Re: 7.4RC2 regression failur and not running stats collector process

Started by Derek Morrabout 22 years ago15 messages
#1Derek Morr
dvm105@psu.edu

I think I have some more information on the statistics collector startup
problem on Solaris.

I inserted the following into pgstat.c:

if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
{

/* what type of socket are we trying to bind? */
fprintf(stderr, "Address family is %d\n",
addr->ai_addr->sa_family);

...
}

This returns a value of 26, which on Solaris is AF_INET6. But the
machine I'm using (a V880 running 2.8) has no IPv6 address on any of its
interfaces. And addr->ai_addr->sa_data is empty, so it's no surprise why
bind() is failing.

I'm not sure why Solaris is giving getaddrinfo_all an IPv6 address, though.

-derek

#2Kurt Roeckx
Q@ping.be
In reply to: Derek Morr (#1)

On Thu, Nov 13, 2003 at 04:04:23PM -0500, Derek Morr wrote:

the
machine I'm using (a V880 running 2.8) has no IPv6 address on any of its
interfaces.

So the for loop over the addresses that are returned should go
over both socket() and bind() instead of only socket(). And
probably connect() too.

The code now assumes if you create a socket of a certain type you
can actually use it.

Kurt

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kurt Roeckx (#2)

Kurt Roeckx <Q@ping.be> writes:

So the for loop over the addresses that are returned should go
over both socket() and bind() instead of only socket(). And
probably connect() too.
The code now assumes if you create a socket of a certain type you
can actually use it.

Ah, light dawns... the postmaster socket code does this correctly,
but pgstat.c doesn't.

Too bad we didn't figure this out yesterday. We are now in code freeze
for 7.4 release, and I'm hesitant to apply a fix for what is arguably a
broken platform. Core guys, time for a vote ... do we fix, or hold this
for 7.4.1?

regards, tom lane

#4Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#3)
Re: [CORE] 7.4RC2 regression failur and not running stats collector process

Tom,

Too bad we didn't figure this out yesterday. We are now in code freeze
for 7.4 release, and I'm hesitant to apply a fix for what is arguably a
broken platform. Core guys, time for a vote ... do we fix, or hold this
for 7.4.1?

One thing I've not seen an answer to: does Postgres run acceptably on other
people's Solaris boxes? If this bug is preventing running on Solaris at
all, I'd say fix it ... Solaris is a major platform. If it only affects
users of one particular Solaris patch version, then we do a big warning and
save it for 7.4.1.

--
Josh Berkus
Aglio Database Solutions
San Francisco

#5Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#3)
Re: 7.4RC2 regression failur and not running stats collector

Tom Lane wrote:

Kurt Roeckx <Q@ping.be> writes:

So the for loop over the addresses that are returned should go
over both socket() and bind() instead of only socket(). And
probably connect() too.
The code now assumes if you create a socket of a certain type you
can actually use it.

Ah, light dawns... the postmaster socket code does this correctly,
but pgstat.c doesn't.

Too bad we didn't figure this out yesterday. We are now in code freeze
for 7.4 release, and I'm hesitant to apply a fix for what is arguably a
broken platform. Core guys, time for a vote ... do we fix, or hold this
for 7.4.1?

Must fix, I believe, especially if it is the same function call sequence
used by the postmaster so we have a high probability it will work on all
platforms.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#4)
Re: [CORE] 7.4RC2 regression failur and not running stats collector process

Josh Berkus <josh@agliodbs.com> writes:

One thing I've not seen an answer to: does Postgres run acceptably on other
people's Solaris boxes? If this bug is preventing running on Solaris at
all, I'd say fix it ... Solaris is a major platform. If it only affects
users of one particular Solaris patch version, then we do a big warning and
save it for 7.4.1.

I'm sure it depends on which Solaris version you're running, and
possibly on local configuration issues as well. We should not however
assume that the problem occurs *only* on Solaris. My take on a lot of
the IPv6 funnies we've turned up is that they are kernel/userland
compatibility issues (v6-ready libc on non-v6-ready kernel or vice
versa), and that's surely at least as likely on Linux as Solaris.

The regression test that detects the problem was only put in as of
7.4beta3. I'm not sure how many of our existing port reports were
based on prior betas ...

regards, tom lane

#7P.J. "Josh" Rovero
rovero@sonalysts.com
In reply to: Bruce Momjian (#5)
Re: 7.4RC2 regression failur and not running stats collector

Solaris (5.7, 5.8, 5.9) on many different
workstation/server types is very important to us...

I agree with Bruce....

Bruce Momjian wrote:

Must fix, I believe, especially if it is the same function call sequence
used by the postmaster so we have a high probability it will work on all
platforms.

--
P. J. "Josh" Rovero Sonalysts, Inc.
Email: rovero@sonalysts.com www.sonalysts.com 215 Parkway North
Work: (860)326-3671 or 442-4355 Waterford CT 06385
***********************************************************************

#8Christopher Browne
cbbrowne@libertyrms.info
In reply to: Derek Morr (#1)
Re: [CORE] 7.4RC2 regression failur and not running stats collector process

josh@agliodbs.com (Josh Berkus) writes:

Too bad we didn't figure this out yesterday. We are now in code freeze
for 7.4 release, and I'm hesitant to apply a fix for what is arguably a
broken platform. Core guys, time for a vote ... do we fix, or hold this
for 7.4.1?

One thing I've not seen an answer to: does Postgres run acceptably on other
people's Solaris boxes? If this bug is preventing running on Solaris at
all, I'd say fix it ... Solaris is a major platform. If it only affects
users of one particular Solaris patch version, then we do a big warning and
save it for 7.4.1.

For what it's worth, I have been running regression on Solaris with
numerous of the betas, and RC1 and [just now] RC2, with NO problems.

If the patch is deemed vital for others, it's possible that all I'm
reporting is one of the statistics that will be outnumbered by others.
(And in that case, I would be quick to test the patch to ensure it
causes no adverse side-effects.)

But it's not apparent that it is _vital_ here right now.
--
let name="cbbrowne" and tld="libertyrms.info" in name ^ "@" ^ tld;;
<http://dev6.int.libertyrms.com/&gt;
Christopher Browne
(416) 646 3304 x124 (land)

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christopher Browne (#8)
Re: [CORE] 7.4RC2 regression failur and not running stats collector process

Christopher Browne <cbbrowne@libertyrms.info> writes:

For what it's worth, I have been running regression on Solaris with
numerous of the betas, and RC1 and [just now] RC2, with NO problems.

It seems clear that some Solaris installations are affected and some
are not. Presumably there is some version difference or some local
configuration difference ... but since we don't know what the critical
factor is, we have no basis for guessing what fraction of Solaris
installations will see the problem.

(And in that case, I would be quick to test the patch to ensure it
causes no adverse side-effects.)

Here is the proposed patch --- please test it ASAP if you can.
This is against RC2.

regards, tom lane

#10Joshua D. Drake
jd@commandprompt.com
In reply to: Tom Lane (#9)
Re: [CORE] 7.4RC2 regression failur and not running stats

I can fire up our solaris machine and let you have access to it if you
want to do some destructive testing.

Tom Lane wrote:

Christopher Browne <cbbrowne@libertyrms.info> writes:

For what it's worth, I have been running regression on Solaris with
numerous of the betas, and RC1 and [just now] RC2, with NO problems.

It seems clear that some Solaris installations are affected and some
are not. Presumably there is some version difference or some local
configuration difference ... but since we don't know what the critical
factor is, we have no basis for guessing what fraction of Solaris
installations will see the problem.

(And in that case, I would be quick to test the patch to ensure it
causes no adverse side-effects.)

Here is the proposed patch --- please test it ASAP if you can.
This is against RC2.

regards, tom lane

------------------------------------------------------------------------

*** src/backend/postmaster/pgstat.c.orig	Fri Nov  7 16:55:50 2003
--- src/backend/postmaster/pgstat.c	Fri Nov 14 15:02:14 2003
***************
*** 203,208 ****
--- 203,216 ----
goto startup_failed;
}
+ 	/*
+ 	 * On some platforms, getaddrinfo_all() may return multiple addresses
+ 	 * only one of which will actually work (eg, both IPv6 and IPv4 addresses
+ 	 * when kernel will reject IPv6).  Worse, the failure may occur at the
+ 	 * bind() or perhaps even connect() stage.  So we must loop through the
+ 	 * results till we find a working combination.  We will generate LOG
+ 	 * messages, but no error, for bogus combinations.
+ 	 */
for (addr = addrs; addr; addr = addr->ai_next)
{
#ifdef HAVE_UNIX_SOCKETS
***************
*** 210,262 ****
if (addr->ai_family == AF_UNIX)
continue;
#endif
! 		if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) >= 0)
! 			break;
! 	}

! if (!addr || pgStatSock < 0)
! {
! ereport(LOG,
! (errcode_for_socket_access(),
! errmsg("could not create socket for statistics collector: %m")));
! goto startup_failed;
! }

! /*
! * Bind it to a kernel assigned port on localhost and get the assigned
! * port via getsockname().
! */
! if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
! {
! ereport(LOG,
! (errcode_for_socket_access(),
! errmsg("could not bind socket for statistics collector: %m")));
! goto startup_failed;
! }

! freeaddrinfo_all(hints.ai_family, addrs);
! addrs = NULL;

! alen = sizeof(pgStatAddr);
! if (getsockname(pgStatSock, (struct sockaddr *) & pgStatAddr, &alen) < 0)
! {
! ereport(LOG,
! (errcode_for_socket_access(),
! errmsg("could not get address of socket for statistics collector: %m")));
! goto startup_failed;
}

! /*
! * Connect the socket to its own address. This saves a few cycles by
! * not having to respecify the target address on every send. This also
! * provides a kernel-level check that only packets from this same
! * address will be received.
! */
! if (connect(pgStatSock, (struct sockaddr *) & pgStatAddr, alen) < 0)
{
ereport(LOG,
(errcode_for_socket_access(),
! errmsg("could not connect socket for statistics collector: %m")));
goto startup_failed;
}

--- 218,285 ----
if (addr->ai_family == AF_UNIX)
continue;
#endif
! 		/*
! 		 * Create the socket.
! 		 */
! 		if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) < 0)
! 		{
! 			ereport(LOG,
! 					(errcode_for_socket_access(),
! 					 errmsg("could not create socket for statistics collector: %m")));
! 			continue;
! 		}

! /*
! * Bind it to a kernel assigned port on localhost and get the assigned
! * port via getsockname().
! */
! if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
! {
! ereport(LOG,
! (errcode_for_socket_access(),
! errmsg("could not bind socket for statistics collector: %m")));
! closesocket(pgStatSock);
! pgStatSock = -1;
! continue;
! }

! alen = sizeof(pgStatAddr);
! if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
! {
! ereport(LOG,
! (errcode_for_socket_access(),
! errmsg("could not get address of socket for statistics collector: %m")));
! closesocket(pgStatSock);
! pgStatSock = -1;
! continue;
! }

! /*
! * Connect the socket to its own address. This saves a few cycles by
! * not having to respecify the target address on every send. This also
! * provides a kernel-level check that only packets from this same
! * address will be received.
! */
! if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
! {
! ereport(LOG,
! (errcode_for_socket_access(),
! errmsg("could not connect socket for statistics collector: %m")));
! closesocket(pgStatSock);
! pgStatSock = -1;
! continue;
! }

! /* If we get here, we have a working socket */
! break;
}

! /* Did we find a working address? */
! if (!addr || pgStatSock < 0)
{
ereport(LOG,
(errcode_for_socket_access(),
! errmsg("disabling statistics collector for lack of working socket")));
goto startup_failed;
}

***************
*** 284,289 ****
--- 307,314 ----
errmsg("could not create pipe for statistics collector: %m")));
goto startup_failed;
}
+ 
+ 	freeaddrinfo_all(hints.ai_family, addrs);

return;

------------------------------------------------------------------------

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

-- 
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-222-2783 - jd@commandprompt.com - http://www.commandprompt.com
Editor-N-Chief - PostgreSQl.Org - http://www.postgresql.org
#11Derek Morr
dvm105@psu.edu
In reply to: Joshua D. Drake (#10)
Re: [CORE] 7.4RC2 regression failur and not running stats

On Fri, 14 Nov 2003 11:07:43, Josh Berkus wrote:

One thing I've not seen an answer to: does Postgres run acceptably on
other people's Solaris boxes? If this bug is preventing running on
Solaris at all, I'd say fix it ... Solaris is a major platform. If it
only affects users of one particular Solaris patch version, then we do a
big warning and save it for 7.4.1.

I've confirmed this bug on Solaris 2.8 on a V880 and Solaris 2.9 on an e4500,
here at PSU. Neither machine has IPv6 addresses, and neither machine loads the
stats collector.

-derek
Derek Morr
GPG public key: http://marion02.marion.psu.edu/~derek/public.key

#12Glenn Wiorek
gwiorek@jmlafferty.com
In reply to: Derek Morr (#1)
Re: [CORE] 7.4RC2 regression failur and not running stats collector process

Hmm I know it's been a while since I used patch but I seem to be having
problems applying it. Perhaps my patch is outdated??

patch -b pgstat.c < patchfile
Looks like a new-style context diff.
Hunk#2failed at line 203.
Hunk#2failed at line 210.
Hunk#3failed at line 284.
3 out of 3 hunks ailed: saving reject to pgstat.c.rej

----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Christopher Browne" <cbbrowne@libertyrms.info>
Cc: <pgsql-hackers@postgresql.org>
Sent: Friday, November 14, 2003 2:42 PM
Subject: Re: [HACKERS] [CORE] 7.4RC2 regression failur and not running stats
collector process

Christopher Browne <cbbrowne@libertyrms.info> writes:

For what it's worth, I have been running regression on Solaris with
numerous of the betas, and RC1 and [just now] RC2, with NO problems.

It seems clear that some Solaris installations are affected and some
are not. Presumably there is some version difference or some local
configuration difference ... but since we don't know what the critical
factor is, we have no basis for guessing what fraction of Solaris
installations will see the problem.

(And in that case, I would be quick to test the patch to ensure it
causes no adverse side-effects.)

Here is the proposed patch --- please test it ASAP if you can.
This is against RC2.

regards, tom lane

----------------------------------------------------------------------------
----

*** src/backend/postmaster/pgstat.c.orig Fri Nov  7 16:55:50 2003
--- src/backend/postmaster/pgstat.c Fri Nov 14 15:02:14 2003
***************
*** 203,208 ****
--- 203,216 ----
goto startup_failed;
}
+ /*
+ * On some platforms, getaddrinfo_all() may return multiple addresses
+ * only one of which will actually work (eg, both IPv6 and IPv4 addresses
+ * when kernel will reject IPv6).  Worse, the failure may occur at the
+ * bind() or perhaps even connect() stage.  So we must loop through the
+ * results till we find a working combination.  We will generate LOG
+ * messages, but no error, for bogus combinations.
+ */
for (addr = addrs; addr; addr = addr->ai_next)
{
#ifdef HAVE_UNIX_SOCKETS
***************
*** 210,262 ****
if (addr->ai_family == AF_UNIX)
continue;
#endif
! if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) >= 0)
! break;
! }

! if (!addr || pgStatSock < 0)
! {
! ereport(LOG,
! (errcode_for_socket_access(),
! errmsg("could not create socket for statistics collector: %m")));
! goto startup_failed;
! }

! /*
! * Bind it to a kernel assigned port on localhost and get the assigned
! * port via getsockname().
! */
! if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
! {
! ereport(LOG,
! (errcode_for_socket_access(),
! errmsg("could not bind socket for statistics collector: %m")));
! goto startup_failed;
! }

! freeaddrinfo_all(hints.ai_family, addrs);
! addrs = NULL;

! alen = sizeof(pgStatAddr);
! if (getsockname(pgStatSock, (struct sockaddr *) & pgStatAddr, &alen) <

0)

! {
! ereport(LOG,
! (errcode_for_socket_access(),
! errmsg("could not get address of socket for statistics collector:

%m")));

! goto startup_failed;
}

! /*
! * Connect the socket to its own address. This saves a few cycles by
! * not having to respecify the target address on every send. This also
! * provides a kernel-level check that only packets from this same
! * address will be received.
! */
! if (connect(pgStatSock, (struct sockaddr *) & pgStatAddr, alen) < 0)
{
ereport(LOG,
(errcode_for_socket_access(),
! errmsg("could not connect socket for statistics collector: %m")));
goto startup_failed;
}

--- 218,285 ----
if (addr->ai_family == AF_UNIX)
continue;
#endif
! /*
! * Create the socket.
! */
! if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) < 0)
! {
! ereport(LOG,
! (errcode_for_socket_access(),
! errmsg("could not create socket for statistics collector: %m")));
! continue;
! }

! /*
! * Bind it to a kernel assigned port on localhost and get the assigned
! * port via getsockname().
! */
! if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
! {
! ereport(LOG,
! (errcode_for_socket_access(),
! errmsg("could not bind socket for statistics collector: %m")));
! closesocket(pgStatSock);
! pgStatSock = -1;
! continue;
! }

! alen = sizeof(pgStatAddr);
! if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
! {
! ereport(LOG,
! (errcode_for_socket_access(),
! errmsg("could not get address of socket for statistics collector:

%m")));

! closesocket(pgStatSock);
! pgStatSock = -1;
! continue;
! }

! /*
! * Connect the socket to its own address. This saves a few cycles by
! * not having to respecify the target address on every send. This also
! * provides a kernel-level check that only packets from this same
! * address will be received.
! */
! if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
! {
! ereport(LOG,
! (errcode_for_socket_access(),
! errmsg("could not connect socket for statistics collector: %m")));
! closesocket(pgStatSock);
! pgStatSock = -1;
! continue;
! }

! /* If we get here, we have a working socket */
! break;
}

! /* Did we find a working address? */
! if (!addr || pgStatSock < 0)
{
ereport(LOG,
(errcode_for_socket_access(),
! errmsg("disabling statistics collector for lack of working socket")));
goto startup_failed;
}

***************
*** 284,289 ****
--- 307,314 ----
errmsg("could not create pipe for statistics collector: %m")));
goto startup_failed;
}
+
+ freeaddrinfo_all(hints.ai_family, addrs);

return;

----------------------------------------------------------------------------
----

Show quoted text

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

#13Marc G. Fournier
scrappy@postgresql.org
In reply to: Josh Berkus (#4)
Re: [CORE] 7.4RC2 regression failur and not running stats

On Fri, 14 Nov 2003, Josh Berkus wrote:

Tom,

Too bad we didn't figure this out yesterday. We are now in code freeze
for 7.4 release, and I'm hesitant to apply a fix for what is arguably a
broken platform. Core guys, time for a vote ... do we fix, or hold this
for 7.4.1?

One thing I've not seen an answer to: does Postgres run acceptably on other
people's Solaris boxes? If this bug is preventing running on Solaris at
all, I'd say fix it ... Solaris is a major platform. If it only affects
users of one particular Solaris patch version, then we do a big warning and
save it for 7.4.1.

I agree with Josh on this ...

#14Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Glenn Wiorek (#12)
Re: [CORE] 7.4RC2 regression failur and not running stats

Check that you don't need to use the -p option at all.

Also, make sure you remove any ^M (DOS CR) characters from the line
endings. That always happens to me if I receive the emailon a windows
machine and save the attachment, windows sometimes likes to rewrite all
the line endings, causing the problem below.

Chris

Glenn Wiorek wrote:

Show quoted text

Hmm I know it's been a while since I used patch but I seem to be having
problems applying it. Perhaps my patch is outdated??

patch -b pgstat.c < patchfile
Looks like a new-style context diff.
Hunk#2failed at line 203.
Hunk#2failed at line 210.
Hunk#3failed at line 284.
3 out of 3 hunks ailed: saving reject to pgstat.c.rej

#15Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Derek Morr (#11)
Re: [CORE] 7.4RC2 regression failur and not running stats

Derek Morr wrote:

On Fri, 14 Nov 2003 11:07:43, Josh Berkus wrote:

One thing I've not seen an answer to: does Postgres run acceptably on
other people's Solaris boxes? If this bug is preventing running on
Solaris at all, I'd say fix it ... Solaris is a major platform. If it
only affects users of one particular Solaris patch version, then we do a
big warning and save it for 7.4.1.

I've confirmed this bug on Solaris 2.8 on a V880 and Solaris 2.9 on an e4500,
here at PSU. Neither machine has IPv6 addresses, and neither machine loads the
stats collector.

Would you please check CVS HEAD. Tom has just applied the patch fix
this and we could use more testers.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073