Stats Collector Error 7.4beta1 and 7.4beta2
I was attempting to get pg_autovacuum to work on my database and after much hammering at it I discovered that the stats system was not working. I tried it with both 7.4beta1 and 7.4beta2 in both cases the number of tuples inserted, deleted and updated remained at 0 no matter what database activity occured.
Matthew T. O'Connor looked at my system recompiled postgres, and checked my config files and was unable to solve my problem.
I am using a default install of RedHat 9.0 on a VIA Samuel 2 processor. Anyone have any ideas what could be causing my problem? I have had hardware problems in the past but this is a new machine that I ran memtest86 and badblocks with the destructive tests on before I installed so I'm pretty sure thats not the issue. Additionally I made a copy of the hard drive and placed it in another server and saw the same thing happen.
I did not have this problem with 7.3.3 so I think it has something to do with 7.4.
--- Adam Kavan--- akavan@cox.net
--- Adam Kavan
--- American Amuesments
--- akavan@cox.net
--- 402-499-5145
Adam Kavan <akavan@cox.net> writes:
I was attempting to get pg_autovacuum to work on my database and after
much hammering at it I discovered that the stats system was not
working.
Does 'ps' show that the stats collector and stats buffer postmaster
child processes are alive? Are there any suggestive complaints in
the postmaster's log?
regards, tom lane
On Wed, 2003-09-03 at 23:50, Tom Lane wrote:
Does 'ps' show that the stats collector and stats buffer postmaster
child processes are alive? Are there any suggestive complaints in
the postmaster's log?
As Adam mentioned, I took a look at his system since the initial report
was about a problem with pg_autovacuum. Anyway, Yes ps shows the two
stats collector related processes running, and no the log files don't
show anything helpful, however I didn't try to change any logging
settings. Initially I saw an error in the logs about an IPv6 address
error but after I recompiled everthing with a simple ./configure
--prefix=/home/user/somethingelse/ I didn't get the IPv6 error in the
logs anymore.
Short answer is I have no idea why this is happening, and I didn't see
any obvious configuration problems that might cause this (make check
passed all tests).
Tom, Adam was able to give me a login to his machine, maybe he would do
the same for you.
Anyway, that is all I was able to see, hence the call for more help on
hackers :-)
Matthew
"Matthew T. O'Connor" <matthew@zeut.net> writes:
... Initially I saw an error in the logs about an IPv6 address
error but after I recompiled everthing with a simple ./configure
--prefix=/home/user/somethingelse/ I didn't get the IPv6 error in the
logs anymore.
Hm. Could it be an IPv6 issue --- that is, the stats collector is alive
and faithfully listening on some UDP port, but it's not the same port
the backends try to send to? Given the discussion over the past couple
of days about bizarre interpretations of loopback addresses in
pg_hba.conf, I could sure believe there's some similar kind of issue for
the stats collector.
regards, tom lane
On Thu, 2003-09-04 at 01:23, Tom Lane wrote:
Hm. Could it be an IPv6 issue --- that is, the stats collector is alive
and faithfully listening on some UDP port, but it's not the same port
the backends try to send to? Given the discussion over the past couple
of days about bizarre interpretations of loopback addresses in
pg_hba.conf, I could sure believe there's some similar kind of issue for
the stats collector.
I had a similar thought, but I have no idea how I would verify this.
The thing is, when I recompiled postgresql myself, I left pg_hba.conf at
default settings, and it's running on RH9, which I am running and have
not had a problem with...
Tom Lane wrote:
"Matthew T. O'Connor" <matthew@zeut.net> writes:
... Initially I saw an error in the logs about an IPv6 address
error but after I recompiled everthing with a simple ./configure
--prefix=/home/user/somethingelse/ I didn't get the IPv6 error in the
logs anymore.Hm. Could it be an IPv6 issue --- that is, the stats collector is alive
and faithfully listening on some UDP port, but it's not the same port
the backends try to send to? Given the discussion over the past couple
of days about bizarre interpretations of loopback addresses in
pg_hba.conf, I could sure believe there's some similar kind of issue for
the stats collector.
Doesn't the stats collector use unix domain sockets, not IP?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Doesn't the stats collector use unix domain sockets, not IP?
No. IIRC, we deliberately chose IP/UDP because it had buffering
behavior we liked.
There are pipes involved in the stats stuff too, but the weak link
in my mind is the backend-to-stats-buffer-process hop, which is UDP.
regards, tom lane
On Thu, 4 Sep 2003, Bruce Momjian wrote:
Tom Lane wrote:
"Matthew T. O'Connor" <matthew@zeut.net> writes:
... Initially I saw an error in the logs about an IPv6 address
error but after I recompiled everthing with a simple ./configure
--prefix=/home/user/somethingelse/ I didn't get the IPv6 error in the
logs anymore.Hm. Could it be an IPv6 issue --- that is, the stats collector is alive
and faithfully listening on some UDP port, but it's not the same port
the backends try to send to? Given the discussion over the past couple
of days about bizarre interpretations of loopback addresses in
pg_hba.conf, I could sure believe there's some similar kind of issue for
the stats collector.Doesn't the stats collector use unix domain sockets, not IP?
Nup.
for (addr = addrs; addr; addr = addr->ai_next)
{
#ifdef HAVE_UNIX_SOCKETS
/* Ignore AF_UNIX sockets, if any are returned. */
if (addr->ai_family == AF_UNIX)
continue;
#endif
if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) >= 0)
break;
}
I thing I haven't seen asked: is there a packet filter blocking
local<->local UDP traffic by any chance?
Thanks,
Gavin
I thing I haven't seen asked: is there a packet filter blocking
local<->local UDP traffic by any chance?
Iptables is set to accept everything. If it would help I can give you all
log in information to poke around yourselves. I appreciate your help.
--- Adam Kavan
--- akavan@cox.net
On Thu, Sep 04, 2003 at 01:39:04AM -0400, Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Doesn't the stats collector use unix domain sockets, not IP?
No. IIRC, we deliberately chose IP/UDP because it had buffering
behavior we liked.
Once you said it was because not all platforms have unix domain
sockets. I asked why we weren't using something like
socketpair().
Kurt
Kurt Roeckx wrote:
On Thu, Sep 04, 2003 at 01:39:04AM -0400, Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Doesn't the stats collector use unix domain sockets, not IP?
No. IIRC, we deliberately chose IP/UDP because it had buffering
behavior we liked.Once you said it was because not all platforms have unix domain
sockets. I asked why we weren't using something like
socketpair().
The reason to use INET UDP is that this is the only connection type that
simply drops packets if the stupid collector daemon isn't able to keep
up with the traffic. Think of a 64 processor SMP machine where 60
backends utilize their own CPU and the poor little collector get's
burried in packets, you don't want it to slow down the whole system, do you?
And I agree with Tom that it is very likely that the IPV4/IPV6 stuff is
the reason. IIRC the postmaster creates the socket and noone ever does
bind(2) on it - so it uses it's dynamically assigned port number. Both,
the collector and the backends inherit that socket via fork(2). The
backends use this socket with it's own sockname to send the stats out,
and the collector reads it with recvfrom(2) and verifies that the from
address is identical to it's sockname ... that way noone can inject
faked stat packets. Now this is a lot of sockname usage that could lead
to either the packets not arriving in the collector, or being thrown
away by the collector because of failing to see them coming from itself.
Jan
Kurt
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
On Thu, Sep 04, 2003 at 04:04:38PM -0400, Jan Wieck wrote:
And I agree with Tom that it is very likely that the IPV4/IPV6 stuff is
the reason. IIRC the postmaster creates the socket and noone ever does
bind(2) on it - so it uses it's dynamically assigned port number. Both,
the collector and the backends inherit that socket via fork(2).
Actually, it does a bind (to localhost), but send the port to 0,
so it gets the random port.
Then it connects to itself. I don't get the logic behind that
howver.
It does:
pgStatSock = socket(...);
bind(pgStatSock, ...);
getsockname(pgStatSock, ...);
connect(pgStatSock, ...);
So it creates a socket, binds to it, asks what address/port it's
bound to, and connects to that port.
I don't see the logic behind that connect(), how it can work, and
how it would block anybody from sending to it, but it seems to
work.
The
backends use this socket with it's own sockname to send the stats out,
and the collector reads it with recvfrom(2) and verifies that the from
address is identical to it's sockname ... that way noone can inject
faked stat packets. Now this is a lot of sockname usage that could lead
to either the packets not arriving in the collector, or being thrown
away by the collector because of failing to see them coming from itself.
I'm trying to think about some kernel bug that sends packets
using the wrong source address ..., but I think that was
connecting to a local address it always showed the loopback
address.
It could be useful to have a warning at the following line:
if (memcmp(&fromaddr, &pgStatAddr, fromlen))
continue;
That way you can rule out that that is a problem.
Anyway, I still didn't see the error message he got in the first
place. Maybe we're looking at the wrong thing?
Kurt
Kurt Roeckx wrote:
It could be useful to have a warning at the following line:
if (memcmp(&fromaddr, &pgStatAddr, fromlen))
continue;That way you can rule out that that is a problem.
Anyway, I still didn't see the error message he got in the first
place. Maybe we're looking at the wrong thing?
I think it's more this piece of code in postmaster/pgstat.c
/*
* The source address of the packet must be our own socket.
* This ensures that only real hackers or our own backends
* tell us something. (This should be redundant with a
* kernel-level check due to having used connect(), but let's
* do it anyway.)
*/
if (memcmp(&fromaddr, &pgStatAddr, fromlen))
continue;
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
On Thu, Sep 04, 2003 at 05:01:54PM -0400, Jan Wieck wrote:
Kurt Roeckx wrote:
It could be useful to have a warning at the following line:
if (memcmp(&fromaddr, &pgStatAddr, fromlen))
continue;That way you can rule out that that is a problem.
Anyway, I still didn't see the error message he got in the first
place. Maybe we're looking at the wrong thing?I think it's more this piece of code in postmaster/pgstat.c
And what do you think I pasted?
Kurt
It could be useful to have a warning at the following line:
if (memcmp(&fromaddr, &pgStatAddr, fromlen))
continue;That way you can rule out that that is a problem.
Anyway, I still didn't see the error message he got in the first
place. Maybe we're looking at the wrong thing?Kurt
This is the very line that is giving me problems. I commented it out and
recompiled and now the stats system works. Of course I have to assume that
its bad to go around with out that check...
--- Adam Kavan
--- akavan@cox.net
Adam Kavan <akavan@cox.net> writes:
if (memcmp(&fromaddr, &pgStatAddr, fromlen))
continue;
This is the very line that is giving me problems. I commented it out and
recompiled and now the stats system works. Of course I have to assume that
its bad to go around with out that check...
Hmm. Could you look and see what the actual values are in each address?
regards, tom lane
At 06:49 PM 9/4/03 -0400, Tom Lane wrote:
Hmm. Could you look and see what the actual values are in each address?
regards, tom lane
I don't really know the layout of these structures so I dumped them to a
file and attached them. The first 16 bytes is from fromaddr and the second
is from pgStatAddr.
--- Adam Kavan
--- akavan@cox.net
Attachments:
Kurt Roeckx wrote:
On Thu, Sep 04, 2003 at 05:01:54PM -0400, Jan Wieck wrote:
Kurt Roeckx wrote:
It could be useful to have a warning at the following line:
if (memcmp(&fromaddr, &pgStatAddr, fromlen))
continue;That way you can rule out that that is a problem.
Anyway, I still didn't see the error message he got in the first
place. Maybe we're looking at the wrong thing?I think it's more this piece of code in postmaster/pgstat.c
And what do you think I pasted?
Hmmm ... good question ... How can I know what I think before I read
what I write?
Jan :-)
Kurt
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
They are both structures of type sockaddr_in (sin_family 2 is AF_INET
whereas sin_family 10 would've been AF_INET6), and all relevant fields
of the structure look the same to me. The problem lies in the padding
bytes that make sockaddr_in the same size as sockaddr.
Since the static structure pgStatAddr is supposed to be initialized to
nul bytes by the compiler and now does not contain those in the padding
area, my guess would be that getsockaddr() is actually writing garbage
into that padding area. This is a nasty change, as one cannot compare
two addresses for equalness with memcmp() any more just because of
sloppy programming in the IP stack.
Well, the correct fix would be to compare only the relevant parts of the
addresses, depending on the address family type.
I personally wouldn't worry too much about removing the check entirely.
If you got a hacker wasting his time and bandwidth with screwing up your
statistic collector daemon by sending faked UDP packets to some guessed
port number (it's only visible in the netstat output on your local
machine), I think he's done with all the rest of his TODO for the day
and you'll soon face other problems than that.
Jan
Adam Kavan wrote:
At 06:49 PM 9/4/03 -0400, Tom Lane wrote:
Hmm. Could you look and see what the actual values are in each address?
regards, tom lane
I don't really know the layout of these structures so I dumped them to a
file and attached them. The first 16 bytes is from fromaddr and the second
is from pgStatAddr.--- Adam Kavan --- akavan@cox.net
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
Adam Kavan <akavan@cox.net> writes:
I don't really know the layout of these structures so I dumped them to a
file and attached them. The first 16 bytes is from fromaddr and the second
is from pgStatAddr.
More legibly:
0000000 0200 8016 7f00 0001 0000 0000 0000 0000
0000010 0200 8016 7f00 0001 0000 0000 f001 0000
The 7f000001 is the IP loopback address, sure enough. I wonder what the
f001 (or it might be little-endian 01f0) is.
regards, tom lane