Stats collector frozen?
I've noticed that my tables are not being auto vacuumed or analyzed
regularly, even though I have very aggressive autovacuum settings.
The stats collector appears to still be running, since I can see a
postgres.exe process with -forkcol. However, I never notice it using
I/O or CPU usage. Also, querying the pg_stat_user_tables view shows no
change in table stats even for tables that change very frequently.
I see error such as these in the log every now and then - not sure if
they are related. These have been discussed at length in other posts
and seems to have something to do with PG holding onto old file handles
(Windows specific):
2007-01-24 06:24:16 ERROR: could not open relation 1663/16404/333779:
Permission denied
2007-01-24 06:24:17 ERROR: could not open relation 1663/16404/333779:
Permission denied
2007-01-24 06:24:18 ERROR: could not open relation 1663/16404/333779:
Permission denied
2007-01-24 06:24:19 ERROR: could not open relation 1663/16404/333779:
Permission denied
I'm running PG 8.2.1 on Windows. Here is some of the output from "show
all":
"autovacuum";"on"
"autovacuum_analyze_scale_factor";"0.02"
"autovacuum_analyze_threshold";"250"
"autovacuum_freeze_max_age";"200000000"
"autovacuum_naptime";"1min"
"autovacuum_vacuum_cost_delay";"-1"
"autovacuum_vacuum_cost_limit";"-1"
"autovacuum_vacuum_scale_factor";"0.08"
"autovacuum_vacuum_threshold";"500"
"stats_block_level";"on"
"stats_command_string";"on"
"stats_reset_on_server_start";"off"
"stats_row_level";"on"
"stats_start_collector";"on"
"vacuum_cost_delay";"20ms"
"vacuum_cost_limit";"200"
"vacuum_cost_page_dirty";"20"
"vacuum_cost_page_hit";"1"
"vacuum_cost_page_miss";"10"
"vacuum_freeze_min_age";"100000000"
"Jeremy Haile" <jhaile@fastmail.fm> writes:
The stats collector appears to still be running, since I can see a
postgres.exe process with -forkcol. However, I never notice it using
I/O or CPU usage. Also, querying the pg_stat_user_tables view shows no
change in table stats even for tables that change very frequently.
Is $PGDATA/global/pgstat.stat getting updated? (watch the file mod time
at least) Do you see any pgstat.tmp file in there?
regards, tom lane
pgstat.stat was last updated 1/22 12:25pm - there is no pgstat.tmp.
Coincidentally (I think not) - the last auto-analyze was performed at
2007-01-22 12:24:11.424-05.
The logs for 1/22 are empty - so no errors or anything like that to give
clues...
Thanks!
Jeremy Haile
Show quoted text
On Wed, 24 Jan 2007 14:00:52 -0500, "Tom Lane" <tgl@sss.pgh.pa.us> said:
"Jeremy Haile" <jhaile@fastmail.fm> writes:
The stats collector appears to still be running, since I can see a
postgres.exe process with -forkcol. However, I never notice it using
I/O or CPU usage. Also, querying the pg_stat_user_tables view shows no
change in table stats even for tables that change very frequently.Is $PGDATA/global/pgstat.stat getting updated? (watch the file mod time
at least) Do you see any pgstat.tmp file in there?regards, tom lane
Searching the archives, I found a couple of 2006 posts that seem
somewhat related to my problem (although I don't see any solutions
listed...):
http://archives.postgresql.org/pgsql-hackers/2006-09/msg00415.php
http://archives.postgresql.org/pgsql-hackers/2006-04/msg00127.php
Tom, since you were involved in these - did you ever figure out how to
resolve the issues of the stats collector getting stuck in Windows?
Thanks, Jeremy Haile
On Wed, 24 Jan 2007 14:19:05 -0500, "Jeremy Haile" <jhaile@fastmail.fm>
said:
Show quoted text
pgstat.stat was last updated 1/22 12:25pm - there is no pgstat.tmp.
Coincidentally (I think not) - the last auto-analyze was performed at
2007-01-22 12:24:11.424-05.The logs for 1/22 are empty - so no errors or anything like that to give
clues...Thanks!
Jeremy HaileOn Wed, 24 Jan 2007 14:00:52 -0500, "Tom Lane" <tgl@sss.pgh.pa.us> said:
"Jeremy Haile" <jhaile@fastmail.fm> writes:
The stats collector appears to still be running, since I can see a
postgres.exe process with -forkcol. However, I never notice it using
I/O or CPU usage. Also, querying the pg_stat_user_tables view shows no
change in table stats even for tables that change very frequently.Is $PGDATA/global/pgstat.stat getting updated? (watch the file mod time
at least) Do you see any pgstat.tmp file in there?regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match
Tom,
Did this information shed any light on what the problem might be? Any
solution or workaround?
Thanks!
Jeremy Haile
On Wed, 24 Jan 2007 14:19:05 -0500, "Jeremy Haile" <jhaile@fastmail.fm>
said:
Show quoted text
pgstat.stat was last updated 1/22 12:25pm - there is no pgstat.tmp.
Coincidentally (I think not) - the last auto-analyze was performed at
2007-01-22 12:24:11.424-05.The logs for 1/22 are empty - so no errors or anything like that to give
clues...Thanks!
Jeremy HaileOn Wed, 24 Jan 2007 14:00:52 -0500, "Tom Lane" <tgl@sss.pgh.pa.us> said:
"Jeremy Haile" <jhaile@fastmail.fm> writes:
The stats collector appears to still be running, since I can see a
postgres.exe process with -forkcol. However, I never notice it using
I/O or CPU usage. Also, querying the pg_stat_user_tables view shows no
change in table stats even for tables that change very frequently.Is $PGDATA/global/pgstat.stat getting updated? (watch the file mod time
at least) Do you see any pgstat.tmp file in there?regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match
"Jeremy Haile" <jhaile@fastmail.fm> writes:
Did this information shed any light on what the problem might be?
It seems to buttress Magnus' theory that the intermittent (or not so
intermittent) stats-test buildfarm failures we've been seeing have to
do with the stats collector actually freezing up, rather than just
not reacting fast enough as most of us (or me anyway) thought. But
why that is happening remains anyone's guess. I don't suppose you
have debugging tools that would let you get a stack trace from the
collector process?
regards, tom lane
Unfortunately I don't have any debugging tools installed that would work
against postgres - although I'd be glad to do something if you could
tell me the steps involved. I can reproduce the issue quite easily on
two different Windows machines (one is XP, the other is 2003).
Please let me know if there is anything else I can do to help debug this
problem.
Do you know of any workaround other than restarting the whole server?
Can the collector be restarted individually?
Thanks,
Jeremy Haile
Show quoted text
On Thu, 25 Jan 2007 12:42:11 -0500, "Tom Lane" <tgl@sss.pgh.pa.us> said:
"Jeremy Haile" <jhaile@fastmail.fm> writes:
Did this information shed any light on what the problem might be?
It seems to buttress Magnus' theory that the intermittent (or not so
intermittent) stats-test buildfarm failures we've been seeing have to
do with the stats collector actually freezing up, rather than just
not reacting fast enough as most of us (or me anyway) thought. But
why that is happening remains anyone's guess. I don't suppose you
have debugging tools that would let you get a stack trace from the
collector process?regards, tom lane
"Jeremy Haile" <jhaile@fastmail.fm> writes:
Unfortunately I don't have any debugging tools installed that would work
against postgres - although I'd be glad to do something if you could
tell me the steps involved. I can reproduce the issue quite easily on
two different Windows machines (one is XP, the other is 2003).
Sorry, I don't know anything about Windows debugging either. Can you
put together a test case that would let one of the Windows-using hackers
reproduce it?
Do you know of any workaround other than restarting the whole server?
Can the collector be restarted individually?
On Unix you can just 'kill -TERM' the collector process and the postmaster
will start a new one without engaging in a database panic cycle. Dunno
what the equivalent is on Windows but it's probably possible.
regards, tom lane
Tom Lane wrote:
"Jeremy Haile" <jhaile@fastmail.fm> writes:
Unfortunately I don't have any debugging tools installed that would work
against postgres - although I'd be glad to do something if you could
tell me the steps involved. I can reproduce the issue quite easily on
two different Windows machines (one is XP, the other is 2003).Sorry, I don't know anything about Windows debugging either. Can you
put together a test case that would let one of the Windows-using hackers
reproduce it?
That would help a lot.
Do you know of any workaround other than restarting the whole server?
Can the collector be restarted individually?On Unix you can just 'kill -TERM' the collector process and the postmaster
will start a new one without engaging in a database panic cycle. Dunno
what the equivalent is on Windows but it's probably possible.
You can use pg_ctl to send the int signal. If it's completely hung, that
may not work. In that case you can kill it from task manager, but that's
equiv of a kill -9, which means that the postmaster will restart all
backends.
//Magnus
Magnus Hagander <magnus@hagander.net> writes:
"Jeremy Haile" <jhaile@fastmail.fm> writes:
Do you know of any workaround other than restarting the whole server?
Can the collector be restarted individually?
You can use pg_ctl to send the int signal. If it's completely hung, that
may not work. In that case you can kill it from task manager, but that's
equiv of a kill -9, which means that the postmaster will restart all
backends.
No, the postmaster does not care how badly the stats collector croaks,
because the collector's not connected to shared memory, so there's no
risk of collateral damage. It'll just start a new one without bothering
the backends. So you can do whatever you have to do to kill the stuck
collector --- I was just not sure how to do that under Windows.
regards, tom lane
Tom Lane wrote:
Magnus Hagander <magnus@hagander.net> writes:
"Jeremy Haile" <jhaile@fastmail.fm> writes:
Do you know of any workaround other than restarting the whole server?
Can the collector be restarted individually?You can use pg_ctl to send the int signal. If it's completely hung, that
may not work. In that case you can kill it from task manager, but that's
equiv of a kill -9, which means that the postmaster will restart all
backends.No, the postmaster does not care how badly the stats collector croaks,
because the collector's not connected to shared memory, so there's no
risk of collateral damage. It'll just start a new one without bothering
the backends. So you can do whatever you have to do to kill the stuck
collector --- I was just not sure how to do that under Windows.
Oh, ok.
Then just pick it up in Task Manager or Process Explorer or whatever and
kill it off. Just make sure you pick the right process.
//Magnus
Then just pick it up in Task Manager or Process Explorer or whatever and
kill it off. Just make sure you pick the right process.
I mentioned earlier that killing off the collector didn't work - however
I was wrong. I just wasn't giving it enough time. If I kill the
"postgres.exe -forkcol" process, it does gets restarted, although
sometimes it takes a minute.
Since it only seems to update pgstat.stat once after restarting, I'd
need to kill it once-a-minute to keep my statistics up to date =) So,
unfortunately it's not a great workaround to my problem.
Jeremy Haile
I'll try to put together a test case for hackers, although I'm not sure
what exactly causes it.
Basically, when I fire up PostgreSQL - after about a minute the stats
collector runs once (pgstat.stat is updated, autovacuum fires up, etc.)
- and then the collector seems to hang. If I watch it's performance
information, it does not read or write to disk again and pgstat.stat is
never updated again. It never updates pgstat.stat more than once after
restart. There are no errors in the log
I tried killing the collector a variety of ways on Windows, but it seems
to terminate indefinitely. I don't see a kill program for windows that
lets me specify the signal to use. So other than restarting PostgreSQL,
I'm not sure how to workaround this problem.
If anyone else is experiencing similar problems, please post your
situation.
Show quoted text
On Thu, 25 Jan 2007 12:51:31 -0500, "Tom Lane" <tgl@sss.pgh.pa.us> said:
"Jeremy Haile" <jhaile@fastmail.fm> writes:
Unfortunately I don't have any debugging tools installed that would work
against postgres - although I'd be glad to do something if you could
tell me the steps involved. I can reproduce the issue quite easily on
two different Windows machines (one is XP, the other is 2003).Sorry, I don't know anything about Windows debugging either. Can you
put together a test case that would let one of the Windows-using hackers
reproduce it?Do you know of any workaround other than restarting the whole server?
Can the collector be restarted individually?On Unix you can just 'kill -TERM' the collector process and the
postmaster
will start a new one without engaging in a database panic cycle. Dunno
what the equivalent is on Windows but it's probably possible.regards, tom lane
Jeremy Haile wrote:
I'll try to put together a test case for hackers, although I'm not sure
what exactly causes it.Basically, when I fire up PostgreSQL - after about a minute the stats
collector runs once (pgstat.stat is updated, autovacuum fires up, etc.)
- and then the collector seems to hang. If I watch it's performance
information, it does not read or write to disk again and pgstat.stat is
never updated again. It never updates pgstat.stat more than once after
restart. There are no errors in the logI tried killing the collector a variety of ways on Windows, but it seems
to terminate indefinitely. I don't see a kill program for windows that
lets me specify the signal to use. So other than restarting PostgreSQL,
I'm not sure how to workaround this problem.If anyone else is experiencing similar problems, please post your
situation.
All the Windows buildfarm machines are, apparently.
We verified this with Magnus. He found that the tenk2 table does not
seem to get stat updates -- the numbers are all zero, at all times. I
thought I had blogged about this ... oh yeah, it's here
http://www.advogato.org/person/alvherre/diary.html?start=11
AFAIR (Magnus can surely confirm) there were some other tables that
weren't showing stats as all zeros -- but there's no way to know whether
those numbers were put there before the collector had "frozen" (if
that's really what's happening).
--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes:
Jeremy Haile wrote:
If anyone else is experiencing similar problems, please post your
situation.
All the Windows buildfarm machines are, apparently.
Can't anyone with a debugger duplicate this and get a stack trace for
us? If the stats collector is indeed freezing up, a stack trace showing
where it's stuck would be exceedingly helpful.
regards, tom lane
AFAIR (Magnus can surely confirm) there were some other tables that
weren't showing stats as all zeros -- but there's no way to know whether
those numbers were put there before the collector had "frozen" (if
that's really what's happening).
Yeah - I have numbers that updated before the stats collector started
freezing. Do you know which version of PG this started with? I have
upgraded 8.1.3, 8.1.4, 8.2, and 8.2.1 in the past months and I didn't
have the collector enabled until 8.2.1 - so I'm not sure how long this
has been a problem.
I might try rolling back to a previous version - it's either that or
setup a scheduled vacuum analyze until we figure out this problem. I'm
having to manually run it every day now... =) I think this is a pretty
critical problem since it cripples autovacuum on Windows.
Are you guys in a position to debug the collector process and see where
it is freezing (ala Tom's earlier comment)? Anything I can do to help
debug this problem faster?
Tom Lane wrote:
Alvaro Herrera <alvherre@commandprompt.com> writes:
Jeremy Haile wrote:
If anyone else is experiencing similar problems, please post your
situation.All the Windows buildfarm machines are, apparently.
Can't anyone with a debugger duplicate this and get a stack trace for
us? If the stats collector is indeed freezing up, a stack trace showing
where it's stuck would be exceedingly helpful.
Must've been asleep when reading and writing in this thread. Didn't
realize it was the same issue as the buildfarm-killer. Will do the
debugger+stacktrace tomorrow on my VC install.
//Magnus
On Thu, Jan 25, 2007 at 04:29:58PM -0500, Tom Lane wrote:
Alvaro Herrera <alvherre@commandprompt.com> writes:
Jeremy Haile wrote:
If anyone else is experiencing similar problems, please post your
situation.All the Windows buildfarm machines are, apparently.
Can't anyone with a debugger duplicate this and get a stack trace for
us? If the stats collector is indeed freezing up, a stack trace showing
where it's stuck would be exceedingly helpful.
Done some checking here. What happens is that suddenly the pgstats
socket stops receiving data. select() (pgstat.c line 1802) returns after
timeout, so got_data is always zero.
Interesting note: I just ran the serial regression tests, and they pass
fine. With the parallel tests, it always stops receiving data somewhere
during the first parallel group.
//Magnus
On Thu, Jan 25, 2007 at 04:29:58PM -0500, Tom Lane wrote:
Alvaro Herrera <alvherre@commandprompt.com> writes:
Jeremy Haile wrote:
If anyone else is experiencing similar problems, please post your
situation.All the Windows buildfarm machines are, apparently.
Can't anyone with a debugger duplicate this and get a stack trace for
us? If the stats collector is indeed freezing up, a stack trace showing
where it's stuck would be exceedingly helpful.
Another update - what broke it was not the enabling of autovacuum, it
was the enabling of row level stats. If I disable stats_row_level,
parallel tests pass again.
That doesn't actually tell us *why* it's broken, I think, but it does
tell us why the autovac patch caused it.
I don't have 8.1 or 8.2 around on win32, but it'd be interesting to see if the
same issue happens if you run the tests on that with stats_row_level
enabled. Most likely the same thing happens.
//Magnus
On Thu, Jan 25, 2007 at 04:29:58PM -0500, Tom Lane wrote:
Alvaro Herrera <alvherre@commandprompt.com> writes:
Jeremy Haile wrote:
If anyone else is experiencing similar problems, please post your
situation.All the Windows buildfarm machines are, apparently.
Can't anyone with a debugger duplicate this and get a stack trace for
us? If the stats collector is indeed freezing up, a stack trace showing
where it's stuck would be exceedingly helpful.
Maybe I should finish testing before I send my emails.
Apparantly there is a bug lurking somewhere in pgwin32_select(). Because
if I put a #undef select right before the select in pgstat.c, the
regression tests pass.
I guess the bug is shown because with row level stats we simply have
more data to process. And it appears only to happen on UDP sockets from
what I can tell.
Now, what Iwould *like* to do is to re-implement that part of the code
using the Win32 APIs instead of going through select(). Since it's very
isolated code. I'm going to try that and see how invasive it is, then
see if it'll get accepted :-)
(This would of course give us better performance in general in that
codepath, since all the emulation stuff wouldn't be needed, so there's a
point to doing that other than finding the obscure UDP-related bug in
pgwin23_select)
//Magnus