Stats collector frozen?

Started by Jeremy Haileabout 19 years ago38 messagesgeneral
Jump to latest
#1Jeremy Haile
jhaile@fastmail.fm

I've noticed that my tables are not being auto vacuumed or analyzed
regularly, even though I have very aggressive autovacuum settings.

The stats collector appears to still be running, since I can see a
postgres.exe process with -forkcol. However, I never notice it using
I/O or CPU usage. Also, querying the pg_stat_user_tables view shows no
change in table stats even for tables that change very frequently.

I see error such as these in the log every now and then - not sure if
they are related. These have been discussed at length in other posts
and seems to have something to do with PG holding onto old file handles
(Windows specific):
2007-01-24 06:24:16 ERROR: could not open relation 1663/16404/333779:
Permission denied
2007-01-24 06:24:17 ERROR: could not open relation 1663/16404/333779:
Permission denied
2007-01-24 06:24:18 ERROR: could not open relation 1663/16404/333779:
Permission denied
2007-01-24 06:24:19 ERROR: could not open relation 1663/16404/333779:
Permission denied

I'm running PG 8.2.1 on Windows. Here is some of the output from "show
all":
"autovacuum";"on"
"autovacuum_analyze_scale_factor";"0.02"
"autovacuum_analyze_threshold";"250"
"autovacuum_freeze_max_age";"200000000"
"autovacuum_naptime";"1min"
"autovacuum_vacuum_cost_delay";"-1"
"autovacuum_vacuum_cost_limit";"-1"
"autovacuum_vacuum_scale_factor";"0.08"
"autovacuum_vacuum_threshold";"500"
"stats_block_level";"on"
"stats_command_string";"on"
"stats_reset_on_server_start";"off"
"stats_row_level";"on"
"stats_start_collector";"on"
"vacuum_cost_delay";"20ms"
"vacuum_cost_limit";"200"
"vacuum_cost_page_dirty";"20"
"vacuum_cost_page_hit";"1"
"vacuum_cost_page_miss";"10"
"vacuum_freeze_min_age";"100000000"

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeremy Haile (#1)
Re: Stats collector frozen?

"Jeremy Haile" <jhaile@fastmail.fm> writes:

The stats collector appears to still be running, since I can see a
postgres.exe process with -forkcol. However, I never notice it using
I/O or CPU usage. Also, querying the pg_stat_user_tables view shows no
change in table stats even for tables that change very frequently.

Is $PGDATA/global/pgstat.stat getting updated? (watch the file mod time
at least) Do you see any pgstat.tmp file in there?

regards, tom lane

#3Jeremy Haile
jhaile@fastmail.fm
In reply to: Tom Lane (#2)
Re: Stats collector frozen?

pgstat.stat was last updated 1/22 12:25pm - there is no pgstat.tmp.

Coincidentally (I think not) - the last auto-analyze was performed at
2007-01-22 12:24:11.424-05.

The logs for 1/22 are empty - so no errors or anything like that to give
clues...

Thanks!
Jeremy Haile

Show quoted text

On Wed, 24 Jan 2007 14:00:52 -0500, "Tom Lane" <tgl@sss.pgh.pa.us> said:

"Jeremy Haile" <jhaile@fastmail.fm> writes:

The stats collector appears to still be running, since I can see a
postgres.exe process with -forkcol. However, I never notice it using
I/O or CPU usage. Also, querying the pg_stat_user_tables view shows no
change in table stats even for tables that change very frequently.

Is $PGDATA/global/pgstat.stat getting updated? (watch the file mod time
at least) Do you see any pgstat.tmp file in there?

regards, tom lane

#4Jeremy Haile
jhaile@fastmail.fm
In reply to: Jeremy Haile (#3)
Re: Stats collector frozen?

Searching the archives, I found a couple of 2006 posts that seem
somewhat related to my problem (although I don't see any solutions
listed...):
http://archives.postgresql.org/pgsql-hackers/2006-09/msg00415.php
http://archives.postgresql.org/pgsql-hackers/2006-04/msg00127.php

Tom, since you were involved in these - did you ever figure out how to
resolve the issues of the stats collector getting stuck in Windows?

Thanks, Jeremy Haile

On Wed, 24 Jan 2007 14:19:05 -0500, "Jeremy Haile" <jhaile@fastmail.fm>
said:

Show quoted text

pgstat.stat was last updated 1/22 12:25pm - there is no pgstat.tmp.

Coincidentally (I think not) - the last auto-analyze was performed at
2007-01-22 12:24:11.424-05.

The logs for 1/22 are empty - so no errors or anything like that to give
clues...

Thanks!
Jeremy Haile

On Wed, 24 Jan 2007 14:00:52 -0500, "Tom Lane" <tgl@sss.pgh.pa.us> said:

"Jeremy Haile" <jhaile@fastmail.fm> writes:

The stats collector appears to still be running, since I can see a
postgres.exe process with -forkcol. However, I never notice it using
I/O or CPU usage. Also, querying the pg_stat_user_tables view shows no
change in table stats even for tables that change very frequently.

Is $PGDATA/global/pgstat.stat getting updated? (watch the file mod time
at least) Do you see any pgstat.tmp file in there?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

#5Jeremy Haile
jhaile@fastmail.fm
In reply to: Jeremy Haile (#3)
Re: Stats collector frozen?

Tom,

Did this information shed any light on what the problem might be? Any
solution or workaround?

Thanks!
Jeremy Haile

On Wed, 24 Jan 2007 14:19:05 -0500, "Jeremy Haile" <jhaile@fastmail.fm>
said:

Show quoted text

pgstat.stat was last updated 1/22 12:25pm - there is no pgstat.tmp.

Coincidentally (I think not) - the last auto-analyze was performed at
2007-01-22 12:24:11.424-05.

The logs for 1/22 are empty - so no errors or anything like that to give
clues...

Thanks!
Jeremy Haile

On Wed, 24 Jan 2007 14:00:52 -0500, "Tom Lane" <tgl@sss.pgh.pa.us> said:

"Jeremy Haile" <jhaile@fastmail.fm> writes:

The stats collector appears to still be running, since I can see a
postgres.exe process with -forkcol. However, I never notice it using
I/O or CPU usage. Also, querying the pg_stat_user_tables view shows no
change in table stats even for tables that change very frequently.

Is $PGDATA/global/pgstat.stat getting updated? (watch the file mod time
at least) Do you see any pgstat.tmp file in there?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeremy Haile (#5)
Re: Stats collector frozen?

"Jeremy Haile" <jhaile@fastmail.fm> writes:

Did this information shed any light on what the problem might be?

It seems to buttress Magnus' theory that the intermittent (or not so
intermittent) stats-test buildfarm failures we've been seeing have to
do with the stats collector actually freezing up, rather than just
not reacting fast enough as most of us (or me anyway) thought. But
why that is happening remains anyone's guess. I don't suppose you
have debugging tools that would let you get a stack trace from the
collector process?

regards, tom lane

#7Jeremy Haile
jhaile@fastmail.fm
In reply to: Tom Lane (#6)
Re: Stats collector frozen?

Unfortunately I don't have any debugging tools installed that would work
against postgres - although I'd be glad to do something if you could
tell me the steps involved. I can reproduce the issue quite easily on
two different Windows machines (one is XP, the other is 2003).

Please let me know if there is anything else I can do to help debug this
problem.

Do you know of any workaround other than restarting the whole server?
Can the collector be restarted individually?

Thanks,
Jeremy Haile

Show quoted text

On Thu, 25 Jan 2007 12:42:11 -0500, "Tom Lane" <tgl@sss.pgh.pa.us> said:

"Jeremy Haile" <jhaile@fastmail.fm> writes:

Did this information shed any light on what the problem might be?

It seems to buttress Magnus' theory that the intermittent (or not so
intermittent) stats-test buildfarm failures we've been seeing have to
do with the stats collector actually freezing up, rather than just
not reacting fast enough as most of us (or me anyway) thought. But
why that is happening remains anyone's guess. I don't suppose you
have debugging tools that would let you get a stack trace from the
collector process?

regards, tom lane

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeremy Haile (#7)
Re: Stats collector frozen?

"Jeremy Haile" <jhaile@fastmail.fm> writes:

Unfortunately I don't have any debugging tools installed that would work
against postgres - although I'd be glad to do something if you could
tell me the steps involved. I can reproduce the issue quite easily on
two different Windows machines (one is XP, the other is 2003).

Sorry, I don't know anything about Windows debugging either. Can you
put together a test case that would let one of the Windows-using hackers
reproduce it?

Do you know of any workaround other than restarting the whole server?
Can the collector be restarted individually?

On Unix you can just 'kill -TERM' the collector process and the postmaster
will start a new one without engaging in a database panic cycle. Dunno
what the equivalent is on Windows but it's probably possible.

regards, tom lane

#9Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#8)
Re: Stats collector frozen?

Tom Lane wrote:

"Jeremy Haile" <jhaile@fastmail.fm> writes:

Unfortunately I don't have any debugging tools installed that would work
against postgres - although I'd be glad to do something if you could
tell me the steps involved. I can reproduce the issue quite easily on
two different Windows machines (one is XP, the other is 2003).

Sorry, I don't know anything about Windows debugging either. Can you
put together a test case that would let one of the Windows-using hackers
reproduce it?

That would help a lot.

Do you know of any workaround other than restarting the whole server?
Can the collector be restarted individually?

On Unix you can just 'kill -TERM' the collector process and the postmaster
will start a new one without engaging in a database panic cycle. Dunno
what the equivalent is on Windows but it's probably possible.

You can use pg_ctl to send the int signal. If it's completely hung, that
may not work. In that case you can kill it from task manager, but that's
equiv of a kill -9, which means that the postmaster will restart all
backends.

//Magnus

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#9)
Re: Stats collector frozen?

Magnus Hagander <magnus@hagander.net> writes:

"Jeremy Haile" <jhaile@fastmail.fm> writes:

Do you know of any workaround other than restarting the whole server?
Can the collector be restarted individually?

You can use pg_ctl to send the int signal. If it's completely hung, that
may not work. In that case you can kill it from task manager, but that's
equiv of a kill -9, which means that the postmaster will restart all
backends.

No, the postmaster does not care how badly the stats collector croaks,
because the collector's not connected to shared memory, so there's no
risk of collateral damage. It'll just start a new one without bothering
the backends. So you can do whatever you have to do to kill the stuck
collector --- I was just not sure how to do that under Windows.

regards, tom lane

#11Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#10)
Re: Stats collector frozen?

Tom Lane wrote:

Magnus Hagander <magnus@hagander.net> writes:

"Jeremy Haile" <jhaile@fastmail.fm> writes:

Do you know of any workaround other than restarting the whole server?
Can the collector be restarted individually?

You can use pg_ctl to send the int signal. If it's completely hung, that
may not work. In that case you can kill it from task manager, but that's
equiv of a kill -9, which means that the postmaster will restart all
backends.

No, the postmaster does not care how badly the stats collector croaks,
because the collector's not connected to shared memory, so there's no
risk of collateral damage. It'll just start a new one without bothering
the backends. So you can do whatever you have to do to kill the stuck
collector --- I was just not sure how to do that under Windows.

Oh, ok.
Then just pick it up in Task Manager or Process Explorer or whatever and
kill it off. Just make sure you pick the right process.

//Magnus

#12Jeremy Haile
jhaile@fastmail.fm
In reply to: Magnus Hagander (#11)
Re: Stats collector frozen?

Then just pick it up in Task Manager or Process Explorer or whatever and
kill it off. Just make sure you pick the right process.

I mentioned earlier that killing off the collector didn't work - however
I was wrong. I just wasn't giving it enough time. If I kill the
"postgres.exe -forkcol" process, it does gets restarted, although
sometimes it takes a minute.

Since it only seems to update pgstat.stat once after restarting, I'd
need to kill it once-a-minute to keep my statistics up to date =) So,
unfortunately it's not a great workaround to my problem.

Jeremy Haile

#13Jeremy Haile
jhaile@fastmail.fm
In reply to: Tom Lane (#8)
Re: Stats collector frozen?

I'll try to put together a test case for hackers, although I'm not sure
what exactly causes it.

Basically, when I fire up PostgreSQL - after about a minute the stats
collector runs once (pgstat.stat is updated, autovacuum fires up, etc.)
- and then the collector seems to hang. If I watch it's performance
information, it does not read or write to disk again and pgstat.stat is
never updated again. It never updates pgstat.stat more than once after
restart. There are no errors in the log

I tried killing the collector a variety of ways on Windows, but it seems
to terminate indefinitely. I don't see a kill program for windows that
lets me specify the signal to use. So other than restarting PostgreSQL,
I'm not sure how to workaround this problem.

If anyone else is experiencing similar problems, please post your
situation.

Show quoted text

On Thu, 25 Jan 2007 12:51:31 -0500, "Tom Lane" <tgl@sss.pgh.pa.us> said:

"Jeremy Haile" <jhaile@fastmail.fm> writes:

Unfortunately I don't have any debugging tools installed that would work
against postgres - although I'd be glad to do something if you could
tell me the steps involved. I can reproduce the issue quite easily on
two different Windows machines (one is XP, the other is 2003).

Sorry, I don't know anything about Windows debugging either. Can you
put together a test case that would let one of the Windows-using hackers
reproduce it?

Do you know of any workaround other than restarting the whole server?
Can the collector be restarted individually?

On Unix you can just 'kill -TERM' the collector process and the
postmaster
will start a new one without engaging in a database panic cycle. Dunno
what the equivalent is on Windows but it's probably possible.

regards, tom lane

#14Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jeremy Haile (#13)
Re: Stats collector frozen?

Jeremy Haile wrote:

I'll try to put together a test case for hackers, although I'm not sure
what exactly causes it.

Basically, when I fire up PostgreSQL - after about a minute the stats
collector runs once (pgstat.stat is updated, autovacuum fires up, etc.)
- and then the collector seems to hang. If I watch it's performance
information, it does not read or write to disk again and pgstat.stat is
never updated again. It never updates pgstat.stat more than once after
restart. There are no errors in the log

I tried killing the collector a variety of ways on Windows, but it seems
to terminate indefinitely. I don't see a kill program for windows that
lets me specify the signal to use. So other than restarting PostgreSQL,
I'm not sure how to workaround this problem.

If anyone else is experiencing similar problems, please post your
situation.

All the Windows buildfarm machines are, apparently.

We verified this with Magnus. He found that the tenk2 table does not
seem to get stat updates -- the numbers are all zero, at all times. I
thought I had blogged about this ... oh yeah, it's here

http://www.advogato.org/person/alvherre/diary.html?start=11

AFAIR (Magnus can surely confirm) there were some other tables that
weren't showing stats as all zeros -- but there's no way to know whether
those numbers were put there before the collector had "frozen" (if
that's really what's happening).

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#14)
Re: Stats collector frozen?

Alvaro Herrera <alvherre@commandprompt.com> writes:

Jeremy Haile wrote:

If anyone else is experiencing similar problems, please post your
situation.

All the Windows buildfarm machines are, apparently.

Can't anyone with a debugger duplicate this and get a stack trace for
us? If the stats collector is indeed freezing up, a stack trace showing
where it's stuck would be exceedingly helpful.

regards, tom lane

#16Jeremy Haile
jhaile@fastmail.fm
In reply to: Alvaro Herrera (#14)
Re: Stats collector frozen?

AFAIR (Magnus can surely confirm) there were some other tables that
weren't showing stats as all zeros -- but there's no way to know whether
those numbers were put there before the collector had "frozen" (if
that's really what's happening).

Yeah - I have numbers that updated before the stats collector started
freezing. Do you know which version of PG this started with? I have
upgraded 8.1.3, 8.1.4, 8.2, and 8.2.1 in the past months and I didn't
have the collector enabled until 8.2.1 - so I'm not sure how long this
has been a problem.

I might try rolling back to a previous version - it's either that or
setup a scheduled vacuum analyze until we figure out this problem. I'm
having to manually run it every day now... =) I think this is a pretty
critical problem since it cripples autovacuum on Windows.

Are you guys in a position to debug the collector process and see where
it is freezing (ala Tom's earlier comment)? Anything I can do to help
debug this problem faster?

#17Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#15)
Re: Stats collector frozen?

Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

Jeremy Haile wrote:

If anyone else is experiencing similar problems, please post your
situation.

All the Windows buildfarm machines are, apparently.

Can't anyone with a debugger duplicate this and get a stack trace for
us? If the stats collector is indeed freezing up, a stack trace showing
where it's stuck would be exceedingly helpful.

Must've been asleep when reading and writing in this thread. Didn't
realize it was the same issue as the buildfarm-killer. Will do the
debugger+stacktrace tomorrow on my VC install.

//Magnus

#18Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#15)
Re: Stats collector frozen?

On Thu, Jan 25, 2007 at 04:29:58PM -0500, Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

Jeremy Haile wrote:

If anyone else is experiencing similar problems, please post your
situation.

All the Windows buildfarm machines are, apparently.

Can't anyone with a debugger duplicate this and get a stack trace for
us? If the stats collector is indeed freezing up, a stack trace showing
where it's stuck would be exceedingly helpful.

Done some checking here. What happens is that suddenly the pgstats
socket stops receiving data. select() (pgstat.c line 1802) returns after
timeout, so got_data is always zero.

Interesting note: I just ran the serial regression tests, and they pass
fine. With the parallel tests, it always stops receiving data somewhere
during the first parallel group.

//Magnus

#19Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#15)
Re: Stats collector frozen?

On Thu, Jan 25, 2007 at 04:29:58PM -0500, Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

Jeremy Haile wrote:

If anyone else is experiencing similar problems, please post your
situation.

All the Windows buildfarm machines are, apparently.

Can't anyone with a debugger duplicate this and get a stack trace for
us? If the stats collector is indeed freezing up, a stack trace showing
where it's stuck would be exceedingly helpful.

Another update - what broke it was not the enabling of autovacuum, it
was the enabling of row level stats. If I disable stats_row_level,
parallel tests pass again.

That doesn't actually tell us *why* it's broken, I think, but it does
tell us why the autovac patch caused it.

I don't have 8.1 or 8.2 around on win32, but it'd be interesting to see if the
same issue happens if you run the tests on that with stats_row_level
enabled. Most likely the same thing happens.

//Magnus

#20Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#15)
Re: Stats collector frozen?

On Thu, Jan 25, 2007 at 04:29:58PM -0500, Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

Jeremy Haile wrote:

If anyone else is experiencing similar problems, please post your
situation.

All the Windows buildfarm machines are, apparently.

Can't anyone with a debugger duplicate this and get a stack trace for
us? If the stats collector is indeed freezing up, a stack trace showing
where it's stuck would be exceedingly helpful.

Maybe I should finish testing before I send my emails.

Apparantly there is a bug lurking somewhere in pgwin32_select(). Because
if I put a #undef select right before the select in pgstat.c, the
regression tests pass.

I guess the bug is shown because with row level stats we simply have
more data to process. And it appears only to happen on UDP sockets from
what I can tell.

Now, what Iwould *like* to do is to re-implement that part of the code
using the Win32 APIs instead of going through select(). Since it's very
isolated code. I'm going to try that and see how invasive it is, then
see if it'll get accepted :-)

(This would of course give us better performance in general in that
codepath, since all the emulation stuff wouldn't be needed, so there's a
point to doing that other than finding the obscure UDP-related bug in
pgwin23_select)

//Magnus

#21Teodor Sigaev
teodor@sigaev.ru
In reply to: Magnus Hagander (#20)
#22Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jeremy Haile (#7)
#23Magnus Hagander
magnus@hagander.net
In reply to: Teodor Sigaev (#21)
#24Magnus Hagander
magnus@hagander.net
In reply to: Magnus Hagander (#23)
#25Jeremy Haile
jhaile@fastmail.fm
In reply to: Magnus Hagander (#24)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#20)
#27Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#26)
#28Jeremy Haile
jhaile@fastmail.fm
In reply to: Magnus Hagander (#27)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#24)
#30Jeremy Haile
jhaile@fastmail.fm
In reply to: Tom Lane (#29)
#31Magnus Hagander
magnus@hagander.net
In reply to: Jeremy Haile (#30)
#32Jeremy Haile
jhaile@fastmail.fm
In reply to: Magnus Hagander (#31)
#33Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#29)
#34Magnus Hagander
magnus@hagander.net
In reply to: Jeremy Haile (#32)
#35Jeremy Haile
jhaile@fastmail.fm
In reply to: Magnus Hagander (#34)
#36Jeremy Haile
jhaile@fastmail.fm
In reply to: Jeremy Haile (#35)
#37Michael Meskes
meskes@postgresql.org
In reply to: Magnus Hagander (#33)
#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#27)