vacuumdb hanging database cluster
When I run:
vacuumdb --full --all --analyze --quiet
on my database cluster it will complete in < 2 minutes (this cluster
is a few million total rows and ~2GB).
After testing, I set this up as an off-hours cron job and it worked
fine for several days then hung the whole database. After my pager
pulled me from bed I found the vacuumdb process still running but the
vacuum process on the first database (alphabetically) was showing it
was waiting:
postgres: vacuumdb --full --all --analyze --quiet
postgres: postgres firstdb [local] VACUUM waiting
A couple hundred processes were showing as "startup waiting" and one
was "idle in transaction". The process in the "VACUUM waiting" state
was the only one connected to that database - all other connections
were to other databases.
CPU and disk utilization were essentially zero. Suspecting a lock
problem I attempted to use a pre-existing connection to view pg_locks
but it would not respond.
I killed the vacuum process and all the processes in the "waiting"
states cleared within a second or two and system returned to normal.
The pg_locks query also returned but showed no useful info.
I tracked down the process that was "idle in transaction" and it was a
pg_dump process running on another machine. This process does a
periodic dump of one very small table and should complete in a
fraction of a second but was still waiting since the previous day -
apparently without deleterious effects.
There was no useful info in the log.
I've stopped running the vacuum full job via cron till I can trust it.
Any ideas on how to track/prevent this behavior? Server is version
7.4.1 and my web searches have proved futile.
Cheers,
Steve
-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of
Steve Crawford
Sent: Monday, July 26, 2004 1:23 PM
To: pgsql-general@postgresql.org
Subject: [GENERAL] vacuumdb hanging database clusterWhen I run:
vacuumdb --full --all --analyze --quiet
on my database cluster it will complete in < 2 minutes (this cluster
is a few million total rows and ~2GB).After testing, I set this up as an off-hours cron job and it worked
fine for several days then hung the whole database. After my pager
pulled me from bed I found the vacuumdb process still running but the
vacuum process on the first database (alphabetically) was showing it
was waiting:
postgres: vacuumdb --full --all --analyze --quiet
postgres: postgres firstdb [local] VACUUM waitingA couple hundred processes were showing as "startup waiting" and one
was "idle in transaction". The process in the "VACUUM waiting" state
was the only one connected to that database - all other connections
were to other databases.CPU and disk utilization were essentially zero. Suspecting a lock
problem I attempted to use a pre-existing connection to view pg_locks
but it would not respond.I killed the vacuum process and all the processes in the "waiting"
states cleared within a second or two and system returned to normal.
The pg_locks query also returned but showed no useful info.I tracked down the process that was "idle in transaction" and
it was a
pg_dump process running on another machine. This process does a
periodic dump of one very small table and should complete in a
fraction of a second but was still waiting since the previous day -
apparently without deleterious effects.There was no useful info in the log.
I've stopped running the vacuum full job via cron till I can
trust it.
Any ideas on how to track/prevent this behavior? Server is version
7.4.1 and my web searches have proved futile.
I have seen problematic behavior when one vacuum starts after another is
already running.
It might be a good idea to semaphore vacuum operations.
But my experience is with an older version of PostgreSQL, so the
problems you are seeing might be totally unrelated.
Import Notes
Resolved by subject fallback
Steve Crawford <scrawford@pinpointresearch.com> writes:
A couple hundred processes were showing as "startup waiting" and one
was "idle in transaction". The process in the "VACUUM waiting" state
was the only one connected to that database - all other connections
were to other databases.
I suspect what must have happened is that the vacuum process was trying
to vacuum one of the shared catalogs (pg_database or pg_shadow), and was
blocked trying to get exclusive lock because someone else (the "idle in
transaction" guy) was holding some lock on that table. At this point
all incoming connections, to any database, will block behind the VACUUM
until the idle guy closes his transaction and thereby releases his lock.
I tracked down the process that was "idle in transaction" and it was a
pg_dump process running on another machine. This process does a
periodic dump of one very small table and should complete in a
fraction of a second but was still waiting since the previous day -
apparently without deleterious effects.
What was it waiting on? Since it was idle instead of waiting, the
problem must have been on the client side. I've not heard of pg_dump
just going to sleep for no reason...
I've stopped running the vacuum full job via cron till I can trust it.
Any ideas on how to track/prevent this behavior? Server is version
7.4.1 and my web searches have proved futile.
My recommendation would be to lose the --full. If you're doing
sufficiently frequent vacuuming you have no need for that, and getting
rid of it means vacuum doesn't take exclusive table locks. That means
it will neither block nor be blocked by ordinary readers and writers.
regards, tom lane
On Monday 26 July 2004 2:18 pm, Tom Lane wrote:
Steve Crawford <scrawford@pinpointresearch.com> writes:
A couple hundred processes were showing as "startup waiting" and
one was "idle in transaction". The process in the "VACUUM
waiting" state was the only one connected to that database - all
other connections were to other databases.I suspect what must have happened is that the vacuum process was
trying to vacuum one of the shared catalogs (pg_database or
pg_shadow), and was blocked trying to get exclusive lock because
someone else (the "idle in transaction" guy) was holding some lock
on that table. At this point all incoming connections, to any
database, will block behind the VACUUM until the idle guy closes
his transaction and thereby releases his lock.
I agree that it has the "smell" of a system-level lock - I just
couldn't get any info from pg_locks till it was cleared.
I tracked down the process that was "idle in transaction" and it
was a pg_dump process running on another machine. This process
does a periodic dump of one very small table and should complete
in a fraction of a second but was still waiting since the
previous day - apparently without deleterious effects.What was it waiting on? Since it was idle instead of waiting, the
problem must have been on the client side. I've not heard of
pg_dump just going to sleep for no reason...
Beats the heck out of me. We periodically dump some selected small
tables via a script using:
pg_dump -i -h $dbhost -U $dbuser -t $dumptable > dumpfile
It's very vanilla and generally works fine but sometimes (perhaps 1
per 1000+ runs) ends up idle in transaction. I'm going to take a much
closer look at pg_locks next time it happens.
The -i is because pg_dump on the client machine is 7.4.2 and the
server is 7.4.1 but that doesn't seem to be a problem.
I've stopped running the vacuum full job via cron till I can
trust it. Any ideas on how to track/prevent this behavior? Server
is version 7.4.1 and my web searches have proved futile.My recommendation would be to lose the --full. If you're doing
sufficiently frequent vacuuming you have no need for that, and
getting rid of it means vacuum doesn't take exclusive table locks.
That means it will neither block nor be blocked by ordinary readers
and writers.
Yes, that's where I've headed. I'll save the occasional full vacuum
for manual running when I can watch it.
Thanks,
Steve
Steve Crawford <scrawford@pinpointresearch.com> writes:
I tracked down the process that was "idle in transaction" and it
was a pg_dump process running on another machine.What was it waiting on?
Beats the heck out of me. We periodically dump some selected small
tables via a script using:
pg_dump -i -h $dbhost -U $dbuser -t $dumptable > dumpfile
It's very vanilla and generally works fine but sometimes (perhaps 1
per 1000+ runs) ends up idle in transaction. I'm going to take a much
closer look at pg_locks next time it happens.
If it is "idle in transaction" and not "<something> waiting" then it is
not blocked waiting for someone's lock, so pg_locks is unlikely to yield
the answer. I think it's got to be something funny on the pg_dump side.
Or maybe a communications problem?
regards, tom lane