Startup process thrashing

Started by Phillip Berryover 17 years ago12 messagesgeneral
Jump to latest
#1Phillip Berry
pberry@stellaconcepts.com

Hello Everyone,

I've got a bit of a problem. It started last night when postgres (8.1.9) went down citing the need
for a vacuum full to be done due to the transaction log needing to wraparound.

So I stopped the server, logged in using a standalone backend and started a vacuum full analyze on
all dbs (one being 156GB).

During the vacuum of the larger of the databases a few hours in it failed, it's filled up the 18GB
pg_xlog partition with over 1000 wal files. Due to running out of space the vacuum failed.

When I came in this morning I attempted to start postgres using the normal init script, and now it's
stuck. The startup process is thrashing the disks and working hard, pg_controldata says it's in
recovery, but it's been going for over two hours.

My question is where I should go from here? Should i kill the startup script, clear out the excess
wal files, start the standalone server and try vacuum again? Or should I just wait and see if the
startup process sorts itself out?

The startup process is responding to login attempts with FATAL: the database system is starting up
and logging these attempts so I assume it's still alive and working...

I appreciate any help and advice, I really hope it's not going to turn into lost data (gulp).

Output from pg_controldata:

pg_control version number: 812
Catalog version number: 200510211
Database system identifier: 5142157718116482999
Database cluster state: in recovery
pg_control last modified: Wed 10 Dec 2008 05:55:52 PM CST
Current log file ID: 811
Next log file segment: 221
Latest checkpoint location: 327/8AE2BED0
Prior checkpoint location: 327/8AE2BE80
Latest checkpoint's REDO location: 327/8AE2BED0
Latest checkpoint's UNDO location: 0/0
Latest checkpoint's TimeLineID: 1
Latest checkpoint's NextXID: 2146484231
Latest checkpoint's NextOID: 123620
Latest checkpoint's NextMultiXactId: 806872
Latest checkpoint's NextMultiOffset: 1766404
Time of latest checkpoint: Wed 10 Dec 2008 06:01:01 AM CST
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Date/time type storage: floating-point numbers
Maximum length of locale name: 128
LC_COLLATE: en_US.UTF-8
LC_CTYPE: en_US.UTF-8

#2Greg Smith
gsmith@gregsmith.com
In reply to: Phillip Berry (#1)
Re: Startup process thrashing

On Thu, 11 Dec 2008, Phillip Berry wrote:

I've got a bit of a problem. It started last night when postgres
(8.1.9) went down citing the need for a vacuum full to be done due to
the transaction log needing to wraparound.

Not exactly. What it said was "To avoid a database shutdown, execute a
full-database VACUUM". In that context, "full" means you vacuum
everything in the database, but only regular VACUUM is needed. VACUUM
FULL, as you learned the hard way, is a more intensive operation, and it's
not needed to resolve the problem you started with. It's a somewhat
unfortunately worded HINT.

During the vacuum of the larger of the databases a few hours in it
failed, it's filled up the 18GB pg_xlog partition with over 1000 wal
files. Due to running out of space the vacuum failed.

Ouch. Are you running PITR recovery by setting archive_command? Did you
set checkpoint_segments to some very high value? 1000 WAL files is not
normal, curious how you ended up with so many of them.

When I came in this morning I attempted to start postgres using the normal init script, and now it's
stuck. The startup process is thrashing the disks and working hard, pg_controldata says it's in
recovery, but it's been going for over two hours.

It takes a long time to sort through 1000 WAL files and figure out if the
database is consistent for every transactions mentioned there. If your
VACUUM FULL ran for several hours and kicked out 1000 of them, it would be
reasonable to expect the cleanup to also take many hours.

My question is where I should go from here? Should i kill the startup script, clear out the excess
wal files, start the standalone server and try vacuum again?

Deleting the WAL files like that will leave your database completely
trashed. The utility that's provided to do the job you're considering is
pg_resetxlog:

http://www.postgresql.org/docs/8.1/static/app-pgresetxlog.html

Which is slightly safer, but note the dire warnings there. You are very
likely to get some sort of data corruption if you do that, and you won't
know where it is. You'll be facing a dump and restore to sort that out,
and if you think the server startup is taking a while on a 156GB database
you're really not going to be happy with how long a restore takes.

The safest thing you can do here is just wait for the server to finish
recovery so it starts up. Watch the system activity with something like
vmstat. If the server process is busy using the CPU and it's doing stuff
with the disks, if you have evidence it's making progress, you'll be hard
pressed to execute any manual recovery that's any safer or more efficient
than that is.

Someone else may be able to point you toward better estimating how far
it's got left to go, I haven't ever been stuck in your position for long
enough before to figure that out myself. Good luck.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#3Phillip Berry
pberry@stellaconcepts.com
In reply to: Greg Smith (#2)
Re: Startup process thrashing

Hi Greg,

I appreciate the reply. Fortunately within the last 10 minutes it has finished the recovery...and
then promptly shut itself down again.

The exact error is in fact:

FATAL: database is not accepting commands to avoid wraparound data loss in database "aim"
2008-12-10 06:00:02 CST [21392]: [4-1] HINT: Stop the postmaster and use a standalone backend to
vacuum database "aim".

So I'm back to where I was last night which is a slight relief. The difference being that the
pg_xlog is still full and when I attempt to start Postgres in single user mode it fails becasue of
that.

I'm not running PITR and checkpoint_segments is set to 100 as this is home to a very write intensive
app. The db has been running for over a year and in that time has been vacuumed many times without
issue.

So now I guess I have two questions:

1. Will copying the pg_xlog dir onto a larger partition cause any problems?
2. Is there any reason that vacuuming would create so many WAL files? I've looked around and can't
find any explaination, though it's a problem that seems to have happened before.
Actually make that three questions...
3. I installed pgbouncer yesterday for connection pooling, does the fact that it holds connections
open have any bearing on how rapidly the transaction IDs are used up? As in are transaction IDs in
anyway related to connections? It's a pretty interesting coincidence.

Cheers
Phil

Show quoted text

On Thursday 11 December 2008 15:20:21 you wrote:

On Thu, 11 Dec 2008, Phillip Berry wrote:

I've got a bit of a problem. It started last night when postgres
(8.1.9) went down citing the need for a vacuum full to be done due to
the transaction log needing to wraparound.

Not exactly. What it said was "To avoid a database shutdown, execute a
full-database VACUUM". In that context, "full" means you vacuum
everything in the database, but only regular VACUUM is needed. VACUUM
FULL, as you learned the hard way, is a more intensive operation, and it's
not needed to resolve the problem you started with. It's a somewhat
unfortunately worded HINT.

During the vacuum of the larger of the databases a few hours in it
failed, it's filled up the 18GB pg_xlog partition with over 1000 wal
files. Due to running out of space the vacuum failed.

Ouch. Are you running PITR recovery by setting archive_command? Did you
set checkpoint_segments to some very high value? 1000 WAL files is not
normal, curious how you ended up with so many of them.

When I came in this morning I attempted to start postgres using the
normal init script, and now it's stuck. The startup process is thrashing
the disks and working hard, pg_controldata says it's in recovery, but
it's been going for over two hours.

It takes a long time to sort through 1000 WAL files and figure out if the
database is consistent for every transactions mentioned there. If your
VACUUM FULL ran for several hours and kicked out 1000 of them, it would be
reasonable to expect the cleanup to also take many hours.

My question is where I should go from here? Should i kill the startup
script, clear out the excess wal files, start the standalone server and
try vacuum again?

Deleting the WAL files like that will leave your database completely
trashed. The utility that's provided to do the job you're considering is
pg_resetxlog:

http://www.postgresql.org/docs/8.1/static/app-pgresetxlog.html

Which is slightly safer, but note the dire warnings there. You are very
likely to get some sort of data corruption if you do that, and you won't
know where it is. You'll be facing a dump and restore to sort that out,
and if you think the server startup is taking a while on a 156GB database
you're really not going to be happy with how long a restore takes.

The safest thing you can do here is just wait for the server to finish
recovery so it starts up. Watch the system activity with something like
vmstat. If the server process is busy using the CPU and it's doing stuff
with the disks, if you have evidence it's making progress, you'll be hard
pressed to execute any manual recovery that's any safer or more efficient
than that is.

Someone else may be able to point you toward better estimating how far
it's got left to go, I haven't ever been stuck in your position for long
enough before to figure that out myself. Good luck.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#4Greg Smith
gsmith@gregsmith.com
In reply to: Phillip Berry (#3)
Re: Startup process thrashing

On Thu, 11 Dec 2008, Phillip Berry wrote:

I'm not running PITR and checkpoint_segments is set to 100 as this is
home to a very write intensive app.

That's weird then. It shouldn't ever keep around more than 201 WAL
segments. I've heard one report of a similarly mysterious excess of them,
from Robert Treat, but that was probably caused by a hardware failure.

1. Will copying the pg_xlog dir onto a larger partition cause any
problems?

Ah, I didn't realize the disk it's on was still 100% full from what you
said. You're not going to get anywhere until that's sorted out. You
absolutely can move them all to another disk, you just can't delete them
altogether or split them across logical disks. The usual technique is
something like this:

cd $PGDATA
mv pg_xlog $BIGGER
ln -s $BIGGER/pg_xlog

And then fire the server up in single-user mode to issue the VACUUM it's
begging for. Once it's back to running in regular mode again, issue a
"CHECKPOINT" command, and it should erase most of those WAL files. Then
you could stop the server and reverse the above:

cd $PGDATA
unlink pg_xlog
mv $BIGGER/pg_xlog .

While filling that disk and grinding to a halt is bad, it's not
necessarily any worse then the alternative if you didn't have it on a
dedicated filesystem--that this out of control xlog creation might create
an even greater backlog of segments.

2. Is there any reason that vacuuming would create so many WAL files?
I've looked around and can't find any explaination, though it's a
problem that seems to have happened before.

Nothing should make that many WAL files, normally the only way you can get
into that position is if you're running PITR recovery but not archiving
the segments--so they just keep piling up but the server can't recycle
them. What's supposed to happen is that every time there is a checkpoint,
it deletes anything beyond 2*checkpoint_segments+1 worth of segments,
that's where the 201 figure I mentioned comes from.

3. I installed pgbouncer yesterday for connection pooling, does the fact
that it holds connections open have any bearing on how rapidly the
transaction IDs are used up? As in are transaction IDs in anyway related
to connections? It's a pretty interesting coincidence.

I'm not really familiar with pgbouncer to comment on the specifics of what
it does. In general, keeping transactions open longer will impact the
pg_clog section of the database, but that shouldn't cause the WAL to grow
out of control. Once you get this back into production, you certainly
should carefully monitor how much disk space is being taken up on the WAL
disk moving forward. If the count of them there gets back over 201 again,
that's an indication there's something that shouldn't be happening going
on. That's the sort of thing you can get more information about

This isn't a real popular time of day for this list, some get some more
ideas might show up once our many European members start posting in
volume. As a general aside, if you ever find yourself in this position
again, where you've got an urgent database problem, something you might do
in parallel with posting here is trying the IRC channel:
http://wiki.postgresql.org/wiki/IRC2RWNames

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Smith (#2)
Re: Startup process thrashing

Greg Smith <gsmith@gregsmith.com> writes:

Not exactly. What it said was "To avoid a database shutdown, execute a
full-database VACUUM". In that context, "full" means you vacuum
everything in the database, but only regular VACUUM is needed. VACUUM
FULL, as you learned the hard way, is a more intensive operation, and it's
not needed to resolve the problem you started with. It's a somewhat
unfortunately worded HINT.

Maybe we could rephrase it as "whole-database VACUUM"?

regards, tom lane

#6Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#5)
Re: Startup process thrashing

Tom Lane wrote:

Greg Smith <gsmith@gregsmith.com> writes:

Not exactly. What it said was "To avoid a database shutdown, execute a
full-database VACUUM". In that context, "full" means you vacuum
everything in the database, but only regular VACUUM is needed. VACUUM
FULL, as you learned the hard way, is a more intensive operation, and it's
not needed to resolve the problem you started with. It's a somewhat
unfortunately worded HINT.

Maybe we could rephrase it as "whole-database VACUUM"?

"database-wide VACUUM"?

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#6)
Re: Startup process thrashing

Alvaro Herrera <alvherre@commandprompt.com> writes:

Tom Lane wrote:

Maybe we could rephrase it as "whole-database VACUUM"?

"database-wide VACUUM"?

Yeah, that's probably better, because I think we use that phrase in
the documentation already.

regards, tom lane

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Smith (#4)
Re: Startup process thrashing

Greg Smith <gsmith@gregsmith.com> writes:

On Thu, 11 Dec 2008, Phillip Berry wrote:

I'm not running PITR and checkpoint_segments is set to 100 as this is
home to a very write intensive app.

That's weird then. It shouldn't ever keep around more than 201 WAL
segments. I've heard one report of a similarly mysterious excess of them,
from Robert Treat, but that was probably caused by a hardware failure.

AFAIK the only non-PITR reason for WAL files to not get recycled is if
checkpoints were failing. Do you still have the postmaster log from
before the original crash, and if so is there anything in there about
checkpoint failures?

regards, tom lane

#9Scott Marlowe
scott.marlowe@gmail.com
In reply to: Tom Lane (#8)
Re: Startup process thrashing

On Thu, Dec 11, 2008 at 9:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Greg Smith <gsmith@gregsmith.com> writes:

On Thu, 11 Dec 2008, Phillip Berry wrote:

I'm not running PITR and checkpoint_segments is set to 100 as this is
home to a very write intensive app.

That's weird then. It shouldn't ever keep around more than 201 WAL
segments. I've heard one report of a similarly mysterious excess of them,
from Robert Treat, but that was probably caused by a hardware failure.

AFAIK the only non-PITR reason for WAL files to not get recycled is if
checkpoints were failing. Do you still have the postmaster log from
before the original crash, and if so is there anything in there about
checkpoint failures?

Don't forget that the OP mentioned earlier that he had very long help
open connections with possible long help open transactions.

#10Scott Marlowe
scott.marlowe@gmail.com
In reply to: Scott Marlowe (#9)
Re: Startup process thrashing

On Thu, Dec 11, 2008 at 10:09 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:

Don't forget that the OP mentioned earlier that he had very long help
open connections with possible long help open transactions.

Long held. held. not help.

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Scott Marlowe (#9)
Re: Startup process thrashing

"Scott Marlowe" <scott.marlowe@gmail.com> writes:

On Thu, Dec 11, 2008 at 9:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

AFAIK the only non-PITR reason for WAL files to not get recycled is if
checkpoints were failing. Do you still have the postmaster log from
before the original crash, and if so is there anything in there about
checkpoint failures?

Don't forget that the OP mentioned earlier that he had very long help
open connections with possible long help open transactions.

Open transactions don't prevent checkpoints and thus not WAL truncation
either. You're confusing this with vacuuming.

regards, tom lane

#12Phillip Berry
pberry@stellaconcepts.com
In reply to: Tom Lane (#8)
Re: Startup process thrashing

On Friday 12 December 2008 03:59:42 Tom Lane wrote:

Greg Smith <gsmith@gregsmith.com> writes:

On Thu, 11 Dec 2008, Phillip Berry wrote:

I'm not running PITR and checkpoint_segments is set to 100 as this is
home to a very write intensive app.

That's weird then. It shouldn't ever keep around more than 201 WAL
segments. I've heard one report of a similarly mysterious excess of
them, from Robert Treat, but that was probably caused by a hardware
failure.

AFAIK the only non-PITR reason for WAL files to not get recycled is if
checkpoints were failing. Do you still have the postmaster log from
before the original crash, and if so is there anything in there about
checkpoint failures?

No the crash happened in single user mode so their was (AFAIK) no logs of what happened outside of
the console. I've grepped all the recent logs anyway and there's no mention of checkpoint
problems.

Cheers
Phil