Trimming transaction logs after extended WAL archive failures
Hi everyone,
I have a Postgres 9.3.3 database machine. Due to some intelligent work on the part of someone who shall remain nameless, the WAL archive command included a ‘> /dev/null 2>&1’ which masked archive failures until the disk entirely filled with 400GB of pg_xlog entries.
I have fixed the archive command and can see WAL segments being shipped off of the server, however the xlog remains at a stable size and is not shrinking. In fact, it’s still growing at a (much slower) rate.
I’ve seen references to people just deleting “old” segment files or using pg_resetxlog to fix this situation, however I already know that the response from the mailing list will be “that’s insane, don’t do that”.
So what is the correct solution to pursue here? The steady state of the machine should have enough space, I just need to reclaim some of it... Thanks for any guidance!
Steven
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 03/25/2014 01:56 PM, Steven Schlansker wrote:
Hi everyone,
I have a Postgres 9.3.3 database machine. Due to some intelligent work on the part of someone who shall remain nameless, the WAL archive command included a �> /dev/null 2>&1� which masked archive failures until the disk entirely filled with 400GB of pg_xlog entries.
I have fixed the archive command and can see WAL segments being shipped off of the server, however the xlog remains at a stable size and is not shrinking. In fact, it�s still growing at a (much slower) rate.
So what is wal_keep_segments set at in postgresql.conf?
I�ve seen references to people just deleting �old� segment files or using pg_resetxlog to fix this situation, however I already know that the response from the mailing list will be �that�s insane, don�t do that�.
So what is the correct solution to pursue here? The steady state of the machine should have enough space, I just need to reclaim some of it... Thanks for any guidance!
Steven
--
Adrian Klaver
adrian.klaver@aklaver.com
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Mar 25, 2014, at 3:52 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 03/25/2014 01:56 PM, Steven Schlansker wrote:
Hi everyone,
I have a Postgres 9.3.3 database machine. Due to some intelligent work on the part of someone who shall remain nameless, the WAL archive command included a ‘> /dev/null 2>&1’ which masked archive failures until the disk entirely filled with 400GB of pg_xlog entries.
I have fixed the archive command and can see WAL segments being shipped off of the server, however the xlog remains at a stable size and is not shrinking. In fact, it’s still growing at a (much slower) rate.
So what is wal_keep_segments set at in postgresql.conf?
5000. There are currently about 18000 WAL segments in pg_xlog.
I’ve seen references to people just deleting “old” segment files or using pg_resetxlog to fix this situation, however I already know that the response from the mailing list will be “that’s insane, don’t do that”.
So what is the correct solution to pursue here? The steady state of the machine should have enough space, I just need to reclaim some of it... Thanks for any guidance!
Steven
--
Adrian Klaver
adrian.klaver@aklaver.com
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 03/25/2014 03:54 PM, Steven Schlansker wrote:
On Mar 25, 2014, at 3:52 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 03/25/2014 01:56 PM, Steven Schlansker wrote:
Hi everyone,
I have a Postgres 9.3.3 database machine. Due to some intelligent work on the part of someone who shall remain nameless, the WAL archive command included a �> /dev/null 2>&1� which masked archive failures until the disk entirely filled with 400GB of pg_xlog entries.
I have fixed the archive command and can see WAL segments being shipped off of the server, however the xlog remains at a stable size and is not shrinking. In fact, it�s still growing at a (much slower) rate.
So what is wal_keep_segments set at in postgresql.conf?
5000. There are currently about 18000 WAL segments in pg_xlog.
I guess what I should have also asked previously is what exactly are you
doing, are you streaming as well as archiving?
I�ve seen references to people just deleting �old� segment files or using pg_resetxlog to fix this situation, however I already know that the response from the mailing list will be �that�s insane, don�t do that�.
So what is the correct solution to pursue here? The steady state of the machine should have enough space, I just need to reclaim some of it... Thanks for any guidance!
Steven
--
Adrian Klaver
adrian.klaver@aklaver.com
--
Adrian Klaver
adrian.klaver@aklaver.com
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Mar 25, 2014, at 4:02 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 03/25/2014 03:54 PM, Steven Schlansker wrote:
On Mar 25, 2014, at 3:52 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 03/25/2014 01:56 PM, Steven Schlansker wrote:
Hi everyone,
I have a Postgres 9.3.3 database machine. Due to some intelligent work on the part of someone who shall remain nameless, the WAL archive command included a ‘> /dev/null 2>&1’ which masked archive failures until the disk entirely filled with 400GB of pg_xlog entries.
I have fixed the archive command and can see WAL segments being shipped off of the server, however the xlog remains at a stable size and is not shrinking. In fact, it’s still growing at a (much slower) rate.
So what is wal_keep_segments set at in postgresql.conf?
5000. There are currently about 18000 WAL segments in pg_xlog.
I guess what I should have also asked previously is what exactly are you doing, are you streaming as well as archiving?
Yes, we have both enabled. Here’s some hopefully relevant configuration stanzas and information:
checkpoint_segments = 1024 # in logfile segments, min 1, 16MB each
checkpoint_timeout = 10min # range 30s-1h
checkpoint_completion_target = 0.9 # checkpoint target duration, 0.0 - 1.0
checkpoint_warning = 8min # 0 disables
archive_mode = on # allows archiving to be done
archive_command = 'rsync -q %p pgbackup@d0028.nessops.net:./wal/prd-db1a/%f' # command to use to archive a logfile segment
archive_timeout = 0 # force a logfile segment switch after this
# number of seconds; 0 disables
# These settings are ignored on a standby server
max_wal_senders = 10 # max number of walsender processes
wal_keep_segments = 5000 # in logfile segments, 16MB each; 0 disables
vacuum_defer_cleanup_age = 0 # number of xacts by which cleanup is delayed
wal_sender_timeout = 60s # in milliseconds; 0 disables
synchronous_standby_names = '' # standby servers that provide sync rep
# comma-separated list of application_name
# from standby(s); '*' = all
# - Standby Servers -
# These settings are ignored on a master server
hot_standby = on # "on" allows queries during recovery
max_standby_archive_delay = 30s # max delay before canceling queries
# when reading WAL from archive;
# -1 allows indefinite delay
max_standby_streaming_delay = 30s # max delay before canceling queries
# when reading streaming WAL;
# -1 allows indefinite delay
wal_receiver_status_interval = 10s # send replies at least this often
# 0 disables
hot_standby_feedback = off # send info from standby to prevent
# query conflicts
[root@prd-db1a data]# ls -1 pg_xlog | wc -l
20168
I have verified that WAL segments are being archived to the archive destination, and that the slave is connected and receiving segments.
Thanks!
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 03/25/2014 04:17 PM, Steven Schlansker wrote:
On Mar 25, 2014, at 4:02 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 03/25/2014 03:54 PM, Steven Schlansker wrote:
On Mar 25, 2014, at 3:52 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 03/25/2014 01:56 PM, Steven Schlansker wrote:
Hi everyone,
I have a Postgres 9.3.3 database machine. Due to some intelligent work on the part of someone who shall remain nameless, the WAL archive command included a �> /dev/null 2>&1� which masked archive failures until the disk entirely filled with 400GB of pg_xlog entries.
I have fixed the archive command and can see WAL segments being shipped off of the server, however the xlog remains at a stable size and is not shrinking. In fact, it�s still growing at a (much slower) rate.
So what is wal_keep_segments set at in postgresql.conf?
5000. There are currently about 18000 WAL segments in pg_xlog.
I guess what I should have also asked previously is what exactly are you doing, are you streaming as well as archiving?
Yes, we have both enabled. Here�s some hopefully relevant configuration stanzas and information:
I have verified that WAL segments are being archived to the archive destination, and that the slave is connected and receiving segments.
Some more questions, what happens when things begin to dawn on me:)
You said the disk filled up entirely with log files yet currently the
number(size) of logs is growing.
So did you grow the disk, move the logs or find some way to reduce the
number?
What happened to the server when the disk filled up?
In other words do the log entries at the time show it recovered gracefully?
If not what did you do to get it running again?
The concern being that the server is actually fully recovered.
Thanks!
--
Adrian Klaver
adrian.klaver@aklaver.com
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Mar 25, 2014, at 4:45 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 03/25/2014 04:17 PM, Steven Schlansker wrote:
On Mar 25, 2014, at 4:02 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 03/25/2014 03:54 PM, Steven Schlansker wrote:
On Mar 25, 2014, at 3:52 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 03/25/2014 01:56 PM, Steven Schlansker wrote:
Hi everyone,
I have a Postgres 9.3.3 database machine. Due to some intelligent work on the part of someone who shall remain nameless, the WAL archive command included a ‘> /dev/null 2>&1’ which masked archive failures until the disk entirely filled with 400GB of pg_xlog entries.
I have fixed the archive command and can see WAL segments being shipped off of the server, however the xlog remains at a stable size and is not shrinking. In fact, it’s still growing at a (much slower) rate.
So what is wal_keep_segments set at in postgresql.conf?
5000. There are currently about 18000 WAL segments in pg_xlog.
I guess what I should have also asked previously is what exactly are you doing, are you streaming as well as archiving?
Yes, we have both enabled. Here’s some hopefully relevant configuration stanzas and information:
I have verified that WAL segments are being archived to the archive destination, and that the slave is connected and receiving segments.
Some more questions, what happens when things begin to dawn on me:)
You said the disk filled up entirely with log files yet currently the number(size) of logs is growing.
It’s holding stable now. I tried to vacuum up to clean some space which turned out to generate more pg_xlog activity than it saved space, and (I assume) the archiver fell behind and that was the source of the growing log. There haven’t been any new segments since I stopped doing that.
So did you grow the disk, move the logs or find some way to reduce the number?
I used tune2fs to use some of the “reserved” filesystem space temporarily. I was too scared to move log segments away, this is a production database.
What happened to the server when the disk filled up?
Postgresql PANICed due to failed writes.
Mar 25 22:46:41 prd-db1a postgres[18995]: [12-1] db=checkin,user=postgres PANIC: could not write to file "pg_xlog/xlogtemp.18995": No space left on device
In other words do the log entries at the time show it recovered gracefully?
The database is currently up and running, although I do not have much time until it fails again, there are only a few precious GB free.
If not what did you do to get it running again?
tune2fs and restarted postgres
The concern being that the server is actually fully recovered.
I believe it is. Our production site is back up and running seemingly normally, the postgres log has no obvious complaining.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Tuesday, March 25, 2014, Steven Schlansker <steven@likeness.com> wrote:
Hi everyone,
I have a Postgres 9.3.3 database machine. Due to some intelligent work on
the part of someone who shall remain nameless, the WAL archive command
included a '> /dev/null 2>&1' which masked archive failures until the disk
entirely filled with 400GB of pg_xlog entries.
PostgreSQL itself should be logging failures to the server log, regardless
of whether those failures log themselves.
I have fixed the archive command and can see WAL segments being shipped
off of the server, however the xlog remains at a stable size and is not
shrinking. In fact, it's still growing at a (much slower) rate.
The leading edge of the log files should be archived as soon as they fill
up, and recycled/deleted two checkpoints later. The trailing edge should
be archived upon checkpoints and then recycled or deleted. I think there
is a throttle on how many off the trailing edge are archived each
checkpoint. So issues a bunch of "CHECKPOINT;" commands for a while and
see if that clears it up.
Cheers,
Jeff
On 03/25/2014 04:52 PM, Steven Schlansker wrote:
Some more questions, what happens when things begin to dawn on me:)
You said the disk filled up entirely with log files yet currently the number(size) of logs is growing.
It�s holding stable now. I tried to vacuum up to clean some space which turned out to generate more pg_xlog activity than it saved space, and (I assume) the archiver fell behind and that was the source of the growing log. There haven�t been any new segments since I stopped doing that.
Yea, vacuum just marks space as available for reuse it does not actually
free space.
--
Adrian Klaver
adrian.klaver@aklaver.com
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Tue, Mar 25, 2014 at 6:33 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Tuesday, March 25, 2014, Steven Schlansker <steven@likeness.com> wrote:
Hi everyone,
I have a Postgres 9.3.3 database machine. Due to some intelligent work
on the part of someone who shall remain nameless, the WAL archive command
included a '> /dev/null 2>&1' which masked archive failures until the disk
entirely filled with 400GB of pg_xlog entries.PostgreSQL itself should be logging failures to the server log, regardless
of whether those failures log themselves.I have fixed the archive command and can see WAL segments being shipped
off of the server, however the xlog remains at a stable size and is not
shrinking. In fact, it's still growing at a (much slower) rate.The leading edge of the log files should be archived as soon as they fill
up, and recycled/deleted two checkpoints later. The trailing edge should
be archived upon checkpoints and then recycled or deleted. I think there
is a throttle on how many off the trailing edge are archived each
checkpoint. So issues a bunch of "CHECKPOINT;" commands for a while and
see if that clears it up.
Actually my description is rather garbled, mixing up what I saw
when wal_keep_segments was lowered, not when recovering from a long lasting
archive failure. Nevertheless, checkpoints are what provoke the removal of
excessive WAL files. Are you logging checkpoints? What do they say?
Also, what is in pg_xlog/archive_status ?
Cheers,
Jeff
On Mar 25, 2014, at 7:58 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 03/25/2014 04:52 PM, Steven Schlansker wrote:
Some more questions, what happens when things begin to dawn on me:)
You said the disk filled up entirely with log files yet currently the number(size) of logs is growing.
It’s holding stable now. I tried to vacuum up to clean some space which turned out to generate more pg_xlog activity than it saved space, and (I assume) the archiver fell behind and that was the source of the growing log. There haven’t been any new segments since I stopped doing that.
Yea, vacuum just marks space as available for reuse it does not actually free space.
I even knew that. Funny what you’ll forget when the system is down and you’re in a panic.
This is actually something that has bit me on more than one occasion — if you accidentally temporarily use too much space, it is *very* hard to back out of the situation. It seems that the only way to actually release space to the system are VACUUM FULL, CLUSTER, or to DROP objects. None of these can be executed without severe disruption to a running database. A cluster operation on any of our tables that are large enough to matter can easily run through the night.
I can only keep wishing for a CLUSTER CONCURRENTLY or VACUUM FULL CONCURRENTLY that can run without a temporary copy of the entire table...
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Mar 26, 2014, at 9:04 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Tue, Mar 25, 2014 at 6:33 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Tuesday, March 25, 2014, Steven Schlansker <steven@likeness.com> wrote:
Hi everyone,I have a Postgres 9.3.3 database machine. Due to some intelligent work on the part of someone who shall remain nameless, the WAL archive command included a ‘> /dev/null 2>&1’ which masked archive failures until the disk entirely filled with 400GB of pg_xlog entries.
PostgreSQL itself should be logging failures to the server log, regardless of whether those failures log themselves.
I have fixed the archive command and can see WAL segments being shipped off of the server, however the xlog remains at a stable size and is not shrinking. In fact, it’s still growing at a (much slower) rate.
The leading edge of the log files should be archived as soon as they fill up, and recycled/deleted two checkpoints later. The trailing edge should be archived upon checkpoints and then recycled or deleted. I think there is a throttle on how many off the trailing edge are archived each checkpoint. So issues a bunch of "CHECKPOINT;" commands for a while and see if that clears it up.
Indeed, forcing a bunch of CHECKPOINTS started to get things moving again.
Actually my description is rather garbled, mixing up what I saw when wal_keep_segments was lowered, not when recovering from a long lasting archive failure. Nevertheless, checkpoints are what provoke the removal of excessive WAL files. Are you logging checkpoints? What do they say? Also, what is in pg_xlog/archive_status ?
I do log checkpoints, but most of them recycle and don’t remove:
Mar 26 16:09:36 prd-db1a postgres[29161]: [221-1] db=,user= LOG: checkpoint complete: wrote 177293 buffers (4.2%); 0 transaction log file(s) added, 0 removed, 56 recycled; write=539.838 s, sync=0.049 s, total=539.909 s; sync files=342, longest=0.015 s, average=0.000 s
That said, after letting the db run / checkpoint / archive overnight, the xlog did indeed start to slowly shrink. The pace at which it is shrinking is somewhat unsatisfying, but at least we are making progress now!
I guess if I had just been patient I could have saved some mailing list traffic. But patience is hard when your production database system is running at 0% free disk :)
Thanks everyone for the help, if the log continues to shrink, I should be out of the woods now.
Best,
Steven
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Thu, Mar 27, 2014 at 1:42 AM, Steven Schlansker <steven@likeness.com> wrote:
On Mar 25, 2014, at 7:58 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 03/25/2014 04:52 PM, Steven Schlansker wrote:
Some more questions, what happens when things begin to dawn on me:)
You said the disk filled up entirely with log files yet currently the number(size) of logs is growing.
It's holding stable now. I tried to vacuum up to clean some space which turned out to generate more pg_xlog activity than it saved space, and (I assume) the archiver fell behind and that was the source of the growing log. There haven't been any new segments since I stopped doing that.
Yea, vacuum just marks space as available for reuse it does not actually free space.
I even knew that. Funny what you'll forget when the system is down and you're in a panic.
This is actually something that has bit me on more than one occasion -- if you accidentally temporarily use too much space, it is *very* hard to back out of the situation. It seems that the only way to actually release space to the system are VACUUM FULL, CLUSTER, or to DROP objects. None of these can be executed without severe disruption to a running database. A cluster operation on any of our tables that are large enough to matter can easily run through the night.
Yep, depending on your application needs you could actually avoid any
periodic VACUUM FULL-like operations that need an exclusive lock on
the objects it is cleaning by making autovacuum more aggressive. This
makes your house cleaner by dropping the garbage at a higher
frequency.
--
Michael
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Mar 27, 2014, at 5:29 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Thu, Mar 27, 2014 at 1:42 AM, Steven Schlansker <steven@likeness.com> wrote:
On Mar 25, 2014, at 7:58 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
Yea, vacuum just marks space as available for reuse it does not actually free space.
I even knew that. Funny what you'll forget when the system is down and you're in a panic.
This is actually something that has bit me on more than one occasion -- if you accidentally temporarily use too much space, it is *very* hard to back out of the situation. It seems that the only way to actually release space to the system are VACUUM FULL, CLUSTER, or to DROP objects. None of these can be executed without severe disruption to a running database. A cluster operation on any of our tables that are large enough to matter can easily run through the night.
Yep, depending on your application needs you could actually avoid any
periodic VACUUM FULL-like operations that need an exclusive lock on
the objects it is cleaning by making autovacuum more aggressive. This
makes your house cleaner by dropping the garbage at a higher
frequency.
Yes, although this will not save you in a couple of notable cases. We have run into this at least by:
* Accidentally leaving transactions open for days
* Runaway process inserting data until the disk fills
So yes we should autovacuum more, but it is not a total solution.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general