Pg_xlog increase due to postgres crash (disk full)

Started by Cliff de Carteretabout 13 years ago12 messagesgeneral

cliffdecarteret@googlemail.com

about 13 years ago

My database crashed a couple of days ago during an upgrade several seconds
after committing a large transaction to the database. Eventually we found
out that this was due to the disk being full as the transaction had created
several gigs of data. A day or so later the disk is full again and
PostgreSQL crashes due to the pg_xlog file taking up all of the disk space.
I have cleaned up the drive to have so extra space which allows PostgreSQL
to start again but the xlogs are still increasing. I have two errors in my
pg_log:

"WARNING: transaction log file "00000001000000A800000078" could not be
archived: too many failures" and

"LOG: archive command failed with exit code 1
DETAIL: The failed archive command was: test ! -f
/opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078 && cp
pg_xlog/00000001000000A800000078
/opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078"

Postgres version 9.0.3 conf:

-
wal_level = hot_standby
archive_mode = true
archive_command = 'test ! -f /opt/postgres/remote_pgsql/wal_archive/%f
&& cp %p /opt/postgres/remote_pgsql/wal_archive/%f' # command to use to
archive a logfile segment
archive_timeout = 1800
max_wal_senders = 1
max_standby_archive_delay = 900s
max_standby_streaming_delay = 900s
default_statistics_target = 50 # pgtune wizard 2010-11-18
maintenance_work_mem = 480MB # pgtune wizard 2010-11-18
constraint_exclusion = on # pgtune wizard 2010-11-18
checkpoint_completion_target = 0.9 # pgtune wizard 2010-11-18
effective_cache_size = 5632MB # pgtune wizard 2010-11-18
work_mem = 48MB # pgtune wizard 2010-11-18
wal_buffers = 8MB # pgtune wizard 2010-11-18
checkpoint_segments = 16 # pgtune wizard 2010-11-18
shared_buffers = 1920MB # pgtune wizard 2010-11-18
max_connections = 80 # pgtune wizard 2010-11-18

I've tried stopping postgres and then deleting the
00000001000000A800000078.ready file and starting postgres but this appears
to be recreated instantly and the error is still in the log file.

I've read about the pg_reset_xlog() command but with having to pg_dump our
db with a large amount of blobs and restoring it again is highly
problematic as the pg_restore has struggled to restore.

Will setting zero_damaged_pages (true) work in 9.0.1 and would this resolve
the issue?

Would creating an empty file and replacing the offending xlog work, would
this need to be a specific size?

Any ideas?

Adrian Klaver

adrian.klaver@aklaver.com

about 13 years ago

In reply to: Cliff de Carteret (#1)

Re: Pg_xlog increase due to postgres crash (disk full)

On 01/22/2013 03:12 AM, Cliff de Carteret wrote:

My database crashed a couple of days ago during an upgrade several
seconds after committing a large transaction to the database. Eventually
we found out that this was due to the disk being full as the transaction
had created several gigs of data. A day or so later the disk is full
again and PostgreSQL crashes due to the pg_xlog file taking up all of
the disk space. I have cleaned up the drive to have so extra space which
allows PostgreSQL to start again but the xlogs are still increasing. I
have two errors in my pg_log:

"WARNING: transaction log file "00000001000000A800000078" could not be
archived: too many failures" and

"LOG: archive command failed with exit code 1
DETAIL: The failed archive command was: test ! -f
/opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078 && cp
pg_xlog/00000001000000A800000078
/opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078"

I am not seeing it below, but just to be complete is wal_keep_segments
set to something greater than 0?

Just to be clear which xlogs are filling up, the ones in the original
data directory or in the archive directory?

Postgres version 9.0.3 conf:

*

wal_level = hot_standby
archive_mode = true
archive_command = 'test ! -f
/opt/postgres/remote_pgsql/wal_archive/%f && cp %p
/opt/postgres/remote_pgsql/wal_archive/%f' # command to use to
archive a logfile segment
archive_timeout = 1800
max_wal_senders = 1
max_standby_archive_delay = 900s
max_standby_streaming_delay = 900s
default_statistics_target = 50 # pgtune wizard 2010-11-18
maintenance_work_mem = 480MB # pgtune wizard 2010-11-18
constraint_exclusion = on # pgtune wizard 2010-11-18
checkpoint_completion_target = 0.9 # pgtune wizard 2010-11-18
effective_cache_size = 5632MB # pgtune wizard 2010-11-18
work_mem = 48MB # pgtune wizard 2010-11-18
wal_buffers = 8MB # pgtune wizard 2010-11-18
checkpoint_segments = 16 # pgtune wizard 2010-11-18
shared_buffers = 1920MB # pgtune wizard 2010-11-18
max_connections = 80 # pgtune wizard 2010-11-18

--
Adrian Klaver
adrian.klaver@gmail.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Cliff de Carteret

cliffdecarteret@googlemail.com

about 13 years ago

In reply to: Adrian Klaver (#2)

Re: Pg_xlog increase due to postgres crash (disk full)

Hello, thanks for your response.

The xlogs that are filling up are in the original data directory and the
wal_keep_segments is commented out on my master and slave postgresql.conf

Cliff

On 22 January 2013 14:48, Adrian Klaver <adrian.klaver@gmail.com> wrote:

Show quoted text

On 01/22/2013 03:12 AM, Cliff de Carteret wrote:

My database crashed a couple of days ago during an upgrade several
seconds after committing a large transaction to the database. Eventually
we found out that this was due to the disk being full as the transaction
had created several gigs of data. A day or so later the disk is full
again and PostgreSQL crashes due to the pg_xlog file taking up all of
the disk space. I have cleaned up the drive to have so extra space which
allows PostgreSQL to start again but the xlogs are still increasing. I
have two errors in my pg_log:

"WARNING: transaction log file "00000001000000A800000078" could not be
archived: too many failures" and

"LOG: archive command failed with exit code 1
DETAIL: The failed archive command was: test ! -f
/opt/postgres/remote_pgsql/**wal_archive/**00000001000000A800000078 && cp
pg_xlog/**00000001000000A800000078
/opt/postgres/remote_pgsql/**wal_archive/**00000001000000A800000078"

I am not seeing it below, but just to be complete is wal_keep_segments set
to something greater than 0?

Just to be clear which xlogs are filling up, the ones in the original data
directory or in the archive directory?

Postgres version 9.0.3 conf:

*

wal_level = hot_standby
archive_mode = true
archive_command = 'test ! -f
/opt/postgres/remote_pgsql/**wal_archive/%f && cp %p
/opt/postgres/remote_pgsql/**wal_archive/%f' # command to use to
archive a logfile segment
archive_timeout = 1800
max_wal_senders = 1
max_standby_archive_delay = 900s
max_standby_streaming_delay = 900s
default_statistics_target = 50 # pgtune wizard 2010-11-18
maintenance_work_mem = 480MB # pgtune wizard 2010-11-18
constraint_exclusion = on # pgtune wizard 2010-11-18
checkpoint_completion_target = 0.9 # pgtune wizard 2010-11-18
effective_cache_size = 5632MB # pgtune wizard 2010-11-18
work_mem = 48MB # pgtune wizard 2010-11-18
wal_buffers = 8MB # pgtune wizard 2010-11-18
checkpoint_segments = 16 # pgtune wizard 2010-11-18
shared_buffers = 1920MB # pgtune wizard 2010-11-18
max_connections = 80 # pgtune wizard 2010-11-18

--
Adrian Klaver
adrian.klaver@gmail.com

Adrian Klaver

adrian.klaver@aklaver.com

about 13 years ago

In reply to: Cliff de Carteret (#3)

Re: Pg_xlog increase due to postgres crash (disk full)

On 01/22/2013 06:59 AM, Cliff de Carteret wrote:

Hello, thanks for your response.

The xlogs that are filling up are in the original data directory and the
wal_keep_segments is commented out on my master and slave postgresql.conf

It is expected that there may a certain increase in WAL files.
In particular: checkpoint_segments = 16

http://www.postgresql.org/docs/9.2/static/runtime-config-wal.html#GUC-CHECKPOINT-SEGMENTS
"
checkpoint_segments (integer)
Maximum number of log file segments between automatic WAL checkpoints
(each segment is normally 16 megabytes). The default is three segments.
Increasing this parameter can increase the amount of time needed for
crash recovery. This parameter can only be set in the postgresql.conf
file or on the server command line.
"

For more information that explains the above see:
http://www.postgresql.org/docs/9.2/static/wal-configuration.html

More questions:

Does it look like any WAL files in the pg_xlog directory are being recycled?

Re your archive error from your previous post, is your archive directory
full?

Assuming the archive directory is remote, is it reachable?

Cliff

--
Adrian Klaver
adrian.klaver@gmail.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Cliff de Carteret

cliffdecarteret@googlemail.com

about 13 years ago

In reply to: Adrian Klaver (#4)

Re: Pg_xlog increase due to postgres crash (disk full)

There are no files in the pg_xlog directory from before the database crash
except one file which ends in .backup, there are 759 files. In the
pg_xlog/archive_status folder there are 755 files.
The local ../wal_archive folder has no files in it at all whereas the
remote wal_archive folder has 147 files present.

The remote archive folder is not full and has ~14G whereas the local
archive folder was previously full but now it has been moved so it has 100G+

The archive directory is a mount and as the user postgres I can copy a file
into it successfully.

The current setup has been working successfully for several years until the
recent database crash

On 22 January 2013 15:34, Adrian Klaver <adrian.klaver@gmail.com> wrote:

Show quoted text

On 01/22/2013 06:59 AM, Cliff de Carteret wrote:

Hello, thanks for your response.

The xlogs that are filling up are in the original data directory and the
wal_keep_segments is commented out on my master and slave postgresql.conf

It is expected that there may a certain increase in WAL files.
In particular: checkpoint_segments = 16

http://www.postgresql.org/**docs/9.2/static/runtime-**config-wal.html#GUC-
**CHECKPOINT-SEGMENTS<http://www.postgresql.org/docs/9.2/static/runtime-config-wal.html#GUC-CHECKPOINT-SEGMENTS>
"
checkpoint_segments (integer)
Maximum number of log file segments between automatic WAL checkpoints
(each segment is normally 16 megabytes). The default is three segments.
Increasing this parameter can increase the amount of time needed for crash
recovery. This parameter can only be set in the postgresql.conf file or on
the server command line.
"

For more information that explains the above see:
http://www.postgresql.org/**docs/9.2/static/wal-**configuration.html<http://www.postgresql.org/docs/9.2/static/wal-configuration.html>

More questions:

Does it look like any WAL files in the pg_xlog directory are being
recycled?

Re your archive error from your previous post, is your archive directory
full?

Assuming the archive directory is remote, is it reachable?

Cliff

--
Adrian Klaver
adrian.klaver@gmail.com

Kevin Grittner

Kevin.Grittner@wicourts.gov

about 13 years ago

In reply to: Cliff de Carteret (#5)

Re: Pg_xlog increase due to postgres crash (disk full)

Cliff de Carteret wrote:

The current setup has been working successfully for several years
until the recent database crash

What file does the server log say it is trying to archive? What
error are you getting? Does that filename already exist on the
archive (or some intermediate location used by the archive command
or script)?

-Kevin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Import Notes

Resolved by subject fallback

Adrian Klaver

adrian.klaver@aklaver.com

about 13 years ago

In reply to: Cliff de Carteret (#5)

Re: Pg_xlog increase due to postgres crash (disk full)

On 01/22/2013 07:57 AM, Cliff de Carteret wrote:

There are no files in the pg_xlog directory from before the database
crash except one file which ends in .backup, there are 759 files. In the
pg_xlog/archive_status folder there are 755 files.
The local ../wal_archive folder has no files in it at all whereas the
remote wal_archive folder has 147 files present.

What is this local wal_archive directory?
From a previous post:

I see only /opt/postgres/remote_pgsql/wal_archive which I assumed was
the remote. I am not sure where the local one fits in?

The remote archive folder is not full and has ~14G whereas the local
archive folder was previously full but now it has been moved so it has 100G+

The archive directory is a mount and as the user postgres I can copy a
file into it successfully.

The current setup has been working successfully for several years until
the recent database crash

So what did you do to get the database running again?

--
Adrian Klaver
adrian.klaver@gmail.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Cliff de Carteret

cliffdecarteret@googlemail.com

about 13 years ago

In reply to: Adrian Klaver (#7)

Re: Pg_xlog increase due to postgres crash (disk full)

The local wal_archive directory is a directory named "wal_archive" which is
at the same file system level as the data directory. The time stamp states
that it has not been used for several years so it looks to be redundant.

To get the database running again I moved an old postgres installation to a
backups folder which gave me ~2G of space to start up again until I was
able to mount a new disk and move the pg_xlog to a separate partition using
a symbolic link.

On 22 January 2013 16:15, Adrian Klaver <adrian.klaver@gmail.com> wrote:

Show quoted text

On 01/22/2013 07:57 AM, Cliff de Carteret wrote:

There are no files in the pg_xlog directory from before the database
crash except one file which ends in .backup, there are 759 files. In the
pg_xlog/archive_status folder there are 755 files.
The local ../wal_archive folder has no files in it at all whereas the
remote wal_archive folder has 147 files present.

What is this local wal_archive directory?
From a previous post:

"LOG: archive command failed with exit code 1
DETAIL: The failed archive command was: test ! -f
/opt/postgres/remote_pgsql/**wal_archive/**00000001000000A800000078 && cp
pg_xlog/**00000001000000A800000078 /opt/postgres/remote_pgsql/**
wal_archive/**00000001000000A800000078

I see only /opt/postgres/remote_pgsql/**wal_archive which I assumed was
the remote. I am not sure where the local one fits in?

The remote archive folder is not full and has ~14G whereas the local
archive folder was previously full but now it has been moved so it has
100G+

The archive directory is a mount and as the user postgres I can copy a
file into it successfully.

The current setup has been working successfully for several years until
the recent database crash

So what did you do to get the database running again?

--
Adrian Klaver
adrian.klaver@gmail.com

Kevin Grittner

Kevin.Grittner@wicourts.gov

about 13 years ago

In reply to: Cliff de Carteret (#8)

Re: Pg_xlog increase due to postgres crash (disk full)

[Please keep the list copied, and put your reply in-line instead
of at the top.]

Cliff de Carteret wrote:

On 22 January 2013 16:07, Kevin Grittner <kgrittn@mail.com> wrote:

Cliff de Carteret wrote:

The current setup has been working successfully for several years
until the recent database crash

What file does the server log say it is trying to archive? What
error are you getting? Does that filename already exist on the
archive (or some intermediate location used by the archive command
or script)?

The sever log is (repeated constantly):

LOG: archive command failed with exit code 1
DETAIL: The failed archive command was: test ! -f
/opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078 && cp
pg_xlog/00000001000000A800000078
/opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078
WARNING: transaction log file "00000001000000A800000078" could not be
archived: too many failures

The file 00000001000000A800000078 exists in the remote archive's
wal_archive directory. I read a post saying to copy the file over to the
archive and then delete the .ready file to get postgres to move onto the
next file but this ended up logging out saying that a log file was missing.
There are more recent files in this directory but they end at the point
where I reverted all of the changes I made last night when time was running
out and the database had to be put back to a known state.

I would have deleted (or renamed) the copy in the archive
directory. Archiving should have then resumed and cleaned up the
pg_xlog directory.

-Kevin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Import Notes

Resolved by subject fallback

#10

Cliff de Carteret

cliffdecarteret@googlemail.com

about 13 years ago

In reply to: Kevin Grittner (#9)

Re: Pg_xlog increase due to postgres crash (disk full)

On 22 January 2013 16:43, Kevin Grittner <kgrittn@mail.com> wrote:

[Please keep the list copied, and put your reply in-line instead
of at the top.]

Cliff de Carteret wrote:

On 22 January 2013 16:07, Kevin Grittner <kgrittn@mail.com> wrote:

Cliff de Carteret wrote:

The current setup has been working successfully for several years
until the recent database crash

What file does the server log say it is trying to archive? What
error are you getting? Does that filename already exist on the
archive (or some intermediate location used by the archive command
or script)?

The sever log is (repeated constantly):

LOG: archive command failed with exit code 1
DETAIL: The failed archive command was: test ! -f
/opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078 && cp
pg_xlog/00000001000000A800000078
/opt/postgres/remote_pgsql/wal_archive/00000001000000A800000078
WARNING: transaction log file "00000001000000A800000078" could not be
archived: too many failures

The file 00000001000000A800000078 exists in the remote archive's
wal_archive directory. I read a post saying to copy the file over to the
archive and then delete the .ready file to get postgres to move onto the
next file but this ended up logging out saying that a log file was

missing.

There are more recent files in this directory but they end at the point
where I reverted all of the changes I made last night when time was

running

out and the database had to be put back to a known state.

I would have deleted (or renamed) the copy in the archive
directory. Archiving should have then resumed and cleaned up the
pg_xlog directory.

I have now deleted the copy on the remote wal_archive folder and the
archiving is now functioning and sending the logs from the local to the
remote folder. The remote database does not startup and the following is in
the log:

LOG: database system was shut down in recovery at 2013-01-22 10:54:48 GMT
LOG: entering standby mode
LOG: restored log file "00000001000000AB00000051" from archive
LOG: invalid resource manager ID in primary checkpoint record
PANIC: could not locate a valid checkpoint record
LOG: startup process (PID 22350) was terminated by signal 6: Aborted
LOG: aborting startup due to startup process failure

00000001000000AB00000051 is in my remote database's pg_xlog folder

Thanks for your help already!

Show quoted text

-Kevin

#11

Kevin Grittner

Kevin.Grittner@wicourts.gov

about 13 years ago

In reply to: Cliff de Carteret (#10)

Re: Pg_xlog increase due to postgres crash (disk full)

Cliff de Carteret wrote:

I have now deleted the copy on the remote wal_archive folder and the
archiving is now functioning and sending the logs from the local to the
remote folder. The remote database does not startup and the following is in
the log:

LOG: database system was shut down in recovery at 2013-01-22 10:54:48 GMT
LOG: entering standby mode
LOG: restored log file "00000001000000AB00000051" from archive
LOG: invalid resource manager ID in primary checkpoint record
PANIC: could not locate a valid checkpoint record
LOG: startup process (PID 22350) was terminated by signal 6: Aborted
LOG: aborting startup due to startup process failure

00000001000000AB00000051 is in my remote database's pg_xlog folder

Any chance that there was a pg_start_backup() call on the master without
a matching pg_stop_backup() call?

-Kevin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Import Notes

Resolved by subject fallback

#12

Cliff de Carteret

cliffdecarteret@googlemail.com

about 13 years ago

In reply to: Kevin Grittner (#11)

Re: Pg_xlog increase due to postgres crash (disk full)

I had not run these commands on the master as I was only doing sql updates
~1mil of them