Warm standby stall -- what debug info would help?
We're running a number of warm standby instances on one server. One of them stalled on Saturday. When I found that this morning, I confirmed that the files were in the directory from which it should be pulling the WAL files. The logs showed normal processing up to the stall, with no messages afterwards. I tried a restart and it resumed warm standby status and caught up quickly.
It seems like this is probably a PostgreSQL bug of some sort, although maybe someone can spot a problem in our recovery script. If it happens again, what information should I gather before the restart to help find the cause?
-Kevin
PGBACKUP:/var/pgsql/data/county/ozaukee/data/pg_log # tail postgresql-2007-07-07_000000.log
[2007-07-07 09:44:41.392 CDT] 5962 LOG: restored log file "000000010000000C000000D2" from archive
[2007-07-07 12:24:53.597 CDT] 5962 LOG: restored log file "000000010000000C000000D3" from archive
[2007-07-07 12:24:53.984 CDT] 5962 LOG: restored log file "000000010000000C000000D4" from archive
[2007-07-07 12:24:54.351 CDT] 5962 LOG: restored log file "000000010000000C000000D5" from archive
[2007-07-07 14:10:42.208 CDT] 5962 LOG: restored log file "000000010000000C000000D6" from archive
[2007-07-07 14:10:42.634 CDT] 5962 LOG: restored log file "000000010000000C000000D7" from archive
[2007-07-07 15:23:41.717 CDT] 5962 LOG: restored log file "000000010000000C000000D8" from archive
[2007-07-07 18:24:26.933 CDT] 5962 LOG: restored log file "000000010000000C000000D9" from archive
[2007-07-07 18:24:27.692 CDT] 5962 LOG: restored log file "000000010000000C000000DA" from archive
[2007-07-07 18:24:28.051 CDT] 5962 LOG: restored log file "000000010000000C000000DB" from archive
PGBACKUP:/var/pgsql/data/county/ozaukee/data/pg_log # cat postgresql-2007-07-09_000000.log
[2007-07-09 08:21:50.200 CDT] 5904 LOG: received fast shutdown request
[2007-07-09 08:21:50.201 CDT] 5962 FATAL: could not restore file "000000010000000C000000DC" from archive: return code 15
[2007-07-09 08:21:50.485 CDT] 5904 LOG: startup process (PID 5962) exited with exit code 1
[2007-07-09 08:21:50.485 CDT] 5904 LOG: aborting startup due to startup process failure
[2007-07-09 08:21:50.555 CDT] 5960 LOG: logger shutting down
PGBACKUP:/var/pgsql/data/county/ozaukee/data/pg_log # cat postgresql-2007-07-09_082151.log | grep -v 'starting up'
[2007-07-09 08:21:51.718 CDT] 19076 LOG: database system was interrupted while in recovery at log time 2007-07-07 15:00:02 CDT
[2007-07-09 08:21:51.718 CDT] 19076 HINT: If this has occurred more than once some data may be corrupted and you may need to choose an earlier recovery target.
[2007-07-09 08:21:51.718 CDT] 19076 LOG: starting archive recovery
[2007-07-09 08:21:51.725 CDT] 19076 LOG: restore_command = "/usr/local/backup/recovery.sh %f %p"
[2007-07-09 08:21:52.079 CDT] 19076 LOG: restored log file "000000010000000C000000D9" from archive
[2007-07-09 08:21:52.079 CDT] 19076 LOG: checkpoint record is at C/D9000020
[2007-07-09 08:21:52.079 CDT] 19076 LOG: redo record is at C/D9000020; undo record is at 0/0; shutdown FALSE
[2007-07-09 08:21:52.079 CDT] 19076 LOG: next transaction ID: 0/10700115; next OID: 1387338
[2007-07-09 08:21:52.079 CDT] 19076 LOG: next MultiXactId: 1; next MultiXactOffset: 0
[2007-07-09 08:21:52.079 CDT] 19076 LOG: automatic recovery in progress
[2007-07-09 08:21:52.081 CDT] 19076 LOG: redo starts at C/D9000068
[2007-07-09 08:21:52.429 CDT] 19076 LOG: restored log file "000000010000000C000000DA" from archive
[2007-07-09 08:21:52.860 CDT] 19076 LOG: restored log file "000000010000000C000000DB" from archive
PGBACKUP:/var/pgsql/data/county/ozaukee/data # grep -Ev '^([[:space:]]+)?($|#)' postgresql.conf
listen_addresses = '*' # what IP address(es) to listen on;
port = 5445 # (change requires restart)
max_connections = 50 # (change requires restart)
shared_buffers = 160MB # min 128kB or max_connections*16kB
temp_buffers = 10MB # min 800kB
work_mem = 30MB # min 64kB
maintenance_work_mem = 160MB # min 1MB
max_fsm_pages = 204800 # min max_fsm_relations*16, 6 bytes each
bgwriter_lru_percent = 20.0 # 0-100% of LRU buffers scanned/round
bgwriter_lru_maxpages = 200 # 0-1000 buffers max written/round
bgwriter_all_percent = 10.0 # 0-100% of all buffers scanned/round
bgwriter_all_maxpages = 600 # 0-1000 buffers max written/round
wal_buffers = 1MB # min 32kB
checkpoint_segments = 10 # in logfile segments, min 1, 16MB each
checkpoint_timeout = 30min # range 30s-1h
archive_timeout = 3600 # force a logfile segment switch after this
seq_page_cost = 0.1 # measured on an arbitrary scale
random_page_cost = 0.1 # same scale as above
effective_cache_size = 3GB
redirect_stderr = on # Enable capturing of stderr into log
log_line_prefix = '[%m] %p %q<%u %d %r> ' # Special values:
stats_block_level = on
stats_row_level = on
autovacuum = on # enable autovacuum subprocess?
autovacuum_naptime = 10s # time between autovacuum runs
autovacuum_vacuum_threshold = 1 # min # of tuple updates before
autovacuum_analyze_threshold = 1 # min # of tuple updates before
datestyle = 'iso, mdy'
lc_messages = 'C' # locale for system error message
lc_monetary = 'C' # locale for monetary formatting
lc_numeric = 'C' # locale for number formatting
lc_time = 'C' # locale for time formatting
escape_string_warning = off
standard_conforming_strings = on
sql_inheritance = off
PGBACKUP:/var/pgsql/data/county/ozaukee/data # pg_controldata /var/pgsql/data/county/ozaukee/data/
pg_control version number: 822
Catalog version number: 200611241
Database system identifier: 5062210045799347511
Database cluster state: in archive recovery
pg_control last modified: Mon 09 Jul 2007 08:21:52 AM CDT
Current log file ID: 8
Next log file segment: 150
Latest checkpoint location: C/D9000020
Prior checkpoint location: C/D9000020
Latest checkpoint's REDO location: C/D9000020
Latest checkpoint's UNDO location: C/D9000020
Latest checkpoint's TimeLineID: 1
Latest checkpoint's NextXID: 0/10700115
Latest checkpoint's NextOID: 1387338
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Time of latest checkpoint: Sat 07 Jul 2007 03:00:02 PM CDT
Minimum recovery ending location: 8/9500A530
Maximum data alignment: 4
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Date/time type storage: 64-bit integers
Maximum length of locale name: 128
LC_COLLATE: C
LC_CTYPE: C
PGBACKUP:/var/pgsql/data/county/ozaukee/data # cat /usr/local/backup/recovery.sh
#! /bin/bash
# Pick out county name from the back of the path.
# The value of $PWD will be: /var/pgsql/data/county/<countyName>/data
countyName=`dirname $PWD`
countyName=`basename $countyName`
while [ ! -f /var/pgsql/data/county/$countyName/wal-files/$1.gz \
-a ! -f /var/pgsql/data/county/$countyName/DONE \
-o -f /var/pgsql/data/county/$countyName/wal-files/rsync-in-progress ]
do
if ! ( echo "$1" | grep -Eq '^[0-9A-F]{24}$' ) ; then
exit 1
fi
sleep 10 # /* wait for ~10 sec */
done
gunzip < /var/pgsql/data/county/$countyName/wal-files/$1.gz > "$2"
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
[2007-07-07 18:24:27.692 CDT] 5962 LOG: restored log file "000000010000000=
C000000DA" from archive
[2007-07-07 18:24:28.051 CDT] 5962 LOG: restored log file "000000010000000=
C000000DB" from archive
[2007-07-09 08:21:50.200 CDT] 5904 LOG: received fast shutdown request
[2007-07-09 08:21:50.201 CDT] 5962 FATAL: could not restore file "00000001=
0000000C000000DC" from archive: return code 15
Evidently it was waiting for the restore_command script to give it back
a file. So the problem is within your restore script. Eyeing the
script, the only obvious thing that could block it is existence of
/var/pgsql/data/county/$countyName/wal-files/rsync-in-progress
regards, tom lane
On Mon, Jul 9, 2007 at 10:54 AM, in message
<469213FB.EE98.0025.0@wicourts.gov>, "Kevin Grittner"
<Kevin.Grittner@wicourts.gov> wrote:
We're running a number of warm standby instances on one server. One of them
stalled on Saturday. When I found that this morning, I confirmed that the
files were in the directory from which it should be pulling the WAL files.
The logs showed normal processing up to the stall, with no messages
afterwards. I tried a restart and it resumed warm standby status and caught
up quickly.
No, actually it stalled on the same WAL file. I've got another one in the
same state that I haven't touched yet. I'll work on gathering what info I
can think of, but if there's something in particular you would like to see,
let me know.
More in a bit.
Should this be on the bugs list instead of hackers?
-Kevin
On Mon, 2007-07-09 at 12:36 -0400, Tom Lane wrote:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
[2007-07-07 18:24:27.692 CDT] 5962 LOG: restored log file "000000010000000=
C000000DA" from archive
[2007-07-07 18:24:28.051 CDT] 5962 LOG: restored log file "000000010000000=
C000000DB" from archive
[2007-07-09 08:21:50.200 CDT] 5904 LOG: received fast shutdown request
[2007-07-09 08:21:50.201 CDT] 5962 FATAL: could not restore file "00000001=
0000000C000000DC" from archive: return code 15Evidently it was waiting for the restore_command script to give it back
a file. So the problem is within your restore script.
Agreed.
Kevin, Can you try pg_standby please? It would help me to diagnose
problems faster if that doesn't work. TIA.
--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com
On Mon, Jul 9, 2007 at 11:36 AM, in message <23173.1183999016@sss.pgh.pa.us>,
Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
[2007-07-07 18:24:27.692 CDT] 5962 LOG: restored log file "000000010000000=
C000000DA" from archive
[2007-07-07 18:24:28.051 CDT] 5962 LOG: restored log file "000000010000000=
C000000DB" from archive
[2007-07-09 08:21:50.200 CDT] 5904 LOG: received fast shutdown request
[2007-07-09 08:21:50.201 CDT] 5962 FATAL: could not restore file "00000001=
0000000C000000DC" from archive: return code 15Evidently it was waiting for the restore_command script to give it back
a file. So the problem is within your restore script. Eyeing the
script, the only obvious thing that could block it is existence of
/var/pgsql/data/county/$countyName/wal-files/rsync-in-progress
Sorry for the noise. It wasn't the rsync file but something even more
obvious that I managed misread. The rsync was failing to copy the files
from the counties to the directory where recovery reads them. I could have
sworn I checked that, but I clearly messed up.
-Kevin