Recovery_target_time misinterpreted?

Started by Klaus Itaover 12 years ago5 messagesbugsgeneral
Jump to latest
#1Klaus Ita
koki.eml@gmail.com
bugsgeneral

Hello PG Experts!

I have restored a Database Cluster with a recovery_target_time set to

recovery_target_time = '2013-07-27 21:20:17.127664+00'
recovery_target_inclusive = false

now it seems the restore rather restored to some point in time (rather the
18th than the 27th). Is there an explanation for this huge gap? Is that the
last 'consistent state'?

pg version:
xaxos_ch=> select version();

version
----------------------------------------------------------------------------------------------
PostgreSQL 9.1.9 on x86_64-unknown-linux-gnu, compiled by gcc (Debian
4.7.2-5) 4.7.2, 64-bit
(1 row)

somewhat irrelevant, as my question is more general?

thank you for your help!

klaus

#2Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Klaus Ita (#1)
bugsgeneral
Re: Recovery_target_time misinterpreted?

Klaus Ita wrote:

I have restored a Database Cluster with a recovery_target_time set to

recovery_target_time = '2013-07-27 21:20:17.127664+00'
recovery_target_inclusive = false

now it seems the restore rather restored to some point in time (rather the 18th than the 27th). Is
there an explanation for this huge gap? Is that the last 'consistent state'?

Maybe the log entries created during restore can answer the question.

Yours,
Laurenz Albe

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Klaus Ita
koki.eml@gmail.com
In reply to: Laurenz Albe (#2)
bugsgeneral
Re: Recovery_target_time misinterpreted?

2013-07-30 11:15:15 UTC <%> LOG: starting point-in-time recovery to
2013-07-27 21:20:17.127664+00
2013-07-30 11:15:15 UTC <%> LOG: restored log file
"00000001000002300000005C" from archive
2013-07-30 11:15:15 UTC <%> LOG: restored log file
"00000001000002300000005A" from archive
2013-07-30 11:15:15 UTC <%> LOG: redo starts at 230/5ACD7CC0
...
...
...
2013-07-30 14:28:45 UTC <%> LOG: restored log file
"000000010000026400000002" from archive
2013-07-30 14:28:45 UTC <%> LOG: unexpected pageaddr 263/C706C000 in log
file 612, segment 2, offset 442368
2013-07-30 14:28:45 UTC <%> LOG: redo done at 264/20698A8
2013-07-30 14:28:45 UTC <%> LOG: last completed transaction was at log
time 2013-07-18 11:42:22.121512+00
2013-07-30 14:28:45 UTC <%> LOG: restored log file
"000000010000026400000002" from archive
cp: cannot stat
`/var/tmp/xlogs_recovered_2013-07-30/wal_files/00000002.history*': No such
file or directory
mv: cannot stat `/tmp/00000002.history': No such file or directory
2013-07-30 14:28:45 UTC <%> LOG: selected new timeline ID: 2
cp: cannot stat
`/var/tmp/xlogs_recovered_2013-07-30/wal_files/00000001.history*': No such
file or directory
mv: cannot stat `/tmp/00000001.history': No such file or directory
2013-07-30 14:28:45 UTC <%> LOG: archive recovery complete
2013-07-30 14:29:09 UTC <%> LOG: autovacuum launcher started
2013-07-30 14:29:09 UTC <%> LOG: database system is ready to accept
connections

well, that does not indicate anything for me.

On Wed, Jul 31, 2013 at 9:37 AM, Albe Laurenz <laurenz.albe@wien.gv.at>wrote:

Show quoted text

Klaus Ita wrote:

I have restored a Database Cluster with a recovery_target_time set to

recovery_target_time = '2013-07-27 21:20:17.127664+00'
recovery_target_inclusive = false

now it seems the restore rather restored to some point in time (rather

the 18th than the 27th). Is

there an explanation for this huge gap? Is that the last 'consistent

state'?

Maybe the log entries created during restore can answer the question.

Yours,
Laurenz Albe

#4Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Klaus Ita (#3)
bugsgeneral
Re: Recovery_target_time misinterpreted?

Klaus Ita wrote:

I have restored a Database Cluster with a recovery_target_time set to

recovery_target_time = '2013-07-27 21:20:17.127664+00'
recovery_target_inclusive = false

now it seems the restore rather restored to some point in time (rather the 18th than the 27th). Is
there an explanation for this huge gap? Is that the last 'consistent state'?

Maybe the log entries created during restore can answer the question.

2013-07-30 11:15:15 UTC <%> LOG: starting point-in-time recovery to 2013-07-27 21:20:17.127664+00
2013-07-30 11:15:15 UTC <%> LOG: restored log file "00000001000002300000005C" from archive
2013-07-30 11:15:15 UTC <%> LOG: restored log file "00000001000002300000005A" from archive
2013-07-30 11:15:15 UTC <%> LOG: redo starts at 230/5ACD7CC0
...
...
...
2013-07-30 14:28:45 UTC <%> LOG: restored log file "000000010000026400000002" from archive
2013-07-30 14:28:45 UTC <%> LOG: unexpected pageaddr 263/C706C000 in log file 612, segment 2, offset
442368
2013-07-30 14:28:45 UTC <%> LOG: redo done at 264/20698A8
2013-07-30 14:28:45 UTC <%> LOG: last completed transaction was at log time 2013-07-18
11:42:22.121512+00
2013-07-30 14:28:45 UTC <%> LOG: restored log file "000000010000026400000002" from archive
cp: cannot stat `/var/tmp/xlogs_recovered_2013-07-30/wal_files/00000002.history*': No such file or
directory
mv: cannot stat `/tmp/00000002.history': No such file or directory
2013-07-30 14:28:45 UTC <%> LOG: selected new timeline ID: 2
cp: cannot stat `/var/tmp/xlogs_recovered_2013-07-30/wal_files/00000001.history*': No such file or
directory
mv: cannot stat `/tmp/00000001.history': No such file or directory
2013-07-30 14:28:45 UTC <%> LOG: archive recovery complete
2013-07-30 14:29:09 UTC <%> LOG: autovacuum launcher started
2013-07-30 14:29:09 UTC <%> LOG: database system is ready to accept connections

well, that does not indicate anything for me.

To me it indicates that log file 000000010000026400000002 might be corrupt.

PostgreSQL stops replaying WAL after it detects a corrupt WAL record.

Do you have a second copy of the WAL file?

Yours,
Laurenz Albe

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Klaus Ita
koki.eml@gmail.com
In reply to: Laurenz Albe (#4)
bugsgeneral
Re: Recovery_target_time misinterpreted?

No, it's super frustrating. While I do the recovery, it says it reaches a
consistent recovery state, and i just cannot find a way how to convince pg
to stop at that state:

2013-08-02 09:23:25 GMT DEBUG: postgres: PostmasterMain: initial
environment dump:
2013-08-02 09:23:25 GMT DEBUG: -----------------------------------------
2013-08-02 09:23:25 GMT DEBUG: PG_GRANDPARENT_PID=9077
2013-08-02 09:23:25 GMT DEBUG: PGLOCALEDIR=/usr/share/locale
2013-08-02 09:23:25 GMT DEBUG: PGSYSCONFDIR=/etc/postgresql-common
2013-08-02 09:23:25 GMT DEBUG: LANG=en_US.utf8
2013-08-02 09:23:25 GMT DEBUG: PWD=/var/lib/postgresql
2013-08-02 09:23:25 GMT DEBUG: PGDATA=/var/lib/postgresql/9.1/main
2013-08-02 09:23:25 GMT DEBUG: LC_COLLATE=en_US.utf8
2013-08-02 09:23:25 GMT DEBUG: LC_CTYPE=en_US.utf8
2013-08-02 09:23:25 GMT DEBUG: LC_MESSAGES=en_US.utf8
2013-08-02 09:23:25 GMT DEBUG: LC_MONETARY=C
2013-08-02 09:23:25 GMT DEBUG: LC_NUMERIC=C
2013-08-02 09:23:25 GMT DEBUG: LC_TIME=C
2013-08-02 09:23:25 GMT DEBUG: -----------------------------------------

2013-08-02 11:23:26 CEST DEBUG: invoking IpcMemoryCreate(size=32399360)
2013-08-02 11:23:26 CEST DEBUG: removing file "pg_notify/0000"
2013-08-02 11:23:26 CEST DEBUG: max_safe_fds = 982, usable_fds = 1000,
already_open = 8
2013-08-02 11:23:26 CEST LOG: database system was interrupted while in
recovery at log time 2013-07-29 11:45:24 CEST
2013-08-02 11:23:26 CEST HINT: If this has occurred more than once some
data might be corrupted and you might need to choose an earlier recovery
target.
2013-08-02 11:23:26 CEST DEBUG: restore_command = 'cp
/home/validad-pg-backups/pgmaster/wal_files/%f.gz /tmp/%f.gz && gunzip
/tmp/%f.gz && mv /tmp/%f %p'
2013-08-02 11:23:26 CEST DEBUG: trigger_file =
'/var/lib/postgresql/9.1/main/stop_replication_trigger'
2013-08-02 11:23:26 CEST LOG: starting archive recovery
2013-08-02 11:23:26 CEST DEBUG: executing restore command "cp
/home/validad-pg-backups/pgmaster/wal_files/000000010000027A0000002C.gz
/tmp/000000010000027A0000002C.gz && gunzip /tmp/000000010000027A0000002C.gz
&& mv /tmp/000000010000027A0000002C pg_xlog/RECOVERYXLOG"
2013-08-02 11:23:26 CEST DEBUG: forked new backend, pid=9090 socket=9
2013-08-02 11:23:26 CEST LOG: incomplete startup packet
2013-08-02 11:23:26 CEST DEBUG: shmem_exit(0): 0 callbacks to make
2013-08-02 11:23:26 CEST DEBUG: proc_exit(0): 1 callbacks to make
2013-08-02 11:23:26 CEST DEBUG: exit(0)
2013-08-02 11:23:26 CEST DEBUG: shmem_exit(-1): 0 callbacks to make
2013-08-02 11:23:26 CEST DEBUG: proc_exit(-1): 0 callbacks to make
2013-08-02 11:23:26 CEST DEBUG: reaping dead processes
2013-08-02 11:23:26 CEST DEBUG: server process (PID 9090) exited with exit
code 0
2013-08-02 11:23:26 CEST LOG: restored log file "000000010000027A0000002C"
from archive
2013-08-02 11:23:26 CEST DEBUG: got WAL segment from archive
2013-08-02 11:23:26 CEST DEBUG: checkpoint record is at 27A/2CB77750
2013-08-02 11:23:26 CEST DEBUG: redo record is at 27A/2CB77750; shutdown
TRUE
2013-08-02 11:23:26 CEST DEBUG: next transaction ID: 0/381985248; next
OID: 1201662
2013-08-02 11:23:26 CEST DEBUG: next MultiXactId: 130079; next
MultiXactOffset: 272843
2013-08-02 11:23:26 CEST DEBUG: oldest unfrozen transaction ID: 197713560,
in database 331065
2013-08-02 11:23:26 CEST DEBUG: transaction ID wrap limit is 2345197207,
limited by database with OID 331065
2013-08-02 11:23:26 CEST DEBUG: resetting unlogged relations: cleanup 1
init 0
2013-08-02 11:23:26 CEST LOG: redo starts at 27A/2CB777A8
2013-08-02 11:23:26 CEST DEBUG: executing restore command "cp
/home/validad-pg-backups/pgmaster/wal_files/000000010000027A0000002D.gz
/tmp/000000010000027A0000002D.gz && gunzip /tmp/000000010000027A0000002D.gz
&& mv /tmp/000000010000027A0000002D pg_xlog/RECOVERYXLOG"
2013-08-02 11:23:27 CEST DEBUG: forked new backend, pid=9098 socket=9
2013-08-02 11:23:27 CEST FATAL: the database system is starting up
2013-08-02 11:23:27 CEST DEBUG: shmem_exit(1): 0 callbacks to make
2013-08-02 11:23:27 CEST DEBUG: proc_exit(1): 1 callbacks to make
2013-08-02 11:23:27 CEST DEBUG: exit(1)
2013-08-02 11:23:27 CEST DEBUG: shmem_exit(-1): 0 callbacks to make
2013-08-02 11:23:27 CEST DEBUG: proc_exit(-1): 0 callbacks to make
2013-08-02 11:23:27 CEST DEBUG: reaping dead processes
2013-08-02 11:23:27 CEST DEBUG: server process (PID 9098) exited with exit
code 1
2013-08-02 11:23:27 CEST LOG: restored log file "000000010000027A0000002D"
from archive
2013-08-02 11:23:27 CEST DEBUG: got WAL segment from archive
2013-08-02 11:23:27 CEST DEBUG: executing restore command "cp
/home/validad-pg-backups/pgmaster/wal_files/000000010000027A0000002E.gz
/tmp/000000010000027A0000002E.gz && gunzip /tmp/000000010000027A0000002E.gz
&& mv /tmp/000000010000027A0000002E pg_xlog/RECOVERYXLOG"
2013-08-02 11:23:27 CEST DEBUG: forked new backend, pid=9105 socket=9
2013-08-02 11:23:27 CEST FATAL: the database system is starting up
2013-08-02 11:23:27 CEST DEBUG: shmem_exit(1): 0 callbacks to make
2013-08-02 11:23:27 CEST DEBUG: proc_exit(1): 1 callbacks to make
2013-08-02 11:23:27 CEST DEBUG: exit(1)
2013-08-02 11:23:27 CEST DEBUG: shmem_exit(-1): 0 callbacks to make
2013-08-02 11:23:27 CEST DEBUG: proc_exit(-1): 0 callbacks to make
2013-08-02 11:23:27 CEST DEBUG: reaping dead processes
2013-08-02 11:23:27 CEST DEBUG: server process (PID 9105) exited with exit
code 1
2013-08-02 11:23:28 CEST LOG: restored log file "000000010000027A0000002E"
from archive
2013-08-02 11:23:28 CEST DEBUG: got WAL segment from archive
2013-08-02 11:23:28 CEST LOG: consistent recovery state reached at
27A/2E3F42E8
2013-08-02 11:23:28 CEST PANIC: _bt_restore_page: cannot add item to page
2013-08-02 11:23:28 CEST CONTEXT: xlog redo split_r: rel 1663/16405/797541
left 4743, right 18008, next 9681, level 0, firstright 194
2013-08-02 11:23:28 CEST DEBUG: reaping dead processes
2013-08-02 11:23:28 CEST LOG: startup process (PID 9086) was terminated by
signal 6: Aborted
2013-08-02 11:23:28 CEST LOG: terminating any other active server processes
2013-08-02 11:23:28 CEST DEBUG: sending SIGQUIT to process 9092
2013-08-02 11:23:28 CEST DEBUG: shmem_exit(-1): 0 callbacks to make
2013-08-02 11:23:28 CEST DEBUG: proc_exit(-1): 0 callbacks to make
2013-08-02 11:23:28 CEST DEBUG: reaping dead processes
2013-08-02 11:23:28 CEST DEBUG: shmem_exit(1): 3 callbacks to make
2013-08-02 11:23:28 CEST DEBUG: proc_exit(1): 3 callbacks to make
2013-08-02 11:23:28 CEST DEBUG: exit(1)
2013-08-02 11:23:28 CEST DEBUG: shmem_exit(-1): 0 callbacks to make
2013-08-02 11:23:28 CEST DEBUG: proc_exit(-1): 0 callbacks to make

now with debug5 setting.

I have like 3 different wal_files, but they were all generated by pg in
this way. The backup script actually always takes the original and uses
this as source to copy it to 3 different locations. But the sha1sum of the
files is always the same :(

I don't so much mind that, we have recovered some backups and replayed from
another standby system, but i need to understand why i cannot reactivate
this cluster on this state:

'2013-08-02 11:23:28 CEST LOG: consistent recovery state reached at
27A/2E3F42E8'

I am posting to bugs, too, as i start thinking this is one.

lg,k