BUG #8543: Standby recovery use incorrect timeline to determine WAL length
The following bug has been logged on the website:
Bug reference: 8543
Logged by: Alfonso Vicente
Email address: alfonso.vicente@logos.com.uy
PostgreSQL version: 9.1.3
Operating system: OpenSUSE 11
Description:
Environment: PostgreSQL 9.1.3 / OpenSUSE 11
The use case is a switchover and switchback test, using a standby cluster
with Streaming Replication. We have 2 datacenters: main-site and
backup-site
The cluster in main-site has timeline ID 1, and we perform the switchover to
backup-site without any trouble. When we execute the pg_ctl promote, the
cluster sets the new timeline ID 2 (OK).
But... we don't stop the cluster in main-site. So, the cluster in main-site
continue to archive logs in /archlogs directory, and archives the WAL
00000001000000FE00000002 of 16MB.
Immediately, we perform a backup in the backup-site, to prepare for
switchback. The pg_stop_backup() force a WAL switch, and generate a WAL
00000002000000FE00000002 of 2MB. Then, the archive_command script perform
the log shipping to the main-site and store the 00000002000000FE00000002 in
the same /archlogs directory
When we execute the pg_ctl promote in the main-site, the postgresql.log show
the following error:
2013-10-19 22:39:59 UYST LOG: database system was shut down in recovery
at 2013-10-19 21:47:07 UYST
sh: /home/postgres/9.1/backups/archlogs/00000002.history.gz: No existe el
fichero o el directorio
2013-10-19 22:39:59 UYST LOG: starting archive recovery
2013-10-19 22:39:59 UYST FATAL: archive file "00000002000000FE00000002"
has wrong size: 2293760 instead of 16777216
2013-10-19 22:39:59 UYST LOG: startup process (PID 10636) exited with
exit code 1
2013-10-19 22:39:59 UYST LOG: aborting startup due to startup process
failure
The WAL 00000002000000FE00000002 is of 2293760 bytes, this is not an error.
Why PostgreSQL think WAL 00000002000000FE00000002 should be of 16MB ? ... I
think the cause can be the WAL 00000001000000FE00000002 (timeline ID 1)
stored in the /archlogs directory
If we drop all the timeline ID 1 archived WALs, the failback executes
succesfully
There is an easy workaround, but please, confirm if this is a bug. The
documentation says: "The default behavior of recovery is to recover along
the same timeline that was current when the base backup was taken". The base
backup in backup-site was timeline ID 2
Thanks in advance
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
On Mon, Oct 21, 2013 at 11:31 PM, <alfonso.vicente@logos.com.uy> wrote:
PostgreSQL version: 9.1.3
This version is outdated, and 9.1.9 even contains important security
fixes, so you should update to the latest minor release of 9.1 which
is 9.1.10.
But... we don't stop the cluster in main-site. So, the cluster in main-site
continue to archive logs in /archlogs directory, and archives the WAL
00000001000000FE00000002 of 16MB.
This is fine, the old and new WAL files are not going to overlap
because of their different timeline ID.
Immediately, we perform a backup in the backup-site, to prepare for
switchback. The pg_stop_backup() force a WAL switch, and generate a WAL
00000002000000FE00000002 of 2MB. Then, the archive_command script perform
the log shipping to the main-site and store the 00000002000000FE00000002 in
the same /archlogs directory
As far as I recall, the size of the WAL segment archived after
pg_stop_backup should be 16MB. Isn't this a problem?
When we execute the pg_ctl promote in the main-site, the postgresql.log show
the following error:
sh: /home/postgres/9.1/backups/archlogs/00000002.history.gz: No existe el
fichero o el directorio
2013-10-19 22:39:59 UYST LOG: starting archive recovery
2013-10-19 22:39:59 UYST FATAL: archive file "00000002000000FE00000002"
has wrong size: 2293760 instead of 16777216If we drop all the timeline ID 1 archived WALs, the failback executes
succesfully
9.1.8 and 9.1.10 include a couple fixes related to timeline handling
of WAL files and detection of a database consistency point. Hence I
would recommend trying once again with a latest version.
The documentation says: "The default behavior of recovery is to recover along
the same timeline that was current when the base backup was taken". The base
backup in backup-site was timeline ID 2
True.
--
Michael
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs