Hot Standby has PANIC: WAL contains references to invalid pages
Hi All,
We are having a thorny problem I'm hoping someone will be able to help with.
We have a pair of machines set up as an active / hot SB pair. The database they contain is quite large - approx. 9TB. They were working fine on 9.1, and we recently upgraded the active DB to 9.2.1.
After upgrading the active DB, we re-mirrored the standby (using pg_basebackup) and started it up. It began replaying the WAL files as expected.
After a few hours this happened:
WARNING: page 1 of relation pg_tblspc/16408/PG_9.2_201204301/16409/1123460086 is uninitialized
CONTEXT: xlog redo vacuum: rel 16408/16409/1123460086; blk 4411, lastBlockVacuumed 0
PANIC: WAL contains references to invalid pages
CONTEXT: xlog redo vacuum: rel 16408/16409/1123460086; blk 4411, lastBlockVacuumed 0
LOG: startup process (PID 24195) was terminated by signal 6: Aborted
LOG: terminating any other active server processes
We tried starting it up again, the same thing happened.
After some googling and re-reading the release notes, we noticed the mention in the 9.2.1 release notes about the potential for corrupted visibility maps, so as per the recommendation we did a full VACUUM of the whole database (with vacuum_freeze_table_age set to zero), then re-mirrored the standby again.
After re-mirroring was completed we started the standby again. Strangely it reached consistency after only 33 WAL files - since the base backup took 5 days to complete this does not seem right to me. Anyway, WAL recovery continued, with occasional warnings like this:
[2013-02-04 10:30:51 EST] 13546@ WARNING: xlog min recovery request 1A13A/9BC425A0 is past current point 19F1E/725043E8
[2013-02-04 10:30:51 EST] 13546@ CONTEXT: writing block 0 of relation pg_tblspc/16408/PG_9.2_201204301/16409/12525_vm
After a few hours, this happened:
[2013-02-04 13:43:24 EST] 13538@ WARNING: page 1248 of relation pg_tblspc/16408/PG_9.2_201204301/16409/1128746393 does not exist
[2013-02-04 13:43:24 EST] 13538@ CONTEXT: xlog redo visible: rel 16408/16409/1128746393; blk 1248
[2013-02-04 13:43:24 EST] 13538@ PANIC: WAL contains references to invalid pages
[2013-02-04 13:43:24 EST] 13538@ CONTEXT: xlog redo visible: rel 16408/16409/1128746393; blk 1248
[2013-02-04 13:43:25 EST] 13532@ LOG: startup process (PID 13538) was terminated by signal 6: Aborted
[2013-02-04 13:43:25 EST] 13532@ LOG: terminating any other active server processes
Looks similar to the first case, but a different context. We thought that perhaps an index had become corrupted (apparently also a possibility with the bug mentioned above) however the file mentioned belongs to a normal table, not an index. And 'redo visible' sounds like it might be to do with the visibility map?
We restarted it again with debugging cranked up. It didn't reveal anything more interesting. We then upgraded the standby to 9.2.2 and started it again. Again no dice. In each case it fails at exactly the same point with the same error.
Any ideas for a next troubleshooting step?
Regards // Mike
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Tuesday, February 05, 2013 6:05 AM Michael Harris wrote:
Any ideas for a next troubleshooting step?
[BUG?] lag of minRecoveryPont in archive recovery, which has fixed recently.
Please check the following link for more details. It may help.
/messages/by-id/20121206.130458.170549097.horiguchi.kyo
taro@lab.ntt.co.jp
Regards,
Hari babu.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Hi Hari,
Thanks for the tip. We tried applying that patch, however the error recurred exactly as before.
Regards // Mike
-----Original Message-----
From: Hari Babu [mailto:haribabu.kommi@huawei.com]
Sent: Tuesday, 5 February 2013 10:07 PM
To: Michael Harris; pgsql-general@postgresql.org
Subject: RE: [GENERAL] Hot Standby has PANIC: WAL contains references to invalid pages
On Tuesday, February 05, 2013 6:05 AM Michael Harris wrote:
Any ideas for a next troubleshooting step?
[BUG?] lag of minRecoveryPont in archive recovery, which has fixed recently.
Please check the following link for more details. It may help.
/messages/by-id/20121206.130458.170549097.horiguchi.kyo
taro@lab.ntt.co.jp
Regards,
Hari babu.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
maybe pg_basebackup can`t handle such big database.try
rsync,pg_start_backup,rsync,pg_stop_backup,it always works fine for us.our
instance is about 2TB and we use pg9.1.x.
jov
在 2013-2-7 下午2:25,"Michael Harris" <michael.harris@ericsson.com>写道:
Show quoted text
Hi Hari,
Thanks for the tip. We tried applying that patch, however the error
recurred exactly as before.Regards // Mike
-----Original Message-----
From: Hari Babu [mailto:haribabu.kommi@huawei.com]
Sent: Tuesday, 5 February 2013 10:07 PM
To: Michael Harris; pgsql-general@postgresql.org
Subject: RE: [GENERAL] Hot Standby has PANIC: WAL contains references to
invalid pagesOn Tuesday, February 05, 2013 6:05 AM Michael Harris wrote:
Any ideas for a next troubleshooting step?
[BUG?] lag of minRecoveryPont in archive recovery, which has fixed
recently.
Please check the following link for more details. It may help./messages/by-id/20121206.130458.170549097.horiguchi.kyo
taro@lab.ntt.co.jpRegards,
Hari babu.--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Thu, Feb 7, 2013 at 7:39 AM, amutu <zhao6014@gmail.com> wrote:
maybe pg_basebackup can`t handle such big database.try
rsync,pg_start_backup,rsync,pg_stop_backup,it always works fine for us.our
instance is about 2TB and we use pg9.1.x.
It really should handle that without problem, but sure, it might be
worth trying that one. If you can show that the problem is in
pg_basebackup, that's a very clear bug (either in pg_basebackup or in
the backend supporting code), so that would be good to know.
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Hi,
We suspect the problem is not in that area, because we used pg_basebackup on the same database pair under 9.1 and did not have any such problems.
Looking at the context of the crashes, they seem to relate to handling of visibility maps during WAL replay. Going by the 9.2 release notes that is an area that was changed to allow index-only scans in 9.2.
Also, we can see that 9.2.3 has been released now and has a number of fixes relating to WAL replay, so we have decided to try again using that. We will scrub the standby and make a fresh copy using pg_basebackup. If that doesn't work then we may try using rsync instead.
We'll let you all know the result.
Regards // Mike
-----Original Message-----
From: Magnus Hagander [mailto:magnus@hagander.net]
Sent: Thursday, 7 February 2013 11:49 PM
To: amutu
Cc: Michael Harris; pgsql-general@postgresql.org; Hari Babu
Subject: Re: [GENERAL] Hot Standby has PANIC: WAL contains references to invalid pages
On Thu, Feb 7, 2013 at 7:39 AM, amutu <zhao6014@gmail.com> wrote:
maybe pg_basebackup can`t handle such big database.try
rsync,pg_start_backup,rsync,pg_stop_backup,it always works fine for
us.our instance is about 2TB and we use pg9.1.x.
It really should handle that without problem, but sure, it might be worth trying that one. If you can show that the problem is in pg_basebackup, that's a very clear bug (either in pg_basebackup or in the backend supporting code), so that would be good to know.
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Hi,
Also, we can see that 9.2.3 has been released now and has a number of fixes relating to WAL replay, so we have decided to try again using that.
We will scrub the standby and make a fresh copy using pg_basebackup. If that doesn't work then we may try using rsync instead.
I am pleased to be able to report that the problem seems to be fixed after upgrading to 9.2.3.
We upgraded the standby server only to 9.2.3, rebuilt the standby using pg_basebackup, and then started it up. It replayed all the outstanding WAL files with no problems.
Regards // Mike
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general