BUG #13739: Recurring corrupted page pointer panics on hot-standby replica

Started by Michael Robinsonover 10 years ago3 messagesbugs
Jump to latest
#1Michael Robinson
michael@snupps.com

The following bug has been logged on the website:

Bug reference: 13739
Logged by: Michael Robinson
Email address: michael@snupps.com
PostgreSQL version: 9.4.4
Operating system: Ubuntu 14.04
Description:

Three days ago, we started getting corrupted page pointer panics on a hot
standby replica (logs below).

The replica is running on a dedicated EC2 instance, and has been running
without any problems for several months. The build version is
9.4.4-1.pgdg14.04+1 from the apt repository, running on Ubuntu 14.04 Trusty.
The database is around 440GB, and is under constant moderate read-only load
(100-1000 queries per second).

There have been no issues with the master database, nor have there been any
database shutdowns other than the panics.

2015-10-24 14:16:46.489 UTC PANIC: corrupted page pointers: lower = 17,
upper = 0, special = 8176
2015-10-24 14:16:46.490 UTC CONTEXT: xlog redo unlink_page: rel
1663/16416/254063; dead 11796080; left 1365037; right 3024097; btpo_xact
64542957; leaf 2456241; leafleft 11130443; leafright 1350594; topparent
4294967295
2015-10-26 04:51:40.530 UTC PANIC: corrupted page pointers: lower = 17,
upper = 0, special = 8176
2015-10-26 04:51:40.530 UTC CONTEXT: xlog redo unlink_page: rel
1663/16416/254063; dead 9922828; left 2449142; right 3415026; btpo_xact
64982371; leaf 2290440; leafleft 5120238; leafright 1903321; topparent
4294967295
2015-10-26 10:24:02.613 UTC PANIC: corrupted page pointers: lower = 17,
upper = 0, special = 8176
2015-10-26 10:24:02.613 UTC CONTEXT: xlog redo unlink_page: rel
1663/16416/401628; dead 2348571; left 2348281; right 2351431; btpo_xact
65010718; leaf 2348740; leafleft 2348434; leafright 2351568; topparent
4294967295
2015-10-26 15:19:01.151 UTC PANIC: corrupted page pointers: lower = 17,
upper = 0, special = 8176
2015-10-26 15:19:01.151 UTC CONTEXT: xlog redo unlink_page: rel
1663/16416/254063; dead 5840652; left 8909980; right 4074914; btpo_xact
65065726; leaf 3644712; leafleft 5129511; leafright 2786892; topparent
4294967295
2015-10-26 15:23:28.954 UTC LOG: unexpected pageaddr FD0/8738C000 in log
segment 0000000100000FD0000000FD, offset 3719168
2015-10-26 15:26:03.937 UTC PANIC: corrupted page pointers: lower = 17,
upper = 0, special = 8176
2015-10-26 15:26:03.937 UTC CONTEXT: xlog redo unlink_page: rel
1663/16416/254063; dead 5740451; left 9572720; right 1029949; btpo_xact
65067057; leaf 2948357; leafleft 5225678; leafright 805064; topparent
4294967295
2015-10-26 15:28:24.802 UTC LOG: unexpected pageaddr FD0/BBD18000 in log
segment 0000000100000FD100000027, offset 13729792
2015-10-26 21:20:00.019 UTC PANIC: corrupted page pointers: lower = 17,
upper = 0, special = 8176
2015-10-26 21:20:00.019 UTC CONTEXT: xlog redo unlink_page: rel
1663/16416/401628; dead 2706485; left 2706197; right 2706779; btpo_xact
65166073; leaf 2706583; leafleft 2706465; leafright 2706652; topparent
4294967295
2015-10-27 08:43:54.211 UTC PANIC: corrupted page pointers: lower = 17,
upper = 0, special = 8176
2015-10-27 08:43:54.211 UTC CONTEXT: xlog redo unlink_page: rel
1663/16416/401628; dead 5287405; left 5287118; right 5287692; btpo_xact
65266215; leaf 5287560; leafleft 5287263; leafright 5287575; topparent
4294967295

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#2Michael Robinson
michael@snupps.com
In reply to: Michael Robinson (#1)
Re: BUG #13739: Recurring corrupted page pointer panics on hot-standby replica

I am somewhat surprised that BUG #13668 merits a response, and this one
does not.

Not a bug?
Won't fix?
Insufficient information?
Go away stop bothering us?
Something else?
Anything?

On Tue, Oct 27, 2015 at 09:22:08AM +0000, michael(at)snupps(dot)com wrote:

Show quoted text

The following bug has been logged on the website:
Bug reference: 13739
Logged by: Michael Robinson
Email address: michael(at)snupps(dot)com
PostgreSQL version: 9.4.4
Operating system: Ubuntu 14.04
Description:
Three days ago, we started getting corrupted page pointer panics on a hot
standby replica (logs below).
The replica is running on a dedicated EC2 instance, and has been running
without any problems for several months. The build version is
9.4.4-1.pgdg14.04+1 from the apt repository, running on Ubuntu 14.04
Trusty.
The database is around 440GB, and is under constant moderate read-only
load
(100-1000 queries per second).
There have been no issues with the master database, nor have there been any
database shutdowns other than the panics.

2015-10-24 14:16:46.489 UTC PANIC: corrupted page pointers: lower = 17,
upper = 0, special = 8176
2015-10-24 14:16:46.490 UTC CONTEXT: xlog redo unlink_page: rel
1663/16416/254063; dead 11796080; left 1365037; right 3024097; btpo_xact
64542957; leaf 2456241; leafleft 11130443; leafright 1350594; topparent
4294967295
2015-10-26 04:51:40.530 UTC PANIC: corrupted page pointers: lower = 17,
upper = 0, special = 8176
2015-10-26 04:51:40.530 UTC CONTEXT: xlog redo unlink_page: rel
1663/16416/254063; dead 9922828; left 2449142; right 3415026; btpo_xact
64982371; leaf 2290440; leafleft 5120238; leafright 1903321; topparent
4294967295
2015-10-26 10:24:02.613 UTC PANIC: corrupted page pointers: lower = 17,
upper = 0, special = 8176
2015-10-26 10:24:02.613 UTC CONTEXT: xlog redo unlink_page: rel
1663/16416/401628; dead 2348571; left 2348281; right 2351431; btpo_xact
65010718; leaf 2348740; leafleft 2348434; leafright 2351568; topparent
4294967295
2015-10-26 15:19:01.151 UTC PANIC: corrupted page pointers: lower = 17,
upper = 0, special = 8176
2015-10-26 15:19:01.151 UTC CONTEXT: xlog redo unlink_page: rel
1663/16416/254063; dead 5840652; left 8909980; right 4074914; btpo_xact
65065726; leaf 3644712; leafleft 5129511; leafright 2786892; topparent
4294967295
2015-10-26 15:23:28.954 UTC LOG: unexpected pageaddr FD0/8738C000 in log
segment 0000000100000FD0000000FD, offset 3719168
2015-10-26 15:26:03.937 UTC PANIC: corrupted page pointers: lower = 17,
upper = 0, special = 8176
2015-10-26 15:26:03.937 UTC CONTEXT: xlog redo unlink_page: rel
1663/16416/254063; dead 5740451; left 9572720; right 1029949; btpo_xact
65067057; leaf 2948357; leafleft 5225678; leafright 805064; topparent
4294967295
2015-10-26 15:28:24.802 UTC LOG: unexpected pageaddr FD0/BBD18000 in log
segment 0000000100000FD100000027, offset 13729792
2015-10-26 21:20:00.019 UTC PANIC: corrupted page pointers: lower = 17,
upper = 0, special = 8176
2015-10-26 21:20:00.019 UTC CONTEXT: xlog redo unlink_page: rel
1663/16416/401628; dead 2706485; left 2706197; right 2706779; btpo_xact
65166073; leaf 2706583; leafleft 2706465; leafright 2706652; topparent
4294967295
2015-10-27 08:43:54.211 UTC PANIC: corrupted page pointers: lower = 17,
upper = 0, special = 8176
2015-10-27 08:43:54.211 UTC CONTEXT: xlog redo unlink_page: rel
1663/16416/401628; dead 5287405; left 5287118; right 5287692; btpo_xact
65266215; leaf 5287560; leafleft 5287263; leafright 5287575; topparent
4294967295

#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Robinson (#2)
Re: Re: BUG #13739: Recurring corrupted page pointer panics on hot-standby replica

Michael Robinson wrote:

I am somewhat surprised that BUG #13668 merits a response, and this one
does not.

Clearly, that one was much easier to answer than yours.

Not a bug?
Won't fix?
Insufficient information?
Go away stop bothering us?
Something else?
Anything?

Most likely, it's either a bug in a WAL record or its replay code, or the
platform has failed to keep its promises (flipped bits, etc)

On Tue, Oct 27, 2015 at 09:22:08AM +0000, michael(at)snupps(dot)com wrote:

2015-10-24 14:16:46.489 UTC PANIC: corrupted page pointers: lower = 17,
upper = 0, special = 8176
2015-10-24 14:16:46.490 UTC CONTEXT: xlog redo unlink_page: rel
1663/16416/254063; dead 11796080; left 1365037; right 3024097; btpo_xact
64542957; leaf 2456241; leafleft 11130443; leafright 1350594; topparent
4294967295

The code expects "upper" to be higher than "lower", per this check in
PageAddItem:

/*
* Be wary about corrupted page pointers
*/
if (phdr->pd_lower < SizeOfPageHeaderData ||
phdr->pd_lower > phdr->pd_upper ||
phdr->pd_upper > phdr->pd_special ||
phdr->pd_special > BLCKSZ)
ereport(PANIC,
(errcode(ERRCODE_DATA_CORRUPTED),
errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
phdr->pd_lower, phdr->pd_upper, phdr->pd_special)));

It's likely that some previous operation set the pd_upper value to 0 --
maybe replay of an earlier WAL record.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs