12.3 replicas falling over during WAL redo

Started by Benover 5 years ago19 messagesgeneral

bench@silentmedia.com

over 5 years ago

We have a few hundred postgres servers in AWS EC2, all of which do
streaming replication to at least two replicas. As we've transitioned
our fleet to from 9.5 to 12.3, we've noticed an alarming increase in the
frequency of a streaming replica dying during replay. Postgres will log
something like:

|2020-07-31T16:55:22.602488+00:00 hostA postgres[31875]: [19137-1]
db=,user= LOG: restartpoint starting: time
2020-07-31T16:55:24.637150+00:00 hostA postgres[24076]: [15754-1]
db=,user= FATAL: incorrect index offsets supplied
2020-07-31T16:55:24.637261+00:00 hostA postgres[24076]: [15754-2]
db=,user= CONTEXT: WAL redo at BCC/CB7AF8B0 for Btree/VACUUM:
lastBlockVacuumed 1720 2020-07-31T16:55:24.642877+00:00 hostA
postgres[24074]: [8-1] db=,user= LOG: startup process (PID 24076) exited
with exit code 1|

...or:

|2020-07-31T10:43:35.520066+00:00 hostB postgres[11836]: [12-1] db=,user=
WARNING: will not overwrite a used ItemId
2020-07-31T10:43:35.520144+00:00 hostB postgres[11836]: [12-2] db=,user=
CONTEXT: WAL redo at AD1/6948E9E0 for Heap/HOT_UPDATE: off 4 xmax
128308492 flags 0x10 ; new off 2 xmax 0 2020-07-31T10:43:35.520180+00:00
hostB postgres[11836]: [13-1] db=,user= PANIC: failed to add tuple
2020-07-31T10:43:35.520220+00:00 hostB postgres[11836]: [13-2] db=,user=
CONTEXT: WAL redo at AD1/6948E9E0 for Heap/HOT_UPDATE: off 4 xmax
128308492 flags 0x10 ; new off 2 xmax 0 |

...or:
||

2020-07-30T14:59:36.839243+00:00 hostC postgres[24338]: [45253-1] db=,user= WARNING: specified item offset is too large
2020-07-30T14:59:36.839307+00:00 hostC postgres[24338]: [45253-2] db=,user= CONTEXT: WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:36.839337+00:00 hostC postgres[24338]: [45254-1] db=,user= PANIC: btree_xlog_insert: failed to add item
2020-07-30T14:59:36.839366+00:00 hostC postgres[24338]: [45254-2] db=,user= CONTEXT: WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:37.587173+00:00 hostC postgres[24337]: [11-1] db=,user= LOG: startup process (PID 24338) was terminated by signal 6: Aborted
||

Each time, a simple restart of the postgres service will bring the
database back to a happy state and it will merrily replicate past the
LSN where it had died before. This has never (yet) happened on a primary
db, and (so far) always on only one one of the replicas the primary is
replicating to, leading me to think there isn't anything actually wrong
with my data. Still, this is no way to run a database. We never saw this
problem in 9.5, but it happened 3 times just on Friday. We have taken
the opportunity to enable checksuming with our move to 12, but this
doesn't appear to be related to that, to my untrained eyes.

A lot of suggestions I've heard to fix problems which sound like this
involve reindexing, and while we haven't yet tried that, it doesn't seem
likely to help. We upgraded from 9.5 to 12 using pglogical, so all our
indices were created from 12.3 code. That said, we were running
pg_repack in an automated fashion for a bit, which seemed to be causing
issues in 12 that we haven't had time to track down and so have
currently disabled.

With the non-durability of the problem, I'm out of ideas of what to look
for. Suggestions?

Alvaro Herrera

alvherre@2ndquadrant.com

over 5 years ago

In reply to: Ben (#1)

Re: 12.3 replicas falling over during WAL redo

On 2020-Aug-01, Ben Chobot wrote:

We have a few hundred postgres servers in AWS EC2, all of which do streaming
replication to at least two replicas. As we've transitioned our fleet to
from 9.5 to 12.3, we've noticed an alarming increase in the frequency of a
streaming replica dying during replay. Postgres will log something like:

|2020-07-31T16:55:22.602488+00:00 hostA postgres[31875]: [19137-1] db=,user=
LOG: restartpoint starting: time 2020-07-31T16:55:24.637150+00:00 hostA
postgres[24076]: [15754-1] db=,user= FATAL: incorrect index offsets supplied
2020-07-31T16:55:24.637261+00:00 hostA postgres[24076]: [15754-2] db=,user=
CONTEXT: WAL redo at BCC/CB7AF8B0 for Btree/VACUUM: lastBlockVacuumed 1720
2020-07-31T16:55:24.642877+00:00 hostA postgres[24074]: [8-1] db=,user= LOG:
startup process (PID 24076) exited with exit code 1|

I've never seen this one.

Can you find out what the index is being modified by those LSNs -- is it
always the same index? Can you have a look at nearby WAL records that
touch the same page of the same index in each case?

One possibility is that the storage forgot a previous write.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Ben

bench@silentmedia.com

over 5 years ago

In reply to: Alvaro Herrera (#2)

Re: 12.3 replicas falling over during WAL redo

Alvaro Herrera wrote on 8/1/20 9:35 AM:

On 2020-Aug-01, Ben Chobot wrote:

We have a few hundred postgres servers in AWS EC2, all of which do streaming
replication to at least two replicas. As we've transitioned our fleet to
from 9.5 to 12.3, we've noticed an alarming increase in the frequency of a
streaming replica dying during replay. Postgres will log something like:

|2020-07-31T16:55:22.602488+00:00 hostA postgres[31875]: [19137-1] db=,user=
LOG: restartpoint starting: time 2020-07-31T16:55:24.637150+00:00 hostA
postgres[24076]: [15754-1] db=,user= FATAL: incorrect index offsets supplied
2020-07-31T16:55:24.637261+00:00 hostA postgres[24076]: [15754-2] db=,user=
CONTEXT: WAL redo at BCC/CB7AF8B0 for Btree/VACUUM: lastBlockVacuumed 1720
2020-07-31T16:55:24.642877+00:00 hostA postgres[24074]: [8-1] db=,user= LOG:
startup process (PID 24076) exited with exit code 1|

I've never seen this one.

Can you find out what the index is being modified by those LSNs -- is it
always the same index? Can you have a look at nearby WAL records that
touch the same page of the same index in each case?

One possibility is that the storage forgot a previous write.

I'd be happy to, if you tell me how. :)

We're using xfs for our postgres filesystem, on ubuntu bionic. Of course
it's always possible there's something wrong in the filesystem or the
EBS layer, but that is one thing we have not changed in the migration
from 9.5 to 12.3.

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 5 years ago

In reply to: Ben (#3)

Re: 12.3 replicas falling over during WAL redo

At Sat, 1 Aug 2020 09:58:05 -0700, Ben Chobot <bench@silentmedia.com> wrote in

Alvaro Herrera wrote on 8/1/20 9:35 AM:

On 2020-Aug-01, Ben Chobot wrote:

We have a few hundred postgres servers in AWS EC2, all of which do
streaming
replication to at least two replicas. As we've transitioned our fleet
to
from 9.5 to 12.3, we've noticed an alarming increase in the frequency
of a
streaming replica dying during replay. Postgres will log something
like:

|2020-07-31T16:55:22.602488+00:00 hostA postgres[31875]: [19137-1]
|db=,user=
LOG: restartpoint starting: time 2020-07-31T16:55:24.637150+00:00
hostA
postgres[24076]: [15754-1] db=,user= FATAL: incorrect index offsets
supplied
2020-07-31T16:55:24.637261+00:00 hostA postgres[24076]: [15754-2]
db=,user=
CONTEXT: WAL redo at BCC/CB7AF8B0 for Btree/VACUUM: lastBlockVacuumed
1720
2020-07-31T16:55:24.642877+00:00 hostA postgres[24074]: [8-1]
db=,user= LOG:
startup process (PID 24076) exited with exit code 1|

I've never seen this one.

Can you find out what the index is being modified by those LSNs -- is
it
always the same index? Can you have a look at nearby WAL records that
touch the same page of the same index in each case?

One possibility is that the storage forgot a previous write.

I'd be happy to, if you tell me how. :)

We're using xfs for our postgres filesystem, on ubuntu bionic. Of
course it's always possible there's something wrong in the filesystem
or the EBS layer, but that is one thing we have not changed in the
migration from 9.5 to 12.3.

All of the cited log lines seem suggesting relation with deleted btree
page items. As a possibility I can guess, that can happen if the pages
were flushed out during a vacuum after the last checkpoint and
full-page-writes didn't restored the page to the state before the
index-item deletion happened(that is, if full_page_writes were set to
off.). (If it found to be the cause, I'm not sure why that didn't
happen on 9.5.)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Ben

bench@silentmedia.com

over 5 years ago

In reply to: Kyotaro Horiguchi (#4)

Re: 12.3 replicas falling over during WAL redo

Kyotaro Horiguchi wrote on 8/2/20 9:39 PM:

At Sat, 1 Aug 2020 09:58:05 -0700, Ben Chobot <bench@silentmedia.com> wrote in
All of the cited log lines seem suggesting relation with deleted btree
page items. As a possibility I can guess, that can happen if the pages
were flushed out during a vacuum after the last checkpoint and
full-page-writes didn't restored the page to the state before the
index-item deletion happened(that is, if full_page_writes were set to
off.). (If it found to be the cause, I'm not sure why that didn't
happen on 9.5.)

regards.

We have always had full_page_writes enabled.

Peter Geoghegan

pg@bowt.ie

over 5 years ago

In reply to: Kyotaro Horiguchi (#4)

Re: 12.3 replicas falling over during WAL redo

On Sun, Aug 2, 2020 at 9:39 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

All of the cited log lines seem suggesting relation with deleted btree
page items. As a possibility I can guess, that can happen if the pages
were flushed out during a vacuum after the last checkpoint and
full-page-writes didn't restored the page to the state before the
index-item deletion happened(that is, if full_page_writes were set to
off.). (If it found to be the cause, I'm not sure why that didn't
happen on 9.5.)

There is also a Heap/HOT_UPDATE log line with similar errors.

--
Peter Geoghegan

Ben

bench@silentmedia.com

over 5 years ago

In reply to: Ben (#3)

Re: 12.3 replicas falling over during WAL redo

Ben Chobot wrote on 8/1/20 9:58 AM:

Alvaro Herrera wrote on 8/1/20 9:35 AM:

On 2020-Aug-01, Ben Chobot wrote:

Can you find out what the index is being modified by those LSNs -- is it
always the same index? Can you have a look at nearby WAL records that
touch the same page of the same index in each case?

One possibility is that the storage forgot a previous write.

So here is me fumbling how to do this. Hopefully this is useful?

The error for this host was:

2020-07-30T14:59:36.839243+00:00 host postgres[24338]: [45253-1]
db=,user= WARNING: specified item offset is too large
2020-07-30T14:59:36.839307+00:00 host postgres[24338]: [45253-2]
db=,user= CONTEXT: WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:36.839337+00:00 host postgres[24338]: [45254-1]
db=,user= PANIC: btree_xlog_insert: failed to add item
2020-07-30T14:59:36.839366+00:00 host postgres[24338]: [45254-2]
db=,user= CONTEXT: WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:37.587173+00:00 host postgres[24337]: [11-1] db=,user=
LOG: startup process (PID 24338) was terminated by signal 6: Aborted
2020-07-30T14:59:37.587266+00:00 host postgres[24337]: [12-1] db=,user=
LOG: terminating any other active server processes

So, starting from a filesystem snapshot where the db had crashed and not
yet been started, I did:

/usr/lib/postgresql/12/bin/pg_waldump -p
/var/lib/postgresql/12/main/pg_wal/ 0000000100000A0A000000AB
0000000100000A0A000000AD | grep A0A/AC4204A0

This shows me:

rmgr: Btree len (rec/tot): 72/ 72, tx: 76396065, lsn:
A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0:
rel 16605/16613/60529051 blk 6501

So then I did:

/usr/lib/postgresql/12/bin/pg_waldump -p
/var/lib/postgresql/12/main/pg_wal/ 0000000100000A0A000000AB
0000000100000A0A000000AD | grep 16605/16613/60529051
rmgr: Btree       len (rec/tot):     53/   813, tx:   76393363, lsn:
A0A/AB289778, prev A0A/AB287F30, desc: INSERT_LEAF off 26, blkref #0:
rel 16605/16613/60529051 blk 6089 FPW
rmgr: Btree       len (rec/tot):     72/    72, tx:   76393390, lsn:
A0A/AB2BE870, prev A0A/AB2BD6E0, desc: INSERT_LEAF off 128, blkref #0:
rel 16605/16613/60529051 blk 6519
rmgr: Btree       len (rec/tot):     72/    72, tx:   76393394, lsn:
A0A/AB2C43D0, prev A0A/AB2C4378, desc: INSERT_LEAF off 41, blkref #0:
rel 16605/16613/60529051 blk 6501
rmgr: Btree       len (rec/tot):     72/    72, tx:   76393992, lsn:
A0A/AB5FA848, prev A0A/AB5FA7F0, desc: INSERT_LEAF off 133, blkref #0:
rel 16605/16613/60529051 blk 5999
rmgr: Btree       len (rec/tot):     72/    72, tx:   76394938, lsn:
A0A/ABE49738, prev A0A/ABE47AF8, desc: INSERT_LEAF off 189, blkref #0:
rel 16605/16613/60529051 blk 6519
rmgr: Btree       len (rec/tot):     72/    72, tx:   76394977, lsn:
A0A/ABEB0330, prev A0A/ABEB02D8, desc: INSERT_LEAF off 204, blkref #0:
rel 16605/16613/60529051 blk 6519
rmgr: Btree       len (rec/tot):     72/    72, tx:   76396065, lsn:
A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0:
rel 16605/16613/60529051 blk 6501
rmgr: Btree       len (rec/tot):     72/    72, tx:   76396075, lsn:
A0A/AC427518, prev A0A/AC4274C0, desc: INSERT_LEAF off 211, blkref #0:
rel 16605/16613/60529051 blk 6519
rmgr: Btree       len (rec/tot):     53/ 2529, tx:   76396250, lsn:
A0A/AC5352F8, prev A0A/AC533A00, desc: INSERT_LEAF off 87, blkref #0:
rel 16605/16613/60529051 blk 5639 FPW
pg_waldump: fatal: error in WAL record at A0A/AC5411B0: invalid resource
manager ID 110 at A0A/AC5411E0

...and I have no idea what I'm looking at. I assume/hope the error at
the end is due to the db shutting down, and nothing to be particularly
worried about?

I have the specific index that postgres tripped on here and I'll spend
some time today trying to see if the other crashes were the same index.

Alvaro Herrera

alvherre@2ndquadrant.com

over 5 years ago

In reply to: Ben (#7)

Re: 12.3 replicas falling over during WAL redo

On 2020-Aug-03, Ben Chobot wrote:

rmgr: Btreeï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ len (rec/tot):ï¿½ï¿½ï¿½ï¿½ 72/ï¿½ï¿½ï¿½ 72, tx:ï¿½ï¿½ 76396065, lsn:
A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0: rel
16605/16613/60529051 blk 6501

So then I did:

/usr/lib/postgresql/12/bin/pg_waldump -p /var/lib/postgresql/12/main/pg_wal/
0000000100000A0A000000AB 0000000100000A0A000000AD | grep
16605/16613/60529051

Yep. Looking at the ones in block 6501,

rmgr: Btreeï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ len (rec/tot):ï¿½ï¿½ï¿½ï¿½ 72/ï¿½ï¿½ï¿½ 72, tx:ï¿½ï¿½ 76393394, lsn:
A0A/AB2C43D0, prev A0A/AB2C4378, desc: INSERT_LEAF off 41, blkref #0: rel
16605/16613/60529051 blk 6501

rmgr: Btreeï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ len (rec/tot):ï¿½ï¿½ï¿½ï¿½ 72/ï¿½ï¿½ï¿½ 72, tx:ï¿½ï¿½ 76396065, lsn:
A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0: rel
16605/16613/60529051 blk 6501

My question was whether the block has received the update that added the
item in offset 41; that is, is the LSN in the crashed copy of the page
equal to A0A/AB2C43D0? If it's an older value, then the write above was
lost for some reason.

pg_waldump: fatal: error in WAL record at A0A/AC5411B0: invalid resource
manager ID 110 at A0A/AC5411E0

...and I have no idea what I'm looking at. I assume/hope the error at the
end is due to the db shutting down, and nothing to be particularly worried
about?

Yeah.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Ben

bench@silentmedia.com

over 5 years ago

In reply to: Alvaro Herrera (#8)

Re: 12.3 replicas falling over during WAL redo

Alvaro Herrera wrote on 8/3/20 12:34 PM:

On 2020-Aug-03, Ben Chobot wrote:

Yep. Looking at the ones in block 6501,

rmgr: Btree len (rec/tot): 72/ 72, tx: 76393394, lsn:
A0A/AB2C43D0, prev A0A/AB2C4378, desc: INSERT_LEAF off 41, blkref #0: rel
16605/16613/60529051 blk 6501
rmgr: Btree len (rec/tot): 72/ 72, tx: 76396065, lsn:
A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0: rel
16605/16613/60529051 blk 6501

My question was whether the block has received the update that added the
item in offset 41; that is, is the LSN in the crashed copy of the page
equal to A0A/AB2C43D0? If it's an older value, then the write above was
lost for some reason.

How do I tell?

#10

Ben

bench@silentmedia.com

over 5 years ago

In reply to: Peter Geoghegan (#6)

Re: 12.3 replicas falling over during WAL redo

Peter Geoghegan wrote on 8/3/20 11:25 AM:

On Sun, Aug 2, 2020 at 9:39 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

All of the cited log lines seem suggesting relation with deleted btree
page items. As a possibility I can guess, that can happen if the pages
were flushed out during a vacuum after the last checkpoint and
full-page-writes didn't restored the page to the state before the
index-item deletion happened(that is, if full_page_writes were set to
off.). (If it found to be the cause, I'm not sure why that didn't
happen on 9.5.)

There is also a Heap/HOT_UPDATE log line with similar errors.

Yes, and I have the pg_waldump output for it. But, that table is quite
large, and the transaction that contains the LSN in the error log is
1,752 waldump lines long. I'm happy to share what would be useful to
help debug it but I'm guessing it should be a subset of that.

#11

Ben

bench@silentmedia.com

over 5 years ago

In reply to: Ben (#7)

Re: 12.3 replicas falling over during WAL redo

Alvaro Herrera wrote on 8/1/20 9:35 AM:

On 2020-Aug-01, Ben Chobot wrote:

Can you find out what the index is being modified by those LSNs --
is it
always the same index? Can you have a look at nearby WAL records that
touch the same page of the same index in each case?

They turn out to be different indices on different tables.

#12

Alvaro Herrera

alvherre@2ndquadrant.com

over 5 years ago

In reply to: Ben (#9)

Re: 12.3 replicas falling over during WAL redo

On 2020-Aug-03, Ben Chobot wrote:

Alvaro Herrera wrote on 8/3/20 12:34 PM:

On 2020-Aug-03, Ben Chobot wrote:

Yep. Looking at the ones in block 6501,

rmgr: Btreeï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ len (rec/tot):ï¿½ï¿½ï¿½ï¿½ 72/ï¿½ï¿½ï¿½ 72, tx:ï¿½ï¿½ 76393394, lsn:
A0A/AB2C43D0, prev A0A/AB2C4378, desc: INSERT_LEAF off 41, blkref #0: rel
16605/16613/60529051 blk 6501
rmgr: Btreeï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ len (rec/tot):ï¿½ï¿½ï¿½ï¿½ 72/ï¿½ï¿½ï¿½ 72, tx:ï¿½ï¿½ 76396065, lsn:
A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0: rel
16605/16613/60529051 blk 6501

My question was whether the block has received the update that added the
item in offset 41; that is, is the LSN in the crashed copy of the page
equal to A0A/AB2C43D0? If it's an older value, then the write above was
lost for some reason.

How do I tell?

You can use pageinspect's page_header() function to obtain the page's
LSN. You can use dd to obtain the page from the file,

dd if=16605/16613/60529051 bs=8192 count=1 seek=6501 of=/tmp/page.6501

then put that binary file in a bytea column, perhaps like

create table page (raw bytea);
insert into page select pg_read_binary_file('/tmp/page');

and with that you can run page_header:

create extension pageinspect;
select h.* from page, page_header(raw) h;

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#13

Peter Geoghegan

pg@bowt.ie

over 5 years ago

In reply to: Alvaro Herrera (#12)

Re: 12.3 replicas falling over during WAL redo

On Mon, Aug 3, 2020 at 2:35 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

You can use pageinspect's page_header() function to obtain the page's
LSN. You can use dd to obtain the page from the file,

dd if=16605/16613/60529051 bs=8192 count=1 seek=6501 of=/tmp/page.6501

Ben might find this approach to dumping out a single page image
easier, since it doesn't involve relfilenodes or filesystem files:

https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#contrib.2Fpageinspect_page_dump

--
Peter Geoghegan

#14

Ben

bench@silentmedia.com

over 5 years ago

In reply to: Peter Geoghegan (#13)

Re: 12.3 replicas falling over during WAL redo

Peter Geoghegan wrote on 8/3/20 3:04 PM:

On Mon, Aug 3, 2020 at 2:35 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

You can use pageinspect's page_header() function to obtain the page's
LSN. You can use dd to obtain the page from the file,

dd if=16605/16613/60529051 bs=8192 count=1 seek=6501 of=/tmp/page.6501

Ben might find this approach to dumping out a single page image
easier, since it doesn't involve relfilenodes or filesystem files:

https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#contrib.2Fpageinspect_page_dump

I don't think that works, because it sounds like this requires I connect
to the db, but that means I have to start it up. I am (wrongly?)
assuming we should get what we can while the db is still in a
just-crashed state.

Still, I seem to have mangled Alvaro's suggestion somehow, because this
doesn't look right:

# select h.* from public.page , page_header(raw) h;
lsn | checksum | flags | lower | upper | special | pagesize | version
| prune_xid
-----+----------+-------+-------+-------+---------+----------+---------+-----------
0/0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
| 0
(1 row)

I'll keep trying though.

#15

Ben

bench@silentmedia.com

over 5 years ago

In reply to: Alvaro Herrera (#12)

Re: 12.3 replicas falling over during WAL redo

Alvaro Herrera wrote on 8/3/20 2:34 PM:

On 2020-Aug-03, Ben Chobot wrote:

Alvaro Herrera wrote on 8/3/20 12:34 PM:

On 2020-Aug-03, Ben Chobot wrote:

Yep. Looking at the ones in block 6501,

rmgr: Btree len (rec/tot): 72/ 72, tx: 76393394, lsn:
A0A/AB2C43D0, prev A0A/AB2C4378, desc: INSERT_LEAF off 41, blkref #0: rel
16605/16613/60529051 blk 6501
rmgr: Btree len (rec/tot): 72/ 72, tx: 76396065, lsn:
A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0: rel
16605/16613/60529051 blk 6501

My question was whether the block has received the update that added the
item in offset 41; that is, is the LSN in the crashed copy of the page
equal to A0A/AB2C43D0? If it's an older value, then the write above was
lost for some reason.

How do I tell?

You can use pageinspect's page_header() function to obtain the page's
LSN. You can use dd to obtain the page from the file,

dd if=16605/16613/60529051 bs=8192 count=1 seek=6501 of=/tmp/page.6501

If I use skip instead of seek....

then put that binary file in a bytea column, perhaps like

create table page (raw bytea);
insert into page select pg_read_binary_file('/tmp/page');

and with that you can run page_header:

create extension pageinspect;
select h.* from page, page_header(raw) h;

     lsn      | checksum | flags | lower | upper | special | pagesize |
version | prune_xid
--------------+----------+-------+-------+-------+---------+----------+---------+-----------
A0A/99BA11F8 |     -215 |     0 |   180 | 7240 |    8176 |     8192
|       4 |         0

As I understand what we're looking at, this means the WAL stream was
assuming this page was last touched by A0A/AB2C43D0, but the page itself
thinks it was last touched by A0A/99BA11F8, which means at least one
write to the page is missing?

#16

Alvaro Herrera

alvherre@2ndquadrant.com

over 5 years ago

In reply to: Ben (#15)

Re: 12.3 replicas falling over during WAL redo

On 2020-Aug-03, Ben Chobot wrote:

Alvaro Herrera wrote on 8/3/20 2:34 PM:

On 2020-Aug-03, Ben Chobot wrote:

dd if=16605/16613/60529051 bs=8192 count=1 seek=6501 of=/tmp/page.6501

If I use skip instead of seek....

Argh, yes, I did correct that in my test and forgot to copy and paste.

ï¿½ï¿½ï¿½ï¿½ lsnï¿½ï¿½ï¿½ï¿½ï¿½ | checksum | flags | lower | upper | special | pagesize |
version | prune_xid
--------------+----------+-------+-------+-------+---------+----------+---------+-----------
ï¿½A0A/99BA11F8 |ï¿½ï¿½ï¿½ï¿½ -215 |ï¿½ï¿½ï¿½ï¿½ 0 |ï¿½ï¿½ 180 |ï¿½ 7240 |ï¿½ï¿½ï¿½ 8176 |ï¿½ï¿½ï¿½ï¿½ 8192
|ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ 4 |ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ 0

As I understand what we're looking at, this means the WAL stream was
assuming this page was last touched by A0A/AB2C43D0, but the page itself
thinks it was last touched by A0A/99BA11F8, which means at least one write
to the page is missing?

Yeah, that's exactly what we're seeing. Somehow an older page version
was resurrected. Of course, this should never happen.

So my theory has been proved. What now?

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#17

Alvaro Herrera

alvherre@2ndquadrant.com

over 5 years ago

In reply to: Alvaro Herrera (#16)

Re: 12.3 replicas falling over during WAL redo

On 2020-Aug-03, Alvaro Herrera wrote:

ï¿½ï¿½ï¿½ï¿½ lsnï¿½ï¿½ï¿½ï¿½ï¿½ | checksum | flags | lower | upper | special | pagesize |
version | prune_xid
--------------+----------+-------+-------+-------+---------+----------+---------+-----------
ï¿½A0A/99BA11F8 |ï¿½ï¿½ï¿½ï¿½ -215 |ï¿½ï¿½ï¿½ï¿½ 0 |ï¿½ï¿½ 180 |ï¿½ 7240 |ï¿½ï¿½ï¿½ 8176 |ï¿½ï¿½ï¿½ï¿½ 8192
|ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ 4 |ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ 0

As I understand what we're looking at, this means the WAL stream was
assuming this page was last touched by A0A/AB2C43D0, but the page itself
thinks it was last touched by A0A/99BA11F8, which means at least one write
to the page is missing?

Yeah, that's exactly what we're seeing. Somehow an older page version
was resurrected. Of course, this should never happen.

... although, the block should have been in shared buffers, and it is
there that the previous WAL record would have updated -- not necessarily
flushed to disk.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#18

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 5 years ago

In reply to: Alvaro Herrera (#17)

Re: 12.3 replicas falling over during WAL redo

At Tue, 4 Aug 2020 09:53:36 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in

On 2020-Aug-03, Alvaro Herrera wrote:

     lsn      | checksum | flags | lower | upper | special | pagesize |
version | prune_xid
--------------+----------+-------+-------+-------+---------+----------+---------+-----------
A0A/99BA11F8 |     -215 |     0 |   180 | 7240 |    8176 |     8192
|       4 |         0

As I understand what we're looking at, this means the WAL stream was
assuming this page was last touched by A0A/AB2C43D0, but the page itself
thinks it was last touched by A0A/99BA11F8, which means at least one write
to the page is missing?

Yeah, that's exactly what we're seeing. Somehow an older page version
was resurrected. Of course, this should never happen.

... although, the block should have been in shared buffers, and it is
there that the previous WAL record would have updated -- not necessarily
flushed to disk.

Yeah. On the other hand, the WAL records shown upthread desn't have a FPW.

rmgr: Btree len (rec/tot): 72/ 72, tx: 76393394, lsn:
A0A/AB2C43D0, prev A0A/AB2C4378, desc: INSERT_LEAF off 41, blkref #0: rel
16605/16613/60529051 blk 6501

rmgr: Btree len (rec/tot): 72/ 72, tx: 76396065, lsn:
A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0: rel
16605/16613/60529051 blk 6501

There must be a record for the page 6501 conveying FPW after the last
checkpoint. If it is not found, something wrong on deciding whether
to attach FPW.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#19

Ben

bench@silentmedia.com

over 5 years ago

In reply to: Alvaro Herrera (#16)

Re: 12.3 replicas falling over during WAL redo

Alvaro Herrera wrote on 8/3/20 4:54 PM:

On 2020-Aug-03, Ben Chobot wrote:

Alvaro Herrera wrote on 8/3/20 2:34 PM:

On 2020-Aug-03, Ben Chobot wrote:
dd if=16605/16613/60529051 bs=8192 count=1 seek=6501 of=/tmp/page.6501

If I use skip instead of seek....

Argh, yes, I did correct that in my test and forgot to copy and paste.

     lsn      | checksum | flags | lower | upper | special | pagesize |
version | prune_xid
--------------+----------+-------+-------+-------+---------+----------+---------+-----------
A0A/99BA11F8 |     -215 |     0 |   180 | 7240 |    8176 |     8192
|       4 |         0

As I understand what we're looking at, this means the WAL stream was
assuming this page was last touched by A0A/AB2C43D0, but the page itself
thinks it was last touched by A0A/99BA11F8, which means at least one write
to the page is missing?

Yeah, that's exactly what we're seeing. Somehow an older page version
was resurrected. Of course, this should never happen.

So my theory has been proved. What now?

Just to close the loop on this, we haven't seen the issue since we've
stopped expanding our filesystems by moving LVM extents between devices,
so while I don't know exactly where the bug lies, I feel it's quite
likely not in Postgres.