"invalid contrecord" error on replica

Started by JOULAUD Françoisalmost 7 years ago8 messagesgeneral
Jump to latest
#1JOULAUD François
Francois.JOULAUD@radiofrance.com

Hello,

I encountered a problem on replicas after the primary crashed for lack of disk space.

After the problem I had a constant flow of "invalid contrecord" logs and replication ceased working.

The only way I found to make it work again was to completely restart the replica.

The logs:

Jun 11 12:44:28 myhostreplica postgres[20302]: [7-1] 2019-06-11 14:44:28 CEST [20302-2] FATAL: could not receive data from WAL stream: SSL SYSCALL error: EOF detected
Jun 11 12:44:28 myhostreplica postgres[19166]: [6-1] 2019-06-11 14:44:28 CEST [19166-1] FATAL: could not connect to the primary server: FATAL: the database system is in recovery mode
Jun 11 12:44:33 myhostreplica postgres[19167]: [6-1] 2019-06-11 14:44:33 CEST [19167-1] LOG: started streaming WAL from primary at 2C/6A000000 on timeline 1
Jun 11 12:44:33 myhostreplica postgres[20298]: [10-1] 2019-06-11 14:44:33 CEST [20298-6] LOG: invalid contrecord length 4571 at 2C/69FFF648
Jun 11 12:44:33 myhostreplica postgres[19167]: [7-1] 2019-06-11 14:44:33 CEST [19167-2] FATAL: terminating walreceiver process due to administrator command
Jun 11 12:44:33 myhostreplica postgres[20298]: [11-1] 2019-06-11 14:44:33 CEST [20298-7] LOG: invalid contrecord length 4571 at 2C/69FFF648
Jun 11 12:44:33 myhostreplica postgres[20298]: [12-1] 2019-06-11 14:44:33 CEST [20298-8] LOG: invalid contrecord length 4571 at 2C/69FFF648

François J.

#2Adrien Nayrat
adrien.nayrat@anayrat.info
In reply to: JOULAUD François (#1)
Re: "invalid contrecord" error on replica

Hello,

I encountered a similar issue with pg 13.

TL;DR: The secondary did not received a wal record (CHECKPOINT_SHUTDOWN) which
corrupted the wal and he failed when he tried to replay it.

For a personal project I have a primary and a secondary with streaming
replication and replication_slot.

I updated pgbackrest only on primary side, which broke archiving due to a
different pgbackrest version on primary and secondary.
It filled the filesystem until the primary crashed. So, I updated pgbackrest on
the secondary, archiving caught up and I did not look at secondary. I thought it
will caught up too.

Unfortunately not:

FATAL: could not connect to the primary server: FATAL: the database system is
in recovery mode
LOG: started streaming WAL from primary at AA/A1000000 on timeline 1
FATAL: could not receive data from WAL stream : ERROR: requested starting
point AA/A1000000 is ahead of the WAL flush position of this server AA/A0FFFBE8
LOG: started streaming WAL from primary at AA/A1000000 on timeline 1
FATAL: could not receive data from WAL stream : ERROR: requested starting
point AA/A1000000 is ahead of the WAL flush position of this server AA/A0FFFBE8
LOG: started streaming WAL from primary at AA/A1000000 on timeline 1
FATAL: could not receive data from WAL stream : ERROR: requested starting
point AA/A1000000 is ahead of the WAL flush position of this server AA/A0FFFBE8
LOG: started streaming WAL from primary at AA/A1000000 on timeline 1
FATAL: could not receive data from WAL stream : ERROR: requested starting
point AA/A1000000 is ahead of the WAL flush position of this server AA/A0FFFC20
[...]
LOG: started streaming WAL from primary at AA/A1000000 on timeline 1
FATAL: could not receive data from WAL stream : ERROR: requested starting
point AA/A1000000 is ahead of the WAL flush position of this server AA/A0FFFC20
LOG: started streaming WAL from primary at AA/A1000000 on timeline 1
FATAL: could not receive data from WAL stream : ERROR: requested starting
point AA/A1000000 is ahead of the WAL flush position of this server AA/A0FFFC20
LOG: started streaming WAL from primary at AA/A1000000 on timeline 1
LOG: invalid contrecord length 1523 at AA/A0FFFB70
FATAL: terminating walreceiver process due to administrator command
LOG: invalid contrecord length 1523 at AA/A0FFFB70
LOG: invalid contrecord length 1523 at AA/A0FFFB70
[ previous line keep repeating ]

And several hours later:
FATAL: terminating walreceiver process due to administrator command

No other messages arround, just the same "invalid contrecord length 1523 at
AA/A0FFFB70" and from time to time "terminating walreceiver process due to
administrator command".

During all this time, the primary kept WAL files because replication slot was
inactive, until FS was full again.

Then, I decided to restart both (I just removed some useless files to have more
place). Hoping the secondary be able to start streaming replication.

Unfortunately, the secondary asked for an old WAL file which where already
recycled on primary:

LOG: received fast shutdown request
LOG: aborting any active transactions
LOG: shutting down
LOG: database system is shut down
LOG: starting PostgreSQL 13.2 (Debian 13.2-1.pgdg90+1) on x86_64-pc-linux-gnu,
compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit
LOG: listening on IPv4 address "127.0.0.1", port 5433
LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5433"
LOG: database system was shut down in recovery at 2021-05-02 20:23:49 CEST
LOG: entering standby mode
LOG: redo starts at AA/9F29F7C8
LOG: consistent recovery state reached at AA/A0FFFB70
LOG: invalid contrecord length 1523 at AA/A0FFFB70
LOG: database system is ready to accept read only connections
LOG: started streaming WAL from primary at AA/A0000000 on timeline 1
FATAL: could not receive data from WAL stream: ERROR: requested WAL segment
00000001000000AA000000A0 has already been removed

(Notice, he started replication at AA/A0000000 whereas, before the restart, he
started at AA/A1000000)

I dig a little bit, on the primary, the restart_lsn of the replication slot was:
AA/A1004B78

pg_controldata on secondary told:
Latest checkpoint location: AA/9FC68B48
Latest checkpoint's REDO location: AA/9F29F7C8
Latest checkpoint's REDO WAL file: 00000001000000AA0000009F

I understood why he asked for a WAL before its restart_lsn:

He reached AA/A1004B78, after pg_ctl restart by me, he start replaying from
"REDO location" at AA/9F29F7C8 until he reached AA/A0FFFB70 then asked for the
WAL A0 and boom.

Why he stopped at AA/A0FFFB70 whereas the restart_lsn is AA/A1004B78 ?

I looked at 00000001000000AA000000A0 on the secondary, it end by:
rmgr: Transaction len (rec/tot): 34/ 34, tx: 55621934, lsn:
AA/A0FFFAF8, prev AA/A0FFE7F8, desc: COMMIT 2021-05-01 17:25:51.949124 CEST
rmgr: Heap len (rec/tot): 76/ 76, tx: 55621935, lsn:
AA/A0FFFB20, prev AA/A0FFFAF8, desc: HOT_UPDATE off 83 xmax 55621935 flags 0x20
; new off 19 xmax 0, blkref #0: rel 1663/167087/167363 blk 2662

Fortunately I archive wal, and on the primary, the end of this wal is:
rmgr: Transaction len (rec/tot): 34/ 34, tx: 55621934, lsn:
AA/A0FFFAF8, prev AA/A0FFE7F8, desc: COMMIT 2021-05-01 17:25:51.949124 CEST
rmgr: Heap len (rec/tot): 76/ 76, tx: 55621935, lsn:
AA/A0FFFB20, prev AA/A0FFFAF8, desc: HOT_UPDATE off 83 xmax 55621935 flags 0x20
; new off 19 xmax 0, blkref #0: rel 1663/167087/167363 blk 2662
rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn:
AA/A0FFFB70, prev AA/A0FFFB20, desc: CHECKPOINT_SHUTDOWN redo AA/A0FFFB70; tli
1; prev tli 1; fpw true; xid 0:55621936; oid 1401408; multi 401; offset 802;
oldest xid 478 in DB 13397; oldest multi 1 in DB 13397; oldest/newest commit
timestamp xid: 0/0; oldest running xid 0; shutdown
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn:
AA/A0FFFBE8, prev AA/A0FFFB70, desc: RUNNING_XACTS nextXid 55621936
latestCompletedXid 55621935 oldestRunningXid 55621936

I also dumped 00000001000000AA000000A1 on the secondary and it contains all the
records until AA/A1004018.

It is really weird, I don't understand how the secondary can miss the last 2
records of A0? It seems he did not received the CHECKPOINT_SHUTDOWN record?

Any idea?

Thanks!

--
Adrien NAYRAT

#3Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Adrien Nayrat (#2)
Re: "invalid contrecord" error on replica

On Sun, 2021-05-02 at 22:43 +0200, Adrien Nayrat wrote:

LOG: started streaming WAL from primary at AA/A1000000 on timeline 1
FATAL: could not receive data from WAL stream : ERROR: requested starting
point AA/A1000000 is ahead of the WAL flush position of this server AA/A0FFFBE8

You are trying to stream from a server whose WAL position is *behind* your
own. That cannot work.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

#4Adrien Nayrat
adrien.nayrat@anayrat.info
In reply to: Laurenz Albe (#3)
Re: "invalid contrecord" error on replica

On 03/05/2021 10:43, Laurenz Albe wrote:

On Sun, 2021-05-02 at 22:43 +0200, Adrien Nayrat wrote:

LOG: started streaming WAL from primary at AA/A1000000 on timeline 1
FATAL: could not receive data from WAL stream : ERROR: requested starting
point AA/A1000000 is ahead of the WAL flush position of this server AA/A0FFFBE8

You are trying to stream from a server whose WAL position is *behind* your
own. That cannot work.

Yours,
Laurenz Albe

Yes, it was during primary's recovery as it crashed, it restarted from previous
checkpoint.

--
Adrien NAYRAT

#5Adrien Nayrat
adrien.nayrat@anayrat.info
In reply to: Adrien Nayrat (#2)
Re: "invalid contrecord" error on replica

Oh, I forgot to tell I was able to recover the secondary by replacing the
00000001000000AA000000A0 from the archives into pg_wal. Then the secondary were
able to finish recovery, start streaming replication and fetch subsequent wals.

I wondered why there was a CHECKPOINT_SHUTDOWN record. I dig a little more:

First, the filesystem on primary were full and I got :

PANIC: could not write to file "pg_wal/xlogtemp.305": No space left on device

The instance crashed and restarted in recovery mode.
At the end of the recovery I got:
checkpoint starting: end-of-recovery immediate
checkpoint complete: ...

Then a FATAL message:
FATAL: could not write to file "pg_wal/xlogtemp.9405": No space left on device

Followed by:
aborting startup due to process failure

Maybe it is this checkpoint which were not replicated? The primary had enough
space for this record. But I don't understand how the secondary received records
beginning by AA/A1...

I googled about this and I found other similar issues:
/messages/by-id/15938-8591df7e95064538@postgresql.org

/messages/by-id/CAMp7vw97871F21X7FHHdmU2FXGME4HTgMYxkAubMdCU2xevmxQ@mail.gmail.com
/messages/by-id/E73F4CFB-E322-461E-B1EC-82FAA808FEE6@lifetrenz.com

/messages/by-id/15398-b4896eebf0bed218@postgresql.org
/messages/by-id/15412-f9a89b026e6774d1@postgresql.org

--
Adrien NAYRAT
https://blog.anayrat.info

#6Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Adrien Nayrat (#2)
Re: "invalid contrecord" error on replica

At Sun, 2 May 2021 22:43:44 +0200, Adrien Nayrat <adrien.nayrat@anayrat.info> wrote in

I also dumped 00000001000000AA000000A1 on the secondary and it
contains all the records until AA/A1004018.

It is really weird, I don't understand how the secondary can miss the
last 2 records of A0? It seems he did not received the
CHECKPOINT_SHUTDOWN record?

Any idea?

This seems like stepping on the same issue with [1], in short, the
secondary having received an incomplete record but the primary forgot
of the record after restart.

Specifically, primary was writing a WAL record that starts at A0FFFB70
and continues to A1xxxxxx segment. The secondary successfully received
the first half of the record but the primary failed to write (then
send) the last half of the record due to disk full.

At this time it seems that the primary's last completed record ended
at A0FFB70. Then the CHECKPOINT_SHUTDOWN record overwrote the
already-halfly-sent record up to A0FFBE8 while restarting.

On the secondary side, there's only the first half of the record,
which had been forgotten by the primary and the last half starting at
LSN A1000000 was still the future in the new history on the primary.

After some time the primary reaches A1000000 but the first record in
the segment is of course disagrees with the history of the secondary.

1: /messages/by-id/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#7Adrien Nayrat
adrien.nayrat@anayrat.info
In reply to: Kyotaro Horiguchi (#6)
Re: "invalid contrecord" error on replica

On 5/6/21 7:37 AM, Kyotaro Horiguchi wrote:

At Sun, 2 May 2021 22:43:44 +0200, Adrien Nayrat <adrien.nayrat@anayrat.info> wrote in

I also dumped 00000001000000AA000000A1 on the secondary and it
contains all the records until AA/A1004018.

It is really weird, I don't understand how the secondary can miss the
last 2 records of A0? It seems he did not received the
CHECKPOINT_SHUTDOWN record?

Any idea?

This seems like stepping on the same issue with [1], in short, the
secondary having received an incomplete record but the primary forgot
of the record after restart.

Specifically, primary was writing a WAL record that starts at A0FFFB70
and continues to A1xxxxxx segment. The secondary successfully received
the first half of the record but the primary failed to write (then
send) the last half of the record due to disk full.

At this time it seems that the primary's last completed record ended
at A0FFB70. Then the CHECKPOINT_SHUTDOWN record overwrote the
already-halfly-sent record up to A0FFBE8 while restarting.

On the secondary side, there's only the first half of the record,
which had been forgotten by the primary and the last half starting at
LSN A1000000 was still the future in the new history on the primary.

After some time the primary reaches A1000000 but the first record in
the segment is of course disagrees with the history of the secondary.

1: /messages/by-id/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com

regards.

Hello,

Thanks for your reply and your explanation! Now, I understand, it's good to know
it is a known issue.
I'll follow this thread, I hope we will find a solution. It's annoying that your
secondary breaks when your primary crash and the only solution is to either
fetch an archived WAL file and replace it on the secondary, or completely
rebuild your secondary.

Thanks

--
Adrien NAYRAT

#8Adrien Nayrat
adrien.nayrat@anayrat.info
In reply to: Adrien Nayrat (#7)
Re: "invalid contrecord" error on replica

I received a question about this error.
Just for the record, if someone encounter the same issue.
It has been fixed in ff9f111bce24

TL;DR : update your instance :)

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ff9f111bce24

Fix WAL replay in presence of an incomplete record

Physical replication always ships WAL segment files to replicas once
they are complete. This is a problem if one WAL record is split across
a segment boundary and the primary server crashes before writing down
the segment with the next portion of the WAL record: WAL writing after
crash recovery would happily resume at the point where the broken record
started, overwriting that record ... but any standby or backup may have
already received a copy of that segment, and they are not rewinding.
This causes standbys to stop following the primary after the latter
crashes:
LOG: invalid contrecord length 7262 at A8/D9FFFBC8
because the standby is still trying to read the continuation record
(contrecord) for the original long WAL record, but it is not there and
it will never be. A workaround is to stop the replica, delete the WAL
file, and restart it -- at which point a fresh copy is brought over from
the primary. But that's pretty labor intensive, and I bet many users
would just give up and re-clone the standby instead.

A fix for this problem was already attempted in commit 515e3d84a0b5, but
it only addressed the case for the scenario of WAL archiving, so
streaming replication would still be a problem (as well as other things
such as taking a filesystem-level backup while the server is down after
having crashed), and it had performance scalability problems too; so it
had to be reverted.

This commit fixes the problem using an approach suggested by Andres
Freund, whereby the initial portion(s) of the split-up WAL record are
kept, and a special type of WAL record is written where the contrecord
was lost, so that WAL replay in the replica knows to skip the broken
parts. With this approach, we can continue to stream/archive segment
files as soon as they are complete, and replay of the broken records
will proceed across the crash point without a hitch.

Because a new type of WAL record is added, users should be careful to
upgrade standbys first, primaries later. Otherwise they risk the standby
being unable to start if the primary happens to write such a record.

A new TAP test that exercises this is added, but the portability of it
is yet to be seen.

This has been wrong since the introduction of physical replication, so
backpatch all the way back. In stable branches, keep the new
XLogReaderState members at the end of the struct, to avoid an ABI
break.