Physical slot restart_lsn advances incorrectly requiring restore from archive

Started by Hsu, Johnover 5 years ago3 messages
#1Hsu, John
hsuchen@amazon.com
1 attachment(s)

Hi hackers,

We believe we’re seeing a problem with how physical slot’s restart_lsn is advanced leading to the replicas needing to restore from archive in order for replication to resume.
The logs below are from reproductions against 10.13. I’m still working on reproducing it for 12.3.

WAL write spans two WAL segments .
Write to first WAL segment is complete but not to the second segment.
Write to first WAL segment is acknowledged as flushed from the Postgres replica.
Primary restarts before the write to second segment completes. It also means the complete WAL was never written.
Crash recovery finishes at a record before the incomplete WAL write.
Though now replica start the slot at the next WAL segment, since the previous WAL was already acknowledged as flushed.

Primary crashes because it ran out of space:
2020-07-10 01:23:31.399 UTC:10.15.4.83(56430):replication_user@[unknown]:[4554]:DEBUG: write 0/2C000000 flush 0/2BFD2000 apply 0/216ACCD0
2020-07-10 01:23:31.401 UTC:10.15.4.83(56430):replication_user@[unknown]:[4554]:DEBUG: write 0/2C000000 flush 0/2C000000 apply 0/216AE728
2020-07-10 01:23:31.504 UTC::@:[4548]:DEBUG: creating and filling new WAL file
2020-07-10 01:23:31.511 UTC::@:[4548]:PANIC: could not write to file "pg_wal/xlogtemp.4548": No space left on device
2020-07-10 01:23:31.518 UTC::@:[4543]:DEBUG: reaping dead processes

Crash recovery beings:
2020-07-10 01:23:36.074 UTC::@:[8677]:DEBUG: checkpoint record is at 0/2B51B030
2020-07-10 01:23:36.074 UTC::@:[8677]:DEBUG: redo record is at 0/2A65AE08; shutdown FALSE
..
2020-07-10 01:23:36.076 UTC::@:[8677]:DEBUG: starting up replication slots
2020-07-10 01:23:36.076 UTC::@:[8677]:DEBUG: restoring replication slot from "pg_replslot/physical_slot_1/state"
2020-07-10 01:23:36.078 UTC::@:[8677]:LOG: restart_lsn for cp slot physical_slot_1: 0/2BF12000 (extra debug logs I added)
2020-07-10 01:23:36.081 UTC::@:[8677]:LOG: redo starts at 0/2A65AE08
...
2020-07-10 01:23:36.325 UTC::@:[8677]:LOG: redo done at 0/2BFFFFB0
...
2020-07-10 01:23:36.330 UTC::@:[8677]:LOG: checkpoint starting: end-of-recovery immediate
2020-07-10 01:23:36.332 UTC::@:[8677]:DEBUG: performing replication slot checkpoint
...
2020-07-10 01:23:36.380 UTC::@:[8677]:DEBUG: checkpoint sync: number=13 file=base/13934/2662 time=0.001 msec
2020-07-10 01:23:36.380 UTC::@:[8677]:DEBUG: checkpoint sync: number=14 file=base/13934/2663 time=0.001 msec
2020-07-10 01:23:36.380 UTC::@:[8677]:DEBUG: checkpoint sync: number=15 file=base/13934/24586 time=0.001 msec
2020-07-10 01:23:36.385 UTC::@:[8677]:LOG: could not signal for checkpoint: checkpointer is not running
2020-07-10 01:23:36.385 UTC::@:[8677]:DEBUG: creating and filling new WAL file
2020-07-10 01:23:36.397 UTC::@:[8677]:PANIC: could not write to file "pg_wal/xlogtemp.8677": No space left on device

Primary runs out of space during crash recovery. Space is freed up afterwards and crash recovery beings again.

2020-07-10 01:32:45.804 UTC::@:[16329]:DEBUG: checkpoint record is at 0/2B51B030
2020-07-10 01:32:45.805 UTC::@:[16329]:DEBUG: redo record is at 0/2A65AE08; shutdown FALSE
...
2020-07-10 01:32:45.805 UTC::@:[16329]:DEBUG: starting up replication slots
2020-07-10 01:32:45.805 UTC::@:[16329]:DEBUG: restoring replication slot from "pg_replslot/physical_slot_1/state"
2020-07-10 01:32:45.806 UTC::@:[16329]:LOG: restart_lsn for cp slot physical_slot_1: 0/2BF12000
...
2020-07-10 01:32:45.809 UTC::@:[16329]:LOG: redo starts at 0/2A65AE08
2020-07-10 01:32:46.043 UTC::@:[16329]:DEBUG: could not open file "pg_wal/00000001000000000000002C": No such file or directory
2020-07-10 01:32:46.043 UTC::@:[16329]:LOG: redo done at 0/2BFFFFB0

Redo finishes at 0/2BFFFFB0 even though the flush we received from the replica is already at 0/2C000000.

This is problematic because the replica reconnects to the slot telling it to start past the new redo point.

2020-07-10 01:32:50.641 UTC:10.15.4.83(56698):replication_user@[unknown]:[16572]:DEBUG: received replication command: START_REPLICATION SLOT "physical_slot_1" 0/2C000000 TIMELINE 1
2020-07-10 01:32:50.641 UTC:10.15.4.83(56698):replication_user@[unknown]:[16572]:DEBUG: "walreceiver" has now caught up with upstream server
2020-07-10 01:32:50.774 UTC:10.15.4.83(56698):replication_user@[unknown]:[16572]:DEBUG: write 0/2C000B80 flush 0/2BFFFFF0 apply 0/2BFFFFF0
2020-07-10 01:32:50.775 UTC:10.15.4.83(56698):replication_user@[unknown]:[16572]:DEBUG: write 0/2C000B80 flush 0/2C000B80 apply 0/2BFFFFF0

This leads to a mismatch between at the end of 0/2B and what was streamed to the replica.

Replica logs:
2020-07-10 01:32:50.671 UTC::@:[24899]:LOG: started streaming WAL from primary at 0/2C000000 on timeline 1
...
2020-07-10 01:39:32.251 UTC::@:[11703]:DEBUG: could not restore file "00000001000000000000002C" from archive: child process exited with exit code 1
2020-07-10 01:39:32.251 UTC::@:[11703]:DEBUG: invalid contrecord length 90 at 0/2BFFFFF0
2020-07-10 01:39:32.251 UTC::@:[11703]:DEBUG: switched WAL source from archive to stream after failure

Now the physical slot has advanced past the 0/2B which is what the replica actually needs.

postgres=> select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
---------------------------------------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
physical_slot_1 | | physical | | | f | f | | | | 0/2C000B80 |
(1 row)

0/2B can now rotate out on the primary and requires restoring from archive in order for replication to resume.

The attached patch (against 10) attempts to address this by keeping track of the first flushLsn in the current segNo, and wait until we receive one after that before updating. This prevents the WAL from rotating out of the primary and a reboot from the replica will fix it instead of needing to restore from archive.

With the patch:

Primary goes into crash recovery and we avoid updating the restart_lsn of the slot:

2020-07-10 18:50:12.686 UTC::@:[6160]:LOG: redo starts at 0/2D417108
2020-07-10 18:50:12.965 UTC::@:[6160]:DEBUG: could not open file "pg_wal/00000001000000000000002F": No such file or directory
2020-07-10 18:50:12.965 UTC::@:[6160]:LOG: redo done at 0/2EFFFF90
...
2020-07-10 18:59:32.987 UTC:10.15.0.240(9056):replication_user@[unknown]:[19623]:DEBUG: received replication command: START_REPLICATION SLOT "physical_slot_2" 0/2F000000 TIMELINE 1
2020-07-10 18:59:33.937 UTC:10.15.0.240(9056):replication_user@[unknown]:[19623]:DEBUG: write 0/2F020000 flush 0/2EFFFFD0 apply 0/2EFFFFD0
2020-07-10 18:59:33.937 UTC:10.15.0.240(9056):replication_user@[unknown]:[19623]:LOG: lsn is not in restartSegNo, update to match
2020-07-10 18:59:33.938 UTC:10.15.0.240(9056):replication_user@[unknown]:[19623]:DEBUG: write 0/2F020000 flush 0/2F020000 apply 0/2EFFFFD0
2020-07-10 18:59:33.938 UTC:10.15.0.240(9056):replication_user@[unknown]:[19623]:LOG: lsn is not in restartSegNo, update to match

Replica logs:
2020-07-10 18:59:54.040 UTC::@:[12873]:DEBUG: could not restore file "00000001000000000000002F" from archive: child process exited with exit code 1
2020-07-10 18:59:54.040 UTC::@:[12873]:DEBUG: invalid contrecord length 58 at 0/2EFFFFD0

Since the flushLSN hasn't advanced past the first one in the restartSegNo, it doesn't get updated in future checkpoints.

postgres=> select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
---------------------------------------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
physical_slot_2 | | physical | | | f | f | | | | 0/2DFF8000 |
(1 row)

Rebooting the replica allows replication to resume from the slot and restart_lsn advances normally.

Thanks,

John H

Attachments:

0001-delay-slot-restart-lsn.patchapplication/octet-stream; name=0001-delay-slot-restart-lsn.patchDownload
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c17c6be040..ac95240366 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -157,6 +157,13 @@ static XLogRecPtr sendTimeLineValidUpto = InvalidXLogRecPtr;
  */
 static XLogRecPtr sentPtr = 0;
 
+/*
+ * These variables are used to avoid updating physical slot's restart_lsn
+ * if the record is crossing a boundary
+ */
+static XLogSegNo restartSegNo = 0;
+static XLogRecPtr firstLsnInRestartSegNo = InvalidXLogRecPtr;
+
 /* Buffers for constructing outgoing messages and processing reply messages. */
 static StringInfoData output_message;
 static StringInfoData reply_message;
@@ -1729,6 +1736,31 @@ PhysicalConfirmReceivedLocation(XLogRecPtr lsn)
 	ReplicationSlot *slot = MyReplicationSlot;
 
 	Assert(lsn != InvalidXLogRecPtr);
+
+	/*
+	 * Avoid updating restart_lsn if the lsn we received is for a record that
+	 * is in a new segNo. This is to avoid the restart_lsn incorrectly
+	 * advancing if the previous record gets rewritten after an engine crash.
+	 */
+	if (!XLByteInSeg(lsn, restartSegNo))
+	{
+		XLByteToSeg(lsn, restartSegNo);
+		firstLsnInRestartSegNo = lsn;
+		elog(DEBUG2, "lsn is not in restartSegNo, update to match");
+		return;
+	}
+
+	if (lsn <= firstLsnInRestartSegNo)
+	{
+		/*
+		 * This means flushLsn hasn't advanced the past first flushLsn we received in the
+		 * current restartSegNo so there's a chance it crossed a WAL boundary so we avoid updating
+		 * restart_lsn until we get a new complete  record that's been flushed across
+		 */
+		elog(DEBUG2, "lsn is not greater than firstLsnInFlushSeg, wait for a new record before updating slot's restart_lsn");
+		return;
+	}
+
 	SpinLockAcquire(&slot->mutex);
 	if (slot->data.restart_lsn != lsn)
 	{
#2Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Hsu, John (#1)
Re: Physical slot restart_lsn advances incorrectly requiring restore from archive

Hello, John.

At Fri, 10 Jul 2020 20:44:30 +0000, "Hsu, John" <hsuchen@amazon.com> wrote in

Hi hackers,

We believe we’re seeing a problem with how physical slot’s restart_lsn is advanced leading to the replicas needing to restore from archive in order for replication to resume.
The logs below are from reproductions against 10.13. I’m still working on reproducing it for 12.3.

WAL write spans two WAL segments .
Write to first WAL segment is complete but not to the second segment.
Write to first WAL segment is acknowledged as flushed from the Postgres replica.
Primary restarts before the write to second segment completes. It also means the complete WAL was never written.
Crash recovery finishes at a record before the incomplete WAL write.
Though now replica start the slot at the next WAL segment, since the previous WAL was already acknowledged as flushed.

...

Redo finishes at 0/2BFFFFB0 even though the flush we received from
the replica is already at 0/2C000000.
This is problematic because the replica reconnects to the slot
telling it to start past the new redo point.

Yeah, that is a problem not only related to restart_lsn. The same
cause leads to aother issue of inconsistent archive as discussed in
[1]: .

1: /messages/by-id/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com

The attached patch (against 10) attempts to address this by keeping
track of the first flushLsn in the current segNo, and wait until we
receive one after that before updating. This prevents the WAL from
rotating out of the primary and a reboot from the replica will fix
it instead of needing to restore from archive.

On the other hand we can and should advance restart_lsn when we know
that the last record is complete. I think a patch in the thread [2]
would fix your issue. With the patch primary doesn't send a
continuation record at the end of a segment until the whole record is
flushed into WAL file.

2: /messages/by-id/20200625.153532.379700510444980240.horikyota.ntt@gmail.com

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#3Hsu, John
hsuchen@amazon.com
In reply to: Hsu, John (#1)
Re: Physical slot restart_lsn advances incorrectly requiring restore from archive

Hi Horiguchi-san,

I'll take a look at that thread and see if I can reproduce with the attached patch.
It seems like it would directly address this issue. Thanks for taking a look.

Cheers,
John H

On Thu, Jul 16, 2020 at 11:00 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
Hello, John.

At Fri, 10 Jul 2020 20:44:30 +0000, "Hsu, John" <hsuchen@amazon.com> wrote in

Hi hackers,
 
We believe we’re seeing a problem with how physical slot’s restart_lsn is advanced leading to the replicas needing to restore from archive in order for replication to resume.
The logs below are from reproductions against 10.13. I’m still working on reproducing it for 12.3.
 
WAL write spans two WAL segments .
Write to first WAL segment is complete but not to the second segment.
Write to first WAL segment is acknowledged as flushed from the Postgres replica.
Primary restarts before the write to second segment completes. It also means the complete WAL was never written.
Crash recovery finishes at a record before the incomplete WAL write.
Though now replica start the slot at the next WAL segment, since the previous WAL was already acknowledged as flushed.

...

Redo finishes at 0/2BFFFFB0 even though the flush we received from
the replica is already at 0/2C000000.
This is problematic because the replica reconnects to the slot
telling it to start past the new redo point.

Yeah, that is a problem not only related to restart_lsn. The same
cause leads to aother issue of inconsistent archive as discussed in
[1]: .

1: /messages/by-id/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com

The attached patch (against 10) attempts to address this by keeping
track of the first flushLsn in the current segNo, and wait until we
receive one after that before updating. This prevents the WAL from
rotating out of the primary and a reboot from the replica will fix
it instead of needing to restore from archive.

On the other hand we can and should advance restart_lsn when we know
that the last record is complete. I think a patch in the thread [2]
would fix your issue. With the patch primary doesn't send a
continuation record at the end of a segment until the whole record is
flushed into WAL file.

2: /messages/by-id/20200625.153532.379700510444980240.horikyota.ntt@gmail.com

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center