[BUG] Panic due to incorrect missingContrecPtr after promotion
On 13.5 a wal flush PANIC is encountered after a standby is promoted.
With debugging, it was found that when a standby skips a missing continuation record on recovery, the missingContrecPtr is not invalidated after the record is skipped. Therefore, when the standby is promoted to a primary it writes an overwrite_contrecord with an LSN of the missingContrecPtr, which is now in the past. On flush time, this causes a PANIC. From what I can see, this failure scenario can only occur after a standby is promoted.
The overwrite_contrecord was introduced in 13.5 with https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ff9f111bce24.
Attached is a patch and a TAP test to handle this condition. The patch ensures that an overwrite_contrecord is only created if the missingContrecPtr is ahead of the last wal record.
To reproduce:
Run the new tap test recovery/t/029_overwrite_contrecord_promotion.pl without the attached patch
2022-02-22 18:38:15.526 UTC [31138] LOG: started streaming WAL from primary at 0/2000000 on timeline 1
2022-02-22 18:38:15.535 UTC [31105] LOG: successfully skipped missing contrecord at 0/1FFC620, overwritten at 2022-02-22 18:38:15.136482+00
2022-02-22 18:38:15.535 UTC [31105] CONTEXT: WAL redo at 0/2000028 for XLOG/OVERWRITE_CONTRECORD: lsn 0/1FFC620; time 2022-02-22 18:38:15.136482+00
…
…..
2022-02-22 18:38:15.575 UTC [31103] PANIC: xlog flush request 0/201EC70 is not satisfied --- flushed only to 0/2000088
2022-02-22 18:38:15.575 UTC [31101] LOG: checkpointer process (PID 31103) was terminated by signal 6: Aborted
….
…..
With the patch, running the same tap test succeeds and a PANIC is not observed.
Thanks
Sami Imseih
Amazon Web Services
Attachments:
0001-Fix-missing-continuation-record-after-standby-promot.patchapplication/octet-stream; name=0001-Fix-missing-continuation-record-after-standby-promot.patchDownload+148-2
On 2022-Feb-22, Imseih (AWS), Sami wrote:
On 13.5 a wal flush PANIC is encountered after a standby is promoted.
With debugging, it was found that when a standby skips a missing
continuation record on recovery, the missingContrecPtr is not
invalidated after the record is skipped. Therefore, when the standby
is promoted to a primary it writes an overwrite_contrecord with an LSN
of the missingContrecPtr, which is now in the past. On flush time,
this causes a PANIC. From what I can see, this failure scenario can
only occur after a standby is promoted.
Ooh, nice find and diagnosys. I can confirm that the test fails as you
described without the code fix, and doesn't fail with it.
I attach the same patch, with the test file put in its final place
rather than as a patch. Due to recent xlog.c changes this need a bit of
work to apply to back branches; I'll see about getting it in all
branches soon.
--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"I'm impressed how quickly you are fixing this obscure issue. I came from
MS SQL and it would be hard for me to put into words how much of a better job
you all are doing on [PostgreSQL]."
Steve Midgley, http://archives.postgresql.org/pgsql-sql/2008-08/msg00000.php
Attachments:
v2-0001-Fix-missing-continuation-record-after-standby-pro.patchtext/x-diff; charset=utf-8Download+126-2
Ooh, nice find and diagnosys. I can confirm that the test fails as you
described without the code fix, and doesn't fail with it.
I attach the same patch, with the test file put in its final place
rather than as a patch. Due to recent xlog.c changes this need a bit of
work to apply to back branches; I'll see about getting it in all
branches soon.
Thank you for reviewing!
Sami
At Wed, 23 Feb 2022 02:58:07 +0000, "Imseih (AWS), Sami" <simseih@amazon.com> wrote in
Ooh, nice find and diagnosys. I can confirm that the test fails as you
described without the code fix, and doesn't fail with it.I attach the same patch, with the test file put in its final place
rather than as a patch. Due to recent xlog.c changes this need a bit of
work to apply to back branches; I'll see about getting it in all
branches soon.Thank you for reviewing!
Nice catch! However, I'm not sure I like the patch.
* made it through and start writing after the portion that persisted.
* (It's critical to first write an OVERWRITE_CONTRECORD message, which
* we'll do as soon as we're open for writing new WAL.)
+ *
+ * If the last wal record is ahead of the missing contrecord, this is a
+ * recently promoted primary and we should not write an overwrite
+ * contrecord.
Before the part, the comment follows the part shown below.
* Actually, if WAL ended in an incomplete record, skip the parts that
So, actually WAL did not ended in an incomplete record. I think
FinishWalRecover is the last place to do that. (But it could be
earlier.)
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Thu, 24 Feb 2022 16:26:42 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
So, actually WAL did not ended in an incomplete record. I think
FinishWalRecover is the last place to do that. (But it could be
earlier.)
After some investigation, I finally concluded that we should reset
abortedRecPtr and missingContrecPtr at processing
XLOG_OVERWRITE_CONTRECORD.
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1953,6 +1953,11 @@ xlogrecovery_redo(XLogReaderState *record, TimeLineID replayTLI)
LSN_FORMAT_ARGS(xlrec.overwritten_lsn),
timestamptz_to_str(xlrec.overwrite_time))));
+ /* We have safely skipped the aborted record */
+ abortedRecPtr = InvalidXLogRecPtr;
+ missingContrecPtr = InvalidXLogRecPtr;
+
/* Verifying the record should only happen once */
record->overwrittenRecPtr = InvalidXLogRecPtr;
}
The last check in the test against "resetting aborted record" is no
longer useful since it is already checked by
026_verwrite_contrecord.pl.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Nice catch! However, I'm not sure I like the patch.
* made it through and start writing after the portion that persisted. * (It's critical to first write an OVERWRITE_CONTRECORD message, which * we'll do as soon as we're open for writing new WAL.) + * + * If the last wal record is ahead of the missing contrecord, this is a + * recently promoted primary and we should not write an overwrite + * contrecord.
Before the part, the comment follows the part shown below.
* Actually, if WAL ended in an incomplete record, skip the parts that
So, actually WAL did not ended in an incomplete record. I think
FinishWalRecover is the last place to do that. (But it could be
earlier.)
Thanks for the feedback.
## On the primary, an OVERWRITE_CONTRECORD is written. The aborted record (abortedRecPtr) and insert position of the missing contrecord (missingContrecPtr) are set during FinishWalRecovery and after recovery but before writes are accepted, the OVERWRITE_CONTRECORD is written.
## on the primary logs
2022-02-25 02:25:16.208 UTC [7397] LOG: redo done at 0/1FFD2B8 system usage: CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s
2022-02-25 02:25:16.208 UTC [7397] DEBUG: resetting unlogged relations: cleanup 0 init 1
2022-02-25 02:25:16.209 UTC [7397] DEBUG: creating and filling new WAL file
2022-02-25 02:25:16.217 UTC [7397] DEBUG: done creating and filling new WAL file
2022-02-25 02:25:16.217 UTC [7397] DEBUG: MultiXactId wrap limit is 2147483648, limited by database with OID 1
2022-02-25 02:25:16.217 UTC [7397] DEBUG: MultiXact member stop limit is now 4294914944 based on MultiXact 1
2022-02-25 02:25:16.217 UTC [7397] DEBUG: OVERWRITE_CONTRECORD inserted at 0/2000028 for aborted record 0/1FFD2E0 <<------- Attached V3 of patch that adds logging which shows the overwrite record created on the primary
2022-02-25 02:25:16.217 UTC [7395] LOG: checkpoint starting: end-of-recovery immediate wait
2022-02-25 02:25:16.217 UTC [7395] DEBUG: performing replication slot checkpoint
2022-02-25 02:25:16.218 UTC [7395] DEBUG: attempting to remove WAL segments older than log file 000000000000000000000000
2022-02-25 02:25:16.218 UTC [7395] LOG: checkpoint complete: wrote 131 buffers (102.3%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.001 s; sync files=0, longest=0.000 s, average=0.000 s; distance=11381 kB, estimate=11381 kB
2022-02-25 02:25:16.219 UTC [7394] DEBUG: starting background worker process "logical replication launcher"
2022-02-25 02:25:16.219 UTC [7394] LOG: database system is ready to accept connections
## A downstream standby now skips over this OVERWRITE_CONTRECORD, but the value for missingContrecPtr is not invalidated afterwards.
## on the standby logs
2022-02-25 02:25:16.616 UTC [7413] DEBUG: sending hot standby feedback xmin 0 epoch 0 catalog_xmin 0 catalog_xmin_epoch 0
2022-02-25 02:25:16.616 UTC [7387] LOG: successfully skipped missing contrecord at 0/1FFD2E0, overwritten at 2022-02-25 02:25:16.2175+00
2022-02-25 02:25:16.616 UTC [7387] CONTEXT: WAL redo at 0/2000028 for XLOG/OVERWRITE_CONTRECORD: lsn 0/1FFD2E0; time 2022-02-25 02:25:16.2175+00
## After promotion, the standby attempts to write the overwrite_contrecord again using the missingContrecPtr LSN which is now in the past.
## on the standby logs
2022-02-25 02:25:16.646 UTC [7387] LOG: invalid record length at 0/201EC70: wanted 24, got 0
2022-02-25 02:25:16.646 UTC [7387] LOG: redo done at 0/201EC48 system usage: CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.55 s
2022-02-25 02:25:16.646 UTC [7387] LOG: last completed transaction was at log time 2022-02-25 02:25:16.301554+00
2022-02-25 02:25:16.646 UTC [7387] DEBUG: resetting unlogged relations: cleanup 0 init 1
2022-02-25 02:25:16.646 UTC [7387] LOG: selected new timeline ID: 2
2022-02-25 02:25:16.646 UTC [7387] DEBUG: updated min recovery point to 0/201EC70 on timeline 1
2022-02-25 02:25:16.656 UTC [7387] DEBUG: could not remove file "pg_wal/000000020000000000000002": No such file or directory
2022-02-25 02:25:16.656 UTC [7387] LOG: archive recovery complete
2022-02-25 02:25:16.656 UTC [7387] DEBUG: MultiXactId wrap limit is 2147483648, limited by database with OID 1
2022-02-25 02:25:16.656 UTC [7387] DEBUG: MultiXact member stop limit is now 4294914944 based on MultiXact 1
2022-02-25 02:25:16.656 UTC [7387] DEBUG: OVERWRITE_CONTRECORD inserted at 0/2000028 for aborted record 0/1FFD2E0 <<------- The same overwrite record is written on the recently promoted standby
2022-02-25 02:25:16.656 UTC [7387] DEBUG: attempting to remove WAL segments newer than log file 000000020000000000000001
2022-02-25 02:25:16.656 UTC [7387] DEBUG: recycled write-ahead log file "000000010000000000000002"
2022-02-25 02:25:16.656 UTC [7387] DEBUG: release all standby locks
2022-02-25 02:25:16.656 UTC [7385] LOG: checkpoint starting: force
2022-02-25 02:25:16.656 UTC [7385] DEBUG: performing replication slot checkpoint
2022-02-25 02:25:16.656 UTC [7385] LOG: request to flush past end of generated WAL; request 0/201EC70, current position 0/2000088
2022-02-25 02:25:16.656 UTC [7385] PANIC: xlog flush request 0/201EC70 is not satisfied --- flushed only to 0/2000088
2022-02-25 02:25:16.657 UTC [7384] LOG: checkpointer process (PID 7385) was terminated by signal 6: Aborted
The purpose of the patch is to check that if the current LSN is beyond missingContrecPtr, and to skip ( incorrectly ) writing an overwrite contrecord that was written on the old writer.
--
Sami Imseih
Amazon Web Services
Attachments:
v3-0001-Fix-missing-continuation-record-after-standby-pro.patchapplication/octet-stream; name=v3-0001-Fix-missing-continuation-record-after-standby-pro.patchDownload+125-2
After some investigation, I finally concluded that we should reset
abortedRecPtr and missingContrecPtr at processing
XLOG_OVERWRITE_CONTRECORD.
--- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -1953,6 +1953,11 @@ xlogrecovery_redo(XLogReaderState *record, TimeLineID replayTLI)
LSN_FORMAT_ARGS(xlrec.overwritten_lsn),
timestamptz_to_str(xlrec.overwrite_time))));
+ /* We have safely skipped the aborted record */ + abortedRecPtr = InvalidXLogRecPtr; + missingContrecPtr = InvalidXLogRecPtr; + /* Verifying the record should only happen once */ record->overwrittenRecPtr = InvalidXLogRecPtr; }
The last check in the test against "resetting aborted record" is no
longer useful since it is already checked by
026_verwrite_contrecord.pl.
regards.
+1 for this. Resetting abortedRecPtr and missingContrecPtr after the broken record is skipped in is cleaner. I also think it would make sense to move the " successfully skipped missing contrecord" to after we invalidate the aborted and missing record pointers, like below.
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1948,15 +1948,15 @@ xlogrecovery_redo(XLogReaderState *record, TimeLineID replayTLI)
LSN_FORMAT_ARGS(xlrec.overwritten_lsn),
LSN_FORMAT_ARGS(record->overwrittenRecPtr));
+ /* We have safely skipped the aborted record */
+ abortedRecPtr = InvalidXLogRecPtr;
+ missingContrecPtr = InvalidXLogRecPtr;
+
ereport(LOG,
(errmsg("successfully skipped missing contrecord at %X/%X, overwritten at %s",
LSN_FORMAT_ARGS(xlrec.overwritten_lsn),
timestamptz_to_str(xlrec.overwrite_time))));
Also, instead of a new test, the current 026_overwrite_contrecord.pl test can include a promotion test at the end.
Attaching a new patch for review.
--
Sami Imseih
Amazon Web Services
Attachments:
v4-0001-Fix-missing-continuation-record-after-standby-promot.patchapplication/octet-stream; name=v4-0001-Fix-missing-continuation-record-after-standby-promot.patchDownload+8-2
I have gone ahead and backpatched this all the way to 10 as well.
--
Sami Imseih
Amazon Web Services
Attachments:
v4-0001-Fix-missing-continuation-record-after-standby-promot-v10.patchapplication/octet-stream; name=v4-0001-Fix-missing-continuation-record-after-standby-promot-v10.patchDownload+11-2
v4-0001-Fix-missing-continuation-record-after-standby-promot-v11.patchapplication/octet-stream; name=v4-0001-Fix-missing-continuation-record-after-standby-promot-v11.patchDownload+11-2
v4-0001-Fix-missing-continuation-record-after-standby-promot-v12.patchapplication/octet-stream; name=v4-0001-Fix-missing-continuation-record-after-standby-promot-v12.patchDownload+8-2
v4-0001-Fix-missing-continuation-record-after-standby-promot-v13.patchapplication/octet-stream; name=v4-0001-Fix-missing-continuation-record-after-standby-promot-v13.patchDownload+8-2
v4-0001-Fix-missing-continuation-record-after-standby-promot-v14.patchapplication/octet-stream; name=v4-0001-Fix-missing-continuation-record-after-standby-promot-v14.patchDownload+8-2
On 2022-Mar-07, Imseih (AWS), Sami wrote:
I have gone ahead and backpatched this all the way to 10 as well.
Thanks! I pushed this now. I edited the test though: I don't
understand why you went to the trouble of setting stuff in order to call
'pg_ctl promote' (in different ways for older branches), when
$node_standby->promote does the same and is simpler to call. So I
changed the tests to do that. (I did verify that without the code fix,
the PANIC indeed is thrown.)
Thank again,
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
We see another occurrence of this bug with the last patch applied in 13.7.
After a promotion we observe the following in the logs:
2022-05-25 00:35:38 UTC::@:[371]:PANIC: xlog flush request 10/B1FA3D88 is not satisfied --- flushed only to 7/A8000060
2022-05-25 00:35:38 UTC:172.31.26.238(38610):administrator@postgres:[23433]:ERROR: current transaction is aborted, commands ignored until end of transaction block
However, The logs do not show "LOG: successfully skipped missing contrecord",
therefore we know that VerifyOverwriteContrecord
is not being called to invalidate the missingContrecPtr.
VerifyOverwriteContrecord(xl_overwrite_contrecord *xlrec, XLogReaderState *state)
{
if (xlrec->overwritten_lsn != state->overwrittenRecPtr)
elog(FATAL, "mismatching overwritten LSN %X/%X -> %X/%X",
(uint32) (xlrec->overwritten_lsn >> 32),
(uint32) xlrec->overwritten_lsn,
(uint32) (state->overwrittenRecPtr >> 32),
(uint32) state->overwrittenRecPtr);
/* We have safely skipped the aborted record */
abortedRecPtr = InvalidXLogRecPtr;
missingContrecPtr = InvalidXLogRecPtr;
ereport(LOG,
(errmsg("successfully skipped missing contrecord at %X/%X, overwritten at %s",
(uint32) (xlrec->overwritten_lsn >> 32),
(uint32) xlrec->overwritten_lsn,
timestamptz_to_str(xlrec->overwrite_time))));
We think it's because VerifyOverwriteContrecord was not
called which is why we see this behavior.
Are there are other places where missingContrecPtr
should be invalidated, such as after a successful promotion?
--
Sami Imseih
Amazon Web Services
On Tue, Feb 22, 2022 at 07:20:55PM +0000, Imseih (AWS), Sami wrote:
The overwrite_contrecord was introduced in 13.5 with https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ff9f111bce24.
Attached is a patch and a TAP test to handle this condition. The
patch ensures that an overwrite_contrecord is only created if the
missingContrecPtr is ahead of the last wal record.
The test you are introducing to force a complete segment to be filled
is funky, and kind of nice actually while being cheap. This part
particularly makes the test predictable:
++unlink $node->basedir . "/pgdata/pg_wal/$endfile"
++ or die "could not unlink " . $node->basedir
. "/pgdata/pg_wal/$endfile: $!";
I really like that.
With the patch, running the same tap test succeeds and a PANIC is
not observed.
This needs a very close lookup, I'll try to check all that except if
somebody beats me to it.
--
Michael
On Fri, May 27, 2022 at 08:53:03AM +0900, Michael Paquier wrote:
This needs a very close lookup, I'll try to check all that except if
somebody beats me to it.
Please ignore that.. I need more coffee, and likely a break.
--
Michael
At Thu, 26 May 2022 19:57:41 +0000, "Imseih (AWS), Sami" <simseih@amazon.com> wrote in
We see another occurrence of this bug with the last patch applied in 13.7.
After a promotion we observe the following in the logs:
...
We think it's because VerifyOverwriteContrecord was not
called which is why we see this behavior.Are there are other places where missingContrecPtr
should be invalidated, such as after a successful promotion?
The only cause known to me for EndOfLog being moved to such location
is missingContrecPtr. But if the next record is not
XLOG_OVERWRITE_CONTRECORD recovery should have stopped there. And if
XLOG_OVERWRITE_CONTRECORD is there, VerifyOverwriteContrecord should
have been called..
Could you inspect WAL files of the environment and see if the first
record of the '7/A8'th segment OVERWRITE_CONTRECORD? I don't say that
makes some progress on this, but could be the first step.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
After further research, we found the following.
Testing on 13.6 with the attached patch we see
that the missingContrecPtr is being incorrectly
set on the standby and the promote in the tap
test fails.
Per the comments in xlog.c, the
missingContrecPtr should not be set when
in standby mode.
/*
* When not in standby mode we find that WAL ends in an incomplete
* record, keep track of that record. After recovery is done,
* we'll write a record to indicate downstream WAL readers that
* that portion is to be ignored.
*/
if (!StandbyMode &&
!XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
{
abortedRecPtr = xlogreader->abortedRecPtr;
missingContrecPtr = xlogreader->missingContrecPtr;
elog(LOG, "missingContrecPtr == %ld", missingContrecPtr);
}
If StandbyModeRequested is checked instead, which
checks for the presence of a standby signal file,
The missingContrecPtr is not set on the
standby and the test succeeds.
if (!StandbyModeRequested &&
!XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
{
abortedRecPtr = xlogreader->abortedRecPtr;
missingContrecPtr = xlogreader->missingContrecPtr;
elog(LOG, "missingContrecPtr == %ld", missingContrecPtr);
}
If this is a bug as it appears, it appears the original patch
to resolve this issue is not needed and the ideal fix
Is to ensure that a standby does not set
missingContrecPtr.
Would like to see what others think.
Thanks
--
Sami Imseih
Amazon Web Services
Attachments:
0001-Testing_patch[2][2].patchapplication/octet-stream; name="0001-Testing_patch[2][2].patch"Download+3-2
Import Notes
Reply to msg id not found: B27FBFC8-F9D9-47E1-A305-E93938FC23FF@amazon.com
At Fri, 27 May 2022 02:01:27 +0000, "Imseih (AWS), Sami" <simseih@amazon.com> wrote in
After further research, we found the following.
Testing on 13.6 with the attached patch we see
that the missingContrecPtr is being incorrectly
set on the standby and the promote in the tap
test fails.Per the comments in xlog.c, the
missingContrecPtr should not be set when
in standby mode./*
* When not in standby mode we find that WAL ends in an incomplete
* record, keep track of that record. After recovery is done,
* we'll write a record to indicate downstream WAL readers that
* that portion is to be ignored.
*/
if (!StandbyMode &&
!XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
{
abortedRecPtr = xlogreader->abortedRecPtr;
missingContrecPtr = xlogreader->missingContrecPtr;
elog(LOG, "missingContrecPtr == %ld", missingContrecPtr);
}If StandbyModeRequested is checked instead, which
checks for the presence of a standby signal file,
The missingContrecPtr is not set on the
standby and the test succeeds.if (!StandbyModeRequested &&
!XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
{
abortedRecPtr = xlogreader->abortedRecPtr;
missingContrecPtr = xlogreader->missingContrecPtr;
elog(LOG, "missingContrecPtr == %ld", missingContrecPtr);
}If this is a bug as it appears, it appears the original patch
to resolve this issue is not needed and the ideal fix
Is to ensure that a standby does not set
missingContrecPtr.Would like to see what others think.
Due to lack of information, I don't have a clear idea of what is
happening here. If that change "fixed" the "issue", that seems to mean
that crash recovery passed an immature contrecord (@6/A8000060) but
recovery did not stop at the time then continues until 10/B1FA3D88. A
possibility is, crash recovery ended at the immature contrecord then
the server moved to archive recovery mode and was able to continue
from the same (aborted) LSN on the archived WAL. This means 6/A8 in
pg_wal had been somehow removed after archived, or the segment 6/A7
(not A8) in both pg_xlog and archive are in different histories, which
seems to me what we wanted to prevent by the
XLOG_OVERWRITE_CONTRECORD. Didn't you use the same archive content
for repeated recovery testing?
And from the fact that entering crash recovery means the cluster
didn't have an idea of how far it should recover until consistency,
that is contained in backup label control file. That could happen
when a crashed primary is as-is reused as the new standby of the new
primary.
So.. I'd like to hear exactly what you did as the testing.
When standby mode is requested, if crash recovery fails with immature
contrecord, I think we shouldn't continue recovery. But I'm not sure
we need to explictly reject that case. Further study is needed..
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
So.. I'd like to hear exactly what you did as the testing.
When standby mode is requested, if crash recovery fails with immature
contrecord, I think we shouldn't continue recovery. But I'm not sure
we need to explictly reject that case. Further study is needed..
Here is more details about my findings and testing.
Even with commit 9d92582abf918215d27659d45a4c9e78bda50aff
we still see the issue with post promotion checkpoint
resulting in "request to flush past end of generated WAL;"
i.e.
2022-05-25 00:35:38 UTC::@:[371]:LOG: checkpoint starting: immediate force wait wal time
2022-05-25 00:35:38 UTC::@:[371]:LOG: request to flush past end of generated WAL; request 10/B1FA3D88, currpos 7/A8000060
2022-05-25 00:35:38 UTC::@:[371]:PANIC: xlog flush request 10/B1FA3D88 is not satisfied --- flushed only to 7/A8000060
2022-05-25 00:35:38 UTC:172.31.26.238(38610):administrator@postgres:[23433]:ERROR: current transaction is aborted, commands ignored until end of transaction block
The intent of commit 9d92582abf918215d27659d45a4c9e78bda50aff
was to make sure the standby skips the overwrite contrecord.
However, we still see "missingContrecPtr" is being set
on the standby before promotion and after the instance is
promoted, the missingContrecPtr is written to WAL
and the subsequent flush throws a "PANIC: xlog flush request"
To Reproduce using TAP tests;
1) apply the attached patch 0001-Patch_to_repro.patch to the head branch.
This patch adds logging when an OverWriteContRecord
is created and comments out the invalidation code
added in commit 9d92582abf918215d27659d45a4c9e78bda50aff
2) Run a tap test under "test/recovery"
make check PROVE_TESTS='t/026_overwrite_contrecord.pl'
3) What is observed in the logfiles for both the standby
and primary instance in the tap test is
that an overwrite contrecord is created on the primary,
which is correct, but also on the standby after promotion,
which is incorrect. This is incorrect as the contrecord
simseih@88665a22795f recovery % cat tmp_check/log/*prim* | grep 'creating\|promo'
2022-05-27 13:17:50.843 CDT [98429] LOG: creating overwrite contrecord at 0/2000058 for aborted_lsn 0/1FFD000
simseih@88665a22795f recovery % cat tmp_check/log/*stan* | grep 'creating\|promo'
2022-05-27 13:17:51.361 CDT [98421] LOG: received promote request
2022-05-27 13:17:51.394 CDT [98421] LOG: creating overwrite contrecord at 0/2000058 for aborted_lsn 0/1FFD000
simseih@88665a22795f recovery %
What we found:
1. missingContrecPtr is set when
StandbyMode is false, therefore
only a writer should set this value
and a record is then sent downstream.
But a standby going through crash
recovery will always have StandbyMode = false,
causing the missingContrecPtr to be incorrectly
set.
2. If StandbyModeRequested is checked instead,
we ensure that a standby will not set a
missingContrecPtr.
3. After applying the patch below, the tap test succeeded
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 5ee90b6..a727aaf 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2977,7 +2977,7 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
* we'll write a record to indicate to downstream WAL readers that
* that portion is to be ignored.
*/
- if (!StandbyMode &&
+ if (!StandbyModeRequested &&
!XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
{
abortedRecPtr = xlogreader->abortedRecPtr;
So, it might be that the best fix is not the commit in 9d92582abf918215d27659d45a4c9e78bda50aff
but to check StandbyModeRequested = false before setting
missingContrecPtr.
Thank you
---
Sami Imseih
Amazon Web Services
Attachments:
0001-Patch_to_repro.patchapplication/octet-stream; name=0001-Patch_to_repro.patchDownload+4-3
Hi,
As an update we have a test suite running for the last few days with
constant workload and failovers/promotions. With the proposed change
from Sami we no longer see PANICs due to this.
The comment indicates that the abortedRecPtr and missingContRecPtr
should only be set when the database is a writer since it's used to
write a new record to downstream readers. !StandbyMode is a poor proxy
for this as Sami mentioned since there's cases when a replica is going
through crash recovery that would then set this record.
Thanks,
John H
Show quoted text
On 5/27/22 12:01 PM, Imseih (AWS), Sami wrote:
So.. I'd like to hear exactly what you did as the testing.
When standby mode is requested, if crash recovery fails with immature
contrecord, I think we shouldn't continue recovery. But I'm not sure
we need to explictly reject that case. Further study is needed..Here is more details about my findings and testing.
Even with commit 9d92582abf918215d27659d45a4c9e78bda50aff
we still see the issue with post promotion checkpoint
resulting in "request to flush past end of generated WAL;"i.e.
2022-05-25 00:35:38 UTC::@:[371]:LOG: checkpoint starting: immediate force wait wal time
2022-05-25 00:35:38 UTC::@:[371]:LOG: request to flush past end of generated WAL; request 10/B1FA3D88, currpos 7/A8000060
2022-05-25 00:35:38 UTC::@:[371]:PANIC: xlog flush request 10/B1FA3D88 is not satisfied --- flushed only to 7/A8000060
2022-05-25 00:35:38 UTC:172.31.26.238(38610):administrator@postgres:[23433]:ERROR: current transaction is aborted, commands ignored until end of transaction blockThe intent of commit 9d92582abf918215d27659d45a4c9e78bda50aff
was to make sure the standby skips the overwrite contrecord.However, we still see "missingContrecPtr" is being set
on the standby before promotion and after the instance is
promoted, the missingContrecPtr is written to WAL
and the subsequent flush throws a "PANIC: xlog flush request"To Reproduce using TAP tests;
1) apply the attached patch 0001-Patch_to_repro.patch to the head branch.
This patch adds logging when an OverWriteContRecord
is created and comments out the invalidation code
added in commit 9d92582abf918215d27659d45a4c9e78bda50aff2) Run a tap test under "test/recovery"
make check PROVE_TESTS='t/026_overwrite_contrecord.pl'3) What is observed in the logfiles for both the standby
and primary instance in the tap test is
that an overwrite contrecord is created on the primary,
which is correct, but also on the standby after promotion,
which is incorrect. This is incorrect as the contrecordsimseih@88665a22795f recovery % cat tmp_check/log/*prim* | grep 'creating\|promo'
2022-05-27 13:17:50.843 CDT [98429] LOG: creating overwrite contrecord at 0/2000058 for aborted_lsn 0/1FFD000simseih@88665a22795f recovery % cat tmp_check/log/*stan* | grep 'creating\|promo'
2022-05-27 13:17:51.361 CDT [98421] LOG: received promote request
2022-05-27 13:17:51.394 CDT [98421] LOG: creating overwrite contrecord at 0/2000058 for aborted_lsn 0/1FFD000
simseih@88665a22795f recovery %What we found:
1. missingContrecPtr is set when
StandbyMode is false, therefore
only a writer should set this value
and a record is then sent downstream.But a standby going through crash
recovery will always have StandbyMode = false,
causing the missingContrecPtr to be incorrectly
set.2. If StandbyModeRequested is checked instead,
we ensure that a standby will not set a
missingContrecPtr.3. After applying the patch below, the tap test succeeded
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index 5ee90b6..a727aaf 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -2977,7 +2977,7 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode, * we'll write a record to indicate to downstream WAL readers that * that portion is to be ignored. */ - if (!StandbyMode && + if (!StandbyModeRequested && !XLogRecPtrIsInvalid(xlogreader->abortedRecPtr)) { abortedRecPtr = xlogreader->abortedRecPtr;So, it might be that the best fix is not the commit in 9d92582abf918215d27659d45a4c9e78bda50aff
but to check StandbyModeRequested = false before setting
missingContrecPtr.Thank you
---
Sami Imseih
Amazon Web Services
On Fri, May 27, 2022 at 07:01:37PM +0000, Imseih (AWS), Sami wrote:
What we found:
1. missingContrecPtr is set when
StandbyMode is false, therefore
only a writer should set this value
and a record is then sent downstream.But a standby going through crash
recovery will always have StandbyMode = false,
causing the missingContrecPtr to be incorrectly
set.
That stands as true as far as I know, StandbyMode would be switched
only once we get out of crash recovery, and only if archive recovery
completes when there is a restore_command.
2. If StandbyModeRequested is checked instead,
we ensure that a standby will not set a
missingContrecPtr.3. After applying the patch below, the tap test succeeded
Hmm. I have not looked at that in depth, but if the intention is to
check that the database is able to write WAL, looking at
XLogCtl->SharedRecoveryState would be the way to go because that's the
flip switching between crash recovery, archive recovery and the end of
recovery (when WAL can be safely written).
The check in xlogrecovery_redo() still looks like a good thing to have
anyway, because we know that we can safely skip the contrecord. Now,
for any patch produced, could the existing TAP test be extended so as
we are able to get a PANIC even if we keep around the sanity check in
xlogrecovery_redo(). That would likely involve an immediate shutdown
of a standby followed by a start sequence?
--
Michael
At Mon, 20 Jun 2022 16:13:43 +0900, Michael Paquier <michael@paquier.xyz> wrote in
On Fri, May 27, 2022 at 07:01:37PM +0000, Imseih (AWS), Sami wrote:
What we found:
1. missingContrecPtr is set when
StandbyMode is false, therefore
only a writer should set this value
and a record is then sent downstream.But a standby going through crash
recovery will always have StandbyMode = false,
causing the missingContrecPtr to be incorrectly
set.That stands as true as far as I know, StandbyMode would be switched
only once we get out of crash recovery, and only if archive recovery
completes when there is a restore_command.
Anyway the change;
- abortedRecPtr = InvalidXLogRecPtr;
- missingContrecPtr = InvalidXLogRecPtr;
+ //abortedRecPtr = InvalidXLogRecPtr;
+ //missingContrecPtr = InvalidXLogRecPtr;
Is injecting a bug that invalid "aborted contrecord" record can be
injected for certain conditions. If a bug is intentionally injected,
it's quite natrual that some behavior gets broken..
2. If StandbyModeRequested is checked instead,
we ensure that a standby will not set a
missingContrecPtr.3. After applying the patch below, the tap test succeeded
Hmm. I have not looked at that in depth, but if the intention is to
check that the database is able to write WAL, looking at
XLogCtl->SharedRecoveryState would be the way to go because that's the
flip switching between crash recovery, archive recovery and the end of
recovery (when WAL can be safely written).
What we are checking there is "we are going to write WAL", not "we are
writing".
(!StanbyMode && StandbyModeRequested) means the server have been
running crash-recovery before starting archive recovery. In that
case, the server is supposed to continue with archived WAL without
insering a record. However, if no archived WAL available and the
server promoted, the server needs to insert an "aborted contrecord"
record again. (I'm not sure how that case happens in the field,
though.)
So I don't think !StandbyModeRequested is the right thing
here. Actually the attached test fails with the fix.
The check in xlogrecovery_redo() still looks like a good thing to have
anyway, because we know that we can safely skip the contrecord. Now,
for any patch produced, could the existing TAP test be extended so as
we are able to get a PANIC even if we keep around the sanity check in
xlogrecovery_redo(). That would likely involve an immediate shutdown
of a standby followed by a start sequence?
Thus, I still don't see what have happened at Imseih's hand, but I can
cause PANIC with a bit tricky steps, which I don't think valid. This
is what I wanted to know the exact steps to cause the PANIC.
The attached 1 is the PoC of the TAP test (it uses system()..), and
the second is a tentative fix for that. (I don't like the fix, too,
though...)
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Mon, Jun 20, 2022 at 7:28 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
Hmm. I have not looked at that in depth, but if the intention is to
check that the database is able to write WAL, looking at
XLogCtl->SharedRecoveryState would be the way to go because that's the
flip switching between crash recovery, archive recovery and the end of
recovery (when WAL can be safely written).What we are checking there is "we are going to write WAL", not "we are
writing".(!StanbyMode && StandbyModeRequested) means the server have been
running crash-recovery before starting archive recovery. In that
case, the server is supposed to continue with archived WAL without
insering a record. However, if no archived WAL available and the
server promoted, the server needs to insert an "aborted contrecord"
record again. (I'm not sure how that case happens in the field,
though.)So I don't think !StandbyModeRequested is the right thing
here. Actually the attached test fails with the fix.
It seems to me that what we want to do is: if we're about to start
allowing WAL writes, then consider whether to insert an aborted
contrecord record. Now, if we are about to start allowing WAL write,
we must determine the LSN at which the first such record will be
written. Then there are two possibilities: either that LSN is on an
existing record boundary, or it isn't. In the former case, no aborted
contrecord record is required. In the latter case, we're writing at
LSN which would have been in the middle of a previous record, except
that the record in question was never finished or at least we don't
have all of it. So the first WAL we insert at our chosen starting LSN
must be an aborted contrecord record.
I'm not quite sure I understand the nature of the remaining bug here,
but this logic seems quite sketchy to me:
/*
* When not in standby mode we find that WAL ends in an incomplete
* record, keep track of that record. After recovery is done,
* we'll write a record to indicate to downstream WAL readers that
* that portion is to be ignored.
*/
if (!StandbyMode &&
!XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
{
abortedRecPtr = xlogreader->abortedRecPtr;
missingContrecPtr = xlogreader->missingContrecPtr;
}
I don't see what StandbyMode has to do with anything here. The
question is whether the place where we're going to begin writing WAL
is on an existing record boundary or not. If it so happens that when
StandbyMode is turned on we can never decide to promote in the middle
of a record, then maybe there's no need to track this information when
StandbyMode = true, but I don't see a good reason why we should make
that assumption. What if in the future somebody added a command that
says "don't keep trying to read more WAL, just promote RIGHT HERE?". I
think this logic would surely be incorrect in that case. It feels to
me like the right thing to do is to always keep track if WAL ends in
an incomplete record, and then when we promote, we write an aborted
contrecord record if WAL ended in an incomplete record, full stop.
Now, we do need to keep in mind that, while in StandbyMode, we might
reach the end of WAL more than once, because we have a polling loop
that looks for more WAL. So it does not work to just set the values
once and then assume that's the whole truth forever. But why not
handle that by storing the abortedRecPtr and missingContrecPtr
unconditionally after every call to XLogPrefetcherReadRecord()? They
will go back and forth between XLogRecPtrIsInvalid and other values
many times but the last value should -- I think -- be however things
ended up at the point where we decided to stop reading WAL.
I haven't really dived into this issue much so I might be totally off
base, but it just doesn't feel right to me that this should depend on
whether we're in standby mode. That seems at best incidentally related
to whether an aborted contrecord record is required.
P.S. "aborted contrecord record" is really quite an awkward turn of phrase!
--
Robert Haas
EDB: http://www.enterprisedb.com