Infinite loop in XLogPageRead() on standby
Hello hackers,
I spent some time debugging an issue with standby not being able to
continue streaming after failover.
The problem manifests itself by following messages in the log:
LOG: received SIGHUP, reloading configuration files
LOG: parameter "primary_conninfo" changed to "port=58669
host=/tmp/dn20WVmNqF"
LOG: restored log file "000000010000000000000003" from archive
LOG: invalid magic number 0000 in WAL segment 000000010000000000000003,
LSN 0/301A000, offset 106496
LOG: fetching timeline history file for timeline 2 from primary server
LOG: started streaming WAL from primary at 0/3000000 on timeline 1
LOG: replication terminated by primary server
DETAIL: End of WAL reached on timeline 1 at 0/3019158.
FATAL: terminating walreceiver process due to administrator command
LOG: restored log file "00000002.history" from archive
LOG: new target timeline is 2
LOG: restored log file "000000020000000000000003" from archive
LOG: invalid magic number 0000 in WAL segment 000000020000000000000003,
LSN 0/301A000, offset 106496
LOG: started streaming WAL from primary at 0/3000000 on timeline 2
LOG: invalid magic number 0000 in WAL segment 000000020000000000000003,
LSN 0/301A000, offset 106496
FATAL: terminating walreceiver process due to administrator command
LOG: waiting for WAL to become available at 0/301A04E
LOG: restored log file "000000020000000000000003" from archive
LOG: invalid magic number 0000 in WAL segment 000000020000000000000003,
LSN 0/301A000, offset 106496
LOG: invalid magic number 0000 in WAL segment 000000020000000000000003,
LSN 0/301A000, offset 106496
LOG: waiting for WAL to become available at 0/301A04E
LOG: restored log file "000000020000000000000003" from archive
LOG: invalid magic number 0000 in WAL segment 000000020000000000000003,
LSN 0/301A000, offset 106496
LOG: invalid magic number 0000 in WAL segment 000000020000000000000003,
LSN 0/301A000, offset 106496
The problem happens when standbys received only the first part of the WAL
record that spans multiple pages.
In this case the promoted standby discards the first part of the WAL record
and writes END_OF_RECOVERY instead. If in addition to that someone will
call pg_switch_wal(), then there are chances that SWITCH record will also
fit to the page where the discarded part was settling, As a result the
other standby (that wasn't promoted) will infinitely try making attempts to
decode WAL record span on multiple pages by reading the next page, which is
filled with zero bytes. And, this next page will never be written, because
the new primary will be writing to the new WAL file after pg_switch_wal().
Restart of the stuck standby fixes the problem, because it will be first
reading the history file and therefore will never read the incomplete WAL
file from the old timeline. That is, all major versions starting from v13
are impacted (including the master branch), because they allow changing of
primary_conninfo GUC with reload.
Please find attached the TAP test that reproduces the problem.
To be honest, I don't know yet how to fix it nicely. I am thinking about
returning XLREAD_FAIL from XLogPageRead() if it suddenly switched to a new
timeline while trying to read a page and if this page is invalid.
--
Regards,
--
Alexander Kukushkin
Attachments:
On Wed, Feb 28, 2024 at 11:19:41AM +0100, Alexander Kukushkin wrote:
I spent some time debugging an issue with standby not being able to
continue streaming after failover.The problem happens when standbys received only the first part of the WAL
record that spans multiple pages.
In this case the promoted standby discards the first part of the WAL record
and writes END_OF_RECOVERY instead. If in addition to that someone will
call pg_switch_wal(), then there are chances that SWITCH record will also
fit to the page where the discarded part was settling, As a result the
other standby (that wasn't promoted) will infinitely try making attempts to
decode WAL record span on multiple pages by reading the next page, which is
filled with zero bytes. And, this next page will never be written, because
the new primary will be writing to the new WAL file after pg_switch_wal().
Wow. Have you seen that in an actual production environment?
I was just trying your TAP test to see it looping on a single record
as you mentioned:
2024-02-29 12:57:44.884 JST [2555] LOG: invalid magic number 0000 in
WAL segment 000000020000000000000003, LSN 0/301A000, offset 106496
2024-02-29 12:57:44.884 JST [2555] LOG: invalid magic number 0000 in
WAL segment 000000020000000000000003, LSN 0/301A000, offset 106496
Restart of the stuck standby fixes the problem, because it will be first
reading the history file and therefore will never read the incomplete WAL
file from the old timeline. That is, all major versions starting from v13
are impacted (including the master branch), because they allow changing of
primary_conninfo GUC with reload.
Still that's not nice at a large scale, because you would not know
about the problem until your monitoring tools raise alarms because
some nodes in your cluster setup decide to lag behind.
Please find attached the TAP test that reproduces the problem.
my $start_page = start_of_page($end_lsn);
my $wal_file = write_wal($primary, $TLI, $start_page,
"\x00" x $WAL_BLOCK_SIZE);
# copy the file we just "hacked" to the archive
copy($wal_file, $primary->archive_dir);
So you are emulating a failure by filling with zeros the second page
where the last emit_message() generated a record, and the page before
that includes the continuation record. Then abuse of WAL archiving to
force the replay of the last record. That's kind of cool.
To be honest, I don't know yet how to fix it nicely. I am thinking about
returning XLREAD_FAIL from XLogPageRead() if it suddenly switched to a new
timeline while trying to read a page and if this page is invalid.
Hmm. I suspect that you may be right on a TLI change when reading a
page. There are a bunch of side cases with continuation records and
header validation around XLogReaderValidatePageHeader(). Perhaps you
have an idea of patch to show your point?
Nit. In your test, it seems to me that you should not call directly
set_standby_mode and enable_restoring, just rely on has_restoring with
the standby option included.
--
Michael
At Thu, 29 Feb 2024 14:05:15 +0900, Michael Paquier <michael@paquier.xyz> wrote in
On Wed, Feb 28, 2024 at 11:19:41AM +0100, Alexander Kukushkin wrote:
I spent some time debugging an issue with standby not being able to
continue streaming after failover.The problem happens when standbys received only the first part of the WAL
record that spans multiple pages.
In this case the promoted standby discards the first part of the WAL record
and writes END_OF_RECOVERY instead. If in addition to that someone will
call pg_switch_wal(), then there are chances that SWITCH record will also
fit to the page where the discarded part was settling, As a result the
other standby (that wasn't promoted) will infinitely try making attempts to
decode WAL record span on multiple pages by reading the next page, which is
filled with zero bytes. And, this next page will never be written, because
the new primary will be writing to the new WAL file after pg_switch_wal().
In the first place, it's important to note that we do not guarantee
that an async standby can always switch its replication connection to
the old primary or another sibling standby. This is due to the
variations in replication lag among standbys. pg_rewind is required to
adjust such discrepancies.
I might be overlooking something, but I don't understand how this
occurs without purposefully tweaking WAL files. The repro script
pushes an incomplete WAL file to the archive as a non-partial
segment. This shouldn't happen in the real world.
In the repro script, the replication connection of the second standby
is switched from the old primary to the first standby after its
promotion. After the switching, replication is expected to continue
from the beginning of the last replayed segment. But with the script,
the second standby copies the intentionally broken file, which differs
from the data that should be received via streaming. A similar problem
to the issue here was seen at segment boundaries, before we introduced
the XLP_FIRST_IS_OVERWRITE_CONTRECORD flag, which prevents overwriting
a WAL file that is already archived. However, in this case, the second
standby won't see the broken record because it cannot be in a
non-partial segment in the archive, and the new primary streams
END_OF_RECOVERY instead of the broken record.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hi Michael,
On Thu, 29 Feb 2024 at 06:05, Michael Paquier <michael@paquier.xyz> wrote:
Wow. Have you seen that in an actual production environment?
Yes, we see it regularly, and it is reproducible in test environments as
well.
my $start_page = start_of_page($end_lsn);
my $wal_file = write_wal($primary, $TLI, $start_page,
"\x00" x $WAL_BLOCK_SIZE);
# copy the file we just "hacked" to the archive
copy($wal_file, $primary->archive_dir);So you are emulating a failure by filling with zeros the second page
where the last emit_message() generated a record, and the page before
that includes the continuation record. Then abuse of WAL archiving to
force the replay of the last record. That's kind of cool.
Right, at this point it is easier than to cause an artificial crash on the
primary after it finished writing just one page.
To be honest, I don't know yet how to fix it nicely. I am thinking about
returning XLREAD_FAIL from XLogPageRead() if it suddenly switched to anew
timeline while trying to read a page and if this page is invalid.
Hmm. I suspect that you may be right on a TLI change when reading a
page. There are a bunch of side cases with continuation records and
header validation around XLogReaderValidatePageHeader(). Perhaps you
have an idea of patch to show your point?
Not yet, but hopefully I will get something done next week.
Nit. In your test, it seems to me that you should not call directly
set_standby_mode and enable_restoring, just rely on has_restoring with
the standby option included.
Thanks, I'll look into it.
--
Regards,
--
Alexander Kukushkin
Hi Kyotaro,
On Thu, 29 Feb 2024 at 08:18, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:
In the first place, it's important to note that we do not guarantee
that an async standby can always switch its replication connection to
the old primary or another sibling standby. This is due to the
variations in replication lag among standbys. pg_rewind is required to
adjust such discrepancies.
Sure, I know. But in this case the async standby received and flushed
absolutely the same amount of WAL as the promoted one.
I might be overlooking something, but I don't understand how this
occurs without purposefully tweaking WAL files. The repro script
pushes an incomplete WAL file to the archive as a non-partial
segment. This shouldn't happen in the real world.
It easily happens if the primary crashed and standbys didn't receive
another page with continuation record.
In the repro script, the replication connection of the second standby
is switched from the old primary to the first standby after its
promotion. After the switching, replication is expected to continue
from the beginning of the last replayed segment.
Well, maybe, but apparently the standby is busy trying to decode a record
that spans multiple pages, and it is just infinitely waiting for the next
page to arrive. Also, the restart "fixes" the problem, because indeed it is
reading the file from the beginning.
But with the script,
the second standby copies the intentionally broken file, which differs
from the data that should be received via streaming.
As I already said, this is a simple way to emulate the primary crash while
standbys receiving WAL.
It could easily happen that the record spans on multiple pages is not fully
received and flushed.
--
Regards,
--
Alexander Kukushkin
On Thu, Feb 29, 2024 at 05:44:25PM +0100, Alexander Kukushkin wrote:
On Thu, 29 Feb 2024 at 08:18, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:In the first place, it's important to note that we do not guarantee
that an async standby can always switch its replication connection to
the old primary or another sibling standby. This is due to the
variations in replication lag among standbys. pg_rewind is required to
adjust such discrepancies.Sure, I know. But in this case the async standby received and flushed
absolutely the same amount of WAL as the promoted one.
Ugh. If it can happen transparently to the user without the user
knowing directly about it, that does not sound good to me. I did not
look very closely at monitoring tools available out there, but if both
standbys flushed the same WAL locations a rewind should not be
required. It is not something that monitoring tools would be able to
detect because they just look at LSNs.
In the repro script, the replication connection of the second standby
is switched from the old primary to the first standby after its
promotion. After the switching, replication is expected to continue
from the beginning of the last replayed segment.Well, maybe, but apparently the standby is busy trying to decode a record
that spans multiple pages, and it is just infinitely waiting for the next
page to arrive. Also, the restart "fixes" the problem, because indeed it is
reading the file from the beginning.
What happens if the continuation record spawns across multiple segment
files boundaries in this case? We would go back to the beginning of
the segment where the record spawning across multiple segments began,
right? (I may recall this part of contrecords incorrectly, feel free
to correct me if necessary.)
But with the script,
the second standby copies the intentionally broken file, which differs
from the data that should be received via streaming.As I already said, this is a simple way to emulate the primary crash while
standbys receiving WAL.
It could easily happen that the record spans on multiple pages is not fully
received and flushed.
I think that's OK to do so at test level to force a test in the
backend, FWIW, because that's cheaper, and 039_end_of_wal.pl has
proved that this can be designed to be cheap and stable across the
buildfarm fleet.
For anything like that, the result had better have solid test
coverage, where perhaps we'd better refactor some of the routines of
039_end_of_wal.pl into a module to use them here, rather than
duplicate the code. The other test has a few assumptions with the
calculation of page boundaries, and I'd rather not duplicate that
across the tree.
Seeing that Alexander is a maintainer of Patroni, which is very
probably used by his employer across a large set of PostgreSQL
instances, if he says that he's seen that in the field, that's good
enough for me to respond to the problem, especially if reconnecting a
standby to a promoted node where both flushed the same LSN. Now the
level of response also depends on the invasiness of the change, and we
need a very careful evaluation here.
--
Michael
At Fri, 1 Mar 2024 08:17:04 +0900, Michael Paquier <michael@paquier.xyz> wrote in
On Thu, Feb 29, 2024 at 05:44:25PM +0100, Alexander Kukushkin wrote:
On Thu, 29 Feb 2024 at 08:18, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:In the first place, it's important to note that we do not guarantee
that an async standby can always switch its replication connection to
the old primary or another sibling standby. This is due to the
variations in replication lag among standbys. pg_rewind is required to
adjust such discrepancies.Sure, I know. But in this case the async standby received and flushed
absolutely the same amount of WAL as the promoted one.Ugh. If it can happen transparently to the user without the user
knowing directly about it, that does not sound good to me. I did not
look very closely at monitoring tools available out there, but if both
standbys flushed the same WAL locations a rewind should not be
required. It is not something that monitoring tools would be able to
detect because they just look at LSNs.In the repro script, the replication connection of the second standby
is switched from the old primary to the first standby after its
promotion. After the switching, replication is expected to continue
from the beginning of the last replayed segment.Well, maybe, but apparently the standby is busy trying to decode a record
that spans multiple pages, and it is just infinitely waiting for the next
page to arrive. Also, the restart "fixes" the problem, because indeed it is
reading the file from the beginning.What happens if the continuation record spawns across multiple segment
files boundaries in this case? We would go back to the beginning of
the segment where the record spawning across multiple segments began,
right? (I may recall this part of contrecords incorrectly, feel free
to correct me if necessary.)
After reading this, I came up with a possibility that walreceiver
recovers more quickly than the calling interval to
WaitForWALtoBecomeAvailable(). If walreceiver disconnects after a call
to the function WaitForWAL...(), and then somehow recovers the
connection before the next call, the function doesn't notice the
disconnection and returns XLREAD_SUCCESS wrongly. If this assumption
is correct, the correct fix might be for us to return XLREAD_FAIL when
reconnection happens after the last call to the WaitForWAL...()
function.
But with the script,
the second standby copies the intentionally broken file, which differs
from the data that should be received via streaming.As I already said, this is a simple way to emulate the primary crash while
standbys receiving WAL.
It could easily happen that the record spans on multiple pages is not fully
received and flushed.I think that's OK to do so at test level to force a test in the
backend, FWIW, because that's cheaper, and 039_end_of_wal.pl has
proved that this can be designed to be cheap and stable across the
buildfarm fleet.
Yeah, I agree that it clearly illustrates the final state after the
issue happened, but if my assumption above is correct, the test
doesn't manifest the real issue.
For anything like that, the result had better have solid test
coverage, where perhaps we'd better refactor some of the routines of
039_end_of_wal.pl into a module to use them here, rather than
duplicate the code. The other test has a few assumptions with the
calculation of page boundaries, and I'd rather not duplicate that
across the tree.
I agree to the point.
Seeing that Alexander is a maintainer of Patroni, which is very
probably used by his employer across a large set of PostgreSQL
instances, if he says that he's seen that in the field, that's good
enough for me to respond to the problem, especially if reconnecting a
standby to a promoted node where both flushed the same LSN. Now the
level of response also depends on the invasiness of the change, and we
need a very careful evaluation here.
I don't mean to say that we should respond with DNF to this "issue" at
all. I simply wanted to make the real issue clear.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Fri, 01 Mar 2024 10:29:12 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
After reading this, I came up with a possibility that walreceiver
recovers more quickly than the calling interval to
WaitForWALtoBecomeAvailable(). If walreceiver disconnects after a call
to the function WaitForWAL...(), and then somehow recovers the
connection before the next call, the function doesn't notice the
disconnection and returns XLREAD_SUCCESS wrongly. If this assumption
is correct, the correct fix might be for us to return XLREAD_FAIL when
reconnection happens after the last call to the WaitForWAL...()
function.
That's my stupid. The function performs reconnection by itself.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Fri, 01 Mar 2024 12:04:31 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
At Fri, 01 Mar 2024 10:29:12 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
After reading this, I came up with a possibility that walreceiver
recovers more quickly than the calling interval to
WaitForWALtoBecomeAvailable(). If walreceiver disconnects after a call
to the function WaitForWAL...(), and then somehow recovers the
connection before the next call, the function doesn't notice the
disconnection and returns XLREAD_SUCCESS wrongly. If this assumption
is correct, the correct fix might be for us to return XLREAD_FAIL when
reconnection happens after the last call to the WaitForWAL...()
function.That's my stupid. The function performs reconnection by itself.
Anyway, our current policy here is to avoid record-rereads beyond
source switches. However, fixing this seems to require that source
switches cause record rereads unless some additional information is
available to know if replay LSN needs to back up.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Fri, 01 Mar 2024 12:37:55 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
Anyway, our current policy here is to avoid record-rereads beyond
source switches. However, fixing this seems to require that source
switches cause record rereads unless some additional information is
available to know if replay LSN needs to back up.
It seems to me that the error messages are related to commit 0668719801.
XLogPageRead:
* Check the page header immediately, so that we can retry immediately if
* it's not valid. This may seem unnecessary, because ReadPageInternal()
* validates the page header anyway, and would propagate the failure up to
* ReadRecord(), which would retry. However, there's a corner case with
* continuation records, if a record is split across two pages such that
* we would need to read the two pages from different sources. For
...
if (StandbyMode &&
!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
{
/*
* Emit this error right now then retry this page immediately. Use
* errmsg_internal() because the message was already translated.
*/
if (xlogreader->errormsg_buf[0])
ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
(errmsg_internal("%s", xlogreader->errormsg_buf)));
This code intends to prevent a page header error from causing a record
reread, when a record is required to be read from multiple sources. We
could restrict this to only fire at segment boundaries. At segment
boundaries, we won't let LSN back up by using XLP_FIRST_IS_CONTRECORD.
Having thought up to this point, I now believe that we should
completely prevent LSN from going back in any case. One drawback is
that the fix cannot be back-patched.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hello Michael, Kyotaro,
Please find attached the patch fixing the problem and the updated TAP test
that addresses Nit.
--
Regards,
--
Alexander Kukushkin
At Tue, 5 Mar 2024 09:36:44 +0100, Alexander Kukushkin <cyberdemn@gmail.com> wrote in
Please find attached the patch fixing the problem and the updated TAP test
that addresses Nit.
Record-level retries happen when the upper layer detects errors. In my
previous mail, I cited code that is intended to prevent this at
segment boundaries. However, the resulting code applies to all page
boundaries, since we judged that the difference doen't significanty
affects the outcome.
* Check the page header immediately, so that we can retry immediately if
* it's not valid. This may seem unnecessary, because ReadPageInternal()
* validates the page header anyway, and would propagate the failure up to
So, the following (tentative) change should also work.
xlogrecovery.c:
@@ -3460,8 +3490,10 @@ retry:
* responsible for the validation.
*/
if (StandbyMode &&
+ targetPagePtr % 0x100000 == 0 &&
!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
{
Thus, I managed to reproduce precisely the same situation as you
described utilizing your script with modifications and some core
tweaks, and with the change above, I saw that the behavior was
fixed. However, for reasons unclear to me, it shows another issue, and
I am running out of time and need more caffeine. I'll continue
investigating this tomorrow.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hi Kyotaro,
Oh, now I understand what you mean. Is the retry supposed to happen only
when we are reading the very first page from the WAL file?
On Wed, 6 Mar 2024 at 09:57, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:
xlogrecovery.c: @@ -3460,8 +3490,10 @@ retry: * responsible for the validation. */ if (StandbyMode && + targetPagePtr % 0x100000 == 0 && !XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf)) {
Hmm, I think you meant to use wal_segment_size, because 0x100000 is just
1MB. As a result, currently it works for you by accident.
Thus, I managed to reproduce precisely the same situation as you
described utilizing your script with modifications and some core
tweaks, and with the change above, I saw that the behavior was
fixed. However, for reasons unclear to me, it shows another issue, and
I am running out of time and need more caffeine. I'll continue
investigating this tomorrow.
Thank you for spending your time on it!
--
Regards,
--
Alexander Kukushkin
At Wed, 6 Mar 2024 11:34:29 +0100, Alexander Kukushkin <cyberdemn@gmail.com> wrote in
Hmm, I think you meant to use wal_segment_size, because 0x100000 is just
1MB. As a result, currently it works for you by accident.
Oh, I once saw the fix work, but seems not to be working after some
point. The new issue was a corruption of received WAL records on the
first standby, and it may be related to the setting.
Thus, I managed to reproduce precisely the same situation as you
described utilizing your script with modifications and some core
tweaks, and with the change above, I saw that the behavior was
fixed. However, for reasons unclear to me, it shows another issue, and
I am running out of time and need more caffeine. I'll continue
investigating this tomorrow.Thank you for spending your time on it!
You're welcome, but I aplogize for the delay in the work..
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Mon, Mar 11, 2024 at 04:43:32PM +0900, Kyotaro Horiguchi wrote:
At Wed, 6 Mar 2024 11:34:29 +0100, Alexander Kukushkin <cyberdemn@gmail.com> wrote in
Thank you for spending your time on it!
You're welcome, but I aplogize for the delay in the work..
Thanks for spending time on this. Everybody is busy with the last
commit fest, and the next minor release set is planned for May so
there should still be time even after the feature freeze:
https://www.postgresql.org/developer/roadmap/
I should be able to come back to this thread around the beginning of
April. If you are able to do some progress in-between, that would be
surely helpful.
Thanks,
--
Michael
At Mon, 11 Mar 2024 16:43:32 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
Oh, I once saw the fix work, but seems not to be working after some
point. The new issue was a corruption of received WAL records on the
first standby, and it may be related to the setting.
I identified the cause of the second issue. When I tried to replay the
issue, the second standby accidentally received the old timeline's
last page-spanning record till the end while the first standby was
promoting (but it had not been read by recovery). In addition to that,
on the second standby, there's a time window where the timeline
increased but the first segment of the new timeline is not available
yet. In this case, the second standby successfully reads the
page-spanning record in the old timeline even after the second standby
noticed that the timeline ID has been increased, thanks to the
robustness of XLogFileReadAnyTLI().
I think the primary change to XLogPageRead that I suggested is correct
(assuming the use of wal_segment_size instead of the
constant). However, still XLogFileReadAnyTLI() has a chance to read
the segment from the old timeline after the second standby notices a
timeline switch, leading to the second issue. The second issue was
fixed by preventing XLogFileReadAnyTLI from reading segments from
older timelines than those suggested by the latest timeline
history. (In other words, disabling the "AnyTLI" part).
I recall that there was a discussion for commit 4bd0ad9e44, about the
objective of allowing reading segments from older timelines than the
timeline history suggests. In my faint memory, we concluded to
postpone making the decision to remove the feature due to uncertainity
about the objective. If there's no clear reason to continue using
XLogFileReadAnyTLI(), I suggest we stop its use and instead adopt
XLogFileReadOnTLHistory(), which reads segments that align precisely
with the timeline history.
Of course, regardless of the changes above, if recovery on the second
standby had reached the end of the page-spanning record before
redirection to the first standby, it would need pg_rewind to connect
to the first standby.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hi Kyotaro,
On Wed, 13 Mar 2024 at 03:56, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:
I identified the cause of the second issue. When I tried to replay the
issue, the second standby accidentally received the old timeline's
last page-spanning record till the end while the first standby was
promoting (but it had not been read by recovery). In addition to that,
on the second standby, there's a time window where the timeline
increased but the first segment of the new timeline is not available
yet. In this case, the second standby successfully reads the
page-spanning record in the old timeline even after the second standby
noticed that the timeline ID has been increased, thanks to the
robustness of XLogFileReadAnyTLI().
Hmm, I don't think it could really be prevented.
There are always chances that the standby that is not ahead of other
standbys could be promoted due to reasons like:
1. HA configuration doesn't let certain nodes to be promoted.
2. This is an async standby (name isn't listed in
synchronous_standby_names) and it was ahead of promoted sync standby. No
data loss from the client point of view.
Of course, regardless of the changes above, if recovery on the second
standby had reached the end of the page-spanning record before
redirection to the first standby, it would need pg_rewind to connect
to the first standby.
Correct, IMO pg_rewind is a right way of solving it.
Regards,
--
Alexander Kukushkin
On Wed, 13 Mar 2024 at 04:56, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
At Mon, 11 Mar 2024 16:43:32 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
Oh, I once saw the fix work, but seems not to be working after some
point. The new issue was a corruption of received WAL records on the
first standby, and it may be related to the setting.I identified the cause of the second issue. When I tried to replay the
issue, the second standby accidentally received the old timeline's
last page-spanning record till the end while the first standby was
promoting (but it had not been read by recovery). In addition to that,
on the second standby, there's a time window where the timeline
increased but the first segment of the new timeline is not available
yet. In this case, the second standby successfully reads the
page-spanning record in the old timeline even after the second standby
noticed that the timeline ID has been increased, thanks to the
robustness of XLogFileReadAnyTLI().I think the primary change to XLogPageRead that I suggested is correct
(assuming the use of wal_segment_size instead of the
constant). However, still XLogFileReadAnyTLI() has a chance to read
the segment from the old timeline after the second standby notices a
timeline switch, leading to the second issue. The second issue was
fixed by preventing XLogFileReadAnyTLI from reading segments from
older timelines than those suggested by the latest timeline
history. (In other words, disabling the "AnyTLI" part).I recall that there was a discussion for commit 4bd0ad9e44, about the
objective of allowing reading segments from older timelines than the
timeline history suggests. In my faint memory, we concluded to
postpone making the decision to remove the feature due to uncertainity
about the objective. If there's no clear reason to continue using
XLogFileReadAnyTLI(), I suggest we stop its use and instead adopt
XLogFileReadOnTLHistory(), which reads segments that align precisely
with the timeline history.
This sounds very similar to the problem described in [1]/messages/by-id/CANwKhkMN3QwAcvuDZHb6wsvLRtkweBiYso-KLFykkQVWuQLcOw@mail.gmail.com. And I think
both will be resolved by that change.
[1]: /messages/by-id/CANwKhkMN3QwAcvuDZHb6wsvLRtkweBiYso-KLFykkQVWuQLcOw@mail.gmail.com
Hi Michael and Kyotaro,
Now that beta1 was released I hope you are not so busy and hence would like
to follow up on this problem.
Regards,
--
Alexander Kukushkin
On Tue, Jun 04, 2024 at 04:16:43PM +0200, Alexander Kukushkin wrote:
Now that beta1 was released I hope you are not so busy and hence would like
to follow up on this problem.
I am still working on something for the v18 cycle that I'd like to
present before the beginning of the next commit fest, so I am a bit
busy to get that out first. Fingers crossed to not have open items to
look at.. This thread is one of the things I have marked as an item
to look at, yes.
--
Michael