001_rep_changes.pl fails due to publisher stuck on shutdown
Hello hackers,
As a recent buildfarm test failure [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder&dt=2024-05-16%2014%3A22%3A38 shows:
[14:33:02.374](0.333s) ok 23 - update works with dropped subscriber column
### Stopping node "publisher" using mode fast
# Running: pg_ctl -D
/home/bf/bf-build/adder/HEAD/pgsql.build/testrun/subscription/001_rep_changes/data/t_001_rep_changes_publisher_data/pgdata
-m fast stop
waiting for server to shut down.. ... ... ... .. failed
pg_ctl: server does not shut down
# pg_ctl stop failed: 256
# Postmaster PID for node "publisher" is 2222549
[14:39:04.375](362.001s) Bail out! pg_ctl stop failed
001_rep_changes_publisher.log
2024-05-16 14:33:02.907 UTC [2238704][client backend][4/22:0] LOG: statement: DELETE FROM tab_rep
2024-05-16 14:33:02.925 UTC [2238704][client backend][:0] LOG: disconnection: session time: 0:00:00.078 user=bf
database=postgres host=[local]
2024-05-16 14:33:02.939 UTC [2222549][postmaster][:0] LOG: received fast shutdown request
2024-05-16 14:33:03.000 UTC [2222549][postmaster][:0] LOG: aborting any active transactions
2024-05-16 14:33:03.049 UTC [2222549][postmaster][:0] LOG: background worker "logical replication launcher" (PID
2223110) exited with exit code 1
2024-05-16 14:33:03.062 UTC [2222901][checkpointer][:0] LOG: shutting down
2024-05-16 14:39:04.377 UTC [2222549][postmaster][:0] LOG: received immediate shutdown request
2024-05-16 14:39:04.382 UTC [2222549][postmaster][:0] LOG: database system is shut down
the publisher node may hang on stopping.
I reproduced the failure (with aggressive autovacuum) locally and
discovered that it happens because:
1) checkpointer calls WalSndInitStopping() (which sends
PROCSIG_WALSND_INIT_STOPPING to walsender), and then spins inside
WalSndWaitStopping() indefinitely, because:
2) walsender receives the signal, sets got_STOPPING = true, but can't exit
WalSndLoop():
3) it never sets got_SIGUSR2 (to get into WalSndDone()) in
XLogSendLogical():
4) it never sets WalSndCaughtUp to true:
5) logical_decoding_ctx->reader->EndRecPtr can't reach flushPtr in
XLogSendLogical():
6) EndRecPtr doesn't advance in XLogNextRecord():
7) XLogDecodeNextRecord() fails do decode a record that crosses a page
boundary:
8) ReadPageInternal() (commented "Wait for the next page to become
available") constantly returns XLREAD_FAIL:
9) state->routine.page_read()/logical_read_xlog_page() constantly returns
-1:
10) flushptr = WalSndWaitForWal() stops advancing, because
got_STOPPING == true (see 2).
That is, walsender doesn't let itself to catch up, if it gets the stop
signal when it's lagging behind and decoding a record requires reading
the next wal page.
Please look at the reproducing test (based on 001_rep_changes.pl) attached.
If fails for me as below:
# 17
Bailout called. Further testing stopped: pg_ctl stop failed
FAILED--Further testing stopped: pg_ctl stop failed
make: *** [Makefile:21: check] Ошибка 255
[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder&dt=2024-05-16%2014%3A22%3A38
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dikkop&dt=2024-04-24%2014%3A38%3A35 (apparently the same)
Best regards,
Alexander
Attachments:
On Wed, 29 May 2024 at 16:30, Alexander Lakhin <exclusion@gmail.com> wrote:
Hello hackers,
As a recent buildfarm test failure [1] shows:
[14:33:02.374](0.333s) ok 23 - update works with dropped subscriber column
### Stopping node "publisher" using mode fast
# Running: pg_ctl -D
/home/bf/bf-build/adder/HEAD/pgsql.build/testrun/subscription/001_rep_changes/data/t_001_rep_changes_publisher_data/pgdata
-m fast stop
waiting for server to shut down.. ... ... ... .. failed
pg_ctl: server does not shut down
# pg_ctl stop failed: 256
# Postmaster PID for node "publisher" is 2222549
[14:39:04.375](362.001s) Bail out! pg_ctl stop failed001_rep_changes_publisher.log
2024-05-16 14:33:02.907 UTC [2238704][client backend][4/22:0] LOG: statement: DELETE FROM tab_rep
2024-05-16 14:33:02.925 UTC [2238704][client backend][:0] LOG: disconnection: session time: 0:00:00.078 user=bf
database=postgres host=[local]
2024-05-16 14:33:02.939 UTC [2222549][postmaster][:0] LOG: received fast shutdown request
2024-05-16 14:33:03.000 UTC [2222549][postmaster][:0] LOG: aborting any active transactions
2024-05-16 14:33:03.049 UTC [2222549][postmaster][:0] LOG: background worker "logical replication launcher" (PID
2223110) exited with exit code 1
2024-05-16 14:33:03.062 UTC [2222901][checkpointer][:0] LOG: shutting down
2024-05-16 14:39:04.377 UTC [2222549][postmaster][:0] LOG: received immediate shutdown request
2024-05-16 14:39:04.382 UTC [2222549][postmaster][:0] LOG: database system is shut downthe publisher node may hang on stopping.
I reproduced the failure (with aggressive autovacuum) locally and
discovered that it happens because:
1) checkpointer calls WalSndInitStopping() (which sends
PROCSIG_WALSND_INIT_STOPPING to walsender), and then spins inside
WalSndWaitStopping() indefinitely, because:
2) walsender receives the signal, sets got_STOPPING = true, but can't exit
WalSndLoop():
3) it never sets got_SIGUSR2 (to get into WalSndDone()) in
XLogSendLogical():
4) it never sets WalSndCaughtUp to true:
5) logical_decoding_ctx->reader->EndRecPtr can't reach flushPtr in
XLogSendLogical():
6) EndRecPtr doesn't advance in XLogNextRecord():
7) XLogDecodeNextRecord() fails do decode a record that crosses a page
boundary:
8) ReadPageInternal() (commented "Wait for the next page to become
available") constantly returns XLREAD_FAIL:
9) state->routine.page_read()/logical_read_xlog_page() constantly returns
-1:
10) flushptr = WalSndWaitForWal() stops advancing, because
got_STOPPING == true (see 2).That is, walsender doesn't let itself to catch up, if it gets the stop
signal when it's lagging behind and decoding a record requires reading
the next wal page.Please look at the reproducing test (based on 001_rep_changes.pl) attached.
If fails for me as below:
# 17
Bailout called. Further testing stopped: pg_ctl stop failed
FAILED--Further testing stopped: pg_ctl stop failed
make: *** [Makefile:21: check] Ошибка 255
Thank you, Alexander, for sharing the script. I was able to reproduce
the issue using the provided script. Furthermore, while investigating
its origins, I discovered that this problem persists across all
branches up to PG10 (the script needs slight adjustments to run it on
older versions). It's worth noting that this issue isn't a result of
recent version changes.
Regards,
Vignesh
On Thu, May 30, 2024 at 2:09 AM vignesh C <vignesh21@gmail.com> wrote:
On Wed, 29 May 2024 at 16:30, Alexander Lakhin <exclusion@gmail.com> wrote:
Hello hackers,
As a recent buildfarm test failure [1] shows:
[14:33:02.374](0.333s) ok 23 - update works with dropped subscriber column
### Stopping node "publisher" using mode fast
# Running: pg_ctl -D
/home/bf/bf-build/adder/HEAD/pgsql.build/testrun/subscription/001_rep_changes/data/t_001_rep_changes_publisher_data/pgdata
-m fast stop
waiting for server to shut down.. ... ... ... .. failed
pg_ctl: server does not shut down
# pg_ctl stop failed: 256
# Postmaster PID for node "publisher" is 2222549
[14:39:04.375](362.001s) Bail out! pg_ctl stop failed001_rep_changes_publisher.log
2024-05-16 14:33:02.907 UTC [2238704][client backend][4/22:0] LOG: statement: DELETE FROM tab_rep
2024-05-16 14:33:02.925 UTC [2238704][client backend][:0] LOG: disconnection: session time: 0:00:00.078 user=bf
database=postgres host=[local]
2024-05-16 14:33:02.939 UTC [2222549][postmaster][:0] LOG: received fast shutdown request
2024-05-16 14:33:03.000 UTC [2222549][postmaster][:0] LOG: aborting any active transactions
2024-05-16 14:33:03.049 UTC [2222549][postmaster][:0] LOG: background worker "logical replication launcher" (PID
2223110) exited with exit code 1
2024-05-16 14:33:03.062 UTC [2222901][checkpointer][:0] LOG: shutting down
2024-05-16 14:39:04.377 UTC [2222549][postmaster][:0] LOG: received immediate shutdown request
2024-05-16 14:39:04.382 UTC [2222549][postmaster][:0] LOG: database system is shut downthe publisher node may hang on stopping.
I reproduced the failure (with aggressive autovacuum) locally and
discovered that it happens because:
1) checkpointer calls WalSndInitStopping() (which sends
PROCSIG_WALSND_INIT_STOPPING to walsender), and then spins inside
WalSndWaitStopping() indefinitely, because:
2) walsender receives the signal, sets got_STOPPING = true, but can't exit
WalSndLoop():
3) it never sets got_SIGUSR2 (to get into WalSndDone()) in
XLogSendLogical():
4) it never sets WalSndCaughtUp to true:
5) logical_decoding_ctx->reader->EndRecPtr can't reach flushPtr in
XLogSendLogical():
6) EndRecPtr doesn't advance in XLogNextRecord():
7) XLogDecodeNextRecord() fails do decode a record that crosses a page
boundary:
8) ReadPageInternal() (commented "Wait for the next page to become
available") constantly returns XLREAD_FAIL:
9) state->routine.page_read()/logical_read_xlog_page() constantly returns
-1:
10) flushptr = WalSndWaitForWal() stops advancing, because
got_STOPPING == true (see 2).That is, walsender doesn't let itself to catch up, if it gets the stop
signal when it's lagging behind and decoding a record requires reading
the next wal page.Please look at the reproducing test (based on 001_rep_changes.pl) attached.
If fails for me as below:
# 17
Bailout called. Further testing stopped: pg_ctl stop failed
FAILED--Further testing stopped: pg_ctl stop failed
make: *** [Makefile:21: check] Ошибка 255Thank you, Alexander, for sharing the script. I was able to reproduce
the issue using the provided script. Furthermore, while investigating
its origins, I discovered that this problem persists across all
branches up to PG10 (the script needs slight adjustments to run it on
older versions). It's worth noting that this issue isn't a result of
recent version changes.
Hi,
FWIW using the provided scripting I was also able to reproduce the
problem on HEAD but for me, it was more rare. -- the script passed ok
3 times all 100 iterations; it eventually failed on the 4th run on the
75th iteration.
======
Kind Regards,
Peter Smith.
Fujitsu Australia
On Wed, May 29, 2024 at 9:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
Hello hackers,
As a recent buildfarm test failure [1] shows:
[14:33:02.374](0.333s) ok 23 - update works with dropped subscriber column
### Stopping node "publisher" using mode fast
# Running: pg_ctl -D
/home/bf/bf-build/adder/HEAD/pgsql.build/testrun/subscription/001_rep_changes/data/t_001_rep_changes_publisher_data/pgdata
-m fast stop
waiting for server to shut down.. ... ... ... .. failed
pg_ctl: server does not shut down
# pg_ctl stop failed: 256
# Postmaster PID for node "publisher" is 2222549
[14:39:04.375](362.001s) Bail out! pg_ctl stop failed001_rep_changes_publisher.log
2024-05-16 14:33:02.907 UTC [2238704][client backend][4/22:0] LOG: statement: DELETE FROM tab_rep
2024-05-16 14:33:02.925 UTC [2238704][client backend][:0] LOG: disconnection: session time: 0:00:00.078 user=bf
database=postgres host=[local]
2024-05-16 14:33:02.939 UTC [2222549][postmaster][:0] LOG: received fast shutdown request
2024-05-16 14:33:03.000 UTC [2222549][postmaster][:0] LOG: aborting any active transactions
2024-05-16 14:33:03.049 UTC [2222549][postmaster][:0] LOG: background worker "logical replication launcher" (PID
2223110) exited with exit code 1
2024-05-16 14:33:03.062 UTC [2222901][checkpointer][:0] LOG: shutting down
2024-05-16 14:39:04.377 UTC [2222549][postmaster][:0] LOG: received immediate shutdown request
2024-05-16 14:39:04.382 UTC [2222549][postmaster][:0] LOG: database system is shut downthe publisher node may hang on stopping.
I reproduced the failure (with aggressive autovacuum) locally and
discovered that it happens because:
1) checkpointer calls WalSndInitStopping() (which sends
PROCSIG_WALSND_INIT_STOPPING to walsender), and then spins inside
WalSndWaitStopping() indefinitely, because:
2) walsender receives the signal, sets got_STOPPING = true, but can't exit
WalSndLoop():
3) it never sets got_SIGUSR2 (to get into WalSndDone()) in
XLogSendLogical():
4) it never sets WalSndCaughtUp to true:
5) logical_decoding_ctx->reader->EndRecPtr can't reach flushPtr in
XLogSendLogical():
6) EndRecPtr doesn't advance in XLogNextRecord():
7) XLogDecodeNextRecord() fails do decode a record that crosses a page
boundary:
8) ReadPageInternal() (commented "Wait for the next page to become
available") constantly returns XLREAD_FAIL:
9) state->routine.page_read()/logical_read_xlog_page() constantly returns
-1:
10) flushptr = WalSndWaitForWal() stops advancing, because
got_STOPPING == true (see 2).That is, walsender doesn't let itself to catch up, if it gets the stop
signal when it's lagging behind and decoding a record requires reading
the next wal page.Please look at the reproducing test (based on 001_rep_changes.pl) attached.
If fails for me as below:
# 17
Bailout called. Further testing stopped: pg_ctl stop failed
FAILED--Further testing stopped: pg_ctl stop failed
make: *** [Makefile:21: check] Ошибка 255[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder&dt=2024-05-16%2014%3A22%3A38
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dikkop&dt=2024-04-24%2014%3A38%3A35 (apparently the same)
Hi Alexander,
FYI, by injecting a lot of logging, I’ve confirmed your findings that
for the failing scenario, the ‘got_SIGUSR2’ flag never gets set to
true, meaning the WalSndLoop() cannot finish. Furthermore, I agree
with your step 8 finding that when it fails the ReadPageInternal
function call (the one in XLogDecodeNextRecord with the comment "Wait
for the next page to become available") constantly returns -1.
I will continue digging next week...
======
Kind Regards,
Peter Smith.
Fujitsu Australia
Hi, I have reproduced this multiple times now.
I confirmed the initial post/steps from Alexander. i.e. The test
script provided [1]OP - /messages/by-id/f15d665f-4cd1-4894-037c-afdbe369287e@gmail.com gets itself into a state where function
ReadPageInternal (called by XLogDecodeNextRecord and commented "Wait
for the next page to become available") constantly returns
XLREAD_FAIL. Ultimately the test times out because WalSndLoop() loops
forever, since it never calls WalSndDone() to exit the walsender
process.
~~~
I've made a patch to inject lots of logging, and when the test script
fails a cycle of function failures can be seen. I don't know how to
fix it yet, so I'm attaching my log results, hoping the information
may be useful for anyone familiar with this area of the code.
~~~
Attachment #1 "v1-0001-DEBUG-LOGGING.patch" -- Patch to inject some
logging. Be careful if you apply this because the resulting log files
can be huge (e.g. 3G)
Attachment #2 "bad8_logs_last500lines.txt" -- This is the last 500
lines of a 3G logfile from a failing test run.
Attachment #3 "bad8_logs_last500lines-simple.txt" -- Same log file as
above, but it's a simplified extract in which I showed the CYCLES of
failure more clearly.
Attachment #4 "bad8_digram"-- Same execution patch information as from
the log files, but in diagram form (just to help me visualise the
logic more easily).
~~~
Just so you know, the test script does not always cause the problem.
Sometimes it happens after just 20 script iterations. Or, sometimes it
takes a very long time and multiple runs (e.g. 400-500 script
iterations). Either way, when the problem eventually occurs the CYCLES
of the ReadPageInternal() failures always have the the same pattern
shown in these attached logs.
======
[1]: OP - /messages/by-id/f15d665f-4cd1-4894-037c-afdbe369287e@gmail.com
Kind Regards,
Peter Smith.
Fujitsu Australia
Attachments:
v1-0001-DEBUG-LOGGING.patchapplication/octet-stream; name=v1-0001-DEBUG-LOGGING.patchDownload
From 165f7bc6888968db3e2f952585e3a5e6524d2cda Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Thu, 6 Jun 2024 12:35:41 +1000
Subject: [PATCH v1] DEBUG LOGGING
---
src/backend/access/transam/xlog.c | 2 +
src/backend/access/transam/xlogreader.c | 144 +++++++++++++++++++++++++++++++-
src/backend/replication/walsender.c | 127 ++++++++++++++++++++++++++++
3 files changed, 272 insertions(+), 1 deletion(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 330e058..d1b5fc2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6554,12 +6554,14 @@ ShutdownXLOG(int code, Datum arg)
/*
* Signal walsenders to move to stopping state.
*/
+ elog(LOG, ">>> ShutdownXLOG: calling WalSndInitStopping");
WalSndInitStopping();
/*
* Wait for WAL senders to be in stopping state. This prevents commands
* from writing new WAL.
*/
+ elog(LOG, ">>> ShutdownXLOG: calling WalSndWaitStopping");
WalSndWaitStopping();
if (RecoveryInProgress())
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 37d2a57..434cd2e 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -32,13 +32,15 @@
#include "catalog/pg_control.h"
#include "common/pg_lzcompress.h"
#include "replication/origin.h"
-
#ifndef FRONTEND
#include "pgstat.h"
+#include "utils/elog.h"
#else
#include "common/logging.h"
#endif
+bool ps_readpageinternal2_failing = false;
+
static void report_invalid_record(XLogReaderState *state, const char *fmt,...)
pg_attribute_printf(2, 3);
static void allocate_recordbuf(XLogReaderState *state, uint32 reclength);
@@ -594,12 +596,23 @@ restart:
* byte to cover the whole record header, or at least the part of it that
* fits on the same page.
*/
+#ifndef FRONTEND
+ if (ps_readpageinternal2_failing)
+ {
+ elog(LOG, ">>> XLogDecodeNextRecord: before ReadPageInternal-1");
+ }
+#endif
readOff = ReadPageInternal(state, targetPagePtr,
Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
if (readOff == XLREAD_WOULDBLOCK)
return XLREAD_WOULDBLOCK;
else if (readOff < 0)
+ {
+#ifndef FRONTEND
+ elog(LOG, ">>> XLogDecodeNextRecord: called ReadPageInternal-1, which returned %d", readOff);
+#endif
goto err;
+ }
/*
* ReadPageInternal always returns at least the page header, so we can
@@ -723,6 +736,12 @@ restart:
targetPagePtr += XLOG_BLCKSZ;
/* Wait for the next page to become available */
+#ifndef FRONTEND
+ if (ps_readpageinternal2_failing)
+ {
+ elog(LOG, ">>> XLogDecodeNextRecord: before ReadPageInternal-2");
+ }
+#endif
readOff = ReadPageInternal(state, targetPagePtr,
Min(total_len - gotlen + SizeOfXLogShortPHD,
XLOG_BLCKSZ));
@@ -730,7 +749,13 @@ restart:
if (readOff == XLREAD_WOULDBLOCK)
return XLREAD_WOULDBLOCK;
else if (readOff < 0)
+ {
+#ifndef FRONTEND
+ elog(LOG, ">>> XLogDecodeNextRecord: called ReadPageInternal-2, which returned %d", readOff);
+ ps_readpageinternal2_failing = true;
+#endif
goto err;
+ }
Assert(SizeOfXLogShortPHD <= readOff);
@@ -779,8 +804,22 @@ restart:
pageHeaderSize = XLogPageHeaderSize(pageHeader);
if (readOff < pageHeaderSize)
+ {
+#ifndef FRONTEND
+ if (ps_readpageinternal2_failing)
+ {
+ elog(LOG, ">>> XLogDecodeNextRecord: before ReadPageInternal-3");
+ }
+#endif
readOff = ReadPageInternal(state, targetPagePtr,
pageHeaderSize);
+#ifndef FRONTEND
+ if (ps_readpageinternal2_failing)
+ {
+ elog(LOG, ">>> XLogDecodeNextRecord: called ReadPageInternal-3, which returned %d", readOff);
+ }
+#endif
+ }
Assert(pageHeaderSize <= readOff);
@@ -790,8 +829,22 @@ restart:
len = pageHeader->xlp_rem_len;
if (readOff < pageHeaderSize + len)
+ {
+#ifndef FRONTEND
+ if (ps_readpageinternal2_failing)
+ {
+ elog(LOG, ">>> XLogDecodeNextRecord: before ReadPageInternal-4");
+ }
+#endif
readOff = ReadPageInternal(state, targetPagePtr,
pageHeaderSize + len);
+#ifndef FRONTEND
+ if (ps_readpageinternal2_failing)
+ {
+ elog(LOG, ">>> XLogDecodeNextRecord: called ReadPageInternal-4, which returned %d", readOff);
+ }
+#endif
+ }
memcpy(buffer, (char *) contdata, len);
buffer += len;
@@ -843,12 +896,26 @@ restart:
else
{
/* Wait for the record data to become available */
+#ifndef FRONTEND
+ if (ps_readpageinternal2_failing)
+ {
+ elog(LOG, ">>> XLogDecodeNextRecord: before ReadPageInternal-5");
+ }
+#endif
readOff = ReadPageInternal(state, targetPagePtr,
Min(targetRecOff + total_len, XLOG_BLCKSZ));
if (readOff == XLREAD_WOULDBLOCK)
return XLREAD_WOULDBLOCK;
else if (readOff < 0)
+ {
+#ifndef FRONTEND
+ if (ps_readpageinternal2_failing)
+ {
+ elog(LOG, ">>> XLogDecodeNextRecord: called ReadPageInternal-5, which returned %d", readOff);
+ }
+#endif
goto err;
+ }
/* Record does not cross a page boundary */
if (!ValidXLogRecord(state, record, RecPtr))
@@ -1012,7 +1079,15 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
/* check whether we have all the requested data already */
if (targetSegNo == state->seg.ws_segno &&
targetPageOff == state->segoff && reqLen <= state->readLen)
+ {
+#ifndef FRONTEND
+ if (ps_readpageinternal2_failing)
+ {
+ elog(LOG, ">>> ReadPageInternal: return #1 readLen=%d", state->readLen);
+ }
+#endif
return state->readLen;
+ }
/*
* Invalidate contents of internal buffer before read attempt. Just set
@@ -1037,9 +1112,22 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
{
XLogRecPtr targetSegmentPtr = pageptr - targetPageOff;
+#ifndef FRONTEND
+ if (ps_readpageinternal2_failing)
+ {
+ elog(LOG, ">>> ReadPageInternal: before calling routine.page_read #1");
+ }
+#endif
readLen = state->routine.page_read(state, targetSegmentPtr, XLOG_BLCKSZ,
state->currRecPtr,
state->readBuf);
+#ifndef FRONTEND
+ if (ps_readpageinternal2_failing)
+ {
+ elog(LOG, ">>> ReadPageInternal: routine.page_read #1 returned readLen %d", readLen);
+ }
+#endif
+
if (readLen == XLREAD_WOULDBLOCK)
return XLREAD_WOULDBLOCK;
else if (readLen < 0)
@@ -1050,16 +1138,36 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
if (!XLogReaderValidatePageHeader(state, targetSegmentPtr,
state->readBuf))
+ {
+#ifndef FRONTEND
+ if (ps_readpageinternal2_failing)
+ {
+ elog(LOG, ">>> ReadPageInternal: XLogReadValidatePageHeader was false");
+ }
+#endif
goto err;
+ }
}
/*
* First, read the requested data length, but at least a short page header
* so that we can validate it.
*/
+#ifndef FRONTEND
+ if (ps_readpageinternal2_failing)
+ {
+ elog(LOG, ">>> ReadPageInternal: before calling routine.page_read #2");
+ }
+#endif
readLen = state->routine.page_read(state, pageptr, Max(reqLen, SizeOfXLogShortPHD),
state->currRecPtr,
state->readBuf);
+#ifndef FRONTEND
+ if (ps_readpageinternal2_failing)
+ {
+ elog(LOG, ">>> ReadPageInternal: routine.page_read #2 returned readLen %d", readLen);
+ }
+#endif
if (readLen == XLREAD_WOULDBLOCK)
return XLREAD_WOULDBLOCK;
else if (readLen < 0)
@@ -1069,7 +1177,15 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
/* Do we have enough data to check the header length? */
if (readLen <= SizeOfXLogShortPHD)
+ {
+#ifndef FRONTEND
+ if (ps_readpageinternal2_failing)
+ {
+ elog(LOG, ">>> ReadPageInternal: readLen (%d) <= SizeOfXLogShortPHD (%ld)", readLen, SizeOfXLogShortPHD);
+ }
+#endif
goto err;
+ }
Assert(readLen >= reqLen);
@@ -1078,9 +1194,27 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
/* still not enough */
if (readLen < XLogPageHeaderSize(hdr))
{
+#ifndef FRONTEND
+ if (ps_readpageinternal2_failing)
+ {
+ elog(LOG, ">>> ReadPageInternal: still not enough");
+ }
+#endif
+#ifndef FRONTEND
+ if (ps_readpageinternal2_failing)
+ {
+ elog(LOG, ">>> ReadPageInternal: before calling routine.page_read #3");
+ }
+#endif
readLen = state->routine.page_read(state, pageptr, XLogPageHeaderSize(hdr),
state->currRecPtr,
state->readBuf);
+#ifndef FRONTEND
+ if (ps_readpageinternal2_failing)
+ {
+ elog(LOG, ">>> ReadPageInternal: routine.page_read #3 returned readLen %d", readLen);
+ }
+#endif
if (readLen == XLREAD_WOULDBLOCK)
return XLREAD_WOULDBLOCK;
else if (readLen < 0)
@@ -1091,7 +1225,15 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
* Now that we know we have the full header, validate it.
*/
if (!XLogReaderValidatePageHeader(state, pageptr, (char *) hdr))
+ {
+#ifndef FRONTEND
+ if (ps_readpageinternal2_failing)
+ {
+ elog(LOG, ">>> ReadPageInternal: invalid page header");
+ }
+#endif
goto err;
+ }
/* update read state information */
state->seg.ws_segno = targetSegNo;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c623b07..a2470fa 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -272,6 +272,7 @@ static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
static void WalSndSegmentOpen(XLogReaderState *state, XLogSegNo nextSegNo,
TimeLineID *tli_p);
+static const char *WalSndGetStateString(WalSndState state);
/* Initialize walsender process before entering the main command loop */
void
@@ -1039,6 +1040,8 @@ StartReplication(StartReplicationCmd *cmd)
EndReplicationCommand("START_STREAMING");
}
+extern bool ps_readpageinternal2_failing; /* this is set in the PageReadInternal call in xlogreader.c */
+
/*
* XLogReaderRoutine->page_read callback for logical decoding contexts, as a
* walsender process.
@@ -1062,7 +1065,18 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
* timeline. This is needed to determine am_cascading_walsender accurately
* which is needed to determine the current timeline.
*/
+ if (ps_readpageinternal2_failing)
+ {
+ elog(LOG, ">>> logical_read_xlog_page: calling WalSndWaitForWal");
+ }
flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
+ if (ps_readpageinternal2_failing)
+ {
+ elog(LOG, ">>> logical_read_xlog_page: just called WalSndWaitForWal. flushptr=%X/%X, targetPagePtr=%X/%X, reqLen=%d",
+ LSN_FORMAT_ARGS(flushptr),
+ LSN_FORMAT_ARGS(targetPagePtr),
+ reqLen);
+ }
/*
* Since logical decoding is also permitted on a standby server, we need
@@ -1085,12 +1099,30 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
/* fail if not (implies we are going to shut down) */
if (flushptr < targetPagePtr + reqLen)
+ {
+ elog(LOG, ">>> logical_read_xlog_page: returning -1. flushptr=%X/%X, targetPagePtr=%X/%X, reqLen=%d",
+ LSN_FORMAT_ARGS(flushptr),
+ LSN_FORMAT_ARGS(targetPagePtr),
+ reqLen);
return -1;
+ }
if (targetPagePtr + XLOG_BLCKSZ <= flushptr)
+ {
count = XLOG_BLCKSZ; /* more than one block available */
+ if (ps_readpageinternal2_failing)
+ {
+ elog(LOG, ">>> logical_read_xlog_page: set count = XLOG_BLCKSZ =%d (%X)", count, count);
+ }
+ }
else
+ {
count = flushptr - targetPagePtr; /* part of the page available */
+ if (ps_readpageinternal2_failing)
+ {
+ elog(LOG, ">>> logical_read_xlog_page: set count = part page = %d (%X)", count, count);
+ }
+ }
/* now actually read the data, we know it's there */
if (!WALRead(state,
@@ -1113,6 +1145,10 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
XLByteToSeg(targetPagePtr, segno, state->segcxt.ws_segsize);
CheckXLogRemoved(segno, state->seg.ws_tli);
+ if (ps_readpageinternal2_failing)
+ {
+ elog(LOG, ">>> logical_read_xlog_page: returning count=%d", count);
+ }
return count;
}
@@ -1820,6 +1856,7 @@ WalSndWaitForWal(XLogRecPtr loc)
int wakeEvents;
uint32 wait_event = 0;
static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
+ bool ps_logging = ps_readpageinternal2_failing;
/*
* Fast path to avoid acquiring the spinlock in case we already know we
@@ -1829,7 +1866,12 @@ WalSndWaitForWal(XLogRecPtr loc)
*/
if (!XLogRecPtrIsInvalid(RecentFlushPtr) &&
!NeedToWaitForWal(loc, RecentFlushPtr, &wait_event))
+ {
+ if (ps_logging)
+ elog(LOG, ">>> WalSndWaitForWal: return #1 %X/%X", LSN_FORMAT_ARGS(RecentFlushPtr));
+
return RecentFlushPtr;
+ }
/*
* Within the loop, we wait for the necessary WALs to be flushed to disk
@@ -1863,7 +1905,11 @@ WalSndWaitForWal(XLogRecPtr loc)
* written, because walwriter has shut down already.
*/
if (got_STOPPING)
+ {
+ if (ps_logging)
+ elog(LOG, ">>> WalSndWaitForWal: calling XLogBackgroundFlush");
XLogBackgroundFlush();
+ }
/*
* To avoid the scenario where standbys need to catch up to a newer
@@ -1874,9 +1920,17 @@ WalSndWaitForWal(XLogRecPtr loc)
if (wait_event != WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION)
{
if (!RecoveryInProgress())
+ {
RecentFlushPtr = GetFlushRecPtr(NULL);
+ if (ps_logging)
+ elog(LOG, ">>> WalSndWaitForWal: 1. RecentFlushPtr assigned %X/%X", LSN_FORMAT_ARGS(RecentFlushPtr));
+ }
else
+ {
RecentFlushPtr = GetXLogReplayRecPtr(NULL);
+ if (ps_logging)
+ elog(LOG, ">>> WalSndWaitForWal: 2. RecentFlushPtr assigned %X/%X", LSN_FORMAT_ARGS(RecentFlushPtr));
+ }
}
/*
@@ -1890,9 +1944,17 @@ WalSndWaitForWal(XLogRecPtr loc)
if (got_STOPPING)
{
if (NeedToWaitForStandbys(RecentFlushPtr, &wait_event))
+ {
+ if (ps_logging)
+ elog(LOG, ">>> WalSndWaitForWal: got_STOPPING; wait_for_standby_at_stop=true");
wait_for_standby_at_stop = true;
+ }
else
+ {
+ if (ps_logging)
+ elog(LOG, ">>> WalSndWaitForWal: got_STOPPING; break #1");
break;
+ }
}
/*
@@ -1914,7 +1976,11 @@ WalSndWaitForWal(XLogRecPtr loc)
*/
if (!wait_for_standby_at_stop &&
!NeedToWaitForWal(loc, RecentFlushPtr, &wait_event))
+ {
+ if (ps_logging)
+ elog(LOG, ">>> WalSndWaitForWal: break #2");
break;
+ }
/*
* Waiting for new WAL or waiting for standbys to catch up. Since we
@@ -1935,7 +2001,11 @@ WalSndWaitForWal(XLogRecPtr loc)
*/
if (streamingDoneReceiving && streamingDoneSending &&
!pq_is_send_pending())
+ {
+ if (ps_logging)
+ elog(LOG, ">>> WalSndWaitForWal: break #3");
break;
+ }
/* die if timeout was reached */
WalSndCheckTimeOut();
@@ -1964,6 +2034,8 @@ WalSndWaitForWal(XLogRecPtr loc)
/* reactivate latch so WalSndLoop knows to continue */
SetLatch(MyLatch);
+ if (ps_logging)
+ elog(LOG, ">>> WalSndWaitForWal: return at bottom of function %X/%X", LSN_FORMAT_ARGS(RecentFlushPtr));
return RecentFlushPtr;
}
@@ -2762,6 +2834,8 @@ WalSndCheckTimeOut(void)
}
}
+static bool ps_logging = false;
+
/* Main loop of walsender process that streams the WAL over Copy messages. */
static void
WalSndLoop(WalSndSendDataCallback send_data)
@@ -2779,6 +2853,20 @@ WalSndLoop(WalSndSendDataCallback send_data)
*/
for (;;)
{
+ ps_logging = got_STOPPING || got_SIGUSR2;
+ if (ps_logging)
+ {
+ static bool prev_got_STOPPING = false, prev_got_SIGUSR2 = false;
+
+ if (got_STOPPING != prev_got_STOPPING || got_SIGUSR2 != prev_got_SIGUSR2)
+ {
+ elog(LOG, ">>> WalSndLoop: flags have changed. got_STOPPING flag is %d, and got_SIGUSR2 flag is %d",
+ got_STOPPING, got_SIGUSR2);
+ prev_got_STOPPING = got_STOPPING;
+ prev_got_SIGUSR2 = got_SIGUSR2;
+ }
+ }
+
/* Clear any already-pending wakeups */
ResetLatch(MyLatch);
@@ -2822,6 +2910,8 @@ WalSndLoop(WalSndSendDataCallback send_data)
/* If nothing remains to be sent right now ... */
if (WalSndCaughtUp && !pq_is_send_pending())
{
+ elog(LOG, ">>> WalSndLoop: \tall caught up...");
+
/*
* If we're in catchup state, move to streaming. This is an
* important state change for users to know about, since before
@@ -2846,7 +2936,10 @@ WalSndLoop(WalSndSendDataCallback send_data)
* is not sure which.
*/
if (got_SIGUSR2)
+ {
+ elog(LOG, ">>> WalSndLoop: \tcalling WalSndDone");
WalSndDone(send_data);
+ }
}
/* Check for replication timeout. */
@@ -3386,6 +3479,11 @@ XLogSendLogical(void)
*/
static XLogRecPtr flushPtr = InvalidXLogRecPtr;
+ if (ps_logging)
+ {
+ elog(LOG, ">>> XLogSendLogical:");
+ }
+
/*
* Don't know whether we've caught up yet. We'll set WalSndCaughtUp to
* true in WalSndWaitForWal, if we're actually waiting. We also set to
@@ -3430,13 +3528,29 @@ XLogSendLogical(void)
if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
WalSndCaughtUp = true;
+ if (ps_logging)
+ {
+ elog(LOG, ">>> XLogSendLogical: EndRecPtr is %X/%X, flushPtr is %X/%X, WalSndCaughtUp=%d",
+ LSN_FORMAT_ARGS(logical_decoding_ctx->reader->EndRecPtr),
+ LSN_FORMAT_ARGS(flushPtr),
+ WalSndCaughtUp);
+ }
+
+ if (ps_logging && WalSndCaughtUp)
+ {
+ elog(LOG, ">>> XLogSendLogical: flags WalSndCaughtUp is %d, got_STOPPING is %d", WalSndCaughtUp, got_STOPPING);
+ }
+
/*
* If we're caught up and have been requested to stop, have WalSndLoop()
* terminate the connection in an orderly manner, after writing out all
* the pending data.
*/
if (WalSndCaughtUp && got_STOPPING)
+ {
got_SIGUSR2 = true;
+ elog(LOG, ">>> XLogSendLogical: just set the got_SIGUSR2 flag to true");
+ }
/* Update shared memory status */
{
@@ -3571,6 +3685,8 @@ HandleWalSndInitStopping(void)
kill(MyProcPid, SIGTERM);
else
got_STOPPING = true;
+
+ elog(LOG, ">>> HandleWalSndInitStopping: got_STOPPING=%d", got_STOPPING);
}
/*
@@ -3746,6 +3862,7 @@ WalSndInitStopping(void)
{
int i;
+ elog(LOG, ">>> WalSndInitStopping: tell active walsenders to stop");
for (i = 0; i < max_wal_senders; i++)
{
WalSnd *walsnd = &WalSndCtl->walsnds[i];
@@ -3755,9 +3872,12 @@ WalSndInitStopping(void)
pid = walsnd->pid;
SpinLockRelease(&walsnd->mutex);
+ elog(LOG, ">>> WalSndInitStopping: \tpid[%d] is %d", i, pid);
+
if (pid == 0)
continue;
+ elog(LOG, ">>> WalSndInitStopping: \tsending PROCSIG_WALSND_INIT_STOPPING signal to pid %d", pid);
SendProcSignal(pid, PROCSIG_WALSND_INIT_STOPPING, INVALID_PROC_NUMBER);
}
}
@@ -3775,6 +3895,7 @@ WalSndWaitStopping(void)
int i;
bool all_stopped = true;
+ elog(LOG, ">>> WalSndWaitStopping: waiting for all %d walsenders...", max_wal_senders);
for (i = 0; i < max_wal_senders; i++)
{
WalSnd *walsnd = &WalSndCtl->walsnds[i];
@@ -3784,6 +3905,7 @@ WalSndWaitStopping(void)
if (walsnd->pid == 0)
{
SpinLockRelease(&walsnd->mutex);
+ elog(LOG, ">>> WalSndWaitStopping: \twalsnd[%d] pid is 0", i);
continue;
}
@@ -3791,6 +3913,7 @@ WalSndWaitStopping(void)
{
all_stopped = false;
SpinLockRelease(&walsnd->mutex);
+ elog(LOG, ">>> WalSndWaitStopping: \twalsnd[%d] has state %d (%s)", i, walsnd->state, WalSndGetStateString(walsnd->state));
break;
}
SpinLockRelease(&walsnd->mutex);
@@ -3798,8 +3921,12 @@ WalSndWaitStopping(void)
/* safe to leave if confirmation is done for all WAL senders */
if (all_stopped)
+ {
+ elog(LOG, ">>> WalSndWaitStopping: \tALL STOPPED!!");
return;
+ }
+ elog(LOG, ">>> WalSndWaitStopping: \tsleep 10s before checking again");
pg_usleep(10000L); /* wait for 10 msec */
}
}
--
1.8.3.1
At Thu, 6 Jun 2024 12:49:45 +1000, Peter Smith <smithpb2250@gmail.com> wrote in
Hi, I have reproduced this multiple times now.
I confirmed the initial post/steps from Alexander. i.e. The test
script provided [1] gets itself into a state where function
ReadPageInternal (called by XLogDecodeNextRecord and commented "Wait
for the next page to become available") constantly returns
XLREAD_FAIL. Ultimately the test times out because WalSndLoop() loops
forever, since it never calls WalSndDone() to exit the walsender
process.
Thanks for the repro; I believe I understand what's happening here.
During server shutdown, the latter half of the last continuation
record may fail to be flushed. This is similar to what is described in
the commit message of commit ff9f111bce. While shutting down,
WalSndLoop() waits for XLogSendLogical() to consume WAL up to
flushPtr, but in this case, the last record cannot complete without
the continuation part starting from flushPtr, which is
missing. However, in such cases, xlogreader.missingContrecPtr is set
to the beginning of the missing part, but something similar to
So, I believe the attached small patch fixes the behavior. I haven't
come up with a good test script for this issue. Something like
026_overwrite_contrecord.pl might work, but this situation seems a bit
more complex than what it handles.
Versions back to 10 should suffer from the same issue and the same
patch will be applicable without significant changes.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
0001-Fix-infinite-loop-in-walsender-during-publisher-shut.patchtext/x-patch; charset=us-asciiDownload
From 99cad7bd53a94b4b90937fb1eb2f37f2ebcadf6a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 6 Jun 2024 14:56:53 +0900
Subject: [PATCH] Fix infinite loop in walsender during publisher shutdown
When a publisher server is shutting down, there can be a case where
the last WAL record at that point is a continuation record with its
latter part not yet flushed. In such cases, the walsender attempts to
read this unflushed part and ends up in an infinite loop. To prevent
this situation, modify the logical WAL sender to consider itself
caught up in this case. The records that are not fully flushed at this
point are generally not significant, so simply ignoring them should
not cause any issues.
---
src/backend/replication/walsender.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c623b07cf0..635396c138 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -3426,8 +3426,15 @@ XLogSendLogical(void)
flushPtr = GetFlushRecPtr(NULL);
}
- /* If EndRecPtr is still past our flushPtr, it means we caught up. */
- if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+ /*
+ * If EndRecPtr is still past our flushPtr, it means we caught up. When
+ * the server is shutting down, the latter part of a continuation record
+ * may be missing. If got_STOPPING is true, assume we are caught up if the
+ * last record is missing its continuation part at flushPtr.
+ */
+ if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr ||
+ (got_STOPPING &&
+ logical_decoding_ctx->reader->missingContrecPtr == flushPtr))
WalSndCaughtUp = true;
/*
--
2.43.0
On Thu, 6 Jun 2024 at 11:49, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
At Thu, 6 Jun 2024 12:49:45 +1000, Peter Smith <smithpb2250@gmail.com> wrote in
Hi, I have reproduced this multiple times now.
I confirmed the initial post/steps from Alexander. i.e. The test
script provided [1] gets itself into a state where function
ReadPageInternal (called by XLogDecodeNextRecord and commented "Wait
for the next page to become available") constantly returns
XLREAD_FAIL. Ultimately the test times out because WalSndLoop() loops
forever, since it never calls WalSndDone() to exit the walsender
process.Thanks for the repro; I believe I understand what's happening here.
During server shutdown, the latter half of the last continuation
record may fail to be flushed. This is similar to what is described in
the commit message of commit ff9f111bce. While shutting down,
WalSndLoop() waits for XLogSendLogical() to consume WAL up to
flushPtr, but in this case, the last record cannot complete without
the continuation part starting from flushPtr, which is
missing. However, in such cases, xlogreader.missingContrecPtr is set
to the beginning of the missing part, but something similar toSo, I believe the attached small patch fixes the behavior. I haven't
come up with a good test script for this issue. Something like
026_overwrite_contrecord.pl might work, but this situation seems a bit
more complex than what it handles.Versions back to 10 should suffer from the same issue and the same
patch will be applicable without significant changes.
I tested the changes for PG 12 to master as we do not support prior versions.
The patch applied successfully for master and PG 16. I ran the test
provided in [1]/messages/by-id/f15d665f-4cd1-4894-037c-afdbe369287e@gmail.com multiple times and it ran successfully each time.
The patch did not apply on PG 15. I did a similar change for PG 15 and
created a patch. I ran the test multiple times and it was successful
every time.
The patch did not apply on PG 14 to PG 12. I did a similar change in
each branch. But the tests did not pass in each branch.
I have attached a patch which applies successfully on the PG 15 branch.
[1]: /messages/by-id/f15d665f-4cd1-4894-037c-afdbe369287e@gmail.com
Thanks and Regards,
Shlok Kyal
Attachments:
v2-0001-Fix-infinite-loop-in-walsender-during-publisher-s.patchapplication/octet-stream; name=v2-0001-Fix-infinite-loop-in-walsender-during-publisher-s.patchDownload
From eea58a421a9a67ad725d708ecc32e694474f3ad2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 6 Jun 2024 14:56:53 +0900
Subject: [PATCH v2] Fix infinite loop in walsender during publisher shutdown
When a publisher server is shutting down, there can be a case where
the last WAL record at that point is a continuation record with its
latter part not yet flushed. In such cases, the walsender attempts to
read this unflushed part and ends up in an infinite loop. To prevent
this situation, modify the logical WAL sender to consider itself
caught up in this case. The records that are not fully flushed at this
point are generally not significant, so simply ignoring them should
not cause any issues.
---
src/backend/replication/walsender.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c623b07cf0..635396c138 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -3426,8 +3426,15 @@ XLogSendLogical(void)
flushPtr = GetFlushRecPtr(NULL);
}
- /* If EndRecPtr is still past our flushPtr, it means we caught up. */
- if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+ /*
+ * If EndRecPtr is still past our flushPtr, it means we caught up. When
+ * the server is shutting down, the latter part of a continuation record
+ * may be missing. If got_STOPPING is true, assume we are caught up if the
+ * last record is missing its continuation part at flushPtr.
+ */
+ if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr ||
+ (got_STOPPING &&
+ logical_decoding_ctx->reader->missingContrecPtr == flushPtr))
WalSndCaughtUp = true;
/*
--
2.34.1
v2-0001-Fix-infinite-loop-in-walsender-during-publisher-s-PG-15.patchapplication/octet-stream; name=v2-0001-Fix-infinite-loop-in-walsender-during-publisher-s-PG-15.patchDownload
From 254d0eef0edcbc96c5cf23c6856bd0467dde4604 Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Mon, 10 Jun 2024 11:29:00 +0530
Subject: [PATCH v2] Fix infinite loop in walsender during publisher shutdown
When a publisher server is shutting down, there can be a case where
the last WAL record at that point is a continuation record with its
latter part not yet flushed. In such cases, the walsender attempts to
read this unflushed part and ends up in an infinite loop. To prevent
this situation, modify the logical WAL sender to consider itself
caught up in this case. The records that are not fully flushed at this
point are generally not significant, so simply ignoring them should
not cause any issues.
---
src/backend/replication/walsender.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 269914bce2..6ae5b99ecd 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -3083,8 +3083,15 @@ XLogSendLogical(void)
else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
flushPtr = GetFlushRecPtr(NULL);
- /* If EndRecPtr is still past our flushPtr, it means we caught up. */
- if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+ /*
+ * If EndRecPtr is still past our flushPtr, it means we caught up. When
+ * the server is shutting down, the latter part of a continuation record
+ * may be missing. If got_STOPPING is true, assume we are caught up if the
+ * last record is missing its continuation part at flushPtr.
+ */
+ if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr ||
+ (got_STOPPING &&
+ logical_decoding_ctx->reader->missingContrecPtr == flushPtr))
WalSndCaughtUp = true;
/*
--
2.34.1
On Mon, 10 Jun 2024 at 15:10, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
On Thu, 6 Jun 2024 at 11:49, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
At Thu, 6 Jun 2024 12:49:45 +1000, Peter Smith <smithpb2250@gmail.com> wrote in
Hi, I have reproduced this multiple times now.
I confirmed the initial post/steps from Alexander. i.e. The test
script provided [1] gets itself into a state where function
ReadPageInternal (called by XLogDecodeNextRecord and commented "Wait
for the next page to become available") constantly returns
XLREAD_FAIL. Ultimately the test times out because WalSndLoop() loops
forever, since it never calls WalSndDone() to exit the walsender
process.Thanks for the repro; I believe I understand what's happening here.
During server shutdown, the latter half of the last continuation
record may fail to be flushed. This is similar to what is described in
the commit message of commit ff9f111bce. While shutting down,
WalSndLoop() waits for XLogSendLogical() to consume WAL up to
flushPtr, but in this case, the last record cannot complete without
the continuation part starting from flushPtr, which is
missing. However, in such cases, xlogreader.missingContrecPtr is set
to the beginning of the missing part, but something similar toSo, I believe the attached small patch fixes the behavior. I haven't
come up with a good test script for this issue. Something like
026_overwrite_contrecord.pl might work, but this situation seems a bit
more complex than what it handles.Versions back to 10 should suffer from the same issue and the same
patch will be applicable without significant changes.I tested the changes for PG 12 to master as we do not support prior versions.
The patch applied successfully for master and PG 16. I ran the test
provided in [1] multiple times and it ran successfully each time.
The patch did not apply on PG 15. I did a similar change for PG 15 and
created a patch. I ran the test multiple times and it was successful
every time.
The patch did not apply on PG 14 to PG 12. I did a similar change in
each branch. But the tests did not pass in each branch.
I, by mistake, applied wrong changes in PG 14 to PG 12. I tested again
for all versions and the test ran successfully for all of them till
PG12.
I have also attached the patch which applies for PG14 to PG12.
Sorry for the inconvenience.
Thanks and Regards,
Shlok Kyal
Attachments:
v2-0001-Fix-infinite-loop-in-walsender-during-publisher-s.patchapplication/octet-stream; name=v2-0001-Fix-infinite-loop-in-walsender-during-publisher-s.patchDownload
From eea58a421a9a67ad725d708ecc32e694474f3ad2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 6 Jun 2024 14:56:53 +0900
Subject: [PATCH v2] Fix infinite loop in walsender during publisher shutdown
When a publisher server is shutting down, there can be a case where
the last WAL record at that point is a continuation record with its
latter part not yet flushed. In such cases, the walsender attempts to
read this unflushed part and ends up in an infinite loop. To prevent
this situation, modify the logical WAL sender to consider itself
caught up in this case. The records that are not fully flushed at this
point are generally not significant, so simply ignoring them should
not cause any issues.
---
src/backend/replication/walsender.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c623b07cf0..635396c138 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -3426,8 +3426,15 @@ XLogSendLogical(void)
flushPtr = GetFlushRecPtr(NULL);
}
- /* If EndRecPtr is still past our flushPtr, it means we caught up. */
- if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+ /*
+ * If EndRecPtr is still past our flushPtr, it means we caught up. When
+ * the server is shutting down, the latter part of a continuation record
+ * may be missing. If got_STOPPING is true, assume we are caught up if the
+ * last record is missing its continuation part at flushPtr.
+ */
+ if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr ||
+ (got_STOPPING &&
+ logical_decoding_ctx->reader->missingContrecPtr == flushPtr))
WalSndCaughtUp = true;
/*
--
2.34.1
v2-0001-Fix-infinite-loop-in-walsender-during-publisher-s-PG-15.patchapplication/octet-stream; name=v2-0001-Fix-infinite-loop-in-walsender-during-publisher-s-PG-15.patchDownload
From 254d0eef0edcbc96c5cf23c6856bd0467dde4604 Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Mon, 10 Jun 2024 11:29:00 +0530
Subject: [PATCH v2] Fix infinite loop in walsender during publisher shutdown
When a publisher server is shutting down, there can be a case where
the last WAL record at that point is a continuation record with its
latter part not yet flushed. In such cases, the walsender attempts to
read this unflushed part and ends up in an infinite loop. To prevent
this situation, modify the logical WAL sender to consider itself
caught up in this case. The records that are not fully flushed at this
point are generally not significant, so simply ignoring them should
not cause any issues.
---
src/backend/replication/walsender.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 269914bce2..6ae5b99ecd 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -3083,8 +3083,15 @@ XLogSendLogical(void)
else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
flushPtr = GetFlushRecPtr(NULL);
- /* If EndRecPtr is still past our flushPtr, it means we caught up. */
- if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+ /*
+ * If EndRecPtr is still past our flushPtr, it means we caught up. When
+ * the server is shutting down, the latter part of a continuation record
+ * may be missing. If got_STOPPING is true, assume we are caught up if the
+ * last record is missing its continuation part at flushPtr.
+ */
+ if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr ||
+ (got_STOPPING &&
+ logical_decoding_ctx->reader->missingContrecPtr == flushPtr))
WalSndCaughtUp = true;
/*
--
2.34.1
v2-0001-Fix-infinite-loop-in-walsender-during-publisher-s-PG-14.patchapplication/octet-stream; name=v2-0001-Fix-infinite-loop-in-walsender-during-publisher-s-PG-14.patchDownload
From e0fd7f52a764a0f13d9ba7697e74220573aef7f7 Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Mon, 10 Jun 2024 18:03:27 +0530
Subject: [PATCH v2] Fix infinite loop in walsender during publisher shutdown
When a publisher server is shutting down, there can be a case where
the last WAL record at that point is a continuation record with its
latter part not yet flushed. In such cases, the walsender attempts to
read this unflushed part and ends up in an infinite loop. To prevent
this situation, modify the logical WAL sender to consider itself
caught up in this case. The records that are not fully flushed at this
point are generally not significant, so simply ignoring them should
not cause any issues.
---
src/backend/replication/walsender.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 28f0a29473..b527a17c9f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2924,8 +2924,15 @@ XLogSendLogical(void)
else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
flushPtr = GetFlushRecPtr();
- /* If EndRecPtr is still past our flushPtr, it means we caught up. */
- if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+ /*
+ * If EndRecPtr is still past our flushPtr, it means we caught up. When
+ * the server is shutting down, the latter part of a continuation record
+ * may be missing. If got_STOPPING is true, assume we are caught up if the
+ * last record is missing its continuation part at flushPtr.
+ */
+ if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr ||
+ (got_STOPPING &&
+ logical_decoding_ctx->reader->missingContrecPtr == flushPtr))
WalSndCaughtUp = true;
/*
--
2.34.1
On Thu, Jun 06, 2024 at 03:19:20PM +0900, Kyotaro Horiguchi wrote:
During server shutdown, the latter half of the last continuation
record may fail to be flushed. This is similar to what is described in
the commit message of commit ff9f111bce. While shutting down,
WalSndLoop() waits for XLogSendLogical() to consume WAL up to
flushPtr, but in this case, the last record cannot complete without
the continuation part starting from flushPtr, which is
missing. However, in such cases, xlogreader.missingContrecPtr is set
to the beginning of the missing part, but something similar to
- /* If EndRecPtr is still past our flushPtr, it means we caught up. */
- if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
+ /*
+ * If EndRecPtr is still past our flushPtr, it means we caught up. When
+ * the server is shutting down, the latter part of a continuation record
+ * may be missing. If got_STOPPING is true, assume we are caught up if the
+ * last record is missing its continuation part at flushPtr.
+ */
+ if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr ||
+ (got_STOPPING &&
+ logical_decoding_ctx->reader->missingContrecPtr == flushPtr))
FWIW, I don't have a better idea than what you are proposing here. We
just cannot receive more data past the page boundary in a shutdown
sequence in this context, so checking after the missingContrecPtr
is a good compromise to ensure that we don't remain stuck when
shutting down a logical WAL sender. I'm surprised that we did not
hear about that more often on the lists, or perhaps we did but just
discarded it?
This is going to take some time to check across all the branches down
to v12 that this is stable, because this area of the code tends to
change slightly every year.. Well, that's my job.
So, I believe the attached small patch fixes the behavior. I haven't
come up with a good test script for this issue. Something like
026_overwrite_contrecord.pl might work, but this situation seems a bit
more complex than what it handles.
Hmm. Indeed you will not be able to reuse the same trick with the end
of a segment. Still you should be able to get a rather stable test by
using the same tricks as 039_end_of_wal.pl to spawn a record across
multiple pages, no?
--
Michael
On Thu, Jun 6, 2024 at 11:49 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
At Thu, 6 Jun 2024 12:49:45 +1000, Peter Smith <smithpb2250@gmail.com> wrote in
Hi, I have reproduced this multiple times now.
I confirmed the initial post/steps from Alexander. i.e. The test
script provided [1] gets itself into a state where function
ReadPageInternal (called by XLogDecodeNextRecord and commented "Wait
for the next page to become available") constantly returns
XLREAD_FAIL. Ultimately the test times out because WalSndLoop() loops
forever, since it never calls WalSndDone() to exit the walsender
process.Thanks for the repro; I believe I understand what's happening here.
During server shutdown, the latter half of the last continuation
record may fail to be flushed. This is similar to what is described in
the commit message of commit ff9f111bce. While shutting down,
WalSndLoop() waits for XLogSendLogical() to consume WAL up to
flushPtr, but in this case, the last record cannot complete without
the continuation part starting from flushPtr, which is
missing.
Sorry, it is not clear to me why we failed to flush the last
continuation record in logical walsender? I see that we try to flush
the WAL after receiving got_STOPPING in WalSndWaitForWal(), why is
that not sufficient?
--
With Regards,
Amit Kapila.
At Tue, 11 Jun 2024 11:32:12 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
Sorry, it is not clear to me why we failed to flush the last
continuation record in logical walsender? I see that we try to flush
the WAL after receiving got_STOPPING in WalSndWaitForWal(), why is
that not sufficient?
It seems that, it uses XLogBackgroundFlush(), which does not guarantee
flushing WAL until the end.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Tue, 11 Jun 2024 09:27:20 +0900, Michael Paquier <michael@paquier.xyz> wrote in
On Thu, Jun 06, 2024 at 03:19:20PM +0900, Kyotaro Horiguchi wrote:
So, I believe the attached small patch fixes the behavior. I haven't
come up with a good test script for this issue. Something like
026_overwrite_contrecord.pl might work, but this situation seems a bit
more complex than what it handles.Hmm. Indeed you will not be able to reuse the same trick with the end
of a segment. Still you should be able to get a rather stable test by
using the same tricks as 039_end_of_wal.pl to spawn a record across
multiple pages, no?
With the trick, we could write a page-spanning record, but I'm not
sure we can control the behavior of XLogBackgroundFlush().
As Amit suggested, we have the option to create a variant of the
function that guarantees flushing WAL until the end.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Tue, Jun 11, 2024 at 12:34 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
At Tue, 11 Jun 2024 11:32:12 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
Sorry, it is not clear to me why we failed to flush the last
continuation record in logical walsender? I see that we try to flush
the WAL after receiving got_STOPPING in WalSndWaitForWal(), why is
that not sufficient?It seems that, it uses XLogBackgroundFlush(), which does not guarantee
flushing WAL until the end.
What would it take to ensure the same? I am trying to explore this
path because currently logical WALSender sends any outstanding logs up
to the shutdown checkpoint record (i.e., the latest record) and waits
for them to be replicated to the standby before exit. Please take a
look at the comments where we call WalSndDone(). The fix you are
proposing will break that guarantee.
--
With Regards,
Amit Kapila.
At Tue, 11 Jun 2024 14:27:28 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
On Tue, Jun 11, 2024 at 12:34 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:At Tue, 11 Jun 2024 11:32:12 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
Sorry, it is not clear to me why we failed to flush the last
continuation record in logical walsender? I see that we try to flush
the WAL after receiving got_STOPPING in WalSndWaitForWal(), why is
that not sufficient?It seems that, it uses XLogBackgroundFlush(), which does not guarantee
flushing WAL until the end.What would it take to ensure the same? I am trying to explore this
path because currently logical WALSender sends any outstanding logs up
to the shutdown checkpoint record (i.e., the latest record) and waits
for them to be replicated to the standby before exit. Please take a
look at the comments where we call WalSndDone(). The fix you are
proposing will break that guarantee.
Shutdown checkpoint is performed after the walsender completed
termination since 086221cf6b, aiming to prevent walsenders from
generating competing WAL (by, for example, CREATE_REPLICATION_SLOT)
records with the shutdown checkpoint. Thus, it seems that the
walsender cannot see the shutdown record, and a certain amount of
bytes before it, as the walsender appears to have relied on the
checkpoint flushing its record, rather than on XLogBackgroundFlush().
If we approve of the walsender being terminated before the shutdown
checkpoint, we need to "fix" the comment, then provide a function to
ensure the synchronization of WAL records.
I'll consider this direction for a while.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Wed, Jun 12, 2024 at 6:43 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
At Tue, 11 Jun 2024 14:27:28 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
On Tue, Jun 11, 2024 at 12:34 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:At Tue, 11 Jun 2024 11:32:12 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
Sorry, it is not clear to me why we failed to flush the last
continuation record in logical walsender? I see that we try to flush
the WAL after receiving got_STOPPING in WalSndWaitForWal(), why is
that not sufficient?It seems that, it uses XLogBackgroundFlush(), which does not guarantee
flushing WAL until the end.What would it take to ensure the same? I am trying to explore this
path because currently logical WALSender sends any outstanding logs up
to the shutdown checkpoint record (i.e., the latest record) and waits
for them to be replicated to the standby before exit. Please take a
look at the comments where we call WalSndDone(). The fix you are
proposing will break that guarantee.Shutdown checkpoint is performed after the walsender completed
termination since 086221cf6b,
Yeah, but the commit you quoted later reverted by commit 703f148e98
and committed again as c6c3334364.
aiming to prevent walsenders from
generating competing WAL (by, for example, CREATE_REPLICATION_SLOT)
records with the shutdown checkpoint. Thus, it seems that the
walsender cannot see the shutdown record,
This is true of logical walsender. The physical walsender do send
shutdown checkpoint record before getting terminated.
and a certain amount of
bytes before it, as the walsender appears to have relied on the
checkpoint flushing its record, rather than on XLogBackgroundFlush().If we approve of the walsender being terminated before the shutdown
checkpoint, we need to "fix" the comment, then provide a function to
ensure the synchronization of WAL records.
Which comment do you want to fix?
I'll consider this direction for a while.
Okay, thanks.
--
With Regards,
Amit Kapila.
At Thu, 13 Jun 2024 09:29:03 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
Yeah, but the commit you quoted later reverted by commit 703f148e98
and committed again as c6c3334364.
Yeah, right..
aiming to prevent walsenders from
generating competing WAL (by, for example, CREATE_REPLICATION_SLOT)
records with the shutdown checkpoint. Thus, it seems that the
walsender cannot see the shutdown record,This is true of logical walsender. The physical walsender do send
shutdown checkpoint record before getting terminated.
Yes, I know. They differ in their blocking mechanisms.
and a certain amount of
bytes before it, as the walsender appears to have relied on the
checkpoint flushing its record, rather than on XLogBackgroundFlush().If we approve of the walsender being terminated before the shutdown
checkpoint, we need to "fix" the comment, then provide a function to
ensure the synchronization of WAL records.Which comment do you want to fix?
Yeah. The part you seem to think I was trying to fix is actually
fine. Instead, I have revised the comment on the modified section to
make its intention clearer.
I'll consider this direction for a while.
Okay, thanks.
The attached patch is it. It's only for the master.
I decided not to create a new function because the simple code has
only one caller. I haven't seen the test script fail with this fix.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
0001-Ensure-WAL-is-written-out-when-terminating-a-logical.patchtext/x-patch; charset=us-asciiDownload
From 663bdeaf8d4d2f5a192dd3ecb6d2817d9b1311f1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 18 Jun 2024 16:36:43 +0900
Subject: [PATCH] Ensure WAL is written out when terminating a logical
walsender
In commit c6c3334364, XLogBackgroundFlush() was assumed to flush all
written WAL to the end, but this function may not flush the last
incomplete page in a single call. In such cases, if the last written
page ends with a continuation record, the latter part will not be
flushed, causing shutdown to wait indefinitely for that part. This
commit ensures that the written records are fully flushed to the end,
guaranteeing expected behavior.
---
src/backend/replication/walsender.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c623b07cf0..5aa0f58238 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1858,12 +1858,13 @@ WalSndWaitForWal(XLogRecPtr loc)
ProcessRepliesIfAny();
/*
- * If we're shutting down, trigger pending WAL to be written out,
- * otherwise we'd possibly end up waiting for WAL that never gets
- * written, because walwriter has shut down already.
+ * If we're shutting down, all WAL-writing processes are gone, leaving
+ * only checkpointer to perform the shutdown checkpoint. Ensure that
+ * any pending WAL is written out here; otherwise, we'd possibly end up
+ * waiting for WAL that never gets written.
*/
- if (got_STOPPING)
- XLogBackgroundFlush();
+ if (got_STOPPING && !RecoveryInProgress())
+ XLogFlush(GetInsertRecPtr());
/*
* To avoid the scenario where standbys need to catch up to a newer
--
2.43.0
Dear Horiguchi-san,
Thanks for sharing the patch! I agree this approach (ensure WAL records are flushed)
Is more proper than others.
I have an unclear point. According to the comment atop GetInsertRecPtr(), it just
returns the approximated value - the position of the last full WAL page [1]/* * GetInsertRecPtr -- Returns the current insert position. * * NOTE: The value *actually* returned is the position of the last full * xlog page. It lags behind the real insert position by at most 1 page. * For that, we don't need to scan through WAL insertion locks, and an * approximation is enough for the current usage of this function. */ XLogRecPtr GetInsertRecPtr(void).
If there is a continuation WAL record which across a page, will it return the
halfway point of the WAL record (end of the first WAL page)? If so, the proposed
fix seems not sufficient. We have to point out the exact the end of the record.
[1]: /* * GetInsertRecPtr -- Returns the current insert position. * * NOTE: The value *actually* returned is the position of the last full * xlog page. It lags behind the real insert position by at most 1 page. * For that, we don't need to scan through WAL insertion locks, and an * approximation is enough for the current usage of this function. */ XLogRecPtr GetInsertRecPtr(void)
/*
* GetInsertRecPtr -- Returns the current insert position.
*
* NOTE: The value *actually* returned is the position of the last full
* xlog page. It lags behind the real insert position by at most 1 page.
* For that, we don't need to scan through WAL insertion locks, and an
* approximation is enough for the current usage of this function.
*/
XLogRecPtr
GetInsertRecPtr(void)
Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/
FYI - I applied this latest patch and re-ran the original failing test
script 10 times (e.g. 10 x 100 test iterations; it took 4+ hours).
There were zero failures observed in my environment.
======
Kind Regards,
Peter Smith.
Fujitsu Australia
On Wed, Jun 19, 2024 at 05:14:50AM +0000, Hayato Kuroda (Fujitsu) wrote:
I have an unclear point. According to the comment atop GetInsertRecPtr(), it just
returns the approximated value - the position of the last full WAL page [1].
If there is a continuation WAL record which across a page, will it return the
halfway point of the WAL record (end of the first WAL page)? If so, the proposed
fix seems not sufficient. We have to point out the exact the end of the record.
Yeah, that a thing of the patch I am confused with. How are we sure
that this is the correct LSN to rely on? If that it the case, the
patch does not offer an explanation about why it is better.
WalSndWaitForWal() is called only in the context of page callback for a
logical WAL sender. Shouldn't we make the flush conditional on what's
stored in XLogReaderState.missingContrecPtr? Aka, if we know that
we're in the middle of the decoding of a continuation record, we
should wait until we've dealt with it, no?
In short, I would imagine that WalSndWaitForWal() should know more
about XLogReaderState is doing. But perhaps I'm missing something.
--
Michael
On Wed, Jun 19, 2024 at 10:44 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
Dear Horiguchi-san,
Thanks for sharing the patch! I agree this approach (ensure WAL records are flushed)
Is more proper than others.I have an unclear point. According to the comment atop GetInsertRecPtr(), it just
returns the approximated value - the position of the last full WAL page [1].
If there is a continuation WAL record which across a page, will it return the
halfway point of the WAL record (end of the first WAL page)? If so, the proposed
fix seems not sufficient. We have to point out the exact the end of the record.
You have a point but if this theory is correct why we are not able to
reproduce the issue after this patch? Also, how to get the WAL
location up to which we need to flush? Is XLogCtlData->logInsertResult
the one we are looking for?
--
With Regards,
Amit Kapila.
At Fri, 21 Jun 2024 11:48:22 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
On Wed, Jun 19, 2024 at 10:44 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:Dear Horiguchi-san,
Thanks for sharing the patch! I agree this approach (ensure WAL records are flushed)
Is more proper than others.I have an unclear point. According to the comment atop GetInsertRecPtr(), it just
returns the approximated value - the position of the last full WAL page [1].
If there is a continuation WAL record which across a page, will it return the
halfway point of the WAL record (end of the first WAL page)? If so, the proposed
fix seems not sufficient. We have to point out the exact the end of the record.You have a point but if this theory is correct why we are not able to
reproduce the issue after this patch? Also, how to get the WAL
location up to which we need to flush? Is XLogCtlData->logInsertResult
the one we are looking for?
It is not exposed, but of course logInsertResult is more
straightforward source for the LSN.
The reason why the patch is working well is due to the following bit
of the code.
xlog.c:958, in XLInsertRecord()
/*
* Update shared LogwrtRqst.Write, if we crossed page boundary.
*/
if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
{
SpinLockAcquire(&XLogCtl->info_lck);
/* advance global request to include new block(s) */
if (XLogCtl->LogwrtRqst.Write < EndPos)
XLogCtl->LogwrtRqst.Write = EndPos;
SpinLockRelease(&XLogCtl->info_lck);
RefreshXLogWriteResult(LogwrtResult);
}
The code, which exists has existed for a long time, ensures that
GetInsertRecPtr() returns the accurate end of a record when it spanns
over page boundaries. This would need to be written in the new comment
if we use GetInsertRecPtr().
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center