pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages
Hi,
Buildfarm identified one issue at [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2025-06-19+23%3A47%3A08 where it could not identify a
partial WAL record spanning across 2 pages was written due to
immediate shutdown.
Consider a scenario where a WAL record is split across multiple WAL
pages. If the server crashes before the entire WAL record is written,
the recovery process detects the incomplete (broken) record. In such
cases, a special flag, XLP_FIRST_IS_OVERWRITE_CONTRECORD, is set in
the page header to indicate this condition. When we attempt to
retrieve changes using a logical replication slot, the system reads
WAL data based on the record header. In one specific instance, a total
of 8133 bytes were required to reconstruct the WAL record. Of this,
277 bytes were expected to be present in the subsequent WAL page.
However, only 248 bytes were available, because no new WAL records had
been generated as there were no operations performed on the system
after the crash. As a result, the slot_get_changes function
continuously checks whether the current WAL flush position has reached
0x29004115 (the offset corresponding to the required 277 bytes in the
next WAL page). Since no new WAL is being generated, the flush
position always returns 0x290040F8 (the 248-byte offset), causing the
function to wait indefinitely at read_local_xlog_page_guts function.
Currently, the logic attempts to read the complete WAL record based on
the size obtained before the crash—even though only a partial record
was written. It then checks the page header to determine whether the
XLP_FIRST_IS_OVERWRITE_CONTRECORD flag is set only after reading the
complete WAL record at XLogDecodeNextRecord function, but since that
much WAL data was not available in the system we never get a chance to
check the header after this.. To address this issue, a more robust
approach would be to first read the page header, check if the
XLP_FIRST_IS_OVERWRITE_CONTRECORD flag is set, and only then proceed
to read the WAL record size if the record is not marked as a partial
overwrite. This would prevent the system from waiting for WAL data
that will never arrive. Attached partial_wal_record_fix.patch patch
for this.
I don't have a consistent test to reproduce this issue, currently this
issue can be reproduced by running the 046_checkpoint_logical_slot
test about 50 times. This test 046_checkpoint_logical_slot was
reverted recently after it caused a few buildfarm failures discussed
at [2]/messages/by-id/CAPpHfdsO9s5he3xHWNFtwvXtvsftu3nNz=PV9fdN32UOh-Z3tA@mail.gmail.com. The same test is also attached as
reverted_test_046_checkpoint_logical_slot.patch.
Added members who were involved in the discussion of this issue.
Thoughts?
[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2025-06-19+23%3A47%3A08
[2]: /messages/by-id/CAPpHfdsO9s5he3xHWNFtwvXtvsftu3nNz=PV9fdN32UOh-Z3tA@mail.gmail.com
Regards,
Vignesh
Attachments:
partial_wal_record_fix.patchapplication/x-patch; name=partial_wal_record_fix.patchDownload
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 2790ade1f91..54d6c67a713 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -723,11 +723,14 @@ restart:
/* Calculate pointer to beginning of next page */
targetPagePtr += XLOG_BLCKSZ;
- /* Wait for the next page to become available */
- readOff = ReadPageInternal(state, targetPagePtr,
- Min(total_len - gotlen + SizeOfXLogShortPHD,
- XLOG_BLCKSZ));
-
+ /*
+ * Read the page header before processing the WAL record data. This
+ * is necessary to correctly handle cases where the previous record
+ * ended as a partial record. Attempting to read the full
+ * record without checking the header may result in waiting
+ * indefinitely for data that doesn't exist.
+ */
+ readOff = ReadPageInternal(state, targetPagePtr, SizeOfXLogShortPHD);
if (readOff == XLREAD_WOULDBLOCK)
return XLREAD_WOULDBLOCK;
else if (readOff < 0)
@@ -761,6 +764,16 @@ restart:
goto err;
}
+ /* Wait for the next page to become available */
+ readOff = ReadPageInternal(state, targetPagePtr,
+ Min(total_len - gotlen + SizeOfXLogShortPHD,
+ XLOG_BLCKSZ));
+
+ if (readOff == XLREAD_WOULDBLOCK)
+ return XLREAD_WOULDBLOCK;
+ else if (readOff < 0)
+ goto err;
+
/*
* Cross-check that xlp_rem_len agrees with how much of the record
* we expect there to be left.
reverted_test_046_checkpoint_logical_slot.patchtext/x-patch; charset=US-ASCII; name=reverted_test_046_checkpoint_logical_slot.patchDownload
diff --git a/src/test/recovery/t/046_checkpoint_logical_slot.pl b/src/test/recovery/t/046_checkpoint_logical_slot.pl
new file mode 100644
index 00000000000..d67c5108d78
--- /dev/null
+++ b/src/test/recovery/t/046_checkpoint_logical_slot.pl
@@ -0,0 +1,136 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+#
+# This test verifies the case when the logical slot is advanced during
+# checkpoint. The test checks that the logical slot's restart_lsn still refers
+# to an existed WAL segment after immediate restart.
+#
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+my ($node, $result);
+
+$node = PostgreSQL::Test::Cluster->new('mike');
+$node->init;
+$node->append_conf('postgresql.conf', "wal_level = 'logical'");
+$node->start;
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+$node->safe_psql('postgres', q(CREATE EXTENSION injection_points));
+
+# Create the two slots we'll need.
+$node->safe_psql('postgres',
+ q{select pg_create_logical_replication_slot('slot_logical', 'test_decoding')}
+);
+$node->safe_psql('postgres',
+ q{select pg_create_physical_replication_slot('slot_physical', true)});
+
+# Advance both slots to the current position just to have everything "valid".
+$node->safe_psql('postgres',
+ q{select count(*) from pg_logical_slot_get_changes('slot_logical', null, null)}
+);
+$node->safe_psql('postgres',
+ q{select pg_replication_slot_advance('slot_physical', pg_current_wal_lsn())}
+);
+
+# Run checkpoint to flush current state to disk and set a baseline.
+$node->safe_psql('postgres', q{checkpoint});
+
+# Generate some transactions to get RUNNING_XACTS.
+my $xacts = $node->background_psql('postgres');
+$xacts->query_until(
+ qr/run_xacts/,
+ q(\echo run_xacts
+SELECT 1 \watch 0.1
+\q
+));
+
+$node->advance_wal(20);
+
+# Run another checkpoint to set a new restore LSN.
+$node->safe_psql('postgres', q{checkpoint});
+
+$node->advance_wal(20);
+
+# Run another checkpoint, this time in the background, and make it wait
+# on the injection point) so that the checkpoint stops right before
+# removing old WAL segments.
+note('starting checkpoint');
+
+my $checkpoint = $node->background_psql('postgres');
+$checkpoint->query_safe(
+ q(select injection_points_attach('checkpoint-before-old-wal-removal','wait'))
+);
+$checkpoint->query_until(
+ qr/starting_checkpoint/,
+ q(\echo starting_checkpoint
+checkpoint;
+\q
+));
+
+# Wait until the checkpoint stops right before removing WAL segments.
+note('waiting for injection_point');
+$node->wait_for_event('checkpointer', 'checkpoint-before-old-wal-removal');
+note('injection_point is reached');
+
+# Try to advance the logical slot, but make it stop when it moves to the next
+# WAL segment (this has to happen in the background, too).
+my $logical = $node->background_psql('postgres');
+$logical->query_safe(
+ q{select injection_points_attach('logical-replication-slot-advance-segment','wait');}
+);
+$logical->query_until(
+ qr/get_changes/,
+ q(
+\echo get_changes
+select count(*) from pg_logical_slot_get_changes('slot_logical', null, null) \watch 1
+\q
+));
+
+# Wait until the slot's restart_lsn points to the next WAL segment.
+note('waiting for injection_point');
+$node->wait_for_event('client backend',
+ 'logical-replication-slot-advance-segment');
+note('injection_point is reached');
+
+# OK, we're in the right situation: time to advance the physical slot, which
+# recalculates the required LSN, and then unblock the checkpoint, which
+# removes the WAL still needed by the logical slot.
+$node->safe_psql('postgres',
+ q{select pg_replication_slot_advance('slot_physical', pg_current_wal_lsn())}
+);
+
+# Continue the checkpoint.
+$node->safe_psql('postgres',
+ q{select injection_points_wakeup('checkpoint-before-old-wal-removal')});
+
+# Abruptly stop the server (1 second should be enough for the checkpoint
+# to finish; it would be better).
+$node->stop('immediate');
+
+$node->start;
+
+eval {
+ $node->safe_psql('postgres',
+ q{select count(*) from pg_logical_slot_get_changes('slot_logical', null, null);}
+ );
+};
+is($@, '', "Logical slot still valid");
+
+done_testing();
On Wed, Jun 25, 2025 at 10:19:55PM +0530, vignesh C wrote:
Currently, the logic attempts to read the complete WAL record based on
the size obtained before the crash—even though only a partial record
was written. It then checks the page header to determine whether the
XLP_FIRST_IS_OVERWRITE_CONTRECORD flag is set only after reading the
complete WAL record at XLogDecodeNextRecord function, but since that
much WAL data was not available in the system we never get a chance to
check the header after this.. To address this issue, a more robust
approach would be to first read the page header, check if the
XLP_FIRST_IS_OVERWRITE_CONTRECORD flag is set, and only then proceed
to read the WAL record size if the record is not marked as a partial
overwrite. This would prevent the system from waiting for WAL data
that will never arrive. Attached partial_wal_record_fix.patch patch
for this.
So you are suggesting the addition of an extra ReadPageInternal() that
forces a read of only the read, perform the checks on the header, then
read the rest. After reading SizeOfXLogShortPHD worth of data,
shouldn't the checks on xlp_rem_len be done a bit earlier than what
you are proposing in this patch?
I don't have a consistent test to reproduce this issue, currently this
issue can be reproduced by running the 046_checkpoint_logical_slot
test about 50 times. This test 046_checkpoint_logical_slot was
reverted recently after it caused a few buildfarm failures discussed
at [2]. The same test is also attached as
reverted_test_046_checkpoint_logical_slot.patch.
Seeing the noise that this originally created in the CFbot and the
buildfarm, even if the issue would be only triggered after a timeout,
I'd vote for relying on this test being good enough for the purpose of
this race condition. Another reliable approach would be to make the
code wait before reading the record in the internal loop of
ReadPageInternal() with an injection point when we know that we have a
contrecord, but I'm not really excited about this prospect in
xlogreader.c which can be also used in the frontend.
--
Michael
Dear Michael, Vignesh,
On Wed, Jun 25, 2025 at 10:19:55PM +0530, vignesh C wrote:
Currently, the logic attempts to read the complete WAL record based on
the size obtained before the crash—even though only a partial record
was written. It then checks the page header to determine whether the
XLP_FIRST_IS_OVERWRITE_CONTRECORD flag is set only after reading the
complete WAL record at XLogDecodeNextRecord function, but since that
much WAL data was not available in the system we never get a chance to
check the header after this.. To address this issue, a more robust
approach would be to first read the page header, check if the
XLP_FIRST_IS_OVERWRITE_CONTRECORD flag is set, and only then proceed
to read the WAL record size if the record is not marked as a partial
overwrite. This would prevent the system from waiting for WAL data
that will never arrive. Attached partial_wal_record_fix.patch patch
for this.So you are suggesting the addition of an extra ReadPageInternal() that
forces a read of only the read, perform the checks on the header, then
read the rest. After reading SizeOfXLogShortPHD worth of data,
shouldn't the checks on xlp_rem_len be done a bit earlier than what
you are proposing in this patch?
I have a concern for the performance perspective. This approach must read the
page twice in any cases, right? The workaround is needed only for the corner case
but affects for all the passes. Or, is it actually negligible?
Another reliable approach would be to make the
code wait before reading the record in the internal loop of
ReadPageInternal() with an injection point when we know that we have a
contrecord, but I'm not really excited about this prospect in
xlogreader.c which can be also used in the frontend.
Per my understanding an injection point must be added while flushing a WAL record,
to emulate the incomplete WAL record issue. To confirm, how can it be used in
ReadPageInternal()?
Best regards,
Hayato Kuroda
FUJITSU LIMITED
On Thu, Jun 26, 2025 at 6:22 AM Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Jun 25, 2025 at 10:19:55PM +0530, vignesh C wrote:
Currently, the logic attempts to read the complete WAL record based on
the size obtained before the crash—even though only a partial record
was written. It then checks the page header to determine whether the
XLP_FIRST_IS_OVERWRITE_CONTRECORD flag is set only after reading the
complete WAL record at XLogDecodeNextRecord function, but since that
much WAL data was not available in the system we never get a chance to
check the header after this.. To address this issue, a more robust
approach would be to first read the page header, check if the
XLP_FIRST_IS_OVERWRITE_CONTRECORD flag is set, and only then proceed
to read the WAL record size if the record is not marked as a partial
overwrite. This would prevent the system from waiting for WAL data
that will never arrive. Attached partial_wal_record_fix.patch patch
for this.
Yeah this is a problem, I am not sure at the moment I can think of
anything better than just reading the header first and checking the
XLP_FIRST_IS_OVERWRITE_CONTRECORD flag.
So you are suggesting the addition of an extra ReadPageInternal() that
forces a read of only the read, perform the checks on the header, then
read the rest. After reading SizeOfXLogShortPHD worth of data,
shouldn't the checks on xlp_rem_len be done a bit earlier than what
you are proposing in this patch?
I did not get the point, IMHO it has to be validated after the record
on the next page has been read.
--
Regards,
Dilip Kumar
Google
On Thu, Jun 26, 2025 at 08:48:32AM +0530, Dilip Kumar wrote:
On Thu, Jun 26, 2025 at 6:22 AM Michael Paquier <michael@paquier.xyz> wrote:
So you are suggesting the addition of an extra ReadPageInternal() that
forces a read of only the read, perform the checks on the header, then
read the rest. After reading SizeOfXLogShortPHD worth of data,
shouldn't the checks on xlp_rem_len be done a bit earlier than what
you are proposing in this patch?I did not get the point, IMHO it has to be validated after the record
on the next page has been read.
As proposed in the patch, the first ReadPageInternal() reads
SizeOfXLogShortPHD worth of data (right?), so we should have access to
XLogPageHeaderData.xlp_rem_len before attempting the second
ReadPageInternal() worth Min(total_len - gotlen + SizeOfXLogShortPHD,
XLOG_BLCKSZ) of data. The check that fails with a "invalid contrecord
length %u" relies on xlp_rem_len, total_len, RecPtr and gotlen. We
know all this information before the second ReadPageInternal().
--
Michael
On Thu, Jun 26, 2025 at 11:47 AM Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Jun 26, 2025 at 08:48:32AM +0530, Dilip Kumar wrote:
On Thu, Jun 26, 2025 at 6:22 AM Michael Paquier <michael@paquier.xyz> wrote:
So you are suggesting the addition of an extra ReadPageInternal() that
forces a read of only the read, perform the checks on the header, then
read the rest. After reading SizeOfXLogShortPHD worth of data,
shouldn't the checks on xlp_rem_len be done a bit earlier than what
you are proposing in this patch?I did not get the point, IMHO it has to be validated after the record
on the next page has been read.As proposed in the patch, the first ReadPageInternal() reads
SizeOfXLogShortPHD worth of data (right?), so we should have access to
XLogPageHeaderData.xlp_rem_len before attempting the second
ReadPageInternal() worth Min(total_len - gotlen + SizeOfXLogShortPHD,
XLOG_BLCKSZ) of data. The check that fails with a "invalid contrecord
length %u" relies on xlp_rem_len, total_len, RecPtr and gotlen. We
know all this information before the second ReadPageInternal().
Yeah you are right, this can be moved before the second
ReadPageInternal() as we have already read XLogPageHeader.
--
Regards,
Dilip Kumar
Google
Another reliable approach would be to make the
code wait before reading the record in the internal loop of
ReadPageInternal() with an injection point when we know that we have a
contrecord, but I'm not really excited about this prospect in
xlogreader.c which can be also used in the frontend.Per my understanding an injection point must be added while flushing a WAL
record,
to emulate the incomplete WAL record issue. To confirm, how can it be used in
ReadPageInternal()?
I've spent time how we use the injection point to cause the same situation, which
generates the OVERWRITE_CONTRECORD in-between the page, but it seems difficult.
XLogFlush()->XLogWrite() has a responsibility to flush WAL records, but it does not
write/flush per pages. It tries to write to pages as much as possible and
flushes the result at once. A corner case is when the segment is changed, but
this is not the same situation we observed the failure.
So... I have no idea to create deterministic reproducers, it is OK for me to use
046 test for the purpose.
Best regards,
Hayato Kuroda
FUJITSU LIMITED
On Thu, 26 Jun 2025 at 06:22, Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Jun 25, 2025 at 10:19:55PM +0530, vignesh C wrote:
Currently, the logic attempts to read the complete WAL record based on
the size obtained before the crash—even though only a partial record
was written. It then checks the page header to determine whether the
XLP_FIRST_IS_OVERWRITE_CONTRECORD flag is set only after reading the
complete WAL record at XLogDecodeNextRecord function, but since that
much WAL data was not available in the system we never get a chance to
check the header after this.. To address this issue, a more robust
approach would be to first read the page header, check if the
XLP_FIRST_IS_OVERWRITE_CONTRECORD flag is set, and only then proceed
to read the WAL record size if the record is not marked as a partial
overwrite. This would prevent the system from waiting for WAL data
that will never arrive. Attached partial_wal_record_fix.patch patch
for this.So you are suggesting the addition of an extra ReadPageInternal() that
forces a read of only the read, perform the checks on the header, then
read the rest. After reading SizeOfXLogShortPHD worth of data,
shouldn't the checks on xlp_rem_len be done a bit earlier than what
you are proposing in this patch?
Modified
I don't have a consistent test to reproduce this issue, currently this
issue can be reproduced by running the 046_checkpoint_logical_slot
test about 50 times. This test 046_checkpoint_logical_slot was
reverted recently after it caused a few buildfarm failures discussed
at [2]. The same test is also attached as
reverted_test_046_checkpoint_logical_slot.patch.Seeing the noise that this originally created in the CFbot and the
buildfarm, even if the issue would be only triggered after a timeout,
I'd vote for relying on this test being good enough for the purpose of
this race condition.
I agree
Another reliable approach would be to make the
code wait before reading the record in the internal loop of
ReadPageInternal() with an injection point when we know that we have a
contrecord, but I'm not really excited about this prospect in
xlogreader.c which can be also used in the frontend.
I could not find an easy way to simulate this scenario through
injection point, I felt 046_checkpoint_logical_slot.pl should be
enough.
The attached v2 version patch has the changes for the same.
Regards,
Vignesh
Attachments:
v2-0001-Fix-infinite-wait-when-reading-partially-written-.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Fix-infinite-wait-when-reading-partially-written-.patchDownload
From 9029482c8dca9762b016743ca693b8e50396f253 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Thu, 26 Jun 2025 16:36:40 +0530
Subject: [PATCH v2] Fix infinite wait when reading partially written WAL
record after crash
If a crash occurs while writing a WAL record that spans multiple pages,
the recovery process marks the page with the XLP_FIRST_IS_OVERWRITE_CONTRECORD
flag. However, logical decoding currently attempts to read the full WAL record
based on its expected size before checking this flag, which can lead to an
infinite wait if the remaining data is never written (e.g., no activity after crash).
This patch updates the logic to first read the page header and check for the
XLP_FIRST_IS_OVERWRITE_CONTRECORD flag before attempting to reconstruct the
full WAL record. If the flag is set, decoding correctly identifies the record
as incomplete and avoids waiting for WAL data that will never arrive.
---
src/backend/access/transam/xlogreader.c | 22 ++-
src/test/recovery/meson.build | 1 +
.../recovery/t/046_checkpoint_logical_slot.pl | 136 ++++++++++++++++++
3 files changed, 154 insertions(+), 5 deletions(-)
create mode 100644 src/test/recovery/t/046_checkpoint_logical_slot.pl
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 2790ade1f91..cba00f6149b 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -723,11 +723,14 @@ restart:
/* Calculate pointer to beginning of next page */
targetPagePtr += XLOG_BLCKSZ;
- /* Wait for the next page to become available */
- readOff = ReadPageInternal(state, targetPagePtr,
- Min(total_len - gotlen + SizeOfXLogShortPHD,
- XLOG_BLCKSZ));
-
+ /*
+ * Read the page header before processing the WAL record data. This
+ * is necessary to correctly handle cases where the previous record
+ * ended as a partial record. Attempting to read the full
+ * record without checking the header may result in waiting
+ * indefinitely for data that doesn't exist.
+ */
+ readOff = ReadPageInternal(state, targetPagePtr, SizeOfXLogShortPHD);
if (readOff == XLREAD_WOULDBLOCK)
return XLREAD_WOULDBLOCK;
else if (readOff < 0)
@@ -775,6 +778,15 @@ restart:
LSN_FORMAT_ARGS(RecPtr));
goto err;
}
+
+ /* Wait for the next page to become available */
+ readOff = ReadPageInternal(state, targetPagePtr,
+ Min(total_len - gotlen + SizeOfXLogShortPHD,
+ XLOG_BLCKSZ));
+ if (readOff == XLREAD_WOULDBLOCK)
+ return XLREAD_WOULDBLOCK;
+ else if (readOff < 0)
+ goto err;
/* Append the continuation from this page to the buffer */
pageHeaderSize = XLogPageHeaderSize(pageHeader);
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 6e78ff1a030..52993c32dbb 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -54,6 +54,7 @@ tests += {
't/043_no_contrecord_switch.pl',
't/044_invalidate_inactive_slots.pl',
't/045_archive_restartpoint.pl',
+ 't/046_checkpoint_logical_slot.pl',
't/047_checkpoint_physical_slot.pl',
't/048_vacuum_horizon_floor.pl'
],
diff --git a/src/test/recovery/t/046_checkpoint_logical_slot.pl b/src/test/recovery/t/046_checkpoint_logical_slot.pl
new file mode 100644
index 00000000000..d67c5108d78
--- /dev/null
+++ b/src/test/recovery/t/046_checkpoint_logical_slot.pl
@@ -0,0 +1,136 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+#
+# This test verifies the case when the logical slot is advanced during
+# checkpoint. The test checks that the logical slot's restart_lsn still refers
+# to an existed WAL segment after immediate restart.
+#
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+my ($node, $result);
+
+$node = PostgreSQL::Test::Cluster->new('mike');
+$node->init;
+$node->append_conf('postgresql.conf', "wal_level = 'logical'");
+$node->start;
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+$node->safe_psql('postgres', q(CREATE EXTENSION injection_points));
+
+# Create the two slots we'll need.
+$node->safe_psql('postgres',
+ q{select pg_create_logical_replication_slot('slot_logical', 'test_decoding')}
+);
+$node->safe_psql('postgres',
+ q{select pg_create_physical_replication_slot('slot_physical', true)});
+
+# Advance both slots to the current position just to have everything "valid".
+$node->safe_psql('postgres',
+ q{select count(*) from pg_logical_slot_get_changes('slot_logical', null, null)}
+);
+$node->safe_psql('postgres',
+ q{select pg_replication_slot_advance('slot_physical', pg_current_wal_lsn())}
+);
+
+# Run checkpoint to flush current state to disk and set a baseline.
+$node->safe_psql('postgres', q{checkpoint});
+
+# Generate some transactions to get RUNNING_XACTS.
+my $xacts = $node->background_psql('postgres');
+$xacts->query_until(
+ qr/run_xacts/,
+ q(\echo run_xacts
+SELECT 1 \watch 0.1
+\q
+));
+
+$node->advance_wal(20);
+
+# Run another checkpoint to set a new restore LSN.
+$node->safe_psql('postgres', q{checkpoint});
+
+$node->advance_wal(20);
+
+# Run another checkpoint, this time in the background, and make it wait
+# on the injection point) so that the checkpoint stops right before
+# removing old WAL segments.
+note('starting checkpoint');
+
+my $checkpoint = $node->background_psql('postgres');
+$checkpoint->query_safe(
+ q(select injection_points_attach('checkpoint-before-old-wal-removal','wait'))
+);
+$checkpoint->query_until(
+ qr/starting_checkpoint/,
+ q(\echo starting_checkpoint
+checkpoint;
+\q
+));
+
+# Wait until the checkpoint stops right before removing WAL segments.
+note('waiting for injection_point');
+$node->wait_for_event('checkpointer', 'checkpoint-before-old-wal-removal');
+note('injection_point is reached');
+
+# Try to advance the logical slot, but make it stop when it moves to the next
+# WAL segment (this has to happen in the background, too).
+my $logical = $node->background_psql('postgres');
+$logical->query_safe(
+ q{select injection_points_attach('logical-replication-slot-advance-segment','wait');}
+);
+$logical->query_until(
+ qr/get_changes/,
+ q(
+\echo get_changes
+select count(*) from pg_logical_slot_get_changes('slot_logical', null, null) \watch 1
+\q
+));
+
+# Wait until the slot's restart_lsn points to the next WAL segment.
+note('waiting for injection_point');
+$node->wait_for_event('client backend',
+ 'logical-replication-slot-advance-segment');
+note('injection_point is reached');
+
+# OK, we're in the right situation: time to advance the physical slot, which
+# recalculates the required LSN, and then unblock the checkpoint, which
+# removes the WAL still needed by the logical slot.
+$node->safe_psql('postgres',
+ q{select pg_replication_slot_advance('slot_physical', pg_current_wal_lsn())}
+);
+
+# Continue the checkpoint.
+$node->safe_psql('postgres',
+ q{select injection_points_wakeup('checkpoint-before-old-wal-removal')});
+
+# Abruptly stop the server (1 second should be enough for the checkpoint
+# to finish; it would be better).
+$node->stop('immediate');
+
+$node->start;
+
+eval {
+ $node->safe_psql('postgres',
+ q{select count(*) from pg_logical_slot_get_changes('slot_logical', null, null);}
+ );
+};
+is($@, '', "Logical slot still valid");
+
+done_testing();
--
2.43.0
On Thu, Jun 26, 2025 at 09:40:05AM +0000, Hayato Kuroda (Fujitsu) wrote:
I've spent time how we use the injection point to cause the same situation, which
generates the OVERWRITE_CONTRECORD in-between the page, but it seems difficult.
Hmm. I was wondering first about detecting a OVERWRITE_CONTRECORD
while looping in the loop of XLogDecodeNextRecord() with a
IS_INJECTION_POINT_ATTACHED(), then do a wait. But this method means
that the injection point is linked to the fix we are trying to bundle
here because it would need to be placed between the two
ReadPageInternal() as we need to read the header first. Usually I try
to place points so as they are generic enough, but that's a vague
philosophy.
So... I have no idea to create deterministic reproducers, it is OK for me to use
046 test for the purpose.
Let's reuse 046 and call it a day. It would not be the first
reproducer that fails on timeout if a problem arises. The important
fact is to be informed about the failure, while keeping tests fast in
the "normal" cases. We have a few tests like that already in the
tree, where a timeout is the best thing we can rely on when the code
fails its test.
--
Michael
On Thu, Jun 26, 2025 at 05:25:42PM +0530, vignesh C wrote:
On Thu, 26 Jun 2025 at 06:22, Michael Paquier <michael@paquier.xyz> wrote:
So you are suggesting the addition of an extra ReadPageInternal() that
forces a read of only the read, perform the checks on the header, then
read the rest. After reading SizeOfXLogShortPHD worth of data,
shouldn't the checks on xlp_rem_len be done a bit earlier than what
you are proposing in this patch?Modified
It seems to me that this assert can be moved after the second page
read:
Assert(SizeOfXLogShortPHD <= readOff);
Coming back to the point of Kuroda-san about performance, did you do
some checks related to that and did you measure any difference? I
suspect none of that because in most cases we are just going to fetch
the next page and we would trigger the fast-exit path of
ReadPageInternal() on the second call when fetching the rest. I still
need to get an idea of all that by myself, probably with various
lengths of logical message records.
Perhaps this code could be improved in the future with less page
reads. Anyway, what you are doing here is simple enough that it is a
no-brainer for the back-branches because we are just forcing our way
through with a new short header validation, so logically that's sound
as far as I can see.
--
Michael
On Fri, 27 Jun 2025 at 07:05, Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Jun 26, 2025 at 05:25:42PM +0530, vignesh C wrote:
On Thu, 26 Jun 2025 at 06:22, Michael Paquier <michael@paquier.xyz> wrote:
So you are suggesting the addition of an extra ReadPageInternal() that
forces a read of only the read, perform the checks on the header, then
read the rest. After reading SizeOfXLogShortPHD worth of data,
shouldn't the checks on xlp_rem_len be done a bit earlier than what
you are proposing in this patch?Modified
It seems to me that this assert can be moved after the second page
read:
Assert(SizeOfXLogShortPHD <= readOff);
I felt this Assert can be here to ensure we’ve read SizeOfXLogShortPHD
before checking the page header contents. But for the second page
read, the following existing Assert should be enough:
Assert(pageHeaderSize <= readOff);
Coming back to the point of Kuroda-san about performance, did you do
some checks related to that and did you measure any difference? I
suspect none of that because in most cases we are just going to fetch
the next page and we would trigger the fast-exit path of
ReadPageInternal() on the second call when fetching the rest. I still
need to get an idea of all that by myself, probably with various
lengths of logical message records.
The test execution times are measured in microseconds. In the results
table below, the first row indicates the message size, and each value
represents the median of 5 test runs.
The attached script was used to run these tests. In the attached
script the MSG_SIZE variable in the script should be changed to 1000(1
page), 10000 (2 pages approx), 25000 (3 pages approx) , 50000 (6 pages
approx), 100000 (12 page approx), 1000000 (122 pages approx), 10000000
(1220 pages approx) and 100000000 (12207 pages approx) and be run.
Test execution time can be taken from run_*.dat files that will be
generated.
Size | 1000 | 10000 | 25000 | 50000 | 100000 | 1000000
--------|-----------|-----------|------------|-------------|------------|--------------
Head | 9297.1 | 9895.4 | 10844.2 | 12946.5 | 16945.1 | 86187.1
Patch | 9222.7 | 9889 | 10897.1 | 12904.2 | 16858.4 | 87115.5
Size | 10000000 | 100000000
---------|----------------|-----------------
HEAD | 804965.6 | 331639.7
Patch | 804942.6 | 321198.6
The performance results show that the patch does not introduce any
noticeable overhead across varying message sizes, I felt there was no
impact because of the additional page header read.
Perhaps this code could be improved in the future with less page
reads. Anyway, what you are doing here is simple enough that it is a
no-brainer for the back-branches because we are just forcing our way
through with a new short header validation, so logically that's sound
as far as I can see.
I was waiting for the review of the master branch patch to complete. I
will post the back branch patches soon.
Regards,
Vignesh
Attachments:
On Fri, 27 Jun 2025 at 07:05, Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Jun 26, 2025 at 05:25:42PM +0530, vignesh C wrote:
On Thu, 26 Jun 2025 at 06:22, Michael Paquier <michael@paquier.xyz> wrote:
So you are suggesting the addition of an extra ReadPageInternal() that
forces a read of only the read, perform the checks on the header, then
read the rest. After reading SizeOfXLogShortPHD worth of data,
shouldn't the checks on xlp_rem_len be done a bit earlier than what
you are proposing in this patch?Modified
It seems to me that this assert can be moved after the second page
read:
Assert(SizeOfXLogShortPHD <= readOff);Coming back to the point of Kuroda-san about performance, did you do
some checks related to that and did you measure any difference? I
suspect none of that because in most cases we are just going to fetch
the next page and we would trigger the fast-exit path of
ReadPageInternal() on the second call when fetching the rest. I still
need to get an idea of all that by myself, probably with various
lengths of logical message records.Perhaps this code could be improved in the future with less page
reads. Anyway, what you are doing here is simple enough that it is a
no-brainer for the back-branches because we are just forcing our way
through with a new short header validation, so logically that's sound
as far as I can see.
I was unable to reproduce the same test failure on the PG17 branch,
even after running the test around 500 times. However, on the master
branch, the failure consistently reproduces approximately once in
every 50 runs. I also noticed that while the buildfarm has reported
multiple failures for this test for the master branch, none of them
appear to be on the PG17 branch. I'm not yet sure why this discrepancy
exists.
Regards,
Vignesh
On Fri, Jun 27, 2025 at 9:29 PM vignesh C <vignesh21@gmail.com> wrote:
On Fri, 27 Jun 2025 at 07:05, Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Jun 26, 2025 at 05:25:42PM +0530, vignesh C wrote:
On Thu, 26 Jun 2025 at 06:22, Michael Paquier <michael@paquier.xyz> wrote:
So you are suggesting the addition of an extra ReadPageInternal() that
forces a read of only the read, perform the checks on the header, then
read the rest. After reading SizeOfXLogShortPHD worth of data,
shouldn't the checks on xlp_rem_len be done a bit earlier than what
you are proposing in this patch?Modified
It seems to me that this assert can be moved after the second page
read:
Assert(SizeOfXLogShortPHD <= readOff);I felt this Assert can be here to ensure we’ve read SizeOfXLogShortPHD
before checking the page header contents. But for the second page
read, the following existing Assert should be enough:
Assert(pageHeaderSize <= readOff);
Thanks make sense.
Coming back to the point of Kuroda-san about performance, did you do
some checks related to that and did you measure any difference? I
suspect none of that because in most cases we are just going to fetch
the next page and we would trigger the fast-exit path of
ReadPageInternal() on the second call when fetching the rest. I still
need to get an idea of all that by myself, probably with various
lengths of logical message records.The test execution times are measured in microseconds. In the results
table below, the first row indicates the message size, and each value
represents the median of 5 test runs.
The attached script was used to run these tests. In the attached
script the MSG_SIZE variable in the script should be changed to 1000(1
page), 10000 (2 pages approx), 25000 (3 pages approx) , 50000 (6 pages
approx), 100000 (12 page approx), 1000000 (122 pages approx), 10000000
(1220 pages approx) and 100000000 (12207 pages approx) and be run.
Test execution time can be taken from run_*.dat files that will be
generated.Size | 1000 | 10000 | 25000 | 50000 | 100000 | 1000000
--------|-----------|-----------|------------|-------------|------------|--------------
Head | 9297.1 | 9895.4 | 10844.2 | 12946.5 | 16945.1 | 86187.1
Patch | 9222.7 | 9889 | 10897.1 | 12904.2 | 16858.4 | 87115.5Size | 10000000 | 100000000
---------|----------------|-----------------
HEAD | 804965.6 | 331639.7
Patch | 804942.6 | 321198.6The performance results show that the patch does not introduce any
noticeable overhead across varying message sizes, I felt there was no
impact because of the additional page header read.
Yeah, that doesn't seem like a regression.
--
Regards,
Dilip Kumar
Google
Dear Vignesh,
I was unable to reproduce the same test failure on the PG17 branch,
even after running the test around 500 times. However, on the master
branch, the failure consistently reproduces approximately once in
every 50 runs. I also noticed that while the buildfarm has reported
multiple failures for this test for the master branch, none of them
appear to be on the PG17 branch. I'm not yet sure why this discrepancy
exists.
I was also not able to reproduce as-is. After analyzing bit more, I found on
PG17, the workload cannot generate an FPI_FOR_HINT. The type of WAL record
has longer length than the page there was a possibility that the WAL record
could be flushed partially in HEAD. But in PG17 it could not happen so that
OVERWRITE_CONTRECORD won't be appeared.
I modified the test code like [1]``` --- a/src/test/recovery/t/046_checkpoint_logical_slot.pl +++ b/src/test/recovery/t/046_checkpoint_logical_slot.pl @@ -123,6 +123,10 @@ $node->safe_psql('postgres', $node->safe_psql('postgres', q{select injection_points_wakeup('checkpoint-before-old-wal-removal')}); and confirmed that the same stuck could happen
on PG17. It generates a long record which can go across the page and can be
flushed partially.
[1]:
```
--- a/src/test/recovery/t/046_checkpoint_logical_slot.pl
+++ b/src/test/recovery/t/046_checkpoint_logical_slot.pl
@@ -123,6 +123,10 @@ $node->safe_psql('postgres',
$node->safe_psql('postgres',
q{select injection_points_wakeup('checkpoint-before-old-wal-removal')});
+# Generate a long WAL record
+$node->safe_psql('postgres',
+ q{select pg_logical_emit_message(false, '', repeat('123456789', 1000))});
```
Best regards,
Hayato Kuroda
FUJITSU LIMITED
On Mon, 30 Jun 2025 at 17:41, Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
Dear Vignesh,
I was unable to reproduce the same test failure on the PG17 branch,
even after running the test around 500 times. However, on the master
branch, the failure consistently reproduces approximately once in
every 50 runs. I also noticed that while the buildfarm has reported
multiple failures for this test for the master branch, none of them
appear to be on the PG17 branch. I'm not yet sure why this discrepancy
exists.I was also not able to reproduce as-is. After analyzing bit more, I found on
PG17, the workload cannot generate an FPI_FOR_HINT. The type of WAL record
has longer length than the page there was a possibility that the WAL record
could be flushed partially in HEAD. But in PG17 it could not happen so that
OVERWRITE_CONTRECORD won't be appeared.I modified the test code like [1] and confirmed that the same stuck could happen
on PG17. It generates a long record which can go across the page and can be
flushed partially.[1]: ``` --- a/src/test/recovery/t/046_checkpoint_logical_slot.pl +++ b/src/test/recovery/t/046_checkpoint_logical_slot.pl @@ -123,6 +123,10 @@ $node->safe_psql('postgres', $node->safe_psql('postgres', q{select injection_points_wakeup('checkpoint-before-old-wal-removal')});+# Generate a long WAL record +$node->safe_psql('postgres', + q{select pg_logical_emit_message(false, '', repeat('123456789', 1000))}); ```
Thanks, Kuroda-san. I’ve prepared a similar test that doesn’t rely on
injection points. The issue reproduced consistently across all
branches up to PG13. You can use the attached
049_slot_get_changes_wait_continously_pg17.pl script (found in the
049_slot_get_changes_wait_continously_pg17.zip file) to verify this.
Just copy the script to src/test/recovery and run the test to observe
the problem.
The included patch addresses the issue. Use
v3_PG17-0001-Fix-infinite-wait-when-reading-partially-written-.patch
for PG17, PG16, and PG15, and
v3_PG14-0001-Fix-infinite-wait-when-reading-partially-written-.patch
for PG14 and PG13.
Regards,
Vignesh
Attachments:
v3_PG14-0001-Fix-infinite-wait-when-reading-partially-written-.patchapplication/octet-stream; name=v3_PG14-0001-Fix-infinite-wait-when-reading-partially-written-.patchDownload
From 6824ac619506fdc762ae49726f95b1d5468c84dc Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Sat, 28 Jun 2025 20:14:40 +0530
Subject: [PATCH v3] Fix infinite wait when reading partially written WAL
record after crash
If a crash occurs while writing a WAL record that spans multiple pages,
the recovery process marks the page with the XLP_FIRST_IS_OVERWRITE_CONTRECORD
flag. However, logical decoding currently attempts to read the full WAL record
based on its expected size before checking this flag, which can lead to an
infinite wait if the remaining data is never written (e.g., no activity after crash).
This patch updates the logic to first read the page header and check for the
XLP_FIRST_IS_OVERWRITE_CONTRECORD flag before attempting to reconstruct the
full WAL record. If the flag is set, decoding correctly identifies the record
as incomplete and avoids waiting for WAL data that will never arrive.
---
src/backend/access/transam/xlogreader.c | 20 +++++++++++++++-----
1 file changed, 15 insertions(+), 5 deletions(-)
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index f84569cdbab..11b1d8facaf 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -407,11 +407,14 @@ restart:
/* Calculate pointer to beginning of next page */
targetPagePtr += XLOG_BLCKSZ;
- /* Wait for the next page to become available */
- readOff = ReadPageInternal(state, targetPagePtr,
- Min(total_len - gotlen + SizeOfXLogShortPHD,
- XLOG_BLCKSZ));
-
+ /*
+ * Read the page header before processing the WAL record data.
+ * This is necessary to correctly handle cases where the previous
+ * record ended as a partial record. Attempting to read the full
+ * record without checking the header may result in waiting
+ * indefinitely for data that doesn't exist.
+ */
+ readOff = ReadPageInternal(state, targetPagePtr, SizeOfXLogShortPHD);
if (readOff < 0)
goto err;
@@ -459,6 +462,13 @@ restart:
goto err;
}
+ /* Wait for the next page to become available */
+ readOff = ReadPageInternal(state, targetPagePtr,
+ Min(total_len - gotlen + SizeOfXLogShortPHD,
+ XLOG_BLCKSZ));
+ if (readOff < 0)
+ goto err;
+
/* Append the continuation from this page to the buffer */
pageHeaderSize = XLogPageHeaderSize(pageHeader);
--
2.43.0
v3-0001-Fix-infinite-wait-when-reading-partially-written-.patchapplication/octet-stream; name=v3-0001-Fix-infinite-wait-when-reading-partially-written-.patchDownload
From 6d9f8c360b6ebe652260b7fa7dfcd07e54e01db3 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Thu, 26 Jun 2025 16:36:40 +0530
Subject: [PATCH v3] Fix infinite wait when reading partially written WAL
record after crash
If a crash occurs while writing a WAL record that spans multiple pages,
the recovery process marks the page with the XLP_FIRST_IS_OVERWRITE_CONTRECORD
flag. However, logical decoding currently attempts to read the full WAL record
based on its expected size before checking this flag, which can lead to an
infinite wait if the remaining data is never written (e.g., no activity after crash).
This patch updates the logic to first read the page header and check for the
XLP_FIRST_IS_OVERWRITE_CONTRECORD flag before attempting to reconstruct the
full WAL record. If the flag is set, decoding correctly identifies the record
as incomplete and avoids waiting for WAL data that will never arrive.
---
src/backend/access/transam/xlogreader.c | 22 ++-
src/test/recovery/meson.build | 1 +
.../recovery/t/046_checkpoint_logical_slot.pl | 136 ++++++++++++++++++
3 files changed, 154 insertions(+), 5 deletions(-)
create mode 100644 src/test/recovery/t/046_checkpoint_logical_slot.pl
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 2790ade1f91..bf4a13ba55f 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -723,11 +723,14 @@ restart:
/* Calculate pointer to beginning of next page */
targetPagePtr += XLOG_BLCKSZ;
- /* Wait for the next page to become available */
- readOff = ReadPageInternal(state, targetPagePtr,
- Min(total_len - gotlen + SizeOfXLogShortPHD,
- XLOG_BLCKSZ));
-
+ /*
+ * Read the page header before processing the WAL record data.
+ * This is necessary to correctly handle cases where the previous
+ * record ended as a partial record. Attempting to read the full
+ * record without checking the header may result in waiting
+ * indefinitely for data that doesn't exist.
+ */
+ readOff = ReadPageInternal(state, targetPagePtr, SizeOfXLogShortPHD);
if (readOff == XLREAD_WOULDBLOCK)
return XLREAD_WOULDBLOCK;
else if (readOff < 0)
@@ -776,6 +779,15 @@ restart:
goto err;
}
+ /* Wait for the next page to become available */
+ readOff = ReadPageInternal(state, targetPagePtr,
+ Min(total_len - gotlen + SizeOfXLogShortPHD,
+ XLOG_BLCKSZ));
+ if (readOff == XLREAD_WOULDBLOCK)
+ return XLREAD_WOULDBLOCK;
+ else if (readOff < 0)
+ goto err;
+
/* Append the continuation from this page to the buffer */
pageHeaderSize = XLogPageHeaderSize(pageHeader);
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 6e78ff1a030..52993c32dbb 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -54,6 +54,7 @@ tests += {
't/043_no_contrecord_switch.pl',
't/044_invalidate_inactive_slots.pl',
't/045_archive_restartpoint.pl',
+ 't/046_checkpoint_logical_slot.pl',
't/047_checkpoint_physical_slot.pl',
't/048_vacuum_horizon_floor.pl'
],
diff --git a/src/test/recovery/t/046_checkpoint_logical_slot.pl b/src/test/recovery/t/046_checkpoint_logical_slot.pl
new file mode 100644
index 00000000000..d67c5108d78
--- /dev/null
+++ b/src/test/recovery/t/046_checkpoint_logical_slot.pl
@@ -0,0 +1,136 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+#
+# This test verifies the case when the logical slot is advanced during
+# checkpoint. The test checks that the logical slot's restart_lsn still refers
+# to an existed WAL segment after immediate restart.
+#
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+my ($node, $result);
+
+$node = PostgreSQL::Test::Cluster->new('mike');
+$node->init;
+$node->append_conf('postgresql.conf', "wal_level = 'logical'");
+$node->start;
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+$node->safe_psql('postgres', q(CREATE EXTENSION injection_points));
+
+# Create the two slots we'll need.
+$node->safe_psql('postgres',
+ q{select pg_create_logical_replication_slot('slot_logical', 'test_decoding')}
+);
+$node->safe_psql('postgres',
+ q{select pg_create_physical_replication_slot('slot_physical', true)});
+
+# Advance both slots to the current position just to have everything "valid".
+$node->safe_psql('postgres',
+ q{select count(*) from pg_logical_slot_get_changes('slot_logical', null, null)}
+);
+$node->safe_psql('postgres',
+ q{select pg_replication_slot_advance('slot_physical', pg_current_wal_lsn())}
+);
+
+# Run checkpoint to flush current state to disk and set a baseline.
+$node->safe_psql('postgres', q{checkpoint});
+
+# Generate some transactions to get RUNNING_XACTS.
+my $xacts = $node->background_psql('postgres');
+$xacts->query_until(
+ qr/run_xacts/,
+ q(\echo run_xacts
+SELECT 1 \watch 0.1
+\q
+));
+
+$node->advance_wal(20);
+
+# Run another checkpoint to set a new restore LSN.
+$node->safe_psql('postgres', q{checkpoint});
+
+$node->advance_wal(20);
+
+# Run another checkpoint, this time in the background, and make it wait
+# on the injection point) so that the checkpoint stops right before
+# removing old WAL segments.
+note('starting checkpoint');
+
+my $checkpoint = $node->background_psql('postgres');
+$checkpoint->query_safe(
+ q(select injection_points_attach('checkpoint-before-old-wal-removal','wait'))
+);
+$checkpoint->query_until(
+ qr/starting_checkpoint/,
+ q(\echo starting_checkpoint
+checkpoint;
+\q
+));
+
+# Wait until the checkpoint stops right before removing WAL segments.
+note('waiting for injection_point');
+$node->wait_for_event('checkpointer', 'checkpoint-before-old-wal-removal');
+note('injection_point is reached');
+
+# Try to advance the logical slot, but make it stop when it moves to the next
+# WAL segment (this has to happen in the background, too).
+my $logical = $node->background_psql('postgres');
+$logical->query_safe(
+ q{select injection_points_attach('logical-replication-slot-advance-segment','wait');}
+);
+$logical->query_until(
+ qr/get_changes/,
+ q(
+\echo get_changes
+select count(*) from pg_logical_slot_get_changes('slot_logical', null, null) \watch 1
+\q
+));
+
+# Wait until the slot's restart_lsn points to the next WAL segment.
+note('waiting for injection_point');
+$node->wait_for_event('client backend',
+ 'logical-replication-slot-advance-segment');
+note('injection_point is reached');
+
+# OK, we're in the right situation: time to advance the physical slot, which
+# recalculates the required LSN, and then unblock the checkpoint, which
+# removes the WAL still needed by the logical slot.
+$node->safe_psql('postgres',
+ q{select pg_replication_slot_advance('slot_physical', pg_current_wal_lsn())}
+);
+
+# Continue the checkpoint.
+$node->safe_psql('postgres',
+ q{select injection_points_wakeup('checkpoint-before-old-wal-removal')});
+
+# Abruptly stop the server (1 second should be enough for the checkpoint
+# to finish; it would be better).
+$node->stop('immediate');
+
+$node->start;
+
+eval {
+ $node->safe_psql('postgres',
+ q{select count(*) from pg_logical_slot_get_changes('slot_logical', null, null);}
+ );
+};
+is($@, '', "Logical slot still valid");
+
+done_testing();
--
2.43.0
v3_PG17-0001-Fix-infinite-wait-when-reading-partially-written-.patchapplication/octet-stream; name=v3_PG17-0001-Fix-infinite-wait-when-reading-partially-written-.patchDownload
From c213cae3ded6467a7b56715105263ed49b2ae706 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Sat, 28 Jun 2025 18:53:29 +0530
Subject: [PATCH v3] Fix infinite wait when reading partially written WAL
record after crash
If a crash occurs while writing a WAL record that spans multiple pages,
the recovery process marks the page with the XLP_FIRST_IS_OVERWRITE_CONTRECORD
flag. However, logical decoding currently attempts to read the full WAL record
based on its expected size before checking this flag, which can lead to an
infinite wait if the remaining data is never written (e.g., no activity after crash).
This patch updates the logic to first read the page header and check for the
XLP_FIRST_IS_OVERWRITE_CONTRECORD flag before attempting to reconstruct the
full WAL record. If the flag is set, decoding correctly identifies the record
as incomplete and avoids waiting for WAL data that will never arrive.
---
src/backend/access/transam/xlogreader.c | 22 +++++++++++++++++-----
1 file changed, 17 insertions(+), 5 deletions(-)
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index a07f55dfbc9..d3dcccf4949 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -722,11 +722,14 @@ restart:
/* Calculate pointer to beginning of next page */
targetPagePtr += XLOG_BLCKSZ;
- /* Wait for the next page to become available */
- readOff = ReadPageInternal(state, targetPagePtr,
- Min(total_len - gotlen + SizeOfXLogShortPHD,
- XLOG_BLCKSZ));
-
+ /*
+ * Read the page header before processing the WAL record data.
+ * This is necessary to correctly handle cases where the previous
+ * record ended as a partial record. Attempting to read the full
+ * record without checking the header may result in waiting
+ * indefinitely for data that doesn't exist.
+ */
+ readOff = ReadPageInternal(state, targetPagePtr, SizeOfXLogShortPHD);
if (readOff == XLREAD_WOULDBLOCK)
return XLREAD_WOULDBLOCK;
else if (readOff < 0)
@@ -775,6 +778,15 @@ restart:
goto err;
}
+ /* Wait for the next page to become available */
+ readOff = ReadPageInternal(state, targetPagePtr,
+ Min(total_len - gotlen + SizeOfXLogShortPHD,
+ XLOG_BLCKSZ));
+ if (readOff == XLREAD_WOULDBLOCK)
+ return XLREAD_WOULDBLOCK;
+ else if (readOff < 0)
+ goto err;
+
/* Append the continuation from this page to the buffer */
pageHeaderSize = XLogPageHeaderSize(pageHeader);
--
2.43.0
On Mon, Jun 30, 2025 at 01:21:24PM +0530, Dilip Kumar wrote:
On Fri, Jun 27, 2025 at 9:29 PM vignesh C <vignesh21@gmail.com> wrote:
The test execution times are measured in microseconds. In the results
table below, the first row indicates the message size, and each value
represents the median of 5 test runs.
The attached script was used to run these tests. In the attached
script the MSG_SIZE variable in the script should be changed to 1000(1
page), 10000 (2 pages approx), 25000 (3 pages approx) , 50000 (6 pages
approx), 100000 (12 page approx), 1000000 (122 pages approx), 10000000
(1220 pages approx) and 100000000 (12207 pages approx) and be run.
Test execution time can be taken from run_*.dat files that will be
generated.Size | 1000 | 10000 | 25000 | 50000 | 100000 | 1000000
--------|-----------|-----------|------------|-------------|------------|--------------
Head | 9297.1 | 9895.4 | 10844.2 | 12946.5 | 16945.1 | 86187.1
Patch | 9222.7 | 9889 | 10897.1 | 12904.2 | 16858.4 | 87115.5Size | 10000000 | 100000000
---------|----------------|-----------------
HEAD | 804965.6 | 331639.7
Patch | 804942.6 | 321198.6The performance results show that the patch does not introduce any
noticeable overhead across varying message sizes, I felt there was no
impact because of the additional page header read.
That's encouraging, yes, thanks for these numbers. I'll try to get
some numbers myself next week. One thing that really stresses me with
this patch is that we have to do this additional header check even in
the physical replication path, but as we're just adding that for the
case where a record needs to reconstructed across two pages, perhaps
I'm just worrying too much.
Yeah, that doesn't seem like a regression.
Yep.
--
Michael
On Mon, Jun 30, 2025 at 05:51:51PM +0530, vignesh C wrote:
Thanks, Kuroda-san. I’ve prepared a similar test that doesn’t rely on
injection points. The issue reproduced consistently across all
branches up to PG13. You can use the attached
049_slot_get_changes_wait_continously_pg17.pl script (found in the
049_slot_get_changes_wait_continously_pg17.zip file) to verify this.
Just copy the script to src/test/recovery and run the test to observe
the problem.
+# Abruptly stop the server (1 second should be enough for the checkpoint
+# to finish; it would be better).
+$node->stop('immediate');
While v3-0001 is good enough to reproduce the original issue after a
few dozen runs here, we cannot reintroduce the test without the proper
wait_for_log() logic making sure that the checkpoint is completed
before we stop the server.
I like the addition of an extra pg_logical_emit_message() in test 046
anyway, down to v17, in the test 046 for all the branches. Even if
the reproduction is sporadic, we have seen it pretty quickly in the CI
and in the buildfarm so it would not go unnoticed for a long time if
we mess up with this stuff again.
We are lucky enough that the existing test does not fail in v17, TBH,
so let's tackle all the issues on this thread step by step:
1) Fix let's add the header check.
2) Let's reintroduce a fixed version of 046 on HEAD and on v18, with
an extra pg_logical_emit_message() that travels across WAL pages
forcing a reconstruction of the record and the extra header check.
3) Let's fix recovery test 046 currently in v17, for the checkpoint
wait logic and the extra pg_logical_emit_message().
4) Let's improve the stability of 047 everywhere for the checkpoint
wait logic, as already done by Alexander.
I have been doing some tests with the patch vs HEAD (thanks for the
script, it has saved some time), and I can summarize my results by
using a message size of 819200000, hence worth 100k pages of WAL or
so. Then, taking 20 runs, and eliminating the three highest and
lowest numbers to eliminate some of the variance, I am getting an
average of runtime when re-assembling the record of:
- 695473.09us for HEAD
- 695679.18us for the patch.
So I am not going to stress more on this point.
Attached is the result of all that for HEAD. There is also one extra
patch dedicated to v17, where I have checked that the extra
pg_logical_emit_message() is enough to reproduce the problem without
the patch, and that the problem is fixed with the patch. The patch
set for v17 is simpler of course, as test 046 still exists on
REL_17_STABLE. Note that I have moved this extra contrecord to be
generated before the checkpoint is completed, leading to the same
result.
With all that said, I'll move on with this stuff once the embargo for
v18 beta2 is lifted and the tag is pushed. That should happen in 24h
or so, I guess.
--
Michael
Attachments:
v4-0001-Fix-infinite-wait-when-reading-partially-written-.patchtext/x-diff; charset=us-asciiDownload
From 4c920861f07e0f4e6e21e24e5f6a9060054c3232 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 15 Jul 2025 12:01:01 +0900
Subject: [PATCH v4 1/3] Fix infinite wait when reading partially-written WAL
record
If a crash occurs while writing a WAL record that spans multiple pages,
the recovery process marks the page with the
XLP_FIRST_IS_OVERWRITE_CONTRECORD flag. However, logical decoding
currently attempts to read the full WAL record based on its expected
size before checking this flag, which can lead to an infinite wait if
the remaining data is never written (e.g., no activity after crash).
This patch updates the logic to first read the page header and check for
the XLP_FIRST_IS_OVERWRITE_CONTRECORD flag before attempting to
reconstruct the full WAL record. If the flag is set, decoding correctly
identifies the record as incomplete and avoids waiting for WAL data that
will never arrive.
---
src/backend/access/transam/xlogreader.c | 20 +++++++++++++++-----
1 file changed, 15 insertions(+), 5 deletions(-)
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index ac1f801b1eb0..2e5361ff4fdd 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -723,11 +723,12 @@ restart:
/* Calculate pointer to beginning of next page */
targetPagePtr += XLOG_BLCKSZ;
- /* Wait for the next page to become available */
- readOff = ReadPageInternal(state, targetPagePtr,
- Min(total_len - gotlen + SizeOfXLogShortPHD,
- XLOG_BLCKSZ));
-
+ /*
+ * Read the page header before processing the record data, to be
+ * able to handle the case where the previous record ended as
+ * being a partial one.
+ */
+ readOff = ReadPageInternal(state, targetPagePtr, SizeOfXLogShortPHD);
if (readOff == XLREAD_WOULDBLOCK)
return XLREAD_WOULDBLOCK;
else if (readOff < 0)
@@ -776,6 +777,15 @@ restart:
goto err;
}
+ /* Wait for the next page to become available */
+ readOff = ReadPageInternal(state, targetPagePtr,
+ Min(total_len - gotlen + SizeOfXLogShortPHD,
+ XLOG_BLCKSZ));
+ if (readOff == XLREAD_WOULDBLOCK)
+ return XLREAD_WOULDBLOCK;
+ else if (readOff < 0)
+ goto err;
+
/* Append the continuation from this page to the buffer */
pageHeaderSize = XLogPageHeaderSize(pageHeader);
--
2.50.0
v4-0002-Improve-stability-of-recovery-test-047_checkpoint.patchtext/x-diff; charset=us-asciiDownload
From 68fab6348a60320ff4dda943d69dc3388a5754e2 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 15 Jul 2025 12:02:08 +0900
Subject: [PATCH v4 2/3] Improve stability of recovery test
047_checkpoint_physical_slot
Author: Alexander Korotkov <akorotkov@postgresql.org>
Backpatch-through: 17
---
src/test/recovery/t/047_checkpoint_physical_slot.pl | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/src/test/recovery/t/047_checkpoint_physical_slot.pl b/src/test/recovery/t/047_checkpoint_physical_slot.pl
index a1332b5d44cb..9e98383e30ea 100644
--- a/src/test/recovery/t/047_checkpoint_physical_slot.pl
+++ b/src/test/recovery/t/047_checkpoint_physical_slot.pl
@@ -94,9 +94,11 @@ $node->safe_psql('postgres',
q{select pg_replication_slot_advance('slot_physical', pg_current_wal_lsn())}
);
-# Continue the checkpoint.
+# Continue the checkpoint and wait for its completion.
+my $log_offset = -s $node->logfile;
$node->safe_psql('postgres',
q{select injection_points_wakeup('checkpoint-before-old-wal-removal')});
+$node->wait_for_log(qr/checkpoint complete/, $log_offset);
my $restart_lsn_old = $node->safe_psql('postgres',
q{select restart_lsn from pg_replication_slots where slot_name = 'slot_physical'}
@@ -104,8 +106,7 @@ my $restart_lsn_old = $node->safe_psql('postgres',
chomp($restart_lsn_old);
note("restart lsn before stop: $restart_lsn_old");
-# Abruptly stop the server (1 second should be enough for the checkpoint
-# to finish; it would be better).
+# Abruptly stop the server.
$node->stop('immediate');
$node->start;
--
2.50.0
v4-0003-Reintroduce-test-046_checkpoint_logical_slot.patchtext/x-diff; charset=us-asciiDownload
From 4584664c18bb63a604c6d20eb22831ec6e59a49a Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 15 Jul 2025 12:07:50 +0900
Subject: [PATCH v4 3/3] Reintroduce test 046_checkpoint_logical_slot
This patch is only for HEAD and v18 where the test has been removed,
improvements to the checkpoint completion wait and the extra
pg_logical_emit_message() call to force the creation of a WAL record
that spawns through at least 2 pages needs to be backpatched down to v17
to improve the stability and coverage of the original test.
Author: Alexander Korotkov <akorotkov@postgresql.org>
Co-authored-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
---
src/test/recovery/meson.build | 1 +
.../recovery/t/046_checkpoint_logical_slot.pl | 142 ++++++++++++++++++
2 files changed, 143 insertions(+)
create mode 100644 src/test/recovery/t/046_checkpoint_logical_slot.pl
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 6e78ff1a030b..52993c32dbba 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -54,6 +54,7 @@ tests += {
't/043_no_contrecord_switch.pl',
't/044_invalidate_inactive_slots.pl',
't/045_archive_restartpoint.pl',
+ 't/046_checkpoint_logical_slot.pl',
't/047_checkpoint_physical_slot.pl',
't/048_vacuum_horizon_floor.pl'
],
diff --git a/src/test/recovery/t/046_checkpoint_logical_slot.pl b/src/test/recovery/t/046_checkpoint_logical_slot.pl
new file mode 100644
index 000000000000..bdfca3ddf826
--- /dev/null
+++ b/src/test/recovery/t/046_checkpoint_logical_slot.pl
@@ -0,0 +1,142 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+#
+# This test verifies the case when the logical slot is advanced during
+# checkpoint. The test checks that the logical slot's restart_lsn still refers
+# to an existed WAL segment after immediate restart.
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+my ($node, $result);
+
+$node = PostgreSQL::Test::Cluster->new('mike');
+$node->init;
+$node->append_conf('postgresql.conf', "wal_level = 'logical'");
+$node->start;
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+$node->safe_psql('postgres', q(CREATE EXTENSION injection_points));
+
+# Create the two slots we'll need.
+$node->safe_psql('postgres',
+ q{select pg_create_logical_replication_slot('slot_logical', 'test_decoding')}
+);
+$node->safe_psql('postgres',
+ q{select pg_create_physical_replication_slot('slot_physical', true)});
+
+# Advance both slots to the current position just to have everything "valid".
+$node->safe_psql('postgres',
+ q{select count(*) from pg_logical_slot_get_changes('slot_logical', null, null)}
+);
+$node->safe_psql('postgres',
+ q{select pg_replication_slot_advance('slot_physical', pg_current_wal_lsn())}
+);
+
+# Run checkpoint to flush current state to disk and set a baseline.
+$node->safe_psql('postgres', q{checkpoint});
+
+# Generate some transactions to get RUNNING_XACTS.
+my $xacts = $node->background_psql('postgres');
+$xacts->query_until(
+ qr/run_xacts/,
+ q(\echo run_xacts
+SELECT 1 \watch 0.1
+\q
+));
+
+$node->advance_wal(20);
+
+# Run another checkpoint to set a new restore LSN.
+$node->safe_psql('postgres', q{checkpoint});
+
+$node->advance_wal(20);
+
+# Run another checkpoint, this time in the background, and make it wait
+# on the injection point) so that the checkpoint stops right before
+# removing old WAL segments.
+note('starting checkpoint');
+
+my $checkpoint = $node->background_psql('postgres');
+$checkpoint->query_safe(
+ q(select injection_points_attach('checkpoint-before-old-wal-removal','wait'))
+);
+$checkpoint->query_until(
+ qr/starting_checkpoint/,
+ q(\echo starting_checkpoint
+checkpoint;
+\q
+));
+
+# Wait until the checkpoint stops right before removing WAL segments.
+note('waiting for injection_point');
+$node->wait_for_event('checkpointer', 'checkpoint-before-old-wal-removal');
+note('injection_point is reached');
+
+# Try to advance the logical slot, but make it stop when it moves to the next
+# WAL segment (this has to happen in the background, too).
+my $logical = $node->background_psql('postgres');
+$logical->query_safe(
+ q{select injection_points_attach('logical-replication-slot-advance-segment','wait');}
+);
+$logical->query_until(
+ qr/get_changes/,
+ q(
+\echo get_changes
+select count(*) from pg_logical_slot_get_changes('slot_logical', null, null) \watch 1
+\q
+));
+
+# Wait until the slot's restart_lsn points to the next WAL segment.
+note('waiting for injection_point');
+$node->wait_for_event('client backend',
+ 'logical-replication-slot-advance-segment');
+note('injection_point is reached');
+
+# OK, we're in the right situation: time to advance the physical slot, which
+# recalculates the required LSN, and then unblock the checkpoint, which
+# removes the WAL still needed by the logical slot.
+$node->safe_psql('postgres',
+ q{select pg_replication_slot_advance('slot_physical', pg_current_wal_lsn())}
+);
+
+# Generate a long WAL record, spawning at least two pages for the follow-up
+# post-recovery check.
+$node->safe_psql('postgres',
+ q{select pg_logical_emit_message(false, '', repeat('123456789', 1000))});
+
+# Continue the checkpoint and wait for its completion.
+my $log_offset = -s $node->logfile;
+$node->safe_psql('postgres',
+ q{select injection_points_wakeup('checkpoint-before-old-wal-removal')});
+$node->wait_for_log(qr/checkpoint complete/, $log_offset);
+
+# Abruptly stop the server.
+$node->stop('immediate');
+
+$node->start;
+
+eval {
+ $node->safe_psql('postgres',
+ q{select count(*) from pg_logical_slot_get_changes('slot_logical', null, null);}
+ );
+};
+is($@, '', "Logical slot still valid");
+
+done_testing();
--
2.50.0
v4-PG17-0003-Improve-recovery-test-046_checkpoint_logical_slot.patchtext/x-diff; charset=us-asciiDownload
From c09e67a8e074a3235499bcf891487fdc1293072d Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 15 Jul 2025 12:25:37 +0900
Subject: [PATCH v4] Improve recovery test 046_checkpoint_logical_slot
This commit improves 046_checkpoint_logical_slot on two aspects:
- Add one pg_logical_emit_message() call, to force the creation of a
record that spawns across two pages.
- Make the logic wait for the checkpoint completion.
Author: Alexander Korotkov <akorotkov@postgresql.org>
Co-authored-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
---
src/test/recovery/t/046_checkpoint_logical_slot.pl | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/src/test/recovery/t/046_checkpoint_logical_slot.pl b/src/test/recovery/t/046_checkpoint_logical_slot.pl
index 0468d4609e43..d2cf1cb4464e 100644
--- a/src/test/recovery/t/046_checkpoint_logical_slot.pl
+++ b/src/test/recovery/t/046_checkpoint_logical_slot.pl
@@ -119,12 +119,18 @@ $node->safe_psql('postgres',
q{select pg_replication_slot_advance('slot_physical', pg_current_wal_lsn())}
);
+# Generate a long WAL record, spawning at least two pages for the follow-up
+# post-recovery check.
+$node->safe_psql('postgres',
+ q{select pg_logical_emit_message(false, '', repeat('123456789', 1000))});
+
# Continue the checkpoint.
+my $log_offset = -s $node->logfile;
$node->safe_psql('postgres',
q{select injection_points_wakeup('checkpoint-before-old-wal-removal')});
+$node->wait_for_log(qr/checkpoint complete/, $log_offset);
-# Abruptly stop the server (1 second should be enough for the checkpoint
-# to finish; it would be better).
+# Abruptly stop the server.
$node->stop('immediate');
$node->start;
--
2.50.0
Dear Michael,
I like the addition of an extra pg_logical_emit_message() in test 046
anyway, down to v17, in the test 046 for all the branches. Even if
the reproduction is sporadic, we have seen it pretty quickly in the CI
and in the buildfarm so it would not go unnoticed for a long time if
we mess up with this stuff again.
I put pg_logical_emit_message() after injection_points_wakeup(), but your patch
puts it before. I verified even your patch can reproduce the issue, but is there
a reason?
(This is not a comment which must be fixed, I just want to know the reason to study)
With all that said, I'll move on with this stuff once the embargo for
v18 beta2 is lifted and the tag is pushed. That should happen in 24h
or so, I guess.
The provided patches looks good to me.
Best regards,
Hayato Kuroda
FUJITSU LIMITED
Hi, Michael!
On Tue, Jul 15, 2025 at 6:31 AM Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Jun 30, 2025 at 05:51:51PM +0530, vignesh C wrote:
Thanks, Kuroda-san. I’ve prepared a similar test that doesn’t rely on
injection points. The issue reproduced consistently across all
branches up to PG13. You can use the attached
049_slot_get_changes_wait_continously_pg17.pl script (found in the
049_slot_get_changes_wait_continously_pg17.zip file) to verify this.
Just copy the script to src/test/recovery and run the test to observe
the problem.+# Abruptly stop the server (1 second should be enough for the checkpoint +# to finish; it would be better). +$node->stop('immediate');While v3-0001 is good enough to reproduce the original issue after a
few dozen runs here, we cannot reintroduce the test without the proper
wait_for_log() logic making sure that the checkpoint is completed
before we stop the server.I like the addition of an extra pg_logical_emit_message() in test 046
anyway, down to v17, in the test 046 for all the branches. Even if
the reproduction is sporadic, we have seen it pretty quickly in the CI
and in the buildfarm so it would not go unnoticed for a long time if
we mess up with this stuff again.We are lucky enough that the existing test does not fail in v17, TBH,
so let's tackle all the issues on this thread step by step:
1) Fix let's add the header check.
2) Let's reintroduce a fixed version of 046 on HEAD and on v18, with
an extra pg_logical_emit_message() that travels across WAL pages
forcing a reconstruction of the record and the extra header check.
3) Let's fix recovery test 046 currently in v17, for the checkpoint
wait logic and the extra pg_logical_emit_message().
4) Let's improve the stability of 047 everywhere for the checkpoint
wait logic, as already done by Alexander.I have been doing some tests with the patch vs HEAD (thanks for the
script, it has saved some time), and I can summarize my results by
using a message size of 819200000, hence worth 100k pages of WAL or
so. Then, taking 20 runs, and eliminating the three highest and
lowest numbers to eliminate some of the variance, I am getting an
average of runtime when re-assembling the record of:
- 695473.09us for HEAD
- 695679.18us for the patch.
So I am not going to stress more on this point.Attached is the result of all that for HEAD. There is also one extra
patch dedicated to v17, where I have checked that the extra
pg_logical_emit_message() is enough to reproduce the problem without
the patch, and that the problem is fixed with the patch. The patch
set for v17 is simpler of course, as test 046 still exists on
REL_17_STABLE. Note that I have moved this extra contrecord to be
generated before the checkpoint is completed, leading to the same
result.With all that said, I'll move on with this stuff once the embargo for
v18 beta2 is lifted and the tag is pushed. That should happen in 24h
or so, I guess.
Thank you for your efforts on this subject. Actually, I was planning
to work on pushing this after the release freeze ends for v18. I'd
like to do this at least for tests as they were initially committed by
me.
Anyway, please, hold on pushing this for ~1 day to let me do final
review of this.
------
Regards,
Alexander Korotkov
Supabase
On Tue, Jul 15, 2025 at 09:19:50PM +0300, Alexander Korotkov wrote:
Thank you for your efforts on this subject. Actually, I was planning
to work on pushing this after the release freeze ends for v18. I'd
like to do this at least for tests as they were initially committed by
me.Anyway, please, hold on pushing this for ~1 day to let me do final
review of this.
If you want to apply all of this yourself, please feel free, of
course.
--
Michael
On Tue, Jul 15, 2025 at 09:14:27AM +0000, Hayato Kuroda (Fujitsu) wrote:
I put pg_logical_emit_message() after injection_points_wakeup(), but your patch
puts it before. I verified even your patch can reproduce the issue, but is there
a reason?
(This is not a comment which must be fixed, I just want to know the reason to study)
It's for increasing the odds, so as the cross-page record finds its
way to disk when doing the recovery phase. I was wondering about
forcing a flush of pg_logical_emit_message(), as well, but the
reproduction rate was still good enough without it.
With all that said, I'll move on with this stuff once the embargo for
v18 beta2 is lifted and the tag is pushed. That should happen in 24h
or so, I guess.The provided patches looks good to me.
Thanks for the reviews!
--
Michael
On Wed, Jul 16, 2025 at 1:12 AM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Jul 15, 2025 at 09:19:50PM +0300, Alexander Korotkov wrote:
Thank you for your efforts on this subject. Actually, I was planning
to work on pushing this after the release freeze ends for v18. I'd
like to do this at least for tests as they were initially committed by
me.Anyway, please, hold on pushing this for ~1 day to let me do final
review of this.If you want to apply all of this yourself, please feel free, of
course.
Thank you!
I went trough the patchset. Everything looks good to me. I only did
some improvements to comments and commit messages. I'm going to push
this if no objections.
------
Regards,
Alexander Korotkov
Supabase
Attachments:
v5-PG17-0003-Improve-recovery-test-046_checkpoint_logical.patchapplication/octet-stream; name=v5-PG17-0003-Improve-recovery-test-046_checkpoint_logical.patchDownload
From 753aa3c063c9fd9f690583cf0a7bde19624e69d1 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Fri, 18 Jul 2025 00:04:52 +0300
Subject: [PATCH v5-PG17] Improve recovery test 046_checkpoint_logical_slot
This commit improves 046_checkpoint_logical_slot in two aspects:
- Add one pg_logical_emit_message() call to force the creation of a record
that spawns across two pages.
- Make the logic wait for the checkpoint completion.
Discussion: https://postgr.es/m/CALDaNm34m36PDHzsU_GdcNXU0gLTfFY5rzh9GSQv%3Dw6B%2BQVNRQ%40mail.gmail.com
Author: Alexander Korotkov <akorotkov@postgresql.org>
Co-authored-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
---
src/test/recovery/t/046_checkpoint_logical_slot.pl | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/src/test/recovery/t/046_checkpoint_logical_slot.pl b/src/test/recovery/t/046_checkpoint_logical_slot.pl
index 0468d4609e4..d2cf1cb4464 100644
--- a/src/test/recovery/t/046_checkpoint_logical_slot.pl
+++ b/src/test/recovery/t/046_checkpoint_logical_slot.pl
@@ -119,12 +119,18 @@ $node->safe_psql('postgres',
q{select pg_replication_slot_advance('slot_physical', pg_current_wal_lsn())}
);
+# Generate a long WAL record, spawning at least two pages for the follow-up
+# post-recovery check.
+$node->safe_psql('postgres',
+ q{select pg_logical_emit_message(false, '', repeat('123456789', 1000))});
+
# Continue the checkpoint.
+my $log_offset = -s $node->logfile;
$node->safe_psql('postgres',
q{select injection_points_wakeup('checkpoint-before-old-wal-removal')});
+$node->wait_for_log(qr/checkpoint complete/, $log_offset);
-# Abruptly stop the server (1 second should be enough for the checkpoint
-# to finish; it would be better).
+# Abruptly stop the server.
$node->stop('immediate');
$node->start;
--
2.39.5 (Apple Git-154)
v5-0002-Improve-the-stability-of-the-recovery-test-047_ch.patchapplication/octet-stream; name=v5-0002-Improve-the-stability-of-the-recovery-test-047_ch.patchDownload
From 8de05d7864af9263ca5d14182d8b57407cb2ff95 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Thu, 17 Jul 2025 23:42:57 +0300
Subject: [PATCH v5 2/3] Improve the stability of the recovery test
047_checkpoint_physical_slot
Currently, the comments in 047_checkpoint_physical_slot. It shows an
incomplete intention to wait for checkpoint completion before performing
an immediate database stop. However, an immediate node stop can occur both
before and after checkpoint completion. Both cases should work correctly.
But we would like the test to be more stable and deterministic. This is why
this commit makes this test explicitly wait for the checkpoint completion
log message.
Discussion: https://postgr.es/m/CAPpHfdurV-j_e0pb%3DUFENAy3tyzxfF%2ByHveNDNQk2gM82WBU5A%40mail.gmail.com
Discussion: https://postgr.es/m/aHXLep3OaX_vRTNQ%40paquier.xyz
Author: Alexander Korotkov <akorotkov@postgresql.org>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Backpatch-through: 17
---
src/test/recovery/t/047_checkpoint_physical_slot.pl | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/src/test/recovery/t/047_checkpoint_physical_slot.pl b/src/test/recovery/t/047_checkpoint_physical_slot.pl
index a1332b5d44c..9e98383e30e 100644
--- a/src/test/recovery/t/047_checkpoint_physical_slot.pl
+++ b/src/test/recovery/t/047_checkpoint_physical_slot.pl
@@ -94,9 +94,11 @@ $node->safe_psql('postgres',
q{select pg_replication_slot_advance('slot_physical', pg_current_wal_lsn())}
);
-# Continue the checkpoint.
+# Continue the checkpoint and wait for its completion.
+my $log_offset = -s $node->logfile;
$node->safe_psql('postgres',
q{select injection_points_wakeup('checkpoint-before-old-wal-removal')});
+$node->wait_for_log(qr/checkpoint complete/, $log_offset);
my $restart_lsn_old = $node->safe_psql('postgres',
q{select restart_lsn from pg_replication_slots where slot_name = 'slot_physical'}
@@ -104,8 +106,7 @@ my $restart_lsn_old = $node->safe_psql('postgres',
chomp($restart_lsn_old);
note("restart lsn before stop: $restart_lsn_old");
-# Abruptly stop the server (1 second should be enough for the checkpoint
-# to finish; it would be better).
+# Abruptly stop the server.
$node->stop('immediate');
$node->start;
--
2.39.5 (Apple Git-154)
v5-0001-Fix-infinite-wait-when-reading-a-partially-writte.patchapplication/octet-stream; name=v5-0001-Fix-infinite-wait-when-reading-a-partially-writte.patchDownload
From 629b7c23501465d8eac186f8f6a1f76f39e34780 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Thu, 17 Jul 2025 23:40:28 +0300
Subject: [PATCH v5 1/3] Fix infinite wait when reading a partially written WAL
record
If a crash occurs while writing a WAL record that spans multiple pages, the
recovery process marks the page with the XLP_FIRST_IS_OVERWRITE_CONTRECORD
flag. However, logical decoding currently attempts to read the full WAL
record based on its expected size before checking this flag, which can lead
to an infinite wait if the remaining data is never written (e.g., no activity
after crash).
This patch updates the logic first to read the page header and check for
the XLP_FIRST_IS_OVERWRITE_CONTRECORD flag before attempting to reconstruct
the full WAL record. If the flag is set, decoding correctly identifies
the record as incomplete and avoids waiting for WAL data that will never
arrive.
Discussion: https://postgr.es/m/CAAKRu_ZCOzQpEumLFgG_%2Biw3FTa%2BhJ4SRpxzaQBYxxM_ZAzWcA%40mail.gmail.com
Discussion: https://postgr.es/m/CALDaNm34m36PDHzsU_GdcNXU0gLTfFY5rzh9GSQv%3Dw6B%2BQVNRQ%40mail.gmail.com
Author: Vignesh C <vignesh21@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
---
src/backend/access/transam/xlogreader.c | 20 +++++++++++++++-----
1 file changed, 15 insertions(+), 5 deletions(-)
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index ac1f801b1eb..dcc8d4f9c1b 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -723,11 +723,12 @@ restart:
/* Calculate pointer to beginning of next page */
targetPagePtr += XLOG_BLCKSZ;
- /* Wait for the next page to become available */
- readOff = ReadPageInternal(state, targetPagePtr,
- Min(total_len - gotlen + SizeOfXLogShortPHD,
- XLOG_BLCKSZ));
-
+ /*
+ * Read the page header before processing the record data, so we
+ * can handle the case where the previous record ended as being a
+ * partial one.
+ */
+ readOff = ReadPageInternal(state, targetPagePtr, SizeOfXLogShortPHD);
if (readOff == XLREAD_WOULDBLOCK)
return XLREAD_WOULDBLOCK;
else if (readOff < 0)
@@ -776,6 +777,15 @@ restart:
goto err;
}
+ /* Wait for the next page to become available */
+ readOff = ReadPageInternal(state, targetPagePtr,
+ Min(total_len - gotlen + SizeOfXLogShortPHD,
+ XLOG_BLCKSZ));
+ if (readOff == XLREAD_WOULDBLOCK)
+ return XLREAD_WOULDBLOCK;
+ else if (readOff < 0)
+ goto err;
+
/* Append the continuation from this page to the buffer */
pageHeaderSize = XLogPageHeaderSize(pageHeader);
--
2.39.5 (Apple Git-154)
v5-0003-Reintroduce-test-046_checkpoint_logical_slot.patchapplication/octet-stream; name=v5-0003-Reintroduce-test-046_checkpoint_logical_slot.patchDownload
From d5a4c098a4f7e7bf68b13805e554a15fe997ac3a Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Fri, 18 Jul 2025 00:05:22 +0300
Subject: [PATCH v5 3/3] Reintroduce test 046_checkpoint_logical_slot
This patch is only for HEAD and v18, where the test has been removed,
improvements to the checkpoint completion wait, and the extra
pg_logical_emit_message() call to force the creation of a WAL record that
spawns through at least two pages need to be backpatched down to v17 to
improve the stability and coverage of the original test.
Discussion: https://postgr.es/m/CAAKRu_ZCOzQpEumLFgG_%2Biw3FTa%2BhJ4SRpxzaQBYxxM_ZAzWcA%40mail.gmail.com
Discussion: https://postgr.es/m/CALDaNm34m36PDHzsU_GdcNXU0gLTfFY5rzh9GSQv%3Dw6B%2BQVNRQ%40mail.gmail.com
Author: Alexander Korotkov <akorotkov@postgresql.org>
Co-authored-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
---
src/test/recovery/meson.build | 1 +
.../recovery/t/046_checkpoint_logical_slot.pl | 142 ++++++++++++++++++
2 files changed, 143 insertions(+)
create mode 100644 src/test/recovery/t/046_checkpoint_logical_slot.pl
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 6e78ff1a030..52993c32dbb 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -54,6 +54,7 @@ tests += {
't/043_no_contrecord_switch.pl',
't/044_invalidate_inactive_slots.pl',
't/045_archive_restartpoint.pl',
+ 't/046_checkpoint_logical_slot.pl',
't/047_checkpoint_physical_slot.pl',
't/048_vacuum_horizon_floor.pl'
],
diff --git a/src/test/recovery/t/046_checkpoint_logical_slot.pl b/src/test/recovery/t/046_checkpoint_logical_slot.pl
new file mode 100644
index 00000000000..bdfca3ddf82
--- /dev/null
+++ b/src/test/recovery/t/046_checkpoint_logical_slot.pl
@@ -0,0 +1,142 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+#
+# This test verifies the case when the logical slot is advanced during
+# checkpoint. The test checks that the logical slot's restart_lsn still refers
+# to an existed WAL segment after immediate restart.
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+my ($node, $result);
+
+$node = PostgreSQL::Test::Cluster->new('mike');
+$node->init;
+$node->append_conf('postgresql.conf', "wal_level = 'logical'");
+$node->start;
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+$node->safe_psql('postgres', q(CREATE EXTENSION injection_points));
+
+# Create the two slots we'll need.
+$node->safe_psql('postgres',
+ q{select pg_create_logical_replication_slot('slot_logical', 'test_decoding')}
+);
+$node->safe_psql('postgres',
+ q{select pg_create_physical_replication_slot('slot_physical', true)});
+
+# Advance both slots to the current position just to have everything "valid".
+$node->safe_psql('postgres',
+ q{select count(*) from pg_logical_slot_get_changes('slot_logical', null, null)}
+);
+$node->safe_psql('postgres',
+ q{select pg_replication_slot_advance('slot_physical', pg_current_wal_lsn())}
+);
+
+# Run checkpoint to flush current state to disk and set a baseline.
+$node->safe_psql('postgres', q{checkpoint});
+
+# Generate some transactions to get RUNNING_XACTS.
+my $xacts = $node->background_psql('postgres');
+$xacts->query_until(
+ qr/run_xacts/,
+ q(\echo run_xacts
+SELECT 1 \watch 0.1
+\q
+));
+
+$node->advance_wal(20);
+
+# Run another checkpoint to set a new restore LSN.
+$node->safe_psql('postgres', q{checkpoint});
+
+$node->advance_wal(20);
+
+# Run another checkpoint, this time in the background, and make it wait
+# on the injection point) so that the checkpoint stops right before
+# removing old WAL segments.
+note('starting checkpoint');
+
+my $checkpoint = $node->background_psql('postgres');
+$checkpoint->query_safe(
+ q(select injection_points_attach('checkpoint-before-old-wal-removal','wait'))
+);
+$checkpoint->query_until(
+ qr/starting_checkpoint/,
+ q(\echo starting_checkpoint
+checkpoint;
+\q
+));
+
+# Wait until the checkpoint stops right before removing WAL segments.
+note('waiting for injection_point');
+$node->wait_for_event('checkpointer', 'checkpoint-before-old-wal-removal');
+note('injection_point is reached');
+
+# Try to advance the logical slot, but make it stop when it moves to the next
+# WAL segment (this has to happen in the background, too).
+my $logical = $node->background_psql('postgres');
+$logical->query_safe(
+ q{select injection_points_attach('logical-replication-slot-advance-segment','wait');}
+);
+$logical->query_until(
+ qr/get_changes/,
+ q(
+\echo get_changes
+select count(*) from pg_logical_slot_get_changes('slot_logical', null, null) \watch 1
+\q
+));
+
+# Wait until the slot's restart_lsn points to the next WAL segment.
+note('waiting for injection_point');
+$node->wait_for_event('client backend',
+ 'logical-replication-slot-advance-segment');
+note('injection_point is reached');
+
+# OK, we're in the right situation: time to advance the physical slot, which
+# recalculates the required LSN, and then unblock the checkpoint, which
+# removes the WAL still needed by the logical slot.
+$node->safe_psql('postgres',
+ q{select pg_replication_slot_advance('slot_physical', pg_current_wal_lsn())}
+);
+
+# Generate a long WAL record, spawning at least two pages for the follow-up
+# post-recovery check.
+$node->safe_psql('postgres',
+ q{select pg_logical_emit_message(false, '', repeat('123456789', 1000))});
+
+# Continue the checkpoint and wait for its completion.
+my $log_offset = -s $node->logfile;
+$node->safe_psql('postgres',
+ q{select injection_points_wakeup('checkpoint-before-old-wal-removal')});
+$node->wait_for_log(qr/checkpoint complete/, $log_offset);
+
+# Abruptly stop the server.
+$node->stop('immediate');
+
+$node->start;
+
+eval {
+ $node->safe_psql('postgres',
+ q{select count(*) from pg_logical_slot_get_changes('slot_logical', null, null);}
+ );
+};
+is($@, '', "Logical slot still valid");
+
+done_testing();
--
2.39.5 (Apple Git-154)
Alexander Korotkov <aekorotkov@gmail.com> writes:
I went trough the patchset. Everything looks good to me. I only did
some improvements to comments and commit messages. I'm going to push
this if no objections.
There's apparently something wrong in the v17 branch, as three
separate buildfarm members have now hit timeout failures in
046_checkpoint_logical_slot.pl [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&dt=2025-07-19%2014%3A07%3A23[2]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2025-07-19%2014%3A07%3A56[3]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2025-07-19%2016%3A29%3A32. I tried to reproduce
this locally, and didn't have much luck initially. However,
if I build with a configuration similar to grassquit's, it
will hang up maybe one time in ten:
export ASAN_OPTIONS='print_stacktrace=1:disable_coredump=0:abort_on_error=1:detect_leaks=0:detect_stack_use_after_return=0'
export UBSAN_OPTIONS='print_stacktrace=1:disable_coredump=0:abort_on_error=1'
./configure ... usual flags plus ... CFLAGS='-O1 -ggdb -g3 -fno-omit-frame-pointer -Wall -Wextra -Wno-unused-parameter -Wno-sign-compare -Wno-missing-field-initializers -fsanitize=address -fno-sanitize-recover=all' --enable-injection-points
The fact that 046_checkpoint_logical_slot.pl is skipped in
non-injection-point builds is probably reducing the number
of buildfarm failures, since only a minority of animals
have that turned on yet.
I don't see anything obviously wrong in the test changes, and the
postmaster log from the failures looks pretty clearly like what is
hanging up is the pg_logical_slot_get_changes call:
2025-07-19 16:10:07.276 CEST [3458309][client backend][0/2:0] LOG: statement: select count(*) from pg_logical_slot_get_changes('slot_logical', null, null);
2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] LOG: starting logical decoding for slot "slot_logical"
2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] DETAIL: Streaming transactions committing after 0/290000F8, reading WAL from 0/1540F40.
2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] STATEMENT: select count(*) from pg_logical_slot_get_changes('slot_logical', null, null);
2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] LOG: logical decoding found consistent point at 0/1540F40
2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] DETAIL: There are no running transactions.
2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] STATEMENT: select count(*) from pg_logical_slot_get_changes('slot_logical', null, null);
2025-07-19 16:59:56.828 CEST [3458140][postmaster][:0] LOG: received immediate shutdown request
2025-07-19 16:59:56.841 CEST [3458309][client backend][0/2:0] LOG: could not send data to client: Broken pipe
2025-07-19 16:59:56.841 CEST [3458309][client backend][0/2:0] STATEMENT: select count(*) from pg_logical_slot_get_changes('slot_logical', null, null);
2025-07-19 16:59:56.851 CEST [3458140][postmaster][:0] LOG: database system is shut down
So my impression is that the bug is not reliably fixed in 17.
One other interesting thing is that once it's hung, the test does
not stop after PG_TEST_TIMEOUT_DEFAULT elapses. You can see
above that olingo took nearly 50 minutes to give up, and in
manual testing it doesn't seem to stop either (though I've not
got the patience to wait 50 minutes...)
regards, tom lane
[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&dt=2025-07-19%2014%3A07%3A23
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2025-07-19%2014%3A07%3A56
[3]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2025-07-19%2016%3A29%3A32
On Sat, Jul 19, 2025 at 10:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alexander Korotkov <aekorotkov@gmail.com> writes:
I went trough the patchset. Everything looks good to me. I only did
some improvements to comments and commit messages. I'm going to push
this if no objections.There's apparently something wrong in the v17 branch, as three
separate buildfarm members have now hit timeout failures in
046_checkpoint_logical_slot.pl [1][2][3]. I tried to reproduce
this locally, and didn't have much luck initially. However,
if I build with a configuration similar to grassquit's, it
will hang up maybe one time in ten:export ASAN_OPTIONS='print_stacktrace=1:disable_coredump=0:abort_on_error=1:detect_leaks=0:detect_stack_use_after_return=0'
export UBSAN_OPTIONS='print_stacktrace=1:disable_coredump=0:abort_on_error=1'
./configure ... usual flags plus ... CFLAGS='-O1 -ggdb -g3 -fno-omit-frame-pointer -Wall -Wextra -Wno-unused-parameter -Wno-sign-compare -Wno-missing-field-initializers -fsanitize=address -fno-sanitize-recover=all' --enable-injection-points
The fact that 046_checkpoint_logical_slot.pl is skipped in
non-injection-point builds is probably reducing the number
of buildfarm failures, since only a minority of animals
have that turned on yet.I don't see anything obviously wrong in the test changes, and the
postmaster log from the failures looks pretty clearly like what is
hanging up is the pg_logical_slot_get_changes call:2025-07-19 16:10:07.276 CEST [3458309][client backend][0/2:0] LOG: statement: select count(*) from pg_logical_slot_get_changes('slot_logical', null, null);
2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] LOG: starting logical decoding for slot "slot_logical"
2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] DETAIL: Streaming transactions committing after 0/290000F8, reading WAL from 0/1540F40.
2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] STATEMENT: select count(*) from pg_logical_slot_get_changes('slot_logical', null, null);
2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] LOG: logical decoding found consistent point at 0/1540F40
2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] DETAIL: There are no running transactions.
2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] STATEMENT: select count(*) from pg_logical_slot_get_changes('slot_logical', null, null);
2025-07-19 16:59:56.828 CEST [3458140][postmaster][:0] LOG: received immediate shutdown request
2025-07-19 16:59:56.841 CEST [3458309][client backend][0/2:0] LOG: could not send data to client: Broken pipe
2025-07-19 16:59:56.841 CEST [3458309][client backend][0/2:0] STATEMENT: select count(*) from pg_logical_slot_get_changes('slot_logical', null, null);
2025-07-19 16:59:56.851 CEST [3458140][postmaster][:0] LOG: database system is shut downSo my impression is that the bug is not reliably fixed in 17.
One other interesting thing is that once it's hung, the test does
not stop after PG_TEST_TIMEOUT_DEFAULT elapses. You can see
above that olingo took nearly 50 minutes to give up, and in
manual testing it doesn't seem to stop either (though I've not
got the patience to wait 50 minutes...)
Thank you for pointing!
Apparently I've backpatched d3917d8f13e7 everywhere but not in
REL_17_STABLE. Will be fixed now.
------
Regards,
Alexander Korotkov
Supabase
Alexander Korotkov <aekorotkov@gmail.com> writes:
On Sat, Jul 19, 2025 at 10:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
So my impression is that the bug is not reliably fixed in 17.
Thank you for pointing!
Apparently I've backpatched d3917d8f13e7 everywhere but not in
REL_17_STABLE. Will be fixed now.
Hah ... well, at least we know the test works!
regards, tom lane