[PATCH] Fix fragile walreceiver test.
The recovery/004_timeline_switch test has been failing for me on
Windows. The test is wrong.
The test does this:
$node_standby_2->restart;
# ... timeline switch happens ...
ok( !$node_standby_2->log_contains(
"FATAL: .* terminating walreceiver process due to
administrator command"
),
'WAL receiver should not be stopped across timeline jumps');
Problem: restart() kills the walreceiver (as it should), which writes
that exact FATAL message to the log. The test then searches the log and
finds it.
The test has a comment claiming "a new log file is used on node
restart". TAP tests use pg_ctl with a fixed filename that gets reused
across restarts. No log rotation.
I added logging to confirm what's actually happening. The walreceiver
works correctly - same PID handles both timelines:
2025-11-04 23:05:28.539 CST walreceiver[83824] LOG: started
streaming WAL from primary at 0/03000000 on timeline 1
2025-11-04 23:05:28.543 CST startup[42764] LOG: new target timeline
is 2
2025-11-04 23:05:28.544 CST walreceiver[83824] LOG: restarted WAL
streaming at 0/03000000 on timeline 2
That's PID 83824 throughout. Works fine.
Earlier in the same log, from the restart:
2025-11-04 23:05:27.261 CST walreceiver[52440] FATAL: terminating
walreceiver process due to administrator command
Different PID (52440), expected shutdown. This is what the test finds.
The fix is obvious: check that the walreceiver PID stays constant.
That's what we actually care about anyway.
This matters because changes to I/O behavior elsewhere in the code can
make this test fail spuriously. I hit it while working on O_CLOEXEC
handling for Windows.
Patch attached.
--
Bryan Green
EDB: https://www.enterprisedb.com
Attachments:
0001-Fix-timing-dependent-failure-in-recovery-004_timelin.patchtext/plain; charset=UTF-8; name=0001-Fix-timing-dependent-failure-in-recovery-004_timelin.patchDownload
From ef4b59760a910fbfd3ea88011307c3af188c2b42 Mon Sep 17 00:00:00 2001
From: Bryan Green <dbryan.green@gmail.com>
Date: Tue, 4 Nov 2025 23:41:29 -0600
Subject: [PATCH] Fix timing-dependent failure in recovery/004_timeline_switch.
The test verified walreceiver survival by searching logs for termination
messages. However, it called restart() before the timeline switch, which
kills walreceiver and logs the exact message being checked. TAP tests
reuse the same log file across restarts, so log_contains() finds that
message and the test fails.
Except sometimes it didn't fail, because the message wasn't flushed yet.
Recent changes to file handle inheritance altered I/O timing enough to
make this fail consistently while testing another patch.
Replace the log check with a PID comparison, which actually tests what
we care about.
---
src/test/recovery/t/004_timeline_switch.pl | 28 ++++++++++++++++------
1 file changed, 21 insertions(+), 7 deletions(-)
diff --git a/src/test/recovery/t/004_timeline_switch.pl b/src/test/recovery/t/004_timeline_switch.pl
index 13874ff866..21e94a1de3 100644
--- a/src/test/recovery/t/004_timeline_switch.pl
+++ b/src/test/recovery/t/004_timeline_switch.pl
@@ -56,6 +56,16 @@ primary_conninfo='$connstr_1'
));
$node_standby_2->restart;
+# Wait for walreceiver to reconnect after the restart. The restart
+# will have killed the old walreceiver, which is expected. We want to
+# verify that after reconnection, the walreceiver stays alive during
+# the timeline switch.
+$node_standby_2->poll_query_until('postgres',
+ "SELECT EXISTS(SELECT 1 FROM pg_stat_wal_receiver)");
+my $wr_pid_before_switch = $node_standby_2->safe_psql('postgres',
+ "SELECT pid FROM pg_stat_wal_receiver");
+
+
# Insert some data in standby 1 and check its presence in standby 2
# to ensure that the timeline switch has been done.
$node_standby_1->safe_psql('postgres',
@@ -66,13 +76,17 @@ my $result =
$node_standby_2->safe_psql('postgres', "SELECT count(*) FROM tab_int");
is($result, qq(2000), 'check content of standby 2');
-# Check the logs, WAL receiver should not have been stopped while
-# transitioning to its new timeline. There is no need to rely on an
-# offset in this check of the server logs: a new log file is used on
-# node restart when primary_conninfo is updated above.
-ok( !$node_standby_2->log_contains(
- "FATAL: .* terminating walreceiver process due to administrator command"
- ),
+
+# Verify that the walreceiver process stayed alive across the timeline
+# switch. We recorded the PID after reconnection and now check that
+# it's the same after completing the timeline switch. This is more
+# reliable than checking log messages, which may include expected
+# shutdown messages from the restart.
+my $wr_pid_after_switch = $node_standby_2->safe_psql('postgres',
+ "SELECT pid FROM pg_stat_wal_receiver");
+
+is( $wr_pid_before_switch,
+ $wr_pid_after_switch,
'WAL receiver should not be stopped across timeline jumps');
# Ensure that a standby is able to follow a primary on a newer timeline
--
2.46.0.windows.1
On Wed, Nov 05, 2025 at 12:03:29AM -0600, Bryan Green wrote:
Problem: restart() kills the walreceiver (as it should), which writes
that exact FATAL message to the log. The test then searches the log and
finds it.
Timing issue then, the buildfarm has not been complaining on this one
AFAIK, there have been no recoveryCheck failures reported:
https://buildfarm.postgresql.org/cgi-bin/show_failures.pl
The test has a comment claiming "a new log file is used on node
restart". TAP tests use pg_ctl with a fixed filename that gets reused
across restarts. No log rotation.
I've fat-fingered this assumption, indeed, missing that one would need
to do an extra rotate_logfile() before the restart.
The fix is obvious: check that the walreceiver PID stays constant.
That's what we actually care about anyway.
Hmm. The reason why I didn't use a PID matching check (mentioned at
[1]: /messages/by-id/aQGfoKGgmAbPATp5@paquier.xyz -- Michael
machine, one could assume that standby_1 generates some records and
that these are replayed by standby_2 *before* the PID of the WAL
receiver is retrieved. This could lead to false positives in some
cases, and a bunch of buildfarm members are very slow. You have a
point that these would unlikely happen in normal runs, so a PID
matching check would be relevant most of the time anyway, even if the
original PID has been fetched after the TLI jump has been processed in
standby_2. I'd rather keep the log check, TBH, bypassing it with an
extra rotate_logfile() before the restart of standby_2.
This matters because changes to I/O behavior elsewhere in the code can
make this test fail spuriously. I hit it while working on O_CLOEXEC
handling for Windows.
Fun. And the WAL receiver never stops after the restart of standby_2
with the log entry present in the server logs generated before the
restart, right?
[1]: /messages/by-id/aQGfoKGgmAbPATp5@paquier.xyz -- Michael
--
Michael
Hi,
On Wed, Nov 5, 2025 at 2:50 PM Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Nov 05, 2025 at 12:03:29AM -0600, Bryan Green wrote:
Problem: restart() kills the walreceiver (as it should), which writes
that exact FATAL message to the log. The test then searches the log and
finds it.Timing issue then, the buildfarm has not been complaining on this one
AFAIK, there have been no recoveryCheck failures reported:
https://buildfarm.postgresql.org/cgi-bin/show_failures.plThe test has a comment claiming "a new log file is used on node
restart". TAP tests use pg_ctl with a fixed filename that gets reused
across restarts. No log rotation.I've fat-fingered this assumption, indeed, missing that one would need
to do an extra rotate_logfile() before the restart.The fix is obvious: check that the walreceiver PID stays constant.
That's what we actually care about anyway.
Thanks for catching and analyzing this!
Hmm. The reason why I didn't use a PID matching check (mentioned at
[1]) is that this is not entirely bullet-proof. On a very slow
machine, one could assume that standby_1 generates some records and
that these are replayed by standby_2 *before* the PID of the WAL
receiver is retrieved. This could lead to false positives in some
cases, and a bunch of buildfarm members are very slow. You have a
point that these would unlikely happen in normal runs, so a PID
matching check would be relevant most of the time anyway, even if the
original PID has been fetched after the TLI jump has been processed in
standby_2. I'd rather keep the log check, TBH, bypassing it with an
extra rotate_logfile() before the restart of standby_2.
I’ve also prepared a patch for this method.
This matters because changes to I/O behavior elsewhere in the code can
make this test fail spuriously. I hit it while working on O_CLOEXEC
handling for Windows.Fun. And the WAL receiver never stops after the restart of standby_2
with the log entry present in the server logs generated before the
restart, right?
Best,
Xuneng
Attachments:
v2-0001-Fix-timing-issue-in-recovery-004_timeline_switch-.patchapplication/octet-stream; name=v2-0001-Fix-timing-issue-in-recovery-004_timeline_switch-.patchDownload
From 2503e5392198b11782fae85d269c048ab2ccfbbd Mon Sep 17 00:00:00 2001
From: alterego655 <824662526@qq.com>
Date: Wed, 5 Nov 2025 15:03:46 +0800
Subject: [PATCH v2] Fix timing issue in recovery/004_timeline_switch test
The test searched the entire log for walreceiver termination messages,
but restart() kills walreceiver and logs that exact message. TAP tests
reuse log files, so the check would find the expected shutdown from
restart() and fail intermittently based on flush timing.
Fix by rotating to a fresh log file before restart(), so only messages
from the timeline switch itself are checked.
---
src/test/recovery/t/004_timeline_switch.pl | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/src/test/recovery/t/004_timeline_switch.pl b/src/test/recovery/t/004_timeline_switch.pl
index 13874ff866f..7a855d68a5d 100644
--- a/src/test/recovery/t/004_timeline_switch.pl
+++ b/src/test/recovery/t/004_timeline_switch.pl
@@ -54,6 +54,12 @@ $node_standby_2->append_conf(
'postgresql.conf', qq(
primary_conninfo='$connstr_1'
));
+
+# Rotate to a fresh log file before restart. The restart will kill
+# walreceiver (logging a FATAL to the old file), but the new server
+# will log to the new file. This way we can check only the new file
+# for issues during the timeline switch.
+$node_standby_2->rotate_logfile();
$node_standby_2->restart;
# Insert some data in standby 1 and check its presence in standby 2
@@ -66,10 +72,8 @@ my $result =
$node_standby_2->safe_psql('postgres', "SELECT count(*) FROM tab_int");
is($result, qq(2000), 'check content of standby 2');
-# Check the logs, WAL receiver should not have been stopped while
-# transitioning to its new timeline. There is no need to rely on an
-# offset in this check of the server logs: a new log file is used on
-# node restart when primary_conninfo is updated above.
+# Check the new log file. WAL receiver should not have been stopped
+# while transitioning to its new timeline.
ok( !$node_standby_2->log_contains(
"FATAL: .* terminating walreceiver process due to administrator command"
),
--
2.51.0
On Wed, Nov 05, 2025 at 03:30:30PM +0800, Xuneng Zhou wrote:
On Wed, Nov 5, 2025 at 2:50 PM Michael Paquier <michael@paquier.xyz> wrote:
Timing issue then, the buildfarm has not been complaining on this one
AFAIK, there have been no recoveryCheck failures reported:
https://buildfarm.postgresql.org/cgi-bin/show_failures.pl
drongo has just reported one failure, so I stand corrected:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2025-11-05%2003%3A50%3A50
And one log rotation should be enough before the restart.
Hmm. The reason why I didn't use a PID matching check (mentioned at
[1]) is that this is not entirely bullet-proof. On a very slow
machine, one could assume that standby_1 generates some records and
that these are replayed by standby_2 *before* the PID of the WAL
receiver is retrieved. This could lead to false positives in some
cases, and a bunch of buildfarm members are very slow. You have a
point that these would unlikely happen in normal runs, so a PID
matching check would be relevant most of the time anyway, even if the
original PID has been fetched after the TLI jump has been processed in
standby_2. I'd rather keep the log check, TBH, bypassing it with an
extra rotate_logfile() before the restart of standby_2.I’ve also prepared a patch for this method.
That's exactly what I have done a couple of minutes ago, and noticed
your message before applying the fix so I've listed you are a
co-author on this one.
I have also kept the PID check after pondering a bit about it. A TLI
jump could be replayed before we grab the initial PID, but in most
cases it should be able to do its work correctly.
--
Michael
Hi,
On Wed, Nov 5, 2025 at 3:56 PM Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Nov 05, 2025 at 03:30:30PM +0800, Xuneng Zhou wrote:
On Wed, Nov 5, 2025 at 2:50 PM Michael Paquier <michael@paquier.xyz> wrote:
Timing issue then, the buildfarm has not been complaining on this one
AFAIK, there have been no recoveryCheck failures reported:
https://buildfarm.postgresql.org/cgi-bin/show_failures.pldrongo has just reported one failure, so I stand corrected:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2025-11-05%2003%3A50%3A50And one log rotation should be enough before the restart.
Hmm. The reason why I didn't use a PID matching check (mentioned at
[1]) is that this is not entirely bullet-proof. On a very slow
machine, one could assume that standby_1 generates some records and
that these are replayed by standby_2 *before* the PID of the WAL
receiver is retrieved. This could lead to false positives in some
cases, and a bunch of buildfarm members are very slow. You have a
point that these would unlikely happen in normal runs, so a PID
matching check would be relevant most of the time anyway, even if the
original PID has been fetched after the TLI jump has been processed in
standby_2. I'd rather keep the log check, TBH, bypassing it with an
extra rotate_logfile() before the restart of standby_2.I’ve also prepared a patch for this method.
That's exactly what I have done a couple of minutes ago, and noticed
your message before applying the fix so I've listed you are a
co-author on this one.
Thanks.
I have also kept the PID check after pondering a bit about it. A TLI
jump could be replayed before we grab the initial PID, but in most
cases it should be able to do its work correctly.
Checking the PID seems straightforward and makes sense to me mostly.
Best,
Xuneng
On 11/5/2025 1:55 AM, Michael Paquier wrote:
On Wed, Nov 05, 2025 at 03:30:30PM +0800, Xuneng Zhou wrote:
On Wed, Nov 5, 2025 at 2:50 PM Michael Paquier <michael@paquier.xyz> wrote:
Timing issue then, the buildfarm has not been complaining on this one
AFAIK, there have been no recoveryCheck failures reported:
https://buildfarm.postgresql.org/cgi-bin/show_failures.pldrongo has just reported one failure, so I stand corrected:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2025-11-05%2003%3A50%3A50
Perfect timing on that drongo failure - validates exactly what I was
seeing with the O_CLOEXEC changes.
And one log rotation should be enough before the restart.
Hmm. The reason why I didn't use a PID matching check (mentioned at
[1]) is that this is not entirely bullet-proof. On a very slow
machine, one could assume that standby_1 generates some records and
that these are replayed by standby_2 *before* the PID of the WAL
receiver is retrieved. This could lead to false positives in some
cases, and a bunch of buildfarm members are very slow. You have a
point that these would unlikely happen in normal runs, so a PID
matching check would be relevant most of the time anyway, even if the
original PID has been fetched after the TLI jump has been processed in
standby_2. I'd rather keep the log check, TBH, bypassing it with an
extra rotate_logfile() before the restart of standby_2.I’ve also prepared a patch for this method.
That's exactly what I have done a couple of minutes ago, and noticed
your message before applying the fix so I've listed you are a
co-author on this one.
Thanks for the quick turnaround and for including both Xuneng and me.
I have also kept the PID check after pondering a bit about it. A TLI
jump could be replayed before we grab the initial PID, but in most
cases it should be able to do its work correctly.
--
Michael
The rotate_logfile() fixes the fundamental issue (log file reuse), while
the PID check adds extra verification for the common case. Belt and
braces approach makes sense here.
Your point about slow machines is well taken - I hadn't considered the
window between restart and PID capture on heavily loaded systems.
This clears the path for my O_CLOEXEC work without worrying about
spurious test failures. Appreciate it.
On Wed, Nov 05, 2025 at 07:38:23AM -0600, Bryan Green wrote:
Thanks for the quick turnaround and for including both Xuneng and me.
I owe you one here for noticing the issue quickly. The buildfarm
took some time to reproduce the failure and we have a minor release
coming next week. It would be annoying to release something with a
known instability, at least for me.
--
Michael
On Thu, Nov 06, 2025 at 07:58:55AM +0900, Michael Paquier wrote:
I owe you one here for noticing the issue quickly. The buildfarm
took some time to reproduce the failure and we have a minor release
coming next week. It would be annoying to release something with a
known instability, at least for me.
drongo has gone back to green, so we should be good. See:
https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=3&member=drongo&filter=Submit
--
Michael