[PATCH] Fix fragile walreceiver test.

Started by Bryan Green8 months ago8 messageshackers

dbryan.green@gmail.com

8 months ago

The recovery/004_timeline_switch test has been failing for me on
Windows. The test is wrong.

The test does this:

$node_standby_2->restart;
# ... timeline switch happens ...
ok( !$node_standby_2->log_contains(
"FATAL: .* terminating walreceiver process due to
administrator command"
),
'WAL receiver should not be stopped across timeline jumps');

Problem: restart() kills the walreceiver (as it should), which writes
that exact FATAL message to the log. The test then searches the log and
finds it.

The test has a comment claiming "a new log file is used on node
restart". TAP tests use pg_ctl with a fixed filename that gets reused
across restarts. No log rotation.

I added logging to confirm what's actually happening. The walreceiver
works correctly - same PID handles both timelines:

2025-11-04 23:05:28.539 CST walreceiver[83824] LOG: started
streaming WAL from primary at 0/03000000 on timeline 1
2025-11-04 23:05:28.543 CST startup[42764] LOG: new target timeline
is 2
2025-11-04 23:05:28.544 CST walreceiver[83824] LOG: restarted WAL
streaming at 0/03000000 on timeline 2

That's PID 83824 throughout. Works fine.

Earlier in the same log, from the restart:

2025-11-04 23:05:27.261 CST walreceiver[52440] FATAL: terminating
walreceiver process due to administrator command

Different PID (52440), expected shutdown. This is what the test finds.

The fix is obvious: check that the walreceiver PID stays constant.
That's what we actually care about anyway.

This matters because changes to I/O behavior elsewhere in the code can
make this test fail spuriously. I hit it while working on O_CLOEXEC
handling for Windows.

Patch attached.
--
Bryan Green
EDB: https://www.enterprisedb.com

Michael Paquier

michael@paquier.xyz

8 months ago

In reply to: Bryan Green (#1)

Re: [PATCH] Fix fragile walreceiver test.

On Wed, Nov 05, 2025 at 12:03:29AM -0600, Bryan Green wrote:

Problem: restart() kills the walreceiver (as it should), which writes
that exact FATAL message to the log. The test then searches the log and
finds it.

Timing issue then, the buildfarm has not been complaining on this one
AFAIK, there have been no recoveryCheck failures reported:
https://buildfarm.postgresql.org/cgi-bin/show_failures.pl

The test has a comment claiming "a new log file is used on node
restart". TAP tests use pg_ctl with a fixed filename that gets reused
across restarts. No log rotation.

I've fat-fingered this assumption, indeed, missing that one would need
to do an extra rotate_logfile() before the restart.

The fix is obvious: check that the walreceiver PID stays constant.
That's what we actually care about anyway.

Hmm. The reason why I didn't use a PID matching check (mentioned at
[1]: /messages/by-id/aQGfoKGgmAbPATp5@paquier.xyz -- Michael
machine, one could assume that standby_1 generates some records and
that these are replayed by standby_2 *before* the PID of the WAL
receiver is retrieved. This could lead to false positives in some
cases, and a bunch of buildfarm members are very slow. You have a
point that these would unlikely happen in normal runs, so a PID
matching check would be relevant most of the time anyway, even if the
original PID has been fetched after the TLI jump has been processed in
standby_2. I'd rather keep the log check, TBH, bypassing it with an
extra rotate_logfile() before the restart of standby_2.

This matters because changes to I/O behavior elsewhere in the code can
make this test fail spuriously. I hit it while working on O_CLOEXEC
handling for Windows.

Fun. And the WAL receiver never stops after the restart of standby_2
with the log entry present in the server logs generated before the
restart, right?

[1]: /messages/by-id/aQGfoKGgmAbPATp5@paquier.xyz -- Michael
--
Michael

Xuneng Zhou

xunengzhou@gmail.com

8 months ago

In reply to: Michael Paquier (#2)

Re: [PATCH] Fix fragile walreceiver test.

Hi,

On Wed, Nov 5, 2025 at 2:50 PM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Nov 05, 2025 at 12:03:29AM -0600, Bryan Green wrote:

Problem: restart() kills the walreceiver (as it should), which writes
that exact FATAL message to the log. The test then searches the log and
finds it.

Timing issue then, the buildfarm has not been complaining on this one
AFAIK, there have been no recoveryCheck failures reported:
https://buildfarm.postgresql.org/cgi-bin/show_failures.pl

The test has a comment claiming "a new log file is used on node
restart". TAP tests use pg_ctl with a fixed filename that gets reused
across restarts. No log rotation.

I've fat-fingered this assumption, indeed, missing that one would need
to do an extra rotate_logfile() before the restart.

The fix is obvious: check that the walreceiver PID stays constant.
That's what we actually care about anyway.

Thanks for catching and analyzing this!

Hmm. The reason why I didn't use a PID matching check (mentioned at
[1]) is that this is not entirely bullet-proof. On a very slow
machine, one could assume that standby_1 generates some records and
that these are replayed by standby_2 *before* the PID of the WAL
receiver is retrieved. This could lead to false positives in some
cases, and a bunch of buildfarm members are very slow. You have a
point that these would unlikely happen in normal runs, so a PID
matching check would be relevant most of the time anyway, even if the
original PID has been fetched after the TLI jump has been processed in
standby_2. I'd rather keep the log check, TBH, bypassing it with an
extra rotate_logfile() before the restart of standby_2.

I’ve also prepared a patch for this method.

This matters because changes to I/O behavior elsewhere in the code can
make this test fail spuriously. I hit it while working on O_CLOEXEC
handling for Windows.

Fun. And the WAL receiver never stops after the restart of standby_2
with the log entry present in the server logs generated before the
restart, right?

Best,
Xuneng

Michael Paquier

michael@paquier.xyz

8 months ago

In reply to: Xuneng Zhou (#3)

Re: [PATCH] Fix fragile walreceiver test.

On Wed, Nov 05, 2025 at 03:30:30PM +0800, Xuneng Zhou wrote:

On Wed, Nov 5, 2025 at 2:50 PM Michael Paquier <michael@paquier.xyz> wrote:

Timing issue then, the buildfarm has not been complaining on this one
AFAIK, there have been no recoveryCheck failures reported:
https://buildfarm.postgresql.org/cgi-bin/show_failures.pl

drongo has just reported one failure, so I stand corrected:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2025-11-05%2003%3A50%3A50

And one log rotation should be enough before the restart.

Hmm. The reason why I didn't use a PID matching check (mentioned at
[1]) is that this is not entirely bullet-proof. On a very slow
machine, one could assume that standby_1 generates some records and
that these are replayed by standby_2 *before* the PID of the WAL
receiver is retrieved. This could lead to false positives in some
cases, and a bunch of buildfarm members are very slow. You have a
point that these would unlikely happen in normal runs, so a PID
matching check would be relevant most of the time anyway, even if the
original PID has been fetched after the TLI jump has been processed in
standby_2. I'd rather keep the log check, TBH, bypassing it with an
extra rotate_logfile() before the restart of standby_2.

I’ve also prepared a patch for this method.

That's exactly what I have done a couple of minutes ago, and noticed
your message before applying the fix so I've listed you are a
co-author on this one.

I have also kept the PID check after pondering a bit about it. A TLI
jump could be replayed before we grab the initial PID, but in most
cases it should be able to do its work correctly.
--
Michael

Xuneng Zhou

xunengzhou@gmail.com

8 months ago

In reply to: Michael Paquier (#4)

Re: [PATCH] Fix fragile walreceiver test.

Hi,

On Wed, Nov 5, 2025 at 3:56 PM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Nov 05, 2025 at 03:30:30PM +0800, Xuneng Zhou wrote:

On Wed, Nov 5, 2025 at 2:50 PM Michael Paquier <michael@paquier.xyz> wrote:

Timing issue then, the buildfarm has not been complaining on this one
AFAIK, there have been no recoveryCheck failures reported:
https://buildfarm.postgresql.org/cgi-bin/show_failures.pl

drongo has just reported one failure, so I stand corrected:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2025-11-05%2003%3A50%3A50

And one log rotation should be enough before the restart.

Hmm. The reason why I didn't use a PID matching check (mentioned at
[1]) is that this is not entirely bullet-proof. On a very slow
machine, one could assume that standby_1 generates some records and
that these are replayed by standby_2 *before* the PID of the WAL
receiver is retrieved. This could lead to false positives in some
cases, and a bunch of buildfarm members are very slow. You have a
point that these would unlikely happen in normal runs, so a PID
matching check would be relevant most of the time anyway, even if the
original PID has been fetched after the TLI jump has been processed in
standby_2. I'd rather keep the log check, TBH, bypassing it with an
extra rotate_logfile() before the restart of standby_2.

I’ve also prepared a patch for this method.

That's exactly what I have done a couple of minutes ago, and noticed
your message before applying the fix so I've listed you are a
co-author on this one.

Thanks.

I have also kept the PID check after pondering a bit about it. A TLI
jump could be replayed before we grab the initial PID, but in most
cases it should be able to do its work correctly.

Checking the PID seems straightforward and makes sense to me mostly.

Best,
Xuneng

Bryan Green

dbryan.green@gmail.com

8 months ago

In reply to: Michael Paquier (#4)

Re: [PATCH] Fix fragile walreceiver test.

On 11/5/2025 1:55 AM, Michael Paquier wrote:

On Wed, Nov 05, 2025 at 03:30:30PM +0800, Xuneng Zhou wrote:

On Wed, Nov 5, 2025 at 2:50 PM Michael Paquier <michael@paquier.xyz> wrote:

Timing issue then, the buildfarm has not been complaining on this one
AFAIK, there have been no recoveryCheck failures reported:
https://buildfarm.postgresql.org/cgi-bin/show_failures.pl

drongo has just reported one failure, so I stand corrected:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2025-11-05%2003%3A50%3A50

Perfect timing on that drongo failure - validates exactly what I was
seeing with the O_CLOEXEC changes.

And one log rotation should be enough before the restart.

Hmm. The reason why I didn't use a PID matching check (mentioned at
[1]) is that this is not entirely bullet-proof. On a very slow
machine, one could assume that standby_1 generates some records and
that these are replayed by standby_2 *before* the PID of the WAL
receiver is retrieved. This could lead to false positives in some
cases, and a bunch of buildfarm members are very slow. You have a
point that these would unlikely happen in normal runs, so a PID
matching check would be relevant most of the time anyway, even if the
original PID has been fetched after the TLI jump has been processed in
standby_2. I'd rather keep the log check, TBH, bypassing it with an
extra rotate_logfile() before the restart of standby_2.

I’ve also prepared a patch for this method.

That's exactly what I have done a couple of minutes ago, and noticed
your message before applying the fix so I've listed you are a
co-author on this one.

Thanks for the quick turnaround and for including both Xuneng and me.

I have also kept the PID check after pondering a bit about it. A TLI
jump could be replayed before we grab the initial PID, but in most
cases it should be able to do its work correctly.
--
Michael

The rotate_logfile() fixes the fundamental issue (log file reuse), while
the PID check adds extra verification for the common case. Belt and
braces approach makes sense here.

Your point about slow machines is well taken - I hadn't considered the
window between restart and PID capture on heavily loaded systems.

This clears the path for my O_CLOEXEC work without worrying about
spurious test failures. Appreciate it.

Michael Paquier

michael@paquier.xyz

8 months ago

In reply to: Bryan Green (#6)

Re: [PATCH] Fix fragile walreceiver test.

On Wed, Nov 05, 2025 at 07:38:23AM -0600, Bryan Green wrote:

Thanks for the quick turnaround and for including both Xuneng and me.

I owe you one here for noticing the issue quickly. The buildfarm
took some time to reproduce the failure and we have a minor release
coming next week. It would be annoying to release something with a
known instability, at least for me.
--
Michael

Michael Paquier

michael@paquier.xyz

8 months ago

In reply to: Michael Paquier (#7)

Re: [PATCH] Fix fragile walreceiver test.

On Thu, Nov 06, 2025 at 07:58:55AM +0900, Michael Paquier wrote:

I owe you one here for noticing the issue quickly. The buildfarm
took some time to reproduce the failure and we have a minor release
coming next week. It would be annoying to release something with a
known instability, at least for me.

drongo has gone back to green, so we should be good. See:
https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=3&member=drongo&filter=Submit
--
Michael

[PATCH] Fix fragile walreceiver test.

Attachments:

Attachments: