004_timeline_switch TAP test may fail

Started by Sergey Tatarintsev8 days ago5 messageshackers

s.tatarintsev@postgrespro.ru

8 days ago

Hi hackers!

I found that after commit 7185eddf0522b3146ed1ff6e063e8e129e77c706 we
got little omission
in TAP test 004_timeline_switch:
...
my $node_standby_1 = PostgreSQL::Test::Cluster->new('standby_1');
...
$node_primary->stop;

There is no guarantee that standby_1 and standby_2 was successfully
connected to primary and start
streaming before primary stopped.

I think we must ensure that primary knows about standby_1 and standby_2

--
With best regards,
Sergey Tatarintsev,
PostgresPro

Ewan Young

kdbase.hack@gmail.com

8 days ago

In reply to: Sergey Tatarintsev (#1)

Re: 004_timeline_switch TAP test may fail

Hi Sergey,

Thanks for the report and patch. I think the analysis is right, and the
fix is in the right place.

The gap traces back to commit 7185eddf, which deliberately dropped the
wait_for_catchup() and switched the primary from teardown_node() to a
clean stop(), on the grounds that a clean stop flushes all WAL to both
standbys before exiting. That's true, but only for standbys whose
walsender is *connected* at shutdown time -- and ->start() only waits
for the postmaster to accept connections, not for the standby's
walreceiver to have connected back to the primary. So if a standby
hasn't connected yet when the primary stops, the clean-shutdown flush
skips it, and we're back to the exact "standbys received different
amounts of WAL -> timeline fork on reconnect" failure that 7185eddf was
meant to fix.

Polling pg_stat_replication until both walsenders are present closes
that hole: it re-establishes the precondition the clean-stop design
silently assumed. And connection is enough here -- the walsender
shutdown path sends all WAL up to the shutdown checkpoint regardless of
catchup state -- so there's no need to additionally check
state = 'streaming'.

One small thing: the rest of this file uses count(*), so I'd write count(*) = 2
rather than count(1) = 2 just for local consistency. And the comment reads a
little better as something like "Wait until both standbys have
connected to the primary",
since by this point they've already started -- what we're waiting for is the
connection.

Regards,
Ewan

On Tue, Jun 16, 2026 at 4:01 PM Sergey Tatarintsev
<s.tatarintsev@postgrespro.ru> wrote:

Show quoted text

Hi hackers!

I found that after commit 7185eddf0522b3146ed1ff6e063e8e129e77c706 we
got little omission
in TAP test 004_timeline_switch:
...
my $node_standby_1 = PostgreSQL::Test::Cluster->new('standby_1');
...
$node_primary->stop;

There is no guarantee that standby_1 and standby_2 was successfully
connected to primary and start
streaming before primary stopped.

I think we must ensure that primary knows about standby_1 and standby_2

--
With best regards,
Sergey Tatarintsev,
PostgresPro

Michael Paquier

michael@paquier.xyz

7 days ago

In reply to: Sergey Tatarintsev (#1)

Re: 004_timeline_switch TAP test may fail

On Tue, Jun 16, 2026 at 03:01:15PM +0700, Sergey Tatarintsev wrote:

I found that after commit 7185eddf0522b3146ed1ff6e063e8e129e77c706 we got
little omission
in TAP test 004_timeline_switch:
...
my $node_standby_1 = PostgreSQL::Test::Cluster->new('standby_1');
...
$node_primary->stop;

7185eddf0522 rings a bell.

There is no guarantee that standby_1 and standby_2 was successfully
connected to primary and start
streaming before primary stopped.

Indeed. I assume that adding a conditional sleep that prevents the
startup process of standby1 or standby2 to connect to their primary
once they have reached a consistent state, before they are able to
replay the inserts of tab_int and before the primary is stopped would
be enough to make the test go rogue, with one or more standbys not
getting the records we want.

If standby2 gets ahead of standby1, we would fail the initial
poll_query_until() done after standby2 attempts to reconnect standby1,
failing the test on timeout. If standby1 gets ahead of standby2,
things would work; there is a wait step for standby2 to catch up with
standby1. So only the first pattern is problematic, not the second.

It does not seem like the buildfarm has complained on this one
(failures in latest 30 days for recoveryCheck report 026 and 035),
neither does the CI:
https://cfbot.cputube.org/highlights/all.html
--
Michael

Michael Paquier

michael@paquier.xyz

7 days ago

In reply to: Ewan Young (#2)

Re: 004_timeline_switch TAP test may fail

On Tue, Jun 16, 2026 at 06:51:48PM +0800, Ewan Young wrote:

One small thing: the rest of this file uses count(*), so I'd write count(*) = 2
rather than count(1) = 2 just for local consistency. And the comment reads a
little better as something like "Wait until both standbys have
connected to the primary",
since by this point they've already started -- what we're waiting for is the
connection.

Both queries work the same for this purpose, so any of them is fine.
The comment needs to be refined a bit, though.
--
Michael

Sergey Tatarintsev

s.tatarintsev@postgrespro.ru

7 days ago

In reply to: Ewan Young (#2)

Re: 004_timeline_switch TAP test may fail

Hi!
Thanks for review!
v2 patch attached
comment changed, count(1) replaced with count(*).

Hi Sergey,

Thanks for the report and patch. I think the analysis is right, and the
fix is in the right place.

The gap traces back to commit 7185eddf, which deliberately dropped the
wait_for_catchup() and switched the primary from teardown_node() to a
clean stop(), on the grounds that a clean stop flushes all WAL to both
standbys before exiting. That's true, but only for standbys whose
walsender is *connected* at shutdown time -- and ->start() only waits
for the postmaster to accept connections, not for the standby's
walreceiver to have connected back to the primary. So if a standby
hasn't connected yet when the primary stops, the clean-shutdown flush
skips it, and we're back to the exact "standbys received different
amounts of WAL -> timeline fork on reconnect" failure that 7185eddf was
meant to fix.

Polling pg_stat_replication until both walsenders are present closes
that hole: it re-establishes the precondition the clean-stop design
silently assumed. And connection is enough here -- the walsender
shutdown path sends all WAL up to the shutdown checkpoint regardless of
catchup state -- so there's no need to additionally check
state = 'streaming'.

One small thing: the rest of this file uses count(*), so I'd write count(*) = 2
rather than count(1) = 2 just for local consistency. And the comment reads a
little better as something like "Wait until both standbys have
connected to the primary",
since by this point they've already started -- what we're waiting for is the
connection.

Regards,
Ewan

On Tue, Jun 16, 2026 at 4:01 PM Sergey Tatarintsev
<s.tatarintsev@postgrespro.ru> wrote:

Hi hackers!

I found that after commit 7185eddf0522b3146ed1ff6e063e8e129e77c706 we
got little omission
in TAP test 004_timeline_switch:
...
my $node_standby_1 = PostgreSQL::Test::Cluster->new('standby_1');
...
$node_primary->stop;

There is no guarantee that standby_1 and standby_2 was successfully
connected to primary and start
streaming before primary stopped.

I think we must ensure that primary knows about standby_1 and standby_2

--
With best regards,
Sergey Tatarintsev,
PostgresPro

--
With best regards,
Sergey Tatarintsev,
PostgresPro

004_timeline_switch TAP test may fail

Attachments:

Attachments: