Isn't wait_for_catchup slightly broken?

Started by Tom Laneover 4 years ago5 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

While trying to make sense of some recent buildfarm failures,
I happened to notice that the default query issued by
the TAP sub wait_for_catchup looks like

SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = '<whatever>';

ISTM there are two things wrong with this:

1. Since pg_current_wal_lsn() is re-evaluated each time, we're
effectively setting a moving target for the standby to reach.
Admittedly we're not going to be issuing any new DML while
waiting in wait_for_catchup, but background activity such as
autovacuum could be creating new WAL. Thus, the test is likely
to wait longer than it needs to. In the worst case, we'd never
catch up until the primary server has been totally quiescent
for awhile.

2. Aside from being slower than necessary, this also makes the
test squishy and formally incorrect, because the standby might
get the opportunity to replay more WAL than the test intends.

So I think we need to fix it to capture the target WAL position
at the start, as I've done in the attached patch. In principle
this might make things a bit slower because of the extra
transaction required, but I don't notice any above-the-noise
difference on my own workstation.

Another thing that is bothering me a bit is that a number of the
callers use $node->lsn('insert') as the target. This also seems
rather dubious, because that could be ahead of what's been written
out. These callers are just taking it on faith that something will
eventually cause that extra WAL to get written out (and become
available to the standby). Again, that seems to make the test
slower than it need be, with a worst-case scenario being that it
eventually times out. Admittedly this is unlikely to be a big
problem unless some background op issues an abortive transaction
at just the wrong time. Nonetheless, I wonder if we shouldn't
standardize on "thou shalt use the write position", because I
don't think the other alternatives have anything to recommend them.
I've not addressed that below, though I did tweak the comment about
that parameter.

Thoughts?

regards, tom lane

Attachments:

fix-wait-for-catchup-default-behavior.patchtext/x-diff; charset=us-ascii; name=fix-wait-for-catchup-default-behavior.patchDownload+8-11
#2Julien Rouhaud
rjuju123@gmail.com
In reply to: Tom Lane (#1)
Re: Isn't wait_for_catchup slightly broken?

On Mon, Jan 10, 2022 at 02:31:38PM -0500, Tom Lane wrote:

So I think we need to fix it to capture the target WAL position
at the start, as I've done in the attached patch.

+1, it looks sensible to me.

In principle
this might make things a bit slower because of the extra
transaction required, but I don't notice any above-the-noise
difference on my own workstation.

I'm wondering if the environments where this extra transaction could make
a noticeable difference are also environments where doing that extra
transaction can save some iteration(s), which would be at least as costly.

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#1)
Re: Isn't wait_for_catchup slightly broken?

I wrote:

Another thing that is bothering me a bit is that a number of the
callers use $node->lsn('insert') as the target. This also seems
rather dubious, because that could be ahead of what's been written
out. These callers are just taking it on faith that something will
eventually cause that extra WAL to get written out (and become
available to the standby). Again, that seems to make the test
slower than it need be, with a worst-case scenario being that it
eventually times out. Admittedly this is unlikely to be a big
problem unless some background op issues an abortive transaction
at just the wrong time. Nonetheless, I wonder if we shouldn't
standardize on "thou shalt use the write position", because I
don't think the other alternatives have anything to recommend them.

Here's a version that makes sure that callers specify a write position not
an insert position. I also simplified the callers wherever it turned
out that they could just use the default parameters.

regards, tom lane

Attachments:

clean-up-wait_for_catchup-usage.patchtext/x-diff; charset=us-ascii; name=clean-up-wait_for_catchup-usage.patchDownload+49-75
#4Julien Rouhaud
rjuju123@gmail.com
In reply to: Tom Lane (#3)
Re: Isn't wait_for_catchup slightly broken?

Hi,

On Sat, Jan 15, 2022 at 05:58:02PM -0500, Tom Lane wrote:

Here's a version that makes sure that callers specify a write position not
an insert position. I also simplified the callers wherever it turned
out that they could just use the default parameters.

LGTM, and passes make check-world on my machine.

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Julien Rouhaud (#4)
Re: Isn't wait_for_catchup slightly broken?

Julien Rouhaud <rjuju123@gmail.com> writes:

On Sat, Jan 15, 2022 at 05:58:02PM -0500, Tom Lane wrote:

Here's a version that makes sure that callers specify a write position not
an insert position. I also simplified the callers wherever it turned
out that they could just use the default parameters.

LGTM, and passes make check-world on my machine.

Pushed, thanks for reviewing.

regards, tom lane