failures in t/031_recovery_conflict.pl on CI

Started by Andres Freundabout 4 years ago38 messageshackers
Jump to latest
#1Andres Freund
andres@anarazel.de

Hi,

on CI [1] the new t/031_recovery_conflict.pl is failing occasionally. Which is
interesting, because I ran it there dozens if not hundreds of times before
commit, with - I think - only cosmetic changes.

I've reproduced it in a private branch, with more logging. And the results are
sure interesting.

https://cirrus-ci.com/task/6448492666159104
https://api.cirrus-ci.com/v1/artifact/task/6448492666159104/log/src/test/recovery/tmp_check/log/031_recovery_conflict_standby.log

The primary is waiting for 0/343A000 to be applied, which requires a recovery
conflict to be detected and resolved. On the standby there's the following
sequence (some omitted):

prerequisite for recovery conflict:
2022-04-09 04:05:31.292 UTC [35071][client backend] [031_recovery_conflict.pl][2/2:0] LOG: statement: BEGIN;
2022-04-09 04:05:31.292 UTC [35071][client backend] [031_recovery_conflict.pl][2/2:0] LOG: statement: DECLARE test_recovery_conflict_cursor CURSOR FOR SELECT b FROM test_recovery_conflict_table1;
2022-04-09 04:05:31.293 UTC [35071][client backend] [031_recovery_conflict.pl][2/2:0] LOG: statement: FETCH FORWARD FROM test_recovery_conflict_cursor;

detecting the conflict:
2022-04-09 04:05:31.382 UTC [35038][startup] LOG: recovery still waiting after 28.821 ms: recovery conflict on buffer pin
2022-04-09 04:05:31.382 UTC [35038][startup] CONTEXT: WAL redo at 0/3432800 for Heap2/PRUNE: latestRemovedXid 0 nredirected 0 ndead 1; blkref #0: rel 1663/16385/16386, blk 0

and then nothing until the timeout:
2022-04-09 04:09:19.317 UTC [35035][postmaster] LOG: received immediate shutdown request
2022-04-09 04:09:19.317 UTC [35035][postmaster] DEBUG: sending signal 3 to process 35071
2022-04-09 04:09:19.320 UTC [35035][postmaster] DEBUG: reaping dead processes
2022-04-09 04:09:19.320 UTC [35035][postmaster] DEBUG: reaping dead processes
2022-04-09 04:09:19.320 UTC [35035][postmaster] DEBUG: server process (PID 35071) exited with exit code 2

Afaics that has to mean something is broken around sending, receiving or
processing of recovery conflict interrupts.

All the failures so far were on freebsd, from what I can see. There were other
failures in other tests, but I think for reverted or fixed things.

Except for not previously triggering while the shmstats patch was in
development, it's hard to tell whether it's a regression or just a
longstanding bug - we never had tests for recovery conflicts...

I don't really see how recovery prefetching could play a role here, clearly
we've been trying to replay the record. So we're elsewhere...

Greetings,

Andres Freund

https://cirrus-ci.com/github/postgres/postgres/master

#2Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
Re: failures in t/031_recovery_conflict.pl on CI

Hi,

On 2022-04-08 21:55:15 -0700, Andres Freund wrote:

on CI [1] the new t/031_recovery_conflict.pl is failing occasionally. Which is
interesting, because I ran it there dozens if not hundreds of times before
commit, with - I think - only cosmetic changes.

Scratch that part - I found an instance of the freebsd failure earlier, just
didn't notice because that run failed for other reasons as well. So this might
just have uncovered an older bug around recovery conflict handling,
potentially platform dependent.

I guess I'll try to reproduce it on freebsd...

Greetings,

Andres Freund

#3Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#2)
Re: failures in t/031_recovery_conflict.pl on CI

Hi,

On 2022-04-08 22:05:01 -0700, Andres Freund wrote:

On 2022-04-08 21:55:15 -0700, Andres Freund wrote:

on CI [1] the new t/031_recovery_conflict.pl is failing occasionally. Which is
interesting, because I ran it there dozens if not hundreds of times before
commit, with - I think - only cosmetic changes.

Scratch that part - I found an instance of the freebsd failure earlier, just
didn't notice because that run failed for other reasons as well. So this might
just have uncovered an older bug around recovery conflict handling,
potentially platform dependent.

I guess I'll try to reproduce it on freebsd...

That failed.

Adding a bunch of debug statements, I think I might have found at least some
problems.

The code in LockBufferForCleanup() that actually interacts with other backends
is:

/* Publish the bufid that Startup process waits on */
SetStartupBufferPinWaitBufId(buffer - 1);
/* Set alarm and then wait to be signaled by UnpinBuffer() */
ResolveRecoveryConflictWithBufferPin();
/* Reset the published bufid */
SetStartupBufferPinWaitBufId(-1);

where ResolveRecoveryConflictWithBufferPin() sends procsignals to all other
backends and then waits with ProcWaitForSignal():

void
ProcWaitForSignal(uint32 wait_event_info)
{
(void) WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, 0,
wait_event_info);
ResetLatch(MyLatch);
CHECK_FOR_INTERRUPTS();
}

one problem is that we pretty much immediately get a SIGALRM whenever we're in
that WaitLatch(). Which does a SetLatch(), interrupting the WaitLatch(). The
startup process then proceeds to SetStartupBufferPinWaitBufId(-1).

In the unlucky cases, the backend holding the pin only processes the interrupt
(in RecoveryConflictInterrupt()) after the
SetStartupBufferPinWaitBufId(-1). The backend then does sees that
HoldingBufferPinThatDelaysRecovery() returns false, and happily continues.

But that's not the whole story, I think. It's a problem leading to conflicts
being handled more slowly, but we eventually should not have more timeouts.

Here's debugging output from a failing run, where I added a few debugging statements:
https://api.cirrus-ci.com/v1/artifact/task/6179111512047616/log/src/test/recovery/tmp_check/log/000_0recovery_conflict_standby.log
https://github.com/anarazel/postgres/commit/212268e753093861aa22a51657c6598c65eeb81b

Curiously, there's only
20644: received interrupt 11

Which is PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK, not
PROCSIG_RECOVERY_CONFLICT_BUFFERPIN.

I guess we've never gotten past the standby timeout.

20644: received interrupt 11
2022-04-09 21:18:23.824 UTC [20630][startup] DEBUG: one cycle of LockBufferForCleanup() iterating in HS
2022-04-09 21:18:23.824 UTC [20630][startup] CONTEXT: WAL redo at 0/3432800 for Heap2/PRUNE: latestRemovedXid 0 nredirected 0 ndead 1; blkref #0: rel 1663/16385/16386, blk 0
2022-04-09 21:18:23.824 UTC [20630][startup] DEBUG: setting timeout() in 0 10000
2022-04-09 21:18:23.824 UTC [20630][startup] CONTEXT: WAL redo at 0/3432800 for Heap2/PRUNE: latestRemovedXid 0 nredirected 0 ndead 1; blkref #0: rel 1663/16385/16386, blk 0
2022-04-09 21:18:23.835 UTC [20630][startup] DEBUG: setting latch()
2022-04-09 21:18:23.835 UTC [20630][startup] CONTEXT: WAL redo at 0/3432800 for Heap2/PRUNE: latestRemovedXid 0 nredirected 0 ndead 1; blkref #0: rel 1663/16385/16386, blk 0
2022-04-09 21:18:23.835 UTC [20630][startup] DEBUG: setting timeout() in 0 3481
2022-04-09 21:18:23.835 UTC [20630][startup] CONTEXT: WAL redo at 0/3432800 for Heap2/PRUNE: latestRemovedXid 0 nredirected 0 ndead 1; blkref #0: rel 1663/16385/16386, blk 0
20644: received interrupt 11
2022-04-09 21:23:47.975 UTC [20631][walreceiver] FATAL: could not receive data from WAL stream: server closed the connection unexpectedly

So we sent a conflict interrupt, and then waited. And nothing happened.

What are we expecting to wake the startup process up, once it does
SendRecoveryConflictWithBufferPin()?

It's likely not the problem here, because we never seem to have even reached
that path, but afaics once we've called disable_all_timeouts() at the bottom
of ResolveRecoveryConflictWithBufferPin() and then re-entered
ResolveRecoveryConflictWithBufferPin(), and go into the "We're already behind,
so clear a path as quickly as possible." path, there's no guarantee for any
timeout to be pending anymore?

If there's either no backend that we're still conflicting with (an entirely
possible race condition), or if there's e.g. a snapshot or database conflict,
there's afaics nobody setting the startup processes' latch.

Greetings,

Andres Freund

#4Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#3)
Re: failures in t/031_recovery_conflict.pl on CI

Hi,

On 2022-04-09 15:00:54 -0700, Andres Freund wrote:

What are we expecting to wake the startup process up, once it does
SendRecoveryConflictWithBufferPin()?

It's likely not the problem here, because we never seem to have even reached
that path, but afaics once we've called disable_all_timeouts() at the bottom
of ResolveRecoveryConflictWithBufferPin() and then re-entered
ResolveRecoveryConflictWithBufferPin(), and go into the "We're already behind,
so clear a path as quickly as possible." path, there's no guarantee for any
timeout to be pending anymore?

If there's either no backend that we're still conflicting with (an entirely
possible race condition), or if there's e.g. a snapshot or database conflict,
there's afaics nobody setting the startup processes' latch.

It's not that (although I still suspect it's a problem). It's a self-deadlock,
because StandbyTimeoutHandler(), which ResolveRecoveryConflictWithBufferPin()
*explicitly enables*, calls SendRecoveryConflictWithBufferPin(). Which does
CancelDBBackends(). Which ResolveRecoveryConflictWithBufferPin() also calls,
if the deadlock timeout is reached.

To make it easier to hit, I put a pg_usleep(10000) in CancelDBBackends(), and boom:

(gdb) bt
#0 __futex_abstimed_wait_common64 (futex_word=futex_word@entry=0x7fd4cb001138, expected=expected@entry=0, clockid=clockid@entry=0,
abstime=abstime@entry=0x0, private=<optimized out>, cancel=cancel@entry=true) at ../sysdeps/nptl/futex-internal.c:87
#1 0x00007fd4ce5a215b in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7fd4cb001138, expected=expected@entry=0,
clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>) at ../sysdeps/nptl/futex-internal.c:123
#2 0x00007fd4ce59e44f in do_futex_wait (sem=sem@entry=0x7fd4cb001138, abstime=0x0, clockid=0) at sem_waitcommon.c:112
#3 0x00007fd4ce59e4e8 in __new_sem_wait_slow64 (sem=0x7fd4cb001138, abstime=0x0, clockid=0) at sem_waitcommon.c:184
#4 0x00007fd4cf20d823 in PGSemaphoreLock (sema=0x7fd4cb001138) at pg_sema.c:327
#5 0x00007fd4cf2dd675 in LWLockAcquire (lock=0x7fd4cb001600, mode=LW_EXCLUSIVE) at /home/andres/src/postgresql/src/backend/storage/lmgr/lwlock.c:1324
#6 0x00007fd4cf2c36e7 in CancelDBBackends (databaseid=0, sigmode=PROCSIG_RECOVERY_CONFLICT_BUFFERPIN, conflictPending=false)
at /home/andres/src/postgresql/src/backend/storage/ipc/procarray.c:3638
#7 0x00007fd4cf2cc579 in SendRecoveryConflictWithBufferPin (reason=PROCSIG_RECOVERY_CONFLICT_BUFFERPIN)
at /home/andres/src/postgresql/src/backend/storage/ipc/standby.c:846
#8 0x00007fd4cf2cc69d in StandbyTimeoutHandler () at /home/andres/src/postgresql/src/backend/storage/ipc/standby.c:911
#9 0x00007fd4cf4e68d7 in handle_sig_alarm (postgres_signal_arg=14) at /home/andres/src/postgresql/src/backend/utils/misc/timeout.c:421
#10 <signal handler called>
#11 0x00007fd4cddfffc4 in __GI___select (nfds=0, readfds=0x0, writefds=0x0, exceptfds=0x0, timeout=0x7ffc6e5561c0) at ../sysdeps/unix/sysv/linux/select.c:71
#12 0x00007fd4cf52ea2a in pg_usleep (microsec=10000) at /home/andres/src/postgresql/src/port/pgsleep.c:56
#13 0x00007fd4cf2c36f1 in CancelDBBackends (databaseid=0, sigmode=PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK, conflictPending=false)
at /home/andres/src/postgresql/src/backend/storage/ipc/procarray.c:3640
#14 0x00007fd4cf2cc579 in SendRecoveryConflictWithBufferPin (reason=PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK)
at /home/andres/src/postgresql/src/backend/storage/ipc/standby.c:846
#15 0x00007fd4cf2cc50f in ResolveRecoveryConflictWithBufferPin () at /home/andres/src/postgresql/src/backend/storage/ipc/standby.c:820
#16 0x00007fd4cf2a996f in LockBufferForCleanup (buffer=43) at /home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:4336
#17 0x00007fd4ceec911c in XLogReadBufferForRedoExtended (record=0x7fd4d106a618, block_id=0 '\000', mode=RBM_NORMAL, get_cleanup_lock=true, buf=0x7ffc6e556394)
at /home/andres/src/postgresql/src/backend/access/transam/xlogutils.c:427
#18 0x00007fd4cee1aa41 in heap_xlog_prune (record=0x7fd4d106a618) at /home/andres/src/postgresql/src/backend/access/heap/heapam.c:8634

it's reproducible on linux.

I'm lacking words I dare to put in an email to describe how bad an idea it is
to call CancelDBBackends() from within a timeout function, particularly when
the function enabling the timeout also calls that function itself. Before
disabling timers.

I ...

Greetings,

Andres Freund

#5Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#4)
Re: failures in t/031_recovery_conflict.pl on CI

Hi,

On 2022-04-09 16:10:02 -0700, Andres Freund wrote:

It's not that (although I still suspect it's a problem). It's a self-deadlock,
because StandbyTimeoutHandler(), which ResolveRecoveryConflictWithBufferPin()
*explicitly enables*, calls SendRecoveryConflictWithBufferPin(). Which does
CancelDBBackends(). Which ResolveRecoveryConflictWithBufferPin() also calls,
if the deadlock timeout is reached.

To make it easier to hit, I put a pg_usleep(10000) in CancelDBBackends(), and boom:

[... backtrace ... ]

it's reproducible on linux.

I'm lacking words I dare to put in an email to describe how bad an idea it is
to call CancelDBBackends() from within a timeout function, particularly when
the function enabling the timeout also calls that function itself. Before
disabling timers.

It's been broken in different ways all the way back to 9.0, from what I can
see, but I didn't check every single version.

Afaics the fix is to nuke the idea of doing anything substantial in the signal
handler from orbit, and instead just set a flag in the handler. Then check
whether the timeout happend after ProcWaitForSignal() and call
SendRecoveryConflictWithBufferPin().

Also worth noting that the disable_all_timeouts() calls appears to break
STARTUP_PROGRESS_TIMEOUT.

Greetings,

Andres Freund

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#5)
Re: failures in t/031_recovery_conflict.pl on CI

Andres Freund <andres@anarazel.de> writes:

It's been broken in different ways all the way back to 9.0, from what I can
see, but I didn't check every single version.

Afaics the fix is to nuke the idea of doing anything substantial in the signal
handler from orbit, and instead just set a flag in the handler.

+1. This is probably more feasible given the latch infrastructure
than it was when that code was first written.

regards, tom lane

#7Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#6)
Re: failures in t/031_recovery_conflict.pl on CI

Hi,

On 2022-04-09 19:34:26 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

It's been broken in different ways all the way back to 9.0, from what I can
see, but I didn't check every single version.

Afaics the fix is to nuke the idea of doing anything substantial in the signal
handler from orbit, and instead just set a flag in the handler.

+1. This is probably more feasible given the latch infrastructure
than it was when that code was first written.

What do you think about just reordering the disable_all_timeouts() to be
before the got_standby_deadlock_timeout check in the back branches? I think
that should close at least the most obvious hole. And fix it properly in
HEAD?

Greetings,

Andres Freund

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#7)
Re: failures in t/031_recovery_conflict.pl on CI

Andres Freund <andres@anarazel.de> writes:

On 2022-04-09 19:34:26 -0400, Tom Lane wrote:

+1. This is probably more feasible given the latch infrastructure
than it was when that code was first written.

What do you think about just reordering the disable_all_timeouts() to be
before the got_standby_deadlock_timeout check in the back branches? I think
that should close at least the most obvious hole. And fix it properly in
HEAD?

I don't have much faith in that, and I don't see why we can't fix it
properly. Don't we just need to have the signal handler set MyLatch,
and then do the unsafe stuff back in the "if (got_standby_deadlock_timeout)"
stanza in mainline?

regards, tom lane

#9Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#8)
Re: failures in t/031_recovery_conflict.pl on CI

Hi,

On 2022-04-12 15:05:22 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2022-04-09 19:34:26 -0400, Tom Lane wrote:

+1. This is probably more feasible given the latch infrastructure
than it was when that code was first written.

What do you think about just reordering the disable_all_timeouts() to be
before the got_standby_deadlock_timeout check in the back branches? I think
that should close at least the most obvious hole. And fix it properly in
HEAD?

I don't have much faith in that, and I don't see why we can't fix it
properly.

It's not too hard, agreed.

It's somewhat hard to test that it works though. The new recovery conflict
tests test all recovery conflicts except for deadlock ones, because they're
not easy to trigger... But I think I nearly got it working reliably.

It's probably worth backpatching the tests, after stripping them of the stats
checks?

Three questions:

- For HEAD we have to replace the disable_all_timeouts() calls, it breaks the
replay progress reporting. Is there a reason to keep them in the
backbranches? Hard to see how an extension or something could rely on it,
but ...?

- I named the variable set by the signal handler got_standby_delay_timeout,
because just got_standby_timeout looked weird besides the _deadlock_. Which
makes me think that we should rename STANDBY_TIMEOUT to
STANDBY_DELAY_TIMEOUT too?

- There's the following comment in ResolveRecoveryConflictWithBufferPin():

"We assume that only UnpinBuffer() and the timeout requests established
above can wake us up here."

That bogus afaict? There's plenty other things that can cause MyProc->latch
to be set. Is it worth doing something about this at the same time? Right
now we seem to call ResolveRecoveryConflictWithBufferPin() in rapid
succession initially.

Greetings,

Andres Freund

#10Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#9)
Re: failures in t/031_recovery_conflict.pl on CI

Hi,

Attached are patches for this issue.

It adds a test case for deadlock conflicts to make sure that case isn't
broken. I also tested the recovery conflict tests in the back branches, and
they work there with a reasonably small set of changes.

Questions:
- I'm planning to backpatch the test as 031_recovery_conflict.pl, even though
preceding numbers are unused. It seems way more problematic to use a
different number in the backbranches than have gaps?

- The test uses pump_until() and wait_for_log(), which don't exist in the
backbranches. For now I've just inlined the implementation, but I guess we
could also backpatch their introduction?

- There's a few incompatibilities in the test with older branches:
- older branches don't have allow_in_place_tablespaces - I think just
skipping tablespace conflicts is fine, they're comparatively
simple.

Eventually it might make sense to backpatch allow_in_place_tablespaces,
our test coverage in the area is quite poor.

- the stats tests can't easily made reliably in the backbranches - which is
ok, as the conflict itself is verified via the log

- some branches don't have log_recovery_conflict_waits, since it's not
critical to the test, it's ok to just not include it there

I played with the idea of handling the differences using version comparisons
in the code, and have the test be identically across branches. Since it's
something we don't do so far, I'm leaning against it, but ...

- For HEAD we have to replace the disable_all_timeouts() calls, it breaks the
replay progress reporting. Is there a reason to keep them in the
backbranches? Hard to see how an extension or something could rely on it,
but ...?

I've left it as is for now, will start a separate thread.

- There's the following comment in ResolveRecoveryConflictWithBufferPin():

"We assume that only UnpinBuffer() and the timeout requests established
above can wake us up here."

That bogus afaict? There's plenty other things that can cause MyProc->latch
to be set. Is it worth doing something about this at the same time? Right
now we seem to call ResolveRecoveryConflictWithBufferPin() in rapid
succession initially.

The comment is more recent than I had realized. I raised this separately in
/messages/by-id/20220429191815.xewxjlpmq7mxhsr2@alap3.anarazel.de

pgindent uses some crazy formatting nearby:
SendRecoveryConflictWithBufferPin(
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);

I'm tempted to clean that up in passing by having just one
SendRecoveryConflictWithBufferPin() call instead of two, storing the type of
conflict in a local variable? Doesn't look entirely pretty either, but ...

I'm very doubtful of this claim above ResolveRecoveryConflictWithBufferPin(),
btw. But that'd be a non-backpatchable cleanup, I think:
* The ProcWaitForSignal() sleep normally done in LockBufferForCleanup()
* (when not InHotStandby) is performed here, for code clarity.

Greetings,

Andres Freund

Attachments:

vHEAD-0001-Fix-possibility-of-self-deadlock-in-ResolveRec.patchtext/x-diff; charset=us-asciiDownload+10-11
vHEAD-0002-Add-tests-for-recovery-deadlock-conflicts.patchtext/x-diff; charset=us-asciiDownload+79-8
vREL_10_STABLE-0001-Fix-possibility-of-self-deadlock-in-R.patchtext/x-diff; charset=us-asciiDownload+12-11
vREL_10_STABLE-0002-Backpatch-031_recovery_conflict.pl.patchtext/x-diff; charset=us-asciiDownload+417-1
#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#10)
Re: failures in t/031_recovery_conflict.pl on CI

Andres Freund <andres@anarazel.de> writes:

Questions:
- I'm planning to backpatch the test as 031_recovery_conflict.pl, even though
preceding numbers are unused. It seems way more problematic to use a
different number in the backbranches than have gaps?

+1

- The test uses pump_until() and wait_for_log(), which don't exist in the
backbranches. For now I've just inlined the implementation, but I guess we
could also backpatch their introduction?

I'd backpatch --- seems unlikely this will be the last need for 'em.

pgindent uses some crazy formatting nearby:
SendRecoveryConflictWithBufferPin(
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);

I do not believe that that line break is pgindent's fault.
If you just fold it into one line it should stay that way.

regards, tom lane

#12Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#11)
Re: failures in t/031_recovery_conflict.pl on CI

Hi,

On 2022-04-29 19:26:59 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

- The test uses pump_until() and wait_for_log(), which don't exist in the
backbranches. For now I've just inlined the implementation, but I guess we
could also backpatch their introduction?

I'd backpatch --- seems unlikely this will be the last need for 'em.

Done.

I ended up committing the extension of the test first, before the fix. I think
that's the cause of the failure on longfin on serinus. Let's hope the
situation improves with the now also committed (and backpatched) fix.

pgindent uses some crazy formatting nearby:
SendRecoveryConflictWithBufferPin(
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);

I do not believe that that line break is pgindent's fault.

Oh - I'm fairly certain I've seen pgindent do that in the past. But you're
right, it's not. Perhaps it was an older version of pgindent?

Greetings,

Andres Freund

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#12)
Re: failures in t/031_recovery_conflict.pl on CI

Andres Freund <andres@anarazel.de> writes:

I ended up committing the extension of the test first, before the fix. I think
that's the cause of the failure on longfin on serinus. Let's hope the
situation improves with the now also committed (and backpatched) fix.

longfin's definitely not very happy: four out of six tries have ended with

psql:<stdin>:8: ERROR: canceling statement due to conflict with recovery
LINE 1: SELECT * FROM test_recovery_conflict_table2;
^
DETAIL: User was holding shared buffer pin for too long.
timed out waiting for match: (?^:User transaction caused buffer deadlock with recovery.) at t/031_recovery_conflict.pl line 358.

I can poke into that tomorrow, but are you sure that that isn't an
expectable result?

regards, tom lane

#14Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#13)
Re: failures in t/031_recovery_conflict.pl on CI

Hi,

On 2022-05-02 23:44:32 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I ended up committing the extension of the test first, before the fix. I think
that's the cause of the failure on longfin on serinus. Let's hope the
situation improves with the now also committed (and backpatched) fix.

longfin's definitely not very happy: four out of six tries have ended with

Too bad :(

psql:<stdin>:8: ERROR: canceling statement due to conflict with recovery
LINE 1: SELECT * FROM test_recovery_conflict_table2;
^
DETAIL: User was holding shared buffer pin for too long.
timed out waiting for match: (?^:User transaction caused buffer deadlock with recovery.) at t/031_recovery_conflict.pl line 358.

I can poke into that tomorrow, but are you sure that that isn't an
expectable result?

It's not expected. But I think I might see what the problem is:

$psql_standby{stdin} .= qq[
BEGIN;
-- hold pin
DECLARE $cursor1 CURSOR FOR SELECT a FROM $table1;
FETCH FORWARD FROM $cursor1;
-- wait for lock held by prepared transaction
SELECT * FROM $table2;
];
ok( pump_until(
$psql_standby{run}, $psql_timeout,
\$psql_standby{stdout}, qr/^1$/m,),
"$sect: cursor holding conflicting pin, also waiting for lock, established"
);

We wait for the FETCH (and thus the buffer pin to be acquired). But that
doesn't guarantee that the lock has been acquired. We can't check that with
pump_until() afaics, because there'll not be any output. But a query_until()
checking pg_locks should do the trick?

Greetings,

Andres Freund

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#14)
Re: failures in t/031_recovery_conflict.pl on CI

Andres Freund <andres@anarazel.de> writes:

On 2022-05-02 23:44:32 -0400, Tom Lane wrote:

I can poke into that tomorrow, but are you sure that that isn't an
expectable result?

It's not expected. But I think I might see what the problem is:
We wait for the FETCH (and thus the buffer pin to be acquired). But that
doesn't guarantee that the lock has been acquired. We can't check that with
pump_until() afaics, because there'll not be any output. But a query_until()
checking pg_locks should do the trick?

Irritatingly, it doesn't reproduce (at least not easily) in a manual
build on the same box. So it's almost surely a timing issue, and
your theory here seems plausible.

regards, tom lane

#16Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#12)
Re: failures in t/031_recovery_conflict.pl on CI

On 2022-May-02, Andres Freund wrote:

pgindent uses some crazy formatting nearby:
SendRecoveryConflictWithBufferPin(
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);

I do not believe that that line break is pgindent's fault.

Oh - I'm fairly certain I've seen pgindent do that in the past. But you're
right, it's not. Perhaps it was an older version of pgindent?

No, it's never done that. We used to fold lines that way manually,
because pgindent used to push the argument to the left, but that changed
at commit 382ceffdf7f6.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"La grandeza es una experiencia transitoria. Nunca es consistente.
Depende en gran parte de la imaginación humana creadora de mitos"
(Irulan)

#17Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#15)
Re: failures in t/031_recovery_conflict.pl on CI

Hi,

On 2022-05-03 01:16:46 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2022-05-02 23:44:32 -0400, Tom Lane wrote:

I can poke into that tomorrow, but are you sure that that isn't an
expectable result?

It's not expected. But I think I might see what the problem is:
We wait for the FETCH (and thus the buffer pin to be acquired). But that
doesn't guarantee that the lock has been acquired. We can't check that with
pump_until() afaics, because there'll not be any output. But a query_until()
checking pg_locks should do the trick?

Irritatingly, it doesn't reproduce (at least not easily) in a manual
build on the same box.

Odd, given how readily it seem to reproduce on the bf. I assume you built with

Uses -fsanitize=alignment -DWRITE_READ_PARSE_PLAN_TREES -DSTRESS_SORT_INT_MIN -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS

So it's almost surely a timing issue, and your theory here seems plausible.

Unfortunately I don't think my theory holds, because I actually had added a
defense against this into the test that I forgot about momentarily...

# just to make sure we're waiting for lock already
ok( $node_standby->poll_query_until(
'postgres', qq[
SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted;
], 'waiting'),
"$sect: lock acquisition is waiting");

and on longfin that step completes sucessfully.

I think what happens is that we get a buffer pin conflict, because these days
we can actually process buffer pin conflicts while waiting for a lock. The
easiest way to get around that is to increase the replay timeout for that
test, I think?

I think we need a restart, not a reload, because reloads aren't guaranteed to
be processed at any certain point in time :/.

Testing a fix in a variety of timing circumstances now...

Greetings,

Andres Freund

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#17)
Re: failures in t/031_recovery_conflict.pl on CI

Andres Freund <andres@anarazel.de> writes:

On 2022-05-03 01:16:46 -0400, Tom Lane wrote:

Irritatingly, it doesn't reproduce (at least not easily) in a manual
build on the same box.

Odd, given how readily it seem to reproduce on the bf. I assume you built with

Uses -fsanitize=alignment -DWRITE_READ_PARSE_PLAN_TREES -DSTRESS_SORT_INT_MIN -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS

Yeah, I copied all that stuff ...

So it's almost surely a timing issue, and your theory here seems plausible.

Unfortunately I don't think my theory holds, because I actually had added a
defense against this into the test that I forgot about momentarily...

Oh, hm. I can try harder to repro it.

regards, tom lane

#19Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#18)
Re: failures in t/031_recovery_conflict.pl on CI

Hi,

On 2022-05-03 14:23:23 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

So it's almost surely a timing issue, and your theory here seems plausible.

Unfortunately I don't think my theory holds, because I actually had added a
defense against this into the test that I forgot about momentarily...

Oh, hm. I can try harder to repro it.

I've now reproduced it a couple times here running under rr, so it's probably
not worth putting too much effort into that.

Attached is a fix for the test that I think should avoid the problem. Couldn't
repro it with it applied, under both rr and valgrind.

My current problem is that I'm running into some IPC::Run issues (bug?). I get
"ack Broken pipe:" iff I add "SELECT pg_sleep(1);" after
"-- wait for lock held by prepared transaction"

It doesn't happen without that debugging thing, but it makes me worried that
it's something that'll come up in random situations.

It looks to me like it's a bug in IPC::Run - with a few changes I get the
failure to happen inside pump_nb(), which seems like it shouldn't error out
just because the child process exited...

I *think* it might not happen without the sleep. But I'm not at all confident.

In general I'm kinda worried on how much effectively unmaintained perl stuff
we're depending :(

Greetings,

Andres Freund

Attachments:

0001-Fix-timing-issue-in-deadlock-recovery-conflict-test.patchtext/x-diff; charset=us-asciiDownload+11-1
#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#19)
Re: failures in t/031_recovery_conflict.pl on CI

Andres Freund <andres@anarazel.de> writes:

Attached is a fix for the test that I think should avoid the problem. Couldn't
repro it with it applied, under both rr and valgrind.

May I ask where we're at on this? Next week's back-branch release is
getting uncomfortably close, and I'm still seeing various buildfarm
animals erratically failing on 031_recovery_conflict.pl. Should we
just remove that test from the back branches for now?

Also, it appears that the back-patch of pump_until failed to remove
some pre-existing copies, eg check-world in v14 now reports

Subroutine pump_until redefined at t/013_crash_restart.pl line 248.
Subroutine pump_until redefined at t/022_crash_temp_files.pl line 272.

I didn't check whether these are exact duplicates.

regards, tom lane

#21Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#20)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#21)
#23Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#22)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#23)
#25Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#24)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#25)
#27Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#26)
#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#27)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#25)
#30Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#29)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#30)
#32Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#31)
#33Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#32)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#33)
#35Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#32)
#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#35)
#37Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#36)
#38Aleksei Fakeev
a.fakeev@postgrespro.ru
In reply to: Andres Freund (#19)