Failure in test_slru for host gokiburi (REL_16_STABLE only)

Started by Michael Paquier2 months ago3 messageshackers

michael@paquier.xyz

2 months ago

Hi all,

gokiburi has been failing on only REL_16_STABLE for the last few days,
for the tests of module test_slru. First failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gokiburi&dt=2026-05-13%2012%3A20%3A45

Set of changes associated with the first failure, which seem
completely innocent to me:
5f12d86dd76 Wed May 13 05:43:49 2026 UTC Add more tests for
corrupted data with pglz_decompress()
d140237dab8 Wed May 13 02:46:17 2026 UTC Fix stale COPY progress
during logical replication table sync

While the buildfarm runs don't show much, I have been able to
reproduce the failure on the buildfarm host, after using
-DEXEC_BACKEND. Here is a backtrace, pointing out that something is
broken with LWLock initialization:
2026-05-18 05:20:50.186 UTC client backend[870830]
pg_regress/test_slru STATEMENT: SELECT
test_slru_page_readonly(12377); TRAP: failed
Assert("LWLockHeldByMe(TestSLRULock)"), File: "test_slru.c", Line:
124, PID: 870830
postgres: popo contrib_regression [local]
SELECT(ExceptionalCondition+0x16c) [0xaaaaabcf4d88]
/home/popo/lib/test_slru.so(test_slru_page_readonly+0xe4)
[0xffffedf83060]
postgres: popo contrib_regression [local] SELECT(+0x885c40) [0xaaaaab325c40]
postgres: popo contrib_regression [local] SELECT(ExecInterpExprStillValid+0x84) [0xaaaaab329a4c]
postgres: popo contrib_regression [local] SELECT(+0x9405fc) [0xaaaaab3e05fc]
postgres: popo contrib_regression [local] SELECT(+0x9406d4) [0xaaaaab3e06d4]
postgres: popo contrib_regression [local] SELECT(+0x940b34) [0xaaaaab3e0b34]
postgres: popo contrib_regression [local] SELECT(+0x8b7ac0) [0xaaaaab357ac0]
postgres: popo contrib_regression [local] SELECT(+0x89de14) [0xaaaaab33de14]
postgres: popo contrib_regression [local] SELECT(+0x8a46c0) [0xaaaaab3446c0]
postgres: popo contrib_regression [local] SELECT(standard_ExecutorRun+0x2d0) [0xaaaaab33ec68]
postgres: popo contrib_regression [local] SELECT(ExecutorRun+0xb8) [0xaaaaab33e970]
postgres: popo contrib_regression [local] SELECT(+0xe550dc) [0xaaaaab8f50dc]
postgres: popo contrib_regression [local] SELECT(PortalRun+0x460) [0xaaaaab8f4958]
postgres: popo contrib_regression [local] SELECT(+0xe43150) [0xaaaaab8e3150]
postgres: popo contrib_regression [local] SELECT(PostgresMain+0x15e8) [0xaaaaab8f0560]
postgres: popo contrib_regression [local] SELECT(postmaster_forkexec+0x0) [0xaaaaab70f644]
postgres: popo contrib_regression [local] SELECT(SubPostmasterMain+0x6fc) [0xaaaaab7106d8]
postgres: popo contrib_regression [local] SELECT(main+0x6d0)
[0xaaaaab463f6c] /lib/aarch64-linux-gnu/libc.so.6(+0x2225c)
[0xfffff725225c]
/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x9c)
[0xfffff725233c]
postgres: popo contrib_regression [local] SELECT(_start+0x30) [0xaaaaaad3d4b0]

The server logs include the following, pointing to a broken state
(these two should not fail):
2026-05-18 05:20:50.184 UTC client backend[870830] pg_regress/test_slru
ERROR: lock <unassigned:0> is not held
2026-05-18 05:20:50.184 UTC client backend[870830] pg_regress/test_slru
STATEMENT: SELECT test_slru_page_write(12345, 'Test SLRU');

Note that the tests pass without -DEXEC_BACKEND.

While reading through the module, I think that the LWLock
initialization logic is borked, where we decide to do a
LWLockInitialize() more times than necessary, confusing the internal
states. Honestly, I have no clue why the test has suddenly been
failing, and why other buildfarm members don't complain. The host has
been upgraded a couple of days ago to the latest Debian, but I also
had a few clean runs in the buildfarm before this began showing up.
What I do know is that the patch attached is able to make the tests of
the module pass for v16 on the problematic host with -DEXEC_BACKEND.

Comments or opinions?
--
Michael

Michael Paquier

michael@paquier.xyz

2 months ago

In reply to: Michael Paquier (#1)

Re: Failure in test_slru for host gokiburi (REL_16_STABLE only)

On Mon, May 18, 2026 at 08:41:45PM +0900, Michael Paquier wrote:

While reading through the module, I think that the LWLock
initialization logic is borked, where we decide to do a
LWLockInitialize() more times than necessary, confusing the internal
states. Honestly, I have no clue why the test has suddenly been
failing, and why other buildfarm members don't complain. The host has
been upgraded a couple of days ago to the latest Debian, but I also
had a few clean runs in the buildfarm before this began showing up.
What I do know is that the patch attached is able to make the tests of
the module pass for v16 on the problematic host with -DEXEC_BACKEND.

Hearing nothing, I have fixed that on REL_16_STABLE to bring back my
animal to green. The tranche ID allocation calls are wasteful in the
module for v17 and v18, but it does not really matter much for the
sake of the tests. HEAD is in a much better state by saving the same
names in shmem (pronounced as in "shmemme").
--
Michael

Michael Paquier

michael@paquier.xyz

2 months ago

In reply to: Michael Paquier (#2)

Re: Failure in test_slru for host gokiburi (REL_16_STABLE only)

On Tue, May 26, 2026 at 05:09:31PM +0900, Michael Paquier wrote:

Hearing nothing, I have fixed that on REL_16_STABLE to bring back my
animal to green. The tranche ID allocation calls are wasteful in the
module for v17 and v18, but it does not really matter much for the
sake of the tests. HEAD is in a much better state by saving the same
names in shmem (pronounced as in "shmemme").

A short update here. The buildfarm client was still stuck and unable
to report on this branch, so I have done some manual cleanup and it
has finally been able to report back to green:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gokiburi&dt=2026-05-27%2000%3A05%3A03

It seemed like some on-disk state was messing up with the follow-up
updates, but well.
--
Michael

Failure in test_slru for host gokiburi (REL_16_STABLE only)

Attachments: