Snapshot related assert failure on skink

Started by Andres Freund10 months ago15 messages

andres@anarazel.de

10 months ago

Hi,

I just saw a BF failure on skink (valgrind) that asserts out.

Check the 002_compare_backups failure in:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-03-04%2017%3A35%3A01

TRAP: failed Assert("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)"), File: "../pgsql/src/backend/storage/ipc/procarray.c", Line: 2132, PID: 3115649
postgres: pitr2: bf postgres [local] SELECT(ExceptionalCondition+0x5f) [0x6e45e0]
postgres: pitr2: bf postgres [local] SELECT(+0x467c3d) [0x56fc3d]
postgres: pitr2: bf postgres [local] SELECT(GetSnapshotData+0x60) [0x570fbd]
postgres: pitr2: bf postgres [local] SELECT(GetNonHistoricCatalogSnapshot+0x53) [0x7285c5]
postgres: pitr2: bf postgres [local] SELECT(GetCatalogSnapshot+0x28) [0x72969b]
postgres: pitr2: bf postgres [local] SELECT(systable_beginscan+0xc9) [0x252602]
postgres: pitr2: bf postgres [local] SELECT(+0x5c1e56) [0x6c9e56]
postgres: pitr2: bf postgres [local] SELECT(+0x5c20e9) [0x6ca0e9]
postgres: pitr2: bf postgres [local] SELECT(SearchCatCache+0x18) [0x6ca71c]
postgres: pitr2: bf postgres [local] SELECT(SearchSysCache+0x21) [0x6deeae]
postgres: pitr2: bf postgres [local] SELECT(GetSysCacheOid+0x23) [0x6df219]
postgres: pitr2: bf postgres [local] SELECT(get_namespace_oid+0x2f) [0x2dbae0]
postgres: pitr2: bf postgres [local] SELECT(+0x1d3c13) [0x2dbc13]
postgres: pitr2: bf postgres [local] SELECT(+0x1d3e1b) [0x2dbe1b]
postgres: pitr2: bf postgres [local] SELECT(+0x1d3e72) [0x2dbe72]
postgres: pitr2: bf postgres [local] SELECT(FuncnameGetCandidates+0xaf) [0x2ddef1]
postgres: pitr2: bf postgres [local] SELECT(func_get_detail+0x96) [0x4bf521]
postgres: pitr2: bf postgres [local] SELECT(ParseFuncOrColumn+0x670) [0x4c13b9]
postgres: pitr2: bf postgres [local] SELECT(+0x3b1123) [0x4b9123]
postgres: pitr2: bf postgres [local] SELECT(+0x3b12ec) [0x4b92ec]
postgres: pitr2: bf postgres [local] SELECT(+0x3b2c60) [0x4bac60]
postgres: pitr2: bf postgres [local] SELECT(+0x3b12dc) [0x4b92dc]
postgres: pitr2: bf postgres [local] SELECT(transformExpr+0x20) [0x4b8faa]
postgres: pitr2: bf postgres [local] SELECT(transformTargetEntry+0x7c) [0x4cf4f9]
postgres: pitr2: bf postgres [local] SELECT(transformTargetList+0x96) [0x4cf5a6]
postgres: pitr2: bf postgres [local] SELECT(+0x39a5b9) [0x4a25b9]
postgres: pitr2: bf postgres [local] SELECT(transformStmt+0x141) [0x4a304b]
postgres: pitr2: bf postgres [local] SELECT(+0x39c390) [0x4a4390]
postgres: pitr2: bf postgres [local] SELECT(transformTopLevelStmt+0x19) [0x4a43cc]
postgres: pitr2: bf postgres [local] SELECT(parse_analyze_fixedparams+0x56) [0x4a4424]
postgres: pitr2: bf postgres [local] SELECT(pg_analyze_and_rewrite_fixedparams+0x40) [0x59b021]
postgres: pitr2: bf postgres [local] SELECT(+0x4937f6) [0x59b7f6]
postgres: pitr2: bf postgres [local] SELECT(PostgresMain+0x940) [0x59db54]
postgres: pitr2: bf postgres [local] SELECT(BackendMain+0x59) [0x596b38]
postgres: pitr2: bf postgres [local] SELECT(postmaster_child_launch+0xeb) [0x4ee282]
postgres: pitr2: bf postgres [local] SELECT(+0x3e9271) [0x4f1271]
postgres: pitr2: bf postgres [local] SELECT(+0x3eadb6) [0x4f2db6]
postgres: pitr2: bf postgres [local] SELECT(PostmasterMain+0x118b) [0x4f43b2]
postgres: pitr2: bf postgres [local] SELECT(main+0x1e2) [0x421da2]
/lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x5a27ca8]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x5a27d65]
postgres: pitr2: bf postgres [local] SELECT(_start+0x21) [0x1eb421]
2025-03-04 19:11:37.941 UTC [3104796][postmaster][:0] LOG: client backend (PID 3115649) was terminated by signal 6: Aborted
2025-03-04 19:11:37.941 UTC [3104796][postmaster][:0] DETAIL: Failed process was running: SELECT NOT pg_is_in_recovery();
2025-03-04 19:11:37.943 UTC [3104796][postmaster][:0] LOG: terminating any other active server processes

The code directly in question hasn't changed in a few years. I haven't seen
this crash before. The only recent-ish changes in the wider area are:

commit 952365cded6
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: 2024-12-23 12:42:39 +0200

Remove unnecessary GetTransactionSnapshot() calls

commit 578a7fe7b6f8484f6d7caa2fda288abb3fe87aa0
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: 2024-12-21 23:42:39 +0200

Update TransactionXmin when MyProc->xmin is updated

but neither is immediately yelling "here" to me.

I don't really see how pg_combinebackup could lead to this issue either.

One potentially interesting bit is that the statement is running *during* a
shutdown checkpoint:

2025-03-04 19:11:37.377 UTC [3106505][checkpointer][:0] LOG: checkpoint starting: end-of-recovery immediate wait
2025-03-04 19:11:37.454 UTC [3115649][client backend][2/2:0] LOG: statement: SELECT NOT pg_is_in_recovery();
TRAP: failed Assert("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)"), File: "../pgsql/src/backend/storage/ipc/procarray.c", Line: 2132, PID: 3115649
...
2025-03-04 19:11:37.585 UTC [3106505][checkpointer][:0] LOG: checkpoint complete: wrote 0 buffers (0.0%), wrote 3 SLRU buffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=0.039 s, sync=0.010 s, total=0.198 s; sync files=2, longest=0.006 s, average=0.004 s; distance=16384 kB, estimate=16384 kB; lsn=0/5000060, redo lsn=0/5000060

I wonder if the problem is related to a transaction first using a snapshot
started in recovery and then a "normal" snapshot?

Greetings,

Andres Freund

Tomas Vondra

tomas@vondra.me

10 months ago

In reply to: Andres Freund (#1)

Re: Snapshot related assert failure on skink

On 3/4/25 23:25, Andres Freund wrote:

Hi,

I just saw a BF failure on skink (valgrind) that asserts out.

Check the 002_compare_backups failure in:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-03-04%2017%3A35%3A01

TRAP: failed Assert("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)"), File: "../pgsql/src/backend/storage/ipc/procarray.c", Line: 2132, PID: 3115649
postgres: pitr2: bf postgres [local] SELECT(ExceptionalCondition+0x5f) [0x6e45e0]
postgres: pitr2: bf postgres [local] SELECT(+0x467c3d) [0x56fc3d]
postgres: pitr2: bf postgres [local] SELECT(GetSnapshotData+0x60) [0x570fbd]
postgres: pitr2: bf postgres [local] SELECT(GetNonHistoricCatalogSnapshot+0x53) [0x7285c5]
postgres: pitr2: bf postgres [local] SELECT(GetCatalogSnapshot+0x28) [0x72969b]
postgres: pitr2: bf postgres [local] SELECT(systable_beginscan+0xc9) [0x252602]
postgres: pitr2: bf postgres [local] SELECT(+0x5c1e56) [0x6c9e56]
postgres: pitr2: bf postgres [local] SELECT(+0x5c20e9) [0x6ca0e9]
postgres: pitr2: bf postgres [local] SELECT(SearchCatCache+0x18) [0x6ca71c]
postgres: pitr2: bf postgres [local] SELECT(SearchSysCache+0x21) [0x6deeae]
postgres: pitr2: bf postgres [local] SELECT(GetSysCacheOid+0x23) [0x6df219]
postgres: pitr2: bf postgres [local] SELECT(get_namespace_oid+0x2f) [0x2dbae0]
postgres: pitr2: bf postgres [local] SELECT(+0x1d3c13) [0x2dbc13]
postgres: pitr2: bf postgres [local] SELECT(+0x1d3e1b) [0x2dbe1b]
postgres: pitr2: bf postgres [local] SELECT(+0x1d3e72) [0x2dbe72]
postgres: pitr2: bf postgres [local] SELECT(FuncnameGetCandidates+0xaf) [0x2ddef1]
postgres: pitr2: bf postgres [local] SELECT(func_get_detail+0x96) [0x4bf521]
postgres: pitr2: bf postgres [local] SELECT(ParseFuncOrColumn+0x670) [0x4c13b9]
postgres: pitr2: bf postgres [local] SELECT(+0x3b1123) [0x4b9123]
postgres: pitr2: bf postgres [local] SELECT(+0x3b12ec) [0x4b92ec]
postgres: pitr2: bf postgres [local] SELECT(+0x3b2c60) [0x4bac60]
postgres: pitr2: bf postgres [local] SELECT(+0x3b12dc) [0x4b92dc]
postgres: pitr2: bf postgres [local] SELECT(transformExpr+0x20) [0x4b8faa]
postgres: pitr2: bf postgres [local] SELECT(transformTargetEntry+0x7c) [0x4cf4f9]
postgres: pitr2: bf postgres [local] SELECT(transformTargetList+0x96) [0x4cf5a6]
postgres: pitr2: bf postgres [local] SELECT(+0x39a5b9) [0x4a25b9]
postgres: pitr2: bf postgres [local] SELECT(transformStmt+0x141) [0x4a304b]
postgres: pitr2: bf postgres [local] SELECT(+0x39c390) [0x4a4390]
postgres: pitr2: bf postgres [local] SELECT(transformTopLevelStmt+0x19) [0x4a43cc]
postgres: pitr2: bf postgres [local] SELECT(parse_analyze_fixedparams+0x56) [0x4a4424]
postgres: pitr2: bf postgres [local] SELECT(pg_analyze_and_rewrite_fixedparams+0x40) [0x59b021]
postgres: pitr2: bf postgres [local] SELECT(+0x4937f6) [0x59b7f6]
postgres: pitr2: bf postgres [local] SELECT(PostgresMain+0x940) [0x59db54]
postgres: pitr2: bf postgres [local] SELECT(BackendMain+0x59) [0x596b38]
postgres: pitr2: bf postgres [local] SELECT(postmaster_child_launch+0xeb) [0x4ee282]
postgres: pitr2: bf postgres [local] SELECT(+0x3e9271) [0x4f1271]
postgres: pitr2: bf postgres [local] SELECT(+0x3eadb6) [0x4f2db6]
postgres: pitr2: bf postgres [local] SELECT(PostmasterMain+0x118b) [0x4f43b2]
postgres: pitr2: bf postgres [local] SELECT(main+0x1e2) [0x421da2]
/lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x5a27ca8]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x5a27d65]
postgres: pitr2: bf postgres [local] SELECT(_start+0x21) [0x1eb421]
2025-03-04 19:11:37.941 UTC [3104796][postmaster][:0] LOG: client backend (PID 3115649) was terminated by signal 6: Aborted
2025-03-04 19:11:37.941 UTC [3104796][postmaster][:0] DETAIL: Failed process was running: SELECT NOT pg_is_in_recovery();
2025-03-04 19:11:37.943 UTC [3104796][postmaster][:0] LOG: terminating any other active server processes

The code directly in question hasn't changed in a few years. I haven't seen
this crash before. The only recent-ish changes in the wider area are:

commit 952365cded6
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: 2024-12-23 12:42:39 +0200

Remove unnecessary GetTransactionSnapshot() calls

commit 578a7fe7b6f8484f6d7caa2fda288abb3fe87aa0
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: 2024-12-21 23:42:39 +0200

Update TransactionXmin when MyProc->xmin is updated

but neither is immediately yelling "here" to me.

I don't really see how pg_combinebackup could lead to this issue either.

One potentially interesting bit is that the statement is running *during* a
shutdown checkpoint:

2025-03-04 19:11:37.377 UTC [3106505][checkpointer][:0] LOG: checkpoint starting: end-of-recovery immediate wait
2025-03-04 19:11:37.454 UTC [3115649][client backend][2/2:0] LOG: statement: SELECT NOT pg_is_in_recovery();
TRAP: failed Assert("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)"), File: "../pgsql/src/backend/storage/ipc/procarray.c", Line: 2132, PID: 3115649
...
2025-03-04 19:11:37.585 UTC [3106505][checkpointer][:0] LOG: checkpoint complete: wrote 0 buffers (0.0%), wrote 3 SLRU buffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=0.039 s, sync=0.010 s, total=0.198 s; sync files=2, longest=0.006 s, average=0.004 s; distance=16384 kB, estimate=16384 kB; lsn=0/5000060, redo lsn=0/5000060

I wonder if the problem is related to a transaction first using a snapshot
started in recovery and then a "normal" snapshot?

Hi,

Alexander Lakhin pointed me to this thread after I posted about hitting
the same assert [1]/messages/by-id/3aee3535-184d-423e-a5eb-05161e8b5617@vondra.me.

[1]: /messages/by-id/3aee3535-184d-423e-a5eb-05161e8b5617@vondra.me
/messages/by-id/3aee3535-184d-423e-a5eb-05161e8b5617@vondra.me

It very much seems like the same issue - the backtrace seems to match
exactly, except for the query executed ofc. The exact mechanism may be a
big different - in my stress test (restarting the primary/standby in
different ways, while running pgbench on both instances, etc.) there's
no end-of-recovery checkpoint, for example.

In my tests, 100% of the failures hit the assert immediately after these
messages in the server log (all failures are on the standby)

LOG: database system is ready to accept read-only connections
LOG: started streaming WAL from primary at 2/86000000 on timeline 1

And by "immediately" I mean it's the very next thing in the log. It
seems something happens (to the RecentXmin?) when opening the standby
for read-only queries that breaks an existing snapshot ... I'm not sure
if the end-of-recovery checkpoint does something similar.

I'm still fiddling with the script, trying to increase the probability
of the (apparent) race condition. On one machine (old Xeon) I can hit it
very easily/reliably, while on a different machine (new Ryzen) it's very
rare. I don't know if that's due to difference in speed of the CPU, or
fewer cores, ... I guess it changes the timing just enough.

I've also tried running the stress test on PG17, and I'm yet to see a
single failure there. Not even on the xeon machine, that hits it
reliably on 18. So this seems to be a PG18-only issue.

If needed, I can try adding more logging, or test a patch.

regards

--
Tomas Vondra

Tomas Vondra

tomas@vondra.me

10 months ago

In reply to: Tomas Vondra (#2)

Re: Snapshot related assert failure on skink

On 3/17/25 12:36, Tomas Vondra wrote:

...

I'm still fiddling with the script, trying to increase the probability
of the (apparent) race condition. On one machine (old Xeon) I can hit it
very easily/reliably, while on a different machine (new Ryzen) it's very
rare. I don't know if that's due to difference in speed of the CPU, or
fewer cores, ... I guess it changes the timing just enough.

I've also tried running the stress test on PG17, and I'm yet to see a
single failure there. Not even on the xeon machine, that hits it
reliably on 18. So this seems to be a PG18-only issue.

And of course, the moment I sent this, I got a failure on 17 too. But
it's seems much harder to hit (compared to 18).

regards

--
Tomas Vondra

Thomas Munro

thomas.munro@gmail.com

10 months ago

In reply to: Tomas Vondra (#3)

Re: Snapshot related assert failure on skink

On Tue, Mar 18, 2025 at 12:59 AM Tomas Vondra <tomas@vondra.me> wrote:

On 3/17/25 12:36, Tomas Vondra wrote:

I'm still fiddling with the script, trying to increase the probability
of the (apparent) race condition. On one machine (old Xeon) I can hit it
very easily/reliably, while on a different machine (new Ryzen) it's very
rare. I don't know if that's due to difference in speed of the CPU, or
fewer cores, ... I guess it changes the timing just enough.

I've also tried running the stress test on PG17, and I'm yet to see a
single failure there. Not even on the xeon machine, that hits it
reliably on 18. So this seems to be a PG18-only issue.

And of course, the moment I sent this, I got a failure on 17 too. But
it's seems much harder to hit (compared to 18).

Could there be a connection to this commit?

commit 119c23eb9819213551cbe7e7665c8b493c59ceee
Author: Nathan Bossart <nathan@postgresql.org>
Date: Tue Sep 5 13:59:06 2023 -0700

Replace known_assigned_xids_lck with memory barriers.

Tomas Vondra

tomas@vondra.me

10 months ago

In reply to: Thomas Munro (#4)

Re: Snapshot related assert failure on skink

On 3/17/25 13:18, Thomas Munro wrote:

On Tue, Mar 18, 2025 at 12:59 AM Tomas Vondra <tomas@vondra.me> wrote:

On 3/17/25 12:36, Tomas Vondra wrote:

I'm still fiddling with the script, trying to increase the probability
of the (apparent) race condition. On one machine (old Xeon) I can hit it
very easily/reliably, while on a different machine (new Ryzen) it's very
rare. I don't know if that's due to difference in speed of the CPU, or
fewer cores, ... I guess it changes the timing just enough.

I've also tried running the stress test on PG17, and I'm yet to see a
single failure there. Not even on the xeon machine, that hits it
reliably on 18. So this seems to be a PG18-only issue.

And of course, the moment I sent this, I got a failure on 17 too. But
it's seems much harder to hit (compared to 18).

Could there be a connection to this commit?

commit 119c23eb9819213551cbe7e7665c8b493c59ceee
Author: Nathan Bossart <nathan@postgresql.org>
Date: Tue Sep 5 13:59:06 2023 -0700

Replace known_assigned_xids_lck with memory barriers.

Doesn't seem to be the case. I reverted this (on master), and I still
get the assert failures (roughly the same number / loop).

regards

--
Tomas Vondra

Tomas Vondra

tomas@vondra.me

10 months ago

In reply to: Tomas Vondra (#5)

Re: Snapshot related assert failure on skink

I kept stress-testing this, and while the frequency massively increased
on PG18, I managed to reproduce this all the way back to PG14. I see
~100x more corefiles on PG18.

That is not a proof the issue was introduced in PG14, maybe it's just
the assert that was added there or something. Or maybe there's another
bug in PG18, making the impact worse.

But I'd suspect this is a bug in

commit 623a9ba79bbdd11c5eccb30b8bd5c446130e521c
Author: Andres Freund <andres@anarazel.de>
Date: Mon Aug 17 21:07:10 2020 -0700

snapshot scalability: cache snapshots using a xact completion counter.

Previous commits made it faster/more scalable to compute snapshots.
But not
building a snapshot is still faster. Now that GetSnapshotData() does not
maintain RecentGlobal* anymore, that is actually not too hard:

...

which is the main snapshot-related change in PG14. I added some logging
to all places changing TransactionXmin and RecendXmin in procarray.c,
and all the failures look like this:

------------------------
1742348688.896 2025-03-19 02:44:48.896 CET [2001770] [client
backend:[unknown]] [67da2190.1e8b6a:1] [2025-03-19 02:44:48 CET]
[13/1/0] LOG: GetSnapshotData TransactionXmin = 209485

1742348688.896 2025-03-19 02:44:48.896 CET [2001770] [client
backend:[unknown]] [67da2190.1e8b6a:2] [2025-03-19 02:44:48 CET]
[13/1/0] LOG: GetSnapshotData RecentXmin = 209485

1742348688.902 2025-03-19 02:44:48.902 CET [2001770] [client
backend:[unknown]] [67da2190.1e8b6a:3] [2025-03-19 02:44:48 CET]
[13/1/0] LOG: GetSnapshotDataReuse TransactionXmin = 209485

1742348688.902 2025-03-19 02:44:48.902 CET [2001770] [client
backend:[unknown]] [67da2190.1e8b6a:4] [2025-03-19 02:44:48 CET]
[13/1/0] LOG: GetSnapshotDataReuse RecentXmin = 209485

1742348688.922 2025-03-19 02:44:48.922 CET [2001770] [client
backend:pgbench] [67da2190.1e8b6a:5] [2025-03-19 02:44:48 CET] [13/2/0]
LOG: GetSnapshotData TransactionXmin = 209508

1742348688.922 2025-03-19 02:44:48.922 CET [2001770] [client
backend:pgbench] [67da2190.1e8b6a:6] [2025-03-19 02:44:48 CET] [13/2/0]
STATEMENT: SELECT abalance FROM pgbench_accounts WHERE aid = 987279;

1742348688.922 2025-03-19 02:44:48.922 CET [2001770] [client
backend:pgbench] [67da2190.1e8b6a:7] [2025-03-19 02:44:48 CET] [13/2/0]
LOG: GetSnapshotData RecentXmin = 209508

1742348688.922 2025-03-19 02:44:48.922 CET [2001770] [client
backend:pgbench] [67da2190.1e8b6a:8] [2025-03-19 02:44:48 CET] [13/2/0]
STATEMENT: SELECT abalance FROM pgbench_accounts WHERE aid = 987279;

1742348688.922 2025-03-19 02:44:48.922 CET [2001770] [client
backend:pgbench] [67da2190.1e8b6a:9] [2025-03-19 02:44:48 CET] [13/2/0]
LOG: GetSnapshotDataReuse RecentXmin = 209485 at character 22

1742348688.922 2025-03-19 02:44:48.922 CET [2001770] [client
backend:pgbench] [67da2190.1e8b6a:10] [2025-03-19 02:44:48 CET] [13/2/0]
STATEMENT: SELECT abalance FROM pgbench_accounts WHERE aid = 987279;
TRAP: failed Assert("TransactionIdPrecedesOrEquals(TransactionXmin,
RecentXmin)"), File: "procarray.c", Line: 2136, PID: 2001770
------------------------

These are all logs for pgbench backend PID 2001770. First, there's a
call GetSnapshotData and then GetSnapshotDataReuse, which sets both
variables to 209485.

LOG: GetSnapshotData TransactionXmin = 209485
LOG: GetSnapshotData RecentXmin = 209485
LOG: GetSnapshotDataReuse TransactionXmin = 209485
LOG: GetSnapshotDataReuse RecentXmin = 209485

I believe this is some very early setup, because both log lines have

[%b:%a] = [client backend:[unknown]]

And then we get another pair of these calls, where GetSnapshotData
advances the XIDs to 209508. But the following GetSnapshotDataReuse call
updates only RecentXmin to 209485, and that is before TrasactionXmin.

Clearly, GetSnapshotDataReuse skips setting the TransactionXmin because
it still has valid XID in MyProc->xmin:

if (!TransactionIdIsValid(MyProc->xmin))
MyProc->xmin = TransactionXmin = snapshot->xmin;

I don't know if that's correct, and it seems a bit weird the RecentXmin
moves backwards.

But as I mentioned before, it's a bit weird this happens right after the
replica gets ready to handle read-only queries:

------------------------
1742348688.825 2025-03-19 02:44:48.825 CET [1981188] [startup:]
[67da2188.1e3b04:9] [2025-03-19 02:44:40 CET] [138/0/0] LOG: consistent
recovery state reached at 1/5FA884E8

1742348688.825 2025-03-19 02:44:48.825 CET [1981188] [startup:]
[67da2188.1e3b04:10] [2025-03-19 02:44:40 CET] [138/0/0] LOG: record
with incorrect prev-link 0/810 at 1/5FA884E8

1742348688.825 2025-03-19 02:44:48.825 CET [1981023] [postmaster:]
[67da2188.1e3a5f:5] [2025-03-19 02:44:40 CET] [/0] LOG: database system
is ready to accept read-only connections

1742348688.837 2025-03-19 02:44:48.837 CET [2001731] [walreceiver:]
[67da2190.1e8b43:1] [2025-03-19 02:44:48 CET] [/0] LOG: started
streaming WAL from primary at 1/5F000000 on timeline 1
------------------------

But there are multiple other concurrent pgbench backends, and those
don't have the same issue. None of them seems to advance the XID to 209508.

regards

--
Tomas Vondra

Heikki Linnakangas

hlinnaka@iki.fi

10 months ago

In reply to: Tomas Vondra (#6)

Re: Snapshot related assert failure on skink

On 19/03/2025 04:22, Tomas Vondra wrote:

I kept stress-testing this, and while the frequency massively increased
on PG18, I managed to reproduce this all the way back to PG14. I see
~100x more corefiles on PG18.

That is not a proof the issue was introduced in PG14, maybe it's just
the assert that was added there or something. Or maybe there's another
bug in PG18, making the impact worse.

But I'd suspect this is a bug in

commit 623a9ba79bbdd11c5eccb30b8bd5c446130e521c
Author: Andres Freund <andres@anarazel.de>
Date: Mon Aug 17 21:07:10 2020 -0700

snapshot scalability: cache snapshots using a xact completion counter.

Previous commits made it faster/more scalable to compute snapshots.
But not
building a snapshot is still faster. Now that GetSnapshotData() does not
maintain RecentGlobal* anymore, that is actually not too hard:

...

Looking at the code, shouldn't ExpireAllKnownAssignedTransactionIds()
and ExpireOldKnownAssignedTransactionIds() update xactCompletionCount?
This can happen during hot standby:

1. Backend acquires snapshot A with xmin 1000
2. Startup process calls ExpireOldKnownAssignedTransactionIds(),
3. Backend acquires snapshot B with xmin 1050
4. Backend releases snapshot A, updating TransactionXmin to 1050
5. Backend acquires new snapshot, calls GetSnapshotDataReuse(), reusing
snapshot A's data.

Because xactCompletionCount is not updated in step 2, the
GetSnapshotDataReuse() call will reuse the snapshot A. But snapshot A
has a lower xmin.

--
Heikki Linnakangas
Neon (https://neon.tech)

Tomas Vondra

tomas@vondra.me

10 months ago

In reply to: Heikki Linnakangas (#7)

Re: Snapshot related assert failure on skink

On 3/19/25 08:17, Heikki Linnakangas wrote:

On 19/03/2025 04:22, Tomas Vondra wrote:

I kept stress-testing this, and while the frequency massively increased
on PG18, I managed to reproduce this all the way back to PG14. I see
~100x more corefiles on PG18.

That is not a proof the issue was introduced in PG14, maybe it's just
the assert that was added there or something. Or maybe there's another
bug in PG18, making the impact worse.

But I'd suspect this is a bug in

commit 623a9ba79bbdd11c5eccb30b8bd5c446130e521c
Author: Andres Freund <andres@anarazel.de>
Date:   Mon Aug 17 21:07:10 2020 -0700

     snapshot scalability: cache snapshots using a xact completion
counter.

     Previous commits made it faster/more scalable to compute snapshots.
But not
     building a snapshot is still faster. Now that GetSnapshotData()
does not
     maintain RecentGlobal* anymore, that is actually not too hard:

     ...

Looking at the code, shouldn't ExpireAllKnownAssignedTransactionIds()
and ExpireOldKnownAssignedTransactionIds() update xactCompletionCount?
This can happen during hot standby:

1. Backend acquires snapshot A with xmin 1000
2. Startup process calls ExpireOldKnownAssignedTransactionIds(),
3. Backend acquires snapshot B with xmin 1050
4. Backend releases snapshot A, updating TransactionXmin to 1050
5. Backend acquires new snapshot, calls GetSnapshotDataReuse(), reusing
snapshot A's data.

Because xactCompletionCount is not updated in step 2, the
GetSnapshotDataReuse() call will reuse the snapshot A. But snapshot A
has a lower xmin.

Could be. As an experiment I added xactCompletionCount advance to the
two functions you mentioned, and I ran the stress test again. I haven't
seen any failures so far, after ~1000 runs. Without the patch this
produced ~200 failures/core files.

regards

--
Tomas Vondra

Tomas Vondra

tomas@vondra.me

10 months ago

In reply to: Tomas Vondra (#8)

Re: Snapshot related assert failure on skink

On 3/19/25 13:27, Tomas Vondra wrote:

On 3/19/25 08:17, Heikki Linnakangas wrote:

On 19/03/2025 04:22, Tomas Vondra wrote:

I kept stress-testing this, and while the frequency massively increased
on PG18, I managed to reproduce this all the way back to PG14. I see
~100x more corefiles on PG18.

That is not a proof the issue was introduced in PG14, maybe it's just
the assert that was added there or something. Or maybe there's another
bug in PG18, making the impact worse.

But I'd suspect this is a bug in

commit 623a9ba79bbdd11c5eccb30b8bd5c446130e521c
Author: Andres Freund <andres@anarazel.de>
Date:   Mon Aug 17 21:07:10 2020 -0700

     snapshot scalability: cache snapshots using a xact completion
counter.

     Previous commits made it faster/more scalable to compute snapshots.
But not
     building a snapshot is still faster. Now that GetSnapshotData()
does not
     maintain RecentGlobal* anymore, that is actually not too hard:

     ...

Looking at the code, shouldn't ExpireAllKnownAssignedTransactionIds()
and ExpireOldKnownAssignedTransactionIds() update xactCompletionCount?
This can happen during hot standby:

1. Backend acquires snapshot A with xmin 1000
2. Startup process calls ExpireOldKnownAssignedTransactionIds(),
3. Backend acquires snapshot B with xmin 1050
4. Backend releases snapshot A, updating TransactionXmin to 1050
5. Backend acquires new snapshot, calls GetSnapshotDataReuse(), reusing
snapshot A's data.

Because xactCompletionCount is not updated in step 2, the
GetSnapshotDataReuse() call will reuse the snapshot A. But snapshot A
has a lower xmin.

Could be. As an experiment I added xactCompletionCount advance to the
two functions you mentioned, and I ran the stress test again. I haven't
seen any failures so far, after ~1000 runs. Without the patch this
produced ~200 failures/core files.

I kept stress-testing this (without the fix), trying to figure out why
the frequency of failures got so much higher on PG18. While discussing
this off-list with Andres, he was wondering if maybe there's a second
independent bug on PG18, with the same symptoms.

It took quite a bit of time, but I managed to narrow this down to ~mid
December 2024, commit 952365cded6. The following table shows the number
of assert failures on a single run of the stress test, on 4 different
machines.

xeon ryzen rpi5-32 rpi5-64
-------------------------------------------------------
1585ff7387d 247 52 69 28
952365cded6 199 30 - -
7ec4b9ff80d 1 7 - -
7f97b4734f9 3 17 - -
578a7fe7b6f - - - -
db448ce5ad3 - - - -
1f81b48a9d5 - - - -
d5a7bd5670c 0 11 1 0

There's a fair amount of randomness - partially due to the bug being a
race condition, and thus time-sensitive. And then also the test suite is
randomized.

But I think it pretty clearly shows a clear change in these commits. The
rpi5 machines take much longer, so I only have the two results for now.

But it seems it changed in 952365cded6, which is:

commit 952365cded635e54c4177399c0280cb7a5e34c11
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon Dec 23 12:42:39 2024 +0200

Remove unnecessary GetTransactionSnapshot() calls

In get_database_list() and get_subscription_list(), the
GetTransactionSnapshot() call is not required because the catalog
table scans use the catalog snapshot, which is held until the end of
the scan. See table_beginscan_catalog(), which calls
RegisterSnapshot(GetCatalogSnapshot(relid)).

...

I'm not claiming the commit is buggy - it might be, but I think it's
more likely it just made the preexisting bug easier to hit. I base this
on the observation that incrementing the xactCompletionCount made the
assert failures go away.

regards

--
Tomas Vondra

#10

Heikki Linnakangas

hlinnaka@iki.fi

10 months ago

In reply to: Tomas Vondra (#9)

Re: Snapshot related assert failure on skink

On 21/03/2025 12:28, Tomas Vondra wrote:

But it seems it changed in 952365cded6, which is:

commit 952365cded635e54c4177399c0280cb7a5e34c11
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon Dec 23 12:42:39 2024 +0200

Remove unnecessary GetTransactionSnapshot() calls

In get_database_list() and get_subscription_list(), the
GetTransactionSnapshot() call is not required because the catalog
table scans use the catalog snapshot, which is held until the end of
the scan. See table_beginscan_catalog(), which calls
RegisterSnapshot(GetCatalogSnapshot(relid)).

...

I'm not claiming the commit is buggy - it might be, but I think it's
more likely it just made the preexisting bug easier to hit. I base this
on the observation that incrementing the xactCompletionCount made the
assert failures go away.

That commit removed three GetTransactionSnapshot() calls: one in
autovacuum, one in logical replication launcher, and one in
InitPostgres(). It must be the one in InitPostgres() that makes the
difference, because the other two are not executed in a standby.

I was able to reproduce this and to catch it in action with 'rr'. After
commit 952365cded6, the sequence of events looks like this:

1. At backend startup, GetCatalogSnapshot() sets
CatalogSnapshotData.xmin=1000, xactCompletionCount=10
2. On first query, GetTransactionSnapshot() sets
CurrentSnapshotData.xmin=1002, xactCompletionCount=10
3. GetCatalogSnapshot() is called again. It tries to reuse the snapshot
with xmin=1000 and hits the assertion.

Before commit 952365cded6, which removed the GetTransactionSnapshot()
call from InitPostgres, this usually happens instead:

0. At backend startup, GetTransactionSnapshot() sets
CurrentSnapshotData.xmin=1000, xactCompletionCount=10
1. At backend startup, GetCatalogSnapshot() sets
CatalogSnapshotData.xmin=1000, xactCompletionCount=10
2. On first query, GetTransactionSnapshot() reuses the snapshot with
CurrentSnapshotData.xmin=1002, xactCompletionCount=10
3. GetCatalogSnapshot() is called again. It successfully reuses the
snapshot with xmin=1000

In other words, the GetTransactionSnapshot() call initializes the
CurrentSnapshotData with the snapshot with an earlier xmin. The first
GetCatalogSnapshot() call happens almost immediately after that, so
they're very likely - but not guaranteed - to get the same snapshot.
When I removed the GetTransactionSnapshot() call, the gap between the
first GetTransactionSnapshot() and GetCatalogSnapshot() calls in the
backend became much longer, making this more likely to happen.

In any case, the missing "xactCompletionCount++" in
ExpireAllKnownAssignedTransactionIds() and
ExpireOldKnownAssignedTransactionIds() is a clear bug and explains this.
I will prepare and commit patches to fix that.

Thanks for the testing!

--
Heikki Linnakangas
Neon (https://neon.tech)

#11

Andres Freund

andres@anarazel.de

10 months ago

In reply to: Heikki Linnakangas (#7)

Re: Snapshot related assert failure on skink

Hi,

On 2025-03-19 09:17:23 +0200, Heikki Linnakangas wrote:

On 19/03/2025 04:22, Tomas Vondra wrote:

I kept stress-testing this, and while the frequency massively increased
on PG18, I managed to reproduce this all the way back to PG14. I see
~100x more corefiles on PG18.

That is not a proof the issue was introduced in PG14, maybe it's just
the assert that was added there or something. Or maybe there's another
bug in PG18, making the impact worse.

But I'd suspect this is a bug in

commit 623a9ba79bbdd11c5eccb30b8bd5c446130e521c
Author: Andres Freund <andres@anarazel.de>
Date: Mon Aug 17 21:07:10 2020 -0700

snapshot scalability: cache snapshots using a xact completion counter.

Previous commits made it faster/more scalable to compute snapshots.
But not
building a snapshot is still faster. Now that GetSnapshotData() does not
maintain RecentGlobal* anymore, that is actually not too hard:

...

Thanks for debugging and analyzing this!

Looking at the code, shouldn't ExpireAllKnownAssignedTransactionIds() and
ExpireOldKnownAssignedTransactionIds() update xactCompletionCount? This can
happen during hot standby:

1. Backend acquires snapshot A with xmin 1000
2. Startup process calls ExpireOldKnownAssignedTransactionIds(),
3. Backend acquires snapshot B with xmin 1050
4. Backend releases snapshot A, updating TransactionXmin to 1050
5. Backend acquires new snapshot, calls GetSnapshotDataReuse(), reusing
snapshot A's data.

Because xactCompletionCount is not updated in step 2, the
GetSnapshotDataReuse() call will reuse the snapshot A. But snapshot A has a
lower xmin.

I've swapped a lot of the KnownAssigned* code out:

Am I right in understanding that the only scenario (when in
STANDBY_SNAPSHOT_READY), where ExpireOldKnownAssignedTransactionIds() would
"legally" remove a transaction, rather than the commit / abort records doing
so, is if the primary crash-restarted while transactions were ongoing?

Those transactions won't have a commit/abort records and thus won't trigger
ExpireTreeKnownAssignedTransactionIds(), which otherwise would have updated
->xactCompletionCount?

When writing the snapshot caching patch, I tried to make sure that all the
places that maintain ->latestCompletedXid also update
->xactCompletionCount. Afaict that's still the case. Which, I think, means
that we're also missing calls to MaintainLatestCompletedXidRecovery()?

If latestCompletedXid is incorrect visibility determinations end up wrong...

Greetings,

Andres Freund

#12

Heikki Linnakangas

hlinnaka@iki.fi

10 months ago

In reply to: Andres Freund (#11)

Re: Snapshot related assert failure on skink

On 21/03/2025 17:16, Andres Freund wrote:

Am I right in understanding that the only scenario (when in
STANDBY_SNAPSHOT_READY), where ExpireOldKnownAssignedTransactionIds() would
"legally" remove a transaction, rather than the commit / abort records doing
so, is if the primary crash-restarted while transactions were ongoing?

Those transactions won't have a commit/abort records and thus won't trigger
ExpireTreeKnownAssignedTransactionIds(), which otherwise would have updated
->xactCompletionCount?

Correct.

When writing the snapshot caching patch, I tried to make sure that all the
places that maintain ->latestCompletedXid also update
->xactCompletionCount. Afaict that's still the case. Which, I think, means
that we're also missing calls to MaintainLatestCompletedXidRecovery()?

Yep, I was just looking at that too.

If latestCompletedXid is incorrect visibility determinations end up wrong...

I think it happens to work, because the transactions are effectively
aborted. latestCompletedXid is used to initialize xmax in
GetSnapshotData. If, for example, latestCompletedXid is incorrectly set
to 1000 even though XID 1001 already aborted, a snapshot with xmax=1000
still correctly considers XID 1001 as "not visible". As soon as a
transaction commits, latestCompletedXid is fixed.

AFAICS we could skip updating latestCompletedXid on aborts altogether
and rename it to latestCommittedXid. But it's hardly worth optimizing
for aborts.

For the same reason, I believe the assertion failure we're discussing
here is also harmless. Even though the reused snapshot has a smaller
xmin than expected, all those transactions aborted and are thus not
visible anyway.

In any case, let's be tidy and fix both latestCompletedXid and
xactCompletionCount.

--
Heikki Linnakangas
Neon (https://neon.tech)

#13

Tomas Vondra

tomas@vondra.me

10 months ago

In reply to: Heikki Linnakangas (#12)

Re: Snapshot related assert failure on skink

On 3/23/25 17:43, Heikki Linnakangas wrote:

On 21/03/2025 17:16, Andres Freund wrote:

Am I right in understanding that the only scenario (when in
STANDBY_SNAPSHOT_READY), where ExpireOldKnownAssignedTransactionIds()
would
"legally" remove a transaction, rather than the commit / abort records
doing
so, is if the primary crash-restarted while transactions were ongoing?

Those transactions won't have a commit/abort records and thus won't
trigger
ExpireTreeKnownAssignedTransactionIds(), which otherwise would have
updated
->xactCompletionCount?

Correct.

When writing the snapshot caching patch, I tried to make sure that all
the
places that maintain ->latestCompletedXid also update
->xactCompletionCount. Afaict that's still the case. Which, I think,
means
that we're also missing calls to MaintainLatestCompletedXidRecovery()?

Yep, I was just looking at that too.

If latestCompletedXid is incorrect visibility determinations end up
wrong...

I think it happens to work, because the transactions are effectively
aborted. latestCompletedXid is used to initialize xmax in
GetSnapshotData. If, for example, latestCompletedXid is incorrectly set
to 1000 even though XID 1001 already aborted, a snapshot with xmax=1000
still correctly considers XID 1001 as "not visible". As soon as a
transaction commits, latestCompletedXid is fixed.

AFAICS we could skip updating latestCompletedXid on aborts altogether
and rename it to latestCommittedXid. But it's hardly worth optimizing
for aborts.

For the same reason, I believe the assertion failure we're discussing
here is also harmless. Even though the reused snapshot has a smaller
xmin than expected, all those transactions aborted and are thus not
visible anyway.

In any case, let's be tidy and fix both latestCompletedXid and
xactCompletionCount.

Thanks for looking into this and pushing the fix.

Would it make sense to add a comment documenting this reasoning about
not handling aborts? Otherwise someone will get to rediscover this in
the future ...

regards

--
Tomas Vondra

#14

Heikki Linnakangas

hlinnaka@iki.fi

10 months ago

In reply to: Tomas Vondra (#13)

Re: Snapshot related assert failure on skink

On 24/03/2025 16:56, Tomas Vondra wrote:

On 3/23/25 17:43, Heikki Linnakangas wrote:

On 21/03/2025 17:16, Andres Freund wrote:

Am I right in understanding that the only scenario (when in
STANDBY_SNAPSHOT_READY), where ExpireOldKnownAssignedTransactionIds()
would
"legally" remove a transaction, rather than the commit / abort records
doing
so, is if the primary crash-restarted while transactions were ongoing?

Those transactions won't have a commit/abort records and thus won't
trigger
ExpireTreeKnownAssignedTransactionIds(), which otherwise would have
updated
->xactCompletionCount?

Correct.

When writing the snapshot caching patch, I tried to make sure that all
the
places that maintain ->latestCompletedXid also update
->xactCompletionCount. Afaict that's still the case. Which, I think,
means
that we're also missing calls to MaintainLatestCompletedXidRecovery()?

Yep, I was just looking at that too.

If latestCompletedXid is incorrect visibility determinations end up
wrong...

I think it happens to work, because the transactions are effectively
aborted. latestCompletedXid is used to initialize xmax in
GetSnapshotData. If, for example, latestCompletedXid is incorrectly set
to 1000 even though XID 1001 already aborted, a snapshot with xmax=1000
still correctly considers XID 1001 as "not visible". As soon as a
transaction commits, latestCompletedXid is fixed.

AFAICS we could skip updating latestCompletedXid on aborts altogether
and rename it to latestCommittedXid. But it's hardly worth optimizing
for aborts.

For the same reason, I believe the assertion failure we're discussing
here is also harmless. Even though the reused snapshot has a smaller
xmin than expected, all those transactions aborted and are thus not
visible anyway.

In any case, let's be tidy and fix both latestCompletedXid and
xactCompletionCount.

Thanks for looking into this and pushing the fix.

Would it make sense to add a comment documenting this reasoning about
not handling aborts? Otherwise someone will get to rediscover this in
the future ...

Well, we're currently _not_ relying on the reasoning about aborts not
changing visibility. So it's not like someone will forget the reasoning
and have a bug as a result. I guess someone could rediscover that
reasoning and think that they can skip advancing those counters on
aborts as an optimization, re-introducing the assertion failure. But I
believe that was not why we had this bug, it was just a simple omission.

So the comment would look like this:

"In principle, we would not need to advance xactCompletionCount and
latestCompletedXid because aborting transactions don't change
visibility. But that could make xmin within a transaction go backwards,
if a snapshot with an older xmin but same xactCompletionCount is reused,
triggering the assertion in GetSnapshotDataReuse()."

If we add that, I guess it should go into ProcArrayEndTransaction()
which currently just notes that it's used for commits and aborts
interchangeably. Do you think it's worth it? Want to propose a patch?

--
Heikki Linnakangas
Neon (https://neon.tech)

#15

Tomas Vondra

tomas@vondra.me

10 months ago

In reply to: Heikki Linnakangas (#14)

Re: Snapshot related assert failure on skink

On 3/24/25 16:25, Heikki Linnakangas wrote:

On 24/03/2025 16:56, Tomas Vondra wrote:

On 3/23/25 17:43, Heikki Linnakangas wrote:

On 21/03/2025 17:16, Andres Freund wrote:

Am I right in understanding that the only scenario (when in
STANDBY_SNAPSHOT_READY), where ExpireOldKnownAssignedTransactionIds()
would
"legally" remove a transaction, rather than the commit / abort records
doing
so, is if the primary crash-restarted while transactions were ongoing?

Those transactions won't have a commit/abort records and thus won't
trigger
ExpireTreeKnownAssignedTransactionIds(), which otherwise would have
updated
->xactCompletionCount?

Correct.

When writing the snapshot caching patch, I tried to make sure that all
the
places that maintain ->latestCompletedXid also update
->xactCompletionCount. Afaict that's still the case. Which, I think,
means
that we're also missing calls to MaintainLatestCompletedXidRecovery()?

Yep, I was just looking at that too.

If latestCompletedXid is incorrect visibility determinations end up
wrong...

I think it happens to work, because the transactions are effectively
aborted. latestCompletedXid is used to initialize xmax in
GetSnapshotData. If, for example, latestCompletedXid is incorrectly set
to 1000 even though XID 1001 already aborted, a snapshot with xmax=1000
still correctly considers XID 1001 as "not visible". As soon as a
transaction commits, latestCompletedXid is fixed.

AFAICS we could skip updating latestCompletedXid on aborts altogether
and rename it to latestCommittedXid. But it's hardly worth optimizing
for aborts.

For the same reason, I believe the assertion failure we're discussing
here is also harmless. Even though the reused snapshot has a smaller
xmin than expected, all those transactions aborted and are thus not
visible anyway.

In any case, let's be tidy and fix both latestCompletedXid and
xactCompletionCount.

Thanks for looking into this and pushing the fix.

Would it make sense to add a comment documenting this reasoning about
not handling aborts? Otherwise someone will get to rediscover this in
the future ...

Well, we're currently _not_ relying on the reasoning about aborts not
changing visibility. So it's not like someone will forget the reasoning
and have a bug as a result. I guess someone could rediscover that
reasoning and think that they can skip advancing those counters on
aborts as an optimization, re-introducing the assertion failure. But I
believe that was not why we had this bug, it was just a simple omission.

So the comment would look like this:

"In principle, we would not need to advance xactCompletionCount and
latestCompletedXid because aborting transactions don't change
visibility. But that could make xmin within a transaction go backwards,
if a snapshot with an older xmin but same xactCompletionCount is reused,
triggering the assertion in GetSnapshotDataReuse()."

If we add that, I guess it should go into ProcArrayEndTransaction()
which currently just notes that it's used for commits and aborts
interchangeably. Do you think it's worth it? Want to propose a patch?

I think the wording you suggested above is reasonable - I certainly
don't have a better one. I don't know what's the likelihood of someone
breaking this, but I think it makes sense to mention this because the
assert is non-trivial to hit.

regards

--
Tomas Vondra