17.8 standby crashes during WAL replay from 17.5 primary: "could not access status of transaction"
PostgreSQL version: 17.8 (standby), 17.5 (primary)
Primary: PostgreSQL 17.5 (Debian 17.5-1.pgdg130+1) on
aarch64-unknown-linux-gnu
Standby: PostgreSQL 17.8 (Debian 17.8-1.pgdg13+1) on
aarch64-unknown-linux-gnu
Platform: Docker containers on macOS (Apple Silicon / aarch64), Docker
Desktop
Description
-----------
A PostgreSQL 17.8 standby crashes during WAL replay when streaming
from a 17.5 primary. The crash occurs after replaying a
MultiXact/TRUNCATE_ID record followed by a MultiXact/CREATE_ID
record.
Steps to reproduce
------------------
1. Start a 17.5 primary configured for streaming replication
2. Seed a database with ~2GB of data (tables with foreign key
constraints)
3. Start a 17.5 standby via pg_basebackup, confirm streaming
replication
4. Generate ~500K MultiXact IDs using concurrent SELECT ... FOR SHARE
/ FOR KEY SHARE on the same rows
5. Run VACUUM on the multixact-heavy tables (generates TRUNCATE_ID
WAL records)
6. Stop the 17.5 standby
7. Continue generating ~2M additional MultiXact IDs on the primary
(builds WAL backlog)
8. Start a 17.8 standby on the same data volume -- it begins
replaying the WAL backlog
9. Standby crashes during replay
An automated reproducer (Go program + shell scripts) is available at:
https://gist.github.com/sebastianwebber/2cd25d298bfe85cabcd8d41f83591acb
It requires Go 1.22+ and Docker. Typical runtime is ~10 minutes.
go run main.go --cleanup
Actual output (standby log)
----------------------------
The standby successfully replays multiple SLRU page boundaries with
this pattern:
DEBUG: next offsets page is not initialized, initializing it now
CONTEXT: WAL redo at 3/28C148D8 for MultiXact/CREATE_ID: 856063 offset
6680130 nmembers 9: ...
DEBUG: skipping initialization of offsets page 418 because it was
already initialized on multixid creation
CONTEXT: WAL redo at 3/28C149B8 for MultiXact/ZERO_OFF_PAGE: 418
This repeats for pages 408 through 418. Then a truncation occurs:
DEBUG: replaying multixact truncation: offsets [1, 490986), offsets
segments [0, 7), members [1, 3864017), members segments [0, 49)
CONTEXT: WAL redo at 3/29D6D548 for MultiXact/TRUNCATE_ID: offsets [1,
490986), members [1, 3864017)
The very next CREATE_ID crashes:
FATAL: could not access status of transaction 858112
DETAIL: Could not read from file "pg_multixact/offsets/000D" at offset
24576: read too few bytes.
CONTEXT: WAL redo at 3/2A3AB408 for MultiXact/CREATE_ID: 858111 offset
6695072 nmembers 5: 1048228 (sh) 1048271 (keysh) 1048316 (sh) 1048344
(keysh) 1048370 (sh)
LOG: startup process (PID 29) exited with exit code 1
LOG: shutting down due to startup process failure
Expected output
---------------
The standby should successfully replay all WAL records and reach a
consistent streaming state.
Configuration (non-default on primary)
--------------------------------------
wal_level = replica
max_wal_senders = 10
max_connections = 1200
shared_buffers = 256MB
wal_keep_size = 16GB
autovacuum_multixact_freeze_max_age = 100000
vacuum_multixact_freeze_min_age = 1000
vacuum_multixact_freeze_table_age = 50000
Standby configured with log_min_messages = debug1.
--
Sebastian Webber
Hi,
On Fri, Feb 13, 2026 at 05:31:18PM -0300, Sebastian Webber wrote:
Description
-----------A PostgreSQL 17.8 standby crashes during WAL replay when streaming
from a 17.5 primary. The crash occurs after replaying a
MultiXact/TRUNCATE_ID record followed by a MultiXact/CREATE_ID
record.
Ouch.
For the record, somebody else (Jacob Bunk Nielsen) reported the same(?) issue
independently on Slack:
|We are running a PostgreSQL 17.7 primary with a number of physical replicas. We upgraded some of the replicas from 17.5 to 17.8 earlier today, and now 4 of them consistently die after a short time with errors like:
|
|Feb 13 13:36:01 srv2 postgres[13068]: [12-1] 2026-02-13 13:36:01 UTC::@:[13068]: FATAL: could not access status of transaction 155428864
|Feb 13 13:36:01 srv2 postgres[13068]: [12-2] 2026-02-13 13:36:01 UTC::@:[13068]: DETAIL: Could not read from file "pg_multixact/offsets/0943" at offset 172032: read too few bytes.
|Feb 13 13:36:01 srv2 postgres[13068]: [12-3] 2026-02-13 13:36:01 UTC::@:[13068]: CONTEXT: WAL redo at 9F79/27144740 for MultiXact/CREATE_ID: 155428863 offset 1359592572 nmembers 2: 448728610 (keysh) 448728612 (keysh)
|
|Did we hit a bug in the newly released 17.8?
Sebastian, as it looks like you are able to reproduce it - are you maybe in a
position to bisect from 17.5 to 17.8 to see which commit is responsible for this?
Michael
On 13/02/2026 22:31, Sebastian Webber wrote:
PostgreSQL version: 17.8 (standby), 17.5 (primary)
Primary: PostgreSQL 17.5 (Debian 17.5-1.pgdg130+1) on aarch64-unknown-
linux-gnu
Standby: PostgreSQL 17.8 (Debian 17.8-1.pgdg13+1) on aarch64-unknown-
linux-gnuPlatform: Docker containers on macOS (Apple Silicon / aarch64), Docker
DesktopDescription
-----------A PostgreSQL 17.8 standby crashes during WAL replay when streaming
from a 17.5 primary. The crash occurs after replaying a
MultiXact/TRUNCATE_ID record followed by a MultiXact/CREATE_ID
record.
Thanks for the report, I can repro it with your script. It is indeed a
regression introduced in the latest minor release, in the logic to
replay multixact WAL generated on older minor versions. (Commit
8ba61bc063). Adding the folks from the thread that led to that commit.
The commit added this in RecordNewMultiXact():
/*
* Older minor versions didn't set the next multixid's offset in this
* function, and therefore didn't initialize the next page until the next
* multixid was assigned. If we're replaying WAL that was generated by
* such a version, the next page might not be initialized yet. Initialize
* it now.
*/
if (InRecovery &&
next_pageno != pageno &&
pg_atomic_read_u64(&MultiXactOffsetCtl->shared->latest_page_number) == pageno)
{
elog(DEBUG1, "next offsets page is not initialized, initializing it now");
The idea is that if the next offset falls on a different page
(next_pageno != pageno), and we have not yet initialized the next page
(pg_atomic_read_u64(&MultiXactOffsetCtl->shared->latest_page_number) ==
pageno), we initialize it now. However, that last check goes wrong after
a truncation record is replayed. Replaying a truncation record does this:
/*
* During XLOG replay, latest_page_number isn't necessarily set up
* yet; insert a suitable value to bypass the sanity test in
* SimpleLruTruncate.
*/
pageno = MultiXactIdToOffsetPage(xlrec.endTruncOff);
pg_atomic_write_u64(&MultiXactOffsetCtl->shared->latest_page_number,
pageno);
Thanks to that, latest_page_number moves backwards to much older page
number. That breaks the "was the next offset page already initialized?"
test in RecordNewMultiXact().
I don't understand why that "bypass the sanity check" is needed. As far
as I can see, latest_page_number is tracked accurately during WAL
replay, and should already be set up. It's initialized in
StartupMultiXact(), and updated whenever the next page is initialized.
That was introduced a long time ago, in commit 4f627f8973, which in turn
was a backpatched and had deal with WAL that was generated before that
commit. I suspect it was necessary back then, for backwards
compatiblity, but isn't necessary any more. Hence, I propose to remove
that "bypass the sanity check" code (attached). Does anyone see a
scenario where latest_page_number might not be set correctly?
If we want to play it even more safe -- and I guess that's the right
thing to do for backpatching -- we could set latest_page_number
*temporarily* while we do the the truncation, and restore the old value
afterwards.
This fixes the bug. With this fix, you can replay WAL that's already
been generated.
- Heikki
Attachments:
0001-Don-t-reset-latest_page_number-when-replaying-multix.patchtext/x-patch; charset=UTF-8; name=0001-Don-t-reset-latest_page_number-when-replaying-multix.patchDownload+0-11
Ouch...
I remember this place. For some reason I thought endTruncOff is the end of offsets. That would make sense here... Now I see it's just a new oldest offset.
On 14 Feb 2026, at 16:42, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
If we want to play it even more safe -- and I guess that's the right thing to do for backpatching -- we could set latest_page_number *temporarily* while we do the the truncation, and restore the old value afterwards.
As far as I can see, the only relevant usage of last_page_number is:
/*
* While we are holding the lock, make an important safety check: the
* current endpoint page must not be eligible for removal.
*/
if (ctl->PagePrecedes(shared->latest_page_number, cutoffPage))
{
LWLockRelease(shared->ControlLock);
ereport(LOG,
(errmsg("could not truncate directory \"%s\": apparent wraparound",
ctl->Dir)));
return;
}
Perhaps, we also can bump latest_page_number forward?
Best regards, Andrey Borodin.
On 14 Feb 2026, at 21:18, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
Perhaps, we also can bump latest_page_number forward?
This is not a good idea, we don't want "most accurate latest_page_number", we need precise number at any point after StartupMultiXact().
Wiping write by XLOG_MULTIXACT_TRUNCATE_ID seems correct to me everywhere 14-18.
I'd also suggest updating comment:
* this is not critical data, since we use it only to avoid swapping out
* the latest page.
It's absolutely critical now.
Best regards, Andrey Borodin.
On 14 Feb 2026, at 22:41, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
Wiping write by XLOG_MULTIXACT_TRUNCATE_ID seems correct to me everywhere 14-18.
FWIW I've tried to create a TAP-reproducer, but it's tricky in controlled environment.
But I've created a TAP that triggers near-wraparound truncation:
2026-02-15 23:05:57.716 +05 [73950] DEBUG: replaying multixact truncation: offsets [1, 2147483648), offsets segments [0, 8000), members [1, 3), members segments [0, 0)
2026-02-15 23:05:57.716 +05 [73950] CONTEXT: WAL redo at 0/309CD70 for MultiXact/TRUNCATE_ID: offsets [1, 2147483648), members [1, 3)
2026-02-15 23:05:57.716 +05 [73950] DEBUG: MultiXactId wrap limit is 4294967295, limited by database with OID 1
2026-02-15 23:05:57.716 +05 [73950] CONTEXT: WAL redo at 0/309CD70 for MultiXact/TRUNCATE_ID: offsets [1, 2147483648), members [1, 3)
2026-02-15 23:05:57.716 +05 [73950] LOG: file "pg_multixact/offsets/8000" doesn't exist, reading as zeroes
And I observe no problems with applied "0001-Don-t-reset-latest_page_number-when-replaying-multix.patch"
Best regards, Andrey Borodin.
Attachments:
0001-Test-Multixact-truncation-near-araparound.patchapplication/octet-stream; name=0001-Test-Multixact-truncation-near-araparound.patch; x-unix-mode=0644Download+245-4
On Sat, 14 Feb 2026 at 16:42, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 13/02/2026 22:31, Sebastian Webber wrote:
PostgreSQL version: 17.8 (standby), 17.5 (primary)
Primary: PostgreSQL 17.5 (Debian 17.5-1.pgdg130+1) on aarch64-unknown-
linux-gnu
Standby: PostgreSQL 17.8 (Debian 17.8-1.pgdg13+1) on aarch64-unknown-
linux-gnuPlatform: Docker containers on macOS (Apple Silicon / aarch64), Docker
DesktopDescription
-----------A PostgreSQL 17.8 standby crashes during WAL replay when streaming
from a 17.5 primary. The crash occurs after replaying a
MultiXact/TRUNCATE_ID record followed by a MultiXact/CREATE_ID
record.Thanks for the report, I can repro it with your script. It is indeed a
regression introduced in the latest minor release, in the logic to
replay multixact WAL generated on older minor versions. (Commit
8ba61bc063). Adding the folks from the thread that led to that commit.The commit added this in RecordNewMultiXact():
/*
* Older minor versions didn't set the next multixid's offset in this
* function, and therefore didn't initialize the next page until the next
* multixid was assigned. If we're replaying WAL that was generated by
* such a version, the next page might not be initialized yet. Initialize
* it now.
*/
if (InRecovery &&
next_pageno != pageno &&
pg_atomic_read_u64(&MultiXactOffsetCtl->shared->latest_page_number) == pageno)
{
elog(DEBUG1, "next offsets page is not initialized, initializing it now");The idea is that if the next offset falls on a different page
(next_pageno != pageno), and we have not yet initialized the next page
(pg_atomic_read_u64(&MultiXactOffsetCtl->shared->latest_page_number) ==
pageno), we initialize it now. However, that last check goes wrong after
a truncation record is replayed. Replaying a truncation record does this:/*
* During XLOG replay, latest_page_number isn't necessarily set up
* yet; insert a suitable value to bypass the sanity test in
* SimpleLruTruncate.
*/
pageno = MultiXactIdToOffsetPage(xlrec.endTruncOff);
pg_atomic_write_u64(&MultiXactOffsetCtl->shared->latest_page_number,
pageno);Thanks to that, latest_page_number moves backwards to much older page
number. That breaks the "was the next offset page already initialized?"
test in RecordNewMultiXact().I don't understand why that "bypass the sanity check" is needed. As far
as I can see, latest_page_number is tracked accurately during WAL
replay, and should already be set up. It's initialized in
StartupMultiXact(), and updated whenever the next page is initialized.That was introduced a long time ago, in commit 4f627f8973, which in turn
was a backpatched and had deal with WAL that was generated before that
commit. I suspect it was necessary back then, for backwards
compatiblity, but isn't necessary any more. Hence, I propose to remove
that "bypass the sanity check" code (attached). Does anyone see a
scenario where latest_page_number might not be set correctly?If we want to play it even more safe -- and I guess that's the right
thing to do for backpatching -- we could set latest_page_number
*temporarily* while we do the the truncation, and restore the old value
afterwards.This fixes the bug. With this fix, you can replay WAL that's already
been generated.- Heikki
Hi!
Patch LGTM. Lets wrap new minors with IT?
--
Best regards,
Kirill Reshke
On 16/02/2026 08:45, Kirill Reshke wrote:
Patch LGTM.
Ok, thanks. I updated the comment to point out that 'latest_page_number'
is used for this backwards-compatibility hack, per Andrey's suggestion,
and committed.
I did some testing of this with the Sebastian's script, but Andrey if
you can verify with your TAP test, too, that'd be great.
Lets wrap new minors with IT?
Yep, that's now planned for next week:
/messages/by-id/177125656521.788.2734531836137629391@wrigleys.postgresql.org
- Heikki
On 16 Feb 2026, at 21:01, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Andrey if you can verify with your TAP test, too, that'd be great.
Here's a hand-wavy test on top of REL_17_STABLE. It modifies binaries to simulate old WAL write behavior.
I tried to hack it with -DDEMO_SIMULATE_OLD_MULTIXACT_BEHAVIOR, but gave up and just hardcoded.
We are not going to commit it, aren't we?
If we comment out this line (patch does it)
pg_atomic_write_u64(&MultiXactOffsetCtl->shared->latest_page_number,
pageno);
the test will pass.
Either way it will hang indefinitely because
2026-02-18 13:44:12.238 +05 [52360] LOG: started streaming WAL from primary at 0/3000000 on timeline 1
2026-02-18 13:44:12.250 +05 [52359] FATAL: could not access status of transaction 4096
2026-02-18 13:44:12.250 +05 [52359] DETAIL: Could not read from file "pg_multixact/offsets/0000" at offset 16384: read too few bytes.
2026-02-18 13:44:12.250 +05 [52359] CONTEXT: WAL redo at 0/30245E0 for MultiXact/CREATE_ID: 4095 offset 8189 nmembers 2: 4835 (sh) 4835 (upd)
Most hand-wavy part is test_multixact_write_truncate_wal(): truncation is synthetic.
FWIW, a lot of calculations and commenting done by LLM. Let me know if such a verbosity is not good for readability.
Best regards, Andrey Borodin.