Bug: PG 14 recovery failure

Started by Torsten Förtsch4 months ago3 messagesgeneral

tfoertsch123@gmail.com

4 months ago

Hi,

for many years I have been running a backup check system for my databases
that constantly
- upgrade to the latest available PG minor version (Debian PGDG)
- restore a DB from a basebackup on S3
- replay all available WAL
- perform a ton consistency checks
- repeat the same with the next DB
- when all DBs are done start from the beginning

All the DBs are PG14. After 14.21 was released last week I saw some of our
bigger DBs failing after replaying a few 1000 WAL files.

The error message reads like so:

2026-02-14 01:53:59.595 UTC [2441074]FATAL: could not access status of transaction 2030956544 2026-02-14 01:53:59.605 UTC [2441074] DETAIL: Could not read from file "pg_multixact/offsets/790D" at offset 245760: read too few bytes. 2026-02-14 01:53:59.605 UTC [2441074] CONTEXT: WAL redo at 17F8D/4E1E03E8 for MultiXact/CREATE_ID: 2030956543 offset 1335629905 nmembers 2: 691151655 (keysh) 691151658 (keysh) It does not happen every time. A freshly taken backup succeeded in restoring ~3000 WAL files. In the next round it failed at ~5000 WAL files. If it fails, it is reproducible. It will fail at the same multixact offset again. LOG: restored log file
"0000000500017F8D0000004E" from archive 2026-02-14 01:53:59.605 UTC
[2441074]: FATAL: could not access status of transaction 2030956544 2026-02-14 01:53:59.605 UTC [2441074] DETAIL: Could not read from file "pg_multixact/offsets/790D" at offset 245760: read too few bytes. 2026-02-14 01:53:59.605 UTC [2441074] CONTEXT: WAL redo at 17F8D/4E1E03E8 for MultiXact/CREATE_ID: 2030956543 offset 1335629905 nmembers 2: 691151655 (keysh) 691151658 (keysh) It does not happen every time. A freshly taken backup succeeded in restoring ~3000 WAL files. In the next round it failed at ~5000 WAL files. If it fails, it is reproducible. It will fail at the same multixact offset again.
2026-02-14 01:53:59.605 UTC [2441074]FATAL: could not access status of transaction 2030956544 2026-02-14 01:53:59.605 UTC [2441074] DETAIL: Could not read from file "pg_multixact/offsets/790D" at offset 245760: read too few bytes. 2026-02-14 01:53:59.605 UTC [2441074] CONTEXT: WAL redo at 17F8D/4E1E03E8 for MultiXact/CREATE_ID: 2030956543 offset 1335629905 nmembers 2: 691151655 (keysh) 691151658 (keysh) It does not happen every time. A freshly taken backup succeeded in restoring ~3000 WAL files. In the next round it failed at ~5000 WAL files. If it fails, it is reproducible. It will fail at the same multixact offset again. DETAIL: Could not read from file
"pg_multixact/offsets/790D" at offset 245760: read too few bytes.
2026-02-14 01:53:59.605 UTC [2441074]FATAL: could not access status of transaction 2030956544 2026-02-14 01:53:59.605 UTC [2441074] DETAIL: Could not read from file "pg_multixact/offsets/790D" at offset 245760: read too few bytes. 2026-02-14 01:53:59.605 UTC [2441074] CONTEXT: WAL redo at 17F8D/4E1E03E8 for MultiXact/CREATE_ID: 2030956543 offset 1335629905 nmembers 2: 691151655 (keysh) 691151658 (keysh) It does not happen every time. A freshly taken backup succeeded in restoring ~3000 WAL files. In the next round it failed at ~5000 WAL files. If it fails, it is reproducible. It will fail at the same multixact offset again. CONTEXT: WAL redo at 17F8D/4E1E03E8
for MultiXact/CREATE_ID: 2030956543 offset 1335629905 nmembers 2: 691151655
(keysh) 691151658 (keysh)
It does not happen every time. A freshly taken backup succeeded in
restoring ~3000 WAL files. In the next round it failed at ~5000 WAL files.
If it fails, it is reproducible. It will fail at the same multixact offset
again.

The multixact offset file where it fails does not exist in the base backup.
It is built during replay. In all cases I saw, the offset mentioned in the
error message is the length of the file. So, PG apparently wants to read
beyond the end of the file.

After rolling back to PG 14.20, everything started working again.

The release notes mention a few multixact changes from 14.20 to 14.21. I
can't claim to understand the change fully. But
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=81416e101
looks like the best culprit candidate to me.

All the best,
Torsten

David Rowley

dgrowleyml@gmail.com

4 months ago

In reply to: Torsten Förtsch (#1)

Re: Bug: PG 14 recovery failure

On Wed, 18 Feb 2026 at 21:40, Torsten Förtsch <tfoertsch123@gmail.com> wrote:

2026-02-14 01:53:59.595 UTC [2441074] LOG: restored log file "0000000500017F8D0000004E" from archive 2026-02-14 01:53:59.605 UTC [2441074] FATAL: could not access status of transaction 2030956544

There's a planned out-of-cycle release to fix this. See [1]https://www.postgresql.org/about/news/out-of-cycle-release-scheduled-for-february-26-2026-3241/.

David

[1]: https://www.postgresql.org/about/news/out-of-cycle-release-scheduled-for-february-26-2026-3241/

Michael Paquier

michael@paquier.xyz

4 months ago

In reply to: David Rowley (#2)

Re: Bug: PG 14 recovery failure

On Wed, Feb 18, 2026 at 11:27:37PM +1300, David Rowley wrote:

On Wed, 18 Feb 2026 at 21:40, Torsten Förtsch <tfoertsch123@gmail.com> wrote:

2026-02-14 01:53:59.595 UTC [2441074] LOG: restored log file "0000000500017F8D0000004E" from archive 2026-02-14 01:53:59.605 UTC [2441074] FATAL: could not access status of transaction 2030956544

There's a planned out-of-cycle release to fix this. See [1].

In more details, here is the specific fix in the v14~18 range:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=817f74600d0d

And its related thread:
/messages/by-id/20260214090150.GC2297@p46.dedyn.io;lightning.p46.dedyn.io
--
Michael