IPC/MultixactCreation on the Standby server

Started by Dmitry6 months ago61 messages
#1Dmitry
Dmitry
dsy.075@yandex.ru

Hi, hackers

The problem is as follows.
A replication cluster includes a primary server and one hot-standby replica.
The workload on the primary server is represented by multiple requests
generating multixact IDs, while the hot-standby replica performs reading
requests.

After some time, all requests on the hot-standby are stuck and never get
finished.

The `pg_stat_activity` view on the replica reports that processes are
stuck waiting for IPC/MultixactCreation,
pg_cancel_backend and pg_terminate_backend cannot cancel the request,
SIGQUIT is the only way to stop it.

We tried:
- changing the `autovacuum_multixact_freeze_max_age` parameters,
- increasing `multixact_member_buffers` and `multixact_offset_buffers`,
- disabling `hot_standby_feedback`,
- switching the replica to synchronous and asynchronous mode,
- and much more.
But nothing helped.

We ran the replica in recovery mode from WAL archive, i.e. as
warm-standby, the result is the same.

We tried to build from the sources based on REL_17_5 branch with the
default configure settings
    ./configure
    make
    make install
But got no luck.

Here is an example with a synthetic workload reproducing the problem.

Test system
===========

- Architecture: x86_64
- OS: Ubuntu 24.04.2 LTS (Noble Numbat)
- Tested postgres version(s):
    - latest 17 (17.5)
    - latest 18 (18-beta1)

The problem is not reproducible on PostgreSQL 16.9

Steps to reproduce
==================

    postgres=# create table tbl (
        id int primary key,
        val int
    );
    postgres=# insert into tbl select i, 0 from generate_series(1,5) i;

The first and second scripts execute queries on the master server
-----------------------------------------------------------------

    pgbench --no-vacuum --report-per-command -M prepared -c 200 -j 200
-T 300 -P 1 --file=/dev/stdin <<'EOF'
    \set id random(1, 5)
    begin;
    select * from tbl where id = :id for key share;
    commit;
    EOF

    pgbench --no-vacuum --report-per-command -M prepared -c 100 -j 100
-T 300 -P 1 --file=/dev/stdin <<'EOF'
    \set id random(1, 5)
    begin;
    update tbl set val = val+1 where id = :id;
    \sleep 10 ms
    commit;
    EOF

The following script is executed on the replica
-----------------------------------------------

    pgbench --no-vacuum --report-per-command -M prepared -c 100 -j 100
-T 300 -P 1 --file=/dev/stdin <<'EOF'
    begin;
    select sum(val) from tbl;
    \sleep 10 ms
    select sum(val) from tbl;
    \sleep 10 ms
    commit;
    EOF

    pgbench (17.5 (Ubuntu 17.5-1.pgdg24.04+1))
    progress: 1.0 s, 2606.8 tps, lat 33.588 ms stddev 13.316, 0 failed
    progress: 2.0 s, 3315.0 tps, lat 30.174 ms stddev 5.933, 0 failed
    progress: 3.0 s, 3357.0 tps, lat 29.699 ms stddev 5.541, 0 failed
    progress: 4.0 s, 3350.0 tps, lat 29.911 ms stddev 5.311, 0 failed
    progress: 5.0 s, 3206.0 tps, lat 30.999 ms stddev 6.343, 0 failed
    progress: 6.0 s, 3264.0 tps, lat 30.828 ms stddev 6.389, 0 failed
    progress: 7.0 s, 3224.0 tps, lat 31.099 ms stddev 6.197, 0 failed
    progress: 8.0 s, 3168.0 tps, lat 31.486 ms stddev 6.940, 0 failed
    progress: 9.0 s, 3118.0 tps, lat 32.004 ms stddev 6.546, 0 failed
    progress: 10.0 s, 3017.0 tps, lat 33.183 ms stddev 7.971, 0 failed
    progress: 11.0 s, 3157.0 tps, lat 31.697 ms stddev 6.624, 0 failed
    progress: 12.0 s, 3180.0 tps, lat 31.415 ms stddev 6.310, 0 failed
    progress: 13.0 s, 3150.9 tps, lat 31.591 ms stddev 6.280, 0 failed
    progress: 14.0 s, 3329.0 tps, lat 30.189 ms stddev 5.792, 0 failed
    progress: 15.0 s, 3233.6 tps, lat 30.852 ms stddev 5.723, 0 failed
    progress: 16.0 s, 3185.4 tps, lat 31.378 ms stddev 6.383, 0 failed
    progress: 17.0 s, 3035.0 tps, lat 32.920 ms stddev 7.390, 0 failed
    progress: 18.0 s, 3173.0 tps, lat 31.547 ms stddev 6.390, 0 failed
    progress: 19.0 s, 3077.0 tps, lat 32.427 ms stddev 6.634, 0 failed
    progress: 20.0 s, 3266.1 tps, lat 30.740 ms stddev 5.842, 0 failed
    progress: 21.0 s, 2990.9 tps, lat 33.353 ms stddev 7.019, 0 failed
    progress: 22.0 s, 3048.1 tps, lat 32.933 ms stddev 6.951, 0 failed
    progress: 23.0 s, 3148.0 tps, lat 31.769 ms stddev 6.077, 0 failed
    progress: 24.0 s, 1523.2 tps, lat 30.029 ms stddev 5.093, 0 failed
    progress: 25.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 26.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 27.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 28.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 29.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 30.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 31.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 32.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 33.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 34.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 35.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed

After some time, all requests on the replica hang waiting for
IPC/MultixactCreation.

Output from `pg_stat_activity`
------------------------------

            backend_type        | state  | wait_event_type |
wait_event     |                  query
----------------------------+--------+-----------------+-------------------+------------------------------------------
     client backend             | active | IPC             |
MultixactCreation | select sum(val) from tbl;
     client backend             | active | IPC             |
MultixactCreation | select sum(val) from tbl;
     client backend             | active | IPC             |
MultixactCreation | select sum(val) from tbl;
     client backend             | active | IPC             |
MultixactCreation | select sum(val) from tbl;
    ...
     client backend             | active | IPC             |
MultixactCreation | select sum(val) from tbl;
     client backend             | active | IPC             |
MultixactCreation | select sum(val) from tbl;
     client backend             | active | IPC             |
MultixactCreation | select sum(val) from tbl;
     startup                    |        | LWLock          |
BufferContent     |
     checkpointer               |        | Activity        |
CheckpointerMain  |
     background writer          |        | Activity        |
BgwriterHibernate |
     walreceiver                |        | Activity        |
WalReceiverMain   |

gdb session for `client backend` process
----------------------------------------

    (gdb) bt
    #0  0x00007f0e9872a007 in epoll_wait (epfd=5,
events=0x57c4747fc458, maxevents=1, timeout=-1) at
../sysdeps/unix/sysv/linux/epoll_wait.c:30
    #1  0x000057c440685033 in WaitEventSetWaitBlock (nevents=<optimized
out>, occurred_events=0x7ffdaedc8360, cur_timeout=-1, set=0x57c4747fc3f0)
        at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/ipc/latch.c:1577
    #2  WaitEventSetWait (set=0x57c4747fc3f0, timeout=timeout@entry=-1,
occurred_events=occurred_events@entry=0x7ffdaedc8360,
nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=134217765)
        at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/ipc/latch.c:1525
    #3  0x000057c44068541c in WaitLatch (latch=<optimized out>,
wakeEvents=<optimized out>, timeout=<optimized out>,
wait_event_info=134217765)
        at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/ipc/latch.c:538
    #4  0x000057c44068d8c0 in ConditionVariableTimedSleep
(cv=0x7f0cefc50ab0, timeout=-1, wait_event_info=134217765)
        at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/lmgr/condition_variable.c:163
    #5  0x000057c440365a0c in ConditionVariableSleep
(wait_event_info=134217765, cv=<optimized out>) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/lmgr/condition_variable.c:98
    #6  GetMultiXactIdMembers (multi=45559845, members=0x7ffdaedc84b0,
from_pgupgrade=<optimized out>, isLockOnly=<optimized out>)
        at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/multixact.c:1483
    #7  0x000057c4408adc6b in MultiXactIdGetUpdateXid.isra.0
(xmax=xmax@entry=45559845, t_infomask=<optimized out>) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:7478
    #8  0x000057c44031ecfa in HeapTupleGetUpdateXid (tuple=<error
reading variable: Cannot access memory at address 0x0>) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:7519
    #9  HeapTupleSatisfiesMVCC (htup=<optimized out>, buffer=404,
snapshot=0x57c474892ff0) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam_visibility.c:1090
    #10 HeapTupleSatisfiesVisibility (htup=<optimized out>,
snapshot=0x57c474892ff0, buffer=404) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam_visibility.c:1772
    #11 0x000057c44030c1cb in page_collect_tuples
(check_serializable=<optimized out>, all_visible=<optimized out>,
lines=<optimized out>, block=<optimized out>, buffer=<optimized out>,
page=<optimized out>,
        snapshot=<optimized out>, scan=<optimized out>) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:480
    #12 heap_prepare_pagescan (sscan=0x57c47495b970) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:579
    #13 0x000057c44030cb59 in heapgettup_pagemode
(scan=scan@entry=0x57c47495b970, dir=<optimized out>, nkeys=<optimized
out>, key=<optimized out>)
        at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:999
    #14 0x000057c44030d1bd in heap_getnextslot (sscan=0x57c47495b970,
direction=<optimized out>, slot=0x57c47494b278) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:1319
    #15 0x000057c4404f090a in table_scan_getnextslot
(slot=0x57c47494b278, direction=ForwardScanDirection, sscan=<optimized out>)
        at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/include/access/tableam.h:1072
    #16 SeqNext (node=0x57c47494b0e8) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/nodeSeqscan.c:80
    #17 0x000057c4404d5cfc in ExecProcNode (node=0x57c47494b0e8) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/include/executor/executor.h:274
    #18 fetch_input_tuple (aggstate=aggstate@entry=0x57c47494aaf0) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/nodeAgg.c:561
    #19 0x000057c4404d848a in agg_retrieve_direct
(aggstate=0x57c47494aaf0) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/nodeAgg.c:2459
    #20 ExecAgg (pstate=0x57c47494aaf0) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/nodeAgg.c:2179
    #21 0x000057c4404c2003 in ExecProcNode (node=0x57c47494aaf0) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/include/executor/executor.h:274
    #22 ExecutePlan (dest=0x57c47483d548, direction=<optimized out>,
numberTuples=0, sendTuples=true, operation=CMD_SELECT,
queryDesc=0x57c474895010)
        at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/execMain.c:1649
    #23 standard_ExecutorRun (queryDesc=0x57c474895010,
direction=<optimized out>, count=0, execute_once=<optimized out>) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/execMain.c:361
    ...

gdb session for `startup` process
---------------------------------

    (gdb) bt
    #0  0x00007f0e98698ce3 in __futex_abstimed_wait_common64
(private=<optimized out>, cancel=true, abstime=0x0, op=265, expected=0,
futex_word=0x7f0ceb34e6b8) at ./nptl/futex-internal.c:57
    #1  __futex_abstimed_wait_common (cancel=true, private=<optimized
out>, abstime=0x0, clockid=0, expected=0, futex_word=0x7f0ceb34e6b8) at
./nptl/futex-internal.c:87
    #2  __GI___futex_abstimed_wait_cancelable64
(futex_word=futex_word@entry=0x7f0ceb34e6b8, expected=expected@entry=0,
clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>)
        at ./nptl/futex-internal.c:139
    #3  0x00007f0e986a4f1f in do_futex_wait
(sem=sem@entry=0x7f0ceb34e6b8, abstime=0x0, clockid=0) at
./nptl/sem_waitcommon.c:111
    #4  0x00007f0e986a4fb8 in __new_sem_wait_slow64
(sem=sem@entry=0x7f0ceb34e6b8, abstime=0x0, clockid=0) at
./nptl/sem_waitcommon.c:183
    #5  0x00007f0e986a503d in __new_sem_wait
(sem=sem@entry=0x7f0ceb34e6b8) at ./nptl/sem_wait.c:42
    #6  0x000057c440696166 in PGSemaphoreLock (sema=0x7f0ceb34e6b8) at
port/pg_sema.c:327
    #7  LWLockAcquire (lock=0x7f0cefc58064, mode=LW_EXCLUSIVE) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/lmgr/lwlock.c:1289
    #8  0x000057c44038f96a in LockBuffer (mode=2, buffer=<optimized
out>) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/buffer/bufmgr.c:5147
    #9  XLogReadBufferForRedoExtended (record=<optimized out>,
block_id=<optimized out>, mode=RBM_NORMAL, get_cleanup_lock=false,
buf=0x7ffdaedc8b4c)
        at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/xlogutils.c:429
    #10 0x000057c440319969 in XLogReadBufferForRedo
(buf=0x7ffdaedc8b4c, block_id=0 '\000', record=0x57c4748994d8) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/xlogutils.c:317
    #11 heap_xlog_lock_updated (record=0x57c4748994d8) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:10230
    #12 heap2_redo (record=0x57c4748994d8) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:10362
    #13 0x000057c44038e1d2 in ApplyWalRecord (replayTLI=<synthetic
pointer>, record=0x7f0e983908e0, xlogreader=<optimized out>)
        at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/include/access/xlog_internal.h:380
    #14 PerformWalRecovery () at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/xlogrecovery.c:1822
    #15 0x000057c44037bbf6 in StartupXLOG () at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/xlog.c:5821
    #16 0x000057c4406155ed in StartupProcessMain
(startup_data=<optimized out>, startup_data_len=<optimized out>) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/startup.c:258
    #17 0x000057c44060b376 in postmaster_child_launch
(child_type=B_STARTUP, startup_data=0x0, startup_data_len=0,
client_sock=0x0)
        at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/launch_backend.c:277
    #18 0x000057c440614509 in postmaster_child_launch (client_sock=0x0,
startup_data_len=0, startup_data=0x0, child_type=B_STARTUP)
        at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:3934
    #19 StartChildProcess (type=type@entry=B_STARTUP) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:3930
    #20 0x000057c44061480d in PostmasterStateMachine () at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:3392
    #21 0x000057c4408a3455 in process_pm_child_exit () at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:2683
    #22 ServerLoop.isra.0 () at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:1667
    #23 0x000057c440616965 in PostmasterMain (argc=<optimized out>,
argv=<optimized out>) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:1374
    #24 0x000057c4402bcd2d in main (argc=17, argv=0x57c4747fb140) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/main/main.c:199

Could you please help me to fix the problem of stuck 'client backend'
processes?

I kindly ask you for any ideas and recommendations!

Best regards,
Dmitry

#2Andrey Borodin
Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Dmitry (#1)
1 attachment(s)
Re: IPC/MultixactCreation on the Standby server

On 25 Jun 2025, at 11:11, Dmitry <dsy.075@yandex.ru> wrote:

#6 GetMultiXactIdMembers (multi=45559845, members=0x7ffdaedc84b0, from_pgupgrade=<optimized out>, isLockOnly=<optimized out>)
at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/multixact.c:1483

Hi Dmitry!

This looks to be related to work in my thread about multixacts [0]. Seems like CV sleep in /* Corner case 2: next multixact is still being filled in */ is not woken up by ConditionVariableBroadcast(&MultiXactState->nextoff_cv) from WAL redo.

If so - any subsequent multixact redo from WAL should unstuck reading last MultiXact.

Either way redo path might be not going through ConditionVariableBroadcast(). I will investigate this further.

Can you please check your reproduction with patch attached to this message? This patch simply adds timeout on CV sleep so in worst case we will fallback to behavior of PG 16.

Best regards, Andrey Borodin.

Attachments:

0001-Make-next-multixact-sleep-timed.patchapplication/octet-stream; name=0001-Make-next-multixact-sleep-timed.patch; x-unix-mode=0644
#3Dmitry
Dmitry
dsy.075@yandex.ru
In reply to: Andrey Borodin (#2)
Re: IPC/MultixactCreation on the Standby server

On 25.06.2025 12:34, Andrey Borodin wrote:

On 25 Jun 2025, at 11:11, Dmitry <dsy.075@yandex.ru> wrote:

#6 GetMultiXactIdMembers (multi=45559845, members=0x7ffdaedc84b0, from_pgupgrade=<optimized out>, isLockOnly=<optimized out>)
at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/multixact.c:1483

Hi Dmitry!

This looks to be related to work in my thread about multixacts [0]. Seems like CV sleep in /* Corner case 2: next multixact is still being filled in */ is not woken up by ConditionVariableBroadcast(&MultiXactState->nextoff_cv) from WAL redo.

If so - any subsequent multixact redo from WAL should unstuck reading last MultiXact.

Either way redo path might be not going through ConditionVariableBroadcast(). I will investigate this further.

Can you please check your reproduction with patch attached to this message? This patch simply adds timeout on CV sleep so in worst case we will fallback to behavior of PG 16.

Best regards, Andrey Borodin.

Hi Andrey!

Thanks so much for your response.

A small comment on /* Corner case 2: ... */
At this point in the code, I tried to set trace points by outputting
messages through `elog()`,
and I can say that the process does not always stuck in this part of the
code, it appears from time to time and in an unpredictable way.

Maybe this will help you a little.

To be honest, PostgreSQL performance is much better with this feature,
it would be a shame if we had to rollback to the behavior in version 16.

I will definitely try to reproduce the problem with your patch.

Best regards,
Dmitry.

#4Dmitry
Dmitry
dsy.075@yandex.ru
In reply to: Dmitry (#3)
Re: IPC/MultixactCreation on the Standby server

On 25.06.2025 16:44, Dmitry wrote:

I will definitely try to reproduce the problem with your patch.

Hi Andrey!

I checked with the patch, unfortunately the problem is also reproducible.
Client processes wake up after a second and try to get information about the members of the multixact again, in an endless loop.
At the same time, the WALs are not played, the 'startup' process also hangs on the 'LWLock/BufferContent'.

Best regards,
Dmitry.

#5Andrey Borodin
Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Dmitry (#4)
Re: IPC/MultixactCreation on the Standby server

On 26 Jun 2025, at 14:33, Dmitry <dsy.075@yandex.ru> wrote:

On 25.06.2025 16:44, Dmitry wrote:

I will definitely try to reproduce the problem with your patch.

Hi Andrey!

I checked with the patch, unfortunately the problem is also reproducible.
Client processes wake up after a second and try to get information about the members of the multixact again, in an endless loop.
At the same time, the WALs are not played, the 'startup' process also hangs on the 'LWLock/BufferContent'.

My hypothesis is that MultiXactState->nextMXact is not filled often enough from redo pathes. So if you are unlucky enough, corner case 2 reading can deadlock with startup.
I need to verify it further, but if so - I's an ancient bug that just happens to be few orders of magnitude more reproducible on 17 due to performance improvements. Still a hypothetical though.

Best regards, Andrey Borodin.

#6Andrey Borodin
Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Andrey Borodin (#5)
1 attachment(s)
Re: IPC/MultixactCreation on the Standby server

On 26 Jun 2025, at 17:59, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

hypothesis

Dmitry, can you please retry your reproduction with attached patch?

It must print nextMXact and tmpMXact. If my hypothesis is correct nextMXact will precede tmpMXact.

Best regards, Andrey Borodin.

Attachments:

v2-0001-Make-next-multixact-sleep-timed-with-debug-loggin.patchapplication/octet-stream; name=v2-0001-Make-next-multixact-sleep-timed-with-debug-loggin.patch; x-unix-mode=0644
#7Dmitry
Dmitry
dsy.075@yandex.ru
In reply to: Andrey Borodin (#6)
Re: IPC/MultixactCreation on the Standby server

On 26.06.2025 19:24, Andrey Borodin wrote:

If my hypothesis is correct nextMXact will precede tmpMXact.

It seems that the hypothesis has not been confirmed.

Attempt #1
2025-06-26 23:47:24.821 MSK [220458] WARNING:  Timed out: nextMXact
24138381 tmpMXact 24138379
2025-06-26 23:47:24.822 MSK [220540] WARNING:  Timed out: nextMXact
24138382 tmpMXact 24138379
2025-06-26 23:47:24.823 MSK [220548] WARNING:  Timed out: nextMXact
24138382 tmpMXact 24138379
...

pgbench (17.5)
progress: 2.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
progress: 3.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
progress: 4.0 s, 482.2 tps, lat 820.293 ms stddev 1370.729, 0 failed
progress: 5.0 s, 886.0 tps, lat 112.463 ms stddev 8.506, 0 failed
progress: 6.0 s, 348.9 tps, lat 111.324 ms stddev 5.871, 0 failed
WARNING:  Timed out: nextMXact 24138381 tmpMXact 24138379
WARNING:  Timed out: nextMXact 24138382 tmpMXact 24138379
WARNING:  Timed out: nextMXact 24138382 tmpMXact 24138379
...
progress: 7.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
WARNING:  Timed out: nextMXact 24138382 tmpMXact 24138379

Attempt #2
2025-06-27 09:18:01.312 MSK [236187] WARNING:  Timed out: nextMXact
24497746 tmpMXact 24497744
2025-06-27 09:18:01.312 MSK [236225] WARNING:  Timed out: nextMXact
24497746 tmpMXact 24497744
2025-06-27 09:18:01.312 MSK [236178] WARNING:  Timed out: nextMXact
24497746 tmpMXact 24497744
...

pgbench (17.5)
progress: 1.0 s, 830.9 tps, lat 108.556 ms stddev 10.078, 0 failed
progress: 2.0 s, 839.0 tps, lat 118.358 ms stddev 19.708, 0 failed
progress: 3.0 s, 623.4 tps, lat 134.186 ms stddev 15.565, 0 failed
WARNING:  Timed out: nextMXact 24497746 tmpMXact 24497744
WARNING:  Timed out: nextMXact 24497746 tmpMXact 24497744
WARNING:  Timed out: nextMXact 24497746 tmpMXact 24497744
WARNING:  Timed out: nextMXact 24497746 tmpMXact 24497744
WARNING:  Timed out: nextMXact 24497746 tmpMXact 24497744
WARNING:  Timed out: nextMXact 24497747 tmpMXact 24497744
WARNING:  Timed out: nextMXact 24497747 tmpMXact 24497744
...
progress: 4.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
WARNING:  Timed out: nextMXact 24497746 tmpMXact 24497744

Best regards,
Dmitry.

#8Andrey Borodin
Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Dmitry (#7)
2 attachment(s)
Re: IPC/MultixactCreation on the Standby server

On 27 Jun 2025, at 11:41, Dmitry <dsy.075@yandex.ru> wrote:

It seems that the hypothesis has not been confirmed.

Indeed.

For some reason your reproduction does not work for me.
I tried to create a test from your workload description. PFA patch with a very dirty prototype.

to run test you can run:

cd contrib/amcheck
PROVE_TESTS=t/006_MultiXact_standby.pl make check

To check that reproduction worked or not you can read tmp_check/log/006_MultiXact_standby_standby_1.log and see if there are messages "Timed out: nextMXact %u tmpMXact %u".

If you could codify our reproduction into this TAP test, we could make it portable. So I can debug the problem on my machine...

Either way we can proceed with remote debugging via mailing list :)

Thank you!

Best regards, Andrey Borodin.

Attachments:

v3-0001-Make-next-multixact-sleep-timed-with-debug-loggin.patchapplication/octet-stream; name=v3-0001-Make-next-multixact-sleep-timed-with-debug-loggin.patch; x-unix-mode=0644
v3-0002-Test-concurrent-Multixact-reading-on-stadnby.patchapplication/octet-stream; name=v3-0002-Test-concurrent-Multixact-reading-on-stadnby.patch; x-unix-mode=0644
#9Andrey Borodin
Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Andrey Borodin (#8)
1 attachment(s)
Re: IPC/MultixactCreation on the Standby server

On 28 Jun 2025, at 00:37, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Indeed.

After some experiments I could get unstable repro on my machine.
I've added some logging and that's what I've found:

2025-06-28 23:03:40.598 +05 [40887] 006_MultiXact_standby.pl WARNING: Timed out: nextMXact 415832 tmpMXact 415827 pageno 203 prev_pageno 203 entryno 83 offptr[1] 831655 offptr[0] 0 offptr[-1] 831651

We are reading 415827-1 Multi, while 415827 is not filled yet. But we are holding a buffer that prevents next Multi to be filled in.
This seems like a recovery conflict.
I'm somewhat surprized with 415827+1 is already filled in...

Can you please try your reproduction with applied patch? This seems to be fixing issue for me.

Best regards, Andrey Borodin.

Attachments:

v4-0001-Make-next-multixact-sleep-timed-with-a-recovery-c.patchapplication/octet-stream; name=v4-0001-Make-next-multixact-sleep-timed-with-a-recovery-c.patch; x-unix-mode=0644
#10Andrey Borodin
Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Andrey Borodin (#9)
1 attachment(s)
Re: IPC/MultixactCreation on the Standby server

On 28 Jun 2025, at 21:24, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

This seems to be fixing issue for me.

ISTM I was wrong: there is a possible recovery conflict with snapshot.

REDO:
frame #2: 0x000000010179a0c8 postgres`pg_usleep(microsec=1000000) at pgsleep.c:50:10
frame #3: 0x000000010144c108 postgres`WaitExceedsMaxStandbyDelay(wait_event_info=134217772) at standby.c:248:2
frame #4: 0x000000010144a63c postgres`ResolveRecoveryConflictWithVirtualXIDs(waitlist=0x0000000126008200, reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, wait_event_info=134217772, report_waiting=true) at standby.c:384:8
frame #5: 0x000000010144a4f4 postgres`ResolveRecoveryConflictWithSnapshot(snapshotConflictHorizon=1214, isCatalogRel=false, locator=(spcOid = 1663, dbOid = 5, relNumber = 16384)) at standby.c:490:2
frame #6: 0x0000000100e4d3f8 postgres`heap_xlog_prune_freeze(record=0x0000000135808e60) at heapam.c:9208:4
frame #7: 0x0000000100e4d204 postgres`heap2_redo(record=0x0000000135808e60) at heapam.c:10353:4
frame #8: 0x0000000100f1548c postgres`ApplyWalRecord(xlogreader=0x0000000135808e60, record=0x0000000138058060, replayTLI=0x000000016f0425b0) at xlogrecovery.c:1991:2
frame #9: 0x0000000100f13ff0 postgres`PerformWalRecovery at xlogrecovery.c:1822:4
frame #10: 0x0000000100ef7940 postgres`StartupXLOG at xlog.c:5821:3
frame #11: 0x0000000101364334 postgres`StartupProcessMain(startup_data=0x0000000000000000, startup_data_len=0) at startup.c:258:2

SELECT:
frame #10: 0x0000000102a14684 postgres`GetMultiXactIdMembers(multi=278, members=0x000000016d4f9498, from_pgupgrade=false, isLockOnly=false) at multixact.c:1493:6
frame #11: 0x0000000102991814 postgres`MultiXactIdGetUpdateXid(xmax=278, t_infomask=4416) at heapam.c:7478:13
frame #12: 0x0000000102985450 postgres`HeapTupleGetUpdateXid(tuple=0x00000001043e5c60) at heapam.c:7519:9
frame #13: 0x00000001029a0360 postgres`HeapTupleSatisfiesMVCC(htup=0x000000016d4f9590, snapshot=0x000000015b07b930, buffer=69) at heapam_visibility.c:1090:10
frame #14: 0x000000010299fbc8 postgres`HeapTupleSatisfiesVisibility(htup=0x000000016d4f9590, snapshot=0x000000015b07b930, buffer=69) at heapam_visibility.c:1772:11
frame #15: 0x0000000102982954 postgres`page_collect_tuples(scan=0x000000014b009648, snapshot=0x000000015b07b930, page="", buffer=69, block=6, lines=228, all_visible=false, check_serializable=false) at heapam.c:480:12

page_collect_tuples() holds a lock on the buffer while examining tuples visibility, having InterruptHoldoffCount > 0. Tuple visibility check might need WAL to go on, we have to wait until some next MX be filled in.
Which might need a buffer lock or have a snapshot conflict with caller of page_collect_tuples().

Please find attached a dirty test, it reproduces problem my machine (startup deadlock, so when reproduced it takes 180s, normally passing in 10s).
Also, there is a fix: checking for recovery conflicts when falling back to case 2 MX read.

I do not feel comfortable with using interrupts while InterruptHoldoffCount > 0, so I need help from someone more knowledgeable about our interrupts machinery to tell if what I'm proposing is OK. (Álvaro?)

Also, I've modified the code to make race condition more reproducible.

multi = GetNewMultiXactId(nmembers, &offset);
// random sleep to make WAL order different order of usage on pages
if (rand()%2 == 0)
pg_usleep(1000);
(void) XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_CREATE_ID);

Perhaps, I can build a fast injection points test if we want it.

Best regards, Andrey Borodin.

Attachments:

v5-0001-Make-next-multixact-sleep-timed-with-a-recovery-c.patchapplication/octet-stream; name=v5-0001-Make-next-multixact-sleep-timed-with-a-recovery-c.patch; x-unix-mode=0644
#11Andrey Borodin
Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Andrey Borodin (#10)
Re: IPC/MultixactCreation on the Standby server

On 30 Jun 2025, at 15:58, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

page_collect_tuples() holds a lock on the buffer while examining tuples visibility, having InterruptHoldoffCount > 0. Tuple visibility check might need WAL to go on, we have to wait until some next MX be filled in.
Which might need a buffer lock or have a snapshot conflict with caller of page_collect_tuples().

Thinking more about the problem I see 3 ways to deal with this deadlock:
1. We check for recovery conflict even in presence of InterruptHoldoffCount. That's what patch v4 does.
2. Teach page_collect_tuples() to do HeapTupleSatisfiesVisibility() without holding buffer lock.
3. Why do we even HOLD_INTERRUPTS() when aquire shared lock??

Personally, I see point 2 as very invasive in a code that I'm not too familiar with. Option 1 is clumsy. But option 3 is a giant system-wide change.
Yet, I see 3 as a correct solution. Can't we just abstain from HOLD_INTERRUPTS() if taken LWLock is not exclusive?

Best regards, Andrey Borodin.

#12Álvaro Herrera
Álvaro Herrera
alvherre@kurilemu.de
In reply to: Andrey Borodin (#11)
Re: IPC/MultixactCreation on the Standby server

On 2025-Jul-17, Andrey Borodin wrote:

Thinking more about the problem I see 3 ways to deal with this deadlock:
1. We check for recovery conflict even in presence of
InterruptHoldoffCount. That's what patch v4 does.
2. Teach page_collect_tuples() to do HeapTupleSatisfiesVisibility()
without holding buffer lock.
3. Why do we even HOLD_INTERRUPTS() when aquire shared lock??

Hmm, as you say, doing (3) is a very invasive system-wide change, but
can we do it more localized? I mean, what if we do RESUME_INTERRUPTS()
just before going to sleep on the CV, and restore with HOLD_INTERRUPTS()
once the sleep is done? That would only affect this one place rather
than the whole system, and should also (AFAICS) solve the issue.

Yet, I see 3 as a correct solution. Can't we just abstain from
HOLD_INTERRUPTS() if taken LWLock is not exclusive?

Hmm, the code in LWLockAcquire says

/*
* Lock out cancel/die interrupts until we exit the code section protected
* by the LWLock. This ensures that interrupts will not interfere with
* manipulations of data structures in shared memory.
*/
HOLD_INTERRUPTS();

which means if we want to change this, we would have to inspect every
single use of LWLocks in shared mode in order to be certain that such a
change isn't problematic. This is a discussion I'm not prepared for.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Si quieres ser creativo, aprende el arte de perder el tiempo"

#13Álvaro Herrera
Álvaro Herrera
alvherre@kurilemu.de
In reply to: Álvaro Herrera (#12)
Re: IPC/MultixactCreation on the Standby server

Hello,

Andrey and I discussed this on IM, and after some back and forth, he
came up with a brilliant idea: modify the WAL record for multixact
creation, so that the offset of the next multixact is transmitted and
can be replayed. (We know it when we create each multixact, because the
number of members is known). So the replica can store the offset of the
next multixact right away, even though it doesn't know the members for
that multixact. On replay of the next multixact we can cross-check that
the offset matches what we had written previously. This allows reading
the first multixact, without having to wait for the replay of creation
of the second multixact.

One concern is: if we write the offset for the second mxact, but haven't
written its members, what happens if another process looks up the
members for that multixact? We'll have to make it wait (retry) somehow.
Given what was described upthread, it's possible for the multixact
beyond that one to be written already, so we won't have the zero offset
that would make us wait.

Anyway, he's going to try and implement this.

Andrey, please let me know if I misunderstood the idea.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/

#14Andrey Borodin
Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Álvaro Herrera (#13)
2 attachment(s)
Re: IPC/MultixactCreation on the Standby server

On 18 Jul 2025, at 16:53, Álvaro Herrera <alvherre@kurilemu.de> wrote:

Hello,

Andrey and I discussed this on IM, and after some back and forth, he
came up with a brilliant idea: modify the WAL record for multixact
creation, so that the offset of the next multixact is transmitted and
can be replayed. (We know it when we create each multixact, because the
number of members is known). So the replica can store the offset of the
next multixact right away, even though it doesn't know the members for
that multixact. On replay of the next multixact we can cross-check that
the offset matches what we had written previously. This allows reading
the first multixact, without having to wait for the replay of creation
of the second multixact.

One concern is: if we write the offset for the second mxact, but haven't
written its members, what happens if another process looks up the
members for that multixact? We'll have to make it wait (retry) somehow.
Given what was described upthread, it's possible for the multixact
beyond that one to be written already, so we won't have the zero offset
that would make us wait.

We redo Multixact creation always before it is visible anywhere on heap.
The problem was that to read Multi we might need another Multi offset, and that multi did not happen to be WAL-logged yet.
However, I think we do not need to read multi before it is redone.

Anyway, he's going to try and implement this.

Andrey, please let me know if I misunderstood the idea.

Please find attached dirty test and a sketch of the fix. It is done against PG 16, I wanted to ensure that problem is reproducible before 17.

Best regards, Andrey Borodin.

Attachments:

v6-0001-Test-that-reproduces-multixat-deadlock-with-recov.patchapplication/octet-stream; name=v6-0001-Test-that-reproduces-multixat-deadlock-with-recov.patch; x-unix-mode=0644
v6-0002-Fill-next-multitransaction-in-REDO-to-avoid-corne.patchapplication/octet-stream; name=v6-0002-Fill-next-multitransaction-in-REDO-to-avoid-corne.patch; x-unix-mode=0644
#15Andrey Borodin
Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Andrey Borodin (#14)
2 attachment(s)
Re: IPC/MultixactCreation on the Standby server

On 18 Jul 2025, at 18:53, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Please find attached dirty test and a sketch of the fix. It is done against PG 16, I wanted to ensure that problem is reproducible before 17.

Here'v v7 with improved comments and cross-check for correctness.
Also, MultiXact wraparound is handled.
I'm planning to prepare tests and fixes for all supported branches, if there's no objections to this approach.

Best regards, Andrey Borodin.

Attachments:

v7-0001-Test-that-reproduces-multixat-deadlock-with-recov.patchapplication/octet-stream; name=v7-0001-Test-that-reproduces-multixat-deadlock-with-recov.patch; x-unix-mode=0644
v7-0002-Fill-next-multitransaction-in-REDO-to-avoid-corne.patchapplication/octet-stream; name=v7-0002-Fill-next-multitransaction-in-REDO-to-avoid-corne.patch; x-unix-mode=0644
#16Andrey Borodin
Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Andrey Borodin (#15)
Re: IPC/MultixactCreation on the Standby server

On 21 Jul 2025, at 19:58, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

I'm planning to prepare tests and fixes for all supported branches

This is a status update message. I've reproduced problem on REL_13_STABLE and verified that proposed fix works there.

Also I've discovered one more serious problem.
If a backend crashes just before WAL-logging multi, any heap tuple that uses this multi will become unreadable. Any attempt to read it will hang forever.

I've reproduced the problem and now I'm working on scripting this scenario. Basically, I modify code to hang forever after assigning multi number 2. Then execute in first psql:

create table x as select i,0 v from generate_series(1,10) i;
create unique index on x(i);

\set id 1
begin;
select * from x where i = :id for no key update;
savepoint s1;
update x set v = v+1 where i = :id; -- multi 1
commit;

\set id 2
begin;
select * from x where i = :id for no key update;
savepoint s1;
update x set v = v+1 where i = :id; -- multi 2 -- will hang
commit;

Then in second psql:

create table y as select i,0 v from generate_series(1,10) i;
create unique index on y(i);

\set id 1
begin;
select * from y where i = :id for no key update;
savepoint s1;
update y set v = v+1 where i = :id;
commit;

After this I pkill -9 postgres. Recovered installation cannot execute select * from x; because multi 1 cannot be read without recovery of multi 2 which was never logged.

Luckily fix is the same: just restore offset of multi 2 when multi 1 is recovered.

Best regards, Andrey Borodin.

#17Álvaro Herrera
Álvaro Herrera
alvherre@kurilemu.de
In reply to: Andrey Borodin (#16)
Re: IPC/MultixactCreation on the Standby server

On 2025-Jul-25, Andrey Borodin wrote:

Also I've discovered one more serious problem.
If a backend crashes just before WAL-logging multi, any heap tuple
that uses this multi will become unreadable. Any attempt to read it
will hang forever.

I've reproduced the problem and now I'm working on scripting this
scenario. Basically, I modify code to hang forever after assigning
multi number 2.

It took me a minute to understand this, and I think your description is
slightly incorrect: you mean that the heap tuple that uses the PREVIOUS
multixact cannot be read (at least, that's what I understand from your
reproducer script). I agree it's a pretty ugly bug! I think it's
essentially the same bug as the other problem, so the proposed fix
should solve both.

Thanks for working on this!

Looking at this,

/*
* We want to avoid edge case 2 in redo, because we cannot wait for
* startup process in GetMultiXactIdMembers() without risk of a
* deadlock.
*/
MultiXactId next = multi + 1;
int next_pageno;

/* Handle wraparound as GetMultiXactIdMembers() does it. */
if (multi < FirstMultiXactId)
multi = FirstMultiXactId;

Don't you mean to test and change the value 'next' rather than 'multi'
here?

In this bit,

* We do not need to handle race conditions, because this code
* is only executed in redo and we hold
* MultiXactOffsetSLRULock.

I think it'd be good to have an
Assert(LWLockHeldByMeInMode(MultiXactOffsetSLRULock, LW_EXCLUSIVE));
just for peace of mind. Also, commit c61678551699 removed
ZeroMultiXactOffsetPage(), but since you have 'false' as the second
argument, then SimpleLruZeroPage() is enough. (I wondered why isn't
WAL-logging necessary ... until I remember that we're in a standby. I
think a simple comment here like "no WAL-logging because we're a
standby" should suffice.)

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/

#18Andrey Borodin
Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Álvaro Herrera (#17)
Re: IPC/MultixactCreation on the Standby server

On 26 Jul 2025, at 22:44, Álvaro Herrera <alvherre@kurilemu.de> wrote:

On 2025-Jul-25, Andrey Borodin wrote:

Also I've discovered one more serious problem.
If a backend crashes just before WAL-logging multi, any heap tuple
that uses this multi will become unreadable. Any attempt to read it
will hang forever.

I've reproduced the problem and now I'm working on scripting this
scenario. Basically, I modify code to hang forever after assigning
multi number 2.

It took me a minute to understand this, and I think your description is
slightly incorrect: you mean that the heap tuple that uses the PREVIOUS
multixact cannot be read (at least, that's what I understand from your
reproducer script).

Yes, I explained a bit incorrectly, but you got the problem correctly.

Looking at this,

/*
* We want to avoid edge case 2 in redo, because we cannot wait for
* startup process in GetMultiXactIdMembers() without risk of a
* deadlock.
*/
MultiXactId next = multi + 1;
int next_pageno;

/* Handle wraparound as GetMultiXactIdMembers() does it. */
if (multi < FirstMultiXactId)
multi = FirstMultiXactId;

Don't you mean to test and change the value 'next' rather than 'multi'
here?

Yup, that was typo.

In this bit,

* We do not need to handle race conditions, because this code
* is only executed in redo and we hold
* MultiXactOffsetSLRULock.

I think it'd be good to have an
Assert(LWLockHeldByMeInMode(MultiXactOffsetSLRULock, LW_EXCLUSIVE));
just for peace of mind.

Ugh, that uncovered 17+ problem: now we have a couple of locks simultaneously. I'll post a version with this a later.

Also, commit c61678551699 removed
ZeroMultiXactOffsetPage(), but since you have 'false' as the second
argument, then SimpleLruZeroPage() is enough. (I wondered why isn't
WAL-logging necessary ... until I remember that we're in a standby. I
think a simple comment here like "no WAL-logging because we're a
standby" should suffice.)

Agree.

I've made a test [0]https://github.com/x4m/postgres_g/commit/eafcaec7aafde064b0da5d2ba4041ed2fb134f07 and discovered another problem. Adding this checkpoint breaks the test[1]https://github.com/x4m/postgres_g/commit/da762c7cac56eff1988ea9126171ca0a6d2665e9 even after a fix[2]https://github.com/x4m/postgres_g/commit/d64c17d697d082856e5fe8bd52abafc0585973af.
I suspect that excluding "edge case 2" on standby is simply not enough... we have to do this "next offset" dance on Primary too. I'll think more about other options.

Best regards, Andrey Borodin.
[0]: https://github.com/x4m/postgres_g/commit/eafcaec7aafde064b0da5d2ba4041ed2fb134f07
[1]: https://github.com/x4m/postgres_g/commit/da762c7cac56eff1988ea9126171ca0a6d2665e9
[2]: https://github.com/x4m/postgres_g/commit/d64c17d697d082856e5fe8bd52abafc0585973af

Timeline of this commits can be seen here https://github.com/x4m/postgres_g/commits/mx19/

#19Andrey Borodin
Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Andrey Borodin (#18)
2 attachment(s)
Re: IPC/MultixactCreation on the Standby server

On 27 Jul 2025, at 16:53, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

we have to do this "next offset" dance on Primary too.

PFA draft of this.
I also attach a version for PG17, maybe Dmitry could try to reproduce the problem with this patch. I think the problem should be fixed by the patch.

Thanks!

Best regards, Andrey Borodin.

Attachments:

v8-PG17-0001-Avoid-edge-case-2-in-multixacts.patchapplication/octet-stream; name=v8-PG17-0001-Avoid-edge-case-2-in-multixacts.patch; x-unix-mode=0644
v8-0001-Avoid-edge-case-2-in-multixacts.patchapplication/octet-stream; name=v8-0001-Avoid-edge-case-2-in-multixacts.patch; x-unix-mode=0644
#20Dmitry
Dmitry
dsy.075@yandex.ru
In reply to: Andrey Borodin (#19)
Re: IPC/MultixactCreation on the Standby server

On 28.07.2025 15:49, Andrey Borodin wrote:

I also attach a version for PG17, maybe Dmitry could try to reproduce the problem with this patch.

Andrey, thank you very much for your work, and also thanks to Álvaro for
joining the discussion on the problem. I ran tests on PG17 with patch
v8, there are no more sessions hanging on the replica, great! Replica
requests are canceled with recovery conflicts. ERROR: canceling
statement due to conflict with recovery DETAIL: User was holding shared
buffer pin for too long. STATEMENT: select sum(val) from tbl2; or ERROR:
canceling statement due to conflict with recovery DETAIL: User query
might have needed to see row versions that must be removed. STATEMENT:
select sum(val) from tbl2;

But on the master, some of the requests then fail with an error,
apparently invalid multixact's remain in the pages. ERROR: MultiXact
81926 has invalid next offset STATEMENT: select * from tbl2 where id =
$1 for no key update; ERROR: MultiXact 81941 has invalid next offset
CONTEXT: while scanning block 3 offset 244 of relation "public.tbl2"
automatic vacuum of table "postgres.public.tbl2" Best regards, Dmitry.

#21Dmitry
Dmitry
dsy.075@yandex.ru
In reply to: Andrey Borodin (#19)
Re: IPC/MultixactCreation on the Standby server

I'll duplicate the message, the previous one turned out to have poor
formatting, sorry.

On 28.07.2025 15:49, Andrey Borodin wrote:

I also attach a version for PG17, maybe Dmitry could try to reproduce
the problem with this patch.

Andrey, thank you very much for your work, and also thanks to Álvaro for
joining the discussion on the problem.
I ran tests on PG17 with patch v8, there are no more sessions hanging on
the replica, great!

Replica requests are canceled with recovery conflicts.

ERROR: canceling statement due to conflict with recovery
DETAIL: User was holding shared buffer pin for too long.
STATEMENT: select sum(val) from tbl2;

or

ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be
removed.
STATEMENT: select sum(val) from tbl2;

But on the master, some of the requests then fail with an error,
apparently invalid multixact's remain in the pages.

ERROR: MultiXact 81926 has invalid next offset
STATEMENT: select * from tbl2 where id = $1 for no key update;

ERROR: MultiXact 81941 has invalid next offset
CONTEXT: while scanning block 3 offset 244 of relation "public.tbl2"
automatic vacuum of table "postgres.public.tbl2"

Best regards,
Dmitry.

#22Yura Sokolov
Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Andrey Borodin (#11)
1 attachment(s)
page_collect_tuples without long lock on page (Was Re: IPC/MultixactCreation on the Standby server)

17.07.2025 21:34, Andrey Borodin пишет:

On 30 Jun 2025, at 15:58, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
page_collect_tuples() holds a lock on the buffer while examining tuples visibility, having InterruptHoldoffCount > 0. Tuple visibility check might need WAL to go on, we have to wait until some next MX be filled in.
Which might need a buffer lock or have a snapshot conflict with caller of page_collect_tuples().

Thinking more about the problem I see 3 ways to deal with this deadlock:
2. Teach page_collect_tuples() to do HeapTupleSatisfiesVisibility() without holding buffer lock.

Personally, I see point 2 as very invasive in a code that I'm not too familiar with.

If there were no SetHintBits inside of HeapTupleSatisfies* , then it could
be just "copy line pointers and tuple headers under lock, release lock,
check tuples visibility using copied arrays".

But hint bits makes it much more difficult.

Probably, tuple headers could be copied twice and compared afterwards. If
there are change in hint bits, page should be relocked.

And call to MarkBufferDirtyHint should be delayed.

A very dirty variant is in attach. I've made it just for fun. It passes
'regress', 'isolation' and 'recovery'. But I didn't benchmark it.

--
regards
Yura Sokolov aka funny-falcon

Attachments:

page_collect_tuples.difftext/x-patch; charset=UTF-8; name=page_collect_tuples.diff
#23Andrey Borodin
Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Dmitry (#21)
Re: IPC/MultixactCreation on the Standby server

On 29 Jul 2025, at 12:17, Dmitry <dsy.075@yandex.ru> wrote:

But on the master, some of the requests then fail with an error, apparently invalid multixact's remain in the pages.

Thanks!

That's a bug in my patch. I do not understand it yet. I've reproduced it with your original workload.
Most of errors I see are shallow (offset == 0 or nextOffset==0), but this one is interesting:

TRAP: failed Assert("shared->page_number[slotno] == pageno && shared->page_status[slotno] == SLRU_PAGE_WRITE_IN_PROGRESS"), File: "slru.c", Line: 729, PID: 91085
0 postgres 0x00000001032ea5ac ExceptionalCondition + 216
1 postgres 0x0000000102af2784 SlruInternalWritePage + 700
2 postgres 0x0000000102af14dc SimpleLruWritePage + 96
3 postgres 0x0000000102ae89d4 RecordNewMultiXact + 576

So it makes me think that it's some version of IO concurrency issue.
As expected error only persists if "extend SLRU" branch is active in RecordNewMultiXact().

Thanks for testing!

Best regards, Andrey Borodin.

#24Andrey Borodin
Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Andrey Borodin (#23)
2 attachment(s)
Re: IPC/MultixactCreation on the Standby server

On 29 Jul 2025, at 23:15, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

I do not understand it yet.

OK, I figured it out. SimpleLruDoesPhysicalPageExist() was reading a physical file and could race with real extension by ExtendMultiXactOffset().
So I used ExtendMultiXactOffset(actual + 1). I hope this does not open a loop for wraparound...

Here's an updated two patches, one for Postgres 17 and one for mater(with a test).

Best regards, Andrey Borodin.

Attachments:

v9-0001-Avoid-edge-case-2-in-multixacts.patchapplication/octet-stream; name=v9-0001-Avoid-edge-case-2-in-multixacts.patch; x-unix-mode=0644
v9-PG17-0001-Avoid-edge-case-2-in-multixacts.patchapplication/octet-stream; name=v9-PG17-0001-Avoid-edge-case-2-in-multixacts.patch; x-unix-mode=0644
#25Dmitry
Dmitry
dsy.075@yandex.ru
In reply to: Andrey Borodin (#24)
Re: IPC/MultixactCreation on the Standby server

On 31.07.2025 09:29, Andrey Borodin wrote:

Here's an updated two patches, one for Postgres 17 and one for
mater(with a test).

I ran tests on PG17 with patch v9.

I tried to reproduce it for three cases, the first when we explicitly
use for key share, the second through subtransactions
and the third, through implicit use of for key share, through an foreign
key.

There are no more sessions hanging on the replica, that's great!

Thank you all very much!

Best regards,
Dmitry.

#26Kirill Reshke
Kirill Reshke
reshkekirill@gmail.com
In reply to: Andrey Borodin (#24)
Re: IPC/MultixactCreation on the Standby server

On Thu, 31 Jul 2025 at 11:29, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

On 29 Jul 2025, at 23:15, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

I do not understand it yet.

OK, I figured it out. SimpleLruDoesPhysicalPageExist() was reading a physical file and could race with real extension by ExtendMultiXactOffset().
So I used ExtendMultiXactOffset(actual + 1). I hope this does not open a loop for wraparound...

Here's an updated two patches, one for Postgres 17 and one for mater(with a test).

Hi!

+ /*
+ * We might have filled this offset previosuly.
+ * Cross-check for correctness.
+ */
+ Assert((*offptr == 0) || (*offptr == offset));

Should we exit here with errcode(ERRCODE_DATA_CORRUPTED) if *offptr !=
0 and *offptr != offset?

+ /* Read and adjust next page */
+ next_slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next);
+ next_offptr = (MultiXactOffset *)
MultiXactOffsetCtl->shared->page_buffer[next_slotno];
+ next_offptr[next_entryno] = offset + nmembers;

should we check the value of next_offptr[next_entryno] to be equal to
zero or offset + nmembers ? Assert or
errcode(ERRCODE_DATA_CORRUPTED) also.

--
Best regards,
Kirill Reshke

#27Andrey Borodin
Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Kirill Reshke (#26)
Re: IPC/MultixactCreation on the Standby server

Hi Kirill, thanks for looking into this!

On 20 Aug 2025, at 12:19, Kirill Reshke <reshkekirill@gmail.com> wrote:

+ /*
+ * We might have filled this offset previosuly.
+ * Cross-check for correctness.
+ */
+ Assert((*offptr == 0) || (*offptr == offset));

Should we exit here with errcode(ERRCODE_DATA_CORRUPTED) if *offptr !=
0 and *offptr != offset?

No, we should not exit. We encountered inconsistencies that we are fully prepared to fix. But you are right - we should better emit WARNING with XX001.

+ /* Read and adjust next page */
+ next_slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next);
+ next_offptr = (MultiXactOffset *)
MultiXactOffsetCtl->shared->page_buffer[next_slotno];
+ next_offptr[next_entryno] = offset + nmembers;

should we check the value of next_offptr[next_entryno] to be equal to
zero or offset + nmembers ? Assert or
errcode(ERRCODE_DATA_CORRUPTED) also.

Yes, we'd better WARN user here.

Thanks for your valuable suggestions. I'm not sending new version of the patch, because I'm waiting input on overall design from Alvaro or any committer willing to fix this. We need to figure out if this radical approach is acceptable to backpatch. I do not see other options, but someone might have more clever ideas.

Best regards, Andrey Borodin.

#28Ivan Bykov
Ivan Bykov
i.bykov@modernsys.ru
In reply to: Andrey Borodin (#27)
Re: IPC/MultixactCreation on the Standby server

Hi, Andrey!

I started reviewing the TAP tests and for the current master (14ee8e6403001c3788f2622cdcf81a8451502dc2),
src/test/modules/test_slru/t/001_multixact.pl reproduces the problem, but we can do it in a more transparent way.

The test should fail on timeout; otherwise, it would be hard to find the source of the problem.
The current state of 001_multixact.pl just leaves the 'mike' node in a deadlock state and
it looks like a very long test that doesn't finish for vague reasons.

I believe that we should provide more comprehensible feedback to developers
by interrupting the deadlock state with a psql timeout. 15 seconds seems safe
enough to distinguish between slow node operation and deadlock issue.

Here is the modified 001_multixact.pl
----
# Verify thet recorded multi is readble, this call must not hang.
# Also note that all injection points disappeared after server restart.
my $timed_out = 0;
$node->safe_psql(
'postgres',
qq{SELECT test_read_multixact('$multi'::xid);},
timeout => 15,
timed_out => \$timed_out);
ok($timed_out == 0, 'recorded multi is readble');

$node->stop;

done_testing();
----

#29Andrey Borodin
Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Ivan Bykov (#28)
1 attachment(s)
Re: IPC/MultixactCreation on the Standby server

Hi Ivan!

Thanks for the review.

On 24 Oct 2025, at 19:33, Ivan Bykov <I.Bykov@modernsys.ru> wrote:

I believe that we should provide more comprehensible feedback to developers
by interrupting the deadlock state with a psql timeout. 15 seconds seems safe
enough to distinguish between slow node operation and deadlock issue.

Makes sense, but I used $PostgreSQL::Test::Utils::timeout_default seconds.

On 24 Oct 2025, at 22:16, Bykov Ivan <i.bykov@ftdata.ru> wrote:

In GetNewMultiXactId() this code may lead to error
---
ExtendMultiXactOffset(MultiXactState->nextMXact + 1);
---
If MultiXactState->nextMXact = MaxMultiXactId (0xFFFFFFFF)
we will not extend MultiXactOffset as we should

It will extend SLRU, this calculations are handled in ExtendMultiXactOffset(). Moreover, we should eliminate "multi != FirstMultiXactId" bailout condition too. I've extended comments and added Assert(). I've added test for this, but it's whacky-hacky with dd, resetwal and such. But it uncovered wrond assertion in this code:
/* This is an overflow-only branch */
Assert(next_entryno == 0 || next == FirstMultiXactId);
This "next == FirstMultiXactId" was missing.

I also included Kirill's suggestions.

GPT is also worried by initialization of page 0, but I don't fully understand it's concerns:
"current MXID’s page is never extended: GetNewMultiXactId() now calls ExtendMultiXactOffset(MultiXactState->nextMXact + 1) instead of extending the page that will hold result. This skips zeroing the page for the MXID we are about to assign, so the first allocation on a fresh cluster (or after wrap) tries to write into an uninitialized SLRU page and will fail/crash once the buffer manager attempts I/O."

I think we have now good coverage of what happens on fresh cluster and after a wraparound.

"wraparound can’t recreate page 0: ExtendMultiXactOffset() now asserts multi != FirstMultiXactId, yet after wrap the first ID on page 0 is exactly FirstMultiXactId. Because callers no longer pass that value, page 0 is never re-created and we keep reusing stale offsets."

Page 0 is actually created via different path on cluster initialization. Though everything works fine on wraparound.

Thanks!

Best regards, Andrey Borodin.

Attachments:

v10-0001-Avoid-multixact-edge-case-2-by-writing-the-next-.patchapplication/octet-stream; name=v10-0001-Avoid-multixact-edge-case-2-by-writing-the-next-.patch; x-unix-mode=0644
#30Ivan Bykov
Ivan Bykov
i.bykov@modernsys.ru
In reply to: Andrey Borodin (#29)
Re: IPC/MultixactCreation on the Standby server

Hi!

It seems my previous email was sent only to Andrey directly and didn't pass moderation
because it had a patch attached. I've now resent it without patch another email address.

----
In GetNewMultiXactId() this code may lead to error
---
ExtendMultiXactOffset(MultiXactState->nextMXact + 1);
---
If MultiXactState->nextMXact = MaxMultiXactId (0xFFFFFFFF)
we will not extend MultiXactOffset as we should
---
ExtendMultiXactOffset(0);
MultiXactIdToOffsetEntry(0)
multi % MULTIXACT_OFFSETS_PER_PAGE = 0
return; /* skip SLRU extension */
---

Perhaps we should introduce a simple function to handle next MultiXact
calculation
---
static inline MultiXactId
NextMultiXactId(MultiXactId multi)
{
return multi == MaxMultiXactId ? FirstMultiXactId : multi + 1;
}
---
I've attached a patch that fixes this issue, although it seems I've discovered
another overflow bug in multixact_redo().
We might call:
---
multixact_redo()
MultiXactAdvanceNextMXact(0xFFFFFFFF + 1, ...);
---
And if MultiXactState->nextMXact != InvalidMultiXactId (0), we will have
MultiXactState->nextMXact = 0.
This appears to cause problems in code that assumes MultiXactState->nextMXact
holds a valid MultiXactId.
For instance, in GetMultiXactId(), we would return an incorrect number
of MultiXacts.

Although, spreading MultiXact overflow handling throughout multixact.c code
seems error-prone.
Maybe we should use a macro instead (which would also allow us to modify this
check and add compiler hints):

---
#define MultiXactAdjustOverflow(mxact) \
if (unlikely((mxact) < FirstMultiXactId)) \
mxact = FirstMultiXactId;
---

#31Andrey Borodin
Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Ivan Bykov (#30)
Re: IPC/MultixactCreation on the Standby server

On 24 Nov 2025, at 11:20, Ivan Bykov <I.Bykov@modernsys.ru> wrote:

It seems my previous email was sent only to Andrey directly and didn't pass moderation
because it had a patch attached. I've now resent it without patch another email address.

Hi Ivan!

Nope, the message hit the list, but only as a separate thread. To answer into same thread, please use "Reply all" function of your mailing agent. Also, it's a good idea to quote relevant part of a message that you are replying to.

I've included quotation of your message in my answer at this thread. It's important to stick to one thread to help folks coming from commitfest connect the dots and to find your messages.

If you do not have a message in your mail box, you can send it into your box from archives. There's button "Resend email" for this purpose.
Your message from "i.bykov@modernsys.ru <mailto:i.bykov@modernsys.ru>" landed into correct thread.

Thanks for your review!

Best regards, Andrey Borodin.

#32Ivan Bykov
Ivan Bykov
i.bykov@modernsys.ru
In reply to: Andrey Borodin (#31)
Re: IPC/MultixactCreation on the Standby server

Hi, Andrey!

Thanks for your review!

Review still in progress, sorry for the delay. I didn't have enough time to fully understand the changes you suggest, but it seems there is
only a small gap in my understanding of what the patch does. Here is my explanation of the problem.

The main problem
=============

The main problem is that we may lose session context before writing the offset to SLRU (but we may write
a WAL record). It seems that the writer process got stuck in the XLogInsert procedure or even failed between GetNewMultiXactId 
and RecordNewMultiXact call. In this case, readers will wait to receive a conditional variable signal (from new multixacts)
but could not find a valid offset for the "failed" (it may be in WAL) multixid.

I illustrate this using the next diagram.

Writer                                        Reader             
--------------------------------------------------------------------------------
 MultiXactIdCreateFromMembers
 -> GetNewMultiXactId (101)
                                              GetMultiXactIdMembers(100)
                                              -> LWLockAcquire(MultiXactOffset)
                                              -> read offset 100
                                              -> read offset 101
                                              -> LWLockRelease(MultiXactOffset)
                                                 offset 101 == 0
                                              -> ConditionVariableSleep()
+--------------------------------------------------------------------------------------+
|-> XLogInsert                                                                         |
+--------------------------------------------------------------------------------------+
 -> RecordNewMultiXact
  -> LWLockAcquire(MultiXactOffset)
  -> write offset 101
  -> LWLockRelease(MultiXactOffset)
  -> ConditionVariableBroadcast(nextoff_cv);  
                                              -> retry:
                                              -> LWLockAcquire(MultiXactOffset)
                                              -> read offset 100
                                              -> read offset 101
                                              -> LWLockRelease(MultiXactOffset)
                                                 offset 101 != 0
                                              -> length = offset 101 - read offset 100
  -> LWLockAcquire(MultiXactMember)
  -> write members 101
  -> LWLockRelease(MultiXactOffset)

--------------------------------------------------------------------------------------

As far as I can see, your proposal seems to address exactly that problem.
The main difference from the former solution is writing to MultiXactOffset SLRU all required
information for the reader atomically on multiact insertion.
Before this change, we actually extended the multixact insertion time window to the next multixact
insertion time, and it seems a risky design.

I illustrate the new solution using the next diagram.

Writer Reader
--------------------------------------------------------------------------------
MultiXactIdCreateFromMembers
-> GetNewMultiXactId (100)
-> XLogInsert
-> RecordNewMultiXact
-> LWLockAcquire(MultiXactOffset)
-> write offset 100
-> write offset 101 *****************************************************************
-> LWLockRelease(MultiXactOffset)
GetMultiXactIdMembers(100)
-> LWLockAcquire(MultiXactOffset)
-> read offset 100
-> read offset 101
-> LWLockRelease(MultiXactOffset)
Assert(offset 101 == 0)
-> ConditionVariableSleep()
-> length = offset 101 - read offset 100
--------------------------------------------------------------------------------------

So if I understand the core idea of your solution right, I think that the code in the last patch
(v10-0001-Avoid-multixact-edge-case-2-by-writing-the-next-.patch) is correct and does what it should.

#33Ivan Bykov
Ivan Bykov
i.bykov@modernsys.ru
In reply to: Ivan Bykov (#32)
Re: IPC/MultixactCreation on the Standby server

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: not tested

Hi, Andrey!

The patch applies correctly to d4c0f91f (master)

Here is some minor review comments.

***

It's worth adding a comment explaining why we don't use the lock swap protocol for MultiXactOffsetCtl in RecordNewMultiXact,
unlike for MultiXactMemberCtl (where we check whether rotation is required before swapping the lock).

That the lock will be the same at wraparound only (when MultiXactOffsetCtl->nbanks = 1).
Since this is a rare case, it's not worth handling in the code, but it should be documented with a comment.

It seems even more confusing if we inspect GetMultiXactIdMembers where MultiXactOffsetCtl
checks wraparound case before rotate lock (rotation only required at wraparound).
So it would be better to simplify MultiXactOffsetCtl lock swapping at GetMultiXactIdMembers
in the same manner as it is done at RecordNewMultiXact (exclude extra checks because it returns that swap required almost every time).

***

I believe we should use int64 instead of int for next_pageno in RecordNewMultiXact().
There's a specific reason for using int64 - see below:

4ed8f09 Index SLRUs by 64-bit integers rather than by 32-bit integers
/messages/by-id/CACG=ezZe1NQSCnfHOr78AtAZxJZeCvxrts0ygrxYwe=pyyjVWA@mail.gmail.com
/messages/by-id/CAJ7c6TPDOYBYrnCAeyndkBktO0WG2xSdYduTF0nxq+vfkmTF5Q@mail.gmail.com

***

We should delete MULTIXACT_CREATION wait event type from src/backend/utils/activity/wait_event_names.txt
because we delete corresponding conditional variable.

***

I also noticed some minor typos; here is the corrected version.

----
/* Also we record next offset here */

Kill9 works unpredictably on Windows / exta 'a' was in unpredictably

# Verify that recorded multi is readable, this call must not hang.

recorded multi is readable

# Test mxid wraparound

The new status of this patch is: Waiting on Author

#34Heikki Linnakangas
Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Ivan Bykov (#33)
Re: IPC/MultixactCreation on the Standby server

This approach makes a lot of sense to me.

I think there's one more case that this fixes: If someone uses
"pg_resetwal -m ..." to advance nextMulti, that's yet another case where
the last multixid becomes unreadable because we haven't set the next
mxid's offset in the SLRU yet. People shouldn't be running pg_resetwal
unnecessarily, but the docs for pg_resetwal include instructions on how
to determine a safe value for nextMulti. You will in fact lose the last
multixid if you follow the instructions, so it's not as safe as it
sounds. This patch fixes that issue too.

Álvaro, are you planning to commit this? I've been looking at this code
lately for the 64-bit mxoffsets patch, so I could also pick it up if
you'd like.

@@ -1383,7 +1417,10 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
* 1. This multixact may be the latest one created, in which case there is
* no next one to look at.  In this case the nextOffset value we just
* saved is the correct endpoint.
+	 * TODO: how does it work on Standby?  MultiXactState->nextMXact does not seem to be up-to date.
+	 * nextMXact and nextOffset are in sync, so nothing bad can happen, but nextMXact seems mostly random.

That's scary. AFAICS MultiXactState->nextMXact should be up-to-date in
hot standby mode. Are we missing MultiXactAdvanceNextMXact() calls
somewhere, or what's going on?

The case 1. is still valid, you can indeed just look at the saved
nextOffset. But now the next offset should also be set on the SLRU,
except right after upgrading to a version that has this patch. So I
guess we still need to have this special case to deal with upgrades.

*
+ * THIS IS NOT POSSIBLE ANYMORE, KEEP IT FOR HISTORIC REASONS.
* 2. The next multixact may still be in process of being filled in: that
* is, another process may have done GetNewMultiXactId but not yet written
* the offset entry for that ID. In that scenario, it is guaranteed that

This comment needs to be cleaned up to explain how all this works now...

@@ -1138,8 +1168,13 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
result = FirstMultiXactId;
}

-	/* Make sure there is room for the MXID in the file.  */
-	ExtendMultiXactOffset(result);
+	/*
+	 * Make sure there is room for the MXID and next offset in the file.
+	 * We might overflow to the next segment, but we don't need to handle
+	 * FirstMultiXactId specifically, because ExtendMultiXactOffset handles
+	 * both cases well: 0 offset and FirstMultiXactId would create segment.
+	 */
+	ExtendMultiXactOffset(MultiXactState->nextMXact + 1);

/*
* Reserve the members space, similarly to above. Also, be careful not to

Does this create the file correctly, if you upgrade the binary to a new
version that contains this patch, and nextMXact was at a segment
boundary before the upgrade?

@@ -2487,10 +2509,11 @@ ExtendMultiXactOffset(MultiXactId multi)

/*
* No work except at first MultiXactId of a page.  But beware: just after
-	 * wraparound, the first MultiXactId of page zero is FirstMultiXactId.
+	 * wraparound, the first MultiXactId of page zero is FirstMultiXactId,
+	 * make sure we are not in that case.
*/
-	if (MultiXactIdToOffsetEntry(multi) != 0 &&
-		multi != FirstMultiXactId)
+	Assert(multi != FirstMultiXactId);
+	if (MultiXactIdToOffsetEntry(multi) != 0)
return;

pageno = MultiXactIdToOffsetPage(multi);

I don't quite understand this change. I guess the point is that the
caller now never calls this with FirstMultiXactId, but it feels a bit
weird to assert and rely on that here.

I didn't look at the test changes yet.

- Heikki

#35Álvaro Herrera
Álvaro Herrera
alvherre@kurilemu.de
In reply to: Heikki Linnakangas (#34)
Re: IPC/MultixactCreation on the Standby server

On 2025-Nov-25, Heikki Linnakangas wrote:

Álvaro, are you planning to commit this? I've been looking at this code
lately for the 64-bit mxoffsets patch, so I could also pick it up if you'd
like.

I would be glad if you can get it over the finish line. I haven't found
time/energy to review it in depth.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Computing is too important to be left to men." (Karen Spärck Jones)

#36Heikki Linnakangas
Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Álvaro Herrera (#35)
1 attachment(s)
Re: IPC/MultixactCreation on the Standby server

Here's a new version of this. Notable changes:

- I reverted the changes to ExtendMultiXactOffset(), so that it deals
with wraparound and FirstMultiXactId the same way as before. The caller
never passes FirstMultiXactId, but the changed comments and the
assertion were confusing, so I felt it's best to just leave it alone

- bunch of comment changes & other cosmetic changes

- I modified TrimMultiXact() to initialize the page corresponding to
'nextMulti', because if you just swapped the binary to the new one, and
nextMulti was at a page boundary, it would not be initialized yet.

If we want to backpatch this, and I think we need to because this fixes
real bugs, we need to think through all the upgrade scenarios. I made
the above-mentioned changes to TrimMultiXact(), but it doesn't fix all
the problems.

What happens if you replay the WAL generated with old binary, without
this patch, with new binary? It's not good:

LOG: database system was not properly shut down; automatic recovery in
progress
LOG: redo starts at 0/01766A68
FATAL: could not access status of transaction 2048
DETAIL: Could not read from file "pg_multixact/offsets/0000" at offset
8192: read too few bytes.
CONTEXT: WAL redo at 0/05561030 for MultiXact/CREATE_ID: 2047 offset
4093 nmembers 2: 2830 (keysh) 2831 (keysh)
LOG: startup process (PID 3130184) exited with exit code 1

This is because the WAL, created with old version, contains records like
this:

lsn: 0/05561030, prev 0/05561008, desc: CREATE_ID 2047 offset 4093
nmembers 2: 2830 (keysh) 2831 (keysh)
lsn: 0/055611A8, prev 0/05561180, desc: ZERO_OFF_PAGE 1
lsn: 0/055611D0, prev 0/055611A8, desc: CREATE_ID 2048 offset 4095
nmembers 2: 2831 (keysh) 2832 (keysh)

When replaying that with the new version, replay of the CREATE_ID 2047
record tries to set the next multixid's offset, but the page hasn't been
initialized yet. With the new version, the ZERO_OFF_PAGE 1 record would
appear before the CREATE_ID 2047 record, but we can't change the WAL
that already exists.

- Heikki

Attachments:

v11-0001-Avoid-multixact-edge-case-2-by-writing-the-next-.patchtext/x-patch; charset=UTF-8; name=v11-0001-Avoid-multixact-edge-case-2-by-writing-the-next-.patch
#37Álvaro Herrera
Álvaro Herrera
alvherre@kurilemu.de
In reply to: Heikki Linnakangas (#36)
Re: IPC/MultixactCreation on the Standby server

On 2025-11-26, Heikki Linnakangas wrote:

What happens if you replay the WAL generated with old binary, without
this patch, with new binary? It's not good:

Maybe this needs a new record identifier, separating old wal from that generated by the new code?

--
Álvaro Herrera

#38Chao Li
Chao Li
li.evan.chao@gmail.com
In reply to: Heikki Linnakangas (#36)
Re: IPC/MultixactCreation on the Standby server

Hi Heikki,

I just reviewed this patch. As offset[x+1] anyway equals offset[x]+nmembers, pre-write offset[x+1] seems a very clever solution.

I got a few questions/comments as below:

On Nov 27, 2025, at 04:59, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Here's a new version of this. Notable changes:

- I reverted the changes to ExtendMultiXactOffset(), so that it deals with wraparound and FirstMultiXactId the same way as before. The caller never passes FirstMultiXactId, but the changed comments and the assertion were confusing, so I felt it's best to just leave it alone

- bunch of comment changes & other cosmetic changes

- I modified TrimMultiXact() to initialize the page corresponding to 'nextMulti', because if you just swapped the binary to the new one, and nextMulti was at a page boundary, it would not be initialized yet.

If we want to backpatch this, and I think we need to because this fixes real bugs, we need to think through all the upgrade scenarios. I made the above-mentioned changes to TrimMultiXact(), but it doesn't fix all the problems.

What happens if you replay the WAL generated with old binary, without this patch, with new binary? It's not good:

LOG: database system was not properly shut down; automatic recovery in progress
LOG: redo starts at 0/01766A68
FATAL: could not access status of transaction 2048
DETAIL: Could not read from file "pg_multixact/offsets/0000" at offset 8192: read too few bytes.
CONTEXT: WAL redo at 0/05561030 for MultiXact/CREATE_ID: 2047 offset 4093 nmembers 2: 2830 (keysh) 2831 (keysh)
LOG: startup process (PID 3130184) exited with exit code 1

This is because the WAL, created with old version, contains records like this:

lsn: 0/05561030, prev 0/05561008, desc: CREATE_ID 2047 offset 4093 nmembers 2: 2830 (keysh) 2831 (keysh)
lsn: 0/055611A8, prev 0/05561180, desc: ZERO_OFF_PAGE 1
lsn: 0/055611D0, prev 0/055611A8, desc: CREATE_ID 2048 offset 4095 nmembers 2: 2831 (keysh) 2832 (keysh)

When replaying that with the new version, replay of the CREATE_ID 2047 record tries to set the next multixid's offset, but the page hasn't been initialized yet. With the new version, the ZERO_OFF_PAGE 1 record would appear before the CREATE_ID 2047 record, but we can't change the WAL that already exists.

- Heikki
<v11-0001-Avoid-multixact-edge-case-2-by-writing-the-next-.patch>

1
```
+	if (*offptr != offset)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*offptr == 0);
+		*offptr = offset;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
```

This is a more like a question. Since pre-write should always happen, in theory *offptr != offset should never be true, why do we still need to handle the case instead of just assert(false)?

2
```
+	next = multi + 1;
+	if (next < FirstMultiXactId)
+		next = FirstMultiXactId;
```

next < FirstMultiXactId will only be true when next wraps around to 0, maybe deserve one-line comment to explain that.

3
```
+	if (*next_offptr != offset + nmembers)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*next_offptr == 0);
+		*next_offptr = offset + nmembers;
+		MultiXactMemberCtl->shared->page_dirty[slotno] = true;
+	}
```

Should MultiXactMemberCtl be MultiXactOffsetCtl? As we are writing to the offset SLRU.

4
```
+# Another multixact test: loosing some multixact must not affect reading near
```

I guess “loosing” should be “losing”

5
```
+# multitransaction (to avaoid corener case 1).
```

Typo: avoid corner case

6
```
+# Verify thet recorded multi is readble, this call must not hang.
```

Typo: readable

7
```
+# One more multitransaction to effectivelt emit WAL record about next
```

Typo: effectivelt -> effectively

8
```
+ plan skip_all => 'Kill9 works unpredicatably on Windows’;
```

Typo: unpredicatably, no “a” after “c"

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#39Heikki Linnakangas
Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Álvaro Herrera (#37)
Re: IPC/MultixactCreation on the Standby server

On 26/11/2025 23:15, Álvaro Herrera wrote:

On 2025-11-26, Heikki Linnakangas wrote:

What happens if you replay the WAL generated with old binary, without
this patch, with new binary? It's not good:

Maybe this needs a new record identifier, separating old wal from that generated by the new code?

One downside of that is that the new WAL record type would be unreadable
by older versions. We recommend upgrading standbys before primary, but
it'd still be nicer if we could avoid that.

Maybe we can make RecordNewMultiXact() tolerate the missing page, in
this special case of replaying WAL and the multixid being at the page
boundary.

- Heikki

#40Heikki Linnakangas
Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Chao Li (#38)
Re: IPC/MultixactCreation on the Standby server

On 27/11/2025 05:39, Chao Li wrote:

<v11-0001-Avoid-multixact-edge-case-2-by-writing-the-next-.patch>

1
```
+	if (*offptr != offset)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*offptr == 0);
+		*offptr = offset;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
```

This is a more like a question. Since pre-write should always happen, in theory *offptr != offset should never be true, why do we still need to handle the case instead of just assert(false)?

No, *offptr == 0 can happen, if multiple backends are generating
multixids concurrently. It's possible that we get here before the
RecordNewMultiXact() call for the previous multixid.

Similarly, the *next_offptr != 0 case can happen if another backend
calls RecordNewMultiXact() concurrently for the next multixid.

2
```
+	next = multi + 1;
+	if (next < FirstMultiXactId)
+		next = FirstMultiXactId;
```

next < FirstMultiXactId will only be true when next wraps around to 0, maybe deserve one-line comment to explain that.

This is a common pattern used in many places in the file.

3
```
+	if (*next_offptr != offset + nmembers)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*next_offptr == 0);
+		*next_offptr = offset + nmembers;
+		MultiXactMemberCtl->shared->page_dirty[slotno] = true;
+	}
```

Should MultiXactMemberCtl be MultiXactOffsetCtl? As we are writing to the offset SLRU.

Good catch! Yes, it should be MultiXactOffsetCtl.

- Heikki

#41Andrey Borodin
Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Heikki Linnakangas (#36)
Re: IPC/MultixactCreation on the Standby server

On 27 Nov 2025, at 01:59, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

This is because the WAL, created with old version, contains records like this:

lsn: 0/05561030, prev 0/05561008, desc: CREATE_ID 2047 offset 4093 nmembers 2: 2830 (keysh) 2831 (keysh)
lsn: 0/055611A8, prev 0/05561180, desc: ZERO_OFF_PAGE 1
lsn: 0/055611D0, prev 0/055611A8, desc: CREATE_ID 2048 offset 4095 nmembers 2: 2831 (keysh) 2832 (keysh)

That's an interesting case. I don't see how SLRU interface can be used to test if SLRU page is initialized correctly. We need a version of SimpleLruReadPage() that can avoid failure if page does not exist yet, and this must not be more expensive than current SimpleLruReadPage(). Alternatively we need new XLOG_MULTIXACT_CREATE_ID_2 in back branches.

BTW, my concern about MultiXactState->nextMXact was wrong, now I see it. I was almost certain that something is wrong and works by accident during summer, but now everything looks 100% correct...

Also, when working on v10 I've asked LLM for help with commit message, and it hallucinated Álvaro's e-mail <alvherre@postgresql.org> IDK, maybe it's real, but it was not used in this thread.

- I reverted the changes to ExtendMultiXactOffset(), so that it deals with wraparound and FirstMultiXactId the same way as before. The caller never passes FirstMultiXactId, but the changed comments and the assertion were confusing, so I felt it's best to just leave it alone

Maybe move a decision (to extend or not to extend) out of this function? Now its purpose is "MaybeExtendMultiXactOffset". And there's just one caller.

Thanks for picking this up! (And reading other thread about multixacts more attentively I see I misinformed you that this bug is connected to that bug...sorry! I'll review that one too!)

Best regards, Andrey Borodin.

#42Heikki Linnakangas
Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andrey Borodin (#41)
1 attachment(s)
Re: IPC/MultixactCreation on the Standby server

On 27/11/2025 20:25, Andrey Borodin wrote:

On 27 Nov 2025, at 01:59, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

This is because the WAL, created with old version, contains records like this:

lsn: 0/05561030, prev 0/05561008, desc: CREATE_ID 2047 offset 4093 nmembers 2: 2830 (keysh) 2831 (keysh)
lsn: 0/055611A8, prev 0/05561180, desc: ZERO_OFF_PAGE 1
lsn: 0/055611D0, prev 0/055611A8, desc: CREATE_ID 2048 offset 4095 nmembers 2: 2831 (keysh) 2832 (keysh)

That's an interesting case. I don't see how SLRU interface can be
used to test if SLRU page is initialized correctly. We need a
version of SimpleLruReadPage() that can avoid failure if page does
not exist yet, and this must not be more expensive than current
SimpleLruReadPage(). Alternatively we need new
XLOG_MULTIXACT_CREATE_ID_2 in back branches.

There is SimpleLruDoesPhysicalPageExist() to check if a page has been
initialized. There's also the 'latest_page_number' field which tracks
what is the latest page that has been initialized.

Here's a new version that uses 'latest_page_number' to detect if we're
in this situation (= replaying WAL generated with older minor version).

(I still haven't looked at the tests)

BTW, my concern about MultiXactState->nextMXact was wrong, now I see
it. I was almost certain that something is wrong and works by
accident during summer, but now everything looks 100% correct...

Ok, thanks for double-checking.

Also, when working on v10 I've asked LLM for help with commit
message, and it hallucinated Álvaro's e-mail
<alvherre@postgresql.org> IDK, maybe it's real, but it was not used
in this thread.

Heh, ok, fixed that. You also came up with a last name for Dmitry that I
haven't seen that mentioned anywhere in this thread, either.

- I reverted the changes to ExtendMultiXactOffset(), so that it
deals with wraparound and FirstMultiXactId the same way as before.
The caller never passes FirstMultiXactId, but the changed comments
and the assertion were confusing, so I felt it's best to just
leave it alone

Maybe move a decision (to extend or not to extend) out of this
function? Now its purpose is "MaybeExtendMultiXactOffset". And
there's just one caller.

Perhaps. Let's leave that for a separate non-backported patch though.

- Heikki

Attachments:

v12-0001-Set-next-multixid-s-offset-when-creating-a-new-m.patchtext/x-patch; charset=UTF-8; name=v12-0001-Set-next-multixid-s-offset-when-creating-a-new-m.patch
#43Álvaro Herrera
Álvaro Herrera
alvherre@kurilemu.de
In reply to: Heikki Linnakangas (#42)
Re: IPC/MultixactCreation on the Standby server

On 2025-Nov-28, Heikki Linnakangas wrote:

Heh, ok, fixed that. You also came up with a last name for Dmitry that I
haven't seen that mentioned anywhere in this thread, either.

/messages/by-id/29bf3d71-bf78-4c74-8525-13d48e0142ec@yandex.ru

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/

#44Andrey Borodin
Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Heikki Linnakangas (#42)
Re: IPC/MultixactCreation on the Standby server

On 28 Nov 2025, at 21:21, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 27/11/2025 20:25, Andrey Borodin wrote:

On 27 Nov 2025, at 01:59, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

This is because the WAL, created with old version, contains records like this:

lsn: 0/05561030, prev 0/05561008, desc: CREATE_ID 2047 offset 4093 nmembers 2: 2830 (keysh) 2831 (keysh)
lsn: 0/055611A8, prev 0/05561180, desc: ZERO_OFF_PAGE 1
lsn: 0/055611D0, prev 0/055611A8, desc: CREATE_ID 2048 offset 4095 nmembers 2: 2831 (keysh) 2832 (keysh)

That's an interesting case. I don't see how SLRU interface can be
used to test if SLRU page is initialized correctly. We need a
version of SimpleLruReadPage() that can avoid failure if page does
not exist yet, and this must not be more expensive than current
SimpleLruReadPage(). Alternatively we need new
XLOG_MULTIXACT_CREATE_ID_2 in back branches.

There is SimpleLruDoesPhysicalPageExist() to check if a page has been initialized. There's also the 'latest_page_number' field which tracks what is the latest page that has been initialized.

During working on v8 of this patch I observed races between SimpleLruDoesPhysicalPageExist() and ExtendMultiXactOffset(). Because ExtendMultiXactOffset() may initialize page only in a buffer, and file system call will not observe it.

latest_page_number looks great! though I'm slightly worried by this commend:
/*
* During XLOG replay, latest_page_number isn't necessarily set up
* yet; insert a suitable value to bypass the sanity test in
* SimpleLruTruncate.
*/

and by

/*
* latest_page_number is the page number of the current end of the log;
* this is __not critical data__, since we use it only to avoid swapping out
* the latest page.
*/

If we use it in startup process to prevent failure, now it is critical data.

But I think the comments are off, latest_page_number is updated precisely.

The page initialization dance is only needed in back branches. And we inevitable will have conflicts with SLRU refactoring in 18 and banking in 17. Conceptually v12 looks good to me, I can prepare backport versions.

You definitely beat LLM in commit message clarity. Though, s/coudl/could/g.

On 28 Nov 2025, at 21:21, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

- I reverted the changes to ExtendMultiXactOffset(), so that it
deals with wraparound and FirstMultiXactId the same way as before.
The caller never passes FirstMultiXactId, but the changed comments
and the assertion were confusing, so I felt it's best to just
leave it alone

Maybe move a decision (to extend or not to extend) out of this
function? Now its purpose is "MaybeExtendMultiXactOffset". And
there's just one caller.

Perhaps. Let's leave that for a separate non-backported patch though.

OK, we just have to be sure that we do not emit 2 ZERO_OFF_PAGE records during wraparound.

The problem seems to be introduced by 1986ca5 as a fix for a problem that might be there even longer. Perhaps, since multitransactions existed. Interestingly the only know to me report was by Thom Brown in 2024 (besides this thread).
/messages/by-id/CAA-aLv7AcTh+O9rnoPVbck=Fw8atMOYrUttk5G9ctFhXOsqQbw@mail.gmail.com

Thanks!

Best regards, Andrey Borodin.

#45Heikki Linnakangas
Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andrey Borodin (#44)
2 attachment(s)
Re: IPC/MultixactCreation on the Standby server

On 28/11/2025 20:17, Andrey Borodin wrote:

latest_page_number looks great! though I'm slightly worried by this commend:
/*
* During XLOG replay, latest_page_number isn't necessarily set up
* yet; insert a suitable value to bypass the sanity test in
* SimpleLruTruncate.
*/

and by

/*
* latest_page_number is the page number of the current end of the log;
* this is __not critical data__, since we use it only to avoid swapping out
* the latest page.
*/

If we use it in startup process to prevent failure, now it is critical data.

But I think the comments are off, latest_page_number is updated precisely.

Agreed. I looked at the commit history of how the first comment got
introduced, to see if it used to be true at some point, but AFAICS we've
always kept latest_page_number up-to-date. It is very clearly
initialized in StartupMultiXact(), before WAL replay. Furthermore, we
only set the "suitable value to bypass the sanity test" for the offsets
SLRU, but everywhere else where we truncate, extend or set
latest_page_number, we treat the members SLRU. So I don't see how the
offsets and members SLRUs could be different in that regard.

I think the second comment became outdated in commit bc7d37a525c0, which
introduced the safety check in (what became later) SimpleLruTruncate().
After that, it's been important that latest_page_number is set
correctly, although for the sanity check I guess you could be a little
sloppy with it.

The page initialization dance is only needed in back branches. And
we inevitable will have conflicts with SLRU refactoring in 18 and
banking in 17. Conceptually v12 looks good to me, I can prepare
backport versions.

Thanks!

Here's a new patch version. I went through the test changes now:

I didn't understand why the 'kill9' and 'poll_start' stuff is needed. We
have plenty of tests that kill the server with regular
"$node->stop('immediate')", and restart the server normally. The
checkpoint in the middle of the tests seems unnecessary too. I removed
all that, and the test still seems to work. Was there a particular
reason for them?

I moved the wraparound test to a separate test file and commit. More
test coverage is good, but it's quite separate from the bugfix and the
wraparound related test shares very little with the other test. The
wraparound test needs a little more cleanup: use plain perl instead of
'dd' and 'rm' for the file operations, for example. (I did that with the
tests in the 64-bit mxoff patches, so we could copy from there.)

- Heikki

Attachments:

v13-0001-Set-next-multixid-s-offset-when-creating-a-new-m.patchtext/x-patch; charset=UTF-8; name=v13-0001-Set-next-multixid-s-offset-when-creating-a-new-m.patch
v13-0002-Add-test-for-multixid-wraparound.patchtext/x-patch; charset=UTF-8; name=v13-0002-Add-test-for-multixid-wraparound.patch
#46Andrey Borodin
Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Heikki Linnakangas (#45)
4 attachment(s)
Re: IPC/MultixactCreation on the Standby server

On 29 Nov 2025, at 00:51, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I think the second comment became outdated in commit bc7d37a525c0, which introduced the safety check in (what became later) SimpleLruTruncate(). After that, it's been important that latest_page_number is set correctly, although for the sanity check I guess you could be a little sloppy with it.

Cool, so we can safely backpatch to 7.2! (I was 15 when this version rocked)

The page initialization dance is only needed in back branches. And we inevitable will have conflicts with SLRU refactoring in 18 and banking in 17. Conceptually v12 looks good to me, I can prepare
backport versions.

Thanks!

Here's a new patch version. I went through the test changes now:

I didn't understand why the 'kill9' and 'poll_start' stuff is needed. We have plenty of tests that kill the server with regular "$node->stop('immediate')", and restart the server normally. The checkpoint in the middle of the tests seems unnecessary too. I removed all that, and the test still seems to work. Was there a particular reason for them?

In current shutdown sequence test seems to be reproducing corruption without checkpointing. I recollect that in July standby deadlock was reachable without checkpoint, but corruption was not. But now it seems test is working.

I moved the wraparound test to a separate test file and commit. More test coverage is good, but it's quite separate from the bugfix and the wraparound related test shares very little with the other test. The wraparound test needs a little more cleanup: use plain perl instead of 'dd' and 'rm' for the file operations, for example. (I did that with the tests in the 64-bit mxoff patches, so we could copy from there.)

PFA test version without dd and rm. Did I get your right, that we do not backport wraparound test, backport fixes for 001_multixact.pl test down to 17 where it appeared?

First two patches are v13 intact, second pair is my suggestions.

Best regards, Andrey Borodin.

Attachments:

v14-0001-Set-next-multixid-s-offset-when-creating-a-new-m.patchapplication/octet-stream; name=v14-0001-Set-next-multixid-s-offset-when-creating-a-new-m.patch; x-unix-mode=0644
v14-0004-Improve-multixact-wraparound-test.patchapplication/octet-stream; name=v14-0004-Improve-multixact-wraparound-test.patch; x-unix-mode=0644
v14-0003-Fix-debug-warning.patchapplication/octet-stream; name=v14-0003-Fix-debug-warning.patch; x-unix-mode=0644
v14-0002-Add-test-for-multixid-wraparound.patchapplication/octet-stream; name=v14-0002-Add-test-for-multixid-wraparound.patch; x-unix-mode=0644
#47Heikki Linnakangas
Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andrey Borodin (#46)
6 attachment(s)
Re: IPC/MultixactCreation on the Standby server

On 30/11/2025 14:15, Andrey Borodin wrote:

On 29 Nov 2025, at 00:51, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I didn't understand why the 'kill9' and 'poll_start' stuff is
needed. We have plenty of tests that kill the server with regular
"$node->stop('immediate')", and restart the server normally. The
checkpoint in the middle of the tests seems unnecessary too. I
removed all that, and the test still seems to work. Was there a
particular reason for them?

In current shutdown sequence test seems to be reproducing corruption
without checkpointing. I recollect that in July standby deadlock was
reachable without checkpoint, but corruption was not. But now it
seems test is working.

Ok.

I moved the wraparound test to a separate test file and commit.
More test coverage is good, but it's quite separate from the
bugfix and the wraparound related test shares very little with the
other test. The wraparound test needs a little more cleanup: use
plain perl instead of 'dd' and 'rm' for the file operations, for
example. (I did that with the tests in the 64-bit mxoff patches,
so we could copy from there.)

PFA test version without dd and rm.

Thanks! I will focus on the main patch and TAP test now, but will commit
the wraparound test separately afterwards. At quick glance, it looks
good now.

Did I get your right, that we do not backport wraparound test,
backport fixes for 001_multixact.pl test down to 17 where it
appeared?

Yes, that's my plan. Except that 001_multixact.pl appeared in v18, not v17.

First two patches are v13 intact, second pair is my suggestions.

Thanks, here's a new set of patches, now with backpatched versions for
all the branches. As you said, there were a number of differences
between branches:

- On master, don't include the compatibility hacks for reading WAL
generated with older minor versions. Because WAL is not compatible
across major versions anyway.

- REL_18_STABLE didn't have the SimpleLruZeroAndWritePage() function
(introduced in commit c616785516).

- REL_17_STABLE didn't have the 001_multixact.pl TAP test. So I didn't
backport the new TAP test to v17 and below either.

- REL_16_STABLE used 32-bit SLRU page numbers, didn't have bank locks,
and used a simple sleep-loop instead of the condition variable.

- REL_15_STABLE and REL_14_STABLE: no conflicts from REL_16_STABLE

All of those conflicts were pretty straightforward to handle, but it's
enough code churn for silly mistakes to slip in, especially when the TAP
test didn't apply. So if you have a chance, please help to review and
test each of these backpatched versions too.

In addition to the backpatching, I did some more cosmetic cleanups to
the TAP test.

- Heikki

Attachments:

v15-master-0001-Set-next-multixid-s-offset-when-creating-.patchtext/x-patch; charset=UTF-8; name=v15-master-0001-Set-next-multixid-s-offset-when-creating-.patch
v15-pg14-0001-Set-next-multixid-s-offset-when-creating-a-.patchtext/x-patch; charset=UTF-8; name=v15-pg14-0001-Set-next-multixid-s-offset-when-creating-a-.patch
v15-pg15-0001-Set-next-multixid-s-offset-when-creating-a-.patchtext/x-patch; charset=UTF-8; name=v15-pg15-0001-Set-next-multixid-s-offset-when-creating-a-.patch
v15-pg16-0001-Set-next-multixid-s-offset-when-creating-a-.patchtext/x-patch; charset=UTF-8; name=v15-pg16-0001-Set-next-multixid-s-offset-when-creating-a-.patch
v15-pg17-0001-Set-next-multixid-s-offset-when-creating-a-.patchtext/x-patch; charset=UTF-8; name=v15-pg17-0001-Set-next-multixid-s-offset-when-creating-a-.patch
v15-pg18-0001-Set-next-multixid-s-offset-when-creating-a-.patchtext/x-patch; charset=UTF-8; name=v15-pg18-0001-Set-next-multixid-s-offset-when-creating-a-.patch
#48Andrey Borodin
Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Heikki Linnakangas (#47)
Re: IPC/MultixactCreation on the Standby server

On 1 Dec 2025, at 17:40, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

All of those conflicts were pretty straightforward to handle, but it's
enough code churn for silly mistakes to slip in, especially when the TAP
test didn't apply. So if you have a chance, please help to review and
test each of these backpatched versions too.

I'm looking through patchsets. I'll look in the morning with fresh eyes.
So far I see two CI warning faulures for pg18 and pg17 versions:
https://github.com/x4m/postgres_g/runs/56796102941
https://github.com/x4m/postgres_g/runs/56795182559

Relevant logs:
[14:06:58.425] multixact.c: In function ‘RecordNewMultiXact’:
[14:06:58.425] multixact.c:944:41: error: declaration of ‘slotno’ shadows a previous local [-Werror=shadow=compatible-local]
[14:06:58.425] 944 | int slotno;
[14:06:58.425] | ^~~~~~
[14:06:58.425] multixact.c:913:33: note: shadowed declaration is here
[14:06:58.425] 913 | int slotno;
[14:06:58.425] | ^~~~~~
[14:06:58.425] multixact.c:945:29: error: declaration of ‘lock’ shadows a previous local [-Werror=shadow=compatible-local]
[14:06:58.425] 945 | LWLock *lock;
[14:06:58.425] | ^~~~
[14:06:58.425] multixact.c:919:21: note: shadowed declaration is here
[14:06:58.425] 919 | LWLock *lock;
[14:06:58.425] | ^~~~

If only we had injection points before 17, I could run all the tests there too, just to be sure...

Best regards, Andrey Borodin.

#49Andrey Borodin
Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Andrey Borodin (#48)
Re: IPC/MultixactCreation on the Standby server

On 1 Dec 2025, at 20:29, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

I'm looking through patchsets. I'll look in the morning with fresh eyes.

So far I have no findings.
I also tried to stress-test v14. I assumed that if regression slipped in, most probably it is inherited by 14 from higher versions.
I used slightly modified scripts from Dmitry who started this thread.

DB initialization:
create table tbl2 (id int primary key,val int);
insert into tbl2 select i, 0 from generate_series(1,100000) i;

Load with multi:
\set id random(1, 10000)
begin;
select * from tbl2 where id = :id for no key update;
savepoint s1;
update tbl2 set val = val+1 where id = :id;
commit;

Consistency test:
select sum(val) from tbl2;

Stress-test script:
while true; do pkill -9 postgres; ./pg_ctl -D testdb restart; (./pgbench --no-vacuum -M prepared -c 50 -j 4 -T 300 -P 1 postgres --file=load.sql &); sleep 5; done

To promote multixact races I tried this with patched version: added random sleep up to 50ms between GetNewMultiXactId() and RecordNewMultiXact().

So far I did not manage to corrupt database.

What else kind of loads worth exercising?

Best regards, Andrey Borodin.

#50Heikki Linnakangas
Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andrey Borodin (#49)
Re: IPC/MultixactCreation on the Standby server

On 02/12/2025 15:44, Andrey Borodin wrote:

On 1 Dec 2025, at 20:29, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

I'm looking through patchsets. I'll look in the morning with fresh eyes.

So far I have no findings.
I also tried to stress-test v14. I assumed that if regression slipped in, most probably it is inherited by 14 from higher versions.

Thanks! Agreed, v14-16 were the same. v17 and v18 might be worth testing
separately, to make sure I didn't e.g. screw up the locking differences.

I used slightly modified scripts from Dmitry who started this thread.

DB initialization:
create table tbl2 (id int primary key,val int);
insert into tbl2 select i, 0 from generate_series(1,100000) i;

Load with multi:
\set id random(1, 10000)
begin;
select * from tbl2 where id = :id for no key update;
savepoint s1;
update tbl2 set val = val+1 where id = :id;
commit;

Consistency test:
select sum(val) from tbl2;

Stress-test script:
while true; do pkill -9 postgres; ./pg_ctl -D testdb restart; (./pgbench --no-vacuum -M prepared -c 50 -j 4 -T 300 -P 1 postgres --file=load.sql &); sleep 5; done

To promote multixact races I tried this with patched version: added random sleep up to 50ms between GetNewMultiXactId() and RecordNewMultiXact().

So far I did not manage to corrupt database.

What else kind of loads worth exercising?

Combinations of patched and unpatched binaries, as primary/replica and
crash recovery. I did some testing of crash recovery, but more would be
good.

- Heikki

#51Dmitry Yurichev
Dmitry Yurichev
dsy.075@yandex.ru
In reply to: Heikki Linnakangas (#50)
Re: IPC/MultixactCreation on the Standby server

On 12/2/25 16:48, Heikki Linnakangas wrote:

Thanks! Agreed, v14-16 were the same. v17 and v18 might be worth testing
separately, to make sure I didn't e.g. screw up the locking differences.

I tested on the REL_18_1 tag (with applying
v15-pg18-0001-Set-next-multixid-s-offset-when-creating-a-.patch).
I didn't notice any deadlocks or other errors. Thanks!

--
Best regards,
Dmitry Yurichev.

#52Heikki Linnakangas
Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Dmitry Yurichev (#51)
Re: IPC/MultixactCreation on the Standby server

On 03/12/2025 16:19, Dmitry Yurichev wrote:

On 12/2/25 16:48, Heikki Linnakangas wrote:

Thanks! Agreed, v14-16 were the same. v17 and v18 might be worth
testing separately, to make sure I didn't e.g. screw up the locking
differences.

I tested on the REL_18_1 tag (with applying v15-pg18-0001-Set-next-
multixid-s-offset-when-creating-a-.patch).
I didn't notice any deadlocks or other errors. Thanks!

Okay. I fixed a few more little things:

- fixed the warnings about shadowed variables that Andrey pointed out
- in older branches before we switched to 64-bit SLRU page numbers, use
'int' rather than 'int64' in page number variables
- improve wording and fix few trivial typos in comments

Committed with those last-minute changes. Thanks for the testing!

- Heikki

#53Heikki Linnakangas
Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#47)
Re: IPC/MultixactCreation on the Standby server

On 01/12/2025 14:40, Heikki Linnakangas wrote:

On 30/11/2025 14:15, Andrey Borodin wrote:

On 29 Nov 2025, at 00:51, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I moved the wraparound test to a separate test file and commit.
More test coverage is good, but it's quite separate from the
bugfix and the wraparound related test shares very little with the
other test. The wraparound test needs a little more cleanup: use
plain perl instead of 'dd' and 'rm' for the file operations, for
example. (I did that with the tests in the 64-bit mxoff patches,
so we could copy from there.)

PFA test version without dd and rm.

Thanks! I will focus on the main patch and TAP test now, but will commit
the wraparound test separately afterwards. At quick glance, it looks
good now.

And now committed the multixid wraparound test, too. Thanks!

- Heikki

#54Heikki Linnakangas
Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#52)
1 attachment(s)
Re: IPC/MultixactCreation on the Standby server

On 03/12/2025 19:19, Heikki Linnakangas wrote:

On 03/12/2025 16:19, Dmitry Yurichev wrote:

On 12/2/25 16:48, Heikki Linnakangas wrote:

Thanks! Agreed, v14-16 were the same. v17 and v18 might be worth
testing separately, to make sure I didn't e.g. screw up the locking
differences.

I tested on the REL_18_1 tag (with applying v15-pg18-0001-Set-next-
multixid-s-offset-when-creating-a-.patch).
I didn't notice any deadlocks or other errors. Thanks!

Okay. I fixed a few more little things:

- fixed the warnings about shadowed variables that Andrey pointed out
- in older branches before we switched to 64-bit SLRU page numbers, use
'int' rather than 'int64' in page number variables
- improve wording and fix few trivial typos in comments

Committed with those last-minute changes. Thanks for the testing!

While working on the 64-bit multixid offsets patch, I noticed one more
bug with this. At offset wraparound, when we set the next multixid's
offset in RecordNewMultiXact, we incorrectly set it to 0 instead of 1.
We're supposed to skip over offset 1, because 0 is reserved to mean
invalid. We do that correctly when setting the "current" multixid's
offset, because the caller of RecordNewMultiXact has already skipped
over offset 0, but I missed it for the next offset.

I was able to reproduce that with these steps:

pg_resetwal -O 0xffffff00 -D data
pg_ctl -D data start
pgbench -s5 -i postgres
pgbench -c3 -t100 -f a.sql postgres

a.sql:
select * from pgbench_branches FOR KEY SHARE;

You get an error like this:

pgbench: error: client 2 script 0 aborted in command 0 query 0: ERROR:
MultiXact 372013 has invalid next offset

I tried to modify the new wraparound TAP test to reproduce that, but it
turned out to be difficult because you need to have multiple backends
assigning multixids concurrently to hit that.

The fix is pretty straightforward, see attached. Barring objections,
I'll commit and backport this.

- Heikki

Attachments:

0001-Fix-setting-next-multixid-s-offset-at-offset-wraparo.patchtext/x-patch; charset=UTF-8; name=0001-Fix-setting-next-multixid-s-offset-at-offset-wraparo.patch
#55Álvaro Herrera
Álvaro Herrera
alvherre@kurilemu.de
In reply to: Heikki Linnakangas (#54)
Re: IPC/MultixactCreation on the Standby server

On 2025-Dec-04, Heikki Linnakangas wrote:

While working on the 64-bit multixid offsets patch, I noticed one more bug
with this. At offset wraparound, when we set the next multixid's offset in
RecordNewMultiXact, we incorrectly set it to 0 instead of 1. We're supposed
to skip over offset 1, because 0 is reserved to mean invalid. We do that
correctly when setting the "current" multixid's offset, because the caller
of RecordNewMultiXact has already skipped over offset 0, but I missed it for
the next offset.

Ouch.

I tried to modify the new wraparound TAP test to reproduce that, but it
turned out to be difficult because you need to have multiple backends
assigning multixids concurrently to hit that.

Hmm, would it make sense to add a pgbench-based test on
src/test/modules/xid_wraparound? That module is already known to be
expensive, and it doesn't run unless explicitly enabled, so I think it's
not a bad fit.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/

#56Heikki Linnakangas
Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Álvaro Herrera (#55)
Re: IPC/MultixactCreation on the Standby server

On 04/12/2025 16:36, Álvaro Herrera wrote:

On 2025-Dec-04, Heikki Linnakangas wrote:

While working on the 64-bit multixid offsets patch, I noticed one more bug
with this. At offset wraparound, when we set the next multixid's offset in
RecordNewMultiXact, we incorrectly set it to 0 instead of 1. We're supposed
to skip over offset 1, because 0 is reserved to mean invalid. We do that
correctly when setting the "current" multixid's offset, because the caller
of RecordNewMultiXact has already skipped over offset 0, but I missed it for
the next offset.

Ouch.

Committed the fix.

I tried to modify the new wraparound TAP test to reproduce that, but it
turned out to be difficult because you need to have multiple backends
assigning multixids concurrently to hit that.

Hmm, would it make sense to add a pgbench-based test on
src/test/modules/xid_wraparound? That module is already known to be
expensive, and it doesn't run unless explicitly enabled, so I think it's
not a bad fit.

Doesn't seem worth it. The repro with pgbench is a bit unreliable too.
It actually only readily reproduces on 'master'. In backbranches, I also
had to add a random sleep before RecordNewMultiXact() to trigger it. I
think that's because on 'master', I removed the special case in
GetMultiXactIdMembers() to use the in-memory nextOffset value instead of
reading the next offset from the SLRU page, when we're reading the last
assigned multixid.

Furthermore, we will hopefully get rid of offset wraparounds soon with
the 64-bit offsets.

- Heikki

#57Maxim Orlov
Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#45)
1 attachment(s)
Re: IPC/MultixactCreation on the Standby server

On Fri, 28 Nov 2025 at 22:51, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I moved the wraparound test to a separate test file and commit. More
test coverage is good, but it's quite separate from the bugfix and the
wraparound related test shares very little with the other test. The
wraparound test needs a little more cleanup: use plain perl instead of
'dd' and 'rm' for the file operations, for example. (I did that with the
tests in the 64-bit mxoff patches, so we could copy from there.)

It's good that the test was added. But it seems like it could be

improved a bit. The problem is, it only runs successfully with a
standard block size. Plus, the comment about the number of bytes was a
bit unclear, for my taste. PFA patch, it should make this test pass
with different block sizes.

--
Best regards,
Maxim Orlov.

Attachments:

0001-Improve-7b81be9b42-Add-test-for-multixid-wraparound.patchapplication/octet-stream; name=0001-Improve-7b81be9b42-Add-test-for-multixid-wraparound.patch
#58Andrey Borodin
Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Maxim Orlov (#57)
Re: IPC/MultixactCreation on the Standby server

On 5 Dec 2025, at 21:36, Maxim Orlov <orlovmg@gmail.com> wrote:

It's good that the test was added. But it seems like it could be
improved a bit. The problem is, it only runs successfully with a
standard block size. Plus, the comment about the number of bytes was a
bit unclear, for my taste. PFA patch, it should make this test pass
with different block sizes.

Oh, great catch!

Other tests seem to extract block size using database query like

$primary->safe_psql('postgres',
"SELECT setting::int FROM pg_settings WHERE name = 'block_size';");
or
$blksize = int($node->safe_psql('postgres', 'SHOW block_size;'));

But here we do not have running cluster, so resorting to parsing pg_resetwal seems reasonable.

Thanks!

Best regards, Andrey Borodin.

#59Andrey Borodin
Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Heikki Linnakangas (#54)
Re: IPC/MultixactCreation on the Standby server

On 4 Dec 2025, at 19:17, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I tried to modify the new wraparound TAP test to reproduce that, but it turned out to be difficult because you need to have multiple backends assigning multixids concurrently to hit that.

IIUC, new TAP test verifies multixact Id wraparound (-m key in pg_resetwal).
While the bug was in multixact offsets wraparound.

I'm on-call for the rest of this week, but a bit later I can produce fast tests for all kinds of wraparounds :)

But as you said, hopefully soon there won't be wraparounds, and, what's more important, offsets\members space exhaustion.

Best regards, Andrey Borodin.

#60Heikki Linnakangas
Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andrey Borodin (#58)
Re: IPC/MultixactCreation on the Standby server

On 05/12/2025 20:36, Andrey Borodin wrote:

On 5 Dec 2025, at 21:36, Maxim Orlov <orlovmg@gmail.com> wrote:

It's good that the test was added. But it seems like it could be
improved a bit. The problem is, it only runs successfully with a
standard block size. Plus, the comment about the number of bytes was a
bit unclear, for my taste. PFA patch, it should make this test pass
with different block sizes.

Oh, great catch!

Other tests seem to extract block size using database query like

$primary->safe_psql('postgres',
"SELECT setting::int FROM pg_settings WHERE name = 'block_size';");
or
$blksize = int($node->safe_psql('postgres', 'SHOW block_size;'));

But here we do not have running cluster, so resorting to parsing pg_resetwal seems reasonable.

+1, pg_resetwal makes sense here. Calculating the filename of the last
SLRU segment also needs to be adjusted. It was hardcoded to FFFF, but
it's different with other block sizes.

Fixed that and committed. Thanks!

P.S. I'm surprised we don't have any buildfarm members with non-default
block sizes.

- Heikki

#61Álvaro Herrera
Álvaro Herrera
alvherre@kurilemu.de
In reply to: Andrey Borodin (#59)
Re: IPC/MultixactCreation on the Standby server

On 2025-Dec-05, Andrey Borodin wrote:

I'm on-call for the rest of this week, but a bit later I can produce
fast tests for all kinds of wraparounds :)

I suspect that would be valuable.

But as you said, hopefully soon there won't be wraparounds, and,
what's more important, offsets\members space exhaustion.

Hmm. We can easily enlarge the offset size to 64 bits, which eliminates
the problem of member wraparound and space exhaustion. However,
enlarging multixact size itself would require enlarging the size of the
t_xmax field in heap tuples, and as far as I know, that's not in the
works. I mean, even with the patches to increase xid to 64 bits[1]/messages/by-id/CACG=ezYeSygeb68tJVej9Qgji6k78V2reSqzh1_Y4P5GxCAGsw@mail.gmail.com, I
understand that t_xmax continues to be 32 bits. This means that
multixact offset wraparound is still an issue that we need to pay
attention to.

[1]: /messages/by-id/CACG=ezYeSygeb68tJVej9Qgji6k78V2reSqzh1_Y4P5GxCAGsw@mail.gmail.com

As far as I understand from that patch, XIDs as stored in t_xmax are
still 32 bits, and are made 64 bits by addition of the pd_xid_base
value. This technique can be used for standard XIDs; but multixacts,
having their own cycle from XIDs, cannot be offset by the same counter.
It would require a separate pd_multixact_base value to be maintained for
each page.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Escucha y olvidarás; ve y recordarás; haz y entenderás" (Confucio)