IPC/MultixactCreation on the Standby server

Started by Dmitry9 months ago61 messages
Jump to latest
#1Dmitry
dsy.075@yandex.ru

Hi, hackers

The problem is as follows.
A replication cluster includes a primary server and one hot-standby replica.
The workload on the primary server is represented by multiple requests
generating multixact IDs, while the hot-standby replica performs reading
requests.

After some time, all requests on the hot-standby are stuck and never get
finished.

The `pg_stat_activity` view on the replica reports that processes are
stuck waiting for IPC/MultixactCreation,
pg_cancel_backend and pg_terminate_backend cannot cancel the request,
SIGQUIT is the only way to stop it.

We tried:
- changing the `autovacuum_multixact_freeze_max_age` parameters,
- increasing `multixact_member_buffers` and `multixact_offset_buffers`,
- disabling `hot_standby_feedback`,
- switching the replica to synchronous and asynchronous mode,
- and much more.
But nothing helped.

We ran the replica in recovery mode from WAL archive, i.e. as
warm-standby, the result is the same.

We tried to build from the sources based on REL_17_5 branch with the
default configure settings
    ./configure
    make
    make install
But got no luck.

Here is an example with a synthetic workload reproducing the problem.

Test system
===========

- Architecture: x86_64
- OS: Ubuntu 24.04.2 LTS (Noble Numbat)
- Tested postgres version(s):
    - latest 17 (17.5)
    - latest 18 (18-beta1)

The problem is not reproducible on PostgreSQL 16.9

Steps to reproduce
==================

    postgres=# create table tbl (
        id int primary key,
        val int
    );
    postgres=# insert into tbl select i, 0 from generate_series(1,5) i;

The first and second scripts execute queries on the master server
-----------------------------------------------------------------

    pgbench --no-vacuum --report-per-command -M prepared -c 200 -j 200
-T 300 -P 1 --file=/dev/stdin <<'EOF'
    \set id random(1, 5)
    begin;
    select * from tbl where id = :id for key share;
    commit;
    EOF

    pgbench --no-vacuum --report-per-command -M prepared -c 100 -j 100
-T 300 -P 1 --file=/dev/stdin <<'EOF'
    \set id random(1, 5)
    begin;
    update tbl set val = val+1 where id = :id;
    \sleep 10 ms
    commit;
    EOF

The following script is executed on the replica
-----------------------------------------------

    pgbench --no-vacuum --report-per-command -M prepared -c 100 -j 100
-T 300 -P 1 --file=/dev/stdin <<'EOF'
    begin;
    select sum(val) from tbl;
    \sleep 10 ms
    select sum(val) from tbl;
    \sleep 10 ms
    commit;
    EOF

    pgbench (17.5 (Ubuntu 17.5-1.pgdg24.04+1))
    progress: 1.0 s, 2606.8 tps, lat 33.588 ms stddev 13.316, 0 failed
    progress: 2.0 s, 3315.0 tps, lat 30.174 ms stddev 5.933, 0 failed
    progress: 3.0 s, 3357.0 tps, lat 29.699 ms stddev 5.541, 0 failed
    progress: 4.0 s, 3350.0 tps, lat 29.911 ms stddev 5.311, 0 failed
    progress: 5.0 s, 3206.0 tps, lat 30.999 ms stddev 6.343, 0 failed
    progress: 6.0 s, 3264.0 tps, lat 30.828 ms stddev 6.389, 0 failed
    progress: 7.0 s, 3224.0 tps, lat 31.099 ms stddev 6.197, 0 failed
    progress: 8.0 s, 3168.0 tps, lat 31.486 ms stddev 6.940, 0 failed
    progress: 9.0 s, 3118.0 tps, lat 32.004 ms stddev 6.546, 0 failed
    progress: 10.0 s, 3017.0 tps, lat 33.183 ms stddev 7.971, 0 failed
    progress: 11.0 s, 3157.0 tps, lat 31.697 ms stddev 6.624, 0 failed
    progress: 12.0 s, 3180.0 tps, lat 31.415 ms stddev 6.310, 0 failed
    progress: 13.0 s, 3150.9 tps, lat 31.591 ms stddev 6.280, 0 failed
    progress: 14.0 s, 3329.0 tps, lat 30.189 ms stddev 5.792, 0 failed
    progress: 15.0 s, 3233.6 tps, lat 30.852 ms stddev 5.723, 0 failed
    progress: 16.0 s, 3185.4 tps, lat 31.378 ms stddev 6.383, 0 failed
    progress: 17.0 s, 3035.0 tps, lat 32.920 ms stddev 7.390, 0 failed
    progress: 18.0 s, 3173.0 tps, lat 31.547 ms stddev 6.390, 0 failed
    progress: 19.0 s, 3077.0 tps, lat 32.427 ms stddev 6.634, 0 failed
    progress: 20.0 s, 3266.1 tps, lat 30.740 ms stddev 5.842, 0 failed
    progress: 21.0 s, 2990.9 tps, lat 33.353 ms stddev 7.019, 0 failed
    progress: 22.0 s, 3048.1 tps, lat 32.933 ms stddev 6.951, 0 failed
    progress: 23.0 s, 3148.0 tps, lat 31.769 ms stddev 6.077, 0 failed
    progress: 24.0 s, 1523.2 tps, lat 30.029 ms stddev 5.093, 0 failed
    progress: 25.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 26.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 27.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 28.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 29.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 30.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 31.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 32.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 33.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 34.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
    progress: 35.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed

After some time, all requests on the replica hang waiting for
IPC/MultixactCreation.

Output from `pg_stat_activity`
------------------------------

            backend_type        | state  | wait_event_type |
wait_event     |                  query
----------------------------+--------+-----------------+-------------------+------------------------------------------
     client backend             | active | IPC             |
MultixactCreation | select sum(val) from tbl;
     client backend             | active | IPC             |
MultixactCreation | select sum(val) from tbl;
     client backend             | active | IPC             |
MultixactCreation | select sum(val) from tbl;
     client backend             | active | IPC             |
MultixactCreation | select sum(val) from tbl;
    ...
     client backend             | active | IPC             |
MultixactCreation | select sum(val) from tbl;
     client backend             | active | IPC             |
MultixactCreation | select sum(val) from tbl;
     client backend             | active | IPC             |
MultixactCreation | select sum(val) from tbl;
     startup                    |        | LWLock          |
BufferContent     |
     checkpointer               |        | Activity        |
CheckpointerMain  |
     background writer          |        | Activity        |
BgwriterHibernate |
     walreceiver                |        | Activity        |
WalReceiverMain   |

gdb session for `client backend` process
----------------------------------------

    (gdb) bt
    #0  0x00007f0e9872a007 in epoll_wait (epfd=5,
events=0x57c4747fc458, maxevents=1, timeout=-1) at
../sysdeps/unix/sysv/linux/epoll_wait.c:30
    #1  0x000057c440685033 in WaitEventSetWaitBlock (nevents=<optimized
out>, occurred_events=0x7ffdaedc8360, cur_timeout=-1, set=0x57c4747fc3f0)
        at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/ipc/latch.c:1577
    #2  WaitEventSetWait (set=0x57c4747fc3f0, timeout=timeout@entry=-1,
occurred_events=occurred_events@entry=0x7ffdaedc8360,
nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=134217765)
        at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/ipc/latch.c:1525
    #3  0x000057c44068541c in WaitLatch (latch=<optimized out>,
wakeEvents=<optimized out>, timeout=<optimized out>,
wait_event_info=134217765)
        at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/ipc/latch.c:538
    #4  0x000057c44068d8c0 in ConditionVariableTimedSleep
(cv=0x7f0cefc50ab0, timeout=-1, wait_event_info=134217765)
        at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/lmgr/condition_variable.c:163
    #5  0x000057c440365a0c in ConditionVariableSleep
(wait_event_info=134217765, cv=<optimized out>) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/lmgr/condition_variable.c:98
    #6  GetMultiXactIdMembers (multi=45559845, members=0x7ffdaedc84b0,
from_pgupgrade=<optimized out>, isLockOnly=<optimized out>)
        at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/multixact.c:1483
    #7  0x000057c4408adc6b in MultiXactIdGetUpdateXid.isra.0
(xmax=xmax@entry=45559845, t_infomask=<optimized out>) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:7478
    #8  0x000057c44031ecfa in HeapTupleGetUpdateXid (tuple=<error
reading variable: Cannot access memory at address 0x0>) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:7519
    #9  HeapTupleSatisfiesMVCC (htup=<optimized out>, buffer=404,
snapshot=0x57c474892ff0) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam_visibility.c:1090
    #10 HeapTupleSatisfiesVisibility (htup=<optimized out>,
snapshot=0x57c474892ff0, buffer=404) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam_visibility.c:1772
    #11 0x000057c44030c1cb in page_collect_tuples
(check_serializable=<optimized out>, all_visible=<optimized out>,
lines=<optimized out>, block=<optimized out>, buffer=<optimized out>,
page=<optimized out>,
        snapshot=<optimized out>, scan=<optimized out>) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:480
    #12 heap_prepare_pagescan (sscan=0x57c47495b970) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:579
    #13 0x000057c44030cb59 in heapgettup_pagemode
(scan=scan@entry=0x57c47495b970, dir=<optimized out>, nkeys=<optimized
out>, key=<optimized out>)
        at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:999
    #14 0x000057c44030d1bd in heap_getnextslot (sscan=0x57c47495b970,
direction=<optimized out>, slot=0x57c47494b278) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:1319
    #15 0x000057c4404f090a in table_scan_getnextslot
(slot=0x57c47494b278, direction=ForwardScanDirection, sscan=<optimized out>)
        at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/include/access/tableam.h:1072
    #16 SeqNext (node=0x57c47494b0e8) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/nodeSeqscan.c:80
    #17 0x000057c4404d5cfc in ExecProcNode (node=0x57c47494b0e8) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/include/executor/executor.h:274
    #18 fetch_input_tuple (aggstate=aggstate@entry=0x57c47494aaf0) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/nodeAgg.c:561
    #19 0x000057c4404d848a in agg_retrieve_direct
(aggstate=0x57c47494aaf0) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/nodeAgg.c:2459
    #20 ExecAgg (pstate=0x57c47494aaf0) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/nodeAgg.c:2179
    #21 0x000057c4404c2003 in ExecProcNode (node=0x57c47494aaf0) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/include/executor/executor.h:274
    #22 ExecutePlan (dest=0x57c47483d548, direction=<optimized out>,
numberTuples=0, sendTuples=true, operation=CMD_SELECT,
queryDesc=0x57c474895010)
        at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/execMain.c:1649
    #23 standard_ExecutorRun (queryDesc=0x57c474895010,
direction=<optimized out>, count=0, execute_once=<optimized out>) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/executor/execMain.c:361
    ...

gdb session for `startup` process
---------------------------------

    (gdb) bt
    #0  0x00007f0e98698ce3 in __futex_abstimed_wait_common64
(private=<optimized out>, cancel=true, abstime=0x0, op=265, expected=0,
futex_word=0x7f0ceb34e6b8) at ./nptl/futex-internal.c:57
    #1  __futex_abstimed_wait_common (cancel=true, private=<optimized
out>, abstime=0x0, clockid=0, expected=0, futex_word=0x7f0ceb34e6b8) at
./nptl/futex-internal.c:87
    #2  __GI___futex_abstimed_wait_cancelable64
(futex_word=futex_word@entry=0x7f0ceb34e6b8, expected=expected@entry=0,
clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>)
        at ./nptl/futex-internal.c:139
    #3  0x00007f0e986a4f1f in do_futex_wait
(sem=sem@entry=0x7f0ceb34e6b8, abstime=0x0, clockid=0) at
./nptl/sem_waitcommon.c:111
    #4  0x00007f0e986a4fb8 in __new_sem_wait_slow64
(sem=sem@entry=0x7f0ceb34e6b8, abstime=0x0, clockid=0) at
./nptl/sem_waitcommon.c:183
    #5  0x00007f0e986a503d in __new_sem_wait
(sem=sem@entry=0x7f0ceb34e6b8) at ./nptl/sem_wait.c:42
    #6  0x000057c440696166 in PGSemaphoreLock (sema=0x7f0ceb34e6b8) at
port/pg_sema.c:327
    #7  LWLockAcquire (lock=0x7f0cefc58064, mode=LW_EXCLUSIVE) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/lmgr/lwlock.c:1289
    #8  0x000057c44038f96a in LockBuffer (mode=2, buffer=<optimized
out>) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/storage/buffer/bufmgr.c:5147
    #9  XLogReadBufferForRedoExtended (record=<optimized out>,
block_id=<optimized out>, mode=RBM_NORMAL, get_cleanup_lock=false,
buf=0x7ffdaedc8b4c)
        at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/xlogutils.c:429
    #10 0x000057c440319969 in XLogReadBufferForRedo
(buf=0x7ffdaedc8b4c, block_id=0 '\000', record=0x57c4748994d8) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/xlogutils.c:317
    #11 heap_xlog_lock_updated (record=0x57c4748994d8) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:10230
    #12 heap2_redo (record=0x57c4748994d8) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/heap/heapam.c:10362
    #13 0x000057c44038e1d2 in ApplyWalRecord (replayTLI=<synthetic
pointer>, record=0x7f0e983908e0, xlogreader=<optimized out>)
        at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/include/access/xlog_internal.h:380
    #14 PerformWalRecovery () at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/xlogrecovery.c:1822
    #15 0x000057c44037bbf6 in StartupXLOG () at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/xlog.c:5821
    #16 0x000057c4406155ed in StartupProcessMain
(startup_data=<optimized out>, startup_data_len=<optimized out>) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/startup.c:258
    #17 0x000057c44060b376 in postmaster_child_launch
(child_type=B_STARTUP, startup_data=0x0, startup_data_len=0,
client_sock=0x0)
        at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/launch_backend.c:277
    #18 0x000057c440614509 in postmaster_child_launch (client_sock=0x0,
startup_data_len=0, startup_data=0x0, child_type=B_STARTUP)
        at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:3934
    #19 StartChildProcess (type=type@entry=B_STARTUP) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:3930
    #20 0x000057c44061480d in PostmasterStateMachine () at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:3392
    #21 0x000057c4408a3455 in process_pm_child_exit () at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:2683
    #22 ServerLoop.isra.0 () at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:1667
    #23 0x000057c440616965 in PostmasterMain (argc=<optimized out>,
argv=<optimized out>) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/postmaster/postmaster.c:1374
    #24 0x000057c4402bcd2d in main (argc=17, argv=0x57c4747fb140) at
/usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/main/main.c:199

Could you please help me to fix the problem of stuck 'client backend'
processes?

I kindly ask you for any ideas and recommendations!

Best regards,
Dmitry

#2Andrey Borodin
amborodin@acm.org
In reply to: Dmitry (#1)
Re: IPC/MultixactCreation on the Standby server

On 25 Jun 2025, at 11:11, Dmitry <dsy.075@yandex.ru> wrote:

#6 GetMultiXactIdMembers (multi=45559845, members=0x7ffdaedc84b0, from_pgupgrade=<optimized out>, isLockOnly=<optimized out>)
at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/multixact.c:1483

Hi Dmitry!

This looks to be related to work in my thread about multixacts [0]. Seems like CV sleep in /* Corner case 2: next multixact is still being filled in */ is not woken up by ConditionVariableBroadcast(&MultiXactState->nextoff_cv) from WAL redo.

If so - any subsequent multixact redo from WAL should unstuck reading last MultiXact.

Either way redo path might be not going through ConditionVariableBroadcast(). I will investigate this further.

Can you please check your reproduction with patch attached to this message? This patch simply adds timeout on CV sleep so in worst case we will fallback to behavior of PG 16.

Best regards, Andrey Borodin.

Attachments:

0001-Make-next-multixact-sleep-timed.patchapplication/octet-stream; name=0001-Make-next-multixact-sleep-timed.patch; x-unix-mode=0644Download+1-2
#3Dmitry
dsy.075@yandex.ru
In reply to: Andrey Borodin (#2)
Re: IPC/MultixactCreation on the Standby server

On 25.06.2025 12:34, Andrey Borodin wrote:

On 25 Jun 2025, at 11:11, Dmitry <dsy.075@yandex.ru> wrote:

#6 GetMultiXactIdMembers (multi=45559845, members=0x7ffdaedc84b0, from_pgupgrade=<optimized out>, isLockOnly=<optimized out>)
at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/multixact.c:1483

Hi Dmitry!

This looks to be related to work in my thread about multixacts [0]. Seems like CV sleep in /* Corner case 2: next multixact is still being filled in */ is not woken up by ConditionVariableBroadcast(&MultiXactState->nextoff_cv) from WAL redo.

If so - any subsequent multixact redo from WAL should unstuck reading last MultiXact.

Either way redo path might be not going through ConditionVariableBroadcast(). I will investigate this further.

Can you please check your reproduction with patch attached to this message? This patch simply adds timeout on CV sleep so in worst case we will fallback to behavior of PG 16.

Best regards, Andrey Borodin.

Hi Andrey!

Thanks so much for your response.

A small comment on /* Corner case 2: ... */
At this point in the code, I tried to set trace points by outputting
messages through `elog()`,
and I can say that the process does not always stuck in this part of the
code, it appears from time to time and in an unpredictable way.

Maybe this will help you a little.

To be honest, PostgreSQL performance is much better with this feature,
it would be a shame if we had to rollback to the behavior in version 16.

I will definitely try to reproduce the problem with your patch.

Best regards,
Dmitry.

#4Dmitry
dsy.075@yandex.ru
In reply to: Dmitry (#3)
Re: IPC/MultixactCreation on the Standby server

On 25.06.2025 16:44, Dmitry wrote:

I will definitely try to reproduce the problem with your patch.

Hi Andrey!

I checked with the patch, unfortunately the problem is also reproducible.
Client processes wake up after a second and try to get information about the members of the multixact again, in an endless loop.
At the same time, the WALs are not played, the 'startup' process also hangs on the 'LWLock/BufferContent'.

Best regards,
Dmitry.

#5Andrey Borodin
amborodin@acm.org
In reply to: Dmitry (#4)
Re: IPC/MultixactCreation on the Standby server

On 26 Jun 2025, at 14:33, Dmitry <dsy.075@yandex.ru> wrote:

On 25.06.2025 16:44, Dmitry wrote:

I will definitely try to reproduce the problem with your patch.

Hi Andrey!

I checked with the patch, unfortunately the problem is also reproducible.
Client processes wake up after a second and try to get information about the members of the multixact again, in an endless loop.
At the same time, the WALs are not played, the 'startup' process also hangs on the 'LWLock/BufferContent'.

My hypothesis is that MultiXactState->nextMXact is not filled often enough from redo pathes. So if you are unlucky enough, corner case 2 reading can deadlock with startup.
I need to verify it further, but if so - I's an ancient bug that just happens to be few orders of magnitude more reproducible on 17 due to performance improvements. Still a hypothetical though.

Best regards, Andrey Borodin.

#6Andrey Borodin
amborodin@acm.org
In reply to: Andrey Borodin (#5)
Re: IPC/MultixactCreation on the Standby server

On 26 Jun 2025, at 17:59, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

hypothesis

Dmitry, can you please retry your reproduction with attached patch?

It must print nextMXact and tmpMXact. If my hypothesis is correct nextMXact will precede tmpMXact.

Best regards, Andrey Borodin.

Attachments:

v2-0001-Make-next-multixact-sleep-timed-with-debug-loggin.patchapplication/octet-stream; name=v2-0001-Make-next-multixact-sleep-timed-with-debug-loggin.patch; x-unix-mode=0644Download+5-3
#7Dmitry
dsy.075@yandex.ru
In reply to: Andrey Borodin (#6)
Re: IPC/MultixactCreation on the Standby server

On 26.06.2025 19:24, Andrey Borodin wrote:

If my hypothesis is correct nextMXact will precede tmpMXact.

It seems that the hypothesis has not been confirmed.

Attempt #1
2025-06-26 23:47:24.821 MSK [220458] WARNING:  Timed out: nextMXact
24138381 tmpMXact 24138379
2025-06-26 23:47:24.822 MSK [220540] WARNING:  Timed out: nextMXact
24138382 tmpMXact 24138379
2025-06-26 23:47:24.823 MSK [220548] WARNING:  Timed out: nextMXact
24138382 tmpMXact 24138379
...

pgbench (17.5)
progress: 2.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
progress: 3.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
progress: 4.0 s, 482.2 tps, lat 820.293 ms stddev 1370.729, 0 failed
progress: 5.0 s, 886.0 tps, lat 112.463 ms stddev 8.506, 0 failed
progress: 6.0 s, 348.9 tps, lat 111.324 ms stddev 5.871, 0 failed
WARNING:  Timed out: nextMXact 24138381 tmpMXact 24138379
WARNING:  Timed out: nextMXact 24138382 tmpMXact 24138379
WARNING:  Timed out: nextMXact 24138382 tmpMXact 24138379
...
progress: 7.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
WARNING:  Timed out: nextMXact 24138382 tmpMXact 24138379

Attempt #2
2025-06-27 09:18:01.312 MSK [236187] WARNING:  Timed out: nextMXact
24497746 tmpMXact 24497744
2025-06-27 09:18:01.312 MSK [236225] WARNING:  Timed out: nextMXact
24497746 tmpMXact 24497744
2025-06-27 09:18:01.312 MSK [236178] WARNING:  Timed out: nextMXact
24497746 tmpMXact 24497744
...

pgbench (17.5)
progress: 1.0 s, 830.9 tps, lat 108.556 ms stddev 10.078, 0 failed
progress: 2.0 s, 839.0 tps, lat 118.358 ms stddev 19.708, 0 failed
progress: 3.0 s, 623.4 tps, lat 134.186 ms stddev 15.565, 0 failed
WARNING:  Timed out: nextMXact 24497746 tmpMXact 24497744
WARNING:  Timed out: nextMXact 24497746 tmpMXact 24497744
WARNING:  Timed out: nextMXact 24497746 tmpMXact 24497744
WARNING:  Timed out: nextMXact 24497746 tmpMXact 24497744
WARNING:  Timed out: nextMXact 24497746 tmpMXact 24497744
WARNING:  Timed out: nextMXact 24497747 tmpMXact 24497744
WARNING:  Timed out: nextMXact 24497747 tmpMXact 24497744
...
progress: 4.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
WARNING:  Timed out: nextMXact 24497746 tmpMXact 24497744

Best regards,
Dmitry.

#8Andrey Borodin
amborodin@acm.org
In reply to: Dmitry (#7)
Re: IPC/MultixactCreation on the Standby server

On 27 Jun 2025, at 11:41, Dmitry <dsy.075@yandex.ru> wrote:

It seems that the hypothesis has not been confirmed.

Indeed.

For some reason your reproduction does not work for me.
I tried to create a test from your workload description. PFA patch with a very dirty prototype.

to run test you can run:

cd contrib/amcheck
PROVE_TESTS=t/006_MultiXact_standby.pl make check

To check that reproduction worked or not you can read tmp_check/log/006_MultiXact_standby_standby_1.log and see if there are messages "Timed out: nextMXact %u tmpMXact %u".

If you could codify our reproduction into this TAP test, we could make it portable. So I can debug the problem on my machine...

Either way we can proceed with remote debugging via mailing list :)

Thank you!

Best regards, Andrey Borodin.

Attachments:

v3-0001-Make-next-multixact-sleep-timed-with-debug-loggin.patchapplication/octet-stream; name=v3-0001-Make-next-multixact-sleep-timed-with-debug-loggin.patch; x-unix-mode=0644Download+5-3
v3-0002-Test-concurrent-Multixact-reading-on-stadnby.patchapplication/octet-stream; name=v3-0002-Test-concurrent-Multixact-reading-on-stadnby.patch; x-unix-mode=0644Download+164-1
#9Andrey Borodin
amborodin@acm.org
In reply to: Andrey Borodin (#8)
Re: IPC/MultixactCreation on the Standby server

On 28 Jun 2025, at 00:37, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Indeed.

After some experiments I could get unstable repro on my machine.
I've added some logging and that's what I've found:

2025-06-28 23:03:40.598 +05 [40887] 006_MultiXact_standby.pl WARNING: Timed out: nextMXact 415832 tmpMXact 415827 pageno 203 prev_pageno 203 entryno 83 offptr[1] 831655 offptr[0] 0 offptr[-1] 831651

We are reading 415827-1 Multi, while 415827 is not filled yet. But we are holding a buffer that prevents next Multi to be filled in.
This seems like a recovery conflict.
I'm somewhat surprized with 415827+1 is already filled in...

Can you please try your reproduction with applied patch? This seems to be fixing issue for me.

Best regards, Andrey Borodin.

Attachments:

v4-0001-Make-next-multixact-sleep-timed-with-a-recovery-c.patchapplication/octet-stream; name=v4-0001-Make-next-multixact-sleep-timed-with-a-recovery-c.patch; x-unix-mode=0644Download+176-3
#10Andrey Borodin
amborodin@acm.org
In reply to: Andrey Borodin (#9)
Re: IPC/MultixactCreation on the Standby server

On 28 Jun 2025, at 21:24, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

This seems to be fixing issue for me.

ISTM I was wrong: there is a possible recovery conflict with snapshot.

REDO:
frame #2: 0x000000010179a0c8 postgres`pg_usleep(microsec=1000000) at pgsleep.c:50:10
frame #3: 0x000000010144c108 postgres`WaitExceedsMaxStandbyDelay(wait_event_info=134217772) at standby.c:248:2
frame #4: 0x000000010144a63c postgres`ResolveRecoveryConflictWithVirtualXIDs(waitlist=0x0000000126008200, reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, wait_event_info=134217772, report_waiting=true) at standby.c:384:8
frame #5: 0x000000010144a4f4 postgres`ResolveRecoveryConflictWithSnapshot(snapshotConflictHorizon=1214, isCatalogRel=false, locator=(spcOid = 1663, dbOid = 5, relNumber = 16384)) at standby.c:490:2
frame #6: 0x0000000100e4d3f8 postgres`heap_xlog_prune_freeze(record=0x0000000135808e60) at heapam.c:9208:4
frame #7: 0x0000000100e4d204 postgres`heap2_redo(record=0x0000000135808e60) at heapam.c:10353:4
frame #8: 0x0000000100f1548c postgres`ApplyWalRecord(xlogreader=0x0000000135808e60, record=0x0000000138058060, replayTLI=0x000000016f0425b0) at xlogrecovery.c:1991:2
frame #9: 0x0000000100f13ff0 postgres`PerformWalRecovery at xlogrecovery.c:1822:4
frame #10: 0x0000000100ef7940 postgres`StartupXLOG at xlog.c:5821:3
frame #11: 0x0000000101364334 postgres`StartupProcessMain(startup_data=0x0000000000000000, startup_data_len=0) at startup.c:258:2

SELECT:
frame #10: 0x0000000102a14684 postgres`GetMultiXactIdMembers(multi=278, members=0x000000016d4f9498, from_pgupgrade=false, isLockOnly=false) at multixact.c:1493:6
frame #11: 0x0000000102991814 postgres`MultiXactIdGetUpdateXid(xmax=278, t_infomask=4416) at heapam.c:7478:13
frame #12: 0x0000000102985450 postgres`HeapTupleGetUpdateXid(tuple=0x00000001043e5c60) at heapam.c:7519:9
frame #13: 0x00000001029a0360 postgres`HeapTupleSatisfiesMVCC(htup=0x000000016d4f9590, snapshot=0x000000015b07b930, buffer=69) at heapam_visibility.c:1090:10
frame #14: 0x000000010299fbc8 postgres`HeapTupleSatisfiesVisibility(htup=0x000000016d4f9590, snapshot=0x000000015b07b930, buffer=69) at heapam_visibility.c:1772:11
frame #15: 0x0000000102982954 postgres`page_collect_tuples(scan=0x000000014b009648, snapshot=0x000000015b07b930, page="", buffer=69, block=6, lines=228, all_visible=false, check_serializable=false) at heapam.c:480:12

page_collect_tuples() holds a lock on the buffer while examining tuples visibility, having InterruptHoldoffCount > 0. Tuple visibility check might need WAL to go on, we have to wait until some next MX be filled in.
Which might need a buffer lock or have a snapshot conflict with caller of page_collect_tuples().

Please find attached a dirty test, it reproduces problem my machine (startup deadlock, so when reproduced it takes 180s, normally passing in 10s).
Also, there is a fix: checking for recovery conflicts when falling back to case 2 MX read.

I do not feel comfortable with using interrupts while InterruptHoldoffCount > 0, so I need help from someone more knowledgeable about our interrupts machinery to tell if what I'm proposing is OK. (Álvaro?)

Also, I've modified the code to make race condition more reproducible.

multi = GetNewMultiXactId(nmembers, &offset);
// random sleep to make WAL order different order of usage on pages
if (rand()%2 == 0)
pg_usleep(1000);
(void) XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_CREATE_ID);

Perhaps, I can build a fast injection points test if we want it.

Best regards, Andrey Borodin.

Attachments:

v5-0001-Make-next-multixact-sleep-timed-with-a-recovery-c.patchapplication/octet-stream; name=v5-0001-Make-next-multixact-sleep-timed-with-a-recovery-c.patch; x-unix-mode=0644Download+191-5
#11Andrey Borodin
amborodin@acm.org
In reply to: Andrey Borodin (#10)
Re: IPC/MultixactCreation on the Standby server

On 30 Jun 2025, at 15:58, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

page_collect_tuples() holds a lock on the buffer while examining tuples visibility, having InterruptHoldoffCount > 0. Tuple visibility check might need WAL to go on, we have to wait until some next MX be filled in.
Which might need a buffer lock or have a snapshot conflict with caller of page_collect_tuples().

Thinking more about the problem I see 3 ways to deal with this deadlock:
1. We check for recovery conflict even in presence of InterruptHoldoffCount. That's what patch v4 does.
2. Teach page_collect_tuples() to do HeapTupleSatisfiesVisibility() without holding buffer lock.
3. Why do we even HOLD_INTERRUPTS() when aquire shared lock??

Personally, I see point 2 as very invasive in a code that I'm not too familiar with. Option 1 is clumsy. But option 3 is a giant system-wide change.
Yet, I see 3 as a correct solution. Can't we just abstain from HOLD_INTERRUPTS() if taken LWLock is not exclusive?

Best regards, Andrey Borodin.

#12Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andrey Borodin (#11)
Re: IPC/MultixactCreation on the Standby server

On 2025-Jul-17, Andrey Borodin wrote:

Thinking more about the problem I see 3 ways to deal with this deadlock:
1. We check for recovery conflict even in presence of
InterruptHoldoffCount. That's what patch v4 does.
2. Teach page_collect_tuples() to do HeapTupleSatisfiesVisibility()
without holding buffer lock.
3. Why do we even HOLD_INTERRUPTS() when aquire shared lock??

Hmm, as you say, doing (3) is a very invasive system-wide change, but
can we do it more localized? I mean, what if we do RESUME_INTERRUPTS()
just before going to sleep on the CV, and restore with HOLD_INTERRUPTS()
once the sleep is done? That would only affect this one place rather
than the whole system, and should also (AFAICS) solve the issue.

Yet, I see 3 as a correct solution. Can't we just abstain from
HOLD_INTERRUPTS() if taken LWLock is not exclusive?

Hmm, the code in LWLockAcquire says

/*
* Lock out cancel/die interrupts until we exit the code section protected
* by the LWLock. This ensures that interrupts will not interfere with
* manipulations of data structures in shared memory.
*/
HOLD_INTERRUPTS();

which means if we want to change this, we would have to inspect every
single use of LWLocks in shared mode in order to be certain that such a
change isn't problematic. This is a discussion I'm not prepared for.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Si quieres ser creativo, aprende el arte de perder el tiempo"

#13Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#12)
Re: IPC/MultixactCreation on the Standby server

Hello,

Andrey and I discussed this on IM, and after some back and forth, he
came up with a brilliant idea: modify the WAL record for multixact
creation, so that the offset of the next multixact is transmitted and
can be replayed. (We know it when we create each multixact, because the
number of members is known). So the replica can store the offset of the
next multixact right away, even though it doesn't know the members for
that multixact. On replay of the next multixact we can cross-check that
the offset matches what we had written previously. This allows reading
the first multixact, without having to wait for the replay of creation
of the second multixact.

One concern is: if we write the offset for the second mxact, but haven't
written its members, what happens if another process looks up the
members for that multixact? We'll have to make it wait (retry) somehow.
Given what was described upthread, it's possible for the multixact
beyond that one to be written already, so we won't have the zero offset
that would make us wait.

Anyway, he's going to try and implement this.

Andrey, please let me know if I misunderstood the idea.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/

#14Andrey Borodin
amborodin@acm.org
In reply to: Alvaro Herrera (#13)
Re: IPC/MultixactCreation on the Standby server

On 18 Jul 2025, at 16:53, Álvaro Herrera <alvherre@kurilemu.de> wrote:

Hello,

Andrey and I discussed this on IM, and after some back and forth, he
came up with a brilliant idea: modify the WAL record for multixact
creation, so that the offset of the next multixact is transmitted and
can be replayed. (We know it when we create each multixact, because the
number of members is known). So the replica can store the offset of the
next multixact right away, even though it doesn't know the members for
that multixact. On replay of the next multixact we can cross-check that
the offset matches what we had written previously. This allows reading
the first multixact, without having to wait for the replay of creation
of the second multixact.

One concern is: if we write the offset for the second mxact, but haven't
written its members, what happens if another process looks up the
members for that multixact? We'll have to make it wait (retry) somehow.
Given what was described upthread, it's possible for the multixact
beyond that one to be written already, so we won't have the zero offset
that would make us wait.

We redo Multixact creation always before it is visible anywhere on heap.
The problem was that to read Multi we might need another Multi offset, and that multi did not happen to be WAL-logged yet.
However, I think we do not need to read multi before it is redone.

Anyway, he's going to try and implement this.

Andrey, please let me know if I misunderstood the idea.

Please find attached dirty test and a sketch of the fix. It is done against PG 16, I wanted to ensure that problem is reproducible before 17.

Best regards, Andrey Borodin.

Attachments:

v6-0001-Test-that-reproduces-multixat-deadlock-with-recov.patchapplication/octet-stream; name=v6-0001-Test-that-reproduces-multixat-deadlock-with-recov.patch; x-unix-mode=0644Download+172-1
v6-0002-Fill-next-multitransaction-in-REDO-to-avoid-corne.patchapplication/octet-stream; name=v6-0002-Fill-next-multitransaction-in-REDO-to-avoid-corne.patch; x-unix-mode=0644Download+43-5
#15Andrey Borodin
amborodin@acm.org
In reply to: Andrey Borodin (#14)
Re: IPC/MultixactCreation on the Standby server

On 18 Jul 2025, at 18:53, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Please find attached dirty test and a sketch of the fix. It is done against PG 16, I wanted to ensure that problem is reproducible before 17.

Here'v v7 with improved comments and cross-check for correctness.
Also, MultiXact wraparound is handled.
I'm planning to prepare tests and fixes for all supported branches, if there's no objections to this approach.

Best regards, Andrey Borodin.

Attachments:

v7-0001-Test-that-reproduces-multixat-deadlock-with-recov.patchapplication/octet-stream; name=v7-0001-Test-that-reproduces-multixat-deadlock-with-recov.patch; x-unix-mode=0644Download+172-1
v7-0002-Fill-next-multitransaction-in-REDO-to-avoid-corne.patchapplication/octet-stream; name=v7-0002-Fill-next-multitransaction-in-REDO-to-avoid-corne.patch; x-unix-mode=0644Download+64-5
#16Andrey Borodin
amborodin@acm.org
In reply to: Andrey Borodin (#15)
Re: IPC/MultixactCreation on the Standby server

On 21 Jul 2025, at 19:58, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

I'm planning to prepare tests and fixes for all supported branches

This is a status update message. I've reproduced problem on REL_13_STABLE and verified that proposed fix works there.

Also I've discovered one more serious problem.
If a backend crashes just before WAL-logging multi, any heap tuple that uses this multi will become unreadable. Any attempt to read it will hang forever.

I've reproduced the problem and now I'm working on scripting this scenario. Basically, I modify code to hang forever after assigning multi number 2. Then execute in first psql:

create table x as select i,0 v from generate_series(1,10) i;
create unique index on x(i);

\set id 1
begin;
select * from x where i = :id for no key update;
savepoint s1;
update x set v = v+1 where i = :id; -- multi 1
commit;

\set id 2
begin;
select * from x where i = :id for no key update;
savepoint s1;
update x set v = v+1 where i = :id; -- multi 2 -- will hang
commit;

Then in second psql:

create table y as select i,0 v from generate_series(1,10) i;
create unique index on y(i);

\set id 1
begin;
select * from y where i = :id for no key update;
savepoint s1;
update y set v = v+1 where i = :id;
commit;

After this I pkill -9 postgres. Recovered installation cannot execute select * from x; because multi 1 cannot be read without recovery of multi 2 which was never logged.

Luckily fix is the same: just restore offset of multi 2 when multi 1 is recovered.

Best regards, Andrey Borodin.

#17Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andrey Borodin (#16)
Re: IPC/MultixactCreation on the Standby server

On 2025-Jul-25, Andrey Borodin wrote:

Also I've discovered one more serious problem.
If a backend crashes just before WAL-logging multi, any heap tuple
that uses this multi will become unreadable. Any attempt to read it
will hang forever.

I've reproduced the problem and now I'm working on scripting this
scenario. Basically, I modify code to hang forever after assigning
multi number 2.

It took me a minute to understand this, and I think your description is
slightly incorrect: you mean that the heap tuple that uses the PREVIOUS
multixact cannot be read (at least, that's what I understand from your
reproducer script). I agree it's a pretty ugly bug! I think it's
essentially the same bug as the other problem, so the proposed fix
should solve both.

Thanks for working on this!

Looking at this,

/*
* We want to avoid edge case 2 in redo, because we cannot wait for
* startup process in GetMultiXactIdMembers() without risk of a
* deadlock.
*/
MultiXactId next = multi + 1;
int next_pageno;

/* Handle wraparound as GetMultiXactIdMembers() does it. */
if (multi < FirstMultiXactId)
multi = FirstMultiXactId;

Don't you mean to test and change the value 'next' rather than 'multi'
here?

In this bit,

* We do not need to handle race conditions, because this code
* is only executed in redo and we hold
* MultiXactOffsetSLRULock.

I think it'd be good to have an
Assert(LWLockHeldByMeInMode(MultiXactOffsetSLRULock, LW_EXCLUSIVE));
just for peace of mind. Also, commit c61678551699 removed
ZeroMultiXactOffsetPage(), but since you have 'false' as the second
argument, then SimpleLruZeroPage() is enough. (I wondered why isn't
WAL-logging necessary ... until I remember that we're in a standby. I
think a simple comment here like "no WAL-logging because we're a
standby" should suffice.)

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/

#18Andrey Borodin
amborodin@acm.org
In reply to: Alvaro Herrera (#17)
Re: IPC/MultixactCreation on the Standby server

On 26 Jul 2025, at 22:44, Álvaro Herrera <alvherre@kurilemu.de> wrote:

On 2025-Jul-25, Andrey Borodin wrote:

Also I've discovered one more serious problem.
If a backend crashes just before WAL-logging multi, any heap tuple
that uses this multi will become unreadable. Any attempt to read it
will hang forever.

I've reproduced the problem and now I'm working on scripting this
scenario. Basically, I modify code to hang forever after assigning
multi number 2.

It took me a minute to understand this, and I think your description is
slightly incorrect: you mean that the heap tuple that uses the PREVIOUS
multixact cannot be read (at least, that's what I understand from your
reproducer script).

Yes, I explained a bit incorrectly, but you got the problem correctly.

Looking at this,

/*
* We want to avoid edge case 2 in redo, because we cannot wait for
* startup process in GetMultiXactIdMembers() without risk of a
* deadlock.
*/
MultiXactId next = multi + 1;
int next_pageno;

/* Handle wraparound as GetMultiXactIdMembers() does it. */
if (multi < FirstMultiXactId)
multi = FirstMultiXactId;

Don't you mean to test and change the value 'next' rather than 'multi'
here?

Yup, that was typo.

In this bit,

* We do not need to handle race conditions, because this code
* is only executed in redo and we hold
* MultiXactOffsetSLRULock.

I think it'd be good to have an
Assert(LWLockHeldByMeInMode(MultiXactOffsetSLRULock, LW_EXCLUSIVE));
just for peace of mind.

Ugh, that uncovered 17+ problem: now we have a couple of locks simultaneously. I'll post a version with this a later.

Also, commit c61678551699 removed
ZeroMultiXactOffsetPage(), but since you have 'false' as the second
argument, then SimpleLruZeroPage() is enough. (I wondered why isn't
WAL-logging necessary ... until I remember that we're in a standby. I
think a simple comment here like "no WAL-logging because we're a
standby" should suffice.)

Agree.

I've made a test [0]https://github.com/x4m/postgres_g/commit/eafcaec7aafde064b0da5d2ba4041ed2fb134f07 and discovered another problem. Adding this checkpoint breaks the test[1]https://github.com/x4m/postgres_g/commit/da762c7cac56eff1988ea9126171ca0a6d2665e9 even after a fix[2]https://github.com/x4m/postgres_g/commit/d64c17d697d082856e5fe8bd52abafc0585973af.
I suspect that excluding "edge case 2" on standby is simply not enough... we have to do this "next offset" dance on Primary too. I'll think more about other options.

Best regards, Andrey Borodin.
[0]: https://github.com/x4m/postgres_g/commit/eafcaec7aafde064b0da5d2ba4041ed2fb134f07
[1]: https://github.com/x4m/postgres_g/commit/da762c7cac56eff1988ea9126171ca0a6d2665e9
[2]: https://github.com/x4m/postgres_g/commit/d64c17d697d082856e5fe8bd52abafc0585973af

Timeline of this commits can be seen here https://github.com/x4m/postgres_g/commits/mx19/

#19Andrey Borodin
amborodin@acm.org
In reply to: Andrey Borodin (#18)
Re: IPC/MultixactCreation on the Standby server

On 27 Jul 2025, at 16:53, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

we have to do this "next offset" dance on Primary too.

PFA draft of this.
I also attach a version for PG17, maybe Dmitry could try to reproduce the problem with this patch. I think the problem should be fixed by the patch.

Thanks!

Best regards, Andrey Borodin.

Attachments:

v8-PG17-0001-Avoid-edge-case-2-in-multixacts.patchapplication/octet-stream; name=v8-PG17-0001-Avoid-edge-case-2-in-multixacts.patch; x-unix-mode=0644Download+62-33
v8-0001-Avoid-edge-case-2-in-multixacts.patchapplication/octet-stream; name=v8-0001-Avoid-edge-case-2-in-multixacts.patch; x-unix-mode=0644Download+131-143
#20Dmitry
dsy.075@yandex.ru
In reply to: Andrey Borodin (#19)
Re: IPC/MultixactCreation on the Standby server

On 28.07.2025 15:49, Andrey Borodin wrote:

I also attach a version for PG17, maybe Dmitry could try to reproduce the problem with this patch.

Andrey, thank you very much for your work, and also thanks to Álvaro for
joining the discussion on the problem. I ran tests on PG17 with patch
v8, there are no more sessions hanging on the replica, great! Replica
requests are canceled with recovery conflicts. ERROR: canceling
statement due to conflict with recovery DETAIL: User was holding shared
buffer pin for too long. STATEMENT: select sum(val) from tbl2; or ERROR:
canceling statement due to conflict with recovery DETAIL: User query
might have needed to see row versions that must be removed. STATEMENT:
select sum(val) from tbl2;

But on the master, some of the requests then fail with an error,
apparently invalid multixact's remain in the pages. ERROR: MultiXact
81926 has invalid next offset STATEMENT: select * from tbl2 where id =
$1 for no key update; ERROR: MultiXact 81941 has invalid next offset
CONTEXT: while scanning block 3 offset 244 of relation "public.tbl2"
automatic vacuum of table "postgres.public.tbl2" Best regards, Dmitry.

#21Dmitry
dsy.075@yandex.ru
In reply to: Andrey Borodin (#19)
#22Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Andrey Borodin (#11)
#23Andrey Borodin
amborodin@acm.org
In reply to: Dmitry (#21)
#24Andrey Borodin
amborodin@acm.org
In reply to: Andrey Borodin (#23)
#25Dmitry
dsy.075@yandex.ru
In reply to: Andrey Borodin (#24)
#26Kirill Reshke
reshkekirill@gmail.com
In reply to: Andrey Borodin (#24)
#27Andrey Borodin
amborodin@acm.org
In reply to: Kirill Reshke (#26)
#28Bykov Ivan
i.bykov@modernsys.ru
In reply to: Andrey Borodin (#27)
#29Andrey Borodin
amborodin@acm.org
In reply to: Bykov Ivan (#28)
#30Bykov Ivan
i.bykov@modernsys.ru
In reply to: Andrey Borodin (#29)
#31Andrey Borodin
amborodin@acm.org
In reply to: Bykov Ivan (#30)
#32Bykov Ivan
i.bykov@modernsys.ru
In reply to: Andrey Borodin (#31)
#33Bykov Ivan
i.bykov@modernsys.ru
In reply to: Bykov Ivan (#32)
#34Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Bykov Ivan (#33)
#35Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Heikki Linnakangas (#34)
#36Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alvaro Herrera (#35)
#37Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Heikki Linnakangas (#36)
#38Chao Li
li.evan.chao@gmail.com
In reply to: Heikki Linnakangas (#36)
#39Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alvaro Herrera (#37)
#40Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Chao Li (#38)
#41Andrey Borodin
amborodin@acm.org
In reply to: Heikki Linnakangas (#36)
#42Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrey Borodin (#41)
#43Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Heikki Linnakangas (#42)
#44Andrey Borodin
amborodin@acm.org
In reply to: Heikki Linnakangas (#42)
#45Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrey Borodin (#44)
#46Andrey Borodin
amborodin@acm.org
In reply to: Heikki Linnakangas (#45)
#47Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrey Borodin (#46)
#48Andrey Borodin
amborodin@acm.org
In reply to: Heikki Linnakangas (#47)
#49Andrey Borodin
amborodin@acm.org
In reply to: Andrey Borodin (#48)
#50Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrey Borodin (#49)
#51Dmitry
dsy.075@yandex.ru
In reply to: Heikki Linnakangas (#50)
#52Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Dmitry (#51)
#53Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#47)
#54Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#52)
#55Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Heikki Linnakangas (#54)
#56Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alvaro Herrera (#55)
#57Maxim Orlov
orlovmg@gmail.com
In reply to: Heikki Linnakangas (#45)
#58Andrey Borodin
amborodin@acm.org
In reply to: Maxim Orlov (#57)
#59Andrey Borodin
amborodin@acm.org
In reply to: Heikki Linnakangas (#54)
#60Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrey Borodin (#58)
#61Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andrey Borodin (#59)