IPC/MultixactCreation on the Standby server

amborodin@acm.org

about 1 year ago

In reply to: Dmitry (#1)

Re: IPC/MultixactCreation on the Standby server

On 25 Jun 2025, at 11:11, Dmitry <dsy.075@yandex.ru> wrote:

#6 GetMultiXactIdMembers (multi=45559845, members=0x7ffdaedc84b0, from_pgupgrade=<optimized out>, isLockOnly=<optimized out>)
at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/multixact.c:1483

Hi Dmitry!

This looks to be related to work in my thread about multixacts [0]. Seems like CV sleep in /* Corner case 2: next multixact is still being filled in */ is not woken up by ConditionVariableBroadcast(&MultiXactState->nextoff_cv) from WAL redo.

If so - any subsequent multixact redo from WAL should unstuck reading last MultiXact.

Either way redo path might be not going through ConditionVariableBroadcast(). I will investigate this further.

Can you please check your reproduction with patch attached to this message? This patch simply adds timeout on CV sleep so in worst case we will fallback to behavior of PG 16.

Best regards, Andrey Borodin.

dsy.075@yandex.ru

about 1 year ago

In reply to: Andrey Borodin (#2)

Re: IPC/MultixactCreation on the Standby server

On 25.06.2025 12:34, Andrey Borodin wrote:

On 25 Jun 2025, at 11:11, Dmitry <dsy.075@yandex.ru> wrote:

#6 GetMultiXactIdMembers (multi=45559845, members=0x7ffdaedc84b0, from_pgupgrade=<optimized out>, isLockOnly=<optimized out>)
at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/multixact.c:1483

Hi Dmitry!

This looks to be related to work in my thread about multixacts [0]. Seems like CV sleep in /* Corner case 2: next multixact is still being filled in */ is not woken up by ConditionVariableBroadcast(&MultiXactState->nextoff_cv) from WAL redo.

If so - any subsequent multixact redo from WAL should unstuck reading last MultiXact.

Either way redo path might be not going through ConditionVariableBroadcast(). I will investigate this further.

Can you please check your reproduction with patch attached to this message? This patch simply adds timeout on CV sleep so in worst case we will fallback to behavior of PG 16.

Best regards, Andrey Borodin.

Hi Andrey!

Thanks so much for your response.

A small comment on /* Corner case 2: ... */
At this point in the code, I tried to set trace points by outputting
messages through `elog()`,
and I can say that the process does not always stuck in this part of the
code, it appears from time to time and in an unpredictable way.

Maybe this will help you a little.

To be honest, PostgreSQL performance is much better with this feature,
it would be a shame if we had to rollback to the behavior in version 16.

I will definitely try to reproduce the problem with your patch.

Best regards,
Dmitry.

dsy.075@yandex.ru

about 1 year ago

In reply to: Dmitry (#3)

Re: IPC/MultixactCreation on the Standby server

On 25.06.2025 16:44, Dmitry wrote:

I will definitely try to reproduce the problem with your patch.

Hi Andrey!

I checked with the patch, unfortunately the problem is also reproducible.
Client processes wake up after a second and try to get information about the members of the multixact again, in an endless loop.
At the same time, the WALs are not played, the 'startup' process also hangs on the 'LWLock/BufferContent'.

Best regards,
Dmitry.

amborodin@acm.org

about 1 year ago

In reply to: Dmitry (#4)

Re: IPC/MultixactCreation on the Standby server

On 26 Jun 2025, at 14:33, Dmitry <dsy.075@yandex.ru> wrote:

On 25.06.2025 16:44, Dmitry wrote:

I will definitely try to reproduce the problem with your patch.

Hi Andrey!

I checked with the patch, unfortunately the problem is also reproducible.
Client processes wake up after a second and try to get information about the members of the multixact again, in an endless loop.
At the same time, the WALs are not played, the 'startup' process also hangs on the 'LWLock/BufferContent'.

My hypothesis is that MultiXactState->nextMXact is not filled often enough from redo pathes. So if you are unlucky enough, corner case 2 reading can deadlock with startup.
I need to verify it further, but if so - I's an ancient bug that just happens to be few orders of magnitude more reproducible on 17 due to performance improvements. Still a hypothetical though.

Best regards, Andrey Borodin.

amborodin@acm.org

about 1 year ago

In reply to: Andrey Borodin (#5)

Re: IPC/MultixactCreation on the Standby server

On 26 Jun 2025, at 17:59, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

hypothesis

Dmitry, can you please retry your reproduction with attached patch?

It must print nextMXact and tmpMXact. If my hypothesis is correct nextMXact will precede tmpMXact.

Best regards, Andrey Borodin.

dsy.075@yandex.ru

12 months ago

In reply to: Andrey Borodin (#6)

Re: IPC/MultixactCreation on the Standby server

On 26.06.2025 19:24, Andrey Borodin wrote:

If my hypothesis is correct nextMXact will precede tmpMXact.

It seems that the hypothesis has not been confirmed.

Attempt #1
2025-06-26 23:47:24.821 MSK [220458] WARNING: Timed out: nextMXact
24138381 tmpMXact 24138379
2025-06-26 23:47:24.822 MSK [220540] WARNING: Timed out: nextMXact
24138382 tmpMXact 24138379
2025-06-26 23:47:24.823 MSK [220548] WARNING: Timed out: nextMXact
24138382 tmpMXact 24138379
...

pgbench (17.5)
progress: 2.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
progress: 3.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
progress: 4.0 s, 482.2 tps, lat 820.293 ms stddev 1370.729, 0 failed
progress: 5.0 s, 886.0 tps, lat 112.463 ms stddev 8.506, 0 failed
progress: 6.0 s, 348.9 tps, lat 111.324 ms stddev 5.871, 0 failed
WARNING: Timed out: nextMXact 24138381 tmpMXact 24138379
WARNING: Timed out: nextMXact 24138382 tmpMXact 24138379
WARNING: Timed out: nextMXact 24138382 tmpMXact 24138379
...
progress: 7.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
WARNING: Timed out: nextMXact 24138382 tmpMXact 24138379

Attempt #2
2025-06-27 09:18:01.312 MSK [236187] WARNING: Timed out: nextMXact
24497746 tmpMXact 24497744
2025-06-27 09:18:01.312 MSK [236225] WARNING: Timed out: nextMXact
24497746 tmpMXact 24497744
2025-06-27 09:18:01.312 MSK [236178] WARNING: Timed out: nextMXact
24497746 tmpMXact 24497744
...

pgbench (17.5)
progress: 1.0 s, 830.9 tps, lat 108.556 ms stddev 10.078, 0 failed
progress: 2.0 s, 839.0 tps, lat 118.358 ms stddev 19.708, 0 failed
progress: 3.0 s, 623.4 tps, lat 134.186 ms stddev 15.565, 0 failed
WARNING: Timed out: nextMXact 24497746 tmpMXact 24497744
WARNING: Timed out: nextMXact 24497746 tmpMXact 24497744
WARNING: Timed out: nextMXact 24497746 tmpMXact 24497744
WARNING: Timed out: nextMXact 24497746 tmpMXact 24497744
WARNING: Timed out: nextMXact 24497746 tmpMXact 24497744
WARNING: Timed out: nextMXact 24497747 tmpMXact 24497744
WARNING: Timed out: nextMXact 24497747 tmpMXact 24497744
...
progress: 4.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
WARNING: Timed out: nextMXact 24497746 tmpMXact 24497744

Best regards,
Dmitry.

amborodin@acm.org

12 months ago

In reply to: Dmitry (#7)

Re: IPC/MultixactCreation on the Standby server

On 27 Jun 2025, at 11:41, Dmitry <dsy.075@yandex.ru> wrote:

It seems that the hypothesis has not been confirmed.

Indeed.

For some reason your reproduction does not work for me.
I tried to create a test from your workload description. PFA patch with a very dirty prototype.

to run test you can run:

cd contrib/amcheck
PROVE_TESTS=t/006_MultiXact_standby.pl make check

To check that reproduction worked or not you can read tmp_check/log/006_MultiXact_standby_standby_1.log and see if there are messages "Timed out: nextMXact %u tmpMXact %u".

If you could codify our reproduction into this TAP test, we could make it portable. So I can debug the problem on my machine...

Either way we can proceed with remote debugging via mailing list :)

Thank you!

Best regards, Andrey Borodin.

amborodin@acm.org

12 months ago

In reply to: Andrey Borodin (#8)

Re: IPC/MultixactCreation on the Standby server

On 28 Jun 2025, at 00:37, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Indeed.

After some experiments I could get unstable repro on my machine.
I've added some logging and that's what I've found:

2025-06-28 23:03:40.598 +05 [40887] 006_MultiXact_standby.pl WARNING: Timed out: nextMXact 415832 tmpMXact 415827 pageno 203 prev_pageno 203 entryno 83 offptr[1] 831655 offptr[0] 0 offptr[-1] 831651

We are reading 415827-1 Multi, while 415827 is not filled yet. But we are holding a buffer that prevents next Multi to be filled in.
This seems like a recovery conflict.
I'm somewhat surprized with 415827+1 is already filled in...

Can you please try your reproduction with applied patch? This seems to be fixing issue for me.

Best regards, Andrey Borodin.

#10

amborodin@acm.org

12 months ago

In reply to: Andrey Borodin (#9)

Re: IPC/MultixactCreation on the Standby server

On 28 Jun 2025, at 21:24, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

This seems to be fixing issue for me.

ISTM I was wrong: there is a possible recovery conflict with snapshot.

REDO:
frame #2: 0x000000010179a0c8 postgres`pg_usleep(microsec=1000000) at pgsleep.c:50:10
frame #3: 0x000000010144c108 postgres`WaitExceedsMaxStandbyDelay(wait_event_info=134217772) at standby.c:248:2
frame #4: 0x000000010144a63c postgres`ResolveRecoveryConflictWithVirtualXIDs(waitlist=0x0000000126008200, reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, wait_event_info=134217772, report_waiting=true) at standby.c:384:8
frame #5: 0x000000010144a4f4 postgres`ResolveRecoveryConflictWithSnapshot(snapshotConflictHorizon=1214, isCatalogRel=false, locator=(spcOid = 1663, dbOid = 5, relNumber = 16384)) at standby.c:490:2
frame #6: 0x0000000100e4d3f8 postgres`heap_xlog_prune_freeze(record=0x0000000135808e60) at heapam.c:9208:4
frame #7: 0x0000000100e4d204 postgres`heap2_redo(record=0x0000000135808e60) at heapam.c:10353:4
frame #8: 0x0000000100f1548c postgres`ApplyWalRecord(xlogreader=0x0000000135808e60, record=0x0000000138058060, replayTLI=0x000000016f0425b0) at xlogrecovery.c:1991:2
frame #9: 0x0000000100f13ff0 postgres`PerformWalRecovery at xlogrecovery.c:1822:4
frame #10: 0x0000000100ef7940 postgres`StartupXLOG at xlog.c:5821:3
frame #11: 0x0000000101364334 postgres`StartupProcessMain(startup_data=0x0000000000000000, startup_data_len=0) at startup.c:258:2

SELECT:
frame #10: 0x0000000102a14684 postgres`GetMultiXactIdMembers(multi=278, members=0x000000016d4f9498, from_pgupgrade=false, isLockOnly=false) at multixact.c:1493:6
frame #11: 0x0000000102991814 postgres`MultiXactIdGetUpdateXid(xmax=278, t_infomask=4416) at heapam.c:7478:13
frame #12: 0x0000000102985450 postgres`HeapTupleGetUpdateXid(tuple=0x00000001043e5c60) at heapam.c:7519:9
frame #13: 0x00000001029a0360 postgres`HeapTupleSatisfiesMVCC(htup=0x000000016d4f9590, snapshot=0x000000015b07b930, buffer=69) at heapam_visibility.c:1090:10
frame #14: 0x000000010299fbc8 postgres`HeapTupleSatisfiesVisibility(htup=0x000000016d4f9590, snapshot=0x000000015b07b930, buffer=69) at heapam_visibility.c:1772:11
frame #15: 0x0000000102982954 postgres`page_collect_tuples(scan=0x000000014b009648, snapshot=0x000000015b07b930, page="", buffer=69, block=6, lines=228, all_visible=false, check_serializable=false) at heapam.c:480:12

page_collect_tuples() holds a lock on the buffer while examining tuples visibility, having InterruptHoldoffCount > 0. Tuple visibility check might need WAL to go on, we have to wait until some next MX be filled in.
Which might need a buffer lock or have a snapshot conflict with caller of page_collect_tuples().

Please find attached a dirty test, it reproduces problem my machine (startup deadlock, so when reproduced it takes 180s, normally passing in 10s).
Also, there is a fix: checking for recovery conflicts when falling back to case 2 MX read.

I do not feel comfortable with using interrupts while InterruptHoldoffCount > 0, so I need help from someone more knowledgeable about our interrupts machinery to tell if what I'm proposing is OK. (Álvaro?)

Also, I've modified the code to make race condition more reproducible.

multi = GetNewMultiXactId(nmembers, &offset);
// random sleep to make WAL order different order of usage on pages
if (rand()%2 == 0)
pg_usleep(1000);
(void) XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_CREATE_ID);

Perhaps, I can build a fast injection points test if we want it.

Best regards, Andrey Borodin.

#11

amborodin@acm.org

11 months ago

In reply to: Andrey Borodin (#10)

Re: IPC/MultixactCreation on the Standby server

On 30 Jun 2025, at 15:58, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

page_collect_tuples() holds a lock on the buffer while examining tuples visibility, having InterruptHoldoffCount > 0. Tuple visibility check might need WAL to go on, we have to wait until some next MX be filled in.
Which might need a buffer lock or have a snapshot conflict with caller of page_collect_tuples().

Thinking more about the problem I see 3 ways to deal with this deadlock:
1. We check for recovery conflict even in presence of InterruptHoldoffCount. That's what patch v4 does.
2. Teach page_collect_tuples() to do HeapTupleSatisfiesVisibility() without holding buffer lock.
3. Why do we even HOLD_INTERRUPTS() when aquire shared lock??

Personally, I see point 2 as very invasive in a code that I'm not too familiar with. Option 1 is clumsy. But option 3 is a giant system-wide change.
Yet, I see 3 as a correct solution. Can't we just abstain from HOLD_INTERRUPTS() if taken LWLock is not exclusive?

Best regards, Andrey Borodin.

#12

alvherre@2ndquadrant.com

11 months ago

In reply to: Andrey Borodin (#11)

Re: IPC/MultixactCreation on the Standby server

On 2025-Jul-17, Andrey Borodin wrote:

Thinking more about the problem I see 3 ways to deal with this deadlock:
1. We check for recovery conflict even in presence of
InterruptHoldoffCount. That's what patch v4 does.
2. Teach page_collect_tuples() to do HeapTupleSatisfiesVisibility()
without holding buffer lock.
3. Why do we even HOLD_INTERRUPTS() when aquire shared lock??

Hmm, as you say, doing (3) is a very invasive system-wide change, but
can we do it more localized? I mean, what if we do RESUME_INTERRUPTS()
just before going to sleep on the CV, and restore with HOLD_INTERRUPTS()
once the sleep is done? That would only affect this one place rather
than the whole system, and should also (AFAICS) solve the issue.

Yet, I see 3 as a correct solution. Can't we just abstain from
HOLD_INTERRUPTS() if taken LWLock is not exclusive?

Hmm, the code in LWLockAcquire says

/*
* Lock out cancel/die interrupts until we exit the code section protected
* by the LWLock. This ensures that interrupts will not interfere with
* manipulations of data structures in shared memory.
*/
HOLD_INTERRUPTS();

which means if we want to change this, we would have to inspect every
single use of LWLocks in shared mode in order to be certain that such a
change isn't problematic. This is a discussion I'm not prepared for.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Si quieres ser creativo, aprende el arte de perder el tiempo"

#13

alvherre@2ndquadrant.com

11 months ago

In reply to: Alvaro Herrera (#12)

Re: IPC/MultixactCreation on the Standby server

Hello,

Andrey and I discussed this on IM, and after some back and forth, he
came up with a brilliant idea: modify the WAL record for multixact
creation, so that the offset of the next multixact is transmitted and
can be replayed. (We know it when we create each multixact, because the
number of members is known). So the replica can store the offset of the
next multixact right away, even though it doesn't know the members for
that multixact. On replay of the next multixact we can cross-check that
the offset matches what we had written previously. This allows reading
the first multixact, without having to wait for the replay of creation
of the second multixact.

One concern is: if we write the offset for the second mxact, but haven't
written its members, what happens if another process looks up the
members for that multixact? We'll have to make it wait (retry) somehow.
Given what was described upthread, it's possible for the multixact
beyond that one to be written already, so we won't have the zero offset
that would make us wait.

Anyway, he's going to try and implement this.

Andrey, please let me know if I misunderstood the idea.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/

#14

amborodin@acm.org

11 months ago

In reply to: Alvaro Herrera (#13)

Re: IPC/MultixactCreation on the Standby server

On 18 Jul 2025, at 16:53, Álvaro Herrera <alvherre@kurilemu.de> wrote:

Hello,

Andrey and I discussed this on IM, and after some back and forth, he
came up with a brilliant idea: modify the WAL record for multixact
creation, so that the offset of the next multixact is transmitted and
can be replayed. (We know it when we create each multixact, because the
number of members is known). So the replica can store the offset of the
next multixact right away, even though it doesn't know the members for
that multixact. On replay of the next multixact we can cross-check that
the offset matches what we had written previously. This allows reading
the first multixact, without having to wait for the replay of creation
of the second multixact.

One concern is: if we write the offset for the second mxact, but haven't
written its members, what happens if another process looks up the
members for that multixact? We'll have to make it wait (retry) somehow.
Given what was described upthread, it's possible for the multixact
beyond that one to be written already, so we won't have the zero offset
that would make us wait.

We redo Multixact creation always before it is visible anywhere on heap.
The problem was that to read Multi we might need another Multi offset, and that multi did not happen to be WAL-logged yet.
However, I think we do not need to read multi before it is redone.

Anyway, he's going to try and implement this.

Andrey, please let me know if I misunderstood the idea.

Please find attached dirty test and a sketch of the fix. It is done against PG 16, I wanted to ensure that problem is reproducible before 17.

Best regards, Andrey Borodin.

#15

amborodin@acm.org

11 months ago

In reply to: Andrey Borodin (#14)

Re: IPC/MultixactCreation on the Standby server

On 18 Jul 2025, at 18:53, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Please find attached dirty test and a sketch of the fix. It is done against PG 16, I wanted to ensure that problem is reproducible before 17.

Here'v v7 with improved comments and cross-check for correctness.
Also, MultiXact wraparound is handled.
I'm planning to prepare tests and fixes for all supported branches, if there's no objections to this approach.

Best regards, Andrey Borodin.

#16

amborodin@acm.org

11 months ago

In reply to: Andrey Borodin (#15)

Re: IPC/MultixactCreation on the Standby server

On 21 Jul 2025, at 19:58, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

I'm planning to prepare tests and fixes for all supported branches

This is a status update message. I've reproduced problem on REL_13_STABLE and verified that proposed fix works there.

Also I've discovered one more serious problem.
If a backend crashes just before WAL-logging multi, any heap tuple that uses this multi will become unreadable. Any attempt to read it will hang forever.

I've reproduced the problem and now I'm working on scripting this scenario. Basically, I modify code to hang forever after assigning multi number 2. Then execute in first psql:

create table x as select i,0 v from generate_series(1,10) i;
create unique index on x(i);

\set id 1
begin;
select * from x where i = :id for no key update;
savepoint s1;
update x set v = v+1 where i = :id; -- multi 1
commit;

\set id 2
begin;
select * from x where i = :id for no key update;
savepoint s1;
update x set v = v+1 where i = :id; -- multi 2 -- will hang
commit;

Then in second psql:

create table y as select i,0 v from generate_series(1,10) i;
create unique index on y(i);

\set id 1
begin;
select * from y where i = :id for no key update;
savepoint s1;
update y set v = v+1 where i = :id;
commit;

After this I pkill -9 postgres. Recovered installation cannot execute select * from x; because multi 1 cannot be read without recovery of multi 2 which was never logged.

Luckily fix is the same: just restore offset of multi 2 when multi 1 is recovered.

Best regards, Andrey Borodin.

#17

alvherre@2ndquadrant.com

11 months ago

In reply to: Andrey Borodin (#16)

Re: IPC/MultixactCreation on the Standby server

On 2025-Jul-25, Andrey Borodin wrote:

Also I've discovered one more serious problem.
If a backend crashes just before WAL-logging multi, any heap tuple
that uses this multi will become unreadable. Any attempt to read it
will hang forever.

I've reproduced the problem and now I'm working on scripting this
scenario. Basically, I modify code to hang forever after assigning
multi number 2.

It took me a minute to understand this, and I think your description is
slightly incorrect: you mean that the heap tuple that uses the PREVIOUS
multixact cannot be read (at least, that's what I understand from your
reproducer script). I agree it's a pretty ugly bug! I think it's
essentially the same bug as the other problem, so the proposed fix
should solve both.

Thanks for working on this!

Looking at this,

/*
* We want to avoid edge case 2 in redo, because we cannot wait for
* startup process in GetMultiXactIdMembers() without risk of a
* deadlock.
*/
MultiXactId next = multi + 1;
int next_pageno;

/* Handle wraparound as GetMultiXactIdMembers() does it. */
if (multi < FirstMultiXactId)
multi = FirstMultiXactId;

Don't you mean to test and change the value 'next' rather than 'multi'
here?

In this bit,

* We do not need to handle race conditions, because this code
* is only executed in redo and we hold
* MultiXactOffsetSLRULock.

I think it'd be good to have an
Assert(LWLockHeldByMeInMode(MultiXactOffsetSLRULock, LW_EXCLUSIVE));
just for peace of mind. Also, commit c61678551699 removed
ZeroMultiXactOffsetPage(), but since you have 'false' as the second
argument, then SimpleLruZeroPage() is enough. (I wondered why isn't
WAL-logging necessary ... until I remember that we're in a standby. I
think a simple comment here like "no WAL-logging because we're a
standby" should suffice.)

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/

#18

amborodin@acm.org

11 months ago

In reply to: Alvaro Herrera (#17)

Re: IPC/MultixactCreation on the Standby server

On 26 Jul 2025, at 22:44, Álvaro Herrera <alvherre@kurilemu.de> wrote:

On 2025-Jul-25, Andrey Borodin wrote:

Also I've discovered one more serious problem.
If a backend crashes just before WAL-logging multi, any heap tuple
that uses this multi will become unreadable. Any attempt to read it
will hang forever.

I've reproduced the problem and now I'm working on scripting this
scenario. Basically, I modify code to hang forever after assigning
multi number 2.

It took me a minute to understand this, and I think your description is
slightly incorrect: you mean that the heap tuple that uses the PREVIOUS
multixact cannot be read (at least, that's what I understand from your
reproducer script).

Yes, I explained a bit incorrectly, but you got the problem correctly.

Looking at this,

/*
* We want to avoid edge case 2 in redo, because we cannot wait for
* startup process in GetMultiXactIdMembers() without risk of a
* deadlock.
*/
MultiXactId next = multi + 1;
int next_pageno;

/* Handle wraparound as GetMultiXactIdMembers() does it. */
if (multi < FirstMultiXactId)
multi = FirstMultiXactId;

Don't you mean to test and change the value 'next' rather than 'multi'
here?

Yup, that was typo.

In this bit,

* We do not need to handle race conditions, because this code
* is only executed in redo and we hold
* MultiXactOffsetSLRULock.

I think it'd be good to have an
Assert(LWLockHeldByMeInMode(MultiXactOffsetSLRULock, LW_EXCLUSIVE));
just for peace of mind.

Ugh, that uncovered 17+ problem: now we have a couple of locks simultaneously. I'll post a version with this a later.

Also, commit c61678551699 removed
ZeroMultiXactOffsetPage(), but since you have 'false' as the second
argument, then SimpleLruZeroPage() is enough. (I wondered why isn't
WAL-logging necessary ... until I remember that we're in a standby. I
think a simple comment here like "no WAL-logging because we're a
standby" should suffice.)

Agree.

I've made a test [0]https://github.com/x4m/postgres_g/commit/eafcaec7aafde064b0da5d2ba4041ed2fb134f07 and discovered another problem. Adding this checkpoint breaks the test[1]https://github.com/x4m/postgres_g/commit/da762c7cac56eff1988ea9126171ca0a6d2665e9 even after a fix[2]https://github.com/x4m/postgres_g/commit/d64c17d697d082856e5fe8bd52abafc0585973af.
I suspect that excluding "edge case 2" on standby is simply not enough... we have to do this "next offset" dance on Primary too. I'll think more about other options.

Best regards, Andrey Borodin.
[0]: https://github.com/x4m/postgres_g/commit/eafcaec7aafde064b0da5d2ba4041ed2fb134f07
[1]: https://github.com/x4m/postgres_g/commit/da762c7cac56eff1988ea9126171ca0a6d2665e9
[2]: https://github.com/x4m/postgres_g/commit/d64c17d697d082856e5fe8bd52abafc0585973af

Timeline of this commits can be seen here https://github.com/x4m/postgres_g/commits/mx19/

#19

amborodin@acm.org

11 months ago

In reply to: Andrey Borodin (#18)

Re: IPC/MultixactCreation on the Standby server

On 27 Jul 2025, at 16:53, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

we have to do this "next offset" dance on Primary too.

PFA draft of this.
I also attach a version for PG17, maybe Dmitry could try to reproduce the problem with this patch. I think the problem should be fixed by the patch.

Thanks!

Best regards, Andrey Borodin.

#20

dsy.075@yandex.ru

11 months ago

In reply to: Andrey Borodin (#19)

Re: IPC/MultixactCreation on the Standby server

On 28.07.2025 15:49, Andrey Borodin wrote:

I also attach a version for PG17, maybe Dmitry could try to reproduce the problem with this patch.

Andrey, thank you very much for your work, and also thanks to Álvaro for
joining the discussion on the problem. I ran tests on PG17 with patch
v8, there are no more sessions hanging on the replica, great! Replica
requests are canceled with recovery conflicts. ERROR: canceling
statement due to conflict with recovery DETAIL: User was holding shared
buffer pin for too long. STATEMENT: select sum(val) from tbl2; or ERROR:
canceling statement due to conflict with recovery DETAIL: User query
might have needed to see row versions that must be removed. STATEMENT:
select sum(val) from tbl2;

But on the master, some of the requests then fail with an error,
apparently invalid multixact's remain in the pages. ERROR: MultiXact
81926 has invalid next offset STATEMENT: select * from tbl2 where id =
$1 for no key update; ERROR: MultiXact 81941 has invalid next offset
CONTEXT: while scanning block 3 offset 244 of relation "public.tbl2"
automatic vacuum of table "postgres.public.tbl2" Best regards, Dmitry.

#21

dsy.075@yandex.ru

11 months ago

In reply to: Andrey Borodin (#19)

#22

Yura Sokolov

y.sokolov@postgrespro.ru

11 months ago

In reply to: Andrey Borodin (#11)

#23

amborodin@acm.org

11 months ago

In reply to: Dmitry (#21)

#24

amborodin@acm.org

11 months ago

In reply to: Andrey Borodin (#23)

#25

dsy.075@yandex.ru

11 months ago

In reply to: Andrey Borodin (#24)

#26

Kirill Reshke

reshkekirill@gmail.com

10 months ago

In reply to: Andrey Borodin (#24)

#27

amborodin@acm.org

10 months ago

In reply to: Kirill Reshke (#26)

#28

i.bykov@modernsys.ru

8 months ago

In reply to: Andrey Borodin (#27)

#29

amborodin@acm.org

7 months ago

In reply to: Bykov Ivan (#28)

#30

i.bykov@modernsys.ru

7 months ago

In reply to: Andrey Borodin (#29)

#31

amborodin@acm.org

7 months ago

In reply to: Bykov Ivan (#30)

#32

i.bykov@modernsys.ru

7 months ago

In reply to: Andrey Borodin (#31)

#33

i.bykov@modernsys.ru

7 months ago

In reply to: Bykov Ivan (#32)

#34

heikki.linnakangas@enterprisedb.com

7 months ago

In reply to: Bykov Ivan (#33)

#35

alvherre@2ndquadrant.com

7 months ago

In reply to: Heikki Linnakangas (#34)

#36

heikki.linnakangas@enterprisedb.com

7 months ago

In reply to: Alvaro Herrera (#35)

#37

alvherre@2ndquadrant.com

7 months ago

In reply to: Heikki Linnakangas (#36)

#38

Chao Li

li.evan.chao@gmail.com

7 months ago

In reply to: Heikki Linnakangas (#36)

#39

heikki.linnakangas@enterprisedb.com

7 months ago

In reply to: Alvaro Herrera (#37)

#40

heikki.linnakangas@enterprisedb.com

7 months ago

In reply to: Chao Li (#38)

#41

amborodin@acm.org

7 months ago

In reply to: Heikki Linnakangas (#36)

#42

heikki.linnakangas@enterprisedb.com

7 months ago

In reply to: Andrey Borodin (#41)

#43

alvherre@2ndquadrant.com

7 months ago

In reply to: Heikki Linnakangas (#42)

#44

amborodin@acm.org

7 months ago

In reply to: Heikki Linnakangas (#42)

#45

heikki.linnakangas@enterprisedb.com

7 months ago

In reply to: Andrey Borodin (#44)

#46

amborodin@acm.org

7 months ago

In reply to: Heikki Linnakangas (#45)

#47

heikki.linnakangas@enterprisedb.com

7 months ago

In reply to: Andrey Borodin (#46)

#48

amborodin@acm.org

7 months ago

In reply to: Heikki Linnakangas (#47)

#49

amborodin@acm.org

7 months ago

In reply to: Andrey Borodin (#48)

#50

heikki.linnakangas@enterprisedb.com

7 months ago

In reply to: Andrey Borodin (#49)

#51

dsy.075@yandex.ru

7 months ago

In reply to: Heikki Linnakangas (#50)

#52

heikki.linnakangas@enterprisedb.com

7 months ago

In reply to: Dmitry (#51)

#53

heikki.linnakangas@enterprisedb.com

7 months ago

In reply to: Heikki Linnakangas (#47)

#54

heikki.linnakangas@enterprisedb.com

7 months ago

In reply to: Heikki Linnakangas (#52)

#55

alvherre@2ndquadrant.com

7 months ago

In reply to: Heikki Linnakangas (#54)

#56

heikki.linnakangas@enterprisedb.com

7 months ago

In reply to: Alvaro Herrera (#55)

#57

Maxim Orlov

orlovmg@gmail.com

7 months ago

In reply to: Heikki Linnakangas (#45)

#58

amborodin@acm.org

7 months ago

In reply to: Maxim Orlov (#57)

#59

amborodin@acm.org

7 months ago

In reply to: Heikki Linnakangas (#54)

#60

heikki.linnakangas@enterprisedb.com

7 months ago

In reply to: Andrey Borodin (#58)

#61