IPC/MultixactCreation on the Standby server

x4mmm@yandex-team.ru

7 months ago

In reply to: Dmitry (#1)

1 attachment(s)

Re: IPC/MultixactCreation on the Standby server

On 25 Jun 2025, at 11:11, Dmitry <dsy.075@yandex.ru> wrote:

#6 GetMultiXactIdMembers (multi=45559845, members=0x7ffdaedc84b0, from_pgupgrade=<optimized out>, isLockOnly=<optimized out>)
at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/multixact.c:1483

Hi Dmitry!

This looks to be related to work in my thread about multixacts [0]. Seems like CV sleep in /* Corner case 2: next multixact is still being filled in */ is not woken up by ConditionVariableBroadcast(&MultiXactState->nextoff_cv) from WAL redo.

If so - any subsequent multixact redo from WAL should unstuck reading last MultiXact.

Either way redo path might be not going through ConditionVariableBroadcast(). I will investigate this further.

Can you please check your reproduction with patch attached to this message? This patch simply adds timeout on CV sleep so in worst case we will fallback to behavior of PG 16.

Best regards, Andrey Borodin.

Attachments:

0001-Make-next-multixact-sleep-timed.patchapplication/octet-stream; name=0001-Make-next-multixact-sleep-timed.patch; x-unix-mode=0644Download

From 42da5b61d38a3fd6578ef72d1a27b74e445c2912 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Wed, 25 Jun 2025 14:34:06 +0500
Subject: [PATCH] Make next multixact sleep timed

---
 src/backend/access/transam/multixact.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b7b47ef076a..66d98ae0587 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1480,7 +1480,7 @@ retry:
 			LWLockRelease(lock);
 			CHECK_FOR_INTERRUPTS();
 
-			ConditionVariableSleep(&MultiXactState->nextoff_cv,
+			ConditionVariableTimedSleep(&MultiXactState->nextoff_cv, 1000,
 								   WAIT_EVENT_MULTIXACT_CREATION);
 			slept = true;
 			goto retry;
-- 
2.39.5 (Apple Git-154)

dsy.075@yandex.ru

7 months ago

In reply to: Andrey Borodin (#2)

Re: IPC/MultixactCreation on the Standby server

On 25.06.2025 12:34, Andrey Borodin wrote:

On 25 Jun 2025, at 11:11, Dmitry <dsy.075@yandex.ru> wrote:

#6 GetMultiXactIdMembers (multi=45559845, members=0x7ffdaedc84b0, from_pgupgrade=<optimized out>, isLockOnly=<optimized out>)
at /usr/src/postgresql-17-17.5-1.pgdg24.04+1/build/../src/backend/access/transam/multixact.c:1483

Hi Dmitry!

This looks to be related to work in my thread about multixacts [0]. Seems like CV sleep in /* Corner case 2: next multixact is still being filled in */ is not woken up by ConditionVariableBroadcast(&MultiXactState->nextoff_cv) from WAL redo.

If so - any subsequent multixact redo from WAL should unstuck reading last MultiXact.

Either way redo path might be not going through ConditionVariableBroadcast(). I will investigate this further.

Can you please check your reproduction with patch attached to this message? This patch simply adds timeout on CV sleep so in worst case we will fallback to behavior of PG 16.

Best regards, Andrey Borodin.

Hi Andrey!

Thanks so much for your response.

A small comment on /* Corner case 2: ... */
At this point in the code, I tried to set trace points by outputting
messages through `elog()`,
and I can say that the process does not always stuck in this part of the
code, it appears from time to time and in an unpredictable way.

Maybe this will help you a little.

To be honest, PostgreSQL performance is much better with this feature,
it would be a shame if we had to rollback to the behavior in version 16.

I will definitely try to reproduce the problem with your patch.

Best regards,
Dmitry.

dsy.075@yandex.ru

7 months ago

In reply to: Dmitry (#3)

Re: IPC/MultixactCreation on the Standby server

On 25.06.2025 16:44, Dmitry wrote:

I will definitely try to reproduce the problem with your patch.

Hi Andrey!

I checked with the patch, unfortunately the problem is also reproducible.
Client processes wake up after a second and try to get information about the members of the multixact again, in an endless loop.
At the same time, the WALs are not played, the 'startup' process also hangs on the 'LWLock/BufferContent'.

Best regards,
Dmitry.

x4mmm@yandex-team.ru

7 months ago

In reply to: Dmitry (#4)

Re: IPC/MultixactCreation on the Standby server

On 26 Jun 2025, at 14:33, Dmitry <dsy.075@yandex.ru> wrote:

On 25.06.2025 16:44, Dmitry wrote:

I will definitely try to reproduce the problem with your patch.

Hi Andrey!

I checked with the patch, unfortunately the problem is also reproducible.
Client processes wake up after a second and try to get information about the members of the multixact again, in an endless loop.
At the same time, the WALs are not played, the 'startup' process also hangs on the 'LWLock/BufferContent'.

My hypothesis is that MultiXactState->nextMXact is not filled often enough from redo pathes. So if you are unlucky enough, corner case 2 reading can deadlock with startup.
I need to verify it further, but if so - I's an ancient bug that just happens to be few orders of magnitude more reproducible on 17 due to performance improvements. Still a hypothetical though.

Best regards, Andrey Borodin.

x4mmm@yandex-team.ru

7 months ago

In reply to: Andrey Borodin (#5)

1 attachment(s)

Re: IPC/MultixactCreation on the Standby server

On 26 Jun 2025, at 17:59, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

hypothesis

Dmitry, can you please retry your reproduction with attached patch?

It must print nextMXact and tmpMXact. If my hypothesis is correct nextMXact will precede tmpMXact.

Best regards, Andrey Borodin.

Attachments:

v2-0001-Make-next-multixact-sleep-timed-with-debug-loggin.patchapplication/octet-stream; name=v2-0001-Make-next-multixact-sleep-timed-with-debug-loggin.patch; x-unix-mode=0644Download

From c7b9a8f3904840a111e57fb6066c7169df0f40b3 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Wed, 25 Jun 2025 14:34:06 +0500
Subject: [PATCH v2] Make next multixact sleep timed with debug logging

---
 src/backend/access/transam/multixact.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b7b47ef076a..b06324ea27a 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1480,8 +1480,11 @@ retry:
 			LWLockRelease(lock);
 			CHECK_FOR_INTERRUPTS();
 
-			ConditionVariableSleep(&MultiXactState->nextoff_cv,
-								   WAIT_EVENT_MULTIXACT_CREATION);
+			if (ConditionVariableTimedSleep(&MultiXactState->nextoff_cv, 1000,
+								   WAIT_EVENT_MULTIXACT_CREATION))
+			{
+				elog(WARNING, "Timed out: nextMXact %u tmpMXact %u", nextMXact, tmpMXact);
+			}
 			slept = true;
 			goto retry;
 		}
-- 
2.39.5 (Apple Git-154)

dsy.075@yandex.ru

7 months ago

In reply to: Andrey Borodin (#6)

Re: IPC/MultixactCreation on the Standby server

On 26.06.2025 19:24, Andrey Borodin wrote:

If my hypothesis is correct nextMXact will precede tmpMXact.

It seems that the hypothesis has not been confirmed.

Attempt #1
2025-06-26 23:47:24.821 MSK [220458] WARNING: Timed out: nextMXact
24138381 tmpMXact 24138379
2025-06-26 23:47:24.822 MSK [220540] WARNING: Timed out: nextMXact
24138382 tmpMXact 24138379
2025-06-26 23:47:24.823 MSK [220548] WARNING: Timed out: nextMXact
24138382 tmpMXact 24138379
...

pgbench (17.5)
progress: 2.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
progress: 3.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
progress: 4.0 s, 482.2 tps, lat 820.293 ms stddev 1370.729, 0 failed
progress: 5.0 s, 886.0 tps, lat 112.463 ms stddev 8.506, 0 failed
progress: 6.0 s, 348.9 tps, lat 111.324 ms stddev 5.871, 0 failed
WARNING: Timed out: nextMXact 24138381 tmpMXact 24138379
WARNING: Timed out: nextMXact 24138382 tmpMXact 24138379
WARNING: Timed out: nextMXact 24138382 tmpMXact 24138379
...
progress: 7.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
WARNING: Timed out: nextMXact 24138382 tmpMXact 24138379

Attempt #2
2025-06-27 09:18:01.312 MSK [236187] WARNING: Timed out: nextMXact
24497746 tmpMXact 24497744
2025-06-27 09:18:01.312 MSK [236225] WARNING: Timed out: nextMXact
24497746 tmpMXact 24497744
2025-06-27 09:18:01.312 MSK [236178] WARNING: Timed out: nextMXact
24497746 tmpMXact 24497744
...

pgbench (17.5)
progress: 1.0 s, 830.9 tps, lat 108.556 ms stddev 10.078, 0 failed
progress: 2.0 s, 839.0 tps, lat 118.358 ms stddev 19.708, 0 failed
progress: 3.0 s, 623.4 tps, lat 134.186 ms stddev 15.565, 0 failed
WARNING: Timed out: nextMXact 24497746 tmpMXact 24497744
WARNING: Timed out: nextMXact 24497746 tmpMXact 24497744
WARNING: Timed out: nextMXact 24497746 tmpMXact 24497744
WARNING: Timed out: nextMXact 24497746 tmpMXact 24497744
WARNING: Timed out: nextMXact 24497746 tmpMXact 24497744
WARNING: Timed out: nextMXact 24497747 tmpMXact 24497744
WARNING: Timed out: nextMXact 24497747 tmpMXact 24497744
...
progress: 4.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
WARNING: Timed out: nextMXact 24497746 tmpMXact 24497744

Best regards,
Dmitry.

x4mmm@yandex-team.ru

7 months ago

In reply to: Dmitry (#7)

2 attachment(s)

Re: IPC/MultixactCreation on the Standby server

On 27 Jun 2025, at 11:41, Dmitry <dsy.075@yandex.ru> wrote:

It seems that the hypothesis has not been confirmed.

Indeed.

For some reason your reproduction does not work for me.
I tried to create a test from your workload description. PFA patch with a very dirty prototype.

to run test you can run:

cd contrib/amcheck
PROVE_TESTS=t/006_MultiXact_standby.pl make check

To check that reproduction worked or not you can read tmp_check/log/006_MultiXact_standby_standby_1.log and see if there are messages "Timed out: nextMXact %u tmpMXact %u".

If you could codify our reproduction into this TAP test, we could make it portable. So I can debug the problem on my machine...

Either way we can proceed with remote debugging via mailing list :)

Thank you!

Best regards, Andrey Borodin.

Attachments:

v3-0001-Make-next-multixact-sleep-timed-with-debug-loggin.patchapplication/octet-stream; name=v3-0001-Make-next-multixact-sleep-timed-with-debug-loggin.patch; x-unix-mode=0644Download

From c7b9a8f3904840a111e57fb6066c7169df0f40b3 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Wed, 25 Jun 2025 14:34:06 +0500
Subject: [PATCH v3 1/2] Make next multixact sleep timed with debug logging

---
 src/backend/access/transam/multixact.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b7b47ef076a..b06324ea27a 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1480,8 +1480,11 @@ retry:
 			LWLockRelease(lock);
 			CHECK_FOR_INTERRUPTS();
 
-			ConditionVariableSleep(&MultiXactState->nextoff_cv,
-								   WAIT_EVENT_MULTIXACT_CREATION);
+			if (ConditionVariableTimedSleep(&MultiXactState->nextoff_cv, 1000,
+								   WAIT_EVENT_MULTIXACT_CREATION))
+			{
+				elog(WARNING, "Timed out: nextMXact %u tmpMXact %u", nextMXact, tmpMXact);
+			}
 			slept = true;
 			goto retry;
 		}
-- 
2.39.5 (Apple Git-154)

v3-0002-Test-concurrent-Multixact-reading-on-stadnby.patchapplication/octet-stream; name=v3-0002-Test-concurrent-Multixact-reading-on-stadnby.patch; x-unix-mode=0644Download

From b4bd33eb14c7220a23bfc4a497bcc3c7aa18d815 Mon Sep 17 00:00:00 2001
From: "Andrey M. Borodin" <x4mmm@flight.local>
Date: Sat, 12 Feb 2022 13:51:06 +0500
Subject: [PATCH v3 2/2] Test concurrent Multixact reading on stadnby

---
 contrib/amcheck/t/006_MultiXact_standby.pl | 116 +++++++++++++++++++++
 src/test/perl/PostgreSQL/Test/Cluster.pm   |  48 +++++++++
 2 files changed, 164 insertions(+)
 create mode 100644 contrib/amcheck/t/006_MultiXact_standby.pl

diff --git a/contrib/amcheck/t/006_MultiXact_standby.pl b/contrib/amcheck/t/006_MultiXact_standby.pl
new file mode 100644
index 00000000000..4df77a3ea0f
--- /dev/null
+++ b/contrib/amcheck/t/006_MultiXact_standby.pl
@@ -0,0 +1,116 @@
+
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+# Minimal test testing multixacts with streaming replication
+use strict;
+use warnings;
+use Config;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 4;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+# A specific role is created for replication purposes
+$node_primary->init(
+	allows_streaming => 1,
+	auth_extra       => [ '--create-role', 'repl_role' ]);
+$node_primary->append_conf('postgresql.conf', 'lock_timeout = 180000');
+$node_primary->append_conf('postgresql.conf', 'max_connections = 500');
+$node_primary->start;
+my $backup_name = 'my_backup';
+
+# Take backup
+$node_primary->backup($backup_name);
+
+# Create streaming standby linking to primary
+my $node_standby_1 = PostgreSQL::Test::Cluster->new('standby_1');
+$node_standby_1->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby_1->start;
+
+# Create some content on primary and check its presence in standby nodes
+$node_primary->safe_psql('postgres', q(create table tbl2 (
+    id int primary key,
+    val int
+);
+insert into tbl2 select i, 0 from generate_series(1,100000) i;
+));
+
+# Wait for standbys to catch up
+my $primary_lsn = $node_primary->lsn('write');
+$node_primary->wait_for_catchup($node_standby_1, 'replay', $primary_lsn);
+
+#
+# Stress CIC with pgbench
+#
+
+# Run background pgbench with bt_index_check on standby
+my $pgbench_out   = '';
+my $pgbench_timer = IPC::Run::timeout(180);
+my $pgbench_h     = $node_standby_1->background_pgbench(
+	'--no-vacuum --report-per-command -M prepared -c 100 -j 2 -T 300 -P 1',
+	{
+		'006_pgbench_standby_check_1' => q(
+			begin;
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			commit;
+		   )
+	},
+	\$pgbench_out,
+	$pgbench_timer);
+
+# Run pgbench with data data manipulations and REINDEX on primary.
+# pgbench might try to launch more than one instance of the RIC
+# transaction concurrently.  That would deadlock, so use an advisory
+# lock to ensure only one CIC runs at a time.
+$node_primary->pgbench(
+	'--no-vacuum --report-per-command -M prepared -c 100 -j 2 -T 300 -P 1',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent updates',
+	{
+		'004_pgbench_updates' => q(
+			\set id random(1, 10000)
+			begin;
+			select * from tbl2 where id = :id for no key update;
+			\sleep 10 ms
+			savepoint s1;
+			update tbl2 set val = val+1 where id = :id;
+			\sleep 10 ms
+			commit;
+		  )
+	});
+
+$pgbench_h->pump_nb;
+$pgbench_h->finish();
+my $result =
+    ($Config{osname} eq "MSWin32")
+  ? ($pgbench_h->full_results)[0]
+  : $pgbench_h->result(0);
+is($result, 0, "pgbench with bt_index_check() on standby works");
+
+# done
+$node_primary->stop;
+$node_standby_1->stop;
+done_testing();
\ No newline at end of file
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index f2d9afd398f..0cbf3b69628 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2387,6 +2387,54 @@ sub pgbench
 
 =pod
 
+=item $node->background_pgbench($opts, $files, \$stdout, $timer) => harness
+
+Invoke B<pgbench> and return an IPC::Run harness object.  The process's stdin
+is empty, and its stdout and stderr go to the $stdout scalar reference.  This
+allows the caller to act on other parts of the system while B<pgbench> is
+running.  Errors from B<pgbench> are the caller's problem.
+
+The specified timer object is attached to the harness, as well.  It's caller's
+responsibility to select the timeout length, and to restart the timer after
+each command if the timeout is per-command.
+
+Be sure to "finish" the harness when done with it.
+
+=over
+
+=item $opts
+
+Options as a string to be split on spaces.
+
+=item $files
+
+Reference to filename/contents dictionary.
+
+=back
+
+=cut
+
+sub background_pgbench
+{
+	my ($self, $opts, $files, $stdout, $timer) = @_;
+
+	my @cmd =
+	  ('pgbench', split(/\s+/, $opts), $self->_pgbench_make_files($files));
+
+	local %ENV = $self->_get_env();
+
+	my $stdin = "";
+	# IPC::Run would otherwise append to existing contents:
+	$$stdout = "" if ref($stdout);
+
+	my $harness = IPC::Run::start \@cmd, '<', \$stdin, '>', $stdout, '2>&1',
+	  $timer;
+
+	return $harness;
+}
+
+=pod
+
 =item $node->connect_ok($connstr, $test_name, %params)
 
 Attempt a connection with a custom connection string.  This is expected
-- 
2.39.5 (Apple Git-154)

x4mmm@yandex-team.ru

7 months ago

In reply to: Andrey Borodin (#8)

1 attachment(s)

Re: IPC/MultixactCreation on the Standby server

On 28 Jun 2025, at 00:37, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Indeed.

After some experiments I could get unstable repro on my machine.
I've added some logging and that's what I've found:

2025-06-28 23:03:40.598 +05 [40887] 006_MultiXact_standby.pl WARNING: Timed out: nextMXact 415832 tmpMXact 415827 pageno 203 prev_pageno 203 entryno 83 offptr[1] 831655 offptr[0] 0 offptr[-1] 831651

We are reading 415827-1 Multi, while 415827 is not filled yet. But we are holding a buffer that prevents next Multi to be filled in.
This seems like a recovery conflict.
I'm somewhat surprized with 415827+1 is already filled in...

Can you please try your reproduction with applied patch? This seems to be fixing issue for me.

Best regards, Andrey Borodin.

Attachments:

v4-0001-Make-next-multixact-sleep-timed-with-a-recovery-c.patchapplication/octet-stream; name=v4-0001-Make-next-multixact-sleep-timed-with-a-recovery-c.patch; x-unix-mode=0644Download

From b4886faded7cc6b6629f2810cb0a8e5156f31635 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Wed, 25 Jun 2025 14:34:06 +0500
Subject: [PATCH v4] Make next multixact sleep timed with a recovery conflict
 check

---
 contrib/amcheck/t/006_MultiXact_standby.pl | 116 +++++++++++++++++++++
 src/backend/access/transam/multixact.c     |  14 ++-
 src/test/perl/PostgreSQL/Test/Cluster.pm   |  48 +++++++++
 3 files changed, 176 insertions(+), 2 deletions(-)
 create mode 100644 contrib/amcheck/t/006_MultiXact_standby.pl

diff --git a/contrib/amcheck/t/006_MultiXact_standby.pl b/contrib/amcheck/t/006_MultiXact_standby.pl
new file mode 100644
index 00000000000..4df77a3ea0f
--- /dev/null
+++ b/contrib/amcheck/t/006_MultiXact_standby.pl
@@ -0,0 +1,116 @@
+
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+# Minimal test testing multixacts with streaming replication
+use strict;
+use warnings;
+use Config;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 4;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+# A specific role is created for replication purposes
+$node_primary->init(
+	allows_streaming => 1,
+	auth_extra       => [ '--create-role', 'repl_role' ]);
+$node_primary->append_conf('postgresql.conf', 'lock_timeout = 180000');
+$node_primary->append_conf('postgresql.conf', 'max_connections = 500');
+$node_primary->start;
+my $backup_name = 'my_backup';
+
+# Take backup
+$node_primary->backup($backup_name);
+
+# Create streaming standby linking to primary
+my $node_standby_1 = PostgreSQL::Test::Cluster->new('standby_1');
+$node_standby_1->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby_1->start;
+
+# Create some content on primary and check its presence in standby nodes
+$node_primary->safe_psql('postgres', q(create table tbl2 (
+    id int primary key,
+    val int
+);
+insert into tbl2 select i, 0 from generate_series(1,100000) i;
+));
+
+# Wait for standbys to catch up
+my $primary_lsn = $node_primary->lsn('write');
+$node_primary->wait_for_catchup($node_standby_1, 'replay', $primary_lsn);
+
+#
+# Stress CIC with pgbench
+#
+
+# Run background pgbench with bt_index_check on standby
+my $pgbench_out   = '';
+my $pgbench_timer = IPC::Run::timeout(180);
+my $pgbench_h     = $node_standby_1->background_pgbench(
+	'--no-vacuum --report-per-command -M prepared -c 100 -j 2 -T 300 -P 1',
+	{
+		'006_pgbench_standby_check_1' => q(
+			begin;
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			commit;
+		   )
+	},
+	\$pgbench_out,
+	$pgbench_timer);
+
+# Run pgbench with data data manipulations and REINDEX on primary.
+# pgbench might try to launch more than one instance of the RIC
+# transaction concurrently.  That would deadlock, so use an advisory
+# lock to ensure only one CIC runs at a time.
+$node_primary->pgbench(
+	'--no-vacuum --report-per-command -M prepared -c 100 -j 2 -T 300 -P 1',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent updates',
+	{
+		'004_pgbench_updates' => q(
+			\set id random(1, 10000)
+			begin;
+			select * from tbl2 where id = :id for no key update;
+			\sleep 10 ms
+			savepoint s1;
+			update tbl2 set val = val+1 where id = :id;
+			\sleep 10 ms
+			commit;
+		  )
+	});
+
+$pgbench_h->pump_nb;
+$pgbench_h->finish();
+my $result =
+    ($Config{osname} eq "MSWin32")
+  ? ($pgbench_h->full_results)[0]
+  : $pgbench_h->result(0);
+is($result, 0, "pgbench with bt_index_check() on standby works");
+
+# done
+$node_primary->stop;
+$node_standby_1->stop;
+done_testing();
\ No newline at end of file
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b7b47ef076a..e52605e61bb 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1480,8 +1480,18 @@ retry:
 			LWLockRelease(lock);
 			CHECK_FOR_INTERRUPTS();
 
-			ConditionVariableSleep(&MultiXactState->nextoff_cv,
-								   WAIT_EVENT_MULTIXACT_CREATION);
+			if (ConditionVariableTimedSleep(&MultiXactState->nextoff_cv, 1,
+								   WAIT_EVENT_MULTIXACT_CREATION))
+			{
+				if (RecoveryInProgress() && !InRecovery)
+				{
+					CheckRecoveryConflictDeadlock();
+				}
+				else
+				{
+					Assert(false);
+				}
+			}
 			slept = true;
 			goto retry;
 		}
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index f2d9afd398f..0cbf3b69628 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2387,6 +2387,54 @@ sub pgbench
 
 =pod
 
+=item $node->background_pgbench($opts, $files, \$stdout, $timer) => harness
+
+Invoke B<pgbench> and return an IPC::Run harness object.  The process's stdin
+is empty, and its stdout and stderr go to the $stdout scalar reference.  This
+allows the caller to act on other parts of the system while B<pgbench> is
+running.  Errors from B<pgbench> are the caller's problem.
+
+The specified timer object is attached to the harness, as well.  It's caller's
+responsibility to select the timeout length, and to restart the timer after
+each command if the timeout is per-command.
+
+Be sure to "finish" the harness when done with it.
+
+=over
+
+=item $opts
+
+Options as a string to be split on spaces.
+
+=item $files
+
+Reference to filename/contents dictionary.
+
+=back
+
+=cut
+
+sub background_pgbench
+{
+	my ($self, $opts, $files, $stdout, $timer) = @_;
+
+	my @cmd =
+	  ('pgbench', split(/\s+/, $opts), $self->_pgbench_make_files($files));
+
+	local %ENV = $self->_get_env();
+
+	my $stdin = "";
+	# IPC::Run would otherwise append to existing contents:
+	$$stdout = "" if ref($stdout);
+
+	my $harness = IPC::Run::start \@cmd, '<', \$stdin, '>', $stdout, '2>&1',
+	  $timer;
+
+	return $harness;
+}
+
+=pod
+
 =item $node->connect_ok($connstr, $test_name, %params)
 
 Attempt a connection with a custom connection string.  This is expected
-- 
2.39.5 (Apple Git-154)

#10

x4mmm@yandex-team.ru

7 months ago

In reply to: Andrey Borodin (#9)

1 attachment(s)

Re: IPC/MultixactCreation on the Standby server

On 28 Jun 2025, at 21:24, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

This seems to be fixing issue for me.

ISTM I was wrong: there is a possible recovery conflict with snapshot.

REDO:
frame #2: 0x000000010179a0c8 postgres`pg_usleep(microsec=1000000) at pgsleep.c:50:10
frame #3: 0x000000010144c108 postgres`WaitExceedsMaxStandbyDelay(wait_event_info=134217772) at standby.c:248:2
frame #4: 0x000000010144a63c postgres`ResolveRecoveryConflictWithVirtualXIDs(waitlist=0x0000000126008200, reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, wait_event_info=134217772, report_waiting=true) at standby.c:384:8
frame #5: 0x000000010144a4f4 postgres`ResolveRecoveryConflictWithSnapshot(snapshotConflictHorizon=1214, isCatalogRel=false, locator=(spcOid = 1663, dbOid = 5, relNumber = 16384)) at standby.c:490:2
frame #6: 0x0000000100e4d3f8 postgres`heap_xlog_prune_freeze(record=0x0000000135808e60) at heapam.c:9208:4
frame #7: 0x0000000100e4d204 postgres`heap2_redo(record=0x0000000135808e60) at heapam.c:10353:4
frame #8: 0x0000000100f1548c postgres`ApplyWalRecord(xlogreader=0x0000000135808e60, record=0x0000000138058060, replayTLI=0x000000016f0425b0) at xlogrecovery.c:1991:2
frame #9: 0x0000000100f13ff0 postgres`PerformWalRecovery at xlogrecovery.c:1822:4
frame #10: 0x0000000100ef7940 postgres`StartupXLOG at xlog.c:5821:3
frame #11: 0x0000000101364334 postgres`StartupProcessMain(startup_data=0x0000000000000000, startup_data_len=0) at startup.c:258:2

SELECT:
frame #10: 0x0000000102a14684 postgres`GetMultiXactIdMembers(multi=278, members=0x000000016d4f9498, from_pgupgrade=false, isLockOnly=false) at multixact.c:1493:6
frame #11: 0x0000000102991814 postgres`MultiXactIdGetUpdateXid(xmax=278, t_infomask=4416) at heapam.c:7478:13
frame #12: 0x0000000102985450 postgres`HeapTupleGetUpdateXid(tuple=0x00000001043e5c60) at heapam.c:7519:9
frame #13: 0x00000001029a0360 postgres`HeapTupleSatisfiesMVCC(htup=0x000000016d4f9590, snapshot=0x000000015b07b930, buffer=69) at heapam_visibility.c:1090:10
frame #14: 0x000000010299fbc8 postgres`HeapTupleSatisfiesVisibility(htup=0x000000016d4f9590, snapshot=0x000000015b07b930, buffer=69) at heapam_visibility.c:1772:11
frame #15: 0x0000000102982954 postgres`page_collect_tuples(scan=0x000000014b009648, snapshot=0x000000015b07b930, page="", buffer=69, block=6, lines=228, all_visible=false, check_serializable=false) at heapam.c:480:12

page_collect_tuples() holds a lock on the buffer while examining tuples visibility, having InterruptHoldoffCount > 0. Tuple visibility check might need WAL to go on, we have to wait until some next MX be filled in.
Which might need a buffer lock or have a snapshot conflict with caller of page_collect_tuples().

Please find attached a dirty test, it reproduces problem my machine (startup deadlock, so when reproduced it takes 180s, normally passing in 10s).
Also, there is a fix: checking for recovery conflicts when falling back to case 2 MX read.

I do not feel comfortable with using interrupts while InterruptHoldoffCount > 0, so I need help from someone more knowledgeable about our interrupts machinery to tell if what I'm proposing is OK. (Álvaro?)

Also, I've modified the code to make race condition more reproducible.

multi = GetNewMultiXactId(nmembers, &offset);
// random sleep to make WAL order different order of usage on pages
if (rand()%2 == 0)
pg_usleep(1000);
(void) XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_CREATE_ID);

Perhaps, I can build a fast injection points test if we want it.

Best regards, Andrey Borodin.

Attachments:

v5-0001-Make-next-multixact-sleep-timed-with-a-recovery-c.patchapplication/octet-stream; name=v5-0001-Make-next-multixact-sleep-timed-with-a-recovery-c.patch; x-unix-mode=0644Download

From 7fc3097442f7bfcb3c4baa890b02bed433f7320d Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Wed, 25 Jun 2025 14:34:06 +0500
Subject: [PATCH v5] Make next multixact sleep timed with a recovery conflict
 check

---
 contrib/amcheck/t/006_MultiXact_standby.pl | 121 +++++++++++++++++++++
 src/backend/access/transam/multixact.c     |  17 ++-
 src/backend/storage/ipc/standby.c          |   1 +
 src/backend/tcop/postgres.c                |   8 +-
 src/test/perl/PostgreSQL/Test/Cluster.pm   |  48 ++++++++
 5 files changed, 191 insertions(+), 4 deletions(-)
 create mode 100644 contrib/amcheck/t/006_MultiXact_standby.pl

diff --git a/contrib/amcheck/t/006_MultiXact_standby.pl b/contrib/amcheck/t/006_MultiXact_standby.pl
new file mode 100644
index 00000000000..2cdd0d93be3
--- /dev/null
+++ b/contrib/amcheck/t/006_MultiXact_standby.pl
@@ -0,0 +1,121 @@
+
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+# Minimal test testing multixacts with streaming replication
+use strict;
+use warnings;
+use Config;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 4;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+# A specific role is created for replication purposes
+$node_primary->init(
+	allows_streaming => 1,
+	auth_extra       => [ '--create-role', 'repl_role' ]);
+$node_primary->append_conf('postgresql.conf', 'lock_timeout = 180000');
+$node_primary->append_conf('postgresql.conf', 'max_connections = 500');
+$node_primary->start;
+my $backup_name = 'my_backup';
+
+# Take backup
+$node_primary->backup($backup_name);
+
+# Create streaming standby linking to primary
+my $node_standby_1 = PostgreSQL::Test::Cluster->new('standby_1');
+$node_standby_1->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby_1->start;
+
+# Create some content on primary and check its presence in standby nodes
+$node_primary->safe_psql('postgres', q(create table tbl2 (
+    id int primary key,
+    val int
+);
+insert into tbl2 select i, 0 from generate_series(1,100000) i;
+));
+
+# Wait for standbys to catch up
+my $primary_lsn = $node_primary->lsn('write');
+$node_primary->wait_for_catchup($node_standby_1, 'replay', $primary_lsn);
+
+#
+# Stress CIC with pgbench
+#
+
+# Run background pgbench with bt_index_check on standby
+my $pgbench_out   = '';
+my $pgbench_timer = IPC::Run::timeout(180);
+my $pgbench_h     = $node_standby_1->background_pgbench(
+	'--no-vacuum --report-per-command -M prepared -c 10 -j 2 -T 10',
+	{
+		'006_pgbench_standby_check_1' => q(
+			begin;
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			commit;
+		   )
+	},
+	\$pgbench_out,
+	$pgbench_timer);
+
+# Run pgbench with data data manipulations and REINDEX on primary.
+# pgbench might try to launch more than one instance of the RIC
+# transaction concurrently.  That would deadlock, so use an advisory
+# lock to ensure only one CIC runs at a time.
+$node_primary->pgbench(
+	'--no-vacuum --report-per-command -M prepared -c 10 -j 2 -T 10',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent updates',
+	{
+		'004_pgbench_updates' => q(
+			\set id random(1, 10000)
+			begin;
+			select * from tbl2 where id = :id for no key update;
+			\sleep 10 ms
+			savepoint s1;
+			update tbl2 set val = val+1 where id = :id;
+			\sleep 10 ms
+			commit;
+		  )
+	});
+
+$pgbench_h->pump_nb;
+$pgbench_h->finish();
+my $result =
+    ($Config{osname} eq "MSWin32")
+  ? ($pgbench_h->full_results)[0]
+  : $pgbench_h->result(0);
+is($result, 0, "pgbench with bt_index_check() on standby works");
+
+
+# Check that no deadlock occured
+$primary_lsn = $node_primary->lsn('write');
+$node_primary->wait_for_catchup($node_standby_1, 'replay', $primary_lsn);
+
+# done
+$node_primary->stop;
+$node_standby_1->stop;
+done_testing();
\ No newline at end of file
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b7b47ef076a..8b4707a60bf 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -883,6 +883,9 @@ MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members)
 	XLogRegisterData((char *) (&xlrec), SizeOfMultiXactCreate);
 	XLogRegisterData((char *) members, nmembers * sizeof(MultiXactMember));
 
+	if (rand()%2 == 0)
+		pg_usleep(1000);
+
 	(void) XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_CREATE_ID);
 
 	/* Now enter the information into the OFFSETs and MEMBERs logs */
@@ -1262,6 +1265,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	return result;
 }
 
+void CheckReoveryInterrupts(void);
+
 /*
  * GetMultiXactIdMembers
  *		Return the set of MultiXactMembers that make up a MultiXactId
@@ -1478,10 +1483,16 @@ retry:
 		{
 			/* Corner case 2: next multixact is still being filled in */
 			LWLockRelease(lock);
-			CHECK_FOR_INTERRUPTS();
 
-			ConditionVariableSleep(&MultiXactState->nextoff_cv,
-								   WAIT_EVENT_MULTIXACT_CREATION);
+			if (ConditionVariableTimedSleep(&MultiXactState->nextoff_cv, 1,
+								   WAIT_EVENT_MULTIXACT_CREATION))
+			{
+				if (RecoveryInProgress() && !InRecovery)
+				{
+					CheckReoveryInterrupts();
+					CheckRecoveryConflictDeadlock();
+				}
+			}
 			slept = true;
 			goto retry;
 		}
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 872679ca447..8da38429996 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -390,6 +390,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 				 */
 				Assert(VirtualTransactionIdIsValid(*waitlist));
 				pid = CancelVirtualTransaction(*waitlist, reason);
+				elog(WARNING, "Cancelling pid %d", pid);
 
 				/*
 				 * Wait a little bit for it to die so that we avoid flooding
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 9cd1d0abe35..cccf4fec033 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3237,7 +3237,7 @@ ProcessRecoveryConflictInterrupts(void)
 	 * us.
 	 */
 	Assert(!proc_exit_inprogress);
-	Assert(InterruptHoldoffCount == 0);
+	//Assert(InterruptHoldoffCount == 0);
 	Assert(RecoveryConflictPending);
 
 	RecoveryConflictPending = false;
@@ -3254,6 +3254,12 @@ ProcessRecoveryConflictInterrupts(void)
 	}
 }
 
+void CheckReoveryInterrupts(void)
+{
+	if (RecoveryConflictPending)
+		ProcessRecoveryConflictInterrupts();
+}
+
 /*
  * ProcessInterrupts: out-of-line portion of CHECK_FOR_INTERRUPTS() macro
  *
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index f2d9afd398f..0cbf3b69628 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2387,6 +2387,54 @@ sub pgbench
 
 =pod
 
+=item $node->background_pgbench($opts, $files, \$stdout, $timer) => harness
+
+Invoke B<pgbench> and return an IPC::Run harness object.  The process's stdin
+is empty, and its stdout and stderr go to the $stdout scalar reference.  This
+allows the caller to act on other parts of the system while B<pgbench> is
+running.  Errors from B<pgbench> are the caller's problem.
+
+The specified timer object is attached to the harness, as well.  It's caller's
+responsibility to select the timeout length, and to restart the timer after
+each command if the timeout is per-command.
+
+Be sure to "finish" the harness when done with it.
+
+=over
+
+=item $opts
+
+Options as a string to be split on spaces.
+
+=item $files
+
+Reference to filename/contents dictionary.
+
+=back
+
+=cut
+
+sub background_pgbench
+{
+	my ($self, $opts, $files, $stdout, $timer) = @_;
+
+	my @cmd =
+	  ('pgbench', split(/\s+/, $opts), $self->_pgbench_make_files($files));
+
+	local %ENV = $self->_get_env();
+
+	my $stdin = "";
+	# IPC::Run would otherwise append to existing contents:
+	$$stdout = "" if ref($stdout);
+
+	my $harness = IPC::Run::start \@cmd, '<', \$stdin, '>', $stdout, '2>&1',
+	  $timer;
+
+	return $harness;
+}
+
+=pod
+
 =item $node->connect_ok($connstr, $test_name, %params)
 
 Attempt a connection with a custom connection string.  This is expected
-- 
2.39.5 (Apple Git-154)

#11

x4mmm@yandex-team.ru

6 months ago

In reply to: Andrey Borodin (#10)

Re: IPC/MultixactCreation on the Standby server

On 30 Jun 2025, at 15:58, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

page_collect_tuples() holds a lock on the buffer while examining tuples visibility, having InterruptHoldoffCount > 0. Tuple visibility check might need WAL to go on, we have to wait until some next MX be filled in.
Which might need a buffer lock or have a snapshot conflict with caller of page_collect_tuples().

Thinking more about the problem I see 3 ways to deal with this deadlock:
1. We check for recovery conflict even in presence of InterruptHoldoffCount. That's what patch v4 does.
2. Teach page_collect_tuples() to do HeapTupleSatisfiesVisibility() without holding buffer lock.
3. Why do we even HOLD_INTERRUPTS() when aquire shared lock??

Personally, I see point 2 as very invasive in a code that I'm not too familiar with. Option 1 is clumsy. But option 3 is a giant system-wide change.
Yet, I see 3 as a correct solution. Can't we just abstain from HOLD_INTERRUPTS() if taken LWLock is not exclusive?

Best regards, Andrey Borodin.

#12

alvherre@kurilemu.de

6 months ago

In reply to: Andrey Borodin (#11)

Re: IPC/MultixactCreation on the Standby server

On 2025-Jul-17, Andrey Borodin wrote:

Thinking more about the problem I see 3 ways to deal with this deadlock:
1. We check for recovery conflict even in presence of
InterruptHoldoffCount. That's what patch v4 does.
2. Teach page_collect_tuples() to do HeapTupleSatisfiesVisibility()
without holding buffer lock.
3. Why do we even HOLD_INTERRUPTS() when aquire shared lock??

Hmm, as you say, doing (3) is a very invasive system-wide change, but
can we do it more localized? I mean, what if we do RESUME_INTERRUPTS()
just before going to sleep on the CV, and restore with HOLD_INTERRUPTS()
once the sleep is done? That would only affect this one place rather
than the whole system, and should also (AFAICS) solve the issue.

Yet, I see 3 as a correct solution. Can't we just abstain from
HOLD_INTERRUPTS() if taken LWLock is not exclusive?

Hmm, the code in LWLockAcquire says

/*
* Lock out cancel/die interrupts until we exit the code section protected
* by the LWLock. This ensures that interrupts will not interfere with
* manipulations of data structures in shared memory.
*/
HOLD_INTERRUPTS();

which means if we want to change this, we would have to inspect every
single use of LWLocks in shared mode in order to be certain that such a
change isn't problematic. This is a discussion I'm not prepared for.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Si quieres ser creativo, aprende el arte de perder el tiempo"

#13

alvherre@kurilemu.de

6 months ago

In reply to: Álvaro Herrera (#12)

Re: IPC/MultixactCreation on the Standby server

Hello,

Andrey and I discussed this on IM, and after some back and forth, he
came up with a brilliant idea: modify the WAL record for multixact
creation, so that the offset of the next multixact is transmitted and
can be replayed. (We know it when we create each multixact, because the
number of members is known). So the replica can store the offset of the
next multixact right away, even though it doesn't know the members for
that multixact. On replay of the next multixact we can cross-check that
the offset matches what we had written previously. This allows reading
the first multixact, without having to wait for the replay of creation
of the second multixact.

One concern is: if we write the offset for the second mxact, but haven't
written its members, what happens if another process looks up the
members for that multixact? We'll have to make it wait (retry) somehow.
Given what was described upthread, it's possible for the multixact
beyond that one to be written already, so we won't have the zero offset
that would make us wait.

Anyway, he's going to try and implement this.

Andrey, please let me know if I misunderstood the idea.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/

#14

x4mmm@yandex-team.ru

6 months ago

In reply to: Álvaro Herrera (#13)

2 attachment(s)

Re: IPC/MultixactCreation on the Standby server

On 18 Jul 2025, at 16:53, Álvaro Herrera <alvherre@kurilemu.de> wrote:

Hello,

Andrey and I discussed this on IM, and after some back and forth, he
came up with a brilliant idea: modify the WAL record for multixact
creation, so that the offset of the next multixact is transmitted and
can be replayed. (We know it when we create each multixact, because the
number of members is known). So the replica can store the offset of the
next multixact right away, even though it doesn't know the members for
that multixact. On replay of the next multixact we can cross-check that
the offset matches what we had written previously. This allows reading
the first multixact, without having to wait for the replay of creation
of the second multixact.

One concern is: if we write the offset for the second mxact, but haven't
written its members, what happens if another process looks up the
members for that multixact? We'll have to make it wait (retry) somehow.
Given what was described upthread, it's possible for the multixact
beyond that one to be written already, so we won't have the zero offset
that would make us wait.

We redo Multixact creation always before it is visible anywhere on heap.
The problem was that to read Multi we might need another Multi offset, and that multi did not happen to be WAL-logged yet.
However, I think we do not need to read multi before it is redone.

Anyway, he's going to try and implement this.

Andrey, please let me know if I misunderstood the idea.

Please find attached dirty test and a sketch of the fix. It is done against PG 16, I wanted to ensure that problem is reproducible before 17.

Best regards, Andrey Borodin.

Attachments:

v6-0001-Test-that-reproduces-multixat-deadlock-with-recov.patchapplication/octet-stream; name=v6-0001-Test-that-reproduces-multixat-deadlock-with-recov.patch; x-unix-mode=0644Download

From 92f21be3b4917e9dcd646d524bd70ebb074684bd Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Fri, 18 Jul 2025 14:58:05 +0500
Subject: [PATCH v6 1/2] Test that reproduces multixat deadlock with recovery

---
 contrib/amcheck/t/006_MultiXact_standby.pl | 121 +++++++++++++++++++++
 src/backend/access/transam/multixact.c     |   3 +
 src/test/perl/PostgreSQL/Test/Cluster.pm   |  48 ++++++++
 3 files changed, 172 insertions(+)
 create mode 100644 contrib/amcheck/t/006_MultiXact_standby.pl

diff --git a/contrib/amcheck/t/006_MultiXact_standby.pl b/contrib/amcheck/t/006_MultiXact_standby.pl
new file mode 100644
index 00000000000..2cdd0d93be3
--- /dev/null
+++ b/contrib/amcheck/t/006_MultiXact_standby.pl
@@ -0,0 +1,121 @@
+
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+# Minimal test testing multixacts with streaming replication
+use strict;
+use warnings;
+use Config;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 4;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+# A specific role is created for replication purposes
+$node_primary->init(
+	allows_streaming => 1,
+	auth_extra       => [ '--create-role', 'repl_role' ]);
+$node_primary->append_conf('postgresql.conf', 'lock_timeout = 180000');
+$node_primary->append_conf('postgresql.conf', 'max_connections = 500');
+$node_primary->start;
+my $backup_name = 'my_backup';
+
+# Take backup
+$node_primary->backup($backup_name);
+
+# Create streaming standby linking to primary
+my $node_standby_1 = PostgreSQL::Test::Cluster->new('standby_1');
+$node_standby_1->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby_1->start;
+
+# Create some content on primary and check its presence in standby nodes
+$node_primary->safe_psql('postgres', q(create table tbl2 (
+    id int primary key,
+    val int
+);
+insert into tbl2 select i, 0 from generate_series(1,100000) i;
+));
+
+# Wait for standbys to catch up
+my $primary_lsn = $node_primary->lsn('write');
+$node_primary->wait_for_catchup($node_standby_1, 'replay', $primary_lsn);
+
+#
+# Stress CIC with pgbench
+#
+
+# Run background pgbench with bt_index_check on standby
+my $pgbench_out   = '';
+my $pgbench_timer = IPC::Run::timeout(180);
+my $pgbench_h     = $node_standby_1->background_pgbench(
+	'--no-vacuum --report-per-command -M prepared -c 10 -j 2 -T 10',
+	{
+		'006_pgbench_standby_check_1' => q(
+			begin;
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			commit;
+		   )
+	},
+	\$pgbench_out,
+	$pgbench_timer);
+
+# Run pgbench with data data manipulations and REINDEX on primary.
+# pgbench might try to launch more than one instance of the RIC
+# transaction concurrently.  That would deadlock, so use an advisory
+# lock to ensure only one CIC runs at a time.
+$node_primary->pgbench(
+	'--no-vacuum --report-per-command -M prepared -c 10 -j 2 -T 10',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent updates',
+	{
+		'004_pgbench_updates' => q(
+			\set id random(1, 10000)
+			begin;
+			select * from tbl2 where id = :id for no key update;
+			\sleep 10 ms
+			savepoint s1;
+			update tbl2 set val = val+1 where id = :id;
+			\sleep 10 ms
+			commit;
+		  )
+	});
+
+$pgbench_h->pump_nb;
+$pgbench_h->finish();
+my $result =
+    ($Config{osname} eq "MSWin32")
+  ? ($pgbench_h->full_results)[0]
+  : $pgbench_h->result(0);
+is($result, 0, "pgbench with bt_index_check() on standby works");
+
+
+# Check that no deadlock occured
+$primary_lsn = $node_primary->lsn('write');
+$node_primary->wait_for_catchup($node_standby_1, 'replay', $primary_lsn);
+
+# done
+$node_primary->stop;
+$node_standby_1->stop;
+done_testing();
\ No newline at end of file
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 3a2d7055c42..6375d0b4762 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -837,6 +837,9 @@ MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members)
 	XLogRegisterData((char *) (&xlrec), SizeOfMultiXactCreate);
 	XLogRegisterData((char *) members, nmembers * sizeof(MultiXactMember));
 
+	if (rand()%2 == 0)
+		pg_usleep(1000);
+
 	(void) XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_CREATE_ID);
 
 	/* Now enter the information into the OFFSETs and MEMBERs logs */
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index b908f17adf6..aa1ca676672 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2210,6 +2210,54 @@ sub pgbench
 
 =pod
 
+=item $node->background_pgbench($opts, $files, \$stdout, $timer) => harness
+
+Invoke B<pgbench> and return an IPC::Run harness object.  The process's stdin
+is empty, and its stdout and stderr go to the $stdout scalar reference.  This
+allows the caller to act on other parts of the system while B<pgbench> is
+running.  Errors from B<pgbench> are the caller's problem.
+
+The specified timer object is attached to the harness, as well.  It's caller's
+responsibility to select the timeout length, and to restart the timer after
+each command if the timeout is per-command.
+
+Be sure to "finish" the harness when done with it.
+
+=over
+
+=item $opts
+
+Options as a string to be split on spaces.
+
+=item $files
+
+Reference to filename/contents dictionary.
+
+=back
+
+=cut
+
+sub background_pgbench
+{
+	my ($self, $opts, $files, $stdout, $timer) = @_;
+
+	my @cmd =
+	  ('pgbench', split(/\s+/, $opts), $self->_pgbench_make_files($files));
+
+	local %ENV = $self->_get_env();
+
+	my $stdin = "";
+	# IPC::Run would otherwise append to existing contents:
+	$$stdout = "" if ref($stdout);
+
+	my $harness = IPC::Run::start \@cmd, '<', \$stdin, '>', $stdout, '2>&1',
+	  $timer;
+
+	return $harness;
+}
+
+=pod
+
 =item $node->connect_ok($connstr, $test_name, %params)
 
 Attempt a connection with a custom connection string.  This is expected
-- 
2.39.5 (Apple Git-154)

v6-0002-Fill-next-multitransaction-in-REDO-to-avoid-corne.patchapplication/octet-stream; name=v6-0002-Fill-next-multitransaction-in-REDO-to-avoid-corne.patch; x-unix-mode=0644Download

From d5057fc9b6a8c04204bf6ba468cacbafdea5ff9f Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Fri, 18 Jul 2025 18:48:54 +0500
Subject: [PATCH v6 2/2] Fill next multitransaction in REDO to avoid corner
 case 2

---
 src/backend/access/transam/multixact.c | 47 +++++++++++++++++++++++---
 1 file changed, 43 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 6375d0b4762..6ea0b90c6fd 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -341,7 +341,7 @@ static MemoryContext MXactContext = NULL;
 /* internal MultiXactId management */
 static void MultiXactIdSetOldestVisible(void);
 static void RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
-							   int nmembers, MultiXactMember *members);
+							   int nmembers, MultiXactMember *members, bool redo);
 static MultiXactId GetNewMultiXactId(int nmembers, MultiXactOffset *offset);
 
 /* MultiXact cache management */
@@ -843,7 +843,7 @@ MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members)
 	(void) XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_CREATE_ID);
 
 	/* Now enter the information into the OFFSETs and MEMBERs logs */
-	RecordNewMultiXact(multi, offset, nmembers, members);
+	RecordNewMultiXact(multi, offset, nmembers, members, false);
 
 	/* Done with critical section */
 	END_CRIT_SECTION();
@@ -865,7 +865,7 @@ MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members)
  */
 static void
 RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
-				   int nmembers, MultiXactMember *members)
+				   int nmembers, MultiXactMember *members, bool redo)
 {
 	int			pageno;
 	int			prev_pageno;
@@ -892,6 +892,43 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 
 	*offptr = offset;
 
+	if (redo) // TODO: AB: I do not want to call RecoveryInProgress() here
+	{
+		MultiXactId next = multi + 1; /* we do not care about wrapraound here */
+		int			next_pageno = MultiXactIdToOffsetPage(next);
+		if (next_pageno == pageno)
+		{
+			offptr[1] = offset + nmembers;
+		}
+		else
+		{
+			int	next_slotno;
+			MultiXactOffset *next_offptr;
+			int	next_entryno = MultiXactIdToOffsetEntry(next);
+
+			if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, next_pageno))
+			{
+				int			slotno;
+
+				/* Copypasted comment from MaybeExtendOffsetSlru */
+				/*
+				* Fortunately for us, SimpleLruWritePage is already prepared to deal
+				* with creating a new segment file even if the page we're writing is
+				* not the first in it, so this is enough.
+				*/
+				next_slotno = ZeroMultiXactOffsetPage(next_pageno, false);
+				SimpleLruWritePage(MultiXactOffsetCtl, next_slotno);
+			}
+			else
+			{
+				next_slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next);
+			}
+			next_offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[next_slotno];
+			next_offptr[next_slotno] = offset + nmembers;
+			MultiXactMemberCtl->shared->page_dirty[next_slotno] = true;
+		}
+	}
+
 	MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 
 	/* Exchange our lock */
@@ -1393,6 +1430,8 @@ retry:
 			/* Corner case 2: next multixact is still being filled in */
 			LWLockRelease(MultiXactOffsetSLRULock);
 			CHECK_FOR_INTERRUPTS();
+			/* CHECK_FOR_INTERRUPTS above would be critical for avoiding conflicts with recovery, yet caller might hold LWLock rendegin CHECK_FOR_INTERRUPTS disfunctional */
+			Assert(!RecoveryInProgress());
 			pg_usleep(1000L);
 			goto retry;
 		}
@@ -3286,7 +3325,7 @@ multixact_redo(XLogReaderState *record)
 
 		/* Store the data back into the SLRU files */
 		RecordNewMultiXact(xlrec->mid, xlrec->moff, xlrec->nmembers,
-						   xlrec->members);
+						   xlrec->members, true);
 
 		/* Make sure nextMXact/nextOffset are beyond what this record has */
 		MultiXactAdvanceNextMXact(xlrec->mid + 1,
-- 
2.39.5 (Apple Git-154)

#15

x4mmm@yandex-team.ru

6 months ago

In reply to: Andrey Borodin (#14)

2 attachment(s)

Re: IPC/MultixactCreation on the Standby server

On 18 Jul 2025, at 18:53, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Please find attached dirty test and a sketch of the fix. It is done against PG 16, I wanted to ensure that problem is reproducible before 17.

Here'v v7 with improved comments and cross-check for correctness.
Also, MultiXact wraparound is handled.
I'm planning to prepare tests and fixes for all supported branches, if there's no objections to this approach.

Best regards, Andrey Borodin.

Attachments:

v7-0001-Test-that-reproduces-multixat-deadlock-with-recov.patchapplication/octet-stream; name=v7-0001-Test-that-reproduces-multixat-deadlock-with-recov.patch; x-unix-mode=0644Download

From 92f21be3b4917e9dcd646d524bd70ebb074684bd Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Fri, 18 Jul 2025 14:58:05 +0500
Subject: [PATCH v7 1/2] Test that reproduces multixat deadlock with recovery

---
 contrib/amcheck/t/006_MultiXact_standby.pl | 121 +++++++++++++++++++++
 src/backend/access/transam/multixact.c     |   3 +
 src/test/perl/PostgreSQL/Test/Cluster.pm   |  48 ++++++++
 3 files changed, 172 insertions(+)
 create mode 100644 contrib/amcheck/t/006_MultiXact_standby.pl

diff --git a/contrib/amcheck/t/006_MultiXact_standby.pl b/contrib/amcheck/t/006_MultiXact_standby.pl
new file mode 100644
index 00000000000..2cdd0d93be3
--- /dev/null
+++ b/contrib/amcheck/t/006_MultiXact_standby.pl
@@ -0,0 +1,121 @@
+
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+# Minimal test testing multixacts with streaming replication
+use strict;
+use warnings;
+use Config;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More tests => 4;
+
+# Initialize primary node
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+# A specific role is created for replication purposes
+$node_primary->init(
+	allows_streaming => 1,
+	auth_extra       => [ '--create-role', 'repl_role' ]);
+$node_primary->append_conf('postgresql.conf', 'lock_timeout = 180000');
+$node_primary->append_conf('postgresql.conf', 'max_connections = 500');
+$node_primary->start;
+my $backup_name = 'my_backup';
+
+# Take backup
+$node_primary->backup($backup_name);
+
+# Create streaming standby linking to primary
+my $node_standby_1 = PostgreSQL::Test::Cluster->new('standby_1');
+$node_standby_1->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+$node_standby_1->start;
+
+# Create some content on primary and check its presence in standby nodes
+$node_primary->safe_psql('postgres', q(create table tbl2 (
+    id int primary key,
+    val int
+);
+insert into tbl2 select i, 0 from generate_series(1,100000) i;
+));
+
+# Wait for standbys to catch up
+my $primary_lsn = $node_primary->lsn('write');
+$node_primary->wait_for_catchup($node_standby_1, 'replay', $primary_lsn);
+
+#
+# Stress CIC with pgbench
+#
+
+# Run background pgbench with bt_index_check on standby
+my $pgbench_out   = '';
+my $pgbench_timer = IPC::Run::timeout(180);
+my $pgbench_h     = $node_standby_1->background_pgbench(
+	'--no-vacuum --report-per-command -M prepared -c 10 -j 2 -T 10',
+	{
+		'006_pgbench_standby_check_1' => q(
+			begin;
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			select sum(val) from tbl2;
+			\sleep 10 ms
+			commit;
+		   )
+	},
+	\$pgbench_out,
+	$pgbench_timer);
+
+# Run pgbench with data data manipulations and REINDEX on primary.
+# pgbench might try to launch more than one instance of the RIC
+# transaction concurrently.  That would deadlock, so use an advisory
+# lock to ensure only one CIC runs at a time.
+$node_primary->pgbench(
+	'--no-vacuum --report-per-command -M prepared -c 10 -j 2 -T 10',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent updates',
+	{
+		'004_pgbench_updates' => q(
+			\set id random(1, 10000)
+			begin;
+			select * from tbl2 where id = :id for no key update;
+			\sleep 10 ms
+			savepoint s1;
+			update tbl2 set val = val+1 where id = :id;
+			\sleep 10 ms
+			commit;
+		  )
+	});
+
+$pgbench_h->pump_nb;
+$pgbench_h->finish();
+my $result =
+    ($Config{osname} eq "MSWin32")
+  ? ($pgbench_h->full_results)[0]
+  : $pgbench_h->result(0);
+is($result, 0, "pgbench with bt_index_check() on standby works");
+
+
+# Check that no deadlock occured
+$primary_lsn = $node_primary->lsn('write');
+$node_primary->wait_for_catchup($node_standby_1, 'replay', $primary_lsn);
+
+# done
+$node_primary->stop;
+$node_standby_1->stop;
+done_testing();
\ No newline at end of file
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 3a2d7055c42..6375d0b4762 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -837,6 +837,9 @@ MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members)
 	XLogRegisterData((char *) (&xlrec), SizeOfMultiXactCreate);
 	XLogRegisterData((char *) members, nmembers * sizeof(MultiXactMember));
 
+	if (rand()%2 == 0)
+		pg_usleep(1000);
+
 	(void) XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_CREATE_ID);
 
 	/* Now enter the information into the OFFSETs and MEMBERs logs */
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index b908f17adf6..aa1ca676672 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2210,6 +2210,54 @@ sub pgbench
 
 =pod
 
+=item $node->background_pgbench($opts, $files, \$stdout, $timer) => harness
+
+Invoke B<pgbench> and return an IPC::Run harness object.  The process's stdin
+is empty, and its stdout and stderr go to the $stdout scalar reference.  This
+allows the caller to act on other parts of the system while B<pgbench> is
+running.  Errors from B<pgbench> are the caller's problem.
+
+The specified timer object is attached to the harness, as well.  It's caller's
+responsibility to select the timeout length, and to restart the timer after
+each command if the timeout is per-command.
+
+Be sure to "finish" the harness when done with it.
+
+=over
+
+=item $opts
+
+Options as a string to be split on spaces.
+
+=item $files
+
+Reference to filename/contents dictionary.
+
+=back
+
+=cut
+
+sub background_pgbench
+{
+	my ($self, $opts, $files, $stdout, $timer) = @_;
+
+	my @cmd =
+	  ('pgbench', split(/\s+/, $opts), $self->_pgbench_make_files($files));
+
+	local %ENV = $self->_get_env();
+
+	my $stdin = "";
+	# IPC::Run would otherwise append to existing contents:
+	$$stdout = "" if ref($stdout);
+
+	my $harness = IPC::Run::start \@cmd, '<', \$stdin, '>', $stdout, '2>&1',
+	  $timer;
+
+	return $harness;
+}
+
+=pod
+
 =item $node->connect_ok($connstr, $test_name, %params)
 
 Attempt a connection with a custom connection string.  This is expected
-- 
2.39.5 (Apple Git-154)

v7-0002-Fill-next-multitransaction-in-REDO-to-avoid-corne.patchapplication/octet-stream; name=v7-0002-Fill-next-multitransaction-in-REDO-to-avoid-corne.patch; x-unix-mode=0644Download

From 34a19eb83c57ee5e2ab50ee7ab6dce6f754fefef Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Fri, 18 Jul 2025 18:48:54 +0500
Subject: [PATCH v7 2/2] Fill next multitransaction in REDO to avoid corner
 case 2

---
 src/backend/access/transam/multixact.c | 68 ++++++++++++++++++++++++--
 1 file changed, 64 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 6375d0b4762..ec15838c561 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -341,7 +341,7 @@ static MemoryContext MXactContext = NULL;
 /* internal MultiXactId management */
 static void MultiXactIdSetOldestVisible(void);
 static void RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
-							   int nmembers, MultiXactMember *members);
+							   int nmembers, MultiXactMember *members, bool redo);
 static MultiXactId GetNewMultiXactId(int nmembers, MultiXactOffset *offset);
 
 /* MultiXact cache management */
@@ -843,7 +843,7 @@ MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members)
 	(void) XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_CREATE_ID);
 
 	/* Now enter the information into the OFFSETs and MEMBERs logs */
-	RecordNewMultiXact(multi, offset, nmembers, members);
+	RecordNewMultiXact(multi, offset, nmembers, members, false);
 
 	/* Done with critical section */
 	END_CRIT_SECTION();
@@ -865,7 +865,7 @@ MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members)
  */
 static void
 RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
-				   int nmembers, MultiXactMember *members)
+				   int nmembers, MultiXactMember *members, bool redo)
 {
 	int			pageno;
 	int			prev_pageno;
@@ -890,8 +890,62 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
 
+	if (redo)
+	{
+		/*
+		 * We might have filled this offset previosuly.
+		 * Cross-check for correctness.
+		 */
+		Assert((*offptr == 0) || (*offptr == offset));
+	}
+
 	*offptr = offset;
 
+	if (redo)
+	{
+		/*
+		 * We want to avoid edge case 2 in redo, because we cannot wait for
+		 * startup process in GetMultiXactIdMembers() without risk of a deadlock.
+		 */
+		MultiXactId next = multi + 1;
+		int			next_pageno;
+		/* Handle wraparound as GetMultiXactIdMembers() does it. */
+		if (multi < FirstMultiXactId)
+			multi = FirstMultiXactId;
+		next_pageno = MultiXactIdToOffsetPage(next);
+		if (next_pageno == pageno)
+		{
+			offptr[1] = offset + nmembers;
+		}
+		else
+		{
+			int	next_slotno;
+			MultiXactOffset *next_offptr;
+			int	next_entryno = MultiXactIdToOffsetEntry(next);
+
+			if (SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, next_pageno))
+			{
+				/* Just read a next page */
+				next_slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next);
+			}
+			else
+			{
+				/*
+				 * We have to create a new page.
+				 * SimpleLruWritePage is already prepared to deal
+				 * with creating a new segment file. We do not need to handle
+				 * race conditions, because this code is only executed in redo
+				 * and we hold MultiXactOffsetSLRULock.
+				 */
+				next_slotno = ZeroMultiXactOffsetPage(next_pageno, false);
+				SimpleLruWritePage(MultiXactOffsetCtl, next_slotno);
+			}
+			next_offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[next_slotno];
+			next_offptr[next_slotno] = offset + nmembers;
+			MultiXactMemberCtl->shared->page_dirty[next_slotno] = true;
+		}
+	}
+
 	MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 
 	/* Exchange our lock */
@@ -1393,6 +1447,12 @@ retry:
 			/* Corner case 2: next multixact is still being filled in */
 			LWLockRelease(MultiXactOffsetSLRULock);
 			CHECK_FOR_INTERRUPTS();
+			/*
+			 * CHECK_FOR_INTERRUPTS above would be critical for avoiding
+			 * conflicts with recovery, yet caller might hold LWLock rendering
+			 * CHECK_FOR_INTERRUPTS disfunctional
+			 */
+			Assert(!RecoveryInProgress());
 			pg_usleep(1000L);
 			goto retry;
 		}
@@ -3286,7 +3346,7 @@ multixact_redo(XLogReaderState *record)
 
 		/* Store the data back into the SLRU files */
 		RecordNewMultiXact(xlrec->mid, xlrec->moff, xlrec->nmembers,
-						   xlrec->members);
+						   xlrec->members, true);
 
 		/* Make sure nextMXact/nextOffset are beyond what this record has */
 		MultiXactAdvanceNextMXact(xlrec->mid + 1,
-- 
2.39.5 (Apple Git-154)

#16

x4mmm@yandex-team.ru

6 months ago

In reply to: Andrey Borodin (#15)

Re: IPC/MultixactCreation on the Standby server

On 21 Jul 2025, at 19:58, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

I'm planning to prepare tests and fixes for all supported branches

This is a status update message. I've reproduced problem on REL_13_STABLE and verified that proposed fix works there.

Also I've discovered one more serious problem.
If a backend crashes just before WAL-logging multi, any heap tuple that uses this multi will become unreadable. Any attempt to read it will hang forever.

I've reproduced the problem and now I'm working on scripting this scenario. Basically, I modify code to hang forever after assigning multi number 2. Then execute in first psql:

create table x as select i,0 v from generate_series(1,10) i;
create unique index on x(i);

\set id 1
begin;
select * from x where i = :id for no key update;
savepoint s1;
update x set v = v+1 where i = :id; -- multi 1
commit;

\set id 2
begin;
select * from x where i = :id for no key update;
savepoint s1;
update x set v = v+1 where i = :id; -- multi 2 -- will hang
commit;

Then in second psql:

create table y as select i,0 v from generate_series(1,10) i;
create unique index on y(i);

\set id 1
begin;
select * from y where i = :id for no key update;
savepoint s1;
update y set v = v+1 where i = :id;
commit;

After this I pkill -9 postgres. Recovered installation cannot execute select * from x; because multi 1 cannot be read without recovery of multi 2 which was never logged.

Luckily fix is the same: just restore offset of multi 2 when multi 1 is recovered.

Best regards, Andrey Borodin.

#17

alvherre@kurilemu.de

6 months ago

In reply to: Andrey Borodin (#16)

Re: IPC/MultixactCreation on the Standby server

On 2025-Jul-25, Andrey Borodin wrote:

Also I've discovered one more serious problem.
If a backend crashes just before WAL-logging multi, any heap tuple
that uses this multi will become unreadable. Any attempt to read it
will hang forever.

I've reproduced the problem and now I'm working on scripting this
scenario. Basically, I modify code to hang forever after assigning
multi number 2.

It took me a minute to understand this, and I think your description is
slightly incorrect: you mean that the heap tuple that uses the PREVIOUS
multixact cannot be read (at least, that's what I understand from your
reproducer script). I agree it's a pretty ugly bug! I think it's
essentially the same bug as the other problem, so the proposed fix
should solve both.

Thanks for working on this!

Looking at this,

/*
* We want to avoid edge case 2 in redo, because we cannot wait for
* startup process in GetMultiXactIdMembers() without risk of a
* deadlock.
*/
MultiXactId next = multi + 1;
int next_pageno;

/* Handle wraparound as GetMultiXactIdMembers() does it. */
if (multi < FirstMultiXactId)
multi = FirstMultiXactId;

Don't you mean to test and change the value 'next' rather than 'multi'
here?

In this bit,

* We do not need to handle race conditions, because this code
* is only executed in redo and we hold
* MultiXactOffsetSLRULock.

I think it'd be good to have an
Assert(LWLockHeldByMeInMode(MultiXactOffsetSLRULock, LW_EXCLUSIVE));
just for peace of mind. Also, commit c61678551699 removed
ZeroMultiXactOffsetPage(), but since you have 'false' as the second
argument, then SimpleLruZeroPage() is enough. (I wondered why isn't
WAL-logging necessary ... until I remember that we're in a standby. I
think a simple comment here like "no WAL-logging because we're a
standby" should suffice.)

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/

#18

x4mmm@yandex-team.ru

6 months ago

In reply to: Álvaro Herrera (#17)

Re: IPC/MultixactCreation on the Standby server

On 26 Jul 2025, at 22:44, Álvaro Herrera <alvherre@kurilemu.de> wrote:

On 2025-Jul-25, Andrey Borodin wrote:

Also I've discovered one more serious problem.
If a backend crashes just before WAL-logging multi, any heap tuple
that uses this multi will become unreadable. Any attempt to read it
will hang forever.

I've reproduced the problem and now I'm working on scripting this
scenario. Basically, I modify code to hang forever after assigning
multi number 2.

It took me a minute to understand this, and I think your description is
slightly incorrect: you mean that the heap tuple that uses the PREVIOUS
multixact cannot be read (at least, that's what I understand from your
reproducer script).

Yes, I explained a bit incorrectly, but you got the problem correctly.

Looking at this,

/*
* We want to avoid edge case 2 in redo, because we cannot wait for
* startup process in GetMultiXactIdMembers() without risk of a
* deadlock.
*/
MultiXactId next = multi + 1;
int next_pageno;

/* Handle wraparound as GetMultiXactIdMembers() does it. */
if (multi < FirstMultiXactId)
multi = FirstMultiXactId;

Don't you mean to test and change the value 'next' rather than 'multi'
here?

Yup, that was typo.

In this bit,

* We do not need to handle race conditions, because this code
* is only executed in redo and we hold
* MultiXactOffsetSLRULock.

I think it'd be good to have an
Assert(LWLockHeldByMeInMode(MultiXactOffsetSLRULock, LW_EXCLUSIVE));
just for peace of mind.

Ugh, that uncovered 17+ problem: now we have a couple of locks simultaneously. I'll post a version with this a later.

Also, commit c61678551699 removed
ZeroMultiXactOffsetPage(), but since you have 'false' as the second
argument, then SimpleLruZeroPage() is enough. (I wondered why isn't
WAL-logging necessary ... until I remember that we're in a standby. I
think a simple comment here like "no WAL-logging because we're a
standby" should suffice.)

Agree.

I've made a test [0]https://github.com/x4m/postgres_g/commit/eafcaec7aafde064b0da5d2ba4041ed2fb134f07 and discovered another problem. Adding this checkpoint breaks the test[1]https://github.com/x4m/postgres_g/commit/da762c7cac56eff1988ea9126171ca0a6d2665e9 even after a fix[2]https://github.com/x4m/postgres_g/commit/d64c17d697d082856e5fe8bd52abafc0585973af.
I suspect that excluding "edge case 2" on standby is simply not enough... we have to do this "next offset" dance on Primary too. I'll think more about other options.

Best regards, Andrey Borodin.
[0]: https://github.com/x4m/postgres_g/commit/eafcaec7aafde064b0da5d2ba4041ed2fb134f07
[1]: https://github.com/x4m/postgres_g/commit/da762c7cac56eff1988ea9126171ca0a6d2665e9
[2]: https://github.com/x4m/postgres_g/commit/d64c17d697d082856e5fe8bd52abafc0585973af

Timeline of this commits can be seen here https://github.com/x4m/postgres_g/commits/mx19/

#19

x4mmm@yandex-team.ru

6 months ago

In reply to: Andrey Borodin (#18)

2 attachment(s)

Re: IPC/MultixactCreation on the Standby server

On 27 Jul 2025, at 16:53, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

we have to do this "next offset" dance on Primary too.

PFA draft of this.
I also attach a version for PG17, maybe Dmitry could try to reproduce the problem with this patch. I think the problem should be fixed by the patch.

Thanks!

Best regards, Andrey Borodin.

Attachments:

v8-PG17-0001-Avoid-edge-case-2-in-multixacts.patchapplication/octet-stream; name=v8-PG17-0001-Avoid-edge-case-2-in-multixacts.patch; x-unix-mode=0644Download

From c4986fd4978206c870ed04846a5b37e71e9ff668 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sun, 27 Jul 2025 11:37:55 +0500
Subject: [PATCH v8-PG17] Avoid edge case 2 in multixacts

---
 src/backend/access/transam/multixact.c | 94 +++++++++++++++++---------
 1 file changed, 62 insertions(+), 32 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b7b47ef076a..7eaefd55903 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -274,12 +274,6 @@ typedef struct MultiXactStateData
 	/* support for members anti-wraparound measures */
 	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
 
-	/*
-	 * This is used to sleep until a multixact offset is written when we want
-	 * to create the next one.
-	 */
-	ConditionVariable nextoff_cv;
-
 	/*
 	 * Per-backend data starts here.  We have two arrays stored in the area
 	 * immediately following the MultiXactStateData struct. Each is indexed by
@@ -918,10 +912,20 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	int			i;
 	LWLock	   *lock;
 	LWLock	   *prevlock = NULL;
+	MultiXactId next = multi + 1;
+	int			next_pageno;
 
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
+	/*
+	 * We must also fill next offset to keep current multi readable
+	 * Handle wraparound as GetMultiXactIdMembers() does it.
+	 */
+	if (next < FirstMultiXactId)
+		next = FirstMultiXactId;
+	next_pageno = MultiXactIdToOffsetPage(next);
+
 	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
@@ -936,19 +940,56 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
 
-	*offptr = offset;
+	/*
+	 * We might have filled this offset previosuly.
+	 * Cross-check for correctness.
+	 */
+	Assert((*offptr == 0) || (*offptr == offset));
 
+	*offptr = offset;
 	MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 
+	if (next_pageno == pageno)
+	{
+		offptr[1] = offset + nmembers;
+	}
+	else
+	{
+		int	next_slotno;
+		MultiXactOffset *next_offptr;
+		int	next_entryno = MultiXactIdToOffsetEntry(next);
+		Assert(next_entryno == 0); /* This is an overflow-only branch */
+
+		/* Swap the lock for a lock of next page */
+		LWLockRelease(lock);
+		lock = SimpleLruGetBankLock(MultiXactOffsetCtl, next_pageno);
+		LWLockAcquire(lock, LW_EXCLUSIVE);
+
+		if (SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, next_pageno))
+		{
+			/* Just read a next page */
+			next_slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next);
+		}
+		else
+		{
+			/*
+			 * We have to create a new page.
+			 * SimpleLruWritePage is already prepared to deal
+			 * with creating a new segment file. We do not need to handle
+			 * race conditions, because this code is only executed in redo
+			 * and we hold appropriate lock of MultiXactOffsetCtl.
+			 */
+			next_slotno = SimpleLruZeroPage(MultiXactOffsetCtl, next_pageno);
+			SimpleLruWritePage(MultiXactOffsetCtl, next_slotno);
+		}
+		next_offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[next_slotno];
+		next_offptr[next_entryno] = offset + nmembers;
+		MultiXactMemberCtl->shared->page_dirty[next_slotno] = true;
+	}
+
 	/* Release MultiXactOffset SLRU lock. */
 	LWLockRelease(lock);
 
-	/*
-	 * If anybody was waiting to know the offset of this multixact ID we just
-	 * wrote, they can read it now, so wake them up.
-	 */
-	ConditionVariableBroadcast(&MultiXactState->nextoff_cv);
-
 	prev_pageno = -1;
 
 	for (i = 0; i < nmembers; i++, offset++)
@@ -1307,7 +1348,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	MultiXactOffset nextOffset;
 	MultiXactMember *ptr;
 	LWLock	   *lock;
-	bool		slept = false;
 
 	debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
 
@@ -1386,7 +1426,10 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * 1. This multixact may be the latest one created, in which case there is
 	 * no next one to look at.  In this case the nextOffset value we just
 	 * saved is the correct endpoint.
+	 * TODO: how does it work on Standby?  MultiXactState->nextMXact does not seem to be up-to date.
+	 * nextMXact and nextOffset are in sync, so nothing bad can happen, but nextMXact seems mostly random. 
 	 *
+	 * THIS IS NOT POSSIBLE ANYMORE, KEEP IT FOR HISTORIC REASONS.
 	 * 2. The next multixact may still be in process of being filled in: that
 	 * is, another process may have done GetNewMultiXactId but not yet written
 	 * the offset entry for that ID.  In that scenario, it is guaranteed that
@@ -1412,7 +1455,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
 	 */
-retry:
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
@@ -1476,14 +1518,10 @@ retry:
 
 		if (nextMXOffset == 0)
 		{
-			/* Corner case 2: next multixact is still being filled in */
-			LWLockRelease(lock);
-			CHECK_FOR_INTERRUPTS();
-
-			ConditionVariableSleep(&MultiXactState->nextoff_cv,
-								   WAIT_EVENT_MULTIXACT_CREATION);
-			slept = true;
-			goto retry;
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("MultiXact %d has invalid next offset",
+							multi)));
 		}
 
 		length = nextMXOffset - offset;
@@ -1492,12 +1530,6 @@ retry:
 	LWLockRelease(lock);
 	lock = NULL;
 
-	/*
-	 * If we slept above, clean up state; it's no longer needed.
-	 */
-	if (slept)
-		ConditionVariableCancelSleep();
-
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
 
 	truelength = 0;
@@ -1987,7 +2019,6 @@ MultiXactShmemInit(void)
 
 		/* Make sure we zero out the per-backend state */
 		MemSet(MultiXactState, 0, SHARED_MULTIXACT_STATE_SIZE);
-		ConditionVariableInit(&MultiXactState->nextoff_cv);
 	}
 	else
 		Assert(found);
@@ -2198,8 +2229,7 @@ TrimMultiXact(void)
 	 * TrimCLOG() for background.  Unlike CLOG, some WAL record covers every
 	 * pg_multixact SLRU mutation.  Since, also unlike CLOG, we ignore the WAL
 	 * rule "write xlog before data," nextMXact successors may carry obsolete,
-	 * nonzero offset values.  Zero those so case 2 of GetMultiXactIdMembers()
-	 * operates normally.
+	 * nonzero offset values.
 	 */
 	entryno = MultiXactIdToOffsetEntry(nextMXact);
 	if (entryno != 0)
-- 
2.39.5 (Apple Git-154)

v8-0001-Avoid-edge-case-2-in-multixacts.patchapplication/octet-stream; name=v8-0001-Avoid-edge-case-2-in-multixacts.patch; x-unix-mode=0644Download

From 48d916e391d5aec6b03614519988dababcc1fbf8 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sun, 27 Jul 2025 11:37:55 +0500
Subject: [PATCH v8] Avoid edge case 2 in multixacts

---
 src/backend/access/transam/multixact.c        | 96 +++++++++++-------
 src/test/modules/test_slru/t/001_multixact.pl | 98 +++++--------------
 src/test/modules/test_slru/test_multixact.c   |  1 -
 src/test/perl/PostgreSQL/Test/Cluster.pm      | 40 ++++++++
 src/test/recovery/t/017_shm.pl                | 38 +------
 5 files changed, 131 insertions(+), 142 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 3cb09c3d598..e0d68e9f0eb 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -275,12 +275,6 @@ typedef struct MultiXactStateData
 	/* support for members anti-wraparound measures */
 	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
 
-	/*
-	 * This is used to sleep until a multixact offset is written when we want
-	 * to create the next one.
-	 */
-	ConditionVariable nextoff_cv;
-
 	/*
 	 * Per-backend data starts here.  We have two arrays stored in the area
 	 * immediately following the MultiXactStateData struct. Each is indexed by
@@ -921,10 +915,20 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	int			i;
 	LWLock	   *lock;
 	LWLock	   *prevlock = NULL;
+	MultiXactId next = multi + 1;
+	int			next_pageno;
 
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
+	/*
+	 * We must also fill next offset to keep current multi readable
+	 * Handle wraparound as GetMultiXactIdMembers() does it.
+	 */
+	if (next < FirstMultiXactId)
+		next = FirstMultiXactId;
+	next_pageno = MultiXactIdToOffsetPage(next);
+
 	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
@@ -939,19 +943,56 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
 
-	*offptr = offset;
+	/*
+	 * We might have filled this offset previosuly.
+	 * Cross-check for correctness.
+	 */
+	Assert((*offptr == 0) || (*offptr == offset));
 
+	*offptr = offset;
 	MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 
+	if (next_pageno == pageno)
+	{
+		offptr[1] = offset + nmembers;
+	}
+	else
+	{
+		int	next_slotno;
+		MultiXactOffset *next_offptr;
+		int	next_entryno = MultiXactIdToOffsetEntry(next);
+		Assert(next_entryno == 0); /* This is an overflow-only branch */
+
+		/* Swap the lock for a lock of next page */
+		LWLockRelease(lock);
+		lock = SimpleLruGetBankLock(MultiXactOffsetCtl, next_pageno);
+		LWLockAcquire(lock, LW_EXCLUSIVE);
+
+		if (SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, next_pageno))
+		{
+			/* Just read a next page */
+			next_slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next);
+		}
+		else
+		{
+			/*
+			 * We have to create a new page.
+			 * SimpleLruWritePage is already prepared to deal
+			 * with creating a new segment file. We do not need to handle
+			 * race conditions, because this code is only executed in redo
+			 * and we hold appropriate lock of MultiXactOffsetCtl.
+			 */
+			next_slotno = SimpleLruZeroPage(MultiXactOffsetCtl, next_pageno);
+			SimpleLruWritePage(MultiXactOffsetCtl, next_slotno);
+		}
+		next_offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[next_slotno];
+		next_offptr[next_entryno] = offset + nmembers;
+		MultiXactMemberCtl->shared->page_dirty[next_slotno] = true;
+	}
+
 	/* Release MultiXactOffset SLRU lock. */
 	LWLockRelease(lock);
 
-	/*
-	 * If anybody was waiting to know the offset of this multixact ID we just
-	 * wrote, they can read it now, so wake them up.
-	 */
-	ConditionVariableBroadcast(&MultiXactState->nextoff_cv);
-
 	prev_pageno = -1;
 
 	for (i = 0; i < nmembers; i++, offset++)
@@ -1310,7 +1351,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	MultiXactOffset nextOffset;
 	MultiXactMember *ptr;
 	LWLock	   *lock;
-	bool		slept = false;
 
 	debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
 
@@ -1389,7 +1429,10 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * 1. This multixact may be the latest one created, in which case there is
 	 * no next one to look at.  In this case the nextOffset value we just
 	 * saved is the correct endpoint.
+	 * TODO: how does it work on Standby?  MultiXactState->nextMXact does not seem to be up-to date.
+	 * nextMXact and nextOffset are in sync, so nothing bad can happen, but nextMXact seems mostly random. 
 	 *
+	 * THIS IS NOT POSSIBLE ANYMORE, KEEP IT FOR HISTORIC REASONS.
 	 * 2. The next multixact may still be in process of being filled in: that
 	 * is, another process may have done GetNewMultiXactId but not yet written
 	 * the offset entry for that ID.  In that scenario, it is guaranteed that
@@ -1415,7 +1458,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
 	 */
-retry:
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
@@ -1479,16 +1521,10 @@ retry:
 
 		if (nextMXOffset == 0)
 		{
-			/* Corner case 2: next multixact is still being filled in */
-			LWLockRelease(lock);
-			CHECK_FOR_INTERRUPTS();
-
-			INJECTION_POINT("multixact-get-members-cv-sleep", NULL);
-
-			ConditionVariableSleep(&MultiXactState->nextoff_cv,
-								   WAIT_EVENT_MULTIXACT_CREATION);
-			slept = true;
-			goto retry;
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("MultiXact %d has invalid next offset",
+							multi)));
 		}
 
 		length = nextMXOffset - offset;
@@ -1497,12 +1533,6 @@ retry:
 	LWLockRelease(lock);
 	lock = NULL;
 
-	/*
-	 * If we slept above, clean up state; it's no longer needed.
-	 */
-	if (slept)
-		ConditionVariableCancelSleep();
-
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
 
 	truelength = 0;
@@ -1992,7 +2022,6 @@ MultiXactShmemInit(void)
 
 		/* Make sure we zero out the per-backend state */
 		MemSet(MultiXactState, 0, SHARED_MULTIXACT_STATE_SIZE);
-		ConditionVariableInit(&MultiXactState->nextoff_cv);
 	}
 	else
 		Assert(found);
@@ -2142,8 +2171,7 @@ TrimMultiXact(void)
 	 * TrimCLOG() for background.  Unlike CLOG, some WAL record covers every
 	 * pg_multixact SLRU mutation.  Since, also unlike CLOG, we ignore the WAL
 	 * rule "write xlog before data," nextMXact successors may carry obsolete,
-	 * nonzero offset values.  Zero those so case 2 of GetMultiXactIdMembers()
-	 * operates normally.
+	 * nonzero offset values.
 	 */
 	entryno = MultiXactIdToOffsetEntry(nextMXact);
 	if (entryno != 0)
diff --git a/src/test/modules/test_slru/t/001_multixact.pl b/src/test/modules/test_slru/t/001_multixact.pl
index e2b567a603d..814ab00353e 100644
--- a/src/test/modules/test_slru/t/001_multixact.pl
+++ b/src/test/modules/test_slru/t/001_multixact.pl
@@ -29,92 +29,46 @@ $node->start;
 $node->safe_psql('postgres', q(CREATE EXTENSION injection_points));
 $node->safe_psql('postgres', q(CREATE EXTENSION test_slru));
 
-# Test for Multixact generation edge case
-$node->safe_psql('postgres',
-	q{select injection_points_attach('test-multixact-read','wait')});
-$node->safe_psql('postgres',
-	q{select injection_points_attach('multixact-get-members-cv-sleep','wait')}
-);
+# Another multixact test: loosing some multixact must not affect reading near
+# multixacts, even after a crash.
+my $bg_psql = $node->background_psql('postgres');
 
-# This session must observe sleep on the condition variable while generating a
-# multixact.  To achieve this it first will create a multixact, then pause
-# before reading it.
-my $observer = $node->background_psql('postgres');
-
-# This query will create a multixact, and hang just before reading it.
-$observer->query_until(
-	qr/start/,
-	q{
-	\echo start
-	SELECT test_read_multixact(test_create_multixact());
-});
-$node->wait_for_event('client backend', 'test-multixact-read');
-
-# This session will create the next Multixact. This is necessary to avoid
-# multixact.c's non-sleeping edge case 1.
-my $creator = $node->background_psql('postgres');
+my $multi = $bg_psql->query_safe(
+	q(SELECT test_create_multixact();));
+
+# The space for next multi will be allocated, but it will never be actually
+# recorded.
 $node->safe_psql('postgres',
 	q{SELECT injection_points_attach('multixact-create-from-members','wait');}
 );
 
-# We expect this query to hang in the critical section after generating new
-# multixact, but before filling its offset into SLRU.
-# Running an injection point inside a critical section requires it to be
-# loaded beforehand.
-$creator->query_until(
-	qr/start/, q{
-	\echo start
+$bg_psql->query_until(
+	qr/deploying lost multi/, q(
+\echo deploying lost multi
 	SELECT test_create_multixact();
-});
+));
 
 $node->wait_for_event('client backend', 'multixact-create-from-members');
-
-# Ensure we have the backends waiting that we expect
-is( $node->safe_psql(
-		'postgres',
-		q{SELECT string_agg(wait_event, ', ' ORDER BY wait_event)
-		FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'}
-	),
-	'multixact-create-from-members, test-multixact-read',
-	"matching injection point waits");
-
-# Now wake observer to get it to read the initial multixact.  A subsequent
-# multixact already exists, but that one doesn't have an offset assigned, so
-# this will hit multixact.c's edge case 2.
-$node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('test-multixact-read')});
-$node->wait_for_event('client backend', 'multixact-get-members-cv-sleep');
-
-# Ensure we have the backends waiting that we expect
-is( $node->safe_psql(
-		'postgres',
-		q{SELECT string_agg(wait_event, ', ' ORDER BY wait_event)
-		FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'}
-	),
-	'multixact-create-from-members, multixact-get-members-cv-sleep',
-	"matching injection point waits");
-
-# Now we have two backends waiting in multixact-create-from-members and
-# multixact-get-members-cv-sleep.  Also we have 3 injections points set to wait.
-# If we wakeup multixact-get-members-cv-sleep it will happen again, so we must
-# detach it first. So let's detach all injection points, then wake up all
-# backends.
-
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('test-multixact-read')});
 $node->safe_psql('postgres',
 	q{SELECT injection_points_detach('multixact-create-from-members')});
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('multixact-get-members-cv-sleep')});
 
 $node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('multixact-create-from-members')});
+	q{checkpoint;});
+
+# One more multitransaction to effectivelt emit WAL record about next
+# multitransaction (to avaoid corener case 1).
 $node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('multixact-get-members-cv-sleep')});
+	q{SELECT test_create_multixact();});
+
+# All set and done, it's time for hard restart
+$node->kill9;
+$node->poll_start;
+$bg_psql->{run}->finish;
 
-# Background psql will now be able to read the result and disconnect.
-$observer->quit;
-$creator->quit;
+# Verify thet recorded multi is readble, this call must not hang.
+# Also note that all injection points disappeared after server restart.
+$node->safe_psql('postgres',
+	qq{SELECT test_read_multixact('$multi'::xid);});
 
 $node->stop;
 
diff --git a/src/test/modules/test_slru/test_multixact.c b/src/test/modules/test_slru/test_multixact.c
index 6c9b0420717..2fd67273ee5 100644
--- a/src/test/modules/test_slru/test_multixact.c
+++ b/src/test/modules/test_slru/test_multixact.c
@@ -46,7 +46,6 @@ test_read_multixact(PG_FUNCTION_ARGS)
 	MultiXactId id = PG_GETARG_TRANSACTIONID(0);
 	MultiXactMember *members;
 
-	INJECTION_POINT("test-multixact-read", NULL);
 	/* discard caches */
 	AtEOXact_MultiXact();
 
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 61f68e0cc2e..a162baa55c6 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -1190,6 +1190,46 @@ sub start
 
 =pod
 
+=item $node->poll_start() => success_or_failure
+
+Polls start after kill9
+
+We may need retries to start a new postmaster.  Causes:
+ - kernel is slow to deliver SIGKILL
+ - postmaster parent is slow to waitpid()
+ - postmaster child is slow to exit in response to SIGQUIT
+ - postmaster child is slow to exit after postmaster death
+
+=cut
+
+sub poll_start
+{
+	my ($self) = @_;
+
+	my $max_attempts = 10 * $PostgreSQL::Test::Utils::timeout_default;
+	my $attempts = 0;
+
+	while ($attempts < $max_attempts)
+	{
+		$self->start(fail_ok => 1) && return 1;
+
+		# Wait 0.1 second before retrying.
+		usleep(100_000);
+
+		# Clean up in case the start attempt just timed out or some such.
+		$self->stop('fast', fail_ok => 1);
+
+		$attempts++;
+	}
+
+	# Try one last time without fail_ok, which will BAIL_OUT unless it
+	# succeeds.
+	$self->start && return 1;
+	return 0;
+}
+
+=pod
+
 =item $node->kill9()
 
 Send SIGKILL (signal 9) to the postmaster.
diff --git a/src/test/recovery/t/017_shm.pl b/src/test/recovery/t/017_shm.pl
index c73aa3f0c2c..ac238544217 100644
--- a/src/test/recovery/t/017_shm.pl
+++ b/src/test/recovery/t/017_shm.pl
@@ -67,7 +67,7 @@ log_ipcs();
 # Upon postmaster death, postmaster children exit automatically.
 $gnat->kill9;
 log_ipcs();
-poll_start($gnat);    # gnat recycles its former shm key.
+$gnat->poll_start;    # gnat recycles its former shm key.
 log_ipcs();
 
 note "removing the conflicting shmem ...";
@@ -82,7 +82,7 @@ log_ipcs();
 # the higher-keyed segment that the previous postmaster was using.
 # That's not great, but key collisions should be rare enough to not
 # make this a big problem.
-poll_start($gnat);
+$gnat->poll_start;
 log_ipcs();
 $gnat->stop;
 log_ipcs();
@@ -174,43 +174,11 @@ $slow_client->finish;    # client has detected backend termination
 log_ipcs();
 
 # now startup should work
-poll_start($gnat);
+$gnat->poll_start;
 log_ipcs();
 
 # finish testing
 $gnat->stop;
 log_ipcs();
 
-
-# We may need retries to start a new postmaster.  Causes:
-# - kernel is slow to deliver SIGKILL
-# - postmaster parent is slow to waitpid()
-# - postmaster child is slow to exit in response to SIGQUIT
-# - postmaster child is slow to exit after postmaster death
-sub poll_start
-{
-	my ($node) = @_;
-
-	my $max_attempts = 10 * $PostgreSQL::Test::Utils::timeout_default;
-	my $attempts = 0;
-
-	while ($attempts < $max_attempts)
-	{
-		$node->start(fail_ok => 1) && return 1;
-
-		# Wait 0.1 second before retrying.
-		usleep(100_000);
-
-		# Clean up in case the start attempt just timed out or some such.
-		$node->stop('fast', fail_ok => 1);
-
-		$attempts++;
-	}
-
-	# Try one last time without fail_ok, which will BAIL_OUT unless it
-	# succeeds.
-	$node->start && return 1;
-	return 0;
-}
-
 done_testing();
-- 
2.39.5 (Apple Git-154)

#20

dsy.075@yandex.ru

6 months ago

In reply to: Andrey Borodin (#19)

Re: IPC/MultixactCreation on the Standby server

On 28.07.2025 15:49, Andrey Borodin wrote:

I also attach a version for PG17, maybe Dmitry could try to reproduce the problem with this patch.

Andrey, thank you very much for your work, and also thanks to Álvaro for
joining the discussion on the problem. I ran tests on PG17 with patch
v8, there are no more sessions hanging on the replica, great! Replica
requests are canceled with recovery conflicts. ERROR: canceling
statement due to conflict with recovery DETAIL: User was holding shared
buffer pin for too long. STATEMENT: select sum(val) from tbl2; or ERROR:
canceling statement due to conflict with recovery DETAIL: User query
might have needed to see row versions that must be removed. STATEMENT:
select sum(val) from tbl2;

But on the master, some of the requests then fail with an error,
apparently invalid multixact's remain in the pages. ERROR: MultiXact
81926 has invalid next offset STATEMENT: select * from tbl2 where id =
$1 for no key update; ERROR: MultiXact 81941 has invalid next offset
CONTEXT: while scanning block 3 offset 244 of relation "public.tbl2"
automatic vacuum of table "postgres.public.tbl2" Best regards, Dmitry.

#21

dsy.075@yandex.ru

6 months ago

In reply to: Andrey Borodin (#19)

Re: IPC/MultixactCreation on the Standby server

I'll duplicate the message, the previous one turned out to have poor
formatting, sorry.

On 28.07.2025 15:49, Andrey Borodin wrote:

I also attach a version for PG17, maybe Dmitry could try to reproduce
the problem with this patch.

Replica requests are canceled with recovery conflicts.

ERROR: canceling statement due to conflict with recovery
DETAIL: User was holding shared buffer pin for too long.
STATEMENT: select sum(val) from tbl2;

ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be
removed.
STATEMENT: select sum(val) from tbl2;

But on the master, some of the requests then fail with an error,
apparently invalid multixact's remain in the pages.

ERROR: MultiXact 81926 has invalid next offset
STATEMENT: select * from tbl2 where id = $1 for no key update;

ERROR: MultiXact 81941 has invalid next offset
CONTEXT: while scanning block 3 offset 244 of relation "public.tbl2"
automatic vacuum of table "postgres.public.tbl2"

Best regards,
Dmitry.

#22

Yura Sokolov

y.sokolov@postgrespro.ru

6 months ago

In reply to: Andrey Borodin (#11)

1 attachment(s)

page_collect_tuples without long lock on page (Was Re: IPC/MultixactCreation on the Standby server)

17.07.2025 21:34, Andrey Borodin пишет:

On 30 Jun 2025, at 15:58, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
page_collect_tuples() holds a lock on the buffer while examining tuples visibility, having InterruptHoldoffCount > 0. Tuple visibility check might need WAL to go on, we have to wait until some next MX be filled in.
Which might need a buffer lock or have a snapshot conflict with caller of page_collect_tuples().

Thinking more about the problem I see 3 ways to deal with this deadlock:
2. Teach page_collect_tuples() to do HeapTupleSatisfiesVisibility() without holding buffer lock.

Personally, I see point 2 as very invasive in a code that I'm not too familiar with.

If there were no SetHintBits inside of HeapTupleSatisfies* , then it could
be just "copy line pointers and tuple headers under lock, release lock,
check tuples visibility using copied arrays".

But hint bits makes it much more difficult.

Probably, tuple headers could be copied twice and compared afterwards. If
there are change in hint bits, page should be relocked.

And call to MarkBufferDirtyHint should be delayed.

A very dirty variant is in attach. I've made it just for fun. It passes
'regress', 'isolation' and 'recovery'. But I didn't benchmark it.

--
regards
Yura Sokolov aka funny-falcon

Attachments:

page_collect_tuples.difftext/x-patch; charset=UTF-8; name=page_collect_tuples.diffDownload

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0dcd6ee817e..e0796dd06fe 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -500,7 +500,8 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
 pg_attribute_always_inline
 static int
 page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
-					Page page, Buffer buffer,
+					ItemId items, HeapTupleHeader headers,
+					Buffer buffer,
 					BlockNumber block, int lines,
 					bool all_visible, bool check_serializable)
 {
@@ -509,14 +510,14 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
 
 	for (lineoff = FirstOffsetNumber; lineoff <= lines; lineoff++)
 	{
-		ItemId		lpp = PageGetItemId(page, lineoff);
+		ItemId		lpp = &items[lineoff - 1];
 		HeapTupleData loctup;
 		bool		valid;
 
 		if (!ItemIdIsNormal(lpp))
 			continue;
 
-		loctup.t_data = (HeapTupleHeader) PageGetItem(page, lpp);
+		loctup.t_data = &headers[lineoff - 1];
 		loctup.t_len = ItemIdGetLength(lpp);
 		loctup.t_tableOid = RelationGetRelid(scan->rs_base.rs_rd);
 		ItemPointerSet(&(loctup.t_self), block, lineoff);
@@ -542,6 +543,8 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
 	return ntup;
 }
 
+extern bool SetHintBitsSkipMark;
+
 /*
  * heap_prepare_pagescan - Prepare current scan page to be scanned in pagemode
  *
@@ -557,8 +560,13 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	Snapshot	snapshot;
 	Page		page;
 	int			lines;
+	OffsetNumber lineoff;
 	bool		all_visible;
 	bool		check_serializable;
+	bool		set_hints;
+	ItemIdData	line_data[MaxHeapTuplesPerPage];
+	HeapTupleHeaderData tup_headers[MaxHeapTuplesPerPage] = {0};
+	uint16		infomasks[MaxHeapTuplesPerPage] = {0};
 
 	Assert(BufferGetBlockNumber(buffer) == block);
 
@@ -580,6 +588,13 @@ heap_prepare_pagescan(TableScanDesc sscan)
 
 	page = BufferGetPage(buffer);
 	lines = PageGetMaxOffsetNumber(page);
+	for (lineoff = FirstOffsetNumber; lineoff <= lines; lineoff++)
+	{
+		line_data[lineoff - 1] = *PageGetItemId(page, lineoff);
+		if (!ItemIdIsNormal(&line_data[lineoff - 1]))
+			continue;
+		tup_headers[lineoff - 1] = *(HeapTupleHeader) PageGetItem(page, &line_data[lineoff - 1]);
+	}
 
 	/*
 	 * If the all-visible flag indicates that all tuples on the page are
@@ -605,6 +620,13 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	check_serializable =
 		CheckForSerializableConflictOutNeeded(scan->rs_base.rs_rd, snapshot);
 
+	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+	for (lineoff = 0; lineoff < lines; lineoff++)
+		infomasks[lineoff] = tup_headers[lineoff].t_infomask;
+
+	SetHintBitsSkipMark = true;
+
 	/*
 	 * We call page_collect_tuples() with constant arguments, to get the
 	 * compiler to constant fold the constant arguments. Separate calls with
@@ -614,23 +636,45 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	if (likely(all_visible))
 	{
 		if (likely(!check_serializable))
-			scan->rs_ntuples = page_collect_tuples(scan, snapshot, page, buffer,
+			scan->rs_ntuples = page_collect_tuples(scan, snapshot, line_data, tup_headers, buffer,
 												   block, lines, true, false);
 		else
-			scan->rs_ntuples = page_collect_tuples(scan, snapshot, page, buffer,
+			scan->rs_ntuples = page_collect_tuples(scan, snapshot, line_data, tup_headers, buffer,
 												   block, lines, true, true);
 	}
 	else
 	{
 		if (likely(!check_serializable))
-			scan->rs_ntuples = page_collect_tuples(scan, snapshot, page, buffer,
+			scan->rs_ntuples = page_collect_tuples(scan, snapshot, line_data, tup_headers, buffer,
 												   block, lines, false, false);
 		else
-			scan->rs_ntuples = page_collect_tuples(scan, snapshot, page, buffer,
+			scan->rs_ntuples = page_collect_tuples(scan, snapshot, line_data, tup_headers, buffer,
 												   block, lines, false, true);
 	}
+	SetHintBitsSkipMark = false;
 
-	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+	set_hints = false;
+	for (lineoff = 0; lineoff < lines; lineoff++)
+	{
+		infomasks[lineoff] ^= tup_headers[lineoff].t_infomask;
+		if (infomasks[lineoff])
+			set_hints = true;
+	}
+
+	if (set_hints)
+	{
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+
+		for (lineoff = 0; lineoff < lines; lineoff++)
+		{
+			if (infomasks[lineoff] == 0)
+				continue;
+			((HeapTupleHeader) PageGetItem(page, &line_data[lineoff]))->t_infomask |= infomasks[lineoff];
+		}
+		MarkBufferDirtyHint(buffer, true);
+
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+	}
 }
 
 /*
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 05f6946fe60..34a251056a7 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -78,6 +78,7 @@
 #include "utils/builtins.h"
 #include "utils/snapmgr.h"
 
+bool SetHintBitsSkipMark = false;
 
 /*
  * SetHintBits()
@@ -128,7 +129,8 @@ SetHintBits(HeapTupleHeader tuple, Buffer buffer,
 	}
 
 	tuple->t_infomask |= infomask;
-	MarkBufferDirtyHint(buffer, true);
+	if (!SetHintBitsSkipMark)
+		MarkBufferDirtyHint(buffer, true);
 }
 
 /*

#23

x4mmm@yandex-team.ru

6 months ago

In reply to: Dmitry (#21)

Re: IPC/MultixactCreation on the Standby server

On 29 Jul 2025, at 12:17, Dmitry <dsy.075@yandex.ru> wrote:

But on the master, some of the requests then fail with an error, apparently invalid multixact's remain in the pages.

Thanks!

That's a bug in my patch. I do not understand it yet. I've reproduced it with your original workload.
Most of errors I see are shallow (offset == 0 or nextOffset==0), but this one is interesting:

TRAP: failed Assert("shared->page_number[slotno] == pageno && shared->page_status[slotno] == SLRU_PAGE_WRITE_IN_PROGRESS"), File: "slru.c", Line: 729, PID: 91085
0 postgres 0x00000001032ea5ac ExceptionalCondition + 216
1 postgres 0x0000000102af2784 SlruInternalWritePage + 700
2 postgres 0x0000000102af14dc SimpleLruWritePage + 96
3 postgres 0x0000000102ae89d4 RecordNewMultiXact + 576

So it makes me think that it's some version of IO concurrency issue.
As expected error only persists if "extend SLRU" branch is active in RecordNewMultiXact().

Thanks for testing!

Best regards, Andrey Borodin.

#24

x4mmm@yandex-team.ru

6 months ago

In reply to: Andrey Borodin (#23)

2 attachment(s)

Re: IPC/MultixactCreation on the Standby server

On 29 Jul 2025, at 23:15, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

I do not understand it yet.

OK, I figured it out. SimpleLruDoesPhysicalPageExist() was reading a physical file and could race with real extension by ExtendMultiXactOffset().
So I used ExtendMultiXactOffset(actual + 1). I hope this does not open a loop for wraparound...

Here's an updated two patches, one for Postgres 17 and one for mater(with a test).

Best regards, Andrey Borodin.

Attachments:

v9-0001-Avoid-edge-case-2-in-multixacts.patchapplication/octet-stream; name=v9-0001-Avoid-edge-case-2-in-multixacts.patch; x-unix-mode=0644Download

From f295b9b0cd6f5597b88b5c302b2ab434262f5e0e Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sun, 27 Jul 2025 11:37:55 +0500
Subject: [PATCH v9] Avoid edge case 2 in multixacts

---
 src/backend/access/transam/multixact.c        |  85 +++++++++------
 src/test/modules/test_slru/t/001_multixact.pl | 103 ++++++------------
 src/test/modules/test_slru/test_multixact.c   |   1 -
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  40 +++++++
 src/test/recovery/t/017_shm.pl                |  38 +------
 5 files changed, 123 insertions(+), 144 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 3cb09c3d598..8e530558c2f 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -275,12 +275,6 @@ typedef struct MultiXactStateData
 	/* support for members anti-wraparound measures */
 	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
 
-	/*
-	 * This is used to sleep until a multixact offset is written when we want
-	 * to create the next one.
-	 */
-	ConditionVariable nextoff_cv;
-
 	/*
 	 * Per-backend data starts here.  We have two arrays stored in the area
 	 * immediately following the MultiXactStateData struct. Each is indexed by
@@ -921,10 +915,20 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	int			i;
 	LWLock	   *lock;
 	LWLock	   *prevlock = NULL;
+	MultiXactId next = multi + 1;
+	int			next_pageno;
 
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
+	/*
+	 * We must also fill next offset to keep current multi readable
+	 * Handle wraparound as GetMultiXactIdMembers() does it.
+	 */
+	if (next < FirstMultiXactId)
+		next = FirstMultiXactId;
+	next_pageno = MultiXactIdToOffsetPage(next);
+
 	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
@@ -939,19 +943,41 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
 
-	*offptr = offset;
+	/*
+	 * We might have filled this offset previosuly.
+	 * Cross-check for correctness.
+	 */
+	Assert((*offptr == 0) || (*offptr == offset));
 
+	*offptr = offset;
 	MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 
+	if (next_pageno == pageno)
+	{
+		offptr[1] = offset + nmembers;
+	}
+	else
+	{
+		int	next_slotno;
+		MultiXactOffset *next_offptr;
+		int	next_entryno = MultiXactIdToOffsetEntry(next);
+		Assert(next_entryno == 0); /* This is an overflow-only branch */
+
+		/* Swap the lock for a lock of next page */
+		LWLockRelease(lock);
+		lock = SimpleLruGetBankLock(MultiXactOffsetCtl, next_pageno);
+		LWLockAcquire(lock, LW_EXCLUSIVE);
+
+		/* Read and adjust next page */
+		next_slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next);
+		next_offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[next_slotno];
+		next_offptr[next_entryno] = offset + nmembers;
+		MultiXactMemberCtl->shared->page_dirty[next_slotno] = true;
+	}
+
 	/* Release MultiXactOffset SLRU lock. */
 	LWLockRelease(lock);
 
-	/*
-	 * If anybody was waiting to know the offset of this multixact ID we just
-	 * wrote, they can read it now, so wake them up.
-	 */
-	ConditionVariableBroadcast(&MultiXactState->nextoff_cv);
-
 	prev_pageno = -1;
 
 	for (i = 0; i < nmembers; i++, offset++)
@@ -1144,8 +1170,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 			result = FirstMultiXactId;
 	}
 
-	/* Make sure there is room for the MXID in the file.  */
-	ExtendMultiXactOffset(result);
+	/* Make sure there is room for the MXID and next offset in the file. */
+	ExtendMultiXactOffset(MultiXactState->nextMXact + 1);
 
 	/*
 	 * Reserve the members space, similarly to above.  Also, be careful not to
@@ -1310,7 +1336,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	MultiXactOffset nextOffset;
 	MultiXactMember *ptr;
 	LWLock	   *lock;
-	bool		slept = false;
 
 	debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
 
@@ -1389,7 +1414,10 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * 1. This multixact may be the latest one created, in which case there is
 	 * no next one to look at.  In this case the nextOffset value we just
 	 * saved is the correct endpoint.
+	 * TODO: how does it work on Standby?  MultiXactState->nextMXact does not seem to be up-to date.
+	 * nextMXact and nextOffset are in sync, so nothing bad can happen, but nextMXact seems mostly random. 
 	 *
+	 * THIS IS NOT POSSIBLE ANYMORE, KEEP IT FOR HISTORIC REASONS.
 	 * 2. The next multixact may still be in process of being filled in: that
 	 * is, another process may have done GetNewMultiXactId but not yet written
 	 * the offset entry for that ID.  In that scenario, it is guaranteed that
@@ -1415,7 +1443,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
 	 */
-retry:
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
@@ -1479,16 +1506,10 @@ retry:
 
 		if (nextMXOffset == 0)
 		{
-			/* Corner case 2: next multixact is still being filled in */
-			LWLockRelease(lock);
-			CHECK_FOR_INTERRUPTS();
-
-			INJECTION_POINT("multixact-get-members-cv-sleep", NULL);
-
-			ConditionVariableSleep(&MultiXactState->nextoff_cv,
-								   WAIT_EVENT_MULTIXACT_CREATION);
-			slept = true;
-			goto retry;
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("MultiXact %d has invalid next offset",
+							multi)));
 		}
 
 		length = nextMXOffset - offset;
@@ -1497,12 +1518,6 @@ retry:
 	LWLockRelease(lock);
 	lock = NULL;
 
-	/*
-	 * If we slept above, clean up state; it's no longer needed.
-	 */
-	if (slept)
-		ConditionVariableCancelSleep();
-
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
 
 	truelength = 0;
@@ -1992,7 +2007,6 @@ MultiXactShmemInit(void)
 
 		/* Make sure we zero out the per-backend state */
 		MemSet(MultiXactState, 0, SHARED_MULTIXACT_STATE_SIZE);
-		ConditionVariableInit(&MultiXactState->nextoff_cv);
 	}
 	else
 		Assert(found);
@@ -2142,8 +2156,7 @@ TrimMultiXact(void)
 	 * TrimCLOG() for background.  Unlike CLOG, some WAL record covers every
 	 * pg_multixact SLRU mutation.  Since, also unlike CLOG, we ignore the WAL
 	 * rule "write xlog before data," nextMXact successors may carry obsolete,
-	 * nonzero offset values.  Zero those so case 2 of GetMultiXactIdMembers()
-	 * operates normally.
+	 * nonzero offset values.
 	 */
 	entryno = MultiXactIdToOffsetEntry(nextMXact);
 	if (entryno != 0)
diff --git a/src/test/modules/test_slru/t/001_multixact.pl b/src/test/modules/test_slru/t/001_multixact.pl
index e2b567a603d..752a72dcc23 100644
--- a/src/test/modules/test_slru/t/001_multixact.pl
+++ b/src/test/modules/test_slru/t/001_multixact.pl
@@ -18,6 +18,10 @@ if ($ENV{enable_injection_points} ne 'yes')
 {
 	plan skip_all => 'Injection points not supported by this build';
 }
+if ($windows_os)
+{
+	plan skip_all => 'Kill9 works unpredicatably on Windows';
+}
 
 my ($node, $result);
 
@@ -29,92 +33,47 @@ $node->start;
 $node->safe_psql('postgres', q(CREATE EXTENSION injection_points));
 $node->safe_psql('postgres', q(CREATE EXTENSION test_slru));
 
-# Test for Multixact generation edge case
-$node->safe_psql('postgres',
-	q{select injection_points_attach('test-multixact-read','wait')});
-$node->safe_psql('postgres',
-	q{select injection_points_attach('multixact-get-members-cv-sleep','wait')}
-);
+# Another multixact test: loosing some multixact must not affect reading near
+# multixacts, even after a crash.
+my $bg_psql = $node->background_psql('postgres');
+
+my $multi = $bg_psql->query_safe(
+	q(SELECT test_create_multixact();));
 
-# This session must observe sleep on the condition variable while generating a
-# multixact.  To achieve this it first will create a multixact, then pause
-# before reading it.
-my $observer = $node->background_psql('postgres');
-
-# This query will create a multixact, and hang just before reading it.
-$observer->query_until(
-	qr/start/,
-	q{
-	\echo start
-	SELECT test_read_multixact(test_create_multixact());
-});
-$node->wait_for_event('client backend', 'test-multixact-read');
-
-# This session will create the next Multixact. This is necessary to avoid
-# multixact.c's non-sleeping edge case 1.
-my $creator = $node->background_psql('postgres');
+# The space for next multi will be allocated, but it will never be actually
+# recorded.
 $node->safe_psql('postgres',
 	q{SELECT injection_points_attach('multixact-create-from-members','wait');}
 );
 
-# We expect this query to hang in the critical section after generating new
-# multixact, but before filling its offset into SLRU.
-# Running an injection point inside a critical section requires it to be
-# loaded beforehand.
-$creator->query_until(
-	qr/start/, q{
-	\echo start
+$bg_psql->query_until(
+	qr/deploying lost multi/, q(
+\echo deploying lost multi
 	SELECT test_create_multixact();
-});
+));
 
 $node->wait_for_event('client backend', 'multixact-create-from-members');
-
-# Ensure we have the backends waiting that we expect
-is( $node->safe_psql(
-		'postgres',
-		q{SELECT string_agg(wait_event, ', ' ORDER BY wait_event)
-		FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'}
-	),
-	'multixact-create-from-members, test-multixact-read',
-	"matching injection point waits");
-
-# Now wake observer to get it to read the initial multixact.  A subsequent
-# multixact already exists, but that one doesn't have an offset assigned, so
-# this will hit multixact.c's edge case 2.
-$node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('test-multixact-read')});
-$node->wait_for_event('client backend', 'multixact-get-members-cv-sleep');
-
-# Ensure we have the backends waiting that we expect
-is( $node->safe_psql(
-		'postgres',
-		q{SELECT string_agg(wait_event, ', ' ORDER BY wait_event)
-		FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'}
-	),
-	'multixact-create-from-members, multixact-get-members-cv-sleep',
-	"matching injection point waits");
-
-# Now we have two backends waiting in multixact-create-from-members and
-# multixact-get-members-cv-sleep.  Also we have 3 injections points set to wait.
-# If we wakeup multixact-get-members-cv-sleep it will happen again, so we must
-# detach it first. So let's detach all injection points, then wake up all
-# backends.
-
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('test-multixact-read')});
 $node->safe_psql('postgres',
 	q{SELECT injection_points_detach('multixact-create-from-members')});
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('multixact-get-members-cv-sleep')});
 
 $node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('multixact-create-from-members')});
+	q{checkpoint;});
+
+# One more multitransaction to effectivelt emit WAL record about next
+# multitransaction (to avaoid corener case 1).
 $node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('multixact-get-members-cv-sleep')});
+	q{SELECT test_create_multixact();});
+
+# All set and done, it's time for hard restart
+$node->kill9;
+$node->stop('immediate', fail_ok => 1);
+$node->poll_start;
+$bg_psql->{run}->finish;
 
-# Background psql will now be able to read the result and disconnect.
-$observer->quit;
-$creator->quit;
+# Verify thet recorded multi is readble, this call must not hang.
+# Also note that all injection points disappeared after server restart.
+$node->safe_psql('postgres',
+	qq{SELECT test_read_multixact('$multi'::xid);});
 
 $node->stop;
 
diff --git a/src/test/modules/test_slru/test_multixact.c b/src/test/modules/test_slru/test_multixact.c
index 6c9b0420717..2fd67273ee5 100644
--- a/src/test/modules/test_slru/test_multixact.c
+++ b/src/test/modules/test_slru/test_multixact.c
@@ -46,7 +46,6 @@ test_read_multixact(PG_FUNCTION_ARGS)
 	MultiXactId id = PG_GETARG_TRANSACTIONID(0);
 	MultiXactMember *members;
 
-	INJECTION_POINT("test-multixact-read", NULL);
 	/* discard caches */
 	AtEOXact_MultiXact();
 
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 61f68e0cc2e..a162baa55c6 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -1190,6 +1190,46 @@ sub start
 
 =pod
 
+=item $node->poll_start() => success_or_failure
+
+Polls start after kill9
+
+We may need retries to start a new postmaster.  Causes:
+ - kernel is slow to deliver SIGKILL
+ - postmaster parent is slow to waitpid()
+ - postmaster child is slow to exit in response to SIGQUIT
+ - postmaster child is slow to exit after postmaster death
+
+=cut
+
+sub poll_start
+{
+	my ($self) = @_;
+
+	my $max_attempts = 10 * $PostgreSQL::Test::Utils::timeout_default;
+	my $attempts = 0;
+
+	while ($attempts < $max_attempts)
+	{
+		$self->start(fail_ok => 1) && return 1;
+
+		# Wait 0.1 second before retrying.
+		usleep(100_000);
+
+		# Clean up in case the start attempt just timed out or some such.
+		$self->stop('fast', fail_ok => 1);
+
+		$attempts++;
+	}
+
+	# Try one last time without fail_ok, which will BAIL_OUT unless it
+	# succeeds.
+	$self->start && return 1;
+	return 0;
+}
+
+=pod
+
 =item $node->kill9()
 
 Send SIGKILL (signal 9) to the postmaster.
diff --git a/src/test/recovery/t/017_shm.pl b/src/test/recovery/t/017_shm.pl
index c73aa3f0c2c..ac238544217 100644
--- a/src/test/recovery/t/017_shm.pl
+++ b/src/test/recovery/t/017_shm.pl
@@ -67,7 +67,7 @@ log_ipcs();
 # Upon postmaster death, postmaster children exit automatically.
 $gnat->kill9;
 log_ipcs();
-poll_start($gnat);    # gnat recycles its former shm key.
+$gnat->poll_start;    # gnat recycles its former shm key.
 log_ipcs();
 
 note "removing the conflicting shmem ...";
@@ -82,7 +82,7 @@ log_ipcs();
 # the higher-keyed segment that the previous postmaster was using.
 # That's not great, but key collisions should be rare enough to not
 # make this a big problem.
-poll_start($gnat);
+$gnat->poll_start;
 log_ipcs();
 $gnat->stop;
 log_ipcs();
@@ -174,43 +174,11 @@ $slow_client->finish;    # client has detected backend termination
 log_ipcs();
 
 # now startup should work
-poll_start($gnat);
+$gnat->poll_start;
 log_ipcs();
 
 # finish testing
 $gnat->stop;
 log_ipcs();
 
-
-# We may need retries to start a new postmaster.  Causes:
-# - kernel is slow to deliver SIGKILL
-# - postmaster parent is slow to waitpid()
-# - postmaster child is slow to exit in response to SIGQUIT
-# - postmaster child is slow to exit after postmaster death
-sub poll_start
-{
-	my ($node) = @_;
-
-	my $max_attempts = 10 * $PostgreSQL::Test::Utils::timeout_default;
-	my $attempts = 0;
-
-	while ($attempts < $max_attempts)
-	{
-		$node->start(fail_ok => 1) && return 1;
-
-		# Wait 0.1 second before retrying.
-		usleep(100_000);
-
-		# Clean up in case the start attempt just timed out or some such.
-		$node->stop('fast', fail_ok => 1);
-
-		$attempts++;
-	}
-
-	# Try one last time without fail_ok, which will BAIL_OUT unless it
-	# succeeds.
-	$node->start && return 1;
-	return 0;
-}
-
 done_testing();
-- 
2.39.5 (Apple Git-154)

v9-PG17-0001-Avoid-edge-case-2-in-multixacts.patchapplication/octet-stream; name=v9-PG17-0001-Avoid-edge-case-2-in-multixacts.patch; x-unix-mode=0644Download

From 49db2a1f56525ce8ab4f9ceb19c32060d8811299 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sun, 27 Jul 2025 11:37:55 +0500
Subject: [PATCH v9-PG17] Avoid edge case 2 in multixacts

---
 src/backend/access/transam/multixact.c | 83 +++++++++++++++-----------
 1 file changed, 49 insertions(+), 34 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b7b47ef076a..8430d83d66c 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -274,12 +274,6 @@ typedef struct MultiXactStateData
 	/* support for members anti-wraparound measures */
 	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
 
-	/*
-	 * This is used to sleep until a multixact offset is written when we want
-	 * to create the next one.
-	 */
-	ConditionVariable nextoff_cv;
-
 	/*
 	 * Per-backend data starts here.  We have two arrays stored in the area
 	 * immediately following the MultiXactStateData struct. Each is indexed by
@@ -918,10 +912,20 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	int			i;
 	LWLock	   *lock;
 	LWLock	   *prevlock = NULL;
+	MultiXactId next = multi + 1;
+	int			next_pageno;
 
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
+	/*
+	 * We must also fill next offset to keep current multi readable
+	 * Handle wraparound as GetMultiXactIdMembers() does it.
+	 */
+	if (next < FirstMultiXactId)
+		next = FirstMultiXactId;
+	next_pageno = MultiXactIdToOffsetPage(next);
+
 	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
@@ -936,19 +940,41 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
 
-	*offptr = offset;
+	/*
+	 * We might have filled this offset previosuly.
+	 * Cross-check for correctness.
+	 */
+	Assert((*offptr == 0) || (*offptr == offset));
 
+	*offptr = offset;
 	MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 
+	if (next_pageno == pageno)
+	{
+		offptr[1] = offset + nmembers;
+	}
+	else
+	{
+		int	next_slotno;
+		MultiXactOffset *next_offptr;
+		int	next_entryno = MultiXactIdToOffsetEntry(next);
+		Assert(next_entryno == 0); /* This is an overflow-only branch */
+
+		/* Swap the lock for a lock of next page */
+		LWLockRelease(lock);
+		lock = SimpleLruGetBankLock(MultiXactOffsetCtl, next_pageno);
+		LWLockAcquire(lock, LW_EXCLUSIVE);
+
+		/* Read and adjust next page */
+		next_slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next);
+		next_offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[next_slotno];
+		next_offptr[next_entryno] = offset + nmembers;
+		MultiXactMemberCtl->shared->page_dirty[next_slotno] = true;
+	}
+
 	/* Release MultiXactOffset SLRU lock. */
 	LWLockRelease(lock);
 
-	/*
-	 * If anybody was waiting to know the offset of this multixact ID we just
-	 * wrote, they can read it now, so wake them up.
-	 */
-	ConditionVariableBroadcast(&MultiXactState->nextoff_cv);
-
 	prev_pageno = -1;
 
 	for (i = 0; i < nmembers; i++, offset++)
@@ -1141,8 +1167,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 			result = FirstMultiXactId;
 	}
 
-	/* Make sure there is room for the MXID in the file.  */
-	ExtendMultiXactOffset(result);
+	/* Make sure there is room for the MXID and next offset in the file. */
+	ExtendMultiXactOffset(MultiXactState->nextMXact + 1);
 
 	/*
 	 * Reserve the members space, similarly to above.  Also, be careful not to
@@ -1307,7 +1333,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	MultiXactOffset nextOffset;
 	MultiXactMember *ptr;
 	LWLock	   *lock;
-	bool		slept = false;
 
 	debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
 
@@ -1386,7 +1411,10 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * 1. This multixact may be the latest one created, in which case there is
 	 * no next one to look at.  In this case the nextOffset value we just
 	 * saved is the correct endpoint.
+	 * TODO: how does it work on Standby?  MultiXactState->nextMXact does not seem to be up-to date.
+	 * nextMXact and nextOffset are in sync, so nothing bad can happen, but nextMXact seems mostly random. 
 	 *
+	 * THIS IS NOT POSSIBLE ANYMORE, KEEP IT FOR HISTORIC REASONS.
 	 * 2. The next multixact may still be in process of being filled in: that
 	 * is, another process may have done GetNewMultiXactId but not yet written
 	 * the offset entry for that ID.  In that scenario, it is guaranteed that
@@ -1412,7 +1440,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
 	 */
-retry:
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
@@ -1476,14 +1503,10 @@ retry:
 
 		if (nextMXOffset == 0)
 		{
-			/* Corner case 2: next multixact is still being filled in */
-			LWLockRelease(lock);
-			CHECK_FOR_INTERRUPTS();
-
-			ConditionVariableSleep(&MultiXactState->nextoff_cv,
-								   WAIT_EVENT_MULTIXACT_CREATION);
-			slept = true;
-			goto retry;
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("MultiXact %d has invalid next offset",
+							multi)));
 		}
 
 		length = nextMXOffset - offset;
@@ -1492,12 +1515,6 @@ retry:
 	LWLockRelease(lock);
 	lock = NULL;
 
-	/*
-	 * If we slept above, clean up state; it's no longer needed.
-	 */
-	if (slept)
-		ConditionVariableCancelSleep();
-
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
 
 	truelength = 0;
@@ -1987,7 +2004,6 @@ MultiXactShmemInit(void)
 
 		/* Make sure we zero out the per-backend state */
 		MemSet(MultiXactState, 0, SHARED_MULTIXACT_STATE_SIZE);
-		ConditionVariableInit(&MultiXactState->nextoff_cv);
 	}
 	else
 		Assert(found);
@@ -2198,8 +2214,7 @@ TrimMultiXact(void)
 	 * TrimCLOG() for background.  Unlike CLOG, some WAL record covers every
 	 * pg_multixact SLRU mutation.  Since, also unlike CLOG, we ignore the WAL
 	 * rule "write xlog before data," nextMXact successors may carry obsolete,
-	 * nonzero offset values.  Zero those so case 2 of GetMultiXactIdMembers()
-	 * operates normally.
+	 * nonzero offset values.
 	 */
 	entryno = MultiXactIdToOffsetEntry(nextMXact);
 	if (entryno != 0)
-- 
2.39.5 (Apple Git-154)

#25

dsy.075@yandex.ru

5 months ago

In reply to: Andrey Borodin (#24)

Re: IPC/MultixactCreation on the Standby server

On 31.07.2025 09:29, Andrey Borodin wrote:

Here's an updated two patches, one for Postgres 17 and one for
mater(with a test).

I ran tests on PG17 with patch v9.

I tried to reproduce it for three cases, the first when we explicitly
use for key share, the second through subtransactions
and the third, through implicit use of for key share, through an foreign
key.

There are no more sessions hanging on the replica, that's great!

Thank you all very much!

Best regards,
Dmitry.

#26

Kirill Reshke

reshkekirill@gmail.com

5 months ago

In reply to: Andrey Borodin (#24)

Re: IPC/MultixactCreation on the Standby server

On Thu, 31 Jul 2025 at 11:29, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

On 29 Jul 2025, at 23:15, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

I do not understand it yet.

OK, I figured it out. SimpleLruDoesPhysicalPageExist() was reading a physical file and could race with real extension by ExtendMultiXactOffset().
So I used ExtendMultiXactOffset(actual + 1). I hope this does not open a loop for wraparound...

Here's an updated two patches, one for Postgres 17 and one for mater(with a test).

Hi!

+ /*
+ * We might have filled this offset previosuly.
+ * Cross-check for correctness.
+ */
+ Assert((*offptr == 0) || (*offptr == offset));

Should we exit here with errcode(ERRCODE_DATA_CORRUPTED) if *offptr !=
0 and *offptr != offset?

+ /* Read and adjust next page */
+ next_slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next);
+ next_offptr = (MultiXactOffset *)
MultiXactOffsetCtl->shared->page_buffer[next_slotno];
+ next_offptr[next_entryno] = offset + nmembers;

should we check the value of next_offptr[next_entryno] to be equal to
zero or offset + nmembers ? Assert or
errcode(ERRCODE_DATA_CORRUPTED) also.

--
Best regards,
Kirill Reshke

#27

x4mmm@yandex-team.ru

5 months ago

In reply to: Kirill Reshke (#26)

Re: IPC/MultixactCreation on the Standby server

Hi Kirill, thanks for looking into this!

On 20 Aug 2025, at 12:19, Kirill Reshke <reshkekirill@gmail.com> wrote:
+ /*
+ * We might have filled this offset previosuly.
+ * Cross-check for correctness.
+ */
+ Assert((*offptr == 0) || (*offptr == offset));
Should we exit here with errcode(ERRCODE_DATA_CORRUPTED) if *offptr !=
0 and *offptr != offset?

No, we should not exit. We encountered inconsistencies that we are fully prepared to fix. But you are right - we should better emit WARNING with XX001.

+ /* Read and adjust next page */
+ next_slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next);
+ next_offptr = (MultiXactOffset *)
MultiXactOffsetCtl->shared->page_buffer[next_slotno];
+ next_offptr[next_entryno] = offset + nmembers;
should we check the value of next_offptr[next_entryno] to be equal to
zero or offset + nmembers ? Assert or
errcode(ERRCODE_DATA_CORRUPTED) also.

Yes, we'd better WARN user here.

Thanks for your valuable suggestions. I'm not sending new version of the patch, because I'm waiting input on overall design from Alvaro or any committer willing to fix this. We need to figure out if this radical approach is acceptable to backpatch. I do not see other options, but someone might have more clever ideas.

Best regards, Andrey Borodin.

#28

i.bykov@modernsys.ru

3 months ago

In reply to: Andrey Borodin (#27)

Re: IPC/MultixactCreation on the Standby server

Hi, Andrey!

I started reviewing the TAP tests and for the current master (14ee8e6403001c3788f2622cdcf81a8451502dc2),
src/test/modules/test_slru/t/001_multixact.pl reproduces the problem, but we can do it in a more transparent way.

The test should fail on timeout; otherwise, it would be hard to find the source of the problem.
The current state of 001_multixact.pl just leaves the 'mike' node in a deadlock state and
it looks like a very long test that doesn't finish for vague reasons.

I believe that we should provide more comprehensible feedback to developers
by interrupting the deadlock state with a psql timeout. 15 seconds seems safe
enough to distinguish between slow node operation and deadlock issue.

Here is the modified 001_multixact.pl
----
# Verify thet recorded multi is readble, this call must not hang.
# Also note that all injection points disappeared after server restart.
my $timed_out = 0;
$node->safe_psql(
'postgres',
qq{SELECT test_read_multixact('$multi'::xid);},
timeout => 15,
timed_out => \$timed_out);
ok($timed_out == 0, 'recorded multi is readble');

$node->stop;

done_testing();
----

#29

x4mmm@yandex-team.ru

about 2 months ago

In reply to: Ivan Bykov (#28)

1 attachment(s)

Re: IPC/MultixactCreation on the Standby server

Hi Ivan!

Thanks for the review.

On 24 Oct 2025, at 19:33, Ivan Bykov <I.Bykov@modernsys.ru> wrote:

I believe that we should provide more comprehensible feedback to developers
by interrupting the deadlock state with a psql timeout. 15 seconds seems safe
enough to distinguish between slow node operation and deadlock issue.

Makes sense, but I used $PostgreSQL::Test::Utils::timeout_default seconds.

On 24 Oct 2025, at 22:16, Bykov Ivan <i.bykov@ftdata.ru> wrote:

In GetNewMultiXactId() this code may lead to error
---
ExtendMultiXactOffset(MultiXactState->nextMXact + 1);
---
If MultiXactState->nextMXact = MaxMultiXactId (0xFFFFFFFF)
we will not extend MultiXactOffset as we should

It will extend SLRU, this calculations are handled in ExtendMultiXactOffset(). Moreover, we should eliminate "multi != FirstMultiXactId" bailout condition too. I've extended comments and added Assert(). I've added test for this, but it's whacky-hacky with dd, resetwal and such. But it uncovered wrond assertion in this code:
/* This is an overflow-only branch */
Assert(next_entryno == 0 || next == FirstMultiXactId);
This "next == FirstMultiXactId" was missing.

I also included Kirill's suggestions.

GPT is also worried by initialization of page 0, but I don't fully understand it's concerns:
"current MXID’s page is never extended: GetNewMultiXactId() now calls ExtendMultiXactOffset(MultiXactState->nextMXact + 1) instead of extending the page that will hold result. This skips zeroing the page for the MXID we are about to assign, so the first allocation on a fresh cluster (or after wrap) tries to write into an uninitialized SLRU page and will fail/crash once the buffer manager attempts I/O."

I think we have now good coverage of what happens on fresh cluster and after a wraparound.

"wraparound can’t recreate page 0: ExtendMultiXactOffset() now asserts multi != FirstMultiXactId, yet after wrap the first ID on page 0 is exactly FirstMultiXactId. Because callers no longer pass that value, page 0 is never re-created and we keep reusing stale offsets."

Page 0 is actually created via different path on cluster initialization. Though everything works fine on wraparound.

Thanks!

Best regards, Andrey Borodin.

Attachments:

v10-0001-Avoid-multixact-edge-case-2-by-writing-the-next-.patchapplication/octet-stream; name=v10-0001-Avoid-multixact-edge-case-2-by-writing-the-next-.patch; x-unix-mode=0644Download

From d21ba6f71280e03b0e402558626d00ec4b79a156 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sun, 27 Jul 2025 11:37:55 +0500
Subject: [PATCH v10] Avoid multixact edge case 2 by writing the next offset
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

RecordNewMultiXact now writes the offset of the
following multixact so readers never block waiting for “case 2”.
The new TAP test reproduces IPC/MultixactCreation hangs and verifies
that previously recorded multis stay readable across crash recovery.

Reviewed-by: Dmitry Yurichev <dsy.075@yandex.ru>
Reviewed-by: Álvaro Herrera <alvherre@postgresql.org>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Ivan Bykov <i.bykov@modernsys.ru>
---
 src/backend/access/transam/multixact.c        | 101 ++++++++-----
 src/test/modules/test_slru/t/001_multixact.pl | 141 +++++++++---------
 src/test/modules/test_slru/test_multixact.c   |   1 -
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  40 +++++
 src/test/recovery/t/017_shm.pl                |  38 +----
 5 files changed, 173 insertions(+), 148 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 9d5f130af7e..ddca8433cc1 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -271,12 +271,6 @@ typedef struct MultiXactStateData
 	/* support for members anti-wraparound measures */
 	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
 
-	/*
-	 * This is used to sleep until a multixact offset is written when we want
-	 * to create the next one.
-	 */
-	ConditionVariable nextoff_cv;
-
 	/*
 	 * Per-backend data starts here.  We have two arrays stored in the area
 	 * immediately following the MultiXactStateData struct. Each is indexed by
@@ -915,10 +909,20 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	int			i;
 	LWLock	   *lock;
 	LWLock	   *prevlock = NULL;
+	MultiXactId next = multi + 1;
+	int			next_pageno;
 
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
+	/*
+	 * We must also fill next offset to keep current multi readable
+	 * Handle wraparound as GetMultiXactIdMembers() does it.
+	 */
+	if (next < FirstMultiXactId)
+		next = FirstMultiXactId;
+	next_pageno = MultiXactIdToOffsetPage(next);
+
 	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
@@ -933,19 +937,45 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
 
-	*offptr = offset;
+	/*
+	 * We might have filled this offset previosuly.
+	 * Cross-check for correctness.
+	 */
+	Assert((*offptr == 0) || (*offptr == offset));
 
+	*offptr = offset;
 	MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 
+	/* Also we records next offset here */
+	if (next_pageno == pageno)
+	{
+		offptr[1] = offset + nmembers;
+	}
+	else
+	{
+		int	next_slotno;
+		MultiXactOffset *next_offptr;
+		int	next_entryno = MultiXactIdToOffsetEntry(next);
+		/* This is an overflow-only branch */
+		Assert(next_entryno == 0 || next == FirstMultiXactId);
+
+		/* Swap the lock for a lock of next page */
+		LWLockRelease(lock);
+		lock = SimpleLruGetBankLock(MultiXactOffsetCtl, next_pageno);
+		LWLockAcquire(lock, LW_EXCLUSIVE);
+
+		/* Read and adjust next page */
+		next_slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next);
+		next_offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[next_slotno];
+		Assert((next_offptr[next_entryno] == 0)
+				|| (next_offptr[next_entryno] == offset + nmembers));
+		next_offptr[next_entryno] = offset + nmembers;
+		MultiXactMemberCtl->shared->page_dirty[next_slotno] = true;
+	}
+
 	/* Release MultiXactOffset SLRU lock. */
 	LWLockRelease(lock);
 
-	/*
-	 * If anybody was waiting to know the offset of this multixact ID we just
-	 * wrote, they can read it now, so wake them up.
-	 */
-	ConditionVariableBroadcast(&MultiXactState->nextoff_cv);
-
 	prev_pageno = -1;
 
 	for (i = 0; i < nmembers; i++, offset++)
@@ -1138,8 +1168,13 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 			result = FirstMultiXactId;
 	}
 
-	/* Make sure there is room for the MXID in the file.  */
-	ExtendMultiXactOffset(result);
+	/*
+	 * Make sure there is room for the MXID and next offset in the file.
+	 * We might overflow to the next segment, but we don't need to handle
+	 * FirstMultiXactId specifically, because ExtendMultiXactOffset handles
+	 * both cases well: 0 offset and FirstMultiXactId would create segment.
+	 */
+	ExtendMultiXactOffset(MultiXactState->nextMXact + 1);
 
 	/*
 	 * Reserve the members space, similarly to above.  Also, be careful not to
@@ -1304,7 +1339,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	MultiXactOffset nextOffset;
 	MultiXactMember *ptr;
 	LWLock	   *lock;
-	bool		slept = false;
 
 	debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
 
@@ -1383,7 +1417,10 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * 1. This multixact may be the latest one created, in which case there is
 	 * no next one to look at.  In this case the nextOffset value we just
 	 * saved is the correct endpoint.
+	 * TODO: how does it work on Standby?  MultiXactState->nextMXact does not seem to be up-to date.
+	 * nextMXact and nextOffset are in sync, so nothing bad can happen, but nextMXact seems mostly random.
 	 *
+	 * THIS IS NOT POSSIBLE ANYMORE, KEEP IT FOR HISTORIC REASONS.
 	 * 2. The next multixact may still be in process of being filled in: that
 	 * is, another process may have done GetNewMultiXactId but not yet written
 	 * the offset entry for that ID.  In that scenario, it is guaranteed that
@@ -1409,7 +1446,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
 	 */
-retry:
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
@@ -1473,16 +1509,10 @@ retry:
 
 		if (nextMXOffset == 0)
 		{
-			/* Corner case 2: next multixact is still being filled in */
-			LWLockRelease(lock);
-			CHECK_FOR_INTERRUPTS();
-
-			INJECTION_POINT("multixact-get-members-cv-sleep", NULL);
-
-			ConditionVariableSleep(&MultiXactState->nextoff_cv,
-								   WAIT_EVENT_MULTIXACT_CREATION);
-			slept = true;
-			goto retry;
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("MultiXact %d has invalid next offset",
+							multi)));
 		}
 
 		length = nextMXOffset - offset;
@@ -1491,12 +1521,6 @@ retry:
 	LWLockRelease(lock);
 	lock = NULL;
 
-	/*
-	 * If we slept above, clean up state; it's no longer needed.
-	 */
-	if (slept)
-		ConditionVariableCancelSleep();
-
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
 
 	truelength = 0;
@@ -1986,7 +2010,6 @@ MultiXactShmemInit(void)
 
 		/* Make sure we zero out the per-backend state */
 		MemSet(MultiXactState, 0, SHARED_MULTIXACT_STATE_SIZE);
-		ConditionVariableInit(&MultiXactState->nextoff_cv);
 	}
 	else
 		Assert(found);
@@ -2136,8 +2159,7 @@ TrimMultiXact(void)
 	 * TrimCLOG() for background.  Unlike CLOG, some WAL record covers every
 	 * pg_multixact SLRU mutation.  Since, also unlike CLOG, we ignore the WAL
 	 * rule "write xlog before data," nextMXact successors may carry obsolete,
-	 * nonzero offset values.  Zero those so case 2 of GetMultiXactIdMembers()
-	 * operates normally.
+	 * nonzero offset values.
 	 */
 	entryno = MultiXactIdToOffsetEntry(nextMXact);
 	if (entryno != 0)
@@ -2487,10 +2509,11 @@ ExtendMultiXactOffset(MultiXactId multi)
 
 	/*
 	 * No work except at first MultiXactId of a page.  But beware: just after
-	 * wraparound, the first MultiXactId of page zero is FirstMultiXactId.
+	 * wraparound, the first MultiXactId of page zero is FirstMultiXactId,
+	 * make sure we are not in that case.
 	 */
-	if (MultiXactIdToOffsetEntry(multi) != 0 &&
-		multi != FirstMultiXactId)
+	Assert(multi != FirstMultiXactId);
+	if (MultiXactIdToOffsetEntry(multi) != 0)
 		return;
 
 	pageno = MultiXactIdToOffsetPage(multi);
diff --git a/src/test/modules/test_slru/t/001_multixact.pl b/src/test/modules/test_slru/t/001_multixact.pl
index e2b567a603d..2f85802f920 100644
--- a/src/test/modules/test_slru/t/001_multixact.pl
+++ b/src/test/modules/test_slru/t/001_multixact.pl
@@ -18,6 +18,10 @@ if ($ENV{enable_injection_points} ne 'yes')
 {
 	plan skip_all => 'Injection points not supported by this build';
 }
+if ($windows_os)
+{
+	plan skip_all => 'Kill9 works unpredicatably on Windows';
+}
 
 my ($node, $result);
 
@@ -25,96 +29,87 @@ $node = PostgreSQL::Test::Cluster->new('mike');
 $node->init;
 $node->append_conf('postgresql.conf',
 	"shared_preload_libraries = 'test_slru,injection_points'");
+# Set the cluster's next multitransaction to 0xFFFFFFF0.
+my $node_pgdata = $node->data_dir;
+command_ok(
+	[
+		'pg_resetwal',
+		'--multixact-ids' => '0xFFFFFFF0,0xFFFFFFF0',
+		$node_pgdata
+	],
+	"set the cluster's next multitransaction to 0xFFFFFFF0");
+command_ok(
+	[
+		'dd',
+		'if=/dev/zero',
+		"of=$node_pgdata/pg_multixact/offsets/FFFF",
+		'bs=4',
+		'count=65536'
+	],
+	"init SLRU file");
+
+command_ok(
+	[
+		'rm',
+		"$node_pgdata/pg_multixact/offsets/0000",
+	],
+	"drop old SLRU file");
+
 $node->start;
 $node->safe_psql('postgres', q(CREATE EXTENSION injection_points));
 $node->safe_psql('postgres', q(CREATE EXTENSION test_slru));
 
-# Test for Multixact generation edge case
-$node->safe_psql('postgres',
-	q{select injection_points_attach('test-multixact-read','wait')});
-$node->safe_psql('postgres',
-	q{select injection_points_attach('multixact-get-members-cv-sleep','wait')}
-);
+# Another multixact test: loosing some multixact must not affect reading near
+# multixacts, even after a crash.
+my $bg_psql = $node->background_psql('postgres');
 
-# This session must observe sleep on the condition variable while generating a
-# multixact.  To achieve this it first will create a multixact, then pause
-# before reading it.
-my $observer = $node->background_psql('postgres');
-
-# This query will create a multixact, and hang just before reading it.
-$observer->query_until(
-	qr/start/,
-	q{
-	\echo start
-	SELECT test_read_multixact(test_create_multixact());
-});
-$node->wait_for_event('client backend', 'test-multixact-read');
-
-# This session will create the next Multixact. This is necessary to avoid
-# multixact.c's non-sleeping edge case 1.
-my $creator = $node->background_psql('postgres');
+my $multi = $bg_psql->query_safe(
+	q(SELECT test_create_multixact();));
+
+# The space for next multi will be allocated, but it will never be actually
+# recorded.
 $node->safe_psql('postgres',
 	q{SELECT injection_points_attach('multixact-create-from-members','wait');}
 );
 
-# We expect this query to hang in the critical section after generating new
-# multixact, but before filling its offset into SLRU.
-# Running an injection point inside a critical section requires it to be
-# loaded beforehand.
-$creator->query_until(
-	qr/start/, q{
-	\echo start
+$bg_psql->query_until(
+	qr/deploying lost multi/, q(
+\echo deploying lost multi
 	SELECT test_create_multixact();
-});
+));
 
 $node->wait_for_event('client backend', 'multixact-create-from-members');
-
-# Ensure we have the backends waiting that we expect
-is( $node->safe_psql(
-		'postgres',
-		q{SELECT string_agg(wait_event, ', ' ORDER BY wait_event)
-		FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'}
-	),
-	'multixact-create-from-members, test-multixact-read',
-	"matching injection point waits");
-
-# Now wake observer to get it to read the initial multixact.  A subsequent
-# multixact already exists, but that one doesn't have an offset assigned, so
-# this will hit multixact.c's edge case 2.
-$node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('test-multixact-read')});
-$node->wait_for_event('client backend', 'multixact-get-members-cv-sleep');
-
-# Ensure we have the backends waiting that we expect
-is( $node->safe_psql(
-		'postgres',
-		q{SELECT string_agg(wait_event, ', ' ORDER BY wait_event)
-		FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'}
-	),
-	'multixact-create-from-members, multixact-get-members-cv-sleep',
-	"matching injection point waits");
-
-# Now we have two backends waiting in multixact-create-from-members and
-# multixact-get-members-cv-sleep.  Also we have 3 injections points set to wait.
-# If we wakeup multixact-get-members-cv-sleep it will happen again, so we must
-# detach it first. So let's detach all injection points, then wake up all
-# backends.
-
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('test-multixact-read')});
 $node->safe_psql('postgres',
 	q{SELECT injection_points_detach('multixact-create-from-members')});
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('multixact-get-members-cv-sleep')});
 
 $node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('multixact-create-from-members')});
-$node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('multixact-get-members-cv-sleep')});
+	q{checkpoint;});
 
-# Background psql will now be able to read the result and disconnect.
-$observer->quit;
-$creator->quit;
+# One more multitransaction to effectivelt emit WAL record about next
+# multitransaction (to avaoid corener case 1).
+$node->safe_psql('postgres',
+	q{SELECT test_create_multixact();});
+
+# All set and done, it's time for hard restart
+$node->kill9;
+$node->stop('immediate', fail_ok => 1);
+$node->poll_start;
+$bg_psql->{run}->finish;
+
+# Verify thet recorded multi is readble, this call must not hang.
+# Also note that all injection points disappeared after server restart.
+my $timed_out = 0;
+$node->safe_psql(
+	'postgres',
+	qq{SELECT test_read_multixact('$multi'::xid);},
+	timeout => $PostgreSQL::Test::Utils::timeout_default,
+	timed_out => \$timed_out);
+ok($timed_out == 0, 'recorded multi is readble');
+
+# Test mxidwraparound
+foreach my $i (1 .. 32) {
+$node->safe_psql('postgres',q{SELECT test_create_multixact();});
+}
 
 $node->stop;
 
diff --git a/src/test/modules/test_slru/test_multixact.c b/src/test/modules/test_slru/test_multixact.c
index 6c9b0420717..2fd67273ee5 100644
--- a/src/test/modules/test_slru/test_multixact.c
+++ b/src/test/modules/test_slru/test_multixact.c
@@ -46,7 +46,6 @@ test_read_multixact(PG_FUNCTION_ARGS)
 	MultiXactId id = PG_GETARG_TRANSACTIONID(0);
 	MultiXactMember *members;
 
-	INJECTION_POINT("test-multixact-read", NULL);
 	/* discard caches */
 	AtEOXact_MultiXact();
 
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 35413f14019..e810f123f93 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -1191,6 +1191,46 @@ sub start
 
 =pod
 
+=item $node->poll_start() => success_or_failure
+
+Polls start after kill9
+
+We may need retries to start a new postmaster.  Causes:
+ - kernel is slow to deliver SIGKILL
+ - postmaster parent is slow to waitpid()
+ - postmaster child is slow to exit in response to SIGQUIT
+ - postmaster child is slow to exit after postmaster death
+
+=cut
+
+sub poll_start
+{
+	my ($self) = @_;
+
+	my $max_attempts = 10 * $PostgreSQL::Test::Utils::timeout_default;
+	my $attempts = 0;
+
+	while ($attempts < $max_attempts)
+	{
+		$self->start(fail_ok => 1) && return 1;
+
+		# Wait 0.1 second before retrying.
+		usleep(100_000);
+
+		# Clean up in case the start attempt just timed out or some such.
+		$self->stop('fast', fail_ok => 1);
+
+		$attempts++;
+	}
+
+	# Try one last time without fail_ok, which will BAIL_OUT unless it
+	# succeeds.
+	$self->start && return 1;
+	return 0;
+}
+
+=pod
+
 =item $node->kill9()
 
 Send SIGKILL (signal 9) to the postmaster.
diff --git a/src/test/recovery/t/017_shm.pl b/src/test/recovery/t/017_shm.pl
index c73aa3f0c2c..ac238544217 100644
--- a/src/test/recovery/t/017_shm.pl
+++ b/src/test/recovery/t/017_shm.pl
@@ -67,7 +67,7 @@ log_ipcs();
 # Upon postmaster death, postmaster children exit automatically.
 $gnat->kill9;
 log_ipcs();
-poll_start($gnat);    # gnat recycles its former shm key.
+$gnat->poll_start;    # gnat recycles its former shm key.
 log_ipcs();
 
 note "removing the conflicting shmem ...";
@@ -82,7 +82,7 @@ log_ipcs();
 # the higher-keyed segment that the previous postmaster was using.
 # That's not great, but key collisions should be rare enough to not
 # make this a big problem.
-poll_start($gnat);
+$gnat->poll_start;
 log_ipcs();
 $gnat->stop;
 log_ipcs();
@@ -174,43 +174,11 @@ $slow_client->finish;    # client has detected backend termination
 log_ipcs();
 
 # now startup should work
-poll_start($gnat);
+$gnat->poll_start;
 log_ipcs();
 
 # finish testing
 $gnat->stop;
 log_ipcs();
 
-
-# We may need retries to start a new postmaster.  Causes:
-# - kernel is slow to deliver SIGKILL
-# - postmaster parent is slow to waitpid()
-# - postmaster child is slow to exit in response to SIGQUIT
-# - postmaster child is slow to exit after postmaster death
-sub poll_start
-{
-	my ($node) = @_;
-
-	my $max_attempts = 10 * $PostgreSQL::Test::Utils::timeout_default;
-	my $attempts = 0;
-
-	while ($attempts < $max_attempts)
-	{
-		$node->start(fail_ok => 1) && return 1;
-
-		# Wait 0.1 second before retrying.
-		usleep(100_000);
-
-		# Clean up in case the start attempt just timed out or some such.
-		$node->stop('fast', fail_ok => 1);
-
-		$attempts++;
-	}
-
-	# Try one last time without fail_ok, which will BAIL_OUT unless it
-	# succeeds.
-	$node->start && return 1;
-	return 0;
-}
-
 done_testing();
-- 
2.51.2

#30

i.bykov@modernsys.ru

about 2 months ago

In reply to: Andrey Borodin (#29)

Re: IPC/MultixactCreation on the Standby server

Hi!

It seems my previous email was sent only to Andrey directly and didn't pass moderation
because it had a patch attached. I've now resent it without patch another email address.

----
In GetNewMultiXactId() this code may lead to error
---
ExtendMultiXactOffset(MultiXactState->nextMXact + 1);
---
If MultiXactState->nextMXact = MaxMultiXactId (0xFFFFFFFF)
we will not extend MultiXactOffset as we should
---
ExtendMultiXactOffset(0);
MultiXactIdToOffsetEntry(0)
multi % MULTIXACT_OFFSETS_PER_PAGE = 0
return; /* skip SLRU extension */
---

Perhaps we should introduce a simple function to handle next MultiXact
calculation
---
static inline MultiXactId
NextMultiXactId(MultiXactId multi)
{
return multi == MaxMultiXactId ? FirstMultiXactId : multi + 1;
}
---
I've attached a patch that fixes this issue, although it seems I've discovered
another overflow bug in multixact_redo().
We might call:
---
multixact_redo()
MultiXactAdvanceNextMXact(0xFFFFFFFF + 1, ...);
---
And if MultiXactState->nextMXact != InvalidMultiXactId (0), we will have
MultiXactState->nextMXact = 0.
This appears to cause problems in code that assumes MultiXactState->nextMXact
holds a valid MultiXactId.
For instance, in GetMultiXactId(), we would return an incorrect number
of MultiXacts.

Although, spreading MultiXact overflow handling throughout multixact.c code
seems error-prone.
Maybe we should use a macro instead (which would also allow us to modify this
check and add compiler hints):

---
#define MultiXactAdjustOverflow(mxact) \
if (unlikely((mxact) < FirstMultiXactId)) \
mxact = FirstMultiXactId;
---

#31

x4mmm@yandex-team.ru

about 2 months ago

In reply to: Ivan Bykov (#30)

Re: IPC/MultixactCreation on the Standby server

On 24 Nov 2025, at 11:20, Ivan Bykov <I.Bykov@modernsys.ru> wrote:

It seems my previous email was sent only to Andrey directly and didn't pass moderation
because it had a patch attached. I've now resent it without patch another email address.

Hi Ivan!

Nope, the message hit the list, but only as a separate thread. To answer into same thread, please use "Reply all" function of your mailing agent. Also, it's a good idea to quote relevant part of a message that you are replying to.

I've included quotation of your message in my answer at this thread. It's important to stick to one thread to help folks coming from commitfest connect the dots and to find your messages.

If you do not have a message in your mail box, you can send it into your box from archives. There's button "Resend email" for this purpose.
Your message from "i.bykov@modernsys.ru <mailto:i.bykov@modernsys.ru>" landed into correct thread.

Thanks for your review!

Best regards, Andrey Borodin.

#32

i.bykov@modernsys.ru

about 2 months ago

In reply to: Andrey Borodin (#31)

Re: IPC/MultixactCreation on the Standby server

Hi, Andrey!

Thanks for your review!

Review still in progress, sorry for the delay. I didn't have enough time to fully understand the changes you suggest, but it seems there is
only a small gap in my understanding of what the patch does. Here is my explanation of the problem.

The main problem
=============

The main problem is that we may lose session context before writing the offset to SLRU (but we may write
a WAL record). It seems that the writer process got stuck in the XLogInsert procedure or even failed between GetNewMultiXactId
and RecordNewMultiXact call. In this case, readers will wait to receive a conditional variable signal (from new multixacts)
but could not find a valid offset for the "failed" (it may be in WAL) multixid.

I illustrate this using the next diagram.

Writer                                        Reader             
--------------------------------------------------------------------------------
 MultiXactIdCreateFromMembers
 -> GetNewMultiXactId (101)
                                              GetMultiXactIdMembers(100)
                                              -> LWLockAcquire(MultiXactOffset)
                                              -> read offset 100
                                              -> read offset 101
                                              -> LWLockRelease(MultiXactOffset)
                                                 offset 101 == 0
                                              -> ConditionVariableSleep()
+--------------------------------------------------------------------------------------+
|-> XLogInsert                                                                         |
+--------------------------------------------------------------------------------------+
 -> RecordNewMultiXact
  -> LWLockAcquire(MultiXactOffset)
  -> write offset 101
  -> LWLockRelease(MultiXactOffset)
  -> ConditionVariableBroadcast(nextoff_cv);  
                                              -> retry:
                                              -> LWLockAcquire(MultiXactOffset)
                                              -> read offset 100
                                              -> read offset 101
                                              -> LWLockRelease(MultiXactOffset)
                                                 offset 101 != 0
                                              -> length = offset 101 - read offset 100
  -> LWLockAcquire(MultiXactMember)
  -> write members 101
  -> LWLockRelease(MultiXactOffset)

--------------------------------------------------------------------------------------

As far as I can see, your proposal seems to address exactly that problem.
The main difference from the former solution is writing to MultiXactOffset SLRU all required
information for the reader atomically on multiact insertion.
Before this change, we actually extended the multixact insertion time window to the next multixact
insertion time, and it seems a risky design.

I illustrate the new solution using the next diagram.

Writer Reader
--------------------------------------------------------------------------------
MultiXactIdCreateFromMembers
-> GetNewMultiXactId (100)
-> XLogInsert
-> RecordNewMultiXact
-> LWLockAcquire(MultiXactOffset)
-> write offset 100
-> write offset 101 *****************************************************************
-> LWLockRelease(MultiXactOffset)
GetMultiXactIdMembers(100)
-> LWLockAcquire(MultiXactOffset)
-> read offset 100
-> read offset 101
-> LWLockRelease(MultiXactOffset)
Assert(offset 101 == 0)
-> ConditionVariableSleep()
-> length = offset 101 - read offset 100
--------------------------------------------------------------------------------------

So if I understand the core idea of your solution right, I think that the code in the last patch
(v10-0001-Avoid-multixact-edge-case-2-by-writing-the-next-.patch) is correct and does what it should.

#33

i.bykov@modernsys.ru

about 2 months ago

In reply to: Ivan Bykov (#32)

Re: IPC/MultixactCreation on the Standby server

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: not tested

Hi, Andrey!

The patch applies correctly to d4c0f91f (master)

Here is some minor review comments.

***

It's worth adding a comment explaining why we don't use the lock swap protocol for MultiXactOffsetCtl in RecordNewMultiXact,
unlike for MultiXactMemberCtl (where we check whether rotation is required before swapping the lock).

That the lock will be the same at wraparound only (when MultiXactOffsetCtl->nbanks = 1).
Since this is a rare case, it's not worth handling in the code, but it should be documented with a comment.

It seems even more confusing if we inspect GetMultiXactIdMembers where MultiXactOffsetCtl
checks wraparound case before rotate lock (rotation only required at wraparound).
So it would be better to simplify MultiXactOffsetCtl lock swapping at GetMultiXactIdMembers
in the same manner as it is done at RecordNewMultiXact (exclude extra checks because it returns that swap required almost every time).

***

I believe we should use int64 instead of int for next_pageno in RecordNewMultiXact().
There's a specific reason for using int64 - see below:

4ed8f09 Index SLRUs by 64-bit integers rather than by 32-bit integers
/messages/by-id/CACG=ezZe1NQSCnfHOr78AtAZxJZeCvxrts0ygrxYwe=pyyjVWA@mail.gmail.com
/messages/by-id/CAJ7c6TPDOYBYrnCAeyndkBktO0WG2xSdYduTF0nxq+vfkmTF5Q@mail.gmail.com

***

We should delete MULTIXACT_CREATION wait event type from src/backend/utils/activity/wait_event_names.txt
because we delete corresponding conditional variable.

***

I also noticed some minor typos; here is the corrected version.

----
/* Also we record next offset here */

Kill9 works unpredictably on Windows / exta 'a' was in unpredictably

# Verify that recorded multi is readable, this call must not hang.

recorded multi is readable

# Test mxid wraparound

The new status of this patch is: Waiting on Author

#34

hlinnaka@iki.fi

about 2 months ago

In reply to: Ivan Bykov (#33)

Re: IPC/MultixactCreation on the Standby server

This approach makes a lot of sense to me.

I think there's one more case that this fixes: If someone uses
"pg_resetwal -m ..." to advance nextMulti, that's yet another case where
the last multixid becomes unreadable because we haven't set the next
mxid's offset in the SLRU yet. People shouldn't be running pg_resetwal
unnecessarily, but the docs for pg_resetwal include instructions on how
to determine a safe value for nextMulti. You will in fact lose the last
multixid if you follow the instructions, so it's not as safe as it
sounds. This patch fixes that issue too.

Álvaro, are you planning to commit this? I've been looking at this code
lately for the 64-bit mxoffsets patch, so I could also pick it up if
you'd like.

@@ -1383,7 +1417,10 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
* 1. This multixact may be the latest one created, in which case there is
* no next one to look at.  In this case the nextOffset value we just
* saved is the correct endpoint.
+	 * TODO: how does it work on Standby?  MultiXactState->nextMXact does not seem to be up-to date.
+	 * nextMXact and nextOffset are in sync, so nothing bad can happen, but nextMXact seems mostly random.

That's scary. AFAICS MultiXactState->nextMXact should be up-to-date in
hot standby mode. Are we missing MultiXactAdvanceNextMXact() calls
somewhere, or what's going on?

The case 1. is still valid, you can indeed just look at the saved
nextOffset. But now the next offset should also be set on the SLRU,
except right after upgrading to a version that has this patch. So I
guess we still need to have this special case to deal with upgrades.

*
+ * THIS IS NOT POSSIBLE ANYMORE, KEEP IT FOR HISTORIC REASONS.
* 2. The next multixact may still be in process of being filled in: that
* is, another process may have done GetNewMultiXactId but not yet written
* the offset entry for that ID. In that scenario, it is guaranteed that

This comment needs to be cleaned up to explain how all this works now...

@@ -1138,8 +1168,13 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
result = FirstMultiXactId;
}

-	/* Make sure there is room for the MXID in the file.  */
-	ExtendMultiXactOffset(result);
+	/*
+	 * Make sure there is room for the MXID and next offset in the file.
+	 * We might overflow to the next segment, but we don't need to handle
+	 * FirstMultiXactId specifically, because ExtendMultiXactOffset handles
+	 * both cases well: 0 offset and FirstMultiXactId would create segment.
+	 */
+	ExtendMultiXactOffset(MultiXactState->nextMXact + 1);

/*
* Reserve the members space, similarly to above. Also, be careful not to

Does this create the file correctly, if you upgrade the binary to a new
version that contains this patch, and nextMXact was at a segment
boundary before the upgrade?

@@ -2487,10 +2509,11 @@ ExtendMultiXactOffset(MultiXactId multi)

/*
* No work except at first MultiXactId of a page.  But beware: just after
-	 * wraparound, the first MultiXactId of page zero is FirstMultiXactId.
+	 * wraparound, the first MultiXactId of page zero is FirstMultiXactId,
+	 * make sure we are not in that case.
*/
-	if (MultiXactIdToOffsetEntry(multi) != 0 &&
-		multi != FirstMultiXactId)
+	Assert(multi != FirstMultiXactId);
+	if (MultiXactIdToOffsetEntry(multi) != 0)
return;

pageno = MultiXactIdToOffsetPage(multi);

I don't quite understand this change. I guess the point is that the
caller now never calls this with FirstMultiXactId, but it feels a bit
weird to assert and rely on that here.

I didn't look at the test changes yet.

- Heikki

#35

alvherre@kurilemu.de

about 2 months ago

In reply to: Heikki Linnakangas (#34)

Re: IPC/MultixactCreation on the Standby server

On 2025-Nov-25, Heikki Linnakangas wrote:

Álvaro, are you planning to commit this? I've been looking at this code
lately for the 64-bit mxoffsets patch, so I could also pick it up if you'd
like.

I would be glad if you can get it over the finish line. I haven't found
time/energy to review it in depth.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Computing is too important to be left to men." (Karen Spärck Jones)

#36

hlinnaka@iki.fi

about 2 months ago

In reply to: Álvaro Herrera (#35)

1 attachment(s)

Re: IPC/MultixactCreation on the Standby server

Here's a new version of this. Notable changes:

- I reverted the changes to ExtendMultiXactOffset(), so that it deals
with wraparound and FirstMultiXactId the same way as before. The caller
never passes FirstMultiXactId, but the changed comments and the
assertion were confusing, so I felt it's best to just leave it alone

- bunch of comment changes & other cosmetic changes

- I modified TrimMultiXact() to initialize the page corresponding to
'nextMulti', because if you just swapped the binary to the new one, and
nextMulti was at a page boundary, it would not be initialized yet.

If we want to backpatch this, and I think we need to because this fixes
real bugs, we need to think through all the upgrade scenarios. I made
the above-mentioned changes to TrimMultiXact(), but it doesn't fix all
the problems.

What happens if you replay the WAL generated with old binary, without
this patch, with new binary? It's not good:

LOG: database system was not properly shut down; automatic recovery in
progress
LOG: redo starts at 0/01766A68
FATAL: could not access status of transaction 2048
DETAIL: Could not read from file "pg_multixact/offsets/0000" at offset
8192: read too few bytes.
CONTEXT: WAL redo at 0/05561030 for MultiXact/CREATE_ID: 2047 offset
4093 nmembers 2: 2830 (keysh) 2831 (keysh)
LOG: startup process (PID 3130184) exited with exit code 1

This is because the WAL, created with old version, contains records like
this:

lsn: 0/05561030, prev 0/05561008, desc: CREATE_ID 2047 offset 4093
nmembers 2: 2830 (keysh) 2831 (keysh)
lsn: 0/055611A8, prev 0/05561180, desc: ZERO_OFF_PAGE 1
lsn: 0/055611D0, prev 0/055611A8, desc: CREATE_ID 2048 offset 4095
nmembers 2: 2831 (keysh) 2832 (keysh)

When replaying that with the new version, replay of the CREATE_ID 2047
record tries to set the next multixid's offset, but the page hasn't been
initialized yet. With the new version, the ZERO_OFF_PAGE 1 record would
appear before the CREATE_ID 2047 record, but we can't change the WAL
that already exists.

- Heikki

Attachments:

v11-0001-Avoid-multixact-edge-case-2-by-writing-the-next-.patchtext/x-patch; charset=UTF-8; name=v11-0001-Avoid-multixact-edge-case-2-by-writing-the-next-.patchDownload

From e84d56b24d034694822710bca971eb7ac3a5a930 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 26 Nov 2025 18:07:01 +0200
Subject: [PATCH v11 1/1] Avoid multixact edge case 2 by writing the next
 offset
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

RecordNewMultiXact now writes the offset of the
following multixact so readers never block waiting for “case 2”.
The new TAP test reproduces IPC/MultixactCreation hangs and verifies
that previously recorded multis stay readable across crash recovery.

Author: Andrey Borodin <amborodin@acm.org>
Reviewed-by: Dmitry Yurichev <dsy.075@yandex.ru>
Reviewed-by: Álvaro Herrera <alvherre@postgresql.org>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Ivan Bykov <i.bykov@modernsys.ru>
---
 src/backend/access/transam/multixact.c        | 150 +++++++++++-------
 src/test/modules/test_slru/t/001_multixact.pl | 141 ++++++++--------
 src/test/modules/test_slru/test_multixact.c   |   1 -
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  40 +++++
 src/test/recovery/t/017_shm.pl                |  38 +----
 5 files changed, 201 insertions(+), 169 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 9d5f130af7e..d1e15ff5624 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -79,7 +79,6 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
-#include "storage/condition_variable.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -271,12 +270,6 @@ typedef struct MultiXactStateData
 	/* support for members anti-wraparound measures */
 	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
 
-	/*
-	 * This is used to sleep until a multixact offset is written when we want
-	 * to create the next one.
-	 */
-	ConditionVariable nextoff_cv;
-
 	/*
 	 * Per-backend data starts here.  We have two arrays stored in the area
 	 * immediately following the MultiXactStateData struct. Each is indexed by
@@ -912,13 +905,33 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	int			entryno;
 	int			slotno;
 	MultiXactOffset *offptr;
-	int			i;
+	MultiXactId next;
+	int64		next_pageno;
+	int			next_entryno;
+	MultiXactOffset *next_offptr;
 	LWLock	   *lock;
 	LWLock	   *prevlock = NULL;
 
+	/* position of this multixid in the offsets SLRU area  */
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
+	/* position of the next multixid */
+	next = multi + 1;
+	if (next < FirstMultiXactId)
+		next = FirstMultiXactId;
+	next_pageno = MultiXactIdToOffsetPage(next);
+	next_entryno = MultiXactIdToOffsetEntry(next);
+
+	/*
+	 * Set the starting offset of this multixid's members.
+	 *
+	 * In the common case, it was already be set by the previous
+	 * RecordNewMultiXact call, as this was the next multixid of the previous
+	 * multixid.  But if multiple backends are generating multixids
+	 * concurrently, we might race ahead and get called before previous
+	 * multixid.
+	 */
 	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
@@ -933,22 +946,50 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
 
-	*offptr = offset;
+	if (*offptr != offset)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*offptr == 0);
+		*offptr = offset;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
 
-	MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	/*
+	 * Set the next multixid's offset to the end of this multixid's members.
+	 */
+	if (next_pageno == pageno)
+	{
+		next_offptr = offptr + 1;
+	}
+	else
+	{
+		/* must be the first entry on the page */
+		Assert(next_entryno == 0 || next == FirstMultiXactId);
+
+		/* Swap the lock for a lock on the next page */
+		LWLockRelease(lock);
+		lock = SimpleLruGetBankLock(MultiXactOffsetCtl, next_pageno);
+		LWLockAcquire(lock, LW_EXCLUSIVE);
+
+		slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next);
+		next_offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
+		next_offptr += next_entryno;
+	}
+
+	if (*next_offptr != offset + nmembers)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*next_offptr == 0);
+		*next_offptr = offset + nmembers;
+		MultiXactMemberCtl->shared->page_dirty[slotno] = true;
+	}
 
 	/* Release MultiXactOffset SLRU lock. */
 	LWLockRelease(lock);
 
-	/*
-	 * If anybody was waiting to know the offset of this multixact ID we just
-	 * wrote, they can read it now, so wake them up.
-	 */
-	ConditionVariableBroadcast(&MultiXactState->nextoff_cv);
-
 	prev_pageno = -1;
 
-	for (i = 0; i < nmembers; i++, offset++)
+	for (int i = 0; i < nmembers; i++, offset++)
 	{
 		TransactionId *memberptr;
 		uint32	   *flagsptr;
@@ -1138,8 +1179,11 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 			result = FirstMultiXactId;
 	}
 
-	/* Make sure there is room for the MXID in the file.  */
-	ExtendMultiXactOffset(result);
+	/*
+	 * Make sure there is room for the next MXID in the file.  Assigning this
+	 * MXID sets the next MXID offset already.
+	 */
+	ExtendMultiXactOffset(result + 1);
 
 	/*
 	 * Reserve the members space, similarly to above.  Also, be careful not to
@@ -1304,7 +1348,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	MultiXactOffset nextOffset;
 	MultiXactMember *ptr;
 	LWLock	   *lock;
-	bool		slept = false;
 
 	debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
 
@@ -1381,23 +1424,14 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * one's.  However, there are some corner cases to worry about:
 	 *
 	 * 1. This multixact may be the latest one created, in which case there is
-	 * no next one to look at.  In this case the nextOffset value we just
-	 * saved is the correct endpoint.
+	 * no next one to look at.  The next multixact's offset should be set
+	 * already, as we set it in RecordNewMultiXact(), but we used to not do
+	 * that in older minor versions.  To cope with that case, if this
+	 * multixact is the latest one created, use the nextOffset value we read
+	 * above as the endpoint.
 	 *
-	 * 2. The next multixact may still be in process of being filled in: that
-	 * is, another process may have done GetNewMultiXactId but not yet written
-	 * the offset entry for that ID.  In that scenario, it is guaranteed that
-	 * the offset entry for that multixact exists (because GetNewMultiXactId
-	 * won't release MultiXactGenLock until it does) but contains zero
-	 * (because we are careful to pre-zero offset pages). Because
-	 * GetNewMultiXactId will never return zero as the starting offset for a
-	 * multixact, when we read zero as the next multixact's offset, we know we
-	 * have this case.  We handle this by sleeping on the condition variable
-	 * we have just for this; the process in charge will signal the CV as soon
-	 * as it has finished writing the multixact offset.
-	 *
-	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
-	 * handle case #2, there is an ambiguity near the point of offset
+	 * 2. Because GetNewMultiXactId skips over offset zero, to reserve zero
+	 * for to mean "unset", there is an ambiguity near the point of offset
 	 * wraparound.  If we see next multixact's offset is one, is that our
 	 * multixact's actual endpoint, or did it end at zero with a subsequent
 	 * increment?  We handle this using the knowledge that if the zero'th
@@ -1409,7 +1443,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
 	 */
-retry:
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
@@ -1473,16 +1506,10 @@ retry:
 
 		if (nextMXOffset == 0)
 		{
-			/* Corner case 2: next multixact is still being filled in */
-			LWLockRelease(lock);
-			CHECK_FOR_INTERRUPTS();
-
-			INJECTION_POINT("multixact-get-members-cv-sleep", NULL);
-
-			ConditionVariableSleep(&MultiXactState->nextoff_cv,
-								   WAIT_EVENT_MULTIXACT_CREATION);
-			slept = true;
-			goto retry;
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("MultiXact %d has invalid next offset",
+							multi)));
 		}
 
 		length = nextMXOffset - offset;
@@ -1491,12 +1518,6 @@ retry:
 	LWLockRelease(lock);
 	lock = NULL;
 
-	/*
-	 * If we slept above, clean up state; it's no longer needed.
-	 */
-	if (slept)
-		ConditionVariableCancelSleep();
-
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
 
 	truelength = 0;
@@ -1539,7 +1560,7 @@ retry:
 
 		if (!TransactionIdIsValid(*xactptr))
 		{
-			/* Corner case 3: we must be looking at unused slot zero */
+			/* Corner case 2: we must be looking at unused slot zero */
 			Assert(offset == 0);
 			continue;
 		}
@@ -1986,7 +2007,6 @@ MultiXactShmemInit(void)
 
 		/* Make sure we zero out the per-backend state */
 		MemSet(MultiXactState, 0, SHARED_MULTIXACT_STATE_SIZE);
-		ConditionVariableInit(&MultiXactState->nextoff_cv);
 	}
 	else
 		Assert(found);
@@ -2132,26 +2152,36 @@ TrimMultiXact(void)
 						pageno);
 
 	/*
+	 * Set the offset of the last multixact on the offsets page.
+	 *
+	 * This is normally done in RecordNewMultiXact() of the previous
+	 * multixact, but we used to not do that in older minor versions.  To
+	 * ensure that the next offset is set if the binary was just upgraded from
+	 * an older minor version, do it now.
+	 *
 	 * Zero out the remainder of the current offsets page.  See notes in
 	 * TrimCLOG() for background.  Unlike CLOG, some WAL record covers every
 	 * pg_multixact SLRU mutation.  Since, also unlike CLOG, we ignore the WAL
 	 * rule "write xlog before data," nextMXact successors may carry obsolete,
-	 * nonzero offset values.  Zero those so case 2 of GetMultiXactIdMembers()
-	 * operates normally.
+	 * nonzero offset values.
 	 */
 	entryno = MultiXactIdToOffsetEntry(nextMXact);
-	if (entryno != 0)
 	{
 		int			slotno;
 		MultiXactOffset *offptr;
 		LWLock	   *lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
 
 		LWLockAcquire(lock, LW_EXCLUSIVE);
-		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
+		if (entryno == 0)
+			slotno = SimpleLruZeroPage(MultiXactOffsetCtl, pageno);
+		else
+			slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
 
-		MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
+		*offptr = offset;
+		if ((entryno + 1) * sizeof(MultiXactOffset) != BLCKSZ)
+			MemSet(offptr + 1, 0, BLCKSZ - (entryno + 1) * sizeof(MultiXactOffset));
 
 		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 		LWLockRelease(lock);
diff --git a/src/test/modules/test_slru/t/001_multixact.pl b/src/test/modules/test_slru/t/001_multixact.pl
index e2b567a603d..2f85802f920 100644
--- a/src/test/modules/test_slru/t/001_multixact.pl
+++ b/src/test/modules/test_slru/t/001_multixact.pl
@@ -18,6 +18,10 @@ if ($ENV{enable_injection_points} ne 'yes')
 {
 	plan skip_all => 'Injection points not supported by this build';
 }
+if ($windows_os)
+{
+	plan skip_all => 'Kill9 works unpredicatably on Windows';
+}
 
 my ($node, $result);
 
@@ -25,96 +29,87 @@ $node = PostgreSQL::Test::Cluster->new('mike');
 $node->init;
 $node->append_conf('postgresql.conf',
 	"shared_preload_libraries = 'test_slru,injection_points'");
+# Set the cluster's next multitransaction to 0xFFFFFFF0.
+my $node_pgdata = $node->data_dir;
+command_ok(
+	[
+		'pg_resetwal',
+		'--multixact-ids' => '0xFFFFFFF0,0xFFFFFFF0',
+		$node_pgdata
+	],
+	"set the cluster's next multitransaction to 0xFFFFFFF0");
+command_ok(
+	[
+		'dd',
+		'if=/dev/zero',
+		"of=$node_pgdata/pg_multixact/offsets/FFFF",
+		'bs=4',
+		'count=65536'
+	],
+	"init SLRU file");
+
+command_ok(
+	[
+		'rm',
+		"$node_pgdata/pg_multixact/offsets/0000",
+	],
+	"drop old SLRU file");
+
 $node->start;
 $node->safe_psql('postgres', q(CREATE EXTENSION injection_points));
 $node->safe_psql('postgres', q(CREATE EXTENSION test_slru));
 
-# Test for Multixact generation edge case
-$node->safe_psql('postgres',
-	q{select injection_points_attach('test-multixact-read','wait')});
-$node->safe_psql('postgres',
-	q{select injection_points_attach('multixact-get-members-cv-sleep','wait')}
-);
+# Another multixact test: loosing some multixact must not affect reading near
+# multixacts, even after a crash.
+my $bg_psql = $node->background_psql('postgres');
 
-# This session must observe sleep on the condition variable while generating a
-# multixact.  To achieve this it first will create a multixact, then pause
-# before reading it.
-my $observer = $node->background_psql('postgres');
-
-# This query will create a multixact, and hang just before reading it.
-$observer->query_until(
-	qr/start/,
-	q{
-	\echo start
-	SELECT test_read_multixact(test_create_multixact());
-});
-$node->wait_for_event('client backend', 'test-multixact-read');
-
-# This session will create the next Multixact. This is necessary to avoid
-# multixact.c's non-sleeping edge case 1.
-my $creator = $node->background_psql('postgres');
+my $multi = $bg_psql->query_safe(
+	q(SELECT test_create_multixact();));
+
+# The space for next multi will be allocated, but it will never be actually
+# recorded.
 $node->safe_psql('postgres',
 	q{SELECT injection_points_attach('multixact-create-from-members','wait');}
 );
 
-# We expect this query to hang in the critical section after generating new
-# multixact, but before filling its offset into SLRU.
-# Running an injection point inside a critical section requires it to be
-# loaded beforehand.
-$creator->query_until(
-	qr/start/, q{
-	\echo start
+$bg_psql->query_until(
+	qr/deploying lost multi/, q(
+\echo deploying lost multi
 	SELECT test_create_multixact();
-});
+));
 
 $node->wait_for_event('client backend', 'multixact-create-from-members');
-
-# Ensure we have the backends waiting that we expect
-is( $node->safe_psql(
-		'postgres',
-		q{SELECT string_agg(wait_event, ', ' ORDER BY wait_event)
-		FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'}
-	),
-	'multixact-create-from-members, test-multixact-read',
-	"matching injection point waits");
-
-# Now wake observer to get it to read the initial multixact.  A subsequent
-# multixact already exists, but that one doesn't have an offset assigned, so
-# this will hit multixact.c's edge case 2.
-$node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('test-multixact-read')});
-$node->wait_for_event('client backend', 'multixact-get-members-cv-sleep');
-
-# Ensure we have the backends waiting that we expect
-is( $node->safe_psql(
-		'postgres',
-		q{SELECT string_agg(wait_event, ', ' ORDER BY wait_event)
-		FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'}
-	),
-	'multixact-create-from-members, multixact-get-members-cv-sleep',
-	"matching injection point waits");
-
-# Now we have two backends waiting in multixact-create-from-members and
-# multixact-get-members-cv-sleep.  Also we have 3 injections points set to wait.
-# If we wakeup multixact-get-members-cv-sleep it will happen again, so we must
-# detach it first. So let's detach all injection points, then wake up all
-# backends.
-
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('test-multixact-read')});
 $node->safe_psql('postgres',
 	q{SELECT injection_points_detach('multixact-create-from-members')});
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('multixact-get-members-cv-sleep')});
 
 $node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('multixact-create-from-members')});
-$node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('multixact-get-members-cv-sleep')});
+	q{checkpoint;});
 
-# Background psql will now be able to read the result and disconnect.
-$observer->quit;
-$creator->quit;
+# One more multitransaction to effectivelt emit WAL record about next
+# multitransaction (to avaoid corener case 1).
+$node->safe_psql('postgres',
+	q{SELECT test_create_multixact();});
+
+# All set and done, it's time for hard restart
+$node->kill9;
+$node->stop('immediate', fail_ok => 1);
+$node->poll_start;
+$bg_psql->{run}->finish;
+
+# Verify thet recorded multi is readble, this call must not hang.
+# Also note that all injection points disappeared after server restart.
+my $timed_out = 0;
+$node->safe_psql(
+	'postgres',
+	qq{SELECT test_read_multixact('$multi'::xid);},
+	timeout => $PostgreSQL::Test::Utils::timeout_default,
+	timed_out => \$timed_out);
+ok($timed_out == 0, 'recorded multi is readble');
+
+# Test mxidwraparound
+foreach my $i (1 .. 32) {
+$node->safe_psql('postgres',q{SELECT test_create_multixact();});
+}
 
 $node->stop;
 
diff --git a/src/test/modules/test_slru/test_multixact.c b/src/test/modules/test_slru/test_multixact.c
index 6c9b0420717..2fd67273ee5 100644
--- a/src/test/modules/test_slru/test_multixact.c
+++ b/src/test/modules/test_slru/test_multixact.c
@@ -46,7 +46,6 @@ test_read_multixact(PG_FUNCTION_ARGS)
 	MultiXactId id = PG_GETARG_TRANSACTIONID(0);
 	MultiXactMember *members;
 
-	INJECTION_POINT("test-multixact-read", NULL);
 	/* discard caches */
 	AtEOXact_MultiXact();
 
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 35413f14019..e810f123f93 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -1191,6 +1191,46 @@ sub start
 
 =pod
 
+=item $node->poll_start() => success_or_failure
+
+Polls start after kill9
+
+We may need retries to start a new postmaster.  Causes:
+ - kernel is slow to deliver SIGKILL
+ - postmaster parent is slow to waitpid()
+ - postmaster child is slow to exit in response to SIGQUIT
+ - postmaster child is slow to exit after postmaster death
+
+=cut
+
+sub poll_start
+{
+	my ($self) = @_;
+
+	my $max_attempts = 10 * $PostgreSQL::Test::Utils::timeout_default;
+	my $attempts = 0;
+
+	while ($attempts < $max_attempts)
+	{
+		$self->start(fail_ok => 1) && return 1;
+
+		# Wait 0.1 second before retrying.
+		usleep(100_000);
+
+		# Clean up in case the start attempt just timed out or some such.
+		$self->stop('fast', fail_ok => 1);
+
+		$attempts++;
+	}
+
+	# Try one last time without fail_ok, which will BAIL_OUT unless it
+	# succeeds.
+	$self->start && return 1;
+	return 0;
+}
+
+=pod
+
 =item $node->kill9()
 
 Send SIGKILL (signal 9) to the postmaster.
diff --git a/src/test/recovery/t/017_shm.pl b/src/test/recovery/t/017_shm.pl
index c73aa3f0c2c..ac238544217 100644
--- a/src/test/recovery/t/017_shm.pl
+++ b/src/test/recovery/t/017_shm.pl
@@ -67,7 +67,7 @@ log_ipcs();
 # Upon postmaster death, postmaster children exit automatically.
 $gnat->kill9;
 log_ipcs();
-poll_start($gnat);    # gnat recycles its former shm key.
+$gnat->poll_start;    # gnat recycles its former shm key.
 log_ipcs();
 
 note "removing the conflicting shmem ...";
@@ -82,7 +82,7 @@ log_ipcs();
 # the higher-keyed segment that the previous postmaster was using.
 # That's not great, but key collisions should be rare enough to not
 # make this a big problem.
-poll_start($gnat);
+$gnat->poll_start;
 log_ipcs();
 $gnat->stop;
 log_ipcs();
@@ -174,43 +174,11 @@ $slow_client->finish;    # client has detected backend termination
 log_ipcs();
 
 # now startup should work
-poll_start($gnat);
+$gnat->poll_start;
 log_ipcs();
 
 # finish testing
 $gnat->stop;
 log_ipcs();
 
-
-# We may need retries to start a new postmaster.  Causes:
-# - kernel is slow to deliver SIGKILL
-# - postmaster parent is slow to waitpid()
-# - postmaster child is slow to exit in response to SIGQUIT
-# - postmaster child is slow to exit after postmaster death
-sub poll_start
-{
-	my ($node) = @_;
-
-	my $max_attempts = 10 * $PostgreSQL::Test::Utils::timeout_default;
-	my $attempts = 0;
-
-	while ($attempts < $max_attempts)
-	{
-		$node->start(fail_ok => 1) && return 1;
-
-		# Wait 0.1 second before retrying.
-		usleep(100_000);
-
-		# Clean up in case the start attempt just timed out or some such.
-		$node->stop('fast', fail_ok => 1);
-
-		$attempts++;
-	}
-
-	# Try one last time without fail_ok, which will BAIL_OUT unless it
-	# succeeds.
-	$node->start && return 1;
-	return 0;
-}
-
 done_testing();
-- 
2.47.3

#37

alvherre@kurilemu.de

about 2 months ago

In reply to: Heikki Linnakangas (#36)

Re: IPC/MultixactCreation on the Standby server

On 2025-11-26, Heikki Linnakangas wrote:

What happens if you replay the WAL generated with old binary, without
this patch, with new binary? It's not good:

Maybe this needs a new record identifier, separating old wal from that generated by the new code?

--
Álvaro Herrera

#38

Chao Li

li.evan.chao@gmail.com

about 2 months ago

In reply to: Heikki Linnakangas (#36)

Re: IPC/MultixactCreation on the Standby server

Hi Heikki,

I just reviewed this patch. As offset[x+1] anyway equals offset[x]+nmembers, pre-write offset[x+1] seems a very clever solution.

I got a few questions/comments as below:

On Nov 27, 2025, at 04:59, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Here's a new version of this. Notable changes:

- I reverted the changes to ExtendMultiXactOffset(), so that it deals with wraparound and FirstMultiXactId the same way as before. The caller never passes FirstMultiXactId, but the changed comments and the assertion were confusing, so I felt it's best to just leave it alone

- bunch of comment changes & other cosmetic changes

- I modified TrimMultiXact() to initialize the page corresponding to 'nextMulti', because if you just swapped the binary to the new one, and nextMulti was at a page boundary, it would not be initialized yet.

If we want to backpatch this, and I think we need to because this fixes real bugs, we need to think through all the upgrade scenarios. I made the above-mentioned changes to TrimMultiXact(), but it doesn't fix all the problems.

What happens if you replay the WAL generated with old binary, without this patch, with new binary? It's not good:

LOG: database system was not properly shut down; automatic recovery in progress
LOG: redo starts at 0/01766A68
FATAL: could not access status of transaction 2048
DETAIL: Could not read from file "pg_multixact/offsets/0000" at offset 8192: read too few bytes.
CONTEXT: WAL redo at 0/05561030 for MultiXact/CREATE_ID: 2047 offset 4093 nmembers 2: 2830 (keysh) 2831 (keysh)
LOG: startup process (PID 3130184) exited with exit code 1

This is because the WAL, created with old version, contains records like this:

lsn: 0/05561030, prev 0/05561008, desc: CREATE_ID 2047 offset 4093 nmembers 2: 2830 (keysh) 2831 (keysh)
lsn: 0/055611A8, prev 0/05561180, desc: ZERO_OFF_PAGE 1
lsn: 0/055611D0, prev 0/055611A8, desc: CREATE_ID 2048 offset 4095 nmembers 2: 2831 (keysh) 2832 (keysh)

When replaying that with the new version, replay of the CREATE_ID 2047 record tries to set the next multixid's offset, but the page hasn't been initialized yet. With the new version, the ZERO_OFF_PAGE 1 record would appear before the CREATE_ID 2047 record, but we can't change the WAL that already exists.

- Heikki
<v11-0001-Avoid-multixact-edge-case-2-by-writing-the-next-.patch>

1
```
+	if (*offptr != offset)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*offptr == 0);
+		*offptr = offset;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
```

This is a more like a question. Since pre-write should always happen, in theory *offptr != offset should never be true, why do we still need to handle the case instead of just assert(false)?

2
```
+	next = multi + 1;
+	if (next < FirstMultiXactId)
+		next = FirstMultiXactId;
```

next < FirstMultiXactId will only be true when next wraps around to 0, maybe deserve one-line comment to explain that.

3
```
+	if (*next_offptr != offset + nmembers)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*next_offptr == 0);
+		*next_offptr = offset + nmembers;
+		MultiXactMemberCtl->shared->page_dirty[slotno] = true;
+	}
```

Should MultiXactMemberCtl be MultiXactOffsetCtl? As we are writing to the offset SLRU.

4
```
+# Another multixact test: loosing some multixact must not affect reading near
```

I guess “loosing” should be “losing”

5
```
+# multitransaction (to avaoid corener case 1).
```

Typo: avoid corner case

6
```
+# Verify thet recorded multi is readble, this call must not hang.
```

Typo: readable

7
```
+# One more multitransaction to effectivelt emit WAL record about next
```

Typo: effectivelt -> effectively

8
```
+ plan skip_all => 'Kill9 works unpredicatably on Windows’;
```

Typo: unpredicatably, no “a” after “c"

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#39

hlinnaka@iki.fi

about 2 months ago

In reply to: Álvaro Herrera (#37)

Re: IPC/MultixactCreation on the Standby server

On 26/11/2025 23:15, Álvaro Herrera wrote:

On 2025-11-26, Heikki Linnakangas wrote:

What happens if you replay the WAL generated with old binary, without
this patch, with new binary? It's not good:

Maybe this needs a new record identifier, separating old wal from that generated by the new code?

One downside of that is that the new WAL record type would be unreadable
by older versions. We recommend upgrading standbys before primary, but
it'd still be nicer if we could avoid that.

Maybe we can make RecordNewMultiXact() tolerate the missing page, in
this special case of replaying WAL and the multixid being at the page
boundary.

- Heikki

#40

hlinnaka@iki.fi

about 2 months ago

In reply to: Chao Li (#38)

Re: IPC/MultixactCreation on the Standby server

On 27/11/2025 05:39, Chao Li wrote:

<v11-0001-Avoid-multixact-edge-case-2-by-writing-the-next-.patch>
1
```
+	if (*offptr != offset)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*offptr == 0);
+		*offptr = offset;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
```
This is a more like a question. Since pre-write should always happen, in theory *offptr != offset should never be true, why do we still need to handle the case instead of just assert(false)?

No, *offptr == 0 can happen, if multiple backends are generating
multixids concurrently. It's possible that we get here before the
RecordNewMultiXact() call for the previous multixid.

Similarly, the *next_offptr != 0 case can happen if another backend
calls RecordNewMultiXact() concurrently for the next multixid.

2
```
+	next = multi + 1;
+	if (next < FirstMultiXactId)
+		next = FirstMultiXactId;
```
next < FirstMultiXactId will only be true when next wraps around to 0, maybe deserve one-line comment to explain that.

This is a common pattern used in many places in the file.

3
```
+	if (*next_offptr != offset + nmembers)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*next_offptr == 0);
+		*next_offptr = offset + nmembers;
+		MultiXactMemberCtl->shared->page_dirty[slotno] = true;
+	}
```

Should MultiXactMemberCtl be MultiXactOffsetCtl? As we are writing to the offset SLRU.

Good catch! Yes, it should be MultiXactOffsetCtl.

- Heikki

#41

x4mmm@yandex-team.ru

about 2 months ago

In reply to: Heikki Linnakangas (#36)

Re: IPC/MultixactCreation on the Standby server

On 27 Nov 2025, at 01:59, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

This is because the WAL, created with old version, contains records like this:

lsn: 0/05561030, prev 0/05561008, desc: CREATE_ID 2047 offset 4093 nmembers 2: 2830 (keysh) 2831 (keysh)
lsn: 0/055611A8, prev 0/05561180, desc: ZERO_OFF_PAGE 1
lsn: 0/055611D0, prev 0/055611A8, desc: CREATE_ID 2048 offset 4095 nmembers 2: 2831 (keysh) 2832 (keysh)

That's an interesting case. I don't see how SLRU interface can be used to test if SLRU page is initialized correctly. We need a version of SimpleLruReadPage() that can avoid failure if page does not exist yet, and this must not be more expensive than current SimpleLruReadPage(). Alternatively we need new XLOG_MULTIXACT_CREATE_ID_2 in back branches.

BTW, my concern about MultiXactState->nextMXact was wrong, now I see it. I was almost certain that something is wrong and works by accident during summer, but now everything looks 100% correct...

Also, when working on v10 I've asked LLM for help with commit message, and it hallucinated Álvaro's e-mail <alvherre@postgresql.org> IDK, maybe it's real, but it was not used in this thread.

- I reverted the changes to ExtendMultiXactOffset(), so that it deals with wraparound and FirstMultiXactId the same way as before. The caller never passes FirstMultiXactId, but the changed comments and the assertion were confusing, so I felt it's best to just leave it alone

Maybe move a decision (to extend or not to extend) out of this function? Now its purpose is "MaybeExtendMultiXactOffset". And there's just one caller.

Thanks for picking this up! (And reading other thread about multixacts more attentively I see I misinformed you that this bug is connected to that bug...sorry! I'll review that one too!)

Best regards, Andrey Borodin.

#42

hlinnaka@iki.fi

about 2 months ago

In reply to: Andrey Borodin (#41)

1 attachment(s)

Re: IPC/MultixactCreation on the Standby server

On 27/11/2025 20:25, Andrey Borodin wrote:

On 27 Nov 2025, at 01:59, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

This is because the WAL, created with old version, contains records like this:

lsn: 0/05561030, prev 0/05561008, desc: CREATE_ID 2047 offset 4093 nmembers 2: 2830 (keysh) 2831 (keysh)
lsn: 0/055611A8, prev 0/05561180, desc: ZERO_OFF_PAGE 1
lsn: 0/055611D0, prev 0/055611A8, desc: CREATE_ID 2048 offset 4095 nmembers 2: 2831 (keysh) 2832 (keysh)

That's an interesting case. I don't see how SLRU interface can be
used to test if SLRU page is initialized correctly. We need a
version of SimpleLruReadPage() that can avoid failure if page does
not exist yet, and this must not be more expensive than current
SimpleLruReadPage(). Alternatively we need new
XLOG_MULTIXACT_CREATE_ID_2 in back branches.

There is SimpleLruDoesPhysicalPageExist() to check if a page has been
initialized. There's also the 'latest_page_number' field which tracks
what is the latest page that has been initialized.

Here's a new version that uses 'latest_page_number' to detect if we're
in this situation (= replaying WAL generated with older minor version).

(I still haven't looked at the tests)

BTW, my concern about MultiXactState->nextMXact was wrong, now I see
it. I was almost certain that something is wrong and works by
accident during summer, but now everything looks 100% correct...

Ok, thanks for double-checking.

Also, when working on v10 I've asked LLM for help with commit
message, and it hallucinated Álvaro's e-mail
<alvherre@postgresql.org> IDK, maybe it's real, but it was not used
in this thread.

Heh, ok, fixed that. You also came up with a last name for Dmitry that I
haven't seen that mentioned anywhere in this thread, either.

- I reverted the changes to ExtendMultiXactOffset(), so that it
deals with wraparound and FirstMultiXactId the same way as before.
The caller never passes FirstMultiXactId, but the changed comments
and the assertion were confusing, so I felt it's best to just
leave it alone

Maybe move a decision (to extend or not to extend) out of this
function? Now its purpose is "MaybeExtendMultiXactOffset". And
there's just one caller.

Perhaps. Let's leave that for a separate non-backported patch though.

- Heikki

Attachments:

v12-0001-Set-next-multixid-s-offset-when-creating-a-new-m.patchtext/x-patch; charset=UTF-8; name=v12-0001-Set-next-multixid-s-offset-when-creating-a-new-m.patchDownload

From 53817d9f158c11c91b661f786eb16c3a78f2cb70 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Fri, 28 Nov 2025 18:17:31 +0200
Subject: [PATCH v12 1/1] Set next multixid's offset when creating a new
 multixid
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

With this commit, the next multixid's offset will always be set on the
offsets page, by the time that a backend might try to read it, so we
no longer need the waiting mechanism with the condition variable. In
other words, this eliminates "corner case 2" mentioned in the
comments.

The waiting mechanism was broken in a few scenarios:

- When nextMulti was advanced without WAL-logging the next
  multixid. For example, if a later multixid was already assigned and
  WAL-logged before the previous one was WAL-logged, and then the
  server crashed. In that case the next offset would never be set in
  the offsets SLRU, and a query trying to read it would get stuck
  waiting for it. Same thing coudl happen if pg_resetwal was used to
  forcibly advance nextMulti.

- In hot standby mode, a deadlock could happen, where one backend waits
  for the next multixid assignment record, but WAL replay is not advancing
  because of a recovery conflict with the backend

We still need to be able to read WAL that was generated before this
fix. For that, the backpatched version of this commit includes a hack
to initialize the next offsets page when replaying
XLOG_MULTIXACT_CREATE_ID for the last multixid on a page.

The new TAP test reproduces IPC/MultixactCreation hangs and verifies
that previously recorded multis stay readable across crash recovery.

Author: Andrey Borodin <amborodin@acm.org>
Reviewed-by: Dmitry <dsy.075@yandex.ru>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Ivan Bykov <i.bykov@modernsys.ru>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
---
 src/backend/access/transam/multixact.c        | 200 ++++++++++++------
 src/test/modules/test_slru/t/001_multixact.pl | 141 ++++++------
 src/test/modules/test_slru/test_multixact.c   |   1 -
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  40 ++++
 src/test/recovery/t/017_shm.pl                |  38 +---
 5 files changed, 250 insertions(+), 170 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 9d5f130af7e..4212a0366fb 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -79,7 +79,6 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
-#include "storage/condition_variable.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -271,12 +270,6 @@ typedef struct MultiXactStateData
 	/* support for members anti-wraparound measures */
 	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
 
-	/*
-	 * This is used to sleep until a multixact offset is written when we want
-	 * to create the next one.
-	 */
-	ConditionVariable nextoff_cv;
-
 	/*
 	 * Per-backend data starts here.  We have two arrays stored in the area
 	 * immediately following the MultiXactStateData struct. Each is indexed by
@@ -381,6 +374,9 @@ static MemoryContext MXactContext = NULL;
 #define debug_elog6(a,b,c,d,e,f)
 #endif
 
+/* hack to deal with WAL generated with older minor versions */
+static int64 pre_initialized_offsets_page = -1;
+
 /* internal MultiXactId management */
 static void MultiXactIdSetOldestVisible(void);
 static void RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
@@ -912,13 +908,54 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	int			entryno;
 	int			slotno;
 	MultiXactOffset *offptr;
-	int			i;
+	MultiXactId next;
+	int64		next_pageno;
+	int			next_entryno;
+	MultiXactOffset *next_offptr;
 	LWLock	   *lock;
 	LWLock	   *prevlock = NULL;
 
+	/* position of this multixid in the offsets SLRU area  */
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
+	/* position of the next multixid */
+	next = multi + 1;
+	if (next < FirstMultiXactId)
+		next = FirstMultiXactId;
+	next_pageno = MultiXactIdToOffsetPage(next);
+	next_entryno = MultiXactIdToOffsetEntry(next);
+
+	/*
+	 * Older minor versions didn't set the next multixid's offset in this
+	 * function, and therefore didn't initialize the next page until the next
+	 * multixid was assigned.  If we're replaying WAL that was generated by
+	 * such a version, the next page might not be initialized yet.  Initialize
+	 * it now.
+	 */
+	if (InRecovery &&
+		next_pageno != pageno &&
+		pg_atomic_read_u64(&MultiXactOffsetCtl->shared->latest_page_number) == pageno)
+	{
+		elog(DEBUG1, "next offsets page is not initialized, initializing it now");
+		SimpleLruZeroAndWritePage(MultiXactOffsetCtl, next_pageno);
+
+		/*
+		 * Remember that we initialized the page, so that we don't zero it
+		 * again at the XLOG_MULTIXACT_ZERO_OFF_PAGE record.
+		 */
+		pre_initialized_offsets_page = next_pageno;
+	}
+
+	/*
+	 * Set the starting offset of this multixid's members.
+	 *
+	 * In the common case, it was already be set by the previous
+	 * RecordNewMultiXact call, as this was the next multixid of the previous
+	 * multixid.  But if multiple backends are generating multixids
+	 * concurrently, we might race ahead and get called before the previous
+	 * multixid.
+	 */
 	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
@@ -933,22 +970,50 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
 
-	*offptr = offset;
+	if (*offptr != offset)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*offptr == 0);
+		*offptr = offset;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
 
-	MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	/*
+	 * Set the next multixid's offset to the end of this multixid's members.
+	 */
+	if (next_pageno == pageno)
+	{
+		next_offptr = offptr + 1;
+	}
+	else
+	{
+		/* must be the first entry on the page */
+		Assert(next_entryno == 0 || next == FirstMultiXactId);
+
+		/* Swap the lock for a lock on the next page */
+		LWLockRelease(lock);
+		lock = SimpleLruGetBankLock(MultiXactOffsetCtl, next_pageno);
+		LWLockAcquire(lock, LW_EXCLUSIVE);
+
+		slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next);
+		next_offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
+		next_offptr += next_entryno;
+	}
+
+	if (*next_offptr != offset + nmembers)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*next_offptr == 0);
+		*next_offptr = offset + nmembers;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
 
 	/* Release MultiXactOffset SLRU lock. */
 	LWLockRelease(lock);
 
-	/*
-	 * If anybody was waiting to know the offset of this multixact ID we just
-	 * wrote, they can read it now, so wake them up.
-	 */
-	ConditionVariableBroadcast(&MultiXactState->nextoff_cv);
-
 	prev_pageno = -1;
 
-	for (i = 0; i < nmembers; i++, offset++)
+	for (int i = 0; i < nmembers; i++, offset++)
 	{
 		TransactionId *memberptr;
 		uint32	   *flagsptr;
@@ -1138,8 +1203,11 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 			result = FirstMultiXactId;
 	}
 
-	/* Make sure there is room for the MXID in the file.  */
-	ExtendMultiXactOffset(result);
+	/*
+	 * Make sure there is room for the next MXID in the file.  Assigning this
+	 * MXID sets the next MXID's offset already.
+	 */
+	ExtendMultiXactOffset(result + 1);
 
 	/*
 	 * Reserve the members space, similarly to above.  Also, be careful not to
@@ -1304,7 +1372,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	MultiXactOffset nextOffset;
 	MultiXactMember *ptr;
 	LWLock	   *lock;
-	bool		slept = false;
 
 	debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
 
@@ -1381,23 +1448,14 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * one's.  However, there are some corner cases to worry about:
 	 *
 	 * 1. This multixact may be the latest one created, in which case there is
-	 * no next one to look at.  In this case the nextOffset value we just
-	 * saved is the correct endpoint.
+	 * no next one to look at.  The next multixact's offset should be set
+	 * already, as we set it in RecordNewMultiXact(), but we used to not do
+	 * that in older minor versions.  To cope with that case, if this
+	 * multixact is the latest one created, use the nextOffset value we read
+	 * above as the endpoint.
 	 *
-	 * 2. The next multixact may still be in process of being filled in: that
-	 * is, another process may have done GetNewMultiXactId but not yet written
-	 * the offset entry for that ID.  In that scenario, it is guaranteed that
-	 * the offset entry for that multixact exists (because GetNewMultiXactId
-	 * won't release MultiXactGenLock until it does) but contains zero
-	 * (because we are careful to pre-zero offset pages). Because
-	 * GetNewMultiXactId will never return zero as the starting offset for a
-	 * multixact, when we read zero as the next multixact's offset, we know we
-	 * have this case.  We handle this by sleeping on the condition variable
-	 * we have just for this; the process in charge will signal the CV as soon
-	 * as it has finished writing the multixact offset.
-	 *
-	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
-	 * handle case #2, there is an ambiguity near the point of offset
+	 * 2. Because GetNewMultiXactId skips over offset zero, to reserve zero
+	 * for to mean "unset", there is an ambiguity near the point of offset
 	 * wraparound.  If we see next multixact's offset is one, is that our
 	 * multixact's actual endpoint, or did it end at zero with a subsequent
 	 * increment?  We handle this using the knowledge that if the zero'th
@@ -1409,7 +1467,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
 	 */
-retry:
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
@@ -1473,16 +1530,10 @@ retry:
 
 		if (nextMXOffset == 0)
 		{
-			/* Corner case 2: next multixact is still being filled in */
-			LWLockRelease(lock);
-			CHECK_FOR_INTERRUPTS();
-
-			INJECTION_POINT("multixact-get-members-cv-sleep", NULL);
-
-			ConditionVariableSleep(&MultiXactState->nextoff_cv,
-								   WAIT_EVENT_MULTIXACT_CREATION);
-			slept = true;
-			goto retry;
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("MultiXact %d has invalid next offset",
+							multi)));
 		}
 
 		length = nextMXOffset - offset;
@@ -1491,12 +1542,6 @@ retry:
 	LWLockRelease(lock);
 	lock = NULL;
 
-	/*
-	 * If we slept above, clean up state; it's no longer needed.
-	 */
-	if (slept)
-		ConditionVariableCancelSleep();
-
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
 
 	truelength = 0;
@@ -1539,7 +1584,7 @@ retry:
 
 		if (!TransactionIdIsValid(*xactptr))
 		{
-			/* Corner case 3: we must be looking at unused slot zero */
+			/* Corner case 2: we must be looking at unused slot zero */
 			Assert(offset == 0);
 			continue;
 		}
@@ -1986,7 +2031,6 @@ MultiXactShmemInit(void)
 
 		/* Make sure we zero out the per-backend state */
 		MemSet(MultiXactState, 0, SHARED_MULTIXACT_STATE_SIZE);
-		ConditionVariableInit(&MultiXactState->nextoff_cv);
 	}
 	else
 		Assert(found);
@@ -2132,26 +2176,36 @@ TrimMultiXact(void)
 						pageno);
 
 	/*
+	 * Set the offset of the last multixact on the offsets page.
+	 *
+	 * This is normally done in RecordNewMultiXact() of the previous
+	 * multixact, but we used to not do that in older minor versions.  To
+	 * ensure that the next offset is set if the binary was just upgraded from
+	 * an older minor version, do it now.
+	 *
 	 * Zero out the remainder of the current offsets page.  See notes in
 	 * TrimCLOG() for background.  Unlike CLOG, some WAL record covers every
 	 * pg_multixact SLRU mutation.  Since, also unlike CLOG, we ignore the WAL
 	 * rule "write xlog before data," nextMXact successors may carry obsolete,
-	 * nonzero offset values.  Zero those so case 2 of GetMultiXactIdMembers()
-	 * operates normally.
+	 * nonzero offset values.
 	 */
 	entryno = MultiXactIdToOffsetEntry(nextMXact);
-	if (entryno != 0)
 	{
 		int			slotno;
 		MultiXactOffset *offptr;
 		LWLock	   *lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
 
 		LWLockAcquire(lock, LW_EXCLUSIVE);
-		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
+		if (entryno == 0)
+			slotno = SimpleLruZeroPage(MultiXactOffsetCtl, pageno);
+		else
+			slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
 
-		MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
+		*offptr = offset;
+		if (entryno != 0 && (entryno + 1) * sizeof(MultiXactOffset) != BLCKSZ)
+			MemSet(offptr + 1, 0, BLCKSZ - (entryno + 1) * sizeof(MultiXactOffset));
 
 		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 		LWLockRelease(lock);
@@ -3339,7 +3393,16 @@ multixact_redo(XLogReaderState *record)
 		int64		pageno;
 
 		memcpy(&pageno, XLogRecGetData(record), sizeof(pageno));
-		SimpleLruZeroAndWritePage(MultiXactOffsetCtl, pageno);
+
+		/*
+		 * Skip the record if we already initialized the page at the previous
+		 * XLOG_MULTIXACT_CREATE_ID record. See RecordNewMultiXact().
+		 */
+		if (pre_initialized_offsets_page != pageno)
+			SimpleLruZeroAndWritePage(MultiXactOffsetCtl, pageno);
+		else
+			elog(DEBUG1, "skipping initialization of page %ld", pageno);
+		pre_initialized_offsets_page = -1;
 	}
 	else if (info == XLOG_MULTIXACT_ZERO_MEM_PAGE)
 	{
@@ -3355,6 +3418,21 @@ multixact_redo(XLogReaderState *record)
 		TransactionId max_xid;
 		int			i;
 
+		if (pre_initialized_offsets_page != -1)
+		{
+			/*
+			 * If we implicitly initialized the next offsets page while
+			 * replaying a XLOG_MULTIXACT_CREATE_ID record that was generated
+			 * with an older minor version, we still expect to see a
+			 * XLOG_MULTIXACT_ZERO_OFF_PAGE record for it before any other
+			 * XLOG_MULTIXACT_CREATE_ID records.  Therefore this case should
+			 * not happen.  If it does, we'll continue with the replay, but
+			 * log a message to note that something's funny.
+			 */
+			elog(LOG, "expected to see a XLOG_MULTIXACT_ZERO_OFF_PAGE record for page that was implicitly initialized earlier");
+			pre_initialized_offsets_page = -1;
+		}
+
 		/* Store the data back into the SLRU files */
 		RecordNewMultiXact(xlrec->mid, xlrec->moff, xlrec->nmembers,
 						   xlrec->members);
diff --git a/src/test/modules/test_slru/t/001_multixact.pl b/src/test/modules/test_slru/t/001_multixact.pl
index e2b567a603d..2f85802f920 100644
--- a/src/test/modules/test_slru/t/001_multixact.pl
+++ b/src/test/modules/test_slru/t/001_multixact.pl
@@ -18,6 +18,10 @@ if ($ENV{enable_injection_points} ne 'yes')
 {
 	plan skip_all => 'Injection points not supported by this build';
 }
+if ($windows_os)
+{
+	plan skip_all => 'Kill9 works unpredicatably on Windows';
+}
 
 my ($node, $result);
 
@@ -25,96 +29,87 @@ $node = PostgreSQL::Test::Cluster->new('mike');
 $node->init;
 $node->append_conf('postgresql.conf',
 	"shared_preload_libraries = 'test_slru,injection_points'");
+# Set the cluster's next multitransaction to 0xFFFFFFF0.
+my $node_pgdata = $node->data_dir;
+command_ok(
+	[
+		'pg_resetwal',
+		'--multixact-ids' => '0xFFFFFFF0,0xFFFFFFF0',
+		$node_pgdata
+	],
+	"set the cluster's next multitransaction to 0xFFFFFFF0");
+command_ok(
+	[
+		'dd',
+		'if=/dev/zero',
+		"of=$node_pgdata/pg_multixact/offsets/FFFF",
+		'bs=4',
+		'count=65536'
+	],
+	"init SLRU file");
+
+command_ok(
+	[
+		'rm',
+		"$node_pgdata/pg_multixact/offsets/0000",
+	],
+	"drop old SLRU file");
+
 $node->start;
 $node->safe_psql('postgres', q(CREATE EXTENSION injection_points));
 $node->safe_psql('postgres', q(CREATE EXTENSION test_slru));
 
-# Test for Multixact generation edge case
-$node->safe_psql('postgres',
-	q{select injection_points_attach('test-multixact-read','wait')});
-$node->safe_psql('postgres',
-	q{select injection_points_attach('multixact-get-members-cv-sleep','wait')}
-);
+# Another multixact test: loosing some multixact must not affect reading near
+# multixacts, even after a crash.
+my $bg_psql = $node->background_psql('postgres');
 
-# This session must observe sleep on the condition variable while generating a
-# multixact.  To achieve this it first will create a multixact, then pause
-# before reading it.
-my $observer = $node->background_psql('postgres');
-
-# This query will create a multixact, and hang just before reading it.
-$observer->query_until(
-	qr/start/,
-	q{
-	\echo start
-	SELECT test_read_multixact(test_create_multixact());
-});
-$node->wait_for_event('client backend', 'test-multixact-read');
-
-# This session will create the next Multixact. This is necessary to avoid
-# multixact.c's non-sleeping edge case 1.
-my $creator = $node->background_psql('postgres');
+my $multi = $bg_psql->query_safe(
+	q(SELECT test_create_multixact();));
+
+# The space for next multi will be allocated, but it will never be actually
+# recorded.
 $node->safe_psql('postgres',
 	q{SELECT injection_points_attach('multixact-create-from-members','wait');}
 );
 
-# We expect this query to hang in the critical section after generating new
-# multixact, but before filling its offset into SLRU.
-# Running an injection point inside a critical section requires it to be
-# loaded beforehand.
-$creator->query_until(
-	qr/start/, q{
-	\echo start
+$bg_psql->query_until(
+	qr/deploying lost multi/, q(
+\echo deploying lost multi
 	SELECT test_create_multixact();
-});
+));
 
 $node->wait_for_event('client backend', 'multixact-create-from-members');
-
-# Ensure we have the backends waiting that we expect
-is( $node->safe_psql(
-		'postgres',
-		q{SELECT string_agg(wait_event, ', ' ORDER BY wait_event)
-		FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'}
-	),
-	'multixact-create-from-members, test-multixact-read',
-	"matching injection point waits");
-
-# Now wake observer to get it to read the initial multixact.  A subsequent
-# multixact already exists, but that one doesn't have an offset assigned, so
-# this will hit multixact.c's edge case 2.
-$node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('test-multixact-read')});
-$node->wait_for_event('client backend', 'multixact-get-members-cv-sleep');
-
-# Ensure we have the backends waiting that we expect
-is( $node->safe_psql(
-		'postgres',
-		q{SELECT string_agg(wait_event, ', ' ORDER BY wait_event)
-		FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'}
-	),
-	'multixact-create-from-members, multixact-get-members-cv-sleep',
-	"matching injection point waits");
-
-# Now we have two backends waiting in multixact-create-from-members and
-# multixact-get-members-cv-sleep.  Also we have 3 injections points set to wait.
-# If we wakeup multixact-get-members-cv-sleep it will happen again, so we must
-# detach it first. So let's detach all injection points, then wake up all
-# backends.
-
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('test-multixact-read')});
 $node->safe_psql('postgres',
 	q{SELECT injection_points_detach('multixact-create-from-members')});
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('multixact-get-members-cv-sleep')});
 
 $node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('multixact-create-from-members')});
-$node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('multixact-get-members-cv-sleep')});
+	q{checkpoint;});
 
-# Background psql will now be able to read the result and disconnect.
-$observer->quit;
-$creator->quit;
+# One more multitransaction to effectivelt emit WAL record about next
+# multitransaction (to avaoid corener case 1).
+$node->safe_psql('postgres',
+	q{SELECT test_create_multixact();});
+
+# All set and done, it's time for hard restart
+$node->kill9;
+$node->stop('immediate', fail_ok => 1);
+$node->poll_start;
+$bg_psql->{run}->finish;
+
+# Verify thet recorded multi is readble, this call must not hang.
+# Also note that all injection points disappeared after server restart.
+my $timed_out = 0;
+$node->safe_psql(
+	'postgres',
+	qq{SELECT test_read_multixact('$multi'::xid);},
+	timeout => $PostgreSQL::Test::Utils::timeout_default,
+	timed_out => \$timed_out);
+ok($timed_out == 0, 'recorded multi is readble');
+
+# Test mxidwraparound
+foreach my $i (1 .. 32) {
+$node->safe_psql('postgres',q{SELECT test_create_multixact();});
+}
 
 $node->stop;
 
diff --git a/src/test/modules/test_slru/test_multixact.c b/src/test/modules/test_slru/test_multixact.c
index 6c9b0420717..2fd67273ee5 100644
--- a/src/test/modules/test_slru/test_multixact.c
+++ b/src/test/modules/test_slru/test_multixact.c
@@ -46,7 +46,6 @@ test_read_multixact(PG_FUNCTION_ARGS)
 	MultiXactId id = PG_GETARG_TRANSACTIONID(0);
 	MultiXactMember *members;
 
-	INJECTION_POINT("test-multixact-read", NULL);
 	/* discard caches */
 	AtEOXact_MultiXact();
 
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 35413f14019..e810f123f93 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -1191,6 +1191,46 @@ sub start
 
 =pod
 
+=item $node->poll_start() => success_or_failure
+
+Polls start after kill9
+
+We may need retries to start a new postmaster.  Causes:
+ - kernel is slow to deliver SIGKILL
+ - postmaster parent is slow to waitpid()
+ - postmaster child is slow to exit in response to SIGQUIT
+ - postmaster child is slow to exit after postmaster death
+
+=cut
+
+sub poll_start
+{
+	my ($self) = @_;
+
+	my $max_attempts = 10 * $PostgreSQL::Test::Utils::timeout_default;
+	my $attempts = 0;
+
+	while ($attempts < $max_attempts)
+	{
+		$self->start(fail_ok => 1) && return 1;
+
+		# Wait 0.1 second before retrying.
+		usleep(100_000);
+
+		# Clean up in case the start attempt just timed out or some such.
+		$self->stop('fast', fail_ok => 1);
+
+		$attempts++;
+	}
+
+	# Try one last time without fail_ok, which will BAIL_OUT unless it
+	# succeeds.
+	$self->start && return 1;
+	return 0;
+}
+
+=pod
+
 =item $node->kill9()
 
 Send SIGKILL (signal 9) to the postmaster.
diff --git a/src/test/recovery/t/017_shm.pl b/src/test/recovery/t/017_shm.pl
index c73aa3f0c2c..ac238544217 100644
--- a/src/test/recovery/t/017_shm.pl
+++ b/src/test/recovery/t/017_shm.pl
@@ -67,7 +67,7 @@ log_ipcs();
 # Upon postmaster death, postmaster children exit automatically.
 $gnat->kill9;
 log_ipcs();
-poll_start($gnat);    # gnat recycles its former shm key.
+$gnat->poll_start;    # gnat recycles its former shm key.
 log_ipcs();
 
 note "removing the conflicting shmem ...";
@@ -82,7 +82,7 @@ log_ipcs();
 # the higher-keyed segment that the previous postmaster was using.
 # That's not great, but key collisions should be rare enough to not
 # make this a big problem.
-poll_start($gnat);
+$gnat->poll_start;
 log_ipcs();
 $gnat->stop;
 log_ipcs();
@@ -174,43 +174,11 @@ $slow_client->finish;    # client has detected backend termination
 log_ipcs();
 
 # now startup should work
-poll_start($gnat);
+$gnat->poll_start;
 log_ipcs();
 
 # finish testing
 $gnat->stop;
 log_ipcs();
 
-
-# We may need retries to start a new postmaster.  Causes:
-# - kernel is slow to deliver SIGKILL
-# - postmaster parent is slow to waitpid()
-# - postmaster child is slow to exit in response to SIGQUIT
-# - postmaster child is slow to exit after postmaster death
-sub poll_start
-{
-	my ($node) = @_;
-
-	my $max_attempts = 10 * $PostgreSQL::Test::Utils::timeout_default;
-	my $attempts = 0;
-
-	while ($attempts < $max_attempts)
-	{
-		$node->start(fail_ok => 1) && return 1;
-
-		# Wait 0.1 second before retrying.
-		usleep(100_000);
-
-		# Clean up in case the start attempt just timed out or some such.
-		$node->stop('fast', fail_ok => 1);
-
-		$attempts++;
-	}
-
-	# Try one last time without fail_ok, which will BAIL_OUT unless it
-	# succeeds.
-	$node->start && return 1;
-	return 0;
-}
-
 done_testing();
-- 
2.47.3

#43

/messages/by-id/29bf3d71-bf78-4c74-8525-13d48e0142ec@yandex.ru

alvherre@kurilemu.de

about 2 months ago

In reply to: Heikki Linnakangas (#42)

Re: IPC/MultixactCreation on the Standby server

On 2025-Nov-28, Heikki Linnakangas wrote:

Heh, ok, fixed that. You also came up with a last name for Dmitry that I
haven't seen that mentioned anywhere in this thread, either.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/

#44

x4mmm@yandex-team.ru

about 1 month ago

In reply to: Heikki Linnakangas (#42)

Re: IPC/MultixactCreation on the Standby server

On 28 Nov 2025, at 21:21, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 27/11/2025 20:25, Andrey Borodin wrote:

On 27 Nov 2025, at 01:59, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

This is because the WAL, created with old version, contains records like this:

lsn: 0/05561030, prev 0/05561008, desc: CREATE_ID 2047 offset 4093 nmembers 2: 2830 (keysh) 2831 (keysh)
lsn: 0/055611A8, prev 0/05561180, desc: ZERO_OFF_PAGE 1
lsn: 0/055611D0, prev 0/055611A8, desc: CREATE_ID 2048 offset 4095 nmembers 2: 2831 (keysh) 2832 (keysh)

That's an interesting case. I don't see how SLRU interface can be
used to test if SLRU page is initialized correctly. We need a
version of SimpleLruReadPage() that can avoid failure if page does
not exist yet, and this must not be more expensive than current
SimpleLruReadPage(). Alternatively we need new
XLOG_MULTIXACT_CREATE_ID_2 in back branches.

There is SimpleLruDoesPhysicalPageExist() to check if a page has been initialized. There's also the 'latest_page_number' field which tracks what is the latest page that has been initialized.

During working on v8 of this patch I observed races between SimpleLruDoesPhysicalPageExist() and ExtendMultiXactOffset(). Because ExtendMultiXactOffset() may initialize page only in a buffer, and file system call will not observe it.

latest_page_number looks great! though I'm slightly worried by this commend:
/*
* During XLOG replay, latest_page_number isn't necessarily set up
* yet; insert a suitable value to bypass the sanity test in
* SimpleLruTruncate.
*/

and by

/*
* latest_page_number is the page number of the current end of the log;
* this is __not critical data__, since we use it only to avoid swapping out
* the latest page.
*/

If we use it in startup process to prevent failure, now it is critical data.

But I think the comments are off, latest_page_number is updated precisely.

The page initialization dance is only needed in back branches. And we inevitable will have conflicts with SLRU refactoring in 18 and banking in 17. Conceptually v12 looks good to me, I can prepare backport versions.

You definitely beat LLM in commit message clarity. Though, s/coudl/could/g.

On 28 Nov 2025, at 21:21, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

- I reverted the changes to ExtendMultiXactOffset(), so that it
deals with wraparound and FirstMultiXactId the same way as before.
The caller never passes FirstMultiXactId, but the changed comments
and the assertion were confusing, so I felt it's best to just
leave it alone

Maybe move a decision (to extend or not to extend) out of this
function? Now its purpose is "MaybeExtendMultiXactOffset". And
there's just one caller.

Perhaps. Let's leave that for a separate non-backported patch though.

OK, we just have to be sure that we do not emit 2 ZERO_OFF_PAGE records during wraparound.

The problem seems to be introduced by 1986ca5 as a fix for a problem that might be there even longer. Perhaps, since multitransactions existed. Interestingly the only know to me report was by Thom Brown in 2024 (besides this thread).
/messages/by-id/CAA-aLv7AcTh+O9rnoPVbck=Fw8atMOYrUttk5G9ctFhXOsqQbw@mail.gmail.com

Thanks!

Best regards, Andrey Borodin.

#45

hlinnaka@iki.fi

about 1 month ago

In reply to: Andrey Borodin (#44)

2 attachment(s)

Re: IPC/MultixactCreation on the Standby server

On 28/11/2025 20:17, Andrey Borodin wrote:

latest_page_number looks great! though I'm slightly worried by this commend:
/*
* During XLOG replay, latest_page_number isn't necessarily set up
* yet; insert a suitable value to bypass the sanity test in
* SimpleLruTruncate.
*/

and by

/*
* latest_page_number is the page number of the current end of the log;
* this is __not critical data__, since we use it only to avoid swapping out
* the latest page.
*/

If we use it in startup process to prevent failure, now it is critical data.

But I think the comments are off, latest_page_number is updated precisely.

Agreed. I looked at the commit history of how the first comment got
introduced, to see if it used to be true at some point, but AFAICS we've
always kept latest_page_number up-to-date. It is very clearly
initialized in StartupMultiXact(), before WAL replay. Furthermore, we
only set the "suitable value to bypass the sanity test" for the offsets
SLRU, but everywhere else where we truncate, extend or set
latest_page_number, we treat the members SLRU. So I don't see how the
offsets and members SLRUs could be different in that regard.

I think the second comment became outdated in commit bc7d37a525c0, which
introduced the safety check in (what became later) SimpleLruTruncate().
After that, it's been important that latest_page_number is set
correctly, although for the sanity check I guess you could be a little
sloppy with it.

The page initialization dance is only needed in back branches. And
we inevitable will have conflicts with SLRU refactoring in 18 and
banking in 17. Conceptually v12 looks good to me, I can prepare
backport versions.

Thanks!

Here's a new patch version. I went through the test changes now:

I didn't understand why the 'kill9' and 'poll_start' stuff is needed. We
have plenty of tests that kill the server with regular
"$node->stop('immediate')", and restart the server normally. The
checkpoint in the middle of the tests seems unnecessary too. I removed
all that, and the test still seems to work. Was there a particular
reason for them?

I moved the wraparound test to a separate test file and commit. More
test coverage is good, but it's quite separate from the bugfix and the
wraparound related test shares very little with the other test. The
wraparound test needs a little more cleanup: use plain perl instead of
'dd' and 'rm' for the file operations, for example. (I did that with the
tests in the 64-bit mxoff patches, so we could copy from there.)

- Heikki

Attachments:

v13-0001-Set-next-multixid-s-offset-when-creating-a-new-m.patchtext/x-patch; charset=UTF-8; name=v13-0001-Set-next-multixid-s-offset-when-creating-a-new-m.patchDownload

From 769612923b5e2314da4e1422731a4d8413fa3265 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Fri, 28 Nov 2025 18:37:21 +0200
Subject: [PATCH v13 1/2] Set next multixid's offset when creating a new
 multixid
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

With this commit, the next multixid's offset will always be set on the
offsets page, by the time that a backend might try to read it, so we
no longer need the waiting mechanism with the condition variable. In
other words, this eliminates "corner case 2" mentioned in the
comments.

The waiting mechanism was broken in a few scenarios:

- When nextMulti was advanced without WAL-logging the next
  multixid. For example, if a later multixid was already assigned and
  WAL-logged before the previous one was WAL-logged, and then the
  server crashed. In that case the next offset would never be set in
  the offsets SLRU, and a query trying to read it would get stuck
  waiting for it. Same thing could happen if pg_resetwal was used to
  forcibly advance nextMulti.

- In hot standby mode, a deadlock could happen, where one backend waits
  for the next multixid assignment record, but WAL replay is not advancing
  because of a recovery conflict with the backend

We still need to be able to read WAL that was generated before this
fix. For that, the backpatched version of this commit includes a hack
to initialize the next offsets page when replaying
XLOG_MULTIXACT_CREATE_ID for the last multixid on a page.

The new TAP test reproduces IPC/MultixactCreation hangs and verifies
that previously recorded multis stay readable across crash recovery.

Author: Andrey Borodin <amborodin@acm.org>
Reviewed-by: Dmitry Yurichev <dsy.075@yandex.ru>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Ivan Bykov <i.bykov@modernsys.ru>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
---
 src/backend/access/transam/multixact.c        | 200 ++++++++++++------
 src/test/modules/test_slru/t/001_multixact.pl | 120 ++++-------
 src/test/modules/test_slru/test_multixact.c   |   5 +-
 3 files changed, 179 insertions(+), 146 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 9d5f130af7e..4212a0366fb 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -79,7 +79,6 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
-#include "storage/condition_variable.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -271,12 +270,6 @@ typedef struct MultiXactStateData
 	/* support for members anti-wraparound measures */
 	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
 
-	/*
-	 * This is used to sleep until a multixact offset is written when we want
-	 * to create the next one.
-	 */
-	ConditionVariable nextoff_cv;
-
 	/*
 	 * Per-backend data starts here.  We have two arrays stored in the area
 	 * immediately following the MultiXactStateData struct. Each is indexed by
@@ -381,6 +374,9 @@ static MemoryContext MXactContext = NULL;
 #define debug_elog6(a,b,c,d,e,f)
 #endif
 
+/* hack to deal with WAL generated with older minor versions */
+static int64 pre_initialized_offsets_page = -1;
+
 /* internal MultiXactId management */
 static void MultiXactIdSetOldestVisible(void);
 static void RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
@@ -912,13 +908,54 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	int			entryno;
 	int			slotno;
 	MultiXactOffset *offptr;
-	int			i;
+	MultiXactId next;
+	int64		next_pageno;
+	int			next_entryno;
+	MultiXactOffset *next_offptr;
 	LWLock	   *lock;
 	LWLock	   *prevlock = NULL;
 
+	/* position of this multixid in the offsets SLRU area  */
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
+	/* position of the next multixid */
+	next = multi + 1;
+	if (next < FirstMultiXactId)
+		next = FirstMultiXactId;
+	next_pageno = MultiXactIdToOffsetPage(next);
+	next_entryno = MultiXactIdToOffsetEntry(next);
+
+	/*
+	 * Older minor versions didn't set the next multixid's offset in this
+	 * function, and therefore didn't initialize the next page until the next
+	 * multixid was assigned.  If we're replaying WAL that was generated by
+	 * such a version, the next page might not be initialized yet.  Initialize
+	 * it now.
+	 */
+	if (InRecovery &&
+		next_pageno != pageno &&
+		pg_atomic_read_u64(&MultiXactOffsetCtl->shared->latest_page_number) == pageno)
+	{
+		elog(DEBUG1, "next offsets page is not initialized, initializing it now");
+		SimpleLruZeroAndWritePage(MultiXactOffsetCtl, next_pageno);
+
+		/*
+		 * Remember that we initialized the page, so that we don't zero it
+		 * again at the XLOG_MULTIXACT_ZERO_OFF_PAGE record.
+		 */
+		pre_initialized_offsets_page = next_pageno;
+	}
+
+	/*
+	 * Set the starting offset of this multixid's members.
+	 *
+	 * In the common case, it was already be set by the previous
+	 * RecordNewMultiXact call, as this was the next multixid of the previous
+	 * multixid.  But if multiple backends are generating multixids
+	 * concurrently, we might race ahead and get called before the previous
+	 * multixid.
+	 */
 	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
@@ -933,22 +970,50 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
 
-	*offptr = offset;
+	if (*offptr != offset)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*offptr == 0);
+		*offptr = offset;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
 
-	MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	/*
+	 * Set the next multixid's offset to the end of this multixid's members.
+	 */
+	if (next_pageno == pageno)
+	{
+		next_offptr = offptr + 1;
+	}
+	else
+	{
+		/* must be the first entry on the page */
+		Assert(next_entryno == 0 || next == FirstMultiXactId);
+
+		/* Swap the lock for a lock on the next page */
+		LWLockRelease(lock);
+		lock = SimpleLruGetBankLock(MultiXactOffsetCtl, next_pageno);
+		LWLockAcquire(lock, LW_EXCLUSIVE);
+
+		slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next);
+		next_offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
+		next_offptr += next_entryno;
+	}
+
+	if (*next_offptr != offset + nmembers)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*next_offptr == 0);
+		*next_offptr = offset + nmembers;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
 
 	/* Release MultiXactOffset SLRU lock. */
 	LWLockRelease(lock);
 
-	/*
-	 * If anybody was waiting to know the offset of this multixact ID we just
-	 * wrote, they can read it now, so wake them up.
-	 */
-	ConditionVariableBroadcast(&MultiXactState->nextoff_cv);
-
 	prev_pageno = -1;
 
-	for (i = 0; i < nmembers; i++, offset++)
+	for (int i = 0; i < nmembers; i++, offset++)
 	{
 		TransactionId *memberptr;
 		uint32	   *flagsptr;
@@ -1138,8 +1203,11 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 			result = FirstMultiXactId;
 	}
 
-	/* Make sure there is room for the MXID in the file.  */
-	ExtendMultiXactOffset(result);
+	/*
+	 * Make sure there is room for the next MXID in the file.  Assigning this
+	 * MXID sets the next MXID's offset already.
+	 */
+	ExtendMultiXactOffset(result + 1);
 
 	/*
 	 * Reserve the members space, similarly to above.  Also, be careful not to
@@ -1304,7 +1372,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	MultiXactOffset nextOffset;
 	MultiXactMember *ptr;
 	LWLock	   *lock;
-	bool		slept = false;
 
 	debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
 
@@ -1381,23 +1448,14 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * one's.  However, there are some corner cases to worry about:
 	 *
 	 * 1. This multixact may be the latest one created, in which case there is
-	 * no next one to look at.  In this case the nextOffset value we just
-	 * saved is the correct endpoint.
+	 * no next one to look at.  The next multixact's offset should be set
+	 * already, as we set it in RecordNewMultiXact(), but we used to not do
+	 * that in older minor versions.  To cope with that case, if this
+	 * multixact is the latest one created, use the nextOffset value we read
+	 * above as the endpoint.
 	 *
-	 * 2. The next multixact may still be in process of being filled in: that
-	 * is, another process may have done GetNewMultiXactId but not yet written
-	 * the offset entry for that ID.  In that scenario, it is guaranteed that
-	 * the offset entry for that multixact exists (because GetNewMultiXactId
-	 * won't release MultiXactGenLock until it does) but contains zero
-	 * (because we are careful to pre-zero offset pages). Because
-	 * GetNewMultiXactId will never return zero as the starting offset for a
-	 * multixact, when we read zero as the next multixact's offset, we know we
-	 * have this case.  We handle this by sleeping on the condition variable
-	 * we have just for this; the process in charge will signal the CV as soon
-	 * as it has finished writing the multixact offset.
-	 *
-	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
-	 * handle case #2, there is an ambiguity near the point of offset
+	 * 2. Because GetNewMultiXactId skips over offset zero, to reserve zero
+	 * for to mean "unset", there is an ambiguity near the point of offset
 	 * wraparound.  If we see next multixact's offset is one, is that our
 	 * multixact's actual endpoint, or did it end at zero with a subsequent
 	 * increment?  We handle this using the knowledge that if the zero'th
@@ -1409,7 +1467,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
 	 */
-retry:
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
@@ -1473,16 +1530,10 @@ retry:
 
 		if (nextMXOffset == 0)
 		{
-			/* Corner case 2: next multixact is still being filled in */
-			LWLockRelease(lock);
-			CHECK_FOR_INTERRUPTS();
-
-			INJECTION_POINT("multixact-get-members-cv-sleep", NULL);
-
-			ConditionVariableSleep(&MultiXactState->nextoff_cv,
-								   WAIT_EVENT_MULTIXACT_CREATION);
-			slept = true;
-			goto retry;
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("MultiXact %d has invalid next offset",
+							multi)));
 		}
 
 		length = nextMXOffset - offset;
@@ -1491,12 +1542,6 @@ retry:
 	LWLockRelease(lock);
 	lock = NULL;
 
-	/*
-	 * If we slept above, clean up state; it's no longer needed.
-	 */
-	if (slept)
-		ConditionVariableCancelSleep();
-
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
 
 	truelength = 0;
@@ -1539,7 +1584,7 @@ retry:
 
 		if (!TransactionIdIsValid(*xactptr))
 		{
-			/* Corner case 3: we must be looking at unused slot zero */
+			/* Corner case 2: we must be looking at unused slot zero */
 			Assert(offset == 0);
 			continue;
 		}
@@ -1986,7 +2031,6 @@ MultiXactShmemInit(void)
 
 		/* Make sure we zero out the per-backend state */
 		MemSet(MultiXactState, 0, SHARED_MULTIXACT_STATE_SIZE);
-		ConditionVariableInit(&MultiXactState->nextoff_cv);
 	}
 	else
 		Assert(found);
@@ -2132,26 +2176,36 @@ TrimMultiXact(void)
 						pageno);
 
 	/*
+	 * Set the offset of the last multixact on the offsets page.
+	 *
+	 * This is normally done in RecordNewMultiXact() of the previous
+	 * multixact, but we used to not do that in older minor versions.  To
+	 * ensure that the next offset is set if the binary was just upgraded from
+	 * an older minor version, do it now.
+	 *
 	 * Zero out the remainder of the current offsets page.  See notes in
 	 * TrimCLOG() for background.  Unlike CLOG, some WAL record covers every
 	 * pg_multixact SLRU mutation.  Since, also unlike CLOG, we ignore the WAL
 	 * rule "write xlog before data," nextMXact successors may carry obsolete,
-	 * nonzero offset values.  Zero those so case 2 of GetMultiXactIdMembers()
-	 * operates normally.
+	 * nonzero offset values.
 	 */
 	entryno = MultiXactIdToOffsetEntry(nextMXact);
-	if (entryno != 0)
 	{
 		int			slotno;
 		MultiXactOffset *offptr;
 		LWLock	   *lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
 
 		LWLockAcquire(lock, LW_EXCLUSIVE);
-		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
+		if (entryno == 0)
+			slotno = SimpleLruZeroPage(MultiXactOffsetCtl, pageno);
+		else
+			slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
 
-		MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
+		*offptr = offset;
+		if (entryno != 0 && (entryno + 1) * sizeof(MultiXactOffset) != BLCKSZ)
+			MemSet(offptr + 1, 0, BLCKSZ - (entryno + 1) * sizeof(MultiXactOffset));
 
 		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 		LWLockRelease(lock);
@@ -3339,7 +3393,16 @@ multixact_redo(XLogReaderState *record)
 		int64		pageno;
 
 		memcpy(&pageno, XLogRecGetData(record), sizeof(pageno));
-		SimpleLruZeroAndWritePage(MultiXactOffsetCtl, pageno);
+
+		/*
+		 * Skip the record if we already initialized the page at the previous
+		 * XLOG_MULTIXACT_CREATE_ID record. See RecordNewMultiXact().
+		 */
+		if (pre_initialized_offsets_page != pageno)
+			SimpleLruZeroAndWritePage(MultiXactOffsetCtl, pageno);
+		else
+			elog(DEBUG1, "skipping initialization of page %ld", pageno);
+		pre_initialized_offsets_page = -1;
 	}
 	else if (info == XLOG_MULTIXACT_ZERO_MEM_PAGE)
 	{
@@ -3355,6 +3418,21 @@ multixact_redo(XLogReaderState *record)
 		TransactionId max_xid;
 		int			i;
 
+		if (pre_initialized_offsets_page != -1)
+		{
+			/*
+			 * If we implicitly initialized the next offsets page while
+			 * replaying a XLOG_MULTIXACT_CREATE_ID record that was generated
+			 * with an older minor version, we still expect to see a
+			 * XLOG_MULTIXACT_ZERO_OFF_PAGE record for it before any other
+			 * XLOG_MULTIXACT_CREATE_ID records.  Therefore this case should
+			 * not happen.  If it does, we'll continue with the replay, but
+			 * log a message to note that something's funny.
+			 */
+			elog(LOG, "expected to see a XLOG_MULTIXACT_ZERO_OFF_PAGE record for page that was implicitly initialized earlier");
+			pre_initialized_offsets_page = -1;
+		}
+
 		/* Store the data back into the SLRU files */
 		RecordNewMultiXact(xlrec->mid, xlrec->moff, xlrec->nmembers,
 						   xlrec->members);
diff --git a/src/test/modules/test_slru/t/001_multixact.pl b/src/test/modules/test_slru/t/001_multixact.pl
index e2b567a603d..5d8a2183fe5 100644
--- a/src/test/modules/test_slru/t/001_multixact.pl
+++ b/src/test/modules/test_slru/t/001_multixact.pl
@@ -1,10 +1,6 @@
 # Copyright (c) 2024-2025, PostgreSQL Global Development Group
 
-# This test verifies edge case of reading a multixact:
-# when we have multixact that is followed by exactly one another multixact,
-# and another multixact have no offset yet, we must wait until this offset
-# becomes observable. Previously we used to wait for 1ms in a loop in this
-# case, but now we use CV for this. This test is exercising such a sleep.
+# Test multixid corner cases.
 
 use strict;
 use warnings FATAL => 'all';
@@ -29,95 +25,57 @@ $node->start;
 $node->safe_psql('postgres', q(CREATE EXTENSION injection_points));
 $node->safe_psql('postgres', q(CREATE EXTENSION test_slru));
 
-# Test for Multixact generation edge case
-$node->safe_psql('postgres',
-	q{select injection_points_attach('test-multixact-read','wait')});
-$node->safe_psql('postgres',
-	q{select injection_points_attach('multixact-get-members-cv-sleep','wait')}
-);
+# This test creates three multixacts. The middle one is never
+# WAL-logged or recorded on the offsets page, because we pause the
+# backend and crash the server before that. After restart, verify that
+# the other multixacts are readable, despite the middle one being
+# lost.
 
-# This session must observe sleep on the condition variable while generating a
-# multixact.  To achieve this it first will create a multixact, then pause
-# before reading it.
-my $observer = $node->background_psql('postgres');
-
-# This query will create a multixact, and hang just before reading it.
-$observer->query_until(
-	qr/start/,
-	q{
-	\echo start
-	SELECT test_read_multixact(test_create_multixact());
-});
-$node->wait_for_event('client backend', 'test-multixact-read');
-
-# This session will create the next Multixact. This is necessary to avoid
-# multixact.c's non-sleeping edge case 1.
-my $creator = $node->background_psql('postgres');
+# Create the first multixact
+my $bg_psql = $node->background_psql('postgres');
+my $multi1 = $bg_psql->query_safe(q(SELECT test_create_multixact();));
+
+# Assign the middle multixact. Use an injection point to prevent it
+# from being fully recorded.
 $node->safe_psql('postgres',
 	q{SELECT injection_points_attach('multixact-create-from-members','wait');}
 );
 
-# We expect this query to hang in the critical section after generating new
-# multixact, but before filling its offset into SLRU.
-# Running an injection point inside a critical section requires it to be
-# loaded beforehand.
-$creator->query_until(
-	qr/start/, q{
-	\echo start
+$bg_psql->query_until(
+	qr/assigning lost multi/, q(
+\echo assigning lost multi
 	SELECT test_create_multixact();
-});
+));
 
 $node->wait_for_event('client backend', 'multixact-create-from-members');
-
-# Ensure we have the backends waiting that we expect
-is( $node->safe_psql(
-		'postgres',
-		q{SELECT string_agg(wait_event, ', ' ORDER BY wait_event)
-		FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'}
-	),
-	'multixact-create-from-members, test-multixact-read',
-	"matching injection point waits");
-
-# Now wake observer to get it to read the initial multixact.  A subsequent
-# multixact already exists, but that one doesn't have an offset assigned, so
-# this will hit multixact.c's edge case 2.
-$node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('test-multixact-read')});
-$node->wait_for_event('client backend', 'multixact-get-members-cv-sleep');
-
-# Ensure we have the backends waiting that we expect
-is( $node->safe_psql(
-		'postgres',
-		q{SELECT string_agg(wait_event, ', ' ORDER BY wait_event)
-		FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'}
-	),
-	'multixact-create-from-members, multixact-get-members-cv-sleep',
-	"matching injection point waits");
-
-# Now we have two backends waiting in multixact-create-from-members and
-# multixact-get-members-cv-sleep.  Also we have 3 injections points set to wait.
-# If we wakeup multixact-get-members-cv-sleep it will happen again, so we must
-# detach it first. So let's detach all injection points, then wake up all
-# backends.
-
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('test-multixact-read')});
 $node->safe_psql('postgres',
 	q{SELECT injection_points_detach('multixact-create-from-members')});
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('multixact-get-members-cv-sleep')});
 
-$node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('multixact-create-from-members')});
-$node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('multixact-get-members-cv-sleep')});
+# Create the third multixid
+my $multi2 = $node->safe_psql('postgres', q{SELECT test_create_multixact();});
 
-# Background psql will now be able to read the result and disconnect.
-$observer->quit;
-$creator->quit;
+# All set and done, it's time for hard restart
+$node->stop('immediate');
+$node->start;
+$bg_psql->{run}->finish;
+
+# Verify that the recorded multixids are readable
+my $timed_out = 0;
+$node->safe_psql(
+	'postgres',
+	qq{SELECT test_read_multixact('$multi1'::xid);},
+	timeout => 2,    #$PostgreSQL::Test::Utils::timeout_default,
+	timed_out => \$timed_out);
+ok($timed_out == 0, 'first recorded multi is readable');
+
+$timed_out = 0;
+$node->safe_psql(
+	'postgres',
+	qq{SELECT test_read_multixact('$multi2'::xid);},
+	timeout => 2,    #$PostgreSQL::Test::Utils::timeout_default,
+	timed_out => \$timed_out);
+ok($timed_out == 0, 'second recorded multi is readable');
 
 $node->stop;
 
-# If we reached this point - everything is OK.
-ok(1);
 done_testing();
diff --git a/src/test/modules/test_slru/test_multixact.c b/src/test/modules/test_slru/test_multixact.c
index 6c9b0420717..8fb6c19d70f 100644
--- a/src/test/modules/test_slru/test_multixact.c
+++ b/src/test/modules/test_slru/test_multixact.c
@@ -17,7 +17,6 @@
 #include "access/multixact.h"
 #include "access/xact.h"
 #include "fmgr.h"
-#include "utils/injection_point.h"
 
 PG_FUNCTION_INFO_V1(test_create_multixact);
 PG_FUNCTION_INFO_V1(test_read_multixact);
@@ -37,8 +36,7 @@ test_create_multixact(PG_FUNCTION_ARGS)
 }
 
 /*
- * Reads given multixact after running an injection point. Discards local cache
- * to make a real read.  Tailored for multixact testing.
+ * Reads given multixact.  Discards local cache to make a real read.
  */
 Datum
 test_read_multixact(PG_FUNCTION_ARGS)
@@ -46,7 +44,6 @@ test_read_multixact(PG_FUNCTION_ARGS)
 	MultiXactId id = PG_GETARG_TRANSACTIONID(0);
 	MultiXactMember *members;
 
-	INJECTION_POINT("test-multixact-read", NULL);
 	/* discard caches */
 	AtEOXact_MultiXact();
 
-- 
2.47.3

v13-0002-Add-test-for-multixid-wraparound.patchtext/x-patch; charset=UTF-8; name=v13-0002-Add-test-for-multixid-wraparound.patchDownload

From cad589df660a97353aeab247d10c83164d92bf28 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Fri, 28 Nov 2025 21:05:18 +0200
Subject: [PATCH v13 2/2] Add test for multixid wraparound

Author: Andrey Borodin <amborodin@acm.org>
---
 src/test/modules/test_slru/meson.build        |  3 +-
 .../test_slru/t/002_multixact_wraparound.pl   | 50 +++++++++++++++++++
 2 files changed, 52 insertions(+), 1 deletion(-)
 create mode 100644 src/test/modules/test_slru/t/002_multixact_wraparound.pl

diff --git a/src/test/modules/test_slru/meson.build b/src/test/modules/test_slru/meson.build
index e58bbdf75ac..691b387d13f 100644
--- a/src/test/modules/test_slru/meson.build
+++ b/src/test/modules/test_slru/meson.build
@@ -38,7 +38,8 @@ tests += {
        'enable_injection_points': get_option('injection_points') ? 'yes' : 'no',
     },
     'tests': [
-      't/001_multixact.pl'
+      't/001_multixact.pl',
+      't/002_multixact_wraparound.pl'
     ],
   },
 }
diff --git a/src/test/modules/test_slru/t/002_multixact_wraparound.pl b/src/test/modules/test_slru/t/002_multixact_wraparound.pl
new file mode 100644
index 00000000000..efaf902e5e3
--- /dev/null
+++ b/src/test/modules/test_slru/t/002_multixact_wraparound.pl
@@ -0,0 +1,50 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+# Test multixact wraparound
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+
+use Test::More;
+
+my ($node, $result);
+
+$node = PostgreSQL::Test::Cluster->new('mike');
+$node->init;
+$node->append_conf('postgresql.conf',
+	"shared_preload_libraries = 'test_slru'");
+
+# Set the cluster's next multitransaction to 0xFFFFFFF0.
+my $node_pgdata = $node->data_dir;
+command_ok(
+	[
+		'pg_resetwal',
+		'--multixact-ids' => '0xFFFFFFF0,0xFFFFFFF0',
+		$node_pgdata
+	],
+	"set the cluster's next multitransaction to 0xFFFFFFF0");
+command_ok(
+	[
+		'dd', 'if=/dev/zero', "of=$node_pgdata/pg_multixact/offsets/FFFF",
+		'bs=4', 'count=65536'
+	],
+	"init SLRU file");
+
+command_ok([ 'rm', "$node_pgdata/pg_multixact/offsets/0000", ],
+	"drop old SLRU file");
+
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION test_slru));
+
+# Consume multixids to wrap around
+foreach my $i (1 .. 32)
+{
+	$node->safe_psql('postgres', q{SELECT test_create_multixact();});
+}
+
+$node->stop;
+
+done_testing();
-- 
2.47.3

#46

x4mmm@yandex-team.ru

about 1 month ago

In reply to: Heikki Linnakangas (#45)

4 attachment(s)

Re: IPC/MultixactCreation on the Standby server

On 29 Nov 2025, at 00:51, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I think the second comment became outdated in commit bc7d37a525c0, which introduced the safety check in (what became later) SimpleLruTruncate(). After that, it's been important that latest_page_number is set correctly, although for the sanity check I guess you could be a little sloppy with it.

Cool, so we can safely backpatch to 7.2! (I was 15 when this version rocked)

The page initialization dance is only needed in back branches. And we inevitable will have conflicts with SLRU refactoring in 18 and banking in 17. Conceptually v12 looks good to me, I can prepare
backport versions.

Thanks!

Here's a new patch version. I went through the test changes now:

I didn't understand why the 'kill9' and 'poll_start' stuff is needed. We have plenty of tests that kill the server with regular "$node->stop('immediate')", and restart the server normally. The checkpoint in the middle of the tests seems unnecessary too. I removed all that, and the test still seems to work. Was there a particular reason for them?

In current shutdown sequence test seems to be reproducing corruption without checkpointing. I recollect that in July standby deadlock was reachable without checkpoint, but corruption was not. But now it seems test is working.

I moved the wraparound test to a separate test file and commit. More test coverage is good, but it's quite separate from the bugfix and the wraparound related test shares very little with the other test. The wraparound test needs a little more cleanup: use plain perl instead of 'dd' and 'rm' for the file operations, for example. (I did that with the tests in the 64-bit mxoff patches, so we could copy from there.)

PFA test version without dd and rm. Did I get your right, that we do not backport wraparound test, backport fixes for 001_multixact.pl test down to 17 where it appeared?

First two patches are v13 intact, second pair is my suggestions.

Best regards, Andrey Borodin.

Attachments:

v14-0001-Set-next-multixid-s-offset-when-creating-a-new-m.patchapplication/octet-stream; name=v14-0001-Set-next-multixid-s-offset-when-creating-a-new-m.patch; x-unix-mode=0644Download

From 057e356fd8b38b11afafef0a6bb3ab91cbcfac6c Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Fri, 28 Nov 2025 18:37:21 +0200
Subject: [PATCH v14 1/4] Set next multixid's offset when creating a new
 multixid
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

With this commit, the next multixid's offset will always be set on the
offsets page, by the time that a backend might try to read it, so we
no longer need the waiting mechanism with the condition variable. In
other words, this eliminates "corner case 2" mentioned in the
comments.

The waiting mechanism was broken in a few scenarios:

- When nextMulti was advanced without WAL-logging the next
  multixid. For example, if a later multixid was already assigned and
  WAL-logged before the previous one was WAL-logged, and then the
  server crashed. In that case the next offset would never be set in
  the offsets SLRU, and a query trying to read it would get stuck
  waiting for it. Same thing could happen if pg_resetwal was used to
  forcibly advance nextMulti.

- In hot standby mode, a deadlock could happen, where one backend waits
  for the next multixid assignment record, but WAL replay is not advancing
  because of a recovery conflict with the backend

We still need to be able to read WAL that was generated before this
fix. For that, the backpatched version of this commit includes a hack
to initialize the next offsets page when replaying
XLOG_MULTIXACT_CREATE_ID for the last multixid on a page.

The new TAP test reproduces IPC/MultixactCreation hangs and verifies
that previously recorded multis stay readable across crash recovery.

Author: Andrey Borodin <amborodin@acm.org>
Reviewed-by: Dmitry Yurichev <dsy.075@yandex.ru>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Ivan Bykov <i.bykov@modernsys.ru>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
---
 src/backend/access/transam/multixact.c        | 200 ++++++++++++------
 src/test/modules/test_slru/t/001_multixact.pl | 120 ++++-------
 src/test/modules/test_slru/test_multixact.c   |   5 +-
 3 files changed, 179 insertions(+), 146 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 9d5f130af7e..4212a0366fb 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -79,7 +79,6 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
-#include "storage/condition_variable.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -271,12 +270,6 @@ typedef struct MultiXactStateData
 	/* support for members anti-wraparound measures */
 	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
 
-	/*
-	 * This is used to sleep until a multixact offset is written when we want
-	 * to create the next one.
-	 */
-	ConditionVariable nextoff_cv;
-
 	/*
 	 * Per-backend data starts here.  We have two arrays stored in the area
 	 * immediately following the MultiXactStateData struct. Each is indexed by
@@ -381,6 +374,9 @@ static MemoryContext MXactContext = NULL;
 #define debug_elog6(a,b,c,d,e,f)
 #endif
 
+/* hack to deal with WAL generated with older minor versions */
+static int64 pre_initialized_offsets_page = -1;
+
 /* internal MultiXactId management */
 static void MultiXactIdSetOldestVisible(void);
 static void RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
@@ -912,13 +908,54 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	int			entryno;
 	int			slotno;
 	MultiXactOffset *offptr;
-	int			i;
+	MultiXactId next;
+	int64		next_pageno;
+	int			next_entryno;
+	MultiXactOffset *next_offptr;
 	LWLock	   *lock;
 	LWLock	   *prevlock = NULL;
 
+	/* position of this multixid in the offsets SLRU area  */
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
+	/* position of the next multixid */
+	next = multi + 1;
+	if (next < FirstMultiXactId)
+		next = FirstMultiXactId;
+	next_pageno = MultiXactIdToOffsetPage(next);
+	next_entryno = MultiXactIdToOffsetEntry(next);
+
+	/*
+	 * Older minor versions didn't set the next multixid's offset in this
+	 * function, and therefore didn't initialize the next page until the next
+	 * multixid was assigned.  If we're replaying WAL that was generated by
+	 * such a version, the next page might not be initialized yet.  Initialize
+	 * it now.
+	 */
+	if (InRecovery &&
+		next_pageno != pageno &&
+		pg_atomic_read_u64(&MultiXactOffsetCtl->shared->latest_page_number) == pageno)
+	{
+		elog(DEBUG1, "next offsets page is not initialized, initializing it now");
+		SimpleLruZeroAndWritePage(MultiXactOffsetCtl, next_pageno);
+
+		/*
+		 * Remember that we initialized the page, so that we don't zero it
+		 * again at the XLOG_MULTIXACT_ZERO_OFF_PAGE record.
+		 */
+		pre_initialized_offsets_page = next_pageno;
+	}
+
+	/*
+	 * Set the starting offset of this multixid's members.
+	 *
+	 * In the common case, it was already be set by the previous
+	 * RecordNewMultiXact call, as this was the next multixid of the previous
+	 * multixid.  But if multiple backends are generating multixids
+	 * concurrently, we might race ahead and get called before the previous
+	 * multixid.
+	 */
 	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
@@ -933,22 +970,50 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
 
-	*offptr = offset;
+	if (*offptr != offset)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*offptr == 0);
+		*offptr = offset;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
 
-	MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	/*
+	 * Set the next multixid's offset to the end of this multixid's members.
+	 */
+	if (next_pageno == pageno)
+	{
+		next_offptr = offptr + 1;
+	}
+	else
+	{
+		/* must be the first entry on the page */
+		Assert(next_entryno == 0 || next == FirstMultiXactId);
+
+		/* Swap the lock for a lock on the next page */
+		LWLockRelease(lock);
+		lock = SimpleLruGetBankLock(MultiXactOffsetCtl, next_pageno);
+		LWLockAcquire(lock, LW_EXCLUSIVE);
+
+		slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next);
+		next_offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
+		next_offptr += next_entryno;
+	}
+
+	if (*next_offptr != offset + nmembers)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*next_offptr == 0);
+		*next_offptr = offset + nmembers;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
 
 	/* Release MultiXactOffset SLRU lock. */
 	LWLockRelease(lock);
 
-	/*
-	 * If anybody was waiting to know the offset of this multixact ID we just
-	 * wrote, they can read it now, so wake them up.
-	 */
-	ConditionVariableBroadcast(&MultiXactState->nextoff_cv);
-
 	prev_pageno = -1;
 
-	for (i = 0; i < nmembers; i++, offset++)
+	for (int i = 0; i < nmembers; i++, offset++)
 	{
 		TransactionId *memberptr;
 		uint32	   *flagsptr;
@@ -1138,8 +1203,11 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 			result = FirstMultiXactId;
 	}
 
-	/* Make sure there is room for the MXID in the file.  */
-	ExtendMultiXactOffset(result);
+	/*
+	 * Make sure there is room for the next MXID in the file.  Assigning this
+	 * MXID sets the next MXID's offset already.
+	 */
+	ExtendMultiXactOffset(result + 1);
 
 	/*
 	 * Reserve the members space, similarly to above.  Also, be careful not to
@@ -1304,7 +1372,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	MultiXactOffset nextOffset;
 	MultiXactMember *ptr;
 	LWLock	   *lock;
-	bool		slept = false;
 
 	debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
 
@@ -1381,23 +1448,14 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * one's.  However, there are some corner cases to worry about:
 	 *
 	 * 1. This multixact may be the latest one created, in which case there is
-	 * no next one to look at.  In this case the nextOffset value we just
-	 * saved is the correct endpoint.
+	 * no next one to look at.  The next multixact's offset should be set
+	 * already, as we set it in RecordNewMultiXact(), but we used to not do
+	 * that in older minor versions.  To cope with that case, if this
+	 * multixact is the latest one created, use the nextOffset value we read
+	 * above as the endpoint.
 	 *
-	 * 2. The next multixact may still be in process of being filled in: that
-	 * is, another process may have done GetNewMultiXactId but not yet written
-	 * the offset entry for that ID.  In that scenario, it is guaranteed that
-	 * the offset entry for that multixact exists (because GetNewMultiXactId
-	 * won't release MultiXactGenLock until it does) but contains zero
-	 * (because we are careful to pre-zero offset pages). Because
-	 * GetNewMultiXactId will never return zero as the starting offset for a
-	 * multixact, when we read zero as the next multixact's offset, we know we
-	 * have this case.  We handle this by sleeping on the condition variable
-	 * we have just for this; the process in charge will signal the CV as soon
-	 * as it has finished writing the multixact offset.
-	 *
-	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
-	 * handle case #2, there is an ambiguity near the point of offset
+	 * 2. Because GetNewMultiXactId skips over offset zero, to reserve zero
+	 * for to mean "unset", there is an ambiguity near the point of offset
 	 * wraparound.  If we see next multixact's offset is one, is that our
 	 * multixact's actual endpoint, or did it end at zero with a subsequent
 	 * increment?  We handle this using the knowledge that if the zero'th
@@ -1409,7 +1467,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
 	 */
-retry:
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
@@ -1473,16 +1530,10 @@ retry:
 
 		if (nextMXOffset == 0)
 		{
-			/* Corner case 2: next multixact is still being filled in */
-			LWLockRelease(lock);
-			CHECK_FOR_INTERRUPTS();
-
-			INJECTION_POINT("multixact-get-members-cv-sleep", NULL);
-
-			ConditionVariableSleep(&MultiXactState->nextoff_cv,
-								   WAIT_EVENT_MULTIXACT_CREATION);
-			slept = true;
-			goto retry;
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("MultiXact %d has invalid next offset",
+							multi)));
 		}
 
 		length = nextMXOffset - offset;
@@ -1491,12 +1542,6 @@ retry:
 	LWLockRelease(lock);
 	lock = NULL;
 
-	/*
-	 * If we slept above, clean up state; it's no longer needed.
-	 */
-	if (slept)
-		ConditionVariableCancelSleep();
-
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
 
 	truelength = 0;
@@ -1539,7 +1584,7 @@ retry:
 
 		if (!TransactionIdIsValid(*xactptr))
 		{
-			/* Corner case 3: we must be looking at unused slot zero */
+			/* Corner case 2: we must be looking at unused slot zero */
 			Assert(offset == 0);
 			continue;
 		}
@@ -1986,7 +2031,6 @@ MultiXactShmemInit(void)
 
 		/* Make sure we zero out the per-backend state */
 		MemSet(MultiXactState, 0, SHARED_MULTIXACT_STATE_SIZE);
-		ConditionVariableInit(&MultiXactState->nextoff_cv);
 	}
 	else
 		Assert(found);
@@ -2132,26 +2176,36 @@ TrimMultiXact(void)
 						pageno);
 
 	/*
+	 * Set the offset of the last multixact on the offsets page.
+	 *
+	 * This is normally done in RecordNewMultiXact() of the previous
+	 * multixact, but we used to not do that in older minor versions.  To
+	 * ensure that the next offset is set if the binary was just upgraded from
+	 * an older minor version, do it now.
+	 *
 	 * Zero out the remainder of the current offsets page.  See notes in
 	 * TrimCLOG() for background.  Unlike CLOG, some WAL record covers every
 	 * pg_multixact SLRU mutation.  Since, also unlike CLOG, we ignore the WAL
 	 * rule "write xlog before data," nextMXact successors may carry obsolete,
-	 * nonzero offset values.  Zero those so case 2 of GetMultiXactIdMembers()
-	 * operates normally.
+	 * nonzero offset values.
 	 */
 	entryno = MultiXactIdToOffsetEntry(nextMXact);
-	if (entryno != 0)
 	{
 		int			slotno;
 		MultiXactOffset *offptr;
 		LWLock	   *lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
 
 		LWLockAcquire(lock, LW_EXCLUSIVE);
-		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
+		if (entryno == 0)
+			slotno = SimpleLruZeroPage(MultiXactOffsetCtl, pageno);
+		else
+			slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
 
-		MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
+		*offptr = offset;
+		if (entryno != 0 && (entryno + 1) * sizeof(MultiXactOffset) != BLCKSZ)
+			MemSet(offptr + 1, 0, BLCKSZ - (entryno + 1) * sizeof(MultiXactOffset));
 
 		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 		LWLockRelease(lock);
@@ -3339,7 +3393,16 @@ multixact_redo(XLogReaderState *record)
 		int64		pageno;
 
 		memcpy(&pageno, XLogRecGetData(record), sizeof(pageno));
-		SimpleLruZeroAndWritePage(MultiXactOffsetCtl, pageno);
+
+		/*
+		 * Skip the record if we already initialized the page at the previous
+		 * XLOG_MULTIXACT_CREATE_ID record. See RecordNewMultiXact().
+		 */
+		if (pre_initialized_offsets_page != pageno)
+			SimpleLruZeroAndWritePage(MultiXactOffsetCtl, pageno);
+		else
+			elog(DEBUG1, "skipping initialization of page %ld", pageno);
+		pre_initialized_offsets_page = -1;
 	}
 	else if (info == XLOG_MULTIXACT_ZERO_MEM_PAGE)
 	{
@@ -3355,6 +3418,21 @@ multixact_redo(XLogReaderState *record)
 		TransactionId max_xid;
 		int			i;
 
+		if (pre_initialized_offsets_page != -1)
+		{
+			/*
+			 * If we implicitly initialized the next offsets page while
+			 * replaying a XLOG_MULTIXACT_CREATE_ID record that was generated
+			 * with an older minor version, we still expect to see a
+			 * XLOG_MULTIXACT_ZERO_OFF_PAGE record for it before any other
+			 * XLOG_MULTIXACT_CREATE_ID records.  Therefore this case should
+			 * not happen.  If it does, we'll continue with the replay, but
+			 * log a message to note that something's funny.
+			 */
+			elog(LOG, "expected to see a XLOG_MULTIXACT_ZERO_OFF_PAGE record for page that was implicitly initialized earlier");
+			pre_initialized_offsets_page = -1;
+		}
+
 		/* Store the data back into the SLRU files */
 		RecordNewMultiXact(xlrec->mid, xlrec->moff, xlrec->nmembers,
 						   xlrec->members);
diff --git a/src/test/modules/test_slru/t/001_multixact.pl b/src/test/modules/test_slru/t/001_multixact.pl
index e2b567a603d..5d8a2183fe5 100644
--- a/src/test/modules/test_slru/t/001_multixact.pl
+++ b/src/test/modules/test_slru/t/001_multixact.pl
@@ -1,10 +1,6 @@
 # Copyright (c) 2024-2025, PostgreSQL Global Development Group
 
-# This test verifies edge case of reading a multixact:
-# when we have multixact that is followed by exactly one another multixact,
-# and another multixact have no offset yet, we must wait until this offset
-# becomes observable. Previously we used to wait for 1ms in a loop in this
-# case, but now we use CV for this. This test is exercising such a sleep.
+# Test multixid corner cases.
 
 use strict;
 use warnings FATAL => 'all';
@@ -29,95 +25,57 @@ $node->start;
 $node->safe_psql('postgres', q(CREATE EXTENSION injection_points));
 $node->safe_psql('postgres', q(CREATE EXTENSION test_slru));
 
-# Test for Multixact generation edge case
-$node->safe_psql('postgres',
-	q{select injection_points_attach('test-multixact-read','wait')});
-$node->safe_psql('postgres',
-	q{select injection_points_attach('multixact-get-members-cv-sleep','wait')}
-);
+# This test creates three multixacts. The middle one is never
+# WAL-logged or recorded on the offsets page, because we pause the
+# backend and crash the server before that. After restart, verify that
+# the other multixacts are readable, despite the middle one being
+# lost.
 
-# This session must observe sleep on the condition variable while generating a
-# multixact.  To achieve this it first will create a multixact, then pause
-# before reading it.
-my $observer = $node->background_psql('postgres');
-
-# This query will create a multixact, and hang just before reading it.
-$observer->query_until(
-	qr/start/,
-	q{
-	\echo start
-	SELECT test_read_multixact(test_create_multixact());
-});
-$node->wait_for_event('client backend', 'test-multixact-read');
-
-# This session will create the next Multixact. This is necessary to avoid
-# multixact.c's non-sleeping edge case 1.
-my $creator = $node->background_psql('postgres');
+# Create the first multixact
+my $bg_psql = $node->background_psql('postgres');
+my $multi1 = $bg_psql->query_safe(q(SELECT test_create_multixact();));
+
+# Assign the middle multixact. Use an injection point to prevent it
+# from being fully recorded.
 $node->safe_psql('postgres',
 	q{SELECT injection_points_attach('multixact-create-from-members','wait');}
 );
 
-# We expect this query to hang in the critical section after generating new
-# multixact, but before filling its offset into SLRU.
-# Running an injection point inside a critical section requires it to be
-# loaded beforehand.
-$creator->query_until(
-	qr/start/, q{
-	\echo start
+$bg_psql->query_until(
+	qr/assigning lost multi/, q(
+\echo assigning lost multi
 	SELECT test_create_multixact();
-});
+));
 
 $node->wait_for_event('client backend', 'multixact-create-from-members');
-
-# Ensure we have the backends waiting that we expect
-is( $node->safe_psql(
-		'postgres',
-		q{SELECT string_agg(wait_event, ', ' ORDER BY wait_event)
-		FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'}
-	),
-	'multixact-create-from-members, test-multixact-read',
-	"matching injection point waits");
-
-# Now wake observer to get it to read the initial multixact.  A subsequent
-# multixact already exists, but that one doesn't have an offset assigned, so
-# this will hit multixact.c's edge case 2.
-$node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('test-multixact-read')});
-$node->wait_for_event('client backend', 'multixact-get-members-cv-sleep');
-
-# Ensure we have the backends waiting that we expect
-is( $node->safe_psql(
-		'postgres',
-		q{SELECT string_agg(wait_event, ', ' ORDER BY wait_event)
-		FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'}
-	),
-	'multixact-create-from-members, multixact-get-members-cv-sleep',
-	"matching injection point waits");
-
-# Now we have two backends waiting in multixact-create-from-members and
-# multixact-get-members-cv-sleep.  Also we have 3 injections points set to wait.
-# If we wakeup multixact-get-members-cv-sleep it will happen again, so we must
-# detach it first. So let's detach all injection points, then wake up all
-# backends.
-
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('test-multixact-read')});
 $node->safe_psql('postgres',
 	q{SELECT injection_points_detach('multixact-create-from-members')});
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('multixact-get-members-cv-sleep')});
 
-$node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('multixact-create-from-members')});
-$node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('multixact-get-members-cv-sleep')});
+# Create the third multixid
+my $multi2 = $node->safe_psql('postgres', q{SELECT test_create_multixact();});
 
-# Background psql will now be able to read the result and disconnect.
-$observer->quit;
-$creator->quit;
+# All set and done, it's time for hard restart
+$node->stop('immediate');
+$node->start;
+$bg_psql->{run}->finish;
+
+# Verify that the recorded multixids are readable
+my $timed_out = 0;
+$node->safe_psql(
+	'postgres',
+	qq{SELECT test_read_multixact('$multi1'::xid);},
+	timeout => 2,    #$PostgreSQL::Test::Utils::timeout_default,
+	timed_out => \$timed_out);
+ok($timed_out == 0, 'first recorded multi is readable');
+
+$timed_out = 0;
+$node->safe_psql(
+	'postgres',
+	qq{SELECT test_read_multixact('$multi2'::xid);},
+	timeout => 2,    #$PostgreSQL::Test::Utils::timeout_default,
+	timed_out => \$timed_out);
+ok($timed_out == 0, 'second recorded multi is readable');
 
 $node->stop;
 
-# If we reached this point - everything is OK.
-ok(1);
 done_testing();
diff --git a/src/test/modules/test_slru/test_multixact.c b/src/test/modules/test_slru/test_multixact.c
index 6c9b0420717..8fb6c19d70f 100644
--- a/src/test/modules/test_slru/test_multixact.c
+++ b/src/test/modules/test_slru/test_multixact.c
@@ -17,7 +17,6 @@
 #include "access/multixact.h"
 #include "access/xact.h"
 #include "fmgr.h"
-#include "utils/injection_point.h"
 
 PG_FUNCTION_INFO_V1(test_create_multixact);
 PG_FUNCTION_INFO_V1(test_read_multixact);
@@ -37,8 +36,7 @@ test_create_multixact(PG_FUNCTION_ARGS)
 }
 
 /*
- * Reads given multixact after running an injection point. Discards local cache
- * to make a real read.  Tailored for multixact testing.
+ * Reads given multixact.  Discards local cache to make a real read.
  */
 Datum
 test_read_multixact(PG_FUNCTION_ARGS)
@@ -46,7 +44,6 @@ test_read_multixact(PG_FUNCTION_ARGS)
 	MultiXactId id = PG_GETARG_TRANSACTIONID(0);
 	MultiXactMember *members;
 
-	INJECTION_POINT("test-multixact-read", NULL);
 	/* discard caches */
 	AtEOXact_MultiXact();
 
-- 
2.51.2

v14-0004-Improve-multixact-wraparound-test.patchapplication/octet-stream; name=v14-0004-Improve-multixact-wraparound-test.patch; x-unix-mode=0644Download

From 360e45485475100c12f27eb617e0ff802cb2bf40 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sun, 30 Nov 2025 16:34:41 +0500
Subject: [PATCH v14 4/4] Improve multixact wraparound test

1. Replace dd and rm
2. Sprincle comments
3. Check that all multis are readable after wraparound
---
 .../test_slru/t/002_multixact_wraparound.pl   | 54 +++++++++++++++----
 1 file changed, 43 insertions(+), 11 deletions(-)

diff --git a/src/test/modules/test_slru/t/002_multixact_wraparound.pl b/src/test/modules/test_slru/t/002_multixact_wraparound.pl
index efaf902e5e3..c05d6280bfb 100644
--- a/src/test/modules/test_slru/t/002_multixact_wraparound.pl
+++ b/src/test/modules/test_slru/t/002_multixact_wraparound.pl
@@ -10,7 +10,7 @@ use PostgreSQL::Test::Utils;
 
 use Test::More;
 
-my ($node, $result);
+my $node;
 
 $node = PostgreSQL::Test::Cluster->new('mike');
 $node->init;
@@ -26,23 +26,55 @@ command_ok(
 		$node_pgdata
 	],
 	"set the cluster's next multitransaction to 0xFFFFFFF0");
-command_ok(
-	[
-		'dd', 'if=/dev/zero', "of=$node_pgdata/pg_multixact/offsets/FFFF",
-		'bs=4', 'count=65536'
-	],
-	"init SLRU file");
 
-command_ok([ 'rm', "$node_pgdata/pg_multixact/offsets/0000", ],
-	"drop old SLRU file");
+# Initialize SLRU file with zeros (65536 entries * 4 bytes = 262144 bytes)
+my $slru_file = "$node_pgdata/pg_multixact/offsets/FFFF";
+open my $fh, ">", $slru_file
+	or die "could not open \"$slru_file\": $!";
+binmode $fh;
+# Write 65536 entries of 4 bytes each (all zeros)
+syswrite($fh, "\0" x 262144) == 262144
+	or die "could not write to \"$slru_file\": $!";
+close $fh;
+
+# Remove old SLRU file if it exists
+if (-f "$node_pgdata/pg_multixact/offsets/0000")
+{
+	unlink("$node_pgdata/pg_multixact/offsets/0000")
+		or die "could not unlink \"$node_pgdata/pg_multixact/offsets/0000\": $!";
+}
 
 $node->start;
 $node->safe_psql('postgres', q(CREATE EXTENSION test_slru));
 
-# Consume multixids to wrap around
+# Consume multixids to wrap around. We start at 0xFFFFFFF0, so after
+# creating 32 multixacts we should have wrapped around past FirstMultiXactId.
+# Capture all multixact IDs to verify they're all readable after wraparound.
+my @multixact_ids;
 foreach my $i (1 .. 32)
 {
-	$node->safe_psql('postgres', q{SELECT test_create_multixact();});
+	my $multi = $node->safe_psql('postgres', q{SELECT test_create_multixact();});
+	push @multixact_ids, $multi;
+}
+
+# Verify that wraparound occurred (last_multi should be less than first_multi
+# or very close to FirstMultiXactId)
+my $first_multi = $multixact_ids[0];
+my $last_multi = $multixact_ids[-1];
+ok($last_multi < $first_multi || $last_multi < 0x10000,
+	"multixact wraparound occurred (first: $first_multi, last: $last_multi)");
+
+# Verify that all multixacts created during wraparound are still readable
+foreach my $i (0 .. $#multixact_ids)
+{
+	my $multi = $multixact_ids[$i];
+	my $timed_out = 0;
+	$node->safe_psql(
+		'postgres',
+		qq{SELECT test_read_multixact('$multi'::xid);},
+		timeout => $PostgreSQL::Test::Utils::timeout_default,
+		timed_out => \$timed_out);
+	ok($timed_out == 0, "multixact $i (ID: $multi) is readable after wraparound");
 }
 
 $node->stop;
-- 
2.51.2

v14-0003-Fix-debug-warning.patchapplication/octet-stream; name=v14-0003-Fix-debug-warning.patch; x-unix-mode=0644Download

From cd8e1c7bf6e4da3da2154b9a106b1d6659e3a4bd Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sun, 30 Nov 2025 16:22:28 +0500
Subject: [PATCH v14 3/4] Fix debug warning

---
 src/backend/access/transam/multixact.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 4212a0366fb..aa06ef0b164 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -3401,7 +3401,7 @@ multixact_redo(XLogReaderState *record)
 		if (pre_initialized_offsets_page != pageno)
 			SimpleLruZeroAndWritePage(MultiXactOffsetCtl, pageno);
 		else
-			elog(DEBUG1, "skipping initialization of page %ld", pageno);
+			elog(DEBUG1, "skipping initialization of page " INT64_FORMAT, pageno);
 		pre_initialized_offsets_page = -1;
 	}
 	else if (info == XLOG_MULTIXACT_ZERO_MEM_PAGE)
-- 
2.51.2

v14-0002-Add-test-for-multixid-wraparound.patchapplication/octet-stream; name=v14-0002-Add-test-for-multixid-wraparound.patch; x-unix-mode=0644Download

From 7873003b75394015d694a53f4bd48c94aa95e96a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Fri, 28 Nov 2025 21:05:18 +0200
Subject: [PATCH v14 2/4] Add test for multixid wraparound

Author: Andrey Borodin <amborodin@acm.org>
---
 src/test/modules/test_slru/meson.build        |  3 +-
 .../test_slru/t/002_multixact_wraparound.pl   | 50 +++++++++++++++++++
 2 files changed, 52 insertions(+), 1 deletion(-)
 create mode 100644 src/test/modules/test_slru/t/002_multixact_wraparound.pl

diff --git a/src/test/modules/test_slru/meson.build b/src/test/modules/test_slru/meson.build
index e58bbdf75ac..691b387d13f 100644
--- a/src/test/modules/test_slru/meson.build
+++ b/src/test/modules/test_slru/meson.build
@@ -38,7 +38,8 @@ tests += {
        'enable_injection_points': get_option('injection_points') ? 'yes' : 'no',
     },
     'tests': [
-      't/001_multixact.pl'
+      't/001_multixact.pl',
+      't/002_multixact_wraparound.pl'
     ],
   },
 }
diff --git a/src/test/modules/test_slru/t/002_multixact_wraparound.pl b/src/test/modules/test_slru/t/002_multixact_wraparound.pl
new file mode 100644
index 00000000000..efaf902e5e3
--- /dev/null
+++ b/src/test/modules/test_slru/t/002_multixact_wraparound.pl
@@ -0,0 +1,50 @@
+# Copyright (c) 2024-2025, PostgreSQL Global Development Group
+
+# Test multixact wraparound
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+
+use Test::More;
+
+my ($node, $result);
+
+$node = PostgreSQL::Test::Cluster->new('mike');
+$node->init;
+$node->append_conf('postgresql.conf',
+	"shared_preload_libraries = 'test_slru'");
+
+# Set the cluster's next multitransaction to 0xFFFFFFF0.
+my $node_pgdata = $node->data_dir;
+command_ok(
+	[
+		'pg_resetwal',
+		'--multixact-ids' => '0xFFFFFFF0,0xFFFFFFF0',
+		$node_pgdata
+	],
+	"set the cluster's next multitransaction to 0xFFFFFFF0");
+command_ok(
+	[
+		'dd', 'if=/dev/zero', "of=$node_pgdata/pg_multixact/offsets/FFFF",
+		'bs=4', 'count=65536'
+	],
+	"init SLRU file");
+
+command_ok([ 'rm', "$node_pgdata/pg_multixact/offsets/0000", ],
+	"drop old SLRU file");
+
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION test_slru));
+
+# Consume multixids to wrap around
+foreach my $i (1 .. 32)
+{
+	$node->safe_psql('postgres', q{SELECT test_create_multixact();});
+}
+
+$node->stop;
+
+done_testing();
-- 
2.51.2

#47

hlinnaka@iki.fi

about 1 month ago

In reply to: Andrey Borodin (#46)

6 attachment(s)

Re: IPC/MultixactCreation on the Standby server

On 30/11/2025 14:15, Andrey Borodin wrote:

On 29 Nov 2025, at 00:51, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I didn't understand why the 'kill9' and 'poll_start' stuff is
needed. We have plenty of tests that kill the server with regular
"$node->stop('immediate')", and restart the server normally. The
checkpoint in the middle of the tests seems unnecessary too. I
removed all that, and the test still seems to work. Was there a
particular reason for them?

In current shutdown sequence test seems to be reproducing corruption
without checkpointing. I recollect that in July standby deadlock was
reachable without checkpoint, but corruption was not. But now it
seems test is working.

Ok.

I moved the wraparound test to a separate test file and commit.
More test coverage is good, but it's quite separate from the
bugfix and the wraparound related test shares very little with the
other test. The wraparound test needs a little more cleanup: use
plain perl instead of 'dd' and 'rm' for the file operations, for
example. (I did that with the tests in the 64-bit mxoff patches,
so we could copy from there.)

PFA test version without dd and rm.

Thanks! I will focus on the main patch and TAP test now, but will commit
the wraparound test separately afterwards. At quick glance, it looks
good now.

Did I get your right, that we do not backport wraparound test,
backport fixes for 001_multixact.pl test down to 17 where it
appeared?

Yes, that's my plan. Except that 001_multixact.pl appeared in v18, not v17.

First two patches are v13 intact, second pair is my suggestions.

Thanks, here's a new set of patches, now with backpatched versions for
all the branches. As you said, there were a number of differences
between branches:

- On master, don't include the compatibility hacks for reading WAL
generated with older minor versions. Because WAL is not compatible
across major versions anyway.

- REL_18_STABLE didn't have the SimpleLruZeroAndWritePage() function
(introduced in commit c616785516).

- REL_17_STABLE didn't have the 001_multixact.pl TAP test. So I didn't
backport the new TAP test to v17 and below either.

- REL_16_STABLE used 32-bit SLRU page numbers, didn't have bank locks,
and used a simple sleep-loop instead of the condition variable.

- REL_15_STABLE and REL_14_STABLE: no conflicts from REL_16_STABLE

All of those conflicts were pretty straightforward to handle, but it's
enough code churn for silly mistakes to slip in, especially when the TAP
test didn't apply. So if you have a chance, please help to review and
test each of these backpatched versions too.

In addition to the backpatching, I did some more cosmetic cleanups to
the TAP test.

- Heikki

Attachments:

v15-master-0001-Set-next-multixid-s-offset-when-creating-.patchtext/x-patch; charset=UTF-8; name=v15-master-0001-Set-next-multixid-s-offset-when-creating-.patchDownload

From b67d9d4f87395a1eda7d6544c223f9f930026e7c Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 1 Dec 2025 14:11:10 +0200
Subject: [PATCH v15-master 1/1] Set next multixid's offset when creating a new
 multixid
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

With this commit, the next multixid's offset will always be set on the
offsets page, by the time that a backend might try to read it, so we
no longer need the waiting mechanism with the condition variable. In
other words, this eliminates "corner case 2" mentioned in the
comments.

The waiting mechanism was broken in a few scenarios:

- When nextMulti was advanced without WAL-logging the next
  multixid. For example, if a later multixid was already assigned and
  WAL-logged before the previous one was WAL-logged, and then the
  server crashed. In that case the next offset would never be set in
  the offsets SLRU, and a query trying to read it would get stuck
  waiting for it. Same thing could happen if pg_resetwal was used to
  forcibly advance nextMulti.

- In hot standby mode, a deadlock could happen where one backend waits
  for the next multixid assignment record, but WAL replay is not
  advancing because of a recovery conflict with the waiting backend.

The old TAP test used carefully placed injection points to exercise
the old waiting code, but now that the waiting code is gone, much of
the old test is no longer relevant. Rewrite the test to reproduce the
IPC/MultixactCreation hang after crash recovery instead, and to verify
that previously recorded multixids stay readable.

Backpatch to all supported versions. In back-branches, we still need
to be able to read WAL that was generated before this fix, so in the
back-branches this includes a hack to initialize the next offsets page
when replaying XLOG_MULTIXACT_CREATE_ID for the last multixid on a
page. On 'master', bump XLOG_PAGE_MAGIC instead to indicate that the
WAL is not compatible.

Author: Andrey Borodin <amborodin@acm.org>
Reviewed-by: Dmitry Yurichev <dsy.075@yandex.ru>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Ivan Bykov <i.bykov@modernsys.ru>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/172e5723-d65f-4eec-b512-14beacb326ce@yandex.ru
Backpatch-through: 14
---
 src/backend/access/transam/multixact.c        | 177 +++++++++---------
 src/include/access/xlog_internal.h            |   2 +-
 src/test/modules/test_slru/t/001_multixact.pl | 116 +++---------
 src/test/modules/test_slru/test_multixact.c   |   5 +-
 4 files changed, 124 insertions(+), 176 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 9d5f130af7e..0f01390c153 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -79,7 +79,6 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
-#include "storage/condition_variable.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -271,12 +270,6 @@ typedef struct MultiXactStateData
 	/* support for members anti-wraparound measures */
 	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
 
-	/*
-	 * This is used to sleep until a multixact offset is written when we want
-	 * to create the next one.
-	 */
-	ConditionVariable nextoff_cv;
-
 	/*
 	 * Per-backend data starts here.  We have two arrays stored in the area
 	 * immediately following the MultiXactStateData struct. Each is indexed by
@@ -912,13 +905,33 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	int			entryno;
 	int			slotno;
 	MultiXactOffset *offptr;
-	int			i;
+	MultiXactId next;
+	int64		next_pageno;
+	int			next_entryno;
+	MultiXactOffset *next_offptr;
 	LWLock	   *lock;
 	LWLock	   *prevlock = NULL;
 
+	/* position of this multixid in the offsets SLRU area  */
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
+	/* position of the next multixid */
+	next = multi + 1;
+	if (next < FirstMultiXactId)
+		next = FirstMultiXactId;
+	next_pageno = MultiXactIdToOffsetPage(next);
+	next_entryno = MultiXactIdToOffsetEntry(next);
+
+	/*
+	 * Set the starting offset of this multixid's members.
+	 *
+	 * In the common case, it was already be set by the previous
+	 * RecordNewMultiXact call, as this was the next multixid of the previous
+	 * multixid.  But if multiple backends are generating multixids
+	 * concurrently, we might race ahead and get called before the previous
+	 * multixid.
+	 */
 	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
@@ -933,22 +946,50 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
 
-	*offptr = offset;
+	if (*offptr != offset)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*offptr == 0);
+		*offptr = offset;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
 
-	MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	/*
+	 * Set the next multixid's offset to the end of this multixid's members.
+	 */
+	if (next_pageno == pageno)
+	{
+		next_offptr = offptr + 1;
+	}
+	else
+	{
+		/* must be the first entry on the page */
+		Assert(next_entryno == 0 || next == FirstMultiXactId);
+
+		/* Swap the lock for a lock on the next page */
+		LWLockRelease(lock);
+		lock = SimpleLruGetBankLock(MultiXactOffsetCtl, next_pageno);
+		LWLockAcquire(lock, LW_EXCLUSIVE);
+
+		slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next);
+		next_offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
+		next_offptr += next_entryno;
+	}
+
+	if (*next_offptr != offset + nmembers)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*next_offptr == 0);
+		*next_offptr = offset + nmembers;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
 
 	/* Release MultiXactOffset SLRU lock. */
 	LWLockRelease(lock);
 
-	/*
-	 * If anybody was waiting to know the offset of this multixact ID we just
-	 * wrote, they can read it now, so wake them up.
-	 */
-	ConditionVariableBroadcast(&MultiXactState->nextoff_cv);
-
 	prev_pageno = -1;
 
-	for (i = 0; i < nmembers; i++, offset++)
+	for (int i = 0; i < nmembers; i++, offset++)
 	{
 		TransactionId *memberptr;
 		uint32	   *flagsptr;
@@ -1138,8 +1179,11 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 			result = FirstMultiXactId;
 	}
 
-	/* Make sure there is room for the MXID in the file.  */
-	ExtendMultiXactOffset(result);
+	/*
+	 * Make sure there is room for the next MXID in the file.  Assigning this
+	 * MXID sets the next MXID's offset already.
+	 */
+	ExtendMultiXactOffset(result + 1);
 
 	/*
 	 * Reserve the members space, similarly to above.  Also, be careful not to
@@ -1300,11 +1344,8 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	int			truelength;
 	MultiXactId oldestMXact;
 	MultiXactId nextMXact;
-	MultiXactId tmpMXact;
-	MultiXactOffset nextOffset;
 	MultiXactMember *ptr;
 	LWLock	   *lock;
-	bool		slept = false;
 
 	debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
 
@@ -1351,14 +1392,12 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * error.
 	 *
 	 * Shared lock is enough here since we aren't modifying any global state.
-	 * Acquire it just long enough to grab the current counter values.  We may
-	 * need both nextMXact and nextOffset; see below.
+	 * Acquire it just long enough to grab the current counter values.
 	 */
 	LWLockAcquire(MultiXactGenLock, LW_SHARED);
 
 	oldestMXact = MultiXactState->oldestMultiXactId;
 	nextMXact = MultiXactState->nextMXact;
-	nextOffset = MultiXactState->nextOffset;
 
 	LWLockRelease(MultiXactGenLock);
 
@@ -1378,38 +1417,17 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * Find out the offset at which we need to start reading MultiXactMembers
 	 * and the number of members in the multixact.  We determine the latter as
 	 * the difference between this multixact's starting offset and the next
-	 * one's.  However, there are some corner cases to worry about:
-	 *
-	 * 1. This multixact may be the latest one created, in which case there is
-	 * no next one to look at.  In this case the nextOffset value we just
-	 * saved is the correct endpoint.
+	 * one's.  However, there is one corner case to worry about:
 	 *
-	 * 2. The next multixact may still be in process of being filled in: that
-	 * is, another process may have done GetNewMultiXactId but not yet written
-	 * the offset entry for that ID.  In that scenario, it is guaranteed that
-	 * the offset entry for that multixact exists (because GetNewMultiXactId
-	 * won't release MultiXactGenLock until it does) but contains zero
-	 * (because we are careful to pre-zero offset pages). Because
-	 * GetNewMultiXactId will never return zero as the starting offset for a
-	 * multixact, when we read zero as the next multixact's offset, we know we
-	 * have this case.  We handle this by sleeping on the condition variable
-	 * we have just for this; the process in charge will signal the CV as soon
-	 * as it has finished writing the multixact offset.
-	 *
-	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
-	 * handle case #2, there is an ambiguity near the point of offset
+	 * Because GetNewMultiXactId skips over offset zero, to reserve zero for
+	 * to mean "unset", there is an ambiguity near the point of offset
 	 * wraparound.  If we see next multixact's offset is one, is that our
 	 * multixact's actual endpoint, or did it end at zero with a subsequent
 	 * increment?  We handle this using the knowledge that if the zero'th
 	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
 	 * transaction ID so it can't be a multixact member.  Therefore, if we
 	 * read a zero from the members array, just ignore it.
-	 *
-	 * This is all pretty messy, but the mess occurs only in infrequent corner
-	 * cases, so it seems better than holding the MultiXactGenLock for a long
-	 * time on every multixact creation.
 	 */
-retry:
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
@@ -1417,6 +1435,7 @@ retry:
 	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
+	/* read this multi's offset */
 	slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
@@ -1424,22 +1443,13 @@ retry:
 
 	Assert(offset != 0);
 
-	/*
-	 * Use the same increment rule as GetNewMultiXactId(), that is, don't
-	 * handle wraparound explicitly until needed.
-	 */
-	tmpMXact = multi + 1;
-
-	if (nextMXact == tmpMXact)
-	{
-		/* Corner case 1: there is no next multixact */
-		length = nextOffset - offset;
-	}
-	else
+	/* read next multi's offset */
 	{
+		MultiXactId tmpMXact;
 		MultiXactOffset nextMXOffset;
 
 		/* handle wraparound if needed */
+		tmpMXact = multi + 1;
 		if (tmpMXact < FirstMultiXactId)
 			tmpMXact = FirstMultiXactId;
 
@@ -1472,18 +1482,10 @@ retry:
 		nextMXOffset = *offptr;
 
 		if (nextMXOffset == 0)
-		{
-			/* Corner case 2: next multixact is still being filled in */
-			LWLockRelease(lock);
-			CHECK_FOR_INTERRUPTS();
-
-			INJECTION_POINT("multixact-get-members-cv-sleep", NULL);
-
-			ConditionVariableSleep(&MultiXactState->nextoff_cv,
-								   WAIT_EVENT_MULTIXACT_CREATION);
-			slept = true;
-			goto retry;
-		}
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("MultiXact %u has invalid next offset",
+							multi)));
 
 		length = nextMXOffset - offset;
 	}
@@ -1491,12 +1493,7 @@ retry:
 	LWLockRelease(lock);
 	lock = NULL;
 
-	/*
-	 * If we slept above, clean up state; it's no longer needed.
-	 */
-	if (slept)
-		ConditionVariableCancelSleep();
-
+	/* read the members */
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
 
 	truelength = 0;
@@ -1539,7 +1536,7 @@ retry:
 
 		if (!TransactionIdIsValid(*xactptr))
 		{
-			/* Corner case 3: we must be looking at unused slot zero */
+			/* Corner case: we must be looking at unused slot zero */
 			Assert(offset == 0);
 			continue;
 		}
@@ -1986,7 +1983,6 @@ MultiXactShmemInit(void)
 
 		/* Make sure we zero out the per-backend state */
 		MemSet(MultiXactState, 0, SHARED_MULTIXACT_STATE_SIZE);
-		ConditionVariableInit(&MultiXactState->nextoff_cv);
 	}
 	else
 		Assert(found);
@@ -2132,26 +2128,35 @@ TrimMultiXact(void)
 						pageno);
 
 	/*
+	 * Set the offset of the last multixact on the offsets page.
+	 *
+	 * This is normally done in RecordNewMultiXact() of the previous
+	 * multixact, but let's be sure the next page exists, if the nextMulti was
+	 * reset with pg_resetwal, for example.
+	 *
 	 * Zero out the remainder of the current offsets page.  See notes in
 	 * TrimCLOG() for background.  Unlike CLOG, some WAL record covers every
 	 * pg_multixact SLRU mutation.  Since, also unlike CLOG, we ignore the WAL
 	 * rule "write xlog before data," nextMXact successors may carry obsolete,
-	 * nonzero offset values.  Zero those so case 2 of GetMultiXactIdMembers()
-	 * operates normally.
+	 * nonzero offset values.
 	 */
 	entryno = MultiXactIdToOffsetEntry(nextMXact);
-	if (entryno != 0)
 	{
 		int			slotno;
 		MultiXactOffset *offptr;
 		LWLock	   *lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
 
 		LWLockAcquire(lock, LW_EXCLUSIVE);
-		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
+		if (entryno == 0)
+			slotno = SimpleLruZeroPage(MultiXactOffsetCtl, pageno);
+		else
+			slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
 
-		MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
+		*offptr = offset;
+		if (entryno != 0 && (entryno + 1) * sizeof(MultiXactOffset) != BLCKSZ)
+			MemSet(offptr + 1, 0, BLCKSZ - (entryno + 1) * sizeof(MultiXactOffset));
 
 		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 		LWLockRelease(lock);
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 34deb2fe5f0..3be5ecf336f 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -31,7 +31,7 @@
 /*
  * Each page of XLOG file has a header like this:
  */
-#define XLOG_PAGE_MAGIC 0xD119	/* can be used as WAL version indicator */
+#define XLOG_PAGE_MAGIC 0xD11A	/* can be used as WAL version indicator */
 
 typedef struct XLogPageHeaderData
 {
diff --git a/src/test/modules/test_slru/t/001_multixact.pl b/src/test/modules/test_slru/t/001_multixact.pl
index e2b567a603d..7837eb810f0 100644
--- a/src/test/modules/test_slru/t/001_multixact.pl
+++ b/src/test/modules/test_slru/t/001_multixact.pl
@@ -1,10 +1,6 @@
 # Copyright (c) 2024-2025, PostgreSQL Global Development Group
 
-# This test verifies edge case of reading a multixact:
-# when we have multixact that is followed by exactly one another multixact,
-# and another multixact have no offset yet, we must wait until this offset
-# becomes observable. Previously we used to wait for 1ms in a loop in this
-# case, but now we use CV for this. This test is exercising such a sleep.
+# Test multixid corner cases.
 
 use strict;
 use warnings FATAL => 'all';
@@ -19,9 +15,7 @@ if ($ENV{enable_injection_points} ne 'yes')
 	plan skip_all => 'Injection points not supported by this build';
 }
 
-my ($node, $result);
-
-$node = PostgreSQL::Test::Cluster->new('mike');
+my $node = PostgreSQL::Test::Cluster->new('main');
 $node->init;
 $node->append_conf('postgresql.conf',
 	"shared_preload_libraries = 'test_slru,injection_points'");
@@ -29,95 +23,47 @@ $node->start;
 $node->safe_psql('postgres', q(CREATE EXTENSION injection_points));
 $node->safe_psql('postgres', q(CREATE EXTENSION test_slru));
 
-# Test for Multixact generation edge case
-$node->safe_psql('postgres',
-	q{select injection_points_attach('test-multixact-read','wait')});
-$node->safe_psql('postgres',
-	q{select injection_points_attach('multixact-get-members-cv-sleep','wait')}
-);
+# This test creates three multixacts. The middle one is never
+# WAL-logged or recorded on the offsets page, because we pause the
+# backend and crash the server before that. After restart, verify that
+# the other multixacts are readable, despite the middle one being
+# lost.
 
-# This session must observe sleep on the condition variable while generating a
-# multixact.  To achieve this it first will create a multixact, then pause
-# before reading it.
-my $observer = $node->background_psql('postgres');
-
-# This query will create a multixact, and hang just before reading it.
-$observer->query_until(
-	qr/start/,
-	q{
-	\echo start
-	SELECT test_read_multixact(test_create_multixact());
-});
-$node->wait_for_event('client backend', 'test-multixact-read');
-
-# This session will create the next Multixact. This is necessary to avoid
-# multixact.c's non-sleeping edge case 1.
-my $creator = $node->background_psql('postgres');
+# Create the first multixact
+my $bg_psql = $node->background_psql('postgres');
+my $multi1 = $bg_psql->query_safe(q(SELECT test_create_multixact();));
+
+# Assign the middle multixact. Use an injection point to prevent it
+# from being fully recorded.
 $node->safe_psql('postgres',
 	q{SELECT injection_points_attach('multixact-create-from-members','wait');}
 );
 
-# We expect this query to hang in the critical section after generating new
-# multixact, but before filling its offset into SLRU.
-# Running an injection point inside a critical section requires it to be
-# loaded beforehand.
-$creator->query_until(
-	qr/start/, q{
-	\echo start
+$bg_psql->query_until(
+	qr/assigning lost multi/, q(
+\echo assigning lost multi
 	SELECT test_create_multixact();
-});
+));
 
 $node->wait_for_event('client backend', 'multixact-create-from-members');
-
-# Ensure we have the backends waiting that we expect
-is( $node->safe_psql(
-		'postgres',
-		q{SELECT string_agg(wait_event, ', ' ORDER BY wait_event)
-		FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'}
-	),
-	'multixact-create-from-members, test-multixact-read',
-	"matching injection point waits");
-
-# Now wake observer to get it to read the initial multixact.  A subsequent
-# multixact already exists, but that one doesn't have an offset assigned, so
-# this will hit multixact.c's edge case 2.
-$node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('test-multixact-read')});
-$node->wait_for_event('client backend', 'multixact-get-members-cv-sleep');
-
-# Ensure we have the backends waiting that we expect
-is( $node->safe_psql(
-		'postgres',
-		q{SELECT string_agg(wait_event, ', ' ORDER BY wait_event)
-		FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'}
-	),
-	'multixact-create-from-members, multixact-get-members-cv-sleep',
-	"matching injection point waits");
-
-# Now we have two backends waiting in multixact-create-from-members and
-# multixact-get-members-cv-sleep.  Also we have 3 injections points set to wait.
-# If we wakeup multixact-get-members-cv-sleep it will happen again, so we must
-# detach it first. So let's detach all injection points, then wake up all
-# backends.
-
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('test-multixact-read')});
 $node->safe_psql('postgres',
 	q{SELECT injection_points_detach('multixact-create-from-members')});
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('multixact-get-members-cv-sleep')});
 
-$node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('multixact-create-from-members')});
-$node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('multixact-get-members-cv-sleep')});
+# Create the third multixid
+my $multi2 = $node->safe_psql('postgres', q{SELECT test_create_multixact();});
+
+# All set and done, it's time for hard restart
+$node->stop('immediate');
+$node->start;
+$bg_psql->{run}->finish;
 
-# Background psql will now be able to read the result and disconnect.
-$observer->quit;
-$creator->quit;
+# Verify that the recorded multixids are readable
+is( $node->safe_psql('postgres', qq{SELECT test_read_multixact('$multi1');}),
+	'',
+	'first recorded multi is readable');
 
-$node->stop;
+is( $node->safe_psql('postgres', qq{SELECT test_read_multixact('$multi2');}),
+	'',
+	'second recorded multi is readable');
 
-# If we reached this point - everything is OK.
-ok(1);
 done_testing();
diff --git a/src/test/modules/test_slru/test_multixact.c b/src/test/modules/test_slru/test_multixact.c
index 6c9b0420717..8fb6c19d70f 100644
--- a/src/test/modules/test_slru/test_multixact.c
+++ b/src/test/modules/test_slru/test_multixact.c
@@ -17,7 +17,6 @@
 #include "access/multixact.h"
 #include "access/xact.h"
 #include "fmgr.h"
-#include "utils/injection_point.h"
 
 PG_FUNCTION_INFO_V1(test_create_multixact);
 PG_FUNCTION_INFO_V1(test_read_multixact);
@@ -37,8 +36,7 @@ test_create_multixact(PG_FUNCTION_ARGS)
 }
 
 /*
- * Reads given multixact after running an injection point. Discards local cache
- * to make a real read.  Tailored for multixact testing.
+ * Reads given multixact.  Discards local cache to make a real read.
  */
 Datum
 test_read_multixact(PG_FUNCTION_ARGS)
@@ -46,7 +44,6 @@ test_read_multixact(PG_FUNCTION_ARGS)
 	MultiXactId id = PG_GETARG_TRANSACTIONID(0);
 	MultiXactMember *members;
 
-	INJECTION_POINT("test-multixact-read", NULL);
 	/* discard caches */
 	AtEOXact_MultiXact();
 
-- 
2.47.3

v15-pg14-0001-Set-next-multixid-s-offset-when-creating-a-.patchtext/x-patch; charset=UTF-8; name=v15-pg14-0001-Set-next-multixid-s-offset-when-creating-a-.patchDownload

From a4ed9e8a360de05cddc167f16dd7652242eaef5a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 1 Dec 2025 14:14:15 +0200
Subject: [PATCH v15-pg14 1/1] Set next multixid's offset when creating a new
 multixid
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

With this commit, the next multixid's offset will always be set on the
offsets page, by the time that a backend might try to read it, so we
no longer need the waiting mechanism with the condition variable. In
other words, this eliminates "corner case 2" mentioned in the
comments.

The waiting mechanism was broken in a few scenarios:

- When nextMulti was advanced without WAL-logging the next
  multixid. For example, if a later multixid was already assigned and
  WAL-logged before the previous one was WAL-logged, and then the
  server crashed. In that case the next offset would never be set in
  the offsets SLRU, and a query trying to read it would get stuck
  waiting for it. Same thing could happen if pg_resetwal was used to
  forcibly advance nextMulti.

- In hot standby mode, a deadlock could happen where one backend waits
  for the next multixid assignment record, but WAL replay is not
  advancing because of a recovery conflict with the waiting backend.

The old TAP test used carefully placed injection points to exercise
the old waiting code, but now that the waiting code is gone, much of
the old test is no longer relevant. Rewrite the test to reproduce the
IPC/MultixactCreation hang after crash recovery instead, and to verify
that previously recorded multixids stay readable.

Backpatch to all supported versions. In back-branches, we still need
to be able to read WAL that was generated before this fix, so in the
back-branches this includes a hack to initialize the next offsets page
when replaying XLOG_MULTIXACT_CREATE_ID for the last multixid on a
page. On 'master', bump XLOG_PAGE_MAGIC instead to indicate that the
WAL is not compatible.

Author: Andrey Borodin <amborodin@acm.org>
Reviewed-by: Dmitry Yurichev <dsy.075@yandex.ru>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Ivan Bykov <i.bykov@modernsys.ru>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/172e5723-d65f-4eec-b512-14beacb326ce@yandex.ru
Backpatch-through: 14
---
 src/backend/access/transam/multixact.c | 186 +++++++++++++++++++------
 1 file changed, 145 insertions(+), 41 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index d56269d941c..7013e1b188c 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -337,6 +337,9 @@ static MemoryContext MXactContext = NULL;
 #define debug_elog6(a,b,c,d,e,f)
 #endif
 
+/* hack to deal with WAL generated with older minor versions */
+static int64 pre_initialized_offsets_page = -1;
+
 /* internal MultiXactId management */
 static void MultiXactIdSetOldestVisible(void);
 static void RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
@@ -868,13 +871,61 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	int			entryno;
 	int			slotno;
 	MultiXactOffset *offptr;
-	int			i;
+	MultiXactId next;
+	int64		next_pageno;
+	int			next_entryno;
+	MultiXactOffset *next_offptr;
 
 	LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
 
+	/* position of this multixid in the offsets SLRU area  */
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
+	/* position of the next multixid */
+	next = multi + 1;
+	if (next < FirstMultiXactId)
+		next = FirstMultiXactId;
+	next_pageno = MultiXactIdToOffsetPage(next);
+	next_entryno = MultiXactIdToOffsetEntry(next);
+
+	/*
+	 * Older minor versions didn't set the next multixid's offset in this
+	 * function, and therefore didn't initialize the next page until the next
+	 * multixid was assigned.  If we're replaying WAL that was generated by
+	 * such a version, the next page might not be initialized yet.  Initialize
+	 * it now.
+	 */
+	if (InRecovery &&
+		next_pageno != pageno &&
+		MultiXactOffsetCtl->shared->latest_page_number == pageno)
+	{
+		elog(DEBUG1, "next offsets page is not initialized, initializing it now");
+
+		/* Create and zero the page */
+		slotno = SimpleLruZeroPage(MultiXactOffsetCtl, next_pageno);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
+		Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+
+		/*
+		 * Remember that we initialized the page, so that we don't zero it
+		 * again at the XLOG_MULTIXACT_ZERO_OFF_PAGE record.
+		 */
+		pre_initialized_offsets_page = next_pageno;
+	}
+
+	/*
+	 * Set the starting offset of this multixid's members.
+	 *
+	 * In the common case, it was already be set by the previous
+	 * RecordNewMultiXact call, as this was the next multixid of the previous
+	 * multixid.  But if multiple backends are generating multixids
+	 * concurrently, we might race ahead and get called before the previous
+	 * multixid.
+	 */
+
 	/*
 	 * Note: we pass the MultiXactId to SimpleLruReadPage as the "transaction"
 	 * to complain about if there's any I/O error.  This is kinda bogus, but
@@ -886,9 +937,37 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
 
-	*offptr = offset;
+	if (*offptr != offset)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*offptr == 0);
+		*offptr = offset;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
+
+	/*
+	 * Set the next multixid's offset to the end of this multixid's members.
+	 */
+	if (next_pageno == pageno)
+	{
+		next_offptr = offptr + 1;
+	}
+	else
+	{
+		/* must be the first entry on the page */
+		Assert(next_entryno == 0 || next == FirstMultiXactId);
+		slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next);
+		next_offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
+		next_offptr += next_entryno;
+	}
 
-	MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	if (*next_offptr != offset + nmembers)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*next_offptr == 0);
+		*next_offptr = offset + nmembers;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
 
 	/* Exchange our lock */
 	LWLockRelease(MultiXactOffsetSLRULock);
@@ -897,7 +976,7 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 
 	prev_pageno = -1;
 
-	for (i = 0; i < nmembers; i++, offset++)
+	for (int i = 0; i < nmembers; i++, offset++)
 	{
 		TransactionId *memberptr;
 		uint32	   *flagsptr;
@@ -1072,8 +1151,11 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 			result = FirstMultiXactId;
 	}
 
-	/* Make sure there is room for the MXID in the file.  */
-	ExtendMultiXactOffset(result);
+	/*
+	 * Make sure there is room for the next MXID in the file.  Assigning this
+	 * MXID sets the next MXID's offset already.
+	 */
+	ExtendMultiXactOffset(result + 1);
 
 	/*
 	 * Reserve the members space, similarly to above.  Also, be careful not to
@@ -1314,21 +1396,14 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * one's.  However, there are some corner cases to worry about:
 	 *
 	 * 1. This multixact may be the latest one created, in which case there is
-	 * no next one to look at.  In this case the nextOffset value we just
-	 * saved is the correct endpoint.
+	 * no next one to look at.  The next multixact's offset should be set
+	 * already, as we set it in RecordNewMultiXact(), but we used to not do
+	 * that in older minor versions.  To cope with that case, if this
+	 * multixact is the latest one created, use the nextOffset value we read
+	 * above as the endpoint.
 	 *
-	 * 2. The next multixact may still be in process of being filled in: that
-	 * is, another process may have done GetNewMultiXactId but not yet written
-	 * the offset entry for that ID.  In that scenario, it is guaranteed that
-	 * the offset entry for that multixact exists (because GetNewMultiXactId
-	 * won't release MultiXactGenLock until it does) but contains zero
-	 * (because we are careful to pre-zero offset pages). Because
-	 * GetNewMultiXactId will never return zero as the starting offset for a
-	 * multixact, when we read zero as the next multixact's offset, we know we
-	 * have this case.  We sleep for a bit and try again.
-	 *
-	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
-	 * handle case #2, there is an ambiguity near the point of offset
+	 * 2. Because GetNewMultiXactId skips over offset zero, to reserve zero
+	 * for to mean "unset", there is an ambiguity near the point of offset
 	 * wraparound.  If we see next multixact's offset is one, is that our
 	 * multixact's actual endpoint, or did it end at zero with a subsequent
 	 * increment?  We handle this using the knowledge that if the zero'th
@@ -1340,7 +1415,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
 	 */
-retry:
 	LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
 
 	pageno = MultiXactIdToOffsetPage(multi);
@@ -1385,13 +1459,10 @@ retry:
 		nextMXOffset = *offptr;
 
 		if (nextMXOffset == 0)
-		{
-			/* Corner case 2: next multixact is still being filled in */
-			LWLockRelease(MultiXactOffsetSLRULock);
-			CHECK_FOR_INTERRUPTS();
-			pg_usleep(1000L);
-			goto retry;
-		}
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("MultiXact %u has invalid next offset",
+							multi)));
 
 		length = nextMXOffset - offset;
 	}
@@ -1427,7 +1498,7 @@ retry:
 
 		if (!TransactionIdIsValid(*xactptr))
 		{
-			/* Corner case 3: we must be looking at unused slot zero */
+			/* Corner case 2: we must be looking at unused slot zero */
 			Assert(offset == 0);
 			continue;
 		}
@@ -2056,24 +2127,34 @@ TrimMultiXact(void)
 	MultiXactOffsetCtl->shared->latest_page_number = pageno;
 
 	/*
+	 * Set the offset of the last multixact on the offsets page.
+	 *
+	 * This is normally done in RecordNewMultiXact() of the previous
+	 * multixact, but we used to not do that in older minor versions.  To
+	 * ensure that the next offset is set if the binary was just upgraded from
+	 * an older minor version, do it now.
+	 *
 	 * Zero out the remainder of the current offsets page.  See notes in
 	 * TrimCLOG() for background.  Unlike CLOG, some WAL record covers every
 	 * pg_multixact SLRU mutation.  Since, also unlike CLOG, we ignore the WAL
 	 * rule "write xlog before data," nextMXact successors may carry obsolete,
-	 * nonzero offset values.  Zero those so case 2 of GetMultiXactIdMembers()
-	 * operates normally.
+	 * nonzero offset values.
 	 */
 	entryno = MultiXactIdToOffsetEntry(nextMXact);
-	if (entryno != 0)
 	{
 		int			slotno;
 		MultiXactOffset *offptr;
 
-		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
+		if (entryno == 0)
+			slotno = SimpleLruZeroPage(MultiXactOffsetCtl, pageno);
+		else
+			slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
 
-		MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
+		*offptr = offset;
+		if (entryno != 0 && (entryno + 1) * sizeof(MultiXactOffset) != BLCKSZ)
+			MemSet(offptr + 1, 0, BLCKSZ - (entryno + 1) * sizeof(MultiXactOffset));
 
 		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 	}
@@ -3255,13 +3336,21 @@ multixact_redo(XLogReaderState *record)
 
 		memcpy(&pageno, XLogRecGetData(record), sizeof(int));
 
-		LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
-
-		slotno = ZeroMultiXactOffsetPage(pageno, false);
-		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
-		Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
-
-		LWLockRelease(MultiXactOffsetSLRULock);
+		/*
+		 * Skip the record if we already initialized the page at the previous
+		 * XLOG_MULTIXACT_CREATE_ID record. See RecordNewMultiXact().
+		 */
+		if (pre_initialized_offsets_page != pageno)
+		{
+			LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+			slotno = ZeroMultiXactOffsetPage(pageno, false);
+			SimpleLruWritePage(MultiXactOffsetCtl, slotno);
+			Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+			LWLockRelease(MultiXactOffsetSLRULock);
+		}
+		else
+			elog(DEBUG1, "skipping initialization of offsets page %d because it was already initialized on multixid creation", pageno);
+		pre_initialized_offsets_page = -1;
 	}
 	else if (info == XLOG_MULTIXACT_ZERO_MEM_PAGE)
 	{
@@ -3285,6 +3374,21 @@ multixact_redo(XLogReaderState *record)
 		TransactionId max_xid;
 		int			i;
 
+		if (pre_initialized_offsets_page != -1)
+		{
+			/*
+			 * If we implicitly initialized the next offsets page while
+			 * replaying a XLOG_MULTIXACT_CREATE_ID record that was generated
+			 * with an older minor version, we still expect to see a
+			 * XLOG_MULTIXACT_ZERO_OFF_PAGE record for it before any other
+			 * XLOG_MULTIXACT_CREATE_ID records.  Therefore this case should
+			 * not happen.  If it does, we'll continue with the replay, but
+			 * log a message to note that something's funny.
+			 */
+			elog(LOG, "expected to see a XLOG_MULTIXACT_ZERO_OFF_PAGE record for page that was implicitly initialized earlier");
+			pre_initialized_offsets_page = -1;
+		}
+
 		/* Store the data back into the SLRU files */
 		RecordNewMultiXact(xlrec->mid, xlrec->moff, xlrec->nmembers,
 						   xlrec->members);
-- 
2.47.3

v15-pg15-0001-Set-next-multixid-s-offset-when-creating-a-.patchtext/x-patch; charset=UTF-8; name=v15-pg15-0001-Set-next-multixid-s-offset-when-creating-a-.patchDownload

From 2255bed329be004b14f27af5084e7732ab3dd09b Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 1 Dec 2025 14:14:05 +0200
Subject: [PATCH v15-pg15 1/1] Set next multixid's offset when creating a new
 multixid
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

With this commit, the next multixid's offset will always be set on the
offsets page, by the time that a backend might try to read it, so we
no longer need the waiting mechanism with the condition variable. In
other words, this eliminates "corner case 2" mentioned in the
comments.

The waiting mechanism was broken in a few scenarios:

- When nextMulti was advanced without WAL-logging the next
  multixid. For example, if a later multixid was already assigned and
  WAL-logged before the previous one was WAL-logged, and then the
  server crashed. In that case the next offset would never be set in
  the offsets SLRU, and a query trying to read it would get stuck
  waiting for it. Same thing could happen if pg_resetwal was used to
  forcibly advance nextMulti.

- In hot standby mode, a deadlock could happen where one backend waits
  for the next multixid assignment record, but WAL replay is not
  advancing because of a recovery conflict with the waiting backend.

The old TAP test used carefully placed injection points to exercise
the old waiting code, but now that the waiting code is gone, much of
the old test is no longer relevant. Rewrite the test to reproduce the
IPC/MultixactCreation hang after crash recovery instead, and to verify
that previously recorded multixids stay readable.

Backpatch to all supported versions. In back-branches, we still need
to be able to read WAL that was generated before this fix, so in the
back-branches this includes a hack to initialize the next offsets page
when replaying XLOG_MULTIXACT_CREATE_ID for the last multixid on a
page. On 'master', bump XLOG_PAGE_MAGIC instead to indicate that the
WAL is not compatible.

Author: Andrey Borodin <amborodin@acm.org>
Reviewed-by: Dmitry Yurichev <dsy.075@yandex.ru>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Ivan Bykov <i.bykov@modernsys.ru>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/172e5723-d65f-4eec-b512-14beacb326ce@yandex.ru
Backpatch-through: 14
---
 src/backend/access/transam/multixact.c | 186 +++++++++++++++++++------
 1 file changed, 145 insertions(+), 41 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 136065125ea..5fc53420655 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -337,6 +337,9 @@ static MemoryContext MXactContext = NULL;
 #define debug_elog6(a,b,c,d,e,f)
 #endif
 
+/* hack to deal with WAL generated with older minor versions */
+static int64 pre_initialized_offsets_page = -1;
+
 /* internal MultiXactId management */
 static void MultiXactIdSetOldestVisible(void);
 static void RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
@@ -868,13 +871,61 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	int			entryno;
 	int			slotno;
 	MultiXactOffset *offptr;
-	int			i;
+	MultiXactId next;
+	int64		next_pageno;
+	int			next_entryno;
+	MultiXactOffset *next_offptr;
 
 	LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
 
+	/* position of this multixid in the offsets SLRU area  */
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
+	/* position of the next multixid */
+	next = multi + 1;
+	if (next < FirstMultiXactId)
+		next = FirstMultiXactId;
+	next_pageno = MultiXactIdToOffsetPage(next);
+	next_entryno = MultiXactIdToOffsetEntry(next);
+
+	/*
+	 * Older minor versions didn't set the next multixid's offset in this
+	 * function, and therefore didn't initialize the next page until the next
+	 * multixid was assigned.  If we're replaying WAL that was generated by
+	 * such a version, the next page might not be initialized yet.  Initialize
+	 * it now.
+	 */
+	if (InRecovery &&
+		next_pageno != pageno &&
+		MultiXactOffsetCtl->shared->latest_page_number == pageno)
+	{
+		elog(DEBUG1, "next offsets page is not initialized, initializing it now");
+
+		/* Create and zero the page */
+		slotno = SimpleLruZeroPage(MultiXactOffsetCtl, next_pageno);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
+		Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+
+		/*
+		 * Remember that we initialized the page, so that we don't zero it
+		 * again at the XLOG_MULTIXACT_ZERO_OFF_PAGE record.
+		 */
+		pre_initialized_offsets_page = next_pageno;
+	}
+
+	/*
+	 * Set the starting offset of this multixid's members.
+	 *
+	 * In the common case, it was already be set by the previous
+	 * RecordNewMultiXact call, as this was the next multixid of the previous
+	 * multixid.  But if multiple backends are generating multixids
+	 * concurrently, we might race ahead and get called before the previous
+	 * multixid.
+	 */
+
 	/*
 	 * Note: we pass the MultiXactId to SimpleLruReadPage as the "transaction"
 	 * to complain about if there's any I/O error.  This is kinda bogus, but
@@ -886,9 +937,37 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
 
-	*offptr = offset;
+	if (*offptr != offset)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*offptr == 0);
+		*offptr = offset;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
+
+	/*
+	 * Set the next multixid's offset to the end of this multixid's members.
+	 */
+	if (next_pageno == pageno)
+	{
+		next_offptr = offptr + 1;
+	}
+	else
+	{
+		/* must be the first entry on the page */
+		Assert(next_entryno == 0 || next == FirstMultiXactId);
+		slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next);
+		next_offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
+		next_offptr += next_entryno;
+	}
 
-	MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	if (*next_offptr != offset + nmembers)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*next_offptr == 0);
+		*next_offptr = offset + nmembers;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
 
 	/* Exchange our lock */
 	LWLockRelease(MultiXactOffsetSLRULock);
@@ -897,7 +976,7 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 
 	prev_pageno = -1;
 
-	for (i = 0; i < nmembers; i++, offset++)
+	for (int i = 0; i < nmembers; i++, offset++)
 	{
 		TransactionId *memberptr;
 		uint32	   *flagsptr;
@@ -1072,8 +1151,11 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 			result = FirstMultiXactId;
 	}
 
-	/* Make sure there is room for the MXID in the file.  */
-	ExtendMultiXactOffset(result);
+	/*
+	 * Make sure there is room for the next MXID in the file.  Assigning this
+	 * MXID sets the next MXID's offset already.
+	 */
+	ExtendMultiXactOffset(result + 1);
 
 	/*
 	 * Reserve the members space, similarly to above.  Also, be careful not to
@@ -1314,21 +1396,14 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * one's.  However, there are some corner cases to worry about:
 	 *
 	 * 1. This multixact may be the latest one created, in which case there is
-	 * no next one to look at.  In this case the nextOffset value we just
-	 * saved is the correct endpoint.
+	 * no next one to look at.  The next multixact's offset should be set
+	 * already, as we set it in RecordNewMultiXact(), but we used to not do
+	 * that in older minor versions.  To cope with that case, if this
+	 * multixact is the latest one created, use the nextOffset value we read
+	 * above as the endpoint.
 	 *
-	 * 2. The next multixact may still be in process of being filled in: that
-	 * is, another process may have done GetNewMultiXactId but not yet written
-	 * the offset entry for that ID.  In that scenario, it is guaranteed that
-	 * the offset entry for that multixact exists (because GetNewMultiXactId
-	 * won't release MultiXactGenLock until it does) but contains zero
-	 * (because we are careful to pre-zero offset pages). Because
-	 * GetNewMultiXactId will never return zero as the starting offset for a
-	 * multixact, when we read zero as the next multixact's offset, we know we
-	 * have this case.  We sleep for a bit and try again.
-	 *
-	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
-	 * handle case #2, there is an ambiguity near the point of offset
+	 * 2. Because GetNewMultiXactId skips over offset zero, to reserve zero
+	 * for to mean "unset", there is an ambiguity near the point of offset
 	 * wraparound.  If we see next multixact's offset is one, is that our
 	 * multixact's actual endpoint, or did it end at zero with a subsequent
 	 * increment?  We handle this using the knowledge that if the zero'th
@@ -1340,7 +1415,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
 	 */
-retry:
 	LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
 
 	pageno = MultiXactIdToOffsetPage(multi);
@@ -1385,13 +1459,10 @@ retry:
 		nextMXOffset = *offptr;
 
 		if (nextMXOffset == 0)
-		{
-			/* Corner case 2: next multixact is still being filled in */
-			LWLockRelease(MultiXactOffsetSLRULock);
-			CHECK_FOR_INTERRUPTS();
-			pg_usleep(1000L);
-			goto retry;
-		}
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("MultiXact %u has invalid next offset",
+							multi)));
 
 		length = nextMXOffset - offset;
 	}
@@ -1427,7 +1498,7 @@ retry:
 
 		if (!TransactionIdIsValid(*xactptr))
 		{
-			/* Corner case 3: we must be looking at unused slot zero */
+			/* Corner case 2: we must be looking at unused slot zero */
 			Assert(offset == 0);
 			continue;
 		}
@@ -2056,24 +2127,34 @@ TrimMultiXact(void)
 	MultiXactOffsetCtl->shared->latest_page_number = pageno;
 
 	/*
+	 * Set the offset of the last multixact on the offsets page.
+	 *
+	 * This is normally done in RecordNewMultiXact() of the previous
+	 * multixact, but we used to not do that in older minor versions.  To
+	 * ensure that the next offset is set if the binary was just upgraded from
+	 * an older minor version, do it now.
+	 *
 	 * Zero out the remainder of the current offsets page.  See notes in
 	 * TrimCLOG() for background.  Unlike CLOG, some WAL record covers every
 	 * pg_multixact SLRU mutation.  Since, also unlike CLOG, we ignore the WAL
 	 * rule "write xlog before data," nextMXact successors may carry obsolete,
-	 * nonzero offset values.  Zero those so case 2 of GetMultiXactIdMembers()
-	 * operates normally.
+	 * nonzero offset values.
 	 */
 	entryno = MultiXactIdToOffsetEntry(nextMXact);
-	if (entryno != 0)
 	{
 		int			slotno;
 		MultiXactOffset *offptr;
 
-		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
+		if (entryno == 0)
+			slotno = SimpleLruZeroPage(MultiXactOffsetCtl, pageno);
+		else
+			slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
 
-		MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
+		*offptr = offset;
+		if (entryno != 0 && (entryno + 1) * sizeof(MultiXactOffset) != BLCKSZ)
+			MemSet(offptr + 1, 0, BLCKSZ - (entryno + 1) * sizeof(MultiXactOffset));
 
 		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 	}
@@ -3255,13 +3336,21 @@ multixact_redo(XLogReaderState *record)
 
 		memcpy(&pageno, XLogRecGetData(record), sizeof(int));
 
-		LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
-
-		slotno = ZeroMultiXactOffsetPage(pageno, false);
-		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
-		Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
-
-		LWLockRelease(MultiXactOffsetSLRULock);
+		/*
+		 * Skip the record if we already initialized the page at the previous
+		 * XLOG_MULTIXACT_CREATE_ID record. See RecordNewMultiXact().
+		 */
+		if (pre_initialized_offsets_page != pageno)
+		{
+			LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+			slotno = ZeroMultiXactOffsetPage(pageno, false);
+			SimpleLruWritePage(MultiXactOffsetCtl, slotno);
+			Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+			LWLockRelease(MultiXactOffsetSLRULock);
+		}
+		else
+			elog(DEBUG1, "skipping initialization of offsets page %d because it was already initialized on multixid creation", pageno);
+		pre_initialized_offsets_page = -1;
 	}
 	else if (info == XLOG_MULTIXACT_ZERO_MEM_PAGE)
 	{
@@ -3285,6 +3374,21 @@ multixact_redo(XLogReaderState *record)
 		TransactionId max_xid;
 		int			i;
 
+		if (pre_initialized_offsets_page != -1)
+		{
+			/*
+			 * If we implicitly initialized the next offsets page while
+			 * replaying a XLOG_MULTIXACT_CREATE_ID record that was generated
+			 * with an older minor version, we still expect to see a
+			 * XLOG_MULTIXACT_ZERO_OFF_PAGE record for it before any other
+			 * XLOG_MULTIXACT_CREATE_ID records.  Therefore this case should
+			 * not happen.  If it does, we'll continue with the replay, but
+			 * log a message to note that something's funny.
+			 */
+			elog(LOG, "expected to see a XLOG_MULTIXACT_ZERO_OFF_PAGE record for page that was implicitly initialized earlier");
+			pre_initialized_offsets_page = -1;
+		}
+
 		/* Store the data back into the SLRU files */
 		RecordNewMultiXact(xlrec->mid, xlrec->moff, xlrec->nmembers,
 						   xlrec->members);
-- 
2.47.3

v15-pg16-0001-Set-next-multixid-s-offset-when-creating-a-.patchtext/x-patch; charset=UTF-8; name=v15-pg16-0001-Set-next-multixid-s-offset-when-creating-a-.patchDownload

From b88bc9227cb34607e5d66140cffb3a0f7a214379 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 1 Dec 2025 14:13:54 +0200
Subject: [PATCH v15-pg16 1/1] Set next multixid's offset when creating a new
 multixid
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

With this commit, the next multixid's offset will always be set on the
offsets page, by the time that a backend might try to read it, so we
no longer need the waiting mechanism with the condition variable. In
other words, this eliminates "corner case 2" mentioned in the
comments.

The waiting mechanism was broken in a few scenarios:

- When nextMulti was advanced without WAL-logging the next
  multixid. For example, if a later multixid was already assigned and
  WAL-logged before the previous one was WAL-logged, and then the
  server crashed. In that case the next offset would never be set in
  the offsets SLRU, and a query trying to read it would get stuck
  waiting for it. Same thing could happen if pg_resetwal was used to
  forcibly advance nextMulti.

- In hot standby mode, a deadlock could happen where one backend waits
  for the next multixid assignment record, but WAL replay is not
  advancing because of a recovery conflict with the waiting backend.

The old TAP test used carefully placed injection points to exercise
the old waiting code, but now that the waiting code is gone, much of
the old test is no longer relevant. Rewrite the test to reproduce the
IPC/MultixactCreation hang after crash recovery instead, and to verify
that previously recorded multixids stay readable.

Backpatch to all supported versions. In back-branches, we still need
to be able to read WAL that was generated before this fix, so in the
back-branches this includes a hack to initialize the next offsets page
when replaying XLOG_MULTIXACT_CREATE_ID for the last multixid on a
page. On 'master', bump XLOG_PAGE_MAGIC instead to indicate that the
WAL is not compatible.

Author: Andrey Borodin <amborodin@acm.org>
Reviewed-by: Dmitry Yurichev <dsy.075@yandex.ru>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Ivan Bykov <i.bykov@modernsys.ru>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/172e5723-d65f-4eec-b512-14beacb326ce@yandex.ru
Backpatch-through: 14
---
 src/backend/access/transam/multixact.c | 186 +++++++++++++++++++------
 1 file changed, 145 insertions(+), 41 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 3a2d7055c42..43fec0afcd0 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -338,6 +338,9 @@ static MemoryContext MXactContext = NULL;
 #define debug_elog6(a,b,c,d,e,f)
 #endif
 
+/* hack to deal with WAL generated with older minor versions */
+static int64 pre_initialized_offsets_page = -1;
+
 /* internal MultiXactId management */
 static void MultiXactIdSetOldestVisible(void);
 static void RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
@@ -869,13 +872,61 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	int			entryno;
 	int			slotno;
 	MultiXactOffset *offptr;
-	int			i;
+	MultiXactId next;
+	int64		next_pageno;
+	int			next_entryno;
+	MultiXactOffset *next_offptr;
 
 	LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
 
+	/* position of this multixid in the offsets SLRU area  */
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
+	/* position of the next multixid */
+	next = multi + 1;
+	if (next < FirstMultiXactId)
+		next = FirstMultiXactId;
+	next_pageno = MultiXactIdToOffsetPage(next);
+	next_entryno = MultiXactIdToOffsetEntry(next);
+
+	/*
+	 * Older minor versions didn't set the next multixid's offset in this
+	 * function, and therefore didn't initialize the next page until the next
+	 * multixid was assigned.  If we're replaying WAL that was generated by
+	 * such a version, the next page might not be initialized yet.  Initialize
+	 * it now.
+	 */
+	if (InRecovery &&
+		next_pageno != pageno &&
+		MultiXactOffsetCtl->shared->latest_page_number == pageno)
+	{
+		elog(DEBUG1, "next offsets page is not initialized, initializing it now");
+
+		/* Create and zero the page */
+		slotno = SimpleLruZeroPage(MultiXactOffsetCtl, next_pageno);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
+		Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+
+		/*
+		 * Remember that we initialized the page, so that we don't zero it
+		 * again at the XLOG_MULTIXACT_ZERO_OFF_PAGE record.
+		 */
+		pre_initialized_offsets_page = next_pageno;
+	}
+
+	/*
+	 * Set the starting offset of this multixid's members.
+	 *
+	 * In the common case, it was already be set by the previous
+	 * RecordNewMultiXact call, as this was the next multixid of the previous
+	 * multixid.  But if multiple backends are generating multixids
+	 * concurrently, we might race ahead and get called before the previous
+	 * multixid.
+	 */
+
 	/*
 	 * Note: we pass the MultiXactId to SimpleLruReadPage as the "transaction"
 	 * to complain about if there's any I/O error.  This is kinda bogus, but
@@ -887,9 +938,37 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
 
-	*offptr = offset;
+	if (*offptr != offset)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*offptr == 0);
+		*offptr = offset;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
+
+	/*
+	 * Set the next multixid's offset to the end of this multixid's members.
+	 */
+	if (next_pageno == pageno)
+	{
+		next_offptr = offptr + 1;
+	}
+	else
+	{
+		/* must be the first entry on the page */
+		Assert(next_entryno == 0 || next == FirstMultiXactId);
+		slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next);
+		next_offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
+		next_offptr += next_entryno;
+	}
 
-	MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	if (*next_offptr != offset + nmembers)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*next_offptr == 0);
+		*next_offptr = offset + nmembers;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
 
 	/* Exchange our lock */
 	LWLockRelease(MultiXactOffsetSLRULock);
@@ -898,7 +977,7 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 
 	prev_pageno = -1;
 
-	for (i = 0; i < nmembers; i++, offset++)
+	for (int i = 0; i < nmembers; i++, offset++)
 	{
 		TransactionId *memberptr;
 		uint32	   *flagsptr;
@@ -1073,8 +1152,11 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 			result = FirstMultiXactId;
 	}
 
-	/* Make sure there is room for the MXID in the file.  */
-	ExtendMultiXactOffset(result);
+	/*
+	 * Make sure there is room for the next MXID in the file.  Assigning this
+	 * MXID sets the next MXID's offset already.
+	 */
+	ExtendMultiXactOffset(result + 1);
 
 	/*
 	 * Reserve the members space, similarly to above.  Also, be careful not to
@@ -1315,21 +1397,14 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * one's.  However, there are some corner cases to worry about:
 	 *
 	 * 1. This multixact may be the latest one created, in which case there is
-	 * no next one to look at.  In this case the nextOffset value we just
-	 * saved is the correct endpoint.
+	 * no next one to look at.  The next multixact's offset should be set
+	 * already, as we set it in RecordNewMultiXact(), but we used to not do
+	 * that in older minor versions.  To cope with that case, if this
+	 * multixact is the latest one created, use the nextOffset value we read
+	 * above as the endpoint.
 	 *
-	 * 2. The next multixact may still be in process of being filled in: that
-	 * is, another process may have done GetNewMultiXactId but not yet written
-	 * the offset entry for that ID.  In that scenario, it is guaranteed that
-	 * the offset entry for that multixact exists (because GetNewMultiXactId
-	 * won't release MultiXactGenLock until it does) but contains zero
-	 * (because we are careful to pre-zero offset pages). Because
-	 * GetNewMultiXactId will never return zero as the starting offset for a
-	 * multixact, when we read zero as the next multixact's offset, we know we
-	 * have this case.  We sleep for a bit and try again.
-	 *
-	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
-	 * handle case #2, there is an ambiguity near the point of offset
+	 * 2. Because GetNewMultiXactId skips over offset zero, to reserve zero
+	 * for to mean "unset", there is an ambiguity near the point of offset
 	 * wraparound.  If we see next multixact's offset is one, is that our
 	 * multixact's actual endpoint, or did it end at zero with a subsequent
 	 * increment?  We handle this using the knowledge that if the zero'th
@@ -1341,7 +1416,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
 	 */
-retry:
 	LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
 
 	pageno = MultiXactIdToOffsetPage(multi);
@@ -1386,13 +1460,10 @@ retry:
 		nextMXOffset = *offptr;
 
 		if (nextMXOffset == 0)
-		{
-			/* Corner case 2: next multixact is still being filled in */
-			LWLockRelease(MultiXactOffsetSLRULock);
-			CHECK_FOR_INTERRUPTS();
-			pg_usleep(1000L);
-			goto retry;
-		}
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("MultiXact %u has invalid next offset",
+							multi)));
 
 		length = nextMXOffset - offset;
 	}
@@ -1428,7 +1499,7 @@ retry:
 
 		if (!TransactionIdIsValid(*xactptr))
 		{
-			/* Corner case 3: we must be looking at unused slot zero */
+			/* Corner case 2: we must be looking at unused slot zero */
 			Assert(offset == 0);
 			continue;
 		}
@@ -2055,24 +2126,34 @@ TrimMultiXact(void)
 	MultiXactOffsetCtl->shared->latest_page_number = pageno;
 
 	/*
+	 * Set the offset of the last multixact on the offsets page.
+	 *
+	 * This is normally done in RecordNewMultiXact() of the previous
+	 * multixact, but we used to not do that in older minor versions.  To
+	 * ensure that the next offset is set if the binary was just upgraded from
+	 * an older minor version, do it now.
+	 *
 	 * Zero out the remainder of the current offsets page.  See notes in
 	 * TrimCLOG() for background.  Unlike CLOG, some WAL record covers every
 	 * pg_multixact SLRU mutation.  Since, also unlike CLOG, we ignore the WAL
 	 * rule "write xlog before data," nextMXact successors may carry obsolete,
-	 * nonzero offset values.  Zero those so case 2 of GetMultiXactIdMembers()
-	 * operates normally.
+	 * nonzero offset values.
 	 */
 	entryno = MultiXactIdToOffsetEntry(nextMXact);
-	if (entryno != 0)
 	{
 		int			slotno;
 		MultiXactOffset *offptr;
 
-		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
+		if (entryno == 0)
+			slotno = SimpleLruZeroPage(MultiXactOffsetCtl, pageno);
+		else
+			slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
 
-		MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
+		*offptr = offset;
+		if (entryno != 0 && (entryno + 1) * sizeof(MultiXactOffset) != BLCKSZ)
+			MemSet(offptr + 1, 0, BLCKSZ - (entryno + 1) * sizeof(MultiXactOffset));
 
 		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 	}
@@ -3251,13 +3332,21 @@ multixact_redo(XLogReaderState *record)
 
 		memcpy(&pageno, XLogRecGetData(record), sizeof(int));
 
-		LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
-
-		slotno = ZeroMultiXactOffsetPage(pageno, false);
-		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
-		Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
-
-		LWLockRelease(MultiXactOffsetSLRULock);
+		/*
+		 * Skip the record if we already initialized the page at the previous
+		 * XLOG_MULTIXACT_CREATE_ID record. See RecordNewMultiXact().
+		 */
+		if (pre_initialized_offsets_page != pageno)
+		{
+			LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+			slotno = ZeroMultiXactOffsetPage(pageno, false);
+			SimpleLruWritePage(MultiXactOffsetCtl, slotno);
+			Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+			LWLockRelease(MultiXactOffsetSLRULock);
+		}
+		else
+			elog(DEBUG1, "skipping initialization of offsets page %d because it was already initialized on multixid creation", pageno);
+		pre_initialized_offsets_page = -1;
 	}
 	else if (info == XLOG_MULTIXACT_ZERO_MEM_PAGE)
 	{
@@ -3281,6 +3370,21 @@ multixact_redo(XLogReaderState *record)
 		TransactionId max_xid;
 		int			i;
 
+		if (pre_initialized_offsets_page != -1)
+		{
+			/*
+			 * If we implicitly initialized the next offsets page while
+			 * replaying a XLOG_MULTIXACT_CREATE_ID record that was generated
+			 * with an older minor version, we still expect to see a
+			 * XLOG_MULTIXACT_ZERO_OFF_PAGE record for it before any other
+			 * XLOG_MULTIXACT_CREATE_ID records.  Therefore this case should
+			 * not happen.  If it does, we'll continue with the replay, but
+			 * log a message to note that something's funny.
+			 */
+			elog(LOG, "expected to see a XLOG_MULTIXACT_ZERO_OFF_PAGE record for page that was implicitly initialized earlier");
+			pre_initialized_offsets_page = -1;
+		}
+
 		/* Store the data back into the SLRU files */
 		RecordNewMultiXact(xlrec->mid, xlrec->moff, xlrec->nmembers,
 						   xlrec->members);
-- 
2.47.3

v15-pg17-0001-Set-next-multixid-s-offset-when-creating-a-.patchtext/x-patch; charset=UTF-8; name=v15-pg17-0001-Set-next-multixid-s-offset-when-creating-a-.patchDownload

From b12c8cba2d0b5071bdb25dc3353ff6fd7b55f8c5 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 1 Dec 2025 14:13:40 +0200
Subject: [PATCH v15-pg17 1/1] Set next multixid's offset when creating a new
 multixid
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

With this commit, the next multixid's offset will always be set on the
offsets page, by the time that a backend might try to read it, so we
no longer need the waiting mechanism with the condition variable. In
other words, this eliminates "corner case 2" mentioned in the
comments.

The waiting mechanism was broken in a few scenarios:

- When nextMulti was advanced without WAL-logging the next
  multixid. For example, if a later multixid was already assigned and
  WAL-logged before the previous one was WAL-logged, and then the
  server crashed. In that case the next offset would never be set in
  the offsets SLRU, and a query trying to read it would get stuck
  waiting for it. Same thing could happen if pg_resetwal was used to
  forcibly advance nextMulti.

- In hot standby mode, a deadlock could happen where one backend waits
  for the next multixid assignment record, but WAL replay is not
  advancing because of a recovery conflict with the waiting backend.

The old TAP test used carefully placed injection points to exercise
the old waiting code, but now that the waiting code is gone, much of
the old test is no longer relevant. Rewrite the test to reproduce the
IPC/MultixactCreation hang after crash recovery instead, and to verify
that previously recorded multixids stay readable.

Backpatch to all supported versions. In back-branches, we still need
to be able to read WAL that was generated before this fix, so in the
back-branches this includes a hack to initialize the next offsets page
when replaying XLOG_MULTIXACT_CREATE_ID for the last multixid on a
page. On 'master', bump XLOG_PAGE_MAGIC instead to indicate that the
WAL is not compatible.

Author: Andrey Borodin <amborodin@acm.org>
Reviewed-by: Dmitry Yurichev <dsy.075@yandex.ru>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Ivan Bykov <i.bykov@modernsys.ru>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/172e5723-d65f-4eec-b512-14beacb326ce@yandex.ru
Backpatch-through: 14
---
 src/backend/access/transam/multixact.c | 224 ++++++++++++++++++-------
 1 file changed, 159 insertions(+), 65 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b7b47ef076a..2f37ad88190 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -274,12 +274,6 @@ typedef struct MultiXactStateData
 	/* support for members anti-wraparound measures */
 	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
 
-	/*
-	 * This is used to sleep until a multixact offset is written when we want
-	 * to create the next one.
-	 */
-	ConditionVariable nextoff_cv;
-
 	/*
 	 * Per-backend data starts here.  We have two arrays stored in the area
 	 * immediately following the MultiXactStateData struct. Each is indexed by
@@ -384,6 +378,9 @@ static MemoryContext MXactContext = NULL;
 #define debug_elog6(a,b,c,d,e,f)
 #endif
 
+/* hack to deal with WAL generated with older minor versions */
+static int64 pre_initialized_offsets_page = -1;
+
 /* internal MultiXactId management */
 static void MultiXactIdSetOldestVisible(void);
 static void RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
@@ -915,13 +912,68 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	int			entryno;
 	int			slotno;
 	MultiXactOffset *offptr;
-	int			i;
+	MultiXactId next;
+	int64		next_pageno;
+	int			next_entryno;
+	MultiXactOffset *next_offptr;
 	LWLock	   *lock;
 	LWLock	   *prevlock = NULL;
 
+	/* position of this multixid in the offsets SLRU area  */
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
+	/* position of the next multixid */
+	next = multi + 1;
+	if (next < FirstMultiXactId)
+		next = FirstMultiXactId;
+	next_pageno = MultiXactIdToOffsetPage(next);
+	next_entryno = MultiXactIdToOffsetEntry(next);
+
+	/*
+	 * Older minor versions didn't set the next multixid's offset in this
+	 * function, and therefore didn't initialize the next page until the next
+	 * multixid was assigned.  If we're replaying WAL that was generated by
+	 * such a version, the next page might not be initialized yet.  Initialize
+	 * it now.
+	 */
+	if (InRecovery &&
+		next_pageno != pageno &&
+		pg_atomic_read_u64(&MultiXactOffsetCtl->shared->latest_page_number) == pageno)
+	{
+		int			slotno;
+		LWLock	   *lock;
+
+		elog(DEBUG1, "next offsets page is not initialized, initializing it now");
+
+		lock = SimpleLruGetBankLock(MultiXactOffsetCtl, next_pageno);
+		LWLockAcquire(lock, LW_EXCLUSIVE);
+
+		/* Create and zero the page */
+		slotno = SimpleLruZeroPage(MultiXactOffsetCtl, next_pageno);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
+		Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+
+		LWLockRelease(lock);
+
+		/*
+		 * Remember that we initialized the page, so that we don't zero it
+		 * again at the XLOG_MULTIXACT_ZERO_OFF_PAGE record.
+		 */
+		pre_initialized_offsets_page = next_pageno;
+	}
+
+	/*
+	 * Set the starting offset of this multixid's members.
+	 *
+	 * In the common case, it was already be set by the previous
+	 * RecordNewMultiXact call, as this was the next multixid of the previous
+	 * multixid.  But if multiple backends are generating multixids
+	 * concurrently, we might race ahead and get called before the previous
+	 * multixid.
+	 */
 	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
@@ -936,22 +988,50 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
 
-	*offptr = offset;
+	if (*offptr != offset)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*offptr == 0);
+		*offptr = offset;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
+
+	/*
+	 * Set the next multixid's offset to the end of this multixid's members.
+	 */
+	if (next_pageno == pageno)
+	{
+		next_offptr = offptr + 1;
+	}
+	else
+	{
+		/* must be the first entry on the page */
+		Assert(next_entryno == 0 || next == FirstMultiXactId);
 
-	MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+		/* Swap the lock for a lock on the next page */
+		LWLockRelease(lock);
+		lock = SimpleLruGetBankLock(MultiXactOffsetCtl, next_pageno);
+		LWLockAcquire(lock, LW_EXCLUSIVE);
+
+		slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next);
+		next_offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
+		next_offptr += next_entryno;
+	}
+
+	if (*next_offptr != offset + nmembers)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*next_offptr == 0);
+		*next_offptr = offset + nmembers;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
 
 	/* Release MultiXactOffset SLRU lock. */
 	LWLockRelease(lock);
 
-	/*
-	 * If anybody was waiting to know the offset of this multixact ID we just
-	 * wrote, they can read it now, so wake them up.
-	 */
-	ConditionVariableBroadcast(&MultiXactState->nextoff_cv);
-
 	prev_pageno = -1;
 
-	for (i = 0; i < nmembers; i++, offset++)
+	for (int i = 0; i < nmembers; i++, offset++)
 	{
 		TransactionId *memberptr;
 		uint32	   *flagsptr;
@@ -1141,8 +1221,11 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 			result = FirstMultiXactId;
 	}
 
-	/* Make sure there is room for the MXID in the file.  */
-	ExtendMultiXactOffset(result);
+	/*
+	 * Make sure there is room for the next MXID in the file.  Assigning this
+	 * MXID sets the next MXID's offset already.
+	 */
+	ExtendMultiXactOffset(result + 1);
 
 	/*
 	 * Reserve the members space, similarly to above.  Also, be careful not to
@@ -1307,7 +1390,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	MultiXactOffset nextOffset;
 	MultiXactMember *ptr;
 	LWLock	   *lock;
-	bool		slept = false;
 
 	debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
 
@@ -1384,23 +1466,14 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * one's.  However, there are some corner cases to worry about:
 	 *
 	 * 1. This multixact may be the latest one created, in which case there is
-	 * no next one to look at.  In this case the nextOffset value we just
-	 * saved is the correct endpoint.
-	 *
-	 * 2. The next multixact may still be in process of being filled in: that
-	 * is, another process may have done GetNewMultiXactId but not yet written
-	 * the offset entry for that ID.  In that scenario, it is guaranteed that
-	 * the offset entry for that multixact exists (because GetNewMultiXactId
-	 * won't release MultiXactGenLock until it does) but contains zero
-	 * (because we are careful to pre-zero offset pages). Because
-	 * GetNewMultiXactId will never return zero as the starting offset for a
-	 * multixact, when we read zero as the next multixact's offset, we know we
-	 * have this case.  We handle this by sleeping on the condition variable
-	 * we have just for this; the process in charge will signal the CV as soon
-	 * as it has finished writing the multixact offset.
+	 * no next one to look at.  The next multixact's offset should be set
+	 * already, as we set it in RecordNewMultiXact(), but we used to not do
+	 * that in older minor versions.  To cope with that case, if this
+	 * multixact is the latest one created, use the nextOffset value we read
+	 * above as the endpoint.
 	 *
-	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
-	 * handle case #2, there is an ambiguity near the point of offset
+	 * 2. Because GetNewMultiXactId skips over offset zero, to reserve zero
+	 * for to mean "unset", there is an ambiguity near the point of offset
 	 * wraparound.  If we see next multixact's offset is one, is that our
 	 * multixact's actual endpoint, or did it end at zero with a subsequent
 	 * increment?  We handle this using the knowledge that if the zero'th
@@ -1412,7 +1485,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
 	 */
-retry:
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
@@ -1475,16 +1547,10 @@ retry:
 		nextMXOffset = *offptr;
 
 		if (nextMXOffset == 0)
-		{
-			/* Corner case 2: next multixact is still being filled in */
-			LWLockRelease(lock);
-			CHECK_FOR_INTERRUPTS();
-
-			ConditionVariableSleep(&MultiXactState->nextoff_cv,
-								   WAIT_EVENT_MULTIXACT_CREATION);
-			slept = true;
-			goto retry;
-		}
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("MultiXact %u has invalid next offset",
+							multi)));
 
 		length = nextMXOffset - offset;
 	}
@@ -1492,12 +1558,6 @@ retry:
 	LWLockRelease(lock);
 	lock = NULL;
 
-	/*
-	 * If we slept above, clean up state; it's no longer needed.
-	 */
-	if (slept)
-		ConditionVariableCancelSleep();
-
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
 
 	truelength = 0;
@@ -1540,7 +1600,7 @@ retry:
 
 		if (!TransactionIdIsValid(*xactptr))
 		{
-			/* Corner case 3: we must be looking at unused slot zero */
+			/* Corner case 2: we must be looking at unused slot zero */
 			Assert(offset == 0);
 			continue;
 		}
@@ -1987,7 +2047,6 @@ MultiXactShmemInit(void)
 
 		/* Make sure we zero out the per-backend state */
 		MemSet(MultiXactState, 0, SHARED_MULTIXACT_STATE_SIZE);
-		ConditionVariableInit(&MultiXactState->nextoff_cv);
 	}
 	else
 		Assert(found);
@@ -2194,26 +2253,36 @@ TrimMultiXact(void)
 						pageno);
 
 	/*
+	 * Set the offset of the last multixact on the offsets page.
+	 *
+	 * This is normally done in RecordNewMultiXact() of the previous
+	 * multixact, but we used to not do that in older minor versions.  To
+	 * ensure that the next offset is set if the binary was just upgraded from
+	 * an older minor version, do it now.
+	 *
 	 * Zero out the remainder of the current offsets page.  See notes in
 	 * TrimCLOG() for background.  Unlike CLOG, some WAL record covers every
 	 * pg_multixact SLRU mutation.  Since, also unlike CLOG, we ignore the WAL
 	 * rule "write xlog before data," nextMXact successors may carry obsolete,
-	 * nonzero offset values.  Zero those so case 2 of GetMultiXactIdMembers()
-	 * operates normally.
+	 * nonzero offset values.
 	 */
 	entryno = MultiXactIdToOffsetEntry(nextMXact);
-	if (entryno != 0)
 	{
 		int			slotno;
 		MultiXactOffset *offptr;
 		LWLock	   *lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
 
 		LWLockAcquire(lock, LW_EXCLUSIVE);
-		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
+		if (entryno == 0)
+			slotno = SimpleLruZeroPage(MultiXactOffsetCtl, pageno);
+		else
+			slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
 
-		MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
+		*offptr = offset;
+		if (entryno != 0 && (entryno + 1) * sizeof(MultiXactOffset) != BLCKSZ)
+			MemSet(offptr + 1, 0, BLCKSZ - (entryno + 1) * sizeof(MultiXactOffset));
 
 		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 		LWLockRelease(lock);
@@ -3398,14 +3467,24 @@ multixact_redo(XLogReaderState *record)
 
 		memcpy(&pageno, XLogRecGetData(record), sizeof(pageno));
 
-		lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
-		LWLockAcquire(lock, LW_EXCLUSIVE);
+		/*
+		 * Skip the record if we already initialized the page at the previous
+		 * XLOG_MULTIXACT_CREATE_ID record. See RecordNewMultiXact().
+		 */
+		if (pre_initialized_offsets_page != pageno)
+		{
+			lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
+			LWLockAcquire(lock, LW_EXCLUSIVE);
 
-		slotno = ZeroMultiXactOffsetPage(pageno, false);
-		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
-		Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+			slotno = ZeroMultiXactOffsetPage(pageno, false);
+			SimpleLruWritePage(MultiXactOffsetCtl, slotno);
+			Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
 
-		LWLockRelease(lock);
+			LWLockRelease(lock);
+		}
+		else
+			elog(DEBUG1, "skipping initialization of offsets page " INT64_FORMAT " because it was already initialized on multixid creation", pageno);
+		pre_initialized_offsets_page = -1;
 	}
 	else if (info == XLOG_MULTIXACT_ZERO_MEM_PAGE)
 	{
@@ -3431,6 +3510,21 @@ multixact_redo(XLogReaderState *record)
 		TransactionId max_xid;
 		int			i;
 
+		if (pre_initialized_offsets_page != -1)
+		{
+			/*
+			 * If we implicitly initialized the next offsets page while
+			 * replaying a XLOG_MULTIXACT_CREATE_ID record that was generated
+			 * with an older minor version, we still expect to see a
+			 * XLOG_MULTIXACT_ZERO_OFF_PAGE record for it before any other
+			 * XLOG_MULTIXACT_CREATE_ID records.  Therefore this case should
+			 * not happen.  If it does, we'll continue with the replay, but
+			 * log a message to note that something's funny.
+			 */
+			elog(LOG, "expected to see a XLOG_MULTIXACT_ZERO_OFF_PAGE record for page that was implicitly initialized earlier");
+			pre_initialized_offsets_page = -1;
+		}
+
 		/* Store the data back into the SLRU files */
 		RecordNewMultiXact(xlrec->mid, xlrec->moff, xlrec->nmembers,
 						   xlrec->members);
-- 
2.47.3

v15-pg18-0001-Set-next-multixid-s-offset-when-creating-a-.patchtext/x-patch; charset=UTF-8; name=v15-pg18-0001-Set-next-multixid-s-offset-when-creating-a-.patchDownload

From 17efc1fef539672e064c30ea2c809b4f06942354 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 1 Dec 2025 14:13:09 +0200
Subject: [PATCH v15-pg18 1/1] Set next multixid's offset when creating a new
 multixid
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

With this commit, the next multixid's offset will always be set on the
offsets page, by the time that a backend might try to read it, so we
no longer need the waiting mechanism with the condition variable. In
other words, this eliminates "corner case 2" mentioned in the
comments.

The waiting mechanism was broken in a few scenarios:

- When nextMulti was advanced without WAL-logging the next
  multixid. For example, if a later multixid was already assigned and
  WAL-logged before the previous one was WAL-logged, and then the
  server crashed. In that case the next offset would never be set in
  the offsets SLRU, and a query trying to read it would get stuck
  waiting for it. Same thing could happen if pg_resetwal was used to
  forcibly advance nextMulti.

- In hot standby mode, a deadlock could happen where one backend waits
  for the next multixid assignment record, but WAL replay is not
  advancing because of a recovery conflict with the waiting backend.

The old TAP test used carefully placed injection points to exercise
the old waiting code, but now that the waiting code is gone, much of
the old test is no longer relevant. Rewrite the test to reproduce the
IPC/MultixactCreation hang after crash recovery instead, and to verify
that previously recorded multixids stay readable.

Backpatch to all supported versions. In back-branches, we still need
to be able to read WAL that was generated before this fix, so in the
back-branches this includes a hack to initialize the next offsets page
when replaying XLOG_MULTIXACT_CREATE_ID for the last multixid on a
page. On 'master', bump XLOG_PAGE_MAGIC instead to indicate that the
WAL is not compatible.

Author: Andrey Borodin <amborodin@acm.org>
Reviewed-by: Dmitry Yurichev <dsy.075@yandex.ru>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Ivan Bykov <i.bykov@modernsys.ru>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/172e5723-d65f-4eec-b512-14beacb326ce@yandex.ru
Backpatch-through: 14
---
 src/backend/access/transam/multixact.c        | 227 ++++++++++++------
 src/test/modules/test_slru/t/001_multixact.pl | 116 +++------
 src/test/modules/test_slru/test_multixact.c   |   5 +-
 3 files changed, 191 insertions(+), 157 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index f94445bdd07..e69ce955bec 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -84,7 +84,6 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
-#include "storage/condition_variable.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -276,12 +275,6 @@ typedef struct MultiXactStateData
 	/* support for members anti-wraparound measures */
 	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
 
-	/*
-	 * This is used to sleep until a multixact offset is written when we want
-	 * to create the next one.
-	 */
-	ConditionVariable nextoff_cv;
-
 	/*
 	 * Per-backend data starts here.  We have two arrays stored in the area
 	 * immediately following the MultiXactStateData struct. Each is indexed by
@@ -386,6 +379,9 @@ static MemoryContext MXactContext = NULL;
 #define debug_elog6(a,b,c,d,e,f)
 #endif
 
+/* hack to deal with WAL generated with older minor versions */
+static int64 pre_initialized_offsets_page = -1;
+
 /* internal MultiXactId management */
 static void MultiXactIdSetOldestVisible(void);
 static void RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
@@ -922,13 +918,68 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	int			entryno;
 	int			slotno;
 	MultiXactOffset *offptr;
-	int			i;
+	MultiXactId next;
+	int64		next_pageno;
+	int			next_entryno;
+	MultiXactOffset *next_offptr;
 	LWLock	   *lock;
 	LWLock	   *prevlock = NULL;
 
+	/* position of this multixid in the offsets SLRU area  */
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
+	/* position of the next multixid */
+	next = multi + 1;
+	if (next < FirstMultiXactId)
+		next = FirstMultiXactId;
+	next_pageno = MultiXactIdToOffsetPage(next);
+	next_entryno = MultiXactIdToOffsetEntry(next);
+
+	/*
+	 * Older minor versions didn't set the next multixid's offset in this
+	 * function, and therefore didn't initialize the next page until the next
+	 * multixid was assigned.  If we're replaying WAL that was generated by
+	 * such a version, the next page might not be initialized yet.  Initialize
+	 * it now.
+	 */
+	if (InRecovery &&
+		next_pageno != pageno &&
+		pg_atomic_read_u64(&MultiXactOffsetCtl->shared->latest_page_number) == pageno)
+	{
+		int			slotno;
+		LWLock	   *lock;
+
+		elog(DEBUG1, "next offsets page is not initialized, initializing it now");
+
+		lock = SimpleLruGetBankLock(MultiXactOffsetCtl, next_pageno);
+		LWLockAcquire(lock, LW_EXCLUSIVE);
+
+		/* Create and zero the page */
+		slotno = SimpleLruZeroPage(MultiXactOffsetCtl, next_pageno);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
+		Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+
+		LWLockRelease(lock);
+
+		/*
+		 * Remember that we initialized the page, so that we don't zero it
+		 * again at the XLOG_MULTIXACT_ZERO_OFF_PAGE record.
+		 */
+		pre_initialized_offsets_page = next_pageno;
+	}
+
+	/*
+	 * Set the starting offset of this multixid's members.
+	 *
+	 * In the common case, it was already be set by the previous
+	 * RecordNewMultiXact call, as this was the next multixid of the previous
+	 * multixid.  But if multiple backends are generating multixids
+	 * concurrently, we might race ahead and get called before the previous
+	 * multixid.
+	 */
 	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
@@ -943,22 +994,50 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
 
-	*offptr = offset;
+	if (*offptr != offset)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*offptr == 0);
+		*offptr = offset;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
+
+	/*
+	 * Set the next multixid's offset to the end of this multixid's members.
+	 */
+	if (next_pageno == pageno)
+	{
+		next_offptr = offptr + 1;
+	}
+	else
+	{
+		/* must be the first entry on the page */
+		Assert(next_entryno == 0 || next == FirstMultiXactId);
+
+		/* Swap the lock for a lock on the next page */
+		LWLockRelease(lock);
+		lock = SimpleLruGetBankLock(MultiXactOffsetCtl, next_pageno);
+		LWLockAcquire(lock, LW_EXCLUSIVE);
+
+		slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next);
+		next_offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
+		next_offptr += next_entryno;
+	}
 
-	MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	if (*next_offptr != offset + nmembers)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*next_offptr == 0);
+		*next_offptr = offset + nmembers;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
 
 	/* Release MultiXactOffset SLRU lock. */
 	LWLockRelease(lock);
 
-	/*
-	 * If anybody was waiting to know the offset of this multixact ID we just
-	 * wrote, they can read it now, so wake them up.
-	 */
-	ConditionVariableBroadcast(&MultiXactState->nextoff_cv);
-
 	prev_pageno = -1;
 
-	for (i = 0; i < nmembers; i++, offset++)
+	for (int i = 0; i < nmembers; i++, offset++)
 	{
 		TransactionId *memberptr;
 		uint32	   *flagsptr;
@@ -1148,8 +1227,11 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 			result = FirstMultiXactId;
 	}
 
-	/* Make sure there is room for the MXID in the file.  */
-	ExtendMultiXactOffset(result);
+	/*
+	 * Make sure there is room for the next MXID in the file.  Assigning this
+	 * MXID sets the next MXID's offset already.
+	 */
+	ExtendMultiXactOffset(result + 1);
 
 	/*
 	 * Reserve the members space, similarly to above.  Also, be careful not to
@@ -1314,7 +1396,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	MultiXactOffset nextOffset;
 	MultiXactMember *ptr;
 	LWLock	   *lock;
-	bool		slept = false;
 
 	debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
 
@@ -1391,23 +1472,14 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * one's.  However, there are some corner cases to worry about:
 	 *
 	 * 1. This multixact may be the latest one created, in which case there is
-	 * no next one to look at.  In this case the nextOffset value we just
-	 * saved is the correct endpoint.
-	 *
-	 * 2. The next multixact may still be in process of being filled in: that
-	 * is, another process may have done GetNewMultiXactId but not yet written
-	 * the offset entry for that ID.  In that scenario, it is guaranteed that
-	 * the offset entry for that multixact exists (because GetNewMultiXactId
-	 * won't release MultiXactGenLock until it does) but contains zero
-	 * (because we are careful to pre-zero offset pages). Because
-	 * GetNewMultiXactId will never return zero as the starting offset for a
-	 * multixact, when we read zero as the next multixact's offset, we know we
-	 * have this case.  We handle this by sleeping on the condition variable
-	 * we have just for this; the process in charge will signal the CV as soon
-	 * as it has finished writing the multixact offset.
+	 * no next one to look at.  The next multixact's offset should be set
+	 * already, as we set it in RecordNewMultiXact(), but we used to not do
+	 * that in older minor versions.  To cope with that case, if this
+	 * multixact is the latest one created, use the nextOffset value we read
+	 * above as the endpoint.
 	 *
-	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
-	 * handle case #2, there is an ambiguity near the point of offset
+	 * 2. Because GetNewMultiXactId skips over offset zero, to reserve zero
+	 * for to mean "unset", there is an ambiguity near the point of offset
 	 * wraparound.  If we see next multixact's offset is one, is that our
 	 * multixact's actual endpoint, or did it end at zero with a subsequent
 	 * increment?  We handle this using the knowledge that if the zero'th
@@ -1419,7 +1491,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
 	 */
-retry:
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
@@ -1482,18 +1553,10 @@ retry:
 		nextMXOffset = *offptr;
 
 		if (nextMXOffset == 0)
-		{
-			/* Corner case 2: next multixact is still being filled in */
-			LWLockRelease(lock);
-			CHECK_FOR_INTERRUPTS();
-
-			INJECTION_POINT("multixact-get-members-cv-sleep", NULL);
-
-			ConditionVariableSleep(&MultiXactState->nextoff_cv,
-								   WAIT_EVENT_MULTIXACT_CREATION);
-			slept = true;
-			goto retry;
-		}
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("MultiXact %u has invalid next offset",
+							multi)));
 
 		length = nextMXOffset - offset;
 	}
@@ -1501,12 +1564,6 @@ retry:
 	LWLockRelease(lock);
 	lock = NULL;
 
-	/*
-	 * If we slept above, clean up state; it's no longer needed.
-	 */
-	if (slept)
-		ConditionVariableCancelSleep();
-
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
 
 	truelength = 0;
@@ -1549,7 +1606,7 @@ retry:
 
 		if (!TransactionIdIsValid(*xactptr))
 		{
-			/* Corner case 3: we must be looking at unused slot zero */
+			/* Corner case 2: we must be looking at unused slot zero */
 			Assert(offset == 0);
 			continue;
 		}
@@ -1996,7 +2053,6 @@ MultiXactShmemInit(void)
 
 		/* Make sure we zero out the per-backend state */
 		MemSet(MultiXactState, 0, SHARED_MULTIXACT_STATE_SIZE);
-		ConditionVariableInit(&MultiXactState->nextoff_cv);
 	}
 	else
 		Assert(found);
@@ -2203,26 +2259,36 @@ TrimMultiXact(void)
 						pageno);
 
 	/*
+	 * Set the offset of the last multixact on the offsets page.
+	 *
+	 * This is normally done in RecordNewMultiXact() of the previous
+	 * multixact, but we used to not do that in older minor versions.  To
+	 * ensure that the next offset is set if the binary was just upgraded from
+	 * an older minor version, do it now.
+	 *
 	 * Zero out the remainder of the current offsets page.  See notes in
 	 * TrimCLOG() for background.  Unlike CLOG, some WAL record covers every
 	 * pg_multixact SLRU mutation.  Since, also unlike CLOG, we ignore the WAL
 	 * rule "write xlog before data," nextMXact successors may carry obsolete,
-	 * nonzero offset values.  Zero those so case 2 of GetMultiXactIdMembers()
-	 * operates normally.
+	 * nonzero offset values.
 	 */
 	entryno = MultiXactIdToOffsetEntry(nextMXact);
-	if (entryno != 0)
 	{
 		int			slotno;
 		MultiXactOffset *offptr;
 		LWLock	   *lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
 
 		LWLockAcquire(lock, LW_EXCLUSIVE);
-		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
+		if (entryno == 0)
+			slotno = SimpleLruZeroPage(MultiXactOffsetCtl, pageno);
+		else
+			slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
 
-		MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
+		*offptr = offset;
+		if (entryno != 0 && (entryno + 1) * sizeof(MultiXactOffset) != BLCKSZ)
+			MemSet(offptr + 1, 0, BLCKSZ - (entryno + 1) * sizeof(MultiXactOffset));
 
 		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 		LWLockRelease(lock);
@@ -3407,14 +3473,24 @@ multixact_redo(XLogReaderState *record)
 
 		memcpy(&pageno, XLogRecGetData(record), sizeof(pageno));
 
-		lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
-		LWLockAcquire(lock, LW_EXCLUSIVE);
+		/*
+		 * Skip the record if we already initialized the page at the previous
+		 * XLOG_MULTIXACT_CREATE_ID record. See RecordNewMultiXact().
+		 */
+		if (pre_initialized_offsets_page != pageno)
+		{
+			lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
+			LWLockAcquire(lock, LW_EXCLUSIVE);
 
-		slotno = ZeroMultiXactOffsetPage(pageno, false);
-		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
-		Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+			slotno = ZeroMultiXactOffsetPage(pageno, false);
+			SimpleLruWritePage(MultiXactOffsetCtl, slotno);
+			Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
 
-		LWLockRelease(lock);
+			LWLockRelease(lock);
+		}
+		else
+			elog(DEBUG1, "skipping initialization of offsets page " INT64_FORMAT " because it was already initialized on multixid creation", pageno);
+		pre_initialized_offsets_page = -1;
 	}
 	else if (info == XLOG_MULTIXACT_ZERO_MEM_PAGE)
 	{
@@ -3440,6 +3516,21 @@ multixact_redo(XLogReaderState *record)
 		TransactionId max_xid;
 		int			i;
 
+		if (pre_initialized_offsets_page != -1)
+		{
+			/*
+			 * If we implicitly initialized the next offsets page while
+			 * replaying a XLOG_MULTIXACT_CREATE_ID record that was generated
+			 * with an older minor version, we still expect to see a
+			 * XLOG_MULTIXACT_ZERO_OFF_PAGE record for it before any other
+			 * XLOG_MULTIXACT_CREATE_ID records.  Therefore this case should
+			 * not happen.  If it does, we'll continue with the replay, but
+			 * log a message to note that something's funny.
+			 */
+			elog(LOG, "expected to see a XLOG_MULTIXACT_ZERO_OFF_PAGE record for page that was implicitly initialized earlier");
+			pre_initialized_offsets_page = -1;
+		}
+
 		/* Store the data back into the SLRU files */
 		RecordNewMultiXact(xlrec->mid, xlrec->moff, xlrec->nmembers,
 						   xlrec->members);
diff --git a/src/test/modules/test_slru/t/001_multixact.pl b/src/test/modules/test_slru/t/001_multixact.pl
index e2b567a603d..7837eb810f0 100644
--- a/src/test/modules/test_slru/t/001_multixact.pl
+++ b/src/test/modules/test_slru/t/001_multixact.pl
@@ -1,10 +1,6 @@
 # Copyright (c) 2024-2025, PostgreSQL Global Development Group
 
-# This test verifies edge case of reading a multixact:
-# when we have multixact that is followed by exactly one another multixact,
-# and another multixact have no offset yet, we must wait until this offset
-# becomes observable. Previously we used to wait for 1ms in a loop in this
-# case, but now we use CV for this. This test is exercising such a sleep.
+# Test multixid corner cases.
 
 use strict;
 use warnings FATAL => 'all';
@@ -19,9 +15,7 @@ if ($ENV{enable_injection_points} ne 'yes')
 	plan skip_all => 'Injection points not supported by this build';
 }
 
-my ($node, $result);
-
-$node = PostgreSQL::Test::Cluster->new('mike');
+my $node = PostgreSQL::Test::Cluster->new('main');
 $node->init;
 $node->append_conf('postgresql.conf',
 	"shared_preload_libraries = 'test_slru,injection_points'");
@@ -29,95 +23,47 @@ $node->start;
 $node->safe_psql('postgres', q(CREATE EXTENSION injection_points));
 $node->safe_psql('postgres', q(CREATE EXTENSION test_slru));
 
-# Test for Multixact generation edge case
-$node->safe_psql('postgres',
-	q{select injection_points_attach('test-multixact-read','wait')});
-$node->safe_psql('postgres',
-	q{select injection_points_attach('multixact-get-members-cv-sleep','wait')}
-);
+# This test creates three multixacts. The middle one is never
+# WAL-logged or recorded on the offsets page, because we pause the
+# backend and crash the server before that. After restart, verify that
+# the other multixacts are readable, despite the middle one being
+# lost.
 
-# This session must observe sleep on the condition variable while generating a
-# multixact.  To achieve this it first will create a multixact, then pause
-# before reading it.
-my $observer = $node->background_psql('postgres');
-
-# This query will create a multixact, and hang just before reading it.
-$observer->query_until(
-	qr/start/,
-	q{
-	\echo start
-	SELECT test_read_multixact(test_create_multixact());
-});
-$node->wait_for_event('client backend', 'test-multixact-read');
-
-# This session will create the next Multixact. This is necessary to avoid
-# multixact.c's non-sleeping edge case 1.
-my $creator = $node->background_psql('postgres');
+# Create the first multixact
+my $bg_psql = $node->background_psql('postgres');
+my $multi1 = $bg_psql->query_safe(q(SELECT test_create_multixact();));
+
+# Assign the middle multixact. Use an injection point to prevent it
+# from being fully recorded.
 $node->safe_psql('postgres',
 	q{SELECT injection_points_attach('multixact-create-from-members','wait');}
 );
 
-# We expect this query to hang in the critical section after generating new
-# multixact, but before filling its offset into SLRU.
-# Running an injection point inside a critical section requires it to be
-# loaded beforehand.
-$creator->query_until(
-	qr/start/, q{
-	\echo start
+$bg_psql->query_until(
+	qr/assigning lost multi/, q(
+\echo assigning lost multi
 	SELECT test_create_multixact();
-});
+));
 
 $node->wait_for_event('client backend', 'multixact-create-from-members');
-
-# Ensure we have the backends waiting that we expect
-is( $node->safe_psql(
-		'postgres',
-		q{SELECT string_agg(wait_event, ', ' ORDER BY wait_event)
-		FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'}
-	),
-	'multixact-create-from-members, test-multixact-read',
-	"matching injection point waits");
-
-# Now wake observer to get it to read the initial multixact.  A subsequent
-# multixact already exists, but that one doesn't have an offset assigned, so
-# this will hit multixact.c's edge case 2.
-$node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('test-multixact-read')});
-$node->wait_for_event('client backend', 'multixact-get-members-cv-sleep');
-
-# Ensure we have the backends waiting that we expect
-is( $node->safe_psql(
-		'postgres',
-		q{SELECT string_agg(wait_event, ', ' ORDER BY wait_event)
-		FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'}
-	),
-	'multixact-create-from-members, multixact-get-members-cv-sleep',
-	"matching injection point waits");
-
-# Now we have two backends waiting in multixact-create-from-members and
-# multixact-get-members-cv-sleep.  Also we have 3 injections points set to wait.
-# If we wakeup multixact-get-members-cv-sleep it will happen again, so we must
-# detach it first. So let's detach all injection points, then wake up all
-# backends.
-
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('test-multixact-read')});
 $node->safe_psql('postgres',
 	q{SELECT injection_points_detach('multixact-create-from-members')});
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('multixact-get-members-cv-sleep')});
 
-$node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('multixact-create-from-members')});
-$node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('multixact-get-members-cv-sleep')});
+# Create the third multixid
+my $multi2 = $node->safe_psql('postgres', q{SELECT test_create_multixact();});
+
+# All set and done, it's time for hard restart
+$node->stop('immediate');
+$node->start;
+$bg_psql->{run}->finish;
 
-# Background psql will now be able to read the result and disconnect.
-$observer->quit;
-$creator->quit;
+# Verify that the recorded multixids are readable
+is( $node->safe_psql('postgres', qq{SELECT test_read_multixact('$multi1');}),
+	'',
+	'first recorded multi is readable');
 
-$node->stop;
+is( $node->safe_psql('postgres', qq{SELECT test_read_multixact('$multi2');}),
+	'',
+	'second recorded multi is readable');
 
-# If we reached this point - everything is OK.
-ok(1);
 done_testing();
diff --git a/src/test/modules/test_slru/test_multixact.c b/src/test/modules/test_slru/test_multixact.c
index 6c9b0420717..8fb6c19d70f 100644
--- a/src/test/modules/test_slru/test_multixact.c
+++ b/src/test/modules/test_slru/test_multixact.c
@@ -17,7 +17,6 @@
 #include "access/multixact.h"
 #include "access/xact.h"
 #include "fmgr.h"
-#include "utils/injection_point.h"
 
 PG_FUNCTION_INFO_V1(test_create_multixact);
 PG_FUNCTION_INFO_V1(test_read_multixact);
@@ -37,8 +36,7 @@ test_create_multixact(PG_FUNCTION_ARGS)
 }
 
 /*
- * Reads given multixact after running an injection point. Discards local cache
- * to make a real read.  Tailored for multixact testing.
+ * Reads given multixact.  Discards local cache to make a real read.
  */
 Datum
 test_read_multixact(PG_FUNCTION_ARGS)
@@ -46,7 +44,6 @@ test_read_multixact(PG_FUNCTION_ARGS)
 	MultiXactId id = PG_GETARG_TRANSACTIONID(0);
 	MultiXactMember *members;
 
-	INJECTION_POINT("test-multixact-read", NULL);
 	/* discard caches */
 	AtEOXact_MultiXact();
 
-- 
2.47.3

#48

x4mmm@yandex-team.ru

about 1 month ago

In reply to: Heikki Linnakangas (#47)

Re: IPC/MultixactCreation on the Standby server

On 1 Dec 2025, at 17:40, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

All of those conflicts were pretty straightforward to handle, but it's
enough code churn for silly mistakes to slip in, especially when the TAP
test didn't apply. So if you have a chance, please help to review and
test each of these backpatched versions too.

I'm looking through patchsets. I'll look in the morning with fresh eyes.
So far I see two CI warning faulures for pg18 and pg17 versions:
https://github.com/x4m/postgres_g/runs/56796102941
https://github.com/x4m/postgres_g/runs/56795182559

Relevant logs:
[14:06:58.425] multixact.c: In function ‘RecordNewMultiXact’:
[14:06:58.425] multixact.c:944:41: error: declaration of ‘slotno’ shadows a previous local [-Werror=shadow=compatible-local]
[14:06:58.425] 944 | int slotno;
[14:06:58.425] | ^~~~~~
[14:06:58.425] multixact.c:913:33: note: shadowed declaration is here
[14:06:58.425] 913 | int slotno;
[14:06:58.425] | ^~~~~~
[14:06:58.425] multixact.c:945:29: error: declaration of ‘lock’ shadows a previous local [-Werror=shadow=compatible-local]
[14:06:58.425] 945 | LWLock *lock;
[14:06:58.425] | ^~~~
[14:06:58.425] multixact.c:919:21: note: shadowed declaration is here
[14:06:58.425] 919 | LWLock *lock;
[14:06:58.425] | ^~~~

If only we had injection points before 17, I could run all the tests there too, just to be sure...

Best regards, Andrey Borodin.

#49

x4mmm@yandex-team.ru

about 1 month ago

In reply to: Andrey Borodin (#48)

Re: IPC/MultixactCreation on the Standby server

On 1 Dec 2025, at 20:29, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

I'm looking through patchsets. I'll look in the morning with fresh eyes.

So far I have no findings.
I also tried to stress-test v14. I assumed that if regression slipped in, most probably it is inherited by 14 from higher versions.
I used slightly modified scripts from Dmitry who started this thread.

DB initialization:
create table tbl2 (id int primary key,val int);
insert into tbl2 select i, 0 from generate_series(1,100000) i;

Load with multi:
\set id random(1, 10000)
begin;
select * from tbl2 where id = :id for no key update;
savepoint s1;
update tbl2 set val = val+1 where id = :id;
commit;

Consistency test:
select sum(val) from tbl2;

Stress-test script:
while true; do pkill -9 postgres; ./pg_ctl -D testdb restart; (./pgbench --no-vacuum -M prepared -c 50 -j 4 -T 300 -P 1 postgres --file=load.sql &); sleep 5; done

To promote multixact races I tried this with patched version: added random sleep up to 50ms between GetNewMultiXactId() and RecordNewMultiXact().

So far I did not manage to corrupt database.

What else kind of loads worth exercising?

Best regards, Andrey Borodin.

#50

hlinnaka@iki.fi

about 1 month ago

In reply to: Andrey Borodin (#49)

Re: IPC/MultixactCreation on the Standby server

On 02/12/2025 15:44, Andrey Borodin wrote:

On 1 Dec 2025, at 20:29, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

I'm looking through patchsets. I'll look in the morning with fresh eyes.

So far I have no findings.
I also tried to stress-test v14. I assumed that if regression slipped in, most probably it is inherited by 14 from higher versions.

Thanks! Agreed, v14-16 were the same. v17 and v18 might be worth testing
separately, to make sure I didn't e.g. screw up the locking differences.

I used slightly modified scripts from Dmitry who started this thread.

DB initialization:
create table tbl2 (id int primary key,val int);
insert into tbl2 select i, 0 from generate_series(1,100000) i;

Load with multi:
\set id random(1, 10000)
begin;
select * from tbl2 where id = :id for no key update;
savepoint s1;
update tbl2 set val = val+1 where id = :id;
commit;

Consistency test:
select sum(val) from tbl2;

Stress-test script:
while true; do pkill -9 postgres; ./pg_ctl -D testdb restart; (./pgbench --no-vacuum -M prepared -c 50 -j 4 -T 300 -P 1 postgres --file=load.sql &); sleep 5; done

To promote multixact races I tried this with patched version: added random sleep up to 50ms between GetNewMultiXactId() and RecordNewMultiXact().

So far I did not manage to corrupt database.

What else kind of loads worth exercising?

Combinations of patched and unpatched binaries, as primary/replica and
crash recovery. I did some testing of crash recovery, but more would be
good.

- Heikki

#51

Dmitry Yurichev

dsy.075@yandex.ru

about 1 month ago

In reply to: Heikki Linnakangas (#50)

Re: IPC/MultixactCreation on the Standby server

On 12/2/25 16:48, Heikki Linnakangas wrote:

Thanks! Agreed, v14-16 were the same. v17 and v18 might be worth testing
separately, to make sure I didn't e.g. screw up the locking differences.

I tested on the REL_18_1 tag (with applying
v15-pg18-0001-Set-next-multixid-s-offset-when-creating-a-.patch).
I didn't notice any deadlocks or other errors. Thanks!

--
Best regards,
Dmitry Yurichev.

#52

hlinnaka@iki.fi

about 1 month ago

In reply to: Dmitry Yurichev (#51)

Re: IPC/MultixactCreation on the Standby server

On 03/12/2025 16:19, Dmitry Yurichev wrote:

On 12/2/25 16:48, Heikki Linnakangas wrote:

Thanks! Agreed, v14-16 were the same. v17 and v18 might be worth
testing separately, to make sure I didn't e.g. screw up the locking
differences.

I tested on the REL_18_1 tag (with applying v15-pg18-0001-Set-next-
multixid-s-offset-when-creating-a-.patch).
I didn't notice any deadlocks or other errors. Thanks!

Okay. I fixed a few more little things:

- fixed the warnings about shadowed variables that Andrey pointed out
- in older branches before we switched to 64-bit SLRU page numbers, use
'int' rather than 'int64' in page number variables
- improve wording and fix few trivial typos in comments

Committed with those last-minute changes. Thanks for the testing!

- Heikki

#53

hlinnaka@iki.fi

about 1 month ago

In reply to: Heikki Linnakangas (#47)

Re: IPC/MultixactCreation on the Standby server

On 01/12/2025 14:40, Heikki Linnakangas wrote:

On 30/11/2025 14:15, Andrey Borodin wrote:

On 29 Nov 2025, at 00:51, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I moved the wraparound test to a separate test file and commit.
More test coverage is good, but it's quite separate from the
bugfix and the wraparound related test shares very little with the
other test. The wraparound test needs a little more cleanup: use
plain perl instead of 'dd' and 'rm' for the file operations, for
example. (I did that with the tests in the 64-bit mxoff patches,
so we could copy from there.)

PFA test version without dd and rm.

Thanks! I will focus on the main patch and TAP test now, but will commit
the wraparound test separately afterwards. At quick glance, it looks
good now.

And now committed the multixid wraparound test, too. Thanks!

- Heikki

#54

hlinnaka@iki.fi

about 1 month ago

In reply to: Heikki Linnakangas (#52)

1 attachment(s)

Re: IPC/MultixactCreation on the Standby server

On 03/12/2025 19:19, Heikki Linnakangas wrote:

On 03/12/2025 16:19, Dmitry Yurichev wrote:

On 12/2/25 16:48, Heikki Linnakangas wrote:

Thanks! Agreed, v14-16 were the same. v17 and v18 might be worth
testing separately, to make sure I didn't e.g. screw up the locking
differences.

I tested on the REL_18_1 tag (with applying v15-pg18-0001-Set-next-
multixid-s-offset-when-creating-a-.patch).
I didn't notice any deadlocks or other errors. Thanks!

Okay. I fixed a few more little things:

- fixed the warnings about shadowed variables that Andrey pointed out
- in older branches before we switched to 64-bit SLRU page numbers, use
'int' rather than 'int64' in page number variables
- improve wording and fix few trivial typos in comments

Committed with those last-minute changes. Thanks for the testing!

While working on the 64-bit multixid offsets patch, I noticed one more
bug with this. At offset wraparound, when we set the next multixid's
offset in RecordNewMultiXact, we incorrectly set it to 0 instead of 1.
We're supposed to skip over offset 1, because 0 is reserved to mean
invalid. We do that correctly when setting the "current" multixid's
offset, because the caller of RecordNewMultiXact has already skipped
over offset 0, but I missed it for the next offset.

I was able to reproduce that with these steps:

pg_resetwal -O 0xffffff00 -D data
pg_ctl -D data start
pgbench -s5 -i postgres
pgbench -c3 -t100 -f a.sql postgres

a.sql:
select * from pgbench_branches FOR KEY SHARE;

You get an error like this:

pgbench: error: client 2 script 0 aborted in command 0 query 0: ERROR:
MultiXact 372013 has invalid next offset

I tried to modify the new wraparound TAP test to reproduce that, but it
turned out to be difficult because you need to have multiple backends
assigning multixids concurrently to hit that.

The fix is pretty straightforward, see attached. Barring objections,
I'll commit and backport this.

- Heikki

Attachments:

0001-Fix-setting-next-multixid-s-offset-at-offset-wraparo.patchtext/x-patch; charset=UTF-8; name=0001-Fix-setting-next-multixid-s-offset-at-offset-wraparo.patchDownload

From f9b2cc8daea1cd0462e9455736237da0f7f00a3d Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 4 Dec 2025 16:05:13 +0200
Subject: [PATCH 1/1] Fix setting next multixid's offset at offset wraparound

---
 src/backend/access/transam/multixact.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 27f02faec80..8ed3fd9d071 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -909,6 +909,7 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	int64		next_pageno;
 	int			next_entryno;
 	MultiXactOffset *next_offptr;
+	MultiXactOffset next_offset;
 	LWLock	   *lock;
 	LWLock	   *prevlock = NULL;
 
@@ -976,11 +977,15 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 		next_offptr += next_entryno;
 	}
 
-	if (*next_offptr != offset + nmembers)
+	/* Like in GetNewMultiXactId(), skip over offset 0 */
+	next_offset = offset + nmembers;
+	if (next_offset == 0)
+		next_offset = 1;
+	if (*next_offptr != next_offset)
 	{
 		/* should already be set to the correct value, or not at all */
 		Assert(*next_offptr == 0);
-		*next_offptr = offset + nmembers;
+		*next_offptr = next_offset;
 		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 	}
 
-- 
2.47.3

#55

alvherre@kurilemu.de

about 1 month ago

In reply to: Heikki Linnakangas (#54)

Re: IPC/MultixactCreation on the Standby server

On 2025-Dec-04, Heikki Linnakangas wrote:

While working on the 64-bit multixid offsets patch, I noticed one more bug
with this. At offset wraparound, when we set the next multixid's offset in
RecordNewMultiXact, we incorrectly set it to 0 instead of 1. We're supposed
to skip over offset 1, because 0 is reserved to mean invalid. We do that
correctly when setting the "current" multixid's offset, because the caller
of RecordNewMultiXact has already skipped over offset 0, but I missed it for
the next offset.

Ouch.

I tried to modify the new wraparound TAP test to reproduce that, but it
turned out to be difficult because you need to have multiple backends
assigning multixids concurrently to hit that.

Hmm, would it make sense to add a pgbench-based test on
src/test/modules/xid_wraparound? That module is already known to be
expensive, and it doesn't run unless explicitly enabled, so I think it's
not a bad fit.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/

#56

hlinnaka@iki.fi

about 1 month ago

In reply to: Álvaro Herrera (#55)

Re: IPC/MultixactCreation on the Standby server

On 04/12/2025 16:36, Álvaro Herrera wrote:

On 2025-Dec-04, Heikki Linnakangas wrote:

While working on the 64-bit multixid offsets patch, I noticed one more bug
with this. At offset wraparound, when we set the next multixid's offset in
RecordNewMultiXact, we incorrectly set it to 0 instead of 1. We're supposed
to skip over offset 1, because 0 is reserved to mean invalid. We do that
correctly when setting the "current" multixid's offset, because the caller
of RecordNewMultiXact has already skipped over offset 0, but I missed it for
the next offset.

Ouch.

Committed the fix.

I tried to modify the new wraparound TAP test to reproduce that, but it
turned out to be difficult because you need to have multiple backends
assigning multixids concurrently to hit that.

Hmm, would it make sense to add a pgbench-based test on
src/test/modules/xid_wraparound? That module is already known to be
expensive, and it doesn't run unless explicitly enabled, so I think it's
not a bad fit.

Doesn't seem worth it. The repro with pgbench is a bit unreliable too.
It actually only readily reproduces on 'master'. In backbranches, I also
had to add a random sleep before RecordNewMultiXact() to trigger it. I
think that's because on 'master', I removed the special case in
GetMultiXactIdMembers() to use the in-memory nextOffset value instead of
reading the next offset from the SLRU page, when we're reading the last
assigned multixid.

Furthermore, we will hopefully get rid of offset wraparounds soon with
the 64-bit offsets.

- Heikki

#57

Maxim Orlov

orlovmg@gmail.com

about 1 month ago

In reply to: Heikki Linnakangas (#45)

1 attachment(s)

Re: IPC/MultixactCreation on the Standby server

On Fri, 28 Nov 2025 at 22:51, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I moved the wraparound test to a separate test file and commit. More
test coverage is good, but it's quite separate from the bugfix and the
wraparound related test shares very little with the other test. The
wraparound test needs a little more cleanup: use plain perl instead of
'dd' and 'rm' for the file operations, for example. (I did that with the
tests in the 64-bit mxoff patches, so we could copy from there.)

It's good that the test was added. But it seems like it could be

improved a bit. The problem is, it only runs successfully with a
standard block size. Plus, the comment about the number of bytes was a
bit unclear, for my taste. PFA patch, it should make this test pass
with different block sizes.

--
Best regards,
Maxim Orlov.

Attachments:

0001-Improve-7b81be9b42-Add-test-for-multixid-wraparound.patchapplication/octet-stream; name=0001-Improve-7b81be9b42-Add-test-for-multixid-wraparound.patchDownload

From 13795e380eecf21cad817026c2ad42127c2ee2fc Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Fri, 5 Dec 2025 19:24:05 +0300
Subject: [PATCH] Improve 7b81be9b42 Add test for multixid wraparound

Add support for different block sizes
---
 .../modules/test_slru/t/002_multixact_wraparound.pl    | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/src/test/modules/test_slru/t/002_multixact_wraparound.pl b/src/test/modules/test_slru/t/002_multixact_wraparound.pl
index de37d845b1..073454d9c3 100644
--- a/src/test/modules/test_slru/t/002_multixact_wraparound.pl
+++ b/src/test/modules/test_slru/t/002_multixact_wraparound.pl
@@ -24,16 +24,20 @@ command_ok(
 		$node_pgdata
 	],
 	"set the cluster's next multitransaction to 0xFFFFFFF8");
+my $out = (run_command([ 'pg_resetwal', '--dry-run', $node->data_dir ]))[0];
+$out =~ /^Database block size: *(\d+)$/m or die;
+my $blcksz = $1;
+my $bytes_per_seg = 32 * $blcksz; # SLRU pages per segment
 
 # Fixup the SLRU files to match the state we reset to.
 
-# initialize SLRU file with zeros (65536 entries * 4 bytes = 262144 bytes)
+# initialize SLRU file with zeros
 my $slru_file = "$node_pgdata/pg_multixact/offsets/FFFF";
 open my $fh, ">", $slru_file
   or die "could not open \"$slru_file\": $!";
 binmode $fh;
-# Write 65536 entries of 4 bytes each (all zeros)
-syswrite($fh, "\0" x 262144) == 262144
+# Write 65536 entries of 4 bytes each (all zeros) for default 8192 block size
+syswrite($fh, "\0" x $bytes_per_seg) == $bytes_per_seg
   or die "could not write to \"$slru_file\": $!";
 close $fh;
 
-- 
2.50.1 (Apple Git-155)

#58

x4mmm@yandex-team.ru

about 1 month ago

In reply to: Maxim Orlov (#57)

Re: IPC/MultixactCreation on the Standby server

On 5 Dec 2025, at 21:36, Maxim Orlov <orlovmg@gmail.com> wrote:

It's good that the test was added. But it seems like it could be
improved a bit. The problem is, it only runs successfully with a
standard block size. Plus, the comment about the number of bytes was a
bit unclear, for my taste. PFA patch, it should make this test pass
with different block sizes.

Oh, great catch!

Other tests seem to extract block size using database query like

$primary->safe_psql('postgres',
"SELECT setting::int FROM pg_settings WHERE name = 'block_size';");
or
$blksize = int($node->safe_psql('postgres', 'SHOW block_size;'));

But here we do not have running cluster, so resorting to parsing pg_resetwal seems reasonable.

Thanks!

Best regards, Andrey Borodin.

#59

x4mmm@yandex-team.ru

about 1 month ago

In reply to: Heikki Linnakangas (#54)

Re: IPC/MultixactCreation on the Standby server

On 4 Dec 2025, at 19:17, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I tried to modify the new wraparound TAP test to reproduce that, but it turned out to be difficult because you need to have multiple backends assigning multixids concurrently to hit that.

IIUC, new TAP test verifies multixact Id wraparound (-m key in pg_resetwal).
While the bug was in multixact offsets wraparound.

I'm on-call for the rest of this week, but a bit later I can produce fast tests for all kinds of wraparounds :)

But as you said, hopefully soon there won't be wraparounds, and, what's more important, offsets\members space exhaustion.

Best regards, Andrey Borodin.

#60

hlinnaka@iki.fi

about 1 month ago

In reply to: Andrey Borodin (#58)

Re: IPC/MultixactCreation on the Standby server

On 05/12/2025 20:36, Andrey Borodin wrote:

On 5 Dec 2025, at 21:36, Maxim Orlov <orlovmg@gmail.com> wrote:

It's good that the test was added. But it seems like it could be
improved a bit. The problem is, it only runs successfully with a
standard block size. Plus, the comment about the number of bytes was a
bit unclear, for my taste. PFA patch, it should make this test pass
with different block sizes.

Oh, great catch!

Other tests seem to extract block size using database query like

$primary->safe_psql('postgres',
"SELECT setting::int FROM pg_settings WHERE name = 'block_size';");
or
$blksize = int($node->safe_psql('postgres', 'SHOW block_size;'));

But here we do not have running cluster, so resorting to parsing pg_resetwal seems reasonable.

+1, pg_resetwal makes sense here. Calculating the filename of the last
SLRU segment also needs to be adjusted. It was hardcoded to FFFF, but
it's different with other block sizes.

Fixed that and committed. Thanks!

P.S. I'm surprised we don't have any buildfarm members with non-default
block sizes.

- Heikki

#61