VM corruption on standby

Started by Andrey Borodin5 months ago79 messages
#1Andrey Borodin
x4mmm@yandex-team.ru
1 attachment(s)

Hi hackers!

I was reviewing the patch about removing xl_heap_visible and found the VM\WAL machinery very interesting.
At Yandex we had several incidents with corrupted VM and on pgconf.dev colleagues from AWS confirmed that they saw something similar too.
So I toyed around and accidentally wrote a test that reproduces $subj.

I think the corruption happens as follows:
0. we create a table with one frozen tuple
1. next heap_insert() clears VM bit and hangs immediately, nothing was logged yet
2. VM buffer is flushed on disk with checkpointer or bgwriter
3. primary is killed with -9
now we have a page that is ALL_VISIBLE\ALL_FORZEN on standby, but clear VM bits on primary
4. subsequent insert does not set XLH_LOCK_ALL_FROZEN_CLEARED in it's WAL record
5. pg_visibility detects corruption

Interestingly, in an off-list conversation Melanie explained me how ALL_VISIBLE is protected from this: WAL-logging depends on PD_ALL_VISIBLE heap page bit, not a state of the VM. But for ALL_FROZEN this is not a case:

/* Clear only the all-frozen bit on visibility map if needed */
if (PageIsAllVisible(page) &&
visibilitymap_clear(relation, block, vmbuffer,
VISIBILITYMAP_ALL_FROZEN))
cleared_all_frozen = true; // this won't happen due to flushed VM buffer before a crash

Anyway, the test reproduces corruption of both bits. And also reproduces selecting deleted data on standby.

The test is not intended to be committed when we fix the problem, so some waits are simulated with sleep(1) and test is placed at modules/test_slru where it was easier to write. But if we ever want something like this - I can design a less hacky version. And, probably, more generic.

Thanks!

Best regards, Andrey Borodin.

Attachments:

v1-0001-Corrupt-VM-on-standby.patchapplication/octet-stream; name=v1-0001-Corrupt-VM-on-standby.patch; x-unix-mode=0644Download
From 422cda54c9551384fd2b0f8fff4f5821989c3edc Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sun, 27 Jul 2025 11:37:55 +0500
Subject: [PATCH v1] Corrupt VM on standby

---
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/heap/heapam_xlog.c         |   1 +
 src/test/modules/test_slru/Makefile           |   2 +-
 src/test/modules/test_slru/t/001_multixact.pl | 172 +++++++++---------
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  40 ++++
 5 files changed, 128 insertions(+), 90 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0dcd6ee817e..e5da051aaf0 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2123,6 +2123,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	INJECTION_POINT_LOAD("heap-insert:visibilitymap-clear");
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2136,6 +2138,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		visibilitymap_clear(relation,
 							ItemPointerGetBlockNumber(&(heaptup->t_self)),
 							vmbuffer, VISIBILITYMAP_VALID_BITS);
+		INJECTION_POINT_CACHED("heap-insert:visibilitymap-clear", NULL);
 	}
 
 	/*
diff --git a/src/backend/access/heap/heapam_xlog.c b/src/backend/access/heap/heapam_xlog.c
index eb4bd3d6ae3..978030048b6 100644
--- a/src/backend/access/heap/heapam_xlog.c
+++ b/src/backend/access/heap/heapam_xlog.c
@@ -21,6 +21,7 @@
 #include "access/xlogutils.h"
 #include "storage/freespace.h"
 #include "storage/standby.h"
+#include "utils/injection_point.h"
 
 
 /*
diff --git a/src/test/modules/test_slru/Makefile b/src/test/modules/test_slru/Makefile
index fe4e7390437..b6ad41ec804 100644
--- a/src/test/modules/test_slru/Makefile
+++ b/src/test/modules/test_slru/Makefile
@@ -7,7 +7,7 @@ OBJS = \
 	test_multixact.o
 PGFILEDESC = "test_slru - test module for SLRUs"
 
-EXTRA_INSTALL=src/test/modules/injection_points
+EXTRA_INSTALL=src/test/modules/injection_points contrib/pg_visibility
 export enable_injection_points
 TAP_TESTS = 1
 
diff --git a/src/test/modules/test_slru/t/001_multixact.pl b/src/test/modules/test_slru/t/001_multixact.pl
index e2b567a603d..0c347a5157d 100644
--- a/src/test/modules/test_slru/t/001_multixact.pl
+++ b/src/test/modules/test_slru/t/001_multixact.pl
@@ -1,11 +1,6 @@
 # Copyright (c) 2024-2025, PostgreSQL Global Development Group
 
-# This test verifies edge case of reading a multixact:
-# when we have multixact that is followed by exactly one another multixact,
-# and another multixact have no offset yet, we must wait until this offset
-# becomes observable. Previously we used to wait for 1ms in a loop in this
-# case, but now we use CV for this. This test is exercising such a sleep.
-
+# This test verifies edge case of visibilitymap_clear
 use strict;
 use warnings FATAL => 'all';
 
@@ -18,106 +13,105 @@ if ($ENV{enable_injection_points} ne 'yes')
 {
 	plan skip_all => 'Injection points not supported by this build';
 }
+if ($windows_os)
+{
+	plan skip_all => 'Kill9 works unpredicatably on Windows';
+}
 
 my ($node, $result);
 
 $node = PostgreSQL::Test::Cluster->new('mike');
-$node->init;
+$node->init(allows_streaming => 1);
 $node->append_conf('postgresql.conf',
 	"shared_preload_libraries = 'test_slru,injection_points'");
 $node->start;
 $node->safe_psql('postgres', q(CREATE EXTENSION injection_points));
+$node->safe_psql('postgres', q(CREATE EXTENSION pg_visibility));
 $node->safe_psql('postgres', q(CREATE EXTENSION test_slru));
 
-# Test for Multixact generation edge case
-$node->safe_psql('postgres',
-	q{select injection_points_attach('test-multixact-read','wait')});
-$node->safe_psql('postgres',
-	q{select injection_points_attach('multixact-get-members-cv-sleep','wait')}
-);
-
-# This session must observe sleep on the condition variable while generating a
-# multixact.  To achieve this it first will create a multixact, then pause
-# before reading it.
-my $observer = $node->background_psql('postgres');
-
-# This query will create a multixact, and hang just before reading it.
-$observer->query_until(
-	qr/start/,
-	q{
-	\echo start
-	SELECT test_read_multixact(test_create_multixact());
-});
-$node->wait_for_event('client backend', 'test-multixact-read');
-
-# This session will create the next Multixact. This is necessary to avoid
-# multixact.c's non-sleeping edge case 1.
-my $creator = $node->background_psql('postgres');
+my $backup_name = 'my_backup';
+$node->backup($backup_name);
+my $standby = PostgreSQL::Test::Cluster->new('standby');
+$standby->init_from_backup($node, $backup_name, has_streaming => 1);
+$standby->start;
+
+my $bg_psql = $node->background_psql('postgres');
+
+# prepare the table, index and freeze some rows.
+# index is here only to demonstrate selecting deleted data on standby
+my $multi = $bg_psql->query_safe(
+	q(
+	CREATE TABLE x(i int);
+	CREATE INDEX on x(i);
+	INSERT INTO x VALUES (1);
+	VACUUM FREEZE x;
+	));
+
+# next backand clearing VM in heap_insert() will hang after releasing buffer lock
 $node->safe_psql('postgres',
-	q{SELECT injection_points_attach('multixact-create-from-members','wait');}
+	q{SELECT injection_points_attach('heap-insert:visibilitymap-clear','wait');}
 );
 
-# We expect this query to hang in the critical section after generating new
-# multixact, but before filling its offset into SLRU.
-# Running an injection point inside a critical section requires it to be
-# loaded beforehand.
-$creator->query_until(
-	qr/start/, q{
-	\echo start
-	SELECT test_create_multixact();
-});
-
-$node->wait_for_event('client backend', 'multixact-create-from-members');
-
-# Ensure we have the backends waiting that we expect
-is( $node->safe_psql(
-		'postgres',
-		q{SELECT string_agg(wait_event, ', ' ORDER BY wait_event)
-		FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'}
-	),
-	'multixact-create-from-members, test-multixact-read',
-	"matching injection point waits");
-
-# Now wake observer to get it to read the initial multixact.  A subsequent
-# multixact already exists, but that one doesn't have an offset assigned, so
-# this will hit multixact.c's edge case 2.
-$node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('test-multixact-read')});
-$node->wait_for_event('client backend', 'multixact-get-members-cv-sleep');
-
-# Ensure we have the backends waiting that we expect
-is( $node->safe_psql(
-		'postgres',
-		q{SELECT string_agg(wait_event, ', ' ORDER BY wait_event)
-		FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'}
-	),
-	'multixact-create-from-members, multixact-get-members-cv-sleep',
-	"matching injection point waits");
-
-# Now we have two backends waiting in multixact-create-from-members and
-# multixact-get-members-cv-sleep.  Also we have 3 injections points set to wait.
-# If we wakeup multixact-get-members-cv-sleep it will happen again, so we must
-# detach it first. So let's detach all injection points, then wake up all
-# backends.
-
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('test-multixact-read')});
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('multixact-create-from-members')});
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('multixact-get-members-cv-sleep')});
+# we expect this backend to hang forever
+$bg_psql->query_until(
+	qr/deploying lost VM bit/, q(
+\echo deploying lost VM bit
+	INSERT INTO x VALUES (1);
+));
 
+$node->wait_for_event('client backend', 'heap-insert:visibilitymap-clear');
 $node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('multixact-create-from-members')});
+	q{SELECT injection_points_detach('heap-insert:visibilitymap-clear')});
+
+# now we want VM on-disk. Checkpoint will hang too, hence another background session.
+my $bg_psql1 = $node->background_psql('postgres');
+$bg_psql1->query_until(
+	qr/flush data to disk/, q(
+\echo flush data to disk
+	checkpoint;
+));
+
+sleep(1);
+# All set and done, it's time for hard restart
+$node->kill9;
+$node->stop('immediate', fail_ok => 1);
+$node->poll_start;
+$bg_psql->{run}->finish;
+$bg_psql1->{run}->finish;
+
+# Now VMs are different on primary and standby. Insert and some data to demonstrate problems.
 $node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('multixact-get-members-cv-sleep')});
-
-# Background psql will now be able to read the result and disconnect.
-$observer->quit;
-$creator->quit;
+	q{
+	INSERT INTO x VALUES (1);
+	INSERT INTO x VALUES (2);
+	DELETE FROM x WHERE i = 2;
+	});
+
+$node->wait_for_catchup($standby);
+
+# corruptino tests
+my $corrupted_all_frozen = $standby->safe_psql('postgres',
+	qq{SELECT pg_check_frozen('x')});
+my $corrupted_all_visible = $standby->safe_psql('postgres',
+	qq{SELECT pg_check_visible('x')});
+my $observed_deleted_data = $standby->safe_psql('postgres',
+	qq{set enable_seqscan to false; SELECT * FROM x WHERE i = 2;});
+
+# debug printing
+print("\n");
+print($corrupted_all_frozen);
+print("\n\n");
+print($corrupted_all_visible);
+print("\n\n");
+print($observed_deleted_data);
+print("\n\n");
+
+# fail the test if any corruption is reported
+is($corrupted_all_frozen, '', 'pg_check_frozen() observes corruption');
+is($corrupted_all_visible, '', 'pg_check_visible() observes corruption');
+is($observed_deleted_data, '', 'deleted data returned by select');
 
 $node->stop;
+$standby->stop;
 
-# If we reached this point - everything is OK.
-ok(1);
 done_testing();
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 61f68e0cc2e..a162baa55c6 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -1190,6 +1190,46 @@ sub start
 
 =pod
 
+=item $node->poll_start() => success_or_failure
+
+Polls start after kill9
+
+We may need retries to start a new postmaster.  Causes:
+ - kernel is slow to deliver SIGKILL
+ - postmaster parent is slow to waitpid()
+ - postmaster child is slow to exit in response to SIGQUIT
+ - postmaster child is slow to exit after postmaster death
+
+=cut
+
+sub poll_start
+{
+	my ($self) = @_;
+
+	my $max_attempts = 10 * $PostgreSQL::Test::Utils::timeout_default;
+	my $attempts = 0;
+
+	while ($attempts < $max_attempts)
+	{
+		$self->start(fail_ok => 1) && return 1;
+
+		# Wait 0.1 second before retrying.
+		usleep(100_000);
+
+		# Clean up in case the start attempt just timed out or some such.
+		$self->stop('fast', fail_ok => 1);
+
+		$attempts++;
+	}
+
+	# Try one last time without fail_ok, which will BAIL_OUT unless it
+	# succeeds.
+	$self->start && return 1;
+	return 0;
+}
+
+=pod
+
 =item $node->kill9()
 
 Send SIGKILL (signal 9) to the postmaster.
-- 
2.39.5 (Apple Git-154)

#2Aleksander Alekseev
aleksander@tigerdata.com
In reply to: Andrey Borodin (#1)
Re: VM corruption on standby

Hi Andrey,

I was reviewing the patch about removing xl_heap_visible and found the VM\WAL machinery very interesting.
At Yandex we had several incidents with corrupted VM and on pgconf.dev colleagues from AWS confirmed that they saw something similar too.
So I toyed around and accidentally wrote a test that reproduces $subj.

I think the corruption happens as follows:
0. we create a table with one frozen tuple
1. next heap_insert() clears VM bit and hangs immediately, nothing was logged yet
2. VM buffer is flushed on disk with checkpointer or bgwriter
3. primary is killed with -9
now we have a page that is ALL_VISIBLE\ALL_FORZEN on standby, but clear VM bits on primary
4. subsequent insert does not set XLH_LOCK_ALL_FROZEN_CLEARED in it's WAL record
5. pg_visibility detects corruption

Interestingly, in an off-list conversation Melanie explained me how ALL_VISIBLE is protected from this: WAL-logging depends on PD_ALL_VISIBLE heap page bit, not a state of the VM. But for ALL_FROZEN this is not a case:

/* Clear only the all-frozen bit on visibility map if needed */
if (PageIsAllVisible(page) &&
visibilitymap_clear(relation, block, vmbuffer,
VISIBILITYMAP_ALL_FROZEN))
cleared_all_frozen = true; // this won't happen due to flushed VM buffer before a crash

Anyway, the test reproduces corruption of both bits. And also reproduces selecting deleted data on standby.

Great find. I executed your test on a pretty much regular Linux x64
machine and indeed it failed:

```
not ok 1 - pg_check_frozen() observes corruption
not ok 2 - pg_check_visible() observes corruption
not ok 3 - deleted data returned by select
1..3
# test failed
----------------------------------- stderr -----------------------------------
# Failed test 'pg_check_frozen() observes corruption'
# at /home/eax/projects/c/postgresql/src/test/modules/test_slru/t/001_multixact.pl
line 110.
# got: '(0,2)
# (0,3)
# (0,4)'
# expected: ''
# Failed test 'pg_check_visible() observes corruption'
# at /home/eax/projects/c/postgresql/src/test/modules/test_slru/t/001_multixact.pl
line 111.
# got: '(0,2)
# (0,4)'
# expected: ''
# Failed test 'deleted data returned by select'
# at /home/eax/projects/c/postgresql/src/test/modules/test_slru/t/001_multixact.pl
line 112.
# got: '2'
# expected: ''
# Looks like you failed 3 tests of 3.
```

This is a tricky bug. Do you also have a proposal of a particular fix?

The test is not intended to be committed when we fix the problem, so some waits are simulated with sleep(1) and test is placed at modules/test_slru where it was easier to write. But if we ever want something like this - I can design a less hacky version. And, probably, more generic.

IMO - yes, we do need this regression test.

#3Aleksander Alekseev
aleksander@tigerdata.com
In reply to: Aleksander Alekseev (#2)
2 attachment(s)
Re: VM corruption on standby

Hi,

This is a tricky bug. Do you also have a proposal of a particular fix?

If my understanding is correct, we should make a WAL record with the
XLH_LOCK_ALL_FROZEN_CLEARED flag *before* we modify the VM but within
the same critical section (in order to avoid race conditions within
the same backend).

A draft patch is attached. It makes the test pass and doesn't seem to
break any other tests.

Thoughts?

Attachments:

v2-0001-Corrupt-VM-on-standby.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Corrupt-VM-on-standby.patchDownload
From 2909bfd3483b4796cb671c722520853d3b6ce61b Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sun, 27 Jul 2025 11:37:55 +0500
Subject: [PATCH v2 1/2] Corrupt VM on standby

---
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/heap/heapam_xlog.c         |   1 +
 src/test/modules/test_slru/Makefile           |   2 +-
 src/test/modules/test_slru/t/001_multixact.pl | 172 +++++++++---------
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  40 ++++
 5 files changed, 128 insertions(+), 90 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0dcd6ee817e..e5da051aaf0 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2123,6 +2123,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	INJECTION_POINT_LOAD("heap-insert:visibilitymap-clear");
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2136,6 +2138,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		visibilitymap_clear(relation,
 							ItemPointerGetBlockNumber(&(heaptup->t_self)),
 							vmbuffer, VISIBILITYMAP_VALID_BITS);
+		INJECTION_POINT_CACHED("heap-insert:visibilitymap-clear", NULL);
 	}
 
 	/*
diff --git a/src/backend/access/heap/heapam_xlog.c b/src/backend/access/heap/heapam_xlog.c
index eb4bd3d6ae3..978030048b6 100644
--- a/src/backend/access/heap/heapam_xlog.c
+++ b/src/backend/access/heap/heapam_xlog.c
@@ -21,6 +21,7 @@
 #include "access/xlogutils.h"
 #include "storage/freespace.h"
 #include "storage/standby.h"
+#include "utils/injection_point.h"
 
 
 /*
diff --git a/src/test/modules/test_slru/Makefile b/src/test/modules/test_slru/Makefile
index fe4e7390437..b6ad41ec804 100644
--- a/src/test/modules/test_slru/Makefile
+++ b/src/test/modules/test_slru/Makefile
@@ -7,7 +7,7 @@ OBJS = \
 	test_multixact.o
 PGFILEDESC = "test_slru - test module for SLRUs"
 
-EXTRA_INSTALL=src/test/modules/injection_points
+EXTRA_INSTALL=src/test/modules/injection_points contrib/pg_visibility
 export enable_injection_points
 TAP_TESTS = 1
 
diff --git a/src/test/modules/test_slru/t/001_multixact.pl b/src/test/modules/test_slru/t/001_multixact.pl
index e2b567a603d..0c347a5157d 100644
--- a/src/test/modules/test_slru/t/001_multixact.pl
+++ b/src/test/modules/test_slru/t/001_multixact.pl
@@ -1,11 +1,6 @@
 # Copyright (c) 2024-2025, PostgreSQL Global Development Group
 
-# This test verifies edge case of reading a multixact:
-# when we have multixact that is followed by exactly one another multixact,
-# and another multixact have no offset yet, we must wait until this offset
-# becomes observable. Previously we used to wait for 1ms in a loop in this
-# case, but now we use CV for this. This test is exercising such a sleep.
-
+# This test verifies edge case of visibilitymap_clear
 use strict;
 use warnings FATAL => 'all';
 
@@ -18,106 +13,105 @@ if ($ENV{enable_injection_points} ne 'yes')
 {
 	plan skip_all => 'Injection points not supported by this build';
 }
+if ($windows_os)
+{
+	plan skip_all => 'Kill9 works unpredicatably on Windows';
+}
 
 my ($node, $result);
 
 $node = PostgreSQL::Test::Cluster->new('mike');
-$node->init;
+$node->init(allows_streaming => 1);
 $node->append_conf('postgresql.conf',
 	"shared_preload_libraries = 'test_slru,injection_points'");
 $node->start;
 $node->safe_psql('postgres', q(CREATE EXTENSION injection_points));
+$node->safe_psql('postgres', q(CREATE EXTENSION pg_visibility));
 $node->safe_psql('postgres', q(CREATE EXTENSION test_slru));
 
-# Test for Multixact generation edge case
-$node->safe_psql('postgres',
-	q{select injection_points_attach('test-multixact-read','wait')});
-$node->safe_psql('postgres',
-	q{select injection_points_attach('multixact-get-members-cv-sleep','wait')}
-);
-
-# This session must observe sleep on the condition variable while generating a
-# multixact.  To achieve this it first will create a multixact, then pause
-# before reading it.
-my $observer = $node->background_psql('postgres');
-
-# This query will create a multixact, and hang just before reading it.
-$observer->query_until(
-	qr/start/,
-	q{
-	\echo start
-	SELECT test_read_multixact(test_create_multixact());
-});
-$node->wait_for_event('client backend', 'test-multixact-read');
-
-# This session will create the next Multixact. This is necessary to avoid
-# multixact.c's non-sleeping edge case 1.
-my $creator = $node->background_psql('postgres');
+my $backup_name = 'my_backup';
+$node->backup($backup_name);
+my $standby = PostgreSQL::Test::Cluster->new('standby');
+$standby->init_from_backup($node, $backup_name, has_streaming => 1);
+$standby->start;
+
+my $bg_psql = $node->background_psql('postgres');
+
+# prepare the table, index and freeze some rows.
+# index is here only to demonstrate selecting deleted data on standby
+my $multi = $bg_psql->query_safe(
+	q(
+	CREATE TABLE x(i int);
+	CREATE INDEX on x(i);
+	INSERT INTO x VALUES (1);
+	VACUUM FREEZE x;
+	));
+
+# next backand clearing VM in heap_insert() will hang after releasing buffer lock
 $node->safe_psql('postgres',
-	q{SELECT injection_points_attach('multixact-create-from-members','wait');}
+	q{SELECT injection_points_attach('heap-insert:visibilitymap-clear','wait');}
 );
 
-# We expect this query to hang in the critical section after generating new
-# multixact, but before filling its offset into SLRU.
-# Running an injection point inside a critical section requires it to be
-# loaded beforehand.
-$creator->query_until(
-	qr/start/, q{
-	\echo start
-	SELECT test_create_multixact();
-});
-
-$node->wait_for_event('client backend', 'multixact-create-from-members');
-
-# Ensure we have the backends waiting that we expect
-is( $node->safe_psql(
-		'postgres',
-		q{SELECT string_agg(wait_event, ', ' ORDER BY wait_event)
-		FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'}
-	),
-	'multixact-create-from-members, test-multixact-read',
-	"matching injection point waits");
-
-# Now wake observer to get it to read the initial multixact.  A subsequent
-# multixact already exists, but that one doesn't have an offset assigned, so
-# this will hit multixact.c's edge case 2.
-$node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('test-multixact-read')});
-$node->wait_for_event('client backend', 'multixact-get-members-cv-sleep');
-
-# Ensure we have the backends waiting that we expect
-is( $node->safe_psql(
-		'postgres',
-		q{SELECT string_agg(wait_event, ', ' ORDER BY wait_event)
-		FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'}
-	),
-	'multixact-create-from-members, multixact-get-members-cv-sleep',
-	"matching injection point waits");
-
-# Now we have two backends waiting in multixact-create-from-members and
-# multixact-get-members-cv-sleep.  Also we have 3 injections points set to wait.
-# If we wakeup multixact-get-members-cv-sleep it will happen again, so we must
-# detach it first. So let's detach all injection points, then wake up all
-# backends.
-
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('test-multixact-read')});
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('multixact-create-from-members')});
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('multixact-get-members-cv-sleep')});
+# we expect this backend to hang forever
+$bg_psql->query_until(
+	qr/deploying lost VM bit/, q(
+\echo deploying lost VM bit
+	INSERT INTO x VALUES (1);
+));
 
+$node->wait_for_event('client backend', 'heap-insert:visibilitymap-clear');
 $node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('multixact-create-from-members')});
+	q{SELECT injection_points_detach('heap-insert:visibilitymap-clear')});
+
+# now we want VM on-disk. Checkpoint will hang too, hence another background session.
+my $bg_psql1 = $node->background_psql('postgres');
+$bg_psql1->query_until(
+	qr/flush data to disk/, q(
+\echo flush data to disk
+	checkpoint;
+));
+
+sleep(1);
+# All set and done, it's time for hard restart
+$node->kill9;
+$node->stop('immediate', fail_ok => 1);
+$node->poll_start;
+$bg_psql->{run}->finish;
+$bg_psql1->{run}->finish;
+
+# Now VMs are different on primary and standby. Insert and some data to demonstrate problems.
 $node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('multixact-get-members-cv-sleep')});
-
-# Background psql will now be able to read the result and disconnect.
-$observer->quit;
-$creator->quit;
+	q{
+	INSERT INTO x VALUES (1);
+	INSERT INTO x VALUES (2);
+	DELETE FROM x WHERE i = 2;
+	});
+
+$node->wait_for_catchup($standby);
+
+# corruptino tests
+my $corrupted_all_frozen = $standby->safe_psql('postgres',
+	qq{SELECT pg_check_frozen('x')});
+my $corrupted_all_visible = $standby->safe_psql('postgres',
+	qq{SELECT pg_check_visible('x')});
+my $observed_deleted_data = $standby->safe_psql('postgres',
+	qq{set enable_seqscan to false; SELECT * FROM x WHERE i = 2;});
+
+# debug printing
+print("\n");
+print($corrupted_all_frozen);
+print("\n\n");
+print($corrupted_all_visible);
+print("\n\n");
+print($observed_deleted_data);
+print("\n\n");
+
+# fail the test if any corruption is reported
+is($corrupted_all_frozen, '', 'pg_check_frozen() observes corruption');
+is($corrupted_all_visible, '', 'pg_check_visible() observes corruption');
+is($observed_deleted_data, '', 'deleted data returned by select');
 
 $node->stop;
+$standby->stop;
 
-# If we reached this point - everything is OK.
-ok(1);
 done_testing();
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 35413f14019..e810f123f93 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -1191,6 +1191,46 @@ sub start
 
 =pod
 
+=item $node->poll_start() => success_or_failure
+
+Polls start after kill9
+
+We may need retries to start a new postmaster.  Causes:
+ - kernel is slow to deliver SIGKILL
+ - postmaster parent is slow to waitpid()
+ - postmaster child is slow to exit in response to SIGQUIT
+ - postmaster child is slow to exit after postmaster death
+
+=cut
+
+sub poll_start
+{
+	my ($self) = @_;
+
+	my $max_attempts = 10 * $PostgreSQL::Test::Utils::timeout_default;
+	my $attempts = 0;
+
+	while ($attempts < $max_attempts)
+	{
+		$self->start(fail_ok => 1) && return 1;
+
+		# Wait 0.1 second before retrying.
+		usleep(100_000);
+
+		# Clean up in case the start attempt just timed out or some such.
+		$self->stop('fast', fail_ok => 1);
+
+		$attempts++;
+	}
+
+	# Try one last time without fail_ok, which will BAIL_OUT unless it
+	# succeeds.
+	$self->start && return 1;
+	return 0;
+}
+
+=pod
+
 =item $node->kill9()
 
 Send SIGKILL (signal 9) to the postmaster.
-- 
2.43.0

v2-0002-Bugfix-TODO-FIXME-write-a-better-message.patchtext/x-patch; charset=US-ASCII; name=v2-0002-Bugfix-TODO-FIXME-write-a-better-message.patchDownload
From 4617d97e2953fcb5b20bb14750bcc971ed504286 Mon Sep 17 00:00:00 2001
From: Aleksander Alekseev <aleksander@timescale.com>
Date: Thu, 7 Aug 2025 17:04:30 +0300
Subject: [PATCH v2 2/2] Bugfix TODO FIXME write a better message

---
 src/backend/access/heap/heapam.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e5da051aaf0..6e38c9b4d37 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2135,10 +2135,13 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	{
 		all_visible_cleared = true;
 		PageClearAllVisible(BufferGetPage(buffer));
-		visibilitymap_clear(relation,
-							ItemPointerGetBlockNumber(&(heaptup->t_self)),
-							vmbuffer, VISIBILITYMAP_VALID_BITS);
-		INJECTION_POINT_CACHED("heap-insert:visibilitymap-clear", NULL);
+
+		/*
+		 * Visibility map should be updated *after* logging the change
+		 * within the same critical section.
+		 *
+		 * AALEKSEEV TODO clarify the reason
+		 */
 	}
 
 	/*
@@ -2233,6 +2236,14 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		PageSetLSN(page, recptr);
 	}
 
+	if (all_visible_cleared)
+	{
+		visibilitymap_clear(relation,
+							ItemPointerGetBlockNumber(&(heaptup->t_self)),
+							vmbuffer, VISIBILITYMAP_VALID_BITS);
+		INJECTION_POINT_CACHED("heap-insert:visibilitymap-clear", NULL);
+	}
+
 	END_CRIT_SECTION();
 
 	UnlockReleaseBuffer(buffer);
-- 
2.43.0

#4Aleksander Alekseev
aleksander@tigerdata.com
In reply to: Aleksander Alekseev (#3)
Re: VM corruption on standby

This is a tricky bug. Do you also have a proposal of a particular fix?

If my understanding is correct, we should make a WAL record with the
XLH_LOCK_ALL_FROZEN_CLEARED flag *before* we modify the VM but within
the same critical section (in order to avoid race conditions within
the same backend).

I meant instance, not backend. Sorry for confusion.

Show quoted text

A draft patch is attached. It makes the test pass and doesn't seem to
break any other tests.

#5Aleksander Alekseev
aleksander@tigerdata.com
In reply to: Aleksander Alekseev (#4)
Re: VM corruption on standby

Hi again,

I meant instance, not backend. Sorry for confusion.

It looks like I completely misunderstood what START_CRIT_SECTION() /
END_CRIT_SECTION() are for here. Simply ignore this part :) Apologies
for the noise.

#6Aleksander Alekseev
aleksander@tigerdata.com
In reply to: Aleksander Alekseev (#3)
Re: VM corruption on standby

Hi,

If my understanding is correct, we should make a WAL record with the
XLH_LOCK_ALL_FROZEN_CLEARED flag *before* we modify the VM but within
the same critical section [...]

A draft patch is attached. It makes the test pass and doesn't seem to
break any other tests.

Thoughts?

In order not to forget - assuming I'm not wrong about the cause of the
issue, we might want to recheck the order of visibilitymap_* and XLog*
calls in the following functions too:

- heap_multi_insert
- heap_delete
- heap_update
- heap_lock_tuple
- heap_lock_updated_tuple_rec

By a quick look all named functions modify the VM before making a
corresponding WAL record. This can cause a similar issue:

1. VM modified
2. evicted asynchronously before logging
3. kill 9
4. different state of VM on primary and standby

#7Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Aleksander Alekseev (#3)
Re: VM corruption on standby

On 7 Aug 2025, at 17:09, Aleksander Alekseev <aleksander@tigerdata.com> wrote:

If my understanding is correct, we should make a WAL record with the
XLH_LOCK_ALL_FROZEN_CLEARED flag *before* we modify the VM but within
the same critical section (in order to avoid race conditions within
the same backend).

Well, the test passes because you moved injection point to a very safe position. I can't comment anything on other aspects of moving visibilitymap_clear() around.
The approach seems viable to me, but I'd like to have understanding why PD_ALL_VISIBLE in a heap page header did not save the day before fixing anything.

Best regards, Andrey Borodin.

#8Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Andrey Borodin (#7)
Re: VM corruption on standby

On 7 Aug 2025, at 18:54, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

moved injection point to a very safe position.

BTW, your fix also fixes ALL_FROZEN stuff, just because WAL for heap insert is already emitted by the time of -9.

I want to emphasize that it seems to me that position of injection point is not a hint, but rather coincidental.

I concur that all other users of visibilitymap_clear() likely will need to be fixed. But only when we have a good picture what exactly is broken.

Best regards, Andrey Borodin.

#9Aleksander Alekseev
aleksander@tigerdata.com
In reply to: Andrey Borodin (#8)
Re: VM corruption on standby

Hi Andrey,

the test passes because you moved injection point to a very safe position
[...]
I want to emphasize that it seems to me that position of injection point is not a hint, but rather coincidental.

Well, I wouldn't say that the test passes merely because the location
of the injection point was moved.

For sure it was moved, because the visibilitymap_clear() call was
moved. Maybe I misunderstood the intent of the test. Wasn't it to call
the injection point right after updating the VM? I tried to place it
between updating the WAL and updating the VM and the effect was the
same - the test still passes.

In any case we can place it anywhere we want to if we agree to include
the test into the final version of the patch.

I concur that all other users of visibilitymap_clear() likely will need to be fixed.

Right, I realized there are a few places besides heapam.c that might
need a change.

The approach seems viable to me, but I'd like to have understanding why PD_ALL_VISIBLE in a heap page header did not save the day before fixing anything
... But only when we have a good picture what exactly is broken.

Agree. I especially would like to know the opinion of somebody who's
been hacking Postgres longer than I did. Perhaps there was a good
reason to update the VM *before* creating WAL records I'm unaware of.

#10Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Aleksander Alekseev (#9)
Re: VM corruption on standby

On 7 Aug 2025, at 19:36, Aleksander Alekseev <aleksander@tigerdata.com> wrote:

Maybe I misunderstood the intent of the test.

You understood precisely my intent of writing the test. But it fail not due to a bug I anticipated!

So far I noticed that if I move injection point before

PageClearAllVisible(BufferGetPage(buffer));

or after writing WAL - test passes.

Also I investigated that in a moment of kill -9 checkpointer flushes heap page to disk despite content lock. I haven't found who released content lock though.

Best regards, Andrey Borodin.

#11Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Andrey Borodin (#10)
Re: VM corruption on standby

On 9 Aug 2025, at 18:28, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Also I investigated that in a moment of kill -9 checkpointer flushes heap page to disk despite content lock. I haven't found who released content lock though.

I've written this message and understood: its LWLockReleaseAll().

0. checkpointer is going to flush a heap buffer but waits on content lock
1. client is resetting PD_ALL_VISIBLE from page
2. postmaster is killed and command client to go down
3. client calls LWLockReleaseAll() at ProcKill() (?)
4. checkpointer flushes buffer with reset PG_ALL_VISIBLE that is not WAL-logged to standby
5. subsequent deletes do not log resetting this bit
6. deleted data is observable on standby with IndexOnlyScan

Any idea how to fix this?

Best regards, Andrey Borodin.

#12Kirill Reshke
reshkekirill@gmail.com
In reply to: Aleksander Alekseev (#9)
Re: VM corruption on standby

On Thu, 7 Aug 2025 at 21:36, Aleksander Alekseev
<aleksander@tigerdata.com> wrote:

Perhaps there was a good
reason to update the VM *before* creating WAL records I'm unaware of.

Looks like 503c730 intentionally does it this way; however, I have not
yet fully understood the reasoning behind it.

--
Best regards,
Kirill Reshke

#13Aleksander Alekseev
aleksander@tigerdata.com
In reply to: Andrey Borodin (#11)
Re: VM corruption on standby

Hi Andrey,

0. checkpointer is going to flush a heap buffer but waits on content lock
1. client is resetting PD_ALL_VISIBLE from page
2. postmaster is killed and command client to go down
3. client calls LWLockReleaseAll() at ProcKill() (?)
4. checkpointer flushes buffer with reset PG_ALL_VISIBLE that is not WAL-logged to standby
5. subsequent deletes do not log resetting this bit
6. deleted data is observable on standby with IndexOnlyScan

Thanks for investigating this in more detail. If this is indeed what
happens it is a violation of the "log before changing" approach. For
this reason we have PageHeaderData.pd_lsn for instance - to make sure
pages are evicted only *after* the record that changed it is written
to disk (because WAL records can't be applied to pages from the
future).

I guess the intent here could be to do an optimization of some sort
but the facts that 1. the instance can be killed at any time and 2.
there might be replicas - were not considered.

Any idea how to fix this?

IMHO: logging the changes first, then allowing to evict the page.

#14Kirill Reshke
reshkekirill@gmail.com
In reply to: Aleksander Alekseev (#13)
Re: VM corruption on standby

On Sun, 10 Aug 2025 at 01:55, Aleksander Alekseev
<aleksander@tigerdata.com> wrote:

For this reason we have PageHeaderData.pd_lsn for instance - to make sure
pages are evicted only *after* the record that changed it is written
to disk (because WAL records can't be applied to pages from the
future).

We don't bump the LSN of the heap page when setting the visibility
map bit.

I guess the intent here could be to do an optimization of some sort
but the facts that 1. the instance can be killed at any time and 2.
there might be replicas - were not considered.

IMHO: logging the changes first, then allowing to evict the page.

Clearing the vm before the logging changes was intentional [0]/messages/by-id/BANLkTimuLk4RHXSQHEEiYGbxiXp2mh5KCA@mail.gmail.com.
So I assume we should not change the approach, but rather just tweak
things a bit to make the whole thing work.

[0]: /messages/by-id/BANLkTimuLk4RHXSQHEEiYGbxiXp2mh5KCA@mail.gmail.com

--
Best regards,
Kirill Reshke

#15Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Aleksander Alekseev (#13)
Re: VM corruption on standby

On 9 Aug 2025, at 23:54, Aleksander Alekseev <aleksander@tigerdata.com> wrote:

IMHO: logging the changes first, then allowing to evict the page.

VM and BufferManager code does not allow flush of a buffer until changes are logged.
The problem is that our crash-exiting system destroys locks that protect buffer from being flushed.

Best regards, Andrey Borodin.

#16Kirill Reshke
reshkekirill@gmail.com
In reply to: Andrey Borodin (#1)
1 attachment(s)
Re: VM corruption on standby

On Wed, 6 Aug 2025 at 20:00, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Hi hackers!

I was reviewing the patch about removing xl_heap_visible and found the VM\WAL machinery very interesting.
At Yandex we had several incidents with corrupted VM and on pgconf.dev colleagues from AWS confirmed that they saw something similar too.
So I toyed around and accidentally wrote a test that reproduces $subj.

I think the corruption happens as follows:
0. we create a table with one frozen tuple
1. next heap_insert() clears VM bit and hangs immediately, nothing was logged yet
2. VM buffer is flushed on disk with checkpointer or bgwriter
3. primary is killed with -9
now we have a page that is ALL_VISIBLE\ALL_FORZEN on standby, but clear VM bits on primary
4. subsequent insert does not set XLH_LOCK_ALL_FROZEN_CLEARED in it's WAL record
5. pg_visibility detects corruption

Interestingly, in an off-list conversation Melanie explained me how ALL_VISIBLE is protected from this: WAL-logging depends on PD_ALL_VISIBLE heap page bit, not a state of the VM. But for ALL_FROZEN this is not a case:

/* Clear only the all-frozen bit on visibility map if needed */
if (PageIsAllVisible(page) &&
visibilitymap_clear(relation, block, vmbuffer,
VISIBILITYMAP_ALL_FROZEN))
cleared_all_frozen = true; // this won't happen due to flushed VM buffer before a crash

Anyway, the test reproduces corruption of both bits. And also reproduces selecting deleted data on standby.

The test is not intended to be committed when we fix the problem, so some waits are simulated with sleep(1) and test is placed at modules/test_slru where it was easier to write. But if we ever want something like this - I can design a less hacky version. And, probably, more generic.

Thanks!

Best regards, Andrey Borodin.

Attached reproduces the same but without any standby node. CHECKPOINT
somehow manages to flush the heap page when instance kill-9-ed.
As a result, we have inconsistency between heap and VM pages:

```
reshke=# select * from pg_visibility('x');
blkno | all_visible | all_frozen | pd_all_visible
-------+-------------+------------+----------------
0 | t | t | f
(1 row)
```

Notice I moved INJECTION point one line above visibilitymap_clear.
Without this change, such behaviour also reproduced, but with much
less frequency.

--
Best regards,
Kirill Reshke

Attachments:

v2-0001-Corrupt-VM-on-standby.patchapplication/octet-stream; name=v2-0001-Corrupt-VM-on-standby.patchDownload
From 786a8333872f32e0f450ba89802791629d20d97f Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sun, 27 Jul 2025 11:37:55 +0500
Subject: [PATCH v2] Corrupt VM on standby

---
 src/backend/access/heap/heapam.c              |   3 +
 src/backend/access/heap/heapam_xlog.c         |   1 +
 src/test/modules/test_slru/Makefile           |   2 +-
 src/test/modules/test_slru/t/001_multixact.pl | 165 ++++++++----------
 src/test/perl/PostgreSQL/Test/Cluster.pm      |  40 +++++
 5 files changed, 120 insertions(+), 91 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0dcd6ee817e..e4f703e46f3 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2123,6 +2123,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBlockNumber);
 
+	INJECTION_POINT_LOAD("heap-insert:visibilitymap-clear");
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -2133,6 +2135,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	{
 		all_visible_cleared = true;
 		PageClearAllVisible(BufferGetPage(buffer));
+		INJECTION_POINT_CACHED("heap-insert:visibilitymap-clear", NULL);
 		visibilitymap_clear(relation,
 							ItemPointerGetBlockNumber(&(heaptup->t_self)),
 							vmbuffer, VISIBILITYMAP_VALID_BITS);
diff --git a/src/backend/access/heap/heapam_xlog.c b/src/backend/access/heap/heapam_xlog.c
index eb4bd3d6ae3..978030048b6 100644
--- a/src/backend/access/heap/heapam_xlog.c
+++ b/src/backend/access/heap/heapam_xlog.c
@@ -21,6 +21,7 @@
 #include "access/xlogutils.h"
 #include "storage/freespace.h"
 #include "storage/standby.h"
+#include "utils/injection_point.h"
 
 
 /*
diff --git a/src/test/modules/test_slru/Makefile b/src/test/modules/test_slru/Makefile
index fe4e7390437..b6ad41ec804 100644
--- a/src/test/modules/test_slru/Makefile
+++ b/src/test/modules/test_slru/Makefile
@@ -7,7 +7,7 @@ OBJS = \
 	test_multixact.o
 PGFILEDESC = "test_slru - test module for SLRUs"
 
-EXTRA_INSTALL=src/test/modules/injection_points
+EXTRA_INSTALL=src/test/modules/injection_points contrib/pg_visibility
 export enable_injection_points
 TAP_TESTS = 1
 
diff --git a/src/test/modules/test_slru/t/001_multixact.pl b/src/test/modules/test_slru/t/001_multixact.pl
index e2b567a603d..6f97e1508a8 100644
--- a/src/test/modules/test_slru/t/001_multixact.pl
+++ b/src/test/modules/test_slru/t/001_multixact.pl
@@ -1,11 +1,6 @@
 # Copyright (c) 2024-2025, PostgreSQL Global Development Group
 
-# This test verifies edge case of reading a multixact:
-# when we have multixact that is followed by exactly one another multixact,
-# and another multixact have no offset yet, we must wait until this offset
-# becomes observable. Previously we used to wait for 1ms in a loop in this
-# case, but now we use CV for this. This test is exercising such a sleep.
-
+# This test verifies edge case of visibilitymap_clear
 use strict;
 use warnings FATAL => 'all';
 
@@ -18,106 +13,96 @@ if ($ENV{enable_injection_points} ne 'yes')
 {
 	plan skip_all => 'Injection points not supported by this build';
 }
+if ($windows_os)
+{
+	plan skip_all => 'Kill9 works unpredicatably on Windows';
+}
 
 my ($node, $result);
 
 $node = PostgreSQL::Test::Cluster->new('mike');
-$node->init;
+$node->init(allows_streaming => 1);
 $node->append_conf('postgresql.conf',
 	"shared_preload_libraries = 'test_slru,injection_points'");
 $node->start;
 $node->safe_psql('postgres', q(CREATE EXTENSION injection_points));
-$node->safe_psql('postgres', q(CREATE EXTENSION test_slru));
+$node->safe_psql('postgres', q(CREATE EXTENSION pg_visibility));
 
-# Test for Multixact generation edge case
-$node->safe_psql('postgres',
-	q{select injection_points_attach('test-multixact-read','wait')});
-$node->safe_psql('postgres',
-	q{select injection_points_attach('multixact-get-members-cv-sleep','wait')}
-);
+my $bg_psql = $node->background_psql('postgres');
 
-# This session must observe sleep on the condition variable while generating a
-# multixact.  To achieve this it first will create a multixact, then pause
-# before reading it.
-my $observer = $node->background_psql('postgres');
+# prepare the table, index and freeze some rows.
+# index is here only to demonstrate selecting deleted data on standby
+my $multi = $bg_psql->query_safe(
+	q(
+	CREATE TABLE x(i int);
+	CREATE INDEX on x(i);
+	INSERT INTO x VALUES (1);
+	VACUUM FREEZE x;
+	));
 
-# This query will create a multixact, and hang just before reading it.
-$observer->query_until(
-	qr/start/,
-	q{
-	\echo start
-	SELECT test_read_multixact(test_create_multixact());
-});
-$node->wait_for_event('client backend', 'test-multixact-read');
-
-# This session will create the next Multixact. This is necessary to avoid
-# multixact.c's non-sleeping edge case 1.
-my $creator = $node->background_psql('postgres');
+# next backand clearing VM in heap_insert() will hang after releasing buffer lock
 $node->safe_psql('postgres',
-	q{SELECT injection_points_attach('multixact-create-from-members','wait');}
+	q{SELECT injection_points_attach('heap-insert:visibilitymap-clear','wait');}
 );
 
-# We expect this query to hang in the critical section after generating new
-# multixact, but before filling its offset into SLRU.
-# Running an injection point inside a critical section requires it to be
-# loaded beforehand.
-$creator->query_until(
-	qr/start/, q{
-	\echo start
-	SELECT test_create_multixact();
-});
-
-$node->wait_for_event('client backend', 'multixact-create-from-members');
-
-# Ensure we have the backends waiting that we expect
-is( $node->safe_psql(
-		'postgres',
-		q{SELECT string_agg(wait_event, ', ' ORDER BY wait_event)
-		FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'}
-	),
-	'multixact-create-from-members, test-multixact-read',
-	"matching injection point waits");
-
-# Now wake observer to get it to read the initial multixact.  A subsequent
-# multixact already exists, but that one doesn't have an offset assigned, so
-# this will hit multixact.c's edge case 2.
-$node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('test-multixact-read')});
-$node->wait_for_event('client backend', 'multixact-get-members-cv-sleep');
-
-# Ensure we have the backends waiting that we expect
-is( $node->safe_psql(
-		'postgres',
-		q{SELECT string_agg(wait_event, ', ' ORDER BY wait_event)
-		FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'}
-	),
-	'multixact-create-from-members, multixact-get-members-cv-sleep',
-	"matching injection point waits");
-
-# Now we have two backends waiting in multixact-create-from-members and
-# multixact-get-members-cv-sleep.  Also we have 3 injections points set to wait.
-# If we wakeup multixact-get-members-cv-sleep it will happen again, so we must
-# detach it first. So let's detach all injection points, then wake up all
-# backends.
-
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('test-multixact-read')});
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('multixact-create-from-members')});
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('multixact-get-members-cv-sleep')});
-
-$node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('multixact-create-from-members')});
+# we expect this backend to hang forever
+$bg_psql->query_until(
+	qr/deploying lost VM bit/, q(
+\echo deploying lost VM bit
+	INSERT INTO x VALUES (1);
+));
+
+#$node->wait_for_event('client backend', 'heap-insert:visibilitymap-clear');
+#$node->safe_psql('postgres',
+#	q{SELECT injection_points_detach('heap-insert:visibilitymap-clear')});
+
+# now we want VM on-disk. Checkpoint will hang too, hence another background session.
+my $bg_psql1 = $node->background_psql('postgres');
+$bg_psql1->query_until(
+	qr/flush data to disk/, q(
+\echo flush data to disk
+	checkpoint;
+));
+
+sleep(1);
+# All set and done, it's time for hard restart
+$node->kill9;
+$node->stop('immediate', fail_ok => 1);
+$node->poll_start;
+$bg_psql->{run}->finish;
+$bg_psql1->{run}->finish;
+
+# Now VMs are different on primary and standby. Insert and some data to demonstrate problems.
 $node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('multixact-get-members-cv-sleep')});
-
-# Background psql will now be able to read the result and disconnect.
-$observer->quit;
-$creator->quit;
+	q{
+	INSERT INTO x VALUES (1);
+	INSERT INTO x VALUES (2);
+	DELETE FROM x WHERE i = 2;
+	});
+
+
+# corruptino tests
+my $corrupted_all_frozen = $node->safe_psql('postgres',
+	qq{SELECT pg_check_frozen('x')});
+my $corrupted_all_visible = $node->safe_psql('postgres',
+	qq{SELECT pg_check_visible('x')});
+my $observed_deleted_data = $node->safe_psql('postgres',
+	qq{set enable_seqscan to false; SELECT * FROM x WHERE i = 2;});
+
+# debug printing
+print("\n");
+print($corrupted_all_frozen);
+print("\n\n");
+print($corrupted_all_visible);
+print("\n\n");
+print($observed_deleted_data);
+print("\n\n");
+
+# fail the test if any corruption is reported
+is($corrupted_all_frozen, '', 'pg_check_frozen() observes corruption');
+is($corrupted_all_visible, '', 'pg_check_visible() observes corruption');
+is($observed_deleted_data, '', 'deleted data returned by select');
 
 $node->stop;
 
-# If we reached this point - everything is OK.
-ok(1);
 done_testing();
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 35413f14019..e810f123f93 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -1191,6 +1191,46 @@ sub start
 
 =pod
 
+=item $node->poll_start() => success_or_failure
+
+Polls start after kill9
+
+We may need retries to start a new postmaster.  Causes:
+ - kernel is slow to deliver SIGKILL
+ - postmaster parent is slow to waitpid()
+ - postmaster child is slow to exit in response to SIGQUIT
+ - postmaster child is slow to exit after postmaster death
+
+=cut
+
+sub poll_start
+{
+	my ($self) = @_;
+
+	my $max_attempts = 10 * $PostgreSQL::Test::Utils::timeout_default;
+	my $attempts = 0;
+
+	while ($attempts < $max_attempts)
+	{
+		$self->start(fail_ok => 1) && return 1;
+
+		# Wait 0.1 second before retrying.
+		usleep(100_000);
+
+		# Clean up in case the start attempt just timed out or some such.
+		$self->stop('fast', fail_ok => 1);
+
+		$attempts++;
+	}
+
+	# Try one last time without fail_ok, which will BAIL_OUT unless it
+	# succeeds.
+	$self->start && return 1;
+	return 0;
+}
+
+=pod
+
 =item $node->kill9()
 
 Send SIGKILL (signal 9) to the postmaster.
-- 
2.43.0

#17Kirill Reshke
reshkekirill@gmail.com
In reply to: Kirill Reshke (#16)
1 attachment(s)
Re: VM corruption on standby

On Tue, 12 Aug 2025 at 10:38, I wrote:

CHECKPOINT
somehow manages to flush the heap page when instance kill-9-ed.

This corruption does not reproduce without CHECKPOINT call, however I
do not see any suspicious syscal that CHECKPOINT's process does.
It does not write anything to disk here, isn’t ? PFA strace.

--
Best regards,
Kirill Reshke

Attachments:

checkpointer.strace.txttext/plain; charset=UTF-8; name=checkpointer.strace.txtDownload
#18Kirill Reshke
reshkekirill@gmail.com
In reply to: Andrey Borodin (#1)
Re: VM corruption on standby

On Wed, 6 Aug 2025 at 20:00, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Hi hackers!

I was reviewing the patch about removing xl_heap_visible and found the VM\WAL machinery very interesting.
At Yandex we had several incidents with corrupted VM and on pgconf.dev colleagues from AWS confirmed that they saw something similar too.

While this aims to find existing VM corruption (i mean, in PG <= 17),
this reproducer does not seem to work on pg17. At least, I did not
manage to reproduce this scenario on pg17.

This makes me think this exact corruption may be pg18-only. Is it
possible that AIO is somehow involved here?

--
Best regards,
Kirill Reshke

#19Kirill Reshke
reshkekirill@gmail.com
In reply to: Kirill Reshke (#18)
Re: VM corruption on standby

On Tue, 12 Aug 2025 at 13:00, I wrote:

While this aims to find existing VM corruption (i mean, in PG <= 17),
this reproducer does not seem to work on pg17. At least, I did not
manage to reproduce this scenario on pg17.

This makes me think this exact corruption may be pg18-only. Is it
possible that AIO is somehow involved here?

First of all, the "corruption" is reproducible with io_method = sync,
so AIO is not under suspicion.
Then, I did a gdb session many times and I ended up with the
conclusion that this test is NOT a valid corruption reproducer.
So, the thing is, when you involve injection point logic, due how inj
points are implemented, you allow postgres to enter WaitLatch
function, which
has its own logic for postmaster death handling[0]https://github.com/postgres/postgres/blob/393e0d2314050576c9c039853fdabe7f685a4f47/src/backend/storage/ipc/waiteventset.c#L1260-L1261.

So, when we add injection point here, we allow this sequence of events
to happen:

1) INSERT enters `heap_insert`, modifies HEAP page, goes to inj point and hangs.
2) CHECKPOINT process tries to FLUSH this page and wiats
3) kill -9 to postmaster
4) INSERT wakes up on postmaster death, goes to [0]https://github.com/postgres/postgres/blob/393e0d2314050576c9c039853fdabe7f685a4f47/src/backend/storage/ipc/waiteventset.c#L1260-L1261 and releases all locks.
5) CHECKPOINT-er flushes the HEAP page to disk, causing corruption.

The thing is, this execution will NOT happen without inj points.

So, overall, injection points are not suitable for this critical
section testing (i think).

==
Off-list Andrey send me this patch:

```
diff --git a/src/backend/storage/ipc/waiteventset.c
b/src/backend/storage/ipc/waiteventset.c
index 7c0e66900f9..e89e1d115cb 100644
--- a/src/backend/storage/ipc/waiteventset.c
+++ b/src/backend/storage/ipc/waiteventset.c
@@ -1044,6 +1044,7 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
        long            cur_timeout = -1;

Assert(nevents > 0);
+ Assert(CritSectionCount == 0);

/*
* Initialize timeout if requested. We must record the current time so
```

The objective is to confirm our assumption that the WaitEventSetWait
call ought not to occur during critical sections. This patch causes
`make check` to fail, indicating that this assumption is incorrect.
Assertion breaks due to AdvanceXLInsertBuffer call (which uses
condvar logic) inside XlogInsertRecord.

I did not find any doc or other piece of information indicating
whether WaitEventSetWait and critical sections are allowed. But I do
thing this is bad, because we do not process interruptions during
critical sections, so it is unclear to me why we should handle
postmaster death any differently.

[0]: https://github.com/postgres/postgres/blob/393e0d2314050576c9c039853fdabe7f685a4f47/src/backend/storage/ipc/waiteventset.c#L1260-L1261

--
Best regards,
Kirill Reshke

#20Kirill Reshke
reshkekirill@gmail.com
In reply to: Kirill Reshke (#19)
Re: VM corruption on standby

On Wed, 13 Aug 2025 at 16:15, I wrote:

I did not find any doc or other piece of information indicating
whether WaitEventSetWait and critical sections are allowed. But I do
thing this is bad, because we do not process interruptions during
critical sections, so it is unclear to me why we should handle
postmaster death any differently.

Maybe I'm very wrong about this, but I'm currently suspecting there is
corruption involving CHECKPOINT, process in CRIT section and kill -9.

The scenario I am trying to reproduce is following:

1) Some process p1 locks some buffer (name it buf1), enters CRIT
section, calls MarkBufferDirty and hangs inside XLogInsert on CondVar
in (GetXLogBuffer -> AdvanceXLInsertBuffer).
2) CHECKPOINT (p2) stars and tries to FLUSH dirty buffers, awaiting lock on buf1
3) Postmaster kill-9-ed
4) signal of postmaster death delivered to p1, it wakes up in
WaitLatch/WaitEventSetWaitBlock functions, checks postmaster
aliveness, and exits releasing all locks.
5) p2 acquires locks on buf1 and flushes it to disk.
6) signal of postmaster death delivered to p2, p2 exits.

And we now have a case when the buffer is flushed to disk, while the
xlog record that describes this change never makes it to disk. This is
very bad.

To be clear, I am trying to avoid use of inj points to reproduce
corruption. I am not yet successful in this though.

--
Best regards,
Kirill Reshke

#21Kirill Reshke
reshkekirill@gmail.com
In reply to: Kirill Reshke (#20)
1 attachment(s)
Re: VM corruption on standby

On Thu, 14 Aug 2025 at 10:41, Kirill Reshke <reshkekirill@gmail.com> wrote:

o I am trying to reproduce is following:

1) Some process p1 locks some buffer (name it buf1), enters CRIT
section, calls MarkBufferDirty and hangs inside XLogInsert on CondVar
in (GetXLogBuffer -> AdvanceXLInsertBuffer).
2) CHECKPOINT (p2) stars and tries to FLUSH dirty buffers, awaiting lock on buf1
3) Postmaster kill-9-ed
4) signal of postmaster death delivered to p1, it wakes up in
WaitLatch/WaitEventSetWaitBlock functions, checks postmaster
aliveness, and exits releasing all locks.
5) p2 acquires locks on buf1 and flushes it to disk.
6) signal of postmaster death delivered to p2, p2 exits.

Andrey told me to create CF entry and attach fix, so doing it

[0]: https://commitfest.postgresql.org/patch/5964/

--
Best regards,
Kirill Reshke

Attachments:

v1-0001-Do-not-exit-on-postmaster-death-ever-inside-CRIT-.patchapplication/octet-stream; name=v1-0001-Do-not-exit-on-postmaster-death-ever-inside-CRIT-.patchDownload
From 409f9728681851fef38c78d3a6eb9378bba3d12c Mon Sep 17 00:00:00 2001
From: reshke <reshke@double.cloud>
Date: Thu, 14 Aug 2025 06:24:55 +0000
Subject: [PATCH v1] Do not exit on postmaster death ever inside CRIT sections.

One of critical sections essential invariats is not to allow
any other process to observe half-way made changes.
This inludes requrement of holding locks to modified buffer
until xlog records describing changes are properly created.
PostgreSQL already disallows to process any signal cancellation
inside CRIT section, guaratees process does not release lock
on its buffers too early. This patch does the same inside WaitLatch
facilites, disallowing to exit on Postmaster death event inside CRIT
section.
---
 src/backend/storage/ipc/waiteventset.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/src/backend/storage/ipc/waiteventset.c b/src/backend/storage/ipc/waiteventset.c
index 7c0e66900f9..26b4a8d0f2b 100644
--- a/src/backend/storage/ipc/waiteventset.c
+++ b/src/backend/storage/ipc/waiteventset.c
@@ -1257,7 +1257,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 			 */
 			if (!PostmasterIsAliveInternal())
 			{
-				if (set->exit_on_postmaster_death)
+				if (set->exit_on_postmaster_death && CritSectionCount == 0)
 					proc_exit(1);
 				occurred_events->fd = PGINVALID_SOCKET;
 				occurred_events->events = WL_POSTMASTER_DEATH;
@@ -1339,7 +1339,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 	 */
 	if (unlikely(set->report_postmaster_not_running))
 	{
-		if (set->exit_on_postmaster_death)
+		if (set->exit_on_postmaster_death && CritSectionCount == 0)
 			proc_exit(1);
 		occurred_events->fd = PGINVALID_SOCKET;
 		occurred_events->events = WL_POSTMASTER_DEATH;
@@ -1411,7 +1411,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 			 */
 			set->report_postmaster_not_running = true;
 
-			if (set->exit_on_postmaster_death)
+			if (set->exit_on_postmaster_death && CritSectionCount == 0)
 				proc_exit(1);
 			occurred_events->fd = PGINVALID_SOCKET;
 			occurred_events->events = WL_POSTMASTER_DEATH;
@@ -1541,7 +1541,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 			 */
 			if (!PostmasterIsAliveInternal())
 			{
-				if (set->exit_on_postmaster_death)
+				if (set->exit_on_postmaster_death && CritSectionCount == 0)
 					proc_exit(1);
 				occurred_events->fd = PGINVALID_SOCKET;
 				occurred_events->events = WL_POSTMASTER_DEATH;
@@ -1752,7 +1752,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 			 */
 			if (!PostmasterIsAliveInternal())
 			{
-				if (set->exit_on_postmaster_death)
+				if (set->exit_on_postmaster_death && CritSectionCount == 0)
 					proc_exit(1);
 				occurred_events->fd = PGINVALID_SOCKET;
 				occurred_events->events = WL_POSTMASTER_DEATH;
-- 
2.43.0

#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kirill Reshke (#21)
Re: VM corruption on standby

Kirill Reshke <reshkekirill@gmail.com> writes:

[ v1-0001-Do-not-exit-on-postmaster-death-ever-inside-CRIT-.patch ]

I do not like this patch one bit: it will replace one set of problems
with another set, namely systems that fail to shut down.

I think the actual bug here is the use of proc_exit(1) after
observing postmaster death. That is what creates the hazard,
because it releases the locks that are preventing other processes
from observing the inconsistent state in shared memory.
Compare this to what we do, for example, on receipt of SIGQUIT:

/*
* We DO NOT want to run proc_exit() or atexit() callbacks -- we're here
* because shared memory may be corrupted, so we don't want to try to
* clean up our transaction. Just nail the windows shut and get out of
* town. The callbacks wouldn't be safe to run from a signal handler,
* anyway.
*
* Note we do _exit(2) not _exit(0). This is to force the postmaster into
* a system reset cycle if someone sends a manual SIGQUIT to a random
* backend. This is necessary precisely because we don't clean up our
* shared memory state. (The "dead man switch" mechanism in pmsignal.c
* should ensure the postmaster sees this as a crash, too, but no harm in
* being doubly sure.)
*/
_exit(2);

So I think the correct fix here is s/proc_exit(1)/_exit(2)/ in the
places that are responding to postmaster death. There might be
more than just WaitEventSetWaitBlock; I didn't look.

regards, tom lane

#23Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Tom Lane (#22)
Re: VM corruption on standby

On 17 Aug 2025, at 17:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:

So I think the correct fix here is s/proc_exit(1)/_exit(2)/ in the
places that are responding to postmaster death.

+1. But should we _exit(2) only in critical section or always in case of postmaster death?

Another question that was bothering Kirill is do we want a test that naturally reproduces WaitEventSetWaitBlock() under critical section so that we can observe a corruption? Or is it kind of obvious from code that such things might happen?
Existing test is adding WaitEventSetWaitBlock() via injection point to a place where it was not present before. Though with existing test at hand we can check that fix is curing WaitEventSetWaitBlock() against critical section.

Best regards, Andrey Borodin.

#24Kirill Reshke
reshkekirill@gmail.com
In reply to: Tom Lane (#22)
Re: VM corruption on standby

Hi! Thank you for putting attention to this.

On Sun, 17 Aug 2025 at 19:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kirill Reshke <reshkekirill@gmail.com> writes:

[ v1-0001-Do-not-exit-on-postmaster-death-ever-inside-CRIT-.patch ]

I do not like this patch one bit: it will replace one set of problems
with another set, namely systems that fail to shut down.

I did not observe this during my by-hand testing. I am under the
impression that CRIT sections are something that backend (or other)
postgres processes try to pass quickly. So, what this patch is doing,
is that it defers the process reaction to postmaster death until the
end of the CRIT section.
So, typical scenario here (as I understand) is this:

(1) Process doing its goods, enters CRIT section.
(2) Postmaster dies.
(3a) Signal of postmaster death (SIGPWR on my VM) delivered to process
(3b) Process exists CRIT sections, and then does CFI logic, observes
postmaster death and quits.

This is why I did my patch the way I did it. I mean, is it always
possible for race conditions to occur, which will result in late
signal delivery, so why bother and all?

I think the actual bug here is the use of proc_exit(1) after
observing postmaster death.

Agreed.

So I think the correct fix here is s/proc_exit(1)/_exit(2)/ in the
places that are responding to postmaster death. There might be
more than just WaitEventSetWaitBlock; I didn't look.

Well, I see that patching this way will be a much safer way to fix the
issue. I can see that doing more conservative changes can be more
beneficial (more future-proof and less bug-prone).
I will take a detailed look and try to send a patch soon.

--
Best regards,
Kirill Reshke

#25Kirill Reshke
reshkekirill@gmail.com
In reply to: Kirill Reshke (#24)
1 attachment(s)
Re: VM corruption on standby

On Mon, 18 Aug 2025 at 13:15, I wrote:

I do not like this patch one bit: it will replace one set of problems
with another set, namely systems that fail to shut down.

I did not observe this during my by-hand testing.

I am sorry: I was wrong. This is exactly what happens in this test
(modified 001_multixact.pl).
To be precise, after the INSERT process exits, CHECKPOINT process
waits indefinitely in LWLockAcquire.

It looks like the reason why proc_exit(1) releases all holded lwlocks
is because we use it to notify all lwlock contenders through shared
memory about state change, which will not be notified otherwise, since
we do not check for signals inside LWLockAcquire.

Looks like we need to do something like ConditionVariableBroadcast(),
but without lwlock release, to notify all lwlock contenders and then
exit(2).

===
As for the fix, I am now trying to make attached work. The idea to
"escalate" proc_exit to immediately exit via syscall comes to my mind
from how elog(ERROR) behaves in CRIT sections (every elog(ERROR)
efficiently becomes elog(PANIC)).

--
Best regards,
Kirill Reshke

Attachments:

v2-0001-Do-not-exit-on-postmaster-death-even-inside-CRIT-.patchapplication/octet-stream; name=v2-0001-Do-not-exit-on-postmaster-death-even-inside-CRIT-.patchDownload
From 553e021a7916fa15f4cd67036ba5d73c5172f54a Mon Sep 17 00:00:00 2001
From: reshke <reshke@double.cloud>
Date: Mon, 18 Aug 2025 09:35:55 +0000
Subject: [PATCH v2] Do not exit on postmaster death even inside CRIT sections.

One of critical sections essential invariats is not to allow
any other process to observe half-way made changes.
This inludes requrement of holding locks to modified buffer
until xlog records describing changes are properly created.
PostgreSQL already disallows to process any signal cancellation
inside CRIT section, guaratees process does not release lock
on its buffers too early. This patch does the same inside proc_exit(),
disallowing to exit on Postmaster death event inside CRIT
section.
---
 src/backend/storage/ipc/ipc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/storage/ipc/ipc.c b/src/backend/storage/ipc/ipc.c
index 2704e80b3a7..796d5df7e81 100644
--- a/src/backend/storage/ipc/ipc.c
+++ b/src/backend/storage/ipc/ipc.c
@@ -107,6 +107,9 @@ proc_exit(int code)
 	if (MyProcPid != (int) getpid())
 		elog(PANIC, "proc_exit() called in child process");
 
+	if (CritSectionCount != 0)
+		_exit(2);
+
 	/* Clean up everything that must be cleaned up */
 	proc_exit_prepare(code);
 
-- 
2.43.0

#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kirill Reshke (#24)
Re: VM corruption on standby

Kirill Reshke <reshkekirill@gmail.com> writes:

On Sun, 17 Aug 2025 at 19:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I do not like this patch one bit: it will replace one set of problems
with another set, namely systems that fail to shut down.

I did not observe this during my by-hand testing. I am under the
impression that CRIT sections are something that backend (or other)
postgres processes try to pass quickly. So, what this patch is doing,
is that it defers the process reaction to postmaster death until the
end of the CRIT section.

Well, if you're inside WaitEventSetWaitBlock, you have little control
over how long you're going to sit. There is a separate discussion
to be had over whether we should prohibit calling that function
inside a critical section. But I'm of the opinion that proc_exit
is the wrong thing to use after seeing postmaster death, critical
section or no. We should assume that system integrity is already
compromised, and get out as fast as we can with as few side-effects
as possible. It'll be up to the next generation of postmaster to
try to clean up.

regards, tom lane

#27Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#26)
Re: VM corruption on standby

On Tue, Aug 19, 2025 at 4:52 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

But I'm of the opinion that proc_exit
is the wrong thing to use after seeing postmaster death, critical
section or no. We should assume that system integrity is already
compromised, and get out as fast as we can with as few side-effects
as possible. It'll be up to the next generation of postmaster to
try to clean up.

Then wouldn't backends blocked in LWLockAcquire(x) hang forever, after
someone who holds x calls _exit()?

I don't know if there are other ways that LWLockReleaseAll() can lead
to persistent corruption that won't be corrected by crash recovery,
but this one is probably new since the following commit, explaining
the failure to reproduce on v17:

commit bc22dc0e0ddc2dcb6043a732415019cc6b6bf683
Author: Alexander Korotkov <akorotkov@postgresql.org>
Date: Wed Apr 2 12:44:24 2025 +0300

Get rid of WALBufMappingLock

Any idea involving deferring the handling of PM death from here
doesn't seem right: you'd keep waiting for the CV, but the backend
that would wake you might have exited.

Hmm, I wonder if there could be a solution in between where we don't
release the locks on PM exit, but we still wake the waiters so they
can observe a new dead state in the lock word (or perhaps a shared
postmaster_is_dead flag), and exit themselves.

Nice detective work Andrey and others! That's a complicated and rare
interaction.

#28Kirill Reshke
reshkekirill@gmail.com
In reply to: Thomas Munro (#27)
Re: VM corruption on standby

Hi! Thank you for putting attention to this.

On Tue, 19 Aug 2025 at 10:32, Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, Aug 19, 2025 at 4:52 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

But I'm of the opinion that proc_exit
is the wrong thing to use after seeing postmaster death, critical
section or no. We should assume that system integrity is already
compromised, and get out as fast as we can with as few side-effects
as possible. It'll be up to the next generation of postmaster to
try to clean up.

Then wouldn't backends blocked in LWLockAcquire(x) hang forever, after
someone who holds x calls _exit()?

I don't know if there are other ways that LWLockReleaseAll() can lead
to persistent corruption that won't be corrected by crash recovery,
but this one is probably new since the following commit, explaining
the failure to reproduce on v17:

commit bc22dc0e0ddc2dcb6043a732415019cc6b6bf683
Author: Alexander Korotkov <akorotkov@postgresql.org>
Date: Wed Apr 2 12:44:24 2025 +0300

Get rid of WALBufMappingLock

Any idea involving deferring the handling of PM death from here
doesn't seem right: you'd keep waiting for the CV, but the backend
that would wake you might have exited.

OK.

Hmm, I wonder if there could be a solution in between where we don't
release the locks on PM exit, but we still wake the waiters so they
can observe a new dead state in the lock word (or perhaps a shared
postmaster_is_dead flag), and exit themselves.

Since yesterday I was thinking about adding a new state bit for
LWLockWaitState. Something like LW_WS_OWNER_DEAD, which will be set by
lwlock owner after observing PM death and then checked by containers
in LWLockAcquire.

so something like:

*lock holder in proc_exit(1)*
```
for all my lwlock do:

waiter->lwWaiting = LW_WS_OWNER_DEAD;
PGSemaphoreUnlock(waiter->sem);
```

*lock contender in LWLockAttemptLock*

```
old_state = pg_atomic_read_u32(&lock->state);

/* loop until we've determined whether we could acquire the lock or not */
while (true)
{
if (old_state & (1<< LW_WS_OWNER_DEAD)) _exit(2) /* or maybe proc_exit(1)*/
....

if (pg_atomic_compare_exchange_u32(&lock->state, &old_state, desired_state))
...
/*rerty*/
}
```

I am not sure this idea is workable though.

--
Best regards,
Kirill Reshke

#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#27)
Re: VM corruption on standby

Thomas Munro <thomas.munro@gmail.com> writes:

On Tue, Aug 19, 2025 at 4:52 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

But I'm of the opinion that proc_exit
is the wrong thing to use after seeing postmaster death, critical
section or no. We should assume that system integrity is already
compromised, and get out as fast as we can with as few side-effects
as possible. It'll be up to the next generation of postmaster to
try to clean up.

Then wouldn't backends blocked in LWLockAcquire(x) hang forever, after
someone who holds x calls _exit()?

If someone who holds x is killed by (say) the OOM killer, how do
we get out of that?

regards, tom lane

#30Kirill Reshke
reshkekirill@gmail.com
In reply to: Tom Lane (#29)
Re: VM corruption on standby

On Tue, 19 Aug 2025 at 11:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

On Tue, Aug 19, 2025 at 4:52 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

But I'm of the opinion that proc_exit
is the wrong thing to use after seeing postmaster death, critical
section or no. We should assume that system integrity is already
compromised, and get out as fast as we can with as few side-effects
as possible. It'll be up to the next generation of postmaster to
try to clean up.

Then wouldn't backends blocked in LWLockAcquire(x) hang forever, after
someone who holds x calls _exit()?

If someone who holds x is killed by (say) the OOM killer, how do
we get out of that?

+1, if we kill-9 PM and then immediately kill-9 lwlock holder, there
is no way for system to shutdown (both HEAD and back branches).
So we can ignore this case.

--
Best regards,
Kirill Reshke

#31Kirill Reshke
reshkekirill@gmail.com
In reply to: Thomas Munro (#27)
2 attachment(s)
Re: VM corruption on standby

On Tue, 19 Aug 2025 at 10:32, Thomas Munro <thomas.munro@gmail.com> wrote:

I don't know if there are other ways that LWLockReleaseAll() can lead
to persistent corruption that won't be corrected by crash recovery,
but this one is probably new since the following commit, explaining
the failure to reproduce on v17:

commit bc22dc0e0ddc2dcb6043a732415019cc6b6bf683
Author: Alexander Korotkov <akorotkov@postgresql.org>
Date: Wed Apr 2 12:44:24 2025 +0300

Get rid of WALBufMappingLock

Any idea involving deferring the handling of PM death from here
doesn't seem right: you'd keep waiting for the CV, but the backend
that would wake you might have exited.

I revert this commit (these were conflicts but i resolved them) and
added assert for crit sections in WaitEventSetWait.

make check passes (without v2-0001 it fails)

--
Best regards,
Kirill Reshke

Attachments:

v2-0001-Revert-Get-rid-of-WALBufMappingLock.patchapplication/octet-stream; name=v2-0001-Revert-Get-rid-of-WALBufMappingLock.patchDownload
From 4eb59878726542cbea099f5ebb5e0d692258371f Mon Sep 17 00:00:00 2001
From: reshke <reshke@double.cloud>
Date: Tue, 19 Aug 2025 06:09:45 +0000
Subject: [PATCH v2 1/2] Revert "Get rid of WALBufMappingLock"

This reverts commit bc22dc0e0ddc2dcb6043a732415019cc6b6bf683.
---
 src/backend/access/transam/xlog.c             | 230 ++++--------------
 .../utils/activity/wait_event_names.txt       |   2 +-
 src/include/storage/lwlocklist.h              |   2 +-
 3 files changed, 52 insertions(+), 182 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e8909406686..81e20459a15 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -303,6 +303,11 @@ static bool doPageWrites;
  * so it's a plain spinlock.  The other locks are held longer (potentially
  * over I/O operations), so we use LWLocks for them.  These locks are:
  *
+ * WALBufMappingLock: must be held to replace a page in the WAL buffer cache.
+ * It is only held while initializing and changing the mapping.  If the
+ * contents of the buffer being replaced haven't been written yet, the mapping
+ * lock is released while the write is done, and reacquired afterwards.
+ *
  * WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or
  * XLogFlush).
  *
@@ -468,6 +473,7 @@ typedef struct XLogCtlData
 	pg_atomic_uint64 logFlushResult;	/* last byte + 1 flushed */
 
 	/*
+<<<<<<< HEAD
 	 * First initialized page in the cache (first byte position).
 	 */
 	XLogRecPtr	InitializedFrom;
@@ -475,30 +481,23 @@ typedef struct XLogCtlData
 	/*
 	 * Latest reserved for initialization page in the cache (last byte
 	 * position + 1).
+=======
+	 * Latest initialized page in the cache (last byte position + 1).
+>>>>>>> parent of bc22dc0e0dd (Get rid of WALBufMappingLock)
 	 *
-	 * To change the identity of a buffer, you need to advance
-	 * InitializeReserved first.  To change the identity of a buffer that's
+	 * To change the identity of a buffer (and InitializedUpTo), you need to
+	 * hold WALBufMappingLock.  To change the identity of a buffer that's
 	 * still dirty, the old page needs to be written out first, and for that
 	 * you need WALWriteLock, and you need to ensure that there are no
 	 * in-progress insertions to the page by calling
 	 * WaitXLogInsertionsToFinish().
 	 */
-	pg_atomic_uint64 InitializeReserved;
-
-	/*
-	 * Latest initialized page in the cache (last byte position + 1).
-	 *
-	 * InitializedUpTo is updated after the buffer initialization.  After
-	 * update, waiters got notification using InitializedUpToCondVar.
-	 */
-	pg_atomic_uint64 InitializedUpTo;
-	ConditionVariable InitializedUpToCondVar;
+	XLogRecPtr	InitializedUpTo;
 
 	/*
 	 * These values do not change after startup, although the pointed-to pages
-	 * and xlblocks values certainly do.  xlblocks values are changed
-	 * lock-free according to the check for the xlog write position and are
-	 * accompanied by changes of InitializeReserved and InitializedUpTo.
+	 * and xlblocks values certainly do.  xlblocks values are protected by
+	 * WALBufMappingLock.
 	 */
 	char	   *pages;			/* buffers for unwritten XLOG pages */
 	pg_atomic_uint64 *xlblocks; /* 1st byte ptr-s + XLOG_BLCKSZ */
@@ -821,9 +820,9 @@ XLogInsertRecord(XLogRecData *rdata,
 	 * fullPageWrites from changing until the insertion is finished.
 	 *
 	 * Step 2 can usually be done completely in parallel. If the required WAL
-	 * page is not initialized yet, you have to go through AdvanceXLInsertBuffer,
-	 * which will ensure it is initialized. But the WAL writer tries to do that
-	 * ahead of insertions to avoid that from happening in the critical path.
+	 * page is not initialized yet, you have to grab WALBufMappingLock to
+	 * initialize it, but the WAL writer tries to do that ahead of insertions
+	 * to avoid that from happening in the critical path.
 	 *
 	 *----------
 	 */
@@ -2005,79 +2004,32 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 	XLogRecPtr	NewPageEndPtr = InvalidXLogRecPtr;
 	XLogRecPtr	NewPageBeginPtr;
 	XLogPageHeader NewPage;
-	XLogRecPtr	ReservedPtr;
 	int			npages pg_attribute_unused() = 0;
 
-	/*
-	 * We must run the loop below inside the critical section as we expect
-	 * XLogCtl->InitializedUpTo to eventually keep up.  The most of callers
-	 * already run inside the critical section. Except for WAL writer, which
-	 * passed 'opportunistic == true', and therefore we don't perform
-	 * operations that could error out.
-	 *
-	 * Start an explicit critical section anyway though.
-	 */
-	Assert(CritSectionCount > 0 || opportunistic);
-	START_CRIT_SECTION();
+	LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
 
-	/*--
-	 * Loop till we get all the pages in WAL buffer before 'upto' reserved for
-	 * initialization.  Multiple process can initialize different buffers with
-	 * this loop in parallel as following.
-	 *
-	 * 1. Reserve page for initialization using XLogCtl->InitializeReserved.
-	 * 2. Initialize the reserved page.
-	 * 3. Attempt to advance XLogCtl->InitializedUpTo,
+	/*
+	 * Now that we have the lock, check if someone initialized the page
+	 * already.
 	 */
-	ReservedPtr = pg_atomic_read_u64(&XLogCtl->InitializeReserved);
-	while (upto >= ReservedPtr || opportunistic)
+	while (upto >= XLogCtl->InitializedUpTo || opportunistic)
 	{
-		Assert(ReservedPtr % XLOG_BLCKSZ == 0);
+		nextidx = XLogRecPtrToBufIdx(XLogCtl->InitializedUpTo);
 
 		/*
-		 * Get ending-offset of the buffer page we need to replace.
-		 *
-		 * We don't lookup into xlblocks, but rather calculate position we
-		 * must wait to be written. If it was written, xlblocks will have this
-		 * position (or uninitialized)
+		 * Get ending-offset of the buffer page we need to replace (this may
+		 * be zero if the buffer hasn't been used yet).  Fall through if it's
+		 * already written out.
 		 */
-		if (ReservedPtr + XLOG_BLCKSZ > XLogCtl->InitializedFrom + XLOG_BLCKSZ * XLOGbuffers)
-			OldPageRqstPtr = ReservedPtr + XLOG_BLCKSZ - (XLogRecPtr) XLOG_BLCKSZ * XLOGbuffers;
-		else
-			OldPageRqstPtr = InvalidXLogRecPtr;
-
-		if (LogwrtResult.Write < OldPageRqstPtr && opportunistic)
+		OldPageRqstPtr = pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]);
+		if (LogwrtResult.Write < OldPageRqstPtr)
 		{
 			/*
-			 * If we just want to pre-initialize as much as we can without
-			 * flushing, give up now.
+			 * Nope, got work to do. If we just want to pre-initialize as much
+			 * as we can without flushing, give up now.
 			 */
-			upto = ReservedPtr - 1;
-			break;
-		}
-
-		/*
-		 * Attempt to reserve the page for initialization.  Failure means that
-		 * this page got reserved by another process.
-		 */
-		if (!pg_atomic_compare_exchange_u64(&XLogCtl->InitializeReserved,
-											&ReservedPtr,
-											ReservedPtr + XLOG_BLCKSZ))
-			continue;
-
-		/*
-		 * Wait till page gets correctly initialized up to OldPageRqstPtr.
-		 */
-		nextidx = XLogRecPtrToBufIdx(ReservedPtr);
-		while (pg_atomic_read_u64(&XLogCtl->InitializedUpTo) < OldPageRqstPtr)
-			ConditionVariableSleep(&XLogCtl->InitializedUpToCondVar, WAIT_EVENT_WAL_BUFFER_INIT);
-		ConditionVariableCancelSleep();
-		Assert(pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]) == OldPageRqstPtr);
-
-		/* Fall through if it's already written out. */
-		if (LogwrtResult.Write < OldPageRqstPtr)
-		{
-			/* Nope, got work to do. */
+			if (opportunistic)
+				break;
 
 			/* Advance shared memory write request position */
 			SpinLockAcquire(&XLogCtl->info_lck);
@@ -2092,6 +2044,14 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 			RefreshXLogWriteResult(LogwrtResult);
 			if (LogwrtResult.Write < OldPageRqstPtr)
 			{
+				/*
+				 * Must acquire write lock. Release WALBufMappingLock first,
+				 * to make sure that all insertions that we need to wait for
+				 * can finish (up to this same position). Otherwise we risk
+				 * deadlock.
+				 */
+				LWLockRelease(WALBufMappingLock);
+
 				WaitXLogInsertionsToFinish(OldPageRqstPtr);
 
 				LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
@@ -2119,6 +2079,9 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 					 */
 					pgstat_report_fixed = true;
 				}
+				/* Re-acquire WALBufMappingLock and retry */
+				LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
+				continue;
 			}
 		}
 
@@ -2126,9 +2089,11 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 		 * Now the next buffer slot is free and we can set it up to be the
 		 * next output page.
 		 */
-		NewPageBeginPtr = ReservedPtr;
+		NewPageBeginPtr = XLogCtl->InitializedUpTo;
 		NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ;
 
+		Assert(XLogRecPtrToBufIdx(NewPageBeginPtr) == nextidx);
+
 		NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
 
 		/*
@@ -2192,100 +2157,12 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 		 */
 		pg_write_barrier();
 
-		/*-----
-		 * Update the value of XLogCtl->xlblocks[nextidx] and try to advance
-		 * XLogCtl->InitializedUpTo in a lock-less manner.
-		 *
-		 * First, let's provide a formal proof of the algorithm.  Let it be 'n'
-		 * process with the following variables in shared memory:
-		 *	f - an array of 'n' boolean flags,
-		 *	v - atomic integer variable.
-		 *
-		 * Also, let
-		 *	i - a number of a process,
-		 *	j - local integer variable,
-		 * CAS(var, oldval, newval) - compare-and-swap atomic operation
-		 *							  returning true on success,
-		 * write_barrier()/read_barrier() - memory barriers.
-		 *
-		 * The pseudocode for each process is the following.
-		 *
-		 *	j := i
-		 *	f[i] := true
-		 *	write_barrier()
-		 *	while CAS(v, j, j + 1):
-		 *		j := j + 1
-		 *		read_barrier()
-		 *		if not f[j]:
-		 *			break
-		 *
-		 * Let's prove that v eventually reaches the value of n.
-		 * 1. Prove by contradiction.  Assume v doesn't reach n and stucks
-		 *	  on k, where k < n.
-		 * 2. Process k attempts CAS(v, k, k + 1).  1). If, as we assumed, v
-		 *	  gets stuck at k, then this CAS operation must fail.  Therefore,
-		 *    v < k when process k attempts CAS(v, k, k + 1).
-		 * 3. If, as we assumed, v gets stuck at k, then the value k of v
-		 *	  must be achieved by some process m, where m < k.  The process
-		 *	  m must observe f[k] == false.  Otherwise, it will later attempt
-		 *	  CAS(v, k, k + 1) with success.
-		 * 4. Therefore, corresponding read_barrier() (while j == k) on
-		 *	  process m reached before write_barrier() of process k.  But then
-		 *	  process k attempts CAS(v, k, k + 1) after process m successfully
-		 *	  incremented v to k, and that CAS operation must succeed.
-		 *	  That leads to a contradiction.  So, there is no such k (k < n)
-		 *    where v gets stuck.  Q.E.D.
-		 *
-		 * To apply this proof to the code below, we assume
-		 * XLogCtl->InitializedUpTo will play the role of v with XLOG_BLCKSZ
-		 * granularity.  We also assume setting XLogCtl->xlblocks[nextidx] to
-		 * NewPageEndPtr to play the role of setting f[i] to true.  Also, note
-		 * that processes can't concurrently map different xlog locations to
-		 * the same nextidx because we previously requested that
-		 * XLogCtl->InitializedUpTo >= OldPageRqstPtr.  So, a xlog buffer can
-		 * be taken for initialization only once the previous initialization
-		 * takes effect on XLogCtl->InitializedUpTo.
-		 */
-
 		pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], NewPageEndPtr);
-
-		pg_write_barrier();
-
-		while (pg_atomic_compare_exchange_u64(&XLogCtl->InitializedUpTo, &NewPageBeginPtr, NewPageEndPtr))
-		{
-			NewPageBeginPtr = NewPageEndPtr;
-			NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ;
-			nextidx = XLogRecPtrToBufIdx(NewPageBeginPtr);
-
-			pg_read_barrier();
-
-			if (pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]) != NewPageEndPtr)
-			{
-				/*
-				 * Page at nextidx wasn't initialized yet, so we can't move
-				 * InitializedUpto further. It will be moved by backend which
-				 * will initialize nextidx.
-				 */
-				ConditionVariableBroadcast(&XLogCtl->InitializedUpToCondVar);
-				break;
-			}
-		}
+		XLogCtl->InitializedUpTo = NewPageEndPtr;
 
 		npages++;
 	}
-
-	END_CRIT_SECTION();
-
-	/*
-	 * All the pages in WAL buffer before 'upto' were reserved for
-	 * initialization.  However, some pages might be reserved by concurrent
-	 * processes.  Wait till they finish initialization.
-	 */
-	while (upto >= pg_atomic_read_u64(&XLogCtl->InitializedUpTo))
-		ConditionVariableSleep(&XLogCtl->InitializedUpToCondVar, WAIT_EVENT_WAL_BUFFER_INIT);
-	ConditionVariableCancelSleep();
-
-	pg_read_barrier();
+	LWLockRelease(WALBufMappingLock);
 
 #ifdef WAL_DEBUG
 	if (XLOG_DEBUG && npages > 0)
@@ -5178,10 +5055,6 @@ XLOGShmemInit(void)
 	pg_atomic_init_u64(&XLogCtl->logWriteResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->logFlushResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->unloggedLSN, InvalidXLogRecPtr);
-
-	pg_atomic_init_u64(&XLogCtl->InitializeReserved, InvalidXLogRecPtr);
-	pg_atomic_init_u64(&XLogCtl->InitializedUpTo, InvalidXLogRecPtr);
-	ConditionVariableInit(&XLogCtl->InitializedUpToCondVar);
 }
 
 /*
@@ -6205,8 +6078,7 @@ StartupXLOG(void)
 		memset(page + len, 0, XLOG_BLCKSZ - len);
 
 		pg_atomic_write_u64(&XLogCtl->xlblocks[firstIdx], endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ);
-		pg_atomic_write_u64(&XLogCtl->InitializedUpTo, endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ);
-		XLogCtl->InitializedFrom = endOfRecoveryInfo->lastPageBeginPtr;
+		XLogCtl->InitializedUpTo = endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ;
 	}
 	else
 	{
@@ -6215,10 +6087,8 @@ StartupXLOG(void)
 		 * let the first attempt to insert a log record to initialize the next
 		 * buffer.
 		 */
-		pg_atomic_write_u64(&XLogCtl->InitializedUpTo, EndOfLog);
-		XLogCtl->InitializedFrom = EndOfLog;
+		XLogCtl->InitializedUpTo = EndOfLog;
 	}
-	pg_atomic_write_u64(&XLogCtl->InitializeReserved, pg_atomic_read_u64(&XLogCtl->InitializedUpTo));
 
 	/*
 	 * Update local and shared status.  This is OK to do without any locks
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0be307d2ca0..5427da5bc1b 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -156,7 +156,6 @@ REPLICATION_SLOT_DROP	"Waiting for a replication slot to become inactive so it c
 RESTORE_COMMAND	"Waiting for <xref linkend="guc-restore-command"/> to complete."
 SAFE_SNAPSHOT	"Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFERRABLE</literal> transaction."
 SYNC_REP	"Waiting for confirmation from a remote server during synchronous replication."
-WAL_BUFFER_INIT	"Waiting on WAL buffer to be initialized."
 WAL_RECEIVER_EXIT	"Waiting for the WAL receiver to exit."
 WAL_RECEIVER_WAIT_START	"Waiting for startup process to send initial data for streaming replication."
 WAL_SUMMARY_READY	"Waiting for a new WAL summary to be generated."
@@ -316,6 +315,7 @@ XidGen	"Waiting to allocate a new transaction ID."
 ProcArray	"Waiting to access the shared per-process data structures (typically, to get a snapshot or report a session's transaction ID)."
 SInvalRead	"Waiting to retrieve messages from the shared catalog invalidation queue."
 SInvalWrite	"Waiting to add a message to the shared catalog invalidation queue."
+WALBufMapping	"Waiting to replace a page in WAL buffers."
 WALWrite	"Waiting for WAL buffers to be written to disk."
 ControlFile	"Waiting to read or update the <filename>pg_control</filename> file or create a new WAL file."
 MultiXactGen	"Waiting to read or update shared multixact state."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 208d2e3a8ed..06a1ffd4b08 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -38,7 +38,7 @@ PG_LWLOCK(3, XidGen)
 PG_LWLOCK(4, ProcArray)
 PG_LWLOCK(5, SInvalRead)
 PG_LWLOCK(6, SInvalWrite)
-/* 7 was WALBufMapping */
+PG_LWLOCK(7, WALBufMapping)
 PG_LWLOCK(8, WALWrite)
 PG_LWLOCK(9, ControlFile)
 /* 10 was CheckpointLock */
-- 
2.43.0

v2-0002-Add-assertion-for-WaitEventSetWait-in-crit-sectio.patchapplication/octet-stream; name=v2-0002-Add-assertion-for-WaitEventSetWait-in-crit-sectio.patchDownload
From 566dfd994283822926a5a3231c92ee512d541a3c Mon Sep 17 00:00:00 2001
From: reshke <reshke@double.cloud>
Date: Tue, 19 Aug 2025 06:21:11 +0000
Subject: [PATCH v2 2/2] Add assertion for WaitEventSetWait in crit section

---
 src/backend/storage/ipc/waiteventset.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/backend/storage/ipc/waiteventset.c b/src/backend/storage/ipc/waiteventset.c
index 7c0e66900f9..e89e1d115cb 100644
--- a/src/backend/storage/ipc/waiteventset.c
+++ b/src/backend/storage/ipc/waiteventset.c
@@ -1044,6 +1044,7 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
 	long		cur_timeout = -1;
 
 	Assert(nevents > 0);
+	Assert(CritSectionCount == 0);
 
 	/*
 	 * Initialize timeout if requested.  We must record the current time so
-- 
2.43.0

#32Kirill Reshke
reshkekirill@gmail.com
In reply to: Kirill Reshke (#31)
Re: VM corruption on standby
#33Tomas Vondra
tomas@vondra.me
In reply to: Kirill Reshke (#32)
Re: VM corruption on standby

On 8/19/25 11:14, Kirill Reshke wrote:

This thread is a candidate for [0]

[0]https://wiki.postgresql.org/wiki/PostgreSQL_18_Open_Items <https://
wiki.postgresql.org/wiki/PostgreSQL_18_Open_Items>

Added, with Alexander as an owner (assuming it really is caused by
commit bc22dc0e0d.

regards

--
Tomas Vondra

#34Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Kirill Reshke (#14)
Re: VM corruption on standby

10.08.2025 08:45, Kirill Reshke пишет:

On Sun, 10 Aug 2025 at 01:55, Aleksander Alekseev
<aleksander@tigerdata.com> wrote:

For this reason we have PageHeaderData.pd_lsn for instance - to make sure
pages are evicted only *after* the record that changed it is written
to disk (because WAL records can't be applied to pages from the
future).

We don't bump the LSN of the heap page when setting the visibility
map bit.

And that is very bad!!!
In fact, I believe "incremental backup" is not safe because of that.

I guess the intent here could be to do an optimization of some sort
but the facts that 1. the instance can be killed at any time and 2.
there might be replicas - were not considered.

IMHO: logging the changes first, then allowing to evict the page.

Clearing the vm before the logging changes was intentional [0].
So I assume we should not change the approach, but rather just tweak
things a bit to make the whole thing work.

[0] /messages/by-id/BANLkTimuLk4RHXSQHEEiYGbxiXp2mh5KCA@mail.gmail.com

Lets cite that message:

On Fri, May 6, 2011 at 5:47 PM, Robert Haas <robertmhaas(at)gmail(dot)com>
wrote:

On Wed, Mar 30, 2011 at 8:52 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

Another question:
To address the problem in
http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php
, should we just clear the vm before the log of insert/update/delete?
This may reduce the performance, is there another solution?

Yeah, that's a straightforward way to fix it. I don't think the performance
hit will be too bad. But we need to be careful not to hold locks while doing
I/O, which might require some rearrangement of the code. We might want to do
a similar dance that we do in vacuum, and call visibilitymap_pin first, then
lock and update the heap page, and then set the VM bit while holding the
lock on the heap page.

Here's an attempt at implementing the necessary gymnastics.

So we see: Heikki told "lock and update the heap page, and then set the VM
bit while holding lock on the heap page". He didn't say "set the VM bit
before log heap update". It is not clear why Robert put visibilitymap_clear
before logging of update, so your point "was intentional" is not valid.

Call of visibilitymap_clear after logging but before exit from critical
section (as Aleksander Alekseev did in suggested patch) is correct way to
do the thing.

--
regards
Yura Sokolov aka funny-falcon

#35Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Kirill Reshke (#12)
Re: VM corruption on standby

09.08.2025 22:54, Kirill Reshke пишет:

On Thu, 7 Aug 2025 at 21:36, Aleksander Alekseev
<aleksander@tigerdata.com> wrote:

Perhaps there was a good
reason to update the VM *before* creating WAL records I'm unaware of.

Looks like 503c730 intentionally does it this way; however, I have not
yet fully understood the reasoning behind it.

I repeat: there was no intention. Neither in commit message, nor in
discussion about.

There was intention to move visibilitymap_clear under heap page lock and
into critical section, but there were no any word about logging.

I believe, it was just an unfortunate oversight that the change is made
before logging.

--
regards
Yura Sokolov aka funny-falcon

#36Andres Freund
andres@anarazel.de
In reply to: Yura Sokolov (#35)
Re: VM corruption on standby

Hi,

On 2025-08-19 15:56:05 +0300, Yura Sokolov wrote:

09.08.2025 22:54, Kirill Reshke пишет:

On Thu, 7 Aug 2025 at 21:36, Aleksander Alekseev
<aleksander@tigerdata.com> wrote:

Perhaps there was a good
reason to update the VM *before* creating WAL records I'm unaware of.

Looks like 503c730 intentionally does it this way; however, I have not
yet fully understood the reasoning behind it.

I repeat: there was no intention. Neither in commit message, nor in
discussion about.

There was intention to move visibilitymap_clear under heap page lock and
into critical section, but there were no any word about logging.

I believe, it was just an unfortunate oversight that the change is made
before logging.

The normal pattern *is* to modify the buffer while holding an exclusive lock,
in a critical section, before WAL logging. Look at
src/backend/access/transam/README:

The general schema for executing a WAL-logged action is

1. Pin and exclusive-lock the shared buffer(s) containing the data page(s)
to be modified.

2. START_CRIT_SECTION() (Any error during the next three steps must cause a
PANIC because the shared buffers will contain unlogged changes, which we
have to ensure don't get to disk. Obviously, you should check conditions
such as whether there's enough free space on the page before you start the
critical section.)

3. Apply the required changes to the shared buffer(s).

4. Mark the shared buffer(s) as dirty with MarkBufferDirty(). (This must
happen before the WAL record is inserted; see notes in SyncOneBuffer().)
Note that marking a buffer dirty with MarkBufferDirty() should only
happen iff you write a WAL record; see Writing Hints below.

5. If the relation requires WAL-logging, build a WAL record using
XLogBeginInsert and XLogRegister* functions, and insert it. (See
"Constructing a WAL record" below). Then update the page's LSN using the
returned XLOG location. For instance,

XLogBeginInsert();
XLogRegisterBuffer(...)
XLogRegisterData(...)
recptr = XLogInsert(rmgr_id, info);

PageSetLSN(dp, recptr);

6. END_CRIT_SECTION()

7. Unlock and unpin the buffer(s).

Greetings,

Andres Freund

#37Kirill Reshke
reshkekirill@gmail.com
In reply to: Kirill Reshke (#32)
Re: VM corruption on standby

On Tue, 19 Aug 2025 at 14:14, Kirill Reshke <reshkekirill@gmail.com> wrote:

This thread is a candidate for [0]

[0]https://wiki.postgresql.org/wiki/PostgreSQL_18_Open_Items

Let me summarize this thread for ease of understanding of what's going on:

Timeline:
1) Andrey Borodin sends a patch (on 6 Aug) claiming there is
corruption in VM bits.
2) We investigate problem in not with how PostgreSQL modified buffers
or logs changes, but with LWLockReleaseALl in proc_exit(1) after
kill-9 PM
3) We have reached the conclusion that there is no corruption, and
that injection points are not a valid way to reproduce them, because
of WaitLatch and friends.

4) But we now suspect there is another corruption with ANY critical
section in scenario:

I wrote:

Maybe I'm very wrong about this, but I'm currently suspecting there is
corruption involving CHECKPOINT, process in CRIT section and kill -9.
1) Some process p1 locks some buffer (name it buf1), enters CRIT
section, calls MarkBufferDirty and hangs inside XLogInsert on CondVar
in (GetXLogBuffer -> AdvanceXLInsertBuffer).
2) CHECKPOINT (p2) stars and tries to FLUSH dirty buffers, awaiting lock on buf1
3) Postmaster kill-9-ed
4) signal of postmaster death delivered to p1, it wakes up in
WaitLatch/WaitEventSetWaitBlock functions, checks postmaster
aliveness, and exits releasing all locks.
5) p2 acquires locks on buf1 and flushes it to disk.
6) signal of postmaster death delivered to p2, p2 exits.

5) We create an open item for pg18 and propose revering
bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 or fix it quickly.

Please note that patches in this thread are NOT reproducer of
corruption, as of today we have NO valid repro of corruption

--
Best regards,
Kirill Reshke

#38Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Andres Freund (#36)
Re: VM corruption on standby

19.08.2025 16:09, Andres Freund пишет:

Hi,

On 2025-08-19 15:56:05 +0300, Yura Sokolov wrote:

09.08.2025 22:54, Kirill Reshke пишет:

On Thu, 7 Aug 2025 at 21:36, Aleksander Alekseev
<aleksander@tigerdata.com> wrote:

Perhaps there was a good
reason to update the VM *before* creating WAL records I'm unaware of.

Looks like 503c730 intentionally does it this way; however, I have not
yet fully understood the reasoning behind it.

I repeat: there was no intention. Neither in commit message, nor in
discussion about.

There was intention to move visibilitymap_clear under heap page lock and
into critical section, but there were no any word about logging.

I believe, it was just an unfortunate oversight that the change is made
before logging.

The normal pattern *is* to modify the buffer while holding an exclusive lock,
in a critical section, before WAL logging. Look at
src/backend/access/transam/README:

The general schema for executing a WAL-logged action is

1. Pin and exclusive-lock the shared buffer(s) containing the data page(s)
to be modified.

2. START_CRIT_SECTION() (Any error during the next three steps must cause a
PANIC because the shared buffers will contain unlogged changes, which we
have to ensure don't get to disk. Obviously, you should check conditions
such as whether there's enough free space on the page before you start the
critical section.)

3. Apply the required changes to the shared buffer(s).

4. Mark the shared buffer(s) as dirty with MarkBufferDirty(). (This must
happen before the WAL record is inserted; see notes in SyncOneBuffer().)
Note that marking a buffer dirty with MarkBufferDirty() should only
happen iff you write a WAL record; see Writing Hints below.

5. If the relation requires WAL-logging, build a WAL record using
XLogBeginInsert and XLogRegister* functions, and insert it. (See
"Constructing a WAL record" below). Then update the page's LSN using the
returned XLOG location. For instance,

XLogBeginInsert();
XLogRegisterBuffer(...)
XLogRegisterData(...)
recptr = XLogInsert(rmgr_id, info);

PageSetLSN(dp, recptr);

6. END_CRIT_SECTION()

7. Unlock and unpin the buffer(s).

There is quite important step in this instruction:

Then update the page's LSN using the returned XLOG location.

This step is violated for the call of visibilitymap_clear.

Though, probably I'm mistaken this is source of the bug. But it is really
source of other kinds of issues.

--
regards
Yura Sokolov aka funny-falcon

#39Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Kirill Reshke (#37)
Re: VM corruption on standby

19.08.2025 16:17, Kirill Reshke пишет:

On Tue, 19 Aug 2025 at 14:14, Kirill Reshke <reshkekirill@gmail.com> wrote:

This thread is a candidate for [0]

[0]https://wiki.postgresql.org/wiki/PostgreSQL_18_Open_Items

Let me summarize this thread for ease of understanding of what's going on:

Timeline:
1) Andrey Borodin sends a patch (on 6 Aug) claiming there is
corruption in VM bits.
2) We investigate problem in not with how PostgreSQL modified buffers
or logs changes, but with LWLockReleaseALl in proc_exit(1) after
kill-9 PM
3) We have reached the conclusion that there is no corruption, and
that injection points are not a valid way to reproduce them, because
of WaitLatch and friends.

4) But we now suspect there is another corruption with ANY critical
section in scenario:

I wrote:

Maybe I'm very wrong about this, but I'm currently suspecting there is
corruption involving CHECKPOINT, process in CRIT section and kill -9.
1) Some process p1 locks some buffer (name it buf1), enters CRIT
section, calls MarkBufferDirty and hangs inside XLogInsert on CondVar
in (GetXLogBuffer -> AdvanceXLInsertBuffer).
2) CHECKPOINT (p2) stars and tries to FLUSH dirty buffers, awaiting lock on buf1
3) Postmaster kill-9-ed
4) signal of postmaster death delivered to p1, it wakes up in
WaitLatch/WaitEventSetWaitBlock functions, checks postmaster
aliveness, and exits releasing all locks.
5) p2 acquires locks on buf1 and flushes it to disk.
6) signal of postmaster death delivered to p2, p2 exits.

5) We create an open item for pg18 and propose revering
bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 or fix it quickly.

Latch and ConditionVariable (that uses Latch) are among basic
synchronization primitives in PostgreSQL.
Therefore they have to work correctly in any place: in critical section, in
wal logging, etc.
Current behavior of WaitEventSetWaitBlock is certainly the bug and it is
ought to be fixed.
So +1 for _exit(2) as Tom suggested.

--
regards
Yura Sokolov aka funny-falcon

#40Kirill Reshke
reshkekirill@gmail.com
In reply to: Yura Sokolov (#39)
Re: VM corruption on standby

On Tue, 19 Aug 2025 at 18:29, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

Latch and ConditionVariable (that uses Latch) are among basic
synchronization primitives in PostgreSQL.

Sure

Therefore they have to work correctly in any place: in critical section, in
wal logging, etc.

No. Before bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 ConditionVariable
code path was never exercised in critical sections. After
bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 it is exercised in almost
every one (if the system is highly loaded). This is a crucial change
with corruption as a drawback (until we fix this).

To replace proc_exit(1) with _exit(2) is not a cure too: if we exit
inside CRIT section without any message to LWLock contenders, they
will never do the same (never exit), because they are wait the
semaphore and do not respond to signals (so, only way to stop them in
to kill-9). Before bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 lwlock
holders did not exit inside crit sections (unless kill9)

I had one suggestion about what can be done [0]/messages/by-id/CALdSSPgDAyqt=ORyLMWMpotb9V4Jk1Am+he39mNtBA8+a8TQDw@mail.gmail.com -- Best regards, Kirill Reshke. However there is
little no time until the pg18 release for a change that scary and big
(my own understanding), so the safest option is to revert.

[0]: /messages/by-id/CALdSSPgDAyqt=ORyLMWMpotb9V4Jk1Am+he39mNtBA8+a8TQDw@mail.gmail.com -- Best regards, Kirill Reshke
--
Best regards,
Kirill Reshke

#41Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#29)
Re: VM corruption on standby

Hi,

On 2025-08-19 02:13:43 -0400, Tom Lane wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

On Tue, Aug 19, 2025 at 4:52 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

But I'm of the opinion that proc_exit
is the wrong thing to use after seeing postmaster death, critical
section or no. We should assume that system integrity is already
compromised, and get out as fast as we can with as few side-effects
as possible. It'll be up to the next generation of postmaster to
try to clean up.

Then wouldn't backends blocked in LWLockAcquire(x) hang forever, after
someone who holds x calls _exit()?

If someone who holds x is killed by (say) the OOM killer, how do
we get out of that?

On linux - the primary OS with OOM killer troubles - I'm pretty sure'll lwlock
waiters would get killed due to the postmaster death signal we've configured
(c.f. PostmasterDeathSignalInit()).

A long while back I had experimented with replacing waiting on semaphores
(within lwlocks) with a latch wait. IIRC it was a bit slower under heavy
contention, but that vanished when adding some adaptive spinning to lwlocks -
which is also what we need to make it more feasible to replace some of the
remaining spinlocks...

Greetings,

Andres Freund

#42Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#41)
Re: VM corruption on standby

On Wed, Aug 20, 2025 at 1:56 AM Andres Freund <andres@anarazel.de> wrote:

On 2025-08-19 02:13:43 -0400, Tom Lane wrote:

Then wouldn't backends blocked in LWLockAcquire(x) hang forever, after
someone who holds x calls _exit()?

If someone who holds x is killed by (say) the OOM killer, how do
we get out of that?

If a backend is killed by the OOM killer, the postmaster will of
course send SIGQUIT/SIGKILL to all backend. If the postmaster itself
is killed, then surviving backends will notice at their next
WaitEventSetWait() and exit, but if any are blocked in sem_wait(),
they it will only make progress because other exiting backends release
their LWLocks on their way out. So if we change that to _exit(), I
assume such backends would linger forever in sem_wait() after the
postmaster dies. I do agree that it seems quite weird to release all
locks as if this is a "normal" exit though, which is why Kirill and I
both wondered about other ways to boot them out of sem_wait()...

On linux - the primary OS with OOM killer troubles - I'm pretty sure'll lwlock
waiters would get killed due to the postmaster death signal we've configured
(c.f. PostmasterDeathSignalInit()).

No, that has a handler that just sets a global variable. That was
done because recovery used to try to read() from the postmaster pipe
after replaying every record. Also we currently have some places that
don't want to be summarily killed (off the top of my head, syncrep
wants to send a special error message, and the logger wants to survive
longer than everyone else to catch as much output as possible, things
I've been thinking about in the context of threads).

#43Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#42)
Re: VM corruption on standby

Hi,

On 2025-08-20 02:54:09 +1200, Thomas Munro wrote:

On linux - the primary OS with OOM killer troubles - I'm pretty sure'll lwlock
waiters would get killed due to the postmaster death signal we've configured
(c.f. PostmasterDeathSignalInit()).

No, that has a handler that just sets a global variable. That was
done because recovery used to try to read() from the postmaster pipe
after replaying every record. Also we currently have some places that
don't want to be summarily killed (off the top of my head, syncrep
wants to send a special error message, and the logger wants to survive
longer than everyone else to catch as much output as possible, things
I've been thinking about in the context of threads).

That makes no sense. We should just _exit(). If postmaster has been killed,
trying to stay up longer just makes everything more fragile. Waiting for the
logger is *exactly* what we should *not* do - what if the logger also crashed?
There's no postmaster around to start it.

Greetings,

Andres Freund

#44Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#43)
Re: VM corruption on standby

On Wed, Aug 20, 2025 at 2:57 AM Andres Freund <andres@anarazel.de> wrote:

On 2025-08-20 02:54:09 +1200, Thomas Munro wrote:

On linux - the primary OS with OOM killer troubles - I'm pretty sure'll lwlock
waiters would get killed due to the postmaster death signal we've configured
(c.f. PostmasterDeathSignalInit()).

No, that has a handler that just sets a global variable. That was
done because recovery used to try to read() from the postmaster pipe
after replaying every record. Also we currently have some places that
don't want to be summarily killed (off the top of my head, syncrep
wants to send a special error message, and the logger wants to survive
longer than everyone else to catch as much output as possible, things
I've been thinking about in the context of threads).

That makes no sense. We should just _exit(). If postmaster has been killed,
trying to stay up longer just makes everything more fragile. Waiting for the
logger is *exactly* what we should *not* do - what if the logger also crashed?
There's no postmaster around to start it.

Nobody is waiting for the logger. The logger waits for everyone else
to exit first to collect forensics:

* Unlike all other postmaster child processes, we'll ignore postmaster
* death because we want to collect final log output from all backends and
* then exit last. We'll do that by running until we see EOF on the
* syslog pipe, which implies that all other backends have exited
* (including the postmaster).

The syncrep case is a bit weirder: it wants to tell the user that
syncrep is broken, so its own WaitEventSetWait() has
WL_POSTMASTER_DEATH, but that's basically bogus because the backend
can reach WaitEventSetWait(WL_EXIT_ON_PM_DEATH) in many other code
paths. I've proposed nuking that before.

#45Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#44)
Re: VM corruption on standby

Hi,

On 2025-08-20 03:19:38 +1200, Thomas Munro wrote:

On Wed, Aug 20, 2025 at 2:57 AM Andres Freund <andres@anarazel.de> wrote:

On 2025-08-20 02:54:09 +1200, Thomas Munro wrote:

On linux - the primary OS with OOM killer troubles - I'm pretty sure'll lwlock
waiters would get killed due to the postmaster death signal we've configured
(c.f. PostmasterDeathSignalInit()).

No, that has a handler that just sets a global variable. That was
done because recovery used to try to read() from the postmaster pipe
after replaying every record. Also we currently have some places that
don't want to be summarily killed (off the top of my head, syncrep
wants to send a special error message, and the logger wants to survive
longer than everyone else to catch as much output as possible, things
I've been thinking about in the context of threads).

That makes no sense. We should just _exit(). If postmaster has been killed,
trying to stay up longer just makes everything more fragile. Waiting for the
logger is *exactly* what we should *not* do - what if the logger also crashed?
There's no postmaster around to start it.

Nobody is waiting for the logger.

Error messages that we might be printing will wait for logger if the pipe is
full, no?

The logger waits for everyone else to exit first to collect forensics:

* Unlike all other postmaster child processes, we'll ignore postmaster
* death because we want to collect final log output from all backends and
* then exit last. We'll do that by running until we see EOF on the
* syslog pipe, which implies that all other backends have exited
* (including the postmaster).

The syncrep case is a bit weirder: it wants to tell the user that
syncrep is broken, so its own WaitEventSetWait() has
WL_POSTMASTER_DEATH, but that's basically bogus because the backend
can reach WaitEventSetWait(WL_EXIT_ON_PM_DEATH) in many other code
paths. I've proposed nuking that before.

Yea, that's just bogus.

I think this is one more instance of "let's try hard to continue limping
along" making things way more fragile than the simpler "let's just do
crash-restart in the most normal way possible".

Greetings,

Andres Freund

#46Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Kirill Reshke (#40)
Re: VM corruption on standby

19.08.2025 16:43, Kirill Reshke пишет:

On Tue, 19 Aug 2025 at 18:29, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

Latch and ConditionVariable (that uses Latch) are among basic
synchronization primitives in PostgreSQL.

Sure

Therefore they have to work correctly in any place: in critical section, in
wal logging, etc.

No. Before bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 ConditionVariable
code path was never exercised in critical sections. After
bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 it is exercised in almost
every one (if the system is highly loaded). This is a crucial change
with corruption as a drawback (until we fix this).

To replace proc_exit(1) with _exit(2) is not a cure too: if we exit
inside CRIT section without any message to LWLock contenders, they
will never do the same (never exit), because they are wait the
semaphore and do not respond to signals (so, only way to stop them in
to kill-9). Before bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 lwlock
holders did not exit inside crit sections (unless kill9)

That is not true.
elog(PANIC) doesn't clear LWLocks. And XLogWrite, which is could be called
from AdvanceXLInsertBuffer, may call elog(PANIC) from several places.

It doesn't lead to any error, because usually postmaster is alive and it
will kill -9 all its children if any one is died in critical section.

So the problem is postmaster is already killed with SIGKILL by definition
of the issue.

Documentation says [0]https://www.postgresql.org/docs/17/app-postgres.html:

If at all possible, do not use SIGKILL to kill the main postgres server.
Doing so will prevent postgres from freeing the system resources (e.g.,

shared memory and semaphores) that it holds before terminating.

Therefore if postmaster SIGKILL-ed, administrator already have to do some
actions.

So the issue Andrey Borodin arose is without any fix Pg18 does provides
inconsistency between data and the WAL log: data could written to main fork
and vm fork although WAL is not written yet IF POSTMASTER IS SIGKILL-ed.

`if (CritSectionCount != 0) _exit(2) else proc_exit(1)` in
WaitEventSetWaitBlock () solves the issue of inconsistency IF POSTMASTER IS
SIGKILLED, and doesn't lead to any problem, if postmaster is not SIGKILL-ed
(since postmaster will SIGKILL its children).

I had one suggestion about what can be done [0]. However there is
little no time until the pg18 release for a change that scary and big
(my own understanding), so the safest option is to revert.

[0] /messages/by-id/CALdSSPgDAyqt=ORyLMWMpotb9V4Jk1Am+he39mNtBA8+a8TQDw@mail.gmail.com

[0]: https://www.postgresql.org/docs/17/app-postgres.html

--
regards
Yura Sokolov aka funny-falcon

#47Kirill Reshke
reshkekirill@gmail.com
In reply to: Yura Sokolov (#46)
Re: VM corruption on standby

On Tue, 19 Aug 2025 at 21:16, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

That is not true.
elog(PANIC) doesn't clear LWLocks. And XLogWrite, which is could be called
from AdvanceXLInsertBuffer, may call elog(PANIC) from several places.

It doesn't lead to any error, because usually postmaster is alive and it
will kill -9 all its children if any one is died in critical section.

So the problem is postmaster is already killed with SIGKILL by definition
of the issue.

Documentation says [0]:

If at all possible, do not use SIGKILL to kill the main postgres server.
Doing so will prevent postgres from freeing the system resources (e.g.,

shared memory and semaphores) that it holds before terminating.

Therefore if postmaster SIGKILL-ed, administrator already have to do some
actions.

There are surely many cases when a system reaches the state which can
only be fixed by admin action.
The elog(PANIC) in the CRIT section is very rare (and very probably is
corruption already).
The simpler example is to kill-9 postmaster and then immediately
kill-9 someone who holds LWLock.
The problem is in pgv18 is that this state probability is much higher
due to the aforementioned commit. In can happen with almost
any OOM on highly loaded systems.

--
Best regards,
Kirill Reshke

#48Kirill Reshke
reshkekirill@gmail.com
In reply to: Yura Sokolov (#46)
Re: VM corruption on standby

On Tue, 19 Aug 2025 at 21:16, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

`if (CritSectionCount != 0) _exit(2) else proc_exit(1)` in
WaitEventSetWaitBlock () solves the issue of inconsistency IF POSTMASTER IS
SIGKILLED, and doesn't lead to any problem, if postmaster is not SIGKILL-ed
(since postmaster will SIGKILL its children).

This fix was proposed in this thread. It fixes inconsistency but it
replaces one set of problems
with another set, namely systems that fail to shut down.

--
Best regards,
Kirill Reshke

#49Kirill Reshke
reshkekirill@gmail.com
In reply to: Andres Freund (#45)
Re: VM corruption on standby

On Tue, 19 Aug 2025 at 20:24, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2025-08-20 03:19:38 +1200, Thomas Munro wrote:

On Wed, Aug 20, 2025 at 2:57 AM Andres Freund <andres@anarazel.de> wrote:

On 2025-08-20 02:54:09 +1200, Thomas Munro wrote:

On linux - the primary OS with OOM killer troubles - I'm pretty sure'll lwlock
waiters would get killed due to the postmaster death signal we've configured
(c.f. PostmasterDeathSignalInit()).

No, that has a handler that just sets a global variable. That was
done because recovery used to try to read() from the postmaster pipe
after replaying every record. Also we currently have some places that
don't want to be summarily killed (off the top of my head, syncrep
wants to send a special error message, and the logger wants to survive
longer than everyone else to catch as much output as possible, things
I've been thinking about in the context of threads).

That makes no sense. We should just _exit(). If postmaster has been killed,
trying to stay up longer just makes everything more fragile. Waiting for the
logger is *exactly* what we should *not* do - what if the logger also crashed?
There's no postmaster around to start it.

Nobody is waiting for the logger.

Error messages that we might be printing will wait for logger if the pipe is
full, no?

I did some crit_sections check for elog usage, and I did not really
find any crit section that uses elog(elevel) with elevel < ERROR. But
surely there are cases when
we might be printing messages inside crit sections, or there can be
such sections in backbranches/future branches.
Anyway, this case, when we are hung indefinitely while waiting for
(already dead logger), might be a rare one.
The problem is, even without a logger, on current HEAD, we will fail
to stop the system when PM dies, and there is no simple fix.
It would be very helpful if LWLock implementation was done using latch
wait, there will be no problem then.
But we are where we are, so I can see there is a sense in making a try
to notify other processes that we are going to die soon and they need
to do the same (through shared memory), and then _exit(1).

--
Best regards,
Kirill Reshke

#50Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kirill Reshke (#48)
Re: VM corruption on standby

Kirill Reshke <reshkekirill@gmail.com> writes:

On Tue, 19 Aug 2025 at 21:16, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

`if (CritSectionCount != 0) _exit(2) else proc_exit(1)` in
WaitEventSetWaitBlock () solves the issue of inconsistency IF POSTMASTER IS
SIGKILLED, and doesn't lead to any problem, if postmaster is not SIGKILL-ed
(since postmaster will SIGKILL its children).

This fix was proposed in this thread. It fixes inconsistency but it
replaces one set of problems with another set, namely systems that
fail to shut down.

I think a bigger objection is that it'd result in two separate
shutdown behaviors in what's already an extremely under-tested
(and hard to test) scenario. I don't want to have to deal with
the ensuing state-space explosion.

I still think that proc_exit(1) is fundamentally the wrong thing
to do if the postmaster is gone: that code path assumes that
the cluster is still functional, which is at best shaky.
I concur though that we'd have to do some more engineering work
before _exit(2) would be a practical solution.

In the meantime, it seems like this discussion point arises
only because the presented test case is doing something that
seems pretty unsafe, namely invoking WaitEventSet inside a
critical section.

We'd probably be best off to get back to the actual bug the
thread started with, namely whether we aren't doing the wrong
thing with VM-update order of operations.

regards, tom lane

#51Kirill Reshke
reshkekirill@gmail.com
In reply to: Tom Lane (#50)
Re: VM corruption on standby

On Tue, 19 Aug 2025 at 23:08, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kirill Reshke <reshkekirill@gmail.com> writes:

On Tue, 19 Aug 2025 at 21:16, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

`if (CritSectionCount != 0) _exit(2) else proc_exit(1)` in
WaitEventSetWaitBlock () solves the issue of inconsistency IF POSTMASTER IS
SIGKILLED, and doesn't lead to any problem, if postmaster is not SIGKILL-ed
(since postmaster will SIGKILL its children).

This fix was proposed in this thread. It fixes inconsistency but it
replaces one set of problems with another set, namely systems that
fail to shut down.

I think a bigger objection is that it'd result in two separate
shutdown behaviors in what's already an extremely under-tested
(and hard to test) scenario. I don't want to have to deal with
the ensuing state-space explosion.

Agreed.

I still think that proc_exit(1) is fundamentally the wrong thing
to do if the postmaster is gone: that code path assumes that
the cluster is still functional, which is at best shaky.
I concur though that we'd have to do some more engineering work
before _exit(2) would be a practical solution.

Agreed.

In the meantime, it seems like this discussion point arises
only because the presented test case is doing something that
seems pretty unsafe, namely invoking WaitEventSet inside a
critical section.

Agreed.

We'd probably be best off to get back to the actual bug the
thread started with, namely whether we aren't doing the wrong
thing with VM-update order of operations.

regards, tom lane

My understanding is that there is no bug in the VM. At least not in
[0]: /messages/by-id/B3C69B86-7F82-4111-B97F-0005497BB745@yandex-team.ru
making the server exit too early.
So, behaviour with inj point and without are very different.
The corruption we are looking for has to reproducer (see [1]/messages/by-id/CALdSSPhGQ1xx10c2NaZgce8qmi+SuKFp6T1uWG_aZvPpvoJRkQ@mail.gmail.com).

[0]: /messages/by-id/B3C69B86-7F82-4111-B97F-0005497BB745@yandex-team.ru
[1]: /messages/by-id/CALdSSPhGQ1xx10c2NaZgce8qmi+SuKFp6T1uWG_aZvPpvoJRkQ@mail.gmail.com

--
Best regards,
Kirill Reshke

#52Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Kirill Reshke (#51)
Re: VM corruption on standby

On 19 Aug 2025, at 23:23, Kirill Reshke <reshkekirill@gmail.com> wrote:

We'd probably be best off to get back to the actual bug the
thread started with, namely whether we aren't doing the wrong
thing with VM-update order of operations.

regards, tom lane

My understanding is that there is no bug in the VM. At least not in
[0] test, because it uses an injection point in the CRIT section,
making the server exit too early.
So, behaviour with inj point and without are very different.
The corruption we are looking for has to reproducer (see [1]).

I believe there is a bug with PageIsAllVisible(page) && visibilitymap_clear(). But I cannot prove it with an injection point test. Because injections points rely on CondVar, that per se creates corruption in critical section. So I'm reading this discussion and wonder if CondVar will be fixed in some clever way or I'd better invent new injection point wait mechanism.

Best regards, Andrey Borodin.

#53Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kirill Reshke (#31)
1 attachment(s)
Re: VM corruption on standby

Kirill Reshke <reshkekirill@gmail.com> writes:

On Tue, 19 Aug 2025 at 10:32, Thomas Munro <thomas.munro@gmail.com> wrote:

Any idea involving deferring the handling of PM death from here
doesn't seem right: you'd keep waiting for the CV, but the backend
that would wake you might have exited.

Yeah. Taking the check for PM death out of here seems just about
as hazardous as leaving it in :-(. Not something I want to mess
with so late in the v18 cycle.

I revert this commit (these were conflicts but i resolved them) and
added assert for crit sections in WaitEventSetWait.

Your patch still contains some conflict markers :-(. Attached is
a corrected version, just to save other people the effort of fixing
the diffs themselves.

make check passes (without v2-0001 it fails)

While 'make check' is okay with this assertion, 'make check-world'
still falls over if you have injection points enabled, because
src/test/modules/test_slru/t/001_multixact.pl also had the
cute idea of putting an injection-point wait inside a critical
section. I did not find any other failures though.

I'm inclined to think that we do want to prohibit WaitEventSetWait
inside a critical section --- it just seems like a bad idea all
around, even without considering this specific failure mode.
Therefore, I vote for reverting bc22dc0e0. Hopefully only
temporarily, but it's too late to figure out another way for v18,
and I don't think that bc22dc0e0 is such an essential improvement
that we can't afford to give it up for v18.

However, we can't install the proposed assertion until we do
something about that test_slru test. It seems like in general
it would be sad if we can't put injection points inside
critical sections, so I'm wondering if there's a way to
re-implement the injection point "wait" functionality without
depending on WaitEventSetWait. I would be willing to accept
weaker safety guarantees in this context, since we only anticipate
such cases being used in test scaffolding.

regards, tom lane

Attachments:

v3-0001-Revert-Get-rid-of-WALBufMappingLock.patchtext/x-diff; charset=us-ascii; name=v3-0001-Revert-Get-rid-of-WALBufMappingLock.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e8909406686..7ffb2179151 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -303,6 +303,11 @@ static bool doPageWrites;
  * so it's a plain spinlock.  The other locks are held longer (potentially
  * over I/O operations), so we use LWLocks for them.  These locks are:
  *
+ * WALBufMappingLock: must be held to replace a page in the WAL buffer cache.
+ * It is only held while initializing and changing the mapping.  If the
+ * contents of the buffer being replaced haven't been written yet, the mapping
+ * lock is released while the write is done, and reacquired afterwards.
+ *
  * WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or
  * XLogFlush).
  *
@@ -468,37 +473,21 @@ typedef struct XLogCtlData
 	pg_atomic_uint64 logFlushResult;	/* last byte + 1 flushed */
 
 	/*
-	 * First initialized page in the cache (first byte position).
-	 */
-	XLogRecPtr	InitializedFrom;
-
-	/*
-	 * Latest reserved for initialization page in the cache (last byte
-	 * position + 1).
+	 * Latest initialized page in the cache (last byte position + 1).
 	 *
-	 * To change the identity of a buffer, you need to advance
-	 * InitializeReserved first.  To change the identity of a buffer that's
+	 * To change the identity of a buffer (and InitializedUpTo), you need to
+	 * hold WALBufMappingLock.  To change the identity of a buffer that's
 	 * still dirty, the old page needs to be written out first, and for that
 	 * you need WALWriteLock, and you need to ensure that there are no
 	 * in-progress insertions to the page by calling
 	 * WaitXLogInsertionsToFinish().
 	 */
-	pg_atomic_uint64 InitializeReserved;
-
-	/*
-	 * Latest initialized page in the cache (last byte position + 1).
-	 *
-	 * InitializedUpTo is updated after the buffer initialization.  After
-	 * update, waiters got notification using InitializedUpToCondVar.
-	 */
-	pg_atomic_uint64 InitializedUpTo;
-	ConditionVariable InitializedUpToCondVar;
+	XLogRecPtr	InitializedUpTo;
 
 	/*
 	 * These values do not change after startup, although the pointed-to pages
-	 * and xlblocks values certainly do.  xlblocks values are changed
-	 * lock-free according to the check for the xlog write position and are
-	 * accompanied by changes of InitializeReserved and InitializedUpTo.
+	 * and xlblocks values certainly do.  xlblocks values are protected by
+	 * WALBufMappingLock.
 	 */
 	char	   *pages;			/* buffers for unwritten XLOG pages */
 	pg_atomic_uint64 *xlblocks; /* 1st byte ptr-s + XLOG_BLCKSZ */
@@ -821,9 +810,9 @@ XLogInsertRecord(XLogRecData *rdata,
 	 * fullPageWrites from changing until the insertion is finished.
 	 *
 	 * Step 2 can usually be done completely in parallel. If the required WAL
-	 * page is not initialized yet, you have to go through AdvanceXLInsertBuffer,
-	 * which will ensure it is initialized. But the WAL writer tries to do that
-	 * ahead of insertions to avoid that from happening in the critical path.
+	 * page is not initialized yet, you have to grab WALBufMappingLock to
+	 * initialize it, but the WAL writer tries to do that ahead of insertions
+	 * to avoid that from happening in the critical path.
 	 *
 	 *----------
 	 */
@@ -2005,79 +1994,32 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 	XLogRecPtr	NewPageEndPtr = InvalidXLogRecPtr;
 	XLogRecPtr	NewPageBeginPtr;
 	XLogPageHeader NewPage;
-	XLogRecPtr	ReservedPtr;
 	int			npages pg_attribute_unused() = 0;
 
-	/*
-	 * We must run the loop below inside the critical section as we expect
-	 * XLogCtl->InitializedUpTo to eventually keep up.  The most of callers
-	 * already run inside the critical section. Except for WAL writer, which
-	 * passed 'opportunistic == true', and therefore we don't perform
-	 * operations that could error out.
-	 *
-	 * Start an explicit critical section anyway though.
-	 */
-	Assert(CritSectionCount > 0 || opportunistic);
-	START_CRIT_SECTION();
+	LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
 
-	/*--
-	 * Loop till we get all the pages in WAL buffer before 'upto' reserved for
-	 * initialization.  Multiple process can initialize different buffers with
-	 * this loop in parallel as following.
-	 *
-	 * 1. Reserve page for initialization using XLogCtl->InitializeReserved.
-	 * 2. Initialize the reserved page.
-	 * 3. Attempt to advance XLogCtl->InitializedUpTo,
+	/*
+	 * Now that we have the lock, check if someone initialized the page
+	 * already.
 	 */
-	ReservedPtr = pg_atomic_read_u64(&XLogCtl->InitializeReserved);
-	while (upto >= ReservedPtr || opportunistic)
+	while (upto >= XLogCtl->InitializedUpTo || opportunistic)
 	{
-		Assert(ReservedPtr % XLOG_BLCKSZ == 0);
+		nextidx = XLogRecPtrToBufIdx(XLogCtl->InitializedUpTo);
 
 		/*
-		 * Get ending-offset of the buffer page we need to replace.
-		 *
-		 * We don't lookup into xlblocks, but rather calculate position we
-		 * must wait to be written. If it was written, xlblocks will have this
-		 * position (or uninitialized)
+		 * Get ending-offset of the buffer page we need to replace (this may
+		 * be zero if the buffer hasn't been used yet).  Fall through if it's
+		 * already written out.
 		 */
-		if (ReservedPtr + XLOG_BLCKSZ > XLogCtl->InitializedFrom + XLOG_BLCKSZ * XLOGbuffers)
-			OldPageRqstPtr = ReservedPtr + XLOG_BLCKSZ - (XLogRecPtr) XLOG_BLCKSZ * XLOGbuffers;
-		else
-			OldPageRqstPtr = InvalidXLogRecPtr;
-
-		if (LogwrtResult.Write < OldPageRqstPtr && opportunistic)
+		OldPageRqstPtr = pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]);
+		if (LogwrtResult.Write < OldPageRqstPtr)
 		{
 			/*
-			 * If we just want to pre-initialize as much as we can without
-			 * flushing, give up now.
+			 * Nope, got work to do. If we just want to pre-initialize as much
+			 * as we can without flushing, give up now.
 			 */
-			upto = ReservedPtr - 1;
-			break;
-		}
-
-		/*
-		 * Attempt to reserve the page for initialization.  Failure means that
-		 * this page got reserved by another process.
-		 */
-		if (!pg_atomic_compare_exchange_u64(&XLogCtl->InitializeReserved,
-											&ReservedPtr,
-											ReservedPtr + XLOG_BLCKSZ))
-			continue;
-
-		/*
-		 * Wait till page gets correctly initialized up to OldPageRqstPtr.
-		 */
-		nextidx = XLogRecPtrToBufIdx(ReservedPtr);
-		while (pg_atomic_read_u64(&XLogCtl->InitializedUpTo) < OldPageRqstPtr)
-			ConditionVariableSleep(&XLogCtl->InitializedUpToCondVar, WAIT_EVENT_WAL_BUFFER_INIT);
-		ConditionVariableCancelSleep();
-		Assert(pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]) == OldPageRqstPtr);
-
-		/* Fall through if it's already written out. */
-		if (LogwrtResult.Write < OldPageRqstPtr)
-		{
-			/* Nope, got work to do. */
+			if (opportunistic)
+				break;
 
 			/* Advance shared memory write request position */
 			SpinLockAcquire(&XLogCtl->info_lck);
@@ -2092,6 +2034,14 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 			RefreshXLogWriteResult(LogwrtResult);
 			if (LogwrtResult.Write < OldPageRqstPtr)
 			{
+				/*
+				 * Must acquire write lock. Release WALBufMappingLock first,
+				 * to make sure that all insertions that we need to wait for
+				 * can finish (up to this same position). Otherwise we risk
+				 * deadlock.
+				 */
+				LWLockRelease(WALBufMappingLock);
+
 				WaitXLogInsertionsToFinish(OldPageRqstPtr);
 
 				LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
@@ -2119,6 +2069,9 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 					 */
 					pgstat_report_fixed = true;
 				}
+				/* Re-acquire WALBufMappingLock and retry */
+				LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
+				continue;
 			}
 		}
 
@@ -2126,9 +2079,11 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 		 * Now the next buffer slot is free and we can set it up to be the
 		 * next output page.
 		 */
-		NewPageBeginPtr = ReservedPtr;
+		NewPageBeginPtr = XLogCtl->InitializedUpTo;
 		NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ;
 
+		Assert(XLogRecPtrToBufIdx(NewPageBeginPtr) == nextidx);
+
 		NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
 
 		/*
@@ -2192,100 +2147,12 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 		 */
 		pg_write_barrier();
 
-		/*-----
-		 * Update the value of XLogCtl->xlblocks[nextidx] and try to advance
-		 * XLogCtl->InitializedUpTo in a lock-less manner.
-		 *
-		 * First, let's provide a formal proof of the algorithm.  Let it be 'n'
-		 * process with the following variables in shared memory:
-		 *	f - an array of 'n' boolean flags,
-		 *	v - atomic integer variable.
-		 *
-		 * Also, let
-		 *	i - a number of a process,
-		 *	j - local integer variable,
-		 * CAS(var, oldval, newval) - compare-and-swap atomic operation
-		 *							  returning true on success,
-		 * write_barrier()/read_barrier() - memory barriers.
-		 *
-		 * The pseudocode for each process is the following.
-		 *
-		 *	j := i
-		 *	f[i] := true
-		 *	write_barrier()
-		 *	while CAS(v, j, j + 1):
-		 *		j := j + 1
-		 *		read_barrier()
-		 *		if not f[j]:
-		 *			break
-		 *
-		 * Let's prove that v eventually reaches the value of n.
-		 * 1. Prove by contradiction.  Assume v doesn't reach n and stucks
-		 *	  on k, where k < n.
-		 * 2. Process k attempts CAS(v, k, k + 1).  1). If, as we assumed, v
-		 *	  gets stuck at k, then this CAS operation must fail.  Therefore,
-		 *    v < k when process k attempts CAS(v, k, k + 1).
-		 * 3. If, as we assumed, v gets stuck at k, then the value k of v
-		 *	  must be achieved by some process m, where m < k.  The process
-		 *	  m must observe f[k] == false.  Otherwise, it will later attempt
-		 *	  CAS(v, k, k + 1) with success.
-		 * 4. Therefore, corresponding read_barrier() (while j == k) on
-		 *	  process m reached before write_barrier() of process k.  But then
-		 *	  process k attempts CAS(v, k, k + 1) after process m successfully
-		 *	  incremented v to k, and that CAS operation must succeed.
-		 *	  That leads to a contradiction.  So, there is no such k (k < n)
-		 *    where v gets stuck.  Q.E.D.
-		 *
-		 * To apply this proof to the code below, we assume
-		 * XLogCtl->InitializedUpTo will play the role of v with XLOG_BLCKSZ
-		 * granularity.  We also assume setting XLogCtl->xlblocks[nextidx] to
-		 * NewPageEndPtr to play the role of setting f[i] to true.  Also, note
-		 * that processes can't concurrently map different xlog locations to
-		 * the same nextidx because we previously requested that
-		 * XLogCtl->InitializedUpTo >= OldPageRqstPtr.  So, a xlog buffer can
-		 * be taken for initialization only once the previous initialization
-		 * takes effect on XLogCtl->InitializedUpTo.
-		 */
-
 		pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], NewPageEndPtr);
-
-		pg_write_barrier();
-
-		while (pg_atomic_compare_exchange_u64(&XLogCtl->InitializedUpTo, &NewPageBeginPtr, NewPageEndPtr))
-		{
-			NewPageBeginPtr = NewPageEndPtr;
-			NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ;
-			nextidx = XLogRecPtrToBufIdx(NewPageBeginPtr);
-
-			pg_read_barrier();
-
-			if (pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]) != NewPageEndPtr)
-			{
-				/*
-				 * Page at nextidx wasn't initialized yet, so we can't move
-				 * InitializedUpto further. It will be moved by backend which
-				 * will initialize nextidx.
-				 */
-				ConditionVariableBroadcast(&XLogCtl->InitializedUpToCondVar);
-				break;
-			}
-		}
+		XLogCtl->InitializedUpTo = NewPageEndPtr;
 
 		npages++;
 	}
-
-	END_CRIT_SECTION();
-
-	/*
-	 * All the pages in WAL buffer before 'upto' were reserved for
-	 * initialization.  However, some pages might be reserved by concurrent
-	 * processes.  Wait till they finish initialization.
-	 */
-	while (upto >= pg_atomic_read_u64(&XLogCtl->InitializedUpTo))
-		ConditionVariableSleep(&XLogCtl->InitializedUpToCondVar, WAIT_EVENT_WAL_BUFFER_INIT);
-	ConditionVariableCancelSleep();
-
-	pg_read_barrier();
+	LWLockRelease(WALBufMappingLock);
 
 #ifdef WAL_DEBUG
 	if (XLOG_DEBUG && npages > 0)
@@ -5178,10 +5045,6 @@ XLOGShmemInit(void)
 	pg_atomic_init_u64(&XLogCtl->logWriteResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->logFlushResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->unloggedLSN, InvalidXLogRecPtr);
-
-	pg_atomic_init_u64(&XLogCtl->InitializeReserved, InvalidXLogRecPtr);
-	pg_atomic_init_u64(&XLogCtl->InitializedUpTo, InvalidXLogRecPtr);
-	ConditionVariableInit(&XLogCtl->InitializedUpToCondVar);
 }
 
 /*
@@ -6205,8 +6068,7 @@ StartupXLOG(void)
 		memset(page + len, 0, XLOG_BLCKSZ - len);
 
 		pg_atomic_write_u64(&XLogCtl->xlblocks[firstIdx], endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ);
-		pg_atomic_write_u64(&XLogCtl->InitializedUpTo, endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ);
-		XLogCtl->InitializedFrom = endOfRecoveryInfo->lastPageBeginPtr;
+		XLogCtl->InitializedUpTo = endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ;
 	}
 	else
 	{
@@ -6215,10 +6077,8 @@ StartupXLOG(void)
 		 * let the first attempt to insert a log record to initialize the next
 		 * buffer.
 		 */
-		pg_atomic_write_u64(&XLogCtl->InitializedUpTo, EndOfLog);
-		XLogCtl->InitializedFrom = EndOfLog;
+		XLogCtl->InitializedUpTo = EndOfLog;
 	}
-	pg_atomic_write_u64(&XLogCtl->InitializeReserved, pg_atomic_read_u64(&XLogCtl->InitializedUpTo));
 
 	/*
 	 * Update local and shared status.  This is OK to do without any locks
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0be307d2ca0..5427da5bc1b 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -156,7 +156,6 @@ REPLICATION_SLOT_DROP	"Waiting for a replication slot to become inactive so it c
 RESTORE_COMMAND	"Waiting for <xref linkend="guc-restore-command"/> to complete."
 SAFE_SNAPSHOT	"Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFERRABLE</literal> transaction."
 SYNC_REP	"Waiting for confirmation from a remote server during synchronous replication."
-WAL_BUFFER_INIT	"Waiting on WAL buffer to be initialized."
 WAL_RECEIVER_EXIT	"Waiting for the WAL receiver to exit."
 WAL_RECEIVER_WAIT_START	"Waiting for startup process to send initial data for streaming replication."
 WAL_SUMMARY_READY	"Waiting for a new WAL summary to be generated."
@@ -316,6 +315,7 @@ XidGen	"Waiting to allocate a new transaction ID."
 ProcArray	"Waiting to access the shared per-process data structures (typically, to get a snapshot or report a session's transaction ID)."
 SInvalRead	"Waiting to retrieve messages from the shared catalog invalidation queue."
 SInvalWrite	"Waiting to add a message to the shared catalog invalidation queue."
+WALBufMapping	"Waiting to replace a page in WAL buffers."
 WALWrite	"Waiting for WAL buffers to be written to disk."
 ControlFile	"Waiting to read or update the <filename>pg_control</filename> file or create a new WAL file."
 MultiXactGen	"Waiting to read or update shared multixact state."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 208d2e3a8ed..06a1ffd4b08 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -38,7 +38,7 @@ PG_LWLOCK(3, XidGen)
 PG_LWLOCK(4, ProcArray)
 PG_LWLOCK(5, SInvalRead)
 PG_LWLOCK(6, SInvalWrite)
-/* 7 was WALBufMapping */
+PG_LWLOCK(7, WALBufMapping)
 PG_LWLOCK(8, WALWrite)
 PG_LWLOCK(9, ControlFile)
 /* 10 was CheckpointLock */
#54Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrey Borodin (#52)
Re: VM corruption on standby

Andrey Borodin <x4mmm@yandex-team.ru> writes:

I believe there is a bug with PageIsAllVisible(page) && visibilitymap_clear(). But I cannot prove it with an injection point test. Because injections points rely on CondVar, that per se creates corruption in critical section. So I'm reading this discussion and wonder if CondVar will be fixed in some clever way or I'd better invent new injection point wait mechanism.

Yeah, I was coming to similar conclusions in the reply I just sent:
we don't really want a policy that we can't put injection-point-based
delays inside critical sections. So that infrastructure is leaving
something to be desired.

Having said that, the test script is also doing something we tell
people not to do, namely SIGKILL the postmaster. Could we use
SIGQUIT (immediate shutdown) instead?

regards, tom lane

#55Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#54)
Re: VM corruption on standby

On Tue, Aug 19, 2025 at 03:55:41PM -0400, Tom Lane wrote:

Yeah, I was coming to similar conclusions in the reply I just sent:
we don't really want a policy that we can't put injection-point-based
delays inside critical sections. So that infrastructure is leaving
something to be desired.

Yes, it doesn't make sense to restrict the use of injection point
waits within critical sections. A simple switch that we could do is
to rely on a clock-based check in the wait() routine, removing the
condition variable part. It costs in responsiveness because the
wakeup() routine would not be able to ping the wait() part to recheck
the shmem counter immediately. But we could reduce the delay with a
variable recheck timing, say double the time to recheck the counter
after each loop, capped at a maximum of a couple of hundred ms so as
it's still good enough on fast machines, and does not stress too much
slow machines. That costs a bit more in terms of clock calls and
delay checks, but it would be low-level enough that the internal
interrupts would not matter if we rely on syscalls, I guess? And we
don't care about efficiency in this case.
--
Michael

#56Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#53)
Re: VM corruption on standby

On Wed, Aug 20, 2025 at 7:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm inclined to think that we do want to prohibit WaitEventSetWait
inside a critical section --- it just seems like a bad idea all
around, even without considering this specific failure mode.

FWIW aio/README.md describes a case where we'd need to wait for an IO,
which might involve a CV to wait for an IO worker to do something, in
order to start writing WAL, which is in a CS. Starting IO involves
calling pgaio_io_acquire(), and if it can't find a handle it calls
pgaio_io_wait_for_free(). That's all hypothetical for now as v18 is
only doing reads, but it's an important architectural principle. That
makes me suspect this new edict can't be the final policy, even if v18
uses it to solve the immediate problem.

For v19 I think we should probably attack the original sin and make
this work. Several mechanisms for unwedging LWLockAcquire() have been
mentioned: (1) the existing LWLockReleaseAll(), which clearly makes
bogus assumptions about system state and cannot stay, (2) some new
thing that would sem_post() all the waiters having set flags that
cause LWLockAcquire() to exit (ie a sort of multiplexing, making our
semaphore-based locks inch towards latch-nature), (3) moving LWLock
over to latches, so the wait would already be multiplexed with PM
death detection, (4) having the parent death signal handler exit
directly (unfortunately Linux and FreeBSD only*), (5) in
multi-threaded prototype work, the whole process exits anyway taking
all backend threads with it** which is a strong motivation to make
multi-process mode act as much like that as possible, eg something the
exits a lot more eagerly and hopefully preemptively than today.

* That's an IRIX invention picked up by Linux and FreeBSD, a sort of
reverse SIGCHLD, and I've tried to recreate it for pure POSIX systems
before. (1) Maybe it's enough for any backend that learns of
postmaster death to signal everyone else since they can't all be
blocked in sig_wait() unless there is already a deadlock. (2) I once
tried making the postmaster deathpipe O_ASYNC so that the "owner" gets
a signal when it becomes readable, but it turned out to require a
separate postmaster pipe for every backend (not just a dup'd
descriptor); perhaps this would be plausible if we already had a
bidirectional postmaster control socket protocol and choose to give
every backend process its own socket pair in MP mode, something I've
been looking into for various other reasons.

** I've been studying the unusual logger case in this context and
contemplated running it as a separate process even in MT mode, as its
stated aim didn't sound crazy to me and I was humbly attempting to
preserve that characteristic in MT mode. Another way to achieve MP/MT
consistency is to decide that the MP design already isn't robust
enough on full pipe and just nuke the logger like everything else.
Reading Andres's earlier comments, I briefly wondered about a
compromise where log senders would make a best effort to send
nonblockingly when they know the postmaster is gone, but that's
neither as reliable as whoever wrote that had in mind (and in their
defence, the logger is basically independent of shared memory state so
whether it should be exiting ASAP or draining final log statements is
at least debatable; barring bugs, it's only going to block progress if
your system is really hosed), nor free entirely of "limping along"
syndrome as Andres argues quite compellingly, so I cancelled that
thought.

#57Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#56)
Re: VM corruption on standby

On Wed, Aug 20, 2025 at 11:59 AM Thomas Munro <thomas.munro@gmail.com> wrote:

they can't all be
blocked in sig_wait() unless there is already a deadlock.

s/sig_wait()/sem_wait()/

#58Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#56)
Re: VM corruption on standby

Thomas Munro <thomas.munro@gmail.com> writes:

On Wed, Aug 20, 2025 at 7:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm inclined to think that we do want to prohibit WaitEventSetWait
inside a critical section --- it just seems like a bad idea all
around, even without considering this specific failure mode.

FWIW aio/README.md describes a case where we'd need to wait for an IO,
which might involve a CV to wait for an IO worker to do something, in
order to start writing WAL, which is in a CS.

Hm. It still makes me mighty uncomfortable, because the point of a
critical section is "crash the database if anything goes wrong during
this bit". Waiting for another process --- or thread --- greatly
increases the scope of ways for things to go wrong. So I'm not
exactly convinced that this aspect of the AIO architecture is
well-thought-out.

Having said that, we should in any case have a better story on
what WaitEventSetWait should do after detecting postmaster death.
So I'm all for trying to avoid the proc_exit path if we can
design a better answer.

regards, tom lane

#59Kirill Reshke
reshkekirill@gmail.com
In reply to: Tom Lane (#53)
Re: VM corruption on standby

On Wed, 20 Aug 2025 at 00:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kirill Reshke <reshkekirill@gmail.com> writes:

I revert this commit (these were conflicts but i resolved them) and
added assert for crit sections in WaitEventSetWait.

Your patch still contains some conflict markers :-(. Attached is
a corrected version, just to save other people the effort of fixing
the diffs themselves.

Oh, yes, it indeed contains some conflict markers, but surprisingly
compiles without errors, that's why i missed it.
Attached looks good to me.

While 'make check' is okay with this assertion, 'make check-world'
still falls over if you have injection points enabled, because
src/test/modules/test_slru/t/001_multixact.pl also had the
cute idea of putting an injection-point wait inside a critical
section. I did not find any other failures though.

Okay, we probably need to reimplement the wait function in injection
point without latch, got it.

I'm inclined to think that we do want to prohibit WaitEventSetWait
inside a critical section --- it just seems like a bad idea all
around, even without considering this specific failure mode.
Therefore, I vote for reverting bc22dc0e0. Hopefully only
temporarily, but it's too late to figure out another way for v18,
and I don't think that bc22dc0e0 is such an essential improvement
that we can't afford to give it up for v18.

+1

--
Best regards,
Kirill Reshke

#60Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#58)
Re: VM corruption on standby

Hi,

On 2025-08-19 23:47:21 -0400, Tom Lane wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

On Wed, Aug 20, 2025 at 7:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm inclined to think that we do want to prohibit WaitEventSetWait
inside a critical section --- it just seems like a bad idea all
around, even without considering this specific failure mode.

FWIW aio/README.md describes a case where we'd need to wait for an IO,
which might involve a CV to wait for an IO worker to do something, in
order to start writing WAL, which is in a CS.

Hm. It still makes me mighty uncomfortable, because the point of a
critical section is "crash the database if anything goes wrong during
this bit". Waiting for another process --- or thread --- greatly
increases the scope of ways for things to go wrong. So I'm not
exactly convinced that this aspect of the AIO architecture is
well-thought-out.

I don't see the alternative:

1) Some IO is done in critical sections (e.g. WAL writes / flushes)

2) Sometimes we need to wait for already started IO in critical sections
(also WAL)

3) With some ways of doing AIO the IO is offloaded to other processes, and
thus waiting for the IO to complete always requires waiting for another
process

How could we avoid the need to wait for another process in criticial sections
given these points?

Greetings,

Andres Freund

#61Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#60)
Re: VM corruption on standby

On Wed, Aug 20, 2025 at 09:14:04AM -0400, Andres Freund wrote:

On 2025-08-19 23:47:21 -0400, Tom Lane wrote:

Hm. It still makes me mighty uncomfortable, because the point of a
critical section is "crash the database if anything goes wrong during
this bit". Waiting for another process --- or thread --- greatly
increases the scope of ways for things to go wrong. So I'm not
exactly convinced that this aspect of the AIO architecture is
well-thought-out.

I don't see the alternative:

1) Some IO is done in critical sections (e.g. WAL writes / flushes)

2) Sometimes we need to wait for already started IO in critical sections
(also WAL)

3) With some ways of doing AIO the IO is offloaded to other processes, and
thus waiting for the IO to complete always requires waiting for another
process

How could we avoid the need to wait for another process in criticial sections
given these points?

Yes, it comes down to the point that for some code path we just cannot
accept a soft failure: some IOs are critical enough that if they fail
the only thing we can should and can do is to recover and replay based
on the past IOs that we know did succeed and made it durably to disk.

Having what can be qualified as safe and efficient to use in a
critical section for event broadcasting and waits would be really,
really nice.
--
Michael

#62Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Tom Lane (#54)
Re: VM corruption on standby

On 20 Aug 2025, at 00:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andrey Borodin <x4mmm@yandex-team.ru> writes:

I believe there is a bug with PageIsAllVisible(page) && visibilitymap_clear(). But I cannot prove it with an injection point test. Because injections points rely on CondVar, that per se creates corruption in critical section. So I'm reading this discussion and wonder if CondVar will be fixed in some clever way or I'd better invent new injection point wait mechanism.

Yeah, I was coming to similar conclusions in the reply I just sent:
we don't really want a policy that we can't put injection-point-based
delays inside critical sections. So that infrastructure is leaving
something to be desired.

Having said that, the test script is also doing something we tell
people not to do, namely SIGKILL the postmaster. Could we use
SIGQUIT (immediate shutdown) instead?

I'm working backwards from corruptions I see on our production.
And almost always I see stormbringers like OOM, power outage or Debian scripts that (I think) do kill -9 when `service postgresql stop` takes too long.

Best regards, Andrey Borodin.

#63Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#58)
Re: VM corruption on standby

On Wed, Aug 20, 2025 at 3:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Having said that, we should in any case have a better story on
what WaitEventSetWait should do after detecting postmaster death.
So I'm all for trying to avoid the proc_exit path if we can
design a better answer.

Yeah. I've posted a concept patch in a new thread:

/messages/by-id/CA+hUKGKp0kTpummCPa97+WFJTm+uYzQ9Ex8UMdH8ZXkLwO0QgA@mail.gmail.com

#64Alexander Korotkov
aekorotkov@gmail.com
In reply to: Tom Lane (#53)
Re: VM corruption on standby

Hi, Tom!

On Tue, Aug 19, 2025 at 10:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm inclined to think that we do want to prohibit WaitEventSetWait
inside a critical section --- it just seems like a bad idea all
around, even without considering this specific failure mode.
Therefore, I vote for reverting bc22dc0e0. Hopefully only
temporarily, but it's too late to figure out another way for v18,
and I don't think that bc22dc0e0 is such an essential improvement
that we can't afford to give it up for v18.

I'm OK about this. Do you mind if I revert bc22dc0e0 myself?
And let's retry it for v19.

------
Regards,
Alexander Korotkov
Supabase

#65Michael Paquier
michael@paquier.xyz
In reply to: Alexander Korotkov (#64)
Re: VM corruption on standby

On Fri, Aug 22, 2025 at 01:27:17AM +0300, Alexander Korotkov wrote:

I'm OK about this. Do you mind if I revert bc22dc0e0 myself?
And let's retry it for v19.

Yes, agreed that it may be the best thing to do for v18 based on
the information we have gathered until now.
--
Michael

#66Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Korotkov (#64)
Re: VM corruption on standby

Alexander Korotkov <aekorotkov@gmail.com> writes:

On Tue, Aug 19, 2025 at 10:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Therefore, I vote for reverting bc22dc0e0. Hopefully only
temporarily, but it's too late to figure out another way for v18,
and I don't think that bc22dc0e0 is such an essential improvement
that we can't afford to give it up for v18.

I'm OK about this. Do you mind if I revert bc22dc0e0 myself?

Not at all, go for it.

And let's retry it for v19.

+1

regards, tom lane

#67Thomas Munro
thomas.munro@gmail.com
In reply to: Alexander Korotkov (#64)
Re: VM corruption on standby

On Fri, Aug 22, 2025 at 10:27 AM Alexander Korotkov
<aekorotkov@gmail.com> wrote:

And let's retry it for v19.

+1

I'm hoping we can fix PM death handling soon, and then I assume this
can go straight back in without modification. CVs are an essential
low level synchronisation component that really should work in lots of
environments, but we need to straighten out some historical mistakes
and rough edges. Commit cfdf4dc4 removed open-coded exits to fix a
lot of bugs of omission, but it failed to fully consider all the
consequences of "composition", ie hiding that behaviour out of sight.
We really need a sort of postmasterless PANIC here, and I am happy to
work on that (see new thread), not least because it aligns better with
the behaviour of a multithreaded server. There, the answer is "what
other backends?" so I'd already been looking sideways at system states
including the lingering logger, lingering query execution and the
special but inherently flaky error reporting in a few spots.

#68Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#66)
Re: VM corruption on standby

[RMT hat]

On Thu, Aug 21, 2025 at 06:42:48PM -0400, Tom Lane wrote:

Alexander Korotkov <aekorotkov@gmail.com> writes:

On Tue, Aug 19, 2025 at 10:50 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Therefore, I vote for reverting bc22dc0e0. Hopefully only
temporarily, but it's too late to figure out another way for v18,
and I don't think that bc22dc0e0 is such an essential improvement
that we can't afford to give it up for v18.

I'm OK about this. Do you mind if I revert bc22dc0e0 myself?

Not at all, go for it.

Now that this is reverted, can the related open item be marked as resolved?

--
nathan

#69Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#68)
Re: VM corruption on standby

On Mon, Aug 25, 2025 at 10:07:26AM -0500, Nathan Bossart wrote:

Now that this is reverted, can the related open item be marked as resolved?

Since there has been no further discussion, I will go ahead and resolve the
open item.

--
nathan

#70Alexander Korotkov
aekorotkov@gmail.com
In reply to: Kirill Reshke (#16)
Re: VM corruption on standby

On Tue, Aug 12, 2025 at 8:38 AM Kirill Reshke <reshkekirill@gmail.com>
wrote:

On Wed, 6 Aug 2025 at 20:00, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Hi hackers!

I was reviewing the patch about removing xl_heap_visible and found the

VM\WAL machinery very interesting.

At Yandex we had several incidents with corrupted VM and on pgconf.dev

colleagues from AWS confirmed that they saw something similar too.

So I toyed around and accidentally wrote a test that reproduces $subj.

I think the corruption happens as follows:
0. we create a table with one frozen tuple
1. next heap_insert() clears VM bit and hangs immediately, nothing was

logged yet

2. VM buffer is flushed on disk with checkpointer or bgwriter
3. primary is killed with -9
now we have a page that is ALL_VISIBLE\ALL_FORZEN on standby, but clear

VM bits on primary

4. subsequent insert does not set XLH_LOCK_ALL_FROZEN_CLEARED in it's

WAL record

5. pg_visibility detects corruption

Interestingly, in an off-list conversation Melanie explained me how

ALL_VISIBLE is protected from this: WAL-logging depends on PD_ALL_VISIBLE
heap page bit, not a state of the VM. But for ALL_FROZEN this is not a case:

/* Clear only the all-frozen bit on visibility map if needed */
if (PageIsAllVisible(page) &&
visibilitymap_clear(relation, block, vmbuffer,
VISIBILITYMAP_ALL_FROZEN))
cleared_all_frozen = true; // this won't happen due to flushed

VM buffer before a crash

Anyway, the test reproduces corruption of both bits. And also

reproduces selecting deleted data on standby.

The test is not intended to be committed when we fix the problem, so

some waits are simulated with sleep(1) and test is placed at
modules/test_slru where it was easier to write. But if we ever want
something like this - I can design a less hacky version. And, probably,
more generic.

Thanks!

Best regards, Andrey Borodin.

Attached reproduces the same but without any standby node. CHECKPOINT
somehow manages to flush the heap page when instance kill-9-ed.
As a result, we have inconsistency between heap and VM pages:

```
reshke=# select * from pg_visibility('x');
blkno | all_visible | all_frozen | pd_all_visible
-------+-------------+------------+----------------
0 | t | t | f
(1 row)
```

Notice I moved INJECTION point one line above visibilitymap_clear.
Without this change, such behaviour also reproduced, but with much
less frequency.

BTW, I've tried this patch on the current master, where bc22dc0e0d was
reverted. And it fails for me.

t/001_multixact.pl .. 1/?
# Failed test 'pg_check_frozen() observes corruption'
# at t/001_multixact.pl line 102.
# got: '(0,2)
# (0,3)
# (0,4)'
# expected: ''

# Failed test 'pg_check_visible() observes corruption'
# at t/001_multixact.pl line 103.
# got: '(0,2)
# (0,4)'
# expected: ''

# Failed test 'deleted data returned by select'
# at t/001_multixact.pl line 104.
# got: '2'
# expected: ''
# Looks like you failed 3 tests of 3.
t/001_multixact.pl .. Dubious, test returned 3 (wstat 768, 0x300)
Failed 3/3 subtests

Test Summary Report
-------------------
t/001_multixact.pl (Wstat: 768 Tests: 3 Failed: 3)
Failed tests: 1-3
Non-zero exit status: 3
Files=1, Tests=3, 2 wallclock secs ( 0.01 usr 0.00 sys + 0.09 cusr 0.27
csys = 0.37 CPU)
Result: FAIL
make: *** [../../../../src/makefiles/pgxs.mk:452: check] Error 1

Could you, please, recheck?

------
Regards,
Alexander Korotkov
Supabase

#71Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Alexander Korotkov (#70)
Re: VM corruption on standby

On 3 Sep 2025, at 11:37, Alexander Korotkov <aekorotkov@gmail.com> wrote:

Could you, please, recheck?

That patch also adds CondVar sleep in critical section. That patch is how we understood that such sleep is dangerous.

Actual patch to deteact a problem is much simpler:
```
diff --git a/src/backend/storage/ipc/waiteventset.c
b/src/backend/storage/ipc/waiteventset.c
index 7c0e66900f9..e89e1d115cb 100644
--- a/src/backend/storage/ipc/waiteventset.c
+++ b/src/backend/storage/ipc/waiteventset.c
@@ -1044,6 +1044,7 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
       long            cur_timeout = -1;

Assert(nevents > 0);
+ Assert(CritSectionCount == 0);

/*
* Initialize timeout if requested. We must record the current time so
```

Though it will fail in multixact test.

Best regards, Andrey Borodin.

#72Alexander Korotkov
aekorotkov@gmail.com
In reply to: Andrey Borodin (#71)
Re: VM corruption on standby

On Wed, Sep 3, 2025 at 9:47 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

On 3 Sep 2025, at 11:37, Alexander Korotkov <aekorotkov@gmail.com> wrote:

Could you, please, recheck?

That patch also adds CondVar sleep in critical section. That patch is how we understood that such sleep is dangerous.

Actual patch to deteact a problem is much simpler:
```
diff --git a/src/backend/storage/ipc/waiteventset.c
b/src/backend/storage/ipc/waiteventset.c
index 7c0e66900f9..e89e1d115cb 100644
--- a/src/backend/storage/ipc/waiteventset.c
+++ b/src/backend/storage/ipc/waiteventset.c
@@ -1044,6 +1044,7 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
long            cur_timeout = -1;

Assert(nevents > 0);
+ Assert(CritSectionCount == 0);

/*
* Initialize timeout if requested. We must record the current time so
```

Though it will fail in multixact test.

I thought that patch allows to reproduce the problem of bc22dc0e0d.
Sorry for the noise.

------
Regards,
Alexander Korotkov
Supabase

#73Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#72)
1 attachment(s)
Re: VM corruption on standby

On Wed, Sep 3, 2025 at 11:28 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Wed, Sep 3, 2025 at 9:47 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

On 3 Sep 2025, at 11:37, Alexander Korotkov <aekorotkov@gmail.com> wrote:

Could you, please, recheck?

That patch also adds CondVar sleep in critical section. That patch is how we understood that such sleep is dangerous.

Actual patch to deteact a problem is much simpler:
```
diff --git a/src/backend/storage/ipc/waiteventset.c
b/src/backend/storage/ipc/waiteventset.c
index 7c0e66900f9..e89e1d115cb 100644
--- a/src/backend/storage/ipc/waiteventset.c
+++ b/src/backend/storage/ipc/waiteventset.c
@@ -1044,6 +1044,7 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
long            cur_timeout = -1;

Assert(nevents > 0);
+ Assert(CritSectionCount == 0);

/*
* Initialize timeout if requested. We must record the current time so
```

Though it will fail in multixact test.

I thought that patch allows to reproduce the problem of bc22dc0e0d.
Sorry for the noise.

The attached is my attempt to rework the bc22dc0e0d. The problem
discussed in this thread occurs because condition variable ends the
process with proc_exit(1) in the case of postmaster death even in the
critical section. However, the shared memory cleanup happening after
proc_exit(1) being called in the critical section is leaving the
shared memory in the inconsistent state. Generally I see two possible
behaviors in this situation.

1) Exit without the shared memory cleanup (e.g. proc_exit(2) as
proposed upthread). That would save us from data corruption caused by
the inconsistent shared memory state. However, if our process has
occupied shared resources (e.g. locks), they wouldn't be released.
That might cause other processes hand waiting for those resources.
2) Continue to wait after postmaster death. While bc22dc0e0d was the
first case when we waited on conditional variable in the critical
section, there are a lot of cases where we're taking LWLock's in the
critical section. Unlike condition variable, lwlock will continue to
wait in the case of postmaster death. The critical section will
continue its operation until completion, postmaster death will be
handled afterwards. If we would apply the same logic to the arbitrary
condition variable, it might however cause our process to stuck if
we're waiting for something from another process, which would exit
seeing postmaster death.

I think the approach #2 is more appropriate for bc22dc0e0d, because in
the critical section we only wait for other processes also in the
critical section (so, there is no risk they will exit immediately
after postmaster death making us stuck). I've implemented a patch,
where waiting on conditional variable is replaced with LWLock-style
waiting on semaphore. However, custom waiting code just for
AdvanceXLInsertBuffer() doesn't look good. I believe we need some
general solution. We might have a special kind of condition variable,
a critical section condition variable, where both waiting and
signaling must be invoked only in a critical section. However, I dig
into our Latch and WaitEventSet, it seems there are too many
assumptions about postmaster death. So, a critical section condition
variable probably should be implemented on top of semaphore. Any
thoughts?

------
Regards,
Alexander Korotkov
Supabase

Attachments:

v11-0001-Get-rid-of-WALBufMappingLock.patchapplication/octet-stream; name=v11-0001-Get-rid-of-WALBufMappingLock.patchDownload
From 7b2797d067a3fe3ee81fa27dd50ccc5233f67f0c Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Tue, 2 Sep 2025 00:58:59 +0300
Subject: [PATCH v11] Get rid of WALBufMappingLock

Allow multiple backends to initialize WAL buffers concurrently.  This way
`MemSet((char *) NewPage, 0, XLOG_BLCKSZ);` can run in parallel without
taking a single LWLock in exclusive mode.

The new algorithm works as follows:
 * reserve a page for initialization using XLogCtl->InitializeReserved,
 * ensure the page is written out,
 * once the page is initialized, try to advance XLogCtl->InitializedUpTo and
   signal to waiters using XLogCtl->InitializdToToWaiterList LWLock-style
   list,
 * repeat previous steps until we reserve initialization up to the target
   WAL position,
 * wait until concurrent initialization finishes using a
   XLogCtl->InitializedUpToCondVar.

Now, multiple backends can, in parallel, concurrently reserve pages,
initialize them, and advance XLogCtl->InitializedUpTo to point to the latest
initialized page.

Author: Yura Sokolov <y.sokolov@postgrespro.ru>
Co-authored-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Pavel Borisov <pashkin.elfe@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Tested-by: Michael Paquier <michael@paquier.xyz>
---
 src/backend/access/transam/xlog.c             | 312 +++++++++++++++---
 .../utils/activity/wait_event_names.txt       |   2 +-
 src/include/storage/lwlocklist.h              |   2 +-
 3 files changed, 267 insertions(+), 49 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7ffb2179151..cf6e01386a9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -90,6 +90,7 @@
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
+#include "storage/proclist.h"
 #include "storage/reinit.h"
 #include "storage/spin.h"
 #include "storage/sync.h"
@@ -303,11 +304,6 @@ static bool doPageWrites;
  * so it's a plain spinlock.  The other locks are held longer (potentially
  * over I/O operations), so we use LWLocks for them.  These locks are:
  *
- * WALBufMappingLock: must be held to replace a page in the WAL buffer cache.
- * It is only held while initializing and changing the mapping.  If the
- * contents of the buffer being replaced haven't been written yet, the mapping
- * lock is released while the write is done, and reacquired afterwards.
- *
  * WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or
  * XLogFlush).
  *
@@ -473,21 +469,38 @@ typedef struct XLogCtlData
 	pg_atomic_uint64 logFlushResult;	/* last byte + 1 flushed */
 
 	/*
-	 * Latest initialized page in the cache (last byte position + 1).
+	 * First initialized page in the cache (first byte position).
+	 */
+	XLogRecPtr	InitializedFrom;
+
+	/*
+	 * Latest reserved for initialization page in the cache (last byte
+	 * position + 1).
 	 *
-	 * To change the identity of a buffer (and InitializedUpTo), you need to
-	 * hold WALBufMappingLock.  To change the identity of a buffer that's
+	 * To change the identity of a buffer, you need to advance
+	 * InitializeReserved first.  To change the identity of a buffer that's
 	 * still dirty, the old page needs to be written out first, and for that
 	 * you need WALWriteLock, and you need to ensure that there are no
 	 * in-progress insertions to the page by calling
 	 * WaitXLogInsertionsToFinish().
 	 */
-	XLogRecPtr	InitializedUpTo;
+	pg_atomic_uint64 InitializeReserved;
+
+	/*
+	 * Latest initialized page in the cache (last byte position + 1).
+	 *
+	 * InitializedUpTo is updated after the buffer initialization.  After
+	 * update, waiters got notification using InitializedUpToCondVar.
+	 */
+	pg_atomic_uint64 InitializedUpTo;
+	slock_t			InitializdToToWaiterListLock;
+	proclist_head	InitializdToToWaiterList;
 
 	/*
 	 * These values do not change after startup, although the pointed-to pages
-	 * and xlblocks values certainly do.  xlblocks values are protected by
-	 * WALBufMappingLock.
+	 * and xlblocks values certainly do.  xlblocks values are changed
+	 * lock-free according to the check for the xlog write position and are
+	 * accompanied by changes of InitializeReserved and InitializedUpTo.
 	 */
 	char	   *pages;			/* buffers for unwritten XLOG pages */
 	pg_atomic_uint64 *xlblocks; /* 1st byte ptr-s + XLOG_BLCKSZ */
@@ -810,9 +823,9 @@ XLogInsertRecord(XLogRecData *rdata,
 	 * fullPageWrites from changing until the insertion is finished.
 	 *
 	 * Step 2 can usually be done completely in parallel. If the required WAL
-	 * page is not initialized yet, you have to grab WALBufMappingLock to
-	 * initialize it, but the WAL writer tries to do that ahead of insertions
-	 * to avoid that from happening in the critical path.
+	 * page is not initialized yet, you have to go through AdvanceXLInsertBuffer,
+	 * which will ensure it is initialized. But the WAL writer tries to do that
+	 * ahead of insertions to avoid that from happening in the critical path.
 	 *
 	 *----------
 	 */
@@ -1977,6 +1990,55 @@ XLogRecPtrToBytePos(XLogRecPtr ptr)
 	return result;
 }
 
+static void
+AdvanceXLWaitForInitialization(XLogRecPtr upto)
+{
+	/*
+	 * All the pages in WAL buffer before 'upto' were reserved for
+	 * initialization.  However, some pages might be reserved by concurrent
+	 * processes.  Wait till they finish initialization.
+	 */
+	while (upto >= pg_atomic_read_u64(&XLogCtl->InitializedUpTo))
+	{
+		PGPROC	   *proc = MyProc;
+		int			extraWaits = 0;
+
+		SpinLockAcquire(&XLogCtl->InitializdToToWaiterListLock);
+		MyProc->lwWaiting = LW_WS_WAITING;
+		MyProc->lwWaitMode = LW_WAIT_UNTIL_FREE;
+		proclist_push_tail(&XLogCtl->InitializdToToWaiterList, MyProcNumber, lwWaitLink);
+		SpinLockRelease(&XLogCtl->InitializdToToWaiterListLock);
+
+		if (upto < pg_atomic_read_u64(&XLogCtl->InitializedUpTo))
+		{
+			SpinLockAcquire(&XLogCtl->InitializdToToWaiterListLock);
+			if (MyProc->lwWaiting == LW_WS_WAITING)
+			{
+				MyProc->lwWaiting = LW_WS_NOT_WAITING;
+				MyProc->lwWaitMode = LW_WAIT_UNTIL_FREE;
+				proclist_delete(&XLogCtl->InitializdToToWaiterList, MyProcNumber, lwWaitLink);
+			}
+			SpinLockRelease(&XLogCtl->InitializdToToWaiterListLock);
+			break;
+		}
+
+		for (;;)
+		{
+			PGSemaphoreLock(proc->sem);
+			if (proc->lwWaiting == LW_WS_NOT_WAITING)
+				break;
+			extraWaits++;
+		}
+
+		/*
+		 * Fix the process wait semaphore's count for any absorbed wakeups.
+		 */
+		while (extraWaits-- > 0)
+			PGSemaphoreUnlock(proc->sem);
+
+	}
+}
+
 /*
  * Initialize XLOG buffers, writing out old buffers if they still contain
  * unwritten data, upto the page containing 'upto'. Or if 'opportunistic' is
@@ -1994,32 +2056,78 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 	XLogRecPtr	NewPageEndPtr = InvalidXLogRecPtr;
 	XLogRecPtr	NewPageBeginPtr;
 	XLogPageHeader NewPage;
+	XLogRecPtr	ReservedPtr;
 	int			npages pg_attribute_unused() = 0;
 
-	LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
-
 	/*
-	 * Now that we have the lock, check if someone initialized the page
-	 * already.
+	 * We must run the loop below inside the critical section as we expect
+	 * XLogCtl->InitializedUpTo to eventually keep up.  The most of callers
+	 * already run inside the critical section. Except for WAL writer, which
+	 * passed 'opportunistic == true', and therefore we don't perform
+	 * operations that could error out.
+	 *
+	 * Start an explicit critical section anyway though.
+	 */
+	Assert(CritSectionCount > 0 || opportunistic);
+	START_CRIT_SECTION();
+
+	/*--
+	 * Loop till we get all the pages in WAL buffer before 'upto' reserved for
+	 * initialization.  Multiple process can initialize different buffers with
+	 * this loop in parallel as following.
+	 *
+	 * 1. Reserve page for initialization using XLogCtl->InitializeReserved.
+	 * 2. Initialize the reserved page.
+	 * 3. Attempt to advance XLogCtl->InitializedUpTo,
 	 */
-	while (upto >= XLogCtl->InitializedUpTo || opportunistic)
+	ReservedPtr = pg_atomic_read_u64(&XLogCtl->InitializeReserved);
+	while (upto >= ReservedPtr || opportunistic)
 	{
-		nextidx = XLogRecPtrToBufIdx(XLogCtl->InitializedUpTo);
+		Assert(ReservedPtr % XLOG_BLCKSZ == 0);
 
 		/*
-		 * Get ending-offset of the buffer page we need to replace (this may
-		 * be zero if the buffer hasn't been used yet).  Fall through if it's
-		 * already written out.
+		 * Get ending-offset of the buffer page we need to replace.
+		 *
+		 * We don't lookup into xlblocks, but rather calculate position we
+		 * must wait to be written. If it was written, xlblocks will have this
+		 * position (or uninitialized)
 		 */
-		OldPageRqstPtr = pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]);
-		if (LogwrtResult.Write < OldPageRqstPtr)
+		if (ReservedPtr + XLOG_BLCKSZ > XLogCtl->InitializedFrom + XLOG_BLCKSZ * XLOGbuffers)
+			OldPageRqstPtr = ReservedPtr + XLOG_BLCKSZ - (XLogRecPtr) XLOG_BLCKSZ * XLOGbuffers;
+		else
+			OldPageRqstPtr = InvalidXLogRecPtr;
+
+		if (LogwrtResult.Write < OldPageRqstPtr && opportunistic)
 		{
 			/*
-			 * Nope, got work to do. If we just want to pre-initialize as much
-			 * as we can without flushing, give up now.
+			 * If we just want to pre-initialize as much as we can without
+			 * flushing, give up now.
 			 */
-			if (opportunistic)
-				break;
+			upto = ReservedPtr - 1;
+			break;
+		}
+
+		/*
+		 * Attempt to reserve the page for initialization.  Failure means that
+		 * this page got reserved by another process.
+		 */
+		if (!pg_atomic_compare_exchange_u64(&XLogCtl->InitializeReserved,
+											&ReservedPtr,
+											ReservedPtr + XLOG_BLCKSZ))
+			continue;
+
+		/*
+		 * Wait till page gets correctly initialized up to OldPageRqstPtr.
+		 */
+		nextidx = XLogRecPtrToBufIdx(ReservedPtr);
+		if (OldPageRqstPtr > 0)
+			AdvanceXLWaitForInitialization(OldPageRqstPtr - 1);
+		Assert(pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]) == OldPageRqstPtr);
+
+		/* Fall through if it's already written out. */
+		if (LogwrtResult.Write < OldPageRqstPtr)
+		{
+			/* Nope, got work to do. */
 
 			/* Advance shared memory write request position */
 			SpinLockAcquire(&XLogCtl->info_lck);
@@ -2034,14 +2142,6 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 			RefreshXLogWriteResult(LogwrtResult);
 			if (LogwrtResult.Write < OldPageRqstPtr)
 			{
-				/*
-				 * Must acquire write lock. Release WALBufMappingLock first,
-				 * to make sure that all insertions that we need to wait for
-				 * can finish (up to this same position). Otherwise we risk
-				 * deadlock.
-				 */
-				LWLockRelease(WALBufMappingLock);
-
 				WaitXLogInsertionsToFinish(OldPageRqstPtr);
 
 				LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
@@ -2069,9 +2169,6 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 					 */
 					pgstat_report_fixed = true;
 				}
-				/* Re-acquire WALBufMappingLock and retry */
-				LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
-				continue;
 			}
 		}
 
@@ -2079,11 +2176,9 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 		 * Now the next buffer slot is free and we can set it up to be the
 		 * next output page.
 		 */
-		NewPageBeginPtr = XLogCtl->InitializedUpTo;
+		NewPageBeginPtr = ReservedPtr;
 		NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ;
 
-		Assert(XLogRecPtrToBufIdx(NewPageBeginPtr) == nextidx);
-
 		NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
 
 		/*
@@ -2147,12 +2242,127 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 		 */
 		pg_write_barrier();
 
+		/*-----
+		 * Update the value of XLogCtl->xlblocks[nextidx] and try to advance
+		 * XLogCtl->InitializedUpTo in a lock-less manner.
+		 *
+		 * First, let's provide a formal proof of the algorithm.  Let it be 'n'
+		 * process with the following variables in shared memory:
+		 *	f - an array of 'n' boolean flags,
+		 *	v - atomic integer variable.
+		 *
+		 * Also, let
+		 *	i - a number of a process,
+		 *	j - local integer variable,
+		 * CAS(var, oldval, newval) - compare-and-swap atomic operation
+		 *							  returning true on success,
+		 * write_barrier()/read_barrier() - memory barriers.
+		 *
+		 * The pseudocode for each process is the following.
+		 *
+		 *	j := i
+		 *	f[i] := true
+		 *	write_barrier()
+		 *	while CAS(v, j, j + 1):
+		 *		j := j + 1
+		 *		read_barrier()
+		 *		if not f[j]:
+		 *			break
+		 *
+		 * Let's prove that v eventually reaches the value of n.
+		 * 1. Prove by contradiction.  Assume v doesn't reach n and stucks
+		 *	  on k, where k < n.
+		 * 2. Process k attempts CAS(v, k, k + 1).  1). If, as we assumed, v
+		 *	  gets stuck at k, then this CAS operation must fail.  Therefore,
+		 *    v < k when process k attempts CAS(v, k, k + 1).
+		 * 3. If, as we assumed, v gets stuck at k, then the value k of v
+		 *	  must be achieved by some process m, where m < k.  The process
+		 *	  m must observe f[k] == false.  Otherwise, it will later attempt
+		 *	  CAS(v, k, k + 1) with success.
+		 * 4. Therefore, corresponding read_barrier() (while j == k) on
+		 *	  process m reached before write_barrier() of process k.  But then
+		 *	  process k attempts CAS(v, k, k + 1) after process m successfully
+		 *	  incremented v to k, and that CAS operation must succeed.
+		 *	  That leads to a contradiction.  So, there is no such k (k < n)
+		 *    where v gets stuck.  Q.E.D.
+		 *
+		 * To apply this proof to the code below, we assume
+		 * XLogCtl->InitializedUpTo will play the role of v with XLOG_BLCKSZ
+		 * granularity.  We also assume setting XLogCtl->xlblocks[nextidx] to
+		 * NewPageEndPtr to play the role of setting f[i] to true.  Also, note
+		 * that processes can't concurrently map different xlog locations to
+		 * the same nextidx because we previously requested that
+		 * XLogCtl->InitializedUpTo >= OldPageRqstPtr.  So, a xlog buffer can
+		 * be taken for initialization only once the previous initialization
+		 * takes effect on XLogCtl->InitializedUpTo.
+		 */
+
 		pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], NewPageEndPtr);
-		XLogCtl->InitializedUpTo = NewPageEndPtr;
+
+		pg_write_barrier();
+
+		while (pg_atomic_compare_exchange_u64(&XLogCtl->InitializedUpTo, &NewPageBeginPtr, NewPageEndPtr))
+		{
+			NewPageBeginPtr = NewPageEndPtr;
+			NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ;
+			nextidx = XLogRecPtrToBufIdx(NewPageBeginPtr);
+
+			pg_read_barrier();
+
+			if (pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]) != NewPageEndPtr)
+			{
+				proclist_head wakeup;
+				proclist_mutable_iter iter;
+
+				/*
+				 * Page at nextidx wasn't initialized yet, so we can't move
+				 * InitializedUpto further. It will be moved by backend which
+				 * will initialize nextidx.
+				 */
+				proclist_init(&wakeup);
+				SpinLockAcquire(&XLogCtl->InitializdToToWaiterListLock);
+				proclist_foreach_modify(iter, &XLogCtl->InitializdToToWaiterList, lwWaitLink)
+				{
+					PGPROC	   *waiter = GetPGProcByNumber(iter.cur);
+
+					proclist_delete(&XLogCtl->InitializdToToWaiterList, iter.cur, lwWaitLink);
+					proclist_push_tail(&wakeup, iter.cur, lwWaitLink);
+					Assert(waiter->lwWaiting == LW_WS_WAITING);
+					waiter->lwWaiting = LW_WS_PENDING_WAKEUP;
+				}
+				SpinLockRelease(&XLogCtl->InitializdToToWaiterListLock);
+
+				proclist_foreach_modify(iter, &wakeup, lwWaitLink)
+				{
+					PGPROC	   *waiter = GetPGProcByNumber(iter.cur);
+
+					proclist_delete(&wakeup, iter.cur, lwWaitLink);
+
+					/*
+					 * Guarantee that lwWaiting being unset only becomes visible once the
+					 * unlink from the link has completed. Otherwise the target backend
+					 * could be woken up for other reason and enqueue for a new lock - if
+					 * that happens before the list unlink happens, the list would end up
+					 * being corrupted.
+					 *
+					 * The barrier pairs with the LWLockWaitListLock() when enqueuing for
+					 * another lock.
+					 */
+					pg_write_barrier();
+					waiter->lwWaiting = LW_WS_NOT_WAITING;
+					PGSemaphoreUnlock(waiter->sem);
+				}
+				break;
+			}
+		}
 
 		npages++;
 	}
-	LWLockRelease(WALBufMappingLock);
+
+	AdvanceXLWaitForInitialization(upto);
+
+	END_CRIT_SECTION();
+	pg_read_barrier();
 
 #ifdef WAL_DEBUG
 	if (XLOG_DEBUG && npages > 0)
@@ -5045,6 +5255,11 @@ XLOGShmemInit(void)
 	pg_atomic_init_u64(&XLogCtl->logWriteResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->logFlushResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->unloggedLSN, InvalidXLogRecPtr);
+
+	pg_atomic_init_u64(&XLogCtl->InitializeReserved, InvalidXLogRecPtr);
+	pg_atomic_init_u64(&XLogCtl->InitializedUpTo, InvalidXLogRecPtr);
+	SpinLockInit(&XLogCtl->InitializdToToWaiterListLock);
+	proclist_init(&XLogCtl->InitializdToToWaiterList);
 }
 
 /*
@@ -6068,7 +6283,8 @@ StartupXLOG(void)
 		memset(page + len, 0, XLOG_BLCKSZ - len);
 
 		pg_atomic_write_u64(&XLogCtl->xlblocks[firstIdx], endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ);
-		XLogCtl->InitializedUpTo = endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ;
+		pg_atomic_write_u64(&XLogCtl->InitializedUpTo, endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ);
+		XLogCtl->InitializedFrom = endOfRecoveryInfo->lastPageBeginPtr;
 	}
 	else
 	{
@@ -6077,8 +6293,10 @@ StartupXLOG(void)
 		 * let the first attempt to insert a log record to initialize the next
 		 * buffer.
 		 */
-		XLogCtl->InitializedUpTo = EndOfLog;
+		pg_atomic_write_u64(&XLogCtl->InitializedUpTo, EndOfLog);
+		XLogCtl->InitializedFrom = EndOfLog;
 	}
+	pg_atomic_write_u64(&XLogCtl->InitializeReserved, pg_atomic_read_u64(&XLogCtl->InitializedUpTo));
 
 	/*
 	 * Update local and shared status.  This is OK to do without any locks
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..6632b808018 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -156,6 +156,7 @@ REPLICATION_SLOT_DROP	"Waiting for a replication slot to become inactive so it c
 RESTORE_COMMAND	"Waiting for <xref linkend="guc-restore-command"/> to complete."
 SAFE_SNAPSHOT	"Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFERRABLE</literal> transaction."
 SYNC_REP	"Waiting for confirmation from a remote server during synchronous replication."
+WAL_BUFFER_INIT	"Waiting on WAL buffer to be initialized."
 WAL_RECEIVER_EXIT	"Waiting for the WAL receiver to exit."
 WAL_RECEIVER_WAIT_START	"Waiting for startup process to send initial data for streaming replication."
 WAL_SUMMARY_READY	"Waiting for a new WAL summary to be generated."
@@ -318,7 +319,6 @@ XidGen	"Waiting to allocate a new transaction ID."
 ProcArray	"Waiting to access the shared per-process data structures (typically, to get a snapshot or report a session's transaction ID)."
 SInvalRead	"Waiting to retrieve messages from the shared catalog invalidation queue."
 SInvalWrite	"Waiting to add a message to the shared catalog invalidation queue."
-WALBufMapping	"Waiting to replace a page in WAL buffers."
 WALWrite	"Waiting for WAL buffers to be written to disk."
 ControlFile	"Waiting to read or update the <filename>pg_control</filename> file or create a new WAL file."
 MultiXactGen	"Waiting to read or update shared multixact state."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..208d2e3a8ed 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -38,7 +38,7 @@ PG_LWLOCK(3, XidGen)
 PG_LWLOCK(4, ProcArray)
 PG_LWLOCK(5, SInvalRead)
 PG_LWLOCK(6, SInvalWrite)
-PG_LWLOCK(7, WALBufMapping)
+/* 7 was WALBufMapping */
 PG_LWLOCK(8, WALWrite)
 PG_LWLOCK(9, ControlFile)
 /* 10 was CheckpointLock */
-- 
2.39.5 (Apple Git-154)

#74Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Alexander Korotkov (#73)
Re: VM corruption on standby

On 10 Sep 2025, at 15:25, Alexander Korotkov <aekorotkov@gmail.com> wrote:

I think the approach #2 is more appropriate for bc22dc0e0d, because in
the critical section we only wait for other processes also in the
critical section (so, there is no risk they will exit immediately
after postmaster death making us stuck). I've implemented a patch,
where waiting on conditional variable is replaced with LWLock-style
waiting on semaphore. However, custom waiting code just for
AdvanceXLInsertBuffer() doesn't look good.

Well, at least I'd like to see corruption-free solution for injection point wait too.

I believe we need some
general solution. We might have a special kind of condition variable,
a critical section condition variable, where both waiting and
signaling must be invoked only in a critical section. However, I dig
into our Latch and WaitEventSet, it seems there are too many
assumptions about postmaster death. So, a critical section condition
variable probably should be implemented on top of semaphore. Any
thoughts?

We want Latch\WaitEventSet, but for critical section. Is it easier to implement from scratch (from semaphores), or is it easier to fix and maintain existing Latch\WaitEventSet?

Best regards, Andrey Borodin.

#75Thomas Munro
thomas.munro@gmail.com
In reply to: Andrey Borodin (#74)
Re: VM corruption on standby

On Thu, Sep 11, 2025 at 12:00 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

On 10 Sep 2025, at 15:25, Alexander Korotkov <aekorotkov@gmail.com> wrote:
I believe we need some
general solution. We might have a special kind of condition variable,
a critical section condition variable, where both waiting and
signaling must be invoked only in a critical section. However, I dig
into our Latch and WaitEventSet, it seems there are too many
assumptions about postmaster death. So, a critical section condition
variable probably should be implemented on top of semaphore. Any
thoughts?

We want Latch\WaitEventSet, but for critical section. Is it easier to implement from scratch (from semaphores), or is it easier to fix and maintain existing Latch\WaitEventSet?

FWIW I'm working on a patch set that kills all backends without
releasing any locks when the postmaster exists. Then CVs and other
latch-based stuff should be safe in this context. Work was
interrupted by a vacation but I hope to post something in the nexts
couple of days, over on that other thread I started...

#76Michael Paquier
michael@paquier.xyz
In reply to: Thomas Munro (#75)
Re: VM corruption on standby

On Thu, Sep 11, 2025 at 10:59:19AM +1200, Thomas Munro wrote:

FWIW I'm working on a patch set that kills all backends without
releasing any locks when the postmaster exists. Then CVs and other
latch-based stuff should be safe in this context. Work was
interrupted by a vacation but I hope to post something in the nexts
couple of days, over on that other thread I started...

Wow. Thanks!
--
Michael

#77Alexander Korotkov
aekorotkov@gmail.com
In reply to: Thomas Munro (#75)
Re: VM corruption on standby

On Thu, Sep 11, 2025 at 1:59 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Sep 11, 2025 at 12:00 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

On 10 Sep 2025, at 15:25, Alexander Korotkov <aekorotkov@gmail.com> wrote:
I believe we need some
general solution. We might have a special kind of condition variable,
a critical section condition variable, where both waiting and
signaling must be invoked only in a critical section. However, I dig
into our Latch and WaitEventSet, it seems there are too many
assumptions about postmaster death. So, a critical section condition
variable probably should be implemented on top of semaphore. Any
thoughts?

We want Latch\WaitEventSet, but for critical section. Is it easier to implement from scratch (from semaphores), or is it easier to fix and maintain existing Latch\WaitEventSet?

FWIW I'm working on a patch set that kills all backends without
releasing any locks when the postmaster exists. Then CVs and other
latch-based stuff should be safe in this context. Work was
interrupted by a vacation but I hope to post something in the nexts
couple of days, over on that other thread I started...

Thank you!
I'm looking forward to see it!

------
Regards,
Alexander Korotkov
Supabase

#78Alexander Korotkov
aekorotkov@gmail.com
In reply to: Thomas Munro (#75)
Re: VM corruption on standby

Hi, Thomas!

On Thu, Sep 11, 2025 at 1:59 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Sep 11, 2025 at 12:00 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

On 10 Sep 2025, at 15:25, Alexander Korotkov <aekorotkov@gmail.com> wrote:
I believe we need some
general solution. We might have a special kind of condition variable,
a critical section condition variable, where both waiting and
signaling must be invoked only in a critical section. However, I dig
into our Latch and WaitEventSet, it seems there are too many
assumptions about postmaster death. So, a critical section condition
variable probably should be implemented on top of semaphore. Any
thoughts?

We want Latch\WaitEventSet, but for critical section. Is it easier to implement from scratch (from semaphores), or is it easier to fix and maintain existing Latch\WaitEventSet?

FWIW I'm working on a patch set that kills all backends without
releasing any locks when the postmaster exists. Then CVs and other
latch-based stuff should be safe in this context. Work was
interrupted by a vacation but I hope to post something in the nexts
couple of days, over on that other thread I started...

How is it going?

------
Regards,
Alexander Korotkov
Supabase

#79Thomas Munro
thomas.munro@gmail.com
In reply to: Alexander Korotkov (#78)
Re: VM corruption on standby

On Fri, Oct 3, 2025 at 7:31 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Thu, Sep 11, 2025 at 1:59 AM Thomas Munro <thomas.munro@gmail.com> wrote:

FWIW I'm working on a patch set that kills all backends without
releasing any locks when the postmaster exists. Then CVs and other
latch-based stuff should be safe in this context. Work was
interrupted by a vacation but I hope to post something in the nexts
couple of days, over on that other thread I started...

How is it going?

I have something, but I ran into a cluster of related subproblems
along the way (namely: how we manage subprocesses for COPY, archiving
etc, which is all a bit of a mess with known problems relating to
interrupts, signals, postmaster exit and [in development] threads,
which I'll write about soon with references to earlier discussions).
I realised that I needed to step back a bit and tackle all aspects of
our process tree and event management in a more architecturally
coherent way, and already had some prototypes for bits and pieces of
that puzzle from earlier attempts, but needed to make them work on
Windows which had some fun subproblems. I think I've mostly figured
it out now and am testing... Unfortunately a planned family vacation
fell in the middle of all that, hence delay. I'm back and actively
working on this now. More very soon.