Small fixes needed by high-availability tools

Started by Andrey Borodin8 months ago23 messages
#1Andrey Borodin
x4mmm@yandex-team.ru
3 attachment(s)

Hi hackers!

I want to revive attempts to fix some old edge cases of physical quorum replication.

Please find attached draft patches that demonstrate ideas. These patches are not actually proposed code changes, I rather want to have a design consensus first.

1. Allow checking standby sync before making data visible after crash recovery

Problem: Postgres instance must not allow to read data, if it is not yet known to be replicated.
Instantly after the crash we do not know if we are still cluster primary. We can disallow new
connections until standby quorum is established. Of course, walsenders and superusers must be exempt from this restriction.

Key change is following:
@@ -1214,6 +1215,16 @@ InitPostgres(const char *in_dbname, Oid dboid,
if (PostAuthDelay > 0)
pg_usleep(PostAuthDelay * 1000000L);

+	/* Check if we need to wait for startup synchronous replication */
+	if (!am_walsender &&
+		!superuser() &&
+		!StartupSyncRepEstablished())
+	{
+		ereport(FATAL,
+				(errcode(ERRCODE_CANNOT_CONNECT_NOW),
+				 errmsg("cannot connect until synchronous replication is established with standbys according to startup_synchronous_standby_level")));
+	}

We might also want to have some kind of cache that quorum was already established. Also the place where the check is done might be not most appropriate.

2. Do not allow to cancel locally written transaction

The problem was discussed many times [0,1,2,3] with some agreement on taken approach. But there was concerns that the solution is incomplete without first patch in the current thread.

Problem: user might try to cancel locally committed transaction and if we do so we will show non-replicated data as committed. This leads to loosing data with UPSERTs.

The key change is how we process cancels in SyncRepWaitForLSN().

3. Allow reading LSN written by walreciever, but not flushed yet

Problem: if we have synchronous_standby_names = ANY(node1,node2), node2 might be ahead of node1 by flush LSN, but before by written LSN. If we do a failover we choose node2 instead of node1 and loose data recently committed with synchronous_commit=remote_write.

Caveat: we already have a function pg_last_wal_receive_lsn(), which in fact returns flushed LSN, not written. I propose to add a new function which returns LSN actually written. Internals of this function are already implemented (GetWalRcvWriteRecPtr()), but unused.

Currently we just use a separate program lwaldump [4]https://github.com/g0djan/lwaldump which just reads WAL until last valid record. In case of failover pg_consul uses LSNs from lwaldump. This approach works well, but is cumbersome.

There are other caveats of replication, but IMO these 3 problems are most annoying in terms of data durability.

I'd greatly appreciate any thoughts on this.

Best regards, Andrey Borodin.

[0]: /messages/by-id/C1F7905E-5DB2-497D-ABCC-E14D4DEE506C@yandex-team.ru
[1]: /messages/by-id/CAEET0ZHG5oFF7iEcbY6TZadh1mosLmfz1HLm311P9VOt7Z+jeg@mail.gmail.com
[2]: /messages/by-id/6a052e81060824a8286148b1165bafedbd7c86cd.camel@j-davis.com
[3]: /messages/by-id/CALj2ACUrOB59QaE6=jF2cFAyv1MR7fzD8tr4YM5+OwEYG1SNzA@mail.gmail.com
[4]: https://github.com/g0djan/lwaldump

Attachments:

0001-Allow-checking-standby-sync-before-making-data-visib.patchapplication/octet-stream; name=0001-Allow-checking-standby-sync-before-making-data-visib.patch; x-unix-mode=0644Download
From f4e09b4b07e528c4b0159767b9a8348e825dd77f Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Fri, 2 May 2025 16:45:31 +0500
Subject: [PATCH 1/3] Allow checking standby sync before making data visible
 after crash recovery

The crash might happen when Primary and Standby synchronisation
is lost. In this case we should not allow queries to see data
which was not acknoledged by quorum. To do so - do not let in
non-superuser connections until standby quorum is restored.

This behavoir is only needed for HA clsuters, thus is disabled by
the GUC by default.
---
 src/backend/access/transam/xact.c   |  1 +
 src/backend/access/transam/xlog.c   |  4 +++-
 src/backend/replication/syncrep.c   | 34 +++++++++++++++++++++++++++++
 src/backend/utils/init/postinit.c   | 11 ++++++++++
 src/backend/utils/misc/guc_tables.c | 10 +++++++++
 src/include/access/xact.h           |  1 +
 src/include/access/xlogrecovery.h   |  1 +
 src/include/replication/syncrep.h   |  3 +++
 8 files changed, 64 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b885513f765..8962b311361 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -85,6 +85,7 @@ bool		DefaultXactDeferrable = false;
 bool		XactDeferrable;
 
 int			synchronous_commit = SYNCHRONOUS_COMMIT_ON;
+int			startup_synchronous_commit = SYNCHRONOUS_COMMIT_OFF;
 
 /*
  * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2d4c346473b..1f79b95b473 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -142,6 +142,8 @@ bool		XLOG_DEBUG = false;
 
 int			wal_segment_size = DEFAULT_XLOG_SEG_SIZE;
 
+XLogRecPtr RecoveryEndOfLog;
+
 /*
  * Number of WAL insertion locks to use. A higher value allows more insertions
  * to happen concurrently, but adds some CPU overhead to flushing the WAL,
@@ -6040,7 +6042,7 @@ StartupXLOG(void)
 	 * Finish WAL recovery.
 	 */
 	endOfRecoveryInfo = FinishWalRecovery();
-	EndOfLog = endOfRecoveryInfo->endOfLog;
+	RecoveryEndOfLog = EndOfLog = endOfRecoveryInfo->endOfLog;
 	EndOfLogTLI = endOfRecoveryInfo->endOfLogTLI;
 	abortedRecPtr = endOfRecoveryInfo->abortedRecPtr;
 	missingContrecPtr = endOfRecoveryInfo->missingContrecPtr;
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index cc35984ad00..0c09f8060a3 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -75,6 +75,7 @@
 #include <unistd.h>
 
 #include "access/xact.h"
+#include "access/xlogrecovery.h"
 #include "common/int.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1138,3 +1139,36 @@ assign_synchronous_commit(int newval, void *extra)
 			break;
 	}
 }
+
+/*
+ * Check if startup synchronous replication is established with the required standbys
+ * based on the synchronous_commit level.
+ */
+bool
+StartupSyncRepEstablished(void)
+{
+	XLogRecPtr replication_lsn = 0;
+	int mode;
+
+	switch (startup_synchronous_commit)
+	{
+		case SYNCHRONOUS_COMMIT_REMOTE_WRITE:
+			mode = SYNC_REP_WAIT_WRITE;
+			break;
+		case SYNCHRONOUS_COMMIT_REMOTE_FLUSH:
+			mode = SYNC_REP_WAIT_FLUSH;
+			break;
+		case SYNCHRONOUS_COMMIT_REMOTE_APPLY:
+			mode = SYNC_REP_WAIT_APPLY;
+			break;
+		default:
+		/* If startup_synchronous_commit is not set to a synchronous level, no need to check */
+			return true;
+	}
+
+	LWLockAcquire(SyncRepLock, LW_SHARED);
+	replication_lsn = WalSndCtl->lsn[mode];
+	LWLockRelease(SyncRepLock);
+
+	return replication_lsn >= RecoveryEndOfLog;
+}
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 01309ef3f86..10b354fe069 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -42,6 +42,7 @@
 #include "postmaster/postmaster.h"
 #include "replication/slot.h"
 #include "replication/slotsync.h"
+#include "replication/syncrep.h"
 #include "replication/walsender.h"
 #include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
@@ -1214,6 +1215,16 @@ InitPostgres(const char *in_dbname, Oid dboid,
 	if (PostAuthDelay > 0)
 		pg_usleep(PostAuthDelay * 1000000L);
 
+	/* Check if we need to wait for startup synchronous replication */
+	if (!am_walsender &&
+		!superuser() &&
+		!StartupSyncRepEstablished())
+	{
+		ereport(FATAL,
+				(errcode(ERRCODE_CANNOT_CONNECT_NOW),
+				 errmsg("cannot connect until synchronous replication is established with standbys according to startup_synchronous_standby_level")));
+	}
+
 	/*
 	 * Initialize various default states that can't be set up until we've
 	 * selected the active user and gotten the right GUC settings.
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 60b12446a1c..23e20261013 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -5173,6 +5173,16 @@ struct config_enum ConfigureNamesEnum[] =
 		NULL, assign_synchronous_commit, NULL
 	},
 
+	{
+		{"startup_synchronous_standby_level", PGC_SUSET, REPLICATION_PRIMARY,
+			gettext_noop("Sets the synchronization level neccesary to start node."),
+			NULL
+		},
+		&startup_synchronous_commit,
+		SYNCHRONOUS_COMMIT_OFF, synchronous_commit_options,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"archive_mode", PGC_POSTMASTER, WAL_ARCHIVING,
 			gettext_noop("Allows archiving of WAL files using \"archive_command\"."),
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index b2bc10ee041..feb47c80dfe 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -81,6 +81,7 @@ typedef enum
 
 /* Synchronous commit level */
 extern PGDLLIMPORT int synchronous_commit;
+extern PGDLLIMPORT int startup_synchronous_commit;
 
 /* used during logical streaming of a transaction */
 extern PGDLLIMPORT TransactionId CheckXidAlive;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 91446303024..8630f2403f8 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -131,6 +131,7 @@ typedef struct
 	bool		recovery_signal_file_found;
 } EndOfWalRecoveryInfo;
 
+extern XLogRecPtr RecoveryEndOfLog;
 extern EndOfWalRecoveryInfo *FinishWalRecovery(void);
 extern void ShutdownWalRecovery(void);
 extern void RemovePromoteSignalFiles(void);
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index 675669a79f7..bc0205d5fa3 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -107,4 +107,7 @@ extern void syncrep_yyerror(SyncRepConfigData **syncrep_parse_result_p, char **s
 extern void syncrep_scanner_init(const char *str, yyscan_t *yyscannerp);
 extern void syncrep_scanner_finish(yyscan_t yyscanner);
 
+extern void assign_synchronous_commit(int newval, void *extra);
+extern bool StartupSyncRepEstablished(void);
+
 #endif							/* _SYNCREP_H */
-- 
2.39.5 (Apple Git-154)

0002-Do-not-allow-to-cancel-locally-written-transaction.patchapplication/octet-stream; name=0002-Do-not-allow-to-cancel-locally-written-transaction.patch; x-unix-mode=0644Download
From aa9037f71f87bea43e4fd402c4de62c463212cd8 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Fri, 2 May 2025 17:02:11 +0500
Subject: [PATCH 2/3] Do not allow to cancel locally written transaction

This extends startup_synchronous_commit on locally logged transactions.
If transaction is not replicated and startup synchronisation is
configured, we do not allow transaction cancelation.

With extending GUC scope we also rename it to
failover_synchronous_standby_level.
---
 src/backend/access/transam/xact.c   |  2 +-
 src/backend/replication/syncrep.c   | 16 +++++++++++-----
 src/backend/utils/init/postinit.c   |  2 +-
 src/backend/utils/misc/guc_tables.c |  4 ++--
 src/include/access/xact.h           |  2 +-
 5 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8962b311361..a2011e80b40 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -85,7 +85,7 @@ bool		DefaultXactDeferrable = false;
 bool		XactDeferrable;
 
 int			synchronous_commit = SYNCHRONOUS_COMMIT_ON;
-int			startup_synchronous_commit = SYNCHRONOUS_COMMIT_OFF;
+int			failover_synchronous_standby_level = SYNCHRONOUS_COMMIT_OFF;
 
 /*
  * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 0c09f8060a3..c3077966917 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -318,11 +318,17 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 		if (QueryCancelPending)
 		{
 			QueryCancelPending = false;
-			ereport(WARNING,
-					(errmsg("canceling wait for synchronous replication due to user request"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+			if (failover_synchronous_standby_level < SYNCHRONOUS_COMMIT_REMOTE_WRITE)
+			{
+				ereport(WARNING,
+						(errmsg("canceling wait for synchronous replication due to user request"),
+						 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
 			SyncRepCancelWait();
 			break;
+			}
+			ereport(WARNING,
+				(errmsg("user requested cancel of waiting for synchronous replication"),
+				 errdetail("The transaction has already committed locally and cannot be rolled back, but have not been replicated according to synchronous_standby_names.")));
 		}
 
 		/*
@@ -1150,7 +1156,7 @@ StartupSyncRepEstablished(void)
 	XLogRecPtr replication_lsn = 0;
 	int mode;
 
-	switch (startup_synchronous_commit)
+	switch (failover_synchronous_standby_level)
 	{
 		case SYNCHRONOUS_COMMIT_REMOTE_WRITE:
 			mode = SYNC_REP_WAIT_WRITE;
@@ -1162,7 +1168,7 @@ StartupSyncRepEstablished(void)
 			mode = SYNC_REP_WAIT_APPLY;
 			break;
 		default:
-		/* If startup_synchronous_commit is not set to a synchronous level, no need to check */
+		/* If failover_synchronous_standby_level is not set to a synchronous level, no need to check */
 			return true;
 	}
 
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 10b354fe069..cd84ddab0ae 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -1222,7 +1222,7 @@ InitPostgres(const char *in_dbname, Oid dboid,
 	{
 		ereport(FATAL,
 				(errcode(ERRCODE_CANNOT_CONNECT_NOW),
-				 errmsg("cannot connect until synchronous replication is established with standbys according to startup_synchronous_standby_level")));
+				 errmsg("cannot connect until synchronous replication is established with standbys according to failover_synchronous_standby_level")));
 	}
 
 	/*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 23e20261013..a95fbb1efe6 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -5174,11 +5174,11 @@ struct config_enum ConfigureNamesEnum[] =
 	},
 
 	{
-		{"startup_synchronous_standby_level", PGC_SUSET, REPLICATION_PRIMARY,
+		{"failover_synchronous_standby_level", PGC_SUSET, REPLICATION_PRIMARY,
 			gettext_noop("Sets the synchronization level neccesary to start node."),
 			NULL
 		},
-		&startup_synchronous_commit,
+		&failover_synchronous_standby_level,
 		SYNCHRONOUS_COMMIT_OFF, synchronous_commit_options,
 		NULL, NULL, NULL
 	},
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index feb47c80dfe..b29a8a94156 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -81,7 +81,7 @@ typedef enum
 
 /* Synchronous commit level */
 extern PGDLLIMPORT int synchronous_commit;
-extern PGDLLIMPORT int startup_synchronous_commit;
+extern PGDLLIMPORT int failover_synchronous_standby_level;
 
 /* used during logical streaming of a transaction */
 extern PGDLLIMPORT TransactionId CheckXidAlive;
-- 
2.39.5 (Apple Git-154)

0003-Allow-reading-LSN-written-by-walreciever-but-not-flu.patchapplication/octet-stream; name=0003-Allow-reading-LSN-written-by-walreciever-but-not-flu.patch; x-unix-mode=0644Download
From 2ae24da62261d170c43157e90f70ea8c88b9808f Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Fri, 2 May 2025 17:13:07 +0500
Subject: [PATCH 3/3] Allow reading LSN written by walreciever, but not flushed
 yet

This LSN is needed for HA tools that operate on synchronous_commit
remote_write level. When this tools perform failover, node that
have done recent flush might win instead of node that actually
acknoledged transaction. Thus such tools loos recently committed
transactions.
---
 src/backend/access/transam/xlogfuncs.c | 19 +++++++++++++++++++
 src/include/catalog/pg_proc.dat        |  4 ++++
 2 files changed, 23 insertions(+)

diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 8c3090165f0..ecc81e799a9 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -347,6 +347,25 @@ pg_last_wal_receive_lsn(PG_FUNCTION_ARGS)
 	PG_RETURN_LSN(recptr);
 }
 
+/*
+ * Report the last WAL receive location
+ *
+ * This is useful for determining how much of WAL is guaranteed to be received
+ * and written to disk by walreceiver, but not flushed yet.
+ */
+Datum
+pg_last_wal_receive_unflushed_lsn(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	recptr;
+
+	recptr = GetWalRcvWriteRecPtr();
+
+	if (recptr == 0)
+		PG_RETURN_NULL();
+
+	PG_RETURN_LSN(recptr);
+}
+
 /*
  * Report the last WAL replay location (same format as pg_backup_start etc)
  *
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 62beb71da28..e6755813b90 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6761,6 +6761,10 @@
   proname => 'pg_last_wal_receive_lsn', provolatile => 'v',
   prorettype => 'pg_lsn', proargtypes => '',
   prosrc => 'pg_last_wal_receive_lsn' },
+{ oid => '8035', descr => 'current wal receive location',
+  proname => 'pg_last_wal_receive_unflushed_lsn', provolatile => 'v',
+  prorettype => 'pg_lsn', proargtypes => '',
+  prosrc => 'pg_last_wal_receive_unflushed_lsn' },
 { oid => '3821', descr => 'last wal replay location',
   proname => 'pg_last_wal_replay_lsn', provolatile => 'v',
   prorettype => 'pg_lsn', proargtypes => '',
-- 
2.39.5 (Apple Git-154)

#2Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Andrey Borodin (#1)
Re: Small fixes needed by high-availability tools

On Fri, 2 May 2025 at 15:00, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Hi hackers!

I want to revive attempts to fix some old edge cases of physical quorum replication.

Please find attached draft patches that demonstrate ideas. These patches are not actually proposed code changes, I rather want to have a design consensus first.

[...]

2. Do not allow to cancel locally written transaction

The problem was discussed many times [0,1,2,3] with some agreement on taken approach. But there was concerns that the solution is incomplete without first patch in the current thread.

I'm trying to figure out where in the thread you find this this "some
agreement". Could you reference the posts you're referring to?

Problem: user might try to cancel locally committed transaction and if we do so we will show non-replicated data as committed. This leads to loosing data with UPSERTs.

Could you explain why specifically UPSERTs would lose data (vs any
other user workload) in cancellations during SyncRepWaitForLSN?

The key change is how we process cancels in SyncRepWaitForLSN().

I personally think we should rather move to CSN-based snapshots on
both primary and replica (with LSN as CSN), and make visibility of
other transactions depend on how much persistence your session wants
(SQL spec permitting, of course).

I.e., if you have synchronous_commit=remote_apply, you wait with
sending the commit success message until you have confirmation that
your commit LSN has been applied on the configured amount of replicas,
and snapshots are taken based on the latest LSN that is known to be
applied everywhere, but if you have synchronous_commit=off, you could
read the commits (even those committed in s_c=remote_apply sessions)
immediately after they've been included in the logs (potentially with
some added slack to account for system state updates).
Similarly, all snapshots you create in a backend with
synchronous_commit=remote_apply would use the highest LSN which is
remotely applied according to the applicable rules, while
synchronous_commit=off implies "all transactions which have been
logged as committed".
Changing synchronous_commit to a value that requires higher
persistence level would cause the backend to wait for its newest
snapshot LSN to reach that persistence level; IMO an acceptable
trade-off for switching s_c at runtime.

This is based on the assumption that if you don't want your commit to
be very durable, you probably also don't care as much about the
durability of the data you can see, and if you want your commits to be
very durable, you probably want to see only very durable data.

This would also unify the commit visibility order between primary and
secondary nodes, and would allow users to have session-level 'wait for
LSN x to be persistent' with much reduced lock times.

(CC-ed to Ants, given his interest in this topic)

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#3Amit Kapila
amit.kapila16@gmail.com
In reply to: Andrey Borodin (#1)
Re: Small fixes needed by high-availability tools

On Fri, May 2, 2025 at 6:30 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

3. Allow reading LSN written by walreciever, but not flushed yet

Problem: if we have synchronous_standby_names = ANY(node1,node2), node2 might be ahead of node1 by flush LSN, but before by written LSN. If we do a failover we choose node2 instead of node1 and loose data recently committed with synchronous_commit=remote_write.

In which case, can we rely on written WAL that is not yet flushed?
Because say you decide based on written WAL and choose node-1 in above
case for failover, what if it restarts without flushing the written
WAL?

Caveat: we already have a function pg_last_wal_receive_lsn(), which in fact returns flushed LSN, not written. I propose to add a new function which returns LSN actually written. Internals of this function are already implemented (GetWalRcvWriteRecPtr()), but unused.

It seems to me that this is less controversial than your other two
proposals. So, we can discuss this in a separate thread as well.

--
With Regards,
Amit Kapila.

#4Amit Kapila
amit.kapila16@gmail.com
In reply to: Andrey Borodin (#1)
Re: Small fixes needed by high-availability tools

On Fri, May 2, 2025 at 6:30 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

I want to revive attempts to fix some old edge cases of physical quorum replication.

Please find attached draft patches that demonstrate ideas. These patches are not actually proposed code changes, I rather want to have a design consensus first.

1. Allow checking standby sync before making data visible after crash recovery

Problem: Postgres instance must not allow to read data, if it is not yet known to be replicated.
Instantly after the crash we do not know if we are still cluster primary. We can disallow new
connections until standby quorum is established. Of course, walsenders and superusers must be exempt from this restriction.

Key change is following:
@@ -1214,6 +1215,16 @@ InitPostgres(const char *in_dbname, Oid dboid,
if (PostAuthDelay > 0)
pg_usleep(PostAuthDelay * 1000000L);

+       /* Check if we need to wait for startup synchronous replication */
+       if (!am_walsender &&
+               !superuser() &&
+               !StartupSyncRepEstablished())
+       {
+               ereport(FATAL,
+                               (errcode(ERRCODE_CANNOT_CONNECT_NOW),
+                                errmsg("cannot connect until synchronous replication is established with standbys according to startup_synchronous_standby_level")));
+       }

We might also want to have some kind of cache that quorum was already established. Also the place where the check is done might be not most appropriate.

2. Do not allow to cancel locally written transaction

The problem was discussed many times [0,1,2,3] with some agreement on taken approach. But there was concerns that the solution is incomplete without first patch in the current thread.

Problem: user might try to cancel locally committed transaction and if we do so we will show non-replicated data as committed. This leads to loosing data with UPSERTs.

The key change is how we process cancels in SyncRepWaitForLSN().

One idea to solve this problem could be that whenever we cancel
sync_rep_wait, we set some system-wide flag that indicates that any
new transaction must ensure that all the current data is replicated to
the synchronous standby. Once we ensure that we have waited for
pending transactions to replicate, we can toggle back that system-wide
flag. Now, if the system restarts for any reason during such a wait,
we can use your idea to disallow new connections until the standby
quorum is established.

--
With Regards,
Amit Kapila.

#5Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Matthias van de Meent (#2)
2 attachment(s)
Re: Small fixes needed by high-availability tools

On 6 May 2025, at 12:00, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:

On Fri, 2 May 2025 at 15:00, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Hi hackers!

I want to revive attempts to fix some old edge cases of physical quorum replication.

Please find attached draft patches that demonstrate ideas. These patches are not actually proposed code changes, I rather want to have a design consensus first.

[...]

2. Do not allow to cancel locally written transaction

The problem was discussed many times [0,1,2,3] with some agreement on taken approach. But there was concerns that the solution is incomplete without first patch in the current thread.

I'm trying to figure out where in the thread you find this this "some
agreement". Could you reference the posts you're referring to?

/messages/by-id/CAAhFRxgYTEqfsqj5MJ0XXGjTUvVfLFxQmD317Bw5vYBHe_sBvQ@mail.gmail.com

Problem: user might try to cancel locally committed transaction and if we do so we will show non-replicated data as committed. This leads to loosing data with UPSERTs.

Could you explain why specifically UPSERTs would lose data (vs any
other user workload) in cancellations during SyncRepWaitForLSN?

Upserts change data conditionally. That's where observed effect affect writtned data. But the root problem is observing non-replicated data, it only becomes obvious when issuing: "INSERT ON CONFLICT DO NOTHING" and retrying it.
1. INSERT ON CONFLICT DO NOTHING hangs on waiting for replication
2. JDBC cancels query by after default timeout
3. INSERT ON CONFLICT DO NOTHING succeeds, because there's no WAL written

The key change is how we process cancels in SyncRepWaitForLSN().

I personally think we should rather move to CSN-based snapshots on
both primary and replica (with LSN as CSN), and make visibility of
other transactions depend on how much persistence your session wants
(SQL spec permitting, of course).

CSN is a snapshot technique and does not affect sync rep in durability aspect. You still WAL-log xid commit.

I.e., if you have synchronous_commit=remote_apply, you wait with
sending the commit success message until you have confirmation that
your commit LSN has been applied on the configured amount of replicas,
and snapshots are taken based on the latest LSN that is known to be
applied everywhere, but if you have synchronous_commit=off, you could
read the commits (even those committed in s_c=remote_apply sessions)
immediately after they've been included in the logs (potentially with
some added slack to account for system state updates).

Introducing dependency of snapshot on synchronous_commit level is the interesting idea, but it still depends on that cancel cannot make effect of transaction visible. It does not contradict ideas that I propose here, but support it.

CSN is discussed for a couple of decades already, anything makes you believe it will arrive soon and we do not to fix existing problems?

Similarly, all snapshots you create in a backend with
synchronous_commit=remote_apply would use the highest LSN which is
remotely applied according to the applicable rules, while
synchronous_commit=off implies "all transactions which have been
logged as committed".
Changing synchronous_commit to a value that requires higher
persistence level would cause the backend to wait for its newest
snapshot LSN to reach that persistence level; IMO an acceptable
trade-off for switching s_c at runtime.

This is based on the assumption that if you don't want your commit to
be very durable, you probably also don't care as much about the
durability of the data you can see, and if you want your commits to be
very durable, you probably want to see only very durable data.

This would also unify the commit visibility order between primary and
secondary nodes, and would allow users to have session-level 'wait for
LSN x to be persistent' with much reduced lock times.

That's an interesting problem, but by far less urgent. Reads from standbys never were consistent, but durability is promised to users.

(CC-ed to Ants, given his interest in this topic)

Thanks!

On 12 May 2025, at 05:40, Amit Kapila <amit.kapila16@gmail.com> wrote:

One idea to solve this problem could be that whenever we cancel
sync_rep_wait, we set some system-wide flag that indicates that any
new transaction must ensure that all the current data is replicated to
the synchronous standby. Once we ensure that we have waited for
pending transactions to replicate, we can toggle back that system-wide
flag. Now, if the system restarts for any reason during such a wait,
we can use your idea to disallow new connections until the standby
quorum is established.

This flag would be very contentious. We need to store it durably, it's costly. And on busy system we almost always have some transactions waiting for SyncRep, so without a flag you know something was in progress.

I'm attaching slightly modified patch set. Now recovery point is correctly stored in shared memory.

Best regards, Andrey Borodin.

Attachments:

v2-0001-Allow-checking-standby-sync-before-making-data-vi.patchapplication/octet-stream; name=v2-0001-Allow-checking-standby-sync-before-making-data-vi.patch; x-unix-mode=0644Download
From 43a2c0f21bf77e30084aaeaf06a99dc8cda09b86 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Fri, 2 May 2025 16:45:31 +0500
Subject: [PATCH v2 1/3] Allow checking standby sync before making data visible
 after crash recovery

The crash might happen when Primary and Standby synchronisation
is lost. In this case we should not allow queries to see data
which was not acknoledged by quorum. To do so - do not let in
non-superuser connections until standby quorum is restored.

This behavoir is only needed for HA clsuters, thus is disabled by
the GUC by default.
---
 src/backend/access/transam/xact.c   |  1 +
 src/backend/access/transam/xlog.c   | 22 +++++++++++++++-
 src/backend/replication/syncrep.c   | 40 +++++++++++++++++++++++++++++
 src/backend/utils/init/postinit.c   | 11 ++++++++
 src/backend/utils/misc/guc_tables.c | 10 ++++++++
 src/include/access/xact.h           |  1 +
 src/include/access/xlogrecovery.h   |  4 +++
 src/include/replication/syncrep.h   |  3 +++
 8 files changed, 91 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b885513f765..8962b311361 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -85,6 +85,7 @@ bool		DefaultXactDeferrable = false;
 bool		XactDeferrable;
 
 int			synchronous_commit = SYNCHRONOUS_COMMIT_ON;
+int			startup_synchronous_commit = SYNCHRONOUS_COMMIT_OFF;
 
 /*
  * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2d4c346473b..99cc519184d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -562,6 +562,10 @@ typedef struct XLogCtlData
 	XLogRecPtr	lastFpwDisableRecPtr;
 
 	slock_t		info_lck;		/* locks shared variables shown above */
+
+	/* TODO */
+	XLogRecPtr	recoveryEndOfLog;
+	bool		syncRepEstablished;
 } XLogCtlData;
 
 /*
@@ -6040,7 +6044,8 @@ StartupXLOG(void)
 	 * Finish WAL recovery.
 	 */
 	endOfRecoveryInfo = FinishWalRecovery();
-	EndOfLog = endOfRecoveryInfo->endOfLog;
+	Assert(!XLogCtl->syncRepEstablished);
+	XLogCtl->recoveryEndOfLog = EndOfLog = endOfRecoveryInfo->endOfLog;
 	EndOfLogTLI = endOfRecoveryInfo->endOfLogTLI;
 	abortedRecPtr = endOfRecoveryInfo->abortedRecPtr;
 	missingContrecPtr = endOfRecoveryInfo->missingContrecPtr;
@@ -9687,3 +9692,18 @@ SetWalWriterSleeping(bool sleeping)
 	XLogCtl->WalWriterSleeping = sleeping;
 	SpinLockRelease(&XLogCtl->info_lck);
 }
+
+XLogRecPtr
+GetEndOfRecoveryPtr()
+{
+	return XLogCtl->recoveryEndOfLog;
+}
+
+void SetSyncRepEstablished()
+{
+	XLogCtl->syncRepEstablished = true;
+}
+bool IsSyncRepEstablished()
+{
+	return XLogCtl->syncRepEstablished;
+}
\ No newline at end of file
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index cc35984ad00..4ed1d582a11 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -75,6 +75,7 @@
 #include <unistd.h>
 
 #include "access/xact.h"
+#include "access/xlogrecovery.h"
 #include "common/int.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1138,3 +1139,42 @@ assign_synchronous_commit(int newval, void *extra)
 			break;
 	}
 }
+
+/*
+ * Check if startup synchronous replication is established with the required standbys
+ * based on the synchronous_commit level.
+ */
+bool
+StartupSyncRepEstablished(void)
+{
+	XLogRecPtr replication_lsn = 0;
+	int mode;
+	bool result;
+
+	switch (startup_synchronous_commit)
+	{
+		case SYNCHRONOUS_COMMIT_REMOTE_WRITE:
+			mode = SYNC_REP_WAIT_WRITE;
+			break;
+		case SYNCHRONOUS_COMMIT_REMOTE_FLUSH:
+			mode = SYNC_REP_WAIT_FLUSH;
+			break;
+		case SYNCHRONOUS_COMMIT_REMOTE_APPLY:
+			mode = SYNC_REP_WAIT_APPLY;
+			break;
+		default:
+		/* If startup_synchronous_commit is not set to a synchronous level, no need to check */
+			return true;
+	}
+
+	LWLockAcquire(SyncRepLock, LW_SHARED);
+	replication_lsn = WalSndCtl->lsn[mode];
+	LWLockRelease(SyncRepLock);
+
+	Assert(GetEndOfRecoveryPtr() != InvalidXLogRecPtr);
+	result = (!(WalSndCtl->sync_standbys_status & SYNC_STANDBY_DEFINED)) ||
+			replication_lsn >= GetEndOfRecoveryPtr();
+	if (result)
+		SetSyncRepEstablished();
+	return result;
+}
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 01309ef3f86..10b354fe069 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -42,6 +42,7 @@
 #include "postmaster/postmaster.h"
 #include "replication/slot.h"
 #include "replication/slotsync.h"
+#include "replication/syncrep.h"
 #include "replication/walsender.h"
 #include "storage/aio_subsys.h"
 #include "storage/bufmgr.h"
@@ -1214,6 +1215,16 @@ InitPostgres(const char *in_dbname, Oid dboid,
 	if (PostAuthDelay > 0)
 		pg_usleep(PostAuthDelay * 1000000L);
 
+	/* Check if we need to wait for startup synchronous replication */
+	if (!am_walsender &&
+		!superuser() &&
+		!StartupSyncRepEstablished())
+	{
+		ereport(FATAL,
+				(errcode(ERRCODE_CANNOT_CONNECT_NOW),
+				 errmsg("cannot connect until synchronous replication is established with standbys according to startup_synchronous_standby_level")));
+	}
+
 	/*
 	 * Initialize various default states that can't be set up until we've
 	 * selected the active user and gotten the right GUC settings.
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 60b12446a1c..23e20261013 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -5173,6 +5173,16 @@ struct config_enum ConfigureNamesEnum[] =
 		NULL, assign_synchronous_commit, NULL
 	},
 
+	{
+		{"startup_synchronous_standby_level", PGC_SUSET, REPLICATION_PRIMARY,
+			gettext_noop("Sets the synchronization level neccesary to start node."),
+			NULL
+		},
+		&startup_synchronous_commit,
+		SYNCHRONOUS_COMMIT_OFF, synchronous_commit_options,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"archive_mode", PGC_POSTMASTER, WAL_ARCHIVING,
 			gettext_noop("Allows archiving of WAL files using \"archive_command\"."),
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index b2bc10ee041..feb47c80dfe 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -81,6 +81,7 @@ typedef enum
 
 /* Synchronous commit level */
 extern PGDLLIMPORT int synchronous_commit;
+extern PGDLLIMPORT int startup_synchronous_commit;
 
 /* used during logical streaming of a transaction */
 extern PGDLLIMPORT TransactionId CheckXidAlive;
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 91446303024..0539c7dd707 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -155,4 +155,8 @@ extern void RecoveryRequiresIntParameter(const char *param_name, int currValue,
 
 extern void xlog_outdesc(StringInfo buf, XLogReaderState *record);
 
+extern XLogRecPtr GetEndOfRecoveryPtr(void);
+extern void SetSyncRepEstablished(void);
+extern bool IsSyncRepEstablished(void);
+
 #endif							/* XLOGRECOVERY_H */
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index 675669a79f7..bc0205d5fa3 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -107,4 +107,7 @@ extern void syncrep_yyerror(SyncRepConfigData **syncrep_parse_result_p, char **s
 extern void syncrep_scanner_init(const char *str, yyscan_t *yyscannerp);
 extern void syncrep_scanner_finish(yyscan_t yyscanner);
 
+extern void assign_synchronous_commit(int newval, void *extra);
+extern bool StartupSyncRepEstablished(void);
+
 #endif							/* _SYNCREP_H */
-- 
2.39.5 (Apple Git-154)

v2-0002-Do-not-allow-to-cancel-locally-written-transactio.patchapplication/octet-stream; name=v2-0002-Do-not-allow-to-cancel-locally-written-transactio.patch; x-unix-mode=0644Download
From d8bd18f4f6c551449f7fcf993d646a0b3cce28ef Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Fri, 2 May 2025 17:02:11 +0500
Subject: [PATCH v2 2/3] Do not allow to cancel locally written transaction

This extends startup_synchronous_commit on locally logged transactions.
If transaction is not replicated and startup synchronisation is
configured, we do not allow transaction cancelation.

With extending GUC scope we also rename it to
failover_synchronous_standby_level.
---
 src/backend/access/transam/xact.c             |  2 +-
 src/backend/replication/syncrep.c             | 16 +++++++++++-----
 src/backend/utils/init/postinit.c             |  2 +-
 src/backend/utils/misc/guc_tables.c           |  4 ++--
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/access/xact.h                     |  2 +-
 6 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8962b311361..a2011e80b40 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -85,7 +85,7 @@ bool		DefaultXactDeferrable = false;
 bool		XactDeferrable;
 
 int			synchronous_commit = SYNCHRONOUS_COMMIT_ON;
-int			startup_synchronous_commit = SYNCHRONOUS_COMMIT_OFF;
+int			failover_synchronous_standby_level = SYNCHRONOUS_COMMIT_OFF;
 
 /*
  * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 4ed1d582a11..5c35fc419e3 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -318,11 +318,17 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 		if (QueryCancelPending)
 		{
 			QueryCancelPending = false;
-			ereport(WARNING,
-					(errmsg("canceling wait for synchronous replication due to user request"),
-					 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
+			if (failover_synchronous_standby_level < SYNCHRONOUS_COMMIT_REMOTE_WRITE)
+			{
+				ereport(WARNING,
+						(errmsg("canceling wait for synchronous replication due to user request"),
+						 errdetail("The transaction has already committed locally, but might not have been replicated to the standby.")));
 			SyncRepCancelWait();
 			break;
+			}
+			ereport(WARNING,
+				(errmsg("user requested cancel of waiting for synchronous replication"),
+				 errdetail("The transaction has already committed locally and cannot be rolled back, but have not been replicated according to synchronous_standby_names.")));
 		}
 
 		/*
@@ -1151,7 +1157,7 @@ StartupSyncRepEstablished(void)
 	int mode;
 	bool result;
 
-	switch (startup_synchronous_commit)
+	switch (failover_synchronous_standby_level)
 	{
 		case SYNCHRONOUS_COMMIT_REMOTE_WRITE:
 			mode = SYNC_REP_WAIT_WRITE;
@@ -1163,7 +1169,7 @@ StartupSyncRepEstablished(void)
 			mode = SYNC_REP_WAIT_APPLY;
 			break;
 		default:
-		/* If startup_synchronous_commit is not set to a synchronous level, no need to check */
+		/* If failover_synchronous_standby_level is not set to a synchronous level, no need to check */
 			return true;
 	}
 
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 10b354fe069..cd84ddab0ae 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -1222,7 +1222,7 @@ InitPostgres(const char *in_dbname, Oid dboid,
 	{
 		ereport(FATAL,
 				(errcode(ERRCODE_CANNOT_CONNECT_NOW),
-				 errmsg("cannot connect until synchronous replication is established with standbys according to startup_synchronous_standby_level")));
+				 errmsg("cannot connect until synchronous replication is established with standbys according to failover_synchronous_standby_level")));
 	}
 
 	/*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 23e20261013..a95fbb1efe6 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -5174,11 +5174,11 @@ struct config_enum ConfigureNamesEnum[] =
 	},
 
 	{
-		{"startup_synchronous_standby_level", PGC_SUSET, REPLICATION_PRIMARY,
+		{"failover_synchronous_standby_level", PGC_SUSET, REPLICATION_PRIMARY,
 			gettext_noop("Sets the synchronization level neccesary to start node."),
 			NULL
 		},
-		&startup_synchronous_commit,
+		&failover_synchronous_standby_level,
 		SYNCHRONOUS_COMMIT_OFF, synchronous_commit_options,
 		NULL, NULL, NULL
 	},
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 34826d01380..608d517f198 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -239,6 +239,7 @@
 					# unrecoverable data corruption)
 #synchronous_commit = on		# synchronization level;
 					# off, local, remote_write, remote_apply, or on
+#failover_synchronous_standby_level = of	# same levels as for synchronous_commit
 #wal_sync_method = fsync		# the default is the first option
 					# supported by the operating system:
 					#   open_datasync
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index feb47c80dfe..b29a8a94156 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -81,7 +81,7 @@ typedef enum
 
 /* Synchronous commit level */
 extern PGDLLIMPORT int synchronous_commit;
-extern PGDLLIMPORT int startup_synchronous_commit;
+extern PGDLLIMPORT int failover_synchronous_standby_level;
 
 /* used during logical streaming of a transaction */
 extern PGDLLIMPORT TransactionId CheckXidAlive;
-- 
2.39.5 (Apple Git-154)

#6Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Amit Kapila (#3)
1 attachment(s)
Allow reading LSN written by walreciever, but not flushed yet

Moved off from "Small fixes needed by high-availability tools"

On 12 May 2025, at 01:33, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 2, 2025 at 6:30 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

3. Allow reading LSN written by walreciever, but not flushed yet

Problem: if we have synchronous_standby_names = ANY(node1,node2), node2 might be ahead of node1 by flush LSN, but before by written LSN. If we do a failover we choose node2 instead of node1 and loose data recently committed with synchronous_commit=remote_write.

In which case, can we rely on written WAL that is not yet flushed?
Because say you decide based on written WAL and choose node-1 in above
case for failover, what if it restarts without flushing the written
WAL?

When user operate on "synchronous_commit=remote_write" they understand that simultaneous reboot of primary and standbys will incur data loss.
And if node is not rebooted - we need LSN of write, not flush. Or might want LSN "flush everything you have written, and return that LSN". That will also do the trick, but is not necessary.

Caveat: we already have a function pg_last_wal_receive_lsn(), which in fact returns flushed LSN, not written. I propose to add a new function which returns LSN actually written. Internals of this function are already implemented (GetWalRcvWriteRecPtr()), but unused.

It seems to me that this is less controversial than your other two
proposals. So, we can discuss this in a separate thread as well.

Done so. Thanks!

Best regards, Andrey Borodin.

Attachments:

v1-0001-Allow-reading-LSN-written-by-walreciever-but-not-.patchapplication/octet-stream; name=v1-0001-Allow-reading-LSN-written-by-walreciever-but-not-.patch; x-unix-mode=0644Download
From 9c3482a72f3b12d23db33f97ab112780b649d603 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Fri, 2 May 2025 17:13:07 +0500
Subject: [PATCH v1] Allow reading LSN written by walreciever, but not flushed
 yet

This LSN is needed for HA tools that operate on synchronous_commit
remote_write level. When this tools perform failover, node that
have done recent flush might win instead of node that actually
acknoledged transaction. Thus such tools loos recently committed
transactions.
---
 src/backend/access/transam/xlogfuncs.c | 19 +++++++++++++++++++
 src/include/catalog/pg_proc.dat        |  4 ++++
 2 files changed, 23 insertions(+)

diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 8c3090165f0..ecc81e799a9 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -347,6 +347,25 @@ pg_last_wal_receive_lsn(PG_FUNCTION_ARGS)
 	PG_RETURN_LSN(recptr);
 }
 
+/*
+ * Report the last WAL receive location
+ *
+ * This is useful for determining how much of WAL is guaranteed to be received
+ * and written to disk by walreceiver, but not flushed yet.
+ */
+Datum
+pg_last_wal_receive_unflushed_lsn(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	recptr;
+
+	recptr = GetWalRcvWriteRecPtr();
+
+	if (recptr == 0)
+		PG_RETURN_NULL();
+
+	PG_RETURN_LSN(recptr);
+}
+
 /*
  * Report the last WAL replay location (same format as pg_backup_start etc)
  *
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 62beb71da28..e6755813b90 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6761,6 +6761,10 @@
   proname => 'pg_last_wal_receive_lsn', provolatile => 'v',
   prorettype => 'pg_lsn', proargtypes => '',
   prosrc => 'pg_last_wal_receive_lsn' },
+{ oid => '8035', descr => 'current wal receive location',
+  proname => 'pg_last_wal_receive_unflushed_lsn', provolatile => 'v',
+  prorettype => 'pg_lsn', proargtypes => '',
+  prosrc => 'pg_last_wal_receive_unflushed_lsn' },
 { oid => '3821', descr => 'last wal replay location',
   proname => 'pg_last_wal_replay_lsn', provolatile => 'v',
   prorettype => 'pg_lsn', proargtypes => '',
-- 
2.39.5 (Apple Git-154)

#7Ants Aasma
ants.aasma@cybertec.at
In reply to: Andrey Borodin (#5)
Re: Small fixes needed by high-availability tools

Hi, dropping in my 2 cents here.

On Mon, 12 May 2025 at 18:42, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Problem: user might try to cancel locally committed transaction and if we do so we will show non-replicated data as committed. This leads to loosing data with UPSERTs.

Could you explain why specifically UPSERTs would lose data (vs any
other user workload) in cancellations during SyncRepWaitForLSN?

Upserts change data conditionally. That's where observed effect affect writtned data. But the root problem is observing non-replicated data, it only becomes obvious when issuing: "INSERT ON CONFLICT DO NOTHING" and retrying it.
1. INSERT ON CONFLICT DO NOTHING hangs on waiting for replication
2. JDBC cancels query by after default timeout
3. INSERT ON CONFLICT DO NOTHING succeeds, because there's no WAL written

Right. I think upsert is a red herring here. Any system trying to
implement idempotency/exactly once delivery will be built around a
similar pattern. Check if a transaction has already been executed, if
not run the transaction, commit, on failure retry. This is
particularly vulnerable to the visibility issue because the retry is
likely to land on the partitioned off leader.

The key change is how we process cancels in SyncRepWaitForLSN().

I personally think we should rather move to CSN-based snapshots on
both primary and replica (with LSN as CSN), and make visibility of
other transactions depend on how much persistence your session wants
(SQL spec permitting, of course).

CSN is a snapshot technique and does not affect sync rep in durability aspect. You still WAL-log xid commit.

CSN based snapshots enable delaying visibility without blocking on
cancelling a commit, and relatedly having async commits remain
invisible. List of concurrent xids snapshots require shared memory to
keep track of which transactions are running and are therefore limited
in size, running a transaction, commiting and then cancelling allows
for a potentially unlimited amount of concurrent transactions.

I.e., if you have synchronous_commit=remote_apply, you wait with
sending the commit success message until you have confirmation that
your commit LSN has been applied on the configured amount of replicas,
and snapshots are taken based on the latest LSN that is known to be
applied everywhere, but if you have synchronous_commit=off, you could
read the commits (even those committed in s_c=remote_apply sessions)
immediately after they've been included in the logs (potentially with
some added slack to account for system state updates).

Introducing dependency of snapshot on synchronous_commit level is the interesting idea, but it still depends on that cancel cannot make effect of transaction visible. It does not contradict ideas that I propose here, but support it.

CSN is discussed for a couple of decades already, anything makes you believe it will arrive soon and we do not to fix existing problems?

A couple of things give me hope. One is Heikki's approach of adding a
xid visibility cache to the snapshot [1]/messages/by-id/80f254d3-8ee9-4cde-a7e3-ee99998154da@iki.fi, which proved to be
surprisingly effective.

The other is having a resolution in sight on how to handle async
transaction visibility. My recollection is that this was the major
issue that derailed the feature last time it was attempted. Allowing
different users to see different states based on their durability
requirements looks like a satisfactory answer to this problem. Whether
it's by overloading synchronous_commit, or a new guc, or a transaction
isolation parameter is a small matter of bikeshedding.

There is even an interesting paper on how this type of approach can be
used to reduce lock durations in contended workloads by moving the
wait to readers [2]https://cs.uwaterloo.ca/~kdaudjee/ED.pdf.

Third, the recent Jepsen report seems to have renewed wider interest
in this problem. [3]https://jepsen.io/analyses/amazon-rds-for-postgresql-17.4

In a related topic, I'm also looking at tail latencies with
synchronous replication. Some storage devices have occasional hiccups,
and because WAL necessarily causes head of line blocking, which can
magnify the problem by multiple orders of magnitude. Right now we
don't even try to replicate before WAL is flushed locally. Ideally I
would like to support quorum commit to return when a transaction is
not yet persistent on the local disk. This will require some
additional awareness on PostgreSQL side that it is running as part of
a cluster, similarly to the no visibility before replicated durability
problem we are discussing here.

Regards,
Ants Aasma

[1]: /messages/by-id/80f254d3-8ee9-4cde-a7e3-ee99998154da@iki.fi
[2]: https://cs.uwaterloo.ca/~kdaudjee/ED.pdf
[3]: https://jepsen.io/analyses/amazon-rds-for-postgresql-17.4

#8Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Andrey Borodin (#6)
Re: Allow reading LSN written by walreciever, but not flushed yet

On 2025/05/13 0:47, Andrey Borodin wrote:

Moved off from "Small fixes needed by high-availability tools"

On 12 May 2025, at 01:33, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 2, 2025 at 6:30 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

3. Allow reading LSN written by walreciever, but not flushed yet

Problem: if we have synchronous_standby_names = ANY(node1,node2), node2 might be ahead of node1 by flush LSN, but before by written LSN. If we do a failover we choose node2 instead of node1 and loose data recently committed with synchronous_commit=remote_write.

In this case, doesn't the flush LSN typically catch up to the write LSN on node2
after a few seconds? Even if the walreceiver exits while there's still written
but unflushed WAL, it looks like WalRcvDie() ensures everything is flushed by
calling XLogWalRcvFlush(). So, isn't it safe to rely on the flush LSN when selecting
the most advanced node? No?

Caveat: we already have a function pg_last_wal_receive_lsn(), which in fact returns flushed LSN, not written. I propose to add a new function which returns LSN actually written. Internals of this function are already implemented (GetWalRcvWriteRecPtr()), but unused.

GetWalRcvWriteRecPtr() returns walrcv->writtenUpto, which can move backward
when the walreceiver restarts. This behavior is OK for your purpose?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#9Mihail Nikalayeu
mihailnikalayeu@gmail.com
In reply to: Fujii Masao (#8)
Re: Allow reading LSN written by walreciever, but not flushed yet

Hello, everyone!

Or might want LSN "flush everything you have written, and return that LSN". That will also do the trick, but is not necessary.

I think it is a better option. Less things may go wrong in such a case.

Best regards,
Mikhail.

#10Mihail Nikalayeu
mihailnikalayeu@gmail.com
In reply to: Ants Aasma (#7)
Re: Small fixes needed by high-availability tools

Hello, everyone!

On Mon, 12 May 2025 at 18:42, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Problem: user might try to cancel locally committed transaction and if we do so we will show non-replicated data as committed. This leads to loosing data with UPSERTs.

Could you explain why specifically UPSERTs would lose data (vs any

other user workload) in cancellations during SyncRepWaitForLSN?

Upserts change data conditionally. That's where observed effect affect writtned data. But the root problem is observing non-replicated data, it only becomes obvious when issuing: "INSERT ON CONFLICT DO >NOTHING" and retrying it.
1. INSERT ON CONFLICT DO NOTHING hangs on waiting for replication
2. JDBC cancels query by after default timeout
3. INSERT ON CONFLICT DO NOTHING succeeds, because there's no WAL written

Right. I think upsert is a red herring here. Any system trying to
implement idempotency/exactly once delivery will be built around a
similar pattern. Check if a transaction has already been executed, if
not run the transaction, commit, on failure retry. This is
particularly vulnerable to the visibility issue because the retry is
likely to land on the partitioned off leader.

I think UPSERT is just one specific case here. Any data that becomes
visible and then disappears can cause a variety of issues.

For example, the system receives a callback from a payment system,
marks an order as "PAID," commits the transaction, and returns a 200
response to the payment system (so it won't retry the callback).
However, if the transaction is lost due to a new primary, we end up
with an order that is paid in the real world, but the system is
unaware of it.

And yes, that patch has actually been applied on top of HEAD by most
PG cloud providers for over four years now.... [0]/messages/by-id/CAAhFRxgcBy-UCvyJ1ZZ1UKf4Owrx4J2X1F4tN_FD=fh5wZgdkw@mail.gmail.com.

One idea to solve this problem could be that whenever we cancel
sync_rep_wait, we set some system-wide flag that indicates that any
new transaction must ensure that all the current data is replicated to
the synchronous standby. Once we ensure that we have waited for
pending transactions to replicate, we can toggle back that system-wide
flag. Now, if the system restarts for any reason during such a wait,
we can use your idea to disallow new connections until the standby
quorum is established.

It might not necessarily be a flag—it could be some LSN value instead.
Also, it's not just about a "new transaction," but about any new
snapshot that could see data not yet replicated to the synchronous
standby.

Best regards,
Mikhail.

[0]: /messages/by-id/CAAhFRxgcBy-UCvyJ1ZZ1UKf4Owrx4J2X1F4tN_FD=fh5wZgdkw@mail.gmail.com

#11Alexander Kukushkin
cyberdemn@gmail.com
In reply to: Fujii Masao (#8)
Re: Allow reading LSN written by walreciever, but not flushed yet

Hi Fujii,

On Tue, 13 May 2025 at 13:13, Fujii Masao <masao.fujii@oss.nttdata.com>
wrote:

In this case, doesn't the flush LSN typically catch up to the write LSN on
node2
after a few seconds? Even if the walreceiver exits while there's still
written
but unflushed WAL, it looks like WalRcvDie() ensures everything is flushed
by
calling XLogWalRcvFlush(). So, isn't it safe to rely on the flush LSN when
selecting
the most advanced node? No?

I think it is a bit more complex than that. There are also cases when we
want to ensure that there are "healthy" standby nodes when switchover is
requested.
Meaning of "healthy" could be something like: "According to the write LSN
it is not lagging more than 16MB" or similar.
Now it is possible to extract this value using
pg_stat_get_wal_receiver()/pg_stat_wal_receiver, but it works only when the
walreceiver process is alive.

Caveat: we already have a function pg_last_wal_receive_lsn(), which in

fact returns flushed LSN, not written. I propose to add a new function
which returns LSN actually written. Internals of this function are already
implemented (GetWalRcvWriteRecPtr()), but unused.

GetWalRcvWriteRecPtr() returns walrcv->writtenUpto, which can move backward
when the walreceiver restarts. This behavior is OK for your purpose?

IMO, most of HA tools are prepared for it. They can't rely only on
write/flush LSN, because standby may be replaying WALs from the archive
using restore_command and as a result only replay LSN is progressing.
That is, they are supposed to be doing something like max(write_lsn,
replay_lsn).

--
Regards,
--
Alexander Kukushkin

#12Alexander Kukushkin
cyberdemn@gmail.com
In reply to: Andrey Borodin (#6)
Re: Allow reading LSN written by walreciever, but not flushed yet

On Mon, 12 May 2025 at 17:48, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Done so. Thanks!

TBH, the current function name pg_last_wal_receive_lsn() is confusing, and
introducing a new one pg_last_wal_receive_unflushed_lsn() doesn't make it
better :(
What about actually adding TWO new functions, pg_last_wal_write_lsn() and
pg_last_wal_flush_lsn()?
These names are more aligned with column names in pg_stat_replication view
and speak for themselves.

And, we can keep pg_last_wal_receive_lsn() as an alias of
pg_last_wal_flush_lsn() for backward compatibility.

--
Regards,
--
Alexander Kukushkin

#13Amit Kapila
amit.kapila16@gmail.com
In reply to: Mihail Nikalayeu (#10)
Re: Small fixes needed by high-availability tools

On Wed, May 14, 2025 at 2:15 AM Mihail Nikalayeu
<mihailnikalayeu@gmail.com> wrote:

One idea to solve this problem could be that whenever we cancel
sync_rep_wait, we set some system-wide flag that indicates that any
new transaction must ensure that all the current data is replicated to
the synchronous standby. Once we ensure that we have waited for
pending transactions to replicate, we can toggle back that system-wide
flag. Now, if the system restarts for any reason during such a wait,
we can use your idea to disallow new connections until the standby
quorum is established.

It might not necessarily be a flag—it could be some LSN value instead.
Also, it's not just about a "new transaction," but about any new
snapshot that could see data not yet replicated to the synchronous
standby.

Sounds reasonable to me. Let us see what others think about it.

--
With Regards,
Amit Kapila.

#14Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Amit Kapila (#13)
Re: Small fixes needed by high-availability tools

On 14 May 2025, at 05:33, Amit Kapila <amit.kapila16@gmail.com> wrote:

It might not necessarily be a flag—it could be some LSN value instead.
Also, it's not just about a "new transaction," but about any new
snapshot that could see data not yet replicated to the synchronous
standby.

Sounds reasonable to me. Let us see what others think about it.

I think this LSN is simply LSN where crash recovery ends...

Best regards, Andrey Borodin.

#15Mihail Nikalayeu
mihailnikalayeu@gmail.com
In reply to: Andrey Borodin (#14)
Re: Small fixes needed by high-availability tools

I think this LSN is simply LSN where crash recovery ends...

Yes, you are right and we come back to :

1. Allow checking standby sync before making data visible after crash recovery

Best regards,
Mikhail.

#16Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Alexander Kukushkin (#11)
Re: Allow reading LSN written by walreciever, but not flushed yet

On 2025/05/14 15:54, Alexander Kukushkin wrote:

Hi Fujii,

On Tue, 13 May 2025 at 13:13, Fujii Masao <masao.fujii@oss.nttdata.com <mailto:masao.fujii@oss.nttdata.com>> wrote:

In this case, doesn't the flush LSN typically catch up to the write LSN on node2
after a few seconds? Even if the walreceiver exits while there's still written
but unflushed WAL, it looks like WalRcvDie() ensures everything is flushed by
calling XLogWalRcvFlush(). So, isn't it safe to rely on the flush LSN when selecting
the most advanced node? No?

I think it is a bit more complex than that. There are also cases when we want to ensure that there are "healthy" standby nodes when switchover is requested.
Meaning of "healthy" could be something like: "According to the write LSN it is not lagging more than 16MB" or similar.

Could we use the flush LSN instead of the write LSN for this?

By the way, in a switchover scenario where the primary is
shut down cleanly, it tries to send all remaining WAL records
to the standby before exiting. The walreceiver on the standby
then writes and flushes all those records before it exits.
So in this case, using the flush LSN to check for a "healthy"
state should work. But were you considering a different scenario?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#17Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Fujii Masao (#8)
Re: Allow reading LSN written by walreciever, but not flushed yet

On 13 May 2025, at 14:13, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2025/05/13 0:47, Andrey Borodin wrote:

Moved off from "Small fixes needed by high-availability tools"

On 12 May 2025, at 01:33, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, May 2, 2025 at 6:30 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

3. Allow reading LSN written by walreciever, but not flushed yet

Problem: if we have synchronous_standby_names = ANY(node1,node2), node2 might be ahead of node1 by flush LSN, but before by written LSN. If we do a failover we choose node2 instead of node1 and loose data recently committed with synchronous_commit=remote_write.

In this case, doesn't the flush LSN typically catch up to the write LSN on node2
after a few seconds? Even if the walreceiver exits while there's still written
but unflushed WAL, it looks like WalRcvDie() ensures everything is flushed by
calling XLogWalRcvFlush(). So, isn't it safe to rely on the flush LSN when selecting
the most advanced node? No?

Well, we implemented this and made tests that do a lot of failovers. These tests observed data loss in some infrequent cases due to wrong new primary selection. Because "few seconds" is actually unknown random time.

Caveat: we already have a function pg_last_wal_receive_lsn(), which in fact returns flushed LSN, not written. I propose to add a new function which returns LSN actually written. Internals of this function are already implemented (GetWalRcvWriteRecPtr()), but unused.

GetWalRcvWriteRecPtr() returns walrcv->writtenUpto, which can move backward
when the walreceiver restarts. This behavior is OK for your purpose?

It is OK, because:
1. It's strictly no worse than flushed LSN
2. synchronous_commit = remove_write assumes that you can loose data when primary failed and standby is restarted simultaneously. The user is warned.

Best regards, Andrey Borodin.

#18Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Andrey Borodin (#17)
Re: Allow reading LSN written by walreciever, but not flushed yet

On 2025/05/21 17:35, Andrey Borodin wrote:

Well, we implemented this and made tests that do a lot of failovers. These tests observed data loss in some infrequent cases due to wrong new primary selection. Because "few seconds" is actually unknown random time.

I see your point. But doesn't a similar issue exist even with the write LSN?
For example, even if node1's write LSN is ahead of node2's at one moment,
node2 might catch up or surpass it a few seconds later.

If the walreceiver is no longer running, we can assume the write LSN has
reached its final value. So by waiting for the walreceiver to exit on both nodes,
we can "safely" compare their write LSNs to decide which one is ahead.
Also, in this situation, since XLogWalRcvFlush() is called during WalRcvDie(),
the flush LSN seems effectively guaranteed to match the write LSN.
So it seems also safe to use the flush LSN.

Caveat: we already have a function pg_last_wal_receive_lsn(), which in fact returns flushed LSN, not written. I propose to add a new function which returns LSN actually written. Internals of this function are already implemented (GetWalRcvWriteRecPtr()), but unused.

GetWalRcvWriteRecPtr() returns walrcv->writtenUpto, which can move backward
when the walreceiver restarts. This behavior is OK for your purpose?

It is OK, because:
1. It's strictly no worse than flushed LSN

Could you clarify this?

XLogWalRcvFlush() only updates flushedUpto if LogstreamResult.Flush has advanced,
while XLogWalRcvWrite() updates writtenUpto unconditionally. That means the flush
LSN (as reported by pg_last_wal_receive_lsn()) never moves backward, whereas
the write LSN might. Because of this difference in behavior, I was thinking that
we might need to track the maximum write LSN seen so far and have the function
return that value.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#19Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Fujii Masao (#18)
Re: Allow reading LSN written by walreciever, but not flushed yet

On 21 May 2025, at 15:03, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2025/05/21 17:35, Andrey Borodin wrote:

Well, we implemented this and made tests that do a lot of failovers. These tests observed data loss in some infrequent cases due to wrong new primary selection. Because "few seconds" is actually unknown random time.

I see your point. But doesn't a similar issue exist even with the write LSN?
For example, even if node1's write LSN is ahead of node2's at one moment,
node2 might catch up or surpass it a few seconds later.

If the walreceiver is no longer running, we can assume the write LSN has
reached its final value. So by waiting for the walreceiver to exit on both nodes,
we can "safely" compare their write LSNs to decide which one is ahead.
Also, in this situation, since XLogWalRcvFlush() is called during WalRcvDie(),
the flush LSN seems effectively guaranteed to match the write LSN.
So it seems also safe to use the flush LSN.

You are right. Receive LSN is meaningless when receive is in progress. So the only way to know receive LSN is to stop receiving...
I need to think more about it.

Caveat: we already have a function pg_last_wal_receive_lsn(), which in fact returns flushed LSN, not written. I propose to add a new function which returns LSN actually written. Internals of this function are already implemented (GetWalRcvWriteRecPtr()), but unused.

GetWalRcvWriteRecPtr() returns walrcv->writtenUpto, which can move backward
when the walreceiver restarts. This behavior is OK for your purpose?

It is OK, because:
1. It's strictly no worse than flushed LSN

Could you clarify this?

XLogWalRcvFlush() only updates flushedUpto if LogstreamResult.Flush has advanced,
while XLogWalRcvWrite() updates writtenUpto unconditionally. That means the flush
LSN (as reported by pg_last_wal_receive_lsn()) never moves backward, whereas
the write LSN might.

Write LSN cannot move backwards beyond flush LSN. Receive LSN >= flush LSN.

Because of this difference in behavior, I was thinking that
we might need to track the maximum write LSN seen so far and have the function
return that value.

That would be ideal. Or, maybe just maximum LSN that we told Primary we have received...

Best regards, Andrey Borodin.

#20Jeremy Schneider
schneider@ardentperf.com
In reply to: Andrey Borodin (#19)
Re: Allow reading LSN written by walreciever, but not flushed yet

On Wed, 21 May 2025 18:38:02 +0300
Andrey Borodin <x4mmm@yandex-team.ru> wrote:

On 21 May 2025, at 15:03, Fujii Masao <masao.fujii@oss.nttdata.com>
wrote:

On 2025/05/21 17:35, Andrey Borodin wrote:

Well, we implemented this and made tests that do a lot of
failovers. These tests observed data loss in some infrequent cases
due to wrong new primary selection. Because "few seconds" is
actually unknown random time.

I see your point. But doesn't a similar issue exist even with the
write LSN? For example, even if node1's write LSN is ahead of
node2's at one moment, node2 might catch up or surpass it a few
seconds later.

If the walreceiver is no longer running, we can assume the write
LSN has reached its final value. So by waiting for the walreceiver
to exit on both nodes, we can "safely" compare their write LSNs to
decide which one is ahead. Also, in this situation, since
XLogWalRcvFlush() is called during WalRcvDie(), the flush LSN seems
effectively guaranteed to match the write LSN. So it seems also
safe to use the flush LSN.

You are right. Receive LSN is meaningless when receive is in
progress. So the only way to know receive LSN is to stop receiving...
I need to think more about it.

When we're making a decision about cluster reconfiguration and
promoting a standby to be the new writer, usually the writer has
stopped sending - so I think we will stop receiving pretty quickly
(network issues notwithstanding).

Eventually the in-flight WAL will get sync'd and replayed on replicas.
This thread/request might partly be about whether postgres cluster
management software can make a promotion decision right away or whether
it needs to delay and give the system time to sync WAL, or resort to
directly decoding WAL which isn't yet sync'd.

A large and stressed system could get into a state where fsync takes
awhile. I'm thinking it simplifies our ability to ensure correctness in
cluster reconfiguration algorithms if we have direct access to the
write LSN for cases where synchronous_commit=remote_write; we can then
avoid resorting to delays or external tools like lwaldump.

With quorum replication, we need to design promotion logic carefully to
determine which replica has the latest COMMITs that were acknowledged
to the client.

-Jeremy Schneider

#21Rahila Syed
rahilasyed90@gmail.com
In reply to: Fujii Masao (#18)
Re: Allow reading LSN written by walreciever, but not flushed yet

Hi,

I've been following this discussion and have a question I'd like to ask.

XLogWalRcvFlush() only updates flushedUpto if LogstreamResult.Flush has

advanced,
while XLogWalRcvWrite() updates writtenUpto unconditionally. That means
the flush
LSN (as reported by pg_last_wal_receive_lsn()) never moves backward,
whereas
the write LSN might. Because of this difference in behavior, I was
thinking that
we might need to track the maximum write LSN seen so far and have the
function
return that value.

Can you please explain the scenarios in which the record pointer of the
WAL record written
by the receiver can move backwards.
For example, in physical replication, the WAL records are sent after a
Flush on the primary in ascending
order of XLogRecPtr. They are also received in that order and written as
they arrive by the receiver.
Are you referring to crash scenarios where unflushed WAL may be discarded
and receiver has
to move backwards and start from the last flushed record on the standby?

Thank you,
Rahila Syed

#22Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Rahila Syed (#21)
Re: Allow reading LSN written by walreciever, but not flushed yet

On 6 Oct 2025, at 11:16, Rahila Syed <rahilasyed90@gmail.com> wrote:

Can you please explain the scenarios in which the record pointer of the WAL record written
by the receiver can move backwards.
For example, in physical replication, the WAL records are sent after a Flush on the primary in ascending
order of XLogRecPtr. They are also received in that order and written as they arrive by the receiver.
Are you referring to crash scenarios where unflushed WAL may be discarded and receiver has
to move backwards and start from the last flushed record on the standby?

I see at least two scenarios:

1. We observe writtenUpto, then system is rebooted and we observe lesser writtenUpto.
2. We observe writtenUpto on one timeline, standby is kill-9-ed and another primary sends timeline change between flushed and previously observed write ptr

If at some point primary will be sending unflushed WAL, this list will grow with possible gray areas on Primary.

Best regards, Andrey Borodin.

#23Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Jeremy Schneider (#20)
Re: Allow reading LSN written by walreciever, but not flushed yet

On 6 Oct 2025, at 04:42, Jeremy Schneider <schneider@ardentperf.com> wrote:

A large and stressed system could get into a state where fsync takes
awhile.

For such cases I would like to have a function that ensures walreceiver will not acknowledge any new LSN. And then returns last ack'ed LSN.

Today we have to stop walreceiver, which takes some time from our 9999s.

Best regards, Andrey Borodin.