sync_standbys_defined read/write race on startup

Started by Maksim.Melnikov10 months ago12 messages

m.melnikov@postgrespro.ru

10 months ago

1 attachment(s)

Hi,
On my windows machine I've found unstable test, it fails very rarely,
but anyway, sometimes it fails.

src/test/recovery/t/009_twophase.pl

#   Failed test 'Committed prepared transaction is visible to new
snapshot in standby'
#   at t/009_twophase.pl line 360.
#          got: '0'
#     expected: '2'
# Looks like you failed 1 test of 30.
t/009_twophase.pl ....................
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/30 subtests

cat src/test/recovery/t/009_twophase.pl

....
$cur_primary->psql(
    'postgres', "
    SET synchronous_commit='remote_apply'; -- To ensure the standby is
caught up
    CREATE TABLE t_009_tbl_standby_mvcc (id int, msg text);
    BEGIN;
    INSERT INTO t_009_tbl_standby_mvcc VALUES (1, 'issued to
${cur_primary_name}');
    SAVEPOINT s1;
    INSERT INTO t_009_tbl_standby_mvcc VALUES (2, 'issued to
${cur_primary_name}');
    PREPARE TRANSACTION 'xact_009_standby_mvcc';
    ");
$cur_primary->stop;
$cur_standby->restart;

# Acquire a snapshot in standby, before we commit the prepared transaction
my $standby_session =
$cur_standby->background_psql('postgres', on_error_die => 1);
$standby_session->query_safe("BEGIN ISOLATION LEVEL REPEATABLE READ");
$psql_out =
$standby_session->query_safe("SELECT count(*) FROM
t_009_tbl_standby_mvcc");
is($psql_out, '0',
"Prepared transaction not visible in standby before commit");

# Commit the transaction in primary
$cur_primary->start;
$cur_primary->psql(
'postgres', "
SET synchronous_commit='remote_apply'; -- To ensure the standby is caught up
COMMIT PREPARED 'xact_009_standby_mvcc';
");

# Still not visible to the old snapshot
$psql_out =
$standby_session->query_safe("SELECT count(*) FROM
t_009_tbl_standby_mvcc");
is($psql_out, '0',
"Committed prepared transaction not visible to old snapshot in
standby");

# Is visible to a new snapshot
$standby_session->query_safe("COMMIT");
$psql_out =
$standby_session->query_safe("SELECT count(*) FROM
t_009_tbl_standby_mvcc");
*is($psql_out, '2',
"Committed prepared transaction is visible to new snapshot in
standby");*
$standby_session->quit;
.....

Test check visibility of prepared transactions in standby after a
restart while primary is down.
Failed assert check that changes, commited on transaction on primary
node, were synchronously replicated to standby node.
In test flow we have primary server stop/start events.
Assert failure is hard to reproduce, but I found how to do it
synthetically, can do it by injecting in code synthetical sleep call

--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -924,6 +924,7 @@ SyncRepUpdateSyncStandbysDefined(void)

        if (sync_standbys_defined != WalSndCtl->sync_standbys_defined)
        {
+               sleep(2);
                LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);

The root cause is WalSndCtl->sync_standbys_defined, it is located in
shared memory and can be updated by checkpointer process only,
in backends it can be read in case of sync replication. So we have race
on postgres startup between backend, processing commit request, and
checkpointer,
setting WalSndCtl->sync_standbys_defined.

Synchronous replication occurred in SyncRepWaitForLSN(syncrep.c)

void
SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
{
.....

    /*
     * Fast exit if user has not requested sync replication, or there
are no
     * sync replication standby names defined.
     *
     * Since this routine gets called every commit time, it's important to
     * exit quickly if sync replication is not requested. So we check
     * WalSndCtl->sync_standbys_defined flag without the lock and exit
     * immediately if it's false. If it's true, we need to check it again
     * later while holding the lock, to check the flag and operate the sync
     * rep queue atomically. This is necessary to avoid the race condition
     * described in SyncRepUpdateSyncStandbysDefined(). On the other
hand, if
     * it's false, the lock is not necessary because we don't touch the
queue.
     */
    if (!SyncRepRequested() ||
        !((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
        return;

.....

    if (!WalSndCtl->sync_standbys_defined ||
        lsn <= WalSndCtl->lsn[mode])
    {
        LWLockRelease(SyncRepLock);
        return;
    }

...........

Checkpointer set WalSndCtl->sync_standbys_defined parameter in
src/backend/replication/syncrep.c,
SyncRepUpdateSyncStandbysDefined method

void
SyncRepUpdateSyncStandbysDefined(void)
{
bool sync_standbys_defined = SyncStandbysDefined();

    if (sync_standbys_defined != WalSndCtl->sync_standbys_defined)
    {
        LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
..................
        WalSndCtl->sync_standbys_defined = sync_standbys_defined;

LWLockRelease(SyncRepLock);
}
}

So my question is
can we accept such behavior as bug of postgres codebase, or, on other
way, it is bug of unstable test?
P.S.
For me it is looking strange, that postgres start processing workload
before full config initializing.
Attached possible patch for codebase.

Best regards
Melnikov Maksim

Attachments:

0001-Fix-sync_standbys_defined-race-on-startup.patchtext/x-patch; charset=UTF-8; name=0001-Fix-sync_standbys_defined-race-on-startup.patchDownload

From 8fa295c81a8db88568d45da3f779d3097aeb6942 Mon Sep 17 00:00:00 2001
From: Maksim Melnikov <m.melnikov@postgrespro.ru>
Date: Wed, 26 Mar 2025 16:44:09 +0300
Subject: [PATCH] Fix sync_standbys_defined race on startup.

WalSndCtl->sync_standbys_defined, can be updated
by checkpointer process only and in backends it
can be read. So, on postgres startup, we have race
between checkpointer process and backends, that check
sync_standbys_defined for sync replications.

To avoid this, now we are working with sync_standbys_defined
not as bool field, but as bitmask, where 7th bit check
standbys_defined or not, 8th bit check
WalSndCtlData->sync_standbys_defined field has been
initialized by checkpointer. So if in synchronous replication functions
was detected that sync_standbys_defined is not initialized,
check SyncStandbysDefined() value.
---
 src/backend/replication/syncrep.c           | 31 +++++++++++++++++----
 src/include/replication/walsender_private.h | 10 ++++++-
 2 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index d75e3968035..cd29bfbdc94 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -148,6 +148,7 @@ void
 SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 {
 	int			mode;
+	bool		sync_standbys_defined;
 
 	/*
 	 * This should be called while holding interrupts during a transaction
@@ -170,7 +171,8 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	 * it's false, the lock is not necessary because we don't touch the queue.
 	 */
 	if (!SyncRepRequested() ||
-		!((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
+		(((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined & SYNCSTANDBYSDEFINEDINITIALIZED)
+		== SYNCSTANDBYSINITIALIZED)
 		return;
 
 	/* Cap the level for anything other than commit to remote flush only. */
@@ -184,6 +186,8 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 	Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
+	sync_standbys_defined = WalSndCtl->sync_standbys_defined & SYNCSTANDBYSINITIALIZED
+		? WalSndCtl->sync_standbys_defined & SYNCSTANDBYSDEFINED : SyncStandbysDefined();
 
 	/*
 	 * We don't wait for sync rep if WalSndCtl->sync_standbys_defined is not
@@ -193,7 +197,7 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	 * condition but we'll be fetching that cache line anyway so it's likely
 	 * to be a low cost check.
 	 */
-	if (!WalSndCtl->sync_standbys_defined ||
+	if (!sync_standbys_defined ||
 		lsn <= WalSndCtl->lsn[mode])
 	{
 		LWLockRelease(SyncRepLock);
@@ -920,9 +924,10 @@ SyncRepWakeQueue(bool all, int mode)
 void
 SyncRepUpdateSyncStandbysDefined(void)
 {
-	bool		sync_standbys_defined = SyncStandbysDefined();
+	bool		new_sync_standbys_defined = SyncStandbysDefined();
+	bool		old_sync_standbys_defined = (WalSndCtl->sync_standbys_defined & SYNCSTANDBYSDEFINED);
 
-	if (sync_standbys_defined != WalSndCtl->sync_standbys_defined)
+	if (new_sync_standbys_defined != old_sync_standbys_defined)
 	{
 		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 
@@ -931,7 +936,7 @@ SyncRepUpdateSyncStandbysDefined(void)
 		 * for backends to continue waiting.  Since the user no longer wants
 		 * synchronous replication, we'd better wake them up.
 		 */
-		if (!sync_standbys_defined)
+		if (!new_sync_standbys_defined)
 		{
 			int			i;
 
@@ -946,10 +951,24 @@ SyncRepUpdateSyncStandbysDefined(void)
 		 * backend that hasn't yet reloaded its config might go to sleep on
 		 * the queue (and never wake up).  This prevents that.
 		 */
-		WalSndCtl->sync_standbys_defined = sync_standbys_defined;
+		WalSndCtl->sync_standbys_defined |= new_sync_standbys_defined
+			? (SYNCSTANDBYSDEFINEDINITIALIZED)
+			: SYNCSTANDBYSINITIALIZED;
 
 		LWLockRelease(SyncRepLock);
 	}
+	else if (!(WalSndCtl->sync_standbys_defined & SYNCSTANDBYSINITIALIZED))
+	{
+		/*
+		 * Even if new_sync_standbys_defined == old_sync_standbys_defined, we
+		 * should check WalSndCtl->sync_standbys_defined is initialized and
+		 * set its initialize bit, if it reset. Anyway this bit should be set
+		 * only on start and then it will be always setted.
+		 */
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		WalSndCtl->sync_standbys_defined |= SYNCSTANDBYSINITIALIZED;
+		LWLockRelease(SyncRepLock);
+	}
 }
 
 #ifdef USE_ASSERT_CHECKING
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 0fc77f1b4af..5b096a1dc90 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -100,7 +100,7 @@ typedef struct
 	 * config file safely, so checkpointer updates this value as needed.
 	 * Protected by SyncRepLock.
 	 */
-	bool		sync_standbys_defined;
+	uint8		sync_standbys_defined;
 
 	/* used as a registry of physical / logical walsenders to wake */
 	ConditionVariable wal_flush_cv;
@@ -116,6 +116,14 @@ typedef struct
 	WalSnd		walsnds[FLEXIBLE_ARRAY_MEMBER];
 } WalSndCtlData;
 
+/*
+ * WalSndCtlData->sync_standbys_defined bit flags. 7th bit check standbys_defined or not,
+ * 8th bit check WalSndCtlData->sync_standbys_defined field has been initialized.
+ */
+#define SYNCSTANDBYSINITIALIZED	(1 << 0)
+#define SYNCSTANDBYSDEFINED		(1 << 2)
+#define SYNCSTANDBYSDEFINEDINITIALIZED	(SYNCSTANDBYSINITIALIZED & SYNCSTANDBYSDEFINED)
+
 extern PGDLLIMPORT WalSndCtlData *WalSndCtl;
 
 
-- 
2.43.0

Kirill Reshke

reshkekirill@gmail.com

10 months ago

In reply to: Maksim.Melnikov (#1)

Re: sync_standbys_defined read/write race on startup

On Wed, 26 Mar 2025 at 21:18, Maksim.Melnikov <m.melnikov@postgrespro.ru> wrote:

So my question is
can we accept such behavior as bug of postgres codebase, or, on other way, it is bug of unstable test?
P.S.
For me it is looking strange, that postgres start processing workload before full config initializing.
Attached possible patch for codebase.

Best regards
Melnikov Maksim

We do not tolerate TAP test flaps. Your repro is OK, can we have
deterministic test here using injection point technique?

--
Best regards,
Kirill Reshke

Michael Paquier

michael@paquier.xyz

10 months ago

In reply to: Maksim.Melnikov (#1)

Re: sync_standbys_defined read/write race on startup

On Wed, Mar 26, 2025 at 05:17:17PM +0300, Maksim.Melnikov wrote:

Test check visibility of prepared transactions in standby after a restart
while primary is down.
Failed assert check that changes, commited on transaction on primary node,
were synchronously replicated to standby node.
In test flow we have primary server stop/start events.
Assert failure is hard to reproduce, but I found how to do it synthetically,
can do it by injecting in code synthetical sleep call
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -924,6 +924,7 @@ SyncRepUpdateSyncStandbysDefined(void)
        if (sync_standbys_defined != WalSndCtl->sync_standbys_defined)
        {
+               sleep(2);
                LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);

                /*

Nice investigation to find out that this specific path faces a race
here. Tests on Windows are usually slower and good at catching these.
It may be possible that the buildfarm has reported that, actually.

So my question is
can we accept such behavior as bug of postgres codebase, or, on other way,
it is bug of unstable test?
P.S.
For me it is looking strange, that postgres start processing workload before
full config initializing.
Attached possible patch for codebase.

Yeah, I agree that it is a bug, and your approach to be more careful
in SyncRepWaitForLSN() so as we wait until the configuration is loaded
and set from the checkpointer sounds like the sensible thing to do,
because the backends have no idea yet what they should do as the
configuration is not loaded.

Tracking if the configuration has been properly loaded in
WalSndCtlData makes sense here, but I am confused by the patch: you
have defined SYNCSTANDBYSDEFINED but sync_standbys_defined never sets
it. It may be simpler to use a separate boolean flag for this
purpose. WalSndCtlData is a private structure holding the state of
the WAL senders internally, so ABI does not stress me much here
(flexible array at the end of the structure, as well..).

Also mentioned by Kirill downthread: let's add a regression test with
an injection point that does a wait. This race condition is worth
having a test for. You could define the point of where you have added
your hardcoded sleep(), for example.
--
Michael

Maksim.Melnikov

m.melnikov@postgrespro.ru

10 months ago

In reply to: Michael Paquier (#3)

1 attachment(s)

Re: sync_standbys_defined read/write race on startup

On 27.03.2025 01:13, Michael Paquier wrote:

Tracking if the configuration has been properly loaded in
WalSndCtlData makes sense here, but I am confused by the patch: you
have defined SYNCSTANDBYSDEFINED but sync_standbys_defined never sets
it. It may be simpler to use a separate boolean flag for this
purpose. WalSndCtlData is a private structure holding the state of
the WAL senders internally, so ABI does not stress me much here
(flexible array at the end of the structure, as well..).

Thanks for your responses.
Yes, I agree that it looks more complicated then addition new bool field,
but there is place in SyncRepWaitForLSN method, where we check
WalSndCtlData->sync_standbys_defined out of acquire/release block.

void
SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
{
    int            mode;
...............................................
*    if (!SyncRepRequested() ||
        !((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
        return;*
..........................

If we just add new bool field, we can't atomically check
sync_standbys_defined flag
and new bool field, only in lock, so we lose fast exit optimization. But
I agree
that in patch I lost SYNCSTANDBYSDEFINED setting, thanks for attentiveness,
so I've attached new patch with fix,also I've done bits setting more
evident.

Anyway, if you still thinks that my arguments not enough, please notify,
I will do patch with separate flag.

On 27.03.2025 01:13, Michael Paquier wrote:

Also mentioned by Kirill downthread: let's add a regression test with
an injection point that does a wait. This race condition is worth
having a test for. You could define the point of where you have added
your hardcoded sleep(), for example.

Sorry, but for me it seems that we can't use injection points toolset
here, because we are talking
about startup process, and we can't call injection_points_attach on
client backend before checkpointer.

Anyway, if you know how to do it, please share details, I will do.

Best regards,

Maksim Melnikov

Attachments:

0002-Fix-sync_standbys_defined-race-on-startup.patchtext/x-patch; charset=UTF-8; name=0002-Fix-sync_standbys_defined-race-on-startup.patchDownload

From c889fd8119642a4e9d3fb3e24fe5d6cd0e656ca5 Mon Sep 17 00:00:00 2001
From: Maksim Melnikov <m.melnikov@postgrespro.ru>
Date: Wed, 26 Mar 2025 16:44:09 +0300
Subject: [PATCH] Fix sync_standbys_defined race on startup.

WalSndCtl->sync_standbys_defined, can be updated
by checkpointer process only and in backends it
can be read. So, on postgres startup, we have race
between checkpointer process and backends, that check
sync_standbys_defined for sync replications.

To avoid this, now we are working with sync_standbys_defined
not as bool field, but as bitmask, where 7th bit check
standbys_defined or not, 8th bit check
WalSndCtlData->sync_standbys_defined field has been
initialized by checkpointer. So if in synchronous replication functions
was detected that sync_standbys_defined is not initialized,
check SyncStandbysDefined() value.
---
 src/backend/replication/syncrep.c           | 45 +++++++++++++++++----
 src/include/replication/walsender_private.h | 10 ++++-
 2 files changed, 47 insertions(+), 8 deletions(-)

diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index d75e3968035..d2ba4d604ed 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -148,6 +148,7 @@ void
 SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 {
 	int			mode;
+	bool		sync_standbys_defined;
 
 	/*
 	 * This should be called while holding interrupts during a transaction
@@ -170,7 +171,8 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	 * it's false, the lock is not necessary because we don't touch the queue.
 	 */
 	if (!SyncRepRequested() ||
-		!((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
+		(((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined & SYNCSTANDBYSDEFINEDINITIALIZED)
+		== SYNCSTANDBYSINITIALIZED)
 		return;
 
 	/* Cap the level for anything other than commit to remote flush only. */
@@ -184,6 +186,8 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 	Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
+	sync_standbys_defined = WalSndCtl->sync_standbys_defined & SYNCSTANDBYSINITIALIZED
+		? WalSndCtl->sync_standbys_defined & SYNCSTANDBYSDEFINED : SyncStandbysDefined();
 
 	/*
 	 * We don't wait for sync rep if WalSndCtl->sync_standbys_defined is not
@@ -193,7 +197,7 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	 * condition but we'll be fetching that cache line anyway so it's likely
 	 * to be a low cost check.
 	 */
-	if (!WalSndCtl->sync_standbys_defined ||
+	if (!sync_standbys_defined ||
 		lsn <= WalSndCtl->lsn[mode])
 	{
 		LWLockRelease(SyncRepLock);
@@ -920,18 +924,19 @@ SyncRepWakeQueue(bool all, int mode)
 void
 SyncRepUpdateSyncStandbysDefined(void)
 {
-	bool		sync_standbys_defined = SyncStandbysDefined();
+	bool		new_sync_standbys_defined = SyncStandbysDefined();
+	bool		old_sync_standbys_defined = (WalSndCtl->sync_standbys_defined & SYNCSTANDBYSDEFINED);
 
-	if (sync_standbys_defined != WalSndCtl->sync_standbys_defined)
+	if (new_sync_standbys_defined != old_sync_standbys_defined)
 	{
+		uint8 nv;
 		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
-
 		/*
 		 * If synchronous_standby_names has been reset to empty, it's futile
 		 * for backends to continue waiting.  Since the user no longer wants
 		 * synchronous replication, we'd better wake them up.
 		 */
-		if (!sync_standbys_defined)
+		if (!new_sync_standbys_defined)
 		{
 			int			i;
 
@@ -946,8 +951,34 @@ SyncRepUpdateSyncStandbysDefined(void)
 		 * backend that hasn't yet reloaded its config might go to sleep on
 		 * the queue (and never wake up).  This prevents that.
 		 */
-		WalSndCtl->sync_standbys_defined = sync_standbys_defined;
 
+		/* set initialized bit.*/
+		nv = WalSndCtl->sync_standbys_defined | SYNCSTANDBYSINITIALIZED;
+		if (new_sync_standbys_defined)
+		{
+  			/* set defined bit.*/
+			nv |= SYNCSTANDBYSDEFINED;
+		}
+		else
+		{
+			/* reset defined bit.*/
+			nv &= (~SYNCSTANDBYSDEFINED);
+		}
+
+		WalSndCtl->sync_standbys_defined = nv;
+
+		LWLockRelease(SyncRepLock);
+	}
+	else if (!(WalSndCtl->sync_standbys_defined & SYNCSTANDBYSINITIALIZED))
+	{
+		/*
+		 * Even if new_sync_standbys_defined == old_sync_standbys_defined, we
+		 * should check WalSndCtl->sync_standbys_defined is initialized and
+		 * set its initialize bit, if it reset. Anyway this bit should be set
+		 * only on start and then it will be always setted.
+		 */
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		WalSndCtl->sync_standbys_defined |= SYNCSTANDBYSINITIALIZED;
 		LWLockRelease(SyncRepLock);
 	}
 }
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 0fc77f1b4af..cc6a98b7e31 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -100,7 +100,7 @@ typedef struct
 	 * config file safely, so checkpointer updates this value as needed.
 	 * Protected by SyncRepLock.
 	 */
-	bool		sync_standbys_defined;
+	uint8		sync_standbys_defined;
 
 	/* used as a registry of physical / logical walsenders to wake */
 	ConditionVariable wal_flush_cv;
@@ -116,6 +116,14 @@ typedef struct
 	WalSnd		walsnds[FLEXIBLE_ARRAY_MEMBER];
 } WalSndCtlData;
 
+/*
+ * WalSndCtlData->sync_standbys_defined bit flags. 7th bit check standbys_defined or not,
+ * 8th bit check WalSndCtlData->sync_standbys_defined field has been initialized.
+ */
+#define SYNCSTANDBYSINITIALIZED	(1 << 0)
+#define SYNCSTANDBYSDEFINED		(1 << 1)
+#define SYNCSTANDBYSDEFINEDINITIALIZED	(SYNCSTANDBYSINITIALIZED & SYNCSTANDBYSDEFINED)
+
 extern PGDLLIMPORT WalSndCtlData *WalSndCtl;
 
 
-- 
2.43.0

Maksim.Melnikov

m.melnikov@postgrespro.ru

9 months ago

In reply to: Michael Paquier (#3)

1 attachment(s)

Re: sync_standbys_defined read/write race on startup

Hi everyone, thanks for your comments.
I've just wanted to share little cosmetic
fixes for previous patch.

Best Regards,
Maksim Melnikov

Attachments:

0003-Fix-sync_standbys_defined-race-on-startup.patchtext/x-patch; charset=UTF-8; name=0003-Fix-sync_standbys_defined-race-on-startup.patchDownload

From 748c5fe8efd4c6365f380f7f9bbeba5e91887ab4 Mon Sep 17 00:00:00 2001
From: Maksim Melnikov <m.melnikov@postgrespro.ru>
Date: Wed, 26 Mar 2025 16:44:09 +0300
Subject: [PATCH] Fix sync_standbys_defined race on startup.

WalSndCtl->sync_standbys_defined, can be updated
by checkpointer process only and in backends it
can be read. So, on postgres startup, we have race
between checkpointer process and backends, that check
sync_standbys_defined for sync replications.

To avoid this, now we are working with sync_standbys_defined
not as bool field, but as bitmask, where 7th bit check
standbys_defined or not, 8th bit check
WalSndCtlData->sync_standbys_defined field has been
initialized by checkpointer. So if in synchronous replication functions
was detected that sync_standbys_defined is not initialized,
check SyncStandbysDefined() value.
---
 src/backend/replication/syncrep.c           | 31 +++++++++++++++++----
 src/include/replication/walsender_private.h | 18 +++++++++++-
 2 files changed, 42 insertions(+), 7 deletions(-)

diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index d75e3968035..60b9507aa41 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -148,6 +148,7 @@ void
 SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 {
 	int			mode;
+	bool		sync_standbys_defined;
 
 	/*
 	 * This should be called while holding interrupts during a transaction
@@ -170,7 +171,7 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	 * it's false, the lock is not necessary because we don't touch the queue.
 	 */
 	if (!SyncRepRequested() ||
-		!((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
+		IS_SYNCSTANDBYS_INITIALIZED_ONLY(((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined))
 		return;
 
 	/* Cap the level for anything other than commit to remote flush only. */
@@ -184,6 +185,8 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 	Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
+	sync_standbys_defined = WalSndCtl->sync_standbys_defined & SYNCSTANDBYSINITIALIZED
+		? (WalSndCtl->sync_standbys_defined & SYNCSTANDBYSDEFINED) : SyncStandbysDefined();
 
 	/*
 	 * We don't wait for sync rep if WalSndCtl->sync_standbys_defined is not
@@ -193,7 +196,7 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	 * condition but we'll be fetching that cache line anyway so it's likely
 	 * to be a low cost check.
 	 */
-	if (!WalSndCtl->sync_standbys_defined ||
+	if (!sync_standbys_defined ||
 		lsn <= WalSndCtl->lsn[mode])
 	{
 		LWLockRelease(SyncRepLock);
@@ -920,9 +923,10 @@ SyncRepWakeQueue(bool all, int mode)
 void
 SyncRepUpdateSyncStandbysDefined(void)
 {
-	bool		sync_standbys_defined = SyncStandbysDefined();
+	bool		new_sync_standbys_defined = SyncStandbysDefined();
+	bool		old_sync_standbys_defined = (WalSndCtl->sync_standbys_defined & SYNCSTANDBYSDEFINED);
 
-	if (sync_standbys_defined != WalSndCtl->sync_standbys_defined)
+	if (new_sync_standbys_defined != old_sync_standbys_defined)
 	{
 		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 
@@ -931,7 +935,7 @@ SyncRepUpdateSyncStandbysDefined(void)
 		 * for backends to continue waiting.  Since the user no longer wants
 		 * synchronous replication, we'd better wake them up.
 		 */
-		if (!sync_standbys_defined)
+		if (!new_sync_standbys_defined)
 		{
 			int			i;
 
@@ -946,8 +950,23 @@ SyncRepUpdateSyncStandbysDefined(void)
 		 * backend that hasn't yet reloaded its config might go to sleep on
 		 * the queue (and never wake up).  This prevents that.
 		 */
-		WalSndCtl->sync_standbys_defined = sync_standbys_defined;
 
+		WalSndCtl->sync_standbys_defined = new_sync_standbys_defined
+			? SET_SYNCSTANDBYS_DEFINED(WalSndCtl->sync_standbys_defined)
+			: RESET_SYNCSTANDBYS_DEFINED(WalSndCtl->sync_standbys_defined);
+
+		LWLockRelease(SyncRepLock);
+	}
+	else if (!(WalSndCtl->sync_standbys_defined & SYNCSTANDBYSINITIALIZED))
+	{
+		/*
+		 * Even if new_sync_standbys_defined == old_sync_standbys_defined, we
+		 * should check WalSndCtl->sync_standbys_defined is initialized and
+		 * set its initialize bit, if it reset. Anyway this bit should be set
+		 * only on start and then it will be always setted.
+		 */
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		SET_SYNCSTANDBYS_INITIALIZED(WalSndCtl->sync_standbys_defined);
 		LWLockRelease(SyncRepLock);
 	}
 }
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 0fc77f1b4af..dd931b81e4f 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -100,7 +100,7 @@ typedef struct
 	 * config file safely, so checkpointer updates this value as needed.
 	 * Protected by SyncRepLock.
 	 */
-	bool		sync_standbys_defined;
+	uint8		sync_standbys_defined;
 
 	/* used as a registry of physical / logical walsenders to wake */
 	ConditionVariable wal_flush_cv;
@@ -116,6 +116,22 @@ typedef struct
 	WalSnd		walsnds[FLEXIBLE_ARRAY_MEMBER];
 } WalSndCtlData;
 
+/*
+ * WalSndCtlData->sync_standbys_defined bit flags. 7th bit check standbys_defined or not,
+ * 8th bit check WalSndCtlData->sync_standbys_defined field has been initialized.
+ */
+#define SYNCSTANDBYSINITIALIZED	(1 << 0)
+#define SYNCSTANDBYSDEFINED		(1 << 1)
+
+#define SET_SYNCSTANDBYS_INITIALIZED(sync_standbys)	\
+	(sync_standbys | SYNCSTANDBYSINITIALIZED)
+#define SET_SYNCSTANDBYS_DEFINED(sync_standbys)	\
+	(sync_standbys | SYNCSTANDBYSINITIALIZED | SYNCSTANDBYSDEFINED)
+#define RESET_SYNCSTANDBYS_DEFINED(sync_standbys)	\
+	(SET_SYNCSTANDBYS_INITIALIZED(sync_standbys) & (~SYNCSTANDBYSDEFINED))
+#define IS_SYNCSTANDBYS_INITIALIZED_ONLY(sync_standbys)	\
+	((sync_standbys & (SYNCSTANDBYSINITIALIZED | SYNCSTANDBYSDEFINED)) == SYNCSTANDBYSINITIALIZED)
+
 extern PGDLLIMPORT WalSndCtlData *WalSndCtl;
 
 
-- 
2.43.0

Michael Paquier

michael@paquier.xyz

9 months ago

In reply to: Maksim.Melnikov (#5)

Re: sync_standbys_defined read/write race on startup

On Mon, Apr 07, 2025 at 11:39:23PM +0300, Maksim.Melnikov wrote:

Hi everyone, thanks for your comments.
I've just wanted to share little cosmetic
fixes for previous patch.

Note: I would suggest to use a version number for the patches, rather
than a 0003 which usually implies that this is third patch in a
series. So I would use a command like that when generating a v4 of
this single patch:
git format-patch -1 -v4

`git am` is OK with what you have sent, but it's just a bit confusing.
Anyway, back to the patch..

I am still playing with the patch and will most probably send an
update tomorrow, but the injection point technique should be rather
straight-forward. You have done most of the work by finding a
problematic case in 009_twophase.pl, so I would suggest the following:
- Instead of the sleep(), add a macro INJECTION_POINT() with a new
name.
- Then implement a TAP test that reproduces the failure by adding a
"wait" point, combined with a "wakeup", probably with a BackgroundPsql
object to keep the problematic SQL sequence alive while it waits.

I would suggest to look at what we do with create-restart-point in the
recovery test 041_checkpoint_at_promote.pl as a starting point. For
background sessions, also look at 007_catcache_inval.pl in test_misc.
The new test may be better in 009_twophase.pl, conditional depending
on the existence of the extension "injection_points". You can check
how it is done in the other TAP tests; these rely on check_extension()
and $ENV{enable_injection_points}.
--
Michael

Maksim.Melnikov

m.melnikov@postgrespro.ru

9 months ago

In reply to: Michael Paquier (#6)

1 attachment(s)

Re: sync_standbys_defined read/write race on startup

Hi Michael, thanks for your answer.

Test 041_checkpoint_at_promote.pl is really good example
of using injection points, but anyway, I still don't see
how injection points can help us here. In failed test case
we started postgres, immediately open psql connection and commit
prepared transaction. Postgres, on start, before accepting
connections, fork checkpointer and checkpointer initialize
shared parameter sync_standbys_defined in CheckpointerMain,
before process loop. So, in most attempts, we can't call
injection_points_attach in psql before
macro INJECTION_POINT() code will be executed in
checkpointer(I mean INJECTION_POINT() code instead sleep()),
bacause checkpointer process is forking before database
system is ready to accept connections.

The manual execution of checkpoint can't help us too.

P.S.
sorry for style errors, I am new in postgres, so can miss
some rules. Anyway, to avoid any community rules mistakes,
attached patch with correct naming.

Thanks.

Best regards,
Maksim Melnikov

Attachments:

v4-0001-Fix-sync_standbys_defined-race-on-startup.patchtext/x-patch; charset=UTF-8; name=v4-0001-Fix-sync_standbys_defined-race-on-startup.patchDownload

From 8804867a70c9c1c946683b9b9c455da2d1504aa0 Mon Sep 17 00:00:00 2001
From: Maksim Melnikov <m.melnikov@postgrespro.ru>
Date: Wed, 26 Mar 2025 16:44:09 +0300
Subject: [PATCH v4] Fix sync_standbys_defined race on startup.

WalSndCtl->sync_standbys_defined, can be updated
by checkpointer process only and in backends it
can be read. So, on postgres startup, we have race
between checkpointer process and backends, that check
sync_standbys_defined for sync replications.

To avoid this, now we are working with sync_standbys_defined
not as bool field, but as bitmask, where 7th bit check
standbys_defined or not, 8th bit check
WalSndCtlData->sync_standbys_defined field has been
initialized by checkpointer. So if in synchronous replication functions
was detected that sync_standbys_defined is not initialized,
check SyncStandbysDefined() value.
---
 src/backend/replication/syncrep.c           | 32 +++++++++++++++++----
 src/backend/replication/walsender.c         |  2 +-
 src/include/replication/walsender_private.h | 18 +++++++++++-
 3 files changed, 44 insertions(+), 8 deletions(-)

diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index d75e3968035..c049e82286a 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -85,6 +85,7 @@
 #include "tcop/tcopprot.h"
 #include "utils/guc_hooks.h"
 #include "utils/ps_status.h"
+#include "utils/injection_point.h"
 
 /* User-settable parameters for sync rep */
 char	   *SyncRepStandbyNames;
@@ -148,6 +149,7 @@ void
 SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 {
 	int			mode;
+	bool		sync_standbys_defined;
 
 	/*
 	 * This should be called while holding interrupts during a transaction
@@ -170,7 +172,7 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	 * it's false, the lock is not necessary because we don't touch the queue.
 	 */
 	if (!SyncRepRequested() ||
-		!((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
+		IS_SYNCSTANDBYS_INITIALIZED_ONLY(((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined))
 		return;
 
 	/* Cap the level for anything other than commit to remote flush only. */
@@ -184,6 +186,8 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 	Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
+	sync_standbys_defined = WalSndCtl->sync_standbys_defined & SYNCSTANDBYSINITIALIZED
+		? (WalSndCtl->sync_standbys_defined & SYNCSTANDBYSDEFINED) : SyncStandbysDefined();
 
 	/*
 	 * We don't wait for sync rep if WalSndCtl->sync_standbys_defined is not
@@ -193,7 +197,7 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	 * condition but we'll be fetching that cache line anyway so it's likely
 	 * to be a low cost check.
 	 */
-	if (!WalSndCtl->sync_standbys_defined ||
+	if (!sync_standbys_defined ||
 		lsn <= WalSndCtl->lsn[mode])
 	{
 		LWLockRelease(SyncRepLock);
@@ -920,9 +924,10 @@ SyncRepWakeQueue(bool all, int mode)
 void
 SyncRepUpdateSyncStandbysDefined(void)
 {
-	bool		sync_standbys_defined = SyncStandbysDefined();
+	bool		new_sync_standbys_defined = SyncStandbysDefined();
+	bool		old_sync_standbys_defined = (WalSndCtl->sync_standbys_defined & SYNCSTANDBYSDEFINED);
 
-	if (sync_standbys_defined != WalSndCtl->sync_standbys_defined)
+	if (new_sync_standbys_defined != old_sync_standbys_defined)
 	{
 		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 
@@ -931,7 +936,7 @@ SyncRepUpdateSyncStandbysDefined(void)
 		 * for backends to continue waiting.  Since the user no longer wants
 		 * synchronous replication, we'd better wake them up.
 		 */
-		if (!sync_standbys_defined)
+		if (!new_sync_standbys_defined)
 		{
 			int			i;
 
@@ -946,8 +951,23 @@ SyncRepUpdateSyncStandbysDefined(void)
 		 * backend that hasn't yet reloaded its config might go to sleep on
 		 * the queue (and never wake up).  This prevents that.
 		 */
-		WalSndCtl->sync_standbys_defined = sync_standbys_defined;
 
+		WalSndCtl->sync_standbys_defined = new_sync_standbys_defined
+			? SET_SYNCSTANDBYS_DEFINED(WalSndCtl->sync_standbys_defined)
+			: RESET_SYNCSTANDBYS_DEFINED(WalSndCtl->sync_standbys_defined);
+
+		LWLockRelease(SyncRepLock);
+	}
+	else if (!(WalSndCtl->sync_standbys_defined & SYNCSTANDBYSINITIALIZED))
+	{
+		/*
+		 * Even if new_sync_standbys_defined == old_sync_standbys_defined, we
+		 * should check WalSndCtl->sync_standbys_defined is initialized and
+		 * set its initialize bit, if it reset. Anyway this bit should be set
+		 * only on start and then it will be always setted.
+		 */
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		WalSndCtl->sync_standbys_defined = SET_SYNCSTANDBYS_INITIALIZED(WalSndCtl->sync_standbys_defined);
 		LWLockRelease(SyncRepLock);
 	}
 }
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 1028919aecb..c181c7ee977 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1676,7 +1676,7 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	 */
 	if (skipped_xact &&
 		SyncRepRequested() &&
-		((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
+		(((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined & SYNCSTANDBYSDEFINED))
 	{
 		WalSndKeepalive(false, lsn);
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 0fc77f1b4af..dd931b81e4f 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -100,7 +100,7 @@ typedef struct
 	 * config file safely, so checkpointer updates this value as needed.
 	 * Protected by SyncRepLock.
 	 */
-	bool		sync_standbys_defined;
+	uint8		sync_standbys_defined;
 
 	/* used as a registry of physical / logical walsenders to wake */
 	ConditionVariable wal_flush_cv;
@@ -116,6 +116,22 @@ typedef struct
 	WalSnd		walsnds[FLEXIBLE_ARRAY_MEMBER];
 } WalSndCtlData;
 
+/*
+ * WalSndCtlData->sync_standbys_defined bit flags. 7th bit check standbys_defined or not,
+ * 8th bit check WalSndCtlData->sync_standbys_defined field has been initialized.
+ */
+#define SYNCSTANDBYSINITIALIZED	(1 << 0)
+#define SYNCSTANDBYSDEFINED		(1 << 1)
+
+#define SET_SYNCSTANDBYS_INITIALIZED(sync_standbys)	\
+	(sync_standbys | SYNCSTANDBYSINITIALIZED)
+#define SET_SYNCSTANDBYS_DEFINED(sync_standbys)	\
+	(sync_standbys | SYNCSTANDBYSINITIALIZED | SYNCSTANDBYSDEFINED)
+#define RESET_SYNCSTANDBYS_DEFINED(sync_standbys)	\
+	(SET_SYNCSTANDBYS_INITIALIZED(sync_standbys) & (~SYNCSTANDBYSDEFINED))
+#define IS_SYNCSTANDBYS_INITIALIZED_ONLY(sync_standbys)	\
+	((sync_standbys & (SYNCSTANDBYSINITIALIZED | SYNCSTANDBYSDEFINED)) == SYNCSTANDBYSINITIALIZED)
+
 extern PGDLLIMPORT WalSndCtlData *WalSndCtl;
 
 
-- 
2.43.0

Michael Paquier

michael@paquier.xyz

9 months ago

In reply to: Maksim.Melnikov (#7)

1 attachment(s)

Re: sync_standbys_defined read/write race on startup

On Wed, Apr 09, 2025 at 06:21:50PM +0300, Maksim.Melnikov wrote:

Test 041_checkpoint_at_promote.pl is really good example
of using injection points, but anyway, I still don't see
how injection points can help us here. In failed test case
we started postgres, immediately open psql connection and commit
prepared transaction. Postgres, on start, before accepting
connections, fork checkpointer and checkpointer initialize
shared parameter sync_standbys_defined in CheckpointerMain,
before process loop. So, in most attempts, we can't call
injection_points_attach in psql before
macro INJECTION_POINT() code will be executed in
checkpointer(I mean INJECTION_POINT() code instead sleep()),
bacause checkpointer process is forking before database
system is ready to accept connections.

Confirmed. One thing where it would be possible to make things work
is to introduce some persistency of the injection points, say these
are flushed at shutdown. We could do that without touching at the
backend code and only in the module injection_points, but perhaps this
thread is not enough to justify this extra complexity.

The manual execution of checkpoint can't help us too.

P.S.
sorry for style errors, I am new in postgres, so can miss
some rules. Anyway, to avoid any community rules mistakes,
attached patch with correct naming.

No worries, we are all here to learn, as am I.

+    sync_standbys_defined = WalSndCtl->sync_standbys_defined & SYNCSTANDBYSINITIALIZED
+        ? (WalSndCtl->sync_standbys_defined & SYNCSTANDBYSDEFINED) : SyncStandbysDefined();

So this is the core point of the patch: if the sync standby data has
not been initialized, we try the only thing we can do with a check on
the GUC value. Then:
- If the GUC has no value, then we know for sure that there is no
point to wait, so the patch skips the wait.
- If the GUC has a value, it may be possible that we have to wait but
we are not sure yet without letting the checkpointer process the GUC
and update the information in shared memory, so we decide that it's
better to wait and let the checkpointer process the setup until it
wakes up the queue, relying on the fact that there should be no
processes that wait as long as the GUC is not set.

This is mentioned in your commit message but there is close to zero
explanation about the reason of this choice, and even worse this fact
is hidden to the reader of the code because of the overly-complex
setup around sync_standbys_defined and the fact that there is no
comment to document this fact.

+/*
+ * WalSndCtlData->sync_standbys_defined bit flags. 7th bit check standbys_defined or not,
+ * 8th bit check WalSndCtlData->sync_standbys_defined field has been initialized.
+ */
+#define SYNCSTANDBYSINITIALIZED    (1 << 0)
+#define SYNCSTANDBYSDEFINED        (1 << 1)
+
+#define SET_SYNCSTANDBYS_INITIALIZED(sync_standbys)    \
+    (sync_standbys | SYNCSTANDBYSINITIALIZED)
+#define SET_SYNCSTANDBYS_DEFINED(sync_standbys)    \
+    (sync_standbys | SYNCSTANDBYSINITIALIZED | SYNCSTANDBYSDEFINED)
+#define RESET_SYNCSTANDBYS_DEFINED(sync_standbys)    \
+    (SET_SYNCSTANDBYS_INITIALIZED(sync_standbys) & (~SYNCSTANDBYSDEFINED))
+#define IS_SYNCSTANDBYS_INITIALIZED_ONLY(sync_standbys)    \
+    ((sync_standbys & (SYNCSTANDBYSINITIALIZED | SYNCSTANDBYSDEFINED)) == SYNCSTANDBYSINITIALIZED)
+

The use of the flags is a good idea to protect the atomicity of the
lookup in the fast path, but hiding that in a set of macros, with some
of them being used in only one location reduces the readability of the
code in walsender.c and syncrep.c because we would need to do a
permanent mapping with what walsender_private.h uses. The reset one
depends on the set one, which is also confusing.

A couple of comments in walsender.c and syncrep.c also need a refresh
around the manipulations of sync_standbys_defined.

Also, the name "sync_standbys_defined" is not adapted anymore, as it
stores a status of the data in shared memory associated to the sync
standbys, so I've switched that to sync_standbys_status, with bits8 as
type so as we have the same size as a boolean.

I have taken a stab at all that, finishing with the attached. The
patch has a pg_usleep() that I have used to stress the CI. That
should be removed in the final result, obviously.

Side note: I am pretty sure that this issue has been the origin of
some spurious failures in the buildfarm. I cannot point my finger to
a particular one now, but it would not go unnoticed especially on the
slow animals..

What do you think?
--
Michael

Attachments:

v5-0001-Fix-sync_standbys_defined-race-on-startup.patchtext/x-diff; charset=us-asciiDownload

From f60a004116bc21440b8769c75f11126ac4ef667f Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Thu, 10 Apr 2025 11:12:41 +0900
Subject: [PATCH v5] Fix sync_standbys_defined race on startup.

WalSndCtl->sync_standbys_defined, can be updated by checkpointer process
only and in backends it can be read. So, on postgres startup, we have
race between checkpointer process and backends.

Notes: blah..
---
 src/include/replication/walsender_private.h | 23 +++++-
 src/backend/replication/syncrep.c           | 85 +++++++++++++++++----
 src/backend/replication/walsender.c         |  6 +-
 3 files changed, 92 insertions(+), 22 deletions(-)

diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 0fc77f1b4af7..92243cce1d04 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -96,11 +96,11 @@ typedef struct
 	XLogRecPtr	lsn[NUM_SYNC_REP_WAIT_MODE];
 
 	/*
-	 * Are any sync standbys defined?  Waiting backends can't reload the
-	 * config file safely, so checkpointer updates this value as needed.
-	 * Protected by SyncRepLock.
+	 * Status of data related to the synchronous standbys.  Waiting backends
+	 * can't reload the config file safely, so checkpointer updates this value
+	 * as needed. Protected by SyncRepLock.
 	 */
-	bool		sync_standbys_defined;
+	bits8		sync_standbys_status;
 
 	/* used as a registry of physical / logical walsenders to wake */
 	ConditionVariable wal_flush_cv;
@@ -116,6 +116,21 @@ typedef struct
 	WalSnd		walsnds[FLEXIBLE_ARRAY_MEMBER];
 } WalSndCtlData;
 
+/* Flags for WalSndCtlData->sync_standbys_status */
+
+/*
+ * Is the synchronous standby data initialized from the GUC?  This is flipped
+ * the first time synchronous_standby_names is processed by the checkpointer.
+ */
+#define SYNC_STANDBY_INIT			(1 << 0)
+
+/*
+ * Is the synchronous standby data defined?  This is set when
+ * synchronous_standby_names has some data, after being processed by the
+ * checkpointer.
+ */
+#define SYNC_STANDBY_DEFINED		(1 << 1)
+
 extern PGDLLIMPORT WalSndCtlData *WalSndCtl;
 
 
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index d75e39680355..4c9f550f7dc3 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -161,16 +161,23 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	 * sync replication standby names defined.
 	 *
 	 * Since this routine gets called every commit time, it's important to
-	 * exit quickly if sync replication is not requested. So we check
-	 * WalSndCtl->sync_standbys_defined flag without the lock and exit
-	 * immediately if it's false. If it's true, we need to check it again
-	 * later while holding the lock, to check the flag and operate the sync
-	 * rep queue atomically. This is necessary to avoid the race condition
-	 * described in SyncRepUpdateSyncStandbysDefined(). On the other hand, if
-	 * it's false, the lock is not necessary because we don't touch the queue.
+	 * exit quickly if sync replication is not requested.
+	 *
+	 * We check WalSndCtl->sync_standbys_status flag without the lock and exit
+	 * immediately if SYNC_STANDBY_INIT is set (the checkpointer has
+	 * initialized this data) but SYNC_STANDBY_DEFINED is missing (no sync
+	 * replication requested).
+	 *
+	 * If SYNC_STANDBY_DEFINED is set, we need to check the status again later
+	 * while holding the lock, to check the flag and operate the sync rep
+	 * queue atomically.  This is necessary to avoid the race condition
+	 * described in SyncRepUpdateSyncStandbysDefined().  On the other hand, if
+	 * SYNC_STANDBY_DEFINED is not set, the lock is not necessary because we
+	 * don't touch the queue.
 	 */
 	if (!SyncRepRequested() ||
-		!((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
+		((((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_status) &
+		 (SYNC_STANDBY_INIT | SYNC_STANDBY_DEFINED)) == SYNC_STANDBY_INIT)
 		return;
 
 	/* Cap the level for anything other than commit to remote flush only. */
@@ -186,16 +193,38 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
 
 	/*
-	 * We don't wait for sync rep if WalSndCtl->sync_standbys_defined is not
-	 * set.  See SyncRepUpdateSyncStandbysDefined.
+	 * We don't wait for sync rep if SYNC_STANDBY_DEFINED is not set.  See
+	 * SyncRepUpdateSyncStandbysDefined().
 	 *
 	 * Also check that the standby hasn't already replied. Unlikely race
 	 * condition but we'll be fetching that cache line anyway so it's likely
 	 * to be a low cost check.
+	 *
+	 * If the sync standby data has not been initialized yet
+	 * (SYNC_STANDBY_INIT is not set), then fall back to a direct GUC check.
 	 */
-	if (!WalSndCtl->sync_standbys_defined ||
-		lsn <= WalSndCtl->lsn[mode])
+	if (WalSndCtl->sync_standbys_status & SYNC_STANDBY_INIT)
 	{
+		if ((WalSndCtl->sync_standbys_status & SYNC_STANDBY_DEFINED) == 0 ||
+			lsn <= WalSndCtl->lsn[mode])
+		{
+			LWLockRelease(SyncRepLock);
+			return;
+		}
+	}
+	else if (!SyncStandbysDefined())
+	{
+		/*
+		 * If we are here, the sync standby data has not been initialized yet,
+		 * so fall back to the best thing we can do: a check on
+		 * SyncStandbysDefined() to see if the GUC is set or not.
+		 *
+		 * If the GUC has a value, we wait until the checkpointer updates the
+		 * status data because we cannot be sure yet if we should wait or not.
+		 * If the GUC has *no* value, we are sure that there is no point to
+		 * wait; this matters for example when initializing a cluster, where
+		 * we should never wait, and no sync standbys is the default behavior.
+		 */
 		LWLockRelease(SyncRepLock);
 		return;
 	}
@@ -912,7 +941,7 @@ SyncRepWakeQueue(bool all, int mode)
 
 /*
  * The checkpointer calls this as needed to update the shared
- * sync_standbys_defined flag, so that backends don't remain permanently wedged
+ * sync_standbys_status flag, so that backends don't remain permanently wedged
  * if synchronous_standby_names is unset.  It's safe to check the current value
  * without the lock, because it's only ever updated by one process.  But we
  * must take the lock to change it.
@@ -922,8 +951,11 @@ SyncRepUpdateSyncStandbysDefined(void)
 {
 	bool		sync_standbys_defined = SyncStandbysDefined();
 
-	if (sync_standbys_defined != WalSndCtl->sync_standbys_defined)
+	if (sync_standbys_defined !=
+		((WalSndCtl->sync_standbys_status & SYNC_STANDBY_DEFINED) != 0))
 	{
+		pg_usleep(2 * 1000 * 1000L);	/* 2s */
+
 		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 
 		/*
@@ -946,7 +978,30 @@ SyncRepUpdateSyncStandbysDefined(void)
 		 * backend that hasn't yet reloaded its config might go to sleep on
 		 * the queue (and never wake up).  This prevents that.
 		 */
-		WalSndCtl->sync_standbys_defined = sync_standbys_defined;
+		WalSndCtl->sync_standbys_status = SYNC_STANDBY_INIT |
+			(sync_standbys_defined ? SYNC_STANDBY_DEFINED : 0);
+
+		LWLockRelease(SyncRepLock);
+	}
+	else if ((WalSndCtl->sync_standbys_status & SYNC_STANDBY_INIT) == 0)
+	{
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+
+		/*
+		 * Note that there is no need to wake up the queues here.  We would
+		 * reach this path only if SyncStandbysDefined() returns false, or it
+		 * would mean that some backends are waiting with the GUC set.  See
+		 * SyncRepWaitForLSN().
+		 */
+		Assert(!SyncStandbysDefined());
+
+		/*
+		 * Even if there is no sync standby defined, let the readers of this
+		 * information know that the sync standby data has been initialized.
+		 * This can just be done once, hence the previous check on
+		 * SYNC_STANDBY_INIT to avoid useless work.
+		 */
+		WalSndCtl->sync_standbys_status |= SYNC_STANDBY_INIT;
 
 		LWLockRelease(SyncRepLock);
 	}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 216baeda5cd2..987a7336ec84 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1674,13 +1674,13 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	 * When skipping empty transactions in synchronous replication, we send a
 	 * keepalive message to avoid delaying such transactions.
 	 *
-	 * It is okay to check sync_standbys_defined flag without lock here as in
-	 * the worst case we will just send an extra keepalive message when it is
+	 * It is okay to check sync_standbys_status without lock here as in the
+	 * worst case we will just send an extra keepalive message when it is
 	 * really not required.
 	 */
 	if (skipped_xact &&
 		SyncRepRequested() &&
-		((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
+		(((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_status & SYNC_STANDBY_DEFINED))
 	{
 		WalSndKeepalive(false, lsn);
 
-- 
2.49.0

Maksim.Melnikov

m.melnikov@postgrespro.ru

9 months ago

In reply to: Michael Paquier (#8)

Re: sync_standbys_defined read/write race on startup

On 10.04.2025 05:25, Michael Paquier wrote:

Confirmed. One thing where it would be possible to make things work
is to introduce some persistency of the injection points, say these
are flushed at shutdown. We could do that without touching at the
backend code and only in the module injection_points, but perhaps this
thread is not enough to justify this extra complexity.

Agree, it should be discussed in another thread.

On 10.04.2025 05:25, Michael Paquier wrote:

What do you think?

Yes, patch for me looks ok, only one suggestion, I am not sure

but I am afraid we are loosing this.

         if (WalSndCtl->sync_standbys_status & SYNC_STANDBY_INIT)
         {
                 if ((WalSndCtl->sync_standbys_status & 
SYNC_STANDBY_DEFINED) == 0 ||
                         lsn <= WalSndCtl->lsn[mode])
                 {
                         LWLockRelease(SyncRepLock);
                         return;
                 }
         }
         else if (!SyncStandbysDefined())
         {
                 /*
                  * If we are here, the sync standby data has not been 
initialized yet,
                  * so fall back to the best thing we can do: a check on
                  * SyncStandbysDefined() to see if the GUC is set or not.
                  *
                  * If the GUC has a value, we wait until the 
checkpointer updates the
                  * status data because we cannot be sure yet if we 
should wait or not.
                  * If the GUC has *no* value, we are sure that there is 
no point to
                  * wait; this matters for example when initializing a 
cluster, where
                  * we should never wait, and no sync standbys is the 
default behavior.
                  */
                 LWLockRelease(SyncRepLock);
                 return;
         }
+       else if (lsn <= WalSndCtl->lsn[mode])
+       {
+               LWLockRelease(SyncRepLock);
+               return;
+       }

What do you think?

Best regards,

Maksim Melnikov

#10

Michael Paquier

michael@paquier.xyz

9 months ago

In reply to: Maksim.Melnikov (#9)

1 attachment(s)

Re: sync_standbys_defined read/write race on startup

On Thu, Apr 10, 2025 at 10:12:34AM +0300, Maksim.Melnikov wrote:

but I am afraid we are loosing this.

if (WalSndCtl->sync_standbys_status & SYNC_STANDBY_INIT)
{
if ((WalSndCtl->sync_standbys_status & SYNC_STANDBY_DEFINED) == 0 ||
lsn <= WalSndCtl->lsn[mode])
{
LWLockRelease(SyncRepLock);
return;
}
}
[...]
+ else if (lsn <= WalSndCtl->lsn[mode])
+ {
+ LWLockRelease(SyncRepLock);
+ return;
+ }

What do you think?

Hmm, yeah. Instead of last, it would be better to put it in second
place perhaps, for clarity? That would be the same at the end, but we
would be slightly more consistent with the past logic regarding the
ordering. Does that look OK to you?
--
Michael

Attachments:

v6-0001-Fix-sync_standbys_defined-race-on-startup.patchtext/x-diff; charset=us-asciiDownload

From b3b43a7eff2fd83a118cab13b99acd6cc831312f Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Thu, 10 Apr 2025 18:13:08 +0900
Subject: [PATCH v6] Fix sync_standbys_defined race on startup.

WalSndCtl->sync_standbys_defined, can be updated by checkpointer process
only and in backends it can be read. So, on postgres startup, we have
race between checkpointer process and backends.

Notes: blah..
---
 src/include/replication/walsender_private.h | 23 ++++-
 src/backend/replication/syncrep.c           | 97 +++++++++++++++++----
 src/backend/replication/walsender.c         |  6 +-
 3 files changed, 104 insertions(+), 22 deletions(-)

diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 0fc77f1b4af7..92243cce1d04 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -96,11 +96,11 @@ typedef struct
 	XLogRecPtr	lsn[NUM_SYNC_REP_WAIT_MODE];
 
 	/*
-	 * Are any sync standbys defined?  Waiting backends can't reload the
-	 * config file safely, so checkpointer updates this value as needed.
-	 * Protected by SyncRepLock.
+	 * Status of data related to the synchronous standbys.  Waiting backends
+	 * can't reload the config file safely, so checkpointer updates this value
+	 * as needed. Protected by SyncRepLock.
 	 */
-	bool		sync_standbys_defined;
+	bits8		sync_standbys_status;
 
 	/* used as a registry of physical / logical walsenders to wake */
 	ConditionVariable wal_flush_cv;
@@ -116,6 +116,21 @@ typedef struct
 	WalSnd		walsnds[FLEXIBLE_ARRAY_MEMBER];
 } WalSndCtlData;
 
+/* Flags for WalSndCtlData->sync_standbys_status */
+
+/*
+ * Is the synchronous standby data initialized from the GUC?  This is flipped
+ * the first time synchronous_standby_names is processed by the checkpointer.
+ */
+#define SYNC_STANDBY_INIT			(1 << 0)
+
+/*
+ * Is the synchronous standby data defined?  This is set when
+ * synchronous_standby_names has some data, after being processed by the
+ * checkpointer.
+ */
+#define SYNC_STANDBY_DEFINED		(1 << 1)
+
 extern PGDLLIMPORT WalSndCtlData *WalSndCtl;
 
 
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index d75e39680355..8b9b34dfc120 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -161,16 +161,23 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	 * sync replication standby names defined.
 	 *
 	 * Since this routine gets called every commit time, it's important to
-	 * exit quickly if sync replication is not requested. So we check
-	 * WalSndCtl->sync_standbys_defined flag without the lock and exit
-	 * immediately if it's false. If it's true, we need to check it again
-	 * later while holding the lock, to check the flag and operate the sync
-	 * rep queue atomically. This is necessary to avoid the race condition
-	 * described in SyncRepUpdateSyncStandbysDefined(). On the other hand, if
-	 * it's false, the lock is not necessary because we don't touch the queue.
+	 * exit quickly if sync replication is not requested.
+	 *
+	 * We check WalSndCtl->sync_standbys_status flag without the lock and exit
+	 * immediately if SYNC_STANDBY_INIT is set (the checkpointer has
+	 * initialized this data) but SYNC_STANDBY_DEFINED is missing (no sync
+	 * replication requested).
+	 *
+	 * If SYNC_STANDBY_DEFINED is set, we need to check the status again later
+	 * while holding the lock, to check the flag and operate the sync rep
+	 * queue atomically.  This is necessary to avoid the race condition
+	 * described in SyncRepUpdateSyncStandbysDefined().  On the other hand, if
+	 * SYNC_STANDBY_DEFINED is not set, the lock is not necessary because we
+	 * don't touch the queue.
 	 */
 	if (!SyncRepRequested() ||
-		!((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
+		((((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_status) &
+		 (SYNC_STANDBY_INIT | SYNC_STANDBY_DEFINED)) == SYNC_STANDBY_INIT)
 		return;
 
 	/* Cap the level for anything other than commit to remote flush only. */
@@ -186,16 +193,50 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
 
 	/*
-	 * We don't wait for sync rep if WalSndCtl->sync_standbys_defined is not
-	 * set.  See SyncRepUpdateSyncStandbysDefined.
+	 * We don't wait for sync rep if SYNC_STANDBY_DEFINED is not set.  See
+	 * SyncRepUpdateSyncStandbysDefined().
 	 *
 	 * Also check that the standby hasn't already replied. Unlikely race
 	 * condition but we'll be fetching that cache line anyway so it's likely
 	 * to be a low cost check.
+	 *
+	 * If the sync standby data has not been initialized yet
+	 * (SYNC_STANDBY_INIT is not set), then fall back to a direct GUC check.
 	 */
-	if (!WalSndCtl->sync_standbys_defined ||
-		lsn <= WalSndCtl->lsn[mode])
+	if (WalSndCtl->sync_standbys_status & SYNC_STANDBY_INIT)
 	{
+		if ((WalSndCtl->sync_standbys_status & SYNC_STANDBY_DEFINED) == 0 ||
+			lsn <= WalSndCtl->lsn[mode])
+		{
+			LWLockRelease(SyncRepLock);
+			return;
+		}
+	}
+	else if (lsn <= WalSndCtl->lsn[mode])
+	{
+		/*
+		 * The LSN is older than what we are waiting for in the queue.  The
+		 * sync standby data has not been initialized yet, but we are OK to
+		 * not wait because we know that there is no point in doing so if
+		 * based on the LSN.
+		 */
+		LWLockRelease(SyncRepLock);
+		return;
+	}
+	else if (!SyncStandbysDefined())
+	{
+		/*
+		 * If we are here, the sync standby data has not been initialized yet,
+		 * and the lsn is newer than what is in the queue, so fall back to the
+		 * best thing we can do in this case: a check on SyncStandbysDefined()
+		 * to see if the GUC is set or not.
+		 *
+		 * If the GUC has a value, we wait until the checkpointer updates the
+		 * status data because we cannot be sure yet if we should wait or not.
+		 * If the GUC has *no* value, we are sure that there is no point to
+		 * wait; this matters for example when initializing a cluster, where
+		 * we should never wait, and no sync standbys is the default behavior.
+		 */
 		LWLockRelease(SyncRepLock);
 		return;
 	}
@@ -912,7 +953,7 @@ SyncRepWakeQueue(bool all, int mode)
 
 /*
  * The checkpointer calls this as needed to update the shared
- * sync_standbys_defined flag, so that backends don't remain permanently wedged
+ * sync_standbys_status flag, so that backends don't remain permanently wedged
  * if synchronous_standby_names is unset.  It's safe to check the current value
  * without the lock, because it's only ever updated by one process.  But we
  * must take the lock to change it.
@@ -922,8 +963,11 @@ SyncRepUpdateSyncStandbysDefined(void)
 {
 	bool		sync_standbys_defined = SyncStandbysDefined();
 
-	if (sync_standbys_defined != WalSndCtl->sync_standbys_defined)
+	if (sync_standbys_defined !=
+		((WalSndCtl->sync_standbys_status & SYNC_STANDBY_DEFINED) != 0))
 	{
+		pg_usleep(2 * 1000 * 1000L);	/* 2s */
+
 		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 
 		/*
@@ -946,7 +990,30 @@ SyncRepUpdateSyncStandbysDefined(void)
 		 * backend that hasn't yet reloaded its config might go to sleep on
 		 * the queue (and never wake up).  This prevents that.
 		 */
-		WalSndCtl->sync_standbys_defined = sync_standbys_defined;
+		WalSndCtl->sync_standbys_status = SYNC_STANDBY_INIT |
+			(sync_standbys_defined ? SYNC_STANDBY_DEFINED : 0);
+
+		LWLockRelease(SyncRepLock);
+	}
+	else if ((WalSndCtl->sync_standbys_status & SYNC_STANDBY_INIT) == 0)
+	{
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+
+		/*
+		 * Note that there is no need to wake up the queues here.  We would
+		 * reach this path only if SyncStandbysDefined() returns false, or it
+		 * would mean that some backends are waiting with the GUC set.  See
+		 * SyncRepWaitForLSN().
+		 */
+		Assert(!SyncStandbysDefined());
+
+		/*
+		 * Even if there is no sync standby defined, let the readers of this
+		 * information know that the sync standby data has been initialized.
+		 * This can just be done once, hence the previous check on
+		 * SYNC_STANDBY_INIT to avoid useless work.
+		 */
+		WalSndCtl->sync_standbys_status |= SYNC_STANDBY_INIT;
 
 		LWLockRelease(SyncRepLock);
 	}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 216baeda5cd2..987a7336ec84 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1674,13 +1674,13 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	 * When skipping empty transactions in synchronous replication, we send a
 	 * keepalive message to avoid delaying such transactions.
 	 *
-	 * It is okay to check sync_standbys_defined flag without lock here as in
-	 * the worst case we will just send an extra keepalive message when it is
+	 * It is okay to check sync_standbys_status without lock here as in the
+	 * worst case we will just send an extra keepalive message when it is
 	 * really not required.
 	 */
 	if (skipped_xact &&
 		SyncRepRequested() &&
-		((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
+		(((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_status & SYNC_STANDBY_DEFINED))
 	{
 		WalSndKeepalive(false, lsn);
 
-- 
2.49.0

#11

Maksim.Melnikov

m.melnikov@postgrespro.ru

9 months ago

In reply to: Michael Paquier (#10)

Re: sync_standbys_defined read/write race on startup

On 10.04.2025 12:15, Michael Paquier wrote:

Hmm, yeah. Instead of last, it would be better to put it in second
place perhaps, for clarity? That would be the same at the end, but we
would be slightly more consistent with the past logic regarding the
ordering. Does that look OK to you?

Yes, from my point of view it looks fine.

Best regards,
Maksim Melnikov

#12

Michael Paquier

michael@paquier.xyz

9 months ago

In reply to: Maksim.Melnikov (#11)

Re: sync_standbys_defined read/write race on startup

On Thu, Apr 10, 2025 at 12:55:54PM +0300, Maksim.Melnikov wrote:

On 10.04.2025 12:15, Michael Paquier wrote:

Hmm, yeah. Instead of last, it would be better to put it in second
place perhaps, for clarity? That would be the same at the end, but we
would be slightly more consistent with the past logic regarding the
ordering. Does that look OK to you?

Yes, from my point of view it looks fine.

Thanks for the double-check. I've played a couple more hours with the
startup case, like playing with s_s_names set but uninitialized in
shmem while stucking backends, and that seems OK, so applied down to
v13. Let's see how it goes..
--
Michael