Race condition in replication slot usage introduced by commit f41d846

Started by Nisha Moond11 months ago3 messages
#1Nisha Moond
nisha.moond412@gmail.com
1 attachment(s)

Hi,

Commit f41d846 [1]https://github.com/postgres/postgres/commit/f41d8468ddea34170fe19fdc17b5a247e7d3ac78 introduced a race condition that allows a process
to acquire and continue using an invalidated replication slot, leading
to unexpected behavior.

Example Scenario:
Suppose there is an orphaned logical slot that will be invalidated on
the next checkpoint. Consider the following sequence of events during
ReplicationSlotAcquire() -
- A process (P1) attempts to acquire the slot. It checks the
invalidated flag and sees that the slot is still valid, so it proceeds
to set the s->active_pid to its own PID.
- Before P1 can acquire the spinlock and update s->active_pid =
MyProcPid, the checkpointer process invalidates the slot.
- Now, P1 proceeds to set s->active_pid = MyProcPid and starts using
an already invalidated slot, which should not happen.

Using an invalidated slot should be prevented in the first place. In
my test, this scenario triggered an assertion failure in
CreateDecodingContext() -

```
TRAP: failed Assert("slot->data.invalidated == RS_INVAL_NONE"), File:
"logical.c", Line: 546, PID: 1042884
postgres: nisha postgres [local]
SELECT(ExceptionalCondition+0xbb)[0x649ff7696c82]
postgres: nisha postgres [local]
SELECT(CreateDecodingContext+0x23b)[0x649ff73bf9d3]
postgres: nisha postgres [local] SELECT(+0x61fb71)[0x649ff73c3b71]
postgres: nisha postgres [local]
SELECT(pg_logical_slot_get_changes+0x26)[0x649ff73c3ed6]
...
```

The expected behavior was for ReplicationSlotAcquire() to detect the
invalid slot and return an error instead of allowing its usage.

Attached the patch v1 fixing this issue.

Issue reported by: Hou Zhijie.

[1]: https://github.com/postgres/postgres/commit/f41d8468ddea34170fe19fdc17b5a247e7d3ac78

--
Thanks,
Nisha

Attachments:

v1-0001-Fix-race-condition-in-ReplicationSlotAcquire-indr.patchapplication/octet-stream; name=v1-0001-Fix-race-condition-in-ReplicationSlotAcquire-indr.patchDownload
From 0282b2b315f2a3a8685eb5ce1e48ce8b57548821 Mon Sep 17 00:00:00 2001
From: Nisha Moond <nisha.moond412@gmail.com>
Date: Tue, 25 Feb 2025 17:33:01 +0530
Subject: [PATCH v1] Fix race condition in ReplicationSlotAcquire() indroduced
 by commit f41d846
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

After commit f41d846, a process could acquire and use a replication slot
that had just been invalidated, leading to unexpected failures. This
occured because the invalidation check was performed before setting the
slot’s active_pid.

The fix ensures the check happens only after acquiring slot, i.e., after
setting s->active_pid = MyProcPid, preventing a race condition with
InvalidatePossiblyObsoleteSlot() and ensuring that invalid slots are not
mistakenly used.
---
 src/backend/replication/slot.c | 38 ++++++++++++++++++++++----------------
 1 file changed, 22 insertions(+), 16 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index d089085..af388ad 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -580,19 +580,6 @@ retry:
 						name)));
 	}
 
-	/* Invalid slots can't be modified or used before accessing the WAL. */
-	if (error_if_invalid && s->data.invalidated != RS_INVAL_NONE)
-	{
-		LWLockRelease(ReplicationSlotControlLock);
-
-		ereport(ERROR,
-				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				errmsg("can no longer access replication slot \"%s\"",
-					   NameStr(s->data.name)),
-				errdetail("This replication slot has been invalidated due to \"%s\".",
-						  GetSlotInvalidationCauseName(s->data.invalidated)));
-	}
-
 	/*
 	 * This is the slot we want; check if it's active under some other
 	 * process.  In single user mode, we don't need this check.
@@ -650,13 +637,32 @@ retry:
 	else if (!nowait)
 		ConditionVariableCancelSleep(); /* no sleep needed after all */
 
-	/* Let everybody know we've modified this slot */
-	ConditionVariableBroadcast(&s->active_cv);
-
 	/* We made this slot active, so it's ours now. */
 	MyReplicationSlot = s;
 
 	/*
+	 * Is the slot invalid? Check the invalidated flag only after acquiring
+	 * the slot, i.e., after setting s->active_pid as MyProcPid, to prevent a
+	 * race condition with InvalidatePossiblyObsoleteSlot().
+	 *
+	 * If this check is performed before acquiring the slot or without holding
+	 * a spinlock, the checkpointer process may invalidate the slot right
+	 * after the check but before slot's active_pid is set. This could result
+	 * in the process mistakenly acquiring and using an invalidated slot,
+	 * leading to unexpected behavior or failures later.
+	 */
+	if (error_if_invalid && s->data.invalidated != RS_INVAL_NONE)
+		ereport(ERROR,
+				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("can no longer access replication slot \"%s\"",
+					   NameStr(s->data.name)),
+				errdetail("This replication slot has been invalidated due to \"%s\".",
+						  GetSlotInvalidationCauseName(s->data.invalidated)));
+
+	/* Let everybody know we've modified this slot */
+	ConditionVariableBroadcast(&s->active_cv);
+
+	/*
 	 * The call to pgstat_acquire_replslot() protects against stats for a
 	 * different slot, from before a restart or such, being present during
 	 * pgstat_report_replslot().
-- 
1.8.3.1

#2Amit Kapila
amit.kapila16@gmail.com
In reply to: Nisha Moond (#1)
Re: Race condition in replication slot usage introduced by commit f41d846

On Wed, Feb 26, 2025 at 5:40 PM Nisha Moond <nisha.moond412@gmail.com> wrote:

Attached the patch v1 fixing this issue.

Issue reported by: Hou Zhijie.

Thanks for the report and patch. I'll look into it.

--
With Regards,
Amit Kapila.

#3Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#2)
Re: Race condition in replication slot usage introduced by commit f41d846

On Wed, Feb 26, 2025 at 7:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Feb 26, 2025 at 5:40 PM Nisha Moond <nisha.moond412@gmail.com> wrote:

Attached the patch v1 fixing this issue.

Issue reported by: Hou Zhijie.

Thanks for the report and patch. I'll look into it.

Pushed the patch with a slightly different comment.

--
With Regards,
Amit Kapila.