Visibility bug in tuple lock

Started by Jasper Smit3 months ago11 messages
#1Jasper Smit
jbsmit@gmail.com
2 attachment(s)

Hi,

We found a bug in heap_lock_tuple(). It seems to be unsafe to follow
the t_ctid pointer of updated tuples,
in the presence of aborted tuples. Vacuum can mark aborted tuples
LP_UNUSED, even if there are still
visible tuples that have a t_ctid pointing to them.

We would like to address this issue in postgres. Attached is a patch
that modifies VACUUM behavior.
Could you advise us on the best way to proceed? How can we get this
patch included in postgres, or would you recommend
pursuing an alternative approach?

The following scenario will lead to an assertion in debug or to wrong
results in release.

1. (session 1) A tuple is being UPDATEd in an active transaction.
2. (session 2) Another session locks that tuple, for example, SELECT
.. FOR UPDATE.
3. heap_lock_tuple() calls HeapTupleSatisfiesUpdate(), which returns
TM_BeingModified.
It will record the xmax and t_ctid of the tuple.
4. Since the tuple is actively being updated, we need to wait for
session 1 to complete.
5. The UPDATE of session 1 aborts, leaving the original tuple with an
aborted xmax and t_ctid pointing to a
tuple with an aborted xmin.
6. VACUUM marks aborted tuples as unused, regardless of whether they
are still referenced by a visible tuple through t_ctid.
As a result, the aborted tuple is marked unused.
7. An INSERT happens and the tuple is placed at the recycled location
of the aborted tuple.
8. The newly INSERTed tuple is UPDATEd.
9. heap_lock_tuple() of session 2 resumes, and will next call
heap_lock_updated_tuple() using the t_ctid which was recorded in step
3.
Note that the code does not check the status of the transaction it
was waiting on.
It will proceed, regardless of the transaction it was waiting on
committed or aborted.
10. heap_lock_updated_tuple() will now work on the inserted tuple of
step 7. It will see that this tuple updated,
and thus the return value of heap_lock_updated_tuple() will be TM_Updated.
11. This will eventually lead to the assertion (result ==
TM_WouldBlock) || !(tuple->t_data->t_infomask & HEAP_XMAX_INVALID)

These concrete steps reproduce the problem:

Used version REL_18_0
Run with configuration:
io_combine_limit = '1'

session 1:
drop table if exists t;
create table t (id int);
insert into t (select generate_series(1, 452));
begin;
update t set id = 1000 where id = 1;

session 2:
gdb: b heapam.c:5033
gdb: continue
select * from t where id = 1 for update;

session 1:
abort;
select * from t;
VACUUM (TRUNCATE off);
insert into t values (453) returning ctid; -- Should be (2,1)
update t set id = 454 where id = 453 returning ctid;

session 2: (should now be at the breakpoint)
gdb: continue

We get the assertion:
(result == TM_WouldBlock) || !(tuple->t_data->t_infomask & HEAP_XMAX_INVALID).
Stacktrace included as attachment.

We have also seen the error "t_xmin %u is uncommitted in tuple (%u,%u)
to be updated in table \"%s\"" in our testing.
We believe that this has a similar cause. However, we have not been
able to come up with reproduction steps for that assertion
specifically.

We think that this problem can be fixed in two different ways.
1. Check the commit status after waiting in heap_lock_updated_tuple(),
and don't proceed checking the updated tuple when the commit has
aborted.
or check, like in heap_get_latest_tid(), if the xmin of the updated
tuples matches the xmax of the old tuple.
2. Change vacuum, to delay marking aborted tuples dead until no
visible tuple will have a t_ctid pointing to that tuple.

For 2. we have a patch included in the e-mail. This patch alters the
visibility check during vacuum, aborted tuples will
be treated similar to RECENTLY_DEAD tuples. They are not recycled
immediately, but instead the xmin is used to decide whether
they can be marked unused.

Regards,

Jasper Smit
Oleksii Kozlov
Luc Vlaming
Spiros Agathos
Servicenow

Attachments:

stacktrace.txttext/plain; charset=US-ASCII; name=stacktrace.txtDownload
delay_vacuum_aborted.patchapplication/octet-stream; name=delay_vacuum_aborted.patchDownload
From 1fbe8648abc70d97c6773b14c833f3fcf40411cc Mon Sep 17 00:00:00 2001
From: Jasper Smit <jasper.smit@servicenow.com>
Date: Thu, 9 Oct 2025 16:23:38 +0200
Subject: [PATCH] DEF0734066 Delay marking aborted tuples unused, until no
 visibile tuple has a ctid link to that tuple

---
 contrib/pgstattuple/pgstatapprox.c          |  1 +
 src/backend/access/heap/heapam.c            |  1 +
 src/backend/access/heap/heapam_handler.c    |  3 ++
 src/backend/access/heap/heapam_visibility.c | 39 +++++++++++++--------
 src/backend/access/heap/pruneheap.c         | 18 +++++++++-
 src/backend/access/heap/vacuumlazy.c        |  2 ++
 src/include/access/heapam.h                 |  3 +-
 7 files changed, 51 insertions(+), 16 deletions(-)

diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index a59ff4e9d4f..50092793b99 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -156,6 +156,7 @@ statapprox_heap(Relation rel, output_type *stat)
 					break;
 				case HEAPTUPLE_DEAD:
 				case HEAPTUPLE_RECENTLY_DEAD:
+				case HEAPTUPLE_RECENTLY_ABORTED:
 				case HEAPTUPLE_INSERT_IN_PROGRESS:
 					stat->dead_tuple_len += tuple.t_len;
 					stat->dead_tuple_count++;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0dcd6ee817e..43e8d343de1 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -9264,6 +9264,7 @@ HeapCheckForSerializableConflictOut(bool visible, Relation relation,
 				return;
 			xid = HeapTupleHeaderGetXmin(tuple->t_data);
 			break;
+		case HEAPTUPLE_RECENTLY_ABORTED:
 		case HEAPTUPLE_RECENTLY_DEAD:
 		case HEAPTUPLE_DELETE_IN_PROGRESS:
 			if (visible)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index cb4bc35c93e..29df8dbc372 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -846,6 +846,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 				isdead = true;
 				break;
 			case HEAPTUPLE_RECENTLY_DEAD:
+			case HEAPTUPLE_RECENTLY_ABORTED:
 				*tups_recently_dead += 1;
 				/* fall through */
 			case HEAPTUPLE_LIVE:
@@ -1079,6 +1080,7 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
 
 			case HEAPTUPLE_DEAD:
 			case HEAPTUPLE_RECENTLY_DEAD:
+			case HEAPTUPLE_RECENTLY_ABORTED:
 				/* Count dead and recently-dead rows */
 				*deadrows += 1;
 				break;
@@ -1427,6 +1429,7 @@ heapam_index_build_range_scan(Relation heapRelation,
 					/* Count it as live, too */
 					reltuples += 1;
 					break;
+				case HEAPTUPLE_RECENTLY_ABORTED:
 				case HEAPTUPLE_RECENTLY_DEAD:
 
 					/*
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 05f6946fe60..ac6f7dc14a8 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1176,7 +1176,7 @@ HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
 
 	res = HeapTupleSatisfiesVacuumHorizon(htup, buffer, &dead_after);
 
-	if (res == HEAPTUPLE_RECENTLY_DEAD)
+	if (res == HEAPTUPLE_RECENTLY_DEAD || res == HEAPTUPLE_RECENTLY_ABORTED)
 	{
 		Assert(TransactionIdIsValid(dead_after));
 
@@ -1195,7 +1195,7 @@ HeapTupleSatisfiesVacuum(HeapTuple htup, TransactionId OldestXmin,
  * In contrast to HeapTupleSatisfiesVacuum this routine, when encountering a
  * tuple that could still be visible to some backend, stores the xid that
  * needs to be compared with the horizon in *dead_after, and returns
- * HEAPTUPLE_RECENTLY_DEAD. The caller then can perform the comparison with
+ * HEAPTUPLE_RECENTLY_DEAD || HEAPTUPLE_RECENTLY_ABORTED. The caller then can perform the comparison with
  * the horizon.  This is e.g. useful when comparing with different horizons.
  *
  * Note: HEAPTUPLE_DEAD can still be returned here, e.g. if the inserting
@@ -1215,13 +1215,20 @@ HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer, TransactionId *de
 	/*
 	 * Has inserting transaction committed?
 	 *
-	 * If the inserting transaction aborted, then the tuple was never visible
-	 * to any other transaction, so we can delete it immediately.
+	 * For aborted and committed transactions we need to take care of dead_after,
+	 * as there can be someone taking a tuple lock on it and in the process of traversing the tuple chain.
+	 * If we allow the tuple to be deleted immediately, it can be that another tuple is placed there.
 	 */
 	if (!HeapTupleHeaderXminCommitted(tuple))
 	{
-		if (HeapTupleHeaderXminInvalid(tuple))
+		TransactionId xmin = HeapTupleHeaderGetRawXmin(tuple);
+		if (HeapTupleHeaderXminInvalid(tuple)) {
+			if (TransactionIdIsNormal(xmin)) {
+				*dead_after = xmin;
+				return HEAPTUPLE_RECENTLY_ABORTED;
+			}
 			return HEAPTUPLE_DEAD;
+		}
 		/* Used by pre-9.0 binary upgrades */
 		else if (tuple->t_infomask & HEAP_MOVED_OFF)
 		{
@@ -1256,10 +1263,11 @@ HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer, TransactionId *de
 			{
 				SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
 							InvalidTransactionId);
+				// TODO: should also here treat HEAPTUPLE_RECENTLY_ABORTED
 				return HEAPTUPLE_DEAD;
 			}
 		}
-		else if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuple)))
+		else if (TransactionIdIsCurrentTransactionId(xmin))
 		{
 			if (tuple->t_infomask & HEAP_XMAX_INVALID)	/* xid invalid */
 				return HEAPTUPLE_INSERT_IN_PROGRESS;
@@ -1273,7 +1281,7 @@ HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer, TransactionId *de
 			/* deleting subtransaction must have aborted */
 			return HEAPTUPLE_INSERT_IN_PROGRESS;
 		}
-		else if (TransactionIdIsInProgress(HeapTupleHeaderGetRawXmin(tuple)))
+		else if (TransactionIdIsInProgress(xmin))
 		{
 			/*
 			 * It'd be possible to discern between INSERT/DELETE in progress
@@ -1285,17 +1293,20 @@ HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer, TransactionId *de
 			 */
 			return HEAPTUPLE_INSERT_IN_PROGRESS;
 		}
-		else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple)))
-			SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-						HeapTupleHeaderGetRawXmin(tuple));
+		else if (TransactionIdDidCommit(xmin))
+			SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED, xmin);
 		else
 		{
 			/*
 			 * Not in Progress, Not Committed, so either Aborted or crashed
 			 */
-			SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-						InvalidTransactionId);
-			return HEAPTUPLE_DEAD;
+			SetHintBits(tuple, buffer, HEAP_XMIN_INVALID, xmin);
+			if(TransactionIdIsNormal(xmin)) {
+				*dead_after = xmin;
+				return HEAPTUPLE_RECENTLY_ABORTED;
+			}
+			else
+				return HEAPTUPLE_DEAD;
 		}
 
 		/*
@@ -1443,7 +1454,7 @@ HeapTupleSatisfiesNonVacuumable(HeapTuple htup, Snapshot snapshot,
 
 	res = HeapTupleSatisfiesVacuumHorizon(htup, buffer, &dead_after);
 
-	if (res == HEAPTUPLE_RECENTLY_DEAD)
+	if (res == HEAPTUPLE_RECENTLY_DEAD || res == HEAPTUPLE_RECENTLY_ABORTED)
 	{
 		Assert(TransactionIdIsValid(dead_after));
 
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index a8025889be0..ecda8daf0e3 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -921,7 +921,7 @@ heap_prune_satisfies_vacuum(PruneState *prstate, HeapTuple tup, Buffer buffer)
 
 	res = HeapTupleSatisfiesVacuumHorizon(tup, buffer, &dead_after);
 
-	if (res != HEAPTUPLE_RECENTLY_DEAD)
+	if (res != HEAPTUPLE_RECENTLY_DEAD && res != HEAPTUPLE_RECENTLY_ABORTED)
 		return res;
 
 	/*
@@ -1108,6 +1108,9 @@ heap_prune_chain(Page page, BlockNumber blockno, OffsetNumber maxoff,
 				 */
 				break;
 
+			case HEAPTUPLE_RECENTLY_ABORTED:
+				break;
+
 			case HEAPTUPLE_DELETE_IN_PROGRESS:
 			case HEAPTUPLE_LIVE:
 			case HEAPTUPLE_INSERT_IN_PROGRESS:
@@ -1415,6 +1418,19 @@ heap_prune_record_unchanged_lp_normal(Page page, PruneState *prstate, OffsetNumb
 			}
 			break;
 
+		case HEAPTUPLE_RECENTLY_ABORTED:
+			prstate->recently_dead_tuples++;
+			prstate->all_visible = false;
+
+			/*
+			 * This tuple will soon become DEAD.  Update the hint field so
+			 * that the page is reconsidered for pruning in future.
+			 */
+			heap_prune_record_prunable(prstate,
+									   HeapTupleHeaderGetRawXmin(htup));
+			break;
+
+
 		case HEAPTUPLE_RECENTLY_DEAD:
 			prstate->recently_dead_tuples++;
 			prstate->all_visible = false;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 0fef8e49e2b..def04eb758c 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2352,6 +2352,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 				 */
 				missed_dead_tuples++;
 				break;
+			case HEAPTUPLE_RECENTLY_ABORTED:
 			case HEAPTUPLE_RECENTLY_DEAD:
 
 				/*
@@ -3694,6 +3695,7 @@ heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 				break;
 
 			case HEAPTUPLE_DEAD:
+			case HEAPTUPLE_RECENTLY_ABORTED:
 			case HEAPTUPLE_RECENTLY_DEAD:
 			case HEAPTUPLE_INSERT_IN_PROGRESS:
 			case HEAPTUPLE_DELETE_IN_PROGRESS:
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 3a9424c19c9..139f1f01037 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -123,7 +123,8 @@ typedef enum
 {
 	HEAPTUPLE_DEAD,				/* tuple is dead and deletable */
 	HEAPTUPLE_LIVE,				/* tuple is live (committed, no deleter) */
-	HEAPTUPLE_RECENTLY_DEAD,	/* tuple is dead, but not deletable yet */
+	HEAPTUPLE_RECENTLY_DEAD,	/* tuple is dead, but not deletable yet; check xmax for when to delete */
+	HEAPTUPLE_RECENTLY_ABORTED,	/* tuple is aborted, but not deletable yet; check xmin for when to delete */
 	HEAPTUPLE_INSERT_IN_PROGRESS,	/* inserting xact is still in progress */
 	HEAPTUPLE_DELETE_IN_PROGRESS,	/* deleting xact is still in progress */
 } HTSV_Result;
-- 
2.39.5

#2Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Jasper Smit (#1)
2 attachment(s)
Re: Visibility bug in tuple lock

On 13/10/2025 10:48, Jasper Smit wrote:

We found a bug in heap_lock_tuple(). It seems to be unsafe to follow
the t_ctid pointer of updated tuples,
in the presence of aborted tuples. Vacuum can mark aborted tuples
LP_UNUSED, even if there are still
visible tuples that have a t_ctid pointing to them.

We would like to address this issue in postgres. Attached is a patch
that modifies VACUUM behavior.
Could you advise us on the best way to proceed? How can we get this
patch included in postgres, or would you recommend
pursuing an alternative approach?

You're at the right place :-). Thanks for the script and the patch!

The following scenario will lead to an assertion in debug or to wrong
results in release.

1. (session 1) A tuple is being UPDATEd in an active transaction.
2. (session 2) Another session locks that tuple, for example, SELECT
.. FOR UPDATE.
3. heap_lock_tuple() calls HeapTupleSatisfiesUpdate(), which returns
TM_BeingModified.
It will record the xmax and t_ctid of the tuple.
4. Since the tuple is actively being updated, we need to wait for
session 1 to complete.
5. The UPDATE of session 1 aborts, leaving the original tuple with an
aborted xmax and t_ctid pointing to a
tuple with an aborted xmin.
6. VACUUM marks aborted tuples as unused, regardless of whether they
are still referenced by a visible tuple through t_ctid.
As a result, the aborted tuple is marked unused.
7. An INSERT happens and the tuple is placed at the recycled location
of the aborted tuple.
8. The newly INSERTed tuple is UPDATEd.
9. heap_lock_tuple() of session 2 resumes, and will next call
heap_lock_updated_tuple() using the t_ctid which was recorded in step
3.
Note that the code does not check the status of the transaction it
was waiting on.
It will proceed, regardless of the transaction it was waiting on
committed or aborted.
10. heap_lock_updated_tuple() will now work on the inserted tuple of
step 7. It will see that this tuple updated,
and thus the return value of heap_lock_updated_tuple() will be TM_Updated.
11. This will eventually lead to the assertion (result ==
TM_WouldBlock) || !(tuple->t_data->t_infomask & HEAP_XMAX_INVALID)

These concrete steps reproduce the problem:

Used version REL_18_0
Run with configuration:
io_combine_limit = '1'

session 1:
drop table if exists t;
create table t (id int);
insert into t (select generate_series(1, 452));
begin;
update t set id = 1000 where id = 1;

session 2:
gdb: b heapam.c:5033
gdb: continue
select * from t where id = 1 for update;

session 1:
abort;
select * from t;
VACUUM (TRUNCATE off);
insert into t values (453) returning ctid; -- Should be (2,1)
update t set id = 454 where id = 453 returning ctid;

session 2: (should now be at the breakpoint)
gdb: continue

We get the assertion:
(result == TM_WouldBlock) || !(tuple->t_data->t_infomask & HEAP_XMAX_INVALID).
Stacktrace included as attachment.

I can reproduce this.

We have also seen the error "t_xmin %u is uncommitted in tuple (%u,%u)
to be updated in table \"%s\"" in our testing.
We believe that this has a similar cause. However, we have not been
able to come up with reproduction steps for that assertion
specifically.

Yeah, sounds quite plausible that you could get that too.

We think that this problem can be fixed in two different ways.
1. Check the commit status after waiting in heap_lock_updated_tuple(),
and don't proceed checking the updated tuple when the commit has
aborted.

Hmm, I think heap_lock_updated_tuple() would still lock incorrect tuples
in that case, which could lead to unrelated transactions failing, or at
least waiting unnecessarily.

or check, like in heap_get_latest_tid(), if the xmin of the updated
tuples matches the xmax of the old tuple.

+1 for that approach. heap_lock_updated_tuple() actually does that check
too, but not for the first tuple.

2. Change vacuum, to delay marking aborted tuples dead until no
visible tuple will have a t_ctid pointing to that tuple.

I guess that works too, but the downside is that we can't vacuum aborted
tuples as fast.

Attached is a fix by checking the prior xmax in
heap_lock_updated_tuple() for the first tuple already. The first commit
adds your repro script as a regression test using an injection point. It
hits the assertion failure, or returns incorrect result if assertions
are disabled. The second commit fixes the bug and makes the test pass.

The comment on heap_lock_updated_tuple says:

* The initial tuple is assumed to be already locked.

But AFAICS that's not true. The caller holds the heavy-weight tuple
lock, to establish priority if there are multiple concurrent lockers,
but it doesn't stamp the initial tuple's xmax until after the
heap_lock_updated_tuple() call. I guess we just need to update that
comment, but I would love to get another pair of eyes on this, this code
is hairy.

- Heikki

Attachments:

v1-0001-Add-repro-for-heap-locking-visibility-bug-as-a-ta.patchtext/x-patch; charset=UTF-8; name=v1-0001-Add-repro-for-heap-locking-visibility-bug-as-a-ta.patchDownload
From a7d7532a672129abf56f0819d431f8b6e6b3613f Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 11 Dec 2025 18:20:37 +0200
Subject: [PATCH v1 1/2] Add repro for heap locking visibility bug as a tap
 test

Discussion: https://www.postgresql.org/message-id/CAOG+RQ74x0q=kgBBQ=mezuvOeZBfSxM1qu_o0V28bwDz3dHxLw@mail.gmail.com
---
 src/backend/access/heap/heapam.c              |  2 +
 .../expected/heap_lock_update.out             | 33 +++++++++
 src/test/modules/injection_points/meson.build |  1 +
 .../specs/heap_lock_update.spec               | 72 +++++++++++++++++++
 4 files changed, 108 insertions(+)
 create mode 100644 src/test/modules/injection_points/expected/heap_lock_update.out
 create mode 100644 src/test/modules/injection_points/specs/heap_lock_update.spec

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 225f9829f22..a69df8bd431 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6069,6 +6069,8 @@ static TM_Result
 heap_lock_updated_tuple(Relation rel, HeapTuple tuple, const ItemPointerData *ctid,
 						TransactionId xid, LockTupleMode mode)
 {
+	INJECTION_POINT("heap_lock_updated_tuple", NULL);
+
 	/*
 	 * If the tuple has not been updated, or has moved into another partition
 	 * (effectively a delete) stop here.
diff --git a/src/test/modules/injection_points/expected/heap_lock_update.out b/src/test/modules/injection_points/expected/heap_lock_update.out
new file mode 100644
index 00000000000..b6308dcd316
--- /dev/null
+++ b/src/test/modules/injection_points/expected/heap_lock_update.out
@@ -0,0 +1,33 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1begin s1update s2lock s1abort vacuum reinsert wake
+step s1begin: begin;
+step s1update: update t set id = 1000 where id = 1;
+step s2lock: select * from t where id = 1 for update; <waiting ...>
+step s1abort: abort;
+step vacuum: VACUUM (TRUNCATE off);
+step reinsert: 
+	insert into t values (453) returning ctid; -- Should be (2,1)
+	update t set id = 454 where id = 453 returning ctid;
+
+ctid 
+-----
+(2,1)
+(1 row)
+
+ctid 
+-----
+(2,2)
+(1 row)
+
+step wake: 
+    SELECT FROM injection_points_detach('heap_lock_updated_tuple');
+    SELECT FROM injection_points_wakeup('heap_lock_updated_tuple');
+ <waiting ...>
+step s2lock: <... completed>
+id
+--
+ 1
+(1 row)
+
+step wake: <... completed>
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 8d6f662040d..e92e7348a80 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -51,6 +51,7 @@ tests += {
       'reindex-concurrently-upsert',
       'reindex-concurrently-upsert-on-constraint',
       'reindex-concurrently-upsert-partitioned',
+      'heap_lock_update',
     ],
     'runningcheck': false, # see syscache-update-pruned
     # Some tests wait for all snapshots, so avoid parallel execution
diff --git a/src/test/modules/injection_points/specs/heap_lock_update.spec b/src/test/modules/injection_points/specs/heap_lock_update.spec
new file mode 100644
index 00000000000..f85b11532e9
--- /dev/null
+++ b/src/test/modules/injection_points/specs/heap_lock_update.spec
@@ -0,0 +1,72 @@
+# XXX
+# Test race conditions involving:
+# - s1: heap_update($FROM_SYSCACHE), without a snapshot or pin
+# - s2: ALTER TABLE making $FROM_SYSCACHE a dead tuple
+# - s3: "VACUUM pg_class" making $FROM_SYSCACHE become LP_UNUSED
+
+# This is a derivative work of inplace.spec, which exercises the corresponding
+# race condition for inplace updates.
+
+# Despite local injection points, this is incompatible with runningcheck.
+# First, removable_cutoff() could move backward, per its header comment.
+# Second, other activity could trigger sinval queue overflow, negating our
+# efforts to delay inval.  Third, this deadlock emerges:
+#
+# - step at2 waits at an injection point, with interrupts held
+# - an unrelated backend waits for at2 to do PROCSIGNAL_BARRIER_SMGRRELEASE
+# - step waitprunable4 waits for the unrelated backend to release its xmin
+
+# The alternative expected output is for -DCATCACHE_FORCE_RELEASE, a setting
+# that thwarts testing the race conditions this spec seeks.
+
+
+# XXX Need s2 to make a non-HOT update.  Otherwise, "VACUUM pg_class" would leave
+# an LP_REDIRECT that persists.  To get non-HOT, make rels so the pg_class row
+# for vactest.orig50 is on a filled page (assuming BLCKSZ=8192).  Just to save
+# on filesystem syscalls, use relkind=c for every other rel.
+setup
+{
+    CREATE EXTENSION injection_points;
+
+    create table t (id int);
+    insert into t (select generate_series(1, 452));
+}
+teardown
+{
+    drop table t;
+    DROP EXTENSION injection_points;
+}
+
+# Wait during GRANT.  Disable debug_discard_caches, since we're here to
+# exercise an outcome that happens under permissible cache staleness.
+session s1
+step s1begin	{ begin; }
+step s1update   { update t set id = 1000 where id = 1; }
+step s1abort	{ abort; }
+step vacuum	{ VACUUM (TRUNCATE off); }
+step reinsert   {
+	insert into t values (453) returning ctid; -- Should be (2,1)
+	update t set id = 454 where id = 453 returning ctid;
+}
+
+step wake	{
+    SELECT FROM injection_points_detach('heap_lock_updated_tuple');
+    SELECT FROM injection_points_wakeup('heap_lock_updated_tuple');
+}
+
+session s2
+setup	{
+	SET debug_discard_caches = 0;
+	SELECT FROM injection_points_set_local();
+	SELECT FROM injection_points_attach('heap_lock_updated_tuple', 'wait');
+}
+step s2lock	{ select * from t where id = 1 for update; }
+
+permutation
+	s1begin
+	s1update
+	s2lock
+	s1abort
+	vacuum
+	reinsert
+	wake(s2lock)
-- 
2.47.3

v1-0002-Fix-the-bug-with-priorXmax.patchtext/x-patch; charset=UTF-8; name=v1-0002-Fix-the-bug-with-priorXmax.patchDownload
From 138b11e1a787f0d2b3fc913eec3c8d28527ee76c Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 11 Dec 2025 18:03:10 +0200
Subject: [PATCH v1 2/2] Fix the bug with priorXmax

---
 src/backend/access/heap/heapam.c | 28 ++++++++++++++++++++++------
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a69df8bd431..5fb81af8d4f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -86,6 +86,7 @@ static void compute_new_xmax_infomask(TransactionId xmax, uint16 old_infomask,
 									  TransactionId *result_xmax, uint16 *result_infomask,
 									  uint16 *result_infomask2);
 static TM_Result heap_lock_updated_tuple(Relation rel, HeapTuple tuple,
+										 TransactionId priorXmax,
 										 const ItemPointerData *ctid, TransactionId xid,
 										 LockTupleMode mode);
 static void GetMultiXactIdHintBits(MultiXactId multi, uint16 *new_infomask,
@@ -4818,9 +4819,16 @@ l3:
 				 */
 				if (follow_updates && updated)
 				{
+					TransactionId updateXid;
 					TM_Result	res;
 
-					res = heap_lock_updated_tuple(relation, tuple, &t_ctid,
+					if (infomask & HEAP_XMAX_IS_MULTI)
+						updateXid = MultiXactIdGetUpdateXid(xwait, infomask);
+					else
+						updateXid = xwait;
+
+					res = heap_lock_updated_tuple(relation, tuple, updateXid,
+												  &t_ctid,
 												  GetCurrentTransactionId(),
 												  mode);
 					if (res != TM_Ok)
@@ -5065,9 +5073,16 @@ l3:
 			/* if there are updates, follow the update chain */
 			if (follow_updates && !HEAP_XMAX_IS_LOCKED_ONLY(infomask))
 			{
+				TransactionId updateXid;
 				TM_Result	res;
 
-				res = heap_lock_updated_tuple(relation, tuple, &t_ctid,
+				if (infomask & HEAP_XMAX_IS_MULTI)
+					updateXid = MultiXactIdGetUpdateXid(xwait, infomask);
+				else
+					updateXid = xwait;
+
+				res = heap_lock_updated_tuple(relation, tuple, updateXid,
+											  &t_ctid,
 											  GetCurrentTransactionId(),
 											  mode);
 				if (res != TM_Ok)
@@ -5721,7 +5736,8 @@ test_lockmode_for_conflict(MultiXactStatus status, TransactionId xid,
  * version as well.
  */
 static TM_Result
-heap_lock_updated_tuple_rec(Relation rel, const ItemPointerData *tid, TransactionId xid,
+heap_lock_updated_tuple_rec(Relation rel, TransactionId priorXmax,
+							const ItemPointerData *tid, TransactionId xid,
 							LockTupleMode mode)
 {
 	TM_Result	result;
@@ -5734,7 +5750,6 @@ heap_lock_updated_tuple_rec(Relation rel, const ItemPointerData *tid, Transactio
 				old_infomask2;
 	TransactionId xmax,
 				new_xmax;
-	TransactionId priorXmax = InvalidTransactionId;
 	bool		cleared_all_frozen = false;
 	bool		pinned_desired_page;
 	Buffer		vmbuffer = InvalidBuffer;
@@ -6066,7 +6081,8 @@ out_unlocked:
  * levels, because that would lead to a serializability failure.
  */
 static TM_Result
-heap_lock_updated_tuple(Relation rel, HeapTuple tuple, const ItemPointerData *ctid,
+heap_lock_updated_tuple(Relation rel, HeapTuple tuple, TransactionId priorXmax,
+						const ItemPointerData *ctid,
 						TransactionId xid, LockTupleMode mode)
 {
 	INJECTION_POINT("heap_lock_updated_tuple", NULL);
@@ -6089,7 +6105,7 @@ heap_lock_updated_tuple(Relation rel, HeapTuple tuple, const ItemPointerData *ct
 		 */
 		MultiXactIdSetOldestMember();
 
-		return heap_lock_updated_tuple_rec(rel, ctid, xid, mode);
+		return heap_lock_updated_tuple_rec(rel, priorXmax, ctid, xid, mode);
 	}
 
 	/* nothing to lock */
-- 
2.47.3

#3Jasper Smit
jbsmit@gmail.com
In reply to: Heikki Linnakangas (#2)
2 attachment(s)
Re: Visibility bug in tuple lock

Thanks, for looking into this and for creating the patch.

+1 for that approach. heap_lock_updated_tuple() actually does that check
too, but not for the first tuple.

I think this will for sure fix the problem, however we are still
accessing completely unrelated tuples. It feels unsafe to access
tuples that might have been vacuumed/pruned away. Is it possible for
example that the page we are accessing has been truncated away in the
meantime?

I guess that works too, but the downside is that we can't vacuum aborted
tuples as fast.

I don't know why we need aborted tuples to be vacuumed so much faster
than dead tuples caused by inserts/updates. I would think that
generally you would have much more of those. In our code base we have
chosen for this approach, since this fix also safeguards us for
other potentially problematic cases where the ctid pointer is followed.

* The initial tuple is assumed to be already locked.

But AFAICS that's not true. The caller holds the heavy-weight tuple

This does not look true to me either.

I simplified your patch a little bit by extracting common code to the
heap_lock_updated_tuple() function. I also added the new test to the
Makefile.

Jasper Smit

Show quoted text

On Thu, Dec 11, 2025 at 5:30 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 13/10/2025 10:48, Jasper Smit wrote:

We found a bug in heap_lock_tuple(). It seems to be unsafe to follow
the t_ctid pointer of updated tuples,
in the presence of aborted tuples. Vacuum can mark aborted tuples
LP_UNUSED, even if there are still
visible tuples that have a t_ctid pointing to them.

We would like to address this issue in postgres. Attached is a patch
that modifies VACUUM behavior.
Could you advise us on the best way to proceed? How can we get this
patch included in postgres, or would you recommend
pursuing an alternative approach?

You're at the right place :-). Thanks for the script and the patch!

The following scenario will lead to an assertion in debug or to wrong
results in release.

1. (session 1) A tuple is being UPDATEd in an active transaction.
2. (session 2) Another session locks that tuple, for example, SELECT
.. FOR UPDATE.
3. heap_lock_tuple() calls HeapTupleSatisfiesUpdate(), which returns
TM_BeingModified.
It will record the xmax and t_ctid of the tuple.
4. Since the tuple is actively being updated, we need to wait for
session 1 to complete.
5. The UPDATE of session 1 aborts, leaving the original tuple with an
aborted xmax and t_ctid pointing to a
tuple with an aborted xmin.
6. VACUUM marks aborted tuples as unused, regardless of whether they
are still referenced by a visible tuple through t_ctid.
As a result, the aborted tuple is marked unused.
7. An INSERT happens and the tuple is placed at the recycled location
of the aborted tuple.
8. The newly INSERTed tuple is UPDATEd.
9. heap_lock_tuple() of session 2 resumes, and will next call
heap_lock_updated_tuple() using the t_ctid which was recorded in step
3.
Note that the code does not check the status of the transaction it
was waiting on.
It will proceed, regardless of the transaction it was waiting on
committed or aborted.
10. heap_lock_updated_tuple() will now work on the inserted tuple of
step 7. It will see that this tuple updated,
and thus the return value of heap_lock_updated_tuple() will be TM_Updated.
11. This will eventually lead to the assertion (result ==
TM_WouldBlock) || !(tuple->t_data->t_infomask & HEAP_XMAX_INVALID)

These concrete steps reproduce the problem:

Used version REL_18_0
Run with configuration:
io_combine_limit = '1'

session 1:
drop table if exists t;
create table t (id int);
insert into t (select generate_series(1, 452));
begin;
update t set id = 1000 where id = 1;

session 2:
gdb: b heapam.c:5033
gdb: continue
select * from t where id = 1 for update;

session 1:
abort;
select * from t;
VACUUM (TRUNCATE off);
insert into t values (453) returning ctid; -- Should be (2,1)
update t set id = 454 where id = 453 returning ctid;

session 2: (should now be at the breakpoint)
gdb: continue

We get the assertion:
(result == TM_WouldBlock) || !(tuple->t_data->t_infomask & HEAP_XMAX_INVALID).
Stacktrace included as attachment.

I can reproduce this.

We have also seen the error "t_xmin %u is uncommitted in tuple (%u,%u)
to be updated in table \"%s\"" in our testing.
We believe that this has a similar cause. However, we have not been
able to come up with reproduction steps for that assertion
specifically.

Yeah, sounds quite plausible that you could get that too.

We think that this problem can be fixed in two different ways.
1. Check the commit status after waiting in heap_lock_updated_tuple(),
and don't proceed checking the updated tuple when the commit has
aborted.

Hmm, I think heap_lock_updated_tuple() would still lock incorrect tuples
in that case, which could lead to unrelated transactions failing, or at
least waiting unnecessarily.

or check, like in heap_get_latest_tid(), if the xmin of the updated
tuples matches the xmax of the old tuple.

+1 for that approach. heap_lock_updated_tuple() actually does that check
too, but not for the first tuple.

2. Change vacuum, to delay marking aborted tuples dead until no
visible tuple will have a t_ctid pointing to that tuple.

I guess that works too, but the downside is that we can't vacuum aborted
tuples as fast.

Attached is a fix by checking the prior xmax in
heap_lock_updated_tuple() for the first tuple already. The first commit
adds your repro script as a regression test using an injection point. It
hits the assertion failure, or returns incorrect result if assertions
are disabled. The second commit fixes the bug and makes the test pass.

The comment on heap_lock_updated_tuple says:

* The initial tuple is assumed to be already locked.

But AFAICS that's not true. The caller holds the heavy-weight tuple
lock, to establish priority if there are multiple concurrent lockers,
but it doesn't stamp the initial tuple's xmax until after the
heap_lock_updated_tuple() call. I guess we just need to update that
comment, but I would love to get another pair of eyes on this, this code
is hairy.

- Heikki

Attachments:

v2-0002-Fix-the-bug-with-priorXmax.patchapplication/octet-stream; name=v2-0002-Fix-the-bug-with-priorXmax.patchDownload
From 544ba4fbd94619f34d1a2746e9cff6ec7e24236b Mon Sep 17 00:00:00 2001
From: Jasper Smit <jasper.smit@servicenow.com>
Date: Fri, 12 Dec 2025 13:40:36 +0100
Subject: [PATCH 2/2] Fix the bug with priorXmax

---
 src/backend/access/heap/heapam.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a69df8bd431..41e963a21a8 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -5721,7 +5721,8 @@ test_lockmode_for_conflict(MultiXactStatus status, TransactionId xid,
  * version as well.
  */
 static TM_Result
-heap_lock_updated_tuple_rec(Relation rel, const ItemPointerData *tid, TransactionId xid,
+heap_lock_updated_tuple_rec(Relation rel, TransactionId priorXmax,
+							const ItemPointerData *tid, TransactionId xid,
 							LockTupleMode mode)
 {
 	TM_Result	result;
@@ -5734,7 +5735,6 @@ heap_lock_updated_tuple_rec(Relation rel, const ItemPointerData *tid, Transactio
 				old_infomask2;
 	TransactionId xmax,
 				new_xmax;
-	TransactionId priorXmax = InvalidTransactionId;
 	bool		cleared_all_frozen = false;
 	bool		pinned_desired_page;
 	Buffer		vmbuffer = InvalidBuffer;
@@ -6078,6 +6078,7 @@ heap_lock_updated_tuple(Relation rel, HeapTuple tuple, const ItemPointerData *ct
 	if (!HeapTupleHeaderIndicatesMovedPartitions(tuple->t_data) &&
 		!ItemPointerEquals(&tuple->t_self, ctid))
 	{
+		TransactionId priorXmax;
 		/*
 		 * If this is the first possibly-multixact-able operation in the
 		 * current transaction, set my per-backend OldestMemberMXactId
@@ -6089,7 +6090,8 @@ heap_lock_updated_tuple(Relation rel, HeapTuple tuple, const ItemPointerData *ct
 		 */
 		MultiXactIdSetOldestMember();
 
-		return heap_lock_updated_tuple_rec(rel, ctid, xid, mode);
+		priorXmax = HeapTupleHeaderGetUpdateXid(tuple->t_data);
+		return heap_lock_updated_tuple_rec(rel, priorXmax, ctid, xid, mode);
 	}
 
 	/* nothing to lock */
-- 
2.39.5

v2-0001-Add-repro-for-heap-locking-visibility-bug-as-a-tap.patchapplication/octet-stream; name=v2-0001-Add-repro-for-heap-locking-visibility-bug-as-a-tap.patchDownload
From 8c61b892587c417f092869b87268beb522e94305 Mon Sep 17 00:00:00 2001
From: Jasper Smit <jasper.smit@servicenow.com>
Date: Fri, 12 Dec 2025 13:29:45 +0100
Subject: [PATCH 1/2] Add repro for heap locking visibility bug as a tap

---
 src/backend/access/heap/heapam.c              |  2 +
 src/test/modules/injection_points/Makefile    |  3 +-
 .../expected/heap_lock_update.out             | 33 +++++++++
 src/test/modules/injection_points/meson.build |  1 +
 .../specs/heap_lock_update.spec               | 72 +++++++++++++++++++
 5 files changed, 110 insertions(+), 1 deletion(-)
 create mode 100644 src/test/modules/injection_points/expected/heap_lock_update.out
 create mode 100644 src/test/modules/injection_points/specs/heap_lock_update.spec

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 225f9829f22..a69df8bd431 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6069,6 +6069,8 @@ static TM_Result
 heap_lock_updated_tuple(Relation rel, HeapTuple tuple, const ItemPointerData *ctid,
 						TransactionId xid, LockTupleMode mode)
 {
+	INJECTION_POINT("heap_lock_updated_tuple", NULL);
+
 	/*
 	 * If the tuple has not been updated, or has moved into another partition
 	 * (effectively a delete) stop here.
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index c85034eb8cc..a9851b4a097 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -19,7 +19,8 @@ ISOLATION = basic \
 	    index-concurrently-upsert-predicate \
 	    reindex-concurrently-upsert \
 	    reindex-concurrently-upsert-on-constraint \
-	    reindex-concurrently-upsert-partitioned
+	    reindex-concurrently-upsert-partitioned \
+	    heap_lock_update
 
 # The injection points are cluster-wide, so disable installcheck
 NO_INSTALLCHECK = 1
diff --git a/src/test/modules/injection_points/expected/heap_lock_update.out b/src/test/modules/injection_points/expected/heap_lock_update.out
new file mode 100644
index 00000000000..b6308dcd316
--- /dev/null
+++ b/src/test/modules/injection_points/expected/heap_lock_update.out
@@ -0,0 +1,33 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1begin s1update s2lock s1abort vacuum reinsert wake
+step s1begin: begin;
+step s1update: update t set id = 1000 where id = 1;
+step s2lock: select * from t where id = 1 for update; <waiting ...>
+step s1abort: abort;
+step vacuum: VACUUM (TRUNCATE off);
+step reinsert: 
+	insert into t values (453) returning ctid; -- Should be (2,1)
+	update t set id = 454 where id = 453 returning ctid;
+
+ctid 
+-----
+(2,1)
+(1 row)
+
+ctid 
+-----
+(2,2)
+(1 row)
+
+step wake: 
+    SELECT FROM injection_points_detach('heap_lock_updated_tuple');
+    SELECT FROM injection_points_wakeup('heap_lock_updated_tuple');
+ <waiting ...>
+step s2lock: <... completed>
+id
+--
+ 1
+(1 row)
+
+step wake: <... completed>
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 8d6f662040d..e92e7348a80 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -51,6 +51,7 @@ tests += {
       'reindex-concurrently-upsert',
       'reindex-concurrently-upsert-on-constraint',
       'reindex-concurrently-upsert-partitioned',
+      'heap_lock_update',
     ],
     'runningcheck': false, # see syscache-update-pruned
     # Some tests wait for all snapshots, so avoid parallel execution
diff --git a/src/test/modules/injection_points/specs/heap_lock_update.spec b/src/test/modules/injection_points/specs/heap_lock_update.spec
new file mode 100644
index 00000000000..f85b11532e9
--- /dev/null
+++ b/src/test/modules/injection_points/specs/heap_lock_update.spec
@@ -0,0 +1,72 @@
+# XXX
+# Test race conditions involving:
+# - s1: heap_update($FROM_SYSCACHE), without a snapshot or pin
+# - s2: ALTER TABLE making $FROM_SYSCACHE a dead tuple
+# - s3: "VACUUM pg_class" making $FROM_SYSCACHE become LP_UNUSED
+
+# This is a derivative work of inplace.spec, which exercises the corresponding
+# race condition for inplace updates.
+
+# Despite local injection points, this is incompatible with runningcheck.
+# First, removable_cutoff() could move backward, per its header comment.
+# Second, other activity could trigger sinval queue overflow, negating our
+# efforts to delay inval.  Third, this deadlock emerges:
+#
+# - step at2 waits at an injection point, with interrupts held
+# - an unrelated backend waits for at2 to do PROCSIGNAL_BARRIER_SMGRRELEASE
+# - step waitprunable4 waits for the unrelated backend to release its xmin
+
+# The alternative expected output is for -DCATCACHE_FORCE_RELEASE, a setting
+# that thwarts testing the race conditions this spec seeks.
+
+
+# XXX Need s2 to make a non-HOT update.  Otherwise, "VACUUM pg_class" would leave
+# an LP_REDIRECT that persists.  To get non-HOT, make rels so the pg_class row
+# for vactest.orig50 is on a filled page (assuming BLCKSZ=8192).  Just to save
+# on filesystem syscalls, use relkind=c for every other rel.
+setup
+{
+    CREATE EXTENSION injection_points;
+
+    create table t (id int);
+    insert into t (select generate_series(1, 452));
+}
+teardown
+{
+    drop table t;
+    DROP EXTENSION injection_points;
+}
+
+# Wait during GRANT.  Disable debug_discard_caches, since we're here to
+# exercise an outcome that happens under permissible cache staleness.
+session s1
+step s1begin	{ begin; }
+step s1update   { update t set id = 1000 where id = 1; }
+step s1abort	{ abort; }
+step vacuum	{ VACUUM (TRUNCATE off); }
+step reinsert   {
+	insert into t values (453) returning ctid; -- Should be (2,1)
+	update t set id = 454 where id = 453 returning ctid;
+}
+
+step wake	{
+    SELECT FROM injection_points_detach('heap_lock_updated_tuple');
+    SELECT FROM injection_points_wakeup('heap_lock_updated_tuple');
+}
+
+session s2
+setup	{
+	SET debug_discard_caches = 0;
+	SELECT FROM injection_points_set_local();
+	SELECT FROM injection_points_attach('heap_lock_updated_tuple', 'wait');
+}
+step s2lock	{ select * from t where id = 1 for update; }
+
+permutation
+	s1begin
+	s1update
+	s2lock
+	s1abort
+	vacuum
+	reinsert
+	wake(s2lock)
-- 
2.39.5

#4Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Jasper Smit (#3)
1 attachment(s)
Re: Visibility bug in tuple lock

On 12/12/2025 15:19, Jasper Smit wrote:

Thanks, for looking into this and for creating the patch.

+1 for that approach. heap_lock_updated_tuple() actually does that check
too, but not for the first tuple.

I think this will for sure fix the problem, however we are still
accessing completely unrelated tuples. It feels unsafe to access
tuples that might have been vacuumed/pruned away. Is it possible for
example that the page we are accessing has been truncated away in the
meantime?

It should be OK, we do it in other places too. For example,
heapam_lock_tuple() follows the ctid after the call to
heap_lock_tuple(), when called with TUPLE_LOCK_FLAG_FIND_LAST_VERSION.

heap_lock_updated_tuple_rec() handles the case that the tuple is missing
altogether, and the xmin == priorXmax check ensures that it's the
correct tuple.

Vacuum acquires an AccessExclusiveLock when truncating the relation,
which ensures that no backend can be in the middle of following the
update chain. That's not great - it would be nice to not need the
AccesseExclusiveLock - but we do rely on it in many places currently.

I guess that works too, but the downside is that we can't vacuum aborted
tuples as fast.

I don't know why we need aborted tuples to be vacuumed so much faster
than dead tuples caused by inserts/updates. I would think that
generally you would have much more of those. In our code base we have
chosen for this approach, since this fix also safeguards us for
other potentially problematic cases where the ctid pointer is followed.

Yeah, for most workloads, it's probably not important that aborted
tuples are vacuumed quickly. But I could imagine some workloads with
lots of rollbacks and long running transactions where it would matter.
Making a behavioral change like that in backbranches is a bad idea, when
we have a much more localized patch available.

* The initial tuple is assumed to be already locked.

But AFAICS that's not true. The caller holds the heavy-weight tuple

This does not look true to me either.

I simplified your patch a little bit by extracting common code to the
heap_lock_updated_tuple() function.

Thanks! That's not quite correct though. This is very subtle, but the
'tuple' argument to heap_lock_updated_tuple() points to the buffer
holding the original tuple. So doing
HeapTupleHeaderGetUpdateXid(tuple->t_data) reads the current xmax of the
tuple, which can now be different than what it was earlier, i.e.
different from the xwait that we waited for. It's important that the
'ctid' and 'priorXmax' that we use to follow the chain are read together.

I expanded the test to demonstrate what can go wrong with that. If the
original tuple was locked or updated concurrently, we try to incorrectly
follow the update chain again and get the same assertion failure.

Hmm, now that i look at it, this existing check there looks a little
suspicious too:

/*
* If the tuple has not been updated, or has moved into another partition
* (effectively a delete) stop here.
*/
if (!HeapTupleHeaderIndicatesMovedPartitions(tuple->t_data) &&
!ItemPointerEquals(&tuple->t_self, ctid))
{

It's possible that tuple->t_data->t_ctid is modified concurrently by
another backend, while we're doing the
HeapTupleHeaderIndicatesMovedPartitions() check. I suppose it's
harmless: if the tuple is updated concurrently, depending on the timing
we'll either exit here quickly with TM_Ok, or we proceed to follow the
ctid, will fail the priorXmax check, and return TM_Ok from there.

It seems sketchy to read the t_ctid without holding a lock on the buffer
however. It's not guaranteed to be atomic.

Attached is a new patch version:

* I kept the code to get the updating xid from a multixid in
heap_lock_updated_tuple() like you did, but it now correctly uses the
original xmax of the tuple, instead of reading it from the buffer.

* Modified that HeapTupleHeaderIndicatesMovedPartitions() call to use
the passed-in ctid instead of reading it from the buffer.

* I removed the 'tuple' pointer argument from heap_lock_updated_tuple()
altogether. Seems safer that way. The function really shouldn't be
accessing tuple->t_data because we're not holding a buffer lock. The
"ItemPointerEquals(&tuple->t_self, ctid)" was safe, because
tuple->t_self is just local memory, but I moved that to the callers so
that I could remove the 'tuple' argument.

* Fixed the comment to not claim that the initial tuple has already been
locked.

- Heikki

Attachments:

v3-0001-Fix-bug-in-following-update-chain-when-locking-a-.patchtext/x-patch; charset=UTF-8; name=v3-0001-Fix-bug-in-following-update-chain-when-locking-a-.patchDownload
From e522298608362461c8348ed883633668ad9ff187 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 15 Dec 2025 18:44:23 +0200
Subject: [PATCH v3 1/1] Fix bug in following update chain when locking a heap
 tuple

Author: Jasper Smit <jasper.smit@servicenow.com>
Discussion: https://www.postgresql.org/message-id/CAOG%2BRQ74x0q%3DkgBBQ%3DmezuvOeZBfSxM1qu_o0V28bwDz3dHxLw%40mail.gmail.com
---
 src/backend/access/heap/heapam.c              | 47 ++++++++----
 src/test/modules/injection_points/Makefile    |  3 +-
 .../expected/heap_lock_update.out             | 73 +++++++++++++++++++
 src/test/modules/injection_points/meson.build |  1 +
 .../specs/heap_lock_update.spec               | 71 ++++++++++++++++++
 5 files changed, 179 insertions(+), 16 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/heap_lock_update.out
 create mode 100644 src/test/modules/injection_points/specs/heap_lock_update.spec

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 225f9829f22..53b217ea50b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -85,8 +85,11 @@ static void compute_new_xmax_infomask(TransactionId xmax, uint16 old_infomask,
 									  LockTupleMode mode, bool is_update,
 									  TransactionId *result_xmax, uint16 *result_infomask,
 									  uint16 *result_infomask2);
-static TM_Result heap_lock_updated_tuple(Relation rel, HeapTuple tuple,
-										 const ItemPointerData *ctid, TransactionId xid,
+static TM_Result heap_lock_updated_tuple(Relation rel,
+										 uint16 prior_infomask,
+										 TransactionId prior_rawxmax,
+										 const ItemPointerData *prior_ctid,
+										 TransactionId xid,
 										 LockTupleMode mode);
 static void GetMultiXactIdHintBits(MultiXactId multi, uint16 *new_infomask,
 								   uint16 *new_infomask2);
@@ -4816,11 +4819,12 @@ l3:
 				 * If there are updates, follow the update chain; bail out if
 				 * that cannot be done.
 				 */
-				if (follow_updates && updated)
+				if (follow_updates && updated && !ItemPointerEquals(&tuple->t_self, &t_ctid))
 				{
 					TM_Result	res;
 
-					res = heap_lock_updated_tuple(relation, tuple, &t_ctid,
+					res = heap_lock_updated_tuple(relation,
+												  infomask, xwait, &t_ctid,
 												  GetCurrentTransactionId(),
 												  mode);
 					if (res != TM_Ok)
@@ -5063,11 +5067,13 @@ l3:
 			}
 
 			/* if there are updates, follow the update chain */
-			if (follow_updates && !HEAP_XMAX_IS_LOCKED_ONLY(infomask))
+			if (follow_updates && !HEAP_XMAX_IS_LOCKED_ONLY(infomask) &&
+				!ItemPointerEquals(&tuple->t_self, &t_ctid))
 			{
 				TM_Result	res;
 
-				res = heap_lock_updated_tuple(relation, tuple, &t_ctid,
+				res = heap_lock_updated_tuple(relation,
+											  infomask, xwait, &t_ctid,
 											  GetCurrentTransactionId(),
 											  mode);
 				if (res != TM_Ok)
@@ -5721,7 +5727,8 @@ test_lockmode_for_conflict(MultiXactStatus status, TransactionId xid,
  * version as well.
  */
 static TM_Result
-heap_lock_updated_tuple_rec(Relation rel, const ItemPointerData *tid, TransactionId xid,
+heap_lock_updated_tuple_rec(Relation rel, TransactionId priorXmax,
+							const ItemPointerData *tid, TransactionId xid,
 							LockTupleMode mode)
 {
 	TM_Result	result;
@@ -5734,7 +5741,6 @@ heap_lock_updated_tuple_rec(Relation rel, const ItemPointerData *tid, Transactio
 				old_infomask2;
 	TransactionId xmax,
 				new_xmax;
-	TransactionId priorXmax = InvalidTransactionId;
 	bool		cleared_all_frozen = false;
 	bool		pinned_desired_page;
 	Buffer		vmbuffer = InvalidBuffer;
@@ -6048,7 +6054,10 @@ out_unlocked:
  *		Follow update chain when locking an updated tuple, acquiring locks (row
  *		marks) on the updated versions.
  *
- * The initial tuple is assumed to be already locked.
+ * 'priorInfomask', 'priorRawXmax' and 'priorCtid' are the corresponding
+ * fields from the initial tuple.  We will update the tuples starting from the
+ * one that 'priorCtid' points to.  Locking the initial tuple where those
+ * values came from is the caller's responsibility.
  *
  * This function doesn't check visibility, it just unconditionally marks the
  * tuple(s) as locked.  If any tuple in the updated chain is being deleted
@@ -6066,16 +6075,22 @@ out_unlocked:
  * levels, because that would lead to a serializability failure.
  */
 static TM_Result
-heap_lock_updated_tuple(Relation rel, HeapTuple tuple, const ItemPointerData *ctid,
+heap_lock_updated_tuple(Relation rel,
+						uint16 prior_infomask,
+						TransactionId prior_raw_xmax,
+						const ItemPointerData *prior_ctid,
 						TransactionId xid, LockTupleMode mode)
 {
+	INJECTION_POINT("heap_lock_updated_tuple", NULL);
+
 	/*
-	 * If the tuple has not been updated, or has moved into another partition
-	 * (effectively a delete) stop here.
+	 * If the tuple has moved into another partition (effectively a delete)
+	 * stop here.
 	 */
-	if (!HeapTupleHeaderIndicatesMovedPartitions(tuple->t_data) &&
-		!ItemPointerEquals(&tuple->t_self, ctid))
+	if (!ItemPointerIndicatesMovedPartitions(prior_ctid))
 	{
+		TransactionId prior_xmax;
+
 		/*
 		 * If this is the first possibly-multixact-able operation in the
 		 * current transaction, set my per-backend OldestMemberMXactId
@@ -6087,7 +6102,9 @@ heap_lock_updated_tuple(Relation rel, HeapTuple tuple, const ItemPointerData *ct
 		 */
 		MultiXactIdSetOldestMember();
 
-		return heap_lock_updated_tuple_rec(rel, ctid, xid, mode);
+		prior_xmax = (prior_infomask & HEAP_XMAX_IS_MULTI) ?
+			MultiXactIdGetUpdateXid(prior_raw_xmax, prior_infomask) : prior_raw_xmax;
+		return heap_lock_updated_tuple_rec(rel, prior_xmax, prior_ctid, xid, mode);
 	}
 
 	/* nothing to lock */
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index bfdb3f53377..cc9be6dcdd2 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -14,7 +14,8 @@ REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic \
 	    inplace \
-	    syscache-update-pruned
+	    syscache-update-pruned \
+	    heap_lock_update
 
 # Temporarily disabled because of flakiness
 #ISOLATION =+
diff --git a/src/test/modules/injection_points/expected/heap_lock_update.out b/src/test/modules/injection_points/expected/heap_lock_update.out
new file mode 100644
index 00000000000..9828e2b151b
--- /dev/null
+++ b/src/test/modules/injection_points/expected/heap_lock_update.out
@@ -0,0 +1,73 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1begin s1update s2lock s1abort vacuum reinsert wake
+step s1begin: begin;
+step s1update: update t set id = 1000 where id = 1;
+step s2lock: select * from t where id = 1 for update; <waiting ...>
+step s1abort: abort;
+step vacuum: VACUUM (TRUNCATE off);
+step reinsert: 
+	insert into t values (453) returning ctid; -- Should be (2,1)
+	update t set id = 454 where id = 453 returning ctid;
+
+ctid 
+-----
+(2,1)
+(1 row)
+
+ctid 
+-----
+(2,2)
+(1 row)
+
+step wake: 
+    SELECT FROM injection_points_detach('heap_lock_updated_tuple');
+    SELECT FROM injection_points_wakeup('heap_lock_updated_tuple');
+ <waiting ...>
+step s2lock: <... completed>
+id
+--
+ 1
+(1 row)
+
+step wake: <... completed>
+
+starting permutation: s1begin s1update s2lock s1abort vacuum reinsert_and_lock wake
+step s1begin: begin;
+step s1update: update t set id = 1000 where id = 1;
+step s2lock: select * from t where id = 1 for update; <waiting ...>
+step s1abort: abort;
+step vacuum: VACUUM (TRUNCATE off);
+step reinsert_and_lock: 
+        begin;
+	insert into t values (453) returning ctid; -- Should be (2,1)
+	select ctid, * from t where id = 1 for update;
+	commit;
+	update t set id = 454 where id = 453 returning ctid;
+
+ctid 
+-----
+(2,1)
+(1 row)
+
+ctid |id
+-----+--
+(0,1)| 1
+(1 row)
+
+ctid 
+-----
+(2,2)
+(1 row)
+
+step wake: 
+    SELECT FROM injection_points_detach('heap_lock_updated_tuple');
+    SELECT FROM injection_points_wakeup('heap_lock_updated_tuple');
+ <waiting ...>
+step s2lock: <... completed>
+id
+--
+ 1
+(1 row)
+
+step wake: <... completed>
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 493e11053dc..c9186ae04d1 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -46,6 +46,7 @@ tests += {
       'basic',
       'inplace',
       'syscache-update-pruned',
+      'heap_lock_update',
       # temporarily disabled because of flakiness
       # 'index-concurrently-upsert',
       # 'index-concurrently-upsert-predicate',
diff --git a/src/test/modules/injection_points/specs/heap_lock_update.spec b/src/test/modules/injection_points/specs/heap_lock_update.spec
new file mode 100644
index 00000000000..93887312815
--- /dev/null
+++ b/src/test/modules/injection_points/specs/heap_lock_update.spec
@@ -0,0 +1,71 @@
+# Test race condition in tuple locking
+#
+# This is a reproducer for the bug reported at:
+# https://www.postgresql.org/message-id/CAOG%2BRQ74x0q%3DkgBBQ%3DmezuvOeZBfSxM1qu_o0V28bwDz3dHxLw%40mail.gmail.com
+
+setup
+{
+    CREATE EXTENSION injection_points;
+
+    create table t (id int);
+    insert into t (select generate_series(1, 452));
+}
+teardown
+{
+    drop table t;
+    DROP EXTENSION injection_points;
+}
+
+session s1
+step s1begin	{ begin; }
+step s1update   { update t set id = 1000 where id = 1; }
+step s1abort	{ abort; }
+step vacuum	{ VACUUM (TRUNCATE off); }
+step reinsert   {
+	insert into t values (453) returning ctid; -- Should be (2,1)
+	update t set id = 454 where id = 453 returning ctid;
+}
+
+# Same as 'reinsert', but we also stamp the original tuple with the
+# same 'xmax' as the re-inserted one.
+step reinsert_and_lock {
+        begin;
+	insert into t values (453) returning ctid; -- Should be (2,1)
+	select ctid, * from t where id = 1 for update;
+	commit;
+	update t set id = 454 where id = 453 returning ctid;
+}
+
+step wake	{
+    SELECT FROM injection_points_detach('heap_lock_updated_tuple');
+    SELECT FROM injection_points_wakeup('heap_lock_updated_tuple');
+}
+
+session s2
+setup	{
+	SELECT FROM injection_points_set_local();
+	SELECT FROM injection_points_attach('heap_lock_updated_tuple', 'wait');
+}
+step s2lock	{ select * from t where id = 1 for update; }
+
+# Basic case
+permutation
+	s1begin
+	s1update
+	s2lock
+	s1abort
+	vacuum
+	reinsert
+	wake(s2lock)
+
+# Variant where the re-inserted tuple is also locked by the inserting
+# transaction. This failed an earlier version of the fix during
+# development.
+permutation
+	s1begin
+	s1update
+	s2lock
+	s1abort
+	vacuum
+	reinsert_and_lock
+	wake(s2lock)
-- 
2.47.3

#5Jasper Smit
jbsmit@gmail.com
In reply to: Heikki Linnakangas (#4)
Re: Visibility bug in tuple lock

Thanks! That's not quite correct though. This is very subtle, but the
'tuple' argument to heap_lock_updated_tuple() points to the buffer
holding the original tuple. So doing
HeapTupleHeaderGetUpdateXid(tuple->t_data) reads the current xmax of the
tuple, which can now be different than what it was earlier, i.e.
different from the xwait that we waited for. It's important that the
'ctid' and 'priorXmax' that we use to follow the chain are read together.

Oops, you are right, I was too quick, it has already been quite some
time since i was dealing with this problem initially.

I see you moved the check ItemPointerEquals(&tuple->t_self, ctid) out
to heap_lock_tuple()
but isn't this redundant in the check `if (follow_updates && updated
&& !ItemPointerEquals(&tuple->t_self, &t_ctid))` if `updated` is
already true in this condition?

Also in `heap_lock_updated_tuple_rec()` is the check for
`TransactionIdIsValid(priorXmax)` now redundant too?

/ * Locking the initial tuple where those
* values came from is the caller's responsibility. */

I think this is still ambiguous, you can still interpret that as the
initial tuple needs to be locked before the function is called. Maybe
it is better to change it to something like:
"This function will not lock the initial tuple."

Unrelated question, I was wondering why these functions take a "const
ItemPointerData *" as an argument as opposed to just pass the ctid by
value?

Jasper

#6Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Jasper Smit (#5)
Re: Visibility bug in tuple lock

On 17/12/2025 16:51, Jasper Smit wrote:

Thanks! That's not quite correct though. This is very subtle, but the
'tuple' argument to heap_lock_updated_tuple() points to the buffer
holding the original tuple. So doing
HeapTupleHeaderGetUpdateXid(tuple->t_data) reads the current xmax of the
tuple, which can now be different than what it was earlier, i.e.
different from the xwait that we waited for. It's important that the
'ctid' and 'priorXmax' that we use to follow the chain are read together.

Oops, you are right, I was too quick, it has already been quite some
time since i was dealing with this problem initially.

I see you moved the check ItemPointerEquals(&tuple->t_self, ctid) out
to heap_lock_tuple()
but isn't this redundant in the check `if (follow_updates && updated
&& !ItemPointerEquals(&tuple->t_self, &t_ctid))` if `updated` is
already true in this condition?

Hmm, I don't think so, HEAP_KEYS_UPDATED is also set on deleted tuples.

I tried to add an assertion for that and ran the regression tests, but
nothing hit it. I was a little surprised. I guess we just don't have
test coverage for the deletion case?

Also in `heap_lock_updated_tuple_rec()` is the check for
`TransactionIdIsValid(priorXmax)` now redundant too?

Yep. I'll replace it with an assertion.

/ * Locking the initial tuple where those
* values came from is the caller's responsibility. */

I think this is still ambiguous, you can still interpret that as the
initial tuple needs to be locked before the function is called. Maybe
it is better to change it to something like:
"This function will not lock the initial tuple."

Ok.

Unrelated question, I was wondering why these functions take a "const
ItemPointerData *" as an argument as opposed to just pass the ctid by
value?

I hate it too :-). Mostly historical reasons, it's always been like
that, and that's the convention we use almost everywhere. On 32-bit
systems, passing by reference made more sense.

I chatted about that with Andres Freund just the other day, and he said
that compilers are not good at optimizing passing them by value. I'll
take his word on that.

And while we're at it, when we're passing around ItemPointerDatas, it
would make sense to not store the hi and low bits of the block number
separately, they just need to be reassembled into a 32-bit BlockNumber
before every use. I think we should have a separate in-memory
representation of ItemPointerData, packed into a uint64, and use that in
all function signatures. It's a lot of code churn, and would affect
extensions too, but I feel it probably would be worth it.

- Heikki

#7Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#4)
Re: Visibility bug in tuple lock

Hi,

On 2025-12-15 21:52:29 +0200, Heikki Linnakangas wrote:

Attached is a new patch version:

Note that the tests included in this seem to fail reliably on 32bit:
https://cirrus-ci.com/task/5787424783073280

Greetings,

Andres Freund

#8Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andres Freund (#7)
Re: Visibility bug in tuple lock

On 17/12/2025 19:11, Andres Freund wrote:

On 2025-12-15 21:52:29 +0200, Heikki Linnakangas wrote:

Attached is a new patch version:

Note that the tests included in this seem to fail reliably on 32bit:
https://cirrus-ci.com/task/5787424783073280

Ah yeah, the test is sensitive to how exactly the tuples land on the
pages. It'll also fail with e.g. different block sizes. If we want to
include that test permanently, we'll need to make it more robust.

- Heikki

#9Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#6)
1 attachment(s)
Re: Visibility bug in tuple lock

On 17/12/2025 17:42, Heikki Linnakangas wrote:

On 17/12/2025 16:51, Jasper Smit wrote:

Thanks! That's not quite correct though. This is very subtle, but the
'tuple' argument to heap_lock_updated_tuple() points to the buffer
holding the original tuple. So doing
HeapTupleHeaderGetUpdateXid(tuple->t_data) reads the current xmax of the
tuple, which can now be different than what it was earlier, i.e.
different from the xwait that we waited for. It's important that the
'ctid' and 'priorXmax' that we use to follow the chain are read
together.

Oops, you are right, I was too quick, it has already been quite some
time since i was dealing with this problem initially.

I see you moved the check ItemPointerEquals(&tuple->t_self, ctid) out
to heap_lock_tuple()
but isn't this redundant in the check `if (follow_updates && updated
&& !ItemPointerEquals(&tuple->t_self, &t_ctid))` if `updated` is
already true in this condition?

Hmm, I don't think so, HEAP_KEYS_UPDATED is also set on deleted tuples.

Ah, but this codepath is taken when HEAP_KEYS_UPDATED is *not* set. I
got that backwards. So I agree the ItemPointerEquals(&tuple->t_self,
ctid) check is redundant.

This is so subtle that I'm inclined to nevertheless keep it, at least in
backbranches. Just in case we're missing something. In master, we could
perhaps add an assertion for it.

Here's a new patch version. I worked some more on the test, making it
less sensitive to the tuple layout. It should now pass on 32-bit systems
and with different block sizes.

- Heikki

Attachments:

v4-0001-Fix-bug-in-following-update-chain-when-locking-a-.patchtext/x-patch; charset=UTF-8; name=v4-0001-Fix-bug-in-following-update-chain-when-locking-a-.patchDownload
From c8e5eb9c74254520d0d6510fc11c9c4db303911f Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 17 Dec 2025 22:05:16 +0200
Subject: [PATCH v4 1/1] Fix bug in following update chain when locking a heap
 tuple

Author: Jasper Smit <jasper.smit@servicenow.com>
Discussion: https://www.postgresql.org/message-id/CAOG%2BRQ74x0q%3DkgBBQ%3DmezuvOeZBfSxM1qu_o0V28bwDz3dHxLw%40mail.gmail.com
---
 src/backend/access/heap/heapam.c              |  48 ++++---
 src/test/modules/injection_points/Makefile    |   3 +-
 .../expected/heap_lock_update.out             |  83 +++++++++++++
 src/test/modules/injection_points/meson.build |   1 +
 .../specs/heap_lock_update.spec               | 117 ++++++++++++++++++
 5 files changed, 236 insertions(+), 16 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/heap_lock_update.out
 create mode 100644 src/test/modules/injection_points/specs/heap_lock_update.spec

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6daf4a87dec..2df1aab6f29 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -85,8 +85,11 @@ static void compute_new_xmax_infomask(TransactionId xmax, uint16 old_infomask,
 									  LockTupleMode mode, bool is_update,
 									  TransactionId *result_xmax, uint16 *result_infomask,
 									  uint16 *result_infomask2);
-static TM_Result heap_lock_updated_tuple(Relation rel, HeapTuple tuple,
-										 const ItemPointerData *ctid, TransactionId xid,
+static TM_Result heap_lock_updated_tuple(Relation rel,
+										 uint16 prior_infomask,
+										 TransactionId prior_rawxmax,
+										 const ItemPointerData *prior_ctid,
+										 TransactionId xid,
 										 LockTupleMode mode);
 static void GetMultiXactIdHintBits(MultiXactId multi, uint16 *new_infomask,
 								   uint16 *new_infomask2);
@@ -4816,11 +4819,13 @@ l3:
 				 * If there are updates, follow the update chain; bail out if
 				 * that cannot be done.
 				 */
-				if (follow_updates && updated)
+				if (follow_updates && updated &&
+					!ItemPointerEquals(&tuple->t_self, &t_ctid))
 				{
 					TM_Result	res;
 
-					res = heap_lock_updated_tuple(relation, tuple, &t_ctid,
+					res = heap_lock_updated_tuple(relation,
+												  infomask, xwait, &t_ctid,
 												  GetCurrentTransactionId(),
 												  mode);
 					if (res != TM_Ok)
@@ -5063,11 +5068,13 @@ l3:
 			}
 
 			/* if there are updates, follow the update chain */
-			if (follow_updates && !HEAP_XMAX_IS_LOCKED_ONLY(infomask))
+			if (follow_updates && !HEAP_XMAX_IS_LOCKED_ONLY(infomask) &&
+				!ItemPointerEquals(&tuple->t_self, &t_ctid))
 			{
 				TM_Result	res;
 
-				res = heap_lock_updated_tuple(relation, tuple, &t_ctid,
+				res = heap_lock_updated_tuple(relation,
+											  infomask, xwait, &t_ctid,
 											  GetCurrentTransactionId(),
 											  mode);
 				if (res != TM_Ok)
@@ -5721,7 +5728,8 @@ test_lockmode_for_conflict(MultiXactStatus status, TransactionId xid,
  * version as well.
  */
 static TM_Result
-heap_lock_updated_tuple_rec(Relation rel, const ItemPointerData *tid, TransactionId xid,
+heap_lock_updated_tuple_rec(Relation rel, TransactionId priorXmax,
+							const ItemPointerData *tid, TransactionId xid,
 							LockTupleMode mode)
 {
 	TM_Result	result;
@@ -5734,7 +5742,6 @@ heap_lock_updated_tuple_rec(Relation rel, const ItemPointerData *tid, Transactio
 				old_infomask2;
 	TransactionId xmax,
 				new_xmax;
-	TransactionId priorXmax = InvalidTransactionId;
 	bool		cleared_all_frozen = false;
 	bool		pinned_desired_page;
 	Buffer		vmbuffer = InvalidBuffer;
@@ -6048,7 +6055,10 @@ out_unlocked:
  *		Follow update chain when locking an updated tuple, acquiring locks (row
  *		marks) on the updated versions.
  *
- * The initial tuple is assumed to be already locked.
+ * 'prior_infomask', 'prior_raw_xmax' and 'prior_ctid' are the corresponding
+ * fields from the initial tuple.  We will update the tuples starting from the
+ * one that 'prior_ctid' points to.  Note: This function does not lock the
+ * initial tuple itself.
  *
  * This function doesn't check visibility, it just unconditionally marks the
  * tuple(s) as locked.  If any tuple in the updated chain is being deleted
@@ -6066,16 +6076,22 @@ out_unlocked:
  * levels, because that would lead to a serializability failure.
  */
 static TM_Result
-heap_lock_updated_tuple(Relation rel, HeapTuple tuple, const ItemPointerData *ctid,
+heap_lock_updated_tuple(Relation rel,
+						uint16 prior_infomask,
+						TransactionId prior_raw_xmax,
+						const ItemPointerData *prior_ctid,
 						TransactionId xid, LockTupleMode mode)
 {
+	INJECTION_POINT("heap_lock_updated_tuple", NULL);
+
 	/*
-	 * If the tuple has not been updated, or has moved into another partition
-	 * (effectively a delete) stop here.
+	 * If the tuple has moved into another partition (effectively a delete)
+	 * stop here.
 	 */
-	if (!HeapTupleHeaderIndicatesMovedPartitions(tuple->t_data) &&
-		!ItemPointerEquals(&tuple->t_self, ctid))
+	if (!ItemPointerIndicatesMovedPartitions(prior_ctid))
 	{
+		TransactionId prior_xmax;
+
 		/*
 		 * If this is the first possibly-multixact-able operation in the
 		 * current transaction, set my per-backend OldestMemberMXactId
@@ -6087,7 +6103,9 @@ heap_lock_updated_tuple(Relation rel, HeapTuple tuple, const ItemPointerData *ct
 		 */
 		MultiXactIdSetOldestMember();
 
-		return heap_lock_updated_tuple_rec(rel, ctid, xid, mode);
+		prior_xmax = (prior_infomask & HEAP_XMAX_IS_MULTI) ?
+			MultiXactIdGetUpdateXid(prior_raw_xmax, prior_infomask) : prior_raw_xmax;
+		return heap_lock_updated_tuple_rec(rel, prior_xmax, prior_ctid, xid, mode);
 	}
 
 	/* nothing to lock */
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index bfdb3f53377..cc9be6dcdd2 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -14,7 +14,8 @@ REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic \
 	    inplace \
-	    syscache-update-pruned
+	    syscache-update-pruned \
+	    heap_lock_update
 
 # Temporarily disabled because of flakiness
 #ISOLATION =+
diff --git a/src/test/modules/injection_points/expected/heap_lock_update.out b/src/test/modules/injection_points/expected/heap_lock_update.out
new file mode 100644
index 00000000000..1ec8d876414
--- /dev/null
+++ b/src/test/modules/injection_points/expected/heap_lock_update.out
@@ -0,0 +1,83 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1begin s1update s2lock s1abort vacuum reinsert wake
+step s1begin: BEGIN;
+step s1update: UPDATE t SET id = 10000 WHERE id = 1 RETURNING ctid;
+ctid 
+-----
+(1,2)
+(1 row)
+
+step s2lock: select * from t where id = 1 for update; <waiting ...>
+step s1abort: ABORT;
+step vacuum: VACUUM t;
+step reinsert: 
+	INSERT INTO t VALUES (10001) RETURNING ctid;
+	UPDATE t SET id = 10002 WHERE id = 10001 RETURNING ctid;
+
+ctid 
+-----
+(1,2)
+(1 row)
+
+ctid 
+-----
+(1,3)
+(1 row)
+
+step wake: 
+	SELECT FROM injection_points_detach('heap_lock_updated_tuple');
+	SELECT FROM injection_points_wakeup('heap_lock_updated_tuple');
+ <waiting ...>
+step s2lock: <... completed>
+id
+--
+ 1
+(1 row)
+
+step wake: <... completed>
+
+starting permutation: s1begin s1update s2lock s1abort vacuum reinsert_and_lock wake
+step s1begin: BEGIN;
+step s1update: UPDATE t SET id = 10000 WHERE id = 1 RETURNING ctid;
+ctid 
+-----
+(1,2)
+(1 row)
+
+step s2lock: select * from t where id = 1 for update; <waiting ...>
+step s1abort: ABORT;
+step vacuum: VACUUM t;
+step reinsert_and_lock: 
+	BEGIN;
+	INSERT INTO t VALUES (10001) RETURNING ctid;
+	SELECT ctid, * FROM t WHERE id = 1 FOR UPDATE;
+	COMMIT;
+	UPDATE t SET id = 10002 WHERE id = 10001 returning ctid;
+
+ctid 
+-----
+(1,2)
+(1 row)
+
+ctid |id
+-----+--
+(0,1)| 1
+(1 row)
+
+ctid 
+-----
+(1,3)
+(1 row)
+
+step wake: 
+	SELECT FROM injection_points_detach('heap_lock_updated_tuple');
+	SELECT FROM injection_points_wakeup('heap_lock_updated_tuple');
+ <waiting ...>
+step s2lock: <... completed>
+id
+--
+ 1
+(1 row)
+
+step wake: <... completed>
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 493e11053dc..c9186ae04d1 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -46,6 +46,7 @@ tests += {
       'basic',
       'inplace',
       'syscache-update-pruned',
+      'heap_lock_update',
       # temporarily disabled because of flakiness
       # 'index-concurrently-upsert',
       # 'index-concurrently-upsert-predicate',
diff --git a/src/test/modules/injection_points/specs/heap_lock_update.spec b/src/test/modules/injection_points/specs/heap_lock_update.spec
new file mode 100644
index 00000000000..b3992a1eb7a
--- /dev/null
+++ b/src/test/modules/injection_points/specs/heap_lock_update.spec
@@ -0,0 +1,117 @@
+# Test race condition in tuple locking
+#
+# This is a reproducer for the bug reported at:
+# https://www.postgresql.org/message-id/CAOG%2BRQ74x0q%3DkgBBQ%3DmezuvOeZBfSxM1qu_o0V28bwDz3dHxLw%40mail.gmail.com
+#
+# The bug was that when following an update chain when locking tuples,
+# we sometimes failed to check that the xmin on the next tuple matched
+# the prior's xmax. If the updated tuple version was vacuumed away and
+# the slot was reused for an unrelated tuple, we'd incorrectly follow
+# and lock the unrelated tuple.
+
+
+# Set up a test table with enough rows to fill a page. We need the
+# UPDATE used in the test to put the new tuple on a different page,
+# because otherwise the VACUUM cannot remove the aborted tuple because
+# we hold a pin on the first page.
+#
+# The exact number of rows inserted doesn't otherwise matter, but we
+# arrange things in a deterministic fashion so that the last inserted
+# tuple goes to (1,1), and the updated and aborted tuple goes to
+# (1,2). That way we can just memorize those ctids in the expected
+# output, to verify that the test exercises the scenario we want.
+setup
+{
+	CREATE EXTENSION injection_points;
+
+	CREATE TABLE t (id int PRIMARY KEY);
+	do $$
+		DECLARE
+			i int;
+			tid tid;
+		BEGIN
+			FOR i IN 1..5000 LOOP
+				INSERT INTO t VALUES (i) RETURNING ctid INTO tid;
+				IF tid = '(1,1)' THEN
+					RETURN;
+				END IF;
+			END LOOP;
+			RAISE 'expected to insert tuple to (1,1)';
+	   END;
+   $$;
+}
+teardown
+{
+	DROP TABLE t;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+step s1begin	{ BEGIN; }
+step s1update	{ UPDATE t SET id = 10000 WHERE id = 1 RETURNING ctid; }
+step s1abort	{ ABORT; }
+step vacuum		{ VACUUM t; }
+
+# Insert a new tuple, and update it.
+step reinsert	{
+	INSERT INTO t VALUES (10001) RETURNING ctid;
+	UPDATE t SET id = 10002 WHERE id = 10001 RETURNING ctid;
+}
+
+# Same as the 'reinsert' step, but for extra confusion, we also stamp
+# the original tuple with the same 'xmax' as the re-inserted one.
+step reinsert_and_lock {
+	BEGIN;
+	INSERT INTO t VALUES (10001) RETURNING ctid;
+	SELECT ctid, * FROM t WHERE id = 1 FOR UPDATE;
+	COMMIT;
+	UPDATE t SET id = 10002 WHERE id = 10001 returning ctid;
+}
+
+step wake	{
+	SELECT FROM injection_points_detach('heap_lock_updated_tuple');
+	SELECT FROM injection_points_wakeup('heap_lock_updated_tuple');
+}
+
+session s2
+setup	{
+	SELECT FROM injection_points_set_local();
+	SELECT FROM injection_points_attach('heap_lock_updated_tuple', 'wait');
+}
+step s2lock	{ select * from t where id = 1 for update; }
+
+permutation
+	# Begin transaction, update a row. Because of how we set up the
+	# test table, the updated tuple lands at (1,2)
+	s1begin
+	s1update
+
+	# While the updating transaction is open, start a new session that
+	# tries to lock the row. This blocks on the open transaction.
+	s2lock
+
+	# Abort the updating transaction. This unblocks session 2, but it
+	# will immediately hit the injection point and block on that.
+	s1abort
+	# Vacuum away the updated, aborted tuple.
+	vacuum
+
+	# Insert a new tuple. It lands at the same location where the
+	# updated tuple was.
+	reinsert
+
+	# Let the locking transaction continue. It should lock the
+	# original tuple, ignoring the re-inserted tuple.
+	wake(s2lock)
+
+# Variant where the re-inserted tuple is also locked by the inserting
+# transaction. This failed an earlier version of the fix during
+# development.
+permutation
+	s1begin
+	s1update
+	s2lock
+	s1abort
+	vacuum
+	reinsert_and_lock
+	wake(s2lock)
-- 
2.47.3

#10Jasper Smit
jbsmit@gmail.com
In reply to: Heikki Linnakangas (#9)
Re: Visibility bug in tuple lock

The test is really nice with the injection points and the dynamically
sized data.

Ah, but this codepath is taken when HEAP_KEYS_UPDATED is *not* set. I
got that backwards. So I agree the ItemPointerEquals(&tuple->t_self,
ctid) check is redundant.

Ok, I did not think about deletes. So the boolean updated here could
mean both update and delete, that makes sense to me.

I chatted about that with Andres Freund just the other day, and he said
that compilers are not good at optimizing passing them by value. I'll
take his word on that.

I believe that too, but in heap_lock_updated_tuple_rec() the first
thing we do is ItemPointerCopy(tid, &tupid); , which makes it probably
just as inefficient.

Jasper

#11Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Jasper Smit (#10)
Re: Visibility bug in tuple lock

On 18/12/2025 11:56, Jasper Smit wrote:

The test is really nice with the injection points and the dynamically
sized data.

Ah, but this codepath is taken when HEAP_KEYS_UPDATED is *not* set. I
got that backwards. So I agree the ItemPointerEquals(&tuple->t_self,
ctid) check is redundant.

Ok, I did not think about deletes. So the boolean updated here could
mean both update and delete, that makes sense to me.

Committed and backpatched this now. Thanks!

- Heikki