Table AM Interface Enhancements

Started by Alexander Korotkovabout 2 years ago104 messages

aekorotkov@gmail.com

about 2 years ago

12 attachment(s)

Hello PostgreSQL Hackers,

I am pleased to submit a series of patches related to the Table Access
Method (AM) interface, which I initially announced during my talk at
PGCon 2023 [1]. These patches are primarily designed to support the
OrioleDB engine, but I believe they could be beneficial for other
table AM implementations as well.

The focus of these patches is to introduce more flexibility and
capabilities into the Table AM interface. This is particularly
relevant for advanced use cases like index-organized tables,
alternative MVCC implementations, etc.

Here's a brief overview of the patches included in this set:

0001-Allow-locking-updated-tuples-in-tuple_update-and--v1.patch

Optimizes the process of locking concurrently updated tuples during
update and delete operations. Helpful for table AMs where refinding
existing tuples is expensive.

0002-Add-EvalPlanQual-delete-returning-isolation-test-v1.patch

The new isolation test is related to the previous patch. These two
patches were previously discussed in [2].

0003-Allow-table-AM-to-store-complex-data-structures-i-v1.patch

Allows table AM to store complex data structure in rd_amcache rather
than a single chunk of memory.

0004-Add-table-AM-tuple_is_current-method-v1.patch

This allows us to abstract how/whether table AM uses transaction identifiers.

0005-Generalize-relation-analyze-in-table-AM-interface-v1.patch

Provides a more flexible API for sampling tuples, beneficial for
non-standard table types like index-organized tables.

0006-Generalize-table-AM-API-for-INSERT-.-ON-CONFLICT-v1.patch

Provides a new table AM API method to encapsulate the whole INSERT ...
ON CONFLICT ... algorithm rather than just implementation of
speculative tokens.

0007-Allow-table-AM-tuple_insert-method-to-return-the--v1.patch

This allows table AM to return a native tuple slot, which is aware of
table AM-specific system attributes.

0008-Let-table-AM-insertion-methods-control-index-inse-v1.patch

Allows table AM to skip index insertions in the executor and handle
those insertions itself.

0009-Custom-reloptions-for-table-AM-v1.patch

Enables table AMs to define and override reloptions for tables and indexes.

0010-Notify-table-AM-about-index-creation-v1.patch

Allows table AMs to prepare or update specific meta-information during
index creation.

011-Introduce-RowRefType-which-describes-the-table-ro-v1.patch

Separates the row identifier type from the lock mode in RowMarkType,
providing clearer semantics and more flexibility.

0012-Introduce-RowID-bytea-tuple-identifier-v1.patch

`This patch introduces 'RowID', a new bytea tuple identifier, to
overcome the limitations of the current 32-bit block number and 16-bit
offset-based tuple identifier. This is particularly useful for
index-organized tables and other advanced use cases.

Each commit message contains a detailed explanation of the changes and
their rationale. I believe these enhancements will significantly
improve the flexibility and capabilities of the PostgreSQL Table AM
interface.

I am looking forward to your feedback and suggestions on these patches.

Links

1. https://www.pgcon.org/events/pgcon_2023/schedule/session/470-future-of-table-access-methods/
2. /messages/by-id/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg@mail.gmail.com

------
Regards,
Alexander Korotkov

Attachments:

0002-Add-EvalPlanQual-delete-returning-isolation-test-v1.patchapplication/octet-stream; name=0002-Add-EvalPlanQual-delete-returning-isolation-test-v1.patchDownload

From 539b9cc7063861b7989a676c20ed96d4f4048f6c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Mar 2023 16:47:09 -0700
Subject: [PATCH 02/12] Add EvalPlanQual delete returning isolation test

Author: Andres Freund
Reviewed-by: Pavel Borisov
Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg%40mail.gmail.com
---
 .../isolation/expected/eval-plan-qual.out     | 30 +++++++++++++++++++
 src/test/isolation/specs/eval-plan-qual.spec  |  4 +++
 2 files changed, 34 insertions(+)

diff --git a/src/test/isolation/expected/eval-plan-qual.out b/src/test/isolation/expected/eval-plan-qual.out
index 73e0aeb50e7..0237271ceec 100644
--- a/src/test/isolation/expected/eval-plan-qual.out
+++ b/src/test/isolation/expected/eval-plan-qual.out
@@ -746,6 +746,36 @@ savings  |    600|    1200
 (2 rows)
 
 
+starting permutation: read wx2 wb1 c2 c1 read
+step read: SELECT * FROM accounts ORDER BY accountid;
+accountid|balance|balance2
+---------+-------+--------
+checking |    600|    1200
+savings  |    600|    1200
+(2 rows)
+
+step wx2: UPDATE accounts SET balance = balance + 450 WHERE accountid = 'checking' RETURNING balance;
+balance
+-------
+   1050
+(1 row)
+
+step wb1: DELETE FROM accounts WHERE balance = 600 RETURNING *; <waiting ...>
+step c2: COMMIT;
+step wb1: <... completed>
+accountid|balance|balance2
+---------+-------+--------
+savings  |    600|    1200
+(1 row)
+
+step c1: COMMIT;
+step read: SELECT * FROM accounts ORDER BY accountid;
+accountid|balance|balance2
+---------+-------+--------
+checking |   1050|    2100
+(1 row)
+
+
 starting permutation: upsert1 upsert2 c1 c2 read
 step upsert1: 
 	WITH upsert AS
diff --git a/src/test/isolation/specs/eval-plan-qual.spec b/src/test/isolation/specs/eval-plan-qual.spec
index 735c671734e..edd6d19df3a 100644
--- a/src/test/isolation/specs/eval-plan-qual.spec
+++ b/src/test/isolation/specs/eval-plan-qual.spec
@@ -76,6 +76,8 @@ setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
 step wx1	{ UPDATE accounts SET balance = balance - 200 WHERE accountid = 'checking' RETURNING balance; }
 # wy1 then wy2 checks the case where quals pass then fail
 step wy1	{ UPDATE accounts SET balance = balance + 500 WHERE accountid = 'checking' RETURNING balance; }
+# wx2 then wb1 checks the case of re-fetching up-to-date values for DELETE ... RETURNING ...
+step wb1	{ DELETE FROM accounts WHERE balance = 600 RETURNING *; }
 
 step wxext1	{ UPDATE accounts_ext SET balance = balance - 200 WHERE accountid = 'checking' RETURNING balance; }
 step tocds1	{ UPDATE accounts SET accountid = 'cds' WHERE accountid = 'checking'; }
@@ -353,6 +355,8 @@ permutation wx1 delwcte c1 c2 read
 # test that a delete to a self-modified row throws error when
 # previously updated by a different cid
 permutation wx1 delwctefail c1 c2 read
+# test that a delete re-fetches up-to-date values for returning clause
+permutation read wx2 wb1 c2 c1 read
 
 permutation upsert1 upsert2 c1 c2 read
 permutation readp1 writep1 readp2 c1 c2
-- 
2.39.3 (Apple Git-145)

0001-Allow-locking-updated-tuples-in-tuple_update-and--v1.patchapplication/octet-stream; name=0001-Allow-locking-updated-tuples-in-tuple_update-and--v1.patchDownload

From 7d87dc017089fa6bb9db489a87cdaff413fce3c3 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Thu, 23 Mar 2023 00:12:00 +0300
Subject: [PATCH 01/12] Allow locking updated tuples in tuple_update() and
 tuple_delete()

Currently, in read committed transaction isolation mode (default), we have the
following sequence of actions when tuple_update()/tuple_delete() finds
the tuple updated by concurrent transaction.

1. Attempt to update/delete tuple with tuple_update()/tuple_delete(), which
   returns TM_Updated.
2. Lock tuple with tuple_lock().
3. Re-evaluate plan qual (recheck if we still need to update/delete and
   calculate the new tuple for update).
4. Second attempt to update/delete tuple with tuple_update()/tuple_delete().
   This attempt should be successful, since the tuple was previously locked.

This patch eliminates step 2 by taking the lock during first
tuple_update()/tuple_delete() call.  Heap table access method saves some
efforts by checking the updated tuple once instead of twice.  Future
undo-based table access methods, which will start from the latest row version,
can immediately place a lock there.

The code in nodeModifyTable.c is simplified by removing the nested switch/case.

Discussion: https://postgr.es/m/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg%40mail.gmail.com
Reviewed-by: Aleksander Alekseev, Pavel Borisov, Vignesh C, Mason Sharp
Reviewed-by: Andres Freund, Chris Travers
---
 src/backend/access/heap/heapam.c         | 205 ++++++++++----
 src/backend/access/heap/heapam_handler.c |  94 +++++--
 src/backend/access/table/tableam.c       |  26 +-
 src/backend/commands/trigger.c           |  55 +---
 src/backend/executor/execReplication.c   |  19 +-
 src/backend/executor/nodeModifyTable.c   | 333 +++++++++--------------
 src/include/access/heapam.h              |  19 +-
 src/include/access/tableam.h             |  69 +++--
 src/include/commands/trigger.h           |   4 +-
 9 files changed, 476 insertions(+), 348 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 14de8158d49..5baa2e632f5 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2503,10 +2503,11 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
 }
 
 /*
- *	heap_delete - delete a tuple
+ *	heap_delete - delete a tuple, optionally fetching it into a slot
  *
  * See table_tuple_delete() for an explanation of the parameters, except that
- * this routine directly takes a tuple rather than a slot.
+ * this routine directly takes a tuple rather than a slot.  Also, we don't
+ * place a lock on the tuple in this function, just fetch the existing version.
  *
  * In the failure cases, the routine fills *tmfd with the tuple's t_ctid,
  * t_xmax (resolving a possible MultiXact, if necessary), and t_cmax (the last
@@ -2515,8 +2516,9 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
  */
 TM_Result
 heap_delete(Relation relation, ItemPointer tid,
-			CommandId cid, Snapshot crosscheck, bool wait,
-			TM_FailureData *tmfd, bool changingPart)
+			CommandId cid, Snapshot crosscheck, int options,
+			TM_FailureData *tmfd, bool changingPart,
+			TupleTableSlot *oldSlot)
 {
 	TM_Result	result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -2594,7 +2596,7 @@ l1:
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("attempted to delete invisible tuple")));
 	}
-	else if (result == TM_BeingModified && wait)
+	else if (result == TM_BeingModified && (options & TABLE_MODIFY_WAIT))
 	{
 		TransactionId xwait;
 		uint16		infomask;
@@ -2730,7 +2732,30 @@ l1:
 			tmfd->cmax = HeapTupleHeaderGetCmax(tp.t_data);
 		else
 			tmfd->cmax = InvalidCommandId;
-		UnlockReleaseBuffer(buffer);
+
+		/*
+		 * If we're asked to lock the updated tuple, we just fetch the
+		 * existing tuple.  That let's the caller save some resources on
+		 * placing the lock.
+		 */
+		if (result == TM_Updated &&
+			(options & TABLE_MODIFY_LOCK_UPDATED))
+		{
+			BufferHeapTupleTableSlot *bslot;
+
+			Assert(TTS_IS_BUFFERTUPLE(oldSlot));
+			bslot = (BufferHeapTupleTableSlot *) oldSlot;
+
+			LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+			bslot->base.tupdata = tp;
+			ExecStorePinnedBufferHeapTuple(&bslot->base.tupdata,
+										   oldSlot,
+										   buffer);
+		}
+		else
+		{
+			UnlockReleaseBuffer(buffer);
+		}
 		if (have_tuple_lock)
 			UnlockTupleTuplock(relation, &(tp.t_self), LockTupleExclusive);
 		if (vmbuffer != InvalidBuffer)
@@ -2904,8 +2929,24 @@ l1:
 	 */
 	CacheInvalidateHeapTuple(relation, &tp, NULL);
 
-	/* Now we can release the buffer */
-	ReleaseBuffer(buffer);
+	/* Fetch the old tuple version if we're asked for that. */
+	if (options & TABLE_MODIFY_FETCH_OLD_TUPLE)
+	{
+		BufferHeapTupleTableSlot *bslot;
+
+		Assert(TTS_IS_BUFFERTUPLE(oldSlot));
+		bslot = (BufferHeapTupleTableSlot *) oldSlot;
+
+		bslot->base.tupdata = tp;
+		ExecStorePinnedBufferHeapTuple(&bslot->base.tupdata,
+									   oldSlot,
+									   buffer);
+	}
+	else
+	{
+		/* Now we can release the buffer */
+		ReleaseBuffer(buffer);
+	}
 
 	/*
 	 * Release the lmgr tuple lock, if we had it.
@@ -2937,8 +2978,8 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 
 	result = heap_delete(relation, tid,
 						 GetCurrentCommandId(true), InvalidSnapshot,
-						 true /* wait for commit */ ,
-						 &tmfd, false /* changingPart */ );
+						 TABLE_MODIFY_WAIT /* wait for commit */ ,
+						 &tmfd, false /* changingPart */ , NULL);
 	switch (result)
 	{
 		case TM_SelfModified:
@@ -2965,10 +3006,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 }
 
 /*
- *	heap_update - replace a tuple
+ *	heap_update - replace a tuple, optionally fetching it into a slot
  *
  * See table_tuple_update() for an explanation of the parameters, except that
- * this routine directly takes a tuple rather than a slot.
+ * this routine directly takes a tuple rather than a slot.  Also, we don't
+ * place a lock on the tuple in this function, just fetch the existing version.
  *
  * In the failure cases, the routine fills *tmfd with the tuple's t_ctid,
  * t_xmax (resolving a possible MultiXact, if necessary), and t_cmax (the last
@@ -2977,9 +3019,9 @@ simple_heap_delete(Relation relation, ItemPointer tid)
  */
 TM_Result
 heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
-			CommandId cid, Snapshot crosscheck, bool wait,
+			CommandId cid, Snapshot crosscheck, int options,
 			TM_FailureData *tmfd, LockTupleMode *lockmode,
-			TU_UpdateIndexes *update_indexes)
+			TU_UpdateIndexes *update_indexes, TupleTableSlot *oldSlot)
 {
 	TM_Result	result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -3156,7 +3198,7 @@ l2:
 	result = HeapTupleSatisfiesUpdate(&oldtup, cid, buffer);
 
 	/* see below about the "no wait" case */
-	Assert(result != TM_BeingModified || wait);
+	Assert(result != TM_BeingModified || (options & TABLE_MODIFY_WAIT));
 
 	if (result == TM_Invisible)
 	{
@@ -3165,7 +3207,7 @@ l2:
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("attempted to update invisible tuple")));
 	}
-	else if (result == TM_BeingModified && wait)
+	else if (result == TM_BeingModified && (options & TABLE_MODIFY_WAIT))
 	{
 		TransactionId xwait;
 		uint16		infomask;
@@ -3367,7 +3409,30 @@ l2:
 			tmfd->cmax = HeapTupleHeaderGetCmax(oldtup.t_data);
 		else
 			tmfd->cmax = InvalidCommandId;
-		UnlockReleaseBuffer(buffer);
+
+		/*
+		 * If we're asked to lock the updated tuple, we just fetch the
+		 * existing tuple.  That let's the caller save some resouces on
+		 * placing the lock.
+		 */
+		if (result == TM_Updated &&
+			(options & TABLE_MODIFY_LOCK_UPDATED))
+		{
+			BufferHeapTupleTableSlot *bslot;
+
+			Assert(TTS_IS_BUFFERTUPLE(oldSlot));
+			bslot = (BufferHeapTupleTableSlot *) oldSlot;
+
+			LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+			bslot->base.tupdata = oldtup;
+			ExecStorePinnedBufferHeapTuple(&bslot->base.tupdata,
+										   oldSlot,
+										   buffer);
+		}
+		else
+		{
+			UnlockReleaseBuffer(buffer);
+		}
 		if (have_tuple_lock)
 			UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
 		if (vmbuffer != InvalidBuffer)
@@ -3846,7 +3911,26 @@ l2:
 	/* Now we can release the buffer(s) */
 	if (newbuf != buffer)
 		ReleaseBuffer(newbuf);
-	ReleaseBuffer(buffer);
+
+	/* Fetch the old tuple version if we're asked for that. */
+	if (options & TABLE_MODIFY_FETCH_OLD_TUPLE)
+	{
+		BufferHeapTupleTableSlot *bslot;
+
+		Assert(TTS_IS_BUFFERTUPLE(oldSlot));
+		bslot = (BufferHeapTupleTableSlot *) oldSlot;
+
+		bslot->base.tupdata = oldtup;
+		ExecStorePinnedBufferHeapTuple(&bslot->base.tupdata,
+									   oldSlot,
+									   buffer);
+	}
+	else
+	{
+		/* Now we can release the buffer */
+		ReleaseBuffer(buffer);
+	}
+
 	if (BufferIsValid(vmbuffer_new))
 		ReleaseBuffer(vmbuffer_new);
 	if (BufferIsValid(vmbuffer))
@@ -4054,8 +4138,8 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
 
 	result = heap_update(relation, otid, tup,
 						 GetCurrentCommandId(true), InvalidSnapshot,
-						 true /* wait for commit */ ,
-						 &tmfd, &lockmode, update_indexes);
+						 TABLE_MODIFY_WAIT /* wait for commit */ ,
+						 &tmfd, &lockmode, update_indexes, NULL);
 	switch (result)
 	{
 		case TM_SelfModified:
@@ -4118,12 +4202,14 @@ get_mxact_status_for_lock(LockTupleMode mode, bool is_update)
  *		tuples.
  *
  * Output parameters:
- *	*tuple: all fields filled in
- *	*buffer: set to buffer holding tuple (pinned but not locked at exit)
+ *	*slot: BufferHeapTupleTableSlot filled with tuple
  *	*tmfd: filled in failure cases (see below)
  *
  * Function results are the same as the ones for table_tuple_lock().
  *
+ * If *slot already contains the target tuple, it takes advantage on that by
+ * skipping the ReadBuffer() call.
+ *
  * In the failure cases other than TM_Invisible, the routine fills
  * *tmfd with the tuple's t_ctid, t_xmax (resolving a possible MultiXact,
  * if necessary), and t_cmax (the last only for TM_SelfModified,
@@ -4134,15 +4220,14 @@ get_mxact_status_for_lock(LockTupleMode mode, bool is_update)
  * See README.tuplock for a thorough explanation of this mechanism.
  */
 TM_Result
-heap_lock_tuple(Relation relation, HeapTuple tuple,
+heap_lock_tuple(Relation relation, ItemPointer tid, TupleTableSlot *slot,
 				CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
-				bool follow_updates,
-				Buffer *buffer, TM_FailureData *tmfd)
+				bool follow_updates, TM_FailureData *tmfd)
 {
 	TM_Result	result;
-	ItemPointer tid = &(tuple->t_self);
 	ItemId		lp;
 	Page		page;
+	Buffer		buffer;
 	Buffer		vmbuffer = InvalidBuffer;
 	BlockNumber block;
 	TransactionId xid,
@@ -4154,8 +4239,24 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
 	bool		skip_tuple_lock = false;
 	bool		have_tuple_lock = false;
 	bool		cleared_all_frozen = false;
+	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
+	HeapTuple	tuple = &bslot->base.tupdata;
+
+	Assert(TTS_IS_BUFFERTUPLE(slot));
 
-	*buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
+	/* Take advantage if slot already contains the relevant tuple  */
+	if (!TTS_EMPTY(slot) &&
+		slot->tts_tableOid == relation->rd_id &&
+		ItemPointerCompare(&slot->tts_tid, tid) == 0 &&
+		BufferIsValid(bslot->buffer))
+	{
+		buffer = bslot->buffer;
+		IncrBufferRefCount(buffer);
+	}
+	else
+	{
+		buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
+	}
 	block = ItemPointerGetBlockNumber(tid);
 
 	/*
@@ -4164,21 +4265,22 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
 	 * in the middle of changing this, so we'll need to recheck after we have
 	 * the lock.
 	 */
-	if (PageIsAllVisible(BufferGetPage(*buffer)))
+	if (PageIsAllVisible(BufferGetPage(buffer)))
 		visibilitymap_pin(relation, block, &vmbuffer);
 
-	LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
-	page = BufferGetPage(*buffer);
+	page = BufferGetPage(buffer);
 	lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid));
 	Assert(ItemIdIsNormal(lp));
 
+	tuple->t_self = *tid;
 	tuple->t_data = (HeapTupleHeader) PageGetItem(page, lp);
 	tuple->t_len = ItemIdGetLength(lp);
 	tuple->t_tableOid = RelationGetRelid(relation);
 
 l3:
-	result = HeapTupleSatisfiesUpdate(tuple, cid, *buffer);
+	result = HeapTupleSatisfiesUpdate(tuple, cid, buffer);
 
 	if (result == TM_Invisible)
 	{
@@ -4207,7 +4309,7 @@ l3:
 		infomask2 = tuple->t_data->t_infomask2;
 		ItemPointerCopy(&tuple->t_data->t_ctid, &t_ctid);
 
-		LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 
 		/*
 		 * If any subtransaction of the current top transaction already holds
@@ -4359,12 +4461,12 @@ l3:
 					{
 						result = res;
 						/* recovery code expects to have buffer lock held */
-						LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+						LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 						goto failed;
 					}
 				}
 
-				LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+				LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 				/*
 				 * Make sure it's still an appropriate lock, else start over.
@@ -4399,7 +4501,7 @@ l3:
 			if (HEAP_XMAX_IS_LOCKED_ONLY(infomask) &&
 				!HEAP_XMAX_IS_EXCL_LOCKED(infomask))
 			{
-				LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+				LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 				/*
 				 * Make sure it's still an appropriate lock, else start over.
@@ -4427,7 +4529,7 @@ l3:
 					 * No conflict, but if the xmax changed under us in the
 					 * meantime, start over.
 					 */
-					LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+					LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 					if (xmax_infomask_changed(tuple->t_data->t_infomask, infomask) ||
 						!TransactionIdEquals(HeapTupleHeaderGetRawXmax(tuple->t_data),
 											 xwait))
@@ -4439,7 +4541,7 @@ l3:
 			}
 			else if (HEAP_XMAX_IS_KEYSHR_LOCKED(infomask))
 			{
-				LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+				LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 				/* if the xmax changed in the meantime, start over */
 				if (xmax_infomask_changed(tuple->t_data->t_infomask, infomask) ||
@@ -4467,7 +4569,7 @@ l3:
 			TransactionIdIsCurrentTransactionId(xwait))
 		{
 			/* ... but if the xmax changed in the meantime, start over */
-			LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 			if (xmax_infomask_changed(tuple->t_data->t_infomask, infomask) ||
 				!TransactionIdEquals(HeapTupleHeaderGetRawXmax(tuple->t_data),
 									 xwait))
@@ -4489,7 +4591,7 @@ l3:
 		 */
 		if (require_sleep && (result == TM_Updated || result == TM_Deleted))
 		{
-			LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 			goto failed;
 		}
 		else if (require_sleep)
@@ -4514,7 +4616,7 @@ l3:
 				 */
 				result = TM_WouldBlock;
 				/* recovery code expects to have buffer lock held */
-				LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+				LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 				goto failed;
 			}
 
@@ -4540,7 +4642,7 @@ l3:
 						{
 							result = TM_WouldBlock;
 							/* recovery code expects to have buffer lock held */
-							LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+							LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 							goto failed;
 						}
 						break;
@@ -4580,7 +4682,7 @@ l3:
 						{
 							result = TM_WouldBlock;
 							/* recovery code expects to have buffer lock held */
-							LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+							LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 							goto failed;
 						}
 						break;
@@ -4606,12 +4708,12 @@ l3:
 				{
 					result = res;
 					/* recovery code expects to have buffer lock held */
-					LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+					LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 					goto failed;
 				}
 			}
 
-			LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 			/*
 			 * xwait is done, but if xwait had just locked the tuple then some
@@ -4633,7 +4735,7 @@ l3:
 				 * don't check for this in the multixact case, because some
 				 * locker transactions might still be running.
 				 */
-				UpdateXmaxHintBits(tuple->t_data, *buffer, xwait);
+				UpdateXmaxHintBits(tuple->t_data, buffer, xwait);
 			}
 		}
 
@@ -4692,9 +4794,9 @@ failed:
 	 */
 	if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
 	{
-		LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 		visibilitymap_pin(relation, block, &vmbuffer);
-		LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 		goto l3;
 	}
 
@@ -4757,7 +4859,7 @@ failed:
 		cleared_all_frozen = true;
 
 
-	MarkBufferDirty(*buffer);
+	MarkBufferDirty(buffer);
 
 	/*
 	 * XLOG stuff.  You might think that we don't need an XLOG record because
@@ -4777,7 +4879,7 @@ failed:
 		XLogRecPtr	recptr;
 
 		XLogBeginInsert();
-		XLogRegisterBuffer(0, *buffer, REGBUF_STANDARD);
+		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
 
 		xlrec.offnum = ItemPointerGetOffsetNumber(&tuple->t_self);
 		xlrec.xmax = xid;
@@ -4798,7 +4900,7 @@ failed:
 	result = TM_Ok;
 
 out_locked:
-	LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 
 out_unlocked:
 	if (BufferIsValid(vmbuffer))
@@ -4816,6 +4918,9 @@ out_unlocked:
 	if (have_tuple_lock)
 		UnlockTupleTuplock(relation, tid, mode);
 
+	/* Put the target tuple to the slot */
+	ExecStorePinnedBufferHeapTuple(tuple, slot, buffer);
+
 	return result;
 }
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 7c28dafb728..f9bba734899 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -45,6 +45,12 @@
 #include "utils/builtins.h"
 #include "utils/rel.h"
 
+static TM_Result heapam_tuple_lock(Relation relation, ItemPointer tid,
+								   Snapshot snapshot, TupleTableSlot *slot,
+								   CommandId cid, LockTupleMode mode,
+								   LockWaitPolicy wait_policy, uint8 flags,
+								   TM_FailureData *tmfd);
+
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
 									 Datum *values, bool *isnull, RewriteState rwstate);
@@ -298,23 +304,55 @@ heapam_tuple_complete_speculative(Relation relation, TupleTableSlot *slot,
 
 static TM_Result
 heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
-					Snapshot snapshot, Snapshot crosscheck, bool wait,
-					TM_FailureData *tmfd, bool changingPart)
+					Snapshot snapshot, Snapshot crosscheck, int options,
+					TM_FailureData *tmfd, bool changingPart,
+					TupleTableSlot *oldSlot)
 {
+	TM_Result	result;
+
 	/*
 	 * Currently Deleting of index tuples are handled at vacuum, in case if
 	 * the storage itself is cleaning the dead tuples by itself, it is the
 	 * time to call the index tuple deletion also.
 	 */
-	return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
+	result = heap_delete(relation, tid, cid, crosscheck, options,
+						 tmfd, changingPart, oldSlot);
+
+	/*
+	 * If the tuple has been concurrently updated, then get the lock on it.
+	 * (Do only if caller asked for this by setting the
+	 * TABLE_MODIFY_LOCK_UPDATED option)  With the lock held retry of the
+	 * delete should succeed even if there are more concurrent update
+	 * attempts.
+	 */
+	if (result == TM_Updated && (options & TABLE_MODIFY_LOCK_UPDATED))
+	{
+		/*
+		 * heapam_tuple_lock() will take advantage of tuple loaded into
+		 * oldSlot by heap_delete().
+		 */
+		result = heapam_tuple_lock(relation, tid, snapshot,
+								   oldSlot, cid, LockTupleExclusive,
+								   (options & TABLE_MODIFY_WAIT) ?
+								   LockWaitBlock :
+								   LockWaitSkip,
+								   TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
+								   tmfd);
+
+		if (result == TM_Ok)
+			return TM_Updated;
+	}
+
+	return result;
 }
 
 
 static TM_Result
 heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 					CommandId cid, Snapshot snapshot, Snapshot crosscheck,
-					bool wait, TM_FailureData *tmfd,
-					LockTupleMode *lockmode, TU_UpdateIndexes *update_indexes)
+					int options, TM_FailureData *tmfd,
+					LockTupleMode *lockmode, TU_UpdateIndexes *update_indexes,
+					TupleTableSlot *oldSlot)
 {
 	bool		shouldFree = true;
 	HeapTuple	tuple = ExecFetchSlotHeapTuple(slot, true, &shouldFree);
@@ -324,8 +362,8 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 	slot->tts_tableOid = RelationGetRelid(relation);
 	tuple->t_tableOid = slot->tts_tableOid;
 
-	result = heap_update(relation, otid, tuple, cid, crosscheck, wait,
-						 tmfd, lockmode, update_indexes);
+	result = heap_update(relation, otid, tuple, cid, crosscheck, options,
+						 tmfd, lockmode, update_indexes, oldSlot);
 	ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
 
 	/*
@@ -352,6 +390,31 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 	if (shouldFree)
 		pfree(tuple);
 
+	/*
+	 * If the tuple has been concurrently updated, then get the lock on it.
+	 * (Do only if caller asked for this by setting the
+	 * TABLE_MODIFY_LOCK_UPDATED option)  With the lock held retry of the
+	 * update should succeed even if there are more concurrent update
+	 * attempts.
+	 */
+	if (result == TM_Updated && (options & TABLE_MODIFY_LOCK_UPDATED))
+	{
+		/*
+		 * heapam_tuple_lock() will take advantage of tuple loaded into
+		 * oldSlot by heap_update().
+		 */
+		result = heapam_tuple_lock(relation, otid, snapshot,
+								   oldSlot, cid, *lockmode,
+								   (options & TABLE_MODIFY_WAIT) ?
+								   LockWaitBlock :
+								   LockWaitSkip,
+								   TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
+								   tmfd);
+
+		if (result == TM_Ok)
+			return TM_Updated;
+	}
+
 	return result;
 }
 
@@ -363,7 +426,6 @@ heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
 {
 	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
 	TM_Result	result;
-	Buffer		buffer;
 	HeapTuple	tuple = &bslot->base.tupdata;
 	bool		follow_updates;
 
@@ -373,9 +435,8 @@ heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
 tuple_lock_retry:
-	tuple->t_self = *tid;
-	result = heap_lock_tuple(relation, tuple, cid, mode, wait_policy,
-							 follow_updates, &buffer, tmfd);
+	result = heap_lock_tuple(relation, tid, slot, cid, mode, wait_policy,
+							 follow_updates, tmfd);
 
 	if (result == TM_Updated &&
 		(flags & TUPLE_LOCK_FLAG_FIND_LAST_VERSION))
@@ -383,8 +444,6 @@ tuple_lock_retry:
 		/* Should not encounter speculative tuple on recheck */
 		Assert(!HeapTupleHeaderIsSpeculative(tuple->t_data));
 
-		ReleaseBuffer(buffer);
-
 		if (!ItemPointerEquals(&tmfd->ctid, &tuple->t_self))
 		{
 			SnapshotData SnapshotDirty;
@@ -406,6 +465,8 @@ tuple_lock_retry:
 			InitDirtySnapshot(SnapshotDirty);
 			for (;;)
 			{
+				Buffer		buffer = InvalidBuffer;
+
 				if (ItemPointerIndicatesMovedPartitions(tid))
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
@@ -500,7 +561,7 @@ tuple_lock_retry:
 					/*
 					 * This is a live tuple, so try to lock it again.
 					 */
-					ReleaseBuffer(buffer);
+					ExecStorePinnedBufferHeapTuple(tuple, slot, buffer);
 					goto tuple_lock_retry;
 				}
 
@@ -511,7 +572,7 @@ tuple_lock_retry:
 				 */
 				if (tuple->t_data == NULL)
 				{
-					Assert(!BufferIsValid(buffer));
+					ReleaseBuffer(buffer);
 					return TM_Deleted;
 				}
 
@@ -564,9 +625,6 @@ tuple_lock_retry:
 	slot->tts_tableOid = RelationGetRelid(relation);
 	tuple->t_tableOid = slot->tts_tableOid;
 
-	/* store in slot, transferring existing pin */
-	ExecStorePinnedBufferHeapTuple(tuple, slot, buffer);
-
 	return result;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index c6bdb7e1c68..bb18dacfe11 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -297,16 +297,23 @@ simple_table_tuple_insert(Relation rel, TupleTableSlot *slot)
  * via ereport().
  */
 void
-simple_table_tuple_delete(Relation rel, ItemPointer tid, Snapshot snapshot)
+simple_table_tuple_delete(Relation rel, ItemPointer tid, Snapshot snapshot,
+						  TupleTableSlot *oldSlot)
 {
 	TM_Result	result;
 	TM_FailureData tmfd;
+	int			options = TABLE_MODIFY_WAIT;	/* wait for commit */
+
+	/* Fetch old tuple if the relevant slot is provided */
+	if (oldSlot)
+		options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
 
 	result = table_tuple_delete(rel, tid,
 								GetCurrentCommandId(true),
 								snapshot, InvalidSnapshot,
-								true /* wait for commit */ ,
-								&tmfd, false /* changingPart */ );
+								options,
+								&tmfd, false /* changingPart */ ,
+								oldSlot);
 
 	switch (result)
 	{
@@ -345,17 +352,24 @@ void
 simple_table_tuple_update(Relation rel, ItemPointer otid,
 						  TupleTableSlot *slot,
 						  Snapshot snapshot,
-						  TU_UpdateIndexes *update_indexes)
+						  TU_UpdateIndexes *update_indexes,
+						  TupleTableSlot *oldSlot)
 {
 	TM_Result	result;
 	TM_FailureData tmfd;
 	LockTupleMode lockmode;
+	int			options = TABLE_MODIFY_WAIT;	/* wait for commit */
+
+	/* Fetch old tuple if the relevant slot is provided */
+	if (oldSlot)
+		options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
 
 	result = table_tuple_update(rel, otid, slot,
 								GetCurrentCommandId(true),
 								snapshot, InvalidSnapshot,
-								true /* wait for commit */ ,
-								&tmfd, &lockmode, update_indexes);
+								options,
+								&tmfd, &lockmode, update_indexes,
+								oldSlot);
 
 	switch (result)
 	{
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 52177759ab5..c7095751221 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2779,8 +2779,8 @@ ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
 void
 ExecARDeleteTriggers(EState *estate,
 					 ResultRelInfo *relinfo,
-					 ItemPointer tupleid,
 					 HeapTuple fdw_trigtuple,
+					 TupleTableSlot *slot,
 					 TransitionCaptureState *transition_capture,
 					 bool is_crosspart_update)
 {
@@ -2789,20 +2789,11 @@ ExecARDeleteTriggers(EState *estate,
 	if ((trigdesc && trigdesc->trig_delete_after_row) ||
 		(transition_capture && transition_capture->tcs_delete_old_table))
 	{
-		TupleTableSlot *slot = ExecGetTriggerOldSlot(estate, relinfo);
-
-		Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
-		if (fdw_trigtuple == NULL)
-			GetTupleForTrigger(estate,
-							   NULL,
-							   relinfo,
-							   tupleid,
-							   LockTupleExclusive,
-							   slot,
-							   NULL,
-							   NULL,
-							   NULL);
-		else
+		/*
+		 * Put the FDW old tuple to the slot.  Otherwise, caller is expected
+		 * to have old tuple alredy fetched to the slot.
+		 */
+		if (fdw_trigtuple != NULL)
 			ExecForceStoreHeapTuple(fdw_trigtuple, slot, false);
 
 		AfterTriggerSaveEvent(estate, relinfo, NULL, NULL,
@@ -3075,18 +3066,17 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
  * Note: 'src_partinfo' and 'dst_partinfo', when non-NULL, refer to the source
  * and destination partitions, respectively, of a cross-partition update of
  * the root partitioned table mentioned in the query, given by 'relinfo'.
- * 'tupleid' in that case refers to the ctid of the "old" tuple in the source
- * partition, and 'newslot' contains the "new" tuple in the destination
- * partition.  This interface allows to support the requirements of
- * ExecCrossPartitionUpdateForeignKey(); is_crosspart_update must be true in
- * that case.
+ * 'oldslot' contains the "old" tuple in the source partition, and 'newslot'
+ * contains the "new" tuple in the destination partition.  This interface
+ * allows to support the requirements of ExecCrossPartitionUpdateForeignKey();
+ * is_crosspart_update must be true in that case.
  */
 void
 ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 					 ResultRelInfo *src_partinfo,
 					 ResultRelInfo *dst_partinfo,
-					 ItemPointer tupleid,
 					 HeapTuple fdw_trigtuple,
+					 TupleTableSlot *oldslot,
 					 TupleTableSlot *newslot,
 					 List *recheckIndexes,
 					 TransitionCaptureState *transition_capture,
@@ -3105,29 +3095,14 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 		 * separately for DELETE and INSERT to capture transition table rows.
 		 * In such case, either old tuple or new tuple can be NULL.
 		 */
-		TupleTableSlot *oldslot;
-		ResultRelInfo *tupsrc;
-
 		Assert((src_partinfo != NULL && dst_partinfo != NULL) ||
 			   !is_crosspart_update);
 
-		tupsrc = src_partinfo ? src_partinfo : relinfo;
-		oldslot = ExecGetTriggerOldSlot(estate, tupsrc);
-
-		if (fdw_trigtuple == NULL && ItemPointerIsValid(tupleid))
-			GetTupleForTrigger(estate,
-							   NULL,
-							   tupsrc,
-							   tupleid,
-							   LockTupleExclusive,
-							   oldslot,
-							   NULL,
-							   NULL,
-							   NULL);
-		else if (fdw_trigtuple != NULL)
+		if (fdw_trigtuple != NULL)
+		{
+			Assert(oldslot);
 			ExecForceStoreHeapTuple(fdw_trigtuple, oldslot, false);
-		else
-			ExecClearTuple(oldslot);
+		}
 
 		AfterTriggerSaveEvent(estate, relinfo,
 							  src_partinfo, dst_partinfo,
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 81f27042bc4..f39e1e55ea8 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -583,6 +583,7 @@ ExecSimpleRelationUpdate(ResultRelInfo *resultRelInfo,
 	{
 		List	   *recheckIndexes = NIL;
 		TU_UpdateIndexes update_indexes;
+		TupleTableSlot *oldSlot = NULL;
 
 		/* Compute stored generated columns */
 		if (rel->rd_att->constr &&
@@ -596,8 +597,12 @@ ExecSimpleRelationUpdate(ResultRelInfo *resultRelInfo,
 		if (rel->rd_rel->relispartition)
 			ExecPartitionCheck(resultRelInfo, slot, estate, true);
 
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_after_row)
+			oldSlot = ExecGetTriggerOldSlot(estate, resultRelInfo);
+
 		simple_table_tuple_update(rel, tid, slot, estate->es_snapshot,
-								  &update_indexes);
+								  &update_indexes, oldSlot);
 
 		if (resultRelInfo->ri_NumIndices > 0 && (update_indexes != TU_None))
 			recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
@@ -608,7 +613,7 @@ ExecSimpleRelationUpdate(ResultRelInfo *resultRelInfo,
 		/* AFTER ROW UPDATE Triggers */
 		ExecARUpdateTriggers(estate, resultRelInfo,
 							 NULL, NULL,
-							 tid, NULL, slot,
+							 NULL, oldSlot, slot,
 							 recheckIndexes, NULL, false);
 
 		list_free(recheckIndexes);
@@ -642,12 +647,18 @@ ExecSimpleRelationDelete(ResultRelInfo *resultRelInfo,
 
 	if (!skip_tuple)
 	{
+		TupleTableSlot *oldSlot = NULL;
+
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_delete_after_row)
+			oldSlot = ExecGetTriggerOldSlot(estate, resultRelInfo);
+
 		/* OK, delete the tuple */
-		simple_table_tuple_delete(rel, tid, estate->es_snapshot);
+		simple_table_tuple_delete(rel, tid, estate->es_snapshot, oldSlot);
 
 		/* AFTER ROW DELETE Triggers */
 		ExecARDeleteTriggers(estate, resultRelInfo,
-							 tid, NULL, NULL, false);
+							 NULL, oldSlot, NULL, false);
 	}
 }
 
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index b16fbe9e22a..ab9530cbf94 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -133,7 +133,7 @@ static void ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 											   ResultRelInfo *sourcePartInfo,
 											   ResultRelInfo *destPartInfo,
 											   ItemPointer tupleid,
-											   TupleTableSlot *oldslot,
+											   TupleTableSlot *oldSlot,
 											   TupleTableSlot *newslot);
 static bool ExecOnConflictUpdate(ModifyTableContext *context,
 								 ResultRelInfo *resultRelInfo,
@@ -570,6 +570,10 @@ ExecInitInsertProjection(ModifyTableState *mtstate,
 	resultRelInfo->ri_newTupleSlot =
 		table_slot_create(resultRelInfo->ri_RelationDesc,
 						  &estate->es_tupleTable);
+	if (node->onConflictAction == ONCONFLICT_UPDATE)
+		resultRelInfo->ri_oldTupleSlot =
+			table_slot_create(resultRelInfo->ri_RelationDesc,
+							  &estate->es_tupleTable);
 
 	/* Build ProjectionInfo if needed (it probably isn't). */
 	if (need_projection)
@@ -1159,7 +1163,7 @@ ExecInsert(ModifyTableContext *context,
 		ExecARUpdateTriggers(estate, resultRelInfo,
 							 NULL, NULL,
 							 NULL,
-							 NULL,
+							 resultRelInfo->ri_oldTupleSlot,
 							 slot,
 							 NULL,
 							 mtstate->mt_transition_capture,
@@ -1339,7 +1343,8 @@ ExecDeletePrologue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static TM_Result
 ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-			  ItemPointer tupleid, bool changingPart)
+			  ItemPointer tupleid, bool changingPart, int options,
+			  TupleTableSlot *oldSlot)
 {
 	EState	   *estate = context->estate;
 
@@ -1347,9 +1352,10 @@ ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 							  estate->es_output_cid,
 							  estate->es_snapshot,
 							  estate->es_crosscheck_snapshot,
-							  true /* wait for commit */ ,
+							  options /* wait for commit */ ,
 							  &context->tmfd,
-							  changingPart);
+							  changingPart,
+							  oldSlot);
 }
 
 /*
@@ -1361,7 +1367,8 @@ ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static void
 ExecDeleteEpilogue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				   ItemPointer tupleid, HeapTuple oldtuple, bool changingPart)
+				   ItemPointer tupleid, HeapTuple oldtuple,
+				   TupleTableSlot *slot, bool changingPart)
 {
 	ModifyTableState *mtstate = context->mtstate;
 	EState	   *estate = context->estate;
@@ -1379,8 +1386,8 @@ ExecDeleteEpilogue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	{
 		ExecARUpdateTriggers(estate, resultRelInfo,
 							 NULL, NULL,
-							 tupleid, oldtuple,
-							 NULL, NULL, mtstate->mt_transition_capture,
+							 oldtuple,
+							 slot, NULL, NULL, mtstate->mt_transition_capture,
 							 false);
 
 		/*
@@ -1391,10 +1398,30 @@ ExecDeleteEpilogue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	}
 
 	/* AFTER ROW DELETE Triggers */
-	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
+	ExecARDeleteTriggers(estate, resultRelInfo, oldtuple, slot,
 						 ar_delete_trig_tcs, changingPart);
 }
 
+/*
+ * Initializes the tuple slot in a ResultRelInfo for DELETE action.
+ *
+ * We mark 'projectNewInfoValid' even though the projections themselves
+ * are not initialized here.
+ */
+static void
+ExecInitDeleteTupleSlot(ModifyTableState *mtstate,
+						ResultRelInfo *resultRelInfo)
+{
+	EState	   *estate = mtstate->ps.state;
+
+	Assert(!resultRelInfo->ri_projectNewInfoValid);
+
+	resultRelInfo->ri_oldTupleSlot =
+		table_slot_create(resultRelInfo->ri_RelationDesc,
+						  &estate->es_tupleTable);
+	resultRelInfo->ri_projectNewInfoValid = true;
+}
+
 /* ----------------------------------------------------------------
  *		ExecDelete
  *
@@ -1422,6 +1449,7 @@ ExecDelete(ModifyTableContext *context,
 		   ResultRelInfo *resultRelInfo,
 		   ItemPointer tupleid,
 		   HeapTuple oldtuple,
+		   TupleTableSlot *oldSlot,
 		   bool processReturning,
 		   bool changingPart,
 		   bool canSetTag,
@@ -1484,6 +1512,11 @@ ExecDelete(ModifyTableContext *context,
 	}
 	else
 	{
+		int			options = TABLE_MODIFY_WAIT | TABLE_MODIFY_FETCH_OLD_TUPLE;
+
+		if (!IsolationUsesXactSnapshot())
+			options |= TABLE_MODIFY_LOCK_UPDATED;
+
 		/*
 		 * delete the tuple
 		 *
@@ -1494,7 +1527,8 @@ ExecDelete(ModifyTableContext *context,
 		 * transaction-snapshot mode transactions.
 		 */
 ldelete:
-		result = ExecDeleteAct(context, resultRelInfo, tupleid, changingPart);
+		result = ExecDeleteAct(context, resultRelInfo, tupleid, changingPart,
+							   options, oldSlot);
 
 		switch (result)
 		{
@@ -1538,7 +1572,6 @@ ldelete:
 
 			case TM_Updated:
 				{
-					TupleTableSlot *inputslot;
 					TupleTableSlot *epqslot;
 
 					if (IsolationUsesXactSnapshot())
@@ -1547,87 +1580,29 @@ ldelete:
 								 errmsg("could not serialize access due to concurrent update")));
 
 					/*
-					 * Already know that we're going to need to do EPQ, so
-					 * fetch tuple directly into the right slot.
+					 * We need to do EPQ. The latest tuple is already found
+					 * and locked as a result of TABLE_MODIFY_LOCK_UPDATED.
 					 */
-					EvalPlanQualBegin(context->epqstate);
-					inputslot = EvalPlanQualSlot(context->epqstate, resultRelationDesc,
-												 resultRelInfo->ri_RangeTableIndex);
-
-					result = table_tuple_lock(resultRelationDesc, tupleid,
-											  estate->es_snapshot,
-											  inputslot, estate->es_output_cid,
-											  LockTupleExclusive, LockWaitBlock,
-											  TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
-											  &context->tmfd);
+					Assert(context->tmfd.traversed);
+					epqslot = EvalPlanQual(context->epqstate,
+										   resultRelationDesc,
+										   resultRelInfo->ri_RangeTableIndex,
+										   oldSlot);
+					if (TupIsNull(epqslot))
+						/* Tuple not passing quals anymore, exiting... */
+						return NULL;
 
-					switch (result)
+					/*
+					 * If requested, skip delete and pass back the updated
+					 * row.
+					 */
+					if (epqreturnslot)
 					{
-						case TM_Ok:
-							Assert(context->tmfd.traversed);
-							epqslot = EvalPlanQual(context->epqstate,
-												   resultRelationDesc,
-												   resultRelInfo->ri_RangeTableIndex,
-												   inputslot);
-							if (TupIsNull(epqslot))
-								/* Tuple not passing quals anymore, exiting... */
-								return NULL;
-
-							/*
-							 * If requested, skip delete and pass back the
-							 * updated row.
-							 */
-							if (epqreturnslot)
-							{
-								*epqreturnslot = epqslot;
-								return NULL;
-							}
-							else
-								goto ldelete;
-
-						case TM_SelfModified:
-
-							/*
-							 * This can be reached when following an update
-							 * chain from a tuple updated by another session,
-							 * reaching a tuple that was already updated in
-							 * this transaction. If previously updated by this
-							 * command, ignore the delete, otherwise error
-							 * out.
-							 *
-							 * See also TM_SelfModified response to
-							 * table_tuple_delete() above.
-							 */
-							if (context->tmfd.cmax != estate->es_output_cid)
-								ereport(ERROR,
-										(errcode(ERRCODE_TRIGGERED_DATA_CHANGE_VIOLATION),
-										 errmsg("tuple to be deleted was already modified by an operation triggered by the current command"),
-										 errhint("Consider using an AFTER trigger instead of a BEFORE trigger to propagate changes to other rows.")));
-							return NULL;
-
-						case TM_Deleted:
-							/* tuple already deleted; nothing to do */
-							return NULL;
-
-						default:
-
-							/*
-							 * TM_Invisible should be impossible because we're
-							 * waiting for updated row versions, and would
-							 * already have errored out if the first version
-							 * is invisible.
-							 *
-							 * TM_Updated should be impossible, because we're
-							 * locking the latest version via
-							 * TUPLE_LOCK_FLAG_FIND_LAST_VERSION.
-							 */
-							elog(ERROR, "unexpected table_tuple_lock status: %u",
-								 result);
-							return NULL;
+						*epqreturnslot = epqslot;
+						return NULL;
 					}
-
-					Assert(false);
-					break;
+					else
+						goto ldelete;
 				}
 
 			case TM_Deleted:
@@ -1661,7 +1636,8 @@ ldelete:
 	if (tupleDeleted)
 		*tupleDeleted = true;
 
-	ExecDeleteEpilogue(context, resultRelInfo, tupleid, oldtuple, changingPart);
+	ExecDeleteEpilogue(context, resultRelInfo, tupleid, oldtuple,
+					   oldSlot, changingPart);
 
 	/* Process RETURNING if present and if requested */
 	if (processReturning && resultRelInfo->ri_projectReturning)
@@ -1679,17 +1655,13 @@ ldelete:
 		}
 		else
 		{
+			/* Copy old tuple to the returning slot */
 			slot = ExecGetReturningSlot(estate, resultRelInfo);
 			if (oldtuple != NULL)
-			{
 				ExecForceStoreHeapTuple(oldtuple, slot, false);
-			}
 			else
-			{
-				if (!table_tuple_fetch_row_version(resultRelationDesc, tupleid,
-												   SnapshotAny, slot))
-					elog(ERROR, "failed to fetch deleted tuple for DELETE RETURNING");
-			}
+				ExecCopySlot(slot, oldSlot);
+			Assert(!TupIsNull(slot));
 		}
 
 		rslot = ExecProcessReturning(resultRelInfo, slot, context->planSlot);
@@ -1788,12 +1760,16 @@ ExecCrossPartitionUpdate(ModifyTableContext *context,
 		MemoryContextSwitchTo(oldcxt);
 	}
 
+	/* Make sure ri_oldTupleSlot is initialized. */
+	if (unlikely(!resultRelInfo->ri_projectNewInfoValid))
+		ExecInitUpdateProjection(mtstate, resultRelInfo);
+
 	/*
 	 * Row movement, part 1.  Delete the tuple, but skip RETURNING processing.
 	 * We want to return rows from INSERT.
 	 */
 	ExecDelete(context, resultRelInfo,
-			   tupleid, oldtuple,
+			   tupleid, oldtuple, resultRelInfo->ri_oldTupleSlot,
 			   false,			/* processReturning */
 			   true,			/* changingPart */
 			   false,			/* canSetTag */
@@ -1834,21 +1810,13 @@ ExecCrossPartitionUpdate(ModifyTableContext *context,
 			return true;
 		else
 		{
-			/* Fetch the most recent version of old tuple. */
-			TupleTableSlot *oldSlot;
-
-			/* ... but first, make sure ri_oldTupleSlot is initialized. */
-			if (unlikely(!resultRelInfo->ri_projectNewInfoValid))
-				ExecInitUpdateProjection(mtstate, resultRelInfo);
-			oldSlot = resultRelInfo->ri_oldTupleSlot;
-			if (!table_tuple_fetch_row_version(resultRelInfo->ri_RelationDesc,
-											   tupleid,
-											   SnapshotAny,
-											   oldSlot))
-				elog(ERROR, "failed to fetch tuple being updated");
-			/* and project the new tuple to retry the UPDATE with */
+			/*
+			 * ExecDelete already fetches the most recent version of old tuple
+			 * to resultRelInfo->ri_RelationDesc.  So, just project the new
+			 * tuple to retry the UPDATE with.
+			 */
 			*retry_slot = ExecGetUpdateNewTuple(resultRelInfo, epqslot,
-												oldSlot);
+												resultRelInfo->ri_oldTupleSlot);
 			return false;
 		}
 	}
@@ -1967,7 +1935,8 @@ ExecUpdatePrepareSlot(ResultRelInfo *resultRelInfo,
 static TM_Result
 ExecUpdateAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 			  ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
-			  bool canSetTag, UpdateContext *updateCxt)
+			  bool canSetTag, int options, TupleTableSlot *oldSlot,
+			  UpdateContext *updateCxt)
 {
 	EState	   *estate = context->estate;
 	Relation	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -2059,7 +2028,8 @@ lreplace:
 				ExecCrossPartitionUpdateForeignKey(context,
 												   resultRelInfo,
 												   insert_destrel,
-												   tupleid, slot,
+												   tupleid,
+												   resultRelInfo->ri_oldTupleSlot,
 												   inserted_tuple);
 
 			return TM_Ok;
@@ -2102,9 +2072,10 @@ lreplace:
 								estate->es_output_cid,
 								estate->es_snapshot,
 								estate->es_crosscheck_snapshot,
-								true /* wait for commit */ ,
+								options /* wait for commit */ ,
 								&context->tmfd, &updateCxt->lockmode,
-								&updateCxt->updateIndexes);
+								&updateCxt->updateIndexes,
+								oldSlot);
 	if (result == TM_Ok)
 		updateCxt->updated = true;
 
@@ -2120,7 +2091,8 @@ lreplace:
 static void
 ExecUpdateEpilogue(ModifyTableContext *context, UpdateContext *updateCxt,
 				   ResultRelInfo *resultRelInfo, ItemPointer tupleid,
-				   HeapTuple oldtuple, TupleTableSlot *slot)
+				   HeapTuple oldtuple, TupleTableSlot *slot,
+				   TupleTableSlot *oldSlot)
 {
 	ModifyTableState *mtstate = context->mtstate;
 	List	   *recheckIndexes = NIL;
@@ -2136,7 +2108,7 @@ ExecUpdateEpilogue(ModifyTableContext *context, UpdateContext *updateCxt,
 	/* AFTER ROW UPDATE Triggers */
 	ExecARUpdateTriggers(context->estate, resultRelInfo,
 						 NULL, NULL,
-						 tupleid, oldtuple, slot,
+						 oldtuple, oldSlot, slot,
 						 recheckIndexes,
 						 mtstate->operation == CMD_INSERT ?
 						 mtstate->mt_oc_transition_capture :
@@ -2225,7 +2197,7 @@ ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 	/* Perform the root table's triggers. */
 	ExecARUpdateTriggers(context->estate,
 						 rootRelInfo, sourcePartInfo, destPartInfo,
-						 tupleid, NULL, newslot, NIL, NULL, true);
+						 NULL, oldslot, newslot, NIL, NULL, true);
 }
 
 /* ----------------------------------------------------------------
@@ -2248,6 +2220,7 @@ ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
  *		no relevant triggers.
  *
  *		slot contains the new tuple value to be stored.
+ *		oldSlot is the slot to store the old tuple.
  *		planSlot is the output of the ModifyTable's subplan; we use it
  *		to access values from other input tables (for RETURNING),
  *		row-ID junk columns, etc.
@@ -2258,7 +2231,7 @@ ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 static TupleTableSlot *
 ExecUpdate(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 		   ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
-		   bool canSetTag)
+		   TupleTableSlot *oldSlot, bool canSetTag, bool locked)
 {
 	EState	   *estate = context->estate;
 	Relation	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -2311,6 +2284,11 @@ ExecUpdate(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	}
 	else
 	{
+		int			options = TABLE_MODIFY_WAIT | TABLE_MODIFY_FETCH_OLD_TUPLE;
+
+		if (!locked && !IsolationUsesXactSnapshot())
+			options |= TABLE_MODIFY_LOCK_UPDATED;
+
 		/*
 		 * If we generate a new candidate tuple after EvalPlanQual testing, we
 		 * must loop back here to try again.  (We don't need to redo triggers,
@@ -2320,7 +2298,7 @@ ExecUpdate(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 		 */
 redo_act:
 		result = ExecUpdateAct(context, resultRelInfo, tupleid, oldtuple, slot,
-							   canSetTag, &updateCxt);
+							   canSetTag, options, oldSlot, &updateCxt);
 
 		/*
 		 * If ExecUpdateAct reports that a cross-partition update was done,
@@ -2371,88 +2349,30 @@ redo_act:
 
 			case TM_Updated:
 				{
-					TupleTableSlot *inputslot;
 					TupleTableSlot *epqslot;
-					TupleTableSlot *oldSlot;
 
 					if (IsolationUsesXactSnapshot())
 						ereport(ERROR,
 								(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 								 errmsg("could not serialize access due to concurrent update")));
+					Assert(!locked);
 
 					/*
-					 * Already know that we're going to need to do EPQ, so
-					 * fetch tuple directly into the right slot.
+					 * We need to do EPQ. The latest tuple is already found
+					 * and locked as a result of TABLE_MODIFY_LOCK_UPDATED.
 					 */
-					inputslot = EvalPlanQualSlot(context->epqstate, resultRelationDesc,
-												 resultRelInfo->ri_RangeTableIndex);
-
-					result = table_tuple_lock(resultRelationDesc, tupleid,
-											  estate->es_snapshot,
-											  inputslot, estate->es_output_cid,
-											  updateCxt.lockmode, LockWaitBlock,
-											  TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
-											  &context->tmfd);
-
-					switch (result)
-					{
-						case TM_Ok:
-							Assert(context->tmfd.traversed);
-
-							epqslot = EvalPlanQual(context->epqstate,
-												   resultRelationDesc,
-												   resultRelInfo->ri_RangeTableIndex,
-												   inputslot);
-							if (TupIsNull(epqslot))
-								/* Tuple not passing quals anymore, exiting... */
-								return NULL;
-
-							/* Make sure ri_oldTupleSlot is initialized. */
-							if (unlikely(!resultRelInfo->ri_projectNewInfoValid))
-								ExecInitUpdateProjection(context->mtstate,
-														 resultRelInfo);
-
-							/* Fetch the most recent version of old tuple. */
-							oldSlot = resultRelInfo->ri_oldTupleSlot;
-							if (!table_tuple_fetch_row_version(resultRelationDesc,
-															   tupleid,
-															   SnapshotAny,
-															   oldSlot))
-								elog(ERROR, "failed to fetch tuple being updated");
-							slot = ExecGetUpdateNewTuple(resultRelInfo,
-														 epqslot, oldSlot);
-							goto redo_act;
-
-						case TM_Deleted:
-							/* tuple already deleted; nothing to do */
-							return NULL;
-
-						case TM_SelfModified:
-
-							/*
-							 * This can be reached when following an update
-							 * chain from a tuple updated by another session,
-							 * reaching a tuple that was already updated in
-							 * this transaction. If previously modified by
-							 * this command, ignore the redundant update,
-							 * otherwise error out.
-							 *
-							 * See also TM_SelfModified response to
-							 * table_tuple_update() above.
-							 */
-							if (context->tmfd.cmax != estate->es_output_cid)
-								ereport(ERROR,
-										(errcode(ERRCODE_TRIGGERED_DATA_CHANGE_VIOLATION),
-										 errmsg("tuple to be updated was already modified by an operation triggered by the current command"),
-										 errhint("Consider using an AFTER trigger instead of a BEFORE trigger to propagate changes to other rows.")));
-							return NULL;
-
-						default:
-							/* see table_tuple_lock call in ExecDelete() */
-							elog(ERROR, "unexpected table_tuple_lock status: %u",
-								 result);
-							return NULL;
-					}
+					Assert(context->tmfd.traversed);
+					epqslot = EvalPlanQual(context->epqstate,
+										   resultRelationDesc,
+										   resultRelInfo->ri_RangeTableIndex,
+										   oldSlot);
+					if (TupIsNull(epqslot))
+						/* Tuple not passing quals anymore, exiting... */
+						return NULL;
+					slot = ExecGetUpdateNewTuple(resultRelInfo,
+												 epqslot,
+												 oldSlot);
+					goto redo_act;
 				}
 
 				break;
@@ -2476,7 +2396,7 @@ redo_act:
 		(estate->es_processed)++;
 
 	ExecUpdateEpilogue(context, &updateCxt, resultRelInfo, tupleid, oldtuple,
-					   slot);
+					   slot, oldSlot);
 
 	/* Process RETURNING if present */
 	if (resultRelInfo->ri_projectReturning)
@@ -2694,7 +2614,8 @@ ExecOnConflictUpdate(ModifyTableContext *context,
 	*returning = ExecUpdate(context, resultRelInfo,
 							conflictTid, NULL,
 							resultRelInfo->ri_onConflict->oc_ProjSlot,
-							canSetTag);
+							existing,
+							canSetTag, true);
 
 	/*
 	 * Clear out existing tuple, as there might not be another conflict among
@@ -2832,7 +2753,6 @@ lmerge_matched:
 	 * EvalPlanQual returns us a new tuple, which may not be visible to our
 	 * MVCC snapshot.
 	 */
-
 	if (!table_tuple_fetch_row_version(resultRelInfo->ri_RelationDesc,
 									   tupleid,
 									   SnapshotAny,
@@ -2898,7 +2818,8 @@ lmerge_matched:
 					break;		/* concurrent update/delete */
 				}
 				result = ExecUpdateAct(context, resultRelInfo, tupleid, NULL,
-									   newslot, false, &updateCxt);
+									   newslot, false, TABLE_MODIFY_WAIT, NULL,
+									   &updateCxt);
 
 				/*
 				 * As in ExecUpdate(), if ExecUpdateAct() reports that a
@@ -2918,7 +2839,8 @@ lmerge_matched:
 				if (result == TM_Ok && updateCxt.updated)
 				{
 					ExecUpdateEpilogue(context, &updateCxt, resultRelInfo,
-									   tupleid, NULL, newslot);
+									   tupleid, NULL, newslot,
+									   resultRelInfo->ri_oldTupleSlot);
 					mtstate->mt_merge_updated += 1;
 				}
 				break;
@@ -2932,11 +2854,12 @@ lmerge_matched:
 						return true;	/* "do nothing" */
 					break;		/* concurrent update/delete */
 				}
-				result = ExecDeleteAct(context, resultRelInfo, tupleid, false);
+				result = ExecDeleteAct(context, resultRelInfo, tupleid,
+									   false, TABLE_MODIFY_WAIT, NULL);
 				if (result == TM_Ok)
 				{
 					ExecDeleteEpilogue(context, resultRelInfo, tupleid, NULL,
-									   false);
+									   resultRelInfo->ri_oldTupleSlot, false);
 					mtstate->mt_merge_deleted += 1;
 				}
 				break;
@@ -3838,12 +3761,18 @@ ExecModifyTable(PlanState *pstate)
 
 				/* Now apply the update. */
 				slot = ExecUpdate(&context, resultRelInfo, tupleid, oldtuple,
-								  slot, node->canSetTag);
+								  slot, resultRelInfo->ri_oldTupleSlot,
+								  node->canSetTag, false);
 				break;
 
 			case CMD_DELETE:
+				/* Initialize slot for DELETE to fetch the old tuple */
+				if (unlikely(!resultRelInfo->ri_projectNewInfoValid))
+					ExecInitDeleteTupleSlot(node, resultRelInfo);
+
 				slot = ExecDelete(&context, resultRelInfo, tupleid, oldtuple,
-								  true, false, node->canSetTag, NULL, NULL);
+								  resultRelInfo->ri_oldTupleSlot, true, false,
+								  node->canSetTag, NULL, NULL);
 				break;
 
 			case CMD_MERGE:
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index a2d7a0ea72f..e220839d73c 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -276,19 +276,22 @@ extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
 							  int ntuples, CommandId cid, int options,
 							  BulkInsertState bistate);
 extern TM_Result heap_delete(Relation relation, ItemPointer tid,
-							 CommandId cid, Snapshot crosscheck, bool wait,
-							 struct TM_FailureData *tmfd, bool changingPart);
+							 CommandId cid, Snapshot crosscheck, int options,
+							 struct TM_FailureData *tmfd, bool changingPart,
+							 TupleTableSlot *oldSlot);
 extern void heap_finish_speculative(Relation relation, ItemPointer tid);
 extern void heap_abort_speculative(Relation relation, ItemPointer tid);
 extern TM_Result heap_update(Relation relation, ItemPointer otid,
 							 HeapTuple newtup,
-							 CommandId cid, Snapshot crosscheck, bool wait,
+							 CommandId cid, Snapshot crosscheck, int options,
 							 struct TM_FailureData *tmfd, LockTupleMode *lockmode,
-							 TU_UpdateIndexes *update_indexes);
-extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
-								 CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
-								 bool follow_updates,
-								 Buffer *buffer, struct TM_FailureData *tmfd);
+							 TU_UpdateIndexes *update_indexes,
+							 TupleTableSlot *oldSlot);
+extern TM_Result heap_lock_tuple(Relation relation, ItemPointer tid,
+								 TupleTableSlot *slot,
+								 CommandId cid, LockTupleMode mode,
+								 LockWaitPolicy wait_policy, bool follow_updates,
+								 struct TM_FailureData *tmfd);
 
 extern void heap_inplace_update(Relation relation, HeapTuple tuple);
 extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index dbb709b56ce..774e8b66d4c 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -259,6 +259,11 @@ typedef struct TM_IndexDeleteOp
 /* Follow update chain and lock latest version of tuple */
 #define TUPLE_LOCK_FLAG_FIND_LAST_VERSION		(1 << 1)
 
+/* "options" flag bits for table_tuple_update and table_tuple_delete */
+#define TABLE_MODIFY_WAIT			0x0001
+#define TABLE_MODIFY_FETCH_OLD_TUPLE 0x0002
+#define TABLE_MODIFY_LOCK_UPDATED	0x0004
+
 
 /* Typedef for callback function for table_index_build_scan */
 typedef void (*IndexBuildCallback) (Relation index,
@@ -528,9 +533,10 @@ typedef struct TableAmRoutine
 								 CommandId cid,
 								 Snapshot snapshot,
 								 Snapshot crosscheck,
-								 bool wait,
+								 int options,
 								 TM_FailureData *tmfd,
-								 bool changingPart);
+								 bool changingPart,
+								 TupleTableSlot *oldSlot);
 
 	/* see table_tuple_update() for reference about parameters */
 	TM_Result	(*tuple_update) (Relation rel,
@@ -539,10 +545,11 @@ typedef struct TableAmRoutine
 								 CommandId cid,
 								 Snapshot snapshot,
 								 Snapshot crosscheck,
-								 bool wait,
+								 int options,
 								 TM_FailureData *tmfd,
 								 LockTupleMode *lockmode,
-								 TU_UpdateIndexes *update_indexes);
+								 TU_UpdateIndexes *update_indexes,
+								 TupleTableSlot *oldSlot);
 
 	/* see table_tuple_lock() for reference about parameters */
 	TM_Result	(*tuple_lock) (Relation rel,
@@ -1457,7 +1464,7 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
 }
 
 /*
- * Delete a tuple.
+ * Delete a tuple (and optionally lock the last tuple version).
  *
  * NB: do not call this directly unless prepared to deal with
  * concurrent-update conditions.  Use simple_table_tuple_delete instead.
@@ -1468,11 +1475,21 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
  *	cid - delete command ID (used for visibility test, and stored into
  *		cmax if successful)
  *	crosscheck - if not InvalidSnapshot, also check tuple against this
- *	wait - true if should wait for any conflicting update to commit/abort
+ *	options:
+ *		If TABLE_MODIFY_WAIT, wait for any conflicting update to commit/abort.
+ *		If TABLE_MODIFY_FETCH_OLD_TUPLE option is given, the existing tuple is
+ *		fetched into oldSlot when the update is successful.
+ *		If TABLE_MODIFY_LOCK_UPDATED option is given and the tuple is
+ *		concurrently updated, then the last tuple version is locked and fetched
+ *		into oldSlot.
+ *
  * Output parameters:
  *	tmfd - filled in failure cases (see below)
  *	changingPart - true iff the tuple is being moved to another partition
  *		table due to an update of the partition key. Otherwise, false.
+ *	oldSlot - slot to save the deleted or locked tuple. Can be NULL if none of
+ *		TABLE_MODIFY_FETCH_OLD_TUPLE or TABLE_MODIFY_LOCK_UPDATED options
+ *		is specified.
  *
  * Normal, successful return value is TM_Ok, which means we did actually
  * delete it.  Failure return codes are TM_SelfModified, TM_Updated, and
@@ -1484,16 +1501,18 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
  */
 static inline TM_Result
 table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
-				   Snapshot snapshot, Snapshot crosscheck, bool wait,
-				   TM_FailureData *tmfd, bool changingPart)
+				   Snapshot snapshot, Snapshot crosscheck, int options,
+				   TM_FailureData *tmfd, bool changingPart,
+				   TupleTableSlot *oldSlot)
 {
 	return rel->rd_tableam->tuple_delete(rel, tid, cid,
 										 snapshot, crosscheck,
-										 wait, tmfd, changingPart);
+										 options, tmfd, changingPart,
+										 oldSlot);
 }
 
 /*
- * Update a tuple.
+ * Update a tuple (and optionally lock the last tuple version).
  *
  * NB: do not call this directly unless you are prepared to deal with
  * concurrent-update conditions.  Use simple_table_tuple_update instead.
@@ -1505,13 +1524,23 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
  *	cid - update command ID (used for visibility test, and stored into
  *		cmax/cmin if successful)
  *	crosscheck - if not InvalidSnapshot, also check old tuple against this
- *	wait - true if should wait for any conflicting update to commit/abort
+ *	options:
+ *		If TABLE_MODIFY_WAIT, wait for any conflicting update to commit/abort.
+ *		If TABLE_MODIFY_FETCH_OLD_TUPLE option is given, the existing tuple is
+ *		fetched into oldSlot when the update is successful.
+ *		If TABLE_MODIFY_LOCK_UPDATED option is given and the tuple is
+ *		concurrently updated, then the last tuple version is locked and fetched
+ *		into oldSlot.
+ *
  * Output parameters:
  *	tmfd - filled in failure cases (see below)
  *	lockmode - filled with lock mode acquired on tuple
  *  update_indexes - in success cases this is set to true if new index entries
  *		are required for this tuple
- *
+ *	oldSlot - slot to save the deleted or locked tuple. Can be NULL if none of
+ *		TABLE_MODIFY_FETCH_OLD_TUPLE or TABLE_MODIFY_LOCK_UPDATED options
+ *		is specified.
+
  * Normal, successful return value is TM_Ok, which means we did actually
  * update it.  Failure return codes are TM_SelfModified, TM_Updated, and
  * TM_BeingModified (the last only possible if wait == false).
@@ -1529,13 +1558,15 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
 static inline TM_Result
 table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
 				   CommandId cid, Snapshot snapshot, Snapshot crosscheck,
-				   bool wait, TM_FailureData *tmfd, LockTupleMode *lockmode,
-				   TU_UpdateIndexes *update_indexes)
+				   int options, TM_FailureData *tmfd, LockTupleMode *lockmode,
+				   TU_UpdateIndexes *update_indexes,
+				   TupleTableSlot *oldSlot)
 {
 	return rel->rd_tableam->tuple_update(rel, otid, slot,
 										 cid, snapshot, crosscheck,
-										 wait, tmfd,
-										 lockmode, update_indexes);
+										 options, tmfd,
+										 lockmode, update_indexes,
+										 oldSlot);
 }
 
 /*
@@ -2051,10 +2082,12 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 
 extern void simple_table_tuple_insert(Relation rel, TupleTableSlot *slot);
 extern void simple_table_tuple_delete(Relation rel, ItemPointer tid,
-									  Snapshot snapshot);
+									  Snapshot snapshot,
+									  TupleTableSlot *oldSlot);
 extern void simple_table_tuple_update(Relation rel, ItemPointer otid,
 									  TupleTableSlot *slot, Snapshot snapshot,
-									  TU_UpdateIndexes *update_indexes);
+									  TU_UpdateIndexes *update_indexes,
+									  TupleTableSlot *oldSlot);
 
 
 /* ----------------------------------------------------------------------------
diff --git a/src/include/commands/trigger.h b/src/include/commands/trigger.h
index 430e3ca7ddf..4903b4b7bc2 100644
--- a/src/include/commands/trigger.h
+++ b/src/include/commands/trigger.h
@@ -216,8 +216,8 @@ extern bool ExecBRDeleteTriggers(EState *estate,
 								 TM_FailureData *tmfd);
 extern void ExecARDeleteTriggers(EState *estate,
 								 ResultRelInfo *relinfo,
-								 ItemPointer tupleid,
 								 HeapTuple fdw_trigtuple,
+								 TupleTableSlot *slot,
 								 TransitionCaptureState *transition_capture,
 								 bool is_crosspart_update);
 extern bool ExecIRDeleteTriggers(EState *estate,
@@ -240,8 +240,8 @@ extern void ExecARUpdateTriggers(EState *estate,
 								 ResultRelInfo *relinfo,
 								 ResultRelInfo *src_partinfo,
 								 ResultRelInfo *dst_partinfo,
-								 ItemPointer tupleid,
 								 HeapTuple fdw_trigtuple,
+								 TupleTableSlot *oldslot,
 								 TupleTableSlot *newslot,
 								 List *recheckIndexes,
 								 TransitionCaptureState *transition_capture,
-- 
2.39.3 (Apple Git-145)

0003-Allow-table-AM-to-store-complex-data-structures-i-v1.patchapplication/octet-stream; name=0003-Allow-table-AM-to-store-complex-data-structures-i-v1.patchDownload

From e48a79f6d35b4bf797d372aba69e51b5b34513e9 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Wed, 7 Jun 2023 13:04:58 +0300
Subject: [PATCH 03/12] Allow table AM to store complex data structures in
 rd_amcache

New table AM method free_rd_amcache is responsible for freeing the rd_amcache.
---
 src/backend/access/heap/heapam_handler.c |  1 +
 src/backend/utils/cache/relcache.c       | 18 +++++++++-------
 src/include/access/tableam.h             | 27 ++++++++++++++++++++++++
 src/include/utils/rel.h                  | 10 +++++----
 4 files changed, 44 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index f9bba734899..29cf023b274 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2642,6 +2642,7 @@ static const TableAmRoutine heapam_methods = {
 	.index_build_range_scan = heapam_index_build_range_scan,
 	.index_validate_scan = heapam_index_validate_scan,
 
+	.free_rd_amcache = NULL,
 	.relation_size = table_block_relation_size,
 	.relation_needs_toast_table = heapam_relation_needs_toast_table,
 	.relation_toast_am = heapam_relation_toast_am,
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index b3faccbefe5..c802f33ac59 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -319,6 +319,7 @@ static OpClassCacheEnt *LookupOpclassInfo(Oid operatorClassOid,
 										  StrategyNumber numSupport);
 static void RelationCacheInitFileRemoveInDir(const char *tblspcpath);
 static void unlink_initfile(const char *initfilename, int elevel);
+static void release_rd_amcache(Relation rel);
 
 
 /*
@@ -2263,9 +2264,7 @@ RelationReloadIndexInfo(Relation relation)
 	RelationCloseSmgr(relation);
 
 	/* Must free any AM cached data upon relcache flush */
-	if (relation->rd_amcache)
-		pfree(relation->rd_amcache);
-	relation->rd_amcache = NULL;
+	release_rd_amcache(relation);
 
 	/*
 	 * If it's a shared index, we might be called before backend startup has
@@ -2485,8 +2484,7 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
 		pfree(relation->rd_options);
 	if (relation->rd_indextuple)
 		pfree(relation->rd_indextuple);
-	if (relation->rd_amcache)
-		pfree(relation->rd_amcache);
+	release_rd_amcache(relation);
 	if (relation->rd_fdwroutine)
 		pfree(relation->rd_fdwroutine);
 	if (relation->rd_indexcxt)
@@ -2548,9 +2546,7 @@ RelationClearRelation(Relation relation, bool rebuild)
 	RelationCloseSmgr(relation);
 
 	/* Free AM cached data, if any */
-	if (relation->rd_amcache)
-		pfree(relation->rd_amcache);
-	relation->rd_amcache = NULL;
+	release_rd_amcache(relation);
 
 	/*
 	 * Treat nailed-in system relations separately, they always need to be
@@ -6873,3 +6869,9 @@ ResOwnerReleaseRelation(Datum res)
 
 	RelationCloseCleanup((Relation) res);
 }
+
+static void
+release_rd_amcache(Relation rel)
+{
+	table_free_rd_amcache(rel);
+}
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 774e8b66d4c..d5972d7a580 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -715,6 +715,13 @@ typedef struct TableAmRoutine
 	 * ------------------------------------------------------------------------
 	 */
 
+	/*
+	 * This callback frees relation private cache data stored in rd_amcache.
+	 * If this callback is not provided, rd_amcache is assumed to point to
+	 * single memory chunk.
+	 */
+	void		(*free_rd_amcache) (Relation rel);
+
 	/*
 	 * See table_relation_size().
 	 *
@@ -1883,6 +1890,26 @@ table_index_validate_scan(Relation table_rel,
  * ----------------------------------------------------------------------------
  */
 
+/*
+ * Frees relation private cache data stored in rd_amcache.  Uses
+ * free_rd_amcache method if provided.  Assumes rd_amcache to point to single
+ * memory chunk otherwise.
+ */
+static inline void
+table_free_rd_amcache(Relation rel)
+{
+	if (rel->rd_tableam && rel->rd_tableam->free_rd_amcache)
+	{
+		rel->rd_tableam->free_rd_amcache(rel);
+	}
+	else
+	{
+		if (rel->rd_amcache)
+			pfree(rel->rd_amcache);
+		rel->rd_amcache = NULL;
+	}
+}
+
 /*
  * Return the current size of `rel` in bytes. If `forkNumber` is
  * InvalidForkNumber, return the relation's overall size, otherwise the size
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 0ad613c4b88..7ec7a9e2805 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -220,10 +220,12 @@ typedef struct RelationData
 	 * rd_amcache is available for index and table AMs to cache private data
 	 * about the relation.  This must be just a cache since it may get reset
 	 * at any time (in particular, it will get reset by a relcache inval
-	 * message for the relation).  If used, it must point to a single memory
-	 * chunk palloc'd in CacheMemoryContext, or in rd_indexcxt for an index
-	 * relation.  A relcache reset will include freeing that chunk and setting
-	 * rd_amcache = NULL.
+	 * message for the relation).  If used for table AM it must point to a
+	 * single memory chunk palloc'd in CacheMemoryContext, or more complex
+	 * data structure in that memory context to be freed by free_rd_amcache
+	 * method. If used for index AM it must point to a single memory chunk
+	 * palloc'd in rd_indexcxt memory context.  A relcache reset will include
+	 * freeing that chunk and setting rd_amcache = NULL.
 	 */
 	void	   *rd_amcache;		/* available for use by index/table AM */
 
-- 
2.39.3 (Apple Git-145)

0004-Add-table-AM-tuple_is_current-method-v1.patchapplication/octet-stream; name=0004-Add-table-AM-tuple_is_current-method-v1.patchDownload

From 13454a9ccd82ff10e8cc7ce7637b843810aa16d3 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Wed, 7 Jun 2023 13:47:53 +0300
Subject: [PATCH 04/12] Add table AM tuple_is_current method

This allows to abstract how/whether table AM uses transaction identifiers.
---
 src/backend/access/heap/heapam_handler.c | 14 ++++++++++++++
 src/backend/access/table/tableamapi.c    |  2 ++
 src/backend/utils/adt/ri_triggers.c      |  8 +-------
 src/include/access/tableam.h             | 14 ++++++++++++++
 4 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 29cf023b274..2d4c243028a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -75,6 +75,19 @@ heapam_slot_callbacks(Relation relation)
 	return &TTSOpsBufferHeapTuple;
 }
 
+static bool
+heapam_tuple_is_current(Relation rel, TupleTableSlot *slot)
+{
+	Datum		xminDatum;
+	TransactionId xmin;
+	bool		isnull;
+
+	xminDatum = slot_getsysattr(slot, MinTransactionIdAttributeNumber, &isnull);
+	Assert(!isnull);
+	xmin = DatumGetTransactionId(xminDatum);
+	return TransactionIdIsCurrentTransactionId(xmin);
+}
+
 
 /* ------------------------------------------------------------------------
  * Index Scan Callbacks for heap AM
@@ -2600,6 +2613,7 @@ static const TableAmRoutine heapam_methods = {
 	.type = T_TableAmRoutine,
 
 	.slot_callbacks = heapam_slot_callbacks,
+	.tuple_is_current = heapam_tuple_is_current,
 
 	.scan_begin = heap_beginscan,
 	.scan_end = heap_endscan,
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index d7798b6afb6..0d43e9c1a4f 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -49,6 +49,8 @@ GetTableAmRoutine(Oid amhandler)
 	 * easier to keep AMs up to date, e.g. when forward porting them to a new
 	 * major version.
 	 */
+	Assert(routine->tuple_is_current != NULL);
+
 	Assert(routine->scan_begin != NULL);
 	Assert(routine->scan_end != NULL);
 	Assert(routine->scan_rescan != NULL);
diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c
index 6945d99b3d5..0412cfac431 100644
--- a/src/backend/utils/adt/ri_triggers.c
+++ b/src/backend/utils/adt/ri_triggers.c
@@ -1263,9 +1263,6 @@ RI_FKey_fk_upd_check_required(Trigger *trigger, Relation fk_rel,
 {
 	const RI_ConstraintInfo *riinfo;
 	int			ri_nullcheck;
-	Datum		xminDatum;
-	TransactionId xmin;
-	bool		isnull;
 
 	/*
 	 * AfterTriggerSaveEvent() handles things such that this function is never
@@ -1333,10 +1330,7 @@ RI_FKey_fk_upd_check_required(Trigger *trigger, Relation fk_rel,
 	 * this if we knew the INSERT trigger already fired, but there is no easy
 	 * way to know that.)
 	 */
-	xminDatum = slot_getsysattr(oldslot, MinTransactionIdAttributeNumber, &isnull);
-	Assert(!isnull);
-	xmin = DatumGetTransactionId(xminDatum);
-	if (TransactionIdIsCurrentTransactionId(xmin))
+	if (table_tuple_is_current(fk_rel, oldslot))
 		return true;
 
 	/* If all old and new key values are equal, no check is needed */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index d5972d7a580..8f9caf3f3c3 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -300,6 +300,11 @@ typedef struct TableAmRoutine
 	 */
 	const TupleTableSlotOps *(*slot_callbacks) (Relation rel);
 
+	/*
+	 * Check if tuple in the slot belongs to the current transaction.
+	 */
+	bool		(*tuple_is_current) (Relation rel, TupleTableSlot *slot);
+
 
 	/* ------------------------------------------------------------------------
 	 * Table scan callbacks.
@@ -901,6 +906,15 @@ extern const TupleTableSlotOps *table_slot_callbacks(Relation relation);
  */
 extern TupleTableSlot *table_slot_create(Relation relation, List **reglist);
 
+/*
+ * Check if tuple in the slot belongs to the current transaction.
+ */
+static inline bool
+table_tuple_is_current(Relation rel, TupleTableSlot *slot)
+{
+	return rel->rd_tableam->tuple_is_current(rel, slot);
+}
+
 
 /* ----------------------------------------------------------------------------
  * Table scan functions.
-- 
2.39.3 (Apple Git-145)

0005-Generalize-relation-analyze-in-table-AM-interface-v1.patchapplication/octet-stream; name=0005-Generalize-relation-analyze-in-table-AM-interface-v1.patchDownload

From 9650c8c2cd1725a93cd29f9395575288dda9277c Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Thu, 8 Jun 2023 04:20:29 +0300
Subject: [PATCH 05/12] Generalize relation analyze in table AM interface

Currently, there is just one algorithm for sampling tuples from a table written
in acquire_sample_rows().  Custom table AM can just redefine the way to get the
next block/tuple by implementing scan_analyze_next_block() and
scan_analyze_next_tuple() API functions.

This approach doesn't seem general enough.  For instance, it's unclear how to
sample this way index-organized tables.  This commit allows table AM to
encapsulate the whole sampling algorithm (currently implemented in
acquire_sample_rows()) into the relation_analyze() API function.
---
 src/backend/access/heap/heapam_handler.c | 286 +++++++++++++++++++++-
 src/backend/access/table/tableamapi.c    |   2 -
 src/backend/commands/analyze.c           | 288 +----------------------
 src/include/access/tableam.h             |  92 ++------
 src/include/commands/vacuum.h            |   5 +
 src/include/foreign/fdwapi.h             |   6 +-
 6 files changed, 317 insertions(+), 362 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 2d4c243028a..ed0e436e0c5 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -44,6 +44,7 @@
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
+#include "utils/sampling.h"
 
 static TM_Result heapam_tuple_lock(Relation relation, ItemPointer tid,
 								   Snapshot snapshot, TupleTableSlot *slot,
@@ -1233,6 +1234,288 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
 	return false;
 }
 
+/*
+ * Comparator for sorting rows[] array
+ */
+static int
+compare_rows(const void *a, const void *b, void *arg)
+{
+	HeapTuple	ha = *(const HeapTuple *) a;
+	HeapTuple	hb = *(const HeapTuple *) b;
+	BlockNumber ba = ItemPointerGetBlockNumber(&ha->t_self);
+	OffsetNumber oa = ItemPointerGetOffsetNumber(&ha->t_self);
+	BlockNumber bb = ItemPointerGetBlockNumber(&hb->t_self);
+	OffsetNumber ob = ItemPointerGetOffsetNumber(&hb->t_self);
+
+	if (ba < bb)
+		return -1;
+	if (ba > bb)
+		return 1;
+	if (oa < ob)
+		return -1;
+	if (oa > ob)
+		return 1;
+	return 0;
+}
+
+static BufferAccessStrategy analyze_bstrategy;
+
+/*
+ * heapam_acquire_sample_rows -- acquire a random sample of rows from the table
+ *
+ * Selected rows are returned in the caller-allocated array rows[], which
+ * must have at least targrows entries.
+ * The actual number of rows selected is returned as the function result.
+ * We also estimate the total numbers of live and dead rows in the table,
+ * and return them into *totalrows and *totaldeadrows, respectively.
+ *
+ * The returned list of tuples is in order by physical position in the table.
+ * (We will rely on this later to derive correlation estimates.)
+ *
+ * As of May 2004 we use a new two-stage method:  Stage one selects up
+ * to targrows random blocks (or all blocks, if there aren't so many).
+ * Stage two scans these blocks and uses the Vitter algorithm to create
+ * a random sample of targrows rows (or less, if there are less in the
+ * sample of blocks).  The two stages are executed simultaneously: each
+ * block is processed as soon as stage one returns its number and while
+ * the rows are read stage two controls which ones are to be inserted
+ * into the sample.
+ *
+ * Although every row has an equal chance of ending up in the final
+ * sample, this sampling method is not perfect: not every possible
+ * sample has an equal chance of being selected.  For large relations
+ * the number of different blocks represented by the sample tends to be
+ * too small.  We can live with that for now.  Improvements are welcome.
+ *
+ * An important property of this sampling method is that because we do
+ * look at a statistically unbiased set of blocks, we should get
+ * unbiased estimates of the average numbers of live and dead rows per
+ * block.  The previous sampling method put too much credence in the row
+ * density near the start of the table.
+ */
+static int
+heapam_acquire_sample_rows(Relation onerel, int elevel,
+						   HeapTuple *rows, int targrows,
+						   double *totalrows, double *totaldeadrows)
+{
+	int			numrows = 0;	/* # rows now in reservoir */
+	double		samplerows = 0; /* total # rows collected */
+	double		liverows = 0;	/* # live rows seen */
+	double		deadrows = 0;	/* # dead rows seen */
+	double		rowstoskip = -1;	/* -1 means not set yet */
+	uint32		randseed;		/* Seed for block sampler(s) */
+	BlockNumber totalblocks;
+	TransactionId OldestXmin;
+	BlockSamplerData bs;
+	ReservoirStateData rstate;
+	TupleTableSlot *slot;
+	TableScanDesc scan;
+	BlockNumber nblocks;
+	BlockNumber blksdone = 0;
+#ifdef USE_PREFETCH
+	int			prefetch_maximum = 0;	/* blocks to prefetch if enabled */
+	BlockSamplerData prefetch_bs;
+#endif
+
+	Assert(targrows > 0);
+
+	totalblocks = RelationGetNumberOfBlocks(onerel);
+
+	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
+	OldestXmin = GetOldestNonRemovableTransactionId(onerel);
+
+	/* Prepare for sampling block numbers */
+	randseed = pg_prng_uint32(&pg_global_prng_state);
+	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, randseed);
+
+#ifdef USE_PREFETCH
+	prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
+	/* Create another BlockSampler, using the same seed, for prefetching */
+	if (prefetch_maximum)
+		(void) BlockSampler_Init(&prefetch_bs, totalblocks, targrows, randseed);
+#endif
+
+	/* Report sampling block numbers */
+	pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
+								 nblocks);
+
+	/* Prepare for sampling rows */
+	reservoir_init_selection_state(&rstate, targrows);
+
+	scan = table_beginscan_analyze(onerel);
+	slot = table_slot_create(onerel, NULL);
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * If we are doing prefetching, then go ahead and tell the kernel about
+	 * the first set of pages we are going to want.  This also moves our
+	 * iterator out ahead of the main one being used, where we will keep it so
+	 * that we're always pre-fetching out prefetch_maximum number of blocks
+	 * ahead.
+	 */
+	if (prefetch_maximum)
+	{
+		for (int i = 0; i < prefetch_maximum; i++)
+		{
+			BlockNumber prefetch_block;
+
+			if (!BlockSampler_HasMore(&prefetch_bs))
+				break;
+
+			prefetch_block = BlockSampler_Next(&prefetch_bs);
+			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block);
+		}
+	}
+#endif
+
+	/* Outer loop over blocks to sample */
+	while (BlockSampler_HasMore(&bs))
+	{
+		bool		block_accepted;
+		BlockNumber targblock = BlockSampler_Next(&bs);
+#ifdef USE_PREFETCH
+		BlockNumber prefetch_targblock = InvalidBlockNumber;
+
+		/*
+		 * Make sure that every time the main BlockSampler is moved forward
+		 * that our prefetch BlockSampler also gets moved forward, so that we
+		 * always stay out ahead.
+		 */
+		if (prefetch_maximum && BlockSampler_HasMore(&prefetch_bs))
+			prefetch_targblock = BlockSampler_Next(&prefetch_bs);
+#endif
+
+		vacuum_delay_point();
+
+		block_accepted = heapam_scan_analyze_next_block(scan, targblock, analyze_bstrategy);
+
+#ifdef USE_PREFETCH
+
+		/*
+		 * When pre-fetching, after we get a block, tell the kernel about the
+		 * next one we will want, if there's any left.
+		 *
+		 * We want to do this even if the table_scan_analyze_next_block() call
+		 * above decides against analyzing the block it picked.
+		 */
+		if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
+			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
+#endif
+
+		/*
+		 * Don't analyze if table_scan_analyze_next_block() indicated this
+		 * block is unsuitable for analyzing.
+		 */
+		if (!block_accepted)
+			continue;
+
+		while (heapam_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
+		{
+			/*
+			 * The first targrows sample rows are simply copied into the
+			 * reservoir. Then we start replacing tuples in the sample until
+			 * we reach the end of the relation.  This algorithm is from Jeff
+			 * Vitter's paper (see full citation in utils/misc/sampling.c). It
+			 * works by repeatedly computing the number of tuples to skip
+			 * before selecting a tuple, which replaces a randomly chosen
+			 * element of the reservoir (current set of tuples).  At all times
+			 * the reservoir is a true random sample of the tuples we've
+			 * passed over so far, so when we fall off the end of the relation
+			 * we're done.
+			 */
+			if (numrows < targrows)
+				rows[numrows++] = ExecCopySlotHeapTuple(slot);
+			else
+			{
+				/*
+				 * t in Vitter's paper is the number of records already
+				 * processed.  If we need to compute a new S value, we must
+				 * use the not-yet-incremented value of samplerows as t.
+				 */
+				if (rowstoskip < 0)
+					rowstoskip = reservoir_get_next_S(&rstate, samplerows, targrows);
+
+				if (rowstoskip <= 0)
+				{
+					/*
+					 * Found a suitable tuple, so save it, replacing one old
+					 * tuple at random
+					 */
+					int			k = (int) (targrows * sampler_random_fract(&rstate.randstate));
+
+					Assert(k >= 0 && k < targrows);
+					heap_freetuple(rows[k]);
+					rows[k] = ExecCopySlotHeapTuple(slot);
+				}
+
+				rowstoskip -= 1;
+			}
+
+			samplerows += 1;
+		}
+
+		pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
+									 ++blksdone);
+	}
+
+	ExecDropSingleTupleTableSlot(slot);
+	table_endscan(scan);
+
+	/*
+	 * If we didn't find as many tuples as we wanted then we're done. No sort
+	 * is needed, since they're already in order.
+	 *
+	 * Otherwise we need to sort the collected tuples by position
+	 * (itempointer). It's not worth worrying about corner cases where the
+	 * tuples are already sorted.
+	 */
+	if (numrows == targrows)
+		qsort_interruptible(rows, numrows, sizeof(HeapTuple),
+							compare_rows, NULL);
+
+	/*
+	 * Estimate total numbers of live and dead rows in relation, extrapolating
+	 * on the assumption that the average tuple density in pages we didn't
+	 * scan is the same as in the pages we did scan.  Since what we scanned is
+	 * a random sample of the pages in the relation, this should be a good
+	 * assumption.
+	 */
+	if (bs.m > 0)
+	{
+		*totalrows = floor((liverows / bs.m) * totalblocks + 0.5);
+		*totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
+	}
+	else
+	{
+		*totalrows = 0.0;
+		*totaldeadrows = 0.0;
+	}
+
+	/*
+	 * Emit some interesting relation info
+	 */
+	ereport(elevel,
+			(errmsg("\"%s\": scanned %d of %u pages, "
+					"containing %.0f live rows and %.0f dead rows; "
+					"%d rows in sample, %.0f estimated total rows",
+					RelationGetRelationName(onerel),
+					bs.m, totalblocks,
+					liverows, deadrows,
+					numrows, *totalrows)));
+
+	return numrows;
+}
+
+static inline void
+heapam_analyze(Relation relation, AcquireSampleRowsFunc *func,
+			   BlockNumber *totalpages, BufferAccessStrategy bstrategy)
+{
+	*func = heapam_acquire_sample_rows;
+	*totalpages = RelationGetNumberOfBlocks(relation);
+	analyze_bstrategy = bstrategy;
+}
+
 static double
 heapam_index_build_range_scan(Relation heapRelation,
 							  Relation indexRelation,
@@ -2651,10 +2934,9 @@ static const TableAmRoutine heapam_methods = {
 	.relation_copy_data = heapam_relation_copy_data,
 	.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
 	.relation_vacuum = heap_vacuum_rel,
-	.scan_analyze_next_block = heapam_scan_analyze_next_block,
-	.scan_analyze_next_tuple = heapam_scan_analyze_next_tuple,
 	.index_build_range_scan = heapam_index_build_range_scan,
 	.index_validate_scan = heapam_index_validate_scan,
+	.relation_analyze = heapam_analyze,
 
 	.free_rd_amcache = NULL,
 	.relation_size = table_block_relation_size,
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 0d43e9c1a4f..dc2a0a0ff6c 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -90,8 +90,6 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->relation_copy_data != NULL);
 	Assert(routine->relation_copy_for_cluster != NULL);
 	Assert(routine->relation_vacuum != NULL);
-	Assert(routine->scan_analyze_next_block != NULL);
-	Assert(routine->scan_analyze_next_tuple != NULL);
 	Assert(routine->index_build_range_scan != NULL);
 	Assert(routine->index_validate_scan != NULL);
 
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index e264ffdcf28..5c8acc955ef 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -97,10 +97,6 @@ static void compute_index_stats(Relation onerel, double totalrows,
 								MemoryContext col_context);
 static VacAttrStats *examine_attribute(Relation onerel, int attnum,
 									   Node *index_expr);
-static int	acquire_sample_rows(Relation onerel, int elevel,
-								HeapTuple *rows, int targrows,
-								double *totalrows, double *totaldeadrows);
-static int	compare_rows(const void *a, const void *b, void *arg);
 static int	acquire_inherited_sample_rows(Relation onerel, int elevel,
 										  HeapTuple *rows, int targrows,
 										  double *totalrows, double *totaldeadrows);
@@ -201,10 +197,9 @@ analyze_rel(Oid relid, RangeVar *relation,
 	if (onerel->rd_rel->relkind == RELKIND_RELATION ||
 		onerel->rd_rel->relkind == RELKIND_MATVIEW)
 	{
-		/* Regular table, so we'll use the regular row acquisition function */
-		acquirefunc = acquire_sample_rows;
-		/* Also get regular table's size */
-		relpages = RelationGetNumberOfBlocks(onerel);
+		/* Use row acquisition function provided by table AM */
+		table_relation_analyze(onerel, &acquirefunc,
+							   &relpages, vac_strategy);
 	}
 	else if (onerel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
 	{
@@ -1092,277 +1087,6 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
 	return stats;
 }
 
-/*
- * acquire_sample_rows -- acquire a random sample of rows from the table
- *
- * Selected rows are returned in the caller-allocated array rows[], which
- * must have at least targrows entries.
- * The actual number of rows selected is returned as the function result.
- * We also estimate the total numbers of live and dead rows in the table,
- * and return them into *totalrows and *totaldeadrows, respectively.
- *
- * The returned list of tuples is in order by physical position in the table.
- * (We will rely on this later to derive correlation estimates.)
- *
- * As of May 2004 we use a new two-stage method:  Stage one selects up
- * to targrows random blocks (or all blocks, if there aren't so many).
- * Stage two scans these blocks and uses the Vitter algorithm to create
- * a random sample of targrows rows (or less, if there are less in the
- * sample of blocks).  The two stages are executed simultaneously: each
- * block is processed as soon as stage one returns its number and while
- * the rows are read stage two controls which ones are to be inserted
- * into the sample.
- *
- * Although every row has an equal chance of ending up in the final
- * sample, this sampling method is not perfect: not every possible
- * sample has an equal chance of being selected.  For large relations
- * the number of different blocks represented by the sample tends to be
- * too small.  We can live with that for now.  Improvements are welcome.
- *
- * An important property of this sampling method is that because we do
- * look at a statistically unbiased set of blocks, we should get
- * unbiased estimates of the average numbers of live and dead rows per
- * block.  The previous sampling method put too much credence in the row
- * density near the start of the table.
- */
-static int
-acquire_sample_rows(Relation onerel, int elevel,
-					HeapTuple *rows, int targrows,
-					double *totalrows, double *totaldeadrows)
-{
-	int			numrows = 0;	/* # rows now in reservoir */
-	double		samplerows = 0; /* total # rows collected */
-	double		liverows = 0;	/* # live rows seen */
-	double		deadrows = 0;	/* # dead rows seen */
-	double		rowstoskip = -1;	/* -1 means not set yet */
-	uint32		randseed;		/* Seed for block sampler(s) */
-	BlockNumber totalblocks;
-	TransactionId OldestXmin;
-	BlockSamplerData bs;
-	ReservoirStateData rstate;
-	TupleTableSlot *slot;
-	TableScanDesc scan;
-	BlockNumber nblocks;
-	BlockNumber blksdone = 0;
-#ifdef USE_PREFETCH
-	int			prefetch_maximum = 0;	/* blocks to prefetch if enabled */
-	BlockSamplerData prefetch_bs;
-#endif
-
-	Assert(targrows > 0);
-
-	totalblocks = RelationGetNumberOfBlocks(onerel);
-
-	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
-	OldestXmin = GetOldestNonRemovableTransactionId(onerel);
-
-	/* Prepare for sampling block numbers */
-	randseed = pg_prng_uint32(&pg_global_prng_state);
-	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, randseed);
-
-#ifdef USE_PREFETCH
-	prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
-	/* Create another BlockSampler, using the same seed, for prefetching */
-	if (prefetch_maximum)
-		(void) BlockSampler_Init(&prefetch_bs, totalblocks, targrows, randseed);
-#endif
-
-	/* Report sampling block numbers */
-	pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
-								 nblocks);
-
-	/* Prepare for sampling rows */
-	reservoir_init_selection_state(&rstate, targrows);
-
-	scan = table_beginscan_analyze(onerel);
-	slot = table_slot_create(onerel, NULL);
-
-#ifdef USE_PREFETCH
-
-	/*
-	 * If we are doing prefetching, then go ahead and tell the kernel about
-	 * the first set of pages we are going to want.  This also moves our
-	 * iterator out ahead of the main one being used, where we will keep it so
-	 * that we're always pre-fetching out prefetch_maximum number of blocks
-	 * ahead.
-	 */
-	if (prefetch_maximum)
-	{
-		for (int i = 0; i < prefetch_maximum; i++)
-		{
-			BlockNumber prefetch_block;
-
-			if (!BlockSampler_HasMore(&prefetch_bs))
-				break;
-
-			prefetch_block = BlockSampler_Next(&prefetch_bs);
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block);
-		}
-	}
-#endif
-
-	/* Outer loop over blocks to sample */
-	while (BlockSampler_HasMore(&bs))
-	{
-		bool		block_accepted;
-		BlockNumber targblock = BlockSampler_Next(&bs);
-#ifdef USE_PREFETCH
-		BlockNumber prefetch_targblock = InvalidBlockNumber;
-
-		/*
-		 * Make sure that every time the main BlockSampler is moved forward
-		 * that our prefetch BlockSampler also gets moved forward, so that we
-		 * always stay out ahead.
-		 */
-		if (prefetch_maximum && BlockSampler_HasMore(&prefetch_bs))
-			prefetch_targblock = BlockSampler_Next(&prefetch_bs);
-#endif
-
-		vacuum_delay_point();
-
-		block_accepted = table_scan_analyze_next_block(scan, targblock, vac_strategy);
-
-#ifdef USE_PREFETCH
-
-		/*
-		 * When pre-fetching, after we get a block, tell the kernel about the
-		 * next one we will want, if there's any left.
-		 *
-		 * We want to do this even if the table_scan_analyze_next_block() call
-		 * above decides against analyzing the block it picked.
-		 */
-		if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
-#endif
-
-		/*
-		 * Don't analyze if table_scan_analyze_next_block() indicated this
-		 * block is unsuitable for analyzing.
-		 */
-		if (!block_accepted)
-			continue;
-
-		while (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
-		{
-			/*
-			 * The first targrows sample rows are simply copied into the
-			 * reservoir. Then we start replacing tuples in the sample until
-			 * we reach the end of the relation.  This algorithm is from Jeff
-			 * Vitter's paper (see full citation in utils/misc/sampling.c). It
-			 * works by repeatedly computing the number of tuples to skip
-			 * before selecting a tuple, which replaces a randomly chosen
-			 * element of the reservoir (current set of tuples).  At all times
-			 * the reservoir is a true random sample of the tuples we've
-			 * passed over so far, so when we fall off the end of the relation
-			 * we're done.
-			 */
-			if (numrows < targrows)
-				rows[numrows++] = ExecCopySlotHeapTuple(slot);
-			else
-			{
-				/*
-				 * t in Vitter's paper is the number of records already
-				 * processed.  If we need to compute a new S value, we must
-				 * use the not-yet-incremented value of samplerows as t.
-				 */
-				if (rowstoskip < 0)
-					rowstoskip = reservoir_get_next_S(&rstate, samplerows, targrows);
-
-				if (rowstoskip <= 0)
-				{
-					/*
-					 * Found a suitable tuple, so save it, replacing one old
-					 * tuple at random
-					 */
-					int			k = (int) (targrows * sampler_random_fract(&rstate.randstate));
-
-					Assert(k >= 0 && k < targrows);
-					heap_freetuple(rows[k]);
-					rows[k] = ExecCopySlotHeapTuple(slot);
-				}
-
-				rowstoskip -= 1;
-			}
-
-			samplerows += 1;
-		}
-
-		pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
-									 ++blksdone);
-	}
-
-	ExecDropSingleTupleTableSlot(slot);
-	table_endscan(scan);
-
-	/*
-	 * If we didn't find as many tuples as we wanted then we're done. No sort
-	 * is needed, since they're already in order.
-	 *
-	 * Otherwise we need to sort the collected tuples by position
-	 * (itempointer). It's not worth worrying about corner cases where the
-	 * tuples are already sorted.
-	 */
-	if (numrows == targrows)
-		qsort_interruptible(rows, numrows, sizeof(HeapTuple),
-							compare_rows, NULL);
-
-	/*
-	 * Estimate total numbers of live and dead rows in relation, extrapolating
-	 * on the assumption that the average tuple density in pages we didn't
-	 * scan is the same as in the pages we did scan.  Since what we scanned is
-	 * a random sample of the pages in the relation, this should be a good
-	 * assumption.
-	 */
-	if (bs.m > 0)
-	{
-		*totalrows = floor((liverows / bs.m) * totalblocks + 0.5);
-		*totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
-	}
-	else
-	{
-		*totalrows = 0.0;
-		*totaldeadrows = 0.0;
-	}
-
-	/*
-	 * Emit some interesting relation info
-	 */
-	ereport(elevel,
-			(errmsg("\"%s\": scanned %d of %u pages, "
-					"containing %.0f live rows and %.0f dead rows; "
-					"%d rows in sample, %.0f estimated total rows",
-					RelationGetRelationName(onerel),
-					bs.m, totalblocks,
-					liverows, deadrows,
-					numrows, *totalrows)));
-
-	return numrows;
-}
-
-/*
- * Comparator for sorting rows[] array
- */
-static int
-compare_rows(const void *a, const void *b, void *arg)
-{
-	HeapTuple	ha = *(const HeapTuple *) a;
-	HeapTuple	hb = *(const HeapTuple *) b;
-	BlockNumber ba = ItemPointerGetBlockNumber(&ha->t_self);
-	OffsetNumber oa = ItemPointerGetOffsetNumber(&ha->t_self);
-	BlockNumber bb = ItemPointerGetBlockNumber(&hb->t_self);
-	OffsetNumber ob = ItemPointerGetOffsetNumber(&hb->t_self);
-
-	if (ba < bb)
-		return -1;
-	if (ba > bb)
-		return 1;
-	if (oa < ob)
-		return -1;
-	if (oa > ob)
-		return 1;
-	return 0;
-}
-
 
 /*
  * acquire_inherited_sample_rows -- acquire sample rows from inheritance tree
@@ -1452,9 +1176,9 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 		if (childrel->rd_rel->relkind == RELKIND_RELATION ||
 			childrel->rd_rel->relkind == RELKIND_MATVIEW)
 		{
-			/* Regular table, so use the regular row acquisition function */
-			acquirefunc = acquire_sample_rows;
-			relpages = RelationGetNumberOfBlocks(childrel);
+			/* Use row acquisition function provided by table AM */
+			table_relation_analyze(childrel, &acquirefunc,
+								   &relpages, vac_strategy);
 		}
 		else if (childrel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
 		{
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8f9caf3f3c3..2e7e9f73527 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -20,6 +20,7 @@
 #include "access/relscan.h"
 #include "access/sdir.h"
 #include "access/xact.h"
+#include "commands/vacuum.h"
 #include "executor/tuptable.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -659,41 +660,6 @@ typedef struct TableAmRoutine
 									struct VacuumParams *params,
 									BufferAccessStrategy bstrategy);
 
-	/*
-	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
-	 * with table_beginscan_analyze().  See also
-	 * table_scan_analyze_next_block().
-	 *
-	 * The callback may acquire resources like locks that are held until
-	 * table_scan_analyze_next_tuple() returns false. It e.g. can make sense
-	 * to hold a lock until all tuples on a block have been analyzed by
-	 * scan_analyze_next_tuple.
-	 *
-	 * The callback can return false if the block is not suitable for
-	 * sampling, e.g. because it's a metapage that could never contain tuples.
-	 *
-	 * XXX: This obviously is primarily suited for block-based AMs. It's not
-	 * clear what a good interface for non block based AMs would be, so there
-	 * isn't one yet.
-	 */
-	bool		(*scan_analyze_next_block) (TableScanDesc scan,
-											BlockNumber blockno,
-											BufferAccessStrategy bstrategy);
-
-	/*
-	 * See table_scan_analyze_next_tuple().
-	 *
-	 * Not every AM might have a meaningful concept of dead rows, in which
-	 * case it's OK to not increment *deadrows - but note that that may
-	 * influence autovacuum scheduling (see comment for relation_vacuum
-	 * callback).
-	 */
-	bool		(*scan_analyze_next_tuple) (TableScanDesc scan,
-											TransactionId OldestXmin,
-											double *liverows,
-											double *deadrows,
-											TupleTableSlot *slot);
-
 	/* see table_index_build_range_scan for reference about parameters */
 	double		(*index_build_range_scan) (Relation table_rel,
 										   Relation index_rel,
@@ -714,6 +680,15 @@ typedef struct TableAmRoutine
 										Snapshot snapshot,
 										struct ValidateIndexState *state);
 
+	/*
+	 * Provides row sampling callback for relation and number of relation
+	 * pages.
+	 */
+	void		(*relation_analyze) (Relation relation,
+									 AcquireSampleRowsFunc *func,
+									 BlockNumber *totalpages,
+									 BufferAccessStrategy bstrategy);
+
 
 	/* ------------------------------------------------------------------------
 	 * Miscellaneous functions.
@@ -1757,42 +1732,6 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
-/*
- * Prepare to analyze block `blockno` of `scan`. The scan needs to have been
- * started with table_beginscan_analyze().  Note that this routine might
- * acquire resources like locks that are held until
- * table_scan_analyze_next_tuple() returns false.
- *
- * Returns false if block is unsuitable for sampling, true otherwise.
- */
-static inline bool
-table_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-							  BufferAccessStrategy bstrategy)
-{
-	return scan->rs_rd->rd_tableam->scan_analyze_next_block(scan, blockno,
-															bstrategy);
-}
-
-/*
- * Iterate over tuples in the block selected with
- * table_scan_analyze_next_block() (which needs to have returned true, and
- * this routine may not have returned false for the same block before). If a
- * tuple that's suitable for sampling is found, true is returned and a tuple
- * is stored in `slot`.
- *
- * *liverows and *deadrows are incremented according to the encountered
- * tuples.
- */
-static inline bool
-table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
-							  double *liverows, double *deadrows,
-							  TupleTableSlot *slot)
-{
-	return scan->rs_rd->rd_tableam->scan_analyze_next_tuple(scan, OldestXmin,
-															liverows, deadrows,
-															slot);
-}
-
 /*
  * table_index_build_scan - scan the table to find tuples to be indexed
  *
@@ -1898,6 +1837,17 @@ table_index_validate_scan(Relation table_rel,
 											   state);
 }
 
+/*
+ * Provides row sampling callback for relation and number of relation
+ * pages.
+ */
+static inline void
+table_relation_analyze(Relation relation, AcquireSampleRowsFunc *func,
+					   BlockNumber *totalpages, BufferAccessStrategy bstrategy)
+{
+	relation->rd_tableam->relation_analyze(relation, func,
+										   totalpages, bstrategy);
+}
 
 /* ----------------------------------------------------------------------------
  * Miscellaneous functionality
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 4af02940c54..7a42784eee2 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -104,6 +104,11 @@ typedef struct ParallelVacuumState ParallelVacuumState;
  */
 typedef struct VacAttrStats *VacAttrStatsP;
 
+typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
+									  HeapTuple *rows, int targrows,
+									  double *totalrows,
+									  double *totaldeadrows);
+
 typedef Datum (*AnalyzeAttrFetchFunc) (VacAttrStatsP stats, int rownum,
 									   bool *isNull);
 
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 996c62e3055..0a1c3381fc2 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -13,6 +13,7 @@
 #define FDWAPI_H
 
 #include "access/parallel.h"
+#include "commands/vacuum.h"
 #include "nodes/execnodes.h"
 #include "nodes/pathnodes.h"
 
@@ -148,11 +149,6 @@ typedef void (*ExplainForeignModify_function) (ModifyTableState *mtstate,
 typedef void (*ExplainDirectModify_function) (ForeignScanState *node,
 											  struct ExplainState *es);
 
-typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
-									  HeapTuple *rows, int targrows,
-									  double *totalrows,
-									  double *totaldeadrows);
-
 typedef bool (*AnalyzeForeignTable_function) (Relation relation,
 											  AcquireSampleRowsFunc *func,
 											  BlockNumber *totalpages);
-- 
2.39.3 (Apple Git-145)

0006-Generalize-table-AM-API-for-INSERT-.-ON-CONFLICT-v1.patchapplication/octet-stream; name=0006-Generalize-table-AM-API-for-INSERT-.-ON-CONFLICT-v1.patchDownload

From 1c00b74596e63064f7a51395b1a48a4627d0654d Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Fri, 9 Jun 2023 00:05:52 +0300
Subject: [PATCH 06/12] Generalize table AM API for INSERT ... ON CONFLICT ...

Currently, all table AMs need to implement INSERT ... ON CONFLICT ... with
speculative tokens.  They could just have a custom implementation of those
tokens using tuple_insert_speculative() and tuple_complete_speculative() API
functions.

This commit changes INSERT ... ON CONFLICT ... implementation to use single
tuple_insert_with_arbiter() API function, which encapsulates the whole
alogrithm.  This new function provides clear semantics to make different
implementations of INSERT ... ON CONFLICT ... functionality.
---
 src/backend/access/heap/heapam_handler.c | 281 ++++++++++++++++++++++-
 src/backend/access/table/tableamapi.c    |   3 +-
 src/backend/executor/nodeModifyTable.c   | 270 ++--------------------
 src/include/access/tableam.h             |  84 +++----
 4 files changed, 348 insertions(+), 290 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ed0e436e0c5..b9933917a56 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -316,6 +316,284 @@ heapam_tuple_complete_speculative(Relation relation, TupleTableSlot *slot,
 		pfree(tuple);
 }
 
+/*
+ * ExecCheckTupleVisible -- verify tuple is visible
+ *
+ * It would not be consistent with guarantees of the higher isolation levels to
+ * proceed with avoiding insertion (taking speculative insertion's alternative
+ * path) on the basis of another tuple that is not visible to MVCC snapshot.
+ * Check for the need to raise a serialization failure, and do so as necessary.
+ */
+static void
+ExecCheckTupleVisible(EState *estate,
+					  Relation rel,
+					  TupleTableSlot *slot)
+{
+	if (!IsolationUsesXactSnapshot())
+		return;
+
+	if (!table_tuple_satisfies_snapshot(rel, slot, estate->es_snapshot))
+	{
+		Datum		xminDatum;
+		TransactionId xmin;
+		bool		isnull;
+
+		xminDatum = slot_getsysattr(slot, MinTransactionIdAttributeNumber, &isnull);
+		Assert(!isnull);
+		xmin = DatumGetTransactionId(xminDatum);
+
+		/*
+		 * We should not raise a serialization failure if the conflict is
+		 * against a tuple inserted by our own transaction, even if it's not
+		 * visible to our snapshot.  (This would happen, for example, if
+		 * conflicting keys are proposed for insertion in a single command.)
+		 */
+		if (!TransactionIdIsCurrentTransactionId(xmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+					 errmsg("could not serialize access due to concurrent update")));
+	}
+}
+
+/*
+ * ExecCheckTIDVisible -- convenience variant of ExecCheckTupleVisible()
+ */
+static void
+ExecCheckTIDVisible(EState *estate,
+					Relation rel,
+					ItemPointer tid,
+					TupleTableSlot *tempSlot)
+{
+	/* Redundantly check isolation level */
+	if (!IsolationUsesXactSnapshot())
+		return;
+
+	if (!table_tuple_fetch_row_version(rel, tid,
+									   SnapshotAny, tempSlot))
+		elog(ERROR, "failed to fetch conflicting tuple for ON CONFLICT");
+	ExecCheckTupleVisible(estate, rel, tempSlot);
+	ExecClearTuple(tempSlot);
+}
+
+static inline TupleTableSlot *
+heapam_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
+								 TupleTableSlot *slot,
+								 CommandId cid, int options,
+								 struct BulkInsertStateData *bistate,
+								 List *arbiterIndexes,
+								 EState *estate,
+								 LockTupleMode lockmode,
+								 TupleTableSlot *lockedSlot,
+								 TupleTableSlot *tempSlot)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	uint32		specToken;
+	ItemPointerData conflictTid;
+	bool		specConflict;
+	List	   *recheckIndexes = NIL;
+
+	while (true)
+	{
+		specConflict = false;
+		if (!ExecCheckIndexConstraints(resultRelInfo, slot, estate, &conflictTid,
+									   arbiterIndexes))
+		{
+			if (lockedSlot)
+			{
+				TM_Result	test;
+				TM_FailureData tmfd;
+				Datum		xminDatum;
+				TransactionId xmin;
+				bool		isnull;
+
+				/* Determine lock mode to use */
+				lockmode = ExecUpdateLockMode(estate, resultRelInfo);
+
+				/*
+				 * Lock tuple for update.  Don't follow updates when tuple
+				 * cannot be locked without doing so.  A row locking conflict
+				 * here means our previous conclusion that the tuple is
+				 * conclusively committed is not true anymore.
+				 */
+				test = table_tuple_lock(rel, &conflictTid,
+										estate->es_snapshot,
+										lockedSlot, estate->es_output_cid,
+										lockmode, LockWaitBlock, 0,
+										&tmfd);
+				switch (test)
+				{
+					case TM_Ok:
+						/* success! */
+						break;
+
+					case TM_Invisible:
+
+						/*
+						 * This can occur when a just inserted tuple is
+						 * updated again in the same command. E.g. because
+						 * multiple rows with the same conflicting key values
+						 * are inserted.
+						 *
+						 * This is somewhat similar to the ExecUpdate()
+						 * TM_SelfModified case.  We do not want to proceed
+						 * because it would lead to the same row being updated
+						 * a second time in some unspecified order, and in
+						 * contrast to plain UPDATEs there's no historical
+						 * behavior to break.
+						 *
+						 * It is the user's responsibility to prevent this
+						 * situation from occurring.  These problems are why
+						 * the SQL standard similarly specifies that for SQL
+						 * MERGE, an exception must be raised in the event of
+						 * an attempt to update the same row twice.
+						 */
+						xminDatum = slot_getsysattr(lockedSlot,
+													MinTransactionIdAttributeNumber,
+													&isnull);
+						Assert(!isnull);
+						xmin = DatumGetTransactionId(xminDatum);
+
+						if (TransactionIdIsCurrentTransactionId(xmin))
+							ereport(ERROR,
+									(errcode(ERRCODE_CARDINALITY_VIOLATION),
+							/* translator: %s is a SQL command name */
+									 errmsg("%s command cannot affect row a second time",
+											"ON CONFLICT DO UPDATE"),
+									 errhint("Ensure that no rows proposed for insertion within the same command have duplicate constrained values.")));
+
+						/* This shouldn't happen */
+						elog(ERROR, "attempted to lock invisible tuple");
+						break;
+
+					case TM_SelfModified:
+
+						/*
+						 * This state should never be reached. As a dirty
+						 * snapshot is used to find conflicting tuples,
+						 * speculative insertion wouldn't have seen this row
+						 * to conflict with.
+						 */
+						elog(ERROR, "unexpected self-updated tuple");
+						break;
+
+					case TM_Updated:
+						if (IsolationUsesXactSnapshot())
+							ereport(ERROR,
+									(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+									 errmsg("could not serialize access due to concurrent update")));
+
+						/*
+						 * As long as we don't support an UPDATE of INSERT ON
+						 * CONFLICT for a partitioned table we shouldn't reach
+						 * to a case where tuple to be lock is moved to
+						 * another partition due to concurrent update of the
+						 * partition key.
+						 */
+						Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
+
+						/*
+						 * Tell caller to try again from the very start.
+						 *
+						 * It does not make sense to use the usual
+						 * EvalPlanQual() style loop here, as the new version
+						 * of the row might not conflict anymore, or the
+						 * conflicting tuple has actually been deleted.
+						 */
+						ExecClearTuple(lockedSlot);
+						return false;
+
+					case TM_Deleted:
+						if (IsolationUsesXactSnapshot())
+							ereport(ERROR,
+									(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+									 errmsg("could not serialize access due to concurrent delete")));
+
+						/* see TM_Updated case */
+						Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
+						ExecClearTuple(lockedSlot);
+						return false;
+
+					default:
+						elog(ERROR, "unrecognized table_tuple_lock status: %u", test);
+				}
+
+				/* Success, the tuple is locked. */
+
+				/*
+				 * Verify that the tuple is visible to our MVCC snapshot if
+				 * the current isolation level mandates that.
+				 *
+				 * It's not sufficient to rely on the check within
+				 * ExecUpdate() as e.g. CONFLICT ... WHERE clause may prevent
+				 * us from reaching that.
+				 *
+				 * This means we only ever continue when a new command in the
+				 * current transaction could see the row, even though in READ
+				 * COMMITTED mode the tuple will not be visible according to
+				 * the current statement's snapshot.  This is in line with the
+				 * way UPDATE deals with newer tuple versions.
+				 */
+				ExecCheckTupleVisible(estate, rel, lockedSlot);
+				return NULL;
+			}
+			else
+			{
+				ExecCheckTIDVisible(estate, rel, &conflictTid, tempSlot);
+				return NULL;
+			}
+		}
+
+		/*
+		 * Before we start insertion proper, acquire our "speculative
+		 * insertion lock".  Others can use that to wait for us to decide if
+		 * we're going to go ahead with the insertion, instead of waiting for
+		 * the whole transaction to complete.
+		 */
+		specToken = SpeculativeInsertionLockAcquire(GetCurrentTransactionId());
+
+		/* insert the tuple, with the speculative token */
+		heapam_tuple_insert_speculative(rel, slot,
+										estate->es_output_cid,
+										0,
+										NULL,
+										specToken);
+
+		/* insert index entries for tuple */
+		recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
+											   slot, estate, false, true,
+											   &specConflict,
+											   arbiterIndexes,
+											   false);
+
+		/* adjust the tuple's state accordingly */
+		heapam_tuple_complete_speculative(rel, slot,
+										  specToken, !specConflict);
+
+		/*
+		 * Wake up anyone waiting for our decision.  They will re-check the
+		 * tuple, see that it's no longer speculative, and wait on our XID as
+		 * if this was a regularly inserted tuple all along.  Or if we killed
+		 * the tuple, they will see it's dead, and proceed as if the tuple
+		 * never existed.
+		 */
+		SpeculativeInsertionLockRelease(GetCurrentTransactionId());
+
+		/*
+		 * If there was a conflict, start from the beginning.  We'll do the
+		 * pre-check again, which will now find the conflicting tuple (unless
+		 * it aborts before we get there).
+		 */
+		if (specConflict)
+		{
+			list_free(recheckIndexes);
+			CHECK_FOR_INTERRUPTS();
+			continue;
+		}
+
+		return slot;
+	}
+}
+
 static TM_Result
 heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
 					Snapshot snapshot, Snapshot crosscheck, int options,
@@ -2916,8 +3194,7 @@ static const TableAmRoutine heapam_methods = {
 	.index_fetch_tuple = heapam_index_fetch_tuple,
 
 	.tuple_insert = heapam_tuple_insert,
-	.tuple_insert_speculative = heapam_tuple_insert_speculative,
-	.tuple_complete_speculative = heapam_tuple_complete_speculative,
+	.tuple_insert_with_arbiter = heapam_tuple_insert_with_arbiter,
 	.multi_insert = heap_multi_insert,
 	.tuple_delete = heapam_tuple_delete,
 	.tuple_update = heapam_tuple_update,
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index dc2a0a0ff6c..b233f585258 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -77,8 +77,7 @@ GetTableAmRoutine(Oid amhandler)
 	 * Could be made optional, but would require throwing error during
 	 * parse-analysis.
 	 */
-	Assert(routine->tuple_insert_speculative != NULL);
-	Assert(routine->tuple_complete_speculative != NULL);
+	Assert(routine->tuple_insert_with_arbiter != NULL);
 
 	Assert(routine->multi_insert != NULL);
 	Assert(routine->tuple_delete != NULL);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index ab9530cbf94..717dc749500 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -137,7 +137,6 @@ static void ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 											   TupleTableSlot *newslot);
 static bool ExecOnConflictUpdate(ModifyTableContext *context,
 								 ResultRelInfo *resultRelInfo,
-								 ItemPointer conflictTid,
 								 TupleTableSlot *excludedSlot,
 								 bool canSetTag,
 								 TupleTableSlot **returning);
@@ -270,66 +269,6 @@ ExecProcessReturning(ResultRelInfo *resultRelInfo,
 	return ExecProject(projectReturning);
 }
 
-/*
- * ExecCheckTupleVisible -- verify tuple is visible
- *
- * It would not be consistent with guarantees of the higher isolation levels to
- * proceed with avoiding insertion (taking speculative insertion's alternative
- * path) on the basis of another tuple that is not visible to MVCC snapshot.
- * Check for the need to raise a serialization failure, and do so as necessary.
- */
-static void
-ExecCheckTupleVisible(EState *estate,
-					  Relation rel,
-					  TupleTableSlot *slot)
-{
-	if (!IsolationUsesXactSnapshot())
-		return;
-
-	if (!table_tuple_satisfies_snapshot(rel, slot, estate->es_snapshot))
-	{
-		Datum		xminDatum;
-		TransactionId xmin;
-		bool		isnull;
-
-		xminDatum = slot_getsysattr(slot, MinTransactionIdAttributeNumber, &isnull);
-		Assert(!isnull);
-		xmin = DatumGetTransactionId(xminDatum);
-
-		/*
-		 * We should not raise a serialization failure if the conflict is
-		 * against a tuple inserted by our own transaction, even if it's not
-		 * visible to our snapshot.  (This would happen, for example, if
-		 * conflicting keys are proposed for insertion in a single command.)
-		 */
-		if (!TransactionIdIsCurrentTransactionId(xmin))
-			ereport(ERROR,
-					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-					 errmsg("could not serialize access due to concurrent update")));
-	}
-}
-
-/*
- * ExecCheckTIDVisible -- convenience variant of ExecCheckTupleVisible()
- */
-static void
-ExecCheckTIDVisible(EState *estate,
-					ResultRelInfo *relinfo,
-					ItemPointer tid,
-					TupleTableSlot *tempSlot)
-{
-	Relation	rel = relinfo->ri_RelationDesc;
-
-	/* Redundantly check isolation level */
-	if (!IsolationUsesXactSnapshot())
-		return;
-
-	if (!table_tuple_fetch_row_version(rel, tid, SnapshotAny, tempSlot))
-		elog(ERROR, "failed to fetch conflicting tuple for ON CONFLICT");
-	ExecCheckTupleVisible(estate, rel, tempSlot);
-	ExecClearTuple(tempSlot);
-}
-
 /*
  * Initialize to compute stored generated columns for a tuple
  *
@@ -1015,12 +954,19 @@ ExecInsert(ModifyTableContext *context,
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
 			/* Perform a speculative insertion. */
-			uint32		specToken;
-			ItemPointerData conflictTid;
-			bool		specConflict;
 			List	   *arbiterIndexes;
+			TupleTableSlot *existing = NULL,
+					   *returningSlot,
+					   *inserted;
+			LockTupleMode lockmode = LockTupleExclusive;
 
 			arbiterIndexes = resultRelInfo->ri_onConflictArbiterIndexes;
+			returningSlot = ExecGetReturningSlot(estate, resultRelInfo);
+			if (onconflict == ONCONFLICT_UPDATE)
+			{
+				lockmode = ExecUpdateLockMode(estate, resultRelInfo);
+				existing = resultRelInfo->ri_onConflict->oc_Existing;
+			}
 
 			/*
 			 * Do a non-conclusive check for conflicts first.
@@ -1037,23 +983,28 @@ ExecInsert(ModifyTableContext *context,
 			 */
 	vlock:
 			CHECK_FOR_INTERRUPTS();
-			specConflict = false;
-			if (!ExecCheckIndexConstraints(resultRelInfo, slot, estate,
-										   &conflictTid, arbiterIndexes))
+			inserted = table_tuple_insert_with_arbiter(resultRelInfo,
+													   slot, estate->es_output_cid,
+													   0, NULL, arbiterIndexes, estate,
+													   lockmode, existing, returningSlot);
+			if (!inserted)
 			{
 				/* committed conflict tuple found */
 				if (onconflict == ONCONFLICT_UPDATE)
 				{
+					TupleTableSlot *returning = NULL;
+
+					if (TTS_EMPTY(existing))
+						goto vlock;
+
 					/*
 					 * In case of ON CONFLICT DO UPDATE, execute the UPDATE
 					 * part.  Be prepared to retry if the UPDATE fails because
 					 * of another concurrent UPDATE/DELETE to the conflict
 					 * tuple.
 					 */
-					TupleTableSlot *returning = NULL;
-
 					if (ExecOnConflictUpdate(context, resultRelInfo,
-											 &conflictTid, slot, canSetTag,
+											 slot, canSetTag,
 											 &returning))
 					{
 						InstrCountTuples2(&mtstate->ps, 1);
@@ -1076,57 +1027,13 @@ ExecInsert(ModifyTableContext *context,
 					 * ExecGetReturningSlot() in the DO NOTHING case...
 					 */
 					Assert(onconflict == ONCONFLICT_NOTHING);
-					ExecCheckTIDVisible(estate, resultRelInfo, &conflictTid,
-										ExecGetReturningSlot(estate, resultRelInfo));
 					InstrCountTuples2(&mtstate->ps, 1);
 					return NULL;
 				}
 			}
-
-			/*
-			 * Before we start insertion proper, acquire our "speculative
-			 * insertion lock".  Others can use that to wait for us to decide
-			 * if we're going to go ahead with the insertion, instead of
-			 * waiting for the whole transaction to complete.
-			 */
-			specToken = SpeculativeInsertionLockAcquire(GetCurrentTransactionId());
-
-			/* insert the tuple, with the speculative token */
-			table_tuple_insert_speculative(resultRelationDesc, slot,
-										   estate->es_output_cid,
-										   0,
-										   NULL,
-										   specToken);
-
-			/* insert index entries for tuple */
-			recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
-												   slot, estate, false, true,
-												   &specConflict,
-												   arbiterIndexes,
-												   false);
-
-			/* adjust the tuple's state accordingly */
-			table_tuple_complete_speculative(resultRelationDesc, slot,
-											 specToken, !specConflict);
-
-			/*
-			 * Wake up anyone waiting for our decision.  They will re-check
-			 * the tuple, see that it's no longer speculative, and wait on our
-			 * XID as if this was a regularly inserted tuple all along.  Or if
-			 * we killed the tuple, they will see it's dead, and proceed as if
-			 * the tuple never existed.
-			 */
-			SpeculativeInsertionLockRelease(GetCurrentTransactionId());
-
-			/*
-			 * If there was a conflict, start from the beginning.  We'll do
-			 * the pre-check again, which will now find the conflicting tuple
-			 * (unless it aborts before we get there).
-			 */
-			if (specConflict)
+			else
 			{
-				list_free(recheckIndexes);
-				goto vlock;
+				slot = inserted;
 			}
 
 			/* Since there was no insertion conflict, we're done */
@@ -2419,144 +2326,15 @@ redo_act:
 static bool
 ExecOnConflictUpdate(ModifyTableContext *context,
 					 ResultRelInfo *resultRelInfo,
-					 ItemPointer conflictTid,
 					 TupleTableSlot *excludedSlot,
 					 bool canSetTag,
 					 TupleTableSlot **returning)
 {
 	ModifyTableState *mtstate = context->mtstate;
 	ExprContext *econtext = mtstate->ps.ps_ExprContext;
-	Relation	relation = resultRelInfo->ri_RelationDesc;
 	ExprState  *onConflictSetWhere = resultRelInfo->ri_onConflict->oc_WhereClause;
 	TupleTableSlot *existing = resultRelInfo->ri_onConflict->oc_Existing;
-	TM_FailureData tmfd;
-	LockTupleMode lockmode;
-	TM_Result	test;
-	Datum		xminDatum;
-	TransactionId xmin;
-	bool		isnull;
-
-	/* Determine lock mode to use */
-	lockmode = ExecUpdateLockMode(context->estate, resultRelInfo);
-
-	/*
-	 * Lock tuple for update.  Don't follow updates when tuple cannot be
-	 * locked without doing so.  A row locking conflict here means our
-	 * previous conclusion that the tuple is conclusively committed is not
-	 * true anymore.
-	 */
-	test = table_tuple_lock(relation, conflictTid,
-							context->estate->es_snapshot,
-							existing, context->estate->es_output_cid,
-							lockmode, LockWaitBlock, 0,
-							&tmfd);
-	switch (test)
-	{
-		case TM_Ok:
-			/* success! */
-			break;
-
-		case TM_Invisible:
-
-			/*
-			 * This can occur when a just inserted tuple is updated again in
-			 * the same command. E.g. because multiple rows with the same
-			 * conflicting key values are inserted.
-			 *
-			 * This is somewhat similar to the ExecUpdate() TM_SelfModified
-			 * case.  We do not want to proceed because it would lead to the
-			 * same row being updated a second time in some unspecified order,
-			 * and in contrast to plain UPDATEs there's no historical behavior
-			 * to break.
-			 *
-			 * It is the user's responsibility to prevent this situation from
-			 * occurring.  These problems are why the SQL standard similarly
-			 * specifies that for SQL MERGE, an exception must be raised in
-			 * the event of an attempt to update the same row twice.
-			 */
-			xminDatum = slot_getsysattr(existing,
-										MinTransactionIdAttributeNumber,
-										&isnull);
-			Assert(!isnull);
-			xmin = DatumGetTransactionId(xminDatum);
-
-			if (TransactionIdIsCurrentTransactionId(xmin))
-				ereport(ERROR,
-						(errcode(ERRCODE_CARDINALITY_VIOLATION),
-				/* translator: %s is a SQL command name */
-						 errmsg("%s command cannot affect row a second time",
-								"ON CONFLICT DO UPDATE"),
-						 errhint("Ensure that no rows proposed for insertion within the same command have duplicate constrained values.")));
-
-			/* This shouldn't happen */
-			elog(ERROR, "attempted to lock invisible tuple");
-			break;
-
-		case TM_SelfModified:
-
-			/*
-			 * This state should never be reached. As a dirty snapshot is used
-			 * to find conflicting tuples, speculative insertion wouldn't have
-			 * seen this row to conflict with.
-			 */
-			elog(ERROR, "unexpected self-updated tuple");
-			break;
-
-		case TM_Updated:
-			if (IsolationUsesXactSnapshot())
-				ereport(ERROR,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("could not serialize access due to concurrent update")));
-
-			/*
-			 * As long as we don't support an UPDATE of INSERT ON CONFLICT for
-			 * a partitioned table we shouldn't reach to a case where tuple to
-			 * be lock is moved to another partition due to concurrent update
-			 * of the partition key.
-			 */
-			Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
-
-			/*
-			 * Tell caller to try again from the very start.
-			 *
-			 * It does not make sense to use the usual EvalPlanQual() style
-			 * loop here, as the new version of the row might not conflict
-			 * anymore, or the conflicting tuple has actually been deleted.
-			 */
-			ExecClearTuple(existing);
-			return false;
-
-		case TM_Deleted:
-			if (IsolationUsesXactSnapshot())
-				ereport(ERROR,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("could not serialize access due to concurrent delete")));
-
-			/* see TM_Updated case */
-			Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
-			ExecClearTuple(existing);
-			return false;
-
-		default:
-			elog(ERROR, "unrecognized table_tuple_lock status: %u", test);
-	}
-
-	/* Success, the tuple is locked. */
-
-	/*
-	 * Verify that the tuple is visible to our MVCC snapshot if the current
-	 * isolation level mandates that.
-	 *
-	 * It's not sufficient to rely on the check within ExecUpdate() as e.g.
-	 * CONFLICT ... WHERE clause may prevent us from reaching that.
-	 *
-	 * This means we only ever continue when a new command in the current
-	 * transaction could see the row, even though in READ COMMITTED mode the
-	 * tuple will not be visible according to the current statement's
-	 * snapshot.  This is in line with the way UPDATE deals with newer tuple
-	 * versions.
-	 */
-	ExecCheckTupleVisible(context->estate, relation, existing);
+	ItemPointer conflictTid = &existing->tts_tid;
 
 	/*
 	 * Make tuple and any needed join variables available to ExecQual and
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 2e7e9f73527..4ba44856a2e 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -22,6 +22,7 @@
 #include "access/xact.h"
 #include "commands/vacuum.h"
 #include "executor/tuptable.h"
+#include "nodes/execnodes.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
 
@@ -515,19 +516,16 @@ typedef struct TableAmRoutine
 								 CommandId cid, int options,
 								 struct BulkInsertStateData *bistate);
 
-	/* see table_tuple_insert_speculative() for reference about parameters */
-	void		(*tuple_insert_speculative) (Relation rel,
-											 TupleTableSlot *slot,
-											 CommandId cid,
-											 int options,
-											 struct BulkInsertStateData *bistate,
-											 uint32 specToken);
-
-	/* see table_tuple_complete_speculative() for reference about parameters */
-	void		(*tuple_complete_speculative) (Relation rel,
-											   TupleTableSlot *slot,
-											   uint32 specToken,
-											   bool succeeded);
+	/* see table_tuple_insert_with_arbiter() for reference about parameters */
+	TupleTableSlot *(*tuple_insert_with_arbiter) (ResultRelInfo *resultRelInfo,
+												  TupleTableSlot *slot,
+												  CommandId cid, int options,
+												  struct BulkInsertStateData *bistate,
+												  List *arbiterIndexes,
+												  EState *estate,
+												  LockTupleMode lockmode,
+												  TupleTableSlot *lockedSlot,
+												  TupleTableSlot *tempSlot);
 
 	/* see table_multi_insert() for reference about parameters */
 	void		(*multi_insert) (Relation rel, TupleTableSlot **slots, int nslots,
@@ -1405,36 +1403,42 @@ table_tuple_insert(Relation rel, TupleTableSlot *slot, CommandId cid,
 }
 
 /*
- * Perform a "speculative insertion". These can be backed out afterwards
- * without aborting the whole transaction.  Other sessions can wait for the
- * speculative insertion to be confirmed, turning it into a regular tuple, or
- * aborted, as if it never existed.  Speculatively inserted tuples behave as
- * "value locks" of short duration, used to implement INSERT .. ON CONFLICT.
+ * Insert a tuple from a slot into table AM routine with arbiter indexes.
  *
- * A transaction having performed a speculative insertion has to either abort,
- * or finish the speculative insertion with
- * table_tuple_complete_speculative(succeeded = ...).
- */
-static inline void
-table_tuple_insert_speculative(Relation rel, TupleTableSlot *slot,
-							   CommandId cid, int options,
-							   struct BulkInsertStateData *bistate,
-							   uint32 specToken)
-{
-	rel->rd_tableam->tuple_insert_speculative(rel, slot, cid, options,
-											  bistate, specToken);
-}
-
-/*
- * Complete "speculative insertion" started in the same transaction. If
- * succeeded is true, the tuple is fully inserted, if false, it's removed.
+ * This function is similar to table_tuple_insert(), but it takes into account
+ * `arbiterIndexes`, which comprises the list of oids of arbiter indexes.
+ *
+ * If tuple doesn't violates uniqueness on all arbiter indexes, then it should
+ * be inserted and the slot containing inserted tuple is returned.
+ *
+ * If tuple violates uniqueness on any arbiter index, then this function
+ * returns NULL and doesn't insert the tuple.  Also, if 'lockedSlot' is
+ * provided, then conflicting tuple gets locked in `lockmode` and placed into
+ * `lockedSlot`.
+ *
+ * Executor state `estate` is passed to this method to provide ability to
+ * calculate index tuples.  Temporary tuple table slot `tempSlot` is passed
+ * for holding of potentially conflicing tuple.
  */
-static inline void
-table_tuple_complete_speculative(Relation rel, TupleTableSlot *slot,
-								 uint32 specToken, bool succeeded)
+static inline TupleTableSlot *
+table_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
+								TupleTableSlot *slot,
+								CommandId cid, int options,
+								struct BulkInsertStateData *bistate,
+								List *arbiterIndexes,
+								EState *estate,
+								LockTupleMode lockmode,
+								TupleTableSlot *lockedSlot,
+								TupleTableSlot *tempSlot)
 {
-	rel->rd_tableam->tuple_complete_speculative(rel, slot, specToken,
-												succeeded);
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+
+	return rel->rd_tableam->tuple_insert_with_arbiter(resultRelInfo,
+													  slot, cid, options,
+													  bistate, arbiterIndexes,
+													  estate,
+													  lockmode, lockedSlot,
+													  tempSlot);
 }
 
 /*
-- 
2.39.3 (Apple Git-145)

0007-Allow-table-AM-tuple_insert-method-to-return-the--v1.patchapplication/octet-stream; name=0007-Allow-table-AM-tuple_insert-method-to-return-the--v1.patchDownload

From d9be5dde5057b35c294ab8948a2a9a7f09e75651 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Fri, 9 Jun 2023 00:28:27 +0300
Subject: [PATCH 07/12] Allow table AM tuple_insert() method to return the
 different slot

This allows table AM to return native tuple slot even if VirtualTupleTableSlot
is given as an input.  Native tuple slot have its knowledge about system
attributes, which could be accessed in future.
---
 src/backend/access/heap/heapam_handler.c |  4 +++-
 src/backend/executor/nodeModifyTable.c   |  6 +++---
 src/include/access/tableam.h             | 20 +++++++++++---------
 3 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index b9933917a56..ac32698f9d9 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -257,7 +257,7 @@ heapam_tuple_satisfies_snapshot(Relation rel, TupleTableSlot *slot,
  * ----------------------------------------------------------------------------
  */
 
-static void
+static TupleTableSlot *
 heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
 					int options, BulkInsertState bistate)
 {
@@ -274,6 +274,8 @@ heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
 
 	if (shouldFree)
 		pfree(tuple);
+
+	return slot;
 }
 
 static void
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 717dc749500..ec1d6499dcc 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1041,9 +1041,9 @@ ExecInsert(ModifyTableContext *context,
 		else
 		{
 			/* insert the tuple normally */
-			table_tuple_insert(resultRelationDesc, slot,
-							   estate->es_output_cid,
-							   0, NULL);
+			slot = table_tuple_insert(resultRelationDesc, slot,
+									  estate->es_output_cid,
+									  0, NULL);
 
 			/* insert index entries for tuple */
 			if (resultRelInfo->ri_NumIndices > 0)
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 4ba44856a2e..6501576e6ce 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -512,9 +512,9 @@ typedef struct TableAmRoutine
 	 */
 
 	/* see table_tuple_insert() for reference about parameters */
-	void		(*tuple_insert) (Relation rel, TupleTableSlot *slot,
-								 CommandId cid, int options,
-								 struct BulkInsertStateData *bistate);
+	TupleTableSlot *(*tuple_insert) (Relation rel, TupleTableSlot *slot,
+									 CommandId cid, int options,
+									 struct BulkInsertStateData *bistate);
 
 	/* see table_tuple_insert_with_arbiter() for reference about parameters */
 	TupleTableSlot *(*tuple_insert_with_arbiter) (ResultRelInfo *resultRelInfo,
@@ -1390,16 +1390,18 @@ table_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
  * behavior) is also just passed through to RelationGetBufferForTuple. If
  * `bistate` is provided, table_finish_bulk_insert() needs to be called.
  *
- * On return the slot's tts_tid and tts_tableOid are updated to reflect the
- * insertion. But note that any toasting of fields within the slot is NOT
- * reflected in the slots contents.
+ * Returns the slot containing the inserted tuple, which may differ from the
+ * given slot. For instance, source slot may by VirtualTupleTableSlot, but
+ * the result is corresponding to table AM. On return the slot's tts_tid and
+ * tts_tableOid are updated to reflect the insertion. But note that any
+ * toasting of fields within the slot is NOT reflected in the slots contents.
  */
-static inline void
+static inline TupleTableSlot *
 table_tuple_insert(Relation rel, TupleTableSlot *slot, CommandId cid,
 				   int options, struct BulkInsertStateData *bistate)
 {
-	rel->rd_tableam->tuple_insert(rel, slot, cid, options,
-								  bistate);
+	return rel->rd_tableam->tuple_insert(rel, slot, cid, options,
+										 bistate);
 }
 
 /*
-- 
2.39.3 (Apple Git-145)

0009-Custom-reloptions-for-table-AM-v1.patchapplication/octet-stream; name=0009-Custom-reloptions-for-table-AM-v1.patchDownload

From 93dc244fa2c955bb66ef02048a9ca07e0f264c43 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 12 Jun 2023 23:16:01 +0300
Subject: [PATCH 09/12] Custom reloptions for table AM

Let table AM define custom reloptions for its tables.  Also, let table AM
override reloptions for indexes built on its tables.
---
 src/backend/access/common/reloptions.c   |  9 ++--
 src/backend/access/heap/heapam_handler.c | 21 +++++++++
 src/backend/access/table/tableamapi.c    | 18 +++++++
 src/backend/commands/indexcmds.c         |  3 +-
 src/backend/commands/tablecmds.c         | 60 +++++++++++++++---------
 src/backend/postmaster/autovacuum.c      |  4 +-
 src/backend/utils/cache/relcache.c       | 30 ++++++++++--
 src/include/access/reloptions.h          |  2 +
 src/include/access/tableam.h             | 52 ++++++++++++++++++++
 9 files changed, 169 insertions(+), 30 deletions(-)

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index c32bb7d2b64..2f1ebaacd47 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -24,6 +24,7 @@
 #include "access/nbtree.h"
 #include "access/reloptions.h"
 #include "access/spgist_private.h"
+#include "access/tableam.h"
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
 #include "commands/tablespace.h"
@@ -1379,7 +1380,7 @@ untransformRelOptions(Datum options)
  */
 bytea *
 extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
-				  amoptions_function amoptions)
+				  const TableAmRoutine *tableam, amoptions_function amoptions)
 {
 	bytea	   *options;
 	bool		isnull;
@@ -1401,7 +1402,8 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
 		case RELKIND_MATVIEW:
-			options = heap_reloptions(classForm->relkind, datum, false);
+			options = tableam_reloptions(tableam, classForm->relkind,
+										 datum, false);
 			break;
 		case RELKIND_PARTITIONED_TABLE:
 			options = partitioned_table_reloptions(datum, false);
@@ -1411,7 +1413,8 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
 			break;
 		case RELKIND_INDEX:
 		case RELKIND_PARTITIONED_INDEX:
-			options = index_reloptions(amoptions, datum, false);
+			options = tableam_indexoptions(tableam, amoptions, classForm->relkind,
+										   datum, false);
 			break;
 		case RELKIND_FOREIGN_TABLE:
 			options = NULL;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 5f25244ce43..d3b8edc73ee 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -23,6 +23,7 @@
 #include "access/heapam.h"
 #include "access/heaptoast.h"
 #include "access/multixact.h"
+#include "access/reloptions.h"
 #include "access/rewriteheap.h"
 #include "access/syncscan.h"
 #include "access/tableam.h"
@@ -2719,6 +2720,24 @@ heapam_relation_toast_am(Relation rel)
 	return rel->rd_rel->relam;
 }
 
+static bytea *
+heapam_reloptions(char relkind, Datum reloptions, bool validate)
+{
+	if (relkind == RELKIND_RELATION ||
+		relkind == RELKIND_TOASTVALUE ||
+		relkind == RELKIND_MATVIEW)
+		return heap_reloptions(relkind, reloptions, validate);
+
+	return NULL;
+}
+
+static bytea *
+heapam_indexoptions(amoptions_function amoptions, char relkind,
+					Datum reloptions, bool validate)
+{
+	return index_reloptions(amoptions, reloptions, validate);
+}
+
 
 /* ------------------------------------------------------------------------
  * Planner related callbacks for the heap AM
@@ -3224,6 +3243,8 @@ static const TableAmRoutine heapam_methods = {
 	.relation_needs_toast_table = heapam_relation_needs_toast_table,
 	.relation_toast_am = heapam_relation_toast_am,
 	.relation_fetch_toast_slice = heap_fetch_toast_slice,
+	.reloptions = heapam_reloptions,
+	.indexoptions = heapam_indexoptions,
 
 	.relation_estimate_size = heapam_estimate_rel_size,
 
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index b233f585258..1bffedfc10a 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -106,6 +106,24 @@ GetTableAmRoutine(Oid amhandler)
 	return routine;
 }
 
+const TableAmRoutine *
+GetTableAmRoutineByAmOid(Oid amoid)
+{
+	HeapTuple	ht_am;
+	Form_pg_am	amrec;
+	const TableAmRoutine *tableam = NULL;
+
+	ht_am = SearchSysCache1(AMOID, ObjectIdGetDatum(amoid));
+	if (!HeapTupleIsValid(ht_am))
+		elog(ERROR, "cache lookup failed for access method %u",
+			 amoid);
+	amrec = (Form_pg_am) GETSTRUCT(ht_am);
+
+	tableam = GetTableAmRoutine(amrec->amhandler);
+	ReleaseSysCache(ht_am);
+	return tableam;
+}
+
 /* check_hook: validate new default_table_access_method */
 bool
 check_default_table_access_method(char **newval, void **extra, GucSource source)
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 0b3b8e98b80..43a914f44b1 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -884,7 +884,8 @@ DefineIndex(Oid tableId,
 	reloptions = transformRelOptions((Datum) 0, stmt->options,
 									 NULL, NULL, false, false);
 
-	(void) index_reloptions(amoptions, reloptions, true);
+	(void) tableam_indexoptions(rel->rd_tableam, amoptions, RELKIND_INDEX,
+								reloptions, true);
 
 	/*
 	 * Prepare arguments for index_create, primarily an IndexInfo structure.
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index d3ee23d08ba..0d8c3bb25d8 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -696,6 +696,7 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	LOCKMODE	parentLockmode;
 	const char *accessMethod = NULL;
 	Oid			accessMethodId = InvalidOid;
+	const TableAmRoutine *tableam = NULL;
 
 	/*
 	 * Truncate relname to appropriate length (probably a waste of time, as
@@ -835,6 +836,26 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	if (!OidIsValid(ownerId))
 		ownerId = GetUserId();
 
+	/*
+	 * If the statement hasn't specified an access method, but we're defining
+	 * a type of relation that needs one, use the default.
+	 */
+	if (stmt->accessMethod != NULL)
+	{
+		accessMethod = stmt->accessMethod;
+
+		if (partitioned)
+			ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("specifying a table access method is not supported on a partitioned table")));
+	}
+	else if (RELKIND_HAS_TABLE_AM(relkind))
+		accessMethod = default_table_access_method;
+
+	/* look up the access method, verify it is for a table */
+	if (accessMethod != NULL)
+		accessMethodId = get_table_am_oid(accessMethod, false);
+
 	/*
 	 * Parse and validate reloptions, if any.
 	 */
@@ -843,6 +864,12 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 
 	switch (relkind)
 	{
+		case RELKIND_RELATION:
+		case RELKIND_TOASTVALUE:
+		case RELKIND_MATVIEW:
+			tableam = GetTableAmRoutineByAmOid(accessMethodId);
+			(void) tableam_reloptions(tableam, relkind, reloptions, true);
+			break;
 		case RELKIND_VIEW:
 			(void) view_reloptions(reloptions, true);
 			break;
@@ -851,6 +878,7 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 			break;
 		default:
 			(void) heap_reloptions(relkind, reloptions, true);
+			break;
 	}
 
 	if (stmt->ofTypename)
@@ -942,26 +970,6 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 		}
 	}
 
-	/*
-	 * If the statement hasn't specified an access method, but we're defining
-	 * a type of relation that needs one, use the default.
-	 */
-	if (stmt->accessMethod != NULL)
-	{
-		accessMethod = stmt->accessMethod;
-
-		if (partitioned)
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("specifying a table access method is not supported on a partitioned table")));
-	}
-	else if (RELKIND_HAS_TABLE_AM(relkind))
-		accessMethod = default_table_access_method;
-
-	/* look up the access method, verify it is for a table */
-	if (accessMethod != NULL)
-		accessMethodId = get_table_am_oid(accessMethod, false);
-
 	/*
 	 * Create the relation.  Inherited defaults and constraints are passed in
 	 * for immediate handling --- since they don't need parsing, they can be
@@ -14989,7 +14997,8 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
 		case RELKIND_MATVIEW:
-			(void) heap_reloptions(rel->rd_rel->relkind, newOptions, true);
+			(void) table_reloptions(rel, rel->rd_rel->relkind,
+									newOptions, true);
 			break;
 		case RELKIND_PARTITIONED_TABLE:
 			(void) partitioned_table_reloptions(newOptions, true);
@@ -14999,7 +15008,14 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 			break;
 		case RELKIND_INDEX:
 		case RELKIND_PARTITIONED_INDEX:
-			(void) index_reloptions(rel->rd_indam->amoptions, newOptions, true);
+			{
+				Relation	tbl = relation_open(rel->rd_index->indrelid,
+												AccessShareLock);
+
+				tableam_indexoptions(tbl->rd_tableam, rel->rd_indam->amoptions,
+									 rel->rd_rel->relkind, newOptions, true);
+				relation_close(tbl, AccessShareLock);
+			}
 			break;
 		default:
 			ereport(ERROR,
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 86a3b3d8be2..efb7cff985f 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2814,7 +2814,9 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE);
 
-	relopts = extractRelOptions(tup, pg_class_desc, NULL);
+	relopts = extractRelOptions(tup, pg_class_desc,
+								GetTableAmRoutineByAmOid(((Form_pg_class) GETSTRUCT(tup))->relam),
+								NULL);
 	if (relopts == NULL)
 		return NULL;
 
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index c802f33ac59..bc04d68478d 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -34,6 +34,7 @@
 #include "access/multixact.h"
 #include "access/nbtree.h"
 #include "access/parallel.h"
+#include "access/relation.h"
 #include "access/reloptions.h"
 #include "access/sysattr.h"
 #include "access/table.h"
@@ -466,6 +467,7 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 {
 	bytea	   *options;
 	amoptions_function amoptsfn;
+	const TableAmRoutine *tableam = NULL;
 
 	relation->rd_options = NULL;
 
@@ -477,14 +479,35 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 	{
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
-		case RELKIND_VIEW:
 		case RELKIND_MATVIEW:
+		case RELKIND_VIEW:
 		case RELKIND_PARTITIONED_TABLE:
+			tableam = relation->rd_tableam;
 			amoptsfn = NULL;
 			break;
 		case RELKIND_INDEX:
 		case RELKIND_PARTITIONED_INDEX:
-			amoptsfn = relation->rd_indam->amoptions;
+			{
+				Form_pg_class classForm;
+				HeapTuple	classTup;
+
+				/* fetch the relation's relcache entry */
+				if (relation->rd_index->indrelid >= FirstNormalObjectId)
+				{
+					classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relation->rd_index->indrelid));
+					classForm = (Form_pg_class) GETSTRUCT(classTup);
+					if (classForm->relam >= FirstNormalObjectId)
+						tableam = GetTableAmRoutineByAmOid(classForm->relam);
+					else
+						tableam = GetHeapamTableAmRoutine();
+					heap_freetuple(classTup);
+				}
+				else
+				{
+					tableam = GetHeapamTableAmRoutine();
+				}
+				amoptsfn = relation->rd_indam->amoptions;
+			}
 			break;
 		default:
 			return;
@@ -495,7 +518,8 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 	 * we might not have any other for pg_class yet (consider executing this
 	 * code for pg_class itself)
 	 */
-	options = extractRelOptions(tuple, GetPgClassDescriptor(), amoptsfn);
+	options = extractRelOptions(tuple, GetPgClassDescriptor(),
+								tableam, amoptsfn);
 
 	/*
 	 * Copy parsed data into CacheMemoryContext.  To guard against the
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index 3602397cf51..2e08edbc54c 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -21,6 +21,7 @@
 
 #include "access/amapi.h"
 #include "access/htup.h"
+#include "access/tableam.h"
 #include "access/tupdesc.h"
 #include "nodes/pg_list.h"
 #include "storage/lock.h"
@@ -224,6 +225,7 @@ extern Datum transformRelOptions(Datum oldOptions, List *defList,
 								 bool acceptOidsOff, bool isReset);
 extern List *untransformRelOptions(Datum options);
 extern bytea *extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
+								const TableAmRoutine *tableam,
 								amoptions_function amoptions);
 extern void *build_reloptions(Datum reloptions, bool validate,
 							  relopt_kind kind,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 58ca3ecaedb..32afea79c86 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -17,6 +17,7 @@
 #ifndef TABLEAM_H
 #define TABLEAM_H
 
+#include "access/amapi.h"
 #include "access/relscan.h"
 #include "access/sdir.h"
 #include "access/xact.h"
@@ -740,6 +741,18 @@ typedef struct TableAmRoutine
 											   int32 slicelength,
 											   struct varlena *result);
 
+	/*
+	 * Parse table AM-specific table options.
+	 */
+	bytea	   *(*reloptions) (char relkind, Datum reloptions, bool validate);
+
+	/*
+	 * Parse table AM-specific index options.  Useful for table AM to define
+	 * new index options or override existing index options.
+	 */
+	bytea	   *(*indexoptions) (amoptions_function amoptions, char relkind,
+								 Datum reloptions, bool validate);
+
 
 	/* ------------------------------------------------------------------------
 	 * Planner related functions.
@@ -1956,6 +1969,44 @@ table_relation_fetch_toast_slice(Relation toastrel, Oid valueid,
 													 result);
 }
 
+/*
+ * Parse options for given table.
+ */
+static inline bytea *
+table_reloptions(Relation rel, char relkind,
+				 Datum reloptions, bool validate)
+{
+	return rel->rd_tableam->reloptions(relkind, reloptions, validate);
+}
+
+/*
+ * Parse table options without knowledge of particular table.
+ */
+static inline bytea *
+tableam_reloptions(const TableAmRoutine *tableam, char relkind,
+				   Datum reloptions, bool validate)
+{
+	return tableam->reloptions(relkind, reloptions, validate);
+}
+
+extern bytea *index_reloptions(amoptions_function amoptions, Datum reloptions,
+							   bool validate);
+
+/*
+ * Parse index options.  Gives table AM a chance to override index-specific
+ * options defined in 'amoptions'.
+ */
+static inline bytea *
+tableam_indexoptions(const TableAmRoutine *tableam,
+					 amoptions_function amoptions, char relkind,
+					 Datum reloptions, bool validate)
+{
+	if (tableam)
+		return tableam->indexoptions(amoptions, relkind, reloptions, validate);
+	else
+		return index_reloptions(amoptions, reloptions, validate);
+}
+
 
 /* ----------------------------------------------------------------------------
  * Planner related functionality
@@ -2134,6 +2185,7 @@ extern void table_block_relation_estimate_size(Relation rel,
  */
 
 extern const TableAmRoutine *GetTableAmRoutine(Oid amhandler);
+extern const TableAmRoutine *GetTableAmRoutineByAmOid(Oid amoid);
 extern const TableAmRoutine *GetHeapamTableAmRoutine(void);
 
 #endif							/* TABLEAM_H */
-- 
2.39.3 (Apple Git-145)

0010-Notify-table-AM-about-index-creation-v1.patchapplication/octet-stream; name=0010-Notify-table-AM-about-index-creation-v1.patchDownload

From 774446b0bbb6b1beedc7d8d2c045ed8e32db3345 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Sat, 17 Jun 2023 22:01:01 +0300
Subject: [PATCH 10/12] Notify table AM about index creation

This allows table AM to do some preparation with index build.  In particular,
table AM could update its specific meta-information.  That could be also useful
if table AM overrides index implementations.
---
 src/backend/access/heap/heapam_handler.c |  2 ++
 src/backend/catalog/index.c              |  2 ++
 src/backend/commands/indexcmds.c         | 41 +++++++++++++----------
 src/include/access/tableam.h             | 42 ++++++++++++++++++++++++
 4 files changed, 69 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d3b8edc73ee..de38174f83d 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -3237,6 +3237,8 @@ static const TableAmRoutine heapam_methods = {
 	.index_build_range_scan = heapam_index_build_range_scan,
 	.index_validate_scan = heapam_index_validate_scan,
 	.relation_analyze = heapam_analyze,
+	.define_index_validate = NULL,
+	.define_index = NULL,
 
 	.free_rd_amcache = NULL,
 	.relation_size = table_block_relation_size,
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 143fae01ebd..9bb09e9bfb8 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3823,6 +3823,8 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
 
 	/* Close rels, but keep locks */
 	index_close(iRel, NoLock);
+	table_define_index(heapRelation, indexId, true,
+					   skip_constraint_checks, false, NULL);
 	table_close(heapRelation, NoLock);
 
 	if (progress)
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 43a914f44b1..cda39711180 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -576,6 +576,7 @@ DefineIndex(Oid tableId,
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
+	void	   *arg;
 
 	root_save_nestlevel = NewGUCNestLevel();
 
@@ -620,6 +621,26 @@ DefineIndex(Oid tableId,
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_INDEX_OID,
 								 InvalidOid);
 
+	/*
+	 * Only SELECT ... FOR UPDATE/SHARE are allowed while doing a standard
+	 * index build; but for concurrent builds we allow INSERT/UPDATE/DELETE
+	 * (but not VACUUM).
+	 *
+	 * NB: Caller is responsible for making sure that relationId refers to the
+	 * relation on which the index should be built; except in bootstrap mode,
+	 * this will typically require the caller to have already locked the
+	 * relation.  To avoid lock upgrade hazards, that lock should be at least
+	 * as strong as the one we take here.
+	 *
+	 * NB: If the lock strength here ever changes, code that is run by
+	 * parallel workers under the control of certain particular ambuild
+	 * functions will need to be updated, too.
+	 */
+	lockmode = concurrent ? ShareUpdateExclusiveLock : ShareLock;
+	rel = table_open(tableId, lockmode);
+
+	table_define_index_validate(rel, stmt, skip_build, &arg);
+
 	/*
 	 * count key attributes in index
 	 */
@@ -647,24 +668,6 @@ DefineIndex(Oid tableId,
 				 errmsg("cannot use more than %d columns in an index",
 						INDEX_MAX_KEYS)));
 
-	/*
-	 * Only SELECT ... FOR UPDATE/SHARE are allowed while doing a standard
-	 * index build; but for concurrent builds we allow INSERT/UPDATE/DELETE
-	 * (but not VACUUM).
-	 *
-	 * NB: Caller is responsible for making sure that tableId refers to the
-	 * relation on which the index should be built; except in bootstrap mode,
-	 * this will typically require the caller to have already locked the
-	 * relation.  To avoid lock upgrade hazards, that lock should be at least
-	 * as strong as the one we take here.
-	 *
-	 * NB: If the lock strength here ever changes, code that is run by
-	 * parallel workers under the control of certain particular ambuild
-	 * functions will need to be updated, too.
-	 */
-	lockmode = concurrent ? ShareUpdateExclusiveLock : ShareLock;
-	rel = table_open(tableId, lockmode);
-
 	/*
 	 * Switch to the table owner's userid, so that any index functions are run
 	 * as that user.  Also lock down security-restricted operations.  We
@@ -1194,6 +1197,8 @@ DefineIndex(Oid tableId,
 
 	ObjectAddressSet(address, RelationRelationId, indexRelationId);
 
+	table_define_index(rel, address.objectId, false, false,
+					   skip_build, arg);
 	if (!OidIsValid(indexRelationId))
 	{
 		/*
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 32afea79c86..b2b397023c7 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -690,6 +690,16 @@ typedef struct TableAmRoutine
 									 BlockNumber *totalpages,
 									 BufferAccessStrategy bstrategy);
 
+	/* See table_define_index_validate() */
+	bool		(*define_index_validate) (Relation rel, IndexStmt *stmt,
+										  bool skip_build, void **arg);
+
+	/* See table_define_index() */
+	bool		(*define_index) (Relation rel, Oid indoid, bool reindex,
+								 bool skip_constraint_checks, bool skip_build,
+								 void *arg);
+
+
 
 	/* ------------------------------------------------------------------------
 	 * Miscellaneous functions.
@@ -1876,6 +1886,38 @@ table_relation_analyze(Relation relation, AcquireSampleRowsFunc *func,
 										   totalpages, bstrategy);
 }
 
+/*
+ * Let table AM validate the index to be created on `rel` with statement
+ * `*stmt`.  `skip_build` indicates that only catalog entries are to be
+ * created without index data.  This method can save some information into
+ * `arg`, and it shoud be passed to table_define_index().
+ */
+static inline bool
+table_define_index_validate(Relation rel, IndexStmt *stmt,
+							bool skip_build, void **arg)
+{
+	if (rel->rd_tableam && rel->rd_tableam->define_index_validate)
+		return rel->rd_tableam->define_index_validate(rel, stmt,
+													  skip_build, arg);
+	else
+		return true;
+}
+
+/*
+ * Notifies table AM about index creation on `rel` with oid `indoid`.
+ */
+static inline bool
+table_define_index(Relation rel, Oid indoid, bool reindex,
+				   bool skip_constraint_checks, bool skip_build, void *arg)
+{
+	if (rel->rd_tableam && rel->rd_tableam->define_index)
+		return rel->rd_tableam->define_index(rel, indoid, reindex,
+											 skip_constraint_checks,
+											 skip_build, arg);
+	else
+		return true;
+}
+
 /* ----------------------------------------------------------------------------
  * Miscellaneous functionality
  * ----------------------------------------------------------------------------
-- 
2.39.3 (Apple Git-145)

0008-Let-table-AM-insertion-methods-control-index-inse-v1.patchapplication/octet-stream; name=0008-Let-table-AM-insertion-methods-control-index-inse-v1.patchDownload

From 4a404edd4a475d7ef95158035d0f9307da488c89 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Fri, 9 Jun 2023 01:02:39 +0300
Subject: [PATCH 08/12] Let table AM insertion methods control index insertion

New parameter for tuple_insert() and multi_insert() methods provides way to
skip index insertions in executor.  In this case, table AM can handle insertions
itself.
---
 src/backend/access/heap/heapam.c         |  4 +++-
 src/backend/access/heap/heapam_handler.c |  4 +++-
 src/backend/access/table/tableam.c       |  6 ++++--
 src/backend/catalog/indexing.c           |  4 +++-
 src/backend/commands/copyfrom.c          | 13 +++++++++----
 src/backend/commands/createas.c          |  4 +++-
 src/backend/commands/matview.c           |  4 +++-
 src/backend/commands/tablecmds.c         |  6 +++++-
 src/backend/executor/execReplication.c   |  6 ++++--
 src/backend/executor/nodeModifyTable.c   |  6 ++++--
 src/include/access/heapam.h              |  2 +-
 src/include/access/tableam.h             | 23 ++++++++++++++++-------
 12 files changed, 58 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5baa2e632f5..f6ea2703219 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2095,7 +2095,8 @@ heap_multi_insert_pages(HeapTuple *heaptuples, int done, int ntuples, Size saveF
  */
 void
 heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
-				  CommandId cid, int options, BulkInsertState bistate)
+				  CommandId cid, int options, BulkInsertState bistate,
+				  bool *insert_indexes)
 {
 	TransactionId xid = GetCurrentTransactionId();
 	HeapTuple  *heaptuples;
@@ -2444,6 +2445,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		slots[i]->tts_tid = heaptuples[i]->t_self;
 
 	pgstat_count_heap_insert(relation, ntuples);
+	*insert_indexes = true;
 }
 
 /*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ac32698f9d9..5f25244ce43 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -259,7 +259,7 @@ heapam_tuple_satisfies_snapshot(Relation rel, TupleTableSlot *slot,
 
 static TupleTableSlot *
 heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
-					int options, BulkInsertState bistate)
+					int options, BulkInsertState bistate, bool *insert_indexes)
 {
 	bool		shouldFree = true;
 	HeapTuple	tuple = ExecFetchSlotHeapTuple(slot, true, &shouldFree);
@@ -275,6 +275,8 @@ heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
 	if (shouldFree)
 		pfree(tuple);
 
+	*insert_indexes = true;
+
 	return slot;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index bb18dacfe11..eb8af13eabe 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -283,9 +283,11 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
  * default command ID and not allowing access to the speedup options.
  */
 void
-simple_table_tuple_insert(Relation rel, TupleTableSlot *slot)
+simple_table_tuple_insert(Relation rel, TupleTableSlot *slot,
+						  bool *insert_indexes)
 {
-	table_tuple_insert(rel, slot, GetCurrentCommandId(true), 0, NULL);
+	table_tuple_insert(rel, slot, GetCurrentCommandId(true), 0, NULL,
+					   insert_indexes);
 }
 
 /*
diff --git a/src/backend/catalog/indexing.c b/src/backend/catalog/indexing.c
index 522da0ac855..a508b62a050 100644
--- a/src/backend/catalog/indexing.c
+++ b/src/backend/catalog/indexing.c
@@ -273,12 +273,14 @@ void
 CatalogTuplesMultiInsertWithInfo(Relation heapRel, TupleTableSlot **slot,
 								 int ntuples, CatalogIndexState indstate)
 {
+	bool		insertIndexes;
+
 	/* Nothing to do */
 	if (ntuples <= 0)
 		return;
 
 	heap_multi_insert(heapRel, slot, ntuples,
-					  GetCurrentCommandId(true), 0, NULL);
+					  GetCurrentCommandId(true), 0, NULL, &insertIndexes);
 
 	/*
 	 * There is no equivalent to heap_multi_insert for the catalog indexes, so
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index f4861652a95..840b0621294 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -399,6 +399,7 @@ CopyMultiInsertBufferFlush(CopyMultiInsertInfo *miinfo,
 		bool		line_buf_valid = cstate->line_buf_valid;
 		uint64		save_cur_lineno = cstate->cur_lineno;
 		MemoryContext oldcontext;
+		bool		insertIndexes;
 
 		Assert(buffer->bistate != NULL);
 
@@ -418,7 +419,8 @@ CopyMultiInsertBufferFlush(CopyMultiInsertInfo *miinfo,
 						   nused,
 						   mycid,
 						   ti_options,
-						   buffer->bistate);
+						   buffer->bistate,
+						   &insertIndexes);
 		MemoryContextSwitchTo(oldcontext);
 
 		for (i = 0; i < nused; i++)
@@ -427,7 +429,7 @@ CopyMultiInsertBufferFlush(CopyMultiInsertInfo *miinfo,
 			 * If there are any indexes, update them for all the inserted
 			 * tuples, and run AFTER ROW INSERT triggers.
 			 */
-			if (resultRelInfo->ri_NumIndices > 0)
+			if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 			{
 				List	   *recheckIndexes;
 
@@ -1240,11 +1242,14 @@ CopyFrom(CopyFromState cstate)
 					}
 					else
 					{
+						bool		insertIndexes;
+
 						/* OK, store the tuple and create index entries for it */
 						table_tuple_insert(resultRelInfo->ri_RelationDesc,
-										   myslot, mycid, ti_options, bistate);
+										   myslot, mycid, ti_options, bistate,
+										   &insertIndexes);
 
-						if (resultRelInfo->ri_NumIndices > 0)
+						if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 							recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 																   myslot,
 																   estate,
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index e91920ca14a..ef69d72504c 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -583,6 +583,7 @@ static bool
 intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_intorel *myState = (DR_intorel *) self;
+	bool		insertIndexes;
 
 	/* Nothing to insert if WITH NO DATA is specified. */
 	if (!myState->into->skipData)
@@ -599,7 +600,8 @@ intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 						   slot,
 						   myState->output_cid,
 						   myState->ti_options,
-						   myState->bistate);
+						   myState->bistate,
+						   &insertIndexes);
 	}
 
 	/* We know this is a newly created relation, so there are no indexes */
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index ac2e74fa3fb..10cf4933c1a 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -479,6 +479,7 @@ static bool
 transientrel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_transientrel *myState = (DR_transientrel *) self;
+	bool		insertIndexes;
 
 	/*
 	 * Note that the input slot might not be of the type of the target
@@ -493,7 +494,8 @@ transientrel_receive(TupleTableSlot *slot, DestReceiver *self)
 					   slot,
 					   myState->output_cid,
 					   myState->ti_options,
-					   myState->bistate);
+					   myState->bistate,
+					   &insertIndexes);
 
 	/* We know this is a newly created relation, so there are no indexes */
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 323d9bf8702..d3ee23d08ba 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -6278,8 +6278,12 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
 
 			/* Write the tuple out to the new relation */
 			if (newrel)
+			{
+				bool		insertIndexes;
+
 				table_tuple_insert(newrel, insertslot, mycid,
-								   ti_options, bistate);
+								   ti_options, bistate, &insertIndexes);
+			}
 
 			ResetExprContext(econtext);
 
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index f39e1e55ea8..fda9877a55f 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -515,6 +515,7 @@ ExecSimpleRelationInsert(ResultRelInfo *resultRelInfo,
 	if (!skip_tuple)
 	{
 		List	   *recheckIndexes = NIL;
+		bool		insertIndexes;
 
 		/* Compute stored generated columns */
 		if (rel->rd_att->constr &&
@@ -529,9 +530,10 @@ ExecSimpleRelationInsert(ResultRelInfo *resultRelInfo,
 			ExecPartitionCheck(resultRelInfo, slot, estate, true);
 
 		/* OK, store the tuple and create index entries for it */
-		simple_table_tuple_insert(resultRelInfo->ri_RelationDesc, slot);
+		simple_table_tuple_insert(resultRelInfo->ri_RelationDesc, slot,
+								  &insertIndexes);
 
-		if (resultRelInfo->ri_NumIndices > 0)
+		if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 			recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 												   slot, estate, false, false,
 												   NULL, NIL, false);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index ec1d6499dcc..5faf10b254f 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1040,13 +1040,15 @@ ExecInsert(ModifyTableContext *context,
 		}
 		else
 		{
+			bool		insertIndexes;
+
 			/* insert the tuple normally */
 			slot = table_tuple_insert(resultRelationDesc, slot,
 									  estate->es_output_cid,
-									  0, NULL);
+									  0, NULL, &insertIndexes);
 
 			/* insert index entries for tuple */
-			if (resultRelInfo->ri_NumIndices > 0)
+			if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 				recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 													   slot, estate, false,
 													   false, NULL, NIL,
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e220839d73c..e280cc328e1 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -274,7 +274,7 @@ extern void heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 						int options, BulkInsertState bistate);
 extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
 							  int ntuples, CommandId cid, int options,
-							  BulkInsertState bistate);
+							  BulkInsertState bistate, bool *insert_indexes);
 extern TM_Result heap_delete(Relation relation, ItemPointer tid,
 							 CommandId cid, Snapshot crosscheck, int options,
 							 struct TM_FailureData *tmfd, bool changingPart,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 6501576e6ce..58ca3ecaedb 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -514,7 +514,8 @@ typedef struct TableAmRoutine
 	/* see table_tuple_insert() for reference about parameters */
 	TupleTableSlot *(*tuple_insert) (Relation rel, TupleTableSlot *slot,
 									 CommandId cid, int options,
-									 struct BulkInsertStateData *bistate);
+									 struct BulkInsertStateData *bistate,
+									 bool *insert_indexes);
 
 	/* see table_tuple_insert_with_arbiter() for reference about parameters */
 	TupleTableSlot *(*tuple_insert_with_arbiter) (ResultRelInfo *resultRelInfo,
@@ -529,7 +530,8 @@ typedef struct TableAmRoutine
 
 	/* see table_multi_insert() for reference about parameters */
 	void		(*multi_insert) (Relation rel, TupleTableSlot **slots, int nslots,
-								 CommandId cid, int options, struct BulkInsertStateData *bistate);
+								 CommandId cid, int options, struct BulkInsertStateData *bistate,
+								 bool *insert_indexes);
 
 	/* see table_tuple_delete() for reference about parameters */
 	TM_Result	(*tuple_delete) (Relation rel,
@@ -1390,6 +1392,10 @@ table_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
  * behavior) is also just passed through to RelationGetBufferForTuple. If
  * `bistate` is provided, table_finish_bulk_insert() needs to be called.
  *
+ * This function sets `*insert_indexes` to true if expects caller to return
+ * the relevant index tuples.  If `*insert_indexes` is set to false, then
+ * this function cares about indexes itself.
+ *
  * Returns the slot containing the inserted tuple, which may differ from the
  * given slot. For instance, source slot may by VirtualTupleTableSlot, but
  * the result is corresponding to table AM. On return the slot's tts_tid and
@@ -1398,10 +1404,11 @@ table_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
  */
 static inline TupleTableSlot *
 table_tuple_insert(Relation rel, TupleTableSlot *slot, CommandId cid,
-				   int options, struct BulkInsertStateData *bistate)
+				   int options, struct BulkInsertStateData *bistate,
+				   bool *insert_indexes)
 {
 	return rel->rd_tableam->tuple_insert(rel, slot, cid, options,
-										 bistate);
+										 bistate, insert_indexes);
 }
 
 /*
@@ -1459,10 +1466,11 @@ table_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
  */
 static inline void
 table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
-				   CommandId cid, int options, struct BulkInsertStateData *bistate)
+				   CommandId cid, int options, struct BulkInsertStateData *bistate,
+				   bool *insert_indexes)
 {
 	rel->rd_tableam->multi_insert(rel, slots, nslots,
-								  cid, options, bistate);
+								  cid, options, bistate, insert_indexes);
 }
 
 /*
@@ -2077,7 +2085,8 @@ table_scan_sample_next_tuple(TableScanDesc scan,
  * ----------------------------------------------------------------------------
  */
 
-extern void simple_table_tuple_insert(Relation rel, TupleTableSlot *slot);
+extern void simple_table_tuple_insert(Relation rel, TupleTableSlot *slot,
+									  bool *insert_indexes);
 extern void simple_table_tuple_delete(Relation rel, ItemPointer tid,
 									  Snapshot snapshot,
 									  TupleTableSlot *oldSlot);
-- 
2.39.3 (Apple Git-145)

0012-Introduce-RowID-bytea-tuple-identifier-v1.patchapplication/octet-stream; name=0012-Introduce-RowID-bytea-tuple-identifier-v1.patchDownload

From 86f1699bc8ea2fafac7b26cf7f9c369a1f016326 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 26 Jun 2023 04:26:30 +0300
Subject: [PATCH 12/12] Introduce RowID -- bytea tuple identifier

Currently, there are two ways to reference the tuple: tuple identifier (tid)
and whole row copy.  The tuple identifier used for regular tables consists of
32-bit block number and 16-bit offset.  This seems limited for some use-cases,
in particular index-organized tables.  The whole row copy used to identify
tuples in FDW.  That could be extended to regular tables, but that seems
overkill.

This commit introduces RowID -- new bytea tuple identifier.  Table AM can choose
the way tuple is identified by providing new get_row_ref_type() API function.
New system attribute RowIdAttributeNumber holds RowID when appropriate.
Table AM methods now accepts Datum arguments as tuple identifiers.  Those Datum
could be either tid or bytea depending on what table_get_row_ref_type() says.
ModifyTable node and triggers are aware of RowID.  IndexScan and BitmapScan
nodes are not aware of RowIDs and expect tids.  Table AMs which use RowIDs
are supposed to redefine those nodes using hooks.
---
 contrib/amcheck/verify_nbtree.c          |   3 +-
 src/backend/access/common/heaptuple.c    |   4 +
 src/backend/access/heap/heapam_handler.c |  33 ++-
 src/backend/access/table/tableam.c       |   4 +-
 src/backend/catalog/aclchk.c             |   2 +-
 src/backend/commands/trigger.c           | 251 ++++++++++++++++++-----
 src/backend/executor/execExprInterp.c    |   4 +-
 src/backend/executor/execMain.c          |  11 +-
 src/backend/executor/execReplication.c   |  12 +-
 src/backend/executor/nodeLockRows.c      |  17 +-
 src/backend/executor/nodeModifyTable.c   | 141 +++++++++----
 src/backend/executor/nodeTidscan.c       |   2 +-
 src/backend/optimizer/prep/preptlist.c   |  16 ++
 src/backend/optimizer/util/appendinfo.c  |  33 ++-
 src/backend/optimizer/util/inherit.c     |  20 +-
 src/backend/parser/parse_relation.c      |   7 +-
 src/backend/rewrite/rewriteHandler.c     |   1 +
 src/backend/utils/sort/tuplestore.c      |  30 +++
 src/include/access/sysattr.h             |   3 +-
 src/include/access/tableam.h             |  56 +++--
 src/include/commands/trigger.h           |   4 +-
 src/include/nodes/primnodes.h            |   1 +
 src/include/utils/tuplestore.h           |   3 +
 23 files changed, 508 insertions(+), 150 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index bcff849aa90..caaee8424ea 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -982,7 +982,8 @@ heap_entry_is_visible(BtreeCheckState *state, ItemPointer tid)
 	TupleTableSlot *slot = table_slot_create(state->heaprel, NULL);
 
 	tid_visible = table_tuple_fetch_row_version(state->heaprel,
-												tid, state->snapshot, slot);
+												PointerGetDatum(tid),
+												state->snapshot, slot);
 	if (slot != NULL)
 		ExecDropSingleTupleTableSlot(slot);
 
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index c52d40dce0f..7bfde232759 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -756,6 +756,10 @@ heap_getsysattr(HeapTuple tup, int attnum, TupleDesc tupleDesc, bool *isnull)
 		case TableOidAttributeNumber:
 			result = ObjectIdGetDatum(tup->t_tableOid);
 			break;
+		case RowIdAttributeNumber:
+			*isnull = true;
+			result = 0;
+			break;
 		default:
 			elog(ERROR, "invalid attnum: %d", attnum);
 			result = 0;			/* keep compiler quiet */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index de38174f83d..504ef922f18 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -47,7 +47,7 @@
 #include "utils/rel.h"
 #include "utils/sampling.h"
 
-static TM_Result heapam_tuple_lock(Relation relation, ItemPointer tid,
+static TM_Result heapam_tuple_lock(Relation relation, Datum tupleid,
 								   Snapshot snapshot, TupleTableSlot *slot,
 								   CommandId cid, LockTupleMode mode,
 								   LockWaitPolicy wait_policy, uint8 flags,
@@ -199,7 +199,7 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 
 static bool
 heapam_fetch_row_version(Relation relation,
-						 ItemPointer tid,
+						 Datum tupleid,
 						 Snapshot snapshot,
 						 TupleTableSlot *slot)
 {
@@ -208,7 +208,7 @@ heapam_fetch_row_version(Relation relation,
 
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
-	bslot->base.tupdata.t_self = *tid;
+	bslot->base.tupdata.t_self = *DatumGetItemPointer(tupleid);
 	if (heap_fetch(relation, snapshot, &bslot->base.tupdata, &buffer, false))
 	{
 		/* store in slot, transferring existing pin */
@@ -373,7 +373,7 @@ ExecCheckTIDVisible(EState *estate,
 	if (!IsolationUsesXactSnapshot())
 		return;
 
-	if (!table_tuple_fetch_row_version(rel, tid,
+	if (!table_tuple_fetch_row_version(rel, PointerGetDatum(tid),
 									   SnapshotAny, tempSlot))
 		elog(ERROR, "failed to fetch conflicting tuple for ON CONFLICT");
 	ExecCheckTupleVisible(estate, rel, tempSlot);
@@ -420,7 +420,7 @@ heapam_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
 				 * here means our previous conclusion that the tuple is
 				 * conclusively committed is not true anymore.
 				 */
-				test = table_tuple_lock(rel, &conflictTid,
+				test = table_tuple_lock(rel, PointerGetDatum(&conflictTid),
 										estate->es_snapshot,
 										lockedSlot, estate->es_output_cid,
 										lockmode, LockWaitBlock, 0,
@@ -600,12 +600,13 @@ heapam_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
 }
 
 static TM_Result
-heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
+heapam_tuple_delete(Relation relation, Datum tupleid, CommandId cid,
 					Snapshot snapshot, Snapshot crosscheck, int options,
 					TM_FailureData *tmfd, bool changingPart,
 					TupleTableSlot *oldSlot)
 {
 	TM_Result	result;
+	ItemPointer tid = DatumGetItemPointer(tupleid);
 
 	/*
 	 * Currently Deleting of index tuples are handled at vacuum, in case if
@@ -628,7 +629,7 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
 		 * heapam_tuple_lock() will take advantage of tuple loaded into
 		 * oldSlot by heap_delete().
 		 */
-		result = heapam_tuple_lock(relation, tid, snapshot,
+		result = heapam_tuple_lock(relation, tupleid, snapshot,
 								   oldSlot, cid, LockTupleExclusive,
 								   (options & TABLE_MODIFY_WAIT) ?
 								   LockWaitBlock :
@@ -645,7 +646,7 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
 
 
 static TM_Result
-heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
+heapam_tuple_update(Relation relation, Datum tupleid, TupleTableSlot *slot,
 					CommandId cid, Snapshot snapshot, Snapshot crosscheck,
 					int options, TM_FailureData *tmfd,
 					LockTupleMode *lockmode, TU_UpdateIndexes *update_indexes,
@@ -653,6 +654,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 {
 	bool		shouldFree = true;
 	HeapTuple	tuple = ExecFetchSlotHeapTuple(slot, true, &shouldFree);
+	ItemPointer otid = DatumGetItemPointer(tupleid);
 	TM_Result	result;
 
 	/* Update the tuple with table oid */
@@ -700,7 +702,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 		 * heapam_tuple_lock() will take advantage of tuple loaded into
 		 * oldSlot by heap_update().
 		 */
-		result = heapam_tuple_lock(relation, otid, snapshot,
+		result = heapam_tuple_lock(relation, tupleid, snapshot,
 								   oldSlot, cid, *lockmode,
 								   (options & TABLE_MODIFY_WAIT) ?
 								   LockWaitBlock :
@@ -716,7 +718,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 }
 
 static TM_Result
-heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
+heapam_tuple_lock(Relation relation, Datum tupleid, Snapshot snapshot,
 				  TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
 				  LockWaitPolicy wait_policy, uint8 flags,
 				  TM_FailureData *tmfd)
@@ -724,6 +726,7 @@ heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
 	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
 	TM_Result	result;
 	HeapTuple	tuple = &bslot->base.tupdata;
+	ItemPointer tid = DatumGetItemPointer(tupleid);
 	bool		follow_updates;
 
 	follow_updates = (flags & TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS) != 0;
@@ -2660,6 +2663,15 @@ heapam_scan_get_blocks_done(HeapScanDesc hscan)
  * ------------------------------------------------------------------------
  */
 
+/*
+ * All heap tables use TID row identifier.
+ */
+static RowRefType
+heapam_get_row_ref_type(Relation rel)
+{
+	return ROW_REF_TID;
+}
+
 /*
  * Check to see whether the table needs a TOAST table.  It does only if
  * (1) there are any toastable attributes, and (2) the maximum length
@@ -3240,6 +3252,7 @@ static const TableAmRoutine heapam_methods = {
 	.define_index_validate = NULL,
 	.define_index = NULL,
 
+	.get_row_ref_type = heapam_get_row_ref_type,
 	.free_rd_amcache = NULL,
 	.relation_size = table_block_relation_size,
 	.relation_needs_toast_table = heapam_relation_needs_toast_table,
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index eb8af13eabe..2d831a1388d 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -310,7 +310,7 @@ simple_table_tuple_delete(Relation rel, ItemPointer tid, Snapshot snapshot,
 	if (oldSlot)
 		options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
 
-	result = table_tuple_delete(rel, tid,
+	result = table_tuple_delete(rel, PointerGetDatum(tid),
 								GetCurrentCommandId(true),
 								snapshot, InvalidSnapshot,
 								options,
@@ -366,7 +366,7 @@ simple_table_tuple_update(Relation rel, ItemPointer otid,
 	if (oldSlot)
 		options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
 
-	result = table_tuple_update(rel, otid, slot,
+	result = table_tuple_update(rel, PointerGetDatum(otid), slot,
 								GetCurrentCommandId(true),
 								snapshot, InvalidSnapshot,
 								options,
diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
index 3ce6c09b44d..1759f4c2cee 100644
--- a/src/backend/catalog/aclchk.c
+++ b/src/backend/catalog/aclchk.c
@@ -1643,7 +1643,7 @@ expand_all_col_privileges(Oid table_oid, Form_pg_class classForm,
 	AttrNumber	curr_att;
 
 	Assert(classForm->relnatts - FirstLowInvalidHeapAttributeNumber < num_col_privileges);
-	for (curr_att = FirstLowInvalidHeapAttributeNumber + 1;
+	for (curr_att = FirstLowInvalidHeapAttributeNumber + 2;
 		 curr_att <= classForm->relnatts;
 		 curr_att++)
 	{
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index c7095751221..7612c203139 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -83,7 +83,7 @@ static void SetTriggerFlags(TriggerDesc *trigdesc, Trigger *trigger);
 static bool GetTupleForTrigger(EState *estate,
 							   EPQState *epqstate,
 							   ResultRelInfo *relinfo,
-							   ItemPointer tid,
+							   Datum tupleid,
 							   LockTupleMode lockmode,
 							   TupleTableSlot *oldslot,
 							   TupleTableSlot **epqslot,
@@ -2688,7 +2688,7 @@ ExecASDeleteTriggers(EState *estate, ResultRelInfo *relinfo,
 bool
 ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
 					 ResultRelInfo *relinfo,
-					 ItemPointer tupleid,
+					 Datum tupleid,
 					 HeapTuple fdw_trigtuple,
 					 TupleTableSlot **epqslot,
 					 TM_Result *tmresult,
@@ -2702,7 +2702,7 @@ ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
 	bool		should_free = false;
 	int			i;
 
-	Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
+	Assert(HeapTupleIsValid(fdw_trigtuple) ^ (DatumGetPointer(tupleid) != NULL));
 	if (fdw_trigtuple == NULL)
 	{
 		TupleTableSlot *epqslot_candidate = NULL;
@@ -2930,7 +2930,7 @@ ExecASUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 bool
 ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
 					 ResultRelInfo *relinfo,
-					 ItemPointer tupleid,
+					 Datum tupleid,
 					 HeapTuple fdw_trigtuple,
 					 TupleTableSlot *newslot,
 					 TM_Result *tmresult,
@@ -2950,7 +2950,7 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
 	/* Determine lock mode to use */
 	lockmode = ExecUpdateLockMode(estate, relinfo);
 
-	Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
+	Assert(HeapTupleIsValid(fdw_trigtuple) ^ (DatumGetPointer(tupleid) != NULL));
 	if (fdw_trigtuple == NULL)
 	{
 		TupleTableSlot *epqslot_candidate = NULL;
@@ -3249,7 +3249,7 @@ static bool
 GetTupleForTrigger(EState *estate,
 				   EPQState *epqstate,
 				   ResultRelInfo *relinfo,
-				   ItemPointer tid,
+				   Datum tupleid,
 				   LockTupleMode lockmode,
 				   TupleTableSlot *oldslot,
 				   TupleTableSlot **epqslot,
@@ -3274,7 +3274,9 @@ GetTupleForTrigger(EState *estate,
 		 */
 		if (!IsolationUsesXactSnapshot())
 			lockflags |= TUPLE_LOCK_FLAG_FIND_LAST_VERSION;
-		test = table_tuple_lock(relation, tid, estate->es_snapshot, oldslot,
+
+		test = table_tuple_lock(relation, tupleid,
+								estate->es_snapshot, oldslot,
 								estate->es_output_cid,
 								lockmode, LockWaitBlock,
 								lockflags,
@@ -3370,8 +3372,8 @@ GetTupleForTrigger(EState *estate,
 		 * We expect the tuple to be present, thus very simple error handling
 		 * suffices.
 		 */
-		if (!table_tuple_fetch_row_version(relation, tid, SnapshotAny,
-										   oldslot))
+		if (!table_tuple_fetch_row_version(relation, tupleid,
+										   SnapshotAny, oldslot))
 			elog(ERROR, "failed to fetch tuple for trigger");
 	}
 
@@ -3577,18 +3579,24 @@ typedef SetConstraintStateData *SetConstraintState;
  * cycles.  So we need only ensure that ats_firing_id is zero when attaching
  * a new event to an existing AfterTriggerSharedData record.
  */
-typedef uint32 TriggerFlags;
-
-#define AFTER_TRIGGER_OFFSET			0x07FFFFFF	/* must be low-order bits */
-#define AFTER_TRIGGER_DONE				0x80000000
-#define AFTER_TRIGGER_IN_PROGRESS		0x40000000
+typedef uint64 TriggerFlags;
+
+#define AFTER_TRIGGER_SIZE				UINT64CONST(0xFFFF000000000)	/* must be low-order
+																		 * bits */
+#define AFTER_TRIGGER_SIZE_SHIFT		(36)
+#define AFTER_TRIGGER_OFFSET			UINT64CONST(0x000000FFFFFFF)	/* must be low-order
+																		 * bits */
+#define AFTER_TRIGGER_DONE				UINT64CONST(0x0000800000000)
+#define AFTER_TRIGGER_IN_PROGRESS		UINT64CONST(0x0000400000000)
 /* bits describing the size and tuple sources of this event */
-#define AFTER_TRIGGER_FDW_REUSE			0x00000000
-#define AFTER_TRIGGER_FDW_FETCH			0x20000000
-#define AFTER_TRIGGER_1CTID				0x10000000
-#define AFTER_TRIGGER_2CTID				0x30000000
-#define AFTER_TRIGGER_CP_UPDATE			0x08000000
-#define AFTER_TRIGGER_TUP_BITS			0x38000000
+#define AFTER_TRIGGER_FDW_REUSE			UINT64CONST(0x0000000000000)
+#define AFTER_TRIGGER_FDW_FETCH			UINT64CONST(0x0000200000000)
+#define AFTER_TRIGGER_1CTID				UINT64CONST(0x0000100000000)
+#define AFTER_TRIGGER_ROWID1			UINT64CONST(0x0000010000000)
+#define AFTER_TRIGGER_2CTID				UINT64CONST(0x0000300000000)
+#define AFTER_TRIGGER_ROWID2			UINT64CONST(0x0000020000000)
+#define AFTER_TRIGGER_CP_UPDATE			UINT64CONST(0x0000080000000)
+#define AFTER_TRIGGER_TUP_BITS			UINT64CONST(0x0000380000000)
 typedef struct AfterTriggerSharedData *AfterTriggerShared;
 
 typedef struct AfterTriggerSharedData
@@ -3640,6 +3648,9 @@ typedef struct AfterTriggerEventDataZeroCtids
 }			AfterTriggerEventDataZeroCtids;
 
 #define SizeofTriggerEvent(evt) \
+	(((evt)->ate_flags & AFTER_TRIGGER_SIZE) >> AFTER_TRIGGER_SIZE_SHIFT)
+
+#define BasicSizeofTriggerEvent(evt) \
 	(((evt)->ate_flags & AFTER_TRIGGER_TUP_BITS) == AFTER_TRIGGER_CP_UPDATE ? \
 	 sizeof(AfterTriggerEventData) : \
 	 (((evt)->ate_flags & AFTER_TRIGGER_TUP_BITS) == AFTER_TRIGGER_2CTID ? \
@@ -3992,14 +4003,34 @@ afterTriggerCopyBitmap(Bitmapset *src)
  */
 static void
 afterTriggerAddEvent(AfterTriggerEventList *events,
-					 AfterTriggerEvent event, AfterTriggerShared evtshared)
+					 AfterTriggerEvent event, AfterTriggerShared evtshared,
+					 bytea *rowid1, bytea *rowid2)
 {
-	Size		eventsize = SizeofTriggerEvent(event);
-	Size		needed = eventsize + sizeof(AfterTriggerSharedData);
+	Size		basiceventsize = MAXALIGN(BasicSizeofTriggerEvent(event));
+	Size		eventsize;
+	Size		needed;
 	AfterTriggerEventChunk *chunk;
 	AfterTriggerShared newshared;
 	AfterTriggerEvent newevent;
 
+	if (SizeofTriggerEvent(event) == 0)
+	{
+		eventsize = basiceventsize;
+		if (event->ate_flags & AFTER_TRIGGER_ROWID1)
+			eventsize += MAXALIGN(VARSIZE(rowid1));
+
+		if (event->ate_flags & AFTER_TRIGGER_ROWID2)
+			eventsize += MAXALIGN(VARSIZE(rowid2));
+
+		event->ate_flags |= eventsize << AFTER_TRIGGER_SIZE_SHIFT;
+	}
+	else
+	{
+		eventsize = SizeofTriggerEvent(event);
+	}
+
+	needed = eventsize + sizeof(AfterTriggerSharedData);
+
 	/*
 	 * If empty list or not enough room in the tail chunk, make a new chunk.
 	 * We assume here that a new shared record will always be needed.
@@ -4032,7 +4063,7 @@ afterTriggerAddEvent(AfterTriggerEventList *events,
 		 * sizes used should be MAXALIGN multiples, to ensure that the shared
 		 * records will be aligned safely.
 		 */
-#define MIN_CHUNK_SIZE 1024
+#define MIN_CHUNK_SIZE (1024*4)
 #define MAX_CHUNK_SIZE (1024*1024)
 
 #if MAX_CHUNK_SIZE > (AFTER_TRIGGER_OFFSET+1)
@@ -4051,6 +4082,7 @@ afterTriggerAddEvent(AfterTriggerEventList *events,
 				chunksize *= 2; /* okay, double it */
 			else
 				chunksize /= 2; /* too many shared records */
+			chunksize = Max(chunksize, MIN_CHUNK_SIZE);
 			chunksize = Min(chunksize, MAX_CHUNK_SIZE);
 		}
 		chunk = MemoryContextAlloc(afterTriggers.event_cxt, chunksize);
@@ -4091,7 +4123,26 @@ afterTriggerAddEvent(AfterTriggerEventList *events,
 
 	/* Insert the data */
 	newevent = (AfterTriggerEvent) chunk->freeptr;
-	memcpy(newevent, event, eventsize);
+	if (!rowid1 && !rowid2)
+	{
+		memcpy(newevent, event, eventsize);
+	}
+	else
+	{
+		Pointer		ptr = chunk->freeptr;
+
+		memcpy(newevent, event, basiceventsize);
+		ptr += basiceventsize;
+
+		if (event->ate_flags & AFTER_TRIGGER_ROWID1)
+		{
+			memcpy(ptr, rowid1, MAXALIGN(VARSIZE(rowid1)));
+			ptr += MAXALIGN(VARSIZE(rowid1));
+		}
+
+		if (event->ate_flags & AFTER_TRIGGER_ROWID2)
+			memcpy(ptr, rowid2, MAXALIGN(VARSIZE(rowid2)));
+	}
 	/* ... and link the new event to its shared record */
 	newevent->ate_flags &= ~AFTER_TRIGGER_OFFSET;
 	newevent->ate_flags |= (char *) newshared - (char *) newevent;
@@ -4251,6 +4302,7 @@ AfterTriggerExecute(EState *estate,
 	int			tgindx;
 	bool		should_free_trig = false;
 	bool		should_free_new = false;
+	Pointer		ptr;
 
 	/*
 	 * Locate trigger in trigdesc.
@@ -4282,15 +4334,17 @@ AfterTriggerExecute(EState *estate,
 			{
 				Tuplestorestate *fdw_tuplestore = GetCurrentFDWTuplestore();
 
-				if (!tuplestore_gettupleslot(fdw_tuplestore, true, false,
-											 trig_tuple_slot1))
+				if (!tuplestore_force_gettupleslot(fdw_tuplestore, true, false,
+												   trig_tuple_slot1))
 					elog(ERROR, "failed to fetch tuple1 for AFTER trigger");
 
 				if ((evtshared->ats_event & TRIGGER_EVENT_OPMASK) ==
 					TRIGGER_EVENT_UPDATE &&
-					!tuplestore_gettupleslot(fdw_tuplestore, true, false,
-											 trig_tuple_slot2))
+					!tuplestore_force_gettupleslot(fdw_tuplestore, true, false,
+												   trig_tuple_slot2))
 					elog(ERROR, "failed to fetch tuple2 for AFTER trigger");
+				trig_tuple_slot1->tts_tid = event->ate_ctid1;
+				trig_tuple_slot2->tts_tid = event->ate_ctid2;
 			}
 			/* fall through */
 		case AFTER_TRIGGER_FDW_REUSE:
@@ -4322,13 +4376,26 @@ AfterTriggerExecute(EState *estate,
 			break;
 
 		default:
-			if (ItemPointerIsValid(&(event->ate_ctid1)))
+			ptr = (Pointer) event + MAXALIGN(BasicSizeofTriggerEvent(event));
+			if (ItemPointerIsValid(&(event->ate_ctid1)) ||
+				(event->ate_flags & AFTER_TRIGGER_ROWID1))
 			{
+				Datum		tupleid;
+
 				TupleTableSlot *src_slot = ExecGetTriggerOldSlot(estate,
 																 src_relInfo);
 
-				if (!table_tuple_fetch_row_version(src_rel,
-												   &(event->ate_ctid1),
+				if (event->ate_flags & AFTER_TRIGGER_ROWID1)
+				{
+					tupleid = PointerGetDatum(ptr);
+					ptr += MAXALIGN(VARSIZE(ptr));
+				}
+				else
+				{
+					tupleid = PointerGetDatum(&(event->ate_ctid1));
+				}
+
+				if (!table_tuple_fetch_row_version(src_rel, tupleid,
 												   SnapshotAny,
 												   src_slot))
 					elog(ERROR, "failed to fetch tuple1 for AFTER trigger");
@@ -4364,13 +4431,23 @@ AfterTriggerExecute(EState *estate,
 			/* don't touch ctid2 if not there */
 			if (((event->ate_flags & AFTER_TRIGGER_TUP_BITS) == AFTER_TRIGGER_2CTID ||
 				 (event->ate_flags & AFTER_TRIGGER_CP_UPDATE)) &&
-				ItemPointerIsValid(&(event->ate_ctid2)))
+				(ItemPointerIsValid(&(event->ate_ctid2)) ||
+				 (event->ate_flags & AFTER_TRIGGER_ROWID2)))
 			{
+				Datum		tupleid;
+
 				TupleTableSlot *dst_slot = ExecGetTriggerNewSlot(estate,
 																 dst_relInfo);
 
-				if (!table_tuple_fetch_row_version(dst_rel,
-												   &(event->ate_ctid2),
+				if (event->ate_flags & AFTER_TRIGGER_ROWID2)
+				{
+					tupleid = PointerGetDatum(ptr);
+				}
+				else
+				{
+					tupleid = PointerGetDatum(&(event->ate_ctid2));
+				}
+				if (!table_tuple_fetch_row_version(dst_rel, tupleid,
 												   SnapshotAny,
 												   dst_slot))
 					elog(ERROR, "failed to fetch tuple2 for AFTER trigger");
@@ -4544,7 +4621,7 @@ afterTriggerMarkEvents(AfterTriggerEventList *events,
 		{
 			deferred_found = true;
 			/* add it to move_list */
-			afterTriggerAddEvent(move_list, event, evtshared);
+			afterTriggerAddEvent(move_list, event, evtshared, NULL, NULL);
 			/* mark original copy "done" so we don't do it again */
 			event->ate_flags |= AFTER_TRIGGER_DONE;
 		}
@@ -4647,6 +4724,7 @@ afterTriggerInvokeEvents(AfterTriggerEventList *events,
 					trigdesc = rInfo->ri_TrigDesc;
 					finfo = rInfo->ri_TrigFunctions;
 					instr = rInfo->ri_TrigInstrument;
+
 					if (slot1 != NULL)
 					{
 						ExecDropSingleTupleTableSlot(slot1);
@@ -6039,6 +6117,8 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 	int			tgtype_level;
 	int			i;
 	Tuplestorestate *fdw_tuplestore = NULL;
+	bytea	   *rowId1 = NULL;
+	bytea	   *rowId2 = NULL;
 
 	/*
 	 * Check state.  We use a normal test not Assert because it is possible to
@@ -6132,6 +6212,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 	 * if so.  This preserves the behavior that statement-level triggers fire
 	 * just once per statement and fire after row-level triggers.
 	 */
+
+	/* Determine flags */
+	if (!(relkind == RELKIND_FOREIGN_TABLE && row_trigger))
+		new_event.ate_flags = (row_trigger && event == TRIGGER_EVENT_UPDATE) ?
+			AFTER_TRIGGER_2CTID : AFTER_TRIGGER_1CTID;
+
 	switch (event)
 	{
 		case TRIGGER_EVENT_INSERT:
@@ -6142,6 +6228,14 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				Assert(newslot != NULL);
 				ItemPointerCopy(&(newslot->tts_tid), &(new_event.ate_ctid1));
 				ItemPointerSetInvalid(&(new_event.ate_ctid2));
+				if (table_get_row_ref_type(rel) == ROW_REF_ROWID)
+				{
+					bool		isnull;
+
+					rowId1 = DatumGetByteaP(slot_getsysattr(newslot, RowIdAttributeNumber, &isnull));
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID1;
+					Assert(!isnull);
+				}
 			}
 			else
 			{
@@ -6161,6 +6255,14 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				Assert(newslot == NULL);
 				ItemPointerCopy(&(oldslot->tts_tid), &(new_event.ate_ctid1));
 				ItemPointerSetInvalid(&(new_event.ate_ctid2));
+				if (table_get_row_ref_type(rel) == ROW_REF_ROWID)
+				{
+					bool		isnull;
+
+					rowId1 = DatumGetByteaP(slot_getsysattr(oldslot, RowIdAttributeNumber, &isnull));
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID1;
+					Assert(!isnull);
+				}
 			}
 			else
 			{
@@ -6176,10 +6278,57 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			tgtype_event = TRIGGER_TYPE_UPDATE;
 			if (row_trigger)
 			{
+				bool		src_rowid = false,
+							dst_rowid = false;
+
 				Assert(oldslot != NULL);
 				Assert(newslot != NULL);
 				ItemPointerCopy(&(oldslot->tts_tid), &(new_event.ate_ctid1));
 				ItemPointerCopy(&(newslot->tts_tid), &(new_event.ate_ctid2));
+				if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+				{
+					Relation	src_rel = src_partinfo->ri_RelationDesc;
+					Relation	dst_rel = dst_partinfo->ri_RelationDesc;
+
+					src_rowid = table_get_row_ref_type(src_rel) ==
+						ROW_REF_ROWID;
+					dst_rowid = table_get_row_ref_type(dst_rel) ==
+						ROW_REF_ROWID;
+				}
+				else
+				{
+					if (table_get_row_ref_type(rel) == ROW_REF_ROWID)
+					{
+						src_rowid = true;
+						dst_rowid = true;
+					}
+				}
+
+				if (src_rowid)
+				{
+					Datum		val;
+					bool		isnull;
+
+					val = slot_getsysattr(oldslot,
+										  RowIdAttributeNumber,
+										  &isnull);
+					rowId1 = DatumGetByteaP(val);
+					Assert(!isnull);
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID1;
+				}
+
+				if (dst_rowid)
+				{
+					Datum		val;
+					bool		isnull;
+
+					val = slot_getsysattr(newslot,
+										  RowIdAttributeNumber,
+										  &isnull);
+					rowId2 = DatumGetByteaP(val);
+					Assert(!isnull);
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID2;
+				}
 
 				/*
 				 * Also remember the OIDs of partitions to fetch these tuples
@@ -6217,20 +6366,6 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			break;
 	}
 
-	/* Determine flags */
-	if (!(relkind == RELKIND_FOREIGN_TABLE && row_trigger))
-	{
-		if (row_trigger && event == TRIGGER_EVENT_UPDATE)
-		{
-			if (relkind == RELKIND_PARTITIONED_TABLE)
-				new_event.ate_flags = AFTER_TRIGGER_CP_UPDATE;
-			else
-				new_event.ate_flags = AFTER_TRIGGER_2CTID;
-		}
-		else
-			new_event.ate_flags = AFTER_TRIGGER_1CTID;
-	}
-
 	/* else, we'll initialize ate_flags for each trigger */
 
 	tgtype_level = (row_trigger ? TRIGGER_TYPE_ROW : TRIGGER_TYPE_STATEMENT);
@@ -6375,6 +6510,20 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				continue;		/* Uniqueness definitely not violated */
 		}
 
+		/* Determine flags */
+		if (!(relkind == RELKIND_FOREIGN_TABLE && row_trigger))
+		{
+			if (row_trigger && event == TRIGGER_EVENT_UPDATE)
+			{
+				if (relkind == RELKIND_PARTITIONED_TABLE)
+					new_event.ate_flags = AFTER_TRIGGER_CP_UPDATE;
+				else
+					new_event.ate_flags = AFTER_TRIGGER_2CTID;
+			}
+			else
+				new_event.ate_flags = AFTER_TRIGGER_1CTID;
+		}
+
 		/*
 		 * Fill in event structure and add it to the current query's queue.
 		 * Note we set ats_table to NULL whenever this trigger doesn't use
@@ -6396,7 +6545,7 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		new_shared.ats_modifiedcols = afterTriggerCopyBitmap(modifiedCols);
 
 		afterTriggerAddEvent(&afterTriggers.query_stack[afterTriggers.query_depth].events,
-							 &new_event, &new_shared);
+							 &new_event, &new_shared, rowId1, rowId2);
 	}
 
 	/*
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 24c2b60c62a..1803d5df5e7 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -4429,7 +4429,9 @@ ExecEvalSysVar(ExprState *state, ExprEvalStep *op, ExprContext *econtext,
 						op->resnull);
 	*op->resvalue = d;
 	/* this ought to be unreachable, but it's cheap enough to check */
-	if (unlikely(*op->resnull))
+	if (op->d.var.attnum != RowIdAttributeNumber &&
+		op->d.var.attnum != SelfItemPointerAttributeNumber &&
+		unlikely(*op->resnull))
 		elog(ERROR, "failed to fetch attribute from slot");
 }
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 4c5a7bbf620..dacdeabba1d 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -869,13 +869,15 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 			Oid			relid;
 			Relation	relation;
 			ExecRowMark *erm;
+			RangeTblEntry *rangeEntry;
 
 			/* ignore "parent" rowmarks; they are irrelevant at runtime */
 			if (rc->isParent)
 				continue;
 
 			/* get relation's OID (will produce InvalidOid if subquery) */
-			relid = exec_rt_fetch(rc->rti, estate)->relid;
+			rangeEntry = exec_rt_fetch(rc->rti, estate);
+			relid = rangeEntry->relid;
 
 			/* open relation, if we need to access it for this mark type */
 			switch (rc->markType)
@@ -908,6 +910,10 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 			erm->prti = rc->prti;
 			erm->rowmarkId = rc->rowmarkId;
 			erm->markType = rc->markType;
+			if (erm->markType == ROW_MARK_COPY)
+				erm->refType = ROW_REF_COPY;
+			else
+				erm->refType = rangeEntry->reftype;
 			erm->strength = rc->strength;
 			erm->waitPolicy = rc->waitPolicy;
 			erm->ermActive = false;
@@ -1295,6 +1301,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
 	resultRelInfo->ri_ChildToRootMap = NULL;
 	resultRelInfo->ri_ChildToRootMapValid = false;
 	resultRelInfo->ri_CopyMultiInsertBuffer = NULL;
+	resultRelInfo->ri_RowRefType = table_get_row_ref_type(resultRelationDesc);
 }
 
 /*
@@ -2727,7 +2734,7 @@ EvalPlanQualFetchRowMark(EPQState *epqstate, Index rti, TupleTableSlot *slot)
 		{
 			/* ordinary table, fetch the tuple */
 			if (!table_tuple_fetch_row_version(erm->relation,
-											   (ItemPointer) DatumGetPointer(datum),
+											   datum,
 											   SnapshotAny, slot))
 				elog(ERROR, "failed to fetch tuple for EvalPlanQual recheck");
 			return true;
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index fda9877a55f..f24f5e28c2f 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -256,7 +256,8 @@ retry:
 
 		PushActiveSnapshot(GetLatestSnapshot());
 
-		res = table_tuple_lock(rel, &(outslot->tts_tid), GetLatestSnapshot(),
+		res = table_tuple_lock(rel, PointerGetDatum(&(outslot->tts_tid)),
+							   GetLatestSnapshot(),
 							   outslot,
 							   GetCurrentCommandId(false),
 							   lockmode,
@@ -440,7 +441,8 @@ retry:
 
 		PushActiveSnapshot(GetLatestSnapshot());
 
-		res = table_tuple_lock(rel, &(outslot->tts_tid), GetLatestSnapshot(),
+		res = table_tuple_lock(rel, PointerGetDatum(&(outslot->tts_tid)),
+							   GetLatestSnapshot(),
 							   outslot,
 							   GetCurrentCommandId(false),
 							   lockmode,
@@ -577,7 +579,8 @@ ExecSimpleRelationUpdate(ResultRelInfo *resultRelInfo,
 		resultRelInfo->ri_TrigDesc->trig_update_before_row)
 	{
 		if (!ExecBRUpdateTriggers(estate, epqstate, resultRelInfo,
-								  tid, NULL, slot, NULL, NULL))
+								  PointerGetDatum(tid), NULL, slot,
+								  NULL, NULL))
 			skip_tuple = true;	/* "do nothing" */
 	}
 
@@ -644,7 +647,8 @@ ExecSimpleRelationDelete(ResultRelInfo *resultRelInfo,
 		resultRelInfo->ri_TrigDesc->trig_delete_before_row)
 	{
 		skip_tuple = !ExecBRDeleteTriggers(estate, epqstate, resultRelInfo,
-										   tid, NULL, NULL, NULL, NULL);
+										   PointerGetDatum(tid), NULL, NULL,
+										   NULL, NULL);
 	}
 
 	if (!skip_tuple)
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index e459971d32e..53efde108cc 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -27,6 +27,7 @@
 #include "executor/nodeLockRows.h"
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
+#include "utils/datum.h"
 #include "utils/rel.h"
 
 
@@ -157,7 +158,16 @@ lnext:
 		}
 
 		/* okay, try to lock (and fetch) the tuple */
-		tid = *((ItemPointer) DatumGetPointer(datum));
+		if (erm->refType == ROW_REF_TID)
+		{
+			tid = *((ItemPointer) DatumGetPointer(datum));
+			datum = PointerGetDatum(&tid);
+		}
+		else
+		{
+			Assert(erm->refType == ROW_REF_ROWID);
+			datum = datumCopy(datum, false, -1);
+		}
 		switch (erm->markType)
 		{
 			case ROW_MARK_EXCLUSIVE:
@@ -182,12 +192,15 @@ lnext:
 		if (!IsolationUsesXactSnapshot())
 			lockflags |= TUPLE_LOCK_FLAG_FIND_LAST_VERSION;
 
-		test = table_tuple_lock(erm->relation, &tid, estate->es_snapshot,
+		test = table_tuple_lock(erm->relation, datum, estate->es_snapshot,
 								markSlot, estate->es_output_cid,
 								lockmode, erm->waitPolicy,
 								lockflags,
 								&tmfd);
 
+		if (erm->refType == ROW_REF_ROWID)
+			pfree(DatumGetPointer(datum));
+
 		switch (test)
 		{
 			case TM_WouldBlock:
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 5faf10b254f..8a2e431a433 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -132,7 +132,7 @@ static void ExecPendingInserts(EState *estate);
 static void ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 											   ResultRelInfo *sourcePartInfo,
 											   ResultRelInfo *destPartInfo,
-											   ItemPointer tupleid,
+											   Datum tupleid,
 											   TupleTableSlot *oldSlot,
 											   TupleTableSlot *newslot);
 static bool ExecOnConflictUpdate(ModifyTableContext *context,
@@ -149,12 +149,12 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 
 static TupleTableSlot *ExecMerge(ModifyTableContext *context,
 								 ResultRelInfo *resultRelInfo,
-								 ItemPointer tupleid,
+								 Datum tupleid,
 								 bool canSetTag);
 static void ExecInitMerge(ModifyTableState *mtstate, EState *estate);
 static bool ExecMergeMatched(ModifyTableContext *context,
 							 ResultRelInfo *resultRelInfo,
-							 ItemPointer tupleid,
+							 Datum tupleid,
 							 bool canSetTag);
 static void ExecMergeNotMatched(ModifyTableContext *context,
 								ResultRelInfo *resultRelInfo,
@@ -1221,7 +1221,7 @@ ExecPendingInserts(EState *estate)
  */
 static bool
 ExecDeletePrologue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				   ItemPointer tupleid, HeapTuple oldtuple,
+				   Datum tupleid, HeapTuple oldtuple,
 				   TupleTableSlot **epqreturnslot, TM_Result *result)
 {
 	if (result)
@@ -1252,7 +1252,7 @@ ExecDeletePrologue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static TM_Result
 ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-			  ItemPointer tupleid, bool changingPart, int options,
+			  Datum tupleid, bool changingPart, int options,
 			  TupleTableSlot *oldSlot)
 {
 	EState	   *estate = context->estate;
@@ -1276,7 +1276,7 @@ ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static void
 ExecDeleteEpilogue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				   ItemPointer tupleid, HeapTuple oldtuple,
+				   HeapTuple oldtuple,
 				   TupleTableSlot *slot, bool changingPart)
 {
 	ModifyTableState *mtstate = context->mtstate;
@@ -1356,7 +1356,7 @@ ExecInitDeleteTupleSlot(ModifyTableState *mtstate,
 static TupleTableSlot *
 ExecDelete(ModifyTableContext *context,
 		   ResultRelInfo *resultRelInfo,
-		   ItemPointer tupleid,
+		   Datum tupleid,
 		   HeapTuple oldtuple,
 		   TupleTableSlot *oldSlot,
 		   bool processReturning,
@@ -1545,7 +1545,7 @@ ldelete:
 	if (tupleDeleted)
 		*tupleDeleted = true;
 
-	ExecDeleteEpilogue(context, resultRelInfo, tupleid, oldtuple,
+	ExecDeleteEpilogue(context, resultRelInfo, oldtuple,
 					   oldSlot, changingPart);
 
 	/* Process RETURNING if present and if requested */
@@ -1562,7 +1562,7 @@ ldelete:
 			/* FDW must have provided a slot containing the deleted row */
 			Assert(!TupIsNull(slot));
 		}
-		else
+		else if (!slot || TupIsNull(slot))
 		{
 			/* Copy old tuple to the returning slot */
 			slot = ExecGetReturningSlot(estate, resultRelInfo);
@@ -1611,7 +1611,7 @@ ldelete:
 static bool
 ExecCrossPartitionUpdate(ModifyTableContext *context,
 						 ResultRelInfo *resultRelInfo,
-						 ItemPointer tupleid, HeapTuple oldtuple,
+						 Datum tupleid, HeapTuple oldtuple,
 						 TupleTableSlot *slot,
 						 bool canSetTag,
 						 UpdateContext *updateCxt,
@@ -1766,7 +1766,7 @@ ExecCrossPartitionUpdate(ModifyTableContext *context,
  */
 static bool
 ExecUpdatePrologue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				   ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
+				   Datum tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
 				   TM_Result *result)
 {
 	Relation	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -1843,7 +1843,7 @@ ExecUpdatePrepareSlot(ResultRelInfo *resultRelInfo,
  */
 static TM_Result
 ExecUpdateAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-			  ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
+			  Datum tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
 			  bool canSetTag, int options, TupleTableSlot *oldSlot,
 			  UpdateContext *updateCxt)
 {
@@ -1999,7 +1999,7 @@ lreplace:
  */
 static void
 ExecUpdateEpilogue(ModifyTableContext *context, UpdateContext *updateCxt,
-				   ResultRelInfo *resultRelInfo, ItemPointer tupleid,
+				   ResultRelInfo *resultRelInfo,
 				   HeapTuple oldtuple, TupleTableSlot *slot,
 				   TupleTableSlot *oldSlot)
 {
@@ -2049,7 +2049,7 @@ static void
 ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 								   ResultRelInfo *sourcePartInfo,
 								   ResultRelInfo *destPartInfo,
-								   ItemPointer tupleid,
+								   Datum tupleid,
 								   TupleTableSlot *oldslot,
 								   TupleTableSlot *newslot)
 {
@@ -2139,7 +2139,7 @@ ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
  */
 static TupleTableSlot *
 ExecUpdate(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-		   ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
+		   Datum tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
 		   TupleTableSlot *oldSlot, bool canSetTag, bool locked)
 {
 	EState	   *estate = context->estate;
@@ -2193,10 +2193,14 @@ ExecUpdate(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	}
 	else
 	{
-		int			options = TABLE_MODIFY_WAIT | TABLE_MODIFY_FETCH_OLD_TUPLE;
+		int			options = TABLE_MODIFY_WAIT;
 
-		if (!locked && !IsolationUsesXactSnapshot())
-			options |= TABLE_MODIFY_LOCK_UPDATED;
+		if (!locked)
+		{
+			options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
+			if (!IsolationUsesXactSnapshot())
+				options |= TABLE_MODIFY_LOCK_UPDATED;
+		}
 
 		/*
 		 * If we generate a new candidate tuple after EvalPlanQual testing, we
@@ -2304,7 +2308,7 @@ redo_act:
 	if (canSetTag)
 		(estate->es_processed)++;
 
-	ExecUpdateEpilogue(context, &updateCxt, resultRelInfo, tupleid, oldtuple,
+	ExecUpdateEpilogue(context, &updateCxt, resultRelInfo, oldtuple,
 					   slot, oldSlot);
 
 	/* Process RETURNING if present */
@@ -2336,7 +2340,19 @@ ExecOnConflictUpdate(ModifyTableContext *context,
 	ExprContext *econtext = mtstate->ps.ps_ExprContext;
 	ExprState  *onConflictSetWhere = resultRelInfo->ri_onConflict->oc_WhereClause;
 	TupleTableSlot *existing = resultRelInfo->ri_onConflict->oc_Existing;
-	ItemPointer conflictTid = &existing->tts_tid;
+	Datum		tupleid;
+
+	if (table_get_row_ref_type(resultRelInfo->ri_RelationDesc) == ROW_REF_ROWID)
+	{
+		bool		isnull;
+
+		tupleid = slot_getsysattr(existing, RowIdAttributeNumber, &isnull);
+		Assert(!isnull);
+	}
+	else
+	{
+		tupleid = PointerGetDatum(&existing->tts_tid);
+	}
 
 	/*
 	 * Make tuple and any needed join variables available to ExecQual and
@@ -2392,7 +2408,7 @@ ExecOnConflictUpdate(ModifyTableContext *context,
 
 	/* Execute UPDATE with projection */
 	*returning = ExecUpdate(context, resultRelInfo,
-							conflictTid, NULL,
+							tupleid, NULL,
 							resultRelInfo->ri_onConflict->oc_ProjSlot,
 							existing,
 							canSetTag, true);
@@ -2411,7 +2427,7 @@ ExecOnConflictUpdate(ModifyTableContext *context,
  */
 static TupleTableSlot *
 ExecMerge(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-		  ItemPointer tupleid, bool canSetTag)
+		  Datum tupleid, bool canSetTag)
 {
 	bool		matched;
 
@@ -2458,7 +2474,7 @@ ExecMerge(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	 * from ExecMergeNotMatched to ExecMergeMatched, there is no risk of a
 	 * livelock.
 	 */
-	matched = tupleid != NULL;
+	matched = DatumGetPointer(tupleid) != NULL;
 	if (matched)
 		matched = ExecMergeMatched(context, resultRelInfo, tupleid, canSetTag);
 
@@ -2497,7 +2513,7 @@ ExecMerge(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static bool
 ExecMergeMatched(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				 ItemPointer tupleid, bool canSetTag)
+				 Datum tupleid, bool canSetTag)
 {
 	ModifyTableState *mtstate = context->mtstate;
 	TupleTableSlot *newslot;
@@ -2619,7 +2635,7 @@ lmerge_matched:
 				if (result == TM_Ok && updateCxt.updated)
 				{
 					ExecUpdateEpilogue(context, &updateCxt, resultRelInfo,
-									   tupleid, NULL, newslot,
+									   NULL, newslot,
 									   resultRelInfo->ri_oldTupleSlot);
 					mtstate->mt_merge_updated += 1;
 				}
@@ -2638,7 +2654,7 @@ lmerge_matched:
 									   false, TABLE_MODIFY_WAIT, NULL);
 				if (result == TM_Ok)
 				{
-					ExecDeleteEpilogue(context, resultRelInfo, tupleid, NULL,
+					ExecDeleteEpilogue(context, resultRelInfo, NULL,
 									   resultRelInfo->ri_oldTupleSlot, false);
 					mtstate->mt_merge_deleted += 1;
 				}
@@ -2736,9 +2752,13 @@ lmerge_matched:
 							if (TupIsNull(epqslot))
 								return false;
 
-							(void) ExecGetJunkAttribute(epqslot,
-														resultRelInfo->ri_RowIdAttNo,
-														&isNull);
+							/*
+							 * Update tupleid to that of the new tuple, for
+							 * the refetch we do at the top.
+							 */
+							tupleid = ExecGetJunkAttribute(epqslot,
+														   resultRelInfo->ri_RowIdAttNo,
+														   &isNull);
 							if (isNull)
 								return false;
 
@@ -2762,11 +2782,7 @@ lmerge_matched:
 							 * apply all the MATCHED rules again, to ensure
 							 * that the first qualifying WHEN MATCHED action
 							 * is executed.
-							 *
-							 * Update tupleid to that of the new tuple, for
-							 * the refetch we do at the top.
 							 */
-							ItemPointerCopy(&context->tmfd.ctid, tupleid);
 							goto lmerge_matched;
 
 						case TM_Deleted:
@@ -3265,10 +3281,10 @@ ExecModifyTable(PlanState *pstate)
 	PlanState  *subplanstate;
 	TupleTableSlot *slot;
 	TupleTableSlot *oldSlot;
+	Datum		tupleid;
 	ItemPointerData tuple_ctid;
 	HeapTupleData oldtupdata;
 	HeapTuple	oldtuple;
-	ItemPointer tupleid;
 
 	CHECK_FOR_INTERRUPTS();
 
@@ -3317,6 +3333,8 @@ ExecModifyTable(PlanState *pstate)
 	 */
 	for (;;)
 	{
+		RowRefType	refType;
+
 		/*
 		 * Reset the per-output-tuple exprcontext.  This is needed because
 		 * triggers expect to use that context as workspace.  It's a bit ugly
@@ -3366,7 +3384,8 @@ ExecModifyTable(PlanState *pstate)
 				{
 					EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 
-					ExecMerge(&context, node->resultRelInfo, NULL, node->canSetTag);
+					ExecMerge(&context, node->resultRelInfo,
+							  PointerGetDatum(NULL), node->canSetTag);
 					continue;	/* no RETURNING support yet */
 				}
 
@@ -3402,7 +3421,8 @@ ExecModifyTable(PlanState *pstate)
 		EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 		slot = context.planSlot;
 
-		tupleid = NULL;
+		refType = resultRelInfo->ri_RowRefType;
+		tupleid = PointerGetDatum(NULL);
 		oldtuple = NULL;
 
 		/*
@@ -3444,16 +3464,33 @@ ExecModifyTable(PlanState *pstate)
 					{
 						EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 
-						ExecMerge(&context, node->resultRelInfo, NULL, node->canSetTag);
+						ExecMerge(&context, node->resultRelInfo,
+								  PointerGetDatum(NULL), node->canSetTag);
 						continue;	/* no RETURNING support yet */
 					}
 
 					elog(ERROR, "ctid is NULL");
 				}
 
-				tupleid = (ItemPointer) DatumGetPointer(datum);
-				tuple_ctid = *tupleid;	/* be sure we don't free ctid!! */
-				tupleid = &tuple_ctid;
+				if (refType == ROW_REF_TID)
+				{
+					/* shouldn't ever get a null result... */
+					if (isNull)
+						elog(ERROR, "ctid is NULL");
+
+					tuple_ctid = *((ItemPointer) DatumGetPointer(datum));	/* be sure we don't free
+																			 * ctid!! */
+					tupleid = PointerGetDatum(&tuple_ctid);
+				}
+				else
+				{
+					Assert(refType == ROW_REF_ROWID);
+					/* shouldn't ever get a null result... */
+					if (isNull)
+						elog(ERROR, "rowid is NULL");
+
+					tupleid = datumCopy(datum, false, -1);
+				}
 			}
 
 			/*
@@ -3530,6 +3567,7 @@ ExecModifyTable(PlanState *pstate)
 					/* Fetch the most recent version of old tuple. */
 					Relation	relation = resultRelInfo->ri_RelationDesc;
 
+					Assert(DatumGetPointer(tupleid) != NULL);
 					if (!table_tuple_fetch_row_version(relation, tupleid,
 													   SnapshotAny,
 													   oldSlot))
@@ -3564,6 +3602,9 @@ ExecModifyTable(PlanState *pstate)
 				break;
 		}
 
+		if (refType == ROW_REF_ROWID && DatumGetPointer(tupleid) != NULL)
+			pfree(DatumGetPointer(tupleid));
+
 		/*
 		 * If we got a RETURNING result, return it to caller.  We'll continue
 		 * the work on next call.
@@ -3803,10 +3844,20 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 				relkind == RELKIND_MATVIEW ||
 				relkind == RELKIND_PARTITIONED_TABLE)
 			{
-				resultRelInfo->ri_RowIdAttNo =
-					ExecFindJunkAttributeInTlist(subplan->targetlist, "ctid");
-				if (!AttributeNumberIsValid(resultRelInfo->ri_RowIdAttNo))
-					elog(ERROR, "could not find junk ctid column");
+				if (resultRelInfo->ri_RowRefType == ROW_REF_TID)
+				{
+					resultRelInfo->ri_RowIdAttNo =
+						ExecFindJunkAttributeInTlist(subplan->targetlist, "ctid");
+					if (!AttributeNumberIsValid(resultRelInfo->ri_RowIdAttNo))
+						elog(ERROR, "could not find junk ctid column");
+				}
+				else
+				{
+					resultRelInfo->ri_RowIdAttNo =
+						ExecFindJunkAttributeInTlist(subplan->targetlist, "rowid");
+					if (!AttributeNumberIsValid(resultRelInfo->ri_RowIdAttNo))
+						elog(ERROR, "could not find junk rowid column");
+				}
 			}
 			else if (relkind == RELKIND_FOREIGN_TABLE)
 			{
@@ -4118,6 +4169,8 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		estate->es_auxmodifytables = lcons(mtstate,
 										   estate->es_auxmodifytables);
 
+
+
 	return mtstate;
 }
 
diff --git a/src/backend/executor/nodeTidscan.c b/src/backend/executor/nodeTidscan.c
index 15055077d03..a4ad0aa0d2e 100644
--- a/src/backend/executor/nodeTidscan.c
+++ b/src/backend/executor/nodeTidscan.c
@@ -378,7 +378,7 @@ TidNext(TidScanState *node)
 		if (node->tss_isCurrentOf)
 			table_tuple_get_latest_tid(scan, &tid);
 
-		if (table_tuple_fetch_row_version(heapRelation, &tid, snapshot, slot))
+		if (table_tuple_fetch_row_version(heapRelation, PointerGetDatum(&tid), snapshot, slot))
 			return slot;
 
 		/* Bad TID or failed snapshot qual; try next */
diff --git a/src/backend/optimizer/prep/preptlist.c b/src/backend/optimizer/prep/preptlist.c
index ef2ce552970..0d849332904 100644
--- a/src/backend/optimizer/prep/preptlist.c
+++ b/src/backend/optimizer/prep/preptlist.c
@@ -226,6 +226,22 @@ preprocess_targetlist(PlannerInfo *root)
 								  true);
 			tlist = lappend(tlist, tle);
 		}
+		if (rc->allRefTypes & (1 << ROW_REF_ROWID))
+		{
+			/* Need to fetch TID */
+			var = makeVar(rc->rti,
+						  RowIdAttributeNumber,
+						  BYTEAOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			snprintf(resname, sizeof(resname), "rowid%u", rc->rowmarkId);
+			tle = makeTargetEntry((Expr *) var,
+								  list_length(tlist) + 1,
+								  pstrdup(resname),
+								  true);
+			tlist = lappend(tlist, tle);
+		}
 		if (rc->allRefTypes & (1 << ROW_REF_COPY))
 		{
 			/* Need the whole row as a junk var */
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index f456b3b0a44..837d2e92c5a 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -16,6 +16,7 @@
 
 #include "access/htup_details.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "foreign/fdwapi.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
@@ -896,17 +897,35 @@ add_row_identity_columns(PlannerInfo *root, Index rtindex,
 		relkind == RELKIND_MATVIEW ||
 		relkind == RELKIND_PARTITIONED_TABLE)
 	{
+		RowRefType	refType = ROW_REF_TID;
+
+		refType = table_get_row_ref_type(target_relation);
+
 		/*
 		 * Emit CTID so that executor can find the row to merge, update or
 		 * delete.
 		 */
-		var = makeVar(rtindex,
-					  SelfItemPointerAttributeNumber,
-					  TIDOID,
-					  -1,
-					  InvalidOid,
-					  0);
-		add_row_identity_var(root, var, rtindex, "ctid");
+		if (refType == ROW_REF_TID)
+		{
+			var = makeVar(rtindex,
+						  SelfItemPointerAttributeNumber,
+						  TIDOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			add_row_identity_var(root, var, rtindex, "ctid");
+		}
+		else
+		{
+			Assert(refType == ROW_REF_ROWID);
+			var = makeVar(rtindex,
+						  RowIdAttributeNumber,
+						  BYTEAOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			add_row_identity_var(root, var, rtindex, "rowid");
+		}
 	}
 	else if (relkind == RELKIND_FOREIGN_TABLE)
 	{
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index b728d0c9d73..af6a82b3246 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -283,6 +283,24 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 			newvars = lappend(newvars, var);
 		}
 
+		if ((new_allRefTypes & (1 << ROW_REF_ROWID)) &&
+			!(old_allRefTypes & (1 << ROW_REF_ROWID)))
+		{
+			var = makeVar(oldrc->rti,
+						  RowIdAttributeNumber,
+						  BYTEAOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			snprintf(resname, sizeof(resname), "rowid%u", oldrc->rowmarkId);
+			tle = makeTargetEntry((Expr *) var,
+								  list_length(root->processed_tlist) + 1,
+								  pstrdup(resname),
+								  true);
+			root->processed_tlist = lappend(root->processed_tlist, tle);
+			newvars = lappend(newvars, var);
+		}
+
 		/* Add tableoid junk Var, unless we had it already */
 		if (!old_isParent)
 		{
@@ -486,7 +504,7 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 	Assert(parentrte->rtekind == RTE_RELATION); /* else this is dubious */
 	childrte->relid = childOID;
 	childrte->relkind = childrel->rd_rel->relkind;
-	childrte->reftype = ROW_REF_TID;
+	childrte->reftype = table_get_row_ref_type(childrel);
 	/* A partitioned child will need to be expanded further. */
 	if (childrte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c
index 56feb553141..df99d6b49b3 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -1503,7 +1503,7 @@ addRangeTableEntry(ParseState *pstate,
 	rte->relid = RelationGetRelid(rel);
 	rte->relkind = rel->rd_rel->relkind;
 	rte->rellockmode = lockmode;
-	rte->reftype = ROW_REF_TID;
+	rte->reftype = table_get_row_ref_type(rel);
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
@@ -1589,7 +1589,7 @@ addRangeTableEntryForRelation(ParseState *pstate,
 	rte->relid = RelationGetRelid(rel);
 	rte->relkind = rel->rd_rel->relkind;
 	rte->rellockmode = lockmode;
-	rte->reftype = ROW_REF_TID;
+	rte->reftype = table_get_row_ref_type(rel);
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
@@ -3273,6 +3273,9 @@ get_rte_attribute_name(RangeTblEntry *rte, AttrNumber attnum)
 		attnum > 0 && attnum <= list_length(rte->alias->colnames))
 		return strVal(list_nth(rte->alias->colnames, attnum - 1));
 
+	if (attnum == RowIdAttributeNumber)
+		return "rowid";
+
 	/*
 	 * If the RTE is a relation, go to the system catalogs not the
 	 * eref->colnames list.  This is a little slower but it will give the
diff --git a/src/backend/rewrite/rewriteHandler.c b/src/backend/rewrite/rewriteHandler.c
index 41a362310a8..0ec58ee7f73 100644
--- a/src/backend/rewrite/rewriteHandler.c
+++ b/src/backend/rewrite/rewriteHandler.c
@@ -1832,6 +1832,7 @@ ApplyRetrieveRule(Query *parsetree,
 	rte = rt_fetch(rt_index, parsetree->rtable);
 
 	rte->rtekind = RTE_SUBQUERY;
+	rte->reftype = ROW_REF_COPY;
 	rte->subquery = rule_action;
 	rte->security_barrier = RelationIsSecurityView(relation);
 
diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index 0c444378227..26e5934716a 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1100,6 +1100,36 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+/*
+ * Same as tuplestore_gettupleslot(), but foces tuple storage to slot.  Thus,
+ * it can work with slot types different than minimal tuple.
+ */
+bool
+tuplestore_force_gettupleslot(Tuplestorestate *state, bool forward,
+							  bool copy, TupleTableSlot *slot)
+{
+	MinimalTuple tuple;
+	bool		should_free;
+
+	tuple = (MinimalTuple) tuplestore_gettuple(state, forward, &should_free);
+
+	if (tuple)
+	{
+		if (copy && !should_free)
+		{
+			tuple = heap_copy_minimal_tuple(tuple);
+			should_free = true;
+		}
+		ExecForceStoreMinimalTuple(tuple, slot, should_free);
+		return true;
+	}
+	else
+	{
+		ExecClearTuple(slot);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
diff --git a/src/include/access/sysattr.h b/src/include/access/sysattr.h
index 8f08682750b..d717a7cafec 100644
--- a/src/include/access/sysattr.h
+++ b/src/include/access/sysattr.h
@@ -24,6 +24,7 @@
 #define MaxTransactionIdAttributeNumber			(-4)
 #define MaxCommandIdAttributeNumber				(-5)
 #define TableOidAttributeNumber					(-6)
-#define FirstLowInvalidHeapAttributeNumber		(-7)
+#define RowIdAttributeNumber					(-7)
+#define FirstLowInvalidHeapAttributeNumber		(-8)
 
 #endif							/* SYSATTR_H */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index b2b397023c7..c5a24289440 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -477,7 +477,7 @@ typedef struct TableAmRoutine
 	 * test, returns true, false otherwise.
 	 */
 	bool		(*tuple_fetch_row_version) (Relation rel,
-											ItemPointer tid,
+											Datum tupleid,
 											Snapshot snapshot,
 											TupleTableSlot *slot);
 
@@ -536,7 +536,7 @@ typedef struct TableAmRoutine
 
 	/* see table_tuple_delete() for reference about parameters */
 	TM_Result	(*tuple_delete) (Relation rel,
-								 ItemPointer tid,
+								 Datum tupleid,
 								 CommandId cid,
 								 Snapshot snapshot,
 								 Snapshot crosscheck,
@@ -547,7 +547,7 @@ typedef struct TableAmRoutine
 
 	/* see table_tuple_update() for reference about parameters */
 	TM_Result	(*tuple_update) (Relation rel,
-								 ItemPointer otid,
+								 Datum tupleid,
 								 TupleTableSlot *slot,
 								 CommandId cid,
 								 Snapshot snapshot,
@@ -560,7 +560,7 @@ typedef struct TableAmRoutine
 
 	/* see table_tuple_lock() for reference about parameters */
 	TM_Result	(*tuple_lock) (Relation rel,
-							   ItemPointer tid,
+							   Datum tupleid,
 							   Snapshot snapshot,
 							   TupleTableSlot *slot,
 							   CommandId cid,
@@ -706,6 +706,11 @@ typedef struct TableAmRoutine
 	 * ------------------------------------------------------------------------
 	 */
 
+	/*
+	 * Get the type of row identifier in the table.
+	 */
+	RowRefType	(*get_row_ref_type) (Relation rel);
+
 	/*
 	 * This callback frees relation private cache data stored in rd_amcache.
 	 * If this callback is not provided, rd_amcache is assumed to point to
@@ -1297,9 +1302,9 @@ extern bool table_index_fetch_tuple_check(Relation rel,
 
 
 /*
- * Fetch tuple at `tid` into `slot`, after doing a visibility test according to
- * `snapshot`. If a tuple was found and passed the visibility test, returns
- * true, false otherwise.
+ * Fetch tuple identified by `tupleid` into `slot`, after doing a visibility
+ * test according to `snapshot`. If a tuple was found and passed the visibility
+ * test, returns true, false otherwise.
  *
  * See table_index_fetch_tuple's comment about what the difference between
  * these functions is. It is correct to use this function outside of index
@@ -1307,7 +1312,7 @@ extern bool table_index_fetch_tuple_check(Relation rel,
  */
 static inline bool
 table_tuple_fetch_row_version(Relation rel,
-							  ItemPointer tid,
+							  Datum tupleid,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
@@ -1319,7 +1324,8 @@ table_tuple_fetch_row_version(Relation rel,
 	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
 		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
 
-	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
+	return rel->rd_tableam->tuple_fetch_row_version(rel, tupleid,
+													snapshot, slot);
 }
 
 /*
@@ -1504,7 +1510,7 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
  *
  * Input parameters:
  *	relation - table to be modified (caller must hold suitable lock)
- *	tid - TID of tuple to be deleted
+ *	tupleid - identifier of tuple to be deleted
  *	cid - delete command ID (used for visibility test, and stored into
  *		cmax if successful)
  *	crosscheck - if not InvalidSnapshot, also check tuple against this
@@ -1533,12 +1539,12 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
  * struct TM_FailureData for additional info.
  */
 static inline TM_Result
-table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
+table_tuple_delete(Relation rel, Datum tupleid, CommandId cid,
 				   Snapshot snapshot, Snapshot crosscheck, int options,
 				   TM_FailureData *tmfd, bool changingPart,
 				   TupleTableSlot *oldSlot)
 {
-	return rel->rd_tableam->tuple_delete(rel, tid, cid,
+	return rel->rd_tableam->tuple_delete(rel, tupleid, cid,
 										 snapshot, crosscheck,
 										 options, tmfd, changingPart,
 										 oldSlot);
@@ -1552,7 +1558,7 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
  *
  * Input parameters:
  *	relation - table to be modified (caller must hold suitable lock)
- *	otid - TID of old tuple to be replaced
+ *	tupleid - identifier of old tuple to be replaced
  *	slot - newly constructed tuple data to store
  *	cid - update command ID (used for visibility test, and stored into
  *		cmax/cmin if successful)
@@ -1589,13 +1595,13 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
  * for additional info.
  */
 static inline TM_Result
-table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
+table_tuple_update(Relation rel, Datum tupleid, TupleTableSlot *slot,
 				   CommandId cid, Snapshot snapshot, Snapshot crosscheck,
 				   int options, TM_FailureData *tmfd, LockTupleMode *lockmode,
 				   TU_UpdateIndexes *update_indexes,
 				   TupleTableSlot *oldSlot)
 {
-	return rel->rd_tableam->tuple_update(rel, otid, slot,
+	return rel->rd_tableam->tuple_update(rel, tupleid, slot,
 										 cid, snapshot, crosscheck,
 										 options, tmfd,
 										 lockmode, update_indexes,
@@ -1607,7 +1613,7 @@ table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
  *
  * Input parameters:
  *	relation: relation containing tuple (caller must hold suitable lock)
- *	tid: TID of tuple to lock
+ *	tupleid: identifier of tuple to lock
  *	snapshot: snapshot to use for visibility determinations
  *	cid: current command ID (used for visibility test, and stored into
  *		tuple's cmax if lock is successful)
@@ -1636,12 +1642,12 @@ table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
  * comments for struct TM_FailureData for additional info.
  */
 static inline TM_Result
-table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
+table_tuple_lock(Relation rel, Datum tupleid, Snapshot snapshot,
 				 TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
 				 LockWaitPolicy wait_policy, uint8 flags,
 				 TM_FailureData *tmfd)
 {
-	return rel->rd_tableam->tuple_lock(rel, tid, snapshot, slot,
+	return rel->rd_tableam->tuple_lock(rel, tupleid, snapshot, slot,
 									   cid, mode, wait_policy,
 									   flags, tmfd);
 }
@@ -1923,6 +1929,20 @@ table_define_index(Relation rel, Oid indoid, bool reindex,
  * ----------------------------------------------------------------------------
  */
 
+/*
+ * Get the type of row identifier.  Returns ROW_REF_TID when table AM routine
+ * is not accessible.  This happens during catalog initialization.  All catalog
+ * tables are known to use heap.
+ */
+static inline RowRefType
+table_get_row_ref_type(Relation rel)
+{
+	if (rel->rd_tableam)
+		return rel->rd_tableam->get_row_ref_type(rel);
+	else
+		return ROW_REF_TID;
+}
+
 /*
  * Frees relation private cache data stored in rd_amcache.  Uses
  * free_rd_amcache method if provided.  Assumes rd_amcache to point to single
diff --git a/src/include/commands/trigger.h b/src/include/commands/trigger.h
index 4903b4b7bc2..15e1fbe7700 100644
--- a/src/include/commands/trigger.h
+++ b/src/include/commands/trigger.h
@@ -209,7 +209,7 @@ extern void ExecASDeleteTriggers(EState *estate,
 extern bool ExecBRDeleteTriggers(EState *estate,
 								 EPQState *epqstate,
 								 ResultRelInfo *relinfo,
-								 ItemPointer tupleid,
+								 Datum tupleid,
 								 HeapTuple fdw_trigtuple,
 								 TupleTableSlot **epqslot,
 								 TM_Result *tmresult,
@@ -231,7 +231,7 @@ extern void ExecASUpdateTriggers(EState *estate,
 extern bool ExecBRUpdateTriggers(EState *estate,
 								 EPQState *epqstate,
 								 ResultRelInfo *relinfo,
-								 ItemPointer tupleid,
+								 Datum tupleid,
 								 HeapTuple fdw_trigtuple,
 								 TupleTableSlot *newslot,
 								 TM_Result *tmresult,
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index 278a1227fc7..766e7a6ba4b 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -2072,6 +2072,7 @@ typedef struct OnConflictExpr
 typedef enum RowRefType
 {
 	ROW_REF_TID,				/* Item pointer (block, offset) */
+	ROW_REF_ROWID,				/* Bytea row id */
 	ROW_REF_COPY				/* Full row copy */
 } RowRefType;
 
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index 1077c5fdeaa..c9f1f57aac5 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -73,6 +73,9 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
 
+extern bool tuplestore_force_gettupleslot(Tuplestorestate *state, bool forward,
+										  bool copy, TupleTableSlot *slot);
+
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
 extern bool tuplestore_skiptuples(Tuplestorestate *state,
-- 
2.39.3 (Apple Git-145)

0011-Introduce-RowRefType-which-describes-the-table-ro-v1.patchapplication/octet-stream; name=0011-Introduce-RowRefType-which-describes-the-table-ro-v1.patchDownload

From 4f5671f95a30ba7e4563264d90603c3812bcbafe Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Sat, 17 Jun 2023 22:55:32 +0300
Subject: [PATCH 11/12] Introduce RowRefType, which describes the table row
 identifier

Currently, the table row could be identified by the ctid or the whole row
(foreign table).  But the row identifier is mixed together with lock mode in
RowMarkType.  This commit separates row identifier type into separate enum
RowRefType.
---
 src/backend/optimizer/plan/planner.c   | 16 +++++++++-----
 src/backend/optimizer/prep/preptlist.c |  4 ++--
 src/backend/optimizer/util/inherit.c   | 30 +++++++++++++++-----------
 src/backend/parser/parse_relation.c    | 10 +++++++++
 src/include/nodes/execnodes.h          |  4 ++++
 src/include/nodes/parsenodes.h         |  1 +
 src/include/nodes/plannodes.h          |  4 ++--
 src/include/nodes/primnodes.h          |  7 ++++++
 src/include/optimizer/planner.h        |  3 ++-
 src/tools/pgindent/typedefs.list       |  1 +
 10 files changed, 57 insertions(+), 23 deletions(-)

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index a8cea5efe14..4db9a62789d 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2283,6 +2283,7 @@ preprocess_rowmarks(PlannerInfo *root)
 		RowMarkClause *rc = lfirst_node(RowMarkClause, l);
 		RangeTblEntry *rte = rt_fetch(rc->rti, parse->rtable);
 		PlanRowMark *newrc;
+		RowRefType	refType;
 
 		/*
 		 * Currently, it is syntactically impossible to have FOR UPDATE et al
@@ -2305,8 +2306,8 @@ preprocess_rowmarks(PlannerInfo *root)
 		newrc = makeNode(PlanRowMark);
 		newrc->rti = newrc->prti = rc->rti;
 		newrc->rowmarkId = ++(root->glob->lastRowMarkId);
-		newrc->markType = select_rowmark_type(rte, rc->strength);
-		newrc->allMarkTypes = (1 << newrc->markType);
+		newrc->markType = select_rowmark_type(rte, rc->strength, &refType);
+		newrc->allRefTypes = (1 << refType);
 		newrc->strength = rc->strength;
 		newrc->waitPolicy = rc->waitPolicy;
 		newrc->isParent = false;
@@ -2322,6 +2323,7 @@ preprocess_rowmarks(PlannerInfo *root)
 	{
 		RangeTblEntry *rte = lfirst_node(RangeTblEntry, l);
 		PlanRowMark *newrc;
+		RowRefType	refType;
 
 		i++;
 		if (!bms_is_member(i, rels))
@@ -2330,8 +2332,8 @@ preprocess_rowmarks(PlannerInfo *root)
 		newrc = makeNode(PlanRowMark);
 		newrc->rti = newrc->prti = i;
 		newrc->rowmarkId = ++(root->glob->lastRowMarkId);
-		newrc->markType = select_rowmark_type(rte, LCS_NONE);
-		newrc->allMarkTypes = (1 << newrc->markType);
+		newrc->markType = select_rowmark_type(rte, LCS_NONE, &refType);
+		newrc->allRefTypes = (1 << refType);
 		newrc->strength = LCS_NONE;
 		newrc->waitPolicy = LockWaitBlock;	/* doesn't matter */
 		newrc->isParent = false;
@@ -2346,11 +2348,13 @@ preprocess_rowmarks(PlannerInfo *root)
  * Select RowMarkType to use for a given table
  */
 RowMarkType
-select_rowmark_type(RangeTblEntry *rte, LockClauseStrength strength)
+select_rowmark_type(RangeTblEntry *rte, LockClauseStrength strength,
+					RowRefType *refType)
 {
 	if (rte->rtekind != RTE_RELATION)
 	{
 		/* If it's not a table at all, use ROW_MARK_COPY */
+		*refType = ROW_REF_COPY;
 		return ROW_MARK_COPY;
 	}
 	else if (rte->relkind == RELKIND_FOREIGN_TABLE)
@@ -2361,11 +2365,13 @@ select_rowmark_type(RangeTblEntry *rte, LockClauseStrength strength)
 		if (fdwroutine->GetForeignRowMarkType != NULL)
 			return fdwroutine->GetForeignRowMarkType(rte, strength);
 		/* Otherwise, use ROW_MARK_COPY by default */
+		*refType = ROW_REF_COPY;
 		return ROW_MARK_COPY;
 	}
 	else
 	{
 		/* Regular table, apply the appropriate lock type */
+		*refType = rte->reftype;
 		switch (strength)
 		{
 			case LCS_NONE:
diff --git a/src/backend/optimizer/prep/preptlist.c b/src/backend/optimizer/prep/preptlist.c
index 9d46488ef7c..ef2ce552970 100644
--- a/src/backend/optimizer/prep/preptlist.c
+++ b/src/backend/optimizer/prep/preptlist.c
@@ -210,7 +210,7 @@ preprocess_targetlist(PlannerInfo *root)
 		if (rc->rti != rc->prti)
 			continue;
 
-		if (rc->allMarkTypes & ~(1 << ROW_MARK_COPY))
+		if (rc->allRefTypes & (1 << ROW_REF_TID))
 		{
 			/* Need to fetch TID */
 			var = makeVar(rc->rti,
@@ -226,7 +226,7 @@ preprocess_targetlist(PlannerInfo *root)
 								  true);
 			tlist = lappend(tlist, tle);
 		}
-		if (rc->allMarkTypes & (1 << ROW_MARK_COPY))
+		if (rc->allRefTypes & (1 << ROW_REF_COPY))
 		{
 			/* Need the whole row as a junk var */
 			var = makeWholeRowVar(rt_fetch(rc->rti, range_table),
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index f9d3ff1e7ac..b728d0c9d73 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -16,6 +16,7 @@
 
 #include "access/sysattr.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "catalog/partition.h"
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_type.h"
@@ -91,7 +92,7 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 	LOCKMODE	lockmode;
 	PlanRowMark *oldrc;
 	bool		old_isParent = false;
-	int			old_allMarkTypes = 0;
+	int			old_allRefTypes = 0;
 
 	Assert(rte->inh);			/* else caller error */
 
@@ -131,8 +132,8 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 	{
 		old_isParent = oldrc->isParent;
 		oldrc->isParent = true;
-		/* Save initial value of allMarkTypes before children add to it */
-		old_allMarkTypes = oldrc->allMarkTypes;
+		/* Save initial value of allRefTypes before children add to it */
+		old_allRefTypes = oldrc->allRefTypes;
 	}
 
 	/* Scan the inheritance set and expand it */
@@ -239,15 +240,15 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (oldrc)
 	{
-		int			new_allMarkTypes = oldrc->allMarkTypes;
+		int			new_allRefTypes = oldrc->allRefTypes;
 		Var		   *var;
 		TargetEntry *tle;
 		char		resname[32];
 		List	   *newvars = NIL;
 
 		/* Add TID junk Var if needed, unless we had it already */
-		if (new_allMarkTypes & ~(1 << ROW_MARK_COPY) &&
-			!(old_allMarkTypes & ~(1 << ROW_MARK_COPY)))
+		if (new_allRefTypes & (1 << ROW_REF_TID) &&
+			!(old_allRefTypes & (1 << ROW_REF_TID)))
 		{
 			/* Need to fetch TID */
 			var = makeVar(oldrc->rti,
@@ -266,8 +267,8 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 		}
 
 		/* Add whole-row junk Var if needed, unless we had it already */
-		if ((new_allMarkTypes & (1 << ROW_MARK_COPY)) &&
-			!(old_allMarkTypes & (1 << ROW_MARK_COPY)))
+		if ((new_allRefTypes & (1 << ROW_REF_COPY)) &&
+			!(old_allRefTypes & (1 << ROW_REF_COPY)))
 		{
 			var = makeWholeRowVar(planner_rt_fetch(oldrc->rti, root),
 								  oldrc->rti,
@@ -441,7 +442,7 @@ expand_partitioned_rtentry(PlannerInfo *root, RelOptInfo *relinfo,
  * where the hierarchy is flattened during RTE expansion.)
  *
  * PlanRowMarks still carry the top-parent's RTI, and the top-parent's
- * allMarkTypes field still accumulates values from all descendents.
+ * allRefTypes field still accumulates values from all descendents.
  *
  * "parentrte" and "parentRTindex" are immediate parent's RTE and
  * RTI. "top_parentrc" is top parent's PlanRowMark.
@@ -485,6 +486,7 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 	Assert(parentrte->rtekind == RTE_RELATION); /* else this is dubious */
 	childrte->relid = childOID;
 	childrte->relkind = childrel->rd_rel->relkind;
+	childrte->reftype = ROW_REF_TID;
 	/* A partitioned child will need to be expanded further. */
 	if (childrte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
@@ -574,14 +576,16 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 	if (top_parentrc)
 	{
 		PlanRowMark *childrc = makeNode(PlanRowMark);
+		RowRefType	refType;
 
 		childrc->rti = childRTindex;
 		childrc->prti = top_parentrc->rti;
 		childrc->rowmarkId = top_parentrc->rowmarkId;
 		/* Reselect rowmark type, because relkind might not match parent */
 		childrc->markType = select_rowmark_type(childrte,
-												top_parentrc->strength);
-		childrc->allMarkTypes = (1 << childrc->markType);
+												top_parentrc->strength,
+												&refType);
+		childrc->allRefTypes = (1 << refType);
 		childrc->strength = top_parentrc->strength;
 		childrc->waitPolicy = top_parentrc->waitPolicy;
 
@@ -592,8 +596,8 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 		 */
 		childrc->isParent = (childrte->relkind == RELKIND_PARTITIONED_TABLE);
 
-		/* Include child's rowmark type in top parent's allMarkTypes */
-		top_parentrc->allMarkTypes |= childrc->allMarkTypes;
+		/* Include child's rowmark type in top parent's allRefTypes */
+		top_parentrc->allRefTypes |= childrc->allRefTypes;
 
 		root->rowMarks = lappend(root->rowMarks, childrc);
 	}
diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c
index 864ea9b0d5d..56feb553141 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/sysattr.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "catalog/heap.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_type.h"
@@ -1502,6 +1503,7 @@ addRangeTableEntry(ParseState *pstate,
 	rte->relid = RelationGetRelid(rel);
 	rte->relkind = rel->rd_rel->relkind;
 	rte->rellockmode = lockmode;
+	rte->reftype = ROW_REF_TID;
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
@@ -1587,6 +1589,7 @@ addRangeTableEntryForRelation(ParseState *pstate,
 	rte->relid = RelationGetRelid(rel);
 	rte->relkind = rel->rd_rel->relkind;
 	rte->rellockmode = lockmode;
+	rte->reftype = ROW_REF_TID;
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
@@ -1656,6 +1659,7 @@ addRangeTableEntryForSubquery(ParseState *pstate,
 	rte->rtekind = RTE_SUBQUERY;
 	rte->subquery = subquery;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias("unnamed_subquery", NIL);
 	numaliases = list_length(eref->colnames);
@@ -1764,6 +1768,7 @@ addRangeTableEntryForFunction(ParseState *pstate,
 	rte->functions = NIL;		/* we'll fill this list below */
 	rte->funcordinality = rangefunc->ordinality;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	/*
 	 * Choose the RTE alias name.  We default to using the first function's
@@ -2083,6 +2088,7 @@ addRangeTableEntryForTableFunc(ParseState *pstate,
 	rte->coltypmods = tf->coltypmods;
 	rte->colcollations = tf->colcollations;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias(refname, NIL);
 	numaliases = list_length(eref->colnames);
@@ -2159,6 +2165,7 @@ addRangeTableEntryForValues(ParseState *pstate,
 	rte->coltypmods = coltypmods;
 	rte->colcollations = colcollations;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias(refname, NIL);
 
@@ -2256,6 +2263,7 @@ addRangeTableEntryForJoin(ParseState *pstate,
 	rte->joinrightcols = rightcols;
 	rte->join_using_alias = join_using_alias;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias("unnamed_join", NIL);
 	numaliases = list_length(eref->colnames);
@@ -2337,6 +2345,7 @@ addRangeTableEntryForCTE(ParseState *pstate,
 	rte->rtekind = RTE_CTE;
 	rte->ctename = cte->ctename;
 	rte->ctelevelsup = levelsup;
+	rte->reftype = ROW_REF_COPY;
 
 	/* Self-reference if and only if CTE's parse analysis isn't completed */
 	rte->self_reference = !IsA(cte->ctequery, Query);
@@ -2499,6 +2508,7 @@ addRangeTableEntryForENR(ParseState *pstate,
 	 * if they access transition tables linked to a table that is altered.
 	 */
 	rte->relid = enrmd->reliddesc;
+	rte->reftype = ROW_REF_COPY;
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 5d7f17dee07..115d167dbc3 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -448,6 +448,9 @@ typedef struct ResultRelInfo
 	/* relation descriptor for result relation */
 	Relation	ri_RelationDesc;
 
+	/* row indentifier for result relation */
+	RowRefType	ri_RowRefType;
+
 	/* # of indices existing on result relation */
 	int			ri_NumIndices;
 
@@ -743,6 +746,7 @@ typedef struct ExecRowMark
 	Index		prti;			/* parent range table index, if child */
 	Index		rowmarkId;		/* unique identifier for resjunk columns */
 	RowMarkType markType;		/* see enum in nodes/plannodes.h */
+	RowRefType	refType;		/* row indentifier for relation */
 	LockClauseStrength strength;	/* LockingClause's strength, or LCS_NONE */
 	LockWaitPolicy waitPolicy;	/* NOWAIT and SKIP LOCKED */
 	bool		ermActive;		/* is this mark relevant for current tuple? */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index e494309da8d..7e1994511f5 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1066,6 +1066,7 @@ typedef struct RangeTblEntry
 	int			rellockmode;	/* lock level that query requires on the rel */
 	struct TableSampleClause *tablesample;	/* sampling info, or NULL */
 	Index		perminfoindex;
+	RowRefType	reftype;		/* row indentifier for relation */
 
 	/*
 	 * Fields valid for a subquery RTE (else NULL):
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index d40af8e59fe..49129cdfda4 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1351,7 +1351,7 @@ typedef enum RowMarkType
  * child relations will also have entries with isParent = true.  The child
  * entries have rti == child rel's RT index and prti == top parent's RT index,
  * and can therefore be recognized as children by the fact that prti != rti.
- * The parent's allMarkTypes field gets the OR of (1<<markType) across all
+ * The parent's allRefTypes field gets the OR of (1<<refType) across all
  * its children (this definition allows children to use different markTypes).
  *
  * The planner also adds resjunk output columns to the plan that carry
@@ -1381,7 +1381,7 @@ typedef struct PlanRowMark
 	Index		prti;			/* range table index of parent relation */
 	Index		rowmarkId;		/* unique identifier for resjunk columns */
 	RowMarkType markType;		/* see enum above */
-	int			allMarkTypes;	/* OR of (1<<markType) for all children */
+	int			allRefTypes;	/* OR of (1<<refType) for all children */
 	LockClauseStrength strength;	/* LockingClause's strength, or LCS_NONE */
 	LockWaitPolicy waitPolicy;	/* NOWAIT and SKIP LOCKED options */
 	bool		isParent;		/* true if this is a "dummy" parent entry */
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index bb930afb521..278a1227fc7 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -2068,4 +2068,11 @@ typedef struct OnConflictExpr
 	List	   *exclRelTlist;	/* tlist of the EXCLUDED pseudo relation */
 } OnConflictExpr;
 
+/* The row identifier */
+typedef enum RowRefType
+{
+	ROW_REF_TID,				/* Item pointer (block, offset) */
+	ROW_REF_COPY				/* Full row copy */
+} RowRefType;
+
 #endif							/* PRIMNODES_H */
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index fc2e15496dd..b4d615f0844 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -47,7 +47,8 @@ extern PlannerInfo *subquery_planner(PlannerGlobal *glob, Query *parse,
 									 bool hasRecursion, double tuple_fraction);
 
 extern RowMarkType select_rowmark_type(RangeTblEntry *rte,
-									   LockClauseStrength strength);
+									   LockClauseStrength strength,
+									   RowRefType *refType);
 
 extern bool limit_needed(Query *parse);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index dba3498a13e..79936589fe4 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2398,6 +2398,7 @@ RowExpr
 RowIdentityVarInfo
 RowMarkClause
 RowMarkType
+RowRefType
 RowSecurityDesc
 RowSecurityPolicy
 RtlGetLastNtStatus_t
-- 
2.39.3 (Apple Git-145)

Matthias van de Meent

boekewurm+postgres@gmail.com

about 2 years ago

In reply to: Alexander Korotkov (#1)

Re: Table AM Interface Enhancements

Hi,

On Thu, 23 Nov 2023 at 13:43, Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hello PostgreSQL Hackers,

I am pleased to submit a series of patches related to the Table Access
Method (AM) interface, which I initially announced during my talk at
PGCon 2023 [1]. These patches are primarily designed to support the
OrioleDB engine, but I believe they could be beneficial for other
table AM implementations as well.

The focus of these patches is to introduce more flexibility and
capabilities into the Table AM interface. This is particularly
relevant for advanced use cases like index-organized tables,
alternative MVCC implementations, etc.

Here's a brief overview of the patches included in this set:

Note: no significant review of the patches, just a first response on
the cover letters and oddities I noticed:

Overall, this patchset adds significant API area to TableAmRoutine,
without adding the relevant documentation on how it's expected to be
used. With the overall size of the patchset also being very
significant, I don't think this patch is reviewable as is; the goal
isn't clear enough, the APIs aren't well explained, and the
interactions with the index API are left up in the air.

0001-Allow-locking-updated-tuples-in-tuple_update-and--v1.patch

Optimizes the process of locking concurrently updated tuples during
update and delete operations. Helpful for table AMs where refinding
existing tuples is expensive.

Is this essentially an optimized implementation of the "DELETE FROM
... RETURNING *" per-tuple primitive?

0003-Allow-table-AM-to-store-complex-data-structures-i-v1.patch

Allows table AM to store complex data structure in rd_amcache rather
than a single chunk of memory.

I don't think we should allow arbitrarily large and arbitrarily many
chunks of data in the relcache or table caches. Why isn't one chunk
enough?

0004-Add-table-AM-tuple_is_current-method-v1.patch

This allows us to abstract how/whether table AM uses transaction identifiers.

I'm not a fan of the indirection here. Also, assuming that table slots
don't outlive transactions, wouldn't this be a more appropriate fit
with the table tuple slot API?

0005-Generalize-relation-analyze-in-table-AM-interface-v1.patch

Provides a more flexible API for sampling tuples, beneficial for
non-standard table types like index-organized tables.

0006-Generalize-table-AM-API-for-INSERT-.-ON-CONFLICT-v1.patch

Provides a new table AM API method to encapsulate the whole INSERT ...
ON CONFLICT ... algorithm rather than just implementation of
speculative tokens.

Does this not still require speculative inserts, with speculative
tokens, for secondary indexes? Why make AMs implement that all over
again?

0007-Allow-table-AM-tuple_insert-method-to-return-the--v1.patch

This allows table AM to return a native tuple slot, which is aware of
table AM-specific system attributes.

This seems reasonable.

0008-Let-table-AM-insertion-methods-control-index-inse-v1.patch

Allows table AM to skip index insertions in the executor and handle
those insertions itself.

Who handles index tuple removal then? I don't see a patch that
describes index AM changes for this...

0009-Custom-reloptions-for-table-AM-v1.patch

Enables table AMs to define and override reloptions for tables and indexes.

0010-Notify-table-AM-about-index-creation-v1.patch

Allows table AMs to prepare or update specific meta-information during
index creation.

I don't think the described use case of this API is OK - a table AM
cannot know about the internals of index AMs, and is definitely not
allowed to overwrite the information of that index.
If I ask for an index that uses the "btree" index, then that needs to
be the AM actually used, or an error needs to be raised if it is
somehow incompatible with the table AM used. It can't be that we
silently update information and create an index that is explicitly not
what the user asked to create.

I also don't see updates in documentation, which I think is quite a
shame as I have trouble understanding some parts.

0012-Introduce-RowID-bytea-tuple-identifier-v1.patch

`This patch introduces 'RowID', a new bytea tuple identifier, to
overcome the limitations of the current 32-bit block number and 16-bit
offset-based tuple identifier. This is particularly useful for
index-organized tables and other advanced use cases.

We don't have any index methods that can handle anything but
block+offset TIDs, and I don't see any changes to the IndexAM APIs to
support these RowID tuples, so what's the plan here? I don't see any
of that in the commit message, nor in the rest of this patchset.

Each commit message contains a detailed explanation of the changes and
their rationale. I believe these enhancements will significantly
improve the flexibility and capabilities of the PostgreSQL Table AM
interface.

I've noticed there is not a lot of rationale for several of the
changes as to why PostgreSQL needs these changes implemented like
this, amongst which the index-related tableAM changes.

I understand that index-organized tables can be quite useful, but I
don't see design solutions to the more complex questions that would
still be required before we could host such table AMs like OreoleDB's
as a first-party citizen: For index-organized tables, you also need
index AM APIs that support TIDS with more than 48 bits of data
(assuming we actually want primary keys with >48 bits of usable
space), and for undo-based logging you would probably need index APIs
for retail index tuple deletion. Neither is supplied here, nor is
described why these APIs were omitted.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Mark Dilger

mark.dilger@enterprisedb.com

about 2 years ago

In reply to: Alexander Korotkov (#1)

Re: Table AM Interface Enhancements

On Nov 23, 2023, at 4:42 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:

0006-Generalize-table-AM-API-for-INSERT-.-ON-CONFLICT-v1.patch

Provides a new table AM API method to encapsulate the whole INSERT ...
ON CONFLICT ... algorithm rather than just implementation of
speculative tokens.

I *think* I understand that you are taking the part of INSERT..ON CONFLICT that lives outside the table AM and pulling it inside so that table AM authors are free to come up with whatever implementation is more suited for them. The most straightforward way of doing so results in an EState parameter in the interface definition. That seems not so good, as the EState is a fairly complicated structure, and future changes in the executor might want to rearrange what EState tracks, which would change which values tuple_insert_with_arbiter() can depend on. Should the patch at least document which parts of the EState are expected to be in which states, and which parts should be viewed as undefined? If the implementors of table AMs rely on any/all aspects of EState, doesn't that prevent future changes to how that structure is used?

0008-Let-table-AM-insertion-methods-control-index-inse-v1.patch

Allows table AM to skip index insertions in the executor and handle
those insertions itself.

The new parameter could use more documentation.

0009-Custom-reloptions-for-table-AM-v1.patch

Enables table AMs to define and override reloptions for tables and indexes.

This could use some regression tests to exercise the custom reloptions.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Alexander Korotkov

aekorotkov@gmail.com

about 2 years ago

In reply to: Mark Dilger (#3)

Re: Table AM Interface Enhancements

On Fri, Nov 24, 2023 at 5:18 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Nov 23, 2023, at 4:42 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:

0006-Generalize-table-AM-API-for-INSERT-.-ON-CONFLICT-v1.patch

Provides a new table AM API method to encapsulate the whole INSERT ...
ON CONFLICT ... algorithm rather than just implementation of
speculative tokens.

I *think* I understand that you are taking the part of INSERT..ON CONFLICT that lives outside the table AM and pulling it inside so that table AM authors are free to come up with whatever implementation is more suited for them. The most straightforward way of doing so results in an EState parameter in the interface definition. That seems not so good, as the EState is a fairly complicated structure, and future changes in the executor might want to rearrange what EState tracks, which would change which values tuple_insert_with_arbiter() can depend on.

I think this is the correct understanding.

Should the patch at least document which parts of the EState are expected to be in which states, and which parts should be viewed as undefined? If the implementors of table AMs rely on any/all aspects of EState, doesn't that prevent future changes to how that structure is used?

New tuple tuple_insert_with_arbiter() table AM API method needs EState
argument to call executor functions: ExecCheckIndexConstraints(),
ExecUpdateLockMode(), and ExecInsertIndexTuples(). I think we
probably need to invent some opaque way to call this executor function
without revealing EState to table AM. Do you think this could work?

0008-Let-table-AM-insertion-methods-control-index-inse-v1.patch

Allows table AM to skip index insertions in the executor and handle
those insertions itself.

The new parameter could use more documentation.

0009-Custom-reloptions-for-table-AM-v1.patch

Enables table AMs to define and override reloptions for tables and indexes.

This could use some regression tests to exercise the custom reloptions.

Thank you for these notes. I'll take this into account for the next
patchset version.

------
Regards,
Alexander Korotkov

Mark Dilger

mark.dilger@enterprisedb.com

about 2 years ago

In reply to: Alexander Korotkov (#4)

Re: Table AM Interface Enhancements

On Nov 25, 2023, at 9:47 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:

Should the patch at least document which parts of the EState are expected to be in which states, and which parts should be viewed as undefined? If the implementors of table AMs rely on any/all aspects of EState, doesn't that prevent future changes to how that structure is used?

New tuple tuple_insert_with_arbiter() table AM API method needs EState
argument to call executor functions: ExecCheckIndexConstraints(),
ExecUpdateLockMode(), and ExecInsertIndexTuples(). I think we
probably need to invent some opaque way to call this executor function
without revealing EState to table AM. Do you think this could work?

We're clearly not accessing all of the EState, just some specific fields, such as es_per_tuple_exprcontext. I think you could at least refactor to pass the minimum amount of state information through the table AM API.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Pavel Borisov

pashkin.elfe@gmail.com

about 2 years ago

In reply to: Mark Dilger (#5)

Re: Table AM Interface Enhancements

Hi, Alexander!

I think table AM extensibility is a very good idea generally, not only in
the scope of APIs that are needed in OrioleDB. Thanks for your proposals!

For patches

0001-Allow-locking-updated-tuples-in-tuple_update-and--v1.patch

0002-Add-EvalPlanQual-delete-returning-isolation-test-v1.patch

The new isolation test is related to the previous patch. These two

patches were previously discussed in [2]. /messages/by-id/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg@mail.gmail.com.

As discussion in [2]. /messages/by-id/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg@mail.gmail.com seems close to the patches being committed and the
only thing it is not in v16 yet is that it was too close to feature freeze,
I've copied these most recent versions of patches 0001 and 0002 from this
thread in [2]. /messages/by-id/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg@mail.gmail.com to finish and commit them there.

I'm planning to review some of the other patches from the current patchset
soon.

[2]: . /messages/by-id/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg@mail.gmail.com
/messages/by-id/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg@mail.gmail.com

Kind regards,
Pavel Borisov

Pavel Borisov

pashkin.elfe@gmail.com

about 2 years ago

In reply to: Pavel Borisov (#6)

Re: Table AM Interface Enhancements

Hi, Alexander!

I'm planning to review some of the other patches from the current patchset
soon.

I've looked into the patch 0003.
The patch looks in good shape and is uncontroversial to me. Making memory
structures to be dynamically allocated is simple enough and it allows to
store complex data like lists etc. I consider places like this that expect
memory structures to be flat and allocated at once are because the was no
need in more complex ones previously. If there is a need for them, I think
they could be added without much doubt, provided the simplicity of the
change.

For the code:
+static inline void
+table_free_rd_amcache(Relation rel)
+{
+ if (rel->rd_tableam && rel->rd_tableam->free_rd_amcache)
+ {
+ rel->rd_tableam->free_rd_amcache(rel);
+ }
+ else
+ {
+ if (rel->rd_amcache)
+ pfree(rel->rd_amcache);
+ rel->rd_amcache = NULL;
+ }

here I suggest adding Assert(rel->rd_amcache == NULL) (or maybe better an
error report) after calling free_rd_amcache to be sure the custom
implementation has done what it should do.

Also, I think some brief documentation about writing this custom method is
quite relevant maybe based on already existing comments in the code.

Kind regards,
Pavel

Nikita Malakhov

hukutoc@gmail.com

about 2 years ago

In reply to: Pavel Borisov (#7)

Re: Table AM Interface Enhancements

Hi,

Pavel, as far as I understand Alexander's idea assertion and especially
ereport
here does not make any sense - this method is not considered to report
error, it
silently calls if there is underlying [free] function and simply falls
through otherwise,
also, take into account that it could be located in the uninterruptible
part of the code.

On the whole topic I have to

On Wed, Nov 29, 2023 at 4:56 PM Pavel Borisov <pashkin.elfe@gmail.com>
wrote:

Hi, Alexander!

I'm planning to review some of the other patches from the current
patchset soon.

I've looked into the patch 0003.
The patch looks in good shape and is uncontroversial to me. Making memory
structures to be dynamically allocated is simple enough and it allows to
store complex data like lists etc. I consider places like this that expect
memory structures to be flat and allocated at once are because the was no
need in more complex ones previously. If there is a need for them, I think
they could be added without much doubt, provided the simplicity of the
change.
For the code:
+static inline void
+table_free_rd_amcache(Relation rel)
+{
+ if (rel->rd_tableam && rel->rd_tableam->free_rd_amcache)
+ {
+ rel->rd_tableam->free_rd_amcache(rel);
+ }
+ else
+ {
+ if (rel->rd_amcache)
+ pfree(rel->rd_amcache);
+ rel->rd_amcache = NULL;
+ }
here I suggest adding Assert(rel->rd_amcache == NULL) (or maybe better an
error report) after calling free_rd_amcache to be sure the custom
implementation has done what it should do.

Also, I think some brief documentation about writing this custom method is
quite relevant maybe based on already existing comments in the code.

Kind regards,
Pavel

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

Pavel Borisov

pashkin.elfe@gmail.com

about 2 years ago

In reply to: Nikita Malakhov (#8)

Re: Table AM Interface Enhancements

Hi, Nikita!

On Wed, 29 Nov 2023 at 18:27, Nikita Malakhov <hukutoc@gmail.com> wrote:

Hi,

Pavel, as far as I understand Alexander's idea assertion and especially
ereport
here does not make any sense - this method is not considered to report
error, it
silently calls if there is underlying [free] function and simply falls
through otherwise,
also, take into account that it could be located in the uninterruptible
part of the code.

On the whole topic I have to

On Wed, Nov 29, 2023 at 4:56 PM Pavel Borisov <pashkin.elfe@gmail.com>
wrote:
Hi, Alexander!

I'm planning to review some of the other patches from the current
patchset soon.

I've looked into the patch 0003.
The patch looks in good shape and is uncontroversial to me. Making memory
structures to be dynamically allocated is simple enough and it allows to
store complex data like lists etc. I consider places like this that expect
memory structures to be flat and allocated at once are because the was no
need in more complex ones previously. If there is a need for them, I think
they could be added without much doubt, provided the simplicity of the
change.
For the code:
+static inline void
+table_free_rd_amcache(Relation rel)
+{
+ if (rel->rd_tableam && rel->rd_tableam->free_rd_amcache)
+ {
+ rel->rd_tableam->free_rd_amcache(rel);
+ }
+ else
+ {
+ if (rel->rd_amcache)
+ pfree(rel->rd_amcache);
+ rel->rd_amcache = NULL;
+ }
here I suggest adding Assert(rel->rd_amcache == NULL) (or maybe better an
error report) after calling free_rd_amcache to be sure the custom
implementation has done what it should do.

Also, I think some brief documentation about writing this custom method
is quite relevant maybe based on already existing comments in the code.

Kind regards,
Pavel
When we do default single chunk routine we invalidate rd_amcache pointer,

+ if (rel->rd_amcache)
+ pfree(rel->rd_amcache);
+ rel->rd_amcache = NULL;

If we delegate this to method, my idea is to check the method
implementation don't leave this pointer valid.
If it's not needed, I'm ok with it, but to me it seems that the check I
proposed makes sense.

Regards,
Pavel

#10

Pavel Borisov

pashkin.elfe@gmail.com

about 2 years ago

In reply to: Pavel Borisov (#9)

Re: Table AM Interface Enhancements

Hi, Alexander!

I've reviewed patch 0004. It's clear enough and I think does what it's
supposed.
One thing, in function signature
+bool (*tuple_is_current) (Relation rel, TupleTableSlot *slot);
there is a Relation agrument, which is unused in both existing heapam
method. Also it's unused in OrioleDb implementation of tuple_is_current.
For what goal it is needed in the interface?

No other objections around this patch.

I've also looked at 0005-0007. Although it is not a thorough review, they
seem to depend on previous patch 0004.
Additionally changes in 0007 looks dependent from 0005. Does replacement of
slot inside ExecInsert, that is already used in the code below the call of

/* insert the tuple normally */
- table_tuple_insert(resultRelationDesc, slot,
- estate->es_output_cid,
- 0, NULL);

could be done without side effects?

Kind regards,
Pavel.

Show quoted text

#11

Pavel Borisov

pashkin.elfe@gmail.com

about 2 years ago

In reply to: Pavel Borisov (#10)

Re: Table AM Interface Enhancements

Additionally changes in 0007 looks dependent from 0005. Does replacement
of slot inside ExecInsert, that is already used in the code below the call
of

/* insert the tuple normally */
- table_tuple_insert(resultRelationDesc, slot,
- estate->es_output_cid,
- 0, NULL);

could be done without side effects?

I'm sorry that I inserter not all relevant code in the previous message:

    /* insert the tuple normally */
- table_tuple_insert(resultRelationDesc, slot,
-   estate->es_output_cid,
-   0, NULL);
+ slot = table_tuple_insert(resultRelationDesc, slot,
+  estate->es_output_cid,
+
(Previously slot variable that exists in the ExecInsert() and could be used
later was not modified at the quoted code block)

Pavel.

#12

Alexander Korotkov

aekorotkov@gmail.com

almost 2 years ago

In reply to: Matthias van de Meent (#2)

Re: Table AM Interface Enhancements

Hi, Matthias!

On Fri, Nov 24, 2023 at 1:07 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

On Thu, 23 Nov 2023 at 13:43, Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hello PostgreSQL Hackers,

I am pleased to submit a series of patches related to the Table Access
Method (AM) interface, which I initially announced during my talk at
PGCon 2023 [1]. These patches are primarily designed to support the
OrioleDB engine, but I believe they could be beneficial for other
table AM implementations as well.

The focus of these patches is to introduce more flexibility and
capabilities into the Table AM interface. This is particularly
relevant for advanced use cases like index-organized tables,
alternative MVCC implementations, etc.

Here's a brief overview of the patches included in this set:

Note: no significant review of the patches, just a first response on
the cover letters and oddities I noticed:

Overall, this patchset adds significant API area to TableAmRoutine,
without adding the relevant documentation on how it's expected to be
used.

I have to note that, unlike documentation for index access methods,
our documentation for table access methods doesn't have an explanation
of API functions. Instead, it just refers to tableam.h for details.
The patches touching tableam.h also revise relevant comments. These
comments are for sure a target for improvements.

With the overall size of the patchset also being very
significant

I wouldn't say that volume is very significant. It's just 2K lines,
not the great size of a patchset. But it for sure represents changes
of great importance.

I don't think this patch is reviewable as is; the goal
isn't clear enough,

The goal is to revise table AM API so that new full-featured
implementations could exist. AFAICS, the current API was designed
keeping zheap in mind, but even zheap was always shipped with the core
patch. All other implementations of table AM, which I've seen, are
very limited. Basically, there is still no real alternative and
functional OLTP table AM. I believe API limitation is one of the
reasons for that.

the APIs aren't well explained, and

As I mentioned before, the table AM API is documented by the comments
in tableam.h. The comments in the patchset aren't perfect for sure,
but a subject for the incremental improvements.

the interactions with the index API are left up in the air.

Right. These patches bring more control on interactions with indexes
to table AMs without touching the index API. In my PGCon 2016 talk I
proposed that table AM could have its own implementation of index AM.

As you mentioned before, this patchset isn't very small already.
Considering it all together with a patchset for index AM redesign
would make it a mess. I propose we can consider here the patches,
which are usable by themselves even without index AM changes. And the
patches tightly coupled with index AM API changes could be considered
later together with those changes.

0001-Allow-locking-updated-tuples-in-tuple_update-and--v1.patch

Optimizes the process of locking concurrently updated tuples during
update and delete operations. Helpful for table AMs where refinding
existing tuples is expensive.

Is this essentially an optimized implementation of the "DELETE FROM
... RETURNING *" per-tuple primitive?

Not really. The test for "DELETE FROM ... RETURNING *" was used just
to reproduce one of the bugs stopped in [2]. The general idea is to
avoid repeated calls for tuple lock.

0003-Allow-table-AM-to-store-complex-data-structures-i-v1.patch

Allows table AM to store complex data structure in rd_amcache rather
than a single chunk of memory.

I don't think we should allow arbitrarily large and arbitrarily many
chunks of data in the relcache or table caches.

Hmm.. It seems to be far out of control of API what and how large
PostgreSQL extensions could actually cache.

Why isn't one chunk
enough?

It's generally possible to fit everything into one chunk, but that's
extremely unhandy when your cache contains something at least as
complex as tuple slots and descriptors. I think the reason that we
still have one chunk restriction is that we don't have a full-featured
implementation fitting API yet. If we had it, I can't imagine there
would be one chunk for a cache.

0004-Add-table-AM-tuple_is_current-method-v1.patch

This allows us to abstract how/whether table AM uses transaction identifiers.

I'm not a fan of the indirection here. Also, assuming that table slots
don't outlive transactions, wouldn't this be a more appropriate fit
with the table tuple slot API?

This is a good idea. I will update the patch accordingly.

0005-Generalize-relation-analyze-in-table-AM-interface-v1.patch

Provides a more flexible API for sampling tuples, beneficial for
non-standard table types like index-organized tables.

0006-Generalize-table-AM-API-for-INSERT-.-ON-CONFLICT-v1.patch

Provides a new table AM API method to encapsulate the whole INSERT ...
ON CONFLICT ... algorithm rather than just implementation of
speculative tokens.

Does this not still require speculative inserts, with speculative
tokens, for secondary indexes? Why make AMs implement that all over
again?

The idea here is to generalize upsert and leave speculative tokens as
details of one particular implementation. Imagine an index-organized
table and upsert on primary key. For that you need to just locate the
relevant page in a tree and do insert or update. Speculative tokens
would rather be an unreasonable complication for this case.

0007-Allow-table-AM-tuple_insert-method-to-return-the--v1.patch

This allows table AM to return a native tuple slot, which is aware of
table AM-specific system attributes.

This seems reasonable.

0008-Let-table-AM-insertion-methods-control-index-inse-v1.patch

Allows table AM to skip index insertions in the executor and handle
those insertions itself.

Who handles index tuple removal then?

Table AM implementation decides what actions to perform on tuple
update/delete. The reason why it can't really care about updating
indexes is that the executor already does it.
The situation is different with deletes, because the executor doesn't
do something immediately about the corresponding index tuples. They
are deleted later by vacuum, which is also controlled by table AM
implementation.

I don't see a patch that describes index AM changes for this...

Yes, index AM should be revised for that. See my comment about that earlier.

0009-Custom-reloptions-for-table-AM-v1.patch

Enables table AMs to define and override reloptions for tables and indexes.

0010-Notify-table-AM-about-index-creation-v1.patch

Allows table AMs to prepare or update specific meta-information during
index creation.

I don't think the described use case of this API is OK - a table AM
cannot know about the internals of index AMs, and is definitely not
allowed to overwrite the information of that index.
If I ask for an index that uses the "btree" index, then that needs to
be the AM actually used, or an error needs to be raised if it is
somehow incompatible with the table AM used. It can't be that we
silently update information and create an index that is explicitly not
what the user asked to create.

I agree that this currently looks more like workarounds rather than
proper API changes. I propose these two should be considered later
together with relevant index API changes.

I also don't see updates in documentation, which I think is quite a
shame as I have trouble understanding some parts.

Sorry for this. I hope I gave some answers in this message and I'll
update the patchset comments and commit messages accordingly. And I'm
open to answer any further questions.

0012-Introduce-RowID-bytea-tuple-identifier-v1.patch

`This patch introduces 'RowID', a new bytea tuple identifier, to
overcome the limitations of the current 32-bit block number and 16-bit
offset-based tuple identifier. This is particularly useful for
index-organized tables and other advanced use cases.

We don't have any index methods that can handle anything but
block+offset TIDs, and I don't see any changes to the IndexAM APIs to
support these RowID tuples, so what's the plan here? I don't see any
of that in the commit message, nor in the rest of this patchset.

Each commit message contains a detailed explanation of the changes and
their rationale. I believe these enhancements will significantly
improve the flexibility and capabilities of the PostgreSQL Table AM
interface.

I've noticed there is not a lot of rationale for several of the
changes as to why PostgreSQL needs these changes implemented like
this, amongst which the index-related tableAM changes.

I understand that index-organized tables can be quite useful, but I
don't see design solutions to the more complex questions that would
still be required before we could host such table AMs like OreoleDB's
as a first-party citizen: For index-organized tables, you also need
index AM APIs that support TIDS with more than 48 bits of data
(assuming we actually want primary keys with >48 bits of usable
space), and for undo-based logging you would probably need index APIs
for retail index tuple deletion. Neither is supplied here, nor is
described why these APIs were omitted.

As I mentioned before, I agree that index AM changes haven't been
presented yet. And yes, for bytea rowID there is currently no way to
use the current index API. However, I think this exact patch could be
useful even without index AM implements. This allows table AMs to
identify rows by custom bytea, even though these tables couldn't be
indexed yet. So, if we allow a custom table AM to implement an
index-organized table, that would have use cases even if secondary
indexes are not supported yet.

Links
1. https://pgconf.ru/media/2016/02/19/06_Korotkov%20Extendability.pdf

2. /messages/by-id/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg@mail.gmail.com

------
Regards,
Alexander Korotkov

#13

Alexander Korotkov

aekorotkov@gmail.com

almost 2 years ago

In reply to: Mark Dilger (#5)

Re: Table AM Interface Enhancements

On Mon, Nov 27, 2023 at 10:18 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Nov 25, 2023, at 9:47 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:

Should the patch at least document which parts of the EState are expected to be in which states, and which parts should be viewed as undefined? If the implementors of table AMs rely on any/all aspects of EState, doesn't that prevent future changes to how that structure is used?

New tuple tuple_insert_with_arbiter() table AM API method needs EState
argument to call executor functions: ExecCheckIndexConstraints(),
ExecUpdateLockMode(), and ExecInsertIndexTuples(). I think we
probably need to invent some opaque way to call this executor function
without revealing EState to table AM. Do you think this could work?

We're clearly not accessing all of the EState, just some specific fields, such as es_per_tuple_exprcontext. I think you could at least refactor to pass the minimum amount of state information through the table AM API.

Yes, the table AM doesn't need the full EState, just the ability to do
specific manipulation with tuples. I'll refactor the patch to make a
better isolation for this.

------
Regards,
Alexander Korotkov

#14

Alexander Korotkov

aekorotkov@gmail.com

almost 2 years ago

In reply to: Alexander Korotkov (#13)

13 attachment(s)

Re: Table AM Interface Enhancements

On Sun, Mar 3, 2024 at 1:50 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Mon, Nov 27, 2023 at 10:18 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Nov 25, 2023, at 9:47 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:

Should the patch at least document which parts of the EState are expected to be in which states, and which parts should be viewed as undefined? If the implementors of table AMs rely on any/all aspects of EState, doesn't that prevent future changes to how that structure is used?

New tuple tuple_insert_with_arbiter() table AM API method needs EState
argument to call executor functions: ExecCheckIndexConstraints(),
ExecUpdateLockMode(), and ExecInsertIndexTuples(). I think we
probably need to invent some opaque way to call this executor function
without revealing EState to table AM. Do you think this could work?

We're clearly not accessing all of the EState, just some specific fields, such as es_per_tuple_exprcontext. I think you could at least refactor to pass the minimum amount of state information through the table AM API.

Yes, the table AM doesn't need the full EState, just the ability to do
specific manipulation with tuples. I'll refactor the patch to make a
better isolation for this.

Please find the revised patchset attached. The changes are following:
1. Patchset is rebase. to the current master.
2. Patchset is reordered. I tried to put less debatable patches to the top.
3. tuple_is_current() method is moved from the Table AM API to the
slot as proposed by Matthias van de Meent.
4. Assert added to the table_free_rd_amcache() as proposed by Pavel Borisov.

------
Regards,
Alexander Korotkov

Attachments:

0010-Notify-table-AM-about-index-creation-v2.patchapplication/octet-stream; name=0010-Notify-table-AM-about-index-creation-v2.patchDownload

From 6ee870660b423fd03f97b5ec875f5439c9cd78fd Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Sat, 17 Jun 2023 22:01:01 +0300
Subject: [PATCH 10/13] Notify table AM about index creation

This allows table AM to do some preparation with index build.  In particular,
table AM could update its specific meta-information.  That could be also useful
if table AM overrides index implementations.
---
 src/backend/access/heap/heapam_handler.c |  2 ++
 src/backend/catalog/index.c              |  2 ++
 src/backend/commands/indexcmds.c         | 41 +++++++++++++----------
 src/include/access/tableam.h             | 42 ++++++++++++++++++++++++
 4 files changed, 69 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 422898a609d..534495f254f 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -3219,6 +3219,8 @@ static const TableAmRoutine heapam_methods = {
 	.index_build_range_scan = heapam_index_build_range_scan,
 	.index_validate_scan = heapam_index_validate_scan,
 	.relation_analyze = heapam_analyze,
+	.define_index_validate = NULL,
+	.define_index = NULL,
 
 	.free_rd_amcache = NULL,
 	.relation_size = table_block_relation_size,
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index b6a7c60e230..bca97981051 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3840,6 +3840,8 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 
 	/* Close rels, but keep locks */
 	index_close(iRel, NoLock);
+	table_define_index(heapRelation, indexId, true,
+					   skip_constraint_checks, false, NULL);
 	table_close(heapRelation, NoLock);
 
 	if (progress)
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 7299ebbe9f3..7f24687c6d9 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -583,6 +583,7 @@ DefineIndex(Oid tableId,
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
+	void	   *arg;
 
 	root_save_nestlevel = NewGUCNestLevel();
 
@@ -629,6 +630,26 @@ DefineIndex(Oid tableId,
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_INDEX_OID,
 								 InvalidOid);
 
+	/*
+	 * Only SELECT ... FOR UPDATE/SHARE are allowed while doing a standard
+	 * index build; but for concurrent builds we allow INSERT/UPDATE/DELETE
+	 * (but not VACUUM).
+	 *
+	 * NB: Caller is responsible for making sure that relationId refers to the
+	 * relation on which the index should be built; except in bootstrap mode,
+	 * this will typically require the caller to have already locked the
+	 * relation.  To avoid lock upgrade hazards, that lock should be at least
+	 * as strong as the one we take here.
+	 *
+	 * NB: If the lock strength here ever changes, code that is run by
+	 * parallel workers under the control of certain particular ambuild
+	 * functions will need to be updated, too.
+	 */
+	lockmode = concurrent ? ShareUpdateExclusiveLock : ShareLock;
+	rel = table_open(tableId, lockmode);
+
+	table_define_index_validate(rel, stmt, skip_build, &arg);
+
 	/*
 	 * count key attributes in index
 	 */
@@ -656,24 +677,6 @@ DefineIndex(Oid tableId,
 				 errmsg("cannot use more than %d columns in an index",
 						INDEX_MAX_KEYS)));
 
-	/*
-	 * Only SELECT ... FOR UPDATE/SHARE are allowed while doing a standard
-	 * index build; but for concurrent builds we allow INSERT/UPDATE/DELETE
-	 * (but not VACUUM).
-	 *
-	 * NB: Caller is responsible for making sure that tableId refers to the
-	 * relation on which the index should be built; except in bootstrap mode,
-	 * this will typically require the caller to have already locked the
-	 * relation.  To avoid lock upgrade hazards, that lock should be at least
-	 * as strong as the one we take here.
-	 *
-	 * NB: If the lock strength here ever changes, code that is run by
-	 * parallel workers under the control of certain particular ambuild
-	 * functions will need to be updated, too.
-	 */
-	lockmode = concurrent ? ShareUpdateExclusiveLock : ShareLock;
-	rel = table_open(tableId, lockmode);
-
 	/*
 	 * Switch to the table owner's userid, so that any index functions are run
 	 * as that user.  Also lock down security-restricted operations.  We
@@ -1218,6 +1221,8 @@ DefineIndex(Oid tableId,
 
 	ObjectAddressSet(address, RelationRelationId, indexRelationId);
 
+	table_define_index(rel, address.objectId, false, false,
+					   skip_build, arg);
 	if (!OidIsValid(indexRelationId))
 	{
 		/*
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 1bfae380637..4ac2d868322 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -683,6 +683,16 @@ typedef struct TableAmRoutine
 									 BlockNumber *totalpages,
 									 BufferAccessStrategy bstrategy);
 
+	/* See table_define_index_validate() */
+	bool		(*define_index_validate) (Relation rel, IndexStmt *stmt,
+										  bool skip_build, void **arg);
+
+	/* See table_define_index() */
+	bool		(*define_index) (Relation rel, Oid indoid, bool reindex,
+								 bool skip_constraint_checks, bool skip_build,
+								 void *arg);
+
+
 
 	/* ------------------------------------------------------------------------
 	 * Miscellaneous functions.
@@ -1849,6 +1859,38 @@ table_relation_analyze(Relation relation, AcquireSampleRowsFunc *func,
 										   totalpages, bstrategy);
 }
 
+/*
+ * Let table AM validate the index to be created on `rel` with statement
+ * `*stmt`.  `skip_build` indicates that only catalog entries are to be
+ * created without index data.  This method can save some information into
+ * `arg`, and it shoud be passed to table_define_index().
+ */
+static inline bool
+table_define_index_validate(Relation rel, IndexStmt *stmt,
+							bool skip_build, void **arg)
+{
+	if (rel->rd_tableam && rel->rd_tableam->define_index_validate)
+		return rel->rd_tableam->define_index_validate(rel, stmt,
+													  skip_build, arg);
+	else
+		return true;
+}
+
+/*
+ * Notifies table AM about index creation on `rel` with oid `indoid`.
+ */
+static inline bool
+table_define_index(Relation rel, Oid indoid, bool reindex,
+				   bool skip_constraint_checks, bool skip_build, void *arg)
+{
+	if (rel->rd_tableam && rel->rd_tableam->define_index)
+		return rel->rd_tableam->define_index(rel, indoid, reindex,
+											 skip_constraint_checks,
+											 skip_build, arg);
+	else
+		return true;
+}
+
 /* ----------------------------------------------------------------------------
  * Miscellaneous functionality
  * ----------------------------------------------------------------------------
-- 
2.39.3 (Apple Git-145)

0012-Introduce-RowRefType-which-describes-the-table-ro-v2.patchapplication/octet-stream; name=0012-Introduce-RowRefType-which-describes-the-table-ro-v2.patchDownload

From ec5931098573a92a6a6bbfb9288089e207700831 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Sat, 17 Jun 2023 22:55:32 +0300
Subject: [PATCH 12/13] Introduce RowRefType, which describes the table row
 identifier

Currently, the table row could be identified by the ctid or the whole row
(foreign table).  But the row identifier is mixed together with lock mode in
RowMarkType.  This commit separates row identifier type into separate enum
RowRefType.
---
 src/backend/optimizer/plan/planner.c   | 16 +++++++++-----
 src/backend/optimizer/prep/preptlist.c |  4 ++--
 src/backend/optimizer/util/inherit.c   | 30 +++++++++++++++-----------
 src/backend/parser/parse_relation.c    | 10 +++++++++
 src/include/nodes/execnodes.h          |  4 ++++
 src/include/nodes/parsenodes.h         |  1 +
 src/include/nodes/plannodes.h          |  4 ++--
 src/include/nodes/primnodes.h          |  7 ++++++
 src/include/optimizer/planner.h        |  3 ++-
 src/tools/pgindent/typedefs.list       |  1 +
 10 files changed, 57 insertions(+), 23 deletions(-)

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 5564826cb4a..6cbabe83adf 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2279,6 +2279,7 @@ preprocess_rowmarks(PlannerInfo *root)
 		RowMarkClause *rc = lfirst_node(RowMarkClause, l);
 		RangeTblEntry *rte = rt_fetch(rc->rti, parse->rtable);
 		PlanRowMark *newrc;
+		RowRefType	refType;
 
 		/*
 		 * Currently, it is syntactically impossible to have FOR UPDATE et al
@@ -2301,8 +2302,8 @@ preprocess_rowmarks(PlannerInfo *root)
 		newrc = makeNode(PlanRowMark);
 		newrc->rti = newrc->prti = rc->rti;
 		newrc->rowmarkId = ++(root->glob->lastRowMarkId);
-		newrc->markType = select_rowmark_type(rte, rc->strength);
-		newrc->allMarkTypes = (1 << newrc->markType);
+		newrc->markType = select_rowmark_type(rte, rc->strength, &refType);
+		newrc->allRefTypes = (1 << refType);
 		newrc->strength = rc->strength;
 		newrc->waitPolicy = rc->waitPolicy;
 		newrc->isParent = false;
@@ -2318,6 +2319,7 @@ preprocess_rowmarks(PlannerInfo *root)
 	{
 		RangeTblEntry *rte = lfirst_node(RangeTblEntry, l);
 		PlanRowMark *newrc;
+		RowRefType	refType;
 
 		i++;
 		if (!bms_is_member(i, rels))
@@ -2326,8 +2328,8 @@ preprocess_rowmarks(PlannerInfo *root)
 		newrc = makeNode(PlanRowMark);
 		newrc->rti = newrc->prti = i;
 		newrc->rowmarkId = ++(root->glob->lastRowMarkId);
-		newrc->markType = select_rowmark_type(rte, LCS_NONE);
-		newrc->allMarkTypes = (1 << newrc->markType);
+		newrc->markType = select_rowmark_type(rte, LCS_NONE, &refType);
+		newrc->allRefTypes = (1 << refType);
 		newrc->strength = LCS_NONE;
 		newrc->waitPolicy = LockWaitBlock;	/* doesn't matter */
 		newrc->isParent = false;
@@ -2342,11 +2344,13 @@ preprocess_rowmarks(PlannerInfo *root)
  * Select RowMarkType to use for a given table
  */
 RowMarkType
-select_rowmark_type(RangeTblEntry *rte, LockClauseStrength strength)
+select_rowmark_type(RangeTblEntry *rte, LockClauseStrength strength,
+					RowRefType *refType)
 {
 	if (rte->rtekind != RTE_RELATION)
 	{
 		/* If it's not a table at all, use ROW_MARK_COPY */
+		*refType = ROW_REF_COPY;
 		return ROW_MARK_COPY;
 	}
 	else if (rte->relkind == RELKIND_FOREIGN_TABLE)
@@ -2357,11 +2361,13 @@ select_rowmark_type(RangeTblEntry *rte, LockClauseStrength strength)
 		if (fdwroutine->GetForeignRowMarkType != NULL)
 			return fdwroutine->GetForeignRowMarkType(rte, strength);
 		/* Otherwise, use ROW_MARK_COPY by default */
+		*refType = ROW_REF_COPY;
 		return ROW_MARK_COPY;
 	}
 	else
 	{
 		/* Regular table, apply the appropriate lock type */
+		*refType = rte->reftype;
 		switch (strength)
 		{
 			case LCS_NONE:
diff --git a/src/backend/optimizer/prep/preptlist.c b/src/backend/optimizer/prep/preptlist.c
index 7698bfa1a58..4599b0dc761 100644
--- a/src/backend/optimizer/prep/preptlist.c
+++ b/src/backend/optimizer/prep/preptlist.c
@@ -210,7 +210,7 @@ preprocess_targetlist(PlannerInfo *root)
 		if (rc->rti != rc->prti)
 			continue;
 
-		if (rc->allMarkTypes & ~(1 << ROW_MARK_COPY))
+		if (rc->allRefTypes & (1 << ROW_REF_TID))
 		{
 			/* Need to fetch TID */
 			var = makeVar(rc->rti,
@@ -226,7 +226,7 @@ preprocess_targetlist(PlannerInfo *root)
 								  true);
 			tlist = lappend(tlist, tle);
 		}
-		if (rc->allMarkTypes & (1 << ROW_MARK_COPY))
+		if (rc->allRefTypes & (1 << ROW_REF_COPY))
 		{
 			/* Need the whole row as a junk var */
 			var = makeWholeRowVar(rt_fetch(rc->rti, range_table),
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index 5c7acf8a901..d32b07bab57 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -16,6 +16,7 @@
 
 #include "access/sysattr.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "catalog/partition.h"
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_type.h"
@@ -91,7 +92,7 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 	LOCKMODE	lockmode;
 	PlanRowMark *oldrc;
 	bool		old_isParent = false;
-	int			old_allMarkTypes = 0;
+	int			old_allRefTypes = 0;
 
 	Assert(rte->inh);			/* else caller error */
 
@@ -131,8 +132,8 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 	{
 		old_isParent = oldrc->isParent;
 		oldrc->isParent = true;
-		/* Save initial value of allMarkTypes before children add to it */
-		old_allMarkTypes = oldrc->allMarkTypes;
+		/* Save initial value of allRefTypes before children add to it */
+		old_allRefTypes = oldrc->allRefTypes;
 	}
 
 	/* Scan the inheritance set and expand it */
@@ -239,15 +240,15 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (oldrc)
 	{
-		int			new_allMarkTypes = oldrc->allMarkTypes;
+		int			new_allRefTypes = oldrc->allRefTypes;
 		Var		   *var;
 		TargetEntry *tle;
 		char		resname[32];
 		List	   *newvars = NIL;
 
 		/* Add TID junk Var if needed, unless we had it already */
-		if (new_allMarkTypes & ~(1 << ROW_MARK_COPY) &&
-			!(old_allMarkTypes & ~(1 << ROW_MARK_COPY)))
+		if (new_allRefTypes & (1 << ROW_REF_TID) &&
+			!(old_allRefTypes & (1 << ROW_REF_TID)))
 		{
 			/* Need to fetch TID */
 			var = makeVar(oldrc->rti,
@@ -266,8 +267,8 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 		}
 
 		/* Add whole-row junk Var if needed, unless we had it already */
-		if ((new_allMarkTypes & (1 << ROW_MARK_COPY)) &&
-			!(old_allMarkTypes & (1 << ROW_MARK_COPY)))
+		if ((new_allRefTypes & (1 << ROW_REF_COPY)) &&
+			!(old_allRefTypes & (1 << ROW_REF_COPY)))
 		{
 			var = makeWholeRowVar(planner_rt_fetch(oldrc->rti, root),
 								  oldrc->rti,
@@ -441,7 +442,7 @@ expand_partitioned_rtentry(PlannerInfo *root, RelOptInfo *relinfo,
  * where the hierarchy is flattened during RTE expansion.)
  *
  * PlanRowMarks still carry the top-parent's RTI, and the top-parent's
- * allMarkTypes field still accumulates values from all descendents.
+ * allRefTypes field still accumulates values from all descendents.
  *
  * "parentrte" and "parentRTindex" are immediate parent's RTE and
  * RTI. "top_parentrc" is top parent's PlanRowMark.
@@ -485,6 +486,7 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 	Assert(parentrte->rtekind == RTE_RELATION); /* else this is dubious */
 	childrte->relid = childOID;
 	childrte->relkind = childrel->rd_rel->relkind;
+	childrte->reftype = ROW_REF_TID;
 	/* A partitioned child will need to be expanded further. */
 	if (childrte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
@@ -574,14 +576,16 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 	if (top_parentrc)
 	{
 		PlanRowMark *childrc = makeNode(PlanRowMark);
+		RowRefType	refType;
 
 		childrc->rti = childRTindex;
 		childrc->prti = top_parentrc->rti;
 		childrc->rowmarkId = top_parentrc->rowmarkId;
 		/* Reselect rowmark type, because relkind might not match parent */
 		childrc->markType = select_rowmark_type(childrte,
-												top_parentrc->strength);
-		childrc->allMarkTypes = (1 << childrc->markType);
+												top_parentrc->strength,
+												&refType);
+		childrc->allRefTypes = (1 << refType);
 		childrc->strength = top_parentrc->strength;
 		childrc->waitPolicy = top_parentrc->waitPolicy;
 
@@ -592,8 +596,8 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 		 */
 		childrc->isParent = (childrte->relkind == RELKIND_PARTITIONED_TABLE);
 
-		/* Include child's rowmark type in top parent's allMarkTypes */
-		top_parentrc->allMarkTypes |= childrc->allMarkTypes;
+		/* Include child's rowmark type in top parent's allRefTypes */
+		top_parentrc->allRefTypes |= childrc->allRefTypes;
 
 		root->rowMarks = lappend(root->rowMarks, childrc);
 	}
diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c
index 427b7325db8..10f2d287b39 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/sysattr.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "catalog/heap.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_type.h"
@@ -1503,6 +1504,7 @@ addRangeTableEntry(ParseState *pstate,
 	rte->inh = inh;
 	rte->relkind = rel->rd_rel->relkind;
 	rte->rellockmode = lockmode;
+	rte->reftype = ROW_REF_TID;
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
@@ -1588,6 +1590,7 @@ addRangeTableEntryForRelation(ParseState *pstate,
 	rte->inh = inh;
 	rte->relkind = rel->rd_rel->relkind;
 	rte->rellockmode = lockmode;
+	rte->reftype = ROW_REF_TID;
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
@@ -1656,6 +1659,7 @@ addRangeTableEntryForSubquery(ParseState *pstate,
 	rte->rtekind = RTE_SUBQUERY;
 	rte->subquery = subquery;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias("unnamed_subquery", NIL);
 	numaliases = list_length(eref->colnames);
@@ -1763,6 +1767,7 @@ addRangeTableEntryForFunction(ParseState *pstate,
 	rte->functions = NIL;		/* we'll fill this list below */
 	rte->funcordinality = rangefunc->ordinality;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	/*
 	 * Choose the RTE alias name.  We default to using the first function's
@@ -2081,6 +2086,7 @@ addRangeTableEntryForTableFunc(ParseState *pstate,
 	rte->coltypmods = tf->coltypmods;
 	rte->colcollations = tf->colcollations;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias(refname, NIL);
 	numaliases = list_length(eref->colnames);
@@ -2156,6 +2162,7 @@ addRangeTableEntryForValues(ParseState *pstate,
 	rte->coltypmods = coltypmods;
 	rte->colcollations = colcollations;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias(refname, NIL);
 
@@ -2252,6 +2259,7 @@ addRangeTableEntryForJoin(ParseState *pstate,
 	rte->joinrightcols = rightcols;
 	rte->join_using_alias = join_using_alias;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias("unnamed_join", NIL);
 	numaliases = list_length(eref->colnames);
@@ -2332,6 +2340,7 @@ addRangeTableEntryForCTE(ParseState *pstate,
 	rte->rtekind = RTE_CTE;
 	rte->ctename = cte->ctename;
 	rte->ctelevelsup = levelsup;
+	rte->reftype = ROW_REF_COPY;
 
 	/* Self-reference if and only if CTE's parse analysis isn't completed */
 	rte->self_reference = !IsA(cte->ctequery, Query);
@@ -2494,6 +2503,7 @@ addRangeTableEntryForENR(ParseState *pstate,
 	 * if they access transition tables linked to a table that is altered.
 	 */
 	rte->relid = enrmd->reliddesc;
+	rte->reftype = ROW_REF_COPY;
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 92593526725..acd9672d789 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -455,6 +455,9 @@ typedef struct ResultRelInfo
 	/* relation descriptor for result relation */
 	Relation	ri_RelationDesc;
 
+	/* row indentifier for result relation */
+	RowRefType	ri_RowRefType;
+
 	/* # of indices existing on result relation */
 	int			ri_NumIndices;
 
@@ -750,6 +753,7 @@ typedef struct ExecRowMark
 	Index		prti;			/* parent range table index, if child */
 	Index		rowmarkId;		/* unique identifier for resjunk columns */
 	RowMarkType markType;		/* see enum in nodes/plannodes.h */
+	RowRefType	refType;		/* row indentifier for relation */
 	LockClauseStrength strength;	/* LockingClause's strength, or LCS_NONE */
 	LockWaitPolicy waitPolicy;	/* NOWAIT and SKIP LOCKED */
 	bool		ermActive;		/* is this mark relevant for current tuple? */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 7b57fddf2d0..72c8c4caf24 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1079,6 +1079,7 @@ typedef struct RangeTblEntry
 	int			rellockmode;	/* lock level that query requires on the rel */
 	Index		perminfoindex;	/* index of RTEPermissionInfo entry, or 0 */
 	struct TableSampleClause *tablesample;	/* sampling info, or NULL */
+	RowRefType	reftype;		/* row indentifier for relation */
 
 	/*
 	 * Fields valid for a subquery RTE (else NULL):
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index b4ef6bc44c9..dbe5c535560 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1351,7 +1351,7 @@ typedef enum RowMarkType
  * child relations will also have entries with isParent = true.  The child
  * entries have rti == child rel's RT index and prti == top parent's RT index,
  * and can therefore be recognized as children by the fact that prti != rti.
- * The parent's allMarkTypes field gets the OR of (1<<markType) across all
+ * The parent's allRefTypes field gets the OR of (1<<refType) across all
  * its children (this definition allows children to use different markTypes).
  *
  * The planner also adds resjunk output columns to the plan that carry
@@ -1381,7 +1381,7 @@ typedef struct PlanRowMark
 	Index		prti;			/* range table index of parent relation */
 	Index		rowmarkId;		/* unique identifier for resjunk columns */
 	RowMarkType markType;		/* see enum above */
-	int			allMarkTypes;	/* OR of (1<<markType) for all children */
+	int			allRefTypes;	/* OR of (1<<refType) for all children */
 	LockClauseStrength strength;	/* LockingClause's strength, or LCS_NONE */
 	LockWaitPolicy waitPolicy;	/* NOWAIT and SKIP LOCKED options */
 	bool		isParent;		/* true if this is a "dummy" parent entry */
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index 8df8884001d..bc06ff99e21 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -2089,4 +2089,11 @@ typedef struct OnConflictExpr
 	List	   *exclRelTlist;	/* tlist of the EXCLUDED pseudo relation */
 } OnConflictExpr;
 
+/* The row identifier */
+typedef enum RowRefType
+{
+	ROW_REF_TID,				/* Item pointer (block, offset) */
+	ROW_REF_COPY				/* Full row copy */
+} RowRefType;
+
 #endif							/* PRIMNODES_H */
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index e1d79ffdf3c..98fc796d054 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -47,7 +47,8 @@ extern PlannerInfo *subquery_planner(PlannerGlobal *glob, Query *parse,
 									 bool hasRecursion, double tuple_fraction);
 
 extern RowMarkType select_rowmark_type(RangeTblEntry *rte,
-									   LockClauseStrength strength);
+									   LockClauseStrength strength,
+									   RowRefType *refType);
 
 extern bool limit_needed(Query *parse);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 042d04c8de2..e5ae4288428 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2420,6 +2420,7 @@ RowExpr
 RowIdentityVarInfo
 RowMarkClause
 RowMarkType
+RowRefType
 RowSecurityDesc
 RowSecurityPolicy
 RtlGetLastNtStatus_t
-- 
2.39.3 (Apple Git-145)

0011-Let-table-AM-insertion-methods-control-index-inse-v2.patchapplication/octet-stream; name=0011-Let-table-AM-insertion-methods-control-index-inse-v2.patchDownload

From 37c52b0375b1261ae45e6fab8976d75365080543 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Fri, 9 Jun 2023 01:02:39 +0300
Subject: [PATCH 11/13] Let table AM insertion methods control index insertion

New parameter for tuple_insert() and multi_insert() methods provides way to
skip index insertions in executor.  In this case, table AM can handle insertions
itself.
---
 src/backend/access/heap/heapam.c         |  4 +++-
 src/backend/access/heap/heapam_handler.c |  4 +++-
 src/backend/access/table/tableam.c       |  6 ++++--
 src/backend/catalog/indexing.c           |  4 +++-
 src/backend/commands/copyfrom.c          | 13 +++++++++----
 src/backend/commands/createas.c          |  4 +++-
 src/backend/commands/matview.c           |  4 +++-
 src/backend/commands/tablecmds.c         |  6 +++++-
 src/backend/executor/execReplication.c   |  6 ++++--
 src/backend/executor/nodeModifyTable.c   |  6 ++++--
 src/include/access/heapam.h              |  2 +-
 src/include/access/tableam.h             | 23 ++++++++++++++++-------
 12 files changed, 58 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f6478f89e77..facad25d5c1 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2091,7 +2091,8 @@ heap_multi_insert_pages(HeapTuple *heaptuples, int done, int ntuples, Size saveF
  */
 void
 heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
-				  CommandId cid, int options, BulkInsertState bistate)
+				  CommandId cid, int options, BulkInsertState bistate,
+				  bool *insert_indexes)
 {
 	TransactionId xid = GetCurrentTransactionId();
 	HeapTuple  *heaptuples;
@@ -2440,6 +2441,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		slots[i]->tts_tid = heaptuples[i]->t_self;
 
 	pgstat_count_heap_insert(relation, ntuples);
+	*insert_indexes = true;
 }
 
 /*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 534495f254f..7ebebf4d6ac 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -247,7 +247,7 @@ heapam_tuple_satisfies_snapshot(Relation rel, TupleTableSlot *slot,
 
 static TupleTableSlot *
 heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
-					int options, BulkInsertState bistate)
+					int options, BulkInsertState bistate, bool *insert_indexes)
 {
 	bool		shouldFree = true;
 	HeapTuple	tuple = ExecFetchSlotHeapTuple(slot, true, &shouldFree);
@@ -263,6 +263,8 @@ heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
 	if (shouldFree)
 		pfree(tuple);
 
+	*insert_indexes = true;
+
 	return slot;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 8d3675be959..805d222cebc 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -273,9 +273,11 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
  * default command ID and not allowing access to the speedup options.
  */
 void
-simple_table_tuple_insert(Relation rel, TupleTableSlot *slot)
+simple_table_tuple_insert(Relation rel, TupleTableSlot *slot,
+						  bool *insert_indexes)
 {
-	table_tuple_insert(rel, slot, GetCurrentCommandId(true), 0, NULL);
+	table_tuple_insert(rel, slot, GetCurrentCommandId(true), 0, NULL,
+					   insert_indexes);
 }
 
 /*
diff --git a/src/backend/catalog/indexing.c b/src/backend/catalog/indexing.c
index d0d1abda58a..4d404f22f83 100644
--- a/src/backend/catalog/indexing.c
+++ b/src/backend/catalog/indexing.c
@@ -273,12 +273,14 @@ void
 CatalogTuplesMultiInsertWithInfo(Relation heapRel, TupleTableSlot **slot,
 								 int ntuples, CatalogIndexState indstate)
 {
+	bool		insertIndexes;
+
 	/* Nothing to do */
 	if (ntuples <= 0)
 		return;
 
 	heap_multi_insert(heapRel, slot, ntuples,
-					  GetCurrentCommandId(true), 0, NULL);
+					  GetCurrentCommandId(true), 0, NULL, &insertIndexes);
 
 	/*
 	 * There is no equivalent to heap_multi_insert for the catalog indexes, so
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 8908a440e19..b6736369771 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -397,6 +397,7 @@ CopyMultiInsertBufferFlush(CopyMultiInsertInfo *miinfo,
 		bool		line_buf_valid = cstate->line_buf_valid;
 		uint64		save_cur_lineno = cstate->cur_lineno;
 		MemoryContext oldcontext;
+		bool		insertIndexes;
 
 		Assert(buffer->bistate != NULL);
 
@@ -416,7 +417,8 @@ CopyMultiInsertBufferFlush(CopyMultiInsertInfo *miinfo,
 						   nused,
 						   mycid,
 						   ti_options,
-						   buffer->bistate);
+						   buffer->bistate,
+						   &insertIndexes);
 		MemoryContextSwitchTo(oldcontext);
 
 		for (i = 0; i < nused; i++)
@@ -425,7 +427,7 @@ CopyMultiInsertBufferFlush(CopyMultiInsertInfo *miinfo,
 			 * If there are any indexes, update them for all the inserted
 			 * tuples, and run AFTER ROW INSERT triggers.
 			 */
-			if (resultRelInfo->ri_NumIndices > 0)
+			if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 			{
 				List	   *recheckIndexes;
 
@@ -1265,11 +1267,14 @@ CopyFrom(CopyFromState cstate)
 					}
 					else
 					{
+						bool		insertIndexes;
+
 						/* OK, store the tuple and create index entries for it */
 						table_tuple_insert(resultRelInfo->ri_RelationDesc,
-										   myslot, mycid, ti_options, bistate);
+										   myslot, mycid, ti_options, bistate,
+										   &insertIndexes);
 
-						if (resultRelInfo->ri_NumIndices > 0)
+						if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 							recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 																   myslot,
 																   estate,
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 62050f4dc59..afd3dace079 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -578,6 +578,7 @@ static bool
 intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_intorel *myState = (DR_intorel *) self;
+	bool		insertIndexes;
 
 	/* Nothing to insert if WITH NO DATA is specified. */
 	if (!myState->into->skipData)
@@ -594,7 +595,8 @@ intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 						   slot,
 						   myState->output_cid,
 						   myState->ti_options,
-						   myState->bistate);
+						   myState->bistate,
+						   &insertIndexes);
 	}
 
 	/* We know this is a newly created relation, so there are no indexes */
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 6d09b755564..9ec13d09846 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -476,6 +476,7 @@ static bool
 transientrel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_transientrel *myState = (DR_transientrel *) self;
+	bool		insertIndexes;
 
 	/*
 	 * Note that the input slot might not be of the type of the target
@@ -490,7 +491,8 @@ transientrel_receive(TupleTableSlot *slot, DestReceiver *self)
 					   slot,
 					   myState->output_cid,
 					   myState->ti_options,
-					   myState->bistate);
+					   myState->bistate,
+					   &insertIndexes);
 
 	/* We know this is a newly created relation, so there are no indexes */
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index fa8eb55b189..c7ffb5c17fe 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -6350,8 +6350,12 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
 
 			/* Write the tuple out to the new relation */
 			if (newrel)
+			{
+				bool		insertIndexes;
+
 				table_tuple_insert(newrel, insertslot, mycid,
-								   ti_options, bistate);
+								   ti_options, bistate, &insertIndexes);
+			}
 
 			ResetExprContext(econtext);
 
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 0cad843fb69..db685473fc0 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -509,6 +509,7 @@ ExecSimpleRelationInsert(ResultRelInfo *resultRelInfo,
 	if (!skip_tuple)
 	{
 		List	   *recheckIndexes = NIL;
+		bool		insertIndexes;
 
 		/* Compute stored generated columns */
 		if (rel->rd_att->constr &&
@@ -523,9 +524,10 @@ ExecSimpleRelationInsert(ResultRelInfo *resultRelInfo,
 			ExecPartitionCheck(resultRelInfo, slot, estate, true);
 
 		/* OK, store the tuple and create index entries for it */
-		simple_table_tuple_insert(resultRelInfo->ri_RelationDesc, slot);
+		simple_table_tuple_insert(resultRelInfo->ri_RelationDesc, slot,
+								  &insertIndexes);
 
-		if (resultRelInfo->ri_NumIndices > 0)
+		if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 			recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 												   slot, estate, false, false,
 												   NULL, NIL, false);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 7d64fcab00d..321f2358c12 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1035,13 +1035,15 @@ ExecInsert(ModifyTableContext *context,
 		}
 		else
 		{
+			bool		insertIndexes;
+
 			/* insert the tuple normally */
 			slot = table_tuple_insert(resultRelationDesc, slot,
 									  estate->es_output_cid,
-									  0, NULL);
+									  0, NULL, &insertIndexes);
 
 			/* insert index entries for tuple */
-			if (resultRelInfo->ri_NumIndices > 0)
+			if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 				recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 													   slot, estate, false,
 													   false, NULL, NIL,
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 45954b8003d..cbb73536289 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -274,7 +274,7 @@ extern void heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 						int options, BulkInsertState bistate);
 extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
 							  int ntuples, CommandId cid, int options,
-							  BulkInsertState bistate);
+							  BulkInsertState bistate, bool *insert_indexes);
 extern TM_Result heap_delete(Relation relation, ItemPointer tid,
 							 CommandId cid, Snapshot crosscheck, int options,
 							 struct TM_FailureData *tmfd, bool changingPart,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 4ac2d868322..c32a3cbcf66 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -510,7 +510,8 @@ typedef struct TableAmRoutine
 	/* see table_tuple_insert() for reference about parameters */
 	TupleTableSlot *(*tuple_insert) (Relation rel, TupleTableSlot *slot,
 									 CommandId cid, int options,
-									 struct BulkInsertStateData *bistate);
+									 struct BulkInsertStateData *bistate,
+									 bool *insert_indexes);
 
 	/* see table_tuple_insert_with_arbiter() for reference about parameters */
 	TupleTableSlot *(*tuple_insert_with_arbiter) (ResultRelInfo *resultRelInfo,
@@ -525,7 +526,8 @@ typedef struct TableAmRoutine
 
 	/* see table_multi_insert() for reference about parameters */
 	void		(*multi_insert) (Relation rel, TupleTableSlot **slots, int nslots,
-								 CommandId cid, int options, struct BulkInsertStateData *bistate);
+								 CommandId cid, int options, struct BulkInsertStateData *bistate,
+								 bool *insert_indexes);
 
 	/* see table_tuple_delete() for reference about parameters */
 	TM_Result	(*tuple_delete) (Relation rel,
@@ -1394,6 +1396,10 @@ table_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
  * behavior) is also just passed through to RelationGetBufferForTuple. If
  * `bistate` is provided, table_finish_bulk_insert() needs to be called.
  *
+ * This function sets `*insert_indexes` to true if expects caller to return
+ * the relevant index tuples.  If `*insert_indexes` is set to false, then
+ * this function cares about indexes itself.
+ *
  * Returns the slot containing the inserted tuple, which may differ from the
  * given slot. For instance, source slot may by VirtualTupleTableSlot, but
  * the result is corresponding to table AM. On return the slot's tts_tid and
@@ -1402,10 +1408,11 @@ table_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
  */
 static inline TupleTableSlot *
 table_tuple_insert(Relation rel, TupleTableSlot *slot, CommandId cid,
-				   int options, struct BulkInsertStateData *bistate)
+				   int options, struct BulkInsertStateData *bistate,
+				   bool *insert_indexes)
 {
 	return rel->rd_tableam->tuple_insert(rel, slot, cid, options,
-										 bistate);
+										 bistate, insert_indexes);
 }
 
 /*
@@ -1463,10 +1470,11 @@ table_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
  */
 static inline void
 table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
-				   CommandId cid, int options, struct BulkInsertStateData *bistate)
+				   CommandId cid, int options, struct BulkInsertStateData *bistate,
+				   bool *insert_indexes)
 {
 	rel->rd_tableam->multi_insert(rel, slots, nslots,
-								  cid, options, bistate);
+								  cid, options, bistate, insert_indexes);
 }
 
 /*
@@ -2157,7 +2165,8 @@ table_scan_sample_next_tuple(TableScanDesc scan,
  * ----------------------------------------------------------------------------
  */
 
-extern void simple_table_tuple_insert(Relation rel, TupleTableSlot *slot);
+extern void simple_table_tuple_insert(Relation rel, TupleTableSlot *slot,
+									  bool *insert_indexes);
 extern void simple_table_tuple_delete(Relation rel, ItemPointer tid,
 									  Snapshot snapshot,
 									  TupleTableSlot *oldSlot);
-- 
2.39.3 (Apple Git-145)

0001-Allow-locking-updated-tuples-in-tuple_update-and--v2.patchapplication/octet-stream; name=0001-Allow-locking-updated-tuples-in-tuple_update-and--v2.patchDownload

From 0cf7ada441692400f62c353d9c329f7545e76ccc Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Thu, 23 Mar 2023 00:12:00 +0300
Subject: [PATCH 01/13] Allow locking updated tuples in tuple_update() and
 tuple_delete()

Currently, in read committed transaction isolation mode (default), we have the
following sequence of actions when tuple_update()/tuple_delete() finds
the tuple updated by concurrent transaction.

1. Attempt to update/delete tuple with tuple_update()/tuple_delete(), which
   returns TM_Updated.
2. Lock tuple with tuple_lock().
3. Re-evaluate plan qual (recheck if we still need to update/delete and
   calculate the new tuple for update).
4. Second attempt to update/delete tuple with tuple_update()/tuple_delete().
   This attempt should be successful, since the tuple was previously locked.

This patch eliminates step 2 by taking the lock during first
tuple_update()/tuple_delete() call.  Heap table access method saves some
efforts by checking the updated tuple once instead of twice.  Future
undo-based table access methods, which will start from the latest row version,
can immediately place a lock there.

The code in nodeModifyTable.c is simplified by removing the nested switch/case.

Discussion: https://postgr.es/m/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg%40mail.gmail.com
Reviewed-by: Aleksander Alekseev, Pavel Borisov, Vignesh C, Mason Sharp
Reviewed-by: Andres Freund, Chris Travers
---
 src/backend/access/heap/heapam.c         | 205 ++++++++++----
 src/backend/access/heap/heapam_handler.c |  94 +++++--
 src/backend/access/table/tableam.c       |  26 +-
 src/backend/commands/trigger.c           |  55 ++--
 src/backend/executor/execReplication.c   |  19 +-
 src/backend/executor/nodeModifyTable.c   | 329 +++++++++--------------
 src/include/access/heapam.h              |  19 +-
 src/include/access/tableam.h             |  69 +++--
 src/include/commands/trigger.h           |   4 +-
 9 files changed, 474 insertions(+), 346 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 34bc60f625f..f6478f89e77 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2499,10 +2499,11 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
 }
 
 /*
- *	heap_delete - delete a tuple
+ *	heap_delete - delete a tuple, optionally fetching it into a slot
  *
  * See table_tuple_delete() for an explanation of the parameters, except that
- * this routine directly takes a tuple rather than a slot.
+ * this routine directly takes a tuple rather than a slot.  Also, we don't
+ * place a lock on the tuple in this function, just fetch the existing version.
  *
  * In the failure cases, the routine fills *tmfd with the tuple's t_ctid,
  * t_xmax (resolving a possible MultiXact, if necessary), and t_cmax (the last
@@ -2511,8 +2512,9 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
  */
 TM_Result
 heap_delete(Relation relation, ItemPointer tid,
-			CommandId cid, Snapshot crosscheck, bool wait,
-			TM_FailureData *tmfd, bool changingPart)
+			CommandId cid, Snapshot crosscheck, int options,
+			TM_FailureData *tmfd, bool changingPart,
+			TupleTableSlot *oldSlot)
 {
 	TM_Result	result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -2590,7 +2592,7 @@ l1:
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("attempted to delete invisible tuple")));
 	}
-	else if (result == TM_BeingModified && wait)
+	else if (result == TM_BeingModified && (options & TABLE_MODIFY_WAIT))
 	{
 		TransactionId xwait;
 		uint16		infomask;
@@ -2731,7 +2733,30 @@ l1:
 			tmfd->cmax = HeapTupleHeaderGetCmax(tp.t_data);
 		else
 			tmfd->cmax = InvalidCommandId;
-		UnlockReleaseBuffer(buffer);
+
+		/*
+		 * If we're asked to lock the updated tuple, we just fetch the
+		 * existing tuple.  That let's the caller save some resources on
+		 * placing the lock.
+		 */
+		if (result == TM_Updated &&
+			(options & TABLE_MODIFY_LOCK_UPDATED))
+		{
+			BufferHeapTupleTableSlot *bslot;
+
+			Assert(TTS_IS_BUFFERTUPLE(oldSlot));
+			bslot = (BufferHeapTupleTableSlot *) oldSlot;
+
+			LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+			bslot->base.tupdata = tp;
+			ExecStorePinnedBufferHeapTuple(&bslot->base.tupdata,
+										   oldSlot,
+										   buffer);
+		}
+		else
+		{
+			UnlockReleaseBuffer(buffer);
+		}
 		if (have_tuple_lock)
 			UnlockTupleTuplock(relation, &(tp.t_self), LockTupleExclusive);
 		if (vmbuffer != InvalidBuffer)
@@ -2905,8 +2930,24 @@ l1:
 	 */
 	CacheInvalidateHeapTuple(relation, &tp, NULL);
 
-	/* Now we can release the buffer */
-	ReleaseBuffer(buffer);
+	/* Fetch the old tuple version if we're asked for that. */
+	if (options & TABLE_MODIFY_FETCH_OLD_TUPLE)
+	{
+		BufferHeapTupleTableSlot *bslot;
+
+		Assert(TTS_IS_BUFFERTUPLE(oldSlot));
+		bslot = (BufferHeapTupleTableSlot *) oldSlot;
+
+		bslot->base.tupdata = tp;
+		ExecStorePinnedBufferHeapTuple(&bslot->base.tupdata,
+									   oldSlot,
+									   buffer);
+	}
+	else
+	{
+		/* Now we can release the buffer */
+		ReleaseBuffer(buffer);
+	}
 
 	/*
 	 * Release the lmgr tuple lock, if we had it.
@@ -2938,8 +2979,8 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 
 	result = heap_delete(relation, tid,
 						 GetCurrentCommandId(true), InvalidSnapshot,
-						 true /* wait for commit */ ,
-						 &tmfd, false /* changingPart */ );
+						 TABLE_MODIFY_WAIT /* wait for commit */ ,
+						 &tmfd, false /* changingPart */ , NULL);
 	switch (result)
 	{
 		case TM_SelfModified:
@@ -2966,10 +3007,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 }
 
 /*
- *	heap_update - replace a tuple
+ *	heap_update - replace a tuple, optionally fetching it into a slot
  *
  * See table_tuple_update() for an explanation of the parameters, except that
- * this routine directly takes a tuple rather than a slot.
+ * this routine directly takes a tuple rather than a slot.  Also, we don't
+ * place a lock on the tuple in this function, just fetch the existing version.
  *
  * In the failure cases, the routine fills *tmfd with the tuple's t_ctid,
  * t_xmax (resolving a possible MultiXact, if necessary), and t_cmax (the last
@@ -2978,9 +3020,9 @@ simple_heap_delete(Relation relation, ItemPointer tid)
  */
 TM_Result
 heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
-			CommandId cid, Snapshot crosscheck, bool wait,
+			CommandId cid, Snapshot crosscheck, int options,
 			TM_FailureData *tmfd, LockTupleMode *lockmode,
-			TU_UpdateIndexes *update_indexes)
+			TU_UpdateIndexes *update_indexes, TupleTableSlot *oldSlot)
 {
 	TM_Result	result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -3157,7 +3199,7 @@ l2:
 	result = HeapTupleSatisfiesUpdate(&oldtup, cid, buffer);
 
 	/* see below about the "no wait" case */
-	Assert(result != TM_BeingModified || wait);
+	Assert(result != TM_BeingModified || (options & TABLE_MODIFY_WAIT));
 
 	if (result == TM_Invisible)
 	{
@@ -3166,7 +3208,7 @@ l2:
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("attempted to update invisible tuple")));
 	}
-	else if (result == TM_BeingModified && wait)
+	else if (result == TM_BeingModified && (options & TABLE_MODIFY_WAIT))
 	{
 		TransactionId xwait;
 		uint16		infomask;
@@ -3370,7 +3412,30 @@ l2:
 			tmfd->cmax = HeapTupleHeaderGetCmax(oldtup.t_data);
 		else
 			tmfd->cmax = InvalidCommandId;
-		UnlockReleaseBuffer(buffer);
+
+		/*
+		 * If we're asked to lock the updated tuple, we just fetch the
+		 * existing tuple.  That let's the caller save some resouces on
+		 * placing the lock.
+		 */
+		if (result == TM_Updated &&
+			(options & TABLE_MODIFY_LOCK_UPDATED))
+		{
+			BufferHeapTupleTableSlot *bslot;
+
+			Assert(TTS_IS_BUFFERTUPLE(oldSlot));
+			bslot = (BufferHeapTupleTableSlot *) oldSlot;
+
+			LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+			bslot->base.tupdata = oldtup;
+			ExecStorePinnedBufferHeapTuple(&bslot->base.tupdata,
+										   oldSlot,
+										   buffer);
+		}
+		else
+		{
+			UnlockReleaseBuffer(buffer);
+		}
 		if (have_tuple_lock)
 			UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
 		if (vmbuffer != InvalidBuffer)
@@ -3849,7 +3914,26 @@ l2:
 	/* Now we can release the buffer(s) */
 	if (newbuf != buffer)
 		ReleaseBuffer(newbuf);
-	ReleaseBuffer(buffer);
+
+	/* Fetch the old tuple version if we're asked for that. */
+	if (options & TABLE_MODIFY_FETCH_OLD_TUPLE)
+	{
+		BufferHeapTupleTableSlot *bslot;
+
+		Assert(TTS_IS_BUFFERTUPLE(oldSlot));
+		bslot = (BufferHeapTupleTableSlot *) oldSlot;
+
+		bslot->base.tupdata = oldtup;
+		ExecStorePinnedBufferHeapTuple(&bslot->base.tupdata,
+									   oldSlot,
+									   buffer);
+	}
+	else
+	{
+		/* Now we can release the buffer */
+		ReleaseBuffer(buffer);
+	}
+
 	if (BufferIsValid(vmbuffer_new))
 		ReleaseBuffer(vmbuffer_new);
 	if (BufferIsValid(vmbuffer))
@@ -4057,8 +4141,8 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
 
 	result = heap_update(relation, otid, tup,
 						 GetCurrentCommandId(true), InvalidSnapshot,
-						 true /* wait for commit */ ,
-						 &tmfd, &lockmode, update_indexes);
+						 TABLE_MODIFY_WAIT /* wait for commit */ ,
+						 &tmfd, &lockmode, update_indexes, NULL);
 	switch (result)
 	{
 		case TM_SelfModified:
@@ -4121,12 +4205,14 @@ get_mxact_status_for_lock(LockTupleMode mode, bool is_update)
  *		tuples.
  *
  * Output parameters:
- *	*tuple: all fields filled in
- *	*buffer: set to buffer holding tuple (pinned but not locked at exit)
+ *	*slot: BufferHeapTupleTableSlot filled with tuple
  *	*tmfd: filled in failure cases (see below)
  *
  * Function results are the same as the ones for table_tuple_lock().
  *
+ * If *slot already contains the target tuple, it takes advantage on that by
+ * skipping the ReadBuffer() call.
+ *
  * In the failure cases other than TM_Invisible, the routine fills
  * *tmfd with the tuple's t_ctid, t_xmax (resolving a possible MultiXact,
  * if necessary), and t_cmax (the last only for TM_SelfModified,
@@ -4137,15 +4223,14 @@ get_mxact_status_for_lock(LockTupleMode mode, bool is_update)
  * See README.tuplock for a thorough explanation of this mechanism.
  */
 TM_Result
-heap_lock_tuple(Relation relation, HeapTuple tuple,
+heap_lock_tuple(Relation relation, ItemPointer tid, TupleTableSlot *slot,
 				CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
-				bool follow_updates,
-				Buffer *buffer, TM_FailureData *tmfd)
+				bool follow_updates, TM_FailureData *tmfd)
 {
 	TM_Result	result;
-	ItemPointer tid = &(tuple->t_self);
 	ItemId		lp;
 	Page		page;
+	Buffer		buffer;
 	Buffer		vmbuffer = InvalidBuffer;
 	BlockNumber block;
 	TransactionId xid,
@@ -4157,8 +4242,24 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
 	bool		skip_tuple_lock = false;
 	bool		have_tuple_lock = false;
 	bool		cleared_all_frozen = false;
+	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
+	HeapTuple	tuple = &bslot->base.tupdata;
+
+	Assert(TTS_IS_BUFFERTUPLE(slot));
 
-	*buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
+	/* Take advantage if slot already contains the relevant tuple  */
+	if (!TTS_EMPTY(slot) &&
+		slot->tts_tableOid == relation->rd_id &&
+		ItemPointerCompare(&slot->tts_tid, tid) == 0 &&
+		BufferIsValid(bslot->buffer))
+	{
+		buffer = bslot->buffer;
+		IncrBufferRefCount(buffer);
+	}
+	else
+	{
+		buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
+	}
 	block = ItemPointerGetBlockNumber(tid);
 
 	/*
@@ -4167,21 +4268,22 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
 	 * in the middle of changing this, so we'll need to recheck after we have
 	 * the lock.
 	 */
-	if (PageIsAllVisible(BufferGetPage(*buffer)))
+	if (PageIsAllVisible(BufferGetPage(buffer)))
 		visibilitymap_pin(relation, block, &vmbuffer);
 
-	LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
-	page = BufferGetPage(*buffer);
+	page = BufferGetPage(buffer);
 	lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid));
 	Assert(ItemIdIsNormal(lp));
 
+	tuple->t_self = *tid;
 	tuple->t_data = (HeapTupleHeader) PageGetItem(page, lp);
 	tuple->t_len = ItemIdGetLength(lp);
 	tuple->t_tableOid = RelationGetRelid(relation);
 
 l3:
-	result = HeapTupleSatisfiesUpdate(tuple, cid, *buffer);
+	result = HeapTupleSatisfiesUpdate(tuple, cid, buffer);
 
 	if (result == TM_Invisible)
 	{
@@ -4210,7 +4312,7 @@ l3:
 		infomask2 = tuple->t_data->t_infomask2;
 		ItemPointerCopy(&tuple->t_data->t_ctid, &t_ctid);
 
-		LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 
 		/*
 		 * If any subtransaction of the current top transaction already holds
@@ -4362,12 +4464,12 @@ l3:
 					{
 						result = res;
 						/* recovery code expects to have buffer lock held */
-						LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+						LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 						goto failed;
 					}
 				}
 
-				LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+				LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 				/*
 				 * Make sure it's still an appropriate lock, else start over.
@@ -4402,7 +4504,7 @@ l3:
 			if (HEAP_XMAX_IS_LOCKED_ONLY(infomask) &&
 				!HEAP_XMAX_IS_EXCL_LOCKED(infomask))
 			{
-				LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+				LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 				/*
 				 * Make sure it's still an appropriate lock, else start over.
@@ -4430,7 +4532,7 @@ l3:
 					 * No conflict, but if the xmax changed under us in the
 					 * meantime, start over.
 					 */
-					LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+					LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 					if (xmax_infomask_changed(tuple->t_data->t_infomask, infomask) ||
 						!TransactionIdEquals(HeapTupleHeaderGetRawXmax(tuple->t_data),
 											 xwait))
@@ -4442,7 +4544,7 @@ l3:
 			}
 			else if (HEAP_XMAX_IS_KEYSHR_LOCKED(infomask))
 			{
-				LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+				LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 				/* if the xmax changed in the meantime, start over */
 				if (xmax_infomask_changed(tuple->t_data->t_infomask, infomask) ||
@@ -4470,7 +4572,7 @@ l3:
 			TransactionIdIsCurrentTransactionId(xwait))
 		{
 			/* ... but if the xmax changed in the meantime, start over */
-			LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 			if (xmax_infomask_changed(tuple->t_data->t_infomask, infomask) ||
 				!TransactionIdEquals(HeapTupleHeaderGetRawXmax(tuple->t_data),
 									 xwait))
@@ -4492,7 +4594,7 @@ l3:
 		 */
 		if (require_sleep && (result == TM_Updated || result == TM_Deleted))
 		{
-			LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 			goto failed;
 		}
 		else if (require_sleep)
@@ -4517,7 +4619,7 @@ l3:
 				 */
 				result = TM_WouldBlock;
 				/* recovery code expects to have buffer lock held */
-				LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+				LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 				goto failed;
 			}
 
@@ -4543,7 +4645,7 @@ l3:
 						{
 							result = TM_WouldBlock;
 							/* recovery code expects to have buffer lock held */
-							LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+							LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 							goto failed;
 						}
 						break;
@@ -4583,7 +4685,7 @@ l3:
 						{
 							result = TM_WouldBlock;
 							/* recovery code expects to have buffer lock held */
-							LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+							LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 							goto failed;
 						}
 						break;
@@ -4609,12 +4711,12 @@ l3:
 				{
 					result = res;
 					/* recovery code expects to have buffer lock held */
-					LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+					LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 					goto failed;
 				}
 			}
 
-			LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 			/*
 			 * xwait is done, but if xwait had just locked the tuple then some
@@ -4636,7 +4738,7 @@ l3:
 				 * don't check for this in the multixact case, because some
 				 * locker transactions might still be running.
 				 */
-				UpdateXmaxHintBits(tuple->t_data, *buffer, xwait);
+				UpdateXmaxHintBits(tuple->t_data, buffer, xwait);
 			}
 		}
 
@@ -4695,9 +4797,9 @@ failed:
 	 */
 	if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
 	{
-		LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 		visibilitymap_pin(relation, block, &vmbuffer);
-		LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 		goto l3;
 	}
 
@@ -4760,7 +4862,7 @@ failed:
 		cleared_all_frozen = true;
 
 
-	MarkBufferDirty(*buffer);
+	MarkBufferDirty(buffer);
 
 	/*
 	 * XLOG stuff.  You might think that we don't need an XLOG record because
@@ -4780,7 +4882,7 @@ failed:
 		XLogRecPtr	recptr;
 
 		XLogBeginInsert();
-		XLogRegisterBuffer(0, *buffer, REGBUF_STANDARD);
+		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
 
 		xlrec.offnum = ItemPointerGetOffsetNumber(&tuple->t_self);
 		xlrec.xmax = xid;
@@ -4801,7 +4903,7 @@ failed:
 	result = TM_Ok;
 
 out_locked:
-	LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 
 out_unlocked:
 	if (BufferIsValid(vmbuffer))
@@ -4819,6 +4921,9 @@ out_unlocked:
 	if (have_tuple_lock)
 		UnlockTupleTuplock(relation, tid, mode);
 
+	/* Put the target tuple to the slot */
+	ExecStorePinnedBufferHeapTuple(tuple, slot, buffer);
+
 	return result;
 }
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 680a50bf8b1..7c7204a2422 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -45,6 +45,12 @@
 #include "utils/builtins.h"
 #include "utils/rel.h"
 
+static TM_Result heapam_tuple_lock(Relation relation, ItemPointer tid,
+								   Snapshot snapshot, TupleTableSlot *slot,
+								   CommandId cid, LockTupleMode mode,
+								   LockWaitPolicy wait_policy, uint8 flags,
+								   TM_FailureData *tmfd);
+
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
 									 Datum *values, bool *isnull, RewriteState rwstate);
@@ -298,23 +304,55 @@ heapam_tuple_complete_speculative(Relation relation, TupleTableSlot *slot,
 
 static TM_Result
 heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
-					Snapshot snapshot, Snapshot crosscheck, bool wait,
-					TM_FailureData *tmfd, bool changingPart)
+					Snapshot snapshot, Snapshot crosscheck, int options,
+					TM_FailureData *tmfd, bool changingPart,
+					TupleTableSlot *oldSlot)
 {
+	TM_Result	result;
+
 	/*
 	 * Currently Deleting of index tuples are handled at vacuum, in case if
 	 * the storage itself is cleaning the dead tuples by itself, it is the
 	 * time to call the index tuple deletion also.
 	 */
-	return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
+	result = heap_delete(relation, tid, cid, crosscheck, options,
+						 tmfd, changingPart, oldSlot);
+
+	/*
+	 * If the tuple has been concurrently updated, then get the lock on it.
+	 * (Do only if caller asked for this by setting the
+	 * TABLE_MODIFY_LOCK_UPDATED option)  With the lock held retry of the
+	 * delete should succeed even if there are more concurrent update
+	 * attempts.
+	 */
+	if (result == TM_Updated && (options & TABLE_MODIFY_LOCK_UPDATED))
+	{
+		/*
+		 * heapam_tuple_lock() will take advantage of tuple loaded into
+		 * oldSlot by heap_delete().
+		 */
+		result = heapam_tuple_lock(relation, tid, snapshot,
+								   oldSlot, cid, LockTupleExclusive,
+								   (options & TABLE_MODIFY_WAIT) ?
+								   LockWaitBlock :
+								   LockWaitSkip,
+								   TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
+								   tmfd);
+
+		if (result == TM_Ok)
+			return TM_Updated;
+	}
+
+	return result;
 }
 
 
 static TM_Result
 heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 					CommandId cid, Snapshot snapshot, Snapshot crosscheck,
-					bool wait, TM_FailureData *tmfd,
-					LockTupleMode *lockmode, TU_UpdateIndexes *update_indexes)
+					int options, TM_FailureData *tmfd,
+					LockTupleMode *lockmode, TU_UpdateIndexes *update_indexes,
+					TupleTableSlot *oldSlot)
 {
 	bool		shouldFree = true;
 	HeapTuple	tuple = ExecFetchSlotHeapTuple(slot, true, &shouldFree);
@@ -324,8 +362,8 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 	slot->tts_tableOid = RelationGetRelid(relation);
 	tuple->t_tableOid = slot->tts_tableOid;
 
-	result = heap_update(relation, otid, tuple, cid, crosscheck, wait,
-						 tmfd, lockmode, update_indexes);
+	result = heap_update(relation, otid, tuple, cid, crosscheck, options,
+						 tmfd, lockmode, update_indexes, oldSlot);
 	ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
 
 	/*
@@ -352,6 +390,31 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 	if (shouldFree)
 		pfree(tuple);
 
+	/*
+	 * If the tuple has been concurrently updated, then get the lock on it.
+	 * (Do only if caller asked for this by setting the
+	 * TABLE_MODIFY_LOCK_UPDATED option)  With the lock held retry of the
+	 * update should succeed even if there are more concurrent update
+	 * attempts.
+	 */
+	if (result == TM_Updated && (options & TABLE_MODIFY_LOCK_UPDATED))
+	{
+		/*
+		 * heapam_tuple_lock() will take advantage of tuple loaded into
+		 * oldSlot by heap_update().
+		 */
+		result = heapam_tuple_lock(relation, otid, snapshot,
+								   oldSlot, cid, *lockmode,
+								   (options & TABLE_MODIFY_WAIT) ?
+								   LockWaitBlock :
+								   LockWaitSkip,
+								   TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
+								   tmfd);
+
+		if (result == TM_Ok)
+			return TM_Updated;
+	}
+
 	return result;
 }
 
@@ -363,7 +426,6 @@ heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
 {
 	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
 	TM_Result	result;
-	Buffer		buffer;
 	HeapTuple	tuple = &bslot->base.tupdata;
 	bool		follow_updates;
 
@@ -373,9 +435,8 @@ heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
 tuple_lock_retry:
-	tuple->t_self = *tid;
-	result = heap_lock_tuple(relation, tuple, cid, mode, wait_policy,
-							 follow_updates, &buffer, tmfd);
+	result = heap_lock_tuple(relation, tid, slot, cid, mode, wait_policy,
+							 follow_updates, tmfd);
 
 	if (result == TM_Updated &&
 		(flags & TUPLE_LOCK_FLAG_FIND_LAST_VERSION))
@@ -383,8 +444,6 @@ tuple_lock_retry:
 		/* Should not encounter speculative tuple on recheck */
 		Assert(!HeapTupleHeaderIsSpeculative(tuple->t_data));
 
-		ReleaseBuffer(buffer);
-
 		if (!ItemPointerEquals(&tmfd->ctid, &tuple->t_self))
 		{
 			SnapshotData SnapshotDirty;
@@ -406,6 +465,8 @@ tuple_lock_retry:
 			InitDirtySnapshot(SnapshotDirty);
 			for (;;)
 			{
+				Buffer		buffer = InvalidBuffer;
+
 				if (ItemPointerIndicatesMovedPartitions(tid))
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
@@ -500,7 +561,7 @@ tuple_lock_retry:
 					/*
 					 * This is a live tuple, so try to lock it again.
 					 */
-					ReleaseBuffer(buffer);
+					ExecStorePinnedBufferHeapTuple(tuple, slot, buffer);
 					goto tuple_lock_retry;
 				}
 
@@ -511,7 +572,7 @@ tuple_lock_retry:
 				 */
 				if (tuple->t_data == NULL)
 				{
-					Assert(!BufferIsValid(buffer));
+					ReleaseBuffer(buffer);
 					return TM_Deleted;
 				}
 
@@ -564,9 +625,6 @@ tuple_lock_retry:
 	slot->tts_tableOid = RelationGetRelid(relation);
 	tuple->t_tableOid = slot->tts_tableOid;
 
-	/* store in slot, transferring existing pin */
-	ExecStorePinnedBufferHeapTuple(tuple, slot, buffer);
-
 	return result;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index e57a0b7ea31..8d3675be959 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -287,16 +287,23 @@ simple_table_tuple_insert(Relation rel, TupleTableSlot *slot)
  * via ereport().
  */
 void
-simple_table_tuple_delete(Relation rel, ItemPointer tid, Snapshot snapshot)
+simple_table_tuple_delete(Relation rel, ItemPointer tid, Snapshot snapshot,
+						  TupleTableSlot *oldSlot)
 {
 	TM_Result	result;
 	TM_FailureData tmfd;
+	int			options = TABLE_MODIFY_WAIT;	/* wait for commit */
+
+	/* Fetch old tuple if the relevant slot is provided */
+	if (oldSlot)
+		options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
 
 	result = table_tuple_delete(rel, tid,
 								GetCurrentCommandId(true),
 								snapshot, InvalidSnapshot,
-								true /* wait for commit */ ,
-								&tmfd, false /* changingPart */ );
+								options,
+								&tmfd, false /* changingPart */ ,
+								oldSlot);
 
 	switch (result)
 	{
@@ -335,17 +342,24 @@ void
 simple_table_tuple_update(Relation rel, ItemPointer otid,
 						  TupleTableSlot *slot,
 						  Snapshot snapshot,
-						  TU_UpdateIndexes *update_indexes)
+						  TU_UpdateIndexes *update_indexes,
+						  TupleTableSlot *oldSlot)
 {
 	TM_Result	result;
 	TM_FailureData tmfd;
 	LockTupleMode lockmode;
+	int			options = TABLE_MODIFY_WAIT;	/* wait for commit */
+
+	/* Fetch old tuple if the relevant slot is provided */
+	if (oldSlot)
+		options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
 
 	result = table_tuple_update(rel, otid, slot,
 								GetCurrentCommandId(true),
 								snapshot, InvalidSnapshot,
-								true /* wait for commit */ ,
-								&tmfd, &lockmode, update_indexes);
+								options,
+								&tmfd, &lockmode, update_indexes,
+								oldSlot);
 
 	switch (result)
 	{
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 35eb7180f7e..3309b4ebd2d 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2773,8 +2773,8 @@ ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
 void
 ExecARDeleteTriggers(EState *estate,
 					 ResultRelInfo *relinfo,
-					 ItemPointer tupleid,
 					 HeapTuple fdw_trigtuple,
+					 TupleTableSlot *slot,
 					 TransitionCaptureState *transition_capture,
 					 bool is_crosspart_update)
 {
@@ -2783,20 +2783,11 @@ ExecARDeleteTriggers(EState *estate,
 	if ((trigdesc && trigdesc->trig_delete_after_row) ||
 		(transition_capture && transition_capture->tcs_delete_old_table))
 	{
-		TupleTableSlot *slot = ExecGetTriggerOldSlot(estate, relinfo);
-
-		Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
-		if (fdw_trigtuple == NULL)
-			GetTupleForTrigger(estate,
-							   NULL,
-							   relinfo,
-							   tupleid,
-							   LockTupleExclusive,
-							   slot,
-							   NULL,
-							   NULL,
-							   NULL);
-		else
+		/*
+		 * Put the FDW old tuple to the slot.  Otherwise, caller is expected
+		 * to have old tuple alredy fetched to the slot.
+		 */
+		if (fdw_trigtuple != NULL)
 			ExecForceStoreHeapTuple(fdw_trigtuple, slot, false);
 
 		AfterTriggerSaveEvent(estate, relinfo, NULL, NULL,
@@ -3087,18 +3078,17 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
  * Note: 'src_partinfo' and 'dst_partinfo', when non-NULL, refer to the source
  * and destination partitions, respectively, of a cross-partition update of
  * the root partitioned table mentioned in the query, given by 'relinfo'.
- * 'tupleid' in that case refers to the ctid of the "old" tuple in the source
- * partition, and 'newslot' contains the "new" tuple in the destination
- * partition.  This interface allows to support the requirements of
- * ExecCrossPartitionUpdateForeignKey(); is_crosspart_update must be true in
- * that case.
+ * 'oldslot' contains the "old" tuple in the source partition, and 'newslot'
+ * contains the "new" tuple in the destination partition.  This interface
+ * allows to support the requirements of ExecCrossPartitionUpdateForeignKey();
+ * is_crosspart_update must be true in that case.
  */
 void
 ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 					 ResultRelInfo *src_partinfo,
 					 ResultRelInfo *dst_partinfo,
-					 ItemPointer tupleid,
 					 HeapTuple fdw_trigtuple,
+					 TupleTableSlot *oldslot,
 					 TupleTableSlot *newslot,
 					 List *recheckIndexes,
 					 TransitionCaptureState *transition_capture,
@@ -3117,29 +3107,14 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 		 * separately for DELETE and INSERT to capture transition table rows.
 		 * In such case, either old tuple or new tuple can be NULL.
 		 */
-		TupleTableSlot *oldslot;
-		ResultRelInfo *tupsrc;
-
 		Assert((src_partinfo != NULL && dst_partinfo != NULL) ||
 			   !is_crosspart_update);
 
-		tupsrc = src_partinfo ? src_partinfo : relinfo;
-		oldslot = ExecGetTriggerOldSlot(estate, tupsrc);
-
-		if (fdw_trigtuple == NULL && ItemPointerIsValid(tupleid))
-			GetTupleForTrigger(estate,
-							   NULL,
-							   tupsrc,
-							   tupleid,
-							   LockTupleExclusive,
-							   oldslot,
-							   NULL,
-							   NULL,
-							   NULL);
-		else if (fdw_trigtuple != NULL)
+		if (fdw_trigtuple != NULL)
+		{
+			Assert(oldslot);
 			ExecForceStoreHeapTuple(fdw_trigtuple, oldslot, false);
-		else
-			ExecClearTuple(oldslot);
+		}
 
 		AfterTriggerSaveEvent(estate, relinfo,
 							  src_partinfo, dst_partinfo,
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index d0a89cd5778..0cad843fb69 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -577,6 +577,7 @@ ExecSimpleRelationUpdate(ResultRelInfo *resultRelInfo,
 	{
 		List	   *recheckIndexes = NIL;
 		TU_UpdateIndexes update_indexes;
+		TupleTableSlot *oldSlot = NULL;
 
 		/* Compute stored generated columns */
 		if (rel->rd_att->constr &&
@@ -590,8 +591,12 @@ ExecSimpleRelationUpdate(ResultRelInfo *resultRelInfo,
 		if (rel->rd_rel->relispartition)
 			ExecPartitionCheck(resultRelInfo, slot, estate, true);
 
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_after_row)
+			oldSlot = ExecGetTriggerOldSlot(estate, resultRelInfo);
+
 		simple_table_tuple_update(rel, tid, slot, estate->es_snapshot,
-								  &update_indexes);
+								  &update_indexes, oldSlot);
 
 		if (resultRelInfo->ri_NumIndices > 0 && (update_indexes != TU_None))
 			recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
@@ -602,7 +607,7 @@ ExecSimpleRelationUpdate(ResultRelInfo *resultRelInfo,
 		/* AFTER ROW UPDATE Triggers */
 		ExecARUpdateTriggers(estate, resultRelInfo,
 							 NULL, NULL,
-							 tid, NULL, slot,
+							 NULL, oldSlot, slot,
 							 recheckIndexes, NULL, false);
 
 		list_free(recheckIndexes);
@@ -636,12 +641,18 @@ ExecSimpleRelationDelete(ResultRelInfo *resultRelInfo,
 
 	if (!skip_tuple)
 	{
+		TupleTableSlot *oldSlot = NULL;
+
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_delete_after_row)
+			oldSlot = ExecGetTriggerOldSlot(estate, resultRelInfo);
+
 		/* OK, delete the tuple */
-		simple_table_tuple_delete(rel, tid, estate->es_snapshot);
+		simple_table_tuple_delete(rel, tid, estate->es_snapshot, oldSlot);
 
 		/* AFTER ROW DELETE Triggers */
 		ExecARDeleteTriggers(estate, resultRelInfo,
-							 tid, NULL, NULL, false);
+							 NULL, oldSlot, NULL, false);
 	}
 }
 
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 4abfe82f7fb..9deeaceb35c 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -125,7 +125,7 @@ static void ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 											   ResultRelInfo *sourcePartInfo,
 											   ResultRelInfo *destPartInfo,
 											   ItemPointer tupleid,
-											   TupleTableSlot *oldslot,
+											   TupleTableSlot *oldSlot,
 											   TupleTableSlot *newslot);
 static bool ExecOnConflictUpdate(ModifyTableContext *context,
 								 ResultRelInfo *resultRelInfo,
@@ -565,6 +565,10 @@ ExecInitInsertProjection(ModifyTableState *mtstate,
 	resultRelInfo->ri_newTupleSlot =
 		table_slot_create(resultRelInfo->ri_RelationDesc,
 						  &estate->es_tupleTable);
+	if (node->onConflictAction == ONCONFLICT_UPDATE)
+		resultRelInfo->ri_oldTupleSlot =
+			table_slot_create(resultRelInfo->ri_RelationDesc,
+							  &estate->es_tupleTable);
 
 	/* Build ProjectionInfo if needed (it probably isn't). */
 	if (need_projection)
@@ -1154,7 +1158,7 @@ ExecInsert(ModifyTableContext *context,
 		ExecARUpdateTriggers(estate, resultRelInfo,
 							 NULL, NULL,
 							 NULL,
-							 NULL,
+							 resultRelInfo->ri_oldTupleSlot,
 							 slot,
 							 NULL,
 							 mtstate->mt_transition_capture,
@@ -1334,7 +1338,8 @@ ExecDeletePrologue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static TM_Result
 ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-			  ItemPointer tupleid, bool changingPart)
+			  ItemPointer tupleid, bool changingPart, int options,
+			  TupleTableSlot *oldSlot)
 {
 	EState	   *estate = context->estate;
 
@@ -1342,9 +1347,10 @@ ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 							  estate->es_output_cid,
 							  estate->es_snapshot,
 							  estate->es_crosscheck_snapshot,
-							  true /* wait for commit */ ,
+							  options /* wait for commit */ ,
 							  &context->tmfd,
-							  changingPart);
+							  changingPart,
+							  oldSlot);
 }
 
 /*
@@ -1356,7 +1362,8 @@ ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static void
 ExecDeleteEpilogue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				   ItemPointer tupleid, HeapTuple oldtuple, bool changingPart)
+				   ItemPointer tupleid, HeapTuple oldtuple,
+				   TupleTableSlot *slot, bool changingPart)
 {
 	ModifyTableState *mtstate = context->mtstate;
 	EState	   *estate = context->estate;
@@ -1374,8 +1381,8 @@ ExecDeleteEpilogue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	{
 		ExecARUpdateTriggers(estate, resultRelInfo,
 							 NULL, NULL,
-							 tupleid, oldtuple,
-							 NULL, NULL, mtstate->mt_transition_capture,
+							 oldtuple,
+							 slot, NULL, NULL, mtstate->mt_transition_capture,
 							 false);
 
 		/*
@@ -1386,10 +1393,30 @@ ExecDeleteEpilogue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	}
 
 	/* AFTER ROW DELETE Triggers */
-	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
+	ExecARDeleteTriggers(estate, resultRelInfo, oldtuple, slot,
 						 ar_delete_trig_tcs, changingPart);
 }
 
+/*
+ * Initializes the tuple slot in a ResultRelInfo for DELETE action.
+ *
+ * We mark 'projectNewInfoValid' even though the projections themselves
+ * are not initialized here.
+ */
+static void
+ExecInitDeleteTupleSlot(ModifyTableState *mtstate,
+						ResultRelInfo *resultRelInfo)
+{
+	EState	   *estate = mtstate->ps.state;
+
+	Assert(!resultRelInfo->ri_projectNewInfoValid);
+
+	resultRelInfo->ri_oldTupleSlot =
+		table_slot_create(resultRelInfo->ri_RelationDesc,
+						  &estate->es_tupleTable);
+	resultRelInfo->ri_projectNewInfoValid = true;
+}
+
 /* ----------------------------------------------------------------
  *		ExecDelete
  *
@@ -1417,6 +1444,7 @@ ExecDelete(ModifyTableContext *context,
 		   ResultRelInfo *resultRelInfo,
 		   ItemPointer tupleid,
 		   HeapTuple oldtuple,
+		   TupleTableSlot *oldSlot,
 		   bool processReturning,
 		   bool changingPart,
 		   bool canSetTag,
@@ -1480,6 +1508,11 @@ ExecDelete(ModifyTableContext *context,
 	}
 	else
 	{
+		int			options = TABLE_MODIFY_WAIT | TABLE_MODIFY_FETCH_OLD_TUPLE;
+
+		if (!IsolationUsesXactSnapshot())
+			options |= TABLE_MODIFY_LOCK_UPDATED;
+
 		/*
 		 * delete the tuple
 		 *
@@ -1490,7 +1523,8 @@ ExecDelete(ModifyTableContext *context,
 		 * transaction-snapshot mode transactions.
 		 */
 ldelete:
-		result = ExecDeleteAct(context, resultRelInfo, tupleid, changingPart);
+		result = ExecDeleteAct(context, resultRelInfo, tupleid, changingPart,
+							   options, oldSlot);
 
 		if (tmresult)
 			*tmresult = result;
@@ -1537,7 +1571,6 @@ ldelete:
 
 			case TM_Updated:
 				{
-					TupleTableSlot *inputslot;
 					TupleTableSlot *epqslot;
 
 					if (IsolationUsesXactSnapshot())
@@ -1546,87 +1579,29 @@ ldelete:
 								 errmsg("could not serialize access due to concurrent update")));
 
 					/*
-					 * Already know that we're going to need to do EPQ, so
-					 * fetch tuple directly into the right slot.
+					 * We need to do EPQ. The latest tuple is already found
+					 * and locked as a result of TABLE_MODIFY_LOCK_UPDATED.
 					 */
-					EvalPlanQualBegin(context->epqstate);
-					inputslot = EvalPlanQualSlot(context->epqstate, resultRelationDesc,
-												 resultRelInfo->ri_RangeTableIndex);
-
-					result = table_tuple_lock(resultRelationDesc, tupleid,
-											  estate->es_snapshot,
-											  inputslot, estate->es_output_cid,
-											  LockTupleExclusive, LockWaitBlock,
-											  TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
-											  &context->tmfd);
+					Assert(context->tmfd.traversed);
+					epqslot = EvalPlanQual(context->epqstate,
+										   resultRelationDesc,
+										   resultRelInfo->ri_RangeTableIndex,
+										   oldSlot);
+					if (TupIsNull(epqslot))
+						/* Tuple not passing quals anymore, exiting... */
+						return NULL;
 
-					switch (result)
+					/*
+					 * If requested, skip delete and pass back the updated
+					 * row.
+					 */
+					if (epqreturnslot)
 					{
-						case TM_Ok:
-							Assert(context->tmfd.traversed);
-							epqslot = EvalPlanQual(context->epqstate,
-												   resultRelationDesc,
-												   resultRelInfo->ri_RangeTableIndex,
-												   inputslot);
-							if (TupIsNull(epqslot))
-								/* Tuple not passing quals anymore, exiting... */
-								return NULL;
-
-							/*
-							 * If requested, skip delete and pass back the
-							 * updated row.
-							 */
-							if (epqreturnslot)
-							{
-								*epqreturnslot = epqslot;
-								return NULL;
-							}
-							else
-								goto ldelete;
-
-						case TM_SelfModified:
-
-							/*
-							 * This can be reached when following an update
-							 * chain from a tuple updated by another session,
-							 * reaching a tuple that was already updated in
-							 * this transaction. If previously updated by this
-							 * command, ignore the delete, otherwise error
-							 * out.
-							 *
-							 * See also TM_SelfModified response to
-							 * table_tuple_delete() above.
-							 */
-							if (context->tmfd.cmax != estate->es_output_cid)
-								ereport(ERROR,
-										(errcode(ERRCODE_TRIGGERED_DATA_CHANGE_VIOLATION),
-										 errmsg("tuple to be deleted was already modified by an operation triggered by the current command"),
-										 errhint("Consider using an AFTER trigger instead of a BEFORE trigger to propagate changes to other rows.")));
-							return NULL;
-
-						case TM_Deleted:
-							/* tuple already deleted; nothing to do */
-							return NULL;
-
-						default:
-
-							/*
-							 * TM_Invisible should be impossible because we're
-							 * waiting for updated row versions, and would
-							 * already have errored out if the first version
-							 * is invisible.
-							 *
-							 * TM_Updated should be impossible, because we're
-							 * locking the latest version via
-							 * TUPLE_LOCK_FLAG_FIND_LAST_VERSION.
-							 */
-							elog(ERROR, "unexpected table_tuple_lock status: %u",
-								 result);
-							return NULL;
+						*epqreturnslot = epqslot;
+						return NULL;
 					}
-
-					Assert(false);
-					break;
+					else
+						goto ldelete;
 				}
 
 			case TM_Deleted:
@@ -1660,7 +1635,8 @@ ldelete:
 	if (tupleDeleted)
 		*tupleDeleted = true;
 
-	ExecDeleteEpilogue(context, resultRelInfo, tupleid, oldtuple, changingPart);
+	ExecDeleteEpilogue(context, resultRelInfo, tupleid, oldtuple,
+					   oldSlot, changingPart);
 
 	/* Process RETURNING if present and if requested */
 	if (processReturning && resultRelInfo->ri_projectReturning)
@@ -1678,17 +1654,13 @@ ldelete:
 		}
 		else
 		{
+			/* Copy old tuple to the returning slot */
 			slot = ExecGetReturningSlot(estate, resultRelInfo);
 			if (oldtuple != NULL)
-			{
 				ExecForceStoreHeapTuple(oldtuple, slot, false);
-			}
 			else
-			{
-				if (!table_tuple_fetch_row_version(resultRelationDesc, tupleid,
-												   SnapshotAny, slot))
-					elog(ERROR, "failed to fetch deleted tuple for DELETE RETURNING");
-			}
+				ExecCopySlot(slot, oldSlot);
+			Assert(!TupIsNull(slot));
 		}
 
 		rslot = ExecProcessReturning(resultRelInfo, slot, context->planSlot);
@@ -1788,12 +1760,16 @@ ExecCrossPartitionUpdate(ModifyTableContext *context,
 		MemoryContextSwitchTo(oldcxt);
 	}
 
+	/* Make sure ri_oldTupleSlot is initialized. */
+	if (unlikely(!resultRelInfo->ri_projectNewInfoValid))
+		ExecInitUpdateProjection(mtstate, resultRelInfo);
+
 	/*
 	 * Row movement, part 1.  Delete the tuple, but skip RETURNING processing.
 	 * We want to return rows from INSERT.
 	 */
 	ExecDelete(context, resultRelInfo,
-			   tupleid, oldtuple,
+			   tupleid, oldtuple, resultRelInfo->ri_oldTupleSlot,
 			   false,			/* processReturning */
 			   true,			/* changingPart */
 			   false,			/* canSetTag */
@@ -1834,21 +1810,13 @@ ExecCrossPartitionUpdate(ModifyTableContext *context,
 			return true;
 		else
 		{
-			/* Fetch the most recent version of old tuple. */
-			TupleTableSlot *oldSlot;
-
-			/* ... but first, make sure ri_oldTupleSlot is initialized. */
-			if (unlikely(!resultRelInfo->ri_projectNewInfoValid))
-				ExecInitUpdateProjection(mtstate, resultRelInfo);
-			oldSlot = resultRelInfo->ri_oldTupleSlot;
-			if (!table_tuple_fetch_row_version(resultRelInfo->ri_RelationDesc,
-											   tupleid,
-											   SnapshotAny,
-											   oldSlot))
-				elog(ERROR, "failed to fetch tuple being updated");
-			/* and project the new tuple to retry the UPDATE with */
+			/*
+			 * ExecDelete already fetches the most recent version of old tuple
+			 * to resultRelInfo->ri_RelationDesc.  So, just project the new
+			 * tuple to retry the UPDATE with.
+			 */
 			*retry_slot = ExecGetUpdateNewTuple(resultRelInfo, epqslot,
-												oldSlot);
+												resultRelInfo->ri_oldTupleSlot);
 			return false;
 		}
 	}
@@ -1967,7 +1935,8 @@ ExecUpdatePrepareSlot(ResultRelInfo *resultRelInfo,
 static TM_Result
 ExecUpdateAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 			  ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
-			  bool canSetTag, UpdateContext *updateCxt)
+			  bool canSetTag, int options, TupleTableSlot *oldSlot,
+			  UpdateContext *updateCxt)
 {
 	EState	   *estate = context->estate;
 	Relation	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -2059,7 +2028,8 @@ lreplace:
 				ExecCrossPartitionUpdateForeignKey(context,
 												   resultRelInfo,
 												   insert_destrel,
-												   tupleid, slot,
+												   tupleid,
+												   resultRelInfo->ri_oldTupleSlot,
 												   inserted_tuple);
 
 			return TM_Ok;
@@ -2102,9 +2072,10 @@ lreplace:
 								estate->es_output_cid,
 								estate->es_snapshot,
 								estate->es_crosscheck_snapshot,
-								true /* wait for commit */ ,
+								options /* wait for commit */ ,
 								&context->tmfd, &updateCxt->lockmode,
-								&updateCxt->updateIndexes);
+								&updateCxt->updateIndexes,
+								oldSlot);
 
 	return result;
 }
@@ -2118,7 +2089,8 @@ lreplace:
 static void
 ExecUpdateEpilogue(ModifyTableContext *context, UpdateContext *updateCxt,
 				   ResultRelInfo *resultRelInfo, ItemPointer tupleid,
-				   HeapTuple oldtuple, TupleTableSlot *slot)
+				   HeapTuple oldtuple, TupleTableSlot *slot,
+				   TupleTableSlot *oldSlot)
 {
 	ModifyTableState *mtstate = context->mtstate;
 	List	   *recheckIndexes = NIL;
@@ -2134,7 +2106,7 @@ ExecUpdateEpilogue(ModifyTableContext *context, UpdateContext *updateCxt,
 	/* AFTER ROW UPDATE Triggers */
 	ExecARUpdateTriggers(context->estate, resultRelInfo,
 						 NULL, NULL,
-						 tupleid, oldtuple, slot,
+						 oldtuple, oldSlot, slot,
 						 recheckIndexes,
 						 mtstate->operation == CMD_INSERT ?
 						 mtstate->mt_oc_transition_capture :
@@ -2223,7 +2195,7 @@ ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 	/* Perform the root table's triggers. */
 	ExecARUpdateTriggers(context->estate,
 						 rootRelInfo, sourcePartInfo, destPartInfo,
-						 tupleid, NULL, newslot, NIL, NULL, true);
+						 NULL, oldslot, newslot, NIL, NULL, true);
 }
 
 /* ----------------------------------------------------------------
@@ -2246,6 +2218,7 @@ ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
  *		no relevant triggers.
  *
  *		slot contains the new tuple value to be stored.
+ *		oldSlot is the slot to store the old tuple.
  *		planSlot is the output of the ModifyTable's subplan; we use it
  *		to access values from other input tables (for RETURNING),
  *		row-ID junk columns, etc.
@@ -2256,7 +2229,7 @@ ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 static TupleTableSlot *
 ExecUpdate(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 		   ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
-		   bool canSetTag)
+		   TupleTableSlot *oldSlot, bool canSetTag, bool locked)
 {
 	EState	   *estate = context->estate;
 	Relation	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -2309,6 +2282,11 @@ ExecUpdate(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	}
 	else
 	{
+		int			options = TABLE_MODIFY_WAIT | TABLE_MODIFY_FETCH_OLD_TUPLE;
+
+		if (!locked && !IsolationUsesXactSnapshot())
+			options |= TABLE_MODIFY_LOCK_UPDATED;
+
 		/*
 		 * If we generate a new candidate tuple after EvalPlanQual testing, we
 		 * must loop back here to try again.  (We don't need to redo triggers,
@@ -2318,7 +2296,7 @@ ExecUpdate(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 		 */
 redo_act:
 		result = ExecUpdateAct(context, resultRelInfo, tupleid, oldtuple, slot,
-							   canSetTag, &updateCxt);
+							   canSetTag, options, oldSlot, &updateCxt);
 
 		/*
 		 * If ExecUpdateAct reports that a cross-partition update was done,
@@ -2369,88 +2347,30 @@ redo_act:
 
 			case TM_Updated:
 				{
-					TupleTableSlot *inputslot;
 					TupleTableSlot *epqslot;
-					TupleTableSlot *oldSlot;
 
 					if (IsolationUsesXactSnapshot())
 						ereport(ERROR,
 								(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 								 errmsg("could not serialize access due to concurrent update")));
+					Assert(!locked);
 
 					/*
-					 * Already know that we're going to need to do EPQ, so
-					 * fetch tuple directly into the right slot.
+					 * We need to do EPQ. The latest tuple is already found
+					 * and locked as a result of TABLE_MODIFY_LOCK_UPDATED.
 					 */
-					inputslot = EvalPlanQualSlot(context->epqstate, resultRelationDesc,
-												 resultRelInfo->ri_RangeTableIndex);
-
-					result = table_tuple_lock(resultRelationDesc, tupleid,
-											  estate->es_snapshot,
-											  inputslot, estate->es_output_cid,
-											  updateCxt.lockmode, LockWaitBlock,
-											  TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
-											  &context->tmfd);
-
-					switch (result)
-					{
-						case TM_Ok:
-							Assert(context->tmfd.traversed);
-
-							epqslot = EvalPlanQual(context->epqstate,
-												   resultRelationDesc,
-												   resultRelInfo->ri_RangeTableIndex,
-												   inputslot);
-							if (TupIsNull(epqslot))
-								/* Tuple not passing quals anymore, exiting... */
-								return NULL;
-
-							/* Make sure ri_oldTupleSlot is initialized. */
-							if (unlikely(!resultRelInfo->ri_projectNewInfoValid))
-								ExecInitUpdateProjection(context->mtstate,
-														 resultRelInfo);
-
-							/* Fetch the most recent version of old tuple. */
-							oldSlot = resultRelInfo->ri_oldTupleSlot;
-							if (!table_tuple_fetch_row_version(resultRelationDesc,
-															   tupleid,
-															   SnapshotAny,
-															   oldSlot))
-								elog(ERROR, "failed to fetch tuple being updated");
-							slot = ExecGetUpdateNewTuple(resultRelInfo,
-														 epqslot, oldSlot);
-							goto redo_act;
-
-						case TM_Deleted:
-							/* tuple already deleted; nothing to do */
-							return NULL;
-
-						case TM_SelfModified:
-
-							/*
-							 * This can be reached when following an update
-							 * chain from a tuple updated by another session,
-							 * reaching a tuple that was already updated in
-							 * this transaction. If previously modified by
-							 * this command, ignore the redundant update,
-							 * otherwise error out.
-							 *
-							 * See also TM_SelfModified response to
-							 * table_tuple_update() above.
-							 */
-							if (context->tmfd.cmax != estate->es_output_cid)
-								ereport(ERROR,
-										(errcode(ERRCODE_TRIGGERED_DATA_CHANGE_VIOLATION),
-										 errmsg("tuple to be updated was already modified by an operation triggered by the current command"),
-										 errhint("Consider using an AFTER trigger instead of a BEFORE trigger to propagate changes to other rows.")));
-							return NULL;
-
-						default:
-							/* see table_tuple_lock call in ExecDelete() */
-							elog(ERROR, "unexpected table_tuple_lock status: %u",
-								 result);
-							return NULL;
-					}
+					Assert(context->tmfd.traversed);
+					epqslot = EvalPlanQual(context->epqstate,
+										   resultRelationDesc,
+										   resultRelInfo->ri_RangeTableIndex,
+										   oldSlot);
+					if (TupIsNull(epqslot))
+						/* Tuple not passing quals anymore, exiting... */
+						return NULL;
+					slot = ExecGetUpdateNewTuple(resultRelInfo,
+												 epqslot,
+												 oldSlot);
+					goto redo_act;
 				}
 
 				break;
@@ -2474,7 +2394,7 @@ redo_act:
 		(estate->es_processed)++;
 
 	ExecUpdateEpilogue(context, &updateCxt, resultRelInfo, tupleid, oldtuple,
-					   slot);
+					   slot, oldSlot);
 
 	/* Process RETURNING if present */
 	if (resultRelInfo->ri_projectReturning)
@@ -2692,7 +2612,8 @@ ExecOnConflictUpdate(ModifyTableContext *context,
 	*returning = ExecUpdate(context, resultRelInfo,
 							conflictTid, NULL,
 							resultRelInfo->ri_onConflict->oc_ProjSlot,
-							canSetTag);
+							existing,
+							canSetTag, true);
 
 	/*
 	 * Clear out existing tuple, as there might not be another conflict among
@@ -2934,6 +2855,7 @@ lmerge_matched:
 				{
 					result = ExecUpdateAct(context, resultRelInfo, tupleid,
 										   NULL, newslot, canSetTag,
+										   TABLE_MODIFY_WAIT, NULL,
 										   &updateCxt);
 
 					/*
@@ -2956,7 +2878,8 @@ lmerge_matched:
 				if (result == TM_Ok)
 				{
 					ExecUpdateEpilogue(context, &updateCxt, resultRelInfo,
-									   tupleid, NULL, newslot);
+									   tupleid, NULL, newslot,
+									   resultRelInfo->ri_oldTupleSlot);
 					mtstate->mt_merge_updated += 1;
 				}
 				break;
@@ -2987,12 +2910,12 @@ lmerge_matched:
 				}
 				else
 					result = ExecDeleteAct(context, resultRelInfo, tupleid,
-										   false);
+										   false, TABLE_MODIFY_WAIT, NULL);
 
 				if (result == TM_Ok)
 				{
 					ExecDeleteEpilogue(context, resultRelInfo, tupleid, NULL,
-									   false);
+									   resultRelInfo->ri_oldTupleSlot, false);
 					mtstate->mt_merge_deleted += 1;
 				}
 				break;
@@ -4006,12 +3929,18 @@ ExecModifyTable(PlanState *pstate)
 
 				/* Now apply the update. */
 				slot = ExecUpdate(&context, resultRelInfo, tupleid, oldtuple,
-								  slot, node->canSetTag);
+								  slot, resultRelInfo->ri_oldTupleSlot,
+								  node->canSetTag, false);
 				break;
 
 			case CMD_DELETE:
+				/* Initialize slot for DELETE to fetch the old tuple */
+				if (unlikely(!resultRelInfo->ri_projectNewInfoValid))
+					ExecInitDeleteTupleSlot(node, resultRelInfo);
+
 				slot = ExecDelete(&context, resultRelInfo, tupleid, oldtuple,
-								  true, false, node->canSetTag, NULL, NULL, NULL);
+								  resultRelInfo->ri_oldTupleSlot, true, false,
+								  node->canSetTag, NULL, NULL, NULL);
 				break;
 
 			case CMD_MERGE:
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4b133f68593..45954b8003d 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -276,19 +276,22 @@ extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
 							  int ntuples, CommandId cid, int options,
 							  BulkInsertState bistate);
 extern TM_Result heap_delete(Relation relation, ItemPointer tid,
-							 CommandId cid, Snapshot crosscheck, bool wait,
-							 struct TM_FailureData *tmfd, bool changingPart);
+							 CommandId cid, Snapshot crosscheck, int options,
+							 struct TM_FailureData *tmfd, bool changingPart,
+							 TupleTableSlot *oldSlot);
 extern void heap_finish_speculative(Relation relation, ItemPointer tid);
 extern void heap_abort_speculative(Relation relation, ItemPointer tid);
 extern TM_Result heap_update(Relation relation, ItemPointer otid,
 							 HeapTuple newtup,
-							 CommandId cid, Snapshot crosscheck, bool wait,
+							 CommandId cid, Snapshot crosscheck, int options,
 							 struct TM_FailureData *tmfd, LockTupleMode *lockmode,
-							 TU_UpdateIndexes *update_indexes);
-extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
-								 CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
-								 bool follow_updates,
-								 Buffer *buffer, struct TM_FailureData *tmfd);
+							 TU_UpdateIndexes *update_indexes,
+							 TupleTableSlot *oldSlot);
+extern TM_Result heap_lock_tuple(Relation relation, ItemPointer tid,
+								 TupleTableSlot *slot,
+								 CommandId cid, LockTupleMode mode,
+								 LockWaitPolicy wait_policy, bool follow_updates,
+								 struct TM_FailureData *tmfd);
 
 extern void heap_inplace_update(Relation relation, HeapTuple tuple);
 extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8249b37bbf1..467bdc09d36 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -259,6 +259,11 @@ typedef struct TM_IndexDeleteOp
 /* Follow update chain and lock latest version of tuple */
 #define TUPLE_LOCK_FLAG_FIND_LAST_VERSION		(1 << 1)
 
+/* "options" flag bits for table_tuple_update and table_tuple_delete */
+#define TABLE_MODIFY_WAIT			0x0001
+#define TABLE_MODIFY_FETCH_OLD_TUPLE 0x0002
+#define TABLE_MODIFY_LOCK_UPDATED	0x0004
+
 
 /* Typedef for callback function for table_index_build_scan */
 typedef void (*IndexBuildCallback) (Relation index,
@@ -528,9 +533,10 @@ typedef struct TableAmRoutine
 								 CommandId cid,
 								 Snapshot snapshot,
 								 Snapshot crosscheck,
-								 bool wait,
+								 int options,
 								 TM_FailureData *tmfd,
-								 bool changingPart);
+								 bool changingPart,
+								 TupleTableSlot *oldSlot);
 
 	/* see table_tuple_update() for reference about parameters */
 	TM_Result	(*tuple_update) (Relation rel,
@@ -539,10 +545,11 @@ typedef struct TableAmRoutine
 								 CommandId cid,
 								 Snapshot snapshot,
 								 Snapshot crosscheck,
-								 bool wait,
+								 int options,
 								 TM_FailureData *tmfd,
 								 LockTupleMode *lockmode,
-								 TU_UpdateIndexes *update_indexes);
+								 TU_UpdateIndexes *update_indexes,
+								 TupleTableSlot *oldSlot);
 
 	/* see table_tuple_lock() for reference about parameters */
 	TM_Result	(*tuple_lock) (Relation rel,
@@ -1452,7 +1459,7 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
 }
 
 /*
- * Delete a tuple.
+ * Delete a tuple (and optionally lock the last tuple version).
  *
  * NB: do not call this directly unless prepared to deal with
  * concurrent-update conditions.  Use simple_table_tuple_delete instead.
@@ -1463,11 +1470,21 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
  *	cid - delete command ID (used for visibility test, and stored into
  *		cmax if successful)
  *	crosscheck - if not InvalidSnapshot, also check tuple against this
- *	wait - true if should wait for any conflicting update to commit/abort
+ *	options:
+ *		If TABLE_MODIFY_WAIT, wait for any conflicting update to commit/abort.
+ *		If TABLE_MODIFY_FETCH_OLD_TUPLE option is given, the existing tuple is
+ *		fetched into oldSlot when the update is successful.
+ *		If TABLE_MODIFY_LOCK_UPDATED option is given and the tuple is
+ *		concurrently updated, then the last tuple version is locked and fetched
+ *		into oldSlot.
+ *
  * Output parameters:
  *	tmfd - filled in failure cases (see below)
  *	changingPart - true iff the tuple is being moved to another partition
  *		table due to an update of the partition key. Otherwise, false.
+ *	oldSlot - slot to save the deleted or locked tuple. Can be NULL if none of
+ *		TABLE_MODIFY_FETCH_OLD_TUPLE or TABLE_MODIFY_LOCK_UPDATED options
+ *		is specified.
  *
  * Normal, successful return value is TM_Ok, which means we did actually
  * delete it.  Failure return codes are TM_SelfModified, TM_Updated, and
@@ -1479,16 +1496,18 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
  */
 static inline TM_Result
 table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
-				   Snapshot snapshot, Snapshot crosscheck, bool wait,
-				   TM_FailureData *tmfd, bool changingPart)
+				   Snapshot snapshot, Snapshot crosscheck, int options,
+				   TM_FailureData *tmfd, bool changingPart,
+				   TupleTableSlot *oldSlot)
 {
 	return rel->rd_tableam->tuple_delete(rel, tid, cid,
 										 snapshot, crosscheck,
-										 wait, tmfd, changingPart);
+										 options, tmfd, changingPart,
+										 oldSlot);
 }
 
 /*
- * Update a tuple.
+ * Update a tuple (and optionally lock the last tuple version).
  *
  * NB: do not call this directly unless you are prepared to deal with
  * concurrent-update conditions.  Use simple_table_tuple_update instead.
@@ -1500,13 +1519,23 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
  *	cid - update command ID (used for visibility test, and stored into
  *		cmax/cmin if successful)
  *	crosscheck - if not InvalidSnapshot, also check old tuple against this
- *	wait - true if should wait for any conflicting update to commit/abort
+ *	options:
+ *		If TABLE_MODIFY_WAIT, wait for any conflicting update to commit/abort.
+ *		If TABLE_MODIFY_FETCH_OLD_TUPLE option is given, the existing tuple is
+ *		fetched into oldSlot when the update is successful.
+ *		If TABLE_MODIFY_LOCK_UPDATED option is given and the tuple is
+ *		concurrently updated, then the last tuple version is locked and fetched
+ *		into oldSlot.
+ *
  * Output parameters:
  *	tmfd - filled in failure cases (see below)
  *	lockmode - filled with lock mode acquired on tuple
  *  update_indexes - in success cases this is set to true if new index entries
  *		are required for this tuple
- *
+ *	oldSlot - slot to save the deleted or locked tuple. Can be NULL if none of
+ *		TABLE_MODIFY_FETCH_OLD_TUPLE or TABLE_MODIFY_LOCK_UPDATED options
+ *		is specified.
+
  * Normal, successful return value is TM_Ok, which means we did actually
  * update it.  Failure return codes are TM_SelfModified, TM_Updated, and
  * TM_BeingModified (the last only possible if wait == false).
@@ -1524,13 +1553,15 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
 static inline TM_Result
 table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
 				   CommandId cid, Snapshot snapshot, Snapshot crosscheck,
-				   bool wait, TM_FailureData *tmfd, LockTupleMode *lockmode,
-				   TU_UpdateIndexes *update_indexes)
+				   int options, TM_FailureData *tmfd, LockTupleMode *lockmode,
+				   TU_UpdateIndexes *update_indexes,
+				   TupleTableSlot *oldSlot)
 {
 	return rel->rd_tableam->tuple_update(rel, otid, slot,
 										 cid, snapshot, crosscheck,
-										 wait, tmfd,
-										 lockmode, update_indexes);
+										 options, tmfd,
+										 lockmode, update_indexes,
+										 oldSlot);
 }
 
 /*
@@ -2046,10 +2077,12 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 
 extern void simple_table_tuple_insert(Relation rel, TupleTableSlot *slot);
 extern void simple_table_tuple_delete(Relation rel, ItemPointer tid,
-									  Snapshot snapshot);
+									  Snapshot snapshot,
+									  TupleTableSlot *oldSlot);
 extern void simple_table_tuple_update(Relation rel, ItemPointer otid,
 									  TupleTableSlot *slot, Snapshot snapshot,
-									  TU_UpdateIndexes *update_indexes);
+									  TU_UpdateIndexes *update_indexes,
+									  TupleTableSlot *oldSlot);
 
 
 /* ----------------------------------------------------------------------------
diff --git a/src/include/commands/trigger.h b/src/include/commands/trigger.h
index 8a5a9fe6422..cb968d03ecd 100644
--- a/src/include/commands/trigger.h
+++ b/src/include/commands/trigger.h
@@ -216,8 +216,8 @@ extern bool ExecBRDeleteTriggers(EState *estate,
 								 TM_FailureData *tmfd);
 extern void ExecARDeleteTriggers(EState *estate,
 								 ResultRelInfo *relinfo,
-								 ItemPointer tupleid,
 								 HeapTuple fdw_trigtuple,
+								 TupleTableSlot *slot,
 								 TransitionCaptureState *transition_capture,
 								 bool is_crosspart_update);
 extern bool ExecIRDeleteTriggers(EState *estate,
@@ -240,8 +240,8 @@ extern void ExecARUpdateTriggers(EState *estate,
 								 ResultRelInfo *relinfo,
 								 ResultRelInfo *src_partinfo,
 								 ResultRelInfo *dst_partinfo,
-								 ItemPointer tupleid,
 								 HeapTuple fdw_trigtuple,
+								 TupleTableSlot *oldslot,
 								 TupleTableSlot *newslot,
 								 List *recheckIndexes,
 								 TransitionCaptureState *transition_capture,
-- 
2.39.3 (Apple Git-145)

0013-Introduce-RowID-bytea-tuple-identifier-v2.patchapplication/octet-stream; name=0013-Introduce-RowID-bytea-tuple-identifier-v2.patchDownload

From d5df668612eb67f5b6f3138c0a77e38710310999 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 26 Jun 2023 04:26:30 +0300
Subject: [PATCH 13/13] Introduce RowID -- bytea tuple identifier

Currently, there are two ways to reference the tuple: tuple identifier (tid)
and whole row copy.  The tuple identifier used for regular tables consists of
32-bit block number and 16-bit offset.  This seems limited for some use-cases,
in particular index-organized tables.  The whole row copy used to identify
tuples in FDW.  That could be extended to regular tables, but that seems
overkill.

This commit introduces RowID -- new bytea tuple identifier.  Table AM can choose
the way tuple is identified by providing new get_row_ref_type() API function.
New system attribute RowIdAttributeNumber holds RowID when appropriate.
Table AM methods now accepts Datum arguments as tuple identifiers.  Those Datum
could be either tid or bytea depending on what table_get_row_ref_type() says.
ModifyTable node and triggers are aware of RowID.  IndexScan and BitmapScan
nodes are not aware of RowIDs and expect tids.  Table AMs which use RowIDs
are supposed to redefine those nodes using hooks.
---
 contrib/amcheck/verify_nbtree.c          |   3 +-
 src/backend/access/common/heaptuple.c    |   4 +
 src/backend/access/heap/heapam_handler.c |  33 ++-
 src/backend/access/table/tableam.c       |   4 +-
 src/backend/catalog/aclchk.c             |   2 +-
 src/backend/commands/trigger.c           | 251 ++++++++++++++++++-----
 src/backend/executor/execExprInterp.c    |   4 +-
 src/backend/executor/execMain.c          |  11 +-
 src/backend/executor/execReplication.c   |  12 +-
 src/backend/executor/nodeLockRows.c      |  17 +-
 src/backend/executor/nodeModifyTable.c   | 145 ++++++++-----
 src/backend/executor/nodeTidscan.c       |   2 +-
 src/backend/optimizer/prep/preptlist.c   |  16 ++
 src/backend/optimizer/util/appendinfo.c  |  33 ++-
 src/backend/optimizer/util/inherit.c     |  20 +-
 src/backend/parser/parse_relation.c      |   7 +-
 src/backend/rewrite/rewriteHandler.c     |   1 +
 src/backend/utils/sort/tuplestore.c      |  30 +++
 src/include/access/sysattr.h             |   3 +-
 src/include/access/tableam.h             |  56 +++--
 src/include/commands/trigger.h           |   4 +-
 src/include/nodes/primnodes.h            |   1 +
 src/include/utils/tuplestore.h           |   3 +
 23 files changed, 509 insertions(+), 153 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 1ef4cff88e8..82f18810935 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -983,7 +983,8 @@ heap_entry_is_visible(BtreeCheckState *state, ItemPointer tid)
 	TupleTableSlot *slot = table_slot_create(state->heaprel, NULL);
 
 	tid_visible = table_tuple_fetch_row_version(state->heaprel,
-												tid, state->snapshot, slot);
+												PointerGetDatum(tid),
+												state->snapshot, slot);
 	if (slot != NULL)
 		ExecDropSingleTupleTableSlot(slot);
 
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index 5c89fbbef83..7b52c66939c 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -755,6 +755,10 @@ heap_getsysattr(HeapTuple tup, int attnum, TupleDesc tupleDesc, bool *isnull)
 		case TableOidAttributeNumber:
 			result = ObjectIdGetDatum(tup->t_tableOid);
 			break;
+		case RowIdAttributeNumber:
+			*isnull = true;
+			result = 0;
+			break;
 		default:
 			elog(ERROR, "invalid attnum: %d", attnum);
 			result = 0;			/* keep compiler quiet */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 7ebebf4d6ac..ac24691bd29 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -47,7 +47,7 @@
 #include "utils/rel.h"
 #include "utils/sampling.h"
 
-static TM_Result heapam_tuple_lock(Relation relation, ItemPointer tid,
+static TM_Result heapam_tuple_lock(Relation relation, Datum tupleid,
 								   Snapshot snapshot, TupleTableSlot *slot,
 								   CommandId cid, LockTupleMode mode,
 								   LockWaitPolicy wait_policy, uint8 flags,
@@ -186,7 +186,7 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 
 static bool
 heapam_fetch_row_version(Relation relation,
-						 ItemPointer tid,
+						 Datum tupleid,
 						 Snapshot snapshot,
 						 TupleTableSlot *slot)
 {
@@ -195,7 +195,7 @@ heapam_fetch_row_version(Relation relation,
 
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
-	bslot->base.tupdata.t_self = *tid;
+	bslot->base.tupdata.t_self = *DatumGetItemPointer(tupleid);
 	if (heap_fetch(relation, snapshot, &bslot->base.tupdata, &buffer, false))
 	{
 		/* store in slot, transferring existing pin */
@@ -360,7 +360,7 @@ ExecCheckTIDVisible(EState *estate,
 	if (!IsolationUsesXactSnapshot())
 		return;
 
-	if (!table_tuple_fetch_row_version(rel, tid,
+	if (!table_tuple_fetch_row_version(rel, PointerGetDatum(tid),
 									   SnapshotAny, tempSlot))
 		elog(ERROR, "failed to fetch conflicting tuple for ON CONFLICT");
 	ExecCheckTupleVisible(estate, rel, tempSlot);
@@ -407,7 +407,7 @@ heapam_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
 				 * here means our previous conclusion that the tuple is
 				 * conclusively committed is not true anymore.
 				 */
-				test = table_tuple_lock(rel, &conflictTid,
+				test = table_tuple_lock(rel, PointerGetDatum(&conflictTid),
 										estate->es_snapshot,
 										lockedSlot, estate->es_output_cid,
 										lockmode, LockWaitBlock, 0,
@@ -587,12 +587,13 @@ heapam_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
 }
 
 static TM_Result
-heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
+heapam_tuple_delete(Relation relation, Datum tupleid, CommandId cid,
 					Snapshot snapshot, Snapshot crosscheck, int options,
 					TM_FailureData *tmfd, bool changingPart,
 					TupleTableSlot *oldSlot)
 {
 	TM_Result	result;
+	ItemPointer tid = DatumGetItemPointer(tupleid);
 
 	/*
 	 * Currently Deleting of index tuples are handled at vacuum, in case if
@@ -615,7 +616,7 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
 		 * heapam_tuple_lock() will take advantage of tuple loaded into
 		 * oldSlot by heap_delete().
 		 */
-		result = heapam_tuple_lock(relation, tid, snapshot,
+		result = heapam_tuple_lock(relation, tupleid, snapshot,
 								   oldSlot, cid, LockTupleExclusive,
 								   (options & TABLE_MODIFY_WAIT) ?
 								   LockWaitBlock :
@@ -632,7 +633,7 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
 
 
 static TM_Result
-heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
+heapam_tuple_update(Relation relation, Datum tupleid, TupleTableSlot *slot,
 					CommandId cid, Snapshot snapshot, Snapshot crosscheck,
 					int options, TM_FailureData *tmfd,
 					LockTupleMode *lockmode, TU_UpdateIndexes *update_indexes,
@@ -640,6 +641,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 {
 	bool		shouldFree = true;
 	HeapTuple	tuple = ExecFetchSlotHeapTuple(slot, true, &shouldFree);
+	ItemPointer otid = DatumGetItemPointer(tupleid);
 	TM_Result	result;
 
 	/* Update the tuple with table oid */
@@ -687,7 +689,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 		 * heapam_tuple_lock() will take advantage of tuple loaded into
 		 * oldSlot by heap_update().
 		 */
-		result = heapam_tuple_lock(relation, otid, snapshot,
+		result = heapam_tuple_lock(relation, tupleid, snapshot,
 								   oldSlot, cid, *lockmode,
 								   (options & TABLE_MODIFY_WAIT) ?
 								   LockWaitBlock :
@@ -703,7 +705,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 }
 
 static TM_Result
-heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
+heapam_tuple_lock(Relation relation, Datum tupleid, Snapshot snapshot,
 				  TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
 				  LockWaitPolicy wait_policy, uint8 flags,
 				  TM_FailureData *tmfd)
@@ -711,6 +713,7 @@ heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
 	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
 	TM_Result	result;
 	HeapTuple	tuple = &bslot->base.tupdata;
+	ItemPointer tid = DatumGetItemPointer(tupleid);
 	bool		follow_updates;
 
 	follow_updates = (flags & TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS) != 0;
@@ -2645,6 +2648,15 @@ heapam_scan_get_blocks_done(HeapScanDesc hscan)
  * ------------------------------------------------------------------------
  */
 
+/*
+ * All heap tables use TID row identifier.
+ */
+static RowRefType
+heapam_get_row_ref_type(Relation rel)
+{
+	return ROW_REF_TID;
+}
+
 /*
  * Check to see whether the table needs a TOAST table.  It does only if
  * (1) there are any toastable attributes, and (2) the maximum length
@@ -3224,6 +3236,7 @@ static const TableAmRoutine heapam_methods = {
 	.define_index_validate = NULL,
 	.define_index = NULL,
 
+	.get_row_ref_type = heapam_get_row_ref_type,
 	.free_rd_amcache = NULL,
 	.relation_size = table_block_relation_size,
 	.relation_needs_toast_table = heapam_relation_needs_toast_table,
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 805d222cebc..caa79c6eddd 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -300,7 +300,7 @@ simple_table_tuple_delete(Relation rel, ItemPointer tid, Snapshot snapshot,
 	if (oldSlot)
 		options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
 
-	result = table_tuple_delete(rel, tid,
+	result = table_tuple_delete(rel, PointerGetDatum(tid),
 								GetCurrentCommandId(true),
 								snapshot, InvalidSnapshot,
 								options,
@@ -356,7 +356,7 @@ simple_table_tuple_update(Relation rel, ItemPointer otid,
 	if (oldSlot)
 		options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
 
-	result = table_tuple_update(rel, otid, slot,
+	result = table_tuple_update(rel, PointerGetDatum(otid), slot,
 								GetCurrentCommandId(true),
 								snapshot, InvalidSnapshot,
 								options,
diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
index 7abf3c2a74a..8765becf986 100644
--- a/src/backend/catalog/aclchk.c
+++ b/src/backend/catalog/aclchk.c
@@ -1626,7 +1626,7 @@ expand_all_col_privileges(Oid table_oid, Form_pg_class classForm,
 	AttrNumber	curr_att;
 
 	Assert(classForm->relnatts - FirstLowInvalidHeapAttributeNumber < num_col_privileges);
-	for (curr_att = FirstLowInvalidHeapAttributeNumber + 1;
+	for (curr_att = FirstLowInvalidHeapAttributeNumber + 2;
 		 curr_att <= classForm->relnatts;
 		 curr_att++)
 	{
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 3309b4ebd2d..b2248bdfd87 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -76,7 +76,7 @@ static void SetTriggerFlags(TriggerDesc *trigdesc, Trigger *trigger);
 static bool GetTupleForTrigger(EState *estate,
 							   EPQState *epqstate,
 							   ResultRelInfo *relinfo,
-							   ItemPointer tid,
+							   Datum tupleid,
 							   LockTupleMode lockmode,
 							   TupleTableSlot *oldslot,
 							   TupleTableSlot **epqslot,
@@ -2682,7 +2682,7 @@ ExecASDeleteTriggers(EState *estate, ResultRelInfo *relinfo,
 bool
 ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
 					 ResultRelInfo *relinfo,
-					 ItemPointer tupleid,
+					 Datum tupleid,
 					 HeapTuple fdw_trigtuple,
 					 TupleTableSlot **epqslot,
 					 TM_Result *tmresult,
@@ -2696,7 +2696,7 @@ ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
 	bool		should_free = false;
 	int			i;
 
-	Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
+	Assert(HeapTupleIsValid(fdw_trigtuple) ^ (DatumGetPointer(tupleid) != NULL));
 	if (fdw_trigtuple == NULL)
 	{
 		TupleTableSlot *epqslot_candidate = NULL;
@@ -2924,7 +2924,7 @@ ExecASUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 bool
 ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
 					 ResultRelInfo *relinfo,
-					 ItemPointer tupleid,
+					 Datum tupleid,
 					 HeapTuple fdw_trigtuple,
 					 TupleTableSlot *newslot,
 					 TM_Result *tmresult,
@@ -2944,7 +2944,7 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
 	/* Determine lock mode to use */
 	lockmode = ExecUpdateLockMode(estate, relinfo);
 
-	Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
+	Assert(HeapTupleIsValid(fdw_trigtuple) ^ (DatumGetPointer(tupleid) != NULL));
 	if (fdw_trigtuple == NULL)
 	{
 		TupleTableSlot *epqslot_candidate = NULL;
@@ -3261,7 +3261,7 @@ static bool
 GetTupleForTrigger(EState *estate,
 				   EPQState *epqstate,
 				   ResultRelInfo *relinfo,
-				   ItemPointer tid,
+				   Datum tupleid,
 				   LockTupleMode lockmode,
 				   TupleTableSlot *oldslot,
 				   TupleTableSlot **epqslot,
@@ -3286,7 +3286,9 @@ GetTupleForTrigger(EState *estate,
 		 */
 		if (!IsolationUsesXactSnapshot())
 			lockflags |= TUPLE_LOCK_FLAG_FIND_LAST_VERSION;
-		test = table_tuple_lock(relation, tid, estate->es_snapshot, oldslot,
+
+		test = table_tuple_lock(relation, tupleid,
+								estate->es_snapshot, oldslot,
 								estate->es_output_cid,
 								lockmode, LockWaitBlock,
 								lockflags,
@@ -3382,8 +3384,8 @@ GetTupleForTrigger(EState *estate,
 		 * We expect the tuple to be present, thus very simple error handling
 		 * suffices.
 		 */
-		if (!table_tuple_fetch_row_version(relation, tid, SnapshotAny,
-										   oldslot))
+		if (!table_tuple_fetch_row_version(relation, tupleid,
+										   SnapshotAny, oldslot))
 			elog(ERROR, "failed to fetch tuple for trigger");
 	}
 
@@ -3589,18 +3591,24 @@ typedef SetConstraintStateData *SetConstraintState;
  * cycles.  So we need only ensure that ats_firing_id is zero when attaching
  * a new event to an existing AfterTriggerSharedData record.
  */
-typedef uint32 TriggerFlags;
-
-#define AFTER_TRIGGER_OFFSET			0x07FFFFFF	/* must be low-order bits */
-#define AFTER_TRIGGER_DONE				0x80000000
-#define AFTER_TRIGGER_IN_PROGRESS		0x40000000
+typedef uint64 TriggerFlags;
+
+#define AFTER_TRIGGER_SIZE				UINT64CONST(0xFFFF000000000)	/* must be low-order
+																		 * bits */
+#define AFTER_TRIGGER_SIZE_SHIFT		(36)
+#define AFTER_TRIGGER_OFFSET			UINT64CONST(0x000000FFFFFFF)	/* must be low-order
+																		 * bits */
+#define AFTER_TRIGGER_DONE				UINT64CONST(0x0000800000000)
+#define AFTER_TRIGGER_IN_PROGRESS		UINT64CONST(0x0000400000000)
 /* bits describing the size and tuple sources of this event */
-#define AFTER_TRIGGER_FDW_REUSE			0x00000000
-#define AFTER_TRIGGER_FDW_FETCH			0x20000000
-#define AFTER_TRIGGER_1CTID				0x10000000
-#define AFTER_TRIGGER_2CTID				0x30000000
-#define AFTER_TRIGGER_CP_UPDATE			0x08000000
-#define AFTER_TRIGGER_TUP_BITS			0x38000000
+#define AFTER_TRIGGER_FDW_REUSE			UINT64CONST(0x0000000000000)
+#define AFTER_TRIGGER_FDW_FETCH			UINT64CONST(0x0000200000000)
+#define AFTER_TRIGGER_1CTID				UINT64CONST(0x0000100000000)
+#define AFTER_TRIGGER_ROWID1			UINT64CONST(0x0000010000000)
+#define AFTER_TRIGGER_2CTID				UINT64CONST(0x0000300000000)
+#define AFTER_TRIGGER_ROWID2			UINT64CONST(0x0000020000000)
+#define AFTER_TRIGGER_CP_UPDATE			UINT64CONST(0x0000080000000)
+#define AFTER_TRIGGER_TUP_BITS			UINT64CONST(0x0000380000000)
 typedef struct AfterTriggerSharedData *AfterTriggerShared;
 
 typedef struct AfterTriggerSharedData
@@ -3652,6 +3660,9 @@ typedef struct AfterTriggerEventDataZeroCtids
 }			AfterTriggerEventDataZeroCtids;
 
 #define SizeofTriggerEvent(evt) \
+	(((evt)->ate_flags & AFTER_TRIGGER_SIZE) >> AFTER_TRIGGER_SIZE_SHIFT)
+
+#define BasicSizeofTriggerEvent(evt) \
 	(((evt)->ate_flags & AFTER_TRIGGER_TUP_BITS) == AFTER_TRIGGER_CP_UPDATE ? \
 	 sizeof(AfterTriggerEventData) : \
 	 (((evt)->ate_flags & AFTER_TRIGGER_TUP_BITS) == AFTER_TRIGGER_2CTID ? \
@@ -4004,14 +4015,34 @@ afterTriggerCopyBitmap(Bitmapset *src)
  */
 static void
 afterTriggerAddEvent(AfterTriggerEventList *events,
-					 AfterTriggerEvent event, AfterTriggerShared evtshared)
+					 AfterTriggerEvent event, AfterTriggerShared evtshared,
+					 bytea *rowid1, bytea *rowid2)
 {
-	Size		eventsize = SizeofTriggerEvent(event);
-	Size		needed = eventsize + sizeof(AfterTriggerSharedData);
+	Size		basiceventsize = MAXALIGN(BasicSizeofTriggerEvent(event));
+	Size		eventsize;
+	Size		needed;
 	AfterTriggerEventChunk *chunk;
 	AfterTriggerShared newshared;
 	AfterTriggerEvent newevent;
 
+	if (SizeofTriggerEvent(event) == 0)
+	{
+		eventsize = basiceventsize;
+		if (event->ate_flags & AFTER_TRIGGER_ROWID1)
+			eventsize += MAXALIGN(VARSIZE(rowid1));
+
+		if (event->ate_flags & AFTER_TRIGGER_ROWID2)
+			eventsize += MAXALIGN(VARSIZE(rowid2));
+
+		event->ate_flags |= eventsize << AFTER_TRIGGER_SIZE_SHIFT;
+	}
+	else
+	{
+		eventsize = SizeofTriggerEvent(event);
+	}
+
+	needed = eventsize + sizeof(AfterTriggerSharedData);
+
 	/*
 	 * If empty list or not enough room in the tail chunk, make a new chunk.
 	 * We assume here that a new shared record will always be needed.
@@ -4044,7 +4075,7 @@ afterTriggerAddEvent(AfterTriggerEventList *events,
 		 * sizes used should be MAXALIGN multiples, to ensure that the shared
 		 * records will be aligned safely.
 		 */
-#define MIN_CHUNK_SIZE 1024
+#define MIN_CHUNK_SIZE (1024*4)
 #define MAX_CHUNK_SIZE (1024*1024)
 
 #if MAX_CHUNK_SIZE > (AFTER_TRIGGER_OFFSET+1)
@@ -4063,6 +4094,7 @@ afterTriggerAddEvent(AfterTriggerEventList *events,
 				chunksize *= 2; /* okay, double it */
 			else
 				chunksize /= 2; /* too many shared records */
+			chunksize = Max(chunksize, MIN_CHUNK_SIZE);
 			chunksize = Min(chunksize, MAX_CHUNK_SIZE);
 		}
 		chunk = MemoryContextAlloc(afterTriggers.event_cxt, chunksize);
@@ -4103,7 +4135,26 @@ afterTriggerAddEvent(AfterTriggerEventList *events,
 
 	/* Insert the data */
 	newevent = (AfterTriggerEvent) chunk->freeptr;
-	memcpy(newevent, event, eventsize);
+	if (!rowid1 && !rowid2)
+	{
+		memcpy(newevent, event, eventsize);
+	}
+	else
+	{
+		Pointer		ptr = chunk->freeptr;
+
+		memcpy(newevent, event, basiceventsize);
+		ptr += basiceventsize;
+
+		if (event->ate_flags & AFTER_TRIGGER_ROWID1)
+		{
+			memcpy(ptr, rowid1, MAXALIGN(VARSIZE(rowid1)));
+			ptr += MAXALIGN(VARSIZE(rowid1));
+		}
+
+		if (event->ate_flags & AFTER_TRIGGER_ROWID2)
+			memcpy(ptr, rowid2, MAXALIGN(VARSIZE(rowid2)));
+	}
 	/* ... and link the new event to its shared record */
 	newevent->ate_flags &= ~AFTER_TRIGGER_OFFSET;
 	newevent->ate_flags |= (char *) newshared - (char *) newevent;
@@ -4263,6 +4314,7 @@ AfterTriggerExecute(EState *estate,
 	int			tgindx;
 	bool		should_free_trig = false;
 	bool		should_free_new = false;
+	Pointer		ptr;
 
 	/*
 	 * Locate trigger in trigdesc.
@@ -4294,15 +4346,17 @@ AfterTriggerExecute(EState *estate,
 			{
 				Tuplestorestate *fdw_tuplestore = GetCurrentFDWTuplestore();
 
-				if (!tuplestore_gettupleslot(fdw_tuplestore, true, false,
-											 trig_tuple_slot1))
+				if (!tuplestore_force_gettupleslot(fdw_tuplestore, true, false,
+												   trig_tuple_slot1))
 					elog(ERROR, "failed to fetch tuple1 for AFTER trigger");
 
 				if ((evtshared->ats_event & TRIGGER_EVENT_OPMASK) ==
 					TRIGGER_EVENT_UPDATE &&
-					!tuplestore_gettupleslot(fdw_tuplestore, true, false,
-											 trig_tuple_slot2))
+					!tuplestore_force_gettupleslot(fdw_tuplestore, true, false,
+												   trig_tuple_slot2))
 					elog(ERROR, "failed to fetch tuple2 for AFTER trigger");
+				trig_tuple_slot1->tts_tid = event->ate_ctid1;
+				trig_tuple_slot2->tts_tid = event->ate_ctid2;
 			}
 			/* fall through */
 		case AFTER_TRIGGER_FDW_REUSE:
@@ -4334,13 +4388,26 @@ AfterTriggerExecute(EState *estate,
 			break;
 
 		default:
-			if (ItemPointerIsValid(&(event->ate_ctid1)))
+			ptr = (Pointer) event + MAXALIGN(BasicSizeofTriggerEvent(event));
+			if (ItemPointerIsValid(&(event->ate_ctid1)) ||
+				(event->ate_flags & AFTER_TRIGGER_ROWID1))
 			{
+				Datum		tupleid;
+
 				TupleTableSlot *src_slot = ExecGetTriggerOldSlot(estate,
 																 src_relInfo);
 
-				if (!table_tuple_fetch_row_version(src_rel,
-												   &(event->ate_ctid1),
+				if (event->ate_flags & AFTER_TRIGGER_ROWID1)
+				{
+					tupleid = PointerGetDatum(ptr);
+					ptr += MAXALIGN(VARSIZE(ptr));
+				}
+				else
+				{
+					tupleid = PointerGetDatum(&(event->ate_ctid1));
+				}
+
+				if (!table_tuple_fetch_row_version(src_rel, tupleid,
 												   SnapshotAny,
 												   src_slot))
 					elog(ERROR, "failed to fetch tuple1 for AFTER trigger");
@@ -4376,13 +4443,23 @@ AfterTriggerExecute(EState *estate,
 			/* don't touch ctid2 if not there */
 			if (((event->ate_flags & AFTER_TRIGGER_TUP_BITS) == AFTER_TRIGGER_2CTID ||
 				 (event->ate_flags & AFTER_TRIGGER_CP_UPDATE)) &&
-				ItemPointerIsValid(&(event->ate_ctid2)))
+				(ItemPointerIsValid(&(event->ate_ctid2)) ||
+				 (event->ate_flags & AFTER_TRIGGER_ROWID2)))
 			{
+				Datum		tupleid;
+
 				TupleTableSlot *dst_slot = ExecGetTriggerNewSlot(estate,
 																 dst_relInfo);
 
-				if (!table_tuple_fetch_row_version(dst_rel,
-												   &(event->ate_ctid2),
+				if (event->ate_flags & AFTER_TRIGGER_ROWID2)
+				{
+					tupleid = PointerGetDatum(ptr);
+				}
+				else
+				{
+					tupleid = PointerGetDatum(&(event->ate_ctid2));
+				}
+				if (!table_tuple_fetch_row_version(dst_rel, tupleid,
 												   SnapshotAny,
 												   dst_slot))
 					elog(ERROR, "failed to fetch tuple2 for AFTER trigger");
@@ -4556,7 +4633,7 @@ afterTriggerMarkEvents(AfterTriggerEventList *events,
 		{
 			deferred_found = true;
 			/* add it to move_list */
-			afterTriggerAddEvent(move_list, event, evtshared);
+			afterTriggerAddEvent(move_list, event, evtshared, NULL, NULL);
 			/* mark original copy "done" so we don't do it again */
 			event->ate_flags |= AFTER_TRIGGER_DONE;
 		}
@@ -4659,6 +4736,7 @@ afterTriggerInvokeEvents(AfterTriggerEventList *events,
 					trigdesc = rInfo->ri_TrigDesc;
 					finfo = rInfo->ri_TrigFunctions;
 					instr = rInfo->ri_TrigInstrument;
+
 					if (slot1 != NULL)
 					{
 						ExecDropSingleTupleTableSlot(slot1);
@@ -6051,6 +6129,8 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 	int			tgtype_level;
 	int			i;
 	Tuplestorestate *fdw_tuplestore = NULL;
+	bytea	   *rowId1 = NULL;
+	bytea	   *rowId2 = NULL;
 
 	/*
 	 * Check state.  We use a normal test not Assert because it is possible to
@@ -6144,6 +6224,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 	 * if so.  This preserves the behavior that statement-level triggers fire
 	 * just once per statement and fire after row-level triggers.
 	 */
+
+	/* Determine flags */
+	if (!(relkind == RELKIND_FOREIGN_TABLE && row_trigger))
+		new_event.ate_flags = (row_trigger && event == TRIGGER_EVENT_UPDATE) ?
+			AFTER_TRIGGER_2CTID : AFTER_TRIGGER_1CTID;
+
 	switch (event)
 	{
 		case TRIGGER_EVENT_INSERT:
@@ -6154,6 +6240,14 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				Assert(newslot != NULL);
 				ItemPointerCopy(&(newslot->tts_tid), &(new_event.ate_ctid1));
 				ItemPointerSetInvalid(&(new_event.ate_ctid2));
+				if (table_get_row_ref_type(rel) == ROW_REF_ROWID)
+				{
+					bool		isnull;
+
+					rowId1 = DatumGetByteaP(slot_getsysattr(newslot, RowIdAttributeNumber, &isnull));
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID1;
+					Assert(!isnull);
+				}
 			}
 			else
 			{
@@ -6173,6 +6267,14 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				Assert(newslot == NULL);
 				ItemPointerCopy(&(oldslot->tts_tid), &(new_event.ate_ctid1));
 				ItemPointerSetInvalid(&(new_event.ate_ctid2));
+				if (table_get_row_ref_type(rel) == ROW_REF_ROWID)
+				{
+					bool		isnull;
+
+					rowId1 = DatumGetByteaP(slot_getsysattr(oldslot, RowIdAttributeNumber, &isnull));
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID1;
+					Assert(!isnull);
+				}
 			}
 			else
 			{
@@ -6188,10 +6290,57 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			tgtype_event = TRIGGER_TYPE_UPDATE;
 			if (row_trigger)
 			{
+				bool		src_rowid = false,
+							dst_rowid = false;
+
 				Assert(oldslot != NULL);
 				Assert(newslot != NULL);
 				ItemPointerCopy(&(oldslot->tts_tid), &(new_event.ate_ctid1));
 				ItemPointerCopy(&(newslot->tts_tid), &(new_event.ate_ctid2));
+				if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+				{
+					Relation	src_rel = src_partinfo->ri_RelationDesc;
+					Relation	dst_rel = dst_partinfo->ri_RelationDesc;
+
+					src_rowid = table_get_row_ref_type(src_rel) ==
+						ROW_REF_ROWID;
+					dst_rowid = table_get_row_ref_type(dst_rel) ==
+						ROW_REF_ROWID;
+				}
+				else
+				{
+					if (table_get_row_ref_type(rel) == ROW_REF_ROWID)
+					{
+						src_rowid = true;
+						dst_rowid = true;
+					}
+				}
+
+				if (src_rowid)
+				{
+					Datum		val;
+					bool		isnull;
+
+					val = slot_getsysattr(oldslot,
+										  RowIdAttributeNumber,
+										  &isnull);
+					rowId1 = DatumGetByteaP(val);
+					Assert(!isnull);
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID1;
+				}
+
+				if (dst_rowid)
+				{
+					Datum		val;
+					bool		isnull;
+
+					val = slot_getsysattr(newslot,
+										  RowIdAttributeNumber,
+										  &isnull);
+					rowId2 = DatumGetByteaP(val);
+					Assert(!isnull);
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID2;
+				}
 
 				/*
 				 * Also remember the OIDs of partitions to fetch these tuples
@@ -6229,20 +6378,6 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			break;
 	}
 
-	/* Determine flags */
-	if (!(relkind == RELKIND_FOREIGN_TABLE && row_trigger))
-	{
-		if (row_trigger && event == TRIGGER_EVENT_UPDATE)
-		{
-			if (relkind == RELKIND_PARTITIONED_TABLE)
-				new_event.ate_flags = AFTER_TRIGGER_CP_UPDATE;
-			else
-				new_event.ate_flags = AFTER_TRIGGER_2CTID;
-		}
-		else
-			new_event.ate_flags = AFTER_TRIGGER_1CTID;
-	}
-
 	/* else, we'll initialize ate_flags for each trigger */
 
 	tgtype_level = (row_trigger ? TRIGGER_TYPE_ROW : TRIGGER_TYPE_STATEMENT);
@@ -6387,6 +6522,20 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				continue;		/* Uniqueness definitely not violated */
 		}
 
+		/* Determine flags */
+		if (!(relkind == RELKIND_FOREIGN_TABLE && row_trigger))
+		{
+			if (row_trigger && event == TRIGGER_EVENT_UPDATE)
+			{
+				if (relkind == RELKIND_PARTITIONED_TABLE)
+					new_event.ate_flags = AFTER_TRIGGER_CP_UPDATE;
+				else
+					new_event.ate_flags = AFTER_TRIGGER_2CTID;
+			}
+			else
+				new_event.ate_flags = AFTER_TRIGGER_1CTID;
+		}
+
 		/*
 		 * Fill in event structure and add it to the current query's queue.
 		 * Note we set ats_table to NULL whenever this trigger doesn't use
@@ -6408,7 +6557,7 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		new_shared.ats_modifiedcols = afterTriggerCopyBitmap(modifiedCols);
 
 		afterTriggerAddEvent(&afterTriggers.query_stack[afterTriggers.query_depth].events,
-							 &new_event, &new_shared);
+							 &new_event, &new_shared, rowId1, rowId2);
 	}
 
 	/*
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index a25ab7570fe..2fa3a0a4e36 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -4552,7 +4552,9 @@ ExecEvalSysVar(ExprState *state, ExprEvalStep *op, ExprContext *econtext,
 						op->resnull);
 	*op->resvalue = d;
 	/* this ought to be unreachable, but it's cheap enough to check */
-	if (unlikely(*op->resnull))
+	if (op->d.var.attnum != RowIdAttributeNumber &&
+		op->d.var.attnum != SelfItemPointerAttributeNumber &&
+		unlikely(*op->resnull))
 		elog(ERROR, "failed to fetch attribute from slot");
 }
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 7eb1f7d0209..7b3fc3038a4 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -867,13 +867,15 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 			Oid			relid;
 			Relation	relation;
 			ExecRowMark *erm;
+			RangeTblEntry *rangeEntry;
 
 			/* ignore "parent" rowmarks; they are irrelevant at runtime */
 			if (rc->isParent)
 				continue;
 
 			/* get relation's OID (will produce InvalidOid if subquery) */
-			relid = exec_rt_fetch(rc->rti, estate)->relid;
+			rangeEntry = exec_rt_fetch(rc->rti, estate);
+			relid = rangeEntry->relid;
 
 			/* open relation, if we need to access it for this mark type */
 			switch (rc->markType)
@@ -906,6 +908,10 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 			erm->prti = rc->prti;
 			erm->rowmarkId = rc->rowmarkId;
 			erm->markType = rc->markType;
+			if (erm->markType == ROW_MARK_COPY)
+				erm->refType = ROW_REF_COPY;
+			else
+				erm->refType = rangeEntry->reftype;
 			erm->strength = rc->strength;
 			erm->waitPolicy = rc->waitPolicy;
 			erm->ermActive = false;
@@ -1269,6 +1275,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
 	resultRelInfo->ri_ChildToRootMap = NULL;
 	resultRelInfo->ri_ChildToRootMapValid = false;
 	resultRelInfo->ri_CopyMultiInsertBuffer = NULL;
+	resultRelInfo->ri_RowRefType = table_get_row_ref_type(resultRelationDesc);
 }
 
 /*
@@ -2701,7 +2708,7 @@ EvalPlanQualFetchRowMark(EPQState *epqstate, Index rti, TupleTableSlot *slot)
 		{
 			/* ordinary table, fetch the tuple */
 			if (!table_tuple_fetch_row_version(erm->relation,
-											   (ItemPointer) DatumGetPointer(datum),
+											   datum,
 											   SnapshotAny, slot))
 				elog(ERROR, "failed to fetch tuple for EvalPlanQual recheck");
 			return true;
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index db685473fc0..aad266a19ff 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -250,7 +250,8 @@ retry:
 
 		PushActiveSnapshot(GetLatestSnapshot());
 
-		res = table_tuple_lock(rel, &(outslot->tts_tid), GetLatestSnapshot(),
+		res = table_tuple_lock(rel, PointerGetDatum(&(outslot->tts_tid)),
+							   GetLatestSnapshot(),
 							   outslot,
 							   GetCurrentCommandId(false),
 							   lockmode,
@@ -434,7 +435,8 @@ retry:
 
 		PushActiveSnapshot(GetLatestSnapshot());
 
-		res = table_tuple_lock(rel, &(outslot->tts_tid), GetLatestSnapshot(),
+		res = table_tuple_lock(rel, PointerGetDatum(&(outslot->tts_tid)),
+							   GetLatestSnapshot(),
 							   outslot,
 							   GetCurrentCommandId(false),
 							   lockmode,
@@ -571,7 +573,8 @@ ExecSimpleRelationUpdate(ResultRelInfo *resultRelInfo,
 		resultRelInfo->ri_TrigDesc->trig_update_before_row)
 	{
 		if (!ExecBRUpdateTriggers(estate, epqstate, resultRelInfo,
-								  tid, NULL, slot, NULL, NULL))
+								  PointerGetDatum(tid), NULL, slot,
+								  NULL, NULL))
 			skip_tuple = true;	/* "do nothing" */
 	}
 
@@ -638,7 +641,8 @@ ExecSimpleRelationDelete(ResultRelInfo *resultRelInfo,
 		resultRelInfo->ri_TrigDesc->trig_delete_before_row)
 	{
 		skip_tuple = !ExecBRDeleteTriggers(estate, epqstate, resultRelInfo,
-										   tid, NULL, NULL, NULL, NULL);
+										   PointerGetDatum(tid), NULL, NULL,
+										   NULL, NULL);
 	}
 
 	if (!skip_tuple)
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 41754ddfea9..2d3ad904a64 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -27,6 +27,7 @@
 #include "executor/nodeLockRows.h"
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
+#include "utils/datum.h"
 #include "utils/rel.h"
 
 
@@ -157,7 +158,16 @@ lnext:
 		}
 
 		/* okay, try to lock (and fetch) the tuple */
-		tid = *((ItemPointer) DatumGetPointer(datum));
+		if (erm->refType == ROW_REF_TID)
+		{
+			tid = *((ItemPointer) DatumGetPointer(datum));
+			datum = PointerGetDatum(&tid);
+		}
+		else
+		{
+			Assert(erm->refType == ROW_REF_ROWID);
+			datum = datumCopy(datum, false, -1);
+		}
 		switch (erm->markType)
 		{
 			case ROW_MARK_EXCLUSIVE:
@@ -182,12 +192,15 @@ lnext:
 		if (!IsolationUsesXactSnapshot())
 			lockflags |= TUPLE_LOCK_FLAG_FIND_LAST_VERSION;
 
-		test = table_tuple_lock(erm->relation, &tid, estate->es_snapshot,
+		test = table_tuple_lock(erm->relation, datum, estate->es_snapshot,
 								markSlot, estate->es_output_cid,
 								lockmode, erm->waitPolicy,
 								lockflags,
 								&tmfd);
 
+		if (erm->refType == ROW_REF_ROWID)
+			pfree(DatumGetPointer(datum));
+
 		switch (test)
 		{
 			case TM_WouldBlock:
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 321f2358c12..fef31456e17 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -124,7 +124,7 @@ static void ExecPendingInserts(EState *estate);
 static void ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 											   ResultRelInfo *sourcePartInfo,
 											   ResultRelInfo *destPartInfo,
-											   ItemPointer tupleid,
+											   Datum tupleid,
 											   TupleTableSlot *oldSlot,
 											   TupleTableSlot *newslot);
 static bool ExecOnConflictUpdate(ModifyTableContext *context,
@@ -141,13 +141,13 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 
 static TupleTableSlot *ExecMerge(ModifyTableContext *context,
 								 ResultRelInfo *resultRelInfo,
-								 ItemPointer tupleid,
+								 Datum tupleid,
 								 HeapTuple oldtuple,
 								 bool canSetTag);
 static void ExecInitMerge(ModifyTableState *mtstate, EState *estate);
 static TupleTableSlot *ExecMergeMatched(ModifyTableContext *context,
 										ResultRelInfo *resultRelInfo,
-										ItemPointer tupleid,
+										Datum tupleid,
 										HeapTuple oldtuple,
 										bool canSetTag,
 										bool *matched);
@@ -1216,7 +1216,7 @@ ExecPendingInserts(EState *estate)
  */
 static bool
 ExecDeletePrologue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				   ItemPointer tupleid, HeapTuple oldtuple,
+				   Datum tupleid, HeapTuple oldtuple,
 				   TupleTableSlot **epqreturnslot, TM_Result *result)
 {
 	if (result)
@@ -1247,7 +1247,7 @@ ExecDeletePrologue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static TM_Result
 ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-			  ItemPointer tupleid, bool changingPart, int options,
+			  Datum tupleid, bool changingPart, int options,
 			  TupleTableSlot *oldSlot)
 {
 	EState	   *estate = context->estate;
@@ -1271,7 +1271,7 @@ ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static void
 ExecDeleteEpilogue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				   ItemPointer tupleid, HeapTuple oldtuple,
+				   HeapTuple oldtuple,
 				   TupleTableSlot *slot, bool changingPart)
 {
 	ModifyTableState *mtstate = context->mtstate;
@@ -1351,7 +1351,7 @@ ExecInitDeleteTupleSlot(ModifyTableState *mtstate,
 static TupleTableSlot *
 ExecDelete(ModifyTableContext *context,
 		   ResultRelInfo *resultRelInfo,
-		   ItemPointer tupleid,
+		   Datum tupleid,
 		   HeapTuple oldtuple,
 		   TupleTableSlot *oldSlot,
 		   bool processReturning,
@@ -1544,7 +1544,7 @@ ldelete:
 	if (tupleDeleted)
 		*tupleDeleted = true;
 
-	ExecDeleteEpilogue(context, resultRelInfo, tupleid, oldtuple,
+	ExecDeleteEpilogue(context, resultRelInfo, oldtuple,
 					   oldSlot, changingPart);
 
 	/* Process RETURNING if present and if requested */
@@ -1561,7 +1561,7 @@ ldelete:
 			/* FDW must have provided a slot containing the deleted row */
 			Assert(!TupIsNull(slot));
 		}
-		else
+		else if (!slot || TupIsNull(slot))
 		{
 			/* Copy old tuple to the returning slot */
 			slot = ExecGetReturningSlot(estate, resultRelInfo);
@@ -1610,7 +1610,7 @@ ldelete:
 static bool
 ExecCrossPartitionUpdate(ModifyTableContext *context,
 						 ResultRelInfo *resultRelInfo,
-						 ItemPointer tupleid, HeapTuple oldtuple,
+						 Datum tupleid, HeapTuple oldtuple,
 						 TupleTableSlot *slot,
 						 bool canSetTag,
 						 UpdateContext *updateCxt,
@@ -1766,7 +1766,7 @@ ExecCrossPartitionUpdate(ModifyTableContext *context,
  */
 static bool
 ExecUpdatePrologue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				   ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
+				   Datum tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
 				   TM_Result *result)
 {
 	Relation	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -1843,7 +1843,7 @@ ExecUpdatePrepareSlot(ResultRelInfo *resultRelInfo,
  */
 static TM_Result
 ExecUpdateAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-			  ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
+			  Datum tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
 			  bool canSetTag, int options, TupleTableSlot *oldSlot,
 			  UpdateContext *updateCxt)
 {
@@ -1997,7 +1997,7 @@ lreplace:
  */
 static void
 ExecUpdateEpilogue(ModifyTableContext *context, UpdateContext *updateCxt,
-				   ResultRelInfo *resultRelInfo, ItemPointer tupleid,
+				   ResultRelInfo *resultRelInfo,
 				   HeapTuple oldtuple, TupleTableSlot *slot,
 				   TupleTableSlot *oldSlot)
 {
@@ -2047,7 +2047,7 @@ static void
 ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 								   ResultRelInfo *sourcePartInfo,
 								   ResultRelInfo *destPartInfo,
-								   ItemPointer tupleid,
+								   Datum tupleid,
 								   TupleTableSlot *oldslot,
 								   TupleTableSlot *newslot)
 {
@@ -2137,7 +2137,7 @@ ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
  */
 static TupleTableSlot *
 ExecUpdate(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-		   ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
+		   Datum tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
 		   TupleTableSlot *oldSlot, bool canSetTag, bool locked)
 {
 	EState	   *estate = context->estate;
@@ -2191,10 +2191,14 @@ ExecUpdate(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	}
 	else
 	{
-		int			options = TABLE_MODIFY_WAIT | TABLE_MODIFY_FETCH_OLD_TUPLE;
+		int			options = TABLE_MODIFY_WAIT;
 
-		if (!locked && !IsolationUsesXactSnapshot())
-			options |= TABLE_MODIFY_LOCK_UPDATED;
+		if (!locked)
+		{
+			options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
+			if (!IsolationUsesXactSnapshot())
+				options |= TABLE_MODIFY_LOCK_UPDATED;
+		}
 
 		/*
 		 * If we generate a new candidate tuple after EvalPlanQual testing, we
@@ -2302,7 +2306,7 @@ redo_act:
 	if (canSetTag)
 		(estate->es_processed)++;
 
-	ExecUpdateEpilogue(context, &updateCxt, resultRelInfo, tupleid, oldtuple,
+	ExecUpdateEpilogue(context, &updateCxt, resultRelInfo, oldtuple,
 					   slot, oldSlot);
 
 	/* Process RETURNING if present */
@@ -2334,7 +2338,19 @@ ExecOnConflictUpdate(ModifyTableContext *context,
 	ExprContext *econtext = mtstate->ps.ps_ExprContext;
 	ExprState  *onConflictSetWhere = resultRelInfo->ri_onConflict->oc_WhereClause;
 	TupleTableSlot *existing = resultRelInfo->ri_onConflict->oc_Existing;
-	ItemPointer conflictTid = &existing->tts_tid;
+	Datum		tupleid;
+
+	if (table_get_row_ref_type(resultRelInfo->ri_RelationDesc) == ROW_REF_ROWID)
+	{
+		bool		isnull;
+
+		tupleid = slot_getsysattr(existing, RowIdAttributeNumber, &isnull);
+		Assert(!isnull);
+	}
+	else
+	{
+		tupleid = PointerGetDatum(&existing->tts_tid);
+	}
 
 	/*
 	 * Make tuple and any needed join variables available to ExecQual and
@@ -2390,7 +2406,7 @@ ExecOnConflictUpdate(ModifyTableContext *context,
 
 	/* Execute UPDATE with projection */
 	*returning = ExecUpdate(context, resultRelInfo,
-							conflictTid, NULL,
+							tupleid, NULL,
 							resultRelInfo->ri_onConflict->oc_ProjSlot,
 							existing,
 							canSetTag, true);
@@ -2409,7 +2425,7 @@ ExecOnConflictUpdate(ModifyTableContext *context,
  */
 static TupleTableSlot *
 ExecMerge(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-		  ItemPointer tupleid, HeapTuple oldtuple, bool canSetTag)
+		  Datum tupleid, HeapTuple oldtuple, bool canSetTag)
 {
 	TupleTableSlot *rslot = NULL;
 	bool		matched;
@@ -2458,7 +2474,7 @@ ExecMerge(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	 * from ExecMergeNotMatched to ExecMergeMatched, there is no risk of a
 	 * livelock.
 	 */
-	matched = tupleid != NULL || oldtuple != NULL;
+	matched = DatumGetPointer(tupleid) != NULL || oldtuple != NULL;
 	if (matched)
 		rslot = ExecMergeMatched(context, resultRelInfo, tupleid, oldtuple,
 								 canSetTag, &matched);
@@ -2499,7 +2515,7 @@ ExecMerge(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static TupleTableSlot *
 ExecMergeMatched(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				 ItemPointer tupleid, HeapTuple oldtuple, bool canSetTag,
+				 Datum tupleid, HeapTuple oldtuple, bool canSetTag,
 				 bool *matched)
 {
 	ModifyTableState *mtstate = context->mtstate;
@@ -2535,7 +2551,7 @@ ExecMergeMatched(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	 * the tupleid of the target row, or an old tuple from the target wholerow
 	 * junk attr.
 	 */
-	Assert(tupleid != NULL || oldtuple != NULL);
+	Assert(DatumGetPointer(tupleid) != NULL || oldtuple != NULL);
 	if (oldtuple != NULL)
 		ExecForceStoreHeapTuple(oldtuple, resultRelInfo->ri_oldTupleSlot,
 								false);
@@ -2549,7 +2565,7 @@ lmerge_matched:
 	 * EvalPlanQual returns us a new tuple, which may not be visible to our
 	 * MVCC snapshot.
 	 */
-	if (tupleid != NULL)
+	if (DatumGetPointer(tupleid) != NULL)
 	{
 		if (!table_tuple_fetch_row_version(resultRelInfo->ri_RelationDesc,
 										   tupleid,
@@ -2658,7 +2674,7 @@ lmerge_matched:
 				if (result == TM_Ok)
 				{
 					ExecUpdateEpilogue(context, &updateCxt, resultRelInfo,
-									   tupleid, NULL, newslot,
+									   NULL, newslot,
 									   resultRelInfo->ri_oldTupleSlot);
 					mtstate->mt_merge_updated += 1;
 				}
@@ -2694,7 +2710,7 @@ lmerge_matched:
 
 				if (result == TM_Ok)
 				{
-					ExecDeleteEpilogue(context, resultRelInfo, tupleid, NULL,
+					ExecDeleteEpilogue(context, resultRelInfo, NULL,
 									   resultRelInfo->ri_oldTupleSlot, false);
 					mtstate->mt_merge_deleted += 1;
 				}
@@ -2818,9 +2834,13 @@ lmerge_matched:
 								return NULL;
 							}
 
-							(void) ExecGetJunkAttribute(epqslot,
-														resultRelInfo->ri_RowIdAttNo,
-														&isNull);
+							/*
+							 * Update tupleid to that of the new tuple, for
+							 * the refetch we do at the top.
+							 */
+							tupleid = ExecGetJunkAttribute(epqslot,
+														   resultRelInfo->ri_RowIdAttNo,
+														   &isNull);
 							if (isNull)
 							{
 								*matched = false;
@@ -2847,11 +2867,7 @@ lmerge_matched:
 							 * apply all the MATCHED rules again, to ensure
 							 * that the first qualifying WHEN MATCHED action
 							 * is executed.
-							 *
-							 * Update tupleid to that of the new tuple, for
-							 * the refetch we do at the top.
 							 */
-							ItemPointerCopy(&context->tmfd.ctid, tupleid);
 							goto lmerge_matched;
 
 						case TM_Deleted:
@@ -3389,10 +3405,10 @@ ExecModifyTable(PlanState *pstate)
 	PlanState  *subplanstate;
 	TupleTableSlot *slot;
 	TupleTableSlot *oldSlot;
+	Datum		tupleid;
 	ItemPointerData tuple_ctid;
 	HeapTupleData oldtupdata;
 	HeapTuple	oldtuple;
-	ItemPointer tupleid;
 
 	CHECK_FOR_INTERRUPTS();
 
@@ -3441,6 +3457,8 @@ ExecModifyTable(PlanState *pstate)
 	 */
 	for (;;)
 	{
+		RowRefType	refType;
+
 		/*
 		 * Reset the per-output-tuple exprcontext.  This is needed because
 		 * triggers expect to use that context as workspace.  It's a bit ugly
@@ -3491,7 +3509,7 @@ ExecModifyTable(PlanState *pstate)
 					EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 
 					slot = ExecMerge(&context, node->resultRelInfo,
-									 NULL, NULL, node->canSetTag);
+									 PointerGetDatum(NULL), NULL, node->canSetTag);
 
 					/*
 					 * If we got a RETURNING result, return it to the caller.
@@ -3535,7 +3553,8 @@ ExecModifyTable(PlanState *pstate)
 		EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 		slot = context.planSlot;
 
-		tupleid = NULL;
+		refType = resultRelInfo->ri_RowRefType;
+		tupleid = PointerGetDatum(NULL);
 		oldtuple = NULL;
 
 		/*
@@ -3578,7 +3597,7 @@ ExecModifyTable(PlanState *pstate)
 						EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 
 						slot = ExecMerge(&context, node->resultRelInfo,
-										 NULL, NULL, node->canSetTag);
+										 PointerGetDatum(NULL), NULL, node->canSetTag);
 
 						/*
 						 * If we got a RETURNING result, return it to the
@@ -3593,9 +3612,25 @@ ExecModifyTable(PlanState *pstate)
 					elog(ERROR, "ctid is NULL");
 				}
 
-				tupleid = (ItemPointer) DatumGetPointer(datum);
-				tuple_ctid = *tupleid;	/* be sure we don't free ctid!! */
-				tupleid = &tuple_ctid;
+				if (refType == ROW_REF_TID)
+				{
+					/* shouldn't ever get a null result... */
+					if (isNull)
+						elog(ERROR, "ctid is NULL");
+
+					tuple_ctid = *((ItemPointer) DatumGetPointer(datum));	/* be sure we don't free
+																			 * ctid!! */
+					tupleid = PointerGetDatum(&tuple_ctid);
+				}
+				else
+				{
+					Assert(refType == ROW_REF_ROWID);
+					/* shouldn't ever get a null result... */
+					if (isNull)
+						elog(ERROR, "rowid is NULL");
+
+					tupleid = datumCopy(datum, false, -1);
+				}
 			}
 
 			/*
@@ -3635,7 +3670,7 @@ ExecModifyTable(PlanState *pstate)
 						EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 
 						slot = ExecMerge(&context, node->resultRelInfo,
-										 NULL, NULL, node->canSetTag);
+										 PointerGetDatum(NULL), NULL, node->canSetTag);
 
 						/*
 						 * If we got a RETURNING result, return it to the
@@ -3699,6 +3734,7 @@ ExecModifyTable(PlanState *pstate)
 					/* Fetch the most recent version of old tuple. */
 					Relation	relation = resultRelInfo->ri_RelationDesc;
 
+					Assert(DatumGetPointer(tupleid) != NULL);
 					if (!table_tuple_fetch_row_version(relation, tupleid,
 													   SnapshotAny,
 													   oldSlot))
@@ -3733,6 +3769,9 @@ ExecModifyTable(PlanState *pstate)
 				break;
 		}
 
+		if (refType == ROW_REF_ROWID && DatumGetPointer(tupleid) != NULL)
+			pfree(DatumGetPointer(tupleid));
+
 		/*
 		 * If we got a RETURNING result, return it to caller.  We'll continue
 		 * the work on next call.
@@ -3976,10 +4015,20 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 				relkind == RELKIND_MATVIEW ||
 				relkind == RELKIND_PARTITIONED_TABLE)
 			{
-				resultRelInfo->ri_RowIdAttNo =
-					ExecFindJunkAttributeInTlist(subplan->targetlist, "ctid");
-				if (!AttributeNumberIsValid(resultRelInfo->ri_RowIdAttNo))
-					elog(ERROR, "could not find junk ctid column");
+				if (resultRelInfo->ri_RowRefType == ROW_REF_TID)
+				{
+					resultRelInfo->ri_RowIdAttNo =
+						ExecFindJunkAttributeInTlist(subplan->targetlist, "ctid");
+					if (!AttributeNumberIsValid(resultRelInfo->ri_RowIdAttNo))
+						elog(ERROR, "could not find junk ctid column");
+				}
+				else
+				{
+					resultRelInfo->ri_RowIdAttNo =
+						ExecFindJunkAttributeInTlist(subplan->targetlist, "rowid");
+					if (!AttributeNumberIsValid(resultRelInfo->ri_RowIdAttNo))
+						elog(ERROR, "could not find junk rowid column");
+				}
 			}
 			else if (relkind == RELKIND_FOREIGN_TABLE)
 			{
@@ -4289,6 +4338,8 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		estate->es_auxmodifytables = lcons(mtstate,
 										   estate->es_auxmodifytables);
 
+
+
 	return mtstate;
 }
 
diff --git a/src/backend/executor/nodeTidscan.c b/src/backend/executor/nodeTidscan.c
index 864a9013b62..f4a124ac4eb 100644
--- a/src/backend/executor/nodeTidscan.c
+++ b/src/backend/executor/nodeTidscan.c
@@ -377,7 +377,7 @@ TidNext(TidScanState *node)
 		if (node->tss_isCurrentOf)
 			table_tuple_get_latest_tid(scan, &tid);
 
-		if (table_tuple_fetch_row_version(heapRelation, &tid, snapshot, slot))
+		if (table_tuple_fetch_row_version(heapRelation, PointerGetDatum(&tid), snapshot, slot))
 			return slot;
 
 		/* Bad TID or failed snapshot qual; try next */
diff --git a/src/backend/optimizer/prep/preptlist.c b/src/backend/optimizer/prep/preptlist.c
index 4599b0dc761..3620be5b52c 100644
--- a/src/backend/optimizer/prep/preptlist.c
+++ b/src/backend/optimizer/prep/preptlist.c
@@ -226,6 +226,22 @@ preprocess_targetlist(PlannerInfo *root)
 								  true);
 			tlist = lappend(tlist, tle);
 		}
+		if (rc->allRefTypes & (1 << ROW_REF_ROWID))
+		{
+			/* Need to fetch TID */
+			var = makeVar(rc->rti,
+						  RowIdAttributeNumber,
+						  BYTEAOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			snprintf(resname, sizeof(resname), "rowid%u", rc->rowmarkId);
+			tle = makeTargetEntry((Expr *) var,
+								  list_length(tlist) + 1,
+								  pstrdup(resname),
+								  true);
+			tlist = lappend(tlist, tle);
+		}
 		if (rc->allRefTypes & (1 << ROW_REF_COPY))
 		{
 			/* Need the whole row as a junk var */
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index 6ba4eba224a..83c08bbd0e1 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -16,6 +16,7 @@
 
 #include "access/htup_details.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "foreign/fdwapi.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
@@ -895,17 +896,35 @@ add_row_identity_columns(PlannerInfo *root, Index rtindex,
 		relkind == RELKIND_MATVIEW ||
 		relkind == RELKIND_PARTITIONED_TABLE)
 	{
+		RowRefType	refType = ROW_REF_TID;
+
+		refType = table_get_row_ref_type(target_relation);
+
 		/*
 		 * Emit CTID so that executor can find the row to merge, update or
 		 * delete.
 		 */
-		var = makeVar(rtindex,
-					  SelfItemPointerAttributeNumber,
-					  TIDOID,
-					  -1,
-					  InvalidOid,
-					  0);
-		add_row_identity_var(root, var, rtindex, "ctid");
+		if (refType == ROW_REF_TID)
+		{
+			var = makeVar(rtindex,
+						  SelfItemPointerAttributeNumber,
+						  TIDOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			add_row_identity_var(root, var, rtindex, "ctid");
+		}
+		else
+		{
+			Assert(refType == ROW_REF_ROWID);
+			var = makeVar(rtindex,
+						  RowIdAttributeNumber,
+						  BYTEAOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			add_row_identity_var(root, var, rtindex, "rowid");
+		}
 	}
 	else if (relkind == RELKIND_FOREIGN_TABLE)
 	{
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index d32b07bab57..171509aae62 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -283,6 +283,24 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 			newvars = lappend(newvars, var);
 		}
 
+		if ((new_allRefTypes & (1 << ROW_REF_ROWID)) &&
+			!(old_allRefTypes & (1 << ROW_REF_ROWID)))
+		{
+			var = makeVar(oldrc->rti,
+						  RowIdAttributeNumber,
+						  BYTEAOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			snprintf(resname, sizeof(resname), "rowid%u", oldrc->rowmarkId);
+			tle = makeTargetEntry((Expr *) var,
+								  list_length(root->processed_tlist) + 1,
+								  pstrdup(resname),
+								  true);
+			root->processed_tlist = lappend(root->processed_tlist, tle);
+			newvars = lappend(newvars, var);
+		}
+
 		/* Add tableoid junk Var, unless we had it already */
 		if (!old_isParent)
 		{
@@ -486,7 +504,7 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 	Assert(parentrte->rtekind == RTE_RELATION); /* else this is dubious */
 	childrte->relid = childOID;
 	childrte->relkind = childrel->rd_rel->relkind;
-	childrte->reftype = ROW_REF_TID;
+	childrte->reftype = table_get_row_ref_type(childrel);
 	/* A partitioned child will need to be expanded further. */
 	if (childrte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c
index 10f2d287b39..2c80e010f2a 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -1504,7 +1504,7 @@ addRangeTableEntry(ParseState *pstate,
 	rte->inh = inh;
 	rte->relkind = rel->rd_rel->relkind;
 	rte->rellockmode = lockmode;
-	rte->reftype = ROW_REF_TID;
+	rte->reftype = table_get_row_ref_type(rel);
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
@@ -1590,7 +1590,7 @@ addRangeTableEntryForRelation(ParseState *pstate,
 	rte->inh = inh;
 	rte->relkind = rel->rd_rel->relkind;
 	rte->rellockmode = lockmode;
-	rte->reftype = ROW_REF_TID;
+	rte->reftype = table_get_row_ref_type(rel);
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
@@ -3267,6 +3267,9 @@ get_rte_attribute_name(RangeTblEntry *rte, AttrNumber attnum)
 		attnum > 0 && attnum <= list_length(rte->alias->colnames))
 		return strVal(list_nth(rte->alias->colnames, attnum - 1));
 
+	if (attnum == RowIdAttributeNumber)
+		return "rowid";
+
 	/*
 	 * If the RTE is a relation, go to the system catalogs not the
 	 * eref->colnames list.  This is a little slower but it will give the
diff --git a/src/backend/rewrite/rewriteHandler.c b/src/backend/rewrite/rewriteHandler.c
index 9fd05b15e73..7a0fdbe3f40 100644
--- a/src/backend/rewrite/rewriteHandler.c
+++ b/src/backend/rewrite/rewriteHandler.c
@@ -1854,6 +1854,7 @@ ApplyRetrieveRule(Query *parsetree,
 	rte = rt_fetch(rt_index, parsetree->rtable);
 
 	rte->rtekind = RTE_SUBQUERY;
+	rte->reftype = ROW_REF_COPY;
 	rte->subquery = rule_action;
 	rte->security_barrier = RelationIsSecurityView(relation);
 
diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index 947a868e569..d3a41533552 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1100,6 +1100,36 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+/*
+ * Same as tuplestore_gettupleslot(), but foces tuple storage to slot.  Thus,
+ * it can work with slot types different than minimal tuple.
+ */
+bool
+tuplestore_force_gettupleslot(Tuplestorestate *state, bool forward,
+							  bool copy, TupleTableSlot *slot)
+{
+	MinimalTuple tuple;
+	bool		should_free;
+
+	tuple = (MinimalTuple) tuplestore_gettuple(state, forward, &should_free);
+
+	if (tuple)
+	{
+		if (copy && !should_free)
+		{
+			tuple = heap_copy_minimal_tuple(tuple);
+			should_free = true;
+		}
+		ExecForceStoreMinimalTuple(tuple, slot, should_free);
+		return true;
+	}
+	else
+	{
+		ExecClearTuple(slot);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
diff --git a/src/include/access/sysattr.h b/src/include/access/sysattr.h
index e88dec71ee9..867b5eb489e 100644
--- a/src/include/access/sysattr.h
+++ b/src/include/access/sysattr.h
@@ -24,6 +24,7 @@
 #define MaxTransactionIdAttributeNumber			(-4)
 #define MaxCommandIdAttributeNumber				(-5)
 #define TableOidAttributeNumber					(-6)
-#define FirstLowInvalidHeapAttributeNumber		(-7)
+#define RowIdAttributeNumber					(-7)
+#define FirstLowInvalidHeapAttributeNumber		(-8)
 
 #endif							/* SYSATTR_H */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index c32a3cbcf66..730e2bd94d3 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -472,7 +472,7 @@ typedef struct TableAmRoutine
 	 * test, returns true, false otherwise.
 	 */
 	bool		(*tuple_fetch_row_version) (Relation rel,
-											ItemPointer tid,
+											Datum tupleid,
 											Snapshot snapshot,
 											TupleTableSlot *slot);
 
@@ -531,7 +531,7 @@ typedef struct TableAmRoutine
 
 	/* see table_tuple_delete() for reference about parameters */
 	TM_Result	(*tuple_delete) (Relation rel,
-								 ItemPointer tid,
+								 Datum tupleid,
 								 CommandId cid,
 								 Snapshot snapshot,
 								 Snapshot crosscheck,
@@ -542,7 +542,7 @@ typedef struct TableAmRoutine
 
 	/* see table_tuple_update() for reference about parameters */
 	TM_Result	(*tuple_update) (Relation rel,
-								 ItemPointer otid,
+								 Datum tupleid,
 								 TupleTableSlot *slot,
 								 CommandId cid,
 								 Snapshot snapshot,
@@ -555,7 +555,7 @@ typedef struct TableAmRoutine
 
 	/* see table_tuple_lock() for reference about parameters */
 	TM_Result	(*tuple_lock) (Relation rel,
-							   ItemPointer tid,
+							   Datum tupleid,
 							   Snapshot snapshot,
 							   TupleTableSlot *slot,
 							   CommandId cid,
@@ -701,6 +701,11 @@ typedef struct TableAmRoutine
 	 * ------------------------------------------------------------------------
 	 */
 
+	/*
+	 * Get the type of row identifier in the table.
+	 */
+	RowRefType	(*get_row_ref_type) (Relation rel);
+
 	/*
 	 * This callback frees relation private cache data stored in rd_amcache.
 	 * If this callback is not provided, rd_amcache is assumed to point to
@@ -1278,9 +1283,9 @@ extern bool table_index_fetch_tuple_check(Relation rel,
 
 
 /*
- * Fetch tuple at `tid` into `slot`, after doing a visibility test according to
- * `snapshot`. If a tuple was found and passed the visibility test, returns
- * true, false otherwise.
+ * Fetch tuple identified by `tupleid` into `slot`, after doing a visibility
+ * test according to `snapshot`. If a tuple was found and passed the visibility
+ * test, returns true, false otherwise.
  *
  * See table_index_fetch_tuple's comment about what the difference between
  * these functions is. It is correct to use this function outside of index
@@ -1288,7 +1293,7 @@ extern bool table_index_fetch_tuple_check(Relation rel,
  */
 static inline bool
 table_tuple_fetch_row_version(Relation rel,
-							  ItemPointer tid,
+							  Datum tupleid,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
@@ -1300,7 +1305,8 @@ table_tuple_fetch_row_version(Relation rel,
 	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
 		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
 
-	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
+	return rel->rd_tableam->tuple_fetch_row_version(rel, tupleid,
+													snapshot, slot);
 }
 
 /*
@@ -1485,7 +1491,7 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
  *
  * Input parameters:
  *	relation - table to be modified (caller must hold suitable lock)
- *	tid - TID of tuple to be deleted
+ *	tupleid - identifier of tuple to be deleted
  *	cid - delete command ID (used for visibility test, and stored into
  *		cmax if successful)
  *	crosscheck - if not InvalidSnapshot, also check tuple against this
@@ -1514,12 +1520,12 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
  * TM_FailureData for additional info.
  */
 static inline TM_Result
-table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
+table_tuple_delete(Relation rel, Datum tupleid, CommandId cid,
 				   Snapshot snapshot, Snapshot crosscheck, int options,
 				   TM_FailureData *tmfd, bool changingPart,
 				   TupleTableSlot *oldSlot)
 {
-	return rel->rd_tableam->tuple_delete(rel, tid, cid,
+	return rel->rd_tableam->tuple_delete(rel, tupleid, cid,
 										 snapshot, crosscheck,
 										 options, tmfd, changingPart,
 										 oldSlot);
@@ -1533,7 +1539,7 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
  *
  * Input parameters:
  *	relation - table to be modified (caller must hold suitable lock)
- *	otid - TID of old tuple to be replaced
+ *	tupleid - identifier of old tuple to be replaced
  *	slot - newly constructed tuple data to store
  *	cid - update command ID (used for visibility test, and stored into
  *		cmax/cmin if successful)
@@ -1570,13 +1576,13 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
  * for additional info.
  */
 static inline TM_Result
-table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
+table_tuple_update(Relation rel, Datum tupleid, TupleTableSlot *slot,
 				   CommandId cid, Snapshot snapshot, Snapshot crosscheck,
 				   int options, TM_FailureData *tmfd, LockTupleMode *lockmode,
 				   TU_UpdateIndexes *update_indexes,
 				   TupleTableSlot *oldSlot)
 {
-	return rel->rd_tableam->tuple_update(rel, otid, slot,
+	return rel->rd_tableam->tuple_update(rel, tupleid, slot,
 										 cid, snapshot, crosscheck,
 										 options, tmfd,
 										 lockmode, update_indexes,
@@ -1588,7 +1594,7 @@ table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
  *
  * Input parameters:
  *	relation: relation containing tuple (caller must hold suitable lock)
- *	tid: TID of tuple to lock
+ *	tupleid: identifier of tuple to lock
  *	snapshot: snapshot to use for visibility determinations
  *	cid: current command ID (used for visibility test, and stored into
  *		tuple's cmax if lock is successful)
@@ -1617,12 +1623,12 @@ table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
  * comments for struct TM_FailureData for additional info.
  */
 static inline TM_Result
-table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
+table_tuple_lock(Relation rel, Datum tupleid, Snapshot snapshot,
 				 TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
 				 LockWaitPolicy wait_policy, uint8 flags,
 				 TM_FailureData *tmfd)
 {
-	return rel->rd_tableam->tuple_lock(rel, tid, snapshot, slot,
+	return rel->rd_tableam->tuple_lock(rel, tupleid, snapshot, slot,
 									   cid, mode, wait_policy,
 									   flags, tmfd);
 }
@@ -1904,6 +1910,20 @@ table_define_index(Relation rel, Oid indoid, bool reindex,
  * ----------------------------------------------------------------------------
  */
 
+/*
+ * Get the type of row identifier.  Returns ROW_REF_TID when table AM routine
+ * is not accessible.  This happens during catalog initialization.  All catalog
+ * tables are known to use heap.
+ */
+static inline RowRefType
+table_get_row_ref_type(Relation rel)
+{
+	if (rel->rd_tableam)
+		return rel->rd_tableam->get_row_ref_type(rel);
+	else
+		return ROW_REF_TID;
+}
+
 /*
  * Frees relation private cache data stored in rd_amcache.  Uses
  * free_rd_amcache method if provided.  Assumes rd_amcache to point to single
diff --git a/src/include/commands/trigger.h b/src/include/commands/trigger.h
index cb968d03ecd..c16e6b6e5a0 100644
--- a/src/include/commands/trigger.h
+++ b/src/include/commands/trigger.h
@@ -209,7 +209,7 @@ extern void ExecASDeleteTriggers(EState *estate,
 extern bool ExecBRDeleteTriggers(EState *estate,
 								 EPQState *epqstate,
 								 ResultRelInfo *relinfo,
-								 ItemPointer tupleid,
+								 Datum tupleid,
 								 HeapTuple fdw_trigtuple,
 								 TupleTableSlot **epqslot,
 								 TM_Result *tmresult,
@@ -231,7 +231,7 @@ extern void ExecASUpdateTriggers(EState *estate,
 extern bool ExecBRUpdateTriggers(EState *estate,
 								 EPQState *epqstate,
 								 ResultRelInfo *relinfo,
-								 ItemPointer tupleid,
+								 Datum tupleid,
 								 HeapTuple fdw_trigtuple,
 								 TupleTableSlot *newslot,
 								 TM_Result *tmresult,
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index bc06ff99e21..90233a4baf7 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -2093,6 +2093,7 @@ typedef struct OnConflictExpr
 typedef enum RowRefType
 {
 	ROW_REF_TID,				/* Item pointer (block, offset) */
+	ROW_REF_ROWID,				/* Bytea row id */
 	ROW_REF_COPY				/* Full row copy */
 } RowRefType;
 
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index 419613c17ba..cf291a0d17a 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -70,6 +70,9 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
 
+extern bool tuplestore_force_gettupleslot(Tuplestorestate *state, bool forward,
+										  bool copy, TupleTableSlot *slot);
+
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
 extern bool tuplestore_skiptuples(Tuplestorestate *state,
-- 
2.39.3 (Apple Git-145)

0009-Let-table-AM-override-reloptions-for-indexes-buil-v2.patchapplication/octet-stream; name=0009-Let-table-AM-override-reloptions-for-indexes-buil-v2.patchDownload

From e1f49d3499f2dcef0487881df245f760be89a2f2 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Thu, 14 Mar 2024 00:53:05 +0200
Subject: [PATCH 09/13] Let table AM override reloptions for indexes built on
 its tables

---
 src/backend/access/common/reloptions.c   |  3 ++-
 src/backend/access/heap/heapam_handler.c |  8 ++++++++
 src/backend/commands/indexcmds.c         |  3 ++-
 src/backend/commands/tablecmds.c         |  9 ++++++++-
 src/backend/utils/cache/relcache.c       | 24 ++++++++++++++++++++++--
 src/include/access/tableam.h             | 23 +++++++++++++++++++++++
 6 files changed, 65 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 963995388bb..00088240cdd 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -1411,7 +1411,8 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
 			break;
 		case RELKIND_INDEX:
 		case RELKIND_PARTITIONED_INDEX:
-			options = index_reloptions(amoptions, datum, false);
+			options = tableam_indexoptions(tableam, amoptions, classForm->relkind,
+										   datum, false);
 			break;
 		case RELKIND_FOREIGN_TABLE:
 			options = NULL;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 781385270b0..422898a609d 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2714,6 +2714,13 @@ heapam_reloptions(char relkind, Datum reloptions, bool validate)
 	return NULL;
 }
 
+static bytea *
+heapam_indexoptions(amoptions_function amoptions, char relkind,
+					Datum reloptions, bool validate)
+{
+	return index_reloptions(amoptions, reloptions, validate);
+}
+
 
 /* ------------------------------------------------------------------------
  * Planner related callbacks for the heap AM
@@ -3219,6 +3226,7 @@ static const TableAmRoutine heapam_methods = {
 	.relation_toast_am = heapam_relation_toast_am,
 	.relation_fetch_toast_slice = heap_fetch_toast_slice,
 	.reloptions = heapam_reloptions,
+	.indexoptions = heapam_indexoptions,
 
 	.relation_estimate_size = heapam_estimate_rel_size,
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 7b20d103c86..7299ebbe9f3 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -899,7 +899,8 @@ DefineIndex(Oid tableId,
 	reloptions = transformRelOptions((Datum) 0, stmt->options,
 									 NULL, NULL, false, false);
 
-	(void) index_reloptions(amoptions, reloptions, true);
+	(void) tableam_indexoptions(rel->rd_tableam, amoptions, RELKIND_INDEX,
+								reloptions, true);
 
 	/*
 	 * Prepare arguments for index_create, primarily an IndexInfo structure.
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index d2ef8a0c383..fa8eb55b189 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -15328,7 +15328,14 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 			break;
 		case RELKIND_INDEX:
 		case RELKIND_PARTITIONED_INDEX:
-			(void) index_reloptions(rel->rd_indam->amoptions, newOptions, true);
+			{
+				Relation	tbl = relation_open(rel->rd_index->indrelid,
+												AccessShareLock);
+
+				tableam_indexoptions(tbl->rd_tableam, rel->rd_indam->amoptions,
+									 rel->rd_rel->relkind, newOptions, true);
+				relation_close(tbl, AccessShareLock);
+			}
 			break;
 		default:
 			ereport(ERROR,
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 3babfc804a7..b1a4b36aa14 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -478,15 +478,35 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 	{
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
-		case RELKIND_VIEW:
 		case RELKIND_MATVIEW:
+		case RELKIND_VIEW:
 		case RELKIND_PARTITIONED_TABLE:
 			tableam = relation->rd_tableam;
 			amoptsfn = NULL;
 			break;
 		case RELKIND_INDEX:
 		case RELKIND_PARTITIONED_INDEX:
-			amoptsfn = relation->rd_indam->amoptions;
+			{
+				Form_pg_class classForm;
+				HeapTuple	classTup;
+
+				/* fetch the relation's relcache entry */
+				if (relation->rd_index->indrelid >= FirstNormalObjectId)
+				{
+					classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relation->rd_index->indrelid));
+					classForm = (Form_pg_class) GETSTRUCT(classTup);
+					if (classForm->relam >= FirstNormalObjectId)
+						tableam = GetTableAmRoutineByAmOid(classForm->relam);
+					else
+						tableam = GetHeapamTableAmRoutine();
+					heap_freetuple(classTup);
+				}
+				else
+				{
+					tableam = GetHeapamTableAmRoutine();
+				}
+				amoptsfn = relation->rd_indam->amoptions;
+			}
 			break;
 		default:
 			return;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 2a496e81610..1bfae380637 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -17,6 +17,7 @@
 #ifndef TABLEAM_H
 #define TABLEAM_H
 
+#include "access/amapi.h"
 #include "access/relscan.h"
 #include "access/sdir.h"
 #include "access/xact.h"
@@ -738,6 +739,13 @@ typedef struct TableAmRoutine
 	 */
 	bytea	   *(*reloptions) (char relkind, Datum reloptions, bool validate);
 
+	/*
+	 * Parse table AM-specific index options.  Useful for table AM to define
+	 * new index options or override existing index options.
+	 */
+	bytea	   *(*indexoptions) (amoptions_function amoptions, char relkind,
+								 Datum reloptions, bool validate);
+
 
 	/* ------------------------------------------------------------------------
 	 * Planner related functions.
@@ -1963,6 +1971,21 @@ tableam_reloptions(const TableAmRoutine *tableam, char relkind,
 extern bytea *index_reloptions(amoptions_function amoptions, Datum reloptions,
 							   bool validate);
 
+/*
+ * Parse index options.  Gives table AM a chance to override index-specific
+ * options defined in 'amoptions'.
+ */
+static inline bytea *
+tableam_indexoptions(const TableAmRoutine *tableam,
+					 amoptions_function amoptions, char relkind,
+					 Datum reloptions, bool validate)
+{
+	if (tableam)
+		return tableam->indexoptions(amoptions, relkind, reloptions, validate);
+	else
+		return index_reloptions(amoptions, reloptions, validate);
+}
+
 
 /* ----------------------------------------------------------------------------
  * Planner related functionality
-- 
2.39.3 (Apple Git-145)

0005-Add-table-AM-tuple_is_current-method-v2.patchapplication/octet-stream; name=0005-Add-table-AM-tuple_is_current-method-v2.patchDownload

From 5840c7d640a0e55ae14d7da36dfd343082b2c5ee Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Wed, 7 Jun 2023 13:47:53 +0300
Subject: [PATCH 05/13] Add table AM tuple_is_current method

This allows to abstract how/whether table AM uses transaction identifiers.
---
 src/backend/executor/execTuples.c   | 79 +++++++++++++++++++++++++++++
 src/backend/utils/adt/ri_triggers.c |  8 +--
 src/include/executor/tuptable.h     | 21 ++++++++
 3 files changed, 101 insertions(+), 7 deletions(-)

diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index a7aa2ee02b1..2e6e441f7f8 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -60,6 +60,7 @@
 #include "access/heaptoast.h"
 #include "access/htup_details.h"
 #include "access/tupdesc_details.h"
+#include "access/xact.h"
 #include "catalog/pg_type.h"
 #include "funcapi.h"
 #include "nodes/nodeFuncs.h"
@@ -148,6 +149,22 @@ tts_virtual_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
 	return 0;					/* silence compiler warnings */
 }
 
+/*
+ * VirtualTupleTableSlots never have a storage tuples.  We generally
+ * shouldn't get here, but provide a user-friendly message if we do.
+ */
+static bool
+tts_virtual_is_current_xact_tuple(TupleTableSlot *slot)
+{
+	Assert(!TTS_EMPTY(slot));
+
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("don't have a storage tuple in this context")));
+
+	return false;					/* silence compiler warnings */
+}
+
 /*
  * To materialize a virtual slot all the datums that aren't passed by value
  * have to be copied into the slot's memory context.  To do so, compute the
@@ -354,6 +371,29 @@ tts_heap_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
 						   slot->tts_tupleDescriptor, isnull);
 }
 
+static bool
+tts_heap_is_current_xact_tuple(TupleTableSlot *slot)
+{
+	HeapTupleTableSlot *hslot = (HeapTupleTableSlot *) slot;
+	TransactionId xmin;
+
+	Assert(!TTS_EMPTY(slot));
+
+	/*
+	 * In some code paths it's possible to get here with a non-materialized
+	 * slot, in which case we can't check if tuple is created by the current
+	 * transaction.
+	 */
+	if (!hslot->tuple)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("don't have a storage tuple in this context")));
+
+	xmin = HeapTupleHeaderGetRawXmin(hslot->tuple->t_data);
+
+	return TransactionIdIsCurrentTransactionId(xmin);
+}
+
 static void
 tts_heap_materialize(TupleTableSlot *slot)
 {
@@ -521,6 +561,18 @@ tts_minimal_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
 	return 0;					/* silence compiler warnings */
 }
 
+static bool
+tts_minimal_is_current_xact_tuple(TupleTableSlot *slot)
+{
+	Assert(!TTS_EMPTY(slot));
+
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("don't have a storage tuple in this context")));
+
+	return false;					/* silence compiler warnings */
+}
+
 static void
 tts_minimal_materialize(TupleTableSlot *slot)
 {
@@ -714,6 +766,29 @@ tts_buffer_heap_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
 						   slot->tts_tupleDescriptor, isnull);
 }
 
+static bool
+tts_buffer_is_current_xact_tuple(TupleTableSlot *slot)
+{
+	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
+	TransactionId xmin;
+
+	Assert(!TTS_EMPTY(slot));
+
+	/*
+	 * In some code paths it's possible to get here with a non-materialized
+	 * slot, in which case we can't check if tuple is created by the current
+	 * transaction.
+	 */
+	if (!bslot->base.tuple)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("don't have a storage tuple in this context")));
+
+	xmin = HeapTupleHeaderGetRawXmin(bslot->base.tuple->t_data);
+
+	return TransactionIdIsCurrentTransactionId(xmin);
+}
+
 static void
 tts_buffer_heap_materialize(TupleTableSlot *slot)
 {
@@ -1029,6 +1104,7 @@ const TupleTableSlotOps TTSOpsVirtual = {
 	.getsomeattrs = tts_virtual_getsomeattrs,
 	.getsysattr = tts_virtual_getsysattr,
 	.materialize = tts_virtual_materialize,
+	.is_current_xact_tuple = tts_virtual_is_current_xact_tuple,
 	.copyslot = tts_virtual_copyslot,
 
 	/*
@@ -1048,6 +1124,7 @@ const TupleTableSlotOps TTSOpsHeapTuple = {
 	.clear = tts_heap_clear,
 	.getsomeattrs = tts_heap_getsomeattrs,
 	.getsysattr = tts_heap_getsysattr,
+	.is_current_xact_tuple = tts_heap_is_current_xact_tuple,
 	.materialize = tts_heap_materialize,
 	.copyslot = tts_heap_copyslot,
 	.get_heap_tuple = tts_heap_get_heap_tuple,
@@ -1065,6 +1142,7 @@ const TupleTableSlotOps TTSOpsMinimalTuple = {
 	.clear = tts_minimal_clear,
 	.getsomeattrs = tts_minimal_getsomeattrs,
 	.getsysattr = tts_minimal_getsysattr,
+	.is_current_xact_tuple = tts_minimal_is_current_xact_tuple,
 	.materialize = tts_minimal_materialize,
 	.copyslot = tts_minimal_copyslot,
 
@@ -1082,6 +1160,7 @@ const TupleTableSlotOps TTSOpsBufferHeapTuple = {
 	.clear = tts_buffer_heap_clear,
 	.getsomeattrs = tts_buffer_heap_getsomeattrs,
 	.getsysattr = tts_buffer_heap_getsysattr,
+	.is_current_xact_tuple = tts_buffer_is_current_xact_tuple,
 	.materialize = tts_buffer_heap_materialize,
 	.copyslot = tts_buffer_heap_copyslot,
 	.get_heap_tuple = tts_buffer_heap_get_heap_tuple,
diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c
index 2fe93775003..62601a6d80c 100644
--- a/src/backend/utils/adt/ri_triggers.c
+++ b/src/backend/utils/adt/ri_triggers.c
@@ -1260,9 +1260,6 @@ RI_FKey_fk_upd_check_required(Trigger *trigger, Relation fk_rel,
 {
 	const RI_ConstraintInfo *riinfo;
 	int			ri_nullcheck;
-	Datum		xminDatum;
-	TransactionId xmin;
-	bool		isnull;
 
 	/*
 	 * AfterTriggerSaveEvent() handles things such that this function is never
@@ -1330,10 +1327,7 @@ RI_FKey_fk_upd_check_required(Trigger *trigger, Relation fk_rel,
 	 * this if we knew the INSERT trigger already fired, but there is no easy
 	 * way to know that.)
 	 */
-	xminDatum = slot_getsysattr(oldslot, MinTransactionIdAttributeNumber, &isnull);
-	Assert(!isnull);
-	xmin = DatumGetTransactionId(xminDatum);
-	if (TransactionIdIsCurrentTransactionId(xmin))
+	if (slot_is_current_xact_tuple(oldslot))
 		return true;
 
 	/* If all old and new key values are equal, no check is needed */
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 6133dbcd0a3..c2eddda74a8 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -166,6 +166,12 @@ struct TupleTableSlotOps
 	 */
 	Datum		(*getsysattr) (TupleTableSlot *slot, int attnum, bool *isnull);
 
+	/*
+	 * Check if the tuple is created by the current transaction. Throws an
+	 * error if the slot doesn't contain the storage tuple.
+	 */
+	bool		(*is_current_xact_tuple) (TupleTableSlot *slot);
+
 	/*
 	 * Make the contents of the slot solely depend on the slot, and not on
 	 * underlying resources (like another memory context, buffers, etc).
@@ -426,6 +432,21 @@ slot_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
 	return slot->tts_ops->getsysattr(slot, attnum, isnull);
 }
 
+/*
+ * slot_is_current_xact_tuple - check if the slot's current tuple is created
+ *								by the current transaction.
+ *
+ *  If the slot does not contain storage tuple, this will throw an error.
+ *  Hence before calling this function, callers should make sure that the
+ *  slot type supports storage tuples and there is currently one inside the
+ *  slot.
+ */
+static inline bool
+slot_is_current_xact_tuple(TupleTableSlot *slot)
+{
+	return slot->tts_ops->is_current_xact_tuple(slot);
+}
+
 /*
  * ExecClearTuple - clear the slot's contents
  */
-- 
2.39.3 (Apple Git-145)

0007-Custom-reloptions-for-table-AM-v2.patchapplication/octet-stream; name=0007-Custom-reloptions-for-table-AM-v2.patchDownload

From 252b2e6b733c3b190321e3d9d18112e5b23c308b Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 12 Jun 2023 23:16:01 +0300
Subject: [PATCH 07/13] Custom reloptions for table AM

Let table AM define custom reloptions for its tables.
---
 src/backend/access/common/reloptions.c   |  6 ++-
 src/backend/access/heap/heapam_handler.c | 13 ++++++
 src/backend/access/table/tableamapi.c    | 20 ++++++++++
 src/backend/commands/tablecmds.c         | 51 ++++++++++++++----------
 src/backend/postmaster/autovacuum.c      |  4 +-
 src/backend/utils/cache/relcache.c       |  6 ++-
 src/include/access/reloptions.h          |  2 +
 src/include/access/tableam.h             | 29 ++++++++++++++
 8 files changed, 106 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index d6eb5d85599..963995388bb 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -24,6 +24,7 @@
 #include "access/nbtree.h"
 #include "access/reloptions.h"
 #include "access/spgist_private.h"
+#include "access/tableam.h"
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
 #include "commands/tablespace.h"
@@ -1377,7 +1378,7 @@ untransformRelOptions(Datum options)
  */
 bytea *
 extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
-				  amoptions_function amoptions)
+				  const TableAmRoutine *tableam, amoptions_function amoptions)
 {
 	bytea	   *options;
 	bool		isnull;
@@ -1399,7 +1400,8 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
 		case RELKIND_MATVIEW:
-			options = heap_reloptions(classForm->relkind, datum, false);
+			options = tableam_reloptions(tableam, classForm->relkind,
+										 datum, false);
 			break;
 		case RELKIND_PARTITIONED_TABLE:
 			options = partitioned_table_reloptions(datum, false);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 66ac541ed21..45df59fdf50 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -23,6 +23,7 @@
 #include "access/heapam.h"
 #include "access/heaptoast.h"
 #include "access/multixact.h"
+#include "access/reloptions.h"
 #include "access/rewriteheap.h"
 #include "access/syncscan.h"
 #include "access/tableam.h"
@@ -2424,6 +2425,17 @@ heapam_relation_toast_am(Relation rel)
 	return rel->rd_rel->relam;
 }
 
+static bytea *
+heapam_reloptions(char relkind, Datum reloptions, bool validate)
+{
+	if (relkind == RELKIND_RELATION ||
+		relkind == RELKIND_TOASTVALUE ||
+		relkind == RELKIND_MATVIEW)
+		return heap_reloptions(relkind, reloptions, validate);
+
+	return NULL;
+}
+
 
 /* ------------------------------------------------------------------------
  * Planner related callbacks for the heap AM
@@ -2929,6 +2941,7 @@ static const TableAmRoutine heapam_methods = {
 	.relation_needs_toast_table = heapam_relation_needs_toast_table,
 	.relation_toast_am = heapam_relation_toast_am,
 	.relation_fetch_toast_slice = heap_fetch_toast_slice,
+	.reloptions = heapam_reloptions,
 
 	.relation_estimate_size = heapam_estimate_rel_size,
 
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 55b8caeadf2..34ff3e38333 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -13,9 +13,11 @@
 
 #include "access/tableam.h"
 #include "access/xact.h"
+#include "catalog/pg_am.h"
 #include "commands/defrem.h"
 #include "miscadmin.h"
 #include "utils/guc_hooks.h"
+#include "utils/syscache.h"
 
 
 /*
@@ -98,6 +100,24 @@ GetTableAmRoutine(Oid amhandler)
 	return routine;
 }
 
+const TableAmRoutine *
+GetTableAmRoutineByAmOid(Oid amoid)
+{
+	HeapTuple	ht_am;
+	Form_pg_am	amrec;
+	const TableAmRoutine *tableam = NULL;
+
+	ht_am = SearchSysCache1(AMOID, ObjectIdGetDatum(amoid));
+	if (!HeapTupleIsValid(ht_am))
+		elog(ERROR, "cache lookup failed for access method %u",
+			 amoid);
+	amrec = (Form_pg_am) GETSTRUCT(ht_am);
+
+	tableam = GetTableAmRoutine(amrec->amhandler);
+	ReleaseSysCache(ht_am);
+	return tableam;
+}
+
 /* check_hook: validate new default_table_access_method */
 bool
 check_default_table_access_method(char **newval, void **extra, GucSource source)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 3ed0618b4e6..d2ef8a0c383 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -705,6 +705,7 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	LOCKMODE	parentLockmode;
 	const char *accessMethod = NULL;
 	Oid			accessMethodId = InvalidOid;
+	const TableAmRoutine *tableam = NULL;
 
 	/*
 	 * Truncate relname to appropriate length (probably a waste of time, as
@@ -844,6 +845,26 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	if (!OidIsValid(ownerId))
 		ownerId = GetUserId();
 
+	/*
+	 * If the statement hasn't specified an access method, but we're defining
+	 * a type of relation that needs one, use the default.
+	 */
+	if (stmt->accessMethod != NULL)
+	{
+		accessMethod = stmt->accessMethod;
+
+		if (partitioned)
+			ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("specifying a table access method is not supported on a partitioned table")));
+	}
+	else if (RELKIND_HAS_TABLE_AM(relkind))
+		accessMethod = default_table_access_method;
+
+	/* look up the access method, verify it is for a table */
+	if (accessMethod != NULL)
+		accessMethodId = get_table_am_oid(accessMethod, false);
+
 	/*
 	 * Parse and validate reloptions, if any.
 	 */
@@ -852,6 +873,12 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 
 	switch (relkind)
 	{
+		case RELKIND_RELATION:
+		case RELKIND_TOASTVALUE:
+		case RELKIND_MATVIEW:
+			tableam = GetTableAmRoutineByAmOid(accessMethodId);
+			(void) tableam_reloptions(tableam, relkind, reloptions, true);
+			break;
 		case RELKIND_VIEW:
 			(void) view_reloptions(reloptions, true);
 			break;
@@ -860,6 +887,7 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 			break;
 		default:
 			(void) heap_reloptions(relkind, reloptions, true);
+			break;
 	}
 
 	if (stmt->ofTypename)
@@ -951,26 +979,6 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 		}
 	}
 
-	/*
-	 * If the statement hasn't specified an access method, but we're defining
-	 * a type of relation that needs one, use the default.
-	 */
-	if (stmt->accessMethod != NULL)
-	{
-		accessMethod = stmt->accessMethod;
-
-		if (partitioned)
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("specifying a table access method is not supported on a partitioned table")));
-	}
-	else if (RELKIND_HAS_TABLE_AM(relkind))
-		accessMethod = default_table_access_method;
-
-	/* look up the access method, verify it is for a table */
-	if (accessMethod != NULL)
-		accessMethodId = get_table_am_oid(accessMethod, false);
-
 	/*
 	 * Create the relation.  Inherited defaults and constraints are passed in
 	 * for immediate handling --- since they don't need parsing, they can be
@@ -15309,7 +15317,8 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
 		case RELKIND_MATVIEW:
-			(void) heap_reloptions(rel->rd_rel->relkind, newOptions, true);
+			(void) table_reloptions(rel, rel->rd_rel->relkind,
+									newOptions, true);
 			break;
 		case RELKIND_PARTITIONED_TABLE:
 			(void) partitioned_table_reloptions(newOptions, true);
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 71e8a6f2584..d1d76016ab4 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2661,7 +2661,9 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE);
 
-	relopts = extractRelOptions(tup, pg_class_desc, NULL);
+	relopts = extractRelOptions(tup, pg_class_desc,
+								GetTableAmRoutineByAmOid(((Form_pg_class) GETSTRUCT(tup))->relam),
+								NULL);
 	if (relopts == NULL)
 		return NULL;
 
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 6d98bdfba06..3babfc804a7 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -33,6 +33,7 @@
 #include "access/htup_details.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
+#include "access/relation.h"
 #include "access/reloptions.h"
 #include "access/sysattr.h"
 #include "access/table.h"
@@ -465,6 +466,7 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 {
 	bytea	   *options;
 	amoptions_function amoptsfn;
+	const TableAmRoutine *tableam = NULL;
 
 	relation->rd_options = NULL;
 
@@ -479,6 +481,7 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 		case RELKIND_VIEW:
 		case RELKIND_MATVIEW:
 		case RELKIND_PARTITIONED_TABLE:
+			tableam = relation->rd_tableam;
 			amoptsfn = NULL;
 			break;
 		case RELKIND_INDEX:
@@ -494,7 +497,8 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 	 * we might not have any other for pg_class yet (consider executing this
 	 * code for pg_class itself)
 	 */
-	options = extractRelOptions(tuple, GetPgClassDescriptor(), amoptsfn);
+	options = extractRelOptions(tuple, GetPgClassDescriptor(),
+								tableam, amoptsfn);
 
 	/*
 	 * Copy parsed data into CacheMemoryContext.  To guard against the
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index 81829b8270a..8ddc75df287 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -21,6 +21,7 @@
 
 #include "access/amapi.h"
 #include "access/htup.h"
+#include "access/tableam.h"
 #include "access/tupdesc.h"
 #include "nodes/pg_list.h"
 #include "storage/lock.h"
@@ -224,6 +225,7 @@ extern Datum transformRelOptions(Datum oldOptions, List *defList,
 								 bool acceptOidsOff, bool isReset);
 extern List *untransformRelOptions(Datum options);
 extern bytea *extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
+								const TableAmRoutine *tableam,
 								amoptions_function amoptions);
 extern void *build_reloptions(Datum reloptions, bool validate,
 							  relopt_kind kind,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index b9210ea4fcb..b99fb6e4e71 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -735,6 +735,11 @@ typedef struct TableAmRoutine
 											   int32 slicelength,
 											   struct varlena *result);
 
+	/*
+	 * Parse table AM-specific table options.
+	 */
+	bytea	   *(*reloptions) (char relkind, Datum reloptions, bool validate);
+
 
 	/* ------------------------------------------------------------------------
 	 * Planner related functions.
@@ -1931,6 +1936,29 @@ table_relation_fetch_toast_slice(Relation toastrel, Oid valueid,
 													 result);
 }
 
+/*
+ * Parse options for given table.
+ */
+static inline bytea *
+table_reloptions(Relation rel, char relkind,
+				 Datum reloptions, bool validate)
+{
+	return rel->rd_tableam->reloptions(relkind, reloptions, validate);
+}
+
+/*
+ * Parse table options without knowledge of particular table.
+ */
+static inline bytea *
+tableam_reloptions(const TableAmRoutine *tableam, char relkind,
+				   Datum reloptions, bool validate)
+{
+	return tableam->reloptions(relkind, reloptions, validate);
+}
+
+extern bytea *index_reloptions(amoptions_function amoptions, Datum reloptions,
+							   bool validate);
+
 
 /* ----------------------------------------------------------------------------
  * Planner related functionality
@@ -2108,6 +2136,7 @@ extern void table_block_relation_estimate_size(Relation rel,
  */
 
 extern const TableAmRoutine *GetTableAmRoutine(Oid amhandler);
+extern const TableAmRoutine *GetTableAmRoutineByAmOid(Oid amoid);
 
 /* ----------------------------------------------------------------------------
  * Functions in heapam_handler.c
-- 
2.39.3 (Apple Git-145)

0006-Generalize-relation-analyze-in-table-AM-interface-v2.patchapplication/octet-stream; name=0006-Generalize-relation-analyze-in-table-AM-interface-v2.patchDownload

From 5423c41f1400b0267e1c4437a7143153ccdc90d7 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Thu, 8 Jun 2023 04:20:29 +0300
Subject: [PATCH 06/13] Generalize relation analyze in table AM interface

Currently, there is just one algorithm for sampling tuples from a table written
in acquire_sample_rows().  Custom table AM can just redefine the way to get the
next block/tuple by implementing scan_analyze_next_block() and
scan_analyze_next_tuple() API functions.

This approach doesn't seem general enough.  For instance, it's unclear how to
sample this way index-organized tables.  This commit allows table AM to
encapsulate the whole sampling algorithm (currently implemented in
acquire_sample_rows()) into the relation_analyze() API function.
---
 src/backend/access/heap/heapam_handler.c | 286 +++++++++++++++++++++-
 src/backend/access/table/tableamapi.c    |   2 -
 src/backend/commands/analyze.c           | 288 +----------------------
 src/include/access/tableam.h             |  92 ++------
 src/include/commands/vacuum.h            |   5 +
 src/include/foreign/fdwapi.h             |   6 +-
 6 files changed, 317 insertions(+), 362 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 6abfe36dec7..66ac541ed21 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -44,6 +44,7 @@
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
+#include "utils/sampling.h"
 
 static TM_Result heapam_tuple_lock(Relation relation, ItemPointer tid,
 								   Snapshot snapshot, TupleTableSlot *slot,
@@ -1220,6 +1221,288 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
 	return false;
 }
 
+/*
+ * Comparator for sorting rows[] array
+ */
+static int
+compare_rows(const void *a, const void *b, void *arg)
+{
+	HeapTuple	ha = *(const HeapTuple *) a;
+	HeapTuple	hb = *(const HeapTuple *) b;
+	BlockNumber ba = ItemPointerGetBlockNumber(&ha->t_self);
+	OffsetNumber oa = ItemPointerGetOffsetNumber(&ha->t_self);
+	BlockNumber bb = ItemPointerGetBlockNumber(&hb->t_self);
+	OffsetNumber ob = ItemPointerGetOffsetNumber(&hb->t_self);
+
+	if (ba < bb)
+		return -1;
+	if (ba > bb)
+		return 1;
+	if (oa < ob)
+		return -1;
+	if (oa > ob)
+		return 1;
+	return 0;
+}
+
+static BufferAccessStrategy analyze_bstrategy;
+
+/*
+ * heapam_acquire_sample_rows -- acquire a random sample of rows from the table
+ *
+ * Selected rows are returned in the caller-allocated array rows[], which
+ * must have at least targrows entries.
+ * The actual number of rows selected is returned as the function result.
+ * We also estimate the total numbers of live and dead rows in the table,
+ * and return them into *totalrows and *totaldeadrows, respectively.
+ *
+ * The returned list of tuples is in order by physical position in the table.
+ * (We will rely on this later to derive correlation estimates.)
+ *
+ * As of May 2004 we use a new two-stage method:  Stage one selects up
+ * to targrows random blocks (or all blocks, if there aren't so many).
+ * Stage two scans these blocks and uses the Vitter algorithm to create
+ * a random sample of targrows rows (or less, if there are less in the
+ * sample of blocks).  The two stages are executed simultaneously: each
+ * block is processed as soon as stage one returns its number and while
+ * the rows are read stage two controls which ones are to be inserted
+ * into the sample.
+ *
+ * Although every row has an equal chance of ending up in the final
+ * sample, this sampling method is not perfect: not every possible
+ * sample has an equal chance of being selected.  For large relations
+ * the number of different blocks represented by the sample tends to be
+ * too small.  We can live with that for now.  Improvements are welcome.
+ *
+ * An important property of this sampling method is that because we do
+ * look at a statistically unbiased set of blocks, we should get
+ * unbiased estimates of the average numbers of live and dead rows per
+ * block.  The previous sampling method put too much credence in the row
+ * density near the start of the table.
+ */
+static int
+heapam_acquire_sample_rows(Relation onerel, int elevel,
+						   HeapTuple *rows, int targrows,
+						   double *totalrows, double *totaldeadrows)
+{
+	int			numrows = 0;	/* # rows now in reservoir */
+	double		samplerows = 0; /* total # rows collected */
+	double		liverows = 0;	/* # live rows seen */
+	double		deadrows = 0;	/* # dead rows seen */
+	double		rowstoskip = -1;	/* -1 means not set yet */
+	uint32		randseed;		/* Seed for block sampler(s) */
+	BlockNumber totalblocks;
+	TransactionId OldestXmin;
+	BlockSamplerData bs;
+	ReservoirStateData rstate;
+	TupleTableSlot *slot;
+	TableScanDesc scan;
+	BlockNumber nblocks;
+	BlockNumber blksdone = 0;
+#ifdef USE_PREFETCH
+	int			prefetch_maximum = 0;	/* blocks to prefetch if enabled */
+	BlockSamplerData prefetch_bs;
+#endif
+
+	Assert(targrows > 0);
+
+	totalblocks = RelationGetNumberOfBlocks(onerel);
+
+	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
+	OldestXmin = GetOldestNonRemovableTransactionId(onerel);
+
+	/* Prepare for sampling block numbers */
+	randseed = pg_prng_uint32(&pg_global_prng_state);
+	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, randseed);
+
+#ifdef USE_PREFETCH
+	prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
+	/* Create another BlockSampler, using the same seed, for prefetching */
+	if (prefetch_maximum)
+		(void) BlockSampler_Init(&prefetch_bs, totalblocks, targrows, randseed);
+#endif
+
+	/* Report sampling block numbers */
+	pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
+								 nblocks);
+
+	/* Prepare for sampling rows */
+	reservoir_init_selection_state(&rstate, targrows);
+
+	scan = table_beginscan_analyze(onerel);
+	slot = table_slot_create(onerel, NULL);
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * If we are doing prefetching, then go ahead and tell the kernel about
+	 * the first set of pages we are going to want.  This also moves our
+	 * iterator out ahead of the main one being used, where we will keep it so
+	 * that we're always pre-fetching out prefetch_maximum number of blocks
+	 * ahead.
+	 */
+	if (prefetch_maximum)
+	{
+		for (int i = 0; i < prefetch_maximum; i++)
+		{
+			BlockNumber prefetch_block;
+
+			if (!BlockSampler_HasMore(&prefetch_bs))
+				break;
+
+			prefetch_block = BlockSampler_Next(&prefetch_bs);
+			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block);
+		}
+	}
+#endif
+
+	/* Outer loop over blocks to sample */
+	while (BlockSampler_HasMore(&bs))
+	{
+		bool		block_accepted;
+		BlockNumber targblock = BlockSampler_Next(&bs);
+#ifdef USE_PREFETCH
+		BlockNumber prefetch_targblock = InvalidBlockNumber;
+
+		/*
+		 * Make sure that every time the main BlockSampler is moved forward
+		 * that our prefetch BlockSampler also gets moved forward, so that we
+		 * always stay out ahead.
+		 */
+		if (prefetch_maximum && BlockSampler_HasMore(&prefetch_bs))
+			prefetch_targblock = BlockSampler_Next(&prefetch_bs);
+#endif
+
+		vacuum_delay_point();
+
+		block_accepted = heapam_scan_analyze_next_block(scan, targblock, analyze_bstrategy);
+
+#ifdef USE_PREFETCH
+
+		/*
+		 * When pre-fetching, after we get a block, tell the kernel about the
+		 * next one we will want, if there's any left.
+		 *
+		 * We want to do this even if the table_scan_analyze_next_block() call
+		 * above decides against analyzing the block it picked.
+		 */
+		if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
+			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
+#endif
+
+		/*
+		 * Don't analyze if table_scan_analyze_next_block() indicated this
+		 * block is unsuitable for analyzing.
+		 */
+		if (!block_accepted)
+			continue;
+
+		while (heapam_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
+		{
+			/*
+			 * The first targrows sample rows are simply copied into the
+			 * reservoir. Then we start replacing tuples in the sample until
+			 * we reach the end of the relation.  This algorithm is from Jeff
+			 * Vitter's paper (see full citation in utils/misc/sampling.c). It
+			 * works by repeatedly computing the number of tuples to skip
+			 * before selecting a tuple, which replaces a randomly chosen
+			 * element of the reservoir (current set of tuples).  At all times
+			 * the reservoir is a true random sample of the tuples we've
+			 * passed over so far, so when we fall off the end of the relation
+			 * we're done.
+			 */
+			if (numrows < targrows)
+				rows[numrows++] = ExecCopySlotHeapTuple(slot);
+			else
+			{
+				/*
+				 * t in Vitter's paper is the number of records already
+				 * processed.  If we need to compute a new S value, we must
+				 * use the not-yet-incremented value of samplerows as t.
+				 */
+				if (rowstoskip < 0)
+					rowstoskip = reservoir_get_next_S(&rstate, samplerows, targrows);
+
+				if (rowstoskip <= 0)
+				{
+					/*
+					 * Found a suitable tuple, so save it, replacing one old
+					 * tuple at random
+					 */
+					int			k = (int) (targrows * sampler_random_fract(&rstate.randstate));
+
+					Assert(k >= 0 && k < targrows);
+					heap_freetuple(rows[k]);
+					rows[k] = ExecCopySlotHeapTuple(slot);
+				}
+
+				rowstoskip -= 1;
+			}
+
+			samplerows += 1;
+		}
+
+		pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
+									 ++blksdone);
+	}
+
+	ExecDropSingleTupleTableSlot(slot);
+	table_endscan(scan);
+
+	/*
+	 * If we didn't find as many tuples as we wanted then we're done. No sort
+	 * is needed, since they're already in order.
+	 *
+	 * Otherwise we need to sort the collected tuples by position
+	 * (itempointer). It's not worth worrying about corner cases where the
+	 * tuples are already sorted.
+	 */
+	if (numrows == targrows)
+		qsort_interruptible(rows, numrows, sizeof(HeapTuple),
+							compare_rows, NULL);
+
+	/*
+	 * Estimate total numbers of live and dead rows in relation, extrapolating
+	 * on the assumption that the average tuple density in pages we didn't
+	 * scan is the same as in the pages we did scan.  Since what we scanned is
+	 * a random sample of the pages in the relation, this should be a good
+	 * assumption.
+	 */
+	if (bs.m > 0)
+	{
+		*totalrows = floor((liverows / bs.m) * totalblocks + 0.5);
+		*totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
+	}
+	else
+	{
+		*totalrows = 0.0;
+		*totaldeadrows = 0.0;
+	}
+
+	/*
+	 * Emit some interesting relation info
+	 */
+	ereport(elevel,
+			(errmsg("\"%s\": scanned %d of %u pages, "
+					"containing %.0f live rows and %.0f dead rows; "
+					"%d rows in sample, %.0f estimated total rows",
+					RelationGetRelationName(onerel),
+					bs.m, totalblocks,
+					liverows, deadrows,
+					numrows, *totalrows)));
+
+	return numrows;
+}
+
+static inline void
+heapam_analyze(Relation relation, AcquireSampleRowsFunc *func,
+			   BlockNumber *totalpages, BufferAccessStrategy bstrategy)
+{
+	*func = heapam_acquire_sample_rows;
+	*totalpages = RelationGetNumberOfBlocks(relation);
+	analyze_bstrategy = bstrategy;
+}
+
 static double
 heapam_index_build_range_scan(Relation heapRelation,
 							  Relation indexRelation,
@@ -2637,10 +2920,9 @@ static const TableAmRoutine heapam_methods = {
 	.relation_copy_data = heapam_relation_copy_data,
 	.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
 	.relation_vacuum = heap_vacuum_rel,
-	.scan_analyze_next_block = heapam_scan_analyze_next_block,
-	.scan_analyze_next_tuple = heapam_scan_analyze_next_tuple,
 	.index_build_range_scan = heapam_index_build_range_scan,
 	.index_validate_scan = heapam_index_validate_scan,
+	.relation_analyze = heapam_analyze,
 
 	.free_rd_amcache = NULL,
 	.relation_size = table_block_relation_size,
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index ce637a5a5d9..55b8caeadf2 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -81,8 +81,6 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->relation_copy_data != NULL);
 	Assert(routine->relation_copy_for_cluster != NULL);
 	Assert(routine->relation_vacuum != NULL);
-	Assert(routine->scan_analyze_next_block != NULL);
-	Assert(routine->scan_analyze_next_tuple != NULL);
 	Assert(routine->index_build_range_scan != NULL);
 	Assert(routine->index_validate_scan != NULL);
 
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8a82af4a4ca..659f69ef270 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -87,10 +87,6 @@ static void compute_index_stats(Relation onerel, double totalrows,
 								MemoryContext col_context);
 static VacAttrStats *examine_attribute(Relation onerel, int attnum,
 									   Node *index_expr);
-static int	acquire_sample_rows(Relation onerel, int elevel,
-								HeapTuple *rows, int targrows,
-								double *totalrows, double *totaldeadrows);
-static int	compare_rows(const void *a, const void *b, void *arg);
 static int	acquire_inherited_sample_rows(Relation onerel, int elevel,
 										  HeapTuple *rows, int targrows,
 										  double *totalrows, double *totaldeadrows);
@@ -190,10 +186,9 @@ analyze_rel(Oid relid, RangeVar *relation,
 	if (onerel->rd_rel->relkind == RELKIND_RELATION ||
 		onerel->rd_rel->relkind == RELKIND_MATVIEW)
 	{
-		/* Regular table, so we'll use the regular row acquisition function */
-		acquirefunc = acquire_sample_rows;
-		/* Also get regular table's size */
-		relpages = RelationGetNumberOfBlocks(onerel);
+		/* Use row acquisition function provided by table AM */
+		table_relation_analyze(onerel, &acquirefunc,
+							   &relpages, vac_strategy);
 	}
 	else if (onerel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
 	{
@@ -1102,277 +1097,6 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
 	return stats;
 }
 
-/*
- * acquire_sample_rows -- acquire a random sample of rows from the table
- *
- * Selected rows are returned in the caller-allocated array rows[], which
- * must have at least targrows entries.
- * The actual number of rows selected is returned as the function result.
- * We also estimate the total numbers of live and dead rows in the table,
- * and return them into *totalrows and *totaldeadrows, respectively.
- *
- * The returned list of tuples is in order by physical position in the table.
- * (We will rely on this later to derive correlation estimates.)
- *
- * As of May 2004 we use a new two-stage method:  Stage one selects up
- * to targrows random blocks (or all blocks, if there aren't so many).
- * Stage two scans these blocks and uses the Vitter algorithm to create
- * a random sample of targrows rows (or less, if there are less in the
- * sample of blocks).  The two stages are executed simultaneously: each
- * block is processed as soon as stage one returns its number and while
- * the rows are read stage two controls which ones are to be inserted
- * into the sample.
- *
- * Although every row has an equal chance of ending up in the final
- * sample, this sampling method is not perfect: not every possible
- * sample has an equal chance of being selected.  For large relations
- * the number of different blocks represented by the sample tends to be
- * too small.  We can live with that for now.  Improvements are welcome.
- *
- * An important property of this sampling method is that because we do
- * look at a statistically unbiased set of blocks, we should get
- * unbiased estimates of the average numbers of live and dead rows per
- * block.  The previous sampling method put too much credence in the row
- * density near the start of the table.
- */
-static int
-acquire_sample_rows(Relation onerel, int elevel,
-					HeapTuple *rows, int targrows,
-					double *totalrows, double *totaldeadrows)
-{
-	int			numrows = 0;	/* # rows now in reservoir */
-	double		samplerows = 0; /* total # rows collected */
-	double		liverows = 0;	/* # live rows seen */
-	double		deadrows = 0;	/* # dead rows seen */
-	double		rowstoskip = -1;	/* -1 means not set yet */
-	uint32		randseed;		/* Seed for block sampler(s) */
-	BlockNumber totalblocks;
-	TransactionId OldestXmin;
-	BlockSamplerData bs;
-	ReservoirStateData rstate;
-	TupleTableSlot *slot;
-	TableScanDesc scan;
-	BlockNumber nblocks;
-	BlockNumber blksdone = 0;
-#ifdef USE_PREFETCH
-	int			prefetch_maximum = 0;	/* blocks to prefetch if enabled */
-	BlockSamplerData prefetch_bs;
-#endif
-
-	Assert(targrows > 0);
-
-	totalblocks = RelationGetNumberOfBlocks(onerel);
-
-	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
-	OldestXmin = GetOldestNonRemovableTransactionId(onerel);
-
-	/* Prepare for sampling block numbers */
-	randseed = pg_prng_uint32(&pg_global_prng_state);
-	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, randseed);
-
-#ifdef USE_PREFETCH
-	prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
-	/* Create another BlockSampler, using the same seed, for prefetching */
-	if (prefetch_maximum)
-		(void) BlockSampler_Init(&prefetch_bs, totalblocks, targrows, randseed);
-#endif
-
-	/* Report sampling block numbers */
-	pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
-								 nblocks);
-
-	/* Prepare for sampling rows */
-	reservoir_init_selection_state(&rstate, targrows);
-
-	scan = table_beginscan_analyze(onerel);
-	slot = table_slot_create(onerel, NULL);
-
-#ifdef USE_PREFETCH
-
-	/*
-	 * If we are doing prefetching, then go ahead and tell the kernel about
-	 * the first set of pages we are going to want.  This also moves our
-	 * iterator out ahead of the main one being used, where we will keep it so
-	 * that we're always pre-fetching out prefetch_maximum number of blocks
-	 * ahead.
-	 */
-	if (prefetch_maximum)
-	{
-		for (int i = 0; i < prefetch_maximum; i++)
-		{
-			BlockNumber prefetch_block;
-
-			if (!BlockSampler_HasMore(&prefetch_bs))
-				break;
-
-			prefetch_block = BlockSampler_Next(&prefetch_bs);
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block);
-		}
-	}
-#endif
-
-	/* Outer loop over blocks to sample */
-	while (BlockSampler_HasMore(&bs))
-	{
-		bool		block_accepted;
-		BlockNumber targblock = BlockSampler_Next(&bs);
-#ifdef USE_PREFETCH
-		BlockNumber prefetch_targblock = InvalidBlockNumber;
-
-		/*
-		 * Make sure that every time the main BlockSampler is moved forward
-		 * that our prefetch BlockSampler also gets moved forward, so that we
-		 * always stay out ahead.
-		 */
-		if (prefetch_maximum && BlockSampler_HasMore(&prefetch_bs))
-			prefetch_targblock = BlockSampler_Next(&prefetch_bs);
-#endif
-
-		vacuum_delay_point();
-
-		block_accepted = table_scan_analyze_next_block(scan, targblock, vac_strategy);
-
-#ifdef USE_PREFETCH
-
-		/*
-		 * When pre-fetching, after we get a block, tell the kernel about the
-		 * next one we will want, if there's any left.
-		 *
-		 * We want to do this even if the table_scan_analyze_next_block() call
-		 * above decides against analyzing the block it picked.
-		 */
-		if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
-#endif
-
-		/*
-		 * Don't analyze if table_scan_analyze_next_block() indicated this
-		 * block is unsuitable for analyzing.
-		 */
-		if (!block_accepted)
-			continue;
-
-		while (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
-		{
-			/*
-			 * The first targrows sample rows are simply copied into the
-			 * reservoir. Then we start replacing tuples in the sample until
-			 * we reach the end of the relation.  This algorithm is from Jeff
-			 * Vitter's paper (see full citation in utils/misc/sampling.c). It
-			 * works by repeatedly computing the number of tuples to skip
-			 * before selecting a tuple, which replaces a randomly chosen
-			 * element of the reservoir (current set of tuples).  At all times
-			 * the reservoir is a true random sample of the tuples we've
-			 * passed over so far, so when we fall off the end of the relation
-			 * we're done.
-			 */
-			if (numrows < targrows)
-				rows[numrows++] = ExecCopySlotHeapTuple(slot);
-			else
-			{
-				/*
-				 * t in Vitter's paper is the number of records already
-				 * processed.  If we need to compute a new S value, we must
-				 * use the not-yet-incremented value of samplerows as t.
-				 */
-				if (rowstoskip < 0)
-					rowstoskip = reservoir_get_next_S(&rstate, samplerows, targrows);
-
-				if (rowstoskip <= 0)
-				{
-					/*
-					 * Found a suitable tuple, so save it, replacing one old
-					 * tuple at random
-					 */
-					int			k = (int) (targrows * sampler_random_fract(&rstate.randstate));
-
-					Assert(k >= 0 && k < targrows);
-					heap_freetuple(rows[k]);
-					rows[k] = ExecCopySlotHeapTuple(slot);
-				}
-
-				rowstoskip -= 1;
-			}
-
-			samplerows += 1;
-		}
-
-		pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
-									 ++blksdone);
-	}
-
-	ExecDropSingleTupleTableSlot(slot);
-	table_endscan(scan);
-
-	/*
-	 * If we didn't find as many tuples as we wanted then we're done. No sort
-	 * is needed, since they're already in order.
-	 *
-	 * Otherwise we need to sort the collected tuples by position
-	 * (itempointer). It's not worth worrying about corner cases where the
-	 * tuples are already sorted.
-	 */
-	if (numrows == targrows)
-		qsort_interruptible(rows, numrows, sizeof(HeapTuple),
-							compare_rows, NULL);
-
-	/*
-	 * Estimate total numbers of live and dead rows in relation, extrapolating
-	 * on the assumption that the average tuple density in pages we didn't
-	 * scan is the same as in the pages we did scan.  Since what we scanned is
-	 * a random sample of the pages in the relation, this should be a good
-	 * assumption.
-	 */
-	if (bs.m > 0)
-	{
-		*totalrows = floor((liverows / bs.m) * totalblocks + 0.5);
-		*totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
-	}
-	else
-	{
-		*totalrows = 0.0;
-		*totaldeadrows = 0.0;
-	}
-
-	/*
-	 * Emit some interesting relation info
-	 */
-	ereport(elevel,
-			(errmsg("\"%s\": scanned %d of %u pages, "
-					"containing %.0f live rows and %.0f dead rows; "
-					"%d rows in sample, %.0f estimated total rows",
-					RelationGetRelationName(onerel),
-					bs.m, totalblocks,
-					liverows, deadrows,
-					numrows, *totalrows)));
-
-	return numrows;
-}
-
-/*
- * Comparator for sorting rows[] array
- */
-static int
-compare_rows(const void *a, const void *b, void *arg)
-{
-	HeapTuple	ha = *(const HeapTuple *) a;
-	HeapTuple	hb = *(const HeapTuple *) b;
-	BlockNumber ba = ItemPointerGetBlockNumber(&ha->t_self);
-	OffsetNumber oa = ItemPointerGetOffsetNumber(&ha->t_self);
-	BlockNumber bb = ItemPointerGetBlockNumber(&hb->t_self);
-	OffsetNumber ob = ItemPointerGetOffsetNumber(&hb->t_self);
-
-	if (ba < bb)
-		return -1;
-	if (ba > bb)
-		return 1;
-	if (oa < ob)
-		return -1;
-	if (oa > ob)
-		return 1;
-	return 0;
-}
-
 
 /*
  * acquire_inherited_sample_rows -- acquire sample rows from inheritance tree
@@ -1462,9 +1186,9 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 		if (childrel->rd_rel->relkind == RELKIND_RELATION ||
 			childrel->rd_rel->relkind == RELKIND_MATVIEW)
 		{
-			/* Regular table, so use the regular row acquisition function */
-			acquirefunc = acquire_sample_rows;
-			relpages = RelationGetNumberOfBlocks(childrel);
+			/* Use row acquisition function provided by table AM */
+			table_relation_analyze(childrel, &acquirefunc,
+								   &relpages, vac_strategy);
 		}
 		else if (childrel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
 		{
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 2c43ef3f60e..b9210ea4fcb 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -20,6 +20,7 @@
 #include "access/relscan.h"
 #include "access/sdir.h"
 #include "access/xact.h"
+#include "commands/vacuum.h"
 #include "executor/tuptable.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -654,41 +655,6 @@ typedef struct TableAmRoutine
 									struct VacuumParams *params,
 									BufferAccessStrategy bstrategy);
 
-	/*
-	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
-	 * with table_beginscan_analyze().  See also
-	 * table_scan_analyze_next_block().
-	 *
-	 * The callback may acquire resources like locks that are held until
-	 * table_scan_analyze_next_tuple() returns false. It e.g. can make sense
-	 * to hold a lock until all tuples on a block have been analyzed by
-	 * scan_analyze_next_tuple.
-	 *
-	 * The callback can return false if the block is not suitable for
-	 * sampling, e.g. because it's a metapage that could never contain tuples.
-	 *
-	 * XXX: This obviously is primarily suited for block-based AMs. It's not
-	 * clear what a good interface for non block based AMs would be, so there
-	 * isn't one yet.
-	 */
-	bool		(*scan_analyze_next_block) (TableScanDesc scan,
-											BlockNumber blockno,
-											BufferAccessStrategy bstrategy);
-
-	/*
-	 * See table_scan_analyze_next_tuple().
-	 *
-	 * Not every AM might have a meaningful concept of dead rows, in which
-	 * case it's OK to not increment *deadrows - but note that that may
-	 * influence autovacuum scheduling (see comment for relation_vacuum
-	 * callback).
-	 */
-	bool		(*scan_analyze_next_tuple) (TableScanDesc scan,
-											TransactionId OldestXmin,
-											double *liverows,
-											double *deadrows,
-											TupleTableSlot *slot);
-
 	/* see table_index_build_range_scan for reference about parameters */
 	double		(*index_build_range_scan) (Relation table_rel,
 										   Relation index_rel,
@@ -709,6 +675,15 @@ typedef struct TableAmRoutine
 										Snapshot snapshot,
 										struct ValidateIndexState *state);
 
+	/*
+	 * Provides row sampling callback for relation and number of relation
+	 * pages.
+	 */
+	void		(*relation_analyze) (Relation relation,
+									 AcquireSampleRowsFunc *func,
+									 BlockNumber *totalpages,
+									 BufferAccessStrategy bstrategy);
+
 
 	/* ------------------------------------------------------------------------
 	 * Miscellaneous functions.
@@ -1740,42 +1715,6 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
-/*
- * Prepare to analyze block `blockno` of `scan`. The scan needs to have been
- * started with table_beginscan_analyze().  Note that this routine might
- * acquire resources like locks that are held until
- * table_scan_analyze_next_tuple() returns false.
- *
- * Returns false if block is unsuitable for sampling, true otherwise.
- */
-static inline bool
-table_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-							  BufferAccessStrategy bstrategy)
-{
-	return scan->rs_rd->rd_tableam->scan_analyze_next_block(scan, blockno,
-															bstrategy);
-}
-
-/*
- * Iterate over tuples in the block selected with
- * table_scan_analyze_next_block() (which needs to have returned true, and
- * this routine may not have returned false for the same block before). If a
- * tuple that's suitable for sampling is found, true is returned and a tuple
- * is stored in `slot`.
- *
- * *liverows and *deadrows are incremented according to the encountered
- * tuples.
- */
-static inline bool
-table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
-							  double *liverows, double *deadrows,
-							  TupleTableSlot *slot)
-{
-	return scan->rs_rd->rd_tableam->scan_analyze_next_tuple(scan, OldestXmin,
-															liverows, deadrows,
-															slot);
-}
-
 /*
  * table_index_build_scan - scan the table to find tuples to be indexed
  *
@@ -1881,6 +1820,17 @@ table_index_validate_scan(Relation table_rel,
 											   state);
 }
 
+/*
+ * Provides row sampling callback for relation and number of relation
+ * pages.
+ */
+static inline void
+table_relation_analyze(Relation relation, AcquireSampleRowsFunc *func,
+					   BlockNumber *totalpages, BufferAccessStrategy bstrategy)
+{
+	relation->rd_tableam->relation_analyze(relation, func,
+										   totalpages, bstrategy);
+}
 
 /* ----------------------------------------------------------------------------
  * Miscellaneous functionality
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 1182a967427..d38ddc68b79 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -104,6 +104,11 @@ typedef struct ParallelVacuumState ParallelVacuumState;
  */
 typedef struct VacAttrStats *VacAttrStatsP;
 
+typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
+									  HeapTuple *rows, int targrows,
+									  double *totalrows,
+									  double *totaldeadrows);
+
 typedef Datum (*AnalyzeAttrFetchFunc) (VacAttrStatsP stats, int rownum,
 									   bool *isNull);
 
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index fcde3876b28..0968e0a01ec 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -13,6 +13,7 @@
 #define FDWAPI_H
 
 #include "access/parallel.h"
+#include "commands/vacuum.h"
 #include "nodes/execnodes.h"
 #include "nodes/pathnodes.h"
 
@@ -148,11 +149,6 @@ typedef void (*ExplainForeignModify_function) (ModifyTableState *mtstate,
 typedef void (*ExplainDirectModify_function) (ForeignScanState *node,
 											  struct ExplainState *es);
 
-typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
-									  HeapTuple *rows, int targrows,
-									  double *totalrows,
-									  double *totaldeadrows);
-
 typedef bool (*AnalyzeForeignTable_function) (Relation relation,
 											  AcquireSampleRowsFunc *func,
 											  BlockNumber *totalpages);
-- 
2.39.3 (Apple Git-145)

0008-Generalize-table-AM-API-for-INSERT-.-ON-CONFLICT-v2.patchapplication/octet-stream; name=0008-Generalize-table-AM-API-for-INSERT-.-ON-CONFLICT-v2.patchDownload

From 21c16745906bc5c0005f3921b05ba1268e5e31cc Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Fri, 9 Jun 2023 00:05:52 +0300
Subject: [PATCH 08/13] Generalize table AM API for INSERT ... ON CONFLICT ...

Currently, all table AMs need to implement INSERT ... ON CONFLICT ... with
speculative tokens.  They could just have a custom implementation of those
tokens using tuple_insert_speculative() and tuple_complete_speculative() API
functions.

This commit changes INSERT ... ON CONFLICT ... implementation to use single
tuple_insert_with_arbiter() API function, which encapsulates the whole
alogrithm.  This new function provides clear semantics to make different
implementations of INSERT ... ON CONFLICT ... functionality.
---
 src/backend/access/heap/heapam_handler.c | 281 ++++++++++++++++++++++-
 src/backend/access/table/tableamapi.c    |   3 +-
 src/backend/executor/nodeModifyTable.c   | 270 ++--------------------
 src/include/access/tableam.h             |  84 +++----
 4 files changed, 348 insertions(+), 290 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 45df59fdf50..781385270b0 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -306,6 +306,284 @@ heapam_tuple_complete_speculative(Relation relation, TupleTableSlot *slot,
 		pfree(tuple);
 }
 
+/*
+ * ExecCheckTupleVisible -- verify tuple is visible
+ *
+ * It would not be consistent with guarantees of the higher isolation levels to
+ * proceed with avoiding insertion (taking speculative insertion's alternative
+ * path) on the basis of another tuple that is not visible to MVCC snapshot.
+ * Check for the need to raise a serialization failure, and do so as necessary.
+ */
+static void
+ExecCheckTupleVisible(EState *estate,
+					  Relation rel,
+					  TupleTableSlot *slot)
+{
+	if (!IsolationUsesXactSnapshot())
+		return;
+
+	if (!table_tuple_satisfies_snapshot(rel, slot, estate->es_snapshot))
+	{
+		Datum		xminDatum;
+		TransactionId xmin;
+		bool		isnull;
+
+		xminDatum = slot_getsysattr(slot, MinTransactionIdAttributeNumber, &isnull);
+		Assert(!isnull);
+		xmin = DatumGetTransactionId(xminDatum);
+
+		/*
+		 * We should not raise a serialization failure if the conflict is
+		 * against a tuple inserted by our own transaction, even if it's not
+		 * visible to our snapshot.  (This would happen, for example, if
+		 * conflicting keys are proposed for insertion in a single command.)
+		 */
+		if (!TransactionIdIsCurrentTransactionId(xmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+					 errmsg("could not serialize access due to concurrent update")));
+	}
+}
+
+/*
+ * ExecCheckTIDVisible -- convenience variant of ExecCheckTupleVisible()
+ */
+static void
+ExecCheckTIDVisible(EState *estate,
+					Relation rel,
+					ItemPointer tid,
+					TupleTableSlot *tempSlot)
+{
+	/* Redundantly check isolation level */
+	if (!IsolationUsesXactSnapshot())
+		return;
+
+	if (!table_tuple_fetch_row_version(rel, tid,
+									   SnapshotAny, tempSlot))
+		elog(ERROR, "failed to fetch conflicting tuple for ON CONFLICT");
+	ExecCheckTupleVisible(estate, rel, tempSlot);
+	ExecClearTuple(tempSlot);
+}
+
+static inline TupleTableSlot *
+heapam_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
+								 TupleTableSlot *slot,
+								 CommandId cid, int options,
+								 struct BulkInsertStateData *bistate,
+								 List *arbiterIndexes,
+								 EState *estate,
+								 LockTupleMode lockmode,
+								 TupleTableSlot *lockedSlot,
+								 TupleTableSlot *tempSlot)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	uint32		specToken;
+	ItemPointerData conflictTid;
+	bool		specConflict;
+	List	   *recheckIndexes = NIL;
+
+	while (true)
+	{
+		specConflict = false;
+		if (!ExecCheckIndexConstraints(resultRelInfo, slot, estate, &conflictTid,
+									   arbiterIndexes))
+		{
+			if (lockedSlot)
+			{
+				TM_Result	test;
+				TM_FailureData tmfd;
+				Datum		xminDatum;
+				TransactionId xmin;
+				bool		isnull;
+
+				/* Determine lock mode to use */
+				lockmode = ExecUpdateLockMode(estate, resultRelInfo);
+
+				/*
+				 * Lock tuple for update.  Don't follow updates when tuple
+				 * cannot be locked without doing so.  A row locking conflict
+				 * here means our previous conclusion that the tuple is
+				 * conclusively committed is not true anymore.
+				 */
+				test = table_tuple_lock(rel, &conflictTid,
+										estate->es_snapshot,
+										lockedSlot, estate->es_output_cid,
+										lockmode, LockWaitBlock, 0,
+										&tmfd);
+				switch (test)
+				{
+					case TM_Ok:
+						/* success! */
+						break;
+
+					case TM_Invisible:
+
+						/*
+						 * This can occur when a just inserted tuple is
+						 * updated again in the same command. E.g. because
+						 * multiple rows with the same conflicting key values
+						 * are inserted.
+						 *
+						 * This is somewhat similar to the ExecUpdate()
+						 * TM_SelfModified case.  We do not want to proceed
+						 * because it would lead to the same row being updated
+						 * a second time in some unspecified order, and in
+						 * contrast to plain UPDATEs there's no historical
+						 * behavior to break.
+						 *
+						 * It is the user's responsibility to prevent this
+						 * situation from occurring.  These problems are why
+						 * the SQL standard similarly specifies that for SQL
+						 * MERGE, an exception must be raised in the event of
+						 * an attempt to update the same row twice.
+						 */
+						xminDatum = slot_getsysattr(lockedSlot,
+													MinTransactionIdAttributeNumber,
+													&isnull);
+						Assert(!isnull);
+						xmin = DatumGetTransactionId(xminDatum);
+
+						if (TransactionIdIsCurrentTransactionId(xmin))
+							ereport(ERROR,
+									(errcode(ERRCODE_CARDINALITY_VIOLATION),
+							/* translator: %s is a SQL command name */
+									 errmsg("%s command cannot affect row a second time",
+											"ON CONFLICT DO UPDATE"),
+									 errhint("Ensure that no rows proposed for insertion within the same command have duplicate constrained values.")));
+
+						/* This shouldn't happen */
+						elog(ERROR, "attempted to lock invisible tuple");
+						break;
+
+					case TM_SelfModified:
+
+						/*
+						 * This state should never be reached. As a dirty
+						 * snapshot is used to find conflicting tuples,
+						 * speculative insertion wouldn't have seen this row
+						 * to conflict with.
+						 */
+						elog(ERROR, "unexpected self-updated tuple");
+						break;
+
+					case TM_Updated:
+						if (IsolationUsesXactSnapshot())
+							ereport(ERROR,
+									(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+									 errmsg("could not serialize access due to concurrent update")));
+
+						/*
+						 * As long as we don't support an UPDATE of INSERT ON
+						 * CONFLICT for a partitioned table we shouldn't reach
+						 * to a case where tuple to be lock is moved to
+						 * another partition due to concurrent update of the
+						 * partition key.
+						 */
+						Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
+
+						/*
+						 * Tell caller to try again from the very start.
+						 *
+						 * It does not make sense to use the usual
+						 * EvalPlanQual() style loop here, as the new version
+						 * of the row might not conflict anymore, or the
+						 * conflicting tuple has actually been deleted.
+						 */
+						ExecClearTuple(lockedSlot);
+						return false;
+
+					case TM_Deleted:
+						if (IsolationUsesXactSnapshot())
+							ereport(ERROR,
+									(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+									 errmsg("could not serialize access due to concurrent delete")));
+
+						/* see TM_Updated case */
+						Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
+						ExecClearTuple(lockedSlot);
+						return false;
+
+					default:
+						elog(ERROR, "unrecognized table_tuple_lock status: %u", test);
+				}
+
+				/* Success, the tuple is locked. */
+
+				/*
+				 * Verify that the tuple is visible to our MVCC snapshot if
+				 * the current isolation level mandates that.
+				 *
+				 * It's not sufficient to rely on the check within
+				 * ExecUpdate() as e.g. CONFLICT ... WHERE clause may prevent
+				 * us from reaching that.
+				 *
+				 * This means we only ever continue when a new command in the
+				 * current transaction could see the row, even though in READ
+				 * COMMITTED mode the tuple will not be visible according to
+				 * the current statement's snapshot.  This is in line with the
+				 * way UPDATE deals with newer tuple versions.
+				 */
+				ExecCheckTupleVisible(estate, rel, lockedSlot);
+				return NULL;
+			}
+			else
+			{
+				ExecCheckTIDVisible(estate, rel, &conflictTid, tempSlot);
+				return NULL;
+			}
+		}
+
+		/*
+		 * Before we start insertion proper, acquire our "speculative
+		 * insertion lock".  Others can use that to wait for us to decide if
+		 * we're going to go ahead with the insertion, instead of waiting for
+		 * the whole transaction to complete.
+		 */
+		specToken = SpeculativeInsertionLockAcquire(GetCurrentTransactionId());
+
+		/* insert the tuple, with the speculative token */
+		heapam_tuple_insert_speculative(rel, slot,
+										estate->es_output_cid,
+										0,
+										NULL,
+										specToken);
+
+		/* insert index entries for tuple */
+		recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
+											   slot, estate, false, true,
+											   &specConflict,
+											   arbiterIndexes,
+											   false);
+
+		/* adjust the tuple's state accordingly */
+		heapam_tuple_complete_speculative(rel, slot,
+										  specToken, !specConflict);
+
+		/*
+		 * Wake up anyone waiting for our decision.  They will re-check the
+		 * tuple, see that it's no longer speculative, and wait on our XID as
+		 * if this was a regularly inserted tuple all along.  Or if we killed
+		 * the tuple, they will see it's dead, and proceed as if the tuple
+		 * never existed.
+		 */
+		SpeculativeInsertionLockRelease(GetCurrentTransactionId());
+
+		/*
+		 * If there was a conflict, start from the beginning.  We'll do the
+		 * pre-check again, which will now find the conflicting tuple (unless
+		 * it aborts before we get there).
+		 */
+		if (specConflict)
+		{
+			list_free(recheckIndexes);
+			CHECK_FOR_INTERRUPTS();
+			continue;
+		}
+
+		return slot;
+	}
+}
+
 static TM_Result
 heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
 					Snapshot snapshot, Snapshot crosscheck, int options,
@@ -2914,8 +3192,7 @@ static const TableAmRoutine heapam_methods = {
 	.index_fetch_tuple = heapam_index_fetch_tuple,
 
 	.tuple_insert = heapam_tuple_insert,
-	.tuple_insert_speculative = heapam_tuple_insert_speculative,
-	.tuple_complete_speculative = heapam_tuple_complete_speculative,
+	.tuple_insert_with_arbiter = heapam_tuple_insert_with_arbiter,
 	.multi_insert = heap_multi_insert,
 	.tuple_delete = heapam_tuple_delete,
 	.tuple_update = heapam_tuple_update,
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 34ff3e38333..d9fc87665c7 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -70,8 +70,7 @@ GetTableAmRoutine(Oid amhandler)
 	 * Could be made optional, but would require throwing error during
 	 * parse-analysis.
 	 */
-	Assert(routine->tuple_insert_speculative != NULL);
-	Assert(routine->tuple_complete_speculative != NULL);
+	Assert(routine->tuple_insert_with_arbiter != NULL);
 
 	Assert(routine->multi_insert != NULL);
 	Assert(routine->tuple_delete != NULL);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 34962033be7..7d64fcab00d 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -129,7 +129,6 @@ static void ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 											   TupleTableSlot *newslot);
 static bool ExecOnConflictUpdate(ModifyTableContext *context,
 								 ResultRelInfo *resultRelInfo,
-								 ItemPointer conflictTid,
 								 TupleTableSlot *excludedSlot,
 								 bool canSetTag,
 								 TupleTableSlot **returning);
@@ -265,66 +264,6 @@ ExecProcessReturning(ResultRelInfo *resultRelInfo,
 	return ExecProject(projectReturning);
 }
 
-/*
- * ExecCheckTupleVisible -- verify tuple is visible
- *
- * It would not be consistent with guarantees of the higher isolation levels to
- * proceed with avoiding insertion (taking speculative insertion's alternative
- * path) on the basis of another tuple that is not visible to MVCC snapshot.
- * Check for the need to raise a serialization failure, and do so as necessary.
- */
-static void
-ExecCheckTupleVisible(EState *estate,
-					  Relation rel,
-					  TupleTableSlot *slot)
-{
-	if (!IsolationUsesXactSnapshot())
-		return;
-
-	if (!table_tuple_satisfies_snapshot(rel, slot, estate->es_snapshot))
-	{
-		Datum		xminDatum;
-		TransactionId xmin;
-		bool		isnull;
-
-		xminDatum = slot_getsysattr(slot, MinTransactionIdAttributeNumber, &isnull);
-		Assert(!isnull);
-		xmin = DatumGetTransactionId(xminDatum);
-
-		/*
-		 * We should not raise a serialization failure if the conflict is
-		 * against a tuple inserted by our own transaction, even if it's not
-		 * visible to our snapshot.  (This would happen, for example, if
-		 * conflicting keys are proposed for insertion in a single command.)
-		 */
-		if (!TransactionIdIsCurrentTransactionId(xmin))
-			ereport(ERROR,
-					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-					 errmsg("could not serialize access due to concurrent update")));
-	}
-}
-
-/*
- * ExecCheckTIDVisible -- convenience variant of ExecCheckTupleVisible()
- */
-static void
-ExecCheckTIDVisible(EState *estate,
-					ResultRelInfo *relinfo,
-					ItemPointer tid,
-					TupleTableSlot *tempSlot)
-{
-	Relation	rel = relinfo->ri_RelationDesc;
-
-	/* Redundantly check isolation level */
-	if (!IsolationUsesXactSnapshot())
-		return;
-
-	if (!table_tuple_fetch_row_version(rel, tid, SnapshotAny, tempSlot))
-		elog(ERROR, "failed to fetch conflicting tuple for ON CONFLICT");
-	ExecCheckTupleVisible(estate, rel, tempSlot);
-	ExecClearTuple(tempSlot);
-}
-
 /*
  * Initialize to compute stored generated columns for a tuple
  *
@@ -1010,12 +949,19 @@ ExecInsert(ModifyTableContext *context,
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
 			/* Perform a speculative insertion. */
-			uint32		specToken;
-			ItemPointerData conflictTid;
-			bool		specConflict;
 			List	   *arbiterIndexes;
+			TupleTableSlot *existing = NULL,
+					   *returningSlot,
+					   *inserted;
+			LockTupleMode lockmode = LockTupleExclusive;
 
 			arbiterIndexes = resultRelInfo->ri_onConflictArbiterIndexes;
+			returningSlot = ExecGetReturningSlot(estate, resultRelInfo);
+			if (onconflict == ONCONFLICT_UPDATE)
+			{
+				lockmode = ExecUpdateLockMode(estate, resultRelInfo);
+				existing = resultRelInfo->ri_onConflict->oc_Existing;
+			}
 
 			/*
 			 * Do a non-conclusive check for conflicts first.
@@ -1032,23 +978,28 @@ ExecInsert(ModifyTableContext *context,
 			 */
 	vlock:
 			CHECK_FOR_INTERRUPTS();
-			specConflict = false;
-			if (!ExecCheckIndexConstraints(resultRelInfo, slot, estate,
-										   &conflictTid, arbiterIndexes))
+			inserted = table_tuple_insert_with_arbiter(resultRelInfo,
+													   slot, estate->es_output_cid,
+													   0, NULL, arbiterIndexes, estate,
+													   lockmode, existing, returningSlot);
+			if (!inserted)
 			{
 				/* committed conflict tuple found */
 				if (onconflict == ONCONFLICT_UPDATE)
 				{
+					TupleTableSlot *returning = NULL;
+
+					if (TTS_EMPTY(existing))
+						goto vlock;
+
 					/*
 					 * In case of ON CONFLICT DO UPDATE, execute the UPDATE
 					 * part.  Be prepared to retry if the UPDATE fails because
 					 * of another concurrent UPDATE/DELETE to the conflict
 					 * tuple.
 					 */
-					TupleTableSlot *returning = NULL;
-
 					if (ExecOnConflictUpdate(context, resultRelInfo,
-											 &conflictTid, slot, canSetTag,
+											 slot, canSetTag,
 											 &returning))
 					{
 						InstrCountTuples2(&mtstate->ps, 1);
@@ -1071,57 +1022,13 @@ ExecInsert(ModifyTableContext *context,
 					 * ExecGetReturningSlot() in the DO NOTHING case...
 					 */
 					Assert(onconflict == ONCONFLICT_NOTHING);
-					ExecCheckTIDVisible(estate, resultRelInfo, &conflictTid,
-										ExecGetReturningSlot(estate, resultRelInfo));
 					InstrCountTuples2(&mtstate->ps, 1);
 					return NULL;
 				}
 			}
-
-			/*
-			 * Before we start insertion proper, acquire our "speculative
-			 * insertion lock".  Others can use that to wait for us to decide
-			 * if we're going to go ahead with the insertion, instead of
-			 * waiting for the whole transaction to complete.
-			 */
-			specToken = SpeculativeInsertionLockAcquire(GetCurrentTransactionId());
-
-			/* insert the tuple, with the speculative token */
-			table_tuple_insert_speculative(resultRelationDesc, slot,
-										   estate->es_output_cid,
-										   0,
-										   NULL,
-										   specToken);
-
-			/* insert index entries for tuple */
-			recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
-												   slot, estate, false, true,
-												   &specConflict,
-												   arbiterIndexes,
-												   false);
-
-			/* adjust the tuple's state accordingly */
-			table_tuple_complete_speculative(resultRelationDesc, slot,
-											 specToken, !specConflict);
-
-			/*
-			 * Wake up anyone waiting for our decision.  They will re-check
-			 * the tuple, see that it's no longer speculative, and wait on our
-			 * XID as if this was a regularly inserted tuple all along.  Or if
-			 * we killed the tuple, they will see it's dead, and proceed as if
-			 * the tuple never existed.
-			 */
-			SpeculativeInsertionLockRelease(GetCurrentTransactionId());
-
-			/*
-			 * If there was a conflict, start from the beginning.  We'll do
-			 * the pre-check again, which will now find the conflicting tuple
-			 * (unless it aborts before we get there).
-			 */
-			if (specConflict)
+			else
 			{
-				list_free(recheckIndexes);
-				goto vlock;
+				slot = inserted;
 			}
 
 			/* Since there was no insertion conflict, we're done */
@@ -2417,144 +2324,15 @@ redo_act:
 static bool
 ExecOnConflictUpdate(ModifyTableContext *context,
 					 ResultRelInfo *resultRelInfo,
-					 ItemPointer conflictTid,
 					 TupleTableSlot *excludedSlot,
 					 bool canSetTag,
 					 TupleTableSlot **returning)
 {
 	ModifyTableState *mtstate = context->mtstate;
 	ExprContext *econtext = mtstate->ps.ps_ExprContext;
-	Relation	relation = resultRelInfo->ri_RelationDesc;
 	ExprState  *onConflictSetWhere = resultRelInfo->ri_onConflict->oc_WhereClause;
 	TupleTableSlot *existing = resultRelInfo->ri_onConflict->oc_Existing;
-	TM_FailureData tmfd;
-	LockTupleMode lockmode;
-	TM_Result	test;
-	Datum		xminDatum;
-	TransactionId xmin;
-	bool		isnull;
-
-	/* Determine lock mode to use */
-	lockmode = ExecUpdateLockMode(context->estate, resultRelInfo);
-
-	/*
-	 * Lock tuple for update.  Don't follow updates when tuple cannot be
-	 * locked without doing so.  A row locking conflict here means our
-	 * previous conclusion that the tuple is conclusively committed is not
-	 * true anymore.
-	 */
-	test = table_tuple_lock(relation, conflictTid,
-							context->estate->es_snapshot,
-							existing, context->estate->es_output_cid,
-							lockmode, LockWaitBlock, 0,
-							&tmfd);
-	switch (test)
-	{
-		case TM_Ok:
-			/* success! */
-			break;
-
-		case TM_Invisible:
-
-			/*
-			 * This can occur when a just inserted tuple is updated again in
-			 * the same command. E.g. because multiple rows with the same
-			 * conflicting key values are inserted.
-			 *
-			 * This is somewhat similar to the ExecUpdate() TM_SelfModified
-			 * case.  We do not want to proceed because it would lead to the
-			 * same row being updated a second time in some unspecified order,
-			 * and in contrast to plain UPDATEs there's no historical behavior
-			 * to break.
-			 *
-			 * It is the user's responsibility to prevent this situation from
-			 * occurring.  These problems are why the SQL standard similarly
-			 * specifies that for SQL MERGE, an exception must be raised in
-			 * the event of an attempt to update the same row twice.
-			 */
-			xminDatum = slot_getsysattr(existing,
-										MinTransactionIdAttributeNumber,
-										&isnull);
-			Assert(!isnull);
-			xmin = DatumGetTransactionId(xminDatum);
-
-			if (TransactionIdIsCurrentTransactionId(xmin))
-				ereport(ERROR,
-						(errcode(ERRCODE_CARDINALITY_VIOLATION),
-				/* translator: %s is a SQL command name */
-						 errmsg("%s command cannot affect row a second time",
-								"ON CONFLICT DO UPDATE"),
-						 errhint("Ensure that no rows proposed for insertion within the same command have duplicate constrained values.")));
-
-			/* This shouldn't happen */
-			elog(ERROR, "attempted to lock invisible tuple");
-			break;
-
-		case TM_SelfModified:
-
-			/*
-			 * This state should never be reached. As a dirty snapshot is used
-			 * to find conflicting tuples, speculative insertion wouldn't have
-			 * seen this row to conflict with.
-			 */
-			elog(ERROR, "unexpected self-updated tuple");
-			break;
-
-		case TM_Updated:
-			if (IsolationUsesXactSnapshot())
-				ereport(ERROR,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("could not serialize access due to concurrent update")));
-
-			/*
-			 * As long as we don't support an UPDATE of INSERT ON CONFLICT for
-			 * a partitioned table we shouldn't reach to a case where tuple to
-			 * be lock is moved to another partition due to concurrent update
-			 * of the partition key.
-			 */
-			Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
-
-			/*
-			 * Tell caller to try again from the very start.
-			 *
-			 * It does not make sense to use the usual EvalPlanQual() style
-			 * loop here, as the new version of the row might not conflict
-			 * anymore, or the conflicting tuple has actually been deleted.
-			 */
-			ExecClearTuple(existing);
-			return false;
-
-		case TM_Deleted:
-			if (IsolationUsesXactSnapshot())
-				ereport(ERROR,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("could not serialize access due to concurrent delete")));
-
-			/* see TM_Updated case */
-			Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
-			ExecClearTuple(existing);
-			return false;
-
-		default:
-			elog(ERROR, "unrecognized table_tuple_lock status: %u", test);
-	}
-
-	/* Success, the tuple is locked. */
-
-	/*
-	 * Verify that the tuple is visible to our MVCC snapshot if the current
-	 * isolation level mandates that.
-	 *
-	 * It's not sufficient to rely on the check within ExecUpdate() as e.g.
-	 * CONFLICT ... WHERE clause may prevent us from reaching that.
-	 *
-	 * This means we only ever continue when a new command in the current
-	 * transaction could see the row, even though in READ COMMITTED mode the
-	 * tuple will not be visible according to the current statement's
-	 * snapshot.  This is in line with the way UPDATE deals with newer tuple
-	 * versions.
-	 */
-	ExecCheckTupleVisible(context->estate, relation, existing);
+	ItemPointer conflictTid = &existing->tts_tid;
 
 	/*
 	 * Make tuple and any needed join variables available to ExecQual and
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index b99fb6e4e71..2a496e81610 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -22,6 +22,7 @@
 #include "access/xact.h"
 #include "commands/vacuum.h"
 #include "executor/tuptable.h"
+#include "nodes/execnodes.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
 
@@ -510,19 +511,16 @@ typedef struct TableAmRoutine
 									 CommandId cid, int options,
 									 struct BulkInsertStateData *bistate);
 
-	/* see table_tuple_insert_speculative() for reference about parameters */
-	void		(*tuple_insert_speculative) (Relation rel,
-											 TupleTableSlot *slot,
-											 CommandId cid,
-											 int options,
-											 struct BulkInsertStateData *bistate,
-											 uint32 specToken);
-
-	/* see table_tuple_complete_speculative() for reference about parameters */
-	void		(*tuple_complete_speculative) (Relation rel,
-											   TupleTableSlot *slot,
-											   uint32 specToken,
-											   bool succeeded);
+	/* see table_tuple_insert_with_arbiter() for reference about parameters */
+	TupleTableSlot *(*tuple_insert_with_arbiter) (ResultRelInfo *resultRelInfo,
+												  TupleTableSlot *slot,
+												  CommandId cid, int options,
+												  struct BulkInsertStateData *bistate,
+												  List *arbiterIndexes,
+												  EState *estate,
+												  LockTupleMode lockmode,
+												  TupleTableSlot *lockedSlot,
+												  TupleTableSlot *tempSlot);
 
 	/* see table_multi_insert() for reference about parameters */
 	void		(*multi_insert) (Relation rel, TupleTableSlot **slots, int nslots,
@@ -1393,36 +1391,42 @@ table_tuple_insert(Relation rel, TupleTableSlot *slot, CommandId cid,
 }
 
 /*
- * Perform a "speculative insertion". These can be backed out afterwards
- * without aborting the whole transaction.  Other sessions can wait for the
- * speculative insertion to be confirmed, turning it into a regular tuple, or
- * aborted, as if it never existed.  Speculatively inserted tuples behave as
- * "value locks" of short duration, used to implement INSERT .. ON CONFLICT.
+ * Insert a tuple from a slot into table AM routine with arbiter indexes.
  *
- * A transaction having performed a speculative insertion has to either abort,
- * or finish the speculative insertion with
- * table_tuple_complete_speculative(succeeded = ...).
- */
-static inline void
-table_tuple_insert_speculative(Relation rel, TupleTableSlot *slot,
-							   CommandId cid, int options,
-							   struct BulkInsertStateData *bistate,
-							   uint32 specToken)
-{
-	rel->rd_tableam->tuple_insert_speculative(rel, slot, cid, options,
-											  bistate, specToken);
-}
-
-/*
- * Complete "speculative insertion" started in the same transaction. If
- * succeeded is true, the tuple is fully inserted, if false, it's removed.
+ * This function is similar to table_tuple_insert(), but it takes into account
+ * `arbiterIndexes`, which comprises the list of oids of arbiter indexes.
+ *
+ * If tuple doesn't violates uniqueness on all arbiter indexes, then it should
+ * be inserted and the slot containing inserted tuple is returned.
+ *
+ * If tuple violates uniqueness on any arbiter index, then this function
+ * returns NULL and doesn't insert the tuple.  Also, if 'lockedSlot' is
+ * provided, then conflicting tuple gets locked in `lockmode` and placed into
+ * `lockedSlot`.
+ *
+ * Executor state `estate` is passed to this method to provide ability to
+ * calculate index tuples.  Temporary tuple table slot `tempSlot` is passed
+ * for holding of potentially conflicing tuple.
  */
-static inline void
-table_tuple_complete_speculative(Relation rel, TupleTableSlot *slot,
-								 uint32 specToken, bool succeeded)
+static inline TupleTableSlot *
+table_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
+								TupleTableSlot *slot,
+								CommandId cid, int options,
+								struct BulkInsertStateData *bistate,
+								List *arbiterIndexes,
+								EState *estate,
+								LockTupleMode lockmode,
+								TupleTableSlot *lockedSlot,
+								TupleTableSlot *tempSlot)
 {
-	rel->rd_tableam->tuple_complete_speculative(rel, slot, specToken,
-												succeeded);
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+
+	return rel->rd_tableam->tuple_insert_with_arbiter(resultRelInfo,
+													  slot, cid, options,
+													  bistate, arbiterIndexes,
+													  estate,
+													  lockmode, lockedSlot,
+													  tempSlot);
 }
 
 /*
-- 
2.39.3 (Apple Git-145)

0004-Allow-table-AM-tuple_insert-method-to-return-the--v2.patchapplication/octet-stream; name=0004-Allow-table-AM-tuple_insert-method-to-return-the--v2.patchDownload

From a85fd10f8b50db832993c9f7531eec834525acc8 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Fri, 9 Jun 2023 00:28:27 +0300
Subject: [PATCH 04/13] Allow table AM tuple_insert() method to return the
 different slot

This allows table AM to return native tuple slot even if VirtualTupleTableSlot
is given as an input.  Native tuple slot have its knowledge about system
attributes, which could be accessed in future.
---
 src/backend/access/heap/heapam_handler.c |  4 +++-
 src/backend/executor/nodeModifyTable.c   |  6 +++---
 src/include/access/tableam.h             | 20 +++++++++++---------
 3 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index da86ca5c31a..6abfe36dec7 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -243,7 +243,7 @@ heapam_tuple_satisfies_snapshot(Relation rel, TupleTableSlot *slot,
  * ----------------------------------------------------------------------------
  */
 
-static void
+static TupleTableSlot *
 heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
 					int options, BulkInsertState bistate)
 {
@@ -260,6 +260,8 @@ heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
 
 	if (shouldFree)
 		pfree(tuple);
+
+	return slot;
 }
 
 static void
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 9deeaceb35c..34962033be7 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1129,9 +1129,9 @@ ExecInsert(ModifyTableContext *context,
 		else
 		{
 			/* insert the tuple normally */
-			table_tuple_insert(resultRelationDesc, slot,
-							   estate->es_output_cid,
-							   0, NULL);
+			slot = table_tuple_insert(resultRelationDesc, slot,
+									  estate->es_output_cid,
+									  0, NULL);
 
 			/* insert index entries for tuple */
 			if (resultRelInfo->ri_NumIndices > 0)
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index be092e8bedb..2c43ef3f60e 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -505,9 +505,9 @@ typedef struct TableAmRoutine
 	 */
 
 	/* see table_tuple_insert() for reference about parameters */
-	void		(*tuple_insert) (Relation rel, TupleTableSlot *slot,
-								 CommandId cid, int options,
-								 struct BulkInsertStateData *bistate);
+	TupleTableSlot *(*tuple_insert) (Relation rel, TupleTableSlot *slot,
+									 CommandId cid, int options,
+									 struct BulkInsertStateData *bistate);
 
 	/* see table_tuple_insert_speculative() for reference about parameters */
 	void		(*tuple_insert_speculative) (Relation rel,
@@ -1398,16 +1398,18 @@ table_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
  * behavior) is also just passed through to RelationGetBufferForTuple. If
  * `bistate` is provided, table_finish_bulk_insert() needs to be called.
  *
- * On return the slot's tts_tid and tts_tableOid are updated to reflect the
- * insertion. But note that any toasting of fields within the slot is NOT
- * reflected in the slots contents.
+ * Returns the slot containing the inserted tuple, which may differ from the
+ * given slot. For instance, source slot may by VirtualTupleTableSlot, but
+ * the result is corresponding to table AM. On return the slot's tts_tid and
+ * tts_tableOid are updated to reflect the insertion. But note that any
+ * toasting of fields within the slot is NOT reflected in the slots contents.
  */
-static inline void
+static inline TupleTableSlot *
 table_tuple_insert(Relation rel, TupleTableSlot *slot, CommandId cid,
 				   int options, struct BulkInsertStateData *bistate)
 {
-	rel->rd_tableam->tuple_insert(rel, slot, cid, options,
-								  bistate);
+	return rel->rd_tableam->tuple_insert(rel, slot, cid, options,
+										 bistate);
 }
 
 /*
-- 
2.39.3 (Apple Git-145)

0002-Add-EvalPlanQual-delete-returning-isolation-test-v2.patchapplication/octet-stream; name=0002-Add-EvalPlanQual-delete-returning-isolation-test-v2.patchDownload

From c2a38f7f33ae645a01f74fe74574681024c07e52 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Mar 2023 16:47:09 -0700
Subject: [PATCH 02/13] Add EvalPlanQual delete returning isolation test

Author: Andres Freund
Reviewed-by: Pavel Borisov
Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg%40mail.gmail.com
---
 .../isolation/expected/eval-plan-qual.out     | 30 +++++++++++++++++++
 src/test/isolation/specs/eval-plan-qual.spec  |  4 +++
 2 files changed, 34 insertions(+)

diff --git a/src/test/isolation/expected/eval-plan-qual.out b/src/test/isolation/expected/eval-plan-qual.out
index 73e0aeb50e7..0237271ceec 100644
--- a/src/test/isolation/expected/eval-plan-qual.out
+++ b/src/test/isolation/expected/eval-plan-qual.out
@@ -746,6 +746,36 @@ savings  |    600|    1200
 (2 rows)
 
 
+starting permutation: read wx2 wb1 c2 c1 read
+step read: SELECT * FROM accounts ORDER BY accountid;
+accountid|balance|balance2
+---------+-------+--------
+checking |    600|    1200
+savings  |    600|    1200
+(2 rows)
+
+step wx2: UPDATE accounts SET balance = balance + 450 WHERE accountid = 'checking' RETURNING balance;
+balance
+-------
+   1050
+(1 row)
+
+step wb1: DELETE FROM accounts WHERE balance = 600 RETURNING *; <waiting ...>
+step c2: COMMIT;
+step wb1: <... completed>
+accountid|balance|balance2
+---------+-------+--------
+savings  |    600|    1200
+(1 row)
+
+step c1: COMMIT;
+step read: SELECT * FROM accounts ORDER BY accountid;
+accountid|balance|balance2
+---------+-------+--------
+checking |   1050|    2100
+(1 row)
+
+
 starting permutation: upsert1 upsert2 c1 c2 read
 step upsert1: 
 	WITH upsert AS
diff --git a/src/test/isolation/specs/eval-plan-qual.spec b/src/test/isolation/specs/eval-plan-qual.spec
index 735c671734e..edd6d19df3a 100644
--- a/src/test/isolation/specs/eval-plan-qual.spec
+++ b/src/test/isolation/specs/eval-plan-qual.spec
@@ -76,6 +76,8 @@ setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
 step wx1	{ UPDATE accounts SET balance = balance - 200 WHERE accountid = 'checking' RETURNING balance; }
 # wy1 then wy2 checks the case where quals pass then fail
 step wy1	{ UPDATE accounts SET balance = balance + 500 WHERE accountid = 'checking' RETURNING balance; }
+# wx2 then wb1 checks the case of re-fetching up-to-date values for DELETE ... RETURNING ...
+step wb1	{ DELETE FROM accounts WHERE balance = 600 RETURNING *; }
 
 step wxext1	{ UPDATE accounts_ext SET balance = balance - 200 WHERE accountid = 'checking' RETURNING balance; }
 step tocds1	{ UPDATE accounts SET accountid = 'cds' WHERE accountid = 'checking'; }
@@ -353,6 +355,8 @@ permutation wx1 delwcte c1 c2 read
 # test that a delete to a self-modified row throws error when
 # previously updated by a different cid
 permutation wx1 delwctefail c1 c2 read
+# test that a delete re-fetches up-to-date values for returning clause
+permutation read wx2 wb1 c2 c1 read
 
 permutation upsert1 upsert2 c1 c2 read
 permutation readp1 writep1 readp2 c1 c2
-- 
2.39.3 (Apple Git-145)

0003-Allow-table-AM-to-store-complex-data-structures-i-v2.patchapplication/octet-stream; name=0003-Allow-table-AM-to-store-complex-data-structures-i-v2.patchDownload

From 6eeeecb060d3d75644946dc9404a3a65f8928813 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Wed, 7 Jun 2023 13:04:58 +0300
Subject: [PATCH 03/13] Allow table AM to store complex data structures in
 rd_amcache

New table AM method free_rd_amcache is responsible for freeing the rd_amcache.
---
 src/backend/access/heap/heapam_handler.c |  1 +
 src/backend/utils/cache/relcache.c       | 18 +++++++------
 src/include/access/tableam.h             | 33 ++++++++++++++++++++++++
 src/include/utils/rel.h                  | 10 ++++---
 4 files changed, 50 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 7c7204a2422..da86ca5c31a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2640,6 +2640,7 @@ static const TableAmRoutine heapam_methods = {
 	.index_build_range_scan = heapam_index_build_range_scan,
 	.index_validate_scan = heapam_index_validate_scan,
 
+	.free_rd_amcache = NULL,
 	.relation_size = table_block_relation_size,
 	.relation_needs_toast_table = heapam_relation_needs_toast_table,
 	.relation_toast_am = heapam_relation_toast_am,
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 2cd19d603fb..6d98bdfba06 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -318,6 +318,7 @@ static OpClassCacheEnt *LookupOpclassInfo(Oid operatorClassOid,
 										  StrategyNumber numSupport);
 static void RelationCacheInitFileRemoveInDir(const char *tblspcpath);
 static void unlink_initfile(const char *initfilename, int elevel);
+static void release_rd_amcache(Relation rel);
 
 
 /*
@@ -2262,9 +2263,7 @@ RelationReloadIndexInfo(Relation relation)
 	RelationCloseSmgr(relation);
 
 	/* Must free any AM cached data upon relcache flush */
-	if (relation->rd_amcache)
-		pfree(relation->rd_amcache);
-	relation->rd_amcache = NULL;
+	release_rd_amcache(relation);
 
 	/*
 	 * If it's a shared index, we might be called before backend startup has
@@ -2484,8 +2483,7 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
 		pfree(relation->rd_options);
 	if (relation->rd_indextuple)
 		pfree(relation->rd_indextuple);
-	if (relation->rd_amcache)
-		pfree(relation->rd_amcache);
+	release_rd_amcache(relation);
 	if (relation->rd_fdwroutine)
 		pfree(relation->rd_fdwroutine);
 	if (relation->rd_indexcxt)
@@ -2547,9 +2545,7 @@ RelationClearRelation(Relation relation, bool rebuild)
 	RelationCloseSmgr(relation);
 
 	/* Free AM cached data, if any */
-	if (relation->rd_amcache)
-		pfree(relation->rd_amcache);
-	relation->rd_amcache = NULL;
+	release_rd_amcache(relation);
 
 	/*
 	 * Treat nailed-in system relations separately, they always need to be
@@ -6868,3 +6864,9 @@ ResOwnerReleaseRelation(Datum res)
 
 	RelationCloseCleanup((Relation) res);
 }
+
+static void
+release_rd_amcache(Relation rel)
+{
+	table_free_rd_amcache(rel);
+}
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 467bdc09d36..be092e8bedb 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -715,6 +715,13 @@ typedef struct TableAmRoutine
 	 * ------------------------------------------------------------------------
 	 */
 
+	/*
+	 * This callback frees relation private cache data stored in rd_amcache.
+	 * If this callback is not provided, rd_amcache is assumed to point to
+	 * single memory chunk.
+	 */
+	void		(*free_rd_amcache) (Relation rel);
+
 	/*
 	 * See table_relation_size().
 	 *
@@ -1878,6 +1885,32 @@ table_index_validate_scan(Relation table_rel,
  * ----------------------------------------------------------------------------
  */
 
+/*
+ * Frees relation private cache data stored in rd_amcache.  Uses
+ * free_rd_amcache method if provided.  Assumes rd_amcache to point to single
+ * memory chunk otherwise.
+ */
+static inline void
+table_free_rd_amcache(Relation rel)
+{
+	if (rel->rd_tableam && rel->rd_tableam->free_rd_amcache)
+	{
+		rel->rd_tableam->free_rd_amcache(rel);
+
+		/*
+		 * We are assuming free_rd_amcache() did clear the cache and left NULL
+		 * in rd_amcache.
+		 */
+		Assert(rel->rd_amcache == NULL);
+	}
+	else
+	{
+		if (rel->rd_amcache)
+			pfree(rel->rd_amcache);
+		rel->rd_amcache = NULL;
+	}
+}
+
 /*
  * Return the current size of `rel` in bytes. If `forkNumber` is
  * InvalidForkNumber, return the relation's overall size, otherwise the size
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 87002049538..69557fc7a2c 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -221,10 +221,12 @@ typedef struct RelationData
 	 * rd_amcache is available for index and table AMs to cache private data
 	 * about the relation.  This must be just a cache since it may get reset
 	 * at any time (in particular, it will get reset by a relcache inval
-	 * message for the relation).  If used, it must point to a single memory
-	 * chunk palloc'd in CacheMemoryContext, or in rd_indexcxt for an index
-	 * relation.  A relcache reset will include freeing that chunk and setting
-	 * rd_amcache = NULL.
+	 * message for the relation).  If used for table AM it must point to a
+	 * single memory chunk palloc'd in CacheMemoryContext, or more complex
+	 * data structure in that memory context to be freed by free_rd_amcache
+	 * method. If used for index AM it must point to a single memory chunk
+	 * palloc'd in rd_indexcxt memory context.  A relcache reset will include
+	 * freeing that chunk and setting rd_amcache = NULL.
 	 */
 	void	   *rd_amcache;		/* available for use by index/table AM */
 
-- 
2.39.3 (Apple Git-145)

#15

Pavel Borisov

pashkin.elfe@gmail.com

almost 2 years ago

In reply to: Alexander Korotkov (#14)

Re: Table AM Interface Enhancements

Hi, Alexander!

On Tue, 19 Mar 2024 at 03:34, Alexander Korotkov <aekorotkov@gmail.com>
wrote:

On Sun, Mar 3, 2024 at 1:50 PM Alexander Korotkov <aekorotkov@gmail.com>
wrote:

On Mon, Nov 27, 2023 at 10:18 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Nov 25, 2023, at 9:47 AM, Alexander Korotkov <

aekorotkov@gmail.com> wrote:

Should the patch at least document which parts of the EState are

expected to be in which states, and which parts should be viewed as
undefined? If the implementors of table AMs rely on any/all aspects of
EState, doesn't that prevent future changes to how that structure is used?

New tuple tuple_insert_with_arbiter() table AM API method needs

EState

argument to call executor functions: ExecCheckIndexConstraints(),
ExecUpdateLockMode(), and ExecInsertIndexTuples(). I think we
probably need to invent some opaque way to call this executor

function

without revealing EState to table AM. Do you think this could work?

We're clearly not accessing all of the EState, just some specific

fields, such as es_per_tuple_exprcontext. I think you could at least
refactor to pass the minimum amount of state information through the table
AM API.

Yes, the table AM doesn't need the full EState, just the ability to do
specific manipulation with tuples. I'll refactor the patch to make a
better isolation for this.

Please find the revised patchset attached. The changes are following:
1. Patchset is rebase. to the current master.
2. Patchset is reordered. I tried to put less debatable patches to the
top.
3. tuple_is_current() method is moved from the Table AM API to the
slot as proposed by Matthias van de Meent.
4. Assert added to the table_free_rd_amcache() as proposed by Pavel
Borisov.

Patches 0001-0002 are unchanged compared to the last version in thread [1]. /messages/by-id/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg@mail.gmail.com.
In my opinion, it's still ready to be committed, which was not done for
time were too close to feature freeze one year ago.

0003 - Assert added from previous version. I still have a strong opinion
that allowing multi-chunked data structures instead of single chunks is
completely safe and makes natural process of Postgres improvement that is
self-justified. The patch is simple enough and ready to be pushed.

0004 (previously 0007) - Have not changed, and there is consensus that
this is reasonable. I've re-checked the current code. Looks safe
considering returning a different slot, which I doubted before. So consider
this patch also ready.

0005 (previously 0004) - Unused argument in the is_current_xact_tuple()
signature is removed. Also comparing to v1 the code shifted from tableam
methods to TTS's level.

I'd propose to remove Assert(!TTS_EMPTY(slot))
for tts_minimal_is_current_xact_tuple()
and tts_virtual_is_current_xact_tuple() as these are only error reporting
functions that don't use slot actually.

Comment similar to:
+/*
+ * VirtualTupleTableSlots never have a storage tuples.  We generally
+ * shouldn't get here, but provide a user-friendly message if we do.
+ */
also applies to tts_minimal_is_current_xact_tuple()

I'd propose changes for clarity of this comment:
%s/a storage tuples/storage tuples/g
%s/generally//g

Otherwise patch 0005 also looks good to me.

I'm planning to review the remaining patches. Meanwhile think pushing what
is now ready and uncontroversial is a good intention.
Thank you for the work done on this patchset!

Regards,
Pavel Borisov,
Supabase.

[1]: . /messages/by-id/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg@mail.gmail.com
/messages/by-id/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg@mail.gmail.com

#16

Alexander Korotkov

aekorotkov@gmail.com

almost 2 years ago

In reply to: Pavel Borisov (#15)

13 attachment(s)

Re: Table AM Interface Enhancements

Hi, Pavel!

On Tue, Mar 19, 2024 at 11:34 AM Pavel Borisov <pashkin.elfe@gmail.com> wrote:

On Tue, 19 Mar 2024 at 03:34, Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Sun, Mar 3, 2024 at 1:50 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Mon, Nov 27, 2023 at 10:18 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Nov 25, 2023, at 9:47 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:

Should the patch at least document which parts of the EState are expected to be in which states, and which parts should be viewed as undefined? If the implementors of table AMs rely on any/all aspects of EState, doesn't that prevent future changes to how that structure is used?

New tuple tuple_insert_with_arbiter() table AM API method needs EState
argument to call executor functions: ExecCheckIndexConstraints(),
ExecUpdateLockMode(), and ExecInsertIndexTuples(). I think we
probably need to invent some opaque way to call this executor function
without revealing EState to table AM. Do you think this could work?

We're clearly not accessing all of the EState, just some specific fields, such as es_per_tuple_exprcontext. I think you could at least refactor to pass the minimum amount of state information through the table AM API.

Yes, the table AM doesn't need the full EState, just the ability to do
specific manipulation with tuples. I'll refactor the patch to make a
better isolation for this.

Please find the revised patchset attached. The changes are following:
1. Patchset is rebase. to the current master.
2. Patchset is reordered. I tried to put less debatable patches to the top.
3. tuple_is_current() method is moved from the Table AM API to the
slot as proposed by Matthias van de Meent.
4. Assert added to the table_free_rd_amcache() as proposed by Pavel Borisov.

Patches 0001-0002 are unchanged compared to the last version in thread [1]. In my opinion, it's still ready to be committed, which was not done for time were too close to feature freeze one year ago.

0003 - Assert added from previous version. I still have a strong opinion that allowing multi-chunked data structures instead of single chunks is completely safe and makes natural process of Postgres improvement that is self-justified. The patch is simple enough and ready to be pushed.

0004 (previously 0007) - Have not changed, and there is consensus that this is reasonable. I've re-checked the current code. Looks safe considering returning a different slot, which I doubted before. So consider this patch also ready.

0005 (previously 0004) - Unused argument in the is_current_xact_tuple() signature is removed. Also comparing to v1 the code shifted from tableam methods to TTS's level.

I'd propose to remove Assert(!TTS_EMPTY(slot)) for tts_minimal_is_current_xact_tuple() and tts_virtual_is_current_xact_tuple() as these are only error reporting functions that don't use slot actually.
Comment similar to:
+/*
+ * VirtualTupleTableSlots never have a storage tuples.  We generally
+ * shouldn't get here, but provide a user-friendly message if we do.
+ */
also applies to tts_minimal_is_current_xact_tuple()
I'd propose changes for clarity of this comment:
%s/a storage tuples/storage tuples/g
%s/generally//g

Otherwise patch 0005 also looks good to me.

I'm planning to review the remaining patches. Meanwhile think pushing what is now ready and uncontroversial is a good intention.
Thank you for the work done on this patchset!

Thank you, Pavel!

Regarding 0005, I did apply "a storage tuples" grammar fix. Regarding
the rest of the things, I'd like to keep methods
tts_*_is_current_xact_tuple() to be similar to nearby
tts_*_getsysattr(). This is why I'm keeping the rest unchanged. I
think we could refactor that later, but together with
tts_*_getsysattr() methods.

I'm going to push 0003, 0004 and 0005 if there are no objections.

And I'll update 0001 and 0002 in their dedicated thread.

------
Regards,
Alexander Korotkov

Attachments:

0002-Add-EvalPlanQual-delete-returning-isolation-test-v3.patchapplication/octet-stream; name=0002-Add-EvalPlanQual-delete-returning-isolation-test-v3.patchDownload

From f03a4b0724edeee05442af6d61bca1e3f0d8d53c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Mar 2023 16:47:09 -0700
Subject: [PATCH 02/13] Add EvalPlanQual delete returning isolation test

Author: Andres Freund
Reviewed-by: Pavel Borisov
Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg%40mail.gmail.com
---
 .../isolation/expected/eval-plan-qual.out     | 30 +++++++++++++++++++
 src/test/isolation/specs/eval-plan-qual.spec  |  4 +++
 2 files changed, 34 insertions(+)

diff --git a/src/test/isolation/expected/eval-plan-qual.out b/src/test/isolation/expected/eval-plan-qual.out
index 73e0aeb50e7..0237271ceec 100644
--- a/src/test/isolation/expected/eval-plan-qual.out
+++ b/src/test/isolation/expected/eval-plan-qual.out
@@ -746,6 +746,36 @@ savings  |    600|    1200
 (2 rows)
 
 
+starting permutation: read wx2 wb1 c2 c1 read
+step read: SELECT * FROM accounts ORDER BY accountid;
+accountid|balance|balance2
+---------+-------+--------
+checking |    600|    1200
+savings  |    600|    1200
+(2 rows)
+
+step wx2: UPDATE accounts SET balance = balance + 450 WHERE accountid = 'checking' RETURNING balance;
+balance
+-------
+   1050
+(1 row)
+
+step wb1: DELETE FROM accounts WHERE balance = 600 RETURNING *; <waiting ...>
+step c2: COMMIT;
+step wb1: <... completed>
+accountid|balance|balance2
+---------+-------+--------
+savings  |    600|    1200
+(1 row)
+
+step c1: COMMIT;
+step read: SELECT * FROM accounts ORDER BY accountid;
+accountid|balance|balance2
+---------+-------+--------
+checking |   1050|    2100
+(1 row)
+
+
 starting permutation: upsert1 upsert2 c1 c2 read
 step upsert1: 
 	WITH upsert AS
diff --git a/src/test/isolation/specs/eval-plan-qual.spec b/src/test/isolation/specs/eval-plan-qual.spec
index 735c671734e..edd6d19df3a 100644
--- a/src/test/isolation/specs/eval-plan-qual.spec
+++ b/src/test/isolation/specs/eval-plan-qual.spec
@@ -76,6 +76,8 @@ setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
 step wx1	{ UPDATE accounts SET balance = balance - 200 WHERE accountid = 'checking' RETURNING balance; }
 # wy1 then wy2 checks the case where quals pass then fail
 step wy1	{ UPDATE accounts SET balance = balance + 500 WHERE accountid = 'checking' RETURNING balance; }
+# wx2 then wb1 checks the case of re-fetching up-to-date values for DELETE ... RETURNING ...
+step wb1	{ DELETE FROM accounts WHERE balance = 600 RETURNING *; }
 
 step wxext1	{ UPDATE accounts_ext SET balance = balance - 200 WHERE accountid = 'checking' RETURNING balance; }
 step tocds1	{ UPDATE accounts SET accountid = 'cds' WHERE accountid = 'checking'; }
@@ -353,6 +355,8 @@ permutation wx1 delwcte c1 c2 read
 # test that a delete to a self-modified row throws error when
 # previously updated by a different cid
 permutation wx1 delwctefail c1 c2 read
+# test that a delete re-fetches up-to-date values for returning clause
+permutation read wx2 wb1 c2 c1 read
 
 permutation upsert1 upsert2 c1 c2 read
 permutation readp1 writep1 readp2 c1 c2
-- 
2.39.3 (Apple Git-145)

0005-Add-TupleTableSlotOps.is_current_xact_tuple-metho-v3.patchapplication/octet-stream; name=0005-Add-TupleTableSlotOps.is_current_xact_tuple-metho-v3.patchDownload

From 8572a71924dc39ff9642b6088117accb4f381786 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Wed, 7 Jun 2023 13:47:53 +0300
Subject: [PATCH 05/13] Add TupleTableSlotOps.is_current_xact_tuple() method

This allows us to abstract how/whether table AM uses transaction identifiers.
A custom table AM can use a custom slot, which may not store xmin directly,
but determine the tuple belonging to the current transaction in the other way.
---
 src/backend/executor/execTuples.c   | 79 +++++++++++++++++++++++++++++
 src/backend/utils/adt/ri_triggers.c |  8 +--
 src/include/executor/tuptable.h     | 21 ++++++++
 3 files changed, 101 insertions(+), 7 deletions(-)

diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index a7aa2ee02b1..45b85b15851 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -60,6 +60,7 @@
 #include "access/heaptoast.h"
 #include "access/htup_details.h"
 #include "access/tupdesc_details.h"
+#include "access/xact.h"
 #include "catalog/pg_type.h"
 #include "funcapi.h"
 #include "nodes/nodeFuncs.h"
@@ -148,6 +149,22 @@ tts_virtual_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
 	return 0;					/* silence compiler warnings */
 }
 
+/*
+ * VirtualTupleTableSlots never have storage tuples.  We generally
+ * shouldn't get here, but provide a user-friendly message if we do.
+ */
+static bool
+tts_virtual_is_current_xact_tuple(TupleTableSlot *slot)
+{
+	Assert(!TTS_EMPTY(slot));
+
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("don't have a storage tuple in this context")));
+
+	return false;					/* silence compiler warnings */
+}
+
 /*
  * To materialize a virtual slot all the datums that aren't passed by value
  * have to be copied into the slot's memory context.  To do so, compute the
@@ -354,6 +371,29 @@ tts_heap_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
 						   slot->tts_tupleDescriptor, isnull);
 }
 
+static bool
+tts_heap_is_current_xact_tuple(TupleTableSlot *slot)
+{
+	HeapTupleTableSlot *hslot = (HeapTupleTableSlot *) slot;
+	TransactionId xmin;
+
+	Assert(!TTS_EMPTY(slot));
+
+	/*
+	 * In some code paths it's possible to get here with a non-materialized
+	 * slot, in which case we can't check if tuple is created by the current
+	 * transaction.
+	 */
+	if (!hslot->tuple)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("don't have a storage tuple in this context")));
+
+	xmin = HeapTupleHeaderGetRawXmin(hslot->tuple->t_data);
+
+	return TransactionIdIsCurrentTransactionId(xmin);
+}
+
 static void
 tts_heap_materialize(TupleTableSlot *slot)
 {
@@ -521,6 +561,18 @@ tts_minimal_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
 	return 0;					/* silence compiler warnings */
 }
 
+static bool
+tts_minimal_is_current_xact_tuple(TupleTableSlot *slot)
+{
+	Assert(!TTS_EMPTY(slot));
+
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("don't have a storage tuple in this context")));
+
+	return false;					/* silence compiler warnings */
+}
+
 static void
 tts_minimal_materialize(TupleTableSlot *slot)
 {
@@ -714,6 +766,29 @@ tts_buffer_heap_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
 						   slot->tts_tupleDescriptor, isnull);
 }
 
+static bool
+tts_buffer_is_current_xact_tuple(TupleTableSlot *slot)
+{
+	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
+	TransactionId xmin;
+
+	Assert(!TTS_EMPTY(slot));
+
+	/*
+	 * In some code paths it's possible to get here with a non-materialized
+	 * slot, in which case we can't check if tuple is created by the current
+	 * transaction.
+	 */
+	if (!bslot->base.tuple)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("don't have a storage tuple in this context")));
+
+	xmin = HeapTupleHeaderGetRawXmin(bslot->base.tuple->t_data);
+
+	return TransactionIdIsCurrentTransactionId(xmin);
+}
+
 static void
 tts_buffer_heap_materialize(TupleTableSlot *slot)
 {
@@ -1029,6 +1104,7 @@ const TupleTableSlotOps TTSOpsVirtual = {
 	.getsomeattrs = tts_virtual_getsomeattrs,
 	.getsysattr = tts_virtual_getsysattr,
 	.materialize = tts_virtual_materialize,
+	.is_current_xact_tuple = tts_virtual_is_current_xact_tuple,
 	.copyslot = tts_virtual_copyslot,
 
 	/*
@@ -1048,6 +1124,7 @@ const TupleTableSlotOps TTSOpsHeapTuple = {
 	.clear = tts_heap_clear,
 	.getsomeattrs = tts_heap_getsomeattrs,
 	.getsysattr = tts_heap_getsysattr,
+	.is_current_xact_tuple = tts_heap_is_current_xact_tuple,
 	.materialize = tts_heap_materialize,
 	.copyslot = tts_heap_copyslot,
 	.get_heap_tuple = tts_heap_get_heap_tuple,
@@ -1065,6 +1142,7 @@ const TupleTableSlotOps TTSOpsMinimalTuple = {
 	.clear = tts_minimal_clear,
 	.getsomeattrs = tts_minimal_getsomeattrs,
 	.getsysattr = tts_minimal_getsysattr,
+	.is_current_xact_tuple = tts_minimal_is_current_xact_tuple,
 	.materialize = tts_minimal_materialize,
 	.copyslot = tts_minimal_copyslot,
 
@@ -1082,6 +1160,7 @@ const TupleTableSlotOps TTSOpsBufferHeapTuple = {
 	.clear = tts_buffer_heap_clear,
 	.getsomeattrs = tts_buffer_heap_getsomeattrs,
 	.getsysattr = tts_buffer_heap_getsysattr,
+	.is_current_xact_tuple = tts_buffer_is_current_xact_tuple,
 	.materialize = tts_buffer_heap_materialize,
 	.copyslot = tts_buffer_heap_copyslot,
 	.get_heap_tuple = tts_buffer_heap_get_heap_tuple,
diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c
index 2fe93775003..62601a6d80c 100644
--- a/src/backend/utils/adt/ri_triggers.c
+++ b/src/backend/utils/adt/ri_triggers.c
@@ -1260,9 +1260,6 @@ RI_FKey_fk_upd_check_required(Trigger *trigger, Relation fk_rel,
 {
 	const RI_ConstraintInfo *riinfo;
 	int			ri_nullcheck;
-	Datum		xminDatum;
-	TransactionId xmin;
-	bool		isnull;
 
 	/*
 	 * AfterTriggerSaveEvent() handles things such that this function is never
@@ -1330,10 +1327,7 @@ RI_FKey_fk_upd_check_required(Trigger *trigger, Relation fk_rel,
 	 * this if we knew the INSERT trigger already fired, but there is no easy
 	 * way to know that.)
 	 */
-	xminDatum = slot_getsysattr(oldslot, MinTransactionIdAttributeNumber, &isnull);
-	Assert(!isnull);
-	xmin = DatumGetTransactionId(xminDatum);
-	if (TransactionIdIsCurrentTransactionId(xmin))
+	if (slot_is_current_xact_tuple(oldslot))
 		return true;
 
 	/* If all old and new key values are equal, no check is needed */
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 6133dbcd0a3..c2eddda74a8 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -166,6 +166,12 @@ struct TupleTableSlotOps
 	 */
 	Datum		(*getsysattr) (TupleTableSlot *slot, int attnum, bool *isnull);
 
+	/*
+	 * Check if the tuple is created by the current transaction. Throws an
+	 * error if the slot doesn't contain the storage tuple.
+	 */
+	bool		(*is_current_xact_tuple) (TupleTableSlot *slot);
+
 	/*
 	 * Make the contents of the slot solely depend on the slot, and not on
 	 * underlying resources (like another memory context, buffers, etc).
@@ -426,6 +432,21 @@ slot_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
 	return slot->tts_ops->getsysattr(slot, attnum, isnull);
 }
 
+/*
+ * slot_is_current_xact_tuple - check if the slot's current tuple is created
+ *								by the current transaction.
+ *
+ *  If the slot does not contain storage tuple, this will throw an error.
+ *  Hence before calling this function, callers should make sure that the
+ *  slot type supports storage tuples and there is currently one inside the
+ *  slot.
+ */
+static inline bool
+slot_is_current_xact_tuple(TupleTableSlot *slot)
+{
+	return slot->tts_ops->is_current_xact_tuple(slot);
+}
+
 /*
  * ExecClearTuple - clear the slot's contents
  */
-- 
2.39.3 (Apple Git-145)

0001-Allow-locking-updated-tuples-in-tuple_update-and--v3.patchapplication/octet-stream; name=0001-Allow-locking-updated-tuples-in-tuple_update-and--v3.patchDownload

From 20179ea0e47231575cbfd807a8dd6df6fb6f7a30 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Thu, 23 Mar 2023 00:12:00 +0300
Subject: [PATCH 01/13] Allow locking updated tuples in tuple_update() and
 tuple_delete()

Currently, in read committed transaction isolation mode (default), we have the
following sequence of actions when tuple_update()/tuple_delete() finds
the tuple updated by concurrent transaction.

1. Attempt to update/delete tuple with tuple_update()/tuple_delete(), which
   returns TM_Updated.
2. Lock tuple with tuple_lock().
3. Re-evaluate plan qual (recheck if we still need to update/delete and
   calculate the new tuple for update).
4. Second attempt to update/delete tuple with tuple_update()/tuple_delete().
   This attempt should be successful, since the tuple was previously locked.

This patch eliminates step 2 by taking the lock during first
tuple_update()/tuple_delete() call.  Heap table access method saves some
efforts by checking the updated tuple once instead of twice.  Future
undo-based table access methods, which will start from the latest row version,
can immediately place a lock there.

The code in nodeModifyTable.c is simplified by removing the nested switch/case.

Discussion: https://postgr.es/m/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg%40mail.gmail.com
Reviewed-by: Aleksander Alekseev, Pavel Borisov, Vignesh C, Mason Sharp
Reviewed-by: Andres Freund, Chris Travers
---
 src/backend/access/heap/heapam.c         | 205 ++++++++++----
 src/backend/access/heap/heapam_handler.c |  94 +++++--
 src/backend/access/table/tableam.c       |  26 +-
 src/backend/commands/trigger.c           |  55 ++--
 src/backend/executor/execReplication.c   |  19 +-
 src/backend/executor/nodeModifyTable.c   | 329 +++++++++--------------
 src/include/access/heapam.h              |  19 +-
 src/include/access/tableam.h             |  69 +++--
 src/include/commands/trigger.h           |   4 +-
 9 files changed, 474 insertions(+), 346 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 34bc60f625f..f6478f89e77 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2499,10 +2499,11 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
 }
 
 /*
- *	heap_delete - delete a tuple
+ *	heap_delete - delete a tuple, optionally fetching it into a slot
  *
  * See table_tuple_delete() for an explanation of the parameters, except that
- * this routine directly takes a tuple rather than a slot.
+ * this routine directly takes a tuple rather than a slot.  Also, we don't
+ * place a lock on the tuple in this function, just fetch the existing version.
  *
  * In the failure cases, the routine fills *tmfd with the tuple's t_ctid,
  * t_xmax (resolving a possible MultiXact, if necessary), and t_cmax (the last
@@ -2511,8 +2512,9 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
  */
 TM_Result
 heap_delete(Relation relation, ItemPointer tid,
-			CommandId cid, Snapshot crosscheck, bool wait,
-			TM_FailureData *tmfd, bool changingPart)
+			CommandId cid, Snapshot crosscheck, int options,
+			TM_FailureData *tmfd, bool changingPart,
+			TupleTableSlot *oldSlot)
 {
 	TM_Result	result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -2590,7 +2592,7 @@ l1:
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("attempted to delete invisible tuple")));
 	}
-	else if (result == TM_BeingModified && wait)
+	else if (result == TM_BeingModified && (options & TABLE_MODIFY_WAIT))
 	{
 		TransactionId xwait;
 		uint16		infomask;
@@ -2731,7 +2733,30 @@ l1:
 			tmfd->cmax = HeapTupleHeaderGetCmax(tp.t_data);
 		else
 			tmfd->cmax = InvalidCommandId;
-		UnlockReleaseBuffer(buffer);
+
+		/*
+		 * If we're asked to lock the updated tuple, we just fetch the
+		 * existing tuple.  That let's the caller save some resources on
+		 * placing the lock.
+		 */
+		if (result == TM_Updated &&
+			(options & TABLE_MODIFY_LOCK_UPDATED))
+		{
+			BufferHeapTupleTableSlot *bslot;
+
+			Assert(TTS_IS_BUFFERTUPLE(oldSlot));
+			bslot = (BufferHeapTupleTableSlot *) oldSlot;
+
+			LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+			bslot->base.tupdata = tp;
+			ExecStorePinnedBufferHeapTuple(&bslot->base.tupdata,
+										   oldSlot,
+										   buffer);
+		}
+		else
+		{
+			UnlockReleaseBuffer(buffer);
+		}
 		if (have_tuple_lock)
 			UnlockTupleTuplock(relation, &(tp.t_self), LockTupleExclusive);
 		if (vmbuffer != InvalidBuffer)
@@ -2905,8 +2930,24 @@ l1:
 	 */
 	CacheInvalidateHeapTuple(relation, &tp, NULL);
 
-	/* Now we can release the buffer */
-	ReleaseBuffer(buffer);
+	/* Fetch the old tuple version if we're asked for that. */
+	if (options & TABLE_MODIFY_FETCH_OLD_TUPLE)
+	{
+		BufferHeapTupleTableSlot *bslot;
+
+		Assert(TTS_IS_BUFFERTUPLE(oldSlot));
+		bslot = (BufferHeapTupleTableSlot *) oldSlot;
+
+		bslot->base.tupdata = tp;
+		ExecStorePinnedBufferHeapTuple(&bslot->base.tupdata,
+									   oldSlot,
+									   buffer);
+	}
+	else
+	{
+		/* Now we can release the buffer */
+		ReleaseBuffer(buffer);
+	}
 
 	/*
 	 * Release the lmgr tuple lock, if we had it.
@@ -2938,8 +2979,8 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 
 	result = heap_delete(relation, tid,
 						 GetCurrentCommandId(true), InvalidSnapshot,
-						 true /* wait for commit */ ,
-						 &tmfd, false /* changingPart */ );
+						 TABLE_MODIFY_WAIT /* wait for commit */ ,
+						 &tmfd, false /* changingPart */ , NULL);
 	switch (result)
 	{
 		case TM_SelfModified:
@@ -2966,10 +3007,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 }
 
 /*
- *	heap_update - replace a tuple
+ *	heap_update - replace a tuple, optionally fetching it into a slot
  *
  * See table_tuple_update() for an explanation of the parameters, except that
- * this routine directly takes a tuple rather than a slot.
+ * this routine directly takes a tuple rather than a slot.  Also, we don't
+ * place a lock on the tuple in this function, just fetch the existing version.
  *
  * In the failure cases, the routine fills *tmfd with the tuple's t_ctid,
  * t_xmax (resolving a possible MultiXact, if necessary), and t_cmax (the last
@@ -2978,9 +3020,9 @@ simple_heap_delete(Relation relation, ItemPointer tid)
  */
 TM_Result
 heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
-			CommandId cid, Snapshot crosscheck, bool wait,
+			CommandId cid, Snapshot crosscheck, int options,
 			TM_FailureData *tmfd, LockTupleMode *lockmode,
-			TU_UpdateIndexes *update_indexes)
+			TU_UpdateIndexes *update_indexes, TupleTableSlot *oldSlot)
 {
 	TM_Result	result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -3157,7 +3199,7 @@ l2:
 	result = HeapTupleSatisfiesUpdate(&oldtup, cid, buffer);
 
 	/* see below about the "no wait" case */
-	Assert(result != TM_BeingModified || wait);
+	Assert(result != TM_BeingModified || (options & TABLE_MODIFY_WAIT));
 
 	if (result == TM_Invisible)
 	{
@@ -3166,7 +3208,7 @@ l2:
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("attempted to update invisible tuple")));
 	}
-	else if (result == TM_BeingModified && wait)
+	else if (result == TM_BeingModified && (options & TABLE_MODIFY_WAIT))
 	{
 		TransactionId xwait;
 		uint16		infomask;
@@ -3370,7 +3412,30 @@ l2:
 			tmfd->cmax = HeapTupleHeaderGetCmax(oldtup.t_data);
 		else
 			tmfd->cmax = InvalidCommandId;
-		UnlockReleaseBuffer(buffer);
+
+		/*
+		 * If we're asked to lock the updated tuple, we just fetch the
+		 * existing tuple.  That let's the caller save some resouces on
+		 * placing the lock.
+		 */
+		if (result == TM_Updated &&
+			(options & TABLE_MODIFY_LOCK_UPDATED))
+		{
+			BufferHeapTupleTableSlot *bslot;
+
+			Assert(TTS_IS_BUFFERTUPLE(oldSlot));
+			bslot = (BufferHeapTupleTableSlot *) oldSlot;
+
+			LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+			bslot->base.tupdata = oldtup;
+			ExecStorePinnedBufferHeapTuple(&bslot->base.tupdata,
+										   oldSlot,
+										   buffer);
+		}
+		else
+		{
+			UnlockReleaseBuffer(buffer);
+		}
 		if (have_tuple_lock)
 			UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
 		if (vmbuffer != InvalidBuffer)
@@ -3849,7 +3914,26 @@ l2:
 	/* Now we can release the buffer(s) */
 	if (newbuf != buffer)
 		ReleaseBuffer(newbuf);
-	ReleaseBuffer(buffer);
+
+	/* Fetch the old tuple version if we're asked for that. */
+	if (options & TABLE_MODIFY_FETCH_OLD_TUPLE)
+	{
+		BufferHeapTupleTableSlot *bslot;
+
+		Assert(TTS_IS_BUFFERTUPLE(oldSlot));
+		bslot = (BufferHeapTupleTableSlot *) oldSlot;
+
+		bslot->base.tupdata = oldtup;
+		ExecStorePinnedBufferHeapTuple(&bslot->base.tupdata,
+									   oldSlot,
+									   buffer);
+	}
+	else
+	{
+		/* Now we can release the buffer */
+		ReleaseBuffer(buffer);
+	}
+
 	if (BufferIsValid(vmbuffer_new))
 		ReleaseBuffer(vmbuffer_new);
 	if (BufferIsValid(vmbuffer))
@@ -4057,8 +4141,8 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
 
 	result = heap_update(relation, otid, tup,
 						 GetCurrentCommandId(true), InvalidSnapshot,
-						 true /* wait for commit */ ,
-						 &tmfd, &lockmode, update_indexes);
+						 TABLE_MODIFY_WAIT /* wait for commit */ ,
+						 &tmfd, &lockmode, update_indexes, NULL);
 	switch (result)
 	{
 		case TM_SelfModified:
@@ -4121,12 +4205,14 @@ get_mxact_status_for_lock(LockTupleMode mode, bool is_update)
  *		tuples.
  *
  * Output parameters:
- *	*tuple: all fields filled in
- *	*buffer: set to buffer holding tuple (pinned but not locked at exit)
+ *	*slot: BufferHeapTupleTableSlot filled with tuple
  *	*tmfd: filled in failure cases (see below)
  *
  * Function results are the same as the ones for table_tuple_lock().
  *
+ * If *slot already contains the target tuple, it takes advantage on that by
+ * skipping the ReadBuffer() call.
+ *
  * In the failure cases other than TM_Invisible, the routine fills
  * *tmfd with the tuple's t_ctid, t_xmax (resolving a possible MultiXact,
  * if necessary), and t_cmax (the last only for TM_SelfModified,
@@ -4137,15 +4223,14 @@ get_mxact_status_for_lock(LockTupleMode mode, bool is_update)
  * See README.tuplock for a thorough explanation of this mechanism.
  */
 TM_Result
-heap_lock_tuple(Relation relation, HeapTuple tuple,
+heap_lock_tuple(Relation relation, ItemPointer tid, TupleTableSlot *slot,
 				CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
-				bool follow_updates,
-				Buffer *buffer, TM_FailureData *tmfd)
+				bool follow_updates, TM_FailureData *tmfd)
 {
 	TM_Result	result;
-	ItemPointer tid = &(tuple->t_self);
 	ItemId		lp;
 	Page		page;
+	Buffer		buffer;
 	Buffer		vmbuffer = InvalidBuffer;
 	BlockNumber block;
 	TransactionId xid,
@@ -4157,8 +4242,24 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
 	bool		skip_tuple_lock = false;
 	bool		have_tuple_lock = false;
 	bool		cleared_all_frozen = false;
+	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
+	HeapTuple	tuple = &bslot->base.tupdata;
+
+	Assert(TTS_IS_BUFFERTUPLE(slot));
 
-	*buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
+	/* Take advantage if slot already contains the relevant tuple  */
+	if (!TTS_EMPTY(slot) &&
+		slot->tts_tableOid == relation->rd_id &&
+		ItemPointerCompare(&slot->tts_tid, tid) == 0 &&
+		BufferIsValid(bslot->buffer))
+	{
+		buffer = bslot->buffer;
+		IncrBufferRefCount(buffer);
+	}
+	else
+	{
+		buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
+	}
 	block = ItemPointerGetBlockNumber(tid);
 
 	/*
@@ -4167,21 +4268,22 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
 	 * in the middle of changing this, so we'll need to recheck after we have
 	 * the lock.
 	 */
-	if (PageIsAllVisible(BufferGetPage(*buffer)))
+	if (PageIsAllVisible(BufferGetPage(buffer)))
 		visibilitymap_pin(relation, block, &vmbuffer);
 
-	LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
-	page = BufferGetPage(*buffer);
+	page = BufferGetPage(buffer);
 	lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid));
 	Assert(ItemIdIsNormal(lp));
 
+	tuple->t_self = *tid;
 	tuple->t_data = (HeapTupleHeader) PageGetItem(page, lp);
 	tuple->t_len = ItemIdGetLength(lp);
 	tuple->t_tableOid = RelationGetRelid(relation);
 
 l3:
-	result = HeapTupleSatisfiesUpdate(tuple, cid, *buffer);
+	result = HeapTupleSatisfiesUpdate(tuple, cid, buffer);
 
 	if (result == TM_Invisible)
 	{
@@ -4210,7 +4312,7 @@ l3:
 		infomask2 = tuple->t_data->t_infomask2;
 		ItemPointerCopy(&tuple->t_data->t_ctid, &t_ctid);
 
-		LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 
 		/*
 		 * If any subtransaction of the current top transaction already holds
@@ -4362,12 +4464,12 @@ l3:
 					{
 						result = res;
 						/* recovery code expects to have buffer lock held */
-						LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+						LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 						goto failed;
 					}
 				}
 
-				LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+				LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 				/*
 				 * Make sure it's still an appropriate lock, else start over.
@@ -4402,7 +4504,7 @@ l3:
 			if (HEAP_XMAX_IS_LOCKED_ONLY(infomask) &&
 				!HEAP_XMAX_IS_EXCL_LOCKED(infomask))
 			{
-				LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+				LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 				/*
 				 * Make sure it's still an appropriate lock, else start over.
@@ -4430,7 +4532,7 @@ l3:
 					 * No conflict, but if the xmax changed under us in the
 					 * meantime, start over.
 					 */
-					LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+					LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 					if (xmax_infomask_changed(tuple->t_data->t_infomask, infomask) ||
 						!TransactionIdEquals(HeapTupleHeaderGetRawXmax(tuple->t_data),
 											 xwait))
@@ -4442,7 +4544,7 @@ l3:
 			}
 			else if (HEAP_XMAX_IS_KEYSHR_LOCKED(infomask))
 			{
-				LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+				LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 				/* if the xmax changed in the meantime, start over */
 				if (xmax_infomask_changed(tuple->t_data->t_infomask, infomask) ||
@@ -4470,7 +4572,7 @@ l3:
 			TransactionIdIsCurrentTransactionId(xwait))
 		{
 			/* ... but if the xmax changed in the meantime, start over */
-			LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 			if (xmax_infomask_changed(tuple->t_data->t_infomask, infomask) ||
 				!TransactionIdEquals(HeapTupleHeaderGetRawXmax(tuple->t_data),
 									 xwait))
@@ -4492,7 +4594,7 @@ l3:
 		 */
 		if (require_sleep && (result == TM_Updated || result == TM_Deleted))
 		{
-			LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 			goto failed;
 		}
 		else if (require_sleep)
@@ -4517,7 +4619,7 @@ l3:
 				 */
 				result = TM_WouldBlock;
 				/* recovery code expects to have buffer lock held */
-				LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+				LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 				goto failed;
 			}
 
@@ -4543,7 +4645,7 @@ l3:
 						{
 							result = TM_WouldBlock;
 							/* recovery code expects to have buffer lock held */
-							LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+							LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 							goto failed;
 						}
 						break;
@@ -4583,7 +4685,7 @@ l3:
 						{
 							result = TM_WouldBlock;
 							/* recovery code expects to have buffer lock held */
-							LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+							LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 							goto failed;
 						}
 						break;
@@ -4609,12 +4711,12 @@ l3:
 				{
 					result = res;
 					/* recovery code expects to have buffer lock held */
-					LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+					LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 					goto failed;
 				}
 			}
 
-			LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 			/*
 			 * xwait is done, but if xwait had just locked the tuple then some
@@ -4636,7 +4738,7 @@ l3:
 				 * don't check for this in the multixact case, because some
 				 * locker transactions might still be running.
 				 */
-				UpdateXmaxHintBits(tuple->t_data, *buffer, xwait);
+				UpdateXmaxHintBits(tuple->t_data, buffer, xwait);
 			}
 		}
 
@@ -4695,9 +4797,9 @@ failed:
 	 */
 	if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
 	{
-		LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 		visibilitymap_pin(relation, block, &vmbuffer);
-		LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 		goto l3;
 	}
 
@@ -4760,7 +4862,7 @@ failed:
 		cleared_all_frozen = true;
 
 
-	MarkBufferDirty(*buffer);
+	MarkBufferDirty(buffer);
 
 	/*
 	 * XLOG stuff.  You might think that we don't need an XLOG record because
@@ -4780,7 +4882,7 @@ failed:
 		XLogRecPtr	recptr;
 
 		XLogBeginInsert();
-		XLogRegisterBuffer(0, *buffer, REGBUF_STANDARD);
+		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
 
 		xlrec.offnum = ItemPointerGetOffsetNumber(&tuple->t_self);
 		xlrec.xmax = xid;
@@ -4801,7 +4903,7 @@ failed:
 	result = TM_Ok;
 
 out_locked:
-	LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 
 out_unlocked:
 	if (BufferIsValid(vmbuffer))
@@ -4819,6 +4921,9 @@ out_unlocked:
 	if (have_tuple_lock)
 		UnlockTupleTuplock(relation, tid, mode);
 
+	/* Put the target tuple to the slot */
+	ExecStorePinnedBufferHeapTuple(tuple, slot, buffer);
+
 	return result;
 }
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 680a50bf8b1..7c7204a2422 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -45,6 +45,12 @@
 #include "utils/builtins.h"
 #include "utils/rel.h"
 
+static TM_Result heapam_tuple_lock(Relation relation, ItemPointer tid,
+								   Snapshot snapshot, TupleTableSlot *slot,
+								   CommandId cid, LockTupleMode mode,
+								   LockWaitPolicy wait_policy, uint8 flags,
+								   TM_FailureData *tmfd);
+
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
 									 Datum *values, bool *isnull, RewriteState rwstate);
@@ -298,23 +304,55 @@ heapam_tuple_complete_speculative(Relation relation, TupleTableSlot *slot,
 
 static TM_Result
 heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
-					Snapshot snapshot, Snapshot crosscheck, bool wait,
-					TM_FailureData *tmfd, bool changingPart)
+					Snapshot snapshot, Snapshot crosscheck, int options,
+					TM_FailureData *tmfd, bool changingPart,
+					TupleTableSlot *oldSlot)
 {
+	TM_Result	result;
+
 	/*
 	 * Currently Deleting of index tuples are handled at vacuum, in case if
 	 * the storage itself is cleaning the dead tuples by itself, it is the
 	 * time to call the index tuple deletion also.
 	 */
-	return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
+	result = heap_delete(relation, tid, cid, crosscheck, options,
+						 tmfd, changingPart, oldSlot);
+
+	/*
+	 * If the tuple has been concurrently updated, then get the lock on it.
+	 * (Do only if caller asked for this by setting the
+	 * TABLE_MODIFY_LOCK_UPDATED option)  With the lock held retry of the
+	 * delete should succeed even if there are more concurrent update
+	 * attempts.
+	 */
+	if (result == TM_Updated && (options & TABLE_MODIFY_LOCK_UPDATED))
+	{
+		/*
+		 * heapam_tuple_lock() will take advantage of tuple loaded into
+		 * oldSlot by heap_delete().
+		 */
+		result = heapam_tuple_lock(relation, tid, snapshot,
+								   oldSlot, cid, LockTupleExclusive,
+								   (options & TABLE_MODIFY_WAIT) ?
+								   LockWaitBlock :
+								   LockWaitSkip,
+								   TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
+								   tmfd);
+
+		if (result == TM_Ok)
+			return TM_Updated;
+	}
+
+	return result;
 }
 
 
 static TM_Result
 heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 					CommandId cid, Snapshot snapshot, Snapshot crosscheck,
-					bool wait, TM_FailureData *tmfd,
-					LockTupleMode *lockmode, TU_UpdateIndexes *update_indexes)
+					int options, TM_FailureData *tmfd,
+					LockTupleMode *lockmode, TU_UpdateIndexes *update_indexes,
+					TupleTableSlot *oldSlot)
 {
 	bool		shouldFree = true;
 	HeapTuple	tuple = ExecFetchSlotHeapTuple(slot, true, &shouldFree);
@@ -324,8 +362,8 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 	slot->tts_tableOid = RelationGetRelid(relation);
 	tuple->t_tableOid = slot->tts_tableOid;
 
-	result = heap_update(relation, otid, tuple, cid, crosscheck, wait,
-						 tmfd, lockmode, update_indexes);
+	result = heap_update(relation, otid, tuple, cid, crosscheck, options,
+						 tmfd, lockmode, update_indexes, oldSlot);
 	ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
 
 	/*
@@ -352,6 +390,31 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 	if (shouldFree)
 		pfree(tuple);
 
+	/*
+	 * If the tuple has been concurrently updated, then get the lock on it.
+	 * (Do only if caller asked for this by setting the
+	 * TABLE_MODIFY_LOCK_UPDATED option)  With the lock held retry of the
+	 * update should succeed even if there are more concurrent update
+	 * attempts.
+	 */
+	if (result == TM_Updated && (options & TABLE_MODIFY_LOCK_UPDATED))
+	{
+		/*
+		 * heapam_tuple_lock() will take advantage of tuple loaded into
+		 * oldSlot by heap_update().
+		 */
+		result = heapam_tuple_lock(relation, otid, snapshot,
+								   oldSlot, cid, *lockmode,
+								   (options & TABLE_MODIFY_WAIT) ?
+								   LockWaitBlock :
+								   LockWaitSkip,
+								   TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
+								   tmfd);
+
+		if (result == TM_Ok)
+			return TM_Updated;
+	}
+
 	return result;
 }
 
@@ -363,7 +426,6 @@ heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
 {
 	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
 	TM_Result	result;
-	Buffer		buffer;
 	HeapTuple	tuple = &bslot->base.tupdata;
 	bool		follow_updates;
 
@@ -373,9 +435,8 @@ heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
 tuple_lock_retry:
-	tuple->t_self = *tid;
-	result = heap_lock_tuple(relation, tuple, cid, mode, wait_policy,
-							 follow_updates, &buffer, tmfd);
+	result = heap_lock_tuple(relation, tid, slot, cid, mode, wait_policy,
+							 follow_updates, tmfd);
 
 	if (result == TM_Updated &&
 		(flags & TUPLE_LOCK_FLAG_FIND_LAST_VERSION))
@@ -383,8 +444,6 @@ tuple_lock_retry:
 		/* Should not encounter speculative tuple on recheck */
 		Assert(!HeapTupleHeaderIsSpeculative(tuple->t_data));
 
-		ReleaseBuffer(buffer);
-
 		if (!ItemPointerEquals(&tmfd->ctid, &tuple->t_self))
 		{
 			SnapshotData SnapshotDirty;
@@ -406,6 +465,8 @@ tuple_lock_retry:
 			InitDirtySnapshot(SnapshotDirty);
 			for (;;)
 			{
+				Buffer		buffer = InvalidBuffer;
+
 				if (ItemPointerIndicatesMovedPartitions(tid))
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
@@ -500,7 +561,7 @@ tuple_lock_retry:
 					/*
 					 * This is a live tuple, so try to lock it again.
 					 */
-					ReleaseBuffer(buffer);
+					ExecStorePinnedBufferHeapTuple(tuple, slot, buffer);
 					goto tuple_lock_retry;
 				}
 
@@ -511,7 +572,7 @@ tuple_lock_retry:
 				 */
 				if (tuple->t_data == NULL)
 				{
-					Assert(!BufferIsValid(buffer));
+					ReleaseBuffer(buffer);
 					return TM_Deleted;
 				}
 
@@ -564,9 +625,6 @@ tuple_lock_retry:
 	slot->tts_tableOid = RelationGetRelid(relation);
 	tuple->t_tableOid = slot->tts_tableOid;
 
-	/* store in slot, transferring existing pin */
-	ExecStorePinnedBufferHeapTuple(tuple, slot, buffer);
-
 	return result;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index e57a0b7ea31..8d3675be959 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -287,16 +287,23 @@ simple_table_tuple_insert(Relation rel, TupleTableSlot *slot)
  * via ereport().
  */
 void
-simple_table_tuple_delete(Relation rel, ItemPointer tid, Snapshot snapshot)
+simple_table_tuple_delete(Relation rel, ItemPointer tid, Snapshot snapshot,
+						  TupleTableSlot *oldSlot)
 {
 	TM_Result	result;
 	TM_FailureData tmfd;
+	int			options = TABLE_MODIFY_WAIT;	/* wait for commit */
+
+	/* Fetch old tuple if the relevant slot is provided */
+	if (oldSlot)
+		options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
 
 	result = table_tuple_delete(rel, tid,
 								GetCurrentCommandId(true),
 								snapshot, InvalidSnapshot,
-								true /* wait for commit */ ,
-								&tmfd, false /* changingPart */ );
+								options,
+								&tmfd, false /* changingPart */ ,
+								oldSlot);
 
 	switch (result)
 	{
@@ -335,17 +342,24 @@ void
 simple_table_tuple_update(Relation rel, ItemPointer otid,
 						  TupleTableSlot *slot,
 						  Snapshot snapshot,
-						  TU_UpdateIndexes *update_indexes)
+						  TU_UpdateIndexes *update_indexes,
+						  TupleTableSlot *oldSlot)
 {
 	TM_Result	result;
 	TM_FailureData tmfd;
 	LockTupleMode lockmode;
+	int			options = TABLE_MODIFY_WAIT;	/* wait for commit */
+
+	/* Fetch old tuple if the relevant slot is provided */
+	if (oldSlot)
+		options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
 
 	result = table_tuple_update(rel, otid, slot,
 								GetCurrentCommandId(true),
 								snapshot, InvalidSnapshot,
-								true /* wait for commit */ ,
-								&tmfd, &lockmode, update_indexes);
+								options,
+								&tmfd, &lockmode, update_indexes,
+								oldSlot);
 
 	switch (result)
 	{
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 35eb7180f7e..3309b4ebd2d 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2773,8 +2773,8 @@ ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
 void
 ExecARDeleteTriggers(EState *estate,
 					 ResultRelInfo *relinfo,
-					 ItemPointer tupleid,
 					 HeapTuple fdw_trigtuple,
+					 TupleTableSlot *slot,
 					 TransitionCaptureState *transition_capture,
 					 bool is_crosspart_update)
 {
@@ -2783,20 +2783,11 @@ ExecARDeleteTriggers(EState *estate,
 	if ((trigdesc && trigdesc->trig_delete_after_row) ||
 		(transition_capture && transition_capture->tcs_delete_old_table))
 	{
-		TupleTableSlot *slot = ExecGetTriggerOldSlot(estate, relinfo);
-
-		Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
-		if (fdw_trigtuple == NULL)
-			GetTupleForTrigger(estate,
-							   NULL,
-							   relinfo,
-							   tupleid,
-							   LockTupleExclusive,
-							   slot,
-							   NULL,
-							   NULL,
-							   NULL);
-		else
+		/*
+		 * Put the FDW old tuple to the slot.  Otherwise, caller is expected
+		 * to have old tuple alredy fetched to the slot.
+		 */
+		if (fdw_trigtuple != NULL)
 			ExecForceStoreHeapTuple(fdw_trigtuple, slot, false);
 
 		AfterTriggerSaveEvent(estate, relinfo, NULL, NULL,
@@ -3087,18 +3078,17 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
  * Note: 'src_partinfo' and 'dst_partinfo', when non-NULL, refer to the source
  * and destination partitions, respectively, of a cross-partition update of
  * the root partitioned table mentioned in the query, given by 'relinfo'.
- * 'tupleid' in that case refers to the ctid of the "old" tuple in the source
- * partition, and 'newslot' contains the "new" tuple in the destination
- * partition.  This interface allows to support the requirements of
- * ExecCrossPartitionUpdateForeignKey(); is_crosspart_update must be true in
- * that case.
+ * 'oldslot' contains the "old" tuple in the source partition, and 'newslot'
+ * contains the "new" tuple in the destination partition.  This interface
+ * allows to support the requirements of ExecCrossPartitionUpdateForeignKey();
+ * is_crosspart_update must be true in that case.
  */
 void
 ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 					 ResultRelInfo *src_partinfo,
 					 ResultRelInfo *dst_partinfo,
-					 ItemPointer tupleid,
 					 HeapTuple fdw_trigtuple,
+					 TupleTableSlot *oldslot,
 					 TupleTableSlot *newslot,
 					 List *recheckIndexes,
 					 TransitionCaptureState *transition_capture,
@@ -3117,29 +3107,14 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 		 * separately for DELETE and INSERT to capture transition table rows.
 		 * In such case, either old tuple or new tuple can be NULL.
 		 */
-		TupleTableSlot *oldslot;
-		ResultRelInfo *tupsrc;
-
 		Assert((src_partinfo != NULL && dst_partinfo != NULL) ||
 			   !is_crosspart_update);
 
-		tupsrc = src_partinfo ? src_partinfo : relinfo;
-		oldslot = ExecGetTriggerOldSlot(estate, tupsrc);
-
-		if (fdw_trigtuple == NULL && ItemPointerIsValid(tupleid))
-			GetTupleForTrigger(estate,
-							   NULL,
-							   tupsrc,
-							   tupleid,
-							   LockTupleExclusive,
-							   oldslot,
-							   NULL,
-							   NULL,
-							   NULL);
-		else if (fdw_trigtuple != NULL)
+		if (fdw_trigtuple != NULL)
+		{
+			Assert(oldslot);
 			ExecForceStoreHeapTuple(fdw_trigtuple, oldslot, false);
-		else
-			ExecClearTuple(oldslot);
+		}
 
 		AfterTriggerSaveEvent(estate, relinfo,
 							  src_partinfo, dst_partinfo,
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index d0a89cd5778..0cad843fb69 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -577,6 +577,7 @@ ExecSimpleRelationUpdate(ResultRelInfo *resultRelInfo,
 	{
 		List	   *recheckIndexes = NIL;
 		TU_UpdateIndexes update_indexes;
+		TupleTableSlot *oldSlot = NULL;
 
 		/* Compute stored generated columns */
 		if (rel->rd_att->constr &&
@@ -590,8 +591,12 @@ ExecSimpleRelationUpdate(ResultRelInfo *resultRelInfo,
 		if (rel->rd_rel->relispartition)
 			ExecPartitionCheck(resultRelInfo, slot, estate, true);
 
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_after_row)
+			oldSlot = ExecGetTriggerOldSlot(estate, resultRelInfo);
+
 		simple_table_tuple_update(rel, tid, slot, estate->es_snapshot,
-								  &update_indexes);
+								  &update_indexes, oldSlot);
 
 		if (resultRelInfo->ri_NumIndices > 0 && (update_indexes != TU_None))
 			recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
@@ -602,7 +607,7 @@ ExecSimpleRelationUpdate(ResultRelInfo *resultRelInfo,
 		/* AFTER ROW UPDATE Triggers */
 		ExecARUpdateTriggers(estate, resultRelInfo,
 							 NULL, NULL,
-							 tid, NULL, slot,
+							 NULL, oldSlot, slot,
 							 recheckIndexes, NULL, false);
 
 		list_free(recheckIndexes);
@@ -636,12 +641,18 @@ ExecSimpleRelationDelete(ResultRelInfo *resultRelInfo,
 
 	if (!skip_tuple)
 	{
+		TupleTableSlot *oldSlot = NULL;
+
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_delete_after_row)
+			oldSlot = ExecGetTriggerOldSlot(estate, resultRelInfo);
+
 		/* OK, delete the tuple */
-		simple_table_tuple_delete(rel, tid, estate->es_snapshot);
+		simple_table_tuple_delete(rel, tid, estate->es_snapshot, oldSlot);
 
 		/* AFTER ROW DELETE Triggers */
 		ExecARDeleteTriggers(estate, resultRelInfo,
-							 tid, NULL, NULL, false);
+							 NULL, oldSlot, NULL, false);
 	}
 }
 
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 4abfe82f7fb..9deeaceb35c 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -125,7 +125,7 @@ static void ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 											   ResultRelInfo *sourcePartInfo,
 											   ResultRelInfo *destPartInfo,
 											   ItemPointer tupleid,
-											   TupleTableSlot *oldslot,
+											   TupleTableSlot *oldSlot,
 											   TupleTableSlot *newslot);
 static bool ExecOnConflictUpdate(ModifyTableContext *context,
 								 ResultRelInfo *resultRelInfo,
@@ -565,6 +565,10 @@ ExecInitInsertProjection(ModifyTableState *mtstate,
 	resultRelInfo->ri_newTupleSlot =
 		table_slot_create(resultRelInfo->ri_RelationDesc,
 						  &estate->es_tupleTable);
+	if (node->onConflictAction == ONCONFLICT_UPDATE)
+		resultRelInfo->ri_oldTupleSlot =
+			table_slot_create(resultRelInfo->ri_RelationDesc,
+							  &estate->es_tupleTable);
 
 	/* Build ProjectionInfo if needed (it probably isn't). */
 	if (need_projection)
@@ -1154,7 +1158,7 @@ ExecInsert(ModifyTableContext *context,
 		ExecARUpdateTriggers(estate, resultRelInfo,
 							 NULL, NULL,
 							 NULL,
-							 NULL,
+							 resultRelInfo->ri_oldTupleSlot,
 							 slot,
 							 NULL,
 							 mtstate->mt_transition_capture,
@@ -1334,7 +1338,8 @@ ExecDeletePrologue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static TM_Result
 ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-			  ItemPointer tupleid, bool changingPart)
+			  ItemPointer tupleid, bool changingPart, int options,
+			  TupleTableSlot *oldSlot)
 {
 	EState	   *estate = context->estate;
 
@@ -1342,9 +1347,10 @@ ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 							  estate->es_output_cid,
 							  estate->es_snapshot,
 							  estate->es_crosscheck_snapshot,
-							  true /* wait for commit */ ,
+							  options /* wait for commit */ ,
 							  &context->tmfd,
-							  changingPart);
+							  changingPart,
+							  oldSlot);
 }
 
 /*
@@ -1356,7 +1362,8 @@ ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static void
 ExecDeleteEpilogue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				   ItemPointer tupleid, HeapTuple oldtuple, bool changingPart)
+				   ItemPointer tupleid, HeapTuple oldtuple,
+				   TupleTableSlot *slot, bool changingPart)
 {
 	ModifyTableState *mtstate = context->mtstate;
 	EState	   *estate = context->estate;
@@ -1374,8 +1381,8 @@ ExecDeleteEpilogue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	{
 		ExecARUpdateTriggers(estate, resultRelInfo,
 							 NULL, NULL,
-							 tupleid, oldtuple,
-							 NULL, NULL, mtstate->mt_transition_capture,
+							 oldtuple,
+							 slot, NULL, NULL, mtstate->mt_transition_capture,
 							 false);
 
 		/*
@@ -1386,10 +1393,30 @@ ExecDeleteEpilogue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	}
 
 	/* AFTER ROW DELETE Triggers */
-	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
+	ExecARDeleteTriggers(estate, resultRelInfo, oldtuple, slot,
 						 ar_delete_trig_tcs, changingPart);
 }
 
+/*
+ * Initializes the tuple slot in a ResultRelInfo for DELETE action.
+ *
+ * We mark 'projectNewInfoValid' even though the projections themselves
+ * are not initialized here.
+ */
+static void
+ExecInitDeleteTupleSlot(ModifyTableState *mtstate,
+						ResultRelInfo *resultRelInfo)
+{
+	EState	   *estate = mtstate->ps.state;
+
+	Assert(!resultRelInfo->ri_projectNewInfoValid);
+
+	resultRelInfo->ri_oldTupleSlot =
+		table_slot_create(resultRelInfo->ri_RelationDesc,
+						  &estate->es_tupleTable);
+	resultRelInfo->ri_projectNewInfoValid = true;
+}
+
 /* ----------------------------------------------------------------
  *		ExecDelete
  *
@@ -1417,6 +1444,7 @@ ExecDelete(ModifyTableContext *context,
 		   ResultRelInfo *resultRelInfo,
 		   ItemPointer tupleid,
 		   HeapTuple oldtuple,
+		   TupleTableSlot *oldSlot,
 		   bool processReturning,
 		   bool changingPart,
 		   bool canSetTag,
@@ -1480,6 +1508,11 @@ ExecDelete(ModifyTableContext *context,
 	}
 	else
 	{
+		int			options = TABLE_MODIFY_WAIT | TABLE_MODIFY_FETCH_OLD_TUPLE;
+
+		if (!IsolationUsesXactSnapshot())
+			options |= TABLE_MODIFY_LOCK_UPDATED;
+
 		/*
 		 * delete the tuple
 		 *
@@ -1490,7 +1523,8 @@ ExecDelete(ModifyTableContext *context,
 		 * transaction-snapshot mode transactions.
 		 */
 ldelete:
-		result = ExecDeleteAct(context, resultRelInfo, tupleid, changingPart);
+		result = ExecDeleteAct(context, resultRelInfo, tupleid, changingPart,
+							   options, oldSlot);
 
 		if (tmresult)
 			*tmresult = result;
@@ -1537,7 +1571,6 @@ ldelete:
 
 			case TM_Updated:
 				{
-					TupleTableSlot *inputslot;
 					TupleTableSlot *epqslot;
 
 					if (IsolationUsesXactSnapshot())
@@ -1546,87 +1579,29 @@ ldelete:
 								 errmsg("could not serialize access due to concurrent update")));
 
 					/*
-					 * Already know that we're going to need to do EPQ, so
-					 * fetch tuple directly into the right slot.
+					 * We need to do EPQ. The latest tuple is already found
+					 * and locked as a result of TABLE_MODIFY_LOCK_UPDATED.
 					 */
-					EvalPlanQualBegin(context->epqstate);
-					inputslot = EvalPlanQualSlot(context->epqstate, resultRelationDesc,
-												 resultRelInfo->ri_RangeTableIndex);
-
-					result = table_tuple_lock(resultRelationDesc, tupleid,
-											  estate->es_snapshot,
-											  inputslot, estate->es_output_cid,
-											  LockTupleExclusive, LockWaitBlock,
-											  TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
-											  &context->tmfd);
+					Assert(context->tmfd.traversed);
+					epqslot = EvalPlanQual(context->epqstate,
+										   resultRelationDesc,
+										   resultRelInfo->ri_RangeTableIndex,
+										   oldSlot);
+					if (TupIsNull(epqslot))
+						/* Tuple not passing quals anymore, exiting... */
+						return NULL;
 
-					switch (result)
+					/*
+					 * If requested, skip delete and pass back the updated
+					 * row.
+					 */
+					if (epqreturnslot)
 					{
-						case TM_Ok:
-							Assert(context->tmfd.traversed);
-							epqslot = EvalPlanQual(context->epqstate,
-												   resultRelationDesc,
-												   resultRelInfo->ri_RangeTableIndex,
-												   inputslot);
-							if (TupIsNull(epqslot))
-								/* Tuple not passing quals anymore, exiting... */
-								return NULL;
-
-							/*
-							 * If requested, skip delete and pass back the
-							 * updated row.
-							 */
-							if (epqreturnslot)
-							{
-								*epqreturnslot = epqslot;
-								return NULL;
-							}
-							else
-								goto ldelete;
-
-						case TM_SelfModified:
-
-							/*
-							 * This can be reached when following an update
-							 * chain from a tuple updated by another session,
-							 * reaching a tuple that was already updated in
-							 * this transaction. If previously updated by this
-							 * command, ignore the delete, otherwise error
-							 * out.
-							 *
-							 * See also TM_SelfModified response to
-							 * table_tuple_delete() above.
-							 */
-							if (context->tmfd.cmax != estate->es_output_cid)
-								ereport(ERROR,
-										(errcode(ERRCODE_TRIGGERED_DATA_CHANGE_VIOLATION),
-										 errmsg("tuple to be deleted was already modified by an operation triggered by the current command"),
-										 errhint("Consider using an AFTER trigger instead of a BEFORE trigger to propagate changes to other rows.")));
-							return NULL;
-
-						case TM_Deleted:
-							/* tuple already deleted; nothing to do */
-							return NULL;
-
-						default:
-
-							/*
-							 * TM_Invisible should be impossible because we're
-							 * waiting for updated row versions, and would
-							 * already have errored out if the first version
-							 * is invisible.
-							 *
-							 * TM_Updated should be impossible, because we're
-							 * locking the latest version via
-							 * TUPLE_LOCK_FLAG_FIND_LAST_VERSION.
-							 */
-							elog(ERROR, "unexpected table_tuple_lock status: %u",
-								 result);
-							return NULL;
+						*epqreturnslot = epqslot;
+						return NULL;
 					}
-
-					Assert(false);
-					break;
+					else
+						goto ldelete;
 				}
 
 			case TM_Deleted:
@@ -1660,7 +1635,8 @@ ldelete:
 	if (tupleDeleted)
 		*tupleDeleted = true;
 
-	ExecDeleteEpilogue(context, resultRelInfo, tupleid, oldtuple, changingPart);
+	ExecDeleteEpilogue(context, resultRelInfo, tupleid, oldtuple,
+					   oldSlot, changingPart);
 
 	/* Process RETURNING if present and if requested */
 	if (processReturning && resultRelInfo->ri_projectReturning)
@@ -1678,17 +1654,13 @@ ldelete:
 		}
 		else
 		{
+			/* Copy old tuple to the returning slot */
 			slot = ExecGetReturningSlot(estate, resultRelInfo);
 			if (oldtuple != NULL)
-			{
 				ExecForceStoreHeapTuple(oldtuple, slot, false);
-			}
 			else
-			{
-				if (!table_tuple_fetch_row_version(resultRelationDesc, tupleid,
-												   SnapshotAny, slot))
-					elog(ERROR, "failed to fetch deleted tuple for DELETE RETURNING");
-			}
+				ExecCopySlot(slot, oldSlot);
+			Assert(!TupIsNull(slot));
 		}
 
 		rslot = ExecProcessReturning(resultRelInfo, slot, context->planSlot);
@@ -1788,12 +1760,16 @@ ExecCrossPartitionUpdate(ModifyTableContext *context,
 		MemoryContextSwitchTo(oldcxt);
 	}
 
+	/* Make sure ri_oldTupleSlot is initialized. */
+	if (unlikely(!resultRelInfo->ri_projectNewInfoValid))
+		ExecInitUpdateProjection(mtstate, resultRelInfo);
+
 	/*
 	 * Row movement, part 1.  Delete the tuple, but skip RETURNING processing.
 	 * We want to return rows from INSERT.
 	 */
 	ExecDelete(context, resultRelInfo,
-			   tupleid, oldtuple,
+			   tupleid, oldtuple, resultRelInfo->ri_oldTupleSlot,
 			   false,			/* processReturning */
 			   true,			/* changingPart */
 			   false,			/* canSetTag */
@@ -1834,21 +1810,13 @@ ExecCrossPartitionUpdate(ModifyTableContext *context,
 			return true;
 		else
 		{
-			/* Fetch the most recent version of old tuple. */
-			TupleTableSlot *oldSlot;
-
-			/* ... but first, make sure ri_oldTupleSlot is initialized. */
-			if (unlikely(!resultRelInfo->ri_projectNewInfoValid))
-				ExecInitUpdateProjection(mtstate, resultRelInfo);
-			oldSlot = resultRelInfo->ri_oldTupleSlot;
-			if (!table_tuple_fetch_row_version(resultRelInfo->ri_RelationDesc,
-											   tupleid,
-											   SnapshotAny,
-											   oldSlot))
-				elog(ERROR, "failed to fetch tuple being updated");
-			/* and project the new tuple to retry the UPDATE with */
+			/*
+			 * ExecDelete already fetches the most recent version of old tuple
+			 * to resultRelInfo->ri_RelationDesc.  So, just project the new
+			 * tuple to retry the UPDATE with.
+			 */
 			*retry_slot = ExecGetUpdateNewTuple(resultRelInfo, epqslot,
-												oldSlot);
+												resultRelInfo->ri_oldTupleSlot);
 			return false;
 		}
 	}
@@ -1967,7 +1935,8 @@ ExecUpdatePrepareSlot(ResultRelInfo *resultRelInfo,
 static TM_Result
 ExecUpdateAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 			  ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
-			  bool canSetTag, UpdateContext *updateCxt)
+			  bool canSetTag, int options, TupleTableSlot *oldSlot,
+			  UpdateContext *updateCxt)
 {
 	EState	   *estate = context->estate;
 	Relation	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -2059,7 +2028,8 @@ lreplace:
 				ExecCrossPartitionUpdateForeignKey(context,
 												   resultRelInfo,
 												   insert_destrel,
-												   tupleid, slot,
+												   tupleid,
+												   resultRelInfo->ri_oldTupleSlot,
 												   inserted_tuple);
 
 			return TM_Ok;
@@ -2102,9 +2072,10 @@ lreplace:
 								estate->es_output_cid,
 								estate->es_snapshot,
 								estate->es_crosscheck_snapshot,
-								true /* wait for commit */ ,
+								options /* wait for commit */ ,
 								&context->tmfd, &updateCxt->lockmode,
-								&updateCxt->updateIndexes);
+								&updateCxt->updateIndexes,
+								oldSlot);
 
 	return result;
 }
@@ -2118,7 +2089,8 @@ lreplace:
 static void
 ExecUpdateEpilogue(ModifyTableContext *context, UpdateContext *updateCxt,
 				   ResultRelInfo *resultRelInfo, ItemPointer tupleid,
-				   HeapTuple oldtuple, TupleTableSlot *slot)
+				   HeapTuple oldtuple, TupleTableSlot *slot,
+				   TupleTableSlot *oldSlot)
 {
 	ModifyTableState *mtstate = context->mtstate;
 	List	   *recheckIndexes = NIL;
@@ -2134,7 +2106,7 @@ ExecUpdateEpilogue(ModifyTableContext *context, UpdateContext *updateCxt,
 	/* AFTER ROW UPDATE Triggers */
 	ExecARUpdateTriggers(context->estate, resultRelInfo,
 						 NULL, NULL,
-						 tupleid, oldtuple, slot,
+						 oldtuple, oldSlot, slot,
 						 recheckIndexes,
 						 mtstate->operation == CMD_INSERT ?
 						 mtstate->mt_oc_transition_capture :
@@ -2223,7 +2195,7 @@ ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 	/* Perform the root table's triggers. */
 	ExecARUpdateTriggers(context->estate,
 						 rootRelInfo, sourcePartInfo, destPartInfo,
-						 tupleid, NULL, newslot, NIL, NULL, true);
+						 NULL, oldslot, newslot, NIL, NULL, true);
 }
 
 /* ----------------------------------------------------------------
@@ -2246,6 +2218,7 @@ ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
  *		no relevant triggers.
  *
  *		slot contains the new tuple value to be stored.
+ *		oldSlot is the slot to store the old tuple.
  *		planSlot is the output of the ModifyTable's subplan; we use it
  *		to access values from other input tables (for RETURNING),
  *		row-ID junk columns, etc.
@@ -2256,7 +2229,7 @@ ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 static TupleTableSlot *
 ExecUpdate(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 		   ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
-		   bool canSetTag)
+		   TupleTableSlot *oldSlot, bool canSetTag, bool locked)
 {
 	EState	   *estate = context->estate;
 	Relation	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -2309,6 +2282,11 @@ ExecUpdate(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	}
 	else
 	{
+		int			options = TABLE_MODIFY_WAIT | TABLE_MODIFY_FETCH_OLD_TUPLE;
+
+		if (!locked && !IsolationUsesXactSnapshot())
+			options |= TABLE_MODIFY_LOCK_UPDATED;
+
 		/*
 		 * If we generate a new candidate tuple after EvalPlanQual testing, we
 		 * must loop back here to try again.  (We don't need to redo triggers,
@@ -2318,7 +2296,7 @@ ExecUpdate(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 		 */
 redo_act:
 		result = ExecUpdateAct(context, resultRelInfo, tupleid, oldtuple, slot,
-							   canSetTag, &updateCxt);
+							   canSetTag, options, oldSlot, &updateCxt);
 
 		/*
 		 * If ExecUpdateAct reports that a cross-partition update was done,
@@ -2369,88 +2347,30 @@ redo_act:
 
 			case TM_Updated:
 				{
-					TupleTableSlot *inputslot;
 					TupleTableSlot *epqslot;
-					TupleTableSlot *oldSlot;
 
 					if (IsolationUsesXactSnapshot())
 						ereport(ERROR,
 								(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 								 errmsg("could not serialize access due to concurrent update")));
+					Assert(!locked);
 
 					/*
-					 * Already know that we're going to need to do EPQ, so
-					 * fetch tuple directly into the right slot.
+					 * We need to do EPQ. The latest tuple is already found
+					 * and locked as a result of TABLE_MODIFY_LOCK_UPDATED.
 					 */
-					inputslot = EvalPlanQualSlot(context->epqstate, resultRelationDesc,
-												 resultRelInfo->ri_RangeTableIndex);
-
-					result = table_tuple_lock(resultRelationDesc, tupleid,
-											  estate->es_snapshot,
-											  inputslot, estate->es_output_cid,
-											  updateCxt.lockmode, LockWaitBlock,
-											  TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
-											  &context->tmfd);
-
-					switch (result)
-					{
-						case TM_Ok:
-							Assert(context->tmfd.traversed);
-
-							epqslot = EvalPlanQual(context->epqstate,
-												   resultRelationDesc,
-												   resultRelInfo->ri_RangeTableIndex,
-												   inputslot);
-							if (TupIsNull(epqslot))
-								/* Tuple not passing quals anymore, exiting... */
-								return NULL;
-
-							/* Make sure ri_oldTupleSlot is initialized. */
-							if (unlikely(!resultRelInfo->ri_projectNewInfoValid))
-								ExecInitUpdateProjection(context->mtstate,
-														 resultRelInfo);
-
-							/* Fetch the most recent version of old tuple. */
-							oldSlot = resultRelInfo->ri_oldTupleSlot;
-							if (!table_tuple_fetch_row_version(resultRelationDesc,
-															   tupleid,
-															   SnapshotAny,
-															   oldSlot))
-								elog(ERROR, "failed to fetch tuple being updated");
-							slot = ExecGetUpdateNewTuple(resultRelInfo,
-														 epqslot, oldSlot);
-							goto redo_act;
-
-						case TM_Deleted:
-							/* tuple already deleted; nothing to do */
-							return NULL;
-
-						case TM_SelfModified:
-
-							/*
-							 * This can be reached when following an update
-							 * chain from a tuple updated by another session,
-							 * reaching a tuple that was already updated in
-							 * this transaction. If previously modified by
-							 * this command, ignore the redundant update,
-							 * otherwise error out.
-							 *
-							 * See also TM_SelfModified response to
-							 * table_tuple_update() above.
-							 */
-							if (context->tmfd.cmax != estate->es_output_cid)
-								ereport(ERROR,
-										(errcode(ERRCODE_TRIGGERED_DATA_CHANGE_VIOLATION),
-										 errmsg("tuple to be updated was already modified by an operation triggered by the current command"),
-										 errhint("Consider using an AFTER trigger instead of a BEFORE trigger to propagate changes to other rows.")));
-							return NULL;
-
-						default:
-							/* see table_tuple_lock call in ExecDelete() */
-							elog(ERROR, "unexpected table_tuple_lock status: %u",
-								 result);
-							return NULL;
-					}
+					Assert(context->tmfd.traversed);
+					epqslot = EvalPlanQual(context->epqstate,
+										   resultRelationDesc,
+										   resultRelInfo->ri_RangeTableIndex,
+										   oldSlot);
+					if (TupIsNull(epqslot))
+						/* Tuple not passing quals anymore, exiting... */
+						return NULL;
+					slot = ExecGetUpdateNewTuple(resultRelInfo,
+												 epqslot,
+												 oldSlot);
+					goto redo_act;
 				}
 
 				break;
@@ -2474,7 +2394,7 @@ redo_act:
 		(estate->es_processed)++;
 
 	ExecUpdateEpilogue(context, &updateCxt, resultRelInfo, tupleid, oldtuple,
-					   slot);
+					   slot, oldSlot);
 
 	/* Process RETURNING if present */
 	if (resultRelInfo->ri_projectReturning)
@@ -2692,7 +2612,8 @@ ExecOnConflictUpdate(ModifyTableContext *context,
 	*returning = ExecUpdate(context, resultRelInfo,
 							conflictTid, NULL,
 							resultRelInfo->ri_onConflict->oc_ProjSlot,
-							canSetTag);
+							existing,
+							canSetTag, true);
 
 	/*
 	 * Clear out existing tuple, as there might not be another conflict among
@@ -2934,6 +2855,7 @@ lmerge_matched:
 				{
 					result = ExecUpdateAct(context, resultRelInfo, tupleid,
 										   NULL, newslot, canSetTag,
+										   TABLE_MODIFY_WAIT, NULL,
 										   &updateCxt);
 
 					/*
@@ -2956,7 +2878,8 @@ lmerge_matched:
 				if (result == TM_Ok)
 				{
 					ExecUpdateEpilogue(context, &updateCxt, resultRelInfo,
-									   tupleid, NULL, newslot);
+									   tupleid, NULL, newslot,
+									   resultRelInfo->ri_oldTupleSlot);
 					mtstate->mt_merge_updated += 1;
 				}
 				break;
@@ -2987,12 +2910,12 @@ lmerge_matched:
 				}
 				else
 					result = ExecDeleteAct(context, resultRelInfo, tupleid,
-										   false);
+										   false, TABLE_MODIFY_WAIT, NULL);
 
 				if (result == TM_Ok)
 				{
 					ExecDeleteEpilogue(context, resultRelInfo, tupleid, NULL,
-									   false);
+									   resultRelInfo->ri_oldTupleSlot, false);
 					mtstate->mt_merge_deleted += 1;
 				}
 				break;
@@ -4006,12 +3929,18 @@ ExecModifyTable(PlanState *pstate)
 
 				/* Now apply the update. */
 				slot = ExecUpdate(&context, resultRelInfo, tupleid, oldtuple,
-								  slot, node->canSetTag);
+								  slot, resultRelInfo->ri_oldTupleSlot,
+								  node->canSetTag, false);
 				break;
 
 			case CMD_DELETE:
+				/* Initialize slot for DELETE to fetch the old tuple */
+				if (unlikely(!resultRelInfo->ri_projectNewInfoValid))
+					ExecInitDeleteTupleSlot(node, resultRelInfo);
+
 				slot = ExecDelete(&context, resultRelInfo, tupleid, oldtuple,
-								  true, false, node->canSetTag, NULL, NULL, NULL);
+								  resultRelInfo->ri_oldTupleSlot, true, false,
+								  node->canSetTag, NULL, NULL, NULL);
 				break;
 
 			case CMD_MERGE:
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4b133f68593..45954b8003d 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -276,19 +276,22 @@ extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
 							  int ntuples, CommandId cid, int options,
 							  BulkInsertState bistate);
 extern TM_Result heap_delete(Relation relation, ItemPointer tid,
-							 CommandId cid, Snapshot crosscheck, bool wait,
-							 struct TM_FailureData *tmfd, bool changingPart);
+							 CommandId cid, Snapshot crosscheck, int options,
+							 struct TM_FailureData *tmfd, bool changingPart,
+							 TupleTableSlot *oldSlot);
 extern void heap_finish_speculative(Relation relation, ItemPointer tid);
 extern void heap_abort_speculative(Relation relation, ItemPointer tid);
 extern TM_Result heap_update(Relation relation, ItemPointer otid,
 							 HeapTuple newtup,
-							 CommandId cid, Snapshot crosscheck, bool wait,
+							 CommandId cid, Snapshot crosscheck, int options,
 							 struct TM_FailureData *tmfd, LockTupleMode *lockmode,
-							 TU_UpdateIndexes *update_indexes);
-extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
-								 CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
-								 bool follow_updates,
-								 Buffer *buffer, struct TM_FailureData *tmfd);
+							 TU_UpdateIndexes *update_indexes,
+							 TupleTableSlot *oldSlot);
+extern TM_Result heap_lock_tuple(Relation relation, ItemPointer tid,
+								 TupleTableSlot *slot,
+								 CommandId cid, LockTupleMode mode,
+								 LockWaitPolicy wait_policy, bool follow_updates,
+								 struct TM_FailureData *tmfd);
 
 extern void heap_inplace_update(Relation relation, HeapTuple tuple);
 extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8249b37bbf1..467bdc09d36 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -259,6 +259,11 @@ typedef struct TM_IndexDeleteOp
 /* Follow update chain and lock latest version of tuple */
 #define TUPLE_LOCK_FLAG_FIND_LAST_VERSION		(1 << 1)
 
+/* "options" flag bits for table_tuple_update and table_tuple_delete */
+#define TABLE_MODIFY_WAIT			0x0001
+#define TABLE_MODIFY_FETCH_OLD_TUPLE 0x0002
+#define TABLE_MODIFY_LOCK_UPDATED	0x0004
+
 
 /* Typedef for callback function for table_index_build_scan */
 typedef void (*IndexBuildCallback) (Relation index,
@@ -528,9 +533,10 @@ typedef struct TableAmRoutine
 								 CommandId cid,
 								 Snapshot snapshot,
 								 Snapshot crosscheck,
-								 bool wait,
+								 int options,
 								 TM_FailureData *tmfd,
-								 bool changingPart);
+								 bool changingPart,
+								 TupleTableSlot *oldSlot);
 
 	/* see table_tuple_update() for reference about parameters */
 	TM_Result	(*tuple_update) (Relation rel,
@@ -539,10 +545,11 @@ typedef struct TableAmRoutine
 								 CommandId cid,
 								 Snapshot snapshot,
 								 Snapshot crosscheck,
-								 bool wait,
+								 int options,
 								 TM_FailureData *tmfd,
 								 LockTupleMode *lockmode,
-								 TU_UpdateIndexes *update_indexes);
+								 TU_UpdateIndexes *update_indexes,
+								 TupleTableSlot *oldSlot);
 
 	/* see table_tuple_lock() for reference about parameters */
 	TM_Result	(*tuple_lock) (Relation rel,
@@ -1452,7 +1459,7 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
 }
 
 /*
- * Delete a tuple.
+ * Delete a tuple (and optionally lock the last tuple version).
  *
  * NB: do not call this directly unless prepared to deal with
  * concurrent-update conditions.  Use simple_table_tuple_delete instead.
@@ -1463,11 +1470,21 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
  *	cid - delete command ID (used for visibility test, and stored into
  *		cmax if successful)
  *	crosscheck - if not InvalidSnapshot, also check tuple against this
- *	wait - true if should wait for any conflicting update to commit/abort
+ *	options:
+ *		If TABLE_MODIFY_WAIT, wait for any conflicting update to commit/abort.
+ *		If TABLE_MODIFY_FETCH_OLD_TUPLE option is given, the existing tuple is
+ *		fetched into oldSlot when the update is successful.
+ *		If TABLE_MODIFY_LOCK_UPDATED option is given and the tuple is
+ *		concurrently updated, then the last tuple version is locked and fetched
+ *		into oldSlot.
+ *
  * Output parameters:
  *	tmfd - filled in failure cases (see below)
  *	changingPart - true iff the tuple is being moved to another partition
  *		table due to an update of the partition key. Otherwise, false.
+ *	oldSlot - slot to save the deleted or locked tuple. Can be NULL if none of
+ *		TABLE_MODIFY_FETCH_OLD_TUPLE or TABLE_MODIFY_LOCK_UPDATED options
+ *		is specified.
  *
  * Normal, successful return value is TM_Ok, which means we did actually
  * delete it.  Failure return codes are TM_SelfModified, TM_Updated, and
@@ -1479,16 +1496,18 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
  */
 static inline TM_Result
 table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
-				   Snapshot snapshot, Snapshot crosscheck, bool wait,
-				   TM_FailureData *tmfd, bool changingPart)
+				   Snapshot snapshot, Snapshot crosscheck, int options,
+				   TM_FailureData *tmfd, bool changingPart,
+				   TupleTableSlot *oldSlot)
 {
 	return rel->rd_tableam->tuple_delete(rel, tid, cid,
 										 snapshot, crosscheck,
-										 wait, tmfd, changingPart);
+										 options, tmfd, changingPart,
+										 oldSlot);
 }
 
 /*
- * Update a tuple.
+ * Update a tuple (and optionally lock the last tuple version).
  *
  * NB: do not call this directly unless you are prepared to deal with
  * concurrent-update conditions.  Use simple_table_tuple_update instead.
@@ -1500,13 +1519,23 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
  *	cid - update command ID (used for visibility test, and stored into
  *		cmax/cmin if successful)
  *	crosscheck - if not InvalidSnapshot, also check old tuple against this
- *	wait - true if should wait for any conflicting update to commit/abort
+ *	options:
+ *		If TABLE_MODIFY_WAIT, wait for any conflicting update to commit/abort.
+ *		If TABLE_MODIFY_FETCH_OLD_TUPLE option is given, the existing tuple is
+ *		fetched into oldSlot when the update is successful.
+ *		If TABLE_MODIFY_LOCK_UPDATED option is given and the tuple is
+ *		concurrently updated, then the last tuple version is locked and fetched
+ *		into oldSlot.
+ *
  * Output parameters:
  *	tmfd - filled in failure cases (see below)
  *	lockmode - filled with lock mode acquired on tuple
  *  update_indexes - in success cases this is set to true if new index entries
  *		are required for this tuple
- *
+ *	oldSlot - slot to save the deleted or locked tuple. Can be NULL if none of
+ *		TABLE_MODIFY_FETCH_OLD_TUPLE or TABLE_MODIFY_LOCK_UPDATED options
+ *		is specified.
+
  * Normal, successful return value is TM_Ok, which means we did actually
  * update it.  Failure return codes are TM_SelfModified, TM_Updated, and
  * TM_BeingModified (the last only possible if wait == false).
@@ -1524,13 +1553,15 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
 static inline TM_Result
 table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
 				   CommandId cid, Snapshot snapshot, Snapshot crosscheck,
-				   bool wait, TM_FailureData *tmfd, LockTupleMode *lockmode,
-				   TU_UpdateIndexes *update_indexes)
+				   int options, TM_FailureData *tmfd, LockTupleMode *lockmode,
+				   TU_UpdateIndexes *update_indexes,
+				   TupleTableSlot *oldSlot)
 {
 	return rel->rd_tableam->tuple_update(rel, otid, slot,
 										 cid, snapshot, crosscheck,
-										 wait, tmfd,
-										 lockmode, update_indexes);
+										 options, tmfd,
+										 lockmode, update_indexes,
+										 oldSlot);
 }
 
 /*
@@ -2046,10 +2077,12 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 
 extern void simple_table_tuple_insert(Relation rel, TupleTableSlot *slot);
 extern void simple_table_tuple_delete(Relation rel, ItemPointer tid,
-									  Snapshot snapshot);
+									  Snapshot snapshot,
+									  TupleTableSlot *oldSlot);
 extern void simple_table_tuple_update(Relation rel, ItemPointer otid,
 									  TupleTableSlot *slot, Snapshot snapshot,
-									  TU_UpdateIndexes *update_indexes);
+									  TU_UpdateIndexes *update_indexes,
+									  TupleTableSlot *oldSlot);
 
 
 /* ----------------------------------------------------------------------------
diff --git a/src/include/commands/trigger.h b/src/include/commands/trigger.h
index 8a5a9fe6422..cb968d03ecd 100644
--- a/src/include/commands/trigger.h
+++ b/src/include/commands/trigger.h
@@ -216,8 +216,8 @@ extern bool ExecBRDeleteTriggers(EState *estate,
 								 TM_FailureData *tmfd);
 extern void ExecARDeleteTriggers(EState *estate,
 								 ResultRelInfo *relinfo,
-								 ItemPointer tupleid,
 								 HeapTuple fdw_trigtuple,
+								 TupleTableSlot *slot,
 								 TransitionCaptureState *transition_capture,
 								 bool is_crosspart_update);
 extern bool ExecIRDeleteTriggers(EState *estate,
@@ -240,8 +240,8 @@ extern void ExecARUpdateTriggers(EState *estate,
 								 ResultRelInfo *relinfo,
 								 ResultRelInfo *src_partinfo,
 								 ResultRelInfo *dst_partinfo,
-								 ItemPointer tupleid,
 								 HeapTuple fdw_trigtuple,
+								 TupleTableSlot *oldslot,
 								 TupleTableSlot *newslot,
 								 List *recheckIndexes,
 								 TransitionCaptureState *transition_capture,
-- 
2.39.3 (Apple Git-145)

0003-Allow-table-AM-to-store-complex-data-structures-i-v3.patchapplication/octet-stream; name=0003-Allow-table-AM-to-store-complex-data-structures-i-v3.patchDownload

From 9157d026f7497b46bd33ab91d2f8b82b3b6793ec Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Wed, 7 Jun 2023 13:04:58 +0300
Subject: [PATCH 03/13] Allow table AM to store complex data structures in
 rd_amcache

New table AM method free_rd_amcache is responsible for freeing the rd_amcache.
---
 src/backend/access/heap/heapam_handler.c |  1 +
 src/backend/utils/cache/relcache.c       | 18 +++++++------
 src/include/access/tableam.h             | 33 ++++++++++++++++++++++++
 src/include/utils/rel.h                  | 10 ++++---
 4 files changed, 50 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 7c7204a2422..da86ca5c31a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2640,6 +2640,7 @@ static const TableAmRoutine heapam_methods = {
 	.index_build_range_scan = heapam_index_build_range_scan,
 	.index_validate_scan = heapam_index_validate_scan,
 
+	.free_rd_amcache = NULL,
 	.relation_size = table_block_relation_size,
 	.relation_needs_toast_table = heapam_relation_needs_toast_table,
 	.relation_toast_am = heapam_relation_toast_am,
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 2cd19d603fb..6d98bdfba06 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -318,6 +318,7 @@ static OpClassCacheEnt *LookupOpclassInfo(Oid operatorClassOid,
 										  StrategyNumber numSupport);
 static void RelationCacheInitFileRemoveInDir(const char *tblspcpath);
 static void unlink_initfile(const char *initfilename, int elevel);
+static void release_rd_amcache(Relation rel);
 
 
 /*
@@ -2262,9 +2263,7 @@ RelationReloadIndexInfo(Relation relation)
 	RelationCloseSmgr(relation);
 
 	/* Must free any AM cached data upon relcache flush */
-	if (relation->rd_amcache)
-		pfree(relation->rd_amcache);
-	relation->rd_amcache = NULL;
+	release_rd_amcache(relation);
 
 	/*
 	 * If it's a shared index, we might be called before backend startup has
@@ -2484,8 +2483,7 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
 		pfree(relation->rd_options);
 	if (relation->rd_indextuple)
 		pfree(relation->rd_indextuple);
-	if (relation->rd_amcache)
-		pfree(relation->rd_amcache);
+	release_rd_amcache(relation);
 	if (relation->rd_fdwroutine)
 		pfree(relation->rd_fdwroutine);
 	if (relation->rd_indexcxt)
@@ -2547,9 +2545,7 @@ RelationClearRelation(Relation relation, bool rebuild)
 	RelationCloseSmgr(relation);
 
 	/* Free AM cached data, if any */
-	if (relation->rd_amcache)
-		pfree(relation->rd_amcache);
-	relation->rd_amcache = NULL;
+	release_rd_amcache(relation);
 
 	/*
 	 * Treat nailed-in system relations separately, they always need to be
@@ -6868,3 +6864,9 @@ ResOwnerReleaseRelation(Datum res)
 
 	RelationCloseCleanup((Relation) res);
 }
+
+static void
+release_rd_amcache(Relation rel)
+{
+	table_free_rd_amcache(rel);
+}
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 467bdc09d36..be092e8bedb 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -715,6 +715,13 @@ typedef struct TableAmRoutine
 	 * ------------------------------------------------------------------------
 	 */
 
+	/*
+	 * This callback frees relation private cache data stored in rd_amcache.
+	 * If this callback is not provided, rd_amcache is assumed to point to
+	 * single memory chunk.
+	 */
+	void		(*free_rd_amcache) (Relation rel);
+
 	/*
 	 * See table_relation_size().
 	 *
@@ -1878,6 +1885,32 @@ table_index_validate_scan(Relation table_rel,
  * ----------------------------------------------------------------------------
  */
 
+/*
+ * Frees relation private cache data stored in rd_amcache.  Uses
+ * free_rd_amcache method if provided.  Assumes rd_amcache to point to single
+ * memory chunk otherwise.
+ */
+static inline void
+table_free_rd_amcache(Relation rel)
+{
+	if (rel->rd_tableam && rel->rd_tableam->free_rd_amcache)
+	{
+		rel->rd_tableam->free_rd_amcache(rel);
+
+		/*
+		 * We are assuming free_rd_amcache() did clear the cache and left NULL
+		 * in rd_amcache.
+		 */
+		Assert(rel->rd_amcache == NULL);
+	}
+	else
+	{
+		if (rel->rd_amcache)
+			pfree(rel->rd_amcache);
+		rel->rd_amcache = NULL;
+	}
+}
+
 /*
  * Return the current size of `rel` in bytes. If `forkNumber` is
  * InvalidForkNumber, return the relation's overall size, otherwise the size
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 87002049538..69557fc7a2c 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -221,10 +221,12 @@ typedef struct RelationData
 	 * rd_amcache is available for index and table AMs to cache private data
 	 * about the relation.  This must be just a cache since it may get reset
 	 * at any time (in particular, it will get reset by a relcache inval
-	 * message for the relation).  If used, it must point to a single memory
-	 * chunk palloc'd in CacheMemoryContext, or in rd_indexcxt for an index
-	 * relation.  A relcache reset will include freeing that chunk and setting
-	 * rd_amcache = NULL.
+	 * message for the relation).  If used for table AM it must point to a
+	 * single memory chunk palloc'd in CacheMemoryContext, or more complex
+	 * data structure in that memory context to be freed by free_rd_amcache
+	 * method. If used for index AM it must point to a single memory chunk
+	 * palloc'd in rd_indexcxt memory context.  A relcache reset will include
+	 * freeing that chunk and setting rd_amcache = NULL.
 	 */
 	void	   *rd_amcache;		/* available for use by index/table AM */
 
-- 
2.39.3 (Apple Git-145)

0004-Allow-table-AM-tuple_insert-method-to-return-the--v3.patchapplication/octet-stream; name=0004-Allow-table-AM-tuple_insert-method-to-return-the--v3.patchDownload

From 924d3d7d92f14fab3c8145f1c30a1eef02b849b1 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Fri, 9 Jun 2023 00:28:27 +0300
Subject: [PATCH 04/13] Allow table AM tuple_insert() method to return the
 different slot

This allows table AM to return native tuple slot even if VirtualTupleTableSlot
is given as an input.  Native tuple slot have its knowledge about system
attributes, which could be accessed in future.
---
 src/backend/access/heap/heapam_handler.c |  4 +++-
 src/backend/executor/nodeModifyTable.c   |  6 +++---
 src/include/access/tableam.h             | 20 +++++++++++---------
 3 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index da86ca5c31a..6abfe36dec7 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -243,7 +243,7 @@ heapam_tuple_satisfies_snapshot(Relation rel, TupleTableSlot *slot,
  * ----------------------------------------------------------------------------
  */
 
-static void
+static TupleTableSlot *
 heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
 					int options, BulkInsertState bistate)
 {
@@ -260,6 +260,8 @@ heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
 
 	if (shouldFree)
 		pfree(tuple);
+
+	return slot;
 }
 
 static void
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 9deeaceb35c..34962033be7 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1129,9 +1129,9 @@ ExecInsert(ModifyTableContext *context,
 		else
 		{
 			/* insert the tuple normally */
-			table_tuple_insert(resultRelationDesc, slot,
-							   estate->es_output_cid,
-							   0, NULL);
+			slot = table_tuple_insert(resultRelationDesc, slot,
+									  estate->es_output_cid,
+									  0, NULL);
 
 			/* insert index entries for tuple */
 			if (resultRelInfo->ri_NumIndices > 0)
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index be092e8bedb..2c43ef3f60e 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -505,9 +505,9 @@ typedef struct TableAmRoutine
 	 */
 
 	/* see table_tuple_insert() for reference about parameters */
-	void		(*tuple_insert) (Relation rel, TupleTableSlot *slot,
-								 CommandId cid, int options,
-								 struct BulkInsertStateData *bistate);
+	TupleTableSlot *(*tuple_insert) (Relation rel, TupleTableSlot *slot,
+									 CommandId cid, int options,
+									 struct BulkInsertStateData *bistate);
 
 	/* see table_tuple_insert_speculative() for reference about parameters */
 	void		(*tuple_insert_speculative) (Relation rel,
@@ -1398,16 +1398,18 @@ table_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
  * behavior) is also just passed through to RelationGetBufferForTuple. If
  * `bistate` is provided, table_finish_bulk_insert() needs to be called.
  *
- * On return the slot's tts_tid and tts_tableOid are updated to reflect the
- * insertion. But note that any toasting of fields within the slot is NOT
- * reflected in the slots contents.
+ * Returns the slot containing the inserted tuple, which may differ from the
+ * given slot. For instance, source slot may by VirtualTupleTableSlot, but
+ * the result is corresponding to table AM. On return the slot's tts_tid and
+ * tts_tableOid are updated to reflect the insertion. But note that any
+ * toasting of fields within the slot is NOT reflected in the slots contents.
  */
-static inline void
+static inline TupleTableSlot *
 table_tuple_insert(Relation rel, TupleTableSlot *slot, CommandId cid,
 				   int options, struct BulkInsertStateData *bistate)
 {
-	rel->rd_tableam->tuple_insert(rel, slot, cid, options,
-								  bistate);
+	return rel->rd_tableam->tuple_insert(rel, slot, cid, options,
+										 bistate);
 }
 
 /*
-- 
2.39.3 (Apple Git-145)

0006-Generalize-relation-analyze-in-table-AM-interface-v3.patchapplication/octet-stream; name=0006-Generalize-relation-analyze-in-table-AM-interface-v3.patchDownload

From 9bb14cdf29d16f04d81e5bc199985df3c5b0d7ed Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Thu, 8 Jun 2023 04:20:29 +0300
Subject: [PATCH 06/13] Generalize relation analyze in table AM interface

Currently, there is just one algorithm for sampling tuples from a table written
in acquire_sample_rows().  Custom table AM can just redefine the way to get the
next block/tuple by implementing scan_analyze_next_block() and
scan_analyze_next_tuple() API functions.

This approach doesn't seem general enough.  For instance, it's unclear how to
sample this way index-organized tables.  This commit allows table AM to
encapsulate the whole sampling algorithm (currently implemented in
acquire_sample_rows()) into the relation_analyze() API function.
---
 src/backend/access/heap/heapam_handler.c | 286 +++++++++++++++++++++-
 src/backend/access/table/tableamapi.c    |   2 -
 src/backend/commands/analyze.c           | 288 +----------------------
 src/include/access/tableam.h             |  92 ++------
 src/include/commands/vacuum.h            |   5 +
 src/include/foreign/fdwapi.h             |   6 +-
 6 files changed, 317 insertions(+), 362 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 6abfe36dec7..66ac541ed21 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -44,6 +44,7 @@
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
+#include "utils/sampling.h"
 
 static TM_Result heapam_tuple_lock(Relation relation, ItemPointer tid,
 								   Snapshot snapshot, TupleTableSlot *slot,
@@ -1220,6 +1221,288 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
 	return false;
 }
 
+/*
+ * Comparator for sorting rows[] array
+ */
+static int
+compare_rows(const void *a, const void *b, void *arg)
+{
+	HeapTuple	ha = *(const HeapTuple *) a;
+	HeapTuple	hb = *(const HeapTuple *) b;
+	BlockNumber ba = ItemPointerGetBlockNumber(&ha->t_self);
+	OffsetNumber oa = ItemPointerGetOffsetNumber(&ha->t_self);
+	BlockNumber bb = ItemPointerGetBlockNumber(&hb->t_self);
+	OffsetNumber ob = ItemPointerGetOffsetNumber(&hb->t_self);
+
+	if (ba < bb)
+		return -1;
+	if (ba > bb)
+		return 1;
+	if (oa < ob)
+		return -1;
+	if (oa > ob)
+		return 1;
+	return 0;
+}
+
+static BufferAccessStrategy analyze_bstrategy;
+
+/*
+ * heapam_acquire_sample_rows -- acquire a random sample of rows from the table
+ *
+ * Selected rows are returned in the caller-allocated array rows[], which
+ * must have at least targrows entries.
+ * The actual number of rows selected is returned as the function result.
+ * We also estimate the total numbers of live and dead rows in the table,
+ * and return them into *totalrows and *totaldeadrows, respectively.
+ *
+ * The returned list of tuples is in order by physical position in the table.
+ * (We will rely on this later to derive correlation estimates.)
+ *
+ * As of May 2004 we use a new two-stage method:  Stage one selects up
+ * to targrows random blocks (or all blocks, if there aren't so many).
+ * Stage two scans these blocks and uses the Vitter algorithm to create
+ * a random sample of targrows rows (or less, if there are less in the
+ * sample of blocks).  The two stages are executed simultaneously: each
+ * block is processed as soon as stage one returns its number and while
+ * the rows are read stage two controls which ones are to be inserted
+ * into the sample.
+ *
+ * Although every row has an equal chance of ending up in the final
+ * sample, this sampling method is not perfect: not every possible
+ * sample has an equal chance of being selected.  For large relations
+ * the number of different blocks represented by the sample tends to be
+ * too small.  We can live with that for now.  Improvements are welcome.
+ *
+ * An important property of this sampling method is that because we do
+ * look at a statistically unbiased set of blocks, we should get
+ * unbiased estimates of the average numbers of live and dead rows per
+ * block.  The previous sampling method put too much credence in the row
+ * density near the start of the table.
+ */
+static int
+heapam_acquire_sample_rows(Relation onerel, int elevel,
+						   HeapTuple *rows, int targrows,
+						   double *totalrows, double *totaldeadrows)
+{
+	int			numrows = 0;	/* # rows now in reservoir */
+	double		samplerows = 0; /* total # rows collected */
+	double		liverows = 0;	/* # live rows seen */
+	double		deadrows = 0;	/* # dead rows seen */
+	double		rowstoskip = -1;	/* -1 means not set yet */
+	uint32		randseed;		/* Seed for block sampler(s) */
+	BlockNumber totalblocks;
+	TransactionId OldestXmin;
+	BlockSamplerData bs;
+	ReservoirStateData rstate;
+	TupleTableSlot *slot;
+	TableScanDesc scan;
+	BlockNumber nblocks;
+	BlockNumber blksdone = 0;
+#ifdef USE_PREFETCH
+	int			prefetch_maximum = 0;	/* blocks to prefetch if enabled */
+	BlockSamplerData prefetch_bs;
+#endif
+
+	Assert(targrows > 0);
+
+	totalblocks = RelationGetNumberOfBlocks(onerel);
+
+	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
+	OldestXmin = GetOldestNonRemovableTransactionId(onerel);
+
+	/* Prepare for sampling block numbers */
+	randseed = pg_prng_uint32(&pg_global_prng_state);
+	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, randseed);
+
+#ifdef USE_PREFETCH
+	prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
+	/* Create another BlockSampler, using the same seed, for prefetching */
+	if (prefetch_maximum)
+		(void) BlockSampler_Init(&prefetch_bs, totalblocks, targrows, randseed);
+#endif
+
+	/* Report sampling block numbers */
+	pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
+								 nblocks);
+
+	/* Prepare for sampling rows */
+	reservoir_init_selection_state(&rstate, targrows);
+
+	scan = table_beginscan_analyze(onerel);
+	slot = table_slot_create(onerel, NULL);
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * If we are doing prefetching, then go ahead and tell the kernel about
+	 * the first set of pages we are going to want.  This also moves our
+	 * iterator out ahead of the main one being used, where we will keep it so
+	 * that we're always pre-fetching out prefetch_maximum number of blocks
+	 * ahead.
+	 */
+	if (prefetch_maximum)
+	{
+		for (int i = 0; i < prefetch_maximum; i++)
+		{
+			BlockNumber prefetch_block;
+
+			if (!BlockSampler_HasMore(&prefetch_bs))
+				break;
+
+			prefetch_block = BlockSampler_Next(&prefetch_bs);
+			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block);
+		}
+	}
+#endif
+
+	/* Outer loop over blocks to sample */
+	while (BlockSampler_HasMore(&bs))
+	{
+		bool		block_accepted;
+		BlockNumber targblock = BlockSampler_Next(&bs);
+#ifdef USE_PREFETCH
+		BlockNumber prefetch_targblock = InvalidBlockNumber;
+
+		/*
+		 * Make sure that every time the main BlockSampler is moved forward
+		 * that our prefetch BlockSampler also gets moved forward, so that we
+		 * always stay out ahead.
+		 */
+		if (prefetch_maximum && BlockSampler_HasMore(&prefetch_bs))
+			prefetch_targblock = BlockSampler_Next(&prefetch_bs);
+#endif
+
+		vacuum_delay_point();
+
+		block_accepted = heapam_scan_analyze_next_block(scan, targblock, analyze_bstrategy);
+
+#ifdef USE_PREFETCH
+
+		/*
+		 * When pre-fetching, after we get a block, tell the kernel about the
+		 * next one we will want, if there's any left.
+		 *
+		 * We want to do this even if the table_scan_analyze_next_block() call
+		 * above decides against analyzing the block it picked.
+		 */
+		if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
+			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
+#endif
+
+		/*
+		 * Don't analyze if table_scan_analyze_next_block() indicated this
+		 * block is unsuitable for analyzing.
+		 */
+		if (!block_accepted)
+			continue;
+
+		while (heapam_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
+		{
+			/*
+			 * The first targrows sample rows are simply copied into the
+			 * reservoir. Then we start replacing tuples in the sample until
+			 * we reach the end of the relation.  This algorithm is from Jeff
+			 * Vitter's paper (see full citation in utils/misc/sampling.c). It
+			 * works by repeatedly computing the number of tuples to skip
+			 * before selecting a tuple, which replaces a randomly chosen
+			 * element of the reservoir (current set of tuples).  At all times
+			 * the reservoir is a true random sample of the tuples we've
+			 * passed over so far, so when we fall off the end of the relation
+			 * we're done.
+			 */
+			if (numrows < targrows)
+				rows[numrows++] = ExecCopySlotHeapTuple(slot);
+			else
+			{
+				/*
+				 * t in Vitter's paper is the number of records already
+				 * processed.  If we need to compute a new S value, we must
+				 * use the not-yet-incremented value of samplerows as t.
+				 */
+				if (rowstoskip < 0)
+					rowstoskip = reservoir_get_next_S(&rstate, samplerows, targrows);
+
+				if (rowstoskip <= 0)
+				{
+					/*
+					 * Found a suitable tuple, so save it, replacing one old
+					 * tuple at random
+					 */
+					int			k = (int) (targrows * sampler_random_fract(&rstate.randstate));
+
+					Assert(k >= 0 && k < targrows);
+					heap_freetuple(rows[k]);
+					rows[k] = ExecCopySlotHeapTuple(slot);
+				}
+
+				rowstoskip -= 1;
+			}
+
+			samplerows += 1;
+		}
+
+		pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
+									 ++blksdone);
+	}
+
+	ExecDropSingleTupleTableSlot(slot);
+	table_endscan(scan);
+
+	/*
+	 * If we didn't find as many tuples as we wanted then we're done. No sort
+	 * is needed, since they're already in order.
+	 *
+	 * Otherwise we need to sort the collected tuples by position
+	 * (itempointer). It's not worth worrying about corner cases where the
+	 * tuples are already sorted.
+	 */
+	if (numrows == targrows)
+		qsort_interruptible(rows, numrows, sizeof(HeapTuple),
+							compare_rows, NULL);
+
+	/*
+	 * Estimate total numbers of live and dead rows in relation, extrapolating
+	 * on the assumption that the average tuple density in pages we didn't
+	 * scan is the same as in the pages we did scan.  Since what we scanned is
+	 * a random sample of the pages in the relation, this should be a good
+	 * assumption.
+	 */
+	if (bs.m > 0)
+	{
+		*totalrows = floor((liverows / bs.m) * totalblocks + 0.5);
+		*totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
+	}
+	else
+	{
+		*totalrows = 0.0;
+		*totaldeadrows = 0.0;
+	}
+
+	/*
+	 * Emit some interesting relation info
+	 */
+	ereport(elevel,
+			(errmsg("\"%s\": scanned %d of %u pages, "
+					"containing %.0f live rows and %.0f dead rows; "
+					"%d rows in sample, %.0f estimated total rows",
+					RelationGetRelationName(onerel),
+					bs.m, totalblocks,
+					liverows, deadrows,
+					numrows, *totalrows)));
+
+	return numrows;
+}
+
+static inline void
+heapam_analyze(Relation relation, AcquireSampleRowsFunc *func,
+			   BlockNumber *totalpages, BufferAccessStrategy bstrategy)
+{
+	*func = heapam_acquire_sample_rows;
+	*totalpages = RelationGetNumberOfBlocks(relation);
+	analyze_bstrategy = bstrategy;
+}
+
 static double
 heapam_index_build_range_scan(Relation heapRelation,
 							  Relation indexRelation,
@@ -2637,10 +2920,9 @@ static const TableAmRoutine heapam_methods = {
 	.relation_copy_data = heapam_relation_copy_data,
 	.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
 	.relation_vacuum = heap_vacuum_rel,
-	.scan_analyze_next_block = heapam_scan_analyze_next_block,
-	.scan_analyze_next_tuple = heapam_scan_analyze_next_tuple,
 	.index_build_range_scan = heapam_index_build_range_scan,
 	.index_validate_scan = heapam_index_validate_scan,
+	.relation_analyze = heapam_analyze,
 
 	.free_rd_amcache = NULL,
 	.relation_size = table_block_relation_size,
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index ce637a5a5d9..55b8caeadf2 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -81,8 +81,6 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->relation_copy_data != NULL);
 	Assert(routine->relation_copy_for_cluster != NULL);
 	Assert(routine->relation_vacuum != NULL);
-	Assert(routine->scan_analyze_next_block != NULL);
-	Assert(routine->scan_analyze_next_tuple != NULL);
 	Assert(routine->index_build_range_scan != NULL);
 	Assert(routine->index_validate_scan != NULL);
 
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8a82af4a4ca..659f69ef270 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -87,10 +87,6 @@ static void compute_index_stats(Relation onerel, double totalrows,
 								MemoryContext col_context);
 static VacAttrStats *examine_attribute(Relation onerel, int attnum,
 									   Node *index_expr);
-static int	acquire_sample_rows(Relation onerel, int elevel,
-								HeapTuple *rows, int targrows,
-								double *totalrows, double *totaldeadrows);
-static int	compare_rows(const void *a, const void *b, void *arg);
 static int	acquire_inherited_sample_rows(Relation onerel, int elevel,
 										  HeapTuple *rows, int targrows,
 										  double *totalrows, double *totaldeadrows);
@@ -190,10 +186,9 @@ analyze_rel(Oid relid, RangeVar *relation,
 	if (onerel->rd_rel->relkind == RELKIND_RELATION ||
 		onerel->rd_rel->relkind == RELKIND_MATVIEW)
 	{
-		/* Regular table, so we'll use the regular row acquisition function */
-		acquirefunc = acquire_sample_rows;
-		/* Also get regular table's size */
-		relpages = RelationGetNumberOfBlocks(onerel);
+		/* Use row acquisition function provided by table AM */
+		table_relation_analyze(onerel, &acquirefunc,
+							   &relpages, vac_strategy);
 	}
 	else if (onerel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
 	{
@@ -1102,277 +1097,6 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
 	return stats;
 }
 
-/*
- * acquire_sample_rows -- acquire a random sample of rows from the table
- *
- * Selected rows are returned in the caller-allocated array rows[], which
- * must have at least targrows entries.
- * The actual number of rows selected is returned as the function result.
- * We also estimate the total numbers of live and dead rows in the table,
- * and return them into *totalrows and *totaldeadrows, respectively.
- *
- * The returned list of tuples is in order by physical position in the table.
- * (We will rely on this later to derive correlation estimates.)
- *
- * As of May 2004 we use a new two-stage method:  Stage one selects up
- * to targrows random blocks (or all blocks, if there aren't so many).
- * Stage two scans these blocks and uses the Vitter algorithm to create
- * a random sample of targrows rows (or less, if there are less in the
- * sample of blocks).  The two stages are executed simultaneously: each
- * block is processed as soon as stage one returns its number and while
- * the rows are read stage two controls which ones are to be inserted
- * into the sample.
- *
- * Although every row has an equal chance of ending up in the final
- * sample, this sampling method is not perfect: not every possible
- * sample has an equal chance of being selected.  For large relations
- * the number of different blocks represented by the sample tends to be
- * too small.  We can live with that for now.  Improvements are welcome.
- *
- * An important property of this sampling method is that because we do
- * look at a statistically unbiased set of blocks, we should get
- * unbiased estimates of the average numbers of live and dead rows per
- * block.  The previous sampling method put too much credence in the row
- * density near the start of the table.
- */
-static int
-acquire_sample_rows(Relation onerel, int elevel,
-					HeapTuple *rows, int targrows,
-					double *totalrows, double *totaldeadrows)
-{
-	int			numrows = 0;	/* # rows now in reservoir */
-	double		samplerows = 0; /* total # rows collected */
-	double		liverows = 0;	/* # live rows seen */
-	double		deadrows = 0;	/* # dead rows seen */
-	double		rowstoskip = -1;	/* -1 means not set yet */
-	uint32		randseed;		/* Seed for block sampler(s) */
-	BlockNumber totalblocks;
-	TransactionId OldestXmin;
-	BlockSamplerData bs;
-	ReservoirStateData rstate;
-	TupleTableSlot *slot;
-	TableScanDesc scan;
-	BlockNumber nblocks;
-	BlockNumber blksdone = 0;
-#ifdef USE_PREFETCH
-	int			prefetch_maximum = 0;	/* blocks to prefetch if enabled */
-	BlockSamplerData prefetch_bs;
-#endif
-
-	Assert(targrows > 0);
-
-	totalblocks = RelationGetNumberOfBlocks(onerel);
-
-	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
-	OldestXmin = GetOldestNonRemovableTransactionId(onerel);
-
-	/* Prepare for sampling block numbers */
-	randseed = pg_prng_uint32(&pg_global_prng_state);
-	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, randseed);
-
-#ifdef USE_PREFETCH
-	prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
-	/* Create another BlockSampler, using the same seed, for prefetching */
-	if (prefetch_maximum)
-		(void) BlockSampler_Init(&prefetch_bs, totalblocks, targrows, randseed);
-#endif
-
-	/* Report sampling block numbers */
-	pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
-								 nblocks);
-
-	/* Prepare for sampling rows */
-	reservoir_init_selection_state(&rstate, targrows);
-
-	scan = table_beginscan_analyze(onerel);
-	slot = table_slot_create(onerel, NULL);
-
-#ifdef USE_PREFETCH
-
-	/*
-	 * If we are doing prefetching, then go ahead and tell the kernel about
-	 * the first set of pages we are going to want.  This also moves our
-	 * iterator out ahead of the main one being used, where we will keep it so
-	 * that we're always pre-fetching out prefetch_maximum number of blocks
-	 * ahead.
-	 */
-	if (prefetch_maximum)
-	{
-		for (int i = 0; i < prefetch_maximum; i++)
-		{
-			BlockNumber prefetch_block;
-
-			if (!BlockSampler_HasMore(&prefetch_bs))
-				break;
-
-			prefetch_block = BlockSampler_Next(&prefetch_bs);
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block);
-		}
-	}
-#endif
-
-	/* Outer loop over blocks to sample */
-	while (BlockSampler_HasMore(&bs))
-	{
-		bool		block_accepted;
-		BlockNumber targblock = BlockSampler_Next(&bs);
-#ifdef USE_PREFETCH
-		BlockNumber prefetch_targblock = InvalidBlockNumber;
-
-		/*
-		 * Make sure that every time the main BlockSampler is moved forward
-		 * that our prefetch BlockSampler also gets moved forward, so that we
-		 * always stay out ahead.
-		 */
-		if (prefetch_maximum && BlockSampler_HasMore(&prefetch_bs))
-			prefetch_targblock = BlockSampler_Next(&prefetch_bs);
-#endif
-
-		vacuum_delay_point();
-
-		block_accepted = table_scan_analyze_next_block(scan, targblock, vac_strategy);
-
-#ifdef USE_PREFETCH
-
-		/*
-		 * When pre-fetching, after we get a block, tell the kernel about the
-		 * next one we will want, if there's any left.
-		 *
-		 * We want to do this even if the table_scan_analyze_next_block() call
-		 * above decides against analyzing the block it picked.
-		 */
-		if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
-#endif
-
-		/*
-		 * Don't analyze if table_scan_analyze_next_block() indicated this
-		 * block is unsuitable for analyzing.
-		 */
-		if (!block_accepted)
-			continue;
-
-		while (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
-		{
-			/*
-			 * The first targrows sample rows are simply copied into the
-			 * reservoir. Then we start replacing tuples in the sample until
-			 * we reach the end of the relation.  This algorithm is from Jeff
-			 * Vitter's paper (see full citation in utils/misc/sampling.c). It
-			 * works by repeatedly computing the number of tuples to skip
-			 * before selecting a tuple, which replaces a randomly chosen
-			 * element of the reservoir (current set of tuples).  At all times
-			 * the reservoir is a true random sample of the tuples we've
-			 * passed over so far, so when we fall off the end of the relation
-			 * we're done.
-			 */
-			if (numrows < targrows)
-				rows[numrows++] = ExecCopySlotHeapTuple(slot);
-			else
-			{
-				/*
-				 * t in Vitter's paper is the number of records already
-				 * processed.  If we need to compute a new S value, we must
-				 * use the not-yet-incremented value of samplerows as t.
-				 */
-				if (rowstoskip < 0)
-					rowstoskip = reservoir_get_next_S(&rstate, samplerows, targrows);
-
-				if (rowstoskip <= 0)
-				{
-					/*
-					 * Found a suitable tuple, so save it, replacing one old
-					 * tuple at random
-					 */
-					int			k = (int) (targrows * sampler_random_fract(&rstate.randstate));
-
-					Assert(k >= 0 && k < targrows);
-					heap_freetuple(rows[k]);
-					rows[k] = ExecCopySlotHeapTuple(slot);
-				}
-
-				rowstoskip -= 1;
-			}
-
-			samplerows += 1;
-		}
-
-		pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
-									 ++blksdone);
-	}
-
-	ExecDropSingleTupleTableSlot(slot);
-	table_endscan(scan);
-
-	/*
-	 * If we didn't find as many tuples as we wanted then we're done. No sort
-	 * is needed, since they're already in order.
-	 *
-	 * Otherwise we need to sort the collected tuples by position
-	 * (itempointer). It's not worth worrying about corner cases where the
-	 * tuples are already sorted.
-	 */
-	if (numrows == targrows)
-		qsort_interruptible(rows, numrows, sizeof(HeapTuple),
-							compare_rows, NULL);
-
-	/*
-	 * Estimate total numbers of live and dead rows in relation, extrapolating
-	 * on the assumption that the average tuple density in pages we didn't
-	 * scan is the same as in the pages we did scan.  Since what we scanned is
-	 * a random sample of the pages in the relation, this should be a good
-	 * assumption.
-	 */
-	if (bs.m > 0)
-	{
-		*totalrows = floor((liverows / bs.m) * totalblocks + 0.5);
-		*totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
-	}
-	else
-	{
-		*totalrows = 0.0;
-		*totaldeadrows = 0.0;
-	}
-
-	/*
-	 * Emit some interesting relation info
-	 */
-	ereport(elevel,
-			(errmsg("\"%s\": scanned %d of %u pages, "
-					"containing %.0f live rows and %.0f dead rows; "
-					"%d rows in sample, %.0f estimated total rows",
-					RelationGetRelationName(onerel),
-					bs.m, totalblocks,
-					liverows, deadrows,
-					numrows, *totalrows)));
-
-	return numrows;
-}
-
-/*
- * Comparator for sorting rows[] array
- */
-static int
-compare_rows(const void *a, const void *b, void *arg)
-{
-	HeapTuple	ha = *(const HeapTuple *) a;
-	HeapTuple	hb = *(const HeapTuple *) b;
-	BlockNumber ba = ItemPointerGetBlockNumber(&ha->t_self);
-	OffsetNumber oa = ItemPointerGetOffsetNumber(&ha->t_self);
-	BlockNumber bb = ItemPointerGetBlockNumber(&hb->t_self);
-	OffsetNumber ob = ItemPointerGetOffsetNumber(&hb->t_self);
-
-	if (ba < bb)
-		return -1;
-	if (ba > bb)
-		return 1;
-	if (oa < ob)
-		return -1;
-	if (oa > ob)
-		return 1;
-	return 0;
-}
-
 
 /*
  * acquire_inherited_sample_rows -- acquire sample rows from inheritance tree
@@ -1462,9 +1186,9 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 		if (childrel->rd_rel->relkind == RELKIND_RELATION ||
 			childrel->rd_rel->relkind == RELKIND_MATVIEW)
 		{
-			/* Regular table, so use the regular row acquisition function */
-			acquirefunc = acquire_sample_rows;
-			relpages = RelationGetNumberOfBlocks(childrel);
+			/* Use row acquisition function provided by table AM */
+			table_relation_analyze(childrel, &acquirefunc,
+								   &relpages, vac_strategy);
 		}
 		else if (childrel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
 		{
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 2c43ef3f60e..b9210ea4fcb 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -20,6 +20,7 @@
 #include "access/relscan.h"
 #include "access/sdir.h"
 #include "access/xact.h"
+#include "commands/vacuum.h"
 #include "executor/tuptable.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -654,41 +655,6 @@ typedef struct TableAmRoutine
 									struct VacuumParams *params,
 									BufferAccessStrategy bstrategy);
 
-	/*
-	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
-	 * with table_beginscan_analyze().  See also
-	 * table_scan_analyze_next_block().
-	 *
-	 * The callback may acquire resources like locks that are held until
-	 * table_scan_analyze_next_tuple() returns false. It e.g. can make sense
-	 * to hold a lock until all tuples on a block have been analyzed by
-	 * scan_analyze_next_tuple.
-	 *
-	 * The callback can return false if the block is not suitable for
-	 * sampling, e.g. because it's a metapage that could never contain tuples.
-	 *
-	 * XXX: This obviously is primarily suited for block-based AMs. It's not
-	 * clear what a good interface for non block based AMs would be, so there
-	 * isn't one yet.
-	 */
-	bool		(*scan_analyze_next_block) (TableScanDesc scan,
-											BlockNumber blockno,
-											BufferAccessStrategy bstrategy);
-
-	/*
-	 * See table_scan_analyze_next_tuple().
-	 *
-	 * Not every AM might have a meaningful concept of dead rows, in which
-	 * case it's OK to not increment *deadrows - but note that that may
-	 * influence autovacuum scheduling (see comment for relation_vacuum
-	 * callback).
-	 */
-	bool		(*scan_analyze_next_tuple) (TableScanDesc scan,
-											TransactionId OldestXmin,
-											double *liverows,
-											double *deadrows,
-											TupleTableSlot *slot);
-
 	/* see table_index_build_range_scan for reference about parameters */
 	double		(*index_build_range_scan) (Relation table_rel,
 										   Relation index_rel,
@@ -709,6 +675,15 @@ typedef struct TableAmRoutine
 										Snapshot snapshot,
 										struct ValidateIndexState *state);
 
+	/*
+	 * Provides row sampling callback for relation and number of relation
+	 * pages.
+	 */
+	void		(*relation_analyze) (Relation relation,
+									 AcquireSampleRowsFunc *func,
+									 BlockNumber *totalpages,
+									 BufferAccessStrategy bstrategy);
+
 
 	/* ------------------------------------------------------------------------
 	 * Miscellaneous functions.
@@ -1740,42 +1715,6 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
-/*
- * Prepare to analyze block `blockno` of `scan`. The scan needs to have been
- * started with table_beginscan_analyze().  Note that this routine might
- * acquire resources like locks that are held until
- * table_scan_analyze_next_tuple() returns false.
- *
- * Returns false if block is unsuitable for sampling, true otherwise.
- */
-static inline bool
-table_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-							  BufferAccessStrategy bstrategy)
-{
-	return scan->rs_rd->rd_tableam->scan_analyze_next_block(scan, blockno,
-															bstrategy);
-}
-
-/*
- * Iterate over tuples in the block selected with
- * table_scan_analyze_next_block() (which needs to have returned true, and
- * this routine may not have returned false for the same block before). If a
- * tuple that's suitable for sampling is found, true is returned and a tuple
- * is stored in `slot`.
- *
- * *liverows and *deadrows are incremented according to the encountered
- * tuples.
- */
-static inline bool
-table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
-							  double *liverows, double *deadrows,
-							  TupleTableSlot *slot)
-{
-	return scan->rs_rd->rd_tableam->scan_analyze_next_tuple(scan, OldestXmin,
-															liverows, deadrows,
-															slot);
-}
-
 /*
  * table_index_build_scan - scan the table to find tuples to be indexed
  *
@@ -1881,6 +1820,17 @@ table_index_validate_scan(Relation table_rel,
 											   state);
 }
 
+/*
+ * Provides row sampling callback for relation and number of relation
+ * pages.
+ */
+static inline void
+table_relation_analyze(Relation relation, AcquireSampleRowsFunc *func,
+					   BlockNumber *totalpages, BufferAccessStrategy bstrategy)
+{
+	relation->rd_tableam->relation_analyze(relation, func,
+										   totalpages, bstrategy);
+}
 
 /* ----------------------------------------------------------------------------
  * Miscellaneous functionality
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 1182a967427..d38ddc68b79 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -104,6 +104,11 @@ typedef struct ParallelVacuumState ParallelVacuumState;
  */
 typedef struct VacAttrStats *VacAttrStatsP;
 
+typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
+									  HeapTuple *rows, int targrows,
+									  double *totalrows,
+									  double *totaldeadrows);
+
 typedef Datum (*AnalyzeAttrFetchFunc) (VacAttrStatsP stats, int rownum,
 									   bool *isNull);
 
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index fcde3876b28..0968e0a01ec 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -13,6 +13,7 @@
 #define FDWAPI_H
 
 #include "access/parallel.h"
+#include "commands/vacuum.h"
 #include "nodes/execnodes.h"
 #include "nodes/pathnodes.h"
 
@@ -148,11 +149,6 @@ typedef void (*ExplainForeignModify_function) (ModifyTableState *mtstate,
 typedef void (*ExplainDirectModify_function) (ForeignScanState *node,
 											  struct ExplainState *es);
 
-typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
-									  HeapTuple *rows, int targrows,
-									  double *totalrows,
-									  double *totaldeadrows);
-
 typedef bool (*AnalyzeForeignTable_function) (Relation relation,
 											  AcquireSampleRowsFunc *func,
 											  BlockNumber *totalpages);
-- 
2.39.3 (Apple Git-145)

0007-Custom-reloptions-for-table-AM-v3.patchapplication/octet-stream; name=0007-Custom-reloptions-for-table-AM-v3.patchDownload

From 8566f1dcece83e66ce3b49279e5f1424d4ca7f68 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 12 Jun 2023 23:16:01 +0300
Subject: [PATCH 07/13] Custom reloptions for table AM

Let table AM define custom reloptions for its tables.
---
 src/backend/access/common/reloptions.c   |  6 ++-
 src/backend/access/heap/heapam_handler.c | 13 ++++++
 src/backend/access/table/tableamapi.c    | 20 ++++++++++
 src/backend/commands/tablecmds.c         | 51 ++++++++++++++----------
 src/backend/postmaster/autovacuum.c      |  4 +-
 src/backend/utils/cache/relcache.c       |  6 ++-
 src/include/access/reloptions.h          |  2 +
 src/include/access/tableam.h             | 29 ++++++++++++++
 8 files changed, 106 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index d6eb5d85599..963995388bb 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -24,6 +24,7 @@
 #include "access/nbtree.h"
 #include "access/reloptions.h"
 #include "access/spgist_private.h"
+#include "access/tableam.h"
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
 #include "commands/tablespace.h"
@@ -1377,7 +1378,7 @@ untransformRelOptions(Datum options)
  */
 bytea *
 extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
-				  amoptions_function amoptions)
+				  const TableAmRoutine *tableam, amoptions_function amoptions)
 {
 	bytea	   *options;
 	bool		isnull;
@@ -1399,7 +1400,8 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
 		case RELKIND_MATVIEW:
-			options = heap_reloptions(classForm->relkind, datum, false);
+			options = tableam_reloptions(tableam, classForm->relkind,
+										 datum, false);
 			break;
 		case RELKIND_PARTITIONED_TABLE:
 			options = partitioned_table_reloptions(datum, false);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 66ac541ed21..45df59fdf50 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -23,6 +23,7 @@
 #include "access/heapam.h"
 #include "access/heaptoast.h"
 #include "access/multixact.h"
+#include "access/reloptions.h"
 #include "access/rewriteheap.h"
 #include "access/syncscan.h"
 #include "access/tableam.h"
@@ -2424,6 +2425,17 @@ heapam_relation_toast_am(Relation rel)
 	return rel->rd_rel->relam;
 }
 
+static bytea *
+heapam_reloptions(char relkind, Datum reloptions, bool validate)
+{
+	if (relkind == RELKIND_RELATION ||
+		relkind == RELKIND_TOASTVALUE ||
+		relkind == RELKIND_MATVIEW)
+		return heap_reloptions(relkind, reloptions, validate);
+
+	return NULL;
+}
+
 
 /* ------------------------------------------------------------------------
  * Planner related callbacks for the heap AM
@@ -2929,6 +2941,7 @@ static const TableAmRoutine heapam_methods = {
 	.relation_needs_toast_table = heapam_relation_needs_toast_table,
 	.relation_toast_am = heapam_relation_toast_am,
 	.relation_fetch_toast_slice = heap_fetch_toast_slice,
+	.reloptions = heapam_reloptions,
 
 	.relation_estimate_size = heapam_estimate_rel_size,
 
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 55b8caeadf2..34ff3e38333 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -13,9 +13,11 @@
 
 #include "access/tableam.h"
 #include "access/xact.h"
+#include "catalog/pg_am.h"
 #include "commands/defrem.h"
 #include "miscadmin.h"
 #include "utils/guc_hooks.h"
+#include "utils/syscache.h"
 
 
 /*
@@ -98,6 +100,24 @@ GetTableAmRoutine(Oid amhandler)
 	return routine;
 }
 
+const TableAmRoutine *
+GetTableAmRoutineByAmOid(Oid amoid)
+{
+	HeapTuple	ht_am;
+	Form_pg_am	amrec;
+	const TableAmRoutine *tableam = NULL;
+
+	ht_am = SearchSysCache1(AMOID, ObjectIdGetDatum(amoid));
+	if (!HeapTupleIsValid(ht_am))
+		elog(ERROR, "cache lookup failed for access method %u",
+			 amoid);
+	amrec = (Form_pg_am) GETSTRUCT(ht_am);
+
+	tableam = GetTableAmRoutine(amrec->amhandler);
+	ReleaseSysCache(ht_am);
+	return tableam;
+}
+
 /* check_hook: validate new default_table_access_method */
 bool
 check_default_table_access_method(char **newval, void **extra, GucSource source)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 3ed0618b4e6..d2ef8a0c383 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -705,6 +705,7 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	LOCKMODE	parentLockmode;
 	const char *accessMethod = NULL;
 	Oid			accessMethodId = InvalidOid;
+	const TableAmRoutine *tableam = NULL;
 
 	/*
 	 * Truncate relname to appropriate length (probably a waste of time, as
@@ -844,6 +845,26 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	if (!OidIsValid(ownerId))
 		ownerId = GetUserId();
 
+	/*
+	 * If the statement hasn't specified an access method, but we're defining
+	 * a type of relation that needs one, use the default.
+	 */
+	if (stmt->accessMethod != NULL)
+	{
+		accessMethod = stmt->accessMethod;
+
+		if (partitioned)
+			ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("specifying a table access method is not supported on a partitioned table")));
+	}
+	else if (RELKIND_HAS_TABLE_AM(relkind))
+		accessMethod = default_table_access_method;
+
+	/* look up the access method, verify it is for a table */
+	if (accessMethod != NULL)
+		accessMethodId = get_table_am_oid(accessMethod, false);
+
 	/*
 	 * Parse and validate reloptions, if any.
 	 */
@@ -852,6 +873,12 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 
 	switch (relkind)
 	{
+		case RELKIND_RELATION:
+		case RELKIND_TOASTVALUE:
+		case RELKIND_MATVIEW:
+			tableam = GetTableAmRoutineByAmOid(accessMethodId);
+			(void) tableam_reloptions(tableam, relkind, reloptions, true);
+			break;
 		case RELKIND_VIEW:
 			(void) view_reloptions(reloptions, true);
 			break;
@@ -860,6 +887,7 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 			break;
 		default:
 			(void) heap_reloptions(relkind, reloptions, true);
+			break;
 	}
 
 	if (stmt->ofTypename)
@@ -951,26 +979,6 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 		}
 	}
 
-	/*
-	 * If the statement hasn't specified an access method, but we're defining
-	 * a type of relation that needs one, use the default.
-	 */
-	if (stmt->accessMethod != NULL)
-	{
-		accessMethod = stmt->accessMethod;
-
-		if (partitioned)
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("specifying a table access method is not supported on a partitioned table")));
-	}
-	else if (RELKIND_HAS_TABLE_AM(relkind))
-		accessMethod = default_table_access_method;
-
-	/* look up the access method, verify it is for a table */
-	if (accessMethod != NULL)
-		accessMethodId = get_table_am_oid(accessMethod, false);
-
 	/*
 	 * Create the relation.  Inherited defaults and constraints are passed in
 	 * for immediate handling --- since they don't need parsing, they can be
@@ -15309,7 +15317,8 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
 		case RELKIND_MATVIEW:
-			(void) heap_reloptions(rel->rd_rel->relkind, newOptions, true);
+			(void) table_reloptions(rel, rel->rd_rel->relkind,
+									newOptions, true);
 			break;
 		case RELKIND_PARTITIONED_TABLE:
 			(void) partitioned_table_reloptions(newOptions, true);
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 71e8a6f2584..d1d76016ab4 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2661,7 +2661,9 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE);
 
-	relopts = extractRelOptions(tup, pg_class_desc, NULL);
+	relopts = extractRelOptions(tup, pg_class_desc,
+								GetTableAmRoutineByAmOid(((Form_pg_class) GETSTRUCT(tup))->relam),
+								NULL);
 	if (relopts == NULL)
 		return NULL;
 
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 6d98bdfba06..3babfc804a7 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -33,6 +33,7 @@
 #include "access/htup_details.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
+#include "access/relation.h"
 #include "access/reloptions.h"
 #include "access/sysattr.h"
 #include "access/table.h"
@@ -465,6 +466,7 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 {
 	bytea	   *options;
 	amoptions_function amoptsfn;
+	const TableAmRoutine *tableam = NULL;
 
 	relation->rd_options = NULL;
 
@@ -479,6 +481,7 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 		case RELKIND_VIEW:
 		case RELKIND_MATVIEW:
 		case RELKIND_PARTITIONED_TABLE:
+			tableam = relation->rd_tableam;
 			amoptsfn = NULL;
 			break;
 		case RELKIND_INDEX:
@@ -494,7 +497,8 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 	 * we might not have any other for pg_class yet (consider executing this
 	 * code for pg_class itself)
 	 */
-	options = extractRelOptions(tuple, GetPgClassDescriptor(), amoptsfn);
+	options = extractRelOptions(tuple, GetPgClassDescriptor(),
+								tableam, amoptsfn);
 
 	/*
 	 * Copy parsed data into CacheMemoryContext.  To guard against the
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index 81829b8270a..8ddc75df287 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -21,6 +21,7 @@
 
 #include "access/amapi.h"
 #include "access/htup.h"
+#include "access/tableam.h"
 #include "access/tupdesc.h"
 #include "nodes/pg_list.h"
 #include "storage/lock.h"
@@ -224,6 +225,7 @@ extern Datum transformRelOptions(Datum oldOptions, List *defList,
 								 bool acceptOidsOff, bool isReset);
 extern List *untransformRelOptions(Datum options);
 extern bytea *extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
+								const TableAmRoutine *tableam,
 								amoptions_function amoptions);
 extern void *build_reloptions(Datum reloptions, bool validate,
 							  relopt_kind kind,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index b9210ea4fcb..b99fb6e4e71 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -735,6 +735,11 @@ typedef struct TableAmRoutine
 											   int32 slicelength,
 											   struct varlena *result);
 
+	/*
+	 * Parse table AM-specific table options.
+	 */
+	bytea	   *(*reloptions) (char relkind, Datum reloptions, bool validate);
+
 
 	/* ------------------------------------------------------------------------
 	 * Planner related functions.
@@ -1931,6 +1936,29 @@ table_relation_fetch_toast_slice(Relation toastrel, Oid valueid,
 													 result);
 }
 
+/*
+ * Parse options for given table.
+ */
+static inline bytea *
+table_reloptions(Relation rel, char relkind,
+				 Datum reloptions, bool validate)
+{
+	return rel->rd_tableam->reloptions(relkind, reloptions, validate);
+}
+
+/*
+ * Parse table options without knowledge of particular table.
+ */
+static inline bytea *
+tableam_reloptions(const TableAmRoutine *tableam, char relkind,
+				   Datum reloptions, bool validate)
+{
+	return tableam->reloptions(relkind, reloptions, validate);
+}
+
+extern bytea *index_reloptions(amoptions_function amoptions, Datum reloptions,
+							   bool validate);
+
 
 /* ----------------------------------------------------------------------------
  * Planner related functionality
@@ -2108,6 +2136,7 @@ extern void table_block_relation_estimate_size(Relation rel,
  */
 
 extern const TableAmRoutine *GetTableAmRoutine(Oid amhandler);
+extern const TableAmRoutine *GetTableAmRoutineByAmOid(Oid amoid);
 
 /* ----------------------------------------------------------------------------
  * Functions in heapam_handler.c
-- 
2.39.3 (Apple Git-145)

0010-Notify-table-AM-about-index-creation-v3.patchapplication/octet-stream; name=0010-Notify-table-AM-about-index-creation-v3.patchDownload

From 594390f787add68c06eeb615c0196de3b24fa050 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Sat, 17 Jun 2023 22:01:01 +0300
Subject: [PATCH 10/13] Notify table AM about index creation

This allows table AM to do some preparation with index build.  In particular,
table AM could update its specific meta-information.  That could be also useful
if table AM overrides index implementations.
---
 src/backend/access/heap/heapam_handler.c |  2 ++
 src/backend/catalog/index.c              |  2 ++
 src/backend/commands/indexcmds.c         | 41 +++++++++++++----------
 src/include/access/tableam.h             | 42 ++++++++++++++++++++++++
 4 files changed, 69 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 422898a609d..534495f254f 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -3219,6 +3219,8 @@ static const TableAmRoutine heapam_methods = {
 	.index_build_range_scan = heapam_index_build_range_scan,
 	.index_validate_scan = heapam_index_validate_scan,
 	.relation_analyze = heapam_analyze,
+	.define_index_validate = NULL,
+	.define_index = NULL,
 
 	.free_rd_amcache = NULL,
 	.relation_size = table_block_relation_size,
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index b6a7c60e230..bca97981051 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3840,6 +3840,8 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 
 	/* Close rels, but keep locks */
 	index_close(iRel, NoLock);
+	table_define_index(heapRelation, indexId, true,
+					   skip_constraint_checks, false, NULL);
 	table_close(heapRelation, NoLock);
 
 	if (progress)
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 7299ebbe9f3..7f24687c6d9 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -583,6 +583,7 @@ DefineIndex(Oid tableId,
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
+	void	   *arg;
 
 	root_save_nestlevel = NewGUCNestLevel();
 
@@ -629,6 +630,26 @@ DefineIndex(Oid tableId,
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_INDEX_OID,
 								 InvalidOid);
 
+	/*
+	 * Only SELECT ... FOR UPDATE/SHARE are allowed while doing a standard
+	 * index build; but for concurrent builds we allow INSERT/UPDATE/DELETE
+	 * (but not VACUUM).
+	 *
+	 * NB: Caller is responsible for making sure that relationId refers to the
+	 * relation on which the index should be built; except in bootstrap mode,
+	 * this will typically require the caller to have already locked the
+	 * relation.  To avoid lock upgrade hazards, that lock should be at least
+	 * as strong as the one we take here.
+	 *
+	 * NB: If the lock strength here ever changes, code that is run by
+	 * parallel workers under the control of certain particular ambuild
+	 * functions will need to be updated, too.
+	 */
+	lockmode = concurrent ? ShareUpdateExclusiveLock : ShareLock;
+	rel = table_open(tableId, lockmode);
+
+	table_define_index_validate(rel, stmt, skip_build, &arg);
+
 	/*
 	 * count key attributes in index
 	 */
@@ -656,24 +677,6 @@ DefineIndex(Oid tableId,
 				 errmsg("cannot use more than %d columns in an index",
 						INDEX_MAX_KEYS)));
 
-	/*
-	 * Only SELECT ... FOR UPDATE/SHARE are allowed while doing a standard
-	 * index build; but for concurrent builds we allow INSERT/UPDATE/DELETE
-	 * (but not VACUUM).
-	 *
-	 * NB: Caller is responsible for making sure that tableId refers to the
-	 * relation on which the index should be built; except in bootstrap mode,
-	 * this will typically require the caller to have already locked the
-	 * relation.  To avoid lock upgrade hazards, that lock should be at least
-	 * as strong as the one we take here.
-	 *
-	 * NB: If the lock strength here ever changes, code that is run by
-	 * parallel workers under the control of certain particular ambuild
-	 * functions will need to be updated, too.
-	 */
-	lockmode = concurrent ? ShareUpdateExclusiveLock : ShareLock;
-	rel = table_open(tableId, lockmode);
-
 	/*
 	 * Switch to the table owner's userid, so that any index functions are run
 	 * as that user.  Also lock down security-restricted operations.  We
@@ -1218,6 +1221,8 @@ DefineIndex(Oid tableId,
 
 	ObjectAddressSet(address, RelationRelationId, indexRelationId);
 
+	table_define_index(rel, address.objectId, false, false,
+					   skip_build, arg);
 	if (!OidIsValid(indexRelationId))
 	{
 		/*
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 1bfae380637..4ac2d868322 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -683,6 +683,16 @@ typedef struct TableAmRoutine
 									 BlockNumber *totalpages,
 									 BufferAccessStrategy bstrategy);
 
+	/* See table_define_index_validate() */
+	bool		(*define_index_validate) (Relation rel, IndexStmt *stmt,
+										  bool skip_build, void **arg);
+
+	/* See table_define_index() */
+	bool		(*define_index) (Relation rel, Oid indoid, bool reindex,
+								 bool skip_constraint_checks, bool skip_build,
+								 void *arg);
+
+
 
 	/* ------------------------------------------------------------------------
 	 * Miscellaneous functions.
@@ -1849,6 +1859,38 @@ table_relation_analyze(Relation relation, AcquireSampleRowsFunc *func,
 										   totalpages, bstrategy);
 }
 
+/*
+ * Let table AM validate the index to be created on `rel` with statement
+ * `*stmt`.  `skip_build` indicates that only catalog entries are to be
+ * created without index data.  This method can save some information into
+ * `arg`, and it shoud be passed to table_define_index().
+ */
+static inline bool
+table_define_index_validate(Relation rel, IndexStmt *stmt,
+							bool skip_build, void **arg)
+{
+	if (rel->rd_tableam && rel->rd_tableam->define_index_validate)
+		return rel->rd_tableam->define_index_validate(rel, stmt,
+													  skip_build, arg);
+	else
+		return true;
+}
+
+/*
+ * Notifies table AM about index creation on `rel` with oid `indoid`.
+ */
+static inline bool
+table_define_index(Relation rel, Oid indoid, bool reindex,
+				   bool skip_constraint_checks, bool skip_build, void *arg)
+{
+	if (rel->rd_tableam && rel->rd_tableam->define_index)
+		return rel->rd_tableam->define_index(rel, indoid, reindex,
+											 skip_constraint_checks,
+											 skip_build, arg);
+	else
+		return true;
+}
+
 /* ----------------------------------------------------------------------------
  * Miscellaneous functionality
  * ----------------------------------------------------------------------------
-- 
2.39.3 (Apple Git-145)

0008-Generalize-table-AM-API-for-INSERT-.-ON-CONFLICT-v3.patchapplication/octet-stream; name=0008-Generalize-table-AM-API-for-INSERT-.-ON-CONFLICT-v3.patchDownload

From 52365781128edb4e898987f3f3985b1eb7413d26 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Fri, 9 Jun 2023 00:05:52 +0300
Subject: [PATCH 08/13] Generalize table AM API for INSERT ... ON CONFLICT ...

Currently, all table AMs need to implement INSERT ... ON CONFLICT ... with
speculative tokens.  They could just have a custom implementation of those
tokens using tuple_insert_speculative() and tuple_complete_speculative() API
functions.

This commit changes INSERT ... ON CONFLICT ... implementation to use single
tuple_insert_with_arbiter() API function, which encapsulates the whole
alogrithm.  This new function provides clear semantics to make different
implementations of INSERT ... ON CONFLICT ... functionality.
---
 src/backend/access/heap/heapam_handler.c | 281 ++++++++++++++++++++++-
 src/backend/access/table/tableamapi.c    |   3 +-
 src/backend/executor/nodeModifyTable.c   | 270 ++--------------------
 src/include/access/tableam.h             |  84 +++----
 4 files changed, 348 insertions(+), 290 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 45df59fdf50..781385270b0 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -306,6 +306,284 @@ heapam_tuple_complete_speculative(Relation relation, TupleTableSlot *slot,
 		pfree(tuple);
 }
 
+/*
+ * ExecCheckTupleVisible -- verify tuple is visible
+ *
+ * It would not be consistent with guarantees of the higher isolation levels to
+ * proceed with avoiding insertion (taking speculative insertion's alternative
+ * path) on the basis of another tuple that is not visible to MVCC snapshot.
+ * Check for the need to raise a serialization failure, and do so as necessary.
+ */
+static void
+ExecCheckTupleVisible(EState *estate,
+					  Relation rel,
+					  TupleTableSlot *slot)
+{
+	if (!IsolationUsesXactSnapshot())
+		return;
+
+	if (!table_tuple_satisfies_snapshot(rel, slot, estate->es_snapshot))
+	{
+		Datum		xminDatum;
+		TransactionId xmin;
+		bool		isnull;
+
+		xminDatum = slot_getsysattr(slot, MinTransactionIdAttributeNumber, &isnull);
+		Assert(!isnull);
+		xmin = DatumGetTransactionId(xminDatum);
+
+		/*
+		 * We should not raise a serialization failure if the conflict is
+		 * against a tuple inserted by our own transaction, even if it's not
+		 * visible to our snapshot.  (This would happen, for example, if
+		 * conflicting keys are proposed for insertion in a single command.)
+		 */
+		if (!TransactionIdIsCurrentTransactionId(xmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+					 errmsg("could not serialize access due to concurrent update")));
+	}
+}
+
+/*
+ * ExecCheckTIDVisible -- convenience variant of ExecCheckTupleVisible()
+ */
+static void
+ExecCheckTIDVisible(EState *estate,
+					Relation rel,
+					ItemPointer tid,
+					TupleTableSlot *tempSlot)
+{
+	/* Redundantly check isolation level */
+	if (!IsolationUsesXactSnapshot())
+		return;
+
+	if (!table_tuple_fetch_row_version(rel, tid,
+									   SnapshotAny, tempSlot))
+		elog(ERROR, "failed to fetch conflicting tuple for ON CONFLICT");
+	ExecCheckTupleVisible(estate, rel, tempSlot);
+	ExecClearTuple(tempSlot);
+}
+
+static inline TupleTableSlot *
+heapam_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
+								 TupleTableSlot *slot,
+								 CommandId cid, int options,
+								 struct BulkInsertStateData *bistate,
+								 List *arbiterIndexes,
+								 EState *estate,
+								 LockTupleMode lockmode,
+								 TupleTableSlot *lockedSlot,
+								 TupleTableSlot *tempSlot)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	uint32		specToken;
+	ItemPointerData conflictTid;
+	bool		specConflict;
+	List	   *recheckIndexes = NIL;
+
+	while (true)
+	{
+		specConflict = false;
+		if (!ExecCheckIndexConstraints(resultRelInfo, slot, estate, &conflictTid,
+									   arbiterIndexes))
+		{
+			if (lockedSlot)
+			{
+				TM_Result	test;
+				TM_FailureData tmfd;
+				Datum		xminDatum;
+				TransactionId xmin;
+				bool		isnull;
+
+				/* Determine lock mode to use */
+				lockmode = ExecUpdateLockMode(estate, resultRelInfo);
+
+				/*
+				 * Lock tuple for update.  Don't follow updates when tuple
+				 * cannot be locked without doing so.  A row locking conflict
+				 * here means our previous conclusion that the tuple is
+				 * conclusively committed is not true anymore.
+				 */
+				test = table_tuple_lock(rel, &conflictTid,
+										estate->es_snapshot,
+										lockedSlot, estate->es_output_cid,
+										lockmode, LockWaitBlock, 0,
+										&tmfd);
+				switch (test)
+				{
+					case TM_Ok:
+						/* success! */
+						break;
+
+					case TM_Invisible:
+
+						/*
+						 * This can occur when a just inserted tuple is
+						 * updated again in the same command. E.g. because
+						 * multiple rows with the same conflicting key values
+						 * are inserted.
+						 *
+						 * This is somewhat similar to the ExecUpdate()
+						 * TM_SelfModified case.  We do not want to proceed
+						 * because it would lead to the same row being updated
+						 * a second time in some unspecified order, and in
+						 * contrast to plain UPDATEs there's no historical
+						 * behavior to break.
+						 *
+						 * It is the user's responsibility to prevent this
+						 * situation from occurring.  These problems are why
+						 * the SQL standard similarly specifies that for SQL
+						 * MERGE, an exception must be raised in the event of
+						 * an attempt to update the same row twice.
+						 */
+						xminDatum = slot_getsysattr(lockedSlot,
+													MinTransactionIdAttributeNumber,
+													&isnull);
+						Assert(!isnull);
+						xmin = DatumGetTransactionId(xminDatum);
+
+						if (TransactionIdIsCurrentTransactionId(xmin))
+							ereport(ERROR,
+									(errcode(ERRCODE_CARDINALITY_VIOLATION),
+							/* translator: %s is a SQL command name */
+									 errmsg("%s command cannot affect row a second time",
+											"ON CONFLICT DO UPDATE"),
+									 errhint("Ensure that no rows proposed for insertion within the same command have duplicate constrained values.")));
+
+						/* This shouldn't happen */
+						elog(ERROR, "attempted to lock invisible tuple");
+						break;
+
+					case TM_SelfModified:
+
+						/*
+						 * This state should never be reached. As a dirty
+						 * snapshot is used to find conflicting tuples,
+						 * speculative insertion wouldn't have seen this row
+						 * to conflict with.
+						 */
+						elog(ERROR, "unexpected self-updated tuple");
+						break;
+
+					case TM_Updated:
+						if (IsolationUsesXactSnapshot())
+							ereport(ERROR,
+									(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+									 errmsg("could not serialize access due to concurrent update")));
+
+						/*
+						 * As long as we don't support an UPDATE of INSERT ON
+						 * CONFLICT for a partitioned table we shouldn't reach
+						 * to a case where tuple to be lock is moved to
+						 * another partition due to concurrent update of the
+						 * partition key.
+						 */
+						Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
+
+						/*
+						 * Tell caller to try again from the very start.
+						 *
+						 * It does not make sense to use the usual
+						 * EvalPlanQual() style loop here, as the new version
+						 * of the row might not conflict anymore, or the
+						 * conflicting tuple has actually been deleted.
+						 */
+						ExecClearTuple(lockedSlot);
+						return false;
+
+					case TM_Deleted:
+						if (IsolationUsesXactSnapshot())
+							ereport(ERROR,
+									(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+									 errmsg("could not serialize access due to concurrent delete")));
+
+						/* see TM_Updated case */
+						Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
+						ExecClearTuple(lockedSlot);
+						return false;
+
+					default:
+						elog(ERROR, "unrecognized table_tuple_lock status: %u", test);
+				}
+
+				/* Success, the tuple is locked. */
+
+				/*
+				 * Verify that the tuple is visible to our MVCC snapshot if
+				 * the current isolation level mandates that.
+				 *
+				 * It's not sufficient to rely on the check within
+				 * ExecUpdate() as e.g. CONFLICT ... WHERE clause may prevent
+				 * us from reaching that.
+				 *
+				 * This means we only ever continue when a new command in the
+				 * current transaction could see the row, even though in READ
+				 * COMMITTED mode the tuple will not be visible according to
+				 * the current statement's snapshot.  This is in line with the
+				 * way UPDATE deals with newer tuple versions.
+				 */
+				ExecCheckTupleVisible(estate, rel, lockedSlot);
+				return NULL;
+			}
+			else
+			{
+				ExecCheckTIDVisible(estate, rel, &conflictTid, tempSlot);
+				return NULL;
+			}
+		}
+
+		/*
+		 * Before we start insertion proper, acquire our "speculative
+		 * insertion lock".  Others can use that to wait for us to decide if
+		 * we're going to go ahead with the insertion, instead of waiting for
+		 * the whole transaction to complete.
+		 */
+		specToken = SpeculativeInsertionLockAcquire(GetCurrentTransactionId());
+
+		/* insert the tuple, with the speculative token */
+		heapam_tuple_insert_speculative(rel, slot,
+										estate->es_output_cid,
+										0,
+										NULL,
+										specToken);
+
+		/* insert index entries for tuple */
+		recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
+											   slot, estate, false, true,
+											   &specConflict,
+											   arbiterIndexes,
+											   false);
+
+		/* adjust the tuple's state accordingly */
+		heapam_tuple_complete_speculative(rel, slot,
+										  specToken, !specConflict);
+
+		/*
+		 * Wake up anyone waiting for our decision.  They will re-check the
+		 * tuple, see that it's no longer speculative, and wait on our XID as
+		 * if this was a regularly inserted tuple all along.  Or if we killed
+		 * the tuple, they will see it's dead, and proceed as if the tuple
+		 * never existed.
+		 */
+		SpeculativeInsertionLockRelease(GetCurrentTransactionId());
+
+		/*
+		 * If there was a conflict, start from the beginning.  We'll do the
+		 * pre-check again, which will now find the conflicting tuple (unless
+		 * it aborts before we get there).
+		 */
+		if (specConflict)
+		{
+			list_free(recheckIndexes);
+			CHECK_FOR_INTERRUPTS();
+			continue;
+		}
+
+		return slot;
+	}
+}
+
 static TM_Result
 heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
 					Snapshot snapshot, Snapshot crosscheck, int options,
@@ -2914,8 +3192,7 @@ static const TableAmRoutine heapam_methods = {
 	.index_fetch_tuple = heapam_index_fetch_tuple,
 
 	.tuple_insert = heapam_tuple_insert,
-	.tuple_insert_speculative = heapam_tuple_insert_speculative,
-	.tuple_complete_speculative = heapam_tuple_complete_speculative,
+	.tuple_insert_with_arbiter = heapam_tuple_insert_with_arbiter,
 	.multi_insert = heap_multi_insert,
 	.tuple_delete = heapam_tuple_delete,
 	.tuple_update = heapam_tuple_update,
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 34ff3e38333..d9fc87665c7 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -70,8 +70,7 @@ GetTableAmRoutine(Oid amhandler)
 	 * Could be made optional, but would require throwing error during
 	 * parse-analysis.
 	 */
-	Assert(routine->tuple_insert_speculative != NULL);
-	Assert(routine->tuple_complete_speculative != NULL);
+	Assert(routine->tuple_insert_with_arbiter != NULL);
 
 	Assert(routine->multi_insert != NULL);
 	Assert(routine->tuple_delete != NULL);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 34962033be7..7d64fcab00d 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -129,7 +129,6 @@ static void ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 											   TupleTableSlot *newslot);
 static bool ExecOnConflictUpdate(ModifyTableContext *context,
 								 ResultRelInfo *resultRelInfo,
-								 ItemPointer conflictTid,
 								 TupleTableSlot *excludedSlot,
 								 bool canSetTag,
 								 TupleTableSlot **returning);
@@ -265,66 +264,6 @@ ExecProcessReturning(ResultRelInfo *resultRelInfo,
 	return ExecProject(projectReturning);
 }
 
-/*
- * ExecCheckTupleVisible -- verify tuple is visible
- *
- * It would not be consistent with guarantees of the higher isolation levels to
- * proceed with avoiding insertion (taking speculative insertion's alternative
- * path) on the basis of another tuple that is not visible to MVCC snapshot.
- * Check for the need to raise a serialization failure, and do so as necessary.
- */
-static void
-ExecCheckTupleVisible(EState *estate,
-					  Relation rel,
-					  TupleTableSlot *slot)
-{
-	if (!IsolationUsesXactSnapshot())
-		return;
-
-	if (!table_tuple_satisfies_snapshot(rel, slot, estate->es_snapshot))
-	{
-		Datum		xminDatum;
-		TransactionId xmin;
-		bool		isnull;
-
-		xminDatum = slot_getsysattr(slot, MinTransactionIdAttributeNumber, &isnull);
-		Assert(!isnull);
-		xmin = DatumGetTransactionId(xminDatum);
-
-		/*
-		 * We should not raise a serialization failure if the conflict is
-		 * against a tuple inserted by our own transaction, even if it's not
-		 * visible to our snapshot.  (This would happen, for example, if
-		 * conflicting keys are proposed for insertion in a single command.)
-		 */
-		if (!TransactionIdIsCurrentTransactionId(xmin))
-			ereport(ERROR,
-					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-					 errmsg("could not serialize access due to concurrent update")));
-	}
-}
-
-/*
- * ExecCheckTIDVisible -- convenience variant of ExecCheckTupleVisible()
- */
-static void
-ExecCheckTIDVisible(EState *estate,
-					ResultRelInfo *relinfo,
-					ItemPointer tid,
-					TupleTableSlot *tempSlot)
-{
-	Relation	rel = relinfo->ri_RelationDesc;
-
-	/* Redundantly check isolation level */
-	if (!IsolationUsesXactSnapshot())
-		return;
-
-	if (!table_tuple_fetch_row_version(rel, tid, SnapshotAny, tempSlot))
-		elog(ERROR, "failed to fetch conflicting tuple for ON CONFLICT");
-	ExecCheckTupleVisible(estate, rel, tempSlot);
-	ExecClearTuple(tempSlot);
-}
-
 /*
  * Initialize to compute stored generated columns for a tuple
  *
@@ -1010,12 +949,19 @@ ExecInsert(ModifyTableContext *context,
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
 			/* Perform a speculative insertion. */
-			uint32		specToken;
-			ItemPointerData conflictTid;
-			bool		specConflict;
 			List	   *arbiterIndexes;
+			TupleTableSlot *existing = NULL,
+					   *returningSlot,
+					   *inserted;
+			LockTupleMode lockmode = LockTupleExclusive;
 
 			arbiterIndexes = resultRelInfo->ri_onConflictArbiterIndexes;
+			returningSlot = ExecGetReturningSlot(estate, resultRelInfo);
+			if (onconflict == ONCONFLICT_UPDATE)
+			{
+				lockmode = ExecUpdateLockMode(estate, resultRelInfo);
+				existing = resultRelInfo->ri_onConflict->oc_Existing;
+			}
 
 			/*
 			 * Do a non-conclusive check for conflicts first.
@@ -1032,23 +978,28 @@ ExecInsert(ModifyTableContext *context,
 			 */
 	vlock:
 			CHECK_FOR_INTERRUPTS();
-			specConflict = false;
-			if (!ExecCheckIndexConstraints(resultRelInfo, slot, estate,
-										   &conflictTid, arbiterIndexes))
+			inserted = table_tuple_insert_with_arbiter(resultRelInfo,
+													   slot, estate->es_output_cid,
+													   0, NULL, arbiterIndexes, estate,
+													   lockmode, existing, returningSlot);
+			if (!inserted)
 			{
 				/* committed conflict tuple found */
 				if (onconflict == ONCONFLICT_UPDATE)
 				{
+					TupleTableSlot *returning = NULL;
+
+					if (TTS_EMPTY(existing))
+						goto vlock;
+
 					/*
 					 * In case of ON CONFLICT DO UPDATE, execute the UPDATE
 					 * part.  Be prepared to retry if the UPDATE fails because
 					 * of another concurrent UPDATE/DELETE to the conflict
 					 * tuple.
 					 */
-					TupleTableSlot *returning = NULL;
-
 					if (ExecOnConflictUpdate(context, resultRelInfo,
-											 &conflictTid, slot, canSetTag,
+											 slot, canSetTag,
 											 &returning))
 					{
 						InstrCountTuples2(&mtstate->ps, 1);
@@ -1071,57 +1022,13 @@ ExecInsert(ModifyTableContext *context,
 					 * ExecGetReturningSlot() in the DO NOTHING case...
 					 */
 					Assert(onconflict == ONCONFLICT_NOTHING);
-					ExecCheckTIDVisible(estate, resultRelInfo, &conflictTid,
-										ExecGetReturningSlot(estate, resultRelInfo));
 					InstrCountTuples2(&mtstate->ps, 1);
 					return NULL;
 				}
 			}
-
-			/*
-			 * Before we start insertion proper, acquire our "speculative
-			 * insertion lock".  Others can use that to wait for us to decide
-			 * if we're going to go ahead with the insertion, instead of
-			 * waiting for the whole transaction to complete.
-			 */
-			specToken = SpeculativeInsertionLockAcquire(GetCurrentTransactionId());
-
-			/* insert the tuple, with the speculative token */
-			table_tuple_insert_speculative(resultRelationDesc, slot,
-										   estate->es_output_cid,
-										   0,
-										   NULL,
-										   specToken);
-
-			/* insert index entries for tuple */
-			recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
-												   slot, estate, false, true,
-												   &specConflict,
-												   arbiterIndexes,
-												   false);
-
-			/* adjust the tuple's state accordingly */
-			table_tuple_complete_speculative(resultRelationDesc, slot,
-											 specToken, !specConflict);
-
-			/*
-			 * Wake up anyone waiting for our decision.  They will re-check
-			 * the tuple, see that it's no longer speculative, and wait on our
-			 * XID as if this was a regularly inserted tuple all along.  Or if
-			 * we killed the tuple, they will see it's dead, and proceed as if
-			 * the tuple never existed.
-			 */
-			SpeculativeInsertionLockRelease(GetCurrentTransactionId());
-
-			/*
-			 * If there was a conflict, start from the beginning.  We'll do
-			 * the pre-check again, which will now find the conflicting tuple
-			 * (unless it aborts before we get there).
-			 */
-			if (specConflict)
+			else
 			{
-				list_free(recheckIndexes);
-				goto vlock;
+				slot = inserted;
 			}
 
 			/* Since there was no insertion conflict, we're done */
@@ -2417,144 +2324,15 @@ redo_act:
 static bool
 ExecOnConflictUpdate(ModifyTableContext *context,
 					 ResultRelInfo *resultRelInfo,
-					 ItemPointer conflictTid,
 					 TupleTableSlot *excludedSlot,
 					 bool canSetTag,
 					 TupleTableSlot **returning)
 {
 	ModifyTableState *mtstate = context->mtstate;
 	ExprContext *econtext = mtstate->ps.ps_ExprContext;
-	Relation	relation = resultRelInfo->ri_RelationDesc;
 	ExprState  *onConflictSetWhere = resultRelInfo->ri_onConflict->oc_WhereClause;
 	TupleTableSlot *existing = resultRelInfo->ri_onConflict->oc_Existing;
-	TM_FailureData tmfd;
-	LockTupleMode lockmode;
-	TM_Result	test;
-	Datum		xminDatum;
-	TransactionId xmin;
-	bool		isnull;
-
-	/* Determine lock mode to use */
-	lockmode = ExecUpdateLockMode(context->estate, resultRelInfo);
-
-	/*
-	 * Lock tuple for update.  Don't follow updates when tuple cannot be
-	 * locked without doing so.  A row locking conflict here means our
-	 * previous conclusion that the tuple is conclusively committed is not
-	 * true anymore.
-	 */
-	test = table_tuple_lock(relation, conflictTid,
-							context->estate->es_snapshot,
-							existing, context->estate->es_output_cid,
-							lockmode, LockWaitBlock, 0,
-							&tmfd);
-	switch (test)
-	{
-		case TM_Ok:
-			/* success! */
-			break;
-
-		case TM_Invisible:
-
-			/*
-			 * This can occur when a just inserted tuple is updated again in
-			 * the same command. E.g. because multiple rows with the same
-			 * conflicting key values are inserted.
-			 *
-			 * This is somewhat similar to the ExecUpdate() TM_SelfModified
-			 * case.  We do not want to proceed because it would lead to the
-			 * same row being updated a second time in some unspecified order,
-			 * and in contrast to plain UPDATEs there's no historical behavior
-			 * to break.
-			 *
-			 * It is the user's responsibility to prevent this situation from
-			 * occurring.  These problems are why the SQL standard similarly
-			 * specifies that for SQL MERGE, an exception must be raised in
-			 * the event of an attempt to update the same row twice.
-			 */
-			xminDatum = slot_getsysattr(existing,
-										MinTransactionIdAttributeNumber,
-										&isnull);
-			Assert(!isnull);
-			xmin = DatumGetTransactionId(xminDatum);
-
-			if (TransactionIdIsCurrentTransactionId(xmin))
-				ereport(ERROR,
-						(errcode(ERRCODE_CARDINALITY_VIOLATION),
-				/* translator: %s is a SQL command name */
-						 errmsg("%s command cannot affect row a second time",
-								"ON CONFLICT DO UPDATE"),
-						 errhint("Ensure that no rows proposed for insertion within the same command have duplicate constrained values.")));
-
-			/* This shouldn't happen */
-			elog(ERROR, "attempted to lock invisible tuple");
-			break;
-
-		case TM_SelfModified:
-
-			/*
-			 * This state should never be reached. As a dirty snapshot is used
-			 * to find conflicting tuples, speculative insertion wouldn't have
-			 * seen this row to conflict with.
-			 */
-			elog(ERROR, "unexpected self-updated tuple");
-			break;
-
-		case TM_Updated:
-			if (IsolationUsesXactSnapshot())
-				ereport(ERROR,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("could not serialize access due to concurrent update")));
-
-			/*
-			 * As long as we don't support an UPDATE of INSERT ON CONFLICT for
-			 * a partitioned table we shouldn't reach to a case where tuple to
-			 * be lock is moved to another partition due to concurrent update
-			 * of the partition key.
-			 */
-			Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
-
-			/*
-			 * Tell caller to try again from the very start.
-			 *
-			 * It does not make sense to use the usual EvalPlanQual() style
-			 * loop here, as the new version of the row might not conflict
-			 * anymore, or the conflicting tuple has actually been deleted.
-			 */
-			ExecClearTuple(existing);
-			return false;
-
-		case TM_Deleted:
-			if (IsolationUsesXactSnapshot())
-				ereport(ERROR,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("could not serialize access due to concurrent delete")));
-
-			/* see TM_Updated case */
-			Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
-			ExecClearTuple(existing);
-			return false;
-
-		default:
-			elog(ERROR, "unrecognized table_tuple_lock status: %u", test);
-	}
-
-	/* Success, the tuple is locked. */
-
-	/*
-	 * Verify that the tuple is visible to our MVCC snapshot if the current
-	 * isolation level mandates that.
-	 *
-	 * It's not sufficient to rely on the check within ExecUpdate() as e.g.
-	 * CONFLICT ... WHERE clause may prevent us from reaching that.
-	 *
-	 * This means we only ever continue when a new command in the current
-	 * transaction could see the row, even though in READ COMMITTED mode the
-	 * tuple will not be visible according to the current statement's
-	 * snapshot.  This is in line with the way UPDATE deals with newer tuple
-	 * versions.
-	 */
-	ExecCheckTupleVisible(context->estate, relation, existing);
+	ItemPointer conflictTid = &existing->tts_tid;
 
 	/*
 	 * Make tuple and any needed join variables available to ExecQual and
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index b99fb6e4e71..2a496e81610 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -22,6 +22,7 @@
 #include "access/xact.h"
 #include "commands/vacuum.h"
 #include "executor/tuptable.h"
+#include "nodes/execnodes.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
 
@@ -510,19 +511,16 @@ typedef struct TableAmRoutine
 									 CommandId cid, int options,
 									 struct BulkInsertStateData *bistate);
 
-	/* see table_tuple_insert_speculative() for reference about parameters */
-	void		(*tuple_insert_speculative) (Relation rel,
-											 TupleTableSlot *slot,
-											 CommandId cid,
-											 int options,
-											 struct BulkInsertStateData *bistate,
-											 uint32 specToken);
-
-	/* see table_tuple_complete_speculative() for reference about parameters */
-	void		(*tuple_complete_speculative) (Relation rel,
-											   TupleTableSlot *slot,
-											   uint32 specToken,
-											   bool succeeded);
+	/* see table_tuple_insert_with_arbiter() for reference about parameters */
+	TupleTableSlot *(*tuple_insert_with_arbiter) (ResultRelInfo *resultRelInfo,
+												  TupleTableSlot *slot,
+												  CommandId cid, int options,
+												  struct BulkInsertStateData *bistate,
+												  List *arbiterIndexes,
+												  EState *estate,
+												  LockTupleMode lockmode,
+												  TupleTableSlot *lockedSlot,
+												  TupleTableSlot *tempSlot);
 
 	/* see table_multi_insert() for reference about parameters */
 	void		(*multi_insert) (Relation rel, TupleTableSlot **slots, int nslots,
@@ -1393,36 +1391,42 @@ table_tuple_insert(Relation rel, TupleTableSlot *slot, CommandId cid,
 }
 
 /*
- * Perform a "speculative insertion". These can be backed out afterwards
- * without aborting the whole transaction.  Other sessions can wait for the
- * speculative insertion to be confirmed, turning it into a regular tuple, or
- * aborted, as if it never existed.  Speculatively inserted tuples behave as
- * "value locks" of short duration, used to implement INSERT .. ON CONFLICT.
+ * Insert a tuple from a slot into table AM routine with arbiter indexes.
  *
- * A transaction having performed a speculative insertion has to either abort,
- * or finish the speculative insertion with
- * table_tuple_complete_speculative(succeeded = ...).
- */
-static inline void
-table_tuple_insert_speculative(Relation rel, TupleTableSlot *slot,
-							   CommandId cid, int options,
-							   struct BulkInsertStateData *bistate,
-							   uint32 specToken)
-{
-	rel->rd_tableam->tuple_insert_speculative(rel, slot, cid, options,
-											  bistate, specToken);
-}
-
-/*
- * Complete "speculative insertion" started in the same transaction. If
- * succeeded is true, the tuple is fully inserted, if false, it's removed.
+ * This function is similar to table_tuple_insert(), but it takes into account
+ * `arbiterIndexes`, which comprises the list of oids of arbiter indexes.
+ *
+ * If tuple doesn't violates uniqueness on all arbiter indexes, then it should
+ * be inserted and the slot containing inserted tuple is returned.
+ *
+ * If tuple violates uniqueness on any arbiter index, then this function
+ * returns NULL and doesn't insert the tuple.  Also, if 'lockedSlot' is
+ * provided, then conflicting tuple gets locked in `lockmode` and placed into
+ * `lockedSlot`.
+ *
+ * Executor state `estate` is passed to this method to provide ability to
+ * calculate index tuples.  Temporary tuple table slot `tempSlot` is passed
+ * for holding of potentially conflicing tuple.
  */
-static inline void
-table_tuple_complete_speculative(Relation rel, TupleTableSlot *slot,
-								 uint32 specToken, bool succeeded)
+static inline TupleTableSlot *
+table_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
+								TupleTableSlot *slot,
+								CommandId cid, int options,
+								struct BulkInsertStateData *bistate,
+								List *arbiterIndexes,
+								EState *estate,
+								LockTupleMode lockmode,
+								TupleTableSlot *lockedSlot,
+								TupleTableSlot *tempSlot)
 {
-	rel->rd_tableam->tuple_complete_speculative(rel, slot, specToken,
-												succeeded);
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+
+	return rel->rd_tableam->tuple_insert_with_arbiter(resultRelInfo,
+													  slot, cid, options,
+													  bistate, arbiterIndexes,
+													  estate,
+													  lockmode, lockedSlot,
+													  tempSlot);
 }
 
 /*
-- 
2.39.3 (Apple Git-145)

0009-Let-table-AM-override-reloptions-for-indexes-buil-v3.patchapplication/octet-stream; name=0009-Let-table-AM-override-reloptions-for-indexes-buil-v3.patchDownload

From 370b8b3a53418f45ac39be17148c179de086e4f8 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Thu, 14 Mar 2024 00:53:05 +0200
Subject: [PATCH 09/13] Let table AM override reloptions for indexes built on
 its tables

---
 src/backend/access/common/reloptions.c   |  3 ++-
 src/backend/access/heap/heapam_handler.c |  8 ++++++++
 src/backend/commands/indexcmds.c         |  3 ++-
 src/backend/commands/tablecmds.c         |  9 ++++++++-
 src/backend/utils/cache/relcache.c       | 24 ++++++++++++++++++++++--
 src/include/access/tableam.h             | 23 +++++++++++++++++++++++
 6 files changed, 65 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 963995388bb..00088240cdd 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -1411,7 +1411,8 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
 			break;
 		case RELKIND_INDEX:
 		case RELKIND_PARTITIONED_INDEX:
-			options = index_reloptions(amoptions, datum, false);
+			options = tableam_indexoptions(tableam, amoptions, classForm->relkind,
+										   datum, false);
 			break;
 		case RELKIND_FOREIGN_TABLE:
 			options = NULL;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 781385270b0..422898a609d 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2714,6 +2714,13 @@ heapam_reloptions(char relkind, Datum reloptions, bool validate)
 	return NULL;
 }
 
+static bytea *
+heapam_indexoptions(amoptions_function amoptions, char relkind,
+					Datum reloptions, bool validate)
+{
+	return index_reloptions(amoptions, reloptions, validate);
+}
+
 
 /* ------------------------------------------------------------------------
  * Planner related callbacks for the heap AM
@@ -3219,6 +3226,7 @@ static const TableAmRoutine heapam_methods = {
 	.relation_toast_am = heapam_relation_toast_am,
 	.relation_fetch_toast_slice = heap_fetch_toast_slice,
 	.reloptions = heapam_reloptions,
+	.indexoptions = heapam_indexoptions,
 
 	.relation_estimate_size = heapam_estimate_rel_size,
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 7b20d103c86..7299ebbe9f3 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -899,7 +899,8 @@ DefineIndex(Oid tableId,
 	reloptions = transformRelOptions((Datum) 0, stmt->options,
 									 NULL, NULL, false, false);
 
-	(void) index_reloptions(amoptions, reloptions, true);
+	(void) tableam_indexoptions(rel->rd_tableam, amoptions, RELKIND_INDEX,
+								reloptions, true);
 
 	/*
 	 * Prepare arguments for index_create, primarily an IndexInfo structure.
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index d2ef8a0c383..fa8eb55b189 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -15328,7 +15328,14 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 			break;
 		case RELKIND_INDEX:
 		case RELKIND_PARTITIONED_INDEX:
-			(void) index_reloptions(rel->rd_indam->amoptions, newOptions, true);
+			{
+				Relation	tbl = relation_open(rel->rd_index->indrelid,
+												AccessShareLock);
+
+				tableam_indexoptions(tbl->rd_tableam, rel->rd_indam->amoptions,
+									 rel->rd_rel->relkind, newOptions, true);
+				relation_close(tbl, AccessShareLock);
+			}
 			break;
 		default:
 			ereport(ERROR,
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 3babfc804a7..b1a4b36aa14 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -478,15 +478,35 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 	{
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
-		case RELKIND_VIEW:
 		case RELKIND_MATVIEW:
+		case RELKIND_VIEW:
 		case RELKIND_PARTITIONED_TABLE:
 			tableam = relation->rd_tableam;
 			amoptsfn = NULL;
 			break;
 		case RELKIND_INDEX:
 		case RELKIND_PARTITIONED_INDEX:
-			amoptsfn = relation->rd_indam->amoptions;
+			{
+				Form_pg_class classForm;
+				HeapTuple	classTup;
+
+				/* fetch the relation's relcache entry */
+				if (relation->rd_index->indrelid >= FirstNormalObjectId)
+				{
+					classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relation->rd_index->indrelid));
+					classForm = (Form_pg_class) GETSTRUCT(classTup);
+					if (classForm->relam >= FirstNormalObjectId)
+						tableam = GetTableAmRoutineByAmOid(classForm->relam);
+					else
+						tableam = GetHeapamTableAmRoutine();
+					heap_freetuple(classTup);
+				}
+				else
+				{
+					tableam = GetHeapamTableAmRoutine();
+				}
+				amoptsfn = relation->rd_indam->amoptions;
+			}
 			break;
 		default:
 			return;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 2a496e81610..1bfae380637 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -17,6 +17,7 @@
 #ifndef TABLEAM_H
 #define TABLEAM_H
 
+#include "access/amapi.h"
 #include "access/relscan.h"
 #include "access/sdir.h"
 #include "access/xact.h"
@@ -738,6 +739,13 @@ typedef struct TableAmRoutine
 	 */
 	bytea	   *(*reloptions) (char relkind, Datum reloptions, bool validate);
 
+	/*
+	 * Parse table AM-specific index options.  Useful for table AM to define
+	 * new index options or override existing index options.
+	 */
+	bytea	   *(*indexoptions) (amoptions_function amoptions, char relkind,
+								 Datum reloptions, bool validate);
+
 
 	/* ------------------------------------------------------------------------
 	 * Planner related functions.
@@ -1963,6 +1971,21 @@ tableam_reloptions(const TableAmRoutine *tableam, char relkind,
 extern bytea *index_reloptions(amoptions_function amoptions, Datum reloptions,
 							   bool validate);
 
+/*
+ * Parse index options.  Gives table AM a chance to override index-specific
+ * options defined in 'amoptions'.
+ */
+static inline bytea *
+tableam_indexoptions(const TableAmRoutine *tableam,
+					 amoptions_function amoptions, char relkind,
+					 Datum reloptions, bool validate)
+{
+	if (tableam)
+		return tableam->indexoptions(amoptions, relkind, reloptions, validate);
+	else
+		return index_reloptions(amoptions, reloptions, validate);
+}
+
 
 /* ----------------------------------------------------------------------------
  * Planner related functionality
-- 
2.39.3 (Apple Git-145)

0011-Let-table-AM-insertion-methods-control-index-inse-v3.patchapplication/octet-stream; name=0011-Let-table-AM-insertion-methods-control-index-inse-v3.patchDownload

From 7facb3c4e40b713ad7626307933e9094f0959d98 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Fri, 9 Jun 2023 01:02:39 +0300
Subject: [PATCH 11/13] Let table AM insertion methods control index insertion

New parameter for tuple_insert() and multi_insert() methods provides way to
skip index insertions in executor.  In this case, table AM can handle insertions
itself.
---
 src/backend/access/heap/heapam.c         |  4 +++-
 src/backend/access/heap/heapam_handler.c |  4 +++-
 src/backend/access/table/tableam.c       |  6 ++++--
 src/backend/catalog/indexing.c           |  4 +++-
 src/backend/commands/copyfrom.c          | 13 +++++++++----
 src/backend/commands/createas.c          |  4 +++-
 src/backend/commands/matview.c           |  4 +++-
 src/backend/commands/tablecmds.c         |  6 +++++-
 src/backend/executor/execReplication.c   |  6 ++++--
 src/backend/executor/nodeModifyTable.c   |  6 ++++--
 src/include/access/heapam.h              |  2 +-
 src/include/access/tableam.h             | 23 ++++++++++++++++-------
 12 files changed, 58 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f6478f89e77..facad25d5c1 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2091,7 +2091,8 @@ heap_multi_insert_pages(HeapTuple *heaptuples, int done, int ntuples, Size saveF
  */
 void
 heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
-				  CommandId cid, int options, BulkInsertState bistate)
+				  CommandId cid, int options, BulkInsertState bistate,
+				  bool *insert_indexes)
 {
 	TransactionId xid = GetCurrentTransactionId();
 	HeapTuple  *heaptuples;
@@ -2440,6 +2441,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		slots[i]->tts_tid = heaptuples[i]->t_self;
 
 	pgstat_count_heap_insert(relation, ntuples);
+	*insert_indexes = true;
 }
 
 /*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 534495f254f..7ebebf4d6ac 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -247,7 +247,7 @@ heapam_tuple_satisfies_snapshot(Relation rel, TupleTableSlot *slot,
 
 static TupleTableSlot *
 heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
-					int options, BulkInsertState bistate)
+					int options, BulkInsertState bistate, bool *insert_indexes)
 {
 	bool		shouldFree = true;
 	HeapTuple	tuple = ExecFetchSlotHeapTuple(slot, true, &shouldFree);
@@ -263,6 +263,8 @@ heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
 	if (shouldFree)
 		pfree(tuple);
 
+	*insert_indexes = true;
+
 	return slot;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 8d3675be959..805d222cebc 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -273,9 +273,11 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
  * default command ID and not allowing access to the speedup options.
  */
 void
-simple_table_tuple_insert(Relation rel, TupleTableSlot *slot)
+simple_table_tuple_insert(Relation rel, TupleTableSlot *slot,
+						  bool *insert_indexes)
 {
-	table_tuple_insert(rel, slot, GetCurrentCommandId(true), 0, NULL);
+	table_tuple_insert(rel, slot, GetCurrentCommandId(true), 0, NULL,
+					   insert_indexes);
 }
 
 /*
diff --git a/src/backend/catalog/indexing.c b/src/backend/catalog/indexing.c
index d0d1abda58a..4d404f22f83 100644
--- a/src/backend/catalog/indexing.c
+++ b/src/backend/catalog/indexing.c
@@ -273,12 +273,14 @@ void
 CatalogTuplesMultiInsertWithInfo(Relation heapRel, TupleTableSlot **slot,
 								 int ntuples, CatalogIndexState indstate)
 {
+	bool		insertIndexes;
+
 	/* Nothing to do */
 	if (ntuples <= 0)
 		return;
 
 	heap_multi_insert(heapRel, slot, ntuples,
-					  GetCurrentCommandId(true), 0, NULL);
+					  GetCurrentCommandId(true), 0, NULL, &insertIndexes);
 
 	/*
 	 * There is no equivalent to heap_multi_insert for the catalog indexes, so
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 8908a440e19..b6736369771 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -397,6 +397,7 @@ CopyMultiInsertBufferFlush(CopyMultiInsertInfo *miinfo,
 		bool		line_buf_valid = cstate->line_buf_valid;
 		uint64		save_cur_lineno = cstate->cur_lineno;
 		MemoryContext oldcontext;
+		bool		insertIndexes;
 
 		Assert(buffer->bistate != NULL);
 
@@ -416,7 +417,8 @@ CopyMultiInsertBufferFlush(CopyMultiInsertInfo *miinfo,
 						   nused,
 						   mycid,
 						   ti_options,
-						   buffer->bistate);
+						   buffer->bistate,
+						   &insertIndexes);
 		MemoryContextSwitchTo(oldcontext);
 
 		for (i = 0; i < nused; i++)
@@ -425,7 +427,7 @@ CopyMultiInsertBufferFlush(CopyMultiInsertInfo *miinfo,
 			 * If there are any indexes, update them for all the inserted
 			 * tuples, and run AFTER ROW INSERT triggers.
 			 */
-			if (resultRelInfo->ri_NumIndices > 0)
+			if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 			{
 				List	   *recheckIndexes;
 
@@ -1265,11 +1267,14 @@ CopyFrom(CopyFromState cstate)
 					}
 					else
 					{
+						bool		insertIndexes;
+
 						/* OK, store the tuple and create index entries for it */
 						table_tuple_insert(resultRelInfo->ri_RelationDesc,
-										   myslot, mycid, ti_options, bistate);
+										   myslot, mycid, ti_options, bistate,
+										   &insertIndexes);
 
-						if (resultRelInfo->ri_NumIndices > 0)
+						if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 							recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 																   myslot,
 																   estate,
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 62050f4dc59..afd3dace079 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -578,6 +578,7 @@ static bool
 intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_intorel *myState = (DR_intorel *) self;
+	bool		insertIndexes;
 
 	/* Nothing to insert if WITH NO DATA is specified. */
 	if (!myState->into->skipData)
@@ -594,7 +595,8 @@ intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 						   slot,
 						   myState->output_cid,
 						   myState->ti_options,
-						   myState->bistate);
+						   myState->bistate,
+						   &insertIndexes);
 	}
 
 	/* We know this is a newly created relation, so there are no indexes */
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 6d09b755564..9ec13d09846 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -476,6 +476,7 @@ static bool
 transientrel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_transientrel *myState = (DR_transientrel *) self;
+	bool		insertIndexes;
 
 	/*
 	 * Note that the input slot might not be of the type of the target
@@ -490,7 +491,8 @@ transientrel_receive(TupleTableSlot *slot, DestReceiver *self)
 					   slot,
 					   myState->output_cid,
 					   myState->ti_options,
-					   myState->bistate);
+					   myState->bistate,
+					   &insertIndexes);
 
 	/* We know this is a newly created relation, so there are no indexes */
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index fa8eb55b189..c7ffb5c17fe 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -6350,8 +6350,12 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
 
 			/* Write the tuple out to the new relation */
 			if (newrel)
+			{
+				bool		insertIndexes;
+
 				table_tuple_insert(newrel, insertslot, mycid,
-								   ti_options, bistate);
+								   ti_options, bistate, &insertIndexes);
+			}
 
 			ResetExprContext(econtext);
 
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 0cad843fb69..db685473fc0 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -509,6 +509,7 @@ ExecSimpleRelationInsert(ResultRelInfo *resultRelInfo,
 	if (!skip_tuple)
 	{
 		List	   *recheckIndexes = NIL;
+		bool		insertIndexes;
 
 		/* Compute stored generated columns */
 		if (rel->rd_att->constr &&
@@ -523,9 +524,10 @@ ExecSimpleRelationInsert(ResultRelInfo *resultRelInfo,
 			ExecPartitionCheck(resultRelInfo, slot, estate, true);
 
 		/* OK, store the tuple and create index entries for it */
-		simple_table_tuple_insert(resultRelInfo->ri_RelationDesc, slot);
+		simple_table_tuple_insert(resultRelInfo->ri_RelationDesc, slot,
+								  &insertIndexes);
 
-		if (resultRelInfo->ri_NumIndices > 0)
+		if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 			recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 												   slot, estate, false, false,
 												   NULL, NIL, false);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 7d64fcab00d..321f2358c12 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1035,13 +1035,15 @@ ExecInsert(ModifyTableContext *context,
 		}
 		else
 		{
+			bool		insertIndexes;
+
 			/* insert the tuple normally */
 			slot = table_tuple_insert(resultRelationDesc, slot,
 									  estate->es_output_cid,
-									  0, NULL);
+									  0, NULL, &insertIndexes);
 
 			/* insert index entries for tuple */
-			if (resultRelInfo->ri_NumIndices > 0)
+			if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 				recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 													   slot, estate, false,
 													   false, NULL, NIL,
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 45954b8003d..cbb73536289 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -274,7 +274,7 @@ extern void heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 						int options, BulkInsertState bistate);
 extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
 							  int ntuples, CommandId cid, int options,
-							  BulkInsertState bistate);
+							  BulkInsertState bistate, bool *insert_indexes);
 extern TM_Result heap_delete(Relation relation, ItemPointer tid,
 							 CommandId cid, Snapshot crosscheck, int options,
 							 struct TM_FailureData *tmfd, bool changingPart,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 4ac2d868322..c32a3cbcf66 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -510,7 +510,8 @@ typedef struct TableAmRoutine
 	/* see table_tuple_insert() for reference about parameters */
 	TupleTableSlot *(*tuple_insert) (Relation rel, TupleTableSlot *slot,
 									 CommandId cid, int options,
-									 struct BulkInsertStateData *bistate);
+									 struct BulkInsertStateData *bistate,
+									 bool *insert_indexes);
 
 	/* see table_tuple_insert_with_arbiter() for reference about parameters */
 	TupleTableSlot *(*tuple_insert_with_arbiter) (ResultRelInfo *resultRelInfo,
@@ -525,7 +526,8 @@ typedef struct TableAmRoutine
 
 	/* see table_multi_insert() for reference about parameters */
 	void		(*multi_insert) (Relation rel, TupleTableSlot **slots, int nslots,
-								 CommandId cid, int options, struct BulkInsertStateData *bistate);
+								 CommandId cid, int options, struct BulkInsertStateData *bistate,
+								 bool *insert_indexes);
 
 	/* see table_tuple_delete() for reference about parameters */
 	TM_Result	(*tuple_delete) (Relation rel,
@@ -1394,6 +1396,10 @@ table_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
  * behavior) is also just passed through to RelationGetBufferForTuple. If
  * `bistate` is provided, table_finish_bulk_insert() needs to be called.
  *
+ * This function sets `*insert_indexes` to true if expects caller to return
+ * the relevant index tuples.  If `*insert_indexes` is set to false, then
+ * this function cares about indexes itself.
+ *
  * Returns the slot containing the inserted tuple, which may differ from the
  * given slot. For instance, source slot may by VirtualTupleTableSlot, but
  * the result is corresponding to table AM. On return the slot's tts_tid and
@@ -1402,10 +1408,11 @@ table_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
  */
 static inline TupleTableSlot *
 table_tuple_insert(Relation rel, TupleTableSlot *slot, CommandId cid,
-				   int options, struct BulkInsertStateData *bistate)
+				   int options, struct BulkInsertStateData *bistate,
+				   bool *insert_indexes)
 {
 	return rel->rd_tableam->tuple_insert(rel, slot, cid, options,
-										 bistate);
+										 bistate, insert_indexes);
 }
 
 /*
@@ -1463,10 +1470,11 @@ table_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
  */
 static inline void
 table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
-				   CommandId cid, int options, struct BulkInsertStateData *bistate)
+				   CommandId cid, int options, struct BulkInsertStateData *bistate,
+				   bool *insert_indexes)
 {
 	rel->rd_tableam->multi_insert(rel, slots, nslots,
-								  cid, options, bistate);
+								  cid, options, bistate, insert_indexes);
 }
 
 /*
@@ -2157,7 +2165,8 @@ table_scan_sample_next_tuple(TableScanDesc scan,
  * ----------------------------------------------------------------------------
  */
 
-extern void simple_table_tuple_insert(Relation rel, TupleTableSlot *slot);
+extern void simple_table_tuple_insert(Relation rel, TupleTableSlot *slot,
+									  bool *insert_indexes);
 extern void simple_table_tuple_delete(Relation rel, ItemPointer tid,
 									  Snapshot snapshot,
 									  TupleTableSlot *oldSlot);
-- 
2.39.3 (Apple Git-145)

0012-Introduce-RowRefType-which-describes-the-table-ro-v3.patchapplication/octet-stream; name=0012-Introduce-RowRefType-which-describes-the-table-ro-v3.patchDownload

From 2702be92b85cd910f30dd0366110f48b8ba79b5f Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Sat, 17 Jun 2023 22:55:32 +0300
Subject: [PATCH 12/13] Introduce RowRefType, which describes the table row
 identifier

Currently, the table row could be identified by the ctid or the whole row
(foreign table).  But the row identifier is mixed together with lock mode in
RowMarkType.  This commit separates row identifier type into separate enum
RowRefType.
---
 src/backend/optimizer/plan/planner.c   | 16 +++++++++-----
 src/backend/optimizer/prep/preptlist.c |  4 ++--
 src/backend/optimizer/util/inherit.c   | 30 +++++++++++++++-----------
 src/backend/parser/parse_relation.c    | 10 +++++++++
 src/include/nodes/execnodes.h          |  4 ++++
 src/include/nodes/parsenodes.h         |  1 +
 src/include/nodes/plannodes.h          |  4 ++--
 src/include/nodes/primnodes.h          |  7 ++++++
 src/include/optimizer/planner.h        |  3 ++-
 src/tools/pgindent/typedefs.list       |  1 +
 10 files changed, 57 insertions(+), 23 deletions(-)

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 5564826cb4a..6cbabe83adf 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2279,6 +2279,7 @@ preprocess_rowmarks(PlannerInfo *root)
 		RowMarkClause *rc = lfirst_node(RowMarkClause, l);
 		RangeTblEntry *rte = rt_fetch(rc->rti, parse->rtable);
 		PlanRowMark *newrc;
+		RowRefType	refType;
 
 		/*
 		 * Currently, it is syntactically impossible to have FOR UPDATE et al
@@ -2301,8 +2302,8 @@ preprocess_rowmarks(PlannerInfo *root)
 		newrc = makeNode(PlanRowMark);
 		newrc->rti = newrc->prti = rc->rti;
 		newrc->rowmarkId = ++(root->glob->lastRowMarkId);
-		newrc->markType = select_rowmark_type(rte, rc->strength);
-		newrc->allMarkTypes = (1 << newrc->markType);
+		newrc->markType = select_rowmark_type(rte, rc->strength, &refType);
+		newrc->allRefTypes = (1 << refType);
 		newrc->strength = rc->strength;
 		newrc->waitPolicy = rc->waitPolicy;
 		newrc->isParent = false;
@@ -2318,6 +2319,7 @@ preprocess_rowmarks(PlannerInfo *root)
 	{
 		RangeTblEntry *rte = lfirst_node(RangeTblEntry, l);
 		PlanRowMark *newrc;
+		RowRefType	refType;
 
 		i++;
 		if (!bms_is_member(i, rels))
@@ -2326,8 +2328,8 @@ preprocess_rowmarks(PlannerInfo *root)
 		newrc = makeNode(PlanRowMark);
 		newrc->rti = newrc->prti = i;
 		newrc->rowmarkId = ++(root->glob->lastRowMarkId);
-		newrc->markType = select_rowmark_type(rte, LCS_NONE);
-		newrc->allMarkTypes = (1 << newrc->markType);
+		newrc->markType = select_rowmark_type(rte, LCS_NONE, &refType);
+		newrc->allRefTypes = (1 << refType);
 		newrc->strength = LCS_NONE;
 		newrc->waitPolicy = LockWaitBlock;	/* doesn't matter */
 		newrc->isParent = false;
@@ -2342,11 +2344,13 @@ preprocess_rowmarks(PlannerInfo *root)
  * Select RowMarkType to use for a given table
  */
 RowMarkType
-select_rowmark_type(RangeTblEntry *rte, LockClauseStrength strength)
+select_rowmark_type(RangeTblEntry *rte, LockClauseStrength strength,
+					RowRefType *refType)
 {
 	if (rte->rtekind != RTE_RELATION)
 	{
 		/* If it's not a table at all, use ROW_MARK_COPY */
+		*refType = ROW_REF_COPY;
 		return ROW_MARK_COPY;
 	}
 	else if (rte->relkind == RELKIND_FOREIGN_TABLE)
@@ -2357,11 +2361,13 @@ select_rowmark_type(RangeTblEntry *rte, LockClauseStrength strength)
 		if (fdwroutine->GetForeignRowMarkType != NULL)
 			return fdwroutine->GetForeignRowMarkType(rte, strength);
 		/* Otherwise, use ROW_MARK_COPY by default */
+		*refType = ROW_REF_COPY;
 		return ROW_MARK_COPY;
 	}
 	else
 	{
 		/* Regular table, apply the appropriate lock type */
+		*refType = rte->reftype;
 		switch (strength)
 		{
 			case LCS_NONE:
diff --git a/src/backend/optimizer/prep/preptlist.c b/src/backend/optimizer/prep/preptlist.c
index 7698bfa1a58..4599b0dc761 100644
--- a/src/backend/optimizer/prep/preptlist.c
+++ b/src/backend/optimizer/prep/preptlist.c
@@ -210,7 +210,7 @@ preprocess_targetlist(PlannerInfo *root)
 		if (rc->rti != rc->prti)
 			continue;
 
-		if (rc->allMarkTypes & ~(1 << ROW_MARK_COPY))
+		if (rc->allRefTypes & (1 << ROW_REF_TID))
 		{
 			/* Need to fetch TID */
 			var = makeVar(rc->rti,
@@ -226,7 +226,7 @@ preprocess_targetlist(PlannerInfo *root)
 								  true);
 			tlist = lappend(tlist, tle);
 		}
-		if (rc->allMarkTypes & (1 << ROW_MARK_COPY))
+		if (rc->allRefTypes & (1 << ROW_REF_COPY))
 		{
 			/* Need the whole row as a junk var */
 			var = makeWholeRowVar(rt_fetch(rc->rti, range_table),
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index 5c7acf8a901..d32b07bab57 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -16,6 +16,7 @@
 
 #include "access/sysattr.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "catalog/partition.h"
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_type.h"
@@ -91,7 +92,7 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 	LOCKMODE	lockmode;
 	PlanRowMark *oldrc;
 	bool		old_isParent = false;
-	int			old_allMarkTypes = 0;
+	int			old_allRefTypes = 0;
 
 	Assert(rte->inh);			/* else caller error */
 
@@ -131,8 +132,8 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 	{
 		old_isParent = oldrc->isParent;
 		oldrc->isParent = true;
-		/* Save initial value of allMarkTypes before children add to it */
-		old_allMarkTypes = oldrc->allMarkTypes;
+		/* Save initial value of allRefTypes before children add to it */
+		old_allRefTypes = oldrc->allRefTypes;
 	}
 
 	/* Scan the inheritance set and expand it */
@@ -239,15 +240,15 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (oldrc)
 	{
-		int			new_allMarkTypes = oldrc->allMarkTypes;
+		int			new_allRefTypes = oldrc->allRefTypes;
 		Var		   *var;
 		TargetEntry *tle;
 		char		resname[32];
 		List	   *newvars = NIL;
 
 		/* Add TID junk Var if needed, unless we had it already */
-		if (new_allMarkTypes & ~(1 << ROW_MARK_COPY) &&
-			!(old_allMarkTypes & ~(1 << ROW_MARK_COPY)))
+		if (new_allRefTypes & (1 << ROW_REF_TID) &&
+			!(old_allRefTypes & (1 << ROW_REF_TID)))
 		{
 			/* Need to fetch TID */
 			var = makeVar(oldrc->rti,
@@ -266,8 +267,8 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 		}
 
 		/* Add whole-row junk Var if needed, unless we had it already */
-		if ((new_allMarkTypes & (1 << ROW_MARK_COPY)) &&
-			!(old_allMarkTypes & (1 << ROW_MARK_COPY)))
+		if ((new_allRefTypes & (1 << ROW_REF_COPY)) &&
+			!(old_allRefTypes & (1 << ROW_REF_COPY)))
 		{
 			var = makeWholeRowVar(planner_rt_fetch(oldrc->rti, root),
 								  oldrc->rti,
@@ -441,7 +442,7 @@ expand_partitioned_rtentry(PlannerInfo *root, RelOptInfo *relinfo,
  * where the hierarchy is flattened during RTE expansion.)
  *
  * PlanRowMarks still carry the top-parent's RTI, and the top-parent's
- * allMarkTypes field still accumulates values from all descendents.
+ * allRefTypes field still accumulates values from all descendents.
  *
  * "parentrte" and "parentRTindex" are immediate parent's RTE and
  * RTI. "top_parentrc" is top parent's PlanRowMark.
@@ -485,6 +486,7 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 	Assert(parentrte->rtekind == RTE_RELATION); /* else this is dubious */
 	childrte->relid = childOID;
 	childrte->relkind = childrel->rd_rel->relkind;
+	childrte->reftype = ROW_REF_TID;
 	/* A partitioned child will need to be expanded further. */
 	if (childrte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
@@ -574,14 +576,16 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 	if (top_parentrc)
 	{
 		PlanRowMark *childrc = makeNode(PlanRowMark);
+		RowRefType	refType;
 
 		childrc->rti = childRTindex;
 		childrc->prti = top_parentrc->rti;
 		childrc->rowmarkId = top_parentrc->rowmarkId;
 		/* Reselect rowmark type, because relkind might not match parent */
 		childrc->markType = select_rowmark_type(childrte,
-												top_parentrc->strength);
-		childrc->allMarkTypes = (1 << childrc->markType);
+												top_parentrc->strength,
+												&refType);
+		childrc->allRefTypes = (1 << refType);
 		childrc->strength = top_parentrc->strength;
 		childrc->waitPolicy = top_parentrc->waitPolicy;
 
@@ -592,8 +596,8 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 		 */
 		childrc->isParent = (childrte->relkind == RELKIND_PARTITIONED_TABLE);
 
-		/* Include child's rowmark type in top parent's allMarkTypes */
-		top_parentrc->allMarkTypes |= childrc->allMarkTypes;
+		/* Include child's rowmark type in top parent's allRefTypes */
+		top_parentrc->allRefTypes |= childrc->allRefTypes;
 
 		root->rowMarks = lappend(root->rowMarks, childrc);
 	}
diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c
index 427b7325db8..10f2d287b39 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/sysattr.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "catalog/heap.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_type.h"
@@ -1503,6 +1504,7 @@ addRangeTableEntry(ParseState *pstate,
 	rte->inh = inh;
 	rte->relkind = rel->rd_rel->relkind;
 	rte->rellockmode = lockmode;
+	rte->reftype = ROW_REF_TID;
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
@@ -1588,6 +1590,7 @@ addRangeTableEntryForRelation(ParseState *pstate,
 	rte->inh = inh;
 	rte->relkind = rel->rd_rel->relkind;
 	rte->rellockmode = lockmode;
+	rte->reftype = ROW_REF_TID;
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
@@ -1656,6 +1659,7 @@ addRangeTableEntryForSubquery(ParseState *pstate,
 	rte->rtekind = RTE_SUBQUERY;
 	rte->subquery = subquery;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias("unnamed_subquery", NIL);
 	numaliases = list_length(eref->colnames);
@@ -1763,6 +1767,7 @@ addRangeTableEntryForFunction(ParseState *pstate,
 	rte->functions = NIL;		/* we'll fill this list below */
 	rte->funcordinality = rangefunc->ordinality;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	/*
 	 * Choose the RTE alias name.  We default to using the first function's
@@ -2081,6 +2086,7 @@ addRangeTableEntryForTableFunc(ParseState *pstate,
 	rte->coltypmods = tf->coltypmods;
 	rte->colcollations = tf->colcollations;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias(refname, NIL);
 	numaliases = list_length(eref->colnames);
@@ -2156,6 +2162,7 @@ addRangeTableEntryForValues(ParseState *pstate,
 	rte->coltypmods = coltypmods;
 	rte->colcollations = colcollations;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias(refname, NIL);
 
@@ -2252,6 +2259,7 @@ addRangeTableEntryForJoin(ParseState *pstate,
 	rte->joinrightcols = rightcols;
 	rte->join_using_alias = join_using_alias;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias("unnamed_join", NIL);
 	numaliases = list_length(eref->colnames);
@@ -2332,6 +2340,7 @@ addRangeTableEntryForCTE(ParseState *pstate,
 	rte->rtekind = RTE_CTE;
 	rte->ctename = cte->ctename;
 	rte->ctelevelsup = levelsup;
+	rte->reftype = ROW_REF_COPY;
 
 	/* Self-reference if and only if CTE's parse analysis isn't completed */
 	rte->self_reference = !IsA(cte->ctequery, Query);
@@ -2494,6 +2503,7 @@ addRangeTableEntryForENR(ParseState *pstate,
 	 * if they access transition tables linked to a table that is altered.
 	 */
 	rte->relid = enrmd->reliddesc;
+	rte->reftype = ROW_REF_COPY;
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 92593526725..acd9672d789 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -455,6 +455,9 @@ typedef struct ResultRelInfo
 	/* relation descriptor for result relation */
 	Relation	ri_RelationDesc;
 
+	/* row indentifier for result relation */
+	RowRefType	ri_RowRefType;
+
 	/* # of indices existing on result relation */
 	int			ri_NumIndices;
 
@@ -750,6 +753,7 @@ typedef struct ExecRowMark
 	Index		prti;			/* parent range table index, if child */
 	Index		rowmarkId;		/* unique identifier for resjunk columns */
 	RowMarkType markType;		/* see enum in nodes/plannodes.h */
+	RowRefType	refType;		/* row indentifier for relation */
 	LockClauseStrength strength;	/* LockingClause's strength, or LCS_NONE */
 	LockWaitPolicy waitPolicy;	/* NOWAIT and SKIP LOCKED */
 	bool		ermActive;		/* is this mark relevant for current tuple? */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 7b57fddf2d0..72c8c4caf24 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1079,6 +1079,7 @@ typedef struct RangeTblEntry
 	int			rellockmode;	/* lock level that query requires on the rel */
 	Index		perminfoindex;	/* index of RTEPermissionInfo entry, or 0 */
 	struct TableSampleClause *tablesample;	/* sampling info, or NULL */
+	RowRefType	reftype;		/* row indentifier for relation */
 
 	/*
 	 * Fields valid for a subquery RTE (else NULL):
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index b4ef6bc44c9..dbe5c535560 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1351,7 +1351,7 @@ typedef enum RowMarkType
  * child relations will also have entries with isParent = true.  The child
  * entries have rti == child rel's RT index and prti == top parent's RT index,
  * and can therefore be recognized as children by the fact that prti != rti.
- * The parent's allMarkTypes field gets the OR of (1<<markType) across all
+ * The parent's allRefTypes field gets the OR of (1<<refType) across all
  * its children (this definition allows children to use different markTypes).
  *
  * The planner also adds resjunk output columns to the plan that carry
@@ -1381,7 +1381,7 @@ typedef struct PlanRowMark
 	Index		prti;			/* range table index of parent relation */
 	Index		rowmarkId;		/* unique identifier for resjunk columns */
 	RowMarkType markType;		/* see enum above */
-	int			allMarkTypes;	/* OR of (1<<markType) for all children */
+	int			allRefTypes;	/* OR of (1<<refType) for all children */
 	LockClauseStrength strength;	/* LockingClause's strength, or LCS_NONE */
 	LockWaitPolicy waitPolicy;	/* NOWAIT and SKIP LOCKED options */
 	bool		isParent;		/* true if this is a "dummy" parent entry */
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index 8df8884001d..bc06ff99e21 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -2089,4 +2089,11 @@ typedef struct OnConflictExpr
 	List	   *exclRelTlist;	/* tlist of the EXCLUDED pseudo relation */
 } OnConflictExpr;
 
+/* The row identifier */
+typedef enum RowRefType
+{
+	ROW_REF_TID,				/* Item pointer (block, offset) */
+	ROW_REF_COPY				/* Full row copy */
+} RowRefType;
+
 #endif							/* PRIMNODES_H */
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index e1d79ffdf3c..98fc796d054 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -47,7 +47,8 @@ extern PlannerInfo *subquery_planner(PlannerGlobal *glob, Query *parse,
 									 bool hasRecursion, double tuple_fraction);
 
 extern RowMarkType select_rowmark_type(RangeTblEntry *rte,
-									   LockClauseStrength strength);
+									   LockClauseStrength strength,
+									   RowRefType *refType);
 
 extern bool limit_needed(Query *parse);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 042d04c8de2..e5ae4288428 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2420,6 +2420,7 @@ RowExpr
 RowIdentityVarInfo
 RowMarkClause
 RowMarkType
+RowRefType
 RowSecurityDesc
 RowSecurityPolicy
 RtlGetLastNtStatus_t
-- 
2.39.3 (Apple Git-145)

0013-Introduce-RowID-bytea-tuple-identifier-v3.patchapplication/octet-stream; name=0013-Introduce-RowID-bytea-tuple-identifier-v3.patchDownload

From 637b33d94a676d8d826e01d9a1e8dd580dfed5a5 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 26 Jun 2023 04:26:30 +0300
Subject: [PATCH 13/13] Introduce RowID -- bytea tuple identifier

Currently, there are two ways to reference the tuple: tuple identifier (tid)
and whole row copy.  The tuple identifier used for regular tables consists of
32-bit block number and 16-bit offset.  This seems limited for some use-cases,
in particular index-organized tables.  The whole row copy used to identify
tuples in FDW.  That could be extended to regular tables, but that seems
overkill.

This commit introduces RowID -- new bytea tuple identifier.  Table AM can choose
the way tuple is identified by providing new get_row_ref_type() API function.
New system attribute RowIdAttributeNumber holds RowID when appropriate.
Table AM methods now accepts Datum arguments as tuple identifiers.  Those Datum
could be either tid or bytea depending on what table_get_row_ref_type() says.
ModifyTable node and triggers are aware of RowID.  IndexScan and BitmapScan
nodes are not aware of RowIDs and expect tids.  Table AMs which use RowIDs
are supposed to redefine those nodes using hooks.
---
 contrib/amcheck/verify_nbtree.c          |   3 +-
 src/backend/access/common/heaptuple.c    |   4 +
 src/backend/access/heap/heapam_handler.c |  33 ++-
 src/backend/access/table/tableam.c       |   4 +-
 src/backend/catalog/aclchk.c             |   2 +-
 src/backend/commands/trigger.c           | 251 ++++++++++++++++++-----
 src/backend/executor/execExprInterp.c    |   4 +-
 src/backend/executor/execMain.c          |  11 +-
 src/backend/executor/execReplication.c   |  12 +-
 src/backend/executor/nodeLockRows.c      |  17 +-
 src/backend/executor/nodeModifyTable.c   | 145 ++++++++-----
 src/backend/executor/nodeTidscan.c       |   2 +-
 src/backend/optimizer/prep/preptlist.c   |  16 ++
 src/backend/optimizer/util/appendinfo.c  |  33 ++-
 src/backend/optimizer/util/inherit.c     |  20 +-
 src/backend/parser/parse_relation.c      |   7 +-
 src/backend/rewrite/rewriteHandler.c     |   1 +
 src/backend/utils/sort/tuplestore.c      |  30 +++
 src/include/access/sysattr.h             |   3 +-
 src/include/access/tableam.h             |  56 +++--
 src/include/commands/trigger.h           |   4 +-
 src/include/nodes/primnodes.h            |   1 +
 src/include/utils/tuplestore.h           |   3 +
 23 files changed, 509 insertions(+), 153 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 1ef4cff88e8..82f18810935 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -983,7 +983,8 @@ heap_entry_is_visible(BtreeCheckState *state, ItemPointer tid)
 	TupleTableSlot *slot = table_slot_create(state->heaprel, NULL);
 
 	tid_visible = table_tuple_fetch_row_version(state->heaprel,
-												tid, state->snapshot, slot);
+												PointerGetDatum(tid),
+												state->snapshot, slot);
 	if (slot != NULL)
 		ExecDropSingleTupleTableSlot(slot);
 
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index 5c89fbbef83..7b52c66939c 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -755,6 +755,10 @@ heap_getsysattr(HeapTuple tup, int attnum, TupleDesc tupleDesc, bool *isnull)
 		case TableOidAttributeNumber:
 			result = ObjectIdGetDatum(tup->t_tableOid);
 			break;
+		case RowIdAttributeNumber:
+			*isnull = true;
+			result = 0;
+			break;
 		default:
 			elog(ERROR, "invalid attnum: %d", attnum);
 			result = 0;			/* keep compiler quiet */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 7ebebf4d6ac..ac24691bd29 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -47,7 +47,7 @@
 #include "utils/rel.h"
 #include "utils/sampling.h"
 
-static TM_Result heapam_tuple_lock(Relation relation, ItemPointer tid,
+static TM_Result heapam_tuple_lock(Relation relation, Datum tupleid,
 								   Snapshot snapshot, TupleTableSlot *slot,
 								   CommandId cid, LockTupleMode mode,
 								   LockWaitPolicy wait_policy, uint8 flags,
@@ -186,7 +186,7 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 
 static bool
 heapam_fetch_row_version(Relation relation,
-						 ItemPointer tid,
+						 Datum tupleid,
 						 Snapshot snapshot,
 						 TupleTableSlot *slot)
 {
@@ -195,7 +195,7 @@ heapam_fetch_row_version(Relation relation,
 
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
-	bslot->base.tupdata.t_self = *tid;
+	bslot->base.tupdata.t_self = *DatumGetItemPointer(tupleid);
 	if (heap_fetch(relation, snapshot, &bslot->base.tupdata, &buffer, false))
 	{
 		/* store in slot, transferring existing pin */
@@ -360,7 +360,7 @@ ExecCheckTIDVisible(EState *estate,
 	if (!IsolationUsesXactSnapshot())
 		return;
 
-	if (!table_tuple_fetch_row_version(rel, tid,
+	if (!table_tuple_fetch_row_version(rel, PointerGetDatum(tid),
 									   SnapshotAny, tempSlot))
 		elog(ERROR, "failed to fetch conflicting tuple for ON CONFLICT");
 	ExecCheckTupleVisible(estate, rel, tempSlot);
@@ -407,7 +407,7 @@ heapam_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
 				 * here means our previous conclusion that the tuple is
 				 * conclusively committed is not true anymore.
 				 */
-				test = table_tuple_lock(rel, &conflictTid,
+				test = table_tuple_lock(rel, PointerGetDatum(&conflictTid),
 										estate->es_snapshot,
 										lockedSlot, estate->es_output_cid,
 										lockmode, LockWaitBlock, 0,
@@ -587,12 +587,13 @@ heapam_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
 }
 
 static TM_Result
-heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
+heapam_tuple_delete(Relation relation, Datum tupleid, CommandId cid,
 					Snapshot snapshot, Snapshot crosscheck, int options,
 					TM_FailureData *tmfd, bool changingPart,
 					TupleTableSlot *oldSlot)
 {
 	TM_Result	result;
+	ItemPointer tid = DatumGetItemPointer(tupleid);
 
 	/*
 	 * Currently Deleting of index tuples are handled at vacuum, in case if
@@ -615,7 +616,7 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
 		 * heapam_tuple_lock() will take advantage of tuple loaded into
 		 * oldSlot by heap_delete().
 		 */
-		result = heapam_tuple_lock(relation, tid, snapshot,
+		result = heapam_tuple_lock(relation, tupleid, snapshot,
 								   oldSlot, cid, LockTupleExclusive,
 								   (options & TABLE_MODIFY_WAIT) ?
 								   LockWaitBlock :
@@ -632,7 +633,7 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
 
 
 static TM_Result
-heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
+heapam_tuple_update(Relation relation, Datum tupleid, TupleTableSlot *slot,
 					CommandId cid, Snapshot snapshot, Snapshot crosscheck,
 					int options, TM_FailureData *tmfd,
 					LockTupleMode *lockmode, TU_UpdateIndexes *update_indexes,
@@ -640,6 +641,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 {
 	bool		shouldFree = true;
 	HeapTuple	tuple = ExecFetchSlotHeapTuple(slot, true, &shouldFree);
+	ItemPointer otid = DatumGetItemPointer(tupleid);
 	TM_Result	result;
 
 	/* Update the tuple with table oid */
@@ -687,7 +689,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 		 * heapam_tuple_lock() will take advantage of tuple loaded into
 		 * oldSlot by heap_update().
 		 */
-		result = heapam_tuple_lock(relation, otid, snapshot,
+		result = heapam_tuple_lock(relation, tupleid, snapshot,
 								   oldSlot, cid, *lockmode,
 								   (options & TABLE_MODIFY_WAIT) ?
 								   LockWaitBlock :
@@ -703,7 +705,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 }
 
 static TM_Result
-heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
+heapam_tuple_lock(Relation relation, Datum tupleid, Snapshot snapshot,
 				  TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
 				  LockWaitPolicy wait_policy, uint8 flags,
 				  TM_FailureData *tmfd)
@@ -711,6 +713,7 @@ heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
 	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
 	TM_Result	result;
 	HeapTuple	tuple = &bslot->base.tupdata;
+	ItemPointer tid = DatumGetItemPointer(tupleid);
 	bool		follow_updates;
 
 	follow_updates = (flags & TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS) != 0;
@@ -2645,6 +2648,15 @@ heapam_scan_get_blocks_done(HeapScanDesc hscan)
  * ------------------------------------------------------------------------
  */
 
+/*
+ * All heap tables use TID row identifier.
+ */
+static RowRefType
+heapam_get_row_ref_type(Relation rel)
+{
+	return ROW_REF_TID;
+}
+
 /*
  * Check to see whether the table needs a TOAST table.  It does only if
  * (1) there are any toastable attributes, and (2) the maximum length
@@ -3224,6 +3236,7 @@ static const TableAmRoutine heapam_methods = {
 	.define_index_validate = NULL,
 	.define_index = NULL,
 
+	.get_row_ref_type = heapam_get_row_ref_type,
 	.free_rd_amcache = NULL,
 	.relation_size = table_block_relation_size,
 	.relation_needs_toast_table = heapam_relation_needs_toast_table,
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 805d222cebc..caa79c6eddd 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -300,7 +300,7 @@ simple_table_tuple_delete(Relation rel, ItemPointer tid, Snapshot snapshot,
 	if (oldSlot)
 		options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
 
-	result = table_tuple_delete(rel, tid,
+	result = table_tuple_delete(rel, PointerGetDatum(tid),
 								GetCurrentCommandId(true),
 								snapshot, InvalidSnapshot,
 								options,
@@ -356,7 +356,7 @@ simple_table_tuple_update(Relation rel, ItemPointer otid,
 	if (oldSlot)
 		options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
 
-	result = table_tuple_update(rel, otid, slot,
+	result = table_tuple_update(rel, PointerGetDatum(otid), slot,
 								GetCurrentCommandId(true),
 								snapshot, InvalidSnapshot,
 								options,
diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
index 7abf3c2a74a..8765becf986 100644
--- a/src/backend/catalog/aclchk.c
+++ b/src/backend/catalog/aclchk.c
@@ -1626,7 +1626,7 @@ expand_all_col_privileges(Oid table_oid, Form_pg_class classForm,
 	AttrNumber	curr_att;
 
 	Assert(classForm->relnatts - FirstLowInvalidHeapAttributeNumber < num_col_privileges);
-	for (curr_att = FirstLowInvalidHeapAttributeNumber + 1;
+	for (curr_att = FirstLowInvalidHeapAttributeNumber + 2;
 		 curr_att <= classForm->relnatts;
 		 curr_att++)
 	{
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 3309b4ebd2d..b2248bdfd87 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -76,7 +76,7 @@ static void SetTriggerFlags(TriggerDesc *trigdesc, Trigger *trigger);
 static bool GetTupleForTrigger(EState *estate,
 							   EPQState *epqstate,
 							   ResultRelInfo *relinfo,
-							   ItemPointer tid,
+							   Datum tupleid,
 							   LockTupleMode lockmode,
 							   TupleTableSlot *oldslot,
 							   TupleTableSlot **epqslot,
@@ -2682,7 +2682,7 @@ ExecASDeleteTriggers(EState *estate, ResultRelInfo *relinfo,
 bool
 ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
 					 ResultRelInfo *relinfo,
-					 ItemPointer tupleid,
+					 Datum tupleid,
 					 HeapTuple fdw_trigtuple,
 					 TupleTableSlot **epqslot,
 					 TM_Result *tmresult,
@@ -2696,7 +2696,7 @@ ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
 	bool		should_free = false;
 	int			i;
 
-	Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
+	Assert(HeapTupleIsValid(fdw_trigtuple) ^ (DatumGetPointer(tupleid) != NULL));
 	if (fdw_trigtuple == NULL)
 	{
 		TupleTableSlot *epqslot_candidate = NULL;
@@ -2924,7 +2924,7 @@ ExecASUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 bool
 ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
 					 ResultRelInfo *relinfo,
-					 ItemPointer tupleid,
+					 Datum tupleid,
 					 HeapTuple fdw_trigtuple,
 					 TupleTableSlot *newslot,
 					 TM_Result *tmresult,
@@ -2944,7 +2944,7 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
 	/* Determine lock mode to use */
 	lockmode = ExecUpdateLockMode(estate, relinfo);
 
-	Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
+	Assert(HeapTupleIsValid(fdw_trigtuple) ^ (DatumGetPointer(tupleid) != NULL));
 	if (fdw_trigtuple == NULL)
 	{
 		TupleTableSlot *epqslot_candidate = NULL;
@@ -3261,7 +3261,7 @@ static bool
 GetTupleForTrigger(EState *estate,
 				   EPQState *epqstate,
 				   ResultRelInfo *relinfo,
-				   ItemPointer tid,
+				   Datum tupleid,
 				   LockTupleMode lockmode,
 				   TupleTableSlot *oldslot,
 				   TupleTableSlot **epqslot,
@@ -3286,7 +3286,9 @@ GetTupleForTrigger(EState *estate,
 		 */
 		if (!IsolationUsesXactSnapshot())
 			lockflags |= TUPLE_LOCK_FLAG_FIND_LAST_VERSION;
-		test = table_tuple_lock(relation, tid, estate->es_snapshot, oldslot,
+
+		test = table_tuple_lock(relation, tupleid,
+								estate->es_snapshot, oldslot,
 								estate->es_output_cid,
 								lockmode, LockWaitBlock,
 								lockflags,
@@ -3382,8 +3384,8 @@ GetTupleForTrigger(EState *estate,
 		 * We expect the tuple to be present, thus very simple error handling
 		 * suffices.
 		 */
-		if (!table_tuple_fetch_row_version(relation, tid, SnapshotAny,
-										   oldslot))
+		if (!table_tuple_fetch_row_version(relation, tupleid,
+										   SnapshotAny, oldslot))
 			elog(ERROR, "failed to fetch tuple for trigger");
 	}
 
@@ -3589,18 +3591,24 @@ typedef SetConstraintStateData *SetConstraintState;
  * cycles.  So we need only ensure that ats_firing_id is zero when attaching
  * a new event to an existing AfterTriggerSharedData record.
  */
-typedef uint32 TriggerFlags;
-
-#define AFTER_TRIGGER_OFFSET			0x07FFFFFF	/* must be low-order bits */
-#define AFTER_TRIGGER_DONE				0x80000000
-#define AFTER_TRIGGER_IN_PROGRESS		0x40000000
+typedef uint64 TriggerFlags;
+
+#define AFTER_TRIGGER_SIZE				UINT64CONST(0xFFFF000000000)	/* must be low-order
+																		 * bits */
+#define AFTER_TRIGGER_SIZE_SHIFT		(36)
+#define AFTER_TRIGGER_OFFSET			UINT64CONST(0x000000FFFFFFF)	/* must be low-order
+																		 * bits */
+#define AFTER_TRIGGER_DONE				UINT64CONST(0x0000800000000)
+#define AFTER_TRIGGER_IN_PROGRESS		UINT64CONST(0x0000400000000)
 /* bits describing the size and tuple sources of this event */
-#define AFTER_TRIGGER_FDW_REUSE			0x00000000
-#define AFTER_TRIGGER_FDW_FETCH			0x20000000
-#define AFTER_TRIGGER_1CTID				0x10000000
-#define AFTER_TRIGGER_2CTID				0x30000000
-#define AFTER_TRIGGER_CP_UPDATE			0x08000000
-#define AFTER_TRIGGER_TUP_BITS			0x38000000
+#define AFTER_TRIGGER_FDW_REUSE			UINT64CONST(0x0000000000000)
+#define AFTER_TRIGGER_FDW_FETCH			UINT64CONST(0x0000200000000)
+#define AFTER_TRIGGER_1CTID				UINT64CONST(0x0000100000000)
+#define AFTER_TRIGGER_ROWID1			UINT64CONST(0x0000010000000)
+#define AFTER_TRIGGER_2CTID				UINT64CONST(0x0000300000000)
+#define AFTER_TRIGGER_ROWID2			UINT64CONST(0x0000020000000)
+#define AFTER_TRIGGER_CP_UPDATE			UINT64CONST(0x0000080000000)
+#define AFTER_TRIGGER_TUP_BITS			UINT64CONST(0x0000380000000)
 typedef struct AfterTriggerSharedData *AfterTriggerShared;
 
 typedef struct AfterTriggerSharedData
@@ -3652,6 +3660,9 @@ typedef struct AfterTriggerEventDataZeroCtids
 }			AfterTriggerEventDataZeroCtids;
 
 #define SizeofTriggerEvent(evt) \
+	(((evt)->ate_flags & AFTER_TRIGGER_SIZE) >> AFTER_TRIGGER_SIZE_SHIFT)
+
+#define BasicSizeofTriggerEvent(evt) \
 	(((evt)->ate_flags & AFTER_TRIGGER_TUP_BITS) == AFTER_TRIGGER_CP_UPDATE ? \
 	 sizeof(AfterTriggerEventData) : \
 	 (((evt)->ate_flags & AFTER_TRIGGER_TUP_BITS) == AFTER_TRIGGER_2CTID ? \
@@ -4004,14 +4015,34 @@ afterTriggerCopyBitmap(Bitmapset *src)
  */
 static void
 afterTriggerAddEvent(AfterTriggerEventList *events,
-					 AfterTriggerEvent event, AfterTriggerShared evtshared)
+					 AfterTriggerEvent event, AfterTriggerShared evtshared,
+					 bytea *rowid1, bytea *rowid2)
 {
-	Size		eventsize = SizeofTriggerEvent(event);
-	Size		needed = eventsize + sizeof(AfterTriggerSharedData);
+	Size		basiceventsize = MAXALIGN(BasicSizeofTriggerEvent(event));
+	Size		eventsize;
+	Size		needed;
 	AfterTriggerEventChunk *chunk;
 	AfterTriggerShared newshared;
 	AfterTriggerEvent newevent;
 
+	if (SizeofTriggerEvent(event) == 0)
+	{
+		eventsize = basiceventsize;
+		if (event->ate_flags & AFTER_TRIGGER_ROWID1)
+			eventsize += MAXALIGN(VARSIZE(rowid1));
+
+		if (event->ate_flags & AFTER_TRIGGER_ROWID2)
+			eventsize += MAXALIGN(VARSIZE(rowid2));
+
+		event->ate_flags |= eventsize << AFTER_TRIGGER_SIZE_SHIFT;
+	}
+	else
+	{
+		eventsize = SizeofTriggerEvent(event);
+	}
+
+	needed = eventsize + sizeof(AfterTriggerSharedData);
+
 	/*
 	 * If empty list or not enough room in the tail chunk, make a new chunk.
 	 * We assume here that a new shared record will always be needed.
@@ -4044,7 +4075,7 @@ afterTriggerAddEvent(AfterTriggerEventList *events,
 		 * sizes used should be MAXALIGN multiples, to ensure that the shared
 		 * records will be aligned safely.
 		 */
-#define MIN_CHUNK_SIZE 1024
+#define MIN_CHUNK_SIZE (1024*4)
 #define MAX_CHUNK_SIZE (1024*1024)
 
 #if MAX_CHUNK_SIZE > (AFTER_TRIGGER_OFFSET+1)
@@ -4063,6 +4094,7 @@ afterTriggerAddEvent(AfterTriggerEventList *events,
 				chunksize *= 2; /* okay, double it */
 			else
 				chunksize /= 2; /* too many shared records */
+			chunksize = Max(chunksize, MIN_CHUNK_SIZE);
 			chunksize = Min(chunksize, MAX_CHUNK_SIZE);
 		}
 		chunk = MemoryContextAlloc(afterTriggers.event_cxt, chunksize);
@@ -4103,7 +4135,26 @@ afterTriggerAddEvent(AfterTriggerEventList *events,
 
 	/* Insert the data */
 	newevent = (AfterTriggerEvent) chunk->freeptr;
-	memcpy(newevent, event, eventsize);
+	if (!rowid1 && !rowid2)
+	{
+		memcpy(newevent, event, eventsize);
+	}
+	else
+	{
+		Pointer		ptr = chunk->freeptr;
+
+		memcpy(newevent, event, basiceventsize);
+		ptr += basiceventsize;
+
+		if (event->ate_flags & AFTER_TRIGGER_ROWID1)
+		{
+			memcpy(ptr, rowid1, MAXALIGN(VARSIZE(rowid1)));
+			ptr += MAXALIGN(VARSIZE(rowid1));
+		}
+
+		if (event->ate_flags & AFTER_TRIGGER_ROWID2)
+			memcpy(ptr, rowid2, MAXALIGN(VARSIZE(rowid2)));
+	}
 	/* ... and link the new event to its shared record */
 	newevent->ate_flags &= ~AFTER_TRIGGER_OFFSET;
 	newevent->ate_flags |= (char *) newshared - (char *) newevent;
@@ -4263,6 +4314,7 @@ AfterTriggerExecute(EState *estate,
 	int			tgindx;
 	bool		should_free_trig = false;
 	bool		should_free_new = false;
+	Pointer		ptr;
 
 	/*
 	 * Locate trigger in trigdesc.
@@ -4294,15 +4346,17 @@ AfterTriggerExecute(EState *estate,
 			{
 				Tuplestorestate *fdw_tuplestore = GetCurrentFDWTuplestore();
 
-				if (!tuplestore_gettupleslot(fdw_tuplestore, true, false,
-											 trig_tuple_slot1))
+				if (!tuplestore_force_gettupleslot(fdw_tuplestore, true, false,
+												   trig_tuple_slot1))
 					elog(ERROR, "failed to fetch tuple1 for AFTER trigger");
 
 				if ((evtshared->ats_event & TRIGGER_EVENT_OPMASK) ==
 					TRIGGER_EVENT_UPDATE &&
-					!tuplestore_gettupleslot(fdw_tuplestore, true, false,
-											 trig_tuple_slot2))
+					!tuplestore_force_gettupleslot(fdw_tuplestore, true, false,
+												   trig_tuple_slot2))
 					elog(ERROR, "failed to fetch tuple2 for AFTER trigger");
+				trig_tuple_slot1->tts_tid = event->ate_ctid1;
+				trig_tuple_slot2->tts_tid = event->ate_ctid2;
 			}
 			/* fall through */
 		case AFTER_TRIGGER_FDW_REUSE:
@@ -4334,13 +4388,26 @@ AfterTriggerExecute(EState *estate,
 			break;
 
 		default:
-			if (ItemPointerIsValid(&(event->ate_ctid1)))
+			ptr = (Pointer) event + MAXALIGN(BasicSizeofTriggerEvent(event));
+			if (ItemPointerIsValid(&(event->ate_ctid1)) ||
+				(event->ate_flags & AFTER_TRIGGER_ROWID1))
 			{
+				Datum		tupleid;
+
 				TupleTableSlot *src_slot = ExecGetTriggerOldSlot(estate,
 																 src_relInfo);
 
-				if (!table_tuple_fetch_row_version(src_rel,
-												   &(event->ate_ctid1),
+				if (event->ate_flags & AFTER_TRIGGER_ROWID1)
+				{
+					tupleid = PointerGetDatum(ptr);
+					ptr += MAXALIGN(VARSIZE(ptr));
+				}
+				else
+				{
+					tupleid = PointerGetDatum(&(event->ate_ctid1));
+				}
+
+				if (!table_tuple_fetch_row_version(src_rel, tupleid,
 												   SnapshotAny,
 												   src_slot))
 					elog(ERROR, "failed to fetch tuple1 for AFTER trigger");
@@ -4376,13 +4443,23 @@ AfterTriggerExecute(EState *estate,
 			/* don't touch ctid2 if not there */
 			if (((event->ate_flags & AFTER_TRIGGER_TUP_BITS) == AFTER_TRIGGER_2CTID ||
 				 (event->ate_flags & AFTER_TRIGGER_CP_UPDATE)) &&
-				ItemPointerIsValid(&(event->ate_ctid2)))
+				(ItemPointerIsValid(&(event->ate_ctid2)) ||
+				 (event->ate_flags & AFTER_TRIGGER_ROWID2)))
 			{
+				Datum		tupleid;
+
 				TupleTableSlot *dst_slot = ExecGetTriggerNewSlot(estate,
 																 dst_relInfo);
 
-				if (!table_tuple_fetch_row_version(dst_rel,
-												   &(event->ate_ctid2),
+				if (event->ate_flags & AFTER_TRIGGER_ROWID2)
+				{
+					tupleid = PointerGetDatum(ptr);
+				}
+				else
+				{
+					tupleid = PointerGetDatum(&(event->ate_ctid2));
+				}
+				if (!table_tuple_fetch_row_version(dst_rel, tupleid,
 												   SnapshotAny,
 												   dst_slot))
 					elog(ERROR, "failed to fetch tuple2 for AFTER trigger");
@@ -4556,7 +4633,7 @@ afterTriggerMarkEvents(AfterTriggerEventList *events,
 		{
 			deferred_found = true;
 			/* add it to move_list */
-			afterTriggerAddEvent(move_list, event, evtshared);
+			afterTriggerAddEvent(move_list, event, evtshared, NULL, NULL);
 			/* mark original copy "done" so we don't do it again */
 			event->ate_flags |= AFTER_TRIGGER_DONE;
 		}
@@ -4659,6 +4736,7 @@ afterTriggerInvokeEvents(AfterTriggerEventList *events,
 					trigdesc = rInfo->ri_TrigDesc;
 					finfo = rInfo->ri_TrigFunctions;
 					instr = rInfo->ri_TrigInstrument;
+
 					if (slot1 != NULL)
 					{
 						ExecDropSingleTupleTableSlot(slot1);
@@ -6051,6 +6129,8 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 	int			tgtype_level;
 	int			i;
 	Tuplestorestate *fdw_tuplestore = NULL;
+	bytea	   *rowId1 = NULL;
+	bytea	   *rowId2 = NULL;
 
 	/*
 	 * Check state.  We use a normal test not Assert because it is possible to
@@ -6144,6 +6224,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 	 * if so.  This preserves the behavior that statement-level triggers fire
 	 * just once per statement and fire after row-level triggers.
 	 */
+
+	/* Determine flags */
+	if (!(relkind == RELKIND_FOREIGN_TABLE && row_trigger))
+		new_event.ate_flags = (row_trigger && event == TRIGGER_EVENT_UPDATE) ?
+			AFTER_TRIGGER_2CTID : AFTER_TRIGGER_1CTID;
+
 	switch (event)
 	{
 		case TRIGGER_EVENT_INSERT:
@@ -6154,6 +6240,14 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				Assert(newslot != NULL);
 				ItemPointerCopy(&(newslot->tts_tid), &(new_event.ate_ctid1));
 				ItemPointerSetInvalid(&(new_event.ate_ctid2));
+				if (table_get_row_ref_type(rel) == ROW_REF_ROWID)
+				{
+					bool		isnull;
+
+					rowId1 = DatumGetByteaP(slot_getsysattr(newslot, RowIdAttributeNumber, &isnull));
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID1;
+					Assert(!isnull);
+				}
 			}
 			else
 			{
@@ -6173,6 +6267,14 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				Assert(newslot == NULL);
 				ItemPointerCopy(&(oldslot->tts_tid), &(new_event.ate_ctid1));
 				ItemPointerSetInvalid(&(new_event.ate_ctid2));
+				if (table_get_row_ref_type(rel) == ROW_REF_ROWID)
+				{
+					bool		isnull;
+
+					rowId1 = DatumGetByteaP(slot_getsysattr(oldslot, RowIdAttributeNumber, &isnull));
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID1;
+					Assert(!isnull);
+				}
 			}
 			else
 			{
@@ -6188,10 +6290,57 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			tgtype_event = TRIGGER_TYPE_UPDATE;
 			if (row_trigger)
 			{
+				bool		src_rowid = false,
+							dst_rowid = false;
+
 				Assert(oldslot != NULL);
 				Assert(newslot != NULL);
 				ItemPointerCopy(&(oldslot->tts_tid), &(new_event.ate_ctid1));
 				ItemPointerCopy(&(newslot->tts_tid), &(new_event.ate_ctid2));
+				if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+				{
+					Relation	src_rel = src_partinfo->ri_RelationDesc;
+					Relation	dst_rel = dst_partinfo->ri_RelationDesc;
+
+					src_rowid = table_get_row_ref_type(src_rel) ==
+						ROW_REF_ROWID;
+					dst_rowid = table_get_row_ref_type(dst_rel) ==
+						ROW_REF_ROWID;
+				}
+				else
+				{
+					if (table_get_row_ref_type(rel) == ROW_REF_ROWID)
+					{
+						src_rowid = true;
+						dst_rowid = true;
+					}
+				}
+
+				if (src_rowid)
+				{
+					Datum		val;
+					bool		isnull;
+
+					val = slot_getsysattr(oldslot,
+										  RowIdAttributeNumber,
+										  &isnull);
+					rowId1 = DatumGetByteaP(val);
+					Assert(!isnull);
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID1;
+				}
+
+				if (dst_rowid)
+				{
+					Datum		val;
+					bool		isnull;
+
+					val = slot_getsysattr(newslot,
+										  RowIdAttributeNumber,
+										  &isnull);
+					rowId2 = DatumGetByteaP(val);
+					Assert(!isnull);
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID2;
+				}
 
 				/*
 				 * Also remember the OIDs of partitions to fetch these tuples
@@ -6229,20 +6378,6 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			break;
 	}
 
-	/* Determine flags */
-	if (!(relkind == RELKIND_FOREIGN_TABLE && row_trigger))
-	{
-		if (row_trigger && event == TRIGGER_EVENT_UPDATE)
-		{
-			if (relkind == RELKIND_PARTITIONED_TABLE)
-				new_event.ate_flags = AFTER_TRIGGER_CP_UPDATE;
-			else
-				new_event.ate_flags = AFTER_TRIGGER_2CTID;
-		}
-		else
-			new_event.ate_flags = AFTER_TRIGGER_1CTID;
-	}
-
 	/* else, we'll initialize ate_flags for each trigger */
 
 	tgtype_level = (row_trigger ? TRIGGER_TYPE_ROW : TRIGGER_TYPE_STATEMENT);
@@ -6387,6 +6522,20 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				continue;		/* Uniqueness definitely not violated */
 		}
 
+		/* Determine flags */
+		if (!(relkind == RELKIND_FOREIGN_TABLE && row_trigger))
+		{
+			if (row_trigger && event == TRIGGER_EVENT_UPDATE)
+			{
+				if (relkind == RELKIND_PARTITIONED_TABLE)
+					new_event.ate_flags = AFTER_TRIGGER_CP_UPDATE;
+				else
+					new_event.ate_flags = AFTER_TRIGGER_2CTID;
+			}
+			else
+				new_event.ate_flags = AFTER_TRIGGER_1CTID;
+		}
+
 		/*
 		 * Fill in event structure and add it to the current query's queue.
 		 * Note we set ats_table to NULL whenever this trigger doesn't use
@@ -6408,7 +6557,7 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		new_shared.ats_modifiedcols = afterTriggerCopyBitmap(modifiedCols);
 
 		afterTriggerAddEvent(&afterTriggers.query_stack[afterTriggers.query_depth].events,
-							 &new_event, &new_shared);
+							 &new_event, &new_shared, rowId1, rowId2);
 	}
 
 	/*
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index a25ab7570fe..2fa3a0a4e36 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -4552,7 +4552,9 @@ ExecEvalSysVar(ExprState *state, ExprEvalStep *op, ExprContext *econtext,
 						op->resnull);
 	*op->resvalue = d;
 	/* this ought to be unreachable, but it's cheap enough to check */
-	if (unlikely(*op->resnull))
+	if (op->d.var.attnum != RowIdAttributeNumber &&
+		op->d.var.attnum != SelfItemPointerAttributeNumber &&
+		unlikely(*op->resnull))
 		elog(ERROR, "failed to fetch attribute from slot");
 }
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 7eb1f7d0209..7b3fc3038a4 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -867,13 +867,15 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 			Oid			relid;
 			Relation	relation;
 			ExecRowMark *erm;
+			RangeTblEntry *rangeEntry;
 
 			/* ignore "parent" rowmarks; they are irrelevant at runtime */
 			if (rc->isParent)
 				continue;
 
 			/* get relation's OID (will produce InvalidOid if subquery) */
-			relid = exec_rt_fetch(rc->rti, estate)->relid;
+			rangeEntry = exec_rt_fetch(rc->rti, estate);
+			relid = rangeEntry->relid;
 
 			/* open relation, if we need to access it for this mark type */
 			switch (rc->markType)
@@ -906,6 +908,10 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 			erm->prti = rc->prti;
 			erm->rowmarkId = rc->rowmarkId;
 			erm->markType = rc->markType;
+			if (erm->markType == ROW_MARK_COPY)
+				erm->refType = ROW_REF_COPY;
+			else
+				erm->refType = rangeEntry->reftype;
 			erm->strength = rc->strength;
 			erm->waitPolicy = rc->waitPolicy;
 			erm->ermActive = false;
@@ -1269,6 +1275,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
 	resultRelInfo->ri_ChildToRootMap = NULL;
 	resultRelInfo->ri_ChildToRootMapValid = false;
 	resultRelInfo->ri_CopyMultiInsertBuffer = NULL;
+	resultRelInfo->ri_RowRefType = table_get_row_ref_type(resultRelationDesc);
 }
 
 /*
@@ -2701,7 +2708,7 @@ EvalPlanQualFetchRowMark(EPQState *epqstate, Index rti, TupleTableSlot *slot)
 		{
 			/* ordinary table, fetch the tuple */
 			if (!table_tuple_fetch_row_version(erm->relation,
-											   (ItemPointer) DatumGetPointer(datum),
+											   datum,
 											   SnapshotAny, slot))
 				elog(ERROR, "failed to fetch tuple for EvalPlanQual recheck");
 			return true;
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index db685473fc0..aad266a19ff 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -250,7 +250,8 @@ retry:
 
 		PushActiveSnapshot(GetLatestSnapshot());
 
-		res = table_tuple_lock(rel, &(outslot->tts_tid), GetLatestSnapshot(),
+		res = table_tuple_lock(rel, PointerGetDatum(&(outslot->tts_tid)),
+							   GetLatestSnapshot(),
 							   outslot,
 							   GetCurrentCommandId(false),
 							   lockmode,
@@ -434,7 +435,8 @@ retry:
 
 		PushActiveSnapshot(GetLatestSnapshot());
 
-		res = table_tuple_lock(rel, &(outslot->tts_tid), GetLatestSnapshot(),
+		res = table_tuple_lock(rel, PointerGetDatum(&(outslot->tts_tid)),
+							   GetLatestSnapshot(),
 							   outslot,
 							   GetCurrentCommandId(false),
 							   lockmode,
@@ -571,7 +573,8 @@ ExecSimpleRelationUpdate(ResultRelInfo *resultRelInfo,
 		resultRelInfo->ri_TrigDesc->trig_update_before_row)
 	{
 		if (!ExecBRUpdateTriggers(estate, epqstate, resultRelInfo,
-								  tid, NULL, slot, NULL, NULL))
+								  PointerGetDatum(tid), NULL, slot,
+								  NULL, NULL))
 			skip_tuple = true;	/* "do nothing" */
 	}
 
@@ -638,7 +641,8 @@ ExecSimpleRelationDelete(ResultRelInfo *resultRelInfo,
 		resultRelInfo->ri_TrigDesc->trig_delete_before_row)
 	{
 		skip_tuple = !ExecBRDeleteTriggers(estate, epqstate, resultRelInfo,
-										   tid, NULL, NULL, NULL, NULL);
+										   PointerGetDatum(tid), NULL, NULL,
+										   NULL, NULL);
 	}
 
 	if (!skip_tuple)
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 41754ddfea9..2d3ad904a64 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -27,6 +27,7 @@
 #include "executor/nodeLockRows.h"
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
+#include "utils/datum.h"
 #include "utils/rel.h"
 
 
@@ -157,7 +158,16 @@ lnext:
 		}
 
 		/* okay, try to lock (and fetch) the tuple */
-		tid = *((ItemPointer) DatumGetPointer(datum));
+		if (erm->refType == ROW_REF_TID)
+		{
+			tid = *((ItemPointer) DatumGetPointer(datum));
+			datum = PointerGetDatum(&tid);
+		}
+		else
+		{
+			Assert(erm->refType == ROW_REF_ROWID);
+			datum = datumCopy(datum, false, -1);
+		}
 		switch (erm->markType)
 		{
 			case ROW_MARK_EXCLUSIVE:
@@ -182,12 +192,15 @@ lnext:
 		if (!IsolationUsesXactSnapshot())
 			lockflags |= TUPLE_LOCK_FLAG_FIND_LAST_VERSION;
 
-		test = table_tuple_lock(erm->relation, &tid, estate->es_snapshot,
+		test = table_tuple_lock(erm->relation, datum, estate->es_snapshot,
 								markSlot, estate->es_output_cid,
 								lockmode, erm->waitPolicy,
 								lockflags,
 								&tmfd);
 
+		if (erm->refType == ROW_REF_ROWID)
+			pfree(DatumGetPointer(datum));
+
 		switch (test)
 		{
 			case TM_WouldBlock:
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 321f2358c12..fef31456e17 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -124,7 +124,7 @@ static void ExecPendingInserts(EState *estate);
 static void ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 											   ResultRelInfo *sourcePartInfo,
 											   ResultRelInfo *destPartInfo,
-											   ItemPointer tupleid,
+											   Datum tupleid,
 											   TupleTableSlot *oldSlot,
 											   TupleTableSlot *newslot);
 static bool ExecOnConflictUpdate(ModifyTableContext *context,
@@ -141,13 +141,13 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 
 static TupleTableSlot *ExecMerge(ModifyTableContext *context,
 								 ResultRelInfo *resultRelInfo,
-								 ItemPointer tupleid,
+								 Datum tupleid,
 								 HeapTuple oldtuple,
 								 bool canSetTag);
 static void ExecInitMerge(ModifyTableState *mtstate, EState *estate);
 static TupleTableSlot *ExecMergeMatched(ModifyTableContext *context,
 										ResultRelInfo *resultRelInfo,
-										ItemPointer tupleid,
+										Datum tupleid,
 										HeapTuple oldtuple,
 										bool canSetTag,
 										bool *matched);
@@ -1216,7 +1216,7 @@ ExecPendingInserts(EState *estate)
  */
 static bool
 ExecDeletePrologue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				   ItemPointer tupleid, HeapTuple oldtuple,
+				   Datum tupleid, HeapTuple oldtuple,
 				   TupleTableSlot **epqreturnslot, TM_Result *result)
 {
 	if (result)
@@ -1247,7 +1247,7 @@ ExecDeletePrologue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static TM_Result
 ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-			  ItemPointer tupleid, bool changingPart, int options,
+			  Datum tupleid, bool changingPart, int options,
 			  TupleTableSlot *oldSlot)
 {
 	EState	   *estate = context->estate;
@@ -1271,7 +1271,7 @@ ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static void
 ExecDeleteEpilogue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				   ItemPointer tupleid, HeapTuple oldtuple,
+				   HeapTuple oldtuple,
 				   TupleTableSlot *slot, bool changingPart)
 {
 	ModifyTableState *mtstate = context->mtstate;
@@ -1351,7 +1351,7 @@ ExecInitDeleteTupleSlot(ModifyTableState *mtstate,
 static TupleTableSlot *
 ExecDelete(ModifyTableContext *context,
 		   ResultRelInfo *resultRelInfo,
-		   ItemPointer tupleid,
+		   Datum tupleid,
 		   HeapTuple oldtuple,
 		   TupleTableSlot *oldSlot,
 		   bool processReturning,
@@ -1544,7 +1544,7 @@ ldelete:
 	if (tupleDeleted)
 		*tupleDeleted = true;
 
-	ExecDeleteEpilogue(context, resultRelInfo, tupleid, oldtuple,
+	ExecDeleteEpilogue(context, resultRelInfo, oldtuple,
 					   oldSlot, changingPart);
 
 	/* Process RETURNING if present and if requested */
@@ -1561,7 +1561,7 @@ ldelete:
 			/* FDW must have provided a slot containing the deleted row */
 			Assert(!TupIsNull(slot));
 		}
-		else
+		else if (!slot || TupIsNull(slot))
 		{
 			/* Copy old tuple to the returning slot */
 			slot = ExecGetReturningSlot(estate, resultRelInfo);
@@ -1610,7 +1610,7 @@ ldelete:
 static bool
 ExecCrossPartitionUpdate(ModifyTableContext *context,
 						 ResultRelInfo *resultRelInfo,
-						 ItemPointer tupleid, HeapTuple oldtuple,
+						 Datum tupleid, HeapTuple oldtuple,
 						 TupleTableSlot *slot,
 						 bool canSetTag,
 						 UpdateContext *updateCxt,
@@ -1766,7 +1766,7 @@ ExecCrossPartitionUpdate(ModifyTableContext *context,
  */
 static bool
 ExecUpdatePrologue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				   ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
+				   Datum tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
 				   TM_Result *result)
 {
 	Relation	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -1843,7 +1843,7 @@ ExecUpdatePrepareSlot(ResultRelInfo *resultRelInfo,
  */
 static TM_Result
 ExecUpdateAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-			  ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
+			  Datum tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
 			  bool canSetTag, int options, TupleTableSlot *oldSlot,
 			  UpdateContext *updateCxt)
 {
@@ -1997,7 +1997,7 @@ lreplace:
  */
 static void
 ExecUpdateEpilogue(ModifyTableContext *context, UpdateContext *updateCxt,
-				   ResultRelInfo *resultRelInfo, ItemPointer tupleid,
+				   ResultRelInfo *resultRelInfo,
 				   HeapTuple oldtuple, TupleTableSlot *slot,
 				   TupleTableSlot *oldSlot)
 {
@@ -2047,7 +2047,7 @@ static void
 ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 								   ResultRelInfo *sourcePartInfo,
 								   ResultRelInfo *destPartInfo,
-								   ItemPointer tupleid,
+								   Datum tupleid,
 								   TupleTableSlot *oldslot,
 								   TupleTableSlot *newslot)
 {
@@ -2137,7 +2137,7 @@ ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
  */
 static TupleTableSlot *
 ExecUpdate(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-		   ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
+		   Datum tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
 		   TupleTableSlot *oldSlot, bool canSetTag, bool locked)
 {
 	EState	   *estate = context->estate;
@@ -2191,10 +2191,14 @@ ExecUpdate(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	}
 	else
 	{
-		int			options = TABLE_MODIFY_WAIT | TABLE_MODIFY_FETCH_OLD_TUPLE;
+		int			options = TABLE_MODIFY_WAIT;
 
-		if (!locked && !IsolationUsesXactSnapshot())
-			options |= TABLE_MODIFY_LOCK_UPDATED;
+		if (!locked)
+		{
+			options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
+			if (!IsolationUsesXactSnapshot())
+				options |= TABLE_MODIFY_LOCK_UPDATED;
+		}
 
 		/*
 		 * If we generate a new candidate tuple after EvalPlanQual testing, we
@@ -2302,7 +2306,7 @@ redo_act:
 	if (canSetTag)
 		(estate->es_processed)++;
 
-	ExecUpdateEpilogue(context, &updateCxt, resultRelInfo, tupleid, oldtuple,
+	ExecUpdateEpilogue(context, &updateCxt, resultRelInfo, oldtuple,
 					   slot, oldSlot);
 
 	/* Process RETURNING if present */
@@ -2334,7 +2338,19 @@ ExecOnConflictUpdate(ModifyTableContext *context,
 	ExprContext *econtext = mtstate->ps.ps_ExprContext;
 	ExprState  *onConflictSetWhere = resultRelInfo->ri_onConflict->oc_WhereClause;
 	TupleTableSlot *existing = resultRelInfo->ri_onConflict->oc_Existing;
-	ItemPointer conflictTid = &existing->tts_tid;
+	Datum		tupleid;
+
+	if (table_get_row_ref_type(resultRelInfo->ri_RelationDesc) == ROW_REF_ROWID)
+	{
+		bool		isnull;
+
+		tupleid = slot_getsysattr(existing, RowIdAttributeNumber, &isnull);
+		Assert(!isnull);
+	}
+	else
+	{
+		tupleid = PointerGetDatum(&existing->tts_tid);
+	}
 
 	/*
 	 * Make tuple and any needed join variables available to ExecQual and
@@ -2390,7 +2406,7 @@ ExecOnConflictUpdate(ModifyTableContext *context,
 
 	/* Execute UPDATE with projection */
 	*returning = ExecUpdate(context, resultRelInfo,
-							conflictTid, NULL,
+							tupleid, NULL,
 							resultRelInfo->ri_onConflict->oc_ProjSlot,
 							existing,
 							canSetTag, true);
@@ -2409,7 +2425,7 @@ ExecOnConflictUpdate(ModifyTableContext *context,
  */
 static TupleTableSlot *
 ExecMerge(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-		  ItemPointer tupleid, HeapTuple oldtuple, bool canSetTag)
+		  Datum tupleid, HeapTuple oldtuple, bool canSetTag)
 {
 	TupleTableSlot *rslot = NULL;
 	bool		matched;
@@ -2458,7 +2474,7 @@ ExecMerge(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	 * from ExecMergeNotMatched to ExecMergeMatched, there is no risk of a
 	 * livelock.
 	 */
-	matched = tupleid != NULL || oldtuple != NULL;
+	matched = DatumGetPointer(tupleid) != NULL || oldtuple != NULL;
 	if (matched)
 		rslot = ExecMergeMatched(context, resultRelInfo, tupleid, oldtuple,
 								 canSetTag, &matched);
@@ -2499,7 +2515,7 @@ ExecMerge(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static TupleTableSlot *
 ExecMergeMatched(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				 ItemPointer tupleid, HeapTuple oldtuple, bool canSetTag,
+				 Datum tupleid, HeapTuple oldtuple, bool canSetTag,
 				 bool *matched)
 {
 	ModifyTableState *mtstate = context->mtstate;
@@ -2535,7 +2551,7 @@ ExecMergeMatched(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	 * the tupleid of the target row, or an old tuple from the target wholerow
 	 * junk attr.
 	 */
-	Assert(tupleid != NULL || oldtuple != NULL);
+	Assert(DatumGetPointer(tupleid) != NULL || oldtuple != NULL);
 	if (oldtuple != NULL)
 		ExecForceStoreHeapTuple(oldtuple, resultRelInfo->ri_oldTupleSlot,
 								false);
@@ -2549,7 +2565,7 @@ lmerge_matched:
 	 * EvalPlanQual returns us a new tuple, which may not be visible to our
 	 * MVCC snapshot.
 	 */
-	if (tupleid != NULL)
+	if (DatumGetPointer(tupleid) != NULL)
 	{
 		if (!table_tuple_fetch_row_version(resultRelInfo->ri_RelationDesc,
 										   tupleid,
@@ -2658,7 +2674,7 @@ lmerge_matched:
 				if (result == TM_Ok)
 				{
 					ExecUpdateEpilogue(context, &updateCxt, resultRelInfo,
-									   tupleid, NULL, newslot,
+									   NULL, newslot,
 									   resultRelInfo->ri_oldTupleSlot);
 					mtstate->mt_merge_updated += 1;
 				}
@@ -2694,7 +2710,7 @@ lmerge_matched:
 
 				if (result == TM_Ok)
 				{
-					ExecDeleteEpilogue(context, resultRelInfo, tupleid, NULL,
+					ExecDeleteEpilogue(context, resultRelInfo, NULL,
 									   resultRelInfo->ri_oldTupleSlot, false);
 					mtstate->mt_merge_deleted += 1;
 				}
@@ -2818,9 +2834,13 @@ lmerge_matched:
 								return NULL;
 							}
 
-							(void) ExecGetJunkAttribute(epqslot,
-														resultRelInfo->ri_RowIdAttNo,
-														&isNull);
+							/*
+							 * Update tupleid to that of the new tuple, for
+							 * the refetch we do at the top.
+							 */
+							tupleid = ExecGetJunkAttribute(epqslot,
+														   resultRelInfo->ri_RowIdAttNo,
+														   &isNull);
 							if (isNull)
 							{
 								*matched = false;
@@ -2847,11 +2867,7 @@ lmerge_matched:
 							 * apply all the MATCHED rules again, to ensure
 							 * that the first qualifying WHEN MATCHED action
 							 * is executed.
-							 *
-							 * Update tupleid to that of the new tuple, for
-							 * the refetch we do at the top.
 							 */
-							ItemPointerCopy(&context->tmfd.ctid, tupleid);
 							goto lmerge_matched;
 
 						case TM_Deleted:
@@ -3389,10 +3405,10 @@ ExecModifyTable(PlanState *pstate)
 	PlanState  *subplanstate;
 	TupleTableSlot *slot;
 	TupleTableSlot *oldSlot;
+	Datum		tupleid;
 	ItemPointerData tuple_ctid;
 	HeapTupleData oldtupdata;
 	HeapTuple	oldtuple;
-	ItemPointer tupleid;
 
 	CHECK_FOR_INTERRUPTS();
 
@@ -3441,6 +3457,8 @@ ExecModifyTable(PlanState *pstate)
 	 */
 	for (;;)
 	{
+		RowRefType	refType;
+
 		/*
 		 * Reset the per-output-tuple exprcontext.  This is needed because
 		 * triggers expect to use that context as workspace.  It's a bit ugly
@@ -3491,7 +3509,7 @@ ExecModifyTable(PlanState *pstate)
 					EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 
 					slot = ExecMerge(&context, node->resultRelInfo,
-									 NULL, NULL, node->canSetTag);
+									 PointerGetDatum(NULL), NULL, node->canSetTag);
 
 					/*
 					 * If we got a RETURNING result, return it to the caller.
@@ -3535,7 +3553,8 @@ ExecModifyTable(PlanState *pstate)
 		EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 		slot = context.planSlot;
 
-		tupleid = NULL;
+		refType = resultRelInfo->ri_RowRefType;
+		tupleid = PointerGetDatum(NULL);
 		oldtuple = NULL;
 
 		/*
@@ -3578,7 +3597,7 @@ ExecModifyTable(PlanState *pstate)
 						EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 
 						slot = ExecMerge(&context, node->resultRelInfo,
-										 NULL, NULL, node->canSetTag);
+										 PointerGetDatum(NULL), NULL, node->canSetTag);
 
 						/*
 						 * If we got a RETURNING result, return it to the
@@ -3593,9 +3612,25 @@ ExecModifyTable(PlanState *pstate)
 					elog(ERROR, "ctid is NULL");
 				}
 
-				tupleid = (ItemPointer) DatumGetPointer(datum);
-				tuple_ctid = *tupleid;	/* be sure we don't free ctid!! */
-				tupleid = &tuple_ctid;
+				if (refType == ROW_REF_TID)
+				{
+					/* shouldn't ever get a null result... */
+					if (isNull)
+						elog(ERROR, "ctid is NULL");
+
+					tuple_ctid = *((ItemPointer) DatumGetPointer(datum));	/* be sure we don't free
+																			 * ctid!! */
+					tupleid = PointerGetDatum(&tuple_ctid);
+				}
+				else
+				{
+					Assert(refType == ROW_REF_ROWID);
+					/* shouldn't ever get a null result... */
+					if (isNull)
+						elog(ERROR, "rowid is NULL");
+
+					tupleid = datumCopy(datum, false, -1);
+				}
 			}
 
 			/*
@@ -3635,7 +3670,7 @@ ExecModifyTable(PlanState *pstate)
 						EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 
 						slot = ExecMerge(&context, node->resultRelInfo,
-										 NULL, NULL, node->canSetTag);
+										 PointerGetDatum(NULL), NULL, node->canSetTag);
 
 						/*
 						 * If we got a RETURNING result, return it to the
@@ -3699,6 +3734,7 @@ ExecModifyTable(PlanState *pstate)
 					/* Fetch the most recent version of old tuple. */
 					Relation	relation = resultRelInfo->ri_RelationDesc;
 
+					Assert(DatumGetPointer(tupleid) != NULL);
 					if (!table_tuple_fetch_row_version(relation, tupleid,
 													   SnapshotAny,
 													   oldSlot))
@@ -3733,6 +3769,9 @@ ExecModifyTable(PlanState *pstate)
 				break;
 		}
 
+		if (refType == ROW_REF_ROWID && DatumGetPointer(tupleid) != NULL)
+			pfree(DatumGetPointer(tupleid));
+
 		/*
 		 * If we got a RETURNING result, return it to caller.  We'll continue
 		 * the work on next call.
@@ -3976,10 +4015,20 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 				relkind == RELKIND_MATVIEW ||
 				relkind == RELKIND_PARTITIONED_TABLE)
 			{
-				resultRelInfo->ri_RowIdAttNo =
-					ExecFindJunkAttributeInTlist(subplan->targetlist, "ctid");
-				if (!AttributeNumberIsValid(resultRelInfo->ri_RowIdAttNo))
-					elog(ERROR, "could not find junk ctid column");
+				if (resultRelInfo->ri_RowRefType == ROW_REF_TID)
+				{
+					resultRelInfo->ri_RowIdAttNo =
+						ExecFindJunkAttributeInTlist(subplan->targetlist, "ctid");
+					if (!AttributeNumberIsValid(resultRelInfo->ri_RowIdAttNo))
+						elog(ERROR, "could not find junk ctid column");
+				}
+				else
+				{
+					resultRelInfo->ri_RowIdAttNo =
+						ExecFindJunkAttributeInTlist(subplan->targetlist, "rowid");
+					if (!AttributeNumberIsValid(resultRelInfo->ri_RowIdAttNo))
+						elog(ERROR, "could not find junk rowid column");
+				}
 			}
 			else if (relkind == RELKIND_FOREIGN_TABLE)
 			{
@@ -4289,6 +4338,8 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		estate->es_auxmodifytables = lcons(mtstate,
 										   estate->es_auxmodifytables);
 
+
+
 	return mtstate;
 }
 
diff --git a/src/backend/executor/nodeTidscan.c b/src/backend/executor/nodeTidscan.c
index 864a9013b62..f4a124ac4eb 100644
--- a/src/backend/executor/nodeTidscan.c
+++ b/src/backend/executor/nodeTidscan.c
@@ -377,7 +377,7 @@ TidNext(TidScanState *node)
 		if (node->tss_isCurrentOf)
 			table_tuple_get_latest_tid(scan, &tid);
 
-		if (table_tuple_fetch_row_version(heapRelation, &tid, snapshot, slot))
+		if (table_tuple_fetch_row_version(heapRelation, PointerGetDatum(&tid), snapshot, slot))
 			return slot;
 
 		/* Bad TID or failed snapshot qual; try next */
diff --git a/src/backend/optimizer/prep/preptlist.c b/src/backend/optimizer/prep/preptlist.c
index 4599b0dc761..3620be5b52c 100644
--- a/src/backend/optimizer/prep/preptlist.c
+++ b/src/backend/optimizer/prep/preptlist.c
@@ -226,6 +226,22 @@ preprocess_targetlist(PlannerInfo *root)
 								  true);
 			tlist = lappend(tlist, tle);
 		}
+		if (rc->allRefTypes & (1 << ROW_REF_ROWID))
+		{
+			/* Need to fetch TID */
+			var = makeVar(rc->rti,
+						  RowIdAttributeNumber,
+						  BYTEAOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			snprintf(resname, sizeof(resname), "rowid%u", rc->rowmarkId);
+			tle = makeTargetEntry((Expr *) var,
+								  list_length(tlist) + 1,
+								  pstrdup(resname),
+								  true);
+			tlist = lappend(tlist, tle);
+		}
 		if (rc->allRefTypes & (1 << ROW_REF_COPY))
 		{
 			/* Need the whole row as a junk var */
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index 6ba4eba224a..83c08bbd0e1 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -16,6 +16,7 @@
 
 #include "access/htup_details.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "foreign/fdwapi.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
@@ -895,17 +896,35 @@ add_row_identity_columns(PlannerInfo *root, Index rtindex,
 		relkind == RELKIND_MATVIEW ||
 		relkind == RELKIND_PARTITIONED_TABLE)
 	{
+		RowRefType	refType = ROW_REF_TID;
+
+		refType = table_get_row_ref_type(target_relation);
+
 		/*
 		 * Emit CTID so that executor can find the row to merge, update or
 		 * delete.
 		 */
-		var = makeVar(rtindex,
-					  SelfItemPointerAttributeNumber,
-					  TIDOID,
-					  -1,
-					  InvalidOid,
-					  0);
-		add_row_identity_var(root, var, rtindex, "ctid");
+		if (refType == ROW_REF_TID)
+		{
+			var = makeVar(rtindex,
+						  SelfItemPointerAttributeNumber,
+						  TIDOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			add_row_identity_var(root, var, rtindex, "ctid");
+		}
+		else
+		{
+			Assert(refType == ROW_REF_ROWID);
+			var = makeVar(rtindex,
+						  RowIdAttributeNumber,
+						  BYTEAOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			add_row_identity_var(root, var, rtindex, "rowid");
+		}
 	}
 	else if (relkind == RELKIND_FOREIGN_TABLE)
 	{
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index d32b07bab57..171509aae62 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -283,6 +283,24 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 			newvars = lappend(newvars, var);
 		}
 
+		if ((new_allRefTypes & (1 << ROW_REF_ROWID)) &&
+			!(old_allRefTypes & (1 << ROW_REF_ROWID)))
+		{
+			var = makeVar(oldrc->rti,
+						  RowIdAttributeNumber,
+						  BYTEAOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			snprintf(resname, sizeof(resname), "rowid%u", oldrc->rowmarkId);
+			tle = makeTargetEntry((Expr *) var,
+								  list_length(root->processed_tlist) + 1,
+								  pstrdup(resname),
+								  true);
+			root->processed_tlist = lappend(root->processed_tlist, tle);
+			newvars = lappend(newvars, var);
+		}
+
 		/* Add tableoid junk Var, unless we had it already */
 		if (!old_isParent)
 		{
@@ -486,7 +504,7 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 	Assert(parentrte->rtekind == RTE_RELATION); /* else this is dubious */
 	childrte->relid = childOID;
 	childrte->relkind = childrel->rd_rel->relkind;
-	childrte->reftype = ROW_REF_TID;
+	childrte->reftype = table_get_row_ref_type(childrel);
 	/* A partitioned child will need to be expanded further. */
 	if (childrte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c
index 10f2d287b39..2c80e010f2a 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -1504,7 +1504,7 @@ addRangeTableEntry(ParseState *pstate,
 	rte->inh = inh;
 	rte->relkind = rel->rd_rel->relkind;
 	rte->rellockmode = lockmode;
-	rte->reftype = ROW_REF_TID;
+	rte->reftype = table_get_row_ref_type(rel);
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
@@ -1590,7 +1590,7 @@ addRangeTableEntryForRelation(ParseState *pstate,
 	rte->inh = inh;
 	rte->relkind = rel->rd_rel->relkind;
 	rte->rellockmode = lockmode;
-	rte->reftype = ROW_REF_TID;
+	rte->reftype = table_get_row_ref_type(rel);
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
@@ -3267,6 +3267,9 @@ get_rte_attribute_name(RangeTblEntry *rte, AttrNumber attnum)
 		attnum > 0 && attnum <= list_length(rte->alias->colnames))
 		return strVal(list_nth(rte->alias->colnames, attnum - 1));
 
+	if (attnum == RowIdAttributeNumber)
+		return "rowid";
+
 	/*
 	 * If the RTE is a relation, go to the system catalogs not the
 	 * eref->colnames list.  This is a little slower but it will give the
diff --git a/src/backend/rewrite/rewriteHandler.c b/src/backend/rewrite/rewriteHandler.c
index 9fd05b15e73..7a0fdbe3f40 100644
--- a/src/backend/rewrite/rewriteHandler.c
+++ b/src/backend/rewrite/rewriteHandler.c
@@ -1854,6 +1854,7 @@ ApplyRetrieveRule(Query *parsetree,
 	rte = rt_fetch(rt_index, parsetree->rtable);
 
 	rte->rtekind = RTE_SUBQUERY;
+	rte->reftype = ROW_REF_COPY;
 	rte->subquery = rule_action;
 	rte->security_barrier = RelationIsSecurityView(relation);
 
diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index 947a868e569..d3a41533552 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1100,6 +1100,36 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+/*
+ * Same as tuplestore_gettupleslot(), but foces tuple storage to slot.  Thus,
+ * it can work with slot types different than minimal tuple.
+ */
+bool
+tuplestore_force_gettupleslot(Tuplestorestate *state, bool forward,
+							  bool copy, TupleTableSlot *slot)
+{
+	MinimalTuple tuple;
+	bool		should_free;
+
+	tuple = (MinimalTuple) tuplestore_gettuple(state, forward, &should_free);
+
+	if (tuple)
+	{
+		if (copy && !should_free)
+		{
+			tuple = heap_copy_minimal_tuple(tuple);
+			should_free = true;
+		}
+		ExecForceStoreMinimalTuple(tuple, slot, should_free);
+		return true;
+	}
+	else
+	{
+		ExecClearTuple(slot);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
diff --git a/src/include/access/sysattr.h b/src/include/access/sysattr.h
index e88dec71ee9..867b5eb489e 100644
--- a/src/include/access/sysattr.h
+++ b/src/include/access/sysattr.h
@@ -24,6 +24,7 @@
 #define MaxTransactionIdAttributeNumber			(-4)
 #define MaxCommandIdAttributeNumber				(-5)
 #define TableOidAttributeNumber					(-6)
-#define FirstLowInvalidHeapAttributeNumber		(-7)
+#define RowIdAttributeNumber					(-7)
+#define FirstLowInvalidHeapAttributeNumber		(-8)
 
 #endif							/* SYSATTR_H */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index c32a3cbcf66..730e2bd94d3 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -472,7 +472,7 @@ typedef struct TableAmRoutine
 	 * test, returns true, false otherwise.
 	 */
 	bool		(*tuple_fetch_row_version) (Relation rel,
-											ItemPointer tid,
+											Datum tupleid,
 											Snapshot snapshot,
 											TupleTableSlot *slot);
 
@@ -531,7 +531,7 @@ typedef struct TableAmRoutine
 
 	/* see table_tuple_delete() for reference about parameters */
 	TM_Result	(*tuple_delete) (Relation rel,
-								 ItemPointer tid,
+								 Datum tupleid,
 								 CommandId cid,
 								 Snapshot snapshot,
 								 Snapshot crosscheck,
@@ -542,7 +542,7 @@ typedef struct TableAmRoutine
 
 	/* see table_tuple_update() for reference about parameters */
 	TM_Result	(*tuple_update) (Relation rel,
-								 ItemPointer otid,
+								 Datum tupleid,
 								 TupleTableSlot *slot,
 								 CommandId cid,
 								 Snapshot snapshot,
@@ -555,7 +555,7 @@ typedef struct TableAmRoutine
 
 	/* see table_tuple_lock() for reference about parameters */
 	TM_Result	(*tuple_lock) (Relation rel,
-							   ItemPointer tid,
+							   Datum tupleid,
 							   Snapshot snapshot,
 							   TupleTableSlot *slot,
 							   CommandId cid,
@@ -701,6 +701,11 @@ typedef struct TableAmRoutine
 	 * ------------------------------------------------------------------------
 	 */
 
+	/*
+	 * Get the type of row identifier in the table.
+	 */
+	RowRefType	(*get_row_ref_type) (Relation rel);
+
 	/*
 	 * This callback frees relation private cache data stored in rd_amcache.
 	 * If this callback is not provided, rd_amcache is assumed to point to
@@ -1278,9 +1283,9 @@ extern bool table_index_fetch_tuple_check(Relation rel,
 
 
 /*
- * Fetch tuple at `tid` into `slot`, after doing a visibility test according to
- * `snapshot`. If a tuple was found and passed the visibility test, returns
- * true, false otherwise.
+ * Fetch tuple identified by `tupleid` into `slot`, after doing a visibility
+ * test according to `snapshot`. If a tuple was found and passed the visibility
+ * test, returns true, false otherwise.
  *
  * See table_index_fetch_tuple's comment about what the difference between
  * these functions is. It is correct to use this function outside of index
@@ -1288,7 +1293,7 @@ extern bool table_index_fetch_tuple_check(Relation rel,
  */
 static inline bool
 table_tuple_fetch_row_version(Relation rel,
-							  ItemPointer tid,
+							  Datum tupleid,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
@@ -1300,7 +1305,8 @@ table_tuple_fetch_row_version(Relation rel,
 	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
 		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
 
-	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
+	return rel->rd_tableam->tuple_fetch_row_version(rel, tupleid,
+													snapshot, slot);
 }
 
 /*
@@ -1485,7 +1491,7 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
  *
  * Input parameters:
  *	relation - table to be modified (caller must hold suitable lock)
- *	tid - TID of tuple to be deleted
+ *	tupleid - identifier of tuple to be deleted
  *	cid - delete command ID (used for visibility test, and stored into
  *		cmax if successful)
  *	crosscheck - if not InvalidSnapshot, also check tuple against this
@@ -1514,12 +1520,12 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
  * TM_FailureData for additional info.
  */
 static inline TM_Result
-table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
+table_tuple_delete(Relation rel, Datum tupleid, CommandId cid,
 				   Snapshot snapshot, Snapshot crosscheck, int options,
 				   TM_FailureData *tmfd, bool changingPart,
 				   TupleTableSlot *oldSlot)
 {
-	return rel->rd_tableam->tuple_delete(rel, tid, cid,
+	return rel->rd_tableam->tuple_delete(rel, tupleid, cid,
 										 snapshot, crosscheck,
 										 options, tmfd, changingPart,
 										 oldSlot);
@@ -1533,7 +1539,7 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
  *
  * Input parameters:
  *	relation - table to be modified (caller must hold suitable lock)
- *	otid - TID of old tuple to be replaced
+ *	tupleid - identifier of old tuple to be replaced
  *	slot - newly constructed tuple data to store
  *	cid - update command ID (used for visibility test, and stored into
  *		cmax/cmin if successful)
@@ -1570,13 +1576,13 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
  * for additional info.
  */
 static inline TM_Result
-table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
+table_tuple_update(Relation rel, Datum tupleid, TupleTableSlot *slot,
 				   CommandId cid, Snapshot snapshot, Snapshot crosscheck,
 				   int options, TM_FailureData *tmfd, LockTupleMode *lockmode,
 				   TU_UpdateIndexes *update_indexes,
 				   TupleTableSlot *oldSlot)
 {
-	return rel->rd_tableam->tuple_update(rel, otid, slot,
+	return rel->rd_tableam->tuple_update(rel, tupleid, slot,
 										 cid, snapshot, crosscheck,
 										 options, tmfd,
 										 lockmode, update_indexes,
@@ -1588,7 +1594,7 @@ table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
  *
  * Input parameters:
  *	relation: relation containing tuple (caller must hold suitable lock)
- *	tid: TID of tuple to lock
+ *	tupleid: identifier of tuple to lock
  *	snapshot: snapshot to use for visibility determinations
  *	cid: current command ID (used for visibility test, and stored into
  *		tuple's cmax if lock is successful)
@@ -1617,12 +1623,12 @@ table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
  * comments for struct TM_FailureData for additional info.
  */
 static inline TM_Result
-table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
+table_tuple_lock(Relation rel, Datum tupleid, Snapshot snapshot,
 				 TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
 				 LockWaitPolicy wait_policy, uint8 flags,
 				 TM_FailureData *tmfd)
 {
-	return rel->rd_tableam->tuple_lock(rel, tid, snapshot, slot,
+	return rel->rd_tableam->tuple_lock(rel, tupleid, snapshot, slot,
 									   cid, mode, wait_policy,
 									   flags, tmfd);
 }
@@ -1904,6 +1910,20 @@ table_define_index(Relation rel, Oid indoid, bool reindex,
  * ----------------------------------------------------------------------------
  */
 
+/*
+ * Get the type of row identifier.  Returns ROW_REF_TID when table AM routine
+ * is not accessible.  This happens during catalog initialization.  All catalog
+ * tables are known to use heap.
+ */
+static inline RowRefType
+table_get_row_ref_type(Relation rel)
+{
+	if (rel->rd_tableam)
+		return rel->rd_tableam->get_row_ref_type(rel);
+	else
+		return ROW_REF_TID;
+}
+
 /*
  * Frees relation private cache data stored in rd_amcache.  Uses
  * free_rd_amcache method if provided.  Assumes rd_amcache to point to single
diff --git a/src/include/commands/trigger.h b/src/include/commands/trigger.h
index cb968d03ecd..c16e6b6e5a0 100644
--- a/src/include/commands/trigger.h
+++ b/src/include/commands/trigger.h
@@ -209,7 +209,7 @@ extern void ExecASDeleteTriggers(EState *estate,
 extern bool ExecBRDeleteTriggers(EState *estate,
 								 EPQState *epqstate,
 								 ResultRelInfo *relinfo,
-								 ItemPointer tupleid,
+								 Datum tupleid,
 								 HeapTuple fdw_trigtuple,
 								 TupleTableSlot **epqslot,
 								 TM_Result *tmresult,
@@ -231,7 +231,7 @@ extern void ExecASUpdateTriggers(EState *estate,
 extern bool ExecBRUpdateTriggers(EState *estate,
 								 EPQState *epqstate,
 								 ResultRelInfo *relinfo,
-								 ItemPointer tupleid,
+								 Datum tupleid,
 								 HeapTuple fdw_trigtuple,
 								 TupleTableSlot *newslot,
 								 TM_Result *tmresult,
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index bc06ff99e21..90233a4baf7 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -2093,6 +2093,7 @@ typedef struct OnConflictExpr
 typedef enum RowRefType
 {
 	ROW_REF_TID,				/* Item pointer (block, offset) */
+	ROW_REF_ROWID,				/* Bytea row id */
 	ROW_REF_COPY				/* Full row copy */
 } RowRefType;
 
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index 419613c17ba..cf291a0d17a 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -70,6 +70,9 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
 
+extern bool tuplestore_force_gettupleslot(Tuplestorestate *state, bool forward,
+										  bool copy, TupleTableSlot *slot);
+
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
 extern bool tuplestore_skiptuples(Tuplestorestate *state,
-- 
2.39.3 (Apple Git-145)

#17

Japin Li

japinli@hotmail.com

almost 2 years ago

In reply to: Alexander Korotkov (#16)

Re: Table AM Interface Enhancements

On Tue, 19 Mar 2024 at 21:05, Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hi, Pavel!

On Tue, Mar 19, 2024 at 11:34 AM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
On Tue, 19 Mar 2024 at 03:34, Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Sun, Mar 3, 2024 at 1:50 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Mon, Nov 27, 2023 at 10:18 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On Nov 25, 2023, at 9:47 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:

Should the patch at least document which parts of the EState are expected to be in which states, and which parts should be viewed as undefined? If the implementors of table AMs rely on any/all aspects of EState, doesn't that prevent future changes to how that structure is used?

New tuple tuple_insert_with_arbiter() table AM API method needs EState
argument to call executor functions: ExecCheckIndexConstraints(),
ExecUpdateLockMode(), and ExecInsertIndexTuples(). I think we
probably need to invent some opaque way to call this executor function
without revealing EState to table AM. Do you think this could work?

We're clearly not accessing all of the EState, just some specific fields, such as es_per_tuple_exprcontext. I think you could at least refactor to pass the minimum amount of state information through the table AM API.

Yes, the table AM doesn't need the full EState, just the ability to do
specific manipulation with tuples. I'll refactor the patch to make a
better isolation for this.

Please find the revised patchset attached. The changes are following:
1. Patchset is rebase. to the current master.
2. Patchset is reordered. I tried to put less debatable patches to the top.
3. tuple_is_current() method is moved from the Table AM API to the
slot as proposed by Matthias van de Meent.
4. Assert added to the table_free_rd_amcache() as proposed by Pavel Borisov.

Patches 0001-0002 are unchanged compared to the last version in thread [1]. In my opinion, it's still ready to be committed, which was not done for time were too close to feature freeze one year ago.

0003 - Assert added from previous version. I still have a strong opinion that allowing multi-chunked data structures instead of single chunks is completely safe and makes natural process of Postgres improvement that is self-justified. The patch is simple enough and ready to be pushed.

0004 (previously 0007) - Have not changed, and there is consensus that this is reasonable. I've re-checked the current code. Looks safe considering returning a different slot, which I doubted before. So consider this patch also ready.

0005 (previously 0004) - Unused argument in the is_current_xact_tuple() signature is removed. Also comparing to v1 the code shifted from tableam methods to TTS's level.

I'd propose to remove Assert(!TTS_EMPTY(slot)) for tts_minimal_is_current_xact_tuple() and tts_virtual_is_current_xact_tuple() as these are only error reporting functions that don't use slot actually.
Comment similar to:
+/*
+ * VirtualTupleTableSlots never have a storage tuples.  We generally
+ * shouldn't get here, but provide a user-friendly message if we do.
+ */
also applies to tts_minimal_is_current_xact_tuple()
I'd propose changes for clarity of this comment:
%s/a storage tuples/storage tuples/g
%s/generally//g

Otherwise patch 0005 also looks good to me.

I'm planning to review the remaining patches. Meanwhile think pushing what is now ready and uncontroversial is a good intention.
Thank you for the work done on this patchset!
Thank you, Pavel!

Regarding 0005, I did apply "a storage tuples" grammar fix. Regarding
the rest of the things, I'd like to keep methods
tts_*_is_current_xact_tuple() to be similar to nearby
tts_*_getsysattr(). This is why I'm keeping the rest unchanged. I
think we could refactor that later, but together with
tts_*_getsysattr() methods.

I'm going to push 0003, 0004 and 0005 if there are no objections.

And I'll update 0001 and 0002 in their dedicated thread.

When I try to test the patch on Ubuntu 22.04 with GCC 11.4.0. There are some
warnings as following:

/home/japin/Codes/postgres/build/../src/backend/access/heap/heapam_handler.c: In function ‘heapam_acquire_sample_rows’:
/home/japin/Codes/postgres/build/../src/backend/access/heap/heapam_handler.c:1603:28: warning: implicit declaration of function ‘get_tablespace_maintenance_io_concurrency’ [-Wimplicit-function-declaration]
 1603 |         prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
      |                            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/japin/Codes/postgres/build/../src/backend/access/heap/heapam_handler.c:1757:30: warning: implicit declaration of function ‘floor’ [-Wimplicit-function-declaration]
 1757 |                 *totalrows = floor((liverows / bs.m) * totalblocks + 0.5);
      |                              ^~~~~
/home/japin/Codes/postgres/build/../src/backend/access/heap/heapam_handler.c:49:1: note: include ‘<math.h>’ or provide a declaration of ‘floor’
   48 | #include "utils/sampling.h"
  +++ |+#include <math.h>
   49 |
/home/japin/Codes/postgres/build/../src/backend/access/heap/heapam_handler.c:1757:30: warning: incompatible implicit declaration of built-in function ‘floor’ [-Wbuiltin-declaration-mismatch]
 1757 |                 *totalrows = floor((liverows / bs.m) * totalblocks + 0.5);
      |                              ^~~~~
/home/japin/Codes/postgres/build/../src/backend/access/heap/heapam_handler.c:1757:30: note: include ‘<math.h>’ or provide a declaration of ‘floor’
/home/japin/Codes/postgres/build/../src/backend/access/heap/heapam_handler.c:1603:21: warning: implicit declaration of function 'get_tablespace_maintenance_io_concurrency' is invalid in C99 [-Wimplicit-function-declaration]
        prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
                           ^

It seems you forgot to include math.h and utils/spccache.h header files
in heapam_handler.c.

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ac24691bd2..04365394f1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -19,6 +19,8 @@
  */
 #include "postgres.h"

+#include <math.h>
+
#include "access/genam.h"
#include "access/heapam.h"
#include "access/heaptoast.h"
@@ -46,6 +48,7 @@
#include "utils/builtins.h"
#include "utils/rel.h"
#include "utils/sampling.h"
+#include "utils/spccache.h"

static TM_Result heapam_tuple_lock(Relation relation, Datum tupleid,
Snapshot snapshot, TupleTableSlot *slot,

#18

Alexander Korotkov

aekorotkov@gmail.com

almost 2 years ago

In reply to: Japin Li (#17)

13 attachment(s)

Re: Table AM Interface Enhancements

On Tue, Mar 19, 2024 at 4:26 PM Japin Li <japinli@hotmail.com> wrote:

On Tue, 19 Mar 2024 at 21:05, Alexander Korotkov <aekorotkov@gmail.com> wrote:

Regarding 0005, I did apply "a storage tuples" grammar fix. Regarding
the rest of the things, I'd like to keep methods
tts_*_is_current_xact_tuple() to be similar to nearby
tts_*_getsysattr(). This is why I'm keeping the rest unchanged. I
think we could refactor that later, but together with
tts_*_getsysattr() methods.

I'm going to push 0003, 0004 and 0005 if there are no objections.

And I'll update 0001 and 0002 in their dedicated thread.

When I try to test the patch on Ubuntu 22.04 with GCC 11.4.0. There are some
warnings as following:

Thank you for catching this!
Please, find the revised patchset attached.

------
Regards,
Alexander Korotkov

Attachments:

0010-Notify-table-AM-about-index-creation-v4.patchapplication/octet-stream; name=0010-Notify-table-AM-about-index-creation-v4.patchDownload

From d13c92ba11e23df4cc2681cceb6ea6b1dfdef729 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Sat, 17 Jun 2023 22:01:01 +0300
Subject: [PATCH 10/13] Notify table AM about index creation

This allows table AM to do some preparation with index build.  In particular,
table AM could update its specific meta-information.  That could be also useful
if table AM overrides index implementations.
---
 src/backend/access/heap/heapam_handler.c |  2 ++
 src/backend/catalog/index.c              |  2 ++
 src/backend/commands/indexcmds.c         | 41 +++++++++++++----------
 src/include/access/tableam.h             | 42 ++++++++++++++++++++++++
 4 files changed, 69 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 3a7901ae92e..9c36e102934 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -3222,6 +3222,8 @@ static const TableAmRoutine heapam_methods = {
 	.index_build_range_scan = heapam_index_build_range_scan,
 	.index_validate_scan = heapam_index_validate_scan,
 	.relation_analyze = heapam_analyze,
+	.define_index_validate = NULL,
+	.define_index = NULL,
 
 	.free_rd_amcache = NULL,
 	.relation_size = table_block_relation_size,
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index b6a7c60e230..bca97981051 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3840,6 +3840,8 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 
 	/* Close rels, but keep locks */
 	index_close(iRel, NoLock);
+	table_define_index(heapRelation, indexId, true,
+					   skip_constraint_checks, false, NULL);
 	table_close(heapRelation, NoLock);
 
 	if (progress)
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 7299ebbe9f3..7f24687c6d9 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -583,6 +583,7 @@ DefineIndex(Oid tableId,
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
+	void	   *arg;
 
 	root_save_nestlevel = NewGUCNestLevel();
 
@@ -629,6 +630,26 @@ DefineIndex(Oid tableId,
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_INDEX_OID,
 								 InvalidOid);
 
+	/*
+	 * Only SELECT ... FOR UPDATE/SHARE are allowed while doing a standard
+	 * index build; but for concurrent builds we allow INSERT/UPDATE/DELETE
+	 * (but not VACUUM).
+	 *
+	 * NB: Caller is responsible for making sure that relationId refers to the
+	 * relation on which the index should be built; except in bootstrap mode,
+	 * this will typically require the caller to have already locked the
+	 * relation.  To avoid lock upgrade hazards, that lock should be at least
+	 * as strong as the one we take here.
+	 *
+	 * NB: If the lock strength here ever changes, code that is run by
+	 * parallel workers under the control of certain particular ambuild
+	 * functions will need to be updated, too.
+	 */
+	lockmode = concurrent ? ShareUpdateExclusiveLock : ShareLock;
+	rel = table_open(tableId, lockmode);
+
+	table_define_index_validate(rel, stmt, skip_build, &arg);
+
 	/*
 	 * count key attributes in index
 	 */
@@ -656,24 +677,6 @@ DefineIndex(Oid tableId,
 				 errmsg("cannot use more than %d columns in an index",
 						INDEX_MAX_KEYS)));
 
-	/*
-	 * Only SELECT ... FOR UPDATE/SHARE are allowed while doing a standard
-	 * index build; but for concurrent builds we allow INSERT/UPDATE/DELETE
-	 * (but not VACUUM).
-	 *
-	 * NB: Caller is responsible for making sure that tableId refers to the
-	 * relation on which the index should be built; except in bootstrap mode,
-	 * this will typically require the caller to have already locked the
-	 * relation.  To avoid lock upgrade hazards, that lock should be at least
-	 * as strong as the one we take here.
-	 *
-	 * NB: If the lock strength here ever changes, code that is run by
-	 * parallel workers under the control of certain particular ambuild
-	 * functions will need to be updated, too.
-	 */
-	lockmode = concurrent ? ShareUpdateExclusiveLock : ShareLock;
-	rel = table_open(tableId, lockmode);
-
 	/*
 	 * Switch to the table owner's userid, so that any index functions are run
 	 * as that user.  Also lock down security-restricted operations.  We
@@ -1218,6 +1221,8 @@ DefineIndex(Oid tableId,
 
 	ObjectAddressSet(address, RelationRelationId, indexRelationId);
 
+	table_define_index(rel, address.objectId, false, false,
+					   skip_build, arg);
 	if (!OidIsValid(indexRelationId))
 	{
 		/*
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 1cd6a92db6d..8b498eb6a76 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -687,6 +687,16 @@ typedef struct TableAmRoutine
 									 BlockNumber *totalpages,
 									 BufferAccessStrategy bstrategy);
 
+	/* See table_define_index_validate() */
+	bool		(*define_index_validate) (Relation rel, IndexStmt *stmt,
+										  bool skip_build, void **arg);
+
+	/* See table_define_index() */
+	bool		(*define_index) (Relation rel, Oid indoid, bool reindex,
+								 bool skip_constraint_checks, bool skip_build,
+								 void *arg);
+
+
 
 	/* ------------------------------------------------------------------------
 	 * Miscellaneous functions.
@@ -1853,6 +1863,38 @@ table_relation_analyze(Relation relation, AcquireSampleRowsFunc *func,
 										   totalpages, bstrategy);
 }
 
+/*
+ * Let table AM validate the index to be created on `rel` with statement
+ * `*stmt`.  `skip_build` indicates that only catalog entries are to be
+ * created without index data.  This method can save some information into
+ * `arg`, and it shoud be passed to table_define_index().
+ */
+static inline bool
+table_define_index_validate(Relation rel, IndexStmt *stmt,
+							bool skip_build, void **arg)
+{
+	if (rel->rd_tableam && rel->rd_tableam->define_index_validate)
+		return rel->rd_tableam->define_index_validate(rel, stmt,
+													  skip_build, arg);
+	else
+		return true;
+}
+
+/*
+ * Notifies table AM about index creation on `rel` with oid `indoid`.
+ */
+static inline bool
+table_define_index(Relation rel, Oid indoid, bool reindex,
+				   bool skip_constraint_checks, bool skip_build, void *arg)
+{
+	if (rel->rd_tableam && rel->rd_tableam->define_index)
+		return rel->rd_tableam->define_index(rel, indoid, reindex,
+											 skip_constraint_checks,
+											 skip_build, arg);
+	else
+		return true;
+}
+
 /* ----------------------------------------------------------------------------
  * Miscellaneous functionality
  * ----------------------------------------------------------------------------
-- 
2.39.3 (Apple Git-145)

0012-Introduce-RowRefType-which-describes-the-table-ro-v4.patchapplication/octet-stream; name=0012-Introduce-RowRefType-which-describes-the-table-ro-v4.patchDownload

From 1f29fd4a16f1c9a80bbc31742670a4509fc50ac4 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Sat, 17 Jun 2023 22:55:32 +0300
Subject: [PATCH 12/13] Introduce RowRefType, which describes the table row
 identifier

Currently, the table row could be identified by the ctid or the whole row
(foreign table).  But the row identifier is mixed together with lock mode in
RowMarkType.  This commit separates row identifier type into separate enum
RowRefType.
---
 src/backend/optimizer/plan/planner.c   | 16 +++++++++-----
 src/backend/optimizer/prep/preptlist.c |  4 ++--
 src/backend/optimizer/util/inherit.c   | 30 +++++++++++++++-----------
 src/backend/parser/parse_relation.c    | 10 +++++++++
 src/include/nodes/execnodes.h          |  4 ++++
 src/include/nodes/parsenodes.h         |  1 +
 src/include/nodes/plannodes.h          |  4 ++--
 src/include/nodes/primnodes.h          |  7 ++++++
 src/include/optimizer/planner.h        |  3 ++-
 src/tools/pgindent/typedefs.list       |  1 +
 10 files changed, 57 insertions(+), 23 deletions(-)

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 5564826cb4a..6cbabe83adf 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2279,6 +2279,7 @@ preprocess_rowmarks(PlannerInfo *root)
 		RowMarkClause *rc = lfirst_node(RowMarkClause, l);
 		RangeTblEntry *rte = rt_fetch(rc->rti, parse->rtable);
 		PlanRowMark *newrc;
+		RowRefType	refType;
 
 		/*
 		 * Currently, it is syntactically impossible to have FOR UPDATE et al
@@ -2301,8 +2302,8 @@ preprocess_rowmarks(PlannerInfo *root)
 		newrc = makeNode(PlanRowMark);
 		newrc->rti = newrc->prti = rc->rti;
 		newrc->rowmarkId = ++(root->glob->lastRowMarkId);
-		newrc->markType = select_rowmark_type(rte, rc->strength);
-		newrc->allMarkTypes = (1 << newrc->markType);
+		newrc->markType = select_rowmark_type(rte, rc->strength, &refType);
+		newrc->allRefTypes = (1 << refType);
 		newrc->strength = rc->strength;
 		newrc->waitPolicy = rc->waitPolicy;
 		newrc->isParent = false;
@@ -2318,6 +2319,7 @@ preprocess_rowmarks(PlannerInfo *root)
 	{
 		RangeTblEntry *rte = lfirst_node(RangeTblEntry, l);
 		PlanRowMark *newrc;
+		RowRefType	refType;
 
 		i++;
 		if (!bms_is_member(i, rels))
@@ -2326,8 +2328,8 @@ preprocess_rowmarks(PlannerInfo *root)
 		newrc = makeNode(PlanRowMark);
 		newrc->rti = newrc->prti = i;
 		newrc->rowmarkId = ++(root->glob->lastRowMarkId);
-		newrc->markType = select_rowmark_type(rte, LCS_NONE);
-		newrc->allMarkTypes = (1 << newrc->markType);
+		newrc->markType = select_rowmark_type(rte, LCS_NONE, &refType);
+		newrc->allRefTypes = (1 << refType);
 		newrc->strength = LCS_NONE;
 		newrc->waitPolicy = LockWaitBlock;	/* doesn't matter */
 		newrc->isParent = false;
@@ -2342,11 +2344,13 @@ preprocess_rowmarks(PlannerInfo *root)
  * Select RowMarkType to use for a given table
  */
 RowMarkType
-select_rowmark_type(RangeTblEntry *rte, LockClauseStrength strength)
+select_rowmark_type(RangeTblEntry *rte, LockClauseStrength strength,
+					RowRefType *refType)
 {
 	if (rte->rtekind != RTE_RELATION)
 	{
 		/* If it's not a table at all, use ROW_MARK_COPY */
+		*refType = ROW_REF_COPY;
 		return ROW_MARK_COPY;
 	}
 	else if (rte->relkind == RELKIND_FOREIGN_TABLE)
@@ -2357,11 +2361,13 @@ select_rowmark_type(RangeTblEntry *rte, LockClauseStrength strength)
 		if (fdwroutine->GetForeignRowMarkType != NULL)
 			return fdwroutine->GetForeignRowMarkType(rte, strength);
 		/* Otherwise, use ROW_MARK_COPY by default */
+		*refType = ROW_REF_COPY;
 		return ROW_MARK_COPY;
 	}
 	else
 	{
 		/* Regular table, apply the appropriate lock type */
+		*refType = rte->reftype;
 		switch (strength)
 		{
 			case LCS_NONE:
diff --git a/src/backend/optimizer/prep/preptlist.c b/src/backend/optimizer/prep/preptlist.c
index 7698bfa1a58..4599b0dc761 100644
--- a/src/backend/optimizer/prep/preptlist.c
+++ b/src/backend/optimizer/prep/preptlist.c
@@ -210,7 +210,7 @@ preprocess_targetlist(PlannerInfo *root)
 		if (rc->rti != rc->prti)
 			continue;
 
-		if (rc->allMarkTypes & ~(1 << ROW_MARK_COPY))
+		if (rc->allRefTypes & (1 << ROW_REF_TID))
 		{
 			/* Need to fetch TID */
 			var = makeVar(rc->rti,
@@ -226,7 +226,7 @@ preprocess_targetlist(PlannerInfo *root)
 								  true);
 			tlist = lappend(tlist, tle);
 		}
-		if (rc->allMarkTypes & (1 << ROW_MARK_COPY))
+		if (rc->allRefTypes & (1 << ROW_REF_COPY))
 		{
 			/* Need the whole row as a junk var */
 			var = makeWholeRowVar(rt_fetch(rc->rti, range_table),
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index 5c7acf8a901..d32b07bab57 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -16,6 +16,7 @@
 
 #include "access/sysattr.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "catalog/partition.h"
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_type.h"
@@ -91,7 +92,7 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 	LOCKMODE	lockmode;
 	PlanRowMark *oldrc;
 	bool		old_isParent = false;
-	int			old_allMarkTypes = 0;
+	int			old_allRefTypes = 0;
 
 	Assert(rte->inh);			/* else caller error */
 
@@ -131,8 +132,8 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 	{
 		old_isParent = oldrc->isParent;
 		oldrc->isParent = true;
-		/* Save initial value of allMarkTypes before children add to it */
-		old_allMarkTypes = oldrc->allMarkTypes;
+		/* Save initial value of allRefTypes before children add to it */
+		old_allRefTypes = oldrc->allRefTypes;
 	}
 
 	/* Scan the inheritance set and expand it */
@@ -239,15 +240,15 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (oldrc)
 	{
-		int			new_allMarkTypes = oldrc->allMarkTypes;
+		int			new_allRefTypes = oldrc->allRefTypes;
 		Var		   *var;
 		TargetEntry *tle;
 		char		resname[32];
 		List	   *newvars = NIL;
 
 		/* Add TID junk Var if needed, unless we had it already */
-		if (new_allMarkTypes & ~(1 << ROW_MARK_COPY) &&
-			!(old_allMarkTypes & ~(1 << ROW_MARK_COPY)))
+		if (new_allRefTypes & (1 << ROW_REF_TID) &&
+			!(old_allRefTypes & (1 << ROW_REF_TID)))
 		{
 			/* Need to fetch TID */
 			var = makeVar(oldrc->rti,
@@ -266,8 +267,8 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 		}
 
 		/* Add whole-row junk Var if needed, unless we had it already */
-		if ((new_allMarkTypes & (1 << ROW_MARK_COPY)) &&
-			!(old_allMarkTypes & (1 << ROW_MARK_COPY)))
+		if ((new_allRefTypes & (1 << ROW_REF_COPY)) &&
+			!(old_allRefTypes & (1 << ROW_REF_COPY)))
 		{
 			var = makeWholeRowVar(planner_rt_fetch(oldrc->rti, root),
 								  oldrc->rti,
@@ -441,7 +442,7 @@ expand_partitioned_rtentry(PlannerInfo *root, RelOptInfo *relinfo,
  * where the hierarchy is flattened during RTE expansion.)
  *
  * PlanRowMarks still carry the top-parent's RTI, and the top-parent's
- * allMarkTypes field still accumulates values from all descendents.
+ * allRefTypes field still accumulates values from all descendents.
  *
  * "parentrte" and "parentRTindex" are immediate parent's RTE and
  * RTI. "top_parentrc" is top parent's PlanRowMark.
@@ -485,6 +486,7 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 	Assert(parentrte->rtekind == RTE_RELATION); /* else this is dubious */
 	childrte->relid = childOID;
 	childrte->relkind = childrel->rd_rel->relkind;
+	childrte->reftype = ROW_REF_TID;
 	/* A partitioned child will need to be expanded further. */
 	if (childrte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
@@ -574,14 +576,16 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 	if (top_parentrc)
 	{
 		PlanRowMark *childrc = makeNode(PlanRowMark);
+		RowRefType	refType;
 
 		childrc->rti = childRTindex;
 		childrc->prti = top_parentrc->rti;
 		childrc->rowmarkId = top_parentrc->rowmarkId;
 		/* Reselect rowmark type, because relkind might not match parent */
 		childrc->markType = select_rowmark_type(childrte,
-												top_parentrc->strength);
-		childrc->allMarkTypes = (1 << childrc->markType);
+												top_parentrc->strength,
+												&refType);
+		childrc->allRefTypes = (1 << refType);
 		childrc->strength = top_parentrc->strength;
 		childrc->waitPolicy = top_parentrc->waitPolicy;
 
@@ -592,8 +596,8 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 		 */
 		childrc->isParent = (childrte->relkind == RELKIND_PARTITIONED_TABLE);
 
-		/* Include child's rowmark type in top parent's allMarkTypes */
-		top_parentrc->allMarkTypes |= childrc->allMarkTypes;
+		/* Include child's rowmark type in top parent's allRefTypes */
+		top_parentrc->allRefTypes |= childrc->allRefTypes;
 
 		root->rowMarks = lappend(root->rowMarks, childrc);
 	}
diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c
index 427b7325db8..10f2d287b39 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/sysattr.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "catalog/heap.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_type.h"
@@ -1503,6 +1504,7 @@ addRangeTableEntry(ParseState *pstate,
 	rte->inh = inh;
 	rte->relkind = rel->rd_rel->relkind;
 	rte->rellockmode = lockmode;
+	rte->reftype = ROW_REF_TID;
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
@@ -1588,6 +1590,7 @@ addRangeTableEntryForRelation(ParseState *pstate,
 	rte->inh = inh;
 	rte->relkind = rel->rd_rel->relkind;
 	rte->rellockmode = lockmode;
+	rte->reftype = ROW_REF_TID;
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
@@ -1656,6 +1659,7 @@ addRangeTableEntryForSubquery(ParseState *pstate,
 	rte->rtekind = RTE_SUBQUERY;
 	rte->subquery = subquery;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias("unnamed_subquery", NIL);
 	numaliases = list_length(eref->colnames);
@@ -1763,6 +1767,7 @@ addRangeTableEntryForFunction(ParseState *pstate,
 	rte->functions = NIL;		/* we'll fill this list below */
 	rte->funcordinality = rangefunc->ordinality;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	/*
 	 * Choose the RTE alias name.  We default to using the first function's
@@ -2081,6 +2086,7 @@ addRangeTableEntryForTableFunc(ParseState *pstate,
 	rte->coltypmods = tf->coltypmods;
 	rte->colcollations = tf->colcollations;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias(refname, NIL);
 	numaliases = list_length(eref->colnames);
@@ -2156,6 +2162,7 @@ addRangeTableEntryForValues(ParseState *pstate,
 	rte->coltypmods = coltypmods;
 	rte->colcollations = colcollations;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias(refname, NIL);
 
@@ -2252,6 +2259,7 @@ addRangeTableEntryForJoin(ParseState *pstate,
 	rte->joinrightcols = rightcols;
 	rte->join_using_alias = join_using_alias;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias("unnamed_join", NIL);
 	numaliases = list_length(eref->colnames);
@@ -2332,6 +2340,7 @@ addRangeTableEntryForCTE(ParseState *pstate,
 	rte->rtekind = RTE_CTE;
 	rte->ctename = cte->ctename;
 	rte->ctelevelsup = levelsup;
+	rte->reftype = ROW_REF_COPY;
 
 	/* Self-reference if and only if CTE's parse analysis isn't completed */
 	rte->self_reference = !IsA(cte->ctequery, Query);
@@ -2494,6 +2503,7 @@ addRangeTableEntryForENR(ParseState *pstate,
 	 * if they access transition tables linked to a table that is altered.
 	 */
 	rte->relid = enrmd->reliddesc;
+	rte->reftype = ROW_REF_COPY;
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 92593526725..acd9672d789 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -455,6 +455,9 @@ typedef struct ResultRelInfo
 	/* relation descriptor for result relation */
 	Relation	ri_RelationDesc;
 
+	/* row indentifier for result relation */
+	RowRefType	ri_RowRefType;
+
 	/* # of indices existing on result relation */
 	int			ri_NumIndices;
 
@@ -750,6 +753,7 @@ typedef struct ExecRowMark
 	Index		prti;			/* parent range table index, if child */
 	Index		rowmarkId;		/* unique identifier for resjunk columns */
 	RowMarkType markType;		/* see enum in nodes/plannodes.h */
+	RowRefType	refType;		/* row indentifier for relation */
 	LockClauseStrength strength;	/* LockingClause's strength, or LCS_NONE */
 	LockWaitPolicy waitPolicy;	/* NOWAIT and SKIP LOCKED */
 	bool		ermActive;		/* is this mark relevant for current tuple? */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 7b57fddf2d0..72c8c4caf24 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1079,6 +1079,7 @@ typedef struct RangeTblEntry
 	int			rellockmode;	/* lock level that query requires on the rel */
 	Index		perminfoindex;	/* index of RTEPermissionInfo entry, or 0 */
 	struct TableSampleClause *tablesample;	/* sampling info, or NULL */
+	RowRefType	reftype;		/* row indentifier for relation */
 
 	/*
 	 * Fields valid for a subquery RTE (else NULL):
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index b4ef6bc44c9..dbe5c535560 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1351,7 +1351,7 @@ typedef enum RowMarkType
  * child relations will also have entries with isParent = true.  The child
  * entries have rti == child rel's RT index and prti == top parent's RT index,
  * and can therefore be recognized as children by the fact that prti != rti.
- * The parent's allMarkTypes field gets the OR of (1<<markType) across all
+ * The parent's allRefTypes field gets the OR of (1<<refType) across all
  * its children (this definition allows children to use different markTypes).
  *
  * The planner also adds resjunk output columns to the plan that carry
@@ -1381,7 +1381,7 @@ typedef struct PlanRowMark
 	Index		prti;			/* range table index of parent relation */
 	Index		rowmarkId;		/* unique identifier for resjunk columns */
 	RowMarkType markType;		/* see enum above */
-	int			allMarkTypes;	/* OR of (1<<markType) for all children */
+	int			allRefTypes;	/* OR of (1<<refType) for all children */
 	LockClauseStrength strength;	/* LockingClause's strength, or LCS_NONE */
 	LockWaitPolicy waitPolicy;	/* NOWAIT and SKIP LOCKED options */
 	bool		isParent;		/* true if this is a "dummy" parent entry */
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index 8df8884001d..bc06ff99e21 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -2089,4 +2089,11 @@ typedef struct OnConflictExpr
 	List	   *exclRelTlist;	/* tlist of the EXCLUDED pseudo relation */
 } OnConflictExpr;
 
+/* The row identifier */
+typedef enum RowRefType
+{
+	ROW_REF_TID,				/* Item pointer (block, offset) */
+	ROW_REF_COPY				/* Full row copy */
+} RowRefType;
+
 #endif							/* PRIMNODES_H */
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index e1d79ffdf3c..98fc796d054 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -47,7 +47,8 @@ extern PlannerInfo *subquery_planner(PlannerGlobal *glob, Query *parse,
 									 bool hasRecursion, double tuple_fraction);
 
 extern RowMarkType select_rowmark_type(RangeTblEntry *rte,
-									   LockClauseStrength strength);
+									   LockClauseStrength strength,
+									   RowRefType *refType);
 
 extern bool limit_needed(Query *parse);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 042d04c8de2..e5ae4288428 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2420,6 +2420,7 @@ RowExpr
 RowIdentityVarInfo
 RowMarkClause
 RowMarkType
+RowRefType
 RowSecurityDesc
 RowSecurityPolicy
 RtlGetLastNtStatus_t
-- 
2.39.3 (Apple Git-145)

0013-Introduce-RowID-bytea-tuple-identifier-v4.patchapplication/octet-stream; name=0013-Introduce-RowID-bytea-tuple-identifier-v4.patchDownload

From aabac3f776c74b38b6e9e8c97ba5b229c9fe2d3c Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 26 Jun 2023 04:26:30 +0300
Subject: [PATCH 13/13] Introduce RowID -- bytea tuple identifier

Currently, there are two ways to reference the tuple: tuple identifier (tid)
and whole row copy.  The tuple identifier used for regular tables consists of
32-bit block number and 16-bit offset.  This seems limited for some use-cases,
in particular index-organized tables.  The whole row copy used to identify
tuples in FDW.  That could be extended to regular tables, but that seems
overkill.

This commit introduces RowID -- new bytea tuple identifier.  Table AM can choose
the way tuple is identified by providing new get_row_ref_type() API function.
New system attribute RowIdAttributeNumber holds RowID when appropriate.
Table AM methods now accepts Datum arguments as tuple identifiers.  Those Datum
could be either tid or bytea depending on what table_get_row_ref_type() says.
ModifyTable node and triggers are aware of RowID.  IndexScan and BitmapScan
nodes are not aware of RowIDs and expect tids.  Table AMs which use RowIDs
are supposed to redefine those nodes using hooks.
---
 contrib/amcheck/verify_nbtree.c          |   3 +-
 src/backend/access/common/heaptuple.c    |   4 +
 src/backend/access/heap/heapam_handler.c |  33 ++-
 src/backend/access/table/tableam.c       |   4 +-
 src/backend/catalog/aclchk.c             |   2 +-
 src/backend/commands/trigger.c           | 251 ++++++++++++++++++-----
 src/backend/executor/execExprInterp.c    |   4 +-
 src/backend/executor/execMain.c          |  11 +-
 src/backend/executor/execReplication.c   |  12 +-
 src/backend/executor/nodeLockRows.c      |  17 +-
 src/backend/executor/nodeModifyTable.c   | 145 ++++++++-----
 src/backend/executor/nodeTidscan.c       |   2 +-
 src/backend/optimizer/prep/preptlist.c   |  16 ++
 src/backend/optimizer/util/appendinfo.c  |  33 ++-
 src/backend/optimizer/util/inherit.c     |  20 +-
 src/backend/parser/parse_relation.c      |   7 +-
 src/backend/rewrite/rewriteHandler.c     |   1 +
 src/backend/utils/sort/tuplestore.c      |  30 +++
 src/include/access/sysattr.h             |   3 +-
 src/include/access/tableam.h             |  56 +++--
 src/include/commands/trigger.h           |   4 +-
 src/include/nodes/primnodes.h            |   1 +
 src/include/utils/tuplestore.h           |   3 +
 23 files changed, 509 insertions(+), 153 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 1ef4cff88e8..82f18810935 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -983,7 +983,8 @@ heap_entry_is_visible(BtreeCheckState *state, ItemPointer tid)
 	TupleTableSlot *slot = table_slot_create(state->heaprel, NULL);
 
 	tid_visible = table_tuple_fetch_row_version(state->heaprel,
-												tid, state->snapshot, slot);
+												PointerGetDatum(tid),
+												state->snapshot, slot);
 	if (slot != NULL)
 		ExecDropSingleTupleTableSlot(slot);
 
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index 5c89fbbef83..7b52c66939c 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -755,6 +755,10 @@ heap_getsysattr(HeapTuple tup, int attnum, TupleDesc tupleDesc, bool *isnull)
 		case TableOidAttributeNumber:
 			result = ObjectIdGetDatum(tup->t_tableOid);
 			break;
+		case RowIdAttributeNumber:
+			*isnull = true;
+			result = 0;
+			break;
 		default:
 			elog(ERROR, "invalid attnum: %d", attnum);
 			result = 0;			/* keep compiler quiet */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 811b0df5abf..04365394f1e 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -50,7 +50,7 @@
 #include "utils/sampling.h"
 #include "utils/spccache.h"
 
-static TM_Result heapam_tuple_lock(Relation relation, ItemPointer tid,
+static TM_Result heapam_tuple_lock(Relation relation, Datum tupleid,
 								   Snapshot snapshot, TupleTableSlot *slot,
 								   CommandId cid, LockTupleMode mode,
 								   LockWaitPolicy wait_policy, uint8 flags,
@@ -189,7 +189,7 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 
 static bool
 heapam_fetch_row_version(Relation relation,
-						 ItemPointer tid,
+						 Datum tupleid,
 						 Snapshot snapshot,
 						 TupleTableSlot *slot)
 {
@@ -198,7 +198,7 @@ heapam_fetch_row_version(Relation relation,
 
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
-	bslot->base.tupdata.t_self = *tid;
+	bslot->base.tupdata.t_self = *DatumGetItemPointer(tupleid);
 	if (heap_fetch(relation, snapshot, &bslot->base.tupdata, &buffer, false))
 	{
 		/* store in slot, transferring existing pin */
@@ -363,7 +363,7 @@ ExecCheckTIDVisible(EState *estate,
 	if (!IsolationUsesXactSnapshot())
 		return;
 
-	if (!table_tuple_fetch_row_version(rel, tid,
+	if (!table_tuple_fetch_row_version(rel, PointerGetDatum(tid),
 									   SnapshotAny, tempSlot))
 		elog(ERROR, "failed to fetch conflicting tuple for ON CONFLICT");
 	ExecCheckTupleVisible(estate, rel, tempSlot);
@@ -410,7 +410,7 @@ heapam_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
 				 * here means our previous conclusion that the tuple is
 				 * conclusively committed is not true anymore.
 				 */
-				test = table_tuple_lock(rel, &conflictTid,
+				test = table_tuple_lock(rel, PointerGetDatum(&conflictTid),
 										estate->es_snapshot,
 										lockedSlot, estate->es_output_cid,
 										lockmode, LockWaitBlock, 0,
@@ -590,12 +590,13 @@ heapam_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
 }
 
 static TM_Result
-heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
+heapam_tuple_delete(Relation relation, Datum tupleid, CommandId cid,
 					Snapshot snapshot, Snapshot crosscheck, int options,
 					TM_FailureData *tmfd, bool changingPart,
 					TupleTableSlot *oldSlot)
 {
 	TM_Result	result;
+	ItemPointer tid = DatumGetItemPointer(tupleid);
 
 	/*
 	 * Currently Deleting of index tuples are handled at vacuum, in case if
@@ -618,7 +619,7 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
 		 * heapam_tuple_lock() will take advantage of tuple loaded into
 		 * oldSlot by heap_delete().
 		 */
-		result = heapam_tuple_lock(relation, tid, snapshot,
+		result = heapam_tuple_lock(relation, tupleid, snapshot,
 								   oldSlot, cid, LockTupleExclusive,
 								   (options & TABLE_MODIFY_WAIT) ?
 								   LockWaitBlock :
@@ -635,7 +636,7 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
 
 
 static TM_Result
-heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
+heapam_tuple_update(Relation relation, Datum tupleid, TupleTableSlot *slot,
 					CommandId cid, Snapshot snapshot, Snapshot crosscheck,
 					int options, TM_FailureData *tmfd,
 					LockTupleMode *lockmode, TU_UpdateIndexes *update_indexes,
@@ -643,6 +644,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 {
 	bool		shouldFree = true;
 	HeapTuple	tuple = ExecFetchSlotHeapTuple(slot, true, &shouldFree);
+	ItemPointer otid = DatumGetItemPointer(tupleid);
 	TM_Result	result;
 
 	/* Update the tuple with table oid */
@@ -690,7 +692,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 		 * heapam_tuple_lock() will take advantage of tuple loaded into
 		 * oldSlot by heap_update().
 		 */
-		result = heapam_tuple_lock(relation, otid, snapshot,
+		result = heapam_tuple_lock(relation, tupleid, snapshot,
 								   oldSlot, cid, *lockmode,
 								   (options & TABLE_MODIFY_WAIT) ?
 								   LockWaitBlock :
@@ -706,7 +708,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 }
 
 static TM_Result
-heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
+heapam_tuple_lock(Relation relation, Datum tupleid, Snapshot snapshot,
 				  TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
 				  LockWaitPolicy wait_policy, uint8 flags,
 				  TM_FailureData *tmfd)
@@ -714,6 +716,7 @@ heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
 	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
 	TM_Result	result;
 	HeapTuple	tuple = &bslot->base.tupdata;
+	ItemPointer tid = DatumGetItemPointer(tupleid);
 	bool		follow_updates;
 
 	follow_updates = (flags & TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS) != 0;
@@ -2648,6 +2651,15 @@ heapam_scan_get_blocks_done(HeapScanDesc hscan)
  * ------------------------------------------------------------------------
  */
 
+/*
+ * All heap tables use TID row identifier.
+ */
+static RowRefType
+heapam_get_row_ref_type(Relation rel)
+{
+	return ROW_REF_TID;
+}
+
 /*
  * Check to see whether the table needs a TOAST table.  It does only if
  * (1) there are any toastable attributes, and (2) the maximum length
@@ -3227,6 +3239,7 @@ static const TableAmRoutine heapam_methods = {
 	.define_index_validate = NULL,
 	.define_index = NULL,
 
+	.get_row_ref_type = heapam_get_row_ref_type,
 	.free_rd_amcache = NULL,
 	.relation_size = table_block_relation_size,
 	.relation_needs_toast_table = heapam_relation_needs_toast_table,
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 805d222cebc..caa79c6eddd 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -300,7 +300,7 @@ simple_table_tuple_delete(Relation rel, ItemPointer tid, Snapshot snapshot,
 	if (oldSlot)
 		options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
 
-	result = table_tuple_delete(rel, tid,
+	result = table_tuple_delete(rel, PointerGetDatum(tid),
 								GetCurrentCommandId(true),
 								snapshot, InvalidSnapshot,
 								options,
@@ -356,7 +356,7 @@ simple_table_tuple_update(Relation rel, ItemPointer otid,
 	if (oldSlot)
 		options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
 
-	result = table_tuple_update(rel, otid, slot,
+	result = table_tuple_update(rel, PointerGetDatum(otid), slot,
 								GetCurrentCommandId(true),
 								snapshot, InvalidSnapshot,
 								options,
diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
index 7abf3c2a74a..8765becf986 100644
--- a/src/backend/catalog/aclchk.c
+++ b/src/backend/catalog/aclchk.c
@@ -1626,7 +1626,7 @@ expand_all_col_privileges(Oid table_oid, Form_pg_class classForm,
 	AttrNumber	curr_att;
 
 	Assert(classForm->relnatts - FirstLowInvalidHeapAttributeNumber < num_col_privileges);
-	for (curr_att = FirstLowInvalidHeapAttributeNumber + 1;
+	for (curr_att = FirstLowInvalidHeapAttributeNumber + 2;
 		 curr_att <= classForm->relnatts;
 		 curr_att++)
 	{
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 3309b4ebd2d..b2248bdfd87 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -76,7 +76,7 @@ static void SetTriggerFlags(TriggerDesc *trigdesc, Trigger *trigger);
 static bool GetTupleForTrigger(EState *estate,
 							   EPQState *epqstate,
 							   ResultRelInfo *relinfo,
-							   ItemPointer tid,
+							   Datum tupleid,
 							   LockTupleMode lockmode,
 							   TupleTableSlot *oldslot,
 							   TupleTableSlot **epqslot,
@@ -2682,7 +2682,7 @@ ExecASDeleteTriggers(EState *estate, ResultRelInfo *relinfo,
 bool
 ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
 					 ResultRelInfo *relinfo,
-					 ItemPointer tupleid,
+					 Datum tupleid,
 					 HeapTuple fdw_trigtuple,
 					 TupleTableSlot **epqslot,
 					 TM_Result *tmresult,
@@ -2696,7 +2696,7 @@ ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
 	bool		should_free = false;
 	int			i;
 
-	Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
+	Assert(HeapTupleIsValid(fdw_trigtuple) ^ (DatumGetPointer(tupleid) != NULL));
 	if (fdw_trigtuple == NULL)
 	{
 		TupleTableSlot *epqslot_candidate = NULL;
@@ -2924,7 +2924,7 @@ ExecASUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 bool
 ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
 					 ResultRelInfo *relinfo,
-					 ItemPointer tupleid,
+					 Datum tupleid,
 					 HeapTuple fdw_trigtuple,
 					 TupleTableSlot *newslot,
 					 TM_Result *tmresult,
@@ -2944,7 +2944,7 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
 	/* Determine lock mode to use */
 	lockmode = ExecUpdateLockMode(estate, relinfo);
 
-	Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
+	Assert(HeapTupleIsValid(fdw_trigtuple) ^ (DatumGetPointer(tupleid) != NULL));
 	if (fdw_trigtuple == NULL)
 	{
 		TupleTableSlot *epqslot_candidate = NULL;
@@ -3261,7 +3261,7 @@ static bool
 GetTupleForTrigger(EState *estate,
 				   EPQState *epqstate,
 				   ResultRelInfo *relinfo,
-				   ItemPointer tid,
+				   Datum tupleid,
 				   LockTupleMode lockmode,
 				   TupleTableSlot *oldslot,
 				   TupleTableSlot **epqslot,
@@ -3286,7 +3286,9 @@ GetTupleForTrigger(EState *estate,
 		 */
 		if (!IsolationUsesXactSnapshot())
 			lockflags |= TUPLE_LOCK_FLAG_FIND_LAST_VERSION;
-		test = table_tuple_lock(relation, tid, estate->es_snapshot, oldslot,
+
+		test = table_tuple_lock(relation, tupleid,
+								estate->es_snapshot, oldslot,
 								estate->es_output_cid,
 								lockmode, LockWaitBlock,
 								lockflags,
@@ -3382,8 +3384,8 @@ GetTupleForTrigger(EState *estate,
 		 * We expect the tuple to be present, thus very simple error handling
 		 * suffices.
 		 */
-		if (!table_tuple_fetch_row_version(relation, tid, SnapshotAny,
-										   oldslot))
+		if (!table_tuple_fetch_row_version(relation, tupleid,
+										   SnapshotAny, oldslot))
 			elog(ERROR, "failed to fetch tuple for trigger");
 	}
 
@@ -3589,18 +3591,24 @@ typedef SetConstraintStateData *SetConstraintState;
  * cycles.  So we need only ensure that ats_firing_id is zero when attaching
  * a new event to an existing AfterTriggerSharedData record.
  */
-typedef uint32 TriggerFlags;
-
-#define AFTER_TRIGGER_OFFSET			0x07FFFFFF	/* must be low-order bits */
-#define AFTER_TRIGGER_DONE				0x80000000
-#define AFTER_TRIGGER_IN_PROGRESS		0x40000000
+typedef uint64 TriggerFlags;
+
+#define AFTER_TRIGGER_SIZE				UINT64CONST(0xFFFF000000000)	/* must be low-order
+																		 * bits */
+#define AFTER_TRIGGER_SIZE_SHIFT		(36)
+#define AFTER_TRIGGER_OFFSET			UINT64CONST(0x000000FFFFFFF)	/* must be low-order
+																		 * bits */
+#define AFTER_TRIGGER_DONE				UINT64CONST(0x0000800000000)
+#define AFTER_TRIGGER_IN_PROGRESS		UINT64CONST(0x0000400000000)
 /* bits describing the size and tuple sources of this event */
-#define AFTER_TRIGGER_FDW_REUSE			0x00000000
-#define AFTER_TRIGGER_FDW_FETCH			0x20000000
-#define AFTER_TRIGGER_1CTID				0x10000000
-#define AFTER_TRIGGER_2CTID				0x30000000
-#define AFTER_TRIGGER_CP_UPDATE			0x08000000
-#define AFTER_TRIGGER_TUP_BITS			0x38000000
+#define AFTER_TRIGGER_FDW_REUSE			UINT64CONST(0x0000000000000)
+#define AFTER_TRIGGER_FDW_FETCH			UINT64CONST(0x0000200000000)
+#define AFTER_TRIGGER_1CTID				UINT64CONST(0x0000100000000)
+#define AFTER_TRIGGER_ROWID1			UINT64CONST(0x0000010000000)
+#define AFTER_TRIGGER_2CTID				UINT64CONST(0x0000300000000)
+#define AFTER_TRIGGER_ROWID2			UINT64CONST(0x0000020000000)
+#define AFTER_TRIGGER_CP_UPDATE			UINT64CONST(0x0000080000000)
+#define AFTER_TRIGGER_TUP_BITS			UINT64CONST(0x0000380000000)
 typedef struct AfterTriggerSharedData *AfterTriggerShared;
 
 typedef struct AfterTriggerSharedData
@@ -3652,6 +3660,9 @@ typedef struct AfterTriggerEventDataZeroCtids
 }			AfterTriggerEventDataZeroCtids;
 
 #define SizeofTriggerEvent(evt) \
+	(((evt)->ate_flags & AFTER_TRIGGER_SIZE) >> AFTER_TRIGGER_SIZE_SHIFT)
+
+#define BasicSizeofTriggerEvent(evt) \
 	(((evt)->ate_flags & AFTER_TRIGGER_TUP_BITS) == AFTER_TRIGGER_CP_UPDATE ? \
 	 sizeof(AfterTriggerEventData) : \
 	 (((evt)->ate_flags & AFTER_TRIGGER_TUP_BITS) == AFTER_TRIGGER_2CTID ? \
@@ -4004,14 +4015,34 @@ afterTriggerCopyBitmap(Bitmapset *src)
  */
 static void
 afterTriggerAddEvent(AfterTriggerEventList *events,
-					 AfterTriggerEvent event, AfterTriggerShared evtshared)
+					 AfterTriggerEvent event, AfterTriggerShared evtshared,
+					 bytea *rowid1, bytea *rowid2)
 {
-	Size		eventsize = SizeofTriggerEvent(event);
-	Size		needed = eventsize + sizeof(AfterTriggerSharedData);
+	Size		basiceventsize = MAXALIGN(BasicSizeofTriggerEvent(event));
+	Size		eventsize;
+	Size		needed;
 	AfterTriggerEventChunk *chunk;
 	AfterTriggerShared newshared;
 	AfterTriggerEvent newevent;
 
+	if (SizeofTriggerEvent(event) == 0)
+	{
+		eventsize = basiceventsize;
+		if (event->ate_flags & AFTER_TRIGGER_ROWID1)
+			eventsize += MAXALIGN(VARSIZE(rowid1));
+
+		if (event->ate_flags & AFTER_TRIGGER_ROWID2)
+			eventsize += MAXALIGN(VARSIZE(rowid2));
+
+		event->ate_flags |= eventsize << AFTER_TRIGGER_SIZE_SHIFT;
+	}
+	else
+	{
+		eventsize = SizeofTriggerEvent(event);
+	}
+
+	needed = eventsize + sizeof(AfterTriggerSharedData);
+
 	/*
 	 * If empty list or not enough room in the tail chunk, make a new chunk.
 	 * We assume here that a new shared record will always be needed.
@@ -4044,7 +4075,7 @@ afterTriggerAddEvent(AfterTriggerEventList *events,
 		 * sizes used should be MAXALIGN multiples, to ensure that the shared
 		 * records will be aligned safely.
 		 */
-#define MIN_CHUNK_SIZE 1024
+#define MIN_CHUNK_SIZE (1024*4)
 #define MAX_CHUNK_SIZE (1024*1024)
 
 #if MAX_CHUNK_SIZE > (AFTER_TRIGGER_OFFSET+1)
@@ -4063,6 +4094,7 @@ afterTriggerAddEvent(AfterTriggerEventList *events,
 				chunksize *= 2; /* okay, double it */
 			else
 				chunksize /= 2; /* too many shared records */
+			chunksize = Max(chunksize, MIN_CHUNK_SIZE);
 			chunksize = Min(chunksize, MAX_CHUNK_SIZE);
 		}
 		chunk = MemoryContextAlloc(afterTriggers.event_cxt, chunksize);
@@ -4103,7 +4135,26 @@ afterTriggerAddEvent(AfterTriggerEventList *events,
 
 	/* Insert the data */
 	newevent = (AfterTriggerEvent) chunk->freeptr;
-	memcpy(newevent, event, eventsize);
+	if (!rowid1 && !rowid2)
+	{
+		memcpy(newevent, event, eventsize);
+	}
+	else
+	{
+		Pointer		ptr = chunk->freeptr;
+
+		memcpy(newevent, event, basiceventsize);
+		ptr += basiceventsize;
+
+		if (event->ate_flags & AFTER_TRIGGER_ROWID1)
+		{
+			memcpy(ptr, rowid1, MAXALIGN(VARSIZE(rowid1)));
+			ptr += MAXALIGN(VARSIZE(rowid1));
+		}
+
+		if (event->ate_flags & AFTER_TRIGGER_ROWID2)
+			memcpy(ptr, rowid2, MAXALIGN(VARSIZE(rowid2)));
+	}
 	/* ... and link the new event to its shared record */
 	newevent->ate_flags &= ~AFTER_TRIGGER_OFFSET;
 	newevent->ate_flags |= (char *) newshared - (char *) newevent;
@@ -4263,6 +4314,7 @@ AfterTriggerExecute(EState *estate,
 	int			tgindx;
 	bool		should_free_trig = false;
 	bool		should_free_new = false;
+	Pointer		ptr;
 
 	/*
 	 * Locate trigger in trigdesc.
@@ -4294,15 +4346,17 @@ AfterTriggerExecute(EState *estate,
 			{
 				Tuplestorestate *fdw_tuplestore = GetCurrentFDWTuplestore();
 
-				if (!tuplestore_gettupleslot(fdw_tuplestore, true, false,
-											 trig_tuple_slot1))
+				if (!tuplestore_force_gettupleslot(fdw_tuplestore, true, false,
+												   trig_tuple_slot1))
 					elog(ERROR, "failed to fetch tuple1 for AFTER trigger");
 
 				if ((evtshared->ats_event & TRIGGER_EVENT_OPMASK) ==
 					TRIGGER_EVENT_UPDATE &&
-					!tuplestore_gettupleslot(fdw_tuplestore, true, false,
-											 trig_tuple_slot2))
+					!tuplestore_force_gettupleslot(fdw_tuplestore, true, false,
+												   trig_tuple_slot2))
 					elog(ERROR, "failed to fetch tuple2 for AFTER trigger");
+				trig_tuple_slot1->tts_tid = event->ate_ctid1;
+				trig_tuple_slot2->tts_tid = event->ate_ctid2;
 			}
 			/* fall through */
 		case AFTER_TRIGGER_FDW_REUSE:
@@ -4334,13 +4388,26 @@ AfterTriggerExecute(EState *estate,
 			break;
 
 		default:
-			if (ItemPointerIsValid(&(event->ate_ctid1)))
+			ptr = (Pointer) event + MAXALIGN(BasicSizeofTriggerEvent(event));
+			if (ItemPointerIsValid(&(event->ate_ctid1)) ||
+				(event->ate_flags & AFTER_TRIGGER_ROWID1))
 			{
+				Datum		tupleid;
+
 				TupleTableSlot *src_slot = ExecGetTriggerOldSlot(estate,
 																 src_relInfo);
 
-				if (!table_tuple_fetch_row_version(src_rel,
-												   &(event->ate_ctid1),
+				if (event->ate_flags & AFTER_TRIGGER_ROWID1)
+				{
+					tupleid = PointerGetDatum(ptr);
+					ptr += MAXALIGN(VARSIZE(ptr));
+				}
+				else
+				{
+					tupleid = PointerGetDatum(&(event->ate_ctid1));
+				}
+
+				if (!table_tuple_fetch_row_version(src_rel, tupleid,
 												   SnapshotAny,
 												   src_slot))
 					elog(ERROR, "failed to fetch tuple1 for AFTER trigger");
@@ -4376,13 +4443,23 @@ AfterTriggerExecute(EState *estate,
 			/* don't touch ctid2 if not there */
 			if (((event->ate_flags & AFTER_TRIGGER_TUP_BITS) == AFTER_TRIGGER_2CTID ||
 				 (event->ate_flags & AFTER_TRIGGER_CP_UPDATE)) &&
-				ItemPointerIsValid(&(event->ate_ctid2)))
+				(ItemPointerIsValid(&(event->ate_ctid2)) ||
+				 (event->ate_flags & AFTER_TRIGGER_ROWID2)))
 			{
+				Datum		tupleid;
+
 				TupleTableSlot *dst_slot = ExecGetTriggerNewSlot(estate,
 																 dst_relInfo);
 
-				if (!table_tuple_fetch_row_version(dst_rel,
-												   &(event->ate_ctid2),
+				if (event->ate_flags & AFTER_TRIGGER_ROWID2)
+				{
+					tupleid = PointerGetDatum(ptr);
+				}
+				else
+				{
+					tupleid = PointerGetDatum(&(event->ate_ctid2));
+				}
+				if (!table_tuple_fetch_row_version(dst_rel, tupleid,
 												   SnapshotAny,
 												   dst_slot))
 					elog(ERROR, "failed to fetch tuple2 for AFTER trigger");
@@ -4556,7 +4633,7 @@ afterTriggerMarkEvents(AfterTriggerEventList *events,
 		{
 			deferred_found = true;
 			/* add it to move_list */
-			afterTriggerAddEvent(move_list, event, evtshared);
+			afterTriggerAddEvent(move_list, event, evtshared, NULL, NULL);
 			/* mark original copy "done" so we don't do it again */
 			event->ate_flags |= AFTER_TRIGGER_DONE;
 		}
@@ -4659,6 +4736,7 @@ afterTriggerInvokeEvents(AfterTriggerEventList *events,
 					trigdesc = rInfo->ri_TrigDesc;
 					finfo = rInfo->ri_TrigFunctions;
 					instr = rInfo->ri_TrigInstrument;
+
 					if (slot1 != NULL)
 					{
 						ExecDropSingleTupleTableSlot(slot1);
@@ -6051,6 +6129,8 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 	int			tgtype_level;
 	int			i;
 	Tuplestorestate *fdw_tuplestore = NULL;
+	bytea	   *rowId1 = NULL;
+	bytea	   *rowId2 = NULL;
 
 	/*
 	 * Check state.  We use a normal test not Assert because it is possible to
@@ -6144,6 +6224,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 	 * if so.  This preserves the behavior that statement-level triggers fire
 	 * just once per statement and fire after row-level triggers.
 	 */
+
+	/* Determine flags */
+	if (!(relkind == RELKIND_FOREIGN_TABLE && row_trigger))
+		new_event.ate_flags = (row_trigger && event == TRIGGER_EVENT_UPDATE) ?
+			AFTER_TRIGGER_2CTID : AFTER_TRIGGER_1CTID;
+
 	switch (event)
 	{
 		case TRIGGER_EVENT_INSERT:
@@ -6154,6 +6240,14 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				Assert(newslot != NULL);
 				ItemPointerCopy(&(newslot->tts_tid), &(new_event.ate_ctid1));
 				ItemPointerSetInvalid(&(new_event.ate_ctid2));
+				if (table_get_row_ref_type(rel) == ROW_REF_ROWID)
+				{
+					bool		isnull;
+
+					rowId1 = DatumGetByteaP(slot_getsysattr(newslot, RowIdAttributeNumber, &isnull));
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID1;
+					Assert(!isnull);
+				}
 			}
 			else
 			{
@@ -6173,6 +6267,14 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				Assert(newslot == NULL);
 				ItemPointerCopy(&(oldslot->tts_tid), &(new_event.ate_ctid1));
 				ItemPointerSetInvalid(&(new_event.ate_ctid2));
+				if (table_get_row_ref_type(rel) == ROW_REF_ROWID)
+				{
+					bool		isnull;
+
+					rowId1 = DatumGetByteaP(slot_getsysattr(oldslot, RowIdAttributeNumber, &isnull));
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID1;
+					Assert(!isnull);
+				}
 			}
 			else
 			{
@@ -6188,10 +6290,57 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			tgtype_event = TRIGGER_TYPE_UPDATE;
 			if (row_trigger)
 			{
+				bool		src_rowid = false,
+							dst_rowid = false;
+
 				Assert(oldslot != NULL);
 				Assert(newslot != NULL);
 				ItemPointerCopy(&(oldslot->tts_tid), &(new_event.ate_ctid1));
 				ItemPointerCopy(&(newslot->tts_tid), &(new_event.ate_ctid2));
+				if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+				{
+					Relation	src_rel = src_partinfo->ri_RelationDesc;
+					Relation	dst_rel = dst_partinfo->ri_RelationDesc;
+
+					src_rowid = table_get_row_ref_type(src_rel) ==
+						ROW_REF_ROWID;
+					dst_rowid = table_get_row_ref_type(dst_rel) ==
+						ROW_REF_ROWID;
+				}
+				else
+				{
+					if (table_get_row_ref_type(rel) == ROW_REF_ROWID)
+					{
+						src_rowid = true;
+						dst_rowid = true;
+					}
+				}
+
+				if (src_rowid)
+				{
+					Datum		val;
+					bool		isnull;
+
+					val = slot_getsysattr(oldslot,
+										  RowIdAttributeNumber,
+										  &isnull);
+					rowId1 = DatumGetByteaP(val);
+					Assert(!isnull);
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID1;
+				}
+
+				if (dst_rowid)
+				{
+					Datum		val;
+					bool		isnull;
+
+					val = slot_getsysattr(newslot,
+										  RowIdAttributeNumber,
+										  &isnull);
+					rowId2 = DatumGetByteaP(val);
+					Assert(!isnull);
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID2;
+				}
 
 				/*
 				 * Also remember the OIDs of partitions to fetch these tuples
@@ -6229,20 +6378,6 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			break;
 	}
 
-	/* Determine flags */
-	if (!(relkind == RELKIND_FOREIGN_TABLE && row_trigger))
-	{
-		if (row_trigger && event == TRIGGER_EVENT_UPDATE)
-		{
-			if (relkind == RELKIND_PARTITIONED_TABLE)
-				new_event.ate_flags = AFTER_TRIGGER_CP_UPDATE;
-			else
-				new_event.ate_flags = AFTER_TRIGGER_2CTID;
-		}
-		else
-			new_event.ate_flags = AFTER_TRIGGER_1CTID;
-	}
-
 	/* else, we'll initialize ate_flags for each trigger */
 
 	tgtype_level = (row_trigger ? TRIGGER_TYPE_ROW : TRIGGER_TYPE_STATEMENT);
@@ -6387,6 +6522,20 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				continue;		/* Uniqueness definitely not violated */
 		}
 
+		/* Determine flags */
+		if (!(relkind == RELKIND_FOREIGN_TABLE && row_trigger))
+		{
+			if (row_trigger && event == TRIGGER_EVENT_UPDATE)
+			{
+				if (relkind == RELKIND_PARTITIONED_TABLE)
+					new_event.ate_flags = AFTER_TRIGGER_CP_UPDATE;
+				else
+					new_event.ate_flags = AFTER_TRIGGER_2CTID;
+			}
+			else
+				new_event.ate_flags = AFTER_TRIGGER_1CTID;
+		}
+
 		/*
 		 * Fill in event structure and add it to the current query's queue.
 		 * Note we set ats_table to NULL whenever this trigger doesn't use
@@ -6408,7 +6557,7 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		new_shared.ats_modifiedcols = afterTriggerCopyBitmap(modifiedCols);
 
 		afterTriggerAddEvent(&afterTriggers.query_stack[afterTriggers.query_depth].events,
-							 &new_event, &new_shared);
+							 &new_event, &new_shared, rowId1, rowId2);
 	}
 
 	/*
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index a25ab7570fe..2fa3a0a4e36 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -4552,7 +4552,9 @@ ExecEvalSysVar(ExprState *state, ExprEvalStep *op, ExprContext *econtext,
 						op->resnull);
 	*op->resvalue = d;
 	/* this ought to be unreachable, but it's cheap enough to check */
-	if (unlikely(*op->resnull))
+	if (op->d.var.attnum != RowIdAttributeNumber &&
+		op->d.var.attnum != SelfItemPointerAttributeNumber &&
+		unlikely(*op->resnull))
 		elog(ERROR, "failed to fetch attribute from slot");
 }
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 7eb1f7d0209..7b3fc3038a4 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -867,13 +867,15 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 			Oid			relid;
 			Relation	relation;
 			ExecRowMark *erm;
+			RangeTblEntry *rangeEntry;
 
 			/* ignore "parent" rowmarks; they are irrelevant at runtime */
 			if (rc->isParent)
 				continue;
 
 			/* get relation's OID (will produce InvalidOid if subquery) */
-			relid = exec_rt_fetch(rc->rti, estate)->relid;
+			rangeEntry = exec_rt_fetch(rc->rti, estate);
+			relid = rangeEntry->relid;
 
 			/* open relation, if we need to access it for this mark type */
 			switch (rc->markType)
@@ -906,6 +908,10 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 			erm->prti = rc->prti;
 			erm->rowmarkId = rc->rowmarkId;
 			erm->markType = rc->markType;
+			if (erm->markType == ROW_MARK_COPY)
+				erm->refType = ROW_REF_COPY;
+			else
+				erm->refType = rangeEntry->reftype;
 			erm->strength = rc->strength;
 			erm->waitPolicy = rc->waitPolicy;
 			erm->ermActive = false;
@@ -1269,6 +1275,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
 	resultRelInfo->ri_ChildToRootMap = NULL;
 	resultRelInfo->ri_ChildToRootMapValid = false;
 	resultRelInfo->ri_CopyMultiInsertBuffer = NULL;
+	resultRelInfo->ri_RowRefType = table_get_row_ref_type(resultRelationDesc);
 }
 
 /*
@@ -2701,7 +2708,7 @@ EvalPlanQualFetchRowMark(EPQState *epqstate, Index rti, TupleTableSlot *slot)
 		{
 			/* ordinary table, fetch the tuple */
 			if (!table_tuple_fetch_row_version(erm->relation,
-											   (ItemPointer) DatumGetPointer(datum),
+											   datum,
 											   SnapshotAny, slot))
 				elog(ERROR, "failed to fetch tuple for EvalPlanQual recheck");
 			return true;
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index db685473fc0..aad266a19ff 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -250,7 +250,8 @@ retry:
 
 		PushActiveSnapshot(GetLatestSnapshot());
 
-		res = table_tuple_lock(rel, &(outslot->tts_tid), GetLatestSnapshot(),
+		res = table_tuple_lock(rel, PointerGetDatum(&(outslot->tts_tid)),
+							   GetLatestSnapshot(),
 							   outslot,
 							   GetCurrentCommandId(false),
 							   lockmode,
@@ -434,7 +435,8 @@ retry:
 
 		PushActiveSnapshot(GetLatestSnapshot());
 
-		res = table_tuple_lock(rel, &(outslot->tts_tid), GetLatestSnapshot(),
+		res = table_tuple_lock(rel, PointerGetDatum(&(outslot->tts_tid)),
+							   GetLatestSnapshot(),
 							   outslot,
 							   GetCurrentCommandId(false),
 							   lockmode,
@@ -571,7 +573,8 @@ ExecSimpleRelationUpdate(ResultRelInfo *resultRelInfo,
 		resultRelInfo->ri_TrigDesc->trig_update_before_row)
 	{
 		if (!ExecBRUpdateTriggers(estate, epqstate, resultRelInfo,
-								  tid, NULL, slot, NULL, NULL))
+								  PointerGetDatum(tid), NULL, slot,
+								  NULL, NULL))
 			skip_tuple = true;	/* "do nothing" */
 	}
 
@@ -638,7 +641,8 @@ ExecSimpleRelationDelete(ResultRelInfo *resultRelInfo,
 		resultRelInfo->ri_TrigDesc->trig_delete_before_row)
 	{
 		skip_tuple = !ExecBRDeleteTriggers(estate, epqstate, resultRelInfo,
-										   tid, NULL, NULL, NULL, NULL);
+										   PointerGetDatum(tid), NULL, NULL,
+										   NULL, NULL);
 	}
 
 	if (!skip_tuple)
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 41754ddfea9..2d3ad904a64 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -27,6 +27,7 @@
 #include "executor/nodeLockRows.h"
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
+#include "utils/datum.h"
 #include "utils/rel.h"
 
 
@@ -157,7 +158,16 @@ lnext:
 		}
 
 		/* okay, try to lock (and fetch) the tuple */
-		tid = *((ItemPointer) DatumGetPointer(datum));
+		if (erm->refType == ROW_REF_TID)
+		{
+			tid = *((ItemPointer) DatumGetPointer(datum));
+			datum = PointerGetDatum(&tid);
+		}
+		else
+		{
+			Assert(erm->refType == ROW_REF_ROWID);
+			datum = datumCopy(datum, false, -1);
+		}
 		switch (erm->markType)
 		{
 			case ROW_MARK_EXCLUSIVE:
@@ -182,12 +192,15 @@ lnext:
 		if (!IsolationUsesXactSnapshot())
 			lockflags |= TUPLE_LOCK_FLAG_FIND_LAST_VERSION;
 
-		test = table_tuple_lock(erm->relation, &tid, estate->es_snapshot,
+		test = table_tuple_lock(erm->relation, datum, estate->es_snapshot,
 								markSlot, estate->es_output_cid,
 								lockmode, erm->waitPolicy,
 								lockflags,
 								&tmfd);
 
+		if (erm->refType == ROW_REF_ROWID)
+			pfree(DatumGetPointer(datum));
+
 		switch (test)
 		{
 			case TM_WouldBlock:
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index a64e37e9af9..90eeb99b2cd 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -124,7 +124,7 @@ static void ExecPendingInserts(EState *estate);
 static void ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 											   ResultRelInfo *sourcePartInfo,
 											   ResultRelInfo *destPartInfo,
-											   ItemPointer tupleid,
+											   Datum tupleid,
 											   TupleTableSlot *oldslot,
 											   TupleTableSlot *newslot);
 static bool ExecOnConflictUpdate(ModifyTableContext *context,
@@ -141,13 +141,13 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 
 static TupleTableSlot *ExecMerge(ModifyTableContext *context,
 								 ResultRelInfo *resultRelInfo,
-								 ItemPointer tupleid,
+								 Datum tupleid,
 								 HeapTuple oldtuple,
 								 bool canSetTag);
 static void ExecInitMerge(ModifyTableState *mtstate, EState *estate);
 static TupleTableSlot *ExecMergeMatched(ModifyTableContext *context,
 										ResultRelInfo *resultRelInfo,
-										ItemPointer tupleid,
+										Datum tupleid,
 										HeapTuple oldtuple,
 										bool canSetTag,
 										bool *matched);
@@ -1221,7 +1221,7 @@ ExecPendingInserts(EState *estate)
  */
 static bool
 ExecDeletePrologue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				   ItemPointer tupleid, HeapTuple oldtuple,
+				   Datum tupleid, HeapTuple oldtuple,
 				   TupleTableSlot **epqreturnslot, TM_Result *result)
 {
 	if (result)
@@ -1252,7 +1252,7 @@ ExecDeletePrologue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static TM_Result
 ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-			  ItemPointer tupleid, bool changingPart, int options,
+			  Datum tupleid, bool changingPart, int options,
 			  TupleTableSlot *oldSlot)
 {
 	EState	   *estate = context->estate;
@@ -1280,7 +1280,7 @@ ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static void
 ExecDeleteEpilogue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				   ItemPointer tupleid, HeapTuple oldtuple,
+				   HeapTuple oldtuple,
 				   TupleTableSlot *slot, bool changingPart)
 {
 	ModifyTableState *mtstate = context->mtstate;
@@ -1361,7 +1361,7 @@ ExecInitDeleteTupleSlot(ModifyTableState *mtstate,
 static TupleTableSlot *
 ExecDelete(ModifyTableContext *context,
 		   ResultRelInfo *resultRelInfo,
-		   ItemPointer tupleid,
+		   Datum tupleid,
 		   HeapTuple oldtuple,
 		   TupleTableSlot *oldslot,
 		   bool processReturning,
@@ -1558,7 +1558,7 @@ ldelete:
 	if (tupleDeleted)
 		*tupleDeleted = true;
 
-	ExecDeleteEpilogue(context, resultRelInfo, tupleid, oldtuple,
+	ExecDeleteEpilogue(context, resultRelInfo, oldtuple,
 					   oldslot, changingPart);
 
 	/* Process RETURNING if present and if requested */
@@ -1575,7 +1575,7 @@ ldelete:
 			/* FDW must have provided a slot containing the deleted row */
 			Assert(!TupIsNull(slot));
 		}
-		else
+		else if (!slot || TupIsNull(slot))
 		{
 			/* Copy old tuple to the returning slot */
 			slot = ExecGetReturningSlot(estate, resultRelInfo);
@@ -1624,7 +1624,7 @@ ldelete:
 static bool
 ExecCrossPartitionUpdate(ModifyTableContext *context,
 						 ResultRelInfo *resultRelInfo,
-						 ItemPointer tupleid, HeapTuple oldtuple,
+						 Datum tupleid, HeapTuple oldtuple,
 						 TupleTableSlot *slot,
 						 bool canSetTag,
 						 UpdateContext *updateCxt,
@@ -1783,7 +1783,7 @@ ExecCrossPartitionUpdate(ModifyTableContext *context,
  */
 static bool
 ExecUpdatePrologue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				   ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
+				   Datum tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
 				   TM_Result *result)
 {
 	Relation	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -1860,7 +1860,7 @@ ExecUpdatePrepareSlot(ResultRelInfo *resultRelInfo,
  */
 static TM_Result
 ExecUpdateAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-			  ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
+			  Datum tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
 			  bool canSetTag, int options, TupleTableSlot *oldSlot,
 			  UpdateContext *updateCxt)
 {
@@ -2014,7 +2014,7 @@ lreplace:
  */
 static void
 ExecUpdateEpilogue(ModifyTableContext *context, UpdateContext *updateCxt,
-				   ResultRelInfo *resultRelInfo, ItemPointer tupleid,
+				   ResultRelInfo *resultRelInfo,
 				   HeapTuple oldtuple, TupleTableSlot *slot,
 				   TupleTableSlot *oldslot)
 {
@@ -2064,7 +2064,7 @@ static void
 ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 								   ResultRelInfo *sourcePartInfo,
 								   ResultRelInfo *destPartInfo,
-								   ItemPointer tupleid,
+								   Datum tupleid,
 								   TupleTableSlot *oldslot,
 								   TupleTableSlot *newslot)
 {
@@ -2154,7 +2154,7 @@ ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
  */
 static TupleTableSlot *
 ExecUpdate(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-		   ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
+		   Datum tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
 		   TupleTableSlot *oldslot, bool canSetTag, bool locked)
 {
 	EState	   *estate = context->estate;
@@ -2208,15 +2208,19 @@ ExecUpdate(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	}
 	else
 	{
-		int			options = TABLE_MODIFY_WAIT | TABLE_MODIFY_FETCH_OLD_TUPLE;
+		int			options = TABLE_MODIFY_WAIT;
 
 		/*
 		 * Specify that we need to lock and fetch the last tuple version for
 		 * EPQ on appropriate transaction isolation levels if the tuple isn't
 		 * locked already.
 		 */
-		if (!locked && !IsolationUsesXactSnapshot())
-			options |= TABLE_MODIFY_LOCK_UPDATED;
+		if (!locked)
+		{
+			options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
+			if (!IsolationUsesXactSnapshot())
+				options |= TABLE_MODIFY_LOCK_UPDATED;
+		}
 
 		/*
 		 * If we generate a new candidate tuple after EvalPlanQual testing, we
@@ -2326,7 +2330,7 @@ redo_act:
 	if (canSetTag)
 		(estate->es_processed)++;
 
-	ExecUpdateEpilogue(context, &updateCxt, resultRelInfo, tupleid, oldtuple,
+	ExecUpdateEpilogue(context, &updateCxt, resultRelInfo, oldtuple,
 					   slot, oldslot);
 
 	/* Process RETURNING if present */
@@ -2358,7 +2362,19 @@ ExecOnConflictUpdate(ModifyTableContext *context,
 	ExprContext *econtext = mtstate->ps.ps_ExprContext;
 	ExprState  *onConflictSetWhere = resultRelInfo->ri_onConflict->oc_WhereClause;
 	TupleTableSlot *existing = resultRelInfo->ri_onConflict->oc_Existing;
-	ItemPointer conflictTid = &existing->tts_tid;
+	Datum		tupleid;
+
+	if (table_get_row_ref_type(resultRelInfo->ri_RelationDesc) == ROW_REF_ROWID)
+	{
+		bool		isnull;
+
+		tupleid = slot_getsysattr(existing, RowIdAttributeNumber, &isnull);
+		Assert(!isnull);
+	}
+	else
+	{
+		tupleid = PointerGetDatum(&existing->tts_tid);
+	}
 
 	/*
 	 * Make tuple and any needed join variables available to ExecQual and
@@ -2414,7 +2430,7 @@ ExecOnConflictUpdate(ModifyTableContext *context,
 
 	/* Execute UPDATE with projection */
 	*returning = ExecUpdate(context, resultRelInfo,
-							conflictTid, NULL,
+							tupleid, NULL,
 							resultRelInfo->ri_onConflict->oc_ProjSlot,
 							existing,
 							canSetTag, true);
@@ -2433,7 +2449,7 @@ ExecOnConflictUpdate(ModifyTableContext *context,
  */
 static TupleTableSlot *
 ExecMerge(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-		  ItemPointer tupleid, HeapTuple oldtuple, bool canSetTag)
+		  Datum tupleid, HeapTuple oldtuple, bool canSetTag)
 {
 	TupleTableSlot *rslot = NULL;
 	bool		matched;
@@ -2482,7 +2498,7 @@ ExecMerge(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	 * from ExecMergeNotMatched to ExecMergeMatched, there is no risk of a
 	 * livelock.
 	 */
-	matched = tupleid != NULL || oldtuple != NULL;
+	matched = DatumGetPointer(tupleid) != NULL || oldtuple != NULL;
 	if (matched)
 		rslot = ExecMergeMatched(context, resultRelInfo, tupleid, oldtuple,
 								 canSetTag, &matched);
@@ -2523,7 +2539,7 @@ ExecMerge(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static TupleTableSlot *
 ExecMergeMatched(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				 ItemPointer tupleid, HeapTuple oldtuple, bool canSetTag,
+				 Datum tupleid, HeapTuple oldtuple, bool canSetTag,
 				 bool *matched)
 {
 	ModifyTableState *mtstate = context->mtstate;
@@ -2559,7 +2575,7 @@ ExecMergeMatched(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	 * the tupleid of the target row, or an old tuple from the target wholerow
 	 * junk attr.
 	 */
-	Assert(tupleid != NULL || oldtuple != NULL);
+	Assert(DatumGetPointer(tupleid) != NULL || oldtuple != NULL);
 	if (oldtuple != NULL)
 		ExecForceStoreHeapTuple(oldtuple, resultRelInfo->ri_oldTupleSlot,
 								false);
@@ -2573,7 +2589,7 @@ lmerge_matched:
 	 * EvalPlanQual returns us a new tuple, which may not be visible to our
 	 * MVCC snapshot.
 	 */
-	if (tupleid != NULL)
+	if (DatumGetPointer(tupleid) != NULL)
 	{
 		if (!table_tuple_fetch_row_version(resultRelInfo->ri_RelationDesc,
 										   tupleid,
@@ -2682,7 +2698,7 @@ lmerge_matched:
 				if (result == TM_Ok)
 				{
 					ExecUpdateEpilogue(context, &updateCxt, resultRelInfo,
-									   tupleid, NULL, newslot,
+									   NULL, newslot,
 									   resultRelInfo->ri_oldTupleSlot);
 					mtstate->mt_merge_updated += 1;
 				}
@@ -2718,7 +2734,7 @@ lmerge_matched:
 
 				if (result == TM_Ok)
 				{
-					ExecDeleteEpilogue(context, resultRelInfo, tupleid, NULL,
+					ExecDeleteEpilogue(context, resultRelInfo, NULL,
 									   resultRelInfo->ri_oldTupleSlot, false);
 					mtstate->mt_merge_deleted += 1;
 				}
@@ -2842,9 +2858,13 @@ lmerge_matched:
 								return NULL;
 							}
 
-							(void) ExecGetJunkAttribute(epqslot,
-														resultRelInfo->ri_RowIdAttNo,
-														&isNull);
+							/*
+							 * Update tupleid to that of the new tuple, for
+							 * the refetch we do at the top.
+							 */
+							tupleid = ExecGetJunkAttribute(epqslot,
+														   resultRelInfo->ri_RowIdAttNo,
+														   &isNull);
 							if (isNull)
 							{
 								*matched = false;
@@ -2871,11 +2891,7 @@ lmerge_matched:
 							 * apply all the MATCHED rules again, to ensure
 							 * that the first qualifying WHEN MATCHED action
 							 * is executed.
-							 *
-							 * Update tupleid to that of the new tuple, for
-							 * the refetch we do at the top.
 							 */
-							ItemPointerCopy(&context->tmfd.ctid, tupleid);
 							goto lmerge_matched;
 
 						case TM_Deleted:
@@ -3413,10 +3429,10 @@ ExecModifyTable(PlanState *pstate)
 	PlanState  *subplanstate;
 	TupleTableSlot *slot;
 	TupleTableSlot *oldSlot;
+	Datum		tupleid;
 	ItemPointerData tuple_ctid;
 	HeapTupleData oldtupdata;
 	HeapTuple	oldtuple;
-	ItemPointer tupleid;
 
 	CHECK_FOR_INTERRUPTS();
 
@@ -3465,6 +3481,8 @@ ExecModifyTable(PlanState *pstate)
 	 */
 	for (;;)
 	{
+		RowRefType	refType;
+
 		/*
 		 * Reset the per-output-tuple exprcontext.  This is needed because
 		 * triggers expect to use that context as workspace.  It's a bit ugly
@@ -3515,7 +3533,7 @@ ExecModifyTable(PlanState *pstate)
 					EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 
 					slot = ExecMerge(&context, node->resultRelInfo,
-									 NULL, NULL, node->canSetTag);
+									 PointerGetDatum(NULL), NULL, node->canSetTag);
 
 					/*
 					 * If we got a RETURNING result, return it to the caller.
@@ -3559,7 +3577,8 @@ ExecModifyTable(PlanState *pstate)
 		EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 		slot = context.planSlot;
 
-		tupleid = NULL;
+		refType = resultRelInfo->ri_RowRefType;
+		tupleid = PointerGetDatum(NULL);
 		oldtuple = NULL;
 
 		/*
@@ -3602,7 +3621,7 @@ ExecModifyTable(PlanState *pstate)
 						EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 
 						slot = ExecMerge(&context, node->resultRelInfo,
-										 NULL, NULL, node->canSetTag);
+										 PointerGetDatum(NULL), NULL, node->canSetTag);
 
 						/*
 						 * If we got a RETURNING result, return it to the
@@ -3617,9 +3636,25 @@ ExecModifyTable(PlanState *pstate)
 					elog(ERROR, "ctid is NULL");
 				}
 
-				tupleid = (ItemPointer) DatumGetPointer(datum);
-				tuple_ctid = *tupleid;	/* be sure we don't free ctid!! */
-				tupleid = &tuple_ctid;
+				if (refType == ROW_REF_TID)
+				{
+					/* shouldn't ever get a null result... */
+					if (isNull)
+						elog(ERROR, "ctid is NULL");
+
+					tuple_ctid = *((ItemPointer) DatumGetPointer(datum));	/* be sure we don't free
+																			 * ctid!! */
+					tupleid = PointerGetDatum(&tuple_ctid);
+				}
+				else
+				{
+					Assert(refType == ROW_REF_ROWID);
+					/* shouldn't ever get a null result... */
+					if (isNull)
+						elog(ERROR, "rowid is NULL");
+
+					tupleid = datumCopy(datum, false, -1);
+				}
 			}
 
 			/*
@@ -3659,7 +3694,7 @@ ExecModifyTable(PlanState *pstate)
 						EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 
 						slot = ExecMerge(&context, node->resultRelInfo,
-										 NULL, NULL, node->canSetTag);
+										 PointerGetDatum(NULL), NULL, node->canSetTag);
 
 						/*
 						 * If we got a RETURNING result, return it to the
@@ -3723,6 +3758,7 @@ ExecModifyTable(PlanState *pstate)
 					/* Fetch the most recent version of old tuple. */
 					Relation	relation = resultRelInfo->ri_RelationDesc;
 
+					Assert(DatumGetPointer(tupleid) != NULL);
 					if (!table_tuple_fetch_row_version(relation, tupleid,
 													   SnapshotAny,
 													   oldSlot))
@@ -3757,6 +3793,9 @@ ExecModifyTable(PlanState *pstate)
 				break;
 		}
 
+		if (refType == ROW_REF_ROWID && DatumGetPointer(tupleid) != NULL)
+			pfree(DatumGetPointer(tupleid));
+
 		/*
 		 * If we got a RETURNING result, return it to caller.  We'll continue
 		 * the work on next call.
@@ -4000,10 +4039,20 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 				relkind == RELKIND_MATVIEW ||
 				relkind == RELKIND_PARTITIONED_TABLE)
 			{
-				resultRelInfo->ri_RowIdAttNo =
-					ExecFindJunkAttributeInTlist(subplan->targetlist, "ctid");
-				if (!AttributeNumberIsValid(resultRelInfo->ri_RowIdAttNo))
-					elog(ERROR, "could not find junk ctid column");
+				if (resultRelInfo->ri_RowRefType == ROW_REF_TID)
+				{
+					resultRelInfo->ri_RowIdAttNo =
+						ExecFindJunkAttributeInTlist(subplan->targetlist, "ctid");
+					if (!AttributeNumberIsValid(resultRelInfo->ri_RowIdAttNo))
+						elog(ERROR, "could not find junk ctid column");
+				}
+				else
+				{
+					resultRelInfo->ri_RowIdAttNo =
+						ExecFindJunkAttributeInTlist(subplan->targetlist, "rowid");
+					if (!AttributeNumberIsValid(resultRelInfo->ri_RowIdAttNo))
+						elog(ERROR, "could not find junk rowid column");
+				}
 			}
 			else if (relkind == RELKIND_FOREIGN_TABLE)
 			{
@@ -4313,6 +4362,8 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		estate->es_auxmodifytables = lcons(mtstate,
 										   estate->es_auxmodifytables);
 
+
+
 	return mtstate;
 }
 
diff --git a/src/backend/executor/nodeTidscan.c b/src/backend/executor/nodeTidscan.c
index 864a9013b62..f4a124ac4eb 100644
--- a/src/backend/executor/nodeTidscan.c
+++ b/src/backend/executor/nodeTidscan.c
@@ -377,7 +377,7 @@ TidNext(TidScanState *node)
 		if (node->tss_isCurrentOf)
 			table_tuple_get_latest_tid(scan, &tid);
 
-		if (table_tuple_fetch_row_version(heapRelation, &tid, snapshot, slot))
+		if (table_tuple_fetch_row_version(heapRelation, PointerGetDatum(&tid), snapshot, slot))
 			return slot;
 
 		/* Bad TID or failed snapshot qual; try next */
diff --git a/src/backend/optimizer/prep/preptlist.c b/src/backend/optimizer/prep/preptlist.c
index 4599b0dc761..3620be5b52c 100644
--- a/src/backend/optimizer/prep/preptlist.c
+++ b/src/backend/optimizer/prep/preptlist.c
@@ -226,6 +226,22 @@ preprocess_targetlist(PlannerInfo *root)
 								  true);
 			tlist = lappend(tlist, tle);
 		}
+		if (rc->allRefTypes & (1 << ROW_REF_ROWID))
+		{
+			/* Need to fetch TID */
+			var = makeVar(rc->rti,
+						  RowIdAttributeNumber,
+						  BYTEAOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			snprintf(resname, sizeof(resname), "rowid%u", rc->rowmarkId);
+			tle = makeTargetEntry((Expr *) var,
+								  list_length(tlist) + 1,
+								  pstrdup(resname),
+								  true);
+			tlist = lappend(tlist, tle);
+		}
 		if (rc->allRefTypes & (1 << ROW_REF_COPY))
 		{
 			/* Need the whole row as a junk var */
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index 6ba4eba224a..83c08bbd0e1 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -16,6 +16,7 @@
 
 #include "access/htup_details.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "foreign/fdwapi.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
@@ -895,17 +896,35 @@ add_row_identity_columns(PlannerInfo *root, Index rtindex,
 		relkind == RELKIND_MATVIEW ||
 		relkind == RELKIND_PARTITIONED_TABLE)
 	{
+		RowRefType	refType = ROW_REF_TID;
+
+		refType = table_get_row_ref_type(target_relation);
+
 		/*
 		 * Emit CTID so that executor can find the row to merge, update or
 		 * delete.
 		 */
-		var = makeVar(rtindex,
-					  SelfItemPointerAttributeNumber,
-					  TIDOID,
-					  -1,
-					  InvalidOid,
-					  0);
-		add_row_identity_var(root, var, rtindex, "ctid");
+		if (refType == ROW_REF_TID)
+		{
+			var = makeVar(rtindex,
+						  SelfItemPointerAttributeNumber,
+						  TIDOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			add_row_identity_var(root, var, rtindex, "ctid");
+		}
+		else
+		{
+			Assert(refType == ROW_REF_ROWID);
+			var = makeVar(rtindex,
+						  RowIdAttributeNumber,
+						  BYTEAOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			add_row_identity_var(root, var, rtindex, "rowid");
+		}
 	}
 	else if (relkind == RELKIND_FOREIGN_TABLE)
 	{
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index d32b07bab57..171509aae62 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -283,6 +283,24 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 			newvars = lappend(newvars, var);
 		}
 
+		if ((new_allRefTypes & (1 << ROW_REF_ROWID)) &&
+			!(old_allRefTypes & (1 << ROW_REF_ROWID)))
+		{
+			var = makeVar(oldrc->rti,
+						  RowIdAttributeNumber,
+						  BYTEAOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			snprintf(resname, sizeof(resname), "rowid%u", oldrc->rowmarkId);
+			tle = makeTargetEntry((Expr *) var,
+								  list_length(root->processed_tlist) + 1,
+								  pstrdup(resname),
+								  true);
+			root->processed_tlist = lappend(root->processed_tlist, tle);
+			newvars = lappend(newvars, var);
+		}
+
 		/* Add tableoid junk Var, unless we had it already */
 		if (!old_isParent)
 		{
@@ -486,7 +504,7 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 	Assert(parentrte->rtekind == RTE_RELATION); /* else this is dubious */
 	childrte->relid = childOID;
 	childrte->relkind = childrel->rd_rel->relkind;
-	childrte->reftype = ROW_REF_TID;
+	childrte->reftype = table_get_row_ref_type(childrel);
 	/* A partitioned child will need to be expanded further. */
 	if (childrte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c
index 10f2d287b39..2c80e010f2a 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -1504,7 +1504,7 @@ addRangeTableEntry(ParseState *pstate,
 	rte->inh = inh;
 	rte->relkind = rel->rd_rel->relkind;
 	rte->rellockmode = lockmode;
-	rte->reftype = ROW_REF_TID;
+	rte->reftype = table_get_row_ref_type(rel);
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
@@ -1590,7 +1590,7 @@ addRangeTableEntryForRelation(ParseState *pstate,
 	rte->inh = inh;
 	rte->relkind = rel->rd_rel->relkind;
 	rte->rellockmode = lockmode;
-	rte->reftype = ROW_REF_TID;
+	rte->reftype = table_get_row_ref_type(rel);
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
@@ -3267,6 +3267,9 @@ get_rte_attribute_name(RangeTblEntry *rte, AttrNumber attnum)
 		attnum > 0 && attnum <= list_length(rte->alias->colnames))
 		return strVal(list_nth(rte->alias->colnames, attnum - 1));
 
+	if (attnum == RowIdAttributeNumber)
+		return "rowid";
+
 	/*
 	 * If the RTE is a relation, go to the system catalogs not the
 	 * eref->colnames list.  This is a little slower but it will give the
diff --git a/src/backend/rewrite/rewriteHandler.c b/src/backend/rewrite/rewriteHandler.c
index 9fd05b15e73..7a0fdbe3f40 100644
--- a/src/backend/rewrite/rewriteHandler.c
+++ b/src/backend/rewrite/rewriteHandler.c
@@ -1854,6 +1854,7 @@ ApplyRetrieveRule(Query *parsetree,
 	rte = rt_fetch(rt_index, parsetree->rtable);
 
 	rte->rtekind = RTE_SUBQUERY;
+	rte->reftype = ROW_REF_COPY;
 	rte->subquery = rule_action;
 	rte->security_barrier = RelationIsSecurityView(relation);
 
diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index 947a868e569..d3a41533552 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1100,6 +1100,36 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+/*
+ * Same as tuplestore_gettupleslot(), but foces tuple storage to slot.  Thus,
+ * it can work with slot types different than minimal tuple.
+ */
+bool
+tuplestore_force_gettupleslot(Tuplestorestate *state, bool forward,
+							  bool copy, TupleTableSlot *slot)
+{
+	MinimalTuple tuple;
+	bool		should_free;
+
+	tuple = (MinimalTuple) tuplestore_gettuple(state, forward, &should_free);
+
+	if (tuple)
+	{
+		if (copy && !should_free)
+		{
+			tuple = heap_copy_minimal_tuple(tuple);
+			should_free = true;
+		}
+		ExecForceStoreMinimalTuple(tuple, slot, should_free);
+		return true;
+	}
+	else
+	{
+		ExecClearTuple(slot);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
diff --git a/src/include/access/sysattr.h b/src/include/access/sysattr.h
index e88dec71ee9..867b5eb489e 100644
--- a/src/include/access/sysattr.h
+++ b/src/include/access/sysattr.h
@@ -24,6 +24,7 @@
 #define MaxTransactionIdAttributeNumber			(-4)
 #define MaxCommandIdAttributeNumber				(-5)
 #define TableOidAttributeNumber					(-6)
-#define FirstLowInvalidHeapAttributeNumber		(-7)
+#define RowIdAttributeNumber					(-7)
+#define FirstLowInvalidHeapAttributeNumber		(-8)
 
 #endif							/* SYSATTR_H */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 047094dfddf..41b3206fa51 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -476,7 +476,7 @@ typedef struct TableAmRoutine
 	 * test, returns true, false otherwise.
 	 */
 	bool		(*tuple_fetch_row_version) (Relation rel,
-											ItemPointer tid,
+											Datum tupleid,
 											Snapshot snapshot,
 											TupleTableSlot *slot);
 
@@ -535,7 +535,7 @@ typedef struct TableAmRoutine
 
 	/* see table_tuple_delete() for reference about parameters */
 	TM_Result	(*tuple_delete) (Relation rel,
-								 ItemPointer tid,
+								 Datum tupleid,
 								 CommandId cid,
 								 Snapshot snapshot,
 								 Snapshot crosscheck,
@@ -546,7 +546,7 @@ typedef struct TableAmRoutine
 
 	/* see table_tuple_update() for reference about parameters */
 	TM_Result	(*tuple_update) (Relation rel,
-								 ItemPointer otid,
+								 Datum tupleid,
 								 TupleTableSlot *slot,
 								 CommandId cid,
 								 Snapshot snapshot,
@@ -559,7 +559,7 @@ typedef struct TableAmRoutine
 
 	/* see table_tuple_lock() for reference about parameters */
 	TM_Result	(*tuple_lock) (Relation rel,
-							   ItemPointer tid,
+							   Datum tupleid,
 							   Snapshot snapshot,
 							   TupleTableSlot *slot,
 							   CommandId cid,
@@ -705,6 +705,11 @@ typedef struct TableAmRoutine
 	 * ------------------------------------------------------------------------
 	 */
 
+	/*
+	 * Get the type of row identifier in the table.
+	 */
+	RowRefType	(*get_row_ref_type) (Relation rel);
+
 	/*
 	 * This callback frees relation private cache data stored in rd_amcache.
 	 * If this callback is not provided, rd_amcache is assumed to point to
@@ -1282,9 +1287,9 @@ extern bool table_index_fetch_tuple_check(Relation rel,
 
 
 /*
- * Fetch tuple at `tid` into `slot`, after doing a visibility test according to
- * `snapshot`. If a tuple was found and passed the visibility test, returns
- * true, false otherwise.
+ * Fetch tuple identified by `tupleid` into `slot`, after doing a visibility
+ * test according to `snapshot`. If a tuple was found and passed the visibility
+ * test, returns true, false otherwise.
  *
  * See table_index_fetch_tuple's comment about what the difference between
  * these functions is. It is correct to use this function outside of index
@@ -1292,7 +1297,7 @@ extern bool table_index_fetch_tuple_check(Relation rel,
  */
 static inline bool
 table_tuple_fetch_row_version(Relation rel,
-							  ItemPointer tid,
+							  Datum tupleid,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
@@ -1304,7 +1309,8 @@ table_tuple_fetch_row_version(Relation rel,
 	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
 		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
 
-	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
+	return rel->rd_tableam->tuple_fetch_row_version(rel, tupleid,
+													snapshot, slot);
 }
 
 /*
@@ -1489,7 +1495,7 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
  *
  * Input parameters:
  *	relation - table to be modified (caller must hold suitable lock)
- *	tid - TID of tuple to be deleted
+ *	tupleid - identifier of tuple to be deleted
  *	cid - delete command ID (used for visibility test, and stored into
  *		cmax if successful)
  *	crosscheck - if not InvalidSnapshot, also check tuple against this
@@ -1518,12 +1524,12 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
  * TM_FailureData for additional info.
  */
 static inline TM_Result
-table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
+table_tuple_delete(Relation rel, Datum tupleid, CommandId cid,
 				   Snapshot snapshot, Snapshot crosscheck, int options,
 				   TM_FailureData *tmfd, bool changingPart,
 				   TupleTableSlot *oldSlot)
 {
-	return rel->rd_tableam->tuple_delete(rel, tid, cid,
+	return rel->rd_tableam->tuple_delete(rel, tupleid, cid,
 										 snapshot, crosscheck,
 										 options, tmfd, changingPart,
 										 oldSlot);
@@ -1537,7 +1543,7 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
  *
  * Input parameters:
  *	relation - table to be modified (caller must hold suitable lock)
- *	otid - TID of old tuple to be replaced
+ *	tupleid - identifier of old tuple to be replaced
  *	slot - newly constructed tuple data to store
  *	cid - update command ID (used for visibility test, and stored into
  *		cmax/cmin if successful)
@@ -1574,13 +1580,13 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
  * for additional info.
  */
 static inline TM_Result
-table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
+table_tuple_update(Relation rel, Datum tupleid, TupleTableSlot *slot,
 				   CommandId cid, Snapshot snapshot, Snapshot crosscheck,
 				   int options, TM_FailureData *tmfd, LockTupleMode *lockmode,
 				   TU_UpdateIndexes *update_indexes,
 				   TupleTableSlot *oldSlot)
 {
-	return rel->rd_tableam->tuple_update(rel, otid, slot,
+	return rel->rd_tableam->tuple_update(rel, tupleid, slot,
 										 cid, snapshot, crosscheck,
 										 options, tmfd,
 										 lockmode, update_indexes,
@@ -1592,7 +1598,7 @@ table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
  *
  * Input parameters:
  *	relation: relation containing tuple (caller must hold suitable lock)
- *	tid: TID of tuple to lock
+ *	tupleid: identifier of tuple to lock
  *	snapshot: snapshot to use for visibility determinations
  *	cid: current command ID (used for visibility test, and stored into
  *		tuple's cmax if lock is successful)
@@ -1621,12 +1627,12 @@ table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
  * comments for struct TM_FailureData for additional info.
  */
 static inline TM_Result
-table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
+table_tuple_lock(Relation rel, Datum tupleid, Snapshot snapshot,
 				 TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
 				 LockWaitPolicy wait_policy, uint8 flags,
 				 TM_FailureData *tmfd)
 {
-	return rel->rd_tableam->tuple_lock(rel, tid, snapshot, slot,
+	return rel->rd_tableam->tuple_lock(rel, tupleid, snapshot, slot,
 									   cid, mode, wait_policy,
 									   flags, tmfd);
 }
@@ -1908,6 +1914,20 @@ table_define_index(Relation rel, Oid indoid, bool reindex,
  * ----------------------------------------------------------------------------
  */
 
+/*
+ * Get the type of row identifier.  Returns ROW_REF_TID when table AM routine
+ * is not accessible.  This happens during catalog initialization.  All catalog
+ * tables are known to use heap.
+ */
+static inline RowRefType
+table_get_row_ref_type(Relation rel)
+{
+	if (rel->rd_tableam)
+		return rel->rd_tableam->get_row_ref_type(rel);
+	else
+		return ROW_REF_TID;
+}
+
 /*
  * Frees relation private cache data stored in rd_amcache.  Uses
  * free_rd_amcache method if provided.  Assumes rd_amcache to point to single
diff --git a/src/include/commands/trigger.h b/src/include/commands/trigger.h
index cb968d03ecd..c16e6b6e5a0 100644
--- a/src/include/commands/trigger.h
+++ b/src/include/commands/trigger.h
@@ -209,7 +209,7 @@ extern void ExecASDeleteTriggers(EState *estate,
 extern bool ExecBRDeleteTriggers(EState *estate,
 								 EPQState *epqstate,
 								 ResultRelInfo *relinfo,
-								 ItemPointer tupleid,
+								 Datum tupleid,
 								 HeapTuple fdw_trigtuple,
 								 TupleTableSlot **epqslot,
 								 TM_Result *tmresult,
@@ -231,7 +231,7 @@ extern void ExecASUpdateTriggers(EState *estate,
 extern bool ExecBRUpdateTriggers(EState *estate,
 								 EPQState *epqstate,
 								 ResultRelInfo *relinfo,
-								 ItemPointer tupleid,
+								 Datum tupleid,
 								 HeapTuple fdw_trigtuple,
 								 TupleTableSlot *newslot,
 								 TM_Result *tmresult,
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index bc06ff99e21..90233a4baf7 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -2093,6 +2093,7 @@ typedef struct OnConflictExpr
 typedef enum RowRefType
 {
 	ROW_REF_TID,				/* Item pointer (block, offset) */
+	ROW_REF_ROWID,				/* Bytea row id */
 	ROW_REF_COPY				/* Full row copy */
 } RowRefType;
 
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index 419613c17ba..cf291a0d17a 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -70,6 +70,9 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
 
+extern bool tuplestore_force_gettupleslot(Tuplestorestate *state, bool forward,
+										  bool copy, TupleTableSlot *slot);
+
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
 extern bool tuplestore_skiptuples(Tuplestorestate *state,
-- 
2.39.3 (Apple Git-145)

0011-Let-table-AM-insertion-methods-control-index-inse-v4.patchapplication/octet-stream; name=0011-Let-table-AM-insertion-methods-control-index-inse-v4.patchDownload

From e4e1cba9c70201401fcaeddee4f01c6ec0401f24 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Fri, 9 Jun 2023 01:02:39 +0300
Subject: [PATCH 11/13] Let table AM insertion methods control index insertion

New parameter for tuple_insert() and multi_insert() methods provides way to
skip index insertions in executor.  In this case, table AM can handle insertions
itself.
---
 src/backend/access/heap/heapam.c         |  4 +++-
 src/backend/access/heap/heapam_handler.c |  4 +++-
 src/backend/access/table/tableam.c       |  6 ++++--
 src/backend/catalog/indexing.c           |  4 +++-
 src/backend/commands/copyfrom.c          | 13 +++++++++----
 src/backend/commands/createas.c          |  4 +++-
 src/backend/commands/matview.c           |  4 +++-
 src/backend/commands/tablecmds.c         |  6 +++++-
 src/backend/executor/execReplication.c   |  6 ++++--
 src/backend/executor/nodeModifyTable.c   |  6 ++++--
 src/include/access/heapam.h              |  2 +-
 src/include/access/tableam.h             | 23 ++++++++++++++++-------
 12 files changed, 58 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f6478f89e77..facad25d5c1 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2091,7 +2091,8 @@ heap_multi_insert_pages(HeapTuple *heaptuples, int done, int ntuples, Size saveF
  */
 void
 heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
-				  CommandId cid, int options, BulkInsertState bistate)
+				  CommandId cid, int options, BulkInsertState bistate,
+				  bool *insert_indexes)
 {
 	TransactionId xid = GetCurrentTransactionId();
 	HeapTuple  *heaptuples;
@@ -2440,6 +2441,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		slots[i]->tts_tid = heaptuples[i]->t_self;
 
 	pgstat_count_heap_insert(relation, ntuples);
+	*insert_indexes = true;
 }
 
 /*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 9c36e102934..811b0df5abf 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -250,7 +250,7 @@ heapam_tuple_satisfies_snapshot(Relation rel, TupleTableSlot *slot,
 
 static TupleTableSlot *
 heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
-					int options, BulkInsertState bistate)
+					int options, BulkInsertState bistate, bool *insert_indexes)
 {
 	bool		shouldFree = true;
 	HeapTuple	tuple = ExecFetchSlotHeapTuple(slot, true, &shouldFree);
@@ -266,6 +266,8 @@ heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
 	if (shouldFree)
 		pfree(tuple);
 
+	*insert_indexes = true;
+
 	return slot;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 8d3675be959..805d222cebc 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -273,9 +273,11 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
  * default command ID and not allowing access to the speedup options.
  */
 void
-simple_table_tuple_insert(Relation rel, TupleTableSlot *slot)
+simple_table_tuple_insert(Relation rel, TupleTableSlot *slot,
+						  bool *insert_indexes)
 {
-	table_tuple_insert(rel, slot, GetCurrentCommandId(true), 0, NULL);
+	table_tuple_insert(rel, slot, GetCurrentCommandId(true), 0, NULL,
+					   insert_indexes);
 }
 
 /*
diff --git a/src/backend/catalog/indexing.c b/src/backend/catalog/indexing.c
index d0d1abda58a..4d404f22f83 100644
--- a/src/backend/catalog/indexing.c
+++ b/src/backend/catalog/indexing.c
@@ -273,12 +273,14 @@ void
 CatalogTuplesMultiInsertWithInfo(Relation heapRel, TupleTableSlot **slot,
 								 int ntuples, CatalogIndexState indstate)
 {
+	bool		insertIndexes;
+
 	/* Nothing to do */
 	if (ntuples <= 0)
 		return;
 
 	heap_multi_insert(heapRel, slot, ntuples,
-					  GetCurrentCommandId(true), 0, NULL);
+					  GetCurrentCommandId(true), 0, NULL, &insertIndexes);
 
 	/*
 	 * There is no equivalent to heap_multi_insert for the catalog indexes, so
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 8908a440e19..b6736369771 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -397,6 +397,7 @@ CopyMultiInsertBufferFlush(CopyMultiInsertInfo *miinfo,
 		bool		line_buf_valid = cstate->line_buf_valid;
 		uint64		save_cur_lineno = cstate->cur_lineno;
 		MemoryContext oldcontext;
+		bool		insertIndexes;
 
 		Assert(buffer->bistate != NULL);
 
@@ -416,7 +417,8 @@ CopyMultiInsertBufferFlush(CopyMultiInsertInfo *miinfo,
 						   nused,
 						   mycid,
 						   ti_options,
-						   buffer->bistate);
+						   buffer->bistate,
+						   &insertIndexes);
 		MemoryContextSwitchTo(oldcontext);
 
 		for (i = 0; i < nused; i++)
@@ -425,7 +427,7 @@ CopyMultiInsertBufferFlush(CopyMultiInsertInfo *miinfo,
 			 * If there are any indexes, update them for all the inserted
 			 * tuples, and run AFTER ROW INSERT triggers.
 			 */
-			if (resultRelInfo->ri_NumIndices > 0)
+			if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 			{
 				List	   *recheckIndexes;
 
@@ -1265,11 +1267,14 @@ CopyFrom(CopyFromState cstate)
 					}
 					else
 					{
+						bool		insertIndexes;
+
 						/* OK, store the tuple and create index entries for it */
 						table_tuple_insert(resultRelInfo->ri_RelationDesc,
-										   myslot, mycid, ti_options, bistate);
+										   myslot, mycid, ti_options, bistate,
+										   &insertIndexes);
 
-						if (resultRelInfo->ri_NumIndices > 0)
+						if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 							recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 																   myslot,
 																   estate,
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 62050f4dc59..afd3dace079 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -578,6 +578,7 @@ static bool
 intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_intorel *myState = (DR_intorel *) self;
+	bool		insertIndexes;
 
 	/* Nothing to insert if WITH NO DATA is specified. */
 	if (!myState->into->skipData)
@@ -594,7 +595,8 @@ intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 						   slot,
 						   myState->output_cid,
 						   myState->ti_options,
-						   myState->bistate);
+						   myState->bistate,
+						   &insertIndexes);
 	}
 
 	/* We know this is a newly created relation, so there are no indexes */
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 6d09b755564..9ec13d09846 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -476,6 +476,7 @@ static bool
 transientrel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_transientrel *myState = (DR_transientrel *) self;
+	bool		insertIndexes;
 
 	/*
 	 * Note that the input slot might not be of the type of the target
@@ -490,7 +491,8 @@ transientrel_receive(TupleTableSlot *slot, DestReceiver *self)
 					   slot,
 					   myState->output_cid,
 					   myState->ti_options,
-					   myState->bistate);
+					   myState->bistate,
+					   &insertIndexes);
 
 	/* We know this is a newly created relation, so there are no indexes */
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index fa8eb55b189..c7ffb5c17fe 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -6350,8 +6350,12 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
 
 			/* Write the tuple out to the new relation */
 			if (newrel)
+			{
+				bool		insertIndexes;
+
 				table_tuple_insert(newrel, insertslot, mycid,
-								   ti_options, bistate);
+								   ti_options, bistate, &insertIndexes);
+			}
 
 			ResetExprContext(econtext);
 
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 0cad843fb69..db685473fc0 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -509,6 +509,7 @@ ExecSimpleRelationInsert(ResultRelInfo *resultRelInfo,
 	if (!skip_tuple)
 	{
 		List	   *recheckIndexes = NIL;
+		bool		insertIndexes;
 
 		/* Compute stored generated columns */
 		if (rel->rd_att->constr &&
@@ -523,9 +524,10 @@ ExecSimpleRelationInsert(ResultRelInfo *resultRelInfo,
 			ExecPartitionCheck(resultRelInfo, slot, estate, true);
 
 		/* OK, store the tuple and create index entries for it */
-		simple_table_tuple_insert(resultRelInfo->ri_RelationDesc, slot);
+		simple_table_tuple_insert(resultRelInfo->ri_RelationDesc, slot,
+								  &insertIndexes);
 
-		if (resultRelInfo->ri_NumIndices > 0)
+		if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 			recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 												   slot, estate, false, false,
 												   NULL, NIL, false);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 8e1c8f697c6..a64e37e9af9 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1040,13 +1040,15 @@ ExecInsert(ModifyTableContext *context,
 		}
 		else
 		{
+			bool		insertIndexes;
+
 			/* insert the tuple normally */
 			slot = table_tuple_insert(resultRelationDesc, slot,
 									  estate->es_output_cid,
-									  0, NULL);
+									  0, NULL, &insertIndexes);
 
 			/* insert index entries for tuple */
-			if (resultRelInfo->ri_NumIndices > 0)
+			if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 				recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 													   slot, estate, false,
 													   false, NULL, NIL,
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 45954b8003d..cbb73536289 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -274,7 +274,7 @@ extern void heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 						int options, BulkInsertState bistate);
 extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
 							  int ntuples, CommandId cid, int options,
-							  BulkInsertState bistate);
+							  BulkInsertState bistate, bool *insert_indexes);
 extern TM_Result heap_delete(Relation relation, ItemPointer tid,
 							 CommandId cid, Snapshot crosscheck, int options,
 							 struct TM_FailureData *tmfd, bool changingPart,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8b498eb6a76..047094dfddf 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -514,7 +514,8 @@ typedef struct TableAmRoutine
 	/* see table_tuple_insert() for reference about parameters */
 	TupleTableSlot *(*tuple_insert) (Relation rel, TupleTableSlot *slot,
 									 CommandId cid, int options,
-									 struct BulkInsertStateData *bistate);
+									 struct BulkInsertStateData *bistate,
+									 bool *insert_indexes);
 
 	/* see table_tuple_insert_with_arbiter() for reference about parameters */
 	TupleTableSlot *(*tuple_insert_with_arbiter) (ResultRelInfo *resultRelInfo,
@@ -529,7 +530,8 @@ typedef struct TableAmRoutine
 
 	/* see table_multi_insert() for reference about parameters */
 	void		(*multi_insert) (Relation rel, TupleTableSlot **slots, int nslots,
-								 CommandId cid, int options, struct BulkInsertStateData *bistate);
+								 CommandId cid, int options, struct BulkInsertStateData *bistate,
+								 bool *insert_indexes);
 
 	/* see table_tuple_delete() for reference about parameters */
 	TM_Result	(*tuple_delete) (Relation rel,
@@ -1398,6 +1400,10 @@ table_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
  * behavior) is also just passed through to RelationGetBufferForTuple. If
  * `bistate` is provided, table_finish_bulk_insert() needs to be called.
  *
+ * This function sets `*insert_indexes` to true if expects caller to return
+ * the relevant index tuples.  If `*insert_indexes` is set to false, then
+ * this function cares about indexes itself.
+ *
  * Returns the slot containing the inserted tuple, which may differ from the
  * given slot. For instance, source slot may by VirtualTupleTableSlot, but
  * the result is corresponding to table AM. On return the slot's tts_tid and
@@ -1406,10 +1412,11 @@ table_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
  */
 static inline TupleTableSlot *
 table_tuple_insert(Relation rel, TupleTableSlot *slot, CommandId cid,
-				   int options, struct BulkInsertStateData *bistate)
+				   int options, struct BulkInsertStateData *bistate,
+				   bool *insert_indexes)
 {
 	return rel->rd_tableam->tuple_insert(rel, slot, cid, options,
-										 bistate);
+										 bistate, insert_indexes);
 }
 
 /*
@@ -1467,10 +1474,11 @@ table_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
  */
 static inline void
 table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
-				   CommandId cid, int options, struct BulkInsertStateData *bistate)
+				   CommandId cid, int options, struct BulkInsertStateData *bistate,
+				   bool *insert_indexes)
 {
 	rel->rd_tableam->multi_insert(rel, slots, nslots,
-								  cid, options, bistate);
+								  cid, options, bistate, insert_indexes);
 }
 
 /*
@@ -2161,7 +2169,8 @@ table_scan_sample_next_tuple(TableScanDesc scan,
  * ----------------------------------------------------------------------------
  */
 
-extern void simple_table_tuple_insert(Relation rel, TupleTableSlot *slot);
+extern void simple_table_tuple_insert(Relation rel, TupleTableSlot *slot,
+									  bool *insert_indexes);
 extern void simple_table_tuple_delete(Relation rel, ItemPointer tid,
 									  Snapshot snapshot,
 									  TupleTableSlot *oldSlot);
-- 
2.39.3 (Apple Git-145)

0001-Allow-locking-updated-tuples-in-tuple_update-and--v4.patchapplication/octet-stream; name=0001-Allow-locking-updated-tuples-in-tuple_update-and--v4.patchDownload

From a7d74aa9ca4485e94ff2e4bad6f65facd3574015 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Tue, 19 Mar 2024 15:18:41 +0200
Subject: [PATCH 01/13] Allow locking updated tuples in tuple_update() and
 tuple_delete()

Currently, in read committed transaction isolation mode (default), we have the
following sequence of actions when tuple_update()/tuple_delete() finds
the tuple updated by the concurrent transaction.

1. Attempt to update/delete tuple with tuple_update()/tuple_delete(), which
   returns TM_Updated.
2. Lock tuple with tuple_lock().
3. Re-evaluate plan qual (recheck if we still need to update/delete and
   calculate the new tuple for update).
4. Second attempt to update/delete tuple with tuple_update()/tuple_delete().
   This attempt should be successful, since the tuple was previously locked.

This commit eliminates step 2 by taking the lock during the first
tuple_update()/tuple_delete() call.  The heap table access method saves some
effort by checking the updated tuple once instead of twice.  Future
undo-based table access methods, which will start from the latest row version,
can immediately place a lock there.

Also, this commit makes tuple_update()/tuple_delete() optionally save the old
tuple into the dedicated slot.  That saves efforts on re-fetching tuples in
certain cases.

The code in nodeModifyTable.c is simplified by removing the nested switch/case.

Discussion: https://postgr.es/m/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg%40mail.gmail.com
Reviewed-by: Aleksander Alekseev, Pavel Borisov, Vignesh C, Mason Sharp
Reviewed-by: Andres Freund, Chris Travers
---
 src/backend/access/heap/heapam.c         | 205 +++++++++----
 src/backend/access/heap/heapam_handler.c |  94 ++++--
 src/backend/access/table/tableam.c       |  26 +-
 src/backend/commands/trigger.c           |  55 +---
 src/backend/executor/execReplication.c   |  19 +-
 src/backend/executor/nodeModifyTable.c   | 353 ++++++++++-------------
 src/include/access/heapam.h              |  19 +-
 src/include/access/tableam.h             |  73 +++--
 src/include/commands/trigger.h           |   4 +-
 9 files changed, 502 insertions(+), 346 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 34bc60f625f..f6478f89e77 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2499,10 +2499,11 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
 }
 
 /*
- *	heap_delete - delete a tuple
+ *	heap_delete - delete a tuple, optionally fetching it into a slot
  *
  * See table_tuple_delete() for an explanation of the parameters, except that
- * this routine directly takes a tuple rather than a slot.
+ * this routine directly takes a tuple rather than a slot.  Also, we don't
+ * place a lock on the tuple in this function, just fetch the existing version.
  *
  * In the failure cases, the routine fills *tmfd with the tuple's t_ctid,
  * t_xmax (resolving a possible MultiXact, if necessary), and t_cmax (the last
@@ -2511,8 +2512,9 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
  */
 TM_Result
 heap_delete(Relation relation, ItemPointer tid,
-			CommandId cid, Snapshot crosscheck, bool wait,
-			TM_FailureData *tmfd, bool changingPart)
+			CommandId cid, Snapshot crosscheck, int options,
+			TM_FailureData *tmfd, bool changingPart,
+			TupleTableSlot *oldSlot)
 {
 	TM_Result	result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -2590,7 +2592,7 @@ l1:
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("attempted to delete invisible tuple")));
 	}
-	else if (result == TM_BeingModified && wait)
+	else if (result == TM_BeingModified && (options & TABLE_MODIFY_WAIT))
 	{
 		TransactionId xwait;
 		uint16		infomask;
@@ -2731,7 +2733,30 @@ l1:
 			tmfd->cmax = HeapTupleHeaderGetCmax(tp.t_data);
 		else
 			tmfd->cmax = InvalidCommandId;
-		UnlockReleaseBuffer(buffer);
+
+		/*
+		 * If we're asked to lock the updated tuple, we just fetch the
+		 * existing tuple.  That let's the caller save some resources on
+		 * placing the lock.
+		 */
+		if (result == TM_Updated &&
+			(options & TABLE_MODIFY_LOCK_UPDATED))
+		{
+			BufferHeapTupleTableSlot *bslot;
+
+			Assert(TTS_IS_BUFFERTUPLE(oldSlot));
+			bslot = (BufferHeapTupleTableSlot *) oldSlot;
+
+			LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+			bslot->base.tupdata = tp;
+			ExecStorePinnedBufferHeapTuple(&bslot->base.tupdata,
+										   oldSlot,
+										   buffer);
+		}
+		else
+		{
+			UnlockReleaseBuffer(buffer);
+		}
 		if (have_tuple_lock)
 			UnlockTupleTuplock(relation, &(tp.t_self), LockTupleExclusive);
 		if (vmbuffer != InvalidBuffer)
@@ -2905,8 +2930,24 @@ l1:
 	 */
 	CacheInvalidateHeapTuple(relation, &tp, NULL);
 
-	/* Now we can release the buffer */
-	ReleaseBuffer(buffer);
+	/* Fetch the old tuple version if we're asked for that. */
+	if (options & TABLE_MODIFY_FETCH_OLD_TUPLE)
+	{
+		BufferHeapTupleTableSlot *bslot;
+
+		Assert(TTS_IS_BUFFERTUPLE(oldSlot));
+		bslot = (BufferHeapTupleTableSlot *) oldSlot;
+
+		bslot->base.tupdata = tp;
+		ExecStorePinnedBufferHeapTuple(&bslot->base.tupdata,
+									   oldSlot,
+									   buffer);
+	}
+	else
+	{
+		/* Now we can release the buffer */
+		ReleaseBuffer(buffer);
+	}
 
 	/*
 	 * Release the lmgr tuple lock, if we had it.
@@ -2938,8 +2979,8 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 
 	result = heap_delete(relation, tid,
 						 GetCurrentCommandId(true), InvalidSnapshot,
-						 true /* wait for commit */ ,
-						 &tmfd, false /* changingPart */ );
+						 TABLE_MODIFY_WAIT /* wait for commit */ ,
+						 &tmfd, false /* changingPart */ , NULL);
 	switch (result)
 	{
 		case TM_SelfModified:
@@ -2966,10 +3007,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 }
 
 /*
- *	heap_update - replace a tuple
+ *	heap_update - replace a tuple, optionally fetching it into a slot
  *
  * See table_tuple_update() for an explanation of the parameters, except that
- * this routine directly takes a tuple rather than a slot.
+ * this routine directly takes a tuple rather than a slot.  Also, we don't
+ * place a lock on the tuple in this function, just fetch the existing version.
  *
  * In the failure cases, the routine fills *tmfd with the tuple's t_ctid,
  * t_xmax (resolving a possible MultiXact, if necessary), and t_cmax (the last
@@ -2978,9 +3020,9 @@ simple_heap_delete(Relation relation, ItemPointer tid)
  */
 TM_Result
 heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
-			CommandId cid, Snapshot crosscheck, bool wait,
+			CommandId cid, Snapshot crosscheck, int options,
 			TM_FailureData *tmfd, LockTupleMode *lockmode,
-			TU_UpdateIndexes *update_indexes)
+			TU_UpdateIndexes *update_indexes, TupleTableSlot *oldSlot)
 {
 	TM_Result	result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -3157,7 +3199,7 @@ l2:
 	result = HeapTupleSatisfiesUpdate(&oldtup, cid, buffer);
 
 	/* see below about the "no wait" case */
-	Assert(result != TM_BeingModified || wait);
+	Assert(result != TM_BeingModified || (options & TABLE_MODIFY_WAIT));
 
 	if (result == TM_Invisible)
 	{
@@ -3166,7 +3208,7 @@ l2:
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("attempted to update invisible tuple")));
 	}
-	else if (result == TM_BeingModified && wait)
+	else if (result == TM_BeingModified && (options & TABLE_MODIFY_WAIT))
 	{
 		TransactionId xwait;
 		uint16		infomask;
@@ -3370,7 +3412,30 @@ l2:
 			tmfd->cmax = HeapTupleHeaderGetCmax(oldtup.t_data);
 		else
 			tmfd->cmax = InvalidCommandId;
-		UnlockReleaseBuffer(buffer);
+
+		/*
+		 * If we're asked to lock the updated tuple, we just fetch the
+		 * existing tuple.  That let's the caller save some resouces on
+		 * placing the lock.
+		 */
+		if (result == TM_Updated &&
+			(options & TABLE_MODIFY_LOCK_UPDATED))
+		{
+			BufferHeapTupleTableSlot *bslot;
+
+			Assert(TTS_IS_BUFFERTUPLE(oldSlot));
+			bslot = (BufferHeapTupleTableSlot *) oldSlot;
+
+			LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+			bslot->base.tupdata = oldtup;
+			ExecStorePinnedBufferHeapTuple(&bslot->base.tupdata,
+										   oldSlot,
+										   buffer);
+		}
+		else
+		{
+			UnlockReleaseBuffer(buffer);
+		}
 		if (have_tuple_lock)
 			UnlockTupleTuplock(relation, &(oldtup.t_self), *lockmode);
 		if (vmbuffer != InvalidBuffer)
@@ -3849,7 +3914,26 @@ l2:
 	/* Now we can release the buffer(s) */
 	if (newbuf != buffer)
 		ReleaseBuffer(newbuf);
-	ReleaseBuffer(buffer);
+
+	/* Fetch the old tuple version if we're asked for that. */
+	if (options & TABLE_MODIFY_FETCH_OLD_TUPLE)
+	{
+		BufferHeapTupleTableSlot *bslot;
+
+		Assert(TTS_IS_BUFFERTUPLE(oldSlot));
+		bslot = (BufferHeapTupleTableSlot *) oldSlot;
+
+		bslot->base.tupdata = oldtup;
+		ExecStorePinnedBufferHeapTuple(&bslot->base.tupdata,
+									   oldSlot,
+									   buffer);
+	}
+	else
+	{
+		/* Now we can release the buffer */
+		ReleaseBuffer(buffer);
+	}
+
 	if (BufferIsValid(vmbuffer_new))
 		ReleaseBuffer(vmbuffer_new);
 	if (BufferIsValid(vmbuffer))
@@ -4057,8 +4141,8 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
 
 	result = heap_update(relation, otid, tup,
 						 GetCurrentCommandId(true), InvalidSnapshot,
-						 true /* wait for commit */ ,
-						 &tmfd, &lockmode, update_indexes);
+						 TABLE_MODIFY_WAIT /* wait for commit */ ,
+						 &tmfd, &lockmode, update_indexes, NULL);
 	switch (result)
 	{
 		case TM_SelfModified:
@@ -4121,12 +4205,14 @@ get_mxact_status_for_lock(LockTupleMode mode, bool is_update)
  *		tuples.
  *
  * Output parameters:
- *	*tuple: all fields filled in
- *	*buffer: set to buffer holding tuple (pinned but not locked at exit)
+ *	*slot: BufferHeapTupleTableSlot filled with tuple
  *	*tmfd: filled in failure cases (see below)
  *
  * Function results are the same as the ones for table_tuple_lock().
  *
+ * If *slot already contains the target tuple, it takes advantage on that by
+ * skipping the ReadBuffer() call.
+ *
  * In the failure cases other than TM_Invisible, the routine fills
  * *tmfd with the tuple's t_ctid, t_xmax (resolving a possible MultiXact,
  * if necessary), and t_cmax (the last only for TM_SelfModified,
@@ -4137,15 +4223,14 @@ get_mxact_status_for_lock(LockTupleMode mode, bool is_update)
  * See README.tuplock for a thorough explanation of this mechanism.
  */
 TM_Result
-heap_lock_tuple(Relation relation, HeapTuple tuple,
+heap_lock_tuple(Relation relation, ItemPointer tid, TupleTableSlot *slot,
 				CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
-				bool follow_updates,
-				Buffer *buffer, TM_FailureData *tmfd)
+				bool follow_updates, TM_FailureData *tmfd)
 {
 	TM_Result	result;
-	ItemPointer tid = &(tuple->t_self);
 	ItemId		lp;
 	Page		page;
+	Buffer		buffer;
 	Buffer		vmbuffer = InvalidBuffer;
 	BlockNumber block;
 	TransactionId xid,
@@ -4157,8 +4242,24 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
 	bool		skip_tuple_lock = false;
 	bool		have_tuple_lock = false;
 	bool		cleared_all_frozen = false;
+	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
+	HeapTuple	tuple = &bslot->base.tupdata;
+
+	Assert(TTS_IS_BUFFERTUPLE(slot));
 
-	*buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
+	/* Take advantage if slot already contains the relevant tuple  */
+	if (!TTS_EMPTY(slot) &&
+		slot->tts_tableOid == relation->rd_id &&
+		ItemPointerCompare(&slot->tts_tid, tid) == 0 &&
+		BufferIsValid(bslot->buffer))
+	{
+		buffer = bslot->buffer;
+		IncrBufferRefCount(buffer);
+	}
+	else
+	{
+		buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
+	}
 	block = ItemPointerGetBlockNumber(tid);
 
 	/*
@@ -4167,21 +4268,22 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
 	 * in the middle of changing this, so we'll need to recheck after we have
 	 * the lock.
 	 */
-	if (PageIsAllVisible(BufferGetPage(*buffer)))
+	if (PageIsAllVisible(BufferGetPage(buffer)))
 		visibilitymap_pin(relation, block, &vmbuffer);
 
-	LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
-	page = BufferGetPage(*buffer);
+	page = BufferGetPage(buffer);
 	lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid));
 	Assert(ItemIdIsNormal(lp));
 
+	tuple->t_self = *tid;
 	tuple->t_data = (HeapTupleHeader) PageGetItem(page, lp);
 	tuple->t_len = ItemIdGetLength(lp);
 	tuple->t_tableOid = RelationGetRelid(relation);
 
 l3:
-	result = HeapTupleSatisfiesUpdate(tuple, cid, *buffer);
+	result = HeapTupleSatisfiesUpdate(tuple, cid, buffer);
 
 	if (result == TM_Invisible)
 	{
@@ -4210,7 +4312,7 @@ l3:
 		infomask2 = tuple->t_data->t_infomask2;
 		ItemPointerCopy(&tuple->t_data->t_ctid, &t_ctid);
 
-		LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 
 		/*
 		 * If any subtransaction of the current top transaction already holds
@@ -4362,12 +4464,12 @@ l3:
 					{
 						result = res;
 						/* recovery code expects to have buffer lock held */
-						LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+						LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 						goto failed;
 					}
 				}
 
-				LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+				LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 				/*
 				 * Make sure it's still an appropriate lock, else start over.
@@ -4402,7 +4504,7 @@ l3:
 			if (HEAP_XMAX_IS_LOCKED_ONLY(infomask) &&
 				!HEAP_XMAX_IS_EXCL_LOCKED(infomask))
 			{
-				LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+				LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 				/*
 				 * Make sure it's still an appropriate lock, else start over.
@@ -4430,7 +4532,7 @@ l3:
 					 * No conflict, but if the xmax changed under us in the
 					 * meantime, start over.
 					 */
-					LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+					LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 					if (xmax_infomask_changed(tuple->t_data->t_infomask, infomask) ||
 						!TransactionIdEquals(HeapTupleHeaderGetRawXmax(tuple->t_data),
 											 xwait))
@@ -4442,7 +4544,7 @@ l3:
 			}
 			else if (HEAP_XMAX_IS_KEYSHR_LOCKED(infomask))
 			{
-				LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+				LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 				/* if the xmax changed in the meantime, start over */
 				if (xmax_infomask_changed(tuple->t_data->t_infomask, infomask) ||
@@ -4470,7 +4572,7 @@ l3:
 			TransactionIdIsCurrentTransactionId(xwait))
 		{
 			/* ... but if the xmax changed in the meantime, start over */
-			LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 			if (xmax_infomask_changed(tuple->t_data->t_infomask, infomask) ||
 				!TransactionIdEquals(HeapTupleHeaderGetRawXmax(tuple->t_data),
 									 xwait))
@@ -4492,7 +4594,7 @@ l3:
 		 */
 		if (require_sleep && (result == TM_Updated || result == TM_Deleted))
 		{
-			LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 			goto failed;
 		}
 		else if (require_sleep)
@@ -4517,7 +4619,7 @@ l3:
 				 */
 				result = TM_WouldBlock;
 				/* recovery code expects to have buffer lock held */
-				LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+				LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 				goto failed;
 			}
 
@@ -4543,7 +4645,7 @@ l3:
 						{
 							result = TM_WouldBlock;
 							/* recovery code expects to have buffer lock held */
-							LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+							LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 							goto failed;
 						}
 						break;
@@ -4583,7 +4685,7 @@ l3:
 						{
 							result = TM_WouldBlock;
 							/* recovery code expects to have buffer lock held */
-							LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+							LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 							goto failed;
 						}
 						break;
@@ -4609,12 +4711,12 @@ l3:
 				{
 					result = res;
 					/* recovery code expects to have buffer lock held */
-					LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+					LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 					goto failed;
 				}
 			}
 
-			LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 			/*
 			 * xwait is done, but if xwait had just locked the tuple then some
@@ -4636,7 +4738,7 @@ l3:
 				 * don't check for this in the multixact case, because some
 				 * locker transactions might still be running.
 				 */
-				UpdateXmaxHintBits(tuple->t_data, *buffer, xwait);
+				UpdateXmaxHintBits(tuple->t_data, buffer, xwait);
 			}
 		}
 
@@ -4695,9 +4797,9 @@ failed:
 	 */
 	if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
 	{
-		LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 		visibilitymap_pin(relation, block, &vmbuffer);
-		LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 		goto l3;
 	}
 
@@ -4760,7 +4862,7 @@ failed:
 		cleared_all_frozen = true;
 
 
-	MarkBufferDirty(*buffer);
+	MarkBufferDirty(buffer);
 
 	/*
 	 * XLOG stuff.  You might think that we don't need an XLOG record because
@@ -4780,7 +4882,7 @@ failed:
 		XLogRecPtr	recptr;
 
 		XLogBeginInsert();
-		XLogRegisterBuffer(0, *buffer, REGBUF_STANDARD);
+		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
 
 		xlrec.offnum = ItemPointerGetOffsetNumber(&tuple->t_self);
 		xlrec.xmax = xid;
@@ -4801,7 +4903,7 @@ failed:
 	result = TM_Ok;
 
 out_locked:
-	LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 
 out_unlocked:
 	if (BufferIsValid(vmbuffer))
@@ -4819,6 +4921,9 @@ out_unlocked:
 	if (have_tuple_lock)
 		UnlockTupleTuplock(relation, tid, mode);
 
+	/* Put the target tuple to the slot */
+	ExecStorePinnedBufferHeapTuple(tuple, slot, buffer);
+
 	return result;
 }
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 680a50bf8b1..7c7204a2422 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -45,6 +45,12 @@
 #include "utils/builtins.h"
 #include "utils/rel.h"
 
+static TM_Result heapam_tuple_lock(Relation relation, ItemPointer tid,
+								   Snapshot snapshot, TupleTableSlot *slot,
+								   CommandId cid, LockTupleMode mode,
+								   LockWaitPolicy wait_policy, uint8 flags,
+								   TM_FailureData *tmfd);
+
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
 									 Datum *values, bool *isnull, RewriteState rwstate);
@@ -298,23 +304,55 @@ heapam_tuple_complete_speculative(Relation relation, TupleTableSlot *slot,
 
 static TM_Result
 heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
-					Snapshot snapshot, Snapshot crosscheck, bool wait,
-					TM_FailureData *tmfd, bool changingPart)
+					Snapshot snapshot, Snapshot crosscheck, int options,
+					TM_FailureData *tmfd, bool changingPart,
+					TupleTableSlot *oldSlot)
 {
+	TM_Result	result;
+
 	/*
 	 * Currently Deleting of index tuples are handled at vacuum, in case if
 	 * the storage itself is cleaning the dead tuples by itself, it is the
 	 * time to call the index tuple deletion also.
 	 */
-	return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
+	result = heap_delete(relation, tid, cid, crosscheck, options,
+						 tmfd, changingPart, oldSlot);
+
+	/*
+	 * If the tuple has been concurrently updated, then get the lock on it.
+	 * (Do only if caller asked for this by setting the
+	 * TABLE_MODIFY_LOCK_UPDATED option)  With the lock held retry of the
+	 * delete should succeed even if there are more concurrent update
+	 * attempts.
+	 */
+	if (result == TM_Updated && (options & TABLE_MODIFY_LOCK_UPDATED))
+	{
+		/*
+		 * heapam_tuple_lock() will take advantage of tuple loaded into
+		 * oldSlot by heap_delete().
+		 */
+		result = heapam_tuple_lock(relation, tid, snapshot,
+								   oldSlot, cid, LockTupleExclusive,
+								   (options & TABLE_MODIFY_WAIT) ?
+								   LockWaitBlock :
+								   LockWaitSkip,
+								   TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
+								   tmfd);
+
+		if (result == TM_Ok)
+			return TM_Updated;
+	}
+
+	return result;
 }
 
 
 static TM_Result
 heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 					CommandId cid, Snapshot snapshot, Snapshot crosscheck,
-					bool wait, TM_FailureData *tmfd,
-					LockTupleMode *lockmode, TU_UpdateIndexes *update_indexes)
+					int options, TM_FailureData *tmfd,
+					LockTupleMode *lockmode, TU_UpdateIndexes *update_indexes,
+					TupleTableSlot *oldSlot)
 {
 	bool		shouldFree = true;
 	HeapTuple	tuple = ExecFetchSlotHeapTuple(slot, true, &shouldFree);
@@ -324,8 +362,8 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 	slot->tts_tableOid = RelationGetRelid(relation);
 	tuple->t_tableOid = slot->tts_tableOid;
 
-	result = heap_update(relation, otid, tuple, cid, crosscheck, wait,
-						 tmfd, lockmode, update_indexes);
+	result = heap_update(relation, otid, tuple, cid, crosscheck, options,
+						 tmfd, lockmode, update_indexes, oldSlot);
 	ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
 
 	/*
@@ -352,6 +390,31 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 	if (shouldFree)
 		pfree(tuple);
 
+	/*
+	 * If the tuple has been concurrently updated, then get the lock on it.
+	 * (Do only if caller asked for this by setting the
+	 * TABLE_MODIFY_LOCK_UPDATED option)  With the lock held retry of the
+	 * update should succeed even if there are more concurrent update
+	 * attempts.
+	 */
+	if (result == TM_Updated && (options & TABLE_MODIFY_LOCK_UPDATED))
+	{
+		/*
+		 * heapam_tuple_lock() will take advantage of tuple loaded into
+		 * oldSlot by heap_update().
+		 */
+		result = heapam_tuple_lock(relation, otid, snapshot,
+								   oldSlot, cid, *lockmode,
+								   (options & TABLE_MODIFY_WAIT) ?
+								   LockWaitBlock :
+								   LockWaitSkip,
+								   TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
+								   tmfd);
+
+		if (result == TM_Ok)
+			return TM_Updated;
+	}
+
 	return result;
 }
 
@@ -363,7 +426,6 @@ heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
 {
 	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
 	TM_Result	result;
-	Buffer		buffer;
 	HeapTuple	tuple = &bslot->base.tupdata;
 	bool		follow_updates;
 
@@ -373,9 +435,8 @@ heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
 tuple_lock_retry:
-	tuple->t_self = *tid;
-	result = heap_lock_tuple(relation, tuple, cid, mode, wait_policy,
-							 follow_updates, &buffer, tmfd);
+	result = heap_lock_tuple(relation, tid, slot, cid, mode, wait_policy,
+							 follow_updates, tmfd);
 
 	if (result == TM_Updated &&
 		(flags & TUPLE_LOCK_FLAG_FIND_LAST_VERSION))
@@ -383,8 +444,6 @@ tuple_lock_retry:
 		/* Should not encounter speculative tuple on recheck */
 		Assert(!HeapTupleHeaderIsSpeculative(tuple->t_data));
 
-		ReleaseBuffer(buffer);
-
 		if (!ItemPointerEquals(&tmfd->ctid, &tuple->t_self))
 		{
 			SnapshotData SnapshotDirty;
@@ -406,6 +465,8 @@ tuple_lock_retry:
 			InitDirtySnapshot(SnapshotDirty);
 			for (;;)
 			{
+				Buffer		buffer = InvalidBuffer;
+
 				if (ItemPointerIndicatesMovedPartitions(tid))
 					ereport(ERROR,
 							(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
@@ -500,7 +561,7 @@ tuple_lock_retry:
 					/*
 					 * This is a live tuple, so try to lock it again.
 					 */
-					ReleaseBuffer(buffer);
+					ExecStorePinnedBufferHeapTuple(tuple, slot, buffer);
 					goto tuple_lock_retry;
 				}
 
@@ -511,7 +572,7 @@ tuple_lock_retry:
 				 */
 				if (tuple->t_data == NULL)
 				{
-					Assert(!BufferIsValid(buffer));
+					ReleaseBuffer(buffer);
 					return TM_Deleted;
 				}
 
@@ -564,9 +625,6 @@ tuple_lock_retry:
 	slot->tts_tableOid = RelationGetRelid(relation);
 	tuple->t_tableOid = slot->tts_tableOid;
 
-	/* store in slot, transferring existing pin */
-	ExecStorePinnedBufferHeapTuple(tuple, slot, buffer);
-
 	return result;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index e57a0b7ea31..8d3675be959 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -287,16 +287,23 @@ simple_table_tuple_insert(Relation rel, TupleTableSlot *slot)
  * via ereport().
  */
 void
-simple_table_tuple_delete(Relation rel, ItemPointer tid, Snapshot snapshot)
+simple_table_tuple_delete(Relation rel, ItemPointer tid, Snapshot snapshot,
+						  TupleTableSlot *oldSlot)
 {
 	TM_Result	result;
 	TM_FailureData tmfd;
+	int			options = TABLE_MODIFY_WAIT;	/* wait for commit */
+
+	/* Fetch old tuple if the relevant slot is provided */
+	if (oldSlot)
+		options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
 
 	result = table_tuple_delete(rel, tid,
 								GetCurrentCommandId(true),
 								snapshot, InvalidSnapshot,
-								true /* wait for commit */ ,
-								&tmfd, false /* changingPart */ );
+								options,
+								&tmfd, false /* changingPart */ ,
+								oldSlot);
 
 	switch (result)
 	{
@@ -335,17 +342,24 @@ void
 simple_table_tuple_update(Relation rel, ItemPointer otid,
 						  TupleTableSlot *slot,
 						  Snapshot snapshot,
-						  TU_UpdateIndexes *update_indexes)
+						  TU_UpdateIndexes *update_indexes,
+						  TupleTableSlot *oldSlot)
 {
 	TM_Result	result;
 	TM_FailureData tmfd;
 	LockTupleMode lockmode;
+	int			options = TABLE_MODIFY_WAIT;	/* wait for commit */
+
+	/* Fetch old tuple if the relevant slot is provided */
+	if (oldSlot)
+		options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
 
 	result = table_tuple_update(rel, otid, slot,
 								GetCurrentCommandId(true),
 								snapshot, InvalidSnapshot,
-								true /* wait for commit */ ,
-								&tmfd, &lockmode, update_indexes);
+								options,
+								&tmfd, &lockmode, update_indexes,
+								oldSlot);
 
 	switch (result)
 	{
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 35eb7180f7e..3309b4ebd2d 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2773,8 +2773,8 @@ ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
 void
 ExecARDeleteTriggers(EState *estate,
 					 ResultRelInfo *relinfo,
-					 ItemPointer tupleid,
 					 HeapTuple fdw_trigtuple,
+					 TupleTableSlot *slot,
 					 TransitionCaptureState *transition_capture,
 					 bool is_crosspart_update)
 {
@@ -2783,20 +2783,11 @@ ExecARDeleteTriggers(EState *estate,
 	if ((trigdesc && trigdesc->trig_delete_after_row) ||
 		(transition_capture && transition_capture->tcs_delete_old_table))
 	{
-		TupleTableSlot *slot = ExecGetTriggerOldSlot(estate, relinfo);
-
-		Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
-		if (fdw_trigtuple == NULL)
-			GetTupleForTrigger(estate,
-							   NULL,
-							   relinfo,
-							   tupleid,
-							   LockTupleExclusive,
-							   slot,
-							   NULL,
-							   NULL,
-							   NULL);
-		else
+		/*
+		 * Put the FDW old tuple to the slot.  Otherwise, caller is expected
+		 * to have old tuple alredy fetched to the slot.
+		 */
+		if (fdw_trigtuple != NULL)
 			ExecForceStoreHeapTuple(fdw_trigtuple, slot, false);
 
 		AfterTriggerSaveEvent(estate, relinfo, NULL, NULL,
@@ -3087,18 +3078,17 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
  * Note: 'src_partinfo' and 'dst_partinfo', when non-NULL, refer to the source
  * and destination partitions, respectively, of a cross-partition update of
  * the root partitioned table mentioned in the query, given by 'relinfo'.
- * 'tupleid' in that case refers to the ctid of the "old" tuple in the source
- * partition, and 'newslot' contains the "new" tuple in the destination
- * partition.  This interface allows to support the requirements of
- * ExecCrossPartitionUpdateForeignKey(); is_crosspart_update must be true in
- * that case.
+ * 'oldslot' contains the "old" tuple in the source partition, and 'newslot'
+ * contains the "new" tuple in the destination partition.  This interface
+ * allows to support the requirements of ExecCrossPartitionUpdateForeignKey();
+ * is_crosspart_update must be true in that case.
  */
 void
 ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 					 ResultRelInfo *src_partinfo,
 					 ResultRelInfo *dst_partinfo,
-					 ItemPointer tupleid,
 					 HeapTuple fdw_trigtuple,
+					 TupleTableSlot *oldslot,
 					 TupleTableSlot *newslot,
 					 List *recheckIndexes,
 					 TransitionCaptureState *transition_capture,
@@ -3117,29 +3107,14 @@ ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 		 * separately for DELETE and INSERT to capture transition table rows.
 		 * In such case, either old tuple or new tuple can be NULL.
 		 */
-		TupleTableSlot *oldslot;
-		ResultRelInfo *tupsrc;
-
 		Assert((src_partinfo != NULL && dst_partinfo != NULL) ||
 			   !is_crosspart_update);
 
-		tupsrc = src_partinfo ? src_partinfo : relinfo;
-		oldslot = ExecGetTriggerOldSlot(estate, tupsrc);
-
-		if (fdw_trigtuple == NULL && ItemPointerIsValid(tupleid))
-			GetTupleForTrigger(estate,
-							   NULL,
-							   tupsrc,
-							   tupleid,
-							   LockTupleExclusive,
-							   oldslot,
-							   NULL,
-							   NULL,
-							   NULL);
-		else if (fdw_trigtuple != NULL)
+		if (fdw_trigtuple != NULL)
+		{
+			Assert(oldslot);
 			ExecForceStoreHeapTuple(fdw_trigtuple, oldslot, false);
-		else
-			ExecClearTuple(oldslot);
+		}
 
 		AfterTriggerSaveEvent(estate, relinfo,
 							  src_partinfo, dst_partinfo,
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index d0a89cd5778..0cad843fb69 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -577,6 +577,7 @@ ExecSimpleRelationUpdate(ResultRelInfo *resultRelInfo,
 	{
 		List	   *recheckIndexes = NIL;
 		TU_UpdateIndexes update_indexes;
+		TupleTableSlot *oldSlot = NULL;
 
 		/* Compute stored generated columns */
 		if (rel->rd_att->constr &&
@@ -590,8 +591,12 @@ ExecSimpleRelationUpdate(ResultRelInfo *resultRelInfo,
 		if (rel->rd_rel->relispartition)
 			ExecPartitionCheck(resultRelInfo, slot, estate, true);
 
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_update_after_row)
+			oldSlot = ExecGetTriggerOldSlot(estate, resultRelInfo);
+
 		simple_table_tuple_update(rel, tid, slot, estate->es_snapshot,
-								  &update_indexes);
+								  &update_indexes, oldSlot);
 
 		if (resultRelInfo->ri_NumIndices > 0 && (update_indexes != TU_None))
 			recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
@@ -602,7 +607,7 @@ ExecSimpleRelationUpdate(ResultRelInfo *resultRelInfo,
 		/* AFTER ROW UPDATE Triggers */
 		ExecARUpdateTriggers(estate, resultRelInfo,
 							 NULL, NULL,
-							 tid, NULL, slot,
+							 NULL, oldSlot, slot,
 							 recheckIndexes, NULL, false);
 
 		list_free(recheckIndexes);
@@ -636,12 +641,18 @@ ExecSimpleRelationDelete(ResultRelInfo *resultRelInfo,
 
 	if (!skip_tuple)
 	{
+		TupleTableSlot *oldSlot = NULL;
+
+		if (resultRelInfo->ri_TrigDesc &&
+			resultRelInfo->ri_TrigDesc->trig_delete_after_row)
+			oldSlot = ExecGetTriggerOldSlot(estate, resultRelInfo);
+
 		/* OK, delete the tuple */
-		simple_table_tuple_delete(rel, tid, estate->es_snapshot);
+		simple_table_tuple_delete(rel, tid, estate->es_snapshot, oldSlot);
 
 		/* AFTER ROW DELETE Triggers */
 		ExecARDeleteTriggers(estate, resultRelInfo,
-							 tid, NULL, NULL, false);
+							 NULL, oldSlot, NULL, false);
 	}
 }
 
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 4abfe82f7fb..79257416426 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -566,6 +566,15 @@ ExecInitInsertProjection(ModifyTableState *mtstate,
 		table_slot_create(resultRelInfo->ri_RelationDesc,
 						  &estate->es_tupleTable);
 
+	/*
+	 * In the ON CONFLICT UPDATE case, we will also need a slot for the old
+	 * tuple to calculate the updated tuple on its base.
+	 */
+	if (node->onConflictAction == ONCONFLICT_UPDATE)
+		resultRelInfo->ri_oldTupleSlot =
+			table_slot_create(resultRelInfo->ri_RelationDesc,
+							  &estate->es_tupleTable);
+
 	/* Build ProjectionInfo if needed (it probably isn't). */
 	if (need_projection)
 	{
@@ -1154,7 +1163,7 @@ ExecInsert(ModifyTableContext *context,
 		ExecARUpdateTriggers(estate, resultRelInfo,
 							 NULL, NULL,
 							 NULL,
-							 NULL,
+							 resultRelInfo->ri_oldTupleSlot,
 							 slot,
 							 NULL,
 							 mtstate->mt_transition_capture,
@@ -1334,7 +1343,8 @@ ExecDeletePrologue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static TM_Result
 ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-			  ItemPointer tupleid, bool changingPart)
+			  ItemPointer tupleid, bool changingPart, int options,
+			  TupleTableSlot *oldSlot)
 {
 	EState	   *estate = context->estate;
 
@@ -1342,9 +1352,10 @@ ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 							  estate->es_output_cid,
 							  estate->es_snapshot,
 							  estate->es_crosscheck_snapshot,
-							  true /* wait for commit */ ,
+							  options,
 							  &context->tmfd,
-							  changingPart);
+							  changingPart,
+							  oldSlot);
 }
 
 /*
@@ -1353,10 +1364,15 @@ ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  * Closing steps of tuple deletion; this invokes AFTER FOR EACH ROW triggers,
  * including the UPDATE triggers if the deletion is being done as part of a
  * cross-partition tuple move.
+ *
+ * The old tuple is already fetched into ‘slot’ for regular tables.  For FDW,
+ * the old tuple is given as 'oldtuple' and is to be stored in 'slot' when
+ * needed.
  */
 static void
 ExecDeleteEpilogue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				   ItemPointer tupleid, HeapTuple oldtuple, bool changingPart)
+				   ItemPointer tupleid, HeapTuple oldtuple,
+				   TupleTableSlot *slot, bool changingPart)
 {
 	ModifyTableState *mtstate = context->mtstate;
 	EState	   *estate = context->estate;
@@ -1374,8 +1390,8 @@ ExecDeleteEpilogue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	{
 		ExecARUpdateTriggers(estate, resultRelInfo,
 							 NULL, NULL,
-							 tupleid, oldtuple,
-							 NULL, NULL, mtstate->mt_transition_capture,
+							 oldtuple,
+							 slot, NULL, NULL, mtstate->mt_transition_capture,
 							 false);
 
 		/*
@@ -1386,10 +1402,30 @@ ExecDeleteEpilogue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	}
 
 	/* AFTER ROW DELETE Triggers */
-	ExecARDeleteTriggers(estate, resultRelInfo, tupleid, oldtuple,
+	ExecARDeleteTriggers(estate, resultRelInfo, oldtuple, slot,
 						 ar_delete_trig_tcs, changingPart);
 }
 
+/*
+ * Initializes the tuple slot in a ResultRelInfo for DELETE action.
+ *
+ * We mark 'projectNewInfoValid' even though the projections themselves
+ * are not initialized here.
+ */
+static void
+ExecInitDeleteTupleSlot(ModifyTableState *mtstate,
+						ResultRelInfo *resultRelInfo)
+{
+	EState	   *estate = mtstate->ps.state;
+
+	Assert(!resultRelInfo->ri_projectNewInfoValid);
+
+	resultRelInfo->ri_oldTupleSlot =
+		table_slot_create(resultRelInfo->ri_RelationDesc,
+						  &estate->es_tupleTable);
+	resultRelInfo->ri_projectNewInfoValid = true;
+}
+
 /* ----------------------------------------------------------------
  *		ExecDelete
  *
@@ -1409,7 +1445,8 @@ ExecDeleteEpilogue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  *		part of an UPDATE of partition-key, then the slot returned by
  *		EvalPlanQual() is passed back using output parameter epqreturnslot.
  *
- *		Returns RETURNING result if any, otherwise NULL.
+ *		Returns RETURNING result if any, otherwise NULL.  The deleted tuple
+ *		to be stored into oldslot independently that.
  * ----------------------------------------------------------------
  */
 static TupleTableSlot *
@@ -1417,6 +1454,7 @@ ExecDelete(ModifyTableContext *context,
 		   ResultRelInfo *resultRelInfo,
 		   ItemPointer tupleid,
 		   HeapTuple oldtuple,
+		   TupleTableSlot *oldslot,
 		   bool processReturning,
 		   bool changingPart,
 		   bool canSetTag,
@@ -1480,6 +1518,15 @@ ExecDelete(ModifyTableContext *context,
 	}
 	else
 	{
+		int			options = TABLE_MODIFY_WAIT | TABLE_MODIFY_FETCH_OLD_TUPLE;
+
+		/*
+		 * Specify that we need to lock and fetch the last tuple version for
+		 * EPQ on appropriate transaction isolation levels.
+		 */
+		if (!IsolationUsesXactSnapshot())
+			options |= TABLE_MODIFY_LOCK_UPDATED;
+
 		/*
 		 * delete the tuple
 		 *
@@ -1490,7 +1537,8 @@ ExecDelete(ModifyTableContext *context,
 		 * transaction-snapshot mode transactions.
 		 */
 ldelete:
-		result = ExecDeleteAct(context, resultRelInfo, tupleid, changingPart);
+		result = ExecDeleteAct(context, resultRelInfo, tupleid, changingPart,
+							   options, oldslot);
 
 		if (tmresult)
 			*tmresult = result;
@@ -1537,7 +1585,6 @@ ldelete:
 
 			case TM_Updated:
 				{
-					TupleTableSlot *inputslot;
 					TupleTableSlot *epqslot;
 
 					if (IsolationUsesXactSnapshot())
@@ -1546,87 +1593,29 @@ ldelete:
 								 errmsg("could not serialize access due to concurrent update")));
 
 					/*
-					 * Already know that we're going to need to do EPQ, so
-					 * fetch tuple directly into the right slot.
+					 * We need to do EPQ. The latest tuple is already found
+					 * and locked as a result of TABLE_MODIFY_LOCK_UPDATED.
 					 */
-					EvalPlanQualBegin(context->epqstate);
-					inputslot = EvalPlanQualSlot(context->epqstate, resultRelationDesc,
-												 resultRelInfo->ri_RangeTableIndex);
-
-					result = table_tuple_lock(resultRelationDesc, tupleid,
-											  estate->es_snapshot,
-											  inputslot, estate->es_output_cid,
-											  LockTupleExclusive, LockWaitBlock,
-											  TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
-											  &context->tmfd);
+					Assert(context->tmfd.traversed);
+					epqslot = EvalPlanQual(context->epqstate,
+										   resultRelationDesc,
+										   resultRelInfo->ri_RangeTableIndex,
+										   oldslot);
+					if (TupIsNull(epqslot))
+						/* Tuple not passing quals anymore, exiting... */
+						return NULL;
 
-					switch (result)
+					/*
+					 * If requested, skip delete and pass back the updated
+					 * row.
+					 */
+					if (epqreturnslot)
 					{
-						case TM_Ok:
-							Assert(context->tmfd.traversed);
-							epqslot = EvalPlanQual(context->epqstate,
-												   resultRelationDesc,
-												   resultRelInfo->ri_RangeTableIndex,
-												   inputslot);
-							if (TupIsNull(epqslot))
-								/* Tuple not passing quals anymore, exiting... */
-								return NULL;
-
-							/*
-							 * If requested, skip delete and pass back the
-							 * updated row.
-							 */
-							if (epqreturnslot)
-							{
-								*epqreturnslot = epqslot;
-								return NULL;
-							}
-							else
-								goto ldelete;
-
-						case TM_SelfModified:
-
-							/*
-							 * This can be reached when following an update
-							 * chain from a tuple updated by another session,
-							 * reaching a tuple that was already updated in
-							 * this transaction. If previously updated by this
-							 * command, ignore the delete, otherwise error
-							 * out.
-							 *
-							 * See also TM_SelfModified response to
-							 * table_tuple_delete() above.
-							 */
-							if (context->tmfd.cmax != estate->es_output_cid)
-								ereport(ERROR,
-										(errcode(ERRCODE_TRIGGERED_DATA_CHANGE_VIOLATION),
-										 errmsg("tuple to be deleted was already modified by an operation triggered by the current command"),
-										 errhint("Consider using an AFTER trigger instead of a BEFORE trigger to propagate changes to other rows.")));
-							return NULL;
-
-						case TM_Deleted:
-							/* tuple already deleted; nothing to do */
-							return NULL;
-
-						default:
-
-							/*
-							 * TM_Invisible should be impossible because we're
-							 * waiting for updated row versions, and would
-							 * already have errored out if the first version
-							 * is invisible.
-							 *
-							 * TM_Updated should be impossible, because we're
-							 * locking the latest version via
-							 * TUPLE_LOCK_FLAG_FIND_LAST_VERSION.
-							 */
-							elog(ERROR, "unexpected table_tuple_lock status: %u",
-								 result);
-							return NULL;
+						*epqreturnslot = epqslot;
+						return NULL;
 					}
-
-					Assert(false);
-					break;
+					else
+						goto ldelete;
 				}
 
 			case TM_Deleted:
@@ -1660,7 +1649,8 @@ ldelete:
 	if (tupleDeleted)
 		*tupleDeleted = true;
 
-	ExecDeleteEpilogue(context, resultRelInfo, tupleid, oldtuple, changingPart);
+	ExecDeleteEpilogue(context, resultRelInfo, tupleid, oldtuple,
+					   oldslot, changingPart);
 
 	/* Process RETURNING if present and if requested */
 	if (processReturning && resultRelInfo->ri_projectReturning)
@@ -1678,17 +1668,13 @@ ldelete:
 		}
 		else
 		{
+			/* Copy old tuple to the returning slot */
 			slot = ExecGetReturningSlot(estate, resultRelInfo);
 			if (oldtuple != NULL)
-			{
 				ExecForceStoreHeapTuple(oldtuple, slot, false);
-			}
 			else
-			{
-				if (!table_tuple_fetch_row_version(resultRelationDesc, tupleid,
-												   SnapshotAny, slot))
-					elog(ERROR, "failed to fetch deleted tuple for DELETE RETURNING");
-			}
+				ExecCopySlot(slot, oldslot);
+			Assert(!TupIsNull(slot));
 		}
 
 		rslot = ExecProcessReturning(resultRelInfo, slot, context->planSlot);
@@ -1788,12 +1774,19 @@ ExecCrossPartitionUpdate(ModifyTableContext *context,
 		MemoryContextSwitchTo(oldcxt);
 	}
 
+	/*
+	 * Make sure ri_oldTupleSlot is initialized.  The old tuple is to be saved
+	 * there by ExecDelete() to save effort on further re-fetching.
+	 */
+	if (unlikely(!resultRelInfo->ri_projectNewInfoValid))
+		ExecInitUpdateProjection(mtstate, resultRelInfo);
+
 	/*
 	 * Row movement, part 1.  Delete the tuple, but skip RETURNING processing.
 	 * We want to return rows from INSERT.
 	 */
 	ExecDelete(context, resultRelInfo,
-			   tupleid, oldtuple,
+			   tupleid, oldtuple, resultRelInfo->ri_oldTupleSlot,
 			   false,			/* processReturning */
 			   true,			/* changingPart */
 			   false,			/* canSetTag */
@@ -1834,21 +1827,13 @@ ExecCrossPartitionUpdate(ModifyTableContext *context,
 			return true;
 		else
 		{
-			/* Fetch the most recent version of old tuple. */
-			TupleTableSlot *oldSlot;
-
-			/* ... but first, make sure ri_oldTupleSlot is initialized. */
-			if (unlikely(!resultRelInfo->ri_projectNewInfoValid))
-				ExecInitUpdateProjection(mtstate, resultRelInfo);
-			oldSlot = resultRelInfo->ri_oldTupleSlot;
-			if (!table_tuple_fetch_row_version(resultRelInfo->ri_RelationDesc,
-											   tupleid,
-											   SnapshotAny,
-											   oldSlot))
-				elog(ERROR, "failed to fetch tuple being updated");
-			/* and project the new tuple to retry the UPDATE with */
+			/*
+			 * ExecDelete already fetches the most recent version of old tuple
+			 * to resultRelInfo->ri_oldTupleSlot.  So, just project the new
+			 * tuple to retry the UPDATE with.
+			 */
 			*retry_slot = ExecGetUpdateNewTuple(resultRelInfo, epqslot,
-												oldSlot);
+												resultRelInfo->ri_oldTupleSlot);
 			return false;
 		}
 	}
@@ -1967,7 +1952,8 @@ ExecUpdatePrepareSlot(ResultRelInfo *resultRelInfo,
 static TM_Result
 ExecUpdateAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 			  ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
-			  bool canSetTag, UpdateContext *updateCxt)
+			  bool canSetTag, int options, TupleTableSlot *oldSlot,
+			  UpdateContext *updateCxt)
 {
 	EState	   *estate = context->estate;
 	Relation	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -2059,7 +2045,8 @@ lreplace:
 				ExecCrossPartitionUpdateForeignKey(context,
 												   resultRelInfo,
 												   insert_destrel,
-												   tupleid, slot,
+												   tupleid,
+												   resultRelInfo->ri_oldTupleSlot,
 												   inserted_tuple);
 
 			return TM_Ok;
@@ -2102,9 +2089,10 @@ lreplace:
 								estate->es_output_cid,
 								estate->es_snapshot,
 								estate->es_crosscheck_snapshot,
-								true /* wait for commit */ ,
+								options /* wait for commit */ ,
 								&context->tmfd, &updateCxt->lockmode,
-								&updateCxt->updateIndexes);
+								&updateCxt->updateIndexes,
+								oldSlot);
 
 	return result;
 }
@@ -2118,7 +2106,8 @@ lreplace:
 static void
 ExecUpdateEpilogue(ModifyTableContext *context, UpdateContext *updateCxt,
 				   ResultRelInfo *resultRelInfo, ItemPointer tupleid,
-				   HeapTuple oldtuple, TupleTableSlot *slot)
+				   HeapTuple oldtuple, TupleTableSlot *slot,
+				   TupleTableSlot *oldslot)
 {
 	ModifyTableState *mtstate = context->mtstate;
 	List	   *recheckIndexes = NIL;
@@ -2134,7 +2123,7 @@ ExecUpdateEpilogue(ModifyTableContext *context, UpdateContext *updateCxt,
 	/* AFTER ROW UPDATE Triggers */
 	ExecARUpdateTriggers(context->estate, resultRelInfo,
 						 NULL, NULL,
-						 tupleid, oldtuple, slot,
+						 oldtuple, oldslot, slot,
 						 recheckIndexes,
 						 mtstate->operation == CMD_INSERT ?
 						 mtstate->mt_oc_transition_capture :
@@ -2223,7 +2212,7 @@ ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 	/* Perform the root table's triggers. */
 	ExecARUpdateTriggers(context->estate,
 						 rootRelInfo, sourcePartInfo, destPartInfo,
-						 tupleid, NULL, newslot, NIL, NULL, true);
+						 NULL, oldslot, newslot, NIL, NULL, true);
 }
 
 /* ----------------------------------------------------------------
@@ -2246,6 +2235,7 @@ ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
  *		no relevant triggers.
  *
  *		slot contains the new tuple value to be stored.
+ *		oldslot is the slot to store the old tuple.
  *		planSlot is the output of the ModifyTable's subplan; we use it
  *		to access values from other input tables (for RETURNING),
  *		row-ID junk columns, etc.
@@ -2256,7 +2246,7 @@ ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 static TupleTableSlot *
 ExecUpdate(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 		   ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
-		   bool canSetTag)
+		   TupleTableSlot *oldslot, bool canSetTag, bool locked)
 {
 	EState	   *estate = context->estate;
 	Relation	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -2309,6 +2299,16 @@ ExecUpdate(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	}
 	else
 	{
+		int			options = TABLE_MODIFY_WAIT | TABLE_MODIFY_FETCH_OLD_TUPLE;
+
+		/*
+		 * Specify that we need to lock and fetch the last tuple version for
+		 * EPQ on appropriate transaction isolation levels if the tuple isn't
+		 * locked already.
+		 */
+		if (!locked && !IsolationUsesXactSnapshot())
+			options |= TABLE_MODIFY_LOCK_UPDATED;
+
 		/*
 		 * If we generate a new candidate tuple after EvalPlanQual testing, we
 		 * must loop back here to try again.  (We don't need to redo triggers,
@@ -2318,7 +2318,7 @@ ExecUpdate(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 		 */
 redo_act:
 		result = ExecUpdateAct(context, resultRelInfo, tupleid, oldtuple, slot,
-							   canSetTag, &updateCxt);
+							   canSetTag, options, oldslot, &updateCxt);
 
 		/*
 		 * If ExecUpdateAct reports that a cross-partition update was done,
@@ -2369,88 +2369,32 @@ redo_act:
 
 			case TM_Updated:
 				{
-					TupleTableSlot *inputslot;
 					TupleTableSlot *epqslot;
-					TupleTableSlot *oldSlot;
 
 					if (IsolationUsesXactSnapshot())
 						ereport(ERROR,
 								(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
 								 errmsg("could not serialize access due to concurrent update")));
 
+					/* Shouldn't get there if the tuple was previously locked */
+					Assert(!locked);
+
 					/*
-					 * Already know that we're going to need to do EPQ, so
-					 * fetch tuple directly into the right slot.
+					 * We need to do EPQ. The latest tuple is already found
+					 * and locked as a result of TABLE_MODIFY_LOCK_UPDATED.
 					 */
-					inputslot = EvalPlanQualSlot(context->epqstate, resultRelationDesc,
-												 resultRelInfo->ri_RangeTableIndex);
-
-					result = table_tuple_lock(resultRelationDesc, tupleid,
-											  estate->es_snapshot,
-											  inputslot, estate->es_output_cid,
-											  updateCxt.lockmode, LockWaitBlock,
-											  TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
-											  &context->tmfd);
-
-					switch (result)
-					{
-						case TM_Ok:
-							Assert(context->tmfd.traversed);
-
-							epqslot = EvalPlanQual(context->epqstate,
-												   resultRelationDesc,
-												   resultRelInfo->ri_RangeTableIndex,
-												   inputslot);
-							if (TupIsNull(epqslot))
-								/* Tuple not passing quals anymore, exiting... */
-								return NULL;
-
-							/* Make sure ri_oldTupleSlot is initialized. */
-							if (unlikely(!resultRelInfo->ri_projectNewInfoValid))
-								ExecInitUpdateProjection(context->mtstate,
-														 resultRelInfo);
-
-							/* Fetch the most recent version of old tuple. */
-							oldSlot = resultRelInfo->ri_oldTupleSlot;
-							if (!table_tuple_fetch_row_version(resultRelationDesc,
-															   tupleid,
-															   SnapshotAny,
-															   oldSlot))
-								elog(ERROR, "failed to fetch tuple being updated");
-							slot = ExecGetUpdateNewTuple(resultRelInfo,
-														 epqslot, oldSlot);
-							goto redo_act;
-
-						case TM_Deleted:
-							/* tuple already deleted; nothing to do */
-							return NULL;
-
-						case TM_SelfModified:
-
-							/*
-							 * This can be reached when following an update
-							 * chain from a tuple updated by another session,
-							 * reaching a tuple that was already updated in
-							 * this transaction. If previously modified by
-							 * this command, ignore the redundant update,
-							 * otherwise error out.
-							 *
-							 * See also TM_SelfModified response to
-							 * table_tuple_update() above.
-							 */
-							if (context->tmfd.cmax != estate->es_output_cid)
-								ereport(ERROR,
-										(errcode(ERRCODE_TRIGGERED_DATA_CHANGE_VIOLATION),
-										 errmsg("tuple to be updated was already modified by an operation triggered by the current command"),
-										 errhint("Consider using an AFTER trigger instead of a BEFORE trigger to propagate changes to other rows.")));
-							return NULL;
-
-						default:
-							/* see table_tuple_lock call in ExecDelete() */
-							elog(ERROR, "unexpected table_tuple_lock status: %u",
-								 result);
-							return NULL;
-					}
+					Assert(context->tmfd.traversed);
+					epqslot = EvalPlanQual(context->epqstate,
+										   resultRelationDesc,
+										   resultRelInfo->ri_RangeTableIndex,
+										   oldslot);
+					if (TupIsNull(epqslot))
+						/* Tuple not passing quals anymore, exiting... */
+						return NULL;
+					slot = ExecGetUpdateNewTuple(resultRelInfo,
+												 epqslot,
+												 oldslot);
+					goto redo_act;
 				}
 
 				break;
@@ -2474,7 +2418,7 @@ redo_act:
 		(estate->es_processed)++;
 
 	ExecUpdateEpilogue(context, &updateCxt, resultRelInfo, tupleid, oldtuple,
-					   slot);
+					   slot, oldslot);
 
 	/* Process RETURNING if present */
 	if (resultRelInfo->ri_projectReturning)
@@ -2692,7 +2636,8 @@ ExecOnConflictUpdate(ModifyTableContext *context,
 	*returning = ExecUpdate(context, resultRelInfo,
 							conflictTid, NULL,
 							resultRelInfo->ri_onConflict->oc_ProjSlot,
-							canSetTag);
+							existing,
+							canSetTag, true);
 
 	/*
 	 * Clear out existing tuple, as there might not be another conflict among
@@ -2934,6 +2879,7 @@ lmerge_matched:
 				{
 					result = ExecUpdateAct(context, resultRelInfo, tupleid,
 										   NULL, newslot, canSetTag,
+										   TABLE_MODIFY_WAIT, NULL,
 										   &updateCxt);
 
 					/*
@@ -2956,7 +2902,8 @@ lmerge_matched:
 				if (result == TM_Ok)
 				{
 					ExecUpdateEpilogue(context, &updateCxt, resultRelInfo,
-									   tupleid, NULL, newslot);
+									   tupleid, NULL, newslot,
+									   resultRelInfo->ri_oldTupleSlot);
 					mtstate->mt_merge_updated += 1;
 				}
 				break;
@@ -2987,12 +2934,12 @@ lmerge_matched:
 				}
 				else
 					result = ExecDeleteAct(context, resultRelInfo, tupleid,
-										   false);
+										   false, TABLE_MODIFY_WAIT, NULL);
 
 				if (result == TM_Ok)
 				{
 					ExecDeleteEpilogue(context, resultRelInfo, tupleid, NULL,
-									   false);
+									   resultRelInfo->ri_oldTupleSlot, false);
 					mtstate->mt_merge_deleted += 1;
 				}
 				break;
@@ -4006,12 +3953,18 @@ ExecModifyTable(PlanState *pstate)
 
 				/* Now apply the update. */
 				slot = ExecUpdate(&context, resultRelInfo, tupleid, oldtuple,
-								  slot, node->canSetTag);
+								  slot, resultRelInfo->ri_oldTupleSlot,
+								  node->canSetTag, false);
 				break;
 
 			case CMD_DELETE:
+				/* Initialize slot for DELETE to fetch the old tuple */
+				if (unlikely(!resultRelInfo->ri_projectNewInfoValid))
+					ExecInitDeleteTupleSlot(node, resultRelInfo);
+
 				slot = ExecDelete(&context, resultRelInfo, tupleid, oldtuple,
-								  true, false, node->canSetTag, NULL, NULL, NULL);
+								  resultRelInfo->ri_oldTupleSlot, true, false,
+								  node->canSetTag, NULL, NULL, NULL);
 				break;
 
 			case CMD_MERGE:
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4b133f68593..45954b8003d 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -276,19 +276,22 @@ extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
 							  int ntuples, CommandId cid, int options,
 							  BulkInsertState bistate);
 extern TM_Result heap_delete(Relation relation, ItemPointer tid,
-							 CommandId cid, Snapshot crosscheck, bool wait,
-							 struct TM_FailureData *tmfd, bool changingPart);
+							 CommandId cid, Snapshot crosscheck, int options,
+							 struct TM_FailureData *tmfd, bool changingPart,
+							 TupleTableSlot *oldSlot);
 extern void heap_finish_speculative(Relation relation, ItemPointer tid);
 extern void heap_abort_speculative(Relation relation, ItemPointer tid);
 extern TM_Result heap_update(Relation relation, ItemPointer otid,
 							 HeapTuple newtup,
-							 CommandId cid, Snapshot crosscheck, bool wait,
+							 CommandId cid, Snapshot crosscheck, int options,
 							 struct TM_FailureData *tmfd, LockTupleMode *lockmode,
-							 TU_UpdateIndexes *update_indexes);
-extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
-								 CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
-								 bool follow_updates,
-								 Buffer *buffer, struct TM_FailureData *tmfd);
+							 TU_UpdateIndexes *update_indexes,
+							 TupleTableSlot *oldSlot);
+extern TM_Result heap_lock_tuple(Relation relation, ItemPointer tid,
+								 TupleTableSlot *slot,
+								 CommandId cid, LockTupleMode mode,
+								 LockWaitPolicy wait_policy, bool follow_updates,
+								 struct TM_FailureData *tmfd);
 
 extern void heap_inplace_update(Relation relation, HeapTuple tuple);
 extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8249b37bbf1..b35a22506c0 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -259,6 +259,15 @@ typedef struct TM_IndexDeleteOp
 /* Follow update chain and lock latest version of tuple */
 #define TUPLE_LOCK_FLAG_FIND_LAST_VERSION		(1 << 1)
 
+/*
+ * "options" flag bits for table_tuple_update and table_tuple_delete,
+ * Wait for any conflicting update to commit/abort */
+#define TABLE_MODIFY_WAIT			0x0001
+/* Fetch the existing tuple into a dedicated slot */
+#define TABLE_MODIFY_FETCH_OLD_TUPLE 0x0002
+/* On concurrent update, follow the update chain and lock latest version of tuple */
+#define TABLE_MODIFY_LOCK_UPDATED	0x0004
+
 
 /* Typedef for callback function for table_index_build_scan */
 typedef void (*IndexBuildCallback) (Relation index,
@@ -528,9 +537,10 @@ typedef struct TableAmRoutine
 								 CommandId cid,
 								 Snapshot snapshot,
 								 Snapshot crosscheck,
-								 bool wait,
+								 int options,
 								 TM_FailureData *tmfd,
-								 bool changingPart);
+								 bool changingPart,
+								 TupleTableSlot *oldSlot);
 
 	/* see table_tuple_update() for reference about parameters */
 	TM_Result	(*tuple_update) (Relation rel,
@@ -539,10 +549,11 @@ typedef struct TableAmRoutine
 								 CommandId cid,
 								 Snapshot snapshot,
 								 Snapshot crosscheck,
-								 bool wait,
+								 int options,
 								 TM_FailureData *tmfd,
 								 LockTupleMode *lockmode,
-								 TU_UpdateIndexes *update_indexes);
+								 TU_UpdateIndexes *update_indexes,
+								 TupleTableSlot *oldSlot);
 
 	/* see table_tuple_lock() for reference about parameters */
 	TM_Result	(*tuple_lock) (Relation rel,
@@ -1452,7 +1463,7 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
 }
 
 /*
- * Delete a tuple.
+ * Delete a tuple (and optionally lock the last tuple version).
  *
  * NB: do not call this directly unless prepared to deal with
  * concurrent-update conditions.  Use simple_table_tuple_delete instead.
@@ -1463,11 +1474,21 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
  *	cid - delete command ID (used for visibility test, and stored into
  *		cmax if successful)
  *	crosscheck - if not InvalidSnapshot, also check tuple against this
- *	wait - true if should wait for any conflicting update to commit/abort
+ *	options:
+ *		If TABLE_MODIFY_WAIT, wait for any conflicting update to commit/abort.
+ *		If TABLE_MODIFY_FETCH_OLD_TUPLE option is given, the existing tuple is
+ *		fetched into oldSlot when the update is successful.
+ *		If TABLE_MODIFY_LOCK_UPDATED option is given and the tuple is
+ *		concurrently updated, then the last tuple version is locked and fetched
+ *		into oldSlot.
+ *
  * Output parameters:
  *	tmfd - filled in failure cases (see below)
  *	changingPart - true iff the tuple is being moved to another partition
  *		table due to an update of the partition key. Otherwise, false.
+ *	oldSlot - slot to save the deleted or locked tuple. Can be NULL if none of
+ *		TABLE_MODIFY_FETCH_OLD_TUPLE or TABLE_MODIFY_LOCK_UPDATED options
+ *		is specified.
  *
  * Normal, successful return value is TM_Ok, which means we did actually
  * delete it.  Failure return codes are TM_SelfModified, TM_Updated, and
@@ -1479,16 +1500,18 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
  */
 static inline TM_Result
 table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
-				   Snapshot snapshot, Snapshot crosscheck, bool wait,
-				   TM_FailureData *tmfd, bool changingPart)
+				   Snapshot snapshot, Snapshot crosscheck, int options,
+				   TM_FailureData *tmfd, bool changingPart,
+				   TupleTableSlot *oldSlot)
 {
 	return rel->rd_tableam->tuple_delete(rel, tid, cid,
 										 snapshot, crosscheck,
-										 wait, tmfd, changingPart);
+										 options, tmfd, changingPart,
+										 oldSlot);
 }
 
 /*
- * Update a tuple.
+ * Update a tuple (and optionally lock the last tuple version).
  *
  * NB: do not call this directly unless you are prepared to deal with
  * concurrent-update conditions.  Use simple_table_tuple_update instead.
@@ -1500,13 +1523,23 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
  *	cid - update command ID (used for visibility test, and stored into
  *		cmax/cmin if successful)
  *	crosscheck - if not InvalidSnapshot, also check old tuple against this
- *	wait - true if should wait for any conflicting update to commit/abort
+ *	options:
+ *		If TABLE_MODIFY_WAIT, wait for any conflicting update to commit/abort.
+ *		If TABLE_MODIFY_FETCH_OLD_TUPLE option is given, the existing tuple is
+ *		fetched into oldSlot when the update is successful.
+ *		If TABLE_MODIFY_LOCK_UPDATED option is given and the tuple is
+ *		concurrently updated, then the last tuple version is locked and fetched
+ *		into oldSlot.
+ *
  * Output parameters:
  *	tmfd - filled in failure cases (see below)
  *	lockmode - filled with lock mode acquired on tuple
  *  update_indexes - in success cases this is set to true if new index entries
  *		are required for this tuple
- *
+ *	oldSlot - slot to save the deleted or locked tuple. Can be NULL if none of
+ *		TABLE_MODIFY_FETCH_OLD_TUPLE or TABLE_MODIFY_LOCK_UPDATED options
+ *		is specified.
+
  * Normal, successful return value is TM_Ok, which means we did actually
  * update it.  Failure return codes are TM_SelfModified, TM_Updated, and
  * TM_BeingModified (the last only possible if wait == false).
@@ -1524,13 +1557,15 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
 static inline TM_Result
 table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
 				   CommandId cid, Snapshot snapshot, Snapshot crosscheck,
-				   bool wait, TM_FailureData *tmfd, LockTupleMode *lockmode,
-				   TU_UpdateIndexes *update_indexes)
+				   int options, TM_FailureData *tmfd, LockTupleMode *lockmode,
+				   TU_UpdateIndexes *update_indexes,
+				   TupleTableSlot *oldSlot)
 {
 	return rel->rd_tableam->tuple_update(rel, otid, slot,
 										 cid, snapshot, crosscheck,
-										 wait, tmfd,
-										 lockmode, update_indexes);
+										 options, tmfd,
+										 lockmode, update_indexes,
+										 oldSlot);
 }
 
 /*
@@ -2046,10 +2081,12 @@ table_scan_sample_next_tuple(TableScanDesc scan,
 
 extern void simple_table_tuple_insert(Relation rel, TupleTableSlot *slot);
 extern void simple_table_tuple_delete(Relation rel, ItemPointer tid,
-									  Snapshot snapshot);
+									  Snapshot snapshot,
+									  TupleTableSlot *oldSlot);
 extern void simple_table_tuple_update(Relation rel, ItemPointer otid,
 									  TupleTableSlot *slot, Snapshot snapshot,
-									  TU_UpdateIndexes *update_indexes);
+									  TU_UpdateIndexes *update_indexes,
+									  TupleTableSlot *oldSlot);
 
 
 /* ----------------------------------------------------------------------------
diff --git a/src/include/commands/trigger.h b/src/include/commands/trigger.h
index 8a5a9fe6422..cb968d03ecd 100644
--- a/src/include/commands/trigger.h
+++ b/src/include/commands/trigger.h
@@ -216,8 +216,8 @@ extern bool ExecBRDeleteTriggers(EState *estate,
 								 TM_FailureData *tmfd);
 extern void ExecARDeleteTriggers(EState *estate,
 								 ResultRelInfo *relinfo,
-								 ItemPointer tupleid,
 								 HeapTuple fdw_trigtuple,
+								 TupleTableSlot *slot,
 								 TransitionCaptureState *transition_capture,
 								 bool is_crosspart_update);
 extern bool ExecIRDeleteTriggers(EState *estate,
@@ -240,8 +240,8 @@ extern void ExecARUpdateTriggers(EState *estate,
 								 ResultRelInfo *relinfo,
 								 ResultRelInfo *src_partinfo,
 								 ResultRelInfo *dst_partinfo,
-								 ItemPointer tupleid,
 								 HeapTuple fdw_trigtuple,
+								 TupleTableSlot *oldslot,
 								 TupleTableSlot *newslot,
 								 List *recheckIndexes,
 								 TransitionCaptureState *transition_capture,
-- 
2.39.3 (Apple Git-145)

0009-Let-table-AM-override-reloptions-for-indexes-buil-v4.patchapplication/octet-stream; name=0009-Let-table-AM-override-reloptions-for-indexes-buil-v4.patchDownload

From d622cb0858c543411a5349b20c51a19d6659079c Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Thu, 14 Mar 2024 00:53:05 +0200
Subject: [PATCH 09/13] Let table AM override reloptions for indexes built on
 its tables

---
 src/backend/access/common/reloptions.c   |  3 ++-
 src/backend/access/heap/heapam_handler.c |  8 ++++++++
 src/backend/commands/indexcmds.c         |  3 ++-
 src/backend/commands/tablecmds.c         |  9 ++++++++-
 src/backend/utils/cache/relcache.c       | 24 ++++++++++++++++++++++--
 src/include/access/tableam.h             | 23 +++++++++++++++++++++++
 6 files changed, 65 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 963995388bb..00088240cdd 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -1411,7 +1411,8 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
 			break;
 		case RELKIND_INDEX:
 		case RELKIND_PARTITIONED_INDEX:
-			options = index_reloptions(amoptions, datum, false);
+			options = tableam_indexoptions(tableam, amoptions, classForm->relkind,
+										   datum, false);
 			break;
 		case RELKIND_FOREIGN_TABLE:
 			options = NULL;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 0b2098f41fc..3a7901ae92e 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2717,6 +2717,13 @@ heapam_reloptions(char relkind, Datum reloptions, bool validate)
 	return NULL;
 }
 
+static bytea *
+heapam_indexoptions(amoptions_function amoptions, char relkind,
+					Datum reloptions, bool validate)
+{
+	return index_reloptions(amoptions, reloptions, validate);
+}
+
 
 /* ------------------------------------------------------------------------
  * Planner related callbacks for the heap AM
@@ -3222,6 +3229,7 @@ static const TableAmRoutine heapam_methods = {
 	.relation_toast_am = heapam_relation_toast_am,
 	.relation_fetch_toast_slice = heap_fetch_toast_slice,
 	.reloptions = heapam_reloptions,
+	.indexoptions = heapam_indexoptions,
 
 	.relation_estimate_size = heapam_estimate_rel_size,
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 7b20d103c86..7299ebbe9f3 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -899,7 +899,8 @@ DefineIndex(Oid tableId,
 	reloptions = transformRelOptions((Datum) 0, stmt->options,
 									 NULL, NULL, false, false);
 
-	(void) index_reloptions(amoptions, reloptions, true);
+	(void) tableam_indexoptions(rel->rd_tableam, amoptions, RELKIND_INDEX,
+								reloptions, true);
 
 	/*
 	 * Prepare arguments for index_create, primarily an IndexInfo structure.
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index d2ef8a0c383..fa8eb55b189 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -15328,7 +15328,14 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 			break;
 		case RELKIND_INDEX:
 		case RELKIND_PARTITIONED_INDEX:
-			(void) index_reloptions(rel->rd_indam->amoptions, newOptions, true);
+			{
+				Relation	tbl = relation_open(rel->rd_index->indrelid,
+												AccessShareLock);
+
+				tableam_indexoptions(tbl->rd_tableam, rel->rd_indam->amoptions,
+									 rel->rd_rel->relkind, newOptions, true);
+				relation_close(tbl, AccessShareLock);
+			}
 			break;
 		default:
 			ereport(ERROR,
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 3babfc804a7..b1a4b36aa14 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -478,15 +478,35 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 	{
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
-		case RELKIND_VIEW:
 		case RELKIND_MATVIEW:
+		case RELKIND_VIEW:
 		case RELKIND_PARTITIONED_TABLE:
 			tableam = relation->rd_tableam;
 			amoptsfn = NULL;
 			break;
 		case RELKIND_INDEX:
 		case RELKIND_PARTITIONED_INDEX:
-			amoptsfn = relation->rd_indam->amoptions;
+			{
+				Form_pg_class classForm;
+				HeapTuple	classTup;
+
+				/* fetch the relation's relcache entry */
+				if (relation->rd_index->indrelid >= FirstNormalObjectId)
+				{
+					classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relation->rd_index->indrelid));
+					classForm = (Form_pg_class) GETSTRUCT(classTup);
+					if (classForm->relam >= FirstNormalObjectId)
+						tableam = GetTableAmRoutineByAmOid(classForm->relam);
+					else
+						tableam = GetHeapamTableAmRoutine();
+					heap_freetuple(classTup);
+				}
+				else
+				{
+					tableam = GetHeapamTableAmRoutine();
+				}
+				amoptsfn = relation->rd_indam->amoptions;
+			}
 			break;
 		default:
 			return;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 707946dfde6..1cd6a92db6d 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -17,6 +17,7 @@
 #ifndef TABLEAM_H
 #define TABLEAM_H
 
+#include "access/amapi.h"
 #include "access/relscan.h"
 #include "access/sdir.h"
 #include "access/xact.h"
@@ -742,6 +743,13 @@ typedef struct TableAmRoutine
 	 */
 	bytea	   *(*reloptions) (char relkind, Datum reloptions, bool validate);
 
+	/*
+	 * Parse table AM-specific index options.  Useful for table AM to define
+	 * new index options or override existing index options.
+	 */
+	bytea	   *(*indexoptions) (amoptions_function amoptions, char relkind,
+								 Datum reloptions, bool validate);
+
 
 	/* ------------------------------------------------------------------------
 	 * Planner related functions.
@@ -1967,6 +1975,21 @@ tableam_reloptions(const TableAmRoutine *tableam, char relkind,
 extern bytea *index_reloptions(amoptions_function amoptions, Datum reloptions,
 							   bool validate);
 
+/*
+ * Parse index options.  Gives table AM a chance to override index-specific
+ * options defined in 'amoptions'.
+ */
+static inline bytea *
+tableam_indexoptions(const TableAmRoutine *tableam,
+					 amoptions_function amoptions, char relkind,
+					 Datum reloptions, bool validate)
+{
+	if (tableam)
+		return tableam->indexoptions(amoptions, relkind, reloptions, validate);
+	else
+		return index_reloptions(amoptions, reloptions, validate);
+}
+
 
 /* ----------------------------------------------------------------------------
  * Planner related functionality
-- 
2.39.3 (Apple Git-145)

0008-Generalize-table-AM-API-for-INSERT-.-ON-CONFLICT-v4.patchapplication/octet-stream; name=0008-Generalize-table-AM-API-for-INSERT-.-ON-CONFLICT-v4.patchDownload

From d665ed5f4ef332e24aec5eada0bcbf716d62c791 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Fri, 9 Jun 2023 00:05:52 +0300
Subject: [PATCH 08/13] Generalize table AM API for INSERT ... ON CONFLICT ...

Currently, all table AMs need to implement INSERT ... ON CONFLICT ... with
speculative tokens.  They could just have a custom implementation of those
tokens using tuple_insert_speculative() and tuple_complete_speculative() API
functions.

This commit changes INSERT ... ON CONFLICT ... implementation to use single
tuple_insert_with_arbiter() API function, which encapsulates the whole
alogrithm.  This new function provides clear semantics to make different
implementations of INSERT ... ON CONFLICT ... functionality.
---
 src/backend/access/heap/heapam_handler.c | 281 ++++++++++++++++++++++-
 src/backend/access/table/tableamapi.c    |   3 +-
 src/backend/executor/nodeModifyTable.c   | 270 ++--------------------
 src/include/access/tableam.h             |  84 +++----
 4 files changed, 348 insertions(+), 290 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 14e0952d7f2..0b2098f41fc 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -309,6 +309,284 @@ heapam_tuple_complete_speculative(Relation relation, TupleTableSlot *slot,
 		pfree(tuple);
 }
 
+/*
+ * ExecCheckTupleVisible -- verify tuple is visible
+ *
+ * It would not be consistent with guarantees of the higher isolation levels to
+ * proceed with avoiding insertion (taking speculative insertion's alternative
+ * path) on the basis of another tuple that is not visible to MVCC snapshot.
+ * Check for the need to raise a serialization failure, and do so as necessary.
+ */
+static void
+ExecCheckTupleVisible(EState *estate,
+					  Relation rel,
+					  TupleTableSlot *slot)
+{
+	if (!IsolationUsesXactSnapshot())
+		return;
+
+	if (!table_tuple_satisfies_snapshot(rel, slot, estate->es_snapshot))
+	{
+		Datum		xminDatum;
+		TransactionId xmin;
+		bool		isnull;
+
+		xminDatum = slot_getsysattr(slot, MinTransactionIdAttributeNumber, &isnull);
+		Assert(!isnull);
+		xmin = DatumGetTransactionId(xminDatum);
+
+		/*
+		 * We should not raise a serialization failure if the conflict is
+		 * against a tuple inserted by our own transaction, even if it's not
+		 * visible to our snapshot.  (This would happen, for example, if
+		 * conflicting keys are proposed for insertion in a single command.)
+		 */
+		if (!TransactionIdIsCurrentTransactionId(xmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+					 errmsg("could not serialize access due to concurrent update")));
+	}
+}
+
+/*
+ * ExecCheckTIDVisible -- convenience variant of ExecCheckTupleVisible()
+ */
+static void
+ExecCheckTIDVisible(EState *estate,
+					Relation rel,
+					ItemPointer tid,
+					TupleTableSlot *tempSlot)
+{
+	/* Redundantly check isolation level */
+	if (!IsolationUsesXactSnapshot())
+		return;
+
+	if (!table_tuple_fetch_row_version(rel, tid,
+									   SnapshotAny, tempSlot))
+		elog(ERROR, "failed to fetch conflicting tuple for ON CONFLICT");
+	ExecCheckTupleVisible(estate, rel, tempSlot);
+	ExecClearTuple(tempSlot);
+}
+
+static inline TupleTableSlot *
+heapam_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
+								 TupleTableSlot *slot,
+								 CommandId cid, int options,
+								 struct BulkInsertStateData *bistate,
+								 List *arbiterIndexes,
+								 EState *estate,
+								 LockTupleMode lockmode,
+								 TupleTableSlot *lockedSlot,
+								 TupleTableSlot *tempSlot)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	uint32		specToken;
+	ItemPointerData conflictTid;
+	bool		specConflict;
+	List	   *recheckIndexes = NIL;
+
+	while (true)
+	{
+		specConflict = false;
+		if (!ExecCheckIndexConstraints(resultRelInfo, slot, estate, &conflictTid,
+									   arbiterIndexes))
+		{
+			if (lockedSlot)
+			{
+				TM_Result	test;
+				TM_FailureData tmfd;
+				Datum		xminDatum;
+				TransactionId xmin;
+				bool		isnull;
+
+				/* Determine lock mode to use */
+				lockmode = ExecUpdateLockMode(estate, resultRelInfo);
+
+				/*
+				 * Lock tuple for update.  Don't follow updates when tuple
+				 * cannot be locked without doing so.  A row locking conflict
+				 * here means our previous conclusion that the tuple is
+				 * conclusively committed is not true anymore.
+				 */
+				test = table_tuple_lock(rel, &conflictTid,
+										estate->es_snapshot,
+										lockedSlot, estate->es_output_cid,
+										lockmode, LockWaitBlock, 0,
+										&tmfd);
+				switch (test)
+				{
+					case TM_Ok:
+						/* success! */
+						break;
+
+					case TM_Invisible:
+
+						/*
+						 * This can occur when a just inserted tuple is
+						 * updated again in the same command. E.g. because
+						 * multiple rows with the same conflicting key values
+						 * are inserted.
+						 *
+						 * This is somewhat similar to the ExecUpdate()
+						 * TM_SelfModified case.  We do not want to proceed
+						 * because it would lead to the same row being updated
+						 * a second time in some unspecified order, and in
+						 * contrast to plain UPDATEs there's no historical
+						 * behavior to break.
+						 *
+						 * It is the user's responsibility to prevent this
+						 * situation from occurring.  These problems are why
+						 * the SQL standard similarly specifies that for SQL
+						 * MERGE, an exception must be raised in the event of
+						 * an attempt to update the same row twice.
+						 */
+						xminDatum = slot_getsysattr(lockedSlot,
+													MinTransactionIdAttributeNumber,
+													&isnull);
+						Assert(!isnull);
+						xmin = DatumGetTransactionId(xminDatum);
+
+						if (TransactionIdIsCurrentTransactionId(xmin))
+							ereport(ERROR,
+									(errcode(ERRCODE_CARDINALITY_VIOLATION),
+							/* translator: %s is a SQL command name */
+									 errmsg("%s command cannot affect row a second time",
+											"ON CONFLICT DO UPDATE"),
+									 errhint("Ensure that no rows proposed for insertion within the same command have duplicate constrained values.")));
+
+						/* This shouldn't happen */
+						elog(ERROR, "attempted to lock invisible tuple");
+						break;
+
+					case TM_SelfModified:
+
+						/*
+						 * This state should never be reached. As a dirty
+						 * snapshot is used to find conflicting tuples,
+						 * speculative insertion wouldn't have seen this row
+						 * to conflict with.
+						 */
+						elog(ERROR, "unexpected self-updated tuple");
+						break;
+
+					case TM_Updated:
+						if (IsolationUsesXactSnapshot())
+							ereport(ERROR,
+									(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+									 errmsg("could not serialize access due to concurrent update")));
+
+						/*
+						 * As long as we don't support an UPDATE of INSERT ON
+						 * CONFLICT for a partitioned table we shouldn't reach
+						 * to a case where tuple to be lock is moved to
+						 * another partition due to concurrent update of the
+						 * partition key.
+						 */
+						Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
+
+						/*
+						 * Tell caller to try again from the very start.
+						 *
+						 * It does not make sense to use the usual
+						 * EvalPlanQual() style loop here, as the new version
+						 * of the row might not conflict anymore, or the
+						 * conflicting tuple has actually been deleted.
+						 */
+						ExecClearTuple(lockedSlot);
+						return false;
+
+					case TM_Deleted:
+						if (IsolationUsesXactSnapshot())
+							ereport(ERROR,
+									(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+									 errmsg("could not serialize access due to concurrent delete")));
+
+						/* see TM_Updated case */
+						Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
+						ExecClearTuple(lockedSlot);
+						return false;
+
+					default:
+						elog(ERROR, "unrecognized table_tuple_lock status: %u", test);
+				}
+
+				/* Success, the tuple is locked. */
+
+				/*
+				 * Verify that the tuple is visible to our MVCC snapshot if
+				 * the current isolation level mandates that.
+				 *
+				 * It's not sufficient to rely on the check within
+				 * ExecUpdate() as e.g. CONFLICT ... WHERE clause may prevent
+				 * us from reaching that.
+				 *
+				 * This means we only ever continue when a new command in the
+				 * current transaction could see the row, even though in READ
+				 * COMMITTED mode the tuple will not be visible according to
+				 * the current statement's snapshot.  This is in line with the
+				 * way UPDATE deals with newer tuple versions.
+				 */
+				ExecCheckTupleVisible(estate, rel, lockedSlot);
+				return NULL;
+			}
+			else
+			{
+				ExecCheckTIDVisible(estate, rel, &conflictTid, tempSlot);
+				return NULL;
+			}
+		}
+
+		/*
+		 * Before we start insertion proper, acquire our "speculative
+		 * insertion lock".  Others can use that to wait for us to decide if
+		 * we're going to go ahead with the insertion, instead of waiting for
+		 * the whole transaction to complete.
+		 */
+		specToken = SpeculativeInsertionLockAcquire(GetCurrentTransactionId());
+
+		/* insert the tuple, with the speculative token */
+		heapam_tuple_insert_speculative(rel, slot,
+										estate->es_output_cid,
+										0,
+										NULL,
+										specToken);
+
+		/* insert index entries for tuple */
+		recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
+											   slot, estate, false, true,
+											   &specConflict,
+											   arbiterIndexes,
+											   false);
+
+		/* adjust the tuple's state accordingly */
+		heapam_tuple_complete_speculative(rel, slot,
+										  specToken, !specConflict);
+
+		/*
+		 * Wake up anyone waiting for our decision.  They will re-check the
+		 * tuple, see that it's no longer speculative, and wait on our XID as
+		 * if this was a regularly inserted tuple all along.  Or if we killed
+		 * the tuple, they will see it's dead, and proceed as if the tuple
+		 * never existed.
+		 */
+		SpeculativeInsertionLockRelease(GetCurrentTransactionId());
+
+		/*
+		 * If there was a conflict, start from the beginning.  We'll do the
+		 * pre-check again, which will now find the conflicting tuple (unless
+		 * it aborts before we get there).
+		 */
+		if (specConflict)
+		{
+			list_free(recheckIndexes);
+			CHECK_FOR_INTERRUPTS();
+			continue;
+		}
+
+		return slot;
+	}
+}
+
 static TM_Result
 heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
 					Snapshot snapshot, Snapshot crosscheck, int options,
@@ -2917,8 +3195,7 @@ static const TableAmRoutine heapam_methods = {
 	.index_fetch_tuple = heapam_index_fetch_tuple,
 
 	.tuple_insert = heapam_tuple_insert,
-	.tuple_insert_speculative = heapam_tuple_insert_speculative,
-	.tuple_complete_speculative = heapam_tuple_complete_speculative,
+	.tuple_insert_with_arbiter = heapam_tuple_insert_with_arbiter,
 	.multi_insert = heap_multi_insert,
 	.tuple_delete = heapam_tuple_delete,
 	.tuple_update = heapam_tuple_update,
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 34ff3e38333..d9fc87665c7 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -70,8 +70,7 @@ GetTableAmRoutine(Oid amhandler)
 	 * Could be made optional, but would require throwing error during
 	 * parse-analysis.
 	 */
-	Assert(routine->tuple_insert_speculative != NULL);
-	Assert(routine->tuple_complete_speculative != NULL);
+	Assert(routine->tuple_insert_with_arbiter != NULL);
 
 	Assert(routine->multi_insert != NULL);
 	Assert(routine->tuple_delete != NULL);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index d1917f2fea7..8e1c8f697c6 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -129,7 +129,6 @@ static void ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 											   TupleTableSlot *newslot);
 static bool ExecOnConflictUpdate(ModifyTableContext *context,
 								 ResultRelInfo *resultRelInfo,
-								 ItemPointer conflictTid,
 								 TupleTableSlot *excludedSlot,
 								 bool canSetTag,
 								 TupleTableSlot **returning);
@@ -265,66 +264,6 @@ ExecProcessReturning(ResultRelInfo *resultRelInfo,
 	return ExecProject(projectReturning);
 }
 
-/*
- * ExecCheckTupleVisible -- verify tuple is visible
- *
- * It would not be consistent with guarantees of the higher isolation levels to
- * proceed with avoiding insertion (taking speculative insertion's alternative
- * path) on the basis of another tuple that is not visible to MVCC snapshot.
- * Check for the need to raise a serialization failure, and do so as necessary.
- */
-static void
-ExecCheckTupleVisible(EState *estate,
-					  Relation rel,
-					  TupleTableSlot *slot)
-{
-	if (!IsolationUsesXactSnapshot())
-		return;
-
-	if (!table_tuple_satisfies_snapshot(rel, slot, estate->es_snapshot))
-	{
-		Datum		xminDatum;
-		TransactionId xmin;
-		bool		isnull;
-
-		xminDatum = slot_getsysattr(slot, MinTransactionIdAttributeNumber, &isnull);
-		Assert(!isnull);
-		xmin = DatumGetTransactionId(xminDatum);
-
-		/*
-		 * We should not raise a serialization failure if the conflict is
-		 * against a tuple inserted by our own transaction, even if it's not
-		 * visible to our snapshot.  (This would happen, for example, if
-		 * conflicting keys are proposed for insertion in a single command.)
-		 */
-		if (!TransactionIdIsCurrentTransactionId(xmin))
-			ereport(ERROR,
-					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-					 errmsg("could not serialize access due to concurrent update")));
-	}
-}
-
-/*
- * ExecCheckTIDVisible -- convenience variant of ExecCheckTupleVisible()
- */
-static void
-ExecCheckTIDVisible(EState *estate,
-					ResultRelInfo *relinfo,
-					ItemPointer tid,
-					TupleTableSlot *tempSlot)
-{
-	Relation	rel = relinfo->ri_RelationDesc;
-
-	/* Redundantly check isolation level */
-	if (!IsolationUsesXactSnapshot())
-		return;
-
-	if (!table_tuple_fetch_row_version(rel, tid, SnapshotAny, tempSlot))
-		elog(ERROR, "failed to fetch conflicting tuple for ON CONFLICT");
-	ExecCheckTupleVisible(estate, rel, tempSlot);
-	ExecClearTuple(tempSlot);
-}
-
 /*
  * Initialize to compute stored generated columns for a tuple
  *
@@ -1015,12 +954,19 @@ ExecInsert(ModifyTableContext *context,
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
 			/* Perform a speculative insertion. */
-			uint32		specToken;
-			ItemPointerData conflictTid;
-			bool		specConflict;
 			List	   *arbiterIndexes;
+			TupleTableSlot *existing = NULL,
+					   *returningSlot,
+					   *inserted;
+			LockTupleMode lockmode = LockTupleExclusive;
 
 			arbiterIndexes = resultRelInfo->ri_onConflictArbiterIndexes;
+			returningSlot = ExecGetReturningSlot(estate, resultRelInfo);
+			if (onconflict == ONCONFLICT_UPDATE)
+			{
+				lockmode = ExecUpdateLockMode(estate, resultRelInfo);
+				existing = resultRelInfo->ri_onConflict->oc_Existing;
+			}
 
 			/*
 			 * Do a non-conclusive check for conflicts first.
@@ -1037,23 +983,28 @@ ExecInsert(ModifyTableContext *context,
 			 */
 	vlock:
 			CHECK_FOR_INTERRUPTS();
-			specConflict = false;
-			if (!ExecCheckIndexConstraints(resultRelInfo, slot, estate,
-										   &conflictTid, arbiterIndexes))
+			inserted = table_tuple_insert_with_arbiter(resultRelInfo,
+													   slot, estate->es_output_cid,
+													   0, NULL, arbiterIndexes, estate,
+													   lockmode, existing, returningSlot);
+			if (!inserted)
 			{
 				/* committed conflict tuple found */
 				if (onconflict == ONCONFLICT_UPDATE)
 				{
+					TupleTableSlot *returning = NULL;
+
+					if (TTS_EMPTY(existing))
+						goto vlock;
+
 					/*
 					 * In case of ON CONFLICT DO UPDATE, execute the UPDATE
 					 * part.  Be prepared to retry if the UPDATE fails because
 					 * of another concurrent UPDATE/DELETE to the conflict
 					 * tuple.
 					 */
-					TupleTableSlot *returning = NULL;
-
 					if (ExecOnConflictUpdate(context, resultRelInfo,
-											 &conflictTid, slot, canSetTag,
+											 slot, canSetTag,
 											 &returning))
 					{
 						InstrCountTuples2(&mtstate->ps, 1);
@@ -1076,57 +1027,13 @@ ExecInsert(ModifyTableContext *context,
 					 * ExecGetReturningSlot() in the DO NOTHING case...
 					 */
 					Assert(onconflict == ONCONFLICT_NOTHING);
-					ExecCheckTIDVisible(estate, resultRelInfo, &conflictTid,
-										ExecGetReturningSlot(estate, resultRelInfo));
 					InstrCountTuples2(&mtstate->ps, 1);
 					return NULL;
 				}
 			}
-
-			/*
-			 * Before we start insertion proper, acquire our "speculative
-			 * insertion lock".  Others can use that to wait for us to decide
-			 * if we're going to go ahead with the insertion, instead of
-			 * waiting for the whole transaction to complete.
-			 */
-			specToken = SpeculativeInsertionLockAcquire(GetCurrentTransactionId());
-
-			/* insert the tuple, with the speculative token */
-			table_tuple_insert_speculative(resultRelationDesc, slot,
-										   estate->es_output_cid,
-										   0,
-										   NULL,
-										   specToken);
-
-			/* insert index entries for tuple */
-			recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
-												   slot, estate, false, true,
-												   &specConflict,
-												   arbiterIndexes,
-												   false);
-
-			/* adjust the tuple's state accordingly */
-			table_tuple_complete_speculative(resultRelationDesc, slot,
-											 specToken, !specConflict);
-
-			/*
-			 * Wake up anyone waiting for our decision.  They will re-check
-			 * the tuple, see that it's no longer speculative, and wait on our
-			 * XID as if this was a regularly inserted tuple all along.  Or if
-			 * we killed the tuple, they will see it's dead, and proceed as if
-			 * the tuple never existed.
-			 */
-			SpeculativeInsertionLockRelease(GetCurrentTransactionId());
-
-			/*
-			 * If there was a conflict, start from the beginning.  We'll do
-			 * the pre-check again, which will now find the conflicting tuple
-			 * (unless it aborts before we get there).
-			 */
-			if (specConflict)
+			else
 			{
-				list_free(recheckIndexes);
-				goto vlock;
+				slot = inserted;
 			}
 
 			/* Since there was no insertion conflict, we're done */
@@ -2441,144 +2348,15 @@ redo_act:
 static bool
 ExecOnConflictUpdate(ModifyTableContext *context,
 					 ResultRelInfo *resultRelInfo,
-					 ItemPointer conflictTid,
 					 TupleTableSlot *excludedSlot,
 					 bool canSetTag,
 					 TupleTableSlot **returning)
 {
 	ModifyTableState *mtstate = context->mtstate;
 	ExprContext *econtext = mtstate->ps.ps_ExprContext;
-	Relation	relation = resultRelInfo->ri_RelationDesc;
 	ExprState  *onConflictSetWhere = resultRelInfo->ri_onConflict->oc_WhereClause;
 	TupleTableSlot *existing = resultRelInfo->ri_onConflict->oc_Existing;
-	TM_FailureData tmfd;
-	LockTupleMode lockmode;
-	TM_Result	test;
-	Datum		xminDatum;
-	TransactionId xmin;
-	bool		isnull;
-
-	/* Determine lock mode to use */
-	lockmode = ExecUpdateLockMode(context->estate, resultRelInfo);
-
-	/*
-	 * Lock tuple for update.  Don't follow updates when tuple cannot be
-	 * locked without doing so.  A row locking conflict here means our
-	 * previous conclusion that the tuple is conclusively committed is not
-	 * true anymore.
-	 */
-	test = table_tuple_lock(relation, conflictTid,
-							context->estate->es_snapshot,
-							existing, context->estate->es_output_cid,
-							lockmode, LockWaitBlock, 0,
-							&tmfd);
-	switch (test)
-	{
-		case TM_Ok:
-			/* success! */
-			break;
-
-		case TM_Invisible:
-
-			/*
-			 * This can occur when a just inserted tuple is updated again in
-			 * the same command. E.g. because multiple rows with the same
-			 * conflicting key values are inserted.
-			 *
-			 * This is somewhat similar to the ExecUpdate() TM_SelfModified
-			 * case.  We do not want to proceed because it would lead to the
-			 * same row being updated a second time in some unspecified order,
-			 * and in contrast to plain UPDATEs there's no historical behavior
-			 * to break.
-			 *
-			 * It is the user's responsibility to prevent this situation from
-			 * occurring.  These problems are why the SQL standard similarly
-			 * specifies that for SQL MERGE, an exception must be raised in
-			 * the event of an attempt to update the same row twice.
-			 */
-			xminDatum = slot_getsysattr(existing,
-										MinTransactionIdAttributeNumber,
-										&isnull);
-			Assert(!isnull);
-			xmin = DatumGetTransactionId(xminDatum);
-
-			if (TransactionIdIsCurrentTransactionId(xmin))
-				ereport(ERROR,
-						(errcode(ERRCODE_CARDINALITY_VIOLATION),
-				/* translator: %s is a SQL command name */
-						 errmsg("%s command cannot affect row a second time",
-								"ON CONFLICT DO UPDATE"),
-						 errhint("Ensure that no rows proposed for insertion within the same command have duplicate constrained values.")));
-
-			/* This shouldn't happen */
-			elog(ERROR, "attempted to lock invisible tuple");
-			break;
-
-		case TM_SelfModified:
-
-			/*
-			 * This state should never be reached. As a dirty snapshot is used
-			 * to find conflicting tuples, speculative insertion wouldn't have
-			 * seen this row to conflict with.
-			 */
-			elog(ERROR, "unexpected self-updated tuple");
-			break;
-
-		case TM_Updated:
-			if (IsolationUsesXactSnapshot())
-				ereport(ERROR,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("could not serialize access due to concurrent update")));
-
-			/*
-			 * As long as we don't support an UPDATE of INSERT ON CONFLICT for
-			 * a partitioned table we shouldn't reach to a case where tuple to
-			 * be lock is moved to another partition due to concurrent update
-			 * of the partition key.
-			 */
-			Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
-
-			/*
-			 * Tell caller to try again from the very start.
-			 *
-			 * It does not make sense to use the usual EvalPlanQual() style
-			 * loop here, as the new version of the row might not conflict
-			 * anymore, or the conflicting tuple has actually been deleted.
-			 */
-			ExecClearTuple(existing);
-			return false;
-
-		case TM_Deleted:
-			if (IsolationUsesXactSnapshot())
-				ereport(ERROR,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("could not serialize access due to concurrent delete")));
-
-			/* see TM_Updated case */
-			Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
-			ExecClearTuple(existing);
-			return false;
-
-		default:
-			elog(ERROR, "unrecognized table_tuple_lock status: %u", test);
-	}
-
-	/* Success, the tuple is locked. */
-
-	/*
-	 * Verify that the tuple is visible to our MVCC snapshot if the current
-	 * isolation level mandates that.
-	 *
-	 * It's not sufficient to rely on the check within ExecUpdate() as e.g.
-	 * CONFLICT ... WHERE clause may prevent us from reaching that.
-	 *
-	 * This means we only ever continue when a new command in the current
-	 * transaction could see the row, even though in READ COMMITTED mode the
-	 * tuple will not be visible according to the current statement's
-	 * snapshot.  This is in line with the way UPDATE deals with newer tuple
-	 * versions.
-	 */
-	ExecCheckTupleVisible(context->estate, relation, existing);
+	ItemPointer conflictTid = &existing->tts_tid;
 
 	/*
 	 * Make tuple and any needed join variables available to ExecQual and
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 15bca56f054..707946dfde6 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -22,6 +22,7 @@
 #include "access/xact.h"
 #include "commands/vacuum.h"
 #include "executor/tuptable.h"
+#include "nodes/execnodes.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
 
@@ -514,19 +515,16 @@ typedef struct TableAmRoutine
 									 CommandId cid, int options,
 									 struct BulkInsertStateData *bistate);
 
-	/* see table_tuple_insert_speculative() for reference about parameters */
-	void		(*tuple_insert_speculative) (Relation rel,
-											 TupleTableSlot *slot,
-											 CommandId cid,
-											 int options,
-											 struct BulkInsertStateData *bistate,
-											 uint32 specToken);
-
-	/* see table_tuple_complete_speculative() for reference about parameters */
-	void		(*tuple_complete_speculative) (Relation rel,
-											   TupleTableSlot *slot,
-											   uint32 specToken,
-											   bool succeeded);
+	/* see table_tuple_insert_with_arbiter() for reference about parameters */
+	TupleTableSlot *(*tuple_insert_with_arbiter) (ResultRelInfo *resultRelInfo,
+												  TupleTableSlot *slot,
+												  CommandId cid, int options,
+												  struct BulkInsertStateData *bistate,
+												  List *arbiterIndexes,
+												  EState *estate,
+												  LockTupleMode lockmode,
+												  TupleTableSlot *lockedSlot,
+												  TupleTableSlot *tempSlot);
 
 	/* see table_multi_insert() for reference about parameters */
 	void		(*multi_insert) (Relation rel, TupleTableSlot **slots, int nslots,
@@ -1397,36 +1395,42 @@ table_tuple_insert(Relation rel, TupleTableSlot *slot, CommandId cid,
 }
 
 /*
- * Perform a "speculative insertion". These can be backed out afterwards
- * without aborting the whole transaction.  Other sessions can wait for the
- * speculative insertion to be confirmed, turning it into a regular tuple, or
- * aborted, as if it never existed.  Speculatively inserted tuples behave as
- * "value locks" of short duration, used to implement INSERT .. ON CONFLICT.
+ * Insert a tuple from a slot into table AM routine with arbiter indexes.
  *
- * A transaction having performed a speculative insertion has to either abort,
- * or finish the speculative insertion with
- * table_tuple_complete_speculative(succeeded = ...).
- */
-static inline void
-table_tuple_insert_speculative(Relation rel, TupleTableSlot *slot,
-							   CommandId cid, int options,
-							   struct BulkInsertStateData *bistate,
-							   uint32 specToken)
-{
-	rel->rd_tableam->tuple_insert_speculative(rel, slot, cid, options,
-											  bistate, specToken);
-}
-
-/*
- * Complete "speculative insertion" started in the same transaction. If
- * succeeded is true, the tuple is fully inserted, if false, it's removed.
+ * This function is similar to table_tuple_insert(), but it takes into account
+ * `arbiterIndexes`, which comprises the list of oids of arbiter indexes.
+ *
+ * If tuple doesn't violates uniqueness on all arbiter indexes, then it should
+ * be inserted and the slot containing inserted tuple is returned.
+ *
+ * If tuple violates uniqueness on any arbiter index, then this function
+ * returns NULL and doesn't insert the tuple.  Also, if 'lockedSlot' is
+ * provided, then conflicting tuple gets locked in `lockmode` and placed into
+ * `lockedSlot`.
+ *
+ * Executor state `estate` is passed to this method to provide ability to
+ * calculate index tuples.  Temporary tuple table slot `tempSlot` is passed
+ * for holding of potentially conflicing tuple.
  */
-static inline void
-table_tuple_complete_speculative(Relation rel, TupleTableSlot *slot,
-								 uint32 specToken, bool succeeded)
+static inline TupleTableSlot *
+table_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
+								TupleTableSlot *slot,
+								CommandId cid, int options,
+								struct BulkInsertStateData *bistate,
+								List *arbiterIndexes,
+								EState *estate,
+								LockTupleMode lockmode,
+								TupleTableSlot *lockedSlot,
+								TupleTableSlot *tempSlot)
 {
-	rel->rd_tableam->tuple_complete_speculative(rel, slot, specToken,
-												succeeded);
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+
+	return rel->rd_tableam->tuple_insert_with_arbiter(resultRelInfo,
+													  slot, cid, options,
+													  bistate, arbiterIndexes,
+													  estate,
+													  lockmode, lockedSlot,
+													  tempSlot);
 }
 
 /*
-- 
2.39.3 (Apple Git-145)

0007-Custom-reloptions-for-table-AM-v4.patchapplication/octet-stream; name=0007-Custom-reloptions-for-table-AM-v4.patchDownload

From 1fb3470d80609405ad6c1d60b6311a9a10ee4938 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 12 Jun 2023 23:16:01 +0300
Subject: [PATCH 07/13] Custom reloptions for table AM

Let table AM define custom reloptions for its tables.
---
 src/backend/access/common/reloptions.c   |  6 ++-
 src/backend/access/heap/heapam_handler.c | 13 ++++++
 src/backend/access/table/tableamapi.c    | 20 ++++++++++
 src/backend/commands/tablecmds.c         | 51 ++++++++++++++----------
 src/backend/postmaster/autovacuum.c      |  4 +-
 src/backend/utils/cache/relcache.c       |  6 ++-
 src/include/access/reloptions.h          |  2 +
 src/include/access/tableam.h             | 29 ++++++++++++++
 8 files changed, 106 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index d6eb5d85599..963995388bb 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -24,6 +24,7 @@
 #include "access/nbtree.h"
 #include "access/reloptions.h"
 #include "access/spgist_private.h"
+#include "access/tableam.h"
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
 #include "commands/tablespace.h"
@@ -1377,7 +1378,7 @@ untransformRelOptions(Datum options)
  */
 bytea *
 extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
-				  amoptions_function amoptions)
+				  const TableAmRoutine *tableam, amoptions_function amoptions)
 {
 	bytea	   *options;
 	bool		isnull;
@@ -1399,7 +1400,8 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
 		case RELKIND_MATVIEW:
-			options = heap_reloptions(classForm->relkind, datum, false);
+			options = tableam_reloptions(tableam, classForm->relkind,
+										 datum, false);
 			break;
 		case RELKIND_PARTITIONED_TABLE:
 			options = partitioned_table_reloptions(datum, false);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 5d125aad6bc..14e0952d7f2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -25,6 +25,7 @@
 #include "access/heapam.h"
 #include "access/heaptoast.h"
 #include "access/multixact.h"
+#include "access/reloptions.h"
 #include "access/rewriteheap.h"
 #include "access/syncscan.h"
 #include "access/tableam.h"
@@ -2427,6 +2428,17 @@ heapam_relation_toast_am(Relation rel)
 	return rel->rd_rel->relam;
 }
 
+static bytea *
+heapam_reloptions(char relkind, Datum reloptions, bool validate)
+{
+	if (relkind == RELKIND_RELATION ||
+		relkind == RELKIND_TOASTVALUE ||
+		relkind == RELKIND_MATVIEW)
+		return heap_reloptions(relkind, reloptions, validate);
+
+	return NULL;
+}
+
 
 /* ------------------------------------------------------------------------
  * Planner related callbacks for the heap AM
@@ -2932,6 +2944,7 @@ static const TableAmRoutine heapam_methods = {
 	.relation_needs_toast_table = heapam_relation_needs_toast_table,
 	.relation_toast_am = heapam_relation_toast_am,
 	.relation_fetch_toast_slice = heap_fetch_toast_slice,
+	.reloptions = heapam_reloptions,
 
 	.relation_estimate_size = heapam_estimate_rel_size,
 
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 55b8caeadf2..34ff3e38333 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -13,9 +13,11 @@
 
 #include "access/tableam.h"
 #include "access/xact.h"
+#include "catalog/pg_am.h"
 #include "commands/defrem.h"
 #include "miscadmin.h"
 #include "utils/guc_hooks.h"
+#include "utils/syscache.h"
 
 
 /*
@@ -98,6 +100,24 @@ GetTableAmRoutine(Oid amhandler)
 	return routine;
 }
 
+const TableAmRoutine *
+GetTableAmRoutineByAmOid(Oid amoid)
+{
+	HeapTuple	ht_am;
+	Form_pg_am	amrec;
+	const TableAmRoutine *tableam = NULL;
+
+	ht_am = SearchSysCache1(AMOID, ObjectIdGetDatum(amoid));
+	if (!HeapTupleIsValid(ht_am))
+		elog(ERROR, "cache lookup failed for access method %u",
+			 amoid);
+	amrec = (Form_pg_am) GETSTRUCT(ht_am);
+
+	tableam = GetTableAmRoutine(amrec->amhandler);
+	ReleaseSysCache(ht_am);
+	return tableam;
+}
+
 /* check_hook: validate new default_table_access_method */
 bool
 check_default_table_access_method(char **newval, void **extra, GucSource source)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 3ed0618b4e6..d2ef8a0c383 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -705,6 +705,7 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	LOCKMODE	parentLockmode;
 	const char *accessMethod = NULL;
 	Oid			accessMethodId = InvalidOid;
+	const TableAmRoutine *tableam = NULL;
 
 	/*
 	 * Truncate relname to appropriate length (probably a waste of time, as
@@ -844,6 +845,26 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	if (!OidIsValid(ownerId))
 		ownerId = GetUserId();
 
+	/*
+	 * If the statement hasn't specified an access method, but we're defining
+	 * a type of relation that needs one, use the default.
+	 */
+	if (stmt->accessMethod != NULL)
+	{
+		accessMethod = stmt->accessMethod;
+
+		if (partitioned)
+			ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("specifying a table access method is not supported on a partitioned table")));
+	}
+	else if (RELKIND_HAS_TABLE_AM(relkind))
+		accessMethod = default_table_access_method;
+
+	/* look up the access method, verify it is for a table */
+	if (accessMethod != NULL)
+		accessMethodId = get_table_am_oid(accessMethod, false);
+
 	/*
 	 * Parse and validate reloptions, if any.
 	 */
@@ -852,6 +873,12 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 
 	switch (relkind)
 	{
+		case RELKIND_RELATION:
+		case RELKIND_TOASTVALUE:
+		case RELKIND_MATVIEW:
+			tableam = GetTableAmRoutineByAmOid(accessMethodId);
+			(void) tableam_reloptions(tableam, relkind, reloptions, true);
+			break;
 		case RELKIND_VIEW:
 			(void) view_reloptions(reloptions, true);
 			break;
@@ -860,6 +887,7 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 			break;
 		default:
 			(void) heap_reloptions(relkind, reloptions, true);
+			break;
 	}
 
 	if (stmt->ofTypename)
@@ -951,26 +979,6 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 		}
 	}
 
-	/*
-	 * If the statement hasn't specified an access method, but we're defining
-	 * a type of relation that needs one, use the default.
-	 */
-	if (stmt->accessMethod != NULL)
-	{
-		accessMethod = stmt->accessMethod;
-
-		if (partitioned)
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("specifying a table access method is not supported on a partitioned table")));
-	}
-	else if (RELKIND_HAS_TABLE_AM(relkind))
-		accessMethod = default_table_access_method;
-
-	/* look up the access method, verify it is for a table */
-	if (accessMethod != NULL)
-		accessMethodId = get_table_am_oid(accessMethod, false);
-
 	/*
 	 * Create the relation.  Inherited defaults and constraints are passed in
 	 * for immediate handling --- since they don't need parsing, they can be
@@ -15309,7 +15317,8 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
 		case RELKIND_MATVIEW:
-			(void) heap_reloptions(rel->rd_rel->relkind, newOptions, true);
+			(void) table_reloptions(rel, rel->rd_rel->relkind,
+									newOptions, true);
 			break;
 		case RELKIND_PARTITIONED_TABLE:
 			(void) partitioned_table_reloptions(newOptions, true);
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 71e8a6f2584..d1d76016ab4 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2661,7 +2661,9 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE);
 
-	relopts = extractRelOptions(tup, pg_class_desc, NULL);
+	relopts = extractRelOptions(tup, pg_class_desc,
+								GetTableAmRoutineByAmOid(((Form_pg_class) GETSTRUCT(tup))->relam),
+								NULL);
 	if (relopts == NULL)
 		return NULL;
 
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 6d98bdfba06..3babfc804a7 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -33,6 +33,7 @@
 #include "access/htup_details.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
+#include "access/relation.h"
 #include "access/reloptions.h"
 #include "access/sysattr.h"
 #include "access/table.h"
@@ -465,6 +466,7 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 {
 	bytea	   *options;
 	amoptions_function amoptsfn;
+	const TableAmRoutine *tableam = NULL;
 
 	relation->rd_options = NULL;
 
@@ -479,6 +481,7 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 		case RELKIND_VIEW:
 		case RELKIND_MATVIEW:
 		case RELKIND_PARTITIONED_TABLE:
+			tableam = relation->rd_tableam;
 			amoptsfn = NULL;
 			break;
 		case RELKIND_INDEX:
@@ -494,7 +497,8 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 	 * we might not have any other for pg_class yet (consider executing this
 	 * code for pg_class itself)
 	 */
-	options = extractRelOptions(tuple, GetPgClassDescriptor(), amoptsfn);
+	options = extractRelOptions(tuple, GetPgClassDescriptor(),
+								tableam, amoptsfn);
 
 	/*
 	 * Copy parsed data into CacheMemoryContext.  To guard against the
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index 81829b8270a..8ddc75df287 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -21,6 +21,7 @@
 
 #include "access/amapi.h"
 #include "access/htup.h"
+#include "access/tableam.h"
 #include "access/tupdesc.h"
 #include "nodes/pg_list.h"
 #include "storage/lock.h"
@@ -224,6 +225,7 @@ extern Datum transformRelOptions(Datum oldOptions, List *defList,
 								 bool acceptOidsOff, bool isReset);
 extern List *untransformRelOptions(Datum options);
 extern bytea *extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
+								const TableAmRoutine *tableam,
 								amoptions_function amoptions);
 extern void *build_reloptions(Datum reloptions, bool validate,
 							  relopt_kind kind,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 5f3c7f865ef..15bca56f054 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -739,6 +739,11 @@ typedef struct TableAmRoutine
 											   int32 slicelength,
 											   struct varlena *result);
 
+	/*
+	 * Parse table AM-specific table options.
+	 */
+	bytea	   *(*reloptions) (char relkind, Datum reloptions, bool validate);
+
 
 	/* ------------------------------------------------------------------------
 	 * Planner related functions.
@@ -1935,6 +1940,29 @@ table_relation_fetch_toast_slice(Relation toastrel, Oid valueid,
 													 result);
 }
 
+/*
+ * Parse options for given table.
+ */
+static inline bytea *
+table_reloptions(Relation rel, char relkind,
+				 Datum reloptions, bool validate)
+{
+	return rel->rd_tableam->reloptions(relkind, reloptions, validate);
+}
+
+/*
+ * Parse table options without knowledge of particular table.
+ */
+static inline bytea *
+tableam_reloptions(const TableAmRoutine *tableam, char relkind,
+				   Datum reloptions, bool validate)
+{
+	return tableam->reloptions(relkind, reloptions, validate);
+}
+
+extern bytea *index_reloptions(amoptions_function amoptions, Datum reloptions,
+							   bool validate);
+
 
 /* ----------------------------------------------------------------------------
  * Planner related functionality
@@ -2112,6 +2140,7 @@ extern void table_block_relation_estimate_size(Relation rel,
  */
 
 extern const TableAmRoutine *GetTableAmRoutine(Oid amhandler);
+extern const TableAmRoutine *GetTableAmRoutineByAmOid(Oid amoid);
 
 /* ----------------------------------------------------------------------------
  * Functions in heapam_handler.c
-- 
2.39.3 (Apple Git-145)

0006-Generalize-relation-analyze-in-table-AM-interface-v4.patchapplication/octet-stream; name=0006-Generalize-relation-analyze-in-table-AM-interface-v4.patchDownload

From 32d31f362b00bd3be143a33d2414691d0910e056 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Thu, 8 Jun 2023 04:20:29 +0300
Subject: [PATCH 06/13] Generalize relation analyze in table AM interface

Currently, there is just one algorithm for sampling tuples from a table written
in acquire_sample_rows().  Custom table AM can just redefine the way to get the
next block/tuple by implementing scan_analyze_next_block() and
scan_analyze_next_tuple() API functions.

This approach doesn't seem general enough.  For instance, it's unclear how to
sample this way index-organized tables.  This commit allows table AM to
encapsulate the whole sampling algorithm (currently implemented in
acquire_sample_rows()) into the relation_analyze() API function.
---
 src/backend/access/heap/heapam_handler.c | 289 ++++++++++++++++++++++-
 src/backend/access/table/tableamapi.c    |   2 -
 src/backend/commands/analyze.c           | 288 +---------------------
 src/include/access/tableam.h             |  92 ++------
 src/include/commands/vacuum.h            |   5 +
 src/include/foreign/fdwapi.h             |   6 +-
 6 files changed, 320 insertions(+), 362 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 6abfe36dec7..5d125aad6bc 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -19,6 +19,8 @@
  */
 #include "postgres.h"
 
+#include <math.h>
+
 #include "access/genam.h"
 #include "access/heapam.h"
 #include "access/heaptoast.h"
@@ -44,6 +46,8 @@
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
+#include "utils/sampling.h"
+#include "utils/spccache.h"
 
 static TM_Result heapam_tuple_lock(Relation relation, ItemPointer tid,
 								   Snapshot snapshot, TupleTableSlot *slot,
@@ -1220,6 +1224,288 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
 	return false;
 }
 
+/*
+ * Comparator for sorting rows[] array
+ */
+static int
+compare_rows(const void *a, const void *b, void *arg)
+{
+	HeapTuple	ha = *(const HeapTuple *) a;
+	HeapTuple	hb = *(const HeapTuple *) b;
+	BlockNumber ba = ItemPointerGetBlockNumber(&ha->t_self);
+	OffsetNumber oa = ItemPointerGetOffsetNumber(&ha->t_self);
+	BlockNumber bb = ItemPointerGetBlockNumber(&hb->t_self);
+	OffsetNumber ob = ItemPointerGetOffsetNumber(&hb->t_self);
+
+	if (ba < bb)
+		return -1;
+	if (ba > bb)
+		return 1;
+	if (oa < ob)
+		return -1;
+	if (oa > ob)
+		return 1;
+	return 0;
+}
+
+static BufferAccessStrategy analyze_bstrategy;
+
+/*
+ * heapam_acquire_sample_rows -- acquire a random sample of rows from the table
+ *
+ * Selected rows are returned in the caller-allocated array rows[], which
+ * must have at least targrows entries.
+ * The actual number of rows selected is returned as the function result.
+ * We also estimate the total numbers of live and dead rows in the table,
+ * and return them into *totalrows and *totaldeadrows, respectively.
+ *
+ * The returned list of tuples is in order by physical position in the table.
+ * (We will rely on this later to derive correlation estimates.)
+ *
+ * As of May 2004 we use a new two-stage method:  Stage one selects up
+ * to targrows random blocks (or all blocks, if there aren't so many).
+ * Stage two scans these blocks and uses the Vitter algorithm to create
+ * a random sample of targrows rows (or less, if there are less in the
+ * sample of blocks).  The two stages are executed simultaneously: each
+ * block is processed as soon as stage one returns its number and while
+ * the rows are read stage two controls which ones are to be inserted
+ * into the sample.
+ *
+ * Although every row has an equal chance of ending up in the final
+ * sample, this sampling method is not perfect: not every possible
+ * sample has an equal chance of being selected.  For large relations
+ * the number of different blocks represented by the sample tends to be
+ * too small.  We can live with that for now.  Improvements are welcome.
+ *
+ * An important property of this sampling method is that because we do
+ * look at a statistically unbiased set of blocks, we should get
+ * unbiased estimates of the average numbers of live and dead rows per
+ * block.  The previous sampling method put too much credence in the row
+ * density near the start of the table.
+ */
+static int
+heapam_acquire_sample_rows(Relation onerel, int elevel,
+						   HeapTuple *rows, int targrows,
+						   double *totalrows, double *totaldeadrows)
+{
+	int			numrows = 0;	/* # rows now in reservoir */
+	double		samplerows = 0; /* total # rows collected */
+	double		liverows = 0;	/* # live rows seen */
+	double		deadrows = 0;	/* # dead rows seen */
+	double		rowstoskip = -1;	/* -1 means not set yet */
+	uint32		randseed;		/* Seed for block sampler(s) */
+	BlockNumber totalblocks;
+	TransactionId OldestXmin;
+	BlockSamplerData bs;
+	ReservoirStateData rstate;
+	TupleTableSlot *slot;
+	TableScanDesc scan;
+	BlockNumber nblocks;
+	BlockNumber blksdone = 0;
+#ifdef USE_PREFETCH
+	int			prefetch_maximum = 0;	/* blocks to prefetch if enabled */
+	BlockSamplerData prefetch_bs;
+#endif
+
+	Assert(targrows > 0);
+
+	totalblocks = RelationGetNumberOfBlocks(onerel);
+
+	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
+	OldestXmin = GetOldestNonRemovableTransactionId(onerel);
+
+	/* Prepare for sampling block numbers */
+	randseed = pg_prng_uint32(&pg_global_prng_state);
+	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, randseed);
+
+#ifdef USE_PREFETCH
+	prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
+	/* Create another BlockSampler, using the same seed, for prefetching */
+	if (prefetch_maximum)
+		(void) BlockSampler_Init(&prefetch_bs, totalblocks, targrows, randseed);
+#endif
+
+	/* Report sampling block numbers */
+	pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
+								 nblocks);
+
+	/* Prepare for sampling rows */
+	reservoir_init_selection_state(&rstate, targrows);
+
+	scan = table_beginscan_analyze(onerel);
+	slot = table_slot_create(onerel, NULL);
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * If we are doing prefetching, then go ahead and tell the kernel about
+	 * the first set of pages we are going to want.  This also moves our
+	 * iterator out ahead of the main one being used, where we will keep it so
+	 * that we're always pre-fetching out prefetch_maximum number of blocks
+	 * ahead.
+	 */
+	if (prefetch_maximum)
+	{
+		for (int i = 0; i < prefetch_maximum; i++)
+		{
+			BlockNumber prefetch_block;
+
+			if (!BlockSampler_HasMore(&prefetch_bs))
+				break;
+
+			prefetch_block = BlockSampler_Next(&prefetch_bs);
+			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block);
+		}
+	}
+#endif
+
+	/* Outer loop over blocks to sample */
+	while (BlockSampler_HasMore(&bs))
+	{
+		bool		block_accepted;
+		BlockNumber targblock = BlockSampler_Next(&bs);
+#ifdef USE_PREFETCH
+		BlockNumber prefetch_targblock = InvalidBlockNumber;
+
+		/*
+		 * Make sure that every time the main BlockSampler is moved forward
+		 * that our prefetch BlockSampler also gets moved forward, so that we
+		 * always stay out ahead.
+		 */
+		if (prefetch_maximum && BlockSampler_HasMore(&prefetch_bs))
+			prefetch_targblock = BlockSampler_Next(&prefetch_bs);
+#endif
+
+		vacuum_delay_point();
+
+		block_accepted = heapam_scan_analyze_next_block(scan, targblock, analyze_bstrategy);
+
+#ifdef USE_PREFETCH
+
+		/*
+		 * When pre-fetching, after we get a block, tell the kernel about the
+		 * next one we will want, if there's any left.
+		 *
+		 * We want to do this even if the table_scan_analyze_next_block() call
+		 * above decides against analyzing the block it picked.
+		 */
+		if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
+			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
+#endif
+
+		/*
+		 * Don't analyze if table_scan_analyze_next_block() indicated this
+		 * block is unsuitable for analyzing.
+		 */
+		if (!block_accepted)
+			continue;
+
+		while (heapam_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
+		{
+			/*
+			 * The first targrows sample rows are simply copied into the
+			 * reservoir. Then we start replacing tuples in the sample until
+			 * we reach the end of the relation.  This algorithm is from Jeff
+			 * Vitter's paper (see full citation in utils/misc/sampling.c). It
+			 * works by repeatedly computing the number of tuples to skip
+			 * before selecting a tuple, which replaces a randomly chosen
+			 * element of the reservoir (current set of tuples).  At all times
+			 * the reservoir is a true random sample of the tuples we've
+			 * passed over so far, so when we fall off the end of the relation
+			 * we're done.
+			 */
+			if (numrows < targrows)
+				rows[numrows++] = ExecCopySlotHeapTuple(slot);
+			else
+			{
+				/*
+				 * t in Vitter's paper is the number of records already
+				 * processed.  If we need to compute a new S value, we must
+				 * use the not-yet-incremented value of samplerows as t.
+				 */
+				if (rowstoskip < 0)
+					rowstoskip = reservoir_get_next_S(&rstate, samplerows, targrows);
+
+				if (rowstoskip <= 0)
+				{
+					/*
+					 * Found a suitable tuple, so save it, replacing one old
+					 * tuple at random
+					 */
+					int			k = (int) (targrows * sampler_random_fract(&rstate.randstate));
+
+					Assert(k >= 0 && k < targrows);
+					heap_freetuple(rows[k]);
+					rows[k] = ExecCopySlotHeapTuple(slot);
+				}
+
+				rowstoskip -= 1;
+			}
+
+			samplerows += 1;
+		}
+
+		pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
+									 ++blksdone);
+	}
+
+	ExecDropSingleTupleTableSlot(slot);
+	table_endscan(scan);
+
+	/*
+	 * If we didn't find as many tuples as we wanted then we're done. No sort
+	 * is needed, since they're already in order.
+	 *
+	 * Otherwise we need to sort the collected tuples by position
+	 * (itempointer). It's not worth worrying about corner cases where the
+	 * tuples are already sorted.
+	 */
+	if (numrows == targrows)
+		qsort_interruptible(rows, numrows, sizeof(HeapTuple),
+							compare_rows, NULL);
+
+	/*
+	 * Estimate total numbers of live and dead rows in relation, extrapolating
+	 * on the assumption that the average tuple density in pages we didn't
+	 * scan is the same as in the pages we did scan.  Since what we scanned is
+	 * a random sample of the pages in the relation, this should be a good
+	 * assumption.
+	 */
+	if (bs.m > 0)
+	{
+		*totalrows = floor((liverows / bs.m) * totalblocks + 0.5);
+		*totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
+	}
+	else
+	{
+		*totalrows = 0.0;
+		*totaldeadrows = 0.0;
+	}
+
+	/*
+	 * Emit some interesting relation info
+	 */
+	ereport(elevel,
+			(errmsg("\"%s\": scanned %d of %u pages, "
+					"containing %.0f live rows and %.0f dead rows; "
+					"%d rows in sample, %.0f estimated total rows",
+					RelationGetRelationName(onerel),
+					bs.m, totalblocks,
+					liverows, deadrows,
+					numrows, *totalrows)));
+
+	return numrows;
+}
+
+static inline void
+heapam_analyze(Relation relation, AcquireSampleRowsFunc *func,
+			   BlockNumber *totalpages, BufferAccessStrategy bstrategy)
+{
+	*func = heapam_acquire_sample_rows;
+	*totalpages = RelationGetNumberOfBlocks(relation);
+	analyze_bstrategy = bstrategy;
+}
+
 static double
 heapam_index_build_range_scan(Relation heapRelation,
 							  Relation indexRelation,
@@ -2637,10 +2923,9 @@ static const TableAmRoutine heapam_methods = {
 	.relation_copy_data = heapam_relation_copy_data,
 	.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
 	.relation_vacuum = heap_vacuum_rel,
-	.scan_analyze_next_block = heapam_scan_analyze_next_block,
-	.scan_analyze_next_tuple = heapam_scan_analyze_next_tuple,
 	.index_build_range_scan = heapam_index_build_range_scan,
 	.index_validate_scan = heapam_index_validate_scan,
+	.relation_analyze = heapam_analyze,
 
 	.free_rd_amcache = NULL,
 	.relation_size = table_block_relation_size,
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index ce637a5a5d9..55b8caeadf2 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -81,8 +81,6 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->relation_copy_data != NULL);
 	Assert(routine->relation_copy_for_cluster != NULL);
 	Assert(routine->relation_vacuum != NULL);
-	Assert(routine->scan_analyze_next_block != NULL);
-	Assert(routine->scan_analyze_next_tuple != NULL);
 	Assert(routine->index_build_range_scan != NULL);
 	Assert(routine->index_validate_scan != NULL);
 
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8a82af4a4ca..659f69ef270 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -87,10 +87,6 @@ static void compute_index_stats(Relation onerel, double totalrows,
 								MemoryContext col_context);
 static VacAttrStats *examine_attribute(Relation onerel, int attnum,
 									   Node *index_expr);
-static int	acquire_sample_rows(Relation onerel, int elevel,
-								HeapTuple *rows, int targrows,
-								double *totalrows, double *totaldeadrows);
-static int	compare_rows(const void *a, const void *b, void *arg);
 static int	acquire_inherited_sample_rows(Relation onerel, int elevel,
 										  HeapTuple *rows, int targrows,
 										  double *totalrows, double *totaldeadrows);
@@ -190,10 +186,9 @@ analyze_rel(Oid relid, RangeVar *relation,
 	if (onerel->rd_rel->relkind == RELKIND_RELATION ||
 		onerel->rd_rel->relkind == RELKIND_MATVIEW)
 	{
-		/* Regular table, so we'll use the regular row acquisition function */
-		acquirefunc = acquire_sample_rows;
-		/* Also get regular table's size */
-		relpages = RelationGetNumberOfBlocks(onerel);
+		/* Use row acquisition function provided by table AM */
+		table_relation_analyze(onerel, &acquirefunc,
+							   &relpages, vac_strategy);
 	}
 	else if (onerel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
 	{
@@ -1102,277 +1097,6 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
 	return stats;
 }
 
-/*
- * acquire_sample_rows -- acquire a random sample of rows from the table
- *
- * Selected rows are returned in the caller-allocated array rows[], which
- * must have at least targrows entries.
- * The actual number of rows selected is returned as the function result.
- * We also estimate the total numbers of live and dead rows in the table,
- * and return them into *totalrows and *totaldeadrows, respectively.
- *
- * The returned list of tuples is in order by physical position in the table.
- * (We will rely on this later to derive correlation estimates.)
- *
- * As of May 2004 we use a new two-stage method:  Stage one selects up
- * to targrows random blocks (or all blocks, if there aren't so many).
- * Stage two scans these blocks and uses the Vitter algorithm to create
- * a random sample of targrows rows (or less, if there are less in the
- * sample of blocks).  The two stages are executed simultaneously: each
- * block is processed as soon as stage one returns its number and while
- * the rows are read stage two controls which ones are to be inserted
- * into the sample.
- *
- * Although every row has an equal chance of ending up in the final
- * sample, this sampling method is not perfect: not every possible
- * sample has an equal chance of being selected.  For large relations
- * the number of different blocks represented by the sample tends to be
- * too small.  We can live with that for now.  Improvements are welcome.
- *
- * An important property of this sampling method is that because we do
- * look at a statistically unbiased set of blocks, we should get
- * unbiased estimates of the average numbers of live and dead rows per
- * block.  The previous sampling method put too much credence in the row
- * density near the start of the table.
- */
-static int
-acquire_sample_rows(Relation onerel, int elevel,
-					HeapTuple *rows, int targrows,
-					double *totalrows, double *totaldeadrows)
-{
-	int			numrows = 0;	/* # rows now in reservoir */
-	double		samplerows = 0; /* total # rows collected */
-	double		liverows = 0;	/* # live rows seen */
-	double		deadrows = 0;	/* # dead rows seen */
-	double		rowstoskip = -1;	/* -1 means not set yet */
-	uint32		randseed;		/* Seed for block sampler(s) */
-	BlockNumber totalblocks;
-	TransactionId OldestXmin;
-	BlockSamplerData bs;
-	ReservoirStateData rstate;
-	TupleTableSlot *slot;
-	TableScanDesc scan;
-	BlockNumber nblocks;
-	BlockNumber blksdone = 0;
-#ifdef USE_PREFETCH
-	int			prefetch_maximum = 0;	/* blocks to prefetch if enabled */
-	BlockSamplerData prefetch_bs;
-#endif
-
-	Assert(targrows > 0);
-
-	totalblocks = RelationGetNumberOfBlocks(onerel);
-
-	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
-	OldestXmin = GetOldestNonRemovableTransactionId(onerel);
-
-	/* Prepare for sampling block numbers */
-	randseed = pg_prng_uint32(&pg_global_prng_state);
-	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, randseed);
-
-#ifdef USE_PREFETCH
-	prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
-	/* Create another BlockSampler, using the same seed, for prefetching */
-	if (prefetch_maximum)
-		(void) BlockSampler_Init(&prefetch_bs, totalblocks, targrows, randseed);
-#endif
-
-	/* Report sampling block numbers */
-	pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
-								 nblocks);
-
-	/* Prepare for sampling rows */
-	reservoir_init_selection_state(&rstate, targrows);
-
-	scan = table_beginscan_analyze(onerel);
-	slot = table_slot_create(onerel, NULL);
-
-#ifdef USE_PREFETCH
-
-	/*
-	 * If we are doing prefetching, then go ahead and tell the kernel about
-	 * the first set of pages we are going to want.  This also moves our
-	 * iterator out ahead of the main one being used, where we will keep it so
-	 * that we're always pre-fetching out prefetch_maximum number of blocks
-	 * ahead.
-	 */
-	if (prefetch_maximum)
-	{
-		for (int i = 0; i < prefetch_maximum; i++)
-		{
-			BlockNumber prefetch_block;
-
-			if (!BlockSampler_HasMore(&prefetch_bs))
-				break;
-
-			prefetch_block = BlockSampler_Next(&prefetch_bs);
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block);
-		}
-	}
-#endif
-
-	/* Outer loop over blocks to sample */
-	while (BlockSampler_HasMore(&bs))
-	{
-		bool		block_accepted;
-		BlockNumber targblock = BlockSampler_Next(&bs);
-#ifdef USE_PREFETCH
-		BlockNumber prefetch_targblock = InvalidBlockNumber;
-
-		/*
-		 * Make sure that every time the main BlockSampler is moved forward
-		 * that our prefetch BlockSampler also gets moved forward, so that we
-		 * always stay out ahead.
-		 */
-		if (prefetch_maximum && BlockSampler_HasMore(&prefetch_bs))
-			prefetch_targblock = BlockSampler_Next(&prefetch_bs);
-#endif
-
-		vacuum_delay_point();
-
-		block_accepted = table_scan_analyze_next_block(scan, targblock, vac_strategy);
-
-#ifdef USE_PREFETCH
-
-		/*
-		 * When pre-fetching, after we get a block, tell the kernel about the
-		 * next one we will want, if there's any left.
-		 *
-		 * We want to do this even if the table_scan_analyze_next_block() call
-		 * above decides against analyzing the block it picked.
-		 */
-		if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
-#endif
-
-		/*
-		 * Don't analyze if table_scan_analyze_next_block() indicated this
-		 * block is unsuitable for analyzing.
-		 */
-		if (!block_accepted)
-			continue;
-
-		while (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
-		{
-			/*
-			 * The first targrows sample rows are simply copied into the
-			 * reservoir. Then we start replacing tuples in the sample until
-			 * we reach the end of the relation.  This algorithm is from Jeff
-			 * Vitter's paper (see full citation in utils/misc/sampling.c). It
-			 * works by repeatedly computing the number of tuples to skip
-			 * before selecting a tuple, which replaces a randomly chosen
-			 * element of the reservoir (current set of tuples).  At all times
-			 * the reservoir is a true random sample of the tuples we've
-			 * passed over so far, so when we fall off the end of the relation
-			 * we're done.
-			 */
-			if (numrows < targrows)
-				rows[numrows++] = ExecCopySlotHeapTuple(slot);
-			else
-			{
-				/*
-				 * t in Vitter's paper is the number of records already
-				 * processed.  If we need to compute a new S value, we must
-				 * use the not-yet-incremented value of samplerows as t.
-				 */
-				if (rowstoskip < 0)
-					rowstoskip = reservoir_get_next_S(&rstate, samplerows, targrows);
-
-				if (rowstoskip <= 0)
-				{
-					/*
-					 * Found a suitable tuple, so save it, replacing one old
-					 * tuple at random
-					 */
-					int			k = (int) (targrows * sampler_random_fract(&rstate.randstate));
-
-					Assert(k >= 0 && k < targrows);
-					heap_freetuple(rows[k]);
-					rows[k] = ExecCopySlotHeapTuple(slot);
-				}
-
-				rowstoskip -= 1;
-			}
-
-			samplerows += 1;
-		}
-
-		pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
-									 ++blksdone);
-	}
-
-	ExecDropSingleTupleTableSlot(slot);
-	table_endscan(scan);
-
-	/*
-	 * If we didn't find as many tuples as we wanted then we're done. No sort
-	 * is needed, since they're already in order.
-	 *
-	 * Otherwise we need to sort the collected tuples by position
-	 * (itempointer). It's not worth worrying about corner cases where the
-	 * tuples are already sorted.
-	 */
-	if (numrows == targrows)
-		qsort_interruptible(rows, numrows, sizeof(HeapTuple),
-							compare_rows, NULL);
-
-	/*
-	 * Estimate total numbers of live and dead rows in relation, extrapolating
-	 * on the assumption that the average tuple density in pages we didn't
-	 * scan is the same as in the pages we did scan.  Since what we scanned is
-	 * a random sample of the pages in the relation, this should be a good
-	 * assumption.
-	 */
-	if (bs.m > 0)
-	{
-		*totalrows = floor((liverows / bs.m) * totalblocks + 0.5);
-		*totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
-	}
-	else
-	{
-		*totalrows = 0.0;
-		*totaldeadrows = 0.0;
-	}
-
-	/*
-	 * Emit some interesting relation info
-	 */
-	ereport(elevel,
-			(errmsg("\"%s\": scanned %d of %u pages, "
-					"containing %.0f live rows and %.0f dead rows; "
-					"%d rows in sample, %.0f estimated total rows",
-					RelationGetRelationName(onerel),
-					bs.m, totalblocks,
-					liverows, deadrows,
-					numrows, *totalrows)));
-
-	return numrows;
-}
-
-/*
- * Comparator for sorting rows[] array
- */
-static int
-compare_rows(const void *a, const void *b, void *arg)
-{
-	HeapTuple	ha = *(const HeapTuple *) a;
-	HeapTuple	hb = *(const HeapTuple *) b;
-	BlockNumber ba = ItemPointerGetBlockNumber(&ha->t_self);
-	OffsetNumber oa = ItemPointerGetOffsetNumber(&ha->t_self);
-	BlockNumber bb = ItemPointerGetBlockNumber(&hb->t_self);
-	OffsetNumber ob = ItemPointerGetOffsetNumber(&hb->t_self);
-
-	if (ba < bb)
-		return -1;
-	if (ba > bb)
-		return 1;
-	if (oa < ob)
-		return -1;
-	if (oa > ob)
-		return 1;
-	return 0;
-}
-
 
 /*
  * acquire_inherited_sample_rows -- acquire sample rows from inheritance tree
@@ -1462,9 +1186,9 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 		if (childrel->rd_rel->relkind == RELKIND_RELATION ||
 			childrel->rd_rel->relkind == RELKIND_MATVIEW)
 		{
-			/* Regular table, so use the regular row acquisition function */
-			acquirefunc = acquire_sample_rows;
-			relpages = RelationGetNumberOfBlocks(childrel);
+			/* Use row acquisition function provided by table AM */
+			table_relation_analyze(childrel, &acquirefunc,
+								   &relpages, vac_strategy);
 		}
 		else if (childrel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
 		{
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index f192e09313b..5f3c7f865ef 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -20,6 +20,7 @@
 #include "access/relscan.h"
 #include "access/sdir.h"
 #include "access/xact.h"
+#include "commands/vacuum.h"
 #include "executor/tuptable.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -658,41 +659,6 @@ typedef struct TableAmRoutine
 									struct VacuumParams *params,
 									BufferAccessStrategy bstrategy);
 
-	/*
-	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
-	 * with table_beginscan_analyze().  See also
-	 * table_scan_analyze_next_block().
-	 *
-	 * The callback may acquire resources like locks that are held until
-	 * table_scan_analyze_next_tuple() returns false. It e.g. can make sense
-	 * to hold a lock until all tuples on a block have been analyzed by
-	 * scan_analyze_next_tuple.
-	 *
-	 * The callback can return false if the block is not suitable for
-	 * sampling, e.g. because it's a metapage that could never contain tuples.
-	 *
-	 * XXX: This obviously is primarily suited for block-based AMs. It's not
-	 * clear what a good interface for non block based AMs would be, so there
-	 * isn't one yet.
-	 */
-	bool		(*scan_analyze_next_block) (TableScanDesc scan,
-											BlockNumber blockno,
-											BufferAccessStrategy bstrategy);
-
-	/*
-	 * See table_scan_analyze_next_tuple().
-	 *
-	 * Not every AM might have a meaningful concept of dead rows, in which
-	 * case it's OK to not increment *deadrows - but note that that may
-	 * influence autovacuum scheduling (see comment for relation_vacuum
-	 * callback).
-	 */
-	bool		(*scan_analyze_next_tuple) (TableScanDesc scan,
-											TransactionId OldestXmin,
-											double *liverows,
-											double *deadrows,
-											TupleTableSlot *slot);
-
 	/* see table_index_build_range_scan for reference about parameters */
 	double		(*index_build_range_scan) (Relation table_rel,
 										   Relation index_rel,
@@ -713,6 +679,15 @@ typedef struct TableAmRoutine
 										Snapshot snapshot,
 										struct ValidateIndexState *state);
 
+	/*
+	 * Provides row sampling callback for relation and number of relation
+	 * pages.
+	 */
+	void		(*relation_analyze) (Relation relation,
+									 AcquireSampleRowsFunc *func,
+									 BlockNumber *totalpages,
+									 BufferAccessStrategy bstrategy);
+
 
 	/* ------------------------------------------------------------------------
 	 * Miscellaneous functions.
@@ -1744,42 +1719,6 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
-/*
- * Prepare to analyze block `blockno` of `scan`. The scan needs to have been
- * started with table_beginscan_analyze().  Note that this routine might
- * acquire resources like locks that are held until
- * table_scan_analyze_next_tuple() returns false.
- *
- * Returns false if block is unsuitable for sampling, true otherwise.
- */
-static inline bool
-table_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-							  BufferAccessStrategy bstrategy)
-{
-	return scan->rs_rd->rd_tableam->scan_analyze_next_block(scan, blockno,
-															bstrategy);
-}
-
-/*
- * Iterate over tuples in the block selected with
- * table_scan_analyze_next_block() (which needs to have returned true, and
- * this routine may not have returned false for the same block before). If a
- * tuple that's suitable for sampling is found, true is returned and a tuple
- * is stored in `slot`.
- *
- * *liverows and *deadrows are incremented according to the encountered
- * tuples.
- */
-static inline bool
-table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
-							  double *liverows, double *deadrows,
-							  TupleTableSlot *slot)
-{
-	return scan->rs_rd->rd_tableam->scan_analyze_next_tuple(scan, OldestXmin,
-															liverows, deadrows,
-															slot);
-}
-
 /*
  * table_index_build_scan - scan the table to find tuples to be indexed
  *
@@ -1885,6 +1824,17 @@ table_index_validate_scan(Relation table_rel,
 											   state);
 }
 
+/*
+ * Provides row sampling callback for relation and number of relation
+ * pages.
+ */
+static inline void
+table_relation_analyze(Relation relation, AcquireSampleRowsFunc *func,
+					   BlockNumber *totalpages, BufferAccessStrategy bstrategy)
+{
+	relation->rd_tableam->relation_analyze(relation, func,
+										   totalpages, bstrategy);
+}
 
 /* ----------------------------------------------------------------------------
  * Miscellaneous functionality
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 1182a967427..d38ddc68b79 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -104,6 +104,11 @@ typedef struct ParallelVacuumState ParallelVacuumState;
  */
 typedef struct VacAttrStats *VacAttrStatsP;
 
+typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
+									  HeapTuple *rows, int targrows,
+									  double *totalrows,
+									  double *totaldeadrows);
+
 typedef Datum (*AnalyzeAttrFetchFunc) (VacAttrStatsP stats, int rownum,
 									   bool *isNull);
 
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index fcde3876b28..0968e0a01ec 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -13,6 +13,7 @@
 #define FDWAPI_H
 
 #include "access/parallel.h"
+#include "commands/vacuum.h"
 #include "nodes/execnodes.h"
 #include "nodes/pathnodes.h"
 
@@ -148,11 +149,6 @@ typedef void (*ExplainForeignModify_function) (ModifyTableState *mtstate,
 typedef void (*ExplainDirectModify_function) (ForeignScanState *node,
 											  struct ExplainState *es);
 
-typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
-									  HeapTuple *rows, int targrows,
-									  double *totalrows,
-									  double *totaldeadrows);
-
 typedef bool (*AnalyzeForeignTable_function) (Relation relation,
 											  AcquireSampleRowsFunc *func,
 											  BlockNumber *totalpages);
-- 
2.39.3 (Apple Git-145)

0005-Add-TupleTableSlotOps.is_current_xact_tuple-metho-v4.patchapplication/octet-stream; name=0005-Add-TupleTableSlotOps.is_current_xact_tuple-metho-v4.patchDownload

From 624145a674a33bedc98d3fb0e4dd5ef0b3cdd445 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Wed, 7 Jun 2023 13:47:53 +0300
Subject: [PATCH 05/13] Add TupleTableSlotOps.is_current_xact_tuple() method

This allows us to abstract how/whether table AM uses transaction identifiers.
A custom table AM can use a custom slot, which may not store xmin directly,
but determine the tuple belonging to the current transaction in the other way.
---
 src/backend/executor/execTuples.c   | 79 +++++++++++++++++++++++++++++
 src/backend/utils/adt/ri_triggers.c |  8 +--
 src/include/executor/tuptable.h     | 21 ++++++++
 3 files changed, 101 insertions(+), 7 deletions(-)

diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index a7aa2ee02b1..45b85b15851 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -60,6 +60,7 @@
 #include "access/heaptoast.h"
 #include "access/htup_details.h"
 #include "access/tupdesc_details.h"
+#include "access/xact.h"
 #include "catalog/pg_type.h"
 #include "funcapi.h"
 #include "nodes/nodeFuncs.h"
@@ -148,6 +149,22 @@ tts_virtual_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
 	return 0;					/* silence compiler warnings */
 }
 
+/*
+ * VirtualTupleTableSlots never have storage tuples.  We generally
+ * shouldn't get here, but provide a user-friendly message if we do.
+ */
+static bool
+tts_virtual_is_current_xact_tuple(TupleTableSlot *slot)
+{
+	Assert(!TTS_EMPTY(slot));
+
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("don't have a storage tuple in this context")));
+
+	return false;					/* silence compiler warnings */
+}
+
 /*
  * To materialize a virtual slot all the datums that aren't passed by value
  * have to be copied into the slot's memory context.  To do so, compute the
@@ -354,6 +371,29 @@ tts_heap_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
 						   slot->tts_tupleDescriptor, isnull);
 }
 
+static bool
+tts_heap_is_current_xact_tuple(TupleTableSlot *slot)
+{
+	HeapTupleTableSlot *hslot = (HeapTupleTableSlot *) slot;
+	TransactionId xmin;
+
+	Assert(!TTS_EMPTY(slot));
+
+	/*
+	 * In some code paths it's possible to get here with a non-materialized
+	 * slot, in which case we can't check if tuple is created by the current
+	 * transaction.
+	 */
+	if (!hslot->tuple)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("don't have a storage tuple in this context")));
+
+	xmin = HeapTupleHeaderGetRawXmin(hslot->tuple->t_data);
+
+	return TransactionIdIsCurrentTransactionId(xmin);
+}
+
 static void
 tts_heap_materialize(TupleTableSlot *slot)
 {
@@ -521,6 +561,18 @@ tts_minimal_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
 	return 0;					/* silence compiler warnings */
 }
 
+static bool
+tts_minimal_is_current_xact_tuple(TupleTableSlot *slot)
+{
+	Assert(!TTS_EMPTY(slot));
+
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("don't have a storage tuple in this context")));
+
+	return false;					/* silence compiler warnings */
+}
+
 static void
 tts_minimal_materialize(TupleTableSlot *slot)
 {
@@ -714,6 +766,29 @@ tts_buffer_heap_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
 						   slot->tts_tupleDescriptor, isnull);
 }
 
+static bool
+tts_buffer_is_current_xact_tuple(TupleTableSlot *slot)
+{
+	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
+	TransactionId xmin;
+
+	Assert(!TTS_EMPTY(slot));
+
+	/*
+	 * In some code paths it's possible to get here with a non-materialized
+	 * slot, in which case we can't check if tuple is created by the current
+	 * transaction.
+	 */
+	if (!bslot->base.tuple)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("don't have a storage tuple in this context")));
+
+	xmin = HeapTupleHeaderGetRawXmin(bslot->base.tuple->t_data);
+
+	return TransactionIdIsCurrentTransactionId(xmin);
+}
+
 static void
 tts_buffer_heap_materialize(TupleTableSlot *slot)
 {
@@ -1029,6 +1104,7 @@ const TupleTableSlotOps TTSOpsVirtual = {
 	.getsomeattrs = tts_virtual_getsomeattrs,
 	.getsysattr = tts_virtual_getsysattr,
 	.materialize = tts_virtual_materialize,
+	.is_current_xact_tuple = tts_virtual_is_current_xact_tuple,
 	.copyslot = tts_virtual_copyslot,
 
 	/*
@@ -1048,6 +1124,7 @@ const TupleTableSlotOps TTSOpsHeapTuple = {
 	.clear = tts_heap_clear,
 	.getsomeattrs = tts_heap_getsomeattrs,
 	.getsysattr = tts_heap_getsysattr,
+	.is_current_xact_tuple = tts_heap_is_current_xact_tuple,
 	.materialize = tts_heap_materialize,
 	.copyslot = tts_heap_copyslot,
 	.get_heap_tuple = tts_heap_get_heap_tuple,
@@ -1065,6 +1142,7 @@ const TupleTableSlotOps TTSOpsMinimalTuple = {
 	.clear = tts_minimal_clear,
 	.getsomeattrs = tts_minimal_getsomeattrs,
 	.getsysattr = tts_minimal_getsysattr,
+	.is_current_xact_tuple = tts_minimal_is_current_xact_tuple,
 	.materialize = tts_minimal_materialize,
 	.copyslot = tts_minimal_copyslot,
 
@@ -1082,6 +1160,7 @@ const TupleTableSlotOps TTSOpsBufferHeapTuple = {
 	.clear = tts_buffer_heap_clear,
 	.getsomeattrs = tts_buffer_heap_getsomeattrs,
 	.getsysattr = tts_buffer_heap_getsysattr,
+	.is_current_xact_tuple = tts_buffer_is_current_xact_tuple,
 	.materialize = tts_buffer_heap_materialize,
 	.copyslot = tts_buffer_heap_copyslot,
 	.get_heap_tuple = tts_buffer_heap_get_heap_tuple,
diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c
index 2fe93775003..62601a6d80c 100644
--- a/src/backend/utils/adt/ri_triggers.c
+++ b/src/backend/utils/adt/ri_triggers.c
@@ -1260,9 +1260,6 @@ RI_FKey_fk_upd_check_required(Trigger *trigger, Relation fk_rel,
 {
 	const RI_ConstraintInfo *riinfo;
 	int			ri_nullcheck;
-	Datum		xminDatum;
-	TransactionId xmin;
-	bool		isnull;
 
 	/*
 	 * AfterTriggerSaveEvent() handles things such that this function is never
@@ -1330,10 +1327,7 @@ RI_FKey_fk_upd_check_required(Trigger *trigger, Relation fk_rel,
 	 * this if we knew the INSERT trigger already fired, but there is no easy
 	 * way to know that.)
 	 */
-	xminDatum = slot_getsysattr(oldslot, MinTransactionIdAttributeNumber, &isnull);
-	Assert(!isnull);
-	xmin = DatumGetTransactionId(xminDatum);
-	if (TransactionIdIsCurrentTransactionId(xmin))
+	if (slot_is_current_xact_tuple(oldslot))
 		return true;
 
 	/* If all old and new key values are equal, no check is needed */
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 6133dbcd0a3..c2eddda74a8 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -166,6 +166,12 @@ struct TupleTableSlotOps
 	 */
 	Datum		(*getsysattr) (TupleTableSlot *slot, int attnum, bool *isnull);
 
+	/*
+	 * Check if the tuple is created by the current transaction. Throws an
+	 * error if the slot doesn't contain the storage tuple.
+	 */
+	bool		(*is_current_xact_tuple) (TupleTableSlot *slot);
+
 	/*
 	 * Make the contents of the slot solely depend on the slot, and not on
 	 * underlying resources (like another memory context, buffers, etc).
@@ -426,6 +432,21 @@ slot_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
 	return slot->tts_ops->getsysattr(slot, attnum, isnull);
 }
 
+/*
+ * slot_is_current_xact_tuple - check if the slot's current tuple is created
+ *								by the current transaction.
+ *
+ *  If the slot does not contain storage tuple, this will throw an error.
+ *  Hence before calling this function, callers should make sure that the
+ *  slot type supports storage tuples and there is currently one inside the
+ *  slot.
+ */
+static inline bool
+slot_is_current_xact_tuple(TupleTableSlot *slot)
+{
+	return slot->tts_ops->is_current_xact_tuple(slot);
+}
+
 /*
  * ExecClearTuple - clear the slot's contents
  */
-- 
2.39.3 (Apple Git-145)

0004-Allow-table-AM-tuple_insert-method-to-return-the--v4.patchapplication/octet-stream; name=0004-Allow-table-AM-tuple_insert-method-to-return-the--v4.patchDownload

From 3e883bea42d20c75bd5a432a6f32ef19f4ea2c6a Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Fri, 9 Jun 2023 00:28:27 +0300
Subject: [PATCH 04/13] Allow table AM tuple_insert() method to return the
 different slot

This allows table AM to return native tuple slot even if VirtualTupleTableSlot
is given as an input.  Native tuple slot have its knowledge about system
attributes, which could be accessed in future.
---
 src/backend/access/heap/heapam_handler.c |  4 +++-
 src/backend/executor/nodeModifyTable.c   |  6 +++---
 src/include/access/tableam.h             | 20 +++++++++++---------
 3 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index da86ca5c31a..6abfe36dec7 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -243,7 +243,7 @@ heapam_tuple_satisfies_snapshot(Relation rel, TupleTableSlot *slot,
  * ----------------------------------------------------------------------------
  */
 
-static void
+static TupleTableSlot *
 heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
 					int options, BulkInsertState bistate)
 {
@@ -260,6 +260,8 @@ heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
 
 	if (shouldFree)
 		pfree(tuple);
+
+	return slot;
 }
 
 static void
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 79257416426..d1917f2fea7 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1134,9 +1134,9 @@ ExecInsert(ModifyTableContext *context,
 		else
 		{
 			/* insert the tuple normally */
-			table_tuple_insert(resultRelationDesc, slot,
-							   estate->es_output_cid,
-							   0, NULL);
+			slot = table_tuple_insert(resultRelationDesc, slot,
+									  estate->es_output_cid,
+									  0, NULL);
 
 			/* insert index entries for tuple */
 			if (resultRelInfo->ri_NumIndices > 0)
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 20b965545db..f192e09313b 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -509,9 +509,9 @@ typedef struct TableAmRoutine
 	 */
 
 	/* see table_tuple_insert() for reference about parameters */
-	void		(*tuple_insert) (Relation rel, TupleTableSlot *slot,
-								 CommandId cid, int options,
-								 struct BulkInsertStateData *bistate);
+	TupleTableSlot *(*tuple_insert) (Relation rel, TupleTableSlot *slot,
+									 CommandId cid, int options,
+									 struct BulkInsertStateData *bistate);
 
 	/* see table_tuple_insert_speculative() for reference about parameters */
 	void		(*tuple_insert_speculative) (Relation rel,
@@ -1402,16 +1402,18 @@ table_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
  * behavior) is also just passed through to RelationGetBufferForTuple. If
  * `bistate` is provided, table_finish_bulk_insert() needs to be called.
  *
- * On return the slot's tts_tid and tts_tableOid are updated to reflect the
- * insertion. But note that any toasting of fields within the slot is NOT
- * reflected in the slots contents.
+ * Returns the slot containing the inserted tuple, which may differ from the
+ * given slot. For instance, source slot may by VirtualTupleTableSlot, but
+ * the result is corresponding to table AM. On return the slot's tts_tid and
+ * tts_tableOid are updated to reflect the insertion. But note that any
+ * toasting of fields within the slot is NOT reflected in the slots contents.
  */
-static inline void
+static inline TupleTableSlot *
 table_tuple_insert(Relation rel, TupleTableSlot *slot, CommandId cid,
 				   int options, struct BulkInsertStateData *bistate)
 {
-	rel->rd_tableam->tuple_insert(rel, slot, cid, options,
-								  bistate);
+	return rel->rd_tableam->tuple_insert(rel, slot, cid, options,
+										 bistate);
 }
 
 /*
-- 
2.39.3 (Apple Git-145)

0003-Allow-table-AM-to-store-complex-data-structures-i-v4.patchapplication/octet-stream; name=0003-Allow-table-AM-to-store-complex-data-structures-i-v4.patchDownload

From 38948f0ac674cdd3b8403b52e757e1d49923d50f Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Wed, 7 Jun 2023 13:04:58 +0300
Subject: [PATCH 03/13] Allow table AM to store complex data structures in
 rd_amcache

New table AM method free_rd_amcache is responsible for freeing the rd_amcache.
---
 src/backend/access/heap/heapam_handler.c |  1 +
 src/backend/utils/cache/relcache.c       | 18 +++++++------
 src/include/access/tableam.h             | 33 ++++++++++++++++++++++++
 src/include/utils/rel.h                  | 10 ++++---
 4 files changed, 50 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 7c7204a2422..da86ca5c31a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2640,6 +2640,7 @@ static const TableAmRoutine heapam_methods = {
 	.index_build_range_scan = heapam_index_build_range_scan,
 	.index_validate_scan = heapam_index_validate_scan,
 
+	.free_rd_amcache = NULL,
 	.relation_size = table_block_relation_size,
 	.relation_needs_toast_table = heapam_relation_needs_toast_table,
 	.relation_toast_am = heapam_relation_toast_am,
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 2cd19d603fb..6d98bdfba06 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -318,6 +318,7 @@ static OpClassCacheEnt *LookupOpclassInfo(Oid operatorClassOid,
 										  StrategyNumber numSupport);
 static void RelationCacheInitFileRemoveInDir(const char *tblspcpath);
 static void unlink_initfile(const char *initfilename, int elevel);
+static void release_rd_amcache(Relation rel);
 
 
 /*
@@ -2262,9 +2263,7 @@ RelationReloadIndexInfo(Relation relation)
 	RelationCloseSmgr(relation);
 
 	/* Must free any AM cached data upon relcache flush */
-	if (relation->rd_amcache)
-		pfree(relation->rd_amcache);
-	relation->rd_amcache = NULL;
+	release_rd_amcache(relation);
 
 	/*
 	 * If it's a shared index, we might be called before backend startup has
@@ -2484,8 +2483,7 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
 		pfree(relation->rd_options);
 	if (relation->rd_indextuple)
 		pfree(relation->rd_indextuple);
-	if (relation->rd_amcache)
-		pfree(relation->rd_amcache);
+	release_rd_amcache(relation);
 	if (relation->rd_fdwroutine)
 		pfree(relation->rd_fdwroutine);
 	if (relation->rd_indexcxt)
@@ -2547,9 +2545,7 @@ RelationClearRelation(Relation relation, bool rebuild)
 	RelationCloseSmgr(relation);
 
 	/* Free AM cached data, if any */
-	if (relation->rd_amcache)
-		pfree(relation->rd_amcache);
-	relation->rd_amcache = NULL;
+	release_rd_amcache(relation);
 
 	/*
 	 * Treat nailed-in system relations separately, they always need to be
@@ -6868,3 +6864,9 @@ ResOwnerReleaseRelation(Datum res)
 
 	RelationCloseCleanup((Relation) res);
 }
+
+static void
+release_rd_amcache(Relation rel)
+{
+	table_free_rd_amcache(rel);
+}
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index b35a22506c0..20b965545db 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -719,6 +719,13 @@ typedef struct TableAmRoutine
 	 * ------------------------------------------------------------------------
 	 */
 
+	/*
+	 * This callback frees relation private cache data stored in rd_amcache.
+	 * If this callback is not provided, rd_amcache is assumed to point to
+	 * single memory chunk.
+	 */
+	void		(*free_rd_amcache) (Relation rel);
+
 	/*
 	 * See table_relation_size().
 	 *
@@ -1882,6 +1889,32 @@ table_index_validate_scan(Relation table_rel,
  * ----------------------------------------------------------------------------
  */
 
+/*
+ * Frees relation private cache data stored in rd_amcache.  Uses
+ * free_rd_amcache method if provided.  Assumes rd_amcache to point to single
+ * memory chunk otherwise.
+ */
+static inline void
+table_free_rd_amcache(Relation rel)
+{
+	if (rel->rd_tableam && rel->rd_tableam->free_rd_amcache)
+	{
+		rel->rd_tableam->free_rd_amcache(rel);
+
+		/*
+		 * We are assuming free_rd_amcache() did clear the cache and left NULL
+		 * in rd_amcache.
+		 */
+		Assert(rel->rd_amcache == NULL);
+	}
+	else
+	{
+		if (rel->rd_amcache)
+			pfree(rel->rd_amcache);
+		rel->rd_amcache = NULL;
+	}
+}
+
 /*
  * Return the current size of `rel` in bytes. If `forkNumber` is
  * InvalidForkNumber, return the relation's overall size, otherwise the size
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 87002049538..69557fc7a2c 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -221,10 +221,12 @@ typedef struct RelationData
 	 * rd_amcache is available for index and table AMs to cache private data
 	 * about the relation.  This must be just a cache since it may get reset
 	 * at any time (in particular, it will get reset by a relcache inval
-	 * message for the relation).  If used, it must point to a single memory
-	 * chunk palloc'd in CacheMemoryContext, or in rd_indexcxt for an index
-	 * relation.  A relcache reset will include freeing that chunk and setting
-	 * rd_amcache = NULL.
+	 * message for the relation).  If used for table AM it must point to a
+	 * single memory chunk palloc'd in CacheMemoryContext, or more complex
+	 * data structure in that memory context to be freed by free_rd_amcache
+	 * method. If used for index AM it must point to a single memory chunk
+	 * palloc'd in rd_indexcxt memory context.  A relcache reset will include
+	 * freeing that chunk and setting rd_amcache = NULL.
 	 */
 	void	   *rd_amcache;		/* available for use by index/table AM */
 
-- 
2.39.3 (Apple Git-145)

0002-Add-EvalPlanQual-delete-returning-isolation-test-v4.patchapplication/octet-stream; name=0002-Add-EvalPlanQual-delete-returning-isolation-test-v4.patchDownload

From ecec5757f7209da1c1933925eb8da7f8ec2788d0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 22 Mar 2023 16:47:09 -0700
Subject: [PATCH 02/13] Add EvalPlanQual delete returning isolation test

Author: Andres Freund
Reviewed-by: Pavel Borisov
Discussion: https://www.postgresql.org/message-id/flat/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg%40mail.gmail.com
---
 .../isolation/expected/eval-plan-qual.out     | 30 +++++++++++++++++++
 src/test/isolation/specs/eval-plan-qual.spec  |  4 +++
 2 files changed, 34 insertions(+)

diff --git a/src/test/isolation/expected/eval-plan-qual.out b/src/test/isolation/expected/eval-plan-qual.out
index 73e0aeb50e7..0237271ceec 100644
--- a/src/test/isolation/expected/eval-plan-qual.out
+++ b/src/test/isolation/expected/eval-plan-qual.out
@@ -746,6 +746,36 @@ savings  |    600|    1200
 (2 rows)
 
 
+starting permutation: read wx2 wb1 c2 c1 read
+step read: SELECT * FROM accounts ORDER BY accountid;
+accountid|balance|balance2
+---------+-------+--------
+checking |    600|    1200
+savings  |    600|    1200
+(2 rows)
+
+step wx2: UPDATE accounts SET balance = balance + 450 WHERE accountid = 'checking' RETURNING balance;
+balance
+-------
+   1050
+(1 row)
+
+step wb1: DELETE FROM accounts WHERE balance = 600 RETURNING *; <waiting ...>
+step c2: COMMIT;
+step wb1: <... completed>
+accountid|balance|balance2
+---------+-------+--------
+savings  |    600|    1200
+(1 row)
+
+step c1: COMMIT;
+step read: SELECT * FROM accounts ORDER BY accountid;
+accountid|balance|balance2
+---------+-------+--------
+checking |   1050|    2100
+(1 row)
+
+
 starting permutation: upsert1 upsert2 c1 c2 read
 step upsert1: 
 	WITH upsert AS
diff --git a/src/test/isolation/specs/eval-plan-qual.spec b/src/test/isolation/specs/eval-plan-qual.spec
index 735c671734e..edd6d19df3a 100644
--- a/src/test/isolation/specs/eval-plan-qual.spec
+++ b/src/test/isolation/specs/eval-plan-qual.spec
@@ -76,6 +76,8 @@ setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
 step wx1	{ UPDATE accounts SET balance = balance - 200 WHERE accountid = 'checking' RETURNING balance; }
 # wy1 then wy2 checks the case where quals pass then fail
 step wy1	{ UPDATE accounts SET balance = balance + 500 WHERE accountid = 'checking' RETURNING balance; }
+# wx2 then wb1 checks the case of re-fetching up-to-date values for DELETE ... RETURNING ...
+step wb1	{ DELETE FROM accounts WHERE balance = 600 RETURNING *; }
 
 step wxext1	{ UPDATE accounts_ext SET balance = balance - 200 WHERE accountid = 'checking' RETURNING balance; }
 step tocds1	{ UPDATE accounts SET accountid = 'cds' WHERE accountid = 'checking'; }
@@ -353,6 +355,8 @@ permutation wx1 delwcte c1 c2 read
 # test that a delete to a self-modified row throws error when
 # previously updated by a different cid
 permutation wx1 delwctefail c1 c2 read
+# test that a delete re-fetches up-to-date values for returning clause
+permutation read wx2 wb1 c2 c1 read
 
 permutation upsert1 upsert2 c1 c2 read
 permutation readp1 writep1 readp2 c1 c2
-- 
2.39.3 (Apple Git-145)

#19

Pavel Borisov

pashkin.elfe@gmail.com

almost 2 years ago

In reply to: Alexander Korotkov (#18)

Re: Table AM Interface Enhancements

Hi, Alexander!

For 0007:

Code inside

+heapam_reloptions(char relkind, Datum reloptions, bool validate)
+{
+   if (relkind == RELKIND_RELATION ||
+       relkind == RELKIND_TOASTVALUE ||
+       relkind == RELKIND_MATVIEW)
+       return heap_reloptions(relkind, reloptions, validate);
+
+   return NULL;

looks redundant to what is done inside heap_reloptions(). Was this on
purpose? Is it possible to leave only "return heap_reloptions()" ?

This looks like a duplicate:
src/include/access/reloptions.h:extern bytea
*index_reloptions(amoptions_function amoptions, Datum reloptions,
src/include/access/tableam.h:extern bytea
*index_reloptions(amoptions_function amoptions, Datum reloptions,

Otherwise the patch looks good and doing what it's proposed to do.

Regards,
Pavel Borisov.

#20

Pavel Borisov

pashkin.elfe@gmail.com

almost 2 years ago

In reply to: Pavel Borisov (#19)

Re: Table AM Interface Enhancements

Hi, Alexander!

On Wed, 20 Mar 2024 at 09:22, Pavel Borisov <pashkin.elfe@gmail.com> wrote:

Hi, Alexander!

For 0007:

Code inside
+heapam_reloptions(char relkind, Datum reloptions, bool validate)
+{
+   if (relkind == RELKIND_RELATION ||
+       relkind == RELKIND_TOASTVALUE ||
+       relkind == RELKIND_MATVIEW)
+       return heap_reloptions(relkind, reloptions, validate);
+
+   return NULL;
looks redundant to what is done inside heap_reloptions(). Was this on
purpose? Is it possible to leave only "return heap_reloptions()" ?

This looks like a duplicate:
src/include/access/reloptions.h:extern bytea
*index_reloptions(amoptions_function amoptions, Datum reloptions,
src/include/access/tableam.h:extern bytea
*index_reloptions(amoptions_function amoptions, Datum reloptions,

Otherwise the patch looks good and doing what it's proposed to do.

For patch 0006:

The change for analyze is in the same style as for previous table am
extensibility patches.

table_scan_analyze_next_tuple/table_scan_analyze_next_block existing
extensibility is dropped in favour of more general method
table_relation_analyze. I haven't found existing extensions on a GitHub
that use these table am's, so probably it's quite ok to remove the
extensibility that didn't get any traction for many years.

The patch contains a big block of code copy-paste. I've checked that the
code is the same with only function name replacement in favor to using
table am instead of heap am. I'd propose restoring the static functions
declaration in the head of the file, which was removed in the patch and
place heapam_acquire_sample_rows() above compare_rows() to make functions
copied as the whole code block. This is for better patch look only, not a
principal change.

-static int acquire_sample_rows(Relation onerel, int elevel,
- HeapTuple *rows, int targrows,
- double *totalrows, double *totaldeadrows);
-static int compare_rows(const void *a, const void *b, void *arg)

May it also be a better place than vacuum.h for
typedef int (*AcquireSampleRowsFunc) ? Maybe sampling.h ?

The other patch that I'd like to review is 0012:

For a
typedef enum RowRefType
 I think some comments would be useful to avoid confusion about the changes
like
-               newrc->allMarkTypes = (1 << newrc->markType);
+              newrc->allRefTypes = (1 << refType);

Also I think the semantical difference between ROW_REF_COPY
and ROW_MARK_COPY is better to be mentioned in the comments and/or commit
message. This may include a description of assigning different reftypes in
parse_relation.c

In a comment there is a small confusion between markType and refType:

* The parent's allRefTypes field gets the OR of (1<<refType) across all
* its children (this definition allows children to use different
markTypes).

Both patches look good to me and are ready, though they may need minimal
comments/cosmetic work.

Regards,
Pavel Borisov

#21

Pavel Borisov

pashkin.elfe@gmail.com

almost 2 years ago

In reply to: Pavel Borisov (#19)

1 attachment(s)

Re: Table AM Interface Enhancements

Hi, Alexander!

Thank you for working on this patchset and pushing some of these patches!

I tried to write comments for tts_minimal_is_current_xact_tuple()
and tts_minimal_getsomeattrs() for them to be the same as for the same
functions for heap and virtual tuple slots, as I proposed above in the
thread. (tts_minimal_getsysattr is not introduced by the current patchset,
but anyway)

Meanwhile I found that (never appearing) error message
for tts_minimal_is_current_xact_tuple needs to be corrected. Please see the
patch in the attachment.

Regards,
Pavel Borisov

Attachments:

Add-comments-on-MinimalTupleSlots-usage.patchapplication/octet-stream; name=Add-comments-on-MinimalTupleSlots-usage.patchDownload

From d3614733cf7904edc3040c6fe140ea1e75544529 Mon Sep 17 00:00:00 2001
From: Pavel Borisov <pashkin.elfe@gmail.com>
Date: Fri, 22 Mar 2024 08:41:20 +0400
Subject: [PATCH] Add comments on MinimalTupleSlots usage.

Also refine error message introduced in 0997e0af273d80add
---
 src/backend/executor/execTuples.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index 7a7c786041..0ae240ced4 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -549,6 +549,10 @@ tts_minimal_getsomeattrs(TupleTableSlot *slot, int natts)
 	slot_deform_heap_tuple(slot, mslot->tuple, &mslot->off, natts);
 }
 
+/*
+ * MinimalTupleTableSlots never provide system attributes. We generally
+ * shouldn't get here, but provide a user-friendly message if we do.
+ */
 static Datum
 tts_minimal_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
 {
@@ -561,6 +565,11 @@ tts_minimal_getsysattr(TupleTableSlot *slot, int attnum, bool *isnull)
 	return 0;					/* silence compiler warnings */
 }
 
+/*
+ * Within MinimalTuple abstraction transaction information is unavailable.
+ * We generally shouldn't get here, but provide a user-friendly message if
+ * we do.
+ */
 static bool
 tts_minimal_is_current_xact_tuple(TupleTableSlot *slot)
 {
@@ -568,7 +577,7 @@ tts_minimal_is_current_xact_tuple(TupleTableSlot *slot)
 
 	ereport(ERROR,
 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			 errmsg("don't have a storage tuple in this context")));
+			 errmsg("don't have transaction information for this type of tuple")));
 
 	return false;				/* silence compiler warnings */
 }
-- 
2.39.2 (Apple Git-143)

#22

Pavel Borisov

pashkin.elfe@gmail.com

almost 2 years ago

In reply to: Pavel Borisov (#21)

Re: Table AM Interface Enhancements

On Fri, 22 Mar 2024 at 08:51, Pavel Borisov <pashkin.elfe@gmail.com> wrote:

Hi, Alexander!

Thank you for working on this patchset and pushing some of these patches!

I tried to write comments for tts_minimal_is_current_xact_tuple()
and tts_minimal_getsomeattrs() for them to be the same as for the same
functions for heap and virtual tuple slots, as I proposed above in the
thread. (tts_minimal_getsysattr is not introduced by the current patchset,
but anyway)

Meanwhile I found that (never appearing) error message
for tts_minimal_is_current_xact_tuple needs to be corrected. Please see the
patch in the attachment.

I need to correct myself: it's for tts_minimal_getsysattr() not

tts_minimal_getsomeattrs()

Pavel.

#23

Alexander Korotkov

aekorotkov@gmail.com

almost 2 years ago

In reply to: Pavel Borisov (#22)

8 attachment(s)

Re: Table AM Interface Enhancements

On Fri, Mar 22, 2024 at 6:56 AM Pavel Borisov <pashkin.elfe@gmail.com> wrote:

On Fri, 22 Mar 2024 at 08:51, Pavel Borisov <pashkin.elfe@gmail.com> wrote:

Hi, Alexander!

Thank you for working on this patchset and pushing some of these patches!

I tried to write comments for tts_minimal_is_current_xact_tuple() and tts_minimal_getsomeattrs() for them to be the same as for the same functions for heap and virtual tuple slots, as I proposed above in the thread. (tts_minimal_getsysattr is not introduced by the current patchset, but anyway)

Meanwhile I found that (never appearing) error message for tts_minimal_is_current_xact_tuple needs to be corrected. Please see the patch in the attachment.

I need to correct myself: it's for tts_minimal_getsysattr() not tts_minimal_getsomeattrs()

Pushed.

The revised rest of the patchset is attached.
0001 (was 0006) – I prefer the definition of AcquireSampleRowsFunc to
stay in vacuum.h. If we move it to sampling.h then we would have to
add there includes to define Relation, HeapTuple etc. I'd like to
avoid this kind of change. Also, I've deleted
table_beginscan_analyze(), because it's only called from
tableam-specific AcquireSampleRowsFunc. Also I put some comments to
heapam_scan_analyze_next_block() and heapam_scan_analyze_next_tuple()
given that there are now no relevant comments for them in tableam.h.
I've removed some redundancies from acquire_sample_rows(). And added
comments to AcquireSampleRowsFunc based on what we have in FDW docs
for this function. Did some small edits as well. As you suggested,
turned back declarations for acquire_sample_rows() and compare_rows().

0002 (was 0007) – I've turned the redundant "if", which you've pointed
out, into an assert. Also, added some comments, most notably comment
for TableAmRoutine.reloptions based on the indexam docs.

0007 (was 0012) – This patch doesn't make much sense if not removing
ROW_MARK_COPY. What an oversight by me! I managed to remove
ROW_MARK_COPY so that tests passed. Added a lot of comments and made
other improvements. But the problem is that I didn't manage to
research all the consequences of this patch to FDW. And I think there
are open design questions. In particular how should ROW_REF_COPY work
with row marks other than ROW_MARK_REFERENCE and should it work at
all? This would require some consensus, and it doesn't seem feasible
to achieve before FF. So, I think this is not a subject for v17.

Other patches are without changes.

------
Regards,
Alexander Korotkov

Attachments:

0004-Let-table-AM-override-reloptions-for-indexes-buil-v5.patchapplication/octet-stream; name=0004-Let-table-AM-override-reloptions-for-indexes-buil-v5.patchDownload

From 5cfa13187983e7d6352eb776ba9856dd9ec332d4 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Thu, 14 Mar 2024 00:53:05 +0200
Subject: [PATCH 4/8] Let table AM override reloptions for indexes built on its
 tables

---
 src/backend/access/common/reloptions.c   |  3 ++-
 src/backend/access/heap/heapam_handler.c |  8 ++++++++
 src/backend/commands/indexcmds.c         |  3 ++-
 src/backend/commands/tablecmds.c         |  9 +++++++-
 src/backend/utils/cache/relcache.c       | 24 ++++++++++++++++++++--
 src/include/access/tableam.h             | 26 ++++++++++++++++++++++++
 6 files changed, 68 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 963995388bb..00088240cdd 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -1411,7 +1411,8 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
 			break;
 		case RELKIND_INDEX:
 		case RELKIND_PARTITIONED_INDEX:
-			options = index_reloptions(amoptions, datum, false);
+			options = tableam_indexoptions(tableam, amoptions, classForm->relkind,
+										   datum, false);
 			break;
 		case RELKIND_FOREIGN_TABLE:
 			options = NULL;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 9ce6fea82c5..c67a9aba389 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2725,6 +2725,13 @@ heapam_reloptions(char relkind, Datum reloptions, bool validate)
 	return heap_reloptions(relkind, reloptions, validate);
 }
 
+static bytea *
+heapam_indexoptions(amoptions_function amoptions, char relkind,
+					Datum reloptions, bool validate)
+{
+	return index_reloptions(amoptions, reloptions, validate);
+}
+
 
 /* ------------------------------------------------------------------------
  * Planner related callbacks for the heap AM
@@ -3230,6 +3237,7 @@ static const TableAmRoutine heapam_methods = {
 	.relation_toast_am = heapam_relation_toast_am,
 	.relation_fetch_toast_slice = heap_fetch_toast_slice,
 	.reloptions = heapam_reloptions,
+	.indexoptions = heapam_indexoptions,
 
 	.relation_estimate_size = heapam_estimate_rel_size,
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index d9016ef487b..e78598c10e1 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -899,7 +899,8 @@ DefineIndex(Oid tableId,
 	reloptions = transformRelOptions((Datum) 0, stmt->options,
 									 NULL, NULL, false, false);
 
-	(void) index_reloptions(amoptions, reloptions, true);
+	(void) tableam_indexoptions(rel->rd_tableam, amoptions, RELKIND_INDEX,
+								reloptions, true);
 
 	/*
 	 * Prepare arguments for index_create, primarily an IndexInfo structure.
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 6fc815666bf..eccd1131a5c 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -15539,7 +15539,14 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 			break;
 		case RELKIND_INDEX:
 		case RELKIND_PARTITIONED_INDEX:
-			(void) index_reloptions(rel->rd_indam->amoptions, newOptions, true);
+			{
+				Relation	tbl = relation_open(rel->rd_index->indrelid,
+												AccessShareLock);
+
+				tableam_indexoptions(tbl->rd_tableam, rel->rd_indam->amoptions,
+									 rel->rd_rel->relkind, newOptions, true);
+				relation_close(tbl, AccessShareLock);
+			}
 			break;
 		default:
 			ereport(ERROR,
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 039c0d3eef4..4343deb4ee3 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -477,15 +477,35 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 	{
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
-		case RELKIND_VIEW:
 		case RELKIND_MATVIEW:
+		case RELKIND_VIEW:
 		case RELKIND_PARTITIONED_TABLE:
 			tableam = relation->rd_tableam;
 			amoptsfn = NULL;
 			break;
 		case RELKIND_INDEX:
 		case RELKIND_PARTITIONED_INDEX:
-			amoptsfn = relation->rd_indam->amoptions;
+			{
+				Form_pg_class classForm;
+				HeapTuple	classTup;
+
+				/* fetch the relation's relcache entry */
+				if (relation->rd_index->indrelid >= FirstNormalObjectId)
+				{
+					classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relation->rd_index->indrelid));
+					classForm = (Form_pg_class) GETSTRUCT(classTup);
+					if (classForm->relam >= FirstNormalObjectId)
+						tableam = GetTableAmRoutineByAmOid(classForm->relam);
+					else
+						tableam = GetHeapamTableAmRoutine();
+					heap_freetuple(classTup);
+				}
+				else
+				{
+					tableam = GetHeapamTableAmRoutine();
+				}
+				amoptsfn = relation->rd_indam->amoptions;
+			}
 			break;
 		default:
 			return;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 87b85c1c93c..e4fdd0b5ec6 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -17,6 +17,7 @@
 #ifndef TABLEAM_H
 #define TABLEAM_H
 
+#include "access/amapi.h"
 #include "access/relscan.h"
 #include "access/sdir.h"
 #include "access/xact.h"
@@ -757,6 +758,13 @@ typedef struct TableAmRoutine
 	 */
 	bytea	   *(*reloptions) (char relkind, Datum reloptions, bool validate);
 
+	/*
+	 * Parse table AM-specific index options.  Useful for table AM to define
+	 * new index options or override existing index options.
+	 */
+	bytea	   *(*indexoptions) (amoptions_function amoptions, char relkind,
+								 Datum reloptions, bool validate);
+
 
 	/* ------------------------------------------------------------------------
 	 * Planner related functions.
@@ -1971,6 +1979,24 @@ tableam_reloptions(const TableAmRoutine *tableam, char relkind,
 	return tableam->reloptions(relkind, reloptions, validate);
 }
 
+extern bytea *index_reloptions(amoptions_function amoptions, Datum reloptions,
+							   bool validate);
+
+/*
+ * Parse index options.  Gives table AM a chance to override index-specific
+ * options defined in 'amoptions'.
+ */
+static inline bytea *
+tableam_indexoptions(const TableAmRoutine *tableam,
+					 amoptions_function amoptions, char relkind,
+					 Datum reloptions, bool validate)
+{
+	if (tableam)
+		return tableam->indexoptions(amoptions, relkind, reloptions, validate);
+	else
+		return index_reloptions(amoptions, reloptions, validate);
+}
+
 
 /* ----------------------------------------------------------------------------
  * Planner related functionality
-- 
2.39.3 (Apple Git-145)

0005-Notify-table-AM-about-index-creation-v5.patchapplication/octet-stream; name=0005-Notify-table-AM-about-index-creation-v5.patchDownload

From 196baba104ba6c78b00f90007788bea8bd4ba96f Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Sat, 17 Jun 2023 22:01:01 +0300
Subject: [PATCH 5/8] Notify table AM about index creation

This allows table AM to do some preparation with index build.  In particular,
table AM could update its specific meta-information.  That could be also useful
if table AM overrides index implementations.
---
 src/backend/access/heap/heapam_handler.c |  2 ++
 src/backend/catalog/index.c              |  2 ++
 src/backend/commands/indexcmds.c         | 41 +++++++++++++----------
 src/include/access/tableam.h             | 42 ++++++++++++++++++++++++
 4 files changed, 69 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c67a9aba389..1ebda05797f 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -3230,6 +3230,8 @@ static const TableAmRoutine heapam_methods = {
 	.index_build_range_scan = heapam_index_build_range_scan,
 	.index_validate_scan = heapam_index_validate_scan,
 	.relation_analyze = heapam_analyze,
+	.define_index_validate = NULL,
+	.define_index = NULL,
 
 	.free_rd_amcache = NULL,
 	.relation_size = table_block_relation_size,
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index b6a7c60e230..bca97981051 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3840,6 +3840,8 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 
 	/* Close rels, but keep locks */
 	index_close(iRel, NoLock);
+	table_define_index(heapRelation, indexId, true,
+					   skip_constraint_checks, false, NULL);
 	table_close(heapRelation, NoLock);
 
 	if (progress)
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index e78598c10e1..2570e7a24a1 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -583,6 +583,7 @@ DefineIndex(Oid tableId,
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
+	void	   *arg;
 
 	root_save_nestlevel = NewGUCNestLevel();
 
@@ -629,6 +630,26 @@ DefineIndex(Oid tableId,
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_INDEX_OID,
 								 InvalidOid);
 
+	/*
+	 * Only SELECT ... FOR UPDATE/SHARE are allowed while doing a standard
+	 * index build; but for concurrent builds we allow INSERT/UPDATE/DELETE
+	 * (but not VACUUM).
+	 *
+	 * NB: Caller is responsible for making sure that relationId refers to the
+	 * relation on which the index should be built; except in bootstrap mode,
+	 * this will typically require the caller to have already locked the
+	 * relation.  To avoid lock upgrade hazards, that lock should be at least
+	 * as strong as the one we take here.
+	 *
+	 * NB: If the lock strength here ever changes, code that is run by
+	 * parallel workers under the control of certain particular ambuild
+	 * functions will need to be updated, too.
+	 */
+	lockmode = concurrent ? ShareUpdateExclusiveLock : ShareLock;
+	rel = table_open(tableId, lockmode);
+
+	table_define_index_validate(rel, stmt, skip_build, &arg);
+
 	/*
 	 * count key attributes in index
 	 */
@@ -656,24 +677,6 @@ DefineIndex(Oid tableId,
 				 errmsg("cannot use more than %d columns in an index",
 						INDEX_MAX_KEYS)));
 
-	/*
-	 * Only SELECT ... FOR UPDATE/SHARE are allowed while doing a standard
-	 * index build; but for concurrent builds we allow INSERT/UPDATE/DELETE
-	 * (but not VACUUM).
-	 *
-	 * NB: Caller is responsible for making sure that tableId refers to the
-	 * relation on which the index should be built; except in bootstrap mode,
-	 * this will typically require the caller to have already locked the
-	 * relation.  To avoid lock upgrade hazards, that lock should be at least
-	 * as strong as the one we take here.
-	 *
-	 * NB: If the lock strength here ever changes, code that is run by
-	 * parallel workers under the control of certain particular ambuild
-	 * functions will need to be updated, too.
-	 */
-	lockmode = concurrent ? ShareUpdateExclusiveLock : ShareLock;
-	rel = table_open(tableId, lockmode);
-
 	/*
 	 * Switch to the table owner's userid, so that any index functions are run
 	 * as that user.  Also lock down security-restricted operations.  We
@@ -1218,6 +1221,8 @@ DefineIndex(Oid tableId,
 
 	ObjectAddressSet(address, RelationRelationId, indexRelationId);
 
+	table_define_index(rel, address.objectId, false, false,
+					   skip_build, arg);
 	if (!OidIsValid(indexRelationId))
 	{
 		/*
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index e4fdd0b5ec6..4e9dab67969 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -684,6 +684,16 @@ typedef struct TableAmRoutine
 									 BlockNumber *totalpages,
 									 BufferAccessStrategy bstrategy);
 
+	/* See table_define_index_validate() */
+	bool		(*define_index_validate) (Relation rel, IndexStmt *stmt,
+										  bool skip_build, void **arg);
+
+	/* See table_define_index() */
+	bool		(*define_index) (Relation rel, Oid indoid, bool reindex,
+								 bool skip_constraint_checks, bool skip_build,
+								 void *arg);
+
+
 
 	/* ------------------------------------------------------------------------
 	 * Miscellaneous functions.
@@ -1860,6 +1870,38 @@ table_relation_analyze(Relation relation, AcquireSampleRowsFunc *func,
 										   totalpages, bstrategy);
 }
 
+/*
+ * Let table AM validate the index to be created on `rel` with statement
+ * `*stmt`.  `skip_build` indicates that only catalog entries are to be
+ * created without index data.  This method can save some information into
+ * `arg`, and it shoud be passed to table_define_index().
+ */
+static inline bool
+table_define_index_validate(Relation rel, IndexStmt *stmt,
+							bool skip_build, void **arg)
+{
+	if (rel->rd_tableam && rel->rd_tableam->define_index_validate)
+		return rel->rd_tableam->define_index_validate(rel, stmt,
+													  skip_build, arg);
+	else
+		return true;
+}
+
+/*
+ * Notifies table AM about index creation on `rel` with oid `indoid`.
+ */
+static inline bool
+table_define_index(Relation rel, Oid indoid, bool reindex,
+				   bool skip_constraint_checks, bool skip_build, void *arg)
+{
+	if (rel->rd_tableam && rel->rd_tableam->define_index)
+		return rel->rd_tableam->define_index(rel, indoid, reindex,
+											 skip_constraint_checks,
+											 skip_build, arg);
+	else
+		return true;
+}
+
 /* ----------------------------------------------------------------------------
  * Miscellaneous functionality
  * ----------------------------------------------------------------------------
-- 
2.39.3 (Apple Git-145)

0003-Generalize-table-AM-API-for-INSERT-.-ON-CONFLICT-v5.patchapplication/octet-stream; name=0003-Generalize-table-AM-API-for-INSERT-.-ON-CONFLICT-v5.patchDownload

From 374d0b8093bffa380eb30c3b88e60ee21025bd47 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Fri, 9 Jun 2023 00:05:52 +0300
Subject: [PATCH 3/8] Generalize table AM API for INSERT ... ON CONFLICT ...

Currently, all table AMs need to implement INSERT ... ON CONFLICT ... with
speculative tokens.  They could just have a custom implementation of those
tokens using tuple_insert_speculative() and tuple_complete_speculative() API
functions.

This commit changes INSERT ... ON CONFLICT ... implementation to use single
tuple_insert_with_arbiter() API function, which encapsulates the whole
alogrithm.  This new function provides clear semantics to make different
implementations of INSERT ... ON CONFLICT ... functionality.
---
 src/backend/access/heap/heapam_handler.c | 281 ++++++++++++++++++++++-
 src/backend/access/table/tableamapi.c    |   3 +-
 src/backend/executor/nodeModifyTable.c   | 270 ++--------------------
 src/include/access/tableam.h             |  84 +++----
 4 files changed, 348 insertions(+), 290 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a586850ea2b..9ce6fea82c5 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -314,6 +314,284 @@ heapam_tuple_complete_speculative(Relation relation, TupleTableSlot *slot,
 		pfree(tuple);
 }
 
+/*
+ * ExecCheckTupleVisible -- verify tuple is visible
+ *
+ * It would not be consistent with guarantees of the higher isolation levels to
+ * proceed with avoiding insertion (taking speculative insertion's alternative
+ * path) on the basis of another tuple that is not visible to MVCC snapshot.
+ * Check for the need to raise a serialization failure, and do so as necessary.
+ */
+static void
+ExecCheckTupleVisible(EState *estate,
+					  Relation rel,
+					  TupleTableSlot *slot)
+{
+	if (!IsolationUsesXactSnapshot())
+		return;
+
+	if (!table_tuple_satisfies_snapshot(rel, slot, estate->es_snapshot))
+	{
+		Datum		xminDatum;
+		TransactionId xmin;
+		bool		isnull;
+
+		xminDatum = slot_getsysattr(slot, MinTransactionIdAttributeNumber, &isnull);
+		Assert(!isnull);
+		xmin = DatumGetTransactionId(xminDatum);
+
+		/*
+		 * We should not raise a serialization failure if the conflict is
+		 * against a tuple inserted by our own transaction, even if it's not
+		 * visible to our snapshot.  (This would happen, for example, if
+		 * conflicting keys are proposed for insertion in a single command.)
+		 */
+		if (!TransactionIdIsCurrentTransactionId(xmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+					 errmsg("could not serialize access due to concurrent update")));
+	}
+}
+
+/*
+ * ExecCheckTIDVisible -- convenience variant of ExecCheckTupleVisible()
+ */
+static void
+ExecCheckTIDVisible(EState *estate,
+					Relation rel,
+					ItemPointer tid,
+					TupleTableSlot *tempSlot)
+{
+	/* Redundantly check isolation level */
+	if (!IsolationUsesXactSnapshot())
+		return;
+
+	if (!table_tuple_fetch_row_version(rel, tid,
+									   SnapshotAny, tempSlot))
+		elog(ERROR, "failed to fetch conflicting tuple for ON CONFLICT");
+	ExecCheckTupleVisible(estate, rel, tempSlot);
+	ExecClearTuple(tempSlot);
+}
+
+static inline TupleTableSlot *
+heapam_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
+								 TupleTableSlot *slot,
+								 CommandId cid, int options,
+								 struct BulkInsertStateData *bistate,
+								 List *arbiterIndexes,
+								 EState *estate,
+								 LockTupleMode lockmode,
+								 TupleTableSlot *lockedSlot,
+								 TupleTableSlot *tempSlot)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	uint32		specToken;
+	ItemPointerData conflictTid;
+	bool		specConflict;
+	List	   *recheckIndexes = NIL;
+
+	while (true)
+	{
+		specConflict = false;
+		if (!ExecCheckIndexConstraints(resultRelInfo, slot, estate, &conflictTid,
+									   arbiterIndexes))
+		{
+			if (lockedSlot)
+			{
+				TM_Result	test;
+				TM_FailureData tmfd;
+				Datum		xminDatum;
+				TransactionId xmin;
+				bool		isnull;
+
+				/* Determine lock mode to use */
+				lockmode = ExecUpdateLockMode(estate, resultRelInfo);
+
+				/*
+				 * Lock tuple for update.  Don't follow updates when tuple
+				 * cannot be locked without doing so.  A row locking conflict
+				 * here means our previous conclusion that the tuple is
+				 * conclusively committed is not true anymore.
+				 */
+				test = table_tuple_lock(rel, &conflictTid,
+										estate->es_snapshot,
+										lockedSlot, estate->es_output_cid,
+										lockmode, LockWaitBlock, 0,
+										&tmfd);
+				switch (test)
+				{
+					case TM_Ok:
+						/* success! */
+						break;
+
+					case TM_Invisible:
+
+						/*
+						 * This can occur when a just inserted tuple is
+						 * updated again in the same command. E.g. because
+						 * multiple rows with the same conflicting key values
+						 * are inserted.
+						 *
+						 * This is somewhat similar to the ExecUpdate()
+						 * TM_SelfModified case.  We do not want to proceed
+						 * because it would lead to the same row being updated
+						 * a second time in some unspecified order, and in
+						 * contrast to plain UPDATEs there's no historical
+						 * behavior to break.
+						 *
+						 * It is the user's responsibility to prevent this
+						 * situation from occurring.  These problems are why
+						 * the SQL standard similarly specifies that for SQL
+						 * MERGE, an exception must be raised in the event of
+						 * an attempt to update the same row twice.
+						 */
+						xminDatum = slot_getsysattr(lockedSlot,
+													MinTransactionIdAttributeNumber,
+													&isnull);
+						Assert(!isnull);
+						xmin = DatumGetTransactionId(xminDatum);
+
+						if (TransactionIdIsCurrentTransactionId(xmin))
+							ereport(ERROR,
+									(errcode(ERRCODE_CARDINALITY_VIOLATION),
+							/* translator: %s is a SQL command name */
+									 errmsg("%s command cannot affect row a second time",
+											"ON CONFLICT DO UPDATE"),
+									 errhint("Ensure that no rows proposed for insertion within the same command have duplicate constrained values.")));
+
+						/* This shouldn't happen */
+						elog(ERROR, "attempted to lock invisible tuple");
+						break;
+
+					case TM_SelfModified:
+
+						/*
+						 * This state should never be reached. As a dirty
+						 * snapshot is used to find conflicting tuples,
+						 * speculative insertion wouldn't have seen this row
+						 * to conflict with.
+						 */
+						elog(ERROR, "unexpected self-updated tuple");
+						break;
+
+					case TM_Updated:
+						if (IsolationUsesXactSnapshot())
+							ereport(ERROR,
+									(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+									 errmsg("could not serialize access due to concurrent update")));
+
+						/*
+						 * As long as we don't support an UPDATE of INSERT ON
+						 * CONFLICT for a partitioned table we shouldn't reach
+						 * to a case where tuple to be lock is moved to
+						 * another partition due to concurrent update of the
+						 * partition key.
+						 */
+						Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
+
+						/*
+						 * Tell caller to try again from the very start.
+						 *
+						 * It does not make sense to use the usual
+						 * EvalPlanQual() style loop here, as the new version
+						 * of the row might not conflict anymore, or the
+						 * conflicting tuple has actually been deleted.
+						 */
+						ExecClearTuple(lockedSlot);
+						return false;
+
+					case TM_Deleted:
+						if (IsolationUsesXactSnapshot())
+							ereport(ERROR,
+									(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+									 errmsg("could not serialize access due to concurrent delete")));
+
+						/* see TM_Updated case */
+						Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
+						ExecClearTuple(lockedSlot);
+						return false;
+
+					default:
+						elog(ERROR, "unrecognized table_tuple_lock status: %u", test);
+				}
+
+				/* Success, the tuple is locked. */
+
+				/*
+				 * Verify that the tuple is visible to our MVCC snapshot if
+				 * the current isolation level mandates that.
+				 *
+				 * It's not sufficient to rely on the check within
+				 * ExecUpdate() as e.g. CONFLICT ... WHERE clause may prevent
+				 * us from reaching that.
+				 *
+				 * This means we only ever continue when a new command in the
+				 * current transaction could see the row, even though in READ
+				 * COMMITTED mode the tuple will not be visible according to
+				 * the current statement's snapshot.  This is in line with the
+				 * way UPDATE deals with newer tuple versions.
+				 */
+				ExecCheckTupleVisible(estate, rel, lockedSlot);
+				return NULL;
+			}
+			else
+			{
+				ExecCheckTIDVisible(estate, rel, &conflictTid, tempSlot);
+				return NULL;
+			}
+		}
+
+		/*
+		 * Before we start insertion proper, acquire our "speculative
+		 * insertion lock".  Others can use that to wait for us to decide if
+		 * we're going to go ahead with the insertion, instead of waiting for
+		 * the whole transaction to complete.
+		 */
+		specToken = SpeculativeInsertionLockAcquire(GetCurrentTransactionId());
+
+		/* insert the tuple, with the speculative token */
+		heapam_tuple_insert_speculative(rel, slot,
+										estate->es_output_cid,
+										0,
+										NULL,
+										specToken);
+
+		/* insert index entries for tuple */
+		recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
+											   slot, estate, false, true,
+											   &specConflict,
+											   arbiterIndexes,
+											   false);
+
+		/* adjust the tuple's state accordingly */
+		heapam_tuple_complete_speculative(rel, slot,
+										  specToken, !specConflict);
+
+		/*
+		 * Wake up anyone waiting for our decision.  They will re-check the
+		 * tuple, see that it's no longer speculative, and wait on our XID as
+		 * if this was a regularly inserted tuple all along.  Or if we killed
+		 * the tuple, they will see it's dead, and proceed as if the tuple
+		 * never existed.
+		 */
+		SpeculativeInsertionLockRelease(GetCurrentTransactionId());
+
+		/*
+		 * If there was a conflict, start from the beginning.  We'll do the
+		 * pre-check again, which will now find the conflicting tuple (unless
+		 * it aborts before we get there).
+		 */
+		if (specConflict)
+		{
+			list_free(recheckIndexes);
+			CHECK_FOR_INTERRUPTS();
+			continue;
+		}
+
+		return slot;
+	}
+}
+
 static TM_Result
 heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
 					Snapshot snapshot, Snapshot crosscheck, int options,
@@ -2925,8 +3203,7 @@ static const TableAmRoutine heapam_methods = {
 	.index_fetch_tuple = heapam_index_fetch_tuple,
 
 	.tuple_insert = heapam_tuple_insert,
-	.tuple_insert_speculative = heapam_tuple_insert_speculative,
-	.tuple_complete_speculative = heapam_tuple_complete_speculative,
+	.tuple_insert_with_arbiter = heapam_tuple_insert_with_arbiter,
 	.multi_insert = heap_multi_insert,
 	.tuple_delete = heapam_tuple_delete,
 	.tuple_update = heapam_tuple_update,
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index d9e23ef3175..c38ab936cde 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -70,8 +70,7 @@ GetTableAmRoutine(Oid amhandler)
 	 * Could be made optional, but would require throwing error during
 	 * parse-analysis.
 	 */
-	Assert(routine->tuple_insert_speculative != NULL);
-	Assert(routine->tuple_complete_speculative != NULL);
+	Assert(routine->tuple_insert_with_arbiter != NULL);
 
 	Assert(routine->multi_insert != NULL);
 	Assert(routine->tuple_delete != NULL);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index d1917f2fea7..8e1c8f697c6 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -129,7 +129,6 @@ static void ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 											   TupleTableSlot *newslot);
 static bool ExecOnConflictUpdate(ModifyTableContext *context,
 								 ResultRelInfo *resultRelInfo,
-								 ItemPointer conflictTid,
 								 TupleTableSlot *excludedSlot,
 								 bool canSetTag,
 								 TupleTableSlot **returning);
@@ -265,66 +264,6 @@ ExecProcessReturning(ResultRelInfo *resultRelInfo,
 	return ExecProject(projectReturning);
 }
 
-/*
- * ExecCheckTupleVisible -- verify tuple is visible
- *
- * It would not be consistent with guarantees of the higher isolation levels to
- * proceed with avoiding insertion (taking speculative insertion's alternative
- * path) on the basis of another tuple that is not visible to MVCC snapshot.
- * Check for the need to raise a serialization failure, and do so as necessary.
- */
-static void
-ExecCheckTupleVisible(EState *estate,
-					  Relation rel,
-					  TupleTableSlot *slot)
-{
-	if (!IsolationUsesXactSnapshot())
-		return;
-
-	if (!table_tuple_satisfies_snapshot(rel, slot, estate->es_snapshot))
-	{
-		Datum		xminDatum;
-		TransactionId xmin;
-		bool		isnull;
-
-		xminDatum = slot_getsysattr(slot, MinTransactionIdAttributeNumber, &isnull);
-		Assert(!isnull);
-		xmin = DatumGetTransactionId(xminDatum);
-
-		/*
-		 * We should not raise a serialization failure if the conflict is
-		 * against a tuple inserted by our own transaction, even if it's not
-		 * visible to our snapshot.  (This would happen, for example, if
-		 * conflicting keys are proposed for insertion in a single command.)
-		 */
-		if (!TransactionIdIsCurrentTransactionId(xmin))
-			ereport(ERROR,
-					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-					 errmsg("could not serialize access due to concurrent update")));
-	}
-}
-
-/*
- * ExecCheckTIDVisible -- convenience variant of ExecCheckTupleVisible()
- */
-static void
-ExecCheckTIDVisible(EState *estate,
-					ResultRelInfo *relinfo,
-					ItemPointer tid,
-					TupleTableSlot *tempSlot)
-{
-	Relation	rel = relinfo->ri_RelationDesc;
-
-	/* Redundantly check isolation level */
-	if (!IsolationUsesXactSnapshot())
-		return;
-
-	if (!table_tuple_fetch_row_version(rel, tid, SnapshotAny, tempSlot))
-		elog(ERROR, "failed to fetch conflicting tuple for ON CONFLICT");
-	ExecCheckTupleVisible(estate, rel, tempSlot);
-	ExecClearTuple(tempSlot);
-}
-
 /*
  * Initialize to compute stored generated columns for a tuple
  *
@@ -1015,12 +954,19 @@ ExecInsert(ModifyTableContext *context,
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
 			/* Perform a speculative insertion. */
-			uint32		specToken;
-			ItemPointerData conflictTid;
-			bool		specConflict;
 			List	   *arbiterIndexes;
+			TupleTableSlot *existing = NULL,
+					   *returningSlot,
+					   *inserted;
+			LockTupleMode lockmode = LockTupleExclusive;
 
 			arbiterIndexes = resultRelInfo->ri_onConflictArbiterIndexes;
+			returningSlot = ExecGetReturningSlot(estate, resultRelInfo);
+			if (onconflict == ONCONFLICT_UPDATE)
+			{
+				lockmode = ExecUpdateLockMode(estate, resultRelInfo);
+				existing = resultRelInfo->ri_onConflict->oc_Existing;
+			}
 
 			/*
 			 * Do a non-conclusive check for conflicts first.
@@ -1037,23 +983,28 @@ ExecInsert(ModifyTableContext *context,
 			 */
 	vlock:
 			CHECK_FOR_INTERRUPTS();
-			specConflict = false;
-			if (!ExecCheckIndexConstraints(resultRelInfo, slot, estate,
-										   &conflictTid, arbiterIndexes))
+			inserted = table_tuple_insert_with_arbiter(resultRelInfo,
+													   slot, estate->es_output_cid,
+													   0, NULL, arbiterIndexes, estate,
+													   lockmode, existing, returningSlot);
+			if (!inserted)
 			{
 				/* committed conflict tuple found */
 				if (onconflict == ONCONFLICT_UPDATE)
 				{
+					TupleTableSlot *returning = NULL;
+
+					if (TTS_EMPTY(existing))
+						goto vlock;
+
 					/*
 					 * In case of ON CONFLICT DO UPDATE, execute the UPDATE
 					 * part.  Be prepared to retry if the UPDATE fails because
 					 * of another concurrent UPDATE/DELETE to the conflict
 					 * tuple.
 					 */
-					TupleTableSlot *returning = NULL;
-
 					if (ExecOnConflictUpdate(context, resultRelInfo,
-											 &conflictTid, slot, canSetTag,
+											 slot, canSetTag,
 											 &returning))
 					{
 						InstrCountTuples2(&mtstate->ps, 1);
@@ -1076,57 +1027,13 @@ ExecInsert(ModifyTableContext *context,
 					 * ExecGetReturningSlot() in the DO NOTHING case...
 					 */
 					Assert(onconflict == ONCONFLICT_NOTHING);
-					ExecCheckTIDVisible(estate, resultRelInfo, &conflictTid,
-										ExecGetReturningSlot(estate, resultRelInfo));
 					InstrCountTuples2(&mtstate->ps, 1);
 					return NULL;
 				}
 			}
-
-			/*
-			 * Before we start insertion proper, acquire our "speculative
-			 * insertion lock".  Others can use that to wait for us to decide
-			 * if we're going to go ahead with the insertion, instead of
-			 * waiting for the whole transaction to complete.
-			 */
-			specToken = SpeculativeInsertionLockAcquire(GetCurrentTransactionId());
-
-			/* insert the tuple, with the speculative token */
-			table_tuple_insert_speculative(resultRelationDesc, slot,
-										   estate->es_output_cid,
-										   0,
-										   NULL,
-										   specToken);
-
-			/* insert index entries for tuple */
-			recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
-												   slot, estate, false, true,
-												   &specConflict,
-												   arbiterIndexes,
-												   false);
-
-			/* adjust the tuple's state accordingly */
-			table_tuple_complete_speculative(resultRelationDesc, slot,
-											 specToken, !specConflict);
-
-			/*
-			 * Wake up anyone waiting for our decision.  They will re-check
-			 * the tuple, see that it's no longer speculative, and wait on our
-			 * XID as if this was a regularly inserted tuple all along.  Or if
-			 * we killed the tuple, they will see it's dead, and proceed as if
-			 * the tuple never existed.
-			 */
-			SpeculativeInsertionLockRelease(GetCurrentTransactionId());
-
-			/*
-			 * If there was a conflict, start from the beginning.  We'll do
-			 * the pre-check again, which will now find the conflicting tuple
-			 * (unless it aborts before we get there).
-			 */
-			if (specConflict)
+			else
 			{
-				list_free(recheckIndexes);
-				goto vlock;
+				slot = inserted;
 			}
 
 			/* Since there was no insertion conflict, we're done */
@@ -2441,144 +2348,15 @@ redo_act:
 static bool
 ExecOnConflictUpdate(ModifyTableContext *context,
 					 ResultRelInfo *resultRelInfo,
-					 ItemPointer conflictTid,
 					 TupleTableSlot *excludedSlot,
 					 bool canSetTag,
 					 TupleTableSlot **returning)
 {
 	ModifyTableState *mtstate = context->mtstate;
 	ExprContext *econtext = mtstate->ps.ps_ExprContext;
-	Relation	relation = resultRelInfo->ri_RelationDesc;
 	ExprState  *onConflictSetWhere = resultRelInfo->ri_onConflict->oc_WhereClause;
 	TupleTableSlot *existing = resultRelInfo->ri_onConflict->oc_Existing;
-	TM_FailureData tmfd;
-	LockTupleMode lockmode;
-	TM_Result	test;
-	Datum		xminDatum;
-	TransactionId xmin;
-	bool		isnull;
-
-	/* Determine lock mode to use */
-	lockmode = ExecUpdateLockMode(context->estate, resultRelInfo);
-
-	/*
-	 * Lock tuple for update.  Don't follow updates when tuple cannot be
-	 * locked without doing so.  A row locking conflict here means our
-	 * previous conclusion that the tuple is conclusively committed is not
-	 * true anymore.
-	 */
-	test = table_tuple_lock(relation, conflictTid,
-							context->estate->es_snapshot,
-							existing, context->estate->es_output_cid,
-							lockmode, LockWaitBlock, 0,
-							&tmfd);
-	switch (test)
-	{
-		case TM_Ok:
-			/* success! */
-			break;
-
-		case TM_Invisible:
-
-			/*
-			 * This can occur when a just inserted tuple is updated again in
-			 * the same command. E.g. because multiple rows with the same
-			 * conflicting key values are inserted.
-			 *
-			 * This is somewhat similar to the ExecUpdate() TM_SelfModified
-			 * case.  We do not want to proceed because it would lead to the
-			 * same row being updated a second time in some unspecified order,
-			 * and in contrast to plain UPDATEs there's no historical behavior
-			 * to break.
-			 *
-			 * It is the user's responsibility to prevent this situation from
-			 * occurring.  These problems are why the SQL standard similarly
-			 * specifies that for SQL MERGE, an exception must be raised in
-			 * the event of an attempt to update the same row twice.
-			 */
-			xminDatum = slot_getsysattr(existing,
-										MinTransactionIdAttributeNumber,
-										&isnull);
-			Assert(!isnull);
-			xmin = DatumGetTransactionId(xminDatum);
-
-			if (TransactionIdIsCurrentTransactionId(xmin))
-				ereport(ERROR,
-						(errcode(ERRCODE_CARDINALITY_VIOLATION),
-				/* translator: %s is a SQL command name */
-						 errmsg("%s command cannot affect row a second time",
-								"ON CONFLICT DO UPDATE"),
-						 errhint("Ensure that no rows proposed for insertion within the same command have duplicate constrained values.")));
-
-			/* This shouldn't happen */
-			elog(ERROR, "attempted to lock invisible tuple");
-			break;
-
-		case TM_SelfModified:
-
-			/*
-			 * This state should never be reached. As a dirty snapshot is used
-			 * to find conflicting tuples, speculative insertion wouldn't have
-			 * seen this row to conflict with.
-			 */
-			elog(ERROR, "unexpected self-updated tuple");
-			break;
-
-		case TM_Updated:
-			if (IsolationUsesXactSnapshot())
-				ereport(ERROR,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("could not serialize access due to concurrent update")));
-
-			/*
-			 * As long as we don't support an UPDATE of INSERT ON CONFLICT for
-			 * a partitioned table we shouldn't reach to a case where tuple to
-			 * be lock is moved to another partition due to concurrent update
-			 * of the partition key.
-			 */
-			Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
-
-			/*
-			 * Tell caller to try again from the very start.
-			 *
-			 * It does not make sense to use the usual EvalPlanQual() style
-			 * loop here, as the new version of the row might not conflict
-			 * anymore, or the conflicting tuple has actually been deleted.
-			 */
-			ExecClearTuple(existing);
-			return false;
-
-		case TM_Deleted:
-			if (IsolationUsesXactSnapshot())
-				ereport(ERROR,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("could not serialize access due to concurrent delete")));
-
-			/* see TM_Updated case */
-			Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
-			ExecClearTuple(existing);
-			return false;
-
-		default:
-			elog(ERROR, "unrecognized table_tuple_lock status: %u", test);
-	}
-
-	/* Success, the tuple is locked. */
-
-	/*
-	 * Verify that the tuple is visible to our MVCC snapshot if the current
-	 * isolation level mandates that.
-	 *
-	 * It's not sufficient to rely on the check within ExecUpdate() as e.g.
-	 * CONFLICT ... WHERE clause may prevent us from reaching that.
-	 *
-	 * This means we only ever continue when a new command in the current
-	 * transaction could see the row, even though in READ COMMITTED mode the
-	 * tuple will not be visible according to the current statement's
-	 * snapshot.  This is in line with the way UPDATE deals with newer tuple
-	 * versions.
-	 */
-	ExecCheckTupleVisible(context->estate, relation, existing);
+	ItemPointer conflictTid = &existing->tts_tid;
 
 	/*
 	 * Make tuple and any needed join variables available to ExecQual and
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index c882d7b8ad1..87b85c1c93c 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -22,6 +22,7 @@
 #include "access/xact.h"
 #include "commands/vacuum.h"
 #include "executor/tuptable.h"
+#include "nodes/execnodes.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
 
@@ -514,19 +515,16 @@ typedef struct TableAmRoutine
 									 CommandId cid, int options,
 									 struct BulkInsertStateData *bistate);
 
-	/* see table_tuple_insert_speculative() for reference about parameters */
-	void		(*tuple_insert_speculative) (Relation rel,
-											 TupleTableSlot *slot,
-											 CommandId cid,
-											 int options,
-											 struct BulkInsertStateData *bistate,
-											 uint32 specToken);
-
-	/* see table_tuple_complete_speculative() for reference about parameters */
-	void		(*tuple_complete_speculative) (Relation rel,
-											   TupleTableSlot *slot,
-											   uint32 specToken,
-											   bool succeeded);
+	/* see table_tuple_insert_with_arbiter() for reference about parameters */
+	TupleTableSlot *(*tuple_insert_with_arbiter) (ResultRelInfo *resultRelInfo,
+												  TupleTableSlot *slot,
+												  CommandId cid, int options,
+												  struct BulkInsertStateData *bistate,
+												  List *arbiterIndexes,
+												  EState *estate,
+												  LockTupleMode lockmode,
+												  TupleTableSlot *lockedSlot,
+												  TupleTableSlot *tempSlot);
 
 	/* see table_multi_insert() for reference about parameters */
 	void		(*multi_insert) (Relation rel, TupleTableSlot **slots, int nslots,
@@ -1400,36 +1398,42 @@ table_tuple_insert(Relation rel, TupleTableSlot *slot, CommandId cid,
 }
 
 /*
- * Perform a "speculative insertion". These can be backed out afterwards
- * without aborting the whole transaction.  Other sessions can wait for the
- * speculative insertion to be confirmed, turning it into a regular tuple, or
- * aborted, as if it never existed.  Speculatively inserted tuples behave as
- * "value locks" of short duration, used to implement INSERT .. ON CONFLICT.
+ * Insert a tuple from a slot into table AM routine with arbiter indexes.
  *
- * A transaction having performed a speculative insertion has to either abort,
- * or finish the speculative insertion with
- * table_tuple_complete_speculative(succeeded = ...).
- */
-static inline void
-table_tuple_insert_speculative(Relation rel, TupleTableSlot *slot,
-							   CommandId cid, int options,
-							   struct BulkInsertStateData *bistate,
-							   uint32 specToken)
-{
-	rel->rd_tableam->tuple_insert_speculative(rel, slot, cid, options,
-											  bistate, specToken);
-}
-
-/*
- * Complete "speculative insertion" started in the same transaction. If
- * succeeded is true, the tuple is fully inserted, if false, it's removed.
+ * This function is similar to table_tuple_insert(), but it takes into account
+ * `arbiterIndexes`, which comprises the list of oids of arbiter indexes.
+ *
+ * If tuple doesn't violates uniqueness on all arbiter indexes, then it should
+ * be inserted and the slot containing inserted tuple is returned.
+ *
+ * If tuple violates uniqueness on any arbiter index, then this function
+ * returns NULL and doesn't insert the tuple.  Also, if 'lockedSlot' is
+ * provided, then conflicting tuple gets locked in `lockmode` and placed into
+ * `lockedSlot`.
+ *
+ * Executor state `estate` is passed to this method to provide ability to
+ * calculate index tuples.  Temporary tuple table slot `tempSlot` is passed
+ * for holding of potentially conflicing tuple.
  */
-static inline void
-table_tuple_complete_speculative(Relation rel, TupleTableSlot *slot,
-								 uint32 specToken, bool succeeded)
+static inline TupleTableSlot *
+table_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
+								TupleTableSlot *slot,
+								CommandId cid, int options,
+								struct BulkInsertStateData *bistate,
+								List *arbiterIndexes,
+								EState *estate,
+								LockTupleMode lockmode,
+								TupleTableSlot *lockedSlot,
+								TupleTableSlot *tempSlot)
 {
-	rel->rd_tableam->tuple_complete_speculative(rel, slot, specToken,
-												succeeded);
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+
+	return rel->rd_tableam->tuple_insert_with_arbiter(resultRelInfo,
+													  slot, cid, options,
+													  bistate, arbiterIndexes,
+													  estate,
+													  lockmode, lockedSlot,
+													  tempSlot);
 }
 
 /*
-- 
2.39.3 (Apple Git-145)

0001-Generalize-relation-analyze-in-table-AM-interface-v5.patchapplication/octet-stream; name=0001-Generalize-relation-analyze-in-table-AM-interface-v5.patchDownload

From d91d9aacaf3b02b15ca7fc61c62e8902ea4bf24d Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Tue, 26 Mar 2024 23:41:11 +0200
Subject: [PATCH 1/8] Generalize relation analyze in table AM interface

Currently, there is just one algorithm for sampling tuples from a table written
in acquire_sample_rows().  Custom table AM can just redefine the way to get the
next block/tuple by implementing scan_analyze_next_block() and
scan_analyze_next_tuple() API functions.

This approach doesn't seem general enough.  For instance, it's unclear how to
sample this way index-organized tables.  This commit allows table AM to
encapsulate the whole sampling algorithm (currently implemented in
acquire_sample_rows()) into the relation_analyze() API function.
---
 src/backend/access/heap/heapam_handler.c | 306 ++++++++++++++++++++++-
 src/backend/access/table/tableamapi.c    |   2 -
 src/backend/commands/analyze.c           | 288 +--------------------
 src/include/access/tableam.h             | 106 ++------
 src/include/commands/vacuum.h            |  15 ++
 src/include/foreign/fdwapi.h             |   6 +-
 6 files changed, 344 insertions(+), 379 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 6abfe36dec7..30ac13b9751 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -19,6 +19,8 @@
  */
 #include "postgres.h"
 
+#include <math.h>
+
 #include "access/genam.h"
 #include "access/heapam.h"
 #include "access/heaptoast.h"
@@ -44,6 +46,8 @@
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
+#include "utils/sampling.h"
+#include "utils/spccache.h"
 
 static TM_Result heapam_tuple_lock(Relation relation, ItemPointer tid,
 								   Snapshot snapshot, TupleTableSlot *slot,
@@ -51,6 +55,11 @@ static TM_Result heapam_tuple_lock(Relation relation, ItemPointer tid,
 								   LockWaitPolicy wait_policy, uint8 flags,
 								   TM_FailureData *tmfd);
 
+static int	acquire_sample_rows(Relation onerel, int elevel,
+								HeapTuple *rows, int targrows,
+								double *totalrows, double *totaldeadrows);
+static int	compare_rows(const void *a, const void *b, void *arg);
+
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
 									 Datum *values, bool *isnull, RewriteState rwstate);
@@ -1052,7 +1061,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	pfree(isnull);
 }
 
-static bool
+/*
+ * Prepare to analyze block `blockno` of `scan`.  The scan has been started
+ * with table_beginscan_analyze().
+ *
+ * This routine holds a buffer pin and lock on the heap page.  They are held
+ * until heapam_scan_analyze_next_tuple() returns false.  That is until all the
+ * items of the heap page are analyzed.
+ */
+static void
 heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
 							   BufferAccessStrategy bstrategy)
 {
@@ -1072,11 +1089,18 @@ heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
 	hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
 										blockno, RBM_NORMAL, bstrategy);
 	LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-
-	/* in heap all blocks can contain tuples, so always return true */
-	return true;
 }
 
+/*
+ * Iterate over tuples in the block selected with
+ * heapam_scan_analyze_next_block().  If a tuple that's suitable for sampling
+ * is found, true is returned and a tuple is stored in `slot`.  When no more
+ * tuples for sampling, false is returned and the pin and lock acquired by
+ * heapam_scan_analyze_next_block() are released.
+ *
+ * *liverows and *deadrows are incremented according to the encountered
+ * tuples.
+ */
 static bool
 heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
 							   double *liverows, double *deadrows,
@@ -1220,6 +1244,277 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
 	return false;
 }
 
+/*
+ * Comparator for sorting rows[] array
+ */
+static int
+compare_rows(const void *a, const void *b, void *arg)
+{
+	HeapTuple	ha = *(const HeapTuple *) a;
+	HeapTuple	hb = *(const HeapTuple *) b;
+	BlockNumber ba = ItemPointerGetBlockNumber(&ha->t_self);
+	OffsetNumber oa = ItemPointerGetOffsetNumber(&ha->t_self);
+	BlockNumber bb = ItemPointerGetBlockNumber(&hb->t_self);
+	OffsetNumber ob = ItemPointerGetOffsetNumber(&hb->t_self);
+
+	if (ba < bb)
+		return -1;
+	if (ba > bb)
+		return 1;
+	if (oa < ob)
+		return -1;
+	if (oa > ob)
+		return 1;
+	return 0;
+}
+
+static BufferAccessStrategy analyze_bstrategy;
+
+/*
+ * acquire_sample_rows -- acquire a random sample of rows from the table
+ *
+ * Selected rows are returned in the caller-allocated array rows[], which
+ * must have at least targrows entries.
+ * The actual number of rows selected is returned as the function result.
+ * We also estimate the total numbers of live and dead rows in the table,
+ * and return them into *totalrows and *totaldeadrows, respectively.
+ *
+ * The returned list of tuples is in order by physical position in the table.
+ * (We will rely on this later to derive correlation estimates.)
+ *
+ * As of May 2004 we use a new two-stage method:  Stage one selects up
+ * to targrows random blocks (or all blocks, if there aren't so many).
+ * Stage two scans these blocks and uses the Vitter algorithm to create
+ * a random sample of targrows rows (or less, if there are less in the
+ * sample of blocks).  The two stages are executed simultaneously: each
+ * block is processed as soon as stage one returns its number and while
+ * the rows are read stage two controls which ones are to be inserted
+ * into the sample.
+ *
+ * Although every row has an equal chance of ending up in the final
+ * sample, this sampling method is not perfect: not every possible
+ * sample has an equal chance of being selected.  For large relations
+ * the number of different blocks represented by the sample tends to be
+ * too small.  We can live with that for now.  Improvements are welcome.
+ *
+ * An important property of this sampling method is that because we do
+ * look at a statistically unbiased set of blocks, we should get
+ * unbiased estimates of the average numbers of live and dead rows per
+ * block.  The previous sampling method put too much credence in the row
+ * density near the start of the table.
+ */
+static int
+acquire_sample_rows(Relation onerel, int elevel,
+					HeapTuple *rows, int targrows,
+					double *totalrows, double *totaldeadrows)
+{
+	int			numrows = 0;	/* # rows now in reservoir */
+	double		samplerows = 0; /* total # rows collected */
+	double		liverows = 0;	/* # live rows seen */
+	double		deadrows = 0;	/* # dead rows seen */
+	double		rowstoskip = -1;	/* -1 means not set yet */
+	uint32		randseed;		/* Seed for block sampler(s) */
+	BlockNumber totalblocks;
+	TransactionId OldestXmin;
+	BlockSamplerData bs;
+	ReservoirStateData rstate;
+	TupleTableSlot *slot;
+	TableScanDesc scan;
+	BlockNumber nblocks;
+	BlockNumber blksdone = 0;
+#ifdef USE_PREFETCH
+	int			prefetch_maximum = 0;	/* blocks to prefetch if enabled */
+	BlockSamplerData prefetch_bs;
+#endif
+
+	Assert(targrows > 0);
+
+	totalblocks = RelationGetNumberOfBlocks(onerel);
+
+	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
+	OldestXmin = GetOldestNonRemovableTransactionId(onerel);
+
+	/* Prepare for sampling block numbers */
+	randseed = pg_prng_uint32(&pg_global_prng_state);
+	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, randseed);
+
+#ifdef USE_PREFETCH
+	prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
+	/* Create another BlockSampler, using the same seed, for prefetching */
+	if (prefetch_maximum)
+		(void) BlockSampler_Init(&prefetch_bs, totalblocks, targrows, randseed);
+#endif
+
+	/* Report sampling block numbers */
+	pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
+								 nblocks);
+
+	/* Prepare for sampling rows */
+	reservoir_init_selection_state(&rstate, targrows);
+
+	scan = heap_beginscan(onerel, NULL, 0, NULL, NULL, SO_TYPE_ANALYZE);
+	slot = table_slot_create(onerel, NULL);
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * If we are doing prefetching, then go ahead and tell the kernel about
+	 * the first set of pages we are going to want.  This also moves our
+	 * iterator out ahead of the main one being used, where we will keep it so
+	 * that we're always pre-fetching out prefetch_maximum number of blocks
+	 * ahead.
+	 */
+	if (prefetch_maximum)
+	{
+		for (int i = 0; i < prefetch_maximum; i++)
+		{
+			BlockNumber prefetch_block;
+
+			if (!BlockSampler_HasMore(&prefetch_bs))
+				break;
+
+			prefetch_block = BlockSampler_Next(&prefetch_bs);
+			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block);
+		}
+	}
+#endif
+
+	/* Outer loop over blocks to sample */
+	while (BlockSampler_HasMore(&bs))
+	{
+		BlockNumber targblock = BlockSampler_Next(&bs);
+#ifdef USE_PREFETCH
+		BlockNumber prefetch_targblock = InvalidBlockNumber;
+
+		/*
+		 * Make sure that every time the main BlockSampler is moved forward
+		 * that our prefetch BlockSampler also gets moved forward, so that we
+		 * always stay out ahead.
+		 */
+		if (prefetch_maximum && BlockSampler_HasMore(&prefetch_bs))
+			prefetch_targblock = BlockSampler_Next(&prefetch_bs);
+#endif
+
+		vacuum_delay_point();
+
+		heapam_scan_analyze_next_block(scan, targblock, analyze_bstrategy);
+
+#ifdef USE_PREFETCH
+
+		/*
+		 * When pre-fetching, after we get a block, tell the kernel about the
+		 * next one we will want, if there's any left.
+		 */
+		if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
+			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
+#endif
+
+		while (heapam_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
+		{
+			/*
+			 * The first targrows sample rows are simply copied into the
+			 * reservoir. Then we start replacing tuples in the sample until
+			 * we reach the end of the relation.  This algorithm is from Jeff
+			 * Vitter's paper (see full citation in utils/misc/sampling.c). It
+			 * works by repeatedly computing the number of tuples to skip
+			 * before selecting a tuple, which replaces a randomly chosen
+			 * element of the reservoir (current set of tuples).  At all times
+			 * the reservoir is a true random sample of the tuples we've
+			 * passed over so far, so when we fall off the end of the relation
+			 * we're done.
+			 */
+			if (numrows < targrows)
+				rows[numrows++] = ExecCopySlotHeapTuple(slot);
+			else
+			{
+				/*
+				 * t in Vitter's paper is the number of records already
+				 * processed.  If we need to compute a new S value, we must
+				 * use the not-yet-incremented value of samplerows as t.
+				 */
+				if (rowstoskip < 0)
+					rowstoskip = reservoir_get_next_S(&rstate, samplerows, targrows);
+
+				if (rowstoskip <= 0)
+				{
+					/*
+					 * Found a suitable tuple, so save it, replacing one old
+					 * tuple at random
+					 */
+					int			k = (int) (targrows * sampler_random_fract(&rstate.randstate));
+
+					Assert(k >= 0 && k < targrows);
+					heap_freetuple(rows[k]);
+					rows[k] = ExecCopySlotHeapTuple(slot);
+				}
+
+				rowstoskip -= 1;
+			}
+
+			samplerows += 1;
+		}
+
+		pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
+									 ++blksdone);
+	}
+
+	ExecDropSingleTupleTableSlot(slot);
+	table_endscan(scan);
+
+	/*
+	 * If we didn't find as many tuples as we wanted then we're done. No sort
+	 * is needed, since they're already in order.
+	 *
+	 * Otherwise we need to sort the collected tuples by position
+	 * (itempointer). It's not worth worrying about corner cases where the
+	 * tuples are already sorted.
+	 */
+	if (numrows == targrows)
+		qsort_interruptible(rows, numrows, sizeof(HeapTuple),
+							compare_rows, NULL);
+
+	/*
+	 * Estimate total numbers of live and dead rows in relation, extrapolating
+	 * on the assumption that the average tuple density in pages we didn't
+	 * scan is the same as in the pages we did scan.  Since what we scanned is
+	 * a random sample of the pages in the relation, this should be a good
+	 * assumption.
+	 */
+	if (bs.m > 0)
+	{
+		*totalrows = floor((liverows / bs.m) * totalblocks + 0.5);
+		*totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
+	}
+	else
+	{
+		*totalrows = 0.0;
+		*totaldeadrows = 0.0;
+	}
+
+	/*
+	 * Emit some interesting relation info
+	 */
+	ereport(elevel,
+			(errmsg("\"%s\": scanned %d of %u pages, "
+					"containing %.0f live rows and %.0f dead rows; "
+					"%d rows in sample, %.0f estimated total rows",
+					RelationGetRelationName(onerel),
+					bs.m, totalblocks,
+					liverows, deadrows,
+					numrows, *totalrows)));
+
+	return numrows;
+}
+
+static inline void
+heapam_analyze(Relation relation, AcquireSampleRowsFunc *func,
+			   BlockNumber *totalpages, BufferAccessStrategy bstrategy)
+{
+	*func = acquire_sample_rows;
+	*totalpages = RelationGetNumberOfBlocks(relation);
+	analyze_bstrategy = bstrategy;
+}
+
 static double
 heapam_index_build_range_scan(Relation heapRelation,
 							  Relation indexRelation,
@@ -2637,10 +2932,9 @@ static const TableAmRoutine heapam_methods = {
 	.relation_copy_data = heapam_relation_copy_data,
 	.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
 	.relation_vacuum = heap_vacuum_rel,
-	.scan_analyze_next_block = heapam_scan_analyze_next_block,
-	.scan_analyze_next_tuple = heapam_scan_analyze_next_tuple,
 	.index_build_range_scan = heapam_index_build_range_scan,
 	.index_validate_scan = heapam_index_validate_scan,
+	.relation_analyze = heapam_analyze,
 
 	.free_rd_amcache = NULL,
 	.relation_size = table_block_relation_size,
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index ce637a5a5d9..55b8caeadf2 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -81,8 +81,6 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->relation_copy_data != NULL);
 	Assert(routine->relation_copy_for_cluster != NULL);
 	Assert(routine->relation_vacuum != NULL);
-	Assert(routine->scan_analyze_next_block != NULL);
-	Assert(routine->scan_analyze_next_tuple != NULL);
 	Assert(routine->index_build_range_scan != NULL);
 	Assert(routine->index_validate_scan != NULL);
 
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8a82af4a4ca..659f69ef270 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -87,10 +87,6 @@ static void compute_index_stats(Relation onerel, double totalrows,
 								MemoryContext col_context);
 static VacAttrStats *examine_attribute(Relation onerel, int attnum,
 									   Node *index_expr);
-static int	acquire_sample_rows(Relation onerel, int elevel,
-								HeapTuple *rows, int targrows,
-								double *totalrows, double *totaldeadrows);
-static int	compare_rows(const void *a, const void *b, void *arg);
 static int	acquire_inherited_sample_rows(Relation onerel, int elevel,
 										  HeapTuple *rows, int targrows,
 										  double *totalrows, double *totaldeadrows);
@@ -190,10 +186,9 @@ analyze_rel(Oid relid, RangeVar *relation,
 	if (onerel->rd_rel->relkind == RELKIND_RELATION ||
 		onerel->rd_rel->relkind == RELKIND_MATVIEW)
 	{
-		/* Regular table, so we'll use the regular row acquisition function */
-		acquirefunc = acquire_sample_rows;
-		/* Also get regular table's size */
-		relpages = RelationGetNumberOfBlocks(onerel);
+		/* Use row acquisition function provided by table AM */
+		table_relation_analyze(onerel, &acquirefunc,
+							   &relpages, vac_strategy);
 	}
 	else if (onerel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
 	{
@@ -1102,277 +1097,6 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
 	return stats;
 }
 
-/*
- * acquire_sample_rows -- acquire a random sample of rows from the table
- *
- * Selected rows are returned in the caller-allocated array rows[], which
- * must have at least targrows entries.
- * The actual number of rows selected is returned as the function result.
- * We also estimate the total numbers of live and dead rows in the table,
- * and return them into *totalrows and *totaldeadrows, respectively.
- *
- * The returned list of tuples is in order by physical position in the table.
- * (We will rely on this later to derive correlation estimates.)
- *
- * As of May 2004 we use a new two-stage method:  Stage one selects up
- * to targrows random blocks (or all blocks, if there aren't so many).
- * Stage two scans these blocks and uses the Vitter algorithm to create
- * a random sample of targrows rows (or less, if there are less in the
- * sample of blocks).  The two stages are executed simultaneously: each
- * block is processed as soon as stage one returns its number and while
- * the rows are read stage two controls which ones are to be inserted
- * into the sample.
- *
- * Although every row has an equal chance of ending up in the final
- * sample, this sampling method is not perfect: not every possible
- * sample has an equal chance of being selected.  For large relations
- * the number of different blocks represented by the sample tends to be
- * too small.  We can live with that for now.  Improvements are welcome.
- *
- * An important property of this sampling method is that because we do
- * look at a statistically unbiased set of blocks, we should get
- * unbiased estimates of the average numbers of live and dead rows per
- * block.  The previous sampling method put too much credence in the row
- * density near the start of the table.
- */
-static int
-acquire_sample_rows(Relation onerel, int elevel,
-					HeapTuple *rows, int targrows,
-					double *totalrows, double *totaldeadrows)
-{
-	int			numrows = 0;	/* # rows now in reservoir */
-	double		samplerows = 0; /* total # rows collected */
-	double		liverows = 0;	/* # live rows seen */
-	double		deadrows = 0;	/* # dead rows seen */
-	double		rowstoskip = -1;	/* -1 means not set yet */
-	uint32		randseed;		/* Seed for block sampler(s) */
-	BlockNumber totalblocks;
-	TransactionId OldestXmin;
-	BlockSamplerData bs;
-	ReservoirStateData rstate;
-	TupleTableSlot *slot;
-	TableScanDesc scan;
-	BlockNumber nblocks;
-	BlockNumber blksdone = 0;
-#ifdef USE_PREFETCH
-	int			prefetch_maximum = 0;	/* blocks to prefetch if enabled */
-	BlockSamplerData prefetch_bs;
-#endif
-
-	Assert(targrows > 0);
-
-	totalblocks = RelationGetNumberOfBlocks(onerel);
-
-	/* Need a cutoff xmin for HeapTupleSatisfiesVacuum */
-	OldestXmin = GetOldestNonRemovableTransactionId(onerel);
-
-	/* Prepare for sampling block numbers */
-	randseed = pg_prng_uint32(&pg_global_prng_state);
-	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, randseed);
-
-#ifdef USE_PREFETCH
-	prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
-	/* Create another BlockSampler, using the same seed, for prefetching */
-	if (prefetch_maximum)
-		(void) BlockSampler_Init(&prefetch_bs, totalblocks, targrows, randseed);
-#endif
-
-	/* Report sampling block numbers */
-	pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
-								 nblocks);
-
-	/* Prepare for sampling rows */
-	reservoir_init_selection_state(&rstate, targrows);
-
-	scan = table_beginscan_analyze(onerel);
-	slot = table_slot_create(onerel, NULL);
-
-#ifdef USE_PREFETCH
-
-	/*
-	 * If we are doing prefetching, then go ahead and tell the kernel about
-	 * the first set of pages we are going to want.  This also moves our
-	 * iterator out ahead of the main one being used, where we will keep it so
-	 * that we're always pre-fetching out prefetch_maximum number of blocks
-	 * ahead.
-	 */
-	if (prefetch_maximum)
-	{
-		for (int i = 0; i < prefetch_maximum; i++)
-		{
-			BlockNumber prefetch_block;
-
-			if (!BlockSampler_HasMore(&prefetch_bs))
-				break;
-
-			prefetch_block = BlockSampler_Next(&prefetch_bs);
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block);
-		}
-	}
-#endif
-
-	/* Outer loop over blocks to sample */
-	while (BlockSampler_HasMore(&bs))
-	{
-		bool		block_accepted;
-		BlockNumber targblock = BlockSampler_Next(&bs);
-#ifdef USE_PREFETCH
-		BlockNumber prefetch_targblock = InvalidBlockNumber;
-
-		/*
-		 * Make sure that every time the main BlockSampler is moved forward
-		 * that our prefetch BlockSampler also gets moved forward, so that we
-		 * always stay out ahead.
-		 */
-		if (prefetch_maximum && BlockSampler_HasMore(&prefetch_bs))
-			prefetch_targblock = BlockSampler_Next(&prefetch_bs);
-#endif
-
-		vacuum_delay_point();
-
-		block_accepted = table_scan_analyze_next_block(scan, targblock, vac_strategy);
-
-#ifdef USE_PREFETCH
-
-		/*
-		 * When pre-fetching, after we get a block, tell the kernel about the
-		 * next one we will want, if there's any left.
-		 *
-		 * We want to do this even if the table_scan_analyze_next_block() call
-		 * above decides against analyzing the block it picked.
-		 */
-		if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
-#endif
-
-		/*
-		 * Don't analyze if table_scan_analyze_next_block() indicated this
-		 * block is unsuitable for analyzing.
-		 */
-		if (!block_accepted)
-			continue;
-
-		while (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
-		{
-			/*
-			 * The first targrows sample rows are simply copied into the
-			 * reservoir. Then we start replacing tuples in the sample until
-			 * we reach the end of the relation.  This algorithm is from Jeff
-			 * Vitter's paper (see full citation in utils/misc/sampling.c). It
-			 * works by repeatedly computing the number of tuples to skip
-			 * before selecting a tuple, which replaces a randomly chosen
-			 * element of the reservoir (current set of tuples).  At all times
-			 * the reservoir is a true random sample of the tuples we've
-			 * passed over so far, so when we fall off the end of the relation
-			 * we're done.
-			 */
-			if (numrows < targrows)
-				rows[numrows++] = ExecCopySlotHeapTuple(slot);
-			else
-			{
-				/*
-				 * t in Vitter's paper is the number of records already
-				 * processed.  If we need to compute a new S value, we must
-				 * use the not-yet-incremented value of samplerows as t.
-				 */
-				if (rowstoskip < 0)
-					rowstoskip = reservoir_get_next_S(&rstate, samplerows, targrows);
-
-				if (rowstoskip <= 0)
-				{
-					/*
-					 * Found a suitable tuple, so save it, replacing one old
-					 * tuple at random
-					 */
-					int			k = (int) (targrows * sampler_random_fract(&rstate.randstate));
-
-					Assert(k >= 0 && k < targrows);
-					heap_freetuple(rows[k]);
-					rows[k] = ExecCopySlotHeapTuple(slot);
-				}
-
-				rowstoskip -= 1;
-			}
-
-			samplerows += 1;
-		}
-
-		pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
-									 ++blksdone);
-	}
-
-	ExecDropSingleTupleTableSlot(slot);
-	table_endscan(scan);
-
-	/*
-	 * If we didn't find as many tuples as we wanted then we're done. No sort
-	 * is needed, since they're already in order.
-	 *
-	 * Otherwise we need to sort the collected tuples by position
-	 * (itempointer). It's not worth worrying about corner cases where the
-	 * tuples are already sorted.
-	 */
-	if (numrows == targrows)
-		qsort_interruptible(rows, numrows, sizeof(HeapTuple),
-							compare_rows, NULL);
-
-	/*
-	 * Estimate total numbers of live and dead rows in relation, extrapolating
-	 * on the assumption that the average tuple density in pages we didn't
-	 * scan is the same as in the pages we did scan.  Since what we scanned is
-	 * a random sample of the pages in the relation, this should be a good
-	 * assumption.
-	 */
-	if (bs.m > 0)
-	{
-		*totalrows = floor((liverows / bs.m) * totalblocks + 0.5);
-		*totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
-	}
-	else
-	{
-		*totalrows = 0.0;
-		*totaldeadrows = 0.0;
-	}
-
-	/*
-	 * Emit some interesting relation info
-	 */
-	ereport(elevel,
-			(errmsg("\"%s\": scanned %d of %u pages, "
-					"containing %.0f live rows and %.0f dead rows; "
-					"%d rows in sample, %.0f estimated total rows",
-					RelationGetRelationName(onerel),
-					bs.m, totalblocks,
-					liverows, deadrows,
-					numrows, *totalrows)));
-
-	return numrows;
-}
-
-/*
- * Comparator for sorting rows[] array
- */
-static int
-compare_rows(const void *a, const void *b, void *arg)
-{
-	HeapTuple	ha = *(const HeapTuple *) a;
-	HeapTuple	hb = *(const HeapTuple *) b;
-	BlockNumber ba = ItemPointerGetBlockNumber(&ha->t_self);
-	OffsetNumber oa = ItemPointerGetOffsetNumber(&ha->t_self);
-	BlockNumber bb = ItemPointerGetBlockNumber(&hb->t_self);
-	OffsetNumber ob = ItemPointerGetOffsetNumber(&hb->t_self);
-
-	if (ba < bb)
-		return -1;
-	if (ba > bb)
-		return 1;
-	if (oa < ob)
-		return -1;
-	if (oa > ob)
-		return 1;
-	return 0;
-}
-
 
 /*
  * acquire_inherited_sample_rows -- acquire sample rows from inheritance tree
@@ -1462,9 +1186,9 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 		if (childrel->rd_rel->relkind == RELKIND_RELATION ||
 			childrel->rd_rel->relkind == RELKIND_MATVIEW)
 		{
-			/* Regular table, so use the regular row acquisition function */
-			acquirefunc = acquire_sample_rows;
-			relpages = RelationGetNumberOfBlocks(childrel);
+			/* Use row acquisition function provided by table AM */
+			table_relation_analyze(childrel, &acquirefunc,
+								   &relpages, vac_strategy);
 		}
 		else if (childrel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
 		{
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index fc0e7027157..8ed4e7295ad 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -20,6 +20,7 @@
 #include "access/relscan.h"
 #include "access/sdir.h"
 #include "access/xact.h"
+#include "commands/vacuum.h"
 #include "executor/tuptable.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -658,41 +659,6 @@ typedef struct TableAmRoutine
 									struct VacuumParams *params,
 									BufferAccessStrategy bstrategy);
 
-	/*
-	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
-	 * with table_beginscan_analyze().  See also
-	 * table_scan_analyze_next_block().
-	 *
-	 * The callback may acquire resources like locks that are held until
-	 * table_scan_analyze_next_tuple() returns false. It e.g. can make sense
-	 * to hold a lock until all tuples on a block have been analyzed by
-	 * scan_analyze_next_tuple.
-	 *
-	 * The callback can return false if the block is not suitable for
-	 * sampling, e.g. because it's a metapage that could never contain tuples.
-	 *
-	 * XXX: This obviously is primarily suited for block-based AMs. It's not
-	 * clear what a good interface for non block based AMs would be, so there
-	 * isn't one yet.
-	 */
-	bool		(*scan_analyze_next_block) (TableScanDesc scan,
-											BlockNumber blockno,
-											BufferAccessStrategy bstrategy);
-
-	/*
-	 * See table_scan_analyze_next_tuple().
-	 *
-	 * Not every AM might have a meaningful concept of dead rows, in which
-	 * case it's OK to not increment *deadrows - but note that that may
-	 * influence autovacuum scheduling (see comment for relation_vacuum
-	 * callback).
-	 */
-	bool		(*scan_analyze_next_tuple) (TableScanDesc scan,
-											TransactionId OldestXmin,
-											double *liverows,
-											double *deadrows,
-											TupleTableSlot *slot);
-
 	/* see table_index_build_range_scan for reference about parameters */
 	double		(*index_build_range_scan) (Relation table_rel,
 										   Relation index_rel,
@@ -713,6 +679,12 @@ typedef struct TableAmRoutine
 										Snapshot snapshot,
 										struct ValidateIndexState *state);
 
+	/* See table_relation_analyze() */
+	void		(*relation_analyze) (Relation relation,
+									 AcquireSampleRowsFunc *func,
+									 BlockNumber *totalpages,
+									 BufferAccessStrategy bstrategy);
+
 
 	/* ------------------------------------------------------------------------
 	 * Miscellaneous functions.
@@ -1008,19 +980,6 @@ table_beginscan_tid(Relation rel, Snapshot snapshot)
 	return rel->rd_tableam->scan_begin(rel, snapshot, 0, NULL, NULL, flags);
 }
 
-/*
- * table_beginscan_analyze is an alternative entry point for setting up a
- * TableScanDesc for an ANALYZE scan.  As with bitmap scans, it's worth using
- * the same data structure although the behavior is rather different.
- */
-static inline TableScanDesc
-table_beginscan_analyze(Relation rel)
-{
-	uint32		flags = SO_TYPE_ANALYZE;
-
-	return rel->rd_tableam->scan_begin(rel, NULL, 0, NULL, NULL, flags);
-}
-
 /*
  * End relation scan.
  */
@@ -1746,42 +1705,6 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
-/*
- * Prepare to analyze block `blockno` of `scan`. The scan needs to have been
- * started with table_beginscan_analyze().  Note that this routine might
- * acquire resources like locks that are held until
- * table_scan_analyze_next_tuple() returns false.
- *
- * Returns false if block is unsuitable for sampling, true otherwise.
- */
-static inline bool
-table_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-							  BufferAccessStrategy bstrategy)
-{
-	return scan->rs_rd->rd_tableam->scan_analyze_next_block(scan, blockno,
-															bstrategy);
-}
-
-/*
- * Iterate over tuples in the block selected with
- * table_scan_analyze_next_block() (which needs to have returned true, and
- * this routine may not have returned false for the same block before). If a
- * tuple that's suitable for sampling is found, true is returned and a tuple
- * is stored in `slot`.
- *
- * *liverows and *deadrows are incremented according to the encountered
- * tuples.
- */
-static inline bool
-table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
-							  double *liverows, double *deadrows,
-							  TupleTableSlot *slot)
-{
-	return scan->rs_rd->rd_tableam->scan_analyze_next_tuple(scan, OldestXmin,
-															liverows, deadrows,
-															slot);
-}
-
 /*
  * table_index_build_scan - scan the table to find tuples to be indexed
  *
@@ -1887,6 +1810,21 @@ table_index_validate_scan(Relation table_rel,
 											   state);
 }
 
+/*
+ * table_relation_analyze - fill the infromation for a sampling statistics
+ *							acquisition
+ *
+ * The pointer to a function that will collect sample rows from the table
+ * should be stored to `*func`, plus the estimated size of the table in pages
+ * should br stored to `*totalpages`.
+ */
+static inline void
+table_relation_analyze(Relation relation, AcquireSampleRowsFunc *func,
+					   BlockNumber *totalpages, BufferAccessStrategy bstrategy)
+{
+	relation->rd_tableam->relation_analyze(relation, func,
+										   totalpages, bstrategy);
+}
 
 /* ----------------------------------------------------------------------------
  * Miscellaneous functionality
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 1182a967427..90af4647aa9 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -175,6 +175,21 @@ typedef struct VacAttrStats
 	int			rowstride;
 } VacAttrStats;
 
+/*
+ * AcquireSampleRowsFunc - a function for the sampling statistics collection.
+ *
+ * A random sample of up to `targrows` rows should be collected from the
+ * table and stored into the caller-provided `rows` array.  The actual number
+ * of rows collected must be returned.  In addition, store estimates of the
+ * total numbers of live and dead rows in the table into the output parameters
+ * `*totalrows` and `*totaldeadrows1.  (Set `*totaldeadrows` to zerp if the
+ * storage does not have any concept of dead rows.)
+ */
+typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
+									  HeapTuple *rows, int targrows,
+									  double *totalrows,
+									  double *totaldeadrows);
+
 /* flag bits for VacuumParams->options */
 #define VACOPT_VACUUM 0x01		/* do VACUUM */
 #define VACOPT_ANALYZE 0x02		/* do ANALYZE */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index fcde3876b28..0968e0a01ec 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -13,6 +13,7 @@
 #define FDWAPI_H
 
 #include "access/parallel.h"
+#include "commands/vacuum.h"
 #include "nodes/execnodes.h"
 #include "nodes/pathnodes.h"
 
@@ -148,11 +149,6 @@ typedef void (*ExplainForeignModify_function) (ModifyTableState *mtstate,
 typedef void (*ExplainDirectModify_function) (ForeignScanState *node,
 											  struct ExplainState *es);
 
-typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
-									  HeapTuple *rows, int targrows,
-									  double *totalrows,
-									  double *totaldeadrows);
-
 typedef bool (*AnalyzeForeignTable_function) (Relation relation,
 											  AcquireSampleRowsFunc *func,
 											  BlockNumber *totalpages);
-- 
2.39.3 (Apple Git-145)

0002-Custom-reloptions-for-table-AM-v5.patchapplication/octet-stream; name=0002-Custom-reloptions-for-table-AM-v5.patchDownload

From dd263df5bd0ddab823829fdbb52e7c76027fcea1 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 12 Jun 2023 23:16:01 +0300
Subject: [PATCH 2/8] Custom reloptions for table AM

Let table AM define custom reloptions for its tables.
---
 src/backend/access/common/reloptions.c   |  6 ++-
 src/backend/access/heap/heapam_handler.c | 12 ++++++
 src/backend/access/table/tableamapi.c    | 25 +++++++++++++
 src/backend/commands/tablecmds.c         | 47 ++++++++++++++----------
 src/backend/postmaster/autovacuum.c      |  4 +-
 src/backend/utils/cache/relcache.c       |  6 ++-
 src/include/access/reloptions.h          |  2 +
 src/include/access/tableam.h             | 43 ++++++++++++++++++++++
 8 files changed, 122 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index d6eb5d85599..963995388bb 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -24,6 +24,7 @@
 #include "access/nbtree.h"
 #include "access/reloptions.h"
 #include "access/spgist_private.h"
+#include "access/tableam.h"
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
 #include "commands/tablespace.h"
@@ -1377,7 +1378,7 @@ untransformRelOptions(Datum options)
  */
 bytea *
 extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
-				  amoptions_function amoptions)
+				  const TableAmRoutine *tableam, amoptions_function amoptions)
 {
 	bytea	   *options;
 	bool		isnull;
@@ -1399,7 +1400,8 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
 		case RELKIND_MATVIEW:
-			options = heap_reloptions(classForm->relkind, datum, false);
+			options = tableam_reloptions(tableam, classForm->relkind,
+										 datum, false);
 			break;
 		case RELKIND_PARTITIONED_TABLE:
 			options = partitioned_table_reloptions(datum, false);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 30ac13b9751..a586850ea2b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -25,6 +25,7 @@
 #include "access/heapam.h"
 #include "access/heaptoast.h"
 #include "access/multixact.h"
+#include "access/reloptions.h"
 #include "access/rewriteheap.h"
 #include "access/syncscan.h"
 #include "access/tableam.h"
@@ -2436,6 +2437,16 @@ heapam_relation_toast_am(Relation rel)
 	return rel->rd_rel->relam;
 }
 
+static bytea *
+heapam_reloptions(char relkind, Datum reloptions, bool validate)
+{
+	Assert(relkind == RELKIND_RELATION ||
+		   relkind == RELKIND_TOASTVALUE ||
+		   relkind == RELKIND_MATVIEW);
+
+	return heap_reloptions(relkind, reloptions, validate);
+}
+
 
 /* ------------------------------------------------------------------------
  * Planner related callbacks for the heap AM
@@ -2941,6 +2952,7 @@ static const TableAmRoutine heapam_methods = {
 	.relation_needs_toast_table = heapam_relation_needs_toast_table,
 	.relation_toast_am = heapam_relation_toast_am,
 	.relation_fetch_toast_slice = heap_fetch_toast_slice,
+	.reloptions = heapam_reloptions,
 
 	.relation_estimate_size = heapam_estimate_rel_size,
 
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 55b8caeadf2..d9e23ef3175 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -13,9 +13,11 @@
 
 #include "access/tableam.h"
 #include "access/xact.h"
+#include "catalog/pg_am.h"
 #include "commands/defrem.h"
 #include "miscadmin.h"
 #include "utils/guc_hooks.h"
+#include "utils/syscache.h"
 
 
 /*
@@ -98,6 +100,29 @@ GetTableAmRoutine(Oid amhandler)
 	return routine;
 }
 
+/*
+ * GetTableAmRoutineByAmOid
+ *		Given the table access method oid get its TableAmRoutine struct, which
+ *		will be palloc'd in the caller's memory context.
+ */
+const TableAmRoutine *
+GetTableAmRoutineByAmOid(Oid amoid)
+{
+	HeapTuple	ht_am;
+	Form_pg_am	amrec;
+	const TableAmRoutine *tableam = NULL;
+
+	ht_am = SearchSysCache1(AMOID, ObjectIdGetDatum(amoid));
+	if (!HeapTupleIsValid(ht_am))
+		elog(ERROR, "cache lookup failed for access method %u",
+			 amoid);
+	amrec = (Form_pg_am) GETSTRUCT(ht_am);
+
+	tableam = GetTableAmRoutine(amrec->amhandler);
+	ReleaseSysCache(ht_am);
+	return tableam;
+}
+
 /* check_hook: validate new default_table_access_method */
 bool
 check_default_table_access_method(char **newval, void **extra, GucSource source)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 8a02c5b05b6..6fc815666bf 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -715,6 +715,7 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	ObjectAddress address;
 	LOCKMODE	parentLockmode;
 	Oid			accessMethodId = InvalidOid;
+	const TableAmRoutine *tableam = NULL;
 
 	/*
 	 * Truncate relname to appropriate length (probably a waste of time, as
@@ -850,6 +851,24 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	if (!OidIsValid(ownerId))
 		ownerId = GetUserId();
 
+	/*
+	 * Select access method to use: an explicitly indicated one, or (in the
+	 * case of a partitioned table) the parent's, if it has one.
+	 */
+	if (stmt->accessMethod != NULL)
+		accessMethodId = get_table_am_oid(stmt->accessMethod, false);
+	else if (stmt->partbound)
+	{
+		Assert(list_length(inheritOids) == 1);
+		accessMethodId = get_rel_relam(linitial_oid(inheritOids));
+	}
+	else
+		accessMethodId = InvalidOid;
+
+	/* still nothing? use the default */
+	if (RELKIND_HAS_TABLE_AM(relkind) && !OidIsValid(accessMethodId))
+		accessMethodId = get_table_am_oid(default_table_access_method, false);
+
 	/*
 	 * Parse and validate reloptions, if any.
 	 */
@@ -858,6 +877,12 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 
 	switch (relkind)
 	{
+		case RELKIND_RELATION:
+		case RELKIND_TOASTVALUE:
+		case RELKIND_MATVIEW:
+			tableam = GetTableAmRoutineByAmOid(accessMethodId);
+			(void) tableam_reloptions(tableam, relkind, reloptions, true);
+			break;
 		case RELKIND_VIEW:
 			(void) view_reloptions(reloptions, true);
 			break;
@@ -866,6 +891,7 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 			break;
 		default:
 			(void) heap_reloptions(relkind, reloptions, true);
+			break;
 	}
 
 	if (stmt->ofTypename)
@@ -957,24 +983,6 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 		}
 	}
 
-	/*
-	 * Select access method to use: an explicitly indicated one, or (in the
-	 * case of a partitioned table) the parent's, if it has one.
-	 */
-	if (stmt->accessMethod != NULL)
-		accessMethodId = get_table_am_oid(stmt->accessMethod, false);
-	else if (stmt->partbound)
-	{
-		Assert(list_length(inheritOids) == 1);
-		accessMethodId = get_rel_relam(linitial_oid(inheritOids));
-	}
-	else
-		accessMethodId = InvalidOid;
-
-	/* still nothing? use the default */
-	if (RELKIND_HAS_TABLE_AM(relkind) && !OidIsValid(accessMethodId))
-		accessMethodId = get_table_am_oid(default_table_access_method, false);
-
 	/*
 	 * Create the relation.  Inherited defaults and constraints are passed in
 	 * for immediate handling --- since they don't need parsing, they can be
@@ -15520,7 +15528,8 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
 		case RELKIND_MATVIEW:
-			(void) heap_reloptions(rel->rd_rel->relkind, newOptions, true);
+			(void) table_reloptions(rel, rel->rd_rel->relkind,
+									newOptions, true);
 			break;
 		case RELKIND_PARTITIONED_TABLE:
 			(void) partitioned_table_reloptions(newOptions, true);
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 71e8a6f2584..d1d76016ab4 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2661,7 +2661,9 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE);
 
-	relopts = extractRelOptions(tup, pg_class_desc, NULL);
+	relopts = extractRelOptions(tup, pg_class_desc,
+								GetTableAmRoutineByAmOid(((Form_pg_class) GETSTRUCT(tup))->relam),
+								NULL);
 	if (relopts == NULL)
 		return NULL;
 
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 1f419c2a6dd..039c0d3eef4 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -33,6 +33,7 @@
 #include "access/htup_details.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
+#include "access/relation.h"
 #include "access/reloptions.h"
 #include "access/sysattr.h"
 #include "access/table.h"
@@ -464,6 +465,7 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 {
 	bytea	   *options;
 	amoptions_function amoptsfn;
+	const TableAmRoutine *tableam = NULL;
 
 	relation->rd_options = NULL;
 
@@ -478,6 +480,7 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 		case RELKIND_VIEW:
 		case RELKIND_MATVIEW:
 		case RELKIND_PARTITIONED_TABLE:
+			tableam = relation->rd_tableam;
 			amoptsfn = NULL;
 			break;
 		case RELKIND_INDEX:
@@ -493,7 +496,8 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 	 * we might not have any other for pg_class yet (consider executing this
 	 * code for pg_class itself)
 	 */
-	options = extractRelOptions(tuple, GetPgClassDescriptor(), amoptsfn);
+	options = extractRelOptions(tuple, GetPgClassDescriptor(),
+								tableam, amoptsfn);
 
 	/*
 	 * Copy parsed data into CacheMemoryContext.  To guard against the
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index 81829b8270a..8ddc75df287 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -21,6 +21,7 @@
 
 #include "access/amapi.h"
 #include "access/htup.h"
+#include "access/tableam.h"
 #include "access/tupdesc.h"
 #include "nodes/pg_list.h"
 #include "storage/lock.h"
@@ -224,6 +225,7 @@ extern Datum transformRelOptions(Datum oldOptions, List *defList,
 								 bool acceptOidsOff, bool isReset);
 extern List *untransformRelOptions(Datum options);
 extern bytea *extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
+								const TableAmRoutine *tableam,
 								amoptions_function amoptions);
 extern void *build_reloptions(Datum reloptions, bool validate,
 							  relopt_kind kind,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8ed4e7295ad..c882d7b8ad1 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -737,6 +737,28 @@ typedef struct TableAmRoutine
 											   int32 slicelength,
 											   struct varlena *result);
 
+	/*
+	 * This callback parses and validate sthe reloptions array for a table.
+	 *
+	 * This is called only when a non-null reloptions array exists for the
+	 * table.  'reloptions' is a text array containing entries of the form
+	 * "name=value".  The function should construct a bytea value, which will
+	 * be copied into the rd_options field of the table's relcache entry. The
+	 * data contents of the bytea value are open for the access method to
+	 * define.
+	 *
+	 * When 'validate' is true, the function should report a suitable error
+	 * message if any of the options are unrecognized or have invalid values;
+	 * when 'validate' is false, invalid entries should be silently ignored.
+	 * ('validate' is false when loading options already stored in pg_catalog;
+	 * an invalid entry could only be found if the access method has changed
+	 * its rules for options, and in that case ignoring obsolete entries is
+	 * appropriate.)
+	 *
+	 * It is OK to return NULL if default behavior is wanted.
+	 */
+	bytea	   *(*reloptions) (char relkind, Datum reloptions, bool validate);
+
 
 	/* ------------------------------------------------------------------------
 	 * Planner related functions.
@@ -1925,6 +1947,26 @@ table_relation_fetch_toast_slice(Relation toastrel, Oid valueid,
 													 result);
 }
 
+/*
+ * Parse options for given table.
+ */
+static inline bytea *
+table_reloptions(Relation rel, char relkind,
+				 Datum reloptions, bool validate)
+{
+	return rel->rd_tableam->reloptions(relkind, reloptions, validate);
+}
+
+/*
+ * Parse table options without knowledge of particular table.
+ */
+static inline bytea *
+tableam_reloptions(const TableAmRoutine *tableam, char relkind,
+				   Datum reloptions, bool validate)
+{
+	return tableam->reloptions(relkind, reloptions, validate);
+}
+
 
 /* ----------------------------------------------------------------------------
  * Planner related functionality
@@ -2102,6 +2144,7 @@ extern void table_block_relation_estimate_size(Relation rel,
  */
 
 extern const TableAmRoutine *GetTableAmRoutine(Oid amhandler);
+extern const TableAmRoutine *GetTableAmRoutineByAmOid(Oid amoid);
 
 /* ----------------------------------------------------------------------------
  * Functions in heapam_handler.c
-- 
2.39.3 (Apple Git-145)

0006-Let-table-AM-insertion-methods-control-index-inse-v5.patchapplication/octet-stream; name=0006-Let-table-AM-insertion-methods-control-index-inse-v5.patchDownload

From 37a63f955783a6440a4f8dbf19696d6b7fb6c474 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Fri, 9 Jun 2023 01:02:39 +0300
Subject: [PATCH 6/8] Let table AM insertion methods control index insertion

New parameter for tuple_insert() and multi_insert() methods provides way to
skip index insertions in executor.  In this case, table AM can handle insertions
itself.
---
 src/backend/access/heap/heapam.c         |  4 +++-
 src/backend/access/heap/heapam_handler.c |  4 +++-
 src/backend/access/table/tableam.c       |  6 ++++--
 src/backend/catalog/indexing.c           |  4 +++-
 src/backend/commands/copyfrom.c          | 13 +++++++++----
 src/backend/commands/createas.c          |  4 +++-
 src/backend/commands/matview.c           |  4 +++-
 src/backend/commands/tablecmds.c         |  6 +++++-
 src/backend/executor/execReplication.c   |  6 ++++--
 src/backend/executor/nodeModifyTable.c   |  6 ++++--
 src/include/access/heapam.h              |  2 +-
 src/include/access/tableam.h             | 23 ++++++++++++++++-------
 12 files changed, 58 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9a8c8e33487..04d2e097375 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2088,7 +2088,8 @@ heap_multi_insert_pages(HeapTuple *heaptuples, int done, int ntuples, Size saveF
  */
 void
 heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
-				  CommandId cid, int options, BulkInsertState bistate)
+				  CommandId cid, int options, BulkInsertState bistate,
+				  bool *insert_indexes)
 {
 	TransactionId xid = GetCurrentTransactionId();
 	HeapTuple  *heaptuples;
@@ -2437,6 +2438,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		slots[i]->tts_tid = heaptuples[i]->t_self;
 
 	pgstat_count_heap_insert(relation, ntuples);
+	*insert_indexes = true;
 }
 
 /*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 1ebda05797f..8ddb90e7ce1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -255,7 +255,7 @@ heapam_tuple_satisfies_snapshot(Relation rel, TupleTableSlot *slot,
 
 static TupleTableSlot *
 heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
-					int options, BulkInsertState bistate)
+					int options, BulkInsertState bistate, bool *insert_indexes)
 {
 	bool		shouldFree = true;
 	HeapTuple	tuple = ExecFetchSlotHeapTuple(slot, true, &shouldFree);
@@ -271,6 +271,8 @@ heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
 	if (shouldFree)
 		pfree(tuple);
 
+	*insert_indexes = true;
+
 	return slot;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 8d3675be959..805d222cebc 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -273,9 +273,11 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
  * default command ID and not allowing access to the speedup options.
  */
 void
-simple_table_tuple_insert(Relation rel, TupleTableSlot *slot)
+simple_table_tuple_insert(Relation rel, TupleTableSlot *slot,
+						  bool *insert_indexes)
 {
-	table_tuple_insert(rel, slot, GetCurrentCommandId(true), 0, NULL);
+	table_tuple_insert(rel, slot, GetCurrentCommandId(true), 0, NULL,
+					   insert_indexes);
 }
 
 /*
diff --git a/src/backend/catalog/indexing.c b/src/backend/catalog/indexing.c
index d0d1abda58a..4d404f22f83 100644
--- a/src/backend/catalog/indexing.c
+++ b/src/backend/catalog/indexing.c
@@ -273,12 +273,14 @@ void
 CatalogTuplesMultiInsertWithInfo(Relation heapRel, TupleTableSlot **slot,
 								 int ntuples, CatalogIndexState indstate)
 {
+	bool		insertIndexes;
+
 	/* Nothing to do */
 	if (ntuples <= 0)
 		return;
 
 	heap_multi_insert(heapRel, slot, ntuples,
-					  GetCurrentCommandId(true), 0, NULL);
+					  GetCurrentCommandId(true), 0, NULL, &insertIndexes);
 
 	/*
 	 * There is no equivalent to heap_multi_insert for the catalog indexes, so
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 8908a440e19..b6736369771 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -397,6 +397,7 @@ CopyMultiInsertBufferFlush(CopyMultiInsertInfo *miinfo,
 		bool		line_buf_valid = cstate->line_buf_valid;
 		uint64		save_cur_lineno = cstate->cur_lineno;
 		MemoryContext oldcontext;
+		bool		insertIndexes;
 
 		Assert(buffer->bistate != NULL);
 
@@ -416,7 +417,8 @@ CopyMultiInsertBufferFlush(CopyMultiInsertInfo *miinfo,
 						   nused,
 						   mycid,
 						   ti_options,
-						   buffer->bistate);
+						   buffer->bistate,
+						   &insertIndexes);
 		MemoryContextSwitchTo(oldcontext);
 
 		for (i = 0; i < nused; i++)
@@ -425,7 +427,7 @@ CopyMultiInsertBufferFlush(CopyMultiInsertInfo *miinfo,
 			 * If there are any indexes, update them for all the inserted
 			 * tuples, and run AFTER ROW INSERT triggers.
 			 */
-			if (resultRelInfo->ri_NumIndices > 0)
+			if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 			{
 				List	   *recheckIndexes;
 
@@ -1265,11 +1267,14 @@ CopyFrom(CopyFromState cstate)
 					}
 					else
 					{
+						bool		insertIndexes;
+
 						/* OK, store the tuple and create index entries for it */
 						table_tuple_insert(resultRelInfo->ri_RelationDesc,
-										   myslot, mycid, ti_options, bistate);
+										   myslot, mycid, ti_options, bistate,
+										   &insertIndexes);
 
-						if (resultRelInfo->ri_NumIndices > 0)
+						if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 							recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 																   myslot,
 																   estate,
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 62050f4dc59..afd3dace079 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -578,6 +578,7 @@ static bool
 intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_intorel *myState = (DR_intorel *) self;
+	bool		insertIndexes;
 
 	/* Nothing to insert if WITH NO DATA is specified. */
 	if (!myState->into->skipData)
@@ -594,7 +595,8 @@ intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 						   slot,
 						   myState->output_cid,
 						   myState->ti_options,
-						   myState->bistate);
+						   myState->bistate,
+						   &insertIndexes);
 	}
 
 	/* We know this is a newly created relation, so there are no indexes */
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 6d09b755564..9ec13d09846 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -476,6 +476,7 @@ static bool
 transientrel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_transientrel *myState = (DR_transientrel *) self;
+	bool		insertIndexes;
 
 	/*
 	 * Note that the input slot might not be of the type of the target
@@ -490,7 +491,8 @@ transientrel_receive(TupleTableSlot *slot, DestReceiver *self)
 					   slot,
 					   myState->output_cid,
 					   myState->ti_options,
-					   myState->bistate);
+					   myState->bistate,
+					   &insertIndexes);
 
 	/* We know this is a newly created relation, so there are no indexes */
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index eccd1131a5c..6d16a9a402a 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -6356,8 +6356,12 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
 
 			/* Write the tuple out to the new relation */
 			if (newrel)
+			{
+				bool		insertIndexes;
+
 				table_tuple_insert(newrel, insertslot, mycid,
-								   ti_options, bistate);
+								   ti_options, bistate, &insertIndexes);
+			}
 
 			ResetExprContext(econtext);
 
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 0cad843fb69..db685473fc0 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -509,6 +509,7 @@ ExecSimpleRelationInsert(ResultRelInfo *resultRelInfo,
 	if (!skip_tuple)
 	{
 		List	   *recheckIndexes = NIL;
+		bool		insertIndexes;
 
 		/* Compute stored generated columns */
 		if (rel->rd_att->constr &&
@@ -523,9 +524,10 @@ ExecSimpleRelationInsert(ResultRelInfo *resultRelInfo,
 			ExecPartitionCheck(resultRelInfo, slot, estate, true);
 
 		/* OK, store the tuple and create index entries for it */
-		simple_table_tuple_insert(resultRelInfo->ri_RelationDesc, slot);
+		simple_table_tuple_insert(resultRelInfo->ri_RelationDesc, slot,
+								  &insertIndexes);
 
-		if (resultRelInfo->ri_NumIndices > 0)
+		if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 			recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 												   slot, estate, false, false,
 												   NULL, NIL, false);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 8e1c8f697c6..a64e37e9af9 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1040,13 +1040,15 @@ ExecInsert(ModifyTableContext *context,
 		}
 		else
 		{
+			bool		insertIndexes;
+
 			/* insert the tuple normally */
 			slot = table_tuple_insert(resultRelationDesc, slot,
 									  estate->es_output_cid,
-									  0, NULL);
+									  0, NULL, &insertIndexes);
 
 			/* insert index entries for tuple */
-			if (resultRelInfo->ri_NumIndices > 0)
+			if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 				recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 													   slot, estate, false,
 													   false, NULL, NIL,
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index f1122453738..98269861360 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -282,7 +282,7 @@ extern void heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 						int options, BulkInsertState bistate);
 extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
 							  int ntuples, CommandId cid, int options,
-							  BulkInsertState bistate);
+							  BulkInsertState bistate, bool *insert_indexes);
 extern TM_Result heap_delete(Relation relation, ItemPointer tid,
 							 CommandId cid, Snapshot crosscheck, int options,
 							 struct TM_FailureData *tmfd, bool changingPart,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 4e9dab67969..dedaf1f758e 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -514,7 +514,8 @@ typedef struct TableAmRoutine
 	/* see table_tuple_insert() for reference about parameters */
 	TupleTableSlot *(*tuple_insert) (Relation rel, TupleTableSlot *slot,
 									 CommandId cid, int options,
-									 struct BulkInsertStateData *bistate);
+									 struct BulkInsertStateData *bistate,
+									 bool *insert_indexes);
 
 	/* see table_tuple_insert_with_arbiter() for reference about parameters */
 	TupleTableSlot *(*tuple_insert_with_arbiter) (ResultRelInfo *resultRelInfo,
@@ -529,7 +530,8 @@ typedef struct TableAmRoutine
 
 	/* see table_multi_insert() for reference about parameters */
 	void		(*multi_insert) (Relation rel, TupleTableSlot **slots, int nslots,
-								 CommandId cid, int options, struct BulkInsertStateData *bistate);
+								 CommandId cid, int options, struct BulkInsertStateData *bistate,
+								 bool *insert_indexes);
 
 	/* see table_tuple_delete() for reference about parameters */
 	TM_Result	(*tuple_delete) (Relation rel,
@@ -1400,6 +1402,10 @@ table_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
  * behavior) is also just passed through to RelationGetBufferForTuple. If
  * `bistate` is provided, table_finish_bulk_insert() needs to be called.
  *
+ * This function sets `*insert_indexes` to true if expects caller to return
+ * the relevant index tuples.  If `*insert_indexes` is set to false, then
+ * this function cares about indexes itself.
+ *
  * Returns the slot containing the inserted tuple, which may differ from the
  * given slot. For instance, the source slot may be VirtualTupleTableSlot, but
  * the result slot may correspond to the table AM. On return the slot's
@@ -1409,10 +1415,11 @@ table_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
  */
 static inline TupleTableSlot *
 table_tuple_insert(Relation rel, TupleTableSlot *slot, CommandId cid,
-				   int options, struct BulkInsertStateData *bistate)
+				   int options, struct BulkInsertStateData *bistate,
+				   bool *insert_indexes)
 {
 	return rel->rd_tableam->tuple_insert(rel, slot, cid, options,
-										 bistate);
+										 bistate, insert_indexes);
 }
 
 /*
@@ -1470,10 +1477,11 @@ table_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
  */
 static inline void
 table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
-				   CommandId cid, int options, struct BulkInsertStateData *bistate)
+				   CommandId cid, int options, struct BulkInsertStateData *bistate,
+				   bool *insert_indexes)
 {
 	rel->rd_tableam->multi_insert(rel, slots, nslots,
-								  cid, options, bistate);
+								  cid, options, bistate, insert_indexes);
 }
 
 /*
@@ -2168,7 +2176,8 @@ table_scan_sample_next_tuple(TableScanDesc scan,
  * ----------------------------------------------------------------------------
  */
 
-extern void simple_table_tuple_insert(Relation rel, TupleTableSlot *slot);
+extern void simple_table_tuple_insert(Relation rel, TupleTableSlot *slot,
+									  bool *insert_indexes);
 extern void simple_table_tuple_delete(Relation rel, ItemPointer tid,
 									  Snapshot snapshot,
 									  TupleTableSlot *oldSlot);
-- 
2.39.3 (Apple Git-145)

0007-Introduce-RowRefType-which-describes-the-table-ro-v5.patchapplication/octet-stream; name=0007-Introduce-RowRefType-which-describes-the-table-ro-v5.patchDownload

From f2051e60633cf81c2e85d4cf6f29f458e568f412 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Sat, 17 Jun 2023 22:55:32 +0300
Subject: [PATCH 7/8] Introduce RowRefType, which describes the table row
 identifier

Currently, the table row could be identified by the ctid or the whole row
(foreign table).  But the row identifier is mixed together with lock mode in
RowMarkType.  This commit separates row identifier type into separate enum
RowRefType.
---
 contrib/postgres_fdw/postgres_fdw.c    |  2 +-
 doc/src/sgml/fdwhandler.sgml           | 22 ++++++++----
 src/backend/executor/execMain.c        | 35 ++++++++++++--------
 src/backend/optimizer/plan/planner.c   | 33 +++++++++++-------
 src/backend/optimizer/prep/preptlist.c |  4 +--
 src/backend/optimizer/util/inherit.c   | 27 +++++++--------
 src/include/foreign/fdwapi.h           |  3 +-
 src/include/nodes/execnodes.h          |  4 +++
 src/include/nodes/plannodes.h          | 46 ++++++++++++++++----------
 src/include/optimizer/planner.h        |  3 +-
 src/tools/pgindent/typedefs.list       |  1 +
 11 files changed, 113 insertions(+), 67 deletions(-)

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 142dcfc9957..b0000790292 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -7636,7 +7636,7 @@ make_tuple_from_result_row(PGresult *res,
 	 * If we have a CTID to return, install it in both t_self and t_ctid.
 	 * t_self is the normal place, but if the tuple is converted to a
 	 * composite Datum, t_self will be lost; setting t_ctid allows CTID to be
-	 * preserved during EvalPlanQual re-evaluations (see ROW_MARK_COPY code).
+	 * preserved during EvalPlanQual re-evaluations (see ROW_REF_COPY code).
 	 */
 	if (ctid)
 		tuple->t_self = tuple->t_data->t_ctid = *ctid;
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index b80320504d6..51bc0e1029a 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -1160,13 +1160,16 @@ ExecForeignTruncate(List *rels,
 <programlisting>
 RowMarkType
 GetForeignRowMarkType(RangeTblEntry *rte,
-                      LockClauseStrength strength);
+                      LockClauseStrength strength,
+                      RowRefType *refType);
 </programlisting>
 
      Report which row-marking option to use for a foreign table.
-     <literal>rte</literal> is the <structname>RangeTblEntry</structname> node for the table
-     and <literal>strength</literal> describes the lock strength requested by the
-     relevant <literal>FOR UPDATE/SHARE</literal> clause, if any.  The result must be
+     <literal>rte</literal> is the <structname>RangeTblEntry</structname> node for the table;
+     <literal>strength</literal> describes the lock strength requested by the
+     relevant <literal>FOR UPDATE/SHARE</literal> clause, if any;
+     <literal>refType</literal> point to the value of <literal>RowRefType</literal>
+     specifying the way to reference the row.  The result must be
      a member of the <literal>RowMarkType</literal> enum type.
     </para>
 
@@ -1177,9 +1180,16 @@ GetForeignRowMarkType(RangeTblEntry *rte,
      or <command>DELETE</command>.
     </para>
 
+    <para>
+     If the value pointed by <literal>refType</literal> is not changed,
+     the <literal>ROW_REF_COPY</literal> option is used.
+    </para>
+
     <para>
      If the <function>GetForeignRowMarkType</function> pointer is set to
-     <literal>NULL</literal>, the <literal>ROW_MARK_COPY</literal> option is always used.
+     <literal>NULL</literal>, the <literal>ROW_MARK_REFERENCE</literal> option
+     for row mark type and <literal>ROW_REF_COPY</literal> option for the row
+     reference type are always used.
      (This implies that <function>RefetchForeignRow</function> will never be called,
      so it need not be provided either.)
     </para>
@@ -1213,7 +1223,7 @@ RefetchForeignRow(EState *estate,
      defined by <literal>erm-&gt;markType</literal>, which is the value
      previously returned by <function>GetForeignRowMarkType</function>.
      (<literal>ROW_MARK_REFERENCE</literal> means to just re-fetch the tuple
-     without acquiring any lock, and <literal>ROW_MARK_COPY</literal> will
+     without acquiring any lock.  This shouldn't and <literal>ROW_MARK_COPY</literal> will
      never be seen by this routine.)
     </para>
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 7eb1f7d0209..3b03f03a98d 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -875,22 +875,19 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 			/* get relation's OID (will produce InvalidOid if subquery) */
 			relid = exec_rt_fetch(rc->rti, estate)->relid;
 
-			/* open relation, if we need to access it for this mark type */
-			switch (rc->markType)
+			/*
+			 * Open relation, if we need to access it for this reference type.
+			 */
+			switch (rc->refType)
 			{
-				case ROW_MARK_EXCLUSIVE:
-				case ROW_MARK_NOKEYEXCLUSIVE:
-				case ROW_MARK_SHARE:
-				case ROW_MARK_KEYSHARE:
-				case ROW_MARK_REFERENCE:
+				case ROW_REF_TID:
 					relation = ExecGetRangeTableRelation(estate, rc->rti);
 					break;
-				case ROW_MARK_COPY:
-					/* no physical table access is required */
+				case ROW_REF_COPY:
 					relation = NULL;
 					break;
 				default:
-					elog(ERROR, "unrecognized markType: %d", rc->markType);
+					elog(ERROR, "unrecognized refType: %d", rc->refType);
 					relation = NULL;	/* keep compiler quiet */
 					break;
 			}
@@ -906,6 +903,7 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 			erm->prti = rc->prti;
 			erm->rowmarkId = rc->rowmarkId;
 			erm->markType = rc->markType;
+			erm->refType = rc->refType;
 			erm->strength = rc->strength;
 			erm->waitPolicy = rc->waitPolicy;
 			erm->ermActive = false;
@@ -2402,10 +2400,14 @@ ExecBuildAuxRowMark(ExecRowMark *erm, List *targetlist)
 
 	aerm->rowmark = erm;
 
-	/* Look up the resjunk columns associated with this rowmark */
-	if (erm->markType != ROW_MARK_COPY)
+	/*
+	 * Look up the resjunk columns associated with this rowmark's reference
+	 * type.
+	 */
+	if (erm->refType != ROW_REF_COPY)
 	{
 		/* need ctid for all methods other than COPY */
+		Assert(erm->refType == ROW_REF_TID);
 		snprintf(resname, sizeof(resname), "ctid%u", erm->rowmarkId);
 		aerm->ctidAttNo = ExecFindJunkAttributeInTlist(targetlist,
 													   resname);
@@ -2656,7 +2658,12 @@ EvalPlanQualFetchRowMark(EPQState *epqstate, Index rti, TupleTableSlot *slot)
 		}
 	}
 
-	if (erm->markType == ROW_MARK_REFERENCE)
+	/*
+	 * For non-locked relation, the row mark type should be
+	 * ROW_MARK_REFERENCE.  Fetch the tuple accodring to reference type.
+	 */
+	Assert(erm->markType == ROW_MARK_REFERENCE);
+	if (erm->refType == ROW_REF_TID)
 	{
 		Assert(erm->relation != NULL);
 
@@ -2709,7 +2716,7 @@ EvalPlanQualFetchRowMark(EPQState *epqstate, Index rti, TupleTableSlot *slot)
 	}
 	else
 	{
-		Assert(erm->markType == ROW_MARK_COPY);
+		Assert(erm->refType == ROW_REF_COPY);
 
 		/* fetch the whole-row Var for the relation */
 		datum = ExecGetJunkAttribute(epqstate->origslot,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 38d070fa004..4b9c9deee84 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2309,7 +2309,7 @@ preprocess_rowmarks(PlannerInfo *root)
 		 * Ignore RowMarkClauses for subqueries; they aren't real tables and
 		 * can't support true locking.  Subqueries that got flattened into the
 		 * main query should be ignored completely.  Any that didn't will get
-		 * ROW_MARK_COPY items in the next loop.
+		 * ROW_REF_COPY items in the next loop.
 		 */
 		if (rte->rtekind != RTE_RELATION)
 			continue;
@@ -2319,8 +2319,8 @@ preprocess_rowmarks(PlannerInfo *root)
 		newrc = makeNode(PlanRowMark);
 		newrc->rti = newrc->prti = rc->rti;
 		newrc->rowmarkId = ++(root->glob->lastRowMarkId);
-		newrc->markType = select_rowmark_type(rte, rc->strength);
-		newrc->allMarkTypes = (1 << newrc->markType);
+		newrc->markType = select_rowmark_type(rte, rc->strength, &newrc->refType);
+		newrc->allRefTypes = (1 << newrc->refType);
 		newrc->strength = rc->strength;
 		newrc->waitPolicy = rc->waitPolicy;
 		newrc->isParent = false;
@@ -2344,8 +2344,8 @@ preprocess_rowmarks(PlannerInfo *root)
 		newrc = makeNode(PlanRowMark);
 		newrc->rti = newrc->prti = i;
 		newrc->rowmarkId = ++(root->glob->lastRowMarkId);
-		newrc->markType = select_rowmark_type(rte, LCS_NONE);
-		newrc->allMarkTypes = (1 << newrc->markType);
+		newrc->markType = select_rowmark_type(rte, LCS_NONE, &newrc->refType);
+		newrc->allRefTypes = (1 << newrc->refType);
 		newrc->strength = LCS_NONE;
 		newrc->waitPolicy = LockWaitBlock;	/* doesn't matter */
 		newrc->isParent = false;
@@ -2357,29 +2357,38 @@ preprocess_rowmarks(PlannerInfo *root)
 }
 
 /*
- * Select RowMarkType to use for a given table
+ * Select RowMarkType and RowRefType to use for a given table
  */
 RowMarkType
-select_rowmark_type(RangeTblEntry *rte, LockClauseStrength strength)
+select_rowmark_type(RangeTblEntry *rte, LockClauseStrength strength,
+					RowRefType *refType)
 {
 	if (rte->rtekind != RTE_RELATION)
 	{
-		/* If it's not a table at all, use ROW_MARK_COPY */
-		return ROW_MARK_COPY;
+		/*
+		 * If it's not a table at all, use ROW_MARK_REFERENCE and
+		 * ROW_REF_COPY.
+		 */
+		*refType = ROW_REF_COPY;
+		return ROW_MARK_REFERENCE;
 	}
 	else if (rte->relkind == RELKIND_FOREIGN_TABLE)
 	{
 		/* Let the FDW select the rowmark type, if it wants to */
 		FdwRoutine *fdwroutine = GetFdwRoutineByRelId(rte->relid);
 
+		/* Set row reference type as ROW_REF_COPY by default */
+		*refType = ROW_REF_COPY;
+
 		if (fdwroutine->GetForeignRowMarkType != NULL)
-			return fdwroutine->GetForeignRowMarkType(rte, strength);
-		/* Otherwise, use ROW_MARK_COPY by default */
-		return ROW_MARK_COPY;
+			return fdwroutine->GetForeignRowMarkType(rte, strength, refType);
+		/* Otherwise, use ROW_MARK_REFERENCE by default */
+		return ROW_MARK_REFERENCE;
 	}
 	else
 	{
 		/* Regular table, apply the appropriate lock type */
+		*refType = ROW_REF_TID;
 		switch (strength)
 		{
 			case LCS_NONE:
diff --git a/src/backend/optimizer/prep/preptlist.c b/src/backend/optimizer/prep/preptlist.c
index 7698bfa1a58..4599b0dc761 100644
--- a/src/backend/optimizer/prep/preptlist.c
+++ b/src/backend/optimizer/prep/preptlist.c
@@ -210,7 +210,7 @@ preprocess_targetlist(PlannerInfo *root)
 		if (rc->rti != rc->prti)
 			continue;
 
-		if (rc->allMarkTypes & ~(1 << ROW_MARK_COPY))
+		if (rc->allRefTypes & (1 << ROW_REF_TID))
 		{
 			/* Need to fetch TID */
 			var = makeVar(rc->rti,
@@ -226,7 +226,7 @@ preprocess_targetlist(PlannerInfo *root)
 								  true);
 			tlist = lappend(tlist, tle);
 		}
-		if (rc->allMarkTypes & (1 << ROW_MARK_COPY))
+		if (rc->allRefTypes & (1 << ROW_REF_COPY))
 		{
 			/* Need the whole row as a junk var */
 			var = makeWholeRowVar(rt_fetch(rc->rti, range_table),
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index 5c7acf8a901..b4b076d1cb1 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -91,7 +91,7 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 	LOCKMODE	lockmode;
 	PlanRowMark *oldrc;
 	bool		old_isParent = false;
-	int			old_allMarkTypes = 0;
+	int			old_allRefTypes = 0;
 
 	Assert(rte->inh);			/* else caller error */
 
@@ -131,8 +131,8 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 	{
 		old_isParent = oldrc->isParent;
 		oldrc->isParent = true;
-		/* Save initial value of allMarkTypes before children add to it */
-		old_allMarkTypes = oldrc->allMarkTypes;
+		/* Save initial value of allRefTypes before children add to it */
+		old_allRefTypes = oldrc->allRefTypes;
 	}
 
 	/* Scan the inheritance set and expand it */
@@ -239,15 +239,15 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (oldrc)
 	{
-		int			new_allMarkTypes = oldrc->allMarkTypes;
+		int			new_allRefTypes = oldrc->allRefTypes;
 		Var		   *var;
 		TargetEntry *tle;
 		char		resname[32];
 		List	   *newvars = NIL;
 
 		/* Add TID junk Var if needed, unless we had it already */
-		if (new_allMarkTypes & ~(1 << ROW_MARK_COPY) &&
-			!(old_allMarkTypes & ~(1 << ROW_MARK_COPY)))
+		if (new_allRefTypes & (1 << ROW_REF_TID) &&
+			!(old_allRefTypes & (1 << ROW_REF_TID)))
 		{
 			/* Need to fetch TID */
 			var = makeVar(oldrc->rti,
@@ -266,8 +266,8 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 		}
 
 		/* Add whole-row junk Var if needed, unless we had it already */
-		if ((new_allMarkTypes & (1 << ROW_MARK_COPY)) &&
-			!(old_allMarkTypes & (1 << ROW_MARK_COPY)))
+		if ((new_allRefTypes & (1 << ROW_REF_COPY)) &&
+			!(old_allRefTypes & (1 << ROW_REF_COPY)))
 		{
 			var = makeWholeRowVar(planner_rt_fetch(oldrc->rti, root),
 								  oldrc->rti,
@@ -441,7 +441,7 @@ expand_partitioned_rtentry(PlannerInfo *root, RelOptInfo *relinfo,
  * where the hierarchy is flattened during RTE expansion.)
  *
  * PlanRowMarks still carry the top-parent's RTI, and the top-parent's
- * allMarkTypes field still accumulates values from all descendents.
+ * allRefTypes field still accumulates values from all descendents.
  *
  * "parentrte" and "parentRTindex" are immediate parent's RTE and
  * RTI. "top_parentrc" is top parent's PlanRowMark.
@@ -580,8 +580,9 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 		childrc->rowmarkId = top_parentrc->rowmarkId;
 		/* Reselect rowmark type, because relkind might not match parent */
 		childrc->markType = select_rowmark_type(childrte,
-												top_parentrc->strength);
-		childrc->allMarkTypes = (1 << childrc->markType);
+												top_parentrc->strength,
+												&childrc->refType);
+		childrc->allRefTypes = (1 << childrc->refType);
 		childrc->strength = top_parentrc->strength;
 		childrc->waitPolicy = top_parentrc->waitPolicy;
 
@@ -592,8 +593,8 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 		 */
 		childrc->isParent = (childrte->relkind == RELKIND_PARTITIONED_TABLE);
 
-		/* Include child's rowmark type in top parent's allMarkTypes */
-		top_parentrc->allMarkTypes |= childrc->allMarkTypes;
+		/* Include child's rowmark type in top parent's allRefTypes */
+		top_parentrc->allRefTypes |= childrc->allRefTypes;
 
 		root->rowMarks = lappend(root->rowMarks, childrc);
 	}
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 0968e0a01ec..868e04e813e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -129,7 +129,8 @@ typedef TupleTableSlot *(*IterateDirectModify_function) (ForeignScanState *node)
 typedef void (*EndDirectModify_function) (ForeignScanState *node);
 
 typedef RowMarkType (*GetForeignRowMarkType_function) (RangeTblEntry *rte,
-													   LockClauseStrength strength);
+													   LockClauseStrength strength,
+													   RowRefType *refType);
 
 typedef void (*RefetchForeignRow_function) (EState *estate,
 											ExecRowMark *erm,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 1774c56ae31..a1ccf6e6811 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -455,6 +455,9 @@ typedef struct ResultRelInfo
 	/* relation descriptor for result relation */
 	Relation	ri_RelationDesc;
 
+	/* row indentifier for result relation */
+	RowRefType	ri_RowRefType;
+
 	/* # of indices existing on result relation */
 	int			ri_NumIndices;
 
@@ -750,6 +753,7 @@ typedef struct ExecRowMark
 	Index		prti;			/* parent range table index, if child */
 	Index		rowmarkId;		/* unique identifier for resjunk columns */
 	RowMarkType markType;		/* see enum in nodes/plannodes.h */
+	RowRefType	refType;		/* row indentifier for relation */
 	LockClauseStrength strength;	/* LockingClause's strength, or LCS_NONE */
 	LockWaitPolicy waitPolicy;	/* NOWAIT and SKIP LOCKED */
 	bool		ermActive;		/* is this mark relevant for current tuple? */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 7f3db5105db..d7f9c389dac 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1311,16 +1311,8 @@ typedef struct Limit
  *
  * When doing UPDATE/DELETE/MERGE/SELECT FOR UPDATE/SHARE, we have to uniquely
  * identify all the source rows, not only those from the target relations, so
- * that we can perform EvalPlanQual rechecking at need.  For plain tables we
- * can just fetch the TID, much as for a target relation; this case is
- * represented by ROW_MARK_REFERENCE.  Otherwise (for example for VALUES or
- * FUNCTION scans) we have to copy the whole row value.  ROW_MARK_COPY is
- * pretty inefficient, since most of the time we'll never need the data; but
- * fortunately the overhead is usually not performance-critical in practice.
- * By default we use ROW_MARK_COPY for foreign tables, but if the FDW has
- * a concept of rowid it can request to use ROW_MARK_REFERENCE instead.
- * (Again, this probably doesn't make sense if a physical remote fetch is
- * needed, but for FDWs that map to local storage it might be credible.)
+ * that we can perform EvalPlanQual rechecking at need.  ROW_MARK_REFERENCE
+ * represents this case.
  */
 typedef enum RowMarkType
 {
@@ -1329,9 +1321,29 @@ typedef enum RowMarkType
 	ROW_MARK_SHARE,				/* obtain shared tuple lock */
 	ROW_MARK_KEYSHARE,			/* obtain keyshare tuple lock */
 	ROW_MARK_REFERENCE,			/* just fetch the TID, don't lock it */
-	ROW_MARK_COPY,				/* physically copy the row value */
 } RowMarkType;
 
+/*
+ * RowRefType -
+ *	  enums for types of row identifiers
+ *
+ * For plain tables we can just fetch the TID, much as for a target relation;
+ * this case is represented by ROW_REF_TID.  Otherwise (for example for VALUES
+ * or FUNCTION scans) we have to copy the whole row value.  ROW_REF_COPY is
+ * pretty inefficient, since most of the time we'll never need the data; but
+ * fortunately the overhead is usually not performance-critical in practice.
+ * By default we use ROW_REF_COPY for foreign tables, but if the FDW has
+ * a concept of rowid it can request to use ROW_REF_TID instead.
+ * (Again, this probably doesn't make sense if a physical remote fetch is
+ * needed, but for FDWs that map to local storage it might be credible.)
+ * In future we may allow more types of row identifiers.
+ */
+typedef enum RowRefType
+{
+	ROW_REF_TID,				/* Item pointer (block, offset) */
+	ROW_REF_COPY				/* Full row copy */
+} RowRefType;
+
 #define RowMarkRequiresRowShareLock(marktype)  ((marktype) <= ROW_MARK_KEYSHARE)
 
 /*
@@ -1340,8 +1352,7 @@ typedef enum RowMarkType
  *
  * When doing UPDATE/DELETE/MERGE/SELECT FOR UPDATE/SHARE, we create a separate
  * PlanRowMark node for each non-target relation in the query.  Relations that
- * are not specified as FOR UPDATE/SHARE are marked ROW_MARK_REFERENCE (if
- * regular tables or supported foreign tables) or ROW_MARK_COPY (if not).
+ * are not specified as FOR UPDATE/SHARE are marked ROW_MARK_REFERENCE.
  *
  * Initially all PlanRowMarks have rti == prti and isParent == false.
  * When the planner discovers that a relation is the root of an inheritance
@@ -1351,16 +1362,16 @@ typedef enum RowMarkType
  * child relations will also have entries with isParent = true.  The child
  * entries have rti == child rel's RT index and prti == top parent's RT index,
  * and can therefore be recognized as children by the fact that prti != rti.
- * The parent's allMarkTypes field gets the OR of (1<<markType) across all
+ * The parent's allRefTypes field gets the OR of (1<<refType) across all
  * its children (this definition allows children to use different markTypes).
  *
  * The planner also adds resjunk output columns to the plan that carry
  * information sufficient to identify the locked or fetched rows.  When
- * markType != ROW_MARK_COPY, these columns are named
+ * refType != ROW_REF_COPY, these columns are named
  *		tableoid%u			OID of table
  *		ctid%u				TID of row
  * The tableoid column is only present for an inheritance hierarchy.
- * When markType == ROW_MARK_COPY, there is instead a single column named
+ * When refType == ROW_REF_COPY, there is instead a single column named
  *		wholerow%u			whole-row value of relation
  * (An inheritance hierarchy could have all three resjunk output columns,
  * if some children use a different markType than others.)
@@ -1381,7 +1392,8 @@ typedef struct PlanRowMark
 	Index		prti;			/* range table index of parent relation */
 	Index		rowmarkId;		/* unique identifier for resjunk columns */
 	RowMarkType markType;		/* see enum above */
-	int			allMarkTypes;	/* OR of (1<<markType) for all children */
+	RowRefType	refType;		/* see enum above */
+	int			allRefTypes;	/* OR of (1<<refType) for all children */
 	LockClauseStrength strength;	/* LockingClause's strength, or LCS_NONE */
 	LockWaitPolicy waitPolicy;	/* NOWAIT and SKIP LOCKED options */
 	bool		isParent;		/* true if this is a "dummy" parent entry */
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index e1d79ffdf3c..98fc796d054 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -47,7 +47,8 @@ extern PlannerInfo *subquery_planner(PlannerGlobal *glob, Query *parse,
 									 bool hasRecursion, double tuple_fraction);
 
 extern RowMarkType select_rowmark_type(RangeTblEntry *rte,
-									   LockClauseStrength strength);
+									   LockClauseStrength strength,
+									   RowRefType *refType);
 
 extern bool limit_needed(Query *parse);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cfa9d5aaeac..6ce0a586bf1 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2433,6 +2433,7 @@ RowExpr
 RowIdentityVarInfo
 RowMarkClause
 RowMarkType
+RowRefType
 RowSecurityDesc
 RowSecurityPolicy
 RtlGetLastNtStatus_t
-- 
2.39.3 (Apple Git-145)

0008-Introduce-RowID-bytea-tuple-identifier-v5.patchapplication/octet-stream; name=0008-Introduce-RowID-bytea-tuple-identifier-v5.patchDownload

From a341e85698df4d8f6d4d9bd9206346e0592b9a00 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Tue, 26 Mar 2024 21:00:37 +0200
Subject: [PATCH 8/8] Introduce RowID -- bytea tuple identifier

Currently, there are two ways to reference the tuple: tuple identifier (tid)
and whole row copy.  The tuple identifier used for regular tables consists of
32-bit block number and 16-bit offset.  This seems limited for some use-cases,
in particular index-organized tables.  The whole row copy used to identify
tuples in FDW.  That could be extended to regular tables, but that seems
overkill.

This commit introduces RowID -- new bytea tuple identifier.  Table AM can choose
the way tuple is identified by providing new get_row_ref_type() API function.
New system attribute RowIdAttributeNumber holds RowID when appropriate.
Table AM methods now accepts Datum arguments as tuple identifiers.  Those Datum
could be either tid or bytea depending on what table_get_row_ref_type() says.
ModifyTable node and triggers are aware of RowID.  IndexScan and BitmapScan
nodes are not aware of RowIDs and expect tids.  Table AMs which use RowIDs
are supposed to redefine those nodes using hooks.
---
 contrib/amcheck/verify_nbtree.c          |   3 +-
 src/backend/access/common/heaptuple.c    |   4 +
 src/backend/access/heap/heapam_handler.c |  33 ++-
 src/backend/access/table/tableam.c       |   4 +-
 src/backend/catalog/aclchk.c             |   2 +-
 src/backend/commands/trigger.c           | 251 ++++++++++++++++++-----
 src/backend/executor/execExprInterp.c    |   4 +-
 src/backend/executor/execMain.c          |   9 +-
 src/backend/executor/execReplication.c   |  12 +-
 src/backend/executor/nodeLockRows.c      |  17 +-
 src/backend/executor/nodeModifyTable.c   | 145 ++++++++-----
 src/backend/executor/nodeTidscan.c       |   2 +-
 src/backend/optimizer/plan/planner.c     |  11 +-
 src/backend/optimizer/prep/preptlist.c   |  16 ++
 src/backend/optimizer/util/appendinfo.c  |  33 ++-
 src/backend/optimizer/util/inherit.c     |  20 ++
 src/backend/parser/parse_relation.c      |  13 ++
 src/backend/rewrite/rewriteHandler.c     |   1 +
 src/backend/utils/sort/tuplestore.c      |  30 +++
 src/include/access/sysattr.h             |   3 +-
 src/include/access/tableam.h             |  58 ++++--
 src/include/commands/trigger.h           |   4 +-
 src/include/nodes/parsenodes.h           |   2 +
 src/include/nodes/plannodes.h            |  21 --
 src/include/nodes/primnodes.h            |  22 ++
 src/include/utils/tuplestore.h           |   3 +
 26 files changed, 548 insertions(+), 175 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index f71f1854e0a..7bfa2a2fc44 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -984,7 +984,8 @@ heap_entry_is_visible(BtreeCheckState *state, ItemPointer tid)
 	TupleTableSlot *slot = table_slot_create(state->heaprel, NULL);
 
 	tid_visible = table_tuple_fetch_row_version(state->heaprel,
-												tid, state->snapshot, slot);
+												PointerGetDatum(tid),
+												state->snapshot, slot);
 	if (slot != NULL)
 		ExecDropSingleTupleTableSlot(slot);
 
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index 5c89fbbef83..7b52c66939c 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -755,6 +755,10 @@ heap_getsysattr(HeapTuple tup, int attnum, TupleDesc tupleDesc, bool *isnull)
 		case TableOidAttributeNumber:
 			result = ObjectIdGetDatum(tup->t_tableOid);
 			break;
+		case RowIdAttributeNumber:
+			*isnull = true;
+			result = 0;
+			break;
 		default:
 			elog(ERROR, "invalid attnum: %d", attnum);
 			result = 0;			/* keep compiler quiet */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 8ddb90e7ce1..6a1bd3ae476 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -50,7 +50,7 @@
 #include "utils/sampling.h"
 #include "utils/spccache.h"
 
-static TM_Result heapam_tuple_lock(Relation relation, ItemPointer tid,
+static TM_Result heapam_tuple_lock(Relation relation, Datum tupleid,
 								   Snapshot snapshot, TupleTableSlot *slot,
 								   CommandId cid, LockTupleMode mode,
 								   LockWaitPolicy wait_policy, uint8 flags,
@@ -194,7 +194,7 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 
 static bool
 heapam_fetch_row_version(Relation relation,
-						 ItemPointer tid,
+						 Datum tupleid,
 						 Snapshot snapshot,
 						 TupleTableSlot *slot)
 {
@@ -203,7 +203,7 @@ heapam_fetch_row_version(Relation relation,
 
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
-	bslot->base.tupdata.t_self = *tid;
+	bslot->base.tupdata.t_self = *DatumGetItemPointer(tupleid);
 	if (heap_fetch(relation, snapshot, &bslot->base.tupdata, &buffer, false))
 	{
 		/* store in slot, transferring existing pin */
@@ -368,7 +368,7 @@ ExecCheckTIDVisible(EState *estate,
 	if (!IsolationUsesXactSnapshot())
 		return;
 
-	if (!table_tuple_fetch_row_version(rel, tid,
+	if (!table_tuple_fetch_row_version(rel, PointerGetDatum(tid),
 									   SnapshotAny, tempSlot))
 		elog(ERROR, "failed to fetch conflicting tuple for ON CONFLICT");
 	ExecCheckTupleVisible(estate, rel, tempSlot);
@@ -415,7 +415,7 @@ heapam_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
 				 * here means our previous conclusion that the tuple is
 				 * conclusively committed is not true anymore.
 				 */
-				test = table_tuple_lock(rel, &conflictTid,
+				test = table_tuple_lock(rel, PointerGetDatum(&conflictTid),
 										estate->es_snapshot,
 										lockedSlot, estate->es_output_cid,
 										lockmode, LockWaitBlock, 0,
@@ -595,12 +595,13 @@ heapam_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
 }
 
 static TM_Result
-heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
+heapam_tuple_delete(Relation relation, Datum tupleid, CommandId cid,
 					Snapshot snapshot, Snapshot crosscheck, int options,
 					TM_FailureData *tmfd, bool changingPart,
 					TupleTableSlot *oldSlot)
 {
 	TM_Result	result;
+	ItemPointer tid = DatumGetItemPointer(tupleid);
 
 	/*
 	 * Currently Deleting of index tuples are handled at vacuum, in case if
@@ -623,7 +624,7 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
 		 * heapam_tuple_lock() will take advantage of tuple loaded into
 		 * oldSlot by heap_delete().
 		 */
-		result = heapam_tuple_lock(relation, tid, snapshot,
+		result = heapam_tuple_lock(relation, tupleid, snapshot,
 								   oldSlot, cid, LockTupleExclusive,
 								   (options & TABLE_MODIFY_WAIT) ?
 								   LockWaitBlock :
@@ -640,7 +641,7 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
 
 
 static TM_Result
-heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
+heapam_tuple_update(Relation relation, Datum tupleid, TupleTableSlot *slot,
 					CommandId cid, Snapshot snapshot, Snapshot crosscheck,
 					int options, TM_FailureData *tmfd,
 					LockTupleMode *lockmode, TU_UpdateIndexes *update_indexes,
@@ -648,6 +649,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 {
 	bool		shouldFree = true;
 	HeapTuple	tuple = ExecFetchSlotHeapTuple(slot, true, &shouldFree);
+	ItemPointer otid = DatumGetItemPointer(tupleid);
 	TM_Result	result;
 
 	/* Update the tuple with table oid */
@@ -695,7 +697,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 		 * heapam_tuple_lock() will take advantage of tuple loaded into
 		 * oldSlot by heap_update().
 		 */
-		result = heapam_tuple_lock(relation, otid, snapshot,
+		result = heapam_tuple_lock(relation, tupleid, snapshot,
 								   oldSlot, cid, *lockmode,
 								   (options & TABLE_MODIFY_WAIT) ?
 								   LockWaitBlock :
@@ -711,7 +713,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 }
 
 static TM_Result
-heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
+heapam_tuple_lock(Relation relation, Datum tupleid, Snapshot snapshot,
 				  TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
 				  LockWaitPolicy wait_policy, uint8 flags,
 				  TM_FailureData *tmfd)
@@ -719,6 +721,7 @@ heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
 	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
 	TM_Result	result;
 	HeapTuple	tuple = &bslot->base.tupdata;
+	ItemPointer tid = DatumGetItemPointer(tupleid);
 	bool		follow_updates;
 
 	follow_updates = (flags & TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS) != 0;
@@ -2657,6 +2660,15 @@ heapam_scan_get_blocks_done(HeapScanDesc hscan)
  * ------------------------------------------------------------------------
  */
 
+/*
+ * All heap tables use TID row identifier.
+ */
+static RowRefType
+heapam_get_row_ref_type(Relation rel)
+{
+	return ROW_REF_TID;
+}
+
 /*
  * Check to see whether the table needs a TOAST table.  It does only if
  * (1) there are any toastable attributes, and (2) the maximum length
@@ -3235,6 +3247,7 @@ static const TableAmRoutine heapam_methods = {
 	.define_index_validate = NULL,
 	.define_index = NULL,
 
+	.get_row_ref_type = heapam_get_row_ref_type,
 	.free_rd_amcache = NULL,
 	.relation_size = table_block_relation_size,
 	.relation_needs_toast_table = heapam_relation_needs_toast_table,
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 805d222cebc..caa79c6eddd 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -300,7 +300,7 @@ simple_table_tuple_delete(Relation rel, ItemPointer tid, Snapshot snapshot,
 	if (oldSlot)
 		options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
 
-	result = table_tuple_delete(rel, tid,
+	result = table_tuple_delete(rel, PointerGetDatum(tid),
 								GetCurrentCommandId(true),
 								snapshot, InvalidSnapshot,
 								options,
@@ -356,7 +356,7 @@ simple_table_tuple_update(Relation rel, ItemPointer otid,
 	if (oldSlot)
 		options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
 
-	result = table_tuple_update(rel, otid, slot,
+	result = table_tuple_update(rel, PointerGetDatum(otid), slot,
 								GetCurrentCommandId(true),
 								snapshot, InvalidSnapshot,
 								options,
diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
index 7abf3c2a74a..8765becf986 100644
--- a/src/backend/catalog/aclchk.c
+++ b/src/backend/catalog/aclchk.c
@@ -1626,7 +1626,7 @@ expand_all_col_privileges(Oid table_oid, Form_pg_class classForm,
 	AttrNumber	curr_att;
 
 	Assert(classForm->relnatts - FirstLowInvalidHeapAttributeNumber < num_col_privileges);
-	for (curr_att = FirstLowInvalidHeapAttributeNumber + 1;
+	for (curr_att = FirstLowInvalidHeapAttributeNumber + 2;
 		 curr_att <= classForm->relnatts;
 		 curr_att++)
 	{
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 3309b4ebd2d..b2248bdfd87 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -76,7 +76,7 @@ static void SetTriggerFlags(TriggerDesc *trigdesc, Trigger *trigger);
 static bool GetTupleForTrigger(EState *estate,
 							   EPQState *epqstate,
 							   ResultRelInfo *relinfo,
-							   ItemPointer tid,
+							   Datum tupleid,
 							   LockTupleMode lockmode,
 							   TupleTableSlot *oldslot,
 							   TupleTableSlot **epqslot,
@@ -2682,7 +2682,7 @@ ExecASDeleteTriggers(EState *estate, ResultRelInfo *relinfo,
 bool
 ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
 					 ResultRelInfo *relinfo,
-					 ItemPointer tupleid,
+					 Datum tupleid,
 					 HeapTuple fdw_trigtuple,
 					 TupleTableSlot **epqslot,
 					 TM_Result *tmresult,
@@ -2696,7 +2696,7 @@ ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
 	bool		should_free = false;
 	int			i;
 
-	Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
+	Assert(HeapTupleIsValid(fdw_trigtuple) ^ (DatumGetPointer(tupleid) != NULL));
 	if (fdw_trigtuple == NULL)
 	{
 		TupleTableSlot *epqslot_candidate = NULL;
@@ -2924,7 +2924,7 @@ ExecASUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 bool
 ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
 					 ResultRelInfo *relinfo,
-					 ItemPointer tupleid,
+					 Datum tupleid,
 					 HeapTuple fdw_trigtuple,
 					 TupleTableSlot *newslot,
 					 TM_Result *tmresult,
@@ -2944,7 +2944,7 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
 	/* Determine lock mode to use */
 	lockmode = ExecUpdateLockMode(estate, relinfo);
 
-	Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
+	Assert(HeapTupleIsValid(fdw_trigtuple) ^ (DatumGetPointer(tupleid) != NULL));
 	if (fdw_trigtuple == NULL)
 	{
 		TupleTableSlot *epqslot_candidate = NULL;
@@ -3261,7 +3261,7 @@ static bool
 GetTupleForTrigger(EState *estate,
 				   EPQState *epqstate,
 				   ResultRelInfo *relinfo,
-				   ItemPointer tid,
+				   Datum tupleid,
 				   LockTupleMode lockmode,
 				   TupleTableSlot *oldslot,
 				   TupleTableSlot **epqslot,
@@ -3286,7 +3286,9 @@ GetTupleForTrigger(EState *estate,
 		 */
 		if (!IsolationUsesXactSnapshot())
 			lockflags |= TUPLE_LOCK_FLAG_FIND_LAST_VERSION;
-		test = table_tuple_lock(relation, tid, estate->es_snapshot, oldslot,
+
+		test = table_tuple_lock(relation, tupleid,
+								estate->es_snapshot, oldslot,
 								estate->es_output_cid,
 								lockmode, LockWaitBlock,
 								lockflags,
@@ -3382,8 +3384,8 @@ GetTupleForTrigger(EState *estate,
 		 * We expect the tuple to be present, thus very simple error handling
 		 * suffices.
 		 */
-		if (!table_tuple_fetch_row_version(relation, tid, SnapshotAny,
-										   oldslot))
+		if (!table_tuple_fetch_row_version(relation, tupleid,
+										   SnapshotAny, oldslot))
 			elog(ERROR, "failed to fetch tuple for trigger");
 	}
 
@@ -3589,18 +3591,24 @@ typedef SetConstraintStateData *SetConstraintState;
  * cycles.  So we need only ensure that ats_firing_id is zero when attaching
  * a new event to an existing AfterTriggerSharedData record.
  */
-typedef uint32 TriggerFlags;
-
-#define AFTER_TRIGGER_OFFSET			0x07FFFFFF	/* must be low-order bits */
-#define AFTER_TRIGGER_DONE				0x80000000
-#define AFTER_TRIGGER_IN_PROGRESS		0x40000000
+typedef uint64 TriggerFlags;
+
+#define AFTER_TRIGGER_SIZE				UINT64CONST(0xFFFF000000000)	/* must be low-order
+																		 * bits */
+#define AFTER_TRIGGER_SIZE_SHIFT		(36)
+#define AFTER_TRIGGER_OFFSET			UINT64CONST(0x000000FFFFFFF)	/* must be low-order
+																		 * bits */
+#define AFTER_TRIGGER_DONE				UINT64CONST(0x0000800000000)
+#define AFTER_TRIGGER_IN_PROGRESS		UINT64CONST(0x0000400000000)
 /* bits describing the size and tuple sources of this event */
-#define AFTER_TRIGGER_FDW_REUSE			0x00000000
-#define AFTER_TRIGGER_FDW_FETCH			0x20000000
-#define AFTER_TRIGGER_1CTID				0x10000000
-#define AFTER_TRIGGER_2CTID				0x30000000
-#define AFTER_TRIGGER_CP_UPDATE			0x08000000
-#define AFTER_TRIGGER_TUP_BITS			0x38000000
+#define AFTER_TRIGGER_FDW_REUSE			UINT64CONST(0x0000000000000)
+#define AFTER_TRIGGER_FDW_FETCH			UINT64CONST(0x0000200000000)
+#define AFTER_TRIGGER_1CTID				UINT64CONST(0x0000100000000)
+#define AFTER_TRIGGER_ROWID1			UINT64CONST(0x0000010000000)
+#define AFTER_TRIGGER_2CTID				UINT64CONST(0x0000300000000)
+#define AFTER_TRIGGER_ROWID2			UINT64CONST(0x0000020000000)
+#define AFTER_TRIGGER_CP_UPDATE			UINT64CONST(0x0000080000000)
+#define AFTER_TRIGGER_TUP_BITS			UINT64CONST(0x0000380000000)
 typedef struct AfterTriggerSharedData *AfterTriggerShared;
 
 typedef struct AfterTriggerSharedData
@@ -3652,6 +3660,9 @@ typedef struct AfterTriggerEventDataZeroCtids
 }			AfterTriggerEventDataZeroCtids;
 
 #define SizeofTriggerEvent(evt) \
+	(((evt)->ate_flags & AFTER_TRIGGER_SIZE) >> AFTER_TRIGGER_SIZE_SHIFT)
+
+#define BasicSizeofTriggerEvent(evt) \
 	(((evt)->ate_flags & AFTER_TRIGGER_TUP_BITS) == AFTER_TRIGGER_CP_UPDATE ? \
 	 sizeof(AfterTriggerEventData) : \
 	 (((evt)->ate_flags & AFTER_TRIGGER_TUP_BITS) == AFTER_TRIGGER_2CTID ? \
@@ -4004,14 +4015,34 @@ afterTriggerCopyBitmap(Bitmapset *src)
  */
 static void
 afterTriggerAddEvent(AfterTriggerEventList *events,
-					 AfterTriggerEvent event, AfterTriggerShared evtshared)
+					 AfterTriggerEvent event, AfterTriggerShared evtshared,
+					 bytea *rowid1, bytea *rowid2)
 {
-	Size		eventsize = SizeofTriggerEvent(event);
-	Size		needed = eventsize + sizeof(AfterTriggerSharedData);
+	Size		basiceventsize = MAXALIGN(BasicSizeofTriggerEvent(event));
+	Size		eventsize;
+	Size		needed;
 	AfterTriggerEventChunk *chunk;
 	AfterTriggerShared newshared;
 	AfterTriggerEvent newevent;
 
+	if (SizeofTriggerEvent(event) == 0)
+	{
+		eventsize = basiceventsize;
+		if (event->ate_flags & AFTER_TRIGGER_ROWID1)
+			eventsize += MAXALIGN(VARSIZE(rowid1));
+
+		if (event->ate_flags & AFTER_TRIGGER_ROWID2)
+			eventsize += MAXALIGN(VARSIZE(rowid2));
+
+		event->ate_flags |= eventsize << AFTER_TRIGGER_SIZE_SHIFT;
+	}
+	else
+	{
+		eventsize = SizeofTriggerEvent(event);
+	}
+
+	needed = eventsize + sizeof(AfterTriggerSharedData);
+
 	/*
 	 * If empty list or not enough room in the tail chunk, make a new chunk.
 	 * We assume here that a new shared record will always be needed.
@@ -4044,7 +4075,7 @@ afterTriggerAddEvent(AfterTriggerEventList *events,
 		 * sizes used should be MAXALIGN multiples, to ensure that the shared
 		 * records will be aligned safely.
 		 */
-#define MIN_CHUNK_SIZE 1024
+#define MIN_CHUNK_SIZE (1024*4)
 #define MAX_CHUNK_SIZE (1024*1024)
 
 #if MAX_CHUNK_SIZE > (AFTER_TRIGGER_OFFSET+1)
@@ -4063,6 +4094,7 @@ afterTriggerAddEvent(AfterTriggerEventList *events,
 				chunksize *= 2; /* okay, double it */
 			else
 				chunksize /= 2; /* too many shared records */
+			chunksize = Max(chunksize, MIN_CHUNK_SIZE);
 			chunksize = Min(chunksize, MAX_CHUNK_SIZE);
 		}
 		chunk = MemoryContextAlloc(afterTriggers.event_cxt, chunksize);
@@ -4103,7 +4135,26 @@ afterTriggerAddEvent(AfterTriggerEventList *events,
 
 	/* Insert the data */
 	newevent = (AfterTriggerEvent) chunk->freeptr;
-	memcpy(newevent, event, eventsize);
+	if (!rowid1 && !rowid2)
+	{
+		memcpy(newevent, event, eventsize);
+	}
+	else
+	{
+		Pointer		ptr = chunk->freeptr;
+
+		memcpy(newevent, event, basiceventsize);
+		ptr += basiceventsize;
+
+		if (event->ate_flags & AFTER_TRIGGER_ROWID1)
+		{
+			memcpy(ptr, rowid1, MAXALIGN(VARSIZE(rowid1)));
+			ptr += MAXALIGN(VARSIZE(rowid1));
+		}
+
+		if (event->ate_flags & AFTER_TRIGGER_ROWID2)
+			memcpy(ptr, rowid2, MAXALIGN(VARSIZE(rowid2)));
+	}
 	/* ... and link the new event to its shared record */
 	newevent->ate_flags &= ~AFTER_TRIGGER_OFFSET;
 	newevent->ate_flags |= (char *) newshared - (char *) newevent;
@@ -4263,6 +4314,7 @@ AfterTriggerExecute(EState *estate,
 	int			tgindx;
 	bool		should_free_trig = false;
 	bool		should_free_new = false;
+	Pointer		ptr;
 
 	/*
 	 * Locate trigger in trigdesc.
@@ -4294,15 +4346,17 @@ AfterTriggerExecute(EState *estate,
 			{
 				Tuplestorestate *fdw_tuplestore = GetCurrentFDWTuplestore();
 
-				if (!tuplestore_gettupleslot(fdw_tuplestore, true, false,
-											 trig_tuple_slot1))
+				if (!tuplestore_force_gettupleslot(fdw_tuplestore, true, false,
+												   trig_tuple_slot1))
 					elog(ERROR, "failed to fetch tuple1 for AFTER trigger");
 
 				if ((evtshared->ats_event & TRIGGER_EVENT_OPMASK) ==
 					TRIGGER_EVENT_UPDATE &&
-					!tuplestore_gettupleslot(fdw_tuplestore, true, false,
-											 trig_tuple_slot2))
+					!tuplestore_force_gettupleslot(fdw_tuplestore, true, false,
+												   trig_tuple_slot2))
 					elog(ERROR, "failed to fetch tuple2 for AFTER trigger");
+				trig_tuple_slot1->tts_tid = event->ate_ctid1;
+				trig_tuple_slot2->tts_tid = event->ate_ctid2;
 			}
 			/* fall through */
 		case AFTER_TRIGGER_FDW_REUSE:
@@ -4334,13 +4388,26 @@ AfterTriggerExecute(EState *estate,
 			break;
 
 		default:
-			if (ItemPointerIsValid(&(event->ate_ctid1)))
+			ptr = (Pointer) event + MAXALIGN(BasicSizeofTriggerEvent(event));
+			if (ItemPointerIsValid(&(event->ate_ctid1)) ||
+				(event->ate_flags & AFTER_TRIGGER_ROWID1))
 			{
+				Datum		tupleid;
+
 				TupleTableSlot *src_slot = ExecGetTriggerOldSlot(estate,
 																 src_relInfo);
 
-				if (!table_tuple_fetch_row_version(src_rel,
-												   &(event->ate_ctid1),
+				if (event->ate_flags & AFTER_TRIGGER_ROWID1)
+				{
+					tupleid = PointerGetDatum(ptr);
+					ptr += MAXALIGN(VARSIZE(ptr));
+				}
+				else
+				{
+					tupleid = PointerGetDatum(&(event->ate_ctid1));
+				}
+
+				if (!table_tuple_fetch_row_version(src_rel, tupleid,
 												   SnapshotAny,
 												   src_slot))
 					elog(ERROR, "failed to fetch tuple1 for AFTER trigger");
@@ -4376,13 +4443,23 @@ AfterTriggerExecute(EState *estate,
 			/* don't touch ctid2 if not there */
 			if (((event->ate_flags & AFTER_TRIGGER_TUP_BITS) == AFTER_TRIGGER_2CTID ||
 				 (event->ate_flags & AFTER_TRIGGER_CP_UPDATE)) &&
-				ItemPointerIsValid(&(event->ate_ctid2)))
+				(ItemPointerIsValid(&(event->ate_ctid2)) ||
+				 (event->ate_flags & AFTER_TRIGGER_ROWID2)))
 			{
+				Datum		tupleid;
+
 				TupleTableSlot *dst_slot = ExecGetTriggerNewSlot(estate,
 																 dst_relInfo);
 
-				if (!table_tuple_fetch_row_version(dst_rel,
-												   &(event->ate_ctid2),
+				if (event->ate_flags & AFTER_TRIGGER_ROWID2)
+				{
+					tupleid = PointerGetDatum(ptr);
+				}
+				else
+				{
+					tupleid = PointerGetDatum(&(event->ate_ctid2));
+				}
+				if (!table_tuple_fetch_row_version(dst_rel, tupleid,
 												   SnapshotAny,
 												   dst_slot))
 					elog(ERROR, "failed to fetch tuple2 for AFTER trigger");
@@ -4556,7 +4633,7 @@ afterTriggerMarkEvents(AfterTriggerEventList *events,
 		{
 			deferred_found = true;
 			/* add it to move_list */
-			afterTriggerAddEvent(move_list, event, evtshared);
+			afterTriggerAddEvent(move_list, event, evtshared, NULL, NULL);
 			/* mark original copy "done" so we don't do it again */
 			event->ate_flags |= AFTER_TRIGGER_DONE;
 		}
@@ -4659,6 +4736,7 @@ afterTriggerInvokeEvents(AfterTriggerEventList *events,
 					trigdesc = rInfo->ri_TrigDesc;
 					finfo = rInfo->ri_TrigFunctions;
 					instr = rInfo->ri_TrigInstrument;
+
 					if (slot1 != NULL)
 					{
 						ExecDropSingleTupleTableSlot(slot1);
@@ -6051,6 +6129,8 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 	int			tgtype_level;
 	int			i;
 	Tuplestorestate *fdw_tuplestore = NULL;
+	bytea	   *rowId1 = NULL;
+	bytea	   *rowId2 = NULL;
 
 	/*
 	 * Check state.  We use a normal test not Assert because it is possible to
@@ -6144,6 +6224,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 	 * if so.  This preserves the behavior that statement-level triggers fire
 	 * just once per statement and fire after row-level triggers.
 	 */
+
+	/* Determine flags */
+	if (!(relkind == RELKIND_FOREIGN_TABLE && row_trigger))
+		new_event.ate_flags = (row_trigger && event == TRIGGER_EVENT_UPDATE) ?
+			AFTER_TRIGGER_2CTID : AFTER_TRIGGER_1CTID;
+
 	switch (event)
 	{
 		case TRIGGER_EVENT_INSERT:
@@ -6154,6 +6240,14 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				Assert(newslot != NULL);
 				ItemPointerCopy(&(newslot->tts_tid), &(new_event.ate_ctid1));
 				ItemPointerSetInvalid(&(new_event.ate_ctid2));
+				if (table_get_row_ref_type(rel) == ROW_REF_ROWID)
+				{
+					bool		isnull;
+
+					rowId1 = DatumGetByteaP(slot_getsysattr(newslot, RowIdAttributeNumber, &isnull));
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID1;
+					Assert(!isnull);
+				}
 			}
 			else
 			{
@@ -6173,6 +6267,14 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				Assert(newslot == NULL);
 				ItemPointerCopy(&(oldslot->tts_tid), &(new_event.ate_ctid1));
 				ItemPointerSetInvalid(&(new_event.ate_ctid2));
+				if (table_get_row_ref_type(rel) == ROW_REF_ROWID)
+				{
+					bool		isnull;
+
+					rowId1 = DatumGetByteaP(slot_getsysattr(oldslot, RowIdAttributeNumber, &isnull));
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID1;
+					Assert(!isnull);
+				}
 			}
 			else
 			{
@@ -6188,10 +6290,57 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			tgtype_event = TRIGGER_TYPE_UPDATE;
 			if (row_trigger)
 			{
+				bool		src_rowid = false,
+							dst_rowid = false;
+
 				Assert(oldslot != NULL);
 				Assert(newslot != NULL);
 				ItemPointerCopy(&(oldslot->tts_tid), &(new_event.ate_ctid1));
 				ItemPointerCopy(&(newslot->tts_tid), &(new_event.ate_ctid2));
+				if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+				{
+					Relation	src_rel = src_partinfo->ri_RelationDesc;
+					Relation	dst_rel = dst_partinfo->ri_RelationDesc;
+
+					src_rowid = table_get_row_ref_type(src_rel) ==
+						ROW_REF_ROWID;
+					dst_rowid = table_get_row_ref_type(dst_rel) ==
+						ROW_REF_ROWID;
+				}
+				else
+				{
+					if (table_get_row_ref_type(rel) == ROW_REF_ROWID)
+					{
+						src_rowid = true;
+						dst_rowid = true;
+					}
+				}
+
+				if (src_rowid)
+				{
+					Datum		val;
+					bool		isnull;
+
+					val = slot_getsysattr(oldslot,
+										  RowIdAttributeNumber,
+										  &isnull);
+					rowId1 = DatumGetByteaP(val);
+					Assert(!isnull);
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID1;
+				}
+
+				if (dst_rowid)
+				{
+					Datum		val;
+					bool		isnull;
+
+					val = slot_getsysattr(newslot,
+										  RowIdAttributeNumber,
+										  &isnull);
+					rowId2 = DatumGetByteaP(val);
+					Assert(!isnull);
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID2;
+				}
 
 				/*
 				 * Also remember the OIDs of partitions to fetch these tuples
@@ -6229,20 +6378,6 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			break;
 	}
 
-	/* Determine flags */
-	if (!(relkind == RELKIND_FOREIGN_TABLE && row_trigger))
-	{
-		if (row_trigger && event == TRIGGER_EVENT_UPDATE)
-		{
-			if (relkind == RELKIND_PARTITIONED_TABLE)
-				new_event.ate_flags = AFTER_TRIGGER_CP_UPDATE;
-			else
-				new_event.ate_flags = AFTER_TRIGGER_2CTID;
-		}
-		else
-			new_event.ate_flags = AFTER_TRIGGER_1CTID;
-	}
-
 	/* else, we'll initialize ate_flags for each trigger */
 
 	tgtype_level = (row_trigger ? TRIGGER_TYPE_ROW : TRIGGER_TYPE_STATEMENT);
@@ -6387,6 +6522,20 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				continue;		/* Uniqueness definitely not violated */
 		}
 
+		/* Determine flags */
+		if (!(relkind == RELKIND_FOREIGN_TABLE && row_trigger))
+		{
+			if (row_trigger && event == TRIGGER_EVENT_UPDATE)
+			{
+				if (relkind == RELKIND_PARTITIONED_TABLE)
+					new_event.ate_flags = AFTER_TRIGGER_CP_UPDATE;
+				else
+					new_event.ate_flags = AFTER_TRIGGER_2CTID;
+			}
+			else
+				new_event.ate_flags = AFTER_TRIGGER_1CTID;
+		}
+
 		/*
 		 * Fill in event structure and add it to the current query's queue.
 		 * Note we set ats_table to NULL whenever this trigger doesn't use
@@ -6408,7 +6557,7 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		new_shared.ats_modifiedcols = afterTriggerCopyBitmap(modifiedCols);
 
 		afterTriggerAddEvent(&afterTriggers.query_stack[afterTriggers.query_depth].events,
-							 &new_event, &new_shared);
+							 &new_event, &new_shared, rowId1, rowId2);
 	}
 
 	/*
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 24a3990a30a..c8ce4d45ff4 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -4888,7 +4888,9 @@ ExecEvalSysVar(ExprState *state, ExprEvalStep *op, ExprContext *econtext,
 						op->resnull);
 	*op->resvalue = d;
 	/* this ought to be unreachable, but it's cheap enough to check */
-	if (unlikely(*op->resnull))
+	if (op->d.var.attnum != RowIdAttributeNumber &&
+		op->d.var.attnum != SelfItemPointerAttributeNumber &&
+		unlikely(*op->resnull))
 		elog(ERROR, "failed to fetch attribute from slot");
 }
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 3b03f03a98d..514d9b28c48 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -867,13 +867,15 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 			Oid			relid;
 			Relation	relation;
 			ExecRowMark *erm;
+			RangeTblEntry *rangeEntry;
 
 			/* ignore "parent" rowmarks; they are irrelevant at runtime */
 			if (rc->isParent)
 				continue;
 
 			/* get relation's OID (will produce InvalidOid if subquery) */
-			relid = exec_rt_fetch(rc->rti, estate)->relid;
+			rangeEntry = exec_rt_fetch(rc->rti, estate);
+			relid = rangeEntry->relid;
 
 			/*
 			 * Open relation, if we need to access it for this reference type.
@@ -903,7 +905,7 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 			erm->prti = rc->prti;
 			erm->rowmarkId = rc->rowmarkId;
 			erm->markType = rc->markType;
-			erm->refType = rc->refType;
+			erm->refType = rangeEntry->reftype;
 			erm->strength = rc->strength;
 			erm->waitPolicy = rc->waitPolicy;
 			erm->ermActive = false;
@@ -1267,6 +1269,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
 	resultRelInfo->ri_ChildToRootMap = NULL;
 	resultRelInfo->ri_ChildToRootMapValid = false;
 	resultRelInfo->ri_CopyMultiInsertBuffer = NULL;
+	resultRelInfo->ri_RowRefType = table_get_row_ref_type(resultRelationDesc);
 }
 
 /*
@@ -2708,7 +2711,7 @@ EvalPlanQualFetchRowMark(EPQState *epqstate, Index rti, TupleTableSlot *slot)
 		{
 			/* ordinary table, fetch the tuple */
 			if (!table_tuple_fetch_row_version(erm->relation,
-											   (ItemPointer) DatumGetPointer(datum),
+											   datum,
 											   SnapshotAny, slot))
 				elog(ERROR, "failed to fetch tuple for EvalPlanQual recheck");
 			return true;
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index db685473fc0..aad266a19ff 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -250,7 +250,8 @@ retry:
 
 		PushActiveSnapshot(GetLatestSnapshot());
 
-		res = table_tuple_lock(rel, &(outslot->tts_tid), GetLatestSnapshot(),
+		res = table_tuple_lock(rel, PointerGetDatum(&(outslot->tts_tid)),
+							   GetLatestSnapshot(),
 							   outslot,
 							   GetCurrentCommandId(false),
 							   lockmode,
@@ -434,7 +435,8 @@ retry:
 
 		PushActiveSnapshot(GetLatestSnapshot());
 
-		res = table_tuple_lock(rel, &(outslot->tts_tid), GetLatestSnapshot(),
+		res = table_tuple_lock(rel, PointerGetDatum(&(outslot->tts_tid)),
+							   GetLatestSnapshot(),
 							   outslot,
 							   GetCurrentCommandId(false),
 							   lockmode,
@@ -571,7 +573,8 @@ ExecSimpleRelationUpdate(ResultRelInfo *resultRelInfo,
 		resultRelInfo->ri_TrigDesc->trig_update_before_row)
 	{
 		if (!ExecBRUpdateTriggers(estate, epqstate, resultRelInfo,
-								  tid, NULL, slot, NULL, NULL))
+								  PointerGetDatum(tid), NULL, slot,
+								  NULL, NULL))
 			skip_tuple = true;	/* "do nothing" */
 	}
 
@@ -638,7 +641,8 @@ ExecSimpleRelationDelete(ResultRelInfo *resultRelInfo,
 		resultRelInfo->ri_TrigDesc->trig_delete_before_row)
 	{
 		skip_tuple = !ExecBRDeleteTriggers(estate, epqstate, resultRelInfo,
-										   tid, NULL, NULL, NULL, NULL);
+										   PointerGetDatum(tid), NULL, NULL,
+										   NULL, NULL);
 	}
 
 	if (!skip_tuple)
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 41754ddfea9..2d3ad904a64 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -27,6 +27,7 @@
 #include "executor/nodeLockRows.h"
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
+#include "utils/datum.h"
 #include "utils/rel.h"
 
 
@@ -157,7 +158,16 @@ lnext:
 		}
 
 		/* okay, try to lock (and fetch) the tuple */
-		tid = *((ItemPointer) DatumGetPointer(datum));
+		if (erm->refType == ROW_REF_TID)
+		{
+			tid = *((ItemPointer) DatumGetPointer(datum));
+			datum = PointerGetDatum(&tid);
+		}
+		else
+		{
+			Assert(erm->refType == ROW_REF_ROWID);
+			datum = datumCopy(datum, false, -1);
+		}
 		switch (erm->markType)
 		{
 			case ROW_MARK_EXCLUSIVE:
@@ -182,12 +192,15 @@ lnext:
 		if (!IsolationUsesXactSnapshot())
 			lockflags |= TUPLE_LOCK_FLAG_FIND_LAST_VERSION;
 
-		test = table_tuple_lock(erm->relation, &tid, estate->es_snapshot,
+		test = table_tuple_lock(erm->relation, datum, estate->es_snapshot,
 								markSlot, estate->es_output_cid,
 								lockmode, erm->waitPolicy,
 								lockflags,
 								&tmfd);
 
+		if (erm->refType == ROW_REF_ROWID)
+			pfree(DatumGetPointer(datum));
+
 		switch (test)
 		{
 			case TM_WouldBlock:
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index a64e37e9af9..90eeb99b2cd 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -124,7 +124,7 @@ static void ExecPendingInserts(EState *estate);
 static void ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 											   ResultRelInfo *sourcePartInfo,
 											   ResultRelInfo *destPartInfo,
-											   ItemPointer tupleid,
+											   Datum tupleid,
 											   TupleTableSlot *oldslot,
 											   TupleTableSlot *newslot);
 static bool ExecOnConflictUpdate(ModifyTableContext *context,
@@ -141,13 +141,13 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 
 static TupleTableSlot *ExecMerge(ModifyTableContext *context,
 								 ResultRelInfo *resultRelInfo,
-								 ItemPointer tupleid,
+								 Datum tupleid,
 								 HeapTuple oldtuple,
 								 bool canSetTag);
 static void ExecInitMerge(ModifyTableState *mtstate, EState *estate);
 static TupleTableSlot *ExecMergeMatched(ModifyTableContext *context,
 										ResultRelInfo *resultRelInfo,
-										ItemPointer tupleid,
+										Datum tupleid,
 										HeapTuple oldtuple,
 										bool canSetTag,
 										bool *matched);
@@ -1221,7 +1221,7 @@ ExecPendingInserts(EState *estate)
  */
 static bool
 ExecDeletePrologue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				   ItemPointer tupleid, HeapTuple oldtuple,
+				   Datum tupleid, HeapTuple oldtuple,
 				   TupleTableSlot **epqreturnslot, TM_Result *result)
 {
 	if (result)
@@ -1252,7 +1252,7 @@ ExecDeletePrologue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static TM_Result
 ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-			  ItemPointer tupleid, bool changingPart, int options,
+			  Datum tupleid, bool changingPart, int options,
 			  TupleTableSlot *oldSlot)
 {
 	EState	   *estate = context->estate;
@@ -1280,7 +1280,7 @@ ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static void
 ExecDeleteEpilogue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				   ItemPointer tupleid, HeapTuple oldtuple,
+				   HeapTuple oldtuple,
 				   TupleTableSlot *slot, bool changingPart)
 {
 	ModifyTableState *mtstate = context->mtstate;
@@ -1361,7 +1361,7 @@ ExecInitDeleteTupleSlot(ModifyTableState *mtstate,
 static TupleTableSlot *
 ExecDelete(ModifyTableContext *context,
 		   ResultRelInfo *resultRelInfo,
-		   ItemPointer tupleid,
+		   Datum tupleid,
 		   HeapTuple oldtuple,
 		   TupleTableSlot *oldslot,
 		   bool processReturning,
@@ -1558,7 +1558,7 @@ ldelete:
 	if (tupleDeleted)
 		*tupleDeleted = true;
 
-	ExecDeleteEpilogue(context, resultRelInfo, tupleid, oldtuple,
+	ExecDeleteEpilogue(context, resultRelInfo, oldtuple,
 					   oldslot, changingPart);
 
 	/* Process RETURNING if present and if requested */
@@ -1575,7 +1575,7 @@ ldelete:
 			/* FDW must have provided a slot containing the deleted row */
 			Assert(!TupIsNull(slot));
 		}
-		else
+		else if (!slot || TupIsNull(slot))
 		{
 			/* Copy old tuple to the returning slot */
 			slot = ExecGetReturningSlot(estate, resultRelInfo);
@@ -1624,7 +1624,7 @@ ldelete:
 static bool
 ExecCrossPartitionUpdate(ModifyTableContext *context,
 						 ResultRelInfo *resultRelInfo,
-						 ItemPointer tupleid, HeapTuple oldtuple,
+						 Datum tupleid, HeapTuple oldtuple,
 						 TupleTableSlot *slot,
 						 bool canSetTag,
 						 UpdateContext *updateCxt,
@@ -1783,7 +1783,7 @@ ExecCrossPartitionUpdate(ModifyTableContext *context,
  */
 static bool
 ExecUpdatePrologue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				   ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
+				   Datum tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
 				   TM_Result *result)
 {
 	Relation	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -1860,7 +1860,7 @@ ExecUpdatePrepareSlot(ResultRelInfo *resultRelInfo,
  */
 static TM_Result
 ExecUpdateAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-			  ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
+			  Datum tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
 			  bool canSetTag, int options, TupleTableSlot *oldSlot,
 			  UpdateContext *updateCxt)
 {
@@ -2014,7 +2014,7 @@ lreplace:
  */
 static void
 ExecUpdateEpilogue(ModifyTableContext *context, UpdateContext *updateCxt,
-				   ResultRelInfo *resultRelInfo, ItemPointer tupleid,
+				   ResultRelInfo *resultRelInfo,
 				   HeapTuple oldtuple, TupleTableSlot *slot,
 				   TupleTableSlot *oldslot)
 {
@@ -2064,7 +2064,7 @@ static void
 ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 								   ResultRelInfo *sourcePartInfo,
 								   ResultRelInfo *destPartInfo,
-								   ItemPointer tupleid,
+								   Datum tupleid,
 								   TupleTableSlot *oldslot,
 								   TupleTableSlot *newslot)
 {
@@ -2154,7 +2154,7 @@ ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
  */
 static TupleTableSlot *
 ExecUpdate(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-		   ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
+		   Datum tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
 		   TupleTableSlot *oldslot, bool canSetTag, bool locked)
 {
 	EState	   *estate = context->estate;
@@ -2208,15 +2208,19 @@ ExecUpdate(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	}
 	else
 	{
-		int			options = TABLE_MODIFY_WAIT | TABLE_MODIFY_FETCH_OLD_TUPLE;
+		int			options = TABLE_MODIFY_WAIT;
 
 		/*
 		 * Specify that we need to lock and fetch the last tuple version for
 		 * EPQ on appropriate transaction isolation levels if the tuple isn't
 		 * locked already.
 		 */
-		if (!locked && !IsolationUsesXactSnapshot())
-			options |= TABLE_MODIFY_LOCK_UPDATED;
+		if (!locked)
+		{
+			options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
+			if (!IsolationUsesXactSnapshot())
+				options |= TABLE_MODIFY_LOCK_UPDATED;
+		}
 
 		/*
 		 * If we generate a new candidate tuple after EvalPlanQual testing, we
@@ -2326,7 +2330,7 @@ redo_act:
 	if (canSetTag)
 		(estate->es_processed)++;
 
-	ExecUpdateEpilogue(context, &updateCxt, resultRelInfo, tupleid, oldtuple,
+	ExecUpdateEpilogue(context, &updateCxt, resultRelInfo, oldtuple,
 					   slot, oldslot);
 
 	/* Process RETURNING if present */
@@ -2358,7 +2362,19 @@ ExecOnConflictUpdate(ModifyTableContext *context,
 	ExprContext *econtext = mtstate->ps.ps_ExprContext;
 	ExprState  *onConflictSetWhere = resultRelInfo->ri_onConflict->oc_WhereClause;
 	TupleTableSlot *existing = resultRelInfo->ri_onConflict->oc_Existing;
-	ItemPointer conflictTid = &existing->tts_tid;
+	Datum		tupleid;
+
+	if (table_get_row_ref_type(resultRelInfo->ri_RelationDesc) == ROW_REF_ROWID)
+	{
+		bool		isnull;
+
+		tupleid = slot_getsysattr(existing, RowIdAttributeNumber, &isnull);
+		Assert(!isnull);
+	}
+	else
+	{
+		tupleid = PointerGetDatum(&existing->tts_tid);
+	}
 
 	/*
 	 * Make tuple and any needed join variables available to ExecQual and
@@ -2414,7 +2430,7 @@ ExecOnConflictUpdate(ModifyTableContext *context,
 
 	/* Execute UPDATE with projection */
 	*returning = ExecUpdate(context, resultRelInfo,
-							conflictTid, NULL,
+							tupleid, NULL,
 							resultRelInfo->ri_onConflict->oc_ProjSlot,
 							existing,
 							canSetTag, true);
@@ -2433,7 +2449,7 @@ ExecOnConflictUpdate(ModifyTableContext *context,
  */
 static TupleTableSlot *
 ExecMerge(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-		  ItemPointer tupleid, HeapTuple oldtuple, bool canSetTag)
+		  Datum tupleid, HeapTuple oldtuple, bool canSetTag)
 {
 	TupleTableSlot *rslot = NULL;
 	bool		matched;
@@ -2482,7 +2498,7 @@ ExecMerge(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	 * from ExecMergeNotMatched to ExecMergeMatched, there is no risk of a
 	 * livelock.
 	 */
-	matched = tupleid != NULL || oldtuple != NULL;
+	matched = DatumGetPointer(tupleid) != NULL || oldtuple != NULL;
 	if (matched)
 		rslot = ExecMergeMatched(context, resultRelInfo, tupleid, oldtuple,
 								 canSetTag, &matched);
@@ -2523,7 +2539,7 @@ ExecMerge(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static TupleTableSlot *
 ExecMergeMatched(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				 ItemPointer tupleid, HeapTuple oldtuple, bool canSetTag,
+				 Datum tupleid, HeapTuple oldtuple, bool canSetTag,
 				 bool *matched)
 {
 	ModifyTableState *mtstate = context->mtstate;
@@ -2559,7 +2575,7 @@ ExecMergeMatched(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	 * the tupleid of the target row, or an old tuple from the target wholerow
 	 * junk attr.
 	 */
-	Assert(tupleid != NULL || oldtuple != NULL);
+	Assert(DatumGetPointer(tupleid) != NULL || oldtuple != NULL);
 	if (oldtuple != NULL)
 		ExecForceStoreHeapTuple(oldtuple, resultRelInfo->ri_oldTupleSlot,
 								false);
@@ -2573,7 +2589,7 @@ lmerge_matched:
 	 * EvalPlanQual returns us a new tuple, which may not be visible to our
 	 * MVCC snapshot.
 	 */
-	if (tupleid != NULL)
+	if (DatumGetPointer(tupleid) != NULL)
 	{
 		if (!table_tuple_fetch_row_version(resultRelInfo->ri_RelationDesc,
 										   tupleid,
@@ -2682,7 +2698,7 @@ lmerge_matched:
 				if (result == TM_Ok)
 				{
 					ExecUpdateEpilogue(context, &updateCxt, resultRelInfo,
-									   tupleid, NULL, newslot,
+									   NULL, newslot,
 									   resultRelInfo->ri_oldTupleSlot);
 					mtstate->mt_merge_updated += 1;
 				}
@@ -2718,7 +2734,7 @@ lmerge_matched:
 
 				if (result == TM_Ok)
 				{
-					ExecDeleteEpilogue(context, resultRelInfo, tupleid, NULL,
+					ExecDeleteEpilogue(context, resultRelInfo, NULL,
 									   resultRelInfo->ri_oldTupleSlot, false);
 					mtstate->mt_merge_deleted += 1;
 				}
@@ -2842,9 +2858,13 @@ lmerge_matched:
 								return NULL;
 							}
 
-							(void) ExecGetJunkAttribute(epqslot,
-														resultRelInfo->ri_RowIdAttNo,
-														&isNull);
+							/*
+							 * Update tupleid to that of the new tuple, for
+							 * the refetch we do at the top.
+							 */
+							tupleid = ExecGetJunkAttribute(epqslot,
+														   resultRelInfo->ri_RowIdAttNo,
+														   &isNull);
 							if (isNull)
 							{
 								*matched = false;
@@ -2871,11 +2891,7 @@ lmerge_matched:
 							 * apply all the MATCHED rules again, to ensure
 							 * that the first qualifying WHEN MATCHED action
 							 * is executed.
-							 *
-							 * Update tupleid to that of the new tuple, for
-							 * the refetch we do at the top.
 							 */
-							ItemPointerCopy(&context->tmfd.ctid, tupleid);
 							goto lmerge_matched;
 
 						case TM_Deleted:
@@ -3413,10 +3429,10 @@ ExecModifyTable(PlanState *pstate)
 	PlanState  *subplanstate;
 	TupleTableSlot *slot;
 	TupleTableSlot *oldSlot;
+	Datum		tupleid;
 	ItemPointerData tuple_ctid;
 	HeapTupleData oldtupdata;
 	HeapTuple	oldtuple;
-	ItemPointer tupleid;
 
 	CHECK_FOR_INTERRUPTS();
 
@@ -3465,6 +3481,8 @@ ExecModifyTable(PlanState *pstate)
 	 */
 	for (;;)
 	{
+		RowRefType	refType;
+
 		/*
 		 * Reset the per-output-tuple exprcontext.  This is needed because
 		 * triggers expect to use that context as workspace.  It's a bit ugly
@@ -3515,7 +3533,7 @@ ExecModifyTable(PlanState *pstate)
 					EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 
 					slot = ExecMerge(&context, node->resultRelInfo,
-									 NULL, NULL, node->canSetTag);
+									 PointerGetDatum(NULL), NULL, node->canSetTag);
 
 					/*
 					 * If we got a RETURNING result, return it to the caller.
@@ -3559,7 +3577,8 @@ ExecModifyTable(PlanState *pstate)
 		EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 		slot = context.planSlot;
 
-		tupleid = NULL;
+		refType = resultRelInfo->ri_RowRefType;
+		tupleid = PointerGetDatum(NULL);
 		oldtuple = NULL;
 
 		/*
@@ -3602,7 +3621,7 @@ ExecModifyTable(PlanState *pstate)
 						EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 
 						slot = ExecMerge(&context, node->resultRelInfo,
-										 NULL, NULL, node->canSetTag);
+										 PointerGetDatum(NULL), NULL, node->canSetTag);
 
 						/*
 						 * If we got a RETURNING result, return it to the
@@ -3617,9 +3636,25 @@ ExecModifyTable(PlanState *pstate)
 					elog(ERROR, "ctid is NULL");
 				}
 
-				tupleid = (ItemPointer) DatumGetPointer(datum);
-				tuple_ctid = *tupleid;	/* be sure we don't free ctid!! */
-				tupleid = &tuple_ctid;
+				if (refType == ROW_REF_TID)
+				{
+					/* shouldn't ever get a null result... */
+					if (isNull)
+						elog(ERROR, "ctid is NULL");
+
+					tuple_ctid = *((ItemPointer) DatumGetPointer(datum));	/* be sure we don't free
+																			 * ctid!! */
+					tupleid = PointerGetDatum(&tuple_ctid);
+				}
+				else
+				{
+					Assert(refType == ROW_REF_ROWID);
+					/* shouldn't ever get a null result... */
+					if (isNull)
+						elog(ERROR, "rowid is NULL");
+
+					tupleid = datumCopy(datum, false, -1);
+				}
 			}
 
 			/*
@@ -3659,7 +3694,7 @@ ExecModifyTable(PlanState *pstate)
 						EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 
 						slot = ExecMerge(&context, node->resultRelInfo,
-										 NULL, NULL, node->canSetTag);
+										 PointerGetDatum(NULL), NULL, node->canSetTag);
 
 						/*
 						 * If we got a RETURNING result, return it to the
@@ -3723,6 +3758,7 @@ ExecModifyTable(PlanState *pstate)
 					/* Fetch the most recent version of old tuple. */
 					Relation	relation = resultRelInfo->ri_RelationDesc;
 
+					Assert(DatumGetPointer(tupleid) != NULL);
 					if (!table_tuple_fetch_row_version(relation, tupleid,
 													   SnapshotAny,
 													   oldSlot))
@@ -3757,6 +3793,9 @@ ExecModifyTable(PlanState *pstate)
 				break;
 		}
 
+		if (refType == ROW_REF_ROWID && DatumGetPointer(tupleid) != NULL)
+			pfree(DatumGetPointer(tupleid));
+
 		/*
 		 * If we got a RETURNING result, return it to caller.  We'll continue
 		 * the work on next call.
@@ -4000,10 +4039,20 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 				relkind == RELKIND_MATVIEW ||
 				relkind == RELKIND_PARTITIONED_TABLE)
 			{
-				resultRelInfo->ri_RowIdAttNo =
-					ExecFindJunkAttributeInTlist(subplan->targetlist, "ctid");
-				if (!AttributeNumberIsValid(resultRelInfo->ri_RowIdAttNo))
-					elog(ERROR, "could not find junk ctid column");
+				if (resultRelInfo->ri_RowRefType == ROW_REF_TID)
+				{
+					resultRelInfo->ri_RowIdAttNo =
+						ExecFindJunkAttributeInTlist(subplan->targetlist, "ctid");
+					if (!AttributeNumberIsValid(resultRelInfo->ri_RowIdAttNo))
+						elog(ERROR, "could not find junk ctid column");
+				}
+				else
+				{
+					resultRelInfo->ri_RowIdAttNo =
+						ExecFindJunkAttributeInTlist(subplan->targetlist, "rowid");
+					if (!AttributeNumberIsValid(resultRelInfo->ri_RowIdAttNo))
+						elog(ERROR, "could not find junk rowid column");
+				}
 			}
 			else if (relkind == RELKIND_FOREIGN_TABLE)
 			{
@@ -4313,6 +4362,8 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		estate->es_auxmodifytables = lcons(mtstate,
 										   estate->es_auxmodifytables);
 
+
+
 	return mtstate;
 }
 
diff --git a/src/backend/executor/nodeTidscan.c b/src/backend/executor/nodeTidscan.c
index 864a9013b62..f4a124ac4eb 100644
--- a/src/backend/executor/nodeTidscan.c
+++ b/src/backend/executor/nodeTidscan.c
@@ -377,7 +377,7 @@ TidNext(TidScanState *node)
 		if (node->tss_isCurrentOf)
 			table_tuple_get_latest_tid(scan, &tid);
 
-		if (table_tuple_fetch_row_version(heapRelation, &tid, snapshot, slot))
+		if (table_tuple_fetch_row_version(heapRelation, PointerGetDatum(&tid), snapshot, slot))
 			return slot;
 
 		/* Bad TID or failed snapshot qual; try next */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 4b9c9deee84..ee648bedd4a 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2376,19 +2376,24 @@ select_rowmark_type(RangeTblEntry *rte, LockClauseStrength strength,
 	{
 		/* Let the FDW select the rowmark type, if it wants to */
 		FdwRoutine *fdwroutine = GetFdwRoutineByRelId(rte->relid);
+		RowMarkType result = ROW_MARK_REFERENCE;
 
 		/* Set row reference type as ROW_REF_COPY by default */
 		*refType = ROW_REF_COPY;
 
 		if (fdwroutine->GetForeignRowMarkType != NULL)
-			return fdwroutine->GetForeignRowMarkType(rte, strength, refType);
+			result = fdwroutine->GetForeignRowMarkType(rte, strength, refType);
+
+		/* XXX: should we fill this before? */
+		rte->reftype = *refType;
+
 		/* Otherwise, use ROW_MARK_REFERENCE by default */
-		return ROW_MARK_REFERENCE;
+		return result;
 	}
 	else
 	{
 		/* Regular table, apply the appropriate lock type */
-		*refType = ROW_REF_TID;
+		*refType = rte->reftype;
 		switch (strength)
 		{
 			case LCS_NONE:
diff --git a/src/backend/optimizer/prep/preptlist.c b/src/backend/optimizer/prep/preptlist.c
index 4599b0dc761..3620be5b52c 100644
--- a/src/backend/optimizer/prep/preptlist.c
+++ b/src/backend/optimizer/prep/preptlist.c
@@ -226,6 +226,22 @@ preprocess_targetlist(PlannerInfo *root)
 								  true);
 			tlist = lappend(tlist, tle);
 		}
+		if (rc->allRefTypes & (1 << ROW_REF_ROWID))
+		{
+			/* Need to fetch TID */
+			var = makeVar(rc->rti,
+						  RowIdAttributeNumber,
+						  BYTEAOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			snprintf(resname, sizeof(resname), "rowid%u", rc->rowmarkId);
+			tle = makeTargetEntry((Expr *) var,
+								  list_length(tlist) + 1,
+								  pstrdup(resname),
+								  true);
+			tlist = lappend(tlist, tle);
+		}
 		if (rc->allRefTypes & (1 << ROW_REF_COPY))
 		{
 			/* Need the whole row as a junk var */
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index 6ba4eba224a..83c08bbd0e1 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -16,6 +16,7 @@
 
 #include "access/htup_details.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "foreign/fdwapi.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
@@ -895,17 +896,35 @@ add_row_identity_columns(PlannerInfo *root, Index rtindex,
 		relkind == RELKIND_MATVIEW ||
 		relkind == RELKIND_PARTITIONED_TABLE)
 	{
+		RowRefType	refType = ROW_REF_TID;
+
+		refType = table_get_row_ref_type(target_relation);
+
 		/*
 		 * Emit CTID so that executor can find the row to merge, update or
 		 * delete.
 		 */
-		var = makeVar(rtindex,
-					  SelfItemPointerAttributeNumber,
-					  TIDOID,
-					  -1,
-					  InvalidOid,
-					  0);
-		add_row_identity_var(root, var, rtindex, "ctid");
+		if (refType == ROW_REF_TID)
+		{
+			var = makeVar(rtindex,
+						  SelfItemPointerAttributeNumber,
+						  TIDOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			add_row_identity_var(root, var, rtindex, "ctid");
+		}
+		else
+		{
+			Assert(refType == ROW_REF_ROWID);
+			var = makeVar(rtindex,
+						  RowIdAttributeNumber,
+						  BYTEAOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			add_row_identity_var(root, var, rtindex, "rowid");
+		}
 	}
 	else if (relkind == RELKIND_FOREIGN_TABLE)
 	{
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index b4b076d1cb1..4a5a167d833 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -16,6 +16,7 @@
 
 #include "access/sysattr.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "catalog/partition.h"
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_type.h"
@@ -282,6 +283,24 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 			newvars = lappend(newvars, var);
 		}
 
+		if ((new_allRefTypes & (1 << ROW_REF_ROWID)) &&
+			!(old_allRefTypes & (1 << ROW_REF_ROWID)))
+		{
+			var = makeVar(oldrc->rti,
+						  RowIdAttributeNumber,
+						  BYTEAOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			snprintf(resname, sizeof(resname), "rowid%u", oldrc->rowmarkId);
+			tle = makeTargetEntry((Expr *) var,
+								  list_length(root->processed_tlist) + 1,
+								  pstrdup(resname),
+								  true);
+			root->processed_tlist = lappend(root->processed_tlist, tle);
+			newvars = lappend(newvars, var);
+		}
+
 		/* Add tableoid junk Var, unless we had it already */
 		if (!old_isParent)
 		{
@@ -485,6 +504,7 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 	Assert(parentrte->rtekind == RTE_RELATION); /* else this is dubious */
 	childrte->relid = childOID;
 	childrte->relkind = childrel->rd_rel->relkind;
+	childrte->reftype = table_get_row_ref_type(childrel);
 	/* A partitioned child will need to be expanded further. */
 	if (childrte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c
index 427b7325db8..2c80e010f2a 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/sysattr.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "catalog/heap.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_type.h"
@@ -1503,6 +1504,7 @@ addRangeTableEntry(ParseState *pstate,
 	rte->inh = inh;
 	rte->relkind = rel->rd_rel->relkind;
 	rte->rellockmode = lockmode;
+	rte->reftype = table_get_row_ref_type(rel);
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
@@ -1588,6 +1590,7 @@ addRangeTableEntryForRelation(ParseState *pstate,
 	rte->inh = inh;
 	rte->relkind = rel->rd_rel->relkind;
 	rte->rellockmode = lockmode;
+	rte->reftype = table_get_row_ref_type(rel);
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
@@ -1656,6 +1659,7 @@ addRangeTableEntryForSubquery(ParseState *pstate,
 	rte->rtekind = RTE_SUBQUERY;
 	rte->subquery = subquery;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias("unnamed_subquery", NIL);
 	numaliases = list_length(eref->colnames);
@@ -1763,6 +1767,7 @@ addRangeTableEntryForFunction(ParseState *pstate,
 	rte->functions = NIL;		/* we'll fill this list below */
 	rte->funcordinality = rangefunc->ordinality;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	/*
 	 * Choose the RTE alias name.  We default to using the first function's
@@ -2081,6 +2086,7 @@ addRangeTableEntryForTableFunc(ParseState *pstate,
 	rte->coltypmods = tf->coltypmods;
 	rte->colcollations = tf->colcollations;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias(refname, NIL);
 	numaliases = list_length(eref->colnames);
@@ -2156,6 +2162,7 @@ addRangeTableEntryForValues(ParseState *pstate,
 	rte->coltypmods = coltypmods;
 	rte->colcollations = colcollations;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias(refname, NIL);
 
@@ -2252,6 +2259,7 @@ addRangeTableEntryForJoin(ParseState *pstate,
 	rte->joinrightcols = rightcols;
 	rte->join_using_alias = join_using_alias;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias("unnamed_join", NIL);
 	numaliases = list_length(eref->colnames);
@@ -2332,6 +2340,7 @@ addRangeTableEntryForCTE(ParseState *pstate,
 	rte->rtekind = RTE_CTE;
 	rte->ctename = cte->ctename;
 	rte->ctelevelsup = levelsup;
+	rte->reftype = ROW_REF_COPY;
 
 	/* Self-reference if and only if CTE's parse analysis isn't completed */
 	rte->self_reference = !IsA(cte->ctequery, Query);
@@ -2494,6 +2503,7 @@ addRangeTableEntryForENR(ParseState *pstate,
 	 * if they access transition tables linked to a table that is altered.
 	 */
 	rte->relid = enrmd->reliddesc;
+	rte->reftype = ROW_REF_COPY;
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
@@ -3257,6 +3267,9 @@ get_rte_attribute_name(RangeTblEntry *rte, AttrNumber attnum)
 		attnum > 0 && attnum <= list_length(rte->alias->colnames))
 		return strVal(list_nth(rte->alias->colnames, attnum - 1));
 
+	if (attnum == RowIdAttributeNumber)
+		return "rowid";
+
 	/*
 	 * If the RTE is a relation, go to the system catalogs not the
 	 * eref->colnames list.  This is a little slower but it will give the
diff --git a/src/backend/rewrite/rewriteHandler.c b/src/backend/rewrite/rewriteHandler.c
index 9fd05b15e73..7a0fdbe3f40 100644
--- a/src/backend/rewrite/rewriteHandler.c
+++ b/src/backend/rewrite/rewriteHandler.c
@@ -1854,6 +1854,7 @@ ApplyRetrieveRule(Query *parsetree,
 	rte = rt_fetch(rt_index, parsetree->rtable);
 
 	rte->rtekind = RTE_SUBQUERY;
+	rte->reftype = ROW_REF_COPY;
 	rte->subquery = rule_action;
 	rte->security_barrier = RelationIsSecurityView(relation);
 
diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index 947a868e569..d3a41533552 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1100,6 +1100,36 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+/*
+ * Same as tuplestore_gettupleslot(), but foces tuple storage to slot.  Thus,
+ * it can work with slot types different than minimal tuple.
+ */
+bool
+tuplestore_force_gettupleslot(Tuplestorestate *state, bool forward,
+							  bool copy, TupleTableSlot *slot)
+{
+	MinimalTuple tuple;
+	bool		should_free;
+
+	tuple = (MinimalTuple) tuplestore_gettuple(state, forward, &should_free);
+
+	if (tuple)
+	{
+		if (copy && !should_free)
+		{
+			tuple = heap_copy_minimal_tuple(tuple);
+			should_free = true;
+		}
+		ExecForceStoreMinimalTuple(tuple, slot, should_free);
+		return true;
+	}
+	else
+	{
+		ExecClearTuple(slot);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
diff --git a/src/include/access/sysattr.h b/src/include/access/sysattr.h
index e88dec71ee9..867b5eb489e 100644
--- a/src/include/access/sysattr.h
+++ b/src/include/access/sysattr.h
@@ -24,6 +24,7 @@
 #define MaxTransactionIdAttributeNumber			(-4)
 #define MaxCommandIdAttributeNumber				(-5)
 #define TableOidAttributeNumber					(-6)
-#define FirstLowInvalidHeapAttributeNumber		(-7)
+#define RowIdAttributeNumber					(-7)
+#define FirstLowInvalidHeapAttributeNumber		(-8)
 
 #endif							/* SYSATTR_H */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index dedaf1f758e..5be4c53af5e 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -476,7 +476,7 @@ typedef struct TableAmRoutine
 	 * test, returns true, false otherwise.
 	 */
 	bool		(*tuple_fetch_row_version) (Relation rel,
-											ItemPointer tid,
+											Datum tupleid,
 											Snapshot snapshot,
 											TupleTableSlot *slot);
 
@@ -535,7 +535,7 @@ typedef struct TableAmRoutine
 
 	/* see table_tuple_delete() for reference about parameters */
 	TM_Result	(*tuple_delete) (Relation rel,
-								 ItemPointer tid,
+								 Datum tupleid,
 								 CommandId cid,
 								 Snapshot snapshot,
 								 Snapshot crosscheck,
@@ -546,7 +546,7 @@ typedef struct TableAmRoutine
 
 	/* see table_tuple_update() for reference about parameters */
 	TM_Result	(*tuple_update) (Relation rel,
-								 ItemPointer otid,
+								 Datum tupleid,
 								 TupleTableSlot *slot,
 								 CommandId cid,
 								 Snapshot snapshot,
@@ -559,7 +559,7 @@ typedef struct TableAmRoutine
 
 	/* see table_tuple_lock() for reference about parameters */
 	TM_Result	(*tuple_lock) (Relation rel,
-							   ItemPointer tid,
+							   Datum tupleid,
 							   Snapshot snapshot,
 							   TupleTableSlot *slot,
 							   CommandId cid,
@@ -702,6 +702,11 @@ typedef struct TableAmRoutine
 	 * ------------------------------------------------------------------------
 	 */
 
+	/*
+	 * Get the type of row identifier in the table.
+	 */
+	RowRefType	(*get_row_ref_type) (Relation rel);
+
 	/*
 	 * This callback frees relation private cache data stored in rd_amcache.
 	 * After the call all memory related to rd_amcache must be freed,
@@ -1284,9 +1289,9 @@ extern bool table_index_fetch_tuple_check(Relation rel,
 
 
 /*
- * Fetch tuple at `tid` into `slot`, after doing a visibility test according to
- * `snapshot`. If a tuple was found and passed the visibility test, returns
- * true, false otherwise.
+ * Fetch tuple identified by `tupleid` into `slot`, after doing a visibility
+ * test according to `snapshot`. If a tuple was found and passed the visibility
+ * test, returns true, false otherwise.
  *
  * See table_index_fetch_tuple's comment about what the difference between
  * these functions is. It is correct to use this function outside of index
@@ -1294,7 +1299,7 @@ extern bool table_index_fetch_tuple_check(Relation rel,
  */
 static inline bool
 table_tuple_fetch_row_version(Relation rel,
-							  ItemPointer tid,
+							  Datum tupleid,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
@@ -1306,7 +1311,8 @@ table_tuple_fetch_row_version(Relation rel,
 	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
 		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
 
-	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
+	return rel->rd_tableam->tuple_fetch_row_version(rel, tupleid,
+													snapshot, slot);
 }
 
 /*
@@ -1492,7 +1498,7 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
  *
  * Input parameters:
  *	relation - table to be modified (caller must hold suitable lock)
- *	tid - TID of tuple to be deleted
+ *	tupleid - identifier of tuple to be deleted
  *	cid - delete command ID (used for visibility test, and stored into
  *		cmax if successful)
  *	crosscheck - if not InvalidSnapshot, also check tuple against this
@@ -1521,12 +1527,12 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
  * TM_FailureData for additional info.
  */
 static inline TM_Result
-table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
+table_tuple_delete(Relation rel, Datum tupleid, CommandId cid,
 				   Snapshot snapshot, Snapshot crosscheck, int options,
 				   TM_FailureData *tmfd, bool changingPart,
 				   TupleTableSlot *oldSlot)
 {
-	return rel->rd_tableam->tuple_delete(rel, tid, cid,
+	return rel->rd_tableam->tuple_delete(rel, tupleid, cid,
 										 snapshot, crosscheck,
 										 options, tmfd, changingPart,
 										 oldSlot);
@@ -1540,7 +1546,7 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
  *
  * Input parameters:
  *	relation - table to be modified (caller must hold suitable lock)
- *	otid - TID of old tuple to be replaced
+ *	tupleid - identifier of old tuple to be replaced
  *	slot - newly constructed tuple data to store
  *	cid - update command ID (used for visibility test, and stored into
  *		cmax/cmin if successful)
@@ -1577,13 +1583,13 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
  * for additional info.
  */
 static inline TM_Result
-table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
+table_tuple_update(Relation rel, Datum tupleid, TupleTableSlot *slot,
 				   CommandId cid, Snapshot snapshot, Snapshot crosscheck,
 				   int options, TM_FailureData *tmfd, LockTupleMode *lockmode,
 				   TU_UpdateIndexes *update_indexes,
 				   TupleTableSlot *oldSlot)
 {
-	return rel->rd_tableam->tuple_update(rel, otid, slot,
+	return rel->rd_tableam->tuple_update(rel, tupleid, slot,
 										 cid, snapshot, crosscheck,
 										 options, tmfd,
 										 lockmode, update_indexes,
@@ -1595,7 +1601,7 @@ table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
  *
  * Input parameters:
  *	relation: relation containing tuple (caller must hold suitable lock)
- *	tid: TID of tuple to lock
+ *	tupleid: identifier of tuple to lock
  *	snapshot: snapshot to use for visibility determinations
  *	cid: current command ID (used for visibility test, and stored into
  *		tuple's cmax if lock is successful)
@@ -1624,12 +1630,12 @@ table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
  * comments for struct TM_FailureData for additional info.
  */
 static inline TM_Result
-table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
+table_tuple_lock(Relation rel, Datum tupleid, Snapshot snapshot,
 				 TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
 				 LockWaitPolicy wait_policy, uint8 flags,
 				 TM_FailureData *tmfd)
 {
-	return rel->rd_tableam->tuple_lock(rel, tid, snapshot, slot,
+	return rel->rd_tableam->tuple_lock(rel, tupleid, snapshot, slot,
 									   cid, mode, wait_policy,
 									   flags, tmfd);
 }
@@ -1915,6 +1921,22 @@ table_define_index(Relation rel, Oid indoid, bool reindex,
  * ----------------------------------------------------------------------------
  */
 
+/*
+ * Get the type of row identifier.  Returns ROW_REF_TID when table AM routine
+ * is not accessible.  This happens during catalog initialization.  All catalog
+ * tables are known to use heap.
+ */
+static inline RowRefType
+table_get_row_ref_type(Relation rel)
+{
+	if (rel->rd_tableam)
+		return rel->rd_tableam->get_row_ref_type(rel);
+	else if (rel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
+		return ROW_REF_COPY;
+	else
+		return ROW_REF_TID;
+}
+
 /*
  * Frees relation private cache data stored in rd_amcache.  Uses
  * free_rd_amcache method if provided.  Assumes rd_amcache to point to single
diff --git a/src/include/commands/trigger.h b/src/include/commands/trigger.h
index cb968d03ecd..c16e6b6e5a0 100644
--- a/src/include/commands/trigger.h
+++ b/src/include/commands/trigger.h
@@ -209,7 +209,7 @@ extern void ExecASDeleteTriggers(EState *estate,
 extern bool ExecBRDeleteTriggers(EState *estate,
 								 EPQState *epqstate,
 								 ResultRelInfo *relinfo,
-								 ItemPointer tupleid,
+								 Datum tupleid,
 								 HeapTuple fdw_trigtuple,
 								 TupleTableSlot **epqslot,
 								 TM_Result *tmresult,
@@ -231,7 +231,7 @@ extern void ExecASUpdateTriggers(EState *estate,
 extern bool ExecBRUpdateTriggers(EState *estate,
 								 EPQState *epqstate,
 								 ResultRelInfo *relinfo,
-								 ItemPointer tupleid,
+								 Datum tupleid,
 								 HeapTuple fdw_trigtuple,
 								 TupleTableSlot *newslot,
 								 TM_Result *tmresult,
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index b89baef95d3..04d8cef6c68 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1089,6 +1089,8 @@ typedef struct RangeTblEntry
 	Index		perminfoindex pg_node_attr(query_jumble_ignore);
 	/* sampling info, or NULL */
 	struct TableSampleClause *tablesample;
+	/* row indentifier for relation */
+	RowRefType	reftype;
 
 	/*
 	 * Fields valid for a subquery RTE (else NULL):
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index d7f9c389dac..d850411aa95 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1323,27 +1323,6 @@ typedef enum RowMarkType
 	ROW_MARK_REFERENCE,			/* just fetch the TID, don't lock it */
 } RowMarkType;
 
-/*
- * RowRefType -
- *	  enums for types of row identifiers
- *
- * For plain tables we can just fetch the TID, much as for a target relation;
- * this case is represented by ROW_REF_TID.  Otherwise (for example for VALUES
- * or FUNCTION scans) we have to copy the whole row value.  ROW_REF_COPY is
- * pretty inefficient, since most of the time we'll never need the data; but
- * fortunately the overhead is usually not performance-critical in practice.
- * By default we use ROW_REF_COPY for foreign tables, but if the FDW has
- * a concept of rowid it can request to use ROW_REF_TID instead.
- * (Again, this probably doesn't make sense if a physical remote fetch is
- * needed, but for FDWs that map to local storage it might be credible.)
- * In future we may allow more types of row identifiers.
- */
-typedef enum RowRefType
-{
-	ROW_REF_TID,				/* Item pointer (block, offset) */
-	ROW_REF_COPY				/* Full row copy */
-} RowRefType;
-
 #define RowMarkRequiresRowShareLock(marktype)  ((marktype) <= ROW_MARK_KEYSHARE)
 
 /*
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index 376f67e6a5f..84cf7837de1 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -2211,4 +2211,26 @@ typedef struct OnConflictExpr
 	List	   *exclRelTlist;	/* tlist of the EXCLUDED pseudo relation */
 } OnConflictExpr;
 
+/*
+ * RowRefType -
+ *	  enums for types of row identifiers
+ *
+ * For plain tables we can just fetch the TID, much as for a target relation;
+ * this case is represented by ROW_REF_TID.  Otherwise (for example for VALUES
+ * or FUNCTION scans) we have to copy the whole row value.  ROW_REF_COPY is
+ * pretty inefficient, since most of the time we'll never need the data; but
+ * fortunately the overhead is usually not performance-critical in practice.
+ * By default we use ROW_REF_COPY for foreign tables, but if the FDW has
+ * a concept of rowid it can request to use ROW_REF_TID instead.
+ * (Again, this probably doesn't make sense if a physical remote fetch is
+ * needed, but for FDWs that map to local storage it might be credible.)
+ * In future we may allow more types of row identifiers.
+ */
+typedef enum RowRefType
+{
+	ROW_REF_TID,				/* Item pointer (block, offset) */
+	ROW_REF_ROWID,				/* Bytea row id */
+	ROW_REF_COPY				/* Full row copy */
+} RowRefType;
+
 #endif							/* PRIMNODES_H */
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index 419613c17ba..cf291a0d17a 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -70,6 +70,9 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
 
+extern bool tuplestore_force_gettupleslot(Tuplestorestate *state, bool forward,
+										  bool copy, TupleTableSlot *slot);
+
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
 extern bool tuplestore_skiptuples(Tuplestorestate *state,
-- 
2.39.3 (Apple Git-145)

#24

Pavel Borisov

pashkin.elfe@gmail.com

almost 2 years ago

In reply to: Alexander Korotkov (#23)

Re: Table AM Interface Enhancements

Hi, Alexander!

The revised rest of the patchset is attached.

0001 (was 0006) – I prefer the definition of AcquireSampleRowsFunc to
stay in vacuum.h. If we move it to sampling.h then we would have to
add there includes to define Relation, HeapTuple etc. I'd like to
avoid this kind of change. Also, I've deleted
table_beginscan_analyze(), because it's only called from
tableam-specific AcquireSampleRowsFunc. Also I put some comments to
heapam_scan_analyze_next_block() and heapam_scan_analyze_next_tuple()
given that there are now no relevant comments for them in tableam.h.
I've removed some redundancies from acquire_sample_rows(). And added
comments to AcquireSampleRowsFunc based on what we have in FDW docs
for this function. Did some small edits as well. As you suggested,
turned back declarations for acquire_sample_rows() and compare_rows().

In my comment in the thread I was not thinking about returning the old name
acquire_sample_rows(), it was only about the declarations and the order of
functions to be one code block. To me heapam_acquire_sample_rows() looks
better for a name of heap implementation of *AcquireSampleRowsFunc(). I
suggest returning the name heapam_acquire_sample_rows() from v4. Sorry for
the confusion in this.

The changed type of static function that always returned true for heap
looks good to me:
static void heapam_scan_analyze_next_block

The same is for removing the comparison of always true "block_accepted" in
(heapam_)acquire_sample_rows()

Removing table_beginscan_analyze and call scan_begin() is not in the same
style as other table_beginscan_* functions. Though this is not a change in
functionality, I'd leave this part as it was in v4. Also, a comment about
it was introduced in v5:

src/backend/access/heap/heapam_handler.c: * with table_beginscan_analyze()

For comments I'd propose:
%s/In addition, store estimates/In addition, a function should store
estimates/g
%s/zerp/zero/g

0002 (was 0007) – I've turned the redundant "if", which you've pointed
out, into an assert. Also, added some comments, most notably comment
for TableAmRoutine.reloptions based on the indexam docs.

%s/validate sthe/validates the/g

This seems not needed, it's already inited to InvalidOid before.
+else
+accessMethod = default_table_access_method;

+ accessMethodId = InvalidOid;

This code came from 374c7a22904. I don't insist on this simplification in a
patch 0002.

Overall both patches look good to me.

Regards,
Pavel Borisov.

#25

Pavel Borisov

pashkin.elfe@gmail.com

almost 2 years ago

In reply to: Pavel Borisov (#24)

Re: Table AM Interface Enhancements

This seems not needed, it's already inited to InvalidOid before.
+else
+accessMethod = default_table_access_method;
+ accessMethodId = InvalidOid;

This code came from 374c7a22904. I don't insist on this simplification in
a patch 0002.

A correction of the code quote for the previous message:

+else
+ accessMethodId = InvalidOid;

#26

Alexander Korotkov

aekorotkov@gmail.com

almost 2 years ago

In reply to: Pavel Borisov (#24)

8 attachment(s)

Re: Table AM Interface Enhancements

Hi, Pavel!

Thank you for your feedback. The revised patch set is attached.

I found that vacuum.c has a lot of heap-specific code. Thus, it
should be OK for analyze.c to keep heap-specific code. Therefore, I
rolled back the movement of functions between files. That leads to a
smaller patch, easier to review.

On Wed, Mar 27, 2024 at 2:52 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:

The revised rest of the patchset is attached.
0001 (was 0006) – I prefer the definition of AcquireSampleRowsFunc to
stay in vacuum.h. If we move it to sampling.h then we would have to
add there includes to define Relation, HeapTuple etc. I'd like to
avoid this kind of change. Also, I've deleted
table_beginscan_analyze(), because it's only called from
tableam-specific AcquireSampleRowsFunc. Also I put some comments to
heapam_scan_analyze_next_block() and heapam_scan_analyze_next_tuple()
given that there are now no relevant comments for them in tableam.h.
I've removed some redundancies from acquire_sample_rows(). And added
comments to AcquireSampleRowsFunc based on what we have in FDW docs
for this function. Did some small edits as well. As you suggested,
turned back declarations for acquire_sample_rows() and compare_rows().

In my comment in the thread I was not thinking about returning the old name acquire_sample_rows(), it was only about the declarations and the order of functions to be one code block. To me heapam_acquire_sample_rows() looks better for a name of heap implementation of *AcquireSampleRowsFunc(). I suggest returning the name heapam_acquire_sample_rows() from v4. Sorry for the confusion in this.

I found that the function name acquire_sample_rows is referenced in
quite several places in the source code. So, I would prefer to save
the old name to keep the changes minimal.

The changed type of static function that always returned true for heap looks good to me:
static void heapam_scan_analyze_next_block

The same is for removing the comparison of always true "block_accepted" in (heapam_)acquire_sample_rows()

Ok!

Removing table_beginscan_analyze and call scan_begin() is not in the same style as other table_beginscan_* functions. Though this is not a change in functionality, I'd leave this part as it was in v4.

With the patch, this method doesn't have usages outside of table am.
I don't think we should keep a method, which doesn't have clear
external usage patterns. But I agree that starting a scan with
heap_beginscan() and ending with table_endscan() is not correct. Now
ending this scan with heap_endscan().

Also, a comment about it was introduced in v5:

src/backend/access/heap/heapam_handler.c: * with table_beginscan_analyze()

Corrected.

For comments I'd propose:
%s/In addition, store estimates/In addition, a function should store estimates/g
%s/zerp/zero/g

Fixed.

0002 (was 0007) – I've turned the redundant "if", which you've pointed
out, into an assert. Also, added some comments, most notably comment
for TableAmRoutine.reloptions based on the indexam docs.

%s/validate sthe/validates the/g

Fixed.

This seems not needed, it's already inited to InvalidOid before.
+else
+accessMethod = default_table_access_method;
+       accessMethodId = InvalidOid;
This code came from 374c7a22904. I don't insist on this simplification in a patch 0002.

This is minor redundancy. I prefer to keep it. This makes it obvious
that patch just moved this code.

------
Regards,
Alexander Korotkov

Attachments:

0002-Custom-reloptions-for-table-AM-v6.patchapplication/octet-stream; name=0002-Custom-reloptions-for-table-AM-v6.patchDownload

From f9bb96f0fb59d5b8c20094bd4a79754170a778b5 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 12 Jun 2023 23:16:01 +0300
Subject: [PATCH 2/8] Custom reloptions for table AM

Let table AM define custom reloptions for its tables.
---
 src/backend/access/common/reloptions.c   |  6 ++-
 src/backend/access/heap/heapam_handler.c | 12 ++++++
 src/backend/access/table/tableamapi.c    | 25 +++++++++++++
 src/backend/commands/tablecmds.c         | 47 ++++++++++++++----------
 src/backend/postmaster/autovacuum.c      |  4 +-
 src/backend/utils/cache/relcache.c       |  6 ++-
 src/include/access/reloptions.h          |  2 +
 src/include/access/tableam.h             | 43 ++++++++++++++++++++++
 8 files changed, 122 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index d6eb5d85599..963995388bb 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -24,6 +24,7 @@
 #include "access/nbtree.h"
 #include "access/reloptions.h"
 #include "access/spgist_private.h"
+#include "access/tableam.h"
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
 #include "commands/tablespace.h"
@@ -1377,7 +1378,7 @@ untransformRelOptions(Datum options)
  */
 bytea *
 extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
-				  amoptions_function amoptions)
+				  const TableAmRoutine *tableam, amoptions_function amoptions)
 {
 	bytea	   *options;
 	bool		isnull;
@@ -1399,7 +1400,8 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
 		case RELKIND_MATVIEW:
-			options = heap_reloptions(classForm->relkind, datum, false);
+			options = tableam_reloptions(tableam, classForm->relkind,
+										 datum, false);
 			break;
 		case RELKIND_PARTITIONED_TABLE:
 			options = partitioned_table_reloptions(datum, false);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ba3e4f4f838..05ce127cf06 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -25,6 +25,7 @@
 #include "access/heapam.h"
 #include "access/heaptoast.h"
 #include "access/multixact.h"
+#include "access/reloptions.h"
 #include "access/rewriteheap.h"
 #include "access/syncscan.h"
 #include "access/tableam.h"
@@ -2159,6 +2160,16 @@ heapam_relation_toast_am(Relation rel)
 	return rel->rd_rel->relam;
 }
 
+static bytea *
+heapam_reloptions(char relkind, Datum reloptions, bool validate)
+{
+	Assert(relkind == RELKIND_RELATION ||
+		   relkind == RELKIND_TOASTVALUE ||
+		   relkind == RELKIND_MATVIEW);
+
+	return heap_reloptions(relkind, reloptions, validate);
+}
+
 
 /* ------------------------------------------------------------------------
  * Planner related callbacks for the heap AM
@@ -2664,6 +2675,7 @@ static const TableAmRoutine heapam_methods = {
 	.relation_needs_toast_table = heapam_relation_needs_toast_table,
 	.relation_toast_am = heapam_relation_toast_am,
 	.relation_fetch_toast_slice = heap_fetch_toast_slice,
+	.reloptions = heapam_reloptions,
 
 	.relation_estimate_size = heapam_estimate_rel_size,
 
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 55b8caeadf2..d9e23ef3175 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -13,9 +13,11 @@
 
 #include "access/tableam.h"
 #include "access/xact.h"
+#include "catalog/pg_am.h"
 #include "commands/defrem.h"
 #include "miscadmin.h"
 #include "utils/guc_hooks.h"
+#include "utils/syscache.h"
 
 
 /*
@@ -98,6 +100,29 @@ GetTableAmRoutine(Oid amhandler)
 	return routine;
 }
 
+/*
+ * GetTableAmRoutineByAmOid
+ *		Given the table access method oid get its TableAmRoutine struct, which
+ *		will be palloc'd in the caller's memory context.
+ */
+const TableAmRoutine *
+GetTableAmRoutineByAmOid(Oid amoid)
+{
+	HeapTuple	ht_am;
+	Form_pg_am	amrec;
+	const TableAmRoutine *tableam = NULL;
+
+	ht_am = SearchSysCache1(AMOID, ObjectIdGetDatum(amoid));
+	if (!HeapTupleIsValid(ht_am))
+		elog(ERROR, "cache lookup failed for access method %u",
+			 amoid);
+	amrec = (Form_pg_am) GETSTRUCT(ht_am);
+
+	tableam = GetTableAmRoutine(amrec->amhandler);
+	ReleaseSysCache(ht_am);
+	return tableam;
+}
+
 /* check_hook: validate new default_table_access_method */
 bool
 check_default_table_access_method(char **newval, void **extra, GucSource source)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 8a02c5b05b6..6fc815666bf 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -715,6 +715,7 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	ObjectAddress address;
 	LOCKMODE	parentLockmode;
 	Oid			accessMethodId = InvalidOid;
+	const TableAmRoutine *tableam = NULL;
 
 	/*
 	 * Truncate relname to appropriate length (probably a waste of time, as
@@ -850,6 +851,24 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	if (!OidIsValid(ownerId))
 		ownerId = GetUserId();
 
+	/*
+	 * Select access method to use: an explicitly indicated one, or (in the
+	 * case of a partitioned table) the parent's, if it has one.
+	 */
+	if (stmt->accessMethod != NULL)
+		accessMethodId = get_table_am_oid(stmt->accessMethod, false);
+	else if (stmt->partbound)
+	{
+		Assert(list_length(inheritOids) == 1);
+		accessMethodId = get_rel_relam(linitial_oid(inheritOids));
+	}
+	else
+		accessMethodId = InvalidOid;
+
+	/* still nothing? use the default */
+	if (RELKIND_HAS_TABLE_AM(relkind) && !OidIsValid(accessMethodId))
+		accessMethodId = get_table_am_oid(default_table_access_method, false);
+
 	/*
 	 * Parse and validate reloptions, if any.
 	 */
@@ -858,6 +877,12 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 
 	switch (relkind)
 	{
+		case RELKIND_RELATION:
+		case RELKIND_TOASTVALUE:
+		case RELKIND_MATVIEW:
+			tableam = GetTableAmRoutineByAmOid(accessMethodId);
+			(void) tableam_reloptions(tableam, relkind, reloptions, true);
+			break;
 		case RELKIND_VIEW:
 			(void) view_reloptions(reloptions, true);
 			break;
@@ -866,6 +891,7 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 			break;
 		default:
 			(void) heap_reloptions(relkind, reloptions, true);
+			break;
 	}
 
 	if (stmt->ofTypename)
@@ -957,24 +983,6 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 		}
 	}
 
-	/*
-	 * Select access method to use: an explicitly indicated one, or (in the
-	 * case of a partitioned table) the parent's, if it has one.
-	 */
-	if (stmt->accessMethod != NULL)
-		accessMethodId = get_table_am_oid(stmt->accessMethod, false);
-	else if (stmt->partbound)
-	{
-		Assert(list_length(inheritOids) == 1);
-		accessMethodId = get_rel_relam(linitial_oid(inheritOids));
-	}
-	else
-		accessMethodId = InvalidOid;
-
-	/* still nothing? use the default */
-	if (RELKIND_HAS_TABLE_AM(relkind) && !OidIsValid(accessMethodId))
-		accessMethodId = get_table_am_oid(default_table_access_method, false);
-
 	/*
 	 * Create the relation.  Inherited defaults and constraints are passed in
 	 * for immediate handling --- since they don't need parsing, they can be
@@ -15520,7 +15528,8 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
 		case RELKIND_MATVIEW:
-			(void) heap_reloptions(rel->rd_rel->relkind, newOptions, true);
+			(void) table_reloptions(rel, rel->rd_rel->relkind,
+									newOptions, true);
 			break;
 		case RELKIND_PARTITIONED_TABLE:
 			(void) partitioned_table_reloptions(newOptions, true);
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 71e8a6f2584..d1d76016ab4 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2661,7 +2661,9 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE);
 
-	relopts = extractRelOptions(tup, pg_class_desc, NULL);
+	relopts = extractRelOptions(tup, pg_class_desc,
+								GetTableAmRoutineByAmOid(((Form_pg_class) GETSTRUCT(tup))->relam),
+								NULL);
 	if (relopts == NULL)
 		return NULL;
 
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 1f419c2a6dd..039c0d3eef4 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -33,6 +33,7 @@
 #include "access/htup_details.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
+#include "access/relation.h"
 #include "access/reloptions.h"
 #include "access/sysattr.h"
 #include "access/table.h"
@@ -464,6 +465,7 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 {
 	bytea	   *options;
 	amoptions_function amoptsfn;
+	const TableAmRoutine *tableam = NULL;
 
 	relation->rd_options = NULL;
 
@@ -478,6 +480,7 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 		case RELKIND_VIEW:
 		case RELKIND_MATVIEW:
 		case RELKIND_PARTITIONED_TABLE:
+			tableam = relation->rd_tableam;
 			amoptsfn = NULL;
 			break;
 		case RELKIND_INDEX:
@@ -493,7 +496,8 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 	 * we might not have any other for pg_class yet (consider executing this
 	 * code for pg_class itself)
 	 */
-	options = extractRelOptions(tuple, GetPgClassDescriptor(), amoptsfn);
+	options = extractRelOptions(tuple, GetPgClassDescriptor(),
+								tableam, amoptsfn);
 
 	/*
 	 * Copy parsed data into CacheMemoryContext.  To guard against the
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index 81829b8270a..8ddc75df287 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -21,6 +21,7 @@
 
 #include "access/amapi.h"
 #include "access/htup.h"
+#include "access/tableam.h"
 #include "access/tupdesc.h"
 #include "nodes/pg_list.h"
 #include "storage/lock.h"
@@ -224,6 +225,7 @@ extern Datum transformRelOptions(Datum oldOptions, List *defList,
 								 bool acceptOidsOff, bool isReset);
 extern List *untransformRelOptions(Datum options);
 extern bytea *extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
+								const TableAmRoutine *tableam,
 								amoptions_function amoptions);
 extern void *build_reloptions(Datum reloptions, bool validate,
 							  relopt_kind kind,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8ed4e7295ad..cf68ec48ebf 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -737,6 +737,28 @@ typedef struct TableAmRoutine
 											   int32 slicelength,
 											   struct varlena *result);
 
+	/*
+	 * This callback parses and validates the reloptions array for a table.
+	 *
+	 * This is called only when a non-null reloptions array exists for the
+	 * table.  'reloptions' is a text array containing entries of the form
+	 * "name=value".  The function should construct a bytea value, which will
+	 * be copied into the rd_options field of the table's relcache entry. The
+	 * data contents of the bytea value are open for the access method to
+	 * define.
+	 *
+	 * When 'validate' is true, the function should report a suitable error
+	 * message if any of the options are unrecognized or have invalid values;
+	 * when 'validate' is false, invalid entries should be silently ignored.
+	 * ('validate' is false when loading options already stored in pg_catalog;
+	 * an invalid entry could only be found if the access method has changed
+	 * its rules for options, and in that case ignoring obsolete entries is
+	 * appropriate.)
+	 *
+	 * It is OK to return NULL if default behavior is wanted.
+	 */
+	bytea	   *(*reloptions) (char relkind, Datum reloptions, bool validate);
+
 
 	/* ------------------------------------------------------------------------
 	 * Planner related functions.
@@ -1925,6 +1947,26 @@ table_relation_fetch_toast_slice(Relation toastrel, Oid valueid,
 													 result);
 }
 
+/*
+ * Parse options for given table.
+ */
+static inline bytea *
+table_reloptions(Relation rel, char relkind,
+				 Datum reloptions, bool validate)
+{
+	return rel->rd_tableam->reloptions(relkind, reloptions, validate);
+}
+
+/*
+ * Parse table options without knowledge of particular table.
+ */
+static inline bytea *
+tableam_reloptions(const TableAmRoutine *tableam, char relkind,
+				   Datum reloptions, bool validate)
+{
+	return tableam->reloptions(relkind, reloptions, validate);
+}
+
 
 /* ----------------------------------------------------------------------------
  * Planner related functionality
@@ -2102,6 +2144,7 @@ extern void table_block_relation_estimate_size(Relation rel,
  */
 
 extern const TableAmRoutine *GetTableAmRoutine(Oid amhandler);
+extern const TableAmRoutine *GetTableAmRoutineByAmOid(Oid amoid);
 
 /* ----------------------------------------------------------------------------
  * Functions in heapam_handler.c
-- 
2.39.3 (Apple Git-145)

0003-Generalize-table-AM-API-for-INSERT-.-ON-CONFLICT-v6.patchapplication/octet-stream; name=0003-Generalize-table-AM-API-for-INSERT-.-ON-CONFLICT-v6.patchDownload

From 867222d594ba461becc363f667ce4816ee4d7763 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Fri, 9 Jun 2023 00:05:52 +0300
Subject: [PATCH 3/8] Generalize table AM API for INSERT ... ON CONFLICT ...

Currently, all table AMs need to implement INSERT ... ON CONFLICT ... with
speculative tokens.  They could just have a custom implementation of those
tokens using tuple_insert_speculative() and tuple_complete_speculative() API
functions.

This commit changes INSERT ... ON CONFLICT ... implementation to use single
tuple_insert_with_arbiter() API function, which encapsulates the whole
alogrithm.  This new function provides clear semantics to make different
implementations of INSERT ... ON CONFLICT ... functionality.
---
 src/backend/access/heap/heapam_handler.c | 281 ++++++++++++++++++++++-
 src/backend/access/table/tableamapi.c    |   3 +-
 src/backend/executor/nodeModifyTable.c   | 270 ++--------------------
 src/include/access/tableam.h             |  84 +++----
 4 files changed, 348 insertions(+), 290 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 05ce127cf06..b89caa16e7f 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -308,6 +308,284 @@ heapam_tuple_complete_speculative(Relation relation, TupleTableSlot *slot,
 		pfree(tuple);
 }
 
+/*
+ * ExecCheckTupleVisible -- verify tuple is visible
+ *
+ * It would not be consistent with guarantees of the higher isolation levels to
+ * proceed with avoiding insertion (taking speculative insertion's alternative
+ * path) on the basis of another tuple that is not visible to MVCC snapshot.
+ * Check for the need to raise a serialization failure, and do so as necessary.
+ */
+static void
+ExecCheckTupleVisible(EState *estate,
+					  Relation rel,
+					  TupleTableSlot *slot)
+{
+	if (!IsolationUsesXactSnapshot())
+		return;
+
+	if (!table_tuple_satisfies_snapshot(rel, slot, estate->es_snapshot))
+	{
+		Datum		xminDatum;
+		TransactionId xmin;
+		bool		isnull;
+
+		xminDatum = slot_getsysattr(slot, MinTransactionIdAttributeNumber, &isnull);
+		Assert(!isnull);
+		xmin = DatumGetTransactionId(xminDatum);
+
+		/*
+		 * We should not raise a serialization failure if the conflict is
+		 * against a tuple inserted by our own transaction, even if it's not
+		 * visible to our snapshot.  (This would happen, for example, if
+		 * conflicting keys are proposed for insertion in a single command.)
+		 */
+		if (!TransactionIdIsCurrentTransactionId(xmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+					 errmsg("could not serialize access due to concurrent update")));
+	}
+}
+
+/*
+ * ExecCheckTIDVisible -- convenience variant of ExecCheckTupleVisible()
+ */
+static void
+ExecCheckTIDVisible(EState *estate,
+					Relation rel,
+					ItemPointer tid,
+					TupleTableSlot *tempSlot)
+{
+	/* Redundantly check isolation level */
+	if (!IsolationUsesXactSnapshot())
+		return;
+
+	if (!table_tuple_fetch_row_version(rel, tid,
+									   SnapshotAny, tempSlot))
+		elog(ERROR, "failed to fetch conflicting tuple for ON CONFLICT");
+	ExecCheckTupleVisible(estate, rel, tempSlot);
+	ExecClearTuple(tempSlot);
+}
+
+static inline TupleTableSlot *
+heapam_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
+								 TupleTableSlot *slot,
+								 CommandId cid, int options,
+								 struct BulkInsertStateData *bistate,
+								 List *arbiterIndexes,
+								 EState *estate,
+								 LockTupleMode lockmode,
+								 TupleTableSlot *lockedSlot,
+								 TupleTableSlot *tempSlot)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	uint32		specToken;
+	ItemPointerData conflictTid;
+	bool		specConflict;
+	List	   *recheckIndexes = NIL;
+
+	while (true)
+	{
+		specConflict = false;
+		if (!ExecCheckIndexConstraints(resultRelInfo, slot, estate, &conflictTid,
+									   arbiterIndexes))
+		{
+			if (lockedSlot)
+			{
+				TM_Result	test;
+				TM_FailureData tmfd;
+				Datum		xminDatum;
+				TransactionId xmin;
+				bool		isnull;
+
+				/* Determine lock mode to use */
+				lockmode = ExecUpdateLockMode(estate, resultRelInfo);
+
+				/*
+				 * Lock tuple for update.  Don't follow updates when tuple
+				 * cannot be locked without doing so.  A row locking conflict
+				 * here means our previous conclusion that the tuple is
+				 * conclusively committed is not true anymore.
+				 */
+				test = table_tuple_lock(rel, &conflictTid,
+										estate->es_snapshot,
+										lockedSlot, estate->es_output_cid,
+										lockmode, LockWaitBlock, 0,
+										&tmfd);
+				switch (test)
+				{
+					case TM_Ok:
+						/* success! */
+						break;
+
+					case TM_Invisible:
+
+						/*
+						 * This can occur when a just inserted tuple is
+						 * updated again in the same command. E.g. because
+						 * multiple rows with the same conflicting key values
+						 * are inserted.
+						 *
+						 * This is somewhat similar to the ExecUpdate()
+						 * TM_SelfModified case.  We do not want to proceed
+						 * because it would lead to the same row being updated
+						 * a second time in some unspecified order, and in
+						 * contrast to plain UPDATEs there's no historical
+						 * behavior to break.
+						 *
+						 * It is the user's responsibility to prevent this
+						 * situation from occurring.  These problems are why
+						 * the SQL standard similarly specifies that for SQL
+						 * MERGE, an exception must be raised in the event of
+						 * an attempt to update the same row twice.
+						 */
+						xminDatum = slot_getsysattr(lockedSlot,
+													MinTransactionIdAttributeNumber,
+													&isnull);
+						Assert(!isnull);
+						xmin = DatumGetTransactionId(xminDatum);
+
+						if (TransactionIdIsCurrentTransactionId(xmin))
+							ereport(ERROR,
+									(errcode(ERRCODE_CARDINALITY_VIOLATION),
+							/* translator: %s is a SQL command name */
+									 errmsg("%s command cannot affect row a second time",
+											"ON CONFLICT DO UPDATE"),
+									 errhint("Ensure that no rows proposed for insertion within the same command have duplicate constrained values.")));
+
+						/* This shouldn't happen */
+						elog(ERROR, "attempted to lock invisible tuple");
+						break;
+
+					case TM_SelfModified:
+
+						/*
+						 * This state should never be reached. As a dirty
+						 * snapshot is used to find conflicting tuples,
+						 * speculative insertion wouldn't have seen this row
+						 * to conflict with.
+						 */
+						elog(ERROR, "unexpected self-updated tuple");
+						break;
+
+					case TM_Updated:
+						if (IsolationUsesXactSnapshot())
+							ereport(ERROR,
+									(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+									 errmsg("could not serialize access due to concurrent update")));
+
+						/*
+						 * As long as we don't support an UPDATE of INSERT ON
+						 * CONFLICT for a partitioned table we shouldn't reach
+						 * to a case where tuple to be lock is moved to
+						 * another partition due to concurrent update of the
+						 * partition key.
+						 */
+						Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
+
+						/*
+						 * Tell caller to try again from the very start.
+						 *
+						 * It does not make sense to use the usual
+						 * EvalPlanQual() style loop here, as the new version
+						 * of the row might not conflict anymore, or the
+						 * conflicting tuple has actually been deleted.
+						 */
+						ExecClearTuple(lockedSlot);
+						return false;
+
+					case TM_Deleted:
+						if (IsolationUsesXactSnapshot())
+							ereport(ERROR,
+									(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+									 errmsg("could not serialize access due to concurrent delete")));
+
+						/* see TM_Updated case */
+						Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
+						ExecClearTuple(lockedSlot);
+						return false;
+
+					default:
+						elog(ERROR, "unrecognized table_tuple_lock status: %u", test);
+				}
+
+				/* Success, the tuple is locked. */
+
+				/*
+				 * Verify that the tuple is visible to our MVCC snapshot if
+				 * the current isolation level mandates that.
+				 *
+				 * It's not sufficient to rely on the check within
+				 * ExecUpdate() as e.g. CONFLICT ... WHERE clause may prevent
+				 * us from reaching that.
+				 *
+				 * This means we only ever continue when a new command in the
+				 * current transaction could see the row, even though in READ
+				 * COMMITTED mode the tuple will not be visible according to
+				 * the current statement's snapshot.  This is in line with the
+				 * way UPDATE deals with newer tuple versions.
+				 */
+				ExecCheckTupleVisible(estate, rel, lockedSlot);
+				return NULL;
+			}
+			else
+			{
+				ExecCheckTIDVisible(estate, rel, &conflictTid, tempSlot);
+				return NULL;
+			}
+		}
+
+		/*
+		 * Before we start insertion proper, acquire our "speculative
+		 * insertion lock".  Others can use that to wait for us to decide if
+		 * we're going to go ahead with the insertion, instead of waiting for
+		 * the whole transaction to complete.
+		 */
+		specToken = SpeculativeInsertionLockAcquire(GetCurrentTransactionId());
+
+		/* insert the tuple, with the speculative token */
+		heapam_tuple_insert_speculative(rel, slot,
+										estate->es_output_cid,
+										0,
+										NULL,
+										specToken);
+
+		/* insert index entries for tuple */
+		recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
+											   slot, estate, false, true,
+											   &specConflict,
+											   arbiterIndexes,
+											   false);
+
+		/* adjust the tuple's state accordingly */
+		heapam_tuple_complete_speculative(rel, slot,
+										  specToken, !specConflict);
+
+		/*
+		 * Wake up anyone waiting for our decision.  They will re-check the
+		 * tuple, see that it's no longer speculative, and wait on our XID as
+		 * if this was a regularly inserted tuple all along.  Or if we killed
+		 * the tuple, they will see it's dead, and proceed as if the tuple
+		 * never existed.
+		 */
+		SpeculativeInsertionLockRelease(GetCurrentTransactionId());
+
+		/*
+		 * If there was a conflict, start from the beginning.  We'll do the
+		 * pre-check again, which will now find the conflicting tuple (unless
+		 * it aborts before we get there).
+		 */
+		if (specConflict)
+		{
+			list_free(recheckIndexes);
+			CHECK_FOR_INTERRUPTS();
+			continue;
+		}
+
+		return slot;
+	}
+}
+
 static TM_Result
 heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
 					Snapshot snapshot, Snapshot crosscheck, int options,
@@ -2648,8 +2926,7 @@ static const TableAmRoutine heapam_methods = {
 	.index_fetch_tuple = heapam_index_fetch_tuple,
 
 	.tuple_insert = heapam_tuple_insert,
-	.tuple_insert_speculative = heapam_tuple_insert_speculative,
-	.tuple_complete_speculative = heapam_tuple_complete_speculative,
+	.tuple_insert_with_arbiter = heapam_tuple_insert_with_arbiter,
 	.multi_insert = heap_multi_insert,
 	.tuple_delete = heapam_tuple_delete,
 	.tuple_update = heapam_tuple_update,
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index d9e23ef3175..c38ab936cde 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -70,8 +70,7 @@ GetTableAmRoutine(Oid amhandler)
 	 * Could be made optional, but would require throwing error during
 	 * parse-analysis.
 	 */
-	Assert(routine->tuple_insert_speculative != NULL);
-	Assert(routine->tuple_complete_speculative != NULL);
+	Assert(routine->tuple_insert_with_arbiter != NULL);
 
 	Assert(routine->multi_insert != NULL);
 	Assert(routine->tuple_delete != NULL);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index d1917f2fea7..8e1c8f697c6 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -129,7 +129,6 @@ static void ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 											   TupleTableSlot *newslot);
 static bool ExecOnConflictUpdate(ModifyTableContext *context,
 								 ResultRelInfo *resultRelInfo,
-								 ItemPointer conflictTid,
 								 TupleTableSlot *excludedSlot,
 								 bool canSetTag,
 								 TupleTableSlot **returning);
@@ -265,66 +264,6 @@ ExecProcessReturning(ResultRelInfo *resultRelInfo,
 	return ExecProject(projectReturning);
 }
 
-/*
- * ExecCheckTupleVisible -- verify tuple is visible
- *
- * It would not be consistent with guarantees of the higher isolation levels to
- * proceed with avoiding insertion (taking speculative insertion's alternative
- * path) on the basis of another tuple that is not visible to MVCC snapshot.
- * Check for the need to raise a serialization failure, and do so as necessary.
- */
-static void
-ExecCheckTupleVisible(EState *estate,
-					  Relation rel,
-					  TupleTableSlot *slot)
-{
-	if (!IsolationUsesXactSnapshot())
-		return;
-
-	if (!table_tuple_satisfies_snapshot(rel, slot, estate->es_snapshot))
-	{
-		Datum		xminDatum;
-		TransactionId xmin;
-		bool		isnull;
-
-		xminDatum = slot_getsysattr(slot, MinTransactionIdAttributeNumber, &isnull);
-		Assert(!isnull);
-		xmin = DatumGetTransactionId(xminDatum);
-
-		/*
-		 * We should not raise a serialization failure if the conflict is
-		 * against a tuple inserted by our own transaction, even if it's not
-		 * visible to our snapshot.  (This would happen, for example, if
-		 * conflicting keys are proposed for insertion in a single command.)
-		 */
-		if (!TransactionIdIsCurrentTransactionId(xmin))
-			ereport(ERROR,
-					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-					 errmsg("could not serialize access due to concurrent update")));
-	}
-}
-
-/*
- * ExecCheckTIDVisible -- convenience variant of ExecCheckTupleVisible()
- */
-static void
-ExecCheckTIDVisible(EState *estate,
-					ResultRelInfo *relinfo,
-					ItemPointer tid,
-					TupleTableSlot *tempSlot)
-{
-	Relation	rel = relinfo->ri_RelationDesc;
-
-	/* Redundantly check isolation level */
-	if (!IsolationUsesXactSnapshot())
-		return;
-
-	if (!table_tuple_fetch_row_version(rel, tid, SnapshotAny, tempSlot))
-		elog(ERROR, "failed to fetch conflicting tuple for ON CONFLICT");
-	ExecCheckTupleVisible(estate, rel, tempSlot);
-	ExecClearTuple(tempSlot);
-}
-
 /*
  * Initialize to compute stored generated columns for a tuple
  *
@@ -1015,12 +954,19 @@ ExecInsert(ModifyTableContext *context,
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
 			/* Perform a speculative insertion. */
-			uint32		specToken;
-			ItemPointerData conflictTid;
-			bool		specConflict;
 			List	   *arbiterIndexes;
+			TupleTableSlot *existing = NULL,
+					   *returningSlot,
+					   *inserted;
+			LockTupleMode lockmode = LockTupleExclusive;
 
 			arbiterIndexes = resultRelInfo->ri_onConflictArbiterIndexes;
+			returningSlot = ExecGetReturningSlot(estate, resultRelInfo);
+			if (onconflict == ONCONFLICT_UPDATE)
+			{
+				lockmode = ExecUpdateLockMode(estate, resultRelInfo);
+				existing = resultRelInfo->ri_onConflict->oc_Existing;
+			}
 
 			/*
 			 * Do a non-conclusive check for conflicts first.
@@ -1037,23 +983,28 @@ ExecInsert(ModifyTableContext *context,
 			 */
 	vlock:
 			CHECK_FOR_INTERRUPTS();
-			specConflict = false;
-			if (!ExecCheckIndexConstraints(resultRelInfo, slot, estate,
-										   &conflictTid, arbiterIndexes))
+			inserted = table_tuple_insert_with_arbiter(resultRelInfo,
+													   slot, estate->es_output_cid,
+													   0, NULL, arbiterIndexes, estate,
+													   lockmode, existing, returningSlot);
+			if (!inserted)
 			{
 				/* committed conflict tuple found */
 				if (onconflict == ONCONFLICT_UPDATE)
 				{
+					TupleTableSlot *returning = NULL;
+
+					if (TTS_EMPTY(existing))
+						goto vlock;
+
 					/*
 					 * In case of ON CONFLICT DO UPDATE, execute the UPDATE
 					 * part.  Be prepared to retry if the UPDATE fails because
 					 * of another concurrent UPDATE/DELETE to the conflict
 					 * tuple.
 					 */
-					TupleTableSlot *returning = NULL;
-
 					if (ExecOnConflictUpdate(context, resultRelInfo,
-											 &conflictTid, slot, canSetTag,
+											 slot, canSetTag,
 											 &returning))
 					{
 						InstrCountTuples2(&mtstate->ps, 1);
@@ -1076,57 +1027,13 @@ ExecInsert(ModifyTableContext *context,
 					 * ExecGetReturningSlot() in the DO NOTHING case...
 					 */
 					Assert(onconflict == ONCONFLICT_NOTHING);
-					ExecCheckTIDVisible(estate, resultRelInfo, &conflictTid,
-										ExecGetReturningSlot(estate, resultRelInfo));
 					InstrCountTuples2(&mtstate->ps, 1);
 					return NULL;
 				}
 			}
-
-			/*
-			 * Before we start insertion proper, acquire our "speculative
-			 * insertion lock".  Others can use that to wait for us to decide
-			 * if we're going to go ahead with the insertion, instead of
-			 * waiting for the whole transaction to complete.
-			 */
-			specToken = SpeculativeInsertionLockAcquire(GetCurrentTransactionId());
-
-			/* insert the tuple, with the speculative token */
-			table_tuple_insert_speculative(resultRelationDesc, slot,
-										   estate->es_output_cid,
-										   0,
-										   NULL,
-										   specToken);
-
-			/* insert index entries for tuple */
-			recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
-												   slot, estate, false, true,
-												   &specConflict,
-												   arbiterIndexes,
-												   false);
-
-			/* adjust the tuple's state accordingly */
-			table_tuple_complete_speculative(resultRelationDesc, slot,
-											 specToken, !specConflict);
-
-			/*
-			 * Wake up anyone waiting for our decision.  They will re-check
-			 * the tuple, see that it's no longer speculative, and wait on our
-			 * XID as if this was a regularly inserted tuple all along.  Or if
-			 * we killed the tuple, they will see it's dead, and proceed as if
-			 * the tuple never existed.
-			 */
-			SpeculativeInsertionLockRelease(GetCurrentTransactionId());
-
-			/*
-			 * If there was a conflict, start from the beginning.  We'll do
-			 * the pre-check again, which will now find the conflicting tuple
-			 * (unless it aborts before we get there).
-			 */
-			if (specConflict)
+			else
 			{
-				list_free(recheckIndexes);
-				goto vlock;
+				slot = inserted;
 			}
 
 			/* Since there was no insertion conflict, we're done */
@@ -2441,144 +2348,15 @@ redo_act:
 static bool
 ExecOnConflictUpdate(ModifyTableContext *context,
 					 ResultRelInfo *resultRelInfo,
-					 ItemPointer conflictTid,
 					 TupleTableSlot *excludedSlot,
 					 bool canSetTag,
 					 TupleTableSlot **returning)
 {
 	ModifyTableState *mtstate = context->mtstate;
 	ExprContext *econtext = mtstate->ps.ps_ExprContext;
-	Relation	relation = resultRelInfo->ri_RelationDesc;
 	ExprState  *onConflictSetWhere = resultRelInfo->ri_onConflict->oc_WhereClause;
 	TupleTableSlot *existing = resultRelInfo->ri_onConflict->oc_Existing;
-	TM_FailureData tmfd;
-	LockTupleMode lockmode;
-	TM_Result	test;
-	Datum		xminDatum;
-	TransactionId xmin;
-	bool		isnull;
-
-	/* Determine lock mode to use */
-	lockmode = ExecUpdateLockMode(context->estate, resultRelInfo);
-
-	/*
-	 * Lock tuple for update.  Don't follow updates when tuple cannot be
-	 * locked without doing so.  A row locking conflict here means our
-	 * previous conclusion that the tuple is conclusively committed is not
-	 * true anymore.
-	 */
-	test = table_tuple_lock(relation, conflictTid,
-							context->estate->es_snapshot,
-							existing, context->estate->es_output_cid,
-							lockmode, LockWaitBlock, 0,
-							&tmfd);
-	switch (test)
-	{
-		case TM_Ok:
-			/* success! */
-			break;
-
-		case TM_Invisible:
-
-			/*
-			 * This can occur when a just inserted tuple is updated again in
-			 * the same command. E.g. because multiple rows with the same
-			 * conflicting key values are inserted.
-			 *
-			 * This is somewhat similar to the ExecUpdate() TM_SelfModified
-			 * case.  We do not want to proceed because it would lead to the
-			 * same row being updated a second time in some unspecified order,
-			 * and in contrast to plain UPDATEs there's no historical behavior
-			 * to break.
-			 *
-			 * It is the user's responsibility to prevent this situation from
-			 * occurring.  These problems are why the SQL standard similarly
-			 * specifies that for SQL MERGE, an exception must be raised in
-			 * the event of an attempt to update the same row twice.
-			 */
-			xminDatum = slot_getsysattr(existing,
-										MinTransactionIdAttributeNumber,
-										&isnull);
-			Assert(!isnull);
-			xmin = DatumGetTransactionId(xminDatum);
-
-			if (TransactionIdIsCurrentTransactionId(xmin))
-				ereport(ERROR,
-						(errcode(ERRCODE_CARDINALITY_VIOLATION),
-				/* translator: %s is a SQL command name */
-						 errmsg("%s command cannot affect row a second time",
-								"ON CONFLICT DO UPDATE"),
-						 errhint("Ensure that no rows proposed for insertion within the same command have duplicate constrained values.")));
-
-			/* This shouldn't happen */
-			elog(ERROR, "attempted to lock invisible tuple");
-			break;
-
-		case TM_SelfModified:
-
-			/*
-			 * This state should never be reached. As a dirty snapshot is used
-			 * to find conflicting tuples, speculative insertion wouldn't have
-			 * seen this row to conflict with.
-			 */
-			elog(ERROR, "unexpected self-updated tuple");
-			break;
-
-		case TM_Updated:
-			if (IsolationUsesXactSnapshot())
-				ereport(ERROR,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("could not serialize access due to concurrent update")));
-
-			/*
-			 * As long as we don't support an UPDATE of INSERT ON CONFLICT for
-			 * a partitioned table we shouldn't reach to a case where tuple to
-			 * be lock is moved to another partition due to concurrent update
-			 * of the partition key.
-			 */
-			Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
-
-			/*
-			 * Tell caller to try again from the very start.
-			 *
-			 * It does not make sense to use the usual EvalPlanQual() style
-			 * loop here, as the new version of the row might not conflict
-			 * anymore, or the conflicting tuple has actually been deleted.
-			 */
-			ExecClearTuple(existing);
-			return false;
-
-		case TM_Deleted:
-			if (IsolationUsesXactSnapshot())
-				ereport(ERROR,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("could not serialize access due to concurrent delete")));
-
-			/* see TM_Updated case */
-			Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
-			ExecClearTuple(existing);
-			return false;
-
-		default:
-			elog(ERROR, "unrecognized table_tuple_lock status: %u", test);
-	}
-
-	/* Success, the tuple is locked. */
-
-	/*
-	 * Verify that the tuple is visible to our MVCC snapshot if the current
-	 * isolation level mandates that.
-	 *
-	 * It's not sufficient to rely on the check within ExecUpdate() as e.g.
-	 * CONFLICT ... WHERE clause may prevent us from reaching that.
-	 *
-	 * This means we only ever continue when a new command in the current
-	 * transaction could see the row, even though in READ COMMITTED mode the
-	 * tuple will not be visible according to the current statement's
-	 * snapshot.  This is in line with the way UPDATE deals with newer tuple
-	 * versions.
-	 */
-	ExecCheckTupleVisible(context->estate, relation, existing);
+	ItemPointer conflictTid = &existing->tts_tid;
 
 	/*
 	 * Make tuple and any needed join variables available to ExecQual and
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index cf68ec48ebf..c4cdae5903c 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -22,6 +22,7 @@
 #include "access/xact.h"
 #include "commands/vacuum.h"
 #include "executor/tuptable.h"
+#include "nodes/execnodes.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
 
@@ -514,19 +515,16 @@ typedef struct TableAmRoutine
 									 CommandId cid, int options,
 									 struct BulkInsertStateData *bistate);
 
-	/* see table_tuple_insert_speculative() for reference about parameters */
-	void		(*tuple_insert_speculative) (Relation rel,
-											 TupleTableSlot *slot,
-											 CommandId cid,
-											 int options,
-											 struct BulkInsertStateData *bistate,
-											 uint32 specToken);
-
-	/* see table_tuple_complete_speculative() for reference about parameters */
-	void		(*tuple_complete_speculative) (Relation rel,
-											   TupleTableSlot *slot,
-											   uint32 specToken,
-											   bool succeeded);
+	/* see table_tuple_insert_with_arbiter() for reference about parameters */
+	TupleTableSlot *(*tuple_insert_with_arbiter) (ResultRelInfo *resultRelInfo,
+												  TupleTableSlot *slot,
+												  CommandId cid, int options,
+												  struct BulkInsertStateData *bistate,
+												  List *arbiterIndexes,
+												  EState *estate,
+												  LockTupleMode lockmode,
+												  TupleTableSlot *lockedSlot,
+												  TupleTableSlot *tempSlot);
 
 	/* see table_multi_insert() for reference about parameters */
 	void		(*multi_insert) (Relation rel, TupleTableSlot **slots, int nslots,
@@ -1400,36 +1398,42 @@ table_tuple_insert(Relation rel, TupleTableSlot *slot, CommandId cid,
 }
 
 /*
- * Perform a "speculative insertion". These can be backed out afterwards
- * without aborting the whole transaction.  Other sessions can wait for the
- * speculative insertion to be confirmed, turning it into a regular tuple, or
- * aborted, as if it never existed.  Speculatively inserted tuples behave as
- * "value locks" of short duration, used to implement INSERT .. ON CONFLICT.
+ * Insert a tuple from a slot into table AM routine with arbiter indexes.
  *
- * A transaction having performed a speculative insertion has to either abort,
- * or finish the speculative insertion with
- * table_tuple_complete_speculative(succeeded = ...).
- */
-static inline void
-table_tuple_insert_speculative(Relation rel, TupleTableSlot *slot,
-							   CommandId cid, int options,
-							   struct BulkInsertStateData *bistate,
-							   uint32 specToken)
-{
-	rel->rd_tableam->tuple_insert_speculative(rel, slot, cid, options,
-											  bistate, specToken);
-}
-
-/*
- * Complete "speculative insertion" started in the same transaction. If
- * succeeded is true, the tuple is fully inserted, if false, it's removed.
+ * This function is similar to table_tuple_insert(), but it takes into account
+ * `arbiterIndexes`, which comprises the list of oids of arbiter indexes.
+ *
+ * If tuple doesn't violates uniqueness on all arbiter indexes, then it should
+ * be inserted and the slot containing inserted tuple is returned.
+ *
+ * If tuple violates uniqueness on any arbiter index, then this function
+ * returns NULL and doesn't insert the tuple.  Also, if 'lockedSlot' is
+ * provided, then conflicting tuple gets locked in `lockmode` and placed into
+ * `lockedSlot`.
+ *
+ * Executor state `estate` is passed to this method to provide ability to
+ * calculate index tuples.  Temporary tuple table slot `tempSlot` is passed
+ * for holding of potentially conflicing tuple.
  */
-static inline void
-table_tuple_complete_speculative(Relation rel, TupleTableSlot *slot,
-								 uint32 specToken, bool succeeded)
+static inline TupleTableSlot *
+table_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
+								TupleTableSlot *slot,
+								CommandId cid, int options,
+								struct BulkInsertStateData *bistate,
+								List *arbiterIndexes,
+								EState *estate,
+								LockTupleMode lockmode,
+								TupleTableSlot *lockedSlot,
+								TupleTableSlot *tempSlot)
 {
-	rel->rd_tableam->tuple_complete_speculative(rel, slot, specToken,
-												succeeded);
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+
+	return rel->rd_tableam->tuple_insert_with_arbiter(resultRelInfo,
+													  slot, cid, options,
+													  bistate, arbiterIndexes,
+													  estate,
+													  lockmode, lockedSlot,
+													  tempSlot);
 }
 
 /*
-- 
2.39.3 (Apple Git-145)

0005-Notify-table-AM-about-index-creation-v6.patchapplication/octet-stream; name=0005-Notify-table-AM-about-index-creation-v6.patchDownload

From 8a4c9670056c5df67bde845d33af74da95f33dbf Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Sat, 17 Jun 2023 22:01:01 +0300
Subject: [PATCH 5/8] Notify table AM about index creation

This allows table AM to do some preparation with index build.  In particular,
table AM could update its specific meta-information.  That could be also useful
if table AM overrides index implementations.
---
 src/backend/access/heap/heapam_handler.c |  2 ++
 src/backend/catalog/index.c              |  2 ++
 src/backend/commands/indexcmds.c         | 41 +++++++++++++----------
 src/include/access/tableam.h             | 42 ++++++++++++++++++++++++
 4 files changed, 69 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 5762a7dbf14..e9f2193e5e2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2953,6 +2953,8 @@ static const TableAmRoutine heapam_methods = {
 	.index_build_range_scan = heapam_index_build_range_scan,
 	.index_validate_scan = heapam_index_validate_scan,
 	.relation_analyze = heapam_analyze,
+	.define_index_validate = NULL,
+	.define_index = NULL,
 
 	.free_rd_amcache = NULL,
 	.relation_size = table_block_relation_size,
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index b6a7c60e230..bca97981051 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3840,6 +3840,8 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 
 	/* Close rels, but keep locks */
 	index_close(iRel, NoLock);
+	table_define_index(heapRelation, indexId, true,
+					   skip_constraint_checks, false, NULL);
 	table_close(heapRelation, NoLock);
 
 	if (progress)
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index e78598c10e1..2570e7a24a1 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -583,6 +583,7 @@ DefineIndex(Oid tableId,
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
+	void	   *arg;
 
 	root_save_nestlevel = NewGUCNestLevel();
 
@@ -629,6 +630,26 @@ DefineIndex(Oid tableId,
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_INDEX_OID,
 								 InvalidOid);
 
+	/*
+	 * Only SELECT ... FOR UPDATE/SHARE are allowed while doing a standard
+	 * index build; but for concurrent builds we allow INSERT/UPDATE/DELETE
+	 * (but not VACUUM).
+	 *
+	 * NB: Caller is responsible for making sure that relationId refers to the
+	 * relation on which the index should be built; except in bootstrap mode,
+	 * this will typically require the caller to have already locked the
+	 * relation.  To avoid lock upgrade hazards, that lock should be at least
+	 * as strong as the one we take here.
+	 *
+	 * NB: If the lock strength here ever changes, code that is run by
+	 * parallel workers under the control of certain particular ambuild
+	 * functions will need to be updated, too.
+	 */
+	lockmode = concurrent ? ShareUpdateExclusiveLock : ShareLock;
+	rel = table_open(tableId, lockmode);
+
+	table_define_index_validate(rel, stmt, skip_build, &arg);
+
 	/*
 	 * count key attributes in index
 	 */
@@ -656,24 +677,6 @@ DefineIndex(Oid tableId,
 				 errmsg("cannot use more than %d columns in an index",
 						INDEX_MAX_KEYS)));
 
-	/*
-	 * Only SELECT ... FOR UPDATE/SHARE are allowed while doing a standard
-	 * index build; but for concurrent builds we allow INSERT/UPDATE/DELETE
-	 * (but not VACUUM).
-	 *
-	 * NB: Caller is responsible for making sure that tableId refers to the
-	 * relation on which the index should be built; except in bootstrap mode,
-	 * this will typically require the caller to have already locked the
-	 * relation.  To avoid lock upgrade hazards, that lock should be at least
-	 * as strong as the one we take here.
-	 *
-	 * NB: If the lock strength here ever changes, code that is run by
-	 * parallel workers under the control of certain particular ambuild
-	 * functions will need to be updated, too.
-	 */
-	lockmode = concurrent ? ShareUpdateExclusiveLock : ShareLock;
-	rel = table_open(tableId, lockmode);
-
 	/*
 	 * Switch to the table owner's userid, so that any index functions are run
 	 * as that user.  Also lock down security-restricted operations.  We
@@ -1218,6 +1221,8 @@ DefineIndex(Oid tableId,
 
 	ObjectAddressSet(address, RelationRelationId, indexRelationId);
 
+	table_define_index(rel, address.objectId, false, false,
+					   skip_build, arg);
 	if (!OidIsValid(indexRelationId))
 	{
 		/*
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 48f078309f7..db0559788a4 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -684,6 +684,16 @@ typedef struct TableAmRoutine
 									 BlockNumber *totalpages,
 									 BufferAccessStrategy bstrategy);
 
+	/* See table_define_index_validate() */
+	bool		(*define_index_validate) (Relation rel, IndexStmt *stmt,
+										  bool skip_build, void **arg);
+
+	/* See table_define_index() */
+	bool		(*define_index) (Relation rel, Oid indoid, bool reindex,
+								 bool skip_constraint_checks, bool skip_build,
+								 void *arg);
+
+
 
 	/* ------------------------------------------------------------------------
 	 * Miscellaneous functions.
@@ -1860,6 +1870,38 @@ table_relation_analyze(Relation relation, AcquireSampleRowsFunc *func,
 										   totalpages, bstrategy);
 }
 
+/*
+ * Let table AM validate the index to be created on `rel` with statement
+ * `*stmt`.  `skip_build` indicates that only catalog entries are to be
+ * created without index data.  This method can save some information into
+ * `arg`, and it shoud be passed to table_define_index().
+ */
+static inline bool
+table_define_index_validate(Relation rel, IndexStmt *stmt,
+							bool skip_build, void **arg)
+{
+	if (rel->rd_tableam && rel->rd_tableam->define_index_validate)
+		return rel->rd_tableam->define_index_validate(rel, stmt,
+													  skip_build, arg);
+	else
+		return true;
+}
+
+/*
+ * Notifies table AM about index creation on `rel` with oid `indoid`.
+ */
+static inline bool
+table_define_index(Relation rel, Oid indoid, bool reindex,
+				   bool skip_constraint_checks, bool skip_build, void *arg)
+{
+	if (rel->rd_tableam && rel->rd_tableam->define_index)
+		return rel->rd_tableam->define_index(rel, indoid, reindex,
+											 skip_constraint_checks,
+											 skip_build, arg);
+	else
+		return true;
+}
+
 /* ----------------------------------------------------------------------------
  * Miscellaneous functionality
  * ----------------------------------------------------------------------------
-- 
2.39.3 (Apple Git-145)

0004-Let-table-AM-override-reloptions-for-indexes-buil-v6.patchapplication/octet-stream; name=0004-Let-table-AM-override-reloptions-for-indexes-buil-v6.patchDownload

From ab26c563af5019dcd6b08781c08c9722796363c1 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Thu, 14 Mar 2024 00:53:05 +0200
Subject: [PATCH 4/8] Let table AM override reloptions for indexes built on its
 tables

---
 src/backend/access/common/reloptions.c   |  3 ++-
 src/backend/access/heap/heapam_handler.c |  8 ++++++++
 src/backend/commands/indexcmds.c         |  3 ++-
 src/backend/commands/tablecmds.c         |  9 +++++++-
 src/backend/utils/cache/relcache.c       | 24 ++++++++++++++++++++--
 src/include/access/tableam.h             | 26 ++++++++++++++++++++++++
 6 files changed, 68 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 963995388bb..00088240cdd 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -1411,7 +1411,8 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
 			break;
 		case RELKIND_INDEX:
 		case RELKIND_PARTITIONED_INDEX:
-			options = index_reloptions(amoptions, datum, false);
+			options = tableam_indexoptions(tableam, amoptions, classForm->relkind,
+										   datum, false);
 			break;
 		case RELKIND_FOREIGN_TABLE:
 			options = NULL;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index b89caa16e7f..5762a7dbf14 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2448,6 +2448,13 @@ heapam_reloptions(char relkind, Datum reloptions, bool validate)
 	return heap_reloptions(relkind, reloptions, validate);
 }
 
+static bytea *
+heapam_indexoptions(amoptions_function amoptions, char relkind,
+					Datum reloptions, bool validate)
+{
+	return index_reloptions(amoptions, reloptions, validate);
+}
+
 
 /* ------------------------------------------------------------------------
  * Planner related callbacks for the heap AM
@@ -2953,6 +2960,7 @@ static const TableAmRoutine heapam_methods = {
 	.relation_toast_am = heapam_relation_toast_am,
 	.relation_fetch_toast_slice = heap_fetch_toast_slice,
 	.reloptions = heapam_reloptions,
+	.indexoptions = heapam_indexoptions,
 
 	.relation_estimate_size = heapam_estimate_rel_size,
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index d9016ef487b..e78598c10e1 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -899,7 +899,8 @@ DefineIndex(Oid tableId,
 	reloptions = transformRelOptions((Datum) 0, stmt->options,
 									 NULL, NULL, false, false);
 
-	(void) index_reloptions(amoptions, reloptions, true);
+	(void) tableam_indexoptions(rel->rd_tableam, amoptions, RELKIND_INDEX,
+								reloptions, true);
 
 	/*
 	 * Prepare arguments for index_create, primarily an IndexInfo structure.
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 6fc815666bf..eccd1131a5c 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -15539,7 +15539,14 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 			break;
 		case RELKIND_INDEX:
 		case RELKIND_PARTITIONED_INDEX:
-			(void) index_reloptions(rel->rd_indam->amoptions, newOptions, true);
+			{
+				Relation	tbl = relation_open(rel->rd_index->indrelid,
+												AccessShareLock);
+
+				tableam_indexoptions(tbl->rd_tableam, rel->rd_indam->amoptions,
+									 rel->rd_rel->relkind, newOptions, true);
+				relation_close(tbl, AccessShareLock);
+			}
 			break;
 		default:
 			ereport(ERROR,
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 039c0d3eef4..4343deb4ee3 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -477,15 +477,35 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 	{
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
-		case RELKIND_VIEW:
 		case RELKIND_MATVIEW:
+		case RELKIND_VIEW:
 		case RELKIND_PARTITIONED_TABLE:
 			tableam = relation->rd_tableam;
 			amoptsfn = NULL;
 			break;
 		case RELKIND_INDEX:
 		case RELKIND_PARTITIONED_INDEX:
-			amoptsfn = relation->rd_indam->amoptions;
+			{
+				Form_pg_class classForm;
+				HeapTuple	classTup;
+
+				/* fetch the relation's relcache entry */
+				if (relation->rd_index->indrelid >= FirstNormalObjectId)
+				{
+					classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relation->rd_index->indrelid));
+					classForm = (Form_pg_class) GETSTRUCT(classTup);
+					if (classForm->relam >= FirstNormalObjectId)
+						tableam = GetTableAmRoutineByAmOid(classForm->relam);
+					else
+						tableam = GetHeapamTableAmRoutine();
+					heap_freetuple(classTup);
+				}
+				else
+				{
+					tableam = GetHeapamTableAmRoutine();
+				}
+				amoptsfn = relation->rd_indam->amoptions;
+			}
 			break;
 		default:
 			return;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index c4cdae5903c..48f078309f7 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -17,6 +17,7 @@
 #ifndef TABLEAM_H
 #define TABLEAM_H
 
+#include "access/amapi.h"
 #include "access/relscan.h"
 #include "access/sdir.h"
 #include "access/xact.h"
@@ -757,6 +758,13 @@ typedef struct TableAmRoutine
 	 */
 	bytea	   *(*reloptions) (char relkind, Datum reloptions, bool validate);
 
+	/*
+	 * Parse table AM-specific index options.  Useful for table AM to define
+	 * new index options or override existing index options.
+	 */
+	bytea	   *(*indexoptions) (amoptions_function amoptions, char relkind,
+								 Datum reloptions, bool validate);
+
 
 	/* ------------------------------------------------------------------------
 	 * Planner related functions.
@@ -1971,6 +1979,24 @@ tableam_reloptions(const TableAmRoutine *tableam, char relkind,
 	return tableam->reloptions(relkind, reloptions, validate);
 }
 
+extern bytea *index_reloptions(amoptions_function amoptions, Datum reloptions,
+							   bool validate);
+
+/*
+ * Parse index options.  Gives table AM a chance to override index-specific
+ * options defined in 'amoptions'.
+ */
+static inline bytea *
+tableam_indexoptions(const TableAmRoutine *tableam,
+					 amoptions_function amoptions, char relkind,
+					 Datum reloptions, bool validate)
+{
+	if (tableam)
+		return tableam->indexoptions(amoptions, relkind, reloptions, validate);
+	else
+		return index_reloptions(amoptions, reloptions, validate);
+}
+
 
 /* ----------------------------------------------------------------------------
  * Planner related functionality
-- 
2.39.3 (Apple Git-145)

0001-Generalize-relation-analyze-in-table-AM-interface-v6.patchapplication/octet-stream; name=0001-Generalize-relation-analyze-in-table-AM-interface-v6.patchDownload

From a59f6888c74b0de889f42e6c4ae267cf0ffbf934 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Tue, 26 Mar 2024 23:41:11 +0200
Subject: [PATCH 1/8] Generalize relation analyze in table AM interface

Currently, there is just one algorithm for sampling tuples from a table written
in acquire_sample_rows().  Custom table AM can just redefine the way to get the
next block/tuple by implementing scan_analyze_next_block() and
scan_analyze_next_tuple() API functions.

This approach doesn't seem general enough.  For instance, it's unclear how to
sample this way index-organized tables.  This commit allows table AM to
encapsulate the whole sampling algorithm (currently implemented in
acquire_sample_rows()) into the relation_analyze() API function.
---
 src/backend/access/heap/heapam_handler.c |  33 +++++--
 src/backend/access/table/tableamapi.c    |   2 -
 src/backend/commands/analyze.c           |  54 ++++++------
 src/include/access/heapam.h              |   9 ++
 src/include/access/tableam.h             | 106 +++++------------------
 src/include/commands/vacuum.h            |  19 ++++
 src/include/foreign/fdwapi.h             |   6 +-
 7 files changed, 104 insertions(+), 125 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 6abfe36dec7..ba3e4f4f838 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -19,6 +19,8 @@
  */
 #include "postgres.h"
 
+#include <math.h>
+
 #include "access/genam.h"
 #include "access/heapam.h"
 #include "access/heaptoast.h"
@@ -44,13 +46,14 @@
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
+#include "utils/sampling.h"
+#include "utils/spccache.h"
 
 static TM_Result heapam_tuple_lock(Relation relation, ItemPointer tid,
 								   Snapshot snapshot, TupleTableSlot *slot,
 								   CommandId cid, LockTupleMode mode,
 								   LockWaitPolicy wait_policy, uint8 flags,
 								   TM_FailureData *tmfd);
-
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
 									 Datum *values, bool *isnull, RewriteState rwstate);
@@ -1052,7 +1055,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	pfree(isnull);
 }
 
-static bool
+/*
+ * Prepare to analyze block `blockno` of `scan`.  The scan has been started
+ * with SO_TYPE_ANALYZE option.
+ *
+ * This routine holds a buffer pin and lock on the heap page.  They are held
+ * until heapam_scan_analyze_next_tuple() returns false.  That is until all the
+ * items of the heap page are analyzed.
+ */
+void
 heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
 							   BufferAccessStrategy bstrategy)
 {
@@ -1072,12 +1083,19 @@ heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
 	hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
 										blockno, RBM_NORMAL, bstrategy);
 	LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-
-	/* in heap all blocks can contain tuples, so always return true */
-	return true;
 }
 
-static bool
+/*
+ * Iterate over tuples in the block selected with
+ * heapam_scan_analyze_next_block().  If a tuple that's suitable for sampling
+ * is found, true is returned and a tuple is stored in `slot`.  When no more
+ * tuples for sampling, false is returned and the pin and lock acquired by
+ * heapam_scan_analyze_next_block() are released.
+ *
+ * *liverows and *deadrows are incremented according to the encountered
+ * tuples.
+ */
+bool
 heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
 							   double *liverows, double *deadrows,
 							   TupleTableSlot *slot)
@@ -2637,10 +2655,9 @@ static const TableAmRoutine heapam_methods = {
 	.relation_copy_data = heapam_relation_copy_data,
 	.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
 	.relation_vacuum = heap_vacuum_rel,
-	.scan_analyze_next_block = heapam_scan_analyze_next_block,
-	.scan_analyze_next_tuple = heapam_scan_analyze_next_tuple,
 	.index_build_range_scan = heapam_index_build_range_scan,
 	.index_validate_scan = heapam_index_validate_scan,
+	.relation_analyze = heapam_analyze,
 
 	.free_rd_amcache = NULL,
 	.relation_size = table_block_relation_size,
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index ce637a5a5d9..55b8caeadf2 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -81,8 +81,6 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->relation_copy_data != NULL);
 	Assert(routine->relation_copy_for_cluster != NULL);
 	Assert(routine->relation_vacuum != NULL);
-	Assert(routine->scan_analyze_next_block != NULL);
-	Assert(routine->scan_analyze_next_tuple != NULL);
 	Assert(routine->index_build_range_scan != NULL);
 	Assert(routine->index_validate_scan != NULL);
 
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8a82af4a4ca..2fb39f3ede1 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -17,6 +17,7 @@
 #include <math.h>
 
 #include "access/detoast.h"
+#include "access/heapam.h"
 #include "access/genam.h"
 #include "access/multixact.h"
 #include "access/relation.h"
@@ -190,10 +191,9 @@ analyze_rel(Oid relid, RangeVar *relation,
 	if (onerel->rd_rel->relkind == RELKIND_RELATION ||
 		onerel->rd_rel->relkind == RELKIND_MATVIEW)
 	{
-		/* Regular table, so we'll use the regular row acquisition function */
-		acquirefunc = acquire_sample_rows;
-		/* Also get regular table's size */
-		relpages = RelationGetNumberOfBlocks(onerel);
+		/* Use row acquisition function provided by table AM */
+		table_relation_analyze(onerel, &acquirefunc,
+							   &relpages, vac_strategy);
 	}
 	else if (onerel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
 	{
@@ -1103,15 +1103,15 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
 }
 
 /*
- * acquire_sample_rows -- acquire a random sample of rows from the table
+ * acquire_sample_rows -- acquire a random sample of rows from the heap
  *
  * Selected rows are returned in the caller-allocated array rows[], which
  * must have at least targrows entries.
  * The actual number of rows selected is returned as the function result.
- * We also estimate the total numbers of live and dead rows in the table,
+ * We also estimate the total numbers of live and dead rows in the heap,
  * and return them into *totalrows and *totaldeadrows, respectively.
  *
- * The returned list of tuples is in order by physical position in the table.
+ * The returned list of tuples is in order by physical position in the heap.
  * (We will rely on this later to derive correlation estimates.)
  *
  * As of May 2004 we use a new two-stage method:  Stage one selects up
@@ -1133,7 +1133,7 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
  * look at a statistically unbiased set of blocks, we should get
  * unbiased estimates of the average numbers of live and dead rows per
  * block.  The previous sampling method put too much credence in the row
- * density near the start of the table.
+ * density near the start of the heap.
  */
 static int
 acquire_sample_rows(Relation onerel, int elevel,
@@ -1184,7 +1184,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	/* Prepare for sampling rows */
 	reservoir_init_selection_state(&rstate, targrows);
 
-	scan = table_beginscan_analyze(onerel);
+	scan = heap_beginscan(onerel, NULL, 0, NULL, NULL, SO_TYPE_ANALYZE);
 	slot = table_slot_create(onerel, NULL);
 
 #ifdef USE_PREFETCH
@@ -1214,7 +1214,6 @@ acquire_sample_rows(Relation onerel, int elevel,
 	/* Outer loop over blocks to sample */
 	while (BlockSampler_HasMore(&bs))
 	{
-		bool		block_accepted;
 		BlockNumber targblock = BlockSampler_Next(&bs);
 #ifdef USE_PREFETCH
 		BlockNumber prefetch_targblock = InvalidBlockNumber;
@@ -1230,29 +1229,19 @@ acquire_sample_rows(Relation onerel, int elevel,
 
 		vacuum_delay_point();
 
-		block_accepted = table_scan_analyze_next_block(scan, targblock, vac_strategy);
+		heapam_scan_analyze_next_block(scan, targblock, vac_strategy);
 
 #ifdef USE_PREFETCH
 
 		/*
 		 * When pre-fetching, after we get a block, tell the kernel about the
 		 * next one we will want, if there's any left.
-		 *
-		 * We want to do this even if the table_scan_analyze_next_block() call
-		 * above decides against analyzing the block it picked.
 		 */
 		if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
 			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
 #endif
 
-		/*
-		 * Don't analyze if table_scan_analyze_next_block() indicated this
-		 * block is unsuitable for analyzing.
-		 */
-		if (!block_accepted)
-			continue;
-
-		while (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
+		while (heapam_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
 		{
 			/*
 			 * The first targrows sample rows are simply copied into the
@@ -1302,7 +1291,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
-	table_endscan(scan);
+	heap_endscan(scan);
 
 	/*
 	 * If we didn't find as many tuples as we wanted then we're done. No sort
@@ -1373,6 +1362,19 @@ compare_rows(const void *a, const void *b, void *arg)
 	return 0;
 }
 
+/*
+ * heapam_analyze -- implementation of relation_analyze() table access method
+ *					 callback for heap
+ */
+void
+heapam_analyze(Relation relation, AcquireSampleRowsFunc *func,
+			   BlockNumber *totalpages, BufferAccessStrategy bstrategy)
+{
+	*func = acquire_sample_rows;
+	*totalpages = RelationGetNumberOfBlocks(relation);
+	vac_strategy = bstrategy;
+}
+
 
 /*
  * acquire_inherited_sample_rows -- acquire sample rows from inheritance tree
@@ -1462,9 +1464,9 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 		if (childrel->rd_rel->relkind == RELKIND_RELATION ||
 			childrel->rd_rel->relkind == RELKIND_MATVIEW)
 		{
-			/* Regular table, so use the regular row acquisition function */
-			acquirefunc = acquire_sample_rows;
-			relpages = RelationGetNumberOfBlocks(childrel);
+			/* Use row acquisition function provided by table AM */
+			table_relation_analyze(childrel, &acquirefunc,
+								   &relpages, vac_strategy);
 		}
 		else if (childrel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
 		{
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index f1122453738..91fbc950343 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -369,6 +369,15 @@ extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
 extern bool HeapTupleIsSurelyDead(HeapTuple htup,
 								  struct GlobalVisState *vistest);
 
+/* in heap/heapam_handler.c*/
+extern void heapam_scan_analyze_next_block(TableScanDesc scan,
+										   BlockNumber blockno,
+										   BufferAccessStrategy bstrategy);
+extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan,
+										   TransactionId OldestXmin,
+										   double *liverows, double *deadrows,
+										   TupleTableSlot *slot);
+
 /*
  * To avoid leaking too much knowledge about reorderbuffer implementation
  * details this is implemented in reorderbuffer.c not heapam_visibility.c
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index fc0e7027157..8ed4e7295ad 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -20,6 +20,7 @@
 #include "access/relscan.h"
 #include "access/sdir.h"
 #include "access/xact.h"
+#include "commands/vacuum.h"
 #include "executor/tuptable.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -658,41 +659,6 @@ typedef struct TableAmRoutine
 									struct VacuumParams *params,
 									BufferAccessStrategy bstrategy);
 
-	/*
-	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
-	 * with table_beginscan_analyze().  See also
-	 * table_scan_analyze_next_block().
-	 *
-	 * The callback may acquire resources like locks that are held until
-	 * table_scan_analyze_next_tuple() returns false. It e.g. can make sense
-	 * to hold a lock until all tuples on a block have been analyzed by
-	 * scan_analyze_next_tuple.
-	 *
-	 * The callback can return false if the block is not suitable for
-	 * sampling, e.g. because it's a metapage that could never contain tuples.
-	 *
-	 * XXX: This obviously is primarily suited for block-based AMs. It's not
-	 * clear what a good interface for non block based AMs would be, so there
-	 * isn't one yet.
-	 */
-	bool		(*scan_analyze_next_block) (TableScanDesc scan,
-											BlockNumber blockno,
-											BufferAccessStrategy bstrategy);
-
-	/*
-	 * See table_scan_analyze_next_tuple().
-	 *
-	 * Not every AM might have a meaningful concept of dead rows, in which
-	 * case it's OK to not increment *deadrows - but note that that may
-	 * influence autovacuum scheduling (see comment for relation_vacuum
-	 * callback).
-	 */
-	bool		(*scan_analyze_next_tuple) (TableScanDesc scan,
-											TransactionId OldestXmin,
-											double *liverows,
-											double *deadrows,
-											TupleTableSlot *slot);
-
 	/* see table_index_build_range_scan for reference about parameters */
 	double		(*index_build_range_scan) (Relation table_rel,
 										   Relation index_rel,
@@ -713,6 +679,12 @@ typedef struct TableAmRoutine
 										Snapshot snapshot,
 										struct ValidateIndexState *state);
 
+	/* See table_relation_analyze() */
+	void		(*relation_analyze) (Relation relation,
+									 AcquireSampleRowsFunc *func,
+									 BlockNumber *totalpages,
+									 BufferAccessStrategy bstrategy);
+
 
 	/* ------------------------------------------------------------------------
 	 * Miscellaneous functions.
@@ -1008,19 +980,6 @@ table_beginscan_tid(Relation rel, Snapshot snapshot)
 	return rel->rd_tableam->scan_begin(rel, snapshot, 0, NULL, NULL, flags);
 }
 
-/*
- * table_beginscan_analyze is an alternative entry point for setting up a
- * TableScanDesc for an ANALYZE scan.  As with bitmap scans, it's worth using
- * the same data structure although the behavior is rather different.
- */
-static inline TableScanDesc
-table_beginscan_analyze(Relation rel)
-{
-	uint32		flags = SO_TYPE_ANALYZE;
-
-	return rel->rd_tableam->scan_begin(rel, NULL, 0, NULL, NULL, flags);
-}
-
 /*
  * End relation scan.
  */
@@ -1746,42 +1705,6 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
-/*
- * Prepare to analyze block `blockno` of `scan`. The scan needs to have been
- * started with table_beginscan_analyze().  Note that this routine might
- * acquire resources like locks that are held until
- * table_scan_analyze_next_tuple() returns false.
- *
- * Returns false if block is unsuitable for sampling, true otherwise.
- */
-static inline bool
-table_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-							  BufferAccessStrategy bstrategy)
-{
-	return scan->rs_rd->rd_tableam->scan_analyze_next_block(scan, blockno,
-															bstrategy);
-}
-
-/*
- * Iterate over tuples in the block selected with
- * table_scan_analyze_next_block() (which needs to have returned true, and
- * this routine may not have returned false for the same block before). If a
- * tuple that's suitable for sampling is found, true is returned and a tuple
- * is stored in `slot`.
- *
- * *liverows and *deadrows are incremented according to the encountered
- * tuples.
- */
-static inline bool
-table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
-							  double *liverows, double *deadrows,
-							  TupleTableSlot *slot)
-{
-	return scan->rs_rd->rd_tableam->scan_analyze_next_tuple(scan, OldestXmin,
-															liverows, deadrows,
-															slot);
-}
-
 /*
  * table_index_build_scan - scan the table to find tuples to be indexed
  *
@@ -1887,6 +1810,21 @@ table_index_validate_scan(Relation table_rel,
 											   state);
 }
 
+/*
+ * table_relation_analyze - fill the infromation for a sampling statistics
+ *							acquisition
+ *
+ * The pointer to a function that will collect sample rows from the table
+ * should be stored to `*func`, plus the estimated size of the table in pages
+ * should br stored to `*totalpages`.
+ */
+static inline void
+table_relation_analyze(Relation relation, AcquireSampleRowsFunc *func,
+					   BlockNumber *totalpages, BufferAccessStrategy bstrategy)
+{
+	relation->rd_tableam->relation_analyze(relation, func,
+										   totalpages, bstrategy);
+}
 
 /* ----------------------------------------------------------------------------
  * Miscellaneous functionality
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 1182a967427..68068dd9003 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -175,6 +175,21 @@ typedef struct VacAttrStats
 	int			rowstride;
 } VacAttrStats;
 
+/*
+ * AcquireSampleRowsFunc - a function for the sampling statistics collection.
+ *
+ * A random sample of up to `targrows` rows should be collected from the
+ * table and stored into the caller-provided `rows` array.  The actual number
+ * of rows collected must be returned.  In addition, a function should store
+ * estimates of the total numbers of live and dead rows in the table into the
+ * output parameters `*totalrows` and `*totaldeadrows1.  (Set `*totaldeadrows`
+ * to zero if the storage does not have any concept of dead rows.)
+ */
+typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
+									  HeapTuple *rows, int targrows,
+									  double *totalrows,
+									  double *totaldeadrows);
+
 /* flag bits for VacuumParams->options */
 #define VACOPT_VACUUM 0x01		/* do VACUUM */
 #define VACOPT_ANALYZE 0x02		/* do ANALYZE */
@@ -380,6 +395,10 @@ extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 extern void analyze_rel(Oid relid, RangeVar *relation,
 						VacuumParams *params, List *va_cols, bool in_outer_xact,
 						BufferAccessStrategy bstrategy);
+extern void heapam_analyze(Relation relation, AcquireSampleRowsFunc *func,
+						   BlockNumber *totalpages,
+						   BufferAccessStrategy bstrategy);
+
 extern bool std_typanalyze(VacAttrStats *stats);
 
 /* in utils/misc/sampling.c --- duplicate of declarations in utils/sampling.h */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index fcde3876b28..0968e0a01ec 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -13,6 +13,7 @@
 #define FDWAPI_H
 
 #include "access/parallel.h"
+#include "commands/vacuum.h"
 #include "nodes/execnodes.h"
 #include "nodes/pathnodes.h"
 
@@ -148,11 +149,6 @@ typedef void (*ExplainForeignModify_function) (ModifyTableState *mtstate,
 typedef void (*ExplainDirectModify_function) (ForeignScanState *node,
 											  struct ExplainState *es);
 
-typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
-									  HeapTuple *rows, int targrows,
-									  double *totalrows,
-									  double *totaldeadrows);
-
 typedef bool (*AnalyzeForeignTable_function) (Relation relation,
 											  AcquireSampleRowsFunc *func,
 											  BlockNumber *totalpages);
-- 
2.39.3 (Apple Git-145)

0006-Let-table-AM-insertion-methods-control-index-inse-v6.patchapplication/octet-stream; name=0006-Let-table-AM-insertion-methods-control-index-inse-v6.patchDownload

From 45d1ec6e6df89ff3146a2924ff7618398e77204f Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Fri, 9 Jun 2023 01:02:39 +0300
Subject: [PATCH 6/8] Let table AM insertion methods control index insertion

New parameter for tuple_insert() and multi_insert() methods provides way to
skip index insertions in executor.  In this case, table AM can handle insertions
itself.
---
 src/backend/access/heap/heapam.c         |  4 +++-
 src/backend/access/heap/heapam_handler.c |  4 +++-
 src/backend/access/table/tableam.c       |  6 ++++--
 src/backend/catalog/indexing.c           |  4 +++-
 src/backend/commands/copyfrom.c          | 13 +++++++++----
 src/backend/commands/createas.c          |  4 +++-
 src/backend/commands/matview.c           |  4 +++-
 src/backend/commands/tablecmds.c         |  6 +++++-
 src/backend/executor/execReplication.c   |  6 ++++--
 src/backend/executor/nodeModifyTable.c   |  6 ++++--
 src/include/access/heapam.h              |  2 +-
 src/include/access/tableam.h             | 23 ++++++++++++++++-------
 12 files changed, 58 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2f6527df0dc..b661d9811eb 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2088,7 +2088,8 @@ heap_multi_insert_pages(HeapTuple *heaptuples, int done, int ntuples, Size saveF
  */
 void
 heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
-				  CommandId cid, int options, BulkInsertState bistate)
+				  CommandId cid, int options, BulkInsertState bistate,
+				  bool *insert_indexes)
 {
 	TransactionId xid = GetCurrentTransactionId();
 	HeapTuple  *heaptuples;
@@ -2437,6 +2438,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		slots[i]->tts_tid = heaptuples[i]->t_self;
 
 	pgstat_count_heap_insert(relation, ntuples);
+	*insert_indexes = true;
 }
 
 /*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index e9f2193e5e2..f42a8a80128 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -249,7 +249,7 @@ heapam_tuple_satisfies_snapshot(Relation rel, TupleTableSlot *slot,
 
 static TupleTableSlot *
 heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
-					int options, BulkInsertState bistate)
+					int options, BulkInsertState bistate, bool *insert_indexes)
 {
 	bool		shouldFree = true;
 	HeapTuple	tuple = ExecFetchSlotHeapTuple(slot, true, &shouldFree);
@@ -265,6 +265,8 @@ heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
 	if (shouldFree)
 		pfree(tuple);
 
+	*insert_indexes = true;
+
 	return slot;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 8d3675be959..805d222cebc 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -273,9 +273,11 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
  * default command ID and not allowing access to the speedup options.
  */
 void
-simple_table_tuple_insert(Relation rel, TupleTableSlot *slot)
+simple_table_tuple_insert(Relation rel, TupleTableSlot *slot,
+						  bool *insert_indexes)
 {
-	table_tuple_insert(rel, slot, GetCurrentCommandId(true), 0, NULL);
+	table_tuple_insert(rel, slot, GetCurrentCommandId(true), 0, NULL,
+					   insert_indexes);
 }
 
 /*
diff --git a/src/backend/catalog/indexing.c b/src/backend/catalog/indexing.c
index d0d1abda58a..4d404f22f83 100644
--- a/src/backend/catalog/indexing.c
+++ b/src/backend/catalog/indexing.c
@@ -273,12 +273,14 @@ void
 CatalogTuplesMultiInsertWithInfo(Relation heapRel, TupleTableSlot **slot,
 								 int ntuples, CatalogIndexState indstate)
 {
+	bool		insertIndexes;
+
 	/* Nothing to do */
 	if (ntuples <= 0)
 		return;
 
 	heap_multi_insert(heapRel, slot, ntuples,
-					  GetCurrentCommandId(true), 0, NULL);
+					  GetCurrentCommandId(true), 0, NULL, &insertIndexes);
 
 	/*
 	 * There is no equivalent to heap_multi_insert for the catalog indexes, so
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 8908a440e19..b6736369771 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -397,6 +397,7 @@ CopyMultiInsertBufferFlush(CopyMultiInsertInfo *miinfo,
 		bool		line_buf_valid = cstate->line_buf_valid;
 		uint64		save_cur_lineno = cstate->cur_lineno;
 		MemoryContext oldcontext;
+		bool		insertIndexes;
 
 		Assert(buffer->bistate != NULL);
 
@@ -416,7 +417,8 @@ CopyMultiInsertBufferFlush(CopyMultiInsertInfo *miinfo,
 						   nused,
 						   mycid,
 						   ti_options,
-						   buffer->bistate);
+						   buffer->bistate,
+						   &insertIndexes);
 		MemoryContextSwitchTo(oldcontext);
 
 		for (i = 0; i < nused; i++)
@@ -425,7 +427,7 @@ CopyMultiInsertBufferFlush(CopyMultiInsertInfo *miinfo,
 			 * If there are any indexes, update them for all the inserted
 			 * tuples, and run AFTER ROW INSERT triggers.
 			 */
-			if (resultRelInfo->ri_NumIndices > 0)
+			if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 			{
 				List	   *recheckIndexes;
 
@@ -1265,11 +1267,14 @@ CopyFrom(CopyFromState cstate)
 					}
 					else
 					{
+						bool		insertIndexes;
+
 						/* OK, store the tuple and create index entries for it */
 						table_tuple_insert(resultRelInfo->ri_RelationDesc,
-										   myslot, mycid, ti_options, bistate);
+										   myslot, mycid, ti_options, bistate,
+										   &insertIndexes);
 
-						if (resultRelInfo->ri_NumIndices > 0)
+						if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 							recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 																   myslot,
 																   estate,
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 62050f4dc59..afd3dace079 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -578,6 +578,7 @@ static bool
 intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_intorel *myState = (DR_intorel *) self;
+	bool		insertIndexes;
 
 	/* Nothing to insert if WITH NO DATA is specified. */
 	if (!myState->into->skipData)
@@ -594,7 +595,8 @@ intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 						   slot,
 						   myState->output_cid,
 						   myState->ti_options,
-						   myState->bistate);
+						   myState->bistate,
+						   &insertIndexes);
 	}
 
 	/* We know this is a newly created relation, so there are no indexes */
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 6d09b755564..9ec13d09846 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -476,6 +476,7 @@ static bool
 transientrel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_transientrel *myState = (DR_transientrel *) self;
+	bool		insertIndexes;
 
 	/*
 	 * Note that the input slot might not be of the type of the target
@@ -490,7 +491,8 @@ transientrel_receive(TupleTableSlot *slot, DestReceiver *self)
 					   slot,
 					   myState->output_cid,
 					   myState->ti_options,
-					   myState->bistate);
+					   myState->bistate,
+					   &insertIndexes);
 
 	/* We know this is a newly created relation, so there are no indexes */
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index eccd1131a5c..6d16a9a402a 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -6356,8 +6356,12 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
 
 			/* Write the tuple out to the new relation */
 			if (newrel)
+			{
+				bool		insertIndexes;
+
 				table_tuple_insert(newrel, insertslot, mycid,
-								   ti_options, bistate);
+								   ti_options, bistate, &insertIndexes);
+			}
 
 			ResetExprContext(econtext);
 
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 0cad843fb69..db685473fc0 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -509,6 +509,7 @@ ExecSimpleRelationInsert(ResultRelInfo *resultRelInfo,
 	if (!skip_tuple)
 	{
 		List	   *recheckIndexes = NIL;
+		bool		insertIndexes;
 
 		/* Compute stored generated columns */
 		if (rel->rd_att->constr &&
@@ -523,9 +524,10 @@ ExecSimpleRelationInsert(ResultRelInfo *resultRelInfo,
 			ExecPartitionCheck(resultRelInfo, slot, estate, true);
 
 		/* OK, store the tuple and create index entries for it */
-		simple_table_tuple_insert(resultRelInfo->ri_RelationDesc, slot);
+		simple_table_tuple_insert(resultRelInfo->ri_RelationDesc, slot,
+								  &insertIndexes);
 
-		if (resultRelInfo->ri_NumIndices > 0)
+		if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 			recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 												   slot, estate, false, false,
 												   NULL, NIL, false);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 8e1c8f697c6..a64e37e9af9 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1040,13 +1040,15 @@ ExecInsert(ModifyTableContext *context,
 		}
 		else
 		{
+			bool		insertIndexes;
+
 			/* insert the tuple normally */
 			slot = table_tuple_insert(resultRelationDesc, slot,
 									  estate->es_output_cid,
-									  0, NULL);
+									  0, NULL, &insertIndexes);
 
 			/* insert index entries for tuple */
-			if (resultRelInfo->ri_NumIndices > 0)
+			if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 				recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 													   slot, estate, false,
 													   false, NULL, NIL,
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 91fbc950343..32a3fbce961 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -282,7 +282,7 @@ extern void heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 						int options, BulkInsertState bistate);
 extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
 							  int ntuples, CommandId cid, int options,
-							  BulkInsertState bistate);
+							  BulkInsertState bistate, bool *insert_indexes);
 extern TM_Result heap_delete(Relation relation, ItemPointer tid,
 							 CommandId cid, Snapshot crosscheck, int options,
 							 struct TM_FailureData *tmfd, bool changingPart,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index db0559788a4..d6a7aace722 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -514,7 +514,8 @@ typedef struct TableAmRoutine
 	/* see table_tuple_insert() for reference about parameters */
 	TupleTableSlot *(*tuple_insert) (Relation rel, TupleTableSlot *slot,
 									 CommandId cid, int options,
-									 struct BulkInsertStateData *bistate);
+									 struct BulkInsertStateData *bistate,
+									 bool *insert_indexes);
 
 	/* see table_tuple_insert_with_arbiter() for reference about parameters */
 	TupleTableSlot *(*tuple_insert_with_arbiter) (ResultRelInfo *resultRelInfo,
@@ -529,7 +530,8 @@ typedef struct TableAmRoutine
 
 	/* see table_multi_insert() for reference about parameters */
 	void		(*multi_insert) (Relation rel, TupleTableSlot **slots, int nslots,
-								 CommandId cid, int options, struct BulkInsertStateData *bistate);
+								 CommandId cid, int options, struct BulkInsertStateData *bistate,
+								 bool *insert_indexes);
 
 	/* see table_tuple_delete() for reference about parameters */
 	TM_Result	(*tuple_delete) (Relation rel,
@@ -1400,6 +1402,10 @@ table_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
  * behavior) is also just passed through to RelationGetBufferForTuple. If
  * `bistate` is provided, table_finish_bulk_insert() needs to be called.
  *
+ * This function sets `*insert_indexes` to true if expects caller to return
+ * the relevant index tuples.  If `*insert_indexes` is set to false, then
+ * this function cares about indexes itself.
+ *
  * Returns the slot containing the inserted tuple, which may differ from the
  * given slot. For instance, the source slot may be VirtualTupleTableSlot, but
  * the result slot may correspond to the table AM. On return the slot's
@@ -1409,10 +1415,11 @@ table_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
  */
 static inline TupleTableSlot *
 table_tuple_insert(Relation rel, TupleTableSlot *slot, CommandId cid,
-				   int options, struct BulkInsertStateData *bistate)
+				   int options, struct BulkInsertStateData *bistate,
+				   bool *insert_indexes)
 {
 	return rel->rd_tableam->tuple_insert(rel, slot, cid, options,
-										 bistate);
+										 bistate, insert_indexes);
 }
 
 /*
@@ -1470,10 +1477,11 @@ table_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
  */
 static inline void
 table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
-				   CommandId cid, int options, struct BulkInsertStateData *bistate)
+				   CommandId cid, int options, struct BulkInsertStateData *bistate,
+				   bool *insert_indexes)
 {
 	rel->rd_tableam->multi_insert(rel, slots, nslots,
-								  cid, options, bistate);
+								  cid, options, bistate, insert_indexes);
 }
 
 /*
@@ -2168,7 +2176,8 @@ table_scan_sample_next_tuple(TableScanDesc scan,
  * ----------------------------------------------------------------------------
  */
 
-extern void simple_table_tuple_insert(Relation rel, TupleTableSlot *slot);
+extern void simple_table_tuple_insert(Relation rel, TupleTableSlot *slot,
+									  bool *insert_indexes);
 extern void simple_table_tuple_delete(Relation rel, ItemPointer tid,
 									  Snapshot snapshot,
 									  TupleTableSlot *oldSlot);
-- 
2.39.3 (Apple Git-145)

0007-Introduce-RowRefType-which-describes-the-table-ro-v6.patchapplication/octet-stream; name=0007-Introduce-RowRefType-which-describes-the-table-ro-v6.patchDownload

From 5fe31cfc153ce2b62e2e672e67191c3c17eee35b Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Sat, 17 Jun 2023 22:55:32 +0300
Subject: [PATCH 7/8] Introduce RowRefType, which describes the table row
 identifier

Currently, the table row could be identified by the ctid or the whole row
(foreign table).  But the row identifier is mixed together with lock mode in
RowMarkType.  This commit separates row identifier type into separate enum
RowRefType.
---
 contrib/postgres_fdw/postgres_fdw.c    |  2 +-
 doc/src/sgml/fdwhandler.sgml           | 22 ++++++++----
 src/backend/executor/execMain.c        | 35 ++++++++++++--------
 src/backend/optimizer/plan/planner.c   | 33 +++++++++++-------
 src/backend/optimizer/prep/preptlist.c |  4 +--
 src/backend/optimizer/util/inherit.c   | 27 +++++++--------
 src/include/foreign/fdwapi.h           |  3 +-
 src/include/nodes/execnodes.h          |  4 +++
 src/include/nodes/plannodes.h          | 46 ++++++++++++++++----------
 src/include/optimizer/planner.h        |  3 +-
 src/tools/pgindent/typedefs.list       |  1 +
 11 files changed, 113 insertions(+), 67 deletions(-)

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 142dcfc9957..b0000790292 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -7636,7 +7636,7 @@ make_tuple_from_result_row(PGresult *res,
 	 * If we have a CTID to return, install it in both t_self and t_ctid.
 	 * t_self is the normal place, but if the tuple is converted to a
 	 * composite Datum, t_self will be lost; setting t_ctid allows CTID to be
-	 * preserved during EvalPlanQual re-evaluations (see ROW_MARK_COPY code).
+	 * preserved during EvalPlanQual re-evaluations (see ROW_REF_COPY code).
 	 */
 	if (ctid)
 		tuple->t_self = tuple->t_data->t_ctid = *ctid;
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index b80320504d6..51bc0e1029a 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -1160,13 +1160,16 @@ ExecForeignTruncate(List *rels,
 <programlisting>
 RowMarkType
 GetForeignRowMarkType(RangeTblEntry *rte,
-                      LockClauseStrength strength);
+                      LockClauseStrength strength,
+                      RowRefType *refType);
 </programlisting>
 
      Report which row-marking option to use for a foreign table.
-     <literal>rte</literal> is the <structname>RangeTblEntry</structname> node for the table
-     and <literal>strength</literal> describes the lock strength requested by the
-     relevant <literal>FOR UPDATE/SHARE</literal> clause, if any.  The result must be
+     <literal>rte</literal> is the <structname>RangeTblEntry</structname> node for the table;
+     <literal>strength</literal> describes the lock strength requested by the
+     relevant <literal>FOR UPDATE/SHARE</literal> clause, if any;
+     <literal>refType</literal> point to the value of <literal>RowRefType</literal>
+     specifying the way to reference the row.  The result must be
      a member of the <literal>RowMarkType</literal> enum type.
     </para>
 
@@ -1177,9 +1180,16 @@ GetForeignRowMarkType(RangeTblEntry *rte,
      or <command>DELETE</command>.
     </para>
 
+    <para>
+     If the value pointed by <literal>refType</literal> is not changed,
+     the <literal>ROW_REF_COPY</literal> option is used.
+    </para>
+
     <para>
      If the <function>GetForeignRowMarkType</function> pointer is set to
-     <literal>NULL</literal>, the <literal>ROW_MARK_COPY</literal> option is always used.
+     <literal>NULL</literal>, the <literal>ROW_MARK_REFERENCE</literal> option
+     for row mark type and <literal>ROW_REF_COPY</literal> option for the row
+     reference type are always used.
      (This implies that <function>RefetchForeignRow</function> will never be called,
      so it need not be provided either.)
     </para>
@@ -1213,7 +1223,7 @@ RefetchForeignRow(EState *estate,
      defined by <literal>erm-&gt;markType</literal>, which is the value
      previously returned by <function>GetForeignRowMarkType</function>.
      (<literal>ROW_MARK_REFERENCE</literal> means to just re-fetch the tuple
-     without acquiring any lock, and <literal>ROW_MARK_COPY</literal> will
+     without acquiring any lock.  This shouldn't and <literal>ROW_MARK_COPY</literal> will
      never be seen by this routine.)
     </para>
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 7eb1f7d0209..3b03f03a98d 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -875,22 +875,19 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 			/* get relation's OID (will produce InvalidOid if subquery) */
 			relid = exec_rt_fetch(rc->rti, estate)->relid;
 
-			/* open relation, if we need to access it for this mark type */
-			switch (rc->markType)
+			/*
+			 * Open relation, if we need to access it for this reference type.
+			 */
+			switch (rc->refType)
 			{
-				case ROW_MARK_EXCLUSIVE:
-				case ROW_MARK_NOKEYEXCLUSIVE:
-				case ROW_MARK_SHARE:
-				case ROW_MARK_KEYSHARE:
-				case ROW_MARK_REFERENCE:
+				case ROW_REF_TID:
 					relation = ExecGetRangeTableRelation(estate, rc->rti);
 					break;
-				case ROW_MARK_COPY:
-					/* no physical table access is required */
+				case ROW_REF_COPY:
 					relation = NULL;
 					break;
 				default:
-					elog(ERROR, "unrecognized markType: %d", rc->markType);
+					elog(ERROR, "unrecognized refType: %d", rc->refType);
 					relation = NULL;	/* keep compiler quiet */
 					break;
 			}
@@ -906,6 +903,7 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 			erm->prti = rc->prti;
 			erm->rowmarkId = rc->rowmarkId;
 			erm->markType = rc->markType;
+			erm->refType = rc->refType;
 			erm->strength = rc->strength;
 			erm->waitPolicy = rc->waitPolicy;
 			erm->ermActive = false;
@@ -2402,10 +2400,14 @@ ExecBuildAuxRowMark(ExecRowMark *erm, List *targetlist)
 
 	aerm->rowmark = erm;
 
-	/* Look up the resjunk columns associated with this rowmark */
-	if (erm->markType != ROW_MARK_COPY)
+	/*
+	 * Look up the resjunk columns associated with this rowmark's reference
+	 * type.
+	 */
+	if (erm->refType != ROW_REF_COPY)
 	{
 		/* need ctid for all methods other than COPY */
+		Assert(erm->refType == ROW_REF_TID);
 		snprintf(resname, sizeof(resname), "ctid%u", erm->rowmarkId);
 		aerm->ctidAttNo = ExecFindJunkAttributeInTlist(targetlist,
 													   resname);
@@ -2656,7 +2658,12 @@ EvalPlanQualFetchRowMark(EPQState *epqstate, Index rti, TupleTableSlot *slot)
 		}
 	}
 
-	if (erm->markType == ROW_MARK_REFERENCE)
+	/*
+	 * For non-locked relation, the row mark type should be
+	 * ROW_MARK_REFERENCE.  Fetch the tuple accodring to reference type.
+	 */
+	Assert(erm->markType == ROW_MARK_REFERENCE);
+	if (erm->refType == ROW_REF_TID)
 	{
 		Assert(erm->relation != NULL);
 
@@ -2709,7 +2716,7 @@ EvalPlanQualFetchRowMark(EPQState *epqstate, Index rti, TupleTableSlot *slot)
 	}
 	else
 	{
-		Assert(erm->markType == ROW_MARK_COPY);
+		Assert(erm->refType == ROW_REF_COPY);
 
 		/* fetch the whole-row Var for the relation */
 		datum = ExecGetJunkAttribute(epqstate->origslot,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 38d070fa004..4b9c9deee84 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2309,7 +2309,7 @@ preprocess_rowmarks(PlannerInfo *root)
 		 * Ignore RowMarkClauses for subqueries; they aren't real tables and
 		 * can't support true locking.  Subqueries that got flattened into the
 		 * main query should be ignored completely.  Any that didn't will get
-		 * ROW_MARK_COPY items in the next loop.
+		 * ROW_REF_COPY items in the next loop.
 		 */
 		if (rte->rtekind != RTE_RELATION)
 			continue;
@@ -2319,8 +2319,8 @@ preprocess_rowmarks(PlannerInfo *root)
 		newrc = makeNode(PlanRowMark);
 		newrc->rti = newrc->prti = rc->rti;
 		newrc->rowmarkId = ++(root->glob->lastRowMarkId);
-		newrc->markType = select_rowmark_type(rte, rc->strength);
-		newrc->allMarkTypes = (1 << newrc->markType);
+		newrc->markType = select_rowmark_type(rte, rc->strength, &newrc->refType);
+		newrc->allRefTypes = (1 << newrc->refType);
 		newrc->strength = rc->strength;
 		newrc->waitPolicy = rc->waitPolicy;
 		newrc->isParent = false;
@@ -2344,8 +2344,8 @@ preprocess_rowmarks(PlannerInfo *root)
 		newrc = makeNode(PlanRowMark);
 		newrc->rti = newrc->prti = i;
 		newrc->rowmarkId = ++(root->glob->lastRowMarkId);
-		newrc->markType = select_rowmark_type(rte, LCS_NONE);
-		newrc->allMarkTypes = (1 << newrc->markType);
+		newrc->markType = select_rowmark_type(rte, LCS_NONE, &newrc->refType);
+		newrc->allRefTypes = (1 << newrc->refType);
 		newrc->strength = LCS_NONE;
 		newrc->waitPolicy = LockWaitBlock;	/* doesn't matter */
 		newrc->isParent = false;
@@ -2357,29 +2357,38 @@ preprocess_rowmarks(PlannerInfo *root)
 }
 
 /*
- * Select RowMarkType to use for a given table
+ * Select RowMarkType and RowRefType to use for a given table
  */
 RowMarkType
-select_rowmark_type(RangeTblEntry *rte, LockClauseStrength strength)
+select_rowmark_type(RangeTblEntry *rte, LockClauseStrength strength,
+					RowRefType *refType)
 {
 	if (rte->rtekind != RTE_RELATION)
 	{
-		/* If it's not a table at all, use ROW_MARK_COPY */
-		return ROW_MARK_COPY;
+		/*
+		 * If it's not a table at all, use ROW_MARK_REFERENCE and
+		 * ROW_REF_COPY.
+		 */
+		*refType = ROW_REF_COPY;
+		return ROW_MARK_REFERENCE;
 	}
 	else if (rte->relkind == RELKIND_FOREIGN_TABLE)
 	{
 		/* Let the FDW select the rowmark type, if it wants to */
 		FdwRoutine *fdwroutine = GetFdwRoutineByRelId(rte->relid);
 
+		/* Set row reference type as ROW_REF_COPY by default */
+		*refType = ROW_REF_COPY;
+
 		if (fdwroutine->GetForeignRowMarkType != NULL)
-			return fdwroutine->GetForeignRowMarkType(rte, strength);
-		/* Otherwise, use ROW_MARK_COPY by default */
-		return ROW_MARK_COPY;
+			return fdwroutine->GetForeignRowMarkType(rte, strength, refType);
+		/* Otherwise, use ROW_MARK_REFERENCE by default */
+		return ROW_MARK_REFERENCE;
 	}
 	else
 	{
 		/* Regular table, apply the appropriate lock type */
+		*refType = ROW_REF_TID;
 		switch (strength)
 		{
 			case LCS_NONE:
diff --git a/src/backend/optimizer/prep/preptlist.c b/src/backend/optimizer/prep/preptlist.c
index 7698bfa1a58..4599b0dc761 100644
--- a/src/backend/optimizer/prep/preptlist.c
+++ b/src/backend/optimizer/prep/preptlist.c
@@ -210,7 +210,7 @@ preprocess_targetlist(PlannerInfo *root)
 		if (rc->rti != rc->prti)
 			continue;
 
-		if (rc->allMarkTypes & ~(1 << ROW_MARK_COPY))
+		if (rc->allRefTypes & (1 << ROW_REF_TID))
 		{
 			/* Need to fetch TID */
 			var = makeVar(rc->rti,
@@ -226,7 +226,7 @@ preprocess_targetlist(PlannerInfo *root)
 								  true);
 			tlist = lappend(tlist, tle);
 		}
-		if (rc->allMarkTypes & (1 << ROW_MARK_COPY))
+		if (rc->allRefTypes & (1 << ROW_REF_COPY))
 		{
 			/* Need the whole row as a junk var */
 			var = makeWholeRowVar(rt_fetch(rc->rti, range_table),
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index 5c7acf8a901..b4b076d1cb1 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -91,7 +91,7 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 	LOCKMODE	lockmode;
 	PlanRowMark *oldrc;
 	bool		old_isParent = false;
-	int			old_allMarkTypes = 0;
+	int			old_allRefTypes = 0;
 
 	Assert(rte->inh);			/* else caller error */
 
@@ -131,8 +131,8 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 	{
 		old_isParent = oldrc->isParent;
 		oldrc->isParent = true;
-		/* Save initial value of allMarkTypes before children add to it */
-		old_allMarkTypes = oldrc->allMarkTypes;
+		/* Save initial value of allRefTypes before children add to it */
+		old_allRefTypes = oldrc->allRefTypes;
 	}
 
 	/* Scan the inheritance set and expand it */
@@ -239,15 +239,15 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (oldrc)
 	{
-		int			new_allMarkTypes = oldrc->allMarkTypes;
+		int			new_allRefTypes = oldrc->allRefTypes;
 		Var		   *var;
 		TargetEntry *tle;
 		char		resname[32];
 		List	   *newvars = NIL;
 
 		/* Add TID junk Var if needed, unless we had it already */
-		if (new_allMarkTypes & ~(1 << ROW_MARK_COPY) &&
-			!(old_allMarkTypes & ~(1 << ROW_MARK_COPY)))
+		if (new_allRefTypes & (1 << ROW_REF_TID) &&
+			!(old_allRefTypes & (1 << ROW_REF_TID)))
 		{
 			/* Need to fetch TID */
 			var = makeVar(oldrc->rti,
@@ -266,8 +266,8 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 		}
 
 		/* Add whole-row junk Var if needed, unless we had it already */
-		if ((new_allMarkTypes & (1 << ROW_MARK_COPY)) &&
-			!(old_allMarkTypes & (1 << ROW_MARK_COPY)))
+		if ((new_allRefTypes & (1 << ROW_REF_COPY)) &&
+			!(old_allRefTypes & (1 << ROW_REF_COPY)))
 		{
 			var = makeWholeRowVar(planner_rt_fetch(oldrc->rti, root),
 								  oldrc->rti,
@@ -441,7 +441,7 @@ expand_partitioned_rtentry(PlannerInfo *root, RelOptInfo *relinfo,
  * where the hierarchy is flattened during RTE expansion.)
  *
  * PlanRowMarks still carry the top-parent's RTI, and the top-parent's
- * allMarkTypes field still accumulates values from all descendents.
+ * allRefTypes field still accumulates values from all descendents.
  *
  * "parentrte" and "parentRTindex" are immediate parent's RTE and
  * RTI. "top_parentrc" is top parent's PlanRowMark.
@@ -580,8 +580,9 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 		childrc->rowmarkId = top_parentrc->rowmarkId;
 		/* Reselect rowmark type, because relkind might not match parent */
 		childrc->markType = select_rowmark_type(childrte,
-												top_parentrc->strength);
-		childrc->allMarkTypes = (1 << childrc->markType);
+												top_parentrc->strength,
+												&childrc->refType);
+		childrc->allRefTypes = (1 << childrc->refType);
 		childrc->strength = top_parentrc->strength;
 		childrc->waitPolicy = top_parentrc->waitPolicy;
 
@@ -592,8 +593,8 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 		 */
 		childrc->isParent = (childrte->relkind == RELKIND_PARTITIONED_TABLE);
 
-		/* Include child's rowmark type in top parent's allMarkTypes */
-		top_parentrc->allMarkTypes |= childrc->allMarkTypes;
+		/* Include child's rowmark type in top parent's allRefTypes */
+		top_parentrc->allRefTypes |= childrc->allRefTypes;
 
 		root->rowMarks = lappend(root->rowMarks, childrc);
 	}
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 0968e0a01ec..868e04e813e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -129,7 +129,8 @@ typedef TupleTableSlot *(*IterateDirectModify_function) (ForeignScanState *node)
 typedef void (*EndDirectModify_function) (ForeignScanState *node);
 
 typedef RowMarkType (*GetForeignRowMarkType_function) (RangeTblEntry *rte,
-													   LockClauseStrength strength);
+													   LockClauseStrength strength,
+													   RowRefType *refType);
 
 typedef void (*RefetchForeignRow_function) (EState *estate,
 											ExecRowMark *erm,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 1774c56ae31..a1ccf6e6811 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -455,6 +455,9 @@ typedef struct ResultRelInfo
 	/* relation descriptor for result relation */
 	Relation	ri_RelationDesc;
 
+	/* row indentifier for result relation */
+	RowRefType	ri_RowRefType;
+
 	/* # of indices existing on result relation */
 	int			ri_NumIndices;
 
@@ -750,6 +753,7 @@ typedef struct ExecRowMark
 	Index		prti;			/* parent range table index, if child */
 	Index		rowmarkId;		/* unique identifier for resjunk columns */
 	RowMarkType markType;		/* see enum in nodes/plannodes.h */
+	RowRefType	refType;		/* row indentifier for relation */
 	LockClauseStrength strength;	/* LockingClause's strength, or LCS_NONE */
 	LockWaitPolicy waitPolicy;	/* NOWAIT and SKIP LOCKED */
 	bool		ermActive;		/* is this mark relevant for current tuple? */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 7f3db5105db..d7f9c389dac 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1311,16 +1311,8 @@ typedef struct Limit
  *
  * When doing UPDATE/DELETE/MERGE/SELECT FOR UPDATE/SHARE, we have to uniquely
  * identify all the source rows, not only those from the target relations, so
- * that we can perform EvalPlanQual rechecking at need.  For plain tables we
- * can just fetch the TID, much as for a target relation; this case is
- * represented by ROW_MARK_REFERENCE.  Otherwise (for example for VALUES or
- * FUNCTION scans) we have to copy the whole row value.  ROW_MARK_COPY is
- * pretty inefficient, since most of the time we'll never need the data; but
- * fortunately the overhead is usually not performance-critical in practice.
- * By default we use ROW_MARK_COPY for foreign tables, but if the FDW has
- * a concept of rowid it can request to use ROW_MARK_REFERENCE instead.
- * (Again, this probably doesn't make sense if a physical remote fetch is
- * needed, but for FDWs that map to local storage it might be credible.)
+ * that we can perform EvalPlanQual rechecking at need.  ROW_MARK_REFERENCE
+ * represents this case.
  */
 typedef enum RowMarkType
 {
@@ -1329,9 +1321,29 @@ typedef enum RowMarkType
 	ROW_MARK_SHARE,				/* obtain shared tuple lock */
 	ROW_MARK_KEYSHARE,			/* obtain keyshare tuple lock */
 	ROW_MARK_REFERENCE,			/* just fetch the TID, don't lock it */
-	ROW_MARK_COPY,				/* physically copy the row value */
 } RowMarkType;
 
+/*
+ * RowRefType -
+ *	  enums for types of row identifiers
+ *
+ * For plain tables we can just fetch the TID, much as for a target relation;
+ * this case is represented by ROW_REF_TID.  Otherwise (for example for VALUES
+ * or FUNCTION scans) we have to copy the whole row value.  ROW_REF_COPY is
+ * pretty inefficient, since most of the time we'll never need the data; but
+ * fortunately the overhead is usually not performance-critical in practice.
+ * By default we use ROW_REF_COPY for foreign tables, but if the FDW has
+ * a concept of rowid it can request to use ROW_REF_TID instead.
+ * (Again, this probably doesn't make sense if a physical remote fetch is
+ * needed, but for FDWs that map to local storage it might be credible.)
+ * In future we may allow more types of row identifiers.
+ */
+typedef enum RowRefType
+{
+	ROW_REF_TID,				/* Item pointer (block, offset) */
+	ROW_REF_COPY				/* Full row copy */
+} RowRefType;
+
 #define RowMarkRequiresRowShareLock(marktype)  ((marktype) <= ROW_MARK_KEYSHARE)
 
 /*
@@ -1340,8 +1352,7 @@ typedef enum RowMarkType
  *
  * When doing UPDATE/DELETE/MERGE/SELECT FOR UPDATE/SHARE, we create a separate
  * PlanRowMark node for each non-target relation in the query.  Relations that
- * are not specified as FOR UPDATE/SHARE are marked ROW_MARK_REFERENCE (if
- * regular tables or supported foreign tables) or ROW_MARK_COPY (if not).
+ * are not specified as FOR UPDATE/SHARE are marked ROW_MARK_REFERENCE.
  *
  * Initially all PlanRowMarks have rti == prti and isParent == false.
  * When the planner discovers that a relation is the root of an inheritance
@@ -1351,16 +1362,16 @@ typedef enum RowMarkType
  * child relations will also have entries with isParent = true.  The child
  * entries have rti == child rel's RT index and prti == top parent's RT index,
  * and can therefore be recognized as children by the fact that prti != rti.
- * The parent's allMarkTypes field gets the OR of (1<<markType) across all
+ * The parent's allRefTypes field gets the OR of (1<<refType) across all
  * its children (this definition allows children to use different markTypes).
  *
  * The planner also adds resjunk output columns to the plan that carry
  * information sufficient to identify the locked or fetched rows.  When
- * markType != ROW_MARK_COPY, these columns are named
+ * refType != ROW_REF_COPY, these columns are named
  *		tableoid%u			OID of table
  *		ctid%u				TID of row
  * The tableoid column is only present for an inheritance hierarchy.
- * When markType == ROW_MARK_COPY, there is instead a single column named
+ * When refType == ROW_REF_COPY, there is instead a single column named
  *		wholerow%u			whole-row value of relation
  * (An inheritance hierarchy could have all three resjunk output columns,
  * if some children use a different markType than others.)
@@ -1381,7 +1392,8 @@ typedef struct PlanRowMark
 	Index		prti;			/* range table index of parent relation */
 	Index		rowmarkId;		/* unique identifier for resjunk columns */
 	RowMarkType markType;		/* see enum above */
-	int			allMarkTypes;	/* OR of (1<<markType) for all children */
+	RowRefType	refType;		/* see enum above */
+	int			allRefTypes;	/* OR of (1<<refType) for all children */
 	LockClauseStrength strength;	/* LockingClause's strength, or LCS_NONE */
 	LockWaitPolicy waitPolicy;	/* NOWAIT and SKIP LOCKED options */
 	bool		isParent;		/* true if this is a "dummy" parent entry */
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index e1d79ffdf3c..98fc796d054 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -47,7 +47,8 @@ extern PlannerInfo *subquery_planner(PlannerGlobal *glob, Query *parse,
 									 bool hasRecursion, double tuple_fraction);
 
 extern RowMarkType select_rowmark_type(RangeTblEntry *rte,
-									   LockClauseStrength strength);
+									   LockClauseStrength strength,
+									   RowRefType *refType);
 
 extern bool limit_needed(Query *parse);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cfa9d5aaeac..6ce0a586bf1 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2433,6 +2433,7 @@ RowExpr
 RowIdentityVarInfo
 RowMarkClause
 RowMarkType
+RowRefType
 RowSecurityDesc
 RowSecurityPolicy
 RtlGetLastNtStatus_t
-- 
2.39.3 (Apple Git-145)

0008-Introduce-RowID-bytea-tuple-identifier-v6.patchapplication/octet-stream; name=0008-Introduce-RowID-bytea-tuple-identifier-v6.patchDownload

From 5b718f1f6e0282e1beddfc75555d17f326d2d1a6 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Tue, 26 Mar 2024 21:00:37 +0200
Subject: [PATCH 8/8] Introduce RowID -- bytea tuple identifier

Currently, there are two ways to reference the tuple: tuple identifier (tid)
and whole row copy.  The tuple identifier used for regular tables consists of
32-bit block number and 16-bit offset.  This seems limited for some use-cases,
in particular index-organized tables.  The whole row copy used to identify
tuples in FDW.  That could be extended to regular tables, but that seems
overkill.

This commit introduces RowID -- new bytea tuple identifier.  Table AM can choose
the way tuple is identified by providing new get_row_ref_type() API function.
New system attribute RowIdAttributeNumber holds RowID when appropriate.
Table AM methods now accepts Datum arguments as tuple identifiers.  Those Datum
could be either tid or bytea depending on what table_get_row_ref_type() says.
ModifyTable node and triggers are aware of RowID.  IndexScan and BitmapScan
nodes are not aware of RowIDs and expect tids.  Table AMs which use RowIDs
are supposed to redefine those nodes using hooks.
---
 contrib/amcheck/verify_nbtree.c          |   3 +-
 src/backend/access/common/heaptuple.c    |   4 +
 src/backend/access/heap/heapam_handler.c |  33 ++-
 src/backend/access/table/tableam.c       |   4 +-
 src/backend/catalog/aclchk.c             |   2 +-
 src/backend/commands/trigger.c           | 251 ++++++++++++++++++-----
 src/backend/executor/execExprInterp.c    |   4 +-
 src/backend/executor/execMain.c          |   9 +-
 src/backend/executor/execReplication.c   |  12 +-
 src/backend/executor/nodeLockRows.c      |  17 +-
 src/backend/executor/nodeModifyTable.c   | 145 ++++++++-----
 src/backend/executor/nodeTidscan.c       |   2 +-
 src/backend/optimizer/plan/planner.c     |  11 +-
 src/backend/optimizer/prep/preptlist.c   |  16 ++
 src/backend/optimizer/util/appendinfo.c  |  33 ++-
 src/backend/optimizer/util/inherit.c     |  20 ++
 src/backend/parser/parse_relation.c      |  13 ++
 src/backend/rewrite/rewriteHandler.c     |   1 +
 src/backend/utils/sort/tuplestore.c      |  30 +++
 src/include/access/sysattr.h             |   3 +-
 src/include/access/tableam.h             |  58 ++++--
 src/include/commands/trigger.h           |   4 +-
 src/include/nodes/parsenodes.h           |   2 +
 src/include/nodes/plannodes.h            |  21 --
 src/include/nodes/primnodes.h            |  22 ++
 src/include/utils/tuplestore.h           |   3 +
 26 files changed, 548 insertions(+), 175 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index f71f1854e0a..7bfa2a2fc44 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -984,7 +984,8 @@ heap_entry_is_visible(BtreeCheckState *state, ItemPointer tid)
 	TupleTableSlot *slot = table_slot_create(state->heaprel, NULL);
 
 	tid_visible = table_tuple_fetch_row_version(state->heaprel,
-												tid, state->snapshot, slot);
+												PointerGetDatum(tid),
+												state->snapshot, slot);
 	if (slot != NULL)
 		ExecDropSingleTupleTableSlot(slot);
 
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index 5c89fbbef83..7b52c66939c 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -755,6 +755,10 @@ heap_getsysattr(HeapTuple tup, int attnum, TupleDesc tupleDesc, bool *isnull)
 		case TableOidAttributeNumber:
 			result = ObjectIdGetDatum(tup->t_tableOid);
 			break;
+		case RowIdAttributeNumber:
+			*isnull = true;
+			result = 0;
+			break;
 		default:
 			elog(ERROR, "invalid attnum: %d", attnum);
 			result = 0;			/* keep compiler quiet */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index f42a8a80128..aea497ec73b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -50,7 +50,7 @@
 #include "utils/sampling.h"
 #include "utils/spccache.h"
 
-static TM_Result heapam_tuple_lock(Relation relation, ItemPointer tid,
+static TM_Result heapam_tuple_lock(Relation relation, Datum tupleid,
 								   Snapshot snapshot, TupleTableSlot *slot,
 								   CommandId cid, LockTupleMode mode,
 								   LockWaitPolicy wait_policy, uint8 flags,
@@ -188,7 +188,7 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 
 static bool
 heapam_fetch_row_version(Relation relation,
-						 ItemPointer tid,
+						 Datum tupleid,
 						 Snapshot snapshot,
 						 TupleTableSlot *slot)
 {
@@ -197,7 +197,7 @@ heapam_fetch_row_version(Relation relation,
 
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
-	bslot->base.tupdata.t_self = *tid;
+	bslot->base.tupdata.t_self = *DatumGetItemPointer(tupleid);
 	if (heap_fetch(relation, snapshot, &bslot->base.tupdata, &buffer, false))
 	{
 		/* store in slot, transferring existing pin */
@@ -362,7 +362,7 @@ ExecCheckTIDVisible(EState *estate,
 	if (!IsolationUsesXactSnapshot())
 		return;
 
-	if (!table_tuple_fetch_row_version(rel, tid,
+	if (!table_tuple_fetch_row_version(rel, PointerGetDatum(tid),
 									   SnapshotAny, tempSlot))
 		elog(ERROR, "failed to fetch conflicting tuple for ON CONFLICT");
 	ExecCheckTupleVisible(estate, rel, tempSlot);
@@ -409,7 +409,7 @@ heapam_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
 				 * here means our previous conclusion that the tuple is
 				 * conclusively committed is not true anymore.
 				 */
-				test = table_tuple_lock(rel, &conflictTid,
+				test = table_tuple_lock(rel, PointerGetDatum(&conflictTid),
 										estate->es_snapshot,
 										lockedSlot, estate->es_output_cid,
 										lockmode, LockWaitBlock, 0,
@@ -589,12 +589,13 @@ heapam_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
 }
 
 static TM_Result
-heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
+heapam_tuple_delete(Relation relation, Datum tupleid, CommandId cid,
 					Snapshot snapshot, Snapshot crosscheck, int options,
 					TM_FailureData *tmfd, bool changingPart,
 					TupleTableSlot *oldSlot)
 {
 	TM_Result	result;
+	ItemPointer tid = DatumGetItemPointer(tupleid);
 
 	/*
 	 * Currently Deleting of index tuples are handled at vacuum, in case if
@@ -617,7 +618,7 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
 		 * heapam_tuple_lock() will take advantage of tuple loaded into
 		 * oldSlot by heap_delete().
 		 */
-		result = heapam_tuple_lock(relation, tid, snapshot,
+		result = heapam_tuple_lock(relation, tupleid, snapshot,
 								   oldSlot, cid, LockTupleExclusive,
 								   (options & TABLE_MODIFY_WAIT) ?
 								   LockWaitBlock :
@@ -634,7 +635,7 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
 
 
 static TM_Result
-heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
+heapam_tuple_update(Relation relation, Datum tupleid, TupleTableSlot *slot,
 					CommandId cid, Snapshot snapshot, Snapshot crosscheck,
 					int options, TM_FailureData *tmfd,
 					LockTupleMode *lockmode, TU_UpdateIndexes *update_indexes,
@@ -642,6 +643,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 {
 	bool		shouldFree = true;
 	HeapTuple	tuple = ExecFetchSlotHeapTuple(slot, true, &shouldFree);
+	ItemPointer otid = DatumGetItemPointer(tupleid);
 	TM_Result	result;
 
 	/* Update the tuple with table oid */
@@ -689,7 +691,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 		 * heapam_tuple_lock() will take advantage of tuple loaded into
 		 * oldSlot by heap_update().
 		 */
-		result = heapam_tuple_lock(relation, otid, snapshot,
+		result = heapam_tuple_lock(relation, tupleid, snapshot,
 								   oldSlot, cid, *lockmode,
 								   (options & TABLE_MODIFY_WAIT) ?
 								   LockWaitBlock :
@@ -705,7 +707,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 }
 
 static TM_Result
-heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
+heapam_tuple_lock(Relation relation, Datum tupleid, Snapshot snapshot,
 				  TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
 				  LockWaitPolicy wait_policy, uint8 flags,
 				  TM_FailureData *tmfd)
@@ -713,6 +715,7 @@ heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
 	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
 	TM_Result	result;
 	HeapTuple	tuple = &bslot->base.tupdata;
+	ItemPointer tid = DatumGetItemPointer(tupleid);
 	bool		follow_updates;
 
 	follow_updates = (flags & TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS) != 0;
@@ -2380,6 +2383,15 @@ heapam_scan_get_blocks_done(HeapScanDesc hscan)
  * ------------------------------------------------------------------------
  */
 
+/*
+ * All heap tables use TID row identifier.
+ */
+static RowRefType
+heapam_get_row_ref_type(Relation rel)
+{
+	return ROW_REF_TID;
+}
+
 /*
  * Check to see whether the table needs a TOAST table.  It does only if
  * (1) there are any toastable attributes, and (2) the maximum length
@@ -2958,6 +2970,7 @@ static const TableAmRoutine heapam_methods = {
 	.define_index_validate = NULL,
 	.define_index = NULL,
 
+	.get_row_ref_type = heapam_get_row_ref_type,
 	.free_rd_amcache = NULL,
 	.relation_size = table_block_relation_size,
 	.relation_needs_toast_table = heapam_relation_needs_toast_table,
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 805d222cebc..caa79c6eddd 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -300,7 +300,7 @@ simple_table_tuple_delete(Relation rel, ItemPointer tid, Snapshot snapshot,
 	if (oldSlot)
 		options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
 
-	result = table_tuple_delete(rel, tid,
+	result = table_tuple_delete(rel, PointerGetDatum(tid),
 								GetCurrentCommandId(true),
 								snapshot, InvalidSnapshot,
 								options,
@@ -356,7 +356,7 @@ simple_table_tuple_update(Relation rel, ItemPointer otid,
 	if (oldSlot)
 		options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
 
-	result = table_tuple_update(rel, otid, slot,
+	result = table_tuple_update(rel, PointerGetDatum(otid), slot,
 								GetCurrentCommandId(true),
 								snapshot, InvalidSnapshot,
 								options,
diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
index 7abf3c2a74a..8765becf986 100644
--- a/src/backend/catalog/aclchk.c
+++ b/src/backend/catalog/aclchk.c
@@ -1626,7 +1626,7 @@ expand_all_col_privileges(Oid table_oid, Form_pg_class classForm,
 	AttrNumber	curr_att;
 
 	Assert(classForm->relnatts - FirstLowInvalidHeapAttributeNumber < num_col_privileges);
-	for (curr_att = FirstLowInvalidHeapAttributeNumber + 1;
+	for (curr_att = FirstLowInvalidHeapAttributeNumber + 2;
 		 curr_att <= classForm->relnatts;
 		 curr_att++)
 	{
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 84494c4b81f..4f83e521a35 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -76,7 +76,7 @@ static void SetTriggerFlags(TriggerDesc *trigdesc, Trigger *trigger);
 static bool GetTupleForTrigger(EState *estate,
 							   EPQState *epqstate,
 							   ResultRelInfo *relinfo,
-							   ItemPointer tid,
+							   Datum tupleid,
 							   LockTupleMode lockmode,
 							   TupleTableSlot *oldslot,
 							   TupleTableSlot **epqslot,
@@ -2682,7 +2682,7 @@ ExecASDeleteTriggers(EState *estate, ResultRelInfo *relinfo,
 bool
 ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
 					 ResultRelInfo *relinfo,
-					 ItemPointer tupleid,
+					 Datum tupleid,
 					 HeapTuple fdw_trigtuple,
 					 TupleTableSlot **epqslot,
 					 TM_Result *tmresult,
@@ -2696,7 +2696,7 @@ ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
 	bool		should_free = false;
 	int			i;
 
-	Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
+	Assert(HeapTupleIsValid(fdw_trigtuple) ^ (DatumGetPointer(tupleid) != NULL));
 	if (fdw_trigtuple == NULL)
 	{
 		TupleTableSlot *epqslot_candidate = NULL;
@@ -2924,7 +2924,7 @@ ExecASUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 bool
 ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
 					 ResultRelInfo *relinfo,
-					 ItemPointer tupleid,
+					 Datum tupleid,
 					 HeapTuple fdw_trigtuple,
 					 TupleTableSlot *newslot,
 					 TM_Result *tmresult,
@@ -2944,7 +2944,7 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
 	/* Determine lock mode to use */
 	lockmode = ExecUpdateLockMode(estate, relinfo);
 
-	Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
+	Assert(HeapTupleIsValid(fdw_trigtuple) ^ (DatumGetPointer(tupleid) != NULL));
 	if (fdw_trigtuple == NULL)
 	{
 		TupleTableSlot *epqslot_candidate = NULL;
@@ -3261,7 +3261,7 @@ static bool
 GetTupleForTrigger(EState *estate,
 				   EPQState *epqstate,
 				   ResultRelInfo *relinfo,
-				   ItemPointer tid,
+				   Datum tupleid,
 				   LockTupleMode lockmode,
 				   TupleTableSlot *oldslot,
 				   TupleTableSlot **epqslot,
@@ -3286,7 +3286,9 @@ GetTupleForTrigger(EState *estate,
 		 */
 		if (!IsolationUsesXactSnapshot())
 			lockflags |= TUPLE_LOCK_FLAG_FIND_LAST_VERSION;
-		test = table_tuple_lock(relation, tid, estate->es_snapshot, oldslot,
+
+		test = table_tuple_lock(relation, tupleid,
+								estate->es_snapshot, oldslot,
 								estate->es_output_cid,
 								lockmode, LockWaitBlock,
 								lockflags,
@@ -3382,8 +3384,8 @@ GetTupleForTrigger(EState *estate,
 		 * We expect the tuple to be present, thus very simple error handling
 		 * suffices.
 		 */
-		if (!table_tuple_fetch_row_version(relation, tid, SnapshotAny,
-										   oldslot))
+		if (!table_tuple_fetch_row_version(relation, tupleid,
+										   SnapshotAny, oldslot))
 			elog(ERROR, "failed to fetch tuple for trigger");
 	}
 
@@ -3589,18 +3591,24 @@ typedef SetConstraintStateData *SetConstraintState;
  * cycles.  So we need only ensure that ats_firing_id is zero when attaching
  * a new event to an existing AfterTriggerSharedData record.
  */
-typedef uint32 TriggerFlags;
-
-#define AFTER_TRIGGER_OFFSET			0x07FFFFFF	/* must be low-order bits */
-#define AFTER_TRIGGER_DONE				0x80000000
-#define AFTER_TRIGGER_IN_PROGRESS		0x40000000
+typedef uint64 TriggerFlags;
+
+#define AFTER_TRIGGER_SIZE				UINT64CONST(0xFFFF000000000)	/* must be low-order
+																		 * bits */
+#define AFTER_TRIGGER_SIZE_SHIFT		(36)
+#define AFTER_TRIGGER_OFFSET			UINT64CONST(0x000000FFFFFFF)	/* must be low-order
+																		 * bits */
+#define AFTER_TRIGGER_DONE				UINT64CONST(0x0000800000000)
+#define AFTER_TRIGGER_IN_PROGRESS		UINT64CONST(0x0000400000000)
 /* bits describing the size and tuple sources of this event */
-#define AFTER_TRIGGER_FDW_REUSE			0x00000000
-#define AFTER_TRIGGER_FDW_FETCH			0x20000000
-#define AFTER_TRIGGER_1CTID				0x10000000
-#define AFTER_TRIGGER_2CTID				0x30000000
-#define AFTER_TRIGGER_CP_UPDATE			0x08000000
-#define AFTER_TRIGGER_TUP_BITS			0x38000000
+#define AFTER_TRIGGER_FDW_REUSE			UINT64CONST(0x0000000000000)
+#define AFTER_TRIGGER_FDW_FETCH			UINT64CONST(0x0000200000000)
+#define AFTER_TRIGGER_1CTID				UINT64CONST(0x0000100000000)
+#define AFTER_TRIGGER_ROWID1			UINT64CONST(0x0000010000000)
+#define AFTER_TRIGGER_2CTID				UINT64CONST(0x0000300000000)
+#define AFTER_TRIGGER_ROWID2			UINT64CONST(0x0000020000000)
+#define AFTER_TRIGGER_CP_UPDATE			UINT64CONST(0x0000080000000)
+#define AFTER_TRIGGER_TUP_BITS			UINT64CONST(0x0000380000000)
 typedef struct AfterTriggerSharedData *AfterTriggerShared;
 
 typedef struct AfterTriggerSharedData
@@ -3652,6 +3660,9 @@ typedef struct AfterTriggerEventDataZeroCtids
 }			AfterTriggerEventDataZeroCtids;
 
 #define SizeofTriggerEvent(evt) \
+	(((evt)->ate_flags & AFTER_TRIGGER_SIZE) >> AFTER_TRIGGER_SIZE_SHIFT)
+
+#define BasicSizeofTriggerEvent(evt) \
 	(((evt)->ate_flags & AFTER_TRIGGER_TUP_BITS) == AFTER_TRIGGER_CP_UPDATE ? \
 	 sizeof(AfterTriggerEventData) : \
 	 (((evt)->ate_flags & AFTER_TRIGGER_TUP_BITS) == AFTER_TRIGGER_2CTID ? \
@@ -4004,14 +4015,34 @@ afterTriggerCopyBitmap(Bitmapset *src)
  */
 static void
 afterTriggerAddEvent(AfterTriggerEventList *events,
-					 AfterTriggerEvent event, AfterTriggerShared evtshared)
+					 AfterTriggerEvent event, AfterTriggerShared evtshared,
+					 bytea *rowid1, bytea *rowid2)
 {
-	Size		eventsize = SizeofTriggerEvent(event);
-	Size		needed = eventsize + sizeof(AfterTriggerSharedData);
+	Size		basiceventsize = MAXALIGN(BasicSizeofTriggerEvent(event));
+	Size		eventsize;
+	Size		needed;
 	AfterTriggerEventChunk *chunk;
 	AfterTriggerShared newshared;
 	AfterTriggerEvent newevent;
 
+	if (SizeofTriggerEvent(event) == 0)
+	{
+		eventsize = basiceventsize;
+		if (event->ate_flags & AFTER_TRIGGER_ROWID1)
+			eventsize += MAXALIGN(VARSIZE(rowid1));
+
+		if (event->ate_flags & AFTER_TRIGGER_ROWID2)
+			eventsize += MAXALIGN(VARSIZE(rowid2));
+
+		event->ate_flags |= eventsize << AFTER_TRIGGER_SIZE_SHIFT;
+	}
+	else
+	{
+		eventsize = SizeofTriggerEvent(event);
+	}
+
+	needed = eventsize + sizeof(AfterTriggerSharedData);
+
 	/*
 	 * If empty list or not enough room in the tail chunk, make a new chunk.
 	 * We assume here that a new shared record will always be needed.
@@ -4044,7 +4075,7 @@ afterTriggerAddEvent(AfterTriggerEventList *events,
 		 * sizes used should be MAXALIGN multiples, to ensure that the shared
 		 * records will be aligned safely.
 		 */
-#define MIN_CHUNK_SIZE 1024
+#define MIN_CHUNK_SIZE (1024*4)
 #define MAX_CHUNK_SIZE (1024*1024)
 
 #if MAX_CHUNK_SIZE > (AFTER_TRIGGER_OFFSET+1)
@@ -4063,6 +4094,7 @@ afterTriggerAddEvent(AfterTriggerEventList *events,
 				chunksize *= 2; /* okay, double it */
 			else
 				chunksize /= 2; /* too many shared records */
+			chunksize = Max(chunksize, MIN_CHUNK_SIZE);
 			chunksize = Min(chunksize, MAX_CHUNK_SIZE);
 		}
 		chunk = MemoryContextAlloc(afterTriggers.event_cxt, chunksize);
@@ -4103,7 +4135,26 @@ afterTriggerAddEvent(AfterTriggerEventList *events,
 
 	/* Insert the data */
 	newevent = (AfterTriggerEvent) chunk->freeptr;
-	memcpy(newevent, event, eventsize);
+	if (!rowid1 && !rowid2)
+	{
+		memcpy(newevent, event, eventsize);
+	}
+	else
+	{
+		Pointer		ptr = chunk->freeptr;
+
+		memcpy(newevent, event, basiceventsize);
+		ptr += basiceventsize;
+
+		if (event->ate_flags & AFTER_TRIGGER_ROWID1)
+		{
+			memcpy(ptr, rowid1, MAXALIGN(VARSIZE(rowid1)));
+			ptr += MAXALIGN(VARSIZE(rowid1));
+		}
+
+		if (event->ate_flags & AFTER_TRIGGER_ROWID2)
+			memcpy(ptr, rowid2, MAXALIGN(VARSIZE(rowid2)));
+	}
 	/* ... and link the new event to its shared record */
 	newevent->ate_flags &= ~AFTER_TRIGGER_OFFSET;
 	newevent->ate_flags |= (char *) newshared - (char *) newevent;
@@ -4263,6 +4314,7 @@ AfterTriggerExecute(EState *estate,
 	int			tgindx;
 	bool		should_free_trig = false;
 	bool		should_free_new = false;
+	Pointer		ptr;
 
 	/*
 	 * Locate trigger in trigdesc.
@@ -4294,15 +4346,17 @@ AfterTriggerExecute(EState *estate,
 			{
 				Tuplestorestate *fdw_tuplestore = GetCurrentFDWTuplestore();
 
-				if (!tuplestore_gettupleslot(fdw_tuplestore, true, false,
-											 trig_tuple_slot1))
+				if (!tuplestore_force_gettupleslot(fdw_tuplestore, true, false,
+												   trig_tuple_slot1))
 					elog(ERROR, "failed to fetch tuple1 for AFTER trigger");
 
 				if ((evtshared->ats_event & TRIGGER_EVENT_OPMASK) ==
 					TRIGGER_EVENT_UPDATE &&
-					!tuplestore_gettupleslot(fdw_tuplestore, true, false,
-											 trig_tuple_slot2))
+					!tuplestore_force_gettupleslot(fdw_tuplestore, true, false,
+												   trig_tuple_slot2))
 					elog(ERROR, "failed to fetch tuple2 for AFTER trigger");
+				trig_tuple_slot1->tts_tid = event->ate_ctid1;
+				trig_tuple_slot2->tts_tid = event->ate_ctid2;
 			}
 			/* fall through */
 		case AFTER_TRIGGER_FDW_REUSE:
@@ -4334,13 +4388,26 @@ AfterTriggerExecute(EState *estate,
 			break;
 
 		default:
-			if (ItemPointerIsValid(&(event->ate_ctid1)))
+			ptr = (Pointer) event + MAXALIGN(BasicSizeofTriggerEvent(event));
+			if (ItemPointerIsValid(&(event->ate_ctid1)) ||
+				(event->ate_flags & AFTER_TRIGGER_ROWID1))
 			{
+				Datum		tupleid;
+
 				TupleTableSlot *src_slot = ExecGetTriggerOldSlot(estate,
 																 src_relInfo);
 
-				if (!table_tuple_fetch_row_version(src_rel,
-												   &(event->ate_ctid1),
+				if (event->ate_flags & AFTER_TRIGGER_ROWID1)
+				{
+					tupleid = PointerGetDatum(ptr);
+					ptr += MAXALIGN(VARSIZE(ptr));
+				}
+				else
+				{
+					tupleid = PointerGetDatum(&(event->ate_ctid1));
+				}
+
+				if (!table_tuple_fetch_row_version(src_rel, tupleid,
 												   SnapshotAny,
 												   src_slot))
 					elog(ERROR, "failed to fetch tuple1 for AFTER trigger");
@@ -4376,13 +4443,23 @@ AfterTriggerExecute(EState *estate,
 			/* don't touch ctid2 if not there */
 			if (((event->ate_flags & AFTER_TRIGGER_TUP_BITS) == AFTER_TRIGGER_2CTID ||
 				 (event->ate_flags & AFTER_TRIGGER_CP_UPDATE)) &&
-				ItemPointerIsValid(&(event->ate_ctid2)))
+				(ItemPointerIsValid(&(event->ate_ctid2)) ||
+				 (event->ate_flags & AFTER_TRIGGER_ROWID2)))
 			{
+				Datum		tupleid;
+
 				TupleTableSlot *dst_slot = ExecGetTriggerNewSlot(estate,
 																 dst_relInfo);
 
-				if (!table_tuple_fetch_row_version(dst_rel,
-												   &(event->ate_ctid2),
+				if (event->ate_flags & AFTER_TRIGGER_ROWID2)
+				{
+					tupleid = PointerGetDatum(ptr);
+				}
+				else
+				{
+					tupleid = PointerGetDatum(&(event->ate_ctid2));
+				}
+				if (!table_tuple_fetch_row_version(dst_rel, tupleid,
 												   SnapshotAny,
 												   dst_slot))
 					elog(ERROR, "failed to fetch tuple2 for AFTER trigger");
@@ -4556,7 +4633,7 @@ afterTriggerMarkEvents(AfterTriggerEventList *events,
 		{
 			deferred_found = true;
 			/* add it to move_list */
-			afterTriggerAddEvent(move_list, event, evtshared);
+			afterTriggerAddEvent(move_list, event, evtshared, NULL, NULL);
 			/* mark original copy "done" so we don't do it again */
 			event->ate_flags |= AFTER_TRIGGER_DONE;
 		}
@@ -4659,6 +4736,7 @@ afterTriggerInvokeEvents(AfterTriggerEventList *events,
 					trigdesc = rInfo->ri_TrigDesc;
 					finfo = rInfo->ri_TrigFunctions;
 					instr = rInfo->ri_TrigInstrument;
+
 					if (slot1 != NULL)
 					{
 						ExecDropSingleTupleTableSlot(slot1);
@@ -6051,6 +6129,8 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 	int			tgtype_level;
 	int			i;
 	Tuplestorestate *fdw_tuplestore = NULL;
+	bytea	   *rowId1 = NULL;
+	bytea	   *rowId2 = NULL;
 
 	/*
 	 * Check state.  We use a normal test not Assert because it is possible to
@@ -6144,6 +6224,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 	 * if so.  This preserves the behavior that statement-level triggers fire
 	 * just once per statement and fire after row-level triggers.
 	 */
+
+	/* Determine flags */
+	if (!(relkind == RELKIND_FOREIGN_TABLE && row_trigger))
+		new_event.ate_flags = (row_trigger && event == TRIGGER_EVENT_UPDATE) ?
+			AFTER_TRIGGER_2CTID : AFTER_TRIGGER_1CTID;
+
 	switch (event)
 	{
 		case TRIGGER_EVENT_INSERT:
@@ -6154,6 +6240,14 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				Assert(newslot != NULL);
 				ItemPointerCopy(&(newslot->tts_tid), &(new_event.ate_ctid1));
 				ItemPointerSetInvalid(&(new_event.ate_ctid2));
+				if (table_get_row_ref_type(rel) == ROW_REF_ROWID)
+				{
+					bool		isnull;
+
+					rowId1 = DatumGetByteaP(slot_getsysattr(newslot, RowIdAttributeNumber, &isnull));
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID1;
+					Assert(!isnull);
+				}
 			}
 			else
 			{
@@ -6173,6 +6267,14 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				Assert(newslot == NULL);
 				ItemPointerCopy(&(oldslot->tts_tid), &(new_event.ate_ctid1));
 				ItemPointerSetInvalid(&(new_event.ate_ctid2));
+				if (table_get_row_ref_type(rel) == ROW_REF_ROWID)
+				{
+					bool		isnull;
+
+					rowId1 = DatumGetByteaP(slot_getsysattr(oldslot, RowIdAttributeNumber, &isnull));
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID1;
+					Assert(!isnull);
+				}
 			}
 			else
 			{
@@ -6188,10 +6290,57 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			tgtype_event = TRIGGER_TYPE_UPDATE;
 			if (row_trigger)
 			{
+				bool		src_rowid = false,
+							dst_rowid = false;
+
 				Assert(oldslot != NULL);
 				Assert(newslot != NULL);
 				ItemPointerCopy(&(oldslot->tts_tid), &(new_event.ate_ctid1));
 				ItemPointerCopy(&(newslot->tts_tid), &(new_event.ate_ctid2));
+				if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+				{
+					Relation	src_rel = src_partinfo->ri_RelationDesc;
+					Relation	dst_rel = dst_partinfo->ri_RelationDesc;
+
+					src_rowid = table_get_row_ref_type(src_rel) ==
+						ROW_REF_ROWID;
+					dst_rowid = table_get_row_ref_type(dst_rel) ==
+						ROW_REF_ROWID;
+				}
+				else
+				{
+					if (table_get_row_ref_type(rel) == ROW_REF_ROWID)
+					{
+						src_rowid = true;
+						dst_rowid = true;
+					}
+				}
+
+				if (src_rowid)
+				{
+					Datum		val;
+					bool		isnull;
+
+					val = slot_getsysattr(oldslot,
+										  RowIdAttributeNumber,
+										  &isnull);
+					rowId1 = DatumGetByteaP(val);
+					Assert(!isnull);
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID1;
+				}
+
+				if (dst_rowid)
+				{
+					Datum		val;
+					bool		isnull;
+
+					val = slot_getsysattr(newslot,
+										  RowIdAttributeNumber,
+										  &isnull);
+					rowId2 = DatumGetByteaP(val);
+					Assert(!isnull);
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID2;
+				}
 
 				/*
 				 * Also remember the OIDs of partitions to fetch these tuples
@@ -6229,20 +6378,6 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			break;
 	}
 
-	/* Determine flags */
-	if (!(relkind == RELKIND_FOREIGN_TABLE && row_trigger))
-	{
-		if (row_trigger && event == TRIGGER_EVENT_UPDATE)
-		{
-			if (relkind == RELKIND_PARTITIONED_TABLE)
-				new_event.ate_flags = AFTER_TRIGGER_CP_UPDATE;
-			else
-				new_event.ate_flags = AFTER_TRIGGER_2CTID;
-		}
-		else
-			new_event.ate_flags = AFTER_TRIGGER_1CTID;
-	}
-
 	/* else, we'll initialize ate_flags for each trigger */
 
 	tgtype_level = (row_trigger ? TRIGGER_TYPE_ROW : TRIGGER_TYPE_STATEMENT);
@@ -6387,6 +6522,20 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				continue;		/* Uniqueness definitely not violated */
 		}
 
+		/* Determine flags */
+		if (!(relkind == RELKIND_FOREIGN_TABLE && row_trigger))
+		{
+			if (row_trigger && event == TRIGGER_EVENT_UPDATE)
+			{
+				if (relkind == RELKIND_PARTITIONED_TABLE)
+					new_event.ate_flags = AFTER_TRIGGER_CP_UPDATE;
+				else
+					new_event.ate_flags = AFTER_TRIGGER_2CTID;
+			}
+			else
+				new_event.ate_flags = AFTER_TRIGGER_1CTID;
+		}
+
 		/*
 		 * Fill in event structure and add it to the current query's queue.
 		 * Note we set ats_table to NULL whenever this trigger doesn't use
@@ -6408,7 +6557,7 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		new_shared.ats_modifiedcols = afterTriggerCopyBitmap(modifiedCols);
 
 		afterTriggerAddEvent(&afterTriggers.query_stack[afterTriggers.query_depth].events,
-							 &new_event, &new_shared);
+							 &new_event, &new_shared, rowId1, rowId2);
 	}
 
 	/*
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 24a3990a30a..c8ce4d45ff4 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -4888,7 +4888,9 @@ ExecEvalSysVar(ExprState *state, ExprEvalStep *op, ExprContext *econtext,
 						op->resnull);
 	*op->resvalue = d;
 	/* this ought to be unreachable, but it's cheap enough to check */
-	if (unlikely(*op->resnull))
+	if (op->d.var.attnum != RowIdAttributeNumber &&
+		op->d.var.attnum != SelfItemPointerAttributeNumber &&
+		unlikely(*op->resnull))
 		elog(ERROR, "failed to fetch attribute from slot");
 }
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 3b03f03a98d..514d9b28c48 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -867,13 +867,15 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 			Oid			relid;
 			Relation	relation;
 			ExecRowMark *erm;
+			RangeTblEntry *rangeEntry;
 
 			/* ignore "parent" rowmarks; they are irrelevant at runtime */
 			if (rc->isParent)
 				continue;
 
 			/* get relation's OID (will produce InvalidOid if subquery) */
-			relid = exec_rt_fetch(rc->rti, estate)->relid;
+			rangeEntry = exec_rt_fetch(rc->rti, estate);
+			relid = rangeEntry->relid;
 
 			/*
 			 * Open relation, if we need to access it for this reference type.
@@ -903,7 +905,7 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 			erm->prti = rc->prti;
 			erm->rowmarkId = rc->rowmarkId;
 			erm->markType = rc->markType;
-			erm->refType = rc->refType;
+			erm->refType = rangeEntry->reftype;
 			erm->strength = rc->strength;
 			erm->waitPolicy = rc->waitPolicy;
 			erm->ermActive = false;
@@ -1267,6 +1269,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
 	resultRelInfo->ri_ChildToRootMap = NULL;
 	resultRelInfo->ri_ChildToRootMapValid = false;
 	resultRelInfo->ri_CopyMultiInsertBuffer = NULL;
+	resultRelInfo->ri_RowRefType = table_get_row_ref_type(resultRelationDesc);
 }
 
 /*
@@ -2708,7 +2711,7 @@ EvalPlanQualFetchRowMark(EPQState *epqstate, Index rti, TupleTableSlot *slot)
 		{
 			/* ordinary table, fetch the tuple */
 			if (!table_tuple_fetch_row_version(erm->relation,
-											   (ItemPointer) DatumGetPointer(datum),
+											   datum,
 											   SnapshotAny, slot))
 				elog(ERROR, "failed to fetch tuple for EvalPlanQual recheck");
 			return true;
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index db685473fc0..aad266a19ff 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -250,7 +250,8 @@ retry:
 
 		PushActiveSnapshot(GetLatestSnapshot());
 
-		res = table_tuple_lock(rel, &(outslot->tts_tid), GetLatestSnapshot(),
+		res = table_tuple_lock(rel, PointerGetDatum(&(outslot->tts_tid)),
+							   GetLatestSnapshot(),
 							   outslot,
 							   GetCurrentCommandId(false),
 							   lockmode,
@@ -434,7 +435,8 @@ retry:
 
 		PushActiveSnapshot(GetLatestSnapshot());
 
-		res = table_tuple_lock(rel, &(outslot->tts_tid), GetLatestSnapshot(),
+		res = table_tuple_lock(rel, PointerGetDatum(&(outslot->tts_tid)),
+							   GetLatestSnapshot(),
 							   outslot,
 							   GetCurrentCommandId(false),
 							   lockmode,
@@ -571,7 +573,8 @@ ExecSimpleRelationUpdate(ResultRelInfo *resultRelInfo,
 		resultRelInfo->ri_TrigDesc->trig_update_before_row)
 	{
 		if (!ExecBRUpdateTriggers(estate, epqstate, resultRelInfo,
-								  tid, NULL, slot, NULL, NULL))
+								  PointerGetDatum(tid), NULL, slot,
+								  NULL, NULL))
 			skip_tuple = true;	/* "do nothing" */
 	}
 
@@ -638,7 +641,8 @@ ExecSimpleRelationDelete(ResultRelInfo *resultRelInfo,
 		resultRelInfo->ri_TrigDesc->trig_delete_before_row)
 	{
 		skip_tuple = !ExecBRDeleteTriggers(estate, epqstate, resultRelInfo,
-										   tid, NULL, NULL, NULL, NULL);
+										   PointerGetDatum(tid), NULL, NULL,
+										   NULL, NULL);
 	}
 
 	if (!skip_tuple)
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 41754ddfea9..2d3ad904a64 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -27,6 +27,7 @@
 #include "executor/nodeLockRows.h"
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
+#include "utils/datum.h"
 #include "utils/rel.h"
 
 
@@ -157,7 +158,16 @@ lnext:
 		}
 
 		/* okay, try to lock (and fetch) the tuple */
-		tid = *((ItemPointer) DatumGetPointer(datum));
+		if (erm->refType == ROW_REF_TID)
+		{
+			tid = *((ItemPointer) DatumGetPointer(datum));
+			datum = PointerGetDatum(&tid);
+		}
+		else
+		{
+			Assert(erm->refType == ROW_REF_ROWID);
+			datum = datumCopy(datum, false, -1);
+		}
 		switch (erm->markType)
 		{
 			case ROW_MARK_EXCLUSIVE:
@@ -182,12 +192,15 @@ lnext:
 		if (!IsolationUsesXactSnapshot())
 			lockflags |= TUPLE_LOCK_FLAG_FIND_LAST_VERSION;
 
-		test = table_tuple_lock(erm->relation, &tid, estate->es_snapshot,
+		test = table_tuple_lock(erm->relation, datum, estate->es_snapshot,
 								markSlot, estate->es_output_cid,
 								lockmode, erm->waitPolicy,
 								lockflags,
 								&tmfd);
 
+		if (erm->refType == ROW_REF_ROWID)
+			pfree(DatumGetPointer(datum));
+
 		switch (test)
 		{
 			case TM_WouldBlock:
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index a64e37e9af9..90eeb99b2cd 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -124,7 +124,7 @@ static void ExecPendingInserts(EState *estate);
 static void ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 											   ResultRelInfo *sourcePartInfo,
 											   ResultRelInfo *destPartInfo,
-											   ItemPointer tupleid,
+											   Datum tupleid,
 											   TupleTableSlot *oldslot,
 											   TupleTableSlot *newslot);
 static bool ExecOnConflictUpdate(ModifyTableContext *context,
@@ -141,13 +141,13 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 
 static TupleTableSlot *ExecMerge(ModifyTableContext *context,
 								 ResultRelInfo *resultRelInfo,
-								 ItemPointer tupleid,
+								 Datum tupleid,
 								 HeapTuple oldtuple,
 								 bool canSetTag);
 static void ExecInitMerge(ModifyTableState *mtstate, EState *estate);
 static TupleTableSlot *ExecMergeMatched(ModifyTableContext *context,
 										ResultRelInfo *resultRelInfo,
-										ItemPointer tupleid,
+										Datum tupleid,
 										HeapTuple oldtuple,
 										bool canSetTag,
 										bool *matched);
@@ -1221,7 +1221,7 @@ ExecPendingInserts(EState *estate)
  */
 static bool
 ExecDeletePrologue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				   ItemPointer tupleid, HeapTuple oldtuple,
+				   Datum tupleid, HeapTuple oldtuple,
 				   TupleTableSlot **epqreturnslot, TM_Result *result)
 {
 	if (result)
@@ -1252,7 +1252,7 @@ ExecDeletePrologue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static TM_Result
 ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-			  ItemPointer tupleid, bool changingPart, int options,
+			  Datum tupleid, bool changingPart, int options,
 			  TupleTableSlot *oldSlot)
 {
 	EState	   *estate = context->estate;
@@ -1280,7 +1280,7 @@ ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static void
 ExecDeleteEpilogue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				   ItemPointer tupleid, HeapTuple oldtuple,
+				   HeapTuple oldtuple,
 				   TupleTableSlot *slot, bool changingPart)
 {
 	ModifyTableState *mtstate = context->mtstate;
@@ -1361,7 +1361,7 @@ ExecInitDeleteTupleSlot(ModifyTableState *mtstate,
 static TupleTableSlot *
 ExecDelete(ModifyTableContext *context,
 		   ResultRelInfo *resultRelInfo,
-		   ItemPointer tupleid,
+		   Datum tupleid,
 		   HeapTuple oldtuple,
 		   TupleTableSlot *oldslot,
 		   bool processReturning,
@@ -1558,7 +1558,7 @@ ldelete:
 	if (tupleDeleted)
 		*tupleDeleted = true;
 
-	ExecDeleteEpilogue(context, resultRelInfo, tupleid, oldtuple,
+	ExecDeleteEpilogue(context, resultRelInfo, oldtuple,
 					   oldslot, changingPart);
 
 	/* Process RETURNING if present and if requested */
@@ -1575,7 +1575,7 @@ ldelete:
 			/* FDW must have provided a slot containing the deleted row */
 			Assert(!TupIsNull(slot));
 		}
-		else
+		else if (!slot || TupIsNull(slot))
 		{
 			/* Copy old tuple to the returning slot */
 			slot = ExecGetReturningSlot(estate, resultRelInfo);
@@ -1624,7 +1624,7 @@ ldelete:
 static bool
 ExecCrossPartitionUpdate(ModifyTableContext *context,
 						 ResultRelInfo *resultRelInfo,
-						 ItemPointer tupleid, HeapTuple oldtuple,
+						 Datum tupleid, HeapTuple oldtuple,
 						 TupleTableSlot *slot,
 						 bool canSetTag,
 						 UpdateContext *updateCxt,
@@ -1783,7 +1783,7 @@ ExecCrossPartitionUpdate(ModifyTableContext *context,
  */
 static bool
 ExecUpdatePrologue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				   ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
+				   Datum tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
 				   TM_Result *result)
 {
 	Relation	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -1860,7 +1860,7 @@ ExecUpdatePrepareSlot(ResultRelInfo *resultRelInfo,
  */
 static TM_Result
 ExecUpdateAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-			  ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
+			  Datum tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
 			  bool canSetTag, int options, TupleTableSlot *oldSlot,
 			  UpdateContext *updateCxt)
 {
@@ -2014,7 +2014,7 @@ lreplace:
  */
 static void
 ExecUpdateEpilogue(ModifyTableContext *context, UpdateContext *updateCxt,
-				   ResultRelInfo *resultRelInfo, ItemPointer tupleid,
+				   ResultRelInfo *resultRelInfo,
 				   HeapTuple oldtuple, TupleTableSlot *slot,
 				   TupleTableSlot *oldslot)
 {
@@ -2064,7 +2064,7 @@ static void
 ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 								   ResultRelInfo *sourcePartInfo,
 								   ResultRelInfo *destPartInfo,
-								   ItemPointer tupleid,
+								   Datum tupleid,
 								   TupleTableSlot *oldslot,
 								   TupleTableSlot *newslot)
 {
@@ -2154,7 +2154,7 @@ ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
  */
 static TupleTableSlot *
 ExecUpdate(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-		   ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
+		   Datum tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
 		   TupleTableSlot *oldslot, bool canSetTag, bool locked)
 {
 	EState	   *estate = context->estate;
@@ -2208,15 +2208,19 @@ ExecUpdate(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	}
 	else
 	{
-		int			options = TABLE_MODIFY_WAIT | TABLE_MODIFY_FETCH_OLD_TUPLE;
+		int			options = TABLE_MODIFY_WAIT;
 
 		/*
 		 * Specify that we need to lock and fetch the last tuple version for
 		 * EPQ on appropriate transaction isolation levels if the tuple isn't
 		 * locked already.
 		 */
-		if (!locked && !IsolationUsesXactSnapshot())
-			options |= TABLE_MODIFY_LOCK_UPDATED;
+		if (!locked)
+		{
+			options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
+			if (!IsolationUsesXactSnapshot())
+				options |= TABLE_MODIFY_LOCK_UPDATED;
+		}
 
 		/*
 		 * If we generate a new candidate tuple after EvalPlanQual testing, we
@@ -2326,7 +2330,7 @@ redo_act:
 	if (canSetTag)
 		(estate->es_processed)++;
 
-	ExecUpdateEpilogue(context, &updateCxt, resultRelInfo, tupleid, oldtuple,
+	ExecUpdateEpilogue(context, &updateCxt, resultRelInfo, oldtuple,
 					   slot, oldslot);
 
 	/* Process RETURNING if present */
@@ -2358,7 +2362,19 @@ ExecOnConflictUpdate(ModifyTableContext *context,
 	ExprContext *econtext = mtstate->ps.ps_ExprContext;
 	ExprState  *onConflictSetWhere = resultRelInfo->ri_onConflict->oc_WhereClause;
 	TupleTableSlot *existing = resultRelInfo->ri_onConflict->oc_Existing;
-	ItemPointer conflictTid = &existing->tts_tid;
+	Datum		tupleid;
+
+	if (table_get_row_ref_type(resultRelInfo->ri_RelationDesc) == ROW_REF_ROWID)
+	{
+		bool		isnull;
+
+		tupleid = slot_getsysattr(existing, RowIdAttributeNumber, &isnull);
+		Assert(!isnull);
+	}
+	else
+	{
+		tupleid = PointerGetDatum(&existing->tts_tid);
+	}
 
 	/*
 	 * Make tuple and any needed join variables available to ExecQual and
@@ -2414,7 +2430,7 @@ ExecOnConflictUpdate(ModifyTableContext *context,
 
 	/* Execute UPDATE with projection */
 	*returning = ExecUpdate(context, resultRelInfo,
-							conflictTid, NULL,
+							tupleid, NULL,
 							resultRelInfo->ri_onConflict->oc_ProjSlot,
 							existing,
 							canSetTag, true);
@@ -2433,7 +2449,7 @@ ExecOnConflictUpdate(ModifyTableContext *context,
  */
 static TupleTableSlot *
 ExecMerge(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-		  ItemPointer tupleid, HeapTuple oldtuple, bool canSetTag)
+		  Datum tupleid, HeapTuple oldtuple, bool canSetTag)
 {
 	TupleTableSlot *rslot = NULL;
 	bool		matched;
@@ -2482,7 +2498,7 @@ ExecMerge(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	 * from ExecMergeNotMatched to ExecMergeMatched, there is no risk of a
 	 * livelock.
 	 */
-	matched = tupleid != NULL || oldtuple != NULL;
+	matched = DatumGetPointer(tupleid) != NULL || oldtuple != NULL;
 	if (matched)
 		rslot = ExecMergeMatched(context, resultRelInfo, tupleid, oldtuple,
 								 canSetTag, &matched);
@@ -2523,7 +2539,7 @@ ExecMerge(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static TupleTableSlot *
 ExecMergeMatched(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				 ItemPointer tupleid, HeapTuple oldtuple, bool canSetTag,
+				 Datum tupleid, HeapTuple oldtuple, bool canSetTag,
 				 bool *matched)
 {
 	ModifyTableState *mtstate = context->mtstate;
@@ -2559,7 +2575,7 @@ ExecMergeMatched(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	 * the tupleid of the target row, or an old tuple from the target wholerow
 	 * junk attr.
 	 */
-	Assert(tupleid != NULL || oldtuple != NULL);
+	Assert(DatumGetPointer(tupleid) != NULL || oldtuple != NULL);
 	if (oldtuple != NULL)
 		ExecForceStoreHeapTuple(oldtuple, resultRelInfo->ri_oldTupleSlot,
 								false);
@@ -2573,7 +2589,7 @@ lmerge_matched:
 	 * EvalPlanQual returns us a new tuple, which may not be visible to our
 	 * MVCC snapshot.
 	 */
-	if (tupleid != NULL)
+	if (DatumGetPointer(tupleid) != NULL)
 	{
 		if (!table_tuple_fetch_row_version(resultRelInfo->ri_RelationDesc,
 										   tupleid,
@@ -2682,7 +2698,7 @@ lmerge_matched:
 				if (result == TM_Ok)
 				{
 					ExecUpdateEpilogue(context, &updateCxt, resultRelInfo,
-									   tupleid, NULL, newslot,
+									   NULL, newslot,
 									   resultRelInfo->ri_oldTupleSlot);
 					mtstate->mt_merge_updated += 1;
 				}
@@ -2718,7 +2734,7 @@ lmerge_matched:
 
 				if (result == TM_Ok)
 				{
-					ExecDeleteEpilogue(context, resultRelInfo, tupleid, NULL,
+					ExecDeleteEpilogue(context, resultRelInfo, NULL,
 									   resultRelInfo->ri_oldTupleSlot, false);
 					mtstate->mt_merge_deleted += 1;
 				}
@@ -2842,9 +2858,13 @@ lmerge_matched:
 								return NULL;
 							}
 
-							(void) ExecGetJunkAttribute(epqslot,
-														resultRelInfo->ri_RowIdAttNo,
-														&isNull);
+							/*
+							 * Update tupleid to that of the new tuple, for
+							 * the refetch we do at the top.
+							 */
+							tupleid = ExecGetJunkAttribute(epqslot,
+														   resultRelInfo->ri_RowIdAttNo,
+														   &isNull);
 							if (isNull)
 							{
 								*matched = false;
@@ -2871,11 +2891,7 @@ lmerge_matched:
 							 * apply all the MATCHED rules again, to ensure
 							 * that the first qualifying WHEN MATCHED action
 							 * is executed.
-							 *
-							 * Update tupleid to that of the new tuple, for
-							 * the refetch we do at the top.
 							 */
-							ItemPointerCopy(&context->tmfd.ctid, tupleid);
 							goto lmerge_matched;
 
 						case TM_Deleted:
@@ -3413,10 +3429,10 @@ ExecModifyTable(PlanState *pstate)
 	PlanState  *subplanstate;
 	TupleTableSlot *slot;
 	TupleTableSlot *oldSlot;
+	Datum		tupleid;
 	ItemPointerData tuple_ctid;
 	HeapTupleData oldtupdata;
 	HeapTuple	oldtuple;
-	ItemPointer tupleid;
 
 	CHECK_FOR_INTERRUPTS();
 
@@ -3465,6 +3481,8 @@ ExecModifyTable(PlanState *pstate)
 	 */
 	for (;;)
 	{
+		RowRefType	refType;
+
 		/*
 		 * Reset the per-output-tuple exprcontext.  This is needed because
 		 * triggers expect to use that context as workspace.  It's a bit ugly
@@ -3515,7 +3533,7 @@ ExecModifyTable(PlanState *pstate)
 					EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 
 					slot = ExecMerge(&context, node->resultRelInfo,
-									 NULL, NULL, node->canSetTag);
+									 PointerGetDatum(NULL), NULL, node->canSetTag);
 
 					/*
 					 * If we got a RETURNING result, return it to the caller.
@@ -3559,7 +3577,8 @@ ExecModifyTable(PlanState *pstate)
 		EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 		slot = context.planSlot;
 
-		tupleid = NULL;
+		refType = resultRelInfo->ri_RowRefType;
+		tupleid = PointerGetDatum(NULL);
 		oldtuple = NULL;
 
 		/*
@@ -3602,7 +3621,7 @@ ExecModifyTable(PlanState *pstate)
 						EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 
 						slot = ExecMerge(&context, node->resultRelInfo,
-										 NULL, NULL, node->canSetTag);
+										 PointerGetDatum(NULL), NULL, node->canSetTag);
 
 						/*
 						 * If we got a RETURNING result, return it to the
@@ -3617,9 +3636,25 @@ ExecModifyTable(PlanState *pstate)
 					elog(ERROR, "ctid is NULL");
 				}
 
-				tupleid = (ItemPointer) DatumGetPointer(datum);
-				tuple_ctid = *tupleid;	/* be sure we don't free ctid!! */
-				tupleid = &tuple_ctid;
+				if (refType == ROW_REF_TID)
+				{
+					/* shouldn't ever get a null result... */
+					if (isNull)
+						elog(ERROR, "ctid is NULL");
+
+					tuple_ctid = *((ItemPointer) DatumGetPointer(datum));	/* be sure we don't free
+																			 * ctid!! */
+					tupleid = PointerGetDatum(&tuple_ctid);
+				}
+				else
+				{
+					Assert(refType == ROW_REF_ROWID);
+					/* shouldn't ever get a null result... */
+					if (isNull)
+						elog(ERROR, "rowid is NULL");
+
+					tupleid = datumCopy(datum, false, -1);
+				}
 			}
 
 			/*
@@ -3659,7 +3694,7 @@ ExecModifyTable(PlanState *pstate)
 						EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 
 						slot = ExecMerge(&context, node->resultRelInfo,
-										 NULL, NULL, node->canSetTag);
+										 PointerGetDatum(NULL), NULL, node->canSetTag);
 
 						/*
 						 * If we got a RETURNING result, return it to the
@@ -3723,6 +3758,7 @@ ExecModifyTable(PlanState *pstate)
 					/* Fetch the most recent version of old tuple. */
 					Relation	relation = resultRelInfo->ri_RelationDesc;
 
+					Assert(DatumGetPointer(tupleid) != NULL);
 					if (!table_tuple_fetch_row_version(relation, tupleid,
 													   SnapshotAny,
 													   oldSlot))
@@ -3757,6 +3793,9 @@ ExecModifyTable(PlanState *pstate)
 				break;
 		}
 
+		if (refType == ROW_REF_ROWID && DatumGetPointer(tupleid) != NULL)
+			pfree(DatumGetPointer(tupleid));
+
 		/*
 		 * If we got a RETURNING result, return it to caller.  We'll continue
 		 * the work on next call.
@@ -4000,10 +4039,20 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 				relkind == RELKIND_MATVIEW ||
 				relkind == RELKIND_PARTITIONED_TABLE)
 			{
-				resultRelInfo->ri_RowIdAttNo =
-					ExecFindJunkAttributeInTlist(subplan->targetlist, "ctid");
-				if (!AttributeNumberIsValid(resultRelInfo->ri_RowIdAttNo))
-					elog(ERROR, "could not find junk ctid column");
+				if (resultRelInfo->ri_RowRefType == ROW_REF_TID)
+				{
+					resultRelInfo->ri_RowIdAttNo =
+						ExecFindJunkAttributeInTlist(subplan->targetlist, "ctid");
+					if (!AttributeNumberIsValid(resultRelInfo->ri_RowIdAttNo))
+						elog(ERROR, "could not find junk ctid column");
+				}
+				else
+				{
+					resultRelInfo->ri_RowIdAttNo =
+						ExecFindJunkAttributeInTlist(subplan->targetlist, "rowid");
+					if (!AttributeNumberIsValid(resultRelInfo->ri_RowIdAttNo))
+						elog(ERROR, "could not find junk rowid column");
+				}
 			}
 			else if (relkind == RELKIND_FOREIGN_TABLE)
 			{
@@ -4313,6 +4362,8 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		estate->es_auxmodifytables = lcons(mtstate,
 										   estate->es_auxmodifytables);
 
+
+
 	return mtstate;
 }
 
diff --git a/src/backend/executor/nodeTidscan.c b/src/backend/executor/nodeTidscan.c
index 864a9013b62..f4a124ac4eb 100644
--- a/src/backend/executor/nodeTidscan.c
+++ b/src/backend/executor/nodeTidscan.c
@@ -377,7 +377,7 @@ TidNext(TidScanState *node)
 		if (node->tss_isCurrentOf)
 			table_tuple_get_latest_tid(scan, &tid);
 
-		if (table_tuple_fetch_row_version(heapRelation, &tid, snapshot, slot))
+		if (table_tuple_fetch_row_version(heapRelation, PointerGetDatum(&tid), snapshot, slot))
 			return slot;
 
 		/* Bad TID or failed snapshot qual; try next */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 4b9c9deee84..ee648bedd4a 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2376,19 +2376,24 @@ select_rowmark_type(RangeTblEntry *rte, LockClauseStrength strength,
 	{
 		/* Let the FDW select the rowmark type, if it wants to */
 		FdwRoutine *fdwroutine = GetFdwRoutineByRelId(rte->relid);
+		RowMarkType result = ROW_MARK_REFERENCE;
 
 		/* Set row reference type as ROW_REF_COPY by default */
 		*refType = ROW_REF_COPY;
 
 		if (fdwroutine->GetForeignRowMarkType != NULL)
-			return fdwroutine->GetForeignRowMarkType(rte, strength, refType);
+			result = fdwroutine->GetForeignRowMarkType(rte, strength, refType);
+
+		/* XXX: should we fill this before? */
+		rte->reftype = *refType;
+
 		/* Otherwise, use ROW_MARK_REFERENCE by default */
-		return ROW_MARK_REFERENCE;
+		return result;
 	}
 	else
 	{
 		/* Regular table, apply the appropriate lock type */
-		*refType = ROW_REF_TID;
+		*refType = rte->reftype;
 		switch (strength)
 		{
 			case LCS_NONE:
diff --git a/src/backend/optimizer/prep/preptlist.c b/src/backend/optimizer/prep/preptlist.c
index 4599b0dc761..3620be5b52c 100644
--- a/src/backend/optimizer/prep/preptlist.c
+++ b/src/backend/optimizer/prep/preptlist.c
@@ -226,6 +226,22 @@ preprocess_targetlist(PlannerInfo *root)
 								  true);
 			tlist = lappend(tlist, tle);
 		}
+		if (rc->allRefTypes & (1 << ROW_REF_ROWID))
+		{
+			/* Need to fetch TID */
+			var = makeVar(rc->rti,
+						  RowIdAttributeNumber,
+						  BYTEAOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			snprintf(resname, sizeof(resname), "rowid%u", rc->rowmarkId);
+			tle = makeTargetEntry((Expr *) var,
+								  list_length(tlist) + 1,
+								  pstrdup(resname),
+								  true);
+			tlist = lappend(tlist, tle);
+		}
 		if (rc->allRefTypes & (1 << ROW_REF_COPY))
 		{
 			/* Need the whole row as a junk var */
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index 6ba4eba224a..83c08bbd0e1 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -16,6 +16,7 @@
 
 #include "access/htup_details.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "foreign/fdwapi.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
@@ -895,17 +896,35 @@ add_row_identity_columns(PlannerInfo *root, Index rtindex,
 		relkind == RELKIND_MATVIEW ||
 		relkind == RELKIND_PARTITIONED_TABLE)
 	{
+		RowRefType	refType = ROW_REF_TID;
+
+		refType = table_get_row_ref_type(target_relation);
+
 		/*
 		 * Emit CTID so that executor can find the row to merge, update or
 		 * delete.
 		 */
-		var = makeVar(rtindex,
-					  SelfItemPointerAttributeNumber,
-					  TIDOID,
-					  -1,
-					  InvalidOid,
-					  0);
-		add_row_identity_var(root, var, rtindex, "ctid");
+		if (refType == ROW_REF_TID)
+		{
+			var = makeVar(rtindex,
+						  SelfItemPointerAttributeNumber,
+						  TIDOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			add_row_identity_var(root, var, rtindex, "ctid");
+		}
+		else
+		{
+			Assert(refType == ROW_REF_ROWID);
+			var = makeVar(rtindex,
+						  RowIdAttributeNumber,
+						  BYTEAOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			add_row_identity_var(root, var, rtindex, "rowid");
+		}
 	}
 	else if (relkind == RELKIND_FOREIGN_TABLE)
 	{
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index b4b076d1cb1..4a5a167d833 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -16,6 +16,7 @@
 
 #include "access/sysattr.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "catalog/partition.h"
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_type.h"
@@ -282,6 +283,24 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 			newvars = lappend(newvars, var);
 		}
 
+		if ((new_allRefTypes & (1 << ROW_REF_ROWID)) &&
+			!(old_allRefTypes & (1 << ROW_REF_ROWID)))
+		{
+			var = makeVar(oldrc->rti,
+						  RowIdAttributeNumber,
+						  BYTEAOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			snprintf(resname, sizeof(resname), "rowid%u", oldrc->rowmarkId);
+			tle = makeTargetEntry((Expr *) var,
+								  list_length(root->processed_tlist) + 1,
+								  pstrdup(resname),
+								  true);
+			root->processed_tlist = lappend(root->processed_tlist, tle);
+			newvars = lappend(newvars, var);
+		}
+
 		/* Add tableoid junk Var, unless we had it already */
 		if (!old_isParent)
 		{
@@ -485,6 +504,7 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 	Assert(parentrte->rtekind == RTE_RELATION); /* else this is dubious */
 	childrte->relid = childOID;
 	childrte->relkind = childrel->rd_rel->relkind;
+	childrte->reftype = table_get_row_ref_type(childrel);
 	/* A partitioned child will need to be expanded further. */
 	if (childrte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c
index 427b7325db8..2c80e010f2a 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/sysattr.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "catalog/heap.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_type.h"
@@ -1503,6 +1504,7 @@ addRangeTableEntry(ParseState *pstate,
 	rte->inh = inh;
 	rte->relkind = rel->rd_rel->relkind;
 	rte->rellockmode = lockmode;
+	rte->reftype = table_get_row_ref_type(rel);
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
@@ -1588,6 +1590,7 @@ addRangeTableEntryForRelation(ParseState *pstate,
 	rte->inh = inh;
 	rte->relkind = rel->rd_rel->relkind;
 	rte->rellockmode = lockmode;
+	rte->reftype = table_get_row_ref_type(rel);
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
@@ -1656,6 +1659,7 @@ addRangeTableEntryForSubquery(ParseState *pstate,
 	rte->rtekind = RTE_SUBQUERY;
 	rte->subquery = subquery;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias("unnamed_subquery", NIL);
 	numaliases = list_length(eref->colnames);
@@ -1763,6 +1767,7 @@ addRangeTableEntryForFunction(ParseState *pstate,
 	rte->functions = NIL;		/* we'll fill this list below */
 	rte->funcordinality = rangefunc->ordinality;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	/*
 	 * Choose the RTE alias name.  We default to using the first function's
@@ -2081,6 +2086,7 @@ addRangeTableEntryForTableFunc(ParseState *pstate,
 	rte->coltypmods = tf->coltypmods;
 	rte->colcollations = tf->colcollations;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias(refname, NIL);
 	numaliases = list_length(eref->colnames);
@@ -2156,6 +2162,7 @@ addRangeTableEntryForValues(ParseState *pstate,
 	rte->coltypmods = coltypmods;
 	rte->colcollations = colcollations;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias(refname, NIL);
 
@@ -2252,6 +2259,7 @@ addRangeTableEntryForJoin(ParseState *pstate,
 	rte->joinrightcols = rightcols;
 	rte->join_using_alias = join_using_alias;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias("unnamed_join", NIL);
 	numaliases = list_length(eref->colnames);
@@ -2332,6 +2340,7 @@ addRangeTableEntryForCTE(ParseState *pstate,
 	rte->rtekind = RTE_CTE;
 	rte->ctename = cte->ctename;
 	rte->ctelevelsup = levelsup;
+	rte->reftype = ROW_REF_COPY;
 
 	/* Self-reference if and only if CTE's parse analysis isn't completed */
 	rte->self_reference = !IsA(cte->ctequery, Query);
@@ -2494,6 +2503,7 @@ addRangeTableEntryForENR(ParseState *pstate,
 	 * if they access transition tables linked to a table that is altered.
 	 */
 	rte->relid = enrmd->reliddesc;
+	rte->reftype = ROW_REF_COPY;
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
@@ -3257,6 +3267,9 @@ get_rte_attribute_name(RangeTblEntry *rte, AttrNumber attnum)
 		attnum > 0 && attnum <= list_length(rte->alias->colnames))
 		return strVal(list_nth(rte->alias->colnames, attnum - 1));
 
+	if (attnum == RowIdAttributeNumber)
+		return "rowid";
+
 	/*
 	 * If the RTE is a relation, go to the system catalogs not the
 	 * eref->colnames list.  This is a little slower but it will give the
diff --git a/src/backend/rewrite/rewriteHandler.c b/src/backend/rewrite/rewriteHandler.c
index 9fd05b15e73..7a0fdbe3f40 100644
--- a/src/backend/rewrite/rewriteHandler.c
+++ b/src/backend/rewrite/rewriteHandler.c
@@ -1854,6 +1854,7 @@ ApplyRetrieveRule(Query *parsetree,
 	rte = rt_fetch(rt_index, parsetree->rtable);
 
 	rte->rtekind = RTE_SUBQUERY;
+	rte->reftype = ROW_REF_COPY;
 	rte->subquery = rule_action;
 	rte->security_barrier = RelationIsSecurityView(relation);
 
diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index 947a868e569..d3a41533552 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1100,6 +1100,36 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+/*
+ * Same as tuplestore_gettupleslot(), but foces tuple storage to slot.  Thus,
+ * it can work with slot types different than minimal tuple.
+ */
+bool
+tuplestore_force_gettupleslot(Tuplestorestate *state, bool forward,
+							  bool copy, TupleTableSlot *slot)
+{
+	MinimalTuple tuple;
+	bool		should_free;
+
+	tuple = (MinimalTuple) tuplestore_gettuple(state, forward, &should_free);
+
+	if (tuple)
+	{
+		if (copy && !should_free)
+		{
+			tuple = heap_copy_minimal_tuple(tuple);
+			should_free = true;
+		}
+		ExecForceStoreMinimalTuple(tuple, slot, should_free);
+		return true;
+	}
+	else
+	{
+		ExecClearTuple(slot);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
diff --git a/src/include/access/sysattr.h b/src/include/access/sysattr.h
index e88dec71ee9..867b5eb489e 100644
--- a/src/include/access/sysattr.h
+++ b/src/include/access/sysattr.h
@@ -24,6 +24,7 @@
 #define MaxTransactionIdAttributeNumber			(-4)
 #define MaxCommandIdAttributeNumber				(-5)
 #define TableOidAttributeNumber					(-6)
-#define FirstLowInvalidHeapAttributeNumber		(-7)
+#define RowIdAttributeNumber					(-7)
+#define FirstLowInvalidHeapAttributeNumber		(-8)
 
 #endif							/* SYSATTR_H */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index d6a7aace722..04ec41062fa 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -476,7 +476,7 @@ typedef struct TableAmRoutine
 	 * test, returns true, false otherwise.
 	 */
 	bool		(*tuple_fetch_row_version) (Relation rel,
-											ItemPointer tid,
+											Datum tupleid,
 											Snapshot snapshot,
 											TupleTableSlot *slot);
 
@@ -535,7 +535,7 @@ typedef struct TableAmRoutine
 
 	/* see table_tuple_delete() for reference about parameters */
 	TM_Result	(*tuple_delete) (Relation rel,
-								 ItemPointer tid,
+								 Datum tupleid,
 								 CommandId cid,
 								 Snapshot snapshot,
 								 Snapshot crosscheck,
@@ -546,7 +546,7 @@ typedef struct TableAmRoutine
 
 	/* see table_tuple_update() for reference about parameters */
 	TM_Result	(*tuple_update) (Relation rel,
-								 ItemPointer otid,
+								 Datum tupleid,
 								 TupleTableSlot *slot,
 								 CommandId cid,
 								 Snapshot snapshot,
@@ -559,7 +559,7 @@ typedef struct TableAmRoutine
 
 	/* see table_tuple_lock() for reference about parameters */
 	TM_Result	(*tuple_lock) (Relation rel,
-							   ItemPointer tid,
+							   Datum tupleid,
 							   Snapshot snapshot,
 							   TupleTableSlot *slot,
 							   CommandId cid,
@@ -702,6 +702,11 @@ typedef struct TableAmRoutine
 	 * ------------------------------------------------------------------------
 	 */
 
+	/*
+	 * Get the type of row identifier in the table.
+	 */
+	RowRefType	(*get_row_ref_type) (Relation rel);
+
 	/*
 	 * This callback frees relation private cache data stored in rd_amcache.
 	 * After the call all memory related to rd_amcache must be freed,
@@ -1284,9 +1289,9 @@ extern bool table_index_fetch_tuple_check(Relation rel,
 
 
 /*
- * Fetch tuple at `tid` into `slot`, after doing a visibility test according to
- * `snapshot`. If a tuple was found and passed the visibility test, returns
- * true, false otherwise.
+ * Fetch tuple identified by `tupleid` into `slot`, after doing a visibility
+ * test according to `snapshot`. If a tuple was found and passed the visibility
+ * test, returns true, false otherwise.
  *
  * See table_index_fetch_tuple's comment about what the difference between
  * these functions is. It is correct to use this function outside of index
@@ -1294,7 +1299,7 @@ extern bool table_index_fetch_tuple_check(Relation rel,
  */
 static inline bool
 table_tuple_fetch_row_version(Relation rel,
-							  ItemPointer tid,
+							  Datum tupleid,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
@@ -1306,7 +1311,8 @@ table_tuple_fetch_row_version(Relation rel,
 	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
 		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
 
-	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
+	return rel->rd_tableam->tuple_fetch_row_version(rel, tupleid,
+													snapshot, slot);
 }
 
 /*
@@ -1492,7 +1498,7 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
  *
  * Input parameters:
  *	relation - table to be modified (caller must hold suitable lock)
- *	tid - TID of tuple to be deleted
+ *	tupleid - identifier of tuple to be deleted
  *	cid - delete command ID (used for visibility test, and stored into
  *		cmax if successful)
  *	crosscheck - if not InvalidSnapshot, also check tuple against this
@@ -1521,12 +1527,12 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
  * TM_FailureData for additional info.
  */
 static inline TM_Result
-table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
+table_tuple_delete(Relation rel, Datum tupleid, CommandId cid,
 				   Snapshot snapshot, Snapshot crosscheck, int options,
 				   TM_FailureData *tmfd, bool changingPart,
 				   TupleTableSlot *oldSlot)
 {
-	return rel->rd_tableam->tuple_delete(rel, tid, cid,
+	return rel->rd_tableam->tuple_delete(rel, tupleid, cid,
 										 snapshot, crosscheck,
 										 options, tmfd, changingPart,
 										 oldSlot);
@@ -1540,7 +1546,7 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
  *
  * Input parameters:
  *	relation - table to be modified (caller must hold suitable lock)
- *	otid - TID of old tuple to be replaced
+ *	tupleid - identifier of old tuple to be replaced
  *	slot - newly constructed tuple data to store
  *	cid - update command ID (used for visibility test, and stored into
  *		cmax/cmin if successful)
@@ -1577,13 +1583,13 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
  * for additional info.
  */
 static inline TM_Result
-table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
+table_tuple_update(Relation rel, Datum tupleid, TupleTableSlot *slot,
 				   CommandId cid, Snapshot snapshot, Snapshot crosscheck,
 				   int options, TM_FailureData *tmfd, LockTupleMode *lockmode,
 				   TU_UpdateIndexes *update_indexes,
 				   TupleTableSlot *oldSlot)
 {
-	return rel->rd_tableam->tuple_update(rel, otid, slot,
+	return rel->rd_tableam->tuple_update(rel, tupleid, slot,
 										 cid, snapshot, crosscheck,
 										 options, tmfd,
 										 lockmode, update_indexes,
@@ -1595,7 +1601,7 @@ table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
  *
  * Input parameters:
  *	relation: relation containing tuple (caller must hold suitable lock)
- *	tid: TID of tuple to lock
+ *	tupleid: identifier of tuple to lock
  *	snapshot: snapshot to use for visibility determinations
  *	cid: current command ID (used for visibility test, and stored into
  *		tuple's cmax if lock is successful)
@@ -1624,12 +1630,12 @@ table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
  * comments for struct TM_FailureData for additional info.
  */
 static inline TM_Result
-table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
+table_tuple_lock(Relation rel, Datum tupleid, Snapshot snapshot,
 				 TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
 				 LockWaitPolicy wait_policy, uint8 flags,
 				 TM_FailureData *tmfd)
 {
-	return rel->rd_tableam->tuple_lock(rel, tid, snapshot, slot,
+	return rel->rd_tableam->tuple_lock(rel, tupleid, snapshot, slot,
 									   cid, mode, wait_policy,
 									   flags, tmfd);
 }
@@ -1915,6 +1921,22 @@ table_define_index(Relation rel, Oid indoid, bool reindex,
  * ----------------------------------------------------------------------------
  */
 
+/*
+ * Get the type of row identifier.  Returns ROW_REF_TID when table AM routine
+ * is not accessible.  This happens during catalog initialization.  All catalog
+ * tables are known to use heap.
+ */
+static inline RowRefType
+table_get_row_ref_type(Relation rel)
+{
+	if (rel->rd_tableam)
+		return rel->rd_tableam->get_row_ref_type(rel);
+	else if (rel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
+		return ROW_REF_COPY;
+	else
+		return ROW_REF_TID;
+}
+
 /*
  * Frees relation private cache data stored in rd_amcache.  Uses
  * free_rd_amcache method if provided.  Assumes rd_amcache to point to single
diff --git a/src/include/commands/trigger.h b/src/include/commands/trigger.h
index cb968d03ecd..c16e6b6e5a0 100644
--- a/src/include/commands/trigger.h
+++ b/src/include/commands/trigger.h
@@ -209,7 +209,7 @@ extern void ExecASDeleteTriggers(EState *estate,
 extern bool ExecBRDeleteTriggers(EState *estate,
 								 EPQState *epqstate,
 								 ResultRelInfo *relinfo,
-								 ItemPointer tupleid,
+								 Datum tupleid,
 								 HeapTuple fdw_trigtuple,
 								 TupleTableSlot **epqslot,
 								 TM_Result *tmresult,
@@ -231,7 +231,7 @@ extern void ExecASUpdateTriggers(EState *estate,
 extern bool ExecBRUpdateTriggers(EState *estate,
 								 EPQState *epqstate,
 								 ResultRelInfo *relinfo,
-								 ItemPointer tupleid,
+								 Datum tupleid,
 								 HeapTuple fdw_trigtuple,
 								 TupleTableSlot *newslot,
 								 TM_Result *tmresult,
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index b89baef95d3..04d8cef6c68 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1089,6 +1089,8 @@ typedef struct RangeTblEntry
 	Index		perminfoindex pg_node_attr(query_jumble_ignore);
 	/* sampling info, or NULL */
 	struct TableSampleClause *tablesample;
+	/* row indentifier for relation */
+	RowRefType	reftype;
 
 	/*
 	 * Fields valid for a subquery RTE (else NULL):
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index d7f9c389dac..d850411aa95 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1323,27 +1323,6 @@ typedef enum RowMarkType
 	ROW_MARK_REFERENCE,			/* just fetch the TID, don't lock it */
 } RowMarkType;
 
-/*
- * RowRefType -
- *	  enums for types of row identifiers
- *
- * For plain tables we can just fetch the TID, much as for a target relation;
- * this case is represented by ROW_REF_TID.  Otherwise (for example for VALUES
- * or FUNCTION scans) we have to copy the whole row value.  ROW_REF_COPY is
- * pretty inefficient, since most of the time we'll never need the data; but
- * fortunately the overhead is usually not performance-critical in practice.
- * By default we use ROW_REF_COPY for foreign tables, but if the FDW has
- * a concept of rowid it can request to use ROW_REF_TID instead.
- * (Again, this probably doesn't make sense if a physical remote fetch is
- * needed, but for FDWs that map to local storage it might be credible.)
- * In future we may allow more types of row identifiers.
- */
-typedef enum RowRefType
-{
-	ROW_REF_TID,				/* Item pointer (block, offset) */
-	ROW_REF_COPY				/* Full row copy */
-} RowRefType;
-
 #define RowMarkRequiresRowShareLock(marktype)  ((marktype) <= ROW_MARK_KEYSHARE)
 
 /*
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index 376f67e6a5f..84cf7837de1 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -2211,4 +2211,26 @@ typedef struct OnConflictExpr
 	List	   *exclRelTlist;	/* tlist of the EXCLUDED pseudo relation */
 } OnConflictExpr;
 
+/*
+ * RowRefType -
+ *	  enums for types of row identifiers
+ *
+ * For plain tables we can just fetch the TID, much as for a target relation;
+ * this case is represented by ROW_REF_TID.  Otherwise (for example for VALUES
+ * or FUNCTION scans) we have to copy the whole row value.  ROW_REF_COPY is
+ * pretty inefficient, since most of the time we'll never need the data; but
+ * fortunately the overhead is usually not performance-critical in practice.
+ * By default we use ROW_REF_COPY for foreign tables, but if the FDW has
+ * a concept of rowid it can request to use ROW_REF_TID instead.
+ * (Again, this probably doesn't make sense if a physical remote fetch is
+ * needed, but for FDWs that map to local storage it might be credible.)
+ * In future we may allow more types of row identifiers.
+ */
+typedef enum RowRefType
+{
+	ROW_REF_TID,				/* Item pointer (block, offset) */
+	ROW_REF_ROWID,				/* Bytea row id */
+	ROW_REF_COPY				/* Full row copy */
+} RowRefType;
+
 #endif							/* PRIMNODES_H */
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index 419613c17ba..cf291a0d17a 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -70,6 +70,9 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
 
+extern bool tuplestore_force_gettupleslot(Tuplestorestate *state, bool forward,
+										  bool copy, TupleTableSlot *slot);
+
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
 extern bool tuplestore_skiptuples(Tuplestorestate *state,
-- 
2.39.3 (Apple Git-145)

#27

Pavel Borisov

pashkin.elfe@gmail.com

almost 2 years ago

In reply to: Alexander Korotkov (#26)

Re: Table AM Interface Enhancements

Hi, Alexander!
Thank you for working on these patches.
On Thu, 28 Mar 2024 at 02:14, Alexander Korotkov <aekorotkov@gmail.com>
wrote:

Hi, Pavel!

Thank you for your feedback. The revised patch set is attached.

I found that vacuum.c has a lot of heap-specific code. Thus, it
should be OK for analyze.c to keep heap-specific code. Therefore, I
rolled back the movement of functions between files. That leads to a
smaller patch, easier to review.

I agree with you.
And with the changes remaining in heapam_handler.c I suppose we can also
remove the includes introduced:

#include <math.h>
#include "utils/sampling.h"
#include "utils/spccache.h"

On Wed, Mar 27, 2024 at 2:52 PM Pavel Borisov <pashkin.elfe@gmail.com>

wrote:

The revised rest of the patchset is attached.
0001 (was 0006) – I prefer the definition of AcquireSampleRowsFunc to
stay in vacuum.h. If we move it to sampling.h then we would have to
add there includes to define Relation, HeapTuple etc. I'd like to
avoid this kind of change. Also, I've deleted
table_beginscan_analyze(), because it's only called from
tableam-specific AcquireSampleRowsFunc. Also I put some comments to
heapam_scan_analyze_next_block() and heapam_scan_analyze_next_tuple()
given that there are now no relevant comments for them in tableam.h.
I've removed some redundancies from acquire_sample_rows(). And added
comments to AcquireSampleRowsFunc based on what we have in FDW docs
for this function. Did some small edits as well. As you suggested,
turned back declarations for acquire_sample_rows() and compare_rows().

In my comment in the thread I was not thinking about returning the old

name acquire_sample_rows(), it was only about the declarations and the
order of functions to be one code block. To me heapam_acquire_sample_rows()
looks better for a name of heap implementation of *AcquireSampleRowsFunc().
I suggest returning the name heapam_acquire_sample_rows() from v4. Sorry
for the confusion in this.

I found that the function name acquire_sample_rows is referenced in
quite several places in the source code. So, I would prefer to save
the old name to keep the changes minimal.

The full list shows only a couple of changes in analyze.c and several
comments elsewhere.

contrib/postgres_fdw/postgres_fdw.c: * of the relation. Same
algorithm as in acquire_sample_rows in
src/backend/access/heap/vacuumlazy.c: * match what analyze.c's
acquire_sample_rows() does, otherwise VACUUM
src/backend/access/heap/vacuumlazy.c: * The logic here is a bit
simpler than acquire_sample_rows(), as
src/backend/access/heap/vacuumlazy.c: * what
acquire_sample_rows() does.
src/backend/access/heap/vacuumlazy.c: *
acquire_sample_rows() does, so be consistent.
src/backend/access/heap/vacuumlazy.c: * acquire_sample_rows()
will recognize the same LP_DEAD items as dead
src/backend/commands/analyze.c:static int
acquire_sample_rows(Relation onerel, int elevel,
src/backend/commands/analyze.c: acquirefunc = acquire_sample_rows;
src/backend/commands/analyze.c: * acquire_sample_rows -- acquire a random
sample of rows from the table
src/backend/commands/analyze.c:acquire_sample_rows(Relation onerel, int
elevel,
src/backend/commands/analyze.c: * This has the same API as
acquire_sample_rows, except that rows are
src/backend/commands/analyze.c: acquirefunc =
acquire_sample_rows;

My point for renaming is to make clear that it's a heap implementation of
acquire_sample_rows which could be useful for possible reworking heap
implementations of table am methods into a separate place later. But I'm
also ok with the existing naming.

The changed type of static function that always returned true for heap

looks good to me:

static void heapam_scan_analyze_next_block

The same is for removing the comparison of always true "block_accepted"

in (heapam_)acquire_sample_rows()

Ok!

Removing table_beginscan_analyze and call scan_begin() is not in the

same style as other table_beginscan_* functions. Though this is not a
change in functionality, I'd leave this part as it was in v4.

With the patch, this method doesn't have usages outside of table am.
I don't think we should keep a method, which doesn't have clear
external usage patterns. But I agree that starting a scan with
heap_beginscan() and ending with table_endscan() is not correct. Now
ending this scan with heap_endscan().

Good!

Also, a comment about it was introduced in v5:

src/backend/access/heap/heapam_handler.c: * with

table_beginscan_analyze()

Corrected.

For comments I'd propose:

%s/In addition, store estimates/In addition, a function should store

estimates/g

%s/zerp/zero/g

Fixed.

0002 (was 0007) – I've turned the redundant "if", which you've pointed
out, into an assert. Also, added some comments, most notably comment
for TableAmRoutine.reloptions based on the indexam docs.

%s/validate sthe/validates the/g

Fixed.
This seems not needed, it's already inited to InvalidOid before.
+else
+accessMethod = default_table_access_method;
+       accessMethodId = InvalidOid;
This code came from 374c7a22904. I don't insist on this simplification
in a patch 0002.

This is minor redundancy. I prefer to keep it. This makes it obvious
that patch just moved this code.

I agree with the remaining.

Regards,
Pavel Borisov

#28

Pavel Borisov

pashkin.elfe@gmail.com

almost 2 years ago

In reply to: Pavel Borisov (#27)

Re: Table AM Interface Enhancements

Hi, Alexander!

The other extensibility that seems quite clear and uncontroversial to me is
0006.

It simply shifts the decision on whether tuple inserts should invoke
inserts to the related indices to the table am level. It doesn't change the
current heap insert behavior so it's safe for the existing heap access
method. But new table access methods could redefine this (only for tables
created with these am's) and make index inserts independently
of ExecInsertIndexTuples inside their own implementations of
tuple_insert/tuple_multi_insert methods.

I'd propose changing the comment:

1405 * This function sets `*insert_indexes` to true if expects caller to
return
1406 * the relevant index tuples. If `*insert_indexes` is set to false,
then
1407 * this function cares about indexes itself.

in the following way

Tableam implementation of tuple_insert should set `*insert_indexes` to true
if it expects the caller to insert the relevant index tuples (as in heap
implementation). It should set `*insert_indexes` to false if it cares
about index inserts itself and doesn't want the caller to do index inserts.

Maybe, a commit message is also better to reformulate to describe better
who should do what.

I think, with rebase and correction in the comments/commit message patch
0006 is ready to be committed.

Regards,
Pavel Borisov.

#29

Alexander Korotkov

aekorotkov@gmail.com

almost 2 years ago

In reply to: Pavel Borisov (#28)

8 attachment(s)

Re: Table AM Interface Enhancements

Hi Pavel!

Revised patchset is attached.

On Thu, Mar 28, 2024 at 3:12 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:

The other extensibility that seems quite clear and uncontroversial to me is 0006.

It simply shifts the decision on whether tuple inserts should invoke inserts to the related indices to the table am level. It doesn't change the current heap insert behavior so it's safe for the existing heap access method. But new table access methods could redefine this (only for tables created with these am's) and make index inserts independently of ExecInsertIndexTuples inside their own implementations of tuple_insert/tuple_multi_insert methods.

I'd propose changing the comment:

1405 * This function sets `*insert_indexes` to true if expects caller to return
1406 * the relevant index tuples. If `*insert_indexes` is set to false, then
1407 * this function cares about indexes itself.

in the following way

Tableam implementation of tuple_insert should set `*insert_indexes` to true
if it expects the caller to insert the relevant index tuples (as in heap
implementation). It should set `*insert_indexes` to false if it cares
about index inserts itself and doesn't want the caller to do index inserts.

Changed as you proposed.

Maybe, a commit message is also better to reformulate to describe better who should do what.

Done.

Also, I removed extra includes in 0001 as you proposed and edited the
commit message in 0002.

I think, with rebase and correction in the comments/commit message patch 0006 is ready to be committed.

I'm going to push 0001, 0002 and 0006 if no objections.

------
Regards,
Alexander Korotkov

Attachments:

0002-Custom-reloptions-for-table-AM-v7.patchapplication/octet-stream; name=0002-Custom-reloptions-for-table-AM-v7.patchDownload

From 50780c2e708bc17a0287bb715a0a5ae63319d7d6 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 12 Jun 2023 23:16:01 +0300
Subject: [PATCH 2/8] Custom reloptions for table AM

Let table AM define custom reloptions for its tables.  This allows to
specify AM-specific parameters by WITH clause when creating a table.

The code may use some parts from prior work by Hao Wu.

Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Discussion: https://postgr.es/m/AMUA1wBBBxfc3tKRLLdU64rb.1.1683276279979.Hmail.wuhao%40hashdata.cn
Reviewed-by: Reviewed-by: Pavel Borisov, Matthias van de Meent
---
 src/backend/access/common/reloptions.c   |  6 ++-
 src/backend/access/heap/heapam_handler.c | 12 ++++++
 src/backend/access/table/tableamapi.c    | 25 +++++++++++++
 src/backend/commands/tablecmds.c         | 47 ++++++++++++++----------
 src/backend/postmaster/autovacuum.c      |  4 +-
 src/backend/utils/cache/relcache.c       |  6 ++-
 src/include/access/reloptions.h          |  2 +
 src/include/access/tableam.h             | 43 ++++++++++++++++++++++
 8 files changed, 122 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index d6eb5d85599..963995388bb 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -24,6 +24,7 @@
 #include "access/nbtree.h"
 #include "access/reloptions.h"
 #include "access/spgist_private.h"
+#include "access/tableam.h"
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
 #include "commands/tablespace.h"
@@ -1377,7 +1378,7 @@ untransformRelOptions(Datum options)
  */
 bytea *
 extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
-				  amoptions_function amoptions)
+				  const TableAmRoutine *tableam, amoptions_function amoptions)
 {
 	bytea	   *options;
 	bool		isnull;
@@ -1399,7 +1400,8 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
 		case RELKIND_MATVIEW:
-			options = heap_reloptions(classForm->relkind, datum, false);
+			options = tableam_reloptions(tableam, classForm->relkind,
+										 datum, false);
 			break;
 		case RELKIND_PARTITIONED_TABLE:
 			options = partitioned_table_reloptions(datum, false);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a7ef0cf72d3..26b3be9779d 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -23,6 +23,7 @@
 #include "access/heapam.h"
 #include "access/heaptoast.h"
 #include "access/multixact.h"
+#include "access/reloptions.h"
 #include "access/rewriteheap.h"
 #include "access/syncscan.h"
 #include "access/tableam.h"
@@ -2155,6 +2156,16 @@ heapam_relation_toast_am(Relation rel)
 	return rel->rd_rel->relam;
 }
 
+static bytea *
+heapam_reloptions(char relkind, Datum reloptions, bool validate)
+{
+	Assert(relkind == RELKIND_RELATION ||
+		   relkind == RELKIND_TOASTVALUE ||
+		   relkind == RELKIND_MATVIEW);
+
+	return heap_reloptions(relkind, reloptions, validate);
+}
+
 
 /* ------------------------------------------------------------------------
  * Planner related callbacks for the heap AM
@@ -2660,6 +2671,7 @@ static const TableAmRoutine heapam_methods = {
 	.relation_needs_toast_table = heapam_relation_needs_toast_table,
 	.relation_toast_am = heapam_relation_toast_am,
 	.relation_fetch_toast_slice = heap_fetch_toast_slice,
+	.reloptions = heapam_reloptions,
 
 	.relation_estimate_size = heapam_estimate_rel_size,
 
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 55b8caeadf2..d9e23ef3175 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -13,9 +13,11 @@
 
 #include "access/tableam.h"
 #include "access/xact.h"
+#include "catalog/pg_am.h"
 #include "commands/defrem.h"
 #include "miscadmin.h"
 #include "utils/guc_hooks.h"
+#include "utils/syscache.h"
 
 
 /*
@@ -98,6 +100,29 @@ GetTableAmRoutine(Oid amhandler)
 	return routine;
 }
 
+/*
+ * GetTableAmRoutineByAmOid
+ *		Given the table access method oid get its TableAmRoutine struct, which
+ *		will be palloc'd in the caller's memory context.
+ */
+const TableAmRoutine *
+GetTableAmRoutineByAmOid(Oid amoid)
+{
+	HeapTuple	ht_am;
+	Form_pg_am	amrec;
+	const TableAmRoutine *tableam = NULL;
+
+	ht_am = SearchSysCache1(AMOID, ObjectIdGetDatum(amoid));
+	if (!HeapTupleIsValid(ht_am))
+		elog(ERROR, "cache lookup failed for access method %u",
+			 amoid);
+	amrec = (Form_pg_am) GETSTRUCT(ht_am);
+
+	tableam = GetTableAmRoutine(amrec->amhandler);
+	ReleaseSysCache(ht_am);
+	return tableam;
+}
+
 /* check_hook: validate new default_table_access_method */
 bool
 check_default_table_access_method(char **newval, void **extra, GucSource source)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 8a02c5b05b6..6fc815666bf 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -715,6 +715,7 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	ObjectAddress address;
 	LOCKMODE	parentLockmode;
 	Oid			accessMethodId = InvalidOid;
+	const TableAmRoutine *tableam = NULL;
 
 	/*
 	 * Truncate relname to appropriate length (probably a waste of time, as
@@ -850,6 +851,24 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	if (!OidIsValid(ownerId))
 		ownerId = GetUserId();
 
+	/*
+	 * Select access method to use: an explicitly indicated one, or (in the
+	 * case of a partitioned table) the parent's, if it has one.
+	 */
+	if (stmt->accessMethod != NULL)
+		accessMethodId = get_table_am_oid(stmt->accessMethod, false);
+	else if (stmt->partbound)
+	{
+		Assert(list_length(inheritOids) == 1);
+		accessMethodId = get_rel_relam(linitial_oid(inheritOids));
+	}
+	else
+		accessMethodId = InvalidOid;
+
+	/* still nothing? use the default */
+	if (RELKIND_HAS_TABLE_AM(relkind) && !OidIsValid(accessMethodId))
+		accessMethodId = get_table_am_oid(default_table_access_method, false);
+
 	/*
 	 * Parse and validate reloptions, if any.
 	 */
@@ -858,6 +877,12 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 
 	switch (relkind)
 	{
+		case RELKIND_RELATION:
+		case RELKIND_TOASTVALUE:
+		case RELKIND_MATVIEW:
+			tableam = GetTableAmRoutineByAmOid(accessMethodId);
+			(void) tableam_reloptions(tableam, relkind, reloptions, true);
+			break;
 		case RELKIND_VIEW:
 			(void) view_reloptions(reloptions, true);
 			break;
@@ -866,6 +891,7 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 			break;
 		default:
 			(void) heap_reloptions(relkind, reloptions, true);
+			break;
 	}
 
 	if (stmt->ofTypename)
@@ -957,24 +983,6 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 		}
 	}
 
-	/*
-	 * Select access method to use: an explicitly indicated one, or (in the
-	 * case of a partitioned table) the parent's, if it has one.
-	 */
-	if (stmt->accessMethod != NULL)
-		accessMethodId = get_table_am_oid(stmt->accessMethod, false);
-	else if (stmt->partbound)
-	{
-		Assert(list_length(inheritOids) == 1);
-		accessMethodId = get_rel_relam(linitial_oid(inheritOids));
-	}
-	else
-		accessMethodId = InvalidOid;
-
-	/* still nothing? use the default */
-	if (RELKIND_HAS_TABLE_AM(relkind) && !OidIsValid(accessMethodId))
-		accessMethodId = get_table_am_oid(default_table_access_method, false);
-
 	/*
 	 * Create the relation.  Inherited defaults and constraints are passed in
 	 * for immediate handling --- since they don't need parsing, they can be
@@ -15520,7 +15528,8 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
 		case RELKIND_MATVIEW:
-			(void) heap_reloptions(rel->rd_rel->relkind, newOptions, true);
+			(void) table_reloptions(rel, rel->rd_rel->relkind,
+									newOptions, true);
 			break;
 		case RELKIND_PARTITIONED_TABLE:
 			(void) partitioned_table_reloptions(newOptions, true);
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 71e8a6f2584..d1d76016ab4 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2661,7 +2661,9 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE);
 
-	relopts = extractRelOptions(tup, pg_class_desc, NULL);
+	relopts = extractRelOptions(tup, pg_class_desc,
+								GetTableAmRoutineByAmOid(((Form_pg_class) GETSTRUCT(tup))->relam),
+								NULL);
 	if (relopts == NULL)
 		return NULL;
 
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 1f419c2a6dd..039c0d3eef4 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -33,6 +33,7 @@
 #include "access/htup_details.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
+#include "access/relation.h"
 #include "access/reloptions.h"
 #include "access/sysattr.h"
 #include "access/table.h"
@@ -464,6 +465,7 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 {
 	bytea	   *options;
 	amoptions_function amoptsfn;
+	const TableAmRoutine *tableam = NULL;
 
 	relation->rd_options = NULL;
 
@@ -478,6 +480,7 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 		case RELKIND_VIEW:
 		case RELKIND_MATVIEW:
 		case RELKIND_PARTITIONED_TABLE:
+			tableam = relation->rd_tableam;
 			amoptsfn = NULL;
 			break;
 		case RELKIND_INDEX:
@@ -493,7 +496,8 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 	 * we might not have any other for pg_class yet (consider executing this
 	 * code for pg_class itself)
 	 */
-	options = extractRelOptions(tuple, GetPgClassDescriptor(), amoptsfn);
+	options = extractRelOptions(tuple, GetPgClassDescriptor(),
+								tableam, amoptsfn);
 
 	/*
 	 * Copy parsed data into CacheMemoryContext.  To guard against the
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index 81829b8270a..8ddc75df287 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -21,6 +21,7 @@
 
 #include "access/amapi.h"
 #include "access/htup.h"
+#include "access/tableam.h"
 #include "access/tupdesc.h"
 #include "nodes/pg_list.h"
 #include "storage/lock.h"
@@ -224,6 +225,7 @@ extern Datum transformRelOptions(Datum oldOptions, List *defList,
 								 bool acceptOidsOff, bool isReset);
 extern List *untransformRelOptions(Datum options);
 extern bytea *extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
+								const TableAmRoutine *tableam,
 								amoptions_function amoptions);
 extern void *build_reloptions(Datum reloptions, bool validate,
 							  relopt_kind kind,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8ed4e7295ad..cf68ec48ebf 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -737,6 +737,28 @@ typedef struct TableAmRoutine
 											   int32 slicelength,
 											   struct varlena *result);
 
+	/*
+	 * This callback parses and validates the reloptions array for a table.
+	 *
+	 * This is called only when a non-null reloptions array exists for the
+	 * table.  'reloptions' is a text array containing entries of the form
+	 * "name=value".  The function should construct a bytea value, which will
+	 * be copied into the rd_options field of the table's relcache entry. The
+	 * data contents of the bytea value are open for the access method to
+	 * define.
+	 *
+	 * When 'validate' is true, the function should report a suitable error
+	 * message if any of the options are unrecognized or have invalid values;
+	 * when 'validate' is false, invalid entries should be silently ignored.
+	 * ('validate' is false when loading options already stored in pg_catalog;
+	 * an invalid entry could only be found if the access method has changed
+	 * its rules for options, and in that case ignoring obsolete entries is
+	 * appropriate.)
+	 *
+	 * It is OK to return NULL if default behavior is wanted.
+	 */
+	bytea	   *(*reloptions) (char relkind, Datum reloptions, bool validate);
+
 
 	/* ------------------------------------------------------------------------
 	 * Planner related functions.
@@ -1925,6 +1947,26 @@ table_relation_fetch_toast_slice(Relation toastrel, Oid valueid,
 													 result);
 }
 
+/*
+ * Parse options for given table.
+ */
+static inline bytea *
+table_reloptions(Relation rel, char relkind,
+				 Datum reloptions, bool validate)
+{
+	return rel->rd_tableam->reloptions(relkind, reloptions, validate);
+}
+
+/*
+ * Parse table options without knowledge of particular table.
+ */
+static inline bytea *
+tableam_reloptions(const TableAmRoutine *tableam, char relkind,
+				   Datum reloptions, bool validate)
+{
+	return tableam->reloptions(relkind, reloptions, validate);
+}
+
 
 /* ----------------------------------------------------------------------------
  * Planner related functionality
@@ -2102,6 +2144,7 @@ extern void table_block_relation_estimate_size(Relation rel,
  */
 
 extern const TableAmRoutine *GetTableAmRoutine(Oid amhandler);
+extern const TableAmRoutine *GetTableAmRoutineByAmOid(Oid amoid);
 
 /* ----------------------------------------------------------------------------
  * Functions in heapam_handler.c
-- 
2.39.3 (Apple Git-145)

0003-Generalize-table-AM-API-for-INSERT-.-ON-CONFLICT-v7.patchapplication/octet-stream; name=0003-Generalize-table-AM-API-for-INSERT-.-ON-CONFLICT-v7.patchDownload

From 792330f67dfea59d67af069448db801c0f601209 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Fri, 9 Jun 2023 00:05:52 +0300
Subject: [PATCH 3/8] Generalize table AM API for INSERT ... ON CONFLICT ...

Currently, all table AMs need to implement INSERT ... ON CONFLICT ... with
speculative tokens.  They could just have a custom implementation of those
tokens using tuple_insert_speculative() and tuple_complete_speculative() API
functions.

This commit changes INSERT ... ON CONFLICT ... implementation to use single
tuple_insert_with_arbiter() API function, which encapsulates the whole
alogrithm.  This new function provides clear semantics to make different
implementations of INSERT ... ON CONFLICT ... functionality.
---
 src/backend/access/heap/heapam_handler.c | 281 ++++++++++++++++++++++-
 src/backend/access/table/tableamapi.c    |   3 +-
 src/backend/executor/nodeModifyTable.c   | 270 ++--------------------
 src/include/access/tableam.h             |  84 +++----
 4 files changed, 348 insertions(+), 290 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 26b3be9779d..590413bab9a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -304,6 +304,284 @@ heapam_tuple_complete_speculative(Relation relation, TupleTableSlot *slot,
 		pfree(tuple);
 }
 
+/*
+ * ExecCheckTupleVisible -- verify tuple is visible
+ *
+ * It would not be consistent with guarantees of the higher isolation levels to
+ * proceed with avoiding insertion (taking speculative insertion's alternative
+ * path) on the basis of another tuple that is not visible to MVCC snapshot.
+ * Check for the need to raise a serialization failure, and do so as necessary.
+ */
+static void
+ExecCheckTupleVisible(EState *estate,
+					  Relation rel,
+					  TupleTableSlot *slot)
+{
+	if (!IsolationUsesXactSnapshot())
+		return;
+
+	if (!table_tuple_satisfies_snapshot(rel, slot, estate->es_snapshot))
+	{
+		Datum		xminDatum;
+		TransactionId xmin;
+		bool		isnull;
+
+		xminDatum = slot_getsysattr(slot, MinTransactionIdAttributeNumber, &isnull);
+		Assert(!isnull);
+		xmin = DatumGetTransactionId(xminDatum);
+
+		/*
+		 * We should not raise a serialization failure if the conflict is
+		 * against a tuple inserted by our own transaction, even if it's not
+		 * visible to our snapshot.  (This would happen, for example, if
+		 * conflicting keys are proposed for insertion in a single command.)
+		 */
+		if (!TransactionIdIsCurrentTransactionId(xmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+					 errmsg("could not serialize access due to concurrent update")));
+	}
+}
+
+/*
+ * ExecCheckTIDVisible -- convenience variant of ExecCheckTupleVisible()
+ */
+static void
+ExecCheckTIDVisible(EState *estate,
+					Relation rel,
+					ItemPointer tid,
+					TupleTableSlot *tempSlot)
+{
+	/* Redundantly check isolation level */
+	if (!IsolationUsesXactSnapshot())
+		return;
+
+	if (!table_tuple_fetch_row_version(rel, tid,
+									   SnapshotAny, tempSlot))
+		elog(ERROR, "failed to fetch conflicting tuple for ON CONFLICT");
+	ExecCheckTupleVisible(estate, rel, tempSlot);
+	ExecClearTuple(tempSlot);
+}
+
+static inline TupleTableSlot *
+heapam_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
+								 TupleTableSlot *slot,
+								 CommandId cid, int options,
+								 struct BulkInsertStateData *bistate,
+								 List *arbiterIndexes,
+								 EState *estate,
+								 LockTupleMode lockmode,
+								 TupleTableSlot *lockedSlot,
+								 TupleTableSlot *tempSlot)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	uint32		specToken;
+	ItemPointerData conflictTid;
+	bool		specConflict;
+	List	   *recheckIndexes = NIL;
+
+	while (true)
+	{
+		specConflict = false;
+		if (!ExecCheckIndexConstraints(resultRelInfo, slot, estate, &conflictTid,
+									   arbiterIndexes))
+		{
+			if (lockedSlot)
+			{
+				TM_Result	test;
+				TM_FailureData tmfd;
+				Datum		xminDatum;
+				TransactionId xmin;
+				bool		isnull;
+
+				/* Determine lock mode to use */
+				lockmode = ExecUpdateLockMode(estate, resultRelInfo);
+
+				/*
+				 * Lock tuple for update.  Don't follow updates when tuple
+				 * cannot be locked without doing so.  A row locking conflict
+				 * here means our previous conclusion that the tuple is
+				 * conclusively committed is not true anymore.
+				 */
+				test = table_tuple_lock(rel, &conflictTid,
+										estate->es_snapshot,
+										lockedSlot, estate->es_output_cid,
+										lockmode, LockWaitBlock, 0,
+										&tmfd);
+				switch (test)
+				{
+					case TM_Ok:
+						/* success! */
+						break;
+
+					case TM_Invisible:
+
+						/*
+						 * This can occur when a just inserted tuple is
+						 * updated again in the same command. E.g. because
+						 * multiple rows with the same conflicting key values
+						 * are inserted.
+						 *
+						 * This is somewhat similar to the ExecUpdate()
+						 * TM_SelfModified case.  We do not want to proceed
+						 * because it would lead to the same row being updated
+						 * a second time in some unspecified order, and in
+						 * contrast to plain UPDATEs there's no historical
+						 * behavior to break.
+						 *
+						 * It is the user's responsibility to prevent this
+						 * situation from occurring.  These problems are why
+						 * the SQL standard similarly specifies that for SQL
+						 * MERGE, an exception must be raised in the event of
+						 * an attempt to update the same row twice.
+						 */
+						xminDatum = slot_getsysattr(lockedSlot,
+													MinTransactionIdAttributeNumber,
+													&isnull);
+						Assert(!isnull);
+						xmin = DatumGetTransactionId(xminDatum);
+
+						if (TransactionIdIsCurrentTransactionId(xmin))
+							ereport(ERROR,
+									(errcode(ERRCODE_CARDINALITY_VIOLATION),
+							/* translator: %s is a SQL command name */
+									 errmsg("%s command cannot affect row a second time",
+											"ON CONFLICT DO UPDATE"),
+									 errhint("Ensure that no rows proposed for insertion within the same command have duplicate constrained values.")));
+
+						/* This shouldn't happen */
+						elog(ERROR, "attempted to lock invisible tuple");
+						break;
+
+					case TM_SelfModified:
+
+						/*
+						 * This state should never be reached. As a dirty
+						 * snapshot is used to find conflicting tuples,
+						 * speculative insertion wouldn't have seen this row
+						 * to conflict with.
+						 */
+						elog(ERROR, "unexpected self-updated tuple");
+						break;
+
+					case TM_Updated:
+						if (IsolationUsesXactSnapshot())
+							ereport(ERROR,
+									(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+									 errmsg("could not serialize access due to concurrent update")));
+
+						/*
+						 * As long as we don't support an UPDATE of INSERT ON
+						 * CONFLICT for a partitioned table we shouldn't reach
+						 * to a case where tuple to be lock is moved to
+						 * another partition due to concurrent update of the
+						 * partition key.
+						 */
+						Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
+
+						/*
+						 * Tell caller to try again from the very start.
+						 *
+						 * It does not make sense to use the usual
+						 * EvalPlanQual() style loop here, as the new version
+						 * of the row might not conflict anymore, or the
+						 * conflicting tuple has actually been deleted.
+						 */
+						ExecClearTuple(lockedSlot);
+						return false;
+
+					case TM_Deleted:
+						if (IsolationUsesXactSnapshot())
+							ereport(ERROR,
+									(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+									 errmsg("could not serialize access due to concurrent delete")));
+
+						/* see TM_Updated case */
+						Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
+						ExecClearTuple(lockedSlot);
+						return false;
+
+					default:
+						elog(ERROR, "unrecognized table_tuple_lock status: %u", test);
+				}
+
+				/* Success, the tuple is locked. */
+
+				/*
+				 * Verify that the tuple is visible to our MVCC snapshot if
+				 * the current isolation level mandates that.
+				 *
+				 * It's not sufficient to rely on the check within
+				 * ExecUpdate() as e.g. CONFLICT ... WHERE clause may prevent
+				 * us from reaching that.
+				 *
+				 * This means we only ever continue when a new command in the
+				 * current transaction could see the row, even though in READ
+				 * COMMITTED mode the tuple will not be visible according to
+				 * the current statement's snapshot.  This is in line with the
+				 * way UPDATE deals with newer tuple versions.
+				 */
+				ExecCheckTupleVisible(estate, rel, lockedSlot);
+				return NULL;
+			}
+			else
+			{
+				ExecCheckTIDVisible(estate, rel, &conflictTid, tempSlot);
+				return NULL;
+			}
+		}
+
+		/*
+		 * Before we start insertion proper, acquire our "speculative
+		 * insertion lock".  Others can use that to wait for us to decide if
+		 * we're going to go ahead with the insertion, instead of waiting for
+		 * the whole transaction to complete.
+		 */
+		specToken = SpeculativeInsertionLockAcquire(GetCurrentTransactionId());
+
+		/* insert the tuple, with the speculative token */
+		heapam_tuple_insert_speculative(rel, slot,
+										estate->es_output_cid,
+										0,
+										NULL,
+										specToken);
+
+		/* insert index entries for tuple */
+		recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
+											   slot, estate, false, true,
+											   &specConflict,
+											   arbiterIndexes,
+											   false);
+
+		/* adjust the tuple's state accordingly */
+		heapam_tuple_complete_speculative(rel, slot,
+										  specToken, !specConflict);
+
+		/*
+		 * Wake up anyone waiting for our decision.  They will re-check the
+		 * tuple, see that it's no longer speculative, and wait on our XID as
+		 * if this was a regularly inserted tuple all along.  Or if we killed
+		 * the tuple, they will see it's dead, and proceed as if the tuple
+		 * never existed.
+		 */
+		SpeculativeInsertionLockRelease(GetCurrentTransactionId());
+
+		/*
+		 * If there was a conflict, start from the beginning.  We'll do the
+		 * pre-check again, which will now find the conflicting tuple (unless
+		 * it aborts before we get there).
+		 */
+		if (specConflict)
+		{
+			list_free(recheckIndexes);
+			CHECK_FOR_INTERRUPTS();
+			continue;
+		}
+
+		return slot;
+	}
+}
+
 static TM_Result
 heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
 					Snapshot snapshot, Snapshot crosscheck, int options,
@@ -2644,8 +2922,7 @@ static const TableAmRoutine heapam_methods = {
 	.index_fetch_tuple = heapam_index_fetch_tuple,
 
 	.tuple_insert = heapam_tuple_insert,
-	.tuple_insert_speculative = heapam_tuple_insert_speculative,
-	.tuple_complete_speculative = heapam_tuple_complete_speculative,
+	.tuple_insert_with_arbiter = heapam_tuple_insert_with_arbiter,
 	.multi_insert = heap_multi_insert,
 	.tuple_delete = heapam_tuple_delete,
 	.tuple_update = heapam_tuple_update,
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index d9e23ef3175..c38ab936cde 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -70,8 +70,7 @@ GetTableAmRoutine(Oid amhandler)
 	 * Could be made optional, but would require throwing error during
 	 * parse-analysis.
 	 */
-	Assert(routine->tuple_insert_speculative != NULL);
-	Assert(routine->tuple_complete_speculative != NULL);
+	Assert(routine->tuple_insert_with_arbiter != NULL);
 
 	Assert(routine->multi_insert != NULL);
 	Assert(routine->tuple_delete != NULL);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index d1917f2fea7..8e1c8f697c6 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -129,7 +129,6 @@ static void ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 											   TupleTableSlot *newslot);
 static bool ExecOnConflictUpdate(ModifyTableContext *context,
 								 ResultRelInfo *resultRelInfo,
-								 ItemPointer conflictTid,
 								 TupleTableSlot *excludedSlot,
 								 bool canSetTag,
 								 TupleTableSlot **returning);
@@ -265,66 +264,6 @@ ExecProcessReturning(ResultRelInfo *resultRelInfo,
 	return ExecProject(projectReturning);
 }
 
-/*
- * ExecCheckTupleVisible -- verify tuple is visible
- *
- * It would not be consistent with guarantees of the higher isolation levels to
- * proceed with avoiding insertion (taking speculative insertion's alternative
- * path) on the basis of another tuple that is not visible to MVCC snapshot.
- * Check for the need to raise a serialization failure, and do so as necessary.
- */
-static void
-ExecCheckTupleVisible(EState *estate,
-					  Relation rel,
-					  TupleTableSlot *slot)
-{
-	if (!IsolationUsesXactSnapshot())
-		return;
-
-	if (!table_tuple_satisfies_snapshot(rel, slot, estate->es_snapshot))
-	{
-		Datum		xminDatum;
-		TransactionId xmin;
-		bool		isnull;
-
-		xminDatum = slot_getsysattr(slot, MinTransactionIdAttributeNumber, &isnull);
-		Assert(!isnull);
-		xmin = DatumGetTransactionId(xminDatum);
-
-		/*
-		 * We should not raise a serialization failure if the conflict is
-		 * against a tuple inserted by our own transaction, even if it's not
-		 * visible to our snapshot.  (This would happen, for example, if
-		 * conflicting keys are proposed for insertion in a single command.)
-		 */
-		if (!TransactionIdIsCurrentTransactionId(xmin))
-			ereport(ERROR,
-					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-					 errmsg("could not serialize access due to concurrent update")));
-	}
-}
-
-/*
- * ExecCheckTIDVisible -- convenience variant of ExecCheckTupleVisible()
- */
-static void
-ExecCheckTIDVisible(EState *estate,
-					ResultRelInfo *relinfo,
-					ItemPointer tid,
-					TupleTableSlot *tempSlot)
-{
-	Relation	rel = relinfo->ri_RelationDesc;
-
-	/* Redundantly check isolation level */
-	if (!IsolationUsesXactSnapshot())
-		return;
-
-	if (!table_tuple_fetch_row_version(rel, tid, SnapshotAny, tempSlot))
-		elog(ERROR, "failed to fetch conflicting tuple for ON CONFLICT");
-	ExecCheckTupleVisible(estate, rel, tempSlot);
-	ExecClearTuple(tempSlot);
-}
-
 /*
  * Initialize to compute stored generated columns for a tuple
  *
@@ -1015,12 +954,19 @@ ExecInsert(ModifyTableContext *context,
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
 			/* Perform a speculative insertion. */
-			uint32		specToken;
-			ItemPointerData conflictTid;
-			bool		specConflict;
 			List	   *arbiterIndexes;
+			TupleTableSlot *existing = NULL,
+					   *returningSlot,
+					   *inserted;
+			LockTupleMode lockmode = LockTupleExclusive;
 
 			arbiterIndexes = resultRelInfo->ri_onConflictArbiterIndexes;
+			returningSlot = ExecGetReturningSlot(estate, resultRelInfo);
+			if (onconflict == ONCONFLICT_UPDATE)
+			{
+				lockmode = ExecUpdateLockMode(estate, resultRelInfo);
+				existing = resultRelInfo->ri_onConflict->oc_Existing;
+			}
 
 			/*
 			 * Do a non-conclusive check for conflicts first.
@@ -1037,23 +983,28 @@ ExecInsert(ModifyTableContext *context,
 			 */
 	vlock:
 			CHECK_FOR_INTERRUPTS();
-			specConflict = false;
-			if (!ExecCheckIndexConstraints(resultRelInfo, slot, estate,
-										   &conflictTid, arbiterIndexes))
+			inserted = table_tuple_insert_with_arbiter(resultRelInfo,
+													   slot, estate->es_output_cid,
+													   0, NULL, arbiterIndexes, estate,
+													   lockmode, existing, returningSlot);
+			if (!inserted)
 			{
 				/* committed conflict tuple found */
 				if (onconflict == ONCONFLICT_UPDATE)
 				{
+					TupleTableSlot *returning = NULL;
+
+					if (TTS_EMPTY(existing))
+						goto vlock;
+
 					/*
 					 * In case of ON CONFLICT DO UPDATE, execute the UPDATE
 					 * part.  Be prepared to retry if the UPDATE fails because
 					 * of another concurrent UPDATE/DELETE to the conflict
 					 * tuple.
 					 */
-					TupleTableSlot *returning = NULL;
-
 					if (ExecOnConflictUpdate(context, resultRelInfo,
-											 &conflictTid, slot, canSetTag,
+											 slot, canSetTag,
 											 &returning))
 					{
 						InstrCountTuples2(&mtstate->ps, 1);
@@ -1076,57 +1027,13 @@ ExecInsert(ModifyTableContext *context,
 					 * ExecGetReturningSlot() in the DO NOTHING case...
 					 */
 					Assert(onconflict == ONCONFLICT_NOTHING);
-					ExecCheckTIDVisible(estate, resultRelInfo, &conflictTid,
-										ExecGetReturningSlot(estate, resultRelInfo));
 					InstrCountTuples2(&mtstate->ps, 1);
 					return NULL;
 				}
 			}
-
-			/*
-			 * Before we start insertion proper, acquire our "speculative
-			 * insertion lock".  Others can use that to wait for us to decide
-			 * if we're going to go ahead with the insertion, instead of
-			 * waiting for the whole transaction to complete.
-			 */
-			specToken = SpeculativeInsertionLockAcquire(GetCurrentTransactionId());
-
-			/* insert the tuple, with the speculative token */
-			table_tuple_insert_speculative(resultRelationDesc, slot,
-										   estate->es_output_cid,
-										   0,
-										   NULL,
-										   specToken);
-
-			/* insert index entries for tuple */
-			recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
-												   slot, estate, false, true,
-												   &specConflict,
-												   arbiterIndexes,
-												   false);
-
-			/* adjust the tuple's state accordingly */
-			table_tuple_complete_speculative(resultRelationDesc, slot,
-											 specToken, !specConflict);
-
-			/*
-			 * Wake up anyone waiting for our decision.  They will re-check
-			 * the tuple, see that it's no longer speculative, and wait on our
-			 * XID as if this was a regularly inserted tuple all along.  Or if
-			 * we killed the tuple, they will see it's dead, and proceed as if
-			 * the tuple never existed.
-			 */
-			SpeculativeInsertionLockRelease(GetCurrentTransactionId());
-
-			/*
-			 * If there was a conflict, start from the beginning.  We'll do
-			 * the pre-check again, which will now find the conflicting tuple
-			 * (unless it aborts before we get there).
-			 */
-			if (specConflict)
+			else
 			{
-				list_free(recheckIndexes);
-				goto vlock;
+				slot = inserted;
 			}
 
 			/* Since there was no insertion conflict, we're done */
@@ -2441,144 +2348,15 @@ redo_act:
 static bool
 ExecOnConflictUpdate(ModifyTableContext *context,
 					 ResultRelInfo *resultRelInfo,
-					 ItemPointer conflictTid,
 					 TupleTableSlot *excludedSlot,
 					 bool canSetTag,
 					 TupleTableSlot **returning)
 {
 	ModifyTableState *mtstate = context->mtstate;
 	ExprContext *econtext = mtstate->ps.ps_ExprContext;
-	Relation	relation = resultRelInfo->ri_RelationDesc;
 	ExprState  *onConflictSetWhere = resultRelInfo->ri_onConflict->oc_WhereClause;
 	TupleTableSlot *existing = resultRelInfo->ri_onConflict->oc_Existing;
-	TM_FailureData tmfd;
-	LockTupleMode lockmode;
-	TM_Result	test;
-	Datum		xminDatum;
-	TransactionId xmin;
-	bool		isnull;
-
-	/* Determine lock mode to use */
-	lockmode = ExecUpdateLockMode(context->estate, resultRelInfo);
-
-	/*
-	 * Lock tuple for update.  Don't follow updates when tuple cannot be
-	 * locked without doing so.  A row locking conflict here means our
-	 * previous conclusion that the tuple is conclusively committed is not
-	 * true anymore.
-	 */
-	test = table_tuple_lock(relation, conflictTid,
-							context->estate->es_snapshot,
-							existing, context->estate->es_output_cid,
-							lockmode, LockWaitBlock, 0,
-							&tmfd);
-	switch (test)
-	{
-		case TM_Ok:
-			/* success! */
-			break;
-
-		case TM_Invisible:
-
-			/*
-			 * This can occur when a just inserted tuple is updated again in
-			 * the same command. E.g. because multiple rows with the same
-			 * conflicting key values are inserted.
-			 *
-			 * This is somewhat similar to the ExecUpdate() TM_SelfModified
-			 * case.  We do not want to proceed because it would lead to the
-			 * same row being updated a second time in some unspecified order,
-			 * and in contrast to plain UPDATEs there's no historical behavior
-			 * to break.
-			 *
-			 * It is the user's responsibility to prevent this situation from
-			 * occurring.  These problems are why the SQL standard similarly
-			 * specifies that for SQL MERGE, an exception must be raised in
-			 * the event of an attempt to update the same row twice.
-			 */
-			xminDatum = slot_getsysattr(existing,
-										MinTransactionIdAttributeNumber,
-										&isnull);
-			Assert(!isnull);
-			xmin = DatumGetTransactionId(xminDatum);
-
-			if (TransactionIdIsCurrentTransactionId(xmin))
-				ereport(ERROR,
-						(errcode(ERRCODE_CARDINALITY_VIOLATION),
-				/* translator: %s is a SQL command name */
-						 errmsg("%s command cannot affect row a second time",
-								"ON CONFLICT DO UPDATE"),
-						 errhint("Ensure that no rows proposed for insertion within the same command have duplicate constrained values.")));
-
-			/* This shouldn't happen */
-			elog(ERROR, "attempted to lock invisible tuple");
-			break;
-
-		case TM_SelfModified:
-
-			/*
-			 * This state should never be reached. As a dirty snapshot is used
-			 * to find conflicting tuples, speculative insertion wouldn't have
-			 * seen this row to conflict with.
-			 */
-			elog(ERROR, "unexpected self-updated tuple");
-			break;
-
-		case TM_Updated:
-			if (IsolationUsesXactSnapshot())
-				ereport(ERROR,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("could not serialize access due to concurrent update")));
-
-			/*
-			 * As long as we don't support an UPDATE of INSERT ON CONFLICT for
-			 * a partitioned table we shouldn't reach to a case where tuple to
-			 * be lock is moved to another partition due to concurrent update
-			 * of the partition key.
-			 */
-			Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
-
-			/*
-			 * Tell caller to try again from the very start.
-			 *
-			 * It does not make sense to use the usual EvalPlanQual() style
-			 * loop here, as the new version of the row might not conflict
-			 * anymore, or the conflicting tuple has actually been deleted.
-			 */
-			ExecClearTuple(existing);
-			return false;
-
-		case TM_Deleted:
-			if (IsolationUsesXactSnapshot())
-				ereport(ERROR,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("could not serialize access due to concurrent delete")));
-
-			/* see TM_Updated case */
-			Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
-			ExecClearTuple(existing);
-			return false;
-
-		default:
-			elog(ERROR, "unrecognized table_tuple_lock status: %u", test);
-	}
-
-	/* Success, the tuple is locked. */
-
-	/*
-	 * Verify that the tuple is visible to our MVCC snapshot if the current
-	 * isolation level mandates that.
-	 *
-	 * It's not sufficient to rely on the check within ExecUpdate() as e.g.
-	 * CONFLICT ... WHERE clause may prevent us from reaching that.
-	 *
-	 * This means we only ever continue when a new command in the current
-	 * transaction could see the row, even though in READ COMMITTED mode the
-	 * tuple will not be visible according to the current statement's
-	 * snapshot.  This is in line with the way UPDATE deals with newer tuple
-	 * versions.
-	 */
-	ExecCheckTupleVisible(context->estate, relation, existing);
+	ItemPointer conflictTid = &existing->tts_tid;
 
 	/*
 	 * Make tuple and any needed join variables available to ExecQual and
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index cf68ec48ebf..c4cdae5903c 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -22,6 +22,7 @@
 #include "access/xact.h"
 #include "commands/vacuum.h"
 #include "executor/tuptable.h"
+#include "nodes/execnodes.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
 
@@ -514,19 +515,16 @@ typedef struct TableAmRoutine
 									 CommandId cid, int options,
 									 struct BulkInsertStateData *bistate);
 
-	/* see table_tuple_insert_speculative() for reference about parameters */
-	void		(*tuple_insert_speculative) (Relation rel,
-											 TupleTableSlot *slot,
-											 CommandId cid,
-											 int options,
-											 struct BulkInsertStateData *bistate,
-											 uint32 specToken);
-
-	/* see table_tuple_complete_speculative() for reference about parameters */
-	void		(*tuple_complete_speculative) (Relation rel,
-											   TupleTableSlot *slot,
-											   uint32 specToken,
-											   bool succeeded);
+	/* see table_tuple_insert_with_arbiter() for reference about parameters */
+	TupleTableSlot *(*tuple_insert_with_arbiter) (ResultRelInfo *resultRelInfo,
+												  TupleTableSlot *slot,
+												  CommandId cid, int options,
+												  struct BulkInsertStateData *bistate,
+												  List *arbiterIndexes,
+												  EState *estate,
+												  LockTupleMode lockmode,
+												  TupleTableSlot *lockedSlot,
+												  TupleTableSlot *tempSlot);
 
 	/* see table_multi_insert() for reference about parameters */
 	void		(*multi_insert) (Relation rel, TupleTableSlot **slots, int nslots,
@@ -1400,36 +1398,42 @@ table_tuple_insert(Relation rel, TupleTableSlot *slot, CommandId cid,
 }
 
 /*
- * Perform a "speculative insertion". These can be backed out afterwards
- * without aborting the whole transaction.  Other sessions can wait for the
- * speculative insertion to be confirmed, turning it into a regular tuple, or
- * aborted, as if it never existed.  Speculatively inserted tuples behave as
- * "value locks" of short duration, used to implement INSERT .. ON CONFLICT.
+ * Insert a tuple from a slot into table AM routine with arbiter indexes.
  *
- * A transaction having performed a speculative insertion has to either abort,
- * or finish the speculative insertion with
- * table_tuple_complete_speculative(succeeded = ...).
- */
-static inline void
-table_tuple_insert_speculative(Relation rel, TupleTableSlot *slot,
-							   CommandId cid, int options,
-							   struct BulkInsertStateData *bistate,
-							   uint32 specToken)
-{
-	rel->rd_tableam->tuple_insert_speculative(rel, slot, cid, options,
-											  bistate, specToken);
-}
-
-/*
- * Complete "speculative insertion" started in the same transaction. If
- * succeeded is true, the tuple is fully inserted, if false, it's removed.
+ * This function is similar to table_tuple_insert(), but it takes into account
+ * `arbiterIndexes`, which comprises the list of oids of arbiter indexes.
+ *
+ * If tuple doesn't violates uniqueness on all arbiter indexes, then it should
+ * be inserted and the slot containing inserted tuple is returned.
+ *
+ * If tuple violates uniqueness on any arbiter index, then this function
+ * returns NULL and doesn't insert the tuple.  Also, if 'lockedSlot' is
+ * provided, then conflicting tuple gets locked in `lockmode` and placed into
+ * `lockedSlot`.
+ *
+ * Executor state `estate` is passed to this method to provide ability to
+ * calculate index tuples.  Temporary tuple table slot `tempSlot` is passed
+ * for holding of potentially conflicing tuple.
  */
-static inline void
-table_tuple_complete_speculative(Relation rel, TupleTableSlot *slot,
-								 uint32 specToken, bool succeeded)
+static inline TupleTableSlot *
+table_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
+								TupleTableSlot *slot,
+								CommandId cid, int options,
+								struct BulkInsertStateData *bistate,
+								List *arbiterIndexes,
+								EState *estate,
+								LockTupleMode lockmode,
+								TupleTableSlot *lockedSlot,
+								TupleTableSlot *tempSlot)
 {
-	rel->rd_tableam->tuple_complete_speculative(rel, slot, specToken,
-												succeeded);
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+
+	return rel->rd_tableam->tuple_insert_with_arbiter(resultRelInfo,
+													  slot, cid, options,
+													  bistate, arbiterIndexes,
+													  estate,
+													  lockmode, lockedSlot,
+													  tempSlot);
 }
 
 /*
-- 
2.39.3 (Apple Git-145)

0001-Generalize-relation-analyze-in-table-AM-interface-v7.patchapplication/octet-stream; name=0001-Generalize-relation-analyze-in-table-AM-interface-v7.patchDownload

From 5ba8cf52c3de4e5c116b784110ceb5657c0a1e4d Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Tue, 26 Mar 2024 23:41:11 +0200
Subject: [PATCH 1/8] Generalize relation analyze in table AM interface

Currently, there is just one algorithm for sampling tuples from a table written
in acquire_sample_rows().  Custom table AM can just redefine the way to get the
next block/tuple by implementing scan_analyze_next_block() and
scan_analyze_next_tuple() API functions.

This approach doesn't seem general enough.  For instance, it's unclear how to
sample this way index-organized tables.  This commit allows table AM to
encapsulate the whole sampling algorithm (currently implemented in
acquire_sample_rows()) into the relation_analyze() API function.

Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Reviewed-by: Pavel Borisov, Matthias van de Meent
---
 src/backend/access/heap/heapam_handler.c |  29 +++++--
 src/backend/access/table/tableamapi.c    |   2 -
 src/backend/commands/analyze.c           |  54 ++++++------
 src/include/access/heapam.h              |   9 ++
 src/include/access/tableam.h             | 106 +++++------------------
 src/include/commands/vacuum.h            |  19 ++++
 src/include/foreign/fdwapi.h             |   6 +-
 7 files changed, 100 insertions(+), 125 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 6abfe36dec7..a7ef0cf72d3 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -50,7 +50,6 @@ static TM_Result heapam_tuple_lock(Relation relation, ItemPointer tid,
 								   CommandId cid, LockTupleMode mode,
 								   LockWaitPolicy wait_policy, uint8 flags,
 								   TM_FailureData *tmfd);
-
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
 									 Datum *values, bool *isnull, RewriteState rwstate);
@@ -1052,7 +1051,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	pfree(isnull);
 }
 
-static bool
+/*
+ * Prepare to analyze block `blockno` of `scan`.  The scan has been started
+ * with SO_TYPE_ANALYZE option.
+ *
+ * This routine holds a buffer pin and lock on the heap page.  They are held
+ * until heapam_scan_analyze_next_tuple() returns false.  That is until all the
+ * items of the heap page are analyzed.
+ */
+void
 heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
 							   BufferAccessStrategy bstrategy)
 {
@@ -1072,12 +1079,19 @@ heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
 	hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
 										blockno, RBM_NORMAL, bstrategy);
 	LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-
-	/* in heap all blocks can contain tuples, so always return true */
-	return true;
 }
 
-static bool
+/*
+ * Iterate over tuples in the block selected with
+ * heapam_scan_analyze_next_block().  If a tuple that's suitable for sampling
+ * is found, true is returned and a tuple is stored in `slot`.  When no more
+ * tuples for sampling, false is returned and the pin and lock acquired by
+ * heapam_scan_analyze_next_block() are released.
+ *
+ * *liverows and *deadrows are incremented according to the encountered
+ * tuples.
+ */
+bool
 heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
 							   double *liverows, double *deadrows,
 							   TupleTableSlot *slot)
@@ -2637,10 +2651,9 @@ static const TableAmRoutine heapam_methods = {
 	.relation_copy_data = heapam_relation_copy_data,
 	.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
 	.relation_vacuum = heap_vacuum_rel,
-	.scan_analyze_next_block = heapam_scan_analyze_next_block,
-	.scan_analyze_next_tuple = heapam_scan_analyze_next_tuple,
 	.index_build_range_scan = heapam_index_build_range_scan,
 	.index_validate_scan = heapam_index_validate_scan,
+	.relation_analyze = heapam_analyze,
 
 	.free_rd_amcache = NULL,
 	.relation_size = table_block_relation_size,
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index ce637a5a5d9..55b8caeadf2 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -81,8 +81,6 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->relation_copy_data != NULL);
 	Assert(routine->relation_copy_for_cluster != NULL);
 	Assert(routine->relation_vacuum != NULL);
-	Assert(routine->scan_analyze_next_block != NULL);
-	Assert(routine->scan_analyze_next_tuple != NULL);
 	Assert(routine->index_build_range_scan != NULL);
 	Assert(routine->index_validate_scan != NULL);
 
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8a82af4a4ca..2fb39f3ede1 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -17,6 +17,7 @@
 #include <math.h>
 
 #include "access/detoast.h"
+#include "access/heapam.h"
 #include "access/genam.h"
 #include "access/multixact.h"
 #include "access/relation.h"
@@ -190,10 +191,9 @@ analyze_rel(Oid relid, RangeVar *relation,
 	if (onerel->rd_rel->relkind == RELKIND_RELATION ||
 		onerel->rd_rel->relkind == RELKIND_MATVIEW)
 	{
-		/* Regular table, so we'll use the regular row acquisition function */
-		acquirefunc = acquire_sample_rows;
-		/* Also get regular table's size */
-		relpages = RelationGetNumberOfBlocks(onerel);
+		/* Use row acquisition function provided by table AM */
+		table_relation_analyze(onerel, &acquirefunc,
+							   &relpages, vac_strategy);
 	}
 	else if (onerel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
 	{
@@ -1103,15 +1103,15 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
 }
 
 /*
- * acquire_sample_rows -- acquire a random sample of rows from the table
+ * acquire_sample_rows -- acquire a random sample of rows from the heap
  *
  * Selected rows are returned in the caller-allocated array rows[], which
  * must have at least targrows entries.
  * The actual number of rows selected is returned as the function result.
- * We also estimate the total numbers of live and dead rows in the table,
+ * We also estimate the total numbers of live and dead rows in the heap,
  * and return them into *totalrows and *totaldeadrows, respectively.
  *
- * The returned list of tuples is in order by physical position in the table.
+ * The returned list of tuples is in order by physical position in the heap.
  * (We will rely on this later to derive correlation estimates.)
  *
  * As of May 2004 we use a new two-stage method:  Stage one selects up
@@ -1133,7 +1133,7 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
  * look at a statistically unbiased set of blocks, we should get
  * unbiased estimates of the average numbers of live and dead rows per
  * block.  The previous sampling method put too much credence in the row
- * density near the start of the table.
+ * density near the start of the heap.
  */
 static int
 acquire_sample_rows(Relation onerel, int elevel,
@@ -1184,7 +1184,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	/* Prepare for sampling rows */
 	reservoir_init_selection_state(&rstate, targrows);
 
-	scan = table_beginscan_analyze(onerel);
+	scan = heap_beginscan(onerel, NULL, 0, NULL, NULL, SO_TYPE_ANALYZE);
 	slot = table_slot_create(onerel, NULL);
 
 #ifdef USE_PREFETCH
@@ -1214,7 +1214,6 @@ acquire_sample_rows(Relation onerel, int elevel,
 	/* Outer loop over blocks to sample */
 	while (BlockSampler_HasMore(&bs))
 	{
-		bool		block_accepted;
 		BlockNumber targblock = BlockSampler_Next(&bs);
 #ifdef USE_PREFETCH
 		BlockNumber prefetch_targblock = InvalidBlockNumber;
@@ -1230,29 +1229,19 @@ acquire_sample_rows(Relation onerel, int elevel,
 
 		vacuum_delay_point();
 
-		block_accepted = table_scan_analyze_next_block(scan, targblock, vac_strategy);
+		heapam_scan_analyze_next_block(scan, targblock, vac_strategy);
 
 #ifdef USE_PREFETCH
 
 		/*
 		 * When pre-fetching, after we get a block, tell the kernel about the
 		 * next one we will want, if there's any left.
-		 *
-		 * We want to do this even if the table_scan_analyze_next_block() call
-		 * above decides against analyzing the block it picked.
 		 */
 		if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
 			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
 #endif
 
-		/*
-		 * Don't analyze if table_scan_analyze_next_block() indicated this
-		 * block is unsuitable for analyzing.
-		 */
-		if (!block_accepted)
-			continue;
-
-		while (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
+		while (heapam_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
 		{
 			/*
 			 * The first targrows sample rows are simply copied into the
@@ -1302,7 +1291,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
-	table_endscan(scan);
+	heap_endscan(scan);
 
 	/*
 	 * If we didn't find as many tuples as we wanted then we're done. No sort
@@ -1373,6 +1362,19 @@ compare_rows(const void *a, const void *b, void *arg)
 	return 0;
 }
 
+/*
+ * heapam_analyze -- implementation of relation_analyze() table access method
+ *					 callback for heap
+ */
+void
+heapam_analyze(Relation relation, AcquireSampleRowsFunc *func,
+			   BlockNumber *totalpages, BufferAccessStrategy bstrategy)
+{
+	*func = acquire_sample_rows;
+	*totalpages = RelationGetNumberOfBlocks(relation);
+	vac_strategy = bstrategy;
+}
+
 
 /*
  * acquire_inherited_sample_rows -- acquire sample rows from inheritance tree
@@ -1462,9 +1464,9 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 		if (childrel->rd_rel->relkind == RELKIND_RELATION ||
 			childrel->rd_rel->relkind == RELKIND_MATVIEW)
 		{
-			/* Regular table, so use the regular row acquisition function */
-			acquirefunc = acquire_sample_rows;
-			relpages = RelationGetNumberOfBlocks(childrel);
+			/* Use row acquisition function provided by table AM */
+			table_relation_analyze(childrel, &acquirefunc,
+								   &relpages, vac_strategy);
 		}
 		else if (childrel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
 		{
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index f1122453738..91fbc950343 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -369,6 +369,15 @@ extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
 extern bool HeapTupleIsSurelyDead(HeapTuple htup,
 								  struct GlobalVisState *vistest);
 
+/* in heap/heapam_handler.c*/
+extern void heapam_scan_analyze_next_block(TableScanDesc scan,
+										   BlockNumber blockno,
+										   BufferAccessStrategy bstrategy);
+extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan,
+										   TransactionId OldestXmin,
+										   double *liverows, double *deadrows,
+										   TupleTableSlot *slot);
+
 /*
  * To avoid leaking too much knowledge about reorderbuffer implementation
  * details this is implemented in reorderbuffer.c not heapam_visibility.c
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index fc0e7027157..8ed4e7295ad 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -20,6 +20,7 @@
 #include "access/relscan.h"
 #include "access/sdir.h"
 #include "access/xact.h"
+#include "commands/vacuum.h"
 #include "executor/tuptable.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -658,41 +659,6 @@ typedef struct TableAmRoutine
 									struct VacuumParams *params,
 									BufferAccessStrategy bstrategy);
 
-	/*
-	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
-	 * with table_beginscan_analyze().  See also
-	 * table_scan_analyze_next_block().
-	 *
-	 * The callback may acquire resources like locks that are held until
-	 * table_scan_analyze_next_tuple() returns false. It e.g. can make sense
-	 * to hold a lock until all tuples on a block have been analyzed by
-	 * scan_analyze_next_tuple.
-	 *
-	 * The callback can return false if the block is not suitable for
-	 * sampling, e.g. because it's a metapage that could never contain tuples.
-	 *
-	 * XXX: This obviously is primarily suited for block-based AMs. It's not
-	 * clear what a good interface for non block based AMs would be, so there
-	 * isn't one yet.
-	 */
-	bool		(*scan_analyze_next_block) (TableScanDesc scan,
-											BlockNumber blockno,
-											BufferAccessStrategy bstrategy);
-
-	/*
-	 * See table_scan_analyze_next_tuple().
-	 *
-	 * Not every AM might have a meaningful concept of dead rows, in which
-	 * case it's OK to not increment *deadrows - but note that that may
-	 * influence autovacuum scheduling (see comment for relation_vacuum
-	 * callback).
-	 */
-	bool		(*scan_analyze_next_tuple) (TableScanDesc scan,
-											TransactionId OldestXmin,
-											double *liverows,
-											double *deadrows,
-											TupleTableSlot *slot);
-
 	/* see table_index_build_range_scan for reference about parameters */
 	double		(*index_build_range_scan) (Relation table_rel,
 										   Relation index_rel,
@@ -713,6 +679,12 @@ typedef struct TableAmRoutine
 										Snapshot snapshot,
 										struct ValidateIndexState *state);
 
+	/* See table_relation_analyze() */
+	void		(*relation_analyze) (Relation relation,
+									 AcquireSampleRowsFunc *func,
+									 BlockNumber *totalpages,
+									 BufferAccessStrategy bstrategy);
+
 
 	/* ------------------------------------------------------------------------
 	 * Miscellaneous functions.
@@ -1008,19 +980,6 @@ table_beginscan_tid(Relation rel, Snapshot snapshot)
 	return rel->rd_tableam->scan_begin(rel, snapshot, 0, NULL, NULL, flags);
 }
 
-/*
- * table_beginscan_analyze is an alternative entry point for setting up a
- * TableScanDesc for an ANALYZE scan.  As with bitmap scans, it's worth using
- * the same data structure although the behavior is rather different.
- */
-static inline TableScanDesc
-table_beginscan_analyze(Relation rel)
-{
-	uint32		flags = SO_TYPE_ANALYZE;
-
-	return rel->rd_tableam->scan_begin(rel, NULL, 0, NULL, NULL, flags);
-}
-
 /*
  * End relation scan.
  */
@@ -1746,42 +1705,6 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
-/*
- * Prepare to analyze block `blockno` of `scan`. The scan needs to have been
- * started with table_beginscan_analyze().  Note that this routine might
- * acquire resources like locks that are held until
- * table_scan_analyze_next_tuple() returns false.
- *
- * Returns false if block is unsuitable for sampling, true otherwise.
- */
-static inline bool
-table_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-							  BufferAccessStrategy bstrategy)
-{
-	return scan->rs_rd->rd_tableam->scan_analyze_next_block(scan, blockno,
-															bstrategy);
-}
-
-/*
- * Iterate over tuples in the block selected with
- * table_scan_analyze_next_block() (which needs to have returned true, and
- * this routine may not have returned false for the same block before). If a
- * tuple that's suitable for sampling is found, true is returned and a tuple
- * is stored in `slot`.
- *
- * *liverows and *deadrows are incremented according to the encountered
- * tuples.
- */
-static inline bool
-table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
-							  double *liverows, double *deadrows,
-							  TupleTableSlot *slot)
-{
-	return scan->rs_rd->rd_tableam->scan_analyze_next_tuple(scan, OldestXmin,
-															liverows, deadrows,
-															slot);
-}
-
 /*
  * table_index_build_scan - scan the table to find tuples to be indexed
  *
@@ -1887,6 +1810,21 @@ table_index_validate_scan(Relation table_rel,
 											   state);
 }
 
+/*
+ * table_relation_analyze - fill the infromation for a sampling statistics
+ *							acquisition
+ *
+ * The pointer to a function that will collect sample rows from the table
+ * should be stored to `*func`, plus the estimated size of the table in pages
+ * should br stored to `*totalpages`.
+ */
+static inline void
+table_relation_analyze(Relation relation, AcquireSampleRowsFunc *func,
+					   BlockNumber *totalpages, BufferAccessStrategy bstrategy)
+{
+	relation->rd_tableam->relation_analyze(relation, func,
+										   totalpages, bstrategy);
+}
 
 /* ----------------------------------------------------------------------------
  * Miscellaneous functionality
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 1182a967427..68068dd9003 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -175,6 +175,21 @@ typedef struct VacAttrStats
 	int			rowstride;
 } VacAttrStats;
 
+/*
+ * AcquireSampleRowsFunc - a function for the sampling statistics collection.
+ *
+ * A random sample of up to `targrows` rows should be collected from the
+ * table and stored into the caller-provided `rows` array.  The actual number
+ * of rows collected must be returned.  In addition, a function should store
+ * estimates of the total numbers of live and dead rows in the table into the
+ * output parameters `*totalrows` and `*totaldeadrows1.  (Set `*totaldeadrows`
+ * to zero if the storage does not have any concept of dead rows.)
+ */
+typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
+									  HeapTuple *rows, int targrows,
+									  double *totalrows,
+									  double *totaldeadrows);
+
 /* flag bits for VacuumParams->options */
 #define VACOPT_VACUUM 0x01		/* do VACUUM */
 #define VACOPT_ANALYZE 0x02		/* do ANALYZE */
@@ -380,6 +395,10 @@ extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 extern void analyze_rel(Oid relid, RangeVar *relation,
 						VacuumParams *params, List *va_cols, bool in_outer_xact,
 						BufferAccessStrategy bstrategy);
+extern void heapam_analyze(Relation relation, AcquireSampleRowsFunc *func,
+						   BlockNumber *totalpages,
+						   BufferAccessStrategy bstrategy);
+
 extern bool std_typanalyze(VacAttrStats *stats);
 
 /* in utils/misc/sampling.c --- duplicate of declarations in utils/sampling.h */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index fcde3876b28..0968e0a01ec 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -13,6 +13,7 @@
 #define FDWAPI_H
 
 #include "access/parallel.h"
+#include "commands/vacuum.h"
 #include "nodes/execnodes.h"
 #include "nodes/pathnodes.h"
 
@@ -148,11 +149,6 @@ typedef void (*ExplainForeignModify_function) (ModifyTableState *mtstate,
 typedef void (*ExplainDirectModify_function) (ForeignScanState *node,
 											  struct ExplainState *es);
 
-typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
-									  HeapTuple *rows, int targrows,
-									  double *totalrows,
-									  double *totaldeadrows);
-
 typedef bool (*AnalyzeForeignTable_function) (Relation relation,
 											  AcquireSampleRowsFunc *func,
 											  BlockNumber *totalpages);
-- 
2.39.3 (Apple Git-145)

0005-Notify-table-AM-about-index-creation-v7.patchapplication/octet-stream; name=0005-Notify-table-AM-about-index-creation-v7.patchDownload

From cd95e8ab0ae24c6586ed6d41b448728b8130da1a Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Sat, 17 Jun 2023 22:01:01 +0300
Subject: [PATCH 5/8] Notify table AM about index creation

This allows table AM to do some preparation with index build.  In particular,
table AM could update its specific meta-information.  That could be also useful
if table AM overrides index implementations.
---
 src/backend/access/heap/heapam_handler.c |  2 ++
 src/backend/catalog/index.c              |  2 ++
 src/backend/commands/indexcmds.c         | 41 +++++++++++++----------
 src/include/access/tableam.h             | 42 ++++++++++++++++++++++++
 4 files changed, 69 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c560f70ba25..1c029ce6ab2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2949,6 +2949,8 @@ static const TableAmRoutine heapam_methods = {
 	.index_build_range_scan = heapam_index_build_range_scan,
 	.index_validate_scan = heapam_index_validate_scan,
 	.relation_analyze = heapam_analyze,
+	.define_index_validate = NULL,
+	.define_index = NULL,
 
 	.free_rd_amcache = NULL,
 	.relation_size = table_block_relation_size,
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index b6a7c60e230..bca97981051 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3840,6 +3840,8 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 
 	/* Close rels, but keep locks */
 	index_close(iRel, NoLock);
+	table_define_index(heapRelation, indexId, true,
+					   skip_constraint_checks, false, NULL);
 	table_close(heapRelation, NoLock);
 
 	if (progress)
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index e78598c10e1..2570e7a24a1 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -583,6 +583,7 @@ DefineIndex(Oid tableId,
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
+	void	   *arg;
 
 	root_save_nestlevel = NewGUCNestLevel();
 
@@ -629,6 +630,26 @@ DefineIndex(Oid tableId,
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_INDEX_OID,
 								 InvalidOid);
 
+	/*
+	 * Only SELECT ... FOR UPDATE/SHARE are allowed while doing a standard
+	 * index build; but for concurrent builds we allow INSERT/UPDATE/DELETE
+	 * (but not VACUUM).
+	 *
+	 * NB: Caller is responsible for making sure that relationId refers to the
+	 * relation on which the index should be built; except in bootstrap mode,
+	 * this will typically require the caller to have already locked the
+	 * relation.  To avoid lock upgrade hazards, that lock should be at least
+	 * as strong as the one we take here.
+	 *
+	 * NB: If the lock strength here ever changes, code that is run by
+	 * parallel workers under the control of certain particular ambuild
+	 * functions will need to be updated, too.
+	 */
+	lockmode = concurrent ? ShareUpdateExclusiveLock : ShareLock;
+	rel = table_open(tableId, lockmode);
+
+	table_define_index_validate(rel, stmt, skip_build, &arg);
+
 	/*
 	 * count key attributes in index
 	 */
@@ -656,24 +677,6 @@ DefineIndex(Oid tableId,
 				 errmsg("cannot use more than %d columns in an index",
 						INDEX_MAX_KEYS)));
 
-	/*
-	 * Only SELECT ... FOR UPDATE/SHARE are allowed while doing a standard
-	 * index build; but for concurrent builds we allow INSERT/UPDATE/DELETE
-	 * (but not VACUUM).
-	 *
-	 * NB: Caller is responsible for making sure that tableId refers to the
-	 * relation on which the index should be built; except in bootstrap mode,
-	 * this will typically require the caller to have already locked the
-	 * relation.  To avoid lock upgrade hazards, that lock should be at least
-	 * as strong as the one we take here.
-	 *
-	 * NB: If the lock strength here ever changes, code that is run by
-	 * parallel workers under the control of certain particular ambuild
-	 * functions will need to be updated, too.
-	 */
-	lockmode = concurrent ? ShareUpdateExclusiveLock : ShareLock;
-	rel = table_open(tableId, lockmode);
-
 	/*
 	 * Switch to the table owner's userid, so that any index functions are run
 	 * as that user.  Also lock down security-restricted operations.  We
@@ -1218,6 +1221,8 @@ DefineIndex(Oid tableId,
 
 	ObjectAddressSet(address, RelationRelationId, indexRelationId);
 
+	table_define_index(rel, address.objectId, false, false,
+					   skip_build, arg);
 	if (!OidIsValid(indexRelationId))
 	{
 		/*
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 48f078309f7..db0559788a4 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -684,6 +684,16 @@ typedef struct TableAmRoutine
 									 BlockNumber *totalpages,
 									 BufferAccessStrategy bstrategy);
 
+	/* See table_define_index_validate() */
+	bool		(*define_index_validate) (Relation rel, IndexStmt *stmt,
+										  bool skip_build, void **arg);
+
+	/* See table_define_index() */
+	bool		(*define_index) (Relation rel, Oid indoid, bool reindex,
+								 bool skip_constraint_checks, bool skip_build,
+								 void *arg);
+
+
 
 	/* ------------------------------------------------------------------------
 	 * Miscellaneous functions.
@@ -1860,6 +1870,38 @@ table_relation_analyze(Relation relation, AcquireSampleRowsFunc *func,
 										   totalpages, bstrategy);
 }
 
+/*
+ * Let table AM validate the index to be created on `rel` with statement
+ * `*stmt`.  `skip_build` indicates that only catalog entries are to be
+ * created without index data.  This method can save some information into
+ * `arg`, and it shoud be passed to table_define_index().
+ */
+static inline bool
+table_define_index_validate(Relation rel, IndexStmt *stmt,
+							bool skip_build, void **arg)
+{
+	if (rel->rd_tableam && rel->rd_tableam->define_index_validate)
+		return rel->rd_tableam->define_index_validate(rel, stmt,
+													  skip_build, arg);
+	else
+		return true;
+}
+
+/*
+ * Notifies table AM about index creation on `rel` with oid `indoid`.
+ */
+static inline bool
+table_define_index(Relation rel, Oid indoid, bool reindex,
+				   bool skip_constraint_checks, bool skip_build, void *arg)
+{
+	if (rel->rd_tableam && rel->rd_tableam->define_index)
+		return rel->rd_tableam->define_index(rel, indoid, reindex,
+											 skip_constraint_checks,
+											 skip_build, arg);
+	else
+		return true;
+}
+
 /* ----------------------------------------------------------------------------
  * Miscellaneous functionality
  * ----------------------------------------------------------------------------
-- 
2.39.3 (Apple Git-145)

0004-Let-table-AM-override-reloptions-for-indexes-buil-v7.patchapplication/octet-stream; name=0004-Let-table-AM-override-reloptions-for-indexes-buil-v7.patchDownload

From 4acba5b4be933225f42acc5df4d7a52b4f4afdc3 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Thu, 14 Mar 2024 00:53:05 +0200
Subject: [PATCH 4/8] Let table AM override reloptions for indexes built on its
 tables

---
 src/backend/access/common/reloptions.c   |  3 ++-
 src/backend/access/heap/heapam_handler.c |  8 ++++++++
 src/backend/commands/indexcmds.c         |  3 ++-
 src/backend/commands/tablecmds.c         |  9 +++++++-
 src/backend/utils/cache/relcache.c       | 24 ++++++++++++++++++++--
 src/include/access/tableam.h             | 26 ++++++++++++++++++++++++
 6 files changed, 68 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 963995388bb..00088240cdd 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -1411,7 +1411,8 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
 			break;
 		case RELKIND_INDEX:
 		case RELKIND_PARTITIONED_INDEX:
-			options = index_reloptions(amoptions, datum, false);
+			options = tableam_indexoptions(tableam, amoptions, classForm->relkind,
+										   datum, false);
 			break;
 		case RELKIND_FOREIGN_TABLE:
 			options = NULL;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 590413bab9a..c560f70ba25 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2444,6 +2444,13 @@ heapam_reloptions(char relkind, Datum reloptions, bool validate)
 	return heap_reloptions(relkind, reloptions, validate);
 }
 
+static bytea *
+heapam_indexoptions(amoptions_function amoptions, char relkind,
+					Datum reloptions, bool validate)
+{
+	return index_reloptions(amoptions, reloptions, validate);
+}
+
 
 /* ------------------------------------------------------------------------
  * Planner related callbacks for the heap AM
@@ -2949,6 +2956,7 @@ static const TableAmRoutine heapam_methods = {
 	.relation_toast_am = heapam_relation_toast_am,
 	.relation_fetch_toast_slice = heap_fetch_toast_slice,
 	.reloptions = heapam_reloptions,
+	.indexoptions = heapam_indexoptions,
 
 	.relation_estimate_size = heapam_estimate_rel_size,
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index d9016ef487b..e78598c10e1 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -899,7 +899,8 @@ DefineIndex(Oid tableId,
 	reloptions = transformRelOptions((Datum) 0, stmt->options,
 									 NULL, NULL, false, false);
 
-	(void) index_reloptions(amoptions, reloptions, true);
+	(void) tableam_indexoptions(rel->rd_tableam, amoptions, RELKIND_INDEX,
+								reloptions, true);
 
 	/*
 	 * Prepare arguments for index_create, primarily an IndexInfo structure.
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 6fc815666bf..eccd1131a5c 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -15539,7 +15539,14 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 			break;
 		case RELKIND_INDEX:
 		case RELKIND_PARTITIONED_INDEX:
-			(void) index_reloptions(rel->rd_indam->amoptions, newOptions, true);
+			{
+				Relation	tbl = relation_open(rel->rd_index->indrelid,
+												AccessShareLock);
+
+				tableam_indexoptions(tbl->rd_tableam, rel->rd_indam->amoptions,
+									 rel->rd_rel->relkind, newOptions, true);
+				relation_close(tbl, AccessShareLock);
+			}
 			break;
 		default:
 			ereport(ERROR,
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 039c0d3eef4..4343deb4ee3 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -477,15 +477,35 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 	{
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
-		case RELKIND_VIEW:
 		case RELKIND_MATVIEW:
+		case RELKIND_VIEW:
 		case RELKIND_PARTITIONED_TABLE:
 			tableam = relation->rd_tableam;
 			amoptsfn = NULL;
 			break;
 		case RELKIND_INDEX:
 		case RELKIND_PARTITIONED_INDEX:
-			amoptsfn = relation->rd_indam->amoptions;
+			{
+				Form_pg_class classForm;
+				HeapTuple	classTup;
+
+				/* fetch the relation's relcache entry */
+				if (relation->rd_index->indrelid >= FirstNormalObjectId)
+				{
+					classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relation->rd_index->indrelid));
+					classForm = (Form_pg_class) GETSTRUCT(classTup);
+					if (classForm->relam >= FirstNormalObjectId)
+						tableam = GetTableAmRoutineByAmOid(classForm->relam);
+					else
+						tableam = GetHeapamTableAmRoutine();
+					heap_freetuple(classTup);
+				}
+				else
+				{
+					tableam = GetHeapamTableAmRoutine();
+				}
+				amoptsfn = relation->rd_indam->amoptions;
+			}
 			break;
 		default:
 			return;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index c4cdae5903c..48f078309f7 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -17,6 +17,7 @@
 #ifndef TABLEAM_H
 #define TABLEAM_H
 
+#include "access/amapi.h"
 #include "access/relscan.h"
 #include "access/sdir.h"
 #include "access/xact.h"
@@ -757,6 +758,13 @@ typedef struct TableAmRoutine
 	 */
 	bytea	   *(*reloptions) (char relkind, Datum reloptions, bool validate);
 
+	/*
+	 * Parse table AM-specific index options.  Useful for table AM to define
+	 * new index options or override existing index options.
+	 */
+	bytea	   *(*indexoptions) (amoptions_function amoptions, char relkind,
+								 Datum reloptions, bool validate);
+
 
 	/* ------------------------------------------------------------------------
 	 * Planner related functions.
@@ -1971,6 +1979,24 @@ tableam_reloptions(const TableAmRoutine *tableam, char relkind,
 	return tableam->reloptions(relkind, reloptions, validate);
 }
 
+extern bytea *index_reloptions(amoptions_function amoptions, Datum reloptions,
+							   bool validate);
+
+/*
+ * Parse index options.  Gives table AM a chance to override index-specific
+ * options defined in 'amoptions'.
+ */
+static inline bytea *
+tableam_indexoptions(const TableAmRoutine *tableam,
+					 amoptions_function amoptions, char relkind,
+					 Datum reloptions, bool validate)
+{
+	if (tableam)
+		return tableam->indexoptions(amoptions, relkind, reloptions, validate);
+	else
+		return index_reloptions(amoptions, reloptions, validate);
+}
+
 
 /* ----------------------------------------------------------------------------
  * Planner related functionality
-- 
2.39.3 (Apple Git-145)

0006-Let-table-AM-insertion-methods-control-index-inse-v7.patchapplication/octet-stream; name=0006-Let-table-AM-insertion-methods-control-index-inse-v7.patchDownload

From e71b52cd0e0aeaf1c15052394b3ced1203c167a1 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Fri, 9 Jun 2023 01:02:39 +0300
Subject: [PATCH 6/8] Let table AM insertion methods control index insertion

Previously, the executor did index insert unconditionally after calling
table AM interface methods tuple_insert() and multi_insert().  This commit
introduces the new parameter insert_indexes for these two methods.  Setting
'*insert_indexes' to true saves the current logic.  Setting it to false
indicates that table AM cares about index inserts itself and doesn't want the
caller to do that.

Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Reviewed-by: Pavel Borisov, Matthias van de Meent
---
 src/backend/access/heap/heapam.c         |  4 +++-
 src/backend/access/heap/heapam_handler.c |  4 +++-
 src/backend/access/table/tableam.c       |  6 ++++--
 src/backend/catalog/indexing.c           |  4 +++-
 src/backend/commands/copyfrom.c          | 13 +++++++++----
 src/backend/commands/createas.c          |  4 +++-
 src/backend/commands/matview.c           |  4 +++-
 src/backend/commands/tablecmds.c         |  6 +++++-
 src/backend/executor/execReplication.c   |  6 ++++--
 src/backend/executor/nodeModifyTable.c   |  6 ++++--
 src/include/access/heapam.h              |  2 +-
 src/include/access/tableam.h             | 24 +++++++++++++++++-------
 12 files changed, 59 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2f6527df0dc..b661d9811eb 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2088,7 +2088,8 @@ heap_multi_insert_pages(HeapTuple *heaptuples, int done, int ntuples, Size saveF
  */
 void
 heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
-				  CommandId cid, int options, BulkInsertState bistate)
+				  CommandId cid, int options, BulkInsertState bistate,
+				  bool *insert_indexes)
 {
 	TransactionId xid = GetCurrentTransactionId();
 	HeapTuple  *heaptuples;
@@ -2437,6 +2438,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		slots[i]->tts_tid = heaptuples[i]->t_self;
 
 	pgstat_count_heap_insert(relation, ntuples);
+	*insert_indexes = true;
 }
 
 /*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 1c029ce6ab2..09429fd9ef0 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -245,7 +245,7 @@ heapam_tuple_satisfies_snapshot(Relation rel, TupleTableSlot *slot,
 
 static TupleTableSlot *
 heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
-					int options, BulkInsertState bistate)
+					int options, BulkInsertState bistate, bool *insert_indexes)
 {
 	bool		shouldFree = true;
 	HeapTuple	tuple = ExecFetchSlotHeapTuple(slot, true, &shouldFree);
@@ -261,6 +261,8 @@ heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
 	if (shouldFree)
 		pfree(tuple);
 
+	*insert_indexes = true;
+
 	return slot;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 8d3675be959..805d222cebc 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -273,9 +273,11 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
  * default command ID and not allowing access to the speedup options.
  */
 void
-simple_table_tuple_insert(Relation rel, TupleTableSlot *slot)
+simple_table_tuple_insert(Relation rel, TupleTableSlot *slot,
+						  bool *insert_indexes)
 {
-	table_tuple_insert(rel, slot, GetCurrentCommandId(true), 0, NULL);
+	table_tuple_insert(rel, slot, GetCurrentCommandId(true), 0, NULL,
+					   insert_indexes);
 }
 
 /*
diff --git a/src/backend/catalog/indexing.c b/src/backend/catalog/indexing.c
index d0d1abda58a..4d404f22f83 100644
--- a/src/backend/catalog/indexing.c
+++ b/src/backend/catalog/indexing.c
@@ -273,12 +273,14 @@ void
 CatalogTuplesMultiInsertWithInfo(Relation heapRel, TupleTableSlot **slot,
 								 int ntuples, CatalogIndexState indstate)
 {
+	bool		insertIndexes;
+
 	/* Nothing to do */
 	if (ntuples <= 0)
 		return;
 
 	heap_multi_insert(heapRel, slot, ntuples,
-					  GetCurrentCommandId(true), 0, NULL);
+					  GetCurrentCommandId(true), 0, NULL, &insertIndexes);
 
 	/*
 	 * There is no equivalent to heap_multi_insert for the catalog indexes, so
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 8908a440e19..b6736369771 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -397,6 +397,7 @@ CopyMultiInsertBufferFlush(CopyMultiInsertInfo *miinfo,
 		bool		line_buf_valid = cstate->line_buf_valid;
 		uint64		save_cur_lineno = cstate->cur_lineno;
 		MemoryContext oldcontext;
+		bool		insertIndexes;
 
 		Assert(buffer->bistate != NULL);
 
@@ -416,7 +417,8 @@ CopyMultiInsertBufferFlush(CopyMultiInsertInfo *miinfo,
 						   nused,
 						   mycid,
 						   ti_options,
-						   buffer->bistate);
+						   buffer->bistate,
+						   &insertIndexes);
 		MemoryContextSwitchTo(oldcontext);
 
 		for (i = 0; i < nused; i++)
@@ -425,7 +427,7 @@ CopyMultiInsertBufferFlush(CopyMultiInsertInfo *miinfo,
 			 * If there are any indexes, update them for all the inserted
 			 * tuples, and run AFTER ROW INSERT triggers.
 			 */
-			if (resultRelInfo->ri_NumIndices > 0)
+			if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 			{
 				List	   *recheckIndexes;
 
@@ -1265,11 +1267,14 @@ CopyFrom(CopyFromState cstate)
 					}
 					else
 					{
+						bool		insertIndexes;
+
 						/* OK, store the tuple and create index entries for it */
 						table_tuple_insert(resultRelInfo->ri_RelationDesc,
-										   myslot, mycid, ti_options, bistate);
+										   myslot, mycid, ti_options, bistate,
+										   &insertIndexes);
 
-						if (resultRelInfo->ri_NumIndices > 0)
+						if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 							recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 																   myslot,
 																   estate,
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 62050f4dc59..afd3dace079 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -578,6 +578,7 @@ static bool
 intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_intorel *myState = (DR_intorel *) self;
+	bool		insertIndexes;
 
 	/* Nothing to insert if WITH NO DATA is specified. */
 	if (!myState->into->skipData)
@@ -594,7 +595,8 @@ intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 						   slot,
 						   myState->output_cid,
 						   myState->ti_options,
-						   myState->bistate);
+						   myState->bistate,
+						   &insertIndexes);
 	}
 
 	/* We know this is a newly created relation, so there are no indexes */
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 6d09b755564..9ec13d09846 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -476,6 +476,7 @@ static bool
 transientrel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_transientrel *myState = (DR_transientrel *) self;
+	bool		insertIndexes;
 
 	/*
 	 * Note that the input slot might not be of the type of the target
@@ -490,7 +491,8 @@ transientrel_receive(TupleTableSlot *slot, DestReceiver *self)
 					   slot,
 					   myState->output_cid,
 					   myState->ti_options,
-					   myState->bistate);
+					   myState->bistate,
+					   &insertIndexes);
 
 	/* We know this is a newly created relation, so there are no indexes */
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index eccd1131a5c..6d16a9a402a 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -6356,8 +6356,12 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
 
 			/* Write the tuple out to the new relation */
 			if (newrel)
+			{
+				bool		insertIndexes;
+
 				table_tuple_insert(newrel, insertslot, mycid,
-								   ti_options, bistate);
+								   ti_options, bistate, &insertIndexes);
+			}
 
 			ResetExprContext(econtext);
 
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 0cad843fb69..db685473fc0 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -509,6 +509,7 @@ ExecSimpleRelationInsert(ResultRelInfo *resultRelInfo,
 	if (!skip_tuple)
 	{
 		List	   *recheckIndexes = NIL;
+		bool		insertIndexes;
 
 		/* Compute stored generated columns */
 		if (rel->rd_att->constr &&
@@ -523,9 +524,10 @@ ExecSimpleRelationInsert(ResultRelInfo *resultRelInfo,
 			ExecPartitionCheck(resultRelInfo, slot, estate, true);
 
 		/* OK, store the tuple and create index entries for it */
-		simple_table_tuple_insert(resultRelInfo->ri_RelationDesc, slot);
+		simple_table_tuple_insert(resultRelInfo->ri_RelationDesc, slot,
+								  &insertIndexes);
 
-		if (resultRelInfo->ri_NumIndices > 0)
+		if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 			recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 												   slot, estate, false, false,
 												   NULL, NIL, false);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 8e1c8f697c6..a64e37e9af9 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1040,13 +1040,15 @@ ExecInsert(ModifyTableContext *context,
 		}
 		else
 		{
+			bool		insertIndexes;
+
 			/* insert the tuple normally */
 			slot = table_tuple_insert(resultRelationDesc, slot,
 									  estate->es_output_cid,
-									  0, NULL);
+									  0, NULL, &insertIndexes);
 
 			/* insert index entries for tuple */
-			if (resultRelInfo->ri_NumIndices > 0)
+			if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 				recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 													   slot, estate, false,
 													   false, NULL, NIL,
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 91fbc950343..32a3fbce961 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -282,7 +282,7 @@ extern void heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 						int options, BulkInsertState bistate);
 extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
 							  int ntuples, CommandId cid, int options,
-							  BulkInsertState bistate);
+							  BulkInsertState bistate, bool *insert_indexes);
 extern TM_Result heap_delete(Relation relation, ItemPointer tid,
 							 CommandId cid, Snapshot crosscheck, int options,
 							 struct TM_FailureData *tmfd, bool changingPart,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index db0559788a4..7f97af067f0 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -514,7 +514,8 @@ typedef struct TableAmRoutine
 	/* see table_tuple_insert() for reference about parameters */
 	TupleTableSlot *(*tuple_insert) (Relation rel, TupleTableSlot *slot,
 									 CommandId cid, int options,
-									 struct BulkInsertStateData *bistate);
+									 struct BulkInsertStateData *bistate,
+									 bool *insert_indexes);
 
 	/* see table_tuple_insert_with_arbiter() for reference about parameters */
 	TupleTableSlot *(*tuple_insert_with_arbiter) (ResultRelInfo *resultRelInfo,
@@ -529,7 +530,8 @@ typedef struct TableAmRoutine
 
 	/* see table_multi_insert() for reference about parameters */
 	void		(*multi_insert) (Relation rel, TupleTableSlot **slots, int nslots,
-								 CommandId cid, int options, struct BulkInsertStateData *bistate);
+								 CommandId cid, int options, struct BulkInsertStateData *bistate,
+								 bool *insert_indexes);
 
 	/* see table_tuple_delete() for reference about parameters */
 	TM_Result	(*tuple_delete) (Relation rel,
@@ -1400,6 +1402,11 @@ table_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
  * behavior) is also just passed through to RelationGetBufferForTuple. If
  * `bistate` is provided, table_finish_bulk_insert() needs to be called.
  *
+ * Tableam implementation of tuple_insert should set `*insert_indexes` to true
+ * if it expects the caller to insert the relevant index tuples (as heap
+ * implementation does).  It should set `*insert_indexes` to false if it cares
+ * about index inserts itself and doesn't want the caller to do index inserts.
+ *
  * Returns the slot containing the inserted tuple, which may differ from the
  * given slot. For instance, the source slot may be VirtualTupleTableSlot, but
  * the result slot may correspond to the table AM. On return the slot's
@@ -1409,10 +1416,11 @@ table_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
  */
 static inline TupleTableSlot *
 table_tuple_insert(Relation rel, TupleTableSlot *slot, CommandId cid,
-				   int options, struct BulkInsertStateData *bistate)
+				   int options, struct BulkInsertStateData *bistate,
+				   bool *insert_indexes)
 {
 	return rel->rd_tableam->tuple_insert(rel, slot, cid, options,
-										 bistate);
+										 bistate, insert_indexes);
 }
 
 /*
@@ -1470,10 +1478,11 @@ table_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
  */
 static inline void
 table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
-				   CommandId cid, int options, struct BulkInsertStateData *bistate)
+				   CommandId cid, int options, struct BulkInsertStateData *bistate,
+				   bool *insert_indexes)
 {
 	rel->rd_tableam->multi_insert(rel, slots, nslots,
-								  cid, options, bistate);
+								  cid, options, bistate, insert_indexes);
 }
 
 /*
@@ -2168,7 +2177,8 @@ table_scan_sample_next_tuple(TableScanDesc scan,
  * ----------------------------------------------------------------------------
  */
 
-extern void simple_table_tuple_insert(Relation rel, TupleTableSlot *slot);
+extern void simple_table_tuple_insert(Relation rel, TupleTableSlot *slot,
+									  bool *insert_indexes);
 extern void simple_table_tuple_delete(Relation rel, ItemPointer tid,
 									  Snapshot snapshot,
 									  TupleTableSlot *oldSlot);
-- 
2.39.3 (Apple Git-145)

0007-Introduce-RowRefType-which-describes-the-table-ro-v7.patchapplication/octet-stream; name=0007-Introduce-RowRefType-which-describes-the-table-ro-v7.patchDownload

From e452a3f9afa7877b58f733ee38b72b5186dd1bf3 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Sat, 17 Jun 2023 22:55:32 +0300
Subject: [PATCH 7/8] Introduce RowRefType, which describes the table row
 identifier

Currently, the table row could be identified by the ctid or the whole row
(foreign table).  But the row identifier is mixed together with lock mode in
RowMarkType.  This commit separates row identifier type into separate enum
RowRefType.
---
 contrib/postgres_fdw/postgres_fdw.c    |  2 +-
 doc/src/sgml/fdwhandler.sgml           | 22 ++++++++----
 src/backend/executor/execMain.c        | 35 ++++++++++++--------
 src/backend/optimizer/plan/planner.c   | 33 +++++++++++-------
 src/backend/optimizer/prep/preptlist.c |  4 +--
 src/backend/optimizer/util/inherit.c   | 27 +++++++--------
 src/include/foreign/fdwapi.h           |  3 +-
 src/include/nodes/execnodes.h          |  4 +++
 src/include/nodes/plannodes.h          | 46 ++++++++++++++++----------
 src/include/optimizer/planner.h        |  3 +-
 src/tools/pgindent/typedefs.list       |  1 +
 11 files changed, 113 insertions(+), 67 deletions(-)

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 142dcfc9957..b0000790292 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -7636,7 +7636,7 @@ make_tuple_from_result_row(PGresult *res,
 	 * If we have a CTID to return, install it in both t_self and t_ctid.
 	 * t_self is the normal place, but if the tuple is converted to a
 	 * composite Datum, t_self will be lost; setting t_ctid allows CTID to be
-	 * preserved during EvalPlanQual re-evaluations (see ROW_MARK_COPY code).
+	 * preserved during EvalPlanQual re-evaluations (see ROW_REF_COPY code).
 	 */
 	if (ctid)
 		tuple->t_self = tuple->t_data->t_ctid = *ctid;
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index b80320504d6..51bc0e1029a 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -1160,13 +1160,16 @@ ExecForeignTruncate(List *rels,
 <programlisting>
 RowMarkType
 GetForeignRowMarkType(RangeTblEntry *rte,
-                      LockClauseStrength strength);
+                      LockClauseStrength strength,
+                      RowRefType *refType);
 </programlisting>
 
      Report which row-marking option to use for a foreign table.
-     <literal>rte</literal> is the <structname>RangeTblEntry</structname> node for the table
-     and <literal>strength</literal> describes the lock strength requested by the
-     relevant <literal>FOR UPDATE/SHARE</literal> clause, if any.  The result must be
+     <literal>rte</literal> is the <structname>RangeTblEntry</structname> node for the table;
+     <literal>strength</literal> describes the lock strength requested by the
+     relevant <literal>FOR UPDATE/SHARE</literal> clause, if any;
+     <literal>refType</literal> point to the value of <literal>RowRefType</literal>
+     specifying the way to reference the row.  The result must be
      a member of the <literal>RowMarkType</literal> enum type.
     </para>
 
@@ -1177,9 +1180,16 @@ GetForeignRowMarkType(RangeTblEntry *rte,
      or <command>DELETE</command>.
     </para>
 
+    <para>
+     If the value pointed by <literal>refType</literal> is not changed,
+     the <literal>ROW_REF_COPY</literal> option is used.
+    </para>
+
     <para>
      If the <function>GetForeignRowMarkType</function> pointer is set to
-     <literal>NULL</literal>, the <literal>ROW_MARK_COPY</literal> option is always used.
+     <literal>NULL</literal>, the <literal>ROW_MARK_REFERENCE</literal> option
+     for row mark type and <literal>ROW_REF_COPY</literal> option for the row
+     reference type are always used.
      (This implies that <function>RefetchForeignRow</function> will never be called,
      so it need not be provided either.)
     </para>
@@ -1213,7 +1223,7 @@ RefetchForeignRow(EState *estate,
      defined by <literal>erm-&gt;markType</literal>, which is the value
      previously returned by <function>GetForeignRowMarkType</function>.
      (<literal>ROW_MARK_REFERENCE</literal> means to just re-fetch the tuple
-     without acquiring any lock, and <literal>ROW_MARK_COPY</literal> will
+     without acquiring any lock.  This shouldn't and <literal>ROW_MARK_COPY</literal> will
      never be seen by this routine.)
     </para>
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 7eb1f7d0209..3b03f03a98d 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -875,22 +875,19 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 			/* get relation's OID (will produce InvalidOid if subquery) */
 			relid = exec_rt_fetch(rc->rti, estate)->relid;
 
-			/* open relation, if we need to access it for this mark type */
-			switch (rc->markType)
+			/*
+			 * Open relation, if we need to access it for this reference type.
+			 */
+			switch (rc->refType)
 			{
-				case ROW_MARK_EXCLUSIVE:
-				case ROW_MARK_NOKEYEXCLUSIVE:
-				case ROW_MARK_SHARE:
-				case ROW_MARK_KEYSHARE:
-				case ROW_MARK_REFERENCE:
+				case ROW_REF_TID:
 					relation = ExecGetRangeTableRelation(estate, rc->rti);
 					break;
-				case ROW_MARK_COPY:
-					/* no physical table access is required */
+				case ROW_REF_COPY:
 					relation = NULL;
 					break;
 				default:
-					elog(ERROR, "unrecognized markType: %d", rc->markType);
+					elog(ERROR, "unrecognized refType: %d", rc->refType);
 					relation = NULL;	/* keep compiler quiet */
 					break;
 			}
@@ -906,6 +903,7 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 			erm->prti = rc->prti;
 			erm->rowmarkId = rc->rowmarkId;
 			erm->markType = rc->markType;
+			erm->refType = rc->refType;
 			erm->strength = rc->strength;
 			erm->waitPolicy = rc->waitPolicy;
 			erm->ermActive = false;
@@ -2402,10 +2400,14 @@ ExecBuildAuxRowMark(ExecRowMark *erm, List *targetlist)
 
 	aerm->rowmark = erm;
 
-	/* Look up the resjunk columns associated with this rowmark */
-	if (erm->markType != ROW_MARK_COPY)
+	/*
+	 * Look up the resjunk columns associated with this rowmark's reference
+	 * type.
+	 */
+	if (erm->refType != ROW_REF_COPY)
 	{
 		/* need ctid for all methods other than COPY */
+		Assert(erm->refType == ROW_REF_TID);
 		snprintf(resname, sizeof(resname), "ctid%u", erm->rowmarkId);
 		aerm->ctidAttNo = ExecFindJunkAttributeInTlist(targetlist,
 													   resname);
@@ -2656,7 +2658,12 @@ EvalPlanQualFetchRowMark(EPQState *epqstate, Index rti, TupleTableSlot *slot)
 		}
 	}
 
-	if (erm->markType == ROW_MARK_REFERENCE)
+	/*
+	 * For non-locked relation, the row mark type should be
+	 * ROW_MARK_REFERENCE.  Fetch the tuple accodring to reference type.
+	 */
+	Assert(erm->markType == ROW_MARK_REFERENCE);
+	if (erm->refType == ROW_REF_TID)
 	{
 		Assert(erm->relation != NULL);
 
@@ -2709,7 +2716,7 @@ EvalPlanQualFetchRowMark(EPQState *epqstate, Index rti, TupleTableSlot *slot)
 	}
 	else
 	{
-		Assert(erm->markType == ROW_MARK_COPY);
+		Assert(erm->refType == ROW_REF_COPY);
 
 		/* fetch the whole-row Var for the relation */
 		datum = ExecGetJunkAttribute(epqstate->origslot,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 38d070fa004..4b9c9deee84 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2309,7 +2309,7 @@ preprocess_rowmarks(PlannerInfo *root)
 		 * Ignore RowMarkClauses for subqueries; they aren't real tables and
 		 * can't support true locking.  Subqueries that got flattened into the
 		 * main query should be ignored completely.  Any that didn't will get
-		 * ROW_MARK_COPY items in the next loop.
+		 * ROW_REF_COPY items in the next loop.
 		 */
 		if (rte->rtekind != RTE_RELATION)
 			continue;
@@ -2319,8 +2319,8 @@ preprocess_rowmarks(PlannerInfo *root)
 		newrc = makeNode(PlanRowMark);
 		newrc->rti = newrc->prti = rc->rti;
 		newrc->rowmarkId = ++(root->glob->lastRowMarkId);
-		newrc->markType = select_rowmark_type(rte, rc->strength);
-		newrc->allMarkTypes = (1 << newrc->markType);
+		newrc->markType = select_rowmark_type(rte, rc->strength, &newrc->refType);
+		newrc->allRefTypes = (1 << newrc->refType);
 		newrc->strength = rc->strength;
 		newrc->waitPolicy = rc->waitPolicy;
 		newrc->isParent = false;
@@ -2344,8 +2344,8 @@ preprocess_rowmarks(PlannerInfo *root)
 		newrc = makeNode(PlanRowMark);
 		newrc->rti = newrc->prti = i;
 		newrc->rowmarkId = ++(root->glob->lastRowMarkId);
-		newrc->markType = select_rowmark_type(rte, LCS_NONE);
-		newrc->allMarkTypes = (1 << newrc->markType);
+		newrc->markType = select_rowmark_type(rte, LCS_NONE, &newrc->refType);
+		newrc->allRefTypes = (1 << newrc->refType);
 		newrc->strength = LCS_NONE;
 		newrc->waitPolicy = LockWaitBlock;	/* doesn't matter */
 		newrc->isParent = false;
@@ -2357,29 +2357,38 @@ preprocess_rowmarks(PlannerInfo *root)
 }
 
 /*
- * Select RowMarkType to use for a given table
+ * Select RowMarkType and RowRefType to use for a given table
  */
 RowMarkType
-select_rowmark_type(RangeTblEntry *rte, LockClauseStrength strength)
+select_rowmark_type(RangeTblEntry *rte, LockClauseStrength strength,
+					RowRefType *refType)
 {
 	if (rte->rtekind != RTE_RELATION)
 	{
-		/* If it's not a table at all, use ROW_MARK_COPY */
-		return ROW_MARK_COPY;
+		/*
+		 * If it's not a table at all, use ROW_MARK_REFERENCE and
+		 * ROW_REF_COPY.
+		 */
+		*refType = ROW_REF_COPY;
+		return ROW_MARK_REFERENCE;
 	}
 	else if (rte->relkind == RELKIND_FOREIGN_TABLE)
 	{
 		/* Let the FDW select the rowmark type, if it wants to */
 		FdwRoutine *fdwroutine = GetFdwRoutineByRelId(rte->relid);
 
+		/* Set row reference type as ROW_REF_COPY by default */
+		*refType = ROW_REF_COPY;
+
 		if (fdwroutine->GetForeignRowMarkType != NULL)
-			return fdwroutine->GetForeignRowMarkType(rte, strength);
-		/* Otherwise, use ROW_MARK_COPY by default */
-		return ROW_MARK_COPY;
+			return fdwroutine->GetForeignRowMarkType(rte, strength, refType);
+		/* Otherwise, use ROW_MARK_REFERENCE by default */
+		return ROW_MARK_REFERENCE;
 	}
 	else
 	{
 		/* Regular table, apply the appropriate lock type */
+		*refType = ROW_REF_TID;
 		switch (strength)
 		{
 			case LCS_NONE:
diff --git a/src/backend/optimizer/prep/preptlist.c b/src/backend/optimizer/prep/preptlist.c
index 7698bfa1a58..4599b0dc761 100644
--- a/src/backend/optimizer/prep/preptlist.c
+++ b/src/backend/optimizer/prep/preptlist.c
@@ -210,7 +210,7 @@ preprocess_targetlist(PlannerInfo *root)
 		if (rc->rti != rc->prti)
 			continue;
 
-		if (rc->allMarkTypes & ~(1 << ROW_MARK_COPY))
+		if (rc->allRefTypes & (1 << ROW_REF_TID))
 		{
 			/* Need to fetch TID */
 			var = makeVar(rc->rti,
@@ -226,7 +226,7 @@ preprocess_targetlist(PlannerInfo *root)
 								  true);
 			tlist = lappend(tlist, tle);
 		}
-		if (rc->allMarkTypes & (1 << ROW_MARK_COPY))
+		if (rc->allRefTypes & (1 << ROW_REF_COPY))
 		{
 			/* Need the whole row as a junk var */
 			var = makeWholeRowVar(rt_fetch(rc->rti, range_table),
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index 5c7acf8a901..b4b076d1cb1 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -91,7 +91,7 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 	LOCKMODE	lockmode;
 	PlanRowMark *oldrc;
 	bool		old_isParent = false;
-	int			old_allMarkTypes = 0;
+	int			old_allRefTypes = 0;
 
 	Assert(rte->inh);			/* else caller error */
 
@@ -131,8 +131,8 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 	{
 		old_isParent = oldrc->isParent;
 		oldrc->isParent = true;
-		/* Save initial value of allMarkTypes before children add to it */
-		old_allMarkTypes = oldrc->allMarkTypes;
+		/* Save initial value of allRefTypes before children add to it */
+		old_allRefTypes = oldrc->allRefTypes;
 	}
 
 	/* Scan the inheritance set and expand it */
@@ -239,15 +239,15 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (oldrc)
 	{
-		int			new_allMarkTypes = oldrc->allMarkTypes;
+		int			new_allRefTypes = oldrc->allRefTypes;
 		Var		   *var;
 		TargetEntry *tle;
 		char		resname[32];
 		List	   *newvars = NIL;
 
 		/* Add TID junk Var if needed, unless we had it already */
-		if (new_allMarkTypes & ~(1 << ROW_MARK_COPY) &&
-			!(old_allMarkTypes & ~(1 << ROW_MARK_COPY)))
+		if (new_allRefTypes & (1 << ROW_REF_TID) &&
+			!(old_allRefTypes & (1 << ROW_REF_TID)))
 		{
 			/* Need to fetch TID */
 			var = makeVar(oldrc->rti,
@@ -266,8 +266,8 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 		}
 
 		/* Add whole-row junk Var if needed, unless we had it already */
-		if ((new_allMarkTypes & (1 << ROW_MARK_COPY)) &&
-			!(old_allMarkTypes & (1 << ROW_MARK_COPY)))
+		if ((new_allRefTypes & (1 << ROW_REF_COPY)) &&
+			!(old_allRefTypes & (1 << ROW_REF_COPY)))
 		{
 			var = makeWholeRowVar(planner_rt_fetch(oldrc->rti, root),
 								  oldrc->rti,
@@ -441,7 +441,7 @@ expand_partitioned_rtentry(PlannerInfo *root, RelOptInfo *relinfo,
  * where the hierarchy is flattened during RTE expansion.)
  *
  * PlanRowMarks still carry the top-parent's RTI, and the top-parent's
- * allMarkTypes field still accumulates values from all descendents.
+ * allRefTypes field still accumulates values from all descendents.
  *
  * "parentrte" and "parentRTindex" are immediate parent's RTE and
  * RTI. "top_parentrc" is top parent's PlanRowMark.
@@ -580,8 +580,9 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 		childrc->rowmarkId = top_parentrc->rowmarkId;
 		/* Reselect rowmark type, because relkind might not match parent */
 		childrc->markType = select_rowmark_type(childrte,
-												top_parentrc->strength);
-		childrc->allMarkTypes = (1 << childrc->markType);
+												top_parentrc->strength,
+												&childrc->refType);
+		childrc->allRefTypes = (1 << childrc->refType);
 		childrc->strength = top_parentrc->strength;
 		childrc->waitPolicy = top_parentrc->waitPolicy;
 
@@ -592,8 +593,8 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 		 */
 		childrc->isParent = (childrte->relkind == RELKIND_PARTITIONED_TABLE);
 
-		/* Include child's rowmark type in top parent's allMarkTypes */
-		top_parentrc->allMarkTypes |= childrc->allMarkTypes;
+		/* Include child's rowmark type in top parent's allRefTypes */
+		top_parentrc->allRefTypes |= childrc->allRefTypes;
 
 		root->rowMarks = lappend(root->rowMarks, childrc);
 	}
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 0968e0a01ec..868e04e813e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -129,7 +129,8 @@ typedef TupleTableSlot *(*IterateDirectModify_function) (ForeignScanState *node)
 typedef void (*EndDirectModify_function) (ForeignScanState *node);
 
 typedef RowMarkType (*GetForeignRowMarkType_function) (RangeTblEntry *rte,
-													   LockClauseStrength strength);
+													   LockClauseStrength strength,
+													   RowRefType *refType);
 
 typedef void (*RefetchForeignRow_function) (EState *estate,
 											ExecRowMark *erm,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 1774c56ae31..a1ccf6e6811 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -455,6 +455,9 @@ typedef struct ResultRelInfo
 	/* relation descriptor for result relation */
 	Relation	ri_RelationDesc;
 
+	/* row indentifier for result relation */
+	RowRefType	ri_RowRefType;
+
 	/* # of indices existing on result relation */
 	int			ri_NumIndices;
 
@@ -750,6 +753,7 @@ typedef struct ExecRowMark
 	Index		prti;			/* parent range table index, if child */
 	Index		rowmarkId;		/* unique identifier for resjunk columns */
 	RowMarkType markType;		/* see enum in nodes/plannodes.h */
+	RowRefType	refType;		/* row indentifier for relation */
 	LockClauseStrength strength;	/* LockingClause's strength, or LCS_NONE */
 	LockWaitPolicy waitPolicy;	/* NOWAIT and SKIP LOCKED */
 	bool		ermActive;		/* is this mark relevant for current tuple? */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 7f3db5105db..d7f9c389dac 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1311,16 +1311,8 @@ typedef struct Limit
  *
  * When doing UPDATE/DELETE/MERGE/SELECT FOR UPDATE/SHARE, we have to uniquely
  * identify all the source rows, not only those from the target relations, so
- * that we can perform EvalPlanQual rechecking at need.  For plain tables we
- * can just fetch the TID, much as for a target relation; this case is
- * represented by ROW_MARK_REFERENCE.  Otherwise (for example for VALUES or
- * FUNCTION scans) we have to copy the whole row value.  ROW_MARK_COPY is
- * pretty inefficient, since most of the time we'll never need the data; but
- * fortunately the overhead is usually not performance-critical in practice.
- * By default we use ROW_MARK_COPY for foreign tables, but if the FDW has
- * a concept of rowid it can request to use ROW_MARK_REFERENCE instead.
- * (Again, this probably doesn't make sense if a physical remote fetch is
- * needed, but for FDWs that map to local storage it might be credible.)
+ * that we can perform EvalPlanQual rechecking at need.  ROW_MARK_REFERENCE
+ * represents this case.
  */
 typedef enum RowMarkType
 {
@@ -1329,9 +1321,29 @@ typedef enum RowMarkType
 	ROW_MARK_SHARE,				/* obtain shared tuple lock */
 	ROW_MARK_KEYSHARE,			/* obtain keyshare tuple lock */
 	ROW_MARK_REFERENCE,			/* just fetch the TID, don't lock it */
-	ROW_MARK_COPY,				/* physically copy the row value */
 } RowMarkType;
 
+/*
+ * RowRefType -
+ *	  enums for types of row identifiers
+ *
+ * For plain tables we can just fetch the TID, much as for a target relation;
+ * this case is represented by ROW_REF_TID.  Otherwise (for example for VALUES
+ * or FUNCTION scans) we have to copy the whole row value.  ROW_REF_COPY is
+ * pretty inefficient, since most of the time we'll never need the data; but
+ * fortunately the overhead is usually not performance-critical in practice.
+ * By default we use ROW_REF_COPY for foreign tables, but if the FDW has
+ * a concept of rowid it can request to use ROW_REF_TID instead.
+ * (Again, this probably doesn't make sense if a physical remote fetch is
+ * needed, but for FDWs that map to local storage it might be credible.)
+ * In future we may allow more types of row identifiers.
+ */
+typedef enum RowRefType
+{
+	ROW_REF_TID,				/* Item pointer (block, offset) */
+	ROW_REF_COPY				/* Full row copy */
+} RowRefType;
+
 #define RowMarkRequiresRowShareLock(marktype)  ((marktype) <= ROW_MARK_KEYSHARE)
 
 /*
@@ -1340,8 +1352,7 @@ typedef enum RowMarkType
  *
  * When doing UPDATE/DELETE/MERGE/SELECT FOR UPDATE/SHARE, we create a separate
  * PlanRowMark node for each non-target relation in the query.  Relations that
- * are not specified as FOR UPDATE/SHARE are marked ROW_MARK_REFERENCE (if
- * regular tables or supported foreign tables) or ROW_MARK_COPY (if not).
+ * are not specified as FOR UPDATE/SHARE are marked ROW_MARK_REFERENCE.
  *
  * Initially all PlanRowMarks have rti == prti and isParent == false.
  * When the planner discovers that a relation is the root of an inheritance
@@ -1351,16 +1362,16 @@ typedef enum RowMarkType
  * child relations will also have entries with isParent = true.  The child
  * entries have rti == child rel's RT index and prti == top parent's RT index,
  * and can therefore be recognized as children by the fact that prti != rti.
- * The parent's allMarkTypes field gets the OR of (1<<markType) across all
+ * The parent's allRefTypes field gets the OR of (1<<refType) across all
  * its children (this definition allows children to use different markTypes).
  *
  * The planner also adds resjunk output columns to the plan that carry
  * information sufficient to identify the locked or fetched rows.  When
- * markType != ROW_MARK_COPY, these columns are named
+ * refType != ROW_REF_COPY, these columns are named
  *		tableoid%u			OID of table
  *		ctid%u				TID of row
  * The tableoid column is only present for an inheritance hierarchy.
- * When markType == ROW_MARK_COPY, there is instead a single column named
+ * When refType == ROW_REF_COPY, there is instead a single column named
  *		wholerow%u			whole-row value of relation
  * (An inheritance hierarchy could have all three resjunk output columns,
  * if some children use a different markType than others.)
@@ -1381,7 +1392,8 @@ typedef struct PlanRowMark
 	Index		prti;			/* range table index of parent relation */
 	Index		rowmarkId;		/* unique identifier for resjunk columns */
 	RowMarkType markType;		/* see enum above */
-	int			allMarkTypes;	/* OR of (1<<markType) for all children */
+	RowRefType	refType;		/* see enum above */
+	int			allRefTypes;	/* OR of (1<<refType) for all children */
 	LockClauseStrength strength;	/* LockingClause's strength, or LCS_NONE */
 	LockWaitPolicy waitPolicy;	/* NOWAIT and SKIP LOCKED options */
 	bool		isParent;		/* true if this is a "dummy" parent entry */
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index e1d79ffdf3c..98fc796d054 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -47,7 +47,8 @@ extern PlannerInfo *subquery_planner(PlannerGlobal *glob, Query *parse,
 									 bool hasRecursion, double tuple_fraction);
 
 extern RowMarkType select_rowmark_type(RangeTblEntry *rte,
-									   LockClauseStrength strength);
+									   LockClauseStrength strength,
+									   RowRefType *refType);
 
 extern bool limit_needed(Query *parse);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cfa9d5aaeac..6ce0a586bf1 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2433,6 +2433,7 @@ RowExpr
 RowIdentityVarInfo
 RowMarkClause
 RowMarkType
+RowRefType
 RowSecurityDesc
 RowSecurityPolicy
 RtlGetLastNtStatus_t
-- 
2.39.3 (Apple Git-145)

0008-Introduce-RowID-bytea-tuple-identifier-v7.patchapplication/octet-stream; name=0008-Introduce-RowID-bytea-tuple-identifier-v7.patchDownload

From dbd598b017c19b9b0e4a4d9e95c6916c02ae81a1 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Tue, 26 Mar 2024 21:00:37 +0200
Subject: [PATCH 8/8] Introduce RowID -- bytea tuple identifier

Currently, there are two ways to reference the tuple: tuple identifier (tid)
and whole row copy.  The tuple identifier used for regular tables consists of
32-bit block number and 16-bit offset.  This seems limited for some use-cases,
in particular index-organized tables.  The whole row copy used to identify
tuples in FDW.  That could be extended to regular tables, but that seems
overkill.

This commit introduces RowID -- new bytea tuple identifier.  Table AM can choose
the way tuple is identified by providing new get_row_ref_type() API function.
New system attribute RowIdAttributeNumber holds RowID when appropriate.
Table AM methods now accepts Datum arguments as tuple identifiers.  Those Datum
could be either tid or bytea depending on what table_get_row_ref_type() says.
ModifyTable node and triggers are aware of RowID.  IndexScan and BitmapScan
nodes are not aware of RowIDs and expect tids.  Table AMs which use RowIDs
are supposed to redefine those nodes using hooks.
---
 contrib/amcheck/verify_nbtree.c          |   3 +-
 src/backend/access/common/heaptuple.c    |   4 +
 src/backend/access/heap/heapam_handler.c |  33 ++-
 src/backend/access/table/tableam.c       |   4 +-
 src/backend/catalog/aclchk.c             |   2 +-
 src/backend/commands/trigger.c           | 251 ++++++++++++++++++-----
 src/backend/executor/execExprInterp.c    |   4 +-
 src/backend/executor/execMain.c          |   9 +-
 src/backend/executor/execReplication.c   |  12 +-
 src/backend/executor/nodeLockRows.c      |  17 +-
 src/backend/executor/nodeModifyTable.c   | 145 ++++++++-----
 src/backend/executor/nodeTidscan.c       |   2 +-
 src/backend/optimizer/plan/planner.c     |  11 +-
 src/backend/optimizer/prep/preptlist.c   |  16 ++
 src/backend/optimizer/util/appendinfo.c  |  33 ++-
 src/backend/optimizer/util/inherit.c     |  20 ++
 src/backend/parser/parse_relation.c      |  13 ++
 src/backend/rewrite/rewriteHandler.c     |   1 +
 src/backend/utils/sort/tuplestore.c      |  30 +++
 src/include/access/sysattr.h             |   3 +-
 src/include/access/tableam.h             |  58 ++++--
 src/include/commands/trigger.h           |   4 +-
 src/include/nodes/parsenodes.h           |   2 +
 src/include/nodes/plannodes.h            |  21 --
 src/include/nodes/primnodes.h            |  22 ++
 src/include/utils/tuplestore.h           |   3 +
 26 files changed, 548 insertions(+), 175 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index f71f1854e0a..7bfa2a2fc44 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -984,7 +984,8 @@ heap_entry_is_visible(BtreeCheckState *state, ItemPointer tid)
 	TupleTableSlot *slot = table_slot_create(state->heaprel, NULL);
 
 	tid_visible = table_tuple_fetch_row_version(state->heaprel,
-												tid, state->snapshot, slot);
+												PointerGetDatum(tid),
+												state->snapshot, slot);
 	if (slot != NULL)
 		ExecDropSingleTupleTableSlot(slot);
 
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index 5c89fbbef83..7b52c66939c 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -755,6 +755,10 @@ heap_getsysattr(HeapTuple tup, int attnum, TupleDesc tupleDesc, bool *isnull)
 		case TableOidAttributeNumber:
 			result = ObjectIdGetDatum(tup->t_tableOid);
 			break;
+		case RowIdAttributeNumber:
+			*isnull = true;
+			result = 0;
+			break;
 		default:
 			elog(ERROR, "invalid attnum: %d", attnum);
 			result = 0;			/* keep compiler quiet */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 09429fd9ef0..13330ca7159 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -46,7 +46,7 @@
 #include "utils/builtins.h"
 #include "utils/rel.h"
 
-static TM_Result heapam_tuple_lock(Relation relation, ItemPointer tid,
+static TM_Result heapam_tuple_lock(Relation relation, Datum tupleid,
 								   Snapshot snapshot, TupleTableSlot *slot,
 								   CommandId cid, LockTupleMode mode,
 								   LockWaitPolicy wait_policy, uint8 flags,
@@ -184,7 +184,7 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 
 static bool
 heapam_fetch_row_version(Relation relation,
-						 ItemPointer tid,
+						 Datum tupleid,
 						 Snapshot snapshot,
 						 TupleTableSlot *slot)
 {
@@ -193,7 +193,7 @@ heapam_fetch_row_version(Relation relation,
 
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
-	bslot->base.tupdata.t_self = *tid;
+	bslot->base.tupdata.t_self = *DatumGetItemPointer(tupleid);
 	if (heap_fetch(relation, snapshot, &bslot->base.tupdata, &buffer, false))
 	{
 		/* store in slot, transferring existing pin */
@@ -358,7 +358,7 @@ ExecCheckTIDVisible(EState *estate,
 	if (!IsolationUsesXactSnapshot())
 		return;
 
-	if (!table_tuple_fetch_row_version(rel, tid,
+	if (!table_tuple_fetch_row_version(rel, PointerGetDatum(tid),
 									   SnapshotAny, tempSlot))
 		elog(ERROR, "failed to fetch conflicting tuple for ON CONFLICT");
 	ExecCheckTupleVisible(estate, rel, tempSlot);
@@ -405,7 +405,7 @@ heapam_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
 				 * here means our previous conclusion that the tuple is
 				 * conclusively committed is not true anymore.
 				 */
-				test = table_tuple_lock(rel, &conflictTid,
+				test = table_tuple_lock(rel, PointerGetDatum(&conflictTid),
 										estate->es_snapshot,
 										lockedSlot, estate->es_output_cid,
 										lockmode, LockWaitBlock, 0,
@@ -585,12 +585,13 @@ heapam_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
 }
 
 static TM_Result
-heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
+heapam_tuple_delete(Relation relation, Datum tupleid, CommandId cid,
 					Snapshot snapshot, Snapshot crosscheck, int options,
 					TM_FailureData *tmfd, bool changingPart,
 					TupleTableSlot *oldSlot)
 {
 	TM_Result	result;
+	ItemPointer tid = DatumGetItemPointer(tupleid);
 
 	/*
 	 * Currently Deleting of index tuples are handled at vacuum, in case if
@@ -613,7 +614,7 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
 		 * heapam_tuple_lock() will take advantage of tuple loaded into
 		 * oldSlot by heap_delete().
 		 */
-		result = heapam_tuple_lock(relation, tid, snapshot,
+		result = heapam_tuple_lock(relation, tupleid, snapshot,
 								   oldSlot, cid, LockTupleExclusive,
 								   (options & TABLE_MODIFY_WAIT) ?
 								   LockWaitBlock :
@@ -630,7 +631,7 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
 
 
 static TM_Result
-heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
+heapam_tuple_update(Relation relation, Datum tupleid, TupleTableSlot *slot,
 					CommandId cid, Snapshot snapshot, Snapshot crosscheck,
 					int options, TM_FailureData *tmfd,
 					LockTupleMode *lockmode, TU_UpdateIndexes *update_indexes,
@@ -638,6 +639,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 {
 	bool		shouldFree = true;
 	HeapTuple	tuple = ExecFetchSlotHeapTuple(slot, true, &shouldFree);
+	ItemPointer otid = DatumGetItemPointer(tupleid);
 	TM_Result	result;
 
 	/* Update the tuple with table oid */
@@ -685,7 +687,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 		 * heapam_tuple_lock() will take advantage of tuple loaded into
 		 * oldSlot by heap_update().
 		 */
-		result = heapam_tuple_lock(relation, otid, snapshot,
+		result = heapam_tuple_lock(relation, tupleid, snapshot,
 								   oldSlot, cid, *lockmode,
 								   (options & TABLE_MODIFY_WAIT) ?
 								   LockWaitBlock :
@@ -701,7 +703,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 }
 
 static TM_Result
-heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
+heapam_tuple_lock(Relation relation, Datum tupleid, Snapshot snapshot,
 				  TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
 				  LockWaitPolicy wait_policy, uint8 flags,
 				  TM_FailureData *tmfd)
@@ -709,6 +711,7 @@ heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
 	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
 	TM_Result	result;
 	HeapTuple	tuple = &bslot->base.tupdata;
+	ItemPointer tid = DatumGetItemPointer(tupleid);
 	bool		follow_updates;
 
 	follow_updates = (flags & TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS) != 0;
@@ -2376,6 +2379,15 @@ heapam_scan_get_blocks_done(HeapScanDesc hscan)
  * ------------------------------------------------------------------------
  */
 
+/*
+ * All heap tables use TID row identifier.
+ */
+static RowRefType
+heapam_get_row_ref_type(Relation rel)
+{
+	return ROW_REF_TID;
+}
+
 /*
  * Check to see whether the table needs a TOAST table.  It does only if
  * (1) there are any toastable attributes, and (2) the maximum length
@@ -2954,6 +2966,7 @@ static const TableAmRoutine heapam_methods = {
 	.define_index_validate = NULL,
 	.define_index = NULL,
 
+	.get_row_ref_type = heapam_get_row_ref_type,
 	.free_rd_amcache = NULL,
 	.relation_size = table_block_relation_size,
 	.relation_needs_toast_table = heapam_relation_needs_toast_table,
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 805d222cebc..caa79c6eddd 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -300,7 +300,7 @@ simple_table_tuple_delete(Relation rel, ItemPointer tid, Snapshot snapshot,
 	if (oldSlot)
 		options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
 
-	result = table_tuple_delete(rel, tid,
+	result = table_tuple_delete(rel, PointerGetDatum(tid),
 								GetCurrentCommandId(true),
 								snapshot, InvalidSnapshot,
 								options,
@@ -356,7 +356,7 @@ simple_table_tuple_update(Relation rel, ItemPointer otid,
 	if (oldSlot)
 		options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
 
-	result = table_tuple_update(rel, otid, slot,
+	result = table_tuple_update(rel, PointerGetDatum(otid), slot,
 								GetCurrentCommandId(true),
 								snapshot, InvalidSnapshot,
 								options,
diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
index 7abf3c2a74a..8765becf986 100644
--- a/src/backend/catalog/aclchk.c
+++ b/src/backend/catalog/aclchk.c
@@ -1626,7 +1626,7 @@ expand_all_col_privileges(Oid table_oid, Form_pg_class classForm,
 	AttrNumber	curr_att;
 
 	Assert(classForm->relnatts - FirstLowInvalidHeapAttributeNumber < num_col_privileges);
-	for (curr_att = FirstLowInvalidHeapAttributeNumber + 1;
+	for (curr_att = FirstLowInvalidHeapAttributeNumber + 2;
 		 curr_att <= classForm->relnatts;
 		 curr_att++)
 	{
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 84494c4b81f..4f83e521a35 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -76,7 +76,7 @@ static void SetTriggerFlags(TriggerDesc *trigdesc, Trigger *trigger);
 static bool GetTupleForTrigger(EState *estate,
 							   EPQState *epqstate,
 							   ResultRelInfo *relinfo,
-							   ItemPointer tid,
+							   Datum tupleid,
 							   LockTupleMode lockmode,
 							   TupleTableSlot *oldslot,
 							   TupleTableSlot **epqslot,
@@ -2682,7 +2682,7 @@ ExecASDeleteTriggers(EState *estate, ResultRelInfo *relinfo,
 bool
 ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
 					 ResultRelInfo *relinfo,
-					 ItemPointer tupleid,
+					 Datum tupleid,
 					 HeapTuple fdw_trigtuple,
 					 TupleTableSlot **epqslot,
 					 TM_Result *tmresult,
@@ -2696,7 +2696,7 @@ ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
 	bool		should_free = false;
 	int			i;
 
-	Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
+	Assert(HeapTupleIsValid(fdw_trigtuple) ^ (DatumGetPointer(tupleid) != NULL));
 	if (fdw_trigtuple == NULL)
 	{
 		TupleTableSlot *epqslot_candidate = NULL;
@@ -2924,7 +2924,7 @@ ExecASUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 bool
 ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
 					 ResultRelInfo *relinfo,
-					 ItemPointer tupleid,
+					 Datum tupleid,
 					 HeapTuple fdw_trigtuple,
 					 TupleTableSlot *newslot,
 					 TM_Result *tmresult,
@@ -2944,7 +2944,7 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
 	/* Determine lock mode to use */
 	lockmode = ExecUpdateLockMode(estate, relinfo);
 
-	Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
+	Assert(HeapTupleIsValid(fdw_trigtuple) ^ (DatumGetPointer(tupleid) != NULL));
 	if (fdw_trigtuple == NULL)
 	{
 		TupleTableSlot *epqslot_candidate = NULL;
@@ -3261,7 +3261,7 @@ static bool
 GetTupleForTrigger(EState *estate,
 				   EPQState *epqstate,
 				   ResultRelInfo *relinfo,
-				   ItemPointer tid,
+				   Datum tupleid,
 				   LockTupleMode lockmode,
 				   TupleTableSlot *oldslot,
 				   TupleTableSlot **epqslot,
@@ -3286,7 +3286,9 @@ GetTupleForTrigger(EState *estate,
 		 */
 		if (!IsolationUsesXactSnapshot())
 			lockflags |= TUPLE_LOCK_FLAG_FIND_LAST_VERSION;
-		test = table_tuple_lock(relation, tid, estate->es_snapshot, oldslot,
+
+		test = table_tuple_lock(relation, tupleid,
+								estate->es_snapshot, oldslot,
 								estate->es_output_cid,
 								lockmode, LockWaitBlock,
 								lockflags,
@@ -3382,8 +3384,8 @@ GetTupleForTrigger(EState *estate,
 		 * We expect the tuple to be present, thus very simple error handling
 		 * suffices.
 		 */
-		if (!table_tuple_fetch_row_version(relation, tid, SnapshotAny,
-										   oldslot))
+		if (!table_tuple_fetch_row_version(relation, tupleid,
+										   SnapshotAny, oldslot))
 			elog(ERROR, "failed to fetch tuple for trigger");
 	}
 
@@ -3589,18 +3591,24 @@ typedef SetConstraintStateData *SetConstraintState;
  * cycles.  So we need only ensure that ats_firing_id is zero when attaching
  * a new event to an existing AfterTriggerSharedData record.
  */
-typedef uint32 TriggerFlags;
-
-#define AFTER_TRIGGER_OFFSET			0x07FFFFFF	/* must be low-order bits */
-#define AFTER_TRIGGER_DONE				0x80000000
-#define AFTER_TRIGGER_IN_PROGRESS		0x40000000
+typedef uint64 TriggerFlags;
+
+#define AFTER_TRIGGER_SIZE				UINT64CONST(0xFFFF000000000)	/* must be low-order
+																		 * bits */
+#define AFTER_TRIGGER_SIZE_SHIFT		(36)
+#define AFTER_TRIGGER_OFFSET			UINT64CONST(0x000000FFFFFFF)	/* must be low-order
+																		 * bits */
+#define AFTER_TRIGGER_DONE				UINT64CONST(0x0000800000000)
+#define AFTER_TRIGGER_IN_PROGRESS		UINT64CONST(0x0000400000000)
 /* bits describing the size and tuple sources of this event */
-#define AFTER_TRIGGER_FDW_REUSE			0x00000000
-#define AFTER_TRIGGER_FDW_FETCH			0x20000000
-#define AFTER_TRIGGER_1CTID				0x10000000
-#define AFTER_TRIGGER_2CTID				0x30000000
-#define AFTER_TRIGGER_CP_UPDATE			0x08000000
-#define AFTER_TRIGGER_TUP_BITS			0x38000000
+#define AFTER_TRIGGER_FDW_REUSE			UINT64CONST(0x0000000000000)
+#define AFTER_TRIGGER_FDW_FETCH			UINT64CONST(0x0000200000000)
+#define AFTER_TRIGGER_1CTID				UINT64CONST(0x0000100000000)
+#define AFTER_TRIGGER_ROWID1			UINT64CONST(0x0000010000000)
+#define AFTER_TRIGGER_2CTID				UINT64CONST(0x0000300000000)
+#define AFTER_TRIGGER_ROWID2			UINT64CONST(0x0000020000000)
+#define AFTER_TRIGGER_CP_UPDATE			UINT64CONST(0x0000080000000)
+#define AFTER_TRIGGER_TUP_BITS			UINT64CONST(0x0000380000000)
 typedef struct AfterTriggerSharedData *AfterTriggerShared;
 
 typedef struct AfterTriggerSharedData
@@ -3652,6 +3660,9 @@ typedef struct AfterTriggerEventDataZeroCtids
 }			AfterTriggerEventDataZeroCtids;
 
 #define SizeofTriggerEvent(evt) \
+	(((evt)->ate_flags & AFTER_TRIGGER_SIZE) >> AFTER_TRIGGER_SIZE_SHIFT)
+
+#define BasicSizeofTriggerEvent(evt) \
 	(((evt)->ate_flags & AFTER_TRIGGER_TUP_BITS) == AFTER_TRIGGER_CP_UPDATE ? \
 	 sizeof(AfterTriggerEventData) : \
 	 (((evt)->ate_flags & AFTER_TRIGGER_TUP_BITS) == AFTER_TRIGGER_2CTID ? \
@@ -4004,14 +4015,34 @@ afterTriggerCopyBitmap(Bitmapset *src)
  */
 static void
 afterTriggerAddEvent(AfterTriggerEventList *events,
-					 AfterTriggerEvent event, AfterTriggerShared evtshared)
+					 AfterTriggerEvent event, AfterTriggerShared evtshared,
+					 bytea *rowid1, bytea *rowid2)
 {
-	Size		eventsize = SizeofTriggerEvent(event);
-	Size		needed = eventsize + sizeof(AfterTriggerSharedData);
+	Size		basiceventsize = MAXALIGN(BasicSizeofTriggerEvent(event));
+	Size		eventsize;
+	Size		needed;
 	AfterTriggerEventChunk *chunk;
 	AfterTriggerShared newshared;
 	AfterTriggerEvent newevent;
 
+	if (SizeofTriggerEvent(event) == 0)
+	{
+		eventsize = basiceventsize;
+		if (event->ate_flags & AFTER_TRIGGER_ROWID1)
+			eventsize += MAXALIGN(VARSIZE(rowid1));
+
+		if (event->ate_flags & AFTER_TRIGGER_ROWID2)
+			eventsize += MAXALIGN(VARSIZE(rowid2));
+
+		event->ate_flags |= eventsize << AFTER_TRIGGER_SIZE_SHIFT;
+	}
+	else
+	{
+		eventsize = SizeofTriggerEvent(event);
+	}
+
+	needed = eventsize + sizeof(AfterTriggerSharedData);
+
 	/*
 	 * If empty list or not enough room in the tail chunk, make a new chunk.
 	 * We assume here that a new shared record will always be needed.
@@ -4044,7 +4075,7 @@ afterTriggerAddEvent(AfterTriggerEventList *events,
 		 * sizes used should be MAXALIGN multiples, to ensure that the shared
 		 * records will be aligned safely.
 		 */
-#define MIN_CHUNK_SIZE 1024
+#define MIN_CHUNK_SIZE (1024*4)
 #define MAX_CHUNK_SIZE (1024*1024)
 
 #if MAX_CHUNK_SIZE > (AFTER_TRIGGER_OFFSET+1)
@@ -4063,6 +4094,7 @@ afterTriggerAddEvent(AfterTriggerEventList *events,
 				chunksize *= 2; /* okay, double it */
 			else
 				chunksize /= 2; /* too many shared records */
+			chunksize = Max(chunksize, MIN_CHUNK_SIZE);
 			chunksize = Min(chunksize, MAX_CHUNK_SIZE);
 		}
 		chunk = MemoryContextAlloc(afterTriggers.event_cxt, chunksize);
@@ -4103,7 +4135,26 @@ afterTriggerAddEvent(AfterTriggerEventList *events,
 
 	/* Insert the data */
 	newevent = (AfterTriggerEvent) chunk->freeptr;
-	memcpy(newevent, event, eventsize);
+	if (!rowid1 && !rowid2)
+	{
+		memcpy(newevent, event, eventsize);
+	}
+	else
+	{
+		Pointer		ptr = chunk->freeptr;
+
+		memcpy(newevent, event, basiceventsize);
+		ptr += basiceventsize;
+
+		if (event->ate_flags & AFTER_TRIGGER_ROWID1)
+		{
+			memcpy(ptr, rowid1, MAXALIGN(VARSIZE(rowid1)));
+			ptr += MAXALIGN(VARSIZE(rowid1));
+		}
+
+		if (event->ate_flags & AFTER_TRIGGER_ROWID2)
+			memcpy(ptr, rowid2, MAXALIGN(VARSIZE(rowid2)));
+	}
 	/* ... and link the new event to its shared record */
 	newevent->ate_flags &= ~AFTER_TRIGGER_OFFSET;
 	newevent->ate_flags |= (char *) newshared - (char *) newevent;
@@ -4263,6 +4314,7 @@ AfterTriggerExecute(EState *estate,
 	int			tgindx;
 	bool		should_free_trig = false;
 	bool		should_free_new = false;
+	Pointer		ptr;
 
 	/*
 	 * Locate trigger in trigdesc.
@@ -4294,15 +4346,17 @@ AfterTriggerExecute(EState *estate,
 			{
 				Tuplestorestate *fdw_tuplestore = GetCurrentFDWTuplestore();
 
-				if (!tuplestore_gettupleslot(fdw_tuplestore, true, false,
-											 trig_tuple_slot1))
+				if (!tuplestore_force_gettupleslot(fdw_tuplestore, true, false,
+												   trig_tuple_slot1))
 					elog(ERROR, "failed to fetch tuple1 for AFTER trigger");
 
 				if ((evtshared->ats_event & TRIGGER_EVENT_OPMASK) ==
 					TRIGGER_EVENT_UPDATE &&
-					!tuplestore_gettupleslot(fdw_tuplestore, true, false,
-											 trig_tuple_slot2))
+					!tuplestore_force_gettupleslot(fdw_tuplestore, true, false,
+												   trig_tuple_slot2))
 					elog(ERROR, "failed to fetch tuple2 for AFTER trigger");
+				trig_tuple_slot1->tts_tid = event->ate_ctid1;
+				trig_tuple_slot2->tts_tid = event->ate_ctid2;
 			}
 			/* fall through */
 		case AFTER_TRIGGER_FDW_REUSE:
@@ -4334,13 +4388,26 @@ AfterTriggerExecute(EState *estate,
 			break;
 
 		default:
-			if (ItemPointerIsValid(&(event->ate_ctid1)))
+			ptr = (Pointer) event + MAXALIGN(BasicSizeofTriggerEvent(event));
+			if (ItemPointerIsValid(&(event->ate_ctid1)) ||
+				(event->ate_flags & AFTER_TRIGGER_ROWID1))
 			{
+				Datum		tupleid;
+
 				TupleTableSlot *src_slot = ExecGetTriggerOldSlot(estate,
 																 src_relInfo);
 
-				if (!table_tuple_fetch_row_version(src_rel,
-												   &(event->ate_ctid1),
+				if (event->ate_flags & AFTER_TRIGGER_ROWID1)
+				{
+					tupleid = PointerGetDatum(ptr);
+					ptr += MAXALIGN(VARSIZE(ptr));
+				}
+				else
+				{
+					tupleid = PointerGetDatum(&(event->ate_ctid1));
+				}
+
+				if (!table_tuple_fetch_row_version(src_rel, tupleid,
 												   SnapshotAny,
 												   src_slot))
 					elog(ERROR, "failed to fetch tuple1 for AFTER trigger");
@@ -4376,13 +4443,23 @@ AfterTriggerExecute(EState *estate,
 			/* don't touch ctid2 if not there */
 			if (((event->ate_flags & AFTER_TRIGGER_TUP_BITS) == AFTER_TRIGGER_2CTID ||
 				 (event->ate_flags & AFTER_TRIGGER_CP_UPDATE)) &&
-				ItemPointerIsValid(&(event->ate_ctid2)))
+				(ItemPointerIsValid(&(event->ate_ctid2)) ||
+				 (event->ate_flags & AFTER_TRIGGER_ROWID2)))
 			{
+				Datum		tupleid;
+
 				TupleTableSlot *dst_slot = ExecGetTriggerNewSlot(estate,
 																 dst_relInfo);
 
-				if (!table_tuple_fetch_row_version(dst_rel,
-												   &(event->ate_ctid2),
+				if (event->ate_flags & AFTER_TRIGGER_ROWID2)
+				{
+					tupleid = PointerGetDatum(ptr);
+				}
+				else
+				{
+					tupleid = PointerGetDatum(&(event->ate_ctid2));
+				}
+				if (!table_tuple_fetch_row_version(dst_rel, tupleid,
 												   SnapshotAny,
 												   dst_slot))
 					elog(ERROR, "failed to fetch tuple2 for AFTER trigger");
@@ -4556,7 +4633,7 @@ afterTriggerMarkEvents(AfterTriggerEventList *events,
 		{
 			deferred_found = true;
 			/* add it to move_list */
-			afterTriggerAddEvent(move_list, event, evtshared);
+			afterTriggerAddEvent(move_list, event, evtshared, NULL, NULL);
 			/* mark original copy "done" so we don't do it again */
 			event->ate_flags |= AFTER_TRIGGER_DONE;
 		}
@@ -4659,6 +4736,7 @@ afterTriggerInvokeEvents(AfterTriggerEventList *events,
 					trigdesc = rInfo->ri_TrigDesc;
 					finfo = rInfo->ri_TrigFunctions;
 					instr = rInfo->ri_TrigInstrument;
+
 					if (slot1 != NULL)
 					{
 						ExecDropSingleTupleTableSlot(slot1);
@@ -6051,6 +6129,8 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 	int			tgtype_level;
 	int			i;
 	Tuplestorestate *fdw_tuplestore = NULL;
+	bytea	   *rowId1 = NULL;
+	bytea	   *rowId2 = NULL;
 
 	/*
 	 * Check state.  We use a normal test not Assert because it is possible to
@@ -6144,6 +6224,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 	 * if so.  This preserves the behavior that statement-level triggers fire
 	 * just once per statement and fire after row-level triggers.
 	 */
+
+	/* Determine flags */
+	if (!(relkind == RELKIND_FOREIGN_TABLE && row_trigger))
+		new_event.ate_flags = (row_trigger && event == TRIGGER_EVENT_UPDATE) ?
+			AFTER_TRIGGER_2CTID : AFTER_TRIGGER_1CTID;
+
 	switch (event)
 	{
 		case TRIGGER_EVENT_INSERT:
@@ -6154,6 +6240,14 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				Assert(newslot != NULL);
 				ItemPointerCopy(&(newslot->tts_tid), &(new_event.ate_ctid1));
 				ItemPointerSetInvalid(&(new_event.ate_ctid2));
+				if (table_get_row_ref_type(rel) == ROW_REF_ROWID)
+				{
+					bool		isnull;
+
+					rowId1 = DatumGetByteaP(slot_getsysattr(newslot, RowIdAttributeNumber, &isnull));
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID1;
+					Assert(!isnull);
+				}
 			}
 			else
 			{
@@ -6173,6 +6267,14 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				Assert(newslot == NULL);
 				ItemPointerCopy(&(oldslot->tts_tid), &(new_event.ate_ctid1));
 				ItemPointerSetInvalid(&(new_event.ate_ctid2));
+				if (table_get_row_ref_type(rel) == ROW_REF_ROWID)
+				{
+					bool		isnull;
+
+					rowId1 = DatumGetByteaP(slot_getsysattr(oldslot, RowIdAttributeNumber, &isnull));
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID1;
+					Assert(!isnull);
+				}
 			}
 			else
 			{
@@ -6188,10 +6290,57 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			tgtype_event = TRIGGER_TYPE_UPDATE;
 			if (row_trigger)
 			{
+				bool		src_rowid = false,
+							dst_rowid = false;
+
 				Assert(oldslot != NULL);
 				Assert(newslot != NULL);
 				ItemPointerCopy(&(oldslot->tts_tid), &(new_event.ate_ctid1));
 				ItemPointerCopy(&(newslot->tts_tid), &(new_event.ate_ctid2));
+				if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+				{
+					Relation	src_rel = src_partinfo->ri_RelationDesc;
+					Relation	dst_rel = dst_partinfo->ri_RelationDesc;
+
+					src_rowid = table_get_row_ref_type(src_rel) ==
+						ROW_REF_ROWID;
+					dst_rowid = table_get_row_ref_type(dst_rel) ==
+						ROW_REF_ROWID;
+				}
+				else
+				{
+					if (table_get_row_ref_type(rel) == ROW_REF_ROWID)
+					{
+						src_rowid = true;
+						dst_rowid = true;
+					}
+				}
+
+				if (src_rowid)
+				{
+					Datum		val;
+					bool		isnull;
+
+					val = slot_getsysattr(oldslot,
+										  RowIdAttributeNumber,
+										  &isnull);
+					rowId1 = DatumGetByteaP(val);
+					Assert(!isnull);
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID1;
+				}
+
+				if (dst_rowid)
+				{
+					Datum		val;
+					bool		isnull;
+
+					val = slot_getsysattr(newslot,
+										  RowIdAttributeNumber,
+										  &isnull);
+					rowId2 = DatumGetByteaP(val);
+					Assert(!isnull);
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID2;
+				}
 
 				/*
 				 * Also remember the OIDs of partitions to fetch these tuples
@@ -6229,20 +6378,6 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			break;
 	}
 
-	/* Determine flags */
-	if (!(relkind == RELKIND_FOREIGN_TABLE && row_trigger))
-	{
-		if (row_trigger && event == TRIGGER_EVENT_UPDATE)
-		{
-			if (relkind == RELKIND_PARTITIONED_TABLE)
-				new_event.ate_flags = AFTER_TRIGGER_CP_UPDATE;
-			else
-				new_event.ate_flags = AFTER_TRIGGER_2CTID;
-		}
-		else
-			new_event.ate_flags = AFTER_TRIGGER_1CTID;
-	}
-
 	/* else, we'll initialize ate_flags for each trigger */
 
 	tgtype_level = (row_trigger ? TRIGGER_TYPE_ROW : TRIGGER_TYPE_STATEMENT);
@@ -6387,6 +6522,20 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				continue;		/* Uniqueness definitely not violated */
 		}
 
+		/* Determine flags */
+		if (!(relkind == RELKIND_FOREIGN_TABLE && row_trigger))
+		{
+			if (row_trigger && event == TRIGGER_EVENT_UPDATE)
+			{
+				if (relkind == RELKIND_PARTITIONED_TABLE)
+					new_event.ate_flags = AFTER_TRIGGER_CP_UPDATE;
+				else
+					new_event.ate_flags = AFTER_TRIGGER_2CTID;
+			}
+			else
+				new_event.ate_flags = AFTER_TRIGGER_1CTID;
+		}
+
 		/*
 		 * Fill in event structure and add it to the current query's queue.
 		 * Note we set ats_table to NULL whenever this trigger doesn't use
@@ -6408,7 +6557,7 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		new_shared.ats_modifiedcols = afterTriggerCopyBitmap(modifiedCols);
 
 		afterTriggerAddEvent(&afterTriggers.query_stack[afterTriggers.query_depth].events,
-							 &new_event, &new_shared);
+							 &new_event, &new_shared, rowId1, rowId2);
 	}
 
 	/*
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 24a3990a30a..c8ce4d45ff4 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -4888,7 +4888,9 @@ ExecEvalSysVar(ExprState *state, ExprEvalStep *op, ExprContext *econtext,
 						op->resnull);
 	*op->resvalue = d;
 	/* this ought to be unreachable, but it's cheap enough to check */
-	if (unlikely(*op->resnull))
+	if (op->d.var.attnum != RowIdAttributeNumber &&
+		op->d.var.attnum != SelfItemPointerAttributeNumber &&
+		unlikely(*op->resnull))
 		elog(ERROR, "failed to fetch attribute from slot");
 }
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 3b03f03a98d..514d9b28c48 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -867,13 +867,15 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 			Oid			relid;
 			Relation	relation;
 			ExecRowMark *erm;
+			RangeTblEntry *rangeEntry;
 
 			/* ignore "parent" rowmarks; they are irrelevant at runtime */
 			if (rc->isParent)
 				continue;
 
 			/* get relation's OID (will produce InvalidOid if subquery) */
-			relid = exec_rt_fetch(rc->rti, estate)->relid;
+			rangeEntry = exec_rt_fetch(rc->rti, estate);
+			relid = rangeEntry->relid;
 
 			/*
 			 * Open relation, if we need to access it for this reference type.
@@ -903,7 +905,7 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 			erm->prti = rc->prti;
 			erm->rowmarkId = rc->rowmarkId;
 			erm->markType = rc->markType;
-			erm->refType = rc->refType;
+			erm->refType = rangeEntry->reftype;
 			erm->strength = rc->strength;
 			erm->waitPolicy = rc->waitPolicy;
 			erm->ermActive = false;
@@ -1267,6 +1269,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
 	resultRelInfo->ri_ChildToRootMap = NULL;
 	resultRelInfo->ri_ChildToRootMapValid = false;
 	resultRelInfo->ri_CopyMultiInsertBuffer = NULL;
+	resultRelInfo->ri_RowRefType = table_get_row_ref_type(resultRelationDesc);
 }
 
 /*
@@ -2708,7 +2711,7 @@ EvalPlanQualFetchRowMark(EPQState *epqstate, Index rti, TupleTableSlot *slot)
 		{
 			/* ordinary table, fetch the tuple */
 			if (!table_tuple_fetch_row_version(erm->relation,
-											   (ItemPointer) DatumGetPointer(datum),
+											   datum,
 											   SnapshotAny, slot))
 				elog(ERROR, "failed to fetch tuple for EvalPlanQual recheck");
 			return true;
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index db685473fc0..aad266a19ff 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -250,7 +250,8 @@ retry:
 
 		PushActiveSnapshot(GetLatestSnapshot());
 
-		res = table_tuple_lock(rel, &(outslot->tts_tid), GetLatestSnapshot(),
+		res = table_tuple_lock(rel, PointerGetDatum(&(outslot->tts_tid)),
+							   GetLatestSnapshot(),
 							   outslot,
 							   GetCurrentCommandId(false),
 							   lockmode,
@@ -434,7 +435,8 @@ retry:
 
 		PushActiveSnapshot(GetLatestSnapshot());
 
-		res = table_tuple_lock(rel, &(outslot->tts_tid), GetLatestSnapshot(),
+		res = table_tuple_lock(rel, PointerGetDatum(&(outslot->tts_tid)),
+							   GetLatestSnapshot(),
 							   outslot,
 							   GetCurrentCommandId(false),
 							   lockmode,
@@ -571,7 +573,8 @@ ExecSimpleRelationUpdate(ResultRelInfo *resultRelInfo,
 		resultRelInfo->ri_TrigDesc->trig_update_before_row)
 	{
 		if (!ExecBRUpdateTriggers(estate, epqstate, resultRelInfo,
-								  tid, NULL, slot, NULL, NULL))
+								  PointerGetDatum(tid), NULL, slot,
+								  NULL, NULL))
 			skip_tuple = true;	/* "do nothing" */
 	}
 
@@ -638,7 +641,8 @@ ExecSimpleRelationDelete(ResultRelInfo *resultRelInfo,
 		resultRelInfo->ri_TrigDesc->trig_delete_before_row)
 	{
 		skip_tuple = !ExecBRDeleteTriggers(estate, epqstate, resultRelInfo,
-										   tid, NULL, NULL, NULL, NULL);
+										   PointerGetDatum(tid), NULL, NULL,
+										   NULL, NULL);
 	}
 
 	if (!skip_tuple)
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 41754ddfea9..2d3ad904a64 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -27,6 +27,7 @@
 #include "executor/nodeLockRows.h"
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
+#include "utils/datum.h"
 #include "utils/rel.h"
 
 
@@ -157,7 +158,16 @@ lnext:
 		}
 
 		/* okay, try to lock (and fetch) the tuple */
-		tid = *((ItemPointer) DatumGetPointer(datum));
+		if (erm->refType == ROW_REF_TID)
+		{
+			tid = *((ItemPointer) DatumGetPointer(datum));
+			datum = PointerGetDatum(&tid);
+		}
+		else
+		{
+			Assert(erm->refType == ROW_REF_ROWID);
+			datum = datumCopy(datum, false, -1);
+		}
 		switch (erm->markType)
 		{
 			case ROW_MARK_EXCLUSIVE:
@@ -182,12 +192,15 @@ lnext:
 		if (!IsolationUsesXactSnapshot())
 			lockflags |= TUPLE_LOCK_FLAG_FIND_LAST_VERSION;
 
-		test = table_tuple_lock(erm->relation, &tid, estate->es_snapshot,
+		test = table_tuple_lock(erm->relation, datum, estate->es_snapshot,
 								markSlot, estate->es_output_cid,
 								lockmode, erm->waitPolicy,
 								lockflags,
 								&tmfd);
 
+		if (erm->refType == ROW_REF_ROWID)
+			pfree(DatumGetPointer(datum));
+
 		switch (test)
 		{
 			case TM_WouldBlock:
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index a64e37e9af9..90eeb99b2cd 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -124,7 +124,7 @@ static void ExecPendingInserts(EState *estate);
 static void ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 											   ResultRelInfo *sourcePartInfo,
 											   ResultRelInfo *destPartInfo,
-											   ItemPointer tupleid,
+											   Datum tupleid,
 											   TupleTableSlot *oldslot,
 											   TupleTableSlot *newslot);
 static bool ExecOnConflictUpdate(ModifyTableContext *context,
@@ -141,13 +141,13 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 
 static TupleTableSlot *ExecMerge(ModifyTableContext *context,
 								 ResultRelInfo *resultRelInfo,
-								 ItemPointer tupleid,
+								 Datum tupleid,
 								 HeapTuple oldtuple,
 								 bool canSetTag);
 static void ExecInitMerge(ModifyTableState *mtstate, EState *estate);
 static TupleTableSlot *ExecMergeMatched(ModifyTableContext *context,
 										ResultRelInfo *resultRelInfo,
-										ItemPointer tupleid,
+										Datum tupleid,
 										HeapTuple oldtuple,
 										bool canSetTag,
 										bool *matched);
@@ -1221,7 +1221,7 @@ ExecPendingInserts(EState *estate)
  */
 static bool
 ExecDeletePrologue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				   ItemPointer tupleid, HeapTuple oldtuple,
+				   Datum tupleid, HeapTuple oldtuple,
 				   TupleTableSlot **epqreturnslot, TM_Result *result)
 {
 	if (result)
@@ -1252,7 +1252,7 @@ ExecDeletePrologue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static TM_Result
 ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-			  ItemPointer tupleid, bool changingPart, int options,
+			  Datum tupleid, bool changingPart, int options,
 			  TupleTableSlot *oldSlot)
 {
 	EState	   *estate = context->estate;
@@ -1280,7 +1280,7 @@ ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static void
 ExecDeleteEpilogue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				   ItemPointer tupleid, HeapTuple oldtuple,
+				   HeapTuple oldtuple,
 				   TupleTableSlot *slot, bool changingPart)
 {
 	ModifyTableState *mtstate = context->mtstate;
@@ -1361,7 +1361,7 @@ ExecInitDeleteTupleSlot(ModifyTableState *mtstate,
 static TupleTableSlot *
 ExecDelete(ModifyTableContext *context,
 		   ResultRelInfo *resultRelInfo,
-		   ItemPointer tupleid,
+		   Datum tupleid,
 		   HeapTuple oldtuple,
 		   TupleTableSlot *oldslot,
 		   bool processReturning,
@@ -1558,7 +1558,7 @@ ldelete:
 	if (tupleDeleted)
 		*tupleDeleted = true;
 
-	ExecDeleteEpilogue(context, resultRelInfo, tupleid, oldtuple,
+	ExecDeleteEpilogue(context, resultRelInfo, oldtuple,
 					   oldslot, changingPart);
 
 	/* Process RETURNING if present and if requested */
@@ -1575,7 +1575,7 @@ ldelete:
 			/* FDW must have provided a slot containing the deleted row */
 			Assert(!TupIsNull(slot));
 		}
-		else
+		else if (!slot || TupIsNull(slot))
 		{
 			/* Copy old tuple to the returning slot */
 			slot = ExecGetReturningSlot(estate, resultRelInfo);
@@ -1624,7 +1624,7 @@ ldelete:
 static bool
 ExecCrossPartitionUpdate(ModifyTableContext *context,
 						 ResultRelInfo *resultRelInfo,
-						 ItemPointer tupleid, HeapTuple oldtuple,
+						 Datum tupleid, HeapTuple oldtuple,
 						 TupleTableSlot *slot,
 						 bool canSetTag,
 						 UpdateContext *updateCxt,
@@ -1783,7 +1783,7 @@ ExecCrossPartitionUpdate(ModifyTableContext *context,
  */
 static bool
 ExecUpdatePrologue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				   ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
+				   Datum tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
 				   TM_Result *result)
 {
 	Relation	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -1860,7 +1860,7 @@ ExecUpdatePrepareSlot(ResultRelInfo *resultRelInfo,
  */
 static TM_Result
 ExecUpdateAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-			  ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
+			  Datum tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
 			  bool canSetTag, int options, TupleTableSlot *oldSlot,
 			  UpdateContext *updateCxt)
 {
@@ -2014,7 +2014,7 @@ lreplace:
  */
 static void
 ExecUpdateEpilogue(ModifyTableContext *context, UpdateContext *updateCxt,
-				   ResultRelInfo *resultRelInfo, ItemPointer tupleid,
+				   ResultRelInfo *resultRelInfo,
 				   HeapTuple oldtuple, TupleTableSlot *slot,
 				   TupleTableSlot *oldslot)
 {
@@ -2064,7 +2064,7 @@ static void
 ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 								   ResultRelInfo *sourcePartInfo,
 								   ResultRelInfo *destPartInfo,
-								   ItemPointer tupleid,
+								   Datum tupleid,
 								   TupleTableSlot *oldslot,
 								   TupleTableSlot *newslot)
 {
@@ -2154,7 +2154,7 @@ ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
  */
 static TupleTableSlot *
 ExecUpdate(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-		   ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
+		   Datum tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
 		   TupleTableSlot *oldslot, bool canSetTag, bool locked)
 {
 	EState	   *estate = context->estate;
@@ -2208,15 +2208,19 @@ ExecUpdate(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	}
 	else
 	{
-		int			options = TABLE_MODIFY_WAIT | TABLE_MODIFY_FETCH_OLD_TUPLE;
+		int			options = TABLE_MODIFY_WAIT;
 
 		/*
 		 * Specify that we need to lock and fetch the last tuple version for
 		 * EPQ on appropriate transaction isolation levels if the tuple isn't
 		 * locked already.
 		 */
-		if (!locked && !IsolationUsesXactSnapshot())
-			options |= TABLE_MODIFY_LOCK_UPDATED;
+		if (!locked)
+		{
+			options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
+			if (!IsolationUsesXactSnapshot())
+				options |= TABLE_MODIFY_LOCK_UPDATED;
+		}
 
 		/*
 		 * If we generate a new candidate tuple after EvalPlanQual testing, we
@@ -2326,7 +2330,7 @@ redo_act:
 	if (canSetTag)
 		(estate->es_processed)++;
 
-	ExecUpdateEpilogue(context, &updateCxt, resultRelInfo, tupleid, oldtuple,
+	ExecUpdateEpilogue(context, &updateCxt, resultRelInfo, oldtuple,
 					   slot, oldslot);
 
 	/* Process RETURNING if present */
@@ -2358,7 +2362,19 @@ ExecOnConflictUpdate(ModifyTableContext *context,
 	ExprContext *econtext = mtstate->ps.ps_ExprContext;
 	ExprState  *onConflictSetWhere = resultRelInfo->ri_onConflict->oc_WhereClause;
 	TupleTableSlot *existing = resultRelInfo->ri_onConflict->oc_Existing;
-	ItemPointer conflictTid = &existing->tts_tid;
+	Datum		tupleid;
+
+	if (table_get_row_ref_type(resultRelInfo->ri_RelationDesc) == ROW_REF_ROWID)
+	{
+		bool		isnull;
+
+		tupleid = slot_getsysattr(existing, RowIdAttributeNumber, &isnull);
+		Assert(!isnull);
+	}
+	else
+	{
+		tupleid = PointerGetDatum(&existing->tts_tid);
+	}
 
 	/*
 	 * Make tuple and any needed join variables available to ExecQual and
@@ -2414,7 +2430,7 @@ ExecOnConflictUpdate(ModifyTableContext *context,
 
 	/* Execute UPDATE with projection */
 	*returning = ExecUpdate(context, resultRelInfo,
-							conflictTid, NULL,
+							tupleid, NULL,
 							resultRelInfo->ri_onConflict->oc_ProjSlot,
 							existing,
 							canSetTag, true);
@@ -2433,7 +2449,7 @@ ExecOnConflictUpdate(ModifyTableContext *context,
  */
 static TupleTableSlot *
 ExecMerge(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-		  ItemPointer tupleid, HeapTuple oldtuple, bool canSetTag)
+		  Datum tupleid, HeapTuple oldtuple, bool canSetTag)
 {
 	TupleTableSlot *rslot = NULL;
 	bool		matched;
@@ -2482,7 +2498,7 @@ ExecMerge(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	 * from ExecMergeNotMatched to ExecMergeMatched, there is no risk of a
 	 * livelock.
 	 */
-	matched = tupleid != NULL || oldtuple != NULL;
+	matched = DatumGetPointer(tupleid) != NULL || oldtuple != NULL;
 	if (matched)
 		rslot = ExecMergeMatched(context, resultRelInfo, tupleid, oldtuple,
 								 canSetTag, &matched);
@@ -2523,7 +2539,7 @@ ExecMerge(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static TupleTableSlot *
 ExecMergeMatched(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				 ItemPointer tupleid, HeapTuple oldtuple, bool canSetTag,
+				 Datum tupleid, HeapTuple oldtuple, bool canSetTag,
 				 bool *matched)
 {
 	ModifyTableState *mtstate = context->mtstate;
@@ -2559,7 +2575,7 @@ ExecMergeMatched(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	 * the tupleid of the target row, or an old tuple from the target wholerow
 	 * junk attr.
 	 */
-	Assert(tupleid != NULL || oldtuple != NULL);
+	Assert(DatumGetPointer(tupleid) != NULL || oldtuple != NULL);
 	if (oldtuple != NULL)
 		ExecForceStoreHeapTuple(oldtuple, resultRelInfo->ri_oldTupleSlot,
 								false);
@@ -2573,7 +2589,7 @@ lmerge_matched:
 	 * EvalPlanQual returns us a new tuple, which may not be visible to our
 	 * MVCC snapshot.
 	 */
-	if (tupleid != NULL)
+	if (DatumGetPointer(tupleid) != NULL)
 	{
 		if (!table_tuple_fetch_row_version(resultRelInfo->ri_RelationDesc,
 										   tupleid,
@@ -2682,7 +2698,7 @@ lmerge_matched:
 				if (result == TM_Ok)
 				{
 					ExecUpdateEpilogue(context, &updateCxt, resultRelInfo,
-									   tupleid, NULL, newslot,
+									   NULL, newslot,
 									   resultRelInfo->ri_oldTupleSlot);
 					mtstate->mt_merge_updated += 1;
 				}
@@ -2718,7 +2734,7 @@ lmerge_matched:
 
 				if (result == TM_Ok)
 				{
-					ExecDeleteEpilogue(context, resultRelInfo, tupleid, NULL,
+					ExecDeleteEpilogue(context, resultRelInfo, NULL,
 									   resultRelInfo->ri_oldTupleSlot, false);
 					mtstate->mt_merge_deleted += 1;
 				}
@@ -2842,9 +2858,13 @@ lmerge_matched:
 								return NULL;
 							}
 
-							(void) ExecGetJunkAttribute(epqslot,
-														resultRelInfo->ri_RowIdAttNo,
-														&isNull);
+							/*
+							 * Update tupleid to that of the new tuple, for
+							 * the refetch we do at the top.
+							 */
+							tupleid = ExecGetJunkAttribute(epqslot,
+														   resultRelInfo->ri_RowIdAttNo,
+														   &isNull);
 							if (isNull)
 							{
 								*matched = false;
@@ -2871,11 +2891,7 @@ lmerge_matched:
 							 * apply all the MATCHED rules again, to ensure
 							 * that the first qualifying WHEN MATCHED action
 							 * is executed.
-							 *
-							 * Update tupleid to that of the new tuple, for
-							 * the refetch we do at the top.
 							 */
-							ItemPointerCopy(&context->tmfd.ctid, tupleid);
 							goto lmerge_matched;
 
 						case TM_Deleted:
@@ -3413,10 +3429,10 @@ ExecModifyTable(PlanState *pstate)
 	PlanState  *subplanstate;
 	TupleTableSlot *slot;
 	TupleTableSlot *oldSlot;
+	Datum		tupleid;
 	ItemPointerData tuple_ctid;
 	HeapTupleData oldtupdata;
 	HeapTuple	oldtuple;
-	ItemPointer tupleid;
 
 	CHECK_FOR_INTERRUPTS();
 
@@ -3465,6 +3481,8 @@ ExecModifyTable(PlanState *pstate)
 	 */
 	for (;;)
 	{
+		RowRefType	refType;
+
 		/*
 		 * Reset the per-output-tuple exprcontext.  This is needed because
 		 * triggers expect to use that context as workspace.  It's a bit ugly
@@ -3515,7 +3533,7 @@ ExecModifyTable(PlanState *pstate)
 					EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 
 					slot = ExecMerge(&context, node->resultRelInfo,
-									 NULL, NULL, node->canSetTag);
+									 PointerGetDatum(NULL), NULL, node->canSetTag);
 
 					/*
 					 * If we got a RETURNING result, return it to the caller.
@@ -3559,7 +3577,8 @@ ExecModifyTable(PlanState *pstate)
 		EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 		slot = context.planSlot;
 
-		tupleid = NULL;
+		refType = resultRelInfo->ri_RowRefType;
+		tupleid = PointerGetDatum(NULL);
 		oldtuple = NULL;
 
 		/*
@@ -3602,7 +3621,7 @@ ExecModifyTable(PlanState *pstate)
 						EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 
 						slot = ExecMerge(&context, node->resultRelInfo,
-										 NULL, NULL, node->canSetTag);
+										 PointerGetDatum(NULL), NULL, node->canSetTag);
 
 						/*
 						 * If we got a RETURNING result, return it to the
@@ -3617,9 +3636,25 @@ ExecModifyTable(PlanState *pstate)
 					elog(ERROR, "ctid is NULL");
 				}
 
-				tupleid = (ItemPointer) DatumGetPointer(datum);
-				tuple_ctid = *tupleid;	/* be sure we don't free ctid!! */
-				tupleid = &tuple_ctid;
+				if (refType == ROW_REF_TID)
+				{
+					/* shouldn't ever get a null result... */
+					if (isNull)
+						elog(ERROR, "ctid is NULL");
+
+					tuple_ctid = *((ItemPointer) DatumGetPointer(datum));	/* be sure we don't free
+																			 * ctid!! */
+					tupleid = PointerGetDatum(&tuple_ctid);
+				}
+				else
+				{
+					Assert(refType == ROW_REF_ROWID);
+					/* shouldn't ever get a null result... */
+					if (isNull)
+						elog(ERROR, "rowid is NULL");
+
+					tupleid = datumCopy(datum, false, -1);
+				}
 			}
 
 			/*
@@ -3659,7 +3694,7 @@ ExecModifyTable(PlanState *pstate)
 						EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 
 						slot = ExecMerge(&context, node->resultRelInfo,
-										 NULL, NULL, node->canSetTag);
+										 PointerGetDatum(NULL), NULL, node->canSetTag);
 
 						/*
 						 * If we got a RETURNING result, return it to the
@@ -3723,6 +3758,7 @@ ExecModifyTable(PlanState *pstate)
 					/* Fetch the most recent version of old tuple. */
 					Relation	relation = resultRelInfo->ri_RelationDesc;
 
+					Assert(DatumGetPointer(tupleid) != NULL);
 					if (!table_tuple_fetch_row_version(relation, tupleid,
 													   SnapshotAny,
 													   oldSlot))
@@ -3757,6 +3793,9 @@ ExecModifyTable(PlanState *pstate)
 				break;
 		}
 
+		if (refType == ROW_REF_ROWID && DatumGetPointer(tupleid) != NULL)
+			pfree(DatumGetPointer(tupleid));
+
 		/*
 		 * If we got a RETURNING result, return it to caller.  We'll continue
 		 * the work on next call.
@@ -4000,10 +4039,20 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 				relkind == RELKIND_MATVIEW ||
 				relkind == RELKIND_PARTITIONED_TABLE)
 			{
-				resultRelInfo->ri_RowIdAttNo =
-					ExecFindJunkAttributeInTlist(subplan->targetlist, "ctid");
-				if (!AttributeNumberIsValid(resultRelInfo->ri_RowIdAttNo))
-					elog(ERROR, "could not find junk ctid column");
+				if (resultRelInfo->ri_RowRefType == ROW_REF_TID)
+				{
+					resultRelInfo->ri_RowIdAttNo =
+						ExecFindJunkAttributeInTlist(subplan->targetlist, "ctid");
+					if (!AttributeNumberIsValid(resultRelInfo->ri_RowIdAttNo))
+						elog(ERROR, "could not find junk ctid column");
+				}
+				else
+				{
+					resultRelInfo->ri_RowIdAttNo =
+						ExecFindJunkAttributeInTlist(subplan->targetlist, "rowid");
+					if (!AttributeNumberIsValid(resultRelInfo->ri_RowIdAttNo))
+						elog(ERROR, "could not find junk rowid column");
+				}
 			}
 			else if (relkind == RELKIND_FOREIGN_TABLE)
 			{
@@ -4313,6 +4362,8 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		estate->es_auxmodifytables = lcons(mtstate,
 										   estate->es_auxmodifytables);
 
+
+
 	return mtstate;
 }
 
diff --git a/src/backend/executor/nodeTidscan.c b/src/backend/executor/nodeTidscan.c
index 864a9013b62..f4a124ac4eb 100644
--- a/src/backend/executor/nodeTidscan.c
+++ b/src/backend/executor/nodeTidscan.c
@@ -377,7 +377,7 @@ TidNext(TidScanState *node)
 		if (node->tss_isCurrentOf)
 			table_tuple_get_latest_tid(scan, &tid);
 
-		if (table_tuple_fetch_row_version(heapRelation, &tid, snapshot, slot))
+		if (table_tuple_fetch_row_version(heapRelation, PointerGetDatum(&tid), snapshot, slot))
 			return slot;
 
 		/* Bad TID or failed snapshot qual; try next */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 4b9c9deee84..ee648bedd4a 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2376,19 +2376,24 @@ select_rowmark_type(RangeTblEntry *rte, LockClauseStrength strength,
 	{
 		/* Let the FDW select the rowmark type, if it wants to */
 		FdwRoutine *fdwroutine = GetFdwRoutineByRelId(rte->relid);
+		RowMarkType result = ROW_MARK_REFERENCE;
 
 		/* Set row reference type as ROW_REF_COPY by default */
 		*refType = ROW_REF_COPY;
 
 		if (fdwroutine->GetForeignRowMarkType != NULL)
-			return fdwroutine->GetForeignRowMarkType(rte, strength, refType);
+			result = fdwroutine->GetForeignRowMarkType(rte, strength, refType);
+
+		/* XXX: should we fill this before? */
+		rte->reftype = *refType;
+
 		/* Otherwise, use ROW_MARK_REFERENCE by default */
-		return ROW_MARK_REFERENCE;
+		return result;
 	}
 	else
 	{
 		/* Regular table, apply the appropriate lock type */
-		*refType = ROW_REF_TID;
+		*refType = rte->reftype;
 		switch (strength)
 		{
 			case LCS_NONE:
diff --git a/src/backend/optimizer/prep/preptlist.c b/src/backend/optimizer/prep/preptlist.c
index 4599b0dc761..3620be5b52c 100644
--- a/src/backend/optimizer/prep/preptlist.c
+++ b/src/backend/optimizer/prep/preptlist.c
@@ -226,6 +226,22 @@ preprocess_targetlist(PlannerInfo *root)
 								  true);
 			tlist = lappend(tlist, tle);
 		}
+		if (rc->allRefTypes & (1 << ROW_REF_ROWID))
+		{
+			/* Need to fetch TID */
+			var = makeVar(rc->rti,
+						  RowIdAttributeNumber,
+						  BYTEAOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			snprintf(resname, sizeof(resname), "rowid%u", rc->rowmarkId);
+			tle = makeTargetEntry((Expr *) var,
+								  list_length(tlist) + 1,
+								  pstrdup(resname),
+								  true);
+			tlist = lappend(tlist, tle);
+		}
 		if (rc->allRefTypes & (1 << ROW_REF_COPY))
 		{
 			/* Need the whole row as a junk var */
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index 6ba4eba224a..83c08bbd0e1 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -16,6 +16,7 @@
 
 #include "access/htup_details.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "foreign/fdwapi.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
@@ -895,17 +896,35 @@ add_row_identity_columns(PlannerInfo *root, Index rtindex,
 		relkind == RELKIND_MATVIEW ||
 		relkind == RELKIND_PARTITIONED_TABLE)
 	{
+		RowRefType	refType = ROW_REF_TID;
+
+		refType = table_get_row_ref_type(target_relation);
+
 		/*
 		 * Emit CTID so that executor can find the row to merge, update or
 		 * delete.
 		 */
-		var = makeVar(rtindex,
-					  SelfItemPointerAttributeNumber,
-					  TIDOID,
-					  -1,
-					  InvalidOid,
-					  0);
-		add_row_identity_var(root, var, rtindex, "ctid");
+		if (refType == ROW_REF_TID)
+		{
+			var = makeVar(rtindex,
+						  SelfItemPointerAttributeNumber,
+						  TIDOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			add_row_identity_var(root, var, rtindex, "ctid");
+		}
+		else
+		{
+			Assert(refType == ROW_REF_ROWID);
+			var = makeVar(rtindex,
+						  RowIdAttributeNumber,
+						  BYTEAOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			add_row_identity_var(root, var, rtindex, "rowid");
+		}
 	}
 	else if (relkind == RELKIND_FOREIGN_TABLE)
 	{
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index b4b076d1cb1..4a5a167d833 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -16,6 +16,7 @@
 
 #include "access/sysattr.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "catalog/partition.h"
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_type.h"
@@ -282,6 +283,24 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 			newvars = lappend(newvars, var);
 		}
 
+		if ((new_allRefTypes & (1 << ROW_REF_ROWID)) &&
+			!(old_allRefTypes & (1 << ROW_REF_ROWID)))
+		{
+			var = makeVar(oldrc->rti,
+						  RowIdAttributeNumber,
+						  BYTEAOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			snprintf(resname, sizeof(resname), "rowid%u", oldrc->rowmarkId);
+			tle = makeTargetEntry((Expr *) var,
+								  list_length(root->processed_tlist) + 1,
+								  pstrdup(resname),
+								  true);
+			root->processed_tlist = lappend(root->processed_tlist, tle);
+			newvars = lappend(newvars, var);
+		}
+
 		/* Add tableoid junk Var, unless we had it already */
 		if (!old_isParent)
 		{
@@ -485,6 +504,7 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 	Assert(parentrte->rtekind == RTE_RELATION); /* else this is dubious */
 	childrte->relid = childOID;
 	childrte->relkind = childrel->rd_rel->relkind;
+	childrte->reftype = table_get_row_ref_type(childrel);
 	/* A partitioned child will need to be expanded further. */
 	if (childrte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c
index 427b7325db8..2c80e010f2a 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/sysattr.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "catalog/heap.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_type.h"
@@ -1503,6 +1504,7 @@ addRangeTableEntry(ParseState *pstate,
 	rte->inh = inh;
 	rte->relkind = rel->rd_rel->relkind;
 	rte->rellockmode = lockmode;
+	rte->reftype = table_get_row_ref_type(rel);
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
@@ -1588,6 +1590,7 @@ addRangeTableEntryForRelation(ParseState *pstate,
 	rte->inh = inh;
 	rte->relkind = rel->rd_rel->relkind;
 	rte->rellockmode = lockmode;
+	rte->reftype = table_get_row_ref_type(rel);
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
@@ -1656,6 +1659,7 @@ addRangeTableEntryForSubquery(ParseState *pstate,
 	rte->rtekind = RTE_SUBQUERY;
 	rte->subquery = subquery;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias("unnamed_subquery", NIL);
 	numaliases = list_length(eref->colnames);
@@ -1763,6 +1767,7 @@ addRangeTableEntryForFunction(ParseState *pstate,
 	rte->functions = NIL;		/* we'll fill this list below */
 	rte->funcordinality = rangefunc->ordinality;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	/*
 	 * Choose the RTE alias name.  We default to using the first function's
@@ -2081,6 +2086,7 @@ addRangeTableEntryForTableFunc(ParseState *pstate,
 	rte->coltypmods = tf->coltypmods;
 	rte->colcollations = tf->colcollations;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias(refname, NIL);
 	numaliases = list_length(eref->colnames);
@@ -2156,6 +2162,7 @@ addRangeTableEntryForValues(ParseState *pstate,
 	rte->coltypmods = coltypmods;
 	rte->colcollations = colcollations;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias(refname, NIL);
 
@@ -2252,6 +2259,7 @@ addRangeTableEntryForJoin(ParseState *pstate,
 	rte->joinrightcols = rightcols;
 	rte->join_using_alias = join_using_alias;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias("unnamed_join", NIL);
 	numaliases = list_length(eref->colnames);
@@ -2332,6 +2340,7 @@ addRangeTableEntryForCTE(ParseState *pstate,
 	rte->rtekind = RTE_CTE;
 	rte->ctename = cte->ctename;
 	rte->ctelevelsup = levelsup;
+	rte->reftype = ROW_REF_COPY;
 
 	/* Self-reference if and only if CTE's parse analysis isn't completed */
 	rte->self_reference = !IsA(cte->ctequery, Query);
@@ -2494,6 +2503,7 @@ addRangeTableEntryForENR(ParseState *pstate,
 	 * if they access transition tables linked to a table that is altered.
 	 */
 	rte->relid = enrmd->reliddesc;
+	rte->reftype = ROW_REF_COPY;
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
@@ -3257,6 +3267,9 @@ get_rte_attribute_name(RangeTblEntry *rte, AttrNumber attnum)
 		attnum > 0 && attnum <= list_length(rte->alias->colnames))
 		return strVal(list_nth(rte->alias->colnames, attnum - 1));
 
+	if (attnum == RowIdAttributeNumber)
+		return "rowid";
+
 	/*
 	 * If the RTE is a relation, go to the system catalogs not the
 	 * eref->colnames list.  This is a little slower but it will give the
diff --git a/src/backend/rewrite/rewriteHandler.c b/src/backend/rewrite/rewriteHandler.c
index 9fd05b15e73..7a0fdbe3f40 100644
--- a/src/backend/rewrite/rewriteHandler.c
+++ b/src/backend/rewrite/rewriteHandler.c
@@ -1854,6 +1854,7 @@ ApplyRetrieveRule(Query *parsetree,
 	rte = rt_fetch(rt_index, parsetree->rtable);
 
 	rte->rtekind = RTE_SUBQUERY;
+	rte->reftype = ROW_REF_COPY;
 	rte->subquery = rule_action;
 	rte->security_barrier = RelationIsSecurityView(relation);
 
diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index 947a868e569..d3a41533552 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1100,6 +1100,36 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+/*
+ * Same as tuplestore_gettupleslot(), but foces tuple storage to slot.  Thus,
+ * it can work with slot types different than minimal tuple.
+ */
+bool
+tuplestore_force_gettupleslot(Tuplestorestate *state, bool forward,
+							  bool copy, TupleTableSlot *slot)
+{
+	MinimalTuple tuple;
+	bool		should_free;
+
+	tuple = (MinimalTuple) tuplestore_gettuple(state, forward, &should_free);
+
+	if (tuple)
+	{
+		if (copy && !should_free)
+		{
+			tuple = heap_copy_minimal_tuple(tuple);
+			should_free = true;
+		}
+		ExecForceStoreMinimalTuple(tuple, slot, should_free);
+		return true;
+	}
+	else
+	{
+		ExecClearTuple(slot);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
diff --git a/src/include/access/sysattr.h b/src/include/access/sysattr.h
index e88dec71ee9..867b5eb489e 100644
--- a/src/include/access/sysattr.h
+++ b/src/include/access/sysattr.h
@@ -24,6 +24,7 @@
 #define MaxTransactionIdAttributeNumber			(-4)
 #define MaxCommandIdAttributeNumber				(-5)
 #define TableOidAttributeNumber					(-6)
-#define FirstLowInvalidHeapAttributeNumber		(-7)
+#define RowIdAttributeNumber					(-7)
+#define FirstLowInvalidHeapAttributeNumber		(-8)
 
 #endif							/* SYSATTR_H */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 7f97af067f0..4ad6ebb1043 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -476,7 +476,7 @@ typedef struct TableAmRoutine
 	 * test, returns true, false otherwise.
 	 */
 	bool		(*tuple_fetch_row_version) (Relation rel,
-											ItemPointer tid,
+											Datum tupleid,
 											Snapshot snapshot,
 											TupleTableSlot *slot);
 
@@ -535,7 +535,7 @@ typedef struct TableAmRoutine
 
 	/* see table_tuple_delete() for reference about parameters */
 	TM_Result	(*tuple_delete) (Relation rel,
-								 ItemPointer tid,
+								 Datum tupleid,
 								 CommandId cid,
 								 Snapshot snapshot,
 								 Snapshot crosscheck,
@@ -546,7 +546,7 @@ typedef struct TableAmRoutine
 
 	/* see table_tuple_update() for reference about parameters */
 	TM_Result	(*tuple_update) (Relation rel,
-								 ItemPointer otid,
+								 Datum tupleid,
 								 TupleTableSlot *slot,
 								 CommandId cid,
 								 Snapshot snapshot,
@@ -559,7 +559,7 @@ typedef struct TableAmRoutine
 
 	/* see table_tuple_lock() for reference about parameters */
 	TM_Result	(*tuple_lock) (Relation rel,
-							   ItemPointer tid,
+							   Datum tupleid,
 							   Snapshot snapshot,
 							   TupleTableSlot *slot,
 							   CommandId cid,
@@ -702,6 +702,11 @@ typedef struct TableAmRoutine
 	 * ------------------------------------------------------------------------
 	 */
 
+	/*
+	 * Get the type of row identifier in the table.
+	 */
+	RowRefType	(*get_row_ref_type) (Relation rel);
+
 	/*
 	 * This callback frees relation private cache data stored in rd_amcache.
 	 * After the call all memory related to rd_amcache must be freed,
@@ -1284,9 +1289,9 @@ extern bool table_index_fetch_tuple_check(Relation rel,
 
 
 /*
- * Fetch tuple at `tid` into `slot`, after doing a visibility test according to
- * `snapshot`. If a tuple was found and passed the visibility test, returns
- * true, false otherwise.
+ * Fetch tuple identified by `tupleid` into `slot`, after doing a visibility
+ * test according to `snapshot`. If a tuple was found and passed the visibility
+ * test, returns true, false otherwise.
  *
  * See table_index_fetch_tuple's comment about what the difference between
  * these functions is. It is correct to use this function outside of index
@@ -1294,7 +1299,7 @@ extern bool table_index_fetch_tuple_check(Relation rel,
  */
 static inline bool
 table_tuple_fetch_row_version(Relation rel,
-							  ItemPointer tid,
+							  Datum tupleid,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
@@ -1306,7 +1311,8 @@ table_tuple_fetch_row_version(Relation rel,
 	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
 		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
 
-	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
+	return rel->rd_tableam->tuple_fetch_row_version(rel, tupleid,
+													snapshot, slot);
 }
 
 /*
@@ -1493,7 +1499,7 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
  *
  * Input parameters:
  *	relation - table to be modified (caller must hold suitable lock)
- *	tid - TID of tuple to be deleted
+ *	tupleid - identifier of tuple to be deleted
  *	cid - delete command ID (used for visibility test, and stored into
  *		cmax if successful)
  *	crosscheck - if not InvalidSnapshot, also check tuple against this
@@ -1522,12 +1528,12 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
  * TM_FailureData for additional info.
  */
 static inline TM_Result
-table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
+table_tuple_delete(Relation rel, Datum tupleid, CommandId cid,
 				   Snapshot snapshot, Snapshot crosscheck, int options,
 				   TM_FailureData *tmfd, bool changingPart,
 				   TupleTableSlot *oldSlot)
 {
-	return rel->rd_tableam->tuple_delete(rel, tid, cid,
+	return rel->rd_tableam->tuple_delete(rel, tupleid, cid,
 										 snapshot, crosscheck,
 										 options, tmfd, changingPart,
 										 oldSlot);
@@ -1541,7 +1547,7 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
  *
  * Input parameters:
  *	relation - table to be modified (caller must hold suitable lock)
- *	otid - TID of old tuple to be replaced
+ *	tupleid - identifier of old tuple to be replaced
  *	slot - newly constructed tuple data to store
  *	cid - update command ID (used for visibility test, and stored into
  *		cmax/cmin if successful)
@@ -1578,13 +1584,13 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
  * for additional info.
  */
 static inline TM_Result
-table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
+table_tuple_update(Relation rel, Datum tupleid, TupleTableSlot *slot,
 				   CommandId cid, Snapshot snapshot, Snapshot crosscheck,
 				   int options, TM_FailureData *tmfd, LockTupleMode *lockmode,
 				   TU_UpdateIndexes *update_indexes,
 				   TupleTableSlot *oldSlot)
 {
-	return rel->rd_tableam->tuple_update(rel, otid, slot,
+	return rel->rd_tableam->tuple_update(rel, tupleid, slot,
 										 cid, snapshot, crosscheck,
 										 options, tmfd,
 										 lockmode, update_indexes,
@@ -1596,7 +1602,7 @@ table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
  *
  * Input parameters:
  *	relation: relation containing tuple (caller must hold suitable lock)
- *	tid: TID of tuple to lock
+ *	tupleid: identifier of tuple to lock
  *	snapshot: snapshot to use for visibility determinations
  *	cid: current command ID (used for visibility test, and stored into
  *		tuple's cmax if lock is successful)
@@ -1625,12 +1631,12 @@ table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
  * comments for struct TM_FailureData for additional info.
  */
 static inline TM_Result
-table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
+table_tuple_lock(Relation rel, Datum tupleid, Snapshot snapshot,
 				 TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
 				 LockWaitPolicy wait_policy, uint8 flags,
 				 TM_FailureData *tmfd)
 {
-	return rel->rd_tableam->tuple_lock(rel, tid, snapshot, slot,
+	return rel->rd_tableam->tuple_lock(rel, tupleid, snapshot, slot,
 									   cid, mode, wait_policy,
 									   flags, tmfd);
 }
@@ -1916,6 +1922,22 @@ table_define_index(Relation rel, Oid indoid, bool reindex,
  * ----------------------------------------------------------------------------
  */
 
+/*
+ * Get the type of row identifier.  Returns ROW_REF_TID when table AM routine
+ * is not accessible.  This happens during catalog initialization.  All catalog
+ * tables are known to use heap.
+ */
+static inline RowRefType
+table_get_row_ref_type(Relation rel)
+{
+	if (rel->rd_tableam)
+		return rel->rd_tableam->get_row_ref_type(rel);
+	else if (rel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
+		return ROW_REF_COPY;
+	else
+		return ROW_REF_TID;
+}
+
 /*
  * Frees relation private cache data stored in rd_amcache.  Uses
  * free_rd_amcache method if provided.  Assumes rd_amcache to point to single
diff --git a/src/include/commands/trigger.h b/src/include/commands/trigger.h
index cb968d03ecd..c16e6b6e5a0 100644
--- a/src/include/commands/trigger.h
+++ b/src/include/commands/trigger.h
@@ -209,7 +209,7 @@ extern void ExecASDeleteTriggers(EState *estate,
 extern bool ExecBRDeleteTriggers(EState *estate,
 								 EPQState *epqstate,
 								 ResultRelInfo *relinfo,
-								 ItemPointer tupleid,
+								 Datum tupleid,
 								 HeapTuple fdw_trigtuple,
 								 TupleTableSlot **epqslot,
 								 TM_Result *tmresult,
@@ -231,7 +231,7 @@ extern void ExecASUpdateTriggers(EState *estate,
 extern bool ExecBRUpdateTriggers(EState *estate,
 								 EPQState *epqstate,
 								 ResultRelInfo *relinfo,
-								 ItemPointer tupleid,
+								 Datum tupleid,
 								 HeapTuple fdw_trigtuple,
 								 TupleTableSlot *newslot,
 								 TM_Result *tmresult,
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index b89baef95d3..04d8cef6c68 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1089,6 +1089,8 @@ typedef struct RangeTblEntry
 	Index		perminfoindex pg_node_attr(query_jumble_ignore);
 	/* sampling info, or NULL */
 	struct TableSampleClause *tablesample;
+	/* row indentifier for relation */
+	RowRefType	reftype;
 
 	/*
 	 * Fields valid for a subquery RTE (else NULL):
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index d7f9c389dac..d850411aa95 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1323,27 +1323,6 @@ typedef enum RowMarkType
 	ROW_MARK_REFERENCE,			/* just fetch the TID, don't lock it */
 } RowMarkType;
 
-/*
- * RowRefType -
- *	  enums for types of row identifiers
- *
- * For plain tables we can just fetch the TID, much as for a target relation;
- * this case is represented by ROW_REF_TID.  Otherwise (for example for VALUES
- * or FUNCTION scans) we have to copy the whole row value.  ROW_REF_COPY is
- * pretty inefficient, since most of the time we'll never need the data; but
- * fortunately the overhead is usually not performance-critical in practice.
- * By default we use ROW_REF_COPY for foreign tables, but if the FDW has
- * a concept of rowid it can request to use ROW_REF_TID instead.
- * (Again, this probably doesn't make sense if a physical remote fetch is
- * needed, but for FDWs that map to local storage it might be credible.)
- * In future we may allow more types of row identifiers.
- */
-typedef enum RowRefType
-{
-	ROW_REF_TID,				/* Item pointer (block, offset) */
-	ROW_REF_COPY				/* Full row copy */
-} RowRefType;
-
 #define RowMarkRequiresRowShareLock(marktype)  ((marktype) <= ROW_MARK_KEYSHARE)
 
 /*
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index 376f67e6a5f..84cf7837de1 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -2211,4 +2211,26 @@ typedef struct OnConflictExpr
 	List	   *exclRelTlist;	/* tlist of the EXCLUDED pseudo relation */
 } OnConflictExpr;
 
+/*
+ * RowRefType -
+ *	  enums for types of row identifiers
+ *
+ * For plain tables we can just fetch the TID, much as for a target relation;
+ * this case is represented by ROW_REF_TID.  Otherwise (for example for VALUES
+ * or FUNCTION scans) we have to copy the whole row value.  ROW_REF_COPY is
+ * pretty inefficient, since most of the time we'll never need the data; but
+ * fortunately the overhead is usually not performance-critical in practice.
+ * By default we use ROW_REF_COPY for foreign tables, but if the FDW has
+ * a concept of rowid it can request to use ROW_REF_TID instead.
+ * (Again, this probably doesn't make sense if a physical remote fetch is
+ * needed, but for FDWs that map to local storage it might be credible.)
+ * In future we may allow more types of row identifiers.
+ */
+typedef enum RowRefType
+{
+	ROW_REF_TID,				/* Item pointer (block, offset) */
+	ROW_REF_ROWID,				/* Bytea row id */
+	ROW_REF_COPY				/* Full row copy */
+} RowRefType;
+
 #endif							/* PRIMNODES_H */
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index 419613c17ba..cf291a0d17a 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -70,6 +70,9 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
 
+extern bool tuplestore_force_gettupleslot(Tuplestorestate *state, bool forward,
+										  bool copy, TupleTableSlot *slot);
+
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
 extern bool tuplestore_skiptuples(Tuplestorestate *state,
-- 
2.39.3 (Apple Git-145)

#30

Japin Li

japinli@hotmail.com

almost 2 years ago

In reply to: Alexander Korotkov (#29)

Re: Table AM Interface Enhancements

On Thu, 28 Mar 2024 at 21:26, Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hi Pavel!

Revised patchset is attached.

On Thu, Mar 28, 2024 at 3:12 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:

The other extensibility that seems quite clear and uncontroversial to me is 0006.

It simply shifts the decision on whether tuple inserts should invoke inserts to the related indices to the table am level. It doesn't change the current heap insert behavior so it's safe for the existing heap access method. But new table access methods could redefine this (only for tables created with these am's) and make index inserts independently of ExecInsertIndexTuples inside their own implementations of tuple_insert/tuple_multi_insert methods.

I'd propose changing the comment:

1405 * This function sets `*insert_indexes` to true if expects caller to return
1406 * the relevant index tuples. If `*insert_indexes` is set to false, then
1407 * this function cares about indexes itself.

in the following way

Tableam implementation of tuple_insert should set `*insert_indexes` to true
if it expects the caller to insert the relevant index tuples (as in heap
implementation). It should set `*insert_indexes` to false if it cares
about index inserts itself and doesn't want the caller to do index inserts.

Changed as you proposed.

Maybe, a commit message is also better to reformulate to describe better who should do what.

Done.

Also, I removed extra includes in 0001 as you proposed and edited the
commit message in 0002.

I think, with rebase and correction in the comments/commit message patch 0006 is ready to be committed.

I'm going to push 0001, 0002 and 0006 if no objections.

Thanks for updating the patches. Here are some minor sugesstion.

0003

+static inline TupleTableSlot *
+heapam_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,

I'm not entirely certain whether the "inline" keyword has any effect.

0004

+static bytea *
+heapam_indexoptions(amoptions_function amoptions, char relkind,
+                                       Datum reloptions, bool validate)
+{
+       return index_reloptions(amoptions, reloptions, validate);
+}

Could you please explain why we are not verifying the relkind like
heapam_reloptions()?

- case RELKIND_VIEW:
case RELKIND_MATVIEW:
+ case RELKIND_VIEW:
case RELKIND_PARTITIONED_TABLE:

I think this change is unnecessary.

+                       {
+                               Form_pg_class classForm;
+                               HeapTuple       classTup;
+
+                               /* fetch the relation's relcache entry */
+                               if (relation->rd_index->indrelid >= FirstNormalObjectId)
+                               {
+                                       classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relation->rd_index->indrelid));
+                                       classForm = (Form_pg_class) GETSTRUCT(classTup);
+                                       if (classForm->relam >= FirstNormalObjectId)
+                                               tableam = GetTableAmRoutineByAmOid(classForm->relam);
+                                       else
+                                               tableam = GetHeapamTableAmRoutine();
+                                       heap_freetuple(classTup);
+                               }
+                               else
+                               {
+                                       tableam = GetHeapamTableAmRoutine();
+                               }
+                               amoptsfn = relation->rd_indam->amoptions;
+                       }

- We can reduce the indentation by moving the classFrom and classTup into
the if branch.
- Perhaps we could remove the brace of else branch to maintain consistency
in the code style.

--
Regards,
Japin Li

#31

Pavel Borisov

pashkin.elfe@gmail.com

almost 2 years ago

In reply to: Japin Li (#30)

8 attachment(s)

Re: Table AM Interface Enhancements

I found that after yesterday's e2395cdbe83a 0002 don't apply.
Rebased the whole patchset.

Pavel

Attachments:

v8-0006-Let-table-AM-insertion-methods-control-index-inse.patchapplication/octet-stream; name=v8-0006-Let-table-AM-insertion-methods-control-index-inse.patchDownload

From f3e292a96d32560248eccbdef3db46f249d02ea6 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Fri, 9 Jun 2023 01:02:39 +0300
Subject: [PATCH v8 6/8] Let table AM insertion methods control index insertion

Previously, the executor did index insert unconditionally after calling
table AM interface methods tuple_insert() and multi_insert().  This commit
introduces the new parameter insert_indexes for these two methods.  Setting
'*insert_indexes' to true saves the current logic.  Setting it to false
indicates that table AM cares about index inserts itself and doesn't want the
caller to do that.

Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Reviewed-by: Pavel Borisov, Matthias van de Meent
---
 src/backend/access/heap/heapam.c         |  4 +++-
 src/backend/access/heap/heapam_handler.c |  4 +++-
 src/backend/access/table/tableam.c       |  6 ++++--
 src/backend/catalog/indexing.c           |  4 +++-
 src/backend/commands/copyfrom.c          | 13 +++++++++----
 src/backend/commands/createas.c          |  4 +++-
 src/backend/commands/matview.c           |  4 +++-
 src/backend/commands/tablecmds.c         |  6 +++++-
 src/backend/executor/execReplication.c   |  6 ++++--
 src/backend/executor/nodeModifyTable.c   |  6 ++++--
 src/include/access/heapam.h              |  2 +-
 src/include/access/tableam.h             | 24 +++++++++++++++++-------
 12 files changed, 59 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2f6527df0d..b661d9811e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2088,7 +2088,8 @@ heap_multi_insert_pages(HeapTuple *heaptuples, int done, int ntuples, Size saveF
  */
 void
 heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
-				  CommandId cid, int options, BulkInsertState bistate)
+				  CommandId cid, int options, BulkInsertState bistate,
+				  bool *insert_indexes)
 {
 	TransactionId xid = GetCurrentTransactionId();
 	HeapTuple  *heaptuples;
@@ -2437,6 +2438,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		slots[i]->tts_tid = heaptuples[i]->t_self;
 
 	pgstat_count_heap_insert(relation, ntuples);
+	*insert_indexes = true;
 }
 
 /*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 1c029ce6ab..09429fd9ef 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -245,7 +245,7 @@ heapam_tuple_satisfies_snapshot(Relation rel, TupleTableSlot *slot,
 
 static TupleTableSlot *
 heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
-					int options, BulkInsertState bistate)
+					int options, BulkInsertState bistate, bool *insert_indexes)
 {
 	bool		shouldFree = true;
 	HeapTuple	tuple = ExecFetchSlotHeapTuple(slot, true, &shouldFree);
@@ -261,6 +261,8 @@ heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
 	if (shouldFree)
 		pfree(tuple);
 
+	*insert_indexes = true;
+
 	return slot;
 }
 
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 8d3675be95..805d222ceb 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -273,9 +273,11 @@ table_tuple_get_latest_tid(TableScanDesc scan, ItemPointer tid)
  * default command ID and not allowing access to the speedup options.
  */
 void
-simple_table_tuple_insert(Relation rel, TupleTableSlot *slot)
+simple_table_tuple_insert(Relation rel, TupleTableSlot *slot,
+						  bool *insert_indexes)
 {
-	table_tuple_insert(rel, slot, GetCurrentCommandId(true), 0, NULL);
+	table_tuple_insert(rel, slot, GetCurrentCommandId(true), 0, NULL,
+					   insert_indexes);
 }
 
 /*
diff --git a/src/backend/catalog/indexing.c b/src/backend/catalog/indexing.c
index d0d1abda58..4d404f22f8 100644
--- a/src/backend/catalog/indexing.c
+++ b/src/backend/catalog/indexing.c
@@ -273,12 +273,14 @@ void
 CatalogTuplesMultiInsertWithInfo(Relation heapRel, TupleTableSlot **slot,
 								 int ntuples, CatalogIndexState indstate)
 {
+	bool		insertIndexes;
+
 	/* Nothing to do */
 	if (ntuples <= 0)
 		return;
 
 	heap_multi_insert(heapRel, slot, ntuples,
-					  GetCurrentCommandId(true), 0, NULL);
+					  GetCurrentCommandId(true), 0, NULL, &insertIndexes);
 
 	/*
 	 * There is no equivalent to heap_multi_insert for the catalog indexes, so
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 8908a440e1..b673636977 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -397,6 +397,7 @@ CopyMultiInsertBufferFlush(CopyMultiInsertInfo *miinfo,
 		bool		line_buf_valid = cstate->line_buf_valid;
 		uint64		save_cur_lineno = cstate->cur_lineno;
 		MemoryContext oldcontext;
+		bool		insertIndexes;
 
 		Assert(buffer->bistate != NULL);
 
@@ -416,7 +417,8 @@ CopyMultiInsertBufferFlush(CopyMultiInsertInfo *miinfo,
 						   nused,
 						   mycid,
 						   ti_options,
-						   buffer->bistate);
+						   buffer->bistate,
+						   &insertIndexes);
 		MemoryContextSwitchTo(oldcontext);
 
 		for (i = 0; i < nused; i++)
@@ -425,7 +427,7 @@ CopyMultiInsertBufferFlush(CopyMultiInsertInfo *miinfo,
 			 * If there are any indexes, update them for all the inserted
 			 * tuples, and run AFTER ROW INSERT triggers.
 			 */
-			if (resultRelInfo->ri_NumIndices > 0)
+			if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 			{
 				List	   *recheckIndexes;
 
@@ -1265,11 +1267,14 @@ CopyFrom(CopyFromState cstate)
 					}
 					else
 					{
+						bool		insertIndexes;
+
 						/* OK, store the tuple and create index entries for it */
 						table_tuple_insert(resultRelInfo->ri_RelationDesc,
-										   myslot, mycid, ti_options, bistate);
+										   myslot, mycid, ti_options, bistate,
+										   &insertIndexes);
 
-						if (resultRelInfo->ri_NumIndices > 0)
+						if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 							recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 																   myslot,
 																   estate,
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 62050f4dc5..afd3dace07 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -578,6 +578,7 @@ static bool
 intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_intorel *myState = (DR_intorel *) self;
+	bool		insertIndexes;
 
 	/* Nothing to insert if WITH NO DATA is specified. */
 	if (!myState->into->skipData)
@@ -594,7 +595,8 @@ intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 						   slot,
 						   myState->output_cid,
 						   myState->ti_options,
-						   myState->bistate);
+						   myState->bistate,
+						   &insertIndexes);
 	}
 
 	/* We know this is a newly created relation, so there are no indexes */
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 6d09b75556..9ec13d0984 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -476,6 +476,7 @@ static bool
 transientrel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_transientrel *myState = (DR_transientrel *) self;
+	bool		insertIndexes;
 
 	/*
 	 * Note that the input slot might not be of the type of the target
@@ -490,7 +491,8 @@ transientrel_receive(TupleTableSlot *slot, DestReceiver *self)
 					   slot,
 					   myState->output_cid,
 					   myState->ti_options,
-					   myState->bistate);
+					   myState->bistate,
+					   &insertIndexes);
 
 	/* We know this is a newly created relation, so there are no indexes */
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 313ca1ae81..bb69875ea7 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -6360,8 +6360,12 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
 
 			/* Write the tuple out to the new relation */
 			if (newrel)
+			{
+				bool		insertIndexes;
+
 				table_tuple_insert(newrel, insertslot, mycid,
-								   ti_options, bistate);
+								   ti_options, bistate, &insertIndexes);
+			}
 
 			ResetExprContext(econtext);
 
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 0cad843fb6..db685473fc 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -509,6 +509,7 @@ ExecSimpleRelationInsert(ResultRelInfo *resultRelInfo,
 	if (!skip_tuple)
 	{
 		List	   *recheckIndexes = NIL;
+		bool		insertIndexes;
 
 		/* Compute stored generated columns */
 		if (rel->rd_att->constr &&
@@ -523,9 +524,10 @@ ExecSimpleRelationInsert(ResultRelInfo *resultRelInfo,
 			ExecPartitionCheck(resultRelInfo, slot, estate, true);
 
 		/* OK, store the tuple and create index entries for it */
-		simple_table_tuple_insert(resultRelInfo->ri_RelationDesc, slot);
+		simple_table_tuple_insert(resultRelInfo->ri_RelationDesc, slot,
+								  &insertIndexes);
 
-		if (resultRelInfo->ri_NumIndices > 0)
+		if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 			recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 												   slot, estate, false, false,
 												   NULL, NIL, false);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 8e1c8f697c..a64e37e9af 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1040,13 +1040,15 @@ ExecInsert(ModifyTableContext *context,
 		}
 		else
 		{
+			bool		insertIndexes;
+
 			/* insert the tuple normally */
 			slot = table_tuple_insert(resultRelationDesc, slot,
 									  estate->es_output_cid,
-									  0, NULL);
+									  0, NULL, &insertIndexes);
 
 			/* insert index entries for tuple */
-			if (resultRelInfo->ri_NumIndices > 0)
+			if (insertIndexes && resultRelInfo->ri_NumIndices > 0)
 				recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 													   slot, estate, false,
 													   false, NULL, NIL,
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 91fbc95034..32a3fbce96 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -282,7 +282,7 @@ extern void heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 						int options, BulkInsertState bistate);
 extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
 							  int ntuples, CommandId cid, int options,
-							  BulkInsertState bistate);
+							  BulkInsertState bistate, bool *insert_indexes);
 extern TM_Result heap_delete(Relation relation, ItemPointer tid,
 							 CommandId cid, Snapshot crosscheck, int options,
 							 struct TM_FailureData *tmfd, bool changingPart,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index db0559788a..7f97af067f 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -514,7 +514,8 @@ typedef struct TableAmRoutine
 	/* see table_tuple_insert() for reference about parameters */
 	TupleTableSlot *(*tuple_insert) (Relation rel, TupleTableSlot *slot,
 									 CommandId cid, int options,
-									 struct BulkInsertStateData *bistate);
+									 struct BulkInsertStateData *bistate,
+									 bool *insert_indexes);
 
 	/* see table_tuple_insert_with_arbiter() for reference about parameters */
 	TupleTableSlot *(*tuple_insert_with_arbiter) (ResultRelInfo *resultRelInfo,
@@ -529,7 +530,8 @@ typedef struct TableAmRoutine
 
 	/* see table_multi_insert() for reference about parameters */
 	void		(*multi_insert) (Relation rel, TupleTableSlot **slots, int nslots,
-								 CommandId cid, int options, struct BulkInsertStateData *bistate);
+								 CommandId cid, int options, struct BulkInsertStateData *bistate,
+								 bool *insert_indexes);
 
 	/* see table_tuple_delete() for reference about parameters */
 	TM_Result	(*tuple_delete) (Relation rel,
@@ -1400,6 +1402,11 @@ table_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
  * behavior) is also just passed through to RelationGetBufferForTuple. If
  * `bistate` is provided, table_finish_bulk_insert() needs to be called.
  *
+ * Tableam implementation of tuple_insert should set `*insert_indexes` to true
+ * if it expects the caller to insert the relevant index tuples (as heap
+ * implementation does).  It should set `*insert_indexes` to false if it cares
+ * about index inserts itself and doesn't want the caller to do index inserts.
+ *
  * Returns the slot containing the inserted tuple, which may differ from the
  * given slot. For instance, the source slot may be VirtualTupleTableSlot, but
  * the result slot may correspond to the table AM. On return the slot's
@@ -1409,10 +1416,11 @@ table_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
  */
 static inline TupleTableSlot *
 table_tuple_insert(Relation rel, TupleTableSlot *slot, CommandId cid,
-				   int options, struct BulkInsertStateData *bistate)
+				   int options, struct BulkInsertStateData *bistate,
+				   bool *insert_indexes)
 {
 	return rel->rd_tableam->tuple_insert(rel, slot, cid, options,
-										 bistate);
+										 bistate, insert_indexes);
 }
 
 /*
@@ -1470,10 +1478,11 @@ table_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
  */
 static inline void
 table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
-				   CommandId cid, int options, struct BulkInsertStateData *bistate)
+				   CommandId cid, int options, struct BulkInsertStateData *bistate,
+				   bool *insert_indexes)
 {
 	rel->rd_tableam->multi_insert(rel, slots, nslots,
-								  cid, options, bistate);
+								  cid, options, bistate, insert_indexes);
 }
 
 /*
@@ -2168,7 +2177,8 @@ table_scan_sample_next_tuple(TableScanDesc scan,
  * ----------------------------------------------------------------------------
  */
 
-extern void simple_table_tuple_insert(Relation rel, TupleTableSlot *slot);
+extern void simple_table_tuple_insert(Relation rel, TupleTableSlot *slot,
+									  bool *insert_indexes);
 extern void simple_table_tuple_delete(Relation rel, ItemPointer tid,
 									  Snapshot snapshot,
 									  TupleTableSlot *oldSlot);
-- 
2.39.2 (Apple Git-143)

v8-0004-Let-table-AM-override-reloptions-for-indexes-buil.patchapplication/octet-stream; name=v8-0004-Let-table-AM-override-reloptions-for-indexes-buil.patchDownload

From af9b25040a6203ac198c1022231a070b71bbb616 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Thu, 14 Mar 2024 00:53:05 +0200
Subject: [PATCH v8 4/8] Let table AM override reloptions for indexes built on
 its tables

---
 src/backend/access/common/reloptions.c   |  3 ++-
 src/backend/access/heap/heapam_handler.c |  8 ++++++++
 src/backend/commands/indexcmds.c         |  3 ++-
 src/backend/commands/tablecmds.c         |  9 +++++++-
 src/backend/utils/cache/relcache.c       | 24 ++++++++++++++++++++--
 src/include/access/tableam.h             | 26 ++++++++++++++++++++++++
 6 files changed, 68 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 963995388b..00088240cd 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -1411,7 +1411,8 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
 			break;
 		case RELKIND_INDEX:
 		case RELKIND_PARTITIONED_INDEX:
-			options = index_reloptions(amoptions, datum, false);
+			options = tableam_indexoptions(tableam, amoptions, classForm->relkind,
+										   datum, false);
 			break;
 		case RELKIND_FOREIGN_TABLE:
 			options = NULL;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 590413bab9..c560f70ba2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2444,6 +2444,13 @@ heapam_reloptions(char relkind, Datum reloptions, bool validate)
 	return heap_reloptions(relkind, reloptions, validate);
 }
 
+static bytea *
+heapam_indexoptions(amoptions_function amoptions, char relkind,
+					Datum reloptions, bool validate)
+{
+	return index_reloptions(amoptions, reloptions, validate);
+}
+
 
 /* ------------------------------------------------------------------------
  * Planner related callbacks for the heap AM
@@ -2949,6 +2956,7 @@ static const TableAmRoutine heapam_methods = {
 	.relation_toast_am = heapam_relation_toast_am,
 	.relation_fetch_toast_slice = heap_fetch_toast_slice,
 	.reloptions = heapam_reloptions,
+	.indexoptions = heapam_indexoptions,
 
 	.relation_estimate_size = heapam_estimate_rel_size,
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index d9016ef487..e78598c10e 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -899,7 +899,8 @@ DefineIndex(Oid tableId,
 	reloptions = transformRelOptions((Datum) 0, stmt->options,
 									 NULL, NULL, false, false);
 
-	(void) index_reloptions(amoptions, reloptions, true);
+	(void) tableam_indexoptions(rel->rd_tableam, amoptions, RELKIND_INDEX,
+								reloptions, true);
 
 	/*
 	 * Prepare arguments for index_create, primarily an IndexInfo structure.
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 3fcb9cd078..313ca1ae81 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -15543,7 +15543,14 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 			break;
 		case RELKIND_INDEX:
 		case RELKIND_PARTITIONED_INDEX:
-			(void) index_reloptions(rel->rd_indam->amoptions, newOptions, true);
+			{
+				Relation	tbl = relation_open(rel->rd_index->indrelid,
+												AccessShareLock);
+
+				tableam_indexoptions(tbl->rd_tableam, rel->rd_indam->amoptions,
+									 rel->rd_rel->relkind, newOptions, true);
+				relation_close(tbl, AccessShareLock);
+			}
 			break;
 		default:
 			ereport(ERROR,
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 039c0d3eef..4343deb4ee 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -477,15 +477,35 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 	{
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
-		case RELKIND_VIEW:
 		case RELKIND_MATVIEW:
+		case RELKIND_VIEW:
 		case RELKIND_PARTITIONED_TABLE:
 			tableam = relation->rd_tableam;
 			amoptsfn = NULL;
 			break;
 		case RELKIND_INDEX:
 		case RELKIND_PARTITIONED_INDEX:
-			amoptsfn = relation->rd_indam->amoptions;
+			{
+				Form_pg_class classForm;
+				HeapTuple	classTup;
+
+				/* fetch the relation's relcache entry */
+				if (relation->rd_index->indrelid >= FirstNormalObjectId)
+				{
+					classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relation->rd_index->indrelid));
+					classForm = (Form_pg_class) GETSTRUCT(classTup);
+					if (classForm->relam >= FirstNormalObjectId)
+						tableam = GetTableAmRoutineByAmOid(classForm->relam);
+					else
+						tableam = GetHeapamTableAmRoutine();
+					heap_freetuple(classTup);
+				}
+				else
+				{
+					tableam = GetHeapamTableAmRoutine();
+				}
+				amoptsfn = relation->rd_indam->amoptions;
+			}
 			break;
 		default:
 			return;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index c4cdae5903..48f078309f 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -17,6 +17,7 @@
 #ifndef TABLEAM_H
 #define TABLEAM_H
 
+#include "access/amapi.h"
 #include "access/relscan.h"
 #include "access/sdir.h"
 #include "access/xact.h"
@@ -757,6 +758,13 @@ typedef struct TableAmRoutine
 	 */
 	bytea	   *(*reloptions) (char relkind, Datum reloptions, bool validate);
 
+	/*
+	 * Parse table AM-specific index options.  Useful for table AM to define
+	 * new index options or override existing index options.
+	 */
+	bytea	   *(*indexoptions) (amoptions_function amoptions, char relkind,
+								 Datum reloptions, bool validate);
+
 
 	/* ------------------------------------------------------------------------
 	 * Planner related functions.
@@ -1971,6 +1979,24 @@ tableam_reloptions(const TableAmRoutine *tableam, char relkind,
 	return tableam->reloptions(relkind, reloptions, validate);
 }
 
+extern bytea *index_reloptions(amoptions_function amoptions, Datum reloptions,
+							   bool validate);
+
+/*
+ * Parse index options.  Gives table AM a chance to override index-specific
+ * options defined in 'amoptions'.
+ */
+static inline bytea *
+tableam_indexoptions(const TableAmRoutine *tableam,
+					 amoptions_function amoptions, char relkind,
+					 Datum reloptions, bool validate)
+{
+	if (tableam)
+		return tableam->indexoptions(amoptions, relkind, reloptions, validate);
+	else
+		return index_reloptions(amoptions, reloptions, validate);
+}
+
 
 /* ----------------------------------------------------------------------------
  * Planner related functionality
-- 
2.39.2 (Apple Git-143)

v8-0007-Introduce-RowRefType-which-describes-the-table-ro.patchapplication/octet-stream; name=v8-0007-Introduce-RowRefType-which-describes-the-table-ro.patchDownload

From 67f2ed3b94574fdb223c672bf1b12c09349e060b Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Sat, 17 Jun 2023 22:55:32 +0300
Subject: [PATCH v8 7/8] Introduce RowRefType, which describes the table row
 identifier

Currently, the table row could be identified by the ctid or the whole row
(foreign table).  But the row identifier is mixed together with lock mode in
RowMarkType.  This commit separates row identifier type into separate enum
RowRefType.
---
 contrib/postgres_fdw/postgres_fdw.c    |  2 +-
 doc/src/sgml/fdwhandler.sgml           | 22 ++++++++----
 src/backend/executor/execMain.c        | 35 ++++++++++++--------
 src/backend/optimizer/plan/planner.c   | 33 +++++++++++-------
 src/backend/optimizer/prep/preptlist.c |  4 +--
 src/backend/optimizer/util/inherit.c   | 27 +++++++--------
 src/include/foreign/fdwapi.h           |  3 +-
 src/include/nodes/execnodes.h          |  4 +++
 src/include/nodes/plannodes.h          | 46 ++++++++++++++++----------
 src/include/optimizer/planner.h        |  3 +-
 src/tools/pgindent/typedefs.list       |  1 +
 11 files changed, 113 insertions(+), 67 deletions(-)

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 142dcfc995..b000079029 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -7636,7 +7636,7 @@ make_tuple_from_result_row(PGresult *res,
 	 * If we have a CTID to return, install it in both t_self and t_ctid.
 	 * t_self is the normal place, but if the tuple is converted to a
 	 * composite Datum, t_self will be lost; setting t_ctid allows CTID to be
-	 * preserved during EvalPlanQual re-evaluations (see ROW_MARK_COPY code).
+	 * preserved during EvalPlanQual re-evaluations (see ROW_REF_COPY code).
 	 */
 	if (ctid)
 		tuple->t_self = tuple->t_data->t_ctid = *ctid;
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index b80320504d..51bc0e1029 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -1160,13 +1160,16 @@ ExecForeignTruncate(List *rels,
 <programlisting>
 RowMarkType
 GetForeignRowMarkType(RangeTblEntry *rte,
-                      LockClauseStrength strength);
+                      LockClauseStrength strength,
+                      RowRefType *refType);
 </programlisting>
 
      Report which row-marking option to use for a foreign table.
-     <literal>rte</literal> is the <structname>RangeTblEntry</structname> node for the table
-     and <literal>strength</literal> describes the lock strength requested by the
-     relevant <literal>FOR UPDATE/SHARE</literal> clause, if any.  The result must be
+     <literal>rte</literal> is the <structname>RangeTblEntry</structname> node for the table;
+     <literal>strength</literal> describes the lock strength requested by the
+     relevant <literal>FOR UPDATE/SHARE</literal> clause, if any;
+     <literal>refType</literal> point to the value of <literal>RowRefType</literal>
+     specifying the way to reference the row.  The result must be
      a member of the <literal>RowMarkType</literal> enum type.
     </para>
 
@@ -1177,9 +1180,16 @@ GetForeignRowMarkType(RangeTblEntry *rte,
      or <command>DELETE</command>.
     </para>
 
+    <para>
+     If the value pointed by <literal>refType</literal> is not changed,
+     the <literal>ROW_REF_COPY</literal> option is used.
+    </para>
+
     <para>
      If the <function>GetForeignRowMarkType</function> pointer is set to
-     <literal>NULL</literal>, the <literal>ROW_MARK_COPY</literal> option is always used.
+     <literal>NULL</literal>, the <literal>ROW_MARK_REFERENCE</literal> option
+     for row mark type and <literal>ROW_REF_COPY</literal> option for the row
+     reference type are always used.
      (This implies that <function>RefetchForeignRow</function> will never be called,
      so it need not be provided either.)
     </para>
@@ -1213,7 +1223,7 @@ RefetchForeignRow(EState *estate,
      defined by <literal>erm-&gt;markType</literal>, which is the value
      previously returned by <function>GetForeignRowMarkType</function>.
      (<literal>ROW_MARK_REFERENCE</literal> means to just re-fetch the tuple
-     without acquiring any lock, and <literal>ROW_MARK_COPY</literal> will
+     without acquiring any lock.  This shouldn't and <literal>ROW_MARK_COPY</literal> will
      never be seen by this routine.)
     </para>
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 7eb1f7d020..3b03f03a98 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -875,22 +875,19 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 			/* get relation's OID (will produce InvalidOid if subquery) */
 			relid = exec_rt_fetch(rc->rti, estate)->relid;
 
-			/* open relation, if we need to access it for this mark type */
-			switch (rc->markType)
+			/*
+			 * Open relation, if we need to access it for this reference type.
+			 */
+			switch (rc->refType)
 			{
-				case ROW_MARK_EXCLUSIVE:
-				case ROW_MARK_NOKEYEXCLUSIVE:
-				case ROW_MARK_SHARE:
-				case ROW_MARK_KEYSHARE:
-				case ROW_MARK_REFERENCE:
+				case ROW_REF_TID:
 					relation = ExecGetRangeTableRelation(estate, rc->rti);
 					break;
-				case ROW_MARK_COPY:
-					/* no physical table access is required */
+				case ROW_REF_COPY:
 					relation = NULL;
 					break;
 				default:
-					elog(ERROR, "unrecognized markType: %d", rc->markType);
+					elog(ERROR, "unrecognized refType: %d", rc->refType);
 					relation = NULL;	/* keep compiler quiet */
 					break;
 			}
@@ -906,6 +903,7 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 			erm->prti = rc->prti;
 			erm->rowmarkId = rc->rowmarkId;
 			erm->markType = rc->markType;
+			erm->refType = rc->refType;
 			erm->strength = rc->strength;
 			erm->waitPolicy = rc->waitPolicy;
 			erm->ermActive = false;
@@ -2402,10 +2400,14 @@ ExecBuildAuxRowMark(ExecRowMark *erm, List *targetlist)
 
 	aerm->rowmark = erm;
 
-	/* Look up the resjunk columns associated with this rowmark */
-	if (erm->markType != ROW_MARK_COPY)
+	/*
+	 * Look up the resjunk columns associated with this rowmark's reference
+	 * type.
+	 */
+	if (erm->refType != ROW_REF_COPY)
 	{
 		/* need ctid for all methods other than COPY */
+		Assert(erm->refType == ROW_REF_TID);
 		snprintf(resname, sizeof(resname), "ctid%u", erm->rowmarkId);
 		aerm->ctidAttNo = ExecFindJunkAttributeInTlist(targetlist,
 													   resname);
@@ -2656,7 +2658,12 @@ EvalPlanQualFetchRowMark(EPQState *epqstate, Index rti, TupleTableSlot *slot)
 		}
 	}
 
-	if (erm->markType == ROW_MARK_REFERENCE)
+	/*
+	 * For non-locked relation, the row mark type should be
+	 * ROW_MARK_REFERENCE.  Fetch the tuple accodring to reference type.
+	 */
+	Assert(erm->markType == ROW_MARK_REFERENCE);
+	if (erm->refType == ROW_REF_TID)
 	{
 		Assert(erm->relation != NULL);
 
@@ -2709,7 +2716,7 @@ EvalPlanQualFetchRowMark(EPQState *epqstate, Index rti, TupleTableSlot *slot)
 	}
 	else
 	{
-		Assert(erm->markType == ROW_MARK_COPY);
+		Assert(erm->refType == ROW_REF_COPY);
 
 		/* fetch the whole-row Var for the relation */
 		datum = ExecGetJunkAttribute(epqstate->origslot,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 38d070fa00..4b9c9deee8 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2309,7 +2309,7 @@ preprocess_rowmarks(PlannerInfo *root)
 		 * Ignore RowMarkClauses for subqueries; they aren't real tables and
 		 * can't support true locking.  Subqueries that got flattened into the
 		 * main query should be ignored completely.  Any that didn't will get
-		 * ROW_MARK_COPY items in the next loop.
+		 * ROW_REF_COPY items in the next loop.
 		 */
 		if (rte->rtekind != RTE_RELATION)
 			continue;
@@ -2319,8 +2319,8 @@ preprocess_rowmarks(PlannerInfo *root)
 		newrc = makeNode(PlanRowMark);
 		newrc->rti = newrc->prti = rc->rti;
 		newrc->rowmarkId = ++(root->glob->lastRowMarkId);
-		newrc->markType = select_rowmark_type(rte, rc->strength);
-		newrc->allMarkTypes = (1 << newrc->markType);
+		newrc->markType = select_rowmark_type(rte, rc->strength, &newrc->refType);
+		newrc->allRefTypes = (1 << newrc->refType);
 		newrc->strength = rc->strength;
 		newrc->waitPolicy = rc->waitPolicy;
 		newrc->isParent = false;
@@ -2344,8 +2344,8 @@ preprocess_rowmarks(PlannerInfo *root)
 		newrc = makeNode(PlanRowMark);
 		newrc->rti = newrc->prti = i;
 		newrc->rowmarkId = ++(root->glob->lastRowMarkId);
-		newrc->markType = select_rowmark_type(rte, LCS_NONE);
-		newrc->allMarkTypes = (1 << newrc->markType);
+		newrc->markType = select_rowmark_type(rte, LCS_NONE, &newrc->refType);
+		newrc->allRefTypes = (1 << newrc->refType);
 		newrc->strength = LCS_NONE;
 		newrc->waitPolicy = LockWaitBlock;	/* doesn't matter */
 		newrc->isParent = false;
@@ -2357,29 +2357,38 @@ preprocess_rowmarks(PlannerInfo *root)
 }
 
 /*
- * Select RowMarkType to use for a given table
+ * Select RowMarkType and RowRefType to use for a given table
  */
 RowMarkType
-select_rowmark_type(RangeTblEntry *rte, LockClauseStrength strength)
+select_rowmark_type(RangeTblEntry *rte, LockClauseStrength strength,
+					RowRefType *refType)
 {
 	if (rte->rtekind != RTE_RELATION)
 	{
-		/* If it's not a table at all, use ROW_MARK_COPY */
-		return ROW_MARK_COPY;
+		/*
+		 * If it's not a table at all, use ROW_MARK_REFERENCE and
+		 * ROW_REF_COPY.
+		 */
+		*refType = ROW_REF_COPY;
+		return ROW_MARK_REFERENCE;
 	}
 	else if (rte->relkind == RELKIND_FOREIGN_TABLE)
 	{
 		/* Let the FDW select the rowmark type, if it wants to */
 		FdwRoutine *fdwroutine = GetFdwRoutineByRelId(rte->relid);
 
+		/* Set row reference type as ROW_REF_COPY by default */
+		*refType = ROW_REF_COPY;
+
 		if (fdwroutine->GetForeignRowMarkType != NULL)
-			return fdwroutine->GetForeignRowMarkType(rte, strength);
-		/* Otherwise, use ROW_MARK_COPY by default */
-		return ROW_MARK_COPY;
+			return fdwroutine->GetForeignRowMarkType(rte, strength, refType);
+		/* Otherwise, use ROW_MARK_REFERENCE by default */
+		return ROW_MARK_REFERENCE;
 	}
 	else
 	{
 		/* Regular table, apply the appropriate lock type */
+		*refType = ROW_REF_TID;
 		switch (strength)
 		{
 			case LCS_NONE:
diff --git a/src/backend/optimizer/prep/preptlist.c b/src/backend/optimizer/prep/preptlist.c
index 7698bfa1a5..4599b0dc76 100644
--- a/src/backend/optimizer/prep/preptlist.c
+++ b/src/backend/optimizer/prep/preptlist.c
@@ -210,7 +210,7 @@ preprocess_targetlist(PlannerInfo *root)
 		if (rc->rti != rc->prti)
 			continue;
 
-		if (rc->allMarkTypes & ~(1 << ROW_MARK_COPY))
+		if (rc->allRefTypes & (1 << ROW_REF_TID))
 		{
 			/* Need to fetch TID */
 			var = makeVar(rc->rti,
@@ -226,7 +226,7 @@ preprocess_targetlist(PlannerInfo *root)
 								  true);
 			tlist = lappend(tlist, tle);
 		}
-		if (rc->allMarkTypes & (1 << ROW_MARK_COPY))
+		if (rc->allRefTypes & (1 << ROW_REF_COPY))
 		{
 			/* Need the whole row as a junk var */
 			var = makeWholeRowVar(rt_fetch(rc->rti, range_table),
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index 5c7acf8a90..b4b076d1cb 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -91,7 +91,7 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 	LOCKMODE	lockmode;
 	PlanRowMark *oldrc;
 	bool		old_isParent = false;
-	int			old_allMarkTypes = 0;
+	int			old_allRefTypes = 0;
 
 	Assert(rte->inh);			/* else caller error */
 
@@ -131,8 +131,8 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 	{
 		old_isParent = oldrc->isParent;
 		oldrc->isParent = true;
-		/* Save initial value of allMarkTypes before children add to it */
-		old_allMarkTypes = oldrc->allMarkTypes;
+		/* Save initial value of allRefTypes before children add to it */
+		old_allRefTypes = oldrc->allRefTypes;
 	}
 
 	/* Scan the inheritance set and expand it */
@@ -239,15 +239,15 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	if (oldrc)
 	{
-		int			new_allMarkTypes = oldrc->allMarkTypes;
+		int			new_allRefTypes = oldrc->allRefTypes;
 		Var		   *var;
 		TargetEntry *tle;
 		char		resname[32];
 		List	   *newvars = NIL;
 
 		/* Add TID junk Var if needed, unless we had it already */
-		if (new_allMarkTypes & ~(1 << ROW_MARK_COPY) &&
-			!(old_allMarkTypes & ~(1 << ROW_MARK_COPY)))
+		if (new_allRefTypes & (1 << ROW_REF_TID) &&
+			!(old_allRefTypes & (1 << ROW_REF_TID)))
 		{
 			/* Need to fetch TID */
 			var = makeVar(oldrc->rti,
@@ -266,8 +266,8 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 		}
 
 		/* Add whole-row junk Var if needed, unless we had it already */
-		if ((new_allMarkTypes & (1 << ROW_MARK_COPY)) &&
-			!(old_allMarkTypes & (1 << ROW_MARK_COPY)))
+		if ((new_allRefTypes & (1 << ROW_REF_COPY)) &&
+			!(old_allRefTypes & (1 << ROW_REF_COPY)))
 		{
 			var = makeWholeRowVar(planner_rt_fetch(oldrc->rti, root),
 								  oldrc->rti,
@@ -441,7 +441,7 @@ expand_partitioned_rtentry(PlannerInfo *root, RelOptInfo *relinfo,
  * where the hierarchy is flattened during RTE expansion.)
  *
  * PlanRowMarks still carry the top-parent's RTI, and the top-parent's
- * allMarkTypes field still accumulates values from all descendents.
+ * allRefTypes field still accumulates values from all descendents.
  *
  * "parentrte" and "parentRTindex" are immediate parent's RTE and
  * RTI. "top_parentrc" is top parent's PlanRowMark.
@@ -580,8 +580,9 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 		childrc->rowmarkId = top_parentrc->rowmarkId;
 		/* Reselect rowmark type, because relkind might not match parent */
 		childrc->markType = select_rowmark_type(childrte,
-												top_parentrc->strength);
-		childrc->allMarkTypes = (1 << childrc->markType);
+												top_parentrc->strength,
+												&childrc->refType);
+		childrc->allRefTypes = (1 << childrc->refType);
 		childrc->strength = top_parentrc->strength;
 		childrc->waitPolicy = top_parentrc->waitPolicy;
 
@@ -592,8 +593,8 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 		 */
 		childrc->isParent = (childrte->relkind == RELKIND_PARTITIONED_TABLE);
 
-		/* Include child's rowmark type in top parent's allMarkTypes */
-		top_parentrc->allMarkTypes |= childrc->allMarkTypes;
+		/* Include child's rowmark type in top parent's allRefTypes */
+		top_parentrc->allRefTypes |= childrc->allRefTypes;
 
 		root->rowMarks = lappend(root->rowMarks, childrc);
 	}
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 0968e0a01e..868e04e813 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -129,7 +129,8 @@ typedef TupleTableSlot *(*IterateDirectModify_function) (ForeignScanState *node)
 typedef void (*EndDirectModify_function) (ForeignScanState *node);
 
 typedef RowMarkType (*GetForeignRowMarkType_function) (RangeTblEntry *rte,
-													   LockClauseStrength strength);
+													   LockClauseStrength strength,
+													   RowRefType *refType);
 
 typedef void (*RefetchForeignRow_function) (EState *estate,
 											ExecRowMark *erm,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 1774c56ae3..a1ccf6e681 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -455,6 +455,9 @@ typedef struct ResultRelInfo
 	/* relation descriptor for result relation */
 	Relation	ri_RelationDesc;
 
+	/* row indentifier for result relation */
+	RowRefType	ri_RowRefType;
+
 	/* # of indices existing on result relation */
 	int			ri_NumIndices;
 
@@ -750,6 +753,7 @@ typedef struct ExecRowMark
 	Index		prti;			/* parent range table index, if child */
 	Index		rowmarkId;		/* unique identifier for resjunk columns */
 	RowMarkType markType;		/* see enum in nodes/plannodes.h */
+	RowRefType	refType;		/* row indentifier for relation */
 	LockClauseStrength strength;	/* LockingClause's strength, or LCS_NONE */
 	LockWaitPolicy waitPolicy;	/* NOWAIT and SKIP LOCKED */
 	bool		ermActive;		/* is this mark relevant for current tuple? */
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 7f3db5105d..d7f9c389da 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1311,16 +1311,8 @@ typedef struct Limit
  *
  * When doing UPDATE/DELETE/MERGE/SELECT FOR UPDATE/SHARE, we have to uniquely
  * identify all the source rows, not only those from the target relations, so
- * that we can perform EvalPlanQual rechecking at need.  For plain tables we
- * can just fetch the TID, much as for a target relation; this case is
- * represented by ROW_MARK_REFERENCE.  Otherwise (for example for VALUES or
- * FUNCTION scans) we have to copy the whole row value.  ROW_MARK_COPY is
- * pretty inefficient, since most of the time we'll never need the data; but
- * fortunately the overhead is usually not performance-critical in practice.
- * By default we use ROW_MARK_COPY for foreign tables, but if the FDW has
- * a concept of rowid it can request to use ROW_MARK_REFERENCE instead.
- * (Again, this probably doesn't make sense if a physical remote fetch is
- * needed, but for FDWs that map to local storage it might be credible.)
+ * that we can perform EvalPlanQual rechecking at need.  ROW_MARK_REFERENCE
+ * represents this case.
  */
 typedef enum RowMarkType
 {
@@ -1329,9 +1321,29 @@ typedef enum RowMarkType
 	ROW_MARK_SHARE,				/* obtain shared tuple lock */
 	ROW_MARK_KEYSHARE,			/* obtain keyshare tuple lock */
 	ROW_MARK_REFERENCE,			/* just fetch the TID, don't lock it */
-	ROW_MARK_COPY,				/* physically copy the row value */
 } RowMarkType;
 
+/*
+ * RowRefType -
+ *	  enums for types of row identifiers
+ *
+ * For plain tables we can just fetch the TID, much as for a target relation;
+ * this case is represented by ROW_REF_TID.  Otherwise (for example for VALUES
+ * or FUNCTION scans) we have to copy the whole row value.  ROW_REF_COPY is
+ * pretty inefficient, since most of the time we'll never need the data; but
+ * fortunately the overhead is usually not performance-critical in practice.
+ * By default we use ROW_REF_COPY for foreign tables, but if the FDW has
+ * a concept of rowid it can request to use ROW_REF_TID instead.
+ * (Again, this probably doesn't make sense if a physical remote fetch is
+ * needed, but for FDWs that map to local storage it might be credible.)
+ * In future we may allow more types of row identifiers.
+ */
+typedef enum RowRefType
+{
+	ROW_REF_TID,				/* Item pointer (block, offset) */
+	ROW_REF_COPY				/* Full row copy */
+} RowRefType;
+
 #define RowMarkRequiresRowShareLock(marktype)  ((marktype) <= ROW_MARK_KEYSHARE)
 
 /*
@@ -1340,8 +1352,7 @@ typedef enum RowMarkType
  *
  * When doing UPDATE/DELETE/MERGE/SELECT FOR UPDATE/SHARE, we create a separate
  * PlanRowMark node for each non-target relation in the query.  Relations that
- * are not specified as FOR UPDATE/SHARE are marked ROW_MARK_REFERENCE (if
- * regular tables or supported foreign tables) or ROW_MARK_COPY (if not).
+ * are not specified as FOR UPDATE/SHARE are marked ROW_MARK_REFERENCE.
  *
  * Initially all PlanRowMarks have rti == prti and isParent == false.
  * When the planner discovers that a relation is the root of an inheritance
@@ -1351,16 +1362,16 @@ typedef enum RowMarkType
  * child relations will also have entries with isParent = true.  The child
  * entries have rti == child rel's RT index and prti == top parent's RT index,
  * and can therefore be recognized as children by the fact that prti != rti.
- * The parent's allMarkTypes field gets the OR of (1<<markType) across all
+ * The parent's allRefTypes field gets the OR of (1<<refType) across all
  * its children (this definition allows children to use different markTypes).
  *
  * The planner also adds resjunk output columns to the plan that carry
  * information sufficient to identify the locked or fetched rows.  When
- * markType != ROW_MARK_COPY, these columns are named
+ * refType != ROW_REF_COPY, these columns are named
  *		tableoid%u			OID of table
  *		ctid%u				TID of row
  * The tableoid column is only present for an inheritance hierarchy.
- * When markType == ROW_MARK_COPY, there is instead a single column named
+ * When refType == ROW_REF_COPY, there is instead a single column named
  *		wholerow%u			whole-row value of relation
  * (An inheritance hierarchy could have all three resjunk output columns,
  * if some children use a different markType than others.)
@@ -1381,7 +1392,8 @@ typedef struct PlanRowMark
 	Index		prti;			/* range table index of parent relation */
 	Index		rowmarkId;		/* unique identifier for resjunk columns */
 	RowMarkType markType;		/* see enum above */
-	int			allMarkTypes;	/* OR of (1<<markType) for all children */
+	RowRefType	refType;		/* see enum above */
+	int			allRefTypes;	/* OR of (1<<refType) for all children */
 	LockClauseStrength strength;	/* LockingClause's strength, or LCS_NONE */
 	LockWaitPolicy waitPolicy;	/* NOWAIT and SKIP LOCKED options */
 	bool		isParent;		/* true if this is a "dummy" parent entry */
diff --git a/src/include/optimizer/planner.h b/src/include/optimizer/planner.h
index e1d79ffdf3..98fc796d05 100644
--- a/src/include/optimizer/planner.h
+++ b/src/include/optimizer/planner.h
@@ -47,7 +47,8 @@ extern PlannerInfo *subquery_planner(PlannerGlobal *glob, Query *parse,
 									 bool hasRecursion, double tuple_fraction);
 
 extern RowMarkType select_rowmark_type(RangeTblEntry *rte,
-									   LockClauseStrength strength);
+									   LockClauseStrength strength,
+									   RowRefType *refType);
 
 extern bool limit_needed(Query *parse);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cfa9d5aaea..6ce0a586bf 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2433,6 +2433,7 @@ RowExpr
 RowIdentityVarInfo
 RowMarkClause
 RowMarkType
+RowRefType
 RowSecurityDesc
 RowSecurityPolicy
 RtlGetLastNtStatus_t
-- 
2.39.2 (Apple Git-143)

v8-0005-Notify-table-AM-about-index-creation.patchapplication/octet-stream; name=v8-0005-Notify-table-AM-about-index-creation.patchDownload

From a8dd842d479c38038d7b9b56ac0123b0d74a1e91 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Sat, 17 Jun 2023 22:01:01 +0300
Subject: [PATCH v8 5/8] Notify table AM about index creation

This allows table AM to do some preparation with index build.  In particular,
table AM could update its specific meta-information.  That could be also useful
if table AM overrides index implementations.
---
 src/backend/access/heap/heapam_handler.c |  2 ++
 src/backend/catalog/index.c              |  2 ++
 src/backend/commands/indexcmds.c         | 41 +++++++++++++----------
 src/include/access/tableam.h             | 42 ++++++++++++++++++++++++
 4 files changed, 69 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c560f70ba2..1c029ce6ab 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2949,6 +2949,8 @@ static const TableAmRoutine heapam_methods = {
 	.index_build_range_scan = heapam_index_build_range_scan,
 	.index_validate_scan = heapam_index_validate_scan,
 	.relation_analyze = heapam_analyze,
+	.define_index_validate = NULL,
+	.define_index = NULL,
 
 	.free_rd_amcache = NULL,
 	.relation_size = table_block_relation_size,
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index b6a7c60e23..bca9798105 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3840,6 +3840,8 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 
 	/* Close rels, but keep locks */
 	index_close(iRel, NoLock);
+	table_define_index(heapRelation, indexId, true,
+					   skip_constraint_checks, false, NULL);
 	table_close(heapRelation, NoLock);
 
 	if (progress)
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index e78598c10e..2570e7a24a 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -583,6 +583,7 @@ DefineIndex(Oid tableId,
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
+	void	   *arg;
 
 	root_save_nestlevel = NewGUCNestLevel();
 
@@ -629,6 +630,26 @@ DefineIndex(Oid tableId,
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_INDEX_OID,
 								 InvalidOid);
 
+	/*
+	 * Only SELECT ... FOR UPDATE/SHARE are allowed while doing a standard
+	 * index build; but for concurrent builds we allow INSERT/UPDATE/DELETE
+	 * (but not VACUUM).
+	 *
+	 * NB: Caller is responsible for making sure that relationId refers to the
+	 * relation on which the index should be built; except in bootstrap mode,
+	 * this will typically require the caller to have already locked the
+	 * relation.  To avoid lock upgrade hazards, that lock should be at least
+	 * as strong as the one we take here.
+	 *
+	 * NB: If the lock strength here ever changes, code that is run by
+	 * parallel workers under the control of certain particular ambuild
+	 * functions will need to be updated, too.
+	 */
+	lockmode = concurrent ? ShareUpdateExclusiveLock : ShareLock;
+	rel = table_open(tableId, lockmode);
+
+	table_define_index_validate(rel, stmt, skip_build, &arg);
+
 	/*
 	 * count key attributes in index
 	 */
@@ -656,24 +677,6 @@ DefineIndex(Oid tableId,
 				 errmsg("cannot use more than %d columns in an index",
 						INDEX_MAX_KEYS)));
 
-	/*
-	 * Only SELECT ... FOR UPDATE/SHARE are allowed while doing a standard
-	 * index build; but for concurrent builds we allow INSERT/UPDATE/DELETE
-	 * (but not VACUUM).
-	 *
-	 * NB: Caller is responsible for making sure that tableId refers to the
-	 * relation on which the index should be built; except in bootstrap mode,
-	 * this will typically require the caller to have already locked the
-	 * relation.  To avoid lock upgrade hazards, that lock should be at least
-	 * as strong as the one we take here.
-	 *
-	 * NB: If the lock strength here ever changes, code that is run by
-	 * parallel workers under the control of certain particular ambuild
-	 * functions will need to be updated, too.
-	 */
-	lockmode = concurrent ? ShareUpdateExclusiveLock : ShareLock;
-	rel = table_open(tableId, lockmode);
-
 	/*
 	 * Switch to the table owner's userid, so that any index functions are run
 	 * as that user.  Also lock down security-restricted operations.  We
@@ -1218,6 +1221,8 @@ DefineIndex(Oid tableId,
 
 	ObjectAddressSet(address, RelationRelationId, indexRelationId);
 
+	table_define_index(rel, address.objectId, false, false,
+					   skip_build, arg);
 	if (!OidIsValid(indexRelationId))
 	{
 		/*
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 48f078309f..db0559788a 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -684,6 +684,16 @@ typedef struct TableAmRoutine
 									 BlockNumber *totalpages,
 									 BufferAccessStrategy bstrategy);
 
+	/* See table_define_index_validate() */
+	bool		(*define_index_validate) (Relation rel, IndexStmt *stmt,
+										  bool skip_build, void **arg);
+
+	/* See table_define_index() */
+	bool		(*define_index) (Relation rel, Oid indoid, bool reindex,
+								 bool skip_constraint_checks, bool skip_build,
+								 void *arg);
+
+
 
 	/* ------------------------------------------------------------------------
 	 * Miscellaneous functions.
@@ -1860,6 +1870,38 @@ table_relation_analyze(Relation relation, AcquireSampleRowsFunc *func,
 										   totalpages, bstrategy);
 }
 
+/*
+ * Let table AM validate the index to be created on `rel` with statement
+ * `*stmt`.  `skip_build` indicates that only catalog entries are to be
+ * created without index data.  This method can save some information into
+ * `arg`, and it shoud be passed to table_define_index().
+ */
+static inline bool
+table_define_index_validate(Relation rel, IndexStmt *stmt,
+							bool skip_build, void **arg)
+{
+	if (rel->rd_tableam && rel->rd_tableam->define_index_validate)
+		return rel->rd_tableam->define_index_validate(rel, stmt,
+													  skip_build, arg);
+	else
+		return true;
+}
+
+/*
+ * Notifies table AM about index creation on `rel` with oid `indoid`.
+ */
+static inline bool
+table_define_index(Relation rel, Oid indoid, bool reindex,
+				   bool skip_constraint_checks, bool skip_build, void *arg)
+{
+	if (rel->rd_tableam && rel->rd_tableam->define_index)
+		return rel->rd_tableam->define_index(rel, indoid, reindex,
+											 skip_constraint_checks,
+											 skip_build, arg);
+	else
+		return true;
+}
+
 /* ----------------------------------------------------------------------------
  * Miscellaneous functionality
  * ----------------------------------------------------------------------------
-- 
2.39.2 (Apple Git-143)

v8-0008-Introduce-RowID-bytea-tuple-identifier.patchapplication/octet-stream; name=v8-0008-Introduce-RowID-bytea-tuple-identifier.patchDownload

From f13df39e1ff3a090a8e774a33d6121394c34bec3 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Tue, 26 Mar 2024 21:00:37 +0200
Subject: [PATCH v8 8/8] Introduce RowID -- bytea tuple identifier

Currently, there are two ways to reference the tuple: tuple identifier (tid)
and whole row copy.  The tuple identifier used for regular tables consists of
32-bit block number and 16-bit offset.  This seems limited for some use-cases,
in particular index-organized tables.  The whole row copy used to identify
tuples in FDW.  That could be extended to regular tables, but that seems
overkill.

This commit introduces RowID -- new bytea tuple identifier.  Table AM can choose
the way tuple is identified by providing new get_row_ref_type() API function.
New system attribute RowIdAttributeNumber holds RowID when appropriate.
Table AM methods now accepts Datum arguments as tuple identifiers.  Those Datum
could be either tid or bytea depending on what table_get_row_ref_type() says.
ModifyTable node and triggers are aware of RowID.  IndexScan and BitmapScan
nodes are not aware of RowIDs and expect tids.  Table AMs which use RowIDs
are supposed to redefine those nodes using hooks.
---
 contrib/amcheck/verify_nbtree.c          |   3 +-
 src/backend/access/common/heaptuple.c    |   4 +
 src/backend/access/heap/heapam_handler.c |  33 ++-
 src/backend/access/table/tableam.c       |   4 +-
 src/backend/catalog/aclchk.c             |   2 +-
 src/backend/commands/trigger.c           | 251 ++++++++++++++++++-----
 src/backend/executor/execExprInterp.c    |   4 +-
 src/backend/executor/execMain.c          |   9 +-
 src/backend/executor/execReplication.c   |  12 +-
 src/backend/executor/nodeLockRows.c      |  17 +-
 src/backend/executor/nodeModifyTable.c   | 145 ++++++++-----
 src/backend/executor/nodeTidscan.c       |   2 +-
 src/backend/optimizer/plan/planner.c     |  11 +-
 src/backend/optimizer/prep/preptlist.c   |  16 ++
 src/backend/optimizer/util/appendinfo.c  |  33 ++-
 src/backend/optimizer/util/inherit.c     |  20 ++
 src/backend/parser/parse_relation.c      |  13 ++
 src/backend/rewrite/rewriteHandler.c     |   1 +
 src/backend/utils/sort/tuplestore.c      |  30 +++
 src/include/access/sysattr.h             |   3 +-
 src/include/access/tableam.h             |  58 ++++--
 src/include/commands/trigger.h           |   4 +-
 src/include/nodes/parsenodes.h           |   2 +
 src/include/nodes/plannodes.h            |  21 --
 src/include/nodes/primnodes.h            |  22 ++
 src/include/utils/tuplestore.h           |   3 +
 26 files changed, 548 insertions(+), 175 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index f71f1854e0..7bfa2a2fc4 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -984,7 +984,8 @@ heap_entry_is_visible(BtreeCheckState *state, ItemPointer tid)
 	TupleTableSlot *slot = table_slot_create(state->heaprel, NULL);
 
 	tid_visible = table_tuple_fetch_row_version(state->heaprel,
-												tid, state->snapshot, slot);
+												PointerGetDatum(tid),
+												state->snapshot, slot);
 	if (slot != NULL)
 		ExecDropSingleTupleTableSlot(slot);
 
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index 5c89fbbef8..7b52c66939 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -755,6 +755,10 @@ heap_getsysattr(HeapTuple tup, int attnum, TupleDesc tupleDesc, bool *isnull)
 		case TableOidAttributeNumber:
 			result = ObjectIdGetDatum(tup->t_tableOid);
 			break;
+		case RowIdAttributeNumber:
+			*isnull = true;
+			result = 0;
+			break;
 		default:
 			elog(ERROR, "invalid attnum: %d", attnum);
 			result = 0;			/* keep compiler quiet */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 09429fd9ef..13330ca715 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -46,7 +46,7 @@
 #include "utils/builtins.h"
 #include "utils/rel.h"
 
-static TM_Result heapam_tuple_lock(Relation relation, ItemPointer tid,
+static TM_Result heapam_tuple_lock(Relation relation, Datum tupleid,
 								   Snapshot snapshot, TupleTableSlot *slot,
 								   CommandId cid, LockTupleMode mode,
 								   LockWaitPolicy wait_policy, uint8 flags,
@@ -184,7 +184,7 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 
 static bool
 heapam_fetch_row_version(Relation relation,
-						 ItemPointer tid,
+						 Datum tupleid,
 						 Snapshot snapshot,
 						 TupleTableSlot *slot)
 {
@@ -193,7 +193,7 @@ heapam_fetch_row_version(Relation relation,
 
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
-	bslot->base.tupdata.t_self = *tid;
+	bslot->base.tupdata.t_self = *DatumGetItemPointer(tupleid);
 	if (heap_fetch(relation, snapshot, &bslot->base.tupdata, &buffer, false))
 	{
 		/* store in slot, transferring existing pin */
@@ -358,7 +358,7 @@ ExecCheckTIDVisible(EState *estate,
 	if (!IsolationUsesXactSnapshot())
 		return;
 
-	if (!table_tuple_fetch_row_version(rel, tid,
+	if (!table_tuple_fetch_row_version(rel, PointerGetDatum(tid),
 									   SnapshotAny, tempSlot))
 		elog(ERROR, "failed to fetch conflicting tuple for ON CONFLICT");
 	ExecCheckTupleVisible(estate, rel, tempSlot);
@@ -405,7 +405,7 @@ heapam_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
 				 * here means our previous conclusion that the tuple is
 				 * conclusively committed is not true anymore.
 				 */
-				test = table_tuple_lock(rel, &conflictTid,
+				test = table_tuple_lock(rel, PointerGetDatum(&conflictTid),
 										estate->es_snapshot,
 										lockedSlot, estate->es_output_cid,
 										lockmode, LockWaitBlock, 0,
@@ -585,12 +585,13 @@ heapam_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
 }
 
 static TM_Result
-heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
+heapam_tuple_delete(Relation relation, Datum tupleid, CommandId cid,
 					Snapshot snapshot, Snapshot crosscheck, int options,
 					TM_FailureData *tmfd, bool changingPart,
 					TupleTableSlot *oldSlot)
 {
 	TM_Result	result;
+	ItemPointer tid = DatumGetItemPointer(tupleid);
 
 	/*
 	 * Currently Deleting of index tuples are handled at vacuum, in case if
@@ -613,7 +614,7 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
 		 * heapam_tuple_lock() will take advantage of tuple loaded into
 		 * oldSlot by heap_delete().
 		 */
-		result = heapam_tuple_lock(relation, tid, snapshot,
+		result = heapam_tuple_lock(relation, tupleid, snapshot,
 								   oldSlot, cid, LockTupleExclusive,
 								   (options & TABLE_MODIFY_WAIT) ?
 								   LockWaitBlock :
@@ -630,7 +631,7 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
 
 
 static TM_Result
-heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
+heapam_tuple_update(Relation relation, Datum tupleid, TupleTableSlot *slot,
 					CommandId cid, Snapshot snapshot, Snapshot crosscheck,
 					int options, TM_FailureData *tmfd,
 					LockTupleMode *lockmode, TU_UpdateIndexes *update_indexes,
@@ -638,6 +639,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 {
 	bool		shouldFree = true;
 	HeapTuple	tuple = ExecFetchSlotHeapTuple(slot, true, &shouldFree);
+	ItemPointer otid = DatumGetItemPointer(tupleid);
 	TM_Result	result;
 
 	/* Update the tuple with table oid */
@@ -685,7 +687,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 		 * heapam_tuple_lock() will take advantage of tuple loaded into
 		 * oldSlot by heap_update().
 		 */
-		result = heapam_tuple_lock(relation, otid, snapshot,
+		result = heapam_tuple_lock(relation, tupleid, snapshot,
 								   oldSlot, cid, *lockmode,
 								   (options & TABLE_MODIFY_WAIT) ?
 								   LockWaitBlock :
@@ -701,7 +703,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 }
 
 static TM_Result
-heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
+heapam_tuple_lock(Relation relation, Datum tupleid, Snapshot snapshot,
 				  TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
 				  LockWaitPolicy wait_policy, uint8 flags,
 				  TM_FailureData *tmfd)
@@ -709,6 +711,7 @@ heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
 	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
 	TM_Result	result;
 	HeapTuple	tuple = &bslot->base.tupdata;
+	ItemPointer tid = DatumGetItemPointer(tupleid);
 	bool		follow_updates;
 
 	follow_updates = (flags & TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS) != 0;
@@ -2376,6 +2379,15 @@ heapam_scan_get_blocks_done(HeapScanDesc hscan)
  * ------------------------------------------------------------------------
  */
 
+/*
+ * All heap tables use TID row identifier.
+ */
+static RowRefType
+heapam_get_row_ref_type(Relation rel)
+{
+	return ROW_REF_TID;
+}
+
 /*
  * Check to see whether the table needs a TOAST table.  It does only if
  * (1) there are any toastable attributes, and (2) the maximum length
@@ -2954,6 +2966,7 @@ static const TableAmRoutine heapam_methods = {
 	.define_index_validate = NULL,
 	.define_index = NULL,
 
+	.get_row_ref_type = heapam_get_row_ref_type,
 	.free_rd_amcache = NULL,
 	.relation_size = table_block_relation_size,
 	.relation_needs_toast_table = heapam_relation_needs_toast_table,
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 805d222ceb..caa79c6edd 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -300,7 +300,7 @@ simple_table_tuple_delete(Relation rel, ItemPointer tid, Snapshot snapshot,
 	if (oldSlot)
 		options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
 
-	result = table_tuple_delete(rel, tid,
+	result = table_tuple_delete(rel, PointerGetDatum(tid),
 								GetCurrentCommandId(true),
 								snapshot, InvalidSnapshot,
 								options,
@@ -356,7 +356,7 @@ simple_table_tuple_update(Relation rel, ItemPointer otid,
 	if (oldSlot)
 		options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
 
-	result = table_tuple_update(rel, otid, slot,
+	result = table_tuple_update(rel, PointerGetDatum(otid), slot,
 								GetCurrentCommandId(true),
 								snapshot, InvalidSnapshot,
 								options,
diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
index 7abf3c2a74..8765becf98 100644
--- a/src/backend/catalog/aclchk.c
+++ b/src/backend/catalog/aclchk.c
@@ -1626,7 +1626,7 @@ expand_all_col_privileges(Oid table_oid, Form_pg_class classForm,
 	AttrNumber	curr_att;
 
 	Assert(classForm->relnatts - FirstLowInvalidHeapAttributeNumber < num_col_privileges);
-	for (curr_att = FirstLowInvalidHeapAttributeNumber + 1;
+	for (curr_att = FirstLowInvalidHeapAttributeNumber + 2;
 		 curr_att <= classForm->relnatts;
 		 curr_att++)
 	{
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 84494c4b81..4f83e521a3 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -76,7 +76,7 @@ static void SetTriggerFlags(TriggerDesc *trigdesc, Trigger *trigger);
 static bool GetTupleForTrigger(EState *estate,
 							   EPQState *epqstate,
 							   ResultRelInfo *relinfo,
-							   ItemPointer tid,
+							   Datum tupleid,
 							   LockTupleMode lockmode,
 							   TupleTableSlot *oldslot,
 							   TupleTableSlot **epqslot,
@@ -2682,7 +2682,7 @@ ExecASDeleteTriggers(EState *estate, ResultRelInfo *relinfo,
 bool
 ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
 					 ResultRelInfo *relinfo,
-					 ItemPointer tupleid,
+					 Datum tupleid,
 					 HeapTuple fdw_trigtuple,
 					 TupleTableSlot **epqslot,
 					 TM_Result *tmresult,
@@ -2696,7 +2696,7 @@ ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
 	bool		should_free = false;
 	int			i;
 
-	Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
+	Assert(HeapTupleIsValid(fdw_trigtuple) ^ (DatumGetPointer(tupleid) != NULL));
 	if (fdw_trigtuple == NULL)
 	{
 		TupleTableSlot *epqslot_candidate = NULL;
@@ -2924,7 +2924,7 @@ ExecASUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
 bool
 ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
 					 ResultRelInfo *relinfo,
-					 ItemPointer tupleid,
+					 Datum tupleid,
 					 HeapTuple fdw_trigtuple,
 					 TupleTableSlot *newslot,
 					 TM_Result *tmresult,
@@ -2944,7 +2944,7 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
 	/* Determine lock mode to use */
 	lockmode = ExecUpdateLockMode(estate, relinfo);
 
-	Assert(HeapTupleIsValid(fdw_trigtuple) ^ ItemPointerIsValid(tupleid));
+	Assert(HeapTupleIsValid(fdw_trigtuple) ^ (DatumGetPointer(tupleid) != NULL));
 	if (fdw_trigtuple == NULL)
 	{
 		TupleTableSlot *epqslot_candidate = NULL;
@@ -3261,7 +3261,7 @@ static bool
 GetTupleForTrigger(EState *estate,
 				   EPQState *epqstate,
 				   ResultRelInfo *relinfo,
-				   ItemPointer tid,
+				   Datum tupleid,
 				   LockTupleMode lockmode,
 				   TupleTableSlot *oldslot,
 				   TupleTableSlot **epqslot,
@@ -3286,7 +3286,9 @@ GetTupleForTrigger(EState *estate,
 		 */
 		if (!IsolationUsesXactSnapshot())
 			lockflags |= TUPLE_LOCK_FLAG_FIND_LAST_VERSION;
-		test = table_tuple_lock(relation, tid, estate->es_snapshot, oldslot,
+
+		test = table_tuple_lock(relation, tupleid,
+								estate->es_snapshot, oldslot,
 								estate->es_output_cid,
 								lockmode, LockWaitBlock,
 								lockflags,
@@ -3382,8 +3384,8 @@ GetTupleForTrigger(EState *estate,
 		 * We expect the tuple to be present, thus very simple error handling
 		 * suffices.
 		 */
-		if (!table_tuple_fetch_row_version(relation, tid, SnapshotAny,
-										   oldslot))
+		if (!table_tuple_fetch_row_version(relation, tupleid,
+										   SnapshotAny, oldslot))
 			elog(ERROR, "failed to fetch tuple for trigger");
 	}
 
@@ -3589,18 +3591,24 @@ typedef SetConstraintStateData *SetConstraintState;
  * cycles.  So we need only ensure that ats_firing_id is zero when attaching
  * a new event to an existing AfterTriggerSharedData record.
  */
-typedef uint32 TriggerFlags;
-
-#define AFTER_TRIGGER_OFFSET			0x07FFFFFF	/* must be low-order bits */
-#define AFTER_TRIGGER_DONE				0x80000000
-#define AFTER_TRIGGER_IN_PROGRESS		0x40000000
+typedef uint64 TriggerFlags;
+
+#define AFTER_TRIGGER_SIZE				UINT64CONST(0xFFFF000000000)	/* must be low-order
+																		 * bits */
+#define AFTER_TRIGGER_SIZE_SHIFT		(36)
+#define AFTER_TRIGGER_OFFSET			UINT64CONST(0x000000FFFFFFF)	/* must be low-order
+																		 * bits */
+#define AFTER_TRIGGER_DONE				UINT64CONST(0x0000800000000)
+#define AFTER_TRIGGER_IN_PROGRESS		UINT64CONST(0x0000400000000)
 /* bits describing the size and tuple sources of this event */
-#define AFTER_TRIGGER_FDW_REUSE			0x00000000
-#define AFTER_TRIGGER_FDW_FETCH			0x20000000
-#define AFTER_TRIGGER_1CTID				0x10000000
-#define AFTER_TRIGGER_2CTID				0x30000000
-#define AFTER_TRIGGER_CP_UPDATE			0x08000000
-#define AFTER_TRIGGER_TUP_BITS			0x38000000
+#define AFTER_TRIGGER_FDW_REUSE			UINT64CONST(0x0000000000000)
+#define AFTER_TRIGGER_FDW_FETCH			UINT64CONST(0x0000200000000)
+#define AFTER_TRIGGER_1CTID				UINT64CONST(0x0000100000000)
+#define AFTER_TRIGGER_ROWID1			UINT64CONST(0x0000010000000)
+#define AFTER_TRIGGER_2CTID				UINT64CONST(0x0000300000000)
+#define AFTER_TRIGGER_ROWID2			UINT64CONST(0x0000020000000)
+#define AFTER_TRIGGER_CP_UPDATE			UINT64CONST(0x0000080000000)
+#define AFTER_TRIGGER_TUP_BITS			UINT64CONST(0x0000380000000)
 typedef struct AfterTriggerSharedData *AfterTriggerShared;
 
 typedef struct AfterTriggerSharedData
@@ -3652,6 +3660,9 @@ typedef struct AfterTriggerEventDataZeroCtids
 }			AfterTriggerEventDataZeroCtids;
 
 #define SizeofTriggerEvent(evt) \
+	(((evt)->ate_flags & AFTER_TRIGGER_SIZE) >> AFTER_TRIGGER_SIZE_SHIFT)
+
+#define BasicSizeofTriggerEvent(evt) \
 	(((evt)->ate_flags & AFTER_TRIGGER_TUP_BITS) == AFTER_TRIGGER_CP_UPDATE ? \
 	 sizeof(AfterTriggerEventData) : \
 	 (((evt)->ate_flags & AFTER_TRIGGER_TUP_BITS) == AFTER_TRIGGER_2CTID ? \
@@ -4004,14 +4015,34 @@ afterTriggerCopyBitmap(Bitmapset *src)
  */
 static void
 afterTriggerAddEvent(AfterTriggerEventList *events,
-					 AfterTriggerEvent event, AfterTriggerShared evtshared)
+					 AfterTriggerEvent event, AfterTriggerShared evtshared,
+					 bytea *rowid1, bytea *rowid2)
 {
-	Size		eventsize = SizeofTriggerEvent(event);
-	Size		needed = eventsize + sizeof(AfterTriggerSharedData);
+	Size		basiceventsize = MAXALIGN(BasicSizeofTriggerEvent(event));
+	Size		eventsize;
+	Size		needed;
 	AfterTriggerEventChunk *chunk;
 	AfterTriggerShared newshared;
 	AfterTriggerEvent newevent;
 
+	if (SizeofTriggerEvent(event) == 0)
+	{
+		eventsize = basiceventsize;
+		if (event->ate_flags & AFTER_TRIGGER_ROWID1)
+			eventsize += MAXALIGN(VARSIZE(rowid1));
+
+		if (event->ate_flags & AFTER_TRIGGER_ROWID2)
+			eventsize += MAXALIGN(VARSIZE(rowid2));
+
+		event->ate_flags |= eventsize << AFTER_TRIGGER_SIZE_SHIFT;
+	}
+	else
+	{
+		eventsize = SizeofTriggerEvent(event);
+	}
+
+	needed = eventsize + sizeof(AfterTriggerSharedData);
+
 	/*
 	 * If empty list or not enough room in the tail chunk, make a new chunk.
 	 * We assume here that a new shared record will always be needed.
@@ -4044,7 +4075,7 @@ afterTriggerAddEvent(AfterTriggerEventList *events,
 		 * sizes used should be MAXALIGN multiples, to ensure that the shared
 		 * records will be aligned safely.
 		 */
-#define MIN_CHUNK_SIZE 1024
+#define MIN_CHUNK_SIZE (1024*4)
 #define MAX_CHUNK_SIZE (1024*1024)
 
 #if MAX_CHUNK_SIZE > (AFTER_TRIGGER_OFFSET+1)
@@ -4063,6 +4094,7 @@ afterTriggerAddEvent(AfterTriggerEventList *events,
 				chunksize *= 2; /* okay, double it */
 			else
 				chunksize /= 2; /* too many shared records */
+			chunksize = Max(chunksize, MIN_CHUNK_SIZE);
 			chunksize = Min(chunksize, MAX_CHUNK_SIZE);
 		}
 		chunk = MemoryContextAlloc(afterTriggers.event_cxt, chunksize);
@@ -4103,7 +4135,26 @@ afterTriggerAddEvent(AfterTriggerEventList *events,
 
 	/* Insert the data */
 	newevent = (AfterTriggerEvent) chunk->freeptr;
-	memcpy(newevent, event, eventsize);
+	if (!rowid1 && !rowid2)
+	{
+		memcpy(newevent, event, eventsize);
+	}
+	else
+	{
+		Pointer		ptr = chunk->freeptr;
+
+		memcpy(newevent, event, basiceventsize);
+		ptr += basiceventsize;
+
+		if (event->ate_flags & AFTER_TRIGGER_ROWID1)
+		{
+			memcpy(ptr, rowid1, MAXALIGN(VARSIZE(rowid1)));
+			ptr += MAXALIGN(VARSIZE(rowid1));
+		}
+
+		if (event->ate_flags & AFTER_TRIGGER_ROWID2)
+			memcpy(ptr, rowid2, MAXALIGN(VARSIZE(rowid2)));
+	}
 	/* ... and link the new event to its shared record */
 	newevent->ate_flags &= ~AFTER_TRIGGER_OFFSET;
 	newevent->ate_flags |= (char *) newshared - (char *) newevent;
@@ -4263,6 +4314,7 @@ AfterTriggerExecute(EState *estate,
 	int			tgindx;
 	bool		should_free_trig = false;
 	bool		should_free_new = false;
+	Pointer		ptr;
 
 	/*
 	 * Locate trigger in trigdesc.
@@ -4294,15 +4346,17 @@ AfterTriggerExecute(EState *estate,
 			{
 				Tuplestorestate *fdw_tuplestore = GetCurrentFDWTuplestore();
 
-				if (!tuplestore_gettupleslot(fdw_tuplestore, true, false,
-											 trig_tuple_slot1))
+				if (!tuplestore_force_gettupleslot(fdw_tuplestore, true, false,
+												   trig_tuple_slot1))
 					elog(ERROR, "failed to fetch tuple1 for AFTER trigger");
 
 				if ((evtshared->ats_event & TRIGGER_EVENT_OPMASK) ==
 					TRIGGER_EVENT_UPDATE &&
-					!tuplestore_gettupleslot(fdw_tuplestore, true, false,
-											 trig_tuple_slot2))
+					!tuplestore_force_gettupleslot(fdw_tuplestore, true, false,
+												   trig_tuple_slot2))
 					elog(ERROR, "failed to fetch tuple2 for AFTER trigger");
+				trig_tuple_slot1->tts_tid = event->ate_ctid1;
+				trig_tuple_slot2->tts_tid = event->ate_ctid2;
 			}
 			/* fall through */
 		case AFTER_TRIGGER_FDW_REUSE:
@@ -4334,13 +4388,26 @@ AfterTriggerExecute(EState *estate,
 			break;
 
 		default:
-			if (ItemPointerIsValid(&(event->ate_ctid1)))
+			ptr = (Pointer) event + MAXALIGN(BasicSizeofTriggerEvent(event));
+			if (ItemPointerIsValid(&(event->ate_ctid1)) ||
+				(event->ate_flags & AFTER_TRIGGER_ROWID1))
 			{
+				Datum		tupleid;
+
 				TupleTableSlot *src_slot = ExecGetTriggerOldSlot(estate,
 																 src_relInfo);
 
-				if (!table_tuple_fetch_row_version(src_rel,
-												   &(event->ate_ctid1),
+				if (event->ate_flags & AFTER_TRIGGER_ROWID1)
+				{
+					tupleid = PointerGetDatum(ptr);
+					ptr += MAXALIGN(VARSIZE(ptr));
+				}
+				else
+				{
+					tupleid = PointerGetDatum(&(event->ate_ctid1));
+				}
+
+				if (!table_tuple_fetch_row_version(src_rel, tupleid,
 												   SnapshotAny,
 												   src_slot))
 					elog(ERROR, "failed to fetch tuple1 for AFTER trigger");
@@ -4376,13 +4443,23 @@ AfterTriggerExecute(EState *estate,
 			/* don't touch ctid2 if not there */
 			if (((event->ate_flags & AFTER_TRIGGER_TUP_BITS) == AFTER_TRIGGER_2CTID ||
 				 (event->ate_flags & AFTER_TRIGGER_CP_UPDATE)) &&
-				ItemPointerIsValid(&(event->ate_ctid2)))
+				(ItemPointerIsValid(&(event->ate_ctid2)) ||
+				 (event->ate_flags & AFTER_TRIGGER_ROWID2)))
 			{
+				Datum		tupleid;
+
 				TupleTableSlot *dst_slot = ExecGetTriggerNewSlot(estate,
 																 dst_relInfo);
 
-				if (!table_tuple_fetch_row_version(dst_rel,
-												   &(event->ate_ctid2),
+				if (event->ate_flags & AFTER_TRIGGER_ROWID2)
+				{
+					tupleid = PointerGetDatum(ptr);
+				}
+				else
+				{
+					tupleid = PointerGetDatum(&(event->ate_ctid2));
+				}
+				if (!table_tuple_fetch_row_version(dst_rel, tupleid,
 												   SnapshotAny,
 												   dst_slot))
 					elog(ERROR, "failed to fetch tuple2 for AFTER trigger");
@@ -4556,7 +4633,7 @@ afterTriggerMarkEvents(AfterTriggerEventList *events,
 		{
 			deferred_found = true;
 			/* add it to move_list */
-			afterTriggerAddEvent(move_list, event, evtshared);
+			afterTriggerAddEvent(move_list, event, evtshared, NULL, NULL);
 			/* mark original copy "done" so we don't do it again */
 			event->ate_flags |= AFTER_TRIGGER_DONE;
 		}
@@ -4659,6 +4736,7 @@ afterTriggerInvokeEvents(AfterTriggerEventList *events,
 					trigdesc = rInfo->ri_TrigDesc;
 					finfo = rInfo->ri_TrigFunctions;
 					instr = rInfo->ri_TrigInstrument;
+
 					if (slot1 != NULL)
 					{
 						ExecDropSingleTupleTableSlot(slot1);
@@ -6051,6 +6129,8 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 	int			tgtype_level;
 	int			i;
 	Tuplestorestate *fdw_tuplestore = NULL;
+	bytea	   *rowId1 = NULL;
+	bytea	   *rowId2 = NULL;
 
 	/*
 	 * Check state.  We use a normal test not Assert because it is possible to
@@ -6144,6 +6224,12 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 	 * if so.  This preserves the behavior that statement-level triggers fire
 	 * just once per statement and fire after row-level triggers.
 	 */
+
+	/* Determine flags */
+	if (!(relkind == RELKIND_FOREIGN_TABLE && row_trigger))
+		new_event.ate_flags = (row_trigger && event == TRIGGER_EVENT_UPDATE) ?
+			AFTER_TRIGGER_2CTID : AFTER_TRIGGER_1CTID;
+
 	switch (event)
 	{
 		case TRIGGER_EVENT_INSERT:
@@ -6154,6 +6240,14 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				Assert(newslot != NULL);
 				ItemPointerCopy(&(newslot->tts_tid), &(new_event.ate_ctid1));
 				ItemPointerSetInvalid(&(new_event.ate_ctid2));
+				if (table_get_row_ref_type(rel) == ROW_REF_ROWID)
+				{
+					bool		isnull;
+
+					rowId1 = DatumGetByteaP(slot_getsysattr(newslot, RowIdAttributeNumber, &isnull));
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID1;
+					Assert(!isnull);
+				}
 			}
 			else
 			{
@@ -6173,6 +6267,14 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				Assert(newslot == NULL);
 				ItemPointerCopy(&(oldslot->tts_tid), &(new_event.ate_ctid1));
 				ItemPointerSetInvalid(&(new_event.ate_ctid2));
+				if (table_get_row_ref_type(rel) == ROW_REF_ROWID)
+				{
+					bool		isnull;
+
+					rowId1 = DatumGetByteaP(slot_getsysattr(oldslot, RowIdAttributeNumber, &isnull));
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID1;
+					Assert(!isnull);
+				}
 			}
 			else
 			{
@@ -6188,10 +6290,57 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			tgtype_event = TRIGGER_TYPE_UPDATE;
 			if (row_trigger)
 			{
+				bool		src_rowid = false,
+							dst_rowid = false;
+
 				Assert(oldslot != NULL);
 				Assert(newslot != NULL);
 				ItemPointerCopy(&(oldslot->tts_tid), &(new_event.ate_ctid1));
 				ItemPointerCopy(&(newslot->tts_tid), &(new_event.ate_ctid2));
+				if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+				{
+					Relation	src_rel = src_partinfo->ri_RelationDesc;
+					Relation	dst_rel = dst_partinfo->ri_RelationDesc;
+
+					src_rowid = table_get_row_ref_type(src_rel) ==
+						ROW_REF_ROWID;
+					dst_rowid = table_get_row_ref_type(dst_rel) ==
+						ROW_REF_ROWID;
+				}
+				else
+				{
+					if (table_get_row_ref_type(rel) == ROW_REF_ROWID)
+					{
+						src_rowid = true;
+						dst_rowid = true;
+					}
+				}
+
+				if (src_rowid)
+				{
+					Datum		val;
+					bool		isnull;
+
+					val = slot_getsysattr(oldslot,
+										  RowIdAttributeNumber,
+										  &isnull);
+					rowId1 = DatumGetByteaP(val);
+					Assert(!isnull);
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID1;
+				}
+
+				if (dst_rowid)
+				{
+					Datum		val;
+					bool		isnull;
+
+					val = slot_getsysattr(newslot,
+										  RowIdAttributeNumber,
+										  &isnull);
+					rowId2 = DatumGetByteaP(val);
+					Assert(!isnull);
+					new_event.ate_flags |= AFTER_TRIGGER_ROWID2;
+				}
 
 				/*
 				 * Also remember the OIDs of partitions to fetch these tuples
@@ -6229,20 +6378,6 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 			break;
 	}
 
-	/* Determine flags */
-	if (!(relkind == RELKIND_FOREIGN_TABLE && row_trigger))
-	{
-		if (row_trigger && event == TRIGGER_EVENT_UPDATE)
-		{
-			if (relkind == RELKIND_PARTITIONED_TABLE)
-				new_event.ate_flags = AFTER_TRIGGER_CP_UPDATE;
-			else
-				new_event.ate_flags = AFTER_TRIGGER_2CTID;
-		}
-		else
-			new_event.ate_flags = AFTER_TRIGGER_1CTID;
-	}
-
 	/* else, we'll initialize ate_flags for each trigger */
 
 	tgtype_level = (row_trigger ? TRIGGER_TYPE_ROW : TRIGGER_TYPE_STATEMENT);
@@ -6387,6 +6522,20 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 				continue;		/* Uniqueness definitely not violated */
 		}
 
+		/* Determine flags */
+		if (!(relkind == RELKIND_FOREIGN_TABLE && row_trigger))
+		{
+			if (row_trigger && event == TRIGGER_EVENT_UPDATE)
+			{
+				if (relkind == RELKIND_PARTITIONED_TABLE)
+					new_event.ate_flags = AFTER_TRIGGER_CP_UPDATE;
+				else
+					new_event.ate_flags = AFTER_TRIGGER_2CTID;
+			}
+			else
+				new_event.ate_flags = AFTER_TRIGGER_1CTID;
+		}
+
 		/*
 		 * Fill in event structure and add it to the current query's queue.
 		 * Note we set ats_table to NULL whenever this trigger doesn't use
@@ -6408,7 +6557,7 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 		new_shared.ats_modifiedcols = afterTriggerCopyBitmap(modifiedCols);
 
 		afterTriggerAddEvent(&afterTriggers.query_stack[afterTriggers.query_depth].events,
-							 &new_event, &new_shared);
+							 &new_event, &new_shared, rowId1, rowId2);
 	}
 
 	/*
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 24a3990a30..c8ce4d45ff 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -4888,7 +4888,9 @@ ExecEvalSysVar(ExprState *state, ExprEvalStep *op, ExprContext *econtext,
 						op->resnull);
 	*op->resvalue = d;
 	/* this ought to be unreachable, but it's cheap enough to check */
-	if (unlikely(*op->resnull))
+	if (op->d.var.attnum != RowIdAttributeNumber &&
+		op->d.var.attnum != SelfItemPointerAttributeNumber &&
+		unlikely(*op->resnull))
 		elog(ERROR, "failed to fetch attribute from slot");
 }
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 3b03f03a98..514d9b28c4 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -867,13 +867,15 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 			Oid			relid;
 			Relation	relation;
 			ExecRowMark *erm;
+			RangeTblEntry *rangeEntry;
 
 			/* ignore "parent" rowmarks; they are irrelevant at runtime */
 			if (rc->isParent)
 				continue;
 
 			/* get relation's OID (will produce InvalidOid if subquery) */
-			relid = exec_rt_fetch(rc->rti, estate)->relid;
+			rangeEntry = exec_rt_fetch(rc->rti, estate);
+			relid = rangeEntry->relid;
 
 			/*
 			 * Open relation, if we need to access it for this reference type.
@@ -903,7 +905,7 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 			erm->prti = rc->prti;
 			erm->rowmarkId = rc->rowmarkId;
 			erm->markType = rc->markType;
-			erm->refType = rc->refType;
+			erm->refType = rangeEntry->reftype;
 			erm->strength = rc->strength;
 			erm->waitPolicy = rc->waitPolicy;
 			erm->ermActive = false;
@@ -1267,6 +1269,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
 	resultRelInfo->ri_ChildToRootMap = NULL;
 	resultRelInfo->ri_ChildToRootMapValid = false;
 	resultRelInfo->ri_CopyMultiInsertBuffer = NULL;
+	resultRelInfo->ri_RowRefType = table_get_row_ref_type(resultRelationDesc);
 }
 
 /*
@@ -2708,7 +2711,7 @@ EvalPlanQualFetchRowMark(EPQState *epqstate, Index rti, TupleTableSlot *slot)
 		{
 			/* ordinary table, fetch the tuple */
 			if (!table_tuple_fetch_row_version(erm->relation,
-											   (ItemPointer) DatumGetPointer(datum),
+											   datum,
 											   SnapshotAny, slot))
 				elog(ERROR, "failed to fetch tuple for EvalPlanQual recheck");
 			return true;
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index db685473fc..aad266a19f 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -250,7 +250,8 @@ retry:
 
 		PushActiveSnapshot(GetLatestSnapshot());
 
-		res = table_tuple_lock(rel, &(outslot->tts_tid), GetLatestSnapshot(),
+		res = table_tuple_lock(rel, PointerGetDatum(&(outslot->tts_tid)),
+							   GetLatestSnapshot(),
 							   outslot,
 							   GetCurrentCommandId(false),
 							   lockmode,
@@ -434,7 +435,8 @@ retry:
 
 		PushActiveSnapshot(GetLatestSnapshot());
 
-		res = table_tuple_lock(rel, &(outslot->tts_tid), GetLatestSnapshot(),
+		res = table_tuple_lock(rel, PointerGetDatum(&(outslot->tts_tid)),
+							   GetLatestSnapshot(),
 							   outslot,
 							   GetCurrentCommandId(false),
 							   lockmode,
@@ -571,7 +573,8 @@ ExecSimpleRelationUpdate(ResultRelInfo *resultRelInfo,
 		resultRelInfo->ri_TrigDesc->trig_update_before_row)
 	{
 		if (!ExecBRUpdateTriggers(estate, epqstate, resultRelInfo,
-								  tid, NULL, slot, NULL, NULL))
+								  PointerGetDatum(tid), NULL, slot,
+								  NULL, NULL))
 			skip_tuple = true;	/* "do nothing" */
 	}
 
@@ -638,7 +641,8 @@ ExecSimpleRelationDelete(ResultRelInfo *resultRelInfo,
 		resultRelInfo->ri_TrigDesc->trig_delete_before_row)
 	{
 		skip_tuple = !ExecBRDeleteTriggers(estate, epqstate, resultRelInfo,
-										   tid, NULL, NULL, NULL, NULL);
+										   PointerGetDatum(tid), NULL, NULL,
+										   NULL, NULL);
 	}
 
 	if (!skip_tuple)
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 41754ddfea..2d3ad904a6 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -27,6 +27,7 @@
 #include "executor/nodeLockRows.h"
 #include "foreign/fdwapi.h"
 #include "miscadmin.h"
+#include "utils/datum.h"
 #include "utils/rel.h"
 
 
@@ -157,7 +158,16 @@ lnext:
 		}
 
 		/* okay, try to lock (and fetch) the tuple */
-		tid = *((ItemPointer) DatumGetPointer(datum));
+		if (erm->refType == ROW_REF_TID)
+		{
+			tid = *((ItemPointer) DatumGetPointer(datum));
+			datum = PointerGetDatum(&tid);
+		}
+		else
+		{
+			Assert(erm->refType == ROW_REF_ROWID);
+			datum = datumCopy(datum, false, -1);
+		}
 		switch (erm->markType)
 		{
 			case ROW_MARK_EXCLUSIVE:
@@ -182,12 +192,15 @@ lnext:
 		if (!IsolationUsesXactSnapshot())
 			lockflags |= TUPLE_LOCK_FLAG_FIND_LAST_VERSION;
 
-		test = table_tuple_lock(erm->relation, &tid, estate->es_snapshot,
+		test = table_tuple_lock(erm->relation, datum, estate->es_snapshot,
 								markSlot, estate->es_output_cid,
 								lockmode, erm->waitPolicy,
 								lockflags,
 								&tmfd);
 
+		if (erm->refType == ROW_REF_ROWID)
+			pfree(DatumGetPointer(datum));
+
 		switch (test)
 		{
 			case TM_WouldBlock:
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index a64e37e9af..90eeb99b2c 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -124,7 +124,7 @@ static void ExecPendingInserts(EState *estate);
 static void ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 											   ResultRelInfo *sourcePartInfo,
 											   ResultRelInfo *destPartInfo,
-											   ItemPointer tupleid,
+											   Datum tupleid,
 											   TupleTableSlot *oldslot,
 											   TupleTableSlot *newslot);
 static bool ExecOnConflictUpdate(ModifyTableContext *context,
@@ -141,13 +141,13 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 
 static TupleTableSlot *ExecMerge(ModifyTableContext *context,
 								 ResultRelInfo *resultRelInfo,
-								 ItemPointer tupleid,
+								 Datum tupleid,
 								 HeapTuple oldtuple,
 								 bool canSetTag);
 static void ExecInitMerge(ModifyTableState *mtstate, EState *estate);
 static TupleTableSlot *ExecMergeMatched(ModifyTableContext *context,
 										ResultRelInfo *resultRelInfo,
-										ItemPointer tupleid,
+										Datum tupleid,
 										HeapTuple oldtuple,
 										bool canSetTag,
 										bool *matched);
@@ -1221,7 +1221,7 @@ ExecPendingInserts(EState *estate)
  */
 static bool
 ExecDeletePrologue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				   ItemPointer tupleid, HeapTuple oldtuple,
+				   Datum tupleid, HeapTuple oldtuple,
 				   TupleTableSlot **epqreturnslot, TM_Result *result)
 {
 	if (result)
@@ -1252,7 +1252,7 @@ ExecDeletePrologue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static TM_Result
 ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-			  ItemPointer tupleid, bool changingPart, int options,
+			  Datum tupleid, bool changingPart, int options,
 			  TupleTableSlot *oldSlot)
 {
 	EState	   *estate = context->estate;
@@ -1280,7 +1280,7 @@ ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static void
 ExecDeleteEpilogue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				   ItemPointer tupleid, HeapTuple oldtuple,
+				   HeapTuple oldtuple,
 				   TupleTableSlot *slot, bool changingPart)
 {
 	ModifyTableState *mtstate = context->mtstate;
@@ -1361,7 +1361,7 @@ ExecInitDeleteTupleSlot(ModifyTableState *mtstate,
 static TupleTableSlot *
 ExecDelete(ModifyTableContext *context,
 		   ResultRelInfo *resultRelInfo,
-		   ItemPointer tupleid,
+		   Datum tupleid,
 		   HeapTuple oldtuple,
 		   TupleTableSlot *oldslot,
 		   bool processReturning,
@@ -1558,7 +1558,7 @@ ldelete:
 	if (tupleDeleted)
 		*tupleDeleted = true;
 
-	ExecDeleteEpilogue(context, resultRelInfo, tupleid, oldtuple,
+	ExecDeleteEpilogue(context, resultRelInfo, oldtuple,
 					   oldslot, changingPart);
 
 	/* Process RETURNING if present and if requested */
@@ -1575,7 +1575,7 @@ ldelete:
 			/* FDW must have provided a slot containing the deleted row */
 			Assert(!TupIsNull(slot));
 		}
-		else
+		else if (!slot || TupIsNull(slot))
 		{
 			/* Copy old tuple to the returning slot */
 			slot = ExecGetReturningSlot(estate, resultRelInfo);
@@ -1624,7 +1624,7 @@ ldelete:
 static bool
 ExecCrossPartitionUpdate(ModifyTableContext *context,
 						 ResultRelInfo *resultRelInfo,
-						 ItemPointer tupleid, HeapTuple oldtuple,
+						 Datum tupleid, HeapTuple oldtuple,
 						 TupleTableSlot *slot,
 						 bool canSetTag,
 						 UpdateContext *updateCxt,
@@ -1783,7 +1783,7 @@ ExecCrossPartitionUpdate(ModifyTableContext *context,
  */
 static bool
 ExecUpdatePrologue(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				   ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
+				   Datum tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
 				   TM_Result *result)
 {
 	Relation	resultRelationDesc = resultRelInfo->ri_RelationDesc;
@@ -1860,7 +1860,7 @@ ExecUpdatePrepareSlot(ResultRelInfo *resultRelInfo,
  */
 static TM_Result
 ExecUpdateAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-			  ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
+			  Datum tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
 			  bool canSetTag, int options, TupleTableSlot *oldSlot,
 			  UpdateContext *updateCxt)
 {
@@ -2014,7 +2014,7 @@ lreplace:
  */
 static void
 ExecUpdateEpilogue(ModifyTableContext *context, UpdateContext *updateCxt,
-				   ResultRelInfo *resultRelInfo, ItemPointer tupleid,
+				   ResultRelInfo *resultRelInfo,
 				   HeapTuple oldtuple, TupleTableSlot *slot,
 				   TupleTableSlot *oldslot)
 {
@@ -2064,7 +2064,7 @@ static void
 ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 								   ResultRelInfo *sourcePartInfo,
 								   ResultRelInfo *destPartInfo,
-								   ItemPointer tupleid,
+								   Datum tupleid,
 								   TupleTableSlot *oldslot,
 								   TupleTableSlot *newslot)
 {
@@ -2154,7 +2154,7 @@ ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
  */
 static TupleTableSlot *
 ExecUpdate(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-		   ItemPointer tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
+		   Datum tupleid, HeapTuple oldtuple, TupleTableSlot *slot,
 		   TupleTableSlot *oldslot, bool canSetTag, bool locked)
 {
 	EState	   *estate = context->estate;
@@ -2208,15 +2208,19 @@ ExecUpdate(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	}
 	else
 	{
-		int			options = TABLE_MODIFY_WAIT | TABLE_MODIFY_FETCH_OLD_TUPLE;
+		int			options = TABLE_MODIFY_WAIT;
 
 		/*
 		 * Specify that we need to lock and fetch the last tuple version for
 		 * EPQ on appropriate transaction isolation levels if the tuple isn't
 		 * locked already.
 		 */
-		if (!locked && !IsolationUsesXactSnapshot())
-			options |= TABLE_MODIFY_LOCK_UPDATED;
+		if (!locked)
+		{
+			options |= TABLE_MODIFY_FETCH_OLD_TUPLE;
+			if (!IsolationUsesXactSnapshot())
+				options |= TABLE_MODIFY_LOCK_UPDATED;
+		}
 
 		/*
 		 * If we generate a new candidate tuple after EvalPlanQual testing, we
@@ -2326,7 +2330,7 @@ redo_act:
 	if (canSetTag)
 		(estate->es_processed)++;
 
-	ExecUpdateEpilogue(context, &updateCxt, resultRelInfo, tupleid, oldtuple,
+	ExecUpdateEpilogue(context, &updateCxt, resultRelInfo, oldtuple,
 					   slot, oldslot);
 
 	/* Process RETURNING if present */
@@ -2358,7 +2362,19 @@ ExecOnConflictUpdate(ModifyTableContext *context,
 	ExprContext *econtext = mtstate->ps.ps_ExprContext;
 	ExprState  *onConflictSetWhere = resultRelInfo->ri_onConflict->oc_WhereClause;
 	TupleTableSlot *existing = resultRelInfo->ri_onConflict->oc_Existing;
-	ItemPointer conflictTid = &existing->tts_tid;
+	Datum		tupleid;
+
+	if (table_get_row_ref_type(resultRelInfo->ri_RelationDesc) == ROW_REF_ROWID)
+	{
+		bool		isnull;
+
+		tupleid = slot_getsysattr(existing, RowIdAttributeNumber, &isnull);
+		Assert(!isnull);
+	}
+	else
+	{
+		tupleid = PointerGetDatum(&existing->tts_tid);
+	}
 
 	/*
 	 * Make tuple and any needed join variables available to ExecQual and
@@ -2414,7 +2430,7 @@ ExecOnConflictUpdate(ModifyTableContext *context,
 
 	/* Execute UPDATE with projection */
 	*returning = ExecUpdate(context, resultRelInfo,
-							conflictTid, NULL,
+							tupleid, NULL,
 							resultRelInfo->ri_onConflict->oc_ProjSlot,
 							existing,
 							canSetTag, true);
@@ -2433,7 +2449,7 @@ ExecOnConflictUpdate(ModifyTableContext *context,
  */
 static TupleTableSlot *
 ExecMerge(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-		  ItemPointer tupleid, HeapTuple oldtuple, bool canSetTag)
+		  Datum tupleid, HeapTuple oldtuple, bool canSetTag)
 {
 	TupleTableSlot *rslot = NULL;
 	bool		matched;
@@ -2482,7 +2498,7 @@ ExecMerge(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	 * from ExecMergeNotMatched to ExecMergeMatched, there is no risk of a
 	 * livelock.
 	 */
-	matched = tupleid != NULL || oldtuple != NULL;
+	matched = DatumGetPointer(tupleid) != NULL || oldtuple != NULL;
 	if (matched)
 		rslot = ExecMergeMatched(context, resultRelInfo, tupleid, oldtuple,
 								 canSetTag, &matched);
@@ -2523,7 +2539,7 @@ ExecMerge(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
  */
 static TupleTableSlot *
 ExecMergeMatched(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
-				 ItemPointer tupleid, HeapTuple oldtuple, bool canSetTag,
+				 Datum tupleid, HeapTuple oldtuple, bool canSetTag,
 				 bool *matched)
 {
 	ModifyTableState *mtstate = context->mtstate;
@@ -2559,7 +2575,7 @@ ExecMergeMatched(ModifyTableContext *context, ResultRelInfo *resultRelInfo,
 	 * the tupleid of the target row, or an old tuple from the target wholerow
 	 * junk attr.
 	 */
-	Assert(tupleid != NULL || oldtuple != NULL);
+	Assert(DatumGetPointer(tupleid) != NULL || oldtuple != NULL);
 	if (oldtuple != NULL)
 		ExecForceStoreHeapTuple(oldtuple, resultRelInfo->ri_oldTupleSlot,
 								false);
@@ -2573,7 +2589,7 @@ lmerge_matched:
 	 * EvalPlanQual returns us a new tuple, which may not be visible to our
 	 * MVCC snapshot.
 	 */
-	if (tupleid != NULL)
+	if (DatumGetPointer(tupleid) != NULL)
 	{
 		if (!table_tuple_fetch_row_version(resultRelInfo->ri_RelationDesc,
 										   tupleid,
@@ -2682,7 +2698,7 @@ lmerge_matched:
 				if (result == TM_Ok)
 				{
 					ExecUpdateEpilogue(context, &updateCxt, resultRelInfo,
-									   tupleid, NULL, newslot,
+									   NULL, newslot,
 									   resultRelInfo->ri_oldTupleSlot);
 					mtstate->mt_merge_updated += 1;
 				}
@@ -2718,7 +2734,7 @@ lmerge_matched:
 
 				if (result == TM_Ok)
 				{
-					ExecDeleteEpilogue(context, resultRelInfo, tupleid, NULL,
+					ExecDeleteEpilogue(context, resultRelInfo, NULL,
 									   resultRelInfo->ri_oldTupleSlot, false);
 					mtstate->mt_merge_deleted += 1;
 				}
@@ -2842,9 +2858,13 @@ lmerge_matched:
 								return NULL;
 							}
 
-							(void) ExecGetJunkAttribute(epqslot,
-														resultRelInfo->ri_RowIdAttNo,
-														&isNull);
+							/*
+							 * Update tupleid to that of the new tuple, for
+							 * the refetch we do at the top.
+							 */
+							tupleid = ExecGetJunkAttribute(epqslot,
+														   resultRelInfo->ri_RowIdAttNo,
+														   &isNull);
 							if (isNull)
 							{
 								*matched = false;
@@ -2871,11 +2891,7 @@ lmerge_matched:
 							 * apply all the MATCHED rules again, to ensure
 							 * that the first qualifying WHEN MATCHED action
 							 * is executed.
-							 *
-							 * Update tupleid to that of the new tuple, for
-							 * the refetch we do at the top.
 							 */
-							ItemPointerCopy(&context->tmfd.ctid, tupleid);
 							goto lmerge_matched;
 
 						case TM_Deleted:
@@ -3413,10 +3429,10 @@ ExecModifyTable(PlanState *pstate)
 	PlanState  *subplanstate;
 	TupleTableSlot *slot;
 	TupleTableSlot *oldSlot;
+	Datum		tupleid;
 	ItemPointerData tuple_ctid;
 	HeapTupleData oldtupdata;
 	HeapTuple	oldtuple;
-	ItemPointer tupleid;
 
 	CHECK_FOR_INTERRUPTS();
 
@@ -3465,6 +3481,8 @@ ExecModifyTable(PlanState *pstate)
 	 */
 	for (;;)
 	{
+		RowRefType	refType;
+
 		/*
 		 * Reset the per-output-tuple exprcontext.  This is needed because
 		 * triggers expect to use that context as workspace.  It's a bit ugly
@@ -3515,7 +3533,7 @@ ExecModifyTable(PlanState *pstate)
 					EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 
 					slot = ExecMerge(&context, node->resultRelInfo,
-									 NULL, NULL, node->canSetTag);
+									 PointerGetDatum(NULL), NULL, node->canSetTag);
 
 					/*
 					 * If we got a RETURNING result, return it to the caller.
@@ -3559,7 +3577,8 @@ ExecModifyTable(PlanState *pstate)
 		EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 		slot = context.planSlot;
 
-		tupleid = NULL;
+		refType = resultRelInfo->ri_RowRefType;
+		tupleid = PointerGetDatum(NULL);
 		oldtuple = NULL;
 
 		/*
@@ -3602,7 +3621,7 @@ ExecModifyTable(PlanState *pstate)
 						EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 
 						slot = ExecMerge(&context, node->resultRelInfo,
-										 NULL, NULL, node->canSetTag);
+										 PointerGetDatum(NULL), NULL, node->canSetTag);
 
 						/*
 						 * If we got a RETURNING result, return it to the
@@ -3617,9 +3636,25 @@ ExecModifyTable(PlanState *pstate)
 					elog(ERROR, "ctid is NULL");
 				}
 
-				tupleid = (ItemPointer) DatumGetPointer(datum);
-				tuple_ctid = *tupleid;	/* be sure we don't free ctid!! */
-				tupleid = &tuple_ctid;
+				if (refType == ROW_REF_TID)
+				{
+					/* shouldn't ever get a null result... */
+					if (isNull)
+						elog(ERROR, "ctid is NULL");
+
+					tuple_ctid = *((ItemPointer) DatumGetPointer(datum));	/* be sure we don't free
+																			 * ctid!! */
+					tupleid = PointerGetDatum(&tuple_ctid);
+				}
+				else
+				{
+					Assert(refType == ROW_REF_ROWID);
+					/* shouldn't ever get a null result... */
+					if (isNull)
+						elog(ERROR, "rowid is NULL");
+
+					tupleid = datumCopy(datum, false, -1);
+				}
 			}
 
 			/*
@@ -3659,7 +3694,7 @@ ExecModifyTable(PlanState *pstate)
 						EvalPlanQualSetSlot(&node->mt_epqstate, context.planSlot);
 
 						slot = ExecMerge(&context, node->resultRelInfo,
-										 NULL, NULL, node->canSetTag);
+										 PointerGetDatum(NULL), NULL, node->canSetTag);
 
 						/*
 						 * If we got a RETURNING result, return it to the
@@ -3723,6 +3758,7 @@ ExecModifyTable(PlanState *pstate)
 					/* Fetch the most recent version of old tuple. */
 					Relation	relation = resultRelInfo->ri_RelationDesc;
 
+					Assert(DatumGetPointer(tupleid) != NULL);
 					if (!table_tuple_fetch_row_version(relation, tupleid,
 													   SnapshotAny,
 													   oldSlot))
@@ -3757,6 +3793,9 @@ ExecModifyTable(PlanState *pstate)
 				break;
 		}
 
+		if (refType == ROW_REF_ROWID && DatumGetPointer(tupleid) != NULL)
+			pfree(DatumGetPointer(tupleid));
+
 		/*
 		 * If we got a RETURNING result, return it to caller.  We'll continue
 		 * the work on next call.
@@ -4000,10 +4039,20 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 				relkind == RELKIND_MATVIEW ||
 				relkind == RELKIND_PARTITIONED_TABLE)
 			{
-				resultRelInfo->ri_RowIdAttNo =
-					ExecFindJunkAttributeInTlist(subplan->targetlist, "ctid");
-				if (!AttributeNumberIsValid(resultRelInfo->ri_RowIdAttNo))
-					elog(ERROR, "could not find junk ctid column");
+				if (resultRelInfo->ri_RowRefType == ROW_REF_TID)
+				{
+					resultRelInfo->ri_RowIdAttNo =
+						ExecFindJunkAttributeInTlist(subplan->targetlist, "ctid");
+					if (!AttributeNumberIsValid(resultRelInfo->ri_RowIdAttNo))
+						elog(ERROR, "could not find junk ctid column");
+				}
+				else
+				{
+					resultRelInfo->ri_RowIdAttNo =
+						ExecFindJunkAttributeInTlist(subplan->targetlist, "rowid");
+					if (!AttributeNumberIsValid(resultRelInfo->ri_RowIdAttNo))
+						elog(ERROR, "could not find junk rowid column");
+				}
 			}
 			else if (relkind == RELKIND_FOREIGN_TABLE)
 			{
@@ -4313,6 +4362,8 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 		estate->es_auxmodifytables = lcons(mtstate,
 										   estate->es_auxmodifytables);
 
+
+
 	return mtstate;
 }
 
diff --git a/src/backend/executor/nodeTidscan.c b/src/backend/executor/nodeTidscan.c
index 864a9013b6..f4a124ac4e 100644
--- a/src/backend/executor/nodeTidscan.c
+++ b/src/backend/executor/nodeTidscan.c
@@ -377,7 +377,7 @@ TidNext(TidScanState *node)
 		if (node->tss_isCurrentOf)
 			table_tuple_get_latest_tid(scan, &tid);
 
-		if (table_tuple_fetch_row_version(heapRelation, &tid, snapshot, slot))
+		if (table_tuple_fetch_row_version(heapRelation, PointerGetDatum(&tid), snapshot, slot))
 			return slot;
 
 		/* Bad TID or failed snapshot qual; try next */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 4b9c9deee8..ee648bedd4 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2376,19 +2376,24 @@ select_rowmark_type(RangeTblEntry *rte, LockClauseStrength strength,
 	{
 		/* Let the FDW select the rowmark type, if it wants to */
 		FdwRoutine *fdwroutine = GetFdwRoutineByRelId(rte->relid);
+		RowMarkType result = ROW_MARK_REFERENCE;
 
 		/* Set row reference type as ROW_REF_COPY by default */
 		*refType = ROW_REF_COPY;
 
 		if (fdwroutine->GetForeignRowMarkType != NULL)
-			return fdwroutine->GetForeignRowMarkType(rte, strength, refType);
+			result = fdwroutine->GetForeignRowMarkType(rte, strength, refType);
+
+		/* XXX: should we fill this before? */
+		rte->reftype = *refType;
+
 		/* Otherwise, use ROW_MARK_REFERENCE by default */
-		return ROW_MARK_REFERENCE;
+		return result;
 	}
 	else
 	{
 		/* Regular table, apply the appropriate lock type */
-		*refType = ROW_REF_TID;
+		*refType = rte->reftype;
 		switch (strength)
 		{
 			case LCS_NONE:
diff --git a/src/backend/optimizer/prep/preptlist.c b/src/backend/optimizer/prep/preptlist.c
index 4599b0dc76..3620be5b52 100644
--- a/src/backend/optimizer/prep/preptlist.c
+++ b/src/backend/optimizer/prep/preptlist.c
@@ -226,6 +226,22 @@ preprocess_targetlist(PlannerInfo *root)
 								  true);
 			tlist = lappend(tlist, tle);
 		}
+		if (rc->allRefTypes & (1 << ROW_REF_ROWID))
+		{
+			/* Need to fetch TID */
+			var = makeVar(rc->rti,
+						  RowIdAttributeNumber,
+						  BYTEAOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			snprintf(resname, sizeof(resname), "rowid%u", rc->rowmarkId);
+			tle = makeTargetEntry((Expr *) var,
+								  list_length(tlist) + 1,
+								  pstrdup(resname),
+								  true);
+			tlist = lappend(tlist, tle);
+		}
 		if (rc->allRefTypes & (1 << ROW_REF_COPY))
 		{
 			/* Need the whole row as a junk var */
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index 6ba4eba224..83c08bbd0e 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -16,6 +16,7 @@
 
 #include "access/htup_details.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "foreign/fdwapi.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
@@ -895,17 +896,35 @@ add_row_identity_columns(PlannerInfo *root, Index rtindex,
 		relkind == RELKIND_MATVIEW ||
 		relkind == RELKIND_PARTITIONED_TABLE)
 	{
+		RowRefType	refType = ROW_REF_TID;
+
+		refType = table_get_row_ref_type(target_relation);
+
 		/*
 		 * Emit CTID so that executor can find the row to merge, update or
 		 * delete.
 		 */
-		var = makeVar(rtindex,
-					  SelfItemPointerAttributeNumber,
-					  TIDOID,
-					  -1,
-					  InvalidOid,
-					  0);
-		add_row_identity_var(root, var, rtindex, "ctid");
+		if (refType == ROW_REF_TID)
+		{
+			var = makeVar(rtindex,
+						  SelfItemPointerAttributeNumber,
+						  TIDOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			add_row_identity_var(root, var, rtindex, "ctid");
+		}
+		else
+		{
+			Assert(refType == ROW_REF_ROWID);
+			var = makeVar(rtindex,
+						  RowIdAttributeNumber,
+						  BYTEAOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			add_row_identity_var(root, var, rtindex, "rowid");
+		}
 	}
 	else if (relkind == RELKIND_FOREIGN_TABLE)
 	{
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index b4b076d1cb..4a5a167d83 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -16,6 +16,7 @@
 
 #include "access/sysattr.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "catalog/partition.h"
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_type.h"
@@ -282,6 +283,24 @@ expand_inherited_rtentry(PlannerInfo *root, RelOptInfo *rel,
 			newvars = lappend(newvars, var);
 		}
 
+		if ((new_allRefTypes & (1 << ROW_REF_ROWID)) &&
+			!(old_allRefTypes & (1 << ROW_REF_ROWID)))
+		{
+			var = makeVar(oldrc->rti,
+						  RowIdAttributeNumber,
+						  BYTEAOID,
+						  -1,
+						  InvalidOid,
+						  0);
+			snprintf(resname, sizeof(resname), "rowid%u", oldrc->rowmarkId);
+			tle = makeTargetEntry((Expr *) var,
+								  list_length(root->processed_tlist) + 1,
+								  pstrdup(resname),
+								  true);
+			root->processed_tlist = lappend(root->processed_tlist, tle);
+			newvars = lappend(newvars, var);
+		}
+
 		/* Add tableoid junk Var, unless we had it already */
 		if (!old_isParent)
 		{
@@ -485,6 +504,7 @@ expand_single_inheritance_child(PlannerInfo *root, RangeTblEntry *parentrte,
 	Assert(parentrte->rtekind == RTE_RELATION); /* else this is dubious */
 	childrte->relid = childOID;
 	childrte->relkind = childrel->rd_rel->relkind;
+	childrte->reftype = table_get_row_ref_type(childrel);
 	/* A partitioned child will need to be expanded further. */
 	if (childrte->relkind == RELKIND_PARTITIONED_TABLE)
 	{
diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c
index 427b7325db..2c80e010f2 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -20,6 +20,7 @@
 #include "access/relation.h"
 #include "access/sysattr.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "catalog/heap.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_type.h"
@@ -1503,6 +1504,7 @@ addRangeTableEntry(ParseState *pstate,
 	rte->inh = inh;
 	rte->relkind = rel->rd_rel->relkind;
 	rte->rellockmode = lockmode;
+	rte->reftype = table_get_row_ref_type(rel);
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
@@ -1588,6 +1590,7 @@ addRangeTableEntryForRelation(ParseState *pstate,
 	rte->inh = inh;
 	rte->relkind = rel->rd_rel->relkind;
 	rte->rellockmode = lockmode;
+	rte->reftype = table_get_row_ref_type(rel);
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
@@ -1656,6 +1659,7 @@ addRangeTableEntryForSubquery(ParseState *pstate,
 	rte->rtekind = RTE_SUBQUERY;
 	rte->subquery = subquery;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias("unnamed_subquery", NIL);
 	numaliases = list_length(eref->colnames);
@@ -1763,6 +1767,7 @@ addRangeTableEntryForFunction(ParseState *pstate,
 	rte->functions = NIL;		/* we'll fill this list below */
 	rte->funcordinality = rangefunc->ordinality;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	/*
 	 * Choose the RTE alias name.  We default to using the first function's
@@ -2081,6 +2086,7 @@ addRangeTableEntryForTableFunc(ParseState *pstate,
 	rte->coltypmods = tf->coltypmods;
 	rte->colcollations = tf->colcollations;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias(refname, NIL);
 	numaliases = list_length(eref->colnames);
@@ -2156,6 +2162,7 @@ addRangeTableEntryForValues(ParseState *pstate,
 	rte->coltypmods = coltypmods;
 	rte->colcollations = colcollations;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias(refname, NIL);
 
@@ -2252,6 +2259,7 @@ addRangeTableEntryForJoin(ParseState *pstate,
 	rte->joinrightcols = rightcols;
 	rte->join_using_alias = join_using_alias;
 	rte->alias = alias;
+	rte->reftype = ROW_REF_COPY;
 
 	eref = alias ? copyObject(alias) : makeAlias("unnamed_join", NIL);
 	numaliases = list_length(eref->colnames);
@@ -2332,6 +2340,7 @@ addRangeTableEntryForCTE(ParseState *pstate,
 	rte->rtekind = RTE_CTE;
 	rte->ctename = cte->ctename;
 	rte->ctelevelsup = levelsup;
+	rte->reftype = ROW_REF_COPY;
 
 	/* Self-reference if and only if CTE's parse analysis isn't completed */
 	rte->self_reference = !IsA(cte->ctequery, Query);
@@ -2494,6 +2503,7 @@ addRangeTableEntryForENR(ParseState *pstate,
 	 * if they access transition tables linked to a table that is altered.
 	 */
 	rte->relid = enrmd->reliddesc;
+	rte->reftype = ROW_REF_COPY;
 
 	/*
 	 * Build the list of effective column names using user-supplied aliases
@@ -3257,6 +3267,9 @@ get_rte_attribute_name(RangeTblEntry *rte, AttrNumber attnum)
 		attnum > 0 && attnum <= list_length(rte->alias->colnames))
 		return strVal(list_nth(rte->alias->colnames, attnum - 1));
 
+	if (attnum == RowIdAttributeNumber)
+		return "rowid";
+
 	/*
 	 * If the RTE is a relation, go to the system catalogs not the
 	 * eref->colnames list.  This is a little slower but it will give the
diff --git a/src/backend/rewrite/rewriteHandler.c b/src/backend/rewrite/rewriteHandler.c
index 9fd05b15e7..7a0fdbe3f4 100644
--- a/src/backend/rewrite/rewriteHandler.c
+++ b/src/backend/rewrite/rewriteHandler.c
@@ -1854,6 +1854,7 @@ ApplyRetrieveRule(Query *parsetree,
 	rte = rt_fetch(rt_index, parsetree->rtable);
 
 	rte->rtekind = RTE_SUBQUERY;
+	rte->reftype = ROW_REF_COPY;
 	rte->subquery = rule_action;
 	rte->security_barrier = RelationIsSecurityView(relation);
 
diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index 947a868e56..d3a4153355 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1100,6 +1100,36 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+/*
+ * Same as tuplestore_gettupleslot(), but foces tuple storage to slot.  Thus,
+ * it can work with slot types different than minimal tuple.
+ */
+bool
+tuplestore_force_gettupleslot(Tuplestorestate *state, bool forward,
+							  bool copy, TupleTableSlot *slot)
+{
+	MinimalTuple tuple;
+	bool		should_free;
+
+	tuple = (MinimalTuple) tuplestore_gettuple(state, forward, &should_free);
+
+	if (tuple)
+	{
+		if (copy && !should_free)
+		{
+			tuple = heap_copy_minimal_tuple(tuple);
+			should_free = true;
+		}
+		ExecForceStoreMinimalTuple(tuple, slot, should_free);
+		return true;
+	}
+	else
+	{
+		ExecClearTuple(slot);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
diff --git a/src/include/access/sysattr.h b/src/include/access/sysattr.h
index e88dec71ee..867b5eb489 100644
--- a/src/include/access/sysattr.h
+++ b/src/include/access/sysattr.h
@@ -24,6 +24,7 @@
 #define MaxTransactionIdAttributeNumber			(-4)
 #define MaxCommandIdAttributeNumber				(-5)
 #define TableOidAttributeNumber					(-6)
-#define FirstLowInvalidHeapAttributeNumber		(-7)
+#define RowIdAttributeNumber					(-7)
+#define FirstLowInvalidHeapAttributeNumber		(-8)
 
 #endif							/* SYSATTR_H */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 7f97af067f..4ad6ebb104 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -476,7 +476,7 @@ typedef struct TableAmRoutine
 	 * test, returns true, false otherwise.
 	 */
 	bool		(*tuple_fetch_row_version) (Relation rel,
-											ItemPointer tid,
+											Datum tupleid,
 											Snapshot snapshot,
 											TupleTableSlot *slot);
 
@@ -535,7 +535,7 @@ typedef struct TableAmRoutine
 
 	/* see table_tuple_delete() for reference about parameters */
 	TM_Result	(*tuple_delete) (Relation rel,
-								 ItemPointer tid,
+								 Datum tupleid,
 								 CommandId cid,
 								 Snapshot snapshot,
 								 Snapshot crosscheck,
@@ -546,7 +546,7 @@ typedef struct TableAmRoutine
 
 	/* see table_tuple_update() for reference about parameters */
 	TM_Result	(*tuple_update) (Relation rel,
-								 ItemPointer otid,
+								 Datum tupleid,
 								 TupleTableSlot *slot,
 								 CommandId cid,
 								 Snapshot snapshot,
@@ -559,7 +559,7 @@ typedef struct TableAmRoutine
 
 	/* see table_tuple_lock() for reference about parameters */
 	TM_Result	(*tuple_lock) (Relation rel,
-							   ItemPointer tid,
+							   Datum tupleid,
 							   Snapshot snapshot,
 							   TupleTableSlot *slot,
 							   CommandId cid,
@@ -702,6 +702,11 @@ typedef struct TableAmRoutine
 	 * ------------------------------------------------------------------------
 	 */
 
+	/*
+	 * Get the type of row identifier in the table.
+	 */
+	RowRefType	(*get_row_ref_type) (Relation rel);
+
 	/*
 	 * This callback frees relation private cache data stored in rd_amcache.
 	 * After the call all memory related to rd_amcache must be freed,
@@ -1284,9 +1289,9 @@ extern bool table_index_fetch_tuple_check(Relation rel,
 
 
 /*
- * Fetch tuple at `tid` into `slot`, after doing a visibility test according to
- * `snapshot`. If a tuple was found and passed the visibility test, returns
- * true, false otherwise.
+ * Fetch tuple identified by `tupleid` into `slot`, after doing a visibility
+ * test according to `snapshot`. If a tuple was found and passed the visibility
+ * test, returns true, false otherwise.
  *
  * See table_index_fetch_tuple's comment about what the difference between
  * these functions is. It is correct to use this function outside of index
@@ -1294,7 +1299,7 @@ extern bool table_index_fetch_tuple_check(Relation rel,
  */
 static inline bool
 table_tuple_fetch_row_version(Relation rel,
-							  ItemPointer tid,
+							  Datum tupleid,
 							  Snapshot snapshot,
 							  TupleTableSlot *slot)
 {
@@ -1306,7 +1311,8 @@ table_tuple_fetch_row_version(Relation rel,
 	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
 		elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
 
-	return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
+	return rel->rd_tableam->tuple_fetch_row_version(rel, tupleid,
+													snapshot, slot);
 }
 
 /*
@@ -1493,7 +1499,7 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
  *
  * Input parameters:
  *	relation - table to be modified (caller must hold suitable lock)
- *	tid - TID of tuple to be deleted
+ *	tupleid - identifier of tuple to be deleted
  *	cid - delete command ID (used for visibility test, and stored into
  *		cmax if successful)
  *	crosscheck - if not InvalidSnapshot, also check tuple against this
@@ -1522,12 +1528,12 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
  * TM_FailureData for additional info.
  */
 static inline TM_Result
-table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
+table_tuple_delete(Relation rel, Datum tupleid, CommandId cid,
 				   Snapshot snapshot, Snapshot crosscheck, int options,
 				   TM_FailureData *tmfd, bool changingPart,
 				   TupleTableSlot *oldSlot)
 {
-	return rel->rd_tableam->tuple_delete(rel, tid, cid,
+	return rel->rd_tableam->tuple_delete(rel, tupleid, cid,
 										 snapshot, crosscheck,
 										 options, tmfd, changingPart,
 										 oldSlot);
@@ -1541,7 +1547,7 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
  *
  * Input parameters:
  *	relation - table to be modified (caller must hold suitable lock)
- *	otid - TID of old tuple to be replaced
+ *	tupleid - identifier of old tuple to be replaced
  *	slot - newly constructed tuple data to store
  *	cid - update command ID (used for visibility test, and stored into
  *		cmax/cmin if successful)
@@ -1578,13 +1584,13 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid,
  * for additional info.
  */
 static inline TM_Result
-table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
+table_tuple_update(Relation rel, Datum tupleid, TupleTableSlot *slot,
 				   CommandId cid, Snapshot snapshot, Snapshot crosscheck,
 				   int options, TM_FailureData *tmfd, LockTupleMode *lockmode,
 				   TU_UpdateIndexes *update_indexes,
 				   TupleTableSlot *oldSlot)
 {
-	return rel->rd_tableam->tuple_update(rel, otid, slot,
+	return rel->rd_tableam->tuple_update(rel, tupleid, slot,
 										 cid, snapshot, crosscheck,
 										 options, tmfd,
 										 lockmode, update_indexes,
@@ -1596,7 +1602,7 @@ table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
  *
  * Input parameters:
  *	relation: relation containing tuple (caller must hold suitable lock)
- *	tid: TID of tuple to lock
+ *	tupleid: identifier of tuple to lock
  *	snapshot: snapshot to use for visibility determinations
  *	cid: current command ID (used for visibility test, and stored into
  *		tuple's cmax if lock is successful)
@@ -1625,12 +1631,12 @@ table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
  * comments for struct TM_FailureData for additional info.
  */
 static inline TM_Result
-table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
+table_tuple_lock(Relation rel, Datum tupleid, Snapshot snapshot,
 				 TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
 				 LockWaitPolicy wait_policy, uint8 flags,
 				 TM_FailureData *tmfd)
 {
-	return rel->rd_tableam->tuple_lock(rel, tid, snapshot, slot,
+	return rel->rd_tableam->tuple_lock(rel, tupleid, snapshot, slot,
 									   cid, mode, wait_policy,
 									   flags, tmfd);
 }
@@ -1916,6 +1922,22 @@ table_define_index(Relation rel, Oid indoid, bool reindex,
  * ----------------------------------------------------------------------------
  */
 
+/*
+ * Get the type of row identifier.  Returns ROW_REF_TID when table AM routine
+ * is not accessible.  This happens during catalog initialization.  All catalog
+ * tables are known to use heap.
+ */
+static inline RowRefType
+table_get_row_ref_type(Relation rel)
+{
+	if (rel->rd_tableam)
+		return rel->rd_tableam->get_row_ref_type(rel);
+	else if (rel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
+		return ROW_REF_COPY;
+	else
+		return ROW_REF_TID;
+}
+
 /*
  * Frees relation private cache data stored in rd_amcache.  Uses
  * free_rd_amcache method if provided.  Assumes rd_amcache to point to single
diff --git a/src/include/commands/trigger.h b/src/include/commands/trigger.h
index cb968d03ec..c16e6b6e5a 100644
--- a/src/include/commands/trigger.h
+++ b/src/include/commands/trigger.h
@@ -209,7 +209,7 @@ extern void ExecASDeleteTriggers(EState *estate,
 extern bool ExecBRDeleteTriggers(EState *estate,
 								 EPQState *epqstate,
 								 ResultRelInfo *relinfo,
-								 ItemPointer tupleid,
+								 Datum tupleid,
 								 HeapTuple fdw_trigtuple,
 								 TupleTableSlot **epqslot,
 								 TM_Result *tmresult,
@@ -231,7 +231,7 @@ extern void ExecASUpdateTriggers(EState *estate,
 extern bool ExecBRUpdateTriggers(EState *estate,
 								 EPQState *epqstate,
 								 ResultRelInfo *relinfo,
-								 ItemPointer tupleid,
+								 Datum tupleid,
 								 HeapTuple fdw_trigtuple,
 								 TupleTableSlot *newslot,
 								 TM_Result *tmresult,
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index b89baef95d..04d8cef6c6 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1089,6 +1089,8 @@ typedef struct RangeTblEntry
 	Index		perminfoindex pg_node_attr(query_jumble_ignore);
 	/* sampling info, or NULL */
 	struct TableSampleClause *tablesample;
+	/* row indentifier for relation */
+	RowRefType	reftype;
 
 	/*
 	 * Fields valid for a subquery RTE (else NULL):
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index d7f9c389da..d850411aa9 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1323,27 +1323,6 @@ typedef enum RowMarkType
 	ROW_MARK_REFERENCE,			/* just fetch the TID, don't lock it */
 } RowMarkType;
 
-/*
- * RowRefType -
- *	  enums for types of row identifiers
- *
- * For plain tables we can just fetch the TID, much as for a target relation;
- * this case is represented by ROW_REF_TID.  Otherwise (for example for VALUES
- * or FUNCTION scans) we have to copy the whole row value.  ROW_REF_COPY is
- * pretty inefficient, since most of the time we'll never need the data; but
- * fortunately the overhead is usually not performance-critical in practice.
- * By default we use ROW_REF_COPY for foreign tables, but if the FDW has
- * a concept of rowid it can request to use ROW_REF_TID instead.
- * (Again, this probably doesn't make sense if a physical remote fetch is
- * needed, but for FDWs that map to local storage it might be credible.)
- * In future we may allow more types of row identifiers.
- */
-typedef enum RowRefType
-{
-	ROW_REF_TID,				/* Item pointer (block, offset) */
-	ROW_REF_COPY				/* Full row copy */
-} RowRefType;
-
 #define RowMarkRequiresRowShareLock(marktype)  ((marktype) <= ROW_MARK_KEYSHARE)
 
 /*
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index 376f67e6a5..84cf7837de 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -2211,4 +2211,26 @@ typedef struct OnConflictExpr
 	List	   *exclRelTlist;	/* tlist of the EXCLUDED pseudo relation */
 } OnConflictExpr;
 
+/*
+ * RowRefType -
+ *	  enums for types of row identifiers
+ *
+ * For plain tables we can just fetch the TID, much as for a target relation;
+ * this case is represented by ROW_REF_TID.  Otherwise (for example for VALUES
+ * or FUNCTION scans) we have to copy the whole row value.  ROW_REF_COPY is
+ * pretty inefficient, since most of the time we'll never need the data; but
+ * fortunately the overhead is usually not performance-critical in practice.
+ * By default we use ROW_REF_COPY for foreign tables, but if the FDW has
+ * a concept of rowid it can request to use ROW_REF_TID instead.
+ * (Again, this probably doesn't make sense if a physical remote fetch is
+ * needed, but for FDWs that map to local storage it might be credible.)
+ * In future we may allow more types of row identifiers.
+ */
+typedef enum RowRefType
+{
+	ROW_REF_TID,				/* Item pointer (block, offset) */
+	ROW_REF_ROWID,				/* Bytea row id */
+	ROW_REF_COPY				/* Full row copy */
+} RowRefType;
+
 #endif							/* PRIMNODES_H */
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index 419613c17b..cf291a0d17 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -70,6 +70,9 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
 
+extern bool tuplestore_force_gettupleslot(Tuplestorestate *state, bool forward,
+										  bool copy, TupleTableSlot *slot);
+
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
 extern bool tuplestore_skiptuples(Tuplestorestate *state,
-- 
2.39.2 (Apple Git-143)

v8-0003-Generalize-table-AM-API-for-INSERT-.-ON-CONFLICT.patchapplication/octet-stream; name=v8-0003-Generalize-table-AM-API-for-INSERT-.-ON-CONFLICT.patchDownload

From 8ad65fded0b1b7a9825920f4f17293732bf1b436 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Fri, 9 Jun 2023 00:05:52 +0300
Subject: [PATCH v8 3/8] Generalize table AM API for INSERT ... ON CONFLICT ...

Currently, all table AMs need to implement INSERT ... ON CONFLICT ... with
speculative tokens.  They could just have a custom implementation of those
tokens using tuple_insert_speculative() and tuple_complete_speculative() API
functions.

This commit changes INSERT ... ON CONFLICT ... implementation to use single
tuple_insert_with_arbiter() API function, which encapsulates the whole
alogrithm.  This new function provides clear semantics to make different
implementations of INSERT ... ON CONFLICT ... functionality.
---
 src/backend/access/heap/heapam_handler.c | 281 ++++++++++++++++++++++-
 src/backend/access/table/tableamapi.c    |   3 +-
 src/backend/executor/nodeModifyTable.c   | 270 ++--------------------
 src/include/access/tableam.h             |  84 +++----
 4 files changed, 348 insertions(+), 290 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 26b3be9779..590413bab9 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -304,6 +304,284 @@ heapam_tuple_complete_speculative(Relation relation, TupleTableSlot *slot,
 		pfree(tuple);
 }
 
+/*
+ * ExecCheckTupleVisible -- verify tuple is visible
+ *
+ * It would not be consistent with guarantees of the higher isolation levels to
+ * proceed with avoiding insertion (taking speculative insertion's alternative
+ * path) on the basis of another tuple that is not visible to MVCC snapshot.
+ * Check for the need to raise a serialization failure, and do so as necessary.
+ */
+static void
+ExecCheckTupleVisible(EState *estate,
+					  Relation rel,
+					  TupleTableSlot *slot)
+{
+	if (!IsolationUsesXactSnapshot())
+		return;
+
+	if (!table_tuple_satisfies_snapshot(rel, slot, estate->es_snapshot))
+	{
+		Datum		xminDatum;
+		TransactionId xmin;
+		bool		isnull;
+
+		xminDatum = slot_getsysattr(slot, MinTransactionIdAttributeNumber, &isnull);
+		Assert(!isnull);
+		xmin = DatumGetTransactionId(xminDatum);
+
+		/*
+		 * We should not raise a serialization failure if the conflict is
+		 * against a tuple inserted by our own transaction, even if it's not
+		 * visible to our snapshot.  (This would happen, for example, if
+		 * conflicting keys are proposed for insertion in a single command.)
+		 */
+		if (!TransactionIdIsCurrentTransactionId(xmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+					 errmsg("could not serialize access due to concurrent update")));
+	}
+}
+
+/*
+ * ExecCheckTIDVisible -- convenience variant of ExecCheckTupleVisible()
+ */
+static void
+ExecCheckTIDVisible(EState *estate,
+					Relation rel,
+					ItemPointer tid,
+					TupleTableSlot *tempSlot)
+{
+	/* Redundantly check isolation level */
+	if (!IsolationUsesXactSnapshot())
+		return;
+
+	if (!table_tuple_fetch_row_version(rel, tid,
+									   SnapshotAny, tempSlot))
+		elog(ERROR, "failed to fetch conflicting tuple for ON CONFLICT");
+	ExecCheckTupleVisible(estate, rel, tempSlot);
+	ExecClearTuple(tempSlot);
+}
+
+static inline TupleTableSlot *
+heapam_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
+								 TupleTableSlot *slot,
+								 CommandId cid, int options,
+								 struct BulkInsertStateData *bistate,
+								 List *arbiterIndexes,
+								 EState *estate,
+								 LockTupleMode lockmode,
+								 TupleTableSlot *lockedSlot,
+								 TupleTableSlot *tempSlot)
+{
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+	uint32		specToken;
+	ItemPointerData conflictTid;
+	bool		specConflict;
+	List	   *recheckIndexes = NIL;
+
+	while (true)
+	{
+		specConflict = false;
+		if (!ExecCheckIndexConstraints(resultRelInfo, slot, estate, &conflictTid,
+									   arbiterIndexes))
+		{
+			if (lockedSlot)
+			{
+				TM_Result	test;
+				TM_FailureData tmfd;
+				Datum		xminDatum;
+				TransactionId xmin;
+				bool		isnull;
+
+				/* Determine lock mode to use */
+				lockmode = ExecUpdateLockMode(estate, resultRelInfo);
+
+				/*
+				 * Lock tuple for update.  Don't follow updates when tuple
+				 * cannot be locked without doing so.  A row locking conflict
+				 * here means our previous conclusion that the tuple is
+				 * conclusively committed is not true anymore.
+				 */
+				test = table_tuple_lock(rel, &conflictTid,
+										estate->es_snapshot,
+										lockedSlot, estate->es_output_cid,
+										lockmode, LockWaitBlock, 0,
+										&tmfd);
+				switch (test)
+				{
+					case TM_Ok:
+						/* success! */
+						break;
+
+					case TM_Invisible:
+
+						/*
+						 * This can occur when a just inserted tuple is
+						 * updated again in the same command. E.g. because
+						 * multiple rows with the same conflicting key values
+						 * are inserted.
+						 *
+						 * This is somewhat similar to the ExecUpdate()
+						 * TM_SelfModified case.  We do not want to proceed
+						 * because it would lead to the same row being updated
+						 * a second time in some unspecified order, and in
+						 * contrast to plain UPDATEs there's no historical
+						 * behavior to break.
+						 *
+						 * It is the user's responsibility to prevent this
+						 * situation from occurring.  These problems are why
+						 * the SQL standard similarly specifies that for SQL
+						 * MERGE, an exception must be raised in the event of
+						 * an attempt to update the same row twice.
+						 */
+						xminDatum = slot_getsysattr(lockedSlot,
+													MinTransactionIdAttributeNumber,
+													&isnull);
+						Assert(!isnull);
+						xmin = DatumGetTransactionId(xminDatum);
+
+						if (TransactionIdIsCurrentTransactionId(xmin))
+							ereport(ERROR,
+									(errcode(ERRCODE_CARDINALITY_VIOLATION),
+							/* translator: %s is a SQL command name */
+									 errmsg("%s command cannot affect row a second time",
+											"ON CONFLICT DO UPDATE"),
+									 errhint("Ensure that no rows proposed for insertion within the same command have duplicate constrained values.")));
+
+						/* This shouldn't happen */
+						elog(ERROR, "attempted to lock invisible tuple");
+						break;
+
+					case TM_SelfModified:
+
+						/*
+						 * This state should never be reached. As a dirty
+						 * snapshot is used to find conflicting tuples,
+						 * speculative insertion wouldn't have seen this row
+						 * to conflict with.
+						 */
+						elog(ERROR, "unexpected self-updated tuple");
+						break;
+
+					case TM_Updated:
+						if (IsolationUsesXactSnapshot())
+							ereport(ERROR,
+									(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+									 errmsg("could not serialize access due to concurrent update")));
+
+						/*
+						 * As long as we don't support an UPDATE of INSERT ON
+						 * CONFLICT for a partitioned table we shouldn't reach
+						 * to a case where tuple to be lock is moved to
+						 * another partition due to concurrent update of the
+						 * partition key.
+						 */
+						Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
+
+						/*
+						 * Tell caller to try again from the very start.
+						 *
+						 * It does not make sense to use the usual
+						 * EvalPlanQual() style loop here, as the new version
+						 * of the row might not conflict anymore, or the
+						 * conflicting tuple has actually been deleted.
+						 */
+						ExecClearTuple(lockedSlot);
+						return false;
+
+					case TM_Deleted:
+						if (IsolationUsesXactSnapshot())
+							ereport(ERROR,
+									(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+									 errmsg("could not serialize access due to concurrent delete")));
+
+						/* see TM_Updated case */
+						Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
+						ExecClearTuple(lockedSlot);
+						return false;
+
+					default:
+						elog(ERROR, "unrecognized table_tuple_lock status: %u", test);
+				}
+
+				/* Success, the tuple is locked. */
+
+				/*
+				 * Verify that the tuple is visible to our MVCC snapshot if
+				 * the current isolation level mandates that.
+				 *
+				 * It's not sufficient to rely on the check within
+				 * ExecUpdate() as e.g. CONFLICT ... WHERE clause may prevent
+				 * us from reaching that.
+				 *
+				 * This means we only ever continue when a new command in the
+				 * current transaction could see the row, even though in READ
+				 * COMMITTED mode the tuple will not be visible according to
+				 * the current statement's snapshot.  This is in line with the
+				 * way UPDATE deals with newer tuple versions.
+				 */
+				ExecCheckTupleVisible(estate, rel, lockedSlot);
+				return NULL;
+			}
+			else
+			{
+				ExecCheckTIDVisible(estate, rel, &conflictTid, tempSlot);
+				return NULL;
+			}
+		}
+
+		/*
+		 * Before we start insertion proper, acquire our "speculative
+		 * insertion lock".  Others can use that to wait for us to decide if
+		 * we're going to go ahead with the insertion, instead of waiting for
+		 * the whole transaction to complete.
+		 */
+		specToken = SpeculativeInsertionLockAcquire(GetCurrentTransactionId());
+
+		/* insert the tuple, with the speculative token */
+		heapam_tuple_insert_speculative(rel, slot,
+										estate->es_output_cid,
+										0,
+										NULL,
+										specToken);
+
+		/* insert index entries for tuple */
+		recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
+											   slot, estate, false, true,
+											   &specConflict,
+											   arbiterIndexes,
+											   false);
+
+		/* adjust the tuple's state accordingly */
+		heapam_tuple_complete_speculative(rel, slot,
+										  specToken, !specConflict);
+
+		/*
+		 * Wake up anyone waiting for our decision.  They will re-check the
+		 * tuple, see that it's no longer speculative, and wait on our XID as
+		 * if this was a regularly inserted tuple all along.  Or if we killed
+		 * the tuple, they will see it's dead, and proceed as if the tuple
+		 * never existed.
+		 */
+		SpeculativeInsertionLockRelease(GetCurrentTransactionId());
+
+		/*
+		 * If there was a conflict, start from the beginning.  We'll do the
+		 * pre-check again, which will now find the conflicting tuple (unless
+		 * it aborts before we get there).
+		 */
+		if (specConflict)
+		{
+			list_free(recheckIndexes);
+			CHECK_FOR_INTERRUPTS();
+			continue;
+		}
+
+		return slot;
+	}
+}
+
 static TM_Result
 heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
 					Snapshot snapshot, Snapshot crosscheck, int options,
@@ -2644,8 +2922,7 @@ static const TableAmRoutine heapam_methods = {
 	.index_fetch_tuple = heapam_index_fetch_tuple,
 
 	.tuple_insert = heapam_tuple_insert,
-	.tuple_insert_speculative = heapam_tuple_insert_speculative,
-	.tuple_complete_speculative = heapam_tuple_complete_speculative,
+	.tuple_insert_with_arbiter = heapam_tuple_insert_with_arbiter,
 	.multi_insert = heap_multi_insert,
 	.tuple_delete = heapam_tuple_delete,
 	.tuple_update = heapam_tuple_update,
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index d9e23ef317..c38ab936cd 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -70,8 +70,7 @@ GetTableAmRoutine(Oid amhandler)
 	 * Could be made optional, but would require throwing error during
 	 * parse-analysis.
 	 */
-	Assert(routine->tuple_insert_speculative != NULL);
-	Assert(routine->tuple_complete_speculative != NULL);
+	Assert(routine->tuple_insert_with_arbiter != NULL);
 
 	Assert(routine->multi_insert != NULL);
 	Assert(routine->tuple_delete != NULL);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index d1917f2fea..8e1c8f697c 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -129,7 +129,6 @@ static void ExecCrossPartitionUpdateForeignKey(ModifyTableContext *context,
 											   TupleTableSlot *newslot);
 static bool ExecOnConflictUpdate(ModifyTableContext *context,
 								 ResultRelInfo *resultRelInfo,
-								 ItemPointer conflictTid,
 								 TupleTableSlot *excludedSlot,
 								 bool canSetTag,
 								 TupleTableSlot **returning);
@@ -265,66 +264,6 @@ ExecProcessReturning(ResultRelInfo *resultRelInfo,
 	return ExecProject(projectReturning);
 }
 
-/*
- * ExecCheckTupleVisible -- verify tuple is visible
- *
- * It would not be consistent with guarantees of the higher isolation levels to
- * proceed with avoiding insertion (taking speculative insertion's alternative
- * path) on the basis of another tuple that is not visible to MVCC snapshot.
- * Check for the need to raise a serialization failure, and do so as necessary.
- */
-static void
-ExecCheckTupleVisible(EState *estate,
-					  Relation rel,
-					  TupleTableSlot *slot)
-{
-	if (!IsolationUsesXactSnapshot())
-		return;
-
-	if (!table_tuple_satisfies_snapshot(rel, slot, estate->es_snapshot))
-	{
-		Datum		xminDatum;
-		TransactionId xmin;
-		bool		isnull;
-
-		xminDatum = slot_getsysattr(slot, MinTransactionIdAttributeNumber, &isnull);
-		Assert(!isnull);
-		xmin = DatumGetTransactionId(xminDatum);
-
-		/*
-		 * We should not raise a serialization failure if the conflict is
-		 * against a tuple inserted by our own transaction, even if it's not
-		 * visible to our snapshot.  (This would happen, for example, if
-		 * conflicting keys are proposed for insertion in a single command.)
-		 */
-		if (!TransactionIdIsCurrentTransactionId(xmin))
-			ereport(ERROR,
-					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-					 errmsg("could not serialize access due to concurrent update")));
-	}
-}
-
-/*
- * ExecCheckTIDVisible -- convenience variant of ExecCheckTupleVisible()
- */
-static void
-ExecCheckTIDVisible(EState *estate,
-					ResultRelInfo *relinfo,
-					ItemPointer tid,
-					TupleTableSlot *tempSlot)
-{
-	Relation	rel = relinfo->ri_RelationDesc;
-
-	/* Redundantly check isolation level */
-	if (!IsolationUsesXactSnapshot())
-		return;
-
-	if (!table_tuple_fetch_row_version(rel, tid, SnapshotAny, tempSlot))
-		elog(ERROR, "failed to fetch conflicting tuple for ON CONFLICT");
-	ExecCheckTupleVisible(estate, rel, tempSlot);
-	ExecClearTuple(tempSlot);
-}
-
 /*
  * Initialize to compute stored generated columns for a tuple
  *
@@ -1015,12 +954,19 @@ ExecInsert(ModifyTableContext *context,
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
 			/* Perform a speculative insertion. */
-			uint32		specToken;
-			ItemPointerData conflictTid;
-			bool		specConflict;
 			List	   *arbiterIndexes;
+			TupleTableSlot *existing = NULL,
+					   *returningSlot,
+					   *inserted;
+			LockTupleMode lockmode = LockTupleExclusive;
 
 			arbiterIndexes = resultRelInfo->ri_onConflictArbiterIndexes;
+			returningSlot = ExecGetReturningSlot(estate, resultRelInfo);
+			if (onconflict == ONCONFLICT_UPDATE)
+			{
+				lockmode = ExecUpdateLockMode(estate, resultRelInfo);
+				existing = resultRelInfo->ri_onConflict->oc_Existing;
+			}
 
 			/*
 			 * Do a non-conclusive check for conflicts first.
@@ -1037,23 +983,28 @@ ExecInsert(ModifyTableContext *context,
 			 */
 	vlock:
 			CHECK_FOR_INTERRUPTS();
-			specConflict = false;
-			if (!ExecCheckIndexConstraints(resultRelInfo, slot, estate,
-										   &conflictTid, arbiterIndexes))
+			inserted = table_tuple_insert_with_arbiter(resultRelInfo,
+													   slot, estate->es_output_cid,
+													   0, NULL, arbiterIndexes, estate,
+													   lockmode, existing, returningSlot);
+			if (!inserted)
 			{
 				/* committed conflict tuple found */
 				if (onconflict == ONCONFLICT_UPDATE)
 				{
+					TupleTableSlot *returning = NULL;
+
+					if (TTS_EMPTY(existing))
+						goto vlock;
+
 					/*
 					 * In case of ON CONFLICT DO UPDATE, execute the UPDATE
 					 * part.  Be prepared to retry if the UPDATE fails because
 					 * of another concurrent UPDATE/DELETE to the conflict
 					 * tuple.
 					 */
-					TupleTableSlot *returning = NULL;
-
 					if (ExecOnConflictUpdate(context, resultRelInfo,
-											 &conflictTid, slot, canSetTag,
+											 slot, canSetTag,
 											 &returning))
 					{
 						InstrCountTuples2(&mtstate->ps, 1);
@@ -1076,57 +1027,13 @@ ExecInsert(ModifyTableContext *context,
 					 * ExecGetReturningSlot() in the DO NOTHING case...
 					 */
 					Assert(onconflict == ONCONFLICT_NOTHING);
-					ExecCheckTIDVisible(estate, resultRelInfo, &conflictTid,
-										ExecGetReturningSlot(estate, resultRelInfo));
 					InstrCountTuples2(&mtstate->ps, 1);
 					return NULL;
 				}
 			}
-
-			/*
-			 * Before we start insertion proper, acquire our "speculative
-			 * insertion lock".  Others can use that to wait for us to decide
-			 * if we're going to go ahead with the insertion, instead of
-			 * waiting for the whole transaction to complete.
-			 */
-			specToken = SpeculativeInsertionLockAcquire(GetCurrentTransactionId());
-
-			/* insert the tuple, with the speculative token */
-			table_tuple_insert_speculative(resultRelationDesc, slot,
-										   estate->es_output_cid,
-										   0,
-										   NULL,
-										   specToken);
-
-			/* insert index entries for tuple */
-			recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
-												   slot, estate, false, true,
-												   &specConflict,
-												   arbiterIndexes,
-												   false);
-
-			/* adjust the tuple's state accordingly */
-			table_tuple_complete_speculative(resultRelationDesc, slot,
-											 specToken, !specConflict);
-
-			/*
-			 * Wake up anyone waiting for our decision.  They will re-check
-			 * the tuple, see that it's no longer speculative, and wait on our
-			 * XID as if this was a regularly inserted tuple all along.  Or if
-			 * we killed the tuple, they will see it's dead, and proceed as if
-			 * the tuple never existed.
-			 */
-			SpeculativeInsertionLockRelease(GetCurrentTransactionId());
-
-			/*
-			 * If there was a conflict, start from the beginning.  We'll do
-			 * the pre-check again, which will now find the conflicting tuple
-			 * (unless it aborts before we get there).
-			 */
-			if (specConflict)
+			else
 			{
-				list_free(recheckIndexes);
-				goto vlock;
+				slot = inserted;
 			}
 
 			/* Since there was no insertion conflict, we're done */
@@ -2441,144 +2348,15 @@ redo_act:
 static bool
 ExecOnConflictUpdate(ModifyTableContext *context,
 					 ResultRelInfo *resultRelInfo,
-					 ItemPointer conflictTid,
 					 TupleTableSlot *excludedSlot,
 					 bool canSetTag,
 					 TupleTableSlot **returning)
 {
 	ModifyTableState *mtstate = context->mtstate;
 	ExprContext *econtext = mtstate->ps.ps_ExprContext;
-	Relation	relation = resultRelInfo->ri_RelationDesc;
 	ExprState  *onConflictSetWhere = resultRelInfo->ri_onConflict->oc_WhereClause;
 	TupleTableSlot *existing = resultRelInfo->ri_onConflict->oc_Existing;
-	TM_FailureData tmfd;
-	LockTupleMode lockmode;
-	TM_Result	test;
-	Datum		xminDatum;
-	TransactionId xmin;
-	bool		isnull;
-
-	/* Determine lock mode to use */
-	lockmode = ExecUpdateLockMode(context->estate, resultRelInfo);
-
-	/*
-	 * Lock tuple for update.  Don't follow updates when tuple cannot be
-	 * locked without doing so.  A row locking conflict here means our
-	 * previous conclusion that the tuple is conclusively committed is not
-	 * true anymore.
-	 */
-	test = table_tuple_lock(relation, conflictTid,
-							context->estate->es_snapshot,
-							existing, context->estate->es_output_cid,
-							lockmode, LockWaitBlock, 0,
-							&tmfd);
-	switch (test)
-	{
-		case TM_Ok:
-			/* success! */
-			break;
-
-		case TM_Invisible:
-
-			/*
-			 * This can occur when a just inserted tuple is updated again in
-			 * the same command. E.g. because multiple rows with the same
-			 * conflicting key values are inserted.
-			 *
-			 * This is somewhat similar to the ExecUpdate() TM_SelfModified
-			 * case.  We do not want to proceed because it would lead to the
-			 * same row being updated a second time in some unspecified order,
-			 * and in contrast to plain UPDATEs there's no historical behavior
-			 * to break.
-			 *
-			 * It is the user's responsibility to prevent this situation from
-			 * occurring.  These problems are why the SQL standard similarly
-			 * specifies that for SQL MERGE, an exception must be raised in
-			 * the event of an attempt to update the same row twice.
-			 */
-			xminDatum = slot_getsysattr(existing,
-										MinTransactionIdAttributeNumber,
-										&isnull);
-			Assert(!isnull);
-			xmin = DatumGetTransactionId(xminDatum);
-
-			if (TransactionIdIsCurrentTransactionId(xmin))
-				ereport(ERROR,
-						(errcode(ERRCODE_CARDINALITY_VIOLATION),
-				/* translator: %s is a SQL command name */
-						 errmsg("%s command cannot affect row a second time",
-								"ON CONFLICT DO UPDATE"),
-						 errhint("Ensure that no rows proposed for insertion within the same command have duplicate constrained values.")));
-
-			/* This shouldn't happen */
-			elog(ERROR, "attempted to lock invisible tuple");
-			break;
-
-		case TM_SelfModified:
-
-			/*
-			 * This state should never be reached. As a dirty snapshot is used
-			 * to find conflicting tuples, speculative insertion wouldn't have
-			 * seen this row to conflict with.
-			 */
-			elog(ERROR, "unexpected self-updated tuple");
-			break;
-
-		case TM_Updated:
-			if (IsolationUsesXactSnapshot())
-				ereport(ERROR,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("could not serialize access due to concurrent update")));
-
-			/*
-			 * As long as we don't support an UPDATE of INSERT ON CONFLICT for
-			 * a partitioned table we shouldn't reach to a case where tuple to
-			 * be lock is moved to another partition due to concurrent update
-			 * of the partition key.
-			 */
-			Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
-
-			/*
-			 * Tell caller to try again from the very start.
-			 *
-			 * It does not make sense to use the usual EvalPlanQual() style
-			 * loop here, as the new version of the row might not conflict
-			 * anymore, or the conflicting tuple has actually been deleted.
-			 */
-			ExecClearTuple(existing);
-			return false;
-
-		case TM_Deleted:
-			if (IsolationUsesXactSnapshot())
-				ereport(ERROR,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("could not serialize access due to concurrent delete")));
-
-			/* see TM_Updated case */
-			Assert(!ItemPointerIndicatesMovedPartitions(&tmfd.ctid));
-			ExecClearTuple(existing);
-			return false;
-
-		default:
-			elog(ERROR, "unrecognized table_tuple_lock status: %u", test);
-	}
-
-	/* Success, the tuple is locked. */
-
-	/*
-	 * Verify that the tuple is visible to our MVCC snapshot if the current
-	 * isolation level mandates that.
-	 *
-	 * It's not sufficient to rely on the check within ExecUpdate() as e.g.
-	 * CONFLICT ... WHERE clause may prevent us from reaching that.
-	 *
-	 * This means we only ever continue when a new command in the current
-	 * transaction could see the row, even though in READ COMMITTED mode the
-	 * tuple will not be visible according to the current statement's
-	 * snapshot.  This is in line with the way UPDATE deals with newer tuple
-	 * versions.
-	 */
-	ExecCheckTupleVisible(context->estate, relation, existing);
+	ItemPointer conflictTid = &existing->tts_tid;
 
 	/*
 	 * Make tuple and any needed join variables available to ExecQual and
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index cf68ec48eb..c4cdae5903 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -22,6 +22,7 @@
 #include "access/xact.h"
 #include "commands/vacuum.h"
 #include "executor/tuptable.h"
+#include "nodes/execnodes.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
 
@@ -514,19 +515,16 @@ typedef struct TableAmRoutine
 									 CommandId cid, int options,
 									 struct BulkInsertStateData *bistate);
 
-	/* see table_tuple_insert_speculative() for reference about parameters */
-	void		(*tuple_insert_speculative) (Relation rel,
-											 TupleTableSlot *slot,
-											 CommandId cid,
-											 int options,
-											 struct BulkInsertStateData *bistate,
-											 uint32 specToken);
-
-	/* see table_tuple_complete_speculative() for reference about parameters */
-	void		(*tuple_complete_speculative) (Relation rel,
-											   TupleTableSlot *slot,
-											   uint32 specToken,
-											   bool succeeded);
+	/* see table_tuple_insert_with_arbiter() for reference about parameters */
+	TupleTableSlot *(*tuple_insert_with_arbiter) (ResultRelInfo *resultRelInfo,
+												  TupleTableSlot *slot,
+												  CommandId cid, int options,
+												  struct BulkInsertStateData *bistate,
+												  List *arbiterIndexes,
+												  EState *estate,
+												  LockTupleMode lockmode,
+												  TupleTableSlot *lockedSlot,
+												  TupleTableSlot *tempSlot);
 
 	/* see table_multi_insert() for reference about parameters */
 	void		(*multi_insert) (Relation rel, TupleTableSlot **slots, int nslots,
@@ -1400,36 +1398,42 @@ table_tuple_insert(Relation rel, TupleTableSlot *slot, CommandId cid,
 }
 
 /*
- * Perform a "speculative insertion". These can be backed out afterwards
- * without aborting the whole transaction.  Other sessions can wait for the
- * speculative insertion to be confirmed, turning it into a regular tuple, or
- * aborted, as if it never existed.  Speculatively inserted tuples behave as
- * "value locks" of short duration, used to implement INSERT .. ON CONFLICT.
+ * Insert a tuple from a slot into table AM routine with arbiter indexes.
  *
- * A transaction having performed a speculative insertion has to either abort,
- * or finish the speculative insertion with
- * table_tuple_complete_speculative(succeeded = ...).
- */
-static inline void
-table_tuple_insert_speculative(Relation rel, TupleTableSlot *slot,
-							   CommandId cid, int options,
-							   struct BulkInsertStateData *bistate,
-							   uint32 specToken)
-{
-	rel->rd_tableam->tuple_insert_speculative(rel, slot, cid, options,
-											  bistate, specToken);
-}
-
-/*
- * Complete "speculative insertion" started in the same transaction. If
- * succeeded is true, the tuple is fully inserted, if false, it's removed.
+ * This function is similar to table_tuple_insert(), but it takes into account
+ * `arbiterIndexes`, which comprises the list of oids of arbiter indexes.
+ *
+ * If tuple doesn't violates uniqueness on all arbiter indexes, then it should
+ * be inserted and the slot containing inserted tuple is returned.
+ *
+ * If tuple violates uniqueness on any arbiter index, then this function
+ * returns NULL and doesn't insert the tuple.  Also, if 'lockedSlot' is
+ * provided, then conflicting tuple gets locked in `lockmode` and placed into
+ * `lockedSlot`.
+ *
+ * Executor state `estate` is passed to this method to provide ability to
+ * calculate index tuples.  Temporary tuple table slot `tempSlot` is passed
+ * for holding of potentially conflicing tuple.
  */
-static inline void
-table_tuple_complete_speculative(Relation rel, TupleTableSlot *slot,
-								 uint32 specToken, bool succeeded)
+static inline TupleTableSlot *
+table_tuple_insert_with_arbiter(ResultRelInfo *resultRelInfo,
+								TupleTableSlot *slot,
+								CommandId cid, int options,
+								struct BulkInsertStateData *bistate,
+								List *arbiterIndexes,
+								EState *estate,
+								LockTupleMode lockmode,
+								TupleTableSlot *lockedSlot,
+								TupleTableSlot *tempSlot)
 {
-	rel->rd_tableam->tuple_complete_speculative(rel, slot, specToken,
-												succeeded);
+	Relation	rel = resultRelInfo->ri_RelationDesc;
+
+	return rel->rd_tableam->tuple_insert_with_arbiter(resultRelInfo,
+													  slot, cid, options,
+													  bistate, arbiterIndexes,
+													  estate,
+													  lockmode, lockedSlot,
+													  tempSlot);
 }
 
 /*
-- 
2.39.2 (Apple Git-143)

v8-0001-Generalize-relation-analyze-in-table-AM-interface.patchapplication/octet-stream; name=v8-0001-Generalize-relation-analyze-in-table-AM-interface.patchDownload

From 0c22cdef034942c3459433fb396d2a9bcdc2fa4d Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Tue, 26 Mar 2024 23:41:11 +0200
Subject: [PATCH v8 1/8] Generalize relation analyze in table AM interface

Currently, there is just one algorithm for sampling tuples from a table written
in acquire_sample_rows().  Custom table AM can just redefine the way to get the
next block/tuple by implementing scan_analyze_next_block() and
scan_analyze_next_tuple() API functions.

This approach doesn't seem general enough.  For instance, it's unclear how to
sample this way index-organized tables.  This commit allows table AM to
encapsulate the whole sampling algorithm (currently implemented in
acquire_sample_rows()) into the relation_analyze() API function.

Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Reviewed-by: Pavel Borisov, Matthias van de Meent
---
 src/backend/access/heap/heapam_handler.c |  29 +++++--
 src/backend/access/table/tableamapi.c    |   2 -
 src/backend/commands/analyze.c           |  54 ++++++------
 src/include/access/heapam.h              |   9 ++
 src/include/access/tableam.h             | 106 +++++------------------
 src/include/commands/vacuum.h            |  19 ++++
 src/include/foreign/fdwapi.h             |   6 +-
 7 files changed, 100 insertions(+), 125 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 6abfe36dec..a7ef0cf72d 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -50,7 +50,6 @@ static TM_Result heapam_tuple_lock(Relation relation, ItemPointer tid,
 								   CommandId cid, LockTupleMode mode,
 								   LockWaitPolicy wait_policy, uint8 flags,
 								   TM_FailureData *tmfd);
-
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
 									 Datum *values, bool *isnull, RewriteState rwstate);
@@ -1052,7 +1051,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	pfree(isnull);
 }
 
-static bool
+/*
+ * Prepare to analyze block `blockno` of `scan`.  The scan has been started
+ * with SO_TYPE_ANALYZE option.
+ *
+ * This routine holds a buffer pin and lock on the heap page.  They are held
+ * until heapam_scan_analyze_next_tuple() returns false.  That is until all the
+ * items of the heap page are analyzed.
+ */
+void
 heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
 							   BufferAccessStrategy bstrategy)
 {
@@ -1072,12 +1079,19 @@ heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
 	hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
 										blockno, RBM_NORMAL, bstrategy);
 	LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-
-	/* in heap all blocks can contain tuples, so always return true */
-	return true;
 }
 
-static bool
+/*
+ * Iterate over tuples in the block selected with
+ * heapam_scan_analyze_next_block().  If a tuple that's suitable for sampling
+ * is found, true is returned and a tuple is stored in `slot`.  When no more
+ * tuples for sampling, false is returned and the pin and lock acquired by
+ * heapam_scan_analyze_next_block() are released.
+ *
+ * *liverows and *deadrows are incremented according to the encountered
+ * tuples.
+ */
+bool
 heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
 							   double *liverows, double *deadrows,
 							   TupleTableSlot *slot)
@@ -2637,10 +2651,9 @@ static const TableAmRoutine heapam_methods = {
 	.relation_copy_data = heapam_relation_copy_data,
 	.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
 	.relation_vacuum = heap_vacuum_rel,
-	.scan_analyze_next_block = heapam_scan_analyze_next_block,
-	.scan_analyze_next_tuple = heapam_scan_analyze_next_tuple,
 	.index_build_range_scan = heapam_index_build_range_scan,
 	.index_validate_scan = heapam_index_validate_scan,
+	.relation_analyze = heapam_analyze,
 
 	.free_rd_amcache = NULL,
 	.relation_size = table_block_relation_size,
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index ce637a5a5d..55b8caeadf 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -81,8 +81,6 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->relation_copy_data != NULL);
 	Assert(routine->relation_copy_for_cluster != NULL);
 	Assert(routine->relation_vacuum != NULL);
-	Assert(routine->scan_analyze_next_block != NULL);
-	Assert(routine->scan_analyze_next_tuple != NULL);
 	Assert(routine->index_build_range_scan != NULL);
 	Assert(routine->index_validate_scan != NULL);
 
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8a82af4a4c..2fb39f3ede 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -17,6 +17,7 @@
 #include <math.h>
 
 #include "access/detoast.h"
+#include "access/heapam.h"
 #include "access/genam.h"
 #include "access/multixact.h"
 #include "access/relation.h"
@@ -190,10 +191,9 @@ analyze_rel(Oid relid, RangeVar *relation,
 	if (onerel->rd_rel->relkind == RELKIND_RELATION ||
 		onerel->rd_rel->relkind == RELKIND_MATVIEW)
 	{
-		/* Regular table, so we'll use the regular row acquisition function */
-		acquirefunc = acquire_sample_rows;
-		/* Also get regular table's size */
-		relpages = RelationGetNumberOfBlocks(onerel);
+		/* Use row acquisition function provided by table AM */
+		table_relation_analyze(onerel, &acquirefunc,
+							   &relpages, vac_strategy);
 	}
 	else if (onerel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
 	{
@@ -1103,15 +1103,15 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
 }
 
 /*
- * acquire_sample_rows -- acquire a random sample of rows from the table
+ * acquire_sample_rows -- acquire a random sample of rows from the heap
  *
  * Selected rows are returned in the caller-allocated array rows[], which
  * must have at least targrows entries.
  * The actual number of rows selected is returned as the function result.
- * We also estimate the total numbers of live and dead rows in the table,
+ * We also estimate the total numbers of live and dead rows in the heap,
  * and return them into *totalrows and *totaldeadrows, respectively.
  *
- * The returned list of tuples is in order by physical position in the table.
+ * The returned list of tuples is in order by physical position in the heap.
  * (We will rely on this later to derive correlation estimates.)
  *
  * As of May 2004 we use a new two-stage method:  Stage one selects up
@@ -1133,7 +1133,7 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
  * look at a statistically unbiased set of blocks, we should get
  * unbiased estimates of the average numbers of live and dead rows per
  * block.  The previous sampling method put too much credence in the row
- * density near the start of the table.
+ * density near the start of the heap.
  */
 static int
 acquire_sample_rows(Relation onerel, int elevel,
@@ -1184,7 +1184,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	/* Prepare for sampling rows */
 	reservoir_init_selection_state(&rstate, targrows);
 
-	scan = table_beginscan_analyze(onerel);
+	scan = heap_beginscan(onerel, NULL, 0, NULL, NULL, SO_TYPE_ANALYZE);
 	slot = table_slot_create(onerel, NULL);
 
 #ifdef USE_PREFETCH
@@ -1214,7 +1214,6 @@ acquire_sample_rows(Relation onerel, int elevel,
 	/* Outer loop over blocks to sample */
 	while (BlockSampler_HasMore(&bs))
 	{
-		bool		block_accepted;
 		BlockNumber targblock = BlockSampler_Next(&bs);
 #ifdef USE_PREFETCH
 		BlockNumber prefetch_targblock = InvalidBlockNumber;
@@ -1230,29 +1229,19 @@ acquire_sample_rows(Relation onerel, int elevel,
 
 		vacuum_delay_point();
 
-		block_accepted = table_scan_analyze_next_block(scan, targblock, vac_strategy);
+		heapam_scan_analyze_next_block(scan, targblock, vac_strategy);
 
 #ifdef USE_PREFETCH
 
 		/*
 		 * When pre-fetching, after we get a block, tell the kernel about the
 		 * next one we will want, if there's any left.
-		 *
-		 * We want to do this even if the table_scan_analyze_next_block() call
-		 * above decides against analyzing the block it picked.
 		 */
 		if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
 			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
 #endif
 
-		/*
-		 * Don't analyze if table_scan_analyze_next_block() indicated this
-		 * block is unsuitable for analyzing.
-		 */
-		if (!block_accepted)
-			continue;
-
-		while (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
+		while (heapam_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
 		{
 			/*
 			 * The first targrows sample rows are simply copied into the
@@ -1302,7 +1291,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
-	table_endscan(scan);
+	heap_endscan(scan);
 
 	/*
 	 * If we didn't find as many tuples as we wanted then we're done. No sort
@@ -1373,6 +1362,19 @@ compare_rows(const void *a, const void *b, void *arg)
 	return 0;
 }
 
+/*
+ * heapam_analyze -- implementation of relation_analyze() table access method
+ *					 callback for heap
+ */
+void
+heapam_analyze(Relation relation, AcquireSampleRowsFunc *func,
+			   BlockNumber *totalpages, BufferAccessStrategy bstrategy)
+{
+	*func = acquire_sample_rows;
+	*totalpages = RelationGetNumberOfBlocks(relation);
+	vac_strategy = bstrategy;
+}
+
 
 /*
  * acquire_inherited_sample_rows -- acquire sample rows from inheritance tree
@@ -1462,9 +1464,9 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 		if (childrel->rd_rel->relkind == RELKIND_RELATION ||
 			childrel->rd_rel->relkind == RELKIND_MATVIEW)
 		{
-			/* Regular table, so use the regular row acquisition function */
-			acquirefunc = acquire_sample_rows;
-			relpages = RelationGetNumberOfBlocks(childrel);
+			/* Use row acquisition function provided by table AM */
+			table_relation_analyze(childrel, &acquirefunc,
+								   &relpages, vac_strategy);
 		}
 		else if (childrel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
 		{
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index f112245373..91fbc95034 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -369,6 +369,15 @@ extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
 extern bool HeapTupleIsSurelyDead(HeapTuple htup,
 								  struct GlobalVisState *vistest);
 
+/* in heap/heapam_handler.c*/
+extern void heapam_scan_analyze_next_block(TableScanDesc scan,
+										   BlockNumber blockno,
+										   BufferAccessStrategy bstrategy);
+extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan,
+										   TransactionId OldestXmin,
+										   double *liverows, double *deadrows,
+										   TupleTableSlot *slot);
+
 /*
  * To avoid leaking too much knowledge about reorderbuffer implementation
  * details this is implemented in reorderbuffer.c not heapam_visibility.c
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index fc0e702715..8ed4e7295a 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -20,6 +20,7 @@
 #include "access/relscan.h"
 #include "access/sdir.h"
 #include "access/xact.h"
+#include "commands/vacuum.h"
 #include "executor/tuptable.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -658,41 +659,6 @@ typedef struct TableAmRoutine
 									struct VacuumParams *params,
 									BufferAccessStrategy bstrategy);
 
-	/*
-	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
-	 * with table_beginscan_analyze().  See also
-	 * table_scan_analyze_next_block().
-	 *
-	 * The callback may acquire resources like locks that are held until
-	 * table_scan_analyze_next_tuple() returns false. It e.g. can make sense
-	 * to hold a lock until all tuples on a block have been analyzed by
-	 * scan_analyze_next_tuple.
-	 *
-	 * The callback can return false if the block is not suitable for
-	 * sampling, e.g. because it's a metapage that could never contain tuples.
-	 *
-	 * XXX: This obviously is primarily suited for block-based AMs. It's not
-	 * clear what a good interface for non block based AMs would be, so there
-	 * isn't one yet.
-	 */
-	bool		(*scan_analyze_next_block) (TableScanDesc scan,
-											BlockNumber blockno,
-											BufferAccessStrategy bstrategy);
-
-	/*
-	 * See table_scan_analyze_next_tuple().
-	 *
-	 * Not every AM might have a meaningful concept of dead rows, in which
-	 * case it's OK to not increment *deadrows - but note that that may
-	 * influence autovacuum scheduling (see comment for relation_vacuum
-	 * callback).
-	 */
-	bool		(*scan_analyze_next_tuple) (TableScanDesc scan,
-											TransactionId OldestXmin,
-											double *liverows,
-											double *deadrows,
-											TupleTableSlot *slot);
-
 	/* see table_index_build_range_scan for reference about parameters */
 	double		(*index_build_range_scan) (Relation table_rel,
 										   Relation index_rel,
@@ -713,6 +679,12 @@ typedef struct TableAmRoutine
 										Snapshot snapshot,
 										struct ValidateIndexState *state);
 
+	/* See table_relation_analyze() */
+	void		(*relation_analyze) (Relation relation,
+									 AcquireSampleRowsFunc *func,
+									 BlockNumber *totalpages,
+									 BufferAccessStrategy bstrategy);
+
 
 	/* ------------------------------------------------------------------------
 	 * Miscellaneous functions.
@@ -1008,19 +980,6 @@ table_beginscan_tid(Relation rel, Snapshot snapshot)
 	return rel->rd_tableam->scan_begin(rel, snapshot, 0, NULL, NULL, flags);
 }
 
-/*
- * table_beginscan_analyze is an alternative entry point for setting up a
- * TableScanDesc for an ANALYZE scan.  As with bitmap scans, it's worth using
- * the same data structure although the behavior is rather different.
- */
-static inline TableScanDesc
-table_beginscan_analyze(Relation rel)
-{
-	uint32		flags = SO_TYPE_ANALYZE;
-
-	return rel->rd_tableam->scan_begin(rel, NULL, 0, NULL, NULL, flags);
-}
-
 /*
  * End relation scan.
  */
@@ -1746,42 +1705,6 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
-/*
- * Prepare to analyze block `blockno` of `scan`. The scan needs to have been
- * started with table_beginscan_analyze().  Note that this routine might
- * acquire resources like locks that are held until
- * table_scan_analyze_next_tuple() returns false.
- *
- * Returns false if block is unsuitable for sampling, true otherwise.
- */
-static inline bool
-table_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-							  BufferAccessStrategy bstrategy)
-{
-	return scan->rs_rd->rd_tableam->scan_analyze_next_block(scan, blockno,
-															bstrategy);
-}
-
-/*
- * Iterate over tuples in the block selected with
- * table_scan_analyze_next_block() (which needs to have returned true, and
- * this routine may not have returned false for the same block before). If a
- * tuple that's suitable for sampling is found, true is returned and a tuple
- * is stored in `slot`.
- *
- * *liverows and *deadrows are incremented according to the encountered
- * tuples.
- */
-static inline bool
-table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
-							  double *liverows, double *deadrows,
-							  TupleTableSlot *slot)
-{
-	return scan->rs_rd->rd_tableam->scan_analyze_next_tuple(scan, OldestXmin,
-															liverows, deadrows,
-															slot);
-}
-
 /*
  * table_index_build_scan - scan the table to find tuples to be indexed
  *
@@ -1887,6 +1810,21 @@ table_index_validate_scan(Relation table_rel,
 											   state);
 }
 
+/*
+ * table_relation_analyze - fill the infromation for a sampling statistics
+ *							acquisition
+ *
+ * The pointer to a function that will collect sample rows from the table
+ * should be stored to `*func`, plus the estimated size of the table in pages
+ * should br stored to `*totalpages`.
+ */
+static inline void
+table_relation_analyze(Relation relation, AcquireSampleRowsFunc *func,
+					   BlockNumber *totalpages, BufferAccessStrategy bstrategy)
+{
+	relation->rd_tableam->relation_analyze(relation, func,
+										   totalpages, bstrategy);
+}
 
 /* ----------------------------------------------------------------------------
  * Miscellaneous functionality
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 1182a96742..68068dd900 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -175,6 +175,21 @@ typedef struct VacAttrStats
 	int			rowstride;
 } VacAttrStats;
 
+/*
+ * AcquireSampleRowsFunc - a function for the sampling statistics collection.
+ *
+ * A random sample of up to `targrows` rows should be collected from the
+ * table and stored into the caller-provided `rows` array.  The actual number
+ * of rows collected must be returned.  In addition, a function should store
+ * estimates of the total numbers of live and dead rows in the table into the
+ * output parameters `*totalrows` and `*totaldeadrows1.  (Set `*totaldeadrows`
+ * to zero if the storage does not have any concept of dead rows.)
+ */
+typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
+									  HeapTuple *rows, int targrows,
+									  double *totalrows,
+									  double *totaldeadrows);
+
 /* flag bits for VacuumParams->options */
 #define VACOPT_VACUUM 0x01		/* do VACUUM */
 #define VACOPT_ANALYZE 0x02		/* do ANALYZE */
@@ -380,6 +395,10 @@ extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 extern void analyze_rel(Oid relid, RangeVar *relation,
 						VacuumParams *params, List *va_cols, bool in_outer_xact,
 						BufferAccessStrategy bstrategy);
+extern void heapam_analyze(Relation relation, AcquireSampleRowsFunc *func,
+						   BlockNumber *totalpages,
+						   BufferAccessStrategy bstrategy);
+
 extern bool std_typanalyze(VacAttrStats *stats);
 
 /* in utils/misc/sampling.c --- duplicate of declarations in utils/sampling.h */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index fcde3876b2..0968e0a01e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -13,6 +13,7 @@
 #define FDWAPI_H
 
 #include "access/parallel.h"
+#include "commands/vacuum.h"
 #include "nodes/execnodes.h"
 #include "nodes/pathnodes.h"
 
@@ -148,11 +149,6 @@ typedef void (*ExplainForeignModify_function) (ModifyTableState *mtstate,
 typedef void (*ExplainDirectModify_function) (ForeignScanState *node,
 											  struct ExplainState *es);
 
-typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
-									  HeapTuple *rows, int targrows,
-									  double *totalrows,
-									  double *totaldeadrows);
-
 typedef bool (*AnalyzeForeignTable_function) (Relation relation,
 											  AcquireSampleRowsFunc *func,
 											  BlockNumber *totalpages);
-- 
2.39.2 (Apple Git-143)

v8-0002-Custom-reloptions-for-table-AM.patchapplication/octet-stream; name=v8-0002-Custom-reloptions-for-table-AM.patchDownload

From d732e27d56212be475ad32dcc8654de337a77b85 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 12 Jun 2023 23:16:01 +0300
Subject: [PATCH v8 2/8] Custom reloptions for table AM

Let table AM define custom reloptions for its tables.  This allows to
specify AM-specific parameters by WITH clause when creating a table.

The code may use some parts from prior work by Hao Wu.

Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Discussion: https://postgr.es/m/AMUA1wBBBxfc3tKRLLdU64rb.1.1683276279979.Hmail.wuhao%40hashdata.cn
Reviewed-by: Reviewed-by: Pavel Borisov, Matthias van de Meent
---
 src/backend/access/common/reloptions.c   |  6 ++-
 src/backend/access/heap/heapam_handler.c | 12 ++++++
 src/backend/access/table/tableamapi.c    | 25 +++++++++++
 src/backend/commands/tablecmds.c         | 55 ++++++++++++++----------
 src/backend/postmaster/autovacuum.c      |  4 +-
 src/backend/utils/cache/relcache.c       |  6 ++-
 src/include/access/reloptions.h          |  2 +
 src/include/access/tableam.h             | 43 ++++++++++++++++++
 8 files changed, 126 insertions(+), 27 deletions(-)

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index d6eb5d8559..963995388b 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -24,6 +24,7 @@
 #include "access/nbtree.h"
 #include "access/reloptions.h"
 #include "access/spgist_private.h"
+#include "access/tableam.h"
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
 #include "commands/tablespace.h"
@@ -1377,7 +1378,7 @@ untransformRelOptions(Datum options)
  */
 bytea *
 extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
-				  amoptions_function amoptions)
+				  const TableAmRoutine *tableam, amoptions_function amoptions)
 {
 	bytea	   *options;
 	bool		isnull;
@@ -1399,7 +1400,8 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
 		case RELKIND_MATVIEW:
-			options = heap_reloptions(classForm->relkind, datum, false);
+			options = tableam_reloptions(tableam, classForm->relkind,
+										 datum, false);
 			break;
 		case RELKIND_PARTITIONED_TABLE:
 			options = partitioned_table_reloptions(datum, false);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a7ef0cf72d..26b3be9779 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -23,6 +23,7 @@
 #include "access/heapam.h"
 #include "access/heaptoast.h"
 #include "access/multixact.h"
+#include "access/reloptions.h"
 #include "access/rewriteheap.h"
 #include "access/syncscan.h"
 #include "access/tableam.h"
@@ -2155,6 +2156,16 @@ heapam_relation_toast_am(Relation rel)
 	return rel->rd_rel->relam;
 }
 
+static bytea *
+heapam_reloptions(char relkind, Datum reloptions, bool validate)
+{
+	Assert(relkind == RELKIND_RELATION ||
+		   relkind == RELKIND_TOASTVALUE ||
+		   relkind == RELKIND_MATVIEW);
+
+	return heap_reloptions(relkind, reloptions, validate);
+}
+
 
 /* ------------------------------------------------------------------------
  * Planner related callbacks for the heap AM
@@ -2660,6 +2671,7 @@ static const TableAmRoutine heapam_methods = {
 	.relation_needs_toast_table = heapam_relation_needs_toast_table,
 	.relation_toast_am = heapam_relation_toast_am,
 	.relation_fetch_toast_slice = heap_fetch_toast_slice,
+	.reloptions = heapam_reloptions,
 
 	.relation_estimate_size = heapam_estimate_rel_size,
 
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 55b8caeadf..d9e23ef317 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -13,9 +13,11 @@
 
 #include "access/tableam.h"
 #include "access/xact.h"
+#include "catalog/pg_am.h"
 #include "commands/defrem.h"
 #include "miscadmin.h"
 #include "utils/guc_hooks.h"
+#include "utils/syscache.h"
 
 
 /*
@@ -98,6 +100,29 @@ GetTableAmRoutine(Oid amhandler)
 	return routine;
 }
 
+/*
+ * GetTableAmRoutineByAmOid
+ *		Given the table access method oid get its TableAmRoutine struct, which
+ *		will be palloc'd in the caller's memory context.
+ */
+const TableAmRoutine *
+GetTableAmRoutineByAmOid(Oid amoid)
+{
+	HeapTuple	ht_am;
+	Form_pg_am	amrec;
+	const TableAmRoutine *tableam = NULL;
+
+	ht_am = SearchSysCache1(AMOID, ObjectIdGetDatum(amoid));
+	if (!HeapTupleIsValid(ht_am))
+		elog(ERROR, "cache lookup failed for access method %u",
+			 amoid);
+	amrec = (Form_pg_am) GETSTRUCT(ht_am);
+
+	tableam = GetTableAmRoutine(amrec->amhandler);
+	ReleaseSysCache(ht_am);
+	return tableam;
+}
+
 /* check_hook: validate new default_table_access_method */
 bool
 check_default_table_access_method(char **newval, void **extra, GucSource source)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 6741e721ae..3fcb9cd078 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -715,6 +715,7 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	ObjectAddress address;
 	LOCKMODE	parentLockmode;
 	Oid			accessMethodId = InvalidOid;
+	const TableAmRoutine *tableam = NULL;
 
 	/*
 	 * Truncate relname to appropriate length (probably a waste of time, as
@@ -850,6 +851,28 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	if (!OidIsValid(ownerId))
 		ownerId = GetUserId();
 
+	/*
+	 * For relations with table AM and partitioned tables, select access
+	 * method to use: an explicitly indicated one, or (in the case of a
+	 * partitioned table) the parent's, if it has one.
+	 */
+	if (stmt->accessMethod != NULL)
+	{
+		Assert(RELKIND_HAS_TABLE_AM(relkind) || relkind == RELKIND_PARTITIONED_TABLE);
+		accessMethodId = get_table_am_oid(stmt->accessMethod, false);
+	}
+	else if (RELKIND_HAS_TABLE_AM(relkind) || relkind == RELKIND_PARTITIONED_TABLE)
+	{
+		if (stmt->partbound)
+		{
+			Assert(list_length(inheritOids) == 1);
+			accessMethodId = get_rel_relam(linitial_oid(inheritOids));
+		}
+
+		if (RELKIND_HAS_TABLE_AM(relkind) && !OidIsValid(accessMethodId))
+			accessMethodId = get_table_am_oid(default_table_access_method, false);
+	}
+
 	/*
 	 * Parse and validate reloptions, if any.
 	 */
@@ -858,6 +881,12 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 
 	switch (relkind)
 	{
+		case RELKIND_RELATION:
+		case RELKIND_TOASTVALUE:
+		case RELKIND_MATVIEW:
+			tableam = GetTableAmRoutineByAmOid(accessMethodId);
+			(void) tableam_reloptions(tableam, relkind, reloptions, true);
+			break;
 		case RELKIND_VIEW:
 			(void) view_reloptions(reloptions, true);
 			break;
@@ -866,6 +895,7 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 			break;
 		default:
 			(void) heap_reloptions(relkind, reloptions, true);
+			break;
 	}
 
 	if (stmt->ofTypename)
@@ -957,28 +987,6 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 		}
 	}
 
-	/*
-	 * For relations with table AM and partitioned tables, select access
-	 * method to use: an explicitly indicated one, or (in the case of a
-	 * partitioned table) the parent's, if it has one.
-	 */
-	if (stmt->accessMethod != NULL)
-	{
-		Assert(RELKIND_HAS_TABLE_AM(relkind) || relkind == RELKIND_PARTITIONED_TABLE);
-		accessMethodId = get_table_am_oid(stmt->accessMethod, false);
-	}
-	else if (RELKIND_HAS_TABLE_AM(relkind) || relkind == RELKIND_PARTITIONED_TABLE)
-	{
-		if (stmt->partbound)
-		{
-			Assert(list_length(inheritOids) == 1);
-			accessMethodId = get_rel_relam(linitial_oid(inheritOids));
-		}
-
-		if (RELKIND_HAS_TABLE_AM(relkind) && !OidIsValid(accessMethodId))
-			accessMethodId = get_table_am_oid(default_table_access_method, false);
-	}
-
 	/*
 	 * Create the relation.  Inherited defaults and constraints are passed in
 	 * for immediate handling --- since they don't need parsing, they can be
@@ -15524,7 +15532,8 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
 		case RELKIND_MATVIEW:
-			(void) heap_reloptions(rel->rd_rel->relkind, newOptions, true);
+			(void) table_reloptions(rel, rel->rd_rel->relkind,
+									newOptions, true);
 			break;
 		case RELKIND_PARTITIONED_TABLE:
 			(void) partitioned_table_reloptions(newOptions, true);
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 71e8a6f258..d1d76016ab 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2661,7 +2661,9 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE);
 
-	relopts = extractRelOptions(tup, pg_class_desc, NULL);
+	relopts = extractRelOptions(tup, pg_class_desc,
+								GetTableAmRoutineByAmOid(((Form_pg_class) GETSTRUCT(tup))->relam),
+								NULL);
 	if (relopts == NULL)
 		return NULL;
 
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 1f419c2a6d..039c0d3eef 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -33,6 +33,7 @@
 #include "access/htup_details.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
+#include "access/relation.h"
 #include "access/reloptions.h"
 #include "access/sysattr.h"
 #include "access/table.h"
@@ -464,6 +465,7 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 {
 	bytea	   *options;
 	amoptions_function amoptsfn;
+	const TableAmRoutine *tableam = NULL;
 
 	relation->rd_options = NULL;
 
@@ -478,6 +480,7 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 		case RELKIND_VIEW:
 		case RELKIND_MATVIEW:
 		case RELKIND_PARTITIONED_TABLE:
+			tableam = relation->rd_tableam;
 			amoptsfn = NULL;
 			break;
 		case RELKIND_INDEX:
@@ -493,7 +496,8 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 	 * we might not have any other for pg_class yet (consider executing this
 	 * code for pg_class itself)
 	 */
-	options = extractRelOptions(tuple, GetPgClassDescriptor(), amoptsfn);
+	options = extractRelOptions(tuple, GetPgClassDescriptor(),
+								tableam, amoptsfn);
 
 	/*
 	 * Copy parsed data into CacheMemoryContext.  To guard against the
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index 81829b8270..8ddc75df28 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -21,6 +21,7 @@
 
 #include "access/amapi.h"
 #include "access/htup.h"
+#include "access/tableam.h"
 #include "access/tupdesc.h"
 #include "nodes/pg_list.h"
 #include "storage/lock.h"
@@ -224,6 +225,7 @@ extern Datum transformRelOptions(Datum oldOptions, List *defList,
 								 bool acceptOidsOff, bool isReset);
 extern List *untransformRelOptions(Datum options);
 extern bytea *extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
+								const TableAmRoutine *tableam,
 								amoptions_function amoptions);
 extern void *build_reloptions(Datum reloptions, bool validate,
 							  relopt_kind kind,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8ed4e7295a..cf68ec48eb 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -737,6 +737,28 @@ typedef struct TableAmRoutine
 											   int32 slicelength,
 											   struct varlena *result);
 
+	/*
+	 * This callback parses and validates the reloptions array for a table.
+	 *
+	 * This is called only when a non-null reloptions array exists for the
+	 * table.  'reloptions' is a text array containing entries of the form
+	 * "name=value".  The function should construct a bytea value, which will
+	 * be copied into the rd_options field of the table's relcache entry. The
+	 * data contents of the bytea value are open for the access method to
+	 * define.
+	 *
+	 * When 'validate' is true, the function should report a suitable error
+	 * message if any of the options are unrecognized or have invalid values;
+	 * when 'validate' is false, invalid entries should be silently ignored.
+	 * ('validate' is false when loading options already stored in pg_catalog;
+	 * an invalid entry could only be found if the access method has changed
+	 * its rules for options, and in that case ignoring obsolete entries is
+	 * appropriate.)
+	 *
+	 * It is OK to return NULL if default behavior is wanted.
+	 */
+	bytea	   *(*reloptions) (char relkind, Datum reloptions, bool validate);
+
 
 	/* ------------------------------------------------------------------------
 	 * Planner related functions.
@@ -1925,6 +1947,26 @@ table_relation_fetch_toast_slice(Relation toastrel, Oid valueid,
 													 result);
 }
 
+/*
+ * Parse options for given table.
+ */
+static inline bytea *
+table_reloptions(Relation rel, char relkind,
+				 Datum reloptions, bool validate)
+{
+	return rel->rd_tableam->reloptions(relkind, reloptions, validate);
+}
+
+/*
+ * Parse table options without knowledge of particular table.
+ */
+static inline bytea *
+tableam_reloptions(const TableAmRoutine *tableam, char relkind,
+				   Datum reloptions, bool validate)
+{
+	return tableam->reloptions(relkind, reloptions, validate);
+}
+
 
 /* ----------------------------------------------------------------------------
  * Planner related functionality
@@ -2102,6 +2144,7 @@ extern void table_block_relation_estimate_size(Relation rel,
  */
 
 extern const TableAmRoutine *GetTableAmRoutine(Oid amhandler);
+extern const TableAmRoutine *GetTableAmRoutineByAmOid(Oid amoid);
 
 /* ----------------------------------------------------------------------------
  * Functions in heapam_handler.c
-- 
2.39.2 (Apple Git-143)

#32

Pavel Borisov

pashkin.elfe@gmail.com

almost 2 years ago

In reply to: Pavel Borisov (#31)

Re: Table AM Interface Enhancements

I've looked at patch 0003.

Generally, it does a similar thing as 0001 - it exposes a more generalized
method tuple_insert_with_arbiter that encapsulates
tuple_insert_speculative/tuple_complete_speculative and at the same time
allows extensibility of this i.e. different implementation for custom table
other than heap by the extensions. Though the code rearrangement is little
bit more complicated, the patch is clear. It doesn't change the behavior
for heap tables.

tuple_insert_speculative/tuple_complete_speculative are removed from table
AM methods. I think it would not be hard for existing users of this
to adapt to the changes.

Code block ExecCheckTupleVisible -- ExecCheckTIDVisible moved
to heapam_handler.c I've checked, the code block unchanged except
that ExecCheckTIDVisible now gets Relation from the caller instead of
constructing it from ResultRelInfo.

Also two big code blocks are moved from ExecOnConflictUpdate and ExecInsert
to a new method heapam_tuple_insert_with_arbiter. They correspond the old
code with several minor modifications.

For ExecOnConflictUpdate comment need to be revised. This one is for
shifted code:

* Try to lock tuple for update as part of speculative insertion.

Probably it is worth to be moved to a comment for
heapam_tuple_insert_with_arbiter.

For heapam_tuple_insert_with_arbiter both "return NULL" could be shifted
level up into the end of the block:

if (!ExecCheckIndexConstraints(resultRelInfo, slot, estate, &conflictTid,
+

arbiterIndexes))

Also I'd add comment for heapam_tuple_insert_with_arbiter:
/* See comments for table_tuple_insert_with_arbiter() */

A comment to be corrected:
src/backend/access/heap/heapam.c: * implement
table_tuple_insert_speculative()

As Jaipin said, I'd also propose removing "inline"
from heapam_tuple_insert_with_arbiter.

More corrections for comments:
%s/If tuple doesn't violates/If tuple doesn't violate/g
%s/which comprises the list of/list, which comprises/g
%s/conflicting tuple gets locked/conflicting tuple should be locked/g

I think for better code look this could be removed:

vlock:
CHECK_FOR_INTERRUPTS();

together with CHECK_FOR_INTERRUPTS(); in heapam_tuple_insert_with_arbiter
placed in the beginning of while loop.

Overall the patch looks good enough to me.

Regards,
Pavel

Show quoted text

#33

Pavel Borisov

pashkin.elfe@gmail.com

almost 2 years ago

In reply to: Pavel Borisov (#32)

Re: Table AM Interface Enhancements

I think for better code look this could be removed:

vlock:
CHECK_FOR_INTERRUPTS();

together with CHECK_FOR_INTERRUPTS(); in heapam_tuple_insert_with_arbiter
placed in the beginning of while loop.

To clarify things, this I wrote only about CHECK_FOR_INTERRUPTS();
rearrangement.

Regards,
Pavel

#34

Alexander Korotkov

aekorotkov@gmail.com

almost 2 years ago

In reply to: Pavel Borisov (#33)

Re: Table AM Interface Enhancements

Hi, Pavel!

I've pushed 0001, 0002 and 0006.

On Fri, Mar 29, 2024 at 5:23 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:

I think for better code look this could be removed:

vlock:
CHECK_FOR_INTERRUPTS();

together with CHECK_FOR_INTERRUPTS(); in heapam_tuple_insert_with_arbiter placed in the beginning of while loop.

To clarify things, this I wrote only about CHECK_FOR_INTERRUPTS(); rearrangement.

Thank you for your review of this patch. But I still think there is a
problem that this patch moves part of the executor to table AM which
directly uses executor data structures and functions. This works, but
not committable since it breaks the encapsulation.

I think the way forward might be to introduce the new API, which would
isolate executor details from table AM. We may introduce a new data
structure InsertWithArbiterContext which would contain EState and a
set of callbacks which would avoid table AM from calling the executor
directly. That should be our goal for pg18. Now, this is too close
to FF to design a new API.

------
Regards,
Alexander Korotkov

#35

Andrei Lepikhov

a.lepikhov@postgrespro.ru

almost 2 years ago

In reply to: Alexander Korotkov (#34)

Re: Table AM Interface Enhancements

On 31/3/2024 00:33, Alexander Korotkov wrote:

I think the way forward might be to introduce the new API, which would
isolate executor details from table AM. We may introduce a new data
structure InsertWithArbiterContext which would contain EState and a
set of callbacks which would avoid table AM from calling the executor
directly. That should be our goal for pg18. Now, this is too close
to FF to design a new API.

I'm a bit late, but have you ever considered adding some sort of index
probing routine to the AM interface for estimation purposes?
I am working out the problem when we have dubious estimations. For
example, we don't have MCV or do not fit MCV statistics for equality of
multiple clauses, or we detected that the constant value is out of the
histogram range. In such cases (especially for [parameterized] JOINs),
the optimizer could have a chance to probe the index and avoid huge
underestimation. This makes sense, especially for multicolumn
filters/clauses.
Having a probing AM method, we may invent something for this challenge.

--
regards,
Andrei Lepikhov
Postgres Professional

#36

Tom Lane

tgl@sss.pgh.pa.us

almost 2 years ago

In reply to: Andrei Lepikhov (#35)

Re: Table AM Interface Enhancements

Coverity complained about what you did in RelationParseRelOptions
in c95c25f9a:

*** CID 1595992: Null pointer dereferences (FORWARD_NULL)
/srv/coverity/git/pgsql-git/postgresql/src/backend/utils/cache/relcache.c: 499 in RelationParseRelOptions()
493
494 /*
495 * Fetch reloptions from tuple; have to use a hardwired descriptor because
496 * we might not have any other for pg_class yet (consider executing this
497 * code for pg_class itself)
498 */

CID 1595992: Null pointer dereferences (FORWARD_NULL)
Passing null pointer "tableam" to "extractRelOptions", which dereferences it.

499 options = extractRelOptions(tuple, GetPgClassDescriptor(),
500 tableam, amoptsfn);
501

I see that extractRelOptions only uses the tableam argument for some
relkinds, and RelationParseRelOptions does set it up for those
relkinds --- but Coverity's complaint isn't without merit, because
those two switch statements are looking at *different copies of the
relkind*, which in theory could be different. This all seems quite
messy and poorly factored. Can't we do better? Why do we need to
involve two copies of allegedly the same pg_class tuple, anyhow?

regards, tom lane

#37

Alexander Korotkov

aekorotkov@gmail.com

almost 2 years ago

In reply to: Tom Lane (#36)

Re: Table AM Interface Enhancements

On Mon, Apr 1, 2024 at 7:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Coverity complained about what you did in RelationParseRelOptions
in c95c25f9a:

*** CID 1595992: Null pointer dereferences (FORWARD_NULL)
/srv/coverity/git/pgsql-git/postgresql/src/backend/utils/cache/relcache.c: 499 in RelationParseRelOptions()
493
494 /*
495 * Fetch reloptions from tuple; have to use a hardwired descriptor because
496 * we might not have any other for pg_class yet (consider executing this
497 * code for pg_class itself)
498 */

CID 1595992: Null pointer dereferences (FORWARD_NULL)
Passing null pointer "tableam" to "extractRelOptions", which dereferences it.

499 options = extractRelOptions(tuple, GetPgClassDescriptor(),
500 tableam, amoptsfn);
501

I see that extractRelOptions only uses the tableam argument for some
relkinds, and RelationParseRelOptions does set it up for those
relkinds --- but Coverity's complaint isn't without merit, because
those two switch statements are looking at *different copies of the
relkind*, which in theory could be different. This all seems quite
messy and poorly factored. Can't we do better? Why do we need to
involve two copies of allegedly the same pg_class tuple, anyhow?

Thank you for reporting this, Tom.
I'm planning to investigate this later today.

------
Regards,
Alexander Korotkov

#38

Alexander Korotkov

aekorotkov@gmail.com

almost 2 years ago

In reply to: Tom Lane (#36)

1 attachment(s)

Re: Table AM Interface Enhancements

On Mon, Apr 1, 2024 at 7:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Coverity complained about what you did in RelationParseRelOptions
in c95c25f9a:

*** CID 1595992: Null pointer dereferences (FORWARD_NULL)
/srv/coverity/git/pgsql-git/postgresql/src/backend/utils/cache/relcache.c: 499 in RelationParseRelOptions()
493
494 /*
495 * Fetch reloptions from tuple; have to use a hardwired descriptor because
496 * we might not have any other for pg_class yet (consider executing this
497 * code for pg_class itself)
498 */

CID 1595992: Null pointer dereferences (FORWARD_NULL)
Passing null pointer "tableam" to "extractRelOptions", which dereferences it.

499 options = extractRelOptions(tuple, GetPgClassDescriptor(),
500 tableam, amoptsfn);
501

I see that extractRelOptions only uses the tableam argument for some
relkinds, and RelationParseRelOptions does set it up for those
relkinds --- but Coverity's complaint isn't without merit, because
those two switch statements are looking at *different copies of the
relkind*, which in theory could be different. This all seems quite
messy and poorly factored. Can't we do better? Why do we need to
involve two copies of allegedly the same pg_class tuple, anyhow?

I wasn't registered at Coverity yet. Now I've registered and am
waiting for approval to access the PostgreSQL analysis data.

I wonder why Coverity complains about tableam, but not amoptsfn.
Their usage patterns are very similar.

It appears that relation->rd_rel isn't the full copy of pg_class tuple
(see AllocateRelationDesc). RelationParseRelOptions() is going to
update relation->rd_options, and thus needs a full pg_class tuple to
fetch options out of it. However, it is really unnecessary to access
both tuples at the same time. We can use a full tuple, not
relation->rd_rel, in both cases. See the attached patch.

------
Regards,
Alexander Korotkov

Attachments:

fix-RelationParseRelOptions-Coverity-complain.patchapplication/octet-stream; name=fix-RelationParseRelOptions-Coverity-complain.patchDownload

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 039c0d3eef4..e9780e7237f 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -466,14 +466,15 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 	bytea	   *options;
 	amoptions_function amoptsfn;
 	const TableAmRoutine *tableam = NULL;
-
+	Form_pg_class classForm = (Form_pg_class) GETSTRUCT(tuple);
+;
 	relation->rd_options = NULL;
 
 	/*
 	 * Look up any AM-specific parse function; fall out if relkind should not
 	 * have options.
 	 */
-	switch (relation->rd_rel->relkind)
+	switch (classForm->relkind)
 	{
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:

#39

Japin Li

japinli@hotmail.com

almost 2 years ago

In reply to: Alexander Korotkov (#38)

Re: Table AM Interface Enhancements

On Tue, 02 Apr 2024 at 07:59, Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Mon, Apr 1, 2024 at 7:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Coverity complained about what you did in RelationParseRelOptions
in c95c25f9a:

*** CID 1595992: Null pointer dereferences (FORWARD_NULL)
/srv/coverity/git/pgsql-git/postgresql/src/backend/utils/cache/relcache.c: 499 in RelationParseRelOptions()
493
494 /*
495 * Fetch reloptions from tuple; have to use a hardwired descriptor because
496 * we might not have any other for pg_class yet (consider executing this
497 * code for pg_class itself)
498 */

CID 1595992: Null pointer dereferences (FORWARD_NULL)
Passing null pointer "tableam" to "extractRelOptions", which dereferences it.

499 options = extractRelOptions(tuple, GetPgClassDescriptor(),
500 tableam, amoptsfn);
501

I see that extractRelOptions only uses the tableam argument for some
relkinds, and RelationParseRelOptions does set it up for those
relkinds --- but Coverity's complaint isn't without merit, because
those two switch statements are looking at *different copies of the
relkind*, which in theory could be different. This all seems quite
messy and poorly factored. Can't we do better? Why do we need to
involve two copies of allegedly the same pg_class tuple, anyhow?

I wasn't registered at Coverity yet. Now I've registered and am
waiting for approval to access the PostgreSQL analysis data.

I wonder why Coverity complains about tableam, but not amoptsfn.
Their usage patterns are very similar.

It appears that relation->rd_rel isn't the full copy of pg_class tuple
(see AllocateRelationDesc). RelationParseRelOptions() is going to
update relation->rd_options, and thus needs a full pg_class tuple to
fetch options out of it. However, it is really unnecessary to access
both tuples at the same time. We can use a full tuple, not
relation->rd_rel, in both cases. See the attached patch.

------
Regards,
Alexander Korotkov

+       Form_pg_class classForm = (Form_pg_class) GETSTRUCT(tuple);
+;

There is an additional semicolon in the code.

--
Regards,
Japin Li

#40

Jeff Davis

pgsql@j-davis.com

almost 2 years ago

In reply to: Alexander Korotkov (#34)

Re: Table AM Interface Enhancements

On Sat, 2024-03-30 at 23:33 +0200, Alexander Korotkov wrote:

I've pushed 0001, 0002 and 0006.

Sorry to jump in to this discussion so late. I had worked on something
like the custom reloptions (0002) in the past, and there were some
complications that don't seem to be addressed in commit c95c25f9af.

* At minimum I think it needs some direction (comments, docs, tests)
that show how it's supposed to be used.

* The bytea returned by the reloptions() method is not in a trivial
format. It's a StdRelOptions struct with string values stored after the
end of the struct. To build the bytea internally, there's some
infrastructure like allocateRelOptStruct() and fillRelOptions(), and
it's not very easy to extend those to support a few custom options.

* If we ever decide to add a string option to StdRdOptions, I think the
design breaks, because the code that looks for those string values
wouldn't know how to skip over the custom options. Perhaps we can just
promise to never do that, but we should make it explicit somehow.

* Most existing heap reloptions (other than fillfactor) are used by
other parts of the system (like autovacuum) so should be considered
valid for any AM. Most AMs will just want to add a handful of their own
options on top, so it would be good to demonstrate how this should be
done.

* There are still places that are inappropriately calling
heap_reloptions directly. For instance, in ProcessUtilitySlow(), it
seems to assume that a toast table is a heap?

Regards,
Jeff Davis

#41

Alexander Korotkov

aekorotkov@gmail.com

almost 2 years ago

In reply to: Jeff Davis (#40)

Re: Table AM Interface Enhancements

Hi, Jeff!

On Tue, Apr 2, 2024 at 8:19 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Sat, 2024-03-30 at 23:33 +0200, Alexander Korotkov wrote:

I've pushed 0001, 0002 and 0006.

Sorry to jump in to this discussion so late. I had worked on something
like the custom reloptions (0002) in the past, and there were some
complications that don't seem to be addressed in commit c95c25f9af.

* At minimum I think it needs some direction (comments, docs, tests)
that show how it's supposed to be used.

* The bytea returned by the reloptions() method is not in a trivial
format. It's a StdRelOptions struct with string values stored after the
end of the struct. To build the bytea internally, there's some
infrastructure like allocateRelOptStruct() and fillRelOptions(), and
it's not very easy to extend those to support a few custom options.

* If we ever decide to add a string option to StdRdOptions, I think the
design breaks, because the code that looks for those string values
wouldn't know how to skip over the custom options. Perhaps we can just
promise to never do that, but we should make it explicit somehow.

* Most existing heap reloptions (other than fillfactor) are used by
other parts of the system (like autovacuum) so should be considered
valid for any AM. Most AMs will just want to add a handful of their own
options on top, so it would be good to demonstrate how this should be
done.

* There are still places that are inappropriately calling
heap_reloptions directly. For instance, in ProcessUtilitySlow(), it
seems to assume that a toast table is a heap?

Thank you for the detailed explanation. This piece definitely needs
more work. I've just reverted the c95c25f9af.

I don't like the idea that every custom table AM reltoptions should
begin with StdRdOptions. I would rather introduce the new data
structure with table options, which need to be accessed outside of
table AM. Then reloptions will be a backbox only directly used in
table AM, while table AM has a freedom on what to store in reloptions
and how to calculate externally-visible options. What do you think?

------
Regards,
Alexander Korotkov

#42

Jeff Davis

pgsql@j-davis.com

almost 2 years ago

In reply to: Alexander Korotkov (#41)

Re: Table AM Interface Enhancements

On Tue, 2024-04-02 at 11:49 +0300, Alexander Korotkov wrote:

I don't like the idea that every custom table AM reltoptions should
begin with StdRdOptions. I would rather introduce the new data
structure with table options, which need to be accessed outside of
table AM. Then reloptions will be a backbox only directly used in
table AM, while table AM has a freedom on what to store in reloptions
and how to calculate externally-visible options. What do you think?

Hi Alexander!

I agree with all of that. It will take some refactoring to get there,
though.

One idea is to store StdRdOptions like normal, but if an unrecognized
option is found, ask the table AM if it understands the option. In that
case I think we'd just use a different field in pg_class so that it can
use whatever format it wants to represent its options.

Regards,
Jeff Davis

#43

Pavel Borisov

pashkin.elfe@gmail.com

almost 2 years ago

In reply to: Jeff Davis (#42)

1 attachment(s)

Re: Table AM Interface Enhancements

Hi, hackers!

On Tue, 2 Apr 2024 at 19:17, Jeff Davis <pgsql@j-davis.com> wrote:

On Tue, 2024-04-02 at 11:49 +0300, Alexander Korotkov wrote:

I don't like the idea that every custom table AM reltoptions should
begin with StdRdOptions. I would rather introduce the new data
structure with table options, which need to be accessed outside of
table AM. Then reloptions will be a backbox only directly used in
table AM, while table AM has a freedom on what to store in reloptions
and how to calculate externally-visible options. What do you think?

Hi Alexander!

I agree with all of that. It will take some refactoring to get there,
though.

One idea is to store StdRdOptions like normal, but if an unrecognized
option is found, ask the table AM if it understands the option. In that
case I think we'd just use a different field in pg_class so that it can
use whatever format it wants to represent its options.

Regards,
Jeff Davis

I tried to rework a patch regarding table am according to the input from
Alexander and Jeff.

It splits table reloptions into two categories:
- common for all tables (stored in a fixed size structure and could be
accessed from outside)
- table-am specific (variable size, parsed and accessed by access method
only)

Please find a patch attached.

Attachments:

v9-0001-Custom-reloptions-for-table-AM.patchapplication/octet-stream; name=v9-0001-Custom-reloptions-for-table-AM.patchDownload

From 8ca761b060b05ce189da3048077809373e996b84 Mon Sep 17 00:00:00 2001
From: Pavel Borisov <pashkin.elfe@gmail.com>
Date: Thu, 4 Apr 2024 13:35:29 +0400
Subject: [PATCH v9] Custom reloptions for table AM

Let table AM define custom reloptions for its tables. This allows to
specify AM-specific parameters by WITH clause when creating a table.

Reloptions are split into common for all tables and table am specific
that are parsed and accessed only by table am functions.

The code may use some parts from prior work by Hao Wu.
---
 src/backend/access/common/reloptions.c   | 111 +++++++++++-------
 src/backend/access/heap/heapam.c         |   2 +-
 src/backend/access/heap/heapam_handler.c |  12 ++
 src/backend/access/heap/heaptoast.c      |   9 +-
 src/backend/access/heap/hio.c            |   2 +-
 src/backend/access/heap/pruneheap.c      |   2 +-
 src/backend/access/heap/rewriteheap.c    |   2 +-
 src/backend/access/table/tableam.c       |   2 +-
 src/backend/access/table/tableamapi.c    |  25 ++++
 src/backend/commands/createas.c          |  13 ++-
 src/backend/commands/tablecmds.c         |  63 ++++++----
 src/backend/commands/vacuum.c            |   8 +-
 src/backend/postmaster/autovacuum.c      |   8 +-
 src/backend/tcop/utility.c               |   9 +-
 src/backend/utils/cache/relcache.c       |  14 ++-
 src/include/access/reloptions.h          |   7 +-
 src/include/access/tableam.h             |  44 +++++++
 src/include/utils/rel.h                  | 140 ++++++++++++-----------
 18 files changed, 320 insertions(+), 153 deletions(-)

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index d6eb5d8559..1055b397ef 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -24,6 +24,7 @@
 #include "access/nbtree.h"
 #include "access/reloptions.h"
 #include "access/spgist_private.h"
+#include "access/tableam.h"
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
 #include "commands/tablespace.h"
@@ -44,7 +45,7 @@
  * value, upper and lower bounds (if applicable); for strings, consider a
  * validation routine.
  * (ii) add a record below (or use add_<type>_reloption).
- * (iii) add it to the appropriate options struct (perhaps StdRdOptions)
+ * (iii) add it to the appropriate options struct (perhaps commonReloptions)
  * (iv) add it to the appropriate handling routine (perhaps
  * default_reloptions)
  * (v) make sure the lock level is set correctly for that operation
@@ -1374,10 +1375,16 @@ untransformRelOptions(Datum options)
  * tupdesc is pg_class' tuple descriptor.  amoptions is a pointer to the index
  * AM's options parser function in the case of a tuple corresponding to an
  * index, or NULL otherwise.
+ *
+ * common_reloptions should be provided NULL if we don't need to get reloptions, just validate
+ * If we provide not NULL, then table am will fill *common_reloptions with parsed list or NULL.
+ * NULL in *common_reloptions means the caller should use defaults.
  */
+
 bytea *
 extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
-				  amoptions_function amoptions)
+				  const TableAmRoutine *tableam, amoptions_function amoptions,
+				  commonReloptions **common_reloptions)
 {
 	bytea	   *options;
 	bool		isnull;
@@ -1399,7 +1406,8 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
 		case RELKIND_MATVIEW:
-			options = heap_reloptions(classForm->relkind, datum, false);
+			options = tableam_reloptions(tableam, classForm->relkind,
+										 datum, common_reloptions, false);
 			break;
 		case RELKIND_PARTITIONED_TABLE:
 			options = partitioned_table_reloptions(datum, false);
@@ -1695,7 +1703,7 @@ parse_one_reloption(relopt_value *option, char *text_str, int text_len,
  * Given the result from parseRelOptions, allocate a struct that's of the
  * specified base size plus any extra space that's needed for string variables.
  *
- * "base" should be sizeof(struct) of the reloptions struct (StdRdOptions or
+ * "base" should be sizeof(struct) of the reloptions struct (commonReloptions or
  * equivalent).
  */
 static void *
@@ -1750,7 +1758,6 @@ fillRelOptions(void *rdopts, Size basesize,
 	for (i = 0; i < numoptions; i++)
 	{
 		int			j;
-		bool		found = false;
 
 		for (j = 0; j < numelems; j++)
 		{
@@ -1819,72 +1826,65 @@ fillRelOptions(void *rdopts, Size basesize,
 							 options[i].gen->type);
 						break;
 				}
-				found = true;
 				break;
 			}
 		}
-		if (validate && !found)
-			elog(ERROR, "reloption \"%s\" not found in parse table",
-				 options[i].gen->name);
 	}
 	SET_VARSIZE(rdopts, offset);
 }
 
 
 /*
- * Option parser for anything that uses StdRdOptions.
+ * Option parser for anything that uses commonReloptions.
  */
 bytea *
 default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 {
 	static const relopt_parse_elt tab[] = {
-		{"fillfactor", RELOPT_TYPE_INT, offsetof(StdRdOptions, fillfactor)},
 		{"autovacuum_enabled", RELOPT_TYPE_BOOL,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, enabled)},
+		offsetof(commonReloptions, autovacuum) + offsetof(AutoVacOpts, enabled)},
 		{"autovacuum_vacuum_threshold", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_threshold)},
+		offsetof(commonReloptions, autovacuum) + offsetof(AutoVacOpts, vacuum_threshold)},
 		{"autovacuum_vacuum_insert_threshold", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_ins_threshold)},
+		offsetof(commonReloptions, autovacuum) + offsetof(AutoVacOpts, vacuum_ins_threshold)},
 		{"autovacuum_analyze_threshold", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, analyze_threshold)},
+		offsetof(commonReloptions, autovacuum) + offsetof(AutoVacOpts, analyze_threshold)},
 		{"autovacuum_vacuum_cost_limit", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_cost_limit)},
+		offsetof(commonReloptions, autovacuum) + offsetof(AutoVacOpts, vacuum_cost_limit)},
 		{"autovacuum_freeze_min_age", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_min_age)},
+		offsetof(commonReloptions, autovacuum) + offsetof(AutoVacOpts, freeze_min_age)},
 		{"autovacuum_freeze_max_age", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_max_age)},
+		offsetof(commonReloptions, autovacuum) + offsetof(AutoVacOpts, freeze_max_age)},
 		{"autovacuum_freeze_table_age", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_table_age)},
+		offsetof(commonReloptions, autovacuum) + offsetof(AutoVacOpts, freeze_table_age)},
 		{"autovacuum_multixact_freeze_min_age", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, multixact_freeze_min_age)},
+		offsetof(commonReloptions, autovacuum) + offsetof(AutoVacOpts, multixact_freeze_min_age)},
 		{"autovacuum_multixact_freeze_max_age", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, multixact_freeze_max_age)},
+		offsetof(commonReloptions, autovacuum) + offsetof(AutoVacOpts, multixact_freeze_max_age)},
 		{"autovacuum_multixact_freeze_table_age", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, multixact_freeze_table_age)},
+		offsetof(commonReloptions, autovacuum) + offsetof(AutoVacOpts, multixact_freeze_table_age)},
 		{"log_autovacuum_min_duration", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, log_min_duration)},
-		{"toast_tuple_target", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, toast_tuple_target)},
+		offsetof(commonReloptions, autovacuum) + offsetof(AutoVacOpts, log_min_duration)},
 		{"autovacuum_vacuum_cost_delay", RELOPT_TYPE_REAL,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_cost_delay)},
+		offsetof(commonReloptions, autovacuum) + offsetof(AutoVacOpts, vacuum_cost_delay)},
 		{"autovacuum_vacuum_scale_factor", RELOPT_TYPE_REAL,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_scale_factor)},
+		offsetof(commonReloptions, autovacuum) + offsetof(AutoVacOpts, vacuum_scale_factor)},
 		{"autovacuum_vacuum_insert_scale_factor", RELOPT_TYPE_REAL,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_ins_scale_factor)},
+		offsetof(commonReloptions, autovacuum) + offsetof(AutoVacOpts, vacuum_ins_scale_factor)},
 		{"autovacuum_analyze_scale_factor", RELOPT_TYPE_REAL,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, analyze_scale_factor)},
+		offsetof(commonReloptions, autovacuum) + offsetof(AutoVacOpts, analyze_scale_factor)},
 		{"user_catalog_table", RELOPT_TYPE_BOOL,
-		offsetof(StdRdOptions, user_catalog_table)},
+		offsetof(commonReloptions, user_catalog_table)},
 		{"parallel_workers", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, parallel_workers)},
+		offsetof(commonReloptions, parallel_workers)},
 		{"vacuum_index_cleanup", RELOPT_TYPE_ENUM,
-		offsetof(StdRdOptions, vacuum_index_cleanup)},
+		offsetof(commonReloptions, vacuum_index_cleanup)},
 		{"vacuum_truncate", RELOPT_TYPE_BOOL,
-		offsetof(StdRdOptions, vacuum_truncate)}
+		offsetof(commonReloptions, vacuum_truncate)}
 	};
 
 	return (bytea *) build_reloptions(reloptions, validate, kind,
-									  sizeof(StdRdOptions),
+									  sizeof(commonReloptions),
 									  tab, lengthof(tab));
 }
 
@@ -1917,7 +1917,6 @@ build_reloptions(Datum reloptions, bool validate,
 
 	/* parse options specific to given relation option kind */
 	options = parseRelOptions(reloptions, validate, kind, &numoptions);
-	Assert(numoptions <= num_relopt_elems);
 
 	/* if none set, we're done */
 	if (numoptions == 0)
@@ -2016,30 +2015,60 @@ view_reloptions(Datum reloptions, bool validate)
  * Parse options for heaps, views and toast tables.
  */
 bytea *
-heap_reloptions(char relkind, Datum reloptions, bool validate)
+heap_reloptions(char relkind, Datum reloptions, commonReloptions **common_reloptions, bool validate)
 {
-	StdRdOptions *rdopts;
+	commonReloptions *rdopts;
+	heapReloptions *hopts;
+	static const relopt_parse_elt tab[] = {
+		{"fillfactor", RELOPT_TYPE_INT, offsetof(heapReloptions, fillfactor)},
+		{"toast_tuple_target", RELOPT_TYPE_INT,
+		offsetof(heapReloptions, toast_tuple_target)},
+	};
 
 	switch (relkind)
 	{
 		case RELKIND_TOASTVALUE:
-			rdopts = (StdRdOptions *)
+			rdopts = (commonReloptions *)
 				default_reloptions(reloptions, validate, RELOPT_KIND_TOAST);
 			if (rdopts != NULL)
 			{
 				/* adjust default-only parameters for TOAST relations */
-				rdopts->fillfactor = 100;
 				rdopts->autovacuum.analyze_threshold = -1;
 				rdopts->autovacuum.analyze_scale_factor = -1;
 			}
-			return (bytea *) rdopts;
+			hopts = (heapReloptions *) build_reloptions(reloptions, validate,
+									  RELOPT_KIND_TOAST,
+									  sizeof(heapReloptions),
+									  tab, lengthof(tab));
+			if (hopts != NULL)
+			{
+				/* adjust default-only parameters for TOAST relations */
+				hopts->fillfactor = 100;
+			}
+			break;
 		case RELKIND_RELATION:
 		case RELKIND_MATVIEW:
-			return default_reloptions(reloptions, validate, RELOPT_KIND_HEAP);
+			rdopts = (commonReloptions *) default_reloptions(reloptions, validate, RELOPT_KIND_HEAP);
+			hopts = (heapReloptions *) build_reloptions(reloptions, validate,
+									  RELOPT_KIND_HEAP,
+									  sizeof(heapReloptions),
+									  tab, lengthof(tab));
+			break;
 		default:
 			/* other relkinds are not supported */
 			return NULL;
 	}
+
+	/*
+	 * Get here only to validate and do not care about output values of both
+	 * heapReloptions and commonReloptions.
+	 */
+	if (common_reloptions == NULL)
+		return NULL;
+
+	*common_reloptions =  rdopts;
+
+	return (bytea *)hopts;
 }
 
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index dada2ecd1e..404b92d1cc 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2129,7 +2129,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 	Assert(!(options & HEAP_INSERT_NO_LOGICAL));
 
 	needwal = RelationNeedsWAL(relation);
-	saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
+	saveFreeSpace = HeapGetTargetPageFreeSpace(relation,
 												   HEAP_DEFAULT_FILLFACTOR);
 
 	/* Toast and set header data in all the slots */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 3e7a6b5548..961512df9c 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -23,6 +23,7 @@
 #include "access/heapam.h"
 #include "access/heaptoast.h"
 #include "access/multixact.h"
+#include "access/reloptions.h"
 #include "access/rewriteheap.h"
 #include "access/syncscan.h"
 #include "access/tableam.h"
@@ -2157,6 +2158,16 @@ heapam_relation_toast_am(Relation rel)
 	return rel->rd_rel->relam;
 }
 
+static bytea *
+heapam_reloptions(char relkind, Datum reloptions, commonReloptions **common_reloptions, bool validate)
+{
+	Assert(relkind == RELKIND_RELATION ||
+		   relkind == RELKIND_TOASTVALUE ||
+		   relkind == RELKIND_MATVIEW);
+
+	return heap_reloptions(relkind, reloptions, common_reloptions, validate);
+}
+
 
 /* ------------------------------------------------------------------------
  * Planner related callbacks for the heap AM
@@ -2678,6 +2689,7 @@ static const TableAmRoutine heapam_methods = {
 	.relation_needs_toast_table = heapam_relation_needs_toast_table,
 	.relation_toast_am = heapam_relation_toast_am,
 	.relation_fetch_toast_slice = heap_fetch_toast_slice,
+	.reloptions = heapam_reloptions,
 
 	.relation_estimate_size = heapam_estimate_rel_size,
 
diff --git a/src/backend/access/heap/heaptoast.c b/src/backend/access/heap/heaptoast.c
index a420e16530..74517e0a7e 100644
--- a/src/backend/access/heap/heaptoast.c
+++ b/src/backend/access/heap/heaptoast.c
@@ -32,6 +32,13 @@
 #include "access/toast_internals.h"
 #include "utils/fmgroids.h"
 
+/*
+ * HeapGetToastTupleTarget
+ *      Returns the heap relation's toast_tuple_target.  Note multiple eval of argument!
+ */
+#define HeapGetToastTupleTarget(relation, defaulttarg) \
+		((heapReloptions *) (relation)->rd_options ? \
+		((heapReloptions *) (relation)->rd_options)->toast_tuple_target : (defaulttarg))
 
 /* ----------
  * heap_toast_delete -
@@ -174,7 +181,7 @@ heap_toast_insert_or_update(Relation rel, HeapTuple newtup, HeapTuple oldtup,
 		hoff += BITMAPLEN(numAttrs);
 	hoff = MAXALIGN(hoff);
 	/* now convert to a limit on the tuple data size */
-	maxDataLen = RelationGetToastTupleTarget(rel, TOAST_TUPLE_TARGET) - hoff;
+	maxDataLen = HeapGetToastTupleTarget(rel, TOAST_TUPLE_TARGET) - hoff;
 
 	/*
 	 * Look for attributes with attstorage EXTENDED to compress.  Also find
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 7c662cdf46..6c46a7759b 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -536,7 +536,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 						len, MaxHeapTupleSize)));
 
 	/* Compute desired extra freespace due to fillfactor option */
-	saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
+	saveFreeSpace = HeapGetTargetPageFreeSpace(relation,
 												   HEAP_DEFAULT_FILLFACTOR);
 
 	/*
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index d2eecaf7eb..98f5af76db 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -235,7 +235,7 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	 * important than sometimes getting a wrong answer in what is after all
 	 * just a heuristic estimate.
 	 */
-	minfree = RelationGetTargetPageFreeSpace(relation,
+	minfree = HeapGetTargetPageFreeSpace(relation,
 											 HEAP_DEFAULT_FILLFACTOR);
 	minfree = Max(minfree, BLCKSZ / 10);
 
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 473f3aa9be..0801af0558 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -641,7 +641,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 						len, MaxHeapTupleSize)));
 
 	/* Compute desired extra freespace due to fillfactor option */
-	saveFreeSpace = RelationGetTargetPageFreeSpace(state->rs_new_rel,
+	saveFreeSpace = HeapGetTargetPageFreeSpace(state->rs_new_rel,
 												   HEAP_DEFAULT_FILLFACTOR);
 
 	/* Now we can check to see if there's enough free space already. */
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 805d222ceb..5d5f0e68fd 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -750,7 +750,7 @@ table_block_relation_estimate_size(Relation rel, int32 *attr_widths,
 		 * The other branch considers it implicitly by calculating density
 		 * from actual relpages/reltuples statistics.
 		 */
-		fillfactor = RelationGetFillFactor(rel, HEAP_DEFAULT_FILLFACTOR);
+		fillfactor = HeapGetFillFactor(rel, HEAP_DEFAULT_FILLFACTOR);
 
 		tuple_width = get_rel_data_width(rel, attr_widths);
 		tuple_width += overhead_bytes_per_tuple;
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 55b8caeadf..d9e23ef317 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -13,9 +13,11 @@
 
 #include "access/tableam.h"
 #include "access/xact.h"
+#include "catalog/pg_am.h"
 #include "commands/defrem.h"
 #include "miscadmin.h"
 #include "utils/guc_hooks.h"
+#include "utils/syscache.h"
 
 
 /*
@@ -98,6 +100,29 @@ GetTableAmRoutine(Oid amhandler)
 	return routine;
 }
 
+/*
+ * GetTableAmRoutineByAmOid
+ *		Given the table access method oid get its TableAmRoutine struct, which
+ *		will be palloc'd in the caller's memory context.
+ */
+const TableAmRoutine *
+GetTableAmRoutineByAmOid(Oid amoid)
+{
+	HeapTuple	ht_am;
+	Form_pg_am	amrec;
+	const TableAmRoutine *tableam = NULL;
+
+	ht_am = SearchSysCache1(AMOID, ObjectIdGetDatum(amoid));
+	if (!HeapTupleIsValid(ht_am))
+		elog(ERROR, "cache lookup failed for access method %u",
+			 amoid);
+	amrec = (Form_pg_am) GETSTRUCT(ht_am);
+
+	tableam = GetTableAmRoutine(amrec->amhandler);
+	ReleaseSysCache(ht_am);
+	return tableam;
+}
+
 /* check_hook: validate new default_table_access_method */
 bool
 check_default_table_access_method(char **newval, void **extra, GucSource source)
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index afd3dace07..d5a7d726f9 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -85,6 +85,9 @@ create_ctas_internal(List *attrList, IntoClause *into)
 	Datum		toast_options;
 	static char *validnsps[] = HEAP_RELOPT_NAMESPACES;
 	ObjectAddress intoRelationAddr;
+	const TableAmRoutine *tableam = NULL;
+	Oid		    accessMethodId = InvalidOid;
+	Relation	rel;
 
 	/* This code supports both CREATE TABLE AS and CREATE MATERIALIZED VIEW */
 	is_matview = (into->viewQuery != NULL);
@@ -125,7 +128,15 @@ create_ctas_internal(List *attrList, IntoClause *into)
 										validnsps,
 										true, false);
 
-	(void) heap_reloptions(RELKIND_TOASTVALUE, toast_options, true);
+	rel = relation_open(intoRelationAddr.objectId, AccessShareLock);
+	accessMethodId = table_relation_toast_am(rel);
+	relation_close(rel, AccessShareLock);
+
+	if(OidIsValid(accessMethodId))
+	{
+		tableam = GetTableAmRoutineByAmOid(accessMethodId);
+		(void) tableam_reloptions(tableam, RELKIND_TOASTVALUE, toast_options, NULL, true);
+	}
 
 	NewRelationCreateToastTable(intoRelationAddr.objectId, toast_options);
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 317b89f67c..de1a126cee 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -715,6 +715,7 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	ObjectAddress address;
 	LOCKMODE	parentLockmode;
 	Oid			accessMethodId = InvalidOid;
+	const TableAmRoutine *tableam = NULL;
 
 	/*
 	 * Truncate relname to appropriate length (probably a waste of time, as
@@ -850,6 +851,28 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	if (!OidIsValid(ownerId))
 		ownerId = GetUserId();
 
+	/*
+	 * For relations with table AM and partitioned tables, select access
+	 * method to use: an explicitly indicated one, or (in the case of a
+	 * partitioned table) the parent's, if it has one.
+	 */
+	if (stmt->accessMethod != NULL)
+	{
+		Assert(RELKIND_HAS_TABLE_AM(relkind) || relkind == RELKIND_PARTITIONED_TABLE);
+		accessMethodId = get_table_am_oid(stmt->accessMethod, false);
+	}
+	else if (RELKIND_HAS_TABLE_AM(relkind) || relkind == RELKIND_PARTITIONED_TABLE)
+	{
+		if (stmt->partbound)
+		{
+			Assert(list_length(inheritOids) == 1);
+			accessMethodId = get_rel_relam(linitial_oid(inheritOids));
+		}
+
+		if (RELKIND_HAS_TABLE_AM(relkind) && !OidIsValid(accessMethodId))
+			accessMethodId = get_table_am_oid(default_table_access_method, false);
+	}
+
 	/*
 	 * Parse and validate reloptions, if any.
 	 */
@@ -858,6 +881,12 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 
 	switch (relkind)
 	{
+		case RELKIND_RELATION:
+		case RELKIND_TOASTVALUE:
+		case RELKIND_MATVIEW:
+			tableam = GetTableAmRoutineByAmOid(accessMethodId);
+			(void) tableam_reloptions(tableam, relkind, reloptions, NULL, true);
+			break;
 		case RELKIND_VIEW:
 			(void) view_reloptions(reloptions, true);
 			break;
@@ -865,7 +894,12 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 			(void) partitioned_table_reloptions(reloptions, true);
 			break;
 		default:
-			(void) heap_reloptions(relkind, reloptions, true);
+			if (OidIsValid(accessMethodId))
+			{
+				tableam = GetTableAmRoutineByAmOid(accessMethodId);
+				(void) tableam_reloptions(tableam, relkind, reloptions, NULL, true);
+			}
+			break;
 	}
 
 	if (stmt->ofTypename)
@@ -957,28 +991,6 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 		}
 	}
 
-	/*
-	 * For relations with table AM and partitioned tables, select access
-	 * method to use: an explicitly indicated one, or (in the case of a
-	 * partitioned table) the parent's, if it has one.
-	 */
-	if (stmt->accessMethod != NULL)
-	{
-		Assert(RELKIND_HAS_TABLE_AM(relkind) || relkind == RELKIND_PARTITIONED_TABLE);
-		accessMethodId = get_table_am_oid(stmt->accessMethod, false);
-	}
-	else if (RELKIND_HAS_TABLE_AM(relkind) || relkind == RELKIND_PARTITIONED_TABLE)
-	{
-		if (stmt->partbound)
-		{
-			Assert(list_length(inheritOids) == 1);
-			accessMethodId = get_rel_relam(linitial_oid(inheritOids));
-		}
-
-		if (RELKIND_HAS_TABLE_AM(relkind) && !OidIsValid(accessMethodId))
-			accessMethodId = get_table_am_oid(default_table_access_method, false);
-	}
-
 	/*
 	 * Create the relation.  Inherited defaults and constraints are passed in
 	 * for immediate handling --- since they don't need parsing, they can be
@@ -15528,7 +15540,8 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
 		case RELKIND_MATVIEW:
-			(void) heap_reloptions(rel->rd_rel->relkind, newOptions, true);
+			(void) table_reloptions(rel, rel->rd_rel->relkind,
+									newOptions, NULL, true);
 			break;
 		case RELKIND_PARTITIONED_TABLE:
 			(void) partitioned_table_reloptions(newOptions, true);
@@ -15641,7 +15654,7 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 										 defList, "toast", validnsps, false,
 										 operation == AT_ResetRelOptions);
 
-		(void) heap_reloptions(RELKIND_TOASTVALUE, newOptions, true);
+		(void) table_reloptions(rel, RELKIND_TOASTVALUE, newOptions, NULL, true);
 
 		memset(repl_val, 0, sizeof(repl_val));
 		memset(repl_null, false, sizeof(repl_null));
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index b589279d49..9d2ad3e216 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2121,11 +2121,11 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
 	{
 		StdRdOptIndexCleanup vacuum_index_cleanup;
 
-		if (rel->rd_options == NULL)
+		if (rel->rd_common_options == NULL)
 			vacuum_index_cleanup = STDRD_OPTION_VACUUM_INDEX_CLEANUP_AUTO;
 		else
 			vacuum_index_cleanup =
-				((StdRdOptions *) rel->rd_options)->vacuum_index_cleanup;
+				rel->rd_common_options->vacuum_index_cleanup;
 
 		if (vacuum_index_cleanup == STDRD_OPTION_VACUUM_INDEX_CLEANUP_AUTO)
 			params->index_cleanup = VACOPTVALUE_AUTO;
@@ -2145,8 +2145,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
 	 */
 	if (params->truncate == VACOPTVALUE_UNSPECIFIED)
 	{
-		if (rel->rd_options == NULL ||
-			((StdRdOptions *) rel->rd_options)->vacuum_truncate)
+		if (rel->rd_common_options == NULL ||
+			(rel->rd_common_options->vacuum_truncate))
 			params->truncate = VACOPTVALUE_ENABLED;
 		else
 			params->truncate = VACOPTVALUE_DISABLED;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index c367ede6f8..9f0d56abca 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2673,19 +2673,21 @@ deleted2:
 static AutoVacOpts *
 extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
 {
-	bytea	   *relopts;
+	commonReloptions *relopts;
 	AutoVacOpts *av;
 
 	Assert(((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_RELATION ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE);
 
-	relopts = extractRelOptions(tup, pg_class_desc, NULL);
+	(void) extractRelOptions(tup, pg_class_desc,
+								GetTableAmRoutineByAmOid(((Form_pg_class) GETSTRUCT(tup))->relam),
+								NULL, &relopts);
 	if (relopts == NULL)
 		return NULL;
 
 	av = palloc(sizeof(AutoVacOpts));
-	memcpy(av, &(((StdRdOptions *) relopts)->autovacuum), sizeof(AutoVacOpts));
+	memcpy(av, &(relopts->autovacuum), sizeof(AutoVacOpts));
 	pfree(relopts);
 
 	return av;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index fa66b8017e..b1d88133dd 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -1156,6 +1156,8 @@ ProcessUtilitySlow(ParseState *pstate,
 							CreateStmt *cstmt = (CreateStmt *) stmt;
 							Datum		toast_options;
 							static char *validnsps[] = HEAP_RELOPT_NAMESPACES;
+							const TableAmRoutine *tableam = NULL;
+							Oid	  accessMethodId;
 
 							/* Remember transformed RangeVar for LIKE */
 							table_rv = cstmt->relation;
@@ -1185,8 +1187,13 @@ ProcessUtilitySlow(ParseState *pstate,
 																validnsps,
 																true,
 																false);
-							(void) heap_reloptions(RELKIND_TOASTVALUE,
+
+							/* TOAST has default access method */
+							accessMethodId = get_table_am_oid(default_table_access_method, false);
+							tableam = GetTableAmRoutineByAmOid(accessMethodId);
+							(void) tableam_reloptions(tableam, RELKIND_TOASTVALUE,
 												   toast_options,
+												   NULL,
 												   true);
 
 							NewRelationCreateToastTable(address.objectId,
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 3fe74dabd0..3d7e97d09f 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -33,6 +33,7 @@
 #include "access/htup_details.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
+#include "access/relation.h"
 #include "access/reloptions.h"
 #include "access/sysattr.h"
 #include "access/table.h"
@@ -464,8 +465,11 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 {
 	bytea	   *options;
 	amoptions_function amoptsfn;
+	const TableAmRoutine *tableam = NULL;
+	commonReloptions *common_reloptions = NULL;
 
 	relation->rd_options = NULL;
+	relation->rd_common_options = NULL;
 
 	/*
 	 * Look up any AM-specific parse function; fall out if relkind should not
@@ -478,6 +482,7 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 		case RELKIND_VIEW:
 		case RELKIND_MATVIEW:
 		case RELKIND_PARTITIONED_TABLE:
+			tableam = relation->rd_tableam;
 			amoptsfn = NULL;
 			break;
 		case RELKIND_INDEX:
@@ -493,7 +498,14 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 	 * we might not have any other for pg_class yet (consider executing this
 	 * code for pg_class itself)
 	 */
-	options = extractRelOptions(tuple, GetPgClassDescriptor(), amoptsfn);
+	options = extractRelOptions(tuple, GetPgClassDescriptor(),
+								tableam, amoptsfn, &common_reloptions);
+
+	if (common_reloptions)
+	{
+		relation->rd_common_options = MemoryContextAlloc(CacheMemoryContext, sizeof(struct commonReloptions));
+		memcpy(relation->rd_common_options, common_reloptions, sizeof(struct commonReloptions));
+	}
 
 	/*
 	 * Copy parsed data into CacheMemoryContext.  To guard against the
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index 81829b8270..c6d74d1a88 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -21,6 +21,7 @@
 
 #include "access/amapi.h"
 #include "access/htup.h"
+#include "access/tableam.h"
 #include "access/tupdesc.h"
 #include "nodes/pg_list.h"
 #include "storage/lock.h"
@@ -224,7 +225,9 @@ extern Datum transformRelOptions(Datum oldOptions, List *defList,
 								 bool acceptOidsOff, bool isReset);
 extern List *untransformRelOptions(Datum options);
 extern bytea *extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
-								amoptions_function amoptions);
+								const TableAmRoutine *tableam,
+								amoptions_function amoptions,
+								commonReloptions **common_reloptions);
 extern void *build_reloptions(Datum reloptions, bool validate,
 							  relopt_kind kind,
 							  Size relopt_struct_size,
@@ -235,7 +238,7 @@ extern void *build_local_reloptions(local_relopts *relopts, Datum options,
 
 extern bytea *default_reloptions(Datum reloptions, bool validate,
 								 relopt_kind kind);
-extern bytea *heap_reloptions(char relkind, Datum reloptions, bool validate);
+extern bytea *heap_reloptions(char relkind, Datum reloptions, commonReloptions **common_reloptions, bool validate);
 extern bytea *view_reloptions(Datum reloptions, bool validate);
 extern bytea *partitioned_table_reloptions(Datum reloptions, bool validate);
 extern bytea *index_reloptions(amoptions_function amoptions, Datum reloptions,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index e7eeb75409..437e18f9b4 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -739,6 +739,29 @@ typedef struct TableAmRoutine
 											   int32 slicelength,
 											   struct varlena *result);
 
+	/*
+	 * This callback parses and validates the reloptions array for a table.
+	 *
+	 * This is called only when a non-null reloptions array exists for the
+	 * table.  'reloptions' is a text array containing entries of the form
+	 * "name=value".  The function should construct a bytea value, which will
+	 * be copied into the rd_options field of the table's relcache entry. The
+	 * data contents of the bytea value are open for the access method to
+	 * define.
+	 *
+	 * When 'validate' is true, the function should report a suitable error
+	 * message if any of the options are unrecognized or have invalid values;
+	 * when 'validate' is false, invalid entries should be silently ignored.
+	 * ('validate' is false when loading options already stored in pg_catalog;
+	 * an invalid entry could only be found if the access method has changed
+	 * its rules for options, and in that case ignoring obsolete entries is
+	 * appropriate.)
+	 *
+	 * It is OK to return NULL if default behavior is wanted.
+	 */
+	bytea	   *(*reloptions) (char relkind, Datum reloptions,
+							   commonReloptions **common_reloptions, bool validate);
+
 
 	/* ------------------------------------------------------------------------
 	 * Planner related functions.
@@ -1935,6 +1958,26 @@ table_relation_fetch_toast_slice(Relation toastrel, Oid valueid,
 													 result);
 }
 
+/*
+ * Parse options for given table.
+ */
+static inline bytea *
+table_reloptions(Relation rel, char relkind,
+				 Datum reloptions, commonReloptions **common_reloptions, bool validate)
+{
+	return rel->rd_tableam->reloptions(relkind, reloptions, common_reloptions, validate);
+}
+
+/*
+ * Parse table options without knowledge of particular table.
+ */
+static inline bytea *
+tableam_reloptions(const TableAmRoutine *tableam, char relkind,
+				   Datum reloptions, commonReloptions **common_reloptions, bool validate)
+{
+	return tableam->reloptions(relkind, reloptions, common_reloptions, validate);
+}
+
 
 /* ----------------------------------------------------------------------------
  * Planner related functionality
@@ -2113,6 +2156,7 @@ extern void table_block_relation_estimate_size(Relation rel,
  */
 
 extern const TableAmRoutine *GetTableAmRoutine(Oid amhandler);
+extern const TableAmRoutine *GetTableAmRoutineByAmOid(Oid amoid);
 
 /* ----------------------------------------------------------------------------
  * Functions in heapam_handler.c
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index f25f769af2..8dfb94a365 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -48,6 +48,50 @@ typedef struct LockInfoData
 
 typedef LockInfoData *LockInfo;
 
+ /* autovacuum-related reloptions. */
+typedef struct AutoVacOpts
+{
+	bool		enabled;
+	int			vacuum_threshold;
+	int			vacuum_ins_threshold;
+	int			analyze_threshold;
+	int			vacuum_cost_limit;
+	int			freeze_min_age;
+	int			freeze_max_age;
+	int			freeze_table_age;
+	int			multixact_freeze_min_age;
+	int			multixact_freeze_max_age;
+	int			multixact_freeze_table_age;
+	int			log_min_duration;
+	float8		vacuum_cost_delay;
+	float8		vacuum_scale_factor;
+	float8		vacuum_ins_scale_factor;
+	float8		analyze_scale_factor;
+} AutoVacOpts;
+
+/* StdRdOptions->vacuum_index_cleanup values */
+typedef enum StdRdOptIndexCleanup
+{
+	STDRD_OPTION_VACUUM_INDEX_CLEANUP_AUTO = 0,
+	STDRD_OPTION_VACUUM_INDEX_CLEANUP_OFF,
+	STDRD_OPTION_VACUUM_INDEX_CLEANUP_ON,
+} StdRdOptIndexCleanup;
+
+/*
+ * commonReloptions
+ *		Contents of rd_common_options for tables.
+ */
+typedef struct commonReloptions
+{
+	int32       vl_len_;		/* maybe not needed */
+	AutoVacOpts autovacuum;		/* autovacuum-related options */
+	bool		user_catalog_table; /* use as an additional catalog relation */
+	int			parallel_workers;	/* max number of parallel workers */
+	StdRdOptIndexCleanup vacuum_index_cleanup;	/* controls index vacuuming */
+	bool		vacuum_truncate;	/* enables vacuum to truncate a relation */
+} commonReloptions;
+
+
 /*
  * Here are the contents of a relation cache entry.
  */
@@ -168,11 +212,16 @@ typedef struct RelationData
 	PublicationDesc *rd_pubdesc;	/* publication descriptor, or NULL */
 
 	/*
-	 * rd_options is set whenever rd_rel is loaded into the relcache entry.
-	 * Note that you can NOT look into rd_rel for this data.  NULL means "use
-	 * defaults".
+	 * rd_options and rd_common_options are set whenever rd_rel is loaded into
+	 * the relcache entry. Note that you can NOT look into rd_rel for this data.
+	 * NULLs means "use defaults".
+	 */
+	commonReloptions *rd_common_options; /* parsed pg_class.reloptions common for all tables */
+	/*
+	 * am-specific part of pg_class.reloptions parsed by table am specific structure
+	 * (e.g. struct heapReloptions) Contents are not to be accessed outside of table am
 	 */
-	bytea	   *rd_options;		/* parsed pg_class.reloptions */
+	bytea	   *rd_options;
 
 	/*
 	 * Oid of the handler for this relation. For an index this is a function
@@ -297,88 +346,41 @@ typedef struct ForeignKeyCacheInfo
 	Oid			conpfeqop[INDEX_MAX_KEYS] pg_node_attr(array_size(nkeys));
 } ForeignKeyCacheInfo;
 
-
 /*
- * StdRdOptions
- *		Standard contents of rd_options for heaps.
- *
- * RelationGetFillFactor() and RelationGetTargetPageFreeSpace() can only
- * be applied to relations that use this format or a superset for
- * private options data.
+ * heapReloptions
+ *		Contents of rd_options specific for heap tables.
  */
- /* autovacuum-related reloptions. */
-typedef struct AutoVacOpts
-{
-	bool		enabled;
-	int			vacuum_threshold;
-	int			vacuum_ins_threshold;
-	int			analyze_threshold;
-	int			vacuum_cost_limit;
-	int			freeze_min_age;
-	int			freeze_max_age;
-	int			freeze_table_age;
-	int			multixact_freeze_min_age;
-	int			multixact_freeze_max_age;
-	int			multixact_freeze_table_age;
-	int			log_min_duration;
-	float8		vacuum_cost_delay;
-	float8		vacuum_scale_factor;
-	float8		vacuum_ins_scale_factor;
-	float8		analyze_scale_factor;
-} AutoVacOpts;
-
-/* StdRdOptions->vacuum_index_cleanup values */
-typedef enum StdRdOptIndexCleanup
-{
-	STDRD_OPTION_VACUUM_INDEX_CLEANUP_AUTO = 0,
-	STDRD_OPTION_VACUUM_INDEX_CLEANUP_OFF,
-	STDRD_OPTION_VACUUM_INDEX_CLEANUP_ON,
-} StdRdOptIndexCleanup;
-
-typedef struct StdRdOptions
+typedef struct heapReloptions
 {
 	int32		vl_len_;		/* varlena header (do not touch directly!) */
 	int			fillfactor;		/* page fill factor in percent (0..100) */
 	int			toast_tuple_target; /* target for tuple toasting */
-	AutoVacOpts autovacuum;		/* autovacuum-related options */
-	bool		user_catalog_table; /* use as an additional catalog relation */
-	int			parallel_workers;	/* max number of parallel workers */
-	StdRdOptIndexCleanup vacuum_index_cleanup;	/* controls index vacuuming */
-	bool		vacuum_truncate;	/* enables vacuum to truncate a relation */
-} StdRdOptions;
+} heapReloptions;
 
 #define HEAP_MIN_FILLFACTOR			10
 #define HEAP_DEFAULT_FILLFACTOR		100
 
 /*
- * RelationGetToastTupleTarget
- *		Returns the relation's toast_tuple_target.  Note multiple eval of argument!
+ * HeapGetFillFactor
+ *		Returns the heap relation's fillfactor.  Note multiple eval of argument!
  */
-#define RelationGetToastTupleTarget(relation, defaulttarg) \
+#define HeapGetFillFactor(relation, defaultff) \
 	((relation)->rd_options ? \
-	 ((StdRdOptions *) (relation)->rd_options)->toast_tuple_target : (defaulttarg))
+	 ((heapReloptions *) (relation)->rd_options)->fillfactor : (defaultff))
 
 /*
- * RelationGetFillFactor
- *		Returns the relation's fillfactor.  Note multiple eval of argument!
- */
-#define RelationGetFillFactor(relation, defaultff) \
-	((relation)->rd_options ? \
-	 ((StdRdOptions *) (relation)->rd_options)->fillfactor : (defaultff))
-
-/*
- * RelationGetTargetPageUsage
+ * HeapGetTargetPageUsage
  *		Returns the relation's desired space usage per page in bytes.
  */
-#define RelationGetTargetPageUsage(relation, defaultff) \
-	(BLCKSZ * RelationGetFillFactor(relation, defaultff) / 100)
+#define HeapGetTargetPageUsage(relation, defaultff) \
+	(BLCKSZ * HeapGetFillFactor(relation, defaultff) / 100)
 
 /*
- * RelationGetTargetPageFreeSpace
+ * HeapGetTargetPageFreeSpace
  *		Returns the relation's desired freespace per page in bytes.
  */
-#define RelationGetTargetPageFreeSpace(relation, defaultff) \
-	(BLCKSZ * (100 - RelationGetFillFactor(relation, defaultff)) / 100)
+#define HeapGetTargetPageFreeSpace(relation, defaultff) \
+	(BLCKSZ * (100 - HeapGetFillFactor(relation, defaultff)) / 100)
 
 /*
  * RelationIsUsedAsCatalogTable
@@ -386,10 +388,10 @@ typedef struct StdRdOptions
  *		from the pov of logical decoding.  Note multiple eval of argument!
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
-	((relation)->rd_options && \
+	((relation)->rd_common_options && \
 	 ((relation)->rd_rel->relkind == RELKIND_RELATION || \
 	  (relation)->rd_rel->relkind == RELKIND_MATVIEW) ? \
-	 ((StdRdOptions *) (relation)->rd_options)->user_catalog_table : false)
+	 (relation)->rd_common_options->user_catalog_table : false)
 
 /*
  * RelationGetParallelWorkers
@@ -397,8 +399,8 @@ typedef struct StdRdOptions
  *		Note multiple eval of argument!
  */
 #define RelationGetParallelWorkers(relation, defaultpw) \
-	((relation)->rd_options ? \
-	 ((StdRdOptions *) (relation)->rd_options)->parallel_workers : (defaultpw))
+	((relation)->rd_common_options ? \
+	 (relation)->rd_common_options->parallel_workers : (defaultpw))
 
 /* ViewOptions->check_option values */
 typedef enum ViewOptCheckOption
-- 
2.39.2 (Apple Git-143)

#44

Alexander Korotkov

aekorotkov@gmail.com

almost 2 years ago

In reply to: Pavel Borisov (#43)

1 attachment(s)

Re: Table AM Interface Enhancements

Hi, Pavel!

On Fri, Apr 5, 2024 at 6:58 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:

On Tue, 2 Apr 2024 at 19:17, Jeff Davis <pgsql@j-davis.com> wrote:

On Tue, 2024-04-02 at 11:49 +0300, Alexander Korotkov wrote:

I don't like the idea that every custom table AM reltoptions should
begin with StdRdOptions. I would rather introduce the new data
structure with table options, which need to be accessed outside of
table AM. Then reloptions will be a backbox only directly used in
table AM, while table AM has a freedom on what to store in reloptions
and how to calculate externally-visible options. What do you think?

Hi Alexander!

I agree with all of that. It will take some refactoring to get there,
though.

One idea is to store StdRdOptions like normal, but if an unrecognized
option is found, ask the table AM if it understands the option. In that
case I think we'd just use a different field in pg_class so that it can
use whatever format it wants to represent its options.

Regards,
Jeff Davis

I tried to rework a patch regarding table am according to the input from Alexander and Jeff.

It splits table reloptions into two categories:
- common for all tables (stored in a fixed size structure and could be accessed from outside)
- table-am specific (variable size, parsed and accessed by access method only)

Thank you for your work. Please, check the revised patch.

It makes CommonRdOptions a separate data structure, not directly
involved in parsing the reloption. Instead table AM can fill it on
the base of its reloptions or calculate the other way. Patch comes
with a test module, which comes with heap-based table AM. This table
AM has "enable_parallel" reloption, which is used as the base to set
the value of CommonRdOptions.parallel_workers.

------
Regards,
Alexander Korotkov

Attachments:

v10-0001-Custom-reloptions-for-table-AM.patchapplication/octet-stream; name=v10-0001-Custom-reloptions-for-table-AM.patchDownload

From 85d1400eab1903ee78804ad5b061c740d3415012 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Sat, 6 Apr 2024 14:01:50 +0300
Subject: [PATCH v10] Custom reloptions for table AM

Let table AM define custom reloptions for its tables. This allows specifying
AM-specific parameters by the WITH clause when creating a table.

The reloptions, which could be used outside of table AM, are now extracted
into the CommonRdOptions data structure.  These options could be by decision
of table AM directly specified by a user or calculated in some way.

The new test module test_tam_options evaluates the ability to set up custom
reloptions and calculate fields of CommonRdOptions on their base.

The code may use some parts from prior work by Hao Wu.

Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Discussion: https://postgr.es/m/AMUA1wBBBxfc3tKRLLdU64rb.1.1683276279979.Hmail.wuhao%40hashdata.cn
Reviewed-by: Reviewed-by: Pavel Borisov, Matthias van de Meent, Jess Davis
---
 src/backend/access/common/reloptions.c        | 121 +++++++++-----
 src/backend/access/heap/heapam.c              |   4 +-
 src/backend/access/heap/heapam_handler.c      |  13 ++
 src/backend/access/heap/heaptoast.c           |   9 +-
 src/backend/access/heap/hio.c                 |   4 +-
 src/backend/access/heap/pruneheap.c           |   4 +-
 src/backend/access/heap/rewriteheap.c         |   4 +-
 src/backend/access/table/tableam.c            |   2 +-
 src/backend/access/table/tableamapi.c         |  25 +++
 src/backend/commands/createas.c               |  13 +-
 src/backend/commands/tablecmds.c              |  63 +++++---
 src/backend/commands/vacuum.c                 |  10 +-
 src/backend/optimizer/util/plancat.c          |   2 +-
 src/backend/postmaster/autovacuum.c           |  12 +-
 src/backend/tcop/utility.c                    |  13 +-
 src/backend/utils/cache/relcache.c            |  33 +++-
 src/include/access/reloptions.h               |  10 +-
 src/include/access/tableam.h                  |  50 ++++++
 src/include/utils/rel.h                       | 148 +++++++++---------
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_tam_options/.gitignore  |   4 +
 src/test/modules/test_tam_options/Makefile    |  23 +++
 .../expected/test_tam_options.out             |  36 +++++
 src/test/modules/test_tam_options/meson.build |  33 ++++
 .../test_tam_options/sql/test_tam_options.sql |  25 +++
 .../test_tam_options--1.0.sql                 |  12 ++
 .../test_tam_options/test_tam_options.c       |  66 ++++++++
 .../test_tam_options/test_tam_options.control |   4 +
 src/tools/pgindent/typedefs.list              |   3 +-
 30 files changed, 584 insertions(+), 164 deletions(-)
 create mode 100644 src/test/modules/test_tam_options/.gitignore
 create mode 100644 src/test/modules/test_tam_options/Makefile
 create mode 100644 src/test/modules/test_tam_options/expected/test_tam_options.out
 create mode 100644 src/test/modules/test_tam_options/meson.build
 create mode 100644 src/test/modules/test_tam_options/sql/test_tam_options.sql
 create mode 100644 src/test/modules/test_tam_options/test_tam_options--1.0.sql
 create mode 100644 src/test/modules/test_tam_options/test_tam_options.c
 create mode 100644 src/test/modules/test_tam_options/test_tam_options.control

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index d6eb5d85599..c1de092a42d 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -24,6 +24,7 @@
 #include "access/nbtree.h"
 #include "access/reloptions.h"
 #include "access/spgist_private.h"
+#include "access/tableam.h"
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
 #include "commands/tablespace.h"
@@ -44,7 +45,7 @@
  * value, upper and lower bounds (if applicable); for strings, consider a
  * validation routine.
  * (ii) add a record below (or use add_<type>_reloption).
- * (iii) add it to the appropriate options struct (perhaps StdRdOptions)
+ * (iii) add it to the appropriate options struct (perhaps HeapRdOptions)
  * (iv) add it to the appropriate handling routine (perhaps
  * default_reloptions)
  * (v) make sure the lock level is set correctly for that operation
@@ -1374,10 +1375,16 @@ untransformRelOptions(Datum options)
  * tupdesc is pg_class' tuple descriptor.  amoptions is a pointer to the index
  * AM's options parser function in the case of a tuple corresponding to an
  * index, or NULL otherwise.
+ *
+ * If common pointer is provided, then the corresponding struct will be
+ * filled with options that table AM exposes for external usage.  That must
+ * be filled with defaults before passing here.
  */
+
 bytea *
 extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
-				  amoptions_function amoptions)
+				  const TableAmRoutine *tableam, amoptions_function amoptions,
+				  CommonRdOptions *common)
 {
 	bytea	   *options;
 	bool		isnull;
@@ -1399,7 +1406,8 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
 		case RELKIND_MATVIEW:
-			options = heap_reloptions(classForm->relkind, datum, false);
+			options = tableam_reloptions(tableam, classForm->relkind,
+										 datum, common, false);
 			break;
 		case RELKIND_PARTITIONED_TABLE:
 			options = partitioned_table_reloptions(datum, false);
@@ -1695,7 +1703,7 @@ parse_one_reloption(relopt_value *option, char *text_str, int text_len,
  * Given the result from parseRelOptions, allocate a struct that's of the
  * specified base size plus any extra space that's needed for string variables.
  *
- * "base" should be sizeof(struct) of the reloptions struct (StdRdOptions or
+ * "base" should be sizeof(struct) of the reloptions struct (HeapRdOptions or
  * equivalent).
  */
 static void *
@@ -1832,59 +1840,95 @@ fillRelOptions(void *rdopts, Size basesize,
 
 
 /*
- * Option parser for anything that uses StdRdOptions.
+ * Option parser for anything that uses HeapRdOptions.
  */
-bytea *
+static bytea *
 default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 {
 	static const relopt_parse_elt tab[] = {
-		{"fillfactor", RELOPT_TYPE_INT, offsetof(StdRdOptions, fillfactor)},
+		{"fillfactor", RELOPT_TYPE_INT, offsetof(HeapRdOptions, fillfactor)},
 		{"autovacuum_enabled", RELOPT_TYPE_BOOL,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, enabled)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, enabled)},
 		{"autovacuum_vacuum_threshold", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_threshold)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, vacuum_threshold)},
 		{"autovacuum_vacuum_insert_threshold", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_ins_threshold)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, vacuum_ins_threshold)},
 		{"autovacuum_analyze_threshold", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, analyze_threshold)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, analyze_threshold)},
 		{"autovacuum_vacuum_cost_limit", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_cost_limit)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, vacuum_cost_limit)},
 		{"autovacuum_freeze_min_age", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_min_age)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, freeze_min_age)},
 		{"autovacuum_freeze_max_age", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_max_age)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, freeze_max_age)},
 		{"autovacuum_freeze_table_age", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_table_age)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, freeze_table_age)},
 		{"autovacuum_multixact_freeze_min_age", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, multixact_freeze_min_age)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, multixact_freeze_min_age)},
 		{"autovacuum_multixact_freeze_max_age", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, multixact_freeze_max_age)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, multixact_freeze_max_age)},
 		{"autovacuum_multixact_freeze_table_age", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, multixact_freeze_table_age)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, multixact_freeze_table_age)},
 		{"log_autovacuum_min_duration", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, log_min_duration)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, log_min_duration)},
 		{"toast_tuple_target", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, toast_tuple_target)},
+		offsetof(HeapRdOptions, toast_tuple_target)},
 		{"autovacuum_vacuum_cost_delay", RELOPT_TYPE_REAL,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_cost_delay)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, vacuum_cost_delay)},
 		{"autovacuum_vacuum_scale_factor", RELOPT_TYPE_REAL,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_scale_factor)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, vacuum_scale_factor)},
 		{"autovacuum_vacuum_insert_scale_factor", RELOPT_TYPE_REAL,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_ins_scale_factor)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, vacuum_ins_scale_factor)},
 		{"autovacuum_analyze_scale_factor", RELOPT_TYPE_REAL,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, analyze_scale_factor)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, analyze_scale_factor)},
 		{"user_catalog_table", RELOPT_TYPE_BOOL,
-		offsetof(StdRdOptions, user_catalog_table)},
+			offsetof(HeapRdOptions, common) +
+		offsetof(CommonRdOptions, user_catalog_table)},
 		{"parallel_workers", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, parallel_workers)},
+			offsetof(HeapRdOptions, common) +
+		offsetof(CommonRdOptions, parallel_workers)},
 		{"vacuum_index_cleanup", RELOPT_TYPE_ENUM,
-		offsetof(StdRdOptions, vacuum_index_cleanup)},
+			offsetof(HeapRdOptions, common) +
+		offsetof(CommonRdOptions, vacuum_index_cleanup)},
 		{"vacuum_truncate", RELOPT_TYPE_BOOL,
-		offsetof(StdRdOptions, vacuum_truncate)}
+			offsetof(HeapRdOptions, common) +
+		offsetof(CommonRdOptions, vacuum_truncate)}
 	};
 
 	return (bytea *) build_reloptions(reloptions, validate, kind,
-									  sizeof(StdRdOptions),
+									  sizeof(HeapRdOptions),
 									  tab, lengthof(tab));
 }
 
@@ -2016,26 +2060,33 @@ view_reloptions(Datum reloptions, bool validate)
  * Parse options for heaps, views and toast tables.
  */
 bytea *
-heap_reloptions(char relkind, Datum reloptions, bool validate)
+heap_reloptions(char relkind, Datum reloptions,
+				CommonRdOptions *common, bool validate)
 {
-	StdRdOptions *rdopts;
+	HeapRdOptions *rdopts;
 
 	switch (relkind)
 	{
 		case RELKIND_TOASTVALUE:
-			rdopts = (StdRdOptions *)
+			rdopts = (HeapRdOptions *)
 				default_reloptions(reloptions, validate, RELOPT_KIND_TOAST);
 			if (rdopts != NULL)
 			{
 				/* adjust default-only parameters for TOAST relations */
 				rdopts->fillfactor = 100;
-				rdopts->autovacuum.analyze_threshold = -1;
-				rdopts->autovacuum.analyze_scale_factor = -1;
+				rdopts->common.autovacuum.analyze_threshold = -1;
+				rdopts->common.autovacuum.analyze_scale_factor = -1;
 			}
+			if (rdopts != NULL && common != NULL)
+				*common = rdopts->common;
 			return (bytea *) rdopts;
 		case RELKIND_RELATION:
 		case RELKIND_MATVIEW:
-			return default_reloptions(reloptions, validate, RELOPT_KIND_HEAP);
+			rdopts = (HeapRdOptions *)
+				default_reloptions(reloptions, validate, RELOPT_KIND_HEAP);
+			if (rdopts != NULL && common != NULL)
+				*common = rdopts->common;
+			return (bytea *) rdopts;
 		default:
 			/* other relkinds are not supported */
 			return NULL;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 01bb2f4cc16..b401a8bd90a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2144,8 +2144,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 	Assert(!(options & HEAP_INSERT_NO_LOGICAL));
 
 	needwal = RelationNeedsWAL(relation);
-	saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
-												   HEAP_DEFAULT_FILLFACTOR);
+	saveFreeSpace = HeapGetTargetPageFreeSpace(relation,
+											   HEAP_DEFAULT_FILLFACTOR);
 
 	/* Toast and set header data in all the slots */
 	heaptuples = palloc(ntuples * sizeof(HeapTuple));
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 58de2c82a70..4f20522fa94 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -23,6 +23,7 @@
 #include "access/heapam.h"
 #include "access/heaptoast.h"
 #include "access/multixact.h"
+#include "access/reloptions.h"
 #include "access/rewriteheap.h"
 #include "access/syncscan.h"
 #include "access/tableam.h"
@@ -2158,6 +2159,17 @@ heapam_relation_toast_am(Relation rel)
 	return rel->rd_rel->relam;
 }
 
+static bytea *
+heapam_reloptions(char relkind, Datum reloptions,
+				  CommonRdOptions *common, bool validate)
+{
+	Assert(relkind == RELKIND_RELATION ||
+		   relkind == RELKIND_TOASTVALUE ||
+		   relkind == RELKIND_MATVIEW);
+
+	return heap_reloptions(relkind, reloptions, common, validate);
+}
+
 
 /* ------------------------------------------------------------------------
  * Planner related callbacks for the heap AM
@@ -2707,6 +2719,7 @@ static const TableAmRoutine heapam_methods = {
 	.relation_needs_toast_table = heapam_relation_needs_toast_table,
 	.relation_toast_am = heapam_relation_toast_am,
 	.relation_fetch_toast_slice = heap_fetch_toast_slice,
+	.reloptions = heapam_reloptions,
 
 	.relation_estimate_size = heapam_estimate_rel_size,
 
diff --git a/src/backend/access/heap/heaptoast.c b/src/backend/access/heap/heaptoast.c
index a420e165304..cffee7e1b7a 100644
--- a/src/backend/access/heap/heaptoast.c
+++ b/src/backend/access/heap/heaptoast.c
@@ -32,6 +32,13 @@
 #include "access/toast_internals.h"
 #include "utils/fmgroids.h"
 
+/*
+ * HeapGetToastTupleTarget
+ *      Returns the heap relation's toast_tuple_target.  Note multiple eval of argument!
+ */
+#define HeapGetToastTupleTarget(relation, defaulttarg) \
+		((HeapRdOptions *) (relation)->rd_options ? \
+		((HeapRdOptions *) (relation)->rd_options)->toast_tuple_target : (defaulttarg))
 
 /* ----------
  * heap_toast_delete -
@@ -174,7 +181,7 @@ heap_toast_insert_or_update(Relation rel, HeapTuple newtup, HeapTuple oldtup,
 		hoff += BITMAPLEN(numAttrs);
 	hoff = MAXALIGN(hoff);
 	/* now convert to a limit on the tuple data size */
-	maxDataLen = RelationGetToastTupleTarget(rel, TOAST_TUPLE_TARGET) - hoff;
+	maxDataLen = HeapGetToastTupleTarget(rel, TOAST_TUPLE_TARGET) - hoff;
 
 	/*
 	 * Look for attributes with attstorage EXTENDED to compress.  Also find
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 7c662cdf46e..ed731179b45 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -536,8 +536,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
 						len, MaxHeapTupleSize)));
 
 	/* Compute desired extra freespace due to fillfactor option */
-	saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
-												   HEAP_DEFAULT_FILLFACTOR);
+	saveFreeSpace = HeapGetTargetPageFreeSpace(relation,
+											   HEAP_DEFAULT_FILLFACTOR);
 
 	/*
 	 * Since pages without tuples can still have line pointers, we consider
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index d2eecaf7ebc..20142dd2144 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -235,8 +235,8 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	 * important than sometimes getting a wrong answer in what is after all
 	 * just a heuristic estimate.
 	 */
-	minfree = RelationGetTargetPageFreeSpace(relation,
-											 HEAP_DEFAULT_FILLFACTOR);
+	minfree = HeapGetTargetPageFreeSpace(relation,
+										 HEAP_DEFAULT_FILLFACTOR);
 	minfree = Max(minfree, BLCKSZ / 10);
 
 	if (PageIsFull(page) || PageGetHeapFreeSpace(page) < minfree)
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 473f3aa9bef..2bbf121146e 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -641,8 +641,8 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 						len, MaxHeapTupleSize)));
 
 	/* Compute desired extra freespace due to fillfactor option */
-	saveFreeSpace = RelationGetTargetPageFreeSpace(state->rs_new_rel,
-												   HEAP_DEFAULT_FILLFACTOR);
+	saveFreeSpace = HeapGetTargetPageFreeSpace(state->rs_new_rel,
+											   HEAP_DEFAULT_FILLFACTOR);
 
 	/* Now we can check to see if there's enough free space already. */
 	page = (Page) state->rs_buffer;
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 805d222cebc..5d5f0e68fd7 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -750,7 +750,7 @@ table_block_relation_estimate_size(Relation rel, int32 *attr_widths,
 		 * The other branch considers it implicitly by calculating density
 		 * from actual relpages/reltuples statistics.
 		 */
-		fillfactor = RelationGetFillFactor(rel, HEAP_DEFAULT_FILLFACTOR);
+		fillfactor = HeapGetFillFactor(rel, HEAP_DEFAULT_FILLFACTOR);
 
 		tuple_width = get_rel_data_width(rel, attr_widths);
 		tuple_width += overhead_bytes_per_tuple;
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 55b8caeadf2..d9e23ef3175 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -13,9 +13,11 @@
 
 #include "access/tableam.h"
 #include "access/xact.h"
+#include "catalog/pg_am.h"
 #include "commands/defrem.h"
 #include "miscadmin.h"
 #include "utils/guc_hooks.h"
+#include "utils/syscache.h"
 
 
 /*
@@ -98,6 +100,29 @@ GetTableAmRoutine(Oid amhandler)
 	return routine;
 }
 
+/*
+ * GetTableAmRoutineByAmOid
+ *		Given the table access method oid get its TableAmRoutine struct, which
+ *		will be palloc'd in the caller's memory context.
+ */
+const TableAmRoutine *
+GetTableAmRoutineByAmOid(Oid amoid)
+{
+	HeapTuple	ht_am;
+	Form_pg_am	amrec;
+	const TableAmRoutine *tableam = NULL;
+
+	ht_am = SearchSysCache1(AMOID, ObjectIdGetDatum(amoid));
+	if (!HeapTupleIsValid(ht_am))
+		elog(ERROR, "cache lookup failed for access method %u",
+			 amoid);
+	amrec = (Form_pg_am) GETSTRUCT(ht_am);
+
+	tableam = GetTableAmRoutine(amrec->amhandler);
+	ReleaseSysCache(ht_am);
+	return tableam;
+}
+
 /* check_hook: validate new default_table_access_method */
 bool
 check_default_table_access_method(char **newval, void **extra, GucSource source)
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index afd3dace079..c5df96e374a 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -85,6 +85,9 @@ create_ctas_internal(List *attrList, IntoClause *into)
 	Datum		toast_options;
 	static char *validnsps[] = HEAP_RELOPT_NAMESPACES;
 	ObjectAddress intoRelationAddr;
+	const TableAmRoutine *tableam = NULL;
+	Oid			accessMethodId = InvalidOid;
+	Relation	rel;
 
 	/* This code supports both CREATE TABLE AS and CREATE MATERIALIZED VIEW */
 	is_matview = (into->viewQuery != NULL);
@@ -125,7 +128,15 @@ create_ctas_internal(List *attrList, IntoClause *into)
 										validnsps,
 										true, false);
 
-	(void) heap_reloptions(RELKIND_TOASTVALUE, toast_options, true);
+	rel = relation_open(intoRelationAddr.objectId, AccessShareLock);
+	accessMethodId = table_relation_toast_am(rel);
+	relation_close(rel, AccessShareLock);
+
+	if (OidIsValid(accessMethodId))
+	{
+		tableam = GetTableAmRoutineByAmOid(accessMethodId);
+		(void) tableam_reloptions(tableam, RELKIND_TOASTVALUE, toast_options, NULL, true);
+	}
 
 	NewRelationCreateToastTable(intoRelationAddr.objectId, toast_options);
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 582890a3025..3c74eef3e67 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -720,6 +720,7 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	ObjectAddress address;
 	LOCKMODE	parentLockmode;
 	Oid			accessMethodId = InvalidOid;
+	const TableAmRoutine *tableam = NULL;
 
 	/*
 	 * Truncate relname to appropriate length (probably a waste of time, as
@@ -855,6 +856,28 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	if (!OidIsValid(ownerId))
 		ownerId = GetUserId();
 
+	/*
+	 * For relations with table AM and partitioned tables, select access
+	 * method to use: an explicitly indicated one, or (in the case of a
+	 * partitioned table) the parent's, if it has one.
+	 */
+	if (stmt->accessMethod != NULL)
+	{
+		Assert(RELKIND_HAS_TABLE_AM(relkind) || relkind == RELKIND_PARTITIONED_TABLE);
+		accessMethodId = get_table_am_oid(stmt->accessMethod, false);
+	}
+	else if (RELKIND_HAS_TABLE_AM(relkind) || relkind == RELKIND_PARTITIONED_TABLE)
+	{
+		if (stmt->partbound)
+		{
+			Assert(list_length(inheritOids) == 1);
+			accessMethodId = get_rel_relam(linitial_oid(inheritOids));
+		}
+
+		if (RELKIND_HAS_TABLE_AM(relkind) && !OidIsValid(accessMethodId))
+			accessMethodId = get_table_am_oid(default_table_access_method, false);
+	}
+
 	/*
 	 * Parse and validate reloptions, if any.
 	 */
@@ -863,6 +886,12 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 
 	switch (relkind)
 	{
+		case RELKIND_RELATION:
+		case RELKIND_TOASTVALUE:
+		case RELKIND_MATVIEW:
+			tableam = GetTableAmRoutineByAmOid(accessMethodId);
+			(void) tableam_reloptions(tableam, relkind, reloptions, NULL, true);
+			break;
 		case RELKIND_VIEW:
 			(void) view_reloptions(reloptions, true);
 			break;
@@ -870,7 +899,12 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 			(void) partitioned_table_reloptions(reloptions, true);
 			break;
 		default:
-			(void) heap_reloptions(relkind, reloptions, true);
+			if (OidIsValid(accessMethodId))
+			{
+				tableam = GetTableAmRoutineByAmOid(accessMethodId);
+				(void) tableam_reloptions(tableam, relkind, reloptions, NULL, true);
+			}
+			break;
 	}
 
 	if (stmt->ofTypename)
@@ -962,28 +996,6 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 		}
 	}
 
-	/*
-	 * For relations with table AM and partitioned tables, select access
-	 * method to use: an explicitly indicated one, or (in the case of a
-	 * partitioned table) the parent's, if it has one.
-	 */
-	if (stmt->accessMethod != NULL)
-	{
-		Assert(RELKIND_HAS_TABLE_AM(relkind) || relkind == RELKIND_PARTITIONED_TABLE);
-		accessMethodId = get_table_am_oid(stmt->accessMethod, false);
-	}
-	else if (RELKIND_HAS_TABLE_AM(relkind) || relkind == RELKIND_PARTITIONED_TABLE)
-	{
-		if (stmt->partbound)
-		{
-			Assert(list_length(inheritOids) == 1);
-			accessMethodId = get_rel_relam(linitial_oid(inheritOids));
-		}
-
-		if (RELKIND_HAS_TABLE_AM(relkind) && !OidIsValid(accessMethodId))
-			accessMethodId = get_table_am_oid(default_table_access_method, false);
-	}
-
 	/*
 	 * Create the relation.  Inherited defaults and constraints are passed in
 	 * for immediate handling --- since they don't need parsing, they can be
@@ -15571,7 +15583,8 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
 		case RELKIND_MATVIEW:
-			(void) heap_reloptions(rel->rd_rel->relkind, newOptions, true);
+			(void) table_reloptions(rel, rel->rd_rel->relkind,
+									newOptions, NULL, true);
 			break;
 		case RELKIND_PARTITIONED_TABLE:
 			(void) partitioned_table_reloptions(newOptions, true);
@@ -15684,7 +15697,7 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 										 defList, "toast", validnsps, false,
 										 operation == AT_ResetRelOptions);
 
-		(void) heap_reloptions(RELKIND_TOASTVALUE, newOptions, true);
+		(void) table_reloptions(rel, RELKIND_TOASTVALUE, newOptions, NULL, true);
 
 		memset(repl_val, 0, sizeof(repl_val));
 		memset(repl_null, false, sizeof(repl_null));
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index b589279d49f..ba13fc0ad6c 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2121,11 +2121,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
 	{
 		StdRdOptIndexCleanup vacuum_index_cleanup;
 
-		if (rel->rd_options == NULL)
-			vacuum_index_cleanup = STDRD_OPTION_VACUUM_INDEX_CLEANUP_AUTO;
-		else
-			vacuum_index_cleanup =
-				((StdRdOptions *) rel->rd_options)->vacuum_index_cleanup;
+		vacuum_index_cleanup =
+			rel->rd_common_options.vacuum_index_cleanup;
 
 		if (vacuum_index_cleanup == STDRD_OPTION_VACUUM_INDEX_CLEANUP_AUTO)
 			params->index_cleanup = VACOPTVALUE_AUTO;
@@ -2145,8 +2142,7 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
 	 */
 	if (params->truncate == VACOPTVALUE_UNSPECIFIED)
 	{
-		if (rel->rd_options == NULL ||
-			((StdRdOptions *) rel->rd_options)->vacuum_truncate)
+		if (rel->rd_common_options.vacuum_truncate)
 			params->truncate = VACOPTVALUE_ENABLED;
 		else
 			params->truncate = VACOPTVALUE_DISABLED;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 6bb53e4346f..bea4440f71b 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -190,7 +190,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 						  &rel->pages, &rel->tuples, &rel->allvisfrac);
 
 	/* Retrieve the parallel_workers reloption, or -1 if not set. */
-	rel->rel_parallel_workers = RelationGetParallelWorkers(relation, -1);
+	rel->rel_parallel_workers = RelationGetParallelWorkers(relation);
 
 	/*
 	 * Make list of indexes.  Ignore indexes on system catalogs if told to.
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index c367ede6f88..7cb79ebcedd 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2674,19 +2674,21 @@ static AutoVacOpts *
 extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
 {
 	bytea	   *relopts;
+	CommonRdOptions common;
 	AutoVacOpts *av;
 
 	Assert(((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_RELATION ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE);
 
-	relopts = extractRelOptions(tup, pg_class_desc, NULL);
-	if (relopts == NULL)
-		return NULL;
+	relopts = extractRelOptions(tup, pg_class_desc,
+								GetTableAmRoutineByAmOid(((Form_pg_class) GETSTRUCT(tup))->relam),
+								NULL, &common);
+	if (relopts)
+		pfree(relopts);
 
 	av = palloc(sizeof(AutoVacOpts));
-	memcpy(av, &(((StdRdOptions *) relopts)->autovacuum), sizeof(AutoVacOpts));
-	pfree(relopts);
+	memcpy(av, &(common.autovacuum), sizeof(AutoVacOpts));
 
 	return av;
 }
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index fa66b8017ed..241ce7d3434 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -1156,6 +1156,8 @@ ProcessUtilitySlow(ParseState *pstate,
 							CreateStmt *cstmt = (CreateStmt *) stmt;
 							Datum		toast_options;
 							static char *validnsps[] = HEAP_RELOPT_NAMESPACES;
+							const TableAmRoutine *tableam = NULL;
+							Oid			accessMethodId;
 
 							/* Remember transformed RangeVar for LIKE */
 							table_rv = cstmt->relation;
@@ -1185,9 +1187,14 @@ ProcessUtilitySlow(ParseState *pstate,
 																validnsps,
 																true,
 																false);
-							(void) heap_reloptions(RELKIND_TOASTVALUE,
-												   toast_options,
-												   true);
+
+							/* TOAST has default access method */
+							accessMethodId = get_table_am_oid(default_table_access_method, false);
+							tableam = GetTableAmRoutineByAmOid(accessMethodId);
+							(void) tableam_reloptions(tableam, RELKIND_TOASTVALUE,
+													  toast_options,
+													  NULL,
+													  true);
 
 							NewRelationCreateToastTable(address.objectId,
 														toast_options);
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 3fe74dabd00..0e3b4cc93fe 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -33,6 +33,7 @@
 #include "access/htup_details.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
+#include "access/relation.h"
 #include "access/reloptions.h"
 #include "access/sysattr.h"
 #include "access/table.h"
@@ -464,9 +465,36 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 {
 	bytea	   *options;
 	amoptions_function amoptsfn;
+	CommonRdOptions *common = &relation->rd_common_options;
+	const TableAmRoutine *tableam = NULL;
 
 	relation->rd_options = NULL;
 
+	/*
+	 * Fill the rd_common_options with default values.  That might be later
+	 * changed by extractRelOptions().
+	 */
+	common->autovacuum.enabled = true;
+	common->autovacuum.vacuum_threshold = -1;
+	common->autovacuum.vacuum_ins_threshold = -2;
+	common->autovacuum.analyze_threshold = -1;
+	common->autovacuum.vacuum_cost_limit = -1;
+	common->autovacuum.freeze_min_age = -1;
+	common->autovacuum.freeze_max_age = -1;
+	common->autovacuum.freeze_table_age = -1;
+	common->autovacuum.multixact_freeze_min_age = -1;
+	common->autovacuum.multixact_freeze_max_age = -1;
+	common->autovacuum.multixact_freeze_table_age = -1;
+	common->autovacuum.log_min_duration = -1;
+	common->autovacuum.vacuum_cost_delay = -1;
+	common->autovacuum.vacuum_scale_factor = -1;
+	common->autovacuum.vacuum_ins_scale_factor = -1;
+	common->autovacuum.analyze_scale_factor = -1;
+	common->parallel_workers = -1;
+	common->user_catalog_table = false;
+	common->vacuum_index_cleanup = STDRD_OPTION_VACUUM_INDEX_CLEANUP_AUTO;
+	common->vacuum_truncate = true;
+
 	/*
 	 * Look up any AM-specific parse function; fall out if relkind should not
 	 * have options.
@@ -478,6 +506,7 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 		case RELKIND_VIEW:
 		case RELKIND_MATVIEW:
 		case RELKIND_PARTITIONED_TABLE:
+			tableam = relation->rd_tableam;
 			amoptsfn = NULL;
 			break;
 		case RELKIND_INDEX:
@@ -493,7 +522,9 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 	 * we might not have any other for pg_class yet (consider executing this
 	 * code for pg_class itself)
 	 */
-	options = extractRelOptions(tuple, GetPgClassDescriptor(), amoptsfn);
+	options = extractRelOptions(tuple, GetPgClassDescriptor(),
+								tableam, amoptsfn,
+								&relation->rd_common_options);
 
 	/*
 	 * Copy parsed data into CacheMemoryContext.  To guard against the
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index 81829b8270a..342b9cdd6ed 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -21,6 +21,7 @@
 
 #include "access/amapi.h"
 #include "access/htup.h"
+#include "access/tableam.h"
 #include "access/tupdesc.h"
 #include "nodes/pg_list.h"
 #include "storage/lock.h"
@@ -224,7 +225,9 @@ extern Datum transformRelOptions(Datum oldOptions, List *defList,
 								 bool acceptOidsOff, bool isReset);
 extern List *untransformRelOptions(Datum options);
 extern bytea *extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
-								amoptions_function amoptions);
+								const TableAmRoutine *tableam,
+								amoptions_function amoptions,
+								CommonRdOptions *common);
 extern void *build_reloptions(Datum reloptions, bool validate,
 							  relopt_kind kind,
 							  Size relopt_struct_size,
@@ -233,9 +236,8 @@ extern void *build_reloptions(Datum reloptions, bool validate,
 extern void *build_local_reloptions(local_relopts *relopts, Datum options,
 									bool validate);
 
-extern bytea *default_reloptions(Datum reloptions, bool validate,
-								 relopt_kind kind);
-extern bytea *heap_reloptions(char relkind, Datum reloptions, bool validate);
+extern bytea *heap_reloptions(char relkind, Datum reloptions,
+							  CommonRdOptions *common, bool validate);
 extern bytea *view_reloptions(Datum reloptions, bool validate);
 extern bytea *partitioned_table_reloptions(Datum reloptions, bool validate);
 extern bytea *index_reloptions(amoptions_function amoptions, Datum reloptions,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index be198fa3158..ec827ac12bf 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -746,6 +746,34 @@ typedef struct TableAmRoutine
 											   int32 slicelength,
 											   struct varlena *result);
 
+	/*
+	 * This callback parses and validates the reloptions array for a table.
+	 *
+	 * This is called only when a non-null reloptions array exists for the
+	 * table.  'reloptions' is a text array containing entries of the form
+	 * "name=value".  The function should construct a bytea value, which will
+	 * be copied into the rd_options field of the table's relcache entry. The
+	 * data contents of the bytea value are open for the access method to
+	 * define.
+	 *
+	 * The '*common' represents the common values, which the table access
+	 * method exposes for autovacuum, query planner, and others.  The caller
+	 * should fill them with default values.  The table access method may
+	 * modify them on the base of options specified by a user.
+	 *
+	 * When 'validate' is true, the function should report a suitable error
+	 * message if any of the options are unrecognized or have invalid values;
+	 * when 'validate' is false, invalid entries should be silently ignored.
+	 * ('validate' is false when loading options already stored in pg_catalog;
+	 * an invalid entry could only be found if the access method has changed
+	 * its rules for options, and in that case ignoring obsolete entries is
+	 * appropriate.)
+	 *
+	 * It is OK to return NULL if default behavior is wanted.
+	 */
+	bytea	   *(*reloptions) (char relkind, Datum reloptions,
+							   CommonRdOptions *common, bool validate);
+
 
 	/* ------------------------------------------------------------------------
 	 * Planner related functions.
@@ -1945,6 +1973,27 @@ table_relation_fetch_toast_slice(Relation toastrel, Oid valueid,
 													 result);
 }
 
+/*
+ * Parse table options without knowledge of particular table.
+ */
+static inline bytea *
+tableam_reloptions(const TableAmRoutine *tableam, char relkind,
+				   Datum reloptions, CommonRdOptions *common, bool validate)
+{
+	return tableam->reloptions(relkind, reloptions, common, validate);
+}
+
+/*
+ * Parse options for given table.
+ */
+static inline bytea *
+table_reloptions(Relation rel, char relkind,
+				 Datum reloptions, CommonRdOptions *common, bool validate)
+{
+	return tableam_reloptions(rel->rd_tableam, relkind, reloptions,
+							  common, validate);
+}
+
 
 /* ----------------------------------------------------------------------------
  * Planner related functionality
@@ -2123,6 +2172,7 @@ extern void table_block_relation_estimate_size(Relation rel,
  */
 
 extern const TableAmRoutine *GetTableAmRoutine(Oid amhandler);
+extern const TableAmRoutine *GetTableAmRoutineByAmOid(Oid amoid);
 
 /* ----------------------------------------------------------------------------
  * Functions in heapam_handler.c
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index f25f769af2b..bed1c0b554f 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -48,6 +48,52 @@ typedef struct LockInfoData
 
 typedef LockInfoData *LockInfo;
 
+ /* autovacuum-related reloptions. */
+typedef struct AutoVacOpts
+{
+	bool		enabled;
+	int			vacuum_threshold;
+	int			vacuum_ins_threshold;
+	int			analyze_threshold;
+	int			vacuum_cost_limit;
+	int			freeze_min_age;
+	int			freeze_max_age;
+	int			freeze_table_age;
+	int			multixact_freeze_min_age;
+	int			multixact_freeze_max_age;
+	int			multixact_freeze_table_age;
+	int			log_min_duration;
+	float8		vacuum_cost_delay;
+	float8		vacuum_scale_factor;
+	float8		vacuum_ins_scale_factor;
+	float8		analyze_scale_factor;
+} AutoVacOpts;
+
+/* StdRdOptions->vacuum_index_cleanup values */
+typedef enum StdRdOptIndexCleanup
+{
+	STDRD_OPTION_VACUUM_INDEX_CLEANUP_AUTO = 0,
+	STDRD_OPTION_VACUUM_INDEX_CLEANUP_OFF,
+	STDRD_OPTION_VACUUM_INDEX_CLEANUP_ON,
+} StdRdOptIndexCleanup;
+
+/*
+ * CommonRdOptions
+ *		Contents of rd_common_options for tables.  It contains the options,
+ *		which the table access method exposes for autovacuum, query planner,
+ *		and others.  These options could be by decision of table AM directly
+ *		specified by a user or calculated in some way.
+ */
+typedef struct CommonRdOptions
+{
+	AutoVacOpts autovacuum;		/* autovacuum-related options */
+	bool		user_catalog_table; /* use as an additional catalog relation */
+	int			parallel_workers;	/* max number of parallel workers */
+	StdRdOptIndexCleanup vacuum_index_cleanup;	/* controls index vacuuming */
+	bool		vacuum_truncate;	/* enables vacuum to truncate a relation */
+} CommonRdOptions;
+
+
 /*
  * Here are the contents of a relation cache entry.
  */
@@ -168,11 +214,19 @@ typedef struct RelationData
 	PublicationDesc *rd_pubdesc;	/* publication descriptor, or NULL */
 
 	/*
-	 * rd_options is set whenever rd_rel is loaded into the relcache entry.
-	 * Note that you can NOT look into rd_rel for this data.  NULL means "use
-	 * defaults".
+	 * rd_options and rd_common_options are set whenever rd_rel is loaded into
+	 * the relcache entry. Note that you can NOT look into rd_rel for this
+	 * data. NULLs means "use defaults".
+	 */
+	CommonRdOptions rd_common_options;	/* the options, which table AM exposes
+										 * for external usage */
+
+	/*
+	 * am-specific part of pg_class.reloptions parsed by table am specific
+	 * structure (e.g. struct HeapRdOptions) Contents are not to be accessed
+	 * outside of table am
 	 */
-	bytea	   *rd_options;		/* parsed pg_class.reloptions */
+	bytea	   *rd_options;
 
 	/*
 	 * Oid of the handler for this relation. For an index this is a function
@@ -297,88 +351,42 @@ typedef struct ForeignKeyCacheInfo
 	Oid			conpfeqop[INDEX_MAX_KEYS] pg_node_attr(array_size(nkeys));
 } ForeignKeyCacheInfo;
 
-
 /*
- * StdRdOptions
- *		Standard contents of rd_options for heaps.
- *
- * RelationGetFillFactor() and RelationGetTargetPageFreeSpace() can only
- * be applied to relations that use this format or a superset for
- * private options data.
+ * HeapRdOptions
+ *		Contents of rd_options specific for heap tables.
  */
- /* autovacuum-related reloptions. */
-typedef struct AutoVacOpts
-{
-	bool		enabled;
-	int			vacuum_threshold;
-	int			vacuum_ins_threshold;
-	int			analyze_threshold;
-	int			vacuum_cost_limit;
-	int			freeze_min_age;
-	int			freeze_max_age;
-	int			freeze_table_age;
-	int			multixact_freeze_min_age;
-	int			multixact_freeze_max_age;
-	int			multixact_freeze_table_age;
-	int			log_min_duration;
-	float8		vacuum_cost_delay;
-	float8		vacuum_scale_factor;
-	float8		vacuum_ins_scale_factor;
-	float8		analyze_scale_factor;
-} AutoVacOpts;
-
-/* StdRdOptions->vacuum_index_cleanup values */
-typedef enum StdRdOptIndexCleanup
-{
-	STDRD_OPTION_VACUUM_INDEX_CLEANUP_AUTO = 0,
-	STDRD_OPTION_VACUUM_INDEX_CLEANUP_OFF,
-	STDRD_OPTION_VACUUM_INDEX_CLEANUP_ON,
-} StdRdOptIndexCleanup;
-
-typedef struct StdRdOptions
+typedef struct HeapRdOptions
 {
 	int32		vl_len_;		/* varlena header (do not touch directly!) */
+	CommonRdOptions common;
 	int			fillfactor;		/* page fill factor in percent (0..100) */
 	int			toast_tuple_target; /* target for tuple toasting */
-	AutoVacOpts autovacuum;		/* autovacuum-related options */
-	bool		user_catalog_table; /* use as an additional catalog relation */
-	int			parallel_workers;	/* max number of parallel workers */
-	StdRdOptIndexCleanup vacuum_index_cleanup;	/* controls index vacuuming */
-	bool		vacuum_truncate;	/* enables vacuum to truncate a relation */
-} StdRdOptions;
+} HeapRdOptions;
 
 #define HEAP_MIN_FILLFACTOR			10
 #define HEAP_DEFAULT_FILLFACTOR		100
 
 /*
- * RelationGetToastTupleTarget
- *		Returns the relation's toast_tuple_target.  Note multiple eval of argument!
+ * HeapGetFillFactor
+ *		Returns the heap relation's fillfactor.  Note multiple eval of argument!
  */
-#define RelationGetToastTupleTarget(relation, defaulttarg) \
+#define HeapGetFillFactor(relation, defaultff) \
 	((relation)->rd_options ? \
-	 ((StdRdOptions *) (relation)->rd_options)->toast_tuple_target : (defaulttarg))
+	 ((HeapRdOptions *) (relation)->rd_options)->fillfactor : (defaultff))
 
 /*
- * RelationGetFillFactor
- *		Returns the relation's fillfactor.  Note multiple eval of argument!
- */
-#define RelationGetFillFactor(relation, defaultff) \
-	((relation)->rd_options ? \
-	 ((StdRdOptions *) (relation)->rd_options)->fillfactor : (defaultff))
-
-/*
- * RelationGetTargetPageUsage
+ * HeapGetTargetPageUsage
  *		Returns the relation's desired space usage per page in bytes.
  */
-#define RelationGetTargetPageUsage(relation, defaultff) \
-	(BLCKSZ * RelationGetFillFactor(relation, defaultff) / 100)
+#define HeapGetTargetPageUsage(relation, defaultff) \
+	(BLCKSZ * HeapGetFillFactor(relation, defaultff) / 100)
 
 /*
- * RelationGetTargetPageFreeSpace
+ * HeapGetTargetPageFreeSpace
  *		Returns the relation's desired freespace per page in bytes.
  */
-#define RelationGetTargetPageFreeSpace(relation, defaultff) \
-	(BLCKSZ * (100 - RelationGetFillFactor(relation, defaultff)) / 100)
+#define HeapGetTargetPageFreeSpace(relation, defaultff) \
+	(BLCKSZ * (100 - HeapGetFillFactor(relation, defaultff)) / 100)
 
 /*
  * RelationIsUsedAsCatalogTable
@@ -386,19 +394,17 @@ typedef struct StdRdOptions
  *		from the pov of logical decoding.  Note multiple eval of argument!
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
-	((relation)->rd_options && \
-	 ((relation)->rd_rel->relkind == RELKIND_RELATION || \
+	(((relation)->rd_rel->relkind == RELKIND_RELATION || \
 	  (relation)->rd_rel->relkind == RELKIND_MATVIEW) ? \
-	 ((StdRdOptions *) (relation)->rd_options)->user_catalog_table : false)
+	 (relation)->rd_common_options.user_catalog_table : false)
 
 /*
  * RelationGetParallelWorkers
  *		Returns the relation's parallel_workers reloption setting.
  *		Note multiple eval of argument!
  */
-#define RelationGetParallelWorkers(relation, defaultpw) \
-	((relation)->rd_options ? \
-	 ((StdRdOptions *) (relation)->rd_options)->parallel_workers : (defaultpw))
+#define RelationGetParallelWorkers(relation) \
+	((relation)->rd_common_options.parallel_workers)
 
 /* ViewOptions->check_option values */
 typedef enum ViewOptCheckOption
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 256799f520a..8f661525399 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -36,6 +36,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_tam_options \
 		  test_tidstore \
 		  unsafe_tests \
 		  worker_spi \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index d8fe059d236..235e342dfaa 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -35,6 +35,7 @@ subdir('test_resowner')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_tam_options')
 subdir('test_tidstore')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_tam_options/.gitignore b/src/test/modules/test_tam_options/.gitignore
new file mode 100644
index 00000000000..5dcb3ff9723
--- /dev/null
+++ b/src/test/modules/test_tam_options/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_tam_options/Makefile b/src/test/modules/test_tam_options/Makefile
new file mode 100644
index 00000000000..bd6d4599a14
--- /dev/null
+++ b/src/test/modules/test_tam_options/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tam_options/Makefile
+
+MODULE_big = test_tam_options
+OBJS = \
+	$(WIN32RES) \
+	test_tam_options.o
+PGFILEDESC = "test_tam_options - test code for table access method reloptions"
+
+EXTENSION = test_tam_options
+DATA = test_tam_options--1.0.sql
+
+REGRESS = test_tam_options
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tam_options
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tam_options/expected/test_tam_options.out b/src/test/modules/test_tam_options/expected/test_tam_options.out
new file mode 100644
index 00000000000..c921afcb270
--- /dev/null
+++ b/src/test/modules/test_tam_options/expected/test_tam_options.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_tam_options;
+-- encourage use of parallel plans
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 4;
+CREATE TABLE test (i int) USING heap_alter_options;
+INSERT INTO test SELECT i FROM generate_series(1, 10000) i;
+VACUUM ANALYZE test;
+EXPLAIN (costs off)
+SELECT * FROM test;
+           QUERY PLAN            
+---------------------------------
+ Gather
+   Workers Planned: 4
+   ->  Parallel Seq Scan on test
+(3 rows)
+
+ALTER TABLE test SET (enable_parallel = OFF);
+EXPLAIN (costs off)
+SELECT * FROM test;
+    QUERY PLAN    
+------------------
+ Seq Scan on test
+(1 row)
+
+ALTER TABLE test SET (enable_parallel = ON);
+EXPLAIN (costs off)
+SELECT * FROM test;
+           QUERY PLAN            
+---------------------------------
+ Gather
+   Workers Planned: 4
+   ->  Parallel Seq Scan on test
+(3 rows)
+
diff --git a/src/test/modules/test_tam_options/meson.build b/src/test/modules/test_tam_options/meson.build
new file mode 100644
index 00000000000..d41a32a6803
--- /dev/null
+++ b/src/test/modules/test_tam_options/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+test_tam_options_sources = files(
+  'test_tam_options.c',
+)
+
+if host_system == 'windows'
+  test_tam_options_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_tam_options',
+    '--FILEDESC', 'test_tam_options -  test code for table access method reloptions',])
+endif
+
+test_tam_options = shared_module('test_tam_options',
+  test_tam_options_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_tam_options
+
+test_install_data += files(
+  'test_tam_options.control',
+  'test_tam_options--1.0.sql',
+)
+
+tests += {
+  'name': 'test_tam_options',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_tam_options',
+    ],
+  },
+}
diff --git a/src/test/modules/test_tam_options/sql/test_tam_options.sql b/src/test/modules/test_tam_options/sql/test_tam_options.sql
new file mode 100644
index 00000000000..4f975046568
--- /dev/null
+++ b/src/test/modules/test_tam_options/sql/test_tam_options.sql
@@ -0,0 +1,25 @@
+CREATE EXTENSION test_tam_options;
+
+-- encourage use of parallel plans
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 4;
+
+CREATE TABLE test (i int) USING heap_alter_options;
+
+INSERT INTO test SELECT i FROM generate_series(1, 10000) i;
+VACUUM ANALYZE test;
+
+EXPLAIN (costs off)
+SELECT * FROM test;
+
+ALTER TABLE test SET (enable_parallel = OFF);
+
+EXPLAIN (costs off)
+SELECT * FROM test;
+
+ALTER TABLE test SET (enable_parallel = ON);
+
+EXPLAIN (costs off)
+SELECT * FROM test;
diff --git a/src/test/modules/test_tam_options/test_tam_options--1.0.sql b/src/test/modules/test_tam_options/test_tam_options--1.0.sql
new file mode 100644
index 00000000000..07569f7b5f1
--- /dev/null
+++ b/src/test/modules/test_tam_options/test_tam_options--1.0.sql
@@ -0,0 +1,12 @@
+/* src/test/modules/test_tam_options/test_tam_options--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tam_options" to load this file. \quit
+
+CREATE FUNCTION heap_alter_options_tam_handler(internal)
+RETURNS table_am_handler
+AS 'MODULE_PATHNAME'
+LANGUAGE C STRICT;
+
+CREATE ACCESS METHOD heap_alter_options TYPE TABLE
+HANDLER heap_alter_options_tam_handler;
diff --git a/src/test/modules/test_tam_options/test_tam_options.c b/src/test/modules/test_tam_options/test_tam_options.c
new file mode 100644
index 00000000000..27008616376
--- /dev/null
+++ b/src/test/modules/test_tam_options/test_tam_options.c
@@ -0,0 +1,66 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tam_options.c
+ *		Test code for table access method reloptions.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_tam_options/test_tam_options.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/reloptions.h"
+#include "access/tableam.h"
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(heap_alter_options_tam_handler);
+
+/* An alternative relation options for heap */
+typedef struct
+{
+	int32		vl_len_;		/* varlena header (do not touch directly!) */
+	bool		enable_parallel; /* enable parallel scans? */
+} HeapAlterRdOptions;
+
+static bytea *
+heap_alter_reloptions(char relkind, Datum reloptions,
+					  CommonRdOptions *common, bool validate)
+{
+	local_relopts	relopts;
+	HeapAlterRdOptions *result;
+
+	Assert(relkind == RELKIND_RELATION ||
+		   relkind == RELKIND_TOASTVALUE ||
+		   relkind == RELKIND_MATVIEW);
+
+	init_local_reloptions(&relopts, sizeof(HeapAlterRdOptions));
+	add_local_bool_reloption(&relopts, "enable_parallel",
+							 "enable parallel scan", true,
+							 offsetof(HeapAlterRdOptions, enable_parallel));
+
+	result = (HeapAlterRdOptions *) build_local_reloptions(&relopts,
+														   reloptions,
+														   validate);
+
+	if (result != NULL && common != NULL)
+	{
+		common->parallel_workers = result->enable_parallel ? -1 : 0;
+	}
+
+	return (bytea *) result;
+}
+
+Datum
+heap_alter_options_tam_handler(PG_FUNCTION_ARGS)
+{
+	static TableAmRoutine tam_routine;
+
+	tam_routine = *GetHeapamTableAmRoutine();
+	tam_routine.reloptions = heap_alter_reloptions;
+
+	PG_RETURN_POINTER(&tam_routine);
+}
diff --git a/src/test/modules/test_tam_options/test_tam_options.control b/src/test/modules/test_tam_options/test_tam_options.control
new file mode 100644
index 00000000000..dd6682edcdf
--- /dev/null
+++ b/src/test/modules/test_tam_options/test_tam_options.control
@@ -0,0 +1,4 @@
+comment = 'Test code for table access method reloptions'
+default_version = '1.0'
+module_pathname = '$libdir/test_tam_options'
+relocatable = true
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e608fd39d96..c40d047a2c2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -440,6 +440,7 @@ CommentStmt
 CommitTimestampEntry
 CommitTimestampShared
 CommonEntry
+CommonRdOptions
 CommonTableExpr
 CompareScalarsContext
 CompiledExprState
@@ -1129,6 +1130,7 @@ HeadlineParsedText
 HeadlineWordEntry
 HeapCheckContext
 HeapPageFreeze
+HeapRdOptions
 HeapScanDesc
 HeapTuple
 HeapTupleData
@@ -2717,7 +2719,6 @@ StatsElem
 StatsExtInfo
 StdAnalyzeData
 StdRdOptIndexCleanup
-StdRdOptions
 Step
 StopList
 StrategyNumber
-- 
2.39.3 (Apple Git-145)

#45

Pavel Borisov

pashkin.elfe@gmail.com

almost 2 years ago

In reply to: Alexander Korotkov (#44)

Re: Table AM Interface Enhancements

Hi, Alexander!

On Sun, 7 Apr 2024 at 07:33, Alexander Korotkov <aekorotkov@gmail.com>
wrote:

Hi, Pavel!

On Fri, Apr 5, 2024 at 6:58 PM Pavel Borisov <pashkin.elfe@gmail.com>
wrote:

On Tue, 2 Apr 2024 at 19:17, Jeff Davis <pgsql@j-davis.com> wrote:

On Tue, 2024-04-02 at 11:49 +0300, Alexander Korotkov wrote:

I don't like the idea that every custom table AM reltoptions should
begin with StdRdOptions. I would rather introduce the new data
structure with table options, which need to be accessed outside of
table AM. Then reloptions will be a backbox only directly used in
table AM, while table AM has a freedom on what to store in reloptions
and how to calculate externally-visible options. What do you think?

Hi Alexander!

I agree with all of that. It will take some refactoring to get there,
though.

One idea is to store StdRdOptions like normal, but if an unrecognized
option is found, ask the table AM if it understands the option. In that
case I think we'd just use a different field in pg_class so that it can
use whatever format it wants to represent its options.

Regards,
Jeff Davis

I tried to rework a patch regarding table am according to the input from

Alexander and Jeff.

It splits table reloptions into two categories:
- common for all tables (stored in a fixed size structure and could be

accessed from outside)

- table-am specific (variable size, parsed and accessed by access method

only)

Thank you for your work. Please, check the revised patch.

It makes CommonRdOptions a separate data structure, not directly
involved in parsing the reloption. Instead table AM can fill it on
the base of its reloptions or calculate the other way. Patch comes
with a test module, which comes with heap-based table AM. This table
AM has "enable_parallel" reloption, which is used as the base to set
the value of CommonRdOptions.parallel_workers.

To me, a patch v10 looks good.

I think the comment for RelationData now applies only to rd_options, not
to rd_common_options.

NULLs means "use defaults".

Regards,
Pavel

#46

Pavel Borisov

pashkin.elfe@gmail.com

almost 2 years ago

In reply to: Pavel Borisov (#45)

1 attachment(s)

Re: Table AM Interface Enhancements

Hi, Alexander!

On Sun, 7 Apr 2024 at 12:34, Pavel Borisov <pashkin.elfe@gmail.com> wrote:

Hi, Alexander!

On Sun, 7 Apr 2024 at 07:33, Alexander Korotkov <aekorotkov@gmail.com>
wrote:

Hi, Pavel!

On Fri, Apr 5, 2024 at 6:58 PM Pavel Borisov <pashkin.elfe@gmail.com>
wrote:

On Tue, 2 Apr 2024 at 19:17, Jeff Davis <pgsql@j-davis.com> wrote:

On Tue, 2024-04-02 at 11:49 +0300, Alexander Korotkov wrote:

I don't like the idea that every custom table AM reltoptions should
begin with StdRdOptions. I would rather introduce the new data
structure with table options, which need to be accessed outside of
table AM. Then reloptions will be a backbox only directly used in
table AM, while table AM has a freedom on what to store in reloptions
and how to calculate externally-visible options. What do you think?

Hi Alexander!

I agree with all of that. It will take some refactoring to get there,
though.

One idea is to store StdRdOptions like normal, but if an unrecognized
option is found, ask the table AM if it understands the option. In that
case I think we'd just use a different field in pg_class so that it can
use whatever format it wants to represent its options.

Regards,
Jeff Davis

I tried to rework a patch regarding table am according to the input

from Alexander and Jeff.

It splits table reloptions into two categories:
- common for all tables (stored in a fixed size structure and could be

accessed from outside)

- table-am specific (variable size, parsed and accessed by access

method only)

Thank you for your work. Please, check the revised patch.

It makes CommonRdOptions a separate data structure, not directly
involved in parsing the reloption. Instead table AM can fill it on
the base of its reloptions or calculate the other way. Patch comes
with a test module, which comes with heap-based table AM. This table
AM has "enable_parallel" reloption, which is used as the base to set
the value of CommonRdOptions.parallel_workers.

To me, a patch v10 looks good.

I think the comment for RelationData now applies only to rd_options, not
to rd_common_options.

NULLs means "use defaults".

Regards,
Pavel

I made minor changes to the patch. Please find v11 attached.

Regards,
Pavel.

Attachments:

v11-0001-Custom-reloptions-for-table-AM.patchapplication/octet-stream; name=v11-0001-Custom-reloptions-for-table-AM.patchDownload

From 30890f0f11070376631db2bb3a7d96aa30209da8 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Sat, 6 Apr 2024 14:01:50 +0300
Subject: [PATCH v11] Custom reloptions for table AM

Let table AM define custom reloptions for its tables. This allows specifying
AM-specific parameters by the WITH clause when creating a table.

The reloptions, which could be used outside of table AM, are now extracted
into the CommonRdOptions data structure.  These options could be by decision
of table AM directly specified by a user or calculated in some way.

The new test module test_tam_options evaluates the ability to set up custom
reloptions and calculate fields of CommonRdOptions on their base.

The code may use some parts from prior work by Hao Wu.

Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Discussion: https://postgr.es/m/AMUA1wBBBxfc3tKRLLdU64rb.1.1683276279979.Hmail.wuhao%40hashdata.cn
Reviewed-by: Reviewed-by: Pavel Borisov, Matthias van de Meent, Jess Davis
---
 src/backend/access/common/reloptions.c        | 121 +++++++++-----
 src/backend/access/heap/heapam.c              |   4 +-
 src/backend/access/heap/heapam_handler.c      |  13 ++
 src/backend/access/heap/heaptoast.c           |   9 +-
 src/backend/access/heap/hio.c                 |   4 +-
 src/backend/access/heap/pruneheap.c           |   4 +-
 src/backend/access/heap/rewriteheap.c         |   4 +-
 src/backend/access/table/tableam.c            |   2 +-
 src/backend/access/table/tableamapi.c         |  25 +++
 src/backend/commands/createas.c               |  13 +-
 src/backend/commands/tablecmds.c              |  63 +++++---
 src/backend/commands/vacuum.c                 |  10 +-
 src/backend/optimizer/util/plancat.c          |   2 +-
 src/backend/postmaster/autovacuum.c           |  12 +-
 src/backend/tcop/utility.c                    |  20 ++-
 src/backend/utils/cache/relcache.c            |  33 +++-
 src/include/access/reloptions.h               |  10 +-
 src/include/access/tableam.h                  |  52 +++++-
 src/include/utils/rel.h                       | 148 +++++++++---------
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_tam_options/.gitignore  |   4 +
 src/test/modules/test_tam_options/Makefile    |  23 +++
 .../expected/test_tam_options.out             |  36 +++++
 src/test/modules/test_tam_options/meson.build |  33 ++++
 .../test_tam_options/sql/test_tam_options.sql |  25 +++
 .../test_tam_options--1.0.sql                 |  12 ++
 .../test_tam_options/test_tam_options.c       |  66 ++++++++
 .../test_tam_options/test_tam_options.control |   4 +
 src/tools/pgindent/typedefs.list              |   3 +-
 30 files changed, 592 insertions(+), 165 deletions(-)
 create mode 100644 src/test/modules/test_tam_options/.gitignore
 create mode 100644 src/test/modules/test_tam_options/Makefile
 create mode 100644 src/test/modules/test_tam_options/expected/test_tam_options.out
 create mode 100644 src/test/modules/test_tam_options/meson.build
 create mode 100644 src/test/modules/test_tam_options/sql/test_tam_options.sql
 create mode 100644 src/test/modules/test_tam_options/test_tam_options--1.0.sql
 create mode 100644 src/test/modules/test_tam_options/test_tam_options.c
 create mode 100644 src/test/modules/test_tam_options/test_tam_options.control

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index d6eb5d8559..c1de092a42 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -24,6 +24,7 @@
 #include "access/nbtree.h"
 #include "access/reloptions.h"
 #include "access/spgist_private.h"
+#include "access/tableam.h"
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
 #include "commands/tablespace.h"
@@ -44,7 +45,7 @@
  * value, upper and lower bounds (if applicable); for strings, consider a
  * validation routine.
  * (ii) add a record below (or use add_<type>_reloption).
- * (iii) add it to the appropriate options struct (perhaps StdRdOptions)
+ * (iii) add it to the appropriate options struct (perhaps HeapRdOptions)
  * (iv) add it to the appropriate handling routine (perhaps
  * default_reloptions)
  * (v) make sure the lock level is set correctly for that operation
@@ -1374,10 +1375,16 @@ untransformRelOptions(Datum options)
  * tupdesc is pg_class' tuple descriptor.  amoptions is a pointer to the index
  * AM's options parser function in the case of a tuple corresponding to an
  * index, or NULL otherwise.
+ *
+ * If common pointer is provided, then the corresponding struct will be
+ * filled with options that table AM exposes for external usage.  That must
+ * be filled with defaults before passing here.
  */
+
 bytea *
 extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
-				  amoptions_function amoptions)
+				  const TableAmRoutine *tableam, amoptions_function amoptions,
+				  CommonRdOptions *common)
 {
 	bytea	   *options;
 	bool		isnull;
@@ -1399,7 +1406,8 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
 		case RELKIND_MATVIEW:
-			options = heap_reloptions(classForm->relkind, datum, false);
+			options = tableam_reloptions(tableam, classForm->relkind,
+										 datum, common, false);
 			break;
 		case RELKIND_PARTITIONED_TABLE:
 			options = partitioned_table_reloptions(datum, false);
@@ -1695,7 +1703,7 @@ parse_one_reloption(relopt_value *option, char *text_str, int text_len,
  * Given the result from parseRelOptions, allocate a struct that's of the
  * specified base size plus any extra space that's needed for string variables.
  *
- * "base" should be sizeof(struct) of the reloptions struct (StdRdOptions or
+ * "base" should be sizeof(struct) of the reloptions struct (HeapRdOptions or
  * equivalent).
  */
 static void *
@@ -1832,59 +1840,95 @@ fillRelOptions(void *rdopts, Size basesize,
 
 
 /*
- * Option parser for anything that uses StdRdOptions.
+ * Option parser for anything that uses HeapRdOptions.
  */
-bytea *
+static bytea *
 default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 {
 	static const relopt_parse_elt tab[] = {
-		{"fillfactor", RELOPT_TYPE_INT, offsetof(StdRdOptions, fillfactor)},
+		{"fillfactor", RELOPT_TYPE_INT, offsetof(HeapRdOptions, fillfactor)},
 		{"autovacuum_enabled", RELOPT_TYPE_BOOL,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, enabled)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, enabled)},
 		{"autovacuum_vacuum_threshold", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_threshold)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, vacuum_threshold)},
 		{"autovacuum_vacuum_insert_threshold", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_ins_threshold)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, vacuum_ins_threshold)},
 		{"autovacuum_analyze_threshold", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, analyze_threshold)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, analyze_threshold)},
 		{"autovacuum_vacuum_cost_limit", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_cost_limit)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, vacuum_cost_limit)},
 		{"autovacuum_freeze_min_age", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_min_age)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, freeze_min_age)},
 		{"autovacuum_freeze_max_age", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_max_age)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, freeze_max_age)},
 		{"autovacuum_freeze_table_age", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_table_age)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, freeze_table_age)},
 		{"autovacuum_multixact_freeze_min_age", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, multixact_freeze_min_age)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, multixact_freeze_min_age)},
 		{"autovacuum_multixact_freeze_max_age", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, multixact_freeze_max_age)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, multixact_freeze_max_age)},
 		{"autovacuum_multixact_freeze_table_age", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, multixact_freeze_table_age)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, multixact_freeze_table_age)},
 		{"log_autovacuum_min_duration", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, log_min_duration)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, log_min_duration)},
 		{"toast_tuple_target", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, toast_tuple_target)},
+		offsetof(HeapRdOptions, toast_tuple_target)},
 		{"autovacuum_vacuum_cost_delay", RELOPT_TYPE_REAL,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_cost_delay)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, vacuum_cost_delay)},
 		{"autovacuum_vacuum_scale_factor", RELOPT_TYPE_REAL,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_scale_factor)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, vacuum_scale_factor)},
 		{"autovacuum_vacuum_insert_scale_factor", RELOPT_TYPE_REAL,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_ins_scale_factor)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, vacuum_ins_scale_factor)},
 		{"autovacuum_analyze_scale_factor", RELOPT_TYPE_REAL,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, analyze_scale_factor)},
+			offsetof(HeapRdOptions, common) +
+			offsetof(CommonRdOptions, autovacuum) +
+		offsetof(AutoVacOpts, analyze_scale_factor)},
 		{"user_catalog_table", RELOPT_TYPE_BOOL,
-		offsetof(StdRdOptions, user_catalog_table)},
+			offsetof(HeapRdOptions, common) +
+		offsetof(CommonRdOptions, user_catalog_table)},
 		{"parallel_workers", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, parallel_workers)},
+			offsetof(HeapRdOptions, common) +
+		offsetof(CommonRdOptions, parallel_workers)},
 		{"vacuum_index_cleanup", RELOPT_TYPE_ENUM,
-		offsetof(StdRdOptions, vacuum_index_cleanup)},
+			offsetof(HeapRdOptions, common) +
+		offsetof(CommonRdOptions, vacuum_index_cleanup)},
 		{"vacuum_truncate", RELOPT_TYPE_BOOL,
-		offsetof(StdRdOptions, vacuum_truncate)}
+			offsetof(HeapRdOptions, common) +
+		offsetof(CommonRdOptions, vacuum_truncate)}
 	};
 
 	return (bytea *) build_reloptions(reloptions, validate, kind,
-									  sizeof(StdRdOptions),
+									  sizeof(HeapRdOptions),
 									  tab, lengthof(tab));
 }
 
@@ -2016,26 +2060,33 @@ view_reloptions(Datum reloptions, bool validate)
  * Parse options for heaps, views and toast tables.
  */
 bytea *
-heap_reloptions(char relkind, Datum reloptions, bool validate)
+heap_reloptions(char relkind, Datum reloptions,
+				CommonRdOptions *common, bool validate)
 {
-	StdRdOptions *rdopts;
+	HeapRdOptions *rdopts;
 
 	switch (relkind)
 	{
 		case RELKIND_TOASTVALUE:
-			rdopts = (StdRdOptions *)
+			rdopts = (HeapRdOptions *)
 				default_reloptions(reloptions, validate, RELOPT_KIND_TOAST);
 			if (rdopts != NULL)
 			{
 				/* adjust default-only parameters for TOAST relations */
 				rdopts->fillfactor = 100;
-				rdopts->autovacuum.analyze_threshold = -1;
-				rdopts->autovacuum.analyze_scale_factor = -1;
+				rdopts->common.autovacuum.analyze_threshold = -1;
+				rdopts->common.autovacuum.analyze_scale_factor = -1;
 			}
+			if (rdopts != NULL && common != NULL)
+				*common = rdopts->common;
 			return (bytea *) rdopts;
 		case RELKIND_RELATION:
 		case RELKIND_MATVIEW:
-			return default_reloptions(reloptions, validate, RELOPT_KIND_HEAP);
+			rdopts = (HeapRdOptions *)
+				default_reloptions(reloptions, validate, RELOPT_KIND_HEAP);
+			if (rdopts != NULL && common != NULL)
+				*common = rdopts->common;
+			return (bytea *) rdopts;
 		default:
 			/* other relkinds are not supported */
 			return NULL;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2663f52d1a..087bb63d1b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2279,8 +2279,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 	Assert(!(options & HEAP_INSERT_NO_LOGICAL));
 
 	needwal = RelationNeedsWAL(relation);
-	saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
-												   HEAP_DEFAULT_FILLFACTOR);
+	saveFreeSpace = HeapGetTargetPageFreeSpace(relation,
+											   HEAP_DEFAULT_FILLFACTOR);
 
 	/* Toast and set header data in all the slots */
 	heaptuples = palloc(ntuples * sizeof(HeapTuple));
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 58de2c82a7..4f20522fa9 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -23,6 +23,7 @@
 #include "access/heapam.h"
 #include "access/heaptoast.h"
 #include "access/multixact.h"
+#include "access/reloptions.h"
 #include "access/rewriteheap.h"
 #include "access/syncscan.h"
 #include "access/tableam.h"
@@ -2158,6 +2159,17 @@ heapam_relation_toast_am(Relation rel)
 	return rel->rd_rel->relam;
 }
 
+static bytea *
+heapam_reloptions(char relkind, Datum reloptions,
+				  CommonRdOptions *common, bool validate)
+{
+	Assert(relkind == RELKIND_RELATION ||
+		   relkind == RELKIND_TOASTVALUE ||
+		   relkind == RELKIND_MATVIEW);
+
+	return heap_reloptions(relkind, reloptions, common, validate);
+}
+
 
 /* ------------------------------------------------------------------------
  * Planner related callbacks for the heap AM
@@ -2707,6 +2719,7 @@ static const TableAmRoutine heapam_methods = {
 	.relation_needs_toast_table = heapam_relation_needs_toast_table,
 	.relation_toast_am = heapam_relation_toast_am,
 	.relation_fetch_toast_slice = heap_fetch_toast_slice,
+	.reloptions = heapam_reloptions,
 
 	.relation_estimate_size = heapam_estimate_rel_size,
 
diff --git a/src/backend/access/heap/heaptoast.c b/src/backend/access/heap/heaptoast.c
index a420e16530..cffee7e1b7 100644
--- a/src/backend/access/heap/heaptoast.c
+++ b/src/backend/access/heap/heaptoast.c
@@ -32,6 +32,13 @@
 #include "access/toast_internals.h"
 #include "utils/fmgroids.h"
 
+/*
+ * HeapGetToastTupleTarget
+ *      Returns the heap relation's toast_tuple_target.  Note multiple eval of argument!
+ */
+#define HeapGetToastTupleTarget(relation, defaulttarg) \
+		((HeapRdOptions *) (relation)->rd_options ? \
+		((HeapRdOptions *) (relation)->rd_options)->toast_tuple_target : (defaulttarg))
 
 /* ----------
  * heap_toast_delete -
@@ -174,7 +181,7 @@ heap_toast_insert_or_update(Relation rel, HeapTuple newtup, HeapTuple oldtup,
 		hoff += BITMAPLEN(numAttrs);
 	hoff = MAXALIGN(hoff);
 	/* now convert to a limit on the tuple data size */
-	maxDataLen = RelationGetToastTupleTarget(rel, TOAST_TUPLE_TARGET) - hoff;
+	maxDataLen = HeapGetToastTupleTarget(rel, TOAST_TUPLE_TARGET) - hoff;
 
 	/*
 	 * Look for attributes with attstorage EXTENDED to compress.  Also find
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 7c662cdf46..ed731179b4 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -536,8 +536,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
 						len, MaxHeapTupleSize)));
 
 	/* Compute desired extra freespace due to fillfactor option */
-	saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
-												   HEAP_DEFAULT_FILLFACTOR);
+	saveFreeSpace = HeapGetTargetPageFreeSpace(relation,
+											   HEAP_DEFAULT_FILLFACTOR);
 
 	/*
 	 * Since pages without tuples can still have line pointers, we consider
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index d2eecaf7eb..20142dd214 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -235,8 +235,8 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	 * important than sometimes getting a wrong answer in what is after all
 	 * just a heuristic estimate.
 	 */
-	minfree = RelationGetTargetPageFreeSpace(relation,
-											 HEAP_DEFAULT_FILLFACTOR);
+	minfree = HeapGetTargetPageFreeSpace(relation,
+										 HEAP_DEFAULT_FILLFACTOR);
 	minfree = Max(minfree, BLCKSZ / 10);
 
 	if (PageIsFull(page) || PageGetHeapFreeSpace(page) < minfree)
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 473f3aa9be..2bbf121146 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -641,8 +641,8 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 						len, MaxHeapTupleSize)));
 
 	/* Compute desired extra freespace due to fillfactor option */
-	saveFreeSpace = RelationGetTargetPageFreeSpace(state->rs_new_rel,
-												   HEAP_DEFAULT_FILLFACTOR);
+	saveFreeSpace = HeapGetTargetPageFreeSpace(state->rs_new_rel,
+											   HEAP_DEFAULT_FILLFACTOR);
 
 	/* Now we can check to see if there's enough free space already. */
 	page = (Page) state->rs_buffer;
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 805d222ceb..5d5f0e68fd 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -750,7 +750,7 @@ table_block_relation_estimate_size(Relation rel, int32 *attr_widths,
 		 * The other branch considers it implicitly by calculating density
 		 * from actual relpages/reltuples statistics.
 		 */
-		fillfactor = RelationGetFillFactor(rel, HEAP_DEFAULT_FILLFACTOR);
+		fillfactor = HeapGetFillFactor(rel, HEAP_DEFAULT_FILLFACTOR);
 
 		tuple_width = get_rel_data_width(rel, attr_widths);
 		tuple_width += overhead_bytes_per_tuple;
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 55b8caeadf..d9e23ef317 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -13,9 +13,11 @@
 
 #include "access/tableam.h"
 #include "access/xact.h"
+#include "catalog/pg_am.h"
 #include "commands/defrem.h"
 #include "miscadmin.h"
 #include "utils/guc_hooks.h"
+#include "utils/syscache.h"
 
 
 /*
@@ -98,6 +100,29 @@ GetTableAmRoutine(Oid amhandler)
 	return routine;
 }
 
+/*
+ * GetTableAmRoutineByAmOid
+ *		Given the table access method oid get its TableAmRoutine struct, which
+ *		will be palloc'd in the caller's memory context.
+ */
+const TableAmRoutine *
+GetTableAmRoutineByAmOid(Oid amoid)
+{
+	HeapTuple	ht_am;
+	Form_pg_am	amrec;
+	const TableAmRoutine *tableam = NULL;
+
+	ht_am = SearchSysCache1(AMOID, ObjectIdGetDatum(amoid));
+	if (!HeapTupleIsValid(ht_am))
+		elog(ERROR, "cache lookup failed for access method %u",
+			 amoid);
+	amrec = (Form_pg_am) GETSTRUCT(ht_am);
+
+	tableam = GetTableAmRoutine(amrec->amhandler);
+	ReleaseSysCache(ht_am);
+	return tableam;
+}
+
 /* check_hook: validate new default_table_access_method */
 bool
 check_default_table_access_method(char **newval, void **extra, GucSource source)
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index afd3dace07..c5df96e374 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -85,6 +85,9 @@ create_ctas_internal(List *attrList, IntoClause *into)
 	Datum		toast_options;
 	static char *validnsps[] = HEAP_RELOPT_NAMESPACES;
 	ObjectAddress intoRelationAddr;
+	const TableAmRoutine *tableam = NULL;
+	Oid			accessMethodId = InvalidOid;
+	Relation	rel;
 
 	/* This code supports both CREATE TABLE AS and CREATE MATERIALIZED VIEW */
 	is_matview = (into->viewQuery != NULL);
@@ -125,7 +128,15 @@ create_ctas_internal(List *attrList, IntoClause *into)
 										validnsps,
 										true, false);
 
-	(void) heap_reloptions(RELKIND_TOASTVALUE, toast_options, true);
+	rel = relation_open(intoRelationAddr.objectId, AccessShareLock);
+	accessMethodId = table_relation_toast_am(rel);
+	relation_close(rel, AccessShareLock);
+
+	if (OidIsValid(accessMethodId))
+	{
+		tableam = GetTableAmRoutineByAmOid(accessMethodId);
+		(void) tableam_reloptions(tableam, RELKIND_TOASTVALUE, toast_options, NULL, true);
+	}
 
 	NewRelationCreateToastTable(intoRelationAddr.objectId, toast_options);
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 582890a302..3c74eef3e6 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -720,6 +720,7 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	ObjectAddress address;
 	LOCKMODE	parentLockmode;
 	Oid			accessMethodId = InvalidOid;
+	const TableAmRoutine *tableam = NULL;
 
 	/*
 	 * Truncate relname to appropriate length (probably a waste of time, as
@@ -855,6 +856,28 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 	if (!OidIsValid(ownerId))
 		ownerId = GetUserId();
 
+	/*
+	 * For relations with table AM and partitioned tables, select access
+	 * method to use: an explicitly indicated one, or (in the case of a
+	 * partitioned table) the parent's, if it has one.
+	 */
+	if (stmt->accessMethod != NULL)
+	{
+		Assert(RELKIND_HAS_TABLE_AM(relkind) || relkind == RELKIND_PARTITIONED_TABLE);
+		accessMethodId = get_table_am_oid(stmt->accessMethod, false);
+	}
+	else if (RELKIND_HAS_TABLE_AM(relkind) || relkind == RELKIND_PARTITIONED_TABLE)
+	{
+		if (stmt->partbound)
+		{
+			Assert(list_length(inheritOids) == 1);
+			accessMethodId = get_rel_relam(linitial_oid(inheritOids));
+		}
+
+		if (RELKIND_HAS_TABLE_AM(relkind) && !OidIsValid(accessMethodId))
+			accessMethodId = get_table_am_oid(default_table_access_method, false);
+	}
+
 	/*
 	 * Parse and validate reloptions, if any.
 	 */
@@ -863,6 +886,12 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 
 	switch (relkind)
 	{
+		case RELKIND_RELATION:
+		case RELKIND_TOASTVALUE:
+		case RELKIND_MATVIEW:
+			tableam = GetTableAmRoutineByAmOid(accessMethodId);
+			(void) tableam_reloptions(tableam, relkind, reloptions, NULL, true);
+			break;
 		case RELKIND_VIEW:
 			(void) view_reloptions(reloptions, true);
 			break;
@@ -870,7 +899,12 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 			(void) partitioned_table_reloptions(reloptions, true);
 			break;
 		default:
-			(void) heap_reloptions(relkind, reloptions, true);
+			if (OidIsValid(accessMethodId))
+			{
+				tableam = GetTableAmRoutineByAmOid(accessMethodId);
+				(void) tableam_reloptions(tableam, relkind, reloptions, NULL, true);
+			}
+			break;
 	}
 
 	if (stmt->ofTypename)
@@ -962,28 +996,6 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 		}
 	}
 
-	/*
-	 * For relations with table AM and partitioned tables, select access
-	 * method to use: an explicitly indicated one, or (in the case of a
-	 * partitioned table) the parent's, if it has one.
-	 */
-	if (stmt->accessMethod != NULL)
-	{
-		Assert(RELKIND_HAS_TABLE_AM(relkind) || relkind == RELKIND_PARTITIONED_TABLE);
-		accessMethodId = get_table_am_oid(stmt->accessMethod, false);
-	}
-	else if (RELKIND_HAS_TABLE_AM(relkind) || relkind == RELKIND_PARTITIONED_TABLE)
-	{
-		if (stmt->partbound)
-		{
-			Assert(list_length(inheritOids) == 1);
-			accessMethodId = get_rel_relam(linitial_oid(inheritOids));
-		}
-
-		if (RELKIND_HAS_TABLE_AM(relkind) && !OidIsValid(accessMethodId))
-			accessMethodId = get_table_am_oid(default_table_access_method, false);
-	}
-
 	/*
 	 * Create the relation.  Inherited defaults and constraints are passed in
 	 * for immediate handling --- since they don't need parsing, they can be
@@ -15571,7 +15583,8 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
 		case RELKIND_MATVIEW:
-			(void) heap_reloptions(rel->rd_rel->relkind, newOptions, true);
+			(void) table_reloptions(rel, rel->rd_rel->relkind,
+									newOptions, NULL, true);
 			break;
 		case RELKIND_PARTITIONED_TABLE:
 			(void) partitioned_table_reloptions(newOptions, true);
@@ -15684,7 +15697,7 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 										 defList, "toast", validnsps, false,
 										 operation == AT_ResetRelOptions);
 
-		(void) heap_reloptions(RELKIND_TOASTVALUE, newOptions, true);
+		(void) table_reloptions(rel, RELKIND_TOASTVALUE, newOptions, NULL, true);
 
 		memset(repl_val, 0, sizeof(repl_val));
 		memset(repl_null, false, sizeof(repl_null));
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index b589279d49..ba13fc0ad6 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2121,11 +2121,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
 	{
 		StdRdOptIndexCleanup vacuum_index_cleanup;
 
-		if (rel->rd_options == NULL)
-			vacuum_index_cleanup = STDRD_OPTION_VACUUM_INDEX_CLEANUP_AUTO;
-		else
-			vacuum_index_cleanup =
-				((StdRdOptions *) rel->rd_options)->vacuum_index_cleanup;
+		vacuum_index_cleanup =
+			rel->rd_common_options.vacuum_index_cleanup;
 
 		if (vacuum_index_cleanup == STDRD_OPTION_VACUUM_INDEX_CLEANUP_AUTO)
 			params->index_cleanup = VACOPTVALUE_AUTO;
@@ -2145,8 +2142,7 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
 	 */
 	if (params->truncate == VACOPTVALUE_UNSPECIFIED)
 	{
-		if (rel->rd_options == NULL ||
-			((StdRdOptions *) rel->rd_options)->vacuum_truncate)
+		if (rel->rd_common_options.vacuum_truncate)
 			params->truncate = VACOPTVALUE_ENABLED;
 		else
 			params->truncate = VACOPTVALUE_DISABLED;
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 6bb53e4346..bea4440f71 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -190,7 +190,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 						  &rel->pages, &rel->tuples, &rel->allvisfrac);
 
 	/* Retrieve the parallel_workers reloption, or -1 if not set. */
-	rel->rel_parallel_workers = RelationGetParallelWorkers(relation, -1);
+	rel->rel_parallel_workers = RelationGetParallelWorkers(relation);
 
 	/*
 	 * Make list of indexes.  Ignore indexes on system catalogs if told to.
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index c367ede6f8..7cb79ebced 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2674,19 +2674,21 @@ static AutoVacOpts *
 extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
 {
 	bytea	   *relopts;
+	CommonRdOptions common;
 	AutoVacOpts *av;
 
 	Assert(((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_RELATION ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_MATVIEW ||
 		   ((Form_pg_class) GETSTRUCT(tup))->relkind == RELKIND_TOASTVALUE);
 
-	relopts = extractRelOptions(tup, pg_class_desc, NULL);
-	if (relopts == NULL)
-		return NULL;
+	relopts = extractRelOptions(tup, pg_class_desc,
+								GetTableAmRoutineByAmOid(((Form_pg_class) GETSTRUCT(tup))->relam),
+								NULL, &common);
+	if (relopts)
+		pfree(relopts);
 
 	av = palloc(sizeof(AutoVacOpts));
-	memcpy(av, &(((StdRdOptions *) relopts)->autovacuum), sizeof(AutoVacOpts));
-	pfree(relopts);
+	memcpy(av, &(common.autovacuum), sizeof(AutoVacOpts));
 
 	return av;
 }
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index fa66b8017e..abdd94ad95 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -65,6 +65,7 @@
 #include "utils/acl.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
+#include "access/relation.h"
 
 /* Hook for plugins to get control in ProcessUtility() */
 ProcessUtility_hook_type ProcessUtility_hook = NULL;
@@ -1156,6 +1157,9 @@ ProcessUtilitySlow(ParseState *pstate,
 							CreateStmt *cstmt = (CreateStmt *) stmt;
 							Datum		toast_options;
 							static char *validnsps[] = HEAP_RELOPT_NAMESPACES;
+							const TableAmRoutine *tableam = NULL;
+							Oid			accessMethodId;
+							Relation        rel;
 
 							/* Remember transformed RangeVar for LIKE */
 							table_rv = cstmt->relation;
@@ -1185,9 +1189,19 @@ ProcessUtilitySlow(ParseState *pstate,
 																validnsps,
 																true,
 																false);
-							(void) heap_reloptions(RELKIND_TOASTVALUE,
-												   toast_options,
-												   true);
+
+							rel = relation_open(address.objectId, AccessShareLock);
+							accessMethodId = table_relation_toast_am(rel);
+							relation_close(rel, AccessShareLock);
+
+							if (OidIsValid(accessMethodId))
+							{
+								tableam = GetTableAmRoutineByAmOid(accessMethodId);
+								(void) tableam_reloptions(tableam, RELKIND_TOASTVALUE,
+														  toast_options,
+														  NULL,
+														  true);
+							}
 
 							NewRelationCreateToastTable(address.objectId,
 														toast_options);
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 3fe74dabd0..0e3b4cc93f 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -33,6 +33,7 @@
 #include "access/htup_details.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
+#include "access/relation.h"
 #include "access/reloptions.h"
 #include "access/sysattr.h"
 #include "access/table.h"
@@ -464,9 +465,36 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 {
 	bytea	   *options;
 	amoptions_function amoptsfn;
+	CommonRdOptions *common = &relation->rd_common_options;
+	const TableAmRoutine *tableam = NULL;
 
 	relation->rd_options = NULL;
 
+	/*
+	 * Fill the rd_common_options with default values.  That might be later
+	 * changed by extractRelOptions().
+	 */
+	common->autovacuum.enabled = true;
+	common->autovacuum.vacuum_threshold = -1;
+	common->autovacuum.vacuum_ins_threshold = -2;
+	common->autovacuum.analyze_threshold = -1;
+	common->autovacuum.vacuum_cost_limit = -1;
+	common->autovacuum.freeze_min_age = -1;
+	common->autovacuum.freeze_max_age = -1;
+	common->autovacuum.freeze_table_age = -1;
+	common->autovacuum.multixact_freeze_min_age = -1;
+	common->autovacuum.multixact_freeze_max_age = -1;
+	common->autovacuum.multixact_freeze_table_age = -1;
+	common->autovacuum.log_min_duration = -1;
+	common->autovacuum.vacuum_cost_delay = -1;
+	common->autovacuum.vacuum_scale_factor = -1;
+	common->autovacuum.vacuum_ins_scale_factor = -1;
+	common->autovacuum.analyze_scale_factor = -1;
+	common->parallel_workers = -1;
+	common->user_catalog_table = false;
+	common->vacuum_index_cleanup = STDRD_OPTION_VACUUM_INDEX_CLEANUP_AUTO;
+	common->vacuum_truncate = true;
+
 	/*
 	 * Look up any AM-specific parse function; fall out if relkind should not
 	 * have options.
@@ -478,6 +506,7 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 		case RELKIND_VIEW:
 		case RELKIND_MATVIEW:
 		case RELKIND_PARTITIONED_TABLE:
+			tableam = relation->rd_tableam;
 			amoptsfn = NULL;
 			break;
 		case RELKIND_INDEX:
@@ -493,7 +522,9 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
 	 * we might not have any other for pg_class yet (consider executing this
 	 * code for pg_class itself)
 	 */
-	options = extractRelOptions(tuple, GetPgClassDescriptor(), amoptsfn);
+	options = extractRelOptions(tuple, GetPgClassDescriptor(),
+								tableam, amoptsfn,
+								&relation->rd_common_options);
 
 	/*
 	 * Copy parsed data into CacheMemoryContext.  To guard against the
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index 81829b8270..342b9cdd6e 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -21,6 +21,7 @@
 
 #include "access/amapi.h"
 #include "access/htup.h"
+#include "access/tableam.h"
 #include "access/tupdesc.h"
 #include "nodes/pg_list.h"
 #include "storage/lock.h"
@@ -224,7 +225,9 @@ extern Datum transformRelOptions(Datum oldOptions, List *defList,
 								 bool acceptOidsOff, bool isReset);
 extern List *untransformRelOptions(Datum options);
 extern bytea *extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
-								amoptions_function amoptions);
+								const TableAmRoutine *tableam,
+								amoptions_function amoptions,
+								CommonRdOptions *common);
 extern void *build_reloptions(Datum reloptions, bool validate,
 							  relopt_kind kind,
 							  Size relopt_struct_size,
@@ -233,9 +236,8 @@ extern void *build_reloptions(Datum reloptions, bool validate,
 extern void *build_local_reloptions(local_relopts *relopts, Datum options,
 									bool validate);
 
-extern bytea *default_reloptions(Datum reloptions, bool validate,
-								 relopt_kind kind);
-extern bytea *heap_reloptions(char relkind, Datum reloptions, bool validate);
+extern bytea *heap_reloptions(char relkind, Datum reloptions,
+							  CommonRdOptions *common, bool validate);
 extern bytea *view_reloptions(Datum reloptions, bool validate);
 extern bytea *partitioned_table_reloptions(Datum reloptions, bool validate);
 extern bytea *index_reloptions(amoptions_function amoptions, Datum reloptions,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index be198fa315..c38e83fe74 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -746,6 +746,34 @@ typedef struct TableAmRoutine
 											   int32 slicelength,
 											   struct varlena *result);
 
+	/*
+	 * This callback parses and validates the reloptions array for a table.
+	 *
+	 * This is called only when a non-null reloptions array exists for the
+	 * table.  'reloptions' is a text array containing entries of the form
+	 * "name=value".  The function should construct a bytea value, which will
+	 * be copied into the rd_options field of the table's relcache entry. The
+	 * data contents of the bytea value are open for the access method to
+	 * define.
+	 *
+	 * The '*common' represents the common values, which the table access
+	 * method exposes for autovacuum, query planner, and others.  The caller
+	 * should fill them with default values.  The table access method may
+	 * modify them on the base of options specified by a user.
+	 *
+	 * When 'validate' is true, the function should report a suitable error
+	 * message if any of the options are unrecognized or have invalid values;
+	 * when 'validate' is false, invalid entries should be silently ignored.
+	 * ('validate' is false when loading options already stored in pg_catalog;
+	 * an invalid entry could only be found if the access method has changed
+	 * its rules for options, and in that case ignoring obsolete entries is
+	 * appropriate.)
+	 *
+	 * It is OK to return NULL if default behavior is wanted.
+	 */
+	bytea	   *(*reloptions) (char relkind, Datum reloptions,
+							   CommonRdOptions *common, bool validate);
+
 
 	/* ------------------------------------------------------------------------
 	 * Planner related functions.
@@ -1908,7 +1936,7 @@ table_relation_needs_toast_table(Relation rel)
 static inline Oid
 table_relation_toast_am(Relation rel)
 {
-	return rel->rd_tableam->relation_toast_am(rel);
+	return rel->rd_tableam ? rel->rd_tableam->relation_toast_am(rel) : InvalidOid;
 }
 
 /*
@@ -1945,6 +1973,27 @@ table_relation_fetch_toast_slice(Relation toastrel, Oid valueid,
 													 result);
 }
 
+/*
+ * Parse table options without knowledge of particular table.
+ */
+static inline bytea *
+tableam_reloptions(const TableAmRoutine *tableam, char relkind,
+				   Datum reloptions, CommonRdOptions *common, bool validate)
+{
+	return tableam->reloptions(relkind, reloptions, common, validate);
+}
+
+/*
+ * Parse options for given table.
+ */
+static inline bytea *
+table_reloptions(Relation rel, char relkind,
+				 Datum reloptions, CommonRdOptions *common, bool validate)
+{
+	return tableam_reloptions(rel->rd_tableam, relkind, reloptions,
+							  common, validate);
+}
+
 
 /* ----------------------------------------------------------------------------
  * Planner related functionality
@@ -2123,6 +2172,7 @@ extern void table_block_relation_estimate_size(Relation rel,
  */
 
 extern const TableAmRoutine *GetTableAmRoutine(Oid amhandler);
+extern const TableAmRoutine *GetTableAmRoutineByAmOid(Oid amoid);
 
 /* ----------------------------------------------------------------------------
  * Functions in heapam_handler.c
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index f25f769af2..4e6c4074bc 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -48,6 +48,52 @@ typedef struct LockInfoData
 
 typedef LockInfoData *LockInfo;
 
+ /* autovacuum-related reloptions. */
+typedef struct AutoVacOpts
+{
+	bool		enabled;
+	int			vacuum_threshold;
+	int			vacuum_ins_threshold;
+	int			analyze_threshold;
+	int			vacuum_cost_limit;
+	int			freeze_min_age;
+	int			freeze_max_age;
+	int			freeze_table_age;
+	int			multixact_freeze_min_age;
+	int			multixact_freeze_max_age;
+	int			multixact_freeze_table_age;
+	int			log_min_duration;
+	float8		vacuum_cost_delay;
+	float8		vacuum_scale_factor;
+	float8		vacuum_ins_scale_factor;
+	float8		analyze_scale_factor;
+} AutoVacOpts;
+
+/* StdRdOptions->vacuum_index_cleanup values */
+typedef enum StdRdOptIndexCleanup
+{
+	STDRD_OPTION_VACUUM_INDEX_CLEANUP_AUTO = 0,
+	STDRD_OPTION_VACUUM_INDEX_CLEANUP_OFF,
+	STDRD_OPTION_VACUUM_INDEX_CLEANUP_ON,
+} StdRdOptIndexCleanup;
+
+/*
+ * CommonRdOptions
+ *		Contents of rd_common_options for tables.  It contains the options,
+ *		which the table access method exposes for autovacuum, query planner,
+ *		and others.  These options could be by decision of table AM directly
+ *		specified by a user or calculated in some way.
+ */
+typedef struct CommonRdOptions
+{
+	AutoVacOpts autovacuum;		/* autovacuum-related options */
+	bool		user_catalog_table; /* use as an additional catalog relation */
+	int			parallel_workers;	/* max number of parallel workers */
+	StdRdOptIndexCleanup vacuum_index_cleanup;	/* controls index vacuuming */
+	bool		vacuum_truncate;	/* enables vacuum to truncate a relation */
+} CommonRdOptions;
+
+
 /*
  * Here are the contents of a relation cache entry.
  */
@@ -168,11 +214,19 @@ typedef struct RelationData
 	PublicationDesc *rd_pubdesc;	/* publication descriptor, or NULL */
 
 	/*
-	 * rd_options is set whenever rd_rel is loaded into the relcache entry.
-	 * Note that you can NOT look into rd_rel for this data.  NULL means "use
-	 * defaults".
+	 * rd_options and rd_common_options are set whenever rd_rel is loaded into
+	 * the relcache entry. Note that you can NOT look into rd_rel for this
+	 * data.
+	 */
+	CommonRdOptions rd_common_options;	/* the options, which table AM exposes
+										 * for external usage */
+
+	/*
+	 * am-specific part of pg_class.reloptions parsed by table am specific
+	 * structure (e.g. struct HeapRdOptions) Contents are not to be accessed
+	 * outside of table am. NULL means "use defaults".
 	 */
-	bytea	   *rd_options;		/* parsed pg_class.reloptions */
+	bytea	   *rd_options;
 
 	/*
 	 * Oid of the handler for this relation. For an index this is a function
@@ -297,88 +351,42 @@ typedef struct ForeignKeyCacheInfo
 	Oid			conpfeqop[INDEX_MAX_KEYS] pg_node_attr(array_size(nkeys));
 } ForeignKeyCacheInfo;
 
-
 /*
- * StdRdOptions
- *		Standard contents of rd_options for heaps.
- *
- * RelationGetFillFactor() and RelationGetTargetPageFreeSpace() can only
- * be applied to relations that use this format or a superset for
- * private options data.
+ * HeapRdOptions
+ *		Contents of rd_options specific for heap tables.
  */
- /* autovacuum-related reloptions. */
-typedef struct AutoVacOpts
-{
-	bool		enabled;
-	int			vacuum_threshold;
-	int			vacuum_ins_threshold;
-	int			analyze_threshold;
-	int			vacuum_cost_limit;
-	int			freeze_min_age;
-	int			freeze_max_age;
-	int			freeze_table_age;
-	int			multixact_freeze_min_age;
-	int			multixact_freeze_max_age;
-	int			multixact_freeze_table_age;
-	int			log_min_duration;
-	float8		vacuum_cost_delay;
-	float8		vacuum_scale_factor;
-	float8		vacuum_ins_scale_factor;
-	float8		analyze_scale_factor;
-} AutoVacOpts;
-
-/* StdRdOptions->vacuum_index_cleanup values */
-typedef enum StdRdOptIndexCleanup
-{
-	STDRD_OPTION_VACUUM_INDEX_CLEANUP_AUTO = 0,
-	STDRD_OPTION_VACUUM_INDEX_CLEANUP_OFF,
-	STDRD_OPTION_VACUUM_INDEX_CLEANUP_ON,
-} StdRdOptIndexCleanup;
-
-typedef struct StdRdOptions
+typedef struct HeapRdOptions
 {
 	int32		vl_len_;		/* varlena header (do not touch directly!) */
+	CommonRdOptions common;
 	int			fillfactor;		/* page fill factor in percent (0..100) */
 	int			toast_tuple_target; /* target for tuple toasting */
-	AutoVacOpts autovacuum;		/* autovacuum-related options */
-	bool		user_catalog_table; /* use as an additional catalog relation */
-	int			parallel_workers;	/* max number of parallel workers */
-	StdRdOptIndexCleanup vacuum_index_cleanup;	/* controls index vacuuming */
-	bool		vacuum_truncate;	/* enables vacuum to truncate a relation */
-} StdRdOptions;
+} HeapRdOptions;
 
 #define HEAP_MIN_FILLFACTOR			10
 #define HEAP_DEFAULT_FILLFACTOR		100
 
 /*
- * RelationGetToastTupleTarget
- *		Returns the relation's toast_tuple_target.  Note multiple eval of argument!
+ * HeapGetFillFactor
+ *		Returns the heap relation's fillfactor.  Note multiple eval of argument!
  */
-#define RelationGetToastTupleTarget(relation, defaulttarg) \
+#define HeapGetFillFactor(relation, defaultff) \
 	((relation)->rd_options ? \
-	 ((StdRdOptions *) (relation)->rd_options)->toast_tuple_target : (defaulttarg))
+	 ((HeapRdOptions *) (relation)->rd_options)->fillfactor : (defaultff))
 
 /*
- * RelationGetFillFactor
- *		Returns the relation's fillfactor.  Note multiple eval of argument!
- */
-#define RelationGetFillFactor(relation, defaultff) \
-	((relation)->rd_options ? \
-	 ((StdRdOptions *) (relation)->rd_options)->fillfactor : (defaultff))
-
-/*
- * RelationGetTargetPageUsage
+ * HeapGetTargetPageUsage
  *		Returns the relation's desired space usage per page in bytes.
  */
-#define RelationGetTargetPageUsage(relation, defaultff) \
-	(BLCKSZ * RelationGetFillFactor(relation, defaultff) / 100)
+#define HeapGetTargetPageUsage(relation, defaultff) \
+	(BLCKSZ * HeapGetFillFactor(relation, defaultff) / 100)
 
 /*
- * RelationGetTargetPageFreeSpace
+ * HeapGetTargetPageFreeSpace
  *		Returns the relation's desired freespace per page in bytes.
  */
-#define RelationGetTargetPageFreeSpace(relation, defaultff) \
-	(BLCKSZ * (100 - RelationGetFillFactor(relation, defaultff)) / 100)
+#define HeapGetTargetPageFreeSpace(relation, defaultff) \
+	(BLCKSZ * (100 - HeapGetFillFactor(relation, defaultff)) / 100)
 
 /*
  * RelationIsUsedAsCatalogTable
@@ -386,19 +394,17 @@ typedef struct StdRdOptions
  *		from the pov of logical decoding.  Note multiple eval of argument!
  */
 #define RelationIsUsedAsCatalogTable(relation)	\
-	((relation)->rd_options && \
-	 ((relation)->rd_rel->relkind == RELKIND_RELATION || \
+	(((relation)->rd_rel->relkind == RELKIND_RELATION || \
 	  (relation)->rd_rel->relkind == RELKIND_MATVIEW) ? \
-	 ((StdRdOptions *) (relation)->rd_options)->user_catalog_table : false)
+	 (relation)->rd_common_options.user_catalog_table : false)
 
 /*
  * RelationGetParallelWorkers
  *		Returns the relation's parallel_workers reloption setting.
  *		Note multiple eval of argument!
  */
-#define RelationGetParallelWorkers(relation, defaultpw) \
-	((relation)->rd_options ? \
-	 ((StdRdOptions *) (relation)->rd_options)->parallel_workers : (defaultpw))
+#define RelationGetParallelWorkers(relation) \
+	((relation)->rd_common_options.parallel_workers)
 
 /* ViewOptions->check_option values */
 typedef enum ViewOptCheckOption
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 256799f520..8f66152539 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -36,6 +36,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_tam_options \
 		  test_tidstore \
 		  unsafe_tests \
 		  worker_spi \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index d8fe059d23..235e342dfa 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -35,6 +35,7 @@ subdir('test_resowner')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_tam_options')
 subdir('test_tidstore')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_tam_options/.gitignore b/src/test/modules/test_tam_options/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_tam_options/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_tam_options/Makefile b/src/test/modules/test_tam_options/Makefile
new file mode 100644
index 0000000000..bd6d4599a1
--- /dev/null
+++ b/src/test/modules/test_tam_options/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tam_options/Makefile
+
+MODULE_big = test_tam_options
+OBJS = \
+	$(WIN32RES) \
+	test_tam_options.o
+PGFILEDESC = "test_tam_options - test code for table access method reloptions"
+
+EXTENSION = test_tam_options
+DATA = test_tam_options--1.0.sql
+
+REGRESS = test_tam_options
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tam_options
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tam_options/expected/test_tam_options.out b/src/test/modules/test_tam_options/expected/test_tam_options.out
new file mode 100644
index 0000000000..c921afcb27
--- /dev/null
+++ b/src/test/modules/test_tam_options/expected/test_tam_options.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_tam_options;
+-- encourage use of parallel plans
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 4;
+CREATE TABLE test (i int) USING heap_alter_options;
+INSERT INTO test SELECT i FROM generate_series(1, 10000) i;
+VACUUM ANALYZE test;
+EXPLAIN (costs off)
+SELECT * FROM test;
+           QUERY PLAN            
+---------------------------------
+ Gather
+   Workers Planned: 4
+   ->  Parallel Seq Scan on test
+(3 rows)
+
+ALTER TABLE test SET (enable_parallel = OFF);
+EXPLAIN (costs off)
+SELECT * FROM test;
+    QUERY PLAN    
+------------------
+ Seq Scan on test
+(1 row)
+
+ALTER TABLE test SET (enable_parallel = ON);
+EXPLAIN (costs off)
+SELECT * FROM test;
+           QUERY PLAN            
+---------------------------------
+ Gather
+   Workers Planned: 4
+   ->  Parallel Seq Scan on test
+(3 rows)
+
diff --git a/src/test/modules/test_tam_options/meson.build b/src/test/modules/test_tam_options/meson.build
new file mode 100644
index 0000000000..d41a32a680
--- /dev/null
+++ b/src/test/modules/test_tam_options/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+test_tam_options_sources = files(
+  'test_tam_options.c',
+)
+
+if host_system == 'windows'
+  test_tam_options_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_tam_options',
+    '--FILEDESC', 'test_tam_options -  test code for table access method reloptions',])
+endif
+
+test_tam_options = shared_module('test_tam_options',
+  test_tam_options_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_tam_options
+
+test_install_data += files(
+  'test_tam_options.control',
+  'test_tam_options--1.0.sql',
+)
+
+tests += {
+  'name': 'test_tam_options',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_tam_options',
+    ],
+  },
+}
diff --git a/src/test/modules/test_tam_options/sql/test_tam_options.sql b/src/test/modules/test_tam_options/sql/test_tam_options.sql
new file mode 100644
index 0000000000..4f97504656
--- /dev/null
+++ b/src/test/modules/test_tam_options/sql/test_tam_options.sql
@@ -0,0 +1,25 @@
+CREATE EXTENSION test_tam_options;
+
+-- encourage use of parallel plans
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 4;
+
+CREATE TABLE test (i int) USING heap_alter_options;
+
+INSERT INTO test SELECT i FROM generate_series(1, 10000) i;
+VACUUM ANALYZE test;
+
+EXPLAIN (costs off)
+SELECT * FROM test;
+
+ALTER TABLE test SET (enable_parallel = OFF);
+
+EXPLAIN (costs off)
+SELECT * FROM test;
+
+ALTER TABLE test SET (enable_parallel = ON);
+
+EXPLAIN (costs off)
+SELECT * FROM test;
diff --git a/src/test/modules/test_tam_options/test_tam_options--1.0.sql b/src/test/modules/test_tam_options/test_tam_options--1.0.sql
new file mode 100644
index 0000000000..07569f7b5f
--- /dev/null
+++ b/src/test/modules/test_tam_options/test_tam_options--1.0.sql
@@ -0,0 +1,12 @@
+/* src/test/modules/test_tam_options/test_tam_options--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tam_options" to load this file. \quit
+
+CREATE FUNCTION heap_alter_options_tam_handler(internal)
+RETURNS table_am_handler
+AS 'MODULE_PATHNAME'
+LANGUAGE C STRICT;
+
+CREATE ACCESS METHOD heap_alter_options TYPE TABLE
+HANDLER heap_alter_options_tam_handler;
diff --git a/src/test/modules/test_tam_options/test_tam_options.c b/src/test/modules/test_tam_options/test_tam_options.c
new file mode 100644
index 0000000000..2700861637
--- /dev/null
+++ b/src/test/modules/test_tam_options/test_tam_options.c
@@ -0,0 +1,66 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tam_options.c
+ *		Test code for table access method reloptions.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_tam_options/test_tam_options.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/reloptions.h"
+#include "access/tableam.h"
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(heap_alter_options_tam_handler);
+
+/* An alternative relation options for heap */
+typedef struct
+{
+	int32		vl_len_;		/* varlena header (do not touch directly!) */
+	bool		enable_parallel; /* enable parallel scans? */
+} HeapAlterRdOptions;
+
+static bytea *
+heap_alter_reloptions(char relkind, Datum reloptions,
+					  CommonRdOptions *common, bool validate)
+{
+	local_relopts	relopts;
+	HeapAlterRdOptions *result;
+
+	Assert(relkind == RELKIND_RELATION ||
+		   relkind == RELKIND_TOASTVALUE ||
+		   relkind == RELKIND_MATVIEW);
+
+	init_local_reloptions(&relopts, sizeof(HeapAlterRdOptions));
+	add_local_bool_reloption(&relopts, "enable_parallel",
+							 "enable parallel scan", true,
+							 offsetof(HeapAlterRdOptions, enable_parallel));
+
+	result = (HeapAlterRdOptions *) build_local_reloptions(&relopts,
+														   reloptions,
+														   validate);
+
+	if (result != NULL && common != NULL)
+	{
+		common->parallel_workers = result->enable_parallel ? -1 : 0;
+	}
+
+	return (bytea *) result;
+}
+
+Datum
+heap_alter_options_tam_handler(PG_FUNCTION_ARGS)
+{
+	static TableAmRoutine tam_routine;
+
+	tam_routine = *GetHeapamTableAmRoutine();
+	tam_routine.reloptions = heap_alter_reloptions;
+
+	PG_RETURN_POINTER(&tam_routine);
+}
diff --git a/src/test/modules/test_tam_options/test_tam_options.control b/src/test/modules/test_tam_options/test_tam_options.control
new file mode 100644
index 0000000000..dd6682edcd
--- /dev/null
+++ b/src/test/modules/test_tam_options/test_tam_options.control
@@ -0,0 +1,4 @@
+comment = 'Test code for table access method reloptions'
+default_version = '1.0'
+module_pathname = '$libdir/test_tam_options'
+relocatable = true
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 6e0717c8c4..80f8261b7f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -442,6 +442,7 @@ CommentStmt
 CommitTimestampEntry
 CommitTimestampShared
 CommonEntry
+CommonRdOptions
 CommonTableExpr
 CompareScalarsContext
 CompiledExprState
@@ -1131,6 +1132,7 @@ HeadlineParsedText
 HeadlineWordEntry
 HeapCheckContext
 HeapPageFreeze
+HeapRdOptions
 HeapScanDesc
 HeapTuple
 HeapTupleData
@@ -2719,7 +2721,6 @@ StatsElem
 StatsExtInfo
 StdAnalyzeData
 StdRdOptIndexCleanup
-StdRdOptions
 Step
 StopList
 StrategyNumber
-- 
2.39.2 (Apple Git-143)

#47

Andres Freund

andres@anarazel.de

almost 2 years ago

In reply to: Alexander Korotkov (#34)

Re: Table AM Interface Enhancements

Hi,

On 2024-03-30 23:33:04 +0200, Alexander Korotkov wrote:

I've pushed 0001, 0002 and 0006.

I briefly looked at 27bc1772fc81 and I don't think the state post this commit
makes sense. Before this commit another block based AM could implement analyze
without much code duplication. Now a large portion of analyze.c has to be
copied, because they can't stop acquire_sample_rows() from calling
heapam_scan_analyze_next_block().

I'm quite certain this will break a few out-of-core AMs in a way that can't
easily be fixed.

And even for non-block based AMs, the new interface basically requires
reimplementing all of analyze.c.

What am I missing here?

Greetings,

Andres Freund

#48

Alexander Korotkov

aekorotkov@gmail.com

almost 2 years ago

In reply to: Andres Freund (#47)

1 attachment(s)

Re: Table AM Interface Enhancements

Hi,

On Mon, Apr 8, 2024 at 12:40 AM Andres Freund <andres@anarazel.de> wrote:

On 2024-03-30 23:33:04 +0200, Alexander Korotkov wrote:

I've pushed 0001, 0002 and 0006.

I briefly looked at 27bc1772fc81 and I don't think the state post this commit
makes sense. Before this commit another block based AM could implement analyze
without much code duplication. Now a large portion of analyze.c has to be
copied, because they can't stop acquire_sample_rows() from calling
heapam_scan_analyze_next_block().

I'm quite certain this will break a few out-of-core AMs in a way that can't
easily be fixed.

I was under the impression there are not so many out-of-core table
AMs, which have non-dummy analysis implementations. And even if there
are some, duplicating acquire_sample_rows() isn't a big deal.

But given your feedback, I'd like to propose to keep both options
open. Turn back the block-level API for analyze, but let table-AM
implement its own analyze function. Then existing out-of-core AMs
wouldn't need to do anything (or probably just set the new API method
to NULL).

And even for non-block based AMs, the new interface basically requires
reimplementing all of analyze.c.

.
Non-lock base AM needs to just provide an alternative implementation
for what acquire_sample_rows() does. This seems like reasonable
effort for me, and surely not reimplementing all of analyze.c.

------
Regards,
Alexander Korotkov

Attachments:

v1-0001-Turn-back-the-block-level-API-for-relation-analyz.patchapplication/octet-stream; name=v1-0001-Turn-back-the-block-level-API-for-relation-analyz.patchDownload

From 010df6f6c3f9ab07061ee13e2126b2aa4ca128bd Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 8 Apr 2024 02:02:18 +0300
Subject: [PATCH v1] Turn back the block-level API for relation analyze

And keep the new API as well.
---
 src/backend/access/heap/heapam_handler.c | 29 +++-----
 src/backend/access/table/tableamapi.c    |  3 +
 src/backend/commands/analyze.c           | 35 +++++----
 src/include/access/heapam.h              |  9 ---
 src/include/access/tableam.h             | 91 +++++++++++++++++++++++-
 src/include/commands/vacuum.h            |  5 +-
 6 files changed, 125 insertions(+), 47 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 58de2c82a70..088da910a11 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -51,6 +51,7 @@ static TM_Result heapam_tuple_lock(Relation relation, ItemPointer tid,
 								   CommandId cid, LockTupleMode mode,
 								   LockWaitPolicy wait_policy, uint8 flags,
 								   TM_FailureData *tmfd);
+
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
 									 Datum *values, bool *isnull, RewriteState rwstate);
@@ -1054,15 +1055,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	pfree(isnull);
 }
 
-/*
- * Prepare to analyze block `blockno` of `scan`.  The scan has been started
- * with SO_TYPE_ANALYZE option.
- *
- * This routine holds a buffer pin and lock on the heap page.  They are held
- * until heapam_scan_analyze_next_tuple() returns false.  That is until all the
- * items of the heap page are analyzed.
- */
-void
+static bool
 heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
 							   BufferAccessStrategy bstrategy)
 {
@@ -1082,19 +1075,12 @@ heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
 	hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
 										blockno, RBM_NORMAL, bstrategy);
 	LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
+
+	/* in heap all blocks can contain tuples, so always return true */
+	return true;
 }
 
-/*
- * Iterate over tuples in the block selected with
- * heapam_scan_analyze_next_block().  If a tuple that's suitable for sampling
- * is found, true is returned and a tuple is stored in `slot`.  When no more
- * tuples for sampling, false is returned and the pin and lock acquired by
- * heapam_scan_analyze_next_block() are released.
- *
- * *liverows and *deadrows are incremented according to the encountered
- * tuples.
- */
-bool
+static bool
 heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
 							   double *liverows, double *deadrows,
 							   TupleTableSlot *slot)
@@ -2698,9 +2684,10 @@ static const TableAmRoutine heapam_methods = {
 	.relation_copy_data = heapam_relation_copy_data,
 	.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
 	.relation_vacuum = heap_vacuum_rel,
+	.scan_analyze_next_block = heapam_scan_analyze_next_block,
+	.scan_analyze_next_tuple = heapam_scan_analyze_next_tuple,
 	.index_build_range_scan = heapam_index_build_range_scan,
 	.index_validate_scan = heapam_index_validate_scan,
-	.relation_analyze = heapam_analyze,
 
 	.free_rd_amcache = NULL,
 	.relation_size = table_block_relation_size,
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index 55b8caeadf2..4aefdc3c385 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -81,6 +81,9 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->relation_copy_data != NULL);
 	Assert(routine->relation_copy_for_cluster != NULL);
 	Assert(routine->relation_vacuum != NULL);
+	Assert(routine->relation_analyze != NULL ||
+			(routine->scan_analyze_next_block != NULL &&
+			 routine->scan_analyze_next_tuple != NULL));
 	Assert(routine->index_build_range_scan != NULL);
 	Assert(routine->index_validate_scan != NULL);
 
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2fb39f3ede1..6e3be0d491d 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1103,15 +1103,15 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
 }
 
 /*
- * acquire_sample_rows -- acquire a random sample of rows from the heap
+ * acquire_sample_rows -- acquire a random sample of rows from the table
  *
  * Selected rows are returned in the caller-allocated array rows[], which
  * must have at least targrows entries.
  * The actual number of rows selected is returned as the function result.
- * We also estimate the total numbers of live and dead rows in the heap,
+ * We also estimate the total numbers of live and dead rows in the table,
  * and return them into *totalrows and *totaldeadrows, respectively.
  *
- * The returned list of tuples is in order by physical position in the heap.
+ * The returned list of tuples is in order by physical position in the table.
  * (We will rely on this later to derive correlation estimates.)
  *
  * As of May 2004 we use a new two-stage method:  Stage one selects up
@@ -1133,7 +1133,7 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
  * look at a statistically unbiased set of blocks, we should get
  * unbiased estimates of the average numbers of live and dead rows per
  * block.  The previous sampling method put too much credence in the row
- * density near the start of the heap.
+ * density near the start of the table.
  */
 static int
 acquire_sample_rows(Relation onerel, int elevel,
@@ -1184,7 +1184,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	/* Prepare for sampling rows */
 	reservoir_init_selection_state(&rstate, targrows);
 
-	scan = heap_beginscan(onerel, NULL, 0, NULL, NULL, SO_TYPE_ANALYZE);
+	scan = table_beginscan_analyze(onerel);
 	slot = table_slot_create(onerel, NULL);
 
 #ifdef USE_PREFETCH
@@ -1214,6 +1214,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	/* Outer loop over blocks to sample */
 	while (BlockSampler_HasMore(&bs))
 	{
+		bool		block_accepted;
 		BlockNumber targblock = BlockSampler_Next(&bs);
 #ifdef USE_PREFETCH
 		BlockNumber prefetch_targblock = InvalidBlockNumber;
@@ -1229,19 +1230,29 @@ acquire_sample_rows(Relation onerel, int elevel,
 
 		vacuum_delay_point();
 
-		heapam_scan_analyze_next_block(scan, targblock, vac_strategy);
+		block_accepted = table_scan_analyze_next_block(scan, targblock, vac_strategy);
 
 #ifdef USE_PREFETCH
 
 		/*
 		 * When pre-fetching, after we get a block, tell the kernel about the
 		 * next one we will want, if there's any left.
+		 *
+		 * We want to do this even if the table_scan_analyze_next_block() call
+		 * above decides against analyzing the block it picked.
 		 */
 		if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
 			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
 #endif
 
-		while (heapam_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
+		/*
+		 * Don't analyze if table_scan_analyze_next_block() indicated this
+		 * block is unsuitable for analyzing.
+		 */
+		if (!block_accepted)
+			continue;
+
+		while (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
 		{
 			/*
 			 * The first targrows sample rows are simply copied into the
@@ -1291,7 +1302,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
-	heap_endscan(scan);
+	table_endscan(scan);
 
 	/*
 	 * If we didn't find as many tuples as we wanted then we're done. No sort
@@ -1363,12 +1374,12 @@ compare_rows(const void *a, const void *b, void *arg)
 }
 
 /*
- * heapam_analyze -- implementation of relation_analyze() table access method
- *					 callback for heap
+ * default_analyze -- implementation of relation_analyze() table access method
+ *					  callback for block table access methods
  */
 void
-heapam_analyze(Relation relation, AcquireSampleRowsFunc *func,
-			   BlockNumber *totalpages, BufferAccessStrategy bstrategy)
+default_analyze(Relation relation, AcquireSampleRowsFunc *func,
+				BlockNumber *totalpages, BufferAccessStrategy bstrategy)
 {
 	*func = acquire_sample_rows;
 	*totalpages = RelationGetNumberOfBlocks(relation);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 48936826bcc..be630620d0d 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -412,15 +412,6 @@ extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
 extern bool HeapTupleIsSurelyDead(HeapTuple htup,
 								  struct GlobalVisState *vistest);
 
-/* in heap/heapam_handler.c*/
-extern void heapam_scan_analyze_next_block(TableScanDesc scan,
-										   BlockNumber blockno,
-										   BufferAccessStrategy bstrategy);
-extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan,
-										   TransactionId OldestXmin,
-										   double *liverows, double *deadrows,
-										   TupleTableSlot *slot);
-
 /*
  * To avoid leaking too much knowledge about reorderbuffer implementation
  * details this is implemented in reorderbuffer.c not heapam_visibility.c
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index be198fa3158..09aa2c119af 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -668,6 +668,41 @@ typedef struct TableAmRoutine
 									struct VacuumParams *params,
 									BufferAccessStrategy bstrategy);
 
+	/*
+	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
+	 * with table_beginscan_analyze().  See also
+	 * table_scan_analyze_next_block().
+	 *
+	 * The callback may acquire resources like locks that are held until
+	 * table_scan_analyze_next_tuple() returns false. It e.g. can make sense
+	 * to hold a lock until all tuples on a block have been analyzed by
+	 * scan_analyze_next_tuple.
+	 *
+	 * The callback can return false if the block is not suitable for
+	 * sampling, e.g. because it's a metapage that could never contain tuples.
+	 *
+	 * XXX: This obviously is primarily suited for block-based AMs. It's not
+	 * clear what a good interface for non block based AMs would be, so there
+	 * isn't one yet.
+	 */
+	bool		(*scan_analyze_next_block) (TableScanDesc scan,
+											BlockNumber blockno,
+											BufferAccessStrategy bstrategy);
+
+	/*
+	 * See table_scan_analyze_next_tuple().
+	 *
+	 * Not every AM might have a meaningful concept of dead rows, in which
+	 * case it's OK to not increment *deadrows - but note that that may
+	 * influence autovacuum scheduling (see comment for relation_vacuum
+	 * callback).
+	 */
+	bool		(*scan_analyze_next_tuple) (TableScanDesc scan,
+											TransactionId OldestXmin,
+											double *liverows,
+											double *deadrows,
+											TupleTableSlot *slot);
+
 	/* see table_index_build_range_scan for reference about parameters */
 	double		(*index_build_range_scan) (Relation table_rel,
 										   Relation index_rel,
@@ -992,6 +1027,19 @@ table_beginscan_tid(Relation rel, Snapshot snapshot)
 	return rel->rd_tableam->scan_begin(rel, snapshot, 0, NULL, NULL, flags);
 }
 
+/*
+ * table_beginscan_analyze is an alternative entry point for setting up a
+ * TableScanDesc for an ANALYZE scan.  As with bitmap scans, it's worth using
+ * the same data structure although the behavior is rather different.
+ */
+static inline TableScanDesc
+table_beginscan_analyze(Relation rel)
+{
+	uint32		flags = SO_TYPE_ANALYZE;
+
+	return rel->rd_tableam->scan_begin(rel, NULL, 0, NULL, NULL, flags);
+}
+
 /*
  * End relation scan.
  */
@@ -1725,6 +1773,42 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
+/*
+ * Prepare to analyze block `blockno` of `scan`. The scan needs to have been
+ * started with table_beginscan_analyze().  Note that this routine might
+ * acquire resources like locks that are held until
+ * table_scan_analyze_next_tuple() returns false.
+ *
+ * Returns false if block is unsuitable for sampling, true otherwise.
+ */
+static inline bool
+table_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
+							  BufferAccessStrategy bstrategy)
+{
+	return scan->rs_rd->rd_tableam->scan_analyze_next_block(scan, blockno,
+															bstrategy);
+}
+
+/*
+ * Iterate over tuples in the block selected with
+ * table_scan_analyze_next_block() (which needs to have returned true, and
+ * this routine may not have returned false for the same block before). If a
+ * tuple that's suitable for sampling is found, true is returned and a tuple
+ * is stored in `slot`.
+ *
+ * *liverows and *deadrows are incremented according to the encountered
+ * tuples.
+ */
+static inline bool
+table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
+							  double *liverows, double *deadrows,
+							  TupleTableSlot *slot)
+{
+	return scan->rs_rd->rd_tableam->scan_analyze_next_tuple(scan, OldestXmin,
+															liverows, deadrows,
+															slot);
+}
+
 /*
  * table_index_build_scan - scan the table to find tuples to be indexed
  *
@@ -1842,8 +1926,11 @@ static inline void
 table_relation_analyze(Relation relation, AcquireSampleRowsFunc *func,
 					   BlockNumber *totalpages, BufferAccessStrategy bstrategy)
 {
-	relation->rd_tableam->relation_analyze(relation, func,
-										   totalpages, bstrategy);
+	if (relation->rd_tableam->relation_analyze)
+		relation->rd_tableam->relation_analyze(relation, func,
+											   totalpages, bstrategy);
+	else
+		default_analyze(relation, func, totalpages, bstrategy);
 }
 
 /* ----------------------------------------------------------------------------
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 9514f8b2fd8..d00f3c19abf 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -393,9 +393,8 @@ extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 extern void analyze_rel(Oid relid, RangeVar *relation,
 						VacuumParams *params, List *va_cols, bool in_outer_xact,
 						BufferAccessStrategy bstrategy);
-extern void heapam_analyze(Relation relation, AcquireSampleRowsFunc *func,
-						   BlockNumber *totalpages,
-						   BufferAccessStrategy bstrategy);
+extern void default_analyze(Relation relation, AcquireSampleRowsFunc *func,
+							BlockNumber *totalpages, BufferAccessStrategy bstrategy);
 
 extern bool std_typanalyze(VacAttrStats *stats);
 
-- 
2.39.3 (Apple Git-145)

#49

Alexander Korotkov

aekorotkov@gmail.com

almost 2 years ago

In reply to: Alexander Korotkov (#48)

Re: Table AM Interface Enhancements

On Mon, Apr 8, 2024 at 2:25 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Mon, Apr 8, 2024 at 12:40 AM Andres Freund <andres@anarazel.de> wrote:

On 2024-03-30 23:33:04 +0200, Alexander Korotkov wrote:

I've pushed 0001, 0002 and 0006.

I briefly looked at 27bc1772fc81 and I don't think the state post this commit
makes sense. Before this commit another block based AM could implement analyze
without much code duplication. Now a large portion of analyze.c has to be
copied, because they can't stop acquire_sample_rows() from calling
heapam_scan_analyze_next_block().

I'm quite certain this will break a few out-of-core AMs in a way that can't
easily be fixed.

I was under the impression there are not so many out-of-core table
AMs, which have non-dummy analysis implementations. And even if there
are some, duplicating acquire_sample_rows() isn't a big deal.

But given your feedback, I'd like to propose to keep both options
open. Turn back the block-level API for analyze, but let table-AM
implement its own analyze function. Then existing out-of-core AMs
wouldn't need to do anything (or probably just set the new API method
to NULL).

The attached patch was to illustrate the approach. It surely needs
some polishing.

------
Regards,
Alexander Korotkov

#50

Andres Freund

andres@anarazel.de

almost 2 years ago

In reply to: Alexander Korotkov (#48)

Re: Table AM Interface Enhancements

Hi,

On 2024-04-08 02:25:17 +0300, Alexander Korotkov wrote:

On Mon, Apr 8, 2024 at 12:40 AM Andres Freund <andres@anarazel.de> wrote:

On 2024-03-30 23:33:04 +0200, Alexander Korotkov wrote:

I've pushed 0001, 0002 and 0006.

I briefly looked at 27bc1772fc81 and I don't think the state post this commit
makes sense. Before this commit another block based AM could implement analyze
without much code duplication. Now a large portion of analyze.c has to be
copied, because they can't stop acquire_sample_rows() from calling
heapam_scan_analyze_next_block().

I'm quite certain this will break a few out-of-core AMs in a way that can't
easily be fixed.

I was under the impression there are not so many out-of-core table
AMs, which have non-dummy analysis implementations.

I know of at least 4 that have some production usage.

And even if there are some, duplicating acquire_sample_rows() isn't a big
deal.

I don't agree. The code has evolved a bunch over time, duplicating it into
various AMs is a bad idea.

But given your feedback, I'd like to propose to keep both options
open. Turn back the block-level API for analyze, but let table-AM
implement its own analyze function. Then existing out-of-core AMs
wouldn't need to do anything (or probably just set the new API method
to NULL).

I think this patch simply hasn't been reviewed even close to careful enough
and should be reverted. It's IMO to late for a redesign. Sorry for not
looking earlier, I was mostly out sick for the last few months.

I think a dedicated tableam callback for sample acquisition probably makes
sense, but if we want that, we need to provide an easy way for AMs that are
sufficiently block-like to reuse the code, not have two different ways to
implement analyze.

ISTM that ->relation_analyze is quite misleading as a name. For one, it it
just sets some callbacks, no? But more importantly, it sounds like it'd
actually allow to wrap the whole analyze process, rather than just the
acquisition of samples.

Greetings,

Andres Freund

#51

Pavel Borisov

pashkin.elfe@gmail.com

almost 2 years ago

In reply to: Alexander Korotkov (#48)

Re: Table AM Interface Enhancements

Hi, Alexander and Andres!

On Mon, 8 Apr 2024 at 03:25, Alexander Korotkov <aekorotkov@gmail.com>
wrote:

Hi,

On Mon, Apr 8, 2024 at 12:40 AM Andres Freund <andres@anarazel.de> wrote:

On 2024-03-30 23:33:04 +0200, Alexander Korotkov wrote:

I've pushed 0001, 0002 and 0006.

I briefly looked at 27bc1772fc81 and I don't think the state post this

commit

makes sense. Before this commit another block based AM could implement

analyze

without much code duplication. Now a large portion of analyze.c has to be
copied, because they can't stop acquire_sample_rows() from calling
heapam_scan_analyze_next_block().

I'm quite certain this will break a few out-of-core AMs in a way that

can't

easily be fixed.

I was under the impression there are not so many out-of-core table
AMs, which have non-dummy analysis implementations. And even if there
are some, duplicating acquire_sample_rows() isn't a big deal.

But given your feedback, I'd like to propose to keep both options
open. Turn back the block-level API for analyze, but let table-AM
implement its own analyze function. Then existing out-of-core AMs
wouldn't need to do anything (or probably just set the new API method
to NULL).

I think that providing both new and old interface functions for block-based
and non-block based custom am is an excellent compromise.

The patch v1-0001-Turn-back.. is mainly an undo of part of the 27bc1772fc81
that had turned off _analyze_next_tuple..analyze_next_block for external
callers. If some extensions are already adapted to the old interface
functions, they are free to still use it.

And even for non-block based AMs, the new interface basically requires

reimplementing all of analyze.c.

.
Non-lock base AM needs to just provide an alternative implementation
for what acquire_sample_rows() does. This seems like reasonable
effort for me, and surely not reimplementing all of analyze.c.

I agree.

Regards,
Pavel Borisov
Supabase

#52

Alexander Korotkov

aekorotkov@gmail.com

almost 2 years ago

In reply to: Pavel Borisov (#51)

1 attachment(s)

Re: Table AM Interface Enhancements

On Mon, Apr 8, 2024 at 10:18 AM Pavel Borisov <pashkin.elfe@gmail.com> wrote:

On Mon, 8 Apr 2024 at 03:25, Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hi,

On Mon, Apr 8, 2024 at 12:40 AM Andres Freund <andres@anarazel.de> wrote:

On 2024-03-30 23:33:04 +0200, Alexander Korotkov wrote:

I've pushed 0001, 0002 and 0006.

I briefly looked at 27bc1772fc81 and I don't think the state post this commit
makes sense. Before this commit another block based AM could implement analyze
without much code duplication. Now a large portion of analyze.c has to be
copied, because they can't stop acquire_sample_rows() from calling
heapam_scan_analyze_next_block().

I'm quite certain this will break a few out-of-core AMs in a way that can't
easily be fixed.

I was under the impression there are not so many out-of-core table
AMs, which have non-dummy analysis implementations. And even if there
are some, duplicating acquire_sample_rows() isn't a big deal.

But given your feedback, I'd like to propose to keep both options
open. Turn back the block-level API for analyze, but let table-AM
implement its own analyze function. Then existing out-of-core AMs
wouldn't need to do anything (or probably just set the new API method
to NULL).

I think that providing both new and old interface functions for block-based and non-block based custom am is an excellent compromise.

The patch v1-0001-Turn-back.. is mainly an undo of part of the 27bc1772fc81 that had turned off _analyze_next_tuple..analyze_next_block for external callers. If some extensions are already adapted to the old interface functions, they are free to still use it.

Please, check this. Instead of keeping two APIs, it generalizes
acquire_sample_rows(). The downside is change of
AcquireSampleRowsFunc signature, which would need some changes in FDWs
too.

------
Regards,
Alexander Korotkov

Attachments:

v2-0001-Generalize-acquire_sample_rows.patchapplication/octet-stream; name=v2-0001-Generalize-acquire_sample_rows.patchDownload

From 4f9abf4e69db12addf6b3023de473d6dc86f3f6d Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 8 Apr 2024 12:30:45 +0300
Subject: [PATCH v2] Generalize acquire_sample_rows()

Reported-by:
Bug:
Discussion:
Author:
Reviewed-by:
Tested-by:
Backpatch-through:
---
 contrib/file_fdw/file_fdw.c         | 13 ++++--
 contrib/postgres_fdw/postgres_fdw.c | 13 ++++--
 src/backend/commands/analyze.c      | 66 +++++++++++++++++++----------
 src/include/access/tableam.h        | 21 +++++++--
 src/include/commands/vacuum.h       | 58 ++++++++++++++++++++++++-
 src/include/foreign/fdwapi.h        |  3 +-
 src/tools/pgindent/typedefs.list    |  1 +
 7 files changed, 138 insertions(+), 37 deletions(-)

diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 249d82d3a05..9c162c099f2 100644
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@@ -139,7 +139,8 @@ static void fileReScanForeignScan(ForeignScanState *node);
 static void fileEndForeignScan(ForeignScanState *node);
 static bool fileAnalyzeForeignTable(Relation relation,
 									AcquireSampleRowsFunc *func,
-									BlockNumber *totalpages);
+									BlockNumber *totalpages,
+									void **arg);
 static bool fileIsForeignScanParallelSafe(PlannerInfo *root, RelOptInfo *rel,
 										  RangeTblEntry *rte);
 
@@ -162,7 +163,8 @@ static void estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
 						   Cost *startup_cost, Cost *total_cost);
 static int	file_acquire_sample_rows(Relation onerel, int elevel,
 									 HeapTuple *rows, int targrows,
-									 double *totalrows, double *totaldeadrows);
+									 double *totalrows, double *totaldeadrows,
+									 void *arg);
 
 
 /*
@@ -806,7 +808,8 @@ fileEndForeignScan(ForeignScanState *node)
 static bool
 fileAnalyzeForeignTable(Relation relation,
 						AcquireSampleRowsFunc *func,
-						BlockNumber *totalpages)
+						BlockNumber *totalpages,
+						void **arg)
 {
 	char	   *filename;
 	bool		is_program;
@@ -845,6 +848,7 @@ fileAnalyzeForeignTable(Relation relation,
 		*totalpages = 1;
 
 	*func = file_acquire_sample_rows;
+	*arg = NULL;
 
 	return true;
 }
@@ -1122,7 +1126,8 @@ estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
 static int
 file_acquire_sample_rows(Relation onerel, int elevel,
 						 HeapTuple *rows, int targrows,
-						 double *totalrows, double *totaldeadrows)
+						 double *totalrows, double *totaldeadrows,
+						 void *arg)
 {
 	int			numrows = 0;
 	double		rowstoskip = -1;	/* -1 means not set yet */
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 142dcfc9957..08e53aa99d6 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -400,7 +400,8 @@ static void postgresExecForeignTruncate(List *rels,
 										bool restart_seqs);
 static bool postgresAnalyzeForeignTable(Relation relation,
 										AcquireSampleRowsFunc *func,
-										BlockNumber *totalpages);
+										BlockNumber *totalpages,
+										void **arg);
 static List *postgresImportForeignSchema(ImportForeignSchemaStmt *stmt,
 										 Oid serverOid);
 static void postgresGetForeignJoinPaths(PlannerInfo *root,
@@ -501,7 +502,8 @@ static void process_query_params(ExprContext *econtext,
 static int	postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 										  HeapTuple *rows, int targrows,
 										  double *totalrows,
-										  double *totaldeadrows);
+										  double *totaldeadrows,
+										  void *arg);
 static void analyze_row_processor(PGresult *res, int row,
 								  PgFdwAnalyzeState *astate);
 static void produce_tuple_asynchronously(AsyncRequest *areq, bool fetch);
@@ -4921,7 +4923,8 @@ process_query_params(ExprContext *econtext,
 static bool
 postgresAnalyzeForeignTable(Relation relation,
 							AcquireSampleRowsFunc *func,
-							BlockNumber *totalpages)
+							BlockNumber *totalpages,
+							void **arg)
 {
 	ForeignTable *table;
 	UserMapping *user;
@@ -4931,6 +4934,7 @@ postgresAnalyzeForeignTable(Relation relation,
 
 	/* Return the row-analysis function pointer */
 	*func = postgresAcquireSampleRowsFunc;
+	*arg = NULL;
 
 	/*
 	 * Now we have to get the number of pages.  It's annoying that the ANALYZE
@@ -5057,7 +5061,8 @@ static int
 postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 							  HeapTuple *rows, int targrows,
 							  double *totalrows,
-							  double *totaldeadrows)
+							  double *totaldeadrows,
+							  void *arg)
 {
 	PgFdwAnalyzeState astate;
 	ForeignTable *table;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index da27a13a3f0..d5b2e537361 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -80,7 +80,8 @@ static BufferAccessStrategy vac_strategy;
 
 static void do_analyze_rel(Relation onerel,
 						   VacuumParams *params, List *va_cols,
-						   AcquireSampleRowsFunc acquirefunc, BlockNumber relpages,
+						   AcquireSampleRowsFunc acquirefunc,
+						   void *acquirefuncArg, BlockNumber relpages,
 						   bool inh, bool in_outer_xact, int elevel);
 static void compute_index_stats(Relation onerel, double totalrows,
 								AnlIndexData *indexdata, int nindexes,
@@ -88,9 +89,6 @@ static void compute_index_stats(Relation onerel, double totalrows,
 								MemoryContext col_context);
 static VacAttrStats *examine_attribute(Relation onerel, int attnum,
 									   Node *index_expr);
-static int	acquire_sample_rows(Relation onerel, int elevel,
-								HeapTuple *rows, int targrows,
-								double *totalrows, double *totaldeadrows);
 static int	compare_rows(const void *a, const void *b, void *arg);
 static int	acquire_inherited_sample_rows(Relation onerel, int elevel,
 										  HeapTuple *rows, int targrows,
@@ -116,6 +114,7 @@ analyze_rel(Oid relid, RangeVar *relation,
 	Relation	onerel;
 	int			elevel;
 	AcquireSampleRowsFunc acquirefunc = NULL;
+	void	   *acquirefuncArg = NULL;
 	BlockNumber relpages = 0;
 
 	/* Select logging level */
@@ -193,7 +192,7 @@ analyze_rel(Oid relid, RangeVar *relation,
 	{
 		/* Use row acquisition function provided by table AM */
 		table_relation_analyze(onerel, &acquirefunc,
-							   &relpages, vac_strategy);
+							   &relpages, vac_strategy, &acquirefuncArg);
 	}
 	else if (onerel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
 	{
@@ -209,7 +208,8 @@ analyze_rel(Oid relid, RangeVar *relation,
 		if (fdwroutine->AnalyzeForeignTable != NULL)
 			ok = fdwroutine->AnalyzeForeignTable(onerel,
 												 &acquirefunc,
-												 &relpages);
+												 &relpages,
+												 &acquirefuncArg);
 
 		if (!ok)
 		{
@@ -248,15 +248,15 @@ analyze_rel(Oid relid, RangeVar *relation,
 	 * tables, which don't contain any rows.
 	 */
 	if (onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
-		do_analyze_rel(onerel, params, va_cols, acquirefunc,
+		do_analyze_rel(onerel, params, va_cols, acquirefunc, acquirefuncArg,
 					   relpages, false, in_outer_xact, elevel);
 
 	/*
 	 * If there are child tables, do recursive ANALYZE.
 	 */
 	if (onerel->rd_rel->relhassubclass)
-		do_analyze_rel(onerel, params, va_cols, acquirefunc, relpages,
-					   true, in_outer_xact, elevel);
+		do_analyze_rel(onerel, params, va_cols, acquirefunc, acquirefuncArg,
+					   relpages, true, in_outer_xact, elevel);
 
 	/*
 	 * Close source relation now, but keep lock so that no one deletes it
@@ -279,8 +279,8 @@ analyze_rel(Oid relid, RangeVar *relation,
 static void
 do_analyze_rel(Relation onerel, VacuumParams *params,
 			   List *va_cols, AcquireSampleRowsFunc acquirefunc,
-			   BlockNumber relpages, bool inh, bool in_outer_xact,
-			   int elevel)
+			   void *acquirefuncArg, BlockNumber relpages, bool inh,
+			   bool in_outer_xact, int elevel)
 {
 	int			attr_cnt,
 				tcnt,
@@ -526,7 +526,8 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	else
 		numrows = (*acquirefunc) (onerel, elevel,
 								  rows, targrows,
-								  &totalrows, &totaldeadrows);
+								  &totalrows, &totaldeadrows,
+								  acquirefuncArg);
 
 	/*
 	 * Compute the statistics.  Temporary results during the calculations for
@@ -1149,11 +1150,13 @@ block_sampling_read_stream_next(ReadStream *stream,
  * block.  The previous sampling method put too much credence in the row
  * density near the start of the heap.
  */
-static int
+int
 acquire_sample_rows(Relation onerel, int elevel,
 					HeapTuple *rows, int targrows,
-					double *totalrows, double *totaldeadrows)
+					double *totalrows, double *totaldeadrows,
+					void *arg)
 {
+	AcquireSampleRowsArg *cb = (AcquireSampleRowsArg *) arg;
 	int			numrows = 0;	/* # rows now in reservoir */
 	double		samplerows = 0; /* total # rows collected */
 	double		liverows = 0;	/* # live rows seen */
@@ -1188,7 +1191,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	/* Prepare for sampling rows */
 	reservoir_init_selection_state(&rstate, targrows);
 
-	scan = heap_beginscan(onerel, NULL, 0, NULL, NULL, SO_TYPE_ANALYZE);
+	scan = table_beginscan_analyze(onerel);
 	slot = table_slot_create(onerel, NULL);
 
 	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
@@ -1200,11 +1203,11 @@ acquire_sample_rows(Relation onerel, int elevel,
 										0);
 
 	/* Outer loop over blocks to sample */
-	while (heapam_scan_analyze_next_block(scan, stream))
+	while (cb->scan_analyze_next_block(scan, stream))
 	{
 		vacuum_delay_point();
 
-		while (heapam_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
+		while (cb->scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
 		{
 			/*
 			 * The first targrows sample rows are simply copied into the
@@ -1256,7 +1259,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	read_stream_end(stream);
 
 	ExecDropSingleTupleTableSlot(slot);
-	heap_endscan(scan);
+	table_endscan(scan);
 
 	/*
 	 * If we didn't find as many tuples as we wanted then we're done. No sort
@@ -1333,10 +1336,18 @@ compare_rows(const void *a, const void *b, void *arg)
  */
 void
 heapam_analyze(Relation relation, AcquireSampleRowsFunc *func,
-			   BlockNumber *totalpages, BufferAccessStrategy bstrategy)
+			   BlockNumber *totalpages, BufferAccessStrategy bstrategy,
+			   void **arg)
 {
+	static AcquireSampleRowsArg cb =
+	{
+		.scan_analyze_next_block = heapam_scan_analyze_next_block,
+		.scan_analyze_next_tuple = heapam_scan_analyze_next_tuple
+	};
+
 	*func = acquire_sample_rows;
 	*totalpages = RelationGetNumberOfBlocks(relation);
+	*arg = &cb;
 	vac_strategy = bstrategy;
 }
 
@@ -1349,7 +1360,7 @@ heapam_analyze(Relation relation, AcquireSampleRowsFunc *func,
  * We fail and return zero if there are no inheritance children, or if all
  * children are foreign tables that don't support ANALYZE.
  */
-static int
+int
 acquire_inherited_sample_rows(Relation onerel, int elevel,
 							  HeapTuple *rows, int targrows,
 							  double *totalrows, double *totaldeadrows)
@@ -1357,6 +1368,7 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 	List	   *tableOIDs;
 	Relation   *rels;
 	AcquireSampleRowsFunc *acquirefuncs;
+	void	  **acquirefuncArgs;
 	double	   *relblocks;
 	double		totalblocks;
 	int			numrows,
@@ -1402,6 +1414,8 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 	rels = (Relation *) palloc(list_length(tableOIDs) * sizeof(Relation));
 	acquirefuncs = (AcquireSampleRowsFunc *)
 		palloc(list_length(tableOIDs) * sizeof(AcquireSampleRowsFunc));
+	acquirefuncArgs = (void **)
+		palloc(list_length(tableOIDs) * sizeof(void *));
 	relblocks = (double *) palloc(list_length(tableOIDs) * sizeof(double));
 	totalblocks = 0;
 	nrels = 0;
@@ -1411,6 +1425,7 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 		Oid			childOID = lfirst_oid(lc);
 		Relation	childrel;
 		AcquireSampleRowsFunc acquirefunc = NULL;
+		void	   *acquirefuncArg = NULL;
 		BlockNumber relpages = 0;
 
 		/* We already got the needed lock */
@@ -1431,7 +1446,8 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 		{
 			/* Use row acquisition function provided by table AM */
 			table_relation_analyze(childrel, &acquirefunc,
-								   &relpages, vac_strategy);
+								   &relpages, vac_strategy,
+								   &acquirefuncArg);
 		}
 		else if (childrel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
 		{
@@ -1447,7 +1463,8 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 			if (fdwroutine->AnalyzeForeignTable != NULL)
 				ok = fdwroutine->AnalyzeForeignTable(childrel,
 													 &acquirefunc,
-													 &relpages);
+													 &relpages,
+													 &acquirefuncArg);
 
 			if (!ok)
 			{
@@ -1475,6 +1492,7 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 		has_child = true;
 		rels[nrels] = childrel;
 		acquirefuncs[nrels] = acquirefunc;
+		acquirefuncArgs[nrels] = acquirefuncArg;
 		relblocks[nrels] = (double) relpages;
 		totalblocks += (double) relpages;
 		nrels++;
@@ -1506,6 +1524,7 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 	{
 		Relation	childrel = rels[i];
 		AcquireSampleRowsFunc acquirefunc = acquirefuncs[i];
+		void	   *acquirefuncArg = acquirefuncArgs[i];
 		double		childblocks = relblocks[i];
 
 		/*
@@ -1544,7 +1563,8 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 				/* Fetch a random sample of the child's rows */
 				childrows = (*acquirefunc) (childrel, elevel,
 											rows + numrows, childtargrows,
-											&trows, &tdrows);
+											&trows, &tdrows,
+											acquirefuncArg);
 
 				/* We may need to convert from child's rowtype to parent's */
 				if (childrows > 0 &&
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index ec827ac12bf..f28044d8342 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -692,7 +692,8 @@ typedef struct TableAmRoutine
 	void		(*relation_analyze) (Relation relation,
 									 AcquireSampleRowsFunc *func,
 									 BlockNumber *totalpages,
-									 BufferAccessStrategy bstrategy);
+									 BufferAccessStrategy bstrategy,
+									 void **arg);
 
 
 	/* ------------------------------------------------------------------------
@@ -1020,6 +1021,19 @@ table_beginscan_tid(Relation rel, Snapshot snapshot)
 	return rel->rd_tableam->scan_begin(rel, snapshot, 0, NULL, NULL, flags);
 }
 
+/*
+ * table_beginscan_analyze is an alternative entry point for setting up a
+ * TableScanDesc for an ANALYZE scan.  As with bitmap scans, it's worth using
+ * the same data structure although the behavior is rather different.
+ */
+static inline TableScanDesc
+table_beginscan_analyze(Relation rel)
+{
+	uint32		flags = SO_TYPE_ANALYZE;
+
+	return rel->rd_tableam->scan_begin(rel, NULL, 0, NULL, NULL, flags);
+}
+
 /*
  * End relation scan.
  */
@@ -1868,10 +1882,11 @@ table_index_validate_scan(Relation table_rel,
  */
 static inline void
 table_relation_analyze(Relation relation, AcquireSampleRowsFunc *func,
-					   BlockNumber *totalpages, BufferAccessStrategy bstrategy)
+					   BlockNumber *totalpages, BufferAccessStrategy bstrategy,
+					   void **arg)
 {
 	relation->rd_tableam->relation_analyze(relation, func,
-										   totalpages, bstrategy);
+										   totalpages, bstrategy, arg);
 }
 
 /* ----------------------------------------------------------------------------
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 9514f8b2fd8..8cf66a00dc2 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -21,9 +21,11 @@
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
+#include "executor/tuptable.h"
 #include "parser/parse_node.h"
 #include "storage/buf.h"
 #include "storage/lock.h"
+#include "storage/read_stream.h"
 #include "utils/relcache.h"
 
 /*
@@ -189,7 +191,8 @@ typedef struct VacAttrStats
 typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
 									  HeapTuple *rows, int targrows,
 									  double *totalrows,
-									  double *totaldeadrows);
+									  double *totaldeadrows,
+									  void *arg);
 
 /* flag bits for VacuumParams->options */
 #define VACOPT_VACUUM 0x01		/* do VACUUM */
@@ -390,12 +393,63 @@ extern void parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs,
 extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 
 /* in commands/analyze.c */
+
+struct TableScanDescData;
+
+/* The struct to be passed as '*arg' to acquire_sample_rows */
+typedef struct
+{
+	/*
+	 * Prepare to analyze block from `stream` of `scan`. The scan has been
+	 * started with table_beginscan_analyze().
+	 *
+	 * The callback may acquire resources like locks that are held until
+	 * (*scan_analyze_next_tuple)() returns false. It e.g. can make sense to
+	 * hold a lock until all tuples on a block have been analyzed by
+	 * (*scan_analyze_next_tuple)().
+	 *
+	 * The callback can return false if the block is not suitable for
+	 * sampling, e.g. because it's a metapage that could never contain tuples.
+	 *
+	 * XXX: This obviously is primarily suited for block-based AMs. It's not
+	 * clear what a good interface for non block based AMs would be, so there
+	 * isn't one yet.
+	 */
+	bool		(*scan_analyze_next_block) (struct TableScanDescData *scan,
+											ReadStream *stream);
+
+	/*
+	 * Iterate over tuples in the block selected with
+	 * (*scan_analyze_next_block)() (which needs to have returned true, and
+	 * this routine may not have returned false for the same block before). If
+	 * a tuple that's suitable for sampling is found, true is returned and a
+	 * tuple is stored in `slot`.
+	 *
+	 * *liverows and *deadrows are incremented according to the encountered
+	 * tuples.
+	 *
+	 * Not every AM might have a meaningful concept of dead rows, in which
+	 * case it's OK to not increment *deadrows - but note that that may
+	 * influence autovacuum scheduling (see comment for relation_vacuum
+	 * callback).
+	 */
+	bool		(*scan_analyze_next_tuple) (struct TableScanDescData *scan,
+											TransactionId OldestXmin,
+											double *liverows,
+											double *deadrows,
+											TupleTableSlot *slot);
+} AcquireSampleRowsArg;
+
 extern void analyze_rel(Oid relid, RangeVar *relation,
 						VacuumParams *params, List *va_cols, bool in_outer_xact,
 						BufferAccessStrategy bstrategy);
+extern int	acquire_sample_rows(Relation onerel, int elevel,
+								HeapTuple *rows, int targrows,
+								double *totalrows, double *totaldeadrows,
+								void *arg);
 extern void heapam_analyze(Relation relation, AcquireSampleRowsFunc *func,
 						   BlockNumber *totalpages,
-						   BufferAccessStrategy bstrategy);
+						   BufferAccessStrategy bstrategy, void **arg);
 
 extern bool std_typanalyze(VacAttrStats *stats);
 
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 0968e0a01ec..ebea559e531 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -151,7 +151,8 @@ typedef void (*ExplainDirectModify_function) (ForeignScanState *node,
 
 typedef bool (*AnalyzeForeignTable_function) (Relation relation,
 											  AcquireSampleRowsFunc *func,
-											  BlockNumber *totalpages);
+											  BlockNumber *totalpages,
+											  void **arg);
 
 typedef List *(*ImportForeignSchema_function) (ImportForeignSchemaStmt *stmt,
 											   Oid serverOid);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cb78f11119c..0156ef27aba 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -22,6 +22,7 @@ AclItem
 AclMaskHow
 AclMode
 AclResult
+AcquireSampleRowsArg
 AcquireSampleRowsFunc
 ActionList
 ActiveSnapshotElt
-- 
2.39.3 (Apple Git-145)

#53

Pavel Borisov

pashkin.elfe@gmail.com

almost 2 years ago

In reply to: Alexander Korotkov (#52)

Re: Table AM Interface Enhancements

On Mon, 8 Apr 2024 at 13:34, Alexander Korotkov <aekorotkov@gmail.com>
wrote:

On Mon, Apr 8, 2024 at 10:18 AM Pavel Borisov <pashkin.elfe@gmail.com>
wrote:

On Mon, 8 Apr 2024 at 03:25, Alexander Korotkov <aekorotkov@gmail.com>

wrote:

Hi,

On Mon, Apr 8, 2024 at 12:40 AM Andres Freund <andres@anarazel.de>

wrote:

On 2024-03-30 23:33:04 +0200, Alexander Korotkov wrote:

I've pushed 0001, 0002 and 0006.

I briefly looked at 27bc1772fc81 and I don't think the state post

this commit

makes sense. Before this commit another block based AM could

implement analyze

without much code duplication. Now a large portion of analyze.c has

to be

copied, because they can't stop acquire_sample_rows() from calling
heapam_scan_analyze_next_block().

I'm quite certain this will break a few out-of-core AMs in a way that

can't

easily be fixed.

I was under the impression there are not so many out-of-core table
AMs, which have non-dummy analysis implementations. And even if there
are some, duplicating acquire_sample_rows() isn't a big deal.

But given your feedback, I'd like to propose to keep both options
open. Turn back the block-level API for analyze, but let table-AM
implement its own analyze function. Then existing out-of-core AMs
wouldn't need to do anything (or probably just set the new API method
to NULL).

I think that providing both new and old interface functions for

block-based and non-block based custom am is an excellent compromise.

The patch v1-0001-Turn-back.. is mainly an undo of part of the

27bc1772fc81 that had turned off _analyze_next_tuple..analyze_next_block
for external callers. If some extensions are already adapted to the old
interface functions, they are free to still use it.

Please, check this. Instead of keeping two APIs, it generalizes
acquire_sample_rows(). The downside is change of
AcquireSampleRowsFunc signature, which would need some changes in FDWs
too.

To me, both approaches v1-0001-Turn-back... and v2-0001-Generalize... and
patch v2 look good.

Pavel.

#54

Pavel Borisov

pashkin.elfe@gmail.com

almost 2 years ago

In reply to: Pavel Borisov (#53)

1 attachment(s)

Re: Table AM Interface Enhancements

Hi, Alexander

On Mon, 8 Apr 2024 at 13:59, Pavel Borisov <pashkin.elfe@gmail.com> wrote:

On Mon, 8 Apr 2024 at 13:34, Alexander Korotkov <aekorotkov@gmail.com>
wrote:

On Mon, Apr 8, 2024 at 10:18 AM Pavel Borisov <pashkin.elfe@gmail.com>
wrote:

On Mon, 8 Apr 2024 at 03:25, Alexander Korotkov <aekorotkov@gmail.com>

wrote:

Hi,

On Mon, Apr 8, 2024 at 12:40 AM Andres Freund <andres@anarazel.de>

wrote:

On 2024-03-30 23:33:04 +0200, Alexander Korotkov wrote:

I've pushed 0001, 0002 and 0006.

I briefly looked at 27bc1772fc81 and I don't think the state post

this commit

makes sense. Before this commit another block based AM could

implement analyze

without much code duplication. Now a large portion of analyze.c has

to be

copied, because they can't stop acquire_sample_rows() from calling
heapam_scan_analyze_next_block().

I'm quite certain this will break a few out-of-core AMs in a way

that can't

easily be fixed.

I was under the impression there are not so many out-of-core table
AMs, which have non-dummy analysis implementations. And even if there
are some, duplicating acquire_sample_rows() isn't a big deal.

But given your feedback, I'd like to propose to keep both options
open. Turn back the block-level API for analyze, but let table-AM
implement its own analyze function. Then existing out-of-core AMs
wouldn't need to do anything (or probably just set the new API method
to NULL).

I think that providing both new and old interface functions for

block-based and non-block based custom am is an excellent compromise.

The patch v1-0001-Turn-back.. is mainly an undo of part of the

27bc1772fc81 that had turned off _analyze_next_tuple..analyze_next_block
for external callers. If some extensions are already adapted to the old
interface functions, they are free to still use it.

Please, check this. Instead of keeping two APIs, it generalizes
acquire_sample_rows(). The downside is change of
AcquireSampleRowsFunc signature, which would need some changes in FDWs
too.

To me, both approaches v1-0001-Turn-back... and v2-0001-Generalize... and
patch v2 look good.

Pavel.

I added some changes in comments to better reflect changes in patch v2. See
a patch v3 (code unchanged from v2)

Regards,
Pavel

Attachments:

v3-0001-Generalize-acquire_sample_rows.patchapplication/octet-stream; name=v3-0001-Generalize-acquire_sample_rows.patchDownload

From 5886698acb05231d0150e779011796c594cb7ed2 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Mon, 8 Apr 2024 12:30:45 +0300
Subject: [PATCH v3] Generalize acquire_sample_rows()

Reported-by:
Bug:
Discussion:
Author:
Reviewed-by:
Tested-by:
Backpatch-through:
---
 contrib/file_fdw/file_fdw.c         | 13 ++--
 contrib/postgres_fdw/postgres_fdw.c | 13 ++--
 src/backend/commands/analyze.c      | 96 +++++++++++++++++++----------
 src/include/access/tableam.h        | 21 ++++++-
 src/include/commands/vacuum.h       | 59 +++++++++++++++++-
 src/include/foreign/fdwapi.h        |  3 +-
 src/tools/pgindent/typedefs.list    |  1 +
 7 files changed, 159 insertions(+), 47 deletions(-)

diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 249d82d3a0..9c162c099f 100644
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@@ -139,7 +139,8 @@ static void fileReScanForeignScan(ForeignScanState *node);
 static void fileEndForeignScan(ForeignScanState *node);
 static bool fileAnalyzeForeignTable(Relation relation,
 									AcquireSampleRowsFunc *func,
-									BlockNumber *totalpages);
+									BlockNumber *totalpages,
+									void **arg);
 static bool fileIsForeignScanParallelSafe(PlannerInfo *root, RelOptInfo *rel,
 										  RangeTblEntry *rte);
 
@@ -162,7 +163,8 @@ static void estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
 						   Cost *startup_cost, Cost *total_cost);
 static int	file_acquire_sample_rows(Relation onerel, int elevel,
 									 HeapTuple *rows, int targrows,
-									 double *totalrows, double *totaldeadrows);
+									 double *totalrows, double *totaldeadrows,
+									 void *arg);
 
 
 /*
@@ -806,7 +808,8 @@ fileEndForeignScan(ForeignScanState *node)
 static bool
 fileAnalyzeForeignTable(Relation relation,
 						AcquireSampleRowsFunc *func,
-						BlockNumber *totalpages)
+						BlockNumber *totalpages,
+						void **arg)
 {
 	char	   *filename;
 	bool		is_program;
@@ -845,6 +848,7 @@ fileAnalyzeForeignTable(Relation relation,
 		*totalpages = 1;
 
 	*func = file_acquire_sample_rows;
+	*arg = NULL;
 
 	return true;
 }
@@ -1122,7 +1126,8 @@ estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
 static int
 file_acquire_sample_rows(Relation onerel, int elevel,
 						 HeapTuple *rows, int targrows,
-						 double *totalrows, double *totaldeadrows)
+						 double *totalrows, double *totaldeadrows,
+						 void *arg)
 {
 	int			numrows = 0;
 	double		rowstoskip = -1;	/* -1 means not set yet */
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 142dcfc995..08e53aa99d 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -400,7 +400,8 @@ static void postgresExecForeignTruncate(List *rels,
 										bool restart_seqs);
 static bool postgresAnalyzeForeignTable(Relation relation,
 										AcquireSampleRowsFunc *func,
-										BlockNumber *totalpages);
+										BlockNumber *totalpages,
+										void **arg);
 static List *postgresImportForeignSchema(ImportForeignSchemaStmt *stmt,
 										 Oid serverOid);
 static void postgresGetForeignJoinPaths(PlannerInfo *root,
@@ -501,7 +502,8 @@ static void process_query_params(ExprContext *econtext,
 static int	postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 										  HeapTuple *rows, int targrows,
 										  double *totalrows,
-										  double *totaldeadrows);
+										  double *totaldeadrows,
+										  void *arg);
 static void analyze_row_processor(PGresult *res, int row,
 								  PgFdwAnalyzeState *astate);
 static void produce_tuple_asynchronously(AsyncRequest *areq, bool fetch);
@@ -4921,7 +4923,8 @@ process_query_params(ExprContext *econtext,
 static bool
 postgresAnalyzeForeignTable(Relation relation,
 							AcquireSampleRowsFunc *func,
-							BlockNumber *totalpages)
+							BlockNumber *totalpages,
+							void **arg)
 {
 	ForeignTable *table;
 	UserMapping *user;
@@ -4931,6 +4934,7 @@ postgresAnalyzeForeignTable(Relation relation,
 
 	/* Return the row-analysis function pointer */
 	*func = postgresAcquireSampleRowsFunc;
+	*arg = NULL;
 
 	/*
 	 * Now we have to get the number of pages.  It's annoying that the ANALYZE
@@ -5057,7 +5061,8 @@ static int
 postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 							  HeapTuple *rows, int targrows,
 							  double *totalrows,
-							  double *totaldeadrows)
+							  double *totaldeadrows,
+							  void *arg)
 {
 	PgFdwAnalyzeState astate;
 	ForeignTable *table;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index da27a13a3f..b2c4b99357 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -80,7 +80,8 @@ static BufferAccessStrategy vac_strategy;
 
 static void do_analyze_rel(Relation onerel,
 						   VacuumParams *params, List *va_cols,
-						   AcquireSampleRowsFunc acquirefunc, BlockNumber relpages,
+						   AcquireSampleRowsFunc acquirefunc,
+						   void *acquirefuncArg, BlockNumber relpages,
 						   bool inh, bool in_outer_xact, int elevel);
 static void compute_index_stats(Relation onerel, double totalrows,
 								AnlIndexData *indexdata, int nindexes,
@@ -88,9 +89,6 @@ static void compute_index_stats(Relation onerel, double totalrows,
 								MemoryContext col_context);
 static VacAttrStats *examine_attribute(Relation onerel, int attnum,
 									   Node *index_expr);
-static int	acquire_sample_rows(Relation onerel, int elevel,
-								HeapTuple *rows, int targrows,
-								double *totalrows, double *totaldeadrows);
 static int	compare_rows(const void *a, const void *b, void *arg);
 static int	acquire_inherited_sample_rows(Relation onerel, int elevel,
 										  HeapTuple *rows, int targrows,
@@ -116,6 +114,7 @@ analyze_rel(Oid relid, RangeVar *relation,
 	Relation	onerel;
 	int			elevel;
 	AcquireSampleRowsFunc acquirefunc = NULL;
+	void	   *acquirefuncArg = NULL;
 	BlockNumber relpages = 0;
 
 	/* Select logging level */
@@ -191,9 +190,12 @@ analyze_rel(Oid relid, RangeVar *relation,
 	if (onerel->rd_rel->relkind == RELKIND_RELATION ||
 		onerel->rd_rel->relkind == RELKIND_MATVIEW)
 	{
-		/* Use row acquisition function provided by table AM */
+		/*
+		 * Get row acquisition function, blocks and tuples iteration
+		 * callbacks provided by table AM
+		 */
 		table_relation_analyze(onerel, &acquirefunc,
-							   &relpages, vac_strategy);
+							   &relpages, vac_strategy, &acquirefuncArg);
 	}
 	else if (onerel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
 	{
@@ -209,7 +211,8 @@ analyze_rel(Oid relid, RangeVar *relation,
 		if (fdwroutine->AnalyzeForeignTable != NULL)
 			ok = fdwroutine->AnalyzeForeignTable(onerel,
 												 &acquirefunc,
-												 &relpages);
+												 &relpages,
+												 &acquirefuncArg);
 
 		if (!ok)
 		{
@@ -248,15 +251,15 @@ analyze_rel(Oid relid, RangeVar *relation,
 	 * tables, which don't contain any rows.
 	 */
 	if (onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
-		do_analyze_rel(onerel, params, va_cols, acquirefunc,
+		do_analyze_rel(onerel, params, va_cols, acquirefunc, acquirefuncArg,
 					   relpages, false, in_outer_xact, elevel);
 
 	/*
 	 * If there are child tables, do recursive ANALYZE.
 	 */
 	if (onerel->rd_rel->relhassubclass)
-		do_analyze_rel(onerel, params, va_cols, acquirefunc, relpages,
-					   true, in_outer_xact, elevel);
+		do_analyze_rel(onerel, params, va_cols, acquirefunc, acquirefuncArg,
+					   relpages, true, in_outer_xact, elevel);
 
 	/*
 	 * Close source relation now, but keep lock so that no one deletes it
@@ -272,15 +275,16 @@ analyze_rel(Oid relid, RangeVar *relation,
 /*
  *	do_analyze_rel() -- analyze one relation, recursively or not
  *
- * Note that "acquirefunc" is only relevant for the non-inherited case.
- * For the inherited case, acquire_inherited_sample_rows() determines the
- * appropriate acquirefunc for each child table.
+ * Note that "acquirefunc" and “acquirefuncArg” are only relevant for the
+ * non-inherited case. For the inherited case, acquire_inherited_sample_rows()
+ * determines the appropriate acquirefunc and acquirefuncArg for each child
+ * table.
  */
 static void
 do_analyze_rel(Relation onerel, VacuumParams *params,
 			   List *va_cols, AcquireSampleRowsFunc acquirefunc,
-			   BlockNumber relpages, bool inh, bool in_outer_xact,
-			   int elevel)
+			   void *acquirefuncArg, BlockNumber relpages, bool inh,
+			   bool in_outer_xact, int elevel)
 {
 	int			attr_cnt,
 				tcnt,
@@ -526,7 +530,8 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	else
 		numrows = (*acquirefunc) (onerel, elevel,
 								  rows, targrows,
-								  &totalrows, &totaldeadrows);
+								  &totalrows, &totaldeadrows,
+								  acquirefuncArg);
 
 	/*
 	 * Compute the statistics.  Temporary results during the calculations for
@@ -1117,15 +1122,17 @@ block_sampling_read_stream_next(ReadStream *stream,
 }
 
 /*
- * acquire_sample_rows -- acquire a random sample of rows from the heap
+ * acquire_sample_rows -- acquire a random sample of rows from the
+ * block-based relation
  *
  * Selected rows are returned in the caller-allocated array rows[], which
  * must have at least targrows entries.
  * The actual number of rows selected is returned as the function result.
- * We also estimate the total numbers of live and dead rows in the heap,
+ * We also estimate the total numbers of live and dead rows in the relation,
  * and return them into *totalrows and *totaldeadrows, respectively.
  *
- * The returned list of tuples is in order by physical position in the heap.
+ * The returned list of tuples is in order by physical position in the
+ * relation.
  * (We will rely on this later to derive correlation estimates.)
  *
  * As of May 2004 we use a new two-stage method:  Stage one selects up
@@ -1147,13 +1154,15 @@ block_sampling_read_stream_next(ReadStream *stream,
  * look at a statistically unbiased set of blocks, we should get
  * unbiased estimates of the average numbers of live and dead rows per
  * block.  The previous sampling method put too much credence in the row
- * density near the start of the heap.
+ * density near the start of the relation.
  */
-static int
+int
 acquire_sample_rows(Relation onerel, int elevel,
 					HeapTuple *rows, int targrows,
-					double *totalrows, double *totaldeadrows)
+					double *totalrows, double *totaldeadrows,
+					void *arg)
 {
+	AcquireSampleRowsArg *cb = (AcquireSampleRowsArg *) arg;
 	int			numrows = 0;	/* # rows now in reservoir */
 	double		samplerows = 0; /* total # rows collected */
 	double		liverows = 0;	/* # live rows seen */
@@ -1188,7 +1197,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	/* Prepare for sampling rows */
 	reservoir_init_selection_state(&rstate, targrows);
 
-	scan = heap_beginscan(onerel, NULL, 0, NULL, NULL, SO_TYPE_ANALYZE);
+	scan = table_beginscan_analyze(onerel);
 	slot = table_slot_create(onerel, NULL);
 
 	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
@@ -1200,11 +1209,11 @@ acquire_sample_rows(Relation onerel, int elevel,
 										0);
 
 	/* Outer loop over blocks to sample */
-	while (heapam_scan_analyze_next_block(scan, stream))
+	while (cb->scan_analyze_next_block(scan, stream))
 	{
 		vacuum_delay_point();
 
-		while (heapam_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
+		while (cb->scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
 		{
 			/*
 			 * The first targrows sample rows are simply copied into the
@@ -1256,7 +1265,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	read_stream_end(stream);
 
 	ExecDropSingleTupleTableSlot(slot);
-	heap_endscan(scan);
+	table_endscan(scan);
 
 	/*
 	 * If we didn't find as many tuples as we wanted then we're done. No sort
@@ -1333,10 +1342,18 @@ compare_rows(const void *a, const void *b, void *arg)
  */
 void
 heapam_analyze(Relation relation, AcquireSampleRowsFunc *func,
-			   BlockNumber *totalpages, BufferAccessStrategy bstrategy)
+			   BlockNumber *totalpages, BufferAccessStrategy bstrategy,
+			   void **arg)
 {
+	static AcquireSampleRowsArg cb =
+	{
+		.scan_analyze_next_block = heapam_scan_analyze_next_block,
+		.scan_analyze_next_tuple = heapam_scan_analyze_next_tuple
+	};
+
 	*func = acquire_sample_rows;
 	*totalpages = RelationGetNumberOfBlocks(relation);
+	*arg = &cb;
 	vac_strategy = bstrategy;
 }
 
@@ -1349,7 +1366,7 @@ heapam_analyze(Relation relation, AcquireSampleRowsFunc *func,
  * We fail and return zero if there are no inheritance children, or if all
  * children are foreign tables that don't support ANALYZE.
  */
-static int
+int
 acquire_inherited_sample_rows(Relation onerel, int elevel,
 							  HeapTuple *rows, int targrows,
 							  double *totalrows, double *totaldeadrows)
@@ -1357,6 +1374,7 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 	List	   *tableOIDs;
 	Relation   *rels;
 	AcquireSampleRowsFunc *acquirefuncs;
+	void	  **acquirefuncArgs;
 	double	   *relblocks;
 	double		totalblocks;
 	int			numrows,
@@ -1396,12 +1414,15 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 	}
 
 	/*
-	 * Identify acquirefuncs to use, and count blocks in all the relations.
+	 * Identify acquirefunc-s and acquirefuncArg-s to use, and count blocks
+	 * in all the relations.
 	 * The result could overflow BlockNumber, so we use double arithmetic.
 	 */
 	rels = (Relation *) palloc(list_length(tableOIDs) * sizeof(Relation));
 	acquirefuncs = (AcquireSampleRowsFunc *)
 		palloc(list_length(tableOIDs) * sizeof(AcquireSampleRowsFunc));
+	acquirefuncArgs = (void **)
+		palloc(list_length(tableOIDs) * sizeof(void *));
 	relblocks = (double *) palloc(list_length(tableOIDs) * sizeof(double));
 	totalblocks = 0;
 	nrels = 0;
@@ -1411,6 +1432,7 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 		Oid			childOID = lfirst_oid(lc);
 		Relation	childrel;
 		AcquireSampleRowsFunc acquirefunc = NULL;
+		void	   *acquirefuncArg = NULL;
 		BlockNumber relpages = 0;
 
 		/* We already got the needed lock */
@@ -1429,9 +1451,13 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 		if (childrel->rd_rel->relkind == RELKIND_RELATION ||
 			childrel->rd_rel->relkind == RELKIND_MATVIEW)
 		{
-			/* Use row acquisition function provided by table AM */
+			/*
+			 * Get row acquisition function, blocks and tuples iteration
+			 * callbacks provided by table AM
+			 */
 			table_relation_analyze(childrel, &acquirefunc,
-								   &relpages, vac_strategy);
+								   &relpages, vac_strategy,
+								   &acquirefuncArg);
 		}
 		else if (childrel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
 		{
@@ -1447,7 +1473,8 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 			if (fdwroutine->AnalyzeForeignTable != NULL)
 				ok = fdwroutine->AnalyzeForeignTable(childrel,
 													 &acquirefunc,
-													 &relpages);
+													 &relpages,
+													 &acquirefuncArg);
 
 			if (!ok)
 			{
@@ -1475,6 +1502,7 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 		has_child = true;
 		rels[nrels] = childrel;
 		acquirefuncs[nrels] = acquirefunc;
+		acquirefuncArgs[nrels] = acquirefuncArg;
 		relblocks[nrels] = (double) relpages;
 		totalblocks += (double) relpages;
 		nrels++;
@@ -1506,6 +1534,7 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 	{
 		Relation	childrel = rels[i];
 		AcquireSampleRowsFunc acquirefunc = acquirefuncs[i];
+		void	   *acquirefuncArg = acquirefuncArgs[i];
 		double		childblocks = relblocks[i];
 
 		/*
@@ -1544,7 +1573,8 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 				/* Fetch a random sample of the child's rows */
 				childrows = (*acquirefunc) (childrel, elevel,
 											rows + numrows, childtargrows,
-											&trows, &tdrows);
+											&trows, &tdrows,
+											acquirefuncArg);
 
 				/* We may need to convert from child's rowtype to parent's */
 				if (childrows > 0 &&
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index ec827ac12b..f28044d834 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -692,7 +692,8 @@ typedef struct TableAmRoutine
 	void		(*relation_analyze) (Relation relation,
 									 AcquireSampleRowsFunc *func,
 									 BlockNumber *totalpages,
-									 BufferAccessStrategy bstrategy);
+									 BufferAccessStrategy bstrategy,
+									 void **arg);
 
 
 	/* ------------------------------------------------------------------------
@@ -1020,6 +1021,19 @@ table_beginscan_tid(Relation rel, Snapshot snapshot)
 	return rel->rd_tableam->scan_begin(rel, snapshot, 0, NULL, NULL, flags);
 }
 
+/*
+ * table_beginscan_analyze is an alternative entry point for setting up a
+ * TableScanDesc for an ANALYZE scan.  As with bitmap scans, it's worth using
+ * the same data structure although the behavior is rather different.
+ */
+static inline TableScanDesc
+table_beginscan_analyze(Relation rel)
+{
+	uint32		flags = SO_TYPE_ANALYZE;
+
+	return rel->rd_tableam->scan_begin(rel, NULL, 0, NULL, NULL, flags);
+}
+
 /*
  * End relation scan.
  */
@@ -1868,10 +1882,11 @@ table_index_validate_scan(Relation table_rel,
  */
 static inline void
 table_relation_analyze(Relation relation, AcquireSampleRowsFunc *func,
-					   BlockNumber *totalpages, BufferAccessStrategy bstrategy)
+					   BlockNumber *totalpages, BufferAccessStrategy bstrategy,
+					   void **arg)
 {
 	relation->rd_tableam->relation_analyze(relation, func,
-										   totalpages, bstrategy);
+										   totalpages, bstrategy, arg);
 }
 
 /* ----------------------------------------------------------------------------
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 9514f8b2fd..5d06c0a567 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -21,9 +21,11 @@
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
+#include "executor/tuptable.h"
 #include "parser/parse_node.h"
 #include "storage/buf.h"
 #include "storage/lock.h"
+#include "storage/read_stream.h"
 #include "utils/relcache.h"
 
 /*
@@ -189,7 +191,8 @@ typedef struct VacAttrStats
 typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
 									  HeapTuple *rows, int targrows,
 									  double *totalrows,
-									  double *totaldeadrows);
+									  double *totaldeadrows,
+									  void *arg);
 
 /* flag bits for VacuumParams->options */
 #define VACOPT_VACUUM 0x01		/* do VACUUM */
@@ -390,12 +393,64 @@ extern void parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs,
 extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 
 /* in commands/analyze.c */
+
+struct TableScanDescData;
+
+/* The struct to be passed as '*arg' to acquire_sample_rows */
+typedef struct
+{
+	/*
+	 * Prepare to analyze block from `stream` of `scan`. The scan has been
+	 * started with table_beginscan_analyze().
+	 *
+	 * The callback may acquire resources like locks that are held until
+	 * (*scan_analyze_next_tuple)() returns false. In some cases it could be
+	 * useful to hold a lock until all tuples in a block have been analyzed
+	 * by (*scan_analyze_next_tuple)().
+	 *
+	 * The callback can return false if the block is not suitable for
+	 * sampling, e.g. because it's a metapage that could never contain tuples.
+	 *
+	 * This is primarily suited for block-based AMs. It's not clear what a
+	 * good interface for non block-based AMs would be, so there isn't one
+	 * yet and sampling using a custom implementation of acquire_sample_rows
+	 * may be preferred.
+	 */
+	bool		(*scan_analyze_next_block) (struct TableScanDescData *scan,
+											ReadStream *stream);
+
+	/*
+	 * Iterate over tuples in the block selected with
+	 * (*scan_analyze_next_block)() (which needs to have returned true, and
+	 * this routine may not have returned false for the same block before). If
+	 * a tuple that's suitable for sampling is found, true is returned and a
+	 * tuple is stored in `slot`.
+	 *
+	 * *liverows and *deadrows are incremented according to the encountered
+	 * tuples.
+	 *
+	 * Not every AM might have a meaningful concept of dead rows, in which
+	 * case it's OK to not increment *deadrows - but note that that may
+	 * influence autovacuum scheduling (see comment for relation_vacuum
+	 * callback).
+	 */
+	bool		(*scan_analyze_next_tuple) (struct TableScanDescData *scan,
+											TransactionId OldestXmin,
+											double *liverows,
+											double *deadrows,
+											TupleTableSlot *slot);
+} AcquireSampleRowsArg;
+
 extern void analyze_rel(Oid relid, RangeVar *relation,
 						VacuumParams *params, List *va_cols, bool in_outer_xact,
 						BufferAccessStrategy bstrategy);
+extern int	acquire_sample_rows(Relation onerel, int elevel,
+								HeapTuple *rows, int targrows,
+								double *totalrows, double *totaldeadrows,
+								void *arg);
 extern void heapam_analyze(Relation relation, AcquireSampleRowsFunc *func,
 						   BlockNumber *totalpages,
-						   BufferAccessStrategy bstrategy);
+						   BufferAccessStrategy bstrategy, void **arg);
 
 extern bool std_typanalyze(VacAttrStats *stats);
 
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 0968e0a01e..ebea559e53 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -151,7 +151,8 @@ typedef void (*ExplainDirectModify_function) (ForeignScanState *node,
 
 typedef bool (*AnalyzeForeignTable_function) (Relation relation,
 											  AcquireSampleRowsFunc *func,
-											  BlockNumber *totalpages);
+											  BlockNumber *totalpages,
+											  void **arg);
 
 typedef List *(*ImportForeignSchema_function) (ImportForeignSchemaStmt *stmt,
 											   Oid serverOid);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cb78f11119..0156ef27ab 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -22,6 +22,7 @@ AclItem
 AclMaskHow
 AclMode
 AclResult
+AcquireSampleRowsArg
 AcquireSampleRowsFunc
 ActionList
 ActiveSnapshotElt
-- 
2.39.2 (Apple Git-143)

#55

Andres Freund

andres@anarazel.de

almost 2 years ago

In reply to: Pavel Borisov (#51)

Re: Table AM Interface Enhancements

Hi,

On 2024-04-08 11:17:51 +0400, Pavel Borisov wrote:

On Mon, 8 Apr 2024 at 03:25, Alexander Korotkov <aekorotkov@gmail.com>

I was under the impression there are not so many out-of-core table
AMs, which have non-dummy analysis implementations. And even if there
are some, duplicating acquire_sample_rows() isn't a big deal.

But given your feedback, I'd like to propose to keep both options
open. Turn back the block-level API for analyze, but let table-AM
implement its own analyze function. Then existing out-of-core AMs
wouldn't need to do anything (or probably just set the new API method
to NULL).

I think that providing both new and old interface functions for block-based
and non-block based custom am is an excellent compromise.

I don't agree, that way lies an unmanageable API. To me the new API doesn't
look well polished either, so it's not a question of a smoother transition or
something like that.

I don't think redesigning extension APIs at this stage of the release cycle
makes sense.

Greetings,

Andres Freund

#56

Andres Freund

andres@anarazel.de

almost 2 years ago

In reply to: Andres Freund (#55)

Re: Table AM Interface Enhancements

On 2024-04-08 08:37:44 -0700, Andres Freund wrote:

On 2024-04-08 11:17:51 +0400, Pavel Borisov wrote:

On Mon, 8 Apr 2024 at 03:25, Alexander Korotkov <aekorotkov@gmail.com>

I was under the impression there are not so many out-of-core table
AMs, which have non-dummy analysis implementations. And even if there
are some, duplicating acquire_sample_rows() isn't a big deal.

But given your feedback, I'd like to propose to keep both options
open. Turn back the block-level API for analyze, but let table-AM
implement its own analyze function. Then existing out-of-core AMs
wouldn't need to do anything (or probably just set the new API method
to NULL).

I think that providing both new and old interface functions for block-based
and non-block based custom am is an excellent compromise.

I don't agree, that way lies an unmanageable API. To me the new API doesn't
look well polished either, so it's not a question of a smoother transition or
something like that.

I don't think redesigning extension APIs at this stage of the release cycle
makes sense.

Wait, you already pushed an API redesign? With a design that hasn't even seen
the list from what I can tell? Without even mentioning that on the list? You
got to be kidding me.

#57

Alexander Korotkov

aekorotkov@gmail.com

almost 2 years ago

In reply to: Andres Freund (#56)

Re: Table AM Interface Enhancements

On Mon, Apr 8, 2024, 19:08 Andres Freund <andres@anarazel.de> wrote:

On 2024-04-08 08:37:44 -0700, Andres Freund wrote:

On 2024-04-08 11:17:51 +0400, Pavel Borisov wrote:

On Mon, 8 Apr 2024 at 03:25, Alexander Korotkov <aekorotkov@gmail.com>

I was under the impression there are not so many out-of-core table
AMs, which have non-dummy analysis implementations. And even if

there

are some, duplicating acquire_sample_rows() isn't a big deal.

But given your feedback, I'd like to propose to keep both options
open. Turn back the block-level API for analyze, but let table-AM
implement its own analyze function. Then existing out-of-core AMs
wouldn't need to do anything (or probably just set the new API method
to NULL).

I think that providing both new and old interface functions for

block-based

and non-block based custom am is an excellent compromise.

I don't agree, that way lies an unmanageable API. To me the new API

doesn't

look well polished either, so it's not a question of a smoother

transition or

something like that.

I don't think redesigning extension APIs at this stage of the release

cycle

makes sense.

Wait, you already pushed an API redesign? With a design that hasn't even
seen
the list from what I can tell? Without even mentioning that on the list?
You
got to be kidding me.

Yes, it was my mistake. I got rushing trying to fit this to FF, even doing
significant changes just before commit.
I'll revert this later today.

------
Regards,
Alexander Korotkov

#58

Robert Haas

robertmhaas@gmail.com

almost 2 years ago

In reply to: Alexander Korotkov (#57)

Re: Table AM Interface Enhancements

On Mon, Apr 8, 2024 at 12:33 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Yes, it was my mistake. I got rushing trying to fit this to FF, even doing significant changes just before commit.
I'll revert this later today.

Alexander,

Exactly how much is getting reverted here? I see these, all since March 23rd:

dd1f6b0c17 Provide a way block-level table AMs could re-use
acquire_sample_rows()
9bd99f4c26 Custom reloptions for table AM
97ce821e3e Fix the parameters order for
TableAmRoutine.relation_copy_for_cluster()
867cc7b6dd Revert "Custom reloptions for table AM"
b1484a3f19 Let table AM insertion methods control index insertion
c95c25f9af Custom reloptions for table AM
27bc1772fc Generalize relation analyze in table AM interface
87985cc925 Allow locking updated tuples in tuple_update() and tuple_delete()
c35a3fb5e0 Allow table AM tuple_insert() method to return the different slot
02eb07ea89 Allow table AM to store complex data structures in rd_amcache

I'm not really feeling very good about all of this, because:

- 87985cc925 was previously committed as 11470f544e on March 23, 2023,
and almost immediately reverted. Now you tried again on March 26,
2024. I know there was a bunch of rework in the middle, but there are
times in the year that things can be committed other than right before
the feature freeze. Like, don't wait a whole year for the next attempt
and then again do it right before the cutoff.

- The Discussion links in the commit messages do not seem to stand for
the proposition that these particular patches ought to be committed in
this form. Some of them are just links to the messages where the patch
was originally posted, which is probably not against policy or
anything, but it'd be nicer to see links to versions of the patch with
which people are, in nearby emails, agreeing. Even worse, some of
these are links to emails where somebody said, "hey, some earlier
commit does not look good." In particular,
dd1f6b0c172a643a73d6b71259fa2d10378b39eb has a discussion link where
Andres complains about 27bc1772fc814946918a5ac8ccb9b5c5ad0380aa, but
it's not clear how that justifies the new commit.

- The commit message for 867cc7b6dd says "This reverts commit
c95c25f9af4bc77f2f66a587735c50da08c12b37 due to multiple design issues
spotted after commit." That's not a very good justification for then
trying again 6 days later with 9bd99f4c26, and it's *definitely* not a
good justification for there being no meaningful discussion links in
the commit message for 9bd99f4c26. They're just the same links you had
in the previous attempt, so it's pretty hard for anybody to understand
what got fixed and whether all of the concerns were really addressed.
Just looking over the commit, it's pretty hard to understand what is
being changed and why: there's not a lot of comment updates, there's
no documentation changes, and there's not a lot of explanation in the
commit message either. Even if this feature is great and all the code
is perfect now, it's going to be hard for anyone to figure out how to
use it.

97ce821e3e looks like a clear bug fix to me, but I wonder if the rest
of this should all just be reverted, with a ban on ever trying it
again after March 1 of any year. I'd like to believe that there are
only bookkeeping problems here, and that there was in fact clear
agreement that all of these changes should be made in this form, and
that the commit messages simply failed to reference the most relevant
emails. But what I fear, especially in view of Andres's remarks, is
that these commits were done in haste without adequate consensus, and
I think that's a serious problem.

--
Robert Haas
EDB: http://www.enterprisedb.com

#59

Alexander Korotkov

aekorotkov@gmail.com

almost 2 years ago

In reply to: Robert Haas (#58)

Re: Table AM Interface Enhancements

On Mon, Apr 8, 2024 at 9:54 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Apr 8, 2024 at 12:33 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Yes, it was my mistake. I got rushing trying to fit this to FF, even doing significant changes just before commit.
I'll revert this later today.

It appears to be a non-trivial revert, because 041b96802e already
revised the relation analyze after 27bc1772fc. That is, I would need
to "backport" 041b96802e. Sorry, I'm too tired to do this today.
I'll come back to this tomorrow.

Alexander,

Exactly how much is getting reverted here? I see these, all since March 23rd:

dd1f6b0c17 Provide a way block-level table AMs could re-use
acquire_sample_rows()
9bd99f4c26 Custom reloptions for table AM
97ce821e3e Fix the parameters order for
TableAmRoutine.relation_copy_for_cluster()
867cc7b6dd Revert "Custom reloptions for table AM"
b1484a3f19 Let table AM insertion methods control index insertion
c95c25f9af Custom reloptions for table AM
27bc1772fc Generalize relation analyze in table AM interface
87985cc925 Allow locking updated tuples in tuple_update() and tuple_delete()
c35a3fb5e0 Allow table AM tuple_insert() method to return the different slot
02eb07ea89 Allow table AM to store complex data structures in rd_amcache

It would be discouraging to revert all of this. Some items are very
simple, some items get a lot of work. I'll come back tomorrow and
answer all your points.

------
Regards,
Alexander Korotkov

#60

Alexander Korotkov

aekorotkov@gmail.com

almost 2 years ago

In reply to: Robert Haas (#58)

1 attachment(s)

Re: Table AM Interface Enhancements

On Mon, Apr 8, 2024 at 9:54 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Apr 8, 2024 at 12:33 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Yes, it was my mistake. I got rushing trying to fit this to FF, even doing significant changes just before commit.
I'll revert this later today.

The patch to revert is attached. Given that revert touches the work
done in 041b96802e, I think it needs some feedback before push.

Alexander,

Exactly how much is getting reverted here? I see these, all since March 23rd:

dd1f6b0c17 Provide a way block-level table AMs could re-use
acquire_sample_rows()
9bd99f4c26 Custom reloptions for table AM
97ce821e3e Fix the parameters order for
TableAmRoutine.relation_copy_for_cluster()
867cc7b6dd Revert "Custom reloptions for table AM"
b1484a3f19 Let table AM insertion methods control index insertion
c95c25f9af Custom reloptions for table AM
27bc1772fc Generalize relation analyze in table AM interface
87985cc925 Allow locking updated tuples in tuple_update() and tuple_delete()
c35a3fb5e0 Allow table AM tuple_insert() method to return the different slot
02eb07ea89 Allow table AM to store complex data structures in rd_amcache

I'm not really feeling very good about all of this, because:

- 87985cc925 was previously committed as 11470f544e on March 23, 2023,
and almost immediately reverted. Now you tried again on March 26,
2024. I know there was a bunch of rework in the middle, but there are
times in the year that things can be committed other than right before
the feature freeze. Like, don't wait a whole year for the next attempt
and then again do it right before the cutoff.

I agree with the facts. But I have a different interpretation on
this. The patch was committed as 11470f544e on March 23, 2023, then
reverted on April 3. I've proposed the revised version, but Andres
complained that this is the new API design days before FF. Then the
patch with this design was published in the thread for the year with
periodical rebases. So, I think I expressed my intention with that
design before 2023 FF, nobody prevented me from expressing objections
or other feedback during the year. Then I realized that 2024 FF is
approaching and decided to give this another try for pg18.

But I don't yet see it's wrong with this patch. I waited a year for
feedback. I waited 2 days after saying "I will push this if no
objections". Given your feedback now, I get that it would be better to
do another attempt to commit this earlier.

I admit my mistake with dd1f6b0c17. I get rushed trying to fix the
things actually making things worse. I apologise for this. But if
I'm forced to revert 87985cc925 without even hearing any reasonable
critics besides imperfection of timing, I feel like this is the
punishment for my mistake with dd1f6b0c17. Pretty unreasonable
punishment in my view.

- The Discussion links in the commit messages do not seem to stand for
the proposition that these particular patches ought to be committed in
this form. Some of them are just links to the messages where the patch
was originally posted, which is probably not against policy or
anything, but it'd be nicer to see links to versions of the patch with
which people are, in nearby emails, agreeing. Even worse, some of
these are links to emails where somebody said, "hey, some earlier
commit does not look good." In particular,
dd1f6b0c172a643a73d6b71259fa2d10378b39eb has a discussion link where
Andres complains about 27bc1772fc814946918a5ac8ccb9b5c5ad0380aa, but
it's not clear how that justifies the new commit.

I have to repeat again, that I admit my mistake with dd1f6b0c17,
apologize for that, and make my own conclusions to not repeat this.
But dd1f6b0c17 seems to be the only one that has a link to the message
with complains. I went through the list of commits above, it seems
that others have just linked to the first message of the thread.
Probably, there is a lack of consensus for some of them. But I never
heard about a policy to link not just the discussion start, but also
exact messages expressing agreeing. And I didn't see others doing
that.

- The commit message for 867cc7b6dd says "This reverts commit
c95c25f9af4bc77f2f66a587735c50da08c12b37 due to multiple design issues
spotted after commit." That's not a very good justification for then
trying again 6 days later with 9bd99f4c26, and it's *definitely* not a
good justification for there being no meaningful discussion links in
the commit message for 9bd99f4c26. They're just the same links you had
in the previous attempt, so it's pretty hard for anybody to understand
what got fixed and whether all of the concerns were really addressed.
Just looking over the commit, it's pretty hard to understand what is
being changed and why: there's not a lot of comment updates, there's
no documentation changes, and there's not a lot of explanation in the
commit message either. Even if this feature is great and all the code
is perfect now, it's going to be hard for anyone to figure out how to
use it.

1) 9bd99f4c26 comprises the reworked patch after working with notes
from Jeff Davis. I agree it would be better to wait for him to
express explicit agreement. Before reverting this, I would prefer to
hear his opinion.
2) One of the issues here is that table AM API doesn't have
documentation, it has just a very brief page which doesn't go deep
explaining particular API methods. I have heard a lot of complains
about that from users attempting to write table access methods. It's
now too late to complain about that (but if I had a wisdom of now back
during pg12 development I would definitely object against table AM API
being committed at that shape). I understand I could be more
proactive and propose a patch with that documentation.

97ce821e3e looks like a clear bug fix to me, but I wonder if the rest
of this should all just be reverted, with a ban on ever trying it
again after March 1 of any year.

Do you propose a ban from March 1 to the end of any year? I think the
first doesn't make sense, because it leaves only 2 months a year for
the work. That would create a potential rush during these 2 month and
could serve exactly opposite to the intention. So, I guess this means
a ban from March 1 to the FF of any year. The situation now is quite
unpleasant for me. So I'm far from repeating this next year.
However, if there should be a formal ban, it should be specified.
Does it relate to the patches I've pushed, all patches in this thread,
all similar patches, all table AM patches, or other API patches?

Surely, I'm an interested party and can't be impartial. But I think
it would be nice if we introduce some general rules based on this
experience. Could we have some API freeze date some time before the
feature freeze?

I'd like to believe that there are
only bookkeeping problems here, and that there was in fact clear
agreement that all of these changes should be made in this form, and
that the commit messages simply failed to reference the most relevant
emails. But what I fear, especially in view of Andres's remarks, is
that these commits were done in haste without adequate consensus, and
I think that's a serious problem.

This thread had a lot of patches for table AM API. My intention for
pg17 was to commit the easiest and least contradictory of them. I
understand there should be more consensus for some of them and
committing dd1f6b0c17 instead of reverting 27bc1772fc was a mistake.
But I don't feel good about reverting everything in a row without
clear feedback.

------
Regards,
Alexander Korotkov

Attachments:

v1-0001-revert-Generalize-relation-analyze-in-table-AM-in.patchapplication/x-patch; name=v1-0001-revert-Generalize-relation-analyze-in-table-AM-in.patchDownload

From ba472259f26ec0894fbe08b40ac7a23cdb3abf61 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorotkov@postgresql.org>
Date: Wed, 10 Apr 2024 11:58:46 +0300
Subject: [PATCH v1] revert: Generalize relation analyze in table AM interface

This commit reverts 27bc1772fc and dd1f6b0c17.  The way a block-level table
AMs could re-use acquire_sample_rows() didn't get enough of review.

Discussion: https://postgr.es/m/20240408160851.drpi33adgxhcb2oa%40awork3.anarazel.de
---
 src/backend/access/heap/heapam_handler.c |  30 ++-----
 src/backend/access/table/tableamapi.c    |   2 +
 src/backend/commands/analyze.c           |  55 ++++--------
 src/include/access/heapam.h              |   8 --
 src/include/access/tableam.h             | 101 ++++++++++++++++++-----
 src/include/commands/vacuum.h            |  69 ----------------
 src/include/foreign/fdwapi.h             |   8 +-
 src/tools/pgindent/typedefs.list         |   3 -
 8 files changed, 108 insertions(+), 168 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 30095d88b09..2485dfa036b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -52,6 +52,7 @@ static TM_Result heapam_tuple_lock(Relation relation, ItemPointer tid,
 								   CommandId cid, LockTupleMode mode,
 								   LockWaitPolicy wait_policy, uint8 flags,
 								   TM_FailureData *tmfd);
+
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
 									 Datum *values, bool *isnull, RewriteState rwstate);
@@ -1064,7 +1065,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
  * until heapam_scan_analyze_next_tuple() returns false.  That is until all the
  * items of the heap page are analyzed.
  */
-bool
+static bool
 heapam_scan_analyze_next_block(TableScanDesc scan, ReadStream *stream)
 {
 	HeapScanDesc hscan = (HeapScanDesc) scan;
@@ -1088,17 +1089,7 @@ heapam_scan_analyze_next_block(TableScanDesc scan, ReadStream *stream)
 	return true;
 }
 
-/*
- * Iterate over tuples in the block selected with
- * heapam_scan_analyze_next_block().  If a tuple that's suitable for sampling
- * is found, true is returned and a tuple is stored in `slot`.  When no more
- * tuples for sampling, false is returned and the pin and lock acquired by
- * heapam_scan_analyze_next_block() are released.
- *
- * *liverows and *deadrows are incremented according to the encountered
- * tuples.
- */
-bool
+static bool
 heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
 							   double *liverows, double *deadrows,
 							   TupleTableSlot *slot)
@@ -2666,18 +2657,6 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
 	}
 }
 
-/*
- * heapap_analyze -- implementation of relation_analyze() for heap
- *					 table access method
- */
-static void
-heapam_analyze(Relation relation, AcquireSampleRowsFunc *func,
-			   BlockNumber *totalpages, BufferAccessStrategy bstrategy)
-{
-	block_level_table_analyze(relation, func, totalpages, bstrategy,
-							  heapam_scan_analyze_next_block,
-							  heapam_scan_analyze_next_tuple);
-}
 
 /* ------------------------------------------------------------------------
  * Definition of the heap table access method.
@@ -2725,9 +2704,10 @@ static const TableAmRoutine heapam_methods = {
 	.relation_copy_data = heapam_relation_copy_data,
 	.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
 	.relation_vacuum = heap_vacuum_rel,
+	.scan_analyze_next_block = heapam_scan_analyze_next_block,
+	.scan_analyze_next_tuple = heapam_scan_analyze_next_tuple,
 	.index_build_range_scan = heapam_index_build_range_scan,
 	.index_validate_scan = heapam_index_validate_scan,
-	.relation_analyze = heapam_analyze,
 
 	.free_rd_amcache = NULL,
 	.relation_size = table_block_relation_size,
diff --git a/src/backend/access/table/tableamapi.c b/src/backend/access/table/tableamapi.c
index d9e23ef3175..c1feef43d8e 100644
--- a/src/backend/access/table/tableamapi.c
+++ b/src/backend/access/table/tableamapi.c
@@ -83,6 +83,8 @@ GetTableAmRoutine(Oid amhandler)
 	Assert(routine->relation_copy_data != NULL);
 	Assert(routine->relation_copy_for_cluster != NULL);
 	Assert(routine->relation_vacuum != NULL);
+	Assert(routine->scan_analyze_next_block != NULL);
+	Assert(routine->scan_analyze_next_tuple != NULL);
 	Assert(routine->index_build_range_scan != NULL);
 	Assert(routine->index_validate_scan != NULL);
 
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 516b43b0e34..7d2cd249972 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -17,7 +17,6 @@
 #include <math.h>
 
 #include "access/detoast.h"
-#include "access/heapam.h"
 #include "access/genam.h"
 #include "access/multixact.h"
 #include "access/relation.h"
@@ -76,8 +75,6 @@ int			default_statistics_target = 100;
 /* A few variables that don't seem worth passing around as parameters */
 static MemoryContext anl_context = NULL;
 static BufferAccessStrategy vac_strategy;
-static ScanAnalyzeNextBlockFunc scan_analyze_next_block;
-static ScanAnalyzeNextTupleFunc scan_analyze_next_tuple;
 
 
 static void do_analyze_rel(Relation onerel,
@@ -90,6 +87,9 @@ static void compute_index_stats(Relation onerel, double totalrows,
 								MemoryContext col_context);
 static VacAttrStats *examine_attribute(Relation onerel, int attnum,
 									   Node *index_expr);
+static int	acquire_sample_rows(Relation onerel, int elevel,
+								HeapTuple *rows, int targrows,
+								double *totalrows, double *totaldeadrows);
 static int	compare_rows(const void *a, const void *b, void *arg);
 static int	acquire_inherited_sample_rows(Relation onerel, int elevel,
 										  HeapTuple *rows, int targrows,
@@ -190,12 +190,10 @@ analyze_rel(Oid relid, RangeVar *relation,
 	if (onerel->rd_rel->relkind == RELKIND_RELATION ||
 		onerel->rd_rel->relkind == RELKIND_MATVIEW)
 	{
-		/*
-		 * Get row acquisition function, blocks and tuples iteration callbacks
-		 * provided by table AM
-		 */
-		table_relation_analyze(onerel, &acquirefunc,
-							   &relpages, vac_strategy);
+		/* Regular table, so we'll use the regular row acquisition function */
+		acquirefunc = acquire_sample_rows;
+		/* Also get regular table's size */
+		relpages = RelationGetNumberOfBlocks(onerel);
 	}
 	else if (onerel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
 	{
@@ -1119,17 +1117,15 @@ block_sampling_read_stream_next(ReadStream *stream,
 }
 
 /*
- * acquire_sample_rows -- acquire a random sample of rows from the
- * block-based relation
+ * acquire_sample_rows -- acquire a random sample of rows from the table
  *
  * Selected rows are returned in the caller-allocated array rows[], which
  * must have at least targrows entries.
  * The actual number of rows selected is returned as the function result.
- * We also estimate the total numbers of live and dead rows in the relation,
+ * We also estimate the total numbers of live and dead rows in the table,
  * and return them into *totalrows and *totaldeadrows, respectively.
  *
- * The returned list of tuples is in order by physical position in the
- * relation.
+ * The returned list of tuples is in order by physical position in the table.
  * (We will rely on this later to derive correlation estimates.)
  *
  * As of May 2004 we use a new two-stage method:  Stage one selects up
@@ -1151,7 +1147,7 @@ block_sampling_read_stream_next(ReadStream *stream,
  * look at a statistically unbiased set of blocks, we should get
  * unbiased estimates of the average numbers of live and dead rows per
  * block.  The previous sampling method put too much credence in the row
- * density near the start of the relation.
+ * density near the start of the table.
  */
 static int
 acquire_sample_rows(Relation onerel, int elevel,
@@ -1204,11 +1200,11 @@ acquire_sample_rows(Relation onerel, int elevel,
 										0);
 
 	/* Outer loop over blocks to sample */
-	while (scan_analyze_next_block(scan, stream))
+	while (table_scan_analyze_next_block(scan, stream))
 	{
 		vacuum_delay_point();
 
-		while (scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
+		while (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
 		{
 			/*
 			 * The first targrows sample rows are simply copied into the
@@ -1331,25 +1327,6 @@ compare_rows(const void *a, const void *b, void *arg)
 	return 0;
 }
 
-/*
- * block_level_table_analyze -- implementation of relation_analyze() for
- *								block-level table access methods
- */
-void
-block_level_table_analyze(Relation relation,
-						  AcquireSampleRowsFunc *func,
-						  BlockNumber *totalpages,
-						  BufferAccessStrategy bstrategy,
-						  ScanAnalyzeNextBlockFunc scan_analyze_next_block_cb,
-						  ScanAnalyzeNextTupleFunc scan_analyze_next_tuple_cb)
-{
-	*func = acquire_sample_rows;
-	*totalpages = RelationGetNumberOfBlocks(relation);
-	vac_strategy = bstrategy;
-	scan_analyze_next_block = scan_analyze_next_block_cb;
-	scan_analyze_next_tuple = scan_analyze_next_tuple_cb;
-}
-
 
 /*
  * acquire_inherited_sample_rows -- acquire sample rows from inheritance tree
@@ -1439,9 +1416,9 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 		if (childrel->rd_rel->relkind == RELKIND_RELATION ||
 			childrel->rd_rel->relkind == RELKIND_MATVIEW)
 		{
-			/* Use row acquisition function provided by table AM */
-			table_relation_analyze(childrel, &acquirefunc,
-								   &relpages, vac_strategy);
+			/* Regular table, so use the regular row acquisition function */
+			acquirefunc = acquire_sample_rows;
+			relpages = RelationGetNumberOfBlocks(childrel);
 		}
 		else if (childrel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
 		{
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index f84dbe629fe..be630620d0d 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -412,14 +412,6 @@ extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
 extern bool HeapTupleIsSurelyDead(HeapTuple htup,
 								  struct GlobalVisState *vistest);
 
-/* in heap/heapam_handler.c*/
-extern bool heapam_scan_analyze_next_block(TableScanDesc scan,
-										   ReadStream *stream);
-extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan,
-										   TransactionId OldestXmin,
-										   double *liverows, double *deadrows,
-										   TupleTableSlot *slot);
-
 /*
  * To avoid leaking too much knowledge about reorderbuffer implementation
  * details this is implemented in reorderbuffer.c not heapam_visibility.c
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index d1cd71b7a17..4e963cf41e7 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -20,8 +20,8 @@
 #include "access/relscan.h"
 #include "access/sdir.h"
 #include "access/xact.h"
-#include "commands/vacuum.h"
 #include "executor/tuptable.h"
+#include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
 
@@ -668,6 +668,50 @@ typedef struct TableAmRoutine
 									struct VacuumParams *params,
 									BufferAccessStrategy bstrategy);
 
+	/*
+	 * Prepare to analyze the next block in the read stream.  Returns false if
+	 * the stream is exhausted and true otherwise. The scan must have been
+	 * started with SO_TYPE_ANALYZE option.
+	 *
+	 * This routine holds a buffer pin and lock on the heap page.  They are
+	 * held until heapam_scan_analyze_next_tuple() returns false.  That is
+	 * until all the items of the heap page are analyzed.
+	 */
+
+	/*
+	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
+	 * with table_beginscan_analyze().  See also
+	 * table_scan_analyze_next_block().
+	 *
+	 * The callback may acquire resources like locks that are held until
+	 * table_scan_analyze_next_tuple() returns false. It e.g. can make sense
+	 * to hold a lock until all tuples on a block have been analyzed by
+	 * scan_analyze_next_tuple.
+	 *
+	 * The callback can return false if the block is not suitable for
+	 * sampling, e.g. because it's a metapage that could never contain tuples.
+	 *
+	 * XXX: This obviously is primarily suited for block-based AMs. It's not
+	 * clear what a good interface for non block based AMs would be, so there
+	 * isn't one yet.
+	 */
+	bool		(*scan_analyze_next_block) (TableScanDesc scan,
+											ReadStream *stream);
+
+	/*
+	 * See table_scan_analyze_next_tuple().
+	 *
+	 * Not every AM might have a meaningful concept of dead rows, in which
+	 * case it's OK to not increment *deadrows - but note that that may
+	 * influence autovacuum scheduling (see comment for relation_vacuum
+	 * callback).
+	 */
+	bool		(*scan_analyze_next_tuple) (TableScanDesc scan,
+											TransactionId OldestXmin,
+											double *liverows,
+											double *deadrows,
+											TupleTableSlot *slot);
+
 	/* see table_index_build_range_scan for reference about parameters */
 	double		(*index_build_range_scan) (Relation table_rel,
 										   Relation index_rel,
@@ -688,12 +732,6 @@ typedef struct TableAmRoutine
 										Snapshot snapshot,
 										struct ValidateIndexState *state);
 
-	/* See table_relation_analyze() */
-	void		(*relation_analyze) (Relation relation,
-									 AcquireSampleRowsFunc *func,
-									 BlockNumber *totalpages,
-									 BufferAccessStrategy bstrategy);
-
 
 	/* ------------------------------------------------------------------------
 	 * Miscellaneous functions.
@@ -1766,6 +1804,40 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 	rel->rd_tableam->relation_vacuum(rel, params, bstrategy);
 }
 
+/*
+ * Prepare to analyze the next block in the read stream. The scan needs to
+ * have been  started with table_beginscan_analyze().  Note that this routine
+ * might acquire resources like locks that are held until
+ * table_scan_analyze_next_tuple() returns false.
+ *
+ * Returns false if block is unsuitable for sampling, true otherwise.
+ */
+static inline bool
+table_scan_analyze_next_block(TableScanDesc scan, ReadStream *stream)
+{
+	return scan->rs_rd->rd_tableam->scan_analyze_next_block(scan, stream);
+}
+
+/*
+ * Iterate over tuples in the block selected with
+ * table_scan_analyze_next_block() (which needs to have returned true, and
+ * this routine may not have returned false for the same block before). If a
+ * tuple that's suitable for sampling is found, true is returned and a tuple
+ * is stored in `slot`.
+ *
+ * *liverows and *deadrows are incremented according to the encountered
+ * tuples.
+ */
+static inline bool
+table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
+							  double *liverows, double *deadrows,
+							  TupleTableSlot *slot)
+{
+	return scan->rs_rd->rd_tableam->scan_analyze_next_tuple(scan, OldestXmin,
+															liverows, deadrows,
+															slot);
+}
+
 /*
  * table_index_build_scan - scan the table to find tuples to be indexed
  *
@@ -1871,21 +1943,6 @@ table_index_validate_scan(Relation table_rel,
 											   state);
 }
 
-/*
- * table_relation_analyze - fill the infromation for a sampling statistics
- *							acquisition
- *
- * The pointer to a function that will collect sample rows from the table
- * should be stored to `*func`, plus the estimated size of the table in pages
- * should br stored to `*totalpages`.
- */
-static inline void
-table_relation_analyze(Relation relation, AcquireSampleRowsFunc *func,
-					   BlockNumber *totalpages, BufferAccessStrategy bstrategy)
-{
-	relation->rd_tableam->relation_analyze(relation, func,
-										   totalpages, bstrategy);
-}
 
 /* ----------------------------------------------------------------------------
  * Miscellaneous functionality
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 12a03abb75a..759f9a87d38 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -21,11 +21,9 @@
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
-#include "executor/tuptable.h"
 #include "parser/parse_node.h"
 #include "storage/buf.h"
 #include "storage/lock.h"
-#include "storage/read_stream.h"
 #include "utils/relcache.h"
 
 /*
@@ -178,21 +176,6 @@ typedef struct VacAttrStats
 	int			rowstride;
 } VacAttrStats;
 
-/*
- * AcquireSampleRowsFunc - a function for the sampling statistics collection.
- *
- * A random sample of up to `targrows` rows should be collected from the
- * table and stored into the caller-provided `rows` array.  The actual number
- * of rows collected must be returned.  In addition, a function should store
- * estimates of the total numbers of live and dead rows in the table into the
- * output parameters `*totalrows` and `*totaldeadrows1.  (Set `*totaldeadrows`
- * to zero if the storage does not have any concept of dead rows.)
- */
-typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
-									  HeapTuple *rows, int targrows,
-									  double *totalrows,
-									  double *totaldeadrows);
-
 /* flag bits for VacuumParams->options */
 #define VACOPT_VACUUM 0x01		/* do VACUUM */
 #define VACOPT_ANALYZE 0x02		/* do ANALYZE */
@@ -392,61 +375,9 @@ extern void parallel_vacuum_cleanup_all_indexes(ParallelVacuumState *pvs,
 extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 
 /* in commands/analyze.c */
-
-struct TableScanDescData;
-
-
-/*
- * A callback to prepare to analyze block from `stream` of `scan`. The scan
- * has been started with table_beginscan_analyze().
- *
- * The callback may acquire resources like locks that are held until
- * ScanAnalyzeNextTupleFunc returns false. In some cases it could be
- * useful to hold a lock until all tuples in a block have been analyzed by
- * ScanAnalyzeNextTupleFunc.
- *
- * The callback can return false if the block is not suitable for
- * sampling, e.g. because it's a metapage that could never contain tuples.
- *
- * This is primarily suited for block-based AMs. It's not clear what a
- * good interface for non block-based AMs would be, so there isn't one
- * yet and sampling using a custom implementation of acquire_sample_rows
- * may be preferred.
- */
-typedef bool (*ScanAnalyzeNextBlockFunc) (struct TableScanDescData *scan,
-										  ReadStream *stream);
-
-/*
- * A callback to iterate over tuples in the block selected with
- * ScanAnalyzeNextBlockFunc (which needs to have returned true, and
- * this routine may not have returned false for the same block before). If
- * a tuple that's suitable for sampling is found, true is returned and a
- * tuple is stored in `slot`.
- *
- * *liverows and *deadrows are incremented according to the encountered
- * tuples.
- *
- * Not every AM might have a meaningful concept of dead rows, in which
- * case it's OK to not increment *deadrows - but note that that may
- * influence autovacuum scheduling (see comment for relation_vacuum
- * callback).
- */
-typedef bool (*ScanAnalyzeNextTupleFunc) (struct TableScanDescData *scan,
-										  TransactionId OldestXmin,
-										  double *liverows,
-										  double *deadrows,
-										  TupleTableSlot *slot);
-
 extern void analyze_rel(Oid relid, RangeVar *relation,
 						VacuumParams *params, List *va_cols, bool in_outer_xact,
 						BufferAccessStrategy bstrategy);
-extern void block_level_table_analyze(Relation relation,
-									  AcquireSampleRowsFunc *func,
-									  BlockNumber *totalpages,
-									  BufferAccessStrategy bstrategy,
-									  ScanAnalyzeNextBlockFunc scan_analyze_next_block_cb,
-									  ScanAnalyzeNextTupleFunc scan_analyze_next_tuple_cb);
-
 extern bool std_typanalyze(VacAttrStats *stats);
 
 /* in utils/misc/sampling.c --- duplicate of declarations in utils/sampling.h */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 0968e0a01ec..7f0475d2fa7 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -13,7 +13,6 @@
 #define FDWAPI_H
 
 #include "access/parallel.h"
-#include "commands/vacuum.h"
 #include "nodes/execnodes.h"
 #include "nodes/pathnodes.h"
 
@@ -149,8 +148,13 @@ typedef void (*ExplainForeignModify_function) (ModifyTableState *mtstate,
 typedef void (*ExplainDirectModify_function) (ForeignScanState *node,
 											  struct ExplainState *es);
 
+typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
+									  HeapTuple *rows, int targrows,
+									  double *totalrows,
+									  double *totaldeadrows);
+
 typedef bool (*AnalyzeForeignTable_function) (Relation relation,
-											  AcquireSampleRowsFunc *func,
+											  AcquireSampleRowsFunc * func,
 											  BlockNumber *totalpages);
 
 typedef List *(*ImportForeignSchema_function) (ImportForeignSchemaStmt *stmt,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c83417ce9d2..50aad8f39e5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -22,7 +22,6 @@ AclItem
 AclMaskHow
 AclMode
 AclResult
-AcquireSampleRowsFunc
 ActionList
 ActiveSnapshotElt
 AddForeignUpdateTargets_function
@@ -2533,8 +2532,6 @@ ScalarIOData
 ScalarItem
 ScalarMCVItem
 Scan
-ScanAnalyzeNextBlockFunc
-ScanAnalyzeNextTupleFunc
 ScanDirection
 ScanKey
 ScanKeyData
-- 
2.39.3 (Apple Git-145)

#61

Robert Haas

robertmhaas@gmail.com

almost 2 years ago

In reply to: Alexander Korotkov (#60)

Re: Table AM Interface Enhancements

On Wed, Apr 10, 2024 at 8:20 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

I agree with the facts. But I have a different interpretation on
this. The patch was committed as 11470f544e on March 23, 2023, then
reverted on April 3. I've proposed the revised version, but Andres
complained that this is the new API design days before FF.

Well, his first complaint that your committed patch was full of bugs:

/messages/by-id/20230323003003.plgaxjqahjgkuxrk@awork3.anarazel.de

When you commit a patch and another committer writes a post-commit
review saying that your patch has so many serious problems that he
gave up on reviewing before enumerating all of them, that's a really
bad sign. That should be a strong signal to you to step back and take
a close look at whether you really understand the area of the code
that you're touching well enough to be doing whatever it is that
you're doing. If I got a review like that, I would have reverted the
patch instantly, given up for the release cycle, possibly given up on
the patch permanently, and most definitely not tried again to commit
unless I was absolutely certain that I'd learned a lot in the meantime
*and* had the agreement of the committer who wrote that review (or
maybe some other committer who was acknowledged as an expert in that
area of the code).

What you did instead is try to do a bunch of post-commit fixup in a
desperate rush right before feature freeze, to which Andres
understandably objected. But that was your second mistake, not your
first one.

Then the
patch with this design was published in the thread for the year with
periodical rebases. So, I think I expressed my intention with that
design before 2023 FF, nobody prevented me from expressing objections
or other feedback during the year. Then I realized that 2024 FF is
approaching and decided to give this another try for pg18.

This doesn't seem to match the facts as I understand them. It appears
to me that there was no activity on the thread from April until
November. The message in November was not written by you. Your first
post to the thread after April of 2023 was on March 19, 2024. Five
days later you said you wanted to commit. That doesn't look to me like
you worked diligently on the patch set throughout the year and other
people had reasonable notice that you planned to get the work done
this cycle. It looks like you ignored the patch for 11 months and then
committed it without any real further feedback from anyone. True,
Pavel did post and say that he thought the patches were in good shape.
But you could hardly take that as evidence that Andres was now content
that the problems he'd raised earlier had been fixed, because (a)
Pavel had also been involved beforehand and had not raised the
concerns that Andres later raised and (b) Pavel wrote nothing in his
email specifically about why he thought your changes or his had
resolved those concerns. I certainly agree that Andres doesn't always
give as much review feedback as I'd like to have from him in, and it's
also true that he doesn't always give that feedback as quickly as I'd
like to have it ... but you know what?

It's not Andres's job to make sure my patches are not broken. It's my
job. That applies to the patches I write, and the patches written by
other people that I commit. If I commit something and it turns out
that it is broken, that's my bad. If I commit something and it turns
out that it does not have consensus, that is also my bad. It is not
the fault of the other people for not helping me get my patches to a
state where they are up to project standard. It is my fault, and my
fault alone, for committing something that was not ready. Now that
does not mean that it isn't frustrating when I can't get the help I
need. It is extremely frustrating. But the solution is not to commit
anyway and then blame the other people for not providing feedback.

I mean, committing without explicit agreement from someone else is OK
if you're pretty sure that you've got everything sorted out correctly.
But I don't think that the paper trail here supports the narrative
that you worked on this diligently throughout the year and had every
reason to believe it would be acceptable to the community. If I'd
looked at this thread, I would have concluded that you'd abandoned the
project. I would have expected that, when you picked it up again,
there would be a series of emails over a period of time carefully
working through the various issues that had been raised, inviting
specific commentary on specific discussion points, and generally
refining the work, and then maybe a suggestion of a commit at the end.
I would not have expected an email or two basically saying "well,
seems like it's all fixed now," followed by a commit.

Do you propose a ban from March 1 to the end of any year? I think the
first doesn't make sense, because it leaves only 2 months a year for
the work. That would create a potential rush during these 2 month and
could serve exactly opposite to the intention. So, I guess this means
a ban from March 1 to the FF of any year. The situation now is quite
unpleasant for me. So I'm far from repeating this next year.
However, if there should be a formal ban, it should be specified.
Does it relate to the patches I've pushed, all patches in this thread,
all similar patches, all table AM patches, or other API patches?

I meant from March 1 to feature freeze, but maybe I should have
proposed that you shouldn't ever commit these patches. The more I look
at this, the less happy I am with how you did it.

--
Robert Haas
EDB: http://www.enterprisedb.com

#62

Pavel Borisov

pashkin.elfe@gmail.com

almost 2 years ago

In reply to: Alexander Korotkov (#60)

Re: Table AM Interface Enhancements

Hi, Alexander!

On Wed, 10 Apr 2024 at 16:20, Alexander Korotkov <aekorotkov@gmail.com>
wrote:

On Mon, Apr 8, 2024 at 9:54 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Apr 8, 2024 at 12:33 PM Alexander Korotkov <aekorotkov@gmail.com>

wrote:

Yes, it was my mistake. I got rushing trying to fit this to FF, even

doing significant changes just before commit.

I'll revert this later today.

The patch to revert is attached. Given that revert touches the work
done in 041b96802e, I think it needs some feedback before push.

Alexander,

Exactly how much is getting reverted here? I see these, all since March

23rd:

dd1f6b0c17 Provide a way block-level table AMs could re-use
acquire_sample_rows()
9bd99f4c26 Custom reloptions for table AM
97ce821e3e Fix the parameters order for
TableAmRoutine.relation_copy_for_cluster()
867cc7b6dd Revert "Custom reloptions for table AM"
b1484a3f19 Let table AM insertion methods control index insertion
c95c25f9af Custom reloptions for table AM
27bc1772fc Generalize relation analyze in table AM interface
87985cc925 Allow locking updated tuples in tuple_update() and

tuple_delete()

c35a3fb5e0 Allow table AM tuple_insert() method to return the different

slot

02eb07ea89 Allow table AM to store complex data structures in rd_amcache

I'm not really feeling very good about all of this, because:

- 87985cc925 was previously committed as 11470f544e on March 23, 2023,
and almost immediately reverted. Now you tried again on March 26,
2024. I know there was a bunch of rework in the middle, but there are
times in the year that things can be committed other than right before
the feature freeze. Like, don't wait a whole year for the next attempt
and then again do it right before the cutoff.

I agree with the facts. But I have a different interpretation on
this. The patch was committed as 11470f544e on March 23, 2023, then
reverted on April 3. I've proposed the revised version, but Andres
complained that this is the new API design days before FF. Then the
patch with this design was published in the thread for the year with
periodical rebases. So, I think I expressed my intention with that
design before 2023 FF, nobody prevented me from expressing objections
or other feedback during the year. Then I realized that 2024 FF is
approaching and decided to give this another try for pg18.

But I don't yet see it's wrong with this patch. I waited a year for
feedback. I waited 2 days after saying "I will push this if no
objections". Given your feedback now, I get that it would be better to
do another attempt to commit this earlier.

I admit my mistake with dd1f6b0c17. I get rushed trying to fix the
things actually making things worse. I apologise for this. But if
I'm forced to revert 87985cc925 without even hearing any reasonable
critics besides imperfection of timing, I feel like this is the
punishment for my mistake with dd1f6b0c17. Pretty unreasonable
punishment in my view.

- The Discussion links in the commit messages do not seem to stand for
the proposition that these particular patches ought to be committed in
this form. Some of them are just links to the messages where the patch
was originally posted, which is probably not against policy or
anything, but it'd be nicer to see links to versions of the patch with
which people are, in nearby emails, agreeing. Even worse, some of
these are links to emails where somebody said, "hey, some earlier
commit does not look good." In particular,
dd1f6b0c172a643a73d6b71259fa2d10378b39eb has a discussion link where
Andres complains about 27bc1772fc814946918a5ac8ccb9b5c5ad0380aa, but
it's not clear how that justifies the new commit.

I have to repeat again, that I admit my mistake with dd1f6b0c17,
apologize for that, and make my own conclusions to not repeat this.
But dd1f6b0c17 seems to be the only one that has a link to the message
with complains. I went through the list of commits above, it seems
that others have just linked to the first message of the thread.
Probably, there is a lack of consensus for some of them. But I never
heard about a policy to link not just the discussion start, but also
exact messages expressing agreeing. And I didn't see others doing
that.

- The commit message for 867cc7b6dd says "This reverts commit
c95c25f9af4bc77f2f66a587735c50da08c12b37 due to multiple design issues
spotted after commit." That's not a very good justification for then
trying again 6 days later with 9bd99f4c26, and it's *definitely* not a
good justification for there being no meaningful discussion links in
the commit message for 9bd99f4c26. They're just the same links you had
in the previous attempt, so it's pretty hard for anybody to understand
what got fixed and whether all of the concerns were really addressed.
Just looking over the commit, it's pretty hard to understand what is
being changed and why: there's not a lot of comment updates, there's
no documentation changes, and there's not a lot of explanation in the
commit message either. Even if this feature is great and all the code
is perfect now, it's going to be hard for anyone to figure out how to
use it.

1) 9bd99f4c26 comprises the reworked patch after working with notes
from Jeff Davis. I agree it would be better to wait for him to
express explicit agreement. Before reverting this, I would prefer to
hear his opinion.
2) One of the issues here is that table AM API doesn't have
documentation, it has just a very brief page which doesn't go deep
explaining particular API methods. I have heard a lot of complains
about that from users attempting to write table access methods. It's
now too late to complain about that (but if I had a wisdom of now back
during pg12 development I would definitely object against table AM API
being committed at that shape). I understand I could be more
proactive and propose a patch with that documentation.

97ce821e3e looks like a clear bug fix to me, but I wonder if the rest
of this should all just be reverted, with a ban on ever trying it
again after March 1 of any year.

Do you propose a ban from March 1 to the end of any year? I think the
first doesn't make sense, because it leaves only 2 months a year for
the work. That would create a potential rush during these 2 month and
could serve exactly opposite to the intention. So, I guess this means
a ban from March 1 to the FF of any year. The situation now is quite
unpleasant for me. So I'm far from repeating this next year.
However, if there should be a formal ban, it should be specified.
Does it relate to the patches I've pushed, all patches in this thread,
all similar patches, all table AM patches, or other API patches?

Surely, I'm an interested party and can't be impartial. But I think
it would be nice if we introduce some general rules based on this
experience. Could we have some API freeze date some time before the
feature freeze?

I'd like to believe that there are
only bookkeeping problems here, and that there was in fact clear
agreement that all of these changes should be made in this form, and
that the commit messages simply failed to reference the most relevant
emails. But what I fear, especially in view of Andres's remarks, is
that these commits were done in haste without adequate consensus, and
I think that's a serious problem.

This thread had a lot of patches for table AM API. My intention for
pg17 was to commit the easiest and least contradictory of them. I
understand there should be more consensus for some of them and
committing dd1f6b0c17 instead of reverting 27bc1772fc was a mistake.
But I don't feel good about reverting everything in a row without
clear feedback.

------
Regards,
Alexander Korotkov

In my view, the actual list of what has raised discussion is:
dd1f6b0c17 Provide a way block-level table AMs could re-use
acquire_sample_rows()
27bc1772fc Generalize relation analyze in table AM interface

Proposals to revert the other patches in a wholesale way look to me like an
ill-performed continuation of a discussion [1]/messages/by-id/39b1e953-6397-44ba-bb18-d3fdd61839c1@joeconway.com. I can't believe that "Let's
select which commits close to FF looks worse than the others" based on
whereabouts, not patch contents is a good and productive way for the
community to use.

At the same time if Andres, who is the most experienced person in the scope
of access methods is willing to give his post-commit re-review of any of
the committed patches and will recommend some of them reverted, it would be
a good sensible input to act accordingly.
patch

[1]: /messages/by-id/39b1e953-6397-44ba-bb18-d3fdd61839c1@joeconway.com
/messages/by-id/39b1e953-6397-44ba-bb18-d3fdd61839c1@joeconway.com

#63

Joe Conway

mail@joeconway.com

almost 2 years ago

In reply to: Robert Haas (#61)

Re: Table AM Interface Enhancements

On 4/10/24 09:19, Robert Haas wrote:

When you commit a patch and another committer writes a post-commit
review saying that your patch has so many serious problems that he
gave up on reviewing before enumerating all of them, that's a really
bad sign. That should be a strong signal to you to step back and take
a close look at whether you really understand the area of the code
that you're touching well enough to be doing whatever it is that
you're doing. If I got a review like that, I would have reverted the
patch instantly, given up for the release cycle, possibly given up on
the patch permanently, and most definitely not tried again to commit
unless I was absolutely certain that I'd learned a lot in the meantime
*and* had the agreement of the committer who wrote that review (or
maybe some other committer who was acknowledged as an expert in that
area of the code).

<snip>

It's not Andres's job to make sure my patches are not broken. It's my
job. That applies to the patches I write, and the patches written by
other people that I commit. If I commit something and it turns out
that it is broken, that's my bad. If I commit something and it turns
out that it does not have consensus, that is also my bad. It is not
the fault of the other people for not helping me get my patches to a
state where they are up to project standard. It is my fault, and my
fault alone, for committing something that was not ready. Now that
does not mean that it isn't frustrating when I can't get the help I
need. It is extremely frustrating. But the solution is not to commit
anyway and then blame the other people for not providing feedback.

+many

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#64

Alexander Korotkov

aekorotkov@gmail.com

almost 2 years ago

In reply to: Robert Haas (#61)

Re: Table AM Interface Enhancements

On Wed, Apr 10, 2024 at 4:19 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Apr 10, 2024 at 8:20 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

I agree with the facts. But I have a different interpretation on
this. The patch was committed as 11470f544e on March 23, 2023, then
reverted on April 3. I've proposed the revised version, but Andres
complained that this is the new API design days before FF.

Well, his first complaint that your committed patch was full of bugs:

/messages/by-id/20230323003003.plgaxjqahjgkuxrk@awork3.anarazel.de

When you commit a patch and another committer writes a post-commit
review saying that your patch has so many serious problems that he
gave up on reviewing before enumerating all of them, that's a really
bad sign. That should be a strong signal to you to step back and take
a close look at whether you really understand the area of the code
that you're touching well enough to be doing whatever it is that
you're doing. If I got a review like that, I would have reverted the
patch instantly, given up for the release cycle, possibly given up on
the patch permanently, and most definitely not tried again to commit
unless I was absolutely certain that I'd learned a lot in the meantime
*and* had the agreement of the committer who wrote that review (or
maybe some other committer who was acknowledged as an expert in that
area of the code).

What you did instead is try to do a bunch of post-commit fixup in a
desperate rush right before feature freeze, to which Andres
understandably objected. But that was your second mistake, not your
first one.

Then the
patch with this design was published in the thread for the year with
periodical rebases. So, I think I expressed my intention with that
design before 2023 FF, nobody prevented me from expressing objections
or other feedback during the year. Then I realized that 2024 FF is
approaching and decided to give this another try for pg18.

This doesn't seem to match the facts as I understand them. It appears
to me that there was no activity on the thread from April until
November. The message in November was not written by you. Your first
post to the thread after April of 2023 was on March 19, 2024. Five
days later you said you wanted to commit. That doesn't look to me like
you worked diligently on the patch set throughout the year and other
people had reasonable notice that you planned to get the work done
this cycle. It looks like you ignored the patch for 11 months and then
committed it without any real further feedback from anyone. True,
Pavel did post and say that he thought the patches were in good shape.
But you could hardly take that as evidence that Andres was now content
that the problems he'd raised earlier had been fixed, because (a)
Pavel had also been involved beforehand and had not raised the
concerns that Andres later raised and (b) Pavel wrote nothing in his
email specifically about why he thought your changes or his had
resolved those concerns. I certainly agree that Andres doesn't always
give as much review feedback as I'd like to have from him in, and it's
also true that he doesn't always give that feedback as quickly as I'd
like to have it ... but you know what?

It's not Andres's job to make sure my patches are not broken. It's my
job. That applies to the patches I write, and the patches written by
other people that I commit. If I commit something and it turns out
that it is broken, that's my bad. If I commit something and it turns
out that it does not have consensus, that is also my bad. It is not
the fault of the other people for not helping me get my patches to a
state where they are up to project standard. It is my fault, and my
fault alone, for committing something that was not ready. Now that
does not mean that it isn't frustrating when I can't get the help I
need. It is extremely frustrating. But the solution is not to commit
anyway and then blame the other people for not providing feedback.

I mean, committing without explicit agreement from someone else is OK
if you're pretty sure that you've got everything sorted out correctly.
But I don't think that the paper trail here supports the narrative
that you worked on this diligently throughout the year and had every
reason to believe it would be acceptable to the community. If I'd
looked at this thread, I would have concluded that you'd abandoned the
project. I would have expected that, when you picked it up again,
there would be a series of emails over a period of time carefully
working through the various issues that had been raised, inviting
specific commentary on specific discussion points, and generally
refining the work, and then maybe a suggestion of a commit at the end.
I would not have expected an email or two basically saying "well,
seems like it's all fixed now," followed by a commit.

Robert, I appreciate your feedback. I don't say I agree with
everything. For example, I definitely wasn't going to place the blame
on others for not giving feedback. My point was to show that it
wasn't so that I've committed that patch without taking feedback into
account. But arguing on every point doesn't feel reasonable for now.
I would better share particular conclusions I made:
1) I shouldn't argue too much about reverting patches especially with
committers more experienced with relevant part of codebase.
2) The fact that previous feedback is taken into account should be
expressed more explicitly everywhere: in comments, commit messages,
mailing list messages etc.

But I have to mention that even that I've committed table AM stuff
close to the FF, there has been quite amount of depended work
committed. So, revert of these patches is promising to be not
something immediate and easy, which requires just the decision. It
would touch others work. And and revert patches might also need
review. I get the point that patches got lack of consensus. But in
terms of efforts (not my efforts) it's probably makes sense to get
them some post-commit review.

Do you propose a ban from March 1 to the end of any year? I think the
first doesn't make sense, because it leaves only 2 months a year for
the work. That would create a potential rush during these 2 month and
could serve exactly opposite to the intention. So, I guess this means
a ban from March 1 to the FF of any year. The situation now is quite
unpleasant for me. So I'm far from repeating this next year.
However, if there should be a formal ban, it should be specified.
Does it relate to the patches I've pushed, all patches in this thread,
all similar patches, all table AM patches, or other API patches?

I meant from March 1 to feature freeze, but maybe I should have
proposed that you shouldn't ever commit these patches. The more I look
at this, the less happy I am with how you did it.

Robert, look. Last year I went through the arrest for expressing my
opinion. I that was not what normal arrest should look like, but a
period of survival. My family went through a period of fear, struggle
and uncertainty. Now, we're healthy and safe, but there is still
uncertainty given asylum seeker status. During all this period, I
have to just obey, agree with everything, lie that I apologize about
things I don't apologize. I had to do this, because the price of
expressing myself was not just my life, but also health, freedom and
well-being of my family.

I owe you great respect for all your work for PostgreSQL, and
especially for your efforts on getting things organized. But it
wouldn't work the way you increase my potential punishment and I just
say that I'm obey and you're right about everything. You may even
initiate the procedure of my exclusion from committers (no idea what
the procedure is), ban from the list etc. I see you express many
valuable points, but my view is not exactly same as yours. And like a
conclusion to some as result of discussion not threats.

I feel the sense of blame and fear in latest discussions, and I don't
like it. That's OK to place the blame from time to time. But I would
like to add here more joy and respect (and I'm sorry I personally
didn't do enough in this matter). It's important get things right
etc. But in long term relationships may mean more.

------
Regards,
Alexander Korotkov

#65

Andres Freund

andres@anarazel.de

almost 2 years ago

In reply to: Robert Haas (#58)

Re: Table AM Interface Enhancements

Hi,

On 2024-04-08 14:54:46 -0400, Robert Haas wrote:

Exactly how much is getting reverted here? I see these, all since March 23rd:

IMO:

dd1f6b0c17 Provide a way block-level table AMs could re-use
acquire_sample_rows()

Should be reverted.

9bd99f4c26 Custom reloptions for table AM

Hm. There are some oddities here:

- It doesn't seem great that relcache.c now needs to know about the default
values for all kinds of reloptions.

- why is there table_reloptions() and tableam_reloptions()?

- Why does extractRelOptions() need a TableAmRoutine parameter, extracted by a
caller, instead of doing that work itself?

97ce821e3e Fix the parameters order for
TableAmRoutine.relation_copy_for_cluster()

Shouldn't be, this is a clear fix.

b1484a3f19 Let table AM insertion methods control index insertion

I'm not sure. I'm not convinced this is right, nor the opposite. If the
tableam takes control of index insertion, shouldn't nodeModifyTuple know this
earlier, so it doesn't prepare a bunch of index insertion state? Also,
there's pretty much no motivating explanation in the commit.

27bc1772fc Generalize relation analyze in table AM interface

Should be reverted.

87985cc925 Allow locking updated tuples in tuple_update() and tuple_delete()

Strongly suspect this should be reverted. The last time this was committed it
was far from ready. It's very easy to cause corruption due to subtle bugs in
this area.

c35a3fb5e0 Allow table AM tuple_insert() method to return the different slot

If the AM returns a different slot, who is responsible for cleaning it up? And
how is creating a new slot for every insert not going to be a measurable
overhead?

02eb07ea89 Allow table AM to store complex data structures in rd_amcache

I am doubtful this is right. Is it really sufficient to have a callback for
freeing? What happens when relcache entries are swapped as part of a rebuild?
That works for "flat" caches, but I don't immediately see how it works for
more complicated datastructures. At least from the commit message it's hard
to evaluate how this actually intended to be used.

Greetings,

Andres Freund

#66

Robert Haas

robertmhaas@gmail.com

almost 2 years ago

In reply to: Alexander Korotkov (#64)

Re: Table AM Interface Enhancements

On Wed, Apr 10, 2024 at 12:36 PM Alexander Korotkov
<aekorotkov@gmail.com> wrote:

But I have to mention that even that I've committed table AM stuff
close to the FF, there has been quite amount of depended work
committed. So, revert of these patches is promising to be not
something immediate and easy, which requires just the decision. It
would touch others work. And and revert patches might also need
review. I get the point that patches got lack of consensus. But in
terms of efforts (not my efforts) it's probably makes sense to get
them some post-commit review.

That is somewhat fair, but it is also a lot of work. There are
multiple people asking for you to revert things on multiple threads,
and figuring out all of the revert requests and trying to come to some
consensus about what should be done in each case is going to take an
enormous amount of time. I know you've done lots of good work on
PostgreSQL in the past and I respect that, but I think you also have
to realize that you're asking other people to spend a LOT of time
figuring out what to do about the current situation. I see Andres has
posted more specifically about what he thinks should happen to each of
the table AM patches and I am willing to defer to his opinion, but we
need to make some quick decisions here to either keep things or take
them out. Extensive reworks after feature freeze should not be an
option that is on the table; that's what makes it a freeze.

I also do not think I really believe that there's been so much stuff
committed that a blanket revert would be all that hard to carry off,
if that were the option that the community ended up preferring.

Robert, look. Last year I went through the arrest for expressing my
opinion. I that was not what normal arrest should look like, but a
period of survival. My family went through a period of fear, struggle
and uncertainty. Now, we're healthy and safe, but there is still
uncertainty given asylum seeker status. During all this period, I
have to just obey, agree with everything, lie that I apologize about
things I don't apologize. I had to do this, because the price of
expressing myself was not just my life, but also health, freedom and
well-being of my family.

I owe you great respect for all your work for PostgreSQL, and
especially for your efforts on getting things organized. But it
wouldn't work the way you increase my potential punishment and I just
say that I'm obey and you're right about everything. You may even
initiate the procedure of my exclusion from committers (no idea what
the procedure is), ban from the list etc. I see you express many
valuable points, but my view is not exactly same as yours. And like a
conclusion to some as result of discussion not threats.

I feel the sense of blame and fear in latest discussions, and I don't
like it. That's OK to place the blame from time to time. But I would
like to add here more joy and respect (and I'm sorry I personally
didn't do enough in this matter). It's important get things right
etc. But in long term relationships may mean more.

I am not sure how to respond to this. On a personal level, I am sorry
to hear that you were arrested and, if I can be of some help to you,
we can discuss that off-list. However, if you're suggesting that there
is some kind of equivalence between me criticizing your decisions
about what to commit and someone in a position of authority putting
you in jail, well, I don't think it's remotely fair to compare those
things.

--
Robert Haas
EDB: http://www.enterprisedb.com

#67

Peter Geoghegan

pg@bowt.ie

almost 2 years ago

In reply to: Robert Haas (#66)

Re: Table AM Interface Enhancements

On Wed, Apr 10, 2024 at 1:25 PM Robert Haas <robertmhaas@gmail.com> wrote:

That is somewhat fair, but it is also a lot of work. There are
multiple people asking for you to revert things on multiple threads,
and figuring out all of the revert requests and trying to come to some
consensus about what should be done in each case is going to take an
enormous amount of time. I know you've done lots of good work on
PostgreSQL in the past and I respect that, but I think you also have
to realize that you're asking other people to spend a LOT of time
figuring out what to do about the current situation. I see Andres has
posted more specifically about what he thinks should happen to each of
the table AM patches and I am willing to defer to his opinion, but we
need to make some quick decisions here to either keep things or take
them out. Extensive reworks after feature freeze should not be an
option that is on the table; that's what makes it a freeze.

Alexander has been sharply criticized for acting in haste, pushing
work in multiple areas when it was clearly not ready. And that seems
proportionate to me. I agree that he showed poor judgement in the past
few months, and especially in the past few weeks. Not just on one
occasion, but on several. That must have consequences.

I also do not think I really believe that there's been so much stuff
committed that a blanket revert would be all that hard to carry off,
if that were the option that the community ended up preferring.

It seems to me that emotions are running high right now. I think that
it would be a mistake to act in haste when determining next steps.
It's very important, but it's not very urgent.

I've known Alexander for about 15 years. I think that he deserves some
consideration here. Say a week or two, to work through some of the
more complicated issues -- and to take a breather. I just don't see
any upside to rushing through this process, given where we are now.

--
Peter Geoghegan

#68

Bruce Momjian

bruce@momjian.us

almost 2 years ago

In reply to: Pavel Borisov (#62)

Re: Table AM Interface Enhancements

On Wed, Apr 10, 2024 at 05:42:51PM +0400, Pavel Borisov wrote:

Hi, Alexander!
In my view, the actual list of what has raised discussion is:
dd1f6b0c17 Provide a way block-level table AMs could re-use acquire_sample_rows
()
27bc1772fc Generalize relation analyze in table AM interface

Proposals to revert the other patches in a wholesale way look to me like an
ill-performed continuation of a discussion [1]. I can't believe that "Let's

For reference this disussion was:

I don't dispute that we could do better, and this is just a
simplistic look based on "number of commits per day", but the
attached does put it in perspective to some extent.

select which commits close to FF looks worse than the others" based on
whereabouts, not patch contents is a good and productive way for the community
to use.

I don't know how you can say these patches are being questioned just
because they are near the feature freeze (FF). There are clear
concerns, and post-feature freeze is not the time to be evaluating which
patches which were pushed in near feature freeze need help.

What is the huge rush for these patches, and if they were so important,
why was this not done earlier? This can all wait until PG 18. If
Supabase or someone else needs these patches for PG 17, they will need
to create a patched verison of PG 17 with these patches.

At the same time if Andres, who is the most experienced person in the scope of
access methods is willing to give his post-commit re-review of any of the
committed patches and will recommend some of them reverted, it would be a good
sensible input to act accordingly.
patch

So the patches were rushed, have problems, and now we are requiring
Andres to stop what he is doing to give immediate feedback --- that is
not fair to him.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Only you can decide what is important to you.

#69

Andres Freund

andres@anarazel.de

almost 2 years ago

In reply to: Alexander Korotkov (#60)

Re: Table AM Interface Enhancements

Hi,

On 2024-04-10 15:19:47 +0300, Alexander Korotkov wrote:

On Mon, Apr 8, 2024 at 9:54 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Apr 8, 2024 at 12:33 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Yes, it was my mistake. I got rushing trying to fit this to FF, even doing significant changes just before commit.
I'll revert this later today.

The patch to revert is attached. Given that revert touches the work
done in 041b96802e, I think it needs some feedback before push.

Hm. It's a bit annoying to revert it, you're right. I think on its own the
revert looks reasonable from what I've seen so far, will continue looking for
a bit.

I think we'll need to do some cleanup of 041b96802e separately afterwards -
possibly in 17, possibly in 18. Particularly post-27bc1772fc8
acquire_sample_rows() was tied hard to heapam, so it made sense for 041b96802e
to create the stream in acquire_sample_rows() and have
block_sampling_read_stream_next() be in analyze.c. But eventually that should
be in access/heap/. Compared to 16, the state post the revert does tie
analyze.c a bit closer to the internals of the AM than before, but I'm not
sure the increase matters.

Greetings,

Andres Freund

#70

Melanie Plageman

melanieplageman@gmail.com

almost 2 years ago

In reply to: Andres Freund (#69)

Re: Table AM Interface Enhancements

On Wed, Apr 10, 2024 at 4:03 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2024-04-10 15:19:47 +0300, Alexander Korotkov wrote:

On Mon, Apr 8, 2024 at 9:54 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Apr 8, 2024 at 12:33 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Yes, it was my mistake. I got rushing trying to fit this to FF, even doing significant changes just before commit.
I'll revert this later today.

The patch to revert is attached. Given that revert touches the work
done in 041b96802e, I think it needs some feedback before push.

Hm. It's a bit annoying to revert it, you're right. I think on its own the
revert looks reasonable from what I've seen so far, will continue looking for
a bit.

I think we'll need to do some cleanup of 041b96802e separately afterwards -
possibly in 17, possibly in 18. Particularly post-27bc1772fc8
acquire_sample_rows() was tied hard to heapam, so it made sense for 041b96802e
to create the stream in acquire_sample_rows() and have
block_sampling_read_stream_next() be in analyze.c. But eventually that should
be in access/heap/. Compared to 16, the state post the revert does tie
analyze.c a bit closer to the internals of the AM than before, but I'm not
sure the increase matters.

Yes in an earlier version of 041b96802e, I gave the review feedback
that the read stream should be pushed down into heap-specific code,
but then after 27bc1772fc8, Bilal took the approach of putting the
read stream code in acquire_sample_rows() since that was no longer
table AM-agnostic.

This thread has been moving pretty fast, so could someone point out
which version of the patch has the modifications to
acquire_sample_rows() that would be relevant for Bilal (and others
involved in analyze streaming read) to review? Is it
v1-0001-revert-Generalize-relation-analyze-in-table-AM-in.patch?

- Melanie

#71

Andres Freund

andres@anarazel.de

almost 2 years ago

In reply to: Melanie Plageman (#70)

Re: Table AM Interface Enhancements

Hi,

On 2024-04-10 16:24:40 -0400, Melanie Plageman wrote:

This thread has been moving pretty fast, so could someone point out
which version of the patch has the modifications to
acquire_sample_rows() that would be relevant for Bilal (and others
involved in analyze streaming read) to review? Is it
v1-0001-revert-Generalize-relation-analyze-in-table-AM-in.patch?

I think so. It's at least what I've been looking at.

Greetings,

Andres Freund

#72

Melanie Plageman

melanieplageman@gmail.com

almost 2 years ago

In reply to: Andres Freund (#71)

Re: Table AM Interface Enhancements

On Wed, Apr 10, 2024 at 4:33 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2024-04-10 16:24:40 -0400, Melanie Plageman wrote:

This thread has been moving pretty fast, so could someone point out
which version of the patch has the modifications to
acquire_sample_rows() that would be relevant for Bilal (and others
involved in analyze streaming read) to review? Is it
v1-0001-revert-Generalize-relation-analyze-in-table-AM-in.patch?

I think so. It's at least what I've been looking at.

I took a look at this patch, and you're right we will need to do
follow-on work with streaming ANALYZE. The streaming read code will
have to be moved now that acquire_sample_rows() is table-AM agnostic
again.

I don't think there was ever a version that Bilal wrote
where the streaming read code was outside of acquire_sample_rows(). By
the time he got that review feedback, 27bc1772fc8 had gone in.

This brings up a question about the prefetching. We never had to have
this discussion for sequential scan streaming read because it didn't
(and still doesn't) do prefetching. But, if we push the streaming read
code down into the heap AM layer, it will be doing the prefetching.
So, do we remove the prefetching from acquire_sample_rows() and expect
other table AMs to implement it themselves or use the streaming read
API?

- Melanie

#73

Andres Freund

andres@anarazel.de

almost 2 years ago

In reply to: Melanie Plageman (#72)

Re: Table AM Interface Enhancements

Hi,

On 2024-04-10 16:50:44 -0400, Melanie Plageman wrote:

This brings up a question about the prefetching. We never had to have
this discussion for sequential scan streaming read because it didn't
(and still doesn't) do prefetching. But, if we push the streaming read
code down into the heap AM layer, it will be doing the prefetching.
So, do we remove the prefetching from acquire_sample_rows() and expect
other table AMs to implement it themselves or use the streaming read
API?

The prefetching added to acquire_sample_rows was quite narrowly tailored to
something heap-like - it pretty much required that block numbers to be 1:1
with the actual physical on-disk location for the specific AM. So I think
it's pretty much required for this to be pushed down.

Using a read stream is a few lines for something like this, so I'm not worried
about it. I guess we could have a default implementation for block based AMs,
similar what we have around table_block_parallelscan_*, but not sure it's
worth doing that, the complexity is much lower than in the
table_block_parallelscan_ case.

Greetings,

Andres

#74

Alexander Korotkov

aekorotkov@gmail.com

almost 2 years ago

In reply to: Andres Freund (#65)

Re: Table AM Interface Enhancements

Hi Andres,

On Wed, Apr 10, 2024 at 7:52 PM Andres Freund <andres@anarazel.de> wrote:

On 2024-04-08 14:54:46 -0400, Robert Haas wrote:

Exactly how much is getting reverted here? I see these, all since March 23rd:

IMO:

dd1f6b0c17 Provide a way block-level table AMs could re-use
acquire_sample_rows()

Should be reverted.

9bd99f4c26 Custom reloptions for table AM

Hm. There are some oddities here:

- It doesn't seem great that relcache.c now needs to know about the default
values for all kinds of reloptions.

- why is there table_reloptions() and tableam_reloptions()?

- Why does extractRelOptions() need a TableAmRoutine parameter, extracted by a
caller, instead of doing that work itself?

97ce821e3e Fix the parameters order for
TableAmRoutine.relation_copy_for_cluster()

Shouldn't be, this is a clear fix.

b1484a3f19 Let table AM insertion methods control index insertion

I'm not sure. I'm not convinced this is right, nor the opposite. If the
tableam takes control of index insertion, shouldn't nodeModifyTuple know this
earlier, so it doesn't prepare a bunch of index insertion state? Also,
there's pretty much no motivating explanation in the commit.

27bc1772fc Generalize relation analyze in table AM interface

Should be reverted.

87985cc925 Allow locking updated tuples in tuple_update() and tuple_delete()

Strongly suspect this should be reverted. The last time this was committed it
was far from ready. It's very easy to cause corruption due to subtle bugs in
this area.

c35a3fb5e0 Allow table AM tuple_insert() method to return the different slot

If the AM returns a different slot, who is responsible for cleaning it up? And
how is creating a new slot for every insert not going to be a measurable
overhead?

02eb07ea89 Allow table AM to store complex data structures in rd_amcache

I am doubtful this is right. Is it really sufficient to have a callback for
freeing? What happens when relcache entries are swapped as part of a rebuild?
That works for "flat" caches, but I don't immediately see how it works for
more complicated datastructures. At least from the commit message it's hard
to evaluate how this actually intended to be used.

Thank you for your feedback. I've reverted all of above.

------
Regards,
Alexander Korotkov

#75

Melanie Plageman

melanieplageman@gmail.com

almost 2 years ago

In reply to: Andres Freund (#73)

Re: Table AM Interface Enhancements

On Wed, Apr 10, 2024 at 5:21 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2024-04-10 16:50:44 -0400, Melanie Plageman wrote:

This brings up a question about the prefetching. We never had to have
this discussion for sequential scan streaming read because it didn't
(and still doesn't) do prefetching. But, if we push the streaming read
code down into the heap AM layer, it will be doing the prefetching.
So, do we remove the prefetching from acquire_sample_rows() and expect
other table AMs to implement it themselves or use the streaming read
API?

The prefetching added to acquire_sample_rows was quite narrowly tailored to
something heap-like - it pretty much required that block numbers to be 1:1
with the actual physical on-disk location for the specific AM. So I think
it's pretty much required for this to be pushed down.

Using a read stream is a few lines for something like this, so I'm not worried
about it. I guess we could have a default implementation for block based AMs,
similar what we have around table_block_parallelscan_*, but not sure it's
worth doing that, the complexity is much lower than in the
table_block_parallelscan_ case.

This makes sense.

I am working on pushing streaming ANALYZE into heap AM code, and I ran
into a few roadblocks.

If we want ANALYZE to make the ReadStream object in heap_beginscan()
(like the read stream implementation of heap sequential and TID range
scans do), I don't see any way around changing the scan_begin table AM
callback to take a BufferAccessStrategy at the least (and perhaps also
the BlockSamplerData).

read_stream_begin_relation() doesn't just save the
BufferAccessStrategy in the ReadStream, it uses it to set various
other things in the ReadStream object. callback_private_data (which in
ANALYZE's case is the BlockSamplerData) is simply saved in the
ReadStream, so it could be set later, but that doesn't sound very
clean to me.

As such, it seems like a cleaner alternative would be to add a table
AM callback for creating a read stream object that takes the
parameters of read_stream_begin_relation(). But, perhaps it is a bit
late for such additions.

It also opens us up to the question of whether or not sequential scan
should use such a callback instead of making the read stream object in
heap_beginscan().

I am happy to write a patch that does any of the above. But, I want to
raise these questions, because perhaps I am simply missing an obvious
alternative solution.

- Melanie

#76

Jeff Davis

pgsql@j-davis.com

almost 2 years ago

In reply to: Alexander Korotkov (#60)

Re: Table AM Interface Enhancements

On Wed, 2024-04-10 at 15:19 +0300, Alexander Korotkov wrote:

1) 9bd99f4c26 comprises the reworked patch after working with notes
from Jeff Davis. I agree it would be better to wait for him to
express explicit agreement. Before reverting this, I would prefer to
hear his opinion.

On this particular feature, I had tried it in the past myself, and
there were a number of minor frustrations and I left it unfinished. I
quickly recognized that commit c95c25f9af was too simple to work.

Commit 9bd99f4c26 looked substantially better, but I was surprised to
see it committed so soon after the redesign. I thought a revert was
likely outcome, but I put it on my list of things to review more deeply
in the next couple weeks so I could give productive feedback.

It would benefit from more discussion in v18, and I apologize for not
getting involved earlier when the patch still could have made it into
v17.

Regards,
Jeff Davis

#77

Alexander Korotkov

aekorotkov@gmail.com

almost 2 years ago

In reply to: Jeff Davis (#76)

Re: Table AM Interface Enhancements

On Thu, Apr 11, 2024 at 8:11 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Wed, 2024-04-10 at 15:19 +0300, Alexander Korotkov wrote:

1) 9bd99f4c26 comprises the reworked patch after working with notes
from Jeff Davis. I agree it would be better to wait for him to
express explicit agreement. Before reverting this, I would prefer to
hear his opinion.

On this particular feature, I had tried it in the past myself, and
there were a number of minor frustrations and I left it unfinished. I
quickly recognized that commit c95c25f9af was too simple to work.

Commit 9bd99f4c26 looked substantially better, but I was surprised to
see it committed so soon after the redesign. I thought a revert was
likely outcome, but I put it on my list of things to review more deeply
in the next couple weeks so I could give productive feedback.

Thank you for your feedback, Jeff.

It would benefit from more discussion in v18, and I apologize for not
getting involved earlier when the patch still could have made it into
v17.

I believe you don't have to apologize. It's definitely not your fault
that I've committed this patch in this shape.

------
Regards,
Alexander Korotkov

#78

Alexander Korotkov

aekorotkov@gmail.com

almost 2 years ago

In reply to: Melanie Plageman (#75)

Re: Table AM Interface Enhancements

Hi!

On Thu, Apr 11, 2024 at 7:19 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Wed, Apr 10, 2024 at 5:21 PM Andres Freund <andres@anarazel.de> wrote:

On 2024-04-10 16:50:44 -0400, Melanie Plageman wrote:

This brings up a question about the prefetching. We never had to have
this discussion for sequential scan streaming read because it didn't
(and still doesn't) do prefetching. But, if we push the streaming read
code down into the heap AM layer, it will be doing the prefetching.
So, do we remove the prefetching from acquire_sample_rows() and expect
other table AMs to implement it themselves or use the streaming read
API?

The prefetching added to acquire_sample_rows was quite narrowly tailored to
something heap-like - it pretty much required that block numbers to be 1:1
with the actual physical on-disk location for the specific AM. So I think
it's pretty much required for this to be pushed down.

Using a read stream is a few lines for something like this, so I'm not worried
about it. I guess we could have a default implementation for block based AMs,
similar what we have around table_block_parallelscan_*, but not sure it's
worth doing that, the complexity is much lower than in the
table_block_parallelscan_ case.

This makes sense.

I am working on pushing streaming ANALYZE into heap AM code, and I ran
into a few roadblocks.

If we want ANALYZE to make the ReadStream object in heap_beginscan()
(like the read stream implementation of heap sequential and TID range
scans do), I don't see any way around changing the scan_begin table AM
callback to take a BufferAccessStrategy at the least (and perhaps also
the BlockSamplerData).

read_stream_begin_relation() doesn't just save the
BufferAccessStrategy in the ReadStream, it uses it to set various
other things in the ReadStream object. callback_private_data (which in
ANALYZE's case is the BlockSamplerData) is simply saved in the
ReadStream, so it could be set later, but that doesn't sound very
clean to me.

As such, it seems like a cleaner alternative would be to add a table
AM callback for creating a read stream object that takes the
parameters of read_stream_begin_relation(). But, perhaps it is a bit
late for such additions.

It also opens us up to the question of whether or not sequential scan
should use such a callback instead of making the read stream object in
heap_beginscan().

I am happy to write a patch that does any of the above. But, I want to
raise these questions, because perhaps I am simply missing an obvious
alternative solution.

I understand that I'm the bad guy of this release, not sure if my
opinion counts.

But what is going on here? I hope this work is targeting pg18.
Otherwise, do I get this right that this post feature-freeze works on
designing a new API? Yes, 27bc1772fc masked the problem. But it was
committed on Mar 30. So that couldn't justify why the proper API
wasn't designed in time. Are we judging different commits with the
same criteria?

IMHO, 041b96802e should be just reverted.

------
Regards,
Alexander Korotkov

#79

Melanie Plageman

melanieplageman@gmail.com

almost 2 years ago

In reply to: Melanie Plageman (#75)

Re: Table AM Interface Enhancements

On Thu, Apr 11, 2024 at 12:19 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Wed, Apr 10, 2024 at 5:21 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2024-04-10 16:50:44 -0400, Melanie Plageman wrote:

This brings up a question about the prefetching. We never had to have
this discussion for sequential scan streaming read because it didn't
(and still doesn't) do prefetching. But, if we push the streaming read
code down into the heap AM layer, it will be doing the prefetching.
So, do we remove the prefetching from acquire_sample_rows() and expect
other table AMs to implement it themselves or use the streaming read
API?

The prefetching added to acquire_sample_rows was quite narrowly tailored to
something heap-like - it pretty much required that block numbers to be 1:1
with the actual physical on-disk location for the specific AM. So I think
it's pretty much required for this to be pushed down.

Using a read stream is a few lines for something like this, so I'm not worried
about it. I guess we could have a default implementation for block based AMs,
similar what we have around table_block_parallelscan_*, but not sure it's
worth doing that, the complexity is much lower than in the
table_block_parallelscan_ case.

This makes sense.

I am working on pushing streaming ANALYZE into heap AM code, and I ran
into a few roadblocks.

If we want ANALYZE to make the ReadStream object in heap_beginscan()
(like the read stream implementation of heap sequential and TID range
scans do), I don't see any way around changing the scan_begin table AM
callback to take a BufferAccessStrategy at the least (and perhaps also
the BlockSamplerData).

I will also say that, had this been 6 months ago, I would probably
suggest we restructure ANALYZE's table AM interface to accommodate
read stream setup and to address a few other things I find odd about
the current code. For example, I think creating a scan descriptor for
the analyze scan in acquire_sample_rows() is quite odd. It seems like
it would be better done in the relation_analyze callback. The
relation_analyze callback saves some state like the callbacks for
acquire_sample_rows() and the Buffer Access Strategy. But at least in
the heap implementation, it just saves them in static variables in
analyze.c. It seems like it would be better to save them in a useful
data structure that could be accessed later. We have access to pretty
much everything we need at that point (in the relation_analyze
callback). I also think heap's implementation of
table_beginscan_analyze() doesn't need most of
heap_beginscan()/initscan(), so doing this instead of something
ANALYZE specific seems more confusing than helpful.

- Melanie

#80

Robert Haas

robertmhaas@gmail.com

almost 2 years ago

In reply to: Alexander Korotkov (#78)

Re: Table AM Interface Enhancements

On Thu, Apr 11, 2024 at 1:46 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

I understand that I'm the bad guy of this release, not sure if my
opinion counts.

But what is going on here? I hope this work is targeting pg18.
Otherwise, do I get this right that this post feature-freeze works on
designing a new API? Yes, 27bc1772fc masked the problem. But it was
committed on Mar 30. So that couldn't justify why the proper API
wasn't designed in time. Are we judging different commits with the
same criteria?

I mean, Andres already said that the cleanup was needed possibly in
17, and possibly in 18.

As far as fairness is concerned, you'll get no argument from me if you
say the streaming read stuff was all committed far later than it
should have been. I said that in the very first email I wrote on the
"post-feature freeze cleanup" thread. But if you're going to argue
that there's no opportunity for anyone to adjust patches that were
sideswiped by the reverts of your patches, and that if any such
adjustments seem advisable we should just revert the sideswiped
patches entirely, I don't agree with that, and I don't see why anyone
would agree with that. I think it's fine to have the discussion, and
if the result of that discussion is that somebody says "hey, we want
to do X in 17 for reason Y," then we can discuss that proposal on its
merits, taking into account the answers to questions like "why wasn't
this done before the freeze?" and "is that adjustment more or less
risky than just reverting?" and "how about we just leave it alone for
now and deal with it next release?".

IMHO, 041b96802e should be just reverted.

IMHO, it's too early to decide that, because we don't know what change
concretely is going to be proposed, and there has been no discussion
of why that change, whatever it is, belongs in this release or next
release.

I understand that you're probably not feeling great about being asked
to revert a bunch of stuff here, and I do think it is a fair point to
make that we need to be even-handed and not overreact. Just because
you had some patches that had some problems doesn't mean that
everything that got touched by the reverts can or should be whacked
around a whole bunch more post-freeze, especially since that stuff was
*also* committed very late, in haste, way closer to feature freeze
than it should have been. At the same time, it's also important to
keep in mind that our goal here is not to punish people for being bad,
or to reward them for being good, or really to make any moral
judgements at all, but to produce a quality release. I'm sure that,
where possible, you'd prefer to fix bugs in a patch you committed
rather than revert the whole thing as soon as anyone finds any
problem. I would also prefer that, both for your patches, and for
mine. And everyone else deserves that same consideration.

--
Robert Haas
EDB: http://www.enterprisedb.com

#81

Andres Freund

andres@anarazel.de

almost 2 years ago

In reply to: Alexander Korotkov (#78)

Re: Table AM Interface Enhancements

Hi,

On 2024-04-11 20:46:02 +0300, Alexander Korotkov wrote:

I hope this work is targeting pg18.

I think anything of the scope discussed by Melanie would be very clearly
targeting 18. For 17, I don't know yet whether we should revert the the
ANALYZE streaming read user (041b96802ef), just do a bit of comment polishing,
or some other small change.

One oddity is that before 041b96802ef, the opportunities for making the
interface cleaner were less apparent, because c6fc50cb4028 increased the
coupling between analyze.c and the way the table storage works.

Otherwise, do I get this right that this post feature-freeze works on
designing a new API? Yes, 27bc1772fc masked the problem. But it was
committed on Mar 30.

Note that there were versions of the patch that were targeting the
pre-27bc1772fc interface.

Greetings,

Andres Freund

#82

Alexander Korotkov

aekorotkov@gmail.com

almost 2 years ago

In reply to: Andres Freund (#81)

Re: Table AM Interface Enhancements

On Fri, Apr 12, 2024 at 12:04 AM Andres Freund <andres@anarazel.de> wrote:

On 2024-04-11 20:46:02 +0300, Alexander Korotkov wrote:

I hope this work is targeting pg18.

I think anything of the scope discussed by Melanie would be very clearly
targeting 18. For 17, I don't know yet whether we should revert the the
ANALYZE streaming read user (041b96802ef), just do a bit of comment polishing,
or some other small change.

One oddity is that before 041b96802ef, the opportunities for making the
interface cleaner were less apparent, because c6fc50cb4028 increased the
coupling between analyze.c and the way the table storage works.

Thank you for pointing this out about c6fc50cb4028, I've missed this.

Otherwise, do I get this right that this post feature-freeze works on
designing a new API? Yes, 27bc1772fc masked the problem. But it was
committed on Mar 30.

Note that there were versions of the patch that were targeting the
pre-27bc1772fc interface.

Sure, I've checked this before writing. It looks quite similar to the
result of applying my revert patch [1] to the head.

Let me describe my view over the current situation.

1) If we just apply my revert patch and leave c6fc50cb4028 and
041b96802ef in the tree, then we get our table AM API narrowed. As
you expressed the current API requires block numbers to be 1:1 with
the actual physical on-disk location [2]. Not a secret I think the
current API is quite restrictive. And we're getting the ANALYZE
interface narrower than it was since 737a292b5de. Frankly speaking, I
don't think this is acceptable.

2) Pushing down the read stream and prefetch to heap am is related to
difficulties [3], [4]. That's quite a significant piece of work to be
done post FF.

In token of all of the above, is the in-tree state that bad? (if we
abstract the way 27bc1772fc and dd1f6b0c17 were committed).

The in-tree state provides quite a general API for analyze, supporting
even non-block storages. There is a way to reuse existing
acquire_sample_rows() for table AMs, which have block numbers 1:1 with
the actual physical on-disk location. It requires some cleanup for
comments and docs, but does not require us to redesing the API post
FF.

Links.
1. /messages/by-id/CAPpHfdvuT6DnguzaV-M1UQ2whYGDojaNU=-=iHc0A7qo9HBEJw@mail.gmail.com
2. /messages/by-id/20240410212117.mxsldz2w6htrl36v@awork3.anarazel.de
3. /messages/by-id/CAAKRu_ZxU6hucckrT1SOJxKfyN7q-K4KU1y62GhDwLBZWG+ROg@mail.gmail.com
4. /messages/by-id/CAAKRu_YkphAPNbBR2jcLqnxGhDEWTKhYfLFY=0R_oG5LHBH7Gw@mail.gmail.com

------
Regards,
Alexander Korotkov

#83

Alexander Korotkov

aekorotkov@gmail.com

almost 2 years ago

In reply to: Robert Haas (#80)

Re: Table AM Interface Enhancements

On Thu, Apr 11, 2024 at 11:27 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Apr 11, 2024 at 1:46 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

I understand that I'm the bad guy of this release, not sure if my
opinion counts.

But what is going on here? I hope this work is targeting pg18.
Otherwise, do I get this right that this post feature-freeze works on
designing a new API? Yes, 27bc1772fc masked the problem. But it was
committed on Mar 30. So that couldn't justify why the proper API
wasn't designed in time. Are we judging different commits with the
same criteria?

I mean, Andres already said that the cleanup was needed possibly in
17, and possibly in 18.

As far as fairness is concerned, you'll get no argument from me if you
say the streaming read stuff was all committed far later than it
should have been. I said that in the very first email I wrote on the
"post-feature freeze cleanup" thread. But if you're going to argue
that there's no opportunity for anyone to adjust patches that were
sideswiped by the reverts of your patches, and that if any such
adjustments seem advisable we should just revert the sideswiped
patches entirely, I don't agree with that, and I don't see why anyone
would agree with that. I think it's fine to have the discussion, and
if the result of that discussion is that somebody says "hey, we want
to do X in 17 for reason Y," then we can discuss that proposal on its
merits, taking into account the answers to questions like "why wasn't
this done before the freeze?" and "is that adjustment more or less
risky than just reverting?" and "how about we just leave it alone for
now and deal with it next release?".

I don't think 041b96802e could be sideswiped by 27bc1772fc. The "Use
streaming I/O in ANALYZE" patch has the same issue before 27bc1772fc,
which was committed on Mar 30. So, in the worst case 27bc1772fc
steals a week of work. I can imagine without 27bc1772fc , a new API
could be proposed days before FF. This means I saved patch authors
from what you name in my case "desperate rush". Huh!

IMHO, 041b96802e should be just reverted.

IMHO, it's too early to decide that, because we don't know what change
concretely is going to be proposed, and there has been no discussion
of why that change, whatever it is, belongs in this release or next
release.

I understand that you're probably not feeling great about being asked
to revert a bunch of stuff here, and I do think it is a fair point to
make that we need to be even-handed and not overreact. Just because
you had some patches that had some problems doesn't mean that
everything that got touched by the reverts can or should be whacked
around a whole bunch more post-freeze, especially since that stuff was
*also* committed very late, in haste, way closer to feature freeze
than it should have been. At the same time, it's also important to
keep in mind that our goal here is not to punish people for being bad,
or to reward them for being good, or really to make any moral
judgements at all, but to produce a quality release. I'm sure that,
where possible, you'd prefer to fix bugs in a patch you committed
rather than revert the whole thing as soon as anyone finds any
problem. I would also prefer that, both for your patches, and for
mine. And everyone else deserves that same consideration.

I expressed my thoughts about producing a better release without a
desperate rush post-FF in my reply to Andres [2].

Links.
1. /messages/by-id/CA+TgmobZUnJQaaGkuoeo22Sydf9=mX864W11yZKd6sv-53-aEQ@mail.gmail.com
2. /messages/by-id/CAPpHfdt+cCj6j6cR5AyBThP6SyDf6wxAz4dU-0NdXjfpiFca7Q@mail.gmail.com

------
Regards,
Alexander Korotkov

#84

Melanie Plageman

melanieplageman@gmail.com

almost 2 years ago

In reply to: Alexander Korotkov (#82)

Re: Table AM Interface Enhancements

On Thu, Apr 11, 2024 at 6:04 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Fri, Apr 12, 2024 at 12:04 AM Andres Freund <andres@anarazel.de> wrote:

On 2024-04-11 20:46:02 +0300, Alexander Korotkov wrote:

I hope this work is targeting pg18.

I think anything of the scope discussed by Melanie would be very clearly
targeting 18. For 17, I don't know yet whether we should revert the the
ANALYZE streaming read user (041b96802ef), just do a bit of comment polishing,
or some other small change.

One oddity is that before 041b96802ef, the opportunities for making the
interface cleaner were less apparent, because c6fc50cb4028 increased the
coupling between analyze.c and the way the table storage works.

Thank you for pointing this out about c6fc50cb4028, I've missed this.

Otherwise, do I get this right that this post feature-freeze works on
designing a new API? Yes, 27bc1772fc masked the problem. But it was
committed on Mar 30.

Note that there were versions of the patch that were targeting the
pre-27bc1772fc interface.

Sure, I've checked this before writing. It looks quite similar to the
result of applying my revert patch [1] to the head.

Let me describe my view over the current situation.

1) If we just apply my revert patch and leave c6fc50cb4028 and
041b96802ef in the tree, then we get our table AM API narrowed. As
you expressed the current API requires block numbers to be 1:1 with
the actual physical on-disk location [2]. Not a secret I think the
current API is quite restrictive. And we're getting the ANALYZE
interface narrower than it was since 737a292b5de. Frankly speaking, I
don't think this is acceptable.

2) Pushing down the read stream and prefetch to heap am is related to
difficulties [3], [4]. That's quite a significant piece of work to be
done post FF.

I had operated under the assumption that we needed to push the
streaming read code into heap AM because that is what we did for
sequential scan, but now that I think about it, I don't see why we
would have to. Bilal's patch pre-27bc1772fc did not do this. But I
think the code in acquire_sample_rows() isn't more tied to heap AM
after 041b96802ef than it was before it. Are you of the opinion that
the code with 041b96802ef ties acquire_sample_rows() more closely to
heap format?

- Melanie

#85

Alexander Korotkov

aekorotkov@gmail.com

almost 2 years ago

In reply to: Melanie Plageman (#84)

Re: Table AM Interface Enhancements

Hi, Melanie!

On Fri, Apr 12, 2024 at 8:48 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Thu, Apr 11, 2024 at 6:04 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Fri, Apr 12, 2024 at 12:04 AM Andres Freund <andres@anarazel.de> wrote:

On 2024-04-11 20:46:02 +0300, Alexander Korotkov wrote:

I hope this work is targeting pg18.

I think anything of the scope discussed by Melanie would be very clearly
targeting 18. For 17, I don't know yet whether we should revert the the
ANALYZE streaming read user (041b96802ef), just do a bit of comment polishing,
or some other small change.

One oddity is that before 041b96802ef, the opportunities for making the
interface cleaner were less apparent, because c6fc50cb4028 increased the
coupling between analyze.c and the way the table storage works.

Thank you for pointing this out about c6fc50cb4028, I've missed this.

Otherwise, do I get this right that this post feature-freeze works on
designing a new API? Yes, 27bc1772fc masked the problem. But it was
committed on Mar 30.

Note that there were versions of the patch that were targeting the
pre-27bc1772fc interface.

Sure, I've checked this before writing. It looks quite similar to the
result of applying my revert patch [1] to the head.

Let me describe my view over the current situation.

1) If we just apply my revert patch and leave c6fc50cb4028 and
041b96802ef in the tree, then we get our table AM API narrowed. As
you expressed the current API requires block numbers to be 1:1 with
the actual physical on-disk location [2]. Not a secret I think the
current API is quite restrictive. And we're getting the ANALYZE
interface narrower than it was since 737a292b5de. Frankly speaking, I
don't think this is acceptable.

2) Pushing down the read stream and prefetch to heap am is related to
difficulties [3], [4]. That's quite a significant piece of work to be
done post FF.

I had operated under the assumption that we needed to push the
streaming read code into heap AM because that is what we did for
sequential scan, but now that I think about it, I don't see why we
would have to. Bilal's patch pre-27bc1772fc did not do this. But I
think the code in acquire_sample_rows() isn't more tied to heap AM
after 041b96802ef than it was before it. Are you of the opinion that
the code with 041b96802ef ties acquire_sample_rows() more closely to
heap format?

Yes, I think so. Table AM API deals with TIDs and block numbers, but
doesn't force on what they actually mean. For example, in ZedStore
[1]: , data is stored on per-column B-trees, where TID used in table AM is just a logical key of that B-trees. Similarly, blockNumber is a range for B-trees.
is just a logical key of that B-trees. Similarly, blockNumber is a
range for B-trees.

c6fc50cb4028 and 041b96802ef are putting to acquire_sample_rows() an
assumption that we are sampling physical blocks as they are stored in
data files. That couldn't anymore be some "logical" block numbers
with meaning only table AM implementation knows. That was pointed out
by Andres [2]. I'm not sure if ZedStore is alive, but there could be
other table AM implementations like this, or other implementations in
development, etc. Anyway, I don't feel good about narrowing the API,
which is there from pg12.

Links.
1. https://www.pgcon.org/events/pgcon_2020/sessions/session/44/slides/13/Zedstore-PGCon2020-Virtual.pdf
2. /messages/by-id/20240410212117.mxsldz2w6htrl36v@awork3.anarazel.de

------
Regards,
Alexander Korotkov

#86

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Alexander Korotkov (#85)

Re: Table AM Interface Enhancements

On Sat, Apr 13, 2024 at 5:28 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Yes, I think so. Table AM API deals with TIDs and block numbers, but
doesn't force on what they actually mean. For example, in ZedStore
[1], data is stored on per-column B-trees, where TID used in table AM
is just a logical key of that B-trees. Similarly, blockNumber is a
range for B-trees.

c6fc50cb4028 and 041b96802ef are putting to acquire_sample_rows() an
assumption that we are sampling physical blocks as they are stored in
data files. That couldn't anymore be some "logical" block numbers
with meaning only table AM implementation knows. That was pointed out
by Andres [2]. I'm not sure if ZedStore is alive, but there could be
other table AM implementations like this, or other implementations in
development, etc. Anyway, I don't feel good about narrowing the API,
which is there from pg12.

I spent some time looking at this. I think it's valid to complain
about the tighter coupling, but c6fc50cb4028 is there starting in v14,
so I don't think I understand why the situation after 041b96802ef is
materially worse than what we've had for the last few releases. I
think it is worse in the sense that, before, you could dodge the
problem without defining USE_PREFETCH, and now you can't, but I don't
think we can regard nonphysical block numbers as a supported scenario
on that basis.

But maybe I'm not correctly understanding the situation?

--
Robert Haas
EDB: http://www.enterprisedb.com

#87

Pavel Borisov

pashkin.elfe@gmail.com

over 1 year ago

In reply to: Robert Haas (#86)

Re: Table AM Interface Enhancements

On Mon, 15 Apr 2024 at 19:36, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Apr 13, 2024 at 5:28 AM Alexander Korotkov <aekorotkov@gmail.com>
wrote:

Yes, I think so. Table AM API deals with TIDs and block numbers, but
doesn't force on what they actually mean. For example, in ZedStore
[1], data is stored on per-column B-trees, where TID used in table AM
is just a logical key of that B-trees. Similarly, blockNumber is a
range for B-trees.

c6fc50cb4028 and 041b96802ef are putting to acquire_sample_rows() an
assumption that we are sampling physical blocks as they are stored in
data files. That couldn't anymore be some "logical" block numbers
with meaning only table AM implementation knows. That was pointed out
by Andres [2]. I'm not sure if ZedStore is alive, but there could be
other table AM implementations like this, or other implementations in
development, etc. Anyway, I don't feel good about narrowing the API,
which is there from pg12.

I spent some time looking at this. I think it's valid to complain
about the tighter coupling, but c6fc50cb4028 is there starting in v14,
so I don't think I understand why the situation after 041b96802ef is
materially worse than what we've had for the last few releases. I
think it is worse in the sense that, before, you could dodge the
problem without defining USE_PREFETCH, and now you can't, but I don't
think we can regard nonphysical block numbers as a supported scenario
on that basis.

But maybe I'm not correctly understanding the situation?

Hi, Robert!

In my understanding, the downside of 041b96802ef is bringing read_stream*
things from being heap-only-related up to the level
of acquire_sample_rows() that is not supposed to be tied to heap. And
changing *_analyze_next_block() function signature to use ReadStream
explicitly in the signature.

Regards,
Pavel.

#88

Nazir Bilal Yavuz

byavuz81@gmail.com

over 1 year ago

In reply to: Robert Haas (#86)

Re: Table AM Interface Enhancements

Hi,

On Mon, 15 Apr 2024 at 18:36, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Apr 13, 2024 at 5:28 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Yes, I think so. Table AM API deals with TIDs and block numbers, but
doesn't force on what they actually mean. For example, in ZedStore
[1], data is stored on per-column B-trees, where TID used in table AM
is just a logical key of that B-trees. Similarly, blockNumber is a
range for B-trees.

c6fc50cb4028 and 041b96802ef are putting to acquire_sample_rows() an
assumption that we are sampling physical blocks as they are stored in
data files. That couldn't anymore be some "logical" block numbers
with meaning only table AM implementation knows. That was pointed out
by Andres [2]. I'm not sure if ZedStore is alive, but there could be
other table AM implementations like this, or other implementations in
development, etc. Anyway, I don't feel good about narrowing the API,
which is there from pg12.

I spent some time looking at this. I think it's valid to complain
about the tighter coupling, but c6fc50cb4028 is there starting in v14,
so I don't think I understand why the situation after 041b96802ef is
materially worse than what we've had for the last few releases. I
think it is worse in the sense that, before, you could dodge the
problem without defining USE_PREFETCH, and now you can't, but I don't
think we can regard nonphysical block numbers as a supported scenario
on that basis.

I agree with you but I did not understand one thing. If out-of-core
AMs are used, does not all block sampling logic (BlockSampler_Init(),
BlockSampler_Next() etc.) need to be edited as well since these
functions assume block numbers are actual physical on-disk location,
right? I mean if the block number is something different than the
actual physical on-disk location, the acquire_sample_rows() function
looks wrong to me before c6fc50cb4028 as well.

--
Regards,
Nazir Bilal Yavuz
Microsoft

#89

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Pavel Borisov (#87)

Re: Table AM Interface Enhancements

On Mon, Apr 15, 2024 at 12:37 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:

In my understanding, the downside of 041b96802ef is bringing read_stream* things from being heap-only-related up to the level of acquire_sample_rows() that is not supposed to be tied to heap. And changing *_analyze_next_block() function signature to use ReadStream explicitly in the signature.

I don't think that really clarifies anything. The ReadStream is
basically just acting as a wrapper for a stream of block numbers, and
the API took a BlockNumber before. So why does it make any difference?

If I understand correctly, Alexander thinks that, before 041b96802ef,
the block number didn't necessarily have to be the physical block
number on disk, but could instead be any 32-bit quantity that the
table AM wanted to pack into the block number. But I don't think
that's true, because acquire_sample_rows() was already passing those
block numbers to PrefetchBuffer(), which already requires physical
block numbers.

--
Robert Haas
EDB: http://www.enterprisedb.com

#90

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Nazir Bilal Yavuz (#88)

Re: Table AM Interface Enhancements

On Mon, Apr 15, 2024 at 12:41 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

I agree with you but I did not understand one thing. If out-of-core
AMs are used, does not all block sampling logic (BlockSampler_Init(),
BlockSampler_Next() etc.) need to be edited as well since these
functions assume block numbers are actual physical on-disk location,
right? I mean if the block number is something different than the
actual physical on-disk location, the acquire_sample_rows() function
looks wrong to me before c6fc50cb4028 as well.

Yes, this is also a problem with trying to use non-physical block
numbers. We can hypothesize an AM where it works out OK in practice,
say because there are always exactly the same number of logical block
numbers as there are physical block numbers. Or, because there are
always more logical block numbers than physical block numbers, but for
some reason the table AM author doesn't care because they believe that
in the target use case for their AM the data distribution will be
sufficiently uniform that sampling only low-numbered blocks won't
really hurt anything.

But that does seem a bit strained. In practice, I suspect that table
AMs that use logical block numbers might want to replace this line
from acquire_sample_rows() with a call to a tableam method that
returns the number of logical blocks:

totalblocks = RelationGetNumberOfBlocks(onerel);

But even that does not seem like enough, because my guess would be
that a lot of table AMs would end up with a sparse logical block
space. For instance, you might create a logical block number sequence
that starts at 0 and just counts up towards 2^32 and eventually either
wraps around or errors out. Each new tuple gets the next TID that
isn't yet used. Well, what's going to happen eventually in a lot of
workloads is that the low-numbered logical blocks are going to be
mostly or entirely empty, and the data is going to be clustered in the
ones that are nearer to the highest logical block number that's so far
been assigned. So, then, as you say, you'd want to replace the whole
BlockSampler thing entirely.

That said, I find it a little bit hard to know what people are already
doing or realistically might try to do with table AMs. If somebody
says they have a table AM where the number of logical block numbers
equals the number of physical block numbers (or is somewhat larger but
in a way that doesn't really matter) and the existing block sampling
logic works well enough, I can't really disprove that. It puts awfully
tight limits on what the AM can be doing, but, OK, sometimes people
want to develop AMs for very specific purposes. However, because of
the prefetching thing, I think even that fairly narrow use case was
already broken before 041b96802efa33d2bc9456f2ad946976b92b5ae1. So I
just don't really see how that commit made anything worse in any way
that really matters.

But maybe it did. People often find extremely creative ways of working
around the limitations of the core interfaces. I think it could be the
case that someone found a clever way of dodging all of these problems
and had something that was working well enough that they were happy
with it, and now they can't make it work after the changes for some
reason. If that someone is reading this thread and wants to spell that
out, we can consider whether there's some relief that we could give to
that person, *especially* if they can demonstrate that they raised the
alarm before the commit went in. But in the absence of that, my
current belief is that nonphysical block numbers were never a
supported scenario; hence, the idea that
041b96802efa33d2bc9456f2ad946976b92b5ae1 should be reverted for
de-supporting them ought to be rejected.

--
Robert Haas
EDB: http://www.enterprisedb.com

#91

Pavel Borisov

pashkin.elfe@gmail.com

over 1 year ago

In reply to: Robert Haas (#89)

Re: Table AM Interface Enhancements

On Mon, 15 Apr 2024 at 22:09, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Apr 15, 2024 at 12:37 PM Pavel Borisov <pashkin.elfe@gmail.com>
wrote:

In my understanding, the downside of 041b96802ef is bringing

read_stream* things from being heap-only-related up to the level of
acquire_sample_rows() that is not supposed to be tied to heap. And changing
*_analyze_next_block() function signature to use ReadStream explicitly in
the signature.

I don't think that really clarifies anything. The ReadStream is
basically just acting as a wrapper for a stream of block numbers, and
the API took a BlockNumber before. So why does it make any difference?

If I understand correctly, Alexander thinks that, before 041b96802ef,
the block number didn't necessarily have to be the physical block
number on disk, but could instead be any 32-bit quantity that the
table AM wanted to pack into the block number. But I don't think
that's true, because acquire_sample_rows() was already passing those
block numbers to PrefetchBuffer(), which already requires physical
block numbers.

Hi, Robert!

Why it makes a difference looks a little bit unclear to me, I can't comment
on this. I noticed that before 041b96802ef we had a block number and block
sampler state that tied acquire_sample_rows() to the actual block
structure. After we have the whole struct ReadStream which doesn't comprise
just a wrapper for the same variables, but the state that ties
acquire_sample_rows() to the streaming read algorithm (and heap). Yes, we
don't have other access methods other than heap implemented for analyze
routine, so the patch works anyway, but from the view on
acquire_sample_rows() as a general method that is intended to have
different implementations in the future it doesn't look good.

It's my impression on 041b96802ef, please forgive me if I haven't
understood something.

Regards,
Pavel Borisov
Supabase

#92

Andres Freund

andres@anarazel.de

over 1 year ago

In reply to: Pavel Borisov (#91)

Re: Table AM Interface Enhancements

Hi,

On 2024-04-15 23:14:01 +0400, Pavel Borisov wrote:

Why it makes a difference looks a little bit unclear to me, I can't comment
on this. I noticed that before 041b96802ef we had a block number and block
sampler state that tied acquire_sample_rows() to the actual block
structure.

That, and the prefetch calls actually translating the block numbers 1:1 to
physical locations within the underlying file.

And before 041b96802ef they were tied much more closely by the direct calls to
heapam added in 27bc1772fc81.

After we have the whole struct ReadStream which doesn't comprise just a
wrapper for the same variables, but the state that ties
acquire_sample_rows() to the streaming read algorithm (and heap).

Yes ... ? I don't see how that is a meaningful difference to the state as of
27bc1772fc81. Nor fundamentally worse than the state 27bc1772fc81^, given
that we already issued requests for specific blocks in the file.

That said, I don't like the state after applying
/messages/by-id/CAPpHfdvuT6DnguzaV-M1UQ2whYGDojaNU=-=iHc0A7qo9HBEJw@mail.gmail.com
because there's too much coupling. Hence talking about needing to iterate on
the interface in some form, earlier in the thread.

What are you actually arguing for here?

Greetings,

Andres Freund

#93

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Andres Freund (#92)

Re: Table AM Interface Enhancements

On Mon, Apr 15, 2024 at 3:47 PM Andres Freund <andres@anarazel.de> wrote:

That said, I don't like the state after applying
/messages/by-id/CAPpHfdvuT6DnguzaV-M1UQ2whYGDojaNU=-=iHc0A7qo9HBEJw@mail.gmail.com
because there's too much coupling. Hence talking about needing to iterate on
the interface in some form, earlier in the thread.

Mmph, I can't follow what the actual state of things is here. Are we
waiting for Alexander to push that patch? Is he waiting for somebody
to sign off on that patch? Do you want that patch applied, not
applied, or applied with some set of modifications?

I find the discussion of "too much coupling" too abstract. I want to
get down to specific proposals for what we should change, or not
change.

--
Robert Haas
EDB: http://www.enterprisedb.com

#94

Andres Freund

andres@anarazel.de

over 1 year ago

In reply to: Alexander Korotkov (#82)

Re: Table AM Interface Enhancements

Hi,

On 2024-04-12 01:04:03 +0300, Alexander Korotkov wrote:

1) If we just apply my revert patch and leave c6fc50cb4028 and
041b96802ef in the tree, then we get our table AM API narrowed. As
you expressed the current API requires block numbers to be 1:1 with
the actual physical on-disk location [2]. Not a secret I think the
current API is quite restrictive. And we're getting the ANALYZE
interface narrower than it was since 737a292b5de. Frankly speaking, I
don't think this is acceptable.

As others already pointed out, c6fc50cb4028 was committed quite a while
ago. I'm fairly unhappy about c6fc50cb4028, fwiw, but didn't realize that
until it was too late.

In token of all of the above, is the in-tree state that bad? (if we
abstract the way 27bc1772fc and dd1f6b0c17 were committed).

To me the 27bc1772fc doesn't make much sense on its own. You added calls
directly to heapam internals to a file in src/backend/commands/, that just
doesn't make sense.

Leaving that aside, I think the interface isn't good on its own:
table_relation_analyze() doesn't actually do anything, it just sets callbacks,
that then later are called from analyze.c, which doesn't at all fit to the
name of the callback/function. I realize that this is kinda cribbed from the
FDW code, but I don't think that is a particularly good excuse.

I don't think dd1f6b0c17 improves the situation, at all. It sets global
variables to redirect how an individual acquire_sample_rows invocation
works:
void
block_level_table_analyze(Relation relation,
AcquireSampleRowsFunc *func,
BlockNumber *totalpages,
BufferAccessStrategy bstrategy,
ScanAnalyzeNextBlockFunc scan_analyze_next_block_cb,
ScanAnalyzeNextTupleFunc scan_analyze_next_tuple_cb)
{
*func = acquire_sample_rows;
*totalpages = RelationGetNumberOfBlocks(relation);
vac_strategy = bstrategy;
scan_analyze_next_block = scan_analyze_next_block_cb;
scan_analyze_next_tuple = scan_analyze_next_tuple_cb;
}

Notably it does so within the ->relation_analyze tableam callback, which does
*NOT* not actually do anything other than returning a callback. So if
->relation_analyze() for another relation is called, the acquire_sample_rows()
for the earlier relation will do something different. Note that this isn't a
theoretical risk, acquire_inherited_sample_rows() actually collects the
acquirefunc for all the inherited relations before calling acquirefunc.

This is honestly leaving me somewhat speechless.

Greetings,

Andres Freund

#95

Andres Freund

andres@anarazel.de

over 1 year ago

In reply to: Robert Haas (#93)

Re: Table AM Interface Enhancements

Hi,

On 2024-04-15 16:02:00 -0400, Robert Haas wrote:

On Mon, Apr 15, 2024 at 3:47 PM Andres Freund <andres@anarazel.de> wrote:

That said, I don't like the state after applying
/messages/by-id/CAPpHfdvuT6DnguzaV-M1UQ2whYGDojaNU=-=iHc0A7qo9HBEJw@mail.gmail.com
because there's too much coupling. Hence talking about needing to iterate on
the interface in some form, earlier in the thread.

Mmph, I can't follow what the actual state of things is here. Are we
waiting for Alexander to push that patch? Is he waiting for somebody
to sign off on that patch?

I think Alexander is arguing that we shouldn't revert 27bc1772fc & dd1f6b0c17
in 17. I already didn't think that was an option, because I didn't like the
added interfaces, but now am even more certain, given how broken dd1f6b0c17
seems to be:
/messages/by-id/20240415201057.khoyxbwwxfgzomeo@awork3.anarazel.de

Do you want that patch applied, not applied, or applied with some set of
modifications?

I think we should apply Alexander's proposed revert and then separately
discuss what we should do about 041b96802ef.

I find the discussion of "too much coupling" too abstract. I want to
get down to specific proposals for what we should change, or not
change.

I think it's a bit hard to propose something concrete until we've decided
whether we'll revert 27bc1772fc & dd1f6b0c17.

Greetings,

Andres Freund

#96

Alexander Korotkov

aekorotkov@gmail.com

over 1 year ago

In reply to: Andres Freund (#94)

Re: Table AM Interface Enhancements

On Mon, Apr 15, 2024 at 11:11 PM Andres Freund <andres@anarazel.de> wrote:

On 2024-04-12 01:04:03 +0300, Alexander Korotkov wrote:

1) If we just apply my revert patch and leave c6fc50cb4028 and
041b96802ef in the tree, then we get our table AM API narrowed. As
you expressed the current API requires block numbers to be 1:1 with
the actual physical on-disk location [2]. Not a secret I think the
current API is quite restrictive. And we're getting the ANALYZE
interface narrower than it was since 737a292b5de. Frankly speaking, I
don't think this is acceptable.

As others already pointed out, c6fc50cb4028 was committed quite a while
ago. I'm fairly unhappy about c6fc50cb4028, fwiw, but didn't realize that
until it was too late.

In token of all of the above, is the in-tree state that bad? (if we
abstract the way 27bc1772fc and dd1f6b0c17 were committed).

To me the 27bc1772fc doesn't make much sense on its own. You added calls
directly to heapam internals to a file in src/backend/commands/, that just
doesn't make sense.

Leaving that aside, I think the interface isn't good on its own:
table_relation_analyze() doesn't actually do anything, it just sets callbacks,
that then later are called from analyze.c, which doesn't at all fit to the
name of the callback/function. I realize that this is kinda cribbed from the
FDW code, but I don't think that is a particularly good excuse.

I don't think dd1f6b0c17 improves the situation, at all. It sets global
variables to redirect how an individual acquire_sample_rows invocation
works:
void
block_level_table_analyze(Relation relation,
AcquireSampleRowsFunc *func,
BlockNumber *totalpages,
BufferAccessStrategy bstrategy,
ScanAnalyzeNextBlockFunc scan_analyze_next_block_cb,
ScanAnalyzeNextTupleFunc scan_analyze_next_tuple_cb)
{
*func = acquire_sample_rows;
*totalpages = RelationGetNumberOfBlocks(relation);
vac_strategy = bstrategy;
scan_analyze_next_block = scan_analyze_next_block_cb;
scan_analyze_next_tuple = scan_analyze_next_tuple_cb;
}

Notably it does so within the ->relation_analyze tableam callback, which does
*NOT* not actually do anything other than returning a callback. So if
->relation_analyze() for another relation is called, the acquire_sample_rows()
for the earlier relation will do something different. Note that this isn't a
theoretical risk, acquire_inherited_sample_rows() actually collects the
acquirefunc for all the inherited relations before calling acquirefunc.

You're right. No sense trying to fix this. Reverted.

------
Regards,
Alexander Korotkov

#97

Alexander Korotkov

aekorotkov@gmail.com

over 1 year ago

In reply to: Andres Freund (#95)

Re: Table AM Interface Enhancements

On Mon, Apr 15, 2024 at 11:17 PM Andres Freund <andres@anarazel.de> wrote:

On 2024-04-15 16:02:00 -0400, Robert Haas wrote:

Do you want that patch applied, not applied, or applied with some set of
modifications?

I think we should apply Alexander's proposed revert and then separately
discuss what we should do about 041b96802ef.

Taking a closer look at acquire_sample_rows(), I think it would be
good if table AM implementation would care about block-level (or
whatever-level) sampling. So that acquire_sample_rows() just fetches
tuples one-by-one from table AM implementation without any care about
blocks. Possible table_beginscan_analyze() could take an argument of
target number of tuples, then those tuples are just fetches with
table_scan_analyze_next_tuple(). What do you think?

------
Regards,
Alexander Korotkov

#98

Pavel Borisov

pashkin.elfe@gmail.com

over 1 year ago

In reply to: Alexander Korotkov (#97)

Re: Table AM Interface Enhancements

On Tue, 16 Apr 2024 at 14:52, Alexander Korotkov <aekorotkov@gmail.com>
wrote:

On Mon, Apr 15, 2024 at 11:17 PM Andres Freund <andres@anarazel.de> wrote:

On 2024-04-15 16:02:00 -0400, Robert Haas wrote:

Do you want that patch applied, not applied, or applied with some set

of

modifications?

I think we should apply Alexander's proposed revert and then separately
discuss what we should do about 041b96802ef.

Taking a closer look at acquire_sample_rows(), I think it would be
good if table AM implementation would care about block-level (or
whatever-level) sampling. So that acquire_sample_rows() just fetches
tuples one-by-one from table AM implementation without any care about
blocks. Possible table_beginscan_analyze() could take an argument of
target number of tuples, then those tuples are just fetches with
table_scan_analyze_next_tuple(). What do you think?

Hi, Alexander!

I like the idea of splitting abstraction levels for:
1. acquirefuncs (FDW or physical table)
2. new specific row fetch functions (alike to existing
_scan_analyze_next_tuple()), that could be AM-specific.

Then scan_analyze_next_block() or another iteration algorithm would be
contained inside table AM implementation of _scan_analyze_next_tuple().

So, init of scan state would be inside table AM implementation of
_beginscan_analyze(). Scan state (like BlockSamplerData or other state that
could be custom for AM) could be transferred from _beginscan_analyze() to
_scan_analyze_next_tuple() by some opaque AM-specific data structure. If so
we'll also may need AM-specific table_endscan_analyze to clean it.

Regards,
Pavel

#99

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Alexander Korotkov (#97)

Re: Table AM Interface Enhancements

On Tue, Apr 16, 2024 at 6:52 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Taking a closer look at acquire_sample_rows(), I think it would be
good if table AM implementation would care about block-level (or
whatever-level) sampling. So that acquire_sample_rows() just fetches
tuples one-by-one from table AM implementation without any care about
blocks. Possible table_beginscan_analyze() could take an argument of
target number of tuples, then those tuples are just fetches with
table_scan_analyze_next_tuple(). What do you think?

Andres is the expert here, but FWIW, that plan seems reasonable to me.
One downside is that every block-based tableam is going to end up with
a very similar implementation, which is kind of something I don't like
about the tableam API in general: if you want to make something that
is basically heap plus a little bit of special sauce, you have to copy
a mountain of code. Right now we don't really care about that problem,
because we don't have any other tableams in core, but if we ever do, I
think we're going to find ourselves very unhappy with that aspect of
things. But maybe now is not the time to start worrying. That problem
isn't unique to analyze, and giving out-of-core tableams the
flexibility to do what they want is better than not.

--
Robert Haas
EDB: http://www.enterprisedb.com

#100

Andres Freund

andres@anarazel.de

over 1 year ago

In reply to: Alexander Korotkov (#96)

Re: Table AM Interface Enhancements

On 2024-04-16 13:33:53 +0300, Alexander Korotkov wrote:

Reverted.

Thanks!

#101

Andres Freund

andres@anarazel.de

over 1 year ago

In reply to: Robert Haas (#99)

Re: Table AM Interface Enhancements

Hi,

On 2024-04-16 08:31:24 -0400, Robert Haas wrote:

On Tue, Apr 16, 2024 at 6:52 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Taking a closer look at acquire_sample_rows(), I think it would be
good if table AM implementation would care about block-level (or
whatever-level) sampling. So that acquire_sample_rows() just fetches
tuples one-by-one from table AM implementation without any care about
blocks. Possible table_beginscan_analyze() could take an argument of
target number of tuples, then those tuples are just fetches with
table_scan_analyze_next_tuple(). What do you think?

Andres is the expert here, but FWIW, that plan seems reasonable to me.
One downside is that every block-based tableam is going to end up with
a very similar implementation, which is kind of something I don't like
about the tableam API in general: if you want to make something that
is basically heap plus a little bit of special sauce, you have to copy
a mountain of code. Right now we don't really care about that problem,
because we don't have any other tableams in core, but if we ever do, I
think we're going to find ourselves very unhappy with that aspect of
things. But maybe now is not the time to start worrying. That problem
isn't unique to analyze, and giving out-of-core tableams the
flexibility to do what they want is better than not.

I think that can partially be addressed by having more "block oriented AM"
helpers in core, like we have for table_block_parallelscan*. Doesn't work for
everything, but should for something like analyze.

Greetings,

Andres Freund

#102

Mats Kindahl

mats@timescale.com

over 1 year ago

In reply to: Alexander Korotkov (#1)

Re: Table AM Interface Enhancements

On Thu, Nov 23, 2023 at 1:43 PM Alexander Korotkov <aekorotkov@gmail.com>
wrote:

Hello PostgreSQL Hackers,

I am pleased to submit a series of patches related to the Table Access
Method (AM) interface, which I initially announced during my talk at
PGCon 2023 [1]. These patches are primarily designed to support the
OrioleDB engine, but I believe they could be beneficial for other
table AM implementations as well.

The focus of these patches is to introduce more flexibility and
capabilities into the Table AM interface. This is particularly
relevant for advanced use cases like index-organized tables,
alternative MVCC implementations, etc.

Hi Alexander and great to see some action around in the table access method
interface.

Sorry for being late to the game, but wondering a few things about the
patches, but I'll start with the first one that caught my eye.

0007-Allow-table-AM-tuple_insert-method-to-return-the--v1.patch

This allows table AM to return a native tuple slot, which is aware of
table AM-specific system attributes.

This patch seems straightforward enough, but from reading the surrounding
code and trying to understand the context I am wondering a few things.
Reading the thread, I am unsure if this will go in or not, but just wanted
to point out a concern I had. My apologies if I am raising an issue that is
already resolved.

AFAICT, the general contract for working with table tuple slots is creating
them for a particular purpose, filling it in, and then passing around a
pointer to it. Since the slot is created using a "source" implementation,
the "source" is responsible for memory allocation and also other updates to
the state. Please correct me if I have misunderstood how this is intended
to work, but this seems like a good API since it avoids
unnecessary allocation and, in particular, unbounded creation of new slots
affecting memory usage while a query is executing. For a plan you want to
execute, you just make sure that you have slots of the right kind in each
plan node and there is no need to dynamically allocate more slots. If you
want one for the table access method, just make sure to fetch the slot
callbacks from the table access method use those correctly. As a result,
the number of slots used during execution is bounded

Assuming that I've understood it correct, if a TTS is then created and
passed to tuple_insert, and it needs to return a different slot, this
raises two questions:

- As Andres pointed out: who is responsible for taking care of and
dealing with the cleanup of the returned slot here? Note that this is not
just a matter of releasing memory, there are other stateful things that
they might need to deal with that the TAM have created for in the slot. For
this, some sort of callback is needed and the tuple_insert implementation
needs to call that correctly.
- The dual is the cleanup of the "original" slot passed in: a slot of a
particular kind is passed in and you need to deal with this correctly to
release the resources allocated by the original slot, using some sort of
callback.

For both these cases, the question is what cleanup function to call.

In most cases, the slot comes from a subplan and is not dynamically
allocated, i.e., it cannot just use release() since it is reused later. For
example, for ExecScanFetch the slot ss_ScanTupleSlot is returned, which is
then used with tuple_insert (unless I've misread the code), which is
typically cleared, not released.

If clear() is used instead, and you clear this slot as part of inserting a
tuple, you can instead clear a premature intermediate result
(ss_ScanTupleSlot, in the example above), which can cause strange issues if
this result is needed later.

So, given that the dynamic allocation of new slots is unbounded within a
query and it is complicated to make sure that slots are
cleared/reset/released correctly depending on context, this seems to be
hard to get to work correctly and not risk introducing bugs. IMHO, it would
be preferable to have a very simple contract where you init, set, clear,
and release the slot to avoid bugs creeping into the code, which is what
the PostgreSQL code mostly has now.

So, the question here is why changing the slot implementation is needed. I
do not know the details of OrioleDB, but this slot is immediately used
with ExecInsertIndexTuples() after the call in nodeModifyTable. If the need
is to pass information from the TAM to the IAM then it might be better to
store this information in the execution state. Is there a case where the
correct slot is not created, then fixing that location might be better.
(I've noticed that the copyFrom code has a somewhat naïve assumption of
what slot implementation should be used, but that is a separate discussion.)

Best wishes,
Mats Kindahl

#103

Matthias van de Meent

boekewurm+postgres@gmail.com

over 1 year ago

In reply to: Alexander Korotkov (#96)

Re: Table AM Interface Enhancements

Hi,

On Tue, 16 Apr 2024 at 12:34, Alexander Korotkov <aekorotkov@gmail.com> wrote:

You're right. No sense trying to fix this. Reverted.

I just noticed that this revert (commit 6377e12a) seems to have
introduced two comment blocks atop TableAmRoutine's
scan_analyze_next_block, and I can't find a clear reason why these are
two separate comment blocks.
Furthermore, both comment blocks seemingly talk about different
implementations of a block-based analyze functionality, and I don't
have the time to analyze which of these comments is authorative and
which are misplaced or obsolete.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#104

Alexander Korotkov

aekorotkov@gmail.com

over 1 year ago

In reply to: Matthias van de Meent (#103)

Re: Table AM Interface Enhancements

On Fri, Jun 21, 2024 at 7:37 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

On Tue, 16 Apr 2024 at 12:34, Alexander Korotkov <aekorotkov@gmail.com> wrote:

You're right. No sense trying to fix this. Reverted.

I just noticed that this revert (commit 6377e12a) seems to have
introduced two comment blocks atop TableAmRoutine's
scan_analyze_next_block, and I can't find a clear reason why these are
two separate comment blocks.
Furthermore, both comment blocks seemingly talk about different
implementations of a block-based analyze functionality, and I don't
have the time to analyze which of these comments is authorative and
which are misplaced or obsolete.

Thank you, I've just removed the first comment. It contains
heap-specific information and has been copied here from
heapam_scan_analyze_next_block().

------
Regards,
Alexander Korotkov
Supabase