partial heap only tuples

Started by Bossart, Nathanalmost 5 years ago17 messages

bossartn@amazon.com

almost 5 years ago

1 attachment(s)

Hello,

I'm hoping to gather some early feedback on a heap optimization I've
been working on. In short, I'm hoping to add "partial heap only
tuple" (PHOT) support, which would allow you to skip updating indexes
for unchanged columns even when other indexes require updates. Today,
HOT works wonders when no indexed columns are updated. However, as
soon as you touch one indexed column, you lose that optimization
entirely, as you must update every index on the table. The resulting
performance impact is a pain point for many of our (AWS's) enterprise
customers, so we'd like to lend a hand for some improvements in this
area. For workloads involving a lot of columns and a lot of indexes,
an optimization like PHOT can make a huge difference. I'm aware that
there was a previous attempt a few years ago to add a similar
optimization called WARM [0]/messages/by-id/CABOikdMop5Rb_RnS2xFdAXMZGSqcJ-P-BY2ruMd+buUkJ4iDPw@mail.gmail.com [1]/messages/by-id/CABOikdMNy6yowA+wTGK9RVd8iw+CzqHeQSGpW7Yka_4RSZ_LOQ@mail.gmail.com. However, I only noticed this
previous effort after coming up with the design for PHOT, so I ended
up taking a slightly different approach. I am also aware of a couple
of recent nbtree improvements that may mitigate some of the impact of
non-HOT updates [2]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=0d861bbb [3]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d168b666, but I am hoping that PHOT serves as a nice
complement to those. I've attached a very early proof-of-concept
patch with the design described below.

As far as performance is concerned, it is simple enough to show major
benefits from PHOT by tacking on a large number of indexes and columns
to a table. For a short pgbench run where each table had 5 additional
text columns and indexes on every column, I noticed a ~34% bump in
TPS with PHOT [4]non-PHOT: transaction type: <builtin: TPC-B (sort of)> scaling factor: 1000 query mode: simple number of clients: 256 number of threads: 256 duration: 1800 s number of transactions actually processed: 29759733 latency average = 15.484 ms latency stddev = 10.102 ms tps = 16530.552950 (including connections establishing) tps = 16530.730565 (excluding connections establishing). Theoretically, the TPS bump should be even higher
with additional columns with indexes. In addition to showing such
benefits, I have also attempted to show that regular pgbench runs are
not significantly affected. For a short pgbench run with no table
modifications, I noticed a ~2% bump in TPS with PHOT [5]non-PHOT: ... number of transactions actually processed: 151841961 latency average = 3.034 ms latency stddev = 1.854 ms tps = 84354.268591 (including connections establishing) tps = 84355.061353 (excluding connections establishing).

Next, I'll go into the design a bit. I've commandeered the two
remaining bits in t_infomask2 to use as HEAP_PHOT_UPDATED and
HEAP_PHOT_TUPLE. These are analogous to the HEAP_HOT_UPDATED and
HEAP_ONLY_TUPLE bits. (If there are concerns about exhausting the
t_infomask2 bits, I think we could only use one of the remaining bits
as a "modifier" bit on the HOT ones. I opted against that for the
proof-of-concept patch to keep things simple.) When creating a PHOT
tuple, we only create new index tuples for updated columns. These new
index tuples point to the PHOT tuple. Following is a simple
demonstration with a table with two integer columns, each with its own
index:

postgres=# SELECT heap_tuple_infomask_flags(t_infomask, t_infomask2), t_data
FROM heap_page_items(get_raw_page('test', 0))
WHERE t_infomask IS NOT NULL
OR t_infomask2 IS NOT NULL;
heap_tuple_infomask_flags | t_data
-----------------------------------------------------------------------------+--------------------
("{HEAP_XMIN_COMMITTED,HEAP_XMAX_COMMITTED,HEAP_PHOT_UPDATED}",{}) | \x0000000000000000
("{HEAP_XMIN_COMMITTED,HEAP_UPDATED,HEAP_PHOT_UPDATED,HEAP_PHOT_TUPLE}",{}) | \x0100000000000000
("{HEAP_XMAX_INVALID,HEAP_UPDATED,HEAP_PHOT_TUPLE}",{}) | \x0100000002000000
(3 rows)

postgres=# SELECT itemoffset, ctid, data
FROM bt_page_items(get_raw_page('test_a_idx', 1));
itemoffset | ctid | data
------------+-------+-------------------------
1 | (0,1) | 00 00 00 00 00 00 00 00
2 | (0,2) | 01 00 00 00 00 00 00 00
(2 rows)

postgres=# SELECT itemoffset, ctid, data
FROM bt_page_items(get_raw_page('test_b_idx', 1));
itemoffset | ctid | data
------------+-------+-------------------------
1 | (0,1) | 00 00 00 00 00 00 00 00
2 | (0,3) | 02 00 00 00 00 00 00 00
(2 rows)

When it is time to scan through a PHOT chain, there are a couple of
things to account for. Sequential scans work out-of-the-box thanks to
the visibility rules, but other types of scans like index scans
require additional checks. If you encounter a PHOT chain when
performing an index scan, you should only continue following the chain
as long as none of the columns the index indexes are modified. If the
scan does encounter such a modification, we stop following the chain
and continue with the index scan. Even if there is a tuple in that
PHOT chain that should be returned by our index scan, we will still
find it, as there will be another matching index tuple that points us
to later in the PHOT chain. My initial idea for determining which
columns were modified was to add a new bitmap after the "nulls" bitmap
in the tuple header. However, the attached patch simply uses
HeapDetermineModifiedColumns(). I've yet to measure the overhead of
this approach versus the bitmap approach, but I haven't noticed
anything too detrimental in the testing I've done so far.

In my proof-of-concept patch, I've included a temporary hack to get
some basic bitmap scans working as expected. Since we won't have
followed the PHOT chains in the bitmap index scan, we must know how to
follow them in the bitmap heap scan. Unfortunately, the bitmap heap
scan has no knowledge of what indexed columns to pay attention to when
following the PHOT chains. My temporary hack fixes this by having the
bitmap heap scan pull the set of indexed columns it needs to consider
from the outer plan. I think this is one area of the design that will
require substantially more effort. Following is a demonstration of a
basic sequential scan and bitmap scan:

postgres=# EXPLAIN (COSTS FALSE) SELECT * FROM test;
QUERY PLAN
------------------
Seq Scan on test
(1 row)

postgres=# SELECT * FROM test;
a | b
---+---
1 | 2
(1 row)

postgres=# EXPLAIN (COSTS FALSE) SELECT * FROM test WHERE a >= 0;
QUERY PLAN
---------------------------------------
Bitmap Heap Scan on test
Recheck Cond: (a >= 0)
-> Bitmap Index Scan on test_a_idx
Index Cond: (a >= 0)
(4 rows)

postgres=# SELECT * FROM test WHERE a >= 0;
a | b
---+---
1 | 2
(1 row)

This design allows for "weaving" between HOT and PHOT in a chain.
However, it is still important to treat each consecutive set of HOT
updates or PHOT updates as its own chain for the purposes of pruning
and cleanup. Pruning is heavily restricted for PHOT due to the
presence of corresponding index tuples. I believe we can redirect
line pointers for consecutive sets of PHOT updates that modify the
same set of indexed columns, but this is only possible if no index has
duplicate values in the redirected set. Also, I do not think it is
possible to prune intermediate line pointers in a PHOT chain. While
it may be possible to redirect all line pointers to the final tuple in
a series of updates to the same set of indexed columns, my hunch is
that doing so will add significant complexity for tracking
intermediate updates, and any performance gains will be marginal.
I've created some small diagrams to illustrate my proposed cleanup
strategy.

Here is a hypothetical PHOT chain.

idx1 0 1 2
idx2 0 1 2
idx3 0
lp 1 2 3 4 5
heap (0,0,0) (1,0,0) (2,0,0) (2,1,0) (2,2,0)

Heap tuples may be removed and line pointers may be redirected for
consecutive updates to the same set of indexes (as long as no index
has duplicate values in the redirected set of updates).

idx1 0 1 2
idx2 0 1 2
idx3 0
lp 1 2 -> 3 4 -> 5
heap (0,0,0) (2,0,0) (2,2,0)

When following redirect chains, we must check that the "interesting"
columns for the relevant indexes are not updated whenever a tuple is
found. In order to be eligible for cleanup, the final tuple in the
corresponding PHOT/HOT chain must also be eligible for cleanup, or all
indexes must have been updated later in the chain before any visible
tuples. (I suspect that the former condition may cause significant
bloat for some workloads and the latter condition may be prohibitively
complicated. Perhaps this can be mitigated by limiting how long we
allow PHOT chains to get.) My proof-of-concept patch does not yet
implement line pointer redirecting and cleanup, so it is possible that
I am missing some additional obstacles and optimizations here.

I think PostgreSQL 15 is realistically the earliest target version for
this change. Given that others find this project worthwhile, that's
my goal for this patch. I've CC'd a number of folks who have been
involved in this project already and who I'm hoping will continue to
help me drive this forward.

Nathan

[0]: /messages/by-id/CABOikdMop5Rb_RnS2xFdAXMZGSqcJ-P-BY2ruMd+buUkJ4iDPw@mail.gmail.com
[1]: /messages/by-id/CABOikdMNy6yowA+wTGK9RVd8iw+CzqHeQSGpW7Yka_4RSZ_LOQ@mail.gmail.com
[2]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=0d861bbb
[3]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d168b666
[4]: non-PHOT: transaction type: <builtin: TPC-B (sort of)> scaling factor: 1000 query mode: simple number of clients: 256 number of threads: 256 duration: 1800 s number of transactions actually processed: 29759733 latency average = 15.484 ms latency stddev = 10.102 ms tps = 16530.552950 (including connections establishing) tps = 16530.730565 (excluding connections establishing)
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 256
number of threads: 256
duration: 1800 s
number of transactions actually processed: 29759733
latency average = 15.484 ms
latency stddev = 10.102 ms
tps = 16530.552950 (including connections establishing)
tps = 16530.730565 (excluding connections establishing)

PHOT:
...
number of transactions actually processed: 39998968
latency average = 11.520 ms
latency stddev = 8.157 ms
tps = 22220.709117 (including connections establishing)
tps = 22221.182648 (excluding connections establishing)
[5]: non-PHOT: ... number of transactions actually processed: 151841961 latency average = 3.034 ms latency stddev = 1.854 ms tps = 84354.268591 (including connections establishing) tps = 84355.061353 (excluding connections establishing)
...
number of transactions actually processed: 151841961
latency average = 3.034 ms
latency stddev = 1.854 ms
tps = 84354.268591 (including connections establishing)
tps = 84355.061353 (excluding connections establishing)

PHOT:
...
number of transactions actually processed: 155225857
latency average = 2.968 ms
latency stddev = 1.264 ms
tps = 86234.044783 (including connections establishing)
tps = 86234.961286 (excluding connections establishing)

Attachments:

0001-Early-prototype-for-partial-heap-only-tuples-PHOT.patchapplication/octet-stream; name=0001-Early-prototype-for-partial-heap-only-tuples-PHOT.patchDownload

From b1051fe3224fedb5595e8f8ace1acf8ec04cb30b Mon Sep 17 00:00:00 2001
From: Nathan Bossart <bossartn@amazon.com>
Date: Tue, 9 Feb 2021 18:07:52 +0000
Subject: [PATCH 1/1] Early prototype for partial heap only tuples (PHOT).

This patch demonstrates a few basic pieces of the partial heap only
tuples project.
---
 contrib/pageinspect/expected/page.out     |   4 +-
 contrib/pageinspect/heapfuncs.c           |   5 +
 src/backend/access/heap/heapam.c          | 151 +++++++++++++++++++++++++-----
 src/backend/access/heap/heapam_handler.c  |  22 +++--
 src/backend/access/heap/pruneheap.c       |  19 +++-
 src/backend/access/heap/vacuumlazy.c      |   2 +
 src/backend/access/index/indexam.c        |   3 +-
 src/backend/access/nbtree/nbtinsert.c     |   5 +-
 src/backend/access/rmgrdesc/heapdesc.c    |  38 ++++++++
 src/backend/access/table/tableam.c        |  12 ++-
 src/backend/commands/constraint.c         |   7 +-
 src/backend/commands/copyfrom.c           |   4 +-
 src/backend/executor/execIndexing.c       |  27 +++++-
 src/backend/executor/execReplication.c    |  14 ++-
 src/backend/executor/nodeBitmapHeapscan.c |  26 ++++-
 src/backend/executor/nodeModifyTable.c    |  18 ++--
 src/backend/replication/logical/decode.c  |  36 +++++++
 src/include/access/heapam.h               |   5 +-
 src/include/access/heapam_xlog.h          |  13 +++
 src/include/access/htup_details.h         |  57 ++++++++++-
 src/include/access/rmgrlist.h             |   1 +
 src/include/access/tableam.h              |  29 +++---
 src/include/executor/executor.h           |   3 +-
 23 files changed, 426 insertions(+), 75 deletions(-)

diff --git a/contrib/pageinspect/expected/page.out b/contrib/pageinspect/expected/page.out
index 4da28f0a1d..81807c89d8 100644
--- a/contrib/pageinspect/expected/page.out
+++ b/contrib/pageinspect/expected/page.out
@@ -140,6 +140,8 @@ SELECT unnest(raw_flags)
  HEAP_MOVED_IN
  HEAP_MOVED_OFF
  HEAP_ONLY_TUPLE
+ HEAP_PHOT_TUPLE
+ HEAP_PHOT_UPDATED
  HEAP_UPDATED
  HEAP_XMAX_COMMITTED
  HEAP_XMAX_EXCL_LOCK
@@ -149,7 +151,7 @@ SELECT unnest(raw_flags)
  HEAP_XMAX_LOCK_ONLY
  HEAP_XMIN_COMMITTED
  HEAP_XMIN_INVALID
-(19 rows)
+(21 rows)
 
 SELECT unnest(combined_flags)
   FROM heap_tuple_infomask_flags(x'FFFF'::int, x'FFFF'::int) ORDER BY 1;
diff --git a/contrib/pageinspect/heapfuncs.c b/contrib/pageinspect/heapfuncs.c
index 9abcee32af..bee811d5ef 100644
--- a/contrib/pageinspect/heapfuncs.c
+++ b/contrib/pageinspect/heapfuncs.c
@@ -587,6 +587,11 @@ heap_tuple_infomask_flags(PG_FUNCTION_ARGS)
 		flags[cnt++] = CStringGetTextDatum("HEAP_HOT_UPDATED");
 	if ((t_infomask2 & HEAP_ONLY_TUPLE) != 0)
 		flags[cnt++] = CStringGetTextDatum("HEAP_ONLY_TUPLE");
+	if ((t_infomask2 & HEAP_PHOT_UPDATED) != 0)
+		flags[cnt++] = CStringGetTextDatum("HEAP_PHOT_UPDATED");
+	if ((t_infomask2 & HEAP_PHOT_TUPLE) != 0)
+		flags[cnt++] = CStringGetTextDatum("HEAP_PHOT_TUPLE");
+
 
 	/* build value */
 	Assert(cnt <= bitcnt);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9926e2bd54..9aa0a50b57 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1538,16 +1538,21 @@ heap_fetch(Relation relation,
 bool
 heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 					   Snapshot snapshot, HeapTuple heapTuple,
-					   bool *all_dead, bool first_call)
+					   bool *all_dead, bool first_call,
+					   Relation indrel, Bitmapset *interesting_attrs)
 {
 	Page		dp = (Page) BufferGetPage(buffer);
 	TransactionId prev_xmax = InvalidTransactionId;
 	BlockNumber blkno;
 	OffsetNumber offnum;
+	OffsetNumber prev_offnum = InvalidOffsetNumber;
 	bool		at_chain_start;
 	bool		valid;
 	bool		skip;
 	GlobalVisState *vistest = NULL;
+	HeapTuple prev_tup = NULL;
+	Bitmapset *ind_attrs = NULL;
+	bool all_attrs = false;
 
 	/* If this is not the first call, previous call returned a (live!) tuple */
 	if (all_dead)
@@ -1562,6 +1567,21 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	Assert(TransactionIdIsValid(RecentXmin));
 	Assert(BufferGetBlockNumber(buffer) == blkno);
 
+	if (indrel)
+	{
+		Form_pg_index indexStruct = indrel->rd_index;
+
+		for (int i = 0; i < indexStruct->indnatts; i++)
+		{
+			int ind_attr = indexStruct->indkey.values[i] - FirstLowInvalidHeapAttributeNumber;
+			ind_attrs = bms_add_member(ind_attrs, ind_attr);
+		}
+	}
+	else if (interesting_attrs)
+		ind_attrs = interesting_attrs;
+	else
+		all_attrs = true;
+
 	/* Scan through possible multiple members of HOT-chain */
 	for (;;)
 	{
@@ -1599,6 +1619,24 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 		heapTuple->t_tableOid = RelationGetRelid(relation);
 		ItemPointerSet(&heapTuple->t_self, blkno, offnum);
 
+		/*
+		 * If this tuple was a PHOT update, make sure that we should keep
+		 * following it based on the index attributes and the modified ones.
+		 */
+		if (OffsetNumberIsValid(prev_offnum) &&
+			HeapTupleIsValid(prev_tup) && HeapTupleIsPartialHeapOnly(heapTuple))
+		{
+			Bitmapset *modified_attrs;
+
+			if (all_attrs)
+				break;
+
+			modified_attrs = HeapDetermineModifiedColumns(relation, bms_copy(ind_attrs),
+																	 prev_tup, heapTuple);
+			if (!bms_is_empty(modified_attrs))
+				break;
+		}
+
 		/*
 		 * Shouldn't see a HEAP_ONLY tuple at chain start.
 		 */
@@ -1661,13 +1699,15 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 		 * Check to see if HOT chain continues past this tuple; if so fetch
 		 * the next offnum and loop around.
 		 */
-		if (HeapTupleIsHotUpdated(heapTuple))
+		if (HeapTupleIsHotUpdated(heapTuple) || HeapTupleIsPartialHotUpdated(heapTuple))
 		{
 			Assert(ItemPointerGetBlockNumber(&heapTuple->t_data->t_ctid) ==
 				   blkno);
+			prev_offnum = offnum;
 			offnum = ItemPointerGetOffsetNumber(&heapTuple->t_data->t_ctid);
 			at_chain_start = false;
 			prev_xmax = HeapTupleHeaderGetUpdateXid(heapTuple->t_data);
+			prev_tup = heap_copytuple(heapTuple);
 		}
 		else
 			break;				/* end of chain */
@@ -3043,7 +3083,8 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 TM_Result
 heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			TM_FailureData *tmfd, LockTupleMode *lockmode)
+			TM_FailureData *tmfd, LockTupleMode *lockmode,
+			Bitmapset **modified_attrs)
 {
 	TM_Result	result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -3051,7 +3092,6 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
 	Bitmapset  *key_attrs;
 	Bitmapset  *id_attrs;
 	Bitmapset  *interesting_attrs;
-	Bitmapset  *modified_attrs;
 	ItemId		lp;
 	HeapTupleData oldtup;
 	HeapTuple	heaptup;
@@ -3070,6 +3110,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
 	bool		have_tuple_lock = false;
 	bool		iscombo;
 	bool		use_hot_update = false;
+	bool		use_phot_update = false;
 	bool		hot_attrs_checked = false;
 	bool		key_intact;
 	bool		all_visible_cleared = false;
@@ -3168,7 +3209,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
 	newtup->t_tableOid = RelationGetRelid(relation);
 
 	/* Determine columns modified by the update. */
-	modified_attrs = HeapDetermineModifiedColumns(relation, interesting_attrs,
+	*modified_attrs = HeapDetermineModifiedColumns(relation, interesting_attrs,
 												  &oldtup, newtup);
 
 	/*
@@ -3182,7 +3223,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
 	 * is updates that don't manipulate key columns, not those that
 	 * serendipitously arrive at the same key values.
 	 */
-	if (!bms_overlap(modified_attrs, key_attrs))
+	if (!bms_overlap(*modified_attrs, key_attrs))
 	{
 		*lockmode = LockTupleNoKeyExclusive;
 		mxact_status = MultiXactStatusNoKeyUpdate;
@@ -3439,7 +3480,6 @@ l2:
 		bms_free(hot_attrs);
 		bms_free(key_attrs);
 		bms_free(id_attrs);
-		bms_free(modified_attrs);
 		bms_free(interesting_attrs);
 		return result;
 	}
@@ -3747,8 +3787,25 @@ l2:
 		 * changed. If the page was already full, we may have skipped checking
 		 * for index columns, and also can't do a HOT update.
 		 */
-		if (hot_attrs_checked && !bms_overlap(modified_attrs, hot_attrs))
+		if (hot_attrs_checked && !bms_overlap(*modified_attrs, hot_attrs))
 			use_hot_update = true;
+
+		/*
+		 * PHOT only takes effect if HOT updating will not work AND if we are
+		 * not updating all of the indexed columns. Currently this can not be
+		 * done for catalog tables because the necessary changes haven't been
+		 * made to CatalogIndexInsert().
+		 */
+		if (hot_attrs_checked && !use_hot_update && !IsCatalogRelation(relation))
+		{
+			Bitmapset *updated_indexed_attrs = bms_intersect(*modified_attrs, hot_attrs);
+			bool updating_all_indexed_attrs = bms_equal(hot_attrs, updated_indexed_attrs);
+
+			if (!updating_all_indexed_attrs)
+				use_phot_update = true;
+
+			bms_free(updated_indexed_attrs);
+		}
 	}
 	else
 	{
@@ -3763,7 +3820,7 @@ l2:
 	 * logged.
 	 */
 	old_key_tuple = ExtractReplicaIdentity(relation, &oldtup,
-										   bms_overlap(modified_attrs, id_attrs),
+										   bms_overlap(*modified_attrs, id_attrs),
 										   &old_key_copied);
 
 	/* NO EREPORT(ERROR) from here till changes are logged */
@@ -3783,8 +3840,22 @@ l2:
 	 */
 	PageSetPrunable(page, xid);
 
+	/*
+	 * Clear all (P)HOT bits before we set them for the new tuple.  This must
+	 * happen before setting the (P)HOT bits, as some of the macros used for
+	 * setting these bits assert that others are not set.
+	 */
+	HeapTupleClearHotUpdated(&oldtup);
+	HeapTupleClearHeapOnly(heaptup);
+	HeapTupleClearHeapOnly(newtup);
+	HeapTupleClearPartialHotUpdated(&oldtup);
+	HeapTupleClearPartialHeapOnly(heaptup);
+	HeapTupleClearPartialHeapOnly(newtup);
+
 	if (use_hot_update)
 	{
+		Assert(!use_phot_update);
+
 		/* Mark the old tuple as HOT-updated */
 		HeapTupleSetHotUpdated(&oldtup);
 		/* And mark the new tuple as heap-only */
@@ -3792,12 +3863,17 @@ l2:
 		/* Mark the caller's copy too, in case different from heaptup */
 		HeapTupleSetHeapOnly(newtup);
 	}
-	else
+
+	if (use_phot_update)
 	{
-		/* Make sure tuples are correctly marked as not-HOT */
-		HeapTupleClearHotUpdated(&oldtup);
-		HeapTupleClearHeapOnly(heaptup);
-		HeapTupleClearHeapOnly(newtup);
+		Assert(!use_hot_update);
+
+		/* Mark the old tuple as PHOT-updated */
+		HeapTupleSetPartialHotUpdated(&oldtup);
+		/* And mark the new tuple as partial heap-only */
+		HeapTupleSetPartialHeapOnly(heaptup);
+		/* Mark the caller's copy too, in case different from heaptup */
+		HeapTupleSetPartialHeapOnly(newtup);
 	}
 
 	RelationPutHeapTuple(relation, newbuf, heaptup, false); /* insert new tuple */
@@ -3912,7 +3988,6 @@ l2:
 	bms_free(hot_attrs);
 	bms_free(key_attrs);
 	bms_free(id_attrs);
-	bms_free(modified_attrs);
 	bms_free(interesting_attrs);
 
 	return TM_Ok;
@@ -4038,11 +4113,14 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup)
 	TM_Result	result;
 	TM_FailureData tmfd;
 	LockTupleMode lockmode;
+	Bitmapset *modified_attrs; /* unused */
 
 	result = heap_update(relation, otid, tup,
 						 GetCurrentCommandId(true), InvalidSnapshot,
 						 true /* wait for commit */ ,
-						 &tmfd, &lockmode);
+						 &tmfd, &lockmode, &modified_attrs);
+	bms_free(modified_attrs);
+
 	switch (result)
 	{
 		case TM_SelfModified:
@@ -6439,6 +6517,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		frz->t_infomask &= ~HEAP_XMAX_BITS;
 		frz->t_infomask |= HEAP_XMAX_INVALID;
 		frz->t_infomask2 &= ~HEAP_HOT_UPDATED;
+		frz->t_infomask2 &= ~HEAP_PHOT_UPDATED;
 		frz->t_infomask2 &= ~HEAP_KEYS_UPDATED;
 		changed = true;
 	}
@@ -7334,7 +7413,7 @@ heap_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
 
 			/* Are any tuples from this HOT chain non-vacuumable? */
 			if (heap_hot_search_buffer(&tmp, rel, buf, &SnapshotNonVacuumable,
-									   &heapTuple, NULL, true))
+									   &heapTuple, NULL, true, NULL, NULL))
 				continue;		/* can't delete entry */
 
 			/* Caller will delete, since whole HOT chain is vacuumable */
@@ -7979,6 +8058,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	bool		need_tuple_data = RelationIsLogicallyLogged(reln);
 	bool		init;
 	int			bufflags;
+	bool		use_rm_heap3 = false;
 
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
@@ -7987,6 +8067,11 @@ log_heap_update(Relation reln, Buffer oldbuf,
 
 	if (HeapTupleIsHeapOnly(newtup))
 		info = XLOG_HEAP_HOT_UPDATE;
+	else if (HeapTupleIsPartialHeapOnly(newtup))
+	{
+		info = XLOG_HEAP3_PARTIAL_HOT_UPDATE;
+		use_rm_heap3 = true;
+	}
 	else
 		info = XLOG_HEAP_UPDATE;
 
@@ -8172,7 +8257,10 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	/* filtering by origin on a row level is much more efficient */
 	XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
 
-	recptr = XLogInsert(RM_HEAP_ID, info);
+	if (use_rm_heap3)
+		recptr = XLogInsert(RM_HEAP3_ID, info);
+	else
+		recptr = XLogInsert(RM_HEAP_ID, info);
 
 	return recptr;
 }
@@ -9035,7 +9123,7 @@ heap_xlog_multi_insert(XLogReaderState *record)
  * Handles UPDATE and HOT_UPDATE
  */
 static void
-heap_xlog_update(XLogReaderState *record, bool hot_update)
+heap_xlog_update(XLogReaderState *record, bool hot_update, bool phot_update)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	xl_heap_update *xlrec = (xl_heap_update *) XLogRecGetData(record);
@@ -9072,7 +9160,7 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
 	if (XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk))
 	{
 		/* HOT updates are never done across pages */
-		Assert(!hot_update);
+		Assert(!hot_update && !phot_update);
 	}
 	else
 		oldblk = newblk;
@@ -9126,6 +9214,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
 		htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
 		if (hot_update)
 			HeapTupleHeaderSetHotUpdated(htup);
+		else if (phot_update)
+			HeapTupleHeaderSetPartialHotUpdated(htup);
 		else
 			HeapTupleHeaderClearHotUpdated(htup);
 		fix_infomask_from_infobits(xlrec->old_infobits_set, &htup->t_infomask,
@@ -9302,7 +9392,7 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
 	 * don't bother to update the FSM in that case, it doesn't need to be
 	 * totally accurate anyway.
 	 */
-	if (newaction == BLK_NEEDS_REDO && !hot_update && freespace < BLCKSZ / 5)
+	if (newaction == BLK_NEEDS_REDO && !(hot_update || phot_update) && freespace < BLCKSZ / 5)
 		XLogRecordPageWithFreeSpace(rnode, newblk, freespace);
 }
 
@@ -9533,7 +9623,7 @@ heap_redo(XLogReaderState *record)
 			heap_xlog_delete(record);
 			break;
 		case XLOG_HEAP_UPDATE:
-			heap_xlog_update(record, false);
+			heap_xlog_update(record, false, false);
 			break;
 		case XLOG_HEAP_TRUNCATE:
 
@@ -9544,7 +9634,7 @@ heap_redo(XLogReaderState *record)
 			 */
 			break;
 		case XLOG_HEAP_HOT_UPDATE:
-			heap_xlog_update(record, true);
+			heap_xlog_update(record, true, false);
 			break;
 		case XLOG_HEAP_CONFIRM:
 			heap_xlog_confirm(record);
@@ -9600,6 +9690,21 @@ heap2_redo(XLogReaderState *record)
 	}
 }
 
+void
+heap3_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info & XLOG_HEAP_OPMASK)
+	{
+		case XLOG_HEAP3_PARTIAL_HOT_UPDATE:
+			heap_xlog_update(record, false, true);
+			break;
+		default:
+			elog(PANIC, "heap3_redo: unknown op code %u", info);
+	}
+}
+
 /*
  * Mask a heap page before performing consistency checks on it.
  */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 4a70e20a14..562997a779 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -113,7 +113,8 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 						 ItemPointer tid,
 						 Snapshot snapshot,
 						 TupleTableSlot *slot,
-						 bool *call_again, bool *all_dead)
+						 bool *call_again, bool *all_dead,
+						 Relation indrel)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
 	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
@@ -146,7 +147,8 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 											snapshot,
 											&bslot->base.tupdata,
 											all_dead,
-											!*call_again);
+											!*call_again,
+											indrel, NULL);
 	bslot->base.tupdata.t_self = *tid;
 	LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_UNLOCK);
 
@@ -314,7 +316,8 @@ static TM_Result
 heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 					CommandId cid, Snapshot snapshot, Snapshot crosscheck,
 					bool wait, TM_FailureData *tmfd,
-					LockTupleMode *lockmode, bool *update_indexes)
+					LockTupleMode *lockmode, bool *update_all_indexes,
+					bool *update_modified_indexes, Bitmapset **modified_attrs)
 {
 	bool		shouldFree = true;
 	HeapTuple	tuple = ExecFetchSlotHeapTuple(slot, true, &shouldFree);
@@ -325,7 +328,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 	tuple->t_tableOid = slot->tts_tableOid;
 
 	result = heap_update(relation, otid, tuple, cid, crosscheck, wait,
-						 tmfd, lockmode);
+						 tmfd, lockmode, modified_attrs);
 	ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
 
 	/*
@@ -335,8 +338,12 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 	 * t_self field.
 	 *
 	 * If it's a HOT update, we mustn't insert new index entries.
+	 *
+	 * If it's a PHOT update, we will not be updating all of the indexes, just
+	 * the ones that touch a modified attribute.
 	 */
-	*update_indexes = result == TM_Ok && !HeapTupleIsHeapOnly(tuple);
+	*update_all_indexes = result == TM_Ok && !(HeapTupleIsHeapOnly(tuple) || HeapTupleIsPartialHeapOnly(tuple));
+	*update_modified_indexes = result == TM_Ok && HeapTupleIsPartialHeapOnly(tuple);
 
 	if (shouldFree)
 		pfree(tuple);
@@ -2109,7 +2116,7 @@ heapam_estimate_rel_size(Relation rel, int32 *attr_widths,
 
 static bool
 heapam_scan_bitmap_next_block(TableScanDesc scan,
-							  TBMIterateResult *tbmres)
+							  TBMIterateResult *tbmres, Bitmapset *interesting_attrs)
 {
 	HeapScanDesc hscan = (HeapScanDesc) scan;
 	BlockNumber page = tbmres->blockno;
@@ -2173,7 +2180,8 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
 
 			ItemPointerSet(&tid, page, offnum);
 			if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot,
-									   &heapTuple, NULL, true))
+									   &heapTuple, NULL, true,
+									   NULL, interesting_attrs))
 				hscan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
 		}
 	}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index e3a716a2a2..5feaa32c84 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -63,7 +63,7 @@ typedef struct
 /* Local functions */
 static int	heap_prune_chain(Buffer buffer,
 							 OffsetNumber rootoffnum,
-							 PruneState *prstate);
+							 PruneState *prstate, bool *phot);
 static void heap_prune_record_prunable(PruneState *prstate, TransactionId xid);
 static void heap_prune_record_redirect(PruneState *prstate,
 									   OffsetNumber offnum, OffsetNumber rdoffnum);
@@ -232,6 +232,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 	OffsetNumber offnum,
 				maxoff;
 	PruneState	prstate;
+	bool phot = false;
 
 	/*
 	 * Our strategy is to scan the page and make lists of items to change,
@@ -279,13 +280,16 @@ heap_page_prune(Relation relation, Buffer buffer,
 			continue;
 
 		/* Process this item or chain of items */
-		ndeleted += heap_prune_chain(buffer, offnum, &prstate);
+		ndeleted += heap_prune_chain(buffer, offnum, &prstate, &phot);
 	}
 
 	/* Clear the offset information once we have processed the given page. */
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	if (phot)
+		return ndeleted;
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -489,7 +493,7 @@ heap_prune_satisfies_vacuum(PruneState *prstate, HeapTuple tup, Buffer buffer)
  * Returns the number of tuples (to be) deleted from the page.
  */
 static int
-heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
+heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate, bool *phot)
 {
 	int			ndeleted = 0;
 	Page		dp = (Page) BufferGetPage(buffer);
@@ -540,7 +544,8 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
 			 * gets there first will mark the tuple unused.
 			 */
 			if (heap_prune_satisfies_vacuum(prstate, &tup, buffer)
-				== HEAPTUPLE_DEAD && !HeapTupleHeaderIsHotUpdated(htup))
+				== HEAPTUPLE_DEAD && !HeapTupleHeaderIsHotUpdated(htup) &&
+				!HeapTupleHeaderIsPartialHotUpdated(htup))
 			{
 				heap_prune_record_unused(prstate, rootoffnum);
 				HeapTupleHeaderAdvanceLatestRemovedXid(htup,
@@ -606,6 +611,12 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
 		tup.t_len = ItemIdGetLength(lp);
 		ItemPointerSet(&(tup.t_self), BufferGetBlockNumber(buffer), offnum);
 
+		if (HeapTupleHeaderIsPartialHeapOnly(htup) || HeapTupleHeaderIsPartialHotUpdated(htup))
+		{
+			*phot = true;
+			break;
+		}
+
 		/*
 		 * Check the tuple XMIN against prior XMAX, if any
 		 */
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f3d2265fad..10a516bec2 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1349,6 +1349,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 					 */
 					if (HeapTupleIsHotUpdated(&tuple) ||
 						HeapTupleIsHeapOnly(&tuple) ||
+						HeapTupleIsPartialHotUpdated(&tuple) ||
+						HeapTupleIsPartialHeapOnly(&tuple) ||
 						params->index_cleanup == VACOPT_TERNARY_DISABLED)
 						nkeep += 1;
 					else
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 3d2dbed708..d08f688bca 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -579,7 +579,8 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
 
 	found = table_index_fetch_tuple(scan->xs_heapfetch, &scan->xs_heaptid,
 									scan->xs_snapshot, slot,
-									&scan->xs_heap_continue, &all_dead);
+									&scan->xs_heap_continue, &all_dead,
+									scan->indexRelation);
 
 	if (found)
 		pgstat_count_heap_fetch(scan->indexRelation);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index e333603912..78b71ab9bc 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -556,7 +556,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 				 */
 				else if (table_index_fetch_tuple_check(heapRel, &htid,
 													   &SnapshotDirty,
-													   &all_dead))
+													   &all_dead,
+													   rel))
 				{
 					TransactionId xwait;
 
@@ -613,7 +614,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					 */
 					htid = itup->t_tid;
 					if (table_index_fetch_tuple_check(heapRel, &htid,
-													  SnapshotSelf, NULL))
+													  SnapshotSelf, NULL, rel))
 					{
 						/* Normal case --- it's still live */
 					}
diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index e60e32b935..76a8b865fe 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -176,6 +176,26 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
 						 xlrec->cmin, xlrec->cmax, xlrec->combocid);
 	}
 }
+void
+heap3_desc(StringInfo buf, XLogReaderState *record)
+{
+	char       *rec = XLogRecGetData(record);
+	uint8       info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_HEAP3_PARTIAL_HOT_UPDATE)
+	{
+		xl_heap_update *xlrec = (xl_heap_update *) rec;
+
+		appendStringInfo(buf, "off %u xmax %u flags 0x%02X ",
+						 xlrec->old_offnum,
+						 xlrec->old_xmax,
+						 xlrec->flags);
+		out_infobits(buf, xlrec->old_infobits_set);
+		appendStringInfo(buf, "; new off %u xmax %u",
+						 xlrec->new_offnum,
+						 xlrec->new_xmax);
+	}
+}
 
 const char *
 heap_identify(uint8 info)
@@ -260,3 +280,21 @@ heap2_identify(uint8 info)
 
 	return id;
 }
+
+const char *
+heap3_identify(uint8 info)
+{
+	const char *id = NULL;
+
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_HEAP3_PARTIAL_HOT_UPDATE:
+			id = "PARTIAL_HOT_UPDATE";
+			break;
+		case XLOG_HEAP3_PARTIAL_HOT_UPDATE | XLOG_HEAP_INIT_PAGE:
+			id = "PARTIAL_HOT_UPDATE+INIT";
+			break;
+	}
+
+	return id;
+}
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 5ea5bdd810..c88ee5e336 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -219,7 +219,8 @@ bool
 table_index_fetch_tuple_check(Relation rel,
 							  ItemPointer tid,
 							  Snapshot snapshot,
-							  bool *all_dead)
+							  bool *all_dead,
+							  Relation indrel)
 {
 	IndexFetchTableData *scan;
 	TupleTableSlot *slot;
@@ -229,7 +230,7 @@ table_index_fetch_tuple_check(Relation rel,
 	slot = table_slot_create(rel, NULL);
 	scan = table_index_fetch_begin(rel);
 	found = table_index_fetch_tuple(scan, tid, snapshot, slot, &call_again,
-									all_dead);
+									all_dead, indrel);
 	table_index_fetch_end(scan);
 	ExecDropSingleTupleTableSlot(slot);
 
@@ -346,7 +347,9 @@ void
 simple_table_tuple_update(Relation rel, ItemPointer otid,
 						  TupleTableSlot *slot,
 						  Snapshot snapshot,
-						  bool *update_indexes)
+						  bool *update_all_indexes,
+						  bool *update_modified_indexes,
+						  Bitmapset **modified_attrs)
 {
 	TM_Result	result;
 	TM_FailureData tmfd;
@@ -356,7 +359,8 @@ simple_table_tuple_update(Relation rel, ItemPointer otid,
 								GetCurrentCommandId(true),
 								snapshot, InvalidSnapshot,
 								true /* wait for commit */ ,
-								&tmfd, &lockmode, update_indexes);
+								&tmfd, &lockmode, update_all_indexes,
+								update_modified_indexes, modified_attrs);
 
 	switch (result)
 	{
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index d0063164a7..cd18bea224 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -106,18 +106,21 @@ unique_key_recheck(PG_FUNCTION_ARGS)
 	 * it's possible the index entry has also been marked dead, and even
 	 * removed.
 	 */
+	indexRel = index_open(trigdata->tg_trigger->tgconstrindid,
+						  RowExclusiveLock);
 	tmptid = checktid;
 	{
 		IndexFetchTableData *scan = table_index_fetch_begin(trigdata->tg_relation);
 		bool		call_again = false;
 
 		if (!table_index_fetch_tuple(scan, &tmptid, SnapshotSelf, slot,
-									 &call_again, NULL))
+									 &call_again, NULL, indexRel))
 		{
 			/*
 			 * All rows referenced by the index entry are dead, so skip the
 			 * check.
 			 */
+			index_close(indexRel, RowExclusiveLock);
 			ExecDropSingleTupleTableSlot(slot);
 			table_index_fetch_end(scan);
 			return PointerGetDatum(NULL);
@@ -130,8 +133,6 @@ unique_key_recheck(PG_FUNCTION_ARGS)
 	 * to update it.  (This protects against possible changes of the index
 	 * schema, not against concurrent updates.)
 	 */
-	indexRel = index_open(trigdata->tg_trigger->tgconstrindid,
-						  RowExclusiveLock);
 	indexInfo = BuildIndexInfo(indexRel);
 
 	/*
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 796ca7b3f7..f4f97ae0b6 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -343,7 +343,7 @@ CopyMultiInsertBufferFlush(CopyMultiInsertInfo *miinfo,
 			recheckIndexes =
 				ExecInsertIndexTuples(resultRelInfo,
 									  buffer->slots[i], estate, false, false,
-									  NULL, NIL);
+									  NULL, NIL, false, NULL);
 			ExecARInsertTriggers(estate, resultRelInfo,
 								 slots[i], recheckIndexes,
 								 cstate->transition_capture);
@@ -1090,7 +1090,7 @@ CopyFrom(CopyFromState cstate)
 																   false,
 																   false,
 																   NULL,
-																   NIL);
+																   NIL, false, NULL);
 					}
 
 					/* AFTER ROW INSERT Triggers */
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index afe7ce87d4..46faaf2455 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -287,7 +287,9 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 					  bool update,
 					  bool noDupErr,
 					  bool *specConflict,
-					  List *arbiterIndexes)
+					  List *arbiterIndexes,
+					  bool update_modified_indexes,
+					  Bitmapset *modified_attrs)
 {
 	ItemPointer tupleid = &slot->tts_tid;
 	List	   *result = NIL;
@@ -343,6 +345,29 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		if (!indexInfo->ii_ReadyForInserts)
 			continue;
 
+		/*
+		 * If the indexed attributes were not modified and this is a partial-HOT
+		 * update, skip it.
+		 */
+		if (update_modified_indexes)
+		{
+			bool should_update = false;
+			int j;
+
+			for (j = 0; j < indexInfo->ii_NumIndexAttrs; j++)
+			{
+				if (bms_is_member(indexInfo->ii_IndexAttrNumbers[j] - FirstLowInvalidHeapAttributeNumber,
+							modified_attrs))
+				{
+					should_update = true;
+					break;
+				}
+			}
+
+			if (!should_update)
+				continue;
+		}
+
 		/* Check for partial index */
 		if (indexInfo->ii_Predicate != NIL)
 		{
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 1e285e0349..fdbbb79d42 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -445,7 +445,7 @@ ExecSimpleRelationInsert(ResultRelInfo *resultRelInfo,
 		if (resultRelInfo->ri_NumIndices > 0)
 			recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 												   slot, estate, false, false,
-												   NULL, NIL);
+												   NULL, NIL, false, NULL);
 
 		/* AFTER ROW INSERT Triggers */
 		ExecARInsertTriggers(estate, resultRelInfo, slot,
@@ -493,7 +493,9 @@ ExecSimpleRelationUpdate(ResultRelInfo *resultRelInfo,
 	if (!skip_tuple)
 	{
 		List	   *recheckIndexes = NIL;
-		bool		update_indexes;
+		bool		update_all_indexes;
+		bool		update_modified_indexes;
+		Bitmapset  *modified_attrs;
 
 		/* Compute stored generated columns */
 		if (rel->rd_att->constr &&
@@ -508,12 +510,13 @@ ExecSimpleRelationUpdate(ResultRelInfo *resultRelInfo,
 			ExecPartitionCheck(resultRelInfo, slot, estate, true);
 
 		simple_table_tuple_update(rel, tid, slot, estate->es_snapshot,
-								  &update_indexes);
+								  &update_all_indexes, &update_modified_indexes,
+								  &modified_attrs);
 
-		if (resultRelInfo->ri_NumIndices > 0 && update_indexes)
+		if (resultRelInfo->ri_NumIndices > 0 && (update_all_indexes || update_modified_indexes))
 			recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 												   slot, estate, true, false,
-												   NULL, NIL);
+												   NULL, NIL, update_modified_indexes, modified_attrs);
 
 		/* AFTER ROW UPDATE Triggers */
 		ExecARUpdateTriggers(estate, resultRelInfo,
@@ -521,6 +524,7 @@ ExecSimpleRelationUpdate(ResultRelInfo *resultRelInfo,
 							 recheckIndexes, NULL);
 
 		list_free(recheckIndexes);
+		bms_free(modified_attrs);
 	}
 }
 
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 2db1914aff..c74521e1e5 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -37,6 +37,7 @@
 
 #include <math.h>
 
+#include "access/genam.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
 #include "access/transam.h"
@@ -79,8 +80,31 @@ BitmapHeapNext(BitmapHeapScanState *node)
 	TBMIterateResult *tbmres;
 	TupleTableSlot *slot;
 	ParallelBitmapHeapState *pstate = node->pstate;
+	Bitmapset *interesting_attrs = NULL;
 	dsa_area   *dsa = node->ss.ps.state->es_query_dsa;
 
+	/* hack for PHOT */
+	if (outerPlanState(&node->ss.ps))
+	{
+		PlanState *ps = outerPlanState(&node->ss.ps);
+		Plan *plan = ps->plan;
+
+		if (nodeTag(plan) == T_BitmapIndexScan)
+		{
+			BitmapIndexScan *scan = (BitmapIndexScan *) plan;
+			Oid indexid = scan->indexid;
+			Relation indrel = index_open(indexid, AccessShareLock);
+
+			for (int i = 0; i < indrel->rd_index->indnatts; i++)
+			{
+				int ind_attr = indrel->rd_index->indkey.values[i] - FirstLowInvalidHeapAttributeNumber;
+				interesting_attrs = bms_add_member(interesting_attrs, ind_attr);
+			}
+
+			index_close(indrel, AccessShareLock);
+		}
+	}
+
 	/*
 	 * extract necessary information from index scan node
 	 */
@@ -232,7 +256,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
 				 */
 				node->return_empty_tuples = tbmres->ntuples;
 			}
-			else if (!table_scan_bitmap_next_block(scan, tbmres))
+			else if (!table_scan_bitmap_next_block(scan, tbmres, interesting_attrs))
 			{
 				/* AM doesn't think this block is valid, skip */
 				continue;
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 2993ba43e3..1a36f29153 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -658,7 +658,9 @@ ExecInsert(ModifyTableState *mtstate,
 			recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 												   slot, estate, false, true,
 												   &specConflict,
-												   arbiterIndexes);
+												   arbiterIndexes,
+												   false,
+												   NULL);
 
 			/* adjust the tuple's state accordingly */
 			table_tuple_complete_speculative(resultRelationDesc, slot,
@@ -697,7 +699,7 @@ ExecInsert(ModifyTableState *mtstate,
 			if (resultRelInfo->ri_NumIndices > 0)
 				recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 													   slot, estate, false,
-													   false, NULL, NIL);
+													   false, NULL, NIL, false, NULL);
 		}
 	}
 
@@ -1395,7 +1397,9 @@ ExecUpdate(ModifyTableState *mtstate,
 	{
 		LockTupleMode lockmode;
 		bool		partition_constraint_failed;
-		bool		update_indexes;
+		bool		update_all_indexes;
+		bool		update_modified_indexes;
+		Bitmapset *modified_attrs;
 
 		/*
 		 * Constraints might reference the tableoid column, so (re-)initialize
@@ -1500,7 +1504,8 @@ lreplace:;
 									estate->es_snapshot,
 									estate->es_crosscheck_snapshot,
 									true /* wait for commit */ ,
-									&tmfd, &lockmode, &update_indexes);
+									&tmfd, &lockmode, &update_all_indexes,
+									&update_modified_indexes, &modified_attrs);
 
 		switch (result)
 		{
@@ -1630,10 +1635,11 @@ lreplace:;
 		}
 
 		/* insert index entries for tuple if necessary */
-		if (resultRelInfo->ri_NumIndices > 0 && update_indexes)
+		if (resultRelInfo->ri_NumIndices > 0 && (update_all_indexes || update_modified_indexes))
 			recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 												   slot, estate, true, false,
-												   NULL, NIL);
+												   NULL, NIL, update_modified_indexes, modified_attrs);
+		bms_free(modified_attrs);
 	}
 
 	if (canSetTag)
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index afa1df00d0..c0c1a74637 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -54,6 +54,7 @@ typedef struct XLogRecordBuffer
 static void DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
 static void DecodeHeapOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
 static void DecodeHeap2Op(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
+static void DecodeHeap3Op(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
 static void DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
 static void DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
 static void DecodeLogicalMsgOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf);
@@ -145,6 +146,10 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 			DecodeStandbyOp(ctx, &buf);
 			break;
 
+		case RM_HEAP3_ID:
+			DecodeHeap3Op(ctx, &buf);
+			break;
+
 		case RM_HEAP2_ID:
 			DecodeHeap2Op(ctx, &buf);
 			break;
@@ -416,6 +421,37 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	}
 }
 
+/*
+ * Handle rmgr HEAP3_ID records for DecodeRecordIntoRorderBuffer().
+ */
+static void
+DecodeHeap3Op(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
+{
+	uint8		 info = XLogRecGetInfo(buf->record) & XLOG_HEAP_OPMASK;
+	TransactionId xid = XLogRecGetXid(buf->record);
+	SnapBuild	*builder = ctx->snapshot_builder;
+
+	ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
+
+	/*
+	 * If we don't have snapshot or we are just fast-forwarding, there is no
+	 * point in decoding data changes.
+	 */
+	if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT ||
+		ctx->fast_forward)
+		return;
+
+	switch (info)
+	{
+		case XLOG_HEAP3_PARTIAL_HOT_UPDATE:
+			if (SnapBuildProcessChange(builder, xid, buf->origptr))
+				DecodeUpdate(ctx, buf);
+			break;
+		default:
+			elog(ERROR, "unexpected RM_HEAP3_ID record type: %u", info);
+	}
+}
+
 /*
  * Handle rmgr HEAP2_ID records for DecodeRecordIntoReorderBuffer().
  */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index d96a47b1ce..ff54eadcd1 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -126,7 +126,7 @@ extern bool heap_fetch(Relation relation, Snapshot snapshot,
 					   HeapTuple tuple, Buffer *userbuf);
 extern bool heap_hot_search_buffer(ItemPointer tid, Relation relation,
 								   Buffer buffer, Snapshot snapshot, HeapTuple heapTuple,
-								   bool *all_dead, bool first_call);
+								   bool *all_dead, bool first_call, Relation indrel, Bitmapset *interesting_attrs);
 
 extern void heap_get_latest_tid(TableScanDesc scan, ItemPointer tid);
 
@@ -147,7 +147,8 @@ extern void heap_abort_speculative(Relation relation, ItemPointer tid);
 extern TM_Result heap_update(Relation relation, ItemPointer otid,
 							 HeapTuple newtup,
 							 CommandId cid, Snapshot crosscheck, bool wait,
-							 struct TM_FailureData *tmfd, LockTupleMode *lockmode);
+							 struct TM_FailureData *tmfd, LockTupleMode *lockmode,
+							 Bitmapset **modified_attrs);
 extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
 								 CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
 								 bool follow_update,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 178d49710a..8b00bb148b 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -59,6 +59,16 @@
 #define XLOG_HEAP2_LOCK_UPDATED 0x60
 #define XLOG_HEAP2_NEW_CID		0x70
 
+/*
+ * We ran out of opcodes again, so heapam.c now has a third RmgrId.  These
+ * opcodes are associated with RM_HEAP3_ID, but are not logically different from
+ * the ones above associated with RM_HEAP_ID.  XLOG_HEAP_OPMASK applies to
+ * these, too.
+ *
+ * 0x10 through 0x70 are available for use.
+ */
+#define XLOG_HEAP3_PARTIAL_HOT_UPDATE	0x00
+
 /*
  * xl_heap_insert/xl_heap_multi_insert flag values, 8 bits are available.
  */
@@ -396,6 +406,9 @@ extern void heap2_redo(XLogReaderState *record);
 extern void heap2_desc(StringInfo buf, XLogReaderState *record);
 extern const char *heap2_identify(uint8 info);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
+extern void heap3_redo(XLogReaderState *record);
+extern void heap3_desc(StringInfo buf, XLogReaderState *record);
+extern const char *heap3_identify(uint8 info);
 
 extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
 										TransactionId latestRemovedXid);
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 7c62852e7f..57dcfa6dd8 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -274,7 +274,8 @@ struct HeapTupleHeaderData
  * information stored in t_infomask2:
  */
 #define HEAP_NATTS_MASK			0x07FF	/* 11 bits for number of attributes */
-/* bits 0x1800 are available */
+#define HEAP_PHOT_UPDATED		0x0800	/* tuple was PHOT-updated */
+#define HEAP_PHOT_TUPLE			0x1000	/* this is a partial heap-only tuple */
 #define HEAP_KEYS_UPDATED		0x2000	/* tuple was updated and key cols
 										 * modified, or tuple deleted */
 #define HEAP_HOT_UPDATED		0x4000	/* tuple was HOT-updated */
@@ -490,6 +491,7 @@ do { \
 
 #define HeapTupleHeaderSetHotUpdated(tup) \
 ( \
+	AssertMacro(((tup)->t_infomask2 & HEAP_PHOT_UPDATED) == 0), \
 	(tup)->t_infomask2 |= HEAP_HOT_UPDATED \
 )
 
@@ -505,6 +507,7 @@ do { \
 
 #define HeapTupleHeaderSetHeapOnly(tup) \
 ( \
+  AssertMacro(((tup)->t_infomask2 & HEAP_PHOT_TUPLE) == 0), \
   (tup)->t_infomask2 |= HEAP_ONLY_TUPLE \
 )
 
@@ -513,6 +516,40 @@ do { \
   (tup)->t_infomask2 &= ~HEAP_ONLY_TUPLE \
 )
 
+#define HeapTupleHeaderIsPartialHotUpdated(tup) \
+( \
+	((tup)->t_infomask2 & HEAP_PHOT_UPDATED) != 0 && \
+	((tup)->t_infomask & HEAP_XMAX_INVALID) == 0 && \
+	!HeapTupleHeaderXminInvalid(tup) \
+)
+
+#define HeapTupleHeaderSetPartialHotUpdated(tup) \
+( \
+	AssertMacro(((tup)->t_infomask2 & HEAP_HOT_UPDATED) == 0), \
+	(tup)->t_infomask2 |= HEAP_PHOT_UPDATED \
+)
+
+#define HeapTupleHeaderClearPartialHotUpdated(tup) \
+( \
+	(tup)->t_infomask2 &= ~HEAP_PHOT_UPDATED \
+)
+
+#define HeapTupleHeaderIsPartialHeapOnly(tup) \
+( \
+	((tup)->t_infomask2 & HEAP_PHOT_TUPLE) != 0 \
+)
+
+#define HeapTupleHeaderSetPartialHeapOnly(tup) \
+( \
+	AssertMacro(((tup)->t_infomask2 & HEAP_ONLY_TUPLE) == 0), \
+	(tup)->t_infomask2 |= HEAP_PHOT_TUPLE \
+)
+
+#define HeapTupleHeaderClearPartialHeapOnly(tup) \
+( \
+	(tup)->t_infomask2 &= ~HEAP_PHOT_TUPLE \
+)
+
 #define HeapTupleHeaderHasMatch(tup) \
 ( \
   ((tup)->t_infomask2 & HEAP_TUPLE_HAS_MATCH) != 0 \
@@ -691,6 +728,24 @@ struct MinimalTupleData
 #define HeapTupleClearHeapOnly(tuple) \
 		HeapTupleHeaderClearHeapOnly((tuple)->t_data)
 
+#define HeapTupleIsPartialHotUpdated(tuple) \
+		HeapTupleHeaderIsPartialHotUpdated((tuple)->t_data)
+
+#define HeapTupleSetPartialHotUpdated(tuple) \
+		HeapTupleHeaderSetPartialHotUpdated((tuple)->t_data)
+
+#define HeapTupleClearPartialHotUpdated(tuple) \
+		HeapTupleHeaderClearPartialHotUpdated((tuple)->t_data)
+
+#define HeapTupleIsPartialHeapOnly(tuple) \
+		HeapTupleHeaderIsPartialHeapOnly((tuple)->t_data)
+
+#define HeapTupleSetPartialHeapOnly(tuple) \
+		HeapTupleHeaderSetPartialHeapOnly((tuple)->t_data)
+
+#define HeapTupleClearPartialHeapOnly(tuple) \
+		HeapTupleHeaderClearPartialHeapOnly((tuple)->t_data)
+
 
 /* ----------------
  *		fastgetattr
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index f582cf535f..367b8cebf0 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -47,3 +47,4 @@ PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_i
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
 PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask)
 PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
+PG_RMGR(RM_HEAP3_ID, "Heap3", heap3_redo, heap3_desc, heap3_identify, NULL, NULL, heap_mask)
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 33bffb6815..817507111c 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -403,7 +403,8 @@ typedef struct TableAmRoutine
 									  ItemPointer tid,
 									  Snapshot snapshot,
 									  TupleTableSlot *slot,
-									  bool *call_again, bool *all_dead);
+									  bool *call_again, bool *all_dead,
+									  Relation indrel);
 
 
 	/* ------------------------------------------------------------------------
@@ -495,7 +496,9 @@ typedef struct TableAmRoutine
 								 bool wait,
 								 TM_FailureData *tmfd,
 								 LockTupleMode *lockmode,
-								 bool *update_indexes);
+								 bool *update_all_indexes,
+								 bool *update_modified_indexes,
+								 Bitmapset **modified_attrs);
 
 	/* see table_tuple_lock() for reference about parameters */
 	TM_Result	(*tuple_lock) (Relation rel,
@@ -757,7 +760,7 @@ typedef struct TableAmRoutine
 	 * scan_bitmap_next_tuple need to exist, or neither.
 	 */
 	bool		(*scan_bitmap_next_block) (TableScanDesc scan,
-										   struct TBMIterateResult *tbmres);
+										   struct TBMIterateResult *tbmres, Bitmapset *interesting_attrs);
 
 	/*
 	 * Fetch the next tuple of a bitmap table scan into `slot` and return true
@@ -1124,7 +1127,8 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						ItemPointer tid,
 						Snapshot snapshot,
 						TupleTableSlot *slot,
-						bool *call_again, bool *all_dead)
+						bool *call_again, bool *all_dead,
+						Relation indrel)
 {
 	/*
 	 * We don't expect direct calls to table_index_fetch_tuple with valid
@@ -1136,7 +1140,7 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
-													all_dead);
+													all_dead, indrel);
 }
 
 /*
@@ -1148,7 +1152,8 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 extern bool table_index_fetch_tuple_check(Relation rel,
 										  ItemPointer tid,
 										  Snapshot snapshot,
-										  bool *all_dead);
+										  bool *all_dead,
+										  Relation indrel);
 
 
 /* ------------------------------------------------------------------------
@@ -1417,12 +1422,13 @@ static inline TM_Result
 table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
 				   CommandId cid, Snapshot snapshot, Snapshot crosscheck,
 				   bool wait, TM_FailureData *tmfd, LockTupleMode *lockmode,
-				   bool *update_indexes)
+				   bool *update_all_indexes, bool *update_modified_indexes, Bitmapset **modified_attrs)
 {
 	return rel->rd_tableam->tuple_update(rel, otid, slot,
 										 cid, snapshot, crosscheck,
 										 wait, tmfd,
-										 lockmode, update_indexes);
+										 lockmode, update_all_indexes,
+										 update_modified_indexes, modified_attrs);
 }
 
 /*
@@ -1842,7 +1848,7 @@ table_relation_estimate_size(Relation rel, int32 *attr_widths,
  */
 static inline bool
 table_scan_bitmap_next_block(TableScanDesc scan,
-							 struct TBMIterateResult *tbmres)
+							 struct TBMIterateResult *tbmres, Bitmapset *interesting_attrs)
 {
 	/*
 	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
@@ -1853,7 +1859,7 @@ table_scan_bitmap_next_block(TableScanDesc scan,
 		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
 
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
-														   tbmres);
+														   tbmres, interesting_attrs);
 }
 
 /*
@@ -1940,7 +1946,8 @@ extern void simple_table_tuple_delete(Relation rel, ItemPointer tid,
 									  Snapshot snapshot);
 extern void simple_table_tuple_update(Relation rel, ItemPointer otid,
 									  TupleTableSlot *slot, Snapshot snapshot,
-									  bool *update_indexes);
+									  bool *update_all_indexes, bool *update_modified_indexes,
+									  Bitmapset **modified_attrs);
 
 
 /* ----------------------------------------------------------------------------
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 071e363d54..9390f2dfd5 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -588,7 +588,8 @@ extern List *ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 								   TupleTableSlot *slot, EState *estate,
 								   bool update,
 								   bool noDupErr,
-								   bool *specConflict, List *arbiterIndexes);
+								   bool *specConflict, List *arbiterIndexes,
+								   bool update_modified_indexes, Bitmapset *modified_attrs);
 extern bool ExecCheckIndexConstraints(ResultRelInfo *resultRelInfo,
 									  TupleTableSlot *slot,
 									  EState *estate, ItemPointer conflictTid,
-- 
2.16.6

Bruce Momjian

bruce@momjian.us

almost 5 years ago

In reply to: Bossart, Nathan (#1)

Re: partial heap only tuples

On Tue, Feb 9, 2021 at 06:48:21PM +0000, Bossart, Nathan wrote:

Hello,

I'm hoping to gather some early feedback on a heap optimization I've
been working on. In short, I'm hoping to add "partial heap only
tuple" (PHOT) support, which would allow you to skip updating indexes
for unchanged columns even when other indexes require updates. Today,

I think it is great you are working on this. I think it is a major way
to improve performance and I have been disappointed it has not moved
forward since 2016.

HOT works wonders when no indexed columns are updated. However, as
soon as you touch one indexed column, you lose that optimization
entirely, as you must update every index on the table. The resulting
performance impact is a pain point for many of our (AWS's) enterprise
customers, so we'd like to lend a hand for some improvements in this
area. For workloads involving a lot of columns and a lot of indexes,
an optimization like PHOT can make a huge difference. I'm aware that
there was a previous attempt a few years ago to add a similar
optimization called WARM [0] [1]. However, I only noticed this
previous effort after coming up with the design for PHOT, so I ended
up taking a slightly different approach. I am also aware of a couple
of recent nbtree improvements that may mitigate some of the impact of
non-HOT updates [2] [3], but I am hoping that PHOT serves as a nice
complement to those. I've attached a very early proof-of-concept
patch with the design described below.

How is your approach different from those of [0] and [1]? It is
interesting you still see performance benefits even after the btree
duplication improvements. Did you test with those improvements?

As far as performance is concerned, it is simple enough to show major
benefits from PHOT by tacking on a large number of indexes and columns
to a table. For a short pgbench run where each table had 5 additional
text columns and indexes on every column, I noticed a ~34% bump in
TPS with PHOT [4]. Theoretically, the TPS bump should be even higher

That's a big improvement.

Next, I'll go into the design a bit. I've commandeered the two
remaining bits in t_infomask2 to use as HEAP_PHOT_UPDATED and
HEAP_PHOT_TUPLE. These are analogous to the HEAP_HOT_UPDATED and
HEAP_ONLY_TUPLE bits. (If there are concerns about exhausting the
t_infomask2 bits, I think we could only use one of the remaining bits
as a "modifier" bit on the HOT ones. I opted against that for the
proof-of-concept patch to keep things simple.) When creating a PHOT
tuple, we only create new index tuples for updated columns. These new
index tuples point to the PHOT tuple. Following is a simple
demonstration with a table with two integer columns, each with its own
index:

Whatever solution you have, you have to be able to handle
adding/removing columns, and adding/removing indexes.

When it is time to scan through a PHOT chain, there are a couple of
things to account for. Sequential scans work out-of-the-box thanks to
the visibility rules, but other types of scans like index scans
require additional checks. If you encounter a PHOT chain when
performing an index scan, you should only continue following the chain
as long as none of the columns the index indexes are modified. If the
scan does encounter such a modification, we stop following the chain
and continue with the index scan. Even if there is a tuple in that

I think in patch [0] and [1], if an index column changes, all the
indexes had to be inserted into, while you seem to require inserts only
into the index that needs it. Is that correct?

PHOT chain that should be returned by our index scan, we will still
find it, as there will be another matching index tuple that points us
to later in the PHOT chain. My initial idea for determining which
columns were modified was to add a new bitmap after the "nulls" bitmap
in the tuple header. However, the attached patch simply uses
HeapDetermineModifiedColumns(). I've yet to measure the overhead of
this approach versus the bitmap approach, but I haven't noticed
anything too detrimental in the testing I've done so far.

A bitmap is an interesting approach, but you are right it will need
benchmarking.

I wonder if you should create a Postgres wiki page to document all of
this. I agree PG 15 makes sense. I would like to help with this if I
can. I will need to study this email more later.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

The usefulness of a cup is in its emptiness, Bruce Lee

Bossart, Nathan

bossartn@amazon.com

almost 5 years ago

In reply to: Bruce Momjian (#2)

Re: partial heap only tuples

On 2/10/21, 2:43 PM, "Bruce Momjian" <bruce@momjian.us> wrote:

On Tue, Feb 9, 2021 at 06:48:21PM +0000, Bossart, Nathan wrote:

HOT works wonders when no indexed columns are updated. However, as
soon as you touch one indexed column, you lose that optimization
entirely, as you must update every index on the table. The resulting
performance impact is a pain point for many of our (AWS's) enterprise
customers, so we'd like to lend a hand for some improvements in this
area. For workloads involving a lot of columns and a lot of indexes,
an optimization like PHOT can make a huge difference. I'm aware that
there was a previous attempt a few years ago to add a similar
optimization called WARM [0] [1]. However, I only noticed this
previous effort after coming up with the design for PHOT, so I ended
up taking a slightly different approach. I am also aware of a couple
of recent nbtree improvements that may mitigate some of the impact of
non-HOT updates [2] [3], but I am hoping that PHOT serves as a nice
complement to those. I've attached a very early proof-of-concept
patch with the design described below.

How is your approach different from those of [0] and [1]? It is
interesting you still see performance benefits even after the btree
duplication improvements. Did you test with those improvements?

I believe one of the main differences is that index tuples will point
to the corresponding PHOT tuple instead of the root of the HOT/PHOT
chain. I'm sure there are other differences. I plan on giving those
two long threads another read-through in the near future.

I made sure that the btree duplication improvements were applied for
my benchmarking. IIUC those don't alleviate the requirement that you
insert all index tuples for non-HOT updates, so PHOT can still provide
some added benefits there.

Next, I'll go into the design a bit. I've commandeered the two
remaining bits in t_infomask2 to use as HEAP_PHOT_UPDATED and
HEAP_PHOT_TUPLE. These are analogous to the HEAP_HOT_UPDATED and
HEAP_ONLY_TUPLE bits. (If there are concerns about exhausting the
t_infomask2 bits, I think we could only use one of the remaining bits
as a "modifier" bit on the HOT ones. I opted against that for the
proof-of-concept patch to keep things simple.) When creating a PHOT
tuple, we only create new index tuples for updated columns. These new
index tuples point to the PHOT tuple. Following is a simple
demonstration with a table with two integer columns, each with its own
index:

Whatever solution you have, you have to be able to handle
adding/removing columns, and adding/removing indexes.

I admittedly have not thought too much about the implications of
adding/removing columns and indexes for PHOT yet, but that's
definitely an important part of this project that I need to look into.
I see that HOT has some special handling for commands like CREATE
INDEX that I can reference.

When it is time to scan through a PHOT chain, there are a couple of
things to account for. Sequential scans work out-of-the-box thanks to
the visibility rules, but other types of scans like index scans
require additional checks. If you encounter a PHOT chain when
performing an index scan, you should only continue following the chain
as long as none of the columns the index indexes are modified. If the
scan does encounter such a modification, we stop following the chain
and continue with the index scan. Even if there is a tuple in that

I think in patch [0] and [1], if an index column changes, all the
indexes had to be inserted into, while you seem to require inserts only
into the index that needs it. Is that correct?

Right, PHOT only requires new index tuples for the modified columns.
However, I was under the impression that WARM aimed to do the same
thing. I might be misunderstanding your question.

I wonder if you should create a Postgres wiki page to document all of
this. I agree PG 15 makes sense. I would like to help with this if I
can. I will need to study this email more later.

Thanks for taking a look. I think a wiki is a good idea for keeping
track of the current state of the design. I'll look into that.

Nathan

Andres Freund

andres@anarazel.de

almost 5 years ago

In reply to: Bossart, Nathan (#1)

Re: partial heap only tuples

Hi,

On 2021-02-09 18:48:21 +0000, Bossart, Nathan wrote:

In order to be eligible for cleanup, the final tuple in the
corresponding PHOT/HOT chain must also be eligible for cleanup, or all
indexes must have been updated later in the chain before any visible
tuples.

This sounds like it might be prohibitively painful. Adding effectively
unremovable bloat to remove other bloat is not an uncomplicated
premise. I think you'd really need a way to fully remove this as part of
vacuum for this to be viable.

Greetings,

Andres Freund

Bossart, Nathan

bossartn@amazon.com

almost 5 years ago

In reply to: Andres Freund (#4)

Re: partial heap only tuples

On 2/13/21, 8:26 AM, "Andres Freund" <andres@anarazel.de> wrote:

On 2021-02-09 18:48:21 +0000, Bossart, Nathan wrote:

In order to be eligible for cleanup, the final tuple in the
corresponding PHOT/HOT chain must also be eligible for cleanup, or all
indexes must have been updated later in the chain before any visible
tuples.

This sounds like it might be prohibitively painful. Adding effectively
unremovable bloat to remove other bloat is not an uncomplicated
premise. I think you'd really need a way to fully remove this as part of
vacuum for this to be viable.

Yeah, this is something I'm concerned about. I think adding a bitmap
of modified columns to the header of PHOT-updated tuples improves
matters quite a bit, even for single-page vacuuming. Following is a
strategy I've been developing (there may still be some gaps). Here's
a basic PHOT chain where all tuples are visible and the last one has
not been deleted or updated:

idx1 0 1 2 3
idx2 0 1 2
idx3 0 2 3
lp 1 2 3 4 5
tuple (0,0,0) (0,1,1) (2,2,1) (2,2,2) (3,2,3)
bitmap -xx xx- --x x-x

For single-page vacuum, we take the following actions:
1. Starting at the root of the PHOT chain, create an OR'd bitmap
of the chain.
2. Walk backwards, OR-ing the bitmaps. Stop when the bitmap
matches the one from step 1. As we walk backwards, identify
"key" tuples, which are tuples where the OR'd bitmap changes as
you walk backwards. If the OR'd bitmap does not include all
columns for the table, also include the root of the PHOT chain
as a key tuple.
3. Redirect each key tuple to the next key tuple.
4. For all but the first key tuple, OR the bitmaps of all pruned
tuples from each key tuple (exclusive) to the next key tuple
(inclusive) and store the result in the bitmap of the next key
tuple.
5. Mark all line pointers for all non-key tuples as dead. Storage
can be removed for all tuples except the last one, but we must
leave around the bitmap for all key tuples except for the first
one.

After this, our basic PHOT chain looks like this:

idx1 0 1 2 3
idx2 0 1 2
idx3 0 2 3
lp X X 3->5 X 5
tuple (3,2,3)
bitmap x-x

Without PHOT, this intermediate state would have 15 index tuples, 5
line pointers, and 1 heap tuples. With PHOT, we have 10 index tuples,
5 line pointers, 1 heap tuple, and 1 bitmap. When we vacuum the
indexes, we can reclaim the dead line pointers and remove the
associated index tuples:

idx1 3
idx2 2
idx3 2 3
lp 3->5 5
tuple (3,2,3)
bitmap x-x

Without PHOT, this final state would have 3 index tuples, 1 line
pointer, and 1 heap tuple. With PHOT, we have 4 index tuples, 2 line
pointers, 1 heap tuple, and 1 bitmap. Overall, we still end up
keeping around more line pointers and tuple headers (for the bitmaps),
but maybe that is good enough. I think the next step here would be to
find a way to remove some of the unnecessary index tuples and adjust
the remaining ones to point to the last line pointer in the PHOT
chain.

Nathan

Bossart, Nathan

bossartn@amazon.com

almost 5 years ago

In reply to: Andres Freund (#4)

Re: partial heap only tuples

On 2/10/21, 2:43 PM, "Bruce Momjian" <bruce@momjian.us> wrote:

I wonder if you should create a Postgres wiki page to document all of
this. I agree PG 15 makes sense. I would like to help with this if I
can. I will need to study this email more later.

I've started the wiki page for this:

https://wiki.postgresql.org/wiki/Partial_Heap_Only_Tuples

Nathan

Ibrar Ahmed

ibrar.ahmad@gmail.com

almost 5 years ago

In reply to: Bossart, Nathan (#6)

Re: partial heap only tuples

On Wed, Feb 24, 2021 at 3:22 AM Bossart, Nathan <bossartn@amazon.com> wrote:

On 2/10/21, 2:43 PM, "Bruce Momjian" <bruce@momjian.us> wrote:

I wonder if you should create a Postgres wiki page to document all of
this. I agree PG 15 makes sense. I would like to help with this if I
can. I will need to study this email more later.

I've started the wiki page for this:

https://wiki.postgresql.org/wiki/Partial_Heap_Only_Tuples

Nathan

The regression test case (partial-index) is failing

https://cirrus-ci.com/task/5310522716323840

----
=== ./src/test/isolation/output_iso/regression.diffs ===
diff -U3 /tmp/cirrus-ci-build/src/test/isolation/expected/partial-index.out
/tmp/cirrus-ci-build/src/test/isolation/output_iso/results/partial-index.out
--- /tmp/cirrus-ci-build/src/test/isolation/expected/partial-index.out
2021-03-06 23:11:08.018868833 +0000
+++
/tmp/cirrus-ci-build/src/test/isolation/output_iso/results/partial-index.out
2021-03-06 23:26:15.857027075 +0000
@@ -30,6 +30,8 @@
6 a 1
7 a 1
8 a 1
+9 a 2
+10 a 2
step c2: COMMIT;
starting permutation: rxy1 wx1 wy2 c1 rxy2 c2
@@ -83,6 +85,7 @@
6 a 1
7 a 1
8 a 1
+9 a 2
10 a 1
step c1: COMMIT;
----

Can you please take a look at that?

--
Ibrar Ahmed

Bossart, Nathan

bossartn@amazon.com

almost 5 years ago

In reply to: Ibrar Ahmed (#7)

Re: partial heap only tuples

On 3/8/21, 10:16 AM, "Ibrar Ahmed" <ibrar.ahmad@gmail.com> wrote:

On Wed, Feb 24, 2021 at 3:22 AM Bossart, Nathan <bossartn@amazon.com> wrote:

On 2/10/21, 2:43 PM, "Bruce Momjian" <bruce@momjian.us> wrote:

I wonder if you should create a Postgres wiki page to document all of
this. I agree PG 15 makes sense. I would like to help with this if I
can. I will need to study this email more later.

I've started the wiki page for this:

https://wiki.postgresql.org/wiki/Partial_Heap_Only_Tuples

Nathan

The regression test case (partial-index) is failing

https://cirrus-ci.com/task/5310522716323840

This patch is intended as a proof-of-concept of some basic pieces of
the project. I'm working on a new patch set that should be more
suitable for community review.

Nathan

Bruce Momjian

bruce@momjian.us

almost 5 years ago

In reply to: Bossart, Nathan (#5)

Re: partial heap only tuples

On Mon, Feb 15, 2021 at 08:19:40PM +0000, Bossart, Nathan wrote:

Yeah, this is something I'm concerned about. I think adding a bitmap
of modified columns to the header of PHOT-updated tuples improves
matters quite a bit, even for single-page vacuuming. Following is a
strategy I've been developing (there may still be some gaps). Here's
a basic PHOT chain where all tuples are visible and the last one has
not been deleted or updated:

idx1 0 1 2 3
idx2 0 1 2
idx3 0 2 3
lp 1 2 3 4 5
tuple (0,0,0) (0,1,1) (2,2,1) (2,2,2) (3,2,3)
bitmap -xx xx- --x x-x

First, I want to continue encouraging you to work on this because I
think it can yield big improvements. Second, I like the wiki you
created. Third, the diagram above seems to be more meaningful if read
from the bottom-up. I suggest you reorder it on the wiki so it can be
read top-down, maybe:

lp 1 2 3 4 5
tuple (0,0,0) (0,1,1) (2,2,1) (2,2,2) (3,2,3)
bitmap -xx xx- --x x-x
idx1 0 1 2 3
idx2 0 1 2
idx3 0 2 3

Fourth, I know in the wiki you said create/drop index needs more
research, but I suggest you avoid any design that will be overly complex
for create/drop index. For example, a per-row bitmap that is based on
what indexes exist at time of row creation might cause unacceptable
problems in handling create/drop index. Would you number indexes? I am
not saying you have to solve all the problems now, but you have to keep
your eye on obstacles that might block your progress later.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

The usefulness of a cup is in its emptiness, Bruce Lee

#10

Bossart, Nathan

bossartn@amazon.com

almost 5 years ago

In reply to: Bruce Momjian (#9)

Re: partial heap only tuples

On 3/9/21, 8:24 AM, "Bruce Momjian" <bruce@momjian.us> wrote:

On Mon, Feb 15, 2021 at 08:19:40PM +0000, Bossart, Nathan wrote:

Yeah, this is something I'm concerned about. I think adding a bitmap
of modified columns to the header of PHOT-updated tuples improves
matters quite a bit, even for single-page vacuuming. Following is a
strategy I've been developing (there may still be some gaps). Here's
a basic PHOT chain where all tuples are visible and the last one has
not been deleted or updated:

idx1 0 1 2 3
idx2 0 1 2
idx3 0 2 3
lp 1 2 3 4 5
tuple (0,0,0) (0,1,1) (2,2,1) (2,2,2) (3,2,3)
bitmap -xx xx- --x x-x

First, I want to continue encouraging you to work on this because I
think it can yield big improvements. Second, I like the wiki you
created. Third, the diagram above seems to be more meaningful if read
from the bottom-up. I suggest you reorder it on the wiki so it can be
read top-down, maybe:

lp 1 2 3 4 5
tuple (0,0,0) (0,1,1) (2,2,1) (2,2,2) (3,2,3)
bitmap -xx xx- --x x-x
idx1 0 1 2 3
idx2 0 1 2
idx3 0 2 3

I appreciate the feedback and the words of encouragement. I'll go
ahead and flip the diagrams like you suggested. I'm planning on
publishing a larger round of edits to the wiki once the patch set is
ready to share. There are a few changes to the design that I've
picked up along the way.

Fourth, I know in the wiki you said create/drop index needs more
research, but I suggest you avoid any design that will be overly complex
for create/drop index. For example, a per-row bitmap that is based on
what indexes exist at time of row creation might cause unacceptable
problems in handling create/drop index. Would you number indexes? I am
not saying you have to solve all the problems now, but you have to keep
your eye on obstacles that might block your progress later.

I am agreed on avoiding an overly complex design. This project
introduces a certain amount of inherent complexity, so one of my main
goals is ensuring that it's easy to reason about each piece.

I'm cautiously optimistic that index creation and deletion will not
require too much extra work. For example, if a new index needs to
point to a partial heap only tuple, it can do so (unlike HOT, which
would require that the new index point to the root of the chain). The
modified-columns bitmaps could include the entire set of modified
columns (not just the indexed ones), so no additional changes would
need to be made there. Furthermore, I'm anticipating that the
modified-columns bitmaps will end up only being used with the
redirected LPs to help reduce heap bloat after single-page vacuuming.
In that case, new indexes would probably avoid the existing bitmaps
anyway.

Nathan

#11

Bruce Momjian

bruce@momjian.us

almost 5 years ago

In reply to: Bossart, Nathan (#10)

Re: partial heap only tuples

On Tue, Mar 9, 2021 at 09:33:31PM +0000, Bossart, Nathan wrote:

I'm cautiously optimistic that index creation and deletion will not
require too much extra work. For example, if a new index needs to
point to a partial heap only tuple, it can do so (unlike HOT, which
would require that the new index point to the root of the chain). The
modified-columns bitmaps could include the entire set of modified
columns (not just the indexed ones), so no additional changes would
need to be made there. Furthermore, I'm anticipating that the
modified-columns bitmaps will end up only being used with the
redirected LPs to help reduce heap bloat after single-page vacuuming.
In that case, new indexes would probably avoid the existing bitmaps
anyway.

Yes, that would probably work, sure.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

The usefulness of a cup is in its emptiness, Bruce Lee

#12

Peter Geoghegan

pg@bowt.ie

over 4 years ago

In reply to: Bossart, Nathan (#1)

Re: partial heap only tuples

On Tue, Feb 9, 2021 at 10:48 AM Bossart, Nathan <bossartn@amazon.com> wrote:

I'm hoping to gather some early feedback on a heap optimization I've
been working on. In short, I'm hoping to add "partial heap only
tuple" (PHOT) support, which would allow you to skip updating indexes
for unchanged columns even when other indexes require updates. Today,
HOT works wonders when no indexed columns are updated. However, as
soon as you touch one indexed column, you lose that optimization
entirely, as you must update every index on the table. The resulting
performance impact is a pain point for many of our (AWS's) enterprise
customers, so we'd like to lend a hand for some improvements in this
area. For workloads involving a lot of columns and a lot of indexes,
an optimization like PHOT can make a huge difference. I'm aware that
there was a previous attempt a few years ago to add a similar
optimization called WARM [0] [1]. However, I only noticed this
previous effort after coming up with the design for PHOT, so I ended
up taking a slightly different approach. I am also aware of a couple
of recent nbtree improvements that may mitigate some of the impact of
non-HOT updates [2] [3], but I am hoping that PHOT serves as a nice
complement to those. I've attached a very early proof-of-concept
patch with the design described below.

I would like to share some thoughts that I have about how I think
about optimizations like PHOT, and how they might fit together with my
own work -- particularly the nbtree bottom-up index deletion feature
you referenced. My remarks could equally well apply to WARM.
Ordinarily this is the kind of thing that would be too hand-wavey for
the mailing list, but we don't have the luxury of in-person
communication right now.

Everybody tends to talk about HOT as if it works perfectly once you
make some modest assumptions, such as "there are no long-running
transactions", and "no UPDATEs will logically modify indexed columns".
But I tend to doubt that that's truly the case -- I think that there
are still pathological cases where HOT cannot keep the total table
size stable in the long run due to subtle effects that eventually
aggregate into significant issues, like heap fragmentation. Ask Jan
Wieck about the stability of some of the TPC-C/BenchmarkSQL tables to
get one example of this. There is no reason to believe that PHOT will
help with that. Maybe that's okay, but I would think carefully about
what that means if I were undertaking this work. Ensuring stability in
the on-disk size of tables in cases where the size of the logical
database is stable should be an important goal of a project like PHOT
or HOT.

If you want to get a better sense of how these inefficiencies might
happen, I suggest looking into using recently added autovacuum logging
to analyze how well HOT works today, using the technique that I go
into here:

/messages/by-id/CAH2-WzkjU+NiBskZunBDpz6trSe+aQvuPAe+xgM8ZvoB4wQwhA@mail.gmail.com

Small inefficiencies in the on-disk structure have a tendency to
aggregate over time, at least when there is no possible way to reverse
them. The bottom-up index deletion stuff is very effective as a
backstop against index bloat, because things are generally very
non-linear. The cost of an unnecessary page split is very high, and
permanent. But we can make it cheap to *try* to avoid that using
fairly simple heuristics. We can be reasonably confident that we're
about to split the page unnecessarily, and use cues that ramp up the
number of heap page accesses as needed. We ramp up during a bottom-up
index deletion, as we manage to free some index tuples as a result of
previous heap page accesses.

This works very well because we can intervene very selectively. We
aren't interested in deleting index tuples unless and until we really
have to, and in general there tends to be quite a bit of free space to
temporarily store extra version duplicates -- that's what most index
pages look like, even on the busiest of databases. It's possible for
the bottom-up index deletion mechanism to be invoked very
infrequently, and yet make a huge difference. And when it fails to
free anything, it fails permanently for that particular leaf page
(because it splits) -- so now we have lots of space for future index
tuple insertions that cover the original page's key space. We won't
thrash.

My intuition is that similar principles can be applied inside heapam.
Failing to fit related versions on a heap page (having managed to do
so for hours or days before that point) is more or less the heap page
equivalent of a leaf page split from version churn (this is the
pathology that bottom-up index deletion targets). For example, we
could have a fall back mode that compresses old versions that is used
if and only if heap pruning was attempted but then failed. We should
always try to avoid migrating to a new heap page, because that amounts
to a permanent solution to a temporary problem. We should perhaps make
the updater work to prove that that's truly necessary, rather than
giving up immediately (i.e. assuming that it must be necessary at the
first sign of trouble).

We might have successfully fit the successor heap tuple version a
million times before just by HOT pruning, and yet currently we give up
just because it didn't work on the one millionth and first occasion --
don't you think that's kind of silly? We may be able to afford having
a fallback strategy that is relatively expensive, provided it is
rarely used. And it might be very effective in the aggregate, despite
being rarely used -- it might provide us just what we were missing
before. Just try harder when you run into a problem every once in a
blue moon!

A diversity of strategies with fallback behavior is sometimes the best
strategy. Don't underestimate the contribution of rare and seemingly
insignificant adverse events. Consider the lifecycle of the data over
time. If we quit trying to fit new versions on the same heap page at
the first sign of real trouble, then it's only a matter of time until
widespread heap fragmentation results -- each heap page only has to be
unlucky once, and in the long run it's inevitable that they all will.
We could probably do better at nipping it in the bud at the level of
individual heap pages and opportunistic prune operations.

I'm sure that it would be useful to not have to rely on bottom-up
index deletion in more cases -- I think that the idea of "a better
HOT" might still be very helpful. Bottom-up index deletion is only
supposed to be a backstop against pathological behavior (version churn
page splits), which is probably always going to be possible with a
sufficiently extreme workload. I don't believe that the current levels
of version churn/write amplification that we still see with Postgres
must be addressed through totally eliminating multiple versions of the
same logical row that live together in the same heap page. This idea
is a false dichotomy. And it fails to acknowledge how the current
design often works very well. When and how it fails to work well with
a real workload and real tuning (especially heap fill factor tuning)
is probably not well understood. Why not start with that?

Our default heap fill factor is 100. Maybe that's the right decision,
but it significantly impedes the ability of HOT to keep the size of
tables stable over time. Just because heap fill factor 90 also has
issues today doesn't mean that each pathological behavior cannot be
fixed through targeted intervention. Maybe the myth that HOT works
perfectly once you make some modest assumptions could come true.

--
Peter Geoghegan

#13

Bruce Momjian

bruce@momjian.us

over 4 years ago

In reply to: Peter Geoghegan (#12)

Re: partial heap only tuples

On Sun, Apr 18, 2021 at 04:27:15PM -0700, Peter Geoghegan wrote:

Everybody tends to talk about HOT as if it works perfectly once you
make some modest assumptions, such as "there are no long-running
transactions", and "no UPDATEs will logically modify indexed columns".
But I tend to doubt that that's truly the case -- I think that there
are still pathological cases where HOT cannot keep the total table
size stable in the long run due to subtle effects that eventually
aggregate into significant issues, like heap fragmentation. Ask Jan
Wieck about the stability of some of the TPC-C/BenchmarkSQL tables to

...

We might have successfully fit the successor heap tuple version a
million times before just by HOT pruning, and yet currently we give up
just because it didn't work on the one millionth and first occasion --
don't you think that's kind of silly? We may be able to afford having
a fallback strategy that is relatively expensive, provided it is
rarely used. And it might be very effective in the aggregate, despite
being rarely used -- it might provide us just what we were missing
before. Just try harder when you run into a problem every once in a
blue moon!

A diversity of strategies with fallback behavior is sometimes the best
strategy. Don't underestimate the contribution of rare and seemingly
insignificant adverse events. Consider the lifecycle of the data over

That is an intersting point --- we often focus on optimizing frequent
operations, but preventing rare but expensive-in-aggregate events from
happening is also useful.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

If only the physical world exists, free will is an illusion.

#14

Peter Geoghegan

pg@bowt.ie

over 4 years ago

In reply to: Bruce Momjian (#13)

Re: partial heap only tuples

On Mon, Apr 19, 2021 at 5:09 PM Bruce Momjian <bruce@momjian.us> wrote:

A diversity of strategies with fallback behavior is sometimes the best
strategy. Don't underestimate the contribution of rare and seemingly
insignificant adverse events. Consider the lifecycle of the data over

That is an intersting point --- we often focus on optimizing frequent
operations, but preventing rare but expensive-in-aggregate events from
happening is also useful.

Right. Similarly, we sometimes focus on adding an improvement,
overlooking more promising opportunities to subtract a disimprovement.
Apparently this is a well known tendency:

https://www.scientificamerican.com/article/our-brain-typically-overlooks-this-brilliant-problem-solving-strategy/

I believe that it's particularly important to consider subtractive
approaches with a complex system. This has sometimes worked well for
me as a conscious and deliberate strategy.

--
Peter Geoghegan

#15

vignesh C

vignesh21@gmail.com

over 4 years ago

In reply to: Bossart, Nathan (#8)

Re: partial heap only tuples

On Tue, Mar 9, 2021 at 12:09 AM Bossart, Nathan <bossartn@amazon.com> wrote:

On 3/8/21, 10:16 AM, "Ibrar Ahmed" <ibrar.ahmad@gmail.com> wrote:

On Wed, Feb 24, 2021 at 3:22 AM Bossart, Nathan <bossartn@amazon.com> wrote:

On 2/10/21, 2:43 PM, "Bruce Momjian" <bruce@momjian.us> wrote:

I wonder if you should create a Postgres wiki page to document all of
this. I agree PG 15 makes sense. I would like to help with this if I
can. I will need to study this email more later.

I've started the wiki page for this:

https://wiki.postgresql.org/wiki/Partial_Heap_Only_Tuples

Nathan

The regression test case (partial-index) is failing

https://cirrus-ci.com/task/5310522716323840

This patch is intended as a proof-of-concept of some basic pieces of
the project. I'm working on a new patch set that should be more
suitable for community review.

The patch does not apply on Head anymore, could you rebase and post a
patch. I'm changing the status to "Waiting for Author".

Regards,
Vignesh

#16

Daniel Gustafsson

daniel@yesql.se

about 4 years ago

In reply to: vignesh C (#15)

Re: partial heap only tuples

On 14 Jul 2021, at 13:34, vignesh C <vignesh21@gmail.com> wrote:

The patch does not apply on Head anymore, could you rebase and post a
patch. I'm changing the status to "Waiting for Author".

As no update has been posted, the patch still doesn't apply. I'm marking this
patch Returned with Feedback, feel free to open a new entry for an updated
patch.

--
Daniel Gustafsson https://vmware.com/

#17

Bossart, Nathan

bossartn@amazon.com

about 4 years ago

In reply to: Daniel Gustafsson (#16)

Re: partial heap only tuples

On 11/4/21, 3:24 AM, "Daniel Gustafsson" <daniel@yesql.se> wrote:

As no update has been posted, the patch still doesn't apply. I'm marking this
patch Returned with Feedback, feel free to open a new entry for an updated
patch.

Thanks. I have been working on this intermittently, and I hope to
post a more complete proof-of-concept in the near future. I'll create
a new commitfest entry once that's done.

Nathan