Improve heapgetpage() performance, overhead from serializable

Started by Andres Freundover 2 years ago19 messages
#1Andres Freund
andres@anarazel.de
1 attachment(s)

Hi,

Several loops which are important for query performance, like heapgetpage()'s
loop over all tuples, have to call functions like
HeapCheckForSerializableConflictOut() and PredicateLockTID() in every
iteration.

When serializable is not in use, all those functions do is to to return. But
being situated in a different translation unit, the compiler can't inline
(without LTO at least) the check whether serializability is needed. It's not
just the function call overhead that's noticable, it's also that registers
have to be spilled to the stack / reloaded from memory etc.

On a freshly loaded pgbench scale 100, with turbo mode disabled, postgres
pinned to one core. Parallel workers disabled to reduce noise. All times are
the average of 15 executions with pgbench, in a newly started, but prewarmed
postgres.

SELECT * FROM pgbench_accounts OFFSET 10000000;
HEAD:
397.977

removing the HeapCheckForSerializableConflictOut() from heapgetpage()
(incorrect!), to establish the baseline of what serializable costs:
336.695

pulling out CheckForSerializableConflictOutNeeded() from
HeapCheckForSerializableConflictOut() in heapgetpage(), and avoiding calling
HeapCheckForSerializableConflictOut() in the loop:
339.742

moving the loop into a static inline function, marked as pg_always_inline,
called with static arguments for always_visible, check_serializable:
326.546

marking the always_visible, !check_serializable case likely():
322.249

removing TestForOldSnapshot() calls, which we pretty much already decided on:
312.987

FWIW, there's more we can do, with some hacky changes I got the time down to
273.261, but the tradeoffs start to be a bit more complicated. And 397->320ms
for something as core as this, is imo worth considering on its own.

Now, this just affects the sequential scan case. heap_hot_search_buffer()
shares many of the same pathologies. I find it a bit harder to improve,
because the compiler's code generation seems to switch between good / bad with
changes that seems unrelated...

I wonder why we haven't used PageIsAllVisible() in heap_hot_search_buffer() so
far?

Greetings,

Andres Freund

Attachments:

v1-0001-optimize-heapgetpage.patchtext/x-diff; charset=us-asciiDownload
From 92ca2c15ccf2adb9b7f2da9cf86b571534287136 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 15 Jul 2023 15:13:30 -0700
Subject: [PATCH v1] optimize heapgetpage()

---
 src/backend/access/heap/heapam.c | 100 ++++++++++++++++++++++---------
 1 file changed, 72 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7ed72abe597..d662d2898a2 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -366,6 +366,56 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
 	scan->rs_numblocks = numBlks;
 }
 
+/*
+ * The loop for heapgetpage() in pagemode. Pulled out so it can be called
+ * multiple times, with constant arguments for all_visible,
+ * check_serializable.
+ */
+pg_attribute_always_inline
+static int
+heapgetpage_collect(HeapScanDesc scan, Snapshot snapshot,
+					Page page, Buffer buffer,
+					BlockNumber block, int lines,
+					bool all_visible, bool check_serializable)
+{
+	int			ntup = 0;
+	OffsetNumber lineoff;
+
+	for (lineoff = FirstOffsetNumber; lineoff <= lines; lineoff++)
+	{
+		ItemId		lpp = PageGetItemId(page, lineoff);
+		bool		valid;
+		HeapTupleData loctup;
+
+		if (!ItemIdIsNormal(lpp))
+			continue;
+
+		loctup.t_data = (HeapTupleHeader) PageGetItem(page, lpp);
+		loctup.t_len = ItemIdGetLength(lpp);
+		loctup.t_tableOid = RelationGetRelid(scan->rs_base.rs_rd);
+		ItemPointerSet(&(loctup.t_self), block, lineoff);
+
+		if (all_visible)
+			valid = true;
+		else
+			valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
+
+		if (check_serializable)
+			HeapCheckForSerializableConflictOut(valid, scan->rs_base.rs_rd,
+												&loctup, buffer, snapshot);
+
+		if (valid)
+		{
+			scan->rs_vistuples[ntup] = lineoff;
+			ntup++;
+		}
+	}
+
+	Assert(ntup <= MaxHeapTuplesPerPage);
+
+	return ntup;
+}
+
 /*
  * heapgetpage - subroutine for heapgettup()
  *
@@ -381,9 +431,8 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 	Snapshot	snapshot;
 	Page		page;
 	int			lines;
-	int			ntup;
-	OffsetNumber lineoff;
 	bool		all_visible;
+	bool		check_serializable;
 
 	Assert(block < scan->rs_nblocks);
 
@@ -427,7 +476,6 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 	page = BufferGetPage(buffer);
 	TestForOldSnapshot(snapshot, scan->rs_base.rs_rd, page);
 	lines = PageGetMaxOffsetNumber(page);
-	ntup = 0;
 
 	/*
 	 * If the all-visible flag indicates that all tuples on the page are
@@ -450,37 +498,33 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 	 * tuple for visibility the hard way.
 	 */
 	all_visible = PageIsAllVisible(page) && !snapshot->takenDuringRecovery;
+	check_serializable =
+		CheckForSerializableConflictOutNeeded(scan->rs_base.rs_rd, snapshot);
 
-	for (lineoff = FirstOffsetNumber; lineoff <= lines; lineoff++)
+	/*
+	 * We call heapgetpage_collect() with constant arguments, for faster code,
+	 * without having to duplicate too much logic.
+	 */
+	if (likely(all_visible))
 	{
-		ItemId		lpp = PageGetItemId(page, lineoff);
-		HeapTupleData loctup;
-		bool		valid;
-
-		if (!ItemIdIsNormal(lpp))
-			continue;
-
-		loctup.t_tableOid = RelationGetRelid(scan->rs_base.rs_rd);
-		loctup.t_data = (HeapTupleHeader) PageGetItem(page, lpp);
-		loctup.t_len = ItemIdGetLength(lpp);
-		ItemPointerSet(&(loctup.t_self), block, lineoff);
-
-		if (all_visible)
-			valid = true;
+		if (likely(!check_serializable))
+			scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
+												   block, lines, 1, 0);
 		else
-			valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
-
-		HeapCheckForSerializableConflictOut(valid, scan->rs_base.rs_rd,
-											&loctup, buffer, snapshot);
-
-		if (valid)
-			scan->rs_vistuples[ntup++] = lineoff;
+			scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
+												   block, lines, 1, 1);
+	}
+	else
+	{
+		if (likely(!check_serializable))
+			scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
+												   block, lines, 0, 0);
+		else
+			scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
+												   block, lines, 0, 1);
 	}
 
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
-
-	Assert(ntup <= MaxHeapTuplesPerPage);
-	scan->rs_ntuples = ntup;
 }
 
 /*
-- 
2.38.0

#2Zhang Mingli
zmlpostgres@gmail.com
In reply to: Andres Freund (#1)
Re: Improve heapgetpage() performance, overhead from serializable

Hi,

Regards,
Zhang Mingli
On Jul 16, 2023 at 09:57 +0800, Andres Freund <andres@anarazel.de>, wrote:

Hi,

Several loops which are important for query performance, like heapgetpage()'s
loop over all tuples, have to call functions like
HeapCheckForSerializableConflictOut() and PredicateLockTID() in every
iteration.

When serializable is not in use, all those functions do is to to return. But
being situated in a different translation unit, the compiler can't inline
(without LTO at least) the check whether serializability is needed. It's not
just the function call overhead that's noticable, it's also that registers
have to be spilled to the stack / reloaded from memory etc.

On a freshly loaded pgbench scale 100, with turbo mode disabled, postgres
pinned to one core. Parallel workers disabled to reduce noise. All times are
the average of 15 executions with pgbench, in a newly started, but prewarmed
postgres.

SELECT * FROM pgbench_accounts OFFSET 10000000;
HEAD:
397.977

removing the HeapCheckForSerializableConflictOut() from heapgetpage()
(incorrect!), to establish the baseline of what serializable costs:
336.695

pulling out CheckForSerializableConflictOutNeeded() from
HeapCheckForSerializableConflictOut() in heapgetpage(), and avoiding calling
HeapCheckForSerializableConflictOut() in the loop:
339.742

moving the loop into a static inline function, marked as pg_always_inline,
called with static arguments for always_visible, check_serializable:
326.546

marking the always_visible, !check_serializable case likely():
322.249

removing TestForOldSnapshot() calls, which we pretty much already decided on:
312.987

FWIW, there's more we can do, with some hacky changes I got the time down to
273.261, but the tradeoffs start to be a bit more complicated. And 397->320ms
for something as core as this, is imo worth considering on its own.

Now, this just affects the sequential scan case. heap_hot_search_buffer()
shares many of the same pathologies. I find it a bit harder to improve,
because the compiler's code generation seems to switch between good / bad with
changes that seems unrelated...

I wonder why we haven't used PageIsAllVisible() in heap_hot_search_buffer() so
far?

Greetings,

Andres Freund

LGTM and I have a fool question:

if (likely(all_visible))
{
if (likely(!check_serializable))
scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
   block, lines, 1, 0);
else
scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
   block, lines, 1, 1);
}
else
{
if (likely(!check_serializable))
scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
   block, lines, 0, 0);
else
scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
   block, lines, 0, 1);

Does it make sense to combine if else condition and put it to the incline function’s param?

Like:
scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
   block, lines, all_visible, check_serializable);

#3Andres Freund
andres@anarazel.de
In reply to: Zhang Mingli (#2)
Re: Improve heapgetpage() performance, overhead from serializable

Hi,

On 2023-07-17 09:55:07 +0800, Zhang Mingli wrote:

LGTM and I have a fool question:

if (likely(all_visible))
{
if (likely(!check_serializable))
scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
   block, lines, 1, 0);
else
scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
   block, lines, 1, 1);
}
else
{
if (likely(!check_serializable))
scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
   block, lines, 0, 0);
else
scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
   block, lines, 0, 1);

Does it make sense to combine if else condition and put it to the incline function’s param?

Like:
scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
   block, lines, all_visible, check_serializable);

I think that makes it less likely that the compiler actually generates a
constant-folded version for each of the branches. Perhaps worth some
experimentation.

Greetings,

Andres Freund

#4Muhammad Malik
muhammad.malik1@hotmail.com
In reply to: Andres Freund (#1)
Re: Improve heapgetpage() performance, overhead from serializable

Hi,

Is there a plan to merge this patch in PG16?

Thanks,
Muhammad

________________________________
From: Andres Freund <andres@anarazel.de>
Sent: Saturday, July 15, 2023 6:56 PM
To: pgsql-hackers@postgresql.org <pgsql-hackers@postgresql.org>
Cc: Thomas Munro <thomas.munro@gmail.com>
Subject: Improve heapgetpage() performance, overhead from serializable

Hi,

Several loops which are important for query performance, like heapgetpage()'s
loop over all tuples, have to call functions like
HeapCheckForSerializableConflictOut() and PredicateLockTID() in every
iteration.

When serializable is not in use, all those functions do is to to return. But
being situated in a different translation unit, the compiler can't inline
(without LTO at least) the check whether serializability is needed. It's not
just the function call overhead that's noticable, it's also that registers
have to be spilled to the stack / reloaded from memory etc.

On a freshly loaded pgbench scale 100, with turbo mode disabled, postgres
pinned to one core. Parallel workers disabled to reduce noise. All times are
the average of 15 executions with pgbench, in a newly started, but prewarmed
postgres.

SELECT * FROM pgbench_accounts OFFSET 10000000;
HEAD:
397.977

removing the HeapCheckForSerializableConflictOut() from heapgetpage()
(incorrect!), to establish the baseline of what serializable costs:
336.695

pulling out CheckForSerializableConflictOutNeeded() from
HeapCheckForSerializableConflictOut() in heapgetpage(), and avoiding calling
HeapCheckForSerializableConflictOut() in the loop:
339.742

moving the loop into a static inline function, marked as pg_always_inline,
called with static arguments for always_visible, check_serializable:
326.546

marking the always_visible, !check_serializable case likely():
322.249

removing TestForOldSnapshot() calls, which we pretty much already decided on:
312.987

FWIW, there's more we can do, with some hacky changes I got the time down to
273.261, but the tradeoffs start to be a bit more complicated. And 397->320ms
for something as core as this, is imo worth considering on its own.

Now, this just affects the sequential scan case. heap_hot_search_buffer()
shares many of the same pathologies. I find it a bit harder to improve,
because the compiler's code generation seems to switch between good / bad with
changes that seems unrelated...

I wonder why we haven't used PageIsAllVisible() in heap_hot_search_buffer() so
far?

Greetings,

Andres Freund

#5tender wang
tndrwang@gmail.com
In reply to: Muhammad Malik (#4)
Re: Improve heapgetpage() performance, overhead from serializable

This thread [1]/messages/by-id/a9f40066-3d25-a240-4229-ec2fbe94e7a5@yeah.net also Improving the heapgetpage function, and looks like
this thread.

[1]: /messages/by-id/a9f40066-3d25-a240-4229-ec2fbe94e7a5@yeah.net
/messages/by-id/a9f40066-3d25-a240-4229-ec2fbe94e7a5@yeah.net

Muhammad Malik <muhammad.malik1@hotmail.com> 于2023年9月1日周五 04:04写道:

Show quoted text

Hi,

Is there a plan to merge this patch in PG16?

Thanks,
Muhammad

------------------------------
*From:* Andres Freund <andres@anarazel.de>
*Sent:* Saturday, July 15, 2023 6:56 PM
*To:* pgsql-hackers@postgresql.org <pgsql-hackers@postgresql.org>
*Cc:* Thomas Munro <thomas.munro@gmail.com>
*Subject:* Improve heapgetpage() performance, overhead from serializable

Hi,

Several loops which are important for query performance, like
heapgetpage()'s
loop over all tuples, have to call functions like
HeapCheckForSerializableConflictOut() and PredicateLockTID() in every
iteration.

When serializable is not in use, all those functions do is to to return.
But
being situated in a different translation unit, the compiler can't inline
(without LTO at least) the check whether serializability is needed. It's
not
just the function call overhead that's noticable, it's also that registers
have to be spilled to the stack / reloaded from memory etc.

On a freshly loaded pgbench scale 100, with turbo mode disabled, postgres
pinned to one core. Parallel workers disabled to reduce noise. All times
are
the average of 15 executions with pgbench, in a newly started, but
prewarmed
postgres.

SELECT * FROM pgbench_accounts OFFSET 10000000;
HEAD:
397.977

removing the HeapCheckForSerializableConflictOut() from heapgetpage()
(incorrect!), to establish the baseline of what serializable costs:
336.695

pulling out CheckForSerializableConflictOutNeeded() from
HeapCheckForSerializableConflictOut() in heapgetpage(), and avoiding
calling
HeapCheckForSerializableConflictOut() in the loop:
339.742

moving the loop into a static inline function, marked as pg_always_inline,
called with static arguments for always_visible, check_serializable:
326.546

marking the always_visible, !check_serializable case likely():
322.249

removing TestForOldSnapshot() calls, which we pretty much already decided
on:
312.987

FWIW, there's more we can do, with some hacky changes I got the time down
to
273.261, but the tradeoffs start to be a bit more complicated. And
397->320ms
for something as core as this, is imo worth considering on its own.

Now, this just affects the sequential scan case. heap_hot_search_buffer()
shares many of the same pathologies. I find it a bit harder to improve,
because the compiler's code generation seems to switch between good / bad
with
changes that seems unrelated...

I wonder why we haven't used PageIsAllVisible() in
heap_hot_search_buffer() so
far?

Greetings,

Andres Freund

#6John Naylor
john.naylor@enterprisedb.com
In reply to: Andres Freund (#3)
Re: Improve heapgetpage() performance, overhead from serializable

On Mon, Jul 17, 2023 at 9:58 PM Andres Freund <andres@anarazel.de> wrote:

FWIW, there's more we can do, with some hacky changes I got the time down

to

273.261, but the tradeoffs start to be a bit more complicated. And

397->320ms

for something as core as this, is imo worth considering on its own.

Nice!

On 2023-07-17 09:55:07 +0800, Zhang Mingli wrote:

Does it make sense to combine if else condition and put it to the

incline function’s param?

Like:
scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,

block, lines, all_visible, check_serializable);

I think that makes it less likely that the compiler actually generates a
constant-folded version for each of the branches. Perhaps worth some
experimentation.

Combining this way doesn't do so for me.

Minor style nit:

+ scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
+   block, lines, 0, 1);

I believe we prefer true/false rather than numbers.

--
John Naylor
EDB: http://www.enterprisedb.com

#7Andres Freund
andres@anarazel.de
In reply to: John Naylor (#6)
Re: Improve heapgetpage() performance, overhead from serializable

Hi,

On 2023-09-05 14:42:57 +0700, John Naylor wrote:

On Mon, Jul 17, 2023 at 9:58 PM Andres Freund <andres@anarazel.de> wrote:

FWIW, there's more we can do, with some hacky changes I got the time down

to

273.261, but the tradeoffs start to be a bit more complicated. And

397->320ms

for something as core as this, is imo worth considering on its own.

Nice!

On 2023-07-17 09:55:07 +0800, Zhang Mingli wrote:

Does it make sense to combine if else condition and put it to the

incline function’s param?

Like:
scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,

block, lines, all_visible, check_serializable);

I think that makes it less likely that the compiler actually generates a
constant-folded version for each of the branches. Perhaps worth some
experimentation.

Combining this way doesn't do so for me.

Are you saying that the desired constant folding happened after combining the
branches, or that it didn't happen?

Greetings,

Andres Freund

#8John Naylor
john.naylor@enterprisedb.com
In reply to: Andres Freund (#7)
Re: Improve heapgetpage() performance, overhead from serializable

On Wed, Sep 6, 2023 at 1:38 AM Andres Freund <andres@anarazel.de> wrote:

I think that makes it less likely that the compiler actually

generates a

constant-folded version for each of the branches. Perhaps worth some
experimentation.

Combining this way doesn't do so for me.

Are you saying that the desired constant folding happened after combining

the

branches, or that it didn't happen?

Constant folding did not happen.

--
John Naylor
EDB: http://www.enterprisedb.com

#9John Naylor
john.naylor@enterprisedb.com
In reply to: tender wang (#5)
Re: Improve heapgetpage() performance, overhead from serializable

On Fri, Sep 1, 2023 at 1:08 PM tender wang <tndrwang@gmail.com> wrote:

This thread [1] also Improving the heapgetpage function, and looks like

this thread.

[1]

/messages/by-id/a9f40066-3d25-a240-4229-ec2fbe94e7a5@yeah.net

Please don't top-post.

For the archives: That CF entry has been withdrawn, after the author looked
at this one and did some testing.

--
John Naylor
EDB: http://www.enterprisedb.com

#10John Naylor
johncnaylorls@gmail.com
In reply to: John Naylor (#6)
Re: Improve heapgetpage() performance, overhead from serializable

On Mon, Jul 17, 2023 at 9:58 PM Andres Freund <andres@anarazel.de> wrote:

And 397->320ms
for something as core as this, is imo worth considering on its own.

Hi Andres, this interesting work seems to have fallen off the radar --
are you still planning to move forward with this for v17?

#11Andres Freund
andres@anarazel.de
In reply to: John Naylor (#10)
1 attachment(s)
Re: Improve heapgetpage() performance, overhead from serializable

Hi,

On 2024-01-22 13:01:31 +0700, John Naylor wrote:

On Mon, Jul 17, 2023 at 9:58 PM Andres Freund <andres@anarazel.de> wrote:

And 397->320ms
for something as core as this, is imo worth considering on its own.

Hi Andres, this interesting work seems to have fallen off the radar --
are you still planning to move forward with this for v17?

I had completely forgotten about this patch, but some discussion around
streaming read reminded me of it. Here's a rebased version, with conflicts
resolved and very light comment polish and a commit message. Given that
there's been no changes otherwise in the last months, I'm inclined to push in
a few hours.

Greetings,

Andres Freund

Attachments:

v2-0001-Reduce-branches-in-heapgetpage-s-per-tuple-loop.patchtext/x-diff; charset=us-asciiDownload
From f2158455ac1a10eb393e25f7ddf87433a98e2ab0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 6 Apr 2024 20:51:07 -0700
Subject: [PATCH v2] Reduce branches in heapgetpage()'s per-tuple loop

Until now, heapgetpage()'s loop over all tuples performed some conditional
checks for each tuple, even though condition did not change across the loop.

This commit fixes that by moving the loop into an inline function. By calling
it with different constant arguments, the compiler can generate an optimized
loop for the different conditions, at the price of two per-page checks.

For cases of all-visible tables and an isolation level other than
serializable, speedups of up to 25% have been measured.

Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Discussion: https://postgr.es/m/20230716015656.xjvemfbp5fysjiea@awork3.anarazel.de
---
 src/backend/access/heap/heapam.c | 102 ++++++++++++++++++++++---------
 1 file changed, 74 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index dada2ecd1e3..cbbc87dec9a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -364,6 +364,56 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
 	scan->rs_numblocks = numBlks;
 }
 
+/*
+ * Per-tuple loop for heapgetpage() in pagemode. Pulled out so it can be
+ * called multiple times, with constant arguments for all_visible,
+ * check_serializable.
+ */
+pg_attribute_always_inline
+static int
+heapgetpage_collect(HeapScanDesc scan, Snapshot snapshot,
+					Page page, Buffer buffer,
+					BlockNumber block, int lines,
+					bool all_visible, bool check_serializable)
+{
+	int			ntup = 0;
+	OffsetNumber lineoff;
+
+	for (lineoff = FirstOffsetNumber; lineoff <= lines; lineoff++)
+	{
+		ItemId		lpp = PageGetItemId(page, lineoff);
+		bool		valid;
+		HeapTupleData loctup;
+
+		if (!ItemIdIsNormal(lpp))
+			continue;
+
+		loctup.t_data = (HeapTupleHeader) PageGetItem(page, lpp);
+		loctup.t_len = ItemIdGetLength(lpp);
+		loctup.t_tableOid = RelationGetRelid(scan->rs_base.rs_rd);
+		ItemPointerSet(&(loctup.t_self), block, lineoff);
+
+		if (all_visible)
+			valid = true;
+		else
+			valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
+
+		if (check_serializable)
+			HeapCheckForSerializableConflictOut(valid, scan->rs_base.rs_rd,
+												&loctup, buffer, snapshot);
+
+		if (valid)
+		{
+			scan->rs_vistuples[ntup] = lineoff;
+			ntup++;
+		}
+	}
+
+	Assert(ntup <= MaxHeapTuplesPerPage);
+
+	return ntup;
+}
+
 /*
  * heap_prepare_pagescan - Prepare current scan page to be scanned in pagemode
  *
@@ -379,9 +429,8 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	Snapshot	snapshot;
 	Page		page;
 	int			lines;
-	int			ntup;
-	OffsetNumber lineoff;
 	bool		all_visible;
+	bool		check_serializable;
 
 	Assert(BufferGetBlockNumber(buffer) == block);
 
@@ -403,7 +452,6 @@ heap_prepare_pagescan(TableScanDesc sscan)
 
 	page = BufferGetPage(buffer);
 	lines = PageGetMaxOffsetNumber(page);
-	ntup = 0;
 
 	/*
 	 * If the all-visible flag indicates that all tuples on the page are
@@ -426,37 +474,35 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	 * tuple for visibility the hard way.
 	 */
 	all_visible = PageIsAllVisible(page) && !snapshot->takenDuringRecovery;
+	check_serializable =
+		CheckForSerializableConflictOutNeeded(scan->rs_base.rs_rd, snapshot);
 
-	for (lineoff = FirstOffsetNumber; lineoff <= lines; lineoff++)
+	/*
+	 * We call heapgetpage_collect() with constant arguments, for faster code,
+	 * without having to duplicate too much logic. The separate calls with
+	 * constant arguments are needed on several compilers to actually perform
+	 * constant folding.
+	 */
+	if (likely(all_visible))
 	{
-		ItemId		lpp = PageGetItemId(page, lineoff);
-		HeapTupleData loctup;
-		bool		valid;
-
-		if (!ItemIdIsNormal(lpp))
-			continue;
-
-		loctup.t_tableOid = RelationGetRelid(scan->rs_base.rs_rd);
-		loctup.t_data = (HeapTupleHeader) PageGetItem(page, lpp);
-		loctup.t_len = ItemIdGetLength(lpp);
-		ItemPointerSet(&(loctup.t_self), block, lineoff);
-
-		if (all_visible)
-			valid = true;
+		if (likely(!check_serializable))
+			scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
+												   block, lines, true, false);
 		else
-			valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
-
-		HeapCheckForSerializableConflictOut(valid, scan->rs_base.rs_rd,
-											&loctup, buffer, snapshot);
-
-		if (valid)
-			scan->rs_vistuples[ntup++] = lineoff;
+			scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
+												   block, lines, true, true);
+	}
+	else
+	{
+		if (likely(!check_serializable))
+			scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
+												   block, lines, false, false);
+		else
+			scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
+												   block, lines, false, true);
 	}
 
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
-
-	Assert(ntup <= MaxHeapTuplesPerPage);
-	scan->rs_ntuples = ntup;
 }
 
 /*
-- 
2.44.0.279.g3bd955d269

#12John Naylor
johncnaylorls@gmail.com
In reply to: Andres Freund (#11)
Re: Improve heapgetpage() performance, overhead from serializable

On Sun, Apr 7, 2024 at 11:49 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2024-01-22 13:01:31 +0700, John Naylor wrote:

On Mon, Jul 17, 2023 at 9:58 PM Andres Freund <andres@anarazel.de> wrote:

And 397->320ms
for something as core as this, is imo worth considering on its own.

Hi Andres, this interesting work seems to have fallen off the radar --
are you still planning to move forward with this for v17?

I had completely forgotten about this patch, but some discussion around
streaming read reminded me of it. Here's a rebased version, with conflicts
resolved and very light comment polish and a commit message. Given that
there's been no changes otherwise in the last months, I'm inclined to push in
a few hours.

Just in time ;-) The commit message should also have "reviewed by
Zhang Mingli" and "tested by Quan Zongliang", who shared results in a
thread for a withrawn CF entry with a similar idea but covering fewer
cases:

/messages/by-id/2ef7ff1b-3d18-2283-61b1-bbd25fc6c7ce@yeah.net

#13Andres Freund
andres@anarazel.de
In reply to: John Naylor (#12)
Re: Improve heapgetpage() performance, overhead from serializable

Hi,

On 2024-04-07 12:07:22 +0700, John Naylor wrote:

Just in time ;-) The commit message should also have "reviewed by
Zhang Mingli" and "tested by Quan Zongliang", who shared results in a
thread for a withrawn CF entry with a similar idea but covering fewer
cases:

Good call. Added and pushed.

Thanks,

Andres

#14David Rowley
dgrowleyml@gmail.com
In reply to: Andres Freund (#13)
1 attachment(s)
Re: Improve heapgetpage() performance, overhead from serializable

On Sun, 7 Apr 2024 at 19:30, Andres Freund <andres@anarazel.de> wrote:

Good call. Added and pushed.

I understand you're already aware of the reference in the comment to
heapgetpage(), which no longer exists as of 44086b097.

Melanie and I had discussed the heap_prepare_pagescan() name while I
was reviewing that recent refactor. Aside from fixing the comment, how
about also renaming heapgetpage_collect() to
heap_prepare_pagescan_tuples()?

Patch attached for reference. Not looking for any credit.

I'm also happy to revisit the heap_prepare_pagescan() name and call
heapgetpage_collect() some appropriate derivative of whatever we'd
rename that to.

Copied Melanie as she may want to chime in too.

David

Attachments:

rename_heapgetpage_collect.patchtext/plain; charset=US-ASCII; name=rename_heapgetpage_collect.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2663f52d1a..d3c2a60985 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -434,16 +434,15 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
 }
 
 /*
- * Per-tuple loop for heapgetpage() in pagemode. Pulled out so it can be
- * called multiple times, with constant arguments for all_visible,
+ * Per-tuple loop for heap_prepare_pagescan().  Pulled out so it can be called
+ * multiple times, with constant arguments for all_visible,
  * check_serializable.
  */
 pg_attribute_always_inline
 static int
-heapgetpage_collect(HeapScanDesc scan, Snapshot snapshot,
-					Page page, Buffer buffer,
-					BlockNumber block, int lines,
-					bool all_visible, bool check_serializable)
+heap_prepare_pagescan_tuples(HeapScanDesc scan, Snapshot snapshot, Page page,
+							 Buffer buffer, BlockNumber block, int lines,
+							 bool all_visible, bool check_serializable)
 {
 	int			ntup = 0;
 	OffsetNumber lineoff;
@@ -547,28 +546,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
 		CheckForSerializableConflictOutNeeded(scan->rs_base.rs_rd, snapshot);
 
 	/*
-	 * We call heapgetpage_collect() with constant arguments, to get the
-	 * compiler to constant fold the constant arguments. Separate calls with
-	 * constant arguments, rather than variables, are needed on several
+	 * We call heap_prepare_pagescan_tuples() with constant arguments, to get
+	 * the compiler to constant fold the constant arguments. Separate calls
+	 * with constant arguments, rather than variables, are needed on several
 	 * compilers to actually perform constant folding.
 	 */
 	if (likely(all_visible))
 	{
 		if (likely(!check_serializable))
-			scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
-												   block, lines, true, false);
+			scan->rs_ntuples = heap_prepare_pagescan_tuples(scan, snapshot,
+															page, buffer,
+															block, lines,
+															true, false);
 		else
-			scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
-												   block, lines, true, true);
+			scan->rs_ntuples = heap_prepare_pagescan_tuples(scan, snapshot,
+															page, buffer,
+															block, lines,
+															true, true);
 	}
 	else
 	{
 		if (likely(!check_serializable))
-			scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
-												   block, lines, false, false);
+			scan->rs_ntuples = heap_prepare_pagescan_tuples(scan, snapshot,
+															page, buffer,
+															block, lines,
+															false, false);
 		else
-			scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
-												   block, lines, false, true);
+			scan->rs_ntuples = heap_prepare_pagescan_tuples(scan, snapshot,
+															page, buffer,
+															block, lines,
+															false, true);
 	}
 
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
#15Andres Freund
andres@anarazel.de
In reply to: David Rowley (#14)
Re: Improve heapgetpage() performance, overhead from serializable

Hi,

On 2024-04-08 14:43:21 +1200, David Rowley wrote:

On Sun, 7 Apr 2024 at 19:30, Andres Freund <andres@anarazel.de> wrote:

Good call. Added and pushed.

I understand you're already aware of the reference in the comment to
heapgetpage(), which no longer exists as of 44086b097.

Yea, /messages/by-id/20240407172615.cocrsvboqm3ttqe4@awork3.anarazel.de

Melanie and I had discussed the heap_prepare_pagescan() name while I
was reviewing that recent refactor. Aside from fixing the comment, how
about also renaming heapgetpage_collect() to
heap_prepare_pagescan_tuples()?

Patch attached for reference. Not looking for any credit.

I'm also happy to revisit the heap_prepare_pagescan() name and call
heapgetpage_collect() some appropriate derivative of whatever we'd
rename that to.

I kinda don't like heap_prepare_pagescan(), but heapgetpage() is worse. And I
don't have a great alternative suggestion.

Off-list Melanie suggested heap_page_collect_visible_tuples(). Given the
separate callsites (making long names annoying) and the fact that it's really
specific to one caller, I'm somewhat inclined to just go with
collect_visible_tuples() or page_collect_visible(), I think I prefer the
latter.

Greetings,

Andres Freund

#16David Rowley
dgrowleyml@gmail.com
In reply to: Andres Freund (#15)
Re: Improve heapgetpage() performance, overhead from serializable

On Mon, 8 Apr 2024 at 15:13, Andres Freund <andres@anarazel.de> wrote:

I kinda don't like heap_prepare_pagescan(), but heapgetpage() is worse. And I
don't have a great alternative suggestion.

It came around from having nothing better. I was keen not to have the
name indicate it was only for checking visibility as we're also
checking for serialization conflicts and pruning the page. The word
"prepare" made it there as it seemed generic enough to not falsely
indicate it was only checking visibility. Also, it seemed good to
keep it generic as if we one day end up with something new that needs
to be done before scanning a page in page mode then that new code is
more likely to be put in the function with a generic name rather than
a function that appears to be named for some other unrelated task. It
would be nice not to end up with two functions to call before scanning
a page in page mode.

Off-list Melanie suggested heap_page_collect_visible_tuples(). Given the
separate callsites (making long names annoying) and the fact that it's really
specific to one caller, I'm somewhat inclined to just go with
collect_visible_tuples() or page_collect_visible(), I think I prefer the
latter.

I understand wanting to avoid the long name. I'd rather stay clear of
"visible", but don't feel as strongly about this as it's static.

David

#17Andres Freund
andres@anarazel.de
In reply to: David Rowley (#16)
Re: Improve heapgetpage() performance, overhead from serializable

Hi,

On 2024-04-08 15:43:12 +1200, David Rowley wrote:

On Mon, 8 Apr 2024 at 15:13, Andres Freund <andres@anarazel.de> wrote:

Off-list Melanie suggested heap_page_collect_visible_tuples(). Given the
separate callsites (making long names annoying) and the fact that it's really
specific to one caller, I'm somewhat inclined to just go with
collect_visible_tuples() or page_collect_visible(), I think I prefer the
latter.

I understand wanting to avoid the long name. I'd rather stay clear of
"visible", but don't feel as strongly about this as it's static.

I think visible would be ok, the serialization checks are IMO about
visibility too. But if you'd prefer I'd also be ok with something like
page_collect_tuples()?

Greetings,

Andres Freund

#18David Rowley
dgrowleyml@gmail.com
In reply to: Andres Freund (#17)
Re: Improve heapgetpage() performance, overhead from serializable

On Mon, 8 Apr 2024 at 16:08, Andres Freund <andres@anarazel.de> wrote:

On 2024-04-08 15:43:12 +1200, David Rowley wrote:

I understand wanting to avoid the long name. I'd rather stay clear of
"visible", but don't feel as strongly about this as it's static.

I think visible would be ok, the serialization checks are IMO about
visibility too. But if you'd prefer I'd also be ok with something like
page_collect_tuples()?

That's ok for me.

David

#19Andres Freund
andres@anarazel.de
In reply to: David Rowley (#18)
Re: Improve heapgetpage() performance, overhead from serializable

Hi,

On 2024-04-08 16:18:21 +1200, David Rowley wrote:

On Mon, 8 Apr 2024 at 16:08, Andres Freund <andres@anarazel.de> wrote:

I think visible would be ok, the serialization checks are IMO about
visibility too. But if you'd prefer I'd also be ok with something like
page_collect_tuples()?

That's ok for me.

Cool, pushed that way.

Greetings,

Andres Freund