Prefetch the next tuple's memory during seqscans

Started by David Rowleyabout 3 years ago30 messages
#1David Rowley
dgrowleyml@gmail.com
2 attachment(s)

As part of the AIO work [1]/messages/by-id/20210223100344.llw5an2aklengrmn@alap3.anarazel.de, Andres mentioned to me that he found that
prefetching tuple memory during hot pruning showed significant wins.
I'm not proposing anything to improve HOT pruning here, but as a segue
to get the prefetching infrastructure in so that there are fewer AIO
patches, I'm proposing we prefetch the next tuple during sequence
scans in while page mode.

It turns out the gains are pretty good when we apply this:

-- table with 4 bytes of user columns
create table t as select a from generate_series(1,10000000)a;
vacuum freeze t;
select pg_prewarm('t');

Master @ a9f8ca600
# select * from t where a = 0;
Time: 355.001 ms
Time: 354.573 ms
Time: 354.490 ms
Time: 354.556 ms
Time: 354.335 ms

Master + 0001 + 0003:
# select * from t where a = 0;
Time: 328.578 ms
Time: 329.387 ms
Time: 329.349 ms
Time: 329.704 ms
Time: 328.225 ms (avg ~7.7% faster)

-- table with 64 bytes of user columns
create table t2 as
select a,a a2,a a3,a a4,a a5,a a6,a a7,a a8,a a9,a a10,a a11,a a12,a
a13,a a14,a a15,a a16
from generate_series(1,10000000)a;
vacuum freeze t2;
select pg_prewarm('t2');

Master:
# select * from t2 where a = 0;
Time: 501.725 ms
Time: 501.815 ms
Time: 503.225 ms
Time: 501.242 ms
Time: 502.394 ms

Master + 0001 + 0003:
# select * from t2 where a = 0;
Time: 412.076 ms
Time: 410.669 ms
Time: 410.490 ms
Time: 409.782 ms
Time: 410.843 ms (avg ~22% faster)

This was tested on an AMD 3990x CPU. I imagine the CPU matters quite a
bit here. It would be interesting to see if the same or similar gains
can be seen on some modern intel chip too.

I believe Thomas wrote the 0001 patch (same as patch in [2]/messages/by-id/CA+hUKG+pi63ZbcZkYK3XB1pfN=kuaDaeV0Ha9E+X_p6TTbKBYw@mail.gmail.com?). I only
quickly put together the 0003 patch.

I wondered if we might want to add a macro to 0001 that says if
pg_prefetch_mem() is empty or not then use that to #ifdef out the code
I added to heapam.c. Although, perhaps most compilers will be able to
optimise away the extra lines that are figuring out what the address
of the next tuple is.

My tests above are likely the best case for this. It seems plausible
to me that if there was a much more complex plan that found a
reasonable number of tuples and did something with them that we
wouldn't see the same sort of gains. Also, it also does not seem
impossible that the prefetch just results in evicting some
useful-to-some-other-exec-node cache line or that the prefetched tuple
gets flushed out the cache by the time we get around to fetching the
next tuple from the scan again due to various other node processing
that's occurred since the seq scan was last called. I imagine such
things would be indistinguishable from noise, but I've not tested.

I also tried prefetching out by 2 tuples. It didn't help any further
than prefetching 1 tuple.

I'll add this to the November CF.

David

[1]: /messages/by-id/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
[2]: /messages/by-id/CA+hUKG+pi63ZbcZkYK3XB1pfN=kuaDaeV0Ha9E+X_p6TTbKBYw@mail.gmail.com

Attachments:

0001-Add-pg_prefetch_mem-macro-to-load-cache-lines.patchtext/plain; charset=US-ASCII; name=0001-Add-pg_prefetch_mem-macro-to-load-cache-lines.patchDownload
From 2fd10f1266550f26f4395de080bcdcf89b6859b6 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Wed, 19 Oct 2022 08:54:01 +1300
Subject: [PATCH 1/3] Add pg_prefetch_mem() macro to load cache lines.

Initially mapping to GCC, Clang and MSVC builtins.

Discussion: https://postgr.es/m/CAEepm%3D2y9HM9QP%2BHhRZdQ3pU6FShSMyu%3DV1uHXhQ5gG-dketHg%40mail.gmail.com
---
 config/c-compiler.m4       | 17 ++++++++++++++++
 configure                  | 40 ++++++++++++++++++++++++++++++++++++++
 configure.ac               |  3 +++
 meson.build                |  3 ++-
 src/include/c.h            |  8 ++++++++
 src/include/pg_config.h.in |  3 +++
 src/tools/msvc/Solution.pm |  1 +
 7 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 000b075312..582a47501c 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -355,6 +355,23 @@ AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1,
                    [Define to 1 if your compiler understands $1.])
 fi])# PGAC_CHECK_BUILTIN_FUNC
 
+# PGAC_CHECK_BUILTIN_VOID_FUNC
+# -----------------------
+# Variant for void functions.
+AC_DEFUN([PGAC_CHECK_BUILTIN_VOID_FUNC],
+[AC_CACHE_CHECK(for $1, pgac_cv$1,
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([
+void
+call$1($2)
+{
+    $1(x);
+}], [])],
+[pgac_cv$1=yes],
+[pgac_cv$1=no])])
+if test x"${pgac_cv$1}" = xyes ; then
+AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1,
+                   [Define to 1 if your compiler understands $1.])
+fi])# PGAC_CHECK_BUILTIN_VOID_FUNC
 
 
 # PGAC_CHECK_BUILTIN_FUNC_PTR
diff --git a/configure b/configure
index 3966368b8d..c4685b8a1e 100755
--- a/configure
+++ b/configure
@@ -15988,6 +15988,46 @@ _ACEOF
 
 fi
 
+# Can we use a built-in to prefetch memory?
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_prefetch" >&5
+$as_echo_n "checking for __builtin_prefetch... " >&6; }
+if ${pgac_cv__builtin_prefetch+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+void
+call__builtin_prefetch(void *x)
+{
+    __builtin_prefetch(x);
+}
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__builtin_prefetch=yes
+else
+  pgac_cv__builtin_prefetch=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_prefetch" >&5
+$as_echo "$pgac_cv__builtin_prefetch" >&6; }
+if test x"${pgac_cv__builtin_prefetch}" = xyes ; then
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE__BUILTIN_PREFETCH 1
+_ACEOF
+
+fi
+
 # We require 64-bit fseeko() to be available, but run this check anyway
 # in case it finds that _LARGEFILE_SOURCE has to be #define'd for that.
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _LARGEFILE_SOURCE value needed for large files" >&5
diff --git a/configure.ac b/configure.ac
index f76b7ee31f..2d4938d43d 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1802,6 +1802,9 @@ PGAC_CHECK_BUILTIN_FUNC([__builtin_popcount], [unsigned int x])
 # so it needs a different test function.
 PGAC_CHECK_BUILTIN_FUNC_PTR([__builtin_frame_address], [0])
 
+# Can we use a built-in to prefetch memory?
+PGAC_CHECK_BUILTIN_VOID_FUNC([__builtin_prefetch], [void *x])
+
 # We require 64-bit fseeko() to be available, but run this check anyway
 # in case it finds that _LARGEFILE_SOURCE has to be #define'd for that.
 AC_FUNC_FSEEKO
diff --git a/meson.build b/meson.build
index bfacbdc0af..9c35637826 100644
--- a/meson.build
+++ b/meson.build
@@ -1587,10 +1587,11 @@ builtins = [
   'bswap32',
   'bswap64',
   'clz',
-  'ctz',
   'constant_p',
+  'ctz',
   'frame_address',
   'popcount',
+  'prefetch',
   'unreachable',
 ]
 
diff --git a/src/include/c.h b/src/include/c.h
index d70ed84ac5..26a1586dc3 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -361,6 +361,14 @@ typedef void (*pg_funcptr_t) (void);
  */
 #define FLEXIBLE_ARRAY_MEMBER	/* empty */
 
+/* Do we have support for prefetching memory? */
+#if defined(HAVE__BUILTIN_PREFETCH)
+#define pg_prefetch_mem(a) __builtin_prefetch(a)
+#elif defined(_MSC_VER)
+#define pg_prefetch_mem(a) _m_prefetch(a)
+#else
+#define pg_prefetch_mem(a)
+#endif
 
 /* ----------------------------------------------------------------
  *				Section 2:	bool, true, false
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index c5a80b829e..07a661e288 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -559,6 +559,9 @@
 /* Define to 1 if your compiler understands __builtin_popcount. */
 #undef HAVE__BUILTIN_POPCOUNT
 
+/* Define to 1 if your compiler understands __builtin_prefetch. */
+#undef HAVE__BUILTIN_PREFETCH
+
 /* Define to 1 if your compiler understands __builtin_types_compatible_p. */
 #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P
 
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index c2acb58df0..95de91890e 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -227,6 +227,7 @@ sub GenerateFiles
 		HAVE_BACKTRACE_SYMBOLS     => undef,
 		HAVE_BIO_GET_DATA          => undef,
 		HAVE_BIO_METH_NEW          => undef,
+		HAVE__BUILTIN_PREFETCH     => undef,
 		HAVE_COMPUTED_GOTO         => undef,
 		HAVE_COPYFILE              => undef,
 		HAVE_COPYFILE_H            => undef,
-- 
2.34.1

0003-Prefetch-tuple-memory-during-forward-seqscans.patchtext/plain; charset=US-ASCII; name=0003-Prefetch-tuple-memory-during-forward-seqscans.patchDownload
From 8459bc4bcdf0403f8c9513dd4d1fed0840acafc1 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Mon, 31 Oct 2022 10:05:12 +1300
Subject: [PATCH 3/3] Prefetch tuple memory during forward seqscans

---
 src/backend/access/heap/heapam.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 12be87efed..e8f1fc2d71 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1025,6 +1025,17 @@ heapgettup_pagemode(HeapScanDesc scan,
 			tuple->t_len = ItemIdGetLength(lpp);
 			ItemPointerSet(&(tuple->t_self), page, lineoff);
 
+			/*
+			 * Prefetching the memory for the next tuple has shown to improve
+			 * performance on certain hardware.
+			 */
+			if (!backward && linesleft > 1)
+			{
+				lineoff = scan->rs_vistuples[lineindex + 1];
+				lpp = PageGetItemId(dp, lineoff);
+				pg_prefetch_mem(PageGetItem((Page) dp, lpp));
+			}
+
 			/*
 			 * if current tuple qualifies, return it.
 			 */
-- 
2.34.1

#2Aleksander Alekseev
aleksander@timescale.com
In reply to: David Rowley (#1)
Re: Prefetch the next tuple's memory during seqscans

Hi David,

I'll add this to the November CF.

Thanks for the patch.

I wonder if we can be sure and/or check that there is no performance
degradation under different loads and different platforms...

Also I see 0001 and 0003 but no 0002. Just wanted to double check that
there is no patch missing.

--
Best regards,
Aleksander Alekseev

#3David Rowley
dgrowleyml@gmail.com
In reply to: Aleksander Alekseev (#2)
1 attachment(s)
Re: Prefetch the next tuple's memory during seqscans

On Tue, 1 Nov 2022 at 03:12, Aleksander Alekseev
<aleksander@timescale.com> wrote:

I wonder if we can be sure and/or check that there is no performance
degradation under different loads and different platforms...

Different platforms would be good. Certainly, 1 platform isn't a good
enough indication that this is going to be useful.

As for different loads. I imagine the worst case for this will be that
the prefetched tuple is flushed from the cache by some other operation
in the plan making the prefetch useless.

I tried the following so that we read 1 million tuples from a Sort
node before coming back and reading another tuple from the seqscan.

create table a as select 1 as a from generate_series(1,2) a;
create table b as select 1 as a from generate_series(1,10000000) a;
vacuum freeze a,b;

select pg_prewarm('a'),pg_prewarm('b');
set work_mem = '256MB';
select * from a, lateral (select * from b order by a) b offset 20000000;

Master (@ a9f8ca600)
Time: 1414.590 ms (00:01.415)
Time: 1373.584 ms (00:01.374)
Time: 1373.057 ms (00:01.373)
Time: 1383.033 ms (00:01.383)
Time: 1378.865 ms (00:01.379)

Master + 0001 + 0003:
Time: 1352.726 ms (00:01.353)
Time: 1348.306 ms (00:01.348)
Time: 1358.033 ms (00:01.358)
Time: 1354.348 ms (00:01.354)
Time: 1353.971 ms (00:01.354)

As I'd have expected, I see no regression. It's hard to imagine we'd
be able to measure the regression over the overhead of some operation
that would evict everything out of cache.

FWIW, this CPU has a 256MB L3 cache and the Sort node's EXPLAIN
ANALYZE looks like:

Sort Method: quicksort Memory: 262144kB

Also I see 0001 and 0003 but no 0002. Just wanted to double check that
there is no patch missing.

Perhaps I should resequence the patches to avoid confusion. I didn't
send 0002 on purpose. The 0002 is Andres' patch to prefetch during HOT
pruning. Here I'm only interested in seeing if we can get the
pg_prefetch_mem macros in core to reduce the number of AIO patches by
1.

Another thing about this is that I'm really only fetching the first
cache line of the tuple. All columns in the t2 table (from the
earlier email) are fixed width, so accessing the a16 column is a
cached offset. I ran a benchmark using the same t2 table as my earlier
email, i.e:

-- table with 64 bytes of user columns
create table t2 as
select a,a a2,a a3,a a4,a a5,a a6,a a7,a a8,a a9,a a10,a a11,a a12,a
a13,a a14,a a15,a a16
from generate_series(1,10000000)a;
vacuum freeze t2;

My test is to run 16 queries changing the WHERE clause each time to
have WHERE a = 0, then WHERE a2 = 0 ... WHERE a16 = 0. I wanted to
know if prefetching only the first cache line of the tuple would be
less useful when we require evaluation of say, the "a16" column vs the
"a" column.

The times below (in milliseconds) are what I got from a 10-second pgbench run:

column master patched
a 490.571 409.748
a2 428.004 430.927
a3 449.156 453.858
a4 474.945 479.73
a5 514.646 507.809
a6 517.525 519.956
a7 543.587 539.023
a8 562.718 559.387
a9 585.458 584.63
a10 609.143 604.606
a11 645.273 638.535
a12 658.848 657.377
a13 696.395 685.389
a14 702.779 716.722
a15 727.161 723.567
a16 756.186 749.396

I'm not sure how to explain why only the "a" column seems to improve
and the rest seem mostly unaffected.

David

Attachments:

bench_prefetch.sh.txttext/plain; charset=US-ASCII; name=bench_prefetch.sh.txtDownload
#4Andy Fan
zhihui.fan1213@gmail.com
In reply to: David Rowley (#3)
Re: Prefetch the next tuple's memory during seqscans

Hi:

Different platforms would be good. Certainly, 1 platform isn't a good
enough indication that this is going to be useful.

I just have a different platforms at hand, Here is my test with
Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz.
shared_buffers has been set to big enough to hold all the data.

columns Master Patched Improvement
a 310.931 289.251 6.972608071
a2 329.577 299.975 8.981816085
a3 336.887 313.502 6.941496704
a4 352.099 325.345 7.598431123
a5 358.582 336.486 6.162049406
a6 375.004 349.12 6.902326375
a7 379.699 362.998 4.398484062
a8 391.911 371.41 5.231034597
a9 404.3 383.779 5.075686372
a10 425.48 396.114 6.901852026
a11 449.944 431.826 4.026723326
a12 461.876 443.579 3.961452857
a13 470.59 460.237 2.20000425
a14 483.332 467.078 3.362905829
a15 490.798 472.262 3.776706507
a16 503.321 484.322 3.774728255

By theory, Why does the preferch make thing better? I am asking this
because I think we need to read the data from buffer to cache line once
in either case (I'm obvious wrong in face of the test result.)

Another simple point is the below styles are same. But the format 3 looks
clearer than others for me. It can tell code reader more stuffs. just fyi.

pg_prefetch_mem(PageGetItem((Page) dp, lpp));
pg_prefetch_mem(tuple->t_data);
pg_prefetch_mem((scan->rs_ctup.t_data);

--
Best Regards
Andy Fan

#5Thomas Munro
thomas.munro@gmail.com
In reply to: Andy Fan (#4)
Re: Prefetch the next tuple's memory during seqscans

On Wed, Nov 2, 2022 at 12:09 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:

By theory, Why does the preferch make thing better? I am asking this
because I think we need to read the data from buffer to cache line once
in either case (I'm obvious wrong in face of the test result.)

CPUs have several different kinds of 'hardware prefetchers' (worth
reading about), that look out for sequential and striding patterns and
try to get the cache line ready before you access it. Using the
prefetch instructions explicitly is called 'software prefetching'
(special instructions inserted by programmers or compilers). The
theory here would have to be that the hardware prefetchers couldn't
pick up the pattern, but we know how to do it. The exact details of
the hardware prefetchers vary between chips, and there are even some
parameters you can adjust in BIOS settings. One idea is that the
hardware prefetchers are generally biased towards increasing
addresses, but our tuples tend to go backwards on the page[1]https://www.postgresql.org/docs/current/storage-page-layout.html#STORAGE-PAGE-LAYOUT-FIGURE. It's
possible that some other CPUs can detect backwards strides better, but
since real world tuples aren't of equal size anyway, there isn't
really a fixed stride at all, so software prefetching seems quite
promising for this...

[1]: https://www.postgresql.org/docs/current/storage-page-layout.html#STORAGE-PAGE-LAYOUT-FIGURE

#6David Rowley
dgrowleyml@gmail.com
In reply to: Andy Fan (#4)
Re: Prefetch the next tuple's memory during seqscans

On Wed, 2 Nov 2022 at 00:09, Andy Fan <zhihui.fan1213@gmail.com> wrote:

I just have a different platforms at hand, Here is my test with
Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz.
shared_buffers has been set to big enough to hold all the data.

Many thanks for testing that. Those numbers look much better than the
ones I got from my AMD machine.

By theory, Why does the preferch make thing better? I am asking this
because I think we need to read the data from buffer to cache line once
in either case (I'm obvious wrong in face of the test result.)

That's a good question. I didn't really explain that in my email.

There's quite a bit of information in [1]https://en.wikipedia.org/wiki/Cache_prefetching I might not do the explanation justice, but I believe many CPU archate. My basic understanding is
that many modern CPU architectures are ok at "Sequential Prefetching"
of cache lines from main memory when the direction is forward, but I
believe that they're not very good at detecting access patterns that
are scanning memory addresses in a backwards direction.

Because of our page layout, we have the page header followed by item
pointers at the start of the page. These item pointers are fixed with
and point to the tuples, which are variable width. Tuples are written
starting at the end of the page. The page is full when the tuples
would overlap with the item pointers. See diagrams in [2]https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/speeding-up-recovery-and-vacuum-in-postgres-14/ba-p/2234071.

We do our best to keep those tuples in reverse order of the item
pointer array. This means when we're performing a forward sequence
scan, we're (generally) reading tuples starting at the end of the page
and working backwards. Since the CPU is not very good at noticing
this and prefetching the preceding cacheline, we can make things go
faster (seemingly) by issuing a manual prefetch operation by way of
pg_prefetch_mem().

The key here is that accessing RAM is far slower than accessing CPU
caches. Modern CPUs can perform multiple operations in parallel and
these can be rearranged by the CPU so they're not in the same order as
the instructions are written in the programme. It's possible that
high latency operations such as accessing RAM could hold up other
operations which depend on the value of what's waiting to come in from
RAM. If the CPU is held up like this, it's called a pipeline stall
[3]: https://en.wikipedia.org/wiki/Pipeline_stall
stalled waiting for memory access.

David

[1]: https://en.wikipedia.org/wiki/Cache_prefetching I might not do the explanation justice, but I believe many CPU archate
I might not do the explanation justice, but I believe many CPU archate
[2]: https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/speeding-up-recovery-and-vacuum-in-postgres-14/ba-p/2234071
[3]: https://en.wikipedia.org/wiki/Pipeline_stall

#7Andres Freund
andres@anarazel.de
In reply to: David Rowley (#1)
Re: Prefetch the next tuple's memory during seqscans

Hi,

On 2022-10-31 16:52:52 +1300, David Rowley wrote:

As part of the AIO work [1], Andres mentioned to me that he found that
prefetching tuple memory during hot pruning showed significant wins.
I'm not proposing anything to improve HOT pruning here

I did try and reproduce my old results, and it does look like we already get
most of the gains from prefetching via 18b87b201f7. I see gains from
prefetching before that patch, but see it hurt after. If I reverse the
iteration order from 18b87b201f7 prefetching helps again.

but as a segue to get the prefetching infrastructure in so that there are
fewer AIO patches, I'm proposing we prefetch the next tuple during sequence
scans in while page mode.

Time: 328.225 ms (avg ~7.7% faster)
...
Time: 410.843 ms (avg ~22% faster)

That's a pretty impressive result.

I suspect that prefetching in heapgetpage() would provide gains as well, at
least for pages that aren't marked all-visible, pretty common in the real
world IME.

Greetings,

Andres Freund

#8Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#7)
1 attachment(s)
Re: Prefetch the next tuple's memory during seqscans

Hi,

On 2022-11-01 20:00:43 -0700, Andres Freund wrote:

I suspect that prefetching in heapgetpage() would provide gains as well, at
least for pages that aren't marked all-visible, pretty common in the real
world IME.

Attached is an experimental patch/hack for that. It ended up being more
beneficial to make the access ordering more optimal than prefetching the tuple
contents, but I'm not at all sure that's the be-all-end-all.

I separately benchmarked pinning the CPU and memory to the same socket,
different socket and interleaving memory.

I did this for HEAD, your patch, your patch and mine.

BEGIN; DROP TABLE IF EXISTS large; CREATE TABLE large(a int8 not null, b int8 not null default '0', c int8); INSERT INTO large SELECT generate_series(1, 50000000);COMMIT;

server is started with
local: numactl --membind 1 --physcpubind 10
remote: numactl --membind 0 --physcpubind 10
interleave: numactl --interleave=all --physcpubind 10

benchmark stared with:
psql -qX -f ~/tmp/prewarm.sql && \
pgbench -n -f ~/tmp/seqbench.sql -t 1 -r > /dev/null && \
perf stat -e task-clock,LLC-loads,LLC-load-misses,cycles,instructions -C
10 \
pgbench -n -f ~/tmp/seqbench.sql -t 3 -r

seqbench.sql:
SELECT count(*) FROM large WHERE c IS NOT NULL;
SELECT sum(a), sum(b), sum(c) FROM large;
SELECT sum(c) FROM large;

branch memory time s miss %
head local 31.612 74.03
david local 32.034 73.54
david+andres local 31.644 42.80
andres local 30.863 48.05

head remote 33.350 72.12
david remote 33.425 71.30
david+andres remote 32.428 49.57
andres remote 30.907 44.33

head interleave 32.465 71.33
david interleave 33.176 72.60
david+andres interleave 32.590 46.23
andres interleave 30.440 45.13

It's cool seeing how doing optimizing heapgetpage seems to pretty much remove
the performance difference between local / remote memory.

It makes some sense that David's patch doesn't help in this case - without
all-visible being set the tuple headers will have already been pulled in for
the HTSV call.

I've not yet experimented with moving the prefetch for the tuple contents from
David's location to before the HTSV. I suspect that might benefit both
workloads.

Greetings,

Andres Freund

Attachments:

prefetch-heapgetpage.difftext/x-diff; charset=us-asciiDownload
diff --git i/src/include/access/heapam.h w/src/include/access/heapam.h
index 9dab35551e1..dff7616abeb 100644
--- i/src/include/access/heapam.h
+++ w/src/include/access/heapam.h
@@ -74,7 +74,8 @@ typedef struct HeapScanDescData
 	/* these fields only used in page-at-a-time mode and for bitmap scans */
 	int			rs_cindex;		/* current tuple's index in vistuples */
 	int			rs_ntuples;		/* number of visible tuples on page */
-	OffsetNumber rs_vistuples[MaxHeapTuplesPerPage];	/* their offsets */
+	OffsetNumber *rs_vistuples;
+	OffsetNumber rs_vistuples_d[MaxHeapTuplesPerPage];	/* their offsets */
 }			HeapScanDescData;
 typedef struct HeapScanDescData *HeapScanDesc;
 
diff --git i/src/backend/access/heap/heapam.c w/src/backend/access/heap/heapam.c
index 12be87efed4..632f315f4e1 100644
--- i/src/backend/access/heap/heapam.c
+++ w/src/backend/access/heap/heapam.c
@@ -448,30 +448,99 @@ heapgetpage(TableScanDesc sscan, BlockNumber page)
 	 */
 	all_visible = PageIsAllVisible(dp) && !snapshot->takenDuringRecovery;
 
-	for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff);
-		 lineoff <= lines;
-		 lineoff++, lpp++)
+	if (all_visible)
 	{
-		if (ItemIdIsNormal(lpp))
-		{
-			HeapTupleData loctup;
-			bool		valid;
+		HeapTupleData loctup;
+
+		loctup.t_tableOid = RelationGetRelid(scan->rs_base.rs_rd);
+
+		scan->rs_vistuples = scan->rs_vistuples_d;
+
+		for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff);
+			 lineoff <= lines;
+			 lineoff++, lpp++)
+		{
+			if (!ItemIdIsNormal(lpp))
+				continue;
 
-			loctup.t_tableOid = RelationGetRelid(scan->rs_base.rs_rd);
 			loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
 			loctup.t_len = ItemIdGetLength(lpp);
 			ItemPointerSet(&(loctup.t_self), page, lineoff);
 
-			if (all_visible)
-				valid = true;
-			else
-				valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
+			HeapCheckForSerializableConflictOut(true, scan->rs_base.rs_rd,
+												&loctup, buffer, snapshot);
+			scan->rs_vistuples[ntup++] = lineoff;
+		}
+	}
+	else
+	{
+		HeapTupleData loctup;
+		int			normcount = 0;
+		OffsetNumber normoffsets[MaxHeapTuplesPerPage];
+
+		loctup.t_tableOid = RelationGetRelid(scan->rs_base.rs_rd);
+
+		for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff);
+			 lineoff <= lines;
+			 lineoff++, lpp++)
+
+		/*
+		 * Iterate forward over line items, they're laid out in increasing
+		 * order in memory. Doing this separately allows to benefit from
+		 * out-of-order capabilities of the CPU and simplifies the next loop.
+		 *
+		 * FIXME: Worth unrolling so that we don't fetch the same cacheline
+		 * over and over, due to line items being smaller than a cacheline?
+		 */
+		for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff);
+			 lineoff <= lines;
+			 lineoff++, lpp++)
+		{
+			pg_prefetch_mem(PageGetItemId(dp, lineoff+5));
+			if (!ItemIdIsNormal(lpp))
+				continue;
+			normoffsets[normcount++] = lineoff;
+		}
+
+		/*
+		 * Process tuples in reverse order. That'll most often lead to memory
+		 * accesses in increasing order, which typically is more efficient for
+		 * the CPUs prefetcher. To avoid affecting sort order, we store the
+		 * visible tuples in decreasing order in rs_vistuples_d and then set
+		 * rs_vistuple to the last tuple found.
+		 *
+		 * FIXME: We should likely compute rs_cindex in a smarter way, rather
+		 * than changing rs_vistuples.
+		 */
+		scan->rs_vistuples = scan->rs_vistuples_d + (MaxHeapTuplesPerPage);
+		for (int i = normcount - 1; i >= 0; i--)
+		{
+			bool valid;
+
+			/* doesn't appear to be beneficial */
+#if 0
+			if (i > 0)
+				pg_prefetch_mem(PageGetItem(dp, PageGetItemId(dp, normoffsets[i - 1])));
+#endif
+
+			lineoff = normoffsets[i];
+			lpp = PageGetItemId(dp, lineoff);
+
+			loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
+			loctup.t_len = ItemIdGetLength(lpp);
+			ItemPointerSet(&(loctup.t_self), page, lineoff);
+			valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
 
 			HeapCheckForSerializableConflictOut(valid, scan->rs_base.rs_rd,
 												&loctup, buffer, snapshot);
 
 			if (valid)
-				scan->rs_vistuples[ntup++] = lineoff;
+			{
+				scan->rs_vistuples--;
+				*scan->rs_vistuples = lineoff;
+				ntup++;
+			}
+
 		}
 	}
 
diff --git i/src/backend/access/heap/heapam_handler.c w/src/backend/access/heap/heapam_handler.c
index 41f1ca65d01..f2876ecbc60 100644
--- i/src/backend/access/heap/heapam_handler.c
+++ w/src/backend/access/heap/heapam_handler.c
@@ -2162,6 +2162,8 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
 		 */
 		int			curslot;
 
+		hscan->rs_vistuples = hscan->rs_vistuples_d;
+
 		for (curslot = 0; curslot < tbmres->ntuples; curslot++)
 		{
 			OffsetNumber offnum = tbmres->offsets[curslot];
@@ -2184,6 +2186,8 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
 		OffsetNumber maxoff = PageGetMaxOffsetNumber(dp);
 		OffsetNumber offnum;
 
+		hscan->rs_vistuples = hscan->rs_vistuples_d;
+
 		for (offnum = FirstOffsetNumber; offnum <= maxoff; offnum = OffsetNumberNext(offnum))
 		{
 			ItemId		lp;
#9Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#8)
Re: Prefetch the next tuple's memory during seqscans

Hi,

On 2022-11-02 10:25:44 -0700, Andres Freund wrote:

server is started with
local: numactl --membind 1 --physcpubind 10
remote: numactl --membind 0 --physcpubind 10
interleave: numactl --interleave=all --physcpubind 10

Argh, forgot to say that this is with max_parallel_workers_per_gather=0,
s_b=8GB, huge_pages=on.

Greetings,

Andres Freund

#10John Naylor
john.naylor@enterprisedb.com
In reply to: David Rowley (#3)
Re: Prefetch the next tuple's memory during seqscans

On Tue, Nov 1, 2022 at 5:17 AM David Rowley <dgrowleyml@gmail.com> wrote:

My test is to run 16 queries changing the WHERE clause each time to
have WHERE a = 0, then WHERE a2 = 0 ... WHERE a16 = 0. I wanted to
know if prefetching only the first cache line of the tuple would be
less useful when we require evaluation of say, the "a16" column vs the
"a" column.

I tried a similar test, but with text fields of random length, and there is
improvement here:

Intel laptop, turbo boost off
shared_buffers = '4GB'
huge_pages = 'on'
max_parallel_workers_per_gather = '0'

create table text8 as
select
repeat('X', int4(random() * 20)) a1,
repeat('X', int4(random() * 20)) a2,
repeat('X', int4(random() * 20)) a3,
repeat('X', int4(random() * 20)) a4,
repeat('X', int4(random() * 20)) a5,
repeat('X', int4(random() * 20)) a6,
repeat('X', int4(random() * 20)) a7,
repeat('X', int4(random() * 20)) a8
from generate_series(1,10000000) a;
vacuum freeze text8;

psql -c "select pg_prewarm('text8')" && \
for i in a1 a2 a3 a4 a5 a6 a7 a8;
do
echo Testing $i
echo "select * from text8 where $i = 'ZZZ';" > bench.sql
pgbench -f bench.sql -M prepared -n -T 10 postgres | grep latency
done

Master:

Testing a1
latency average = 980.595 ms
Testing a2
latency average = 1045.081 ms
Testing a3
latency average = 1107.736 ms
Testing a4
latency average = 1162.188 ms
Testing a5
latency average = 1213.985 ms
Testing a6
latency average = 1272.156 ms
Testing a7
latency average = 1318.281 ms
Testing a8
latency average = 1363.359 ms

Patch 0001+0003:

Testing a1
latency average = 812.548 ms
Testing a2
latency average = 897.303 ms
Testing a3
latency average = 955.997 ms
Testing a4
latency average = 1023.497 ms
Testing a5
latency average = 1088.494 ms
Testing a6
latency average = 1149.418 ms
Testing a7
latency average = 1213.134 ms
Testing a8
latency average = 1282.760 ms

--
John Naylor
EDB: http://www.enterprisedb.com

#11David Rowley
dgrowleyml@gmail.com
In reply to: Andres Freund (#8)
5 attachment(s)
Re: Prefetch the next tuple's memory during seqscans

On Thu, 3 Nov 2022 at 06:25, Andres Freund <andres@anarazel.de> wrote:

Attached is an experimental patch/hack for that. It ended up being more
beneficial to make the access ordering more optimal than prefetching the tuple
contents, but I'm not at all sure that's the be-all-end-all.

Thanks for writing that patch. I've been experimenting with it.

I tried unrolling the loop (patch 0003) as you mentioned in:

+ * FIXME: Worth unrolling so that we don't fetch the same cacheline
+ * over and over, due to line items being smaller than a cacheline?

but didn't see any gains from doing that.

I also adjusted your patch a little so that instead of doing:

- OffsetNumber rs_vistuples[MaxHeapTuplesPerPage]; /* their offsets */
+ OffsetNumber *rs_vistuples;
+ OffsetNumber rs_vistuples_d[MaxHeapTuplesPerPage]; /* their offsets */

to work around the issue of having to populate rs_vistuples_d in
reverse, I added a new field called rs_startindex to mark where the
first element in the rs_vistuples array is. The way you wrote it seems
to require fewer code changes, but per the FIXME comment you left, I
get the idea you just did it the way you did to make it work enough
for testing.

I'm quite keen to move forward in committing the 0001 patch to add the
pg_prefetch_mem macro. What I'm a little undecided about is what the
best patch is to commit first to make use of the new macro.

I did some tests on the attached set of patches:

alter system set max_parallel_workers_per_gather = 0;
select pg_reload_conf();

create table t as select a from generate_series(1,10000000)a;
alter table t set (autovacuum_enabled=false);

$ cat bench.sql
select * from t where a = 0;

psql -c "select pg_prewarm('t');" postgres

-- Test 1 no frozen tuples in "t"

Master (@9c6ad5eaa):
$ pgbench -n -f bench.sql -M prepared -T 10 postgres | grep -E "^latency"
latency average = 383.332 ms
latency average = 375.747 ms
latency average = 376.090 ms

Master + 0001 + 0002:
$ pgbench -n -f bench.sql -M prepared -T 10 postgres | grep -E "^latency"
latency average = 370.133 ms
latency average = 370.149 ms
latency average = 370.157 ms

Master + 0001 + 0005:
$ pgbench -n -f bench.sql -M prepared -T 10 postgres | grep -E "^latency"
latency average = 372.662 ms
latency average = 371.034 ms
latency average = 372.709 ms

-- Test 2 "select count(*) from t" with all tuples frozen

$ cat bench1.sql
select count(*) from t;

psql -c "vacuum freeze t;" postgres
psql -c "select pg_prewarm('t');" postgres

Master (@9c6ad5eaa):
$ pgbench -n -f bench1.sql -M prepared -T 10 postgres | grep -E "^latency"
latency average = 406.238 ms
latency average = 407.029 ms
latency average = 406.962 ms

Master + 0001 + 0005:
$ pgbench -n -f bench1.sql -M prepared -T 10 postgres | grep -E "^latency"
latency average = 345.470 ms
latency average = 345.775 ms
latency average = 345.354 ms

My current thoughts are that it might be best to go with 0005 to start
with. I know Melanie is working on making some changes in this area,
so perhaps it's best to leave 0002 until that work is complete.

David

Attachments:

v2-0001-Add-pg_prefetch_mem-macro-to-load-cache-lines.patchtext/plain; charset=US-ASCII; name=v2-0001-Add-pg_prefetch_mem-macro-to-load-cache-lines.patchDownload
From 491df9d6ab87a54bbc76b876484733d02d6c94ea Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Wed, 19 Oct 2022 08:54:01 +1300
Subject: [PATCH v2 1/5] Add pg_prefetch_mem() macro to load cache lines.

Initially mapping to GCC, Clang and MSVC builtins.

Discussion: https://postgr.es/m/CAEepm%3D2y9HM9QP%2BHhRZdQ3pU6FShSMyu%3DV1uHXhQ5gG-dketHg%40mail.gmail.com
---
 config/c-compiler.m4       | 17 ++++++++++++++++
 configure                  | 40 ++++++++++++++++++++++++++++++++++++++
 configure.ac               |  3 +++
 meson.build                |  3 ++-
 src/include/c.h            |  8 ++++++++
 src/include/pg_config.h.in |  3 +++
 src/tools/msvc/Solution.pm |  1 +
 7 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 000b075312..582a47501c 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -355,6 +355,23 @@ AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1,
                    [Define to 1 if your compiler understands $1.])
 fi])# PGAC_CHECK_BUILTIN_FUNC
 
+# PGAC_CHECK_BUILTIN_VOID_FUNC
+# -----------------------
+# Variant for void functions.
+AC_DEFUN([PGAC_CHECK_BUILTIN_VOID_FUNC],
+[AC_CACHE_CHECK(for $1, pgac_cv$1,
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([
+void
+call$1($2)
+{
+    $1(x);
+}], [])],
+[pgac_cv$1=yes],
+[pgac_cv$1=no])])
+if test x"${pgac_cv$1}" = xyes ; then
+AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1,
+                   [Define to 1 if your compiler understands $1.])
+fi])# PGAC_CHECK_BUILTIN_VOID_FUNC
 
 
 # PGAC_CHECK_BUILTIN_FUNC_PTR
diff --git a/configure b/configure
index 3966368b8d..c4685b8a1e 100755
--- a/configure
+++ b/configure
@@ -15988,6 +15988,46 @@ _ACEOF
 
 fi
 
+# Can we use a built-in to prefetch memory?
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_prefetch" >&5
+$as_echo_n "checking for __builtin_prefetch... " >&6; }
+if ${pgac_cv__builtin_prefetch+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+void
+call__builtin_prefetch(void *x)
+{
+    __builtin_prefetch(x);
+}
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__builtin_prefetch=yes
+else
+  pgac_cv__builtin_prefetch=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_prefetch" >&5
+$as_echo "$pgac_cv__builtin_prefetch" >&6; }
+if test x"${pgac_cv__builtin_prefetch}" = xyes ; then
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE__BUILTIN_PREFETCH 1
+_ACEOF
+
+fi
+
 # We require 64-bit fseeko() to be available, but run this check anyway
 # in case it finds that _LARGEFILE_SOURCE has to be #define'd for that.
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _LARGEFILE_SOURCE value needed for large files" >&5
diff --git a/configure.ac b/configure.ac
index f76b7ee31f..2d4938d43d 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1802,6 +1802,9 @@ PGAC_CHECK_BUILTIN_FUNC([__builtin_popcount], [unsigned int x])
 # so it needs a different test function.
 PGAC_CHECK_BUILTIN_FUNC_PTR([__builtin_frame_address], [0])
 
+# Can we use a built-in to prefetch memory?
+PGAC_CHECK_BUILTIN_VOID_FUNC([__builtin_prefetch], [void *x])
+
 # We require 64-bit fseeko() to be available, but run this check anyway
 # in case it finds that _LARGEFILE_SOURCE has to be #define'd for that.
 AC_FUNC_FSEEKO
diff --git a/meson.build b/meson.build
index 058382046e..096643703c 100644
--- a/meson.build
+++ b/meson.build
@@ -1587,10 +1587,11 @@ builtins = [
   'bswap32',
   'bswap64',
   'clz',
-  'ctz',
   'constant_p',
+  'ctz',
   'frame_address',
   'popcount',
+  'prefetch',
   'unreachable',
 ]
 
diff --git a/src/include/c.h b/src/include/c.h
index 98cdd285dd..a7f7531450 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -361,6 +361,14 @@ typedef void (*pg_funcptr_t) (void);
  */
 #define FLEXIBLE_ARRAY_MEMBER	/* empty */
 
+/* Do we have support for prefetching memory? */
+#if defined(HAVE__BUILTIN_PREFETCH)
+#define pg_prefetch_mem(a) __builtin_prefetch(a)
+#elif defined(_MSC_VER)
+#define pg_prefetch_mem(a) _m_prefetch(a)
+#else
+#define pg_prefetch_mem(a)
+#endif
 
 /* ----------------------------------------------------------------
  *				Section 2:	bool, true, false
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index c5a80b829e..07a661e288 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -559,6 +559,9 @@
 /* Define to 1 if your compiler understands __builtin_popcount. */
 #undef HAVE__BUILTIN_POPCOUNT
 
+/* Define to 1 if your compiler understands __builtin_prefetch. */
+#undef HAVE__BUILTIN_PREFETCH
+
 /* Define to 1 if your compiler understands __builtin_types_compatible_p. */
 #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P
 
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index c2acb58df0..95de91890e 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -227,6 +227,7 @@ sub GenerateFiles
 		HAVE_BACKTRACE_SYMBOLS     => undef,
 		HAVE_BIO_GET_DATA          => undef,
 		HAVE_BIO_METH_NEW          => undef,
+		HAVE__BUILTIN_PREFETCH     => undef,
 		HAVE_COMPUTED_GOTO         => undef,
 		HAVE_COPYFILE              => undef,
 		HAVE_COPYFILE_H            => undef,
-- 
2.35.1.windows.2

v2-0002-Perform-memory-prefetching-in-heapgetpage.patchtext/plain; charset=US-ASCII; name=v2-0002-Perform-memory-prefetching-in-heapgetpage.patchDownload
From c5ec896a2df02041f08d1e41a982223781137d5f Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Thu, 17 Nov 2022 13:04:49 +1300
Subject: [PATCH v2 2/5] Perform memory prefetching in heapgetpage

---
 src/backend/access/heap/heapam.c         | 87 +++++++++++++++++++-----
 src/backend/access/heap/heapam_handler.c |  7 +-
 src/include/access/heapam.h              |  1 +
 3 files changed, 74 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d18c5ca6f5..81c7f69644 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -451,31 +451,82 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 	 */
 	all_visible = PageIsAllVisible(page) && !snapshot->takenDuringRecovery;
 
-	for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(page, lineoff);
-		 lineoff <= lines;
-		 lineoff++, lpp++)
+	if (all_visible)
 	{
-		if (ItemIdIsNormal(lpp))
+		HeapTupleData loctup;
+
+		loctup.t_tableOid = RelationGetRelid(scan->rs_base.rs_rd);
+
+		for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(page, lineoff);
+			 lineoff <= lines;
+			 lineoff++, lpp++)
 		{
-			HeapTupleData loctup;
-			bool		valid;
+			if (!ItemIdIsNormal(lpp))
+				continue;
 
-			loctup.t_tableOid = RelationGetRelid(scan->rs_base.rs_rd);
 			loctup.t_data = (HeapTupleHeader) PageGetItem(page, lpp);
 			loctup.t_len = ItemIdGetLength(lpp);
 			ItemPointerSet(&(loctup.t_self), block, lineoff);
 
-			if (all_visible)
-				valid = true;
-			else
-				valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
+			HeapCheckForSerializableConflictOut(true, scan->rs_base.rs_rd,
+												&loctup, buffer, snapshot);
+			scan->rs_vistuples[ntup++] = lineoff;
+		}
+		scan->rs_startindex = 0;
+	}
+	else
+	{
+		HeapTupleData loctup;
+		int			normcount = 0;
+		int			startindex = MaxHeapTuplesPerPage;
+		OffsetNumber normoffsets[MaxHeapTuplesPerPage];
+
+		loctup.t_tableOid = RelationGetRelid(scan->rs_base.rs_rd);
+
+		/*
+		 * Iterate forward over line items, they're laid out in increasing
+		 * order in memory. Doing this separately allows to benefit from
+		 * out-of-order capabilities of the CPU and simplifies the next loop.
+		 */
+		for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(page, lineoff);
+			 lineoff <= lines;
+			 lineoff++, lpp++)
+		{
+			pg_prefetch_mem(PageGetItemId(page, lineoff+5));
+			if (ItemIdIsNormal(lpp))
+				normoffsets[normcount++] = lineoff;
+		}
+
+		/*
+		 * Process tuples in reverse order. That'll most often lead to memory
+		 * accesses in increasing order, which typically is more efficient for
+		 * the CPUs prefetcher. To avoid affecting sort order, we populate the
+		 * rs_vistuples[] array backwards and use rs_startindex to mark the
+		 * first used element in the array.
+		 */
+		for (int i = normcount - 1; i >= 0; i--)
+		{
+			bool valid;
+
+			lineoff = normoffsets[i];
+			lpp = PageGetItemId(page, lineoff);
+
+			loctup.t_data = (HeapTupleHeader) PageGetItem(page, lpp);
+			loctup.t_len = ItemIdGetLength(lpp);
+			ItemPointerSet(&(loctup.t_self), block, lineoff);
+			valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
 
 			HeapCheckForSerializableConflictOut(valid, scan->rs_base.rs_rd,
 												&loctup, buffer, snapshot);
 
 			if (valid)
-				scan->rs_vistuples[ntup++] = lineoff;
+			{
+				scan->rs_vistuples[--startindex] = lineoff;
+				ntup++;
+			}
 		}
+		/* record the first used element in rs_vistuples[] */
+		scan->rs_startindex = startindex;
 	}
 
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
@@ -902,7 +953,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 			else
 				block = scan->rs_startblock; /* first page */
 			heapgetpage((TableScanDesc) scan, block);
-			lineindex = 0;
+			lineindex = scan->rs_startindex;
 			scan->rs_inited = true;
 		}
 		else
@@ -917,7 +968,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 		lines = scan->rs_ntuples;
 		/* block and lineindex now reference the next visible tid */
 
-		linesleft = lines - lineindex;
+		linesleft = lines - lineindex + scan->rs_startindex;
 	}
 	else if (backward)
 	{
@@ -968,7 +1019,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 
 		if (!scan->rs_inited)
 		{
-			lineindex = lines - 1;
+			lineindex = scan->rs_startindex + lines - 1;
 			scan->rs_inited = true;
 		}
 		else
@@ -977,7 +1028,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 		}
 		/* block and lineindex now reference the previous visible tid */
 
-		linesleft = lineindex + 1;
+		linesleft = lineindex + 1 - scan->rs_startindex;
 	}
 	else
 	{
@@ -1127,9 +1178,9 @@ heapgettup_pagemode(HeapScanDesc scan,
 		lines = scan->rs_ntuples;
 		linesleft = lines;
 		if (backward)
-			lineindex = lines - 1;
+			lineindex = MaxHeapTuplesPerPage - 1;
 		else
-			lineindex = 0;
+			lineindex = scan->rs_startindex;
 	}
 }
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ab1bcf3522..d39284465b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2490,15 +2490,16 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
 	{
 		/*
 		 * In pageatatime mode, heapgetpage() already did visibility checks,
-		 * so just look at the info it left in rs_vistuples[].
+		 * so just look at the info it left in rs_vistuples[] starting at
+		 * rs_startindex.
 		 *
 		 * We use a binary search over the known-sorted array.  Note: we could
 		 * save some effort if we insisted that NextSampleTuple select tuples
 		 * in increasing order, but it's not clear that there would be enough
 		 * gain to justify the restriction.
 		 */
-		int			start = 0,
-					end = hscan->rs_ntuples - 1;
+		int			start = hscan->rs_startindex,
+					end = hscan->rs_startindex + hscan->rs_ntuples - 1;
 
 		while (start <= end)
 		{
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 810baaf9d0..aba0795fc6 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -74,6 +74,7 @@ typedef struct HeapScanDescData
 	/* these fields only used in page-at-a-time mode and for bitmap scans */
 	int			rs_cindex;		/* current tuple's index in vistuples */
 	int			rs_ntuples;		/* number of visible tuples on page */
+	int			rs_startindex;	/* first used element in rs_vistuples[] */
 	OffsetNumber rs_vistuples[MaxHeapTuplesPerPage];	/* their offsets */
 }			HeapScanDescData;
 typedef struct HeapScanDescData *HeapScanDesc;
-- 
2.35.1.windows.2

v2-0005-Prefetch-tuple-memory-during-forward-seqscans.patchtext/plain; charset=US-ASCII; name=v2-0005-Prefetch-tuple-memory-during-forward-seqscans.patchDownload
From f0b8920b23b200132a4e92942c4bb281426e4f9c Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Mon, 31 Oct 2022 10:05:12 +1300
Subject: [PATCH v2 5/5] Prefetch tuple memory during forward seqscans

---
 src/backend/access/heap/heapam.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8fe4f4c837..6dc53effb4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1115,6 +1115,17 @@ heapgettup_pagemode(HeapScanDesc scan,
 			tuple->t_len = ItemIdGetLength(lpp);
 			ItemPointerSet(&(tuple->t_self), block, lineoff);
 
+			/*
+			 * Prefetching the memory for the next tuple has shown to improve
+			 * performance on certain hardware.
+			 */
+			if (!backward && linesleft > 1)
+			{
+				lineoff = scan->rs_vistuples[lineindex + 1];
+				lpp = PageGetItemId(page, lineoff);
+				pg_prefetch_mem(PageGetItem(page, lpp));
+			}
+
 			/*
 			 * if current tuple qualifies, return it.
 			 */
-- 
2.35.1.windows.2

v2-0004-heapam-WIP-cacheline-prefetching-for-hot-pruning.patchtext/plain; charset=US-ASCII; name=v2-0004-heapam-WIP-cacheline-prefetching-for-hot-pruning.patchDownload
From 8f47b1e0163028ec8928aaf02ce49d34bb89c647 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Wed, 19 Oct 2022 08:55:45 +1300
Subject: [PATCH v2 4/5] heapam: WIP: cacheline prefetching for hot pruning.

Author: Andres Freund, with contributions from Dmitry Dolgov
---
 src/backend/access/heap/pruneheap.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 91c5f5e9ef..a094ac18f5 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -302,6 +302,30 @@ heap_page_prune(Relation relation, Buffer buffer,
 	maxoff = PageGetMaxOffsetNumber(page);
 	tup.t_tableOid = RelationGetRelid(prstate.rel);
 
+#if 1
+	for (char *p = (char *) PageGetItemId(page, FirstOffsetNumber);
+		 p < (char *) PageGetItemId(page, maxoff);
+		 p += 64)
+	{
+		pg_prefetch_mem((ItemId)p);
+	}
+
+
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid;
+
+		itemid = PageGetItemId(page, offnum);
+		if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid) || !ItemIdHasStorage(itemid))
+			continue;
+
+		pg_prefetch_mem((HeapTupleHeader) PageGetItem(page, itemid));
+		pg_prefetch_mem(PageGetItem(page, itemid) + sizeof(HeapTupleHeaderData) - 1);
+	}
+#endif
+
 	/*
 	 * Determine HTSV for all tuples.
 	 *
-- 
2.35.1.windows.2

v2-0003-Unroll-loop-in-heapgetpage.patchtext/plain; charset=US-ASCII; name=v2-0003-Unroll-loop-in-heapgetpage.patchDownload
From f1b183123befc1a3f096ba36b1c3834b4c67d3ec Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Thu, 17 Nov 2022 13:24:40 +1300
Subject: [PATCH v2 3/5] Unroll loop in heapgetpage

---
 src/backend/access/heap/heapam.c | 48 ++++++++++++++++++++++++++++----
 1 file changed, 42 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 81c7f69644..8fe4f4c837 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -484,15 +484,51 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 		loctup.t_tableOid = RelationGetRelid(scan->rs_base.rs_rd);
 
 		/*
-		 * Iterate forward over line items, they're laid out in increasing
-		 * order in memory. Doing this separately allows to benefit from
-		 * out-of-order capabilities of the CPU and simplifies the next loop.
+		 * Iterate forward over line items processing 16 at a time (this
+		 * assumes there will be 16 ItemIds per CPU cacheline.
 		 */
 		for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(page, lineoff);
-			 lineoff <= lines;
-			 lineoff++, lpp++)
+			 lineoff <= lines - 15;
+			 lineoff += 16, lpp += 16)
+		{
+			pg_prefetch_mem(PageGetItemId(page, lineoff + 16));
+			if (ItemIdIsNormal(lpp))
+				normoffsets[normcount++] = lineoff;
+			if (ItemIdIsNormal(lpp + 1))
+				normoffsets[normcount++] = lineoff + 1;
+			if (ItemIdIsNormal(lpp + 2))
+				normoffsets[normcount++] = lineoff + 2;
+			if (ItemIdIsNormal(lpp + 3))
+				normoffsets[normcount++] = lineoff + 3;
+			if (ItemIdIsNormal(lpp + 4))
+				normoffsets[normcount++] = lineoff + 4;
+			if (ItemIdIsNormal(lpp + 5))
+				normoffsets[normcount++] = lineoff + 5;
+			if (ItemIdIsNormal(lpp + 6))
+				normoffsets[normcount++] = lineoff + 6;
+			if (ItemIdIsNormal(lpp + 7))
+				normoffsets[normcount++] = lineoff + 7;
+			if (ItemIdIsNormal(lpp + 8))
+				normoffsets[normcount++] = lineoff + 8;
+			if (ItemIdIsNormal(lpp + 9))
+				normoffsets[normcount++] = lineoff + 9;
+			if (ItemIdIsNormal(lpp + 10))
+				normoffsets[normcount++] = lineoff + 10;
+			if (ItemIdIsNormal(lpp + 11))
+				normoffsets[normcount++] = lineoff + 11;
+			if (ItemIdIsNormal(lpp + 12))
+				normoffsets[normcount++] = lineoff + 12;
+			if (ItemIdIsNormal(lpp + 13))
+				normoffsets[normcount++] = lineoff + 13;
+			if (ItemIdIsNormal(lpp + 14))
+				normoffsets[normcount++] = lineoff + 14;
+			if (ItemIdIsNormal(lpp + 15))
+				normoffsets[normcount++] = lineoff + 15;
+		}
+
+		/* get remainder */
+		for (; lineoff <= lines; lineoff++, lpp++)
 		{
-			pg_prefetch_mem(PageGetItemId(page, lineoff+5));
 			if (ItemIdIsNormal(lpp))
 				normoffsets[normcount++] = lineoff;
 		}
-- 
2.35.1.windows.2

#12David Rowley
dgrowleyml@gmail.com
In reply to: John Naylor (#10)
Re: Prefetch the next tuple's memory during seqscans

On Thu, 3 Nov 2022 at 22:09, John Naylor <john.naylor@enterprisedb.com> wrote:

I tried a similar test, but with text fields of random length, and there is improvement here:

Thank you for testing that. Can you share which CPU this was on?

My tests were all on AMD Zen 2. I'm keen to see what the results are
on intel hardware.

David

#13John Naylor
john.naylor@enterprisedb.com
In reply to: David Rowley (#12)
Re: Prefetch the next tuple's memory during seqscans

On Wed, Nov 23, 2022 at 5:00 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 3 Nov 2022 at 22:09, John Naylor <john.naylor@enterprisedb.com>

wrote:

I tried a similar test, but with text fields of random length, and

there is improvement here:

Thank you for testing that. Can you share which CPU this was on?

That was an Intel Core i7-10750H

--
John Naylor
EDB: http://www.enterprisedb.com

#14sirisha chamarthi
sirichamarthi22@gmail.com
In reply to: David Rowley (#11)
Re: Prefetch the next tuple's memory during seqscans

On Tue, Nov 22, 2022 at 1:58 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 3 Nov 2022 at 06:25, Andres Freund <andres@anarazel.de> wrote:

Attached is an experimental patch/hack for that. It ended up being more
beneficial to make the access ordering more optimal than prefetching the

tuple

contents, but I'm not at all sure that's the be-all-end-all.

Thanks for writing that patch. I've been experimenting with it.

I tried unrolling the loop (patch 0003) as you mentioned in:

+ * FIXME: Worth unrolling so that we don't fetch the same cacheline
+ * over and over, due to line items being smaller than a cacheline?

but didn't see any gains from doing that.

I also adjusted your patch a little so that instead of doing:

- OffsetNumber rs_vistuples[MaxHeapTuplesPerPage]; /* their offsets */
+ OffsetNumber *rs_vistuples;
+ OffsetNumber rs_vistuples_d[MaxHeapTuplesPerPage]; /* their offsets */

to work around the issue of having to populate rs_vistuples_d in
reverse, I added a new field called rs_startindex to mark where the
first element in the rs_vistuples array is. The way you wrote it seems
to require fewer code changes, but per the FIXME comment you left, I
get the idea you just did it the way you did to make it work enough
for testing.

I'm quite keen to move forward in committing the 0001 patch to add the
pg_prefetch_mem macro. What I'm a little undecided about is what the
best patch is to commit first to make use of the new macro.

I did some tests on the attached set of patches:

alter system set max_parallel_workers_per_gather = 0;
select pg_reload_conf();

create table t as select a from generate_series(1,10000000)a;
alter table t set (autovacuum_enabled=false);

$ cat bench.sql
select * from t where a = 0;

psql -c "select pg_prewarm('t');" postgres

-- Test 1 no frozen tuples in "t"

Master (@9c6ad5eaa):
$ pgbench -n -f bench.sql -M prepared -T 10 postgres | grep -E "^latency"
latency average = 383.332 ms
latency average = 375.747 ms
latency average = 376.090 ms

Master + 0001 + 0002:
$ pgbench -n -f bench.sql -M prepared -T 10 postgres | grep -E "^latency"
latency average = 370.133 ms
latency average = 370.149 ms
latency average = 370.157 ms

Master + 0001 + 0005:
$ pgbench -n -f bench.sql -M prepared -T 10 postgres | grep -E "^latency"
latency average = 372.662 ms
latency average = 371.034 ms
latency average = 372.709 ms

-- Test 2 "select count(*) from t" with all tuples frozen

$ cat bench1.sql
select count(*) from t;

psql -c "vacuum freeze t;" postgres
psql -c "select pg_prewarm('t');" postgres

Master (@9c6ad5eaa):
$ pgbench -n -f bench1.sql -M prepared -T 10 postgres | grep -E "^latency"
latency average = 406.238 ms
latency average = 407.029 ms
latency average = 406.962 ms

Master + 0001 + 0005:
$ pgbench -n -f bench1.sql -M prepared -T 10 postgres | grep -E "^latency"
latency average = 345.470 ms
latency average = 345.775 ms
latency average = 345.354 ms

My current thoughts are that it might be best to go with 0005 to start
with. I know Melanie is working on making some changes in this area,
so perhaps it's best to leave 0002 until that work is complete.

I ran your test1 exactly like your setup except the row count is 3000000
(with 13275 blocks). Shared_buffers is 128MB and the hardware configuration
details at the bottom of the mail. It appears *Master + 0001 + 0005 *regressed
compared to master slightly .

*Master (@56d0ed3b756b2e3799a7bbc0ac89bc7657ca2c33)*

Before vacuum:
/usr/local/pgsql/bin/pgbench -n -f bench.sql -M prepared -T 30 -P 10
postgres | grep -E "^latency"
latency average = 430.287 ms

After Vacuum:
/usr/local/pgsql/bin/pgbench -n -f bench.sql -M prepared -T 30 -P 10
postgres | grep -E "^latency"
latency average = 369.046 ms

*Master + 0001 + 0002:*

Before vacuum:
/usr/local/pgsql/bin/pgbench -n -f bench.sql -M prepared -T 30 -P 10
postgres | grep -E "^latency"
latency average = 427.983 ms

After Vacuum:
/usr/local/pgsql/bin/pgbench -n -f bench.sql -M prepared -T 30 -P 10
postgres | grep -E "^latency"
latency average = 367.185 ms

*Master + 0001 + 0005:*

Before vacuum:
/usr/local/pgsql/bin/pgbench -n -f bench.sql -M prepared -T 30 -P 10
postgres | grep -E "^latency"
latency average = 447.045 ms

After Vacuum:
/usr/local/pgsql/bin/pgbench -n -f bench.sql -M prepared -T 30 -P 10
postgres | grep -E "^latency"
latency average = 374.484 ms

lscpu output

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 1
On-line CPU(s) list: 0
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
Stepping: 2
CPU MHz: 2397.224
BogoMIPS: 4794.44
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 32 KiB
L1i cache: 32 KiB
L2 cache: 256 KiB
L3 cache: 30 MiB
NUMA node0 CPU(s): 0
Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported
Vulnerability L1tf: Mitigation; PTE Inversion
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT Host
state unknown
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted,
no microcode; SMT Host state unknown
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and
__user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, STIBP disabled,
RSB filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic
sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx
pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni
pclmulqdq ssse3 fma cx16 pcid sse4_1
sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm
invpcid_single pti fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt m
d_clear

Show quoted text

David

#15David Rowley
dgrowleyml@gmail.com
In reply to: sirisha chamarthi (#14)
Re: Prefetch the next tuple's memory during seqscans

On Wed, 23 Nov 2022 at 20:29, sirisha chamarthi
<sirichamarthi22@gmail.com> wrote:

I ran your test1 exactly like your setup except the row count is 3000000 (with 13275 blocks). Shared_buffers is 128MB and the hardware configuration details at the bottom of the mail. It appears Master + 0001 + 0005 regressed compared to master slightly .

Thank you for running these tests.

Can you share if the plans used for these queries was a parallel plan?
I had set max_parallel_workers_per_gather to 0 to remove the
additional variability from parallel query.

Also, 13275 blocks is 104MBs, does EXPLAIN (ANALYZE, BUFFERS) indicate
that all pages were in shared buffers? I used pg_prewarm() to ensure
they were so that the runs were consistent.

David

#16sirisha chamarthi
sirichamarthi22@gmail.com
In reply to: David Rowley (#15)
Re: Prefetch the next tuple's memory during seqscans

On Tue, Nov 22, 2022 at 11:44 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Wed, 23 Nov 2022 at 20:29, sirisha chamarthi
<sirichamarthi22@gmail.com> wrote:

I ran your test1 exactly like your setup except the row count is 3000000

(with 13275 blocks). Shared_buffers is 128MB and the hardware configuration
details at the bottom of the mail. It appears Master + 0001 + 0005
regressed compared to master slightly .

Thank you for running these tests.

Can you share if the plans used for these queries was a parallel plan?
I had set max_parallel_workers_per_gather to 0 to remove the
additional variability from parallel query.

Also, 13275 blocks is 104MBs, does EXPLAIN (ANALYZE, BUFFERS) indicate
that all pages were in shared buffers? I used pg_prewarm() to ensure
they were so that the runs were consistent.

I reran the test with setting max_parallel_workers_per_gather = 0 and with
pg_prewarm. Appears I missed some step while testing on the master, thanks
for sharing the details. New numbers show master has higher latency
than *Master +
0001 + 0005*.

*Master*

Before vacuum:
latency average = 452.881 ms

After vacuum:
latency average = 393.880 ms

*Master + 0001 + 0005*
Before vacuum:
latency average = 441.832 ms

After vacuum:
latency average = 369.591 ms

#17David Rowley
dgrowleyml@gmail.com
In reply to: sirisha chamarthi (#16)
Re: Prefetch the next tuple's memory during seqscans

On Wed, 23 Nov 2022 at 21:26, sirisha chamarthi
<sirichamarthi22@gmail.com> wrote:

Master
After vacuum:
latency average = 393.880 ms

Master + 0001 + 0005
After vacuum:
latency average = 369.591 ms

Thank you for running those again. Those results make more sense.
Would you mind also testing the count(*) query too?

David

#18Bruce Momjian
bruce@momjian.us
In reply to: Thomas Munro (#5)
Re: Prefetch the next tuple's memory during seqscans

On Wed, Nov 2, 2022 at 12:42:11AM +1300, Thomas Munro wrote:

On Wed, Nov 2, 2022 at 12:09 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:

By theory, Why does the preferch make thing better? I am asking this
because I think we need to read the data from buffer to cache line once
in either case (I'm obvious wrong in face of the test result.)

CPUs have several different kinds of 'hardware prefetchers' (worth
reading about), that look out for sequential and striding patterns and
try to get the cache line ready before you access it. Using the
prefetch instructions explicitly is called 'software prefetching'
(special instructions inserted by programmers or compilers). The
theory here would have to be that the hardware prefetchers couldn't
pick up the pattern, but we know how to do it. The exact details of
the hardware prefetchers vary between chips, and there are even some
parameters you can adjust in BIOS settings. One idea is that the
hardware prefetchers are generally biased towards increasing
addresses, but our tuples tend to go backwards on the page[1]. It's
possible that some other CPUs can detect backwards strides better, but
since real world tuples aren't of equal size anyway, there isn't
really a fixed stride at all, so software prefetching seems quite
promising for this...

[1] https://www.postgresql.org/docs/current/storage-page-layout.html#STORAGE-PAGE-LAYOUT-FIGURE

I remember someone showing that having our item pointers at the _end_ of
the page and tuples at the start moving toward the end increased
performance significantly.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Indecision is a decision. Inaction is an action. Mark Batterson

#19Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#18)
Re: Prefetch the next tuple's memory during seqscans

On Wed, Nov 23, 2022 at 11:03:22AM -0500, Bruce Momjian wrote:

CPUs have several different kinds of 'hardware prefetchers' (worth
reading about), that look out for sequential and striding patterns and
try to get the cache line ready before you access it. Using the
prefetch instructions explicitly is called 'software prefetching'
(special instructions inserted by programmers or compilers). The
theory here would have to be that the hardware prefetchers couldn't
pick up the pattern, but we know how to do it. The exact details of
the hardware prefetchers vary between chips, and there are even some
parameters you can adjust in BIOS settings. One idea is that the
hardware prefetchers are generally biased towards increasing
addresses, but our tuples tend to go backwards on the page[1]. It's
possible that some other CPUs can detect backwards strides better, but
since real world tuples aren't of equal size anyway, there isn't
really a fixed stride at all, so software prefetching seems quite
promising for this...

[1] https://www.postgresql.org/docs/current/storage-page-layout.html#STORAGE-PAGE-LAYOUT-FIGURE

I remember someone showing that having our item pointers at the _end_ of
the page and tuples at the start moving toward the end increased
performance significantly.

Ah, I found it, from 2017, with a 15-25% slowdown:

/messages/by-id/20171108205943.tps27i2tujsstrg7@alap3.anarazel.de

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Indecision is a decision. Inaction is an action. Mark Batterson

#20David Rowley
dgrowleyml@gmail.com
In reply to: David Rowley (#11)
1 attachment(s)
Re: Prefetch the next tuple's memory during seqscans

On Wed, 23 Nov 2022 at 10:58, David Rowley <dgrowleyml@gmail.com> wrote:

My current thoughts are that it might be best to go with 0005 to start
with. I know Melanie is working on making some changes in this area,
so perhaps it's best to leave 0002 until that work is complete.

I tried running TPC-H @ scale 5 with master (@d09dbeb9) vs master +
0001 + 0005 patch. The results look quite promising. Query 15 seems
to run 15% faster and overall it's 4.23% faster.

Full results are attached.

David

Attachments:

tpch_results.txttext/plain; charset=US-ASCII; name=tpch_results.txtDownload
#21John Naylor
john.naylor@enterprisedb.com
In reply to: David Rowley (#11)
Re: Prefetch the next tuple's memory during seqscans

On Wed, Nov 23, 2022 at 4:58 AM David Rowley <dgrowleyml@gmail.com> wrote:

My current thoughts are that it might be best to go with 0005 to start
with.

+1

I know Melanie is working on making some changes in this area,
so perhaps it's best to leave 0002 until that work is complete.

There seem to be some open questions about that one as well.

I reran the same test in [1]/messages/by-id/CAFBsxsHqmH_S=4apc5agKsJsF6xZ9f6NaH0Z83jUYv3EgySHfw@mail.gmail.com (except I don't have the ability to lock clock
speed or affect huge pages) on an older CPU from 2014 (Intel(R) Xeon(R) CPU
E5-2695 v3 @ 2.30GHz, kernel 3.10 gcc 4.8) with good results:

HEAD:

Testing a1
latency average = 965.462 ms
Testing a2
latency average = 1054.608 ms
Testing a3
latency average = 1078.263 ms
Testing a4
latency average = 1120.933 ms
Testing a5
latency average = 1162.753 ms
Testing a6
latency average = 1298.876 ms
Testing a7
latency average = 1228.775 ms
Testing a8
latency average = 1293.535 ms

0001+0005:

Testing a1
latency average = 791.224 ms
Testing a2
latency average = 876.421 ms
Testing a3
latency average = 911.039 ms
Testing a4
latency average = 981.693 ms
Testing a5
latency average = 998.176 ms
Testing a6
latency average = 979.954 ms
Testing a7
latency average = 1066.523 ms
Testing a8
latency average = 1030.235 ms

I then tested a Power8 machine (also kernel 3.10 gcc 4.8). Configure
reports "checking for __builtin_prefetch... yes", but I don't think it does
anything here, as the results are within noise level. A quick search didn't
turn up anything informative on this platform, and I'm not motivated to dig
deeper. In any case, it doesn't make things worse.

HEAD:

Testing a1
latency average = 1402.163 ms
Testing a2
latency average = 1442.971 ms
Testing a3
latency average = 1599.188 ms
Testing a4
latency average = 1664.397 ms
Testing a5
latency average = 1782.091 ms
Testing a6
latency average = 1860.655 ms
Testing a7
latency average = 1929.120 ms
Testing a8
latency average = 2021.100 ms

0001+0005:

Testing a1
latency average = 1433.080 ms
Testing a2
latency average = 1428.369 ms
Testing a3
latency average = 1542.406 ms
Testing a4
latency average = 1642.452 ms
Testing a5
latency average = 1737.173 ms
Testing a6
latency average = 1828.239 ms
Testing a7
latency average = 1920.909 ms
Testing a8
latency average = 2036.922 ms

[1]: /messages/by-id/CAFBsxsHqmH_S=4apc5agKsJsF6xZ9f6NaH0Z83jUYv3EgySHfw@mail.gmail.com
/messages/by-id/CAFBsxsHqmH_S=4apc5agKsJsF6xZ9f6NaH0Z83jUYv3EgySHfw@mail.gmail.com

--
John Naylor
EDB: http://www.enterprisedb.com

#22David Rowley
dgrowleyml@gmail.com
In reply to: John Naylor (#21)
Re: Prefetch the next tuple's memory during seqscans

On Thu, 1 Dec 2022 at 18:18, John Naylor <john.naylor@enterprisedb.com> wrote:

I then tested a Power8 machine (also kernel 3.10 gcc 4.8). Configure reports "checking for __builtin_prefetch... yes", but I don't think it does anything here, as the results are within noise level. A quick search didn't turn up anything informative on this platform, and I'm not motivated to dig deeper. In any case, it doesn't make things worse.

Thanks for testing the power8 hardware.

Andres just let me test on some Apple M1 hardware (those cores are
insanely fast!)

Using the table and running the script from [1]/messages/by-id/CAApHDvqWexy_6jGmB39Vr3OqxZ_w6stAFkq52hODvwaW-19aiA@mail.gmail.com, with trimmed-down
output, I see:

Master @ edf12e7bbd

Testing a -> 158.037 ms
Testing a2 -> 164.442 ms
Testing a3 -> 171.523 ms
Testing a4 -> 189.892 ms
Testing a5 -> 217.197 ms
Testing a6 -> 186.790 ms
Testing a7 -> 189.491 ms
Testing a8 -> 195.384 ms
Testing a9 -> 200.547 ms
Testing a10 -> 206.149 ms
Testing a11 -> 211.708 ms
Testing a12 -> 217.976 ms
Testing a13 -> 224.565 ms
Testing a14 -> 230.642 ms
Testing a15 -> 237.372 ms
Testing a16 -> 244.110 ms

(checking for __builtin_prefetch... yes)

Master + v2-0001 + v2-0005

Testing a -> 157.477 ms
Testing a2 -> 163.720 ms
Testing a3 -> 171.159 ms
Testing a4 -> 186.837 ms
Testing a5 -> 205.220 ms
Testing a6 -> 184.585 ms
Testing a7 -> 189.879 ms
Testing a8 -> 195.650 ms
Testing a9 -> 201.220 ms
Testing a10 -> 207.162 ms
Testing a11 -> 213.255 ms
Testing a12 -> 219.313 ms
Testing a13 -> 225.763 ms
Testing a14 -> 237.337 ms
Testing a15 -> 239.440 ms
Testing a16 -> 245.740 ms

It does not seem like there's any improvement on this architecture.
There is a very small increase from "a" to "a6", but a very small
decrease in performance from "a7" to "a16". It's likely within the
expected noise level.

David

[1]: /messages/by-id/CAApHDvqWexy_6jGmB39Vr3OqxZ_w6stAFkq52hODvwaW-19aiA@mail.gmail.com

#23vignesh C
vignesh21@gmail.com
In reply to: David Rowley (#11)
Re: Prefetch the next tuple's memory during seqscans

On Wed, 23 Nov 2022 at 03:28, David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 3 Nov 2022 at 06:25, Andres Freund <andres@anarazel.de> wrote:

Attached is an experimental patch/hack for that. It ended up being more
beneficial to make the access ordering more optimal than prefetching the tuple
contents, but I'm not at all sure that's the be-all-end-all.

Thanks for writing that patch. I've been experimenting with it.

I tried unrolling the loop (patch 0003) as you mentioned in:

+ * FIXME: Worth unrolling so that we don't fetch the same cacheline
+ * over and over, due to line items being smaller than a cacheline?

but didn't see any gains from doing that.

I also adjusted your patch a little so that instead of doing:

- OffsetNumber rs_vistuples[MaxHeapTuplesPerPage]; /* their offsets */
+ OffsetNumber *rs_vistuples;
+ OffsetNumber rs_vistuples_d[MaxHeapTuplesPerPage]; /* their offsets */

to work around the issue of having to populate rs_vistuples_d in
reverse, I added a new field called rs_startindex to mark where the
first element in the rs_vistuples array is. The way you wrote it seems
to require fewer code changes, but per the FIXME comment you left, I
get the idea you just did it the way you did to make it work enough
for testing.

I'm quite keen to move forward in committing the 0001 patch to add the
pg_prefetch_mem macro. What I'm a little undecided about is what the
best patch is to commit first to make use of the new macro.

I did some tests on the attached set of patches:

alter system set max_parallel_workers_per_gather = 0;
select pg_reload_conf();

create table t as select a from generate_series(1,10000000)a;
alter table t set (autovacuum_enabled=false);

$ cat bench.sql
select * from t where a = 0;

psql -c "select pg_prewarm('t');" postgres

-- Test 1 no frozen tuples in "t"

Master (@9c6ad5eaa):
$ pgbench -n -f bench.sql -M prepared -T 10 postgres | grep -E "^latency"
latency average = 383.332 ms
latency average = 375.747 ms
latency average = 376.090 ms

Master + 0001 + 0002:
$ pgbench -n -f bench.sql -M prepared -T 10 postgres | grep -E "^latency"
latency average = 370.133 ms
latency average = 370.149 ms
latency average = 370.157 ms

Master + 0001 + 0005:
$ pgbench -n -f bench.sql -M prepared -T 10 postgres | grep -E "^latency"
latency average = 372.662 ms
latency average = 371.034 ms
latency average = 372.709 ms

-- Test 2 "select count(*) from t" with all tuples frozen

$ cat bench1.sql
select count(*) from t;

psql -c "vacuum freeze t;" postgres
psql -c "select pg_prewarm('t');" postgres

Master (@9c6ad5eaa):
$ pgbench -n -f bench1.sql -M prepared -T 10 postgres | grep -E "^latency"
latency average = 406.238 ms
latency average = 407.029 ms
latency average = 406.962 ms

Master + 0001 + 0005:
$ pgbench -n -f bench1.sql -M prepared -T 10 postgres | grep -E "^latency"
latency average = 345.470 ms
latency average = 345.775 ms
latency average = 345.354 ms

My current thoughts are that it might be best to go with 0005 to start
with. I know Melanie is working on making some changes in this area,
so perhaps it's best to leave 0002 until that work is complete.

The patch does not apply on top of HEAD as in [1]http://cfbot.cputube.org/patch_41_3978.log, please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
5212d447fa53518458cbe609092b347803a667c5 ===
=== applying patch ./v2-0001-Add-pg_prefetch_mem-macro-to-load-cache-lines.patch
=== applying patch ./v2-0002-Perform-memory-prefetching-in-heapgetpage.patch
patching file src/backend/access/heap/heapam.c
Hunk #1 FAILED at 451.
1 out of 6 hunks FAILED -- saving rejects to file
src/backend/access/heap/heapam.c.rej

[1]: http://cfbot.cputube.org/patch_41_3978.log

Regards,
Vignesh

#24David Rowley
dgrowleyml@gmail.com
In reply to: vignesh C (#23)
Re: Prefetch the next tuple's memory during seqscans

On Wed, 4 Jan 2023 at 23:06, vignesh C <vignesh21@gmail.com> wrote:

patching file src/backend/access/heap/heapam.c
Hunk #1 FAILED at 451.
1 out of 6 hunks FAILED -- saving rejects to file
src/backend/access/heap/heapam.c.rej

I've moved this patch to the next CF. This patch has a dependency on
what's being proposed in [1]https://commitfest.postgresql.org/41/3987/. I'd rather wait until that goes in
before rebasing this. Having this go in first will just make Melanie's
job harder on her heapam.c refactoring work.

David

[1]: https://commitfest.postgresql.org/41/3987/

#25Gregory Stark (as CFM)
stark.cfm@gmail.com
In reply to: David Rowley (#24)
Re: Prefetch the next tuple's memory during seqscans

On Sun, 29 Jan 2023 at 21:24, David Rowley <dgrowleyml@gmail.com> wrote:

I've moved this patch to the next CF. This patch has a dependency on
what's being proposed in [1].

The referenced patch was committed March 19th but there's been no
comment here. Is this patch likely to go ahead this release or should
I move it forward again?

--
Gregory Stark
As Commitfest Manager

#26David Rowley
dgrowleyml@gmail.com
In reply to: Gregory Stark (as CFM) (#25)
2 attachment(s)
Re: Prefetch the next tuple's memory during seqscans

On Tue, 4 Apr 2023 at 07:47, Gregory Stark (as CFM) <stark.cfm@gmail.com> wrote:

The referenced patch was committed March 19th but there's been no
comment here. Is this patch likely to go ahead this release or should
I move it forward again?

Thanks for the reminder on this.

I have done some work on it but just didn't post it here as I didn't
have good news. The problem I'm facing is that after Melanie's recent
refactor work done around heapgettup() [1]/messages/by-id/CAAKRu_YSOnhKsDyFcqJsKtBSrd32DP-jjXmv7hL0BPD-z0TGXQ@mail.gmail.com, I can no longer get the
same speedup as before with the pg_prefetch_mem(). While testing
Melanie's patches, I did do some performance tests and did see a good
increase in performance from it. I really don't know the reason why
the prefetching does not show the gains as it did before. Perhaps the
rearranged code is better able to perform hardware prefetching of
cache lines.

I am, however, inclined not to drop the pg_prefetch_mem() macro
altogether just because I can no longer demonstrate any performance
gains during sequential scans, so I decided to go and try what Thomas
mentioned in [2]/messages/by-id/CA+hUKGJRtzbbhVmb83vbCiMRZ4piOAi7HWLCqs=GQ74mUPrP_w@mail.gmail.com to use the prefetching macro to fetch the required
tuples in PageRepairFragmentation() so that they're cached in CPU
cache by the time we get to compactify_tuples().

I tried this using the same test as I described in [3]/messages/by-id/CAApHDvoKwqAzhiuxEt8jSquPJKDpH8DNUZDFUSX9P7DXrJdc3Q@mail.gmail.com after adjusting
the following line to use PANIC instead of LOG:

ereport(LOG,
(errmsg("redo done at %X/%X system usage: %s",
LSN_FORMAT_ARGS(xlogreader->ReadRecPtr),
pg_rusage_show(&ru0))));

doing that allows me to repeat the test using the same WAL each time.

amd3990x CPU on Ubuntu 22.10 with 64GB RAM.

shared_buffers = 10GB
checkpoint_timeout = '1 h'
max_wal_size = 100GB
max_connections = 300

Master:

2023-04-04 15:54:55.635 NZST [15958] PANIC: redo done at 0/DC447610
system usage: CPU: user: 44.46 s, system: 0.97 s, elapsed: 45.45 s
2023-04-04 15:56:33.380 NZST [16109] PANIC: redo done at 0/DC447610
system usage: CPU: user: 43.80 s, system: 0.86 s, elapsed: 44.69 s
2023-04-04 15:57:25.968 NZST [16134] PANIC: redo done at 0/DC447610
system usage: CPU: user: 44.08 s, system: 0.74 s, elapsed: 44.84 s
2023-04-04 15:58:53.820 NZST [16158] PANIC: redo done at 0/DC447610
system usage: CPU: user: 44.20 s, system: 0.72 s, elapsed: 44.94 s

Prefetch Memory in PageRepairFragmentation():

2023-04-04 16:03:16.296 NZST [25921] PANIC: redo done at 0/DC447610
system usage: CPU: user: 41.73 s, system: 0.77 s, elapsed: 42.52 s
2023-04-04 16:04:07.384 NZST [25945] PANIC: redo done at 0/DC447610
system usage: CPU: user: 40.87 s, system: 0.86 s, elapsed: 41.74 s
2023-04-04 16:05:01.090 NZST [25968] PANIC: redo done at 0/DC447610
system usage: CPU: user: 41.20 s, system: 0.72 s, elapsed: 41.94 s
2023-04-04 16:05:49.235 NZST [25996] PANIC: redo done at 0/DC447610
system usage: CPU: user: 41.56 s, system: 0.66 s, elapsed: 42.24 s

About 6.7% performance increase over master.

I wonder since I really just did the seqscan patch as a means to get
the pg_prefetch_mem() patch in, I wonder if it's ok to scrap that in
favour of the PageRepairFragmentation patch.

Updated patches attached.

David

[1]: /messages/by-id/CAAKRu_YSOnhKsDyFcqJsKtBSrd32DP-jjXmv7hL0BPD-z0TGXQ@mail.gmail.com
[2]: /messages/by-id/CA+hUKGJRtzbbhVmb83vbCiMRZ4piOAi7HWLCqs=GQ74mUPrP_w@mail.gmail.com
[3]: /messages/by-id/CAApHDvoKwqAzhiuxEt8jSquPJKDpH8DNUZDFUSX9P7DXrJdc3Q@mail.gmail.com

Attachments:

v1-0001-Add-pg_prefetch_mem-macro-to-load-cache-lines.patchapplication/octet-stream; name=v1-0001-Add-pg_prefetch_mem-macro-to-load-cache-lines.patchDownload
From 9d9474b1988ee2d81ea35afdc09028f6c8fee3aa Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Wed, 19 Oct 2022 08:54:01 +1300
Subject: [PATCH v1 1/3] Add pg_prefetch_mem() macro to load cache lines.

Initially mapping to GCC, Clang and MSVC builtins.

Discussion: https://postgr.es/m/CAEepm%3D2y9HM9QP%2BHhRZdQ3pU6FShSMyu%3DV1uHXhQ5gG-dketHg%40mail.gmail.com
---
 config/c-compiler.m4       | 17 ++++++++++++++++
 configure                  | 40 ++++++++++++++++++++++++++++++++++++++
 configure.ac               |  3 +++
 meson.build                |  3 ++-
 src/include/c.h            |  8 ++++++++
 src/include/pg_config.h.in |  3 +++
 src/tools/msvc/Solution.pm |  1 +
 7 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 5be8f0f08d..fedfd55cbc 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -355,6 +355,23 @@ AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1,
                    [Define to 1 if your compiler understands $1.])
 fi])# PGAC_CHECK_BUILTIN_FUNC
 
+# PGAC_CHECK_BUILTIN_VOID_FUNC
+# -----------------------
+# Variant for void functions.
+AC_DEFUN([PGAC_CHECK_BUILTIN_VOID_FUNC],
+[AC_CACHE_CHECK(for $1, pgac_cv$1,
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([
+void
+call$1($2)
+{
+    $1(x);
+}], [])],
+[pgac_cv$1=yes],
+[pgac_cv$1=no])])
+if test x"${pgac_cv$1}" = xyes ; then
+AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1,
+                   [Define to 1 if your compiler understands $1.])
+fi])# PGAC_CHECK_BUILTIN_VOID_FUNC
 
 
 # PGAC_CHECK_BUILTIN_FUNC_PTR
diff --git a/configure b/configure
index 905be9568b..9ea528fdb3 100755
--- a/configure
+++ b/configure
@@ -15969,6 +15969,46 @@ _ACEOF
 
 fi
 
+# Can we use a built-in to prefetch memory?
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_prefetch" >&5
+$as_echo_n "checking for __builtin_prefetch... " >&6; }
+if ${pgac_cv__builtin_prefetch+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+void
+call__builtin_prefetch(void *x)
+{
+    __builtin_prefetch(x);
+}
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__builtin_prefetch=yes
+else
+  pgac_cv__builtin_prefetch=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_prefetch" >&5
+$as_echo "$pgac_cv__builtin_prefetch" >&6; }
+if test x"${pgac_cv__builtin_prefetch}" = xyes ; then
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE__BUILTIN_PREFETCH 1
+_ACEOF
+
+fi
+
 # We require 64-bit fseeko() to be available, but run this check anyway
 # in case it finds that _LARGEFILE_SOURCE has to be #define'd for that.
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _LARGEFILE_SOURCE value needed for large files" >&5
diff --git a/configure.ac b/configure.ac
index 8095dfcf1d..b3b3d26806 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1819,6 +1819,9 @@ PGAC_CHECK_BUILTIN_FUNC([__builtin_popcount], [unsigned int x])
 # so it needs a different test function.
 PGAC_CHECK_BUILTIN_FUNC_PTR([__builtin_frame_address], [0])
 
+# Can we use a built-in to prefetch memory?
+PGAC_CHECK_BUILTIN_VOID_FUNC([__builtin_prefetch], [void *x])
+
 # We require 64-bit fseeko() to be available, but run this check anyway
 # in case it finds that _LARGEFILE_SOURCE has to be #define'd for that.
 AC_FUNC_FSEEKO
diff --git a/meson.build b/meson.build
index 84b60c8933..a7cb9cac48 100644
--- a/meson.build
+++ b/meson.build
@@ -1646,10 +1646,11 @@ builtins = [
   'bswap32',
   'bswap64',
   'clz',
-  'ctz',
   'constant_p',
+  'ctz',
   'frame_address',
   'popcount',
+  'prefetch',
   'unreachable',
 ]
 
diff --git a/src/include/c.h b/src/include/c.h
index 5fe7a97ff0..1bd5a04fe2 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -409,6 +409,14 @@ typedef void (*pg_funcptr_t) (void);
 #define HAVE_PRAGMA_GCC_SYSTEM_HEADER	1
 #endif
 
+/* Do we have support for prefetching memory? */
+#if defined(HAVE__BUILTIN_PREFETCH)
+#define pg_prefetch_mem(a) __builtin_prefetch(a)
+#elif defined(_MSC_VER)
+#define pg_prefetch_mem(a) _m_prefetch(a)
+#else
+#define pg_prefetch_mem(a)
+#endif
 
 /* ----------------------------------------------------------------
  *				Section 2:	bool, true, false
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 3665e799e7..9cef36cbb3 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -559,6 +559,9 @@
 /* Define to 1 if your compiler understands __builtin_popcount. */
 #undef HAVE__BUILTIN_POPCOUNT
 
+/* Define to 1 if your compiler understands __builtin_prefetch. */
+#undef HAVE__BUILTIN_PREFETCH
+
 /* Define to 1 if your compiler understands __builtin_types_compatible_p. */
 #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P
 
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index 153be7be11..ff60cc5d51 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -227,6 +227,7 @@ sub GenerateFiles
 		HAVE_BACKTRACE_SYMBOLS     => undef,
 		HAVE_BIO_GET_DATA          => undef,
 		HAVE_BIO_METH_NEW          => undef,
+		HAVE__BUILTIN_PREFETCH     => undef,
 		HAVE_COMPUTED_GOTO         => undef,
 		HAVE_COPYFILE              => undef,
 		HAVE_COPYFILE_H            => undef,
-- 
2.39.0.windows.1

prefetch_in_PageRepairFragmentation.patchapplication/octet-stream; name=prefetch_in_PageRepairFragmentation.patchDownload
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 92994f8f39..501b886181 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -752,6 +752,8 @@ PageRepairFragmentation(Page page)
 				else
 					presorted = false;
 
+				pg_prefetch_mem((HeapTupleHeader) PageGetItem(page, lp));
+
 				if (unlikely(itemidptr->itemoff < (int) pd_upper ||
 							 itemidptr->itemoff >= (int) pd_special))
 					ereport(ERROR,
#27Daniel Gustafsson
daniel@yesql.se
In reply to: David Rowley (#26)
1 attachment(s)
Re: Prefetch the next tuple's memory during seqscans

On 4 Apr 2023, at 06:50, David Rowley <dgrowleyml@gmail.com> wrote:

Updated patches attached.

This patch is marked Waiting on Author, but from the thread it seems Needs
Review is more apt. I've changed status and also attached a new version of the
patch as the posted v1 no longer applied due to changes in formatting for Perl
code.

--
Daniel Gustafsson

Attachments:

v2-0001-Add-pg_prefetch_mem-macro-to-load-cache-lines.patchapplication/octet-stream; name=v2-0001-Add-pg_prefetch_mem-macro-to-load-cache-lines.patch; x-unix-mode=0644Download
From 72b3cbd8d53516451ee914f013188cfd67504c7c Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Mon, 10 Jul 2023 11:22:34 +0200
Subject: [PATCH v2] Add pg_prefetch_mem() macro to load cache lines.

Initially mapping to GCC, Clang and MSVC builtins.

Discussion: https://postgr.es/m/CAEepm%3D2y9HM9QP%2BHhRZdQ3pU6FShSMyu%3DV1uHXhQ5gG-dketHg%40mail.gmail.com
---
 config/c-compiler.m4       | 17 ++++++++++++++++
 configure                  | 40 ++++++++++++++++++++++++++++++++++++++
 configure.ac               |  3 +++
 meson.build                |  3 ++-
 src/include/c.h            |  8 ++++++++
 src/include/pg_config.h.in |  3 +++
 src/tools/msvc/Solution.pm |  1 +
 7 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 5be8f0f08d..fedfd55cbc 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -355,6 +355,23 @@ AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1,
                    [Define to 1 if your compiler understands $1.])
 fi])# PGAC_CHECK_BUILTIN_FUNC
 
+# PGAC_CHECK_BUILTIN_VOID_FUNC
+# -----------------------
+# Variant for void functions.
+AC_DEFUN([PGAC_CHECK_BUILTIN_VOID_FUNC],
+[AC_CACHE_CHECK(for $1, pgac_cv$1,
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([
+void
+call$1($2)
+{
+    $1(x);
+}], [])],
+[pgac_cv$1=yes],
+[pgac_cv$1=no])])
+if test x"${pgac_cv$1}" = xyes ; then
+AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1,
+                   [Define to 1 if your compiler understands $1.])
+fi])# PGAC_CHECK_BUILTIN_VOID_FUNC
 
 
 # PGAC_CHECK_BUILTIN_FUNC_PTR
diff --git a/configure b/configure
index 4a8f652129..bfe2783bef 100755
--- a/configure
+++ b/configure
@@ -15976,6 +15976,46 @@ _ACEOF
 
 fi
 
+# Can we use a built-in to prefetch memory?
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_prefetch" >&5
+$as_echo_n "checking for __builtin_prefetch... " >&6; }
+if ${pgac_cv__builtin_prefetch+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+void
+call__builtin_prefetch(void *x)
+{
+    __builtin_prefetch(x);
+}
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__builtin_prefetch=yes
+else
+  pgac_cv__builtin_prefetch=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_prefetch" >&5
+$as_echo "$pgac_cv__builtin_prefetch" >&6; }
+if test x"${pgac_cv__builtin_prefetch}" = xyes ; then
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE__BUILTIN_PREFETCH 1
+_ACEOF
+
+fi
+
 # We require 64-bit fseeko() to be available, but run this check anyway
 # in case it finds that _LARGEFILE_SOURCE has to be #define'd for that.
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _LARGEFILE_SOURCE value needed for large files" >&5
diff --git a/configure.ac b/configure.ac
index 0f802f3df4..25217c9af5 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1823,6 +1823,9 @@ PGAC_CHECK_BUILTIN_FUNC([__builtin_popcount], [unsigned int x])
 # so it needs a different test function.
 PGAC_CHECK_BUILTIN_FUNC_PTR([__builtin_frame_address], [0])
 
+# Can we use a built-in to prefetch memory?
+PGAC_CHECK_BUILTIN_VOID_FUNC([__builtin_prefetch], [void *x])
+
 # We require 64-bit fseeko() to be available, but run this check anyway
 # in case it finds that _LARGEFILE_SOURCE has to be #define'd for that.
 AC_FUNC_FSEEKO
diff --git a/meson.build b/meson.build
index 0c44f19cb9..13c6dbf5a7 100644
--- a/meson.build
+++ b/meson.build
@@ -1691,10 +1691,11 @@ builtins = [
   'bswap32',
   'bswap64',
   'clz',
-  'ctz',
   'constant_p',
+  'ctz',
   'frame_address',
   'popcount',
+  'prefetch',
   'unreachable',
 ]
 
diff --git a/src/include/c.h b/src/include/c.h
index f69d739be5..d6b79ac277 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -409,6 +409,14 @@ typedef void (*pg_funcptr_t) (void);
 #define HAVE_PRAGMA_GCC_SYSTEM_HEADER	1
 #endif
 
+/* Do we have support for prefetching memory? */
+#if defined(HAVE__BUILTIN_PREFETCH)
+#define pg_prefetch_mem(a) __builtin_prefetch(a)
+#elif defined(_MSC_VER)
+#define pg_prefetch_mem(a) _m_prefetch(a)
+#else
+#define pg_prefetch_mem(a)
+#endif
 
 /* ----------------------------------------------------------------
  *				Section 2:	bool, true, false
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index d03f6e8de8..c15d272a5e 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -559,6 +559,9 @@
 /* Define to 1 if your compiler understands __builtin_popcount. */
 #undef HAVE__BUILTIN_POPCOUNT
 
+/* Define to 1 if your compiler understands __builtin_prefetch. */
+#undef HAVE__BUILTIN_PREFETCH
+
 /* Define to 1 if your compiler understands __builtin_types_compatible_p. */
 #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P
 
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index f09b3826ff..eccfde51f7 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -227,6 +227,7 @@ sub GenerateFiles
 		HAVE_BACKTRACE_SYMBOLS => undef,
 		HAVE_BIO_GET_DATA => undef,
 		HAVE_BIO_METH_NEW => undef,
+ 		HAVE__BUILTIN_PREFETCH => undef,
 		HAVE_COMPUTED_GOTO => undef,
 		HAVE_COPYFILE => undef,
 		HAVE_COPYFILE_H => undef,
-- 
2.32.1 (Apple Git-133)

#28Daniel Gustafsson
daniel@yesql.se
In reply to: Daniel Gustafsson (#27)
2 attachment(s)
Re: Prefetch the next tuple's memory during seqscans

On 10 Jul 2023, at 11:32, Daniel Gustafsson <daniel@yesql.se> wrote:

On 4 Apr 2023, at 06:50, David Rowley <dgrowleyml@gmail.com> wrote:

Updated patches attached.

This patch is marked Waiting on Author, but from the thread it seems Needs
Review is more apt. I've changed status and also attached a new version of the
patch as the posted v1 no longer applied due to changes in formatting for Perl
code.

..and again with both patches attached. Doh.

--
Daniel Gustafsson

Attachments:

v2-0001-Add-pg_prefetch_mem-macro-to-load-cache-lines.patchapplication/octet-stream; name=v2-0001-Add-pg_prefetch_mem-macro-to-load-cache-lines.patch; x-unix-mode=0644Download
From 72b3cbd8d53516451ee914f013188cfd67504c7c Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Mon, 10 Jul 2023 11:22:34 +0200
Subject: [PATCH v2] Add pg_prefetch_mem() macro to load cache lines.

Initially mapping to GCC, Clang and MSVC builtins.

Discussion: https://postgr.es/m/CAEepm%3D2y9HM9QP%2BHhRZdQ3pU6FShSMyu%3DV1uHXhQ5gG-dketHg%40mail.gmail.com
---
 config/c-compiler.m4       | 17 ++++++++++++++++
 configure                  | 40 ++++++++++++++++++++++++++++++++++++++
 configure.ac               |  3 +++
 meson.build                |  3 ++-
 src/include/c.h            |  8 ++++++++
 src/include/pg_config.h.in |  3 +++
 src/tools/msvc/Solution.pm |  1 +
 7 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 5be8f0f08d..fedfd55cbc 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -355,6 +355,23 @@ AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1,
                    [Define to 1 if your compiler understands $1.])
 fi])# PGAC_CHECK_BUILTIN_FUNC
 
+# PGAC_CHECK_BUILTIN_VOID_FUNC
+# -----------------------
+# Variant for void functions.
+AC_DEFUN([PGAC_CHECK_BUILTIN_VOID_FUNC],
+[AC_CACHE_CHECK(for $1, pgac_cv$1,
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([
+void
+call$1($2)
+{
+    $1(x);
+}], [])],
+[pgac_cv$1=yes],
+[pgac_cv$1=no])])
+if test x"${pgac_cv$1}" = xyes ; then
+AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1,
+                   [Define to 1 if your compiler understands $1.])
+fi])# PGAC_CHECK_BUILTIN_VOID_FUNC
 
 
 # PGAC_CHECK_BUILTIN_FUNC_PTR
diff --git a/configure b/configure
index 4a8f652129..bfe2783bef 100755
--- a/configure
+++ b/configure
@@ -15976,6 +15976,46 @@ _ACEOF
 
 fi
 
+# Can we use a built-in to prefetch memory?
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_prefetch" >&5
+$as_echo_n "checking for __builtin_prefetch... " >&6; }
+if ${pgac_cv__builtin_prefetch+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+void
+call__builtin_prefetch(void *x)
+{
+    __builtin_prefetch(x);
+}
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__builtin_prefetch=yes
+else
+  pgac_cv__builtin_prefetch=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_prefetch" >&5
+$as_echo "$pgac_cv__builtin_prefetch" >&6; }
+if test x"${pgac_cv__builtin_prefetch}" = xyes ; then
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE__BUILTIN_PREFETCH 1
+_ACEOF
+
+fi
+
 # We require 64-bit fseeko() to be available, but run this check anyway
 # in case it finds that _LARGEFILE_SOURCE has to be #define'd for that.
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _LARGEFILE_SOURCE value needed for large files" >&5
diff --git a/configure.ac b/configure.ac
index 0f802f3df4..25217c9af5 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1823,6 +1823,9 @@ PGAC_CHECK_BUILTIN_FUNC([__builtin_popcount], [unsigned int x])
 # so it needs a different test function.
 PGAC_CHECK_BUILTIN_FUNC_PTR([__builtin_frame_address], [0])
 
+# Can we use a built-in to prefetch memory?
+PGAC_CHECK_BUILTIN_VOID_FUNC([__builtin_prefetch], [void *x])
+
 # We require 64-bit fseeko() to be available, but run this check anyway
 # in case it finds that _LARGEFILE_SOURCE has to be #define'd for that.
 AC_FUNC_FSEEKO
diff --git a/meson.build b/meson.build
index 0c44f19cb9..13c6dbf5a7 100644
--- a/meson.build
+++ b/meson.build
@@ -1691,10 +1691,11 @@ builtins = [
   'bswap32',
   'bswap64',
   'clz',
-  'ctz',
   'constant_p',
+  'ctz',
   'frame_address',
   'popcount',
+  'prefetch',
   'unreachable',
 ]
 
diff --git a/src/include/c.h b/src/include/c.h
index f69d739be5..d6b79ac277 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -409,6 +409,14 @@ typedef void (*pg_funcptr_t) (void);
 #define HAVE_PRAGMA_GCC_SYSTEM_HEADER	1
 #endif
 
+/* Do we have support for prefetching memory? */
+#if defined(HAVE__BUILTIN_PREFETCH)
+#define pg_prefetch_mem(a) __builtin_prefetch(a)
+#elif defined(_MSC_VER)
+#define pg_prefetch_mem(a) _m_prefetch(a)
+#else
+#define pg_prefetch_mem(a)
+#endif
 
 /* ----------------------------------------------------------------
  *				Section 2:	bool, true, false
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index d03f6e8de8..c15d272a5e 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -559,6 +559,9 @@
 /* Define to 1 if your compiler understands __builtin_popcount. */
 #undef HAVE__BUILTIN_POPCOUNT
 
+/* Define to 1 if your compiler understands __builtin_prefetch. */
+#undef HAVE__BUILTIN_PREFETCH
+
 /* Define to 1 if your compiler understands __builtin_types_compatible_p. */
 #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P
 
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index f09b3826ff..eccfde51f7 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -227,6 +227,7 @@ sub GenerateFiles
 		HAVE_BACKTRACE_SYMBOLS => undef,
 		HAVE_BIO_GET_DATA => undef,
 		HAVE_BIO_METH_NEW => undef,
+ 		HAVE__BUILTIN_PREFETCH => undef,
 		HAVE_COMPUTED_GOTO => undef,
 		HAVE_COPYFILE => undef,
 		HAVE_COPYFILE_H => undef,
-- 
2.32.1 (Apple Git-133)

prefetch_in_PageRepairFragmentation.patchapplication/octet-stream; name=prefetch_in_PageRepairFragmentation.patch; x-unix-mode=0644Download
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 92994f8f39..501b886181 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -752,6 +752,8 @@ PageRepairFragmentation(Page page)
 				else
 					presorted = false;
 
+				pg_prefetch_mem((HeapTupleHeader) PageGetItem(page, lp));
+
 				if (unlikely(itemidptr->itemoff < (int) pd_upper ||
 							 itemidptr->itemoff >= (int) pd_special))
 					ereport(ERROR,
#29vignesh C
vignesh21@gmail.com
In reply to: Daniel Gustafsson (#28)
Re: Prefetch the next tuple's memory during seqscans

On Mon, 10 Jul 2023 at 15:04, Daniel Gustafsson <daniel@yesql.se> wrote:

On 10 Jul 2023, at 11:32, Daniel Gustafsson <daniel@yesql.se> wrote:

On 4 Apr 2023, at 06:50, David Rowley <dgrowleyml@gmail.com> wrote:

Updated patches attached.

This patch is marked Waiting on Author, but from the thread it seems Needs
Review is more apt. I've changed status and also attached a new version of the
patch as the posted v1 no longer applied due to changes in formatting for Perl
code.

..and again with both patches attached. Doh.

I'm seeing that there has been no activity in this thread for more
than 6 months, I'm planning to close this in the current commitfest
unless someone is planning to take it forward.

Regards,
Vignesh

#30David Rowley
dgrowleyml@gmail.com
In reply to: vignesh C (#29)
Re: Prefetch the next tuple's memory during seqscans

On Sat, 20 Jan 2024 at 16:35, vignesh C <vignesh21@gmail.com> wrote:

I'm seeing that there has been no activity in this thread for more
than 6 months, I'm planning to close this in the current commitfest
unless someone is planning to take it forward.

Thanks for the reminder about this. Since the
heapgettup/heapgettup_pagemode refactor I was unable to see the same
performance gains as I was before.

Also, since reading "The Art of Writing Efficient Programs" I'm led to
believe that modern processor hardware prefetchers can detect and
prefetch on both forward and backward access patterns. I also saw some
discussion on twitter about this [1]https://twitter.com/ID_AA_Carmack/status/1470832912149135360.

I'm not sure yet how this translates to non-uniform access patterns,
e.g. tuples are varying cachelines apart and we do something like only
deform attributes in the first cacheline. Will the prefetcher still
see the pattern in this case? If it's non-uniform, then how does it
know which cacheline to fetch? If the tuple spans multiple cacheline
and we deform the whole tuple, will accessing the next cacheline in a
forward direction make the hardware prefetcher forget about the more
general backward access that's going on?

These are questions I'll need to learn the answers to before I can
understand what's the best thing to do in this area. The only way to
tell is to design a benchmark and see how far we can go before the
hardware prefetcher no longer detects the pattern.

I've withdrawn the patch. I can resubmit once I've done some more
experimentation if that experimentation yields positive results.

David

[1]: https://twitter.com/ID_AA_Carmack/status/1470832912149135360