Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

Started by Cedric Villemainabout 2 years ago8 messages
#1Cedric Villemain
Cedric.Villemain+pgsql@abcSQL.com
3 attachment(s)

Hi,

I wonder what you think of making pg_prewarm use recent addition on
smgrprefetch and readv ?

In order to try, I did it anyway in the attached patches. They contain
no doc update, but I will proceed if it is of interest.

In summary:

1. The first one adds a new check on parameters (checking last block is
indeed not before first block).
Consequence is an ERROR is raised instead of silently doing nothing.

2. The second one does implement smgrprefetch with range and loops by
default per segment to still have a check for interrupts.

3. The third one provides smgrreadv instead of smgrread,  by default on
a range of 8 buffers. I am absolutely unsure that I used readv correctly...

Q: posix_fadvise may not work exactly the way you think it does, or does
it ?

In details, and for the question:

It's not so obvious that the "feature" is really required or wanted,
depending on what are the expectations from user point of view.

The kernel decides on what to do with posix_fadvise calls, and how we
pass parameters does impact the decision.
With the current situation where prefetch is done step by step, block by
block, they are very probably most of the time all loaded even if those
from the beginning of the relation can be discarded at the end of the
prefetch.

However,  if instead you provide a real range, or the magic len=0 to
posix_fadvise, then blocks are "more" loaded according to effective vm
pressure (which is not the case on the previous example).
As a result only a small part of the relation might be loaded, and this
is probably not what end-users expect despite being probably a good
choice (you can still free cache beforehand to help the kernel).

An example, below I'm using vm_relation_cachestat() which provides linux
cachestat output, and vm_relation_fadvise() to unload cache, and
pg_prewarm for the demo:

# clear cache: (nr_cache is the number of file system pages in cache,
not postgres blocks)

```
postgres=# select block_start, block_count, nr_pages, nr_cache from
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
          0 |       32768 |    65536 |        0
      32768 |       32768 |    65536 |        0
      65536 |       32768 |    65536 |        0
      98304 |       32768 |    65536 |        0
     131072 |        1672 |     3344 |        0
```

# load full relation with pg_prewarm (patched)

```
postgres=# select pg_prewarm('foo','prefetch');
pg_prewarm
------------
    132744
(1 row)
```

# Checking results:

```
postgres=# select block_start, block_count, nr_pages, nr_cache from
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
          0 |       32768 |    65536 |      320
      32768 |       32768 |    65536 |        0
      65536 |       32768 |    65536 |        0
      98304 |       32768 |    65536 |        0
     131072 |        1672 |     3344 |      320  <-- segment 1

```

# Load block by block and check:

```
postgres=# select from generate_series(0, 132743) g(n), lateral
pg_prewarm('foo','prefetch', 'main', n, n);
postgres=# select block_start, block_count, nr_pages, nr_cache from
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
          0 |       32768 |    65536 |    65536
      32768 |       32768 |    65536 |    65536
      65536 |       32768 |    65536 |    65536
      98304 |       32768 |    65536 |    65536
     131072 |        1672 |     3344 |     3344

```

The duration of the last example is also really significant: full
relation is 0.3ms and block by block is 1550ms!
You might think it's because of generate_series or whatever, but I have
the exact same behavior with pgfincore.
I can compare loading and unloading duration for similar "async" work,
here each call is from block 0 with len of 132744 and a range of 1 block
(i.e. posix_fadvise on 8kB at a time).
So they have exactly the same number of operations doing DONTNEED or
WILLNEED, but distinct duration on the first "load":

```

postgres=# select * from
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_DONTNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 25.202 ms
postgres=# select * from
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 1523.636 ms (00:01.524) <----- not free !
postgres=# select * from
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 24.967 ms
```

Thank you for your time reading this longer than expected email.

Comments ?

---
Cédric Villemain +33 (0)6 20 30 22 52
https://Data-Bene.io
PostgreSQL Expertise, Support, Training, R&D

Attachments:

0001-Check-last-block-greater-or-equal-to-first-block-in-.patchtext/x-patch; charset=UTF-8; name=0001-Check-last-block-greater-or-equal-to-first-block-in-.patchDownload
From eeb24acd782f1dc2afff80fccfcec442521b9622 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?C=C3=A9dric=20Villemain?= <Cedric.Villemain@Data-Bene.io>
Date: Wed, 3 Jan 2024 19:03:22 +0100
Subject: [PATCH 1/3] Check last block greater or equal to first block in
 pg_prewarm

This prevents useless call and inform users of unexpected value as
parameter.
---
 contrib/pg_prewarm/pg_prewarm.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 01fc2c8ad9..1de1b39482 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -133,10 +133,11 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	else
 	{
 		last_block = PG_GETARG_INT64(4);
-		if (last_block < 0 || last_block >= nblocks)
+		if (last_block < first_block || last_block >= nblocks)
 			ereport(ERROR,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-					 errmsg("ending block number must be between 0 and %lld",
+					 errmsg("ending block number must be between %lld and %lld",
+							(long long) (first_block),
 							(long long) (nblocks - 1))));
 	}
 
-- 
2.39.2

0002-Allow-pg_prewarm-to-use-new-smgrprefetch-range.patchtext/x-patch; charset=UTF-8; name=0002-Allow-pg_prewarm-to-use-new-smgrprefetch-range.patchDownload
From 8fd2a0611a13ca40fe7d4c28a27257ba0f7e2a7a Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?C=C3=A9dric=20Villemain?= <Cedric.Villemain@Data-Bene.io>
Date: Wed, 3 Jan 2024 19:08:37 +0100
Subject: [PATCH 2/3] Allow pg_prewarm to use new smgrprefetch range

This commit replaces BufferPrefetch with smgrprefetch (with range).
It reduces the number of instructions to prefetch data.

I also defined a PREWARM_PREFETCH_RANGE which is by default the size of
a segment so there is a chance to interrupt the prefetching (on very
large table it might be desirable).
---
 contrib/pg_prewarm/pg_prewarm.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 1de1b39482..aea07927db 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -25,6 +25,8 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 
+#define PREWARM_PREFETCH_RANGE	RELSEG_SIZE
+
 PG_MODULE_MAGIC;
 
 PG_FUNCTION_INFO_V1(pg_prewarm);
@@ -156,11 +158,21 @@ pg_prewarm(PG_FUNCTION_ARGS)
 		 * no practical way to do that at present without a gross modularity
 		 * violation, so we just do this.
 		 */
-		for (block = first_block; block <= last_block; ++block)
+		for (block = first_block; block <= last_block;
+			 block += PREWARM_PREFETCH_RANGE)
 		{
+			int seek = Min(PREWARM_PREFETCH_RANGE, (last_block - block + 1));
+
+			/*
+			 * if handling a multi-TB relation, we need a way to interrupt the
+			 * prefetching: smgrprefetch (mdprefetch) will loop on all segments
+			 * without interruption so we use a range and keep the following
+			 * CHECK in place
+			 */
 			CHECK_FOR_INTERRUPTS();
-			PrefetchBuffer(rel, forkNumber, block);
-			++blocks_done;
+
+			smgrprefetch(RelationGetSmgr(rel), forkNumber, block, seek);
+			blocks_done += seek;
 		}
 #else
 		ereport(ERROR,
-- 
2.39.2

0003-Allow-pg_prewarm-to-use-new-smgrreadv.patchtext/x-patch; charset=UTF-8; name=0003-Allow-pg_prewarm-to-use-new-smgrreadv.patchDownload
From cb2fa6f0245b28a1d3f83ac901eb0d8698f22dd3 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?C=C3=A9dric=20Villemain?= <Cedric.Villemain@Data-Bene.io>
Date: Wed, 3 Jan 2024 20:16:37 +0100
Subject: [PATCH 3/3] Allow pg_prewarm to use new smgrreadv

This commit replaces smgrread with smgrreadv (with range).

I also defined a PREWARM_READ_RANGE, 8 by default, which is the
number of buffers to read in one call.

I just discovered pg_readv so I hope the initialization I wrote is correct.

It's not really sure there is a huge win because all blocks are sequential
and readv should merge them but not use vectorization (if I understood
correctly).
---
 contrib/pg_prewarm/pg_prewarm.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index aea07927db..e11694d707 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -26,6 +26,7 @@
 #include "utils/rel.h"
 
 #define PREWARM_PREFETCH_RANGE	RELSEG_SIZE
+#define PREWARM_READ_RANGE	8
 
 PG_MODULE_MAGIC;
 
@@ -38,7 +39,7 @@ typedef enum
 	PREWARM_BUFFER,
 } PrewarmType;
 
-static PGIOAlignedBlock blockbuffer;
+static PGIOAlignedBlock blockbuffers[PREWARM_READ_RANGE];
 
 /*
  * pg_prewarm(regclass, mode text, fork text,
@@ -187,11 +188,19 @@ pg_prewarm(PG_FUNCTION_ARGS)
 		 * buffers.  This is more portable than prefetch mode (it works
 		 * everywhere) and is synchronous.
 		 */
-		for (block = first_block; block <= last_block; ++block)
+		char	*buffers[PREWARM_READ_RANGE];
+		for (int i=0; i < PREWARM_READ_RANGE; i++)
+			buffers[i] = blockbuffers[i].data;
+		for (block = first_block; block <= last_block;
+			 block += PREWARM_READ_RANGE)
 		{
+			int seek = Min(PREWARM_READ_RANGE, (last_block - block + 1));
+
 			CHECK_FOR_INTERRUPTS();
-			smgrread(RelationGetSmgr(rel), forkNumber, block, blockbuffer.data);
-			++blocks_done;
+
+			smgrreadv(RelationGetSmgr(rel), forkNumber, block, (void *) buffers,
+					  seek);
+			blocks_done += seek;
 		}
 	}
 	else if (ptype == PREWARM_BUFFER)
-- 
2.39.2

#2Nazir Bilal Yavuz
byavuz81@gmail.com
In reply to: Cedric Villemain (#1)
Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

Hi,

Thanks for working on this!

The patches are cleanly applied on top of the current master and all
tests are passed.

On Thu, 4 Jan 2024 at 02:23, Cedric Villemain
<Cedric.Villemain+pgsql@abcsql.com> wrote:

Hi,

I wonder what you think of making pg_prewarm use recent addition on
smgrprefetch and readv ?

In order to try, I did it anyway in the attached patches. They contain
no doc update, but I will proceed if it is of interest.

In summary:

1. The first one adds a new check on parameters (checking last block is
indeed not before first block).
Consequence is an ERROR is raised instead of silently doing nothing.

This is a general improvement and can be committed without other patches.

2. The second one does implement smgrprefetch with range and loops by
default per segment to still have a check for interrupts.

It looks good codewise but RELSEG_SIZE is too big to prefetch. Man
page of posix_fadvise [1]https://man7.org/linux/man-pages/man2/posix_fadvise.2.html#DESCRIPTION states that: "The amount of data read may be
decreased by the kernel depending on virtual memory load. (A few
megabytes will usually be fully satisfied, and more is rarely
useful.)". It is trying to prefetch 1GB data now. That could explain
your observation about differences between nr_cache numbers.

3. The third one provides smgrreadv instead of smgrread, by default on
a range of 8 buffers. I am absolutely unsure that I used readv correctly...

Looks good to me.

Q: posix_fadvise may not work exactly the way you think it does, or does
it ?

In details, and for the question:

It's not so obvious that the "feature" is really required or wanted,
depending on what are the expectations from user point of view.

The kernel decides on what to do with posix_fadvise calls, and how we
pass parameters does impact the decision.
With the current situation where prefetch is done step by step, block by
block, they are very probably most of the time all loaded even if those
from the beginning of the relation can be discarded at the end of the
prefetch.

However, if instead you provide a real range, or the magic len=0 to
posix_fadvise, then blocks are "more" loaded according to effective vm
pressure (which is not the case on the previous example).
As a result only a small part of the relation might be loaded, and this
is probably not what end-users expect despite being probably a good
choice (you can still free cache beforehand to help the kernel).

An example, below I'm using vm_relation_cachestat() which provides linux
cachestat output, and vm_relation_fadvise() to unload cache, and
pg_prewarm for the demo:

# clear cache: (nr_cache is the number of file system pages in cache,
not postgres blocks)

```
postgres=# select block_start, block_count, nr_pages, nr_cache from
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
0 | 32768 | 65536 | 0
32768 | 32768 | 65536 | 0
65536 | 32768 | 65536 | 0
98304 | 32768 | 65536 | 0
131072 | 1672 | 3344 | 0
```

# load full relation with pg_prewarm (patched)

```
postgres=# select pg_prewarm('foo','prefetch');
pg_prewarm
------------
132744
(1 row)
```

# Checking results:

```
postgres=# select block_start, block_count, nr_pages, nr_cache from
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
0 | 32768 | 65536 | 320
32768 | 32768 | 65536 | 0
65536 | 32768 | 65536 | 0
98304 | 32768 | 65536 | 0
131072 | 1672 | 3344 | 320 <-- segment 1

```

# Load block by block and check:

```
postgres=# select from generate_series(0, 132743) g(n), lateral
pg_prewarm('foo','prefetch', 'main', n, n);
postgres=# select block_start, block_count, nr_pages, nr_cache from
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
0 | 32768 | 65536 | 65536
32768 | 32768 | 65536 | 65536
65536 | 32768 | 65536 | 65536
98304 | 32768 | 65536 | 65536
131072 | 1672 | 3344 | 3344

```

The duration of the last example is also really significant: full
relation is 0.3ms and block by block is 1550ms!
You might think it's because of generate_series or whatever, but I have
the exact same behavior with pgfincore.
I can compare loading and unloading duration for similar "async" work,
here each call is from block 0 with len of 132744 and a range of 1 block
(i.e. posix_fadvise on 8kB at a time).
So they have exactly the same number of operations doing DONTNEED or
WILLNEED, but distinct duration on the first "load":

```

postgres=# select * from
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_DONTNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 25.202 ms
postgres=# select * from
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 1523.636 ms (00:01.524) <----- not free !
postgres=# select * from
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 24.967 ms
```

I confirm that there is a time difference between calling pg_prewarm
by full relation and block by block, but IMO this is expected. When
pg_prewarm is called by full relation, it does the initialization part
just once but when it is called block by block, it does initialization
for each call, right?

I run 'select pg_prewarm('foo','prefetch', 'main', n, n) FROM
generate_series(0, 132744)n;' a couple of times consecutively but I
could not see the time difference between first run (first load) and
the consecutive runs. Am I doing something wrong?

[1]: https://man7.org/linux/man-pages/man2/posix_fadvise.2.html#DESCRIPTION

--
Regards,
Nazir Bilal Yavuz
Microsoft

#3Cédric Villemain
Cedric.Villemain@abcSQL.com
In reply to: Nazir Bilal Yavuz (#2)
Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

Hi Nazir,

thank you for your review. I comment below.

On 05/03/2024 12:07, Nazir Bilal Yavuz wrote:

2. The second one does implement smgrprefetch with range and loops by
default per segment to still have a check for interrupts.

It looks good codewise but RELSEG_SIZE is too big to prefetch. Man
page of posix_fadvise [1] states that: "The amount of data read may be
decreased by the kernel depending on virtual memory load. (A few
megabytes will usually be fully satisfied, and more is rarely
useful.)". It is trying to prefetch 1GB data now. That could explain
your observation about differences between nr_cache numbers.

From an "adminsys" point of view I will find beneficial to get a single
syscall per file, respecting the logic and behavior of underlying system
call.

The behavior is 100% OK, and in fact it might a bad idea to prefetch
block by block as the result is just to put more pressure on a system if
it is already under pressure.

Though there are use cases and it's nice to be able to do that too at
this per page level.

About [1], it's very old statement about resources. And Linux manages a
part of the problem for us here I think [2]https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L303:

/*
 * Chunk the readahead into 2 megabyte units, so that we don't pin too much
 * memory at once.
 */
void force_page_cache_ra(....)

Q: posix_fadvise may not work exactly the way you think it does, or does
it ?

In details, and for the question:

However, if instead you provide a real range, or the magic len=0 to
posix_fadvise, then blocks are "more" loaded according to effective vm
pressure (which is not the case on the previous example).
As a result only a small part of the relation might be loaded, and this
is probably not what end-users expect despite being probably a good
choice (you can still free cache beforehand to help the kernel).

I think it's a matter of documenting well the feature, and if at all
possible, as usual, not let users be negatively impacted by default.

An example, below I'm using vm_relation_cachestat() which provides linux
cachestat output, and vm_relation_fadvise() to unload cache, and
pg_prewarm for the demo:

# clear cache: (nr_cache is the number of file system pages in cache,
not postgres blocks)

```
postgres=# select block_start, block_count, nr_pages, nr_cache from
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
0 | 32768 | 65536 | 0
32768 | 32768 | 65536 | 0
65536 | 32768 | 65536 | 0
98304 | 32768 | 65536 | 0
131072 | 1672 | 3344 | 0
```

# load full relation with pg_prewarm (patched)

```
postgres=# select pg_prewarm('foo','prefetch');
pg_prewarm
------------
132744
(1 row)
```

# Checking results:

```
postgres=# select block_start, block_count, nr_pages, nr_cache from
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
0 | 32768 | 65536 | 320
32768 | 32768 | 65536 | 0
65536 | 32768 | 65536 | 0
98304 | 32768 | 65536 | 0
131072 | 1672 | 3344 | 320 <-- segment 1

```

# Load block by block and check:

```
postgres=# select from generate_series(0, 132743) g(n), lateral
pg_prewarm('foo','prefetch', 'main', n, n);
postgres=# select block_start, block_count, nr_pages, nr_cache from
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
0 | 32768 | 65536 | 65536
32768 | 32768 | 65536 | 65536
65536 | 32768 | 65536 | 65536
98304 | 32768 | 65536 | 65536
131072 | 1672 | 3344 | 3344

```

The duration of the last example is also really significant: full
relation is 0.3ms and block by block is 1550ms!
You might think it's because of generate_series or whatever, but I have
the exact same behavior with pgfincore.
I can compare loading and unloading duration for similar "async" work,
here each call is from block 0 with len of 132744 and a range of 1 block
(i.e. posix_fadvise on 8kB at a time).
So they have exactly the same number of operations doing DONTNEED or
WILLNEED, but distinct duration on the first "load":

```

postgres=# select * from
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_DONTNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 25.202 ms
postgres=# select * from
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 1523.636 ms (00:01.524) <----- not free !
postgres=# select * from
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 24.967 ms
```

I confirm that there is a time difference between calling pg_prewarm
by full relation and block by block, but IMO this is expected. When
pg_prewarm is called by full relation, it does the initialization part
just once but when it is called block by block, it does initialization
for each call, right?

Not sure what initialization is here exactly, in my example with
WILLNEED/DONTNEED there are exactly the same code pattern and syscall
request(s), just the flag is distinct, so initialization cost are
expected to be very similar.
I'll try to move forward on those vm_relation functions into pgfincore
so it'll be easier to run similar tests and compare.

I run 'select pg_prewarm('foo','prefetch', 'main', n, n) FROM
generate_series(0, 132744)n;' a couple of times consecutively but I
could not see the time difference between first run (first load) and
the consecutive runs. Am I doing something wrong?

Maybe the system is overloaded and thus by the time you're done
prefetching tail blocks, the heads ones have been dropped already. So
looping on that leads to similar duration.
If it's already in cache and not removed from it, execution time is
stable. This point (in cache or not) is hard to guess right until you do
check the status, or you ensure to clean it first.

[1] https://man7.org/linux/man-pages/man2/posix_fadvise.2.html#DESCRIPTION

[2]: https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L303

My apologize about the email address with sub-address which leads to
undelivered email. Please update with the current one.

---
Cédric Villemain +33 (0)6 20 30 22 52
https://Data-Bene.io
PostgreSQL Expertise, Support, Training, R&D

#4Nazir Bilal Yavuz
byavuz81@gmail.com
In reply to: Cédric Villemain (#3)
Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

Hi,

On Wed, 6 Mar 2024 at 18:23, Cédric Villemain
<Cedric.Villemain@abcsql.com> wrote:

Hi Nazir,

thank you for your review. I comment below.

On 05/03/2024 12:07, Nazir Bilal Yavuz wrote:

2. The second one does implement smgrprefetch with range and loops by
default per segment to still have a check for interrupts.

It looks good codewise but RELSEG_SIZE is too big to prefetch. Man
page of posix_fadvise [1] states that: "The amount of data read may be
decreased by the kernel depending on virtual memory load. (A few
megabytes will usually be fully satisfied, and more is rarely
useful.)". It is trying to prefetch 1GB data now. That could explain
your observation about differences between nr_cache numbers.

From an "adminsys" point of view I will find beneficial to get a single
syscall per file, respecting the logic and behavior of underlying system
call.

I agree.

The behavior is 100% OK, and in fact it might a bad idea to prefetch
block by block as the result is just to put more pressure on a system if
it is already under pressure.

Though there are use cases and it's nice to be able to do that too at
this per page level.

Yes, I do not know which one is more important, cache more blocks but
create more pressure or create less pressure but cache less blocks.
Also, pg_prewarm is designed to be run at startup so I guess there
will not be much pressure.

About [1], it's very old statement about resources. And Linux manages a
part of the problem for us here I think [2]:

/*
* Chunk the readahead into 2 megabyte units, so that we don't pin too much
* memory at once.
*/
void force_page_cache_ra(....)

Thanks for pointing out the actual code. Yes, it looks like the kernel
is already doing that. I would like to do more testing when you
forward vm_relation functions into pgfincore.

Q: posix_fadvise may not work exactly the way you think it does, or does
it ?

In details, and for the question:

However, if instead you provide a real range, or the magic len=0 to
posix_fadvise, then blocks are "more" loaded according to effective vm
pressure (which is not the case on the previous example).
As a result only a small part of the relation might be loaded, and this
is probably not what end-users expect despite being probably a good
choice (you can still free cache beforehand to help the kernel).

I think it's a matter of documenting well the feature, and if at all
possible, as usual, not let users be negatively impacted by default.

An example, below I'm using vm_relation_cachestat() which provides linux
cachestat output, and vm_relation_fadvise() to unload cache, and
pg_prewarm for the demo:

# clear cache: (nr_cache is the number of file system pages in cache,
not postgres blocks)

```
postgres=# select block_start, block_count, nr_pages, nr_cache from
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
0 | 32768 | 65536 | 0
32768 | 32768 | 65536 | 0
65536 | 32768 | 65536 | 0
98304 | 32768 | 65536 | 0
131072 | 1672 | 3344 | 0
```

# load full relation with pg_prewarm (patched)

```
postgres=# select pg_prewarm('foo','prefetch');
pg_prewarm
------------
132744
(1 row)
```

# Checking results:

```
postgres=# select block_start, block_count, nr_pages, nr_cache from
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
0 | 32768 | 65536 | 320
32768 | 32768 | 65536 | 0
65536 | 32768 | 65536 | 0
98304 | 32768 | 65536 | 0
131072 | 1672 | 3344 | 320 <-- segment 1

```

# Load block by block and check:

```
postgres=# select from generate_series(0, 132743) g(n), lateral
pg_prewarm('foo','prefetch', 'main', n, n);
postgres=# select block_start, block_count, nr_pages, nr_cache from
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
0 | 32768 | 65536 | 65536
32768 | 32768 | 65536 | 65536
65536 | 32768 | 65536 | 65536
98304 | 32768 | 65536 | 65536
131072 | 1672 | 3344 | 3344

```

The duration of the last example is also really significant: full
relation is 0.3ms and block by block is 1550ms!
You might think it's because of generate_series or whatever, but I have
the exact same behavior with pgfincore.
I can compare loading and unloading duration for similar "async" work,
here each call is from block 0 with len of 132744 and a range of 1 block
(i.e. posix_fadvise on 8kB at a time).
So they have exactly the same number of operations doing DONTNEED or
WILLNEED, but distinct duration on the first "load":

```

postgres=# select * from
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_DONTNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 25.202 ms
postgres=# select * from
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 1523.636 ms (00:01.524) <----- not free !
postgres=# select * from
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 24.967 ms
```

I confirm that there is a time difference between calling pg_prewarm
by full relation and block by block, but IMO this is expected. When
pg_prewarm is called by full relation, it does the initialization part
just once but when it is called block by block, it does initialization
for each call, right?

Not sure what initialization is here exactly, in my example with
WILLNEED/DONTNEED there are exactly the same code pattern and syscall
request(s), just the flag is distinct, so initialization cost are
expected to be very similar.

Sorry, there was a miscommunication. I was talking about pg_prewarm's
initialization, meaning if the pg_prewarm is called block by block (by
using generate_series); it will make block_count times initialization
and if it is called by full relation it will just do it once but it
seems that is not the case, see below.

I'll try to move forward on those vm_relation functions into pgfincore
so it'll be easier to run similar tests and compare.

Thanks, that will be helpful for the testing.

I run 'select pg_prewarm('foo','prefetch', 'main', n, n) FROM
generate_series(0, 132744)n;' a couple of times consecutively but I
could not see the time difference between first run (first load) and
the consecutive runs. Am I doing something wrong?

Maybe the system is overloaded and thus by the time you're done
prefetching tail blocks, the heads ones have been dropped already. So
looping on that leads to similar duration.
If it's already in cache and not removed from it, execution time is
stable. This point (in cache or not) is hard to guess right until you do
check the status, or you ensure to clean it first.

My bad. I was trying to drop buffers from the postgres cache, not from
the kernel cache. See my results now:

patched | prefetch test

$ create_the_data [3]CREATE EXTENSION pg_prewarm; drop table if exists foo; create table foo ( id int, c text) with (autovacuum_enabled=false); insert into foo select i, repeat('a', 1000) from generate_series(1,10000000)i;
$ drop_kernel_cache [4]echo 3 | sudo tee /proc/sys/vm/drop_caches
$ first_run_full_relation_prefetch [5]select pg_prewarm('foo', 'prefetch', 'main'); -> Time: 11.395 ms
$ second_run_full_relation_prefetch [5]select pg_prewarm('foo', 'prefetch', 'main'); -> Time: 0.887 ms

master | prefetch test

$ create_the_data [3]CREATE EXTENSION pg_prewarm; drop table if exists foo; create table foo ( id int, c text) with (autovacuum_enabled=false); insert into foo select i, repeat('a', 1000) from generate_series(1,10000000)i;
$ drop_kernel_cache [4]echo 3 | sudo tee /proc/sys/vm/drop_caches
$ first_run_full_relation_prefetch [5]select pg_prewarm('foo', 'prefetch', 'main'); -> Time: 3208.944 ms
$ second_run_full_relation_prefetch [5]select pg_prewarm('foo', 'prefetch', 'main'); -> Time: 283.905 ms

I did more perf tests about comparison between first and second run
for the prefetch and found this on master:

first run:
- 86.40% generic_fadvise
    - 86.24% force_page_cache_ra
        - 85.99% page_cache_ra_unbounded
            + 37.36% filemap_add_folio
            + 34.14% read_pages
            + 8.31% folio_alloc
            + 4.55% up_read
                0.77% xa_load
second run:
- 20.64% generic_fadvise
    - 18.64% force_page_cache_ra
        - 17.46% page_cache_ra_unbounded
            + 8.54% xa_load
            2.82% down_read
            2.29% read_pages
            1.45% up_read

So, it looks like the difference between the first and the second run
comes from kernel optimization that does not do prefetch if the page
is already in the cache [6]https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L232. Saying that, I do not know the difference
between WILLNEED/DONTNEED and I do not have enough materials to test
it but I guess it is something similar.

I did not test read performance but I am planning to do that soon.

[1] https://man7.org/linux/man-pages/man2/posix_fadvise.2.html#DESCRIPTION

[2] https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L303

[3]: CREATE EXTENSION pg_prewarm; drop table if exists foo; create table foo ( id int, c text) with (autovacuum_enabled=false); insert into foo select i, repeat('a', 1000) from generate_series(1,10000000)i;
CREATE EXTENSION pg_prewarm;
drop table if exists foo;
create table foo ( id int, c text) with (autovacuum_enabled=false);
insert into foo select i, repeat('a', 1000) from generate_series(1,10000000)i;

[4]: echo 3 | sudo tee /proc/sys/vm/drop_caches

[5]: select pg_prewarm('foo', 'prefetch', 'main');

[6]: https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L232

--
Regards,
Nazir Bilal Yavuz
Microsoft

#5Cédric Villemain
Cedric.Villemain@abcSQL.com
In reply to: Nazir Bilal Yavuz (#4)
Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

Hi Nazir,

On 07/03/2024 12:19, Nazir Bilal Yavuz wrote:

On Wed, 6 Mar 2024 at 18:23, Cédric Villemain
<Cedric.Villemain@abcsql.com> wrote:

The behavior is 100% OK, and in fact it might a bad idea to prefetch
block by block as the result is just to put more pressure on a system if
it is already under pressure.

Though there are use cases and it's nice to be able to do that too at
this per page level.

Yes, I do not know which one is more important, cache more blocks but
create more pressure or create less pressure but cache less blocks.
Also, pg_prewarm is designed to be run at startup so I guess there
will not be much pressure.

autowarm is designed for that purpose but pg_prewarm is free to use when
neeed.

About [1], it's very old statement about resources. And Linux manages a
part of the problem for us here I think [2]:

/*
* Chunk the readahead into 2 megabyte units, so that we don't pin too much
* memory at once.
*/
void force_page_cache_ra(....)

Thanks for pointing out the actual code. Yes, it looks like the kernel
is already doing that. I would like to do more testing when you
forward vm_relation functions into pgfincore.

I hope to be able to get back there next week max.

An example, below I'm using vm_relation_cachestat() which provides linux
cachestat output, and vm_relation_fadvise() to unload cache, and
pg_prewarm for the demo:

# clear cache: (nr_cache is the number of file system pages in cache,
not postgres blocks)

```
postgres=# select block_start, block_count, nr_pages, nr_cache from
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
0 | 32768 | 65536 | 0
32768 | 32768 | 65536 | 0
65536 | 32768 | 65536 | 0
98304 | 32768 | 65536 | 0
131072 | 1672 | 3344 | 0
```

# load full relation with pg_prewarm (patched)

```
postgres=# select pg_prewarm('foo','prefetch');
pg_prewarm
------------
132744
(1 row)
```

# Checking results:

```
postgres=# select block_start, block_count, nr_pages, nr_cache from
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
0 | 32768 | 65536 | 320
32768 | 32768 | 65536 | 0
65536 | 32768 | 65536 | 0
98304 | 32768 | 65536 | 0
131072 | 1672 | 3344 | 320 <-- segment 1

```

# Load block by block and check:

```
postgres=# select from generate_series(0, 132743) g(n), lateral
pg_prewarm('foo','prefetch', 'main', n, n);
postgres=# select block_start, block_count, nr_pages, nr_cache from
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
0 | 32768 | 65536 | 65536
32768 | 32768 | 65536 | 65536
65536 | 32768 | 65536 | 65536
98304 | 32768 | 65536 | 65536
131072 | 1672 | 3344 | 3344

```

The duration of the last example is also really significant: full
relation is 0.3ms and block by block is 1550ms!
You might think it's because of generate_series or whatever, but I have
the exact same behavior with pgfincore.
I can compare loading and unloading duration for similar "async" work,
here each call is from block 0 with len of 132744 and a range of 1 block
(i.e. posix_fadvise on 8kB at a time).
So they have exactly the same number of operations doing DONTNEED or
WILLNEED, but distinct duration on the first "load":

```

postgres=# select * from
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_DONTNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 25.202 ms
postgres=# select * from
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 1523.636 ms (00:01.524) <----- not free !
postgres=# select * from
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 24.967 ms
```

I confirm that there is a time difference between calling pg_prewarm
by full relation and block by block, but IMO this is expected. When
pg_prewarm is called by full relation, it does the initialization part
just once but when it is called block by block, it does initialization
for each call, right?

Not sure what initialization is here exactly, in my example with
WILLNEED/DONTNEED there are exactly the same code pattern and syscall
request(s), just the flag is distinct, so initialization cost are
expected to be very similar.

Sorry, there was a miscommunication. I was talking about pg_prewarm's
initialization, meaning if the pg_prewarm is called block by block (by
using generate_series); it will make block_count times initialization
and if it is called by full relation it will just do it once but it
seems that is not the case, see below.

OK.

I'll try to move forward on those vm_relation functions into pgfincore
so it'll be easier to run similar tests and compare.

Thanks, that will be helpful for the testing.

I run 'select pg_prewarm('foo','prefetch', 'main', n, n) FROM
generate_series(0, 132744)n;' a couple of times consecutively but I
could not see the time difference between first run (first load) and
the consecutive runs. Am I doing something wrong?

Maybe the system is overloaded and thus by the time you're done
prefetching tail blocks, the heads ones have been dropped already. So
looping on that leads to similar duration.
If it's already in cache and not removed from it, execution time is
stable. This point (in cache or not) is hard to guess right until you do
check the status, or you ensure to clean it first.

My bad. I was trying to drop buffers from the postgres cache, not from
the kernel cache. See my results now:

patched | prefetch test

$ create_the_data [3]
$ drop_kernel_cache [4]
$ first_run_full_relation_prefetch [5] -> Time: 11.395 ms
$ second_run_full_relation_prefetch [5] -> Time: 0.887 ms

master | prefetch test

$ create_the_data [3]
$ drop_kernel_cache [4]
$ first_run_full_relation_prefetch [5] -> Time: 3208.944 ms
$ second_run_full_relation_prefetch [5] -> Time: 283.905 ms

I did more perf tests about comparison between first and second run
for the prefetch and found this on master:

first run:
- 86.40% generic_fadvise
- 86.24% force_page_cache_ra
- 85.99% page_cache_ra_unbounded
+ 37.36% filemap_add_folio
+ 34.14% read_pages
+ 8.31% folio_alloc
+ 4.55% up_read
0.77% xa_load
second run:
- 20.64% generic_fadvise
- 18.64% force_page_cache_ra
- 17.46% page_cache_ra_unbounded
+ 8.54% xa_load
2.82% down_read
2.29% read_pages
1.45% up_read

So, it looks like the difference between the first and the second run
comes from kernel optimization that does not do prefetch if the page
is already in the cache [6]. Saying that, I do not know the difference
between WILLNEED/DONTNEED and I do not have enough materials to test
it but I guess it is something similar.

Patched: Clearly, only a small part has been read and put into VM during
the first pass, but still some pages, and the second one probably did
nothing at all.
Master: Apparently it takes around 3.2 seconds to read all (which
outlines that the first pass, patched, read few). On the second pass
it's already in cache, so it goes fast. you're correct. But given it
still required 2803ms, there is something.
You may want to test the status with vm_relation_cachestat() [7]https://github.com/klando/pgfincore/blob/vm_relation_cachestat/pgfincore--1.3.1--1.4.0.sql#L54, it's
in a branch, not main or master. It requires linux 6.5, but allows to
get information about memory eviction, which is super handy (and super
fast)!
It returns:
 - nr_cache is Number of cached pages
 - nr_dirty is Number of dirty pages
 - nr_writeback is Number of pages marked for writeback
 - nr_evicted is Number of pages evicted from the cache
 - nr_recently_evicted is Number of pages recently evicted from the cache
/*
 * A page is recently evicted if its last eviction was recent enough
that its
 * reentry to the cache would indicate that it is actively being used
by the
 * system, and that there is memory pressure on the system.
 */

WILLNEED posix fadvise flag leads to what used to be call "prefetch":
reading from disk, and put into VM. (it's not as simple, but this is the
idea).
DONTNEED flushes from VM.

Might be interesting to compare with prewarm called on each block of the
relation, one way to do it with current path is to change the constant:
#define PREWARM_PREFETCH_RANGE    RELSEG_SIZE

RELSEG_SIZE is 131071 IIRC

Here you can set to 1 and you'll have prewarm working on all pages, one
by one, which should be similar to current behavior.
In pgfincore I have a "range" parameter for that purpose so end-user can
adjust exactly as desired.
I was not looking after change to prewarm function parameters but if
it's better...

I did not test read performance but I am planning to do that soon.

Nice, thank you for the effort!

[1] https://man7.org/linux/man-pages/man2/posix_fadvise.2.html#DESCRIPTION

[2] https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L303

[3]
CREATE EXTENSION pg_prewarm;
drop table if exists foo;
create table foo ( id int, c text) with (autovacuum_enabled=false);
insert into foo select i, repeat('a', 1000) from generate_series(1,10000000)i;

[4] echo 3 | sudo tee /proc/sys/vm/drop_caches

[5] select pg_prewarm('foo', 'prefetch', 'main');

[6] https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L232

[7]: https://github.com/klando/pgfincore/blob/vm_relation_cachestat/pgfincore--1.3.1--1.4.0.sql#L54
https://github.com/klando/pgfincore/blob/vm_relation_cachestat/pgfincore--1.3.1--1.4.0.sql#L54

---
Cédric Villemain +33 (0)6 20 30 22 52
https://Data-Bene.io
PostgreSQL Expertise, Support, Training, R&D

#6Nazir Bilal Yavuz
byavuz81@gmail.com
In reply to: Cédric Villemain (#5)
Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

Hi,

On Thu, 7 Mar 2024 at 15:26, Cédric Villemain
<Cedric.Villemain@abcsql.com> wrote:

On 07/03/2024 12:19, Nazir Bilal Yavuz wrote:

I did not test read performance but I am planning to do that soon.

I did not have the time to check other things you mentioned but I
tested the read performance. The table size is 5.5GB, I did 20 runs in
total.

When the OS cache is cleared:

Master -> Median: 2266.293ms, Avg: 2265.5038ms
Patched -> Median: 2166.493ms, Avg: 2183.6208ms

When the buffers are in the OS cache:

Master -> Median: 585.719ms, Avg: 583.5032ms
Patched -> Median: 533.071ms, Avg: 532.7119ms

Patched version is better on both. ~4% when the OS cache is cleared,
~%9 when the buffers are in the OS cache.

--
Regards,
Nazir Bilal Yavuz
Microsoft

#7Andrey M. Borodin
x4mmm@yandex-team.ru
In reply to: Nazir Bilal Yavuz (#6)
Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

On 15 Mar 2024, at 17:12, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

I did not have the time to check other things you mentioned but I
tested the read performance. The table size is 5.5GB, I did 20 runs in
total.

Hi Nazir!

Do you plan to review anything else? Or do you think it worth to look at by someone else? Or is the patch Ready for Committer? If so, please swap CF entry [0]https://commitfest.postgresql.org/47/4763/ to status accordingly, currently it's "Waiting on Author".

Best regards, Andrey Borodin.

[0]: https://commitfest.postgresql.org/47/4763/

#8Nazir Bilal Yavuz
byavuz81@gmail.com
In reply to: Andrey M. Borodin (#7)
Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

Hi Andrey,

On Sun, 7 Apr 2024 at 08:29, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

On 15 Mar 2024, at 17:12, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

I did not have the time to check other things you mentioned but I
tested the read performance. The table size is 5.5GB, I did 20 runs in
total.

Hi Nazir!

Do you plan to review anything else? Or do you think it worth to look at by someone else? Or is the patch Ready for Committer? If so, please swap CF entry [0] to status accordingly, currently it's "Waiting on Author".

Thanks for reminding me! I think this needs review by someone else
(especially the prefetch part) so I changed it to 'Needs review'.

--
Regards,
Nazir Bilal Yavuz
Microsoft