Query regarding pg_prewarm extension

Started by Ayush Vatsaabout 1 year ago7 messages
#1Ayush Vatsa
ayushvatsa1810@gmail.com

Hi PostgreSQL Community,

I have a question regarding the use of the pg_prewarm() function [1]https://www.postgresql.org/docs/current/pgprewarm.html#PGPREWARM-FUNCS
in the pg_prewarm extension. The function requires a relation name
and a range of pages (e.g., pages 10 to 50) to be warmed by shifting
them from disk to the buffer cache.

How can I decide which range of pages to prewarm?
I assume that it is related to hot pages in the relation,
but how can I identify which pages are likely to be hot
before they are even in the buffer cache?
Additionally, since tuples within a page can move to
different pages over time (due to operations like VACUUM FULL or REINDEX),
how should I handle this when selecting the pages to prewarm?

Any insights would be greatly appreciated.

Regards,
Ayush Vatsa

[1]: https://www.postgresql.org/docs/current/pgprewarm.html#PGPREWARM-FUNCS

#2Jeremy Schneider
schneider@ardentperf.com
In reply to: Ayush Vatsa (#1)
Re: Query regarding pg_prewarm extension

On Fri, 13 Dec 2024 16:16:16 +0530
Ayush Vatsa <ayushvatsa1810@gmail.com> wrote:

How can I decide which range of pages to prewarm?
I assume that it is related to hot pages in the relation,
but how can I identify which pages are likely to be hot
before they are even in the buffer cache?
Additionally, since tuples within a page can move to
different pages over time (due to operations like VACUUM FULL or
REINDEX), how should I handle this when selecting the pages to
prewarm?

For my part, I've only used the block offsets when I wanted to fire off
several jobs in parallel, attempting to prewarm a relation faster. I've
never tried to track the location of specific rows for purposes of
prewarming.

You might try the "autoprewarm" feature. After adding pg_prewarm to
your shared_preload_libraries, it will automatically keep track of the
contents of the buffer cache and after a restart it will automatically
prewarm the buffer cache with the blocks that were there before.

https://www.enterprisedb.com/blog/autoprewarm-new-functionality-pgprewarm

Alternatively you could just prewarm a few of your most important hot
tables and indexes with a script after restarts.

For most smaller databases, slightly slower performance for a short
period after startup isn't a problem - while reading blocks from disk
for the first time. After the first read, blocks that are frequently
accessed will remain in the cache. The Postgres cache management
algorithm works well in general.

This is my two cents, anyway

-Jeremy

--
http://about.me/jeremy_schneider

#3Bruce Momjian
bruce@momjian.us
In reply to: Jeremy Schneider (#2)
1 attachment(s)
Re: Query regarding pg_prewarm extension

On Fri, Dec 13, 2024 at 05:20:05PM -0800, Jeremy Schneider wrote:

On Fri, 13 Dec 2024 16:16:16 +0530
Ayush Vatsa <ayushvatsa1810@gmail.com> wrote:

How can I decide which range of pages to prewarm?
I assume that it is related to hot pages in the relation,
but how can I identify which pages are likely to be hot
before they are even in the buffer cache?
Additionally, since tuples within a page can move to
different pages over time (due to operations like VACUUM FULL or
REINDEX), how should I handle this when selecting the pages to
prewarm?

For my part, I've only used the block offsets when I wanted to fire off
several jobs in parallel, attempting to prewarm a relation faster. I've
never tried to track the location of specific rows for purposes of
prewarming.

You might try the "autoprewarm" feature. After adding pg_prewarm to
your shared_preload_libraries, it will automatically keep track of the
contents of the buffer cache and after a restart it will automatically
prewarm the buffer cache with the blocks that were there before.

https://www.enterprisedb.com/blog/autoprewarm-new-functionality-pgprewarm

Alternatively you could just prewarm a few of your most important hot
tables and indexes with a script after restarts.

For most smaller databases, slightly slower performance for a short
period after startup isn't a problem - while reading blocks from disk
for the first time. After the first read, blocks that are frequently
accessed will remain in the cache. The Postgres cache management
algorithm works well in general.

This is my two cents, anyway

It feels like we should document what the block range is used for, so
attached is a doc patch to do that.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Do not let urgent matters crowd out time for investment in the future.

Attachments:

prewarm.difftext/x-diff; charset=us-asciiDownload
diff --git a/doc/src/sgml/pgprewarm.sgml b/doc/src/sgml/pgprewarm.sgml
index 75f45b91b67..6907f214eb8 100644
--- a/doc/src/sgml/pgprewarm.sgml
+++ b/doc/src/sgml/pgprewarm.sgml
@@ -35,8 +35,10 @@ pg_prewarm(regclass, mode text default 'buffer', fork text default 'main',
    The fourth argument is the first block number to prewarm
    (<literal>NULL</literal> is accepted as a synonym for zero).  The fifth
    argument is the last block number to prewarm (<literal>NULL</literal>
-   means prewarm through the last block in the relation).  The return value
-   is the number of blocks prewarmed.
+   means prewarm through the last block in the relation).  The block
+   range allows a single relation to be loaded in parallel using multiple
+   concurent function calls.  The return value is the number of blocks
+   prewarmed.
   </para>
 
   <para>
#4David Rowley
dgrowleyml@gmail.com
In reply to: Bruce Momjian (#3)
Re: Query regarding pg_prewarm extension

On Sun, 29 Dec 2024 at 14:00, Bruce Momjian <bruce@momjian.us> wrote:

It feels like we should document what the block range is used for, so
attached is a doc patch to do that.

-   means prewarm through the last block in the relation).  The return value
-   is the number of blocks prewarmed.
+   means prewarm through the last block in the relation).  The block
+   range allows a single relation to be loaded in parallel using multiple
+   concurent function calls.  The return value is the number of blocks
+   prewarmed.

hmm, do we really need to highlight one specific usage for the range
like this? I think mentioning this could just confuse readers as it
makes it sound like using a range is going to magically run something
in parallel. I was confused to what you were talking about until I
read what Jeremy had written in his email.

Another equally legitimate use case would be if the user only wanted
to prewarm a subset of the relation... Actually, I'd imagine that's
probably more common than someone trying to speed this up by kicking
off multiple queries each with their own range. I imagine there's less
need to use the range to speed this up now that we have read steams
and likely there will be even less need when AIO is in.

I think the current wording is ok as it is. But if I'm outvoted,
"concurent" needs another 'r'.

David

#5Ayush Vatsa
ayushvatsa1810@gmail.com
In reply to: David Rowley (#4)
Re: Query regarding pg_prewarm extension

hmm, do we really need to highlight one specific usage for the range
like this? I think mentioning this could just confuse readers as it
makes it sound like using a range is going to magically run something
in parallel.

I believe highlighting that particular use case would indeed be helpful,
but not directly through a change in the documentation. Instead, it
could be better conveyed through an example.

When I initially read the documentation, I found it unclear how someone
would practically use the range feature. For instance, how would a user
determine the specific range of pages they need in the buffer cache?
Since PostgreSQL doesn’t store data in a fixed order and the order can
change over time due to operations like vacuum or updates, this could
be confusing. Hence, an example illustrating the use case would be
valuable.

Another equally legitimate use case would be if the user only wanted
to prewarm a subset of the relation... Actually, I'd imagine that's
probably more common than someone trying to speed this up by kicking
off multiple queries each with their own range.

To me, using a range of pages to prewarm a relation doesn’t seem like a
common use case. For example, if a user calls prewarm(100, 200),
how would they decide those specific numbers? While it’s possible to
inspect the contents of those pages, as Jeremy noted, users typically
don’t track the location of specific rows for prewarming purposes.

Regards,
Ayush Vatsa
AWS Aurora

#6Bruce Momjian
bruce@momjian.us
In reply to: Ayush Vatsa (#5)
Re: Query regarding pg_prewarm extension

On Mon, Dec 30, 2024 at 10:54:21PM +0530, Ayush Vatsa wrote:

When I initially read the documentation, I found it unclear how someone
would practically use the range feature. For instance, how would a user
determine the specific range of pages they need in the buffer cache?
Since PostgreSQL doesn’t store data in a fixed order and the order can
change over time due to operations like vacuum or updates, this could
be confusing. Hence, an example illustrating the use case would be
valuable.

Another equally legitimate use case would be if the user only wanted
to prewarm a subset of the relation... Actually, I'd imagine that's
probably more common than someone trying to speed this up by kicking
off multiple queries each with their own range.

To me, using a range of pages to prewarm a relation doesn’t seem like a
common use case. For example, if a user calls prewarm(100, 200),
how would they decide those specific numbers? While it’s possible to
inspect the contents of those pages, as Jeremy noted, users typically
don’t track the location of specific rows for prewarming purposes.

Yeah, the ranage-of-blocks case is rare, and I hoped to explain it in
the docs, but it seems like that isn't helping, so I retract my patch.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Do not let urgent matters crowd out time for investment in the future.

#7David Rowley
dgrowleyml@gmail.com
In reply to: Ayush Vatsa (#5)
Re: Query regarding pg_prewarm extension

On Tue, 31 Dec 2024 at 06:24, Ayush Vatsa <ayushvatsa1810@gmail.com> wrote:

To me, using a range of pages to prewarm a relation doesn’t seem like a
common use case. For example, if a user calls prewarm(100, 200),
how would they decide those specific numbers? While it’s possible to
inspect the contents of those pages, as Jeremy noted, users typically
don’t track the location of specific rows for prewarming purposes.

It's probably rarely useful for that exact reason, however, for
insert-only tables (where only recently inserted data is queried),
which pages recently inserted rows are in is more predictable. I can
imagine that someone might want to load some recent percentage of the
table using these parameters.

David