autoprewarm_dump_now

Started by Дарья Шанина9 months ago14 messages
#1Дарья Шанина
vilensipkdm@gmail.com

Hello everyone!
I have a question.

What would be better for the function autoprewarm_dump_now in case when we
need to allocate memory that exceeds 1 GB:
1) allocate enough memory for the entire shared_buffer array (1..NBuffers)
using palloc_extended;
2) allocate the maximum of currently possible memory (1 GB) using an
ordinary palloc.

Thank you for your attention!

--
Best regards,
Daria Shanina

#2Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Дарья Шанина (#1)
Re: autoprewarm_dump_now

On 04/04/2025 16:40, Дарья Шанина wrote:

Hello everyone!
I have a question.

What would be better for the function autoprewarm_dump_now in case when
we need to allocate memory that exceeds 1 GB:

Hmm, so if I counted right, sizeof(BlockInfoRecord) == 20 bytes, which
means that you can fit about 409 GB worth of buffers in a 1 GB
allocation. So autoprewarm will currently not work with shared_buffers >
409 GB. That's indeed quite unfortunate.

1) allocate enough memory for the entire shared_buffer array
(1..NBuffers) using palloc_extended;

That would be a pretty straightforward fix.

2) allocate the maximum of currently possible memory (1 GB) using an
ordinary palloc.

That'd put an upper limit on how much is prewarmed. It'd be a weird
limitation. And prewarming matters the most with large shared_buffers.

3) Don't pre-allocate the array, write it out in a streaming fashion.

Unfortunately the file format doesn't make that easy: the number of
entries is at the beginning of the file. You could count the entries
beforehand, but the buffers can change concurrently. You could write a
placeholder first, and seek back to the beginning of the file to fill in
the real number at the end. The problem with that is that the number of
bytes needed for the count itself varies. I suppose we could write some
spaces as placeholders to accommodate the max count.

In apw_load_buffers(), we also load the file into (DSM) memory. There's
no similar 1 GB limit in dsm_create(), but I think it's a bit
unfortunate that the array needs to be allocated upfront upon loading.

In short, ISTM the easy answer here is "use palloc_extended". But
there's a lot of room for further optimizations.

--
Heikki Linnakangas
Neon (https://neon.tech)

#3Melanie Plageman
melanieplageman@gmail.com
In reply to: Heikki Linnakangas (#2)
Re: autoprewarm_dump_now

On Fri, Apr 4, 2025 at 10:04 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

In apw_load_buffers(), we also load the file into (DSM) memory. There's
no similar 1 GB limit in dsm_create(), but I think it's a bit
unfortunate that the array needs to be allocated upfront upon loading.

Unrelated to this problem, but I wondered why autoprewarm doesn''t
launch background workers for each database simultaneously instead of
waiting for each one to finish a db before moving onto the next one.
Is it simply to limit the number of bgworkers taking up resources?

- Melanie

#4Robert Haas
robertmhaas@gmail.com
In reply to: Melanie Plageman (#3)
Re: autoprewarm_dump_now

On Fri, Apr 4, 2025 at 12:17 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Unrelated to this problem, but I wondered why autoprewarm doesn''t
launch background workers for each database simultaneously instead of
waiting for each one to finish a db before moving onto the next one.
Is it simply to limit the number of bgworkers taking up resources?

That's probably part of it, but also (1) a system that allowed for
multiple workers would be somewhat more complex to implement and (2)
I'm not sure how beneficial it would be. We go to some trouble to make
the I/O as sequential as possible, and this would detract from that. I
also don't know how long prewarming normally takes -- if it's fast
enough already, then maybe this doesn't matter. But if somebody is
having a problem with autoprewarm being slow and wants to implement a
multi-worker system to make it faster, cool.

--
Robert Haas
EDB: http://www.enterprisedb.com

#5Daria Shanina
vilensipkdm@gmail.com
In reply to: Robert Haas (#4)
1 attachment(s)
Re: autoprewarm_dump_now

Hello!

I have made a patch, now we can allocate more than 1 GB of memory for the
autoprewarm_dump_now function.

Best regards,

Daria Shanina

пт, 4 апр. 2025 г. в 19:36, Robert Haas <robertmhaas@gmail.com>:

Show quoted text

On Fri, Apr 4, 2025 at 12:17 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Unrelated to this problem, but I wondered why autoprewarm doesn''t
launch background workers for each database simultaneously instead of
waiting for each one to finish a db before moving onto the next one.
Is it simply to limit the number of bgworkers taking up resources?

That's probably part of it, but also (1) a system that allowed for
multiple workers would be somewhat more complex to implement and (2)
I'm not sure how beneficial it would be. We go to some trouble to make
the I/O as sequential as possible, and this would detract from that. I
also don't know how long prewarming normally takes -- if it's fast
enough already, then maybe this doesn't matter. But if somebody is
having a problem with autoprewarm being slow and wants to implement a
multi-worker system to make it faster, cool.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

v1-0001-PGPRO-9971-Allocate-enough-memory-with-huge-share.patchtext/x-patch; charset=US-ASCII; name=v1-0001-PGPRO-9971-Allocate-enough-memory-with-huge-share.patchDownload
From 7b8af18231b539378ec4fc432186b551c08a7774 Mon Sep 17 00:00:00 2001
From: Darya Shanina <d.shanina@postgrespro.ru>
Date: Tue, 22 Apr 2025 11:27:01 +0300
Subject: [PATCH v1] [PGPRO-9971] Allocate enough memory with huge
 shared_buffers. Tags: commitfest_hotfix

---
 contrib/pg_prewarm/autoprewarm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index d061731706a..ea5f7bc49c9 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -598,7 +598,7 @@ apw_dump_now(bool is_bgworker, bool dump_unlogged)
 	}
 
 	block_info_array =
-		(BlockInfoRecord *) palloc(sizeof(BlockInfoRecord) * NBuffers);
+	(BlockInfoRecord *) palloc_extended((sizeof(BlockInfoRecord) * NBuffers), MCXT_ALLOC_HUGE);
 
 	for (num_blocks = 0, i = 0; i < NBuffers; i++)
 	{
-- 
2.43.0

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Daria Shanina (#5)
Re: autoprewarm_dump_now

Daria Shanina <vilensipkdm@gmail.com> writes:

I have made a patch, now we can allocate more than 1 GB of memory for the
autoprewarm_dump_now function.

Is that solving a real-world problem? If it is, shouldn't we be
looking for a different approach that doesn't require such a huge
amount of memory?

regards, tom lane

#7Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#6)
Re: autoprewarm_dump_now

On Thu, May 29, 2025 at 9:21 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Is that solving a real-world problem? If it is, shouldn't we be
looking for a different approach that doesn't require such a huge
amount of memory?

Upthread, Heikki said that this function currently fails with
shared_buffers>409GB. While I'm not opposed to a more efficient
solution, it seems reasonable to me to suppose that if you've got
shared_buffers>409GB, you're able to afford having this function use

1GB.

--
Robert Haas
EDB: http://www.enterprisedb.com

#8Daria Shanina
vilensipkdm@gmail.com
In reply to: Tom Lane (#6)
Re: autoprewarm_dump_now

Some of our clients encountered a problem — they needed to allocate
shared_buffers = 700 GB on a server with 1.5 TB RAM, and the error "invalid
memory alloc request size 1835008000" occurred. That is, these are not
mental exercises.

Best regards,

Daria Shanina

чт, 29 мая 2025 г. в 16:21, Tom Lane <tgl@sss.pgh.pa.us>:

Daria Shanina <vilensipkdm@gmail.com> writes:

I have made a patch, now we can allocate more than 1 GB of memory for the
autoprewarm_dump_now function.

Is that solving a real-world problem? If it is, shouldn't we be
looking for a different approach that doesn't require such a huge
amount of memory?

regards, tom lane

--
С уважением,
Шанина Дарья Александровна

#9Robert Haas
robertmhaas@gmail.com
In reply to: Daria Shanina (#8)
Re: autoprewarm_dump_now

On Fri, May 30, 2025 at 6:07 AM Daria Shanina <vilensipkdm@gmail.com> wrote:

Some of our clients encountered a problem — they needed to allocate shared_buffers = 700 GB on a server with 1.5 TB RAM, and the error "invalid memory alloc request size 1835008000" occurred. That is, these are not mental exercises.

I think the proposed patch should be committed and back-patched, after
fixing it so that it's pgindent-clean and adding a comment. Does
anyone have strong objection to that?

--
Robert Haas
EDB: http://www.enterprisedb.com

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#9)
Re: autoprewarm_dump_now

Robert Haas <robertmhaas@gmail.com> writes:

I think the proposed patch should be committed and back-patched, after
fixing it so that it's pgindent-clean and adding a comment. Does
anyone have strong objection to that?

Not here. I do wonder if we can't find a more memory-efficient way,
but I concur that any such change would likely not be back-patch
material.

regards, tom lane

#11Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#10)
Re: autoprewarm_dump_now

On Tue, Jun 3, 2025 at 1:24 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

I think the proposed patch should be committed and back-patched, after
fixing it so that it's pgindent-clean and adding a comment. Does
anyone have strong objection to that?

Not here. I do wonder if we can't find a more memory-efficient way,
but I concur that any such change would likely not be back-patch
material.

Cool. Regarding memory efficiency, I was just discussing radix trees
this morning with Sawada-san and others. "lib/radixtree.h" seems
unsuitable here because it supposes that each node should cover 1 byte
of key material, which seems unsuitable for a 20-byte key, but I
wonder if we could reuse "common/blkreftable.h" which I created for
incremental backup but which seems like it could probably work here as
well.

Implementation sketch: The "limit block" concept would be irrelevant
in this case, so you just wouldn't call BlockRefTableSetLimitBlock.
Instead, you'd call BlockRefTableMarkBlockModified for each
memory-resident block -- the name would be a misnomer, and we might
want to do some renaming. blkreftable.c/h already has infrastructure
to read and write the data structure from and to disk, so we wouldn't
need to reinvent that.

The reason I think this might work out well is that for moderately
sparse block sets, the space usage is about 2 bytes per block, while
for dense sets block sets, it converges to 1 bit per block. If you
make the set of blocks super-duper sparse then it might use more
memory than what we do today, but I'm not sure that's a very realistic
scenario in practice -- you'd probably need to have a gigantic number
of relations in the database and absolutely no locality of access. See
the comment in blkreftable.c just above "#define BLOCKS_PER_CHUNK" for
what the format actually is.

--
Robert Haas
EDB: http://www.enterprisedb.com

#12Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#9)
Re: autoprewarm_dump_now

Hi,

On 2025-06-03 13:17:43 -0400, Robert Haas wrote:

On Fri, May 30, 2025 at 6:07 AM Daria Shanina <vilensipkdm@gmail.com> wrote:

Some of our clients encountered a problem — they needed to allocate shared_buffers = 700 GB on a server with 1.5 TB RAM, and the error "invalid memory alloc request size 1835008000" occurred. That is, these are not mental exercises.

I think the proposed patch should be committed and back-patched, after
fixing it so that it's pgindent-clean and adding a comment. Does
anyone have strong objection to that?

No, seems like a thing that pretty obviously should be fixed.

Greetings,

Andres Freund

#13Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#12)
Re: autoprewarm_dump_now

On Tue, Jun 3, 2025 at 2:58 PM Andres Freund <andres@anarazel.de> wrote:

I think the proposed patch should be committed and back-patched, after
fixing it so that it's pgindent-clean and adding a comment. Does
anyone have strong objection to that?

No, seems like a thing that pretty obviously should be fixed.

Done.

*quakes in fear of incoming buildfarm results*

--
Robert Haas
EDB: http://www.enterprisedb.com

#14Daria Shanina
vilensipkdm@gmail.com
In reply to: Robert Haas (#13)
1 attachment(s)
Re: autoprewarm_dump_now

Hi,
I launched pgindent and created a new patch. Here it is.

*quakes in fear of incoming buildfarm results*

I hope there will be good results at buildfarm.

пт, 6 июн. 2025 г. в 16:19, Robert Haas <robertmhaas@gmail.com>:

On Tue, Jun 3, 2025 at 2:58 PM Andres Freund <andres@anarazel.de> wrote:

I think the proposed patch should be committed and back-patched, after
fixing it so that it's pgindent-clean and adding a comment. Does
anyone have strong objection to that?

No, seems like a thing that pretty obviously should be fixed.

Done.

*quakes in fear of incoming buildfarm results*

--
Robert Haas
EDB: http://www.enterprisedb.com

--
Best regards,
Daria Shanina

Attachments:

v1-0001-PGPRO-9971-Allocate-enough-memory-with-huge-share_buffers.patchtext/x-patch; charset=US-ASCII; name=v1-0001-PGPRO-9971-Allocate-enough-memory-with-huge-share_buffers.patchDownload
From a4776e4ae0c2ffe3061b64dec7ac60d730db12b0 Mon Sep 17 00:00:00 2001
From: Darya Shanina <d.shanina@postgrespro.ru>
Date: Tue, 22 Apr 2025 11:27:01 +0300
Subject: [PATCH v1] [PGPRO-9971] Allocate enough memory with huge
 shared_buffers. Tags: commitfest_hotfix

---
 contrib/pg_prewarm/autoprewarm.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index d061731706a..4dbda58eeae 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -597,8 +597,9 @@ apw_dump_now(bool is_bgworker, bool dump_unlogged)
 		return 0;
 	}
 
-	block_info_array =
-		(BlockInfoRecord *) palloc(sizeof(BlockInfoRecord) * NBuffers);
+	block_info_array = (BlockInfoRecord *)
+		palloc_extended((sizeof(BlockInfoRecord) * NBuffers),
+						 MCXT_ALLOC_HUGE);
 
 	for (num_blocks = 0, i = 0; i < NBuffers; i++)
 	{
-- 
2.43.0