PG18 GIN parallel index build crash - invalid memory alloc request size
Testing PostgreSQL 18.0 on Debian from PGDG repo: 18.0-1.pgdg12+3 with
PostGIS 3.6.0+dfsg-2.pgdg12+1. Running the osm2pgsql workload to load the
entire OSM Planet data set in my home lab system.
I found a weird crash during the recently adjusted parallel GIN index
building code. There are 2 parallel workers spawning, one of them crashes
then everything terminates. This is one of the last steps in OSM loading,
I can reproduce just by trying the one statement again:
gis=# CREATE INDEX ON "public"."planet_osm_polygon" USING GIN (tags);
ERROR: invalid memory alloc request size 1113001620
I see that this area of the code was just being triaged during early beta
time in May, may need another round.
The table is 215 GB. Server has 128GB and only 1/3 is nailed down, there's
plenty of RAM available.
Settings include:
work_mem=1GB
maintenance_work_mem=20GB
shared_buffers=48GB
max_parallel_workers_per_gather = 8
Log files show a number of similarly big allocations working before then,
here's an example:
LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp161831.0.fileset/0.1",
size 1073741824
STATEMENT: CREATE INDEX ON "public"."planet_osm_polygon" USING BTREE
(osm_id)
ERROR: invalid memory alloc request size 1137667788
STATEMENT: CREATE INDEX ON "public"."planet_osm_polygon" USING GIN (tags)
CONTEXT: parallel worker
And another one to show the size at crash is a little different each time:
ERROR: Database error: ERROR: invalid memory alloc request size 1115943018
Hooked into the error message and it gave this stack trace:
#0 errfinish (filename=0x5646de247420
"./build/../src/backend/utils/mmgr/mcxt.c",
lineno=1174, funcname=0x5646de2477d0 <__func__.3>
"MemoryContextSizeFailure")
at ./build/../src/backend/utils/error/elog.c:476
#1 0x00005646ddb4ae9c in MemoryContextSizeFailure (
context=context@entry=0x56471ce98c90, size=size@entry=1136261136,
flags=flags@entry=0) at ./build/../src/backend/utils/mmgr/mcxt.c:1174
#2 0x00005646de05898d in MemoryContextCheckSize (flags=0, size=1136261136,
context=0x56471ce98c90) at
./build/../src/include/utils/memutils_internal.h:172
#3 MemoryContextCheckSize (flags=0, size=1136261136,
context=0x56471ce98c90)
at ./build/../src/include/utils/memutils_internal.h:167
#4 AllocSetRealloc (pointer=0x7f34f558b040, size=1136261136, flags=0)
at ./build/../src/backend/utils/mmgr/aset.c:1203
#5 0x00005646ddb701c8 in GinBufferStoreTuple (buffer=0x56471cee0d10,
tup=0x7f34dfdd2030) at
./build/../src/backend/access/gin/gininsert.c:1497
#6 0x00005646ddb70503 in _gin_process_worker_data (progress=<optimized
out>,
worker_sort=0x56471cf13638, state=0x7ffc288b0200)
at ./build/../src/backend/access/gin/gininsert.c:1926
#7 _gin_parallel_scan_and_build (state=state@entry=0x7ffc288b0200,
ginshared=ginshared@entry=0x7f4168a5d360,
sharedsort=sharedsort@entry=0x7f4168a5d300, heap=heap@entry
=0x7f41686e5280,
index=index@entry=0x7f41686e4738, sortmem=<optimized out>,
progress=<optimized out>) at
./build/../src/backend/access/gin/gininsert.c:2046
#8 0x00005646ddb71ebf in _gin_parallel_build_main (seg=<optimized out>,
toc=0x7f4168a5d000) at
./build/../src/backend/access/gin/gininsert.c:2159
#9 0x00005646ddbdf882 in ParallelWorkerMain (main_arg=<optimized out>)
at ./build/../src/backend/access/transam/parallel.c:1563
#10 0x00005646dde40670 in BackgroundWorkerMain (startup_data=<optimized
out>,
startup_data_len=<optimized out>)
at ./build/../src/backend/postmaster/bgworker.c:843
#11 0x00005646dde42a45 in postmaster_child_launch (
child_type=child_type@entry=B_BG_WORKER, child_slot=320,
startup_data=startup_data@entry=0x56471cdbc8f8,
startup_data_len=startup_data_len@entry=1472,
client_sock=client_sock@entry=0x0)
at ./build/../src/backend/postmaster/launch_backend.c:290
#12 0x00005646dde44265 in StartBackgroundWorker (rw=0x56471cdbc8f8)
at ./build/../src/backend/postmaster/postmaster.c:4157
#13 maybe_start_bgworkers () at
./build/../src/backend/postmaster/postmaster.c:4323
#14 0x00005646dde45b13 in LaunchMissingBackgroundProcesses ()
at ./build/../src/backend/postmaster/postmaster.c:3397
#15 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1717
#16 0x00005646dde47f6d in PostmasterMain (argc=argc@entry=5,
argv=argv@entry=0x56471cd66dc0)
at ./build/../src/backend/postmaster/postmaster.c:1400
#17 0x00005646ddb4d56c in main (argc=5, argv=0x56471cd66dc0)
at ./build/../src/backend/main/main.c:227
I've frozen my testing at the spot where I can reproduce the problem. I
was going to try dropping m_w_m next and turning off the parallel
execution. I didn't want to touch anything until after asking if there's
more data that should be collected from a crashing instance.
--
Greg Smith, Software Engineering
Snowflake - Where Data Does More
gregory.smith@snowflake.com
Hi,
On 10/24/25 05:03, Gregory Smith wrote:
Testing PostgreSQL 18.0 on Debian from PGDG repo: 18.0-1.pgdg12+3 with
PostGIS 3.6.0+dfsg-2.pgdg12+1. Running the osm2pgsql workload to load
the entire OSM Planet data set in my home lab system.I found a weird crash during the recently adjusted parallel GIN index
building code. There are 2 parallel workers spawning, one of them
crashes then everything terminates. This is one of the last steps in
OSM loading, I can reproduce just by trying the one statement again:gis=# CREATE INDEX ON "public"."planet_osm_polygon" USING GIN (tags);
ERROR: invalid memory alloc request size 1113001620I see that this area of the code was just being triaged during early
beta time in May, may need another round.The table is 215 GB. Server has 128GB and only 1/3 is nailed down,
there's plenty of RAM available.Settings include:
work_mem=1GB
maintenance_work_mem=20GB
shared_buffers=48GB
max_parallel_workers_per_gather = 8
Hmmm, I wonder if the m_w_m is high enough to confuse the trimming logic
in some way. Can you try if using smaller m_w_m (maybe 128MB-256MB)
makes the issue go away?
Log files show a number of similarly big allocations working before
then, here's an example:LOG: temporary file: path "base/pgsql_tmp/
pgsql_tmp161831.0.fileset/0.1", size 1073741824
STATEMENT: CREATE INDEX ON "public"."planet_osm_polygon" USING BTREE
(osm_id)
ERROR: invalid memory alloc request size 1137667788
STATEMENT: CREATE INDEX ON "public"."planet_osm_polygon" USING GIN (tags)
CONTEXT: parallel worker
But that btree allocation is exactly 1GB, which is the palloc limit. And
IIRC the tuplesort code is doing palloc_huge, so that's probably why it
works fine. While the GIN code does a plain repalloc(), so it's subject
to the MaxAllocSize limit.
And another one to show the size at crash is a little different each time:
ERROR: Database error: ERROR: invalid memory alloc request size 1115943018Hooked into the error message and it gave this stack trace:
#0 errfinish (filename=0x5646de247420 "./build/../src/backend/utils/
mmgr/mcxt.c",
lineno=1174, funcname=0x5646de2477d0 <__func__.3>
"MemoryContextSizeFailure")
at ./build/../src/backend/utils/error/elog.c:476
#1 0x00005646ddb4ae9c in MemoryContextSizeFailure (
context=context@entry=0x56471ce98c90, size=size@entry=1136261136,
flags=flags@entry=0) at ./build/../src/backend/utils/mmgr/mcxt.c:1174
#2 0x00005646de05898d in MemoryContextCheckSize (flags=0, size=1136261136,
context=0x56471ce98c90) at ./build/../src/include/utils/
memutils_internal.h:172
#3 MemoryContextCheckSize (flags=0, size=1136261136,
context=0x56471ce98c90)
at ./build/../src/include/utils/memutils_internal.h:167
#4 AllocSetRealloc (pointer=0x7f34f558b040, size=1136261136, flags=0)
at ./build/../src/backend/utils/mmgr/aset.c:1203
#5 0x00005646ddb701c8 in GinBufferStoreTuple (buffer=0x56471cee0d10,
tup=0x7f34dfdd2030) at ./build/../src/backend/access/gin/
gininsert.c:1497
#6 0x00005646ddb70503 in _gin_process_worker_data (progress=<optimized
out>,
worker_sort=0x56471cf13638, state=0x7ffc288b0200)
at ./build/../src/backend/access/gin/gininsert.c:1926
#7 _gin_parallel_scan_and_build (state=state@entry=0x7ffc288b0200,
ginshared=ginshared@entry=0x7f4168a5d360,
sharedsort=sharedsort@entry=0x7f4168a5d300,
heap=heap@entry=0x7f41686e5280,
index=index@entry=0x7f41686e4738, sortmem=<optimized out>,
progress=<optimized out>) at ./build/../src/backend/access/gin/
gininsert.c:2046
#8 0x00005646ddb71ebf in _gin_parallel_build_main (seg=<optimized out>,
toc=0x7f4168a5d000) at ./build/../src/backend/access/gin/
gininsert.c:2159
#9 0x00005646ddbdf882 in ParallelWorkerMain (main_arg=<optimized out>)
at ./build/../src/backend/access/transam/parallel.c:1563
#10 0x00005646dde40670 in BackgroundWorkerMain (startup_data=<optimized
out>,
startup_data_len=<optimized out>)
at ./build/../src/backend/postmaster/bgworker.c:843
#11 0x00005646dde42a45 in postmaster_child_launch (
child_type=child_type@entry=B_BG_WORKER, child_slot=320,
startup_data=startup_data@entry=0x56471cdbc8f8,
startup_data_len=startup_data_len@entry=1472,
client_sock=client_sock@entry=0x0)
at ./build/../src/backend/postmaster/launch_backend.c:290
#12 0x00005646dde44265 in StartBackgroundWorker (rw=0x56471cdbc8f8)
at ./build/../src/backend/postmaster/postmaster.c:4157
#13 maybe_start_bgworkers () at ./build/../src/backend/postmaster/
postmaster.c:4323
#14 0x00005646dde45b13 in LaunchMissingBackgroundProcesses ()
at ./build/../src/backend/postmaster/postmaster.c:3397
#15 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1717
#16 0x00005646dde47f6d in PostmasterMain (argc=argc@entry=5,
argv=argv@entry=0x56471cd66dc0)
at ./build/../src/backend/postmaster/postmaster.c:1400
#17 0x00005646ddb4d56c in main (argc=5, argv=0x56471cd66dc0)
at ./build/../src/backend/main/main.c:227I've frozen my testing at the spot where I can reproduce the problem. I
was going to try dropping m_w_m next and turning off the parallel
execution. I didn't want to touch anything until after asking if
there's more data that should be collected from a crashing instance.
Hmm, so it's failing on the repalloc in GinBufferStoreTuple(), which is
merging the "GinTuple" into an in-memory buffer. I'll take a closer look
once I get back from pgconf.eu, but I guess I failed to consider that
the "parts" may be large enough to exceed MaxAlloc.
The code tries to flush the "frozen" part of the TID lists part that can
no longer change, but I think with m_w_m this large it could happen the
first two buffers are already too large (and the trimming happens only
after the fact).
Can you show the contents of buffer and tup? I'm especially interested
in these fields:
buffer->nitems
buffer->maxitems
buffer->nfrozen
tup->nitems
If I'm right, I think there are two ways to fix this:
(1) apply the trimming earlier, i.e. try to freeze + flush before
actually merging the data (essentially, update nfrozen earlier)
(2) use repalloc_huge (and palloc_huge) in GinBufferStoreTuple
Or we probably should do both.
regards
--
Tomas Vondra
On Fri, Oct 24, 2025 at 8:38 AM Tomas Vondra <tomas@vondra.me> wrote:
Hmmm, I wonder if the m_w_m is high enough to confuse the trimming logic
in some way. Can you try if using smaller m_w_m (maybe 128MB-256MB)
makes the issue go away?
The index builds at up to 4GB of m_w_m. 5GB and above crashes.
Now that I know roughly where the limits are that de-escalates things a
bit. The sort of customers deploying a month after release should be OK
with just knowing to be careful about high m_w_m settings on PG18 until a
fix is ready.
Hope everyone is enjoying Latvia. My obscure music collection includes a
band from there I used to see in the NYC area, The Quags;
https://www.youtube.com/watch?v=Bg3P4736CxM
Can you show the contents of buffer and tup? I'm especially interested
in these fields:
buffer->nitems
buffer->maxitems
buffer->nfrozen
tup->nitems
I'll see if I can grab that data at the crash point.
FYI for anyone who wants to replicate this: if you have a system with
128GB+ of RAM you could probably recreate the test case. Just have to
download the Planet file and run osm2pgsql with the overly tweaked settings
I use. I've published all the details of how I run this regression test
now.
Settings: https://github.com/gregs1104/pgbent/tree/main/conf/18/conf.d
Script setup: https://github.com/gregs1104/pgbent/blob/main/wl/osm-import
Test runner:
https://github.com/gregs1104/pgbent/blob/main/util/osm-importer
Parse results:
https://github.com/gregs1104/pgbent/blob/main/util/pgbench-init-parse
If I'm right, I think there are two ways to fix this:
(1) apply the trimming earlier, i.e. try to freeze + flush before
actually merging the data (essentially, update nfrozen earlier)
(2) use repalloc_huge (and palloc_huge) in GinBufferStoreTuple
Or we probably should do both.
Sounds like (2) is probably mandatory and (1) is good hygiene.
--
Greg Smith, Software engineering
Snowflake - Where Data Does More
gregory.smith@snowflake.com
On 10/24/25 22:22, Gregory Smith wrote:
On Fri, Oct 24, 2025 at 8:38 AM Tomas Vondra <tomas@vondra.me
<mailto:tomas@vondra.me>> wrote:Hmmm, I wonder if the m_w_m is high enough to confuse the trimming logic
in some way. Can you try if using smaller m_w_m (maybe 128MB-256MB)
makes the issue go away?The index builds at up to 4GB of m_w_m. 5GB and above crashes.
Now that I know roughly where the limits are that de-escalates things a
bit. The sort of customers deploying a month after release should be OK
with just knowing to be careful about high m_w_m settings on PG18 until
a fix is ready.Hope everyone is enjoying Latvia. My obscure music collection includes
a band from there I used to see in the NYC area, The Quags; https://
www.youtube.com/watch?v=Bg3P4736CxM <https://www.youtube.com/watch?
v=Bg3P4736CxM>
Nice!
Can you show the contents of buffer and tup? I'm especially interested
in these fields:
buffer->nitems
buffer->maxitems
buffer->nfrozen
tup->nitemsI'll see if I can grab that data at the crash point.
FYI for anyone who wants to replicate this: if you have a system with
128GB+ of RAM you could probably recreate the test case. Just have to
download the Planet file and run osm2pgsql with the overly tweaked
settings I use. I've published all the details of how I run this
regression test now.Settings: https://github.com/gregs1104/pgbent/tree/main/conf/18/conf.d
<https://github.com/gregs1104/pgbent/tree/main/conf/18/conf.d>
Script setup: https://github.com/gregs1104/pgbent/blob/main/wl/osm-
import <https://github.com/gregs1104/pgbent/blob/main/wl/osm-import>
Test runner: https://github.com/gregs1104/pgbent/blob/main/util/osm-
importer <https://github.com/gregs1104/pgbent/blob/main/util/osm-importer>
Parse results: https://github.com/gregs1104/pgbent/blob/main/util/
pgbench-init-parse <https://github.com/gregs1104/pgbent/blob/main/util/
pgbench-init-parse>
I did reproduce this using OSM, although I used different settings, but
that's only affects loading. Setting maintenance_work_mem=20GB is more
than enough to trigger the error during parallel index build.
So I don't need the data.
If I'm right, I think there are two ways to fix this:
(1) apply the trimming earlier, i.e. try to freeze + flush before
actually merging the data (essentially, update nfrozen earlier)
(2) use repalloc_huge (and palloc_huge) in GinBufferStoreTuple
Or we probably should do both.Sounds like (2) is probably mandatory and (1) is good hygiene.
Yes, (2) is mandatory to fix this, and it's also sufficient. See the
attached fix. I'll clean this up and push soon.
AFAICS (1) is not really needed. I was concerned we might end up with
each worker producing a TID buffer close to maintenance_work_mem, and
then the leader would have to use twice as much memory when merging. But
it turns out I already thought about that, and the workers use a fair
share or maintenance_work_mem, not a new limit. So they produce smaller
chunks, and those should not exceed maintenance_work_mem when merging.
I tried "freezing" the existing buffer more eagerly (before merging the
tuple), but that made no difference. The workers produce data with a lot
of overlaps (simply because that's how the parallel builds divide data),
and the amount of trimmed data is tiny. Something like 10k TIDs from a
buffer of 1M TIDs. So a tiny difference, and it'd still fail.
I'm not against maybe experimenting with this, but it's going to be a
mater-only thing, not for backpatching.
Maybe we should split the data into smaller chunks while building tuples
in ginFlushBuildState. That'd probably allow enforcing the memory limit
more strictly, because we sometimes hold multiple copies of the TIDs
arrays. But that's for master too.
regards
--
Tomas Vondra
Attachments:
gin-palloc-fix.patchtext/x-patch; charset=UTF-8; name=gin-palloc-fix.patchDownload
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 3d71b442aa9..3499a49a8f4 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1496,9 +1496,9 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
* still pass 0 as number of elements in that array though.
*/
if (buffer->items == NULL)
- buffer->items = palloc((buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+ buffer->items = MemoryContextAllocHuge(CurrentMemoryContext, (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
else
- buffer->items = repalloc(buffer->items,
+ buffer->items = repalloc_huge(buffer->items,
(buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfrozen */
On 10/26/25 16:16, Tomas Vondra wrote:
On 10/24/25 22:22, Gregory Smith wrote:
On Fri, Oct 24, 2025 at 8:38 AM Tomas Vondra <tomas@vondra.me
<mailto:tomas@vondra.me>> wrote:...
Can you show the contents of buffer and tup? I'm especially
interested
in these fields:
buffer->nitems
buffer->maxitems
buffer->nfrozen
tup->nitemsI'll see if I can grab that data at the crash point.
FYI for anyone who wants to replicate this: if you have a system with
128GB+ of RAM you could probably recreate the test case. Just have to
download the Planet file and run osm2pgsql with the overly tweaked
settings I use. I've published all the details of how I run this
regression test now.Settings: https://github.com/gregs1104/pgbent/tree/main/conf/18/conf.d
<https://github.com/gregs1104/pgbent/tree/main/conf/18/conf.d>
Script setup: https://github.com/gregs1104/pgbent/blob/main/wl/osm-
import <https://github.com/gregs1104/pgbent/blob/main/wl/osm-import>
Test runner: https://github.com/gregs1104/pgbent/blob/main/util/osm-
importer <https://github.com/gregs1104/pgbent/blob/main/util/osm-importer>
Parse results: https://github.com/gregs1104/pgbent/blob/main/util/
pgbench-init-parse <https://github.com/gregs1104/pgbent/blob/main/util/
pgbench-init-parse>
I did reproduce this using OSM, although I used different settings, but
that's only affects loading. Setting maintenance_work_mem=20GB is more
than enough to trigger the error during parallel index build.So I don't need the data.
If I'm right, I think there are two ways to fix this:
(1) apply the trimming earlier, i.e. try to freeze + flush before
actually merging the data (essentially, update nfrozen earlier)
(2) use repalloc_huge (and palloc_huge) in GinBufferStoreTuple
Or we probably should do both.Sounds like (2) is probably mandatory and (1) is good hygiene.
Yes, (2) is mandatory to fix this, and it's also sufficient. See the
attached fix. I'll clean this up and push soon.AFAICS (1) is not really needed. I was concerned we might end up with
each worker producing a TID buffer close to maintenance_work_mem, and
then the leader would have to use twice as much memory when merging. But
it turns out I already thought about that, and the workers use a fair
share or maintenance_work_mem, not a new limit. So they produce smaller
chunks, and those should not exceed maintenance_work_mem when merging.I tried "freezing" the existing buffer more eagerly (before merging the
tuple), but that made no difference. The workers produce data with a lot
of overlaps (simply because that's how the parallel builds divide data),
and the amount of trimmed data is tiny. Something like 10k TIDs from a
buffer of 1M TIDs. So a tiny difference, and it'd still fail.I'm not against maybe experimenting with this, but it's going to be a
mater-only thing, not for backpatching.Maybe we should split the data into smaller chunks while building tuples
in ginFlushBuildState. That'd probably allow enforcing the memory limit
more strictly, because we sometimes hold multiple copies of the TIDs
arrays. But that's for master too.
I spoke too soon, apparently :-(
(2) is not actually a fix. It does fix some cases of invalid alloc size
failures, the following call to ginMergeItemPointers() can hit that too,
because it does palloc() internally. I didn't notice this before because
of the other experimental changes, and because it seems to depend on
which of the OSM indexes is being built, with how many workers, etc.
I was a bit puzzled how come we don't hit this with serial builds too,
because that calls ginMergeItemPointers() too. I guess that's just luck,
because with serial builds we're likely flushing the TID list in smaller
chunks, appending to an existing tuple. And it seems unlikely to cross
the alloc limit for any of those. But for parallel builds we're pretty
much guaranteed to see all TIDs for a key at once.
I see two ways to fix this:
a) Do the (re)palloc_huge change, but then also change the palloc call
in ginMergeItemPointers. I'm not sure if we want to change the existing
function, or create a static copy in gininsert.c with this tweak (it
doesn't need anything else, so it's not that bad).
b) Do the data splitting in ginFlushBuildState, so that workers don't
generate chunks larger than MaxAllocSize/nworkers (for any key). The
leader then merges at most one chunk per worker at a time, so it still
fits into the alloc limit.
Both seem to work. I like (a) more, because it's more consistent with
how I understand m_w_m. It's weird to say "use up to 20GB of memory" and
then the system overrides that with "1GB". I don't think it affects
performance, though.
I'll experiment with this a bit more, I just wanted to mention the fix I
posted earlier does not actually fix the issue.
I also wonder how far are we from hitting the uint32 limits. FAICS with
m_w_m=24GB we might end up with too many elements, even with serial
index builds. It'd have to be a quite weird data set, though.
regards
--
Tomas Vondra
Attachments:
0001-a-use-huge-allocations.patchtext/x-patch; charset=UTF-8; name=0001-a-use-huge-allocations.patchDownload
From be3417e947a1f17e5fa4137481668779be532bc5 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sun, 26 Oct 2025 21:14:44 +0100
Subject: [PATCH 1/3] a: use huge allocations
---
src/backend/access/gin/gininsert.c | 81 +++++++++++++++++++++++++++---
1 file changed, 75 insertions(+), 6 deletions(-)
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 3d71b442aa9..085c85718cc 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -198,6 +198,10 @@ static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
ItemPointerData *items, uint32 nitems,
Size *len);
+static ItemPointer mergeItemPointers(ItemPointerData *a, uint32 na,
+ ItemPointerData *b, uint32 nb,
+ int *nmerged);
+
/*
* Adds array of item pointers to tuple's posting list, or
* creates posting tree and tuple pointing to tree in case
@@ -1496,14 +1500,15 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
* still pass 0 as number of elements in that array though.
*/
if (buffer->items == NULL)
- buffer->items = palloc((buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+ buffer->items = MemoryContextAllocHuge(CurrentMemoryContext,
+ (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
else
- buffer->items = repalloc(buffer->items,
- (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+ buffer->items = repalloc_huge(buffer->items,
+ (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
- new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfrozen */
- (buffer->nitems - buffer->nfrozen), /* num of unfrozen */
- items, tup->nitems, &nnew);
+ new = mergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfrozen */
+ (buffer->nitems - buffer->nfrozen), /* num of unfrozen */
+ items, tup->nitems, &nnew);
Assert(nnew == (tup->nitems + (buffer->nitems - buffer->nfrozen)));
@@ -2441,3 +2446,67 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
return ItemPointerCompare(GinTupleGetFirst(a),
GinTupleGetFirst(b));
}
+
+
+/*
+ * local copy, doing palloc_huge
+ */
+static ItemPointer
+mergeItemPointers(ItemPointerData *a, uint32 na,
+ ItemPointerData *b, uint32 nb,
+ int *nmerged)
+{
+ ItemPointerData *dst;
+
+ dst = (ItemPointer) MemoryContextAllocHuge(CurrentMemoryContext,
+ (na + nb) * sizeof(ItemPointerData));
+
+ /*
+ * If the argument arrays don't overlap, we can just append them to each
+ * other.
+ */
+ if (na == 0 || nb == 0 || ginCompareItemPointers(&a[na - 1], &b[0]) < 0)
+ {
+ memcpy(dst, a, na * sizeof(ItemPointerData));
+ memcpy(&dst[na], b, nb * sizeof(ItemPointerData));
+ *nmerged = na + nb;
+ }
+ else if (ginCompareItemPointers(&b[nb - 1], &a[0]) < 0)
+ {
+ memcpy(dst, b, nb * sizeof(ItemPointerData));
+ memcpy(&dst[nb], a, na * sizeof(ItemPointerData));
+ *nmerged = na + nb;
+ }
+ else
+ {
+ ItemPointerData *dptr = dst;
+ ItemPointerData *aptr = a;
+ ItemPointerData *bptr = b;
+
+ while (aptr - a < na && bptr - b < nb)
+ {
+ int cmp = ginCompareItemPointers(aptr, bptr);
+
+ if (cmp > 0)
+ *dptr++ = *bptr++;
+ else if (cmp == 0)
+ {
+ /* only keep one copy of the identical items */
+ *dptr++ = *bptr++;
+ aptr++;
+ }
+ else
+ *dptr++ = *aptr++;
+ }
+
+ while (aptr - a < na)
+ *dptr++ = *aptr++;
+
+ while (bptr - b < nb)
+ *dptr++ = *bptr++;
+
+ *nmerged = dptr - dst;
+ }
+
+ return dst;
+}
--
2.51.0
0001-b-split-TID-lists-when-flushing.patchtext/x-patch; charset=UTF-8; name=0001-b-split-TID-lists-when-flushing.patchDownload
From 1c47f44483939a0a7a47073a838164a3296a51c3 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sun, 26 Oct 2025 21:14:44 +0100
Subject: [PATCH] b: split TID lists when flushing
---
src/backend/access/gin/gininsert.c | 37 +++++++++++++++++++++---------
1 file changed, 26 insertions(+), 11 deletions(-)
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 3d71b442aa9..eff78cc622d 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -152,7 +152,9 @@ typedef struct
* only in the leader process.
*/
GinLeader *bs_leader;
- int bs_worker_id;
+
+ /* number of participating workers (including leader) */
+ int bs_num_workers;
/* used to pass information from workers to leader */
double bs_numtuples;
@@ -494,27 +496,39 @@ ginFlushBuildState(GinBuildState *buildstate, Relation index)
OffsetNumber attnum;
TupleDesc tdesc = RelationGetDescr(index);
+ /* how many TIDs fit into a regular allocation (50% to keep slack) */
+ uint32 maxsize = (MaxAllocSize / sizeof(ItemPointerData) / 2);
+ maxsize /= buildstate->bs_num_workers; /* fair share per worker */
+
ginBeginBAScan(&buildstate->accum);
while ((list = ginGetBAEntry(&buildstate->accum,
&attnum, &key, &category, &nlist)) != NULL)
{
/* information about the key */
CompactAttribute *attr = TupleDescCompactAttr(tdesc, (attnum - 1));
+ uint32 start = 0;
- /* GIN tuple and tuple length */
- GinTuple *tup;
- Size tuplen;
+ /* split the list into smaller parts with up to maxsize items */
+ while (start < nlist)
+ {
+ /* GIN tuple and tuple length */
+ GinTuple *tup;
+ Size tuplen;
- /* there could be many entries, so be willing to abort here */
- CHECK_FOR_INTERRUPTS();
+ /* there could be many entries, so be willing to abort here */
+ CHECK_FOR_INTERRUPTS();
- tup = _gin_build_tuple(attnum, category,
- key, attr->attlen, attr->attbyval,
- list, nlist, &tuplen);
+ tup = _gin_build_tuple(attnum, category,
+ key, attr->attlen, attr->attbyval,
+ &list[start], Min(maxsize, nlist - start),
+ &tuplen);
- tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+ start += maxsize;
- pfree(tup);
+ tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+
+ pfree(tup);
+ }
}
MemoryContextReset(buildstate->tmpCtx);
@@ -2016,6 +2030,7 @@ _gin_parallel_scan_and_build(GinBuildState *state,
/* remember how much space is allowed for the accumulated entries */
state->work_mem = (sortmem / 2);
+ state->bs_num_workers = ginshared->scantuplesortstates;
/* Begin "partial" tuplesort */
state->bs_sortstate = tuplesort_begin_index_gin(heap, index,
--
2.51.0
On Sun, Oct 26, 2025 at 5:52 PM Tomas Vondra <tomas@vondra.me> wrote:
I like (a) more, because it's more consistent with how I understand m_w_m.
It's weird
to say "use up to 20GB of memory" and then the system overrides that with
"1GB".
I don't think it affects performance, though.
There wasn't really that much gain from 1GB -> 20GB, I was using that
setting for QA purposes more than measured performance. During the early
parts of an OSM build, you need to have a big Node Cache to hit max speed,
1/2 or more of a ~90GB file. Once that part finishes, the 45GB+ cache
block frees up and index building starts. I just looked at how much was
just freed and thought "ehhh...split it in half and maybe 20GB maintenance
mem?" Results seemed a little better than the 1GB setting I started at, so
I've ran with that 20GB setting since.
That was back in PG14 and so many bottlenecks have moved around. Since
reporting this bug I've done a set of PG18 tests with m_w_m=256MB, and one
of them just broke my previous record time running PG17. So even that size
setting seems fine.
I also wonder how far are we from hitting the uint32 limits. FAICS with
m_w_m=24GB we might end up with too many elements, even with serial
index builds. It'd have to be a quite weird data set, though.
Since I'm starting to doubt I ever really needed even 20GB, I wouldn't
stress about supporting that much being important. I'll see if I can
trigger an overflow with a test case though, maybe it's worth protecting
against even if it's not a functional setting.
--
Greg Smith, Software Engineering
Snowflake - Where Data Does More
gregory.smith@snowflake.com
On 10/28/25 21:54, Gregory Smith wrote:
On Sun, Oct 26, 2025 at 5:52 PM Tomas Vondra <tomas@vondra.me
<mailto:tomas@vondra.me>> wrote:I like (a) more, because it's more consistent with how I understand
m_w_m. It's weird
to say "use up to 20GB of memory" and then the system overrides that
with "1GB".
I don't think it affects performance, though.There wasn't really that much gain from 1GB -> 20GB, I was using that
setting for QA purposes more than measured performance. During the
early parts of an OSM build, you need to have a big Node Cache to hit
max speed, 1/2 or more of a ~90GB file. Once that part finishes,
the 45GB+ cache block frees up and index building starts. I just looked
at how much was just freed and thought "ehhh...split it in half and
maybe 20GB maintenance mem?" Results seemed a little better than the
1GB setting I started at, so I've ran with that 20GB setting since.That was back in PG14 and so many bottlenecks have moved around. Since
reporting this bug I've done a set of PG18 tests with m_w_m=256MB, and
one of them just broke my previous record time running PG17. So even
that size setting seems fine.
Right, that matches my observations from testing the fixes.
I'd attribute this to caching effects when the accumulated GIN entries
fit into L3.
I also wonder how far are we from hitting the uint32 limits. FAICS with
m_w_m=24GB we might end up with too many elements, even with serial
index builds. It'd have to be a quite weird data set, though.Since I'm starting to doubt I ever really needed even 20GB, I wouldn't
stress about supporting that much being important. I'll see if I can
trigger an overflow with a test case though, maybe it's worth protecting
against even if it's not a functional setting.
Yeah, I definitely want to protect against this. I believe similar
failures can happen even with much lower m_w_m values (possibly ~2-3GB),
although only with weird/skewed data sets. AFAICS a constant
single-element array would trigger this, but I haven't tested that.
Serial builds can fail with large maintenance_work_mem too, like this:
ERROR: posting list is too long
HINT: Reduce "maintenance_work_mem".
but it's deterministic, and it's actually a proper error message, not
just some weird "invalid alloc size".
Attached is a v3 of the patch series. 0001 and 0002 were already posted,
and I believe either of those would address the issue. 0003 is more of
an optimization, further reducing the memory usage.
I'm putting this through additional testing, which takes time. But it
seems there's still some loose end in 0001, as I just got the "invalid
alloc request" failure with it applied ... I'll take a look tomorrow.
regards
--
Tomas Vondra
Attachments:
v3-0001-Allow-parallel-GIN-builds-to-allocate-large-chunk.patchtext/x-patch; charset=UTF-8; name=v3-0001-Allow-parallel-GIN-builds-to-allocate-large-chunk.patchDownload
From a287cd4e711fba07029d873c9049010599616b6d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sun, 26 Oct 2025 21:14:44 +0100
Subject: [PATCH v3 1/3] Allow parallel GIN builds to allocate large chunks
The parallel GIN builds used palloc/repalloc to maintain TID lists,
which with high maintance_work_mem values can lead to failures like
ERROR: invalid memory alloc request size 1113001620
The reason is that while merging intermediate worker data, we call
GinBufferStoreTuple() which coalesces the TID lists, and the result
may not fit into MaxAllocSize.
Fixed by allowing huge allocations when merging TID lists, including
an existing palloc call in ginMergeItemPointers().
Report by Greg Smith, investigation and fix by me. Batchpatched to 18,
where parallel GIN builds were introduced.
Reported-by: Gregory Smith <gregsmithpgsql@gmail.com>
Discussion: https://postgr.es/m/CAHLJuCWDwn-PE2BMZE4Kux7x5wWt_6RoWtA0mUQffEDLeZ6sfA@mail.gmail.com
Backpatch-through: 18
---
src/backend/access/gin/gininsert.c | 7 ++++---
src/backend/access/gin/ginpostinglist.c | 3 ++-
2 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 3d71b442aa9..2355b96b351 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1496,10 +1496,11 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
* still pass 0 as number of elements in that array though.
*/
if (buffer->items == NULL)
- buffer->items = palloc((buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+ buffer->items = palloc_extended((buffer->nitems + tup->nitems) * sizeof(ItemPointerData),
+ MCXT_ALLOC_HUGE);
else
- buffer->items = repalloc(buffer->items,
- (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+ buffer->items = repalloc_huge(buffer->items,
+ (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfrozen */
(buffer->nitems - buffer->nfrozen), /* num of unfrozen */
diff --git a/src/backend/access/gin/ginpostinglist.c b/src/backend/access/gin/ginpostinglist.c
index 48eadec87b0..4fe46135238 100644
--- a/src/backend/access/gin/ginpostinglist.c
+++ b/src/backend/access/gin/ginpostinglist.c
@@ -381,7 +381,8 @@ ginMergeItemPointers(ItemPointerData *a, uint32 na,
{
ItemPointerData *dst;
- dst = (ItemPointer) palloc((na + nb) * sizeof(ItemPointerData));
+ dst = (ItemPointer) palloc_extended((na + nb) * sizeof(ItemPointerData),
+ MCXT_ALLOC_HUGE);
/*
* If the argument arrays don't overlap, we can just append them to each
--
2.51.0
v3-0002-Split-TID-lists-during-parallel-GIN-build.patchtext/x-patch; charset=UTF-8; name=v3-0002-Split-TID-lists-during-parallel-GIN-build.patchDownload
From dfe964ae6b9d8bbb58143ed3ebd74ccb44cb340e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sun, 26 Oct 2025 21:23:37 +0100
Subject: [PATCH v3 2/3] Split TID lists during parallel GIN build
When building intermediate TID lists during parallel GIN builds, split
the sorted lists into smaller chunks, to limit the amount of memory
needed when merging the chunks later.
The leader may need to keep in memory up to one chunk per worker, and
possibly one extra chunk (before evicting some of the data). We limit
the chunk size so that memory usage does not exceed MaxAllocSize (1GB).
This is desirable even with huge allocations allowed. Larger chunks do
not improve performance, so that the increased memory usage is useless.
Report by Greg Smith, investigation and fix by me. Batchpatched to 18,
where parallel GIN builds were introduced.
Reported-by: Gregory Smith <gregsmithpgsql@gmail.com>
Discussion: https://postgr.es/m/CAHLJuCWDwn-PE2BMZE4Kux7x5wWt_6RoWtA0mUQffEDLeZ6sfA@mail.gmail.com
Backpatch-through: 18
---
src/backend/access/gin/gininsert.c | 48 +++++++++++++++++++++++-------
1 file changed, 37 insertions(+), 11 deletions(-)
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 2355b96b351..d15e9a0cb0b 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -152,7 +152,9 @@ typedef struct
* only in the leader process.
*/
GinLeader *bs_leader;
- int bs_worker_id;
+
+ /* number of participating workers (including leader) */
+ int bs_num_workers;
/* used to pass information from workers to leader */
double bs_numtuples;
@@ -483,6 +485,11 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
/*
* ginFlushBuildState
* Write all data from BuildAccumulator into the tuplesort.
+ *
+ * The number of TIDs written to the tuplesort at once is limited, to reduce
+ * the amount of memory needed when merging the intermediate results later.
+ * The leader will see up to two chunks per worker, so calculate the limit to
+ * not need more than MaxAllocSize overall.
*/
static void
ginFlushBuildState(GinBuildState *buildstate, Relation index)
@@ -493,6 +500,11 @@ ginFlushBuildState(GinBuildState *buildstate, Relation index)
uint32 nlist;
OffsetNumber attnum;
TupleDesc tdesc = RelationGetDescr(index);
+ uint32 maxlen;
+
+ /* maximum number of TIDs per chunk (two chunks per worker) */
+ maxlen = MaxAllocSize / sizeof(ItemPointerData);
+ maxlen /= (2 * buildstate->bs_num_workers);
ginBeginBAScan(&buildstate->accum);
while ((list = ginGetBAEntry(&buildstate->accum,
@@ -501,20 +513,31 @@ ginFlushBuildState(GinBuildState *buildstate, Relation index)
/* information about the key */
CompactAttribute *attr = TupleDescCompactAttr(tdesc, (attnum - 1));
- /* GIN tuple and tuple length */
- GinTuple *tup;
- Size tuplen;
+ /* start of the chunk */
+ uint32 offset = 0;
- /* there could be many entries, so be willing to abort here */
- CHECK_FOR_INTERRUPTS();
+ /* split the entry into smaller chunk with up to maxlen items */
+ while (offset < nlist)
+ {
+ /* GIN tuple and tuple length */
+ GinTuple *tup;
+ Size tuplen;
+ uint32 len = Min(maxlen, nlist - offset);
- tup = _gin_build_tuple(attnum, category,
- key, attr->attlen, attr->attbyval,
- list, nlist, &tuplen);
+ /* there could be many entries, so be willing to abort here */
+ CHECK_FOR_INTERRUPTS();
+
+ tup = _gin_build_tuple(attnum, category,
+ key, attr->attlen, attr->attbyval,
+ &list[offset], len,
+ &tuplen);
+
+ offset += maxlen;
- tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+ tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
- pfree(tup);
+ pfree(tup);
+ }
}
MemoryContextReset(buildstate->tmpCtx);
@@ -2018,6 +2041,9 @@ _gin_parallel_scan_and_build(GinBuildState *state,
/* remember how much space is allowed for the accumulated entries */
state->work_mem = (sortmem / 2);
+ /* remember how many workers participate in the build */
+ state->bs_num_workers = ginshared->scantuplesortstates;
+
/* Begin "partial" tuplesort */
state->bs_sortstate = tuplesort_begin_index_gin(heap, index,
state->work_mem,
--
2.51.0
v3-0003-Trim-TIDs-during-parallel-GIN-builds-more-eagerly.patchtext/x-patch; charset=UTF-8; name=v3-0003-Trim-TIDs-during-parallel-GIN-builds-more-eagerly.patchDownload
From af9327b95d1a2d016dc207a532e6d668d944b49c Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 27 Oct 2025 23:58:10 +0100
Subject: [PATCH v3 3/3] Trim TIDs during parallel GIN builds more eagerly
The code frozen the beginning of TID lists only when merging the lists,
which means we'd only trim the list when adding the next chunk. But we
can do better - we can update the number of frozen items earlier.
This is not expected to make a huge difference, but it can't hurt and
it's virtually free.
Discussion: https://postgr.es/m/CAHLJuCWDwn-PE2BMZE4Kux7x5wWt_6RoWtA0mUQffEDLeZ6sfA@mail.gmail.com
---
src/backend/access/gin/gininsert.c | 129 ++++++++++++++---------------
1 file changed, 64 insertions(+), 65 deletions(-)
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index d15e9a0cb0b..37d72811b9a 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1401,6 +1401,48 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
static bool
GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
{
+ /*
+ * Try freeze TIDs at the beginning of the list, i.e. exclude them from
+ * the mergesort. We can do that with TIDs before the first TID in the new
+ * tuple we're about to add into the buffer.
+ *
+ * We do this incrementally when adding data into the in-memory buffer,
+ * and not later (e.g. when hitting a memory limit), because it allows us
+ * to skip the frozen data during the mergesort, making it cheaper.
+ */
+
+ /*
+ * Check if the last TID in the current list is frozen. This is the case
+ * when merging non-overlapping lists, e.g. in each parallel worker.
+ */
+ if ((buffer->nitems > 0) &&
+ (ItemPointerCompare(&buffer->items[buffer->nitems - 1],
+ GinTupleGetFirst(tup)) == 0))
+ buffer->nfrozen = buffer->nitems;
+
+ /*
+ * Now find the last TID we know to be frozen, i.e. the last TID right
+ * before the new GIN tuple.
+ *
+ * Start with the first not-yet-frozen tuple, and walk until we find the
+ * first TID that's higher. If we already know the whole list is frozen
+ * (i.e. nfrozen == nitems), this does nothing.
+ *
+ * XXX This might do a binary search for sufficiently long lists, but it
+ * does not seem worth the complexity. Overlapping lists should be rare
+ * common, TID comparisons are cheap, and we should quickly freeze most of
+ * the list.
+ */
+ for (int i = buffer->nfrozen; i < buffer->nitems; i++)
+ {
+ /* Is the TID after the first TID of the new tuple? Can't freeze. */
+ if (ItemPointerCompare(&buffer->items[i],
+ GinTupleGetFirst(tup)) > 0)
+ break;
+
+ buffer->nfrozen++;
+ }
+
/* not enough TIDs to trim (1024 is somewhat arbitrary number) */
if (buffer->nfrozen < 1024)
return false;
@@ -1445,6 +1487,9 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
ItemPointerData *items;
Datum key;
+ int nnew;
+ ItemPointer new;
+
AssertCheckGinBuffer(buffer);
key = _gin_parse_tuple_key(tup);
@@ -1466,80 +1511,34 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
buffer->key = (Datum) 0;
}
- /*
- * Try freeze TIDs at the beginning of the list, i.e. exclude them from
- * the mergesort. We can do that with TIDs before the first TID in the new
- * tuple we're about to add into the buffer.
- *
- * We do this incrementally when adding data into the in-memory buffer,
- * and not later (e.g. when hitting a memory limit), because it allows us
- * to skip the frozen data during the mergesort, making it cheaper.
- */
-
- /*
- * Check if the last TID in the current list is frozen. This is the case
- * when merging non-overlapping lists, e.g. in each parallel worker.
- */
- if ((buffer->nitems > 0) &&
- (ItemPointerCompare(&buffer->items[buffer->nitems - 1],
- GinTupleGetFirst(tup)) == 0))
- buffer->nfrozen = buffer->nitems;
+ /* add the new TIDs into the buffer, combine using merge-sort */
/*
- * Now find the last TID we know to be frozen, i.e. the last TID right
- * before the new GIN tuple.
- *
- * Start with the first not-yet-frozen tuple, and walk until we find the
- * first TID that's higher. If we already know the whole list is frozen
- * (i.e. nfrozen == nitems), this does nothing.
- *
- * XXX This might do a binary search for sufficiently long lists, but it
- * does not seem worth the complexity. Overlapping lists should be rare
- * common, TID comparisons are cheap, and we should quickly freeze most of
- * the list.
+ * Resize the array - we do this first, because we'll dereference the
+ * first unfrozen TID, which would fail if the array is NULL. We'll still
+ * pass 0 as number of elements in that array though.
*/
- for (int i = buffer->nfrozen; i < buffer->nitems; i++)
- {
- /* Is the TID after the first TID of the new tuple? Can't freeze. */
- if (ItemPointerCompare(&buffer->items[i],
- GinTupleGetFirst(tup)) > 0)
- break;
-
- buffer->nfrozen++;
- }
-
- /* add the new TIDs into the buffer, combine using merge-sort */
- {
- int nnew;
- ItemPointer new;
-
- /*
- * Resize the array - we do this first, because we'll dereference the
- * first unfrozen TID, which would fail if the array is NULL. We'll
- * still pass 0 as number of elements in that array though.
- */
- if (buffer->items == NULL)
- buffer->items = palloc_extended((buffer->nitems + tup->nitems) * sizeof(ItemPointerData),
- MCXT_ALLOC_HUGE);
- else
- buffer->items = repalloc_huge(buffer->items,
- (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+ if (buffer->items == NULL)
+ buffer->items = palloc_extended((buffer->nitems + tup->nitems) * sizeof(ItemPointerData),
+ MCXT_ALLOC_HUGE);
+ else
+ buffer->items = repalloc_huge(buffer->items,
+ (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
- new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfrozen */
- (buffer->nitems - buffer->nfrozen), /* num of unfrozen */
- items, tup->nitems, &nnew);
+ new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfrozen */
+ (buffer->nitems - buffer->nfrozen), /* num of unfrozen */
+ items, tup->nitems, &nnew);
- Assert(nnew == (tup->nitems + (buffer->nitems - buffer->nfrozen)));
+ Assert(nnew == (tup->nitems + (buffer->nitems - buffer->nfrozen)));
- memcpy(&buffer->items[buffer->nfrozen], new,
- nnew * sizeof(ItemPointerData));
+ memcpy(&buffer->items[buffer->nfrozen], new,
+ nnew * sizeof(ItemPointerData));
- pfree(new);
+ pfree(new);
- buffer->nitems += tup->nitems;
+ buffer->nitems += tup->nitems;
- AssertCheckItemPointers(buffer);
- }
+ AssertCheckItemPointers(buffer);
/* free the decompressed TID list */
pfree(items);
--
2.51.0
On 10/29/25 01:05, Tomas Vondra wrote:
...
Yeah, I definitely want to protect against this. I believe similar
failures can happen even with much lower m_w_m values (possibly ~2-3GB),
although only with weird/skewed data sets. AFAICS a constant
single-element array would trigger this, but I haven't tested that.Serial builds can fail with large maintenance_work_mem too, like this:
ERROR: posting list is too long
HINT: Reduce "maintenance_work_mem".but it's deterministic, and it's actually a proper error message, not
just some weird "invalid alloc size".Attached is a v3 of the patch series. 0001 and 0002 were already posted,
and I believe either of those would address the issue. 0003 is more of
an optimization, further reducing the memory usage.I'm putting this through additional testing, which takes time. But it
seems there's still some loose end in 0001, as I just got the "invalid
alloc request" failure with it applied ... I'll take a look tomorrow.
Unsurprisingly, there were a couple more palloc/repalloc calls (in
ginPostingListDecodeAllSegments) that could fail with long TID lists
produced when merging worker data. The attached v4 fixes this.
However, I see this as a sign that allowing huge allocations is not the
right way to fix this. The GIN code generally assumes, and I don't think
reworking this in a bugfix seems a bit too invasive. And I'm not really
certain this is the last place that could hit this.
Another argument against 0001 is using more memory does not really help
anything. It's not any faster or simpler. It's more like "let's use the
memory we have" rather than "let's use the memory we need".
So I'm planning to get rid of 0001, and fix that by 0002 or 0002+0003.
That seems like a better and (unexpectedly) less invasive fix.
regards
--
Tomas Vondra
Attachments:
v4-0001-Allow-parallel-GIN-builds-to-allocate-large-chunk.patchtext/x-patch; charset=UTF-8; name=v4-0001-Allow-parallel-GIN-builds-to-allocate-large-chunk.patchDownload
From 27a68b43636233952bff8aefce08112194b39d40 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sun, 26 Oct 2025 21:14:44 +0100
Subject: [PATCH v4 1/3] Allow parallel GIN builds to allocate large chunks
The parallel GIN builds used palloc/repalloc to maintain TID lists,
which with high maintance_work_mem values can lead to failures like
ERROR: invalid memory alloc request size 1113001620
The reason is that while merging intermediate worker data, we call
GinBufferStoreTuple() which coalesces the TID lists, and the result
may not fit into MaxAllocSize.
Fixed by allowing huge allocations when merging TID lists, including
an existing palloc call in ginMergeItemPointers().
Report by Greg Smith, investigation and fix by me. Batchpatched to 18,
where parallel GIN builds were introduced.
Reported-by: Gregory Smith <gregsmithpgsql@gmail.com>
Discussion: https://postgr.es/m/CAHLJuCWDwn-PE2BMZE4Kux7x5wWt_6RoWtA0mUQffEDLeZ6sfA@mail.gmail.com
Backpatch-through: 18
---
src/backend/access/gin/gininsert.c | 7 ++++---
src/backend/access/gin/ginpostinglist.c | 10 ++++++----
2 files changed, 10 insertions(+), 7 deletions(-)
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 3d71b442aa9..2355b96b351 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1496,10 +1496,11 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
* still pass 0 as number of elements in that array though.
*/
if (buffer->items == NULL)
- buffer->items = palloc((buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+ buffer->items = palloc_extended((buffer->nitems + tup->nitems) * sizeof(ItemPointerData),
+ MCXT_ALLOC_HUGE);
else
- buffer->items = repalloc(buffer->items,
- (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+ buffer->items = repalloc_huge(buffer->items,
+ (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfrozen */
(buffer->nitems - buffer->nfrozen), /* num of unfrozen */
diff --git a/src/backend/access/gin/ginpostinglist.c b/src/backend/access/gin/ginpostinglist.c
index 48eadec87b0..a60ea46204a 100644
--- a/src/backend/access/gin/ginpostinglist.c
+++ b/src/backend/access/gin/ginpostinglist.c
@@ -308,7 +308,8 @@ ginPostingListDecodeAllSegments(GinPostingList *segment, int len, int *ndecoded_
* Guess an initial size of the array.
*/
nallocated = segment->nbytes * 2 + 1;
- result = palloc(nallocated * sizeof(ItemPointerData));
+ result = palloc_extended(nallocated * sizeof(ItemPointerData),
+ MCXT_ALLOC_HUGE);
ndecoded = 0;
while ((char *) segment < endseg)
@@ -317,7 +318,7 @@ ginPostingListDecodeAllSegments(GinPostingList *segment, int len, int *ndecoded_
if (ndecoded >= nallocated)
{
nallocated *= 2;
- result = repalloc(result, nallocated * sizeof(ItemPointerData));
+ result = repalloc_huge(result, nallocated * sizeof(ItemPointerData));
}
/* copy the first item */
@@ -335,7 +336,7 @@ ginPostingListDecodeAllSegments(GinPostingList *segment, int len, int *ndecoded_
if (ndecoded >= nallocated)
{
nallocated *= 2;
- result = repalloc(result, nallocated * sizeof(ItemPointerData));
+ result = repalloc_huge(result, nallocated * sizeof(ItemPointerData));
}
val += decode_varbyte(&ptr);
@@ -381,7 +382,8 @@ ginMergeItemPointers(ItemPointerData *a, uint32 na,
{
ItemPointerData *dst;
- dst = (ItemPointer) palloc((na + nb) * sizeof(ItemPointerData));
+ dst = (ItemPointer) palloc_extended((na + nb) * sizeof(ItemPointerData),
+ MCXT_ALLOC_HUGE);
/*
* If the argument arrays don't overlap, we can just append them to each
--
2.51.0
v4-0002-Split-TID-lists-during-parallel-GIN-build.patchtext/x-patch; charset=UTF-8; name=v4-0002-Split-TID-lists-during-parallel-GIN-build.patchDownload
From 21c64c4e036eea0e2bfa94bc8517de5b2bbed3b3 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sun, 26 Oct 2025 21:23:37 +0100
Subject: [PATCH v4 2/3] Split TID lists during parallel GIN build
When building intermediate TID lists during parallel GIN builds, split
the sorted lists into smaller chunks, to limit the amount of memory
needed when merging the chunks later.
The leader may need to keep in memory up to one chunk per worker, and
possibly one extra chunk (before evicting some of the data). We limit
the chunk size so that memory usage does not exceed MaxAllocSize (1GB).
This is desirable even with huge allocations allowed. Larger chunks do
not improve performance, so that the increased memory usage is useless.
Report by Greg Smith, investigation and fix by me. Batchpatched to 18,
where parallel GIN builds were introduced.
Reported-by: Gregory Smith <gregsmithpgsql@gmail.com>
Discussion: https://postgr.es/m/CAHLJuCWDwn-PE2BMZE4Kux7x5wWt_6RoWtA0mUQffEDLeZ6sfA@mail.gmail.com
Backpatch-through: 18
---
src/backend/access/gin/gininsert.c | 48 +++++++++++++++++++++++-------
1 file changed, 37 insertions(+), 11 deletions(-)
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 2355b96b351..d15e9a0cb0b 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -152,7 +152,9 @@ typedef struct
* only in the leader process.
*/
GinLeader *bs_leader;
- int bs_worker_id;
+
+ /* number of participating workers (including leader) */
+ int bs_num_workers;
/* used to pass information from workers to leader */
double bs_numtuples;
@@ -483,6 +485,11 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
/*
* ginFlushBuildState
* Write all data from BuildAccumulator into the tuplesort.
+ *
+ * The number of TIDs written to the tuplesort at once is limited, to reduce
+ * the amount of memory needed when merging the intermediate results later.
+ * The leader will see up to two chunks per worker, so calculate the limit to
+ * not need more than MaxAllocSize overall.
*/
static void
ginFlushBuildState(GinBuildState *buildstate, Relation index)
@@ -493,6 +500,11 @@ ginFlushBuildState(GinBuildState *buildstate, Relation index)
uint32 nlist;
OffsetNumber attnum;
TupleDesc tdesc = RelationGetDescr(index);
+ uint32 maxlen;
+
+ /* maximum number of TIDs per chunk (two chunks per worker) */
+ maxlen = MaxAllocSize / sizeof(ItemPointerData);
+ maxlen /= (2 * buildstate->bs_num_workers);
ginBeginBAScan(&buildstate->accum);
while ((list = ginGetBAEntry(&buildstate->accum,
@@ -501,20 +513,31 @@ ginFlushBuildState(GinBuildState *buildstate, Relation index)
/* information about the key */
CompactAttribute *attr = TupleDescCompactAttr(tdesc, (attnum - 1));
- /* GIN tuple and tuple length */
- GinTuple *tup;
- Size tuplen;
+ /* start of the chunk */
+ uint32 offset = 0;
- /* there could be many entries, so be willing to abort here */
- CHECK_FOR_INTERRUPTS();
+ /* split the entry into smaller chunk with up to maxlen items */
+ while (offset < nlist)
+ {
+ /* GIN tuple and tuple length */
+ GinTuple *tup;
+ Size tuplen;
+ uint32 len = Min(maxlen, nlist - offset);
- tup = _gin_build_tuple(attnum, category,
- key, attr->attlen, attr->attbyval,
- list, nlist, &tuplen);
+ /* there could be many entries, so be willing to abort here */
+ CHECK_FOR_INTERRUPTS();
+
+ tup = _gin_build_tuple(attnum, category,
+ key, attr->attlen, attr->attbyval,
+ &list[offset], len,
+ &tuplen);
+
+ offset += maxlen;
- tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+ tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
- pfree(tup);
+ pfree(tup);
+ }
}
MemoryContextReset(buildstate->tmpCtx);
@@ -2018,6 +2041,9 @@ _gin_parallel_scan_and_build(GinBuildState *state,
/* remember how much space is allowed for the accumulated entries */
state->work_mem = (sortmem / 2);
+ /* remember how many workers participate in the build */
+ state->bs_num_workers = ginshared->scantuplesortstates;
+
/* Begin "partial" tuplesort */
state->bs_sortstate = tuplesort_begin_index_gin(heap, index,
state->work_mem,
--
2.51.0
v4-0003-Trim-TIDs-during-parallel-GIN-builds-more-eagerly.patchtext/x-patch; charset=UTF-8; name=v4-0003-Trim-TIDs-during-parallel-GIN-builds-more-eagerly.patchDownload
From d14cdb4bf70bc30e6e3757b70ffa23c7d202a443 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 27 Oct 2025 23:58:10 +0100
Subject: [PATCH v4 3/3] Trim TIDs during parallel GIN builds more eagerly
The code frozen the beginning of TID lists only when merging the lists,
which means we'd only trim the list when adding the next chunk. But we
can do better - we can update the number of frozen items earlier.
This is not expected to make a huge difference, but it can't hurt and
it's virtually free.
Discussion: https://postgr.es/m/CAHLJuCWDwn-PE2BMZE4Kux7x5wWt_6RoWtA0mUQffEDLeZ6sfA@mail.gmail.com
---
src/backend/access/gin/gininsert.c | 129 ++++++++++++++---------------
1 file changed, 64 insertions(+), 65 deletions(-)
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index d15e9a0cb0b..37d72811b9a 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1401,6 +1401,48 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
static bool
GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
{
+ /*
+ * Try freeze TIDs at the beginning of the list, i.e. exclude them from
+ * the mergesort. We can do that with TIDs before the first TID in the new
+ * tuple we're about to add into the buffer.
+ *
+ * We do this incrementally when adding data into the in-memory buffer,
+ * and not later (e.g. when hitting a memory limit), because it allows us
+ * to skip the frozen data during the mergesort, making it cheaper.
+ */
+
+ /*
+ * Check if the last TID in the current list is frozen. This is the case
+ * when merging non-overlapping lists, e.g. in each parallel worker.
+ */
+ if ((buffer->nitems > 0) &&
+ (ItemPointerCompare(&buffer->items[buffer->nitems - 1],
+ GinTupleGetFirst(tup)) == 0))
+ buffer->nfrozen = buffer->nitems;
+
+ /*
+ * Now find the last TID we know to be frozen, i.e. the last TID right
+ * before the new GIN tuple.
+ *
+ * Start with the first not-yet-frozen tuple, and walk until we find the
+ * first TID that's higher. If we already know the whole list is frozen
+ * (i.e. nfrozen == nitems), this does nothing.
+ *
+ * XXX This might do a binary search for sufficiently long lists, but it
+ * does not seem worth the complexity. Overlapping lists should be rare
+ * common, TID comparisons are cheap, and we should quickly freeze most of
+ * the list.
+ */
+ for (int i = buffer->nfrozen; i < buffer->nitems; i++)
+ {
+ /* Is the TID after the first TID of the new tuple? Can't freeze. */
+ if (ItemPointerCompare(&buffer->items[i],
+ GinTupleGetFirst(tup)) > 0)
+ break;
+
+ buffer->nfrozen++;
+ }
+
/* not enough TIDs to trim (1024 is somewhat arbitrary number) */
if (buffer->nfrozen < 1024)
return false;
@@ -1445,6 +1487,9 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
ItemPointerData *items;
Datum key;
+ int nnew;
+ ItemPointer new;
+
AssertCheckGinBuffer(buffer);
key = _gin_parse_tuple_key(tup);
@@ -1466,80 +1511,34 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
buffer->key = (Datum) 0;
}
- /*
- * Try freeze TIDs at the beginning of the list, i.e. exclude them from
- * the mergesort. We can do that with TIDs before the first TID in the new
- * tuple we're about to add into the buffer.
- *
- * We do this incrementally when adding data into the in-memory buffer,
- * and not later (e.g. when hitting a memory limit), because it allows us
- * to skip the frozen data during the mergesort, making it cheaper.
- */
-
- /*
- * Check if the last TID in the current list is frozen. This is the case
- * when merging non-overlapping lists, e.g. in each parallel worker.
- */
- if ((buffer->nitems > 0) &&
- (ItemPointerCompare(&buffer->items[buffer->nitems - 1],
- GinTupleGetFirst(tup)) == 0))
- buffer->nfrozen = buffer->nitems;
+ /* add the new TIDs into the buffer, combine using merge-sort */
/*
- * Now find the last TID we know to be frozen, i.e. the last TID right
- * before the new GIN tuple.
- *
- * Start with the first not-yet-frozen tuple, and walk until we find the
- * first TID that's higher. If we already know the whole list is frozen
- * (i.e. nfrozen == nitems), this does nothing.
- *
- * XXX This might do a binary search for sufficiently long lists, but it
- * does not seem worth the complexity. Overlapping lists should be rare
- * common, TID comparisons are cheap, and we should quickly freeze most of
- * the list.
+ * Resize the array - we do this first, because we'll dereference the
+ * first unfrozen TID, which would fail if the array is NULL. We'll still
+ * pass 0 as number of elements in that array though.
*/
- for (int i = buffer->nfrozen; i < buffer->nitems; i++)
- {
- /* Is the TID after the first TID of the new tuple? Can't freeze. */
- if (ItemPointerCompare(&buffer->items[i],
- GinTupleGetFirst(tup)) > 0)
- break;
-
- buffer->nfrozen++;
- }
-
- /* add the new TIDs into the buffer, combine using merge-sort */
- {
- int nnew;
- ItemPointer new;
-
- /*
- * Resize the array - we do this first, because we'll dereference the
- * first unfrozen TID, which would fail if the array is NULL. We'll
- * still pass 0 as number of elements in that array though.
- */
- if (buffer->items == NULL)
- buffer->items = palloc_extended((buffer->nitems + tup->nitems) * sizeof(ItemPointerData),
- MCXT_ALLOC_HUGE);
- else
- buffer->items = repalloc_huge(buffer->items,
- (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+ if (buffer->items == NULL)
+ buffer->items = palloc_extended((buffer->nitems + tup->nitems) * sizeof(ItemPointerData),
+ MCXT_ALLOC_HUGE);
+ else
+ buffer->items = repalloc_huge(buffer->items,
+ (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
- new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfrozen */
- (buffer->nitems - buffer->nfrozen), /* num of unfrozen */
- items, tup->nitems, &nnew);
+ new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfrozen */
+ (buffer->nitems - buffer->nfrozen), /* num of unfrozen */
+ items, tup->nitems, &nnew);
- Assert(nnew == (tup->nitems + (buffer->nitems - buffer->nfrozen)));
+ Assert(nnew == (tup->nitems + (buffer->nitems - buffer->nfrozen)));
- memcpy(&buffer->items[buffer->nfrozen], new,
- nnew * sizeof(ItemPointerData));
+ memcpy(&buffer->items[buffer->nfrozen], new,
+ nnew * sizeof(ItemPointerData));
- pfree(new);
+ pfree(new);
- buffer->nitems += tup->nitems;
+ buffer->nitems += tup->nitems;
- AssertCheckItemPointers(buffer);
- }
+ AssertCheckItemPointers(buffer);
/* free the decompressed TID list */
pfree(items);
--
2.51.0
On 10/29/25 19:47, Tomas Vondra wrote:
...
Unsurprisingly, there were a couple more palloc/repalloc calls (in
ginPostingListDecodeAllSegments) that could fail with long TID lists
produced when merging worker data. The attached v4 fixes this.However, I see this as a sign that allowing huge allocations is not the
right way to fix this. The GIN code generally assumes, and I don't think
reworking this in a bugfix seems a bit too invasive. And I'm not really
certain this is the last place that could hit this.Another argument against 0001 is using more memory does not really help
anything. It's not any faster or simpler. It's more like "let's use the
memory we have" rather than "let's use the memory we need".So I'm planning to get rid of 0001, and fix that by 0002 or 0002+0003.
That seems like a better and (unexpectedly) less invasive fix.
I ended up pushing 0002, for the reasons explained above. This fixes the
allocation issue simply by not needing too much memory. The 0003 is more
of an optimization, so I pushed that only to master.
regards
--
Tomas Vondra