Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker

craig@milhiser.com

over 1 year ago

In reply to: Tom Lane (#2)

Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker

On Sun, Sep 22, 2024 at 11:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Craig Milhiser <craig@milhiser.com> writes:

Has the referenced bug in this discussion[1] been released?

I don't see any indication, either in that thread or in the commit
log, that anything has been done in this area since about 16.2.
It's not an easy problem in general.

Having said that, Aurora is not Postgres, and I don't know
how closely they track us. Can you reproduce this problem
on a stock build of community Postgres?

regards, tom lane

Thanks. I will work on setting that up. Also getting the aws team
involved.

The one query I investigated I rewrote. It took 15 seconds without parallel
to avoid this issue. I rewrote it and now the query completes in 0.2
seconds. For this query I can avoid the issue, at least temporarily, by
making a better query. But we need to solve the real problem. And I have
not looked at the other queries affecting me. I may not get so lucky
again.

I will post when I get the stock Postgres setup and running.

thomas.munro@gmail.com

over 1 year ago

In reply to: Craig Milhiser (#3)

Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker

On Mon, Sep 23, 2024 at 12:52 PM Craig Milhiser <craig@milhiser.com> wrote:

On Sun, Sep 22, 2024 at 11:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Craig Milhiser <craig@milhiser.com> writes:

Has the referenced bug in this discussion[1] been released?

I don't see any indication, either in that thread or in the commit
log, that anything has been done in this area since about 16.2.
It's not an easy problem in general.

Having said that, Aurora is not Postgres, and I don't know
how closely they track us. Can you reproduce this problem
on a stock build of community Postgres?

regards, tom lane

Thanks. I will work on setting that up. Also getting the aws team involved.

The one query I investigated I rewrote. It took 15 seconds without parallel to avoid this issue. I rewrote it and now the query completes in 0.2 seconds. For this query I can avoid the issue, at least temporarily, by making a better query. But we need to solve the real problem. And I have not looked at the other queries affecting me. I may not get so lucky again.

I will post when I get the stock Postgres setup and running.

Hi,

FYI this is on my radar and it would be good to try to make a small
back-patchable improvement. I would need to swap the problem back
into memory to be sure but from memory the problem is that parallel
hash join partitions take 432 bytes of book keeping memory each (there
is also the problem that they each have output buffers, but those are
not allocated in one big array, and for non-parallel hash join there
is also a per-partition overhead, but it's smaller due to less
bookkeeping state so we don't hear about it). Hash joins use
partition files (AKA batches) to try to keep each hash table under
work_mem * hash_mem_multiplier, and if you have 2^22 (~4 million)
partitions we therefore try to allocate 1811939328 bytes of memory
(the number in $SUBJECT), exceeding our arbitrary 1GB allocation
limit. It's possible to turn that arbitrary allocation limit off, but
that'd be treating a symptom and certainly not really produce good
performance. If you increased (probably at least by double) work_mem
or hash_mem_multiplier, you might have better luck: at some cross-over
point, doubling the size of the array of partitions uses more memory
than you can save by (potentially) halving the size of the hash table!
Even aside from that arithmetic problem, anything more than around
2^10 partitions (~1 thousand) will start to perform worse unless you
also increase max_files_per_process to match, because every flush of a
page of spill file will likely have to close and open a file (on stock
PostgreSQL at least, but Aurora may have a completely different scheme
for temporary spill data for all I know). So we could simply cap the
number of partitions, and start ignoring the work_mem *
hash_mem_multiplier limit beyond that cap, but we haven't done it yet
because it's hard to pick a number and reports are rare (ie very large
queries run with low work_mem, if that is indeed the problem here).
2^21 would be the highest plausible candidate (2^21 * 432 = ~900MB),
but it's still very high. There is a related problem of extreme skew
(squillions of tuples in one bucket), which is much harder to treat,
but ideas were mentioned in that and other threads... For the
non-skewed version of the problem, which may be more common, at one
level at least the problem is the defaults being set for small memory
machines, people running increasingly huge joins on huge memory
machines, and the machinery to make it work being a bit naive and
excessively expensive. We could and should invent better strategies
for coping.

thomas.munro@gmail.com

over 1 year ago

In reply to: Thomas Munro (#4)

Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker

On Mon, Sep 23, 2024 at 1:46 PM Thomas Munro <thomas.munro@gmail.com> wrote:

432 bytes

Oh, as Tomas pointed out in the referenced thread, the actual number
depends on the number of workers because there is some per-worker
state for partitions, but that number does seem consistent with your
reported case. Perhaps the correct answer is simply to give up
partitioning when the partition state size would exceed the potential
hash table savings by further partitioning. Another question is
whether it's arriving at the problematic number by underestimating and
then repeatedly expanding, which is very inefficient, or planning the
high number from the outset, but either way that'd be two different
code paths that would need to respect a new cap. If it's the former,
there may also be ways to improve initial estimates with statistics.

craig@milhiser.com

over 1 year ago

In reply to: Thomas Munro (#5)

Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker

On Sun, Sep 22, 2024 at 10:23 PM Thomas Munro <thomas.munro@gmail.com>
wrote:

On Mon, Sep 23, 2024 at 1:46 PM Thomas Munro <thomas.munro@gmail.com>
wrote:

432 bytes

Oh, as Tomas pointed out in the referenced thread,

Thanks for working on it and the detailed explanation. I tested set
max_parallel_workers_per_gather = 0 from the original thread and it was
working. We are putting that into the application, for our largest
customers. Set to 0 before the query then back to 2 after.

Your explanation also shows why rewriting of the query works. I reduced the
number of rows being processed much earlier in the query. The query was
written with 1 set of many joins which worked on millions of rows then
reduced to a handful. I broke this into a materialized CTE that forced
Postgres to reduce the rows early then do the joins. Rewriting the query
is better regardless of this issue.

I am working on getting a stock Postgres in our production protected
enclave with our production database. Probably a full day of work that I
need to splice in. We have a similar mechanism in our development
environment. Once working I can help test and debug any changes. I can also
work on a reproducible example.

Show quoted text

craig@milhiser.com

over 1 year ago

In reply to: Craig Milhiser (#6)

Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker

Having said that, Aurora is not Postgres, and I don't know
how closely they track us. Can you reproduce this problem
on a stock build of community Postgres?

I reproduced the issue on v17. I downloaded the source tarball, built it,
passed tests, put my production database, analyzed and ran the query. As
you expected, the same issue occurred. I have opened the incident with the
AWS team as well.

select version();
version
--------------------------------------------------------------------------------------------------
PostgreSQL 17.0 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
13.2.0-23ubuntu4) 13.2.0, 64-bit

Since I have this saved for building, if you need logs or have an
experiment, let me know. I tried to reproduce the issue with
artificial data simply but the query completed. A different optimization
plan was created since the data skew was very different.

I have workarounds of turning parallel execution off for the known queries
and when possible rewriting the queries.

Thanks for the help.
Craig

/messages/by-id/7d763a6d-fad7-49b6-beb0-86f99ce4a6eb@postgrespro.ru

thomas.munro@gmail.com

over 1 year ago

In reply to: Craig Milhiser (#7)

Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker

On Mon, Sep 30, 2024 at 12:03 PM Craig Milhiser <craig@milhiser.com> wrote:

I reproduced the issue on v17. I downloaded the source tarball, built it, passed tests, put my production database, analyzed and ran the query. As you expected, the same issue occurred. I have opened the incident with the AWS team as well.

Since you're building from source, you could try applying the patch
posted by Andrei Lephikov:

I suspect we may want to limit it to a smaller number than that, as
mentioned already, and I think we should also apply the same cap to
the initial estimate (Andrei's patch only caps it when it decides to
increase it, not for the initial nbatch number). I can write a patch
like that in a few days when I return from travelling, and we can aim
to get it into the November release, but I suspect Andrei's patch
might already avoid the error for your case.

/messages/by-id/7d763a6d-fad7-49b6-beb0-86f99ce4a6eb@postgrespro.ru

craig@milhiser.com

over 1 year ago

In reply to: Thomas Munro (#8)

Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker

Since you're building from source, you could try applying the patch
posted by Andrei Lephikov:

This did not work for me. I am running out of memory.

I applied the patch, make clean, make, make check, sudo make install. I am
running out of the box Postgres configuration.

Memory below uses "free -m".

Before loading Postgres
total used free shared buff/cache
available
Mem: 31388 669 30467 2 639
30719
Swap: 0 0 0

After loading
total used free shared buff/cache
available
Mem: 31388 672 30464 14 651
30715
Swap: 0 0 0

I go into psql
set max_parallel_workers_per_gather = 0;
run the query multiple times, takes 9.5 seconds at steady state, returns 20
rows.

Memory is still available

total used free shared buff/cache
available
Mem: 31388 921 22547 142 8460
30466
Swap: 0 0 0

In the same psql session, set max_parallel_workers_per_gather = 2; then run
the query again. This runs for 1 minute then:

2024-10-01 18:28:45.883 UTC [2586] LOG: background worker "parallel
worker" (PID 4465) was terminated by signal 9: Killed
2024-10-01 18:28:45.883 UTC [2586] DETAIL: Failed process was running:
SELECT
...
2024-10-01 18:28:45.883 UTC [2586] LOG: terminating any other active
server processes
2024-10-01 18:28:46.620 UTC [2586] LOG: all server processes terminated;
reinitializing

I got this as close to the end as I could
total used free shared buff/cache
available
Mem: 31388 31014 535 1955 2156
373
Swap: 0 0 0

Though OOM conditions often means all bets are off for behavior, I tried
something different. I rebooted, started Postgres then run the query. I do
not set parallel_... = 0 and run the query which populated the cache. The
machine exhausts memory again but usually "hangs". I need to restart.
Below is the frozen screen
total used free shared buff/cache
available
Mem: 31388 31317 240 1955 2140
70
Swap: 0 0 0

I ran these sequences multiple times. I also analyzed the data again just
to make sure.

I reverted the patch to make sure I am reproducing the issue. I get the
same 1.8GB allocation failure with parallel. Without parallel the query
takes ~10 seconds. The patch increased the single worker performance for
this query for out of the box configuration by 5%.

Thanks

On Sun, Sep 29, 2024 at 9:15 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Show quoted text

On Mon, Sep 30, 2024 at 12:03 PM Craig Milhiser <craig@milhiser.com>
wrote:

I reproduced the issue on v17. I downloaded the source tarball, built

it, passed tests, put my production database, analyzed and ran the query.
As you expected, the same issue occurred. I have opened the incident with
the AWS team as well.

Since you're building from source, you could try applying the patch
posted by Andrei Lephikov:

/messages/by-id/7d763a6d-fad7-49b6-beb0-86f99ce4a6eb@postgrespro.ru

I suspect we may want to limit it to a smaller number than that, as
mentioned already, and I think we should also apply the same cap to
the initial estimate (Andrei's patch only caps it when it decides to
increase it, not for the initial nbatch number). I can write a patch
like that in a few days when I return from travelling, and we can aim
to get it into the November release, but I suspect Andrei's patch
might already avoid the error for your case.

#10

lepihov@gmail.com

over 1 year ago

In reply to: Craig Milhiser (#9)

Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker

On 2/10/2024 02:12, Craig Milhiser wrote:

Since you're building from source, you could try applying the patch
posted by Andrei Lephikov:

/messages/by-id/7d763a6d-fad7-49b6-beb0-86f99ce4a6eb@postgrespro.ru </messages/by-id/7d763a6d-fad7-49b6-beb0-86f99ce4a6eb@postgrespro.ru>

This did not work for me. I am running out of memory.

Can you provide an explain of this query? Also, can you remove
unnecessary details from the query text like temporary view or CASE ..
WHEN construction, if the OOM still reproduces.

--
regards, Andrei Lepikhov

#11

craig@milhiser.com

over 1 year ago

In reply to: Andrei Lepikhov (#10)

Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker

On Oct 1, 2024 Andrei Lepikhov wrote

Can you provide an explain of this query?

Apologies for the delay. I have been travelling since Wednesday night.
Thanks for your help and time with this issue.

Below is the query, with specific values redacted. An explain with
max_parallel_workers_per_gather = 2 and explain analyze
max_parallel_workers_per_gather = 0.

In this case, the number of rows from the users table based on account_id
is in the 99th percentile for this table and it is a long and sparse right
tail.

This is using V17.0 stock source code and stock configuration on linux.

The query

SELECT
CF.NUMERIC_VALUE AS CF_COL,
U.USERS_ID,
U.OBJECT_ID,
U.ACCOUNT_ID,
U.EXTERNAL_ID,
U.FIRST_NAME,
U.MIDDLE_NAME,
U.LAST_NAME,
U.DISABLED,
U.DEACTIVATED AS SUSPEND_DATE,
U.CREATED,
U.UPDATED,
U.IS_BLE_TWO_FACTOR_EXEMPT,
U.HAS_THUMBNAIL,
U.USER_TYPE_ID,
UI.USER_IMAGE_ID,
UI.CONTENT_TYPE AS USER_IMAGE_CONTENT_TYPE,
COUNT(*) OVER () AS TOTAL_USERS_COUNT,
STRING_AGG(SG.object_ID::CHARACTER VARYING, ';') AS GROUPS,
STRING_AGG(SG.NAME
<https://urldefense.proofpoint.com/v2/url?u=http-3A__SG.NAME&d=DwMGaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=JZHDXmxC6C_GpXil_p_qZyChJLKMKUlbW9OutJroJT4&m=d1w9W1jfdQcXFikedJO9jjD5rMsB8hKCE9Ldj4R6QV_WdAoes0xhjdMdU0outkA7&s=ZMpYMOndqorTz75E_JF2rCvBKp40__QNQlw2rXVcw-k&e=>,
' ') AS GROUPNAMES
FROM
USERS U
LEFT JOIN USER_IMAGE UI ON U.USER_IMAGE_ID = UI.USER_IMAGE_ID
LEFT JOIN SECURITY_GROUP_MEMBER SGM ON SGM.OBJECT_ID = U.OBJECT_ID
AND SGM.OBJECT_ID = U.OBJECT_ID
LEFT JOIN SECURITY_GROUP SG
ON SGM.SECURITY_GROUP_ID = SG.SECURITY_GROUP_ID
AND SG.DISABLED = 0
AND SG.ACCOUNT_ID = U.ACCOUNT_ID
AND SG.SECURITY_GROUP_TYPE_ID = 2
LEFT JOIN CUSTOM_FIELD_VALUE CF
ON U.USERS_ID = CF.USER_ID
AND CF.CUSTOM_FIELD_ID = <craig redacted>
WHERE
U.ACCOUNT_ID = <craig redacted>
AND U.USER_TYPE_ID = 1
AND U.DISABLED = 0
GROUP BY
U.USERS_ID,
UI.USER_IMAGE_ID,
CF.NUMERIC_VALUE
ORDER BY
U.LAST_NAME ASC,
U.FIRST_NAME ASC,
U.USERS_ID ASC
LIMIT
20
OFFSET
0;

Explain with stock configuration which is set
max_parallel_workers_per_gather = 2;

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=2529139.77..2529139.82 rows=20 width=187)
-> Sort (cost=2529139.77..2530484.84 rows=538028 width=187)
Sort Key: u.last_name, u.first_name, u.users_id
-> WindowAgg (cost=2514822.88..2514823.03 rows=538028 width=187)
-> Finalize GroupAggregate (cost=2432583.40..2508097.68
rows=538028 width=179)
Group Key: u.users_id, ui.user_image_id,
cf.numeric_value
-> Gather Merge (cost=2432583.40..2492181.03
rows=448356 width=179)
Workers Planned: 2
-> Partial GroupAggregate
(cost=2431583.37..2439429.60 rows=224178 width=179)
Group Key: u.users_id, ui.user_image_id,
cf.numeric_value
-> Sort (cost=2431583.37..2432143.82
rows=224178 width=140)
Sort Key: u.users_id,
ui.user_image_id, cf.numeric_value
-> Parallel Hash Left Join
(cost=1384936.37..2395567.35 rows=224178 width=140)
Hash Cond: (u.users_id =
cf.user_id)
-> Hash Left Join
(cost=1124308.04..2134350.56 rows=224178 width=134)
Hash Cond:
(sgm.security_group_id = sg.security_group_id)
-> Nested Loop Left
Join (cost=1119678.30..2129132.34 rows=224178 width=117)
-> Parallel Hash
Right Join (cost=1119677.73..1326436.98 rows=224178 width=109)
Hash Cond:
(ui.user_image_id = u.user_image_id)
-> Parallel
Seq Scan on user_image ui (cost=0.00..130846.12 rows=3533412 width=18)
-> Parallel
Hash (cost=1113372.50..1113372.50 rows=224178 width=99)
->
Parallel Bitmap Heap Scan on users u (cost=8824.42..1113372.50
rows=224178 width=99)

Recheck Cond: ((account_id = <craig redacted>) AND (disabled = 0) AND
(user_type_id = 1))

-> Bitmap Index Scan on u_act_dis_type (cost=0.00..8689.92 rows=538028
width=0)

Index Cond: ((account_id = <craig redacted>) AND (disabled = 0) AND
(user_type_id = 1))
-> Index Only
Scan using security_group_obid_sgid_idx on security_group_member sgm
(cost=0.57..3.57 rows=1 width=16)
Index Cond:
((object_id = u.object_id) AND (object_id = u.object_id))
-> Hash
(cost=4622.16..4622.16 rows=607 width=41)
-> Index Scan
using account_security_group_fk_ind on security_group sg
(cost=0.43..4622.16 rows=607 width=41)
Index Cond:
(account_id = <craig redacted>)
Filter:
((disabled = 0) AND (security_group_type_id = 2))
-> Parallel Hash
(cost=259796.42..259796.42 rows=66553 width=14)
-> Parallel Index Scan
using date_value_idx on custom_field_value cf (cost=0.56..259796.42
rows=66553 width=14)
Index Cond:
(custom_field_id = <craig redacted>)
(34 rows)

explain (analyze) with set max_parallel_workers_per_gather = 0;

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=4152596.58..4152596.63 rows=20 width=187) (actual
time=10192.249..10192.258 rows=20 loops=1)
-> Sort (cost=4152596.58..4153941.65 rows=538028 width=187) (actual
time=10192.248..10192.255 rows=20 loops=1)
Sort Key: u.last_name, u.first_name, u.users_id
Sort Method: top-N heapsort Memory: 32kB
-> WindowAgg (cost=4138279.81..4138279.85 rows=538028 width=187)
(actual time=9748.632..9958.924 rows=904292 loops=1)
-> GroupAggregate (cost=4112723.52..4131554.50 rows=538028
width=179) (actual time=8482.695..9389.560 rows=904292 loops=1)
Group Key: u.users_id, ui.user_image_id,
cf.numeric_value
-> Sort (cost=4112723.52..4114068.59 rows=538028
width=140) (actual time=8482.679..8655.695 rows=1720872 loops=1)
Sort Key: u.users_id, ui.user_image_id,
cf.numeric_value
Sort Method: external merge Disk: 199104kB
-> Hash Left Join (cost=602312.67..3984272.46
rows=538028 width=140) (actual time=1955.881..7537.783 rows=1720872 loops=1)
Hash Cond: (u.users_id = cf.user_id)
-> Hash Left Join
(cost=340636.13..3721183.60 rows=538028 width=134) (actual
time=1806.879..6920.376 rows=1720872 loops=1)
Hash Cond: (sgm.security_group_id =
sg.security_group_id)
-> Nested Loop Left Join
(cost=336006.39..3715141.53 rows=538028 width=117) (actual
time=1804.650..6599.170 rows=1720872 loops=1)
-> Hash Left Join
(cost=336005.82..1788669.80 rows=538028 width=109) (actual
time=1804.623..3537.213 rows=904292 loops=1)
Hash Cond:
(u.user_image_id = ui.user_image_id)
-> Index Scan using
u_act_dis_type on users u (cost=0.56..1384749.23 rows=538028 width=99)
(actual time=0.033..1133.900 rows=904292 loops=1)
Index Cond:
((account_id = <craig redacted>) AND (disabled = 0) AND (user_type_id = 1))
-> Hash
(cost=180313.89..180313.89 rows=8480189 width=18) (actual
time=1804.516..1804.517 rows=8488571 loops=1)
Buckets: 131072
Batches: 128 Memory Usage: 3986kB
-> Seq Scan on
user_image ui (cost=0.00..180313.89 rows=8480189 width=18) (actual
time=0.011..753.277 rows=8488571 loops=1)
-> Index Only Scan using
security_group_obid_sgid_idx on security_group_member sgm (cost=0.57..3.57
rows=1 width=16) (actual time=0.003..0.003 rows=2 loops=904292)
Index Cond: ((object_id
= u.object_id) AND (object_id = u.object_id))
Heap Fetches: 0
-> Hash (cost=4622.16..4622.16
rows=607 width=41) (actual time=2.219..2.220 rows=795 loops=1)
Buckets: 1024 Batches: 1
Memory Usage: 78kB
-> Index Scan using
account_security_group_fk_ind on security_group sg (cost=0.43..4622.16
rows=607 width=41) (actual time=0.937..2.121 rows=795 loops=1)
Index Cond: (account_id
= <craig redacted>)
Filter: ((disabled = 0)
AND (security_group_type_id = 2))
Rows Removed by Filter:
764
-> Hash (cost=260262.29..260262.29
rows=113140 width=14) (actual time=148.930..148.931 rows=125986 loops=1)
Buckets: 131072 Batches: 1 Memory
Usage: 6931kB
-> Index Scan using date_value_idx
on custom_field_value cf (cost=0.56..260262.29 rows=113140 width=14)
(actual time=0.021..132.508 rows=125986 loops=1)
Index Cond: (custom_field_id
= <craig redacted>)
Planning Time: 0.983 ms
Execution Time: 10233.621 ms
(37 rows)

I asked someone else to try to build artificial data for this query. Maybe
they will have a different take and be successful compared to me.

Thanks
Craig

#12

lepihov@gmail.com

over 1 year ago

In reply to: Craig Milhiser (#11)

Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker

On 10/7/24 18:42, Craig Milhiser wrote:

On Oct 1, 2024 Andrei Lepikhov wrote

Can you provide an explain of this query?

Apologies for the delay. I have been travelling since Wednesday night.
Thanks for your help and time with this issue.

Below is the query, with specific values redacted. An explain with
max_parallel_workers_per_gather = 2 and explain analyze
max_parallel_workers_per_gather = 0.

I'm a bit confused: the thread subject named ' invalid DSA memory alloc
request size ...', but you write about issue with OOM killer. It is two
different issues, which one do you have exactly?

OOM killer can be explained easily, because I see huge string_agg
aggregate - workers can utilise memory more intensively. For now,
explain of an Aggregate node don't show information about factual sort
operation of each aggregate and memory consumption.

--
regards, Andrei Lepikhov

#13

craig@milhiser.com

over 1 year ago

In reply to: Andrei Lepikhov (#12)

Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker

On Oct 8 Andrei Lepikhov wrote

I'm a bit confused: the thread subject named ' invalid DSA memory alloc
request size ...', but you write about issue with OOM killer. It is two
different issues, which one do you have exactly?

I started with the Invalid DSA memory allocation error. I was asked to try
an experimental patch above. Then I got OOM with the patch only running
parallel. You will see below, there was an OOM but I do not believe it is
the query.

Thanks for the push on OOM. I should have ran this test earlier.

v17.0 and out of the box Postgres configuration.

I ran a new test on an instance with 512 GiB of memory. After I applied
the patch, the Invalid DSA memory allocation message was not replicated.
Running max_parallel_workers_per_gather = 0, the query took ~9.5 seconds
and used <1 GiB of memory. With max_parallel_workers_per_gather = 2 the
query used ~170 GiB of memory, ~70 GB of temp files were written and the
query ran for more than 1 hour until I ran out of disk space.

I moved from Invalid DSA memory allocation of ~2 GB to using 170 GB of RAM
and 70+GB of temp files with the patch. Only when using 2 parallel workers
per gather.

The new test:

This morning I increased the machine size from 32 GiB to 512 GiB RAM.

With the patch applied and max_parallel_workers_per_gather = 0 the query
worked in ~9.5 seconds at steady state. While it was running I captured
memory. I ran the query a few times earlier to get the buffers loaded.

total used free shared buff/cache
available
Mem: 493Gi 3.5Gi 484Gi 142Mi 8.4Gi
489Gi
Swap: 0B 0B 0B

With the patch applied and max_parallel_workers_per_gather = 2; the query
ran for more than 1 hour. During that time memory settled at:
total used free shared buff/cache
available
Mem: 493Gi 178Gi 209Gi 1.9Gi 110Gi
314Gi
Swap: 0B 0B 0B

Then the machine ran out of disk space: ERROR: could not write to file
"base/pgsql_tmp/pgsql_tmp4942.1.fileset/o1859485of2097152.p0.0": No space
left on device

I captured top as well during the run.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
4951 postgres 20 0 46.8g 45.5g 1.9g D 5.6 9.2 2:40.40
postgres
4942 postgres 20 0 68.9g 65.6g 1.9g D 5.3 13.3 3:25.35
postgres
4952 postgres 20 0 68.4g 65.2g 1.9g D 5.3 13.2 3:07.43
postgres

After rebooting:
df -H
Filesystem Size Used Avail Use% Mounted on
/dev/root 266G 197G 70G 74% /

As you mentioned there are string aggregations. I ran with parallel=0 and
did some analysis. The aggregations do not seem to be creating something
that is out of line.

select max(length(groups)), sum(length(groups)), max(length(groupnames)),
sum(length(groupnames)) from milhiser_test;
max | sum | max | sum
-----+---------+-----+----------
143 | 6557620 | 499 | 22790616
(1 row)

Perhaps this is a different problem than the "invalid DSA memory alloc".
The patch might have addressed that problem and this is another issue. From
< 1 GiB to ~170 GiB of memory and using ~70 GB of log files when moving
from parallel = 0 to 2 seems something is off.

Summary before this test:
Before the patch linked above, I was receiving "ERROR: invalid DSA memory
alloc request size 1879048192" when I ran the query with
max_parallel_workers_per_gather = 2.

Before the patch with max_parallel_workers_per_gather = 0 the query worked
in ~10 seconds at steady state.

I applied the patch to v17.0 source, rebuilt, and passed tests.

With max_parallel_workers_per_gather = 0 the query worked in ~9.5 seconds
at steady state and took < 1 GiB of memory.

With max_parallel_workers_per_gather = 2, the machine ran out of memory.
This was a 32 GiB machine. The free memory when running without parallel
was ~30 GiB free.

Thanks

On Tue, Oct 8, 2024 at 5:16 AM Andrei Lepikhov <lepihov@gmail.com> wrote:

Show quoted text

On 10/7/24 18:42, Craig Milhiser wrote:

On Oct 1, 2024 Andrei Lepikhov wrote

Can you provide an explain of this query?

Apologies for the delay. I have been travelling since Wednesday night.
Thanks for your help and time with this issue.

Below is the query, with specific values redacted. An explain with
max_parallel_workers_per_gather = 2 and explain analyze
max_parallel_workers_per_gather = 0.

I'm a bit confused: the thread subject named ' invalid DSA memory alloc
request size ...', but you write about issue with OOM killer. It is two
different issues, which one do you have exactly?

OOM killer can be explained easily, because I see huge string_agg
aggregate - workers can utilise memory more intensively. For now,
explain of an Aggregate node don't show information about factual sort
operation of each aggregate and memory consumption.

--
regards, Andrei Lepikhov

#14

/messages/by-id/CA+hUKG+RGdvhAdVu5_LH3Ksee+kW-XkTP_nMxBL+Rmgp3Tjb_w@mail.gmail.com

thomas.munro@gmail.com

over 1 year ago

In reply to: Craig Milhiser (#13)

Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker

On Thu, Oct 10, 2024 at 5:28 AM Craig Milhiser <craig@milhiser.com> wrote:

Then the machine ran out of disk space: ERROR: could not write to file "base/pgsql_tmp/pgsql_tmp4942.1.fileset/o1859485of2097152.p0.0": No space left on device

For that, I have a patch in the queue to unlink temporary files incrementally:

That's just treating a symptom, though. Things have already gone
quite wrong if we're repeatedly repartitioning our way up to 2 million
batches and only giving up there because of Andrei's patch.

I wonder if there something could be wrong with Parallel Hash Right
Join, which we see in your plan. That's new-ish, and I vaguely recall
another case where that seemed to be on the scene in a plan with a
high number of batches... hmm. Definitely keen to see a reproducer
with synthetic data if you can come up with one...

#15

[1]: https://github.com/postgrespro/pg_query_state

lepihov@gmail.com

over 1 year ago

In reply to: Craig Milhiser (#13)

Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker

On 10/9/24 23:28, Craig Milhiser wrote:

On Oct 8 Andrei Lepikhov wrote

I'm a bit confused: the thread subject named ' invalid DSA memory alloc
request size ...', but you write about issue with OOM killer. It is two
different issues, which one do you have exactly?

I started with the Invalid DSA memory allocation error. I was asked to
try an experimental patch above. Then I got OOM with the patch only
running parallel. You will see below, there was an OOM but I do not
believe it is the query.

So, I think the patch works, but you found out one more issue at the
same query. Awesome!

I ran a new test on an instance with 512 GiB of memory. After I applied
the patch, the Invalid DSA memory allocation message was not
replicated. Running max_parallel_workers_per_gather = 0, the query took
~9.5 seconds and used <1 GiB of memory. With
max_parallel_workers_per_gather = 2 the query used ~170 GiB of memory,
~70 GB of temp files were written and the query ran for more than 1 hour
until I ran out of disk space.

It's fascinating. I have one user report like that, but they also didn't
provide any synthetic test. I think it is almost impossible to create
such reproduction without a minimal understanding of what's happening. I
can imagine only a data skew or a logical bug in this part of the code.
But without direct perf and gdb touch, it is hard to resolve the issue
by just gazing into the code.
Additional actions can provide some food for thought:
1. If you remove aggregates (STRING_AGG, count) from the selection list,
will the problem remain? What about OFFSET 0?
2. Can you build extended statistics on account_id,disabled,user_type_id
and provide an explain (and explain analyse)?
3. Can you use pg_query_state (unfortunately, it needs a patch and
re-compilation) and show us intermediate execution state snapshots?
4. I see a duplicate clause in the query: SGM.OBJECT_ID = U.OBJECT_ID.
For what reason you have it here? can you remove it from the query?
5. One more wild guess: can you analyse how much NULLS contains column
u.users_id at the moment when HashJoin evaluates clause (u.users_id =
cf.user_id)?

--
regards, Andrei Lepikhov

#16

craig@milhiser.com

over 1 year ago

In reply to: Andrei Lepikhov (#15)

Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker

Thomas Munro wrote

I wonder if there something could be wrong with Parallel Hash Right

Join...Definitely keen to see a reproducer

with synthetic data if you can come up with one

Andrei Lepikhov wrote

I can imagine only a data skew or a logical bug in this part of the code.
But without direct perf and gdb touch, it is hard to resolve the issue
by just gazing into the code.

Both of you are correct.

I have reproduced the problem with synthetic data. The script is below.
Thank you for your patience with me.

There are comments in the script. Please let me know of any questions or if
you cannot reproduce it. If you want me to file a report via the form, let
me know.

Using Postgres v17 with out of the box configuration.

drop table test_users;
create table test_users (account_id bigint not null, users_id bigint not
null constraint test_users_pkey primary key, first_name varchar(105),
last_name varchar(105), user_image_id bigint);

-- The account we are interested, data numbers are negative to eliminate
duplicates and help with debugging
insert into test_users (account_id, users_id, first_name, last_name,
user_image_id)
SELECT -1, -1 * i, md5(random()::text), md5(random()::text), case when
random() < 0.95 then null else -1 * i end
FROM generate_series(1, 925_000) AS t(i)
;

-- Make enough other records to get the skew to force a Parallel Hash Right
Join and the query breaks
-- Change the "< 0.50" to "< 0.95" to get a skew for a Parallel Hash Left
Join and the query works
-- 0.50 makes a right join and breaks; 0.95 makes a left join and works
-- Changes how many users are in user_image which, relative to the number
of users and accounts, is the key skew that I found
-- Data numbers are positive
insert into test_users(account_id, users_id, first_name, last_name,
user_image_id)
SELECT random(10, 50_000)::bigint, i, md5(random()::text),
md5(random()::text), case when random() < 0.50 then null else i end
FROM generate_series(1, 50_000_000) AS t(i)
;

create index user_img_fk_idx on test_users using btree (user_image_id);

drop table test_user_image;
create table test_user_image(user_image_id bigint not null constraint
test_user_image_pkey primary key);
insert into test_user_image(user_image_id) select user_image_id from
test_users where user_image_id is not null;

ALTER TABLE test_users ADD CONSTRAINT users_user_image_fk FOREIGN KEY
(user_image_id) REFERENCES test_user_image(user_image_id);

commit;
analyze test_users;
analyze test_user_image;

-- at 0 workers the query will work
set max_parallel_workers_per_gather = 0;

SELECT U.USERS_ID
, U.FIRST_NAME
, U.LAST_NAME
FROM test_USERS U
LEFT JOIN test_USER_IMAGE UI
ON U.USER_IMAGE_ID = UI.USER_IMAGE_ID
WHERE U.ACCOUNT_ID = -1
GROUP BY U.USERS_ID
, UI.USER_IMAGE_ID
ORDER BY U.LAST_NAME ASC
, U.FIRST_NAME ASC
, U.USERS_ID ASC
LIMIT 20
OFFSET 0
;

set max_parallel_workers_per_gather = 2;
-- Explain the above query. For it to break, a Parallel Hash Right Join is
executed.
-- When a Parallel Hash Left Join is executed, the query works. Switch
between left and right by changing the skew as noted above.
-- when run with stock Postgres 17, the Invalid DSA memory allocation
occurs, which started this thread
-- when run with the patch to fix the Invalid DSA memory allocation, the
OOM occurs.
-- I reproduced the Invalid DSA memory allocation with AWS Aurora v16.2.
Naturally I cannot try patches there. The above was recreated with stock
Postgres v17 on a plain ec2 instance.

Thank you for your time

Craig

#17

thomas.munro@gmail.com

over 1 year ago

In reply to: Craig Milhiser (#16)

Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker

On Mon, Oct 14, 2024 at 12:23 AM Craig Milhiser <craig@milhiser.com> wrote:

I have reproduced the problem with synthetic data. The script is below. Thank you for your patience with me.

Thanks, repro'd here. At first glance, it looks like it's trying to
load this distribution into a hash table and failing to handle the
skew as well as non-parallel hash:

postgres=# select user_image_id, count(*) from test_users where
account_id = -1 group by 1 order by 2 desc limit 5;
user_image_id | count
---------------+--------
| 878823 <-- choking on this?
-924960 | 1
-924934 | 1
-924917 | 1
-924971 | 1
(5 rows)

-> Parallel Hash Right Join
(cost=1027177.72..1368758.97 rows=363544 width=82)
Hash Cond: (ui.user_image_id = u.user_image_id)
-> Parallel Seq Scan on
test_user_image ui (cost=0.00..215192.79 rows=10436379 width=8)
-> Parallel Hash
(cost=1017662.42..1017662.42 rows=363544 width=82)
-> Parallel Seq Scan on
test_users u (cost=0.00..1017662.42 rows=363544 width=82)
Filter: (account_id =
'-1'::integer)

Getting coffee and looking more closely...

#18

lepihov@gmail.com

over 1 year ago

In reply to: Thomas Munro (#17)

Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker

On 10/14/24 04:08, Thomas Munro wrote:

On Mon, Oct 14, 2024 at 12:23 AM Craig Milhiser <craig@milhiser.com> wrote:

I have reproduced the problem with synthetic data. The script is below. Thank you for your patience with me.

Thanks, repro'd here. At first glance, it looks like it's trying to
load this distribution into a hash table and failing to handle the
skew as well as non-parallel hash:
Getting coffee and looking more closely...

Hmm, with reproduction, it is too easy to solve ;)
My explanation (correct if I'm wrong):
OUTER JOINs allow NULLs to be in a hash table. At the same time, a hash
value for NULL is 0, and it goes to the batch==0.
If batch number 0 gets overfilled, the
ExecParallelHashIncreaseNumBatches routine attempts to increase the
number of batches - but nothing happens. The initial batch is still too
big, and the number of batches doubles up to the limit.
At the limit, parallel HashJoin stops this grow and (I didn't trace this
part, just guess) allocates memory for 2097152 batches that causes OOM.
To support this chain of thought, you can see the simple example below,
which triggers the issue:

DROP TABLE IF EXISTS test;
CREATE TABLE test (n int);
INSERT INTO test (n) SELECT NULL FROM generate_series(1,1E6);
INSERT INTO test (n) VALUES (1, 'a');
ANALYZE test;

SET enable_nestloop = 'off';
SET enable_mergejoin = 'off';

SET max_parallel_workers_per_gather = 2;
SET min_parallel_table_scan_size = 0;
SET min_parallel_index_scan_size = 0;
SET parallel_setup_cost = 0.001;
SET parallel_tuple_cost = 0.0001;

EXPLAIN (ANALYZE, VERBOSE, COSTS OFF)
SELECT t1.n FROM test t1 LEFT JOIN test t2 USING (n);

I think, now it is much easier to find a proper solution.

--
regards, Andrei Lepikhov

#19

Tom Lane

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: Andrei Lepikhov (#18)

Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker

Andrei Lepikhov <lepihov@gmail.com> writes:

My explanation (correct if I'm wrong):
OUTER JOINs allow NULLs to be in a hash table. At the same time, a hash
value for NULL is 0, and it goes to the batch==0.
If batch number 0 gets overfilled, the
ExecParallelHashIncreaseNumBatches routine attempts to increase the
number of batches - but nothing happens. The initial batch is still too
big, and the number of batches doubles up to the limit.

Interesting point. If memory serves (I'm too tired to actually look)
the planner considers the statistical most-common-value when
estimating whether an unsplittable hash bucket is likely to be too
big. It does *not* think about null values ... but it ought to.

However, this does not explain why PHJ would be more subject to
the problem than non-parallel HJ.

regards, tom lane

#20

[1]: /messages/by-id/7d763a6d-fad7-49b6-beb0-86f99ce4a6eb@postgrespro.ru
/messages/by-id/7d763a6d-fad7-49b6-beb0-86f99ce4a6eb@postgrespro.ru

lepihov@gmail.com

over 1 year ago

In reply to: Tom Lane (#19)

Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker

On 10/14/24 13:26, Tom Lane wrote:

Andrei Lepikhov <lepihov@gmail.com> writes:

My explanation (correct if I'm wrong):
OUTER JOINs allow NULLs to be in a hash table. At the same time, a hash
value for NULL is 0, and it goes to the batch==0.
If batch number 0 gets overfilled, the
ExecParallelHashIncreaseNumBatches routine attempts to increase the
number of batches - but nothing happens. The initial batch is still too
big, and the number of batches doubles up to the limit.

Interesting point. If memory serves (I'm too tired to actually look)
the planner considers the statistical most-common-value when
estimating whether an unsplittable hash bucket is likely to be too
big. It does *not* think about null values ... but it ought to.

As I see it, it is just an oversight in the resizing logic: batch 0
doesn't change the estimated_size value at all - I think because it
doesn't matter for this batch - it can't be treated as exhausted by
definition. Because of that, parallel HashJoin doesn't detect extreme
skew, caused by duplicates in this batch. NULLS is just our luck - they
correspond to hash value 0 and fall into this batch.
See the attachment for a sketch of the solution.

However, this does not explain why PHJ would be more subject to
the problem than non-parallel HJ.

Good question! I rarely touch this part of the code and maybe don't see
whole picture. But as I see it, HJ is designed differently:
repartitioning machinery is based on overall hash table size and number
of tuples and has nothing similar to 'batch 0' or parallel batches. Hash
table size is calculated for each batch and can't cause this bug.

BTW, Can we also resolve here the long-living corner case with "invalid
DSA memory alloc request size" [1]/messages/by-id/7d763a6d-fad7-49b6-beb0-86f99ce4a6eb@postgrespro.ru? Just because we have clear
reproduction ...

--
regards, Andrei Lepikhov

#21

craig@milhiser.com

over 1 year ago

In reply to: Andrei Lepikhov (#20)

#22