BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours)

Started by PG Bug reporting form3 days ago10 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 19449
Logged by: Adrian
Email address: adrian.moennich@cern.ch
PostgreSQL version: 18.3
Operating system: Linux
Description:

In Indico (an open source conference mgmt tool which I maintain and develop)
I noticed that a
certain query to gather statistics became extremely slow on newer Postgres
version on our production
database. And with extremely slow I mean 3 hours instead of a few seconds.

To replicate:

$ podman run -it --rm -p 65432:5432 -e POSTGRES_HOST_AUTH_METHOD=trust
--shm-size 8G docker.io/postgres:XX-alpine
$ createdb -h 127.0.0.1 -p 65432 -U postgres test
$ psql -h 127.0.0.1 -p 65432 -U postgres test -f data.sql
$ psql -h 127.0.0.1 -p 65432 -U postgres test -f stats.sql

Likely works fine with Docker as well, or with a non-containerized setup.
I just used podman/containers because of the convenience to run different
Postgres versions.

XX=14: Works fine, even w/o the increased shm-size of the container
XX=15: Works fine but only with the increased shm-size of the container
XX={16,17,17}: Massive CPU and disk usage (tens of gigabytes)

On these simple reproducers I did not keep the query running on 16+.
However, I ran it on a postgres 16.11 instance on our production setup (with
our real database),
and there the query finished only after over 3 hours(!).

This is extreme both in general and compared to the performance we got on
14/15, where the same
query took just a few seconds.

Here are EXPLAIN ANALYZE outputs from when I tested this a few weeks ago on
14 and 16
using our real production database.
https://explain.depesz.com/s/17Fp
https://explain.depesz.com/s/0dHI

For the reproducer above I created a dumbed down version of my real data
which basically just has
the relevant columns, FKs and indexes but no actual data. I'm sharing a link
to the data.sql file
since it's 250 MB uncompressed and still 50 MB compressed.

Structure + dummy data: https://fd.aeum.net/pgperf/data.sql.bz2
Problematic query: https://fd.aeum.net/pgperf/stats.sql

For the sake of having the query here and not just in an external file:

```
EXPLAIN ANALYZE SELECT count(attachments.attachments.id) AS count_1
FROM attachments.attachments
JOIN attachments.folders ON attachments.folders.id =
attachments.attachments.folder_id
JOIN events.events ON events.events.id = attachments.folders.event_id
LEFT OUTER JOIN events.sessions ON events.sessions.id =
attachments.folders.session_id
LEFT OUTER JOIN events.contributions ON events.contributions.id =
attachments.folders.contribution_id
LEFT OUTER JOIN events.subcontributions ON events.subcontributions.id =
attachments.folders.subcontribution_id
LEFT OUTER JOIN events.contributions AS contributions_1 ON
contributions_1.id = events.subcontributions.contribution_id
WHERE attachments.folders.link_type != 1
AND NOT attachments.attachments.is_deleted
AND NOT attachments.folders.is_deleted
AND NOT events.events.is_deleted
AND NOT coalesce(events.sessions.is_deleted,
events.contributions.is_deleted, events.subcontributions.is_deleted, false)
AND (contributions_1.is_deleted IS NULL
OR NOT contributions_1.is_deleted)
```

#2Andres Freund
andres@anarazel.de
In reply to: PG Bug reporting form (#1)
Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours)

Hi,

On 2026-04-02 13:04:46 +0000, PG Bug reporting form wrote:

This is extreme both in general and compared to the performance we got on
14/15, where the same
query took just a few seconds.

Here are EXPLAIN ANALYZE outputs from when I tested this a few weeks ago on
14 and 16
using our real production database.
https://explain.depesz.com/s/17Fp
https://explain.depesz.com/s/0dHI

A lot of time is wasted due to batching in the hash join in 16, seemingly due
to a mis-estimate in how much batching we would need:

-> Parallel Hash (cost=323037.00..323037.00 rows=1075136 width=10) (actual time=3267572.432..3267575.016 rows=1023098 loops=3)
Buckets: 262144 (originally 262144) Batches: 262144 (originally 32) Memory Usage: 18912kB
(note the 262144 batches, when 32 were originally assumed)

I'd suggest trying to run the query with a larger work mem. Not because
that should be necessary to avoid regressions, but because it will be useful
to narrow down whether that's related to the issue...

However, even on 14, you do look to be loosing a fair bit of performance due
to batching, so it might be also worth running the query on 14 with a larger
work mem, to see what performance you get there.

It also looks like that the choice of using memoize might not be working out
entirely here. Although I don't think it's determinative for performance, it
might still be worth checking what plan you get with
SET enable_memoize = 0;

Greetings,

Andres Freund

#3Adrian Mönnich
adrian.moennich@cern.ch
In reply to: Andres Freund (#2)
Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours)

Hi,

thanks a lot, I just tried with work_mem set to 128MB on PG16 and it worked fine:
https://explain.depesz.com/s/7Zan

Likewise on PG18:
https://explain.depesz.com/s/H15B

And with enable_memoize=0 (PG18, 128MB):
https://explain.depesz.com/s/SaVI

So increasing work_mem seems like a good workaround for when we upgrade
our production DB. But I guess there's still a but somewhere that results to the
wrong estimate?

Cheers,
Adrian

Show quoted text

Hi,

On 2026-04-02 13:04:46 +0000, PG Bug reporting form wrote:

This is extreme both in general and compared to the performance we got on
14/15, where the same
query took just a few seconds.

Here are EXPLAIN ANALYZE outputs from when I tested this a few weeks ago on
14 and 16
using our real production database.
https://explain.depesz.com/s/17Fp
https://explain.depesz.com/s/0dHI

A lot of time is wasted due to batching in the hash join in 16, seemingly due
to a mis-estimate in how much batching we would need:

-> Parallel Hash
(cost=323037.00..323037.00 rows=1075136 width=10) (actual
time=3267572.432..3267575.016 rows=1023098 loops=3)
Buckets: 262144 (originally 262144)
Batches: 262144 (originally 32) Memory Usage: 18912kB
(note the 262144 batches, when 32 were originally assumed)

I'd suggest trying to run the query with a larger work mem. Not because
that should be necessary to avoid regressions, but because it will be useful
to narrow down whether that's related to the issue...

However, even on 14, you do look to be loosing a fair bit of performance due
to batching, so it might be also worth running the query on 14 with a larger
work mem, to see what performance you get there.

It also looks like that the choice of using memoize might not be working out
entirely here. Although I don't think it's determinative for performance, it
might still be worth checking what plan you get with
SET enable_memoize = 0;

Greetings,

Andres Freund

#4Andres Freund
andres@anarazel.de
In reply to: PG Bug reporting form (#1)
Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours)

Hi,

On 2026-04-02 16:06:27 +0200, Adrian Mönnich wrote:

thanks a lot, I just tried with work_mem set to 128MB on PG16 and it worked fine:
https://explain.depesz.com/s/7Zan

Likewise on PG18:
https://explain.depesz.com/s/H15B

And with enable_memoize=0 (PG18, 128MB):
https://explain.depesz.com/s/SaVI

That's good.

So increasing work_mem seems like a good workaround for when we upgrade
our production DB. But I guess there's still a but somewhere that results to the
wrong estimate?

I don't even know if it's a misestimate that didn't happen in the earlier
versions - the join order is different in 14 than it's in the later ones. I
don't know why that is at this point.

This means that we don't know if 14 would have had the same misestimation if
the same join order had been chosen.

There also seem to be some data differences:

14: https://explain.depesz.com/s/17Fp#source
-> Parallel Seq Scan on contributions contributions_1 (cost=0.00..164891.13 rows=2687413 width=5) (actual time=0.013..454.721 rows=2143186 loops=3)

16: https://explain.depesz.com/s/7Zan
-> Parallel Seq Scan on contributions contributions_1 (cost=0.00..37776.28 rows=1643228 width=5) (actual time=0.081..78.499 rows=1314582.00 loops=3)

That's a pretty substantial difference in the number of rows.

Greetings,

Andres Freund

#5Adrian Mönnich
adrian.moennich@cern.ch
In reply to: Andres Freund (#4)
Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours)

Indeed, good catch. I was generating the test data from an older prod data copy
and not a more recent one. In any case, the performance was fine on that same
copy on 14/15 and got bad on 16.

I just re-ran it with a larger database (and also replaced the gzipped SQL file
from my initial message with the latest one).

PG14: https://explain.depesz.com/s/ysdJ
PG16, 4M: massive cpu + disk usage and thus aborted after a few seconds
PG16, 32M: https://explain.depesz.com/s/mYiY

Cheers,
Adrian

Show quoted text

Hi,

On 2026-04-02 16:06:27 +0200, Adrian Mönnich wrote:

thanks a lot, I just tried with work_mem set to 128MB on PG16 and it worked fine:
https://explain.depesz.com/s/7Zan

Likewise on PG18:
https://explain.depesz.com/s/H15B

And with enable_memoize=0 (PG18, 128MB):
https://explain.depesz.com/s/SaVI

That's good.

So increasing work_mem seems like a good workaround for when we upgrade
our production DB. But I guess there's still a but somewhere that results to the
wrong estimate?

I don't even know if it's a misestimate that didn't happen in the earlier
versions - the join order is different in 14 than it's in the later ones. I
don't know why that is at this point.

This means that we don't know if 14 would have had the same misestimation if
the same join order had been chosen.

There also seem to be some data differences:

14: https://explain.depesz.com/s/17Fp#source
-> Parallel Seq Scan on contributions contributions_1
(cost=0.00..164891.13 rows=2687413 width=5) (actual time=0.013..454.721 rows=2143186 loops=3)

16: https://explain.depesz.com/s/7Zan
-> Parallel Seq Scan on contributions contributions_1
(cost=0.00..37776.28 rows=1643228 width=5) (actual time=0.081..78.499 rows=1314582.00 loops=3)

That's a pretty substantial difference in the number of rows.

Greetings,

Andres Freund

#6Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Adrian Mönnich (#5)
Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours)

Hi,

I can reproduce the performance getting much worse in 16, using the
provided SQL scripts. This is what I see:

14: 1551.363 ms
15: 1385.414 ms
16: 161571.400 ms
17: 156434.543 ms
18: 159095.001 ms

I'm attaching the explains for 15+16. I don't know what's causing it,
but I have a couple interesting observations.

1) If I disable parallel query, the timings change to

14: 3990.439 ms
15: 3518.453 ms
16: 3606.460 ms
17: 3591.039 ms
18: 3617.872 ms

So no regression in this case. It seems to be related to parallelism.

2) There seems to be an explosion of temporary files. We don't have that
in explain, but I queried pg_stat_database before/after the query, and
there's huge difference. Both start at

temp_files | 112
temp_bytes | 1942275280

so 112 files, ~2GB disk space. But after the query, 15 says

temp_files | 721
temp_bytes | 2755839184

while 16 has

temp_files | 2078995
temp_bytes | 70607906000

2M files and 70GB? Wow!

3) Indeed, before the query completes the pgsql_tmp directory has this:

63M pgsql_tmp3499395.0.fileset
63G pgsql_tmp3499395.1.fileset
95M pgsql_tmp3499395.2.fileset
95M pgsql_tmp3499395.3.fileset
127M pgsql_tmp3499395.4.fileset

So I guess that's one of the parallel hash joins doing something, and
consuming 63GB of disk space? I don't see anything suspicious in the
plan, but I assume parallel HJ may not report the relevant stats.

FWIW bumping up work_mem (to 64MB) solved this with the sample data.

I suspect this is going to be something like the hash join explosion,
where we just happen to add more and more batches. I don't have time to
investigate this more at the moment.

regards

--
Tomas Vondra

Attachments:

16.logtext/x-log; charset=UTF-8; name=16.logDownload
15.logtext/x-log; charset=UTF-8; name=15.logDownload
#7Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#6)
Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours)

On 4/2/26 20:12, Tomas Vondra wrote:

Hi,

I can reproduce the performance getting much worse in 16, using the
provided SQL scripts. This is what I see:

14: 1551.363 ms
15: 1385.414 ms
16: 161571.400 ms
17: 156434.543 ms
18: 159095.001 ms

I'm attaching the explains for 15+16. I don't know what's causing it,
but I have a couple interesting observations.

1) If I disable parallel query, the timings change to

14: 3990.439 ms
15: 3518.453 ms
16: 3606.460 ms
17: 3591.039 ms
18: 3617.872 ms

So no regression in this case. It seems to be related to parallelism.

2) There seems to be an explosion of temporary files. We don't have that
in explain, but I queried pg_stat_database before/after the query, and
there's huge difference. Both start at

temp_files | 112
temp_bytes | 1942275280

so 112 files, ~2GB disk space. But after the query, 15 says

temp_files | 721
temp_bytes | 2755839184

while 16 has

temp_files | 2078995
temp_bytes | 70607906000

2M files and 70GB? Wow!

3) Indeed, before the query completes the pgsql_tmp directory has this:

63M pgsql_tmp3499395.0.fileset
63G pgsql_tmp3499395.1.fileset
95M pgsql_tmp3499395.2.fileset
95M pgsql_tmp3499395.3.fileset
127M pgsql_tmp3499395.4.fileset

So I guess that's one of the parallel hash joins doing something, and
consuming 63GB of disk space? I don't see anything suspicious in the
plan, but I assume parallel HJ may not report the relevant stats.

FWIW bumping up work_mem (to 64MB) solved this with the sample data.

I suspect this is going to be something like the hash join explosion,
where we just happen to add more and more batches. I don't have time to
investigate this more at the moment.

FWIW I think that's what's happening. If I add an elog(WARNING) into
ExecParallelHashJoinSetUpBatches, I see this:

WARNING: 0x55dbe375a5e8 initializing 16 batches
WARNING: 0x7f3868a3a978 initializing 32 batches
WARNING: 0x7f3868a3ab80 initializing 4 batches
WARNING: 0x55dbe36148c0 initializing 4 batches
WARNING: 0x7f3868a3b230 initializing 16 batches
WARNING: 0x7f3868a3a978 initializing 64 batches
WARNING: 0x55dbe36144b0 initializing 128 batches
WARNING: 0x55dbe36144b0 initializing 256 batches
WARNING: 0x55dbe36144b0 initializing 512 batches
WARNING: 0x55dbe36144b0 initializing 1024 batches
WARNING: 0x7f3868a3a978 initializing 2048 batches
WARNING: 0x7f3868a3a978 initializing 4096 batches
WARNING: 0x55dbe36144b0 initializing 8192 batches
WARNING: 0x55dbe36144b0 initializing 16384 batches
WARNING: 0x55dbe36144b0 initializing 32768 batches
WARNING: 0x7f3868a3a978 initializing 65536 batches
WARNING: 0x55dbe36144b0 initializing 131072 batches
WARNING: 0x7f3868a3a978 initializing 262144 batches

so we're ending with 256k batches, for this one join. I'm not sure how
exactly this maps to the 2M files from pg_stat_database, but it means
~0.5M tuplestores and ~10GB virtual memory (at lest per top).

I don't know what triggers the batch increase, but I still suspect it's
similar to the explosion we fixed (or mitigated) in PG18, but only for
serial (non-parallel) joins.

regards

--
Tomas Vondra

#8Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#7)
Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours)

On 4/2/26 21:00, Tomas Vondra wrote:

...
FWIW I think that's what's happening. If I add an elog(WARNING) into
ExecParallelHashJoinSetUpBatches, I see this:

WARNING: 0x55dbe375a5e8 initializing 16 batches
WARNING: 0x7f3868a3a978 initializing 32 batches
WARNING: 0x7f3868a3ab80 initializing 4 batches
WARNING: 0x55dbe36148c0 initializing 4 batches
WARNING: 0x7f3868a3b230 initializing 16 batches
WARNING: 0x7f3868a3a978 initializing 64 batches
WARNING: 0x55dbe36144b0 initializing 128 batches
WARNING: 0x55dbe36144b0 initializing 256 batches
WARNING: 0x55dbe36144b0 initializing 512 batches
WARNING: 0x55dbe36144b0 initializing 1024 batches
WARNING: 0x7f3868a3a978 initializing 2048 batches
WARNING: 0x7f3868a3a978 initializing 4096 batches
WARNING: 0x55dbe36144b0 initializing 8192 batches
WARNING: 0x55dbe36144b0 initializing 16384 batches
WARNING: 0x55dbe36144b0 initializing 32768 batches
WARNING: 0x7f3868a3a978 initializing 65536 batches
WARNING: 0x55dbe36144b0 initializing 131072 batches
WARNING: 0x7f3868a3a978 initializing 262144 batches

so we're ending with 256k batches, for this one join. I'm not sure how
exactly this maps to the 2M files from pg_stat_database, but it means
~0.5M tuplestores and ~10GB virtual memory (at lest per top).

I don't know what triggers the batch increase, but I still suspect it's
similar to the explosion we fixed (or mitigated) in PG18, but only for
serial (non-parallel) joins.

An interesting question is "What changed in PG16?" causing the query to
fail, when it worked OK on earlier versions. I guess the main suspect is
this item from release notes

Allow parallelization of FULL and internal right OUTER hash joins

So I guess it might be interesting to flip the joins to inner, see if it
still fails like that, and then see if that crashes on PG15 too.

Although the query has only inner and left outer joins, which seems
unrelated to the change. It might be simply a consequence of the planner
picking a different join tree (due to some general optimizer changes).

It might be interesting to try forcing the same join tree (which might
be possible with join_collapse_limit=1) on PG15. Maybe it'll crash the
same way?

Maybe it'd be easier to try reducing the query first, before doing any
of this. Start removing the joins one by one from the "top" (per the
explain), until it stops failing. That might leave a much smaller query.

regards

--
Tomas Vondra

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#8)
Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours)

Tomas Vondra <tomas@vondra.me> writes:

An interesting question is "What changed in PG16?" causing the query to
fail, when it worked OK on earlier versions.

"git bisect" could be informative here. I agree with trying to
minimize the query first, though --- else you may waste time
going down blind alleys, as a result of planner changes changing
the join order without affecting the critical executor behavior.

regards, tom lane

#10Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#9)
Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours)

On 4/3/26 00:43, Tom Lane wrote:

Tomas Vondra <tomas@vondra.me> writes:

An interesting question is "What changed in PG16?" causing the query to
fail, when it worked OK on earlier versions.

"git bisect" could be informative here. I agree with trying to
minimize the query first, though --- else you may waste time
going down blind alleys, as a result of planner changes changing
the join order without affecting the critical executor behavior.

I did a bit of bisecting today (with the full query), and unsurprisingly
it started failing at:

commit 11c2d6fdf5af1aacec9ca2005543f1b0fc4cc364 (HEAD ->
hashjoin-explosion-bisect)
Author: Thomas Munro <tmunro@postgresql.org>
Date: Fri Mar 31 11:01:51 2023 +1300

Parallel Hash Full Join.

Full and right outer joins were not supported in the initial
implementation of Parallel Hash Join because of deadlock hazards
(see discussion). Therefore FULL JOIN inhibited parallelism, as
the other join strategies can't do that in parallel either.

...

Although, it's a bit strange, AFAIK the query does not have any full
outer join. Also, for me it now fails like this:

Sat Apr 4 04:00:58 PM CEST 2026
ERROR: invalid DSA memory alloc request size 1811939328
CONTEXT: parallel worker
Sat Apr 4 04:02:04 PM CEST 2026

I believe it's the same issue (I still get the same tempfile explosion).

After a bit of trial-and-error I managed to reduce the query to a single
join:

SET parallel_setup_cost = 0;
SET cpu_tuple_cost = 1;
SET enable_nestloop = off;

EXPLAIN ANALYZE SELECT *
FROM attachments.folders
LEFT OUTER JOIN events.contributions
ON events.contributions.id = attachments.folders.contribution_id;

The trick is to force it to do a parallel hash join by adjusting the CPU
costs. I don't think it can be reduced even further, even just switching
to an inner join makes it work fine.

At this point I was suspecting the data distributions for the join
columns may be somewhat weird, causing issues for the hashjoin batching.
For events.contributions.id it's perfectly fine - it's entirely unique,
with each ID having 1 entry. Unsurprisingly, because it's the PK. But
for attachments.folders.contribution_id I see this:

SELECT contribution_id, count(*) FROM attachments.folders
GROUP BY contribution_id ORDER BY 2 DESC;

contribution_id | count
-----------------+--------
| 464515
5492978 | 67
4117499 | 42
4045045 | 41
...

So there's ~500k entries with NULL, that can't possibly match to
anything (right)? I assume we still add them to the hash, though.
Because if I explicitly filter them out, it starts working fine:

EXPLAIN ANALYZE SELECT *
FROM attachments.folders
LEFT OUTER JOIN events.contributions
ON events.contributions.id = attachments.folders.contribution_id
WHERE attachments.folders.contribution_id IT NOT NULL;
...
Planning Time: 0.192 ms
Execution Time: 670.950 ms

and when I invert the condition (to IS NULL), it stats failing pretty
much right away.

regards

--
Tomas Vondra