Should HashSetOp go away

Started by Jeff Janes6 months ago11 messageshackers
Jump to latest
#1Jeff Janes
jeff.janes@gmail.com

Many years ago I ran into some problems doing maintenance tasks checking
for short identifiers which existed in one table but not another. It would
choose HashSetOp, which was memory inefficient, and it was also unaware of
how memory inefficient it was, leading it to blow well past work_mem, by
many fold. So if you set work_mem to a large value in order to get
maintenance operations over with quickly while the system is basically
single-user, it could cause crashes. It certainly isn't the only part of
PostgreSQL which is memory inefficient, but it did seem particularly
egregious.

I noticed some changes in this code v18, so wanted to revisit the issue.
Under commit 27627929528e, it looks like it got 25% more memory efficient,
but it thinks it got 40% more efficient, so the memory use got better but
the estimation actually got worse.

Using the data:
create table jj as select lpad(x::text,15,'0') from
generate_series(1,10000000) f(x);

And the dummy query:
select lpad from jj except select lpad from jj;

It goes from needing a work_mem of at least 270MB to choose HashSetOp where
it actually uses 1.3GB (as determined by 'max resident size' from
log_executor_stats, which is not perfect but should be pretty close--I
intentionally used a small shared_buffers so that it didn't contribute much
to the memory usage) in v17 to needing work_mem of at least 160MB while
actually using 1.0GB in 18devel commit 276279. Under 18.0, it is slightly
but not meaningfully different from commit 276279.

I was thinking of ways to improve the memory usage (or at least its
estimation) but decided maybe it would be better if HashSetOp went away
entirely. As far as I can tell HashSetOp has nothing to recommend it other
than the fact that it already exists. If we instead used an elaboration on
Hash Anti Join, then it would automatically get spilling to disk, parallel
operations, better estimation, and the benefits of whatever micro
optimizations people lavish on the highly used HashJoin machinery but not
the obscure, little-used HashSetOp.

It would need to elaborate the HashAntiJoin so that it can deduplicate one
input (in the case of EXCEPT) or count the other input (in the case of
EXCEPT ALL).

Is there some reason this is not feasible?

Yes, I could (and did) rewrite my query to force it to use the AntiJoin,
but why should people need to do that when the planner can do it instead?

Cheers,

Jeff

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Janes (#1)
Re: Should HashSetOp go away

Jeff Janes <jeff.janes@gmail.com> writes:

I noticed some changes in this code v18, so wanted to revisit the issue.
Under commit 27627929528e, it looks like it got 25% more memory efficient,
but it thinks it got 40% more efficient, so the memory use got better but
the estimation actually got worse.

Hmm, so why not fix that estimation?

I was thinking of ways to improve the memory usage (or at least its
estimation) but decided maybe it would be better if HashSetOp went away
entirely. As far as I can tell HashSetOp has nothing to recommend it other
than the fact that it already exists. If we instead used an elaboration on
Hash Anti Join, then it would automatically get spilling to disk, parallel
operations, better estimation, and the benefits of whatever micro
optimizations people lavish on the highly used HashJoin machinery but not
the obscure, little-used HashSetOp.

This seems like a pretty bad solution. It would imply exporting the
complexities of duplicate-counting for EXCEPT ALL and INTERSECT ALL
modes into the hash-join logic. We don't need that extra complexity
there (it's more than enough of a mess already), and we don't need
whatever performance hit ordinary hash joins would take.

Also, I doubt the problem is confined to nodeSetOp. I think this is
fundamentally a complaint about BuildTupleHashTable and friends being
unable to spill to disk. Since we also use that logic for hashed
aggregates, RecursiveUnion, and hashed SubPlans, getting rid of
nodeSetOp isn't going to move the needle very far.

regards, tom lane

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#2)
Re: Should HashSetOp go away

I wrote:

Jeff Janes <jeff.janes@gmail.com> writes:

I noticed some changes in this code v18, so wanted to revisit the issue.
Under commit 27627929528e, it looks like it got 25% more memory efficient,
but it thinks it got 40% more efficient, so the memory use got better but
the estimation actually got worse.

Hmm, so why not fix that estimation?

So I poked at this a little bit, and found a few factors affecting it:

* Tuple hash tables are typically given license to use twice work_mem;
see hash_mem_multiplier. Not sure if you were accounting for that
in your test.

* create_setop_path's required-space estimate of entrysize * numGroups
was rather lame before commit 5dfc19814, and it's even more so
afterwards. It's basically only accounting for the tuples themselves,
and not either the hashtable overhead or the SetOpStatePerGroupData
counter space. With wide tuples that might disappear into the noise,
but with only 16-ish data bytes per tuple it's all about the overhead.
On my machine this example uses 80 bytes per tuple in the "SetOp hash
table" context and another 16 or more in the simplehash hashtable.
So about triple what the planner thought.

* We can buy some of this back nearly for free, by switching the
SetOp hash table context to be a BumpContext not an AllocSetContext.
That doesn't move the needle too much in this example with
MEMORY_CONTEXT_CHECKING on, but with it off, the SetOp hash table
consumption drops to 48 bytes/tuple. (Also, for other tuple widths,
AllocSetContext's round-up-to-power-of-2 behavior could hurt a lot
more than it does here.)

* To do better, we probably need to take this computation out of the
planner and have execGrouping.c expose a function to estimate the
TupleHashTable size for N entries and such-and-such average data
width. That in turn will require simplehash.h to expose a function
for estimating the size of its tables, because AFAICS its callers are
not supposed to know such details.

Attached is a quick-hack patch to use a BumpContext for this purpose
in nodeSetOp.c. We should probably look at whether other users of
BuildTupleHashTable can do similarly. I haven't thought hard about
what better-factorized space estimation would look like.

regards, tom lane

Attachments:

use-bump-context-in-nodeSetOp.patchtext/x-diff; charset=us-ascii; name=use-bump-context-in-nodeSetOp.patchDownload+6-4
#4David Rowley
dgrowleyml@gmail.com
In reply to: Tom Lane (#2)
Re: Should HashSetOp go away

On Mon, 27 Oct 2025 at 07:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeff Janes <jeff.janes@gmail.com> writes:

I was thinking of ways to improve the memory usage (or at least its
estimation) but decided maybe it would be better if HashSetOp went away
entirely. As far as I can tell HashSetOp has nothing to recommend it other
than the fact that it already exists. If we instead used an elaboration on
Hash Anti Join, then it would automatically get spilling to disk, parallel
operations, better estimation, and the benefits of whatever micro
optimizations people lavish on the highly used HashJoin machinery but not
the obscure, little-used HashSetOp.

This seems like a pretty bad solution. It would imply exporting the
complexities of duplicate-counting for EXCEPT ALL and INTERSECT ALL
modes into the hash-join logic. We don't need that extra complexity
there (it's more than enough of a mess already), and we don't need
whatever performance hit ordinary hash joins would take.

If Hash Joins did support IS NOT DISTINCT FROM clauses, then at least
the non-ALL cases could be done with Hash Semi Join and Hash Anti Join
for INTERSECT and EXCEPT, respectively, followed by a HashAgg. I doubt
it would be any faster for the general case, but at least it would
allow those setop queries to run when the inputs don't fit in memory.
It's not ideal though, as when the planner underestimates, Hashed
Setops could still blow up.

David

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#4)
Re: Should HashSetOp go away

David Rowley <dgrowleyml@gmail.com> writes:

If Hash Joins did support IS NOT DISTINCT FROM clauses, then at least
the non-ALL cases could be done with Hash Semi Join and Hash Anti Join
for INTERSECT and EXCEPT, respectively, followed by a HashAgg. I doubt
it would be any faster for the general case, but at least it would
allow those setop queries to run when the inputs don't fit in memory.
It's not ideal though, as when the planner underestimates, Hashed
Setops could still blow up.

Yeah. As I hinted before, I think a better answer would be to teach
TupleHashTables to be able to spill to disk at need. No idea how
much work that would be, but it would fix all users of that code
not just one of them.

regards, tom lane

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#3)
Re: Should HashSetOp go away

I wrote:

* create_setop_path's required-space estimate of entrysize * numGroups
was rather lame before commit 5dfc19814, and it's even more so
afterwards. It's basically only accounting for the tuples themselves,
and not either the hashtable overhead or the SetOpStatePerGroupData
counter space. With wide tuples that might disappear into the noise,
but with only 16-ish data bytes per tuple it's all about the overhead.
On my machine this example uses 80 bytes per tuple in the "SetOp hash
table" context and another 16 or more in the simplehash hashtable.
So about triple what the planner thought.

* To do better, we probably need to take this computation out of the
planner and have execGrouping.c expose a function to estimate the
TupleHashTable size for N entries and such-and-such average data
width. That in turn will require simplehash.h to expose a function
for estimating the size of its tables, because AFAICS its callers are
not supposed to know such details.

Here's a pair of patches to try to do better. The first one
is concerned with getting more realistic size estimates for
TupleHashTables in the planner. The second is some mop-up
that's been pending for a long time in the same area, namely
getting rid of "long int" field types in Plan nodes.

With 0001, the planner's estimate of the amount of space needed
for your example query seems to be pretty dead-on, at least in
non-debug builds.

regards, tom lane

Attachments:

v1-0001-Improve-planner-s-estimates-of-tuple-hash-table-s.patchtext/x-diff; charset=us-ascii; name*0=v1-0001-Improve-planner-s-estimates-of-tuple-hash-table-s.p; name*1=atchDownload+192-32
v1-0002-Change-long-numGroups-fields-to-be-Cardinality-i..patchtext/x-diff; charset=us-ascii; name*0=v1-0002-Change-long-numGroups-fields-to-be-Cardinality-i..p; name*1=atchDownload+50-95
#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#6)
Re: Should HashSetOp go away

I wrote:

Here's a pair of patches to try to do better. The first one
is concerned with getting more realistic size estimates for
TupleHashTables in the planner. The second is some mop-up
that's been pending for a long time in the same area, namely
getting rid of "long int" field types in Plan nodes.

Meh ... cfbot found a compiler warning that I'd not seen locally.
v2 attached silences that, and I twiddled a couple of comments.

regards, tom lane

Attachments:

v2-0001-Improve-planner-s-estimates-of-tuple-hash-table-s.patchtext/x-diff; charset=us-ascii; name*0=v2-0001-Improve-planner-s-estimates-of-tuple-hash-table-s.p; name*1=atchDownload+203-32
v2-0002-Change-long-numGroups-fields-to-be-Cardinality-i..patchtext/x-diff; charset=us-ascii; name*0=v2-0002-Change-long-numGroups-fields-to-be-Cardinality-i..p; name*1=atchDownload+50-95
#8David Rowley
dgrowleyml@gmail.com
In reply to: Tom Lane (#7)
Re: Should HashSetOp go away

On Sat, 1 Nov 2025 at 08:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

Here's a pair of patches to try to do better. The first one
is concerned with getting more realistic size estimates for
TupleHashTables in the planner. The second is some mop-up
that's been pending for a long time in the same area, namely
getting rid of "long int" field types in Plan nodes.

I had a look at the v2 patches. Mostly quibbles, but #4 seems like an oversight.

0001:

1) For the following:

+    tuples_space = nentries * (MAXALIGN(SizeofMinimalTupleHeader) +
+                               MAXALIGN(tupleWidth) +
+                               MAXALIGN(additionalsize));

If I'm not mistaken, technically that should be
MAXALIGN(SizeofMinimalTupleHeader + tupleWidth) +
MAXALIGN(additionalsize), but in reality it should come out the same
since SizeofMinimalTupleHeader is 16. If that were to change then I
believe the extra MAXALIGN would start overestimating the memory.

2) Would it be better to reference the function name
"buildSubPlanHash" instead of "above" in:

+ * Adjust the rowcount estimate in the same way as above, except that we

I think "above" is ok when it's the same function, but when it's
talking about another function, it's a recipe for becoming outdated.
It'd be better using the function name so we can grep for that when do
refactor work, else we end up with commits like e3a0304eb...

3) Quite a collection of naming styles here.

+Size
+EstimateTupleHashTableSpace(double nentries,
+                            Size tupleWidth,
+                            Size additionalsize)
+{
+    Size        sh_space;
+    double        tuples_space;

4) I think this is missing a "/ SH_FILLFACTOR"

+ /* should be safe to convert to uint64 */
+ size = (uint64) nentries;

i.e do what SH_CREATE does.

0002:

5) Is it switching "Max(nbuckets, 1);" to "nbuckets" in
hash_choose_num_buckets(). Looks like BuildTupleHashTable() will do
that part for us.

David

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#8)
Re: Should HashSetOp go away

David Rowley <dgrowleyml@gmail.com> writes:

I had a look at the v2 patches. Mostly quibbles, but #4 seems like an oversight.

Thanks for reviewing!

1) For the following:
+    tuples_space = nentries * (MAXALIGN(SizeofMinimalTupleHeader) +
+                               MAXALIGN(tupleWidth) +
+                               MAXALIGN(additionalsize));
If I'm not mistaken, technically that should be
MAXALIGN(SizeofMinimalTupleHeader + tupleWidth) +
MAXALIGN(additionalsize),

No, I think it's correct as written: the data payload of a tuple
must always start on a MAXALIGN boundary. As you say, it doesn't
matter as long as SizeofMinimalTupleHeader is 16, but I think this
way is formally correct. (It would matter more if we were trying
to account for tuples' null bitmaps ...)

2) Would it be better to reference the function name
"buildSubPlanHash" instead of "above" in:
+ * Adjust the rowcount estimate in the same way as above, except that we

Done.

3) Quite a collection of naming styles here.

Yeah :-( ... we work in an old and none-too-consistent code base.
Do you have any specific suggestions about which of these functions
might fit its surroundings better with a different caps style?

4) I think this is missing a "/ SH_FILLFACTOR"
+ /* should be safe to convert to uint64 */
+ size = (uint64) nentries;
i.e do what SH_CREATE does.

Oh! I think I got confused because some of that logic is in
SH_CREATE and some in SH_COMPUTE_SIZE :-(. Good catch.

5) Is it switching "Max(nbuckets, 1);" to "nbuckets" in
hash_choose_num_buckets(). Looks like BuildTupleHashTable() will do
that part for us.

Yeah, we could do that. I was trying to not touch more of nodeAgg
than I had to, but it seems sensible to not duplicate something
that BuildTupleHashTable will do.

v3 attached.

regards, tom lane

Attachments:

v3-0001-Improve-planner-s-estimates-of-tuple-hash-table-s.patchtext/x-diff; charset=us-ascii; name*0=v3-0001-Improve-planner-s-estimates-of-tuple-hash-table-s.p; name*1=atchDownload+206-32
v3-0002-Change-long-numGroups-fields-to-be-Cardinality-i..patchtext/x-diff; charset=us-ascii; name*0=v3-0002-Change-long-numGroups-fields-to-be-Cardinality-i..p; name*1=atchDownload+55-96
#10David Rowley
dgrowleyml@gmail.com
In reply to: Tom Lane (#9)
Re: Should HashSetOp go away

On Mon, 3 Nov 2025 at 05:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:

v3 attached.

I reviewed the changes. Looks good to me.

David

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#10)
Re: Should HashSetOp go away

David Rowley <dgrowleyml@gmail.com> writes:

On Mon, 3 Nov 2025 at 05:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:

v3 attached.

I reviewed the changes. Looks good to me.

Thanks again! I'll get that pushed shortly.

regards, tom lane