Use BumpContext contexts for TupleHashTables' tablecxt

Started by Tom Lane8 months ago15 messageshackers

tgl@sss.pgh.pa.us

8 months ago

[ starting a new thread to keep this separate from the estimation
issue ]

I looked at the callers of BuildTupleHashTable, and realized that
(a) every one of them can use a BumpContext for the "tablecxt",
and (b) Jeff Davis already noticed that for nodeAgg.c, in commit
cc721c459. So we have precedent for the idea working. Here's
a fleshed-out patch to fix the remaining callers.

regards, tom lane

David Rowley

dgrowleyml@gmail.com

8 months ago

In reply to: Tom Lane (#1)

Re: Use BumpContext contexts for TupleHashTables' tablecxt

On Mon, 27 Oct 2025 at 09:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:

[ starting a new thread to keep this separate from the estimation
issue ]

I looked at the callers of BuildTupleHashTable, and realized that
(a) every one of them can use a BumpContext for the "tablecxt",
and (b) Jeff Davis already noticed that for nodeAgg.c, in commit
cc721c459. So we have precedent for the idea working. Here's
a fleshed-out patch to fix the remaining callers.

Yeah, this should give a decent performance improvement for larger workloads.

I can't help wonder if we can improve the memory context parameter
names in BuildTupleHashTable(). Every time I see "tablecxt" I have to
remind myself that it's not for the bucket array, just the stuff we
have the buckets point to. Would "hashedtuplecxt" be better?

If we did that, then the context names could use some of the same
treatment, "RecursiveUnion hashed tuples" and "SetOp hashed tuples" or
something.

The field name for TupleHashTableData.tablecxt and the comment (/*
memory context containing table */) also leads me into thinking it's
for the bucket array.

Maybe I'm unique in having that problem...

David

Tom Lane

tgl@sss.pgh.pa.us

8 months ago

In reply to: David Rowley (#2)

Re: Use BumpContext contexts for TupleHashTables' tablecxt

David Rowley <dgrowleyml@gmail.com> writes:

I can't help wonder if we can improve the memory context parameter
names in BuildTupleHashTable(). Every time I see "tablecxt" I have to
remind myself that it's not for the bucket array, just the stuff we
have the buckets point to. Would "hashedtuplecxt" be better?

I agree these names are not great. I think they might be leftovers
from a time when there actually was a complete hash-table structure
in that context.

Related to this, while I was chasing Jeff's complaint I realized that
the none-too-small simplehash table for this is getting made in the
query's ExecutorState. That's pretty awful from the standpoint of
being able to blame memory consumption on the hash node. I'm not
sure though if we want to go so far as to make another context just
for the simplehash table. We could keep it in that same "tablectx"
at the price of destroying and rebuilding the simplehash table, not
just resetting it, at each node rescan. But that's not ideal either.

regards, tom lane

David Rowley

dgrowleyml@gmail.com

8 months ago

In reply to: Tom Lane (#3)

Re: Use BumpContext contexts for TupleHashTables' tablecxt

On Mon, 27 Oct 2025 at 11:11, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Related to this, while I was chasing Jeff's complaint I realized that
the none-too-small simplehash table for this is getting made in the
query's ExecutorState. That's pretty awful from the standpoint of
being able to blame memory consumption on the hash node. I'm not
sure though if we want to go so far as to make another context just
for the simplehash table. We could keep it in that same "tablectx"
at the price of destroying and rebuilding the simplehash table, not
just resetting it, at each node rescan. But that's not ideal either.

I don't think you could do that and have your patch as SH_GROW() needs
to pfree the old bucket array after rehashing, which bump won't like.

David

Tom Lane

tgl@sss.pgh.pa.us

8 months ago

In reply to: David Rowley (#4)

Re: Use BumpContext contexts for TupleHashTables' tablecxt

David Rowley <dgrowleyml@gmail.com> writes:

On Mon, 27 Oct 2025 at 11:11, Tom Lane <tgl@sss.pgh.pa.us> wrote:

... We could keep it in that same "tablectx"

I don't think you could do that and have your patch as SH_GROW() needs
to pfree the old bucket array after rehashing, which bump won't like.

Ah, good point. Nevermind that idea then ...

regards, tom lane

David Rowley

dgrowleyml@gmail.com

8 months ago

In reply to: David Rowley (#2)

Re: Use BumpContext contexts for TupleHashTables' tablecxt

On Mon, 27 Oct 2025 at 10:46, David Rowley <dgrowleyml@gmail.com> wrote:

On Mon, 27 Oct 2025 at 09:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:

[ starting a new thread to keep this separate from the estimation
issue ]

I looked at the callers of BuildTupleHashTable, and realized that
(a) every one of them can use a BumpContext for the "tablecxt",
and (b) Jeff Davis already noticed that for nodeAgg.c, in commit
cc721c459. So we have precedent for the idea working. Here's
a fleshed-out patch to fix the remaining callers.

Yeah, this should give a decent performance improvement for larger workloads.

I just did a quick test of this with the best-case I could imagine,
where all rows are filtered, thus reducing the additional overhead of
going into other nodes. Patched came out about 9% faster than master
(without MEMORY_CONTEXT_CHECKING).

Test 1)
Setup 1 million rows:
create table t1 as select generate_Series(1,1_000_000)a;
vacuum freeze analyze t1;
set work_mem = '1GB';
set jit=0;
\timing on

Master:
select * from t1 except select * from t1;
Time: 343.660 ms
Time: 341.049 ms
Time: 352.762 ms

Patched:
select * from t1 except select * from t1;
Time: 312.576 ms
Time: 317.629 ms
Time: 323.980 ms (+8.73%)

Test 2)
Setup 10 million rows:
create table t2 as select generate_Series(1,10_000_000)a;
vacuum freeze analyze t2;
set work_mem = '1GB';
set jit=0;
\timing on

Master:
select * from t2 except select * from t2;
Time: 4737.736 ms
Time: 4660.170 ms
Time: 4749.984 ms

Patched:
select * from t2 except select * from t2;
Time: 4307.161 ms
Time: 4339.134 ms
Time: 4279.902 ms (+9.45%)

David

Chao Li

li.evan.chao@gmail.com

8 months ago

In reply to: Tom Lane (#1)

Re: Use BumpContext contexts for TupleHashTables' tablecxt

On Oct 27, 2025, at 04:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I looked at the callers of BuildTupleHashTable, and realized that
(a) every one of them can use a BumpContext for the "tablecxt",
and (b) Jeff Davis already noticed that for nodeAgg.c, in commit
cc721c459. So we have precedent for the idea working. Here's
a fleshed-out patch to fix the remaining callers.

regards, tom lane

From 15ef2e50085ac83728cb9189e482964ff02e5aae Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sun, 26 Oct 2025 16:46:57 -0400
Subject: [PATCH v1] Use BumpContext contexts for TupleHashTables' tablecxt.

execGrouping.c itself does nothing with this context except to
allocate new hash entries in it, and the callers do nothing with it
except to reset the whole context. So this is an ideal use-case for
a BumpContext, and the hash tables are frequently big enough for the
savings to be significant.

I just did a test. My test procedure is like:

1. Set the following options in postgrs.conf:
```
work_mem = '512MB’
max_parallel_workers = 0
shared_buffers = '1GB’
log_statement_stats = on
```

2. Prepare for some data
```
CREATE TABLE t1 AS
SELECT g AS id, md5(g::text) AS txt
FROM generate_series(1, 5e6) g;

CREATE TABLE t2 AS
SELECT g AS id, md5(g::text) AS txt
FROM generate_series(1, 5e6) g
WHERE g % 2 = 0;

VACUUM FREEZE t1;
VACUUM FREEZE t2;
```

3. Run
```
\timing on
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM t1
INTERSECT
SELECT * FROM t2;
```

Here is the test result:

Without the patch:

- stats:
```
2025-10-27 11:04:43.145 CST [88599] LOG: QUERY STATISTICS
2025-10-27 11:04:43.145 CST [88599] DETAIL: ! system usage stats:
! 1.609178 s user, 0.051652 s system, 1.661308 s elapsed
! [1.612991 s user, 0.056877 s system total]
! 601392 kB max resident size
! 0/0 [0/0] filesystem blocks in/out
! 1/37016 [1/37621] page faults/reclaims, 0 [0] swaps
! 0 [0] signals rcvd, 0/0 [2/1] messages rcvd/sent
! 0/18 [0/44] voluntary/involuntary context switches
```

- SQL execution:
```
evantest=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM t1 INTERSECT SELECT * FROM t2;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
HashSetOp Intersect (cost=174993.73..181243.73 rows=2500000 width=39) (actual time=1500.129..1565.293 rows=2500000.00 loops=1)
Buffers: shared read=62501
-> Seq Scan on t2 (cost=0.00..45834.00 rows=2500000 width=39) (actual time=0.323..68.214 rows=2500000.00 loops=1)
Buffers: shared read=20834
-> Seq Scan on t1 (cost=0.00..91662.15 rows=4999515 width=39) (actual time=0.077..113.602 rows=5000000.00 loops=1)
Buffers: shared read=41667
Planning:
Buffers: shared hit=75 read=25
Planning Time: 1.800 ms
Execution Time: 1655.722 ms
(10 rows)

Time: 1661.561 ms (00:01.662)
```

With the patch:

- stats:
```
2025-10-27 11:02:14.656 CST [87387] LOG: QUERY STATISTICS
2025-10-27 11:02:14.656 CST [87387] DETAIL: ! system usage stats:
! 1.625830 s user, 0.046199 s system, 1.672351 s elapsed
! [1.630466 s user, 0.052191 s system total]
! 487264 kB max resident size
! 0/0 [0/0] filesystem blocks in/out
! 1/29891 [1/30489] page faults/reclaims, 0 [0] swaps
! 0 [0] signals rcvd, 0/0 [2/1] messages rcvd/sent
! 0/11 [0/36] voluntary/involuntary context switches
```

- SQL execution:
```
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
HashSetOp Intersect (cost=174993.73..181243.73 rows=2500000 width=39) (actual time=1471.164..1602.767 rows=2500000.00 loops=1)
Buffers: shared read=62501
-> Seq Scan on t2 (cost=0.00..45834.00 rows=2500000 width=39) (actual time=0.296..67.955 rows=2500000.00 loops=1)
Buffers: shared read=20834
-> Seq Scan on t1 (cost=0.00..91662.15 rows=4999515 width=39) (actual time=0.088..114.052 rows=5000000.00 loops=1)
Buffers: shared read=41667
Planning:
Buffers: shared hit=75 read=25
Planning Time: 1.875 ms
Execution Time: 1666.881 ms
(10 rows)

Time: 1672.574 ms (00:01.673)
```

Looks like:

* No obvious execution time improvement
* Max resident size reduced from 600MB to 487MB, ~19% reduction
* Page reclaims dropped from 37k -> 30k, ~19% reduction

I ran the test several rounds, and the results are consistent: roughly 19% memory usage reduction.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Tom Lane

tgl@sss.pgh.pa.us

8 months ago

In reply to: David Rowley (#6)

Re: Use BumpContext contexts for TupleHashTables' tablecxt

David Rowley <dgrowleyml@gmail.com> writes:

On Mon, 27 Oct 2025 at 09:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I looked at the callers of BuildTupleHashTable, and realized that
(a) every one of them can use a BumpContext for the "tablecxt",
and (b) Jeff Davis already noticed that for nodeAgg.c, in commit
cc721c459. So we have precedent for the idea working. Here's
a fleshed-out patch to fix the remaining callers.

I just did a quick test of this with the best-case I could imagine,
where all rows are filtered, thus reducing the additional overhead of
going into other nodes. Patched came out about 9% faster than master
(without MEMORY_CONTEXT_CHECKING).

Hmm, I wasn't really expecting any direct time saving; the point
was about cutting memory consumption. So Chao Li's nearby results
are in line with mine.

regards, tom lane

David Rowley

dgrowleyml@gmail.com

8 months ago

In reply to: Tom Lane (#8)

Re: Use BumpContext contexts for TupleHashTables' tablecxt

On Mon, 27 Oct 2025 at 16:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Hmm, I wasn't really expecting any direct time saving; the point
was about cutting memory consumption. So Chao Li's nearby results
are in line with mine.

It's for the same reason that Hash Join starts to run more slowly once
the hash table is larger than L3. Because the memory access pattern
when probing the hash table can't be predicted by the CPU, larger
tables will start having to shuffle cachelines in from RAM more often.
The same happens with smaller tables when having to go from L2 out to
L3 (and even L1d out to L2). If you graphed various different table
sizes, you'd see the performance dropping off per hash lookup as the
memory usage crosses cache size boundaries.

What you've done by using bump is made it so that more tuples will fit
in the same amount of memory, therefore increasing the chances that
useful cachelines are found.

If you happened to always probe the hash table in hash key order, then
this probably wouldn't happen (or at least to a lesser extent) as the
hardware prefetcher would see the forward pattern and prefetch the
memory.

David

#10

Tom Lane

tgl@sss.pgh.pa.us

8 months ago

In reply to: Tom Lane (#3)

Re: Use BumpContext contexts for TupleHashTables' tablecxt

I wrote:

David Rowley <dgrowleyml@gmail.com> writes:

I can't help wonder if we can improve the memory context parameter
names in BuildTupleHashTable(). Every time I see "tablecxt" I have to
remind myself that it's not for the bucket array, just the stuff we
have the buckets point to. Would "hashedtuplecxt" be better?

I agree these names are not great. I think they might be leftovers
from a time when there actually was a complete hash-table structure
in that context.

Looking closer, it seems like a lot of the confusion here arose when
TupleHashTables were rebased onto simplehash.h. That patch seems to
have paid about zero attention to updating comments that it'd
falsified or at least made misleading. Here's a v2 that tries to
clean up some of the mess.

I'm also proposing to move the MemoryContextReset of that context
into ResetTupleHashTable instead of documenting that we expect the
caller to do that. The old way just seems like a bizarre artifact.

Related to this, while I was chasing Jeff's complaint I realized that
the none-too-small simplehash table for this is getting made in the
query's ExecutorState. That's pretty awful from the standpoint of
being able to blame memory consumption on the hash node.

I didn't do anything about this. I briefly considered having
BuildTupleHashTable construct a per-hashtable context, but backed
off after seeing that nodeAgg.c goes to some effort to use the same
metacxt and tuplescxt for several hashtables. I'm not sure if that
had measured performance benefits behind it, but it well might have.
I now think that if we care, the right thing is to make the remaining
callers construct extra memory contexts for their hash tables. That
could be left for another day.

regards, tom lane

#11

David Rowley

dgrowleyml@gmail.com

8 months ago

In reply to: Tom Lane (#10)

Re: Use BumpContext contexts for TupleHashTables' tablecxt

On Thu, 30 Oct 2025 at 08:51, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

David Rowley <dgrowleyml@gmail.com> writes:

I can't help wonder if we can improve the memory context parameter
names in BuildTupleHashTable(). Every time I see "tablecxt" I have to
remind myself that it's not for the bucket array, just the stuff we
have the buckets point to. Would "hashedtuplecxt" be better?

I agree these names are not great. I think they might be leftovers
from a time when there actually was a complete hash-table structure
in that context.

Looking closer, it seems like a lot of the confusion here arose when
TupleHashTables were rebased onto simplehash.h. That patch seems to
have paid about zero attention to updating comments that it'd
falsified or at least made misleading. Here's a v2 that tries to
clean up some of the mess.

Thanks for doing that cleanup. A good improvement. Just a couple of
notes from the review for your consideration:

1) The comment in ExecEndSetOp() says: "/* free subsidiary stuff
including hashtable */". How about adjusting that to "hashtable data"?
Is it worth mentioning there that we leave it up to the destruction of
the hash table itself to when the ExecutorState context is reset?
(Apparently execGrouping.c does not expose any way to do
tuplehash_destroy() anyway)

2) I'm not sure if it makes it neater or not, but did you consider
moving the creation of the setopstate->tuplesContext context into
build_hash_table()? The motivation there is that it keeps the
hash-related initialisations together.

3) Not for this patch, but wanted to note the observation: I see we
always create the hash table during plan startup in nodeSetOp.c. I
suspect this could cause the same issue for Setops as I fixed in
Memoize in 57f59396b, which fixed a slow EXPLAIN in [1]/messages/by-id/CAFj8pRAMp=QsMi6sPQJ4W3hczoFJRvyXHJV3AZAZaMyTVM312Q@mail.gmail.com. That was
mostly down to a bug that overestimated the number of buckets, but
there's legitimate reasons to want lots of buckets too... a big table
and lots of work_mem.

create table t (a int);
insert into t values(1);
alter table t alter column a set (n_distinct = -1); -- all values distinct
analyze t;
-- hack fool the planner into thinking it's a big table.
update pg_class set reltuples = 1e9 where oid = 't'::regclass;
set work_mem = '4GB';
\timing on
explain (summary on) select a from t except select a from t;

HashSetOp Except (cost=25000002.00..27500002.00 rows=1000000000 width=4)
Disabled: true
-> Seq Scan on t (cost=0.00..10000001.00 rows=1000000000 width=4)
-> Seq Scan on t t_1 (cost=0.00..10000001.00 rows=1000000000 width=4)
Planning Time: 0.091 ms <-- not a long time
(5 rows)

Time: 1658.866 ms (00:01.659) <-- a long time.

I'm also proposing to move the MemoryContextReset of that context
into ResetTupleHashTable instead of documenting that we expect the
caller to do that. The old way just seems like a bizarre artifact.

Related to this, while I was chasing Jeff's complaint I realized that
the none-too-small simplehash table for this is getting made in the
query's ExecutorState. That's pretty awful from the standpoint of
being able to blame memory consumption on the hash node.

I didn't do anything about this. I briefly considered having
BuildTupleHashTable construct a per-hashtable context, but backed
off after seeing that nodeAgg.c goes to some effort to use the same
metacxt and tuplescxt for several hashtables. I'm not sure if that
had measured performance benefits behind it, but it well might have.
I now think that if we care, the right thing is to make the remaining
callers construct extra memory contexts for their hash tables. That
could be left for another day.

Reminds me of 590b045c3, but I think maybe not as bad as the bucket
array for the hash table is more likely to be a large allocation,
which aset.c will put on a dedicated AllocBlock rather than sharing an
AllocBlock with other chunks. ResetTupleHashTable() also just results
in a SH_RESET(), which doesn't do any pfreeing work (just memset()).
i.e. shouldn't cause fragmentation due to many rescans. Before
590b045c3, we used to pfree() all tuples at tuplestore_end(), which
happens in a Materialize node's rescan. That resulted in bad
fragmentation of the ExecutorState context. Nothing like that is
happening here, so I'm not that worried about it, though I agree that
it isn't ideal.

David

[1]: /messages/by-id/CAFj8pRAMp=QsMi6sPQJ4W3hczoFJRvyXHJV3AZAZaMyTVM312Q@mail.gmail.com

#12

Jeff Davis

pgsql@j-davis.com

8 months ago

In reply to: Tom Lane (#3)

Re: Use BumpContext contexts for TupleHashTables' tablecxt

On Sun, 2025-10-26 at 18:11 -0400, Tom Lane wrote:

Related to this, while I was chasing Jeff's complaint I realized that
the none-too-small simplehash table for this is getting made in the
query's ExecutorState. That's pretty awful from the standpoint of
being able to blame memory consumption on the hash node. I'm not
sure though if we want to go so far as to make another context just
for the simplehash table. We could keep it in that same "tablectx"
at the price of destroying and rebuilding the simplehash table, not
just resetting it, at each node rescan. But that's not ideal either.

I had investigated the idea of destroying/rebuilding the simplehash
table regardless because, in some cases, it can crowd out space for the
transition states, and we end up with a mostly-empty bucket array that
causes recursive spilling.

I'm not sure if that's a practical problem -- all the heuristics are
designed to avoid that situation -- but I can create the situation with
injection points.

Also, in the case of spilled grouping sets, once one group is
completely done with all spilled data it would be good to destroy that
hash table. By adjusting the order we process the spilled data, we
could minimize the creation/destruction of the bucket array.

However, rebuilding it has a cost, so we should only do that when there
is really a problem. I came to the conclusion it would require
significant refactoring to make that work well, and I didn't get around
to it yet.

Regards,
Jeff Davis

#13

Tom Lane

tgl@sss.pgh.pa.us

8 months ago

In reply to: David Rowley (#11)

Re: Use BumpContext contexts for TupleHashTables' tablecxt

David Rowley <dgrowleyml@gmail.com> writes:

On Thu, 30 Oct 2025 at 08:51, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Here's a v2 that tries to clean up some of the mess.

Thanks for doing that cleanup. A good improvement.

Thanks for reviewing!

Just a couple of
notes from the review for your consideration:

1) The comment in ExecEndSetOp() says: "/* free subsidiary stuff
including hashtable */". How about adjusting that to "hashtable data"?

Agreed, also in ExecEndRecursiveUnion.

Is it worth mentioning there that we leave it up to the destruction of
the hash table itself to when the ExecutorState context is reset?
(Apparently execGrouping.c does not expose any way to do
tuplehash_destroy() anyway)

Yeah. I think this is better documented in execGrouping.c. I propose
adding this:

@@ -169,6 +169,13 @@ execTuplesHashPrepare(int numCols,
  * reasonably often, typically once per tuple.  (We do it that way, rather
  * than managing an extra context within the hashtable, because in many cases
  * the caller can specify a tempcxt that it needs to reset per-tuple anyway.)
+ *
+ * We don't currently provide DestroyTupleHashTable functionality; the hash
+ * table will be cleaned up at destruction of the metacxt.  (Some callers
+ * bother to delete the tuplescxt explicitly, though it'd be sufficient to
+ * ensure it's a child of the metacxt.)  There's not much point in working
+ * harder than this so long as the expression-evaluation infrastructure
+ * behaves similarly.
  */
 TupleHashTable
 BuildTupleHashTable(PlanState *parent,

2) I'm not sure if it makes it neater or not, but did you consider
moving the creation of the setopstate->tuplesContext context into
build_hash_table()? The motivation there is that it keeps the
hash-related initialisations together.

Hmm, could do that, but all of these callers currently follow the
pattern of creating these contexts in the plan nodes' Init functions
and destroying them in the End functions. I don't see a big advantage
in changing that.

3) Not for this patch, but wanted to note the observation: I see we
always create the hash table during plan startup in nodeSetOp.c. I
suspect this could cause the same issue for Setops as I fixed in
Memoize in 57f59396b, which fixed a slow EXPLAIN in [1].

Fair point. I'd be inclined to just skip the hashtable creation
if (eflags & EXEC_FLAG_EXPLAIN_ONLY), like nodeAgg.c did.

I wonder though if skipping the initialization of the hashtable
execution expressions has any visible impact on what EXPLAIN can
print. I have a vague recollection that there were some places
where we backed off similar optimizations because of concerns
like that.

I now think that if we care, the right thing is to make the remaining
callers construct extra memory contexts for their hash tables. That
could be left for another day.

Reminds me of 590b045c3, but I think maybe not as bad as the bucket
array for the hash table is more likely to be a large allocation,
which aset.c will put on a dedicated AllocBlock rather than sharing an
AllocBlock with other chunks. ResetTupleHashTable() also just results
in a SH_RESET(), which doesn't do any pfreeing work (just memset()).
i.e. shouldn't cause fragmentation due to many rescans.

My concern here was just whether you could tell by MemoryContextStats
inspection how much to blame on the hash table.

regards, tom lane

#14

David Rowley

dgrowleyml@gmail.com

8 months ago

In reply to: Tom Lane (#13)

Re: Use BumpContext contexts for TupleHashTables' tablecxt

On Thu, 30 Oct 2025 at 13:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <dgrowleyml@gmail.com> writes:

Is it worth mentioning there that we leave it up to the destruction of
the hash table itself to when the ExecutorState context is reset?
(Apparently execGrouping.c does not expose any way to do
tuplehash_destroy() anyway)

Yeah. I think this is better documented in execGrouping.c. I propose
adding this:
@@ -169,6 +169,13 @@ execTuplesHashPrepare(int numCols,
* reasonably often, typically once per tuple.  (We do it that way, rather
* than managing an extra context within the hashtable, because in many cases
* the caller can specify a tempcxt that it needs to reset per-tuple anyway.)
+ *
+ * We don't currently provide DestroyTupleHashTable functionality; the hash
+ * table will be cleaned up at destruction of the metacxt.  (Some callers
+ * bother to delete the tuplescxt explicitly, though it'd be sufficient to
+ * ensure it's a child of the metacxt.)  There's not much point in working
+ * harder than this so long as the expression-evaluation infrastructure
+ * behaves similarly.
*/
TupleHashTable
BuildTupleHashTable(PlanState *parent,

Looks good.

2) I'm not sure if it makes it neater or not, but did you consider
moving the creation of the setopstate->tuplesContext context into
build_hash_table()? The motivation there is that it keeps the
hash-related initialisations together.

Hmm, could do that, but all of these callers currently follow the
pattern of creating these contexts in the plan nodes' Init functions
and destroying them in the End functions. I don't see a big advantage
in changing that.

All good. I'm fine either way.

3) Not for this patch, but wanted to note the observation: I see we
always create the hash table during plan startup in nodeSetOp.c. I
suspect this could cause the same issue for Setops as I fixed in
Memoize in 57f59396b, which fixed a slow EXPLAIN in [1].

Fair point. I'd be inclined to just skip the hashtable creation
if (eflags & EXEC_FLAG_EXPLAIN_ONLY), like nodeAgg.c did.

I wonder though if skipping the initialization of the hashtable
execution expressions has any visible impact on what EXPLAIN can
print. I have a vague recollection that there were some places
where we backed off similar optimizations because of concerns
like that.

I recall some discussion around JIT and EXPLAIN (ANALYZE OFF). I seem
to have a branch here that disables it completely, but I recall Andres
or you rejecting the idea. I can't seem to find the thread. I agree
would be quite strange if the JIT functions count was to differ
between EXPLAIN and EXPLAIN ANALYZE, but if we have to contort our
code too much to make that work, then I do question the usefulness of
showing the JIT details in EXPLAIN (ANALYZE OFF). Anyway, maybe off
topic for this thread.

I'm fine with the patch, assuming it now includes the new
BuildTupleHashTable() comment and adjusted
ExecEndRecursiveUnion/ExecEndSetOp comments.

David

#15

Tom Lane

tgl@sss.pgh.pa.us

8 months ago

In reply to: David Rowley (#14)

Re: Use BumpContext contexts for TupleHashTables' tablecxt

David Rowley <dgrowleyml@gmail.com> writes:

I'm fine with the patch, assuming it now includes the new
BuildTupleHashTable() comment and adjusted
ExecEndRecursiveUnion/ExecEndSetOp comments.

OK, pushed after a tiny bit more comment-smithing. Thanks
for reviewing!

regards, tom lane

Use BumpContext contexts for TupleHashTables' tablecxt

Attachments:

Attachments: