What to call an executor node which lazily caches tuples in a hash table?

Started by David Rowleyalmost 5 years ago9 messages
#1David Rowley
dgrowleyml@gmail.com

Hackers,

Over on [1]/messages/by-id/CAApHDvrPcQyQdWERGYWx8J+2DLUNgXu+fOSbQ1UscxrunyXyrQ@mail.gmail.com I've been working on adding a new type of executor node
which caches tuples in a hash table belonging to a given cache key.

The current sole use of this node type is to go between a
parameterized nested loop and the inner node in order to cache
previously seen sets of parameters so that we can skip scanning the
inner scan for parameter values that we've already cached. The node
could also be used to cache results from correlated subqueries,
although that's not done yet.

The cache limits itself to not use more than hash_mem by evicting the
least recently used entries whenever more space is needed for new
entries.

Currently, in the patch, the node is named "Result Cache". That name
was not carefully thought out. I just needed to pick something when
writing the code.

Here's an EXPLAIN output with the current name:

postgres=# explain (costs off) select relkind,c from pg_class c1,
lateral (select count(*) c from pg_class c2 where c1.relkind =
c2.relkind) c2;
QUERY PLAN
----------------------------------------------------
Nested Loop
-> Seq Scan on pg_class c1
-> Result Cache
Cache Key: c1.relkind
-> Aggregate
-> Seq Scan on pg_class c2
Filter: (c1.relkind = relkind)
(7 rows)

I just got off a team call with Andres, Thomas and Melanie. During the
call I mentioned that I didn't like the name "Result Cache". Many name
suggestions followed:

Here's a list of a few that were mentioned:

Probe Cache
Tuple Cache
Keyed Materialize
Hash Materialize
Result Cache
Cache
Hash Cache
Lazy Hash
Reactive Hash
Parameterized Hash
Parameterized Cache
Keyed Inner Cache
MRU Cache
MRU Hash

I was hoping to commit the final patch pretty soon, but thought I'd
have another go at seeing if we can get some consensus on a name
before doing that. Otherwise, I'd sort of assumed that we'd just reach
some consensus after everyone complained about the current name after
the feature is committed.

My personal preference is "Lazy Hash", but I feel it might be better
to use the word "Reactive" instead of "Lazy".

There was some previous discussion on the name in [2]/messages/by-id/CA+TgmoZMxLeanqrS00_p3xNsU3g1v3EKjNZ4dM02ShRxxLiDBw@mail.gmail.com. I suggested
some other names in [3]/messages/by-id/CAApHDvoj_sH1H3JVXgHuwnxf1FQbjRVOqqgxzOgJX13NiA9-cg@mail.gmail.com. Andy voted for "Tuple Cache" in [4]/messages/by-id/CAKU4AWoshM0JoymwBK6PKOFDMKg-OO6qtSVU_Piqb0dynxeL5w@mail.gmail.com

Votes? Other suggestions?

(I've included all the people who have shown some previous interest in
naming this node.)

David

[1]: /messages/by-id/CAApHDvrPcQyQdWERGYWx8J+2DLUNgXu+fOSbQ1UscxrunyXyrQ@mail.gmail.com
[2]: /messages/by-id/CA+TgmoZMxLeanqrS00_p3xNsU3g1v3EKjNZ4dM02ShRxxLiDBw@mail.gmail.com
[3]: /messages/by-id/CAApHDvoj_sH1H3JVXgHuwnxf1FQbjRVOqqgxzOgJX13NiA9-cg@mail.gmail.com
[4]: /messages/by-id/CAKU4AWoshM0JoymwBK6PKOFDMKg-OO6qtSVU_Piqb0dynxeL5w@mail.gmail.com

#2Zhihong Yu
zyu@yugabyte.com
In reply to: David Rowley (#1)
Re: What to call an executor node which lazily caches tuples in a hash table?

Hi,
I was reading this part of the description:

the Result Cache's
hash table is much smaller than the hash join's due to result cache only
caching useful values rather than all tuples from the inner side of the
join.

I think the word 'Result' should be part of the cache name considering the
above.

Cheers

On Tue, Mar 30, 2021 at 4:30 PM David Rowley <dgrowleyml@gmail.com> wrote:

Show quoted text

Hackers,

Over on [1] I've been working on adding a new type of executor node
which caches tuples in a hash table belonging to a given cache key.

The current sole use of this node type is to go between a
parameterized nested loop and the inner node in order to cache
previously seen sets of parameters so that we can skip scanning the
inner scan for parameter values that we've already cached. The node
could also be used to cache results from correlated subqueries,
although that's not done yet.

The cache limits itself to not use more than hash_mem by evicting the
least recently used entries whenever more space is needed for new
entries.

Currently, in the patch, the node is named "Result Cache". That name
was not carefully thought out. I just needed to pick something when
writing the code.

Here's an EXPLAIN output with the current name:

postgres=# explain (costs off) select relkind,c from pg_class c1,
lateral (select count(*) c from pg_class c2 where c1.relkind =
c2.relkind) c2;
QUERY PLAN
----------------------------------------------------
Nested Loop
-> Seq Scan on pg_class c1
-> Result Cache
Cache Key: c1.relkind
-> Aggregate
-> Seq Scan on pg_class c2
Filter: (c1.relkind = relkind)
(7 rows)

I just got off a team call with Andres, Thomas and Melanie. During the
call I mentioned that I didn't like the name "Result Cache". Many name
suggestions followed:

Here's a list of a few that were mentioned:

Probe Cache
Tuple Cache
Keyed Materialize
Hash Materialize
Result Cache
Cache
Hash Cache
Lazy Hash
Reactive Hash
Parameterized Hash
Parameterized Cache
Keyed Inner Cache
MRU Cache
MRU Hash

I was hoping to commit the final patch pretty soon, but thought I'd
have another go at seeing if we can get some consensus on a name
before doing that. Otherwise, I'd sort of assumed that we'd just reach
some consensus after everyone complained about the current name after
the feature is committed.

My personal preference is "Lazy Hash", but I feel it might be better
to use the word "Reactive" instead of "Lazy".

There was some previous discussion on the name in [2]. I suggested
some other names in [3]. Andy voted for "Tuple Cache" in [4]

Votes? Other suggestions?

(I've included all the people who have shown some previous interest in
naming this node.)

David

[1]
/messages/by-id/CAApHDvrPcQyQdWERGYWx8J+2DLUNgXu+fOSbQ1UscxrunyXyrQ@mail.gmail.com
[2]
/messages/by-id/CA+TgmoZMxLeanqrS00_p3xNsU3g1v3EKjNZ4dM02ShRxxLiDBw@mail.gmail.com
[3]
/messages/by-id/CAApHDvoj_sH1H3JVXgHuwnxf1FQbjRVOqqgxzOgJX13NiA9-cg@mail.gmail.com
[4]
/messages/by-id/CAKU4AWoshM0JoymwBK6PKOFDMKg-OO6qtSVU_Piqb0dynxeL5w@mail.gmail.com

#3Andy Fan
zhihui.fan1213@gmail.com
In reply to: Zhihong Yu (#2)
Re: What to call an executor node which lazily caches tuples in a hash table?

On Wed, Mar 31, 2021 at 7:45 AM Zhihong Yu <zyu@yugabyte.com> wrote:

Hi,
I was reading this part of the description:

the Result Cache's
hash table is much smaller than the hash join's due to result cache only
caching useful values rather than all tuples from the inner side of the
join.

I think the word 'Result' should be part of the cache name considering the
above.

Cheers

On Tue, Mar 30, 2021 at 4:30 PM David Rowley <dgrowleyml@gmail.com> wrote:

Hackers,

Over on [1] I've been working on adding a new type of executor node
which caches tuples in a hash table belonging to a given cache key.

The current sole use of this node type is to go between a
parameterized nested loop and the inner node in order to cache
previously seen sets of parameters so that we can skip scanning the
inner scan for parameter values that we've already cached. The node
could also be used to cache results from correlated subqueries,
although that's not done yet.

The cache limits itself to not use more than hash_mem by evicting the
least recently used entries whenever more space is needed for new
entries.

Currently, in the patch, the node is named "Result Cache". That name
was not carefully thought out. I just needed to pick something when
writing the code.

Here's an EXPLAIN output with the current name:

postgres=# explain (costs off) select relkind,c from pg_class c1,
lateral (select count(*) c from pg_class c2 where c1.relkind =
c2.relkind) c2;
QUERY PLAN
----------------------------------------------------
Nested Loop
-> Seq Scan on pg_class c1
-> Result Cache
Cache Key: c1.relkind
-> Aggregate
-> Seq Scan on pg_class c2
Filter: (c1.relkind = relkind)
(7 rows)

I just got off a team call with Andres, Thomas and Melanie. During the
call I mentioned that I didn't like the name "Result Cache". Many name
suggestions followed:

Here's a list of a few that were mentioned:

Probe Cache
Tuple Cache
Keyed Materialize
Hash Materialize
Result Cache
Cache
Hash Cache
Lazy Hash
Reactive Hash
Parameterized Hash
Parameterized Cache
Keyed Inner Cache
MRU Cache
MRU Hash

I was hoping to commit the final patch pretty soon, but thought I'd
have another go at seeing if we can get some consensus on a name
before doing that. Otherwise, I'd sort of assumed that we'd just reach
some consensus after everyone complained about the current name after
the feature is committed.

My personal preference is "Lazy Hash", but I feel it might be better
to use the word "Reactive" instead of "Lazy".

There was some previous discussion on the name in [2]. I suggested
some other names in [3]. Andy voted for "Tuple Cache" in [4]

Votes? Other suggestions?

(I've included all the people who have shown some previous interest in
naming this node.)

David

[1]
/messages/by-id/CAApHDvrPcQyQdWERGYWx8J+2DLUNgXu+fOSbQ1UscxrunyXyrQ@mail.gmail.com
[2]
/messages/by-id/CA+TgmoZMxLeanqrS00_p3xNsU3g1v3EKjNZ4dM02ShRxxLiDBw@mail.gmail.com
[3]
/messages/by-id/CAApHDvoj_sH1H3JVXgHuwnxf1FQbjRVOqqgxzOgJX13NiA9-cg@mail.gmail.com
[4]
/messages/by-id/CAKU4AWoshM0JoymwBK6PKOFDMKg-OO6qtSVU_Piqb0dynxeL5w@mail.gmail.com

I want to share some feelings about other keywords. Materialize are used
in
Materialize node in executor node, which would write data to disk when
memory
is not enough, and it is used in "Materialized View", where it stores all
the data to disk
This gives me some feeling that "Materialize" usually has something with
disk,
but our result cache node doesn't.

And I think DBA checks plans more than the PostgreSQL developer. So
some MRU might be too internal for them. As for developers, if they want to
know such details, they can just read the source code.

When naming it, we may also think about some non native English speakers,
so
some too advanced words may make them uncomfortable. Actually when I read
"Reactive", I googled to find what its meaning is. I knew reactive
programming, but I
do not truly understand "reactive hash". And Compared with HashJoin, Hash
may
mislead people the result may be spilled into disk as well. so I prefer
"Cache"
over "Hash".

At last, I still want to vote for "Tuple(s) Cache", which sounds simple
and enough.
I was thinking if we need to put "Lazy" in the node name since we do build
cache
lazily, then I found we didn't call "Materialize" as "Lazy Materialize",
so I think we
can keep consistent.

I was hoping to commit the final patch pretty soon

Very glad to see it, thanks for the great feature.

--
Best Regards
Andy Fan (https://www.aliyun.com/)

#4David Rowley
dgrowleyml@gmail.com
In reply to: Andy Fan (#3)
Re: What to call an executor node which lazily caches tuples in a hash table?

On Wed, 31 Mar 2021 at 14:43, Andy Fan <zhihui.fan1213@gmail.com> wrote:

When naming it, we may also think about some non native English speakers, so
some too advanced words may make them uncomfortable. Actually when I read
"Reactive", I googled to find what its meaning is. I knew reactive programming, but I
do not truly understand "reactive hash".

The origin of that idea came from "reactive" being the opposite of
"proactive". If that's not clear then it's likely a bad choice for a
name.

I had thought proactive would mean "do things beforehand" i.e not on
demand. Basically, just fill the hash table with records that we need
to put in it rather than all records that we might need, the latter
being what Hash Join does, and the former is what the new node does.

David

#5David Rowley
dgrowleyml@gmail.com
In reply to: Andy Fan (#3)
Re: What to call an executor node which lazily caches tuples in a hash table?

On Wed, 31 Mar 2021 at 14:43, Andy Fan <zhihui.fan1213@gmail.com> wrote:

At last, I still want to vote for "Tuple(s) Cache", which sounds simple and enough.
I was thinking if we need to put "Lazy" in the node name since we do build cache
lazily, then I found we didn't call "Materialize" as "Lazy Materialize", so I think we
can keep consistent.

I thought about this a little more and I can see now why I put the
word "Cache" in the original name. I do now agree we really need to
keep the word "Cache" in the name.

The EXPLAIN ANALYZE talks about "hits", "misses" and "evictions", all
of those are things that caches do.

-> Result Cache (actual rows=1 loops=403)
Cache Key: c1.relkind
Hits: 398 Misses: 5 Evictions: 0 Overflows: 0 Memory Usage: 1kB
-> Aggregate (actual rows=1 loops=5)

I don't think there's any need to put the word "Lazy" in the name as
if we're keeping "Cache", then most caches do only cache results of
values that have been looked for.

I'm just not sure if "Tuple" is the best word or not. I primarily
think of "tuple" as the word we use internally, but a quick grep of
the docs reminds me that's not the case. The word is used all over the
documents. We have GUCs like parallel_tuple_cost and cpu_tuple_cost.
So it does seem like the sort of thing anyone who is interested in
looking at the EXPLAIN output should know about. I'm just not
massively keen on using that word in the name. The only other options
that come to mind are "Result" and "Parameterized". However, I think
"Parameterized" does not add much meaning. I think most people would
expect a cache to have a key. I sort of see why I went with "Result
Cache" now.

Does anyone else like the name "Tuple Cache"?

David

#6Andres Freund
andres@anarazel.de
In reply to: David Rowley (#1)
Re: What to call an executor node which lazily caches tuples in a hash table?

Hi,

On 2021-03-31 12:29:36 +1300, David Rowley wrote:

Here's a list of a few that were mentioned:

Probe Cache
Tuple Cache
Keyed Materialize
Hash Materialize
Result Cache
Cache
Hash Cache
Lazy Hash
Reactive Hash
Parameterized Hash
Parameterized Cache
Keyed Inner Cache
MRU Cache
MRU Hash

Suggestion: ParamKeyedCache.

I don't like Parameterized because it makes it sound like the cache is
parameterized (e.g. size), rather than using Param values as the keys
for the cache. ParamKeyed indicates that Params are the key, rather
than configuring the cache...

I don't like "result cache" all that much, because it does sound like
we'd be caching query results etc, or that it might be referring to
Result nodes. Neither of which is the case...

Greetings,

Andres Freund

#7Adam Brusselback
adambrusselback@gmail.com
In reply to: Andres Freund (#6)
Re: What to call an executor node which lazily caches tuples in a hash table?

Does anyone else like the name "Tuple Cache"?

I personally like that name best.

It makes sense to me when thinking about looking at an EXPLAIN and trying
to understand why this node may be there. The way we look up the value
stored in the cache doesn't really matter to me as a user, I'm more
thinking about the reason the node is there in the first place, which Tuple
Cache conveys...at least for me.

#8Justin Pryzby
pryzby@telsasoft.com
In reply to: David Rowley (#1)
Re: What to call an executor node which lazily caches tuples in a hash table? (GUC)

You started the thread about what to call the node, but what about its GUC?

Should it be enable_result_cache instead of enable_resultcache?

See also Robert's opinion last year about enable_incrementalsort and "economizing on underscores".
/messages/by-id/CA+TgmoazhPwebpOwNbHt1vJoos1eLYhJVQPka+pptSLgS685aA@mail.gmail.com

--
Justin

#9David Rowley
dgrowleyml@gmail.com
In reply to: Justin Pryzby (#8)
Re: What to call an executor node which lazily caches tuples in a hash table? (GUC)

On Thu, 3 Jun 2021 at 14:36, Justin Pryzby <pryzby@telsasoft.com> wrote:

You started the thread about what to call the node, but what about its GUC?

Should it be enable_result_cache instead of enable_resultcache?

See also Robert's opinion last year about enable_incrementalsort and "economizing on underscores".
/messages/by-id/CA+TgmoazhPwebpOwNbHt1vJoos1eLYhJVQPka+pptSLgS685aA@mail.gmail.com

If starting fresh, either is ok for me. We already have a mix and we
ended up adding the underscore to enable_incremental_sort as people
thought it was more clear. I did vote to keep it as it was without the
underscore as it seemed to follow the other enable* GUCs more closely.

I get the idea that we changed enable_incrementalsort because the word
incremental was fairly long when you compare it to things like hash,
nest and merge. "result" is not too much longer so I don't really
feel like the name enable_resultcache is wrong. Since it's already
called that, I'd rather not confuse things.

To be honest, I'd rather the discussion was about whether we want to
actually call the entire node "Result Cache" in the first place. I
did do an internet search a few weeks ago on "Result Cache" just to
see if anyone was talking about it anywhere that I might miss. I
discovered that Oracle has something of the same name, but going by
[1]: https://logicalread.com/huge-performance-gains-using-oracle-11g-result-cache-dr01/#.YLhEjLczb9Q
That's obviously not what PostgreSQL's Result Cache does. After
learning that, I dislike the name "Result Cache" even more.

FWIW, I'm still willing to go and do the renaming work for the node,
but the time we have to do that is ticking away fast and so far
there's no consensus [2]/messages/by-id/CAApHDvq=yQXr5kqhRviT2RhNKwToaWr9JAN5t+5_PzhuRJ3wvg@mail.gmail.com on any other name. Just a few people
suggesting other names.

David

[1]: https://logicalread.com/huge-performance-gains-using-oracle-11g-result-cache-dr01/#.YLhEjLczb9Q
[2]: /messages/by-id/CAApHDvq=yQXr5kqhRviT2RhNKwToaWr9JAN5t+5_PzhuRJ3wvg@mail.gmail.com