Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

Started by Lukas Fittlalmost 3 years ago57 messages

lukas@fittl.com

almost 3 years ago

1 attachment(s)

Hi,

I was debugging a planner problem on Postgres 14.4 the other day - and the
involved "bad" plan was including Memoize - though I don't necessarily
think that Memoize is to blame (and this isn't any of the problems recently
fixed in Memoize costing).

However, what I noticed whilst trying different ways to fix the plan, is
that the Memoize output was a bit hard to reason about - especially since
the plan involving Memoize was expensive to run, and so I was mostly
running EXPLAIN without ANALYZE to look at the costing.

Here is an example of the output I was looking at:

-> Nested Loop (cost=1.00..971672.56 rows=119623 width=0)
-> Index Only Scan using table1_idx on table1
(cost=0.43..372676.50 rows=23553966 width=8)
-> Memoize (cost=0.57..0.61 rows=1 width=8)
Cache Key: table1.table2_id
Cache Mode: logical
-> Index Scan using table2_idx on table2
(cost=0.56..0.60 rows=1 width=8)
Index Cond: (id = table1.table2_id)

The other plan I was comparing with (that I wanted the planner to choose
instead), had a total cost of 1,451,807.35 -- and so I was trying to figure
out why the Nested Loop was costed as 971,672.56.

Simple math makes me expect the Nested Loop should roughly have a total
cost of14,740,595.76 here (372,676.50 + 23,553,966 * 0.61), ignoring a lot
of the smaller costs. Thus, in this example, it appears Memoize made the
plan cost significantly cheaper (roughly 6% of the regular cost).

Essentially this comes down to the "cost reduction" performed by Memoize
only being implicitly visible in the Nested Loop's total cost - and with
nothing useful on the Memoize node itself - since the rescan costs are not
shown.

I think explicitly adding the estimated cache hit ratio for Memoize nodes
might make this easier to reason about, like this:

-> Memoize (cost=0.57..0.61 rows=1 width=8)
Cache Key: table1.table2_id
Cache Mode: logical
Cache Hit Ratio Estimated: 0.94

Alternatively (or in addition) we could consider showing the "ndistinct"
value that is calculated in cost_memoize_rescan - since that's the most
significant contributor to the cache hit ratio (and you can influence that
directly by improving the ndistinct statistics).

See attached a patch that implements showing the cache hit ratio as a
discussion starter.

I'll park this in the July commitfest for now.

Thanks,
Lukas

--
Lukas Fittl

Attachments:

v1-0001-Add-Estimated-Cache-Hit-Ratio-for-Memoize-plan-no.patchapplication/octet-stream; name=v1-0001-Add-Estimated-Cache-Hit-Ratio-for-Memoize-plan-no.patchDownload

From 7eaed8f1a059bed00dfd4625931ece475b9c3451 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <lukas@fittl.com>
Date: Sat, 4 Mar 2023 15:44:29 -0800
Subject: [PATCH v1] Add Estimated Cache Hit Ratio for Memoize plan nodes in
 EXPLAIN

Memoize can substantially reduce the costs of certain Nested Loops, but
its hard to see this clearly, since the often significantly discounted
rescan costs are not shown in EXPLAIN.

To help understand Memoize's impact better, remember the calculated
estimated cache hit ratio, and show it in EXPLAIN output.
---
 src/backend/commands/explain.c          |  7 +++++++
 src/backend/optimizer/path/costsize.c   |  3 +++
 src/backend/optimizer/plan/createplan.c | 10 +++++++---
 src/backend/optimizer/util/pathnode.c   |  5 +++++
 src/include/nodes/pathnodes.h           |  1 +
 src/include/nodes/plannodes.h           |  3 +++
 6 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index e57bda7b62..58fda99a69 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3167,6 +3167,8 @@ show_memoize_info(MemoizeState *mstate, List *ancestors, ExplainState *es)
 	{
 		ExplainPropertyText("Cache Key", keystr.data, es);
 		ExplainPropertyText("Cache Mode", mstate->binary_mode ? "binary" : "logical", es);
+		if (es->costs)
+			ExplainPropertyFloat("Cache Hit Ratio Estimated", "", ((Memoize *) plan)->hit_ratio, 2, es);
 	}
 	else
 	{
@@ -3174,6 +3176,11 @@ show_memoize_info(MemoizeState *mstate, List *ancestors, ExplainState *es)
 		appendStringInfo(es->str, "Cache Key: %s\n", keystr.data);
 		ExplainIndentText(es);
 		appendStringInfo(es->str, "Cache Mode: %s\n", mstate->binary_mode ? "binary" : "logical");
+		if (es->costs)
+		{
+			ExplainIndentText(es);
+			appendStringInfo(es->str, "Cache Hit Ratio Estimated: %0.2f\n", ((Memoize *) plan)->hit_ratio);
+		}
 	}
 
 	pfree(keystr.data);
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 7918bb6f0d..a0fd3f8965 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2600,6 +2600,9 @@ cost_memoize_rescan(PlannerInfo *root, MemoizePath *mpath,
 	 */
 	startup_cost += cpu_tuple_cost;
 
+	/* Remember cache hit ratio for a potential EXPLAIN later */
+	mpath->hit_ratio = hit_ratio;
+
 	*rescan_startup_cost = startup_cost;
 	*rescan_total_cost = total_cost;
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fa09a6103b..4c403cb081 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -280,7 +280,8 @@ static Material *make_material(Plan *lefttree);
 static Memoize *make_memoize(Plan *lefttree, Oid *hashoperators,
 							 Oid *collations, List *param_exprs,
 							 bool singlerow, bool binary_mode,
-							 uint32 est_entries, Bitmapset *keyparamids);
+							 uint32 est_entries, Bitmapset *keyparamids,
+							 double hit_ratio);
 static WindowAgg *make_windowagg(List *tlist, Index winref,
 								 int partNumCols, AttrNumber *partColIdx, Oid *partOperators, Oid *partCollations,
 								 int ordNumCols, AttrNumber *ordColIdx, Oid *ordOperators, Oid *ordCollations,
@@ -1691,7 +1692,8 @@ create_memoize_plan(PlannerInfo *root, MemoizePath *best_path, int flags)
 
 	plan = make_memoize(subplan, operators, collations, param_exprs,
 						best_path->singlerow, best_path->binary_mode,
-						best_path->est_entries, keyparamids);
+						best_path->est_entries, keyparamids,
+						best_path->hit_ratio);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -6528,7 +6530,8 @@ materialize_finished_plan(Plan *subplan)
 static Memoize *
 make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations,
 			 List *param_exprs, bool singlerow, bool binary_mode,
-			 uint32 est_entries, Bitmapset *keyparamids)
+			 uint32 est_entries, Bitmapset *keyparamids,
+			 double hit_ratio)
 {
 	Memoize    *node = makeNode(Memoize);
 	Plan	   *plan = &node->plan;
@@ -6546,6 +6549,7 @@ make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations,
 	node->binary_mode = binary_mode;
 	node->est_entries = est_entries;
 	node->keyparamids = keyparamids;
+	node->hit_ratio = hit_ratio;
 
 	return node;
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d749b50578..d4186784c5 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1624,6 +1624,11 @@ create_memoize_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	 */
 	pathnode->est_entries = 0;
 
+	/*
+	 * The estimated cache hit ratio will calculated later by cost_memoize_rescan()
+	 */
+	pathnode->hit_ratio = 0;
+
 	/*
 	 * Add a small additional charge for caching the first entry.  All the
 	 * harder calculations for rescans are performed in cost_memoize_rescan().
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index d61a62da19..1b7cab3bc5 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1962,6 +1962,7 @@ typedef struct MemoizePath
 	uint32		est_entries;	/* The maximum number of entries that the
 								 * planner expects will fit in the cache, or 0
 								 * if unknown */
+	double		hit_ratio;		/* Estimated cache hit ratio, kept for EXPLAIN */
 } MemoizePath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 659bd05c0c..7966062a32 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -923,6 +923,9 @@ typedef struct Memoize
 
 	/* paramids from param_exprs */
 	Bitmapset  *keyparamids;
+
+	/* Estimated cache hit ratio, kept for EXPLAIN */
+	double		hit_ratio;
 } Memoize;
 
 /* ----------------
-- 
2.34.0

David Rowley

dgrowleyml@gmail.com

almost 3 years ago

In reply to: Lukas Fittl (#1)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On Sun, 5 Mar 2023 at 13:21, Lukas Fittl <lukas@fittl.com> wrote:

Alternatively (or in addition) we could consider showing the "ndistinct" value that is calculated in cost_memoize_rescan - since that's the most significant contributor to the cache hit ratio (and you can influence that directly by improving the ndistinct statistics).

I think the ndistinct estimate plus the est_entries together would be
useful. I think showing just the hit ratio number might often just
raise too many questions about how that's calculated. To calculate the
hit ratio we need to estimate the number of entries that can be kept
in the cache at once and also the number of input rows and the number
of distinct values. We can see the input rows by looking at the outer
side of the join in EXPLAIN, but we've no idea about the ndistinct or
how many items the planner thought could be kept in the cache at once.

The plan node already has est_entries, so it should just be a matter
of storing the ndistinct estimate in the Path and putting it into the
Plan node so the executor has access to it during EXPLAIN.

David

Daniel Gustafsson

daniel@yesql.se

over 2 years ago

In reply to: David Rowley (#2)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On 7 Mar 2023, at 10:51, David Rowley <dgrowleyml@gmail.com> wrote:

On Sun, 5 Mar 2023 at 13:21, Lukas Fittl <lukas@fittl.com> wrote:

Alternatively (or in addition) we could consider showing the "ndistinct" value that is calculated in cost_memoize_rescan - since that's the most significant contributor to the cache hit ratio (and you can influence that directly by improving the ndistinct statistics).

I think the ndistinct estimate plus the est_entries together would be
useful. I think showing just the hit ratio number might often just
raise too many questions about how that's calculated. To calculate the
hit ratio we need to estimate the number of entries that can be kept
in the cache at once and also the number of input rows and the number
of distinct values. We can see the input rows by looking at the outer
side of the join in EXPLAIN, but we've no idea about the ndistinct or
how many items the planner thought could be kept in the cache at once.

The plan node already has est_entries, so it should just be a matter
of storing the ndistinct estimate in the Path and putting it into the
Plan node so the executor has access to it during EXPLAIN.

Lukas: do you have an updated patch for this commitfest to address David's
comments?

--
Daniel Gustafsson

Lukas Fittl

lukas@fittl.com

over 2 years ago

In reply to: Daniel Gustafsson (#3)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On Thu, Jul 6, 2023 at 12:56 AM Daniel Gustafsson <daniel@yesql.se> wrote:

Lukas: do you have an updated patch for this commitfest to address David's
comments?

I have a draft - I should be able to post an updated patch in the next
days. Thanks for checking!

Thanks,
Lukas

--
Lukas Fittl

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

10 months ago

In reply to: Lukas Fittl (#4)

1 attachment(s)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On 06.07.2023 11:27, Lukas Fittl wrote:

On Thu, Jul 6, 2023 at 12:56 AM Daniel Gustafsson <daniel@yesql.se> wrote:

Lukas: do you have an updated patch for this commitfest to address
David's
comments?

I have a draft - I should be able to post an updated patch in the next
days. Thanks for checking!

Thanks,
Lukas

--
Lukas Fittl

Hi hackers,

While debugging a query execution plan involving Memoize, it'd be nice
to determine how many unique keys would fit into the cache. The
est_entries value provides some insight, but without knowing ndistinct,
it is unclear whether the cache is large enough to hold all unique keys
or if some will be evicted.

Given its potential usefulness, I would like to work for this. I
attached v2 patch with changes.

Example from memoize.sql

EXPLAIN SELECT COUNT(*),AVG(t1.unique1) FROM tenk1 t1
INNER JOIN tenk1 t2 ON t1.unique1 = t2.thousand
WHERE t2.unique1 < 1200;
                                             QUERY PLAN
-----------------------------------------------------------------------------------------------------
Aggregate (cost=815.12..815.13 rows=1 width=40)
   -> Nested Loop (cost=0.30..809.12 rows=1200 width=4)
         -> Seq Scan on tenk1 t2 (cost=0.00..470.00 rows=1200 width=4)
               Filter: (unique1 < 1200)
         -> Memoize (cost=0.30..0.41 rows=1 width=4)
               Cache Key: t2.thousand
               Cache Mode: logical
               Cache Estimated Entries: 655
               Cache Estimated NDistinct: 721
               -> Index Only Scan using tenk1_unique1 on tenk1 t1
(cost=0.29..0.40 rows=1 width=4)
                     Index Cond: (unique1 = t2.thousand)
(11 rows)

Additionally, since this information would only be shown in EXPLAIN when
costs are enabled, it should not cause any performance regression in
normal execution. However, reviewers should be especially careful when
verifying test outputs, as this change could affect plan details in
regression tests.

Any thoughts?

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachments:

v2-0001-Show-ndistinct-and-est_entries-in-EXPLAIN-for-Memoize.patchtext/x-patch; charset=UTF-8; name=v2-0001-Show-ndistinct-and-est_entries-in-EXPLAIN-for-Memoize.patchDownload

From 2e7d9bd9bfe58825fbdbb0100b9bd19d5bb7e53b Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <ilya.evdokimov@tantorlabs.ru>
Date: Thu, 20 Mar 2025 11:45:02 +0300
Subject: [PATCH v2] Show ndistinct and est_entries in EXPLAIN for Memoize

---
 src/backend/commands/explain.c          | 12 ++++++++++++
 src/backend/optimizer/path/costsize.c   |  3 +++
 src/backend/optimizer/plan/createplan.c | 10 +++++++---
 src/backend/optimizer/util/pathnode.c   |  3 +++
 src/include/nodes/pathnodes.h           |  2 ++
 src/include/nodes/plannodes.h           |  6 ++++++
 6 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 33a16d2d8e2..95110e63c7d 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3629,6 +3629,11 @@ show_memoize_info(MemoizeState *mstate, List *ancestors, ExplainState *es)
 	{
 		ExplainPropertyText("Cache Key", keystr.data, es);
 		ExplainPropertyText("Cache Mode", mstate->binary_mode ? "binary" : "logical", es);
+		if (es->costs)
+		{
+			ExplainPropertyFloat("Cache Estimated Entries", "", ((Memoize *) plan)->est_entries, 0, es);
+			ExplainPropertyFloat("Cache Estimated NDistinct", "", ((Memoize *) plan)->ndistinct, 0, es);
+		}
 	}
 	else
 	{
@@ -3636,6 +3641,13 @@ show_memoize_info(MemoizeState *mstate, List *ancestors, ExplainState *es)
 		appendStringInfo(es->str, "Cache Key: %s\n", keystr.data);
 		ExplainIndentText(es);
 		appendStringInfo(es->str, "Cache Mode: %s\n", mstate->binary_mode ? "binary" : "logical");
+		if (es->costs)
+		{
+			ExplainIndentText(es);
+			appendStringInfo(es->str, "Cache Estimated Entries: %d\n", ((Memoize *) plan)->est_entries);
+			ExplainIndentText(es);
+			appendStringInfo(es->str, "Cache Estimated NDistinct: %0.0f\n", ((Memoize *) plan)->ndistinct);
+		}
 	}
 
 	pfree(keystr.data);
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 256568d05a2..9c0738da4fe 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2604,6 +2604,9 @@ cost_memoize_rescan(PlannerInfo *root, MemoizePath *mpath,
 	mpath->est_entries = Min(Min(ndistinct, est_cache_entries),
 							 PG_UINT32_MAX);
 
+	/* Remember ndistinct for a potential EXPLAIN later */
+	mpath->ndistinct = ndistinct;
+
 	/*
 	 * When the number of distinct parameter values is above the amount we can
 	 * store in the cache, then we'll have to evict some entries from the
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 75e2b0b9036..79795a2b5f2 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -284,7 +284,8 @@ static Material *make_material(Plan *lefttree);
 static Memoize *make_memoize(Plan *lefttree, Oid *hashoperators,
 							 Oid *collations, List *param_exprs,
 							 bool singlerow, bool binary_mode,
-							 uint32 est_entries, Bitmapset *keyparamids);
+							uint32 est_entries, Bitmapset *keyparamids,
+							double ndistinct);
 static WindowAgg *make_windowagg(List *tlist, WindowClause *wc,
 								 int partNumCols, AttrNumber *partColIdx, Oid *partOperators, Oid *partCollations,
 								 int ordNumCols, AttrNumber *ordColIdx, Oid *ordOperators, Oid *ordCollations,
@@ -1703,7 +1704,8 @@ create_memoize_plan(PlannerInfo *root, MemoizePath *best_path, int flags)
 
 	plan = make_memoize(subplan, operators, collations, param_exprs,
 						best_path->singlerow, best_path->binary_mode,
-						best_path->est_entries, keyparamids);
+						best_path->est_entries, keyparamids,
+						best_path->ndistinct);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -6636,7 +6638,8 @@ materialize_finished_plan(Plan *subplan)
 static Memoize *
 make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations,
 			 List *param_exprs, bool singlerow, bool binary_mode,
-			 uint32 est_entries, Bitmapset *keyparamids)
+			uint32 est_entries, Bitmapset *keyparamids,
+			double ndistinct)
 {
 	Memoize    *node = makeNode(Memoize);
 	Plan	   *plan = &node->plan;
@@ -6654,6 +6657,7 @@ make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations,
 	node->binary_mode = binary_mode;
 	node->est_entries = est_entries;
 	node->keyparamids = keyparamids;
+	node->ndistinct = ndistinct;
 
 	return node;
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 93e73cb44db..1676e20879a 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1701,6 +1701,9 @@ create_memoize_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	Assert(enable_memoize);
 	pathnode->path.disabled_nodes = subpath->disabled_nodes;
 
+	/* Estimated number of distinct memoization keys, computed using estimate_num_groups() */
+	pathnode->ndistinct = 0;
+
 	/*
 	 * Add a small additional charge for caching the first entry.  All the
 	 * harder calculations for rescans are performed in cost_memoize_rescan().
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index c24a1fc8514..e99dbee2b3f 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -2040,6 +2040,8 @@ typedef struct MemoizePath
 	uint32		est_entries;	/* The maximum number of entries that the
 								 * planner expects will fit in the cache, or 0
 								 * if unknown */
+	double		ndistinct;		/* Estimated number of distinct memoization keys,
+								 * used for cache size evaluation. Kept for EXPLAIN */
 } MemoizePath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f78bffd90cf..16b307a2141 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1060,6 +1060,12 @@ typedef struct Memoize
 
 	/* paramids from param_exprs */
 	Bitmapset  *keyparamids;
+
+	/*
+	 * Estimated number of distinct memoization keys,
+	 * used for cache size evaluation. Kept for EXPLAIN
+	 */
+	double		ndistinct;
 } Memoize;
 
 /* ----------------
-- 
2.34.1

David Rowley

dgrowleyml@gmail.com

10 months ago

In reply to: Ilia Evdokimov (#5)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On Thu, 20 Mar 2025 at 21:48, Ilia Evdokimov
<ilya.evdokimov@tantorlabs.com> wrote:

-> Memoize (cost=0.30..0.41 rows=1 width=4)
Cache Key: t2.thousand
Cache Mode: logical
Cache Estimated Entries: 655
Cache Estimated NDistinct: 721
-> Index Only Scan using tenk1_unique1 on tenk1 t1 (cost=0.29..0.40 rows=1 width=4)
Index Cond: (unique1 = t2.thousand)
(11 rows)

Additionally, since this information would only be shown in EXPLAIN when costs are enabled, it should not cause any performance regression in normal execution. However, reviewers should be especially careful when verifying test outputs, as this change could affect plan details in regression tests.

Any thoughts?

+ ExplainIndentText(es);
+ appendStringInfo(es->str, "Cache Estimated Entries: %d\n", ((Memoize *) plan)->est_entries);
+ ExplainIndentText(es);
+ appendStringInfo(es->str, "Cache Estimated NDistinct: %0.0f\n", ((Memoize *) plan)->ndistinct);

est_entries is a uint32, so %u is the correct format character for that type.

I don't think you need to prefix all these properties with "Cache"
just because the other two properties have that prefix. I also don't
think the names you've chosen really reflect the meaning. How about
something like: "Estimated Distinct Lookup Keys: 721 Estimated
Capacity: 655", in that order. I think maybe having that as one line
for format=text is better than 2 lines. The EXPLAIN output is already
often taking up more lines in v18 than in v17, would be good to not
make that even worse unnecessarily.

I see the existing code there could use ExplainPropertyText rather
than have a special case for text and non-text formats. That's likely
my fault. If we're touching this code, then we should probably tidy
that up. Do you want to create a precursor fixup patch for that?

+ double ndistinct; /* Estimated number of distinct memoization keys,
+ * used for cache size evaluation. Kept for EXPLAIN */

Maybe this field in MemoizePath needs a better name. How about
"est_unique_keys"? and also do the rename in struct Memoize.

I'm also slightly concerned about making struct Memoize bigger. I had
issues with a performance regression [1]/messages/by-id/flat/CAHoyFK9n-QCXKTUWT_xxtXninSMEv+gbJN66-y6prM3f4WkEHw@mail.gmail.com for 908a96861 when increasing
the WindowAgg struct size last year and the only way I found to make
it go away was to shuffle the fields around so that the struct size
didn't increase. I think we'll need to see a benchmark of a query that
hits Memoize quite hard with a small cache size to see if the
performance decreases as a result of adding the ndistinct field. It's
unfortunate that we'll not have the luxury of squeezing this double
into padding if we do see a slowdown.

David

[1]: /messages/by-id/flat/CAHoyFK9n-QCXKTUWT_xxtXninSMEv+gbJN66-y6prM3f4WkEHw@mail.gmail.com

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

10 months ago

In reply to: David Rowley (#6)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On 20.03.2025 13:37, David Rowley wrote:

est_entries is a uint32, so %u is the correct format character for that type.

I don't think you need to prefix all these properties with "Cache"
just because the other two properties have that prefix. I also don't
think the names you've chosen really reflect the meaning. How about
something like: "Estimated Distinct Lookup Keys: 721 Estimated
Capacity: 655", in that order. I think maybe having that as one line
for format=text is better than 2 lines. The EXPLAIN output is already
often taking up more lines in v18 than in v17, would be good to not
make that even worse unnecessarily.

LGTM

I see the existing code there could use ExplainPropertyText rather
than have a special case for text and non-text formats. That's likely
my fault. If we're touching this code, then we should probably tidy
that up. Do you want to create a precursor fixup patch for that?

I fully agree with this suggestion. Then I'll begin with this on another
new thread.

+ double ndistinct; /* Estimated number of distinct memoization keys,
+ * used for cache size evaluation. Kept for EXPLAIN */
Maybe this field in MemoizePath needs a better name. How about
"est_unique_keys"? and also do the rename in struct Memoize.

LGTM

I'm also slightly concerned about making struct Memoize bigger. I had
issues with a performance regression [1] for 908a96861 when increasing
the WindowAgg struct size last year and the only way I found to make
it go away was to shuffle the fields around so that the struct size
didn't increase. I think we'll need to see a benchmark of a query that
hits Memoize quite hard with a small cache size to see if the
performance decreases as a result of adding the ndistinct field. It's
unfortunate that we'll not have the luxury of squeezing this double
into padding if we do see a slowdown.

I tried rearranging the fields in the structure with 'ndistinct' field,
but unfortunately, sizeof(Memoize) did not remain the same. Therefore,
benchmarking Memoize is definitely necessary. Thanks for the information!

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Andrei Lepikhov

lepihov@gmail.com

10 months ago

In reply to: David Rowley (#6)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On 20/3/2025 11:37, David Rowley wrote:

I'm also slightly concerned about making struct Memoize bigger. I had
issues with a performance regression [1] for 908a96861 when increasing
the WindowAgg struct size last year and the only way I found to make
it go away was to shuffle the fields around so that the struct size
didn't increase. I think we'll need to see a benchmark of a query that
hits Memoize quite hard with a small cache size to see if the
performance decreases as a result of adding the ndistinct field. It's
unfortunate that we'll not have the luxury of squeezing this double
into padding if we do see a slowdown.

I quite frequently need the number of distinct values (or groups)
predicted during the Memoize node creation to understand why caching is
sometimes employed or not.
But I had thought about an alternative way: having an extensible EXPLAIN
(thanks to Robert), we may save optimisation-stage data (I have the same
necessity in the case of IncrementalSort, for example) and put it into
the Plan node on-demand. So, the way I want to go is a Plan::extlist
node and create_plan hook, which may allow copying best_path data to the
final plan. So, here, we will add a new parameter and avoid touching the
core code.
But I would give +1 to current approach if it were done in a shorter time.

--
regards, Andrei Lepikhov

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

10 months ago

In reply to: Andrei Lepikhov (#8)

1 attachment(s)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On 20.03.2025 15:32, Andrei Lepikhov wrote:

I quite frequently need the number of distinct values (or groups)
predicted during the Memoize node creation to understand why caching
is sometimes employed or not.
But I had thought about an alternative way: having an extensible
EXPLAIN (thanks to Robert), we may save optimisation-stage data (I
have the same necessity in the case of IncrementalSort, for example)
and put it into the Plan node on-demand. So, the way I want to go is a
Plan::extlist node and create_plan hook, which may allow copying
best_path data to the final plan. So, here, we will add a new
parameter and avoid touching the core code.
But I would give +1 to current approach if it were done in a shorter
time.

Then before proceeding further, I think we need to benchmark this change
to ensure there are no performance regressions. If performance issues
arise, then using extensible EXPLAIN might be the only viable approach.

I have addressed most of the review comments except for the
ExplainPropertyText change. I am attaching v3 of the patch with these
updates. If anyone notices any performance issues, please let me know.
Issue with ExplainPropertyText for this thread I'll fix in the next
patches if it will be necessary.

So far, I have tested it on small machines. There are no performance
issues yet. I'll run benchmarks on larger ones soon.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachments:

v3-0001-Show-ndistinct-and-est_entries-in-EXPLAIN-for-Memoize.patchtext/x-patch; charset=UTF-8; name=v3-0001-Show-ndistinct-and-est_entries-in-EXPLAIN-for-Memoize.patchDownload

From 02a9b4793deb74f2e67dacabec259c4cd929d995 Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <ilya.evdokimov@tantorlabs.ru>
Date: Thu, 20 Mar 2025 16:58:48 +0300
Subject: [PATCH v3] Show ndistinct and est_entries in EXPLAIN for Memoize

---
 src/backend/commands/explain.c          | 12 ++++++++++++
 src/backend/optimizer/path/costsize.c   |  3 +++
 src/backend/optimizer/plan/createplan.c | 10 +++++++---
 src/backend/optimizer/util/pathnode.c   |  3 +++
 src/include/nodes/pathnodes.h           |  2 ++
 src/include/nodes/plannodes.h           |  6 ++++++
 6 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 33a16d2d8e2..4d8e94d3feb 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3629,6 +3629,11 @@ show_memoize_info(MemoizeState *mstate, List *ancestors, ExplainState *es)
 	{
 		ExplainPropertyText("Cache Key", keystr.data, es);
 		ExplainPropertyText("Cache Mode", mstate->binary_mode ? "binary" : "logical", es);
+		if (es->costs)
+		{
+			ExplainPropertyFloat("Estimated Capacity", "", ((Memoize *) plan)->est_entries, 0, es);
+			ExplainPropertyFloat("Estimated Distinct Lookup Keys", "", ((Memoize *) plan)->est_unique_keys, 0, es);
+		}
 	}
 	else
 	{
@@ -3636,6 +3641,13 @@ show_memoize_info(MemoizeState *mstate, List *ancestors, ExplainState *es)
 		appendStringInfo(es->str, "Cache Key: %s\n", keystr.data);
 		ExplainIndentText(es);
 		appendStringInfo(es->str, "Cache Mode: %s\n", mstate->binary_mode ? "binary" : "logical");
+		if (es->costs)
+		{
+			ExplainIndentText(es);
+			appendStringInfo(es->str, "Estimated Capacity: %u, Estimated Distinct Lookup Keys: %0.0f\n",
+							((Memoize *) plan)->est_entries,
+							((Memoize *) plan)->est_unique_keys);
+		}
 	}
 
 	pfree(keystr.data);
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 256568d05a2..4e42afbf53a 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2604,6 +2604,9 @@ cost_memoize_rescan(PlannerInfo *root, MemoizePath *mpath,
 	mpath->est_entries = Min(Min(ndistinct, est_cache_entries),
 							 PG_UINT32_MAX);
 
+	/* Remember ndistinct for a potential EXPLAIN later */
+	mpath->est_unique_keys = ndistinct;
+
 	/*
 	 * When the number of distinct parameter values is above the amount we can
 	 * store in the cache, then we'll have to evict some entries from the
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 75e2b0b9036..596b82c5dc9 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -284,7 +284,8 @@ static Material *make_material(Plan *lefttree);
 static Memoize *make_memoize(Plan *lefttree, Oid *hashoperators,
 							 Oid *collations, List *param_exprs,
 							 bool singlerow, bool binary_mode,
-							 uint32 est_entries, Bitmapset *keyparamids);
+							uint32 est_entries, Bitmapset *keyparamids,
+							double est_unique_keys);
 static WindowAgg *make_windowagg(List *tlist, WindowClause *wc,
 								 int partNumCols, AttrNumber *partColIdx, Oid *partOperators, Oid *partCollations,
 								 int ordNumCols, AttrNumber *ordColIdx, Oid *ordOperators, Oid *ordCollations,
@@ -1703,7 +1704,8 @@ create_memoize_plan(PlannerInfo *root, MemoizePath *best_path, int flags)
 
 	plan = make_memoize(subplan, operators, collations, param_exprs,
 						best_path->singlerow, best_path->binary_mode,
-						best_path->est_entries, keyparamids);
+						best_path->est_entries, keyparamids,
+						best_path->est_unique_keys);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -6636,7 +6638,8 @@ materialize_finished_plan(Plan *subplan)
 static Memoize *
 make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations,
 			 List *param_exprs, bool singlerow, bool binary_mode,
-			 uint32 est_entries, Bitmapset *keyparamids)
+			uint32 est_entries, Bitmapset *keyparamids,
+			double est_unique_keys)
 {
 	Memoize    *node = makeNode(Memoize);
 	Plan	   *plan = &node->plan;
@@ -6654,6 +6657,7 @@ make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations,
 	node->binary_mode = binary_mode;
 	node->est_entries = est_entries;
 	node->keyparamids = keyparamids;
+	node->est_unique_keys = est_unique_keys;
 
 	return node;
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 93e73cb44db..1fbcda99067 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1701,6 +1701,9 @@ create_memoize_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	Assert(enable_memoize);
 	pathnode->path.disabled_nodes = subpath->disabled_nodes;
 
+	/* Estimated number of distinct memoization keys, computed using estimate_num_groups() */
+	pathnode->est_unique_keys = 0;
+
 	/*
 	 * Add a small additional charge for caching the first entry.  All the
 	 * harder calculations for rescans are performed in cost_memoize_rescan().
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index c24a1fc8514..0946ccea994 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -2040,6 +2040,8 @@ typedef struct MemoizePath
 	uint32		est_entries;	/* The maximum number of entries that the
 								 * planner expects will fit in the cache, or 0
 								 * if unknown */
+	double		est_unique_keys;		/* Estimated number of distinct memoization keys,
+								 * used for cache size evaluation. Kept for EXPLAIN */
 } MemoizePath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f78bffd90cf..b38d4d91fb4 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1060,6 +1060,12 @@ typedef struct Memoize
 
 	/* paramids from param_exprs */
 	Bitmapset  *keyparamids;
+
+	/*
+	 * Estimated number of distinct memoization keys,
+	 * used for cache size evaluation. Kept for EXPLAIN
+	 */
+	double		est_unique_keys;
 } Memoize;
 
 /* ----------------
-- 
2.34.1

#10

Andrei Lepikhov

lepihov@gmail.com

10 months ago

In reply to: Ilia Evdokimov (#9)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On 20/3/2025 15:03, Ilia Evdokimov wrote:

On 20.03.2025 15:32, Andrei Lepikhov wrote:

I quite frequently need the number of distinct values (or groups)
predicted during the Memoize node creation to understand why caching
is sometimes employed or not.
But I had thought about an alternative way: having an extensible
EXPLAIN (thanks to Robert), we may save optimisation-stage data (I
have the same necessity in the case of IncrementalSort, for example)
and put it into the Plan node on-demand. So, the way I want to go is a
Plan::extlist node and create_plan hook, which may allow copying
best_path data to the final plan. So, here, we will add a new
parameter and avoid touching the core code.
But I would give +1 to current approach if it were done in a shorter
time.

Then before proceeding further, I think we need to benchmark this change
to ensure there are no performance regressions. If performance issues
arise, then using extensible EXPLAIN might be the only viable approach.

I have addressed most of the review comments except for the
ExplainPropertyText change. I am attaching v3 of the patch with these
updates. If anyone notices any performance issues, please let me know.
Issue with ExplainPropertyText for this thread I'll fix in the next
patches if it will be necessary.

So far, I have tested it on small machines. There are no performance
issues yet. I'll run benchmarks on larger ones soon.

I have some doubts here.
The number of distinct values says something only when it has taken
together with the number of calls.
Frequently, one of the caching keys is from outer table A (10 tuples),
and another is from outer table B (100 tuples). Calculating the number
of successful cache fetches predicted by the planner may not be evident
in the case of a composite cache key.

What I may propose here is:
1. Use fraction of calls, for example - 50% duplicated key values.
2. Show the calculated hit and eviction ratio.

The second option looks better to me. It is pretty understandable by a
user and may be compared to the numbers obtained during the execution.

--
regards, Andrei Lepikhov

#11

David Rowley

dgrowleyml@gmail.com

10 months ago

In reply to: Andrei Lepikhov (#10)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On Fri, 21 Mar 2025 at 07:54, Andrei Lepikhov <lepihov@gmail.com> wrote:

I have some doubts here.
The number of distinct values says something only when it has taken
together with the number of calls.

Couldn't the reader just look at the Nested Loop's outer side row
estimate for that?

Frequently, one of the caching keys is from outer table A (10 tuples),
and another is from outer table B (100 tuples). Calculating the number
of successful cache fetches predicted by the planner may not be evident
in the case of a composite cache key.

What I may propose here is:
1. Use fraction of calls, for example - 50% duplicated key values.
2. Show the calculated hit and eviction ratio.

My concern with showing just the estimated hit and evict ratios it
that it might lead to more questions, like how those were calculated.
However, I can see the argument for having one or both of these in
addition to the expected unique keys and expected cache capacity.

I think the primary factors in how useful Memoize is are: 1) How many
items do we expect to be able to store in the cache concurrently, and;
2) How many unique lookups keys do we expect to be looked up, and; 3)
The total number of expected lookups. #1 is quite difficult to
figure out (maybe by looking at row width and row estimates) and
there's just no information about #2. #3 is already shown, in the
Nested Loop's outer side.

The reason that the hit and evict ratios might also be useful are that
it you need a bit of inside knowledge on how the 3 input values are
used to calculate these. For example you might come up with a higher
estimated hit ratio if you assumed the keys arrived in order vs
unordered. If they're unordered and you don't have room for all keys
in the cache at once then that increases the chances that Memoize had
to evict something that will be needed for a future lookup.

Also, I was just looking back at the concern I had with increasing the
size of struct Memoize. I suspect that might not be that much of a
concern. The WindowAgg problem I mentioned was with the executor
state, not the plan node. I see the only time we access the plan node
for Memoize during execution is when we call build_hash_table().

David

#12

Andrei Lepikhov

lepihov@gmail.com

10 months ago

In reply to: David Rowley (#11)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On 21/3/2025 03:50, David Rowley wrote:

On Fri, 21 Mar 2025 at 07:54, Andrei Lepikhov <lepihov@gmail.com> wrote:

I have some doubts here.
The number of distinct values says something only when it has taken
together with the number of calls.

Couldn't the reader just look at the Nested Loop's outer side row
estimate for that?

In my cases, key sources are usually spread across dozens of joins, and
it is visually hard to find out (especially when we have an EXPLAIN
ANALYSE VERBOSE) the JOIN operator to extract the number of calls. The
hit ratio, meanwhile, may be analysed locally in the Memoize node. For
example, 80% (0.8) is evidently a good one, 40% is questionable, and 5%
is too low and we should avoid Memoize here.
May it be beaten by just printing the "calls" number at the Memoize output?

Frequently, one of the caching keys is from outer table A (10 tuples),
and another is from outer table B (100 tuples). Calculating the number
of successful cache fetches predicted by the planner may not be evident
in the case of a composite cache key.

What I may propose here is:
1. Use fraction of calls, for example - 50% duplicated key values.
2. Show the calculated hit and eviction ratio.

I think the primary factors in how useful Memoize is are: 1) How many
items do we expect to be able to store in the cache concurrently, and;
2) How many unique lookups keys do we expect to be looked up, and; 3)
The total number of expected lookups. #1 is quite difficult to
figure out (maybe by looking at row width and row estimates) and
there's just no information about #2. #3 is already shown, in the
Nested Loop's outer side.

It depends on the task. If you are looking for the answer to how precise
the group's estimation has been (to check statistics), I agree. In cases
I have seen before, the main question is how effective was (or maybe) a
Memoize node == how often the incoming key fits the cache. In that case,
the hit ratio fraction is more understandable for a broad audience.
That's why according to my experience in case of a good cache
reusability factor, users are usually okay with increasing the cache
size to the necessary numbers and avoiding evictions at all costs. So,
the predicted evict_ratio also tells us about incrementing work_mem to
enhance the chances of Memoisation.
Having written the last sentence I came back to the point why work_mem
is so universal and is used at each node as a criteria of memory
allocation size? But it is a different story, I think.

--
regards, Andrei Lepikhov

#13

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

10 months ago

In reply to: Andrei Lepikhov (#12)

1 attachment(s)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

After considering the opinions expressed in this discussion, I tend to
agree more with David. If we add info about estimated unique keys and
estimated capacity, then any additional information - such as
evict_ratio and hit_ratio - can also be calculated, as EXPLAIN ANALYZE
provides all the necessary details to compute these values.

For now, I’m attaching a rebased v4 patch, which includes the fix for
ExplainPropertyText.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachments:

v4-0001-Show-ndistinct-and-est_entries-in-EXPLAIN-for-Memoize.patchtext/x-patch; charset=UTF-8; name=v4-0001-Show-ndistinct-and-est_entries-in-EXPLAIN-for-Memoize.patchDownload

From b60e508583b1cb7f30814cef6f39c4b570693350 Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <ilya.evdokimov@tantorlabs.ru>
Date: Fri, 21 Mar 2025 11:52:29 +0300
Subject: [PATCH v4] Show ndistinct and est_entries in EXPLAIN for Memoize

---
 src/backend/commands/explain.c          | 16 ++++++++++++++++
 src/backend/optimizer/path/costsize.c   |  3 +++
 src/backend/optimizer/plan/createplan.c | 10 +++++++---
 src/backend/optimizer/util/pathnode.c   |  3 +++
 src/include/nodes/pathnodes.h           |  2 ++
 src/include/nodes/plannodes.h           |  6 ++++++
 6 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 391b34a2af2..cb7e295ca8a 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3628,6 +3628,22 @@ show_memoize_info(MemoizeState *mstate, List *ancestors, ExplainState *es)
 	ExplainPropertyText("Cache Key", keystr.data, es);
 	ExplainPropertyText("Cache Mode", mstate->binary_mode ? "binary" : "logical", es);
 
+	if (es->costs)
+	{
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			ExplainIndentText(es);
+			appendStringInfo(es->str, "Estimated Capacity: %u  Estimated Distinct Lookup Keys: %0.0f\n",
+							((Memoize *) plan)->est_entries,
+							((Memoize *) plan)->est_unique_keys);
+		}
+		else
+		{
+			ExplainPropertyUInteger("Estimated Capacity", "", ((Memoize *) plan)->est_entries, es);
+			ExplainPropertyFloat("Estimated Distinct Lookup Keys", "", ((Memoize *) plan)->est_unique_keys, 0, es);
+		}
+	}
+
 	pfree(keystr.data);
 
 	if (!es->analyze)
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index f6f77b8fe19..e3abf9ccc26 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2604,6 +2604,9 @@ cost_memoize_rescan(PlannerInfo *root, MemoizePath *mpath,
 	mpath->est_entries = Min(Min(ndistinct, est_cache_entries),
 							 PG_UINT32_MAX);
 
+	/* Remember ndistinct for a potential EXPLAIN later */
+	mpath->est_unique_keys = ndistinct;
+
 	/*
 	 * When the number of distinct parameter values is above the amount we can
 	 * store in the cache, then we'll have to evict some entries from the
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 75e2b0b9036..596b82c5dc9 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -284,7 +284,8 @@ static Material *make_material(Plan *lefttree);
 static Memoize *make_memoize(Plan *lefttree, Oid *hashoperators,
 							 Oid *collations, List *param_exprs,
 							 bool singlerow, bool binary_mode,
-							 uint32 est_entries, Bitmapset *keyparamids);
+							uint32 est_entries, Bitmapset *keyparamids,
+							double est_unique_keys);
 static WindowAgg *make_windowagg(List *tlist, WindowClause *wc,
 								 int partNumCols, AttrNumber *partColIdx, Oid *partOperators, Oid *partCollations,
 								 int ordNumCols, AttrNumber *ordColIdx, Oid *ordOperators, Oid *ordCollations,
@@ -1703,7 +1704,8 @@ create_memoize_plan(PlannerInfo *root, MemoizePath *best_path, int flags)
 
 	plan = make_memoize(subplan, operators, collations, param_exprs,
 						best_path->singlerow, best_path->binary_mode,
-						best_path->est_entries, keyparamids);
+						best_path->est_entries, keyparamids,
+						best_path->est_unique_keys);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -6636,7 +6638,8 @@ materialize_finished_plan(Plan *subplan)
 static Memoize *
 make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations,
 			 List *param_exprs, bool singlerow, bool binary_mode,
-			 uint32 est_entries, Bitmapset *keyparamids)
+			uint32 est_entries, Bitmapset *keyparamids,
+			double est_unique_keys)
 {
 	Memoize    *node = makeNode(Memoize);
 	Plan	   *plan = &node->plan;
@@ -6654,6 +6657,7 @@ make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations,
 	node->binary_mode = binary_mode;
 	node->est_entries = est_entries;
 	node->keyparamids = keyparamids;
+	node->est_unique_keys = est_unique_keys;
 
 	return node;
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 93e73cb44db..1fbcda99067 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1701,6 +1701,9 @@ create_memoize_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	Assert(enable_memoize);
 	pathnode->path.disabled_nodes = subpath->disabled_nodes;
 
+	/* Estimated number of distinct memoization keys, computed using estimate_num_groups() */
+	pathnode->est_unique_keys = 0;
+
 	/*
 	 * Add a small additional charge for caching the first entry.  All the
 	 * harder calculations for rescans are performed in cost_memoize_rescan().
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index c24a1fc8514..0946ccea994 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -2040,6 +2040,8 @@ typedef struct MemoizePath
 	uint32		est_entries;	/* The maximum number of entries that the
 								 * planner expects will fit in the cache, or 0
 								 * if unknown */
+	double		est_unique_keys;		/* Estimated number of distinct memoization keys,
+								 * used for cache size evaluation. Kept for EXPLAIN */
 } MemoizePath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f78bffd90cf..b38d4d91fb4 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1060,6 +1060,12 @@ typedef struct Memoize
 
 	/* paramids from param_exprs */
 	Bitmapset  *keyparamids;
+
+	/*
+	 * Estimated number of distinct memoization keys,
+	 * used for cache size evaluation. Kept for EXPLAIN
+	 */
+	double		est_unique_keys;
 } Memoize;
 
 /* ----------------
-- 
2.34.1

#14

David Rowley

dgrowleyml@gmail.com

10 months ago

In reply to: Andrei Lepikhov (#12)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On Fri, 21 Mar 2025 at 22:02, Andrei Lepikhov <lepihov@gmail.com> wrote:

In cases
I have seen before, the main question is how effective was (or maybe) a
Memoize node == how often the incoming key fits the cache. In that case,
the hit ratio fraction is more understandable for a broad audience.
That's why according to my experience in case of a good cache
reusability factor, users are usually okay with increasing the cache
size to the necessary numbers and avoiding evictions at all costs. So,
the predicted evict_ratio also tells us about incrementing work_mem to
enhance the chances of Memoisation.

Can you explain why "Estimated Capacity" and "Estimated Distinct
Lookup Keys" don't answer that? If there are more distinct lookup
keys than there is capacity to store them, then some will be evicted.

Once again, I'm not necessarily objecting to hit and evict ratios
being shown, I just want to know they're actually useful enough to
show and don't just bloat the EXPLAIN output needlessly. So far your
arguments aren't convincing me that they are.

Having written the last sentence I came back to the point why work_mem
is so universal and is used at each node as a criteria of memory
allocation size? But it is a different story, I think.

We have to set the limit somehow. We could have done this by having a
GUC per node type that uses memory, but it looks like something more
universal was decided, perhaps to save on GUCs. I don't know the exact
history, but once upon a time, sort_mem existed. Perhaps that
disappeared because we grew more node types that needed to allocate
large, otherwise unbounded amounts of memory. We did more recently
grow a hash_mem_multiplier GUC, so it's not true to say that work_mem
solely controls the limits of each node's memory allocation sizes.

David

#15

Andrei Lepikhov

lepihov@gmail.com

10 months ago

In reply to: David Rowley (#14)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On 3/23/25 22:16, David Rowley wrote:

On Fri, 21 Mar 2025 at 22:02, Andrei Lepikhov <lepihov@gmail.com> wrote:
Can you explain why "Estimated Capacity" and "Estimated Distinct
Lookup Keys" don't answer that? If there are more distinct lookup
keys than there is capacity to store them, then some will be evicted.

I wouldn't say these parameters don't answer. I try to debate how usable
they are. To be more practical, let me demonstrate the example:

EXPLAIN (COSTS OFF) SELECT * FROM t1,t2 WHERE t1.x=t2.x;

Nested Loop (cost=0.44..7312.65 rows=211330 width=33)
-> Seq Scan on t1 (cost=0.00..492.00 rows=30000 width=22)
-> Memoize (cost=0.44..3.82 rows=7 width=11)
Cache Key: t1.x
Cache Mode: logical
Estimated Capacity: 1001 Estimated Distinct Lookup Keys: 1001
-> Index Scan using t2_x_idx2 on t2 (cost=0.43..3.81 rows=7)
Index Cond: (x = t1.x)

At first, I began to look for documentation because it was unclear what
both new parameters specifically meant. Okay, there was no documentation
but trivial code, and after a short discovery, I realised the meaning.
The first fact I see from this EXPLAIN is that Postgres estimates it has
enough memory to fit all the entries. Okay, but what does it give me? I
may just increase work_mem and provide the query with more memory if
needed. My main concern is how frequently this cache is planned to be
used. Doing some mental effort, I found the line "rows=30000."
Calculating a bit more, I may suppose it means that we have a 95% chance
to reuse the cache. Okay, I got it.
Now, see:
1. I needed to discover the meaning of the new parameters because they
were different from the earlier "hit" and "miss."
2. I need to find a common JOIN for keys of this node. Imagine a typical
200-row EXPLAIN with 2-3 Memoization keys from different tables.
3. I need to make calculations

On the opposite, the hit ratio, written instead, already known by
analogy, already provides me with necessary cache efficacy data; no need
to watch outside the node; it may be easily compared with the actual
value. Am I wrong?

Both approaches provide the data, but each one is more usable?
I think we may ask more people, for example, Nikolay Samokhvalov, who,
as I heard, works hard with explains.

Once again, I'm not necessarily objecting to hit and evict ratios
being shown, I just want to know they're actually useful enough to
show and don't just bloat the EXPLAIN output needlessly. So far your
arguments aren't convincing me that they are.

I'm -1 for this redundancy.

--
regards, Andrei Lepikhov

#16

David Rowley

dgrowleyml@gmail.com

10 months ago

In reply to: Andrei Lepikhov (#15)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On Tue, 25 Mar 2025 at 10:23, Andrei Lepikhov <lepihov@gmail.com> wrote:

On 3/23/25 22:16, David Rowley wrote:

Once again, I'm not necessarily objecting to hit and evict ratios
being shown, I just want to know they're actually useful enough to
show and don't just bloat the EXPLAIN output needlessly. So far your
arguments aren't convincing me that they are.

I'm -1 for this redundancy.

I'm not following what the -1 is for. Is it for showing both hit and
evict ratios? And your vote is only for adding hit ratio?

Just to make it clear, the evict ratio isn't redundant because we show
hit ratio. If you have 1000 calls and 1000 distinct values and enough
memory to store those, then that's a 0% hit ratio since the first
lookup is a miss. If you change the calls to 2000 then that's a 50%
hit ratio and still 0% evict.

David

#17

Tom Lane

tgl@sss.pgh.pa.us

10 months ago

In reply to: David Rowley (#16)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

David Rowley <dgrowleyml@gmail.com> writes:

I'm not following what the -1 is for. Is it for showing both hit and
evict ratios? And your vote is only for adding hit ratio?

Just to make it clear, the evict ratio isn't redundant because we show
hit ratio. If you have 1000 calls and 1000 distinct values and enough
memory to store those, then that's a 0% hit ratio since the first
lookup is a miss. If you change the calls to 2000 then that's a 50%
hit ratio and still 0% evict.

FWIW, I share these doubts about whether these values are useful
enough to include in the default EXPLAIN output. My main beef
with them though is that they are basically numbers derived along
the way to producing a cost estimate, and I don't think we break
out such intermediate results for other node types.

It's looking like Robert's "pg_overexplain" will hit the tree soon,
so maybe there could be a case for teaching that to emit additional
costing details such as these? I'd kind of like to see a larger
scope than just Memoize for such an addition, though I'm not sure
exactly which other intermediate estimates are worth showing.

regards, tom lane

#18

Lukas Fittl

lukas@fittl.com

10 months ago

In reply to: Tom Lane (#17)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On Mon, Mar 24, 2025 at 3:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

FWIW, I share these doubts about whether these values are useful
enough to include in the default EXPLAIN output. My main beef
with them though is that they are basically numbers derived along
the way to producing a cost estimate, and I don't think we break
out such intermediate results for other node types.

The main argument I had initially when proposing this, is that Memoize is
different from other plan nodes, in that it makes the child node costs
"cheaper". Clearly seeing the expected cache hit/ratio (that drives that
costing modification) helps interpret why the planner came up with a given
plan.

Put differently, the reason this should be in the default EXPLAIN output
(or at least the VERBOSE output), is because Memoize's impact on costing is
counterintuitive (in my experience), and breaks the user's understanding of
planner costs you can usually derive by looking at the per-node cost
details, which typically flows up (i.e. gets larger as you step higher up
in the tree).

It's looking like Robert's "pg_overexplain" will hit the tree soon,
so maybe there could be a case for teaching that to emit additional
costing details such as these?

I don't think that addresses the end-user usability sufficiently - from
what I gathered "pg_overexplain" was meant for a hacker audience, not
DBAs/etc interpreting Postgres plan choices.

Thanks,
Lukas

--
Lukas Fittl

#19

Tom Lane

tgl@sss.pgh.pa.us

10 months ago

In reply to: Lukas Fittl (#18)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

Lukas Fittl <lukas@fittl.com> writes:

The main argument I had initially when proposing this, is that Memoize is
different from other plan nodes, in that it makes the child node costs
"cheaper". Clearly seeing the expected cache hit/ratio (that drives that
costing modification) helps interpret why the planner came up with a given
plan.

Memoize is hardly unique in that respect. Merge Join sometimes
expects that it won't have to read the inner input to completion,
and reduces its cost estimate accordingly, and that confuses people.
LIMIT also reduces the cost estimate of its input, though perhaps
that doesn't surprise people.

As I said, I'm not necessarily averse to showing these numbers
somehow. But I don't think they belong in the default output,
and I'm not even convinced that VERBOSE is the right place.
pg_overexplain seems like it could be an ideal home for this
sort of detail.

regards, tom lane

#20

David Rowley

dgrowleyml@gmail.com

10 months ago

In reply to: Tom Lane (#17)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On Tue, 25 Mar 2025 at 11:15, Tom Lane <tgl@sss.pgh.pa.us> wrote:

FWIW, I share these doubts about whether these values are useful
enough to include in the default EXPLAIN output. My main beef
with them though is that they are basically numbers derived along
the way to producing a cost estimate, and I don't think we break
out such intermediate results for other node types.

The only likeness I can think of is the "(originally N)" for the
buckets and batches in EXPLAIN ANALYZE for Hash Joins. I see that that
only shows when one of those isn't the same as the planner's
expectations

Maybe there's room to show additional details for Memoize in some
similar way only during EXPLAIN ANALYZE. EXPLAIN ANALYZE for Memoize
currently shows things like the number of hits, misses and evictions.
We could calculate what the planner expected those values to be and
show those. For this case, they're almost certain not to match what
the planner expected, but maybe that's ok.

David

#21

Andrei Lepikhov

lepihov@gmail.com

10 months ago

In reply to: David Rowley (#16)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On 24/3/2025 23:05, David Rowley wrote:

On Tue, 25 Mar 2025 at 10:23, Andrei Lepikhov <lepihov@gmail.com> wrote:

On 3/23/25 22:16, David Rowley wrote:

Once again, I'm not necessarily objecting to hit and evict ratios
being shown, I just want to know they're actually useful enough to
show and don't just bloat the EXPLAIN output needlessly. So far your
arguments aren't convincing me that they are.

I'm -1 for this redundancy.

I'm not following what the -1 is for. Is it for showing both hit and
evict ratios? And your vote is only for adding hit ratio?

Oh, pardon me. I wanted to say that IMO, the couple (capacity, distinct
lookup keys) makes redundant (hit ratio, eviction ratio) and vice versa.

--
regards, Andrei Lepikhov

#22

Andrei Lepikhov

lepihov@gmail.com

10 months ago

In reply to: Tom Lane (#19)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On 3/24/25 23:45, Tom Lane wrote:

Lukas Fittl <lukas@fittl.com> writes:

The main argument I had initially when proposing this, is that Memoize is
different from other plan nodes, in that it makes the child node costs
"cheaper". Clearly seeing the expected cache hit/ratio (that drives that
costing modification) helps interpret why the planner came up with a given
plan.

As I said, I'm not necessarily averse to showing these numbers
somehow. But I don't think they belong in the default output,
and I'm not even convinced that VERBOSE is the right place.
pg_overexplain seems like it could be an ideal home for this
sort of detail.

I prefer not to see these numbers in the default EXPLAIN output, not
only because they can fluctuate but mainly because I like to view the
basic query structure without additional details.
As I mentioned earlier, most of the information we typically need to
explore query plans stays in path nodes and does not transfer to the
Plan node. I believe this should stay as is, as we deal with numerous
cases and a vast amount of data.
It would be beneficial to expose extra data in an external extension. By
implementing a `create_plan` hook and an extensible list node in both
Path and Plan structures, we could provide a machinery for writing an
extension that can expose any planning-stage information in EXPLAIN on
demand.
This could encourage developers to create a "pg_extended_explain"
extension, which would address most user needs without requiring
additional changes to the core system.

--
regards, Andrei Lepikhov

#23

Robert Haas

robertmhaas@gmail.com

10 months ago

In reply to: Tom Lane (#19)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On Mon, Mar 24, 2025 at 6:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

As I said, I'm not necessarily averse to showing these numbers
somehow. But I don't think they belong in the default output,
and I'm not even convinced that VERBOSE is the right place.
pg_overexplain seems like it could be an ideal home for this
sort of detail.

I think we're going to be sad if we start shunting things that
non-developers need into pg_overexplain. On the one hand, PostgreSQL
consultants will be annoyed because they'll have to get a contrib
module installed in order to be able to do their jobs, and I don't
think that should be a requirement. On the other hand, PostgreSQL
developers will also be annoyed because once the consultants start
using it they'll complain when we change things, and I think we want
to have the freedom to change things in pg_overexplain. For that
reason, I think that if we choose to display anything here, it should
either be displayed all the time or gated by some in-core option such
as VERBOSE.

I do acknowledge the argument that we don't show details of how costs
are derived in other cases. While I think that point has some
validity, the flip side is that I spend a fairly significant amount of
time attempting to reverse-engineer what the planner did from the
EXPLAIN output, and I find that pretty unenjoyable. The recent change
to show two decimal places on row-count estimation is one that comes
up a heck of a lot, and several people have thanked me for getting
that patch committed because that problem affected them, too. But it's
surely not the only example of a case where it's hard to determine
what happened in the planner from what shows up in EXPLAIN output, and
I think that trying to find ways to improve on that situation is
worthwhile.

I also don't think that we should be too concerned about bloating the
EXPLAIN output in the context of a patch that only affects Memoize.
Memoize nodes are not incredibly common in the query plans that I see,
so even if we added another line or three to the output for each one,
I don't think that would create major problems. On the other hand,
maybe there's an argument that what this patch touches is only the tip
of the iceberg, and that we're eventually going to want the same kinds
of things for Nested Loop and Hash Joins and Merge Joins, and maybe
even more detail that can be displayed in say 3 lines. In that case,
there's a double concern. On the one hand, that really would make the
output a whole lot more verbose, and on the other hand, it might
generate a fair amount of work to maintain it across future planner
changes. I can see deciding to reject changes of that sort on the
grounds that we're not prepared to maintain it, or deciding to gate it
behind a new option on the grounds that it is so verbose that even
people who say EXPLAIN VERBOSE are going to be sad if they get all
that crap by default. I'm not saying that we SHOULD make those
decisions -- I think exposing more detail here could be pretty useful
to people trying to solve query plan problems, including me, so I hope
we don't just kick that idea straight to the curb without due thought
-- but I would understand them.

The part I'm least sure about with respect to the proposed patch is
the actual stuff being displayed. I don't have the experience to know
whether it's useful for tracking down issues. If it's not, then I
agree we shouldn't display it. If it is, then I'm tentatively in favor
of showing it in standard EXPLAIN, possibly only with VERBOSE, with
the caveats from the previous paragraph: if more-common node types are
also going to have a bunch of stuff like this, then we need to think
more carefully. If Memoize is exceptional in needing additional
information displayed, then I think it's fine.

--
Robert Haas
EDB: http://www.enterprisedb.com

#24

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

10 months ago

In reply to: Robert Haas (#23)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On 26.03.2025 22:37, Robert Haas wrote:

I also don't think that we should be too concerned about bloating the
EXPLAIN output in the context of a patch that only affects Memoize.
Memoize nodes are not incredibly common in the query plans that I see,
so even if we added another line or three to the output for each one,
I don't think that would create major problems. On the other hand,
maybe there's an argument that what this patch touches is only the tip
of the iceberg, and that we're eventually going to want the same kinds
of things for Nested Loop and Hash Joins and Merge Joins, and maybe
even more detail that can be displayed in say 3 lines. In that case,
there's a double concern. On the one hand, that really would make the
output a whole lot more verbose, and on the other hand, it might
generate a fair amount of work to maintain it across future planner
changes. I can see deciding to reject changes of that sort on the
grounds that we're not prepared to maintain it, or deciding to gate it
behind a new option on the grounds that it is so verbose that even
people who say EXPLAIN VERBOSE are going to be sad if they get all
that crap by default. I'm not saying that we SHOULD make those
decisions -- I think exposing more detail here could be pretty useful
to people trying to solve query plan problems, including me, so I hope
we don't just kick that idea straight to the curb without due thought
-- but I would understand them.

The part I'm least sure about with respect to the proposed patch is
the actual stuff being displayed. I don't have the experience to know
whether it's useful for tracking down issues. If it's not, then I
agree we shouldn't display it. If it is, then I'm tentatively in favor
of showing it in standard EXPLAIN, possibly only with VERBOSE, with
the caveats from the previous paragraph: if more-common node types are
also going to have a bunch of stuff like this, then we need to think
more carefully. If Memoize is exceptional in needing additional
information displayed, then I think it's fine.

--
Robert Haas
EDB:http://www.enterprisedb.com

I understand the concerns raised about the risk of opening the door to
more diagnostic detail across various plan nodes. However, in Hash Join,
Merge Join, and Nested Loop, EXPLAIN typically reveals at least some of
the planner’s expectations. For example, Hash Join shows the number of
batches and originally expected buckets, giving insight into whether the
hash table fit in memory. Merge Join shows unexpected Sort nodes when
presorted inputs were assumed. Nested Loop reflects planner assumptions
via loops and row estimates. In other words, these nodes expose at least
some information about what the planner thought would happen.

Memoize is unique in that it shows runtime statistics (hits, misses,
evictions), but reveals nothing about the planner’s expectations. We
don’t see how many distinct keys were estimated or how many entries the
planner thought would fit in memory. This makes it very difficult to
understand whether Memoize was a good choice or not, or how to fix it
when it performs poorly.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

#25

Robert Haas

robertmhaas@gmail.com

10 months ago

In reply to: Ilia Evdokimov (#24)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On Thu, Mar 27, 2025 at 6:12 AM Ilia Evdokimov
<ilya.evdokimov@tantorlabs.com> wrote:

I understand the concerns raised about the risk of opening the door to more diagnostic detail across various plan nodes. However, in Hash Join, Merge Join, and Nested Loop, EXPLAIN typically reveals at least some of the planner’s expectations. For example, Hash Join shows the number of batches and originally expected buckets, giving insight into whether the hash table fit in memory. Merge Join shows unexpected Sort nodes when presorted inputs were assumed. Nested Loop reflects planner assumptions via loops and row estimates. In other words, these nodes expose at least some information about what the planner thought would happen.

Memoize is unique in that it shows runtime statistics (hits, misses, evictions), but reveals nothing about the planner’s expectations. We don’t see how many distinct keys were estimated or how many entries the planner thought would fit in memory. This makes it very difficult to understand whether Memoize was a good choice or not, or how to fix it when it performs poorly.

Right. Without taking a strong position on this particular patch, I
think it's entirely reasonable, as a concept, to think of making
Memoize work similarly to what other nodes already do.

--
Robert Haas
EDB: http://www.enterprisedb.com

#26

Andrei Lepikhov

lepihov@gmail.com

10 months ago

In reply to: Ilia Evdokimov (#24)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On 3/27/25 11:12, Ilia Evdokimov wrote:

Memoize is unique in that it shows runtime statistics (hits, misses,
evictions), but reveals nothing about the planner’s expectations. We
don’t see how many distinct keys were estimated or how many entries the
planner thought would fit in memory.

It is acceptable for me to have Memoize estimations covered by the COSTS
option.
This data may help both developers and the users. For us, it provides
information on how good group estimations are and lets us identify gaps
in the prediction code. A user may find insights about ways to optimise
the query, detecting stale statistics or defining extended statistics.

This makes it very difficult to

understand whether Memoize was a good choice or not, or how to fix it
when it performs poorly.

I think user already has enough data to determine whether Memoize was
the right choice - hits/misses/evictions show that.

--
regards, Andrei Lepikhov

#27

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

10 months ago

In reply to: Andrei Lepikhov (#26)

1 attachment(s)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

Then we need to decide clearly what exactly to display in EXPLAIN for
the Memoize node: absolute values (estimated distinct keys and estimated
cache capacity) or ratios (hit_ratio and evict_ratio). Ratios have the
advantage of quickly reflecting the overall effectiveness of Memoize.
However, absolute values have a significant advantage as they explicitly
reveal the reason of Memoize's poor performance, making problem
diagnosis simpler.

With absolute values, users can directly understand the underlying
reason for poor performance. For example: insufficient memory (capacity
< distinct keys), inaccurate planner statistics (distinct keys
significantly different from actual values), poorly ordered keys
(capacity ~ distinct keys, but frequent evictions as seen in the
Evictions parameter), or Memoize simply not being beneficial (capacity ~
distinct keys ~ calls). Ratios, by contrast, only reflect the final
outcome without clearly indicating the cause or the specific steps
needed to resolve the issue.

Thus, absolute values do more than just inform users that a problem
exists; they provide actionable details that enable users to directly
address the problem (increase work_mem, refresh statistics, create
extended statistics, or disable Memoize entirely). Additionally, no
other plan nodes in PostgreSQL currently use a similar ratio-based
approach - everywhere else absolute values are consistently shown (e.g.,
number of rows, buckets, batches, memory used, etc.). Using absolute
values in Memoize maintains consistency with existing practice.

I've updated the patch to v5, since the new parameter est_unique_keys in
make_memoize() is now placed near est_entries, which is more logical and
readable than putting it at the end.

Any thoughts?

--
Best Regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachments:

v5-0001-Show-ndistinct-and-est_entries-in-EXPLAIN-for-Memoize.patchtext/x-patch; charset=UTF-8; name=v5-0001-Show-ndistinct-and-est_entries-in-EXPLAIN-for-Memoize.patchDownload

From 79af33e499730a4bd0553832b23bf68c90817c6e Mon Sep 17 00:00:00 2001
From: Evdokimov Ilia <ilya.evdokimov@tantorlabs.com>
Date: Fri, 28 Mar 2025 15:17:09 +0300
Subject: [PATCH v5] Show ndistinct and est_entries in EXPLAIN for Memoize

---
 src/backend/commands/explain.c          | 16 ++++++++++++++++
 src/backend/optimizer/path/costsize.c   |  3 +++
 src/backend/optimizer/plan/createplan.c | 10 +++++++---
 src/backend/optimizer/util/pathnode.c   |  3 +++
 src/include/nodes/pathnodes.h           |  2 ++
 src/include/nodes/plannodes.h           |  6 ++++++
 6 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 391b34a2af2..2bacf7f85cb 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3628,6 +3628,22 @@ show_memoize_info(MemoizeState *mstate, List *ancestors, ExplainState *es)
 	ExplainPropertyText("Cache Key", keystr.data, es);
 	ExplainPropertyText("Cache Mode", mstate->binary_mode ? "binary" : "logical", es);
 
+	if (es->costs)
+	{
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			ExplainIndentText(es);
+			appendStringInfo(es->str, "Estimated Capacity: %u  Estimated Distinct Lookup Keys: %0.0f\n",
+							((Memoize *) plan)->est_entries,
+							((Memoize *) plan)->est_unique_keys);
+		}
+		else
+		{
+			ExplainPropertyUInteger("Estimated Capacity", "", ((Memoize *) plan)->est_entries, es);
+			ExplainPropertyFloat("Estimated Distinct Lookup Keys", "", ((Memoize *) plan)->est_unique_keys, 0, es);
+		}
+	}
+
 	pfree(keystr.data);
 
 	if (!es->analyze)
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index f6f77b8fe19..e3abf9ccc26 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2604,6 +2604,9 @@ cost_memoize_rescan(PlannerInfo *root, MemoizePath *mpath,
 	mpath->est_entries = Min(Min(ndistinct, est_cache_entries),
 							 PG_UINT32_MAX);
 
+	/* Remember ndistinct for a potential EXPLAIN later */
+	mpath->est_unique_keys = ndistinct;
+
 	/*
 	 * When the number of distinct parameter values is above the amount we can
 	 * store in the cache, then we'll have to evict some entries from the
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 75e2b0b9036..38882501484 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -284,7 +284,8 @@ static Material *make_material(Plan *lefttree);
 static Memoize *make_memoize(Plan *lefttree, Oid *hashoperators,
 							 Oid *collations, List *param_exprs,
 							 bool singlerow, bool binary_mode,
-							 uint32 est_entries, Bitmapset *keyparamids);
+							 uint32 est_entries, double est_unique_keys,
+							 Bitmapset *keyparamids);
 static WindowAgg *make_windowagg(List *tlist, WindowClause *wc,
 								 int partNumCols, AttrNumber *partColIdx, Oid *partOperators, Oid *partCollations,
 								 int ordNumCols, AttrNumber *ordColIdx, Oid *ordOperators, Oid *ordCollations,
@@ -1703,7 +1704,8 @@ create_memoize_plan(PlannerInfo *root, MemoizePath *best_path, int flags)
 
 	plan = make_memoize(subplan, operators, collations, param_exprs,
 						best_path->singlerow, best_path->binary_mode,
-						best_path->est_entries, keyparamids);
+						best_path->est_entries, best_path->est_unique_keys,
+						keyparamids);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -6636,7 +6638,8 @@ materialize_finished_plan(Plan *subplan)
 static Memoize *
 make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations,
 			 List *param_exprs, bool singlerow, bool binary_mode,
-			 uint32 est_entries, Bitmapset *keyparamids)
+			uint32 est_entries, double est_unique_keys,
+			Bitmapset *keyparamids)
 {
 	Memoize    *node = makeNode(Memoize);
 	Plan	   *plan = &node->plan;
@@ -6654,6 +6657,7 @@ make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations,
 	node->binary_mode = binary_mode;
 	node->est_entries = est_entries;
 	node->keyparamids = keyparamids;
+	node->est_unique_keys = est_unique_keys;
 
 	return node;
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 93e73cb44db..1fbcda99067 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1701,6 +1701,9 @@ create_memoize_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	Assert(enable_memoize);
 	pathnode->path.disabled_nodes = subpath->disabled_nodes;
 
+	/* Estimated number of distinct memoization keys, computed using estimate_num_groups() */
+	pathnode->est_unique_keys = 0;
+
 	/*
 	 * Add a small additional charge for caching the first entry.  All the
 	 * harder calculations for rescans are performed in cost_memoize_rescan().
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index c24a1fc8514..0946ccea994 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -2040,6 +2040,8 @@ typedef struct MemoizePath
 	uint32		est_entries;	/* The maximum number of entries that the
 								 * planner expects will fit in the cache, or 0
 								 * if unknown */
+	double		est_unique_keys;		/* Estimated number of distinct memoization keys,
+								 * used for cache size evaluation. Kept for EXPLAIN */
 } MemoizePath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 658d76225e4..3d9d3a1159d 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1063,6 +1063,12 @@ typedef struct Memoize
 
 	/* paramids from param_exprs */
 	Bitmapset  *keyparamids;
+
+	/*
+	 * Estimated number of distinct memoization keys,
+	 * used for cache size evaluation. Kept for EXPLAIN
+	 */
+	double		est_unique_keys;
 } Memoize;
 
 /* ----------------
-- 
2.34.1

#28

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

10 months ago

In reply to: Ilia Evdokimov (#27)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On 28.03.2025 15:20, Ilia Evdokimov wrote:

Then we need to decide clearly what exactly to display in EXPLAIN for
the Memoize node: absolute values (estimated distinct keys and
estimated cache capacity) or ratios (hit_ratio and evict_ratio).
Ratios have the advantage of quickly reflecting the overall
effectiveness of Memoize. However, absolute values have a significant
advantage as they explicitly reveal the reason of Memoize's poor
performance, making problem diagnosis simpler.

With absolute values, users can directly understand the underlying
reason for poor performance. For example: insufficient memory
(capacity < distinct keys), inaccurate planner statistics (distinct
keys significantly different from actual values), poorly ordered keys
(capacity ~ distinct keys, but frequent evictions as seen in the
Evictions parameter), or Memoize simply not being beneficial (capacity
~ distinct keys ~ calls). Ratios, by contrast, only reflect the final
outcome without clearly indicating the cause or the specific steps
needed to resolve the issue.

Thus, absolute values do more than just inform users that a problem
exists; they provide actionable details that enable users to directly
address the problem (increase work_mem, refresh statistics, create
extended statistics, or disable Memoize entirely). Additionally, no
other plan nodes in PostgreSQL currently use a similar ratio-based
approach - everywhere else absolute values are consistently shown
(e.g., number of rows, buckets, batches, memory used, etc.). Using
absolute values in Memoize maintains consistency with existing practice.

I've updated the patch to v5, since the new parameter est_unique_keys
in make_memoize() is now placed near est_entries, which is more
logical and readable than putting it at the end.

Any thoughts?

--
Best Regards,
Ilia Evdokimov,
Tantor Labs LLC.

With the feature freeze coming up soon, I’d like to ask: do we plan to
include this patch in v18?

Please let me know if there’s anything I can do to help move it forward.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

#29

David Rowley

dgrowleyml@gmail.com

10 months ago

In reply to: Ilia Evdokimov (#28)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On Tue, 1 Apr 2025 at 20:52, Ilia Evdokimov
<ilya.evdokimov@tantorlabs.com> wrote:

With the feature freeze coming up soon, I’d like to ask: do we plan to
include this patch in v18?

I don't think there's a clear enough consensus yet on what EXPLAIN
should display. We'd need that beforehand. There are now less than 7
days to get that, so it might be unrealistic.

I tried to move things along to address Tom's concern about not
wanting to show this in EXPLAIN's standard output. I suggested in [1]/messages/by-id/CAApHDvqPkWmz1Lq23c9CD+mAW=hgPrD289tC30f3H1f6Ng59+g@mail.gmail.com
that we could use EXPLAIN ANALYZE, but nobody commented on that.
EXPLAIN ANALYZE is much more verbose than EXPLAIN already, and if we
put these in EXPLAIN ANALYZE then the viewer can more easily compare
planned vs actual. I had mentioned that Hash Join does something like
this for buckets.

David

[1]: /messages/by-id/CAApHDvqPkWmz1Lq23c9CD+mAW=hgPrD289tC30f3H1f6Ng59+g@mail.gmail.com

#30

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

10 months ago

In reply to: David Rowley (#29)

1 attachment(s)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On 02.04.2025 00:06, David Rowley wrote:

I tried to move things along to address Tom's concern about not
wanting to show this in EXPLAIN's standard output. I suggested in [1]
that we could use EXPLAIN ANALYZE, but nobody commented on that.
EXPLAIN ANALYZE is much more verbose than EXPLAIN already, and if we
put these in EXPLAIN ANALYZE then the viewer can more easily compare
planned vs actual. I had mentioned that Hash Join does something like
this for buckets.

David

[1]/messages/by-id/CAApHDvqPkWmz1Lq23c9CD+mAW=hgPrD289tC30f3H1f6Ng59+g@mail.gmail.com

Apologies for not addressing your earlier suggestion properly. After
reconsidering, I agree that showing ndistinct and est_entries in EXPLAIN
ANALYZE when there are actual cache misses is the best approach. This is
typically when users notice performance regressions and start
investigating the cause, so surfacing planner expectations at that point
can be the most useful. I attached v6 patch with changes.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachments:

v6-0001-Display-estimated-distinct-keys-and-cache-capacity.patchtext/x-patch; charset=UTF-8; name=v6-0001-Display-estimated-distinct-keys-and-cache-capacity.patchDownload

From 36ed866e3d28a709fd1133a89ecdc7a2654a657f Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <ilya.evdokimov@tantorlabs.ru>
Date: Wed, 2 Apr 2025 10:49:27 +0300
Subject: [PATCH v6] Display estimated distinct keys and cache capacity in
 EXPLAIN ANALYZE for Memoize when cache misses occur

---
 src/backend/commands/explain.c          | 13 +++++++++++++
 src/backend/optimizer/path/costsize.c   |  3 +++
 src/backend/optimizer/plan/createplan.c | 10 +++++++---
 src/backend/optimizer/util/pathnode.c   |  3 +++
 src/include/nodes/pathnodes.h           |  2 ++
 src/include/nodes/plannodes.h           |  6 ++++++
 6 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index ef8aa489af8..f21de58e0a0 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3635,6 +3635,19 @@ show_memoize_info(MemoizeState *mstate, List *ancestors, ExplainState *es)
 
 	if (mstate->stats.cache_misses > 0)
 	{
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			ExplainIndentText(es);
+			appendStringInfo(es->str, "Estimated Capacity: %u  Estimated Distinct Lookup Keys: %0.0f\n",
+							((Memoize *) plan)->est_entries,
+							((Memoize *) plan)->est_unique_keys);
+		}
+		else
+		{
+			ExplainPropertyUInteger("Estimated Capacity", "", ((Memoize *) plan)->est_entries, es);
+			ExplainPropertyFloat("Estimated Distinct Lookup Keys", "", ((Memoize *) plan)->est_unique_keys, 0, es);
+		}
+
 		/*
 		 * mem_peak is only set when we freed memory, so we must use mem_used
 		 * when mem_peak is 0.
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index f6f77b8fe19..e3abf9ccc26 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2604,6 +2604,9 @@ cost_memoize_rescan(PlannerInfo *root, MemoizePath *mpath,
 	mpath->est_entries = Min(Min(ndistinct, est_cache_entries),
 							 PG_UINT32_MAX);
 
+	/* Remember ndistinct for a potential EXPLAIN later */
+	mpath->est_unique_keys = ndistinct;
+
 	/*
 	 * When the number of distinct parameter values is above the amount we can
 	 * store in the cache, then we'll have to evict some entries from the
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 75e2b0b9036..38882501484 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -284,7 +284,8 @@ static Material *make_material(Plan *lefttree);
 static Memoize *make_memoize(Plan *lefttree, Oid *hashoperators,
 							 Oid *collations, List *param_exprs,
 							 bool singlerow, bool binary_mode,
-							 uint32 est_entries, Bitmapset *keyparamids);
+							 uint32 est_entries, double est_unique_keys,
+							 Bitmapset *keyparamids);
 static WindowAgg *make_windowagg(List *tlist, WindowClause *wc,
 								 int partNumCols, AttrNumber *partColIdx, Oid *partOperators, Oid *partCollations,
 								 int ordNumCols, AttrNumber *ordColIdx, Oid *ordOperators, Oid *ordCollations,
@@ -1703,7 +1704,8 @@ create_memoize_plan(PlannerInfo *root, MemoizePath *best_path, int flags)
 
 	plan = make_memoize(subplan, operators, collations, param_exprs,
 						best_path->singlerow, best_path->binary_mode,
-						best_path->est_entries, keyparamids);
+						best_path->est_entries, best_path->est_unique_keys,
+						keyparamids);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -6636,7 +6638,8 @@ materialize_finished_plan(Plan *subplan)
 static Memoize *
 make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations,
 			 List *param_exprs, bool singlerow, bool binary_mode,
-			 uint32 est_entries, Bitmapset *keyparamids)
+			uint32 est_entries, double est_unique_keys,
+			Bitmapset *keyparamids)
 {
 	Memoize    *node = makeNode(Memoize);
 	Plan	   *plan = &node->plan;
@@ -6654,6 +6657,7 @@ make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations,
 	node->binary_mode = binary_mode;
 	node->est_entries = est_entries;
 	node->keyparamids = keyparamids;
+	node->est_unique_keys = est_unique_keys;
 
 	return node;
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 93e73cb44db..1fbcda99067 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1701,6 +1701,9 @@ create_memoize_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	Assert(enable_memoize);
 	pathnode->path.disabled_nodes = subpath->disabled_nodes;
 
+	/* Estimated number of distinct memoization keys, computed using estimate_num_groups() */
+	pathnode->est_unique_keys = 0;
+
 	/*
 	 * Add a small additional charge for caching the first entry.  All the
 	 * harder calculations for rescans are performed in cost_memoize_rescan().
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index c24a1fc8514..0946ccea994 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -2040,6 +2040,8 @@ typedef struct MemoizePath
 	uint32		est_entries;	/* The maximum number of entries that the
 								 * planner expects will fit in the cache, or 0
 								 * if unknown */
+	double		est_unique_keys;		/* Estimated number of distinct memoization keys,
+								 * used for cache size evaluation. Kept for EXPLAIN */
 } MemoizePath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 658d76225e4..3d9d3a1159d 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1063,6 +1063,12 @@ typedef struct Memoize
 
 	/* paramids from param_exprs */
 	Bitmapset  *keyparamids;
+
+	/*
+	 * Estimated number of distinct memoization keys,
+	 * used for cache size evaluation. Kept for EXPLAIN
+	 */
+	double		est_unique_keys;
 } Memoize;
 
 /* ----------------
-- 
2.34.1

#31

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

9 months ago

In reply to: Ilia Evdokimov (#30)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

As there seems to be general support for the idea behind this patch, and
no one has raised strong objections recently, but also no clear
consensus on the exact final shape of the output has been reached yet -
and I understand other priorities might be taking precedence right now -
I’ve registered the patch in the upcoming CommitFest 2025-07 to ensure
it stays in the review process.

https://commitfest.postgresql.org/patch/5697/

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

#32

Robert Haas

robertmhaas@gmail.com

9 months ago

In reply to: David Rowley (#29)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On Tue, Apr 1, 2025 at 5:07 PM David Rowley <dgrowleyml@gmail.com> wrote:

I tried to move things along to address Tom's concern about not
wanting to show this in EXPLAIN's standard output. I suggested in [1]
that we could use EXPLAIN ANALYZE, but nobody commented on that.
EXPLAIN ANALYZE is much more verbose than EXPLAIN already, and if we
put these in EXPLAIN ANALYZE then the viewer can more easily compare
planned vs actual. I had mentioned that Hash Join does something like
this for buckets.

I don't think we should use ANALYZE for this because, IME, that should
just be about whether the query gets executed. Since this looks like
information that is available at plan time, I think it should be
displayed all the time or made contingent on VERBOSE. It would be sad
if you had to run the query to get information that was available
without needing to run the query. Personally, I think we can probably
just display it all the time. I mean, the typical plan is not going to
contain enough Memoize nodes that a little extra chatter for each one
impedes readability significantly.

As far as what to display, I have sympathy with Lukas's initial
complaint that it was hard to understand the costing of Memoize. I
haven't faced this exact problem, I think because Memoize isn't that
often used in plans I have seen, but I've faced a lot of similar
problems where you have to try to work backwards painfully to figure
out the chain of events that led to the value you're actually seeing,
and I'm +1 for efforts to make it more clear.

Having looked at v6, I think it would help, at least if the reader is
sufficiently knowledgeable. From the values displayed, it looks like
someone could reconstruct the evict_ratio value in
cost_memoize_rescan(), and if they know the loop count, also the
hit_ratio. But that seems hard: if you weren't reading the code, how
would you know how to do it? Even if you are reading the code, are you
sure you'd reconstruct it correctly? I wonder why we think it's better
to display this than a more "cooked" number like the estimated hit
ratio that was proposed in v1?

--
Robert Haas
EDB: http://www.enterprisedb.com

#33

Andrei Lepikhov

lepihov@gmail.com

9 months ago

In reply to: Robert Haas (#32)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On 4/14/25 18:31, Robert Haas wrote:

On Tue, Apr 1, 2025 at 5:07 PM David Rowley <dgrowleyml@gmail.com> wrote:

I tried to move things along to address Tom's concern about not
wanting to show this in EXPLAIN's standard output. I suggested in [1]
that we could use EXPLAIN ANALYZE, but nobody commented on that.
EXPLAIN ANALYZE is much more verbose than EXPLAIN already, and if we
put these in EXPLAIN ANALYZE then the viewer can more easily compare
planned vs actual. I had mentioned that Hash Join does something like
this for buckets.

Having looked at v6, I think it would help, at least if the reader is
sufficiently knowledgeable. From the values displayed, it looks like
someone could reconstruct the evict_ratio value in
cost_memoize_rescan(), and if they know the loop count, also the
hit_ratio. But that seems hard: if you weren't reading the code, how
would you know how to do it? Even if you are reading the code, are you
sure you'd reconstruct it correctly? I wonder why we think it's better
to display this than a more "cooked" number like the estimated hit
ratio that was proposed in v1?

+1.
My reasons: a) it more "physical"; b) We don't need to look for the
corresponding NestLoop JOIN node - my explans usually contain 150+ lines
and it is hard to get into the analysis something out of current screen.

--
regards, Andrei Lepikhov

#34

David Rowley

dgrowleyml@gmail.com

9 months ago

In reply to: Robert Haas (#32)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On Tue, 15 Apr 2025 at 04:31, Robert Haas <robertmhaas@gmail.com> wrote:

I don't think we should use ANALYZE for this because, IME, that should
just be about whether the query gets executed. Since this looks like
information that is available at plan time, I think it should be
displayed all the time or made contingent on VERBOSE. It would be sad
if you had to run the query to get information that was available
without needing to run the query. Personally, I think we can probably
just display it all the time. I mean, the typical plan is not going to
contain enough Memoize nodes that a little extra chatter for each one
impedes readability significantly.

You might be right there. I just offered that as a possible solution
to the concern of bloating the EXPLAIN output.

Having looked at v6, I think it would help, at least if the reader is
sufficiently knowledgeable. From the values displayed, it looks like
someone could reconstruct the evict_ratio value in
cost_memoize_rescan(), and if they know the loop count, also the
hit_ratio. But that seems hard: if you weren't reading the code, how
would you know how to do it? Even if you are reading the code, are you
sure you'd reconstruct it correctly? I wonder why we think it's better
to display this than a more "cooked" number like the estimated hit
ratio that was proposed in v1?

I think the problem with only showing the estimated hit ratio is that
it's difficult to then know what you might change to improve that in
cases where you might have other join types disabled and are
experimenting to figure out why Nested Loop / Memoize isn't being
chosen. If you see the estimated capacity isn't big enough to store
all the rows for all distinct keys then you know you can increase
work_mem to assist. If the Estimated Unique Keys is too high, then you
might be able to adjust some ndistinct estimates or increase stats
targets somewhere to improve that. If we only showed the estimated
hit ratio, then you wouldn't know where to start to get the planner to
pick the Memoize plan.

As I mentioned in [1]/messages/by-id/CAApHDvoE5S5FkvEq+N3-J9LfaVUpWLOnczOYAOEvBMCY20=pdg@mail.gmail.com, I wouldn't object to having the estimated hit
ratio shown in addition to the components that are used to calculate
it. Per [2]/messages/by-id/e76f1635-a933-47bf-a826-045e20d5cfff@gmail.com, Andrei didn't seem to like the idea due to the
duplication. FWIW, I do agree with you that the hit ratio isn't
exactly easy to calculate on the fly when looking at the input numbers
that are used to calculate it. I just wasn't 100% certain that it
would be needed in its calculated form. Maybe it depends on the reason
you're looking at EXPLAIN. Maybe the reason I mentioned above for
looking at EXPLAIN is just one of many and I'm not thinking hard
enough to consider the other ones.

If we can't get consensus for everything people want to add at once
then maybe the patch could be broken into two, with 0001 being pretty
much the v4 patch and then have 0002 add the Estimated Hit Ratio.
Having the physical patches and being able to try it out or view the
regression test changes might lower the bar on people chipping in with
their views.

David

[1]: /messages/by-id/CAApHDvoE5S5FkvEq+N3-J9LfaVUpWLOnczOYAOEvBMCY20=pdg@mail.gmail.com
[2]: /messages/by-id/e76f1635-a933-47bf-a826-045e20d5cfff@gmail.com

#35

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

9 months ago

In reply to: David Rowley (#34)

2 attachment(s)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On 14.04.2025 23:53, David Rowley wrote:

If we can't get consensus for everything people want to add at once
then maybe the patch could be broken into two, with 0001 being pretty
much the v4 patch and then have 0002 add the Estimated Hit Ratio.
Having the physical patches and being able to try it out or view the
regression test changes might lower the bar on people chipping in with
their views.

Agreed - I’ll split the patch into two: one adds Estimated Capacity
and Estimated Distinct Keys, and the second adds Estimated Hit Ratio.
I’ll also add regression test differences to make it easier to evaluate
the usefulness of each part.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachments:

v7-0001-Show-ndistinct-and-est_entries-in-EXPLAIN-for-Mem.patchtext/x-patch; charset=UTF-8; name=v7-0001-Show-ndistinct-and-est_entries-in-EXPLAIN-for-Mem.patchDownload

From 99a23dfae4415a3eb88c3847deefae3c2103077d Mon Sep 17 00:00:00 2001
From: Evdokimov Ilia <ilya.evdokimov@tantorlabs.com>
Date: Tue, 15 Apr 2025 00:47:26 +0300
Subject: [PATCH v7 1/2] Show ndistinct and est_entries in EXPLAIN for Memoize

---
 src/backend/commands/explain.c               | 13 ++++
 src/backend/optimizer/path/costsize.c        |  3 +
 src/backend/optimizer/plan/createplan.c      | 10 ++-
 src/backend/optimizer/util/pathnode.c        |  3 +
 src/include/nodes/pathnodes.h                |  2 +
 src/include/nodes/plannodes.h                |  6 ++
 src/test/regress/expected/create_index.out   |  3 +-
 src/test/regress/expected/join.out           | 64 +++++++++++---------
 src/test/regress/expected/memoize.out        | 54 +++++++++++------
 src/test/regress/expected/partition_join.out | 18 ++++--
 src/test/regress/expected/subselect.out      | 10 +--
 11 files changed, 126 insertions(+), 60 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 786ee865f14..0a4ce5ee7d7 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3628,6 +3628,19 @@ show_memoize_info(MemoizeState *mstate, List *ancestors, ExplainState *es)
 	ExplainPropertyText("Cache Key", keystr.data, es);
 	ExplainPropertyText("Cache Mode", mstate->binary_mode ? "binary" : "logical", es);
 
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		ExplainIndentText(es);
+		appendStringInfo(es->str, "Estimated Capacity: %u  Estimated Distinct Lookup Keys: %0.0f\n",
+						((Memoize *) plan)->est_entries,
+						((Memoize *) plan)->est_unique_keys);
+	}
+	else
+	{
+		ExplainPropertyUInteger("Estimated Capacity", "", ((Memoize *) plan)->est_entries, es);
+		ExplainPropertyFloat("Estimated Distinct Lookup Keys", "", ((Memoize *) plan)->est_unique_keys, 0, es);
+	}
+
 	pfree(keystr.data);
 
 	if (!es->analyze)
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 60b0fcfb6be..f72319d903c 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2604,6 +2604,9 @@ cost_memoize_rescan(PlannerInfo *root, MemoizePath *mpath,
 	mpath->est_entries = Min(Min(ndistinct, est_cache_entries),
 							 PG_UINT32_MAX);
 
+	/* Remember ndistinct for a potential EXPLAIN later */
+	mpath->est_unique_keys = ndistinct;
+
 	/*
 	 * When the number of distinct parameter values is above the amount we can
 	 * store in the cache, then we'll have to evict some entries from the
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index a8f22a8c154..a1456c9014d 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -284,7 +284,8 @@ static Material *make_material(Plan *lefttree);
 static Memoize *make_memoize(Plan *lefttree, Oid *hashoperators,
 							 Oid *collations, List *param_exprs,
 							 bool singlerow, bool binary_mode,
-							 uint32 est_entries, Bitmapset *keyparamids);
+							uint32 est_entries, Bitmapset *keyparamids,
+							double est_unique_keys);
 static WindowAgg *make_windowagg(List *tlist, WindowClause *wc,
 								 int partNumCols, AttrNumber *partColIdx, Oid *partOperators, Oid *partCollations,
 								 int ordNumCols, AttrNumber *ordColIdx, Oid *ordOperators, Oid *ordCollations,
@@ -1703,7 +1704,8 @@ create_memoize_plan(PlannerInfo *root, MemoizePath *best_path, int flags)
 
 	plan = make_memoize(subplan, operators, collations, param_exprs,
 						best_path->singlerow, best_path->binary_mode,
-						best_path->est_entries, keyparamids);
+						best_path->est_entries, keyparamids,
+						best_path->est_unique_keys);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -6636,7 +6638,8 @@ materialize_finished_plan(Plan *subplan)
 static Memoize *
 make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations,
 			 List *param_exprs, bool singlerow, bool binary_mode,
-			 uint32 est_entries, Bitmapset *keyparamids)
+			uint32 est_entries, Bitmapset *keyparamids,
+			double est_unique_keys)
 {
 	Memoize    *node = makeNode(Memoize);
 	Plan	   *plan = &node->plan;
@@ -6654,6 +6657,7 @@ make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations,
 	node->binary_mode = binary_mode;
 	node->est_entries = est_entries;
 	node->keyparamids = keyparamids;
+	node->est_unique_keys = est_unique_keys;
 
 	return node;
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 93e73cb44db..1fbcda99067 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1701,6 +1701,9 @@ create_memoize_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	Assert(enable_memoize);
 	pathnode->path.disabled_nodes = subpath->disabled_nodes;
 
+	/* Estimated number of distinct memoization keys, computed using estimate_num_groups() */
+	pathnode->est_unique_keys = 0;
+
 	/*
 	 * Add a small additional charge for caching the first entry.  All the
 	 * harder calculations for rescans are performed in cost_memoize_rescan().
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index bb678bdcdcd..07d97dc0b5b 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -2138,6 +2138,8 @@ typedef struct MemoizePath
 	uint32		est_entries;	/* The maximum number of entries that the
 								 * planner expects will fit in the cache, or 0
 								 * if unknown */
+	double		est_unique_keys;	/* Estimated number of distinct memoization keys,
+								 * used for cache size evaluation. Kept for EXPLAIN */
 } MemoizePath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 658d76225e4..3d9d3a1159d 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1063,6 +1063,12 @@ typedef struct Memoize
 
 	/* paramids from param_exprs */
 	Bitmapset  *keyparamids;
+
+	/*
+	 * Estimated number of distinct memoization keys,
+	 * used for cache size evaluation. Kept for EXPLAIN
+	 */
+	double		est_unique_keys;
 } Memoize;
 
 /* ----------------
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 9ade7b835e6..826018a8a9f 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -2245,10 +2245,11 @@ SELECT count(*) FROM tenk1 LEFT JOIN tenk2 ON
          ->  Memoize
                Cache Key: tenk1.hundred
                Cache Mode: logical
+               Estimated Capacity: 100  Estimated Distinct Lookup Keys: 100
                ->  Index Scan using tenk2_hundred on tenk2
                      Index Cond: (hundred = tenk1.hundred)
                      Filter: ((thousand = 42) OR (thousand = 41) OR (tenthous = 2))
-(10 rows)
+(11 rows)
 
 --
 -- Check behavior with duplicate index column contents
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 14da5708451..eca6ce2c6f0 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -2729,8 +2729,8 @@ select * from onek t1
     left join lateral
       (select * from onek t3 where t3.two = t2.two offset 0) s
       on t2.unique1 = 1;
-                    QUERY PLAN                    
---------------------------------------------------
+                                  QUERY PLAN                                  
+------------------------------------------------------------------------------
  Nested Loop Left Join
    ->  Seq Scan on onek t1
    ->  Materialize
@@ -2740,9 +2740,10 @@ select * from onek t1
                ->  Memoize
                      Cache Key: t2.two
                      Cache Mode: binary
+                     Estimated Capacity: 2  Estimated Distinct Lookup Keys: 2
                      ->  Seq Scan on onek t3
                            Filter: (two = t2.two)
-(11 rows)
+(12 rows)
 
 --
 -- check a case where we formerly got confused by conflicting sort orders
@@ -5128,8 +5129,8 @@ select * from
   on i8.q2 = 123,
   lateral (select i8.q1, t2.f1 from text_tbl t2 limit 1) as ss
 where t1.f1 = ss.f1;
-                    QUERY PLAN                    
---------------------------------------------------
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Nested Loop
    Output: t1.f1, i8.q1, i8.q2, (i8.q1), t2.f1
    Join Filter: (t1.f1 = t2.f1)
@@ -5146,11 +5147,12 @@ where t1.f1 = ss.f1;
          Output: (i8.q1), t2.f1
          Cache Key: i8.q1
          Cache Mode: binary
+         Estimated Capacity: 1  Estimated Distinct Lookup Keys: 1
          ->  Limit
                Output: (i8.q1), t2.f1
                ->  Seq Scan on public.text_tbl t2
                      Output: i8.q1, t2.f1
-(20 rows)
+(21 rows)
 
 select * from
   text_tbl t1
@@ -5171,8 +5173,8 @@ select * from
   lateral (select i8.q1, t2.f1 from text_tbl t2 limit 1) as ss1,
   lateral (select ss1.* from text_tbl t3 limit 1) as ss2
 where t1.f1 = ss2.f1;
-                            QUERY PLAN                             
--------------------------------------------------------------------
+                               QUERY PLAN                               
+------------------------------------------------------------------------
  Nested Loop
    Output: t1.f1, i8.q1, i8.q2, (i8.q1), t2.f1, ((i8.q1)), (t2.f1)
    Join Filter: (t1.f1 = (t2.f1))
@@ -5191,6 +5193,7 @@ where t1.f1 = ss2.f1;
                Output: (i8.q1), t2.f1
                Cache Key: i8.q1
                Cache Mode: binary
+               Estimated Capacity: 1  Estimated Distinct Lookup Keys: 1
                ->  Limit
                      Output: (i8.q1), t2.f1
                      ->  Seq Scan on public.text_tbl t2
@@ -5199,11 +5202,12 @@ where t1.f1 = ss2.f1;
          Output: ((i8.q1)), (t2.f1)
          Cache Key: (i8.q1), t2.f1
          Cache Mode: binary
+         Estimated Capacity: 1  Estimated Distinct Lookup Keys: 1
          ->  Limit
                Output: ((i8.q1)), (t2.f1)
                ->  Seq Scan on public.text_tbl t3
                      Output: (i8.q1), t2.f1
-(30 rows)
+(32 rows)
 
 select * from
   text_tbl t1
@@ -5225,8 +5229,8 @@ select 1 from
   left join text_tbl as tt4 on (tt3.f1 = tt4.f1),
   lateral (select tt4.f1 as c0 from text_tbl as tt5 limit 1) as ss1
 where tt1.f1 = ss1.c0;
-                        QUERY PLAN                        
-----------------------------------------------------------
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Nested Loop
    Output: 1
    ->  Nested Loop Left Join
@@ -5252,6 +5256,7 @@ where tt1.f1 = ss1.c0;
          Output: ss1.c0
          Cache Key: tt4.f1
          Cache Mode: binary
+         Estimated Capacity: 1  Estimated Distinct Lookup Keys: 1
          ->  Subquery Scan on ss1
                Output: ss1.c0
                Filter: (ss1.c0 = 'foo'::text)
@@ -5259,7 +5264,7 @@ where tt1.f1 = ss1.c0;
                      Output: (tt4.f1)
                      ->  Seq Scan on public.text_tbl tt5
                            Output: tt4.f1
-(32 rows)
+(33 rows)
 
 select 1 from
   text_tbl as tt1
@@ -6462,17 +6467,18 @@ select * from sj t1
     join lateral
       (select * from sj tablesample system(t1.b)) s
     on t1.a = s.a;
-              QUERY PLAN               
----------------------------------------
+                            QUERY PLAN                            
+------------------------------------------------------------------
  Nested Loop
    ->  Seq Scan on sj t1
    ->  Memoize
          Cache Key: t1.a, t1.b
          Cache Mode: binary
+         Estimated Capacity: 2  Estimated Distinct Lookup Keys: 2
          ->  Sample Scan on sj
                Sampling: system (t1.b)
                Filter: (t1.a = a)
-(8 rows)
+(9 rows)
 
 -- Ensure that SJE does not form a self-referential lateral dependency
 explain (costs off)
@@ -7614,43 +7620,46 @@ select count(*) from tenk1 a, lateral generate_series(1,two) g;
 
 explain (costs off)
   select count(*) from tenk1 a, lateral generate_series(1,two) g;
-                      QUERY PLAN                      
-------------------------------------------------------
+                               QUERY PLAN                               
+------------------------------------------------------------------------
  Aggregate
    ->  Nested Loop
          ->  Seq Scan on tenk1 a
          ->  Memoize
                Cache Key: a.two
                Cache Mode: binary
+               Estimated Capacity: 2  Estimated Distinct Lookup Keys: 2
                ->  Function Scan on generate_series g
-(7 rows)
+(8 rows)
 
 explain (costs off)
   select count(*) from tenk1 a cross join lateral generate_series(1,two) g;
-                      QUERY PLAN                      
-------------------------------------------------------
+                               QUERY PLAN                               
+------------------------------------------------------------------------
  Aggregate
    ->  Nested Loop
          ->  Seq Scan on tenk1 a
          ->  Memoize
                Cache Key: a.two
                Cache Mode: binary
+               Estimated Capacity: 2  Estimated Distinct Lookup Keys: 2
                ->  Function Scan on generate_series g
-(7 rows)
+(8 rows)
 
 -- don't need the explicit LATERAL keyword for functions
 explain (costs off)
   select count(*) from tenk1 a, generate_series(1,two) g;
-                      QUERY PLAN                      
-------------------------------------------------------
+                               QUERY PLAN                               
+------------------------------------------------------------------------
  Aggregate
    ->  Nested Loop
          ->  Seq Scan on tenk1 a
          ->  Memoize
                Cache Key: a.two
                Cache Mode: binary
+               Estimated Capacity: 2  Estimated Distinct Lookup Keys: 2
                ->  Function Scan on generate_series g
-(7 rows)
+(8 rows)
 
 -- lateral with UNION ALL subselect
 explain (costs off)
@@ -7702,8 +7711,8 @@ select count(*) from tenk1 a,
 explain (costs off)
   select count(*) from tenk1 a,
     tenk1 b join lateral (values(a.unique1),(-1)) ss(x) on b.unique2 = ss.x;
-                            QUERY PLAN                            
-------------------------------------------------------------------
+                               QUERY PLAN                               
+------------------------------------------------------------------------
  Aggregate
    ->  Nested Loop
          ->  Nested Loop
@@ -7712,9 +7721,10 @@ explain (costs off)
          ->  Memoize
                Cache Key: "*VALUES*".column1
                Cache Mode: logical
+               Estimated Capacity: 2  Estimated Distinct Lookup Keys: 2
                ->  Index Only Scan using tenk1_unique2 on tenk1 b
                      Index Cond: (unique2 = "*VALUES*".column1)
-(10 rows)
+(11 rows)
 
 select count(*) from tenk1 a,
   tenk1 b join lateral (values(a.unique1),(-1)) ss(x) on b.unique2 = ss.x;
diff --git a/src/test/regress/expected/memoize.out b/src/test/regress/expected/memoize.out
index 38dfaf021c9..39c76aaa1a4 100644
--- a/src/test/regress/expected/memoize.out
+++ b/src/test/regress/expected/memoize.out
@@ -46,12 +46,13 @@ WHERE t2.unique1 < 1000;', false);
          ->  Memoize (actual rows=1.00 loops=N)
                Cache Key: t2.twenty
                Cache Mode: logical
+               Estimated Capacity: 20  Estimated Distinct Lookup Keys: 20
                Hits: 980  Misses: 20  Evictions: Zero  Overflows: 0  Memory Usage: NkB
                ->  Index Only Scan using tenk1_unique1 on tenk1 t1 (actual rows=1.00 loops=N)
                      Index Cond: (unique1 = t2.twenty)
                      Heap Fetches: N
                      Index Searches: N
-(13 rows)
+(14 rows)
 
 -- And check we get the expected results.
 SELECT COUNT(*),AVG(t1.unique1) FROM tenk1 t1
@@ -78,12 +79,13 @@ WHERE t1.unique1 < 1000;', false);
          ->  Memoize (actual rows=1.00 loops=N)
                Cache Key: t1.twenty
                Cache Mode: binary
+               Estimated Capacity: 20  Estimated Distinct Lookup Keys: 20
                Hits: 980  Misses: 20  Evictions: Zero  Overflows: 0  Memory Usage: NkB
                ->  Index Only Scan using tenk1_unique1 on tenk1 t2 (actual rows=1.00 loops=N)
                      Index Cond: (unique1 = t1.twenty)
                      Heap Fetches: N
                      Index Searches: N
-(13 rows)
+(14 rows)
 
 -- And check we get the expected results.
 SELECT COUNT(*),AVG(t2.unique1) FROM tenk1 t1,
@@ -113,6 +115,7 @@ WHERE t1.unique1 < 10;', false);
          ->  Memoize (actual rows=2.00 loops=N)
                Cache Key: t1.two
                Cache Mode: binary
+               Estimated Capacity: 2  Estimated Distinct Lookup Keys: 2
                Hits: 8  Misses: 2  Evictions: Zero  Overflows: 0  Memory Usage: NkB
                ->  Subquery Scan on t2 (actual rows=2.00 loops=N)
                      Filter: (t1.two = t2.two)
@@ -120,7 +123,7 @@ WHERE t1.unique1 < 10;', false);
                      ->  Index Scan using tenk1_unique1 on tenk1 t2_1 (actual rows=4.00 loops=N)
                            Index Cond: (unique1 < 4)
                            Index Searches: N
-(15 rows)
+(16 rows)
 
 -- And check we get the expected results.
 SELECT COUNT(*),AVG(t2.t1two) FROM tenk1 t1 LEFT JOIN
@@ -149,13 +152,14 @@ WHERE s.c1 = s.c2 AND t1.unique1 < 1000;', false);
          ->  Memoize (actual rows=1.00 loops=N)
                Cache Key: (t1.two + 1)
                Cache Mode: binary
+               Estimated Capacity: 2  Estimated Distinct Lookup Keys: 2
                Hits: 998  Misses: 2  Evictions: Zero  Overflows: 0  Memory Usage: NkB
                ->  Index Only Scan using tenk1_unique1 on tenk1 t2 (actual rows=1.00 loops=N)
                      Filter: ((t1.two + 1) = unique1)
                      Rows Removed by Filter: 9999
                      Heap Fetches: N
                      Index Searches: N
-(14 rows)
+(15 rows)
 
 -- And check we get the expected results.
 SELECT COUNT(*), AVG(t1.twenty) FROM tenk1 t1 LEFT JOIN
@@ -182,11 +186,12 @@ WHERE s.c1 = s.c2 AND t1.unique1 < 1000;', false);
          ->  Memoize (actual rows=1.00 loops=N)
                Cache Key: t1.two, t1.twenty
                Cache Mode: binary
+               Estimated Capacity: 40  Estimated Distinct Lookup Keys: 40
                Hits: 980  Misses: 20  Evictions: Zero  Overflows: 0  Memory Usage: NkB
                ->  Seq Scan on tenk1 t2 (actual rows=1.00 loops=N)
                      Filter: ((t1.twenty = unique1) AND (t1.two = two))
                      Rows Removed by Filter: 9999
-(12 rows)
+(13 rows)
 
 -- And check we get the expected results.
 SELECT COUNT(*), AVG(t1.twenty) FROM tenk1 t1 LEFT JOIN
@@ -220,13 +225,14 @@ ON t1.x = t2.t::numeric AND t1.t::numeric = t2.x;', false);
    ->  Memoize (actual rows=2.00 loops=N)
          Cache Key: t1.x, (t1.t)::numeric
          Cache Mode: logical
+         Estimated Capacity: 20  Estimated Distinct Lookup Keys: 20
          Hits: 20  Misses: 20  Evictions: Zero  Overflows: 0  Memory Usage: NkB
          ->  Index Only Scan using expr_key_idx_x_t on expr_key t2 (actual rows=2.00 loops=N)
                Index Cond: (x = (t1.t)::numeric)
                Filter: (t1.x = (t)::numeric)
                Heap Fetches: N
                Index Searches: N
-(11 rows)
+(12 rows)
 
 DROP TABLE expr_key;
 -- Reduce work_mem and hash_mem_multiplier so that we see some cache evictions
@@ -249,12 +255,13 @@ WHERE t2.unique1 < 1200;', true);
          ->  Memoize (actual rows=1.00 loops=N)
                Cache Key: t2.thousand
                Cache Mode: logical
+               Estimated Capacity: 655  Estimated Distinct Lookup Keys: 721
                Hits: N  Misses: N  Evictions: N  Overflows: 0  Memory Usage: NkB
                ->  Index Only Scan using tenk1_unique1 on tenk1 t1 (actual rows=1.00 loops=N)
                      Index Cond: (unique1 = t2.thousand)
                      Heap Fetches: N
                      Index Searches: N
-(13 rows)
+(14 rows)
 
 CREATE TABLE flt (f float);
 CREATE INDEX flt_f_idx ON flt (f);
@@ -273,12 +280,13 @@ SELECT * FROM flt f1 INNER JOIN flt f2 ON f1.f = f2.f;', false);
    ->  Memoize (actual rows=2.00 loops=N)
          Cache Key: f1.f
          Cache Mode: logical
+         Estimated Capacity: 1  Estimated Distinct Lookup Keys: 1
          Hits: 1  Misses: 1  Evictions: Zero  Overflows: 0  Memory Usage: NkB
          ->  Index Only Scan using flt_f_idx on flt f2 (actual rows=2.00 loops=N)
                Index Cond: (f = f1.f)
                Heap Fetches: N
                Index Searches: N
-(12 rows)
+(13 rows)
 
 -- Ensure memoize operates in binary mode
 SELECT explain_memoize('
@@ -292,12 +300,13 @@ SELECT * FROM flt f1 INNER JOIN flt f2 ON f1.f >= f2.f;', false);
    ->  Memoize (actual rows=2.00 loops=N)
          Cache Key: f1.f
          Cache Mode: binary
+         Estimated Capacity: 1  Estimated Distinct Lookup Keys: 1
          Hits: 0  Misses: 2  Evictions: Zero  Overflows: 0  Memory Usage: NkB
          ->  Index Only Scan using flt_f_idx on flt f2 (actual rows=2.00 loops=N)
                Index Cond: (f <= f1.f)
                Heap Fetches: N
                Index Searches: N
-(12 rows)
+(13 rows)
 
 DROP TABLE flt;
 -- Exercise Memoize in binary mode with a large fixed width type and a
@@ -320,11 +329,12 @@ SELECT * FROM strtest s1 INNER JOIN strtest s2 ON s1.n >= s2.n;', false);
    ->  Memoize (actual rows=4.00 loops=N)
          Cache Key: s1.n
          Cache Mode: binary
+         Estimated Capacity: 3  Estimated Distinct Lookup Keys: 3
          Hits: 3  Misses: 3  Evictions: Zero  Overflows: 0  Memory Usage: NkB
          ->  Index Scan using strtest_n_idx on strtest s2 (actual rows=4.00 loops=N)
                Index Cond: (n <= s1.n)
                Index Searches: N
-(10 rows)
+(11 rows)
 
 -- Ensure we get 3 hits and 3 misses
 SELECT explain_memoize('
@@ -337,11 +347,12 @@ SELECT * FROM strtest s1 INNER JOIN strtest s2 ON s1.t >= s2.t;', false);
    ->  Memoize (actual rows=4.00 loops=N)
          Cache Key: s1.t
          Cache Mode: binary
+         Estimated Capacity: 4  Estimated Distinct Lookup Keys: 4
          Hits: 3  Misses: 3  Evictions: Zero  Overflows: 0  Memory Usage: NkB
          ->  Index Scan using strtest_t_idx on strtest s2 (actual rows=4.00 loops=N)
                Index Cond: (t <= s1.t)
                Index Searches: N
-(10 rows)
+(11 rows)
 
 DROP TABLE strtest;
 -- Ensure memoize works with partitionwise join
@@ -366,6 +377,7 @@ SELECT * FROM prt t1 INNER JOIN prt t2 ON t1.a = t2.a;', false);
          ->  Memoize (actual rows=4.00 loops=N)
                Cache Key: t1_1.a
                Cache Mode: logical
+               Estimated Capacity: 2  Estimated Distinct Lookup Keys: 2
                Hits: 3  Misses: 1  Evictions: Zero  Overflows: 0  Memory Usage: NkB
                ->  Index Only Scan using iprt_p1_a on prt_p1 t2_1 (actual rows=4.00 loops=N)
                      Index Cond: (a = t1_1.a)
@@ -378,12 +390,13 @@ SELECT * FROM prt t1 INNER JOIN prt t2 ON t1.a = t2.a;', false);
          ->  Memoize (actual rows=4.00 loops=N)
                Cache Key: t1_2.a
                Cache Mode: logical
+               Estimated Capacity: 2  Estimated Distinct Lookup Keys: 2
                Hits: 3  Misses: 1  Evictions: Zero  Overflows: 0  Memory Usage: NkB
                ->  Index Only Scan using iprt_p2_a on prt_p2 t2_2 (actual rows=4.00 loops=N)
                      Index Cond: (a = t1_2.a)
                      Heap Fetches: N
                      Index Searches: N
-(25 rows)
+(27 rows)
 
 -- Ensure memoize works with parameterized union-all Append path
 SET enable_partitionwise_join TO off;
@@ -400,6 +413,7 @@ ON t1.a = t2.a;', false);
    ->  Memoize (actual rows=4.00 loops=N)
          Cache Key: t1.a
          Cache Mode: logical
+         Estimated Capacity: 1  Estimated Distinct Lookup Keys: 1
          Hits: 3  Misses: 1  Evictions: Zero  Overflows: 0  Memory Usage: NkB
          ->  Append (actual rows=4.00 loops=N)
                ->  Index Only Scan using iprt_p1_a on prt_p1 (actual rows=4.00 loops=N)
@@ -410,7 +424,7 @@ ON t1.a = t2.a;', false);
                      Index Cond: (a = t1.a)
                      Heap Fetches: N
                      Index Searches: N
-(17 rows)
+(18 rows)
 
 DROP TABLE prt;
 RESET enable_partitionwise_join;
@@ -424,8 +438,8 @@ WHERE unique1 < 3
 	SELECT 1 FROM tenk1 t1
 	INNER JOIN tenk1 t2 ON t1.unique1 = t2.hundred
 	WHERE t0.ten = t1.twenty AND t0.two <> t2.four OFFSET 0);
-                           QUERY PLAN                           
-----------------------------------------------------------------
+                                  QUERY PLAN                                  
+------------------------------------------------------------------------------
  Index Scan using tenk1_unique1 on tenk1 t0
    Index Cond: (unique1 < 3)
    Filter: EXISTS(SubPlan 1)
@@ -436,10 +450,11 @@ WHERE unique1 < 3
            ->  Memoize
                  Cache Key: t2.hundred
                  Cache Mode: logical
+                 Estimated Capacity: 100  Estimated Distinct Lookup Keys: 100
                  ->  Index Scan using tenk1_unique1 on tenk1 t1
                        Index Cond: (unique1 = t2.hundred)
                        Filter: (t0.ten = twenty)
-(13 rows)
+(14 rows)
 
 -- Ensure the above query returns the correct result
 SELECT unique1 FROM tenk1 t0
@@ -469,8 +484,8 @@ EXPLAIN (COSTS OFF)
 SELECT COUNT(*),AVG(t2.unique1) FROM tenk1 t1,
 LATERAL (SELECT t2.unique1 FROM tenk1 t2 WHERE t1.twenty = t2.unique1) t2
 WHERE t1.unique1 < 1000;
-                                  QUERY PLAN                                   
--------------------------------------------------------------------------------
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
  Finalize Aggregate
    ->  Gather
          Workers Planned: 2
@@ -483,9 +498,10 @@ WHERE t1.unique1 < 1000;
                      ->  Memoize
                            Cache Key: t1.twenty
                            Cache Mode: logical
+                           Estimated Capacity: 20  Estimated Distinct Lookup Keys: 20
                            ->  Index Only Scan using tenk1_unique1 on tenk1 t2
                                  Index Cond: (unique1 = t1.twenty)
-(14 rows)
+(15 rows)
 
 -- And ensure the parallel plan gives us the correct results.
 SELECT COUNT(*),AVG(t2.unique1) FROM tenk1 t1,
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 6101c8c7cf1..d3c37ffe22b 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -5289,8 +5289,8 @@ EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 1;
 
 -- Increase number of tuples requested and an IndexScan will be chosen
 EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 100;
-                               QUERY PLAN                               
-------------------------------------------------------------------------
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
  Limit
    ->  Append
          ->  Nested Loop
@@ -5298,6 +5298,7 @@ EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 100;
                ->  Memoize
                      Cache Key: p1_1.c
                      Cache Mode: logical
+                     Estimated Capacity: 12  Estimated Distinct Lookup Keys: 12
                      ->  Index Scan using pht1_p1_c_idx on pht1_p1 p2_1
                            Index Cond: (c = p1_1.c)
          ->  Nested Loop
@@ -5305,6 +5306,7 @@ EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 100;
                ->  Memoize
                      Cache Key: p1_2.c
                      Cache Mode: logical
+                     Estimated Capacity: 12  Estimated Distinct Lookup Keys: 12
                      ->  Index Scan using pht1_p2_c_idx on pht1_p2 p2_2
                            Index Cond: (c = p1_2.c)
          ->  Nested Loop
@@ -5312,9 +5314,10 @@ EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 100;
                ->  Memoize
                      Cache Key: p1_3.c
                      Cache Mode: logical
+                     Estimated Capacity: 12  Estimated Distinct Lookup Keys: 12
                      ->  Index Scan using pht1_p3_c_idx on pht1_p3 p2_3
                            Index Cond: (c = p1_3.c)
-(23 rows)
+(26 rows)
 
 -- If almost all the data should be fetched - prefer SeqScan
 EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 1000;
@@ -5343,8 +5346,8 @@ SET max_parallel_workers_per_gather = 1;
 SET debug_parallel_query = on;
 -- Partial paths should also be smart enough to employ limits
 EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 100;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
  Gather
    Workers Planned: 1
    Single Copy: true
@@ -5355,6 +5358,7 @@ EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 100;
                      ->  Memoize
                            Cache Key: p1_1.c
                            Cache Mode: logical
+                           Estimated Capacity: 12  Estimated Distinct Lookup Keys: 12
                            ->  Index Scan using pht1_p1_c_idx on pht1_p1 p2_1
                                  Index Cond: (c = p1_1.c)
                ->  Nested Loop
@@ -5362,6 +5366,7 @@ EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 100;
                      ->  Memoize
                            Cache Key: p1_2.c
                            Cache Mode: logical
+                           Estimated Capacity: 12  Estimated Distinct Lookup Keys: 12
                            ->  Index Scan using pht1_p2_c_idx on pht1_p2 p2_2
                                  Index Cond: (c = p1_2.c)
                ->  Nested Loop
@@ -5369,9 +5374,10 @@ EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 100;
                      ->  Memoize
                            Cache Key: p1_3.c
                            Cache Mode: logical
+                           Estimated Capacity: 12  Estimated Distinct Lookup Keys: 12
                            ->  Index Scan using pht1_p3_c_idx on pht1_p3 p2_3
                                  Index Cond: (c = p1_3.c)
-(26 rows)
+(29 rows)
 
 RESET debug_parallel_query;
 -- Remove indexes from the partitioned table and its partitions
diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index 40d8056fcea..f5575c7d332 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -1316,8 +1316,8 @@ select sum(o.four), sum(ss.a) from
     select * from x
   ) ss
 where o.ten = 1;
-                       QUERY PLAN                        
----------------------------------------------------------
+                               QUERY PLAN                               
+------------------------------------------------------------------------
  Aggregate
    ->  Nested Loop
          ->  Seq Scan on onek o
@@ -1325,13 +1325,14 @@ where o.ten = 1;
          ->  Memoize
                Cache Key: o.four
                Cache Mode: binary
+               Estimated Capacity: 4  Estimated Distinct Lookup Keys: 4
                ->  CTE Scan on x
                      CTE x
                        ->  Recursive Union
                              ->  Result
                              ->  WorkTable Scan on x x_1
                                    Filter: (a < 10)
-(13 rows)
+(14 rows)
 
 select sum(o.four), sum(ss.a) from
   onek o cross join lateral (
@@ -2642,6 +2643,7 @@ ON B.hundred in (SELECT min(c.hundred) FROM tenk2 C WHERE c.odd = b.odd);
                ->  Memoize
                      Cache Key: b.hundred, b.odd
                      Cache Mode: binary
+                     Estimated Capacity: 1000  Estimated Distinct Lookup Keys: 1000
                      ->  Subquery Scan on "ANY_subquery"
                            Filter: (b.hundred = "ANY_subquery".min)
                            ->  Result
@@ -2650,7 +2652,7 @@ ON B.hundred in (SELECT min(c.hundred) FROM tenk2 C WHERE c.odd = b.odd);
                                          ->  Index Scan using tenk2_hundred on tenk2 c
                                                Index Cond: (hundred IS NOT NULL)
                                                Filter: (odd = b.odd)
-(16 rows)
+(17 rows)
 
 --
 -- Test VALUES to ARRAY (VtA) transformation
-- 
2.34.1

v7-0002-Add-Estimated-Hit-Ratio-for-Memoize-plan-nodes-in.patchtext/x-patch; charset=UTF-8; name=v7-0002-Add-Estimated-Hit-Ratio-for-Memoize-plan-nodes-in.patchDownload

From b2dc58e271075909d5136b09259f2f6b83d4d112 Mon Sep 17 00:00:00 2001
From: Evdokimov Ilia <ilya.evdokimov@tantorlabs.com>
Date: Tue, 15 Apr 2025 01:53:27 +0300
Subject: [PATCH v7 2/2] Add Estimated Hit Ratio for Memoize plan nodes in
 EXPLAIN

---
 src/backend/commands/explain.c               |  4 +-
 src/backend/optimizer/path/costsize.c        |  3 ++
 src/backend/optimizer/plan/createplan.c      |  7 +--
 src/backend/optimizer/util/pathnode.c        |  5 +++
 src/include/nodes/pathnodes.h                |  1 +
 src/include/nodes/plannodes.h                |  3 ++
 src/test/regress/expected/create_index.out   |  3 +-
 src/test/regress/expected/join.out           | 28 ++++++++----
 src/test/regress/expected/memoize.out        | 46 +++++++++++++-------
 src/test/regress/expected/partition_join.out | 10 ++++-
 src/test/regress/expected/subselect.out      |  6 ++-
 11 files changed, 83 insertions(+), 33 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0a4ce5ee7d7..d8526b7d06d 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3634,11 +3634,13 @@ show_memoize_info(MemoizeState *mstate, List *ancestors, ExplainState *es)
 		appendStringInfo(es->str, "Estimated Capacity: %u  Estimated Distinct Lookup Keys: %0.0f\n",
 						((Memoize *) plan)->est_entries,
 						((Memoize *) plan)->est_unique_keys);
-	}
+		ExplainIndentText(es);
+		appendStringInfo(es->str, "Estimated Hit Ratio: %0.2f\n", ((Memoize *) plan)->hit_ratio);	}
 	else
 	{
 		ExplainPropertyUInteger("Estimated Capacity", "", ((Memoize *) plan)->est_entries, es);
 		ExplainPropertyFloat("Estimated Distinct Lookup Keys", "", ((Memoize *) plan)->est_unique_keys, 0, es);
+		ExplainPropertyFloat("Estimated Hit Ratio", "", ((Memoize *) plan)->hit_ratio, 2, es);
 	}
 
 	pfree(keystr.data);
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index f72319d903c..3e99214501b 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2624,6 +2624,9 @@ cost_memoize_rescan(PlannerInfo *root, MemoizePath *mpath,
 	hit_ratio = ((calls - ndistinct) / calls) *
 		(est_cache_entries / Max(ndistinct, est_cache_entries));
 
+	/* Remember cache hit ratio for a potential EXPLAIN later */
+	mpath->hit_ratio = hit_ratio;
+
 	Assert(hit_ratio >= 0 && hit_ratio <= 1.0);
 
 	/*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index a1456c9014d..ccb880158fe 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -285,7 +285,7 @@ static Memoize *make_memoize(Plan *lefttree, Oid *hashoperators,
 							 Oid *collations, List *param_exprs,
 							 bool singlerow, bool binary_mode,
 							uint32 est_entries, Bitmapset *keyparamids,
-							double est_unique_keys);
+							double est_unique_keys, double hit_ratio);
 static WindowAgg *make_windowagg(List *tlist, WindowClause *wc,
 								 int partNumCols, AttrNumber *partColIdx, Oid *partOperators, Oid *partCollations,
 								 int ordNumCols, AttrNumber *ordColIdx, Oid *ordOperators, Oid *ordCollations,
@@ -1705,7 +1705,7 @@ create_memoize_plan(PlannerInfo *root, MemoizePath *best_path, int flags)
 	plan = make_memoize(subplan, operators, collations, param_exprs,
 						best_path->singlerow, best_path->binary_mode,
 						best_path->est_entries, keyparamids,
-						best_path->est_unique_keys);
+						best_path->est_unique_keys, best_path->hit_ratio);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -6639,7 +6639,7 @@ static Memoize *
 make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations,
 			 List *param_exprs, bool singlerow, bool binary_mode,
 			uint32 est_entries, Bitmapset *keyparamids,
-			double est_unique_keys)
+			double est_unique_keys, double hit_ratio)
 {
 	Memoize    *node = makeNode(Memoize);
 	Plan	   *plan = &node->plan;
@@ -6658,6 +6658,7 @@ make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations,
 	node->est_entries = est_entries;
 	node->keyparamids = keyparamids;
 	node->est_unique_keys = est_unique_keys;
+	node->hit_ratio = hit_ratio;
 
 	return node;
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 1fbcda99067..5971f67b77a 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1704,6 +1704,11 @@ create_memoize_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	/* Estimated number of distinct memoization keys, computed using estimate_num_groups() */
 	pathnode->est_unique_keys = 0;
 
+	/*
+	 * The estimated cache hit ratio will calculated later by cost_memoize_rescan()
+	 */
+	pathnode->hit_ratio = 0;
+
 	/*
 	 * Add a small additional charge for caching the first entry.  All the
 	 * harder calculations for rescans are performed in cost_memoize_rescan().
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 07d97dc0b5b..9bfc35cd4f6 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -2140,6 +2140,7 @@ typedef struct MemoizePath
 								 * if unknown */
 	double		est_unique_keys;	/* Estimated number of distinct memoization keys,
 								 * used for cache size evaluation. Kept for EXPLAIN */
+	double		hit_ratio;		/* Estimated cache hit ratio, kept for EXPLAIN */
 } MemoizePath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 3d9d3a1159d..a7d350ce9ca 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1069,6 +1069,9 @@ typedef struct Memoize
 	 * used for cache size evaluation. Kept for EXPLAIN
 	 */
 	double		est_unique_keys;
+
+	/* Estimated cache hit ratio, kept for EXPLAIN */
+	double		hit_ratio;
 } Memoize;
 
 /* ----------------
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 826018a8a9f..0fa4359e23c 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -2246,10 +2246,11 @@ SELECT count(*) FROM tenk1 LEFT JOIN tenk2 ON
                Cache Key: tenk1.hundred
                Cache Mode: logical
                Estimated Capacity: 100  Estimated Distinct Lookup Keys: 100
+               Estimated Hit Ratio: 0.99
                ->  Index Scan using tenk2_hundred on tenk2
                      Index Cond: (hundred = tenk1.hundred)
                      Filter: ((thousand = 42) OR (thousand = 41) OR (tenthous = 2))
-(11 rows)
+(12 rows)
 
 --
 -- Check behavior with duplicate index column contents
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index eca6ce2c6f0..212b7897919 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -2741,9 +2741,10 @@ select * from onek t1
                      Cache Key: t2.two
                      Cache Mode: binary
                      Estimated Capacity: 2  Estimated Distinct Lookup Keys: 2
+                     Estimated Hit Ratio: 1.00
                      ->  Seq Scan on onek t3
                            Filter: (two = t2.two)
-(12 rows)
+(13 rows)
 
 --
 -- check a case where we formerly got confused by conflicting sort orders
@@ -5148,11 +5149,12 @@ where t1.f1 = ss.f1;
          Cache Key: i8.q1
          Cache Mode: binary
          Estimated Capacity: 1  Estimated Distinct Lookup Keys: 1
+         Estimated Hit Ratio: 0.50
          ->  Limit
                Output: (i8.q1), t2.f1
                ->  Seq Scan on public.text_tbl t2
                      Output: i8.q1, t2.f1
-(21 rows)
+(22 rows)
 
 select * from
   text_tbl t1
@@ -5194,6 +5196,7 @@ where t1.f1 = ss2.f1;
                Cache Key: i8.q1
                Cache Mode: binary
                Estimated Capacity: 1  Estimated Distinct Lookup Keys: 1
+               Estimated Hit Ratio: 0.50
                ->  Limit
                      Output: (i8.q1), t2.f1
                      ->  Seq Scan on public.text_tbl t2
@@ -5203,11 +5206,12 @@ where t1.f1 = ss2.f1;
          Cache Key: (i8.q1), t2.f1
          Cache Mode: binary
          Estimated Capacity: 1  Estimated Distinct Lookup Keys: 1
+         Estimated Hit Ratio: 0.50
          ->  Limit
                Output: ((i8.q1)), (t2.f1)
                ->  Seq Scan on public.text_tbl t3
                      Output: (i8.q1), t2.f1
-(32 rows)
+(34 rows)
 
 select * from
   text_tbl t1
@@ -5257,6 +5261,7 @@ where tt1.f1 = ss1.c0;
          Cache Key: tt4.f1
          Cache Mode: binary
          Estimated Capacity: 1  Estimated Distinct Lookup Keys: 1
+         Estimated Hit Ratio: 0.50
          ->  Subquery Scan on ss1
                Output: ss1.c0
                Filter: (ss1.c0 = 'foo'::text)
@@ -5264,7 +5269,7 @@ where tt1.f1 = ss1.c0;
                      Output: (tt4.f1)
                      ->  Seq Scan on public.text_tbl tt5
                            Output: tt4.f1
-(33 rows)
+(34 rows)
 
 select 1 from
   text_tbl as tt1
@@ -6475,10 +6480,11 @@ select * from sj t1
          Cache Key: t1.a, t1.b
          Cache Mode: binary
          Estimated Capacity: 2  Estimated Distinct Lookup Keys: 2
+         Estimated Hit Ratio: 0.33
          ->  Sample Scan on sj
                Sampling: system (t1.b)
                Filter: (t1.a = a)
-(9 rows)
+(10 rows)
 
 -- Ensure that SJE does not form a self-referential lateral dependency
 explain (costs off)
@@ -7629,8 +7635,9 @@ explain (costs off)
                Cache Key: a.two
                Cache Mode: binary
                Estimated Capacity: 2  Estimated Distinct Lookup Keys: 2
+               Estimated Hit Ratio: 1.00
                ->  Function Scan on generate_series g
-(8 rows)
+(9 rows)
 
 explain (costs off)
   select count(*) from tenk1 a cross join lateral generate_series(1,two) g;
@@ -7643,8 +7650,9 @@ explain (costs off)
                Cache Key: a.two
                Cache Mode: binary
                Estimated Capacity: 2  Estimated Distinct Lookup Keys: 2
+               Estimated Hit Ratio: 1.00
                ->  Function Scan on generate_series g
-(8 rows)
+(9 rows)
 
 -- don't need the explicit LATERAL keyword for functions
 explain (costs off)
@@ -7658,8 +7666,9 @@ explain (costs off)
                Cache Key: a.two
                Cache Mode: binary
                Estimated Capacity: 2  Estimated Distinct Lookup Keys: 2
+               Estimated Hit Ratio: 1.00
                ->  Function Scan on generate_series g
-(8 rows)
+(9 rows)
 
 -- lateral with UNION ALL subselect
 explain (costs off)
@@ -7722,9 +7731,10 @@ explain (costs off)
                Cache Key: "*VALUES*".column1
                Cache Mode: logical
                Estimated Capacity: 2  Estimated Distinct Lookup Keys: 2
+               Estimated Hit Ratio: 1.00
                ->  Index Only Scan using tenk1_unique2 on tenk1 b
                      Index Cond: (unique2 = "*VALUES*".column1)
-(11 rows)
+(12 rows)
 
 select count(*) from tenk1 a,
   tenk1 b join lateral (values(a.unique1),(-1)) ss(x) on b.unique2 = ss.x;
diff --git a/src/test/regress/expected/memoize.out b/src/test/regress/expected/memoize.out
index 39c76aaa1a4..1e1b3419991 100644
--- a/src/test/regress/expected/memoize.out
+++ b/src/test/regress/expected/memoize.out
@@ -47,12 +47,13 @@ WHERE t2.unique1 < 1000;', false);
                Cache Key: t2.twenty
                Cache Mode: logical
                Estimated Capacity: 20  Estimated Distinct Lookup Keys: 20
+               Estimated Hit Ratio: 0.98
                Hits: 980  Misses: 20  Evictions: Zero  Overflows: 0  Memory Usage: NkB
                ->  Index Only Scan using tenk1_unique1 on tenk1 t1 (actual rows=1.00 loops=N)
                      Index Cond: (unique1 = t2.twenty)
                      Heap Fetches: N
                      Index Searches: N
-(14 rows)
+(15 rows)
 
 -- And check we get the expected results.
 SELECT COUNT(*),AVG(t1.unique1) FROM tenk1 t1
@@ -80,12 +81,13 @@ WHERE t1.unique1 < 1000;', false);
                Cache Key: t1.twenty
                Cache Mode: binary
                Estimated Capacity: 20  Estimated Distinct Lookup Keys: 20
+               Estimated Hit Ratio: 0.98
                Hits: 980  Misses: 20  Evictions: Zero  Overflows: 0  Memory Usage: NkB
                ->  Index Only Scan using tenk1_unique1 on tenk1 t2 (actual rows=1.00 loops=N)
                      Index Cond: (unique1 = t1.twenty)
                      Heap Fetches: N
                      Index Searches: N
-(14 rows)
+(15 rows)
 
 -- And check we get the expected results.
 SELECT COUNT(*),AVG(t2.unique1) FROM tenk1 t1,
@@ -116,6 +118,7 @@ WHERE t1.unique1 < 10;', false);
                Cache Key: t1.two
                Cache Mode: binary
                Estimated Capacity: 2  Estimated Distinct Lookup Keys: 2
+               Estimated Hit Ratio: 0.80
                Hits: 8  Misses: 2  Evictions: Zero  Overflows: 0  Memory Usage: NkB
                ->  Subquery Scan on t2 (actual rows=2.00 loops=N)
                      Filter: (t1.two = t2.two)
@@ -123,7 +126,7 @@ WHERE t1.unique1 < 10;', false);
                      ->  Index Scan using tenk1_unique1 on tenk1 t2_1 (actual rows=4.00 loops=N)
                            Index Cond: (unique1 < 4)
                            Index Searches: N
-(16 rows)
+(17 rows)
 
 -- And check we get the expected results.
 SELECT COUNT(*),AVG(t2.t1two) FROM tenk1 t1 LEFT JOIN
@@ -153,13 +156,14 @@ WHERE s.c1 = s.c2 AND t1.unique1 < 1000;', false);
                Cache Key: (t1.two + 1)
                Cache Mode: binary
                Estimated Capacity: 2  Estimated Distinct Lookup Keys: 2
+               Estimated Hit Ratio: 1.00
                Hits: 998  Misses: 2  Evictions: Zero  Overflows: 0  Memory Usage: NkB
                ->  Index Only Scan using tenk1_unique1 on tenk1 t2 (actual rows=1.00 loops=N)
                      Filter: ((t1.two + 1) = unique1)
                      Rows Removed by Filter: 9999
                      Heap Fetches: N
                      Index Searches: N
-(15 rows)
+(16 rows)
 
 -- And check we get the expected results.
 SELECT COUNT(*), AVG(t1.twenty) FROM tenk1 t1 LEFT JOIN
@@ -187,11 +191,12 @@ WHERE s.c1 = s.c2 AND t1.unique1 < 1000;', false);
                Cache Key: t1.two, t1.twenty
                Cache Mode: binary
                Estimated Capacity: 40  Estimated Distinct Lookup Keys: 40
+               Estimated Hit Ratio: 0.96
                Hits: 980  Misses: 20  Evictions: Zero  Overflows: 0  Memory Usage: NkB
                ->  Seq Scan on tenk1 t2 (actual rows=1.00 loops=N)
                      Filter: ((t1.twenty = unique1) AND (t1.two = two))
                      Rows Removed by Filter: 9999
-(13 rows)
+(14 rows)
 
 -- And check we get the expected results.
 SELECT COUNT(*), AVG(t1.twenty) FROM tenk1 t1 LEFT JOIN
@@ -226,13 +231,14 @@ ON t1.x = t2.t::numeric AND t1.t::numeric = t2.x;', false);
          Cache Key: t1.x, (t1.t)::numeric
          Cache Mode: logical
          Estimated Capacity: 20  Estimated Distinct Lookup Keys: 20
+         Estimated Hit Ratio: 0.50
          Hits: 20  Misses: 20  Evictions: Zero  Overflows: 0  Memory Usage: NkB
          ->  Index Only Scan using expr_key_idx_x_t on expr_key t2 (actual rows=2.00 loops=N)
                Index Cond: (x = (t1.t)::numeric)
                Filter: (t1.x = (t)::numeric)
                Heap Fetches: N
                Index Searches: N
-(12 rows)
+(13 rows)
 
 DROP TABLE expr_key;
 -- Reduce work_mem and hash_mem_multiplier so that we see some cache evictions
@@ -256,12 +262,13 @@ WHERE t2.unique1 < 1200;', true);
                Cache Key: t2.thousand
                Cache Mode: logical
                Estimated Capacity: 655  Estimated Distinct Lookup Keys: 721
+               Estimated Hit Ratio: 0.36
                Hits: N  Misses: N  Evictions: N  Overflows: 0  Memory Usage: NkB
                ->  Index Only Scan using tenk1_unique1 on tenk1 t1 (actual rows=1.00 loops=N)
                      Index Cond: (unique1 = t2.thousand)
                      Heap Fetches: N
                      Index Searches: N
-(14 rows)
+(15 rows)
 
 CREATE TABLE flt (f float);
 CREATE INDEX flt_f_idx ON flt (f);
@@ -281,12 +288,13 @@ SELECT * FROM flt f1 INNER JOIN flt f2 ON f1.f = f2.f;', false);
          Cache Key: f1.f
          Cache Mode: logical
          Estimated Capacity: 1  Estimated Distinct Lookup Keys: 1
+         Estimated Hit Ratio: 0.50
          Hits: 1  Misses: 1  Evictions: Zero  Overflows: 0  Memory Usage: NkB
          ->  Index Only Scan using flt_f_idx on flt f2 (actual rows=2.00 loops=N)
                Index Cond: (f = f1.f)
                Heap Fetches: N
                Index Searches: N
-(13 rows)
+(14 rows)
 
 -- Ensure memoize operates in binary mode
 SELECT explain_memoize('
@@ -301,12 +309,13 @@ SELECT * FROM flt f1 INNER JOIN flt f2 ON f1.f >= f2.f;', false);
          Cache Key: f1.f
          Cache Mode: binary
          Estimated Capacity: 1  Estimated Distinct Lookup Keys: 1
+         Estimated Hit Ratio: 0.50
          Hits: 0  Misses: 2  Evictions: Zero  Overflows: 0  Memory Usage: NkB
          ->  Index Only Scan using flt_f_idx on flt f2 (actual rows=2.00 loops=N)
                Index Cond: (f <= f1.f)
                Heap Fetches: N
                Index Searches: N
-(13 rows)
+(14 rows)
 
 DROP TABLE flt;
 -- Exercise Memoize in binary mode with a large fixed width type and a
@@ -330,11 +339,12 @@ SELECT * FROM strtest s1 INNER JOIN strtest s2 ON s1.n >= s2.n;', false);
          Cache Key: s1.n
          Cache Mode: binary
          Estimated Capacity: 3  Estimated Distinct Lookup Keys: 3
+         Estimated Hit Ratio: 0.50
          Hits: 3  Misses: 3  Evictions: Zero  Overflows: 0  Memory Usage: NkB
          ->  Index Scan using strtest_n_idx on strtest s2 (actual rows=4.00 loops=N)
                Index Cond: (n <= s1.n)
                Index Searches: N
-(11 rows)
+(12 rows)
 
 -- Ensure we get 3 hits and 3 misses
 SELECT explain_memoize('
@@ -348,11 +358,12 @@ SELECT * FROM strtest s1 INNER JOIN strtest s2 ON s1.t >= s2.t;', false);
          Cache Key: s1.t
          Cache Mode: binary
          Estimated Capacity: 4  Estimated Distinct Lookup Keys: 4
+         Estimated Hit Ratio: 0.33
          Hits: 3  Misses: 3  Evictions: Zero  Overflows: 0  Memory Usage: NkB
          ->  Index Scan using strtest_t_idx on strtest s2 (actual rows=4.00 loops=N)
                Index Cond: (t <= s1.t)
                Index Searches: N
-(11 rows)
+(12 rows)
 
 DROP TABLE strtest;
 -- Ensure memoize works with partitionwise join
@@ -378,6 +389,7 @@ SELECT * FROM prt t1 INNER JOIN prt t2 ON t1.a = t2.a;', false);
                Cache Key: t1_1.a
                Cache Mode: logical
                Estimated Capacity: 2  Estimated Distinct Lookup Keys: 2
+               Estimated Hit Ratio: 0.50
                Hits: 3  Misses: 1  Evictions: Zero  Overflows: 0  Memory Usage: NkB
                ->  Index Only Scan using iprt_p1_a on prt_p1 t2_1 (actual rows=4.00 loops=N)
                      Index Cond: (a = t1_1.a)
@@ -391,12 +403,13 @@ SELECT * FROM prt t1 INNER JOIN prt t2 ON t1.a = t2.a;', false);
                Cache Key: t1_2.a
                Cache Mode: logical
                Estimated Capacity: 2  Estimated Distinct Lookup Keys: 2
+               Estimated Hit Ratio: 0.50
                Hits: 3  Misses: 1  Evictions: Zero  Overflows: 0  Memory Usage: NkB
                ->  Index Only Scan using iprt_p2_a on prt_p2 t2_2 (actual rows=4.00 loops=N)
                      Index Cond: (a = t1_2.a)
                      Heap Fetches: N
                      Index Searches: N
-(27 rows)
+(29 rows)
 
 -- Ensure memoize works with parameterized union-all Append path
 SET enable_partitionwise_join TO off;
@@ -414,6 +427,7 @@ ON t1.a = t2.a;', false);
          Cache Key: t1.a
          Cache Mode: logical
          Estimated Capacity: 1  Estimated Distinct Lookup Keys: 1
+         Estimated Hit Ratio: 0.75
          Hits: 3  Misses: 1  Evictions: Zero  Overflows: 0  Memory Usage: NkB
          ->  Append (actual rows=4.00 loops=N)
                ->  Index Only Scan using iprt_p1_a on prt_p1 (actual rows=4.00 loops=N)
@@ -424,7 +438,7 @@ ON t1.a = t2.a;', false);
                      Index Cond: (a = t1.a)
                      Heap Fetches: N
                      Index Searches: N
-(18 rows)
+(19 rows)
 
 DROP TABLE prt;
 RESET enable_partitionwise_join;
@@ -451,10 +465,11 @@ WHERE unique1 < 3
                  Cache Key: t2.hundred
                  Cache Mode: logical
                  Estimated Capacity: 100  Estimated Distinct Lookup Keys: 100
+                 Estimated Hit Ratio: 0.99
                  ->  Index Scan using tenk1_unique1 on tenk1 t1
                        Index Cond: (unique1 = t2.hundred)
                        Filter: (t0.ten = twenty)
-(14 rows)
+(15 rows)
 
 -- Ensure the above query returns the correct result
 SELECT unique1 FROM tenk1 t0
@@ -499,9 +514,10 @@ WHERE t1.unique1 < 1000;
                            Cache Key: t1.twenty
                            Cache Mode: logical
                            Estimated Capacity: 20  Estimated Distinct Lookup Keys: 20
+                           Estimated Hit Ratio: 0.95
                            ->  Index Only Scan using tenk1_unique1 on tenk1 t2
                                  Index Cond: (unique1 = t1.twenty)
-(15 rows)
+(16 rows)
 
 -- And ensure the parallel plan gives us the correct results.
 SELECT COUNT(*),AVG(t2.unique1) FROM tenk1 t1,
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index d3c37ffe22b..15a6d94d7c0 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -5299,6 +5299,7 @@ EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 100;
                      Cache Key: p1_1.c
                      Cache Mode: logical
                      Estimated Capacity: 12  Estimated Distinct Lookup Keys: 12
+                     Estimated Hit Ratio: 0.76
                      ->  Index Scan using pht1_p1_c_idx on pht1_p1 p2_1
                            Index Cond: (c = p1_1.c)
          ->  Nested Loop
@@ -5307,6 +5308,7 @@ EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 100;
                      Cache Key: p1_2.c
                      Cache Mode: logical
                      Estimated Capacity: 12  Estimated Distinct Lookup Keys: 12
+                     Estimated Hit Ratio: 0.90
                      ->  Index Scan using pht1_p2_c_idx on pht1_p2 p2_2
                            Index Cond: (c = p1_2.c)
          ->  Nested Loop
@@ -5315,9 +5317,10 @@ EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 100;
                      Cache Key: p1_3.c
                      Cache Mode: logical
                      Estimated Capacity: 12  Estimated Distinct Lookup Keys: 12
+                     Estimated Hit Ratio: 0.90
                      ->  Index Scan using pht1_p3_c_idx on pht1_p3 p2_3
                            Index Cond: (c = p1_3.c)
-(26 rows)
+(29 rows)
 
 -- If almost all the data should be fetched - prefer SeqScan
 EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 1000;
@@ -5359,6 +5362,7 @@ EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 100;
                            Cache Key: p1_1.c
                            Cache Mode: logical
                            Estimated Capacity: 12  Estimated Distinct Lookup Keys: 12
+                           Estimated Hit Ratio: 0.76
                            ->  Index Scan using pht1_p1_c_idx on pht1_p1 p2_1
                                  Index Cond: (c = p1_1.c)
                ->  Nested Loop
@@ -5367,6 +5371,7 @@ EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 100;
                            Cache Key: p1_2.c
                            Cache Mode: logical
                            Estimated Capacity: 12  Estimated Distinct Lookup Keys: 12
+                           Estimated Hit Ratio: 0.90
                            ->  Index Scan using pht1_p2_c_idx on pht1_p2 p2_2
                                  Index Cond: (c = p1_2.c)
                ->  Nested Loop
@@ -5375,9 +5380,10 @@ EXPLAIN (COSTS OFF) SELECT * FROM pht1 p1 JOIN pht1 p2 USING (c) LIMIT 100;
                            Cache Key: p1_3.c
                            Cache Mode: logical
                            Estimated Capacity: 12  Estimated Distinct Lookup Keys: 12
+                           Estimated Hit Ratio: 0.90
                            ->  Index Scan using pht1_p3_c_idx on pht1_p3 p2_3
                                  Index Cond: (c = p1_3.c)
-(29 rows)
+(32 rows)
 
 RESET debug_parallel_query;
 -- Remove indexes from the partitioned table and its partitions
diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out
index f5575c7d332..81e3e07d968 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -1326,13 +1326,14 @@ where o.ten = 1;
                Cache Key: o.four
                Cache Mode: binary
                Estimated Capacity: 4  Estimated Distinct Lookup Keys: 4
+               Estimated Hit Ratio: 0.96
                ->  CTE Scan on x
                      CTE x
                        ->  Recursive Union
                              ->  Result
                              ->  WorkTable Scan on x x_1
                                    Filter: (a < 10)
-(14 rows)
+(15 rows)
 
 select sum(o.four), sum(ss.a) from
   onek o cross join lateral (
@@ -2644,6 +2645,7 @@ ON B.hundred in (SELECT min(c.hundred) FROM tenk2 C WHERE c.odd = b.odd);
                      Cache Key: b.hundred, b.odd
                      Cache Mode: binary
                      Estimated Capacity: 1000  Estimated Distinct Lookup Keys: 1000
+                     Estimated Hit Ratio: 0.90
                      ->  Subquery Scan on "ANY_subquery"
                            Filter: (b.hundred = "ANY_subquery".min)
                            ->  Result
@@ -2652,7 +2654,7 @@ ON B.hundred in (SELECT min(c.hundred) FROM tenk2 C WHERE c.odd = b.odd);
                                          ->  Index Scan using tenk2_hundred on tenk2 c
                                                Index Cond: (hundred IS NOT NULL)
                                                Filter: (odd = b.odd)
-(17 rows)
+(18 rows)
 
 --
 -- Test VALUES to ARRAY (VtA) transformation
-- 
2.34.1

#36

David Rowley

dgrowleyml@gmail.com

9 months ago

In reply to: Ilia Evdokimov (#35)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On Tue, 15 Apr 2025 at 11:14, Ilia Evdokimov
<ilya.evdokimov@tantorlabs.com> wrote:

On 14.04.2025 23:53, David Rowley wrote:
If we can't get consensus for everything people want to add at once
then maybe the patch could be broken into two, with 0001 being pretty
much the v4 patch and then have 0002 add the Estimated Hit Ratio.
Having the physical patches and being able to try it out or view the
regression test changes might lower the bar on people chipping in with
their views.

Agreed - I’ll split the patch into two: one adds Estimated Capacity and Estimated Distinct Keys, and the second adds Estimated Hit Ratio. I’ll also add regression test differences to make it easier to evaluate the usefulness of each part.

Thanks for the patches.

I'm just looking for ways to allow us to group all three of these
together for the TEXT format type. I see the BUFFERS are displayed
quite compactly, e.g. "Buffers: shared hit=17 read=4".
show_wal_usage() does something similar and so does
show_modifytable_info() for MERGE.

Maybe we could compress the text output a bit with:

"Estimates: capacity=N distinct keys=N hit ratio=N.N%"

If we get it that compact, maybe lookups could fit in there too, i.e.:

"Estimates: capacity=N distinct keys=N lookups=N hit ratio=N.N%"

I'd also like to vote that you modify explain.c and multiply the
hit_ratio by 100 and use "%" as the unit in ExplainPropertyFloat().
Just looking at the raw number of "1.00" in the expected output, it
isn't obvious if the planner expects every lookup to be a cache hit or
just 1% of them.

Also, to get something commitable, you'll also need to modify
explain_memoize() at the top of memoize.sql and add handling to mask
out the value of these new properties. These are not going to be
anywhere near stable enough across platforms to have these shown in
the expected output files.

David

#37

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

9 months ago

In reply to: David Rowley (#36)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On 15.04.2025 03:23, David Rowley wrote:

I'm just looking for ways to allow us to group all three of these
together for the TEXT format type. I see the BUFFERS are displayed
quite compactly, e.g. "Buffers: shared hit=17 read=4".
show_wal_usage() does something similar and so does
show_modifytable_info() for MERGE.

Maybe we could compress the text output a bit with:

"Estimates: capacity=N distinct keys=N hit ratio=N.N%"

If we get it that compact, maybe lookups could fit in there too, i.e.:

"Estimates: capacity=N distinct keys=N lookups=N hit ratio=N.N%"

I'd also like to vote that you modify explain.c and multiply the
hit_ratio by 100 and use "%" as the unit in ExplainPropertyFloat().
Just looking at the raw number of "1.00" in the expected output, it
isn't obvious if the planner expects every lookup to be a cache hit or
just 1% of them.

+1. I'll attach three patches with 'capacity=N distinct keys=N' output,
'lookups=N' and 'ratio=N.N%'

Also, to get something commitable, you'll also need to modify
explain_memoize() at the top of memoize.sql and add handling to mask
out the value of these new properties. These are not going to be
anywhere near stable enough across platforms to have these shown in
the expected output files.

David

There are other regression tests - such as create_index, subselect,
join, and partition_join - that also use Memoize nodes. These tests
currently don't have any masking in place, so the new output fields may
cause instability across platforms.

Wrapping the line in costs or verbose would help with test stability,
but doing that only to simplify test output doesn't look like the right
reason. Maybe it makes more sense to mask these values in other tests
that use Memoize too. What do you think?

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

#38

Robert Haas

robertmhaas@gmail.com

9 months ago

In reply to: David Rowley (#36)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On Mon, Apr 14, 2025 at 8:23 PM David Rowley <dgrowleyml@gmail.com> wrote:

Maybe we could compress the text output a bit with:

"Estimates: capacity=N distinct keys=N hit ratio=N.N%"

Sure, that works for me.

If we get it that compact, maybe lookups could fit in there too, i.e.:

"Estimates: capacity=N distinct keys=N lookups=N hit ratio=N.N%"

Is lookups=N here the estimated number of lookups i.e. what we think
nloops will end up being?

--
Robert Haas
EDB: http://www.enterprisedb.com

#39

David Rowley

dgrowleyml@gmail.com

9 months ago

In reply to: Ilia Evdokimov (#37)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On Tue, 15 Apr 2025 at 21:44, Ilia Evdokimov
<ilya.evdokimov@tantorlabs.com> wrote:

Wrapping the line in costs or verbose would help with test stability, but doing that only to simplify test output doesn't look like the right reason. Maybe it makes more sense to mask these values in other tests that use Memoize too. What do you think?

Disregard what I said about the explain_memoize() function. The new
output just shouldn't be shown with EXPLAIN (costs off).

David

#40

David Rowley

dgrowleyml@gmail.com

9 months ago

In reply to: Robert Haas (#38)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On Wed, 16 Apr 2025 at 04:25, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Apr 14, 2025 at 8:23 PM David Rowley <dgrowleyml@gmail.com> wrote:

"Estimates: capacity=N distinct keys=N lookups=N hit ratio=N.N%"

Is lookups=N here the estimated number of lookups i.e. what we think
nloops will end up being?

Yes. The estimate is the "calls" variable in cost_memoize_rescan(),
which is fairly critical in the hit ratio estimate calculation.

Technically this is just the Nested Loop's outer_path->rows. There was
an argument earlier in the thread for putting this in along with the
other fields to make things easier to read. I did argue that it was
redundant due to the fact that the reader can look at the row estimate
for the outer side of the Nest Loop, but maybe it's small enough to go
in using the above format.

David

#41

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

9 months ago

In reply to: David Rowley (#39)

3 attachment(s)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

I've prepared the updated patches as discussed, including the addition
of estimated lookups in the EXPLAIN output for Memoize. Please let me
know if you have any feedback or further suggestions.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachments:

v8-0001-Show-ndistinct-and-est_entries-in-EXPLAIN-for-Mem.patchtext/x-patch; charset=UTF-8; name=v8-0001-Show-ndistinct-and-est_entries-in-EXPLAIN-for-Mem.patchDownload

From 021cc6f5629a607cab3a0ec1021fc5f102dc0439 Mon Sep 17 00:00:00 2001
From: Evdokimov Ilia <ilya.evdokimov@tantorlabs.com>
Date: Tue, 15 Apr 2025 23:47:57 +0300
Subject: [PATCH v8 1/3] Show ndistinct and est_entries in EXPLAIN for Memoize

Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
---
 src/backend/commands/explain.c          | 16 ++++++++++++++++
 src/backend/optimizer/path/costsize.c   |  3 +++
 src/backend/optimizer/plan/createplan.c | 10 +++++++---
 src/backend/optimizer/util/pathnode.c   |  6 ++++++
 src/include/nodes/pathnodes.h           |  2 ++
 src/include/nodes/plannodes.h           |  6 ++++++
 6 files changed, 40 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 786ee865f14..0634d0a982e 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3628,6 +3628,22 @@ show_memoize_info(MemoizeState *mstate, List *ancestors, ExplainState *es)
 	ExplainPropertyText("Cache Key", keystr.data, es);
 	ExplainPropertyText("Cache Mode", mstate->binary_mode ? "binary" : "logical", es);
 
+	if (es->costs)
+	{
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			ExplainIndentText(es);
+			appendStringInfo(es->str, "Estimates: capacity=%u distinct keys=%.0f\n",
+							((Memoize *) plan)->est_entries,
+							((Memoize *) plan)->est_unique_keys);
+		}
+		else
+		{
+			ExplainPropertyUInteger("Estimated Capacity", "", ((Memoize *) plan)->est_entries, es);
+			ExplainPropertyFloat("Estimated Distinct Lookup Keys", "", ((Memoize *) plan)->est_unique_keys, 0, es);
+		}
+	}
+
 	pfree(keystr.data);
 
 	if (!es->analyze)
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 60b0fcfb6be..f72319d903c 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2604,6 +2604,9 @@ cost_memoize_rescan(PlannerInfo *root, MemoizePath *mpath,
 	mpath->est_entries = Min(Min(ndistinct, est_cache_entries),
 							 PG_UINT32_MAX);
 
+	/* Remember ndistinct for a potential EXPLAIN later */
+	mpath->est_unique_keys = ndistinct;
+
 	/*
 	 * When the number of distinct parameter values is above the amount we can
 	 * store in the cache, then we'll have to evict some entries from the
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index a8f22a8c154..a1456c9014d 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -284,7 +284,8 @@ static Material *make_material(Plan *lefttree);
 static Memoize *make_memoize(Plan *lefttree, Oid *hashoperators,
 							 Oid *collations, List *param_exprs,
 							 bool singlerow, bool binary_mode,
-							 uint32 est_entries, Bitmapset *keyparamids);
+							uint32 est_entries, Bitmapset *keyparamids,
+							double est_unique_keys);
 static WindowAgg *make_windowagg(List *tlist, WindowClause *wc,
 								 int partNumCols, AttrNumber *partColIdx, Oid *partOperators, Oid *partCollations,
 								 int ordNumCols, AttrNumber *ordColIdx, Oid *ordOperators, Oid *ordCollations,
@@ -1703,7 +1704,8 @@ create_memoize_plan(PlannerInfo *root, MemoizePath *best_path, int flags)
 
 	plan = make_memoize(subplan, operators, collations, param_exprs,
 						best_path->singlerow, best_path->binary_mode,
-						best_path->est_entries, keyparamids);
+						best_path->est_entries, keyparamids,
+						best_path->est_unique_keys);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -6636,7 +6638,8 @@ materialize_finished_plan(Plan *subplan)
 static Memoize *
 make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations,
 			 List *param_exprs, bool singlerow, bool binary_mode,
-			 uint32 est_entries, Bitmapset *keyparamids)
+			uint32 est_entries, Bitmapset *keyparamids,
+			double est_unique_keys)
 {
 	Memoize    *node = makeNode(Memoize);
 	Plan	   *plan = &node->plan;
@@ -6654,6 +6657,7 @@ make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations,
 	node->binary_mode = binary_mode;
 	node->est_entries = est_entries;
 	node->keyparamids = keyparamids;
+	node->est_unique_keys = est_unique_keys;
 
 	return node;
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 93e73cb44db..76bdea52127 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1701,6 +1701,12 @@ create_memoize_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	Assert(enable_memoize);
 	pathnode->path.disabled_nodes = subpath->disabled_nodes;
 
+	/*
+	 * Estimated number of distinct memoization keys,
+	 * computed using estimate_num_groups()
+	 */
+	pathnode->est_unique_keys = 0;
+
 	/*
 	 * Add a small additional charge for caching the first entry.  All the
 	 * harder calculations for rescans are performed in cost_memoize_rescan().
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index bb678bdcdcd..07d97dc0b5b 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -2138,6 +2138,8 @@ typedef struct MemoizePath
 	uint32		est_entries;	/* The maximum number of entries that the
 								 * planner expects will fit in the cache, or 0
 								 * if unknown */
+	double		est_unique_keys;	/* Estimated number of distinct memoization keys,
+								 * used for cache size evaluation. Kept for EXPLAIN */
 } MemoizePath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 658d76225e4..3d9d3a1159d 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1063,6 +1063,12 @@ typedef struct Memoize
 
 	/* paramids from param_exprs */
 	Bitmapset  *keyparamids;
+
+	/*
+	 * Estimated number of distinct memoization keys,
+	 * used for cache size evaluation. Kept for EXPLAIN
+	 */
+	double		est_unique_keys;
 } Memoize;
 
 /* ----------------
-- 
2.34.1

v8-0002-Add-Estimated-Hit-Ratio-for-Memoize-plan-nodes-in.patchtext/x-patch; charset=UTF-8; name=v8-0002-Add-Estimated-Hit-Ratio-for-Memoize-plan-nodes-in.patchDownload

From 18052d2d3ac54e12042673e35ca357ae11f99093 Mon Sep 17 00:00:00 2001
From: Evdokimov Ilia <ilya.evdokimov@tantorlabs.com>
Date: Wed, 16 Apr 2025 00:20:49 +0300
Subject: [PATCH v8 2/3] Add Estimated Hit Ratio for Memoize plan nodes in
 EXPLAIN

Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
---
 src/backend/commands/explain.c          | 6 ++++--
 src/backend/optimizer/path/costsize.c   | 3 +++
 src/backend/optimizer/plan/createplan.c | 7 ++++---
 src/backend/optimizer/util/pathnode.c   | 6 ++++++
 src/include/nodes/pathnodes.h           | 1 +
 src/include/nodes/plannodes.h           | 3 +++
 6 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0634d0a982e..85b22561aa6 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3633,14 +3633,16 @@ show_memoize_info(MemoizeState *mstate, List *ancestors, ExplainState *es)
 		if (es->format == EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainIndentText(es);
-			appendStringInfo(es->str, "Estimates: capacity=%u distinct keys=%.0f\n",
+			appendStringInfo(es->str, "Estimates: capacity=%u distinct keys=%.0f hit ratio=%.2f%%\n",
 							((Memoize *) plan)->est_entries,
-							((Memoize *) plan)->est_unique_keys);
+							((Memoize *) plan)->est_unique_keys,
+							((Memoize *) plan)->hit_ratio * 100.0);
 		}
 		else
 		{
 			ExplainPropertyUInteger("Estimated Capacity", "", ((Memoize *) plan)->est_entries, es);
 			ExplainPropertyFloat("Estimated Distinct Lookup Keys", "", ((Memoize *) plan)->est_unique_keys, 0, es);
+			ExplainPropertyFloat("Estimated Hit Ratio", "", ((Memoize *) plan)->hit_ratio * 100.0, 2, es);
 		}
 	}
 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index f72319d903c..3e99214501b 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2624,6 +2624,9 @@ cost_memoize_rescan(PlannerInfo *root, MemoizePath *mpath,
 	hit_ratio = ((calls - ndistinct) / calls) *
 		(est_cache_entries / Max(ndistinct, est_cache_entries));
 
+	/* Remember cache hit ratio for a potential EXPLAIN later */
+	mpath->hit_ratio = hit_ratio;
+
 	Assert(hit_ratio >= 0 && hit_ratio <= 1.0);
 
 	/*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index a1456c9014d..ccb880158fe 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -285,7 +285,7 @@ static Memoize *make_memoize(Plan *lefttree, Oid *hashoperators,
 							 Oid *collations, List *param_exprs,
 							 bool singlerow, bool binary_mode,
 							uint32 est_entries, Bitmapset *keyparamids,
-							double est_unique_keys);
+							double est_unique_keys, double hit_ratio);
 static WindowAgg *make_windowagg(List *tlist, WindowClause *wc,
 								 int partNumCols, AttrNumber *partColIdx, Oid *partOperators, Oid *partCollations,
 								 int ordNumCols, AttrNumber *ordColIdx, Oid *ordOperators, Oid *ordCollations,
@@ -1705,7 +1705,7 @@ create_memoize_plan(PlannerInfo *root, MemoizePath *best_path, int flags)
 	plan = make_memoize(subplan, operators, collations, param_exprs,
 						best_path->singlerow, best_path->binary_mode,
 						best_path->est_entries, keyparamids,
-						best_path->est_unique_keys);
+						best_path->est_unique_keys, best_path->hit_ratio);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -6639,7 +6639,7 @@ static Memoize *
 make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations,
 			 List *param_exprs, bool singlerow, bool binary_mode,
 			uint32 est_entries, Bitmapset *keyparamids,
-			double est_unique_keys)
+			double est_unique_keys, double hit_ratio)
 {
 	Memoize    *node = makeNode(Memoize);
 	Plan	   *plan = &node->plan;
@@ -6658,6 +6658,7 @@ make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations,
 	node->est_entries = est_entries;
 	node->keyparamids = keyparamids;
 	node->est_unique_keys = est_unique_keys;
+	node->hit_ratio = hit_ratio;
 
 	return node;
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 76bdea52127..40674027f9f 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1707,6 +1707,12 @@ create_memoize_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	 */
 	pathnode->est_unique_keys = 0;
 
+	/*
+	 * The estimated cache hit ratio will calculated later
+	 * by cost_memoize_rescan().
+	 */
+	pathnode->hit_ratio = 0;
+
 	/*
 	 * Add a small additional charge for caching the first entry.  All the
 	 * harder calculations for rescans are performed in cost_memoize_rescan().
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 07d97dc0b5b..e17da6f8f02 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -2140,6 +2140,7 @@ typedef struct MemoizePath
 								 * if unknown */
 	double		est_unique_keys;	/* Estimated number of distinct memoization keys,
 								 * used for cache size evaluation. Kept for EXPLAIN */
+	double		hit_ratio;		/* Estimated cache hit ratio. Kept for EXPLAIN */
 } MemoizePath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 3d9d3a1159d..4354d4f66d3 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1069,6 +1069,9 @@ typedef struct Memoize
 	 * used for cache size evaluation. Kept for EXPLAIN
 	 */
 	double		est_unique_keys;
+
+	/* Estimated cache hit ratio. Kept for EXPLAIN */
+	double		hit_ratio;
 } Memoize;
 
 /* ----------------
-- 
2.34.1

v8-0003-Show-estimated-lookups-in-EXPLAIN-ouput-for-Memoi.patchtext/x-patch; charset=UTF-8; name=v8-0003-Show-estimated-lookups-in-EXPLAIN-ouput-for-Memoi.patchDownload

From bbce295626e45d899e93a572d38074c40e8c9335 Mon Sep 17 00:00:00 2001
From: Evdokimov Ilia <ilya.evdokimov@tantorlabs.com>
Date: Wed, 16 Apr 2025 00:53:50 +0300
Subject: [PATCH v8 3/3] Show estimated lookups in EXPLAIN ouput for Memoize

Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
---
 src/backend/commands/explain.c          |  4 +++-
 src/backend/optimizer/plan/createplan.c | 10 +++++++---
 src/include/nodes/plannodes.h           |  3 +++
 3 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 85b22561aa6..6fcfa53a166 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3633,15 +3633,17 @@ show_memoize_info(MemoizeState *mstate, List *ancestors, ExplainState *es)
 		if (es->format == EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainIndentText(es);
-			appendStringInfo(es->str, "Estimates: capacity=%u distinct keys=%.0f hit ratio=%.2f%%\n",
+			appendStringInfo(es->str, "Estimates: capacity=%u distinct keys=%.0f lookups=%.0f hit ratio=%.2f%%\n",
 							((Memoize *) plan)->est_entries,
 							((Memoize *) plan)->est_unique_keys,
+							((Memoize *) plan)->lookups,
 							((Memoize *) plan)->hit_ratio * 100.0);
 		}
 		else
 		{
 			ExplainPropertyUInteger("Estimated Capacity", "", ((Memoize *) plan)->est_entries, es);
 			ExplainPropertyFloat("Estimated Distinct Lookup Keys", "", ((Memoize *) plan)->est_unique_keys, 0, es);
+			ExplainPropertyFloat("Estimated Lookups", "", ((Memoize *) plan)->lookups, 0, es);
 			ExplainPropertyFloat("Estimated Hit Ratio", "", ((Memoize *) plan)->hit_ratio * 100.0, 2, es);
 		}
 	}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index ccb880158fe..47a092747d5 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -285,7 +285,8 @@ static Memoize *make_memoize(Plan *lefttree, Oid *hashoperators,
 							 Oid *collations, List *param_exprs,
 							 bool singlerow, bool binary_mode,
 							uint32 est_entries, Bitmapset *keyparamids,
-							double est_unique_keys, double hit_ratio);
+							double est_unique_keys, double hit_ratio,
+							double lookups);
 static WindowAgg *make_windowagg(List *tlist, WindowClause *wc,
 								 int partNumCols, AttrNumber *partColIdx, Oid *partOperators, Oid *partCollations,
 								 int ordNumCols, AttrNumber *ordColIdx, Oid *ordOperators, Oid *ordCollations,
@@ -1705,7 +1706,8 @@ create_memoize_plan(PlannerInfo *root, MemoizePath *best_path, int flags)
 	plan = make_memoize(subplan, operators, collations, param_exprs,
 						best_path->singlerow, best_path->binary_mode,
 						best_path->est_entries, keyparamids,
-						best_path->est_unique_keys, best_path->hit_ratio);
+						best_path->est_unique_keys, best_path->hit_ratio,
+						best_path->calls);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -6639,7 +6641,8 @@ static Memoize *
 make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations,
 			 List *param_exprs, bool singlerow, bool binary_mode,
 			uint32 est_entries, Bitmapset *keyparamids,
-			double est_unique_keys, double hit_ratio)
+			double est_unique_keys, double hit_ratio,
+			double lookups)
 {
 	Memoize    *node = makeNode(Memoize);
 	Plan	   *plan = &node->plan;
@@ -6659,6 +6662,7 @@ make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations,
 	node->keyparamids = keyparamids;
 	node->est_unique_keys = est_unique_keys;
 	node->hit_ratio = hit_ratio;
+	node->lookups = lookups;
 
 	return node;
 }
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4354d4f66d3..6eb08866304 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1072,6 +1072,9 @@ typedef struct Memoize
 
 	/* Estimated cache hit ratio. Kept for EXPLAIN */
 	double		hit_ratio;
+
+	/* Estimated number of lookups. Kept for EXPLAIN */
+	double		lookups;
 } Memoize;
 
 /* ----------------
-- 
2.34.1

#42

Maciek Sakrejda

maciek@pganalyze.com

9 months ago

In reply to: David Rowley (#40)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On Tue, Apr 15, 2025 at 1:50 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Wed, 16 Apr 2025 at 04:25, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Apr 14, 2025 at 8:23 PM David Rowley <dgrowleyml@gmail.com> wrote:

"Estimates: capacity=N distinct keys=N lookups=N hit ratio=N.N%"

Is lookups=N here the estimated number of lookups i.e. what we think
nloops will end up being?

Yes. The estimate is the "calls" variable in cost_memoize_rescan(),
which is fairly critical in the hit ratio estimate calculation.

Technically this is just the Nested Loop's outer_path->rows. There was
an argument earlier in the thread for putting this in along with the
other fields to make things easier to read. I did argue that it was
redundant due to the fact that the reader can look at the row estimate
for the outer side of the Nest Loop, but maybe it's small enough to go
in using the above format.

This kind of thing is nice to have in the text format, but it's really
nice to have when working with structured formats. Currently, to get a
loops estimate for the inner node, I need to find the parent, find the
outer node in the parent's children, and get that sibling's Plan Rows
(I think). It'd be nice to make things like that easier.

Thanks,
Maciek

#43

David Rowley

dgrowleyml@gmail.com

9 months ago

In reply to: Ilia Evdokimov (#41)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On Wed, 16 Apr 2025 at 10:09, Ilia Evdokimov
<ilya.evdokimov@tantorlabs.com> wrote:

I've prepared the updated patches as discussed, including the addition
of estimated lookups in the EXPLAIN output for Memoize. Please let me
know if you have any feedback or further suggestions.

While this is fresh, as I might forget before the July CF...

I know I did suggest that the hit_ratio should be a percent and it
should be multiplied by 100.0. This seems fine for the text format as
you can have the % unit there. However, looking at
ExplainPropertyFloat(), we don't print the unit for non-text formats.
I wonder if printing a number between 0 and 100 there will be
confusing. I don't believe we have anything else in EXPLAIN that shows
a percentage. I don't quite know what to do about this. One thought I
had was to only * 100 for text format, but that might be more
confusing.

Aside from that, I'd prefer the new fields in struct Memoize to be
prefixed with "est_" the same as the existing "est_entries" field. I'm
unsure why MemoizePath.calls becomes Memoize.lookups. Seems
unnecessary and just means more brain space being used to maintain a
mental model.

David

#44

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

9 months ago

In reply to: David Rowley (#43)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

Sorry for the delayed reply.

On 18.04.2025 00:29, David Rowley wrote:

I know I did suggest that the hit_ratio should be a percent and it
should be multiplied by 100.0. This seems fine for the text format as
you can have the % unit there. However, looking at
ExplainPropertyFloat(), we don't print the unit for non-text formats.
I wonder if printing a number between 0 and 100 there will be
confusing. I don't believe we have anything else in EXPLAIN that shows
a percentage. I don't quite know what to do about this. One thought I
had was to only * 100 for text format, but that might be more
confusing.

I agree that displaying the percentage instead of the fraction is a
better choice, because it is much more convenient for analysis. If this
value were intended to be passed into some API for further computations,
it would make sense to keep it as a fraction to avoid unnecessary
multiplications and divisions by 100. However, such use cases seem
extremely unlikely.

Therefore, I think it is better to report percentages directly. Since
non-text EXPLAIN formats do not display units, I propose to rename the
field to "hit_percent" in all formats, including the text format. This
way, the meaning of the value remains clear without needing additional
context. What do you think about this approach?

Aside from that, I'd prefer the new fields in struct Memoize to be
prefixed with "est_" the same as the existing "est_entries" field. I'm
unsure why MemoizePath.calls becomes Memoize.lookups. Seems
unnecessary and just means more brain space being used to maintain a
mental model.

Agree.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

#45

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

9 months ago

In reply to: Ilia Evdokimov (#44)

1 attachment(s)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On 27.04.2025 22:58, Ilia Evdokimov wrote:

Therefore, I think it is better to report percentages directly. Since
non-text EXPLAIN formats do not display units, I propose to rename the
field to "hit_percent" in all formats, including the text format. This
way, the meaning of the value remains clear without needing additional
context. What do you think about this approach?

I attached updated v9 patch with the suggested changes. The updated line
in the EXPLAIN looks like this:

Text format:
Estimates: capacity=1 distinct keys=1 lookups=2 hit percent=50.00%

Non-text format:
    Estimated Capacity: 1
    Estimated Distinct Lookup Keys: 1
    Estimated Lookups: 2
    Estimated Hit Percent: 50.00

Any suggestions?

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachments:

v9-0001-Expose-cache-hit-statistics-in-Memoize-node-EXPLAIN.patchtext/x-patch; charset=UTF-8; name=v9-0001-Expose-cache-hit-statistics-in-Memoize-node-EXPLAIN.patchDownload

From eff0cb420c922343af9e09e58cec7b5be19c35e4 Mon Sep 17 00:00:00 2001
From: Evdokimov Ilia <ilya.evdokimov@tantorlabs.com>
Date: Thu, 1 May 2025 15:20:07 +0300
Subject: [PATCH v9] Expose cache hit statistics in Memoize node EXPLAIN
 output.

This patch adds additional information to the Memoize node's EXPLAIN output:
estimated cache capacity, number of distinct keys, total lookups, and
cache hit percentage.

Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
---
 src/backend/commands/explain.c          | 20 ++++++++++++++++++++
 src/backend/optimizer/path/costsize.c   |  6 ++++++
 src/backend/optimizer/plan/createplan.c | 15 ++++++++++++---
 src/backend/optimizer/util/pathnode.c   | 12 ++++++++++++
 src/include/nodes/pathnodes.h           |  3 +++
 src/include/nodes/plannodes.h           | 12 ++++++++++++
 6 files changed, 65 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 786ee865f14..3a5b436b561 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3628,6 +3628,26 @@ show_memoize_info(MemoizeState *mstate, List *ancestors, ExplainState *es)
 	ExplainPropertyText("Cache Key", keystr.data, es);
 	ExplainPropertyText("Cache Mode", mstate->binary_mode ? "binary" : "logical", es);
 
+	if (es->costs)
+	{
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+		 	ExplainIndentText(es);
+		 	appendStringInfo(es->str, "Estimates: capacity=%u distinct keys=%.0f lookups=%.0f hit percent=%.2f%%\n",
+		 					((Memoize *) plan)->est_entries,
+		 					((Memoize *) plan)->est_unique_keys,
+		 					((Memoize *) plan)->est_calls,
+		 					((Memoize *) plan)->est_hit_ratio * 100.0);
+		}
+		else
+		{
+			ExplainPropertyUInteger("Estimated Capacity", NULL, ((Memoize *) plan)->est_entries, es);
+			ExplainPropertyFloat("Estimated Distinct Lookup Keys", NULL, ((Memoize *) plan)->est_unique_keys, 0, es);
+			ExplainPropertyFloat("Estimated Lookups", NULL, ((Memoize *) plan)->est_calls, 0, es);
+			ExplainPropertyFloat("Estimated Hit Percent", NULL, ((Memoize *) plan)->est_hit_ratio * 100.0, 2, es);
+		}
+	}
+
 	pfree(keystr.data);
 
 	if (!es->analyze)
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 60b0fcfb6be..226a7861543 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2604,6 +2604,9 @@ cost_memoize_rescan(PlannerInfo *root, MemoizePath *mpath,
 	mpath->est_entries = Min(Min(ndistinct, est_cache_entries),
 							 PG_UINT32_MAX);
 
+	/* Remember ndistinct for a potential EXPLAIN later */
+	mpath->est_unique_keys = ndistinct;
+
 	/*
 	 * When the number of distinct parameter values is above the amount we can
 	 * store in the cache, then we'll have to evict some entries from the
@@ -2621,6 +2624,9 @@ cost_memoize_rescan(PlannerInfo *root, MemoizePath *mpath,
 	hit_ratio = ((calls - ndistinct) / calls) *
 		(est_cache_entries / Max(ndistinct, est_cache_entries));
 
+	/* Remember cache hit ratio for a potential EXPLAIN later */
+	mpath->est_hit_ratio = hit_ratio;
+
 	Assert(hit_ratio >= 0 && hit_ratio <= 1.0);
 
 	/*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index a8f22a8c154..b3d3fac187f 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -284,7 +284,9 @@ static Material *make_material(Plan *lefttree);
 static Memoize *make_memoize(Plan *lefttree, Oid *hashoperators,
 							 Oid *collations, List *param_exprs,
 							 bool singlerow, bool binary_mode,
-							 uint32 est_entries, Bitmapset *keyparamids);
+							uint32 est_entries, Bitmapset *keyparamids,
+							double est_unique_keys, double est_hit_ratio,
+							double est_calls);
 static WindowAgg *make_windowagg(List *tlist, WindowClause *wc,
 								 int partNumCols, AttrNumber *partColIdx, Oid *partOperators, Oid *partCollations,
 								 int ordNumCols, AttrNumber *ordColIdx, Oid *ordOperators, Oid *ordCollations,
@@ -1703,7 +1705,9 @@ create_memoize_plan(PlannerInfo *root, MemoizePath *best_path, int flags)
 
 	plan = make_memoize(subplan, operators, collations, param_exprs,
 						best_path->singlerow, best_path->binary_mode,
-						best_path->est_entries, keyparamids);
+						best_path->est_entries, keyparamids,
+						best_path->est_unique_keys, best_path->est_hit_ratio,
+						best_path->calls);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -6636,7 +6640,9 @@ materialize_finished_plan(Plan *subplan)
 static Memoize *
 make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations,
 			 List *param_exprs, bool singlerow, bool binary_mode,
-			 uint32 est_entries, Bitmapset *keyparamids)
+			uint32 est_entries, Bitmapset *keyparamids,
+			double est_unique_keys, double est_hit_ratio,
+			double est_calls)
 {
 	Memoize    *node = makeNode(Memoize);
 	Plan	   *plan = &node->plan;
@@ -6654,6 +6660,9 @@ make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations,
 	node->binary_mode = binary_mode;
 	node->est_entries = est_entries;
 	node->keyparamids = keyparamids;
+	node->est_unique_keys = est_unique_keys;
+	node->est_hit_ratio = est_hit_ratio;
+	node->est_calls = est_calls;
 
 	return node;
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 93e73cb44db..7e35421f410 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1701,6 +1701,18 @@ create_memoize_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	Assert(enable_memoize);
 	pathnode->path.disabled_nodes = subpath->disabled_nodes;
 
+	/*
+	 * Estimated number of distinct memoization keys,
+	 * computed using estimate_num_groups()
+	 */
+	pathnode->est_unique_keys = 0;
+
+	/*
+	 * The estimated cache hit ratio will calculated later
+	 * by cost_memoize_rescan().
+	 */
+	pathnode->est_hit_ratio = 0;
+
 	/*
 	 * Add a small additional charge for caching the first entry.  All the
 	 * harder calculations for rescans are performed in cost_memoize_rescan().
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 011e5a811c3..938e4c43a81 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -2138,6 +2138,9 @@ typedef struct MemoizePath
 	uint32		est_entries;	/* The maximum number of entries that the
 								 * planner expects will fit in the cache, or 0
 								 * if unknown */
+	double		est_unique_keys;	/* Estimated number of distinct memoization keys,
+								 * used for cache size evaluation. Kept for EXPLAIN */
+	double		est_hit_ratio;	/* Estimated cache hit ratio. Kept for EXPLAIN */
 } MemoizePath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 658d76225e4..26a4c78afe8 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1063,6 +1063,18 @@ typedef struct Memoize
 
 	/* paramids from param_exprs */
 	Bitmapset  *keyparamids;
+
+	/*
+	 * Estimated number of distinct memoization keys,
+	 * used for cache size evaluation. Kept for EXPLAIN
+	 */
+	double		est_unique_keys;
+
+	/* Estimated cache hit ratio. Kept for EXPLAIN */
+	double		est_hit_ratio;
+
+	/* Estimated number of rescans. Kept for EXPLAIN */
+	double		est_calls;
 } Memoize;
 
 /* ----------------
-- 
2.34.1

#46

Andrei Lepikhov

lepihov@gmail.com

9 months ago

In reply to: Ilia Evdokimov (#45)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On 5/1/25 14:22, Ilia Evdokimov wrote:

Estimated Hit Percent: 50.00

Any suggestions?

I still think that two meaningful digits are enough for an EXPLAIN. We
usually need to estimate the tuple set's size, not a precise number of
tuples or groups. And definitely, not an arbitrarily chosen two digits
after the point. So, IMO, good examples may look like the following:
50%, 5.0%, 0.00051%.

--
regards, Andrei Lepikhov

#47

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

9 months ago

In reply to: Andrei Lepikhov (#46)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On 01.05.2025 16:37, Andrei Lepikhov wrote:

I still think that two meaningful digits are enough for an EXPLAIN. We
usually need to estimate the tuple set's size, not a precise number of
tuples or groups. And definitely, not an arbitrarily chosen two digits
after the point. So, IMO, good examples may look like the following:
50%, 5.0%, 0.00051%.

The idea is not bad. But I think it might be better to raise this in a
new thread, because this is not the only place [0]/messages/by-id/3365160.1739385615@sss.pgh.pa.us where we might want
to output numbers like that. We could also discuss there what else might
benefit from such formatting.

[0]: /messages/by-id/3365160.1739385615@sss.pgh.pa.us
/messages/by-id/3365160.1739385615@sss.pgh.pa.us

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

#48

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

6 months ago

In reply to: Ilia Evdokimov (#45)

1 attachment(s)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On 01.05.2025 15:22, Ilia Evdokimov wrote:

I attached updated v9 patch with the suggested changes. The updated
line in the EXPLAIN looks like this:

Text format:
    Estimates: capacity=1 distinct keys=1 lookups=2 hit percent=50.00%

Non-text format:
    Estimated Capacity: 1
    Estimated Distinct Lookup Keys: 1
    Estimated Lookups: 2
    Estimated Hit Percent: 50.00

Any suggestions?

I attached rebased v10 patch on 5a6c39b.

--
Best Regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachments:

v10-0001-Expose-cache-hit-statistics-in-Memoize-node-EXPLAIN.patchtext/x-patch; charset=UTF-8; name=v10-0001-Expose-cache-hit-statistics-in-Memoize-node-EXPLAIN.patchDownload

From 48cacc9b59eaa73c69a891e0cf6967f1a3f3c398 Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <ilya.evdokimov@tantorlabs.ru>
Date: Fri, 4 Jul 2025 11:24:56 +0300
Subject: [PATCH v10] Expose cache hit statistics in Memoize node EXPLAIN
 output.

This patch adds additional information to the Memoize node's EXPLAIN output:
estimated cache capacity, number of distinct keys, total lookups, and
cache hit percentage.
---
 src/backend/commands/explain.c          | 20 ++++++++++++++++++++
 src/backend/optimizer/path/costsize.c   |  6 ++++++
 src/backend/optimizer/plan/createplan.c | 15 ++++++++++++---
 src/backend/optimizer/util/pathnode.c   | 12 ++++++++++++
 src/include/nodes/pathnodes.h           |  3 +++
 src/include/nodes/plannodes.h           | 12 ++++++++++++
 6 files changed, 65 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7e2792ead71..561eb8957b8 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3616,6 +3616,26 @@ show_memoize_info(MemoizeState *mstate, List *ancestors, ExplainState *es)
 	ExplainPropertyText("Cache Key", keystr.data, es);
 	ExplainPropertyText("Cache Mode", mstate->binary_mode ? "binary" : "logical", es);
 
+	if (es->costs)
+	{
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+		 	ExplainIndentText(es);
+		 	appendStringInfo(es->str, "Estimates: capacity=%u distinct keys=%.0f lookups=%.0f hit percent=%.2f%%\n",
+		 					((Memoize *) plan)->est_entries,
+		 					((Memoize *) plan)->est_unique_keys,
+		 					((Memoize *) plan)->est_calls,
+		 					((Memoize *) plan)->est_hit_ratio * 100.0);
+		}
+		else
+		{
+			ExplainPropertyUInteger("Estimated Capacity", NULL, ((Memoize *) plan)->est_entries, es);
+			ExplainPropertyFloat("Estimated Distinct Lookup Keys", NULL, ((Memoize *) plan)->est_unique_keys, 0, es);
+			ExplainPropertyFloat("Estimated Lookups", NULL, ((Memoize *) plan)->est_calls, 0, es);
+			ExplainPropertyFloat("Estimated Hit Percent", NULL, ((Memoize *) plan)->est_hit_ratio * 100.0, 2, es);
+		}
+	}
+
 	pfree(keystr.data);
 
 	if (!es->analyze)
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 3d44815ed5a..d5f2cc15817 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2604,6 +2604,9 @@ cost_memoize_rescan(PlannerInfo *root, MemoizePath *mpath,
 	mpath->est_entries = Min(Min(ndistinct, est_cache_entries),
 							 PG_UINT32_MAX);
 
+	/* Remember ndistinct for a potential EXPLAIN later */
+	mpath->est_unique_keys = ndistinct;
+
 	/*
 	 * When the number of distinct parameter values is above the amount we can
 	 * store in the cache, then we'll have to evict some entries from the
@@ -2621,6 +2624,9 @@ cost_memoize_rescan(PlannerInfo *root, MemoizePath *mpath,
 	hit_ratio = ((calls - ndistinct) / calls) *
 		(est_cache_entries / Max(ndistinct, est_cache_entries));
 
+	/* Remember cache hit ratio for a potential EXPLAIN later */
+	mpath->est_hit_ratio = hit_ratio;
+
 	Assert(hit_ratio >= 0 && hit_ratio <= 1.0);
 
 	/*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 0b61aef962c..4f28a2d906c 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -284,7 +284,9 @@ static Material *make_material(Plan *lefttree);
 static Memoize *make_memoize(Plan *lefttree, Oid *hashoperators,
 							 Oid *collations, List *param_exprs,
 							 bool singlerow, bool binary_mode,
-							 uint32 est_entries, Bitmapset *keyparamids);
+							uint32 est_entries, Bitmapset *keyparamids,
+							double est_unique_keys, double est_hit_ratio,
+							double est_calls);
 static WindowAgg *make_windowagg(List *tlist, WindowClause *wc,
 								 int partNumCols, AttrNumber *partColIdx, Oid *partOperators, Oid *partCollations,
 								 int ordNumCols, AttrNumber *ordColIdx, Oid *ordOperators, Oid *ordCollations,
@@ -1703,7 +1705,9 @@ create_memoize_plan(PlannerInfo *root, MemoizePath *best_path, int flags)
 
 	plan = make_memoize(subplan, operators, collations, param_exprs,
 						best_path->singlerow, best_path->binary_mode,
-						best_path->est_entries, keyparamids);
+						best_path->est_entries, keyparamids,
+						best_path->est_unique_keys, best_path->est_hit_ratio,
+						best_path->calls);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -6699,7 +6703,9 @@ materialize_finished_plan(Plan *subplan)
 static Memoize *
 make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations,
 			 List *param_exprs, bool singlerow, bool binary_mode,
-			 uint32 est_entries, Bitmapset *keyparamids)
+			uint32 est_entries, Bitmapset *keyparamids,
+			double est_unique_keys, double est_hit_ratio,
+			double est_calls)
 {
 	Memoize    *node = makeNode(Memoize);
 	Plan	   *plan = &node->plan;
@@ -6717,6 +6723,9 @@ make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations,
 	node->binary_mode = binary_mode;
 	node->est_entries = est_entries;
 	node->keyparamids = keyparamids;
+	node->est_unique_keys = est_unique_keys;
+	node->est_hit_ratio = est_hit_ratio;
+	node->est_calls = est_calls;
 
 	return node;
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index e0192d4a491..a68a32c89cf 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1701,6 +1701,18 @@ create_memoize_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	Assert(enable_memoize);
 	pathnode->path.disabled_nodes = subpath->disabled_nodes;
 
+	/*
+	 * Estimated number of distinct memoization keys,
+	 * computed using estimate_num_groups()
+	 */
+	pathnode->est_unique_keys = 0;
+
+	/*
+	 * The estimated cache hit ratio will calculated later
+	 * by cost_memoize_rescan().
+	 */
+	pathnode->est_hit_ratio = 0;
+
 	/*
 	 * Add a small additional charge for caching the first entry.  All the
 	 * harder calculations for rescans are performed in cost_memoize_rescan().
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 6567759595d..5bb7451e160 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -2135,6 +2135,9 @@ typedef struct MemoizePath
 	uint32		est_entries;	/* The maximum number of entries that the
 								 * planner expects will fit in the cache, or 0
 								 * if unknown */
+	double		est_unique_keys;	/* Estimated number of distinct memoization keys,
+								 * used for cache size evaluation. Kept for EXPLAIN */
+	double		est_hit_ratio;	/* Estimated cache hit ratio. Kept for EXPLAIN */
 } MemoizePath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4f59e30d62d..598c0d5eb1f 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1056,6 +1056,18 @@ typedef struct Memoize
 
 	/* paramids from param_exprs */
 	Bitmapset  *keyparamids;
+
+	/*
+	 * Estimated number of distinct memoization keys,
+	 * used for cache size evaluation. Kept for EXPLAIN
+	 */
+	double		est_unique_keys;
+
+	/* Estimated cache hit ratio. Kept for EXPLAIN */
+	double		est_hit_ratio;
+
+	/* Estimated number of rescans. Kept for EXPLAIN */
+	double		est_calls;
 } Memoize;
 
 /* ----------------
-- 
2.34.1

#49

Andrei Lepikhov

lepihov@gmail.com

6 months ago

In reply to: Ilia Evdokimov (#48)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On 4/7/2025 10:30, Ilia Evdokimov wrote:

On 01.05.2025 15:22, Ilia Evdokimov wrote:

I attached updated v9 patch with the suggested changes. The updated
line in the EXPLAIN looks like this:

Text format:
    Estimates: capacity=1 distinct keys=1 lookups=2 hit percent=50.00%

Non-text format:
    Estimated Capacity: 1
    Estimated Distinct Lookup Keys: 1
    Estimated Lookups: 2
    Estimated Hit Percent: 50.00

Any suggestions?

I attached rebased v10 patch on 5a6c39b.

Exposing internal information about the estimation of the number of
groups in the Memoise node, shouldn't we do the same in even more vague
cases, such as IncrementalSort? For example, in [1]/messages/by-id/a74d2648-c93e-4686-a8d4-2843515b8702@gmail.com (see its
attachment), I observe that IncrementalSort is considerably better than
Sort, but has a larger cost. It would be helpful to understand if an
incorrect ngroups estimation causes this.

[1]: /messages/by-id/a74d2648-c93e-4686-a8d4-2843515b8702@gmail.com
/messages/by-id/a74d2648-c93e-4686-a8d4-2843515b8702@gmail.com

--
regards, Andrei Lepikhov

#50

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

6 months ago

In reply to: Andrei Lepikhov (#49)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On 07.07.2025 16:49, Andrei Lepikhov wrote:

Exposing internal information about the estimation of the number of
groups in the Memoise node, shouldn't we do the same in even more
vague cases, such as IncrementalSort? For example, in [1] (see its
attachment), I observe that IncrementalSort is considerably better
than Sort, but has a larger cost. It would be helpful to understand if
an incorrect ngroups estimation causes this.

LGTM.

If we still don't expose any information for nodes like IncrementalSort
that would help explain why the planner chose them, then I fully agree -
we definitely should add it.

I suggest discussing that in a separate thread.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

#51

David Rowley

dgrowleyml@gmail.com

6 months ago

In reply to: Ilia Evdokimov (#48)

1 attachment(s)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On Fri, 4 Jul 2025 at 20:30, Ilia Evdokimov
<ilya.evdokimov@tantorlabs.com> wrote:

I attached rebased v10 patch on 5a6c39b.

I've gone over this and made some cosmetic adjustments. A few
adjustments to the comments and used Cardinality rather than double
for some data types. I also moved the MemoizePath.calls field down
below est_entries so that est_entries would take up some padding
space. This saves 8 bytes of struct, and IMO, improves the logical
order of fields. I renamed "calls" to "est_calls" so it's more aligned
with the new fields being added by this patch.

The final thing which I'm not so sure about is the EXPLAIN format.
Currently it looks like:

-> Memoize (cost=0.43..0.46 rows=1 width=4) (actual
time=0.000..0.000 rows=1.00 loops=1000000)
Cache Key: t.a
Cache Mode: logical
Estimates: capacity=10010 distinct keys=10010 lookups=1000000
hit ratio=99.00%
Hits: 989999 Misses: 10001 Evictions: 0 Overflows: 0
Memory Usage: 1016kB

This format for the Estimates copies the "Buffers:" format for putting
multiple sub-values under a single heading. However, we don't seem
very consistent with this as JIT has a similar need but formats things
another way, i.e.:

JIT:
Functions: 6
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 0.473 ms (Deform 0.112 ms), Inlining 0.000 ms,
Optimization 0.701 ms, Emission 5.778 ms, Total 6.952 ms

We could get rid of the "Estimates" heading and just prefix each value
with "Estimated", as in:

-> Memoize (cost=0.43..0.46 rows=1 width=4) (actual
time=0.000..0.000 rows=1.00 loops=1000000)
Cache Key: t.a
Cache Mode: logical
Estimated capacity: 10010 Estimated distinct keys: 10010
Estimated lookups: 1000000 Estimated hit ratio: 99.00%
Hits: 989999 Misses: 10001 Evictions: 0 Overflows: 0
Memory Usage: 1016kB
Buffers: shared hit=30004

That removes the dilemma about which example to follow, but it's more verbose.

Does anyone have any opinions on this?

v11 patch attached.

David

Attachments:

v11-0001-Expose-cache-hit-statistics-in-Memoize-node-EXPL.patchapplication/octet-stream; name=v11-0001-Expose-cache-hit-statistics-in-Memoize-node-EXPL.patchDownload

From d7e35cfaa7efef534bd81ab21590e243db1aa685 Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <ilya.evdokimov@tantorlabs.ru>
Date: Fri, 4 Jul 2025 11:24:56 +0300
Subject: [PATCH v11] Expose cache hit statistics in Memoize node EXPLAIN
 output.

This patch adds additional information to the Memoize node's EXPLAIN output:
estimated cache capacity, number of distinct keys, total lookups, and
cache hit percentage.
---
 src/backend/commands/explain.c          | 20 ++++++++++++++++++++
 src/backend/optimizer/path/costsize.c   | 14 ++++++++++----
 src/backend/optimizer/plan/createplan.c | 13 ++++++++++---
 src/backend/optimizer/util/pathnode.c   | 11 ++++++++---
 src/include/nodes/pathnodes.h           |  4 +++-
 src/include/nodes/plannodes.h           | 10 ++++++++++
 src/include/optimizer/pathnode.h        |  2 +-
 7 files changed, 62 insertions(+), 12 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7e2792ead71..c3c52fbe190 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3616,6 +3616,26 @@ show_memoize_info(MemoizeState *mstate, List *ancestors, ExplainState *es)
 	ExplainPropertyText("Cache Key", keystr.data, es);
 	ExplainPropertyText("Cache Mode", mstate->binary_mode ? "binary" : "logical", es);
 
+	if (es->costs)
+	{
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			ExplainIndentText(es);
+			appendStringInfo(es->str, "Estimates: capacity=%u distinct keys=%.0f lookups=%.0f hit percent=%.2f%%\n",
+							 ((Memoize *) plan)->est_entries,
+							 ((Memoize *) plan)->est_unique_keys,
+							 ((Memoize *) plan)->est_calls,
+							 ((Memoize *) plan)->est_hit_ratio * 100.0);
+		}
+		else
+		{
+			ExplainPropertyUInteger("Estimated Capacity", NULL, ((Memoize *) plan)->est_entries, es);
+			ExplainPropertyFloat("Estimated Distinct Lookup Keys", NULL, ((Memoize *) plan)->est_unique_keys, 0, es);
+			ExplainPropertyFloat("Estimated Lookups", NULL, ((Memoize *) plan)->est_calls, 0, es);
+			ExplainPropertyFloat("Estimated Hit Percent", NULL, ((Memoize *) plan)->est_hit_ratio * 100.0, 2, es);
+		}
+	}
+
 	pfree(keystr.data);
 
 	if (!es->analyze)
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 1f04a2c182c..3a60c55803c 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2572,7 +2572,7 @@ cost_memoize_rescan(PlannerInfo *root, MemoizePath *mpath,
 	Cost		input_startup_cost = mpath->subpath->startup_cost;
 	Cost		input_total_cost = mpath->subpath->total_cost;
 	double		tuples = mpath->subpath->rows;
-	double		calls = mpath->calls;
+	double		est_calls = mpath->est_calls;
 	int			width = mpath->subpath->pathtarget->width;
 
 	double		hash_mem_bytes;
@@ -2604,7 +2604,7 @@ cost_memoize_rescan(PlannerInfo *root, MemoizePath *mpath,
 	est_cache_entries = floor(hash_mem_bytes / est_entry_bytes);
 
 	/* estimate on the distinct number of parameter values */
-	ndistinct = estimate_num_groups(root, mpath->param_exprs, calls, NULL,
+	ndistinct = estimate_num_groups(root, mpath->param_exprs, est_calls, NULL,
 									&estinfo);
 
 	/*
@@ -2616,7 +2616,7 @@ cost_memoize_rescan(PlannerInfo *root, MemoizePath *mpath,
 	 * certainly mean a MemoizePath will never survive add_path().
 	 */
 	if ((estinfo.flags & SELFLAG_USED_DEFAULT) != 0)
-		ndistinct = calls;
+		ndistinct = est_calls;
 
 	/*
 	 * Since we've already estimated the maximum number of entries we can
@@ -2630,6 +2630,9 @@ cost_memoize_rescan(PlannerInfo *root, MemoizePath *mpath,
 	mpath->est_entries = Min(Min(ndistinct, est_cache_entries),
 							 PG_UINT32_MAX);
 
+	/* Remember the ndistinct estimate for EXPLAIN */
+	mpath->est_unique_keys = ndistinct;
+
 	/*
 	 * When the number of distinct parameter values is above the amount we can
 	 * store in the cache, then we'll have to evict some entries from the
@@ -2644,9 +2647,12 @@ cost_memoize_rescan(PlannerInfo *root, MemoizePath *mpath,
 	 * must look at how many scans are estimated in total for this node and
 	 * how many of those scans we expect to get a cache hit.
 	 */
-	hit_ratio = ((calls - ndistinct) / calls) *
+	hit_ratio = ((est_calls - ndistinct) / est_calls) *
 		(est_cache_entries / Max(ndistinct, est_cache_entries));
 
+	/* Remember the hit ratio estimate for EXPLAIN */
+	mpath->est_hit_ratio = hit_ratio;
+
 	Assert(hit_ratio >= 0 && hit_ratio <= 1.0);
 
 	/*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 8a9f1d7a943..331b0b66a69 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -284,7 +284,9 @@ static Material *make_material(Plan *lefttree);
 static Memoize *make_memoize(Plan *lefttree, Oid *hashoperators,
 							 Oid *collations, List *param_exprs,
 							 bool singlerow, bool binary_mode,
-							 uint32 est_entries, Bitmapset *keyparamids);
+							 uint32 est_entries, Bitmapset *keyparamids,
+							 double est_calls, double est_unique_keys,
+							 double est_hit_ratio);
 static WindowAgg *make_windowagg(List *tlist, WindowClause *wc,
 								 int partNumCols, AttrNumber *partColIdx, Oid *partOperators, Oid *partCollations,
 								 int ordNumCols, AttrNumber *ordColIdx, Oid *ordOperators, Oid *ordCollations,
@@ -1753,7 +1755,8 @@ create_memoize_plan(PlannerInfo *root, MemoizePath *best_path, int flags)
 
 	plan = make_memoize(subplan, operators, collations, param_exprs,
 						best_path->singlerow, best_path->binary_mode,
-						best_path->est_entries, keyparamids);
+						best_path->est_entries, keyparamids, best_path->est_calls,
+						best_path->est_unique_keys, best_path->est_hit_ratio);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -6749,7 +6752,8 @@ materialize_finished_plan(Plan *subplan)
 static Memoize *
 make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations,
 			 List *param_exprs, bool singlerow, bool binary_mode,
-			 uint32 est_entries, Bitmapset *keyparamids)
+			 uint32 est_entries, Bitmapset *keyparamids, double est_calls,
+			 double est_unique_keys, double est_hit_ratio)
 {
 	Memoize    *node = makeNode(Memoize);
 	Plan	   *plan = &node->plan;
@@ -6767,6 +6771,9 @@ make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations,
 	node->binary_mode = binary_mode;
 	node->est_entries = est_entries;
 	node->keyparamids = keyparamids;
+	node->est_calls = est_calls;
+	node->est_unique_keys = est_unique_keys;
+	node->est_hit_ratio = est_hit_ratio;
 
 	return node;
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 9cc602788ea..1812eb55821 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1689,7 +1689,7 @@ create_material_path(RelOptInfo *rel, Path *subpath)
 MemoizePath *
 create_memoize_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 					List *param_exprs, List *hash_operators,
-					bool singlerow, bool binary_mode, double calls)
+					bool singlerow, bool binary_mode, double est_calls)
 {
 	MemoizePath *pathnode = makeNode(MemoizePath);
 
@@ -1710,7 +1710,6 @@ create_memoize_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	pathnode->param_exprs = param_exprs;
 	pathnode->singlerow = singlerow;
 	pathnode->binary_mode = binary_mode;
-	pathnode->calls = clamp_row_est(calls);
 
 	/*
 	 * For now we set est_entries to 0.  cost_memoize_rescan() does all the
@@ -1720,6 +1719,12 @@ create_memoize_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 	 */
 	pathnode->est_entries = 0;
 
+	pathnode->est_calls = clamp_row_est(est_calls);
+
+	/* These will also be set later in cost_memoize_rescan() */
+	pathnode->est_unique_keys = 0;
+	pathnode->est_hit_ratio = 0;
+
 	/* we should not generate this path type when enable_memoize=false */
 	Assert(enable_memoize);
 	pathnode->path.disabled_nodes = subpath->disabled_nodes;
@@ -4259,7 +4264,7 @@ reparameterize_path(PlannerInfo *root, Path *path,
 													mpath->hash_operators,
 													mpath->singlerow,
 													mpath->binary_mode,
-													mpath->calls);
+													mpath->est_calls);
 			}
 		default:
 			break;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 6567759595d..40166e22a49 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -2131,10 +2131,12 @@ typedef struct MemoizePath
 								 * complete after caching the first record. */
 	bool		binary_mode;	/* true when cache key should be compared bit
 								 * by bit, false when using hash equality ops */
-	Cardinality calls;			/* expected number of rescans */
 	uint32		est_entries;	/* The maximum number of entries that the
 								 * planner expects will fit in the cache, or 0
 								 * if unknown */
+	Cardinality est_calls;		/* expected number of rescans */
+	Cardinality est_unique_keys;	/* estimated unique keys, for EXPLAIN */
+	double		est_hit_ratio;	/* estimated cache hit ratio, for EXPLAIN */
 } MemoizePath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4f59e30d62d..008328577cc 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -1056,6 +1056,16 @@ typedef struct Memoize
 
 	/* paramids from param_exprs */
 	Bitmapset  *keyparamids;
+
+	/* Estimated number of rescans, for EXPLAIN */
+	Cardinality est_calls;
+
+	/* Estimated number of distinct lookup keys, for EXPLAIN */
+	Cardinality est_unique_keys;
+
+	/* Estimated cache hit ratio, for EXPLAIN */
+	double		est_hit_ratio;
+
 } Memoize;
 
 /* ----------------
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 60dcdb77e41..95f98b0ba40 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -90,7 +90,7 @@ extern MemoizePath *create_memoize_path(PlannerInfo *root,
 										List *hash_operators,
 										bool singlerow,
 										bool binary_mode,
-										double calls);
+										double est_calls);
 extern UniquePath *create_unique_path(PlannerInfo *root, RelOptInfo *rel,
 									  Path *subpath, SpecialJoinInfo *sjinfo);
 extern GatherPath *create_gather_path(PlannerInfo *root,
-- 
2.43.0

#52

Andrei Lepikhov

lepihov@gmail.com

6 months ago

In reply to: David Rowley (#51)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On 22/7/2025 00:17, David Rowley wrote:

On Fri, 4 Jul 2025 at 20:30, Ilia Evdokimov
<ilya.evdokimov@tantorlabs.com> wrote:

I attached rebased v10 patch on 5a6c39b.

I've gone over this and made some cosmetic adjustments. A few
adjustments to the comments and used Cardinality rather than double
for some data types. I also moved the MemoizePath.calls field down
below est_entries so that est_entries would take up some padding
space. This saves 8 bytes of struct, and IMO, improves the logical
order of fields. I renamed "calls" to "est_calls" so it's more aligned
with the new fields being added by this patch.

Looks good

That removes the dilemma about which example to follow, but it's more verbose.

The 'Buffers:' way looks more natural to me. I don't like duplicated
text in the explain format - it is already cluttered by multiple
unnecessary elements that distract attention from the problematic plan
elements, such as unplaggable costs output if we only need row
predictions, '.00' in estimations, etc.
However, at first, I'd consider how it could be added to the
IncrementalSort and HashJoin. The number of estimated groups/buckets may
also provide some insights into the planner's decision.

Will you add the ExplainOpenGroup call to the final version of the patch?

--
regards, Andrei Lepikhov

#53

David Rowley

dgrowleyml@gmail.com

6 months ago

In reply to: Andrei Lepikhov (#52)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On Wed, 23 Jul 2025 at 02:35, Andrei Lepikhov <lepihov@gmail.com> wrote:

The 'Buffers:' way looks more natural to me. I don't like duplicated
text in the explain format - it is already cluttered by multiple
unnecessary elements that distract attention from the problematic plan
elements, such as unplaggable costs output if we only need row
predictions, '.00' in estimations, etc.

Seems logical.

However, at first, I'd consider how it could be added to the
IncrementalSort and HashJoin. The number of estimated groups/buckets may
also provide some insights into the planner's decision.

Sounds like another patch for another thread.

Will you add the ExplainOpenGroup call to the final version of the patch?

I'm leaning towards not doing that as a reader might expect all the
"Estimates" to be within that group, but the estimated cost and row
counts won't be in there. Maybe it's possible to circumvent that
expectation by naming the group something like "MemoizeEstimates", but
IMO, that seems excessive for 4 properties.

Do you have an argument in mind to support adding the group?

David

#54

Andrei Lepikhov

lepihov@gmail.com

6 months ago

In reply to: David Rowley (#53)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On 23/7/2025 02:11, David Rowley wrote:

On Wed, 23 Jul 2025 at 02:35, Andrei Lepikhov <lepihov@gmail.com> wrote:

However, at first, I'd consider how it could be added to the
IncrementalSort and HashJoin. The number of estimated groups/buckets may
also provide some insights into the planner's decision.

Sounds like another patch for another thread.

Maybe, it is up to you

Do you have an argument in mind to support adding the group?

I have not. I just thought that grouping would be helpful for an explain
parser.

--
regards, Andrei Lepikhov

#55

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

6 months ago

In reply to: David Rowley (#53)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On 23.07.2025 03:11, David Rowley wrote:

On Wed, 23 Jul 2025 at 02:35, Andrei Lepikhov <lepihov@gmail.com> wrote:

The 'Buffers:' way looks more natural to me. I don't like duplicated
text in the explain format - it is already cluttered by multiple
unnecessary elements that distract attention from the problematic plan
elements, such as unplaggable costs output if we only need row
predictions, '.00' in estimations, etc.

Seems logical.

Will you add the ExplainOpenGroup call to the final version of the patch?

I'm leaning towards not doing that as a reader might expect all the
"Estimates" to be within that group, but the estimated cost and row
counts won't be in there. Maybe it's possible to circumvent that
expectation by naming the group something like "MemoizeEstimates", but
IMO, that seems excessive for 4 properties.

I agree. I would consider adding a group if we displayed information in
a looped format, like for Workers, or if we had some particularly useful
data for parsers - for example, timings or memory usage. But for four
estimates, grouping seems unnesseray.

Given that, patch v11 still looks like the most appropriate version to me.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

#56

David Rowley

dgrowleyml@gmail.com

6 months ago

In reply to: Ilia Evdokimov (#55)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On Wed, 23 Jul 2025 at 23:08, Ilia Evdokimov
<ilya.evdokimov@tantorlabs.com> wrote:

Given that, patch v11 still looks like the most appropriate version to me.

I spent a bit more time cleaning a few things up in this. With all the
complexity from the extra fields, I felt that a round of adjusting the
"double" types so they use the "Cardinality" typedef where appropriate
was in order. costsize.c isn't very good at doing that, but I didn't
see that as a reason to make the problem worse. The only other change
was in explain.c where I added a local "Memoize *" variable rather
than casting the "plan" variable to the subclass type in each usage.

I understand not everyone got what they wanted here, but I hope what's
been added is ok for the majority of people.

Thanks for working on this Ilia and Lukas.

David

#57

Lukas Fittl

lukas@fittl.com

6 months ago

In reply to: David Rowley (#56)

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On Mon, Jul 28, 2025 at 8:26 PM David Rowley <dgrowleyml@gmail.com> wrote:

I understand not everyone got what they wanted here, but I hope what's
been added is ok for the majority of people.

For what its worth, the last version you posted (v11) looks good to me as
well, and +1 on the "Estimates:" style that puts them all on one line in
text mode.

I think this will help a lot in understanding Memoize's impact on plans
better, thanks everyone for pushing this forward!

Thanks,
Lukas

--
Lukas Fittl