Adjust ndistinct for eqjoinsel
Hi,
I run TPC-DS benchmark for Postgres and find the join size estimation has several problems.
For example, Ndistinct is key to join selectivity's estimation, this value does not take restrictions
of the rel, I hit some cases in the function eqjoinsel, nd is much larger than vardata.rel->rows.
Accurate estimation need good math model that considering dependency of join var and vars in restriction.
But at least, indistinct should not be greater than the number of rows.
See the attached patch to adjust nd in eqjoinsel.
Best,
Zhenghua Lyu
Attachments:
0001-Adjust-ndistinct-with-nrows-in-the-rel-when-estimati.patchapplication/octet-stream; name=0001-Adjust-ndistinct-with-nrows-in-the-rel-when-estimati.patchDownload
From 00e5314a357f80a54284b7b8fceed9b7eb9da7ce Mon Sep 17 00:00:00 2001
From: Zhenghua lyu <kainwen@gmail.com>
Date: Fri, 15 Jul 2022 13:51:20 +0000
Subject: [PATCH] Adjust ndistinct with nrows in the rel when estimating join
selectivity.
Ndistinct is key to the accuracy of join selectivity estimation, which
impacts the performance of complex SQLs a lot. Previous code in eqjoinsel
does not take rel's restriction into account. A good math model should
use the dependency of the Vars in rel's restrictions and the join var to
estimate Ndistinct. At least there is a truth that Ndistinct should not
be greater then the number of rows of the rel.
---
src/backend/utils/adt/selfuncs.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index fa1f589fad..bf6ef60b56 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -2269,6 +2269,15 @@ eqjoinsel(PG_FUNCTION_ARGS)
nd1 = get_variable_numdistinct(&vardata1, &isdefault1);
nd2 = get_variable_numdistinct(&vardata2, &isdefault2);
+ /*
+ * Adjust ndistinct to account for restriction clauses.
+ * nd should not be greater than the number of rows in the relation.
+ */
+ if (vardata1.rel)
+ nd1 = Min(nd1, vardata1.rel->rows);
+ if (vardata2.rel)
+ nd2 = Min(nd2, vardata2.rel->rows);
+
opfuncoid = get_opcode(operator);
memset(&sslot1, 0, sizeof(sslot1));
--
2.25.1
Zhenghua Lyu <zlyu@vmware.com> writes:
I run TPC-DS benchmark for Postgres and find the join size estimation has several problems.
For example, Ndistinct is key to join selectivity's estimation, this value does not take restrictions
of the rel, I hit some cases in the function eqjoinsel, nd is much larger than vardata.rel->rows.
Accurate estimation need good math model that considering dependency of join var and vars in restriction.
But at least, indistinct should not be greater than the number of rows.
See the attached patch to adjust nd in eqjoinsel.
We're very unlikely to accept this with no test case and no explanation
of why it's not an overcorrection. get_variable_numdistinct already
clamps its result to rel->tuples, and I think that by using rel->rows
instead you are probably double-counting the selectivity of the rel's
restriction clauses.
See the sad history of commit 7f3eba30c, which did something
pretty close to this and eventually got almost entirely reverted
(97930cf57, 0d3b231ee). I'd be the first to agree that better
estimates here would be great, but it's not as simple as it looks.
regards, tom lane