PATCH: add support for IN and @> in functional-dependency statistics use

Started by Pierre Ducroquetalmost 6 years ago24 messages

p.psql@pinaraf.info

almost 6 years ago

2 attachment(s)

Hello

At my current job, we have a lot of multi-tenant databases, thus with tables
containing a tenant_id column. Such a column introduces a severe bias in
statistics estimation since any other FK in the next columns is very likely to
have a functional dependency on the tenant id. We found several queries where
this functional dependency messed up the estimations so much the optimizer
chose wrong plans.
When we tried to use extended statistics with CREATE STATISTIC on tenant_id,
other_id, we noticed that the current implementation for detecting functional
dependency lacks two features (at least in our use case):
- support for IN clauses
- support for the array contains operator (that could be considered as a
special case of IN)

After digging in the source code, I think the lack of support for IN clauses
is an oversight and due to the fact that IN clauses are ScalarArrayOpExpr
instead of OpExpr. The attached patch fixes this by simply copying the code-
path for OpExpr and changing the type name. It compiles and the results are
correct, with a dependency being correctly taken into consideration when
estimating rows. If you think such a copy paste is bad and should be factored
in another static bool function, please say so and I will happily provide an
updated patch.
The lack of support for @> operator, on the other hand, seems to be a decision
taken when writing the initial code, but I can not find any mathematical nor
technical reason for it. The current code restricts dependency calculation to
the = operator, obviously because inequality operators are not going to
work... but array contains is just several = operators grouped, thus the same
for the dependency calculation. The second patch refactors the operator check
in order to also include array contains.

I tested the patches on current HEAD, but I can test and provide back-ported
versions of the patch for other versions if needed (this code path hardly
changed since its introduction in 10).

Best regards

Pierre Ducroquet

Attachments:

0001-Add-support-for-IN-clauses-in-dependencies-check.patchtext/x-patch; charset=UTF-8; name=0001-Add-support-for-IN-clauses-in-dependencies-check.patchDownload

From d2b89748e1b520c276875538149eb466e97c21b4 Mon Sep 17 00:00:00 2001
From: Pierre <pierre.ducroquet@people-doc.com>
Date: Fri, 31 Jan 2020 23:58:35 +0100
Subject: [PATCH 1/2] Add support for IN clauses in dependencies check

---
 src/backend/statistics/dependencies.c | 34 +++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c
index e2f6c5bb97..d4844a9ec6 100644
--- a/src/backend/statistics/dependencies.c
+++ b/src/backend/statistics/dependencies.c
@@ -801,6 +801,40 @@ dependency_is_compatible_clause(Node *clause, Index relid, AttrNumber *attnum)
 
 		/* OK to proceed with checking "var" */
 	}
+	else if (IsA(rinfo->clause, ScalarArrayOpExpr))
+	{
+		/* If it's an opclause, check for Var IN Const. */
+		ScalarArrayOpExpr	   *expr = (ScalarArrayOpExpr *) rinfo->clause;
+
+		/* Only expressions with two arguments are candidates. */
+		if (list_length(expr->args) != 2)
+			return false;
+
+		/* Make sure non-selected argument is a pseudoconstant. */
+		if (is_pseudo_constant_clause(lsecond(expr->args)))
+			var = linitial(expr->args);
+		else if (is_pseudo_constant_clause(linitial(expr->args)))
+			var = lsecond(expr->args);
+		else
+			return false;
+
+		/*
+		 * If it's not an "=" operator, just ignore the clause, as it's not
+		 * compatible with functional dependencies.
+		 *
+		 * This uses the function for estimating selectivity, not the operator
+		 * directly (a bit awkward, but well ...).
+		 *
+		 * XXX this is pretty dubious; probably it'd be better to check btree
+		 * or hash opclass membership, so as not to be fooled by custom
+		 * selectivity functions, and to be more consistent with decisions
+		 * elsewhere in the planner.
+		 */
+		if (get_oprrest(expr->opno) != F_EQSEL)
+			return false;
+
+		/* OK to proceed with checking "var" */
+	}
 	else if (is_notclause(rinfo->clause))
 	{
 		/*
-- 
2.24.1

0002-Add-support-for-array-contains-in-dependency-check.patchtext/x-patch; charset=UTF-8; name=0002-Add-support-for-array-contains-in-dependency-check.patchDownload

From 02363c52d0d03f58b8e79088a7261a9fc1e3bbb7 Mon Sep 17 00:00:00 2001
From: Pierre <pierre.ducroquet@people-doc.com>
Date: Fri, 31 Jan 2020 23:58:55 +0100
Subject: [PATCH 2/2] Add support for array contains in dependency check

---
 src/backend/statistics/dependencies.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c
index d4844a9ec6..c3f37a474b 100644
--- a/src/backend/statistics/dependencies.c
+++ b/src/backend/statistics/dependencies.c
@@ -785,7 +785,7 @@ dependency_is_compatible_clause(Node *clause, Index relid, AttrNumber *attnum)
 			return false;
 
 		/*
-		 * If it's not an "=" operator, just ignore the clause, as it's not
+		 * If it's not an "=" or "@>" operator, just ignore the clause, as it's not
 		 * compatible with functional dependencies.
 		 *
 		 * This uses the function for estimating selectivity, not the operator
@@ -796,8 +796,13 @@ dependency_is_compatible_clause(Node *clause, Index relid, AttrNumber *attnum)
 		 * selectivity functions, and to be more consistent with decisions
 		 * elsewhere in the planner.
 		 */
-		if (get_oprrest(expr->opno) != F_EQSEL)
-			return false;
+		switch(get_oprrest(expr->opno)) {
+			case F_EQSEL:
+			case F_ARRAYCONTSEL:
+				break;
+			default:
+				return false;
+		}
 
 		/* OK to proceed with checking "var" */
 	}
-- 
2.24.1

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Pierre Ducroquet (#1)

Re: PATCH: add support for IN and @> in functional-dependency statistics use

On Sat, Feb 01, 2020 at 08:51:04AM +0100, Pierre Ducroquet wrote:

Hello

At my current job, we have a lot of multi-tenant databases, thus with tables
containing a tenant_id column. Such a column introduces a severe bias in
statistics estimation since any other FK in the next columns is very likely to
have a functional dependency on the tenant id. We found several queries where
this functional dependency messed up the estimations so much the optimizer
chose wrong plans.
When we tried to use extended statistics with CREATE STATISTIC on tenant_id,
other_id, we noticed that the current implementation for detecting functional
dependency lacks two features (at least in our use case):
- support for IN clauses
- support for the array contains operator (that could be considered as a
special case of IN)

Thanks for the patch. I don't think the lack of support for these clause
types is an oversight - we haven't done them because we were not quite
sure the functional dependencies can actually apply to them. But maybe
we can support them, I'm not against that in principle.

After digging in the source code, I think the lack of support for IN clauses
is an oversight and due to the fact that IN clauses are ScalarArrayOpExpr
instead of OpExpr. The attached patch fixes this by simply copying the code-
path for OpExpr and changing the type name. It compiles and the results are
correct, with a dependency being correctly taken into consideration when
estimating rows. If you think such a copy paste is bad and should be factored
in another static bool function, please say so and I will happily provide an
updated patch.

Hmmm. Consider a query like this:

... WHERE tenant_id = 1 AND another_column IN (2,3)

which kinda contradicts the idea of functional dependencies that knowing
a value in one column, tells us something about a value in a second
column. But that assumes a single value, which is not quite true here.

The above query is essentially the same thing as

... WHERE (tenant_id=1 AND (another_column=2 OR another_column=3))

and also

... WHERE (tenant_id=1 AND another_column=2)
OR (tenant_id=1 AND another_column=3)

at wchich point we could apply functional dependencies - but we'd do it
once for each AND-clause, and then combine the results to compute
selectivity for the OR clause.

But this means that if we apply functional dependencies directly to the
original clause, it'll be inconsistent. Which seems a bit unfortunate.

Or do I get this wrong?

BTW the code added in the 0001 patch is the same as for is_opclause, so
maybe we can simply do

if (is_opclause(rinfo->clause) ||
IsA(rinfo->clause, ScalarArrayOpExpr))
{
...
}

instead of just duplicating the code. We also need some at least some
regression tests, testing functional dependencies with this clause type.

The lack of support for @> operator, on the other hand, seems to be a decision
taken when writing the initial code, but I can not find any mathematical nor
technical reason for it. The current code restricts dependency calculation to
the = operator, obviously because inequality operators are not going to
work... but array contains is just several = operators grouped, thus the same
for the dependency calculation. The second patch refactors the operator check
in order to also include array contains.

No concrete opinion on this yet. I think my concerns are pretty similar
to the IN clause, although I'm not sure what you mean by "this could be
considered as special case of IN".

I tested the patches on current HEAD, but I can test and provide back-ported
versions of the patch for other versions if needed (this code path hardly
changed since its introduction in 10).

I think the chance of this getting backpatched is zero, because it might
easily break existing apps.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Pierre Ducroquet

p.psql@pinaraf.info

almost 6 years ago

In reply to: Tomas Vondra (#2)

Re: PATCH: add support for IN and @> in functional-dependency statistics use

On Saturday, February 1, 2020 3:24:46 PM CET Tomas Vondra wrote:

On Sat, Feb 01, 2020 at 08:51:04AM +0100, Pierre Ducroquet wrote:

Hello

At my current job, we have a lot of multi-tenant databases, thus with
tables containing a tenant_id column. Such a column introduces a severe
bias in statistics estimation since any other FK in the next columns is
very likely to have a functional dependency on the tenant id. We found
several queries where this functional dependency messed up the estimations
so much the optimizer chose wrong plans.
When we tried to use extended statistics with CREATE STATISTIC on
tenant_id, other_id, we noticed that the current implementation for
detecting functional dependency lacks two features (at least in our use
case):
- support for IN clauses
- support for the array contains operator (that could be considered as a
special case of IN)

Thanks for the patch. I don't think the lack of support for these clause
types is an oversight - we haven't done them because we were not quite
sure the functional dependencies can actually apply to them. But maybe
we can support them, I'm not against that in principle.

After digging in the source code, I think the lack of support for IN
clauses is an oversight and due to the fact that IN clauses are
ScalarArrayOpExpr instead of OpExpr. The attached patch fixes this by
simply copying the code- path for OpExpr and changing the type name. It
compiles and the results are correct, with a dependency being correctly
taken into consideration when estimating rows. If you think such a copy
paste is bad and should be factored in another static bool function,
please say so and I will happily provide an updated patch.

Hmmm. Consider a query like this:

... WHERE tenant_id = 1 AND another_column IN (2,3)

which kinda contradicts the idea of functional dependencies that knowing
a value in one column, tells us something about a value in a second
column. But that assumes a single value, which is not quite true here.

The above query is essentially the same thing as

... WHERE (tenant_id=1 AND (another_column=2 OR another_column=3))

and also

... WHERE (tenant_id=1 AND another_column=2)
OR (tenant_id=1 AND another_column=3)

at wchich point we could apply functional dependencies - but we'd do it
once for each AND-clause, and then combine the results to compute
selectivity for the OR clause.

But this means that if we apply functional dependencies directly to the
original clause, it'll be inconsistent. Which seems a bit unfortunate.

Or do I get this wrong?

In my tests, I've got a table with two columns a and b, generated this way:
CREATE TABLE test (a INT, b INT)
AS SELECT i, i/10 FROM
generate_series(1, 100000) s(i),
generate_series(1, 5) x;

With statistics defined on the a, b columns

Here are the estimated selectivity results without any patch:

SELECT * FROM test WHERE a = 1 AND b = 1 : 5
SELECT * FROM test WHERE a = 1 AND (b = 1 OR b = 2) : 1
SELECT * FROM test WHERE (a = 1 AND b = 1) OR (a = 1 AND b = 2) : 1
SELECT * FROM test WHERE a = 1 AND b IN (1, 2) : 1

With the patch, the estimated rows of the last query goes back to 5, which is
more logical. The other ones don't change.

BTW the code added in the 0001 patch is the same as for is_opclause, so
maybe we can simply do

if (is_opclause(rinfo->clause) ||
IsA(rinfo->clause, ScalarArrayOpExpr))
{
...
}

instead of just duplicating the code.

I would love doing that, but the ScalarArrayOpExpr and OpExpr are not binary
compatible for the members used here. In ScalarArrayOpExpr, on AMD64, args is
at offset 24 and opno at 4, while they are at 32 and 4 in OpExpr. I can work
around with this kind of code, but I don't like it much:
List *args;
Oid opno;
if (IsA(rinfo->clause, OpExpr))
{
/* If it's an opclause, we will check for Var = Const or Const = Var. */
OpExpr *expr = (OpExpr *) rinfo->clause;
args = expr->args;
opno = expr->opno;
}
else if (IsA(rinfo->clause, ScalarArrayOpExpr))
{
/* If it's a ScalarArrayOpExpr, check for Var IN Const. */
ScalarArrayOpExpr *expr = (ScalarArrayOpExpr *) rinfo->clause;
args = expr->args;
opno = expr->opno;
}

Or I can rewrite it in C++ to play with templates... :)

We also need some at least some
regression tests, testing functional dependencies with this clause type.

Agreed

The lack of support for @> operator, on the other hand, seems to be a
decision taken when writing the initial code, but I can not find any
mathematical nor technical reason for it. The current code restricts
dependency calculation to the = operator, obviously because inequality
operators are not going to work... but array contains is just several =
operators grouped, thus the same for the dependency calculation. The
second patch refactors the operator check in order to also include array
contains.

No concrete opinion on this yet. I think my concerns are pretty similar
to the IN clause, although I'm not sure what you mean by "this could be
considered as special case of IN".

I meant from a mathematical point of view.

I tested the patches on current HEAD, but I can test and provide
back-ported versions of the patch for other versions if needed (this code
path hardly changed since its introduction in 10).

I think the chance of this getting backpatched is zero, because it might
easily break existing apps.

I understand

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Pierre Ducroquet (#3)

Re: PATCH: add support for IN and @> in functional-dependency statistics use

On Sun, Feb 02, 2020 at 10:59:32AM +0100, Pierre Ducroquet wrote:

On Saturday, February 1, 2020 3:24:46 PM CET Tomas Vondra wrote:

On Sat, Feb 01, 2020 at 08:51:04AM +0100, Pierre Ducroquet wrote:

Hello

At my current job, we have a lot of multi-tenant databases, thus with
tables containing a tenant_id column. Such a column introduces a severe
bias in statistics estimation since any other FK in the next columns is
very likely to have a functional dependency on the tenant id. We found
several queries where this functional dependency messed up the estimations
so much the optimizer chose wrong plans.
When we tried to use extended statistics with CREATE STATISTIC on
tenant_id, other_id, we noticed that the current implementation for
detecting functional dependency lacks two features (at least in our use
case):
- support for IN clauses
- support for the array contains operator (that could be considered as a
special case of IN)

Thanks for the patch. I don't think the lack of support for these clause
types is an oversight - we haven't done them because we were not quite
sure the functional dependencies can actually apply to them. But maybe
we can support them, I'm not against that in principle.

After digging in the source code, I think the lack of support for IN
clauses is an oversight and due to the fact that IN clauses are
ScalarArrayOpExpr instead of OpExpr. The attached patch fixes this by
simply copying the code- path for OpExpr and changing the type name. It
compiles and the results are correct, with a dependency being correctly
taken into consideration when estimating rows. If you think such a copy
paste is bad and should be factored in another static bool function,
please say so and I will happily provide an updated patch.

Hmmm. Consider a query like this:

... WHERE tenant_id = 1 AND another_column IN (2,3)

which kinda contradicts the idea of functional dependencies that knowing
a value in one column, tells us something about a value in a second
column. But that assumes a single value, which is not quite true here.

The above query is essentially the same thing as

... WHERE (tenant_id=1 AND (another_column=2 OR another_column=3))

and also

... WHERE (tenant_id=1 AND another_column=2)
OR (tenant_id=1 AND another_column=3)

at wchich point we could apply functional dependencies - but we'd do it
once for each AND-clause, and then combine the results to compute
selectivity for the OR clause.

But this means that if we apply functional dependencies directly to the
original clause, it'll be inconsistent. Which seems a bit unfortunate.

Or do I get this wrong?

In my tests, I've got a table with two columns a and b, generated this way:
CREATE TABLE test (a INT, b INT)
AS SELECT i, i/10 FROM
generate_series(1, 100000) s(i),
generate_series(1, 5) x;

With statistics defined on the a, b columns

Here are the estimated selectivity results without any patch:

SELECT * FROM test WHERE a = 1 AND b = 1 : 5
SELECT * FROM test WHERE a = 1 AND (b = 1 OR b = 2) : 1
SELECT * FROM test WHERE (a = 1 AND b = 1) OR (a = 1 AND b = 2) : 1
SELECT * FROM test WHERE a = 1 AND b IN (1, 2) : 1

With the patch, the estimated rows of the last query goes back to 5, which is
more logical. The other ones don't change.

Yes, I think you're right. I've been playing with this a bit more, and I
think you're right we can support it the way you propose.

I'm still a bit annoyed by the inconsistency this might/does introduce.
Consider for example these clauses:

a) ... WHERE a = 10 AND b IN (100, 200)
b) ... WHERE a = 10 AND (b = 100 OR b = 200)
c) ... WHERE (a = 10 AND b = 100) OR (a = 10 AND b = 200)

All three cases are logically equivalent and do return the same set of
rows. But we estimate them differently, arriving at different estimates.

Case (a) is the one you improve in your patch. Case (c) is actually not
possible in practice, because we rewrite it as (b) during planning.

But (b) is estimated very differently, because we don't recognize the OR
clause as supported by functional dependencies. On the one hand I'm sure
it's not the first case where we already estimate equivalent clauses
differently. On the other hand I wonder how difficult would it be to
support this specific type of OR clause (with all expressions having the
form "Var = Const" and all Vars referencing the same rel).

I'm not going to block the patch because of this, of course. Similarly,
it'd be nice to add support for ScalarArrayOpExpr to MCV stats, not just
functional dependencies ...

BTW the code added in the 0001 patch is the same as for is_opclause, so
maybe we can simply do

if (is_opclause(rinfo->clause) ||
IsA(rinfo->clause, ScalarArrayOpExpr))
{
...
}

instead of just duplicating the code.

I would love doing that, but the ScalarArrayOpExpr and OpExpr are not binary
compatible for the members used here. In ScalarArrayOpExpr, on AMD64, args is
at offset 24 and opno at 4, while they are at 32 and 4 in OpExpr. I can work
around with this kind of code, but I don't like it much:
List *args;
Oid opno;
if (IsA(rinfo->clause, OpExpr))
{
/* If it's an opclause, we will check for Var = Const or Const = Var. */
OpExpr *expr = (OpExpr *) rinfo->clause;
args = expr->args;
opno = expr->opno;
}
else if (IsA(rinfo->clause, ScalarArrayOpExpr))
{
/* If it's a ScalarArrayOpExpr, check for Var IN Const. */
ScalarArrayOpExpr *expr = (ScalarArrayOpExpr *) rinfo->clause;
args = expr->args;
opno = expr->opno;
}

Oh, right. I'm dumb, I missed this obvious detail. I blame belgian beer.

Or I can rewrite it in C++ to play with templates... :)

Please don't ;-)

We also need some at least some
regression tests, testing functional dependencies with this clause type.

Agreed

The lack of support for @> operator, on the other hand, seems to be a
decision taken when writing the initial code, but I can not find any
mathematical nor technical reason for it. The current code restricts
dependency calculation to the = operator, obviously because inequality
operators are not going to work... but array contains is just several =
operators grouped, thus the same for the dependency calculation. The
second patch refactors the operator check in order to also include array
contains.

No concrete opinion on this yet. I think my concerns are pretty similar
to the IN clause, although I'm not sure what you mean by "this could be
considered as special case of IN".

I meant from a mathematical point of view.

Can you elaborate a bit? I still don't understand how it's just "several
equality operators grouped".

I think the challenge here is in applying the functional dependency
computed for the whole array to individual elements. I'm not sure we can
do that.

For example, with a table like this:

CREATE TABLE t (a int, b int[]);
CREATE STATISTICS s (dependencies) ON a, b FROM t;

Let's say the functional dependency is "perfect" i.e. has strength 1.0.
But that only tells us dependency for complete array values, we don't
know how much information we gain by knowledge of subset of the values.

For example, all the arrays may contain {1, 2, 3} as subset, and then
some "unique" element, like this:

INSERT INTO t SELECT i/1000, ARRAY[1,2,3, 4 + i/100]
FROM generate_series(1,1000000) s(i);

and then do a query like this:

select * from t where a = 10 and b @> ARRAY[1,2];

Without extended stats, it's estimated like this:

QUERY PLAN
---------------------------------------------------------------
Seq Scan on t (cost=0.00..24346.00 rows=997 width=41)
(actual time=1.391..140.261 rows=1000 loops=1)
Filter: ((b @> '{1,2}'::integer[]) AND (a = 10))
Rows Removed by Filter: 999000
Planning Time: 0.052 ms
Execution Time: 140.707 ms
(5 rows)

but the moment you create functional stats, you get this:

QUERY PLAN
---------------------------------------------------------------
Seq Scan on t (cost=0.00..24346.00 rows=1000000 width=41)
(actual time=1.432..143.047 rows=1000 loops=1)
Filter: ((b @> '{1,2}'::integer[]) AND (a = 10))
Rows Removed by Filter: 999000
Planning Time: 0.099 ms
Execution Time: 143.527 ms
(5 rows)

That doesn't seem very useful :-(

I tested the patches on current HEAD, but I can test and provide
back-ported versions of the patch for other versions if needed (this code
path hardly changed since its introduction in 10).

I think the chance of this getting backpatched is zero, because it might
easily break existing apps.

I understand

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Tomas Vondra (#4)

5 attachment(s)

Re: PATCH: add support for IN and @> in functional-dependency statistics use

Hi Pierre,

I've looked at this patch series, hoping to get it close to committable.
Here is a somewhat improved version of the patch series, split into 5
pieces. The first 4 parts are about applying functional dependencies to
ScalarArrayOpExpr clauses. The last part is about doing the same thing
for MCV lists, so it seemed fine to post it here.

0001 is the patch you posted back in October

0002 simplifies the handling logic a bit, because ScalarArrayOpExpr can
only have form (Var op Const) but not (Const op Var).

0003 fixes what I think is a bug - ScalarArrayOpExpr can represent three
different cases:

* Var op ANY ()
* Var IN () -- special case of ANY
* Var op ALL ()

I don't think functional dependencies can handle the ALL case, we need
to reject it by checking the useOr flag.

0004 adds queries to the stats_ext test suite, to test all of this (some
of the cases illustrate the need for 0003, I think)

0005 allows estimation of ScalarArrayOpExpr by MCV lists, including
regression tests etc.

Will you have time to look at this, particularly 0001-0004, but maybe
even the 0005 part?

As for the second part of your patch (the one allowing estimation of
array containment queries), I still think that's not something we can
reasonably do without also building statistics on elements (which is
what we have in pg_stats but not pg_stats_ext).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Apply-functional-dependencies-to-ScalarArrayOpExpr.patchtext/plain; charset=us-asciiDownload

From ca77b087725d1c670250538c017f88150b3df868 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Wed, 4 Mar 2020 15:55:46 +0100
Subject: [PATCH 01/11] Apply functional dependencies to ScalarArrayOpExpr

Until now functional dependencies handled only plain equality clauses.
We can apply them to IN some cases of ANY, which can be translated to
an equality.
---
 src/backend/statistics/dependencies.c | 28 +++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c
index e2f6c5bb97..36941b8535 100644
--- a/src/backend/statistics/dependencies.c
+++ b/src/backend/statistics/dependencies.c
@@ -801,6 +801,34 @@ dependency_is_compatible_clause(Node *clause, Index relid, AttrNumber *attnum)
 
 		/* OK to proceed with checking "var" */
 	}
+	else if (IsA(rinfo->clause, ScalarArrayOpExpr))
+	{
+		/* If it's an opclause, check for Var IN Const. */
+		ScalarArrayOpExpr	   *expr = (ScalarArrayOpExpr *) rinfo->clause;
+
+		/* Only expressions with two arguments are candidates. */
+		if (list_length(expr->args) != 2)
+			return false;
+
+		/* Make sure non-selected argument is a pseudoconstant. */
+		if (is_pseudo_constant_clause(lsecond(expr->args)))
+			var = linitial(expr->args);
+		else if (is_pseudo_constant_clause(linitial(expr->args)))
+			var = lsecond(expr->args);
+		else
+			return false;
+
+		/*
+		 * If it's not an "=" operator, just ignore the clause, as it's not
+		 * compatible with functional dependencies. The operator is identified
+		 * simply by looking at which function it uses to estimate selectivity.
+		 * That's a bit strange, but it's what other similar places do.
+		 */
+		if (get_oprrest(expr->opno) != F_EQSEL)
+			return false;
+
+		/* OK to proceed with checking "var" */
+	}
 	else if (is_notclause(rinfo->clause))
 	{
 		/*
-- 
2.21.1

0002-Simplify-parsing-of-ScalarArrayOpExpr.patchtext/plain; charset=us-asciiDownload

From 9d1bba7dc28a219d9de47e14bad5888e9a495877 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Wed, 4 Mar 2020 22:25:54 +0100
Subject: [PATCH 02/11] Simplify parsing of ScalarArrayOpExpr

We know ScalarArrayOpExpr is always of the form

    Var op Const

in this exact order, so we don't need to check if var is on the left.
---
 src/backend/statistics/dependencies.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c
index 36941b8535..a75b9d73db 100644
--- a/src/backend/statistics/dependencies.c
+++ b/src/backend/statistics/dependencies.c
@@ -803,21 +803,22 @@ dependency_is_compatible_clause(Node *clause, Index relid, AttrNumber *attnum)
 	}
 	else if (IsA(rinfo->clause, ScalarArrayOpExpr))
 	{
-		/* If it's an opclause, check for Var IN Const. */
+		/* If it's an scalar array operator, check for Var IN Const. */
 		ScalarArrayOpExpr	   *expr = (ScalarArrayOpExpr *) rinfo->clause;
 
 		/* Only expressions with two arguments are candidates. */
 		if (list_length(expr->args) != 2)
 			return false;
 
-		/* Make sure non-selected argument is a pseudoconstant. */
-		if (is_pseudo_constant_clause(lsecond(expr->args)))
-			var = linitial(expr->args);
-		else if (is_pseudo_constant_clause(linitial(expr->args)))
-			var = lsecond(expr->args);
-		else
+		/*
+		 * We know it's always (Var IN Const), so we assume the var is the
+		 * first argument, and pseudoconstant is the second one.
+		 */
+		if (!is_pseudo_constant_clause(lsecond(expr->args)))
 			return false;
 
+		var = linitial(expr->args);
+
 		/*
 		 * If it's not an "=" operator, just ignore the clause, as it's not
 		 * compatible with functional dependencies. The operator is identified
-- 
2.21.1

0003-Add-regression-tests-for-ScalarArrayOpExpr-with-depe.patchtext/plain; charset=us-asciiDownload

From 469e758c613ebe3b7d2d22c11fdf6969f55f5de6 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Wed, 4 Mar 2020 22:26:01 +0100
Subject: [PATCH 03/11] Add regression tests for ScalarArrayOpExpr with
 dependencies

We need to check that IN and ANY (with equality operator) are correctly
estimated, while ALL (all cases) and ANY (with inequalities) are treated
as not compatible with functional dependencies.
---
 src/test/regress/expected/stats_ext.out | 224 ++++++++++++++++++++++++
 src/test/regress/sql/stats_ext.sql      |  80 +++++++++
 2 files changed, 304 insertions(+)

diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index 61237dfb11..6a628f5680 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -421,6 +421,118 @@ SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE
          1 |     50
 (1 row)
 
+-- IN
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 51) AND b = ''1''');
+ estimated | actual 
+-----------+--------
+         2 |    100
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 51) AND b IN (''1'', ''2'')');
+ estimated | actual 
+-----------+--------
+         4 |    100
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 51, 52) AND b IN (''1'', ''2'')');
+ estimated | actual 
+-----------+--------
+         8 |    200
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 26, 51, 76) AND b IN (''1'', ''26'') AND c = 1');
+ estimated | actual 
+-----------+--------
+         1 |    200
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 26, 51, 76) AND b IN (''1'', ''26'') AND c IN (1)');
+ estimated | actual 
+-----------+--------
+         1 |    200
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 26, 27, 51, 52, 76, 77) AND b IN (''1'', ''2'', ''26'', ''27'') AND c IN (1, 2)');
+ estimated | actual 
+-----------+--------
+         3 |    400
+(1 row)
+
+-- ANY
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = ANY (ARRAY[1, 51]) AND b = ''1''');
+ estimated | actual 
+-----------+--------
+         2 |    100
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = ANY (ARRAY[1, 51]) AND b = ANY (ARRAY[''1'', ''2''])');
+ estimated | actual 
+-----------+--------
+         4 |    100
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = ANY (ARRAY[1, 2, 51, 52]) AND b = ANY (ARRAY[''1'', ''2''])');
+ estimated | actual 
+-----------+--------
+         8 |    200
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = ANY (ARRAY[1, 26, 51, 76]) AND b = ANY (ARRAY[''1'', ''26'']) AND c = 1');
+ estimated | actual 
+-----------+--------
+         1 |    200
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = ANY (ARRAY[1, 26, 51, 76]) AND b = ANY (ARRAY[''1'', ''26'']) AND c = ANY (ARRAY[1])');
+ estimated | actual 
+-----------+--------
+         1 |    200
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = ANY (ARRAY[1, 2, 26, 27, 51, 52, 76, 77]) AND b = ANY (ARRAY[''1'', ''2'', ''26'', ''27'']) AND c = ANY (ARRAY[1, 2])');
+ estimated | actual 
+-----------+--------
+         3 |    400
+(1 row)
+
+-- ANY with inequalities should not benefit from functional dependencies
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a < ANY (ARRAY[1, 51]) AND b > ''1''');
+ estimated | actual 
+-----------+--------
+      2472 |   2400
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a >= ANY (ARRAY[1, 51]) AND b <= ANY (ARRAY[''1'', ''2''])');
+ estimated | actual 
+-----------+--------
+      1441 |   1250
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a <= ANY (ARRAY[1, 2, 51, 52]) AND b >= ANY (ARRAY[''1'', ''2''])');
+ estimated | actual 
+-----------+--------
+      3909 |   2550
+(1 row)
+
+-- ALL (should not benefit from functional dependencies)
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 51) AND b = ALL (ARRAY[''1''])');
+ estimated | actual 
+-----------+--------
+         2 |    100
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 51) AND b = ALL (ARRAY[''1'', ''2''])');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 51, 52) AND b = ALL (ARRAY[''1'', ''2''])');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
 -- create statistics
 CREATE STATISTICS func_deps_stat (dependencies) ON a, b, c FROM functional_dependencies;
 ANALYZE functional_dependencies;
@@ -436,6 +548,118 @@ SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE
         50 |     50
 (1 row)
 
+-- IN
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 51) AND b = ''1''');
+ estimated | actual 
+-----------+--------
+       100 |    100
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 51) AND b IN (''1'', ''2'')');
+ estimated | actual 
+-----------+--------
+       100 |    100
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 51, 52) AND b IN (''1'', ''2'')');
+ estimated | actual 
+-----------+--------
+       200 |    200
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 26, 51, 76) AND b IN (''1'', ''26'') AND c = 1');
+ estimated | actual 
+-----------+--------
+       200 |    200
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 26, 51, 76) AND b IN (''1'', ''26'') AND c IN (1)');
+ estimated | actual 
+-----------+--------
+       200 |    200
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 26, 27, 51, 52, 76, 77) AND b IN (''1'', ''2'', ''26'', ''27'') AND c IN (1, 2)');
+ estimated | actual 
+-----------+--------
+       400 |    400
+(1 row)
+
+-- ANY
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = ANY (ARRAY[1, 51]) AND b = ''1''');
+ estimated | actual 
+-----------+--------
+       100 |    100
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = ANY (ARRAY[1, 51]) AND b = ANY (ARRAY[''1'', ''2''])');
+ estimated | actual 
+-----------+--------
+       100 |    100
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = ANY (ARRAY[1, 2, 51, 52]) AND b = ANY (ARRAY[''1'', ''2''])');
+ estimated | actual 
+-----------+--------
+       200 |    200
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = ANY (ARRAY[1, 26, 51, 76]) AND b = ANY (ARRAY[''1'', ''26'']) AND c = 1');
+ estimated | actual 
+-----------+--------
+       200 |    200
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = ANY (ARRAY[1, 26, 51, 76]) AND b = ANY (ARRAY[''1'', ''26'']) AND c = ANY (ARRAY[1])');
+ estimated | actual 
+-----------+--------
+       200 |    200
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = ANY (ARRAY[1, 2, 26, 27, 51, 52, 76, 77]) AND b = ANY (ARRAY[''1'', ''2'', ''26'', ''27'']) AND c = ANY (ARRAY[1, 2])');
+ estimated | actual 
+-----------+--------
+       400 |    400
+(1 row)
+
+-- ANY with inequalities should not benefit from functional dependencies
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a < ANY (ARRAY[1, 51]) AND b > ''1''');
+ estimated | actual 
+-----------+--------
+      2472 |   2400
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a >= ANY (ARRAY[1, 51]) AND b <= ANY (ARRAY[''1'', ''2''])');
+ estimated | actual 
+-----------+--------
+      1441 |   1250
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a <= ANY (ARRAY[1, 2, 51, 52]) AND b >= ANY (ARRAY[''1'', ''2''])');
+ estimated | actual 
+-----------+--------
+      3909 |   2550
+(1 row)
+
+-- ALL (should not benefit from functional dependencies)
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 51) AND b = ALL (ARRAY[''1''])');
+ estimated | actual 
+-----------+--------
+         2 |    100
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 51) AND b = ALL (ARRAY[''1'', ''2''])');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 51, 52) AND b = ALL (ARRAY[''1'', ''2''])');
+ estimated | actual 
+-----------+--------
+         1 |      0
+(1 row)
+
 -- check change of column type doesn't break it
 ALTER TABLE functional_dependencies ALTER COLUMN c TYPE numeric;
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = 1 AND b = ''1'' AND c = 1');
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 84f13e8814..3de2be500a 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -273,6 +273,46 @@ SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE
 
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = 1 AND b = ''1'' AND c = 1');
 
+-- IN
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 51) AND b = ''1''');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 51) AND b IN (''1'', ''2'')');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 51, 52) AND b IN (''1'', ''2'')');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 26, 51, 76) AND b IN (''1'', ''26'') AND c = 1');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 26, 51, 76) AND b IN (''1'', ''26'') AND c IN (1)');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 26, 27, 51, 52, 76, 77) AND b IN (''1'', ''2'', ''26'', ''27'') AND c IN (1, 2)');
+
+-- ANY
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = ANY (ARRAY[1, 51]) AND b = ''1''');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = ANY (ARRAY[1, 51]) AND b = ANY (ARRAY[''1'', ''2''])');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = ANY (ARRAY[1, 2, 51, 52]) AND b = ANY (ARRAY[''1'', ''2''])');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = ANY (ARRAY[1, 26, 51, 76]) AND b = ANY (ARRAY[''1'', ''26'']) AND c = 1');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = ANY (ARRAY[1, 26, 51, 76]) AND b = ANY (ARRAY[''1'', ''26'']) AND c = ANY (ARRAY[1])');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = ANY (ARRAY[1, 2, 26, 27, 51, 52, 76, 77]) AND b = ANY (ARRAY[''1'', ''2'', ''26'', ''27'']) AND c = ANY (ARRAY[1, 2])');
+
+-- ANY with inequalities should not benefit from functional dependencies
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a < ANY (ARRAY[1, 51]) AND b > ''1''');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a >= ANY (ARRAY[1, 51]) AND b <= ANY (ARRAY[''1'', ''2''])');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a <= ANY (ARRAY[1, 2, 51, 52]) AND b >= ANY (ARRAY[''1'', ''2''])');
+
+-- ALL (should not benefit from functional dependencies)
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 51) AND b = ALL (ARRAY[''1''])');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 51) AND b = ALL (ARRAY[''1'', ''2''])');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 51, 52) AND b = ALL (ARRAY[''1'', ''2''])');
+
 -- create statistics
 CREATE STATISTICS func_deps_stat (dependencies) ON a, b, c FROM functional_dependencies;
 
@@ -282,6 +322,46 @@ SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE
 
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = 1 AND b = ''1'' AND c = 1');
 
+-- IN
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 51) AND b = ''1''');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 51) AND b IN (''1'', ''2'')');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 51, 52) AND b IN (''1'', ''2'')');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 26, 51, 76) AND b IN (''1'', ''26'') AND c = 1');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 26, 51, 76) AND b IN (''1'', ''26'') AND c IN (1)');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 26, 27, 51, 52, 76, 77) AND b IN (''1'', ''2'', ''26'', ''27'') AND c IN (1, 2)');
+
+-- ANY
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = ANY (ARRAY[1, 51]) AND b = ''1''');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = ANY (ARRAY[1, 51]) AND b = ANY (ARRAY[''1'', ''2''])');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = ANY (ARRAY[1, 2, 51, 52]) AND b = ANY (ARRAY[''1'', ''2''])');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = ANY (ARRAY[1, 26, 51, 76]) AND b = ANY (ARRAY[''1'', ''26'']) AND c = 1');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = ANY (ARRAY[1, 26, 51, 76]) AND b = ANY (ARRAY[''1'', ''26'']) AND c = ANY (ARRAY[1])');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = ANY (ARRAY[1, 2, 26, 27, 51, 52, 76, 77]) AND b = ANY (ARRAY[''1'', ''2'', ''26'', ''27'']) AND c = ANY (ARRAY[1, 2])');
+
+-- ANY with inequalities should not benefit from functional dependencies
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a < ANY (ARRAY[1, 51]) AND b > ''1''');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a >= ANY (ARRAY[1, 51]) AND b <= ANY (ARRAY[''1'', ''2''])');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a <= ANY (ARRAY[1, 2, 51, 52]) AND b >= ANY (ARRAY[''1'', ''2''])');
+
+-- ALL (should not benefit from functional dependencies)
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 51) AND b = ALL (ARRAY[''1''])');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 51) AND b = ALL (ARRAY[''1'', ''2''])');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 51, 52) AND b = ALL (ARRAY[''1'', ''2''])');
+
 -- check change of column type doesn't break it
 ALTER TABLE functional_dependencies ALTER COLUMN c TYPE numeric;
 
-- 
2.21.1

0004-Fix-handling-of-ScalarArrayOpExpr-with-ALL-clause.patchtext/plain; charset=us-asciiDownload

From 4829e3f64b9ad58cc39979e9dd0671f6c3a083df Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Wed, 4 Mar 2020 15:57:09 +0100
Subject: [PATCH 04/11] Fix handling of ScalarArrayOpExpr with ALL clause

Simply reject all ALL cases, irrespectedly of the estimation function.
---
 src/backend/statistics/dependencies.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c
index a75b9d73db..72dc1cd1bd 100644
--- a/src/backend/statistics/dependencies.c
+++ b/src/backend/statistics/dependencies.c
@@ -806,6 +806,15 @@ dependency_is_compatible_clause(Node *clause, Index relid, AttrNumber *attnum)
 		/* If it's an scalar array operator, check for Var IN Const. */
 		ScalarArrayOpExpr	   *expr = (ScalarArrayOpExpr *) rinfo->clause;
 
+		/*
+		 * Reject ALL() variant, we only care about ANY/IN.
+		 *
+		 * FIXME Maybe we should check if all the values are the same, and
+		 * allow ALL in that case? Doesn't seem very practical, though.
+		 */
+		if (!expr->useOr)
+			return false;
+
 		/* Only expressions with two arguments are candidates. */
 		if (list_length(expr->args) != 2)
 			return false;
-- 
2.21.1

0005-Apply-multi-column-MCV-lists-to-ScalarArrayOpExpr.patchtext/plain; charset=us-asciiDownload

From 6f4096b3a036f2faa9049e910a9a14db12a48753 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Wed, 4 Mar 2020 15:57:25 +0100
Subject: [PATCH 05/11] Apply multi-column MCV lists to ScalarArrayOpExpr

Commit ca77b08772 added handling of ScalarArrayOpExpr to dependencies,
but there's no good reason not to support them in MCV lists too. In
fact, MCV lists can handle all cases, including inequalities and ALL.
---
 src/backend/statistics/extended_stats.c       |  66 ++++++++++-
 src/backend/statistics/mcv.c                  | 111 +++++++++++++++++-
 .../statistics/extended_stats_internal.h      |   4 +-
 src/test/regress/expected/stats_ext.out       |  60 ++++++++++
 src/test/regress/sql/stats_ext.sql            |  20 ++++
 5 files changed, 252 insertions(+), 9 deletions(-)

diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index 03e69d057f..159ec7f178 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -993,7 +993,63 @@ statext_is_compatible_clause_internal(PlannerInfo *root, Node *clause,
 			return false;
 
 		/* Check if the expression the right shape (one Var, one Const) */
-		if (!examine_opclause_expression(expr, &var, NULL, NULL))
+		if (!examine_clause_args(expr->args, &var, NULL, NULL))
+			return false;
+
+		/*
+		 * If it's not one of the supported operators ("=", "<", ">", etc.),
+		 * just ignore the clause, as it's not compatible with MCV lists.
+		 *
+		 * This uses the function for estimating selectivity, not the operator
+		 * directly (a bit awkward, but well ...).
+		 */
+		switch (get_oprrest(expr->opno))
+		{
+			case F_EQSEL:
+			case F_NEQSEL:
+			case F_SCALARLTSEL:
+			case F_SCALARLESEL:
+			case F_SCALARGTSEL:
+			case F_SCALARGESEL:
+				/* supported, will continue with inspection of the Var */
+				break;
+
+			default:
+				/* other estimators are considered unknown/unsupported */
+				return false;
+		}
+
+		/*
+		 * If there are any securityQuals on the RTE from security barrier
+		 * views or RLS policies, then the user may not have access to all the
+		 * table's data, and we must check that the operator is leak-proof.
+		 *
+		 * If the operator is leaky, then we must ignore this clause for the
+		 * purposes of estimating with MCV lists, otherwise the operator might
+		 * reveal values from the MCV list that the user doesn't have
+		 * permission to see.
+		 */
+		if (rte->securityQuals != NIL &&
+			!get_func_leakproof(get_opcode(expr->opno)))
+			return false;
+
+		return statext_is_compatible_clause_internal(root, (Node *) var,
+													 relid, attnums);
+	}
+
+	/* Var IN Array */
+	if (IsA(clause, ScalarArrayOpExpr))
+	{
+		RangeTblEntry	   *rte = root->simple_rte_array[relid];
+		ScalarArrayOpExpr  *expr = (ScalarArrayOpExpr *) clause;
+		Var		   *var;
+
+		/* Only expressions with two arguments are considered compatible. */
+		if (list_length(expr->args) != 2)
+			return false;
+
+		/* Check if the expression the right shape (one Var, one Const) */
+		if (!examine_clause_args(expr->args, &var, NULL, NULL))
 			return false;
 
 		/*
@@ -1395,7 +1451,7 @@ statext_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
  * on which side of the operator we found the Var node.
  */
 bool
-examine_opclause_expression(OpExpr *expr, Var **varp, Const **cstp, bool *varonleftp)
+examine_clause_args(List *args, Var **varp, Const **cstp, bool *varonleftp)
 {
 	Var	   *var;
 	Const  *cst;
@@ -1404,10 +1460,10 @@ examine_opclause_expression(OpExpr *expr, Var **varp, Const **cstp, bool *varonl
 		   *rightop;
 
 	/* enforced by statext_is_compatible_clause_internal */
-	Assert(list_length(expr->args) == 2);
+	Assert(list_length(args) == 2);
 
-	leftop = linitial(expr->args);
-	rightop = lsecond(expr->args);
+	leftop = linitial(args);
+	rightop = lsecond(args);
 
 	/* strip RelabelType from either side of the expression */
 	if (IsA(leftop, RelabelType))
diff --git a/src/backend/statistics/mcv.c b/src/backend/statistics/mcv.c
index 87e232fdd4..3147d8fedc 100644
--- a/src/backend/statistics/mcv.c
+++ b/src/backend/statistics/mcv.c
@@ -1579,7 +1579,7 @@ mcv_get_match_bitmap(PlannerInfo *root, List *clauses,
 			OpExpr	   *expr = (OpExpr *) clause;
 			FmgrInfo	opproc;
 
-			/* valid only after examine_opclause_expression returns true */
+			/* valid only after examine_clause_args returns true */
 			Var		   *var;
 			Const	   *cst;
 			bool		varonleft;
@@ -1587,7 +1587,7 @@ mcv_get_match_bitmap(PlannerInfo *root, List *clauses,
 			fmgr_info(get_opcode(expr->opno), &opproc);
 
 			/* extract the var and const from the expression */
-			if (examine_opclause_expression(expr, &var, &cst, &varonleft))
+			if (examine_clause_args(expr->args, &var, &cst, &varonleft))
 			{
 				int			idx;
 
@@ -1652,6 +1652,113 @@ mcv_get_match_bitmap(PlannerInfo *root, List *clauses,
 				}
 			}
 		}
+		else if (IsA(clause, ScalarArrayOpExpr))
+		{
+			ScalarArrayOpExpr   *expr = (ScalarArrayOpExpr *) clause;
+			FmgrInfo	opproc;
+
+			/* valid only after examine_clause_args returns true */
+			Var		   *var;
+			Const	   *cst;
+			bool		varonleft;
+
+			fmgr_info(get_opcode(expr->opno), &opproc);
+
+			/* extract the var and const from the expression */
+			if (examine_clause_args(expr->args, &var, &cst, &varonleft))
+			{
+				int			idx;
+
+				ArrayType  *arrayval;
+				int16		elmlen;
+				bool		elmbyval;
+				char		elmalign;
+				int			num_elems;
+				Datum	   *elem_values;
+				bool	   *elem_nulls;
+
+				/* ScalarArrayOpExpr has the Var always on the left */
+				Assert(varonleft);
+
+				if (!cst->constisnull)
+				{
+					arrayval = DatumGetArrayTypeP(cst->constvalue);
+					get_typlenbyvalalign(ARR_ELEMTYPE(arrayval),
+										 &elmlen, &elmbyval, &elmalign);
+					deconstruct_array(arrayval,
+									  ARR_ELEMTYPE(arrayval),
+									  elmlen, elmbyval, elmalign,
+									  &elem_values, &elem_nulls, &num_elems);
+				}
+
+				/* match the attribute to a dimension of the statistic */
+				idx = bms_member_index(keys, var->varattno);
+
+				/*
+				 * Walk through the MCV items and evaluate the current clause.
+				 * We can skip items that were already ruled out, and
+				 * terminate if there are no remaining MCV items that might
+				 * possibly match.
+				 */
+				for (i = 0; i < mcvlist->nitems; i++)
+				{
+					int			j;
+					bool		match = (expr->useOr ? false : true);
+					MCVItem    *item = &mcvlist->items[i];
+
+					/*
+					 * When the MCV item or the Const value is NULL we can treat
+					 * this as a mismatch. We must not call the operator because
+					 * of strictness.
+					 */
+					if (item->isnull[idx] || cst->constisnull)
+					{
+						matches[i] = RESULT_MERGE(matches[i], is_or, false);
+						continue;
+					}
+
+					/*
+					 * Skip MCV items that can't change result in the bitmap.
+					 * Once the value gets false for AND-lists, or true for
+					 * OR-lists, we don't need to look at more clauses.
+					 */
+					if (RESULT_IS_FINAL(matches[i], is_or))
+						continue;
+
+					for (j = 0; j < num_elems; j++)
+					{
+						Datum	elem_value = elem_values[j];
+						bool	elem_isnull = elem_nulls[j];
+						bool	elem_match;
+
+						/* NULL values always evaluate as not matching. */
+						if (elem_isnull)
+						{
+							match = RESULT_MERGE(match, expr->useOr, false);
+							continue;
+						}
+
+						/*
+						 * Stop evaluating the array elements once we reach
+						 * match value that can't change - ALL() is the same
+						 * as AND-list, ANY() is the same as OR-list.
+						 */
+						if (RESULT_IS_FINAL(match, expr->useOr))
+							break;
+
+						elem_match = DatumGetBool(FunctionCall2Coll(&opproc,
+																	var->varcollid,
+																	item->values[idx],
+																	elem_value));
+
+						match = RESULT_MERGE(match, expr->useOr, elem_match);
+					}
+
+					/* update the match bitmap with the result */
+					matches[i] = RESULT_MERGE(matches[i], is_or, match);
+				}
+			}
+		}
 		else if (IsA(clause, NullTest))
 		{
 			NullTest   *expr = (NullTest *) clause;
diff --git a/src/include/statistics/extended_stats_internal.h b/src/include/statistics/extended_stats_internal.h
index b512ee908a..2b14ab238c 100644
--- a/src/include/statistics/extended_stats_internal.h
+++ b/src/include/statistics/extended_stats_internal.h
@@ -96,8 +96,8 @@ extern SortItem *build_sorted_items(int numrows, int *nitems, HeapTuple *rows,
 									TupleDesc tdesc, MultiSortSupport mss,
 									int numattrs, AttrNumber *attnums);
 
-extern bool examine_opclause_expression(OpExpr *expr, Var **varp,
-										Const **cstp, bool *varonleftp);
+extern bool examine_clause_args(List *args, Var **varp,
+								Const **cstp, bool *varonleftp);
 
 extern Selectivity mcv_clauselist_selectivity(PlannerInfo *root,
 											  StatisticExtInfo *stat,
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index 6a628f5680..9fa659c71d 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -827,6 +827,36 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
        343 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
+ estimated | actual 
+-----------+--------
+         8 |    200
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = ANY (ARRAY[1, 2, 51, 52]) AND b = ANY (ARRAY[''1'', ''2''])');
+ estimated | actual 
+-----------+--------
+         8 |    200
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= ANY (ARRAY[1, 2, 3]) AND b IN (''1'', ''2'', ''3'')');
+ estimated | actual 
+-----------+--------
+        26 |    150
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND c > ANY (ARRAY[1, 2, 3])');
+ estimated | actual 
+-----------+--------
+        10 |    100
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND b IN (''1'', ''2'', ''3'') AND c > ANY (ARRAY[1, 2, 3])');
+ estimated | actual 
+-----------+--------
+         1 |    100
+(1 row)
+
 -- create statistics
 CREATE STATISTICS mcv_lists_stats (mcv) ON a, b, c FROM mcv_lists;
 ANALYZE mcv_lists;
@@ -872,6 +902,36 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
        200 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
+ estimated | actual 
+-----------+--------
+       200 |    200
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = ANY (ARRAY[1, 2, 51, 52]) AND b = ANY (ARRAY[''1'', ''2''])');
+ estimated | actual 
+-----------+--------
+       200 |    200
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= ANY (ARRAY[1, 2, 3]) AND b IN (''1'', ''2'', ''3'')');
+ estimated | actual 
+-----------+--------
+       150 |    150
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND c > ANY (ARRAY[1, 2, 3])');
+ estimated | actual 
+-----------+--------
+       100 |    100
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND b IN (''1'', ''2'', ''3'') AND c > ANY (ARRAY[1, 2, 3])');
+ estimated | actual 
+-----------+--------
+       100 |    100
+(1 row)
+
 -- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute)
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
  estimated | actual 
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 3de2be500a..0ece39a279 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -461,6 +461,16 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = '
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = ANY (ARRAY[1, 2, 51, 52]) AND b = ANY (ARRAY[''1'', ''2''])');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= ANY (ARRAY[1, 2, 3]) AND b IN (''1'', ''2'', ''3'')');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND c > ANY (ARRAY[1, 2, 3])');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND b IN (''1'', ''2'', ''3'') AND c > ANY (ARRAY[1, 2, 3])');
+
 -- create statistics
 CREATE STATISTICS mcv_lists_stats (mcv) ON a, b, c FROM mcv_lists;
 
@@ -480,6 +490,16 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= 4 AND b <
 
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a IN (1, 2, 51, 52) AND b IN ( ''1'', ''2'')');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = ANY (ARRAY[1, 2, 51, 52]) AND b = ANY (ARRAY[''1'', ''2''])');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a <= ANY (ARRAY[1, 2, 3]) AND b IN (''1'', ''2'', ''3'')');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND c > ANY (ARRAY[1, 2, 3])');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a < ALL (ARRAY[4, 5]) AND b IN (''1'', ''2'', ''3'') AND c > ANY (ARRAY[1, 2, 3])');
+
 -- we can't use the statistic for OR clauses that are not fully covered (missing 'd' attribute)
 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists WHERE a = 1 OR b = ''1'' OR c = 1 OR d IS NOT NULL');
 
-- 
2.21.1

Dean Rasheed

dean.a.rasheed@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#4)

Re: PATCH: add support for IN and @> in functional-dependency statistics use

[ For the sake of the archives, some of the discussion on the other
thread [1-3] should really have been on this thread. ]

On Sun, 2 Feb 2020 at 18:41, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

I think the challenge here is in applying the functional dependency
computed for the whole array to individual elements. I'm not sure we can
do that.

For example, with a table like this:

CREATE TABLE t (a int, b int[]);
CREATE STATISTICS s (dependencies) ON a, b FROM t;

Let's say the functional dependency is "perfect" i.e. has strength 1.0.
But that only tells us dependency for complete array values, we don't
know how much information we gain by knowledge of subset of the values.

The more I think about this example, the more I think this is really
just a special case of the more general problem of compatibility of
clauses. Once you add support for IN (...) clauses, any query of the
form

SELECT ... WHERE (any clauses on col a) AND (any clauses on col b)

can be recast as

SELECT ... WHERE a IN (...) AND b IN (...)

so any counter-example with bad estimates produced with a query in the
first form can also be written in the second form.

I think we should really be thinking in terms of making a strong
functional dependency (a => b) applicable generally to queries in the
first form, which will work well if the clauses on b are compatible
with those on b, but not if they're incompatible. However, that's not
so very different from the current state without extended stats, which
assumes independence, and will return poor estimates if the
columns/clauses aren't independent.

So I'd be tempted to apply a tidied up version of the patch from [3]/messages/by-id/CAEZATCUic8PwhTnexC+Ux-Z_e5MhWD-8jk=J1MtnVW8TJD+VHw@mail.gmail.com,
and then lift all restrictions from dependency_is_compatible_clause(),
other than the requirement that the clause refer to a single variable.

Regards,
Dean

[1]: /messages/by-id/CAEZATCXaNFZyOhR4XXAfkvj1tibRBEjje6ZbXwqWUB_tqbH=rw@mail.gmail.com
[2]: /messages/by-id/20200309181915.5lxhuw2qxoihfoqo@development
[3]: /messages/by-id/CAEZATCUic8PwhTnexC+Ux-Z_e5MhWD-8jk=J1MtnVW8TJD+VHw@mail.gmail.com

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Dean Rasheed (#6)

Re: PATCH: add support for IN and @> in functional-dependency statistics use

On Thu, Mar 12, 2020 at 10:25:41AM +0000, Dean Rasheed wrote:

[ For the sake of the archives, some of the discussion on the other
thread [1-3] should really have been on this thread. ]

On Sun, 2 Feb 2020 at 18:41, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

I think the challenge here is in applying the functional dependency
computed for the whole array to individual elements. I'm not sure we can
do that.

For example, with a table like this:

CREATE TABLE t (a int, b int[]);
CREATE STATISTICS s (dependencies) ON a, b FROM t;

Let's say the functional dependency is "perfect" i.e. has strength 1.0.
But that only tells us dependency for complete array values, we don't
know how much information we gain by knowledge of subset of the values.

The more I think about this example, the more I think this is really
just a special case of the more general problem of compatibility of
clauses. Once you add support for IN (...) clauses, any query of the
form

SELECT ... WHERE (any clauses on col a) AND (any clauses on col b)

can be recast as

SELECT ... WHERE a IN (...) AND b IN (...)

so any counter-example with bad estimates produced with a query in the
first form can also be written in the second form.

I think we should really be thinking in terms of making a strong
functional dependency (a => b) applicable generally to queries in the
first form, which will work well if the clauses on b are compatible
with those on b, but not if they're incompatible. However, that's not
so very different from the current state without extended stats, which
assumes independence, and will return poor estimates if the
columns/clauses aren't independent.

I'm sorry, but I don't see how we could do this for arbitrary clauses. I
think we could do that for clauses that have equality semantics and
reference column values as a whole. So I think it's possible to do this
for IN clauses (which is what the first part of the patch does), but I
don't think we can do it for the containment operator.

I.e. we can do that for

WHERE a IN (...) AND b IN (...)

but I don't see how we could do that for

WHERE a @> (...) AND b @> (...)

I don't think the dependency degree gives us any reliable insight into
statistical dependency of elements of the values.

Or maybe we're just talking about different things? You seem to be
talking abotu IN clauses (which I think is doable), but my question was
about using functional dependencies to estimate array containment
clauses (which I think is not really doable).

So I'd be tempted to apply a tidied up version of the patch from [3],
and then lift all restrictions from dependency_is_compatible_clause(),
other than the requirement that the clause refer to a single variable.

I haven't looked at the patch from [3] closely yet, but you're right

P(A & B) <= Min(P(A), P(B))

and the approach you proposed seems reasonable. I don't think how we
can just remove all the restriction on clause type - the restriction
that dependencies only handle equality-like clauses seems pretty much
baked into the dependencies.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Dean Rasheed

dean.a.rasheed@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#7)

Re: PATCH: add support for IN and @> in functional-dependency statistics use

On Thu, 12 Mar 2020 at 17:30, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

I'm sorry, but I don't see how we could do this for arbitrary clauses. I
think we could do that for clauses that have equality semantics and
reference column values as a whole. So I think it's possible to do this
for IN clauses (which is what the first part of the patch does), but I
don't think we can do it for the containment operator.

I.e. we can do that for

WHERE a IN (...) AND b IN (...)

Hmm, the difficulty always comes back to the compatibility of the
clauses though. It's easy to come up with artificial examples for
which functional dependencies come up with bad estimates, even with
just = and IN (...) operators. For example, given a perfect
correlation like

a | b
-------
1 | 1
2 | 2
3 | 3
: | :

you only need to write a query like "WHERE a IN (1,3,5,7,9,...) AND b
IN (2,4,6,8,...)" to get a very bad estimate from functional
dependencies.

However, I don't think such artificial examples are that useful. I
think you have to think in terms of real data distributions together
with real queries expected to go with them. For example:

Using the OP's original example of a multi-tenant system, you might
well have a table with columns (product_type, tenant_id) and a
functional dependency product_type => tenant_id. In that case, it
could well be very useful in optimising queries like "WHERE
product_type IN (X,Y,Z) AND tenant_id = 123".

But this isn't necessarily limited to = and IN (...). For example,
consider a table with UK-based geographic data with columns (location
point, postcode text). Then there would be a strong functional
dependency location => postcode (and possibly also the other way
round, depending on how dense the points were). That dependency could
be used to estimate much more general queries like "WHERE location <@
some_area AND postcode ~ '^CB.*'", where there may be no useful stats
on location, but a histogram on postcode might give a reasonable
estimate.

This also extends to inequalities. For example a table with columns
(weight, category) might have a strong functional dependency weight =>
category. Then a query like "WHERE weight > 10 AND weight < 20 AND
category = 'large'" could get a decent estimate from a histogram on
the weight column, and then use the functional dependency to note that
that implies the category. Note that such an example would work with
my patch from the other thread, because it groups clauses by column,
and uses clauselist_selectivity_simple() on them. So in this case, the
clauses "weight > 10 AND weight < 20" would be estimated together, and
would be able to make use of the RangeQueryClause code.

Of course, it's equally easy to come up with counter-example queries
for any of those examples, where using the functional dependency would
produce a poor estimate. Ultimately, it's up to the user to decide
whether or not to build functional dependency statistics, and that
decision needs to be based not just on the data distribution, but also
on the types of queries expected.

Given the timing though, perhaps it is best to limit this to IN (..)
clauses for PG13, and we can consider other possibilities later.

Regards,
Dean

Bruce Momjian

bruce@momjian.us

almost 6 years ago

In reply to: Dean Rasheed (#8)

Re: PATCH: add support for IN and @> in functional-dependency statistics use

On Fri, Mar 13, 2020 at 08:42:49AM +0000, Dean Rasheed wrote:

On Thu, 12 Mar 2020 at 17:30, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

I'm sorry, but I don't see how we could do this for arbitrary clauses. I
think we could do that for clauses that have equality semantics and
reference column values as a whole. So I think it's possible to do this
for IN clauses (which is what the first part of the patch does), but I
don't think we can do it for the containment operator.

I.e. we can do that for

WHERE a IN (...) AND b IN (...)

Hmm, the difficulty always comes back to the compatibility of the
clauses though. It's easy to come up with artificial examples for
which functional dependencies come up with bad estimates, even with
just = and IN (...) operators. For example, given a perfect
correlation like

a | b
-------
1 | 1
2 | 2
3 | 3
: | :

you only need to write a query like "WHERE a IN (1,3,5,7,9,...) AND b
IN (2,4,6,8,...)" to get a very bad estimate from functional
dependencies.

Wow, that is a very good example --- the arrays do not tie elements in
one array to elements in another array; good point. I get it now!

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EnterpriseDB https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

#10

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Dean Rasheed (#8)

Re: PATCH: add support for IN and @> in functional-dependency statistics use

On Fri, Mar 13, 2020 at 08:42:49AM +0000, Dean Rasheed wrote:

On Thu, 12 Mar 2020 at 17:30, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

I'm sorry, but I don't see how we could do this for arbitrary clauses. I
think we could do that for clauses that have equality semantics and
reference column values as a whole. So I think it's possible to do this
for IN clauses (which is what the first part of the patch does), but I
don't think we can do it for the containment operator.

I.e. we can do that for

WHERE a IN (...) AND b IN (...)

Hmm, the difficulty always comes back to the compatibility of the
clauses though. It's easy to come up with artificial examples for
which functional dependencies come up with bad estimates, even with
just = and IN (...) operators. For example, given a perfect
correlation like

a | b
-------
1 | 1
2 | 2
3 | 3
: | :

you only need to write a query like "WHERE a IN (1,3,5,7,9,...) AND b
IN (2,4,6,8,...)" to get a very bad estimate from functional
dependencies.

However, I don't think such artificial examples are that useful. I
think you have to think in terms of real data distributions together
with real queries expected to go with them. For example:

Using the OP's original example of a multi-tenant system, you might
well have a table with columns (product_type, tenant_id) and a
functional dependency product_type => tenant_id. In that case, it
could well be very useful in optimising queries like "WHERE
product_type IN (X,Y,Z) AND tenant_id = 123".

But this isn't necessarily limited to = and IN (...). For example,
consider a table with UK-based geographic data with columns (location
point, postcode text). Then there would be a strong functional
dependency location => postcode (and possibly also the other way
round, depending on how dense the points were). That dependency could
be used to estimate much more general queries like "WHERE location <@
some_area AND postcode ~ '^CB.*'", where there may be no useful stats
on location, but a histogram on postcode might give a reasonable
estimate.

This also extends to inequalities. For example a table with columns
(weight, category) might have a strong functional dependency weight =>
category. Then a query like "WHERE weight > 10 AND weight < 20 AND
category = 'large'" could get a decent estimate from a histogram on
the weight column, and then use the functional dependency to note that
that implies the category. Note that such an example would work with
my patch from the other thread, because it groups clauses by column,
and uses clauselist_selectivity_simple() on them. So in this case, the
clauses "weight > 10 AND weight < 20" would be estimated together, and
would be able to make use of the RangeQueryClause code.

Of course, it's equally easy to come up with counter-example queries
for any of those examples, where using the functional dependency would
produce a poor estimate. Ultimately, it's up to the user to decide
whether or not to build functional dependency statistics, and that
decision needs to be based not just on the data distribution, but also
on the types of queries expected.

Well, yeah. I'm sure we can produce countless examples where applying
the functional dependencies to additional types of clauses helps a lot.
I'm somewhat hesitant to just drop any restrictions, though, because
it's equally simple to produce examples with poor results.

The main issue I have with just applying dependencies to arbitrary
clauses is it uses "degree" computed for the value as a whole, and
just uses it to estimate dependency between pieces of the values.

The IN() clause does not have this problem, the other cases like @> or
pattern matching do.

Given the timing though, perhaps it is best to limit this to IN (..)
clauses for PG13, and we can consider other possibilities later.

Yeah, I was gonna propose the same thing. I'll get the IN bit committed
shortly (both for dependencies and MCV), improved handling of OR clauses
and some additional regression tests to increase the coverage.

Then we can discuss these improvement in more detail for PG14.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#11

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Tomas Vondra (#10)

Re: PATCH: add support for IN and @> in functional-dependency statistics use

Hi,

I've pushed the first part of the patch, adding ScalarArrayOpExpr as
supported clause for functional dependencies, and then also doing the
same for MCV lists.

As discussed, I'm not going to do anything about the array containment
clauses for PG13, that needs more discussion.

I have a bunch of additional improvements for extended stats, discussed
in [1]/messages/by-id/20200113230008.g67iyk4cs3xbnjju@development. I'll wait a bit for buildfarm and maybe some feedback before
pushing those.

[1]: /messages/by-id/20200113230008.g67iyk4cs3xbnjju@development

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#12

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Tomas Vondra (#11)

1 attachment(s)

Re: PATCH: add support for IN and @> in functional-dependency statistics use

Hi,

I realized there's one more thing that probably needs discussing.
Essentially, these two clause types are the same:

a IN (1, 2, 3)

(a = 1 OR a = 2 OR a = 3)

but with 8f321bd1 we only recognize the first one as compatible with
functional dependencies. It was always the case that we estimated those
two clauses a bit differently, but the differences were usually small.
But now that we recognize IN as compatible with dependencies, the
difference may be much larger, which bugs me a bit ...

So I wonder if we should recognize the special form of an OR clause,
with all Vars referencing to the same attribute etc. and treat this as
supported by functional dependencies - the attached patch does that.
MCV lists there's already no difference because OR clauses are
supported.

The question is whether we want to do this, and whether we should also
teach the per-column estimates to recognize this special case of IN
clause. That would allow producing exactly the same estimates even with
functional dependencies - recognizing the OR clause as supported gets us
only half-way there, because we still use estimates for each clause (and
those will be slightly different).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

estimate-OR-similar-to-IN-clauses.patchtext/plain; charset=us-asciiDownload

diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c
index 72dc1cd1bd..5f9b43bf7f 100644
--- a/src/backend/statistics/dependencies.c
+++ b/src/backend/statistics/dependencies.c
@@ -753,24 +753,27 @@ pg_dependencies_send(PG_FUNCTION_ARGS)
 static bool
 dependency_is_compatible_clause(Node *clause, Index relid, AttrNumber *attnum)
 {
-	RestrictInfo *rinfo = (RestrictInfo *) clause;
 	Var		   *var;
 
-	if (!IsA(rinfo, RestrictInfo))
-		return false;
+	if (IsA(clause, RestrictInfo))
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) clause;
 
-	/* Pseudoconstants are not interesting (they couldn't contain a Var) */
-	if (rinfo->pseudoconstant)
-		return false;
+		/* Pseudoconstants are not interesting (they couldn't contain a Var) */
+		if (rinfo->pseudoconstant)
+			return false;
 
-	/* Clauses referencing multiple, or no, varnos are incompatible */
-	if (bms_membership(rinfo->clause_relids) != BMS_SINGLETON)
-		return false;
+		/* Clauses referencing multiple, or no, varnos are incompatible */
+		if (bms_membership(rinfo->clause_relids) != BMS_SINGLETON)
+			return false;
 
-	if (is_opclause(rinfo->clause))
+		clause = (Node *) rinfo->clause;
+	}
+
+	if (is_opclause(clause))
 	{
 		/* If it's an opclause, check for Var = Const or Const = Var. */
-		OpExpr	   *expr = (OpExpr *) rinfo->clause;
+		OpExpr	   *expr = (OpExpr *) clause;
 
 		/* Only expressions with two arguments are candidates. */
 		if (list_length(expr->args) != 2)
@@ -801,10 +804,10 @@ dependency_is_compatible_clause(Node *clause, Index relid, AttrNumber *attnum)
 
 		/* OK to proceed with checking "var" */
 	}
-	else if (IsA(rinfo->clause, ScalarArrayOpExpr))
+	else if (IsA(clause, ScalarArrayOpExpr))
 	{
 		/* If it's an scalar array operator, check for Var IN Const. */
-		ScalarArrayOpExpr	   *expr = (ScalarArrayOpExpr *) rinfo->clause;
+		ScalarArrayOpExpr	   *expr = (ScalarArrayOpExpr *) clause;
 
 		/*
 		 * Reject ALL() variant, we only care about ANY/IN.
@@ -839,13 +842,43 @@ dependency_is_compatible_clause(Node *clause, Index relid, AttrNumber *attnum)
 
 		/* OK to proceed with checking "var" */
 	}
-	else if (is_notclause(rinfo->clause))
+	else if (is_orclause(clause))
+	{
+		BoolExpr   *expr = (BoolExpr *) clause;
+		ListCell   *lc;
+
+		/* start with no attribute number */
+		*attnum = InvalidAttrNumber;
+
+		foreach(lc, expr->args)
+		{
+			AttrNumber	clause_attnum;
+
+			/*
+			 * Had we found incompatible clause in the arguments, treat the
+			 * whole clause as incompatible.
+			 */
+			if (!dependency_is_compatible_clause((Node *) lfirst(lc),
+												 relid, &clause_attnum))
+				return false;
+
+			if (*attnum == InvalidAttrNumber)
+				*attnum = clause_attnum;
+
+			if (*attnum != clause_attnum)
+				return false;
+		}
+
+		/* the Var is already checked by the recursive call */
+		return true;
+	}
+	else if (is_notclause(clause))
 	{
 		/*
 		 * "NOT x" can be interpreted as "x = false", so get the argument and
 		 * proceed with seeing if it's a suitable Var.
 		 */
-		var = (Var *) get_notclausearg(rinfo->clause);
+		var = (Var *) get_notclausearg(clause);
 	}
 	else
 	{
@@ -853,7 +886,7 @@ dependency_is_compatible_clause(Node *clause, Index relid, AttrNumber *attnum)
 		 * A boolean expression "x" can be interpreted as "x = true", so
 		 * proceed with seeing if it's a suitable Var.
 		 */
-		var = (Var *) rinfo->clause;
+		var = (Var *) clause;
 	}
 
 	/*
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
index 9fa659c71d..74da75c3c6 100644
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -458,6 +458,32 @@ SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE
          3 |    400
 (1 row)
 
+-- OR clauses referencing the same attribute
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE (a = 1 OR a = 51) AND b = ''1''');
+ estimated | actual 
+-----------+--------
+         2 |    100
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE (a = 1 OR a = 51) AND (b = ''1'' OR b = ''2'')');
+ estimated | actual 
+-----------+--------
+         4 |    100
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE (a = 1 OR a = 2 OR a = 51 OR a = 52) AND (b = ''1'' OR b = ''2'')');
+ estimated | actual 
+-----------+--------
+         8 |    200
+(1 row)
+
+-- OR clauses referencing different attributes
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE (a = 1 OR b = ''1'') AND b = ''1''');
+ estimated | actual 
+-----------+--------
+         3 |    100
+(1 row)
+
 -- ANY
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = ANY (ARRAY[1, 51]) AND b = ''1''');
  estimated | actual 
@@ -585,6 +611,32 @@ SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE
        400 |    400
 (1 row)
 
+-- OR clauses referencing the same attribute
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE (a = 1 OR a = 51) AND b = ''1''');
+ estimated | actual 
+-----------+--------
+        99 |    100
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE (a = 1 OR a = 51) AND (b = ''1'' OR b = ''2'')');
+ estimated | actual 
+-----------+--------
+        99 |    100
+(1 row)
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE (a = 1 OR a = 2 OR a = 51 OR a = 52) AND (b = ''1'' OR b = ''2'')');
+ estimated | actual 
+-----------+--------
+       197 |    200
+(1 row)
+
+-- OR clauses referencing different attributes are incompatible
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE (a = 1 OR b = ''1'') AND b = ''1''');
+ estimated | actual 
+-----------+--------
+         3 |    100
+(1 row)
+
 -- ANY
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = ANY (ARRAY[1, 51]) AND b = ''1''');
  estimated | actual 
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 0ece39a279..d9c935896b 100644
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -286,6 +286,16 @@ SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE
 
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 26, 27, 51, 52, 76, 77) AND b IN (''1'', ''2'', ''26'', ''27'') AND c IN (1, 2)');
 
+-- OR clauses referencing the same attribute
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE (a = 1 OR a = 51) AND b = ''1''');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE (a = 1 OR a = 51) AND (b = ''1'' OR b = ''2'')');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE (a = 1 OR a = 2 OR a = 51 OR a = 52) AND (b = ''1'' OR b = ''2'')');
+
+-- OR clauses referencing different attributes
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE (a = 1 OR b = ''1'') AND b = ''1''');
+
 -- ANY
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = ANY (ARRAY[1, 51]) AND b = ''1''');
 
@@ -335,6 +345,16 @@ SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE
 
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 26, 27, 51, 52, 76, 77) AND b IN (''1'', ''2'', ''26'', ''27'') AND c IN (1, 2)');
 
+-- OR clauses referencing the same attribute
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE (a = 1 OR a = 51) AND b = ''1''');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE (a = 1 OR a = 51) AND (b = ''1'' OR b = ''2'')');
+
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE (a = 1 OR a = 2 OR a = 51 OR a = 52) AND (b = ''1'' OR b = ''2'')');
+
+-- OR clauses referencing different attributes are incompatible
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE (a = 1 OR b = ''1'') AND b = ''1''');
+
 -- ANY
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = ANY (ARRAY[1, 51]) AND b = ''1''');

#13

Dean Rasheed

dean.a.rasheed@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#12)

1 attachment(s)

Re: PATCH: add support for IN and @> in functional-dependency statistics use

On Sat, 14 Mar 2020 at 18:45, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

I realized there's one more thing that probably needs discussing.
Essentially, these two clause types are the same:

a IN (1, 2, 3)

(a = 1 OR a = 2 OR a = 3)

but with 8f321bd1 we only recognize the first one as compatible with
functional dependencies. It was always the case that we estimated those
two clauses a bit differently, but the differences were usually small.
But now that we recognize IN as compatible with dependencies, the
difference may be much larger, which bugs me a bit ...

So I wonder if we should recognize the special form of an OR clause,
with all Vars referencing to the same attribute etc. and treat this as
supported by functional dependencies - the attached patch does that.
MCV lists there's already no difference because OR clauses are
supported.

Makes sense, and the patch looks straightforward enough.

The question is whether we want to do this, and whether we should also
teach the per-column estimates to recognize this special case of IN
clause.

I'm not convinced about that second part though. I'd say that
recognising the OR clause for functional dependencies is sufficient to
prevent the large differences in estimates relative to the equivalent
IN clauses. The small differences between the way that OR and IN
clauses are handled have always been there, and I think that changing
that is out of scope for this work.

The other thing that I'm still concerned about is the possibility of
returning estimates with P(a,b) > P(a) or P(b). I think that such a
thing becomes much more likely with the new types of clause supported
here, because they now allow multiple values from each column, where
before we only allowed one. I took another look at the patch I posted
on the other thread, and I've convinced myself that it's correct.
Attached is an updated version, with some cosmetic tidying up and now
with some additional regression tests.

Regards,
Dean

Attachments:

dependencies_clauselist_selectivity-v2.patchtext/x-patch; charset=US-ASCII; name=dependencies_clauselist_selectivity-v2.patchDownload

diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c
new file mode 100644
index 72dc1cd..e0cc2b9
--- a/src/backend/statistics/dependencies.c
+++ b/src/backend/statistics/dependencies.c
@@ -30,6 +30,7 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 #include "utils/syscache.h"
 #include "utils/typcache.h"
 
@@ -73,8 +74,6 @@ static double dependency_degree(int numr
 								AttrNumber *dependency, VacAttrStats **stats, Bitmapset *attrs);
 static bool dependency_is_fully_matched(MVDependency *dependency,
 										Bitmapset *attnums);
-static bool dependency_implies_attribute(MVDependency *dependency,
-										 AttrNumber attnum);
 static bool dependency_is_compatible_clause(Node *clause, Index relid,
 											AttrNumber *attnum);
 static MVDependency *find_strongest_dependency(MVDependencies **dependencies,
@@ -614,19 +613,6 @@ dependency_is_fully_matched(MVDependency
 }
 
 /*
- * dependency_implies_attribute
- *		check that the attnum matches is implied by the functional dependency
- */
-static bool
-dependency_implies_attribute(MVDependency *dependency, AttrNumber attnum)
-{
-	if (attnum == dependency->attributes[dependency->nattributes - 1])
-		return true;
-
-	return false;
-}
-
-/*
  * statext_dependencies_load
  *		Load the functional dependencies for the indicated pg_statistic_ext tuple
  */
@@ -966,7 +952,7 @@ find_strongest_dependency(MVDependencies
  * between them, i.e. either (a=>b) or (b=>a). Assuming (a=>b) is the selected
  * dependency, we then combine the per-clause selectivities using the formula
  *
- *	   P(a,b) = P(a) * [f + (1-f)*P(b)]
+ *	   P(a,b) = f * Min(P(a), P(b)) + (1-f) * P(a) * P(b)
  *
  * where 'f' is the degree of the dependency.
  *
@@ -974,7 +960,7 @@ find_strongest_dependency(MVDependencies
  * recursively, starting with the widest/strongest dependencies. For example
  * P(a,b,c) is first split like this:
  *
- *	   P(a,b,c) = P(a,b) * [f + (1-f)*P(c)]
+ *	   P(a,b,c) = f * Min(P(a,b), P(c)) + (1-f) * P(a,b) * P(c)
  *
  * assuming (a,b=>c) is the strongest dependency.
  */
@@ -987,20 +973,27 @@ dependencies_clauselist_selectivity(Plan
 									RelOptInfo *rel,
 									Bitmapset **estimatedclauses)
 {
-	Selectivity s1 = 1.0;
+	Selectivity s1;
 	ListCell   *l;
-	Bitmapset  *clauses_attnums = NULL;
-	Bitmapset **list_attnums;
+	Bitmapset  *clauses_attnums;
+	AttrNumber *list_attnums;
 	int			listidx;
-	MVDependencies    **dependencies = NULL;
-	int					ndependencies = 0;
+	MVDependencies **dependencies;
+	int			ndependencies;
+	int			total_ndeps;
+	MVDependency **dependencies_used;
+	int			ndependencies_used;
+	Bitmapset  *attrs_used;
+	int			nattrs_used;
+	Selectivity *attr_sel;
 	int			i;
+	int			attidx;
 
 	/* check if there's any stats that might be useful for us. */
 	if (!has_stats_of_kind(rel->statlist, STATS_EXT_DEPENDENCIES))
 		return 1.0;
 
-	list_attnums = (Bitmapset **) palloc(sizeof(Bitmapset *) *
+	list_attnums = (AttrNumber *) palloc(sizeof(AttrNumber) *
 										 list_length(clauses));
 
 	/*
@@ -1014,6 +1007,7 @@ dependencies_clauselist_selectivity(Plan
 	 * We also skip clauses that we already estimated using different types of
 	 * statistics (we treat them as incompatible).
 	 */
+	clauses_attnums = NULL;
 	listidx = 0;
 	foreach(l, clauses)
 	{
@@ -1023,11 +1017,11 @@ dependencies_clauselist_selectivity(Plan
 		if (!bms_is_member(listidx, *estimatedclauses) &&
 			dependency_is_compatible_clause(clause, rel->relid, &attnum))
 		{
-			list_attnums[listidx] = bms_make_singleton(attnum);
+			list_attnums[listidx] = attnum;
 			clauses_attnums = bms_add_member(clauses_attnums, attnum);
 		}
 		else
-			list_attnums[listidx] = NULL;
+			list_attnums[listidx] = InvalidAttrNumber;
 
 		listidx++;
 	}
@@ -1039,6 +1033,7 @@ dependencies_clauselist_selectivity(Plan
 	 */
 	if (bms_num_members(clauses_attnums) < 2)
 	{
+		bms_free(clauses_attnums);
 		pfree(list_attnums);
 		return 1.0;
 	}
@@ -1055,6 +1050,7 @@ dependencies_clauselist_selectivity(Plan
 	 * to moving the freed chunks to freelists etc.
 	 */
 	ndependencies = 0;
+	total_ndeps = 0;
 	dependencies = (MVDependencies **) palloc(sizeof(MVDependencies *) *
 											  list_length(rel->statlist));
 
@@ -1076,104 +1072,185 @@ dependencies_clauselist_selectivity(Plan
 		if (num_matched < 2)
 			continue;
 
-		dependencies[ndependencies++]
+		dependencies[ndependencies]
 			= statext_dependencies_load(stat->statOid);
+
+		total_ndeps += dependencies[ndependencies]->ndeps;
+		ndependencies++;
 	}
 
 	/* if no matching stats could be found then we've nothing to do */
 	if (!ndependencies)
 	{
+		pfree(dependencies);
+		bms_free(clauses_attnums);
 		pfree(list_attnums);
 		return 1.0;
 	}
 
 	/*
-	 * Apply the dependencies recursively, starting with the widest/strongest
-	 * ones, and proceeding to the smaller/weaker ones. At the end of each
-	 * round we factor in the selectivity of clauses on the implied attribute,
-	 * and remove the clauses from the list.
+	 * Work out which dependencies we can apply, starting with the
+	 * widest/stongest ones, and proceeding to smaller/weaker ones.
 	 */
+	ndependencies_used = 0;
+	dependencies_used = (MVDependency **) palloc(sizeof(MVDependency *) *
+												 total_ndeps);
+
+	attrs_used = NULL;
+
 	while (true)
 	{
-		Selectivity s2 = 1.0;
 		MVDependency *dependency;
+		AttrNumber	attnum;
 
-		/* the widest/strongest dependency, fully matched by clauses */
+		/* The widest/strongest dependency, fully matched by clauses */
 		dependency = find_strongest_dependency(dependencies, ndependencies,
 											   clauses_attnums);
-
-		/* if no suitable dependency was found, we're done */
 		if (!dependency)
 			break;
 
+		dependencies_used[ndependencies_used++] = dependency;
+
+		/* Track all attributes used */
+		for (i = 0; i < dependency->nattributes; i++)
+		{
+			attnum = dependency->attributes[i];
+			attrs_used = bms_add_member(attrs_used, attnum);
+		}
+
 		/*
-		 * We found an applicable dependency, so find all the clauses on the
-		 * implied attribute - with dependency (a,b => c) we look for clauses
-		 * on 'c'.
+		 * Ignore dependencies involving the implied attribute in later loops.
 		 */
+		clauses_attnums = bms_del_member(clauses_attnums, attnum);
+	}
+
+	if (!ndependencies_used)
+	{
+		bms_free(attrs_used);
+		pfree(dependencies_used);
+		pfree(dependencies);
+		bms_free(clauses_attnums);
+		pfree(list_attnums);
+		return 1.0;
+	}
+
+	/*
+	 * For each attribute used in the dependencies to be applied, compute the
+	 * single-column selectivity of all the clauses using that attribute, and
+	 * mark all those clauses as having been estimated.
+	 */
+	nattrs_used = bms_num_members(attrs_used);
+	attr_sel = (Selectivity *) palloc(sizeof(Selectivity) * nattrs_used);
+
+	attidx = 0;
+	i = -1;
+	while ((i = bms_next_member(attrs_used, i)) >= 0)
+	{
+		List	   *attr_clauses = NIL;
+		Selectivity	simple_sel;
+
 		listidx = -1;
 		foreach(l, clauses)
 		{
-			Node	   *clause;
-			AttrNumber	attnum;
+			Node	   *clause = (Node *) lfirst(l);
 
 			listidx++;
-
-			/*
-			 * Skip incompatible clauses, and ones we've already estimated on.
-			 */
-			if (!list_attnums[listidx])
-				continue;
-
-			/*
-			 * We expect the bitmaps ton contain a single attribute number.
-			 */
-			attnum = bms_singleton_member(list_attnums[listidx]);
-
-			/*
-			 * Technically we could find more than one clause for a given
-			 * attnum. Since these clauses must be equality clauses, we choose
-			 * to only take the selectivity estimate from the final clause in
-			 * the list for this attnum. If the attnum happens to be compared
-			 * to a different Const in another clause then no rows will match
-			 * anyway. If it happens to be compared to the same Const, then
-			 * ignoring the additional clause is just the thing to do.
-			 */
-			if (dependency_implies_attribute(dependency, attnum))
+			if (list_attnums[listidx] == i)
 			{
-				clause = (Node *) lfirst(l);
+				attr_clauses = lappend(attr_clauses, clause);
+				*estimatedclauses = bms_add_member(*estimatedclauses, listidx);
+			}
+		}
 
-				s2 = clause_selectivity(root, clause, varRelid, jointype,
-										sjinfo);
+		simple_sel = clauselist_selectivity_simple(root, attr_clauses, varRelid,
+												   jointype, sjinfo, NULL);
+		attr_sel[attidx++] = simple_sel;
+	}
 
-				/* mark this one as done, so we don't touch it again. */
-				*estimatedclauses = bms_add_member(*estimatedclauses, listidx);
+	/*
+	 * Now combine these selectivities using the dependency information.  For
+	 * chains of dependencies such as a -> b -> c, the b -> c dependency will
+	 * come before the a -> b dependency in the array, so we traverse the
+	 * array backwards to ensure such chains are computed in the right order.
+	 *
+	 * Pairs of selectivities for attributes with dependencies are combined
+	 * using the formula
+	 *
+	 * P(a,b) = f * Min(P(a), P(b)) + (1-f) * P(a) * P(b)
+	 *
+	 * where 'f' is the degree of validity of the dependency.  This ensures
+	 * that the combined selectivity is never greater than either individual
+	 * selectivity.
+	 *
+	 * Where multiple dependencies apply (e.g., a -> b -> c), we use
+	 * conditional probabilities to compute the overall result as follows:
+	 *
+	 * P(a,b,c) = P(c|a,b) * P(a,b) = P(c|a,b) * P(b|a) * P(a)
+	 *
+	 * so we replace the probabilities of all implied attributes with
+	 * conditional probabilities, conditional on all their implying
+	 * attributes.  The probabilities of all other non-implied attributes are
+	 * left as they are.
+	 */
+	for (i = ndependencies_used-1; i >= 0; i--)
+	{
+		MVDependency *dependency = dependencies_used[i];
+		int			j;
+		AttrNumber	attnum;
+		Selectivity	s2;
+		double		f;
 
-				/*
-				 * Mark that we've got and used the dependency on this clause.
-				 * We'll want to ignore this when looking for the next
-				 * strongest dependency above.
-				 */
-				clauses_attnums = bms_del_member(clauses_attnums, attnum);
-			}
+		/* Selectivity of all the implying attributes */
+		s1 = 1.0;
+		for (j = 0; j < dependency->nattributes - 1; j++)
+		{
+			attnum = dependency->attributes[j];
+			attidx = bms_member_index(attrs_used, attnum);
+			s1 *= attr_sel[attidx];
 		}
 
+		/* Original selectivity of the implied attribute */
+		attnum = dependency->attributes[j];
+		attidx = bms_member_index(attrs_used, attnum);
+		s2 = attr_sel[attidx];
+
 		/*
-		 * Now factor in the selectivity for all the "implied" clauses into
-		 * the final one, using this formula:
+		 * Replace s2 with the conditional probability s2 given s1, computed
+		 * using the formula P(b|a) = P(a,b) / P(a), which simplifies to
 		 *
-		 * P(a,b) = P(a) * (f + (1-f) * P(b))
+		 * P(b|a) = f * Min(P(a), P(b)) / P(a) + (1-f) * P(b)
 		 *
-		 * where 'f' is the degree of validity of the dependency.
+		 * where P(a) = s1, the selectivity of the implying attributes, and
+		 * P(b) = s2, the selectivity of the implied attribute.
 		 */
-		s1 *= (dependency->degree + (1 - dependency->degree) * s2);
+		f = dependency->degree;
+
+		if (s1 <= s2)
+			attr_sel[attidx] = f + (1 - f) * s2;
+		else
+			attr_sel[attidx] = f * s2 / s1 + (1 -f) * s2;
 	}
 
+	/*
+	 * The overall selectivity of all the clauses on all these attributes is
+	 * then the product of all the original (non-implied) and conditional
+	 * (implied) probabilities.
+	 */
+	s1 = 1.0;
+	for (i = 0; i < nattrs_used; i++)
+		s1 *= attr_sel[i];
+
+	CLAMP_PROBABILITY(s1);
+
 	/* free deserialized functional dependencies (and then the array) */
 	for (i = 0; i < ndependencies; i++)
 		pfree(dependencies[i]);
 
+	pfree(attr_sel);
+	bms_free(attrs_used);
+	pfree(dependencies_used);
 	pfree(dependencies);
+	bms_free(clauses_attnums);
 	pfree(list_attnums);
 
 	return s1;
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
new file mode 100644
index dbc2532..f96e651
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -440,6 +440,12 @@ SELECT * FROM check_estimated_rows('SELE
          8 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 51, 52) AND b = ''1''');
+ estimated | actual 
+-----------+--------
+         4 |    100
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 26, 51, 76) AND b IN (''1'', ''26'') AND c = 1');
  estimated | actual 
 -----------+--------
@@ -574,6 +580,12 @@ SELECT * FROM check_estimated_rows('SELE
        200 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 51, 52) AND b = ''1''');
+ estimated | actual 
+-----------+--------
+       100 |    100
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 26, 51, 76) AND b IN (''1'', ''26'') AND c = 1');
  estimated | actual 
 -----------+--------
@@ -667,12 +679,14 @@ SELECT * FROM check_estimated_rows('SELE
          1 |      0
 (1 row)
 
--- check change of column type doesn't break it
+-- changing the type of column c causes its single-column stats to be dropped,
+-- giving a default estimate of 0.005 * 5000 = 25 for (c = 1); check multiple
+-- clauses estimated with functional dependencies does not exceed this
 ALTER TABLE functional_dependencies ALTER COLUMN c TYPE numeric;
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = 1 AND b = ''1'' AND c = 1');
  estimated | actual 
 -----------+--------
-        50 |     50
+        25 |     50
 (1 row)
 
 ANALYZE functional_dependencies;
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
new file mode 100644
index 8beb042..203d38d
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -280,6 +280,8 @@ SELECT * FROM check_estimated_rows('SELE
 
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 51, 52) AND b IN (''1'', ''2'')');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 51, 52) AND b = ''1''');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 26, 51, 76) AND b IN (''1'', ''26'') AND c = 1');
 
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 26, 51, 76) AND b IN (''1'', ''26'') AND c IN (1)');
@@ -332,6 +334,8 @@ SELECT * FROM check_estimated_rows('SELE
 
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 51, 52) AND b IN (''1'', ''2'')');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 51, 52) AND b = ''1''');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 26, 51, 76) AND b IN (''1'', ''26'') AND c = 1');
 
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 26, 51, 76) AND b IN (''1'', ''26'') AND c IN (1)');
@@ -365,7 +369,9 @@ SELECT * FROM check_estimated_rows('SELE
 
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 51, 52) AND b = ALL (ARRAY[''1'', ''2''])');
 
--- check change of column type doesn't break it
+-- changing the type of column c causes its single-column stats to be dropped,
+-- giving a default estimate of 0.005 * 5000 = 25 for (c = 1); check multiple
+-- clauses estimated with functional dependencies does not exceed this
 ALTER TABLE functional_dependencies ALTER COLUMN c TYPE numeric;
 
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = 1 AND b = ''1'' AND c = 1');

#14

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Dean Rasheed (#13)

Re: PATCH: add support for IN and @> in functional-dependency statistics use

On Tue, Mar 17, 2020 at 12:42:52PM +0000, Dean Rasheed wrote:

On Sat, 14 Mar 2020 at 18:45, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

I realized there's one more thing that probably needs discussing.
Essentially, these two clause types are the same:

a IN (1, 2, 3)

(a = 1 OR a = 2 OR a = 3)

but with 8f321bd1 we only recognize the first one as compatible with
functional dependencies. It was always the case that we estimated those
two clauses a bit differently, but the differences were usually small.
But now that we recognize IN as compatible with dependencies, the
difference may be much larger, which bugs me a bit ...

So I wonder if we should recognize the special form of an OR clause,
with all Vars referencing to the same attribute etc. and treat this as
supported by functional dependencies - the attached patch does that.
MCV lists there's already no difference because OR clauses are
supported.

Makes sense, and the patch looks straightforward enough.

The question is whether we want to do this, and whether we should also
teach the per-column estimates to recognize this special case of IN
clause.

I'm not convinced about that second part though. I'd say that
recognising the OR clause for functional dependencies is sufficient to
prevent the large differences in estimates relative to the equivalent
IN clauses. The small differences between the way that OR and IN
clauses are handled have always been there, and I think that changing
that is out of scope for this work.

Not sure. I think the inconsistency between plan and extended stats may
be a bit surprising, but I agree that issue may be negligible.

The other thing that I'm still concerned about is the possibility of
returning estimates with P(a,b) > P(a) or P(b). I think that such a
thing becomes much more likely with the new types of clause supported
here, because they now allow multiple values from each column, where
before we only allowed one. I took another look at the patch I posted
on the other thread, and I've convinced myself that it's correct.
Attached is an updated version, with some cosmetic tidying up and now
with some additional regression tests.

Yeah, I agree that's something we need to fix. Do you plan to push the
fix, or do you want me to do it?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#15

Dean Rasheed

dean.a.rasheed@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#14)

Re: PATCH: add support for IN and @> in functional-dependency statistics use

On Tue, 17 Mar 2020 at 15:37, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 17, 2020 at 12:42:52PM +0000, Dean Rasheed wrote:

The other thing that I'm still concerned about is the possibility of
returning estimates with P(a,b) > P(a) or P(b). I think that such a
thing becomes much more likely with the new types of clause supported
here, because they now allow multiple values from each column, where
before we only allowed one. I took another look at the patch I posted
on the other thread, and I've convinced myself that it's correct.
Attached is an updated version, with some cosmetic tidying up and now
with some additional regression tests.

Yeah, I agree that's something we need to fix. Do you plan to push the
fix, or do you want me to do it?

I can push it. Have you had a chance to review it?

Regards,
Dean

#16

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Dean Rasheed (#15)

Re: PATCH: add support for IN and @> in functional-dependency statistics use

On Tue, Mar 17, 2020 at 04:14:26PM +0000, Dean Rasheed wrote:

On Tue, 17 Mar 2020 at 15:37, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 17, 2020 at 12:42:52PM +0000, Dean Rasheed wrote:

The other thing that I'm still concerned about is the possibility of
returning estimates with P(a,b) > P(a) or P(b). I think that such a
thing becomes much more likely with the new types of clause supported
here, because they now allow multiple values from each column, where
before we only allowed one. I took another look at the patch I posted
on the other thread, and I've convinced myself that it's correct.
Attached is an updated version, with some cosmetic tidying up and now
with some additional regression tests.

Yeah, I agree that's something we need to fix. Do you plan to push the
fix, or do you want me to do it?

I can push it. Have you had a chance to review it?

Not yet, I'll take a look today.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#17

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Tomas Vondra (#16)

Re: PATCH: add support for IN and @> in functional-dependency statistics use

On Tue, Mar 17, 2020 at 06:05:17PM +0100, Tomas Vondra wrote:

On Tue, Mar 17, 2020 at 04:14:26PM +0000, Dean Rasheed wrote:

On Tue, 17 Mar 2020 at 15:37, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On Tue, Mar 17, 2020 at 12:42:52PM +0000, Dean Rasheed wrote:

The other thing that I'm still concerned about is the possibility of
returning estimates with P(a,b) > P(a) or P(b). I think that such a
thing becomes much more likely with the new types of clause supported
here, because they now allow multiple values from each column, where
before we only allowed one. I took another look at the patch I posted
on the other thread, and I've convinced myself that it's correct.
Attached is an updated version, with some cosmetic tidying up and now
with some additional regression tests.

Yeah, I agree that's something we need to fix. Do you plan to push the
fix, or do you want me to do it?

I can push it. Have you had a chance to review it?

Not yet, I'll take a look today.

OK, I took a look. I think from the correctness POV the patch is OK, but
I think the dependencies_clauselist_selectivity() function now got a bit
too complex. I've been able to parse it now, but I'm sure I'll have
trouble in the future :-(

Can we refactor / split it somehow and move bits of the logic to smaller
functions, or something like that?

Another thing I'd like to suggest is keeping the "old" formula, and
instead of just replacing it with

P(a,b) = f * Min(P(a), P(b)) + (1-f) * P(a) * P(b)

but explaining how the old formula may produce nonsensical selectivity,
and how the new formula addresses that issue.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#18

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Tomas Vondra (#14)

Re: PATCH: add support for IN and @> in functional-dependency statistics use

On Tue, Mar 17, 2020 at 04:37:06PM +0100, Tomas Vondra wrote:

On Tue, Mar 17, 2020 at 12:42:52PM +0000, Dean Rasheed wrote:

On Sat, 14 Mar 2020 at 18:45, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

I realized there's one more thing that probably needs discussing.
Essentially, these two clause types are the same:

a IN (1, 2, 3)

(a = 1 OR a = 2 OR a = 3)

but with 8f321bd1 we only recognize the first one as compatible with
functional dependencies. It was always the case that we estimated those
two clauses a bit differently, but the differences were usually small.
But now that we recognize IN as compatible with dependencies, the
difference may be much larger, which bugs me a bit ...

So I wonder if we should recognize the special form of an OR clause,
with all Vars referencing to the same attribute etc. and treat this as
supported by functional dependencies - the attached patch does that.
MCV lists there's already no difference because OR clauses are
supported.

Makes sense, and the patch looks straightforward enough.

The question is whether we want to do this, and whether we should also
teach the per-column estimates to recognize this special case of IN
clause.

I'm not convinced about that second part though. I'd say that
recognising the OR clause for functional dependencies is sufficient to
prevent the large differences in estimates relative to the equivalent
IN clauses. The small differences between the way that OR and IN
clauses are handled have always been there, and I think that changing
that is out of scope for this work.

Not sure. I think the inconsistency between plan and extended stats may
be a bit surprising, but I agree that issue may be negligible.

OK, I've pushed the change recognizing the special case of OR clauses as
supported by functional dependencies. I've left the estimation of the
clause itself as it's, we can address that in the future if needed.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#19

Dean Rasheed

dean.a.rasheed@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#17)

1 attachment(s)

Re: PATCH: add support for IN and @> in functional-dependency statistics use

On Wed, 18 Mar 2020 at 00:29, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

OK, I took a look. I think from the correctness POV the patch is OK, but
I think the dependencies_clauselist_selectivity() function now got a bit
too complex. I've been able to parse it now, but I'm sure I'll have
trouble in the future :-(

Can we refactor / split it somehow and move bits of the logic to smaller
functions, or something like that?

Yeah, it has gotten a bit long. It's somewhat tricky splitting it up,
because of the number of shared variables used throughout the
function, but here's an updated patch splitting it into what seemed
like the 2 most logical pieces. The first piece (still in
dependencies_clauselist_selectivity()) works out what dependencies
can/should be applied, and the second piece in a new function does the
actual work of applying the list of functional dependencies to the
clause list.

I think that has made it easier to follow, and it has also reduced the
complexity of the final "no applicable stats" branch.

Another thing I'd like to suggest is keeping the "old" formula, and
instead of just replacing it with

P(a,b) = f * Min(P(a), P(b)) + (1-f) * P(a) * P(b)

but explaining how the old formula may produce nonsensical selectivity,
and how the new formula addresses that issue.

I think this is purely a comment issue? I've added some more extensive
comments attempting to justify the formulae.

Regards,
Dean

Attachments:

dependencies_clauselist_selectivity-v3.patchtext/x-patch; charset=US-ASCII; name=dependencies_clauselist_selectivity-v3.patchDownload

diff --git a/src/backend/statistics/dependencies.c b/src/backend/statistics/dependencies.c
new file mode 100644
index 5f9b43b..18cce60
--- a/src/backend/statistics/dependencies.c
+++ b/src/backend/statistics/dependencies.c
@@ -30,6 +30,7 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 #include "utils/lsyscache.h"
+#include "utils/selfuncs.h"
 #include "utils/syscache.h"
 #include "utils/typcache.h"
 
@@ -73,13 +74,18 @@ static double dependency_degree(int numr
 								AttrNumber *dependency, VacAttrStats **stats, Bitmapset *attrs);
 static bool dependency_is_fully_matched(MVDependency *dependency,
 										Bitmapset *attnums);
-static bool dependency_implies_attribute(MVDependency *dependency,
-										 AttrNumber attnum);
 static bool dependency_is_compatible_clause(Node *clause, Index relid,
 											AttrNumber *attnum);
 static MVDependency *find_strongest_dependency(MVDependencies **dependencies,
 											   int ndependencies,
 											   Bitmapset *attnums);
+static Selectivity deps_clauselist_selectivity(PlannerInfo *root, List *clauses,
+											   int varRelid, JoinType jointype,
+											   SpecialJoinInfo *sjinfo,
+											   MVDependency **dependencies,
+											   int ndependencies,
+											   AttrNumber *list_attnums,
+											   Bitmapset **estimatedclauses);
 
 static void
 generate_dependencies_recurse(DependencyGenerator state, int index,
@@ -614,19 +620,6 @@ dependency_is_fully_matched(MVDependency
 }
 
 /*
- * dependency_implies_attribute
- *		check that the attnum matches is implied by the functional dependency
- */
-static bool
-dependency_implies_attribute(MVDependency *dependency, AttrNumber attnum)
-{
-	if (attnum == dependency->attributes[dependency->nattributes - 1])
-		return true;
-
-	return false;
-}
-
-/*
  * statext_dependencies_load
  *		Load the functional dependencies for the indicated pg_statistic_ext tuple
  */
@@ -986,6 +979,182 @@ find_strongest_dependency(MVDependencies
 }
 
 /*
+ * deps_clauselist_selectivity
+ *		Return the estimated selectivity of (a subset of) the given clauses
+ *		using the given functional dependencies.  This is the guts of
+ *		dependencies_clauselist_selectivity().
+ *
+ * This will estimate all not-already-estimated clauses that are compatible
+ * with functional dependencies, and which have an attribute mentioned by any
+ * of the given dependencies (either as an implying or implied attribute).
+ *
+ * Given (lists of) clauses on attributes (a,b) and a functional dependency
+ * (a=>b), the per-column selectivities P(a) and P(b) are notionally combined
+ * using the formula
+ *
+ *		P(a,b) = f * P(a) + (1-f) * P(a) * P(b)
+ *
+ * where 'f' is the degree of dependency.  This reflects the fact that we
+ * expect a fraction f of all rows to be consistent with the dependency
+ * (a=>b), and so have a selectivity of P(a), while the remaining rows are
+ * treated as independent.
+ *
+ * In practice, we use a slightly modified version of this formula, which uses
+ * a selectivity of Min(P(a), P(b)) for the dependent rows, since the result
+ * should obviously not exceed either column's individual selectivity.  I.e.,
+ * we actually combine selectivities using the formula
+ *
+ *		P(a,b) = f * Min(P(a), P(b)) + (1-f) * P(a) * P(b)
+ *
+ * This can make quite a difference if the specific values matching the
+ * clauses are not consistent with the functional dependency.
+ */
+static Selectivity
+deps_clauselist_selectivity(PlannerInfo *root, List *clauses,
+							int varRelid, JoinType jointype,
+							SpecialJoinInfo *sjinfo,
+							MVDependency **dependencies, int ndependencies,
+							AttrNumber *list_attnums,
+							Bitmapset **estimatedclauses)
+{
+	Bitmapset  *attnums;
+	int			i;
+	int			j;
+	int			nattrs;
+	Selectivity *attr_sel;
+	int			attidx;
+	int			listidx;
+	ListCell   *l;
+	Selectivity s1;
+
+	/*
+	 * Extract the attnums of all implying and implied attributes from all the
+	 * given dependencies.  Each of these attributes is expected to have at
+	 * least 1 not-already-estimated compatible clause that we will estimate
+	 * here.
+	 */
+	attnums = NULL;
+	for (i = 0; i < ndependencies; i++)
+	{
+		for (j = 0; j < dependencies[i]->nattributes; j++)
+		{
+			AttrNumber	attnum = dependencies[i]->attributes[j];
+			attnums = bms_add_member(attnums, attnum);
+		}
+	}
+
+	/*
+	 * Compute per-column selectivity estimates for each of these attributes,
+	 * and mark all the corresponding clauses as estimated.
+	 */
+	nattrs = bms_num_members(attnums);
+	attr_sel = (Selectivity *) palloc(sizeof(Selectivity) * nattrs);
+
+	attidx = 0;
+	i = -1;
+	while ((i = bms_next_member(attnums, i)) >= 0)
+	{
+		List	   *attr_clauses = NIL;
+		Selectivity	simple_sel;
+
+		listidx = -1;
+		foreach(l, clauses)
+		{
+			Node	   *clause = (Node *) lfirst(l);
+
+			listidx++;
+			if (list_attnums[listidx] == i)
+			{
+				attr_clauses = lappend(attr_clauses, clause);
+				*estimatedclauses = bms_add_member(*estimatedclauses, listidx);
+			}
+		}
+
+		simple_sel = clauselist_selectivity_simple(root, attr_clauses, varRelid,
+												   jointype, sjinfo, NULL);
+		attr_sel[attidx++] = simple_sel;
+	}
+
+	/*
+	 * Now combine these selectivities using the dependency information.  For
+	 * chains of dependencies such as a -> b -> c, the b -> c dependency will
+	 * come before the a -> b dependency in the array, so we traverse the
+	 * array backwards to ensure such chains are computed in the right order.
+	 *
+	 * As explained above, pairs of selectivities are combined using the
+	 * formula
+	 *
+	 * P(a,b) = f * Min(P(a), P(b)) + (1-f) * P(a) * P(b)
+	 *
+	 * to ensure that the combined selectivity is never greater than either
+	 * individual selectivity.
+	 *
+	 * Where multiple dependencies apply (e.g., a -> b -> c), we use
+	 * conditional probabilities to compute the overall result as follows:
+	 *
+	 * P(a,b,c) = P(c|a,b) * P(a,b) = P(c|a,b) * P(b|a) * P(a)
+	 *
+	 * so we replace the selectivities of all implied attributes with
+	 * conditional probabilities, that are conditional on all their implying
+	 * attributes.  The selectivities of all other non-implied attributes are
+	 * left as they are.
+	 */
+	for (i = ndependencies - 1; i >= 0; i--)
+	{
+		MVDependency *dependency = dependencies[i];
+		AttrNumber	attnum;
+		Selectivity	s2;
+		double		f;
+
+		/* Selectivity of all the implying attributes */
+		s1 = 1.0;
+		for (j = 0; j < dependency->nattributes - 1; j++)
+		{
+			attnum = dependency->attributes[j];
+			attidx = bms_member_index(attnums, attnum);
+			s1 *= attr_sel[attidx];
+		}
+
+		/* Original selectivity of the implied attribute */
+		attnum = dependency->attributes[j];
+		attidx = bms_member_index(attnums, attnum);
+		s2 = attr_sel[attidx];
+
+		/*
+		 * Replace s2 with the conditional probability s2 given s1, computed
+		 * using the formula P(b|a) = P(a,b) / P(a), which simplifies to
+		 *
+		 * P(b|a) = f * Min(P(a), P(b)) / P(a) + (1-f) * P(b)
+		 *
+		 * where P(a) = s1, the selectivity of the implying attributes, and
+		 * P(b) = s2, the selectivity of the implied attribute.
+		 */
+		f = dependency->degree;
+
+		if (s1 <= s2)
+			attr_sel[attidx] = f + (1 - f) * s2;
+		else
+			attr_sel[attidx] = f * s2 / s1 + (1 -f) * s2;
+	}
+
+	/*
+	 * The overall selectivity of all the clauses on all these attributes is
+	 * then the product of all the original (non-implied) probabilities and
+	 * the new conditional (implied) probabilities.
+	 */
+	s1 = 1.0;
+	for (i = 0; i < nattrs; i++)
+		s1 *= attr_sel[i];
+
+	CLAMP_PROBABILITY(s1);
+
+	pfree(attr_sel);
+	bms_free(attnums);
+
+	return s1;
+}
+
+/*
  * dependencies_clauselist_selectivity
  *		Return the estimated selectivity of (a subset of) the given clauses
  *		using functional dependency statistics, or 1.0 if no useful functional
@@ -999,15 +1168,16 @@ find_strongest_dependency(MVDependencies
  * between them, i.e. either (a=>b) or (b=>a). Assuming (a=>b) is the selected
  * dependency, we then combine the per-clause selectivities using the formula
  *
- *	   P(a,b) = P(a) * [f + (1-f)*P(b)]
+ *	   P(a,b) = f * P(a) + (1-f) * P(a) * P(b)
  *
- * where 'f' is the degree of the dependency.
+ * where 'f' is the degree of the dependency.  (Actually we use a slightly
+ * modified version of this formula -- see deps_clauselist_selectivity()).
  *
  * With clauses on more than two attributes, the dependencies are applied
  * recursively, starting with the widest/strongest dependencies. For example
  * P(a,b,c) is first split like this:
  *
- *	   P(a,b,c) = P(a,b) * [f + (1-f)*P(c)]
+ *	   P(a,b,c) = f * P(a,b) + (1-f) * P(a,b) * P(c)
  *
  * assuming (a,b=>c) is the strongest dependency.
  */
@@ -1023,17 +1193,20 @@ dependencies_clauselist_selectivity(Plan
 	Selectivity s1 = 1.0;
 	ListCell   *l;
 	Bitmapset  *clauses_attnums = NULL;
-	Bitmapset **list_attnums;
+	AttrNumber *list_attnums;
 	int			listidx;
-	MVDependencies    **dependencies = NULL;
-	int					ndependencies = 0;
+	MVDependencies **func_dependencies;
+	int			nfunc_dependencies;
+	int			total_ndeps;
+	MVDependency **dependencies;
+	int			ndependencies;
 	int			i;
 
 	/* check if there's any stats that might be useful for us. */
 	if (!has_stats_of_kind(rel->statlist, STATS_EXT_DEPENDENCIES))
 		return 1.0;
 
-	list_attnums = (Bitmapset **) palloc(sizeof(Bitmapset *) *
+	list_attnums = (AttrNumber *) palloc(sizeof(AttrNumber) *
 										 list_length(clauses));
 
 	/*
@@ -1056,11 +1229,11 @@ dependencies_clauselist_selectivity(Plan
 		if (!bms_is_member(listidx, *estimatedclauses) &&
 			dependency_is_compatible_clause(clause, rel->relid, &attnum))
 		{
-			list_attnums[listidx] = bms_make_singleton(attnum);
+			list_attnums[listidx] = attnum;
 			clauses_attnums = bms_add_member(clauses_attnums, attnum);
 		}
 		else
-			list_attnums[listidx] = NULL;
+			list_attnums[listidx] = InvalidAttrNumber;
 
 		listidx++;
 	}
@@ -1072,6 +1245,7 @@ dependencies_clauselist_selectivity(Plan
 	 */
 	if (bms_num_members(clauses_attnums) < 2)
 	{
+		bms_free(clauses_attnums);
 		pfree(list_attnums);
 		return 1.0;
 	}
@@ -1087,9 +1261,10 @@ dependencies_clauselist_selectivity(Plan
 	 * to make it just the right size, but it's likely wasteful anyway thanks
 	 * to moving the freed chunks to freelists etc.
 	 */
-	ndependencies = 0;
-	dependencies = (MVDependencies **) palloc(sizeof(MVDependencies *) *
-											  list_length(rel->statlist));
+	func_dependencies = (MVDependencies **) palloc(sizeof(MVDependencies *) *
+												   list_length(rel->statlist));
+	nfunc_dependencies = 0;
+	total_ndeps = 0;
 
 	foreach(l,rel->statlist)
 	{
@@ -1109,104 +1284,65 @@ dependencies_clauselist_selectivity(Plan
 		if (num_matched < 2)
 			continue;
 
-		dependencies[ndependencies++]
+		func_dependencies[nfunc_dependencies]
 			= statext_dependencies_load(stat->statOid);
+
+		total_ndeps += func_dependencies[nfunc_dependencies]->ndeps;
+		nfunc_dependencies++;
 	}
 
 	/* if no matching stats could be found then we've nothing to do */
-	if (!ndependencies)
+	if (nfunc_dependencies == 0)
 	{
+		pfree(func_dependencies);
+		bms_free(clauses_attnums);
 		pfree(list_attnums);
 		return 1.0;
 	}
 
 	/*
-	 * Apply the dependencies recursively, starting with the widest/strongest
-	 * ones, and proceeding to the smaller/weaker ones. At the end of each
-	 * round we factor in the selectivity of clauses on the implied attribute,
-	 * and remove the clauses from the list.
+	 * Work out which dependencies we can apply, starting with the
+	 * widest/stongest ones, and proceeding to smaller/weaker ones.
 	 */
+	dependencies = (MVDependency **) palloc(sizeof(MVDependency *) *
+											total_ndeps);
+	ndependencies = 0;
+
 	while (true)
 	{
-		Selectivity s2 = 1.0;
 		MVDependency *dependency;
+		AttrNumber	attnum;
 
 		/* the widest/strongest dependency, fully matched by clauses */
-		dependency = find_strongest_dependency(dependencies, ndependencies,
+		dependency = find_strongest_dependency(func_dependencies,
+											   nfunc_dependencies,
 											   clauses_attnums);
-
-		/* if no suitable dependency was found, we're done */
 		if (!dependency)
 			break;
 
-		/*
-		 * We found an applicable dependency, so find all the clauses on the
-		 * implied attribute - with dependency (a,b => c) we look for clauses
-		 * on 'c'.
-		 */
-		listidx = -1;
-		foreach(l, clauses)
-		{
-			Node	   *clause;
-			AttrNumber	attnum;
-
-			listidx++;
-
-			/*
-			 * Skip incompatible clauses, and ones we've already estimated on.
-			 */
-			if (!list_attnums[listidx])
-				continue;
-
-			/*
-			 * We expect the bitmaps ton contain a single attribute number.
-			 */
-			attnum = bms_singleton_member(list_attnums[listidx]);
-
-			/*
-			 * Technically we could find more than one clause for a given
-			 * attnum. Since these clauses must be equality clauses, we choose
-			 * to only take the selectivity estimate from the final clause in
-			 * the list for this attnum. If the attnum happens to be compared
-			 * to a different Const in another clause then no rows will match
-			 * anyway. If it happens to be compared to the same Const, then
-			 * ignoring the additional clause is just the thing to do.
-			 */
-			if (dependency_implies_attribute(dependency, attnum))
-			{
-				clause = (Node *) lfirst(l);
-
-				s2 = clause_selectivity(root, clause, varRelid, jointype,
-										sjinfo);
-
-				/* mark this one as done, so we don't touch it again. */
-				*estimatedclauses = bms_add_member(*estimatedclauses, listidx);
-
-				/*
-				 * Mark that we've got and used the dependency on this clause.
-				 * We'll want to ignore this when looking for the next
-				 * strongest dependency above.
-				 */
-				clauses_attnums = bms_del_member(clauses_attnums, attnum);
-			}
-		}
+		dependencies[ndependencies++] = dependency;
 
-		/*
-		 * Now factor in the selectivity for all the "implied" clauses into
-		 * the final one, using this formula:
-		 *
-		 * P(a,b) = P(a) * (f + (1-f) * P(b))
-		 *
-		 * where 'f' is the degree of validity of the dependency.
-		 */
-		s1 *= (dependency->degree + (1 - dependency->degree) * s2);
+		/* Ignore dependencies using this implied attribute in later loops */
+		attnum = dependency->attributes[dependency->nattributes - 1];
+		clauses_attnums = bms_del_member(clauses_attnums, attnum);
 	}
 
+	/*
+	 * If we found applicable dependencies, use them to estimate all
+	 * compatible clauses on attributes that they refer to.
+	 */
+	if (ndependencies != 0)
+		s1 = deps_clauselist_selectivity(root, clauses, varRelid, jointype,
+										 sjinfo, dependencies, ndependencies,
+										 list_attnums, estimatedclauses);
+
 	/* free deserialized functional dependencies (and then the array) */
-	for (i = 0; i < ndependencies; i++)
-		pfree(dependencies[i]);
+	for (i = 0; i < nfunc_dependencies; i++)
+		pfree(func_dependencies[i]);
 
 	pfree(dependencies);
+	pfree(func_dependencies);
+	bms_free(clauses_attnums);
 	pfree(list_attnums);
 
 	return s1;
diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out
new file mode 100644
index bd6d8be..f2f4008
--- a/src/test/regress/expected/stats_ext.out
+++ b/src/test/regress/expected/stats_ext.out
@@ -440,6 +440,12 @@ SELECT * FROM check_estimated_rows('SELE
          8 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 51, 52) AND b = ''1''');
+ estimated | actual 
+-----------+--------
+         4 |    100
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 26, 51, 76) AND b IN (''1'', ''26'') AND c = 1');
  estimated | actual 
 -----------+--------
@@ -600,6 +606,12 @@ SELECT * FROM check_estimated_rows('SELE
        200 |    200
 (1 row)
 
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 51, 52) AND b = ''1''');
+ estimated | actual 
+-----------+--------
+       100 |    100
+(1 row)
+
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 26, 51, 76) AND b IN (''1'', ''26'') AND c = 1');
  estimated | actual 
 -----------+--------
@@ -719,12 +731,14 @@ SELECT * FROM check_estimated_rows('SELE
          1 |      0
 (1 row)
 
--- check change of column type doesn't break it
+-- changing the type of column c causes its single-column stats to be dropped,
+-- giving a default estimate of 0.005 * 5000 = 25 for (c = 1); check multiple
+-- clauses estimated with functional dependencies does not exceed this
 ALTER TABLE functional_dependencies ALTER COLUMN c TYPE numeric;
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = 1 AND b = ''1'' AND c = 1');
  estimated | actual 
 -----------+--------
-        50 |     50
+        25 |     50
 (1 row)
 
 ANALYZE functional_dependencies;
diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
new file mode 100644
index 897ed30..f3b652d
--- a/src/test/regress/sql/stats_ext.sql
+++ b/src/test/regress/sql/stats_ext.sql
@@ -280,6 +280,8 @@ SELECT * FROM check_estimated_rows('SELE
 
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 51, 52) AND b IN (''1'', ''2'')');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 51, 52) AND b = ''1''');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 26, 51, 76) AND b IN (''1'', ''26'') AND c = 1');
 
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 26, 51, 76) AND b IN (''1'', ''26'') AND c IN (1)');
@@ -342,6 +344,8 @@ SELECT * FROM check_estimated_rows('SELE
 
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 51, 52) AND b IN (''1'', ''2'')');
 
+SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 51, 52) AND b = ''1''');
+
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 26, 51, 76) AND b IN (''1'', ''26'') AND c = 1');
 
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 26, 51, 76) AND b IN (''1'', ''26'') AND c IN (1)');
@@ -385,7 +389,9 @@ SELECT * FROM check_estimated_rows('SELE
 
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a IN (1, 2, 51, 52) AND b = ALL (ARRAY[''1'', ''2''])');
 
--- check change of column type doesn't break it
+-- changing the type of column c causes its single-column stats to be dropped,
+-- giving a default estimate of 0.005 * 5000 = 25 for (c = 1); check multiple
+-- clauses estimated with functional dependencies does not exceed this
 ALTER TABLE functional_dependencies ALTER COLUMN c TYPE numeric;
 
 SELECT * FROM check_estimated_rows('SELECT * FROM functional_dependencies WHERE a = 1 AND b = ''1'' AND c = 1');

#20

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Dean Rasheed (#19)

Re: PATCH: add support for IN and @> in functional-dependency statistics use

On Thu, Mar 19, 2020 at 07:53:39PM +0000, Dean Rasheed wrote:

On Wed, 18 Mar 2020 at 00:29, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

OK, I took a look. I think from the correctness POV the patch is OK, but
I think the dependencies_clauselist_selectivity() function now got a bit
too complex. I've been able to parse it now, but I'm sure I'll have
trouble in the future :-(

Can we refactor / split it somehow and move bits of the logic to smaller
functions, or something like that?

Yeah, it has gotten a bit long. It's somewhat tricky splitting it up,
because of the number of shared variables used throughout the
function, but here's an updated patch splitting it into what seemed
like the 2 most logical pieces. The first piece (still in
dependencies_clauselist_selectivity()) works out what dependencies
can/should be applied, and the second piece in a new function does the
actual work of applying the list of functional dependencies to the
clause list.

I think that has made it easier to follow, and it has also reduced the
complexity of the final "no applicable stats" branch.

Seems OK to me.

I'd perhaps name deps_clauselist_selectivity differently, it's a bit too
similar to dependencies_clauselist_selectivity. Perhaps something like
clauselist_apply_dependencies? But that's a minor detail.

Another thing I'd like to suggest is keeping the "old" formula, and
instead of just replacing it with

P(a,b) = f * Min(P(a), P(b)) + (1-f) * P(a) * P(b)

but explaining how the old formula may produce nonsensical selectivity,
and how the new formula addresses that issue.

I think this is purely a comment issue? I've added some more extensive
comments attempting to justify the formulae.

Yes, it was purely a comment issue. Seems fine now.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#21

Dean Rasheed

dean.a.rasheed@gmail.com

almost 6 years ago

In reply to: Tomas Vondra (#20)

Re: PATCH: add support for IN and @> in functional-dependency statistics use

On Wed, 25 Mar 2020 at 00:28, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Seems OK to me.

I'd perhaps name deps_clauselist_selectivity differently, it's a bit too
similar to dependencies_clauselist_selectivity. Perhaps something like
clauselist_apply_dependencies? But that's a minor detail.

OK, I've pushed that with your recommendation for that function name.

Regards,
Dean

#22

Dean Rasheed

dean.a.rasheed@gmail.com

almost 6 years ago

In reply to: Dean Rasheed (#21)

Re: PATCH: add support for IN and @> in functional-dependency statistics use

On Sat, 28 Mar 2020 at 13:18, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

OK, I've pushed that with your recommendation for that function name.

Does this now complete everything that you wanted to do for functional
dependency stats for PG13? Re-reading the thread, I couldn't see
anything else that needed looking at. If that's the case, the CF entry
can be closed.

Regards,
Dean

#23

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Dean Rasheed (#22)

Re: PATCH: add support for IN and @> in functional-dependency statistics use

On Sun, Mar 29, 2020 at 10:22:25AM +0100, Dean Rasheed wrote:

On Sat, 28 Mar 2020 at 13:18, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

OK, I've pushed that with your recommendation for that function name.

Does this now complete everything that you wanted to do for functional
dependency stats for PG13? Re-reading the thread, I couldn't see
anything else that needed looking at. If that's the case, the CF entry
can be closed.

Yes. There were two improvements proposed, we've committed one of them
(the IN/ANY operator handling) and the other (containment) needs more
discussion. So I think it's OK to mark this either as committed or maybe
returned with feedback.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#24

Daniel Gustafsson

daniel@yesql.se

over 5 years ago

In reply to: Tomas Vondra (#23)

Re: PATCH: add support for IN and @> in functional-dependency statistics use

On 29 Mar 2020, at 17:27, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On Sun, Mar 29, 2020 at 10:22:25AM +0100, Dean Rasheed wrote:

On Sat, 28 Mar 2020 at 13:18, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

OK, I've pushed that with your recommendation for that function name.

Does this now complete everything that you wanted to do for functional
dependency stats for PG13? Re-reading the thread, I couldn't see
anything else that needed looking at. If that's the case, the CF entry
can be closed.

Yes. There were two improvements proposed, we've committed one of them
(the IN/ANY operator handling) and the other (containment) needs more
discussion. So I think it's OK to mark this either as committed or maybe
returned with feedback.

Since there hasn't been more discussion on the second item I've closed this
item as committed. The containment part can be opened as a new CF entry.

cheers ./daniel