[PATCH] Erase the distinctClause if the result is unique by definition

Started by Andy Fanover 6 years ago42 messageshackers

zhihui.fan1213@gmail.com

over 6 years ago

Hi:

I wrote a patch to erase the distinctClause if the result is unique by
definition, I find this because a user switch this code from oracle
to PG and find the performance is bad due to this, so I adapt pg for
this as well.

This patch doesn't work for a well-written SQL, but some drawback
of a SQL may be not very obvious, since the cost of checking is pretty
low as well, so I think it would be ok to add..

Please see the patch for details.

Thank you.

Andy Fan

zhihui.fan1213@gmail.com

over 6 years ago

In reply to: Andy Fan (#1)

Re: [PATCH] Erase the distinctClause if the result is unique by definition

update the patch with considering the semi/anti join.

Can anyone help to review this patch?

Thanks

On Fri, Jan 31, 2020 at 8:39 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:

Show quoted text

Hi:

I wrote a patch to erase the distinctClause if the result is unique by
definition, I find this because a user switch this code from oracle
to PG and find the performance is bad due to this, so I adapt pg for
this as well.

This patch doesn't work for a well-written SQL, but some drawback
of a SQL may be not very obvious, since the cost of checking is pretty
low as well, so I think it would be ok to add..

Please see the patch for details.

Thank you.

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 6 years ago

In reply to: Andy Fan (#2)

Re: [PATCH] Erase the distinctClause if the result is unique by definition

Hi Andy,
What might help is to add more description to your email message like
giving examples to explain your idea.

Anyway, I looked at the testcases you added for examples.
+create table select_distinct_a(a int, b char(20),  c char(20) not null,  d
int, e int, primary key(a, b));
+set enable_mergejoin to off;
+set enable_hashjoin to off;
+-- no node for distinct.
+explain (costs off) select distinct * from select_distinct_a;
+          QUERY PLAN
+-------------------------------
+ Seq Scan on select_distinct_a
+(1 row)

From this example, it seems that the distinct operation can be dropped
because (a, b) is a primary key. Is my understanding correct?

I like the idea since it eliminates one expensive operation.

However the patch as presented has some problems
1. What happens if the primary key constraint or NOT NULL constraint gets
dropped between a prepare and execute? The plan will no more be valid and
thus execution may produce non-distinct results. PostgreSQL has similar
concept of allowing non-grouping expression as part of targetlist when
those expressions can be proved to be functionally dependent on the GROUP
BY clause. See check_functional_grouping() and its caller. I think,
DISTINCT elimination should work on similar lines.
2. For the same reason described in check_functional_grouping(), using
unique indexes for eliminating DISTINCT should be discouraged.
3. If you could eliminate DISTINCT you could similarly eliminate GROUP BY
as well
4. The patch works only at the query level, but that functionality can be
expanded generally to other places which add Unique/HashAggregate/Group
nodes if the underlying relation can be proved to produce distinct rows.
But that's probably more work since we will have to label paths with unique
keys similar to pathkeys.
5. Have you tested this OUTER joins, which can render inner side nullable?

On Thu, Feb 6, 2020 at 11:31 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:

update the patch with considering the semi/anti join.

Can anyone help to review this patch?

Thanks

On Fri, Jan 31, 2020 at 8:39 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:

Hi:

I wrote a patch to erase the distinctClause if the result is unique by
definition, I find this because a user switch this code from oracle
to PG and find the performance is bad due to this, so I adapt pg for
this as well.

This patch doesn't work for a well-written SQL, but some drawback
of a SQL may be not very obvious, since the cost of checking is pretty
low as well, so I think it would be ok to add..

Please see the patch for details.

Thank you.

--
--
Best Wishes,
Ashutosh Bapat

Andy Fan

zhihui.fan1213@gmail.com

over 6 years ago

In reply to: Ashutosh Bapat (#3)

Re: [PATCH] Erase the distinctClause if the result is unique by definition

Hi Ashutosh:
Thanks for your time.

On Fri, Feb 7, 2020 at 11:54 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
wrote:

Hi Andy,
What might help is to add more description to your email message like
giving examples to explain your idea.
Anyway, I looked at the testcases you added for examples.
+create table select_distinct_a(a int, b char(20),  c char(20) not null,
d int, e int, primary key(a, b));
+set enable_mergejoin to off;
+set enable_hashjoin to off;
+-- no node for distinct.
+explain (costs off) select distinct * from select_distinct_a;
+          QUERY PLAN
+-------------------------------
+ Seq Scan on select_distinct_a
+(1 row)
From this example, it seems that the distinct operation can be dropped
because (a, b) is a primary key. Is my understanding correct?

Yes, you are correct. Actually I added then to commit message,
but it's true that I should have copied them in this email body as
well. so copy it now.

[PATCH] Erase the distinctClause if the result is unique by
definition

For a single relation, we can tell it by any one of the following
is true:
1. The pk is in the target list.
2. The uk is in the target list and the columns is not null
3. The columns in group-by clause is also in the target list

for relation join, we can tell it by:
if every relation in the jointree yields a unique result set, then
the final result is unique as well regardless the join method.
for semi/anti join, we will ignore the righttable.

I like the idea since it eliminates one expensive operation.

However the patch as presented has some problems
1. What happens if the primary key constraint or NOT NULL constraint gets
dropped between a prepare and execute? The plan will no more be valid and
thus execution may produce non-distinct results.

Will this still be an issue if user use doesn't use a "read uncommitted"
isolation level? I suppose it should be ok for this case. But even though
I should add an isolation level check for this. Just added that in the
patch
to continue discussing of this issue.

PostgreSQL has similar concept of allowing non-grouping expression as part
of targetlist when those expressions can be proved to be functionally
dependent on the GROUP BY clause. See check_functional_grouping() and its
caller. I think, DISTINCT elimination should work on similar lines.

2. For the same reason described in check_functional_grouping(), using

unique indexes for eliminating DISTINCT should be discouraged.

I checked the comments of check_functional_grouping, the reason is

* Currently we only check to see if the rel has a primary key that is a
* subset of the grouping_columns. We could also use plain unique
constraints
* if all their columns are known not null, but there's a problem: we need
* to be able to represent the not-null-ness as part of the constraints
added
* to *constraintDeps. FIXME whenever not-null constraints get represented
* in pg_constraint.

Actually I am doubtful the reason for pg_constraint since we still be able
to get the not null information from relation->rd_attr->attrs[n].attnotnull
which
is just what this patch did.

3. If you could eliminate DISTINCT you could similarly eliminate GROUP BY

as well

This is a good point. The rules may have some different for join, so I
prefer
to to focus on the current one so far.

4. The patch works only at the query level, but that functionality can be
expanded generally to other places which add Unique/HashAggregate/Group
nodes if the underlying relation can be proved to produce distinct rows.
But that's probably more work since we will have to label paths with unique
keys similar to pathkeys.

Do you mean adding some information into PlannerInfo, and when we create
a node for Unique/HashAggregate/Group, we can just create a dummy node?

5. Have you tested this OUTER joins, which can render inner side nullable?

Yes, that part was missed in the test case. I just added them.

Show quoted text

On Thu, Feb 6, 2020 at 11:31 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:

update the patch with considering the semi/anti join.

Can anyone help to review this patch?

Thanks

On Fri, Jan 31, 2020 at 8:39 PM Andy Fan <zhihui.fan1213@gmail.com>
wrote:

Hi:

I wrote a patch to erase the distinctClause if the result is unique by
definition, I find this because a user switch this code from oracle
to PG and find the performance is bad due to this, so I adapt pg for
this as well.

This patch doesn't work for a well-written SQL, but some drawback
of a SQL may be not very obvious, since the cost of checking is pretty
low as well, so I think it would be ok to add..

Please see the patch for details.

Thank you.

--
--
Best Wishes,
Ashutosh Bapat

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 6 years ago

In reply to: Andy Fan (#4)

Re: [PATCH] Erase the distinctClause if the result is unique by definition

On Sat, Feb 8, 2020 at 12:53 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:

Hi Ashutosh:
Thanks for your time.

On Fri, Feb 7, 2020 at 11:54 PM Ashutosh Bapat <
ashutosh.bapat.oss@gmail.com> wrote:
Hi Andy,
What might help is to add more description to your email message like
giving examples to explain your idea.
Anyway, I looked at the testcases you added for examples.
+create table select_distinct_a(a int, b char(20),  c char(20) not null,
d int, e int, primary key(a, b));
+set enable_mergejoin to off;
+set enable_hashjoin to off;
+-- no node for distinct.
+explain (costs off) select distinct * from select_distinct_a;
+          QUERY PLAN
+-------------------------------
+ Seq Scan on select_distinct_a
+(1 row)
From this example, it seems that the distinct operation can be dropped
because (a, b) is a primary key. Is my understanding correct?
Yes, you are correct. Actually I added then to commit message,
but it's true that I should have copied them in this email body as
well. so copy it now.

[PATCH] Erase the distinctClause if the result is unique by
definition

I forgot to mention this in the last round of comments. Your patch was
actually removing distictClause from the Query structure. Please avoid
doing that. If you remove it, you are also removing the evidence that this
Query had a DISTINCT clause in it.

However the patch as presented has some problems
1. What happens if the primary key constraint or NOT NULL constraint gets
dropped between a prepare and execute? The plan will no more be valid and
thus execution may produce non-distinct results.

Will this still be an issue if user use doesn't use a "read uncommitted"
isolation level? I suppose it should be ok for this case. But even though
I should add an isolation level check for this. Just added that in the
patch
to continue discussing of this issue.

In PostgreSQL there's no "read uncommitted". But that doesn't matter since
a query can be prepared outside a transaction and executed within one or
more subsequent transactions.

PostgreSQL has similar concept of allowing non-grouping expression as
part of targetlist when those expressions can be proved to be functionally
dependent on the GROUP BY clause. See check_functional_grouping() and its
caller. I think, DISTINCT elimination should work on similar lines.

2. For the same reason described in check_functional_grouping(), using

unique indexes for eliminating DISTINCT should be discouraged.

I checked the comments of check_functional_grouping, the reason is

* Currently we only check to see if the rel has a primary key that is a
* subset of the grouping_columns. We could also use plain unique
constraints
* if all their columns are known not null, but there's a problem: we need
* to be able to represent the not-null-ness as part of the constraints
added
* to *constraintDeps. FIXME whenever not-null constraints get represented
* in pg_constraint.

Actually I am doubtful the reason for pg_constraint since we still be able
to get the not null information from
relation->rd_attr->attrs[n].attnotnull which
is just what this patch did.

The problem isn't whether not-null-less can be inferred or not, the problem
is whether that can be guaranteed across planning and execution of query
(prepare and execute for example.) The constraintDep machinary registers
the constraints used for preparing plan and invalidates the plan if any of
those constraints change after plan is created.

3. If you could eliminate DISTINCT you could similarly eliminate GROUP BY

as well

This is a good point. The rules may have some different for join, so I
prefer
to to focus on the current one so far.

I doubt that since DISTINCT is ultimately carried out as Grouping
operation. But anyway, I won't hang upon that.

4. The patch works only at the query level, but that functionality can be
expanded generally to other places which add Unique/HashAggregate/Group
nodes if the underlying relation can be proved to produce distinct rows.
But that's probably more work since we will have to label paths with unique
keys similar to pathkeys.

Do you mean adding some information into PlannerInfo, and when we create
a node for Unique/HashAggregate/Group, we can just create a dummy node?

Not so much as PlannerInfo but something on lines of PathKey. See PathKey
structure and related code. What I envision is PathKey class is also
annotated with the information whether that PathKey implies uniqueness.
E.g. a PathKey derived from a Primary index would imply uniqueness also. A
PathKey derived from say Group operation also implies uniqueness. Then just
by looking at the underlying Path we would be able to say whether we need
Group/Unique node on top of it or not. I think that would make it much
wider usecase and a very useful optimization.

--
Best Wishes,
Ashutosh Bapat

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Ashutosh Bapat (#5)

Re: [PATCH] Erase the distinctClause if the result is unique by definition

Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes:

On Sat, Feb 8, 2020 at 12:53 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
Do you mean adding some information into PlannerInfo, and when we create
a node for Unique/HashAggregate/Group, we can just create a dummy node?

Not so much as PlannerInfo but something on lines of PathKey. See PathKey
structure and related code. What I envision is PathKey class is also
annotated with the information whether that PathKey implies uniqueness.
E.g. a PathKey derived from a Primary index would imply uniqueness also. A
PathKey derived from say Group operation also implies uniqueness. Then just
by looking at the underlying Path we would be able to say whether we need
Group/Unique node on top of it or not. I think that would make it much
wider usecase and a very useful optimization.

FWIW, that doesn't seem like a very prudent approach to me, because it
confuses sorted-ness with unique-ness. PathKeys are about sorting,
but it's possible to have uniqueness guarantees without having sorted
anything, for instance via hashed grouping.

I haven't looked at this patch, but I'd expect it to use infrastructure
related to query_is_distinct_for(), and that doesn't deal in PathKeys.

regards, tom lane

Andy Fan

zhihui.fan1213@gmail.com

over 6 years ago

In reply to: Ashutosh Bapat (#5)

Re: [PATCH] Erase the distinctClause if the result is unique by definition

On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat <
ashutosh.bapat.oss@gmail.com> wrote:

[PATCH] Erase the distinctClause if the result is unique by
definition

I forgot to mention this in the last round of comments. Your patch was
actually removing distictClause from the Query structure. Please avoid
doing that. If you remove it, you are also removing the evidence that this
Query had a DISTINCT clause in it.

Yes, I removed it because it is the easiest way to do it. what is the
purpose of keeping the evidence?

However the patch as presented has some problems
1. What happens if the primary key constraint or NOT NULL constraint gets
dropped between a prepare and execute? The plan will no more be valid and
thus execution may produce non-distinct results.

Will this still be an issue if user use doesn't use a "read uncommitted"
isolation level? I suppose it should be ok for this case. But even
though
I should add an isolation level check for this. Just added that in the
patch
to continue discussing of this issue.

In PostgreSQL there's no "read uncommitted".

Thanks for the hint, I just noticed read uncommitted is treated as read
committed
in Postgresql.

But that doesn't matter since a query can be prepared outside a
transaction and executed within one or more subsequent transactions.

Suppose after a DDL, the prepared statement need to be re-parsed/planned
if it is not executed or it will prevent the DDL to happen.

The following is my test.

postgres=# create table t (a int primary key, b int not null, c int);
CREATE TABLE
postgres=# insert into t values(1, 1, 1), (2, 2, 2);
INSERT 0 2
postgres=# create unique index t_idx1 on t(b);
CREATE INDEX

postgres=# prepare st as select distinct b from t where c = $1;
PREPARE
postgres=# explain execute st(1);
QUERY PLAN
-------------------------------------------------
Seq Scan on t (cost=0.00..1.02 rows=1 width=4)
Filter: (c = 1)
(2 rows)
...
postgres=# explain execute st(1);
QUERY PLAN
-------------------------------------------------
Seq Scan on t (cost=0.00..1.02 rows=1 width=4)
Filter: (c = $1)
(2 rows)

-- session 2
postgres=# alter table t alter column b drop not null;
ALTER TABLE

-- session 1:
postgres=# explain execute st(1);
QUERY PLAN
-------------------------------------------------------------
Unique (cost=1.03..1.04 rows=1 width=4)
-> Sort (cost=1.03..1.04 rows=1 width=4)
Sort Key: b
-> Seq Scan on t (cost=0.00..1.02 rows=1 width=4)
Filter: (c = $1)
(5 rows)

-- session 2
postgres=# insert into t values (3, null, 3), (4, null, 3);
INSERT 0 2

-- session 1
postgres=# execute st(3);
b
---

(1 row)

and if we prepare sql outside a transaction, and execute it in the
transaction, the other session can't drop the constraint until the
transaction is ended.

Show quoted text

--
Best Wishes,
Ashutosh Bapat

Julien Rouhaud

rjuju123@gmail.com

over 6 years ago

In reply to: Andy Fan (#7)

Re: [PATCH] Erase the distinctClause if the result is unique by definition

On Tue, Feb 11, 2020 at 10:57:26AM +0800, Andy Fan wrote:

On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat <
ashutosh.bapat.oss@gmail.com> wrote:

I forgot to mention this in the last round of comments. Your patch was
actually removing distictClause from the Query structure. Please avoid
doing that. If you remove it, you are also removing the evidence that this
Query had a DISTINCT clause in it.

Yes, I removed it because it is the easiest way to do it. what is the
purpose of keeping the evidence?

However the patch as presented has some problems
1. What happens if the primary key constraint or NOT NULL constraint gets
dropped between a prepare and execute? The plan will no more be valid and
thus execution may produce non-distinct results.

But that doesn't matter since a query can be prepared outside a
transaction and executed within one or more subsequent transactions.

Suppose after a DDL, the prepared statement need to be re-parsed/planned
if it is not executed or it will prevent the DDL to happen.

The following is my test.

postgres=# create table t (a int primary key, b int not null, c int);
CREATE TABLE
postgres=# insert into t values(1, 1, 1), (2, 2, 2);
INSERT 0 2
postgres=# create unique index t_idx1 on t(b);
CREATE INDEX

postgres=# prepare st as select distinct b from t where c = $1;
PREPARE
postgres=# explain execute st(1);
QUERY PLAN
-------------------------------------------------
Seq Scan on t (cost=0.00..1.02 rows=1 width=4)
Filter: (c = 1)
(2 rows)
...
postgres=# explain execute st(1);
QUERY PLAN
-------------------------------------------------
Seq Scan on t (cost=0.00..1.02 rows=1 width=4)
Filter: (c = $1)
(2 rows)

-- session 2
postgres=# alter table t alter column b drop not null;
ALTER TABLE

-- session 1:
postgres=# explain execute st(1);
QUERY PLAN
-------------------------------------------------------------
Unique (cost=1.03..1.04 rows=1 width=4)
-> Sort (cost=1.03..1.04 rows=1 width=4)
Sort Key: b
-> Seq Scan on t (cost=0.00..1.02 rows=1 width=4)
Filter: (c = $1)
(5 rows)

-- session 2
postgres=# insert into t values (3, null, 3), (4, null, 3);
INSERT 0 2

-- session 1
postgres=# execute st(3);
b
---

(1 row)

and if we prepare sql outside a transaction, and execute it in the
transaction, the other session can't drop the constraint until the
transaction is ended.

And what if you create a view on top of a query containing a distinct clause
rather than using prepared statements? FWIW your patch doesn't handle such
case at all, without even needing to drop constraints:

CREATE TABLE t (a int primary key, b int not null, c int);
INSERT INTO t VALUEs(1, 1, 1), (2, 2, 2);
CREATE UNIQUE INDEX t_idx1 on t(b);
CREATE VIEW v1 AS SELECT DISTINCT b FROM t;
EXPLAIN SELECT * FROM v1;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

I also think this is not the right way to handle this optimization.

Andy Fan

zhihui.fan1213@gmail.com

over 6 years ago

In reply to: Julien Rouhaud (#8)

Re: [PATCH] Erase the distinctClause if the result is unique by definition

On Tue, Feb 11, 2020 at 3:56 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Tue, Feb 11, 2020 at 10:57:26AM +0800, Andy Fan wrote:

On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat <
ashutosh.bapat.oss@gmail.com> wrote:

I forgot to mention this in the last round of comments. Your patch was
actually removing distictClause from the Query structure. Please avoid
doing that. If you remove it, you are also removing the evidence that

this

Query had a DISTINCT clause in it.

Yes, I removed it because it is the easiest way to do it. what is the
purpose of keeping the evidence?

However the patch as presented has some problems
1. What happens if the primary key constraint or NOT NULL constraint

gets

dropped between a prepare and execute? The plan will no more be valid

and

thus execution may produce non-distinct results.

But that doesn't matter since a query can be prepared outside a
transaction and executed within one or more subsequent transactions.

Suppose after a DDL, the prepared statement need to be re-parsed/planned
if it is not executed or it will prevent the DDL to happen.

The following is my test.

postgres=# create table t (a int primary key, b int not null, c int);
CREATE TABLE
postgres=# insert into t values(1, 1, 1), (2, 2, 2);
INSERT 0 2
postgres=# create unique index t_idx1 on t(b);
CREATE INDEX

postgres=# prepare st as select distinct b from t where c = $1;
PREPARE
postgres=# explain execute st(1);
QUERY PLAN
-------------------------------------------------
Seq Scan on t (cost=0.00..1.02 rows=1 width=4)
Filter: (c = 1)
(2 rows)
...
postgres=# explain execute st(1);
QUERY PLAN
-------------------------------------------------
Seq Scan on t (cost=0.00..1.02 rows=1 width=4)
Filter: (c = $1)
(2 rows)

-- session 2
postgres=# alter table t alter column b drop not null;
ALTER TABLE

-- session 1:
postgres=# explain execute st(1);
QUERY PLAN
-------------------------------------------------------------
Unique (cost=1.03..1.04 rows=1 width=4)
-> Sort (cost=1.03..1.04 rows=1 width=4)
Sort Key: b
-> Seq Scan on t (cost=0.00..1.02 rows=1 width=4)
Filter: (c = $1)
(5 rows)

-- session 2
postgres=# insert into t values (3, null, 3), (4, null, 3);
INSERT 0 2

-- session 1
postgres=# execute st(3);
b
---

(1 row)

and if we prepare sql outside a transaction, and execute it in the
transaction, the other session can't drop the constraint until the
transaction is ended.

And what if you create a view on top of a query containing a distinct
clause
rather than using prepared statements? FWIW your patch doesn't handle such
case at all, without even needing to drop constraints:

CREATE TABLE t (a int primary key, b int not null, c int);

INSERT INTO t VALUEs(1, 1, 1), (2, 2, 2);
CREATE UNIQUE INDEX t_idx1 on t(b);
CREATE VIEW v1 AS SELECT DISTINCT b FROM t;
EXPLAIN SELECT * FROM v1;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

Thanks for pointing it out. This is unexpected based on my current
knowledge, I
will check that.

I also think this is not the right way to handle this optimization.

I started to check query_is_distinct_for when Tom point it out, but still
doesn't
understand the context fully. I will take your finding with this as well.

#10

Andy Fan

zhihui.fan1213@gmail.com

over 6 years ago

In reply to: Julien Rouhaud (#8)

Re: [PATCH] Erase the distinctClause if the result is unique by definition

On Tue, Feb 11, 2020 at 3:56 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

and if we prepare sql outside a transaction, and execute it in the
transaction, the other session can't drop the constraint until the
transaction is ended.

And what if you create a view on top of a query containing a distinct
clause
rather than using prepared statements? FWIW your patch doesn't handle such
case at all, without even needing to drop constraints:

CREATE TABLE t (a int primary key, b int not null, c int);
INSERT INTO t VALUEs(1, 1, 1), (2, 2, 2);
CREATE UNIQUE INDEX t_idx1 on t(b);
CREATE VIEW v1 AS SELECT DISTINCT b FROM t;
EXPLAIN SELECT * FROM v1;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

This error can be fixed with

-       num_of_rtables = bms_num_members(non_semi_anti_relids);
+       num_of_rtables = list_length(query->rtable);

This test case also be added into the patch.

I also think this is not the right way to handle this optimization.

do you have any other concerns?

#11

Julien Rouhaud

rjuju123@gmail.com

over 6 years ago

In reply to: Andy Fan (#10)

Re: [PATCH] Erase the distinctClause if the result is unique by definition

On Tue, Feb 11, 2020 at 08:14:14PM +0800, Andy Fan wrote:

On Tue, Feb 11, 2020 at 3:56 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

and if we prepare sql outside a transaction, and execute it in the
transaction, the other session can't drop the constraint until the
transaction is ended.

And what if you create a view on top of a query containing a distinct
clause
rather than using prepared statements? FWIW your patch doesn't handle such
case at all, without even needing to drop constraints:

CREATE TABLE t (a int primary key, b int not null, c int);
INSERT INTO t VALUEs(1, 1, 1), (2, 2, 2);
CREATE UNIQUE INDEX t_idx1 on t(b);
CREATE VIEW v1 AS SELECT DISTINCT b FROM t;
EXPLAIN SELECT * FROM v1;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

This error can be fixed with
-       num_of_rtables = bms_num_members(non_semi_anti_relids);
+       num_of_rtables = list_length(query->rtable);
This test case also be added into the patch.

I also think this is not the right way to handle this optimization.

do you have any other concerns?

Yes, it seems to be broken as soon as you alter the view's underlying table:

=# CREATE TABLE t (a int primary key, b int not null, c int);
CREATE TABLE

=# INSERT INTO t VALUEs(1, 1, 1), (2, 2, 2);
INSERT 0 2

=# CREATE UNIQUE INDEX t_idx1 on t(b);
CREATE INDEX

=# CREATE VIEW v1 AS SELECT DISTINCT b FROM t;
CREATE VIEW

=# EXPLAIN SELECT * FROM v1;
QUERY PLAN
-------------------------------------------------
Seq Scan on t (cost=0.00..1.02 rows=2 width=4)
(1 row)

=# EXPLAIN SELECT DISTINCT b FROM t;
QUERY PLAN
-------------------------------------------------
Seq Scan on t (cost=0.00..1.02 rows=2 width=4)
(1 row)

=# ALTER TABLE t ALTER COLUMN b DROP NOT NULL;
ALTER TABLE

=# EXPLAIN SELECT * FROM v1;
QUERY PLAN
-------------------------------------------------
Seq Scan on t (cost=0.00..1.02 rows=2 width=4)
(1 row)

=# EXPLAIN SELECT DISTINCT b FROM t;
QUERY PLAN
-------------------------------------------------------------
Unique (cost=1.03..1.04 rows=2 width=4)
-> Sort (cost=1.03..1.03 rows=2 width=4)
Sort Key: b
-> Seq Scan on t (cost=0.00..1.02 rows=2 width=4)
(4 rows)

#12

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 6 years ago

In reply to: Tom Lane (#6)

Re: [PATCH] Erase the distinctClause if the result is unique by definition

On Mon, Feb 10, 2020 at 10:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes:

On Sat, Feb 8, 2020 at 12:53 PM Andy Fan <zhihui.fan1213@gmail.com>

wrote:

Do you mean adding some information into PlannerInfo, and when we

create

a node for Unique/HashAggregate/Group, we can just create a dummy node?

Not so much as PlannerInfo but something on lines of PathKey. See PathKey
structure and related code. What I envision is PathKey class is also
annotated with the information whether that PathKey implies uniqueness.
E.g. a PathKey derived from a Primary index would imply uniqueness also.

A

PathKey derived from say Group operation also implies uniqueness. Then

just

by looking at the underlying Path we would be able to say whether we need
Group/Unique node on top of it or not. I think that would make it much
wider usecase and a very useful optimization.

FWIW, that doesn't seem like a very prudent approach to me, because it
confuses sorted-ness with unique-ness. PathKeys are about sorting,
but it's possible to have uniqueness guarantees without having sorted
anything, for instance via hashed grouping.

I haven't looked at this patch, but I'd expect it to use infrastructure
related to query_is_distinct_for(), and that doesn't deal in PathKeys.

Thanks for the pointer. I think there's another problem with my approach.

PathKeys are specific to paths since the order of the result depends upon
the Path. But uniqueness is a property of the result i.e. relation and thus
should be attached to RelOptInfo as query_is_distinct_for() does. I think
uniquness should bubble up the RelOptInfo tree, annotating each RelOptInfo
with the minimum set of TLEs which make the result from that relation
unique. Thus we could eliminate extra Group/Unique node if the underlying
RelOptInfo's unique column set is subset of required uniqueness.
--
--
Best Wishes,
Ashutosh Bapat

#13

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 6 years ago

In reply to: Andy Fan (#7)

Re: [PATCH] Erase the distinctClause if the result is unique by definition

On Tue, Feb 11, 2020 at 8:27 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:

On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat <
ashutosh.bapat.oss@gmail.com> wrote:

[PATCH] Erase the distinctClause if the result is unique by
definition

I forgot to mention this in the last round of comments. Your patch was
actually removing distictClause from the Query structure. Please avoid
doing that. If you remove it, you are also removing the evidence that this
Query had a DISTINCT clause in it.

Yes, I removed it because it is the easiest way to do it. what is the
purpose of keeping the evidence?

Julien's example provides an explanation for this. The Query structure is
serialised into a view definition. Removing distinctClause from there means
that the view will never try to produce unique results.

Suppose after a DDL, the prepared statement need to be re-parsed/planned
if it is not executed or it will prevent the DDL to happen.

The query will be replanned. I am not sure about reparsed though.

-- session 2
postgres=# alter table t alter column b drop not null;
ALTER TABLE

-- session 1:
postgres=# explain execute st(1);
QUERY PLAN
-------------------------------------------------------------
Unique (cost=1.03..1.04 rows=1 width=4)
-> Sort (cost=1.03..1.04 rows=1 width=4)
Sort Key: b
-> Seq Scan on t (cost=0.00..1.02 rows=1 width=4)
Filter: (c = $1)
(5 rows)

Since this prepared statement is parameterised PostgreSQL is replanning it
every time it gets executed. It's not using a stored prepared plan. Try
without parameters. Also make sure that a prepared plan is used for
execution and not a new plan.
--
Best Wishes,
Ashutosh Bapat

#14

Julien Rouhaud

rjuju123@gmail.com

over 6 years ago

In reply to: Ashutosh Bapat (#13)

Re: [PATCH] Erase the distinctClause if the result is unique by definition

On Tue, Feb 11, 2020 at 10:06:17PM +0530, Ashutosh Bapat wrote:

On Tue, Feb 11, 2020 at 8:27 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:

On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat <
ashutosh.bapat.oss@gmail.com> wrote:

[PATCH] Erase the distinctClause if the result is unique by
definition

I forgot to mention this in the last round of comments. Your patch was
actually removing distictClause from the Query structure. Please avoid
doing that. If you remove it, you are also removing the evidence that this
Query had a DISTINCT clause in it.

Yes, I removed it because it is the easiest way to do it. what is the
purpose of keeping the evidence?

Julien's example provides an explanation for this. The Query structure is
serialised into a view definition. Removing distinctClause from there means
that the view will never try to produce unique results.

And also I think that this approach will have a lot of other unexpected side
effects. Isn't changing the Query going to affect pg_stat_statements queryid
computing for instance?

#15

Andy Fan

zhihui.fan1213@gmail.com

over 6 years ago

In reply to: Julien Rouhaud (#14)

Re: [PATCH] Erase the distinctClause if the result is unique by definition

On Thu, Feb 13, 2020 at 5:39 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Tue, Feb 11, 2020 at 10:06:17PM +0530, Ashutosh Bapat wrote:

On Tue, Feb 11, 2020 at 8:27 AM Andy Fan <zhihui.fan1213@gmail.com>

wrote:

On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat <
ashutosh.bapat.oss@gmail.com> wrote:

[PATCH] Erase the distinctClause if the result is unique by
definition

I forgot to mention this in the last round of comments. Your patch was
actually removing distictClause from the Query structure. Please avoid
doing that. If you remove it, you are also removing the evidence that

this

Query had a DISTINCT clause in it.

Yes, I removed it because it is the easiest way to do it. what is the
purpose of keeping the evidence?

Julien's example provides an explanation for this. The Query structure is
serialised into a view definition. Removing distinctClause from there

means

that the view will never try to produce unique results.

And also I think that this approach will have a lot of other unexpected
side
effects. Isn't changing the Query going to affect pg_stat_statements
queryid
computing for instance?

Thanks, the 2 factors above are pretty valuable. so erasing the
distinctClause is not reasonable, I will try another way.

#16

Andy Fan

zhihui.fan1213@gmail.com

over 6 years ago

In reply to: Ashutosh Bapat (#13)

Re: [PATCH] Erase the distinctClause if the result is unique by definition

On Wed, Feb 12, 2020 at 12:36 AM Ashutosh Bapat <
ashutosh.bapat.oss@gmail.com> wrote:

On Tue, Feb 11, 2020 at 8:27 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:

On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat <
ashutosh.bapat.oss@gmail.com> wrote:

[PATCH] Erase the distinctClause if the result is unique by
definition

I forgot to mention this in the last round of comments. Your patch was
actually removing distictClause from the Query structure. Please avoid
doing that. If you remove it, you are also removing the evidence that this
Query had a DISTINCT clause in it.

Yes, I removed it because it is the easiest way to do it. what is the
purpose of keeping the evidence?

Julien's example provides an explanation for this. The Query structure is
serialised into a view definition. Removing distinctClause from there means
that the view will never try to produce unique results.

Actually it is not true. If a view is used in the query, the definition
will be *copied*
into the query tree. so if we modify the query tree, the definition of the
view never
touched. The issue of Julien reported is because of a typo error.

-- session 2

postgres=# alter table t alter column b drop not null;
ALTER TABLE

-- session 1:
postgres=# explain execute st(1);
QUERY PLAN
-------------------------------------------------------------
Unique (cost=1.03..1.04 rows=1 width=4)
-> Sort (cost=1.03..1.04 rows=1 width=4)
Sort Key: b
-> Seq Scan on t (cost=0.00..1.02 rows=1 width=4)
Filter: (c = $1)
(5 rows)

Since this prepared statement is parameterised PostgreSQL is replanning it
every time it gets executed. It's not using a stored prepared plan. Try
without parameters. Also make sure that a prepared plan is used for
execution and not a new plan.

Even for parameterised prepared statement, it is still possible to
generate an generic
plan. so it will not replanning every time. But no matter generic plan or
not, after a DDL like
changing the NOT NULL constraints. pg will generated a plan based on the
stored query
tree. However, the query tree will be *copied* again to generate a new
plan. so even I
modified the query tree, everything will be ok as well.

At last, I am agreed with that modifying the query tree is not a good
idea.
so my updated patch doesn't use it any more.

#17

Andy Fan

zhihui.fan1213@gmail.com

over 6 years ago

In reply to: Andy Fan (#16)

Re: [PATCH] Erase the distinctClause if the result is unique by definition

Hi All:

Here is the updated patch. It used some functions from
query_is_distinct_for.
I check the query's distinctness in create_distinct_paths, if it is
distinct already,
it will not generate the paths for that. so at last the query tree is not
untouched.

Please see if you have any comments. Thanks

On Mon, Feb 24, 2020 at 8:38 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:

Show quoted text

On Wed, Feb 12, 2020 at 12:36 AM Ashutosh Bapat <
ashutosh.bapat.oss@gmail.com> wrote:

On Tue, Feb 11, 2020 at 8:27 AM Andy Fan <zhihui.fan1213@gmail.com>
wrote:

On Tue, Feb 11, 2020 at 12:22 AM Ashutosh Bapat <
ashutosh.bapat.oss@gmail.com> wrote:

[PATCH] Erase the distinctClause if the result is unique by
definition

I forgot to mention this in the last round of comments. Your patch was
actually removing distictClause from the Query structure. Please avoid
doing that. If you remove it, you are also removing the evidence that this
Query had a DISTINCT clause in it.

Yes, I removed it because it is the easiest way to do it. what is the
purpose of keeping the evidence?

Julien's example provides an explanation for this. The Query structure is
serialised into a view definition. Removing distinctClause from there means
that the view will never try to produce unique results.

Actually it is not true. If a view is used in the query, the definition
will be *copied*
into the query tree. so if we modify the query tree, the definition of
the view never
touched. The issue of Julien reported is because of a typo error.

-- session 2

postgres=# alter table t alter column b drop not null;
ALTER TABLE

-- session 1:
postgres=# explain execute st(1);
QUERY PLAN
-------------------------------------------------------------
Unique (cost=1.03..1.04 rows=1 width=4)
-> Sort (cost=1.03..1.04 rows=1 width=4)
Sort Key: b
-> Seq Scan on t (cost=0.00..1.02 rows=1 width=4)
Filter: (c = $1)
(5 rows)

Since this prepared statement is parameterised PostgreSQL is replanning
it every time it gets executed. It's not using a stored prepared plan. Try
without parameters. Also make sure that a prepared plan is used for
execution and not a new plan.

Even for parameterised prepared statement, it is still possible to
generate an generic
plan. so it will not replanning every time. But no matter generic plan or
not, after a DDL like
changing the NOT NULL constraints. pg will generated a plan based on the
stored query
tree. However, the query tree will be *copied* again to generate a new
plan. so even I
modified the query tree, everything will be ok as well.

At last, I am agreed with that modifying the query tree is not a good
idea.
so my updated patch doesn't use it any more.

#18

Tom Lane

tgl@sss.pgh.pa.us

about 6 years ago

In reply to: Andy Fan (#17)

Re: [PATCH] Erase the distinctClause if the result is unique by definition

Andy Fan <zhihui.fan1213@gmail.com> writes:

Please see if you have any comments. Thanks

The cfbot isn't at all happy with this. Its linux build is complaining
about a possibly-uninitialized variable, and then giving up:

https://travis-ci.org/postgresql-cfbot/postgresql/builds/656722993

The Windows build isn't using -Werror, but it is crashing in at least
two different spots in the regression tests:

https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.81778

I've not attempted to identify the cause of that.

At a high level, I'm a bit disturbed that this focuses only on DISTINCT
and doesn't (appear to) have any equivalent intelligence for GROUP BY,
though surely that offers much the same opportunities for optimization.
It seems like it'd be worthwhile to take a couple steps back and see
if we couldn't recast the logic to work for both.

Some other random comments:

* Don't especially like the way you broke query_is_distinct_for()
into two functions, especially when you then introduced a whole
lot of other code in between. That's just making reviewer's job
harder to see what changed. It makes the comments a bit disjointed
too, that is where you even had any. (Zero introductory comment
for query_is_distinct_agg is *not* up to project coding standards.
There are a lot of other undercommented places in this patch, too.)

* Definitely don't like having query_distinct_through_join re-open
all the relations. The data needed for this should get absorbed
while plancat.c has the relations open at the beginning. (Non-nullness
of columns, in particular, seems like it'll be useful for other
purposes; I'm a bit surprised the planner isn't using that already.)

* In general, query_distinct_through_join seems hugely messy, expensive,
and not clearly correct. If it is correct, the existing comments sure
aren't enough to explain what it is doing or why.

* Not entirely convinced that a new GUC is needed for this, but if
it is, you have to document it.

* I wouldn't bother with bms_array_free(), nor with any of the other
cleanup you've got at the bottom of query_distinct_through_join.
The planner leaks *lots* of memory, and this function isn't going to
be called so many times that it'll move the needle.

* There seem to be some pointless #include additions, eg in planner.c
the added code doesn't look to justify any of them. Please also
avoid unnecessary whitespace changes, those also slow down reviewing.

* I see you decided to add a new regression test file select_distinct_2.
That's a poor choice of name because it conflicts with our rules for the
naming of alternative output files. Besides which, you forgot to plug
it into the test schedule files, so it isn't actually getting run.
Is there a reason not to just add the new test cases to select_distinct?

* There are some changes in existing regression cases that aren't
visibly related to the stated purpose of the patch, eg it now
notices that "select distinct max(unique2) from tenk1" doesn't
require an explicit DISTINCT step. That's not wrong, but I wonder
if maybe you should subdivide this patch into more than one patch,
because that must be coming from some separate change. I'm also
wondering what caused the plan change in expected/join.out.

regards, tom lane

#19

Andy Fan

zhihui.fan1213@gmail.com

about 6 years ago

In reply to: Tom Lane (#18)

Re: [PATCH] Erase the distinctClause if the result is unique by definition

Thank you Tom for the review!

On Mon, Mar 2, 2020 at 4:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andy Fan <zhihui.fan1213@gmail.com> writes:

Please see if you have any comments. Thanks

The cfbot isn't at all happy with this. Its linux build is complaining
about a possibly-uninitialized variable, and then giving up:

https://travis-ci.org/postgresql-cfbot/postgresql/builds/656722993

The Windows build isn't using -Werror, but it is crashing in at least
two different spots in the regression tests:

https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.81778

I've not attempted to identify the cause of that.

Before I submit the patch, I can make sure "make check-world" is
successful, but
since the compile option is not same, so I didn't catch
the possibly-uninitialized
variable. As for the crash on the windows, I didn't get the enough
information
now, I will find a windows server and reproduce the cases.

I just found the link http://commitfest.cputube.org/ this morning, I will
make sure
the next patch can go pass this test.

At a high level, I'm a bit disturbed that this focuses only on DISTINCT

and doesn't (appear to) have any equivalent intelligence for GROUP BY,
though surely that offers much the same opportunities for optimization.
It seems like it'd be worthwhile to take a couple steps back and see
if we couldn't recast the logic to work for both.

OK, Looks group by is a bit harder than distinct since the aggregation
function.
I will go through the code to see why to add this logic.

Some other random comments:

* Don't especially like the way you broke query_is_distinct_for()
into two functions, especially when you then introduced a whole
lot of other code in between.

This is not expected by me until you point it out. In this case, I have to
break the query_is_distinct_for to two functions, but it true that we
should put the two functions together.

That's just making reviewer's job

harder to see what changed. It makes the comments a bit disjointed
too, that is where you even had any. (Zero introductory comment
for query_is_distinct_agg is *not* up to project coding standards.
There are a lot of other undercommented places in this patch, too.)

* Definitely don't like having query_distinct_through_join re-open
all the relations. The data needed for this should get absorbed
while plancat.c has the relations open at the beginning. (Non-nullness
of columns, in particular, seems like it'll be useful for other
purposes; I'm a bit surprised the planner isn't using that already.)

I can add new attributes to RelOptInfo and fill the value in
get_relation_info
call.

* In general, query_distinct_through_join seems hugely messy, expensive,
and not clearly correct. If it is correct, the existing comments sure
aren't enough to explain what it is doing or why.

Removing the relation_open call can make it a bit simpler, I will try more
comment to make it clearer in the following patch.

* There seem to be some pointless #include additions, eg in planner.c
the added code doesn't look to justify any of them. Please also
avoid unnecessary whitespace changes, those also slow down reviewing.

That may because I added the header file some time 1 and then refactored
the code later then forget the remove the header file accordingly. Do we
need
to relay on experience to tell if the header file is needed or not, or do
have have
any code to tell it automatically?

* I see you decided to add a new regression test file select_distinct_2.
That's a poor choice of name because it conflicts with our rules for the
naming of alternative output files. Besides which, you forgot to plug
it into the test schedule files, so it isn't actually getting run.
Is there a reason not to just add the new test cases to select_distinct?

Adding it to select_distinct.sql is ok for me as well. Actually I have no
obviously reason to add the new file.

* There are some changes in existing regression cases that aren't
visibly related to the stated purpose of the patch, eg it now
notices that "select distinct max(unique2) from tenk1" doesn't
require an explicit DISTINCT step. That's not wrong, but I wonder
if maybe you should subdivide this patch into more than one patch,
because that must be coming from some separate change. I'm also
wondering what caused the plan change in expected/join.out.

Per my purpose it should be in the same patch, the logical here is we
have distinct in the sql and the query is distinct already since the max
function (the rule is defined in query_is_distinct_agg which is splited
from
the original query_is_distinct_for clause).

Show quoted text

regards, tom lane

#20

Andy Fan

zhihui.fan1213@gmail.com

about 6 years ago

In reply to: Andy Fan (#19)

Re: [PATCH] Erase the distinctClause if the result is unique by definition

On Tue, Mar 3, 2020 at 1:24 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:

Thank you Tom for the review!

On Mon, Mar 2, 2020 at 4:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andy Fan <zhihui.fan1213@gmail.com> writes:

Please see if you have any comments. Thanks

The cfbot isn't at all happy with this. Its linux build is complaining
about a possibly-uninitialized variable, and then giving up:

https://travis-ci.org/postgresql-cfbot/postgresql/builds/656722993

The Windows build isn't using -Werror, but it is crashing in at least
two different spots in the regression tests:

https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.81778

I've not attempted to identify the cause of that.

Before I submit the patch, I can make sure "make check-world" is
successful, but
since the compile option is not same, so I didn't catch
the possibly-uninitialized
variable. As for the crash on the windows, I didn't get the enough
information
now, I will find a windows server and reproduce the cases.

I just found the link http://commitfest.cputube.org/ this morning, I will
make sure
the next patch can go pass this test.

At a high level, I'm a bit disturbed that this focuses only on DISTINCT

and doesn't (appear to) have any equivalent intelligence for GROUP BY,
though surely that offers much the same opportunities for optimization.
It seems like it'd be worthwhile to take a couple steps back and see
if we couldn't recast the logic to work for both.

OK, Looks group by is a bit harder than distinct since the aggregation
function.
I will go through the code to see why to add this logic.

Can we grantee any_aggr_func(a) == a if only 1 row returned, if so, we
can do
some work on the pathtarget/reltarget by transforming the Aggref to raw
expr.
I checked the execution path of the aggregation call, looks it depends on
Agg node
which is the thing we want to remove.

* There seem to be some pointless #include additions, eg in planner.c
the added code doesn't look to justify any of them. Please also
avoid unnecessary whitespace changes, those also slow down reviewing.

fixed some typo errors.

That may be because I added the header file at time 1 and then refactored
the code but forget to remove the header file when it is not necessary.
Do we need to relay on experience to tell if the header file is needed or
not,
or do we have any tool to tell it automatically?

regards, Andy Fan

#21

Andy Fan

zhihui.fan1213@gmail.com

about 6 years ago

In reply to: Andy Fan (#19)

#22

Andy Fan

zhihui.fan1213@gmail.com

about 6 years ago

In reply to: Andy Fan (#21)

#23