Avoid parallel full and right join paths.

Started by Mithun Cyover 9 years ago6 messages
#1Mithun Cy
mithun.cy@enterprisedb.com
1 attachment(s)

Tests:
create table mytab(x int,x1 char(9),x2 varchar(9));
create table mytab1(y int,y1 char(9),y2 varchar(9));
insert into mytab values (generate_series(1,50000),'aa','aaa');
insert into mytab1 values (generate_series(1,10000),'aa','aaa');
insert into mytab values (generate_series(1,500000),'aa','aaa');
insert into mytab values (generate_series(1,500000),'aa','aaa');
analyze mytab;
analyze mytab1;
vacuum mytab;
vacuum mytab1;

set max_parallel_degree=0;
SET
df=# SELECT count(*) FROM mytab RIGHT OUTER JOIN mytab1
ON mytab.x = mytab1.y;
count
-------
30000
(1 row)

# set max_parallel_degree=5;
SET
df=# SELECT count(*) FROM mytab RIGHT OUTER JOIN mytab1
ON mytab.x = mytab1.y;
count
-------
39089
(1 row)

Casue:
======
Normal plan
==========
explain SELECT count(*) FROM mytab RIGHT OUTER JOIN mytab1
ON mytab.x = mytab1.y;postgres-#
QUERY PLAN
------------------------------------------------------------------------------
Aggregate (cost=21682.71..21682.72 rows=1 width=8)
-> Hash Right Join (cost=289.00..21629.07 rows=21457 width=0)
Hash Cond: (mytab.x = mytab1.y)
-> Seq Scan on mytab (cost=0.00..17188.00 rows=1050000 width=4)
-> Hash (cost=164.00..164.00 rows=10000 width=4)
-> Seq Scan on mytab1 (cost=0.00..164.00 rows=10000 width=4)
=================================================================

Parallel plan.
==========
explain SELECT count(*) FROM mytab RIGHT OUTER JOIN mytab1
ON mytab.x = mytab1.y;postgres-#
QUERY PLAN
-----------------------------------------------------------------------------------------------
Finalize Aggregate (cost=14135.88..14135.89 rows=1 width=8)
-> Gather (cost=14135.67..14135.88 rows=2 width=8)
Number of Workers: 2
-> Partial Aggregate (cost=13135.67..13135.68 rows=1 width=8)
-> Hash Right Join (cost=289.00..13082.02 rows=21457 width=0)
Hash Cond: (mytab.x = mytab1.y)
-> Parallel Seq Scan on mytab (cost=0.00..11063.00 rows=437500 width=4)
-> Hash (cost=164.00..164.00 rows=10000 width=4)
-> Seq Scan on mytab1 (cost=0.00..164.00 rows=10000 width=4)

As above Right and Full join paths cannot be parallel as they can produce
false null extended rows because outer table is partial path and not
completely visible.
Adding a patch to fix same.

--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

Attachments:

avoid_parallel_full_right_join.patchapplication/octet-stream; name=avoid_parallel_full_right_join.patchDownload
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index 41b60d0..58078b5 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -1450,10 +1450,15 @@ hash_inner_and_outer(PlannerInfo *root,
 		 * If the joinrel is parallel-safe, we may be able to consider a
 		 * partial hash join.  However, we can't handle JOIN_UNIQUE_OUTER,
 		 * because the outer path will be partial, and therefore we won't be
-		 * able to properly guarantee uniqueness.  Also, the resulting path
+		 * able to properly guarantee uniqueness. Similarly we can't handle
+		 * JOIN_FULL and JOIN_RIGHT because they can produce false null
+		 * extended rows as outer path is partial. Also, the resulting path
 		 * must not be parameterized.
 		 */
-		if (joinrel->consider_parallel && jointype != JOIN_UNIQUE_OUTER &&
+		if (joinrel->consider_parallel &&
+			jointype != JOIN_UNIQUE_OUTER &&
+			jointype != JOIN_FULL &&
+			jointype != JOIN_RIGHT &&
 			outerrel->partial_pathlist != NIL &&
 			bms_is_empty(joinrel->lateral_relids))
 		{
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index cafbc5e..e2028d4 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -5260,3 +5260,30 @@ ERROR:  invalid reference to FROM-clause entry for table "xx1"
 LINE 1: ...xx1 using lateral (select * from int4_tbl where f1 = x1) ss;
                                                                 ^
 HINT:  There is an entry for table "xx1", but it cannot be referenced from this part of the query.
+-- avoid parallel full and right join path.
+create table mytab(x int,x1 char(20),x2 varchar(9));
+create table mytab1(y int,y1 char(20),y2 varchar(9));
+insert into mytab values (generate_series(1,200000),'aaaaaaaaaaaaaaaa','aaaaaaaaa');
+insert into mytab1 values (generate_series(1,100000),'aaaaaaaaaaaaaaaa','aaaaaaaaa');
+analyze mytab;
+analyze mytab1;
+select count(*) FROM mytab right outer join mytab1 ON mytab.x = mytab1.y;
+ count  
+--------
+ 100000
+(1 row)
+
+select count(*) FROM mytab full outer join mytab1 ON mytab.x = mytab1.y;
+ count  
+--------
+ 200000
+(1 row)
+
+select count(*) FROM mytab left outer join mytab1 ON mytab.x = mytab1.y;
+ count  
+--------
+ 200000
+(1 row)
+
+drop table mytab;
+drop table mytab1;
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 3430f91..52d29c4 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -1696,3 +1696,17 @@ update xx1 set x2 = f1 from xx1, lateral (select * from int4_tbl where f1 = x1)
 delete from xx1 using (select * from int4_tbl where f1 = x1) ss;
 delete from xx1 using (select * from int4_tbl where f1 = xx1.x1) ss;
 delete from xx1 using lateral (select * from int4_tbl where f1 = x1) ss;
+
+-- avoid parallel full and right join path.
+create table mytab(x int,x1 char(20),x2 varchar(9));
+create table mytab1(y int,y1 char(20),y2 varchar(9));
+insert into mytab values (generate_series(1,200000),'aaaaaaaaaaaaaaaa','aaaaaaaaa');
+insert into mytab1 values (generate_series(1,100000),'aaaaaaaaaaaaaaaa','aaaaaaaaa');
+analyze mytab;
+analyze mytab1;
+select count(*) FROM mytab right outer join mytab1 ON mytab.x = mytab1.y;
+select count(*) FROM mytab full outer join mytab1 ON mytab.x = mytab1.y;
+select count(*) FROM mytab left outer join mytab1 ON mytab.x = mytab1.y;
+
+drop table mytab;
+drop table mytab1;
#2Robert Haas
robertmhaas@gmail.com
In reply to: Mithun Cy (#1)
Re: Avoid parallel full and right join paths.

On Tue, Apr 19, 2016 at 10:21 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:

Tests:
create table mytab(x int,x1 char(9),x2 varchar(9));
create table mytab1(y int,y1 char(9),y2 varchar(9));
insert into mytab values (generate_series(1,50000),'aa','aaa');
insert into mytab1 values (generate_series(1,10000),'aa','aaa');
insert into mytab values (generate_series(1,500000),'aa','aaa');
insert into mytab values (generate_series(1,500000),'aa','aaa');
analyze mytab;
analyze mytab1;
vacuum mytab;
vacuum mytab1;

set max_parallel_degree=0;
SET
df=# SELECT count(*) FROM mytab RIGHT OUTER JOIN mytab1
ON mytab.x = mytab1.y;
count
-------
30000
(1 row)

# set max_parallel_degree=5;
SET
df=# SELECT count(*) FROM mytab RIGHT OUTER JOIN mytab1
ON mytab.x = mytab1.y;
count
-------
39089
(1 row)

Casue:
======
Normal plan
==========
explain SELECT count(*) FROM mytab RIGHT OUTER JOIN mytab1
ON mytab.x = mytab1.y;postgres-#
QUERY PLAN
------------------------------------------------------------------------------
Aggregate (cost=21682.71..21682.72 rows=1 width=8)
-> Hash Right Join (cost=289.00..21629.07 rows=21457 width=0)
Hash Cond: (mytab.x = mytab1.y)
-> Seq Scan on mytab (cost=0.00..17188.00 rows=1050000 width=4)
-> Hash (cost=164.00..164.00 rows=10000 width=4)
-> Seq Scan on mytab1 (cost=0.00..164.00 rows=10000 width=4)
=================================================================

Parallel plan.
==========
explain SELECT count(*) FROM mytab RIGHT OUTER JOIN mytab1
ON mytab.x = mytab1.y;postgres-#
QUERY PLAN
-----------------------------------------------------------------------------------------------
Finalize Aggregate (cost=14135.88..14135.89 rows=1 width=8)
-> Gather (cost=14135.67..14135.88 rows=2 width=8)
Number of Workers: 2
-> Partial Aggregate (cost=13135.67..13135.68 rows=1 width=8)
-> Hash Right Join (cost=289.00..13082.02 rows=21457 width=0)
Hash Cond: (mytab.x = mytab1.y)
-> Parallel Seq Scan on mytab (cost=0.00..11063.00 rows=437500 width=4)
-> Hash (cost=164.00..164.00 rows=10000 width=4)
-> Seq Scan on mytab1 (cost=0.00..164.00 rows=10000 width=4)

As above Right and Full join paths cannot be parallel as they can produce
false null extended rows because outer table is partial path and not
completely visible.
Adding a patch to fix same.

Committed. But I think the regression test needs more thought, so I
left that out.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#2)
Re: Avoid parallel full and right join paths.

On Wed, Apr 20, 2016 at 2:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Committed. But I think the regression test needs more thought, so I
left that out.

It would be nice if there was a fuzz testing infrastructure that
verified that parallel plans produce the same answer as serial plans.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Michael Paquier
michael.paquier@gmail.com
In reply to: Peter Geoghegan (#3)
Re: Avoid parallel full and right join paths.

On Thu, Apr 21, 2016 at 7:13 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Wed, Apr 20, 2016 at 2:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Committed. But I think the regression test needs more thought, so I
left that out.

It would be nice if there was a fuzz testing infrastructure that
verified that parallel plans produce the same answer as serial plans.

Results of parallel plans and serial plans could be stored in
temporary tables in the test, then that's a matter of comparing them I
guess. That's largely doable.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#4)
Re: Avoid parallel full and right join paths.

Michael Paquier wrote:

On Thu, Apr 21, 2016 at 7:13 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Wed, Apr 20, 2016 at 2:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Committed. But I think the regression test needs more thought, so I
left that out.

It would be nice if there was a fuzz testing infrastructure that
verified that parallel plans produce the same answer as serial plans.

Results of parallel plans and serial plans could be stored in
temporary tables in the test, then that's a matter of comparing them I
guess. That's largely doable.

The brin.sql test does that ...

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Peter Geoghegan
pg@heroku.com
In reply to: Alvaro Herrera (#5)
Re: Avoid parallel full and right join paths.

On Wed, Apr 20, 2016 at 6:43 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

The brin.sql test does that ...

I actually copied brin.sql when creating regression tests for external
sorting, primarily because I wanted to test a variety of collations,
without having any control of what they happen to be on the target.
Those went into amcheck's regression tests, and so have yet to be
committed.

I think that your approach there has plenty to recommend it, at least
where requirements are more complicated.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers