parallel sequential scan returns extraneous rows
I have found a nasty bug when using parallel sequential scans with an exists clause on postgresql 9.6.1. I have found that the rows returned using parallel sequential scan plans are incorrect (though I haven’t dug sufficiently to know in what ways). See below for an example of the issue.
denver=# set max_parallel_workers_per_gather = 0;
SET
denver=# select count(*)
denver-# from users u
denver-# join address a on (a.users_id = u.id)
denver-# where exists (select 1 from address where users_id = u.id)
count
---------
9486910
(1 row)
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Aggregate (cost=2117655.96..2117655.97 rows=1 width=8)
-> Merge Join (cost=1257.34..2094242.48 rows=9365393 width=0)
Merge Cond: (u.id = a.users_id)
-> Merge Semi Join (cost=495.43..1607025.52 rows=2824687 width=8)
Merge Cond: (u.id = address.users_id)
-> Index Only Scan using users_pk on users u (cost=0.43..1018930.31 rows=11648927 width=4)
-> Index Only Scan using address_idx01 on address (cost=0.43..456495.80 rows=9365393 width=4)
-> Index Only Scan using address_idx01 on address a (cost=0.43..456495.80 rows=9365393 width=4)
(8 rows)
denver=# set max_parallel_workers_per_gather = 1;
SET
denver=# select count(*)
denver-# from users u
denver-# join address a on (a.users_id = u.id)
denver-# where exists (select 1 from address where users_id = u.id);
count
----------
29288954
(1 row)
QUERY PLAN
---------------------------------------------------------------------------------------------------------
Aggregate (cost=1889898.47..1889898.48 rows=1 width=8)
-> Hash Join (cost=1401575.70..1866484.99 rows=9365393 width=0)
Hash Cond: (a.users_id = u.id)
-> Seq Scan on address a (cost=0.00..299463.93 rows=9365393 width=4)
-> Hash (cost=1355233.12..1355233.12 rows=2824687 width=8)
-> Gather (cost=571820.86..1355233.12 rows=2824687 width=8)
Workers Planned: 1
-> Hash Join (cost=570820.86..1071764.42 rows=2824687 width=8)
Hash Cond: (address.users_id = u.id)
-> Parallel Seq Scan on address (cost=0.00..260900.55 rows=5509055 width=4)
-> Hash (cost=379705.27..379705.27 rows=11648927 width=4)
-> Seq Scan on users u (cost=0.00..379705.27 rows=11648927 width=4)
(12 rows)
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Michael Day <blake@rcmail.com> writes:
I have found a nasty bug when using parallel sequential scans with an exists clause on postgresql 9.6.1. I have found that the rows returned using parallel sequential scan plans are incorrect (though I haven’t dug sufficiently to know in what ways). See below for an example of the issue.
Hm, looks like a planner error: it seems to be forgetting that the join
to "address" should be a semijoin. "address" should either be on the
inside of a "Semi" join (as in your first, correct-looking plan) or be
passed through a unique-ification stage such as a HashAgg. Clearly,
neither thing is happening in the second plan.
I couldn't reproduce this in a bit of trying, however. Can you come
up with a self-contained test case?
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
I was able to reproduce with this set of data.
create table users (id integer);
create table address (id integer, users_id integer);
insert into users select s from generate_series(1,1000000) s;
insert into address select s, s/2 from generate_series(1,2000000) s;
analyze users;
analyze address;
set max_parallel_workers_per_gather = 0;
select count(*)
from users u
join address a on (a.users_id = u.id)
where exists (select 1 from address where users_id = u.id);
set max_parallel_workers_per_gather = 1;
select count(*)
from users u
join address a on (a.users_id = u.id)
where exists (select 1 from address where users_id = u.id);
On 11/29/16, 11:19 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
Michael Day <blake@rcmail.com> writes:
I have found a nasty bug when using parallel sequential scans with an exists clause on postgresql 9.6.1. I have found that the rows returned using parallel sequential scan plans are incorrect (though I haven’t dug sufficiently to know in what ways). See below for an example of the issue.
Hm, looks like a planner error: it seems to be forgetting that the join
to "address" should be a semijoin. "address" should either be on the
inside of a "Semi" join (as in your first, correct-looking plan) or be
passed through a unique-ification stage such as a HashAgg. Clearly,
neither thing is happening in the second plan.
I couldn't reproduce this in a bit of trying, however. Can you come
up with a self-contained test case?
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Michael Day <blake@rcmail.com> writes:
I was able to reproduce with this set of data.
Ah, thanks for the test case! Looks like the handling of
JOIN_UNIQUE_INNER cases for parallel plans is completely confused ---
it forgot about needing to unique-ify the inner rel, and I'm rather
surprised it didn't trip over some Asserts while at it.
Will fix, thanks.
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs