parallel sequential scan returns extraneous rows

Started by Michael Dayover 9 years ago4 messagesbugs
Jump to latest
#1Michael Day
blake@rcmail.com

I have found a nasty bug when using parallel sequential scans with an exists clause on postgresql 9.6.1. I have found that the rows returned using parallel sequential scan plans are incorrect (though I haven’t dug sufficiently to know in what ways). See below for an example of the issue.

denver=# set max_parallel_workers_per_gather = 0;
SET
denver=# select count(*)
denver-# from users u
denver-# join address a on (a.users_id = u.id)
denver-# where exists (select 1 from address where users_id = u.id)
count
---------
9486910
(1 row)

QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Aggregate (cost=2117655.96..2117655.97 rows=1 width=8)
-> Merge Join (cost=1257.34..2094242.48 rows=9365393 width=0)
Merge Cond: (u.id = a.users_id)
-> Merge Semi Join (cost=495.43..1607025.52 rows=2824687 width=8)
Merge Cond: (u.id = address.users_id)
-> Index Only Scan using users_pk on users u (cost=0.43..1018930.31 rows=11648927 width=4)
-> Index Only Scan using address_idx01 on address (cost=0.43..456495.80 rows=9365393 width=4)
-> Index Only Scan using address_idx01 on address a (cost=0.43..456495.80 rows=9365393 width=4)
(8 rows)

denver=# set max_parallel_workers_per_gather = 1;
SET
denver=# select count(*)
denver-# from users u
denver-# join address a on (a.users_id = u.id)
denver-# where exists (select 1 from address where users_id = u.id);
count
----------
29288954
(1 row)

QUERY PLAN
---------------------------------------------------------------------------------------------------------
Aggregate (cost=1889898.47..1889898.48 rows=1 width=8)
-> Hash Join (cost=1401575.70..1866484.99 rows=9365393 width=0)
Hash Cond: (a.users_id = u.id)
-> Seq Scan on address a (cost=0.00..299463.93 rows=9365393 width=4)
-> Hash (cost=1355233.12..1355233.12 rows=2824687 width=8)
-> Gather (cost=571820.86..1355233.12 rows=2824687 width=8)
Workers Planned: 1
-> Hash Join (cost=570820.86..1071764.42 rows=2824687 width=8)
Hash Cond: (address.users_id = u.id)
-> Parallel Seq Scan on address (cost=0.00..260900.55 rows=5509055 width=4)
-> Hash (cost=379705.27..379705.27 rows=11648927 width=4)
-> Seq Scan on users u (cost=0.00..379705.27 rows=11648927 width=4)
(12 rows)

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Day (#1)
Re: parallel sequential scan returns extraneous rows

Michael Day <blake@rcmail.com> writes:

I have found a nasty bug when using parallel sequential scans with an exists clause on postgresql 9.6.1. I have found that the rows returned using parallel sequential scan plans are incorrect (though I haven’t dug sufficiently to know in what ways). See below for an example of the issue.

Hm, looks like a planner error: it seems to be forgetting that the join
to "address" should be a semijoin. "address" should either be on the
inside of a "Semi" join (as in your first, correct-looking plan) or be
passed through a unique-ification stage such as a HashAgg. Clearly,
neither thing is happening in the second plan.

I couldn't reproduce this in a bit of trying, however. Can you come
up with a self-contained test case?

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#3Michael Day
blake@rcmail.com
In reply to: Tom Lane (#2)
Re: parallel sequential scan returns extraneous rows

I was able to reproduce with this set of data.

create table users (id integer);
create table address (id integer, users_id integer);

insert into users select s from generate_series(1,1000000) s;
insert into address select s, s/2 from generate_series(1,2000000) s;

analyze users;
analyze address;

set max_parallel_workers_per_gather = 0;

select count(*)
from users u
join address a on (a.users_id = u.id)
where exists (select 1 from address where users_id = u.id);

set max_parallel_workers_per_gather = 1;

select count(*)
from users u
join address a on (a.users_id = u.id)
where exists (select 1 from address where users_id = u.id);

On 11/29/16, 11:19 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

Michael Day <blake@rcmail.com> writes:

I have found a nasty bug when using parallel sequential scans with an exists clause on postgresql 9.6.1. I have found that the rows returned using parallel sequential scan plans are incorrect (though I haven’t dug sufficiently to know in what ways). See below for an example of the issue.

Hm, looks like a planner error: it seems to be forgetting that the join
to "address" should be a semijoin. "address" should either be on the
inside of a "Semi" join (as in your first, correct-looking plan) or be
passed through a unique-ification stage such as a HashAgg. Clearly,
neither thing is happening in the second plan.

I couldn't reproduce this in a bit of trying, however. Can you come
up with a self-contained test case?

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Day (#3)
Re: parallel sequential scan returns extraneous rows

Michael Day <blake@rcmail.com> writes:

I was able to reproduce with this set of data.

Ah, thanks for the test case! Looks like the handling of
JOIN_UNIQUE_INNER cases for parallel plans is completely confused ---
it forgot about needing to unique-ify the inner rel, and I'm rather
surprised it didn't trip over some Asserts while at it.

Will fix, thanks.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs