Index is not used for "IN (non-correlated subquery)"

Started by Georgeover 9 years ago11 messagesgeneral

pinkisntwell@gmail.com

over 9 years ago

My use case:

I have a table which I expect to reach a size of more than 10M rows.
This table will have a column "partner_uuid" which will have a maximum
envisioned cardinality of 10.

I want different users of my web application to see different subsets
of that table. I am using row-level security for this, with a USING
clause similar to this:

partner_uuid in (
select p.uuid
from userpartner up
join partner p on p.id = up.partner_id
where up.user_id::varchar = get_parameter('WEB_LOGGED_IN_USER_ID')
)

I want to make sure that when running SELECTs the index of the
partner_uuid column will be used. It appears though that it is not
being used. Is there some way to make the query planner use the index
for this case or will I always have to run a query to load the allowed
partner_uuids, add them to my query so that they are hardcoded, and
then finally run the query so that it uses the index?

For example, compare the following simplified and similar two cases,
one of which uses the index and one which does not:

explain select * from wg3ppbm_transaction where partner_uuid in
('0f50ce66-6dcf-11e6-8b77-86f30ca893d3');

"Index Scan using wg3ppbm_transaction_f9b3d985 on wg3ppbm_transaction
(cost=0.28..227.67 rows=323 width=482)"
" Index Cond: ((partner_uuid)::text =
'0f50ce66-6dcf-11e6-8b77-86f30ca893d3'::text)"

explain select * from wg3ppbm_transaction where partner_uuid in (
select p.uuid
from wg3ppbm_userpartner up
join wg3ppbm_partner p on p.id = up.partner_id
);

"Hash Semi Join (cost=2.07..425.72 rows=2960 width=482)"
" Hash Cond: ((wg3ppbm_transaction.partner_uuid)::text = (p.uuid)::text)"
" -> Seq Scan on wg3ppbm_transaction (cost=0.00..375.19 rows=5919 width=482)"
" -> Hash (cost=2.06..2.06 rows=1 width=37)"
" -> Nested Loop (cost=0.00..2.06 rows=1 width=37)"
" Join Filter: (up.partner_id = p.id)"
" -> Seq Scan on wg3ppbm_userpartner up
(cost=0.00..1.01 rows=1 width=4)"
" -> Seq Scan on wg3ppbm_partner p (cost=0.00..1.02
rows=2 width=41)"

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: George (#1)

Re: Index is not used for "IN (non-correlated subquery)"

George <pinkisntwell@gmail.com> writes:

explain select * from wg3ppbm_transaction where partner_uuid in (
select p.uuid
from wg3ppbm_userpartner up
join wg3ppbm_partner p on p.id = up.partner_id
);

"Hash Semi Join (cost=2.07..425.72 rows=2960 width=482)"
" Hash Cond: ((wg3ppbm_transaction.partner_uuid)::text = (p.uuid)::text)"
" -> Seq Scan on wg3ppbm_transaction (cost=0.00..375.19 rows=5919 width=482)"
" -> Hash (cost=2.06..2.06 rows=1 width=37)"
" -> Nested Loop (cost=0.00..2.06 rows=1 width=37)"
" Join Filter: (up.partner_id = p.id)"
" -> Seq Scan on wg3ppbm_userpartner up
(cost=0.00..1.01 rows=1 width=4)"
" -> Seq Scan on wg3ppbm_partner p (cost=0.00..1.02
rows=2 width=41)"

This plan is expecting to have to return about half of the rows in
wg3ppbm_transaction, a situation for which an indexscan would NOT
be a better choice. The usual rule of thumb is that you need to be
retrieving at most one or two percent of a table's rows for an indexscan
on it to be faster than a seqscan.

I think however that the "half" may be a default estimate occasioned
by the other tables being empty and therefore not having any statistics.
Another rule of thumb is that the plans you get for tiny tables have
little to do with what happens once there's lots of data.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Merlin Moncure

mmoncure@gmail.com

over 9 years ago

In reply to: Tom Lane (#2)

Re: Index is not used for "IN (non-correlated subquery)"

On Wed, Nov 30, 2016 at 10:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

George <pinkisntwell@gmail.com> writes:

explain select * from wg3ppbm_transaction where partner_uuid in (
select p.uuid
from wg3ppbm_userpartner up
join wg3ppbm_partner p on p.id = up.partner_id
);

"Hash Semi Join (cost=2.07..425.72 rows=2960 width=482)"
" Hash Cond: ((wg3ppbm_transaction.partner_uuid)::text = (p.uuid)::text)"
" -> Seq Scan on wg3ppbm_transaction (cost=0.00..375.19 rows=5919 width=482)"
" -> Hash (cost=2.06..2.06 rows=1 width=37)"
" -> Nested Loop (cost=0.00..2.06 rows=1 width=37)"
" Join Filter: (up.partner_id = p.id)"
" -> Seq Scan on wg3ppbm_userpartner up
(cost=0.00..1.01 rows=1 width=4)"
" -> Seq Scan on wg3ppbm_partner p (cost=0.00..1.02
rows=2 width=41)"

This plan is expecting to have to return about half of the rows in
wg3ppbm_transaction, a situation for which an indexscan would NOT
be a better choice. The usual rule of thumb is that you need to be
retrieving at most one or two percent of a table's rows for an indexscan
on it to be faster than a seqscan.

I think however that the "half" may be a default estimate occasioned
by the other tables being empty and therefore not having any statistics.
Another rule of thumb is that the plans you get for tiny tables have
little to do with what happens once there's lots of data.

Yeah, don't make query plan assumptions against empty or nearly empty
tables. As the data grows, the plans will suitably change. Perhaps
OP just recently loaded a bunch of data and the tables haven't been
analyzed yet?

merlin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

George

pinkisntwell@gmail.com

over 9 years ago

In reply to: Merlin Moncure (#3)

Re: Index is not used for "IN (non-correlated subquery)"

On Wed, Nov 30, 2016 at 6:45 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Wed, Nov 30, 2016 at 10:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

George <pinkisntwell@gmail.com> writes:

explain select * from wg3ppbm_transaction where partner_uuid in (
select p.uuid
from wg3ppbm_userpartner up
join wg3ppbm_partner p on p.id = up.partner_id
);

"Hash Semi Join (cost=2.07..425.72 rows=2960 width=482)"
" Hash Cond: ((wg3ppbm_transaction.partner_uuid)::text = (p.uuid)::text)"
" -> Seq Scan on wg3ppbm_transaction (cost=0.00..375.19 rows=5919 width=482)"
" -> Hash (cost=2.06..2.06 rows=1 width=37)"
" -> Nested Loop (cost=0.00..2.06 rows=1 width=37)"
" Join Filter: (up.partner_id = p.id)"
" -> Seq Scan on wg3ppbm_userpartner up
(cost=0.00..1.01 rows=1 width=4)"
" -> Seq Scan on wg3ppbm_partner p (cost=0.00..1.02
rows=2 width=41)"

This plan is expecting to have to return about half of the rows in
wg3ppbm_transaction, a situation for which an indexscan would NOT
be a better choice. The usual rule of thumb is that you need to be
retrieving at most one or two percent of a table's rows for an indexscan
on it to be faster than a seqscan.

I think however that the "half" may be a default estimate occasioned
by the other tables being empty and therefore not having any statistics.
Another rule of thumb is that the plans you get for tiny tables have
little to do with what happens once there's lots of data.

Yeah, don't make query plan assumptions against empty or nearly empty
tables. As the data grows, the plans will suitably change. Perhaps
OP just recently loaded a bunch of data and the tables haven't been
analyzed yet?

I just added a significant number of rows to the table. I now have
1.3M rows in total but only 8K rows that contain the value I am
seeking. I also ran ANALYZE after loading the data. The query plans
for the two queries did not change. Also, the simple query returns in
45 ms while the one with the subquery needs 1.5 s, i.e. it is about
30x slower.

So there is definitely something wrong here. This situation makes many
row-level security use cases cumbersome since you need to have
almost the same WHERE clause both in the row-level security policy and
in every SELECT query in order for the index to be used.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Merlin Moncure

mmoncure@gmail.com

over 9 years ago

In reply to: George (#4)

Re: Index is not used for "IN (non-correlated subquery)"

On Wed, Nov 30, 2016 at 11:05 AM, George <pinkisntwell@gmail.com> wrote:

On Wed, Nov 30, 2016 at 6:45 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Wed, Nov 30, 2016 at 10:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

George <pinkisntwell@gmail.com> writes:

explain select * from wg3ppbm_transaction where partner_uuid in (
select p.uuid
from wg3ppbm_userpartner up
join wg3ppbm_partner p on p.id = up.partner_id
);

"Hash Semi Join (cost=2.07..425.72 rows=2960 width=482)"
" Hash Cond: ((wg3ppbm_transaction.partner_uuid)::text = (p.uuid)::text)"
" -> Seq Scan on wg3ppbm_transaction (cost=0.00..375.19 rows=5919 width=482)"
" -> Hash (cost=2.06..2.06 rows=1 width=37)"
" -> Nested Loop (cost=0.00..2.06 rows=1 width=37)"
" Join Filter: (up.partner_id = p.id)"
" -> Seq Scan on wg3ppbm_userpartner up
(cost=0.00..1.01 rows=1 width=4)"
" -> Seq Scan on wg3ppbm_partner p (cost=0.00..1.02
rows=2 width=41)"

This plan is expecting to have to return about half of the rows in
wg3ppbm_transaction, a situation for which an indexscan would NOT
be a better choice. The usual rule of thumb is that you need to be
retrieving at most one or two percent of a table's rows for an indexscan
on it to be faster than a seqscan.

I think however that the "half" may be a default estimate occasioned
by the other tables being empty and therefore not having any statistics.
Another rule of thumb is that the plans you get for tiny tables have
little to do with what happens once there's lots of data.

Yeah, don't make query plan assumptions against empty or nearly empty
tables. As the data grows, the plans will suitably change. Perhaps
OP just recently loaded a bunch of data and the tables haven't been
analyzed yet?

I just added a significant number of rows to the table. I now have
1.3M rows in total but only 8K rows that contain the value I am
seeking. I also ran ANALYZE after loading the data. The query plans
for the two queries did not change. Also, the simple query returns in
45 ms while the one with the subquery needs 1.5 s, i.e. it is about
30x slower.

So there is definitely something wrong here. This situation makes many
row-level security use cases cumbersome since you need to have
almost the same WHERE clause both in the row-level security policy and
in every SELECT query in order for the index to be used.

can you give EXPLAIN ANALYZE for the 'good' query and the 'bad' query?

merlin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Merlin Moncure (#5)

Re: Index is not used for "IN (non-correlated subquery)"

Merlin Moncure <mmoncure@gmail.com> writes:

On Wed, Nov 30, 2016 at 11:05 AM, George <pinkisntwell@gmail.com> wrote:

So there is definitely something wrong here. This situation makes many
row-level security use cases cumbersome since you need to have
almost the same WHERE clause both in the row-level security policy and
in every SELECT query in order for the index to be used.

can you give EXPLAIN ANALYZE for the 'good' query and the 'bad' query?

Planning for queries affected by RLS is definitely an area where we need
to improve (I'm working on a patch for that). Whether the OP's particular
query is being hit by that is impossible to tell, though, since there
isn't any actual RLS usage in the doubtless-oversimplified example.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

George

pinkisntwell@gmail.com

over 9 years ago

In reply to: Tom Lane (#6)

Re: Index is not used for "IN (non-correlated subquery)"

On Wed, Nov 30, 2016 at 8:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Merlin Moncure <mmoncure@gmail.com> writes:

On Wed, Nov 30, 2016 at 11:05 AM, George <pinkisntwell@gmail.com> wrote:

So there is definitely something wrong here. This situation makes many
row-level security use cases cumbersome since you need to have
almost the same WHERE clause both in the row-level security policy and
in every SELECT query in order for the index to be used.

can you give EXPLAIN ANALYZE for the 'good' query and the 'bad' query?

Planning for queries affected by RLS is definitely an area where we need
to improve (I'm working on a patch for that). Whether the OP's particular
query is being hit by that is impossible to tell, though, since there
isn't any actual RLS usage in the doubtless-oversimplified example.

The example is not over-simplified, I basically just took the clause
that the RLS would have to add and stuck it in the WHERE. Thus I
verified that even the normal, non-RLS planner is affected.

When I get to work tomorrow morning (Europe) I will post the EXPLAIN
ANALYZE output.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

George

pinkisntwell@gmail.com

over 9 years ago

In reply to: George (#7)

Re: Index is not used for "IN (non-correlated subquery)"

On Wed, Nov 30, 2016 at 10:08 PM, George <pinkisntwell@gmail.com> wrote:

On Wed, Nov 30, 2016 at 8:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Merlin Moncure <mmoncure@gmail.com> writes:

On Wed, Nov 30, 2016 at 11:05 AM, George <pinkisntwell@gmail.com> wrote:

So there is definitely something wrong here. This situation makes many
row-level security use cases cumbersome since you need to have
almost the same WHERE clause both in the row-level security policy and
in every SELECT query in order for the index to be used.

can you give EXPLAIN ANALYZE for the 'good' query and the 'bad' query?

Planning for queries affected by RLS is definitely an area where we need
to improve (I'm working on a patch for that). Whether the OP's particular
query is being hit by that is impossible to tell, though, since there
isn't any actual RLS usage in the doubtless-oversimplified example.

The example is not over-simplified, I basically just took the clause
that the RLS would have to add and stuck it in the WHERE. Thus I
verified that even the normal, non-RLS planner is affected.

When I get to work tomorrow morning (Europe) I will post the EXPLAIN
ANALYZE output.

Here are the EXPLAIN ANALYZE results:

explain analyze
select *
from wg3ppbm_transaction
where partner_uuid in ('80228212-2247-4bdd-a130-80239cb33c5c');

"Index Scan using wg3ppbm_transaction_f9b3d985 on wg3ppbm_transaction
(cost=0.43..2838.57 rows=8186 width=380) (actual time=0.458..5.265
rows=7827 loops=1)"
" Index Cond: ((partner_uuid)::text =
'80228212-2247-4bdd-a130-80239cb33c5c'::text)"
"Planning time: 0.155 ms"
"Execution time: 6.992 ms"

explain analyze select *
from wg3ppbm_transaction where partner_uuid in (
select p.uuid
from wg3ppbm_userpartner up
join wg3ppbm_partner p on p.id = up.partner_id
);

"Hash Semi Join (cost=2.07..65628.14 rows=663727 width=380) (actual
time=0.346..1542.730 rows=1 loops=1)"
" Hash Cond: ((wg3ppbm_transaction.partner_uuid)::text = (p.uuid)::text)"
" -> Seq Scan on wg3ppbm_transaction (cost=0.00..54757.54
rows=1327454 width=380) (actual time=0.004..878.568 rows=1327587
loops=1)"
" -> Hash (cost=2.06..2.06 rows=1 width=37) (actual
time=0.017..0.017 rows=1 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 5kB"
" -> Nested Loop (cost=0.00..2.06 rows=1 width=37) (actual
time=0.011..0.012 rows=1 loops=1)"
" Join Filter: (up.partner_id = p.id)"
" Rows Removed by Join Filter: 1"
" -> Seq Scan on wg3ppbm_userpartner up
(cost=0.00..1.01 rows=1 width=4) (actual time=0.004..0.004 rows=1
loops=1)"
" -> Seq Scan on wg3ppbm_partner p (cost=0.00..1.02
rows=2 width=41) (actual time=0.001..0.001 rows=2 loops=1)"
"Planning time: 1.484 ms"
"Execution time: 1542.799 ms"

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: George (#8)

Re: Index is not used for "IN (non-correlated subquery)"

George <pinkisntwell@gmail.com> writes:

explain analyze select *
from wg3ppbm_transaction where partner_uuid in (
select p.uuid
from wg3ppbm_userpartner up
join wg3ppbm_partner p on p.id = up.partner_id
);

"Hash Semi Join (cost=2.07..65628.14 rows=663727 width=380) (actual
time=0.346..1542.730 rows=1 loops=1)"
" Hash Cond: ((wg3ppbm_transaction.partner_uuid)::text = (p.uuid)::text)"
" -> Seq Scan on wg3ppbm_transaction (cost=0.00..54757.54
rows=1327454 width=380) (actual time=0.004..878.568 rows=1327587
loops=1)"

So you're still getting the 50% default estimate, which is why it doesn't
want to use the index ...

" -> Hash (cost=2.06..2.06 rows=1 width=37) (actual
time=0.017..0.017 rows=1 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 5kB"
" -> Nested Loop (cost=0.00..2.06 rows=1 width=37) (actual
time=0.011..0.012 rows=1 loops=1)"
" Join Filter: (up.partner_id = p.id)"
" Rows Removed by Join Filter: 1"
" -> Seq Scan on wg3ppbm_userpartner up
(cost=0.00..1.01 rows=1 width=4) (actual time=0.004..0.004 rows=1
loops=1)"
" -> Seq Scan on wg3ppbm_partner p (cost=0.00..1.02
rows=2 width=41) (actual time=0.001..0.001 rows=2 loops=1)"

... and you still don't have any meaningful number of rows in
wg3ppbm_userpartner or wg3ppbm_partner. However, I don't understand how
it knows that there's only one or two rows in those tables and yet is
producing the stupid default estimate for the semijoin. I spent some time
trying to duplicate that behavior, without success. What PG version is
that, exactly? Have you vacuumed and/or analyzed those two tables? What
do you get for

select * from pg_stats where tablename = 'wg3ppbm_userpartner';

and likewise for wg3ppbm_partner?

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#10

George

pinkisntwell@gmail.com

over 9 years ago

In reply to: Tom Lane (#9)

Re: Index is not used for "IN (non-correlated subquery)"

On Thu, Dec 1, 2016 at 6:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

George <pinkisntwell@gmail.com> writes:

explain analyze select *
from wg3ppbm_transaction where partner_uuid in (
select p.uuid
from wg3ppbm_userpartner up
join wg3ppbm_partner p on p.id = up.partner_id
);

"Hash Semi Join (cost=2.07..65628.14 rows=663727 width=380) (actual
time=0.346..1542.730 rows=1 loops=1)"
" Hash Cond: ((wg3ppbm_transaction.partner_uuid)::text = (p.uuid)::text)"
" -> Seq Scan on wg3ppbm_transaction (cost=0.00..54757.54
rows=1327454 width=380) (actual time=0.004..878.568 rows=1327587
loops=1)"

So you're still getting the 50% default estimate, which is why it doesn't
want to use the index ...

" -> Hash (cost=2.06..2.06 rows=1 width=37) (actual
time=0.017..0.017 rows=1 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 5kB"
" -> Nested Loop (cost=0.00..2.06 rows=1 width=37) (actual
time=0.011..0.012 rows=1 loops=1)"
" Join Filter: (up.partner_id = p.id)"
" Rows Removed by Join Filter: 1"
" -> Seq Scan on wg3ppbm_userpartner up
(cost=0.00..1.01 rows=1 width=4) (actual time=0.004..0.004 rows=1
loops=1)"
" -> Seq Scan on wg3ppbm_partner p (cost=0.00..1.02
rows=2 width=41) (actual time=0.001..0.001 rows=2 loops=1)"

... and you still don't have any meaningful number of rows in
wg3ppbm_userpartner or wg3ppbm_partner. However, I don't understand how
it knows that there's only one or two rows in those tables and yet is
producing the stupid default estimate for the semijoin. I spent some time
trying to duplicate that behavior, without success. What PG version is
that, exactly?

"PostgreSQL 9.5.5 on i686-pc-linux-gnu, compiled by gcc (GCC) 4.4.7
20120313 (Red Hat 4.4.7-17), 32-bit"

Have you vacuumed and/or analyzed those two tables?

Yes.

What
do you get for

select * from pg_stats where tablename = 'wg3ppbm_userpartner';
and likewise for wg3ppbm_partner?

It is a wide table. Do you want me to dump csv here?

In the meantime, with the help of the folks at #postgresql I was able
to wisen up the query planner by using either one of the following two
settings:

SET enable_seqscan = false

SET cpu_tuple_cost = 0.1

I think this should be helpful.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#11

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: George (#10)

Re: Index is not used for "IN (non-correlated subquery)"

George <pinkisntwell@gmail.com> writes:

On Thu, Dec 1, 2016 at 6:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

What
do you get for
select * from pg_stats where tablename = 'wg3ppbm_userpartner';
and likewise for wg3ppbm_partner?

It is a wide table. Do you want me to dump csv here?

Shouldn't be *that* wide, with only one row in the underlying table ;-)
Maybe psql \x format would be suitable.

In the meantime, with the help of the folks at #postgresql I was able
to wisen up the query planner by using either one of the following two
settings:
SET enable_seqscan = false
SET cpu_tuple_cost = 0.1

Well, that proves it is considering the indexscan option. But there is
something funny going on, if you have stats for these tables and yet
you're getting a default rowcount estimate.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general