Gin indexes on intarray is fast when value in array does not exists, and slow, when value exists

Started by otar shavadzeover 9 years ago10 messagesgeneral

Jump to latest

otar shavadze

oshavadze@gmail.com

over 9 years ago

I increased rows limit from 50 to 500, because now, difference visible much
better, so query is:

explain analyze *SELECT * FROM table_name WHERE my_array @>
'{x}'::integer[] ORDER BY id desc LIMIT 500*

with GIN index:

"Limit (cost=107.83..109.08 rows=500 width=905) (actual
time=978.256..978.293 rows=500 loops=1)"
" -> Sort (cost=107.83..109.16 rows=533 width=905) (actual
time=978.254..978.272 rows=500 loops=1)"
" Sort Key: id DESC"
" Sort Method: top-N heapsort Memory: 589kB"
" -> Bitmap Heap Scan on table_name (cost=23.93..83.69 rows=533
width=905) (actual time=50.612..917.422 rows=90049 loops=1)"
" Recheck Cond: (my_array @> '{8}'::integer[])"
" Heap Blocks: exact=46525"
" -> Bitmap Index Scan on idx (cost=0.00..23.80 rows=533
width=0) (actual time=35.054..35.054 rows=90052 loops=1)"
" Index Cond: (my_array @> '{8}'::integer[])"
"Planning time: 0.202 ms"
"Execution time: 978.718 ms"

Without index:

"Limit (cost=7723.12..7724.37 rows=500 width=122) (actual
time=184.041..184.102 rows=500 loops=1)"
" -> Sort (cost=7723.12..7724.45 rows=534 width=122) (actual
time=184.039..184.052 rows=500 loops=1)"
" Sort Key: id DESC"
" Sort Method: top-N heapsort Memory: 157kB"
" -> Seq Scan on table_name (cost=0.00..7698.93 rows=534 width=122)
(actual time=0.020..176.079 rows=84006 loops=1)"
" Filter: (my_array @> '{14}'::integer[])"
" Rows Removed by Filter: 450230"
"Planning time: 0.165 ms"
"Execution time: 184.155 ms"

Postgres version: 9.5; OS: Windows 7; RAM: 8GB

In picture is some config current values.

p.s. In "pg_stats" really many values (long lists in "most_common_vals",
"most_common_freqs") and in another columns
Which one columns should I show you? All?

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: otar shavadze (#1)

Re: Gin indexes on intarray is fast when value in array does not exists, and slow, when value exists

otar shavadze <oshavadze@gmail.com> writes:

" -> Bitmap Index Scan on idx (cost=0.00..23.80 rows=533
width=0) (actual time=35.054..35.054 rows=90052 loops=1)"
" Index Cond: (my_array @> '{8}'::integer[])"

Seems like your problem here is that the planner has no idea about the
selectivity of this condition --- if it did, I think it would have
made the right choice, because it would have made a much higher estimate
for the cost of the indexscan.

AFAICT, Postgres 9.5 does make a reasonably correct guess when given
up-to-date stats. I speculate that you need to ANALYZE this table.
If there are a lot of distinct possible values in the arrays, increasing
the statistics target for the column might be needed.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Jeff Janes

jeff.janes@gmail.com

over 9 years ago

In reply to: otar shavadze (#1)

Re: Gin indexes on intarray is fast when value in array does not exists, and slow, when value exists

On Tue, Nov 8, 2016 at 12:27 PM, otar shavadze <oshavadze@gmail.com> wrote:

p.s. In "pg_stats" really many values (long lists in "most_common_vals",
"most_common_freqs") and in another columns
Which one columns should I show you? All?

most_common_elems. Is it empty, or is it not empty? If not empty, does it
contain the specific values you used in your queries?

Cheers,

Jeff

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Tom Lane (#2)

Re: Gin indexes on intarray is fast when value in array does not exists, and slow, when value exists

I wrote:

Seems like your problem here is that the planner has no idea about the
selectivity of this condition --- if it did, I think it would have
made the right choice, because it would have made a much higher estimate
for the cost of the indexscan.

AFAICT, Postgres 9.5 does make a reasonably correct guess when given
up-to-date stats. I speculate that you need to ANALYZE this table.

Hmmm ... actually, I wonder if maybe '@>' here is the contrib/intarray
operator not the core operator? The intarray operator didn't get plugged
into any real estimation logic until 9.6.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

otar shavadze

oshavadze@gmail.com

over 9 years ago

In reply to: Tom Lane (#4)

Re: Gin indexes on intarray is fast when value in array does not exists, and slow, when value exists

@Jeff

most_common_elems. Is it empty, or is it not empty? If not empty, does
it contain the specific values you used in your queries?

No, most_common_elems is not empty. it contain the specific values I used
in queries.

@Tom

Hmmm ... actually, I wonder if maybe '@>' here is the contrib/intarray
operator not the core operator? The intarray operator didn't get plugged
into any real estimation logic until 9.6.

So, you mean that better would be go to version 9.6 ?

On Wed, Nov 9, 2016 at 8:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

I wrote:

Seems like your problem here is that the planner has no idea about the
selectivity of this condition --- if it did, I think it would have
made the right choice, because it would have made a much higher estimate
for the cost of the indexscan.

AFAICT, Postgres 9.5 does make a reasonably correct guess when given
up-to-date stats. I speculate that you need to ANALYZE this table.

Hmmm ... actually, I wonder if maybe '@>' here is the contrib/intarray
operator not the core operator? The intarray operator didn't get plugged
into any real estimation logic until 9.6.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: otar shavadze (#5)

Re: Gin indexes on intarray is fast when value in array does not exists, and slow, when value exists

otar shavadze <oshavadze@gmail.com> writes:

Hmmm ... actually, I wonder if maybe '@>' here is the contrib/intarray
operator not the core operator? The intarray operator didn't get plugged
into any real estimation logic until 9.6.

So, you mean that better would be go to version 9.6 ?

If you are using that contrib module, and it's capturing this operator
reference, that would probably explain the bad estimate. You could
drop the extension if you're not depending on its other features, or you
could explicitly qualify the operator name ("operator(pg_catalog.@>)"),
or you could upgrade to 9.6 (don't forget to do ALTER EXTENSION ... UPDATE
afterwards).

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Jeff Janes

jeff.janes@gmail.com

over 9 years ago

In reply to: Tom Lane (#6)

Re: Gin indexes on intarray is fast when value in array does not exists, and slow, when value exists

On Thu, Nov 10, 2016 at 7:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

otar shavadze <oshavadze@gmail.com> writes:

Hmmm ... actually, I wonder if maybe '@>' here is the contrib/intarray
operator not the core operator? The intarray operator didn't get

plugged

into any real estimation logic until 9.6.

So, you mean that better would be go to version 9.6 ?

If you are using that contrib module, and it's capturing this operator
reference, that would probably explain the bad estimate. You could
drop the extension if you're not depending on its other features, or you
could explicitly qualify the operator name ("operator(pg_catalog.@>)"),
or you could upgrade to 9.6 (don't forget to do ALTER EXTENSION ... UPDATE
afterwards).

Isn't the operator determined at index build time? If he doesn't want to
update to 9.6, I think he would need to rebuild the index, removing
the "gin__int_ops" specification.

Cheers,

Jeff

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Jeff Janes (#7)

Re: Gin indexes on intarray is fast when value in array does not exists, and slow, when value exists

Jeff Janes <jeff.janes@gmail.com> writes:

On Thu, Nov 10, 2016 at 7:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

If you are using that contrib module, and it's capturing this operator
reference, that would probably explain the bad estimate. You could
drop the extension if you're not depending on its other features, or you
could explicitly qualify the operator name ("operator(pg_catalog.@>)"),
or you could upgrade to 9.6 (don't forget to do ALTER EXTENSION ... UPDATE
afterwards).

Isn't the operator determined at index build time? If he doesn't want to
update to 9.6, I think he would need to rebuild the index, removing
the "gin__int_ops" specification.

The operator in the query isn't. But yes, if he's using an index that's
built on the extension's opclass, he'd need to rebuild that too in order
to still use the index with the core @> operator.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

otar shavadze

oshavadze@gmail.com

over 9 years ago

In reply to: Tom Lane (#8)

Re: Gin indexes on intarray is fast when value in array does not exists, and slow, when value exists

Tried

OPERATOR(pg_catalog.@>)

as Tom mentioned, but still, don't get fast performance when value does
not existed in any array.

Also "played" with many different ways, gin, gist indexes (gin with and
without *gin__int_ops*) but, always, there was some situation, where
search in array was slow.

I don't know exactly, may be I am wrong, but what I understood after
several day "trying", is that, I never will use arrays, with tables more
than 500 000-1000 000 rows, because then searching in this array is
somehow problematic.

I rebuild my structure and added another table (instead of using array) and
then used join's instead of searching in array.

That's works perfectly, joining works fast as hell, even for several
millions rows in each table.

On Fri, Nov 11, 2016 at 12:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Jeff Janes <jeff.janes@gmail.com> writes:

On Thu, Nov 10, 2016 at 7:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

If you are using that contrib module, and it's capturing this operator
reference, that would probably explain the bad estimate. You could
drop the extension if you're not depending on its other features, or you
could explicitly qualify the operator name ("operator(pg_catalog.@>)"),
or you could upgrade to 9.6 (don't forget to do ALTER EXTENSION ...

UPDATE

afterwards).

Isn't the operator determined at index build time? If he doesn't want to
update to 9.6, I think he would need to rebuild the index, removing
the "gin__int_ops" specification.

The operator in the query isn't. But yes, if he's using an index that's
built on the extension's opclass, he'd need to rebuild that too in order
to still use the index with the core @> operator.

regards, tom lane

#10

Jeff Janes

jeff.janes@gmail.com

over 9 years ago

In reply to: otar shavadze (#9)

Re: Gin indexes on intarray is fast when value in array does not exists, and slow, when value exists

On Sat, Nov 12, 2016 at 5:33 PM, otar shavadze <oshavadze@gmail.com> wrote:

Tried

OPERATOR(pg_catalog.@>)

as Tom mentioned, but still, don't get fast performance when value does
not existed in any array.

Did you build the correct index?

Also "played" with many different ways, gin, gist indexes (gin with and
without *gin__int_ops*) but, always, there was some situation, where
search in array was slow.

Yes. There will always be some situation when the array search is slow.
Is that situation one that a specific person cares about? Hard to tell,
since you have not given us any additional useful information.

I don't know exactly, may be I am wrong, but what I understood after

several day "trying", is that, I never will use arrays, with tables more
than 500 000-1000 000 rows, because then searching in this array is
somehow problematic.

I rebuild my structure and added another table (instead of using array)
and then used join's instead of searching in array.

That's works perfectly, joining works fast as hell, even for several
millions rows in each table.

"Properly" normalizing your data is a wonderful thing, no doubt about it,
if you are prepared to deal with the consequences of doing so. But not
everyone has that luxury. Which is why there is more than one way of doing
things.

Cheers,

Jeff

Gin indexes on intarray is fast when value in array does not exists, and slow, when value exists

Attachments: