Querying a table with jaccard similarity with 1.6 million records take 12 seconds

Started by balasubramanian c rover 4 years ago8 messagesgeneral
Jump to latest
#1balasubramanian c r
crbs.siebel@gmail.com

Hi Team

We have encountered a problem in our testing environment.
I have a scenario where I am running a similarity match for an address
I have created a table with following number of records
1603423

We are using pg_similarity extension in postgresql version is 13.

And I have created GIN index (since i am using pg_similarity) library
jaccard similarity method
when I run the Explain analyze
EXPLAIN ANALYZE select complete_address, jaccard(complete_address, 'raj
nagar ext near ajnara integrity up ghaziabad 201017') as qsim from address
where complete_address~??'raj nagar ext near ajnara integrity up ghaziabad
201017' order by qsim DESC;
QUERY PLAN

Sort (cost=5856.13..5860.14 rows=1603 width=93) (actual
time=12101.194..12101.197 rows=6 loops=1)

Sort Key: (jaccard(complete_address, 'raj nagar ext near ajnara integrity
up ghaziabad 201017'::text)) DESC
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on address (cost=172.43..5770.80 rows=1603 width=93)
(actual time=3516.233..12101.172 rows=6 loops=1)
Recheck Cond: (complete_address ~?? 'raj nagar ext near ajnara integrity up
ghaziabad 201017'::text)
Rows Removed by Index Recheck: 1039186
Heap Blocks: exact=58018
-> Bitmap Index Scan on address_complete_address_idx1 (cost=0.00..172.02
rows=1603 width=0) (actual time=256.037..256.037 rows=1039192 loops=1)
Index Cond: (complete_address ~?? 'raj nagar ext near ajnara integrity up
ghaziabad 201017'::text)
Planning Time: 0.141 ms
Execution Time: 12101.245 ms
(11 rows)
it took 12 seconds
following are my pgconf file settings
shared buffer as 4GB
work_mem 256 MB
maintenence_work_mem 512MB
autovacuum_work_mem 20MB
My index definition is this "address_complete_address_idx1" gin
(complete_address gin_similarity_ops)

It is taking the index correctly. But why it took 12 seconds to process I
really don't understand.

Please help.

Thanks
C.R.Bala

#2Ninad Shah
nshah.postgres@gmail.com
In reply to: balasubramanian c r (#1)
Re: Querying a table with jaccard similarity with 1.6 million records take 12 seconds

Hi Bala,

Are your statistics updated? Also, have you used the gin
operator(gin_similarity_ops) correctly?

It is fetching just 6 records out of a million, hence, it should not go for
bitmap index scan. As bitmap index scan loads a complete index, and access
relevant pages from the table later by bitmap heap scan.

Regards,
Ninad Shah

On Thu, 2 Sept 2021 at 16:39, balasubramanian c r <crbs.siebel@gmail.com>
wrote:

Show quoted text

Hi Team

We have encountered a problem in our testing environment.
I have a scenario where I am running a similarity match for an address
I have created a table with following number of records
1603423

We are using pg_similarity extension in postgresql version is 13.

And I have created GIN index (since i am using pg_similarity) library
jaccard similarity method
when I run the Explain analyze
EXPLAIN ANALYZE select complete_address, jaccard(complete_address, 'raj
nagar ext near ajnara integrity up ghaziabad 201017') as qsim from address
where complete_address~??'raj nagar ext near ajnara integrity up ghaziabad
201017' order by qsim DESC;
QUERY PLAN

Sort (cost=5856.13..5860.14 rows=1603 width=93) (actual
time=12101.194..12101.197 rows=6 loops=1)

Sort Key: (jaccard(complete_address, 'raj nagar ext near ajnara integrity
up ghaziabad 201017'::text)) DESC
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on address (cost=172.43..5770.80 rows=1603 width=93)
(actual time=3516.233..12101.172 rows=6 loops=1)
Recheck Cond: (complete_address ~?? 'raj nagar ext near ajnara integrity
up ghaziabad 201017'::text)
Rows Removed by Index Recheck: 1039186
Heap Blocks: exact=58018
-> Bitmap Index Scan on address_complete_address_idx1 (cost=0.00..172.02
rows=1603 width=0) (actual time=256.037..256.037 rows=1039192 loops=1)
Index Cond: (complete_address ~?? 'raj nagar ext near ajnara integrity up
ghaziabad 201017'::text)
Planning Time: 0.141 ms
Execution Time: 12101.245 ms
(11 rows)
it took 12 seconds
following are my pgconf file settings
shared buffer as 4GB
work_mem 256 MB
maintenence_work_mem 512MB
autovacuum_work_mem 20MB
My index definition is this "address_complete_address_idx1" gin
(complete_address gin_similarity_ops)

It is taking the index correctly. But why it took 12 seconds to process I
really don't understand.

Please help.

Thanks
C.R.Bala

#3Michael Lewis
mlewis@entrata.com
In reply to: Ninad Shah (#2)
Re: Querying a table with jaccard similarity with 1.6 million records take 12 seconds

This is showing many false positives from the index scan that get removed
when the actual values are examined. With such a long search parameter,
that does not seem surprising. I would expect a search on "raj nagar
ghaziabad 201017" or something like that to yield far fewer results from
the index scan. I don't know GIN indexes super well, but I would guess that
including words that are very common will yield false positives that get
filtered out later.

Show quoted text
#4balasubramanian c r
crbs.siebel@gmail.com
In reply to: Ninad Shah (#2)
Re: Querying a table with jaccard similarity with 1.6 million records take 12 seconds

HI Ninad

Thanks for your reply.
If bitmap index should not be used. Do i need to disable it for the time
being and carry out the test.

The documentation in pg_similarity shows that index can be created on text
column using gin with gin_similarity_ops.
The same way the index is created like
CREATE INDEX on address using GIN(complete_address gin_similarity_ops);

AFAIK I have not seen any other operators other than gin in the
pg_smilarity extension.

Thanks
C.R.Bala

On Thu, Sep 2, 2021 at 8:17 PM Ninad Shah <nshah.postgres@gmail.com> wrote:

Show quoted text

Hi Bala,

Are your statistics updated? Also, have you used the gin
operator(gin_similarity_ops) correctly?

It is fetching just 6 records out of a million, hence, it should not go
for bitmap index scan. As bitmap index scan loads a complete index, and
access relevant pages from the table later by bitmap heap scan.

Regards,
Ninad Shah

On Thu, 2 Sept 2021 at 16:39, balasubramanian c r <crbs.siebel@gmail.com>
wrote:

Hi Team

We have encountered a problem in our testing environment.
I have a scenario where I am running a similarity match for an address
I have created a table with following number of records
1603423

We are using pg_similarity extension in postgresql version is 13.

And I have created GIN index (since i am using pg_similarity) library
jaccard similarity method
when I run the Explain analyze
EXPLAIN ANALYZE select complete_address, jaccard(complete_address, 'raj
nagar ext near ajnara integrity up ghaziabad 201017') as qsim from address
where complete_address~??'raj nagar ext near ajnara integrity up ghaziabad
201017' order by qsim DESC;
QUERY PLAN

Sort (cost=5856.13..5860.14 rows=1603 width=93) (actual
time=12101.194..12101.197 rows=6 loops=1)

Sort Key: (jaccard(complete_address, 'raj nagar ext near ajnara integrity
up ghaziabad 201017'::text)) DESC
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on address (cost=172.43..5770.80 rows=1603 width=93)
(actual time=3516.233..12101.172 rows=6 loops=1)
Recheck Cond: (complete_address ~?? 'raj nagar ext near ajnara integrity
up ghaziabad 201017'::text)
Rows Removed by Index Recheck: 1039186
Heap Blocks: exact=58018
-> Bitmap Index Scan on address_complete_address_idx1 (cost=0.00..172.02
rows=1603 width=0) (actual time=256.037..256.037 rows=1039192 loops=1)
Index Cond: (complete_address ~?? 'raj nagar ext near ajnara integrity up
ghaziabad 201017'::text)
Planning Time: 0.141 ms
Execution Time: 12101.245 ms
(11 rows)
it took 12 seconds
following are my pgconf file settings
shared buffer as 4GB
work_mem 256 MB
maintenence_work_mem 512MB
autovacuum_work_mem 20MB
My index definition is this "address_complete_address_idx1" gin
(complete_address gin_similarity_ops)

It is taking the index correctly. But why it took 12 seconds to process I
really don't understand.

Please help.

Thanks
C.R.Bala

#5Ninad Shah
nshah.postgres@gmail.com
In reply to: balasubramanian c r (#4)
Re: Querying a table with jaccard similarity with 1.6 million records take 12 seconds

Try the pg_tgrm extension. It is a rich set of operators.

Regards,
Ninad Shah

On Thu, 2 Sept 2021 at 23:39, balasubramanian c r <crbs.siebel@gmail.com>
wrote:

Show quoted text

HI Ninad

Thanks for your reply.
If bitmap index should not be used. Do i need to disable it for the time
being and carry out the test.

The documentation in pg_similarity shows that index can be created on text
column using gin with gin_similarity_ops.
The same way the index is created like
CREATE INDEX on address using GIN(complete_address gin_similarity_ops);

AFAIK I have not seen any other operators other than gin in the
pg_smilarity extension.

Thanks
C.R.Bala

On Thu, Sep 2, 2021 at 8:17 PM Ninad Shah <nshah.postgres@gmail.com>
wrote:

Hi Bala,

Are your statistics updated? Also, have you used the gin
operator(gin_similarity_ops) correctly?

It is fetching just 6 records out of a million, hence, it should not go
for bitmap index scan. As bitmap index scan loads a complete index, and
access relevant pages from the table later by bitmap heap scan.

Regards,
Ninad Shah

On Thu, 2 Sept 2021 at 16:39, balasubramanian c r <crbs.siebel@gmail.com>
wrote:

Hi Team

We have encountered a problem in our testing environment.
I have a scenario where I am running a similarity match for an address
I have created a table with following number of records
1603423

We are using pg_similarity extension in postgresql version is 13.

And I have created GIN index (since i am using pg_similarity) library
jaccard similarity method
when I run the Explain analyze
EXPLAIN ANALYZE select complete_address, jaccard(complete_address, 'raj
nagar ext near ajnara integrity up ghaziabad 201017') as qsim from address
where complete_address~??'raj nagar ext near ajnara integrity up ghaziabad
201017' order by qsim DESC;
QUERY PLAN

Sort (cost=5856.13..5860.14 rows=1603 width=93) (actual
time=12101.194..12101.197 rows=6 loops=1)

Sort Key: (jaccard(complete_address, 'raj nagar ext near ajnara
integrity up ghaziabad 201017'::text)) DESC
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on address (cost=172.43..5770.80 rows=1603 width=93)
(actual time=3516.233..12101.172 rows=6 loops=1)
Recheck Cond: (complete_address ~?? 'raj nagar ext near ajnara integrity
up ghaziabad 201017'::text)
Rows Removed by Index Recheck: 1039186
Heap Blocks: exact=58018
-> Bitmap Index Scan on address_complete_address_idx1 (cost=0.00..172.02
rows=1603 width=0) (actual time=256.037..256.037 rows=1039192 loops=1)
Index Cond: (complete_address ~?? 'raj nagar ext near ajnara integrity
up ghaziabad 201017'::text)
Planning Time: 0.141 ms
Execution Time: 12101.245 ms
(11 rows)
it took 12 seconds
following are my pgconf file settings
shared buffer as 4GB
work_mem 256 MB
maintenence_work_mem 512MB
autovacuum_work_mem 20MB
My index definition is this "address_complete_address_idx1" gin
(complete_address gin_similarity_ops)

It is taking the index correctly. But why it took 12 seconds to process
I really don't understand.

Please help.

Thanks
C.R.Bala

#6Ninad Shah
nshah.postgres@gmail.com
In reply to: Michael Lewis (#3)
Re: Querying a table with jaccard similarity with 1.6 million records take 12 seconds

I see an issue with the operator. GIN index is capable of performing index
scan and index-only scan.

Regards,
Ninad Shah

On Thu, 2 Sept 2021 at 20:57, Michael Lewis <mlewis@entrata.com> wrote:

Show quoted text

This is showing many false positives from the index scan that get removed
when the actual values are examined. With such a long search parameter,
that does not seem surprising. I would expect a search on "raj nagar
ghaziabad 201017" or something like that to yield far fewer results from
the index scan. I don't know GIN indexes super well, but I would guess that
including words that are very common will yield false positives that get
filtered out later.

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Lewis (#3)
Re: Querying a table with jaccard similarity with 1.6 million records take 12 seconds

Michael Lewis <mlewis@entrata.com> writes:

This is showing many false positives from the index scan that get removed
when the actual values are examined. With such a long search parameter,
that does not seem surprising. I would expect a search on "raj nagar
ghaziabad 201017" or something like that to yield far fewer results from
the index scan. I don't know GIN indexes super well, but I would guess that
including words that are very common will yield false positives that get
filtered out later.

Yeah, the huge "Rows Removed" number shows that this index is very
poorly adapted to the query. I don't think the problem is with GIN
per se, but with a poor choice of how to use it. The given example
looks like what the OP really wants to do is full text search.
If so, a GIN index should be fine as long as you put tsvector/tsquery
filtering in front of it. If that's not a good characterization of
the goal, it'd help to tell us what the goal is. (Just saying "I
want to use jaccard similarity" sounds a lot like a man whose only
tool is a hammer, therefore his problem must be a nail, despite
evidence to the contrary.)

regards, tom lane

#8balasubramanian c r
crbs.siebel@gmail.com
In reply to: Tom Lane (#7)
Re: Querying a table with jaccard similarity with 1.6 million records take 12 seconds

HI Tom/Ninad
My bad I didn't explain my use case properly.
The use case is to find the best string similarity for a given address
against the list of addresses in the table.
Initially I tried a similarity function provided by the pg_trgm extension.
But the similarity scores were not satisfactory.
Later I explored the pg_similarity extension which had an exhaustive list
of functions which supported indexes primarily on GIN.
After trying multiple functions like levenshtein, cosine, qgram, jaro, jaro
winkler and jaccard similarity functions
jaccard similarity functions were providing better accurate results.

Hence we decided to use that for our analysis.

Now with a small amount of data the query response time is better. But with
such 1.6 million it is taking a long time.
Even the documentation owner of pg_similarity provided an example on how to
create an index using GIN. We followed exactly the same process.
But still when the data is huge we don't know why it is taking time to scan
through the records.

Thanks
C.R.Bala

On Fri, Sep 3, 2021 at 1:14 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Michael Lewis <mlewis@entrata.com> writes:

This is showing many false positives from the index scan that get removed
when the actual values are examined. With such a long search parameter,
that does not seem surprising. I would expect a search on "raj nagar
ghaziabad 201017" or something like that to yield far fewer results from
the index scan. I don't know GIN indexes super well, but I would guess

that

including words that are very common will yield false positives that get
filtered out later.

Yeah, the huge "Rows Removed" number shows that this index is very
poorly adapted to the query. I don't think the problem is with GIN
per se, but with a poor choice of how to use it. The given example
looks like what the OP really wants to do is full text search.
If so, a GIN index should be fine as long as you put tsvector/tsquery
filtering in front of it. If that's not a good characterization of
the goal, it'd help to tell us what the goal is. (Just saying "I
want to use jaccard similarity" sounds a lot like a man whose only
tool is a hammer, therefore his problem must be a nail, despite
evidence to the contrary.)

regards, tom lane