Relative performance of prefix and suffix string matching

Started by Andrew Roseover 14 years ago4 messagesgeneral

andrew.rose@metaswitch.com

over 14 years ago

Basic Question: In text fields, is prefix matching significantly faster than suffix matching?

Background:

I'm designing a database schema where a common operation will be "search for substring x either at the beginning or end of column 'str'".

1. I could have the client issue...

SELECT * FROM tbl WHERE str LIKE 'x%' OR str LIKE '%x'

2. Alternatively, I could store column 'rev_str' as a reversed version of column 'str' and have the client produce a reversed version of x on each query (call it r). Then the client would issue...

SELECT * FROM tbl WHERE str LIKE 'x%' OR rev_str LIKE 'r%'

...which would use prefix matches only instead of requiring suffix matches. Since I've seen this form used by others, I was wondering if it's necessary - i.e. if databases really do perform prefix matching faster?

3. Is there a solution I'm unaware of with even better performance?

Thanks,

Andrew

Tore Halvorsen

tore.halvorsen@gmail.com

over 14 years ago

In reply to: Andrew Rose (#1)

Re: Relative performance of prefix and suffix string matching

On Fri, Sep 23, 2011 at 11:47 AM, Andrew Rose <andrew.rose@metaswitch.com>wrote:

Basic Question: In text fields, is prefix matching significantly faster
than suffix matching?

If you are using text_pattern_ops, then yes.

2. Alternatively, I could store column 'rev_str' as a reversed version of
column 'str' and have the client produce a reversed version of x on each
query (call it r). Then the client would issue...

... or use an index on the reversed string.

create table foo (text text not null);
insert into foo select md5(generate_series(1, 1000000, 1)::text);
create index on foo(text text_pattern_ops);
create index on foo(reverse(text) text_pattern_ops);
explain select * from foo where text like 'f000' || '%' or reverse(text)
like reverse('f000') || '%'

Bitmap Heap Scan on foo (cost=9.20..13.22 rows=200 width=33)
Recheck Cond: ((text ~~ 'f000%'::text) OR (reverse(text) ~~
'000f%'::text))
Filter: ((text ~~ 'f000%'::text) OR (reverse(text) ~~ '000f%'::text))
-> BitmapOr (cost=9.20..9.20 rows=1 width=0)
-> Bitmap Index Scan on foo_text_idx (cost=0.00..4.55 rows=1
width=0)
Index Cond: ((text ~>=~ 'f000'::text) AND (text ~<~
'f001'::text))
-> Bitmap Index Scan on foo_reverse_idx (cost=0.00..4.55 rows=1
width=0)
Index Cond: ((reverse(text) ~>=~ '000f'::text) AND
(reverse(text) ~<~ '000g'::text))

... at least this works for me :)

--
Eld på åren og sol på eng gjer mannen fegen og fjåg. [Jøtul]
<demo> 2011 Tore Halvorsen || +052 0553034554

Alban Hertroys

haramrae@gmail.com

over 14 years ago

In reply to: Andrew Rose (#1)

Re: Relative performance of prefix and suffix string matching

On 23 September 2011 11:47, Andrew Rose <andrew.rose@metaswitch.com> wrote:

Basic Question: In text fields, is prefix matching significantly faster
than suffix matching?

It does depend on what type of index you use. BTrees split off text strings,
from left to right, halving the number of records you need to scan at every
branch. For a suffix match, that's exactly the wrong way around.
Hash indexes probably don't fare any better.
I don't know enough about GIST or GIN indexes to comment on their
suitability for suffix matches, but presumably they're better at those.

I recall doing suffix matches used to be a problem in at least earlier
versions of Postgres, but it's quite possible that the query planner is
smart enough to do the reverse match by itself nowadays (I doubt it, seeing
that it would also need to reverse the way the index is organised).

2. Alternatively, I could store column 'rev_str' as a reversed version of
column 'str' and have the client produce a reversed version of x on each
query (call it r). Then the client would issue...

SELECT * FROM tbl WHERE str LIKE 'x%' OR rev_str LIKE 'r%'

...which would use prefix matches only instead of requiring suffix matches.
Since I've seen this form used by others, I was wondering if it's necessary
- i.e. if databases really do perform prefix matching faster?

3. Is there a solution I'm unaware of with even better performance?

You can create a functional index on the reverse of the string, that way
omitting the need for an extra column (that needs updating as well).

CREATE INDEX tbl_str_rev ON tbl (reverse(str));
SELECT * FROM tbl WHERE str LIKE 'x%' OR reverse(str) LIKE 'x%';

See: http://www.postgresql.org/docs/9.0/static/indexes-expressional.html

--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.

"Stéphane A. Schildknecht"

stephane.schildknecht@postgresql.fr

over 14 years ago

In reply to: Alban Hertroys (#3)

Re: Relative performance of prefix and suffix string matching

Le 23/09/2011 12:30, Alban Hertroys a ï¿œcrit :

On 23 September 2011 11:47, Andrew Rose <andrew.rose@metaswitch.com
<mailto:andrew.rose@metaswitch.com>> wrote:

Basic Question: In text fields, is prefix matching significantly faster
than suffix matching?

It does depend on what type of index you use. BTrees split off text strings,
from left to right, halving the number of records you need to scan at every
branch. For a suffix match, that's exactly the wrong way around.
Hash indexes probably don't fare any better.
I don't know enough about GIST or GIN indexes to comment on their suitability
for suffix matches, but presumably they're better at those.

I recall doing suffix matches used to be a problem in at least earlier versions
of Postgres, but it's quite possible that the query planner is smart enough to
do the reverse match by itself nowadays (I doubt it, seeing that it would also
need to reverse the way the index is organised).

2. Alternatively, I could store column 'rev_str' as a reversed version of
column 'str' and have the client produce a reversed version of x on each
query (call it r). Then the client would issue...

SELECT * FROM tbl WHERE str LIKE 'x%' OR rev_str LIKE 'r%'

...which would use prefix matches only instead of requiring suffix matches.
Since I've seen this form used by others, I was wondering if it's
necessary - i.e. if databases really do perform prefix matching faster?

3. Is there a solution I'm unaware of with even better performance?

You can create a functional index on the reverse of the string, that way
omitting the need for an extra column (that needs updating as well).

CREATE INDEX tbl_str_rev ON tbl (reverse(str));
SELECT * FROM tbl WHERE str LIKE 'x%' OR reverse(str) LIKE 'x%';

See: http://www.postgresql.org/docs/9.0/static/indexes-expressional.html

You can use the pg_trgm extension which lets you create gin or gist indexes on
your text field.

There is also wildspeed (see http://www.sai.msu.su/~megera/wiki/wildspeed).

Didn't try the latter solution, but the first one gives really great result for
searching partial strings.

a propos, there's one thing I'd like to know, is how to set the similarity
limit within pg_trgm on a server side (I'd like to have it settled to 0.2 for
every new session, for instance).

Regards,
--
Stï¿œphane Schildknecht
http://www.loxodata.com
Contact rï¿œgional PostgreSQL