Behavior of a pg_trgm index for 2 (or < 3) character LIKE queries

Started by Amit Langoteover 12 years ago11 messages
#1Amit Langote
amitlangote09@gmail.com

Hello,

I have been trying to understand how pg_trgm works. As part of that, I
was looking at gin_extract_query_trgm(), which I think, extracts
trigrams from a search query string. So, I debugged for 3 cases:

1) column_name LIKE '%緊急%'

in this case, inside gin_extract_query_trgm(), after a call to
generate_wildcard_trgm(), returned trglen is 0, hence
GIN_SEARCH_MODE_ALL search mode is used.

2) column_name LIKE '%os%'

same as in case (1)

3) column_name LIKE '%ost%'

returned trglen is > 0, things proceed differently. May be, trigrams
have been generated and cane be used for index search.

I later commented out #define KEEPONLYALNUM from
contrib/pg_trgm/trgm.h (following from a related discussion on
-hackers viz. /messages/by-id/CAHGQGwFJshvV2nGME19wdTW9teFw_w7h2ns4E+YYsjkB9WdWDQ@mail.gmail.com
but things didn't change.

So, it appears, for search strings consisting of 2 (or < 3)
characters, trigrams can not be utilized. No?

NOTE: Using the master branch. The indexed column is a text field and
data consists of mix of Japanese, alphanumeric characters.

--
Amit Langote

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Robert Haas
robertmhaas@gmail.com
In reply to: Amit Langote (#1)
Re: Behavior of a pg_trgm index for 2 (or < 3) character LIKE queries

On Wed, May 29, 2013 at 10:51 PM, Amit Langote <amitlangote09@gmail.com> wrote:

So, it appears, for search strings consisting of 2 (or < 3)
characters, trigrams can not be utilized. No?

I think that's right. "trigram" means a sequence of three characters,
and what's stored in the indexes are three-character sequences from
the original text.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Amit Langote
amitlangote09@gmail.com
In reply to: Robert Haas (#2)
Re: Behavior of a pg_trgm index for 2 (or < 3) character LIKE queries

On Thu, May 30, 2013 at 11:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, May 29, 2013 at 10:51 PM, Amit Langote <amitlangote09@gmail.com> wrote:

So, it appears, for search strings consisting of 2 (or < 3)
characters, trigrams can not be utilized. No?

I think that's right. "trigram" means a sequence of three characters,
and what's stored in the indexes are three-character sequences from
the original text.

Was there any improvement to pg_trgm in recent past that could make it
better for partial matching (the case in question I suppose) or is
partial-matching a different thing altogether?

--
Amit Langote

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Robert Haas
robertmhaas@gmail.com
In reply to: Amit Langote (#3)
Re: Behavior of a pg_trgm index for 2 (or < 3) character LIKE queries

On Thu, May 30, 2013 at 11:00 AM, Amit Langote <amitlangote09@gmail.com> wrote:

Was there any improvement to pg_trgm in recent past that could make it
better for partial matching (the case in question I suppose) or is
partial-matching a different thing altogether?

Sorry, I am not sure.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Sawada Masahiko
sawada.mshk@gmail.com
In reply to: Amit Langote (#3)
Re: Behavior of a pg_trgm index for 2 (or < 3) character LIKE queries

2013/5/31 Amit Langote <amitlangote09@gmail.com>

On Thu, May 30, 2013 at 11:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, May 29, 2013 at 10:51 PM, Amit Langote <amitlangote09@gmail.com> wrote:

So, it appears, for search strings consisting of 2 (or < 3)
characters, trigrams can not be utilized. No?

I think that's right. "trigram" means a sequence of three characters,
and what's stored in the indexes are three-character sequences from
the original text.

Was there any improvement to pg_trgm in recent past that could make it
better for partial matching (the case in question I suppose) or is
partial-matching a different thing altogether?

Hi Amit,

following emails are discussed about partial match of pg_trgm. I hope
will this help.
</messages/by-id/CAHGQGwFJshvV2nGME19wdTW9teFw_w7h2ns4E+YYsjkB9WdWDQ@mail.gmail.com&gt;
as you may know, if search string contains multibyte characters
trigram key is converted to CRC of 4 byte and it is used as key.
(but only use upper 3 byte from CRC)
so we can do partial matching if KEEPONLYALNUM is enabled.

Regards,

--------
Masahiko Sawada

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Alexander Korotkov
aekorotkov@gmail.com
In reply to: Sawada Masahiko (#5)
Re: Behavior of a pg_trgm index for 2 (or < 3) character LIKE queries

On Thu, May 30, 2013 at 12:49 PM, Sawada Masahiko <sawada.mshk@gmail.com>wrote:

following emails are discussed about partial match of pg_trgm. I hope
will this help.
<
/messages/by-id/CAHGQGwFJshvV2nGME19wdTW9teFw_w7h2ns4E+YYsjkB9WdWDQ@mail.gmail.com

as you may know, if search string contains multibyte characters
trigram key is converted to CRC of 4 byte and it is used as key.
(but only use upper 3 byte from CRC)
so we can do partial matching if KEEPONLYALNUM is enabled.

Please, read the further discussion on that thread. We can't do partial
matching because of CRC independently of KEEPONLYALNUM.

------
With best regards,
Alexander Korotkov.

#7Amit Langote
amitlangote09@gmail.com
In reply to: Alexander Korotkov (#6)
Re: Behavior of a pg_trgm index for 2 (or < 3) character LIKE queries

On Fri, May 31, 2013 at 4:25 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

On Thu, May 30, 2013 at 12:49 PM, Sawada Masahiko <sawada.mshk@gmail.com>
wrote:

following emails are discussed about partial match of pg_trgm. I hope
will this help.

</messages/by-id/CAHGQGwFJshvV2nGME19wdTW9teFw_w7h2ns4E+YYsjkB9WdWDQ@mail.gmail.com&gt;
as you may know, if search string contains multibyte characters
trigram key is converted to CRC of 4 byte and it is used as key.
(but only use upper 3 byte from CRC)
so we can do partial matching if KEEPONLYALNUM is enabled.

Please, read the further discussion on that thread. We can't do partial
matching because of CRC independently of KEEPONLYALNUM.

Thank you Sawada-san and Alexander.

I think the idea of using trigram "text" itself rather than its CRC
(due to its problems in partial matching) as GIN key (?) has not been
implemented into pg_trgm yet, right? And even though, such a facility
would be added, we would still need to handle multibyte characters
case differently (even for partial matching), is that right?

--
Amit Langote

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Amit Langote
amitlangote09@gmail.com
In reply to: Alexander Korotkov (#6)
Re: Behavior of a pg_trgm index for 2 (or < 3) character LIKE queries

On Fri, May 31, 2013 at 4:25 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

On Thu, May 30, 2013 at 12:49 PM, Sawada Masahiko <sawada.mshk@gmail.com>
wrote:

following emails are discussed about partial match of pg_trgm. I hope
will this help.

</messages/by-id/CAHGQGwFJshvV2nGME19wdTW9teFw_w7h2ns4E+YYsjkB9WdWDQ@mail.gmail.com&gt;
as you may know, if search string contains multibyte characters
trigram key is converted to CRC of 4 byte and it is used as key.
(but only use upper 3 byte from CRC)
so we can do partial matching if KEEPONLYALNUM is enabled.

Please, read the further discussion on that thread. We can't do partial
matching because of CRC independently of KEEPONLYALNUM.

Also, a few more questions:

1) When building a trgm index, are there any differences for
multi-byte character strings. For example, would a 2 character
Japanese string (multi-byte offcourse) produce exactly 3 trigrams to
be stored in the index which would later be used while look-up?

2) And if that is so, is there problem in gin_extract_query_trgm(),
that is while generating trigrams from a query search term that causes
trigrams (stored in the index if answer to (1) is yes) NOT to be used
in such a partial matching case?

--
Amit Langote

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Sawada Masahiko
sawada.mshk@gmail.com
In reply to: Amit Langote (#8)
Re: Behavior of a pg_trgm index for 2 (or < 3) character LIKE queries

On Fri, May 31, 2013 at 11:16 AM, Amit Langote <amitlangote09@gmail.com> wrote:

On Fri, May 31, 2013 at 4:25 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

On Thu, May 30, 2013 at 12:49 PM, Sawada Masahiko <sawada.mshk@gmail.com>
wrote:

following emails are discussed about partial match of pg_trgm. I hope
will this help.

</messages/by-id/CAHGQGwFJshvV2nGME19wdTW9teFw_w7h2ns4E+YYsjkB9WdWDQ@mail.gmail.com&gt;
as you may know, if search string contains multibyte characters
trigram key is converted to CRC of 4 byte and it is used as key.
(but only use upper 3 byte from CRC)
so we can do partial matching if KEEPONLYALNUM is enabled.

Please, read the further discussion on that thread. We can't do partial
matching because of CRC independently of KEEPONLYALNUM.

Also, a few more questions:

1) When building a trgm index, are there any differences for
multi-byte character strings. For example, would a 2 character
Japanese string (multi-byte offcourse) produce exactly 3 trigrams to
be stored in the index which would later be used while look-up?

in above case a 2 character multibyte string produce 3 trigrams of
CRC. (because these was larger than 3 byte)
and these are used while look-up.

2) And if that is so, is there problem in gin_extract_query_trgm(),
that is while generating trigrams from a query search term that causes
trigrams (stored in the index if answer to (1) is yes) NOT to be used
in such a partial matching case?

it means that we can't use trigrams in case of partial matching
because trigrams (stored in index) are converted to different
value(CRC).
right?

Regards,
--
-------
Sawada Masahiko

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Amit Langote
amitlangote09@gmail.com
In reply to: Sawada Masahiko (#9)
Re: Behavior of a pg_trgm index for 2 (or < 3) character LIKE queries

On Sat, Jun 1, 2013 at 1:48 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:

On Fri, May 31, 2013 at 11:16 AM, Amit Langote <amitlangote09@gmail.com> wrote:

2) And if that is so, is there problem in gin_extract_query_trgm(),
that is while generating trigrams from a query search term that causes
trigrams (stored in the index if answer to (1) is yes) NOT to be used
in such a partial matching case?

it means that we can't use trigrams in case of partial matching
because trigrams (stored in index) are converted to different
value(CRC).
right?

When I debugged a partial match case such as " column like '%st%'
", it appears that get_wildcard_trigrams return no trigrams for
wildcard part 'st' since charlen < 3. Hence, GIN_SEARCH_MODE_ALL mode
is used and results in full index scan instead of trigrams being
used. This happens for multibyte case too. The problem is that for
wildcard part consisting of less than 3 characters,
get_wildcard_trigrams return nothing.

--
Amit Langote

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Alexander Korotkov
aekorotkov@gmail.com
In reply to: Amit Langote (#10)
Re: Behavior of a pg_trgm index for 2 (or < 3) character LIKE queries

On Fri, May 31, 2013 at 9:41 PM, Amit Langote <amitlangote09@gmail.com>wrote:

On Sat, Jun 1, 2013 at 1:48 AM, Sawada Masahiko <sawada.mshk@gmail.com>
wrote:

On Fri, May 31, 2013 at 11:16 AM, Amit Langote <amitlangote09@gmail.com>

wrote:

2) And if that is so, is there problem in gin_extract_query_trgm(),
that is while generating trigrams from a query search term that causes
trigrams (stored in the index if answer to (1) is yes) NOT to be used
in such a partial matching case?

it means that we can't use trigrams in case of partial matching
because trigrams (stored in index) are converted to different
value(CRC).
right?

When I debugged a partial match case such as " column like '%st%'
", it appears that get_wildcard_trigrams return no trigrams for
wildcard part 'st' since charlen < 3. Hence, GIN_SEARCH_MODE_ALL mode
is used and results in full index scan instead of trigrams being
used. This happens for multibyte case too. The problem is that for
wildcard part consisting of less than 3 characters,
get_wildcard_trigrams return nothing.

Partial match for LIKE queries is not implemented at all. With current
index structure it could be possible to implement it with guarantee that
all indexed strings don't contain multibyte characters (i.e. single-byte
encoding).
Also, at it was mentioned, it's possible to implement operator class with
text storage type which would support partial match in all the cases.
However, I doubt it's reasonable to implement it based on pg_trgm, because
of many hard-wired assumptions that trigram is fixed length and
compatibility.

------
With best regards,
Alexander Korotkov.