pg_trgm partial-match

Started by Fujii Masaoabout 13 years ago7 messages
#1Fujii Masao
masao.fujii@gmail.com
1 attachment(s)

Hi,

I'd like to propose to extend pg_trgm so that it can compare a partial-match
query key to a GIN index. IOW, I'm thinking to implement the 'comparePartial'
GIN method for pg_trgm.

Currently, when the query key is less than three characters, we cannot use
a GIN index (+ pg_trgm) efficiently, because pg_trgm doesn't support a
partial-match method. In this case, seq scan or index full scan would be
executed, and its response time would be very slow. I'd like to alleviate this
problem.

Note that we cannot do a partial-match if KEEPONLYALNUM is disabled,
i.e., if query key contains multibyte characters. In this case, byte length of
the trigram string might be larger than three, and its CRC is used as a
trigram key instead of the trigram string itself. Because of using CRC, we
cannot do a partial-match. Attached patch extends pg_trgm so that it
compares a partial-match query key only when KEEPONLYALNUM is
enabled.

Attached patch is WIP yet. What I should do next is:

* version up pg_trgm from 1.0 to 1.1, i.e., create pg_trgm--1.1.sql, etc.
* write the regression test

Comments? Review? Objection?

Regards,

--
Fujii Masao

Attachments:

trgm_compare_partial_v0.patchapplication/octet-stream; name=trgm_compare_partial_v0.patchDownload
*** a/contrib/pg_trgm/pg_trgm--1.0.sql
--- b/contrib/pg_trgm/pg_trgm--1.0.sql
***************
*** 148,153 **** RETURNS bool
--- 148,158 ----
  AS 'MODULE_PATHNAME'
  LANGUAGE C IMMUTABLE STRICT;
  
+ CREATE FUNCTION gin_trgm_compare_partial(int4, int4, int2, internal)
+ RETURNS int
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
  -- create the operator class for gin
  CREATE OPERATOR CLASS gin_trgm_ops
  FOR TYPE text USING gin
***************
*** 157,162 **** AS
--- 162,168 ----
          FUNCTION        2       gin_extract_value_trgm (text, internal),
          FUNCTION        3       gin_extract_query_trgm (text, internal, int2, internal, internal, internal, internal),
          FUNCTION        4       gin_trgm_consistent (internal, int2, text, int4, internal, internal, internal, internal),
+ 				FUNCTION        5       gin_trgm_compare_partial (int4, int4, int2, internal),
          STORAGE         int4;
  
  -- Add operators that are new in 9.1.
*** a/contrib/pg_trgm/trgm.h
--- b/contrib/pg_trgm/trgm.h
***************
*** 42,47 **** typedef char trgm[3];
--- 42,49 ----
  	*(((char*)(a))+2) = *(((char*)(b))+2);	\
  } while(0);
  
+ #define TRGMCHARLEN(t) ( (*(((char*)(t))+2)=='\0') ? ( (*(((char*)(t))+1)=='\0') ? 1 : 2) : 3 )
+ 
  uint32		trgm2int(trgm *ptr);
  
  #ifdef KEEPONLYALNUM
***************
*** 100,105 **** typedef char *BITVECP;
--- 102,108 ----
  #define ARRNELEM(x) ( ( VARSIZE(x) - TRGMHDRSIZE )/sizeof(trgm) )
  
  extern float4 trgm_limit;
+ extern bool	trgm_pmatch;
  
  TRGM	   *generate_trgm(char *str, int slen);
  TRGM	   *generate_wildcard_trgm(const char *str, int slen);
*** a/contrib/pg_trgm/trgm_gin.c
--- b/contrib/pg_trgm/trgm_gin.c
***************
*** 21,26 **** Datum		gin_extract_query_trgm(PG_FUNCTION_ARGS);
--- 21,29 ----
  PG_FUNCTION_INFO_V1(gin_trgm_consistent);
  Datum		gin_trgm_consistent(PG_FUNCTION_ARGS);
  
+ PG_FUNCTION_INFO_V1(gin_trgm_compare_partial);
+ Datum		gin_trgm_compare_partial(PG_FUNCTION_ARGS);
+ 
  /*
   * This function can only be called if a pre-9.1 version of the GIN operator
   * class definition is present in the catalogs (probably as a consequence
***************
*** 46,51 **** gin_extract_value_trgm(PG_FUNCTION_ARGS)
--- 49,57 ----
  	TRGM	   *trg;
  	int32		trglen;
  
+ #ifdef KEEPONLYALNUM
+ 	trgm_pmatch = true;
+ #endif
  	*nentries = 0;
  
  	trg = generate_trgm(VARDATA(val), VARSIZE(val) - VARHDRSZ);
***************
*** 79,86 **** gin_extract_query_trgm(PG_FUNCTION_ARGS)
  	int32	   *nentries = (int32 *) PG_GETARG_POINTER(1);
  	StrategyNumber strategy = PG_GETARG_UINT16(2);
  
! 	/* bool   **pmatch = (bool **) PG_GETARG_POINTER(3); */
! 	/* Pointer	  *extra_data = (Pointer *) PG_GETARG_POINTER(4); */
  	/* bool   **nullFlags = (bool **) PG_GETARG_POINTER(5); */
  	int32	   *searchMode = (int32 *) PG_GETARG_POINTER(6);
  	Datum	   *entries = NULL;
--- 85,92 ----
  	int32	   *nentries = (int32 *) PG_GETARG_POINTER(1);
  	StrategyNumber strategy = PG_GETARG_UINT16(2);
  
! 	bool   **pmatch = (bool **) PG_GETARG_POINTER(3);
! 	Pointer	  **extra_data = (Pointer **) PG_GETARG_POINTER(4);
  	/* bool   **nullFlags = (bool **) PG_GETARG_POINTER(5); */
  	int32	   *searchMode = (int32 *) PG_GETARG_POINTER(6);
  	Datum	   *entries = NULL;
***************
*** 89,94 **** gin_extract_query_trgm(PG_FUNCTION_ARGS)
--- 95,104 ----
  	trgm	   *ptr;
  	int32		i;
  
+ #ifdef KEEPONLYALNUM
+ 	trgm_pmatch = true;
+ #endif
+ 
  	switch (strategy)
  	{
  		case SimilarityStrategyNumber:
***************
*** 115,120 **** gin_extract_query_trgm(PG_FUNCTION_ARGS)
--- 125,132 ----
  
  	trglen = ARRNELEM(trg);
  	*nentries = trglen;
+ 	*pmatch = NULL;
+ 	*extra_data = NULL;
  
  	if (trglen > 0)
  	{
***************
*** 124,129 **** gin_extract_query_trgm(PG_FUNCTION_ARGS)
--- 136,159 ----
  		{
  			int32		item = trgm2int(ptr);
  
+ #ifdef KEEPONLYALNUM
+ 			if (trgm_pmatch)
+ 			{
+ 				int		trgmcharlen = TRGMCHARLEN(ptr);
+ 
+ 				if (trgmcharlen < 3)
+ 				{
+ 					if (*pmatch == NULL)
+ 					{
+ 						*pmatch = (bool *) palloc0(sizeof(bool) * trglen);
+ 						*extra_data = (Pointer *) palloc0(sizeof(int) * trglen);
+ 					}
+ 					*(*pmatch + i) = true;
+ 					*((int *)(*extra_data + i)) = trgmcharlen;
+ 				}
+ 			}
+ #endif
+ 
  			entries[i] = Int32GetDatum(item);
  			ptr++;
  		}
***************
*** 197,199 **** gin_trgm_consistent(PG_FUNCTION_ARGS)
--- 227,245 ----
  
  	PG_RETURN_BOOL(res);
  }
+ 
+ Datum
+ gin_trgm_compare_partial(PG_FUNCTION_ARGS)
+ {
+ 	int32		a = PG_GETARG_INT32(0);
+ 	int32		b = PG_GETARG_INT32(1);
+ 	int32		shiftlen = (3 - PG_GETARG_INT32(3)) * 8;
+ 
+ 	a >>= shiftlen;
+ 	b >>= shiftlen;
+ 
+ 	if (a == b)
+ 		PG_RETURN_INT32(0);
+ 	else
+ 		PG_RETURN_INT32(1);
+ }
*** a/contrib/pg_trgm/trgm_op.c
--- b/contrib/pg_trgm/trgm_op.c
***************
*** 15,20 **** PG_MODULE_MAGIC;
--- 15,22 ----
  
  float4		trgm_limit = 0.3f;
  
+ bool		trgm_pmatch = false;
+ 
  PG_FUNCTION_INFO_V1(set_limit);
  Datum		set_limit(PG_FUNCTION_ARGS);
  
***************
*** 142,148 **** make_trigrams(trgm *tptr, char *str, int bytelen, int charlen)
--- 144,163 ----
  	char	   *ptr = str;
  
  	if (charlen < 3)
+ 	{
+ #ifdef KEEPONLYALNUM
+ 		if (trgm_pmatch)
+ 		{
+ 			int		i;
+ 
+ 			Assert(bytelen == charlen);
+ 			for (i = 0; i < 3; i++)
+ 				*(((char *)tptr) + i) = (i < bytelen) ? *(ptr + i) : '\0';
+ 			tptr++;
+ 		}
+ #endif
  		return tptr;
+ 	}
  
  #ifdef USE_WIDE_UPPER_LOWER
  	if (pg_database_encoding_max_length() > 1)
#2Tomas Vondra
tv@fuzzy.cz
In reply to: Fujii Masao (#1)
Re: pg_trgm partial-match

On 15.11.2012 20:39, Fujii Masao wrote:

Hi,

I'd like to propose to extend pg_trgm so that it can compare a partial-match
query key to a GIN index. IOW, I'm thinking to implement the 'comparePartial'
GIN method for pg_trgm.

Currently, when the query key is less than three characters, we cannot use
a GIN index (+ pg_trgm) efficiently, because pg_trgm doesn't support a
partial-match method. In this case, seq scan or index full scan would be
executed, and its response time would be very slow. I'd like to alleviate this
problem.

Note that we cannot do a partial-match if KEEPONLYALNUM is disabled,
i.e., if query key contains multibyte characters. In this case, byte length of
the trigram string might be larger than three, and its CRC is used as a
trigram key instead of the trigram string itself. Because of using CRC, we
cannot do a partial-match. Attached patch extends pg_trgm so that it
compares a partial-match query key only when KEEPONLYALNUM is
enabled.

Attached patch is WIP yet. What I should do next is:

* version up pg_trgm from 1.0 to 1.1, i.e., create pg_trgm--1.1.sql, etc.
* write the regression test

Comments? Review? Objection?

Hi,

I've done a quick review of the current patch:

(1) It applies cleanly on the current master and builds fine.

(2) In pg_trgm--1.0.sql the gin_trgm_compare_partial is indented
differently (using tabs instead of spaces).

(3) In trgm_gin.c, function gin_extract_value_trgm contains #ifdef
KEEPONLYALNUM, although trgm_pmatch is not used at all.

(4) The patch removes some commented-out variables, but there still
remain various commented-out variables. Will this be cleaned too?

(5) I've done basic functionality of the patch, it really seems to work:

CREATE EXTENSION pg_trgm ;
CREATE TABLE TEST (val TEXT);
INSERT INTO test
SELECT md5(i::text) FROM generate_series(1,1000000) s(i);
CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops);
ANALYZE test;

EXPLAIN SELECT * FROM test WHERE val LIKE '%aa%';
QUERY PLAN
------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=1655.96..11757.63 rows=141414 width=33)
Recheck Cond: (val ~~ '%aa%'::text)
-> Bitmap Index Scan on trgm_idx (cost=0.00..1620.61 rows=141414
width=0)
Index Cond: (val ~~ '%aa%'::text)
(4 rows)

Without the patch, this gives a seq scan (as expected).

Do you expect to update the docs too? IMHO it's worth mentioning that
the pg_trgm can handle even patterns shorter than 2 chars ...

regards
Tomas Vondra

#3Alexander Korotkov
aekorotkov@gmail.com
In reply to: Fujii Masao (#1)
Re: pg_trgm partial-match

Hi!

On Thu, Nov 15, 2012 at 11:39 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

Note that we cannot do a partial-match if KEEPONLYALNUM is disabled,
i.e., if query key contains multibyte characters. In this case, byte
length of
the trigram string might be larger than three, and its CRC is used as a
trigram key instead of the trigram string itself. Because of using CRC, we
cannot do a partial-match. Attached patch extends pg_trgm so that it
compares a partial-match query key only when KEEPONLYALNUM is
enabled.

Didn't get this point. How does KEEPONLYALNUM guarantee that each trigram
character is singlebyte?

CREATE TABLE test (val TEXT);
INSERT INTO test VALUES ('aa'), ('aaa'), ('шaaш');
CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops);
ANALYZE test;
test=# SELECT * FROM test WHERE val LIKE '%aa%';
val
------
aa
aaa
шaaш
(3 rows)
test=# set enable_seqscan = off;
SET
test=# SELECT * FROM test WHERE val LIKE '%aa%';
val
-----
aa
aaa
(2 rows)

I think we can use partial match only for singlebyte encodings. Or, at
most, in cases when all alpha-numeric characters are singlebyte (have no
idea how to check this).

------
With best regards,
Alexander Korotkov.

#4Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#3)
Re: pg_trgm partial-match

On Mon, Nov 19, 2012 at 10:05 AM, Alexander Korotkov
<aekorotkov@gmail.com>wrote:

On Thu, Nov 15, 2012 at 11:39 PM, Fujii Masao <masao.fujii@gmail.com>wrote:

Note that we cannot do a partial-match if KEEPONLYALNUM is disabled,
i.e., if query key contains multibyte characters. In this case, byte
length of
the trigram string might be larger than three, and its CRC is used as a
trigram key instead of the trigram string itself. Because of using CRC, we
cannot do a partial-match. Attached patch extends pg_trgm so that it
compares a partial-match query key only when KEEPONLYALNUM is
enabled.

Didn't get this point. How does KEEPONLYALNUM guarantee that each trigram
character is singlebyte?

CREATE TABLE test (val TEXT);
INSERT INTO test VALUES ('aa'), ('aaa'), ('шaaш');
CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops);
ANALYZE test;
test=# SELECT * FROM test WHERE val LIKE '%aa%';
val
------
aa
aaa
шaaш
(3 rows)
test=# set enable_seqscan = off;
SET
test=# SELECT * FROM test WHERE val LIKE '%aa%';
val
-----
aa
aaa
(2 rows)

I think we can use partial match only for singlebyte encodings. Or, at
most, in cases when all alpha-numeric characters are singlebyte (have no
idea how to check this).

Actually, I also was fiddling around idea of partial match on trigrams when
I was working on initial LIKE patch. But, I concluded that we would need a
separate opclass which always keeps full trigram in entry.

------
With best regards,
Alexander Korotkov.

#5Fujii Masao
masao.fujii@gmail.com
In reply to: Alexander Korotkov (#4)
Re: pg_trgm partial-match

On Mon, Nov 19, 2012 at 7:55 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

On Mon, Nov 19, 2012 at 10:05 AM, Alexander Korotkov <aekorotkov@gmail.com>
wrote:

On Thu, Nov 15, 2012 at 11:39 PM, Fujii Masao <masao.fujii@gmail.com>
wrote:

Note that we cannot do a partial-match if KEEPONLYALNUM is disabled,
i.e., if query key contains multibyte characters. In this case, byte
length of
the trigram string might be larger than three, and its CRC is used as a
trigram key instead of the trigram string itself. Because of using CRC,
we
cannot do a partial-match. Attached patch extends pg_trgm so that it
compares a partial-match query key only when KEEPONLYALNUM is
enabled.

Didn't get this point. How does KEEPONLYALNUM guarantee that each trigram
character is singlebyte?

CREATE TABLE test (val TEXT);
INSERT INTO test VALUES ('aa'), ('aaa'), ('шaaш');
CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops);
ANALYZE test;
test=# SELECT * FROM test WHERE val LIKE '%aa%';
val
------
aa
aaa
шaaш
(3 rows)
test=# set enable_seqscan = off;
SET
test=# SELECT * FROM test WHERE val LIKE '%aa%';
val
-----
aa
aaa
(2 rows)

I think we can use partial match only for singlebyte encodings. Or, at
most, in cases when all alpha-numeric characters are singlebyte (have no
idea how to check this).

Good catch! You're right.

Actually, I also was fiddling around idea of partial match on trigrams when
I was working on initial LIKE patch. But, I concluded that we would need a
separate opclass which always keeps full trigram in entry.

Agreed.

My goal is to enable pg_trgm's full-text search using the keyword which
consists of one or two Japanese characters to use an index efficiently.
I first implemented the partial-match patch, and was planning to introduce
new opclass next CommitFest to store the multibyte characters to
the text data instead of int4. But obviously the order of development
should have been the opposite. I will work on the latter development first,
and add new opclass like gin_trgm_mb_ops (mb means multibyte) which
ignores KEEPONLYALNUM and stores the GIN index key as text value.

Regards,

--
Fujii Masao

#6Fujii Masao
masao.fujii@gmail.com
In reply to: Tomas Vondra (#2)
Re: pg_trgm partial-match

On Mon, Nov 19, 2012 at 10:56 AM, Tomas Vondra <tv@fuzzy.cz> wrote:

I've done a quick review of the current patch:

Thanks for the commit!

As Alexander pointed out upthread, another infrastructure patch is required
before applying this patch. So I will implement the infra patch first.

Regards,

--
Fujii Masao

#7Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#6)
Re: pg_trgm partial-match

On Fri, Nov 23, 2012 at 2:11 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Mon, Nov 19, 2012 at 10:56 AM, Tomas Vondra <tv@fuzzy.cz> wrote:

I've done a quick review of the current patch:

Thanks for the commit!

As Alexander pointed out upthread, another infrastructure patch is required
before applying this patch. So I will implement the infra patch first.

I marked this patch as "Returned with Feedback" because unfortunately
I don't have enough time to revise the patch. I will retry this maybe in 9.4dev
phase.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers