pg_trgm version 1.2

Started by Jeff Janesover 10 years ago9 messages

jeff.janes@gmail.com

over 10 years ago

1 attachment(s)

This patch implements version 1.2 of contrib module pg_trgm.

This supports the triconsistent function, introduced in version 9.4 of the
server, to make it faster to implement indexed queries where some keys are
common and some are rare.

I've included the paths to both upgrade and downgrade between 1.1 and 1.2,
although after doing so you must close and restart the session before you
can be sure the change has taken effect. There is no change to the on-disk
index structure

This shows the difference it can make in some cases:

create extension pg_trgm version "1.1";

create table foo as select

md5(random()::text)|| case when random()<0.000005 then 'lmnop' else '123'
end ||

md5(random()::text) as bar

from generate_series(1,10000000);

create index on foo using gin (bar gin_trgm_ops);

--some queries

alter extension pg_trgm update to "1.2";

--close, reopen, more queries

select count(*) from foo where bar like '%12344321lmnabcddd%';

V1.1: Time: 1743.691 ms --- after repeated execution to warm the cache

V1.2: Time: 2.839 ms --- after repeated execution to warm the cache

You could get the same benefit just by increasing MAX_MAYBE_ENTRIES (in
core) from 4 to some higher value (which it probably should be anyway, but
there will always be a case where it needs to be higher than you can afford
it to be, so a real solution is needed).

I wasn't sure if this should be a new version of pg_trgm or not, because
there is no user visible change other than to performance. But there may
be some cases where it results in performance reduction and so it is nice
to provide options. Also, I'd like to use it in a back-branch, so versions
seems to be the right way to go there.

There is a lot of code duplication between the binary consistent function
and the ternary one. I thought it the duplication was necessary in order
to support both 1.1 and 1.2 from the same code base.

There may also be some gains in the similarity and regex cases, but I
didn't really analyze those for performance.

I've thought about how to document this change. Looking to other example
of other contrib modules with multiple versions, I decided that we don't
document them, other than in the release notes.

The same patch applies to 9.4 code with a minor conflict in the Makefile,
and gives benefits there as well.

Cheers,

Jeff

Attachments:

pg_trgm_1_2_v001.patchapplication/octet-stream; name=pg_trgm_1_2_v001.patchDownload

diff --git a/contrib/pg_trgm/Makefile b/contrib/pg_trgm/Makefile
new file mode 100644
index e081a1e..68db0bc
*** a/contrib/pg_trgm/Makefile
--- b/contrib/pg_trgm/Makefile
*************** MODULE_big = pg_trgm
*** 4,10 ****
  OBJS = trgm_op.o trgm_gist.o trgm_gin.o trgm_regexp.o $(WIN32RES)
  
  EXTENSION = pg_trgm
! DATA = pg_trgm--1.1.sql pg_trgm--1.0--1.1.sql pg_trgm--unpackaged--1.0.sql
  PGFILEDESC = "pg_trgm - trigram matching"
  
  REGRESS = pg_trgm
--- 4,10 ----
  OBJS = trgm_op.o trgm_gist.o trgm_gin.o trgm_regexp.o $(WIN32RES)
  
  EXTENSION = pg_trgm
! DATA = pg_trgm--1.1.sql pg_trgm--1.2.sql pg_trgm--1.0--1.1.sql pg_trgm--1.1--1.2.sql pg_trgm--1.2--1.1.sql pg_trgm--unpackaged--1.0.sql
  PGFILEDESC = "pg_trgm - trigram matching"
  
  REGRESS = pg_trgm
diff --git a/contrib/pg_trgm/pg_trgm--1.1--1.2.sql b/contrib/pg_trgm/pg_trgm--1.1--1.2.sql
new file mode 100644
index ...c101f21
*** a/contrib/pg_trgm/pg_trgm--1.1--1.2.sql
--- b/contrib/pg_trgm/pg_trgm--1.1--1.2.sql
***************
*** 0 ****
--- 1,12 ----
+ /* contrib/pg_trgm/pg_trgm--1.1--1.2.sql */
+ 
+ -- complain if script is sourced in psql, rather than via ALTER EXTENSION
+ \echo Use "ALTER EXTENSION pg_trgm UPDATE TO '1.2'" to load this file. \quit
+ 
+ CREATE FUNCTION gin_trgm_triconsistent(internal, int2, text, int4, internal, internal, internal)
+ RETURNS "char"
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ ALTER OPERATOR FAMILY gin_trgm_ops USING gin ADD
+         FUNCTION        6    (text, text)   gin_trgm_triconsistent (internal, int2, text, int4, internal, internal, internal);
diff --git a/contrib/pg_trgm/pg_trgm--1.2--1.1.sql b/contrib/pg_trgm/pg_trgm--1.2--1.1.sql
new file mode 100644
index ...2862cba
*** a/contrib/pg_trgm/pg_trgm--1.2--1.1.sql
--- b/contrib/pg_trgm/pg_trgm--1.2--1.1.sql
***************
*** 0 ****
--- 1,9 ----
+ /* contrib/pg_trgm/pg_trgm--1.1--1.2.sql */
+ 
+ -- complain if script is sourced in psql, rather than via ALTER EXTENSION
+ \echo Use "ALTER EXTENSION pg_trgm UPDATE TO '1.1'" to load this file. \quit
+ 
+ ALTER OPERATOR FAMILY gin_trgm_ops USING gin DROP
+         FUNCTION  6 (text, text);
+ 
+ DROP FUNCTION gin_trgm_triconsistent ( internal, smallint, text, integer, internal, internal, internal) ;
diff --git a/contrib/pg_trgm/pg_trgm--1.2.sql b/contrib/pg_trgm/pg_trgm--1.2.sql
new file mode 100644
index ...7800105
*** a/contrib/pg_trgm/pg_trgm--1.2.sql
--- b/contrib/pg_trgm/pg_trgm--1.2.sql
***************
*** 0 ****
--- 1,188 ----
+ /* contrib/pg_trgm/pg_trgm--1.1.sql */
+ 
+ -- complain if script is sourced in psql, rather than via CREATE EXTENSION
+ \echo Use "CREATE EXTENSION pg_trgm" to load this file. \quit
+ 
+ CREATE FUNCTION set_limit(float4)
+ RETURNS float4
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT VOLATILE;
+ 
+ CREATE FUNCTION show_limit()
+ RETURNS float4
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT STABLE;
+ 
+ CREATE FUNCTION show_trgm(text)
+ RETURNS _text
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT IMMUTABLE;
+ 
+ CREATE FUNCTION similarity(text,text)
+ RETURNS float4
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT IMMUTABLE;
+ 
+ CREATE FUNCTION similarity_op(text,text)
+ RETURNS bool
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT STABLE;  -- stable because depends on trgm_limit
+ 
+ CREATE OPERATOR % (
+         LEFTARG = text,
+         RIGHTARG = text,
+         PROCEDURE = similarity_op,
+         COMMUTATOR = '%',
+         RESTRICT = contsel,
+         JOIN = contjoinsel
+ );
+ 
+ CREATE FUNCTION similarity_dist(text,text)
+ RETURNS float4
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT IMMUTABLE;
+ 
+ CREATE OPERATOR <-> (
+         LEFTARG = text,
+         RIGHTARG = text,
+         PROCEDURE = similarity_dist,
+         COMMUTATOR = '<->'
+ );
+ 
+ -- gist key
+ CREATE FUNCTION gtrgm_in(cstring)
+ RETURNS gtrgm
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT IMMUTABLE;
+ 
+ CREATE FUNCTION gtrgm_out(gtrgm)
+ RETURNS cstring
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT IMMUTABLE;
+ 
+ CREATE TYPE gtrgm (
+         INTERNALLENGTH = -1,
+         INPUT = gtrgm_in,
+         OUTPUT = gtrgm_out
+ );
+ 
+ -- support functions for gist
+ CREATE FUNCTION gtrgm_consistent(internal,text,int,oid,internal)
+ RETURNS bool
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ CREATE FUNCTION gtrgm_distance(internal,text,int,oid)
+ RETURNS float8
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ CREATE FUNCTION gtrgm_compress(internal)
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ CREATE FUNCTION gtrgm_decompress(internal)
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ CREATE FUNCTION gtrgm_penalty(internal,internal,internal)
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ CREATE FUNCTION gtrgm_picksplit(internal, internal)
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ CREATE FUNCTION gtrgm_union(bytea, internal)
+ RETURNS _int4
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ CREATE FUNCTION gtrgm_same(gtrgm, gtrgm, internal)
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ -- create the operator class for gist
+ CREATE OPERATOR CLASS gist_trgm_ops
+ FOR TYPE text USING gist
+ AS
+         OPERATOR        1       % (text, text),
+         FUNCTION        1       gtrgm_consistent (internal, text, int, oid, internal),
+         FUNCTION        2       gtrgm_union (bytea, internal),
+         FUNCTION        3       gtrgm_compress (internal),
+         FUNCTION        4       gtrgm_decompress (internal),
+         FUNCTION        5       gtrgm_penalty (internal, internal, internal),
+         FUNCTION        6       gtrgm_picksplit (internal, internal),
+         FUNCTION        7       gtrgm_same (gtrgm, gtrgm, internal),
+         STORAGE         gtrgm;
+ 
+ -- Add operators and support functions that are new in 9.1.  We do it like
+ -- this, leaving them "loose" in the operator family rather than bound into
+ -- the gist_trgm_ops opclass, because that's the only state that can be
+ -- reproduced during an upgrade from 9.0 (see pg_trgm--unpackaged--1.0.sql).
+ 
+ ALTER OPERATOR FAMILY gist_trgm_ops USING gist ADD
+         OPERATOR        2       <-> (text, text) FOR ORDER BY pg_catalog.float_ops,
+         OPERATOR        3       pg_catalog.~~ (text, text),
+         OPERATOR        4       pg_catalog.~~* (text, text),
+         FUNCTION        8 (text, text)  gtrgm_distance (internal, text, int, oid);
+ 
+ -- Add operators that are new in 9.3.
+ 
+ ALTER OPERATOR FAMILY gist_trgm_ops USING gist ADD
+         OPERATOR        5       pg_catalog.~ (text, text),
+         OPERATOR        6       pg_catalog.~* (text, text);
+ 
+ -- support functions for gin
+ CREATE FUNCTION gin_extract_value_trgm(text, internal)
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ CREATE FUNCTION gin_extract_query_trgm(text, internal, int2, internal, internal, internal, internal)
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ CREATE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal, internal, internal)
+ RETURNS bool
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ -- create the operator class for gin
+ CREATE OPERATOR CLASS gin_trgm_ops
+ FOR TYPE text USING gin
+ AS
+         OPERATOR        1       % (text, text),
+         FUNCTION        1       btint4cmp (int4, int4),
+         FUNCTION        2       gin_extract_value_trgm (text, internal),
+         FUNCTION        3       gin_extract_query_trgm (text, internal, int2, internal, internal, internal, internal),
+         FUNCTION        4       gin_trgm_consistent (internal, int2, text, int4, internal, internal, internal, internal),
+         STORAGE         int4;
+ 
+ -- Add operators that are new in 9.1.
+ 
+ ALTER OPERATOR FAMILY gin_trgm_ops USING gin ADD
+         OPERATOR        3       pg_catalog.~~ (text, text),
+         OPERATOR        4       pg_catalog.~~* (text, text);
+ 
+ -- Add operators that are new in 9.3.
+ 
+ ALTER OPERATOR FAMILY gin_trgm_ops USING gin ADD
+         OPERATOR        5       pg_catalog.~ (text, text),
+         OPERATOR        6       pg_catalog.~* (text, text);
+ 
+ -- Add functions that are new in 9.6 (pg_trgm 1.2).
+ 
+ CREATE FUNCTION gin_trgm_triconsistent(internal, int2, text, int4, internal, internal, internal)
+ RETURNS "char"
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ ALTER OPERATOR FAMILY gin_trgm_ops USING gin ADD
+         FUNCTION        6      (text,text) gin_trgm_triconsistent (internal, int2, text, int4, internal, internal, internal);
diff --git a/contrib/pg_trgm/pg_trgm.control b/contrib/pg_trgm/pg_trgm.control
new file mode 100644
index 2ac51e6..cbf5a18
*** a/contrib/pg_trgm/pg_trgm.control
--- b/contrib/pg_trgm/pg_trgm.control
***************
*** 1,5 ****
  # pg_trgm extension
  comment = 'text similarity measurement and index searching based on trigrams'
! default_version = '1.1'
  module_pathname = '$libdir/pg_trgm'
  relocatable = true
--- 1,5 ----
  # pg_trgm extension
  comment = 'text similarity measurement and index searching based on trigrams'
! default_version = '1.2'
  module_pathname = '$libdir/pg_trgm'
  relocatable = true
diff --git a/contrib/pg_trgm/trgm_gin.c b/contrib/pg_trgm/trgm_gin.c
new file mode 100644
index d524cea..d02a2f5
*** a/contrib/pg_trgm/trgm_gin.c
--- b/contrib/pg_trgm/trgm_gin.c
*************** PG_FUNCTION_INFO_V1(gin_extract_trgm);
*** 14,19 ****
--- 14,20 ----
  PG_FUNCTION_INFO_V1(gin_extract_value_trgm);
  PG_FUNCTION_INFO_V1(gin_extract_query_trgm);
  PG_FUNCTION_INFO_V1(gin_trgm_consistent);
+ PG_FUNCTION_INFO_V1(gin_trgm_triconsistent);
  
  /*
   * This function can only be called if a pre-9.1 version of the GIN operator
*************** gin_trgm_consistent(PG_FUNCTION_ARGS)
*** 235,237 ****
--- 236,324 ----
  
  	PG_RETURN_BOOL(res);
  }
+ 
+ /*
+  * In all cases, GIN_TRUE is at least as favorable to inclusion as GIN_MAYBE.
+  * If no better option is available, simply treat GIN_MAYBE as if it were 
+  * GIN_TRUE a apply the same test as the binary consistent function.
+  */
+ Datum
+ gin_trgm_triconsistent(PG_FUNCTION_ARGS)
+ {
+ 	GinTernaryValue  *check = (bool *) PG_GETARG_POINTER(0);
+ 	StrategyNumber strategy = PG_GETARG_UINT16(1);
+ 
+ 	/* text    *query = PG_GETARG_TEXT_P(2); */
+ 	int32		nkeys = PG_GETARG_INT32(3);
+ 	Pointer    *extra_data = (Pointer *) PG_GETARG_POINTER(4);
+ 	GinTernaryValue	res = GIN_MAYBE;
+ 	int32		i,
+ 				ntrue;
+ 	bool	   *boolcheck;
+ 
+ 	switch (strategy)
+ 	{
+ 		case SimilarityStrategyNumber:
+ 			/* Count the matches */
+ 			ntrue = 0;
+ 			for (i = 0; i < nkeys; i++)
+ 			{
+ 				if (check[i]==GIN_TRUE)
+ 					ntrue++;
+ 				if (check[i]==GIN_MAYBE)
+ 					ntrue++;
+ 			}
+ #ifdef DIVUNION
+ 			res = (nkeys == ntrue) ? GIN_MAYBE : ((((((float4) ntrue) / ((float4) (nkeys - ntrue)))) >= trgm_limit) ? GIN_MAYBE : GIN_FALSE);
+ #else
+ 			res = (nkeys == 0) ? GIN_FALSE : ((((((float4) ntrue) / ((float4) nkeys))) >= trgm_limit) ? GIN_MAYBE : GIN_FALSE);
+ #endif
+ 			break;
+ 		case ILikeStrategyNumber:
+ #ifndef IGNORECASE
+ 			elog(ERROR, "cannot handle ~~* with case-sensitive trigrams");
+ #endif
+ 			/* FALL THRU */
+ 		case LikeStrategyNumber:
+ 			/* Check if all extracted trigrams are presented. */
+ 			res = GIN_MAYBE;
+ 			for (i = 0; i < nkeys; i++)
+ 			{
+ 				if (check[i]==GIN_FALSE)
+ 				{
+ 					res = GIN_FALSE;
+ 					break;
+ 				}
+ 			}
+ 			break;
+ 		case RegExpICaseStrategyNumber:
+ #ifndef IGNORECASE
+ 			elog(ERROR, "cannot handle ~* with case-sensitive trigrams");
+ #endif
+ 			/* FALL THRU */
+ 		case RegExpStrategyNumber:
+ 			if (nkeys < 1)
+ 			{
+ 				/* Regex processing gave no result: do full index scan */
+ 				res = GIN_MAYBE;
+ 			}
+ 			else
+ 			{
+ 				boolcheck = (bool *) palloc(sizeof(bool) * nkeys);
+ 				for (i = 0; i < nkeys; i++)
+ 					boolcheck[i] = (check[i] == GIN_TRUE || check[i] == GIN_MAYBE);
+ 				if (!trigramsMatchGraph((TrgmPackedGraph *) extra_data[0], check))
+ 					res = GIN_FALSE;
+ 				pfree(boolcheck);
+ 			}
+ 			break;
+ 		default:
+ 			elog(ERROR, "unrecognized strategy number: %d", strategy);
+ 			res = GIN_FALSE;		/* keep compiler quiet */
+ 			break;
+ 	}
+ 
+ 	/* All cases served by this function are inexact */
+     Assert(res != GIN_TRUE);
+ 	PG_RETURN_GIN_TERNARY_VALUE(res);
+ }

Merlin Moncure

mmoncure@gmail.com

over 10 years ago

In reply to: Jeff Janes (#1)

Re: pg_trgm version 1.2

On Sat, Jun 27, 2015 at 5:17 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

This patch implements version 1.2 of contrib module pg_trgm.

This supports the triconsistent function, introduced in version 9.4 of the
server, to make it faster to implement indexed queries where some keys are
common and some are rare.

I've included the paths to both upgrade and downgrade between 1.1 and 1.2,
although after doing so you must close and restart the session before you
can be sure the change has taken effect. There is no change to the on-disk
index structure

This shows the difference it can make in some cases:

create extension pg_trgm version "1.1";

create table foo as select

md5(random()::text)|| case when random()<0.000005 then 'lmnop' else '123'
end ||

md5(random()::text) as bar

from generate_series(1,10000000);

create index on foo using gin (bar gin_trgm_ops);

--some queries

alter extension pg_trgm update to "1.2";

--close, reopen, more queries

select count(*) from foo where bar like '%12344321lmnabcddd%';

V1.1: Time: 1743.691 ms --- after repeated execution to warm the cache

V1.2: Time: 2.839 ms --- after repeated execution to warm the cache

Wow! I'm going to test this. I have some data sets for which trigram
searching isn't really practical...if the search string touches
trigrams with a lot of duplication the algorithm can have trouble
beating brute force searches.

trigram searching is important: it's the only way currently to search
string encoded structures for partial strings quickly.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Merlin Moncure

mmoncure@gmail.com

over 10 years ago

In reply to: Merlin Moncure (#2)

Re: pg_trgm version 1.2

On Mon, Jun 29, 2015 at 7:23 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Sat, Jun 27, 2015 at 5:17 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

V1.1: Time: 1743.691 ms --- after repeated execution to warm the cache

V1.2: Time: 2.839 ms --- after repeated execution to warm the cache

Wow! I'm going to test this. I have some data sets for which trigram
searching isn't really practical...if the search string touches
trigrams with a lot of duplication the algorithm can have trouble
beating brute force searches.

trigram searching is important: it's the only way currently to search
string encoded structures for partial strings quickly.

I ran your patch against stock 9.4 and am happy to confirm massive
speedups of pg_trgm; results of 90% reduction in runtime are common.
Also, with the new changes it's hard to get the indexed search to
significantly underperform brute force searching which is a huge
improvement vs the stock behavior, something that made me very wary of
using these kinds of searches in the past.

datatable: 'test2'
rows: ~ 2 million
heap size: 3.3GB (includes several unrelated fields)
index size: 1GB
9.4: stock
9.5: patched

match 50% rows, brute force seq scan
9.4: 11.5s
9.5: 9.1s

match 50% rows, indexed (time is quite variable with 9.4 giving > 40 sec times)
9.4: 21.0s
9.5: 11.8s

match 1% rows, indexed (>90% time reduction!)
9.4: .566s
9.5: .046s

match .1% rows, one selective one non-selective search term, selective
term first
9.4: .563s
9.5: .028s

match .1% rows, one selective one non-selective search term, selective term last
9.4: 1.014s
9.5: 0.093s

very nice! Recently, I examined pg_tgrm for an attribute searching
system -- it failed due to response time variability and lack of tools
to control that. Were your patch in place, I would have passed it. I
had a 'real world' data set though. With this, pg_trgm is basically
outperforming SOLR search engine for all cases we're interested in
whereas before low selectivity cases where having all kinds of
trouble.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Alexander Korotkov

a.korotkov@postgrespro.ru

over 10 years ago

In reply to: Jeff Janes (#1)

Re: pg_trgm version 1.2

On Sun, Jun 28, 2015 at 1:17 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

This patch implements version 1.2 of contrib module pg_trgm.

This supports the triconsistent function, introduced in version 9.4 of the
server, to make it faster to implement indexed queries where some keys are
common and some are rare.

Thank you for the patch! Lack of pg_trgm triconsistent support was
significant miss after "fast scan" implementation.

I've included the paths to both upgrade and downgrade between 1.1 and 1.2,
although after doing so you must close and restart the session before you
can be sure the change has taken effect. There is no change to the on-disk
index structure

pg_trgm--1.1.sql andpg_trgm--1.1--1.2.sql are useful for debug, but do you
expect them in final commit? As I can see in other contribs we have only
last version and upgrade scripts.

This shows the difference it can make in some cases:

create extension pg_trgm version "1.1";

create table foo as select

md5(random()::text)|| case when random()<0.000005 then 'lmnop' else
'123' end ||

md5(random()::text) as bar

from generate_series(1,10000000);

create index on foo using gin (bar gin_trgm_ops);

--some queries

alter extension pg_trgm update to "1.2";

--close, reopen, more queries

select count(*) from foo where bar like '%12344321lmnabcddd%';

V1.1: Time: 1743.691 ms --- after repeated execution to warm the cache

V1.2: Time: 2.839 ms --- after repeated execution to warm the cache

Nice!

You could get the same benefit just by increasing MAX_MAYBE_ENTRIES (in
core) from 4 to some higher value (which it probably should be anyway, but
there will always be a case where it needs to be higher than you can afford
it to be, so a real solution is needed).

Actually, it depends on how long it takes to calculate consistent function.
The cheaper consistent function is, the higher MAX_MAYBE_ENTRIES could be.
Since all functions in PostgreSQL may define its cost, GIN could calculate
MAX_MAYBE_ENTRIES from the cost of consistent function.

I wasn't sure if this should be a new version of pg_trgm or not, because

there is no user visible change other than to performance. But there may
be some cases where it results in performance reduction and so it is nice
to provide options. Also, I'd like to use it in a back-branch, so versions
seems to be the right way to go there.

We definitely have to stamp a new version of extension every time when its
SQL-definition changes. By the current design of GIN, providing both
consistent and triconsistent function should be optimal for performance. If
it's not so then it's a problem of GIN, not opclass.

There is a lot of code duplication between the binary consistent function

and the ternary one. I thought it the duplication was necessary in order
to support both 1.1 and 1.2 from the same code base.

For me current code duplication is OK.

There may also be some gains in the similarity and regex cases, but I

didn't really analyze those for performance.

I've thought about how to document this change. Looking to other example
of other contrib modules with multiple versions, I decided that we don't
document them, other than in the release notes.

The same patch applies to 9.4 code with a minor conflict in the Makefile,
and gives benefits there as well.

Some other notes about the patch:
* You allocate boolcheck array and don't use it.
* Check coding style and formatting, in particular "check[i]==GIN_TRUE"
should be "check[i] == GIN_TRUE".
* I think some comments needed in gin_trgm_triconsistent() about
trigramsMatchGraph(). gin_trgm_triconsistent() may use trigramsMatchGraph()
that way because trigramsMatchGraph() implements monotonous boolean
function.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Jeff Janes

jeff.janes@gmail.com

over 10 years ago

In reply to: Alexander Korotkov (#4)

1 attachment(s)

Re: pg_trgm version 1.2

On Tue, Jun 30, 2015 at 2:46 AM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:

On Sun, Jun 28, 2015 at 1:17 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

This patch implements version 1.2 of contrib module pg_trgm.

This supports the triconsistent function, introduced in version 9.4 of
the server, to make it faster to implement indexed queries where some keys
are common and some are rare.

Thank you for the patch! Lack of pg_trgm triconsistent support was
significant miss after "fast scan" implementation.

I've included the paths to both upgrade and downgrade between 1.1 and
1.2, although after doing so you must close and restart the session before
you can be sure the change has taken effect. There is no change to the
on-disk index structure

pg_trgm--1.1.sql andpg_trgm--1.1--1.2.sql are useful for debug, but do you
expect them in final commit? As I can see in other contribs we have only
last version and upgrade scripts.

I had thought that pg_trgm--1.1.sql was needed for pg_upgrade to work, but
I see that that is not the case.

I did see another downgrade path for different module, but on closer
inspection it was one that I wrote while trying to analyze that module, and
then never removed. I have no objection to removing pg_trgm--1.2--1.1.sql
before the commit, but I don't see what we gain by doing so. If a
downgrade is feasible and has been tested, why not include it?

...

You could get the same benefit just by increasing MAX_MAYBE_ENTRIES (in
core) from 4 to some higher value (which it probably should be anyway, but
there will always be a case where it needs to be higher than you can afford
it to be, so a real solution is needed).

Actually, it depends on how long it takes to calculate consistent
function. The cheaper consistent function is, the higher MAX_MAYBE_ENTRIES
could be. Since all functions in PostgreSQL may define its cost, GIN could
calculate MAX_MAYBE_ENTRIES from the cost of consistent function.

The consistent function gets called on every candidate tuple, so if it is
expensive then it is also all the more worthwhile to reduce the set of
candidate tuples. Perhaps MAX_MAYBE_ENTRIES could be calculated from the
log of the maximum of the predictNumberResult entries? Anyway, a subject
for a different day....

There may also be some gains in the similarity and regex cases, but I

didn't really analyze those for performance.

I've thought about how to document this change. Looking to other example
of other contrib modules with multiple versions, I decided that we don't
document them, other than in the release notes.

The same patch applies to 9.4 code with a minor conflict in the Makefile,
and gives benefits there as well.

Some other notes about the patch:
* You allocate boolcheck array and don't use it.

That was a bug (at least notionally). trigramsMatchGraph was supposed to
be getting boolcheck, not check, passed in to it.

It may not have been a bug in practise, because GIN_MAYBE and GIN_TRUE both
test as true when cast to booleans. But it seems wrong to rely on that. Or
was it intended to work this way?

I'm surprised the compiler suite doesn't emit some kind of warning on that.

* Check coding style and formatting, in particular "check[i]==GIN_TRUE"
should be "check[i] == GIN_TRUE".

Sorry about that, fixed. I also changed it in other places to check[i] !=
GIN_FALSE, rather than checking against both GIN_TRUE and GIN_MAYBE. At
first I was concerned we might decide to add a 4th option to the type which
would render != GIN_FALSE the wrong way to test for it. But since it is
called GinTernaryValue, I think we wouldn't add a fourth thing to it. But
perhaps the more verbose form is clearer to some people.

* I think some comments needed in gin_trgm_triconsistent() about
trigramsMatchGraph(). gin_trgm_triconsistent() may use trigramsMatchGraph()
that way because trigramsMatchGraph() implements monotonous boolean
function.

I have a function-level comment that in all cases, GIN_TRUE is at least as
favorable to inclusion of a tuple as GIN_MAYBE. Should I reiterate that
comment before trigramsMatchGraph() as well? Calling it a monotonic
boolean function is precise and concise, but probably less understandable
to people reading the code. At least, if I saw that, I would need to go do
some reading before I knew what it meant.

Update attached, with the downgrade path still included for now.

Thanks,

Jeff

Attachments:

pg_trgm_1_2_v002.patchapplication/octet-stream; name=pg_trgm_1_2_v002.patchDownload

diff --git a/contrib/pg_trgm/Makefile b/contrib/pg_trgm/Makefile
new file mode 100644
index e081a1e..68db0bc
*** a/contrib/pg_trgm/Makefile
--- b/contrib/pg_trgm/Makefile
*************** MODULE_big = pg_trgm
*** 4,10 ****
  OBJS = trgm_op.o trgm_gist.o trgm_gin.o trgm_regexp.o $(WIN32RES)
  
  EXTENSION = pg_trgm
! DATA = pg_trgm--1.1.sql pg_trgm--1.0--1.1.sql pg_trgm--unpackaged--1.0.sql
  PGFILEDESC = "pg_trgm - trigram matching"
  
  REGRESS = pg_trgm
--- 4,10 ----
  OBJS = trgm_op.o trgm_gist.o trgm_gin.o trgm_regexp.o $(WIN32RES)
  
  EXTENSION = pg_trgm
! DATA = pg_trgm--1.1.sql pg_trgm--1.2.sql pg_trgm--1.0--1.1.sql pg_trgm--1.1--1.2.sql pg_trgm--1.2--1.1.sql pg_trgm--unpackaged--1.0.sql
  PGFILEDESC = "pg_trgm - trigram matching"
  
  REGRESS = pg_trgm
diff --git a/contrib/pg_trgm/pg_trgm--1.1--1.2.sql b/contrib/pg_trgm/pg_trgm--1.1--1.2.sql
new file mode 100644
index ...c101f21
*** a/contrib/pg_trgm/pg_trgm--1.1--1.2.sql
--- b/contrib/pg_trgm/pg_trgm--1.1--1.2.sql
***************
*** 0 ****
--- 1,12 ----
+ /* contrib/pg_trgm/pg_trgm--1.1--1.2.sql */
+ 
+ -- complain if script is sourced in psql, rather than via ALTER EXTENSION
+ \echo Use "ALTER EXTENSION pg_trgm UPDATE TO '1.2'" to load this file. \quit
+ 
+ CREATE FUNCTION gin_trgm_triconsistent(internal, int2, text, int4, internal, internal, internal)
+ RETURNS "char"
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ ALTER OPERATOR FAMILY gin_trgm_ops USING gin ADD
+         FUNCTION        6    (text, text)   gin_trgm_triconsistent (internal, int2, text, int4, internal, internal, internal);
diff --git a/contrib/pg_trgm/pg_trgm--1.2--1.1.sql b/contrib/pg_trgm/pg_trgm--1.2--1.1.sql
new file mode 100644
index ...0e450c5
*** a/contrib/pg_trgm/pg_trgm--1.2--1.1.sql
--- b/contrib/pg_trgm/pg_trgm--1.2--1.1.sql
***************
*** 0 ****
--- 1,9 ----
+ /* contrib/pg_trgm/pg_trgm--1.2--1.1.sql */
+ 
+ -- complain if script is sourced in psql, rather than via ALTER EXTENSION
+ \echo Use "ALTER EXTENSION pg_trgm UPDATE TO '1.1'" to load this file. \quit
+ 
+ ALTER OPERATOR FAMILY gin_trgm_ops USING gin DROP
+         FUNCTION  6 (text, text);
+ 
+ DROP FUNCTION gin_trgm_triconsistent ( internal, smallint, text, integer, internal, internal, internal) ;
diff --git a/contrib/pg_trgm/pg_trgm--1.2.sql b/contrib/pg_trgm/pg_trgm--1.2.sql
new file mode 100644
index ...03d46d0
*** a/contrib/pg_trgm/pg_trgm--1.2.sql
--- b/contrib/pg_trgm/pg_trgm--1.2.sql
***************
*** 0 ****
--- 1,188 ----
+ /* contrib/pg_trgm/pg_trgm--1.2.sql */
+ 
+ -- complain if script is sourced in psql, rather than via CREATE EXTENSION
+ \echo Use "CREATE EXTENSION pg_trgm" to load this file. \quit
+ 
+ CREATE FUNCTION set_limit(float4)
+ RETURNS float4
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT VOLATILE;
+ 
+ CREATE FUNCTION show_limit()
+ RETURNS float4
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT STABLE;
+ 
+ CREATE FUNCTION show_trgm(text)
+ RETURNS _text
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT IMMUTABLE;
+ 
+ CREATE FUNCTION similarity(text,text)
+ RETURNS float4
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT IMMUTABLE;
+ 
+ CREATE FUNCTION similarity_op(text,text)
+ RETURNS bool
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT STABLE;  -- stable because depends on trgm_limit
+ 
+ CREATE OPERATOR % (
+         LEFTARG = text,
+         RIGHTARG = text,
+         PROCEDURE = similarity_op,
+         COMMUTATOR = '%',
+         RESTRICT = contsel,
+         JOIN = contjoinsel
+ );
+ 
+ CREATE FUNCTION similarity_dist(text,text)
+ RETURNS float4
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT IMMUTABLE;
+ 
+ CREATE OPERATOR <-> (
+         LEFTARG = text,
+         RIGHTARG = text,
+         PROCEDURE = similarity_dist,
+         COMMUTATOR = '<->'
+ );
+ 
+ -- gist key
+ CREATE FUNCTION gtrgm_in(cstring)
+ RETURNS gtrgm
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT IMMUTABLE;
+ 
+ CREATE FUNCTION gtrgm_out(gtrgm)
+ RETURNS cstring
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT IMMUTABLE;
+ 
+ CREATE TYPE gtrgm (
+         INTERNALLENGTH = -1,
+         INPUT = gtrgm_in,
+         OUTPUT = gtrgm_out
+ );
+ 
+ -- support functions for gist
+ CREATE FUNCTION gtrgm_consistent(internal,text,int,oid,internal)
+ RETURNS bool
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ CREATE FUNCTION gtrgm_distance(internal,text,int,oid)
+ RETURNS float8
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ CREATE FUNCTION gtrgm_compress(internal)
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ CREATE FUNCTION gtrgm_decompress(internal)
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ CREATE FUNCTION gtrgm_penalty(internal,internal,internal)
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ CREATE FUNCTION gtrgm_picksplit(internal, internal)
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ CREATE FUNCTION gtrgm_union(bytea, internal)
+ RETURNS _int4
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ CREATE FUNCTION gtrgm_same(gtrgm, gtrgm, internal)
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ -- create the operator class for gist
+ CREATE OPERATOR CLASS gist_trgm_ops
+ FOR TYPE text USING gist
+ AS
+         OPERATOR        1       % (text, text),
+         FUNCTION        1       gtrgm_consistent (internal, text, int, oid, internal),
+         FUNCTION        2       gtrgm_union (bytea, internal),
+         FUNCTION        3       gtrgm_compress (internal),
+         FUNCTION        4       gtrgm_decompress (internal),
+         FUNCTION        5       gtrgm_penalty (internal, internal, internal),
+         FUNCTION        6       gtrgm_picksplit (internal, internal),
+         FUNCTION        7       gtrgm_same (gtrgm, gtrgm, internal),
+         STORAGE         gtrgm;
+ 
+ -- Add operators and support functions that are new in 9.1.  We do it like
+ -- this, leaving them "loose" in the operator family rather than bound into
+ -- the gist_trgm_ops opclass, because that's the only state that can be
+ -- reproduced during an upgrade from 9.0 (see pg_trgm--unpackaged--1.0.sql).
+ 
+ ALTER OPERATOR FAMILY gist_trgm_ops USING gist ADD
+         OPERATOR        2       <-> (text, text) FOR ORDER BY pg_catalog.float_ops,
+         OPERATOR        3       pg_catalog.~~ (text, text),
+         OPERATOR        4       pg_catalog.~~* (text, text),
+         FUNCTION        8 (text, text)  gtrgm_distance (internal, text, int, oid);
+ 
+ -- Add operators that are new in 9.3.
+ 
+ ALTER OPERATOR FAMILY gist_trgm_ops USING gist ADD
+         OPERATOR        5       pg_catalog.~ (text, text),
+         OPERATOR        6       pg_catalog.~* (text, text);
+ 
+ -- support functions for gin
+ CREATE FUNCTION gin_extract_value_trgm(text, internal)
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ CREATE FUNCTION gin_extract_query_trgm(text, internal, int2, internal, internal, internal, internal)
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ CREATE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal, internal, internal)
+ RETURNS bool
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ -- create the operator class for gin
+ CREATE OPERATOR CLASS gin_trgm_ops
+ FOR TYPE text USING gin
+ AS
+         OPERATOR        1       % (text, text),
+         FUNCTION        1       btint4cmp (int4, int4),
+         FUNCTION        2       gin_extract_value_trgm (text, internal),
+         FUNCTION        3       gin_extract_query_trgm (text, internal, int2, internal, internal, internal, internal),
+         FUNCTION        4       gin_trgm_consistent (internal, int2, text, int4, internal, internal, internal, internal),
+         STORAGE         int4;
+ 
+ -- Add operators that are new in 9.1.
+ 
+ ALTER OPERATOR FAMILY gin_trgm_ops USING gin ADD
+         OPERATOR        3       pg_catalog.~~ (text, text),
+         OPERATOR        4       pg_catalog.~~* (text, text);
+ 
+ -- Add operators that are new in 9.3.
+ 
+ ALTER OPERATOR FAMILY gin_trgm_ops USING gin ADD
+         OPERATOR        5       pg_catalog.~ (text, text),
+         OPERATOR        6       pg_catalog.~* (text, text);
+ 
+ -- Add functions that are new in 9.6 (pg_trgm 1.2).
+ 
+ CREATE FUNCTION gin_trgm_triconsistent(internal, int2, text, int4, internal, internal, internal)
+ RETURNS "char"
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ ALTER OPERATOR FAMILY gin_trgm_ops USING gin ADD
+         FUNCTION        6      (text,text) gin_trgm_triconsistent (internal, int2, text, int4, internal, internal, internal);
diff --git a/contrib/pg_trgm/pg_trgm.control b/contrib/pg_trgm/pg_trgm.control
new file mode 100644
index 2ac51e6..cbf5a18
*** a/contrib/pg_trgm/pg_trgm.control
--- b/contrib/pg_trgm/pg_trgm.control
***************
*** 1,5 ****
  # pg_trgm extension
  comment = 'text similarity measurement and index searching based on trigrams'
! default_version = '1.1'
  module_pathname = '$libdir/pg_trgm'
  relocatable = true
--- 1,5 ----
  # pg_trgm extension
  comment = 'text similarity measurement and index searching based on trigrams'
! default_version = '1.2'
  module_pathname = '$libdir/pg_trgm'
  relocatable = true
diff --git a/contrib/pg_trgm/trgm_gin.c b/contrib/pg_trgm/trgm_gin.c
new file mode 100644
index d524cea..6e2d46f
*** a/contrib/pg_trgm/trgm_gin.c
--- b/contrib/pg_trgm/trgm_gin.c
*************** PG_FUNCTION_INFO_V1(gin_extract_trgm);
*** 14,19 ****
--- 14,20 ----
  PG_FUNCTION_INFO_V1(gin_extract_value_trgm);
  PG_FUNCTION_INFO_V1(gin_extract_query_trgm);
  PG_FUNCTION_INFO_V1(gin_trgm_consistent);
+ PG_FUNCTION_INFO_V1(gin_trgm_triconsistent);
  
  /*
   * This function can only be called if a pre-9.1 version of the GIN operator
*************** gin_trgm_consistent(PG_FUNCTION_ARGS)
*** 235,237 ****
--- 236,323 ----
  
  	PG_RETURN_BOOL(res);
  }
+ 
+ /*
+  * In all cases, GIN_TRUE is at least as favorable to inclusion as 
+  * GIN_MAYBE. If no better option is available, simply treat 
+  * GIN_MAYBE as if it were GIN_TRUE and apply the same test as the binary 
+  * consistent function.
+  */
+ Datum
+ gin_trgm_triconsistent(PG_FUNCTION_ARGS)
+ {
+ 	GinTernaryValue  *check = (GinTernaryValue *) PG_GETARG_POINTER(0);
+ 	StrategyNumber strategy = PG_GETARG_UINT16(1);
+ 
+ 	/* text    *query = PG_GETARG_TEXT_P(2); */
+ 	int32		nkeys = PG_GETARG_INT32(3);
+ 	Pointer    *extra_data = (Pointer *) PG_GETARG_POINTER(4);
+ 	GinTernaryValue	res = GIN_MAYBE;
+ 	int32		i,
+ 				ntrue;
+ 	bool	   *boolcheck;
+ 
+ 	switch (strategy)
+ 	{
+ 		case SimilarityStrategyNumber:
+ 			/* Count the matches */
+ 			ntrue = 0;
+ 			for (i = 0; i < nkeys; i++)
+ 			{
+ 				if (check[i] != GIN_FALSE)
+ 					ntrue++;
+ 			}
+ #ifdef DIVUNION
+ 			res = (nkeys == ntrue) ? GIN_MAYBE : ((((((float4) ntrue) / ((float4) (nkeys - ntrue)))) >= trgm_limit) ? GIN_MAYBE : GIN_FALSE);
+ #else
+ 			res = (nkeys == 0) ? GIN_FALSE : ((((((float4) ntrue) / ((float4) nkeys))) >= trgm_limit) ? GIN_MAYBE : GIN_FALSE);
+ #endif
+ 			break;
+ 		case ILikeStrategyNumber:
+ #ifndef IGNORECASE
+ 			elog(ERROR, "cannot handle ~~* with case-sensitive trigrams");
+ #endif
+ 			/* FALL THRU */
+ 		case LikeStrategyNumber:
+ 			/* Check if all extracted trigrams are presented. */
+ 			res = GIN_MAYBE;
+ 			for (i = 0; i < nkeys; i++)
+ 			{
+ 				if (check[i] == GIN_FALSE)
+ 				{
+ 					res = GIN_FALSE;
+ 					break;
+ 				}
+ 			}
+ 			break;
+ 		case RegExpICaseStrategyNumber:
+ #ifndef IGNORECASE
+ 			elog(ERROR, "cannot handle ~* with case-sensitive trigrams");
+ #endif
+ 			/* FALL THRU */
+ 		case RegExpStrategyNumber:
+ 			if (nkeys < 1)
+ 			{
+ 				/* Regex processing gave no result: do full index scan */
+ 				res = GIN_MAYBE;
+ 			}
+ 			else
+ 			{
+ 				boolcheck = (bool *) palloc(sizeof(bool) * nkeys);
+ 				for (i = 0; i < nkeys; i++)
+ 					boolcheck[i] = (check[i] != GIN_FALSE);
+ 				if (!trigramsMatchGraph((TrgmPackedGraph *) extra_data[0], boolcheck))
+ 					res = GIN_FALSE;
+ 				pfree(boolcheck);
+ 			}
+ 			break;
+ 		default:
+ 			elog(ERROR, "unrecognized strategy number: %d", strategy);
+ 			res = GIN_FALSE;		/* keep compiler quiet */
+ 			break;
+ 	}
+ 
+ 	/* All cases served by this function are inexact */
+     Assert(res != GIN_TRUE);
+ 	PG_RETURN_GIN_TERNARY_VALUE(res);
+ }

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: Jeff Janes (#5)

Re: pg_trgm version 1.2

Jeff Janes <jeff.janes@gmail.com> writes:

On Tue, Jun 30, 2015 at 2:46 AM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:

pg_trgm--1.1.sql andpg_trgm--1.1--1.2.sql are useful for debug, but do you
expect them in final commit? As I can see in other contribs we have only
last version and upgrade scripts.

I did see another downgrade path for different module, but on closer
inspection it was one that I wrote while trying to analyze that module, and
then never removed. I have no objection to removing pg_trgm--1.2--1.1.sql
before the commit, but I don't see what we gain by doing so. If a
downgrade is feasible and has been tested, why not include it?

Because we don't want to support 1.1 anymore once 1.2 exists. You're
supposing that just because you wrote the downgrade script and think
it works, there's no further costs associated with having that.
Personally, I don't even want to review such a script, let alone document
its existence and why someone might want to use it, let alone support 1.1
into the far future.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Alexander Korotkov

a.korotkov@postgrespro.ru

over 10 years ago

In reply to: Jeff Janes (#5)

Re: pg_trgm version 1.2

On Tue, Jun 30, 2015 at 11:28 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Tue, Jun 30, 2015 at 2:46 AM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:

On Sun, Jun 28, 2015 at 1:17 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

This patch implements version 1.2 of contrib module pg_trgm.

This supports the triconsistent function, introduced in version 9.4 of
the server, to make it faster to implement indexed queries where some keys
are common and some are rare.

Thank you for the patch! Lack of pg_trgm triconsistent support was
significant miss after "fast scan" implementation.

I've included the paths to both upgrade and downgrade between 1.1 and
1.2, although after doing so you must close and restart the session before
you can be sure the change has taken effect. There is no change to the
on-disk index structure

pg_trgm--1.1.sql andpg_trgm--1.1--1.2.sql are useful for debug, but do
you expect them in final commit? As I can see in other contribs we have
only last version and upgrade scripts.

I had thought that pg_trgm--1.1.sql was needed for pg_upgrade to work, but
I see that that is not the case.

I did see another downgrade path for different module, but on closer
inspection it was one that I wrote while trying to analyze that module, and
then never removed. I have no objection to removing pg_trgm--1.2--1.1.sql
before the commit, but I don't see what we gain by doing so. If a
downgrade is feasible and has been tested, why not include it?

See Tom Lane's comment about downgrade scripts. I think just remove it is a
right solution.

You could get the same benefit just by increasing MAX_MAYBE_ENTRIES (in

core) from 4 to some higher value (which it probably should be anyway, but
there will always be a case where it needs to be higher than you can afford
it to be, so a real solution is needed).

Actually, it depends on how long it takes to calculate consistent
function. The cheaper consistent function is, the higher MAX_MAYBE_ENTRIES
could be. Since all functions in PostgreSQL may define its cost, GIN could
calculate MAX_MAYBE_ENTRIES from the cost of consistent function.

The consistent function gets called on every candidate tuple, so if it is
expensive then it is also all the more worthwhile to reduce the set of
candidate tuples. Perhaps MAX_MAYBE_ENTRIES could be calculated from the
log of the maximum of the predictNumberResult entries? Anyway, a subject
for a different day....

Yes, that also a point. However, we optimize not only costs of consistent
calls but also costs of getting candidate item pointers which are both CPU
and IO.

There may also be some gains in the similarity and regex cases, but I

didn't really analyze those for performance.

I've thought about how to document this change. Looking to other
example of other contrib modules with multiple versions, I decided that we
don't document them, other than in the release notes.

The same patch applies to 9.4 code with a minor conflict in the
Makefile, and gives benefits there as well.

Some other notes about the patch:
* You allocate boolcheck array and don't use it.

That was a bug (at least notionally). trigramsMatchGraph was supposed to
be getting boolcheck, not check, passed in to it.

It may not have been a bug in practise, because GIN_MAYBE and GIN_TRUE
both test as true when cast to booleans. But it seems wrong to rely on
that. Or was it intended to work this way?

I'm surprised the compiler suite doesn't emit some kind of warning on that.

I'm not sure. It's attractive to get rid of useless memory allocation and
copying.
You can also change trigramsMatchGraph() to take GinTernaryValue *
argument. Probably casting bool * as GinTernaryValue * looks better than
inverse casting.

* Check coding style and formatting, in particular "check[i]==GIN_TRUE"
should be "check[i] == GIN_TRUE".

Sorry about that, fixed. I also changed it in other places to check[i] !=
GIN_FALSE, rather than checking against both GIN_TRUE and GIN_MAYBE. At
first I was concerned we might decide to add a 4th option to the type which
would render != GIN_FALSE the wrong way to test for it. But since it is
called GinTernaryValue, I think we wouldn't add a fourth thing to it. But
perhaps the more verbose form is clearer to some people.

Thanks!

* I think some comments needed in gin_trgm_triconsistent() about

trigramsMatchGraph(). gin_trgm_triconsistent() may use trigramsMatchGraph()
that way because trigramsMatchGraph() implements monotonous boolean
function.

I have a function-level comment that in all cases, GIN_TRUE is at least as
favorable to inclusion of a tuple as GIN_MAYBE. Should I reiterate that
comment before trigramsMatchGraph() as well? Calling it a monotonic
boolean function is precise and concise, but probably less understandable
to people reading the code. At least, if I saw that, I would need to go do
some reading before I knew what it meant.

Let's consider '^(?!.*def).*abc' regular expression as an example. It
matches strings which contains 'abc' and don't contains 'def'.

# SELECT 'abc' ~ '^(?!.*def).*abc';
?column?
----------
t
(1 row)

# SELECT 'def abc' ~ '^(?!.*def).*abc';
?column?
----------
f
(1 row)

# SELECT 'abc def' ~ '^(?!.*def).*abc';
?column?
----------
f
(1 row)

Theoretically, our trigram regex processing could support negative matching
of 'def' trigram, i.e. trigramsMatchGraph(abc = true, def = false) = true
but trigramsMatchGraph(abc = true, def = true) = false. Actually, it
doesn't because trigramsMatchGraph() implements a monotonic function. I
just think it should be stated explicitly.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Jeff Janes

jeff.janes@gmail.com

over 10 years ago

In reply to: Alexander Korotkov (#7)

1 attachment(s)

Re: pg_trgm version 1.2

On Tue, Jul 7, 2015 at 6:33 AM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:

See Tom Lane's comment about downgrade scripts. I think just remove it is
a right solution.

The new patch removes the downgrade path and the ability to install the old
version.

(If anyone wants an easy downgrade path for testing, they can keep using
the prior patch--no functional changes)

It also added a comment before the trigramsMatchGraph call.

I retained the palloc and the loop to promote the ternary array to a binary
array. While I also think it is tempting to get rid of that by abusing the
type system and would do it that way in my own standalone code, it seems
contrary to way the project usually does things. And I couldn't measure a
difference in performance.

....

Let's consider '^(?!.*def).*abc' regular expression as an example. It
matches strings which contains 'abc' and don't contains 'def'.

# SELECT 'abc' ~ '^(?!.*def).*abc';
?column?
----------
t
(1 row)

# SELECT 'def abc' ~ '^(?!.*def).*abc';
?column?
----------
f
(1 row)

# SELECT 'abc def' ~ '^(?!.*def).*abc';
?column?
----------
f
(1 row)

Theoretically, our trigram regex processing could support negative
matching of 'def' trigram, i.e. trigramsMatchGraph(abc = true, def = false)
= true but trigramsMatchGraph(abc = true, def = true) = false. Actually, it
doesn't because trigramsMatchGraph() implements a monotonic function. I
just think it should be stated explicitly.

Do you think it is likely to change to stop being monotonic and so support
the (def=GIN_TRUE) => false case?

^(?!.*def) seems like a profoundly niche situation. (Although one that I
might actually start using myself now that I know it isn't just a Perl-ism).

It doesn't make any difference to this patch, other than perhaps how to
word the comments.

Cheers,

Jeff

Attachments:

pg_trgm_1_2_v003.patchapplication/octet-stream; name=pg_trgm_1_2_v003.patchDownload

diff --git a/contrib/pg_trgm/Makefile b/contrib/pg_trgm/Makefile
new file mode 100644
index e081a1e..1e38753
*** a/contrib/pg_trgm/Makefile
--- b/contrib/pg_trgm/Makefile
*************** MODULE_big = pg_trgm
*** 4,10 ****
  OBJS = trgm_op.o trgm_gist.o trgm_gin.o trgm_regexp.o $(WIN32RES)
  
  EXTENSION = pg_trgm
! DATA = pg_trgm--1.1.sql pg_trgm--1.0--1.1.sql pg_trgm--unpackaged--1.0.sql
  PGFILEDESC = "pg_trgm - trigram matching"
  
  REGRESS = pg_trgm
--- 4,10 ----
  OBJS = trgm_op.o trgm_gist.o trgm_gin.o trgm_regexp.o $(WIN32RES)
  
  EXTENSION = pg_trgm
! DATA = pg_trgm--1.2.sql pg_trgm--1.0--1.1.sql pg_trgm--1.1--1.2.sql pg_trgm--unpackaged--1.0.sql
  PGFILEDESC = "pg_trgm - trigram matching"
  
  REGRESS = pg_trgm
diff --git a/contrib/pg_trgm/pg_trgm--1.1--1.2.sql b/contrib/pg_trgm/pg_trgm--1.1--1.2.sql
new file mode 100644
index ...c101f21
*** a/contrib/pg_trgm/pg_trgm--1.1--1.2.sql
--- b/contrib/pg_trgm/pg_trgm--1.1--1.2.sql
***************
*** 0 ****
--- 1,12 ----
+ /* contrib/pg_trgm/pg_trgm--1.1--1.2.sql */
+ 
+ -- complain if script is sourced in psql, rather than via ALTER EXTENSION
+ \echo Use "ALTER EXTENSION pg_trgm UPDATE TO '1.2'" to load this file. \quit
+ 
+ CREATE FUNCTION gin_trgm_triconsistent(internal, int2, text, int4, internal, internal, internal)
+ RETURNS "char"
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ ALTER OPERATOR FAMILY gin_trgm_ops USING gin ADD
+         FUNCTION        6    (text, text)   gin_trgm_triconsistent (internal, int2, text, int4, internal, internal, internal);
diff --git a/contrib/pg_trgm/pg_trgm--1.1.sql b/contrib/pg_trgm/pg_trgm--1.1.sql
new file mode .
index 34b37e4..e69de29
*** a/contrib/pg_trgm/pg_trgm--1.1.sql
--- b/contrib/pg_trgm/pg_trgm--1.1.sql
***************
*** 1,178 ****
- /* contrib/pg_trgm/pg_trgm--1.1.sql */
- 
- -- complain if script is sourced in psql, rather than via CREATE EXTENSION
- \echo Use "CREATE EXTENSION pg_trgm" to load this file. \quit
- 
- CREATE FUNCTION set_limit(float4)
- RETURNS float4
- AS 'MODULE_PATHNAME'
- LANGUAGE C STRICT VOLATILE;
- 
- CREATE FUNCTION show_limit()
- RETURNS float4
- AS 'MODULE_PATHNAME'
- LANGUAGE C STRICT STABLE;
- 
- CREATE FUNCTION show_trgm(text)
- RETURNS _text
- AS 'MODULE_PATHNAME'
- LANGUAGE C STRICT IMMUTABLE;
- 
- CREATE FUNCTION similarity(text,text)
- RETURNS float4
- AS 'MODULE_PATHNAME'
- LANGUAGE C STRICT IMMUTABLE;
- 
- CREATE FUNCTION similarity_op(text,text)
- RETURNS bool
- AS 'MODULE_PATHNAME'
- LANGUAGE C STRICT STABLE;  -- stable because depends on trgm_limit
- 
- CREATE OPERATOR % (
-         LEFTARG = text,
-         RIGHTARG = text,
-         PROCEDURE = similarity_op,
-         COMMUTATOR = '%',
-         RESTRICT = contsel,
-         JOIN = contjoinsel
- );
- 
- CREATE FUNCTION similarity_dist(text,text)
- RETURNS float4
- AS 'MODULE_PATHNAME'
- LANGUAGE C STRICT IMMUTABLE;
- 
- CREATE OPERATOR <-> (
-         LEFTARG = text,
-         RIGHTARG = text,
-         PROCEDURE = similarity_dist,
-         COMMUTATOR = '<->'
- );
- 
- -- gist key
- CREATE FUNCTION gtrgm_in(cstring)
- RETURNS gtrgm
- AS 'MODULE_PATHNAME'
- LANGUAGE C STRICT IMMUTABLE;
- 
- CREATE FUNCTION gtrgm_out(gtrgm)
- RETURNS cstring
- AS 'MODULE_PATHNAME'
- LANGUAGE C STRICT IMMUTABLE;
- 
- CREATE TYPE gtrgm (
-         INTERNALLENGTH = -1,
-         INPUT = gtrgm_in,
-         OUTPUT = gtrgm_out
- );
- 
- -- support functions for gist
- CREATE FUNCTION gtrgm_consistent(internal,text,int,oid,internal)
- RETURNS bool
- AS 'MODULE_PATHNAME'
- LANGUAGE C IMMUTABLE STRICT;
- 
- CREATE FUNCTION gtrgm_distance(internal,text,int,oid)
- RETURNS float8
- AS 'MODULE_PATHNAME'
- LANGUAGE C IMMUTABLE STRICT;
- 
- CREATE FUNCTION gtrgm_compress(internal)
- RETURNS internal
- AS 'MODULE_PATHNAME'
- LANGUAGE C IMMUTABLE STRICT;
- 
- CREATE FUNCTION gtrgm_decompress(internal)
- RETURNS internal
- AS 'MODULE_PATHNAME'
- LANGUAGE C IMMUTABLE STRICT;
- 
- CREATE FUNCTION gtrgm_penalty(internal,internal,internal)
- RETURNS internal
- AS 'MODULE_PATHNAME'
- LANGUAGE C IMMUTABLE STRICT;
- 
- CREATE FUNCTION gtrgm_picksplit(internal, internal)
- RETURNS internal
- AS 'MODULE_PATHNAME'
- LANGUAGE C IMMUTABLE STRICT;
- 
- CREATE FUNCTION gtrgm_union(bytea, internal)
- RETURNS _int4
- AS 'MODULE_PATHNAME'
- LANGUAGE C IMMUTABLE STRICT;
- 
- CREATE FUNCTION gtrgm_same(gtrgm, gtrgm, internal)
- RETURNS internal
- AS 'MODULE_PATHNAME'
- LANGUAGE C IMMUTABLE STRICT;
- 
- -- create the operator class for gist
- CREATE OPERATOR CLASS gist_trgm_ops
- FOR TYPE text USING gist
- AS
-         OPERATOR        1       % (text, text),
-         FUNCTION        1       gtrgm_consistent (internal, text, int, oid, internal),
-         FUNCTION        2       gtrgm_union (bytea, internal),
-         FUNCTION        3       gtrgm_compress (internal),
-         FUNCTION        4       gtrgm_decompress (internal),
-         FUNCTION        5       gtrgm_penalty (internal, internal, internal),
-         FUNCTION        6       gtrgm_picksplit (internal, internal),
-         FUNCTION        7       gtrgm_same (gtrgm, gtrgm, internal),
-         STORAGE         gtrgm;
- 
- -- Add operators and support functions that are new in 9.1.  We do it like
- -- this, leaving them "loose" in the operator family rather than bound into
- -- the gist_trgm_ops opclass, because that's the only state that can be
- -- reproduced during an upgrade from 9.0 (see pg_trgm--unpackaged--1.0.sql).
- 
- ALTER OPERATOR FAMILY gist_trgm_ops USING gist ADD
-         OPERATOR        2       <-> (text, text) FOR ORDER BY pg_catalog.float_ops,
-         OPERATOR        3       pg_catalog.~~ (text, text),
-         OPERATOR        4       pg_catalog.~~* (text, text),
-         FUNCTION        8 (text, text)  gtrgm_distance (internal, text, int, oid);
- 
- -- Add operators that are new in 9.3.
- 
- ALTER OPERATOR FAMILY gist_trgm_ops USING gist ADD
-         OPERATOR        5       pg_catalog.~ (text, text),
-         OPERATOR        6       pg_catalog.~* (text, text);
- 
- -- support functions for gin
- CREATE FUNCTION gin_extract_value_trgm(text, internal)
- RETURNS internal
- AS 'MODULE_PATHNAME'
- LANGUAGE C IMMUTABLE STRICT;
- 
- CREATE FUNCTION gin_extract_query_trgm(text, internal, int2, internal, internal, internal, internal)
- RETURNS internal
- AS 'MODULE_PATHNAME'
- LANGUAGE C IMMUTABLE STRICT;
- 
- CREATE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal, internal, internal)
- RETURNS bool
- AS 'MODULE_PATHNAME'
- LANGUAGE C IMMUTABLE STRICT;
- 
- -- create the operator class for gin
- CREATE OPERATOR CLASS gin_trgm_ops
- FOR TYPE text USING gin
- AS
-         OPERATOR        1       % (text, text),
-         FUNCTION        1       btint4cmp (int4, int4),
-         FUNCTION        2       gin_extract_value_trgm (text, internal),
-         FUNCTION        3       gin_extract_query_trgm (text, internal, int2, internal, internal, internal, internal),
-         FUNCTION        4       gin_trgm_consistent (internal, int2, text, int4, internal, internal, internal, internal),
-         STORAGE         int4;
- 
- -- Add operators that are new in 9.1.
- 
- ALTER OPERATOR FAMILY gin_trgm_ops USING gin ADD
-         OPERATOR        3       pg_catalog.~~ (text, text),
-         OPERATOR        4       pg_catalog.~~* (text, text);
- 
- -- Add operators that are new in 9.3.
- 
- ALTER OPERATOR FAMILY gin_trgm_ops USING gin ADD
-         OPERATOR        5       pg_catalog.~ (text, text),
-         OPERATOR        6       pg_catalog.~* (text, text);
--- 0 ----
diff --git a/contrib/pg_trgm/pg_trgm--1.2.sql b/contrib/pg_trgm/pg_trgm--1.2.sql
new file mode 100644
index ...03d46d0
*** a/contrib/pg_trgm/pg_trgm--1.2.sql
--- b/contrib/pg_trgm/pg_trgm--1.2.sql
***************
*** 0 ****
--- 1,188 ----
+ /* contrib/pg_trgm/pg_trgm--1.2.sql */
+ 
+ -- complain if script is sourced in psql, rather than via CREATE EXTENSION
+ \echo Use "CREATE EXTENSION pg_trgm" to load this file. \quit
+ 
+ CREATE FUNCTION set_limit(float4)
+ RETURNS float4
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT VOLATILE;
+ 
+ CREATE FUNCTION show_limit()
+ RETURNS float4
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT STABLE;
+ 
+ CREATE FUNCTION show_trgm(text)
+ RETURNS _text
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT IMMUTABLE;
+ 
+ CREATE FUNCTION similarity(text,text)
+ RETURNS float4
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT IMMUTABLE;
+ 
+ CREATE FUNCTION similarity_op(text,text)
+ RETURNS bool
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT STABLE;  -- stable because depends on trgm_limit
+ 
+ CREATE OPERATOR % (
+         LEFTARG = text,
+         RIGHTARG = text,
+         PROCEDURE = similarity_op,
+         COMMUTATOR = '%',
+         RESTRICT = contsel,
+         JOIN = contjoinsel
+ );
+ 
+ CREATE FUNCTION similarity_dist(text,text)
+ RETURNS float4
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT IMMUTABLE;
+ 
+ CREATE OPERATOR <-> (
+         LEFTARG = text,
+         RIGHTARG = text,
+         PROCEDURE = similarity_dist,
+         COMMUTATOR = '<->'
+ );
+ 
+ -- gist key
+ CREATE FUNCTION gtrgm_in(cstring)
+ RETURNS gtrgm
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT IMMUTABLE;
+ 
+ CREATE FUNCTION gtrgm_out(gtrgm)
+ RETURNS cstring
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C STRICT IMMUTABLE;
+ 
+ CREATE TYPE gtrgm (
+         INTERNALLENGTH = -1,
+         INPUT = gtrgm_in,
+         OUTPUT = gtrgm_out
+ );
+ 
+ -- support functions for gist
+ CREATE FUNCTION gtrgm_consistent(internal,text,int,oid,internal)
+ RETURNS bool
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ CREATE FUNCTION gtrgm_distance(internal,text,int,oid)
+ RETURNS float8
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ CREATE FUNCTION gtrgm_compress(internal)
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ CREATE FUNCTION gtrgm_decompress(internal)
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ CREATE FUNCTION gtrgm_penalty(internal,internal,internal)
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ CREATE FUNCTION gtrgm_picksplit(internal, internal)
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ CREATE FUNCTION gtrgm_union(bytea, internal)
+ RETURNS _int4
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ CREATE FUNCTION gtrgm_same(gtrgm, gtrgm, internal)
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ -- create the operator class for gist
+ CREATE OPERATOR CLASS gist_trgm_ops
+ FOR TYPE text USING gist
+ AS
+         OPERATOR        1       % (text, text),
+         FUNCTION        1       gtrgm_consistent (internal, text, int, oid, internal),
+         FUNCTION        2       gtrgm_union (bytea, internal),
+         FUNCTION        3       gtrgm_compress (internal),
+         FUNCTION        4       gtrgm_decompress (internal),
+         FUNCTION        5       gtrgm_penalty (internal, internal, internal),
+         FUNCTION        6       gtrgm_picksplit (internal, internal),
+         FUNCTION        7       gtrgm_same (gtrgm, gtrgm, internal),
+         STORAGE         gtrgm;
+ 
+ -- Add operators and support functions that are new in 9.1.  We do it like
+ -- this, leaving them "loose" in the operator family rather than bound into
+ -- the gist_trgm_ops opclass, because that's the only state that can be
+ -- reproduced during an upgrade from 9.0 (see pg_trgm--unpackaged--1.0.sql).
+ 
+ ALTER OPERATOR FAMILY gist_trgm_ops USING gist ADD
+         OPERATOR        2       <-> (text, text) FOR ORDER BY pg_catalog.float_ops,
+         OPERATOR        3       pg_catalog.~~ (text, text),
+         OPERATOR        4       pg_catalog.~~* (text, text),
+         FUNCTION        8 (text, text)  gtrgm_distance (internal, text, int, oid);
+ 
+ -- Add operators that are new in 9.3.
+ 
+ ALTER OPERATOR FAMILY gist_trgm_ops USING gist ADD
+         OPERATOR        5       pg_catalog.~ (text, text),
+         OPERATOR        6       pg_catalog.~* (text, text);
+ 
+ -- support functions for gin
+ CREATE FUNCTION gin_extract_value_trgm(text, internal)
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ CREATE FUNCTION gin_extract_query_trgm(text, internal, int2, internal, internal, internal, internal)
+ RETURNS internal
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ CREATE FUNCTION gin_trgm_consistent(internal, int2, text, int4, internal, internal, internal, internal)
+ RETURNS bool
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ -- create the operator class for gin
+ CREATE OPERATOR CLASS gin_trgm_ops
+ FOR TYPE text USING gin
+ AS
+         OPERATOR        1       % (text, text),
+         FUNCTION        1       btint4cmp (int4, int4),
+         FUNCTION        2       gin_extract_value_trgm (text, internal),
+         FUNCTION        3       gin_extract_query_trgm (text, internal, int2, internal, internal, internal, internal),
+         FUNCTION        4       gin_trgm_consistent (internal, int2, text, int4, internal, internal, internal, internal),
+         STORAGE         int4;
+ 
+ -- Add operators that are new in 9.1.
+ 
+ ALTER OPERATOR FAMILY gin_trgm_ops USING gin ADD
+         OPERATOR        3       pg_catalog.~~ (text, text),
+         OPERATOR        4       pg_catalog.~~* (text, text);
+ 
+ -- Add operators that are new in 9.3.
+ 
+ ALTER OPERATOR FAMILY gin_trgm_ops USING gin ADD
+         OPERATOR        5       pg_catalog.~ (text, text),
+         OPERATOR        6       pg_catalog.~* (text, text);
+ 
+ -- Add functions that are new in 9.6 (pg_trgm 1.2).
+ 
+ CREATE FUNCTION gin_trgm_triconsistent(internal, int2, text, int4, internal, internal, internal)
+ RETURNS "char"
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C IMMUTABLE STRICT;
+ 
+ ALTER OPERATOR FAMILY gin_trgm_ops USING gin ADD
+         FUNCTION        6      (text,text) gin_trgm_triconsistent (internal, int2, text, int4, internal, internal, internal);
diff --git a/contrib/pg_trgm/pg_trgm.control b/contrib/pg_trgm/pg_trgm.control
new file mode 100644
index 2ac51e6..cbf5a18
*** a/contrib/pg_trgm/pg_trgm.control
--- b/contrib/pg_trgm/pg_trgm.control
***************
*** 1,5 ****
  # pg_trgm extension
  comment = 'text similarity measurement and index searching based on trigrams'
! default_version = '1.1'
  module_pathname = '$libdir/pg_trgm'
  relocatable = true
--- 1,5 ----
  # pg_trgm extension
  comment = 'text similarity measurement and index searching based on trigrams'
! default_version = '1.2'
  module_pathname = '$libdir/pg_trgm'
  relocatable = true
diff --git a/contrib/pg_trgm/trgm_gin.c b/contrib/pg_trgm/trgm_gin.c
new file mode 100644
index d524cea..63526c0
*** a/contrib/pg_trgm/trgm_gin.c
--- b/contrib/pg_trgm/trgm_gin.c
*************** PG_FUNCTION_INFO_V1(gin_extract_trgm);
*** 14,19 ****
--- 14,20 ----
  PG_FUNCTION_INFO_V1(gin_extract_value_trgm);
  PG_FUNCTION_INFO_V1(gin_extract_query_trgm);
  PG_FUNCTION_INFO_V1(gin_trgm_consistent);
+ PG_FUNCTION_INFO_V1(gin_trgm_triconsistent);
  
  /*
   * This function can only be called if a pre-9.1 version of the GIN operator
*************** gin_trgm_consistent(PG_FUNCTION_ARGS)
*** 235,237 ****
--- 236,329 ----
  
  	PG_RETURN_BOOL(res);
  }
+ 
+ /*
+  * In all cases, GIN_TRUE is at least as favorable to inclusion as 
+  * GIN_MAYBE. If no better option is available, simply treat 
+  * GIN_MAYBE as if it were GIN_TRUE and apply the same test as the binary 
+  * consistent function.
+  */
+ Datum
+ gin_trgm_triconsistent(PG_FUNCTION_ARGS)
+ {
+ 	GinTernaryValue  *check = (GinTernaryValue *) PG_GETARG_POINTER(0);
+ 	StrategyNumber strategy = PG_GETARG_UINT16(1);
+ 
+ 	/* text    *query = PG_GETARG_TEXT_P(2); */
+ 	int32		nkeys = PG_GETARG_INT32(3);
+ 	Pointer    *extra_data = (Pointer *) PG_GETARG_POINTER(4);
+ 	GinTernaryValue	res = GIN_MAYBE;
+ 	int32		i,
+ 				ntrue;
+ 	bool	   *boolcheck;
+ 
+ 	switch (strategy)
+ 	{
+ 		case SimilarityStrategyNumber:
+ 			/* Count the matches */
+ 			ntrue = 0;
+ 			for (i = 0; i < nkeys; i++)
+ 			{
+ 				if (check[i] != GIN_FALSE)
+ 					ntrue++;
+ 			}
+ #ifdef DIVUNION
+ 			res = (nkeys == ntrue) ? GIN_MAYBE : ((((((float4) ntrue) / ((float4) (nkeys - ntrue)))) >= trgm_limit) ? GIN_MAYBE : GIN_FALSE);
+ #else
+ 			res = (nkeys == 0) ? GIN_FALSE : ((((((float4) ntrue) / ((float4) nkeys))) >= trgm_limit) ? GIN_MAYBE : GIN_FALSE);
+ #endif
+ 			break;
+ 		case ILikeStrategyNumber:
+ #ifndef IGNORECASE
+ 			elog(ERROR, "cannot handle ~~* with case-sensitive trigrams");
+ #endif
+ 			/* FALL THRU */
+ 		case LikeStrategyNumber:
+ 			/* Check if all extracted trigrams are presented. */
+ 			res = GIN_MAYBE;
+ 			for (i = 0; i < nkeys; i++)
+ 			{
+ 				if (check[i] == GIN_FALSE)
+ 				{
+ 					res = GIN_FALSE;
+ 					break;
+ 				}
+ 			}
+ 			break;
+ 		case RegExpICaseStrategyNumber:
+ #ifndef IGNORECASE
+ 			elog(ERROR, "cannot handle ~* with case-sensitive trigrams");
+ #endif
+ 			/* FALL THRU */
+ 		case RegExpStrategyNumber:
+ 			if (nkeys < 1)
+ 			{
+ 				/* Regex processing gave no result: do full index scan */
+ 				res = GIN_MAYBE;
+ 			}
+ 			else
+ 			{
+ 				/* 
+ 				 * As trigramsMatchGraph implements a montonic boolean function,
+ 				 * promoting all GIN_MAYBE keys to GIN_TRUE will give a conservative
+ 				 * result.
+ 				 */
+ 				boolcheck = (bool *) palloc(sizeof(bool) * nkeys);
+ 				for (i = 0; i < nkeys; i++)
+ 					boolcheck[i] = (check[i] != GIN_FALSE);
+ 				if (!trigramsMatchGraph((TrgmPackedGraph *) extra_data[0],
+ 										boolcheck))
+ 					res = GIN_FALSE;
+ 				pfree(boolcheck);
+ 			}
+ 			break;
+ 		default:
+ 			elog(ERROR, "unrecognized strategy number: %d", strategy);
+ 			res = GIN_FALSE;		/* keep compiler quiet */
+ 			break;
+ 	}
+ 
+ 	/* All cases served by this function are inexact */
+ 	Assert(res != GIN_TRUE);
+ 	PG_RETURN_GIN_TERNARY_VALUE(res);
+ }

Alexander Korotkov

a.korotkov@postgrespro.ru

over 10 years ago

In reply to: Jeff Janes (#8)

Re: pg_trgm version 1.2

On Wed, Jul 15, 2015 at 12:31 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Tue, Jul 7, 2015 at 6:33 AM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:

See Tom Lane's comment about downgrade scripts. I think just remove it is
a right solution.

The new patch removes the downgrade path and the ability to install the
old version.

(If anyone wants an easy downgrade path for testing, they can keep using
the prior patch--no functional changes)

It also added a comment before the trigramsMatchGraph call.

I retained the palloc and the loop to promote the ternary array to a
binary array. While I also think it is tempting to get rid of that by
abusing the type system and would do it that way in my own standalone code,
it seems contrary to way the project usually does things. And I couldn't
measure a difference in performance.

Ok!

....

Let's consider '^(?!.*def).*abc' regular expression as an example. It
matches strings which contains 'abc' and don't contains 'def'.

# SELECT 'abc' ~ '^(?!.*def).*abc';
?column?
----------
t
(1 row)

# SELECT 'def abc' ~ '^(?!.*def).*abc';
?column?
----------
f
(1 row)

# SELECT 'abc def' ~ '^(?!.*def).*abc';
?column?
----------
f
(1 row)

Theoretically, our trigram regex processing could support negative
matching of 'def' trigram, i.e. trigramsMatchGraph(abc = true, def = false)
= true but trigramsMatchGraph(abc = true, def = true) = false. Actually, it
doesn't because trigramsMatchGraph() implements a monotonic function. I
just think it should be stated explicitly.

Do you think it is likely to change to stop being monotonic and so support
the (def=GIN_TRUE) => false case?

^(?!.*def) seems like a profoundly niche situation. (Although one that
I might actually start using myself now that I know it isn't just a
Perl-ism).

It doesn't make any difference to this patch, other than perhaps how to
word the comments.

Yes, it was just about the comments.

I also run few tests on real-life dataset: set of dblp paper titles. As it
was expected, there is huge speedup when pattern is long enough.

# SELECT count(*) FROM dblp_titles WHERE s ILIKE '%Multidimensional Data%';
count
-------
318
(1 row)

Time: 29,114 ms

# SELECT count(*) FROM dblp_titles WHERE s ~* '.*Multidimensional Data.*';
count
-------
318
(1 row)

Time: 26,273 ms

# SELECT count(*) FROM dblp_titles WHERE s ILIKE '%Multidimensional Data%';
count
-------
318
(1 row)

Time: 249,725 ms

# SELECT count(*) FROM dblp_titles WHERE s ~* '.*Multidimensional Data.*';
count
-------
318
(1 row)

Time: 218,627 ms

For me, patch is in the pretty good shape. I'm going to mark it "Ready for
committer".

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company