trgm regex index peculiarity
9.4devel (but same in 9.3)
In a 112 MB test table (containing random generated text) with a trgm index (gin_trgm_ops), I consistently get these timings:
select txt from azjunk6 where txt ~ '^abcd';
130 ms
select txt from azjunk6
where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
3 ms
(a similar performance difference occurs when using a regex, i.e. 'abc[de]' )
This difference is so large that I wonder if there is not something wrong in the first case. (The returned results are
correct though)
Here are the two explains:
explain analyze select txt from azjunk6 where txt ~ '^abcd';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on azjunk6 (cost=108.78..484.93 rows=100 width=81) (actual time=129.557..129.742 rows=1 loops=1)
Recheck Cond: (txt ~ '^abcd'::text)
Rows Removed by Index Recheck: 17
-> Bitmap Index Scan on azjunk6_trgm_re_idx (cost=0.00..108.75 rows=100 width=0) (actual time=129.503..129.503 rows=18
loops=1)
Index Cond: (txt ~ '^abcd'::text)
Total runtime: 130.008 ms
(6 rows)
explain analyze select txt from azjunk6 where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on azjunk6 (cost=56.75..433.40 rows=1 width=81) (actual time=2.064..3.379 rows=1 loops=1)
Recheck Cond: (txt ~ 'abcd'::text)
Rows Removed by Index Recheck: 14
Filter: (substr(txt, 1, 4) = 'abcd'::text)
Rows Removed by Filter: 112
-> Bitmap Index Scan on azjunk6_trgm_re_idx (cost=0.00..56.75 rows=100 width=0) (actual time=1.911..1.911 rows=127
loops=1)
Index Cond: (txt ~ 'abcd'::text)
Total runtime: 3.409 ms
(8 rows)
The results in both cases are correct, but does this difference not almost amount to a bug?
( Interestingly, the variant WHERE txt ~ 'abcd$'
is as fast as the non-anchored variant )
Thanks,
Erik Rijkers
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
"Erik Rijkers" <er@xs4all.nl> writes:
In a 112 MB test table (containing random generated text) with a trgm index (gin_trgm_ops), I consistently get these timings:
select txt from azjunk6 where txt ~ '^abcd';
130 ms
select txt from azjunk6
where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
3 ms
Hm, could you provide a self-contained test case?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, June 21, 2013 05:25, Tom Lane wrote:
"Erik Rijkers" <er@xs4all.nl> writes:
In a 112 MB test table (containing random generated text) with a trgm index (gin_trgm_ops), I consistently get these
timings:
select txt from azjunk6 where txt ~ '^abcd';
130 ms
select txt from azjunk6
where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
3 msHm, could you provide a self-contained test case?
yes, sorry. I tested on a 1M row table:
#!/bin/sh
# create table:
for power in 6;
do
table=azjunk${power}
index=${table}_trgm_re_idx
perl -E'
sub ss{ join"",@_[ map{rand @_} 1 .. shift ] };
say(ss(80,"a".."g"," ","h".."m"," ","n".."s"," ","t".."z")) for 1 .. 1e'"${power};" \
| psql -aqXc "
drop table if exists $table;
create table $table(txt text);
copy $table from stdin;";
echo "set session maintenance_work_mem='1GB';
create index $index on $table using gin (txt gin_trgm_ops);
analyze $table;" | psql -qtAX;
done
# test:
echo "
\\timing on
explain analyze select txt from azjunk6 where txt ~ '^abcd'; -- slow (140 ms)
explain analyze select txt from azjunk6 where txt ~ 'abcd' and substr(txt,1,4) = 'abcd'; -- fast (5 ms)
" | psql -Xqa
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jun 21, 2013 at 2:40 PM, Erik Rijkers <er@xs4all.nl> wrote:
On Fri, June 21, 2013 05:25, Tom Lane wrote:
"Erik Rijkers" <er@xs4all.nl> writes:
In a 112 MB test table (containing random generated text) with a trgm
index (gin_trgm_ops), I consistently get these
timings:
select txt from azjunk6 where txt ~ '^abcd';
130 ms
select txt from azjunk6
where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
3 msHm, could you provide a self-contained test case?
yes, sorry. I tested on a 1M row table:
#!/bin/sh
# create table:
for power in 6;
do
table=azjunk${power}
index=${table}_trgm_re_idx
perl -E'
sub ss{ join"",@_[ map{rand @_} 1 .. shift ] };
say(ss(80,"a".."g"," ","h".."m"," ","n".."s"," ","t".."z")) for 1 ..
1e'"${power};" \
| psql -aqXc "
drop table if exists $table;
create table $table(txt text);
copy $table from stdin;";
echo "set session maintenance_work_mem='1GB';
create index $index on $table using gin (txt gin_trgm_ops);
analyze $table;" | psql -qtAX;
done# test:
echo "
\\timing on
explain analyze select txt from azjunk6 where txt ~ '^abcd'; -- slow (140
ms)
explain analyze select txt from azjunk6 where txt ~ 'abcd' and
substr(txt,1,4) = 'abcd'; -- fast (5 ms)
" | psql -Xqa
Regex '^abcd' will be expanded into trigrams '__a', '_ab', 'abc' and 'bcd'.
However trigrams '__a' is much more frequent than '_ab' which in turn is
much more frequent than 'abc' and 'bcd'. Ommiting of ^ leads to ommiting of
'__a' and '_ab' and that gives so significant speedup.
The problem is that trigram regex engine doesn't know that '__a' and '_ab'
is more frequent than another trigrams because we don't know their
distribution. However, simple assumption that blank character (in pg_trgm
meaning) is much more frequent than others seems to be true for most cases.
Attached patch implementing this assumption. It introduces BLANK_COLOR_SIZE
macro which is used in selectColorTrigrams like count of characters in
COLOR_BLANK. Another change in this patch is split of MAX_TRGM_COUNT into
WISH_TRGM_SIZE and MAX_TRGM_SIZE. The idea is to keep trying to remove
color trigrams from graph even when it fits into MAX_TRGM_SIZE, because we
are intending to scan less posting lists/trees.
Comments is not fixed yet, coming soon.
------
With best regards,
Alexander Korotkov.
Attachments:
trgm_regex_optimize.1.patchapplication/octet-stream; name=trgm_regex_optimize.1.patchDownload
diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c
new file mode 100644
index 772fc44..2632a03
*** a/contrib/pg_trgm/trgm_regexp.c
--- b/contrib/pg_trgm/trgm_regexp.c
***************
*** 203,210 ****
*/
#define MAX_EXPANDED_STATES 128
#define MAX_EXPANDED_ARCS 1024
! #define MAX_TRGM_COUNT 256
#define COLOR_COUNT_LIMIT 256
/* Struct representing a single pg_wchar, converted back to multibyte form */
typedef struct
--- 203,212 ----
*/
#define MAX_EXPANDED_STATES 128
#define MAX_EXPANDED_ARCS 1024
! #define MAX_TRGM_SIZE 256
! #define WISH_TRGM_SIZE 16
#define COLOR_COUNT_LIMIT 256
+ #define BLANK_COLOR_SIZE 32
/* Struct representing a single pg_wchar, converted back to multibyte form */
typedef struct
*************** static void fillTrgm(trgm *ptrgm, trgm_m
*** 460,465 ****
--- 462,468 ----
static void mergeStates(TrgmState *state1, TrgmState *state2);
static int colorTrgmInfoCmp(const void *p1, const void *p2);
static int colorTrgmInfoCountCmp(const void *p1, const void *p2);
+ static int getColorTrigramSize(const ColorTrgmInfo *colorTrgm);
static TrgmPackedGraph *packGraph(TrgmNFA *trgmNFA, MemoryContext rcontext);
static int packArcInfoCmp(const void *a1, const void *a2);
*************** selectColorTrigrams(TrgmNFA *trgmNFA)
*** 1423,1429 ****
i;
TrgmState *state;
ColorTrgmInfo *colorTrgms;
! int64 totalTrgmCount;
int number;
/* Collect color trigrams from all arcs */
--- 1426,1432 ----
i;
TrgmState *state;
ColorTrgmInfo *colorTrgms;
! int64 totalTrgmSize;
int number;
/* Collect color trigrams from all arcs */
*************** selectColorTrigrams(TrgmNFA *trgmNFA)
*** 1489,1495 ****
* 1290. However, the grand total totalTrgmCount might conceivably
* overflow an int, so we use int64 for that within this routine.
*/
! totalTrgmCount = 0;
for (i = 0; i < trgmNFA->colorTrgmsCount; i++)
{
ColorTrgmInfo *trgmInfo = &colorTrgms[i];
--- 1492,1498 ----
* 1290. However, the grand total totalTrgmCount might conceivably
* overflow an int, so we use int64 for that within this routine.
*/
! totalTrgmSize = 0;
for (i = 0; i < trgmNFA->colorTrgmsCount; i++)
{
ColorTrgmInfo *trgmInfo = &colorTrgms[i];
*************** selectColorTrigrams(TrgmNFA *trgmNFA)
*** 1504,1510 ****
count *= trgmNFA->colorInfo[c].wordCharsCount;
}
trgmInfo->count = count;
! totalTrgmCount += count;
}
/* Sort color trigrams in descending order of simple trigram counts */
--- 1507,1513 ----
count *= trgmNFA->colorInfo[c].wordCharsCount;
}
trgmInfo->count = count;
! totalTrgmSize += getColorTrigramSize(trgmInfo);
}
/* Sort color trigrams in descending order of simple trigram counts */
*************** selectColorTrigrams(TrgmNFA *trgmNFA)
*** 1522,1528 ****
* cannot always remove the trigram we'd prefer to.
*/
for (i = 0;
! (i < trgmNFA->colorTrgmsCount) && (totalTrgmCount > MAX_TRGM_COUNT);
i++)
{
ColorTrgmInfo *trgmInfo = &colorTrgms[i];
--- 1525,1531 ----
* cannot always remove the trigram we'd prefer to.
*/
for (i = 0;
! (i < trgmNFA->colorTrgmsCount) && (totalTrgmSize > WISH_TRGM_SIZE);
i++)
{
ColorTrgmInfo *trgmInfo = &colorTrgms[i];
*************** selectColorTrigrams(TrgmNFA *trgmNFA)
*** 1572,1585 ****
/* Mark trigram unexpanded, and update totalTrgmCount */
trgmInfo->expanded = false;
! totalTrgmCount -= trgmInfo->count;
}
/* Did we succeed in fitting into MAX_TRGM_COUNT? */
! if (totalTrgmCount > MAX_TRGM_COUNT)
return false;
! trgmNFA->totalTrgmCount = (int) totalTrgmCount;
/*
* Sort color trigrams by colors (will be useful for bsearch in packGraph)
--- 1575,1588 ----
/* Mark trigram unexpanded, and update totalTrgmCount */
trgmInfo->expanded = false;
! totalTrgmSize -= getColorTrigramSize(trgmInfo);
}
/* Did we succeed in fitting into MAX_TRGM_COUNT? */
! if (totalTrgmSize > MAX_TRGM_SIZE)
return false;
! trgmNFA->totalTrgmCount = (int) totalTrgmSize;
/*
* Sort color trigrams by colors (will be useful for bsearch in packGraph)
*************** colorTrgmInfoCmp(const void *p1, const v
*** 1751,1767 ****
static int
colorTrgmInfoCountCmp(const void *p1, const void *p2)
{
! const ColorTrgmInfo *c1 = (const ColorTrgmInfo *) p1;
! const ColorTrgmInfo *c2 = (const ColorTrgmInfo *) p2;
! if (c1->count < c2->count)
return 1;
! else if (c1->count == c2->count)
return 0;
else
return -1;
}
/*---------------------
* Subroutines for packing the graph into final representation (stage 4).
--- 1754,1783 ----
static int
colorTrgmInfoCountCmp(const void *p1, const void *p2)
{
! int count1 = getColorTrigramSize((const ColorTrgmInfo *) p1);
! int count2 = getColorTrigramSize((const ColorTrgmInfo *) p2);
! if (count1 < count2)
return 1;
! else if (count1 == count2)
return 0;
else
return -1;
}
+ static int
+ getColorTrigramSize(const ColorTrgmInfo *colorTrgm)
+ {
+ int count = colorTrgm->count, i;
+
+ for (i = 0; i < 3; i++)
+ {
+ if (colorTrgm->ctrgm.colors[i] == COLOR_BLANK)
+ count *= BLANK_COLOR_SIZE;
+ }
+ return count;
+ }
+
/*---------------------
* Subroutines for packing the graph into final representation (stage 4).
On Fri, June 21, 2013 15:11, Alexander Korotkov wrote:
On Fri, Jun 21, 2013 at 2:40 PM, Erik Rijkers <er@xs4all.nl> wrote:
On Fri, June 21, 2013 05:25, Tom Lane wrote:
"Erik Rijkers" <er@xs4all.nl> writes:
In a 112 MB test table (containing random generated text) with a trgm
index (gin_trgm_ops), I consistently get these
timings:
select txt from azjunk6 where txt ~ '^abcd';
130 ms
select txt from azjunk6
where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
3 msRegex '^abcd' will be expanded into trigrams '__a', '_ab', 'abc' and 'bcd'.
However trigrams '__a' is much more frequent than '_ab' which in turn is
much more frequent than 'abc' and 'bcd'. Ommiting of ^ leads to ommiting of
'__a' and '_ab' and that gives so significant speedup.
[trgm_regex_optimize.1.patch ]
Yes, that fixes the problem, thanks.
Erik Rijkers
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jun 21, 2013 at 5:39 PM, Erik Rijkers <er@xs4all.nl> wrote:
On Fri, June 21, 2013 15:11, Alexander Korotkov wrote:
On Fri, Jun 21, 2013 at 2:40 PM, Erik Rijkers <er@xs4all.nl> wrote:
On Fri, June 21, 2013 05:25, Tom Lane wrote:
"Erik Rijkers" <er@xs4all.nl> writes:
In a 112 MB test table (containing random generated text) with a trgm
index (gin_trgm_ops), I consistently get these
timings:
select txt from azjunk6 where txt ~ '^abcd';
130 ms
select txt from azjunk6
where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
3 msRegex '^abcd' will be expanded into trigrams '__a', '_ab', 'abc' and
'bcd'.
However trigrams '__a' is much more frequent than '_ab' which in turn is
much more frequent than 'abc' and 'bcd'. Ommiting of ^ leads to ommitingof
'__a' and '_ab' and that gives so significant speedup.
[trgm_regex_optimize.1.patch ]
Yes, that fixes the problem, thanks.
Revised version of patch with necessary comments.
------
With best regards,
Alexander Korotkov.
Attachments:
trgm-regex-optimize.2.patchapplication/octet-stream; name=trgm-regex-optimize.2.patchDownload
diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c
new file mode 100644
index c4dfdf2..f5eea0f
*** a/contrib/pg_trgm/trgm_regexp.c
--- b/contrib/pg_trgm/trgm_regexp.c
***************
*** 122,130 ****
* thousands of trigrams would be slow, and would likely produce so many
* false positives that we would have to traverse a large fraction of the
* index, the graph is simplified further in a lossy fashion by removing
! * color trigrams until the number of trigrams after expansion is below
! * the MAX_TRGM_COUNT threshold. When a color trigram is removed, the states
! * connected by any arcs labelled with that trigram are merged.
*
* 4) Pack the graph into a compact representation
* -----------------------------------------------
--- 122,139 ----
* thousands of trigrams would be slow, and would likely produce so many
* false positives that we would have to traverse a large fraction of the
* index, the graph is simplified further in a lossy fashion by removing
! * color trigrams. Also, trigrams itself have not equivalent value: some of
! * them are more frequent and some of them are less frequent. Wishfully, we
! * would like to know the distribution of trigrams, but we don't. But because
! * of padding we for sure know that empty character is much more frequent than
! * others. We assume it to be in BLANK_COLOR_SIZE more frequent than average
! * frequency of other characters. Assuming that we intruduce "size of color
! * trigram" which is like number of trigrams in color trigram except that
! * empty color counts as BLANK_COLOR_SIZE. While removing color trigrams we're
! * trying to minimize it's summary size. We would like to achieve
! * WISH_TRGM_SIZE and assume fail if we don't reach MAX_TRGM_SIZE threshold.
! * When a color trigram is removed, the states connected by any arcs labelled
! * with that trigram are merged.
*
* 4) Pack the graph into a compact representation
* -----------------------------------------------
***************
*** 198,210 ****
*
* MAX_EXPANDED_STATES - How many states we allow in expanded graph
* MAX_EXPANDED_ARCS - How many arcs we allow in expanded graph
! * MAX_TRGM_COUNT - How many simple trigrams we allow to be extracted
* COLOR_COUNT_LIMIT - Maximum number of characters per color
*/
#define MAX_EXPANDED_STATES 128
#define MAX_EXPANDED_ARCS 1024
! #define MAX_TRGM_COUNT 256
#define COLOR_COUNT_LIMIT 256
/* Struct representing a single pg_wchar, converted back to multibyte form */
typedef struct
--- 207,224 ----
*
* MAX_EXPANDED_STATES - How many states we allow in expanded graph
* MAX_EXPANDED_ARCS - How many arcs we allow in expanded graph
! * MAX_TRGM_SIZE - What largest size of color trigrams we allow
! * WISH_TRGM_SIZE - Desired size of color trigrams
* COLOR_COUNT_LIMIT - Maximum number of characters per color
+ * BLANK_COLOR_SIZE - How much blank character is more frequent than
+ * other character in average
*/
#define MAX_EXPANDED_STATES 128
#define MAX_EXPANDED_ARCS 1024
! #define MAX_TRGM_SIZE 256
! #define WISH_TRGM_SIZE 16
#define COLOR_COUNT_LIMIT 256
+ #define BLANK_COLOR_SIZE 32
/* Struct representing a single pg_wchar, converted back to multibyte form */
typedef struct
*************** static void fillTrgm(trgm *ptrgm, trgm_m
*** 460,465 ****
--- 474,480 ----
static void mergeStates(TrgmState *state1, TrgmState *state2);
static int colorTrgmInfoCmp(const void *p1, const void *p2);
static int colorTrgmInfoCountCmp(const void *p1, const void *p2);
+ static int getColorTrigramSize(const ColorTrgmInfo *colorTrgm);
static TrgmPackedGraph *packGraph(TrgmNFA *trgmNFA, MemoryContext rcontext);
static int packArcInfoCmp(const void *a1, const void *a2);
*************** selectColorTrigrams(TrgmNFA *trgmNFA)
*** 1423,1429 ****
i;
TrgmState *state;
ColorTrgmInfo *colorTrgms;
! int64 totalTrgmCount;
int number;
/* Collect color trigrams from all arcs */
--- 1438,1444 ----
i;
TrgmState *state;
ColorTrgmInfo *colorTrgms;
! int64 totalTrgmSize;
int number;
/* Collect color trigrams from all arcs */
*************** selectColorTrigrams(TrgmNFA *trgmNFA)
*** 1489,1495 ****
* 1290. However, the grand total totalTrgmCount might conceivably
* overflow an int, so we use int64 for that within this routine.
*/
! totalTrgmCount = 0;
for (i = 0; i < trgmNFA->colorTrgmsCount; i++)
{
ColorTrgmInfo *trgmInfo = &colorTrgms[i];
--- 1504,1510 ----
* 1290. However, the grand total totalTrgmCount might conceivably
* overflow an int, so we use int64 for that within this routine.
*/
! totalTrgmSize = 0;
for (i = 0; i < trgmNFA->colorTrgmsCount; i++)
{
ColorTrgmInfo *trgmInfo = &colorTrgms[i];
*************** selectColorTrigrams(TrgmNFA *trgmNFA)
*** 1504,1528 ****
count *= trgmNFA->colorInfo[c].wordCharsCount;
}
trgmInfo->count = count;
! totalTrgmCount += count;
}
! /* Sort color trigrams in descending order of simple trigram counts */
qsort(colorTrgms, trgmNFA->colorTrgmsCount, sizeof(ColorTrgmInfo),
colorTrgmInfoCountCmp);
/*
! * Remove color trigrams from the graph so long as total number of simple
! * trigrams exceeds MAX_TRGM_COUNT. We prefer to remove color trigrams
! * with the most associated simple trigrams, since those are the most
! * promising for reducing the total number of simple trigrams. When
! * removing a color trigram we have to merge states connected by arcs
* labeled with that trigram. It's necessary to not merge initial and
* final states, because our graph becomes useless if that happens; so we
* cannot always remove the trigram we'd prefer to.
*/
for (i = 0;
! (i < trgmNFA->colorTrgmsCount) && (totalTrgmCount > MAX_TRGM_COUNT);
i++)
{
ColorTrgmInfo *trgmInfo = &colorTrgms[i];
--- 1519,1543 ----
count *= trgmNFA->colorInfo[c].wordCharsCount;
}
trgmInfo->count = count;
! totalTrgmSize += getColorTrigramSize(trgmInfo);
}
! /* Sort color trigrams in descending order of their "sizes" */
qsort(colorTrgms, trgmNFA->colorTrgmsCount, sizeof(ColorTrgmInfo),
colorTrgmInfoCountCmp);
/*
! * Remove color trigrams from the graph so long as total size of color
! * trigrams exceeds WISH_TRGM_SIZE. But if we can't reach WISH_TRGM_SIZE
! * it's OK to be in MAX_TRGM_SIZE. We prefer to remove largest color
! * trigrams, since those are the most promising for reducing the total size.
! * When removing a color trigram we have to merge states connected by arcs
* labeled with that trigram. It's necessary to not merge initial and
* final states, because our graph becomes useless if that happens; so we
* cannot always remove the trigram we'd prefer to.
*/
for (i = 0;
! (i < trgmNFA->colorTrgmsCount) && (totalTrgmSize > WISH_TRGM_SIZE);
i++)
{
ColorTrgmInfo *trgmInfo = &colorTrgms[i];
*************** selectColorTrigrams(TrgmNFA *trgmNFA)
*** 1572,1585 ****
/* Mark trigram unexpanded, and update totalTrgmCount */
trgmInfo->expanded = false;
! totalTrgmCount -= trgmInfo->count;
}
! /* Did we succeed in fitting into MAX_TRGM_COUNT? */
! if (totalTrgmCount > MAX_TRGM_COUNT)
return false;
! trgmNFA->totalTrgmCount = (int) totalTrgmCount;
/*
* Sort color trigrams by colors (will be useful for bsearch in packGraph)
--- 1587,1600 ----
/* Mark trigram unexpanded, and update totalTrgmCount */
trgmInfo->expanded = false;
! totalTrgmSize -= getColorTrigramSize(trgmInfo);
}
! /* Did we succeed in fitting into MAX_TRGM_SIZE? */
! if (totalTrgmSize > MAX_TRGM_SIZE)
return false;
! trgmNFA->totalTrgmCount = (int) totalTrgmSize;
/*
* Sort color trigrams by colors (will be useful for bsearch in packGraph)
*************** colorTrgmInfoCmp(const void *p1, const v
*** 1751,1767 ****
static int
colorTrgmInfoCountCmp(const void *p1, const void *p2)
{
! const ColorTrgmInfo *c1 = (const ColorTrgmInfo *) p1;
! const ColorTrgmInfo *c2 = (const ColorTrgmInfo *) p2;
! if (c1->count < c2->count)
return 1;
! else if (c1->count == c2->count)
return 0;
else
return -1;
}
/*---------------------
* Subroutines for packing the graph into final representation (stage 4).
--- 1766,1799 ----
static int
colorTrgmInfoCountCmp(const void *p1, const void *p2)
{
! int count1 = getColorTrigramSize((const ColorTrgmInfo *) p1);
! int count2 = getColorTrigramSize((const ColorTrgmInfo *) p2);
! if (count1 < count2)
return 1;
! else if (count1 == count2)
return 0;
else
return -1;
}
+ /*
+ * Return size of color trigrams assuming count containing of trigrams
+ * is already calculated.
+ */
+ static int
+ getColorTrigramSize(const ColorTrgmInfo *colorTrgm)
+ {
+ int count = colorTrgm->count, i;
+
+ for (i = 0; i < 3; i++)
+ {
+ if (colorTrgm->ctrgm.colors[i] == COLOR_BLANK)
+ count *= BLANK_COLOR_SIZE;
+ }
+ return count;
+ }
+
/*---------------------
* Subroutines for packing the graph into final representation (stage 4).
Alexander Korotkov <aekorotkov@gmail.com> writes:
Revised version of patch with necessary comments.
I looked at this patch a bit. It seems like this:
+ * BLANK_COLOR_SIZE - How much blank character is more frequent than
+ * other character in average
+ #define BLANK_COLOR_SIZE 32
is a drastic overestimate. Using the program Erik provided to generate
some sample data, I find that blanks in trigrams are maybe about three
times more common than other characters. Possibly Erik's data isn't
very representative of real text, but if that's the case we'd better
get a more representative sample case before we start optimizing ...
Now maybe the above number is okay for this purpose anyway, but if so
it needs some different justification; maybe call it a "penalty"?
But nonetheless it seems like a darn large penalty, particularly for " x"
type trigrams where the value would effectively be squared. Compared to
the proposed values of MAX_TRGM_SIZE and WISH_TRGM_SIZE, it seems like
you might as well forget the "sizing" approach and just flat out reject
trigrams containing COLOR_BLANK, because they'll never get through the
size filter. I'm inclined to think you need a larger MAX_TRGM_SIZE;
and WISH_TRGM_SIZE seems mighty small as well, compared to what the
code would have done before.
Also, surely this bit:
! trgmNFA->totalTrgmCount = (int) totalTrgmSize;
is flat out wrong? expandColorTrigrams() expects that
trgmNFA->totalTrgmCount is the truth, not some penalty-bloated
abstraction. The fact that the patch hasn't failed on you completely
demonstrates that you've not tested any cases where blank-containing
trigrams got through the filter, reinforcing my feeling that it's
probably too strict now.
I wonder if it would be more appropriate to continue to measure the limit
MAX_TRGM_COUNT in terms of actual trigram counts, and just use the
penalized "size" as the sort key while determining which color trigrams
to discard first.
Another thought is that it's not clear that you should apply the same
penalty to blanks in all three positions. Because of the padding
settings, any one word will normally lead to padded strings " a",
" ab", "yz ", so that spaces in the first position are about twice
as common as spaces in the other positions. (It's a little less
than that because single-letter words produce only " a" and " a ";
but I'd think that such words aren't very common in text that trigrams
are effective for.) So I'm thinking the penalty ought to take that
into account.
I'm also inclined to think that it might be worth adding a size
field to ColorTrgmInfo rather than repeatedly calculating the
size estimate. We only allow a thousand and some ColorTrgmInfos
at most, so the extra space wouldn't be that large, and I'm concerned
by the expense the current patch adds to the sort comparator.
Another point is that this comment:
* Note: per-color-trigram counts cannot overflow an int so long as
* COLOR_COUNT_LIMIT is not more than the cube root of INT_MAX, ie about
* 1290. However, the grand total totalTrgmCount might conceivably
* overflow an int, so we use int64 for that within this routine.
is no longer valid, or at least it fails to demonstrate that the result
of getColorTrigramSize() can't overflow an int; indeed I rather fear it can.
The patch failed to even update the variable name in this comment, let
alone address the substantive question.
There are some other cosmetic things I didn't like, for instance
colorTrgmInfoCountCmp() is no longer comparing counts but you didn't
rename it. That's about it for substantive comments though.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jan 16, 2014 at 3:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alexander Korotkov <aekorotkov@gmail.com> writes:
Revised version of patch with necessary comments.
I looked at this patch a bit. It seems like this:
+ * BLANK_COLOR_SIZE - How much blank character is more frequent than + * other character in average + #define BLANK_COLOR_SIZE 32is a drastic overestimate. Using the program Erik provided to generate
some sample data, I find that blanks in trigrams are maybe about three
times more common than other characters. Possibly Erik's data isn't
very representative of real text, but if that's the case we'd better
get a more representative sample case before we start optimizing ...Now maybe the above number is okay for this purpose anyway, but if so
it needs some different justification; maybe call it a "penalty"?
But nonetheless it seems like a darn large penalty, particularly for " x"
type trigrams where the value would effectively be squared. Compared to
the proposed values of MAX_TRGM_SIZE and WISH_TRGM_SIZE, it seems like
you might as well forget the "sizing" approach and just flat out reject
trigrams containing COLOR_BLANK, because they'll never get through the
size filter.
It seems to be right decision to me when we have other trigrams can reject
them. But there are cases when we can't.
I'm inclined to think you need a larger MAX_TRGM_SIZE;
and WISH_TRGM_SIZE seems mighty small as well, compared to what the
code would have done before.
OK, I would like to make more reasoning for penalty.
Let's consider word of length n.
It contains n+1 trigrams including:
1 __x trigram
1 _xx trigram
1 xx_ trigram
n - 2 xxx trigrams
Assume alphabet of size m those trigrams have following average frequencies:
__x 1/((n+1)*m)
_xx 1/((n+1)*m^2)
xx_ 1/((n+1)*m^2)
xxx (n-2)/((n+1)*m^3)
The ratio of this frequencies with blanks to frequency of xxx is:
__x m^2/(n-2)
_xx and xx_ m/(n-2)
In order to estimate n I've analyzed some classics:
Ernest Hemingway "A farewell to arms" - 3.8
Leo Tolstoy "War and Peace" - 5.0
In english with alphabet size = 26 we have
__x m^2/(n-2) = 375
_xx and xx_ m/(n-2) = 14.4
In russian with alphabet size = 33 we have
__x m^2/(n-2) = 363
_xx and xx_ m/(n-2) = 11
Assuming these calculations is approximate, we can safely use same values
for both languages.
Does such reasoning make sense?
Also, surely this bit:
! trgmNFA->totalTrgmCount = (int) totalTrgmSize;
is flat out wrong? expandColorTrigrams() expects that
trgmNFA->totalTrgmCount is the truth, not some penalty-bloated
abstraction. The fact that the patch hasn't failed on you completely
demonstrates that you've not tested any cases where blank-containing
trigrams got through the filter, reinforcing my feeling that it's
probably too strict now.
Oh, I wonder how can I leave such weird bug in patch :-(
I wonder if it would be more appropriate to continue to measure the limit
MAX_TRGM_COUNT in terms of actual trigram counts, and just use the
penalized "size" as the sort key while determining which color trigrams
to discard first.
Agree. But I would like to save some equivalent of WISH_TRGM_SIZE.
Another thought is that it's not clear that you should apply the same
penalty to blanks in all three positions. Because of the padding
settings, any one word will normally lead to padded strings " a",
" ab", "yz ", so that spaces in the first position are about twice
as common as spaces in the other positions. (It's a little less
than that because single-letter words produce only " a" and " a ";
but I'd think that such words aren't very common in text that trigrams
are effective for.) So I'm thinking the penalty ought to take that
into account.I'm also inclined to think that it might be worth adding a size
field to ColorTrgmInfo rather than repeatedly calculating the
size estimate. We only allow a thousand and some ColorTrgmInfos
at most, so the extra space wouldn't be that large, and I'm concerned
by the expense the current patch adds to the sort comparator.
Agree.
Another point is that this comment:
* Note: per-color-trigram counts cannot overflow an int so long as
* COLOR_COUNT_LIMIT is not more than the cube root of INT_MAX, ie
about
* 1290. However, the grand total totalTrgmCount might conceivably
* overflow an int, so we use int64 for that within this routine.is no longer valid, or at least it fails to demonstrate that the result
of getColorTrigramSize() can't overflow an int; indeed I rather fear it
can.
The patch failed to even update the variable name in this comment, let
alone address the substantive question.There are some other cosmetic things I didn't like, for instance
colorTrgmInfoCountCmp() is no longer comparing counts but you didn't
rename it. That's about it for substantive comments though.
Thanks, will be fixed.
------
With best regards,
Alexander Korotkov.
Alexander Korotkov <aekorotkov@gmail.com> writes:
On Thu, Jan 16, 2014 at 3:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I looked at this patch a bit. It seems like this: + * BLANK_COLOR_SIZE - How much blank character is more frequent than + * other character in average + #define BLANK_COLOR_SIZE 32 is a drastic overestimate.
OK, I would like to make more reasoning for penalty.
Let's consider word of length n.
It contains n+1 trigrams including:
1 __x trigram
1 _xx trigram
1 xx_ trigram
n - 2 xxx trigrams
Assume alphabet of size m those trigrams have following average frequencies:
__x 1/((n+1)*m)
_xx 1/((n+1)*m^2)
xx_ 1/((n+1)*m^2)
xxx (n-2)/((n+1)*m^3)
Hmm, but you're assuming that the m letters are all equally frequent,
which is surely not true in normal text.
I didn't have a machine-readable copy of Hemingway or Tolstoy at hand,
but I do have a text file with the collected works of H.P. Lovecraft,
so I tried analyzing the trigrams in that. I find that
* The average word length is 4.78 characters.
* There are 5652 distinct trigrams (discounting some containing digits),
the most common one (' t') occurring 81222 times; the average
occurrence count is 500.0.
* Considering only trigrams not containing any blanks, there are
4937 of them, the most common one ('the') occurring 45832 times,
with an average count of 266.9.
* There are (unsurprisingly) exactly 26 trigrams of the form ' x',
with an average count of 19598.5.
* There are 689 trigrams of the form ' xx' or 'xx ', the most common one
(' th') occurring 58200 times, with an average count of 1450.1.
So, we've rediscovered the fact that 'the' is the most common word in
English text ;-). But I think the significant conclusion for our purposes
here is that single-space trigrams are on average about 1450.1/266.9 = 5.4
times more common than the average space-free trigram, and two-space
trigrams about 19598.5/266.9 = 73.4 times more common.
So this leads me to the conclusion that we should be using a
BLANK_COLOR_SIZE value around 5 or 6 (with maybe something other than
a square-law rule for two-space trigrams). Even using your numbers,
it shouldn't be 32.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Feb 10, 2014 at 1:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alexander Korotkov <aekorotkov@gmail.com> writes:
On Thu, Jan 16, 2014 at 3:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I looked at this patch a bit. It seems like this:
+ * BLANK_COLOR_SIZE - How much blank character is more frequentthan
+ * other character in average + #define BLANK_COLOR_SIZE 32 is a drastic overestimate.OK, I would like to make more reasoning for penalty.
Let's consider word of length n.
It contains n+1 trigrams including:
1 __x trigram
1 _xx trigram
1 xx_ trigram
n - 2 xxx trigramsAssume alphabet of size m those trigrams have following average
frequencies:
__x 1/((n+1)*m)
_xx 1/((n+1)*m^2)
xx_ 1/((n+1)*m^2)
xxx (n-2)/((n+1)*m^3)Hmm, but you're assuming that the m letters are all equally frequent,
which is surely not true in normal text.I didn't have a machine-readable copy of Hemingway or Tolstoy at hand,
but I do have a text file with the collected works of H.P. Lovecraft,
so I tried analyzing the trigrams in that. I find that
* The average word length is 4.78 characters.
* There are 5652 distinct trigrams (discounting some containing digits),
the most common one (' t') occurring 81222 times; the average
occurrence count is 500.0.
* Considering only trigrams not containing any blanks, there are
4937 of them, the most common one ('the') occurring 45832 times,
with an average count of 266.9.
* There are (unsurprisingly) exactly 26 trigrams of the form ' x',
with an average count of 19598.5.
* There are 689 trigrams of the form ' xx' or 'xx ', the most common one
(' th') occurring 58200 times, with an average count of 1450.1.So, we've rediscovered the fact that 'the' is the most common word in
English text ;-). But I think the significant conclusion for our purposes
here is that single-space trigrams are on average about 1450.1/266.9 = 5.4
times more common than the average space-free trigram, and two-space
trigrams about 19598.5/266.9 = 73.4 times more common.So this leads me to the conclusion that we should be using a
BLANK_COLOR_SIZE value around 5 or 6 (with maybe something other than
a square-law rule for two-space trigrams). Even using your numbers,
it shouldn't be 32.
Next revision of patch is attached. Changes are so:
1) Notion "penalty" is used instead of "size".
2) We try to reduce total penalty to WISH_TRGM_PENALTY, but restriction is
MAX_TRGM_COUNT total trigrams count.
3) Penalties are assigned to particular color trigram classes. I.e.
separate penalties for __a, _aa, _a_, aa_. It's based on analysis of
trigram frequencies in Oscar Wilde writings. We can end up with different
numbers, but I don't think they will be dramatically different.
------
With best regards,
Alexander Korotkov.
Attachments:
trgm-regex-optimize.3.patchapplication/octet-stream; name=trgm-regex-optimize.3.patchDownload
diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c
new file mode 100644
index c4dfdf2..72949e4
*** a/contrib/pg_trgm/trgm_regexp.c
--- b/contrib/pg_trgm/trgm_regexp.c
***************
*** 122,130 ****
* thousands of trigrams would be slow, and would likely produce so many
* false positives that we would have to traverse a large fraction of the
* index, the graph is simplified further in a lossy fashion by removing
! * color trigrams until the number of trigrams after expansion is below
! * the MAX_TRGM_COUNT threshold. When a color trigram is removed, the states
! * connected by any arcs labelled with that trigram are merged.
*
* 4) Pack the graph into a compact representation
* -----------------------------------------------
--- 122,142 ----
* thousands of trigrams would be slow, and would likely produce so many
* false positives that we would have to traverse a large fraction of the
* index, the graph is simplified further in a lossy fashion by removing
! * color trigrams. Also, trigrams itself have not equivalent value: some of
! * them are more frequent and some of them are less frequent. Wishfully, we
! * would like to know the distribution of trigrams, but we don't. But because
! * of padding we for sure know that empty character is more frequent than
! * others. Assuming that we divide trigrams into classes according to presence
! * of whitespace. Their average frequencies relative to average frequency
! * of trigrams without whitespace are calculated by real-life texts, called
! * classes penalties and stored into "penalties" array. Then we introduce
! * penalty of color trigram which is multiplication of number of its trigrams
! * count and its class penalty. While removing color trigrams we're trying to
! * minimize it's summary penalty, but the restriction is total number of
! * trigrams. We would like to achieve WISH_TRGM_PENALTY summary penalty and
! * assume fail if we don't reach MAX_TRGM_COUNT threshold for total number of
! * trigrams. When a color trigram is removed, the states connected by any arcs
! * labelled with that trigram are merged.
*
* 4) Pack the graph into a compact representation
* -----------------------------------------------
***************
*** 199,211 ****
--- 211,240 ----
* MAX_EXPANDED_STATES - How many states we allow in expanded graph
* MAX_EXPANDED_ARCS - How many arcs we allow in expanded graph
* MAX_TRGM_COUNT - How many simple trigrams we allow to be extracted
+ * WISH_TRGM_PENALTY - Desired sum of color trigrams penalties
* COLOR_COUNT_LIMIT - Maximum number of characters per color
*/
#define MAX_EXPANDED_STATES 128
#define MAX_EXPANDED_ARCS 1024
#define MAX_TRGM_COUNT 256
+ #define WISH_TRGM_PENALTY 16
#define COLOR_COUNT_LIMIT 256
+ /*
+ * Penalty of trigrams depending on whitespace presence. Based on analysis of
+ * real-life texts.
+ */
+ const float4 penalties[8] = {
+ 1.0, /* "aaa" */
+ 3.5, /* "aa " */
+ 0.0, /* "a a" (impossible) */
+ 0.0, /* "a " (impossible) */
+ 4.2, /* " aa" */
+ 2.1, /* " a " */
+ 25.0, /* " a" */
+ 1.0 /* " " (impossible) */
+ };
+
/* Struct representing a single pg_wchar, converted back to multibyte form */
typedef struct
{
*************** typedef struct
*** 339,344 ****
--- 368,374 ----
ColorTrgm ctrgm;
int number;
int count;
+ float4 penalty;
bool expanded;
List *arcs;
} ColorTrgmInfo;
*************** static TRGM *expandColorTrigrams(TrgmNFA
*** 459,465 ****
static void fillTrgm(trgm *ptrgm, trgm_mb_char s[3]);
static void mergeStates(TrgmState *state1, TrgmState *state2);
static int colorTrgmInfoCmp(const void *p1, const void *p2);
! static int colorTrgmInfoCountCmp(const void *p1, const void *p2);
static TrgmPackedGraph *packGraph(TrgmNFA *trgmNFA, MemoryContext rcontext);
static int packArcInfoCmp(const void *a1, const void *a2);
--- 489,495 ----
static void fillTrgm(trgm *ptrgm, trgm_mb_char s[3]);
static void mergeStates(TrgmState *state1, TrgmState *state2);
static int colorTrgmInfoCmp(const void *p1, const void *p2);
! static int colorTrgmInfoPenaltyCmp(const void *p1, const void *p2);
static TrgmPackedGraph *packGraph(TrgmNFA *trgmNFA, MemoryContext rcontext);
static int packArcInfoCmp(const void *a1, const void *a2);
*************** selectColorTrigrams(TrgmNFA *trgmNFA)
*** 1424,1429 ****
--- 1454,1460 ----
TrgmState *state;
ColorTrgmInfo *colorTrgms;
int64 totalTrgmCount;
+ float4 totalTrgmPenalty;
int number;
/* Collect color trigrams from all arcs */
*************** selectColorTrigrams(TrgmNFA *trgmNFA)
*** 1490,1528 ****
* overflow an int, so we use int64 for that within this routine.
*/
totalTrgmCount = 0;
for (i = 0; i < trgmNFA->colorTrgmsCount; i++)
{
ColorTrgmInfo *trgmInfo = &colorTrgms[i];
int j,
! count = 1;
for (j = 0; j < 3; j++)
{
TrgmColor c = trgmInfo->ctrgm.colors[j];
! if (c != COLOR_BLANK)
count *= trgmNFA->colorInfo[c].wordCharsCount;
}
trgmInfo->count = count;
totalTrgmCount += count;
}
! /* Sort color trigrams in descending order of simple trigram counts */
qsort(colorTrgms, trgmNFA->colorTrgmsCount, sizeof(ColorTrgmInfo),
! colorTrgmInfoCountCmp);
/*
! * Remove color trigrams from the graph so long as total number of simple
! * trigrams exceeds MAX_TRGM_COUNT. We prefer to remove color trigrams
! * with the most associated simple trigrams, since those are the most
! * promising for reducing the total number of simple trigrams. When
! * removing a color trigram we have to merge states connected by arcs
! * labeled with that trigram. It's necessary to not merge initial and
! * final states, because our graph becomes useless if that happens; so we
! * cannot always remove the trigram we'd prefer to.
*/
for (i = 0;
! (i < trgmNFA->colorTrgmsCount) && (totalTrgmCount > MAX_TRGM_COUNT);
i++)
{
ColorTrgmInfo *trgmInfo = &colorTrgms[i];
--- 1521,1568 ----
* overflow an int, so we use int64 for that within this routine.
*/
totalTrgmCount = 0;
+ totalTrgmPenalty = 0.0f;
for (i = 0; i < trgmNFA->colorTrgmsCount; i++)
{
ColorTrgmInfo *trgmInfo = &colorTrgms[i];
int j,
! count = 1,
! typeIndex = 0;
for (j = 0; j < 3; j++)
{
TrgmColor c = trgmInfo->ctrgm.colors[j];
! typeIndex *= 2;
! if (c == COLOR_BLANK)
! typeIndex++;
! else
count *= trgmNFA->colorInfo[c].wordCharsCount;
}
trgmInfo->count = count;
+ trgmInfo->penalty = (float4)count * penalties[typeIndex];
totalTrgmCount += count;
+ totalTrgmPenalty += trgmInfo->penalty;
+
}
! /* Sort color trigrams in descending order of their penalties */
qsort(colorTrgms, trgmNFA->colorTrgmsCount, sizeof(ColorTrgmInfo),
! colorTrgmInfoPenaltyCmp);
/*
! * Remove color trigrams from the graph so long as total penalty of color
! * trigrams exceeds WISH_TRGM_PENALTY. But if we can't reach
! * WISH_TRGM_PENALTY it's OK to be in MAX_TRGM_COUNT. We prefer to remove
! * color trigrams with higher penalty, since those are the most promising
! * for reducing the total penalty. When removing a color trigram we have
! * to merge states connected by arcs labeled with that trigram. It's
! * necessary to not merge initial and final states, because our graph
! * becomes useless if that happens; so we cannot always remove the trigram
! * we'd prefer to.
*/
for (i = 0;
! (i < trgmNFA->colorTrgmsCount) && (totalTrgmPenalty > WISH_TRGM_PENALTY);
i++)
{
ColorTrgmInfo *trgmInfo = &colorTrgms[i];
*************** selectColorTrigrams(TrgmNFA *trgmNFA)
*** 1572,1585 ****
/* Mark trigram unexpanded, and update totalTrgmCount */
trgmInfo->expanded = false;
totalTrgmCount -= trgmInfo->count;
}
! /* Did we succeed in fitting into MAX_TRGM_COUNT? */
if (totalTrgmCount > MAX_TRGM_COUNT)
return false;
! trgmNFA->totalTrgmCount = (int) totalTrgmCount;
/*
* Sort color trigrams by colors (will be useful for bsearch in packGraph)
--- 1612,1626 ----
/* Mark trigram unexpanded, and update totalTrgmCount */
trgmInfo->expanded = false;
+ totalTrgmPenalty -= trgmInfo->penalty;
totalTrgmCount -= trgmInfo->count;
}
! /* Did we succeed in fitting into MAX_TRGM_SIZE? */
if (totalTrgmCount > MAX_TRGM_COUNT)
return false;
! trgmNFA->totalTrgmCount = totalTrgmCount;
/*
* Sort color trigrams by colors (will be useful for bsearch in packGraph)
*************** colorTrgmInfoCmp(const void *p1, const v
*** 1749,1768 ****
* their simple trigrams counts.
*/
static int
! colorTrgmInfoCountCmp(const void *p1, const void *p2)
{
! const ColorTrgmInfo *c1 = (const ColorTrgmInfo *) p1;
! const ColorTrgmInfo *c2 = (const ColorTrgmInfo *) p2;
! if (c1->count < c2->count)
return 1;
! else if (c1->count == c2->count)
return 0;
else
return -1;
}
-
/*---------------------
* Subroutines for packing the graph into final representation (stage 4).
*---------------------
--- 1790,1808 ----
* their simple trigrams counts.
*/
static int
! colorTrgmInfoPenaltyCmp(const void *p1, const void *p2)
{
! float4 penalty1 = ((const ColorTrgmInfo *) p1)->penalty;
! float4 penalty2 = ((const ColorTrgmInfo *) p2)->penalty;
! if (penalty1 < penalty2)
return 1;
! else if (penalty1 == penalty2)
return 0;
else
return -1;
}
/*---------------------
* Subroutines for packing the graph into final representation (stage 4).
*---------------------
I went back and tried Erik's original test
(/messages/by-id/dafad644f268ce1503e1b8b682aae38a.squirrel@webmail.xs4all.nl).
With a fresh checkout from master, the difference between the slow and
fast queries is much less dramatic than Erik reported. The reason is
that Alexander's GIN "fast scan" patch is very effective on that query.
Erik reported that the '^abcd' query took 140ms, vs 5ms for 'abcd'. On
my laptop, the numbers with a fresh checkout are about 2.5 ms vs. 1 ms,
and with fast scan disabled (by modifying the source code), 40ms vs 1ms.
So thanks to the fast scan patch, I don't think this patch is worth
pursuing anymore. Unless there are some other test case where this patch
helps, but the fast scan patch doesn't.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, March 28, 2014 09:31, Heikki Linnakangas wrote:
I went back and tried Erik's original test
(/messages/by-id/dafad644f268ce1503e1b8b682aae38a.squirrel@webmail.xs4all.nl).
With a fresh checkout from master, the difference between the slow and
fast queries is much less dramatic than Erik reported. The reason is
that Alexander's GIN "fast scan" patch is very effective on that query.
Erik reported that the '^abcd' query took 140ms, vs 5ms for 'abcd'. On
my laptop, the numbers with a fresh checkout are about 2.5 ms vs. 1 ms,
and with fast scan disabled (by modifying the source code), 40ms vs 1ms.So thanks to the fast scan patch, I don't think this patch is worth
pursuing anymore. Unless there are some other test case where this patch
helps, but the fast scan patch doesn't.
for the same 2 statements of my original test:
explain (analyze,buffers) select txt from azjunk6 where txt ~ '^abcd'; -- slow (140 ms)
explain (analyze,buffers) select txt from azjunk6 where txt ~ 'abcd' and substr(txt,1,4) = 'abcd'; -- fast (5 ms)
You mention (from HEAD, I suppose?) a difference of 2.5 ms vs. 1 ms.
FWIW, for me the difference (from HEAD) remains quite a bit larger:
for n in `seq 1 10`; do ./trgm_peculiarity.sh ; done | grep runtime
Total runtime: 16.167 ms
Total runtime: 2.188 ms
Total runtime: 16.902 ms
Total runtime: 2.203 ms
Total runtime: 17.486 ms
Total runtime: 2.201 ms
Total runtime: 17.663 ms
Total runtime: 2.441 ms
Total runtime: 13.555 ms
Total runtime: 2.204 ms
Total runtime: 16.862 ms
Total runtime: 2.225 ms
Total runtime: 13.207 ms
Total runtime: 2.550 ms
Total runtime: 16.768 ms
Total runtime: 2.172 ms
Total runtime: 19.259 ms
Total runtime: 2.180 ms
Total runtime: 12.934 ms
Total runtime: 2.198 ms
That's a lot better than the original 140ms vs 5ms but your laptop's 2.5 ms vs. 1 ms is perhaps not representative.
(for the full plans see below)
Erik Rijkers
----------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on azjunk6 (cost=56.77..432.93 rows=100 width=81) (actual time=15.898..15.925 rows=2 loops=1)
Recheck Cond: (txt ~ '^abcd'::text)
Rows Removed by Index Recheck: 11
Heap Blocks: exact=13
Buffers: shared hit=105
-> Bitmap Index Scan on azjunk6_trgm_re_idx (cost=0.00..56.75 rows=100 width=0) (actual time=15.834..15.834 rows=13
loops=1)
Index Cond: (txt ~ '^abcd'::text)
Buffers: shared hit=92
Planning time: 3.304 ms
Total runtime: 16.179 ms
(10 rows)
Time: 21.103 ms
explain (analyze,buffers) select txt from azjunk6 where txt ~ 'abcd' and substr(txt,1,4) = 'abcd'; -- fast (5 ms)
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on azjunk6 (cost=28.75..405.40 rows=1 width=81) (actual time=1.681..2.164 rows=2 loops=1)
Recheck Cond: (txt ~ 'abcd'::text)
Rows Removed by Index Recheck: 11
Filter: (substr(txt, 1, 4) = 'abcd'::text)
Rows Removed by Filter: 101
Heap Blocks: exact=113
Buffers: shared hit=120
-> Bitmap Index Scan on azjunk6_trgm_re_idx (cost=0.00..28.75 rows=100 width=0) (actual time=1.171..1.171 rows=114
loops=1)
Index Cond: (txt ~ 'abcd'::text)
Buffers: shared hit=7
Planning time: 0.516 ms
Total runtime: 2.183 ms
(12 rows)
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
"Erik Rijkers" <er@xs4all.nl> writes:
On Fri, March 28, 2014 09:31, Heikki Linnakangas wrote:
So thanks to the fast scan patch, I don't think this patch is worth
pursuing anymore. Unless there are some other test case where this patch
helps, but the fast scan patch doesn't.
FWIW, for me the difference (from HEAD) remains quite a bit larger:
...
That's a lot better than the original 140ms vs 5ms but your laptop's 2.5 ms vs. 1 ms is perhaps not representative.
I also feel that Alexander's patch is still worth pursuing. What I see
(in an assert-disabled build of HEAD) is that Erik's "slow" query takes
about 2 ms to plan and 5.5 ms to execute, while the "fast" query is about
0.7 ms to plan and 1.3 ms to execute. With the patch, the "slow" query is
still about 2 ms to plan but only 0.3 ms to execute, while the "fast"
query doesn't change much. There's a lot of noise in these numbers
(successive executions seem to be bouncing around more than usual today),
but still it looks like the runtime gain is impressive percentagewise.
One thing that's interesting is that the planning time is so much more for
the "slow" query, even though the "fast" one has an additional WHERE
clause that the planner has to think about. I haven't tried to do perf
measurements to be sure, but my guess is that this has nothing to do with
GIN or pg_trgm directly, but represents the planner's efforts to get a
selectivity estimate for the ~ operator --- that code works much
differently for patterns that define fixed prefixes vs those that don't.
Anyway, the important point here is that I think the planning time
is helping to mask the fact that there's a pretty useful runtime
improvement.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Alexander Korotkov <aekorotkov@gmail.com> writes:
Next revision of patch is attached. Changes are so:
1) Notion "penalty" is used instead of "size".
2) We try to reduce total penalty to WISH_TRGM_PENALTY, but restriction is
MAX_TRGM_COUNT total trigrams count.
3) Penalties are assigned to particular color trigram classes. I.e.
separate penalties for __a, _aa, _a_, aa_. It's based on analysis of
trigram frequencies in Oscar Wilde writings. We can end up with different
numbers, but I don't think they will be dramatically different.
Committed with cosmetic improvements (adjusting the comments mostly).
The new whitespace penalties look reasonably sane to me. I wonder though
if WISH_TRGM_PENALTY is too small --- it seems like this code will tend to
select many fewer trigrams than the old code did. What testing did you do
that led you to select the specific value of 16?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers