WIP: index support for regexp search
Hackers,
WIP patch with index support for regexp search for pg_trgm contrib is
attached.
In spite of techniques which extracts continuous text parts from regexp,
this patch presents technique of automatum transformation. That allows more
comprehensive trigrams extraction.
A little example of possible perfomance benefit.
test=# explain analyze select * from words where s ~ 'a[bc]+[de]';
QUERY PLAN
----------------------------------------------------------------------------------------------------
---
Seq Scan on words (cost=0.00..1703.11 rows=10 width=9) (actual
time=3.092..242.303 rows=662 loops=1)
Filter: (s ~ 'a[bc]+[de]'::text)
Rows Removed by Filter: 97907
Total runtime: 243.213 ms
(4 rows)
test=# explain analyze select * from words where s ~ 'a[bc]+[de]';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on words (cost=260.08..295.83 rows=10 width=9) (actual
time=4.166..7.506 rows=662 loops=1)
Recheck Cond: (s ~ 'a[bc]+[de]'::text)
Rows Removed by Index Recheck: 18
-> Bitmap Index Scan on words_trgm_idx (cost=0.00..260.07 rows=10
width=0) (actual time=4.076..4.076 rows=680 loops=1)
Index Cond: (s ~ 'a[bc]+[de]'::text)
Total runtime: 8.424 ms
(6 rows)
Current version of patch have some limitations:
1) Algorithm of logical expression extraction on trigrams have high
computational complexity. So, it can become really slow on regexp with many
branches. Probably, improvements of this algorithm is possible.
2) Surely, no perfomance benefit if no trigrams can be extracted from
regexp. It's inevitably.
3) Currently, only GIN index is supported. There are no serious problems,
GiST code for it just not written yet.
4) It appear to be some kind of problem to extract multibyte encoded
character from pg_wchar. I've posted question about it here:
http://archives.postgresql.org/pgsql-hackers/2011-11/msg01222.php
While I've hardcoded some dirty solution. So
PG_EUC_JP, PG_EUC_CN, PG_EUC_KR, PG_EUC_TW, PG_EUC_JIS_2004 are not
supported yet.
------
With best regards,
Alexander Korotkov.
Attachments:
On Tue, Nov 22, 2011 at 2:38 PM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
WIP patch with index support for regexp search for pg_trgm contrib is
attached.
In spite of techniques which extracts continuous text parts from regexp,
this patch presents technique of automatum transformation. That allows more
comprehensive trigrams extraction.
Please add this patch here so it does not get lost in the shuffle:
https://commitfest.postgresql.org/action/commitfest_view/open
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Dec 1, 2011 at 12:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Please add this patch here so it does not get lost in the shuffle:
https://commitfest.postgresql.org/action/commitfest_view/open
Done.
------
With best regards,
Alexander Korotkov.
On 22.11.2011 21:38, Alexander Korotkov wrote:
WIP patch with index support for regexp search for pg_trgm contrib is
attached.
In spite of techniques which extracts continuous text parts from regexp,
this patch presents technique of automatum transformation. That allows more
comprehensive trigrams extraction.
Nice!
Current version of patch have some limitations:
1) Algorithm of logical expression extraction on trigrams have high
computational complexity. So, it can become really slow on regexp with many
branches. Probably, improvements of this algorithm is possible.
2) Surely, no perfomance benefit if no trigrams can be extracted from
regexp. It's inevitably.
3) Currently, only GIN index is supported. There are no serious problems,
GiST code for it just not written yet.
4) It appear to be some kind of problem to extract multibyte encoded
character from pg_wchar. I've posted question about it here:
http://archives.postgresql.org/pgsql-hackers/2011-11/msg01222.php
While I've hardcoded some dirty solution. So
PG_EUC_JP, PG_EUC_CN, PG_EUC_KR, PG_EUC_TW, PG_EUC_JIS_2004 are not
supported yet.
This is pretty far from being in committable state, so I'm going to mark
this as "returned with feedback" in the commitfest app. The feedback:
The code badly needs comments. There is no explanation of how the
trigram extraction code in trgm_regexp.c works. Guessing from the
variable names, it seems to be some sort of a coloring algorithm that
works on a graph, but that all needs to be explained. Can this algorithm
be found somewhere in literature, perhaps? A link to a paper would be nice.
Apart from that, the multibyte issue seems like the big one. Any way
around that?
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Fri, Jan 20, 2012 at 12:30 AM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:
The code badly needs comments. There is no explanation of how the trigram
extraction code in trgm_regexp.c works.
Sure. I hoped to find a time for comments before commitfest starts.
Unfortunately I didn't, sorry.
Guessing from the variable names, it seems to be some sort of a coloring
algorithm that works on a graph, but that all needs to be explained. Can
this algorithm be found somewhere in literature, perhaps? A link to a paper
would be nice.
I hope it's truly novel. At least application to regular expressions. I'm
going to write a paper about it.
Apart from that, the multibyte issue seems like the big one. Any way
around that?
Conversion of pg_wchar to multibyte character is the only way I found to
avoid serious hacking of existing regexp code. Do you think additional
function in pg_wchar_tbl which converts pg_wchar back to multibyte
character is possible solution?
------
With best regards,
Alexander Korotkov.
I also have a question about pg_wchar.
/*
*-------------------------------------------------------------------
* encoding info table
* XXX must be sorted by the same order as enum pg_enc (in mb/pg_wchar.h)
*-------------------------------------------------------------------
*/
pg_wchar_tbl pg_wchar_table[] = {
{pg_ascii2wchar_with_len, pg_ascii_mblen, pg_ascii_dsplen,
pg_ascii_verifier, 1}, /* PG_SQL_ASCII */
{pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen,
pg_eucjp_verifier, 3}, /* PG_EUC_JP */
{pg_euccn2wchar_with_len, pg_euccn_mblen, pg_euccn_dsplen,
pg_euccn_verifier, 2}, /* PG_EUC_CN */
{pg_euckr2wchar_with_len, pg_euckr_mblen, pg_euckr_dsplen,
pg_euckr_verifier, 3}, /* PG_EUC_KR */
{pg_euctw2wchar_with_len, pg_euctw_mblen, pg_euctw_dsplen,
pg_euctw_verifier, 4}, /* PG_EUC_TW */
{pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen,
pg_eucjp_verifier, 3}, /* PG_EUC_JIS_2004 */
{pg_utf2wchar_with_len, pg_utf_mblen, pg_utf_dsplen, pg_utf8_verifier, 4}, /*
PG_UTF8 */
{pg_mule2wchar_with_len, pg_mule_mblen, pg_mule_dsplen, pg_mule_verifier,
4}, /* PG_MULE_INTERNAL */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN1 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN2 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN3 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN4 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN5 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN6 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN7 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN8 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN9 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_LATIN10 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1256 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1258 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN866 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN874 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_KOI8R */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1251 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1252 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* ISO-8859-5 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* ISO-8859-6 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* ISO-8859-7 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* ISO-8859-8 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1250 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1253 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1254 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1255 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_WIN1257 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen,
pg_latin1_verifier, 1}, /* PG_KOI8U */
{0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2}, /* PG_SJIS */
{0, pg_big5_mblen, pg_big5_dsplen, pg_big5_verifier, 2}, /* PG_BIG5 */
{0, pg_gbk_mblen, pg_gbk_dsplen, pg_gbk_verifier, 2}, /* PG_GBK */
{0, pg_uhc_mblen, pg_uhc_dsplen, pg_uhc_verifier, 2}, /* PG_UHC */
{0, pg_gb18030_mblen, pg_gb18030_dsplen, pg_gb18030_verifier, 4}, /*
PG_GB18030 */
{0, pg_johab_mblen, pg_johab_dsplen, pg_johab_verifier, 3}, /* PG_JOHAB */
{0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2} /*
PG_SHIFT_JIS_2004 */
};
What does last 7 zeros in the first column means? No conversion to pg_wchar
is possible from these encodings?
------
With best regards,
Alexander Korotkov.
On Fri, Jan 20, 2012 at 1:07 AM, Alexander Korotkov <aekorotkov@gmail.com>wrote:
What does last 7 zeros in the first column means? No conversion to
pg_wchar is possible from these encodings?
Uh, I see. These encodings is not supported as server encodings.
------
With best regards,
Alexander Korotkov.
On Thu, January 19, 2012 21:30, Heikki Linnakangas wrote:
On 22.11.2011 21:38, Alexander Korotkov wrote:
WIP patch with index support for regexp search for pg_trgm contrib is
attached.
In spite of techniques which extracts continuous text parts from regexp,
this patch presents technique of automatum transformation. That allows more
comprehensive trigrams extraction.Nice!
Yes, wonderful stuff; I tested quite a bit with this patch; FWIW, here's what I found.
The patch yields spectacular speedups with small, simple-enough regexen. But it does not do a
good enough job when guessing where to use the index and where fall back to Seq Scan. This can
lead to (also spectacular) slow-downs, compared to Seq Scan.
I used the following to generate 3 test tables with lines of 80 random chars (just in case it's
handy for others to use):
$ cat create_data.sh
#!/bin/sh
for power in 4 5 6
do
table=azjunk${power}
index=${table}_trgmrgx_txt_01_idx
echo "-- generating table $table with index $index";
time perl -E'
sub ss{ join"",@_[ map{rand @_} 1 .. shift ] };
say(ss(80,"a".."g"," ","h".."m"," ","n".."s"," ","t".."z"))
for 1 .. 1e'"${power};" \
| psql -aqXc "drop table if exists $table;
create table $table(txt text); copy $table from stdin;
-- set session maintenance_work_mem='20GB';
create index $index on $table using gin(txt gin_trgm_ops);
analyze $table;";
done
\dt+ public.azjunk*
List of relations
Schema | Name | Type | Owner | Size | Description
--------+---------+-------+-----------+---------+-------------
public | azjunk4 | table | breinbaas | 1152 kB |
public | azjunk5 | table | breinbaas | 11 MB |
public | azjunk6 | table | breinbaas | 112 MB |
(3 rows)
I guessed that MAX_COLOR_CHARS limits the character class size (to 4, in your patch), is that
true? I can understand you want that value to be low to limit the above risk, but now it reduces
the usability of the feature a bit: one has to split up larger char-classes into several smaller
ones to make a statement use the index: i.e.:
txt ~ 'f[aeio]n' OR txt ~ 'f[uy]n'
instead of
txt ~ 'f[aeiouy]n'
I made compiled instances with larger values for MAX_COLOR_CHARS (6 and 9), and sure enough they
used the index for larger classes such as the above, but of course also got into problems easier
when quantifiers are added (*, +, {n,m}).
A better calculation to decide index-use seems necessary, and ideally one that allows for a larger
MAX_COLOR_CHARS than 4. Especially quantifiers could perhaps be inspected wrt that decision.
IMHO, the functionality would still be very useful when only very simple regexen were considered.
Btw, it seems impossible to Ctrl-C out of a search once it is submitted; I suppose this is
normally necessary for perfomance reasons, but it would be useful te be able to compile a test
version that allows it. I don't know how hard that would be.
There is also a minor bug, I think, when running with 'set enable_seqscan=off' in combination
with a too-large regex:
$ cat fail.sql:
set enable_bitmapscan=on;
set enable_seqscan=on;
explain analyze select txt from azjunk4 where txt ~ 'f[aeio]n'; -- OK
explain analyze select txt from azjunk4 where txt ~ 'f[aeiou]n'; -- OK
set enable_bitmapscan=on;
set enable_seqscan=off;
explain analyze select txt from azjunk4 where txt ~ 'f[aeio]n'; -- OK
explain analyze select txt from azjunk4 where txt ~ 'f[aeiou]n'; -- crashes
$ psql -f fail.sql
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on azjunk4 (cost=52.01..56.02 rows=1 width=81) (actual time=1.011..5.291
rows=131 loops=1)
Recheck Cond: (txt ~ 'f[aeio]n'::text)
-> Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx (cost=0.00..52.01 rows=1 width=0) (actual
time=0.880..0.880 rows=131 loops=1)
Index Cond: (txt ~ 'f[aeio]n'::text)
Total runtime: 5.700 ms
(5 rows)
QUERY PLAN
-------------------------------------------------------------------------------------------------------
Seq Scan on azjunk4 (cost=0.00..268.00 rows=1 width=81) (actual time=1.491..36.049 rows=164
loops=1)
Filter: (txt ~ 'f[aeiou]n'::text)
Rows Removed by Filter: 9836
Total runtime: 36.112 ms
(4 rows)
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on azjunk4 (cost=52.01..56.02 rows=1 width=81) (actual time=0.346..0.927
rows=131 loops=1)
Recheck Cond: (txt ~ 'f[aeio]n'::text)
-> Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx (cost=0.00..52.01 rows=1 width=0) (actual
time=0.316..0.316 rows=131 loops=1)
Index Cond: (txt ~ 'f[aeio]n'::text)
Total runtime: 0.996 ms
(5 rows)
psql:fail.sql:24: connection to server was lost
Sorry: I found the code, esp. the regex engine hard to understand sofar, so all the above are
somewhat inconclusive remarks; I hope nonetheless they are useful.
Thanks,
Erik Rijkers
On Fri, Jan 20, 2012 at 01:33, Erik Rijkers <er@xs4all.nl> wrote:
Btw, it seems impossible to Ctrl-C out of a search once it is submitted; I suppose this is
normally necessary for perfomance reasons, but it would be useful te be able to compile a test
version that allows it.
I believe being interruptible is a requirement for the patch to be accepted.
CHECK_FOR_INTERRUPTS(); should be added to the indeterminate loops.
Regards,
Marti
Hi!
Thank you for your feedback!
On Fri, Jan 20, 2012 at 3:33 AM, Erik Rijkers <er@xs4all.nl> wrote:
The patch yields spectacular speedups with small, simple-enough regexen.
But it does not do a
good enough job when guessing where to use the index and where fall back
to Seq Scan. This can
lead to (also spectacular) slow-downs, compared to Seq Scan.
Could you give some examples of regexes where index scan becomes slower
than seq scan?
I guessed that MAX_COLOR_CHARS limits the character class size (to 4, in
your patch), is that
true? I can understand you want that value to be low to limit the above
risk, but now it reduces
the usability of the feature a bit: one has to split up larger
char-classes into several smaller
ones to make a statement use the index: i.e.:
Yes, MAX_COLOR_CHARS is number of maximum character in automata color when
that color is divided to a separated characters. And it's likely there
could be better solution than just have this hard limit.
Btw, it seems impossible to Ctrl-C out of a search once it is submitted; I
suppose this is
normally necessary for perfomance reasons, but it would be useful te be
able to compile a test
version that allows it. I don't know how hard that would be.
I seems that Ctrl-C was impossible because procedure of trigrams
exctraction becomes so long while it is not breakable. It's not difficult
to make this procedure breakable, but actually it just shouldn't take so
long.
There is also a minor bug, I think, when running with 'set
enable_seqscan=off' in combination
with a too-large regex:
Thanks for pointing. Will be fixed.
------
With best regards,
Alexander Korotkov.
On Fri, Jan 20, 2012 at 8:45 PM, Marti Raudsepp <marti@juffo.org> wrote:
On Fri, Jan 20, 2012 at 01:33, Erik Rijkers <er@xs4all.nl> wrote:
Btw, it seems impossible to Ctrl-C out of a search once it is submitted;
I suppose this is
normally necessary for perfomance reasons, but it would be useful te be
able to compile a test
version that allows it.
I believe being interruptible is a requirement for the patch to be
accepted.CHECK_FOR_INTERRUPTS(); should be added to the indeterminate loops.
Sure. It's easy to fix. But it seems that in this case gin extract_query
method becomes slow (because index scan itself is breakable). So, it just
shouldn't work so long.
------
With best regards,
Alexander Korotkov.
On Fri, Jan 20, 2012 at 12:54 AM, Alexander Korotkov
<aekorotkov@gmail.com>wrote:
On Fri, Jan 20, 2012 at 12:30 AM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:Apart from that, the multibyte issue seems like the big one. Any way
around that?
Conversion of pg_wchar to multibyte character is the only way I found to
avoid serious hacking of existing regexp code. Do you think additional
function in pg_wchar_tbl which converts pg_wchar back to multibyte
character is possible solution?
Do you have any notes on it? I could make the patch which adds such
function into core.
------
With best regards,
Alexander Korotkov.
On Sat, January 21, 2012 06:26, Alexander Korotkov wrote:
Hi!
Thank you for your feedback!
On Fri, Jan 20, 2012 at 3:33 AM, Erik Rijkers <er@xs4all.nl> wrote:
The patch yields spectacular speedups with small, simple-enough regexen.
But it does not do a
good enough job when guessing where to use the index and where fall back
to Seq Scan. This can
lead to (also spectacular) slow-downs, compared to Seq Scan.Could you give some examples of regexes where index scan becomes slower
than seq scan?
x[aeio]+q takes many minutes, uninterruptible. It's now running for almost 30 minutes, I'm sure
it will come back eventually, but I think I'll kill it later on; I suppose you get the point ;-)
Even with {n,m} quantifiers it's easy to hit slowdowns:
(table azjunk6 is 112 MB, the index 693 MB.)
(MAX_COLOR_CHARS=4 <= your original patch)
instance table regex plan time
HEAD azjunk6 x[aeio]{1,3}q Seq Scan 3566.088 ms
HEAD azjunk6 x[aeio]{1,3}q Seq Scan 3540.606 ms
HEAD azjunk6 x[aeio]{1,3}q Seq Scan 3495.034 ms
HEAD azjunk6 x[aeio]{1,3}q Seq Scan 3510.403 ms
trgm_regex azjunk6 x[aeio]{1,3}q Bitmap Heap Scan 3724.131 ms
trgm_regex azjunk6 x[aeio]{1,3}q Bitmap Heap Scan 3844.999 ms
trgm_regex azjunk6 x[aeio]{1,3}q Bitmap Heap Scan 3835.190 ms
trgm_regex azjunk6 x[aeio]{1,3}q Bitmap Heap Scan 3724.016 ms
HEAD azjunk6 x[aeio]{1,4}q Seq Scan 3617.997 ms
HEAD azjunk6 x[aeio]{1,4}q Seq Scan 3644.215 ms
HEAD azjunk6 x[aeio]{1,4}q Seq Scan 3636.976 ms
HEAD azjunk6 x[aeio]{1,4}q Seq Scan 3625.493 ms
trgm_regex azjunk6 x[aeio]{1,4}q Bitmap Heap Scan 7885.247 ms
trgm_regex azjunk6 x[aeio]{1,4}q Bitmap Heap Scan 8799.082 ms
trgm_regex azjunk6 x[aeio]{1,4}q Bitmap Heap Scan 7754.152 ms
trgm_regex azjunk6 x[aeio]{1,4}q Bitmap Heap Scan 7721.332 ms
This is with your patch as is; in instances compiled with higher MAX_COLOR_CHARS (I did 6 and 9),
it is of course even easier to dream up a slow regex...
Erik Rijkers
Hi!
New version of patch is attached. Changes are following:
1) Right way to convert from pg_wchar to multibyte.
2) Optimization of producing CFNA-like graph on trigrams (produce smaller,
but equivalent, graphs in less time).
3) Comments and refactoring.
------
With best regards,
Alexander Korotkov.
Attachments:
On Mon, November 19, 2012 22:58, Alexander Korotkov wrote:
New version of patch is attached.
Hi Alexander,
I get some compile-errors:
(Centos 6.3, Linux 2.6.32-279.14.1.el6.x86_64 GNU/Linux, gcc (GCC) 4.7.2)
make contrib
trgm_regexp.c:73:2: error: unknown type name TrgmStateKey
make[1]: *** [trgm_regexp.o] Error 1
make: *** [all-pg_trgm-recurse] Error 2
trgm_regexp.c:73:2: error: unknown type name TrgmStateKey
make[1]: *** [trgm_regexp.o] Error 1
make: *** [install-pg_trgm-recurse] Error 2
Did I forget something?
Thanks,
Erik Rijkers
On 19.11.2012 22:58, Alexander Korotkov wrote:
Hi!
New version of patch is attached. Changes are following:
1) Right way to convert from pg_wchar to multibyte.
2) Optimization of producing CFNA-like graph on trigrams (produce
smaller, but equivalent, graphs in less time).
3) Comments and refactoring.
Hi,
thanks for the updated message-id. I've done the initial review:
1) Patch applies fine to the master.
2) It's common to use upper-case names for macros, but trgm.h defines
macro "iswordchr" - I see it's moved from trgm_op.c but maybe we
could make it a bit more correct?
3) I see there are two '#ifdef KEEPONLYALNUM" blocks right next to each
other in trgm.h - maybe it'd be a good idea to join them?
4) The two new method prototypes at the end of trgm.h use different
indendation than the rest (spaces only instead of tabs).
5) There are no regression tests / updated docs (yet).
6) It does not compile - I do get a bunch of errors like this
trgm_regexp.c:73:2: error: expected specifier-qualifier-list before
‘TrgmStateKey’
trgm_regexp.c: In function ‘addKeys’:
trgm_regexp.c:291:24: error: ‘TrgmState’ has no member named ‘keys’
trgm_regexp.c:304:10: error: ‘TrgmState’ has no member named ‘keys’
...
It seems this is cause by the order of typedefs in trgm_regexp.c, namely
TrgmState referencing TrgmStateKey before it's defined. Moving the
TrgmStateKey before TrgmState fixed the issue (I'm using gcc-4.5.4 but
I think it's not compiler-dependent.)
7) Once fixed, it seems to work
CREATE EXTENSION pg_trgm ;
CREATE TABLE TEST (val TEXT);
INSERT INTO test
SELECT md5(i::text) FROM generate_series(1,1000000) s(i);
CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops);
ANALYZE test;
EXPLAIN SELECT * FROM test WHERE val ~ '.*qqq.*';
QUERY PLAN
---------------------------------------------------------------------
Bitmap Heap Scan on test (cost=16.77..385.16 rows=100 width=33)
Recheck Cond: (val ~ '.*qqq.*'::text)
-> Bitmap Index Scan on trgm_idx (cost=0.00..16.75 rows=100
width=0)
Index Cond: (val ~ '.*qqq.*'::text)
(4 rows)
but I do get a bunch of NOTICE messages with debugging info (no matter
if the GIN index is used or not, so it's somewhere in the common regexp
code). But I guess that's due to WIP status.
regards
Tomas
Some quick comments.
On Tue, Nov 20, 2012 at 3:02 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
6) It does not compile - I do get a bunch of errors like this
Fixed.
7) Once fixed, it seems to work
CREATE EXTENSION pg_trgm ;
CREATE TABLE TEST (val TEXT);
INSERT INTO test
SELECT md5(i::text) FROM generate_series(1,1000000) s(i);
CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops);
ANALYZE test;EXPLAIN SELECT * FROM test WHERE val ~ '.*qqq.*';
QUERY PLAN
---------------------------------------------------------------------
Bitmap Heap Scan on test (cost=16.77..385.16 rows=100 width=33)
Recheck Cond: (val ~ '.*qqq.*'::text)
-> Bitmap Index Scan on trgm_idx (cost=0.00..16.75 rows=100
width=0)
Index Cond: (val ~ '.*qqq.*'::text)
(4 rows)but I do get a bunch of NOTICE messages with debugging info (no matter
if the GIN index is used or not, so it's somewhere in the common regexp
code). But I guess that's due to WIP status.
It's due to TRGM_REGEXP_DEBUG macro. I disabled it by default. But I think
pieces of code hidden by that macro could be useful for debug even after
WIP status.
------
With best regards,
Alexander Korotkov.
Attachments:
Glad to see this patch hasn't been totally forgotten. Being able to use
indexes for regular expressions would be really cool!
Back in January, I asked for some high-level description of how the
algorithm works
(http://archives.postgresql.org/message-id/4F187D5C.30701@enterprisedb.com).
That's still sorely needed. Googling around, I found the slides for your
presentation on this from PGConf.EU - it would be great to have the
information from that presentation included in the patch.
- Heikki
Hello
I tested it now and it working very well!
tested utf8, case sensitive, case insensitive
Regards
Pavel Stehule
2012/11/20 Heikki Linnakangas <hlinnakangas@vmware.com>:
Show quoted text
Glad to see this patch hasn't been totally forgotten. Being able to use
indexes for regular expressions would be really cool!Back in January, I asked for some high-level description of how the
algorithm works
(http://archives.postgresql.org/message-id/4F187D5C.30701@enterprisedb.com).
That's still sorely needed. Googling around, I found the slides for your
presentation on this from PGConf.EU - it would be great to have the
information from that presentation included in the patch.- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Nov 20, 2012 at 3:02 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
2) It's common to use upper-case names for macros, but trgm.h defines
macro "iswordchr" - I see it's moved from trgm_op.c but maybe we
could make it a bit more correct?3) I see there are two '#ifdef KEEPONLYALNUM" blocks right next to each
other in trgm.h - maybe it'd be a good idea to join them?4) The two new method prototypes at the end of trgm.h use different
indendation than the rest (spaces only instead of tabs).
These issues are fixed in attached patch.
Additionally 3 new macros are
introduced: MAX_RESULT_STATES, MAX_RESULT_ARCS, MAX_RESULT_PATHS. They are
limiting resources usage during regex processing.
------
With best regards,
Alexander Korotkov.