Prefix support for synonym dictionary
Hi there,
attached is our patch for CVS HEAD, which adds prefix support for synonym
dictionary.
Quick example:
cat $SHAREDIR/tsearch_data/synonym_sample.syn
postgres pgsql
postgresql pgsql
postgre pgsql
gogle googl
indices index*
=# create text search dictionary syn( template=synonym,synonyms='synonym_sample');
=# select ts_lexize('syn','indices');
ts_lexize
-----------
{index}
(1 row)
=# create text search configuration tst ( copy=simple);
=# alter text search configuration tst alter mapping for asciiword with syn;
=# select to_tsquery('tst','indices');
to_tsquery
------------
'index':*
(1 row)
=# select 'indexes are very useful'::tsvector @@ to_tsquery('tst','indices');
?column?
----------
t
(1 row)
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
Attachments:
Hi,
The patch looks good.
Comments:
1. The docs should be clarified a little. For instance, it should have a
link back to the definition of a prefix search (12.3.2). I included my
doc suggestions as an attachment.
2. dsynonym_init() uses findwrd() in a slightly confusing (and perhaps
fragile) way. After calling findwrd(), the "end" pointer is pointing at
either the end of the string, or the *; depending on whether the string
ends in * and whether flags is NULL. I only mention this because I had
to take a more careful look to see what was happening. Perhaps add a
comment to make it more clear?
3. The patch looks for the special byte '*'. I think that's fine,
because we depend on the files being in UTF-8 encoding, where it's the
same byte. However, I thought it was worth mentioning in case we want to
support other encodings for text search files later.
Regards,
Jeff Davis
Attachments:
prefix-synonym-review.difftext/x-patch; charset=UTF-8; name=prefix-synonym-review.diffDownload
*** textsearch.sgml 2009-08-02 11:22:38.000000000 -0700
--- textsearch.sgml.new 2009-08-02 11:22:27.000000000 -0700
***************
*** 2290,2315 ****
</para>
<para>
! Star sign <literal>*</literal> at the end of definition word indicates,
! that definition word is a prefix and <function>to_tsquery()</function>
! function will transform that definition to the prefix search format.
! Notice, it is ignored in <function>to_tsvector()</function>.
</para>
<programlisting>
- > cat $SHAREDIR/tsearch_data/synonym_sample.syn
- postgres pgsql
- postgresql pgsql
- postgre pgsql
- gogle googl
- indices index*
- > cat $SHAREDIR/tsearch_data/synonym_sample.syn
postgres pgsql
postgresql pgsql
postgre pgsql
gogle googl
indices index*
=# create text search dictionary syn( template=synonym,synonyms='synonym_sample');
=# select ts_lexize('syn','indices');
ts_lexize
--- 2290,2317 ----
</para>
<para>
! An asterisk (<literal>*</literal>) at the end of definition word indicates
! that definition word is a prefix, and <function>to_tsquery()</function>
! function will transform that definition to the prefix search format (see
! <xref linkend="textsearch-parsing-queries">).
! Notice that it is ignored in <function>to_tsvector()</function>.
</para>
+ <para>
+ Contents of <filename>$SHAREDIR/tsearch_data/synonym_sample.syn</>:
+ </para>
<programlisting>
postgres pgsql
postgresql pgsql
postgre pgsql
gogle googl
indices index*
+ </programlisting>
+ <para>
+ Results:
+ </para>
+ <programlisting>
=# create text search dictionary syn( template=synonym,synonyms='synonym_sample');
=# select ts_lexize('syn','indices');
ts_lexize
***************
*** 2324,2329 ****
--- 2326,2338 ----
------------
'index':*
(1 row)
+
+ =# select 'indexes are very useful'::tsvector;
+ tsvector
+ ---------------------------------
+ 'are' 'indexes' 'useful' 'very'
+ (1 row)
+
=# select 'indexes are very useful'::tsvector @@ to_tsquery('tst','indices');
?column?
----------
On Sun, Aug 2, 2009 at 3:05 PM, Jeff Davis<pgsql@j-davis.com> wrote:
The patch looks good.
Comments:
1. The docs should be clarified a little. For instance, it should have a
link back to the definition of a prefix search (12.3.2). I included my
doc suggestions as an attachment.2. dsynonym_init() uses findwrd() in a slightly confusing (and perhaps
fragile) way. After calling findwrd(), the "end" pointer is pointing at
either the end of the string, or the *; depending on whether the string
ends in * and whether flags is NULL. I only mention this because I had
to take a more careful look to see what was happening. Perhaps add a
comment to make it more clear?3. The patch looks for the special byte '*'. I think that's fine,
because we depend on the files being in UTF-8 encoding, where it's the
same byte. However, I thought it was worth mentioning in case we want to
support other encodings for text search files later.
Oleg,
Are you planning to update this patch this week? If not I will set it
to "Returned with Feedback".
Thanks,
...Robert
On Wed, 2009-08-05 at 12:34 -0400, Robert Haas wrote:
Oleg,
Are you planning to update this patch this week? If not I will set it
to "Returned with Feedback".
My only comments were related to docs and comments, and I supplied a
patch as a suggested fix for the docs. Also, the patch is very small.
I'd hate to hold it up over such a minor issue, and it seems like a
useful feature. If Oleg is unavailable, would you mind just having a
second review of the patch to see if they agree with my suggestions, and
then mark "ready for committer review"?
Regards,
Jeff Davis
1. The docs should be clarified a little. For instance, it should have a
link back to the definition of a prefix search (12.3.2). I included my
doc suggestions as an attachment.
Thank you, merged
2. dsynonym_init() uses findwrd() in a slightly confusing (and perhaps
fragile) way. After calling findwrd(), the "end" pointer is pointing at
either the end of the string, or the *; depending on whether the string
ends in * and whether flags is NULL. I only mention this because I had
to take a more careful look to see what was happening. Perhaps add a
comment to make it more clear?
Add comments:
/*
* Finds the next whitespace-delimited word within the 'in' string.
* Returns a pointer to the first character of the word, and a pointer
* to the next byte after the last character in the word (in *end).
* Character '*' at the end of word will not be threated as word
* charater if flags is not null.
*/
static char *
findwrd(char *in, char **end, uint16 *flags)
3. The patch looks for the special byte '*'. I think that's fine,
because we depend on the files being in UTF-8 encoding, where it's the
same byte. However, I thought it was worth mentioning in case we want to
support other encodings for text search files later.
tsearch_readline() converts file's UTF8 encoding into server encoding. pgsql
supports only encoding which are a superset of ASCII. So it's safe to use
asterisk with any encodings
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Attachments:
2009/8/6 Teodor Sigaev <teodor@sigaev.ru>:
1. The docs should be clarified a little. For instance, it should have a
link back to the definition of a prefix search (12.3.2). I included my
doc suggestions as an attachment.Thank you, merged
2. dsynonym_init() uses findwrd() in a slightly confusing (and perhaps
fragile) way. After calling findwrd(), the "end" pointer is pointing at
either the end of the string, or the *; depending on whether the string
ends in * and whether flags is NULL. I only mention this because I had
to take a more careful look to see what was happening. Perhaps add a
comment to make it more clear?Add comments:
/*
* Finds the next whitespace-delimited word within the 'in' string.
* Returns a pointer to the first character of the word, and a pointer
* to the next byte after the last character in the word (in *end).
* Character '*' at the end of word will not be threated as word
* charater if flags is not null.
*/
static char *
findwrd(char *in, char **end, uint16 *flags)3. The patch looks for the special byte '*'. I think that's fine,
because we depend on the files being in UTF-8 encoding, where it's the
same byte. However, I thought it was worth mentioning in case we want to
support other encodings for text search files later.tsearch_readline() converts file's UTF8 encoding into server encoding. pgsql
supports only encoding which are a superset of ASCII. So it's safe to use
asterisk with any encodings
Jeff,
Based on these comments, do you want to go ahead and mark this "Ready
for Committer"?
https://commitfest.postgresql.org/action/patch_view?id=133
...Robert
On Thu, 2009-08-06 at 12:19 -0400, Robert Haas wrote:
Based on these comments, do you want to go ahead and mark this "Ready
for Committer"?
Done, thanks Teodor.
However, on the commitfest page, the patches got updated in the wrong
places: "prefix support" and "filtering dictionary support" are pointing
at each others' patches.
Regards,
Jeff Davis
On Thu, Aug 6, 2009 at 12:53 PM, Jeff Davis<pgsql@j-davis.com> wrote:
On Thu, 2009-08-06 at 12:19 -0400, Robert Haas wrote:
Based on these comments, do you want to go ahead and mark this "Ready
for Committer"?Done, thanks Teodor.
However, on the commitfest page, the patches got updated in the wrong
places: "prefix support" and "filtering dictionary support" are pointing
at each others' patches.
Fixed.
...Robert