Flexible configuration for full-text search

Started by Aleksandr Parfenovover 8 years ago38 messageshackers

a.parfenov@postgrespro.ru

over 8 years ago

Hello hackers,

Arthur Zakirov and I are working on a patch to introduce more flexible
way to configure full-text search in PostgreSQL because current syntax
doesn't allow a variety of scenarios to be handled. Additionally, some
parts contain the implicit logic of the processing, such as filtering
dictionaries with TSL_FILTER flag, so configuration partially moved to
dictionary itself and in most of the cases hardcoded into dictionary.
One more drawback of current FTS configuration is that we can't divide
the dictionary selection and output producing, so we can't configure
FTS to use one dictionary if another one recognized a token (e.g. use
hunspell if dictionary of nouns recognized a token).

Basically, the key goal of the patch is to provide user more
control on processing of the text.

The patch introduces way to configure FTS based on CASE/WHEN/THEN/ELSE
construction. Current comma-separated list also available to meet
compatibility. The basic form of new syntax is following:

ALTER TEXT SEARCH CONFIGURATION <fts_conf>
ALTER MAPPING FOR <token_types> WITH
CASE
WHEN <condition> THEN <command>
....
[ ELSE <command> ]
END;

A condition is a logical expression on dictionaries. You can specify how
to interpret dictionary output with
dictionary IS [ NOT ] NULL - for NULL-result
dictionary IS [ NOT ] STOPWORD - for empty (stopword) result

If interpretation marker is not given it is interpreted as:
dictionary IS NOT NULL AND dictionary IS NOT STOPWORD

A command is an expression on dictionaries output sets with operators
UNION, EXCEPT and INTERSECT. Additionally, there is a special operator
MAP BY which allow us to create the same behavior as with filtering
dictionaries. MAP BY operator get output of the right subexpression and
send it to left subexpression as an input token (if there are more than
one lexeme each one is sent separately).

There is a few example of usage of new configuration and comparison
with solutions using current syntax.

1) Multilingual search. Can be used for FTS on a set of documents in
different languages (example for German and English languages).
ALTER TEXT SEARCH CONFIGURATION multi
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part WITH CASE
WHEN english_hunspell AND german_hunspell THEN
english_hunspell UNION german_hunspell
WHEN english_hunspell THEN english_hunspell
WHEN german_hunspell THEN german_hunspell
ELSE german_stem UNION english_stem
END;

With old configuration we should use separate vector and index for each
required language and query should combine result of search for each
language:
SELECT * FROM en_de_documents WHERE
to_tsvector('english', text) @@ to_tsquery('english', 'query')
OR
to_tsvector('german', text) @@ to_tsquery('german', 'query');

The new multilingual search configuration itself looks more complex but
allow to avoid a split of index and vectors. Additionally, for
similar languages or configurations with simple or *_stem dictionaries
in the list we can reduce total size of index since in
current-state example index for English configuration also will
keep data about documents written in German and vice-versa.

2) Combination of exact search with morphological one. This patch not
fully solve the problem but it is a step toward solution. Currently, we
should split exact and morphological search in query manually and use
separate index for each part. With new way to configure FTS we can use
following configuration:
ALTER TEXT SEARCH CONFIGURATION exact_and_morph
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part WITH CASE
WHEN english_hunspell THEN english_hunspell UNION simple
ELSE english_stem UNION simple
END;

Some of the queries like "'looking' <1> through" where 'looking' is
search for exact form of the word doesn't work in current-state FTS
since we can guarantee that document contains both 'looking' and
through, but can't be sure with distance between them.

Unfortunately, we can't fully support such queries with current format
of tsvector because after processing we can't distinguish is a word
was mentioned in normal form in text or was processed by some
dictionary. This leads to false positive hits if user searches for
the normal form of the word. I think we should provide a user ability
to mark dictionary something like "exact form producer". But without
tsvector modification this mark is useless since we can't mark output
of this dictionary in tsvector.

There is a patch on commitfest which removes 1MB limit on tsvector [1]Remove 1MB size limit in tsvector https://commitfest.postgresql.org/15/1221/.
There are few free bits available in each lexeme in vector, so
one of the bits may be used for "exact" flag.

3) Using different dictionaries for recognizing and output generation.
As I mentioned before, in new syntax condition and command are separate
and we can use it for some more complex text processing. Here an
example for processing only nouns:
ALTER TEXT SEARCH CONFIGURATION nouns_only
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part WITH CASE
WHEN english_noun THEN english_hunspell
END;

This behavior couldn't be reached with the current state of FTS.

4) Special stopword processing allows us to discard stopwords even if
the main dictionary doesn't support such feature (in example pl_ispell
dictionary keeps stopwords in text):
ALTER TEXT SEARCH CONFIGURATION pl_without_stops
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part WITH CASE
WHEN simple_pl IS NOT STOPWORD THEN pl_ispell
END;

The patch is in attachment. I'm will be glad to hear hackers' opinion
about it.

There are several cases discussed in hackers earlier:

Check for stopwords using non-target dictionary.
/messages/by-id/4733B65A.9030707@students.mimuw.edu.pl

Support union of outputs of several dictionaries.
/messages/by-id/c6851b7e-da25-3d8e-a5df-022c395a11b4@postgrespro.ru

Support of chain of dictionaries using MAP BY operator.
/messages/by-id/46D57E6F.8020009@enterprisedb.com

[1]: Remove 1MB size limit in tsvector https://commitfest.postgresql.org/15/1221/
https://commitfest.postgresql.org/15/1221/

--
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Aleksandr Parfenov

a.parfenov@postgrespro.ru

over 8 years ago

In reply to: Aleksandr Parfenov (#1)

Re: Flexible configuration for full-text search

In attachment updated patch with fixes of empty XML tags in
documentation.

--
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Emre Hasegeli

emre@hasegeli.com

over 8 years ago

In reply to: Aleksandr Parfenov (#1)

Re: Flexible configuration for full-text search

The patch introduces way to configure FTS based on CASE/WHEN/THEN/ELSE
construction.

Interesting feature. I needed this flexibility before when I was
implementing text search for a Turkish private listing application.
Aleksandr and Arthur were kind enough to discuss it with me off-list
today.

1) Multilingual search. Can be used for FTS on a set of documents in
different languages (example for German and English languages).

ALTER TEXT SEARCH CONFIGURATION multi
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part WITH CASE
WHEN english_hunspell AND german_hunspell THEN
english_hunspell UNION german_hunspell
WHEN english_hunspell THEN english_hunspell
WHEN german_hunspell THEN german_hunspell
ELSE german_stem UNION english_stem
END;

I understand the need to support branching, but this syntax is overly
complicated. I don't think there is any need to support different set
of dictionaries as condition and action. Something like this might
work better:

ALTER TEXT SEARCH CONFIGURATION multi
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part WITH
CASE english_hunspell UNION german_hunspell
WHEN MATCH THEN KEEP
ELSE german_stem UNION english_stem
END;

To put it formally:

ALTER TEXT SEARCH CONFIGURATION name
ADD MAPPING FOR token_type [, ... ] WITH config

where config is one of:

dictionary_name
config { UNION | INTERSECT | EXCEPT } config
CASE config WHEN [ NO ] MATCH THEN [ KEEP ELSE ] config END

2) Combination of exact search with morphological one. This patch not
fully solve the problem but it is a step toward solution. Currently, we
should split exact and morphological search in query manually and use
separate index for each part. With new way to configure FTS we can use
following configuration:

ALTER TEXT SEARCH CONFIGURATION exact_and_morph
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part WITH CASE
WHEN english_hunspell THEN english_hunspell UNION simple
ELSE english_stem UNION simple
END

This could be:

CASE english_hunspell
THEN KEEP
ELSE english_stem
END
UNION
simple

3) Using different dictionaries for recognizing and output generation.
As I mentioned before, in new syntax condition and command are separate
and we can use it for some more complex text processing. Here an
example for processing only nouns:

ALTER TEXT SEARCH CONFIGURATION nouns_only
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part WITH CASE
WHEN english_noun THEN english_hunspell
END

This would also still work with the simpler syntax because
"english_noun", still being a dictionary, would pass the tokens to the
next one.

4) Special stopword processing allows us to discard stopwords even if
the main dictionary doesn't support such feature (in example pl_ispell
dictionary keeps stopwords in text):

ALTER TEXT SEARCH CONFIGURATION pl_without_stops
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part WITH CASE
WHEN simple_pl IS NOT STOPWORD THEN pl_ispell
END

Instead of supporting old way of putting stopwords on dictionaries, we
can make them dictionaries on their own. This would then become
something like:

CASE polish_stopword
WHEN NO MATCH THEN polish_isspell
END

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Aleksandr Parfenov

a.parfenov@postgrespro.ru

over 8 years ago

In reply to: Emre Hasegeli (#3)

Re: Flexible configuration for full-text search

I'm mostly happy with mentioned modifications, but I have few questions
to clarify some points. I will send new patch in week or two.

On Thu, 26 Oct 2017 20:01:14 +0200
Emre Hasegeli <emre@hasegeli.com> wrote:

To put it formally:

ALTER TEXT SEARCH CONFIGURATION name
ADD MAPPING FOR token_type [, ... ] WITH config

where config is one of:

dictionary_name
config { UNION | INTERSECT | EXCEPT } config
CASE config WHEN [ NO ] MATCH THEN [ KEEP ELSE ] config END

According to formal definition following configurations are valid:

CASE english_hunspell WHEN MATCH THEN KEEP ELSE simple END
CASE english_noun WHEN MATCH THEN english_hunspell END

But configuration:

CASE english_noun WHEN MATCH THEN english_hunspell ELSE simple END

is not (as I understand ELSE can be used only with KEEP).

I think we should decide to allow or disallow usage of different
dictionaries for match checking (between CASE and WHEN) and a result
(after THEN). If answer is 'allow', maybe we should allow the
third example too for consistency in configurations.

3) Using different dictionaries for recognizing and output
generation. As I mentioned before, in new syntax condition and
command are separate and we can use it for some more complex text
processing. Here an example for processing only nouns:

ALTER TEXT SEARCH CONFIGURATION nouns_only
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part WITH CASE
WHEN english_noun THEN english_hunspell
END

This would also still work with the simpler syntax because
"english_noun", still being a dictionary, would pass the tokens to the
next one.

Based on formal definition it is possible to describe this example in
following manner:
CASE english_noun WHEN MATCH THEN english_hunspell END

The question is same as in the previous example.

Instead of supporting old way of putting stopwords on dictionaries, we
can make them dictionaries on their own. This would then become
something like:

CASE polish_stopword
WHEN NO MATCH THEN polish_isspell
END

Currently, stopwords increment position, for example:
SELECT to_tsvector('english','a test message');
---------------------
'messag':3 'test':2

A stopword 'a' has a position 1 but it is not in the vector.

If we want to save this behavior, we should somehow pass a stopword to
tsvector composition function (parsetext in ts_parse.c) for counter
increment or increment it in another way. Currently, an empty lexemes
array is passed as a result of LexizeExec.

One of possible way to do so is something like:
CASE polish_stopword
WHEN MATCH THEN KEEP -- stopword counting
ELSE polish_isspell
END

--
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Emre Hasegeli

emre@hasegeli.com

over 8 years ago

In reply to: Aleksandr Parfenov (#4)

Re: Flexible configuration for full-text search

I'm mostly happy with mentioned modifications, but I have few questions
to clarify some points. I will send new patch in week or two.

I am glad you liked it. Though, I think we should get approval from
more senior community members or committers about the syntax, before
we put more effort to the code.

But configuration:

CASE english_noun WHEN MATCH THEN english_hunspell ELSE simple END

is not (as I understand ELSE can be used only with KEEP).

I think we should decide to allow or disallow usage of different
dictionaries for match checking (between CASE and WHEN) and a result
(after THEN). If answer is 'allow', maybe we should allow the
third example too for consistency in configurations.

I think you are right. We better allow this too. Then the CASE syntax becomes:

CASE config
WHEN [ NO ] MATCH THEN { KEEP | config }
[ ELSE config ]
END

Based on formal definition it is possible to describe this example in
following manner:
CASE english_noun WHEN MATCH THEN english_hunspell END

The question is same as in the previous example.

I couldn't understand the question.

Currently, stopwords increment position, for example:
SELECT to_tsvector('english','a test message');
---------------------
'messag':3 'test':2

A stopword 'a' has a position 1 but it is not in the vector.

Is this problem only applies to stopwords and the whole thing we are
inventing? Shouldn't we preserve the positions through the pipeline?

If we want to save this behavior, we should somehow pass a stopword to
tsvector composition function (parsetext in ts_parse.c) for counter
increment or increment it in another way. Currently, an empty lexemes
array is passed as a result of LexizeExec.

One of possible way to do so is something like:
CASE polish_stopword
WHEN MATCH THEN KEEP -- stopword counting
ELSE polish_isspell
END

This would mean keeping the stopwords. What we want is

CASE polish_stopword -- stopword counting
WHEN NO MATCH THEN polish_isspell
END

Do you think it is possible?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Thomas Munro

thomas.munro@gmail.com

over 8 years ago

In reply to: Aleksandr Parfenov (#2)

Re: Flexible configuration for full-text search

On Sat, Oct 21, 2017 at 1:39 AM, Aleksandr Parfenov
<a.parfenov@postgrespro.ru> wrote:

In attachment updated patch with fixes of empty XML tags in
documentation.

Hi Aleksandr,

I'm not sure if this is expected at this stage, but just in case you
aren't aware, with this version of the patch the binary upgrade test
in
src/bin/pg_dump/t/002_pg_dump.pl fails for me:

# Failed test 'binary_upgrade: dumps ALTER TEXT SEARCH CONFIGURATION
dump_test.alt_ts_conf1 ...'
# at t/002_pg_dump.pl line 6715.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Aleksandr Parfenov

a.parfenov@postgrespro.ru

over 8 years ago

In reply to: Thomas Munro (#6)

Re: Flexible configuration for full-text search

On Mon, 6 Nov 2017 18:05:23 +1300
Thomas Munro <thomas.munro@enterprisedb.com> wrote:

On Sat, Oct 21, 2017 at 1:39 AM, Aleksandr Parfenov
<a.parfenov@postgrespro.ru> wrote:

In attachment updated patch with fixes of empty XML tags in
documentation.

Hi Aleksandr,

I'm not sure if this is expected at this stage, but just in case you
aren't aware, with this version of the patch the binary upgrade test
in
src/bin/pg_dump/t/002_pg_dump.pl fails for me:

# Failed test 'binary_upgrade: dumps ALTER TEXT SEARCH CONFIGURATION
dump_test.alt_ts_conf1 ...'
# at t/002_pg_dump.pl line 6715.

Hi Thomas,

Thank you for noticing it. I will investigate it during work on next
version of patch.

--
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Aleksandr Parfenov

a.parfenov@postgrespro.ru

over 8 years ago

In reply to: Emre Hasegeli (#5)

Re: Flexible configuration for full-text search

On Tue, 31 Oct 2017 09:47:57 +0100
Emre Hasegeli <emre@hasegeli.com> wrote:

If we want to save this behavior, we should somehow pass a stopword
to tsvector composition function (parsetext in ts_parse.c) for
counter increment or increment it in another way. Currently, an
empty lexemes array is passed as a result of LexizeExec.

One of possible way to do so is something like:
CASE polish_stopword
WHEN MATCH THEN KEEP -- stopword counting
ELSE polish_isspell
END

This would mean keeping the stopwords. What we want is

CASE polish_stopword -- stopword counting
WHEN NO MATCH THEN polish_isspell
END

Do you think it is possible?

Hi Emre,

I thought how it can be implemented. The way I see is to increment
word counter in case if any chcked dictionary matched the word even
without returning lexeme. Main drawback is that counter increment is
implicit.

--
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael@paquier.xyz

over 8 years ago

In reply to: Aleksandr Parfenov (#7)

Re: [HACKERS] Flexible configuration for full-text search

On Tue, Nov 7, 2017 at 3:18 PM, Aleksandr Parfenov
<a.parfenov@postgrespro.ru> wrote:

On Mon, 6 Nov 2017 18:05:23 +1300
Thomas Munro <thomas.munro@enterprisedb.com> wrote:

On Sat, Oct 21, 2017 at 1:39 AM, Aleksandr Parfenov
<a.parfenov@postgrespro.ru> wrote:

In attachment updated patch with fixes of empty XML tags in
documentation.

Hi Aleksandr,

I'm not sure if this is expected at this stage, but just in case you
aren't aware, with this version of the patch the binary upgrade test
in
src/bin/pg_dump/t/002_pg_dump.pl fails for me:

# Failed test 'binary_upgrade: dumps ALTER TEXT SEARCH CONFIGURATION
dump_test.alt_ts_conf1 ...'
# at t/002_pg_dump.pl line 6715.

Hi Thomas,

Thank you for noticing it. I will investigate it during work on next
version of patch.

Next version pending after three weeks, I am marking the patch as
returned with feedback for now.
--
Michael

#10

Aleksandr Parfenov

a.parfenov@postgrespro.ru

over 8 years ago

In reply to: Emre Hasegeli (#5)

Re: [HACKERS] Flexible configuration for full-text search

Hi,

On Tue, 31 Oct 2017 09:47:57 +0100
Emre Hasegeli <emre@hasegeli.com> wrote:

I am glad you liked it. Though, I think we should get approval from
more senior community members or committers about the syntax, before
we put more effort to the code.

I postpone a new version of the patch in order to wait for more
feedback, but I think now is the time to send it to push discussion
further.

I keep in mind all the feedback during reworking a patch, so the FTS
syntax and behavior changed since previous one. But I'm not sure about
one last thing:

CASE polish_stopword -- stopword counting
WHEN NO MATCH THEN polish_isspell
END

Do you think it is possible?

If we will count tokens in such a case, any dropped words will be
counted too. For example:

CASE banned_words
WHEN NO MATCH THEN some_dictionary
END

And I'm not sure about that behavior due to implicit use of the
token produced by 'banned_words' dictionary in the example. In the third
version of patch I keep the behavior without an implicit use of
tokens for counting.

The new version of the patch is in attachment as well as a
little README file with a description of changes in each file. Any
feedback is welcome.

--
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

#11

Arthur Zakirov

a.zakirov@postgrespro.ru

over 8 years ago

In reply to: Aleksandr Parfenov (#10)

Re: [HACKERS] Flexible configuration for full-text search

Hello,

On Tue, Dec 19, 2017 at 05:31:09PM +0300, Aleksandr Parfenov wrote:

The new version of the patch is in attachment as well as a
little README file with a description of changes in each file. Any
feedback is welcome.

I looked the patch a little bit. The patch is applied and tests passed.

I noticed that there are typos in the documentation. And I think it is necessary to keep information about previous sintax. The syntax will be supported anyway. For example, information about TSL_FILTER was removed. And it will be good to add more examples of the new syntax.

The result of ts_debug() function was changed. Is it possible to keep the old ts_debug() result? To be specific, 'dictionaries' column is text now, not array, as I understood. It will be good to keep the result for the sake of backward compatibility.

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

#12

Aleksandr Parfenov

a.parfenov@postgrespro.ru

over 8 years ago

In reply to: Arthur Zakirov (#11)

Re: [HACKERS] Flexible configuration for full-text search

Hi Arthur,

Thank you for the review.

On Thu, 21 Dec 2017 17:46:42 +0300
Arthur Zakirov <a.zakirov@postgrespro.ru> wrote:

I noticed that there are typos in the documentation. And I think it
is necessary to keep information about previous sintax. The syntax
will be supported anyway. For example, information about TSL_FILTER
was removed. And it will be good to add more examples of the new
syntax.

In the current version of the patch, configurations written in old
syntax are rewritten into the same configuration in the new syntax.
Since new syntax doesn't support a TSL_FILTER, it was removed from the
documentation. It is possible to store configurations written in old
syntax in a special way and simulate a TSL_FILTER behavior for them.
But it will lead to maintenance of two different behavior of the FTS
depends on a version of the syntax used during configuration. Do you
think we should keep both behaviors?

The result of ts_debug() function was changed. Is it possible to keep
the old ts_debug() result? To be specific, 'dictionaries' column is
text now, not array, as I understood. It will be good to keep the
result for the sake of backward compatibility.

Columns' 'dictionaries' and 'dictionary' type were changed to text
because after the patch the configuration may be not a plain array of
dictionaries but a complex expression tree. In the column
'dictionaries' the result is textual representation of configuration
and it is the same as a result of \dF+ description of the configuration.

I decide to rename newly added column to 'configuration' and keep
column 'dictionaries' with an array of all dictionaries used in
configuration (no matter how). Also, I fixed a bug in 'command' output
of the ts_debug in some cases.

Additionally, I added some examples to documentation regarding
multilingual search and combination of exact and linguistic-aware
search and fixed typos.

#13

Arthur Zakirov

a.zakirov@postgrespro.ru

over 8 years ago

In reply to: Aleksandr Parfenov (#12)

Re: [HACKERS] Flexible configuration for full-text search

On Mon, Dec 25, 2017 at 05:15:07PM +0300, Aleksandr Parfenov wrote:

In the current version of the patch, configurations written in old
syntax are rewritten into the same configuration in the new syntax.
Since new syntax doesn't support a TSL_FILTER, it was removed from the
documentation. It is possible to store configurations written in old
syntax in a special way and simulate a TSL_FILTER behavior for them.
But it will lead to maintenance of two different behavior of the FTS
depends on a version of the syntax used during configuration. Do you
think we should keep both behaviors?

Is I understood users need to rewrite their configurations if they use unaccent dictionary, for example.
It is not good I think. Users will be upset about that if they use only old configuration and they don't need new configuration.

From my point of view it is necessary to keep old configuration syntax.

Columns' 'dictionaries' and 'dictionary' type were changed to text
because after the patch the configuration may be not a plain array of
dictionaries but a complex expression tree. In the column
'dictionaries' the result is textual representation of configuration
and it is the same as a result of \dF+ description of the configuration.

Oh, I understood.

I decide to rename newly added column to 'configuration' and keep
column 'dictionaries' with an array of all dictionaries used in
configuration (no matter how). Also, I fixed a bug in 'command' output
of the ts_debug in some cases.

Maybe it would be better to keep the 'dictionary' column name? Is there a reason why it was renamed to 'command'?

Additionally, I added some examples to documentation regarding
multilingual search and combination of exact and linguistic-aware
search and fixed typos.

Great!

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

#14

Aleksandr Parfenov

a.parfenov@postgrespro.ru

over 8 years ago

In reply to: Arthur Zakirov (#13)

Re: [HACKERS] Flexible configuration for full-text search

On Tue, 26 Dec 2017 13:51:03 +0300
Arthur Zakirov <a.zakirov@postgrespro.ru> wrote:

On Mon, Dec 25, 2017 at 05:15:07PM +0300, Aleksandr Parfenov wrote:

Is I understood users need to rewrite their configurations if they
use unaccent dictionary, for example. It is not good I think. Users
will be upset about that if they use only old configuration and they
don't need new configuration.

From my point of view it is necessary to keep old configuration
syntax.

I see your point. I will rework a patch to keep the backward
compatibility and return the TSL_FILTER entry into documentation.

I decide to rename newly added column to 'configuration' and keep
column 'dictionaries' with an array of all dictionaries used in
configuration (no matter how). Also, I fixed a bug in 'command'
output of the ts_debug in some cases.

Maybe it would be better to keep the 'dictionary' column name? Is
there a reason why it was renamed to 'command'?

I changed the name bacause it may contain more than one dictionary
interconnected via operators (e.g. 'english_stem UNION simple') and
the word 'dictionary' doesn't fully describe a content of the column
now. Also, a type of column was changed from regdictionary to text in
order to put operators into the output.

--
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

#15

Aleksandr Parfenov

a.parfenov@postgrespro.ru

about 8 years ago

In reply to: Aleksandr Parfenov (#1)

Re: [HACKERS] Flexible configuration for full-text search

Greetings,

According to http://commitfest.cputube.org/ the patch is not applicable.

Updated version of the patch in the attachment.

--
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

#16

Aleksander Alekseev

aleksander@timescale.com

about 8 years ago

In reply to: Aleksandr Parfenov (#15)

Re: Flexible configuration for full-text search

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed

This patch seems to be in a pretty good shape. There is a room for improvement
though.

1. There are no comments for some procedures, their arguments and return
values. Ditto regarding structures and their fields.

2. Please, fix the year in the copyright messages from 2017 to 2018.

3. Somehow I doubt that current amount of tests covers most of the
functionality. Are you sure that if we run lcov, it will not show that most of
the new code is never executed during make installcheck-world?

4. I'm a bit concerned regarding change of the catalog in the
src/include/catalog/indexing.h file. Are you sure it will not break if I
migrate from PostgreSQL 10 to PostgreSQL 11?

5. There are typos, e.g "Last result retued...", "...thesaurus pharse
processing...".

I'm going to run a few more test a bit later. I'll let you know if I'll find
anything.

The new status of this patch is: Waiting on Author

#17

Aleksandr Parfenov

a.parfenov@postgrespro.ru

about 8 years ago

In reply to: Aleksander Alekseev (#16)

Re: Flexible configuration for full-text search

Hi Aleksander,

Thank you for the review!

1. There are no comments for some procedures, their arguments and
return values. Ditto regarding structures and their fields.

2. Please, fix the year in the copyright messages from 2017 to 2018.

Both issues are fixed.

3. Somehow I doubt that current amount of tests covers most of the
functionality. Are you sure that if we run lcov, it will not show
that most of the new code is never executed during make
installcheck-world?

I checked it and most of the new code is executed (I mostly checked
ts_parse.c and ts_configmap.c because those files contain most of the
potentially unchecked code). I have added some tests to test output of
the configurations. Also, I have added a test of TEXT CONFIGURATION into
unaccent contrib to test the MAP operator.

4. I'm a bit concerned regarding change of the catalog in the
src/include/catalog/indexing.h file. Are you sure it will not break
if I migrate from PostgreSQL 10 to PostgreSQL 11?

I have tested an upgrade process via pg_upgrade from PostgreSQL 10 and
PostgreSQL 9.6 and have found a bug in a way the pg_dump generates
schema dump for old databases. I fixed it and now pg_upgrade works fine.

The new version of the patch is in an attachment.

--
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

#18

Aleksandr Parfenov

a.parfenov@postgrespro.ru

about 8 years ago

In reply to: Aleksandr Parfenov (#17)

Re: Flexible configuration for full-text search

I have found an issue in grammar which doesn't allow to construct
complex expressions (usage of CASEs as operands) without parentheses.

I fixed and simplified a grammar a little bit. The rest of the patch is
the same.

--
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

#19

Aleksander Alekseev

aleksander@timescale.com

about 8 years ago

In reply to: Aleksandr Parfenov (#18)

Re: Flexible configuration for full-text search

The following review has been posted through the commitfest application:
make installcheck-world: not tested
Implements feature: not tested
Spec compliant: not tested
Documentation: not tested

Unfortunately this patch doesn't apply anymore: http://commitfest.cputube.org/

#20

Aleksandr Parfenov

a.parfenov@postgrespro.ru

about 8 years ago

In reply to: Aleksander Alekseev (#19)

Re: Flexible configuration for full-text search

On Mon, 05 Mar 2018 12:59:21 +0000
Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:

The following review has been posted through the commitfest
application: make installcheck-world: not tested
Implements feature: not tested
Spec compliant: not tested
Documentation: not tested

Unfortunately this patch doesn't apply anymore:
http://commitfest.cputube.org/

Thank you for noticing that. A refreshed version of the patch is
attached.

--
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company