Full text search bug ('russian' regconfig)

Started by egocenterabout 6 years ago3 messagesbugs
Jump to latest
#1egocenter
egocenter@yandex.ru

Hello!

Text search doesn't work correct with the EQUAL string in text and query (russian dictionary config),
as you can see in example ts_vector receives different from ts_query lexemes for identical text:

tsv = 'дан':1 'магазин':2 'нужн':3 'посеща':4 'точн':5
tsq = 'нужн' & 'точн' & 'дан' & 'посещаем' & 'магазин'

SELECT
(web_query_and @@ ts_title)::INTEGER AS full_title_entries, -- 0 / supposed 1
(web_query_and @@ 'зачем нужны точные данные о посещаемости магазинов?')::INTEGER AS full_title_entries2,
*
FROM
(SELECT
to_tsvector('russian', STRIP(to_tsvector('russian', 'зачем нужны точные данные о посещаемости магазинов?'))::TEXT ) AS ts_title,
websearch_to_tsquery('russian', REPLACE('зачем нужны точные данные о посещаемости магазинов?', '- ' , '')) AS web_query_and

) AS main

--
Best regards,
Roman

#2Artur Zakirov
zaartur@gmail.com
In reply to: egocenter (#1)
Re: Full text search bug ('russian' regconfig)

Hello

On 2/19/2020 5:35 PM, egocenter wrote:

Text search doesn't work correct with the EQUAL string in text and query (russian dictionary config),
as you can see in example ts_vector receives different from ts_query lexemes for identical text:

tsv = 'пїЅпїЅпїЅ':1 'пїЅпїЅпїЅпїЅпїЅпїЅпїЅ':2 'пїЅпїЅпїЅпїЅ':3 'пїЅпїЅпїЅпїЅпїЅпїЅ':4 'пїЅпїЅпїЅпїЅ':5
tsq = 'пїЅпїЅпїЅпїЅ' & 'пїЅпїЅпїЅпїЅ' & 'пїЅпїЅпїЅ' & 'пїЅпїЅпїЅпїЅпїЅпїЅпїЅпїЅ' & 'пїЅпїЅпїЅпїЅпїЅпїЅпїЅ'

It is because you call to_tsvector() two times. 'russian' is a Snowball
dictionary and it uses stemming algorithms to cut words ending. Your
query works if to_tsvector() isn't called twice on the same text:

=# SELECT
web_query_and @@ ts_title,
web_query_and @@ 'пїЅпїЅпїЅпїЅпїЅ пїЅпїЅпїЅпїЅпїЅ пїЅпїЅпїЅпїЅпїЅпїЅ пїЅпїЅпїЅпїЅпїЅпїЅ пїЅ пїЅпїЅпїЅпїЅпїЅпїЅпїЅпїЅпїЅпїЅпїЅпїЅ пїЅпїЅпїЅпїЅпїЅпїЅпїЅпїЅпїЅ',
*
FROM
(SELECT
to_tsvector('russian', 'пїЅпїЅпїЅпїЅпїЅ пїЅпїЅпїЅпїЅпїЅ пїЅпїЅпїЅпїЅпїЅпїЅ пїЅпїЅпїЅпїЅпїЅпїЅ пїЅ пїЅпїЅпїЅпїЅпїЅпїЅпїЅпїЅпїЅпїЅпїЅпїЅ
пїЅпїЅпїЅпїЅпїЅпїЅпїЅпїЅпїЅ') AS ts_title,
websearch_to_tsquery('russian', 'пїЅпїЅпїЅпїЅпїЅ пїЅпїЅпїЅпїЅпїЅ пїЅпїЅпїЅпїЅпїЅпїЅ пїЅпїЅпїЅпїЅпїЅпїЅ пїЅ
пїЅпїЅпїЅпїЅпїЅпїЅпїЅпїЅпїЅпїЅпїЅпїЅ пїЅпїЅпїЅпїЅпїЅпїЅпїЅпїЅпїЅ?') AS web_query_and
) AS main;

It gives 'true' for the first column.

--
Artur

#3egocenter
egocenter@yandex.ru
In reply to: Artur Zakirov (#2)
Re: Full text search bug ('russian' regconfig)

Hello, Artur!

Thanks for the answer,
ok, it's strange that only 1 word is affected that way (as if two lexemes exist for 1 word)...

*I use double to_tsvector to eliminate words duplicates.
in the example below ts_title = 'histori':2 'watcom':1,3
and it gives 2 entries in 'город - watcom' via ts_rank_cd

I need to count UNIQUE words entries but it seems to be no luck with std functionality
(I see 2 ways: custom ts_rank function OR to_tsvector / edit tsvector and leave only first position for 'watcom':
ts_title = 'histori':2 'watcom':1).

If you have any idea on that situation, I would highly appreciate it! Thanks in advance)

---------
SELECT
round((ts_rank_cd(ts_title, web_query_or)/0.1)::NUMERIC, 0) AS title_entries_count, -- 2, but should be 1
*
FROM
(SELECT
to_tsvector('russian', 'watcom history | watcom') AS ts_title,
websearch_to_tsquery('russian', REPLACE('город - watcom', '- ' , '')) AS web_query_and, -- тире заменено для отмены его конвертации в отрицание !
REPLACE(websearch_to_tsquery(:reg_config, REPLACE('город - watcom', '- ' , ''))::TEXT, '&', '|')::tsquery AS web_query_or

) AS main;

--

Show quoted text

Hello

On 2/19/2020 5:35 PM, egocenter wrote:

Text search doesn't work correct with the EQUAL string in text and query (russian dictionary config),
as you can see in example ts_vector receives different from ts_query lexemes for identical text:

tsv = 'дан':1 'магазин':2 'нужн':3 'посеща':4 'точн':5
tsq = 'нужн' & 'точн' & 'дан' & 'посещаем' & 'магазин'

It is because you call to_tsvector() two times. 'russian' is a Snowball
dictionary and it uses stemming algorithms to cut words ending. Your
query works if to_tsvector() isn't called twice on the same text:

=# SELECT
web_query_and @@ ts_title,
web_query_and @@ 'зачем нужны точные данные о посещаемости магазинов',
*
FROM
(SELECT
to_tsvector('russian', 'зачем нужны точные данные о посещаемости
магазинов') AS ts_title,
websearch_to_tsquery('russian', 'зачем нужны точные данные о
посещаемости магазинов?') AS web_query_and
) AS main;

It gives 'true' for the first column.