BUG #17556: ts_headline does not correctly find matches when separated by 4,999 words
The following bug has been logged on the website:
Bug reference: 17556
Logged by: Alex Malek
Email address: magicagent@gmail.com
PostgreSQL version: 14.4
Operating system: Red Hat
Description:
Correct results when 4,998 words separate search terms:
# select ts_headline('baz baz baz ipsum ' || repeat(' foo ',4998) || '
labor',
$$'ipsum' & 'labor'$$::tsquery, 'StartSel=>, StopSel=<,
MaxFragments=100, MaxWords=7, MinWords=3') ;
ts_headline
---------------------
ipsum< ... >labor<
(1 row)
Add one more word between terms being searched for, to total 4,999, and
terms are not found:
# select ts_headline('baz baz baz ipsum ' || repeat(' foo ',4999) || '
labor',
$$'ipsum' & 'labor'$$::tsquery, 'StartSel=>, StopSel=<,
MaxFragments=100, MaxWords=7, MinWords=3') ;
ts_headline
-------------
baz baz baz
(1 row)
Works correctly if "&" (AND) is replaced by "|" (OR)
# select ts_headline('baz baz baz ipsum ' || repeat(' foo ',4999) || '
labor',
$$'ipsum' | 'labor'$$::tsquery, 'StartSel=>, StopSel=<,
MaxFragments=100, MaxWords=7, MinWords=3') ;
ts_headline
---------------------
ipsum< ... >labor<
(1 row)
The "MinWords" argument and the number of words before the first term being
searched for alters the results:
Removing one word before the first search term and ts_headline will match
first term:
# select ts_headline('baz baz ipsum ' || repeat(' foo ',4999) || ' labor',
$$'ipsum' & 'labor'$$::tsquery, 'StartSel=>, StopSel=<,
MaxFragments=100, MaxWords=7, MinWords=3') ;
ts_headline
-----------------
baz baz >ipsum<
(1 row)
Now reducing MinWords from 3 to 2 and terms are once again not found:
# select ts_headline('baz baz ipsum ' || repeat(' foo ',4999) || ' labor',
$$'ipsum' & 'labor'$$::tsquery, 'StartSel=>, StopSel=<,
MaxFragments=100, MaxWords=7, MinWords=2') ;
ts_headline
-------------
baz baz
(1 row)
At Fri, 22 Jul 2022 14:06:43 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in
The following bug has been logged on the website:
Bug reference: 17556
Logged by: Alex Malek
Email address: magicagent@gmail.com
PostgreSQL version: 14.4
Operating system: Red Hat
Description:Correct results when 4,998 words separate search terms:
# select ts_headline('baz baz baz ipsum ' || repeat(' foo ',4998) || '
labor',
$$'ipsum' & 'labor'$$::tsquery, 'StartSel=>, StopSel=<,
MaxFragments=100, MaxWords=7, MinWords=3') ;
ts_headline
---------------------ipsum< ... >labor<
(1 row)
Add one more word between terms being searched for, to total 4,999, and
terms are not found:# select ts_headline('baz baz baz ipsum ' || repeat(' foo ',4999) || '
labor',
$$'ipsum' & 'labor'$$::tsquery, 'StartSel=>, StopSel=<,
MaxFragments=100, MaxWords=7, MinWords=3') ;
ts_headline
-------------
baz baz baz
(1 row)
When ts_headline searches the document, it splits the document into
segments in the length called internally as max_cover, which is not
configurable for now [1]For developers, wparser_def.c:2582. In the latter case above, it is
MaxFragments * (max(MaxWords * 10, 100)) = 10000 "words" where
whitespaces are counted as words. The docuement has 10007 "words",
where 'ipsum' is the 7th word and 'labor' is the 10007th word. The two
words aren't within a 10000-word segment so it is missed. ts_headeline
returns instead the first MinWords words as you see.
This is not a bug, but a designed behavior. However, we might want to
document that beahvior.
This could be "improved" as [1]For developers, wparser_def.c:2582, but in this specific case, I doubt
the usefulness of ts_headline picking up it up when the two words are
that far distant each other, in exchange of possible degradation.
[1]: For developers, wparser_def.c:2582
* We might eventually make max_cover a user-settable parameter, but for
* now, just compute a reasonable value based on max_words and
* max_fragments.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Sun, Jul 24, 2022 at 10:36 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:
At Fri, 22 Jul 2022 14:06:43 +0000, PG Bug reporting form <
noreply@postgresql.org> wrote inThe following bug has been logged on the website:
Bug reference: 17556
Logged by: Alex Malek
Email address: magicagent@gmail.com
PostgreSQL version: 14.4
Operating system: Red Hat
Description:Correct results when 4,998 words separate search terms:
# select ts_headline('baz baz baz ipsum ' || repeat(' foo ',4998) || '
labor',
$$'ipsum' & 'labor'$$::tsquery, 'StartSel=>, StopSel=<,
MaxFragments=100, MaxWords=7, MinWords=3') ;
ts_headline
---------------------ipsum< ... >labor<
(1 row)
Add one more word between terms being searched for, to total 4,999, and
terms are not found:# select ts_headline('baz baz baz ipsum ' || repeat(' foo ',4999) || '
labor',
$$'ipsum' & 'labor'$$::tsquery, 'StartSel=>, StopSel=<,
MaxFragments=100, MaxWords=7, MinWords=3') ;
ts_headline
-------------
baz baz baz
(1 row)When ts_headline searches the document, it splits the document into
segments in the length called internally as max_cover, which is not
configurable for now [1]. In the latter case above, it is
MaxFragments * (max(MaxWords * 10, 100)) = 10000 "words" where
whitespaces are counted as words. The docuement has 10007 "words",
where 'ipsum' is the 7th word and 'labor' is the 10007th word. The two
words aren't within a 10000-word segment so it is missed. ts_headeline
returns instead the first MinWords words as you see.This is not a bug, but a designed behavior. However, we might want to
document that beahvior.This could be "improved" as [1], but in this specific case, I doubt
the usefulness of ts_headline picking up it up when the two words are
that far distant each other, in exchange of possible degradation.[1] For developers, wparser_def.c:2582
* We might eventually make max_cover a user-settable parameter,
but for
* now, just compute a reasonable value based on max_words and
* max_fragments.
Since the expected output is produced for much larger documents when OR
('|') replaces AND ('&'),
what if the code, when no match is found, tries again with such a
replacement?
Alternatively since the "highlighting" of terms is the same for '|' vs '&'
maybe always do the replacement?
Note: I have no idea how the parsing, max_cover etc., actually work, I am
suggesting "high level" ideas
that I realize may or may not make sense for that code base.
Correct highlighting for 100,000+ "words:" using OR ('|'):
# select ts_headline('baz baz baz ipsum ' || repeat(' foo ',100000) || '
labor',
$$'ipsum' | 'labor'$$::tsquery, 'StartSel=>, StopSel=<,
MaxFragments=100, MaxWords=7, MinWords=3') ;
ts_headline
---------------------
ipsum< ... >labor<
(1 row)
Highlighting the same for OR vs AND:
# select ts_headline('baz baz baz ipsum labor foo foo foo', $$'ipsum' &
'labor'$$::tsquery, 'StartSel=>, StopSel=<');
ts_headline
-----------------------------------------
baz baz baz >ipsum< >labor< foo foo foo
(1 row)
# select ts_headline('baz baz baz ipsum labor foo foo foo', $$'ipsum' |
'labor'$$::tsquery, 'StartSel=>, StopSel=<');
ts_headline
-----------------------------------------
baz baz baz >ipsum< >labor< foo foo foo
(1 row)
Best,
Alex