BUG #16744: ts_headline behaves incorrectly with <-> and proximity operators

Started by PG Bug reporting formover 5 years ago4 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 16744
Logged by: Stas Obydionnov
Email address: stas@hellofyllo.com
PostgreSQL version: 12.3
Operating system: runs on AWS RDS
Description:

When running the following code
select ts_headline('Alpha Beta Gama', phraseto_tsquery ('alpha gama'))

or
select ts_headline('Alpha Beta Gama', to_tsquery ('alpha <-> gama'))
I would expect the result be not to be highlighted, however the result looks
like:
<b>Alpha</b> Beta <b>Gama</b>

The same behavior is found for the following operator:
select ts_headline('Alpha Beta Gama Delta', phraseto_tsquery ('alpha <3>
gama'))

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: PG Bug reporting form (#1)
Re: BUG #16744: ts_headline behaves incorrectly with <-> and proximity operators

PG Bug reporting form <noreply@postgresql.org> writes:

When running the following code
select ts_headline('Alpha Beta Gama', phraseto_tsquery ('alpha gama'))
or
select ts_headline('Alpha Beta Gama', to_tsquery ('alpha <-> gama'))
I would expect the result be not to be highlighted,

That's operating as designed, I think. Per the code comment:

* If we found nothing acceptable, select min_words words starting at
* the beginning.

The expectation really is that it's on you to not select documents that
don't match your search query. Once you've selected a document to
display, ts_headline() is just going to do the best it can to produce
something useful. "Not highlight anything" wasn't deemed particularly
useful, and I agree with that judgment.

Also, once it's selected a document fragment to display, it will highlight
all words within that fragment that appear in the search query, whether or
not the particular occurrence is part of the match-if-any. Thus

regression=# select ts_headline('Alpha Beta Gama foo bar alpha gama', phraseto_tsquery ('alpha gama'));
ts_headline
----------------------------------------------------------------
<b>Alpha</b> Beta <b>Gama</b> foo bar <b>alpha</b> <b>gama</b>
(1 row)

Again, this is a value judgment about what's useful.

regards, tom lane

#3Stas Obydionnov
stas@hellofyllo.com
In reply to: Tom Lane (#2)
Re: BUG #16744: ts_headline behaves incorrectly with <-> and proximity operators

Thanks Tom,

Probably I provided a bad example.
Here is another one from a similar bug that was opened a couple of years
ago and was not answered.

Assuming the following query:

SELECT ts_headline('English',
'This Commercial Bank does not have any Equity in Europe but European
Commercial Bank does',
to_tsquery('English','European <-> Commercial <-> Bank')::tsquery);

The returned result is:
This <b>Commercial</b> <b>Bank</b> does not have any Equity in Europe but
<b>European</b> <b>Commercial</b> <b>Bank</b> does

This highlights the words Commercial & Bank separately in addition
to European Commercial Bank.

However, the correct output expected should be:
This Commercial Bank does not have any Equity in Europe but <b>European</b>
<b>Commercial</b> <b>Bank</b> does

Which only highlights *European Commercial Bank* due to the <-> operator in
phraseto_tsquery.

Regards,
Stas.

On Tue, Nov 24, 2020 at 8:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

PG Bug reporting form <noreply@postgresql.org> writes:

When running the following code
select ts_headline('Alpha Beta Gama', phraseto_tsquery ('alpha

gama'))

or
select ts_headline('Alpha Beta Gama', to_tsquery ('alpha <-> gama'))
I would expect the result be not to be highlighted,

That's operating as designed, I think. Per the code comment:

* If we found nothing acceptable, select min_words words starting
at
* the beginning.

The expectation really is that it's on you to not select documents that
don't match your search query. Once you've selected a document to
display, ts_headline() is just going to do the best it can to produce
something useful. "Not highlight anything" wasn't deemed particularly
useful, and I agree with that judgment.

Also, once it's selected a document fragment to display, it will highlight
all words within that fragment that appear in the search query, whether or
not the particular occurrence is part of the match-if-any. Thus

regression=# select ts_headline('Alpha Beta Gama foo bar alpha gama',
phraseto_tsquery ('alpha gama'));
ts_headline
----------------------------------------------------------------
<b>Alpha</b> Beta <b>Gama</b> foo bar <b>alpha</b> <b>gama</b>
(1 row)

Again, this is a value judgment about what's useful.

regards, tom lane

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stas Obydionnov (#3)
Re: BUG #16744: ts_headline behaves incorrectly with <-> and proximity operators

Stas Obydionnov <stas@hellofyllo.com> writes:

Probably I provided a bad example.
Here is another one from a similar bug that was opened a couple of years
ago and was not answered.

Assuming the following query:

SELECT ts_headline('English',
'This Commercial Bank does not have any Equity in Europe but European
Commercial Bank does',
to_tsquery('English','European <-> Commercial <-> Bank')::tsquery);

The returned result is:
This <b>Commercial</b> <b>Bank</b> does not have any Equity in Europe but
<b>European</b> <b>Commercial</b> <b>Bank</b> does

This highlights the words Commercial & Bank separately in addition
to European Commercial Bank.

However, the correct output expected should be:
This Commercial Bank does not have any Equity in Europe but <b>European</b>
<b>Commercial</b> <b>Bank</b> does

[ shrug... ] Whether that's more correct than the current behavior
is a matter of opinion. As I said, the ts_headline code highlights
all matching words within whatever fragment it selects. It does
make an effort to locate a fragment that satisfies the query as
written, but that doesn't mean there won't be additional word
matches within the fragment. (In fact, if I'm reading the code
correctly, it actually gives preference to fragments having more
matching words, which is why you don't just get "<b>European</b>
<b>Commercial</b> <b>Bank</b>" here.) I think it's reasonable to
consider that highlighting the additional matches is a useful thing
to do, so I'm disinclined to change this longstanding behavior.

regards, tom lane