BUG #4562: ts_headline() adds space when parsing url

Started by Denis Monsieurover 17 years ago6 messagesbugs
Jump to latest
#1Denis Monsieur
dmonsieur@gmail.com

The following bug has been logged online:

Bug reference: 4562
Logged by: Denis Monsieur
Email address: dmonsieur@gmail.com
PostgreSQL version: 8.3.4
Operating system: Debian etch
Description: ts_headline() adds space when parsing url
Details:

My system is 8.3.4, but people in #postgresql with 8.3.5 have confirmed the
issue.

The problem is a space being added to text in the form of
http://some.url/path
Compare the output:

shs=# SELECT ts_headline('http://some.url', to_tsquery('sometext'));
ts_headline
-----------------
http://some.url
(1 row)

shs=# SELECT ts_headline('http://some.url/path', to_tsquery('sometext'));
ts_headline
-----------------------
http:// some.url/path
(1 row)

#2gildas prime
g.prime@aeschemunex.com
In reply to: Denis Monsieur (#1)
Re: BUG #4562: ts_headline() adds space when parsing url

Same thing on 8.3.5 Win32

ester=# SELECT ts_headline('http://some.url/path', to_tsquery('sometext'));
ts_headline
-----------------------
http:// some.url/path
(1 row)

ester=# SELECT ts_headline('http://some.url', to_tsquery('sometext'));
ts_headline
-----------------
http://some.url
(1 row)

ester=#

Gildas

-----Message d'origine-----
De : pgsql-bugs-owner@postgresql.org [mailto:pgsql-bugs-owner@postgresql.org] De la part de Denis Monsieur
Envoyé : jeudi 4 décembre 2008 00:33
À : pgsql-bugs@postgresql.org
Objet : [BUGS] BUG #4562: ts_headline() adds space when parsing url

The following bug has been logged online:

Bug reference: 4562
Logged by: Denis Monsieur
Email address: dmonsieur@gmail.com
PostgreSQL version: 8.3.4
Operating system: Debian etch
Description: ts_headline() adds space when parsing url
Details:

My system is 8.3.4, but people in #postgresql with 8.3.5 have confirmed the
issue.

The problem is a space being added to text in the form of
http://some.url/path
Compare the output:

shs=# SELECT ts_headline('http://some.url', to_tsquery('sometext'));
ts_headline
-----------------
http://some.url
(1 row)

shs=# SELECT ts_headline('http://some.url/path', to_tsquery('sometext'));
ts_headline
-----------------------
http:// some.url/path
(1 row)

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Denis Monsieur (#1)
Re: BUG #4562: ts_headline() adds space when parsing url

"Denis Monsieur" <dmonsieur@gmail.com> writes:

The problem is a space being added to text in the form of
http://some.url/path
Compare the output:

shs=# SELECT ts_headline('http://some.url&#39;, to_tsquery('sometext'));
ts_headline
-----------------
http://some.url
(1 row)

shs=# SELECT ts_headline('http://some.url/path&#39;, to_tsquery('sometext'));
ts_headline
-----------------------
http:// some.url/path
(1 row)

I looked into this, and it seems that the problem is that
generateHeadline() emits a space for any token marked as replace = 1.
I think it probably shouldn't emit anything at all. AFAICS the cases
where replace will get set are token types URL, TAG, NUMHWORD,
ASCIIHWORD, HWORD. For URL and the HWORD variants the space is
certainly undesirable, because these token types are just respecifying
text that is also covered by their component tokens. The only case
where you could make an argument that the space is useful is TAG,
as in

regression=# SELECT ts_headline('http<foo>blah', to_tsquery('sometext'));
ts_headline
-------------
http blah
(1 row)

But it seems to me to be at least as plausible that you should get
nothing as that you should get a space for a removed tag.

Comments?

regards, tom lane

#4Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#3)
Re: BUG #4562: ts_headline() adds space when parsing url

This bug still exists in my testing.

---------------------------------------------------------------------------

Tom Lane wrote:

"Denis Monsieur" <dmonsieur@gmail.com> writes:

The problem is a space being added to text in the form of
http://some.url/path
Compare the output:

shs=# SELECT ts_headline('http://some.url&#39;, to_tsquery('sometext'));
ts_headline
-----------------
http://some.url
(1 row)

shs=# SELECT ts_headline('http://some.url/path&#39;, to_tsquery('sometext'));
ts_headline
-----------------------
http:// some.url/path
(1 row)

I looked into this, and it seems that the problem is that
generateHeadline() emits a space for any token marked as replace = 1.
I think it probably shouldn't emit anything at all. AFAICS the cases
where replace will get set are token types URL, TAG, NUMHWORD,
ASCIIHWORD, HWORD. For URL and the HWORD variants the space is
certainly undesirable, because these token types are just respecifying
text that is also covered by their component tokens. The only case
where you could make an argument that the space is useful is TAG,
as in

regression=# SELECT ts_headline('http<foo>blah', to_tsquery('sometext'));
ts_headline
-------------
http blah
(1 row)

But it seems to me to be at least as plausible that you should get
nothing as that you should get a space for a removed tag.

Comments?

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#5Oleg Bartunov
oleg@sai.msu.su
In reply to: Bruce Momjian (#4)
Re: BUG #4562: ts_headline() adds space when parsing url

On Wed, 14 Jan 2009, Bruce Momjian wrote:

This bug still exists in my testing.

We fixed all issues with ts_headline and will submit soon.

---------------------------------------------------------------------------

Tom Lane wrote:

"Denis Monsieur" <dmonsieur@gmail.com> writes:

The problem is a space being added to text in the form of
http://some.url/path
Compare the output:

shs=# SELECT ts_headline('http://some.url&#39;, to_tsquery('sometext'));
ts_headline
-----------------
http://some.url
(1 row)

shs=# SELECT ts_headline('http://some.url/path&#39;, to_tsquery('sometext'));
ts_headline
-----------------------
http:// some.url/path
(1 row)

I looked into this, and it seems that the problem is that
generateHeadline() emits a space for any token marked as replace = 1.
I think it probably shouldn't emit anything at all. AFAICS the cases
where replace will get set are token types URL, TAG, NUMHWORD,
ASCIIHWORD, HWORD. For URL and the HWORD variants the space is
certainly undesirable, because these token types are just respecifying
text that is also covered by their component tokens. The only case
where you could make an argument that the space is useful is TAG,
as in

regression=# SELECT ts_headline('http<foo>blah', to_tsquery('sometext'));
ts_headline
-------------
http blah
(1 row)

But it seems to me to be at least as plausible that you should get
nothing as that you should get a space for a removed tag.

Comments?

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#6Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#3)
Re: BUG #4562: ts_headline() adds space when parsing url

This has been fixed and will be in the next 8.3 minor release.

---------------------------------------------------------------------------

Tom Lane wrote:

"Denis Monsieur" <dmonsieur@gmail.com> writes:

The problem is a space being added to text in the form of
http://some.url/path
Compare the output:

shs=# SELECT ts_headline('http://some.url&#39;, to_tsquery('sometext'));
ts_headline
-----------------
http://some.url
(1 row)

shs=# SELECT ts_headline('http://some.url/path&#39;, to_tsquery('sometext'));
ts_headline
-----------------------
http:// some.url/path
(1 row)

I looked into this, and it seems that the problem is that
generateHeadline() emits a space for any token marked as replace = 1.
I think it probably shouldn't emit anything at all. AFAICS the cases
where replace will get set are token types URL, TAG, NUMHWORD,
ASCIIHWORD, HWORD. For URL and the HWORD variants the space is
certainly undesirable, because these token types are just respecifying
text that is also covered by their component tokens. The only case
where you could make an argument that the space is useful is TAG,
as in

regression=# SELECT ts_headline('http<foo>blah', to_tsquery('sometext'));
ts_headline
-------------
http blah
(1 row)

But it seems to me to be at least as plausible that you should get
nothing as that you should get a space for a removed tag.

Comments?

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +