BUG #5075: Text Search parser does not identify xml tag when attribute name's contains underscore

Started by Marek Lewczukover 16 years ago6 messagesbugs
Jump to latest
#1Marek Lewczuk
marek@lewczuk.com

The following bug has been logged online:

Bug reference: 5075
Logged by: Marek Lewczuk
Email address: marek@lewczuk.com
PostgreSQL version: 8.4.0
Operating system: All
Description: Text Search parser does not identify xml tag when
attribute name's contains underscore
Details:

Please execute following example:
select * from ts_debug('english', '<img width="182" height="120"
align="right" style="margin: 0px 0px 5px 5px;" test_aa="26461"/>')

As the result you will see, that <img/> is not identified as XML tag, but
rather splitted as words, blank spaces etc. The reason for that is the fact,
that last attribute "test_aa" contains underscore in its name - when the
underscore is removed, then img tag is properly identified as XML tag.

XML definition allows using underscore in tag and attribute names.

In reply to: Marek Lewczuk (#1)
Re: BUG #5075: Text Search parser does not identify xml tag when attribute name's contains underscore

Marek Lewczuk escreveu:

Please execute following example:
select * from ts_debug('english', '<img width="182" height="120"
align="right" style="margin: 0px 0px 5px 5px;" test_aa="26461"/>')

As the result you will see, that <img/> is not identified as XML tag, but
rather splitted as words, blank spaces etc. The reason for that is the fact,
that last attribute "test_aa" contains underscore in its name - when the
underscore is removed, then img tag is properly identified as XML tag.

XML definition allows using underscore in tag and attribute names.

The problem is we already allow it in tag names but not in attribute names. So
the proper fix is to allow underscore when the state is TPS_InTag; according
to XML spec [1]http://www.w3.org/TR/REC-xml/#sec-common-syn, the underscore is a valid character in attribute names.

A possible downside is that we don't have underscores in HTML attribute names.
In this case, should it fail? I don't think so but...

The problem exists in 8.3, 8.4 and HEAD. It is a trivial fix so I think there
isn't a problem to back-patch it.

[1]: http://www.w3.org/TR/REC-xml/#sec-common-syn

--
Euler Taveira de Oliveira
http://www.timbira.com/

Attachments:

ts.difftext/plain; name=ts.diffDownload+1-0
#3Robert Haas
robertmhaas@gmail.com
In reply to: Euler Taveira de Oliveira (#2)
Re: BUG #5075: Text Search parser does not identify xml tag when attribute name's contains underscore

On Wed, Sep 23, 2009 at 7:31 PM, Euler Taveira de Oliveira
<euler@timbira.com> wrote:

Marek Lewczuk escreveu:

Please execute following example:
select * from ts_debug('english', '<img width="182" height="120"
align="right" style="margin: 0px 0px 5px 5px;" test_aa="26461"/>')

As the result you will see, that <img/> is not identified as XML tag, but
rather splitted as words, blank spaces etc. The reason for that is the fact,
that last attribute "test_aa" contains underscore in its name - when the
underscore is removed, then img tag is properly identified as XML tag.

XML definition allows using underscore in tag and attribute names.

The problem is we already allow it in tag names but not in attribute names. So
the proper fix is to allow underscore when the state is TPS_InTag; according
to XML spec [1], the underscore is a valid character in attribute names.

A possible downside is that we don't have underscores in HTML attribute names.
In this case, should it fail? I don't think so but...

The problem exists in 8.3, 8.4 and HEAD. It is a trivial fix so I think there
isn't a problem to back-patch it.

This patch should probably be added to
https://commitfest.postgresql.org/action/commitfest_view/open so that
we don't lose track of it.

...Robert

#4Selena Deckelmann
selenamarie@gmail.com
In reply to: Robert Haas (#3)
Re: BUG #5075: Text Search parser does not identify xml tag when attribute name's contains underscore

On Sun, Sep 27, 2009 at 7:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 23, 2009 at 7:31 PM, Euler Taveira de Oliveira
<euler@timbira.com> wrote:

The problem exists in 8.3, 8.4 and HEAD. It is a trivial fix so I think there
isn't a problem to back-patch it.

This patch should probably be added to
https://commitfest.postgresql.org/action/commitfest_view/open so that
we don't lose track of it.

Done.

--
http://chesnok.com/daily - me
http://endpoint.com - work

#5Peter Eisentraut
peter_e@gmx.net
In reply to: Euler Taveira de Oliveira (#2)
Re: BUG #5075: Text Search parser does not identify xml tag when attribute name's contains underscore

On ons, 2009-09-23 at 20:31 -0300, Euler Taveira de Oliveira wrote:

Marek Lewczuk escreveu:

Please execute following example:
select * from ts_debug('english', '<img width="182" height="120"
align="right" style="margin: 0px 0px 5px 5px;" test_aa="26461"/>')

As the result you will see, that <img/> is not identified as XML tag, but
rather splitted as words, blank spaces etc. The reason for that is the fact,
that last attribute "test_aa" contains underscore in its name - when the
underscore is removed, then img tag is properly identified as XML tag.

XML definition allows using underscore in tag and attribute names.

The problem is we already allow it in tag names but not in attribute names. So
the proper fix is to allow underscore when the state is TPS_InTag; according
to XML spec [1], the underscore is a valid character in attribute names.

A possible downside is that we don't have underscores in HTML attribute names.
In this case, should it fail? I don't think so but...

The problem exists in 8.3, 8.4 and HEAD. It is a trivial fix so I think there
isn't a problem to back-patch it.

Fix committed to 8.3, 8.4, 8.5.

#6Marek Lewczuk
newsy@lewczuk.com
In reply to: Peter Eisentraut (#5)
Re: BUG #5075: Text Search parser does not identify xml tag when attribute name's contains underscore

W dniu 2009-11-15 14:56, Peter Eisentraut pisze:

On ons, 2009-09-23 at 20:31 -0300, Euler Taveira de Oliveira wrote:

Marek Lewczuk escreveu:

Please execute following example:
select * from ts_debug('english', '<img width="182" height="120"
align="right" style="margin: 0px 0px 5px 5px;" test_aa="26461"/>')

Fix committed to 8.3, 8.4, 8.5.

Great. Thanks.

ML