BUG #5075: Text Search parser does not identify xml tag when attribute name's contains underscore
The following bug has been logged online:
Bug reference: 5075
Logged by: Marek Lewczuk
Email address: marek@lewczuk.com
PostgreSQL version: 8.4.0
Operating system: All
Description: Text Search parser does not identify xml tag when
attribute name's contains underscore
Details:
Please execute following example:
select * from ts_debug('english', '<img width="182" height="120"
align="right" style="margin: 0px 0px 5px 5px;" test_aa="26461"/>')
As the result you will see, that <img/> is not identified as XML tag, but
rather splitted as words, blank spaces etc. The reason for that is the fact,
that last attribute "test_aa" contains underscore in its name - when the
underscore is removed, then img tag is properly identified as XML tag.
XML definition allows using underscore in tag and attribute names.
Marek Lewczuk escreveu:
Please execute following example:
select * from ts_debug('english', '<img width="182" height="120"
align="right" style="margin: 0px 0px 5px 5px;" test_aa="26461"/>')As the result you will see, that <img/> is not identified as XML tag, but
rather splitted as words, blank spaces etc. The reason for that is the fact,
that last attribute "test_aa" contains underscore in its name - when the
underscore is removed, then img tag is properly identified as XML tag.XML definition allows using underscore in tag and attribute names.
The problem is we already allow it in tag names but not in attribute names. So
the proper fix is to allow underscore when the state is TPS_InTag; according
to XML spec [1]http://www.w3.org/TR/REC-xml/#sec-common-syn, the underscore is a valid character in attribute names.
A possible downside is that we don't have underscores in HTML attribute names.
In this case, should it fail? I don't think so but...
The problem exists in 8.3, 8.4 and HEAD. It is a trivial fix so I think there
isn't a problem to back-patch it.
[1]: http://www.w3.org/TR/REC-xml/#sec-common-syn
--
Euler Taveira de Oliveira
http://www.timbira.com/
Attachments:
ts.difftext/plain; name=ts.diffDownload+1-0
On Wed, Sep 23, 2009 at 7:31 PM, Euler Taveira de Oliveira
<euler@timbira.com> wrote:
Marek Lewczuk escreveu:
Please execute following example:
select * from ts_debug('english', '<img width="182" height="120"
align="right" style="margin: 0px 0px 5px 5px;" test_aa="26461"/>')As the result you will see, that <img/> is not identified as XML tag, but
rather splitted as words, blank spaces etc. The reason for that is the fact,
that last attribute "test_aa" contains underscore in its name - when the
underscore is removed, then img tag is properly identified as XML tag.XML definition allows using underscore in tag and attribute names.
The problem is we already allow it in tag names but not in attribute names. So
the proper fix is to allow underscore when the state is TPS_InTag; according
to XML spec [1], the underscore is a valid character in attribute names.A possible downside is that we don't have underscores in HTML attribute names.
In this case, should it fail? I don't think so but...The problem exists in 8.3, 8.4 and HEAD. It is a trivial fix so I think there
isn't a problem to back-patch it.
This patch should probably be added to
https://commitfest.postgresql.org/action/commitfest_view/open so that
we don't lose track of it.
...Robert
On Sun, Sep 27, 2009 at 7:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Sep 23, 2009 at 7:31 PM, Euler Taveira de Oliveira
<euler@timbira.com> wrote:The problem exists in 8.3, 8.4 and HEAD. It is a trivial fix so I think there
isn't a problem to back-patch it.This patch should probably be added to
https://commitfest.postgresql.org/action/commitfest_view/open so that
we don't lose track of it.
Done.
--
http://chesnok.com/daily - me
http://endpoint.com - work
On ons, 2009-09-23 at 20:31 -0300, Euler Taveira de Oliveira wrote:
Marek Lewczuk escreveu:
Please execute following example:
select * from ts_debug('english', '<img width="182" height="120"
align="right" style="margin: 0px 0px 5px 5px;" test_aa="26461"/>')As the result you will see, that <img/> is not identified as XML tag, but
rather splitted as words, blank spaces etc. The reason for that is the fact,
that last attribute "test_aa" contains underscore in its name - when the
underscore is removed, then img tag is properly identified as XML tag.XML definition allows using underscore in tag and attribute names.
The problem is we already allow it in tag names but not in attribute names. So
the proper fix is to allow underscore when the state is TPS_InTag; according
to XML spec [1], the underscore is a valid character in attribute names.A possible downside is that we don't have underscores in HTML attribute names.
In this case, should it fail? I don't think so but...The problem exists in 8.3, 8.4 and HEAD. It is a trivial fix so I think there
isn't a problem to back-patch it.
Fix committed to 8.3, 8.4, 8.5.
W dniu 2009-11-15 14:56, Peter Eisentraut pisze:
On ons, 2009-09-23 at 20:31 -0300, Euler Taveira de Oliveira wrote:
Marek Lewczuk escreveu:
Please execute following example:
select * from ts_debug('english', '<img width="182" height="120"
align="right" style="margin: 0px 0px 5px 5px;" test_aa="26461"/>')Fix committed to 8.3, 8.4, 8.5.
Great. Thanks.
ML