possible encoding issues with libxml2 functions
Hi
Today I played with xml_recv function and with xml processing functions.
xml_recv function ensures correct encoding from document encoding to server
encoding. But the decl section holds original encoding info - that should
be obsolete after encoding. Sometimes we solve this issue by removing decl
section - see the xml_out function.
Sometimes we don't do it - lot of functions uses direct conversion from
xmltype to xmlChar. Wrong encoding in decl section can breaks libxml2
parser with error
ERROR: could not parse XML document
DETAIL: input conversion failed due to input error, bytes 0x88 0x3C 0x2F
0x72
line 1: switching encoding: encoder error
This error is not often - but it is hard to find it - because there is
small but important difference between printed XML and used XML.
There are possible two fixes
a) clean decl on input - the encoding info can be removed from decl part
b) use xml_out_internal everywhere before transformation to
xmlChar. pg_xmlCharStrndup can be good candidate.
Comments?
Regards
Pavel
On Mon, Feb 20, 2017 at 07:48:18PM +0100, Pavel Stehule wrote:
Today I played with xml_recv function and with xml processing functions.
xml_recv function ensures correct encoding from document encoding to server
encoding. But the decl section holds original encoding info - that should
be obsolete after encoding. Sometimes we solve this issue by removing decl
section - see the xml_out function.Sometimes we don't do it - lot of functions uses direct conversion from
xmltype to xmlChar.
There are possible two fixes
a) clean decl on input - the encoding info can be removed from decl part
b) use xml_out_internal everywhere before transformation to
xmlChar. pg_xmlCharStrndup can be good candidate.
I'd prefer (a) if the xml type were a new feature, because no good can come of
storing an encoding in each xml field when we know the actual encoding is the
database encoding. However, if you implemented (a), we'd still see untreated
values brought over via pg_upgrade. Therefore, I would try (b) first. I
suspect the intent of xml_parse() was to implement (b); it will be interesting
to see your test case that malfunctions.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
2017-03-12 0:56 GMT+01:00 Noah Misch <noah@leadboat.com>:
On Mon, Feb 20, 2017 at 07:48:18PM +0100, Pavel Stehule wrote:
Today I played with xml_recv function and with xml processing functions.
xml_recv function ensures correct encoding from document encoding to
server
encoding. But the decl section holds original encoding info - that should
be obsolete after encoding. Sometimes we solve this issue by removingdecl
section - see the xml_out function.
Sometimes we don't do it - lot of functions uses direct conversion from
xmltype to xmlChar.There are possible two fixes
a) clean decl on input - the encoding info can be removed from decl part
b) use xml_out_internal everywhere before transformation to
xmlChar. pg_xmlCharStrndup can be good candidate.I'd prefer (a) if the xml type were a new feature, because no good can
come of
storing an encoding in each xml field when we know the actual encoding is
the
database encoding. However, if you implemented (a), we'd still see
untreated
values brought over via pg_upgrade. Therefore, I would try (b) first. I
suspect the intent of xml_parse() was to implement (b); it will be
interesting
to see your test case that malfunctions.
I looked there again and I found so this issue is related to xpath function
only
Functions based on xml_parse are working without problems. xpath_internal
uses own direct xmlCtxtReadMemory without correct encoding sanitation.
so fix is pretty simple
diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index f81cf489d2..89aae48cb3 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -3874,9 +3874,11 @@ xpath_internal(text *xpath_expr_text, xmltype *data,
ArrayType *namespaces,
ns_count = 0;
}
- datastr = VARDATA(data);
- len = VARSIZE(data) - VARHDRSZ;
+ datastr = xml_out_internal(data, 0);
+ len = strlen(datastr);
+
xpath_len = VARSIZE(xpath_expr_text) - VARHDRSZ;
+
if (xpath_len == 0)
ereport(ERROR,
(errcode(ERRCODE_DATA_EXCEPTION),
Regards
Pavel
On Sun, Mar 12, 2017 at 08:36:58PM +0100, Pavel Stehule wrote:
2017-03-12 0:56 GMT+01:00 Noah Misch <noah@leadboat.com>:
On Mon, Feb 20, 2017 at 07:48:18PM +0100, Pavel Stehule wrote:
There are possible two fixes
a) clean decl on input - the encoding info can be removed from decl part
b) use xml_out_internal everywhere before transformation to
xmlChar. pg_xmlCharStrndup can be good candidate.I'd prefer (a) if the xml type were a new feature, because no good can
come of
storing an encoding in each xml field when we know the actual encoding is
the
database encoding. However, if you implemented (a), we'd still see
untreated
values brought over via pg_upgrade. Therefore, I would try (b) first. I
suspect the intent of xml_parse() was to implement (b); it will be
interesting
to see your test case that malfunctions.I looked there again and I found so this issue is related to xpath function
onlyFunctions based on xml_parse are working without problems. xpath_internal
uses own direct xmlCtxtReadMemory without correct encoding sanitation.so fix is pretty simple
Please add a test case.
--- a/src/backend/utils/adt/xml.c +++ b/src/backend/utils/adt/xml.c @@ -3874,9 +3874,11 @@ xpath_internal(text *xpath_expr_text, xmltype *data, ArrayType *namespaces, ns_count = 0; }- datastr = VARDATA(data); - len = VARSIZE(data) - VARHDRSZ; + datastr = xml_out_internal(data, 0);
Why not use xml_parse() instead of calling xmlCtxtReadMemory() directly? The
answer is probably in the archives, because someone understood the problem
enough to document "Some XML-related functions may not work at all on
non-ASCII data when the server encoding is not UTF-8. This is known to be an
issue for xpath() in particular."
+ len = strlen(datastr); + xpath_len = VARSIZE(xpath_expr_text) - VARHDRSZ; +
The two lines of functional change don't create a cause for more newlines, so
don't add these two newlines.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
2017-03-12 21:57 GMT+01:00 Noah Misch <noah@leadboat.com>:
On Sun, Mar 12, 2017 at 08:36:58PM +0100, Pavel Stehule wrote:
2017-03-12 0:56 GMT+01:00 Noah Misch <noah@leadboat.com>:
On Mon, Feb 20, 2017 at 07:48:18PM +0100, Pavel Stehule wrote:
There are possible two fixes
a) clean decl on input - the encoding info can be removed from decl
part
b) use xml_out_internal everywhere before transformation to
xmlChar. pg_xmlCharStrndup can be good candidate.I'd prefer (a) if the xml type were a new feature, because no good can
come of
storing an encoding in each xml field when we know the actual encodingis
the
database encoding. However, if you implemented (a), we'd still see
untreated
values brought over via pg_upgrade. Therefore, I would try (b)first. I
suspect the intent of xml_parse() was to implement (b); it will be
interesting
to see your test case that malfunctions.I looked there again and I found so this issue is related to xpath
function
only
Functions based on xml_parse are working without problems. xpath_internal
uses own direct xmlCtxtReadMemory without correct encoding sanitation.so fix is pretty simple
Please add a test case.
It needs a application - currently there is not possibility to import XML
document via recv API :(
I wrote a pgimportdoc utility, but it is not part of core
--- a/src/backend/utils/adt/xml.c +++ b/src/backend/utils/adt/xml.c @@ -3874,9 +3874,11 @@ xpath_internal(text *xpath_expr_text, xmltype*data,
ArrayType *namespaces,
ns_count = 0;
}- datastr = VARDATA(data); - len = VARSIZE(data) - VARHDRSZ; + datastr = xml_out_internal(data, 0);Why not use xml_parse() instead of calling xmlCtxtReadMemory() directly?
The
answer is probably in the archives, because someone understood the problem
enough to document "Some XML-related functions may not work at all on
non-ASCII data when the server encoding is not UTF-8. This is known to be
an
issue for xpath() in particular."
Probably there are two possible issues
1. what I touched - recv function does encoding to database encoding - but
document encoding is not updated.
2. there are not possibility to encode from document encoding to database
encoding.
+ len = strlen(datastr); + xpath_len = VARSIZE(xpath_expr_text) - VARHDRSZ; +The two lines of functional change don't create a cause for more newlines,
so
don't add these two newlines.
ok
2017-03-12 22:26 GMT+01:00 Pavel Stehule <pavel.stehule@gmail.com>:
2017-03-12 21:57 GMT+01:00 Noah Misch <noah@leadboat.com>:
On Sun, Mar 12, 2017 at 08:36:58PM +0100, Pavel Stehule wrote:
2017-03-12 0:56 GMT+01:00 Noah Misch <noah@leadboat.com>:
On Mon, Feb 20, 2017 at 07:48:18PM +0100, Pavel Stehule wrote:
There are possible two fixes
a) clean decl on input - the encoding info can be removed from decl
part
b) use xml_out_internal everywhere before transformation to
xmlChar. pg_xmlCharStrndup can be good candidate.I'd prefer (a) if the xml type were a new feature, because no good can
come of
storing an encoding in each xml field when we know the actualencoding is
the
database encoding. However, if you implemented (a), we'd still see
untreated
values brought over via pg_upgrade. Therefore, I would try (b)first. I
suspect the intent of xml_parse() was to implement (b); it will be
interesting
to see your test case that malfunctions.I looked there again and I found so this issue is related to xpath
function
only
Functions based on xml_parse are working without problems.
xpath_internal
uses own direct xmlCtxtReadMemory without correct encoding sanitation.
so fix is pretty simple
Please add a test case.
It needs a application - currently there is not possibility to import XML
document via recv API :(I wrote a pgimportdoc utility, but it is not part of core
--- a/src/backend/utils/adt/xml.c +++ b/src/backend/utils/adt/xml.c @@ -3874,9 +3874,11 @@ xpath_internal(text *xpath_expr_text, xmltype*data,
ArrayType *namespaces,
ns_count = 0;
}- datastr = VARDATA(data); - len = VARSIZE(data) - VARHDRSZ; + datastr = xml_out_internal(data, 0);Why not use xml_parse() instead of calling xmlCtxtReadMemory() directly?
The
answer is probably in the archives, because someone understood the problem
enough to document "Some XML-related functions may not work at all on
non-ASCII data when the server encoding is not UTF-8. This is known to be
an
issue for xpath() in particular."Probably there are two possible issues
1. what I touched - recv function does encoding to database encoding - but
document encoding is not updated.2. there are not possibility to encode from document encoding to database
encoding.+ len = strlen(datastr); + xpath_len = VARSIZE(xpath_expr_text) - VARHDRSZ; +The two lines of functional change don't create a cause for more
newlines, so
don't add these two newlines.ok
diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index 1908b13db5..2786d5b1cb 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -3874,8 +3874,8 @@ xpath_internal(text *xpath_expr_text, xmltype
*data, ArrayType *namespaces,
ns_count = 0;
}
- datastr = VARDATA(data);
- len = VARSIZE(data) - VARHDRSZ;
+ datastr = xml_out_internal(data, 0);
+ len = strlen(datastr);
xpath_len = VARSIZE_ANY_EXHDR(xpath_expr_text);
if (xpath_len == 0)
ereport(ERROR,
Attachments:
xpath-bugfix.patchtext/x-patch; charset=US-ASCII; name=xpath-bugfix.patchDownload
diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index 1908b13db5..2786d5b1cb 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -3874,8 +3874,8 @@ xpath_internal(text *xpath_expr_text, xmltype *data, ArrayType *namespaces,
ns_count = 0;
}
- datastr = VARDATA(data);
- len = VARSIZE(data) - VARHDRSZ;
+ datastr = xml_out_internal(data, 0);
+ len = strlen(datastr);
xpath_len = VARSIZE_ANY_EXHDR(xpath_expr_text);
if (xpath_len == 0)
ereport(ERROR,
On Sun, Mar 12, 2017 at 10:26:33PM +0100, Pavel Stehule wrote:
2017-03-12 21:57 GMT+01:00 Noah Misch <noah@leadboat.com>:
On Sun, Mar 12, 2017 at 08:36:58PM +0100, Pavel Stehule wrote:
2017-03-12 0:56 GMT+01:00 Noah Misch <noah@leadboat.com>:
Please add a test case.
It needs a application - currently there is not possibility to import XML
document via recv API :(
I think xml_in() can create every value that xml_recv() can create; xml_recv()
is just more convenient given diverse source encodings. If you make your
application store the value into a table, does "pg_dump --inserts" emit code
that reproduces the same value? If so, you can use that in your test case.
If not, please provide precise instructions (code, SQL commands) for
reproducing the bug manually.
Why not use xml_parse() instead of calling xmlCtxtReadMemory() directly?
The
answer is probably in the archives, because someone understood the problem
enough to document "Some XML-related functions may not work at all on
non-ASCII data when the server encoding is not UTF-8. This is known to be
an
issue for xpath() in particular."Probably there are two possible issues
Would you research in the archives to confirm?
1. what I touched - recv function does encoding to database encoding - but
document encoding is not updated.
Using xml_parse() would fix that, right?
2. there are not possibility to encode from document encoding to database
encoding.
Both xml_in() and xml_recv() require the value to be representable in the
database encoding, so I don't think this particular problem can remain by the
time we reach an xpath_internal() call.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
2017-03-17 4:23 GMT+01:00 Noah Misch <noah@leadboat.com>:
On Sun, Mar 12, 2017 at 10:26:33PM +0100, Pavel Stehule wrote:
2017-03-12 21:57 GMT+01:00 Noah Misch <noah@leadboat.com>:
On Sun, Mar 12, 2017 at 08:36:58PM +0100, Pavel Stehule wrote:
2017-03-12 0:56 GMT+01:00 Noah Misch <noah@leadboat.com>:
Please add a test case.
I am sending test case.
I am afraid so this cannot be part of regress tests due strong dependency
on locale win1250.
It needs a application - currently there is not possibility to import XML
document via recv API :(I think xml_in() can create every value that xml_recv() can create;
xml_recv()
is just more convenient given diverse source encodings. If you make your
application store the value into a table, does "pg_dump --inserts" emit
code
that reproduces the same value? If so, you can use that in your test case.
If not, please provide precise instructions (code, SQL commands) for
reproducing the bug manually.
I was wrong - both methods produces broken internal XML document
<?xml version="1.0" encoding="windows-1250"?>
<enprimeur>
<vino>
<id>1</id>
<nazev>Alter Ego de Palmer</nazev>
<vyrobce>63</vyrobce>
<rocnik>2012</rocnik>
<cena0375>0</cena0375>
<cena1500>0</cena1500>
<cena3000>0</cena3000>
<cena6000>0</cena6000>
<cena0750>1425</cena0750>
<cenastart>1085</cenastart>
<min0375>0</min0375>
<min0750>0</min0750>
<odrudy>51 % Merlot, 40 % Cabernet Sauvignon,9 %
Petit Verdot</odrudy>
<bestin>2017 - 2026</bestin>
<klas>2</klas>
<sklad0375>0</sklad0375>
<sklad0750>0</sklad0750>
<sklad1500>0</sklad1500>
<sklad3000>0</sklad3000>
<sklad6000>0</sklad6000>
<alk>13,4 %</alk>
<remark>Premiant oblasti Margaux Ch. Palmer
tentokrát ve svých obou vínech těžil z dokonale zralého Merlotu, kterého do
svých směsí naládoval více než Cabernet Sauvignonu. 2. víno Alter Ego de
Palmer 2012 se může p
ochlubit skvělou kondicí. Není pochyb o tom, že na Ch. Palmer sklízeli z
vinice dokonale zralé hrozny. Alter Ego je mohutné, husté a syté víno,
nabízí objemnou dávku ovoce, voní po ostružinách, chuť je krásně kulatá,
pevná a svěží, taniny
vykazují fenomenální strukturu, takto krásné spojení všech aromatických a
chuťových složek s příměsí úžasných kyselin a alkoholu je radost mít ve
sklepě.</remark>
<rating>Robert Parker: /100
TOPVINO SCORE: 92-94/100
James Suckling: 92-93/100
Wine Spectator: 90-93/100</rating>
<zalozeno></zalozeno>
<rozloha></rozloha>
<stari></stari>
<puda></puda>
<produkce></produkce>
<zrani></zrani>
<active>1</active>
<stitky>
<stitek>8</stitek>
<stitek>1</stitek>
</stitky>
</vino>
</enprimeur>
Document is well encoded from win1250 to UTF8 - it is well readable, but
the header is wrong - it is not in win1250 now (after encoding). This xml
is attached (in original form (win1250 encoding)).
Import with
create table target(x xml);
\set xml `cat ~/Downloads/e6.xml`
postgres=# set client_encoding to win1250;
postgres=# insert into target values(:'xml');
disconnect or reset client encoding
Document should be correctly inserted.
Then run a query
postgres=# select * from target, xpath('/enprimeur/vino', x);
ERROR: could not parse XML document
DETAIL: input conversion failed due to input error, bytes 0x88 0x73 0x6B
0x75
input conversion failed due to input error, bytes 0x88 0x73 0x6B 0x75
encoder error
line 25: Premature end of data in tag remark line 25
line 25: Premature end of data in tag vino line 3
line 25: Premature end of data in tag enprimeur line 2
select xmltable.* from target, xmltable('/enprimeur/vino' passing x columns
id int, nazev text, remark text);
XMLTABLE is not vulnerable against this bug and result should be correct.
when you do
select x from target
you can see a correct xml without header
but select x::text from target; shows wrong header
Why not use xml_parse() instead of calling xmlCtxtReadMemory()
directly?
The
answer is probably in the archives, because someone understood theproblem
enough to document "Some XML-related functions may not work at all on
non-ASCII data when the server encoding is not UTF-8. This is known tobe
an
issue for xpath() in particular."Probably there are two possible issues
Would you research in the archives to confirm?
1. what I touched - recv function does encoding to database encoding -
but
document encoding is not updated.
Using xml_parse() would fix that, right?
It should to help, because it cuts a header - but it does little bit more
work, than we would - it check if xml is doc or not too.
In mostly functions there the important work is done in parse_xml_decl
function
One possible fix - and similar technique is used more times in xml.c code
.. xmlroot
diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index 2f87151..45faf83 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -3834,6 +3834,7 @@ xpath_internal(text *xpath_expr_text, xmltype *data,
ArrayType *namespaces,
Datum *ns_names_uris;
bool *ns_names_uris_nulls;
int ns_count;
+ size_t header_len;
/*
* Namespace mappings are passed as text[]. If an empty array is passed
@@ -3882,6 +3883,10 @@ xpath_internal(text *xpath_expr_text, xmltype *data,
ArrayType *namespaces,
errmsg("empty XPath expression")));
string = pg_xmlCharStrndup(datastr, len);
+
+ /* remove header */
+ parse_xml_decl(string, &header_len, NULL, NULL, NULL);
+
xpath_expr = pg_xmlCharStrndup(VARDATA_ANY(xpath_expr_text), xpath_len);
xmlerrcxt = pg_xml_init(PG_XML_STRICTNESS_ALL);
@@ -3898,7 +3903,7 @@ xpath_internal(text *xpath_expr_text, xmltype *data,
ArrayType *namespaces,
if (ctxt == NULL || xmlerrcxt->err_occurred)
xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY,
"could not allocate parser context");
- doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0);
+ doc = xmlCtxtReadMemory(ctxt, (char *) string + header_len, len -
header_len, NULL, NULL, 0);
if (doc == NULL || xmlerrcxt->err_occurred)
xml_ereport(xmlerrcxt, ERROR, ERRCODE_INVALID_XML_DOCUMENT,
"could not parse XML document");
another possibility is using xml_out_internal - that is used in XMLTABLE
function - what was initial fix.
xml_out_internal uses parse_xml_decl too, but it is little bit more
expensive due call print_xml_decl that has not any effect in this case
(where only encoding is interesting) - but it can has sense, when server
encoding is not UTF8, because in this case, xmlstr is not encoded to UTF8 -
so now, I am think the correct solution should be using xml_out_internal -
because it appends a header with server encoding to XML doc
Regards
Pavel
Show quoted text
2. there are not possibility to encode from document encoding to database
encoding.Both xml_in() and xml_recv() require the value to be representable in the
database encoding, so I don't think this particular problem can remain by
the
time we reach an xpath_internal() call.
Attachments:
On Wed, Apr 05, 2017 at 10:53:39PM +0200, Pavel Stehule wrote:
2017-03-17 4:23 GMT+01:00 Noah Misch <noah@leadboat.com>:
On Sun, Mar 12, 2017 at 10:26:33PM +0100, Pavel Stehule wrote:
2017-03-12 21:57 GMT+01:00 Noah Misch <noah@leadboat.com>:
On Sun, Mar 12, 2017 at 08:36:58PM +0100, Pavel Stehule wrote:
2017-03-12 0:56 GMT+01:00 Noah Misch <noah@leadboat.com>:
Please add a test case.
I am sending test case.
I am afraid so this cannot be part of regress tests due strong dependency
on locale win1250.
Right. Based on that, I've distilled this portable test case:
-- Round-trip non-ASCII data through xpath().
DO $$
DECLARE
xml_declaration text := '<?xml version="1.0" encoding="ISO-8859-1"?>';
degree_symbol text;
res xml[];
BEGIN
-- Per the documentation, xpath() doesn't work on non-ASCII data when
-- the server encoding is not UTF8. The EXCEPTION block below,
-- currently dead code, will be relevant if we remove this limitation.
IF current_setting('server_encoding') <> 'UTF8' THEN
RAISE LOG 'skip: encoding % unsupported for xml',
current_setting('server_encoding');
RETURN;
END IF;
degree_symbol := convert_from('\xc2b0', 'UTF8');
res := xpath('text()', (xml_declaration ||
'<x>' || degree_symbol || '</x>')::xml);
IF degree_symbol <> res[1]::text THEN
RAISE 'expected % (%), got % (%)',
degree_symbol, convert_to(degree_symbol, 'UTF8'),
res[1], convert_to(res[1]::text, 'UTF8');
END IF;
EXCEPTION
-- character with byte sequence 0xc2 0xb0 in encoding "UTF8" has no equivalent in encoding "LATIN8"
WHEN untranslatable_character THEN RAISE LOG 'skip: %', SQLERRM;
-- default conversion function for encoding "UTF8" to "MULE_INTERNAL" does not exist
WHEN undefined_function THEN RAISE LOG 'skip: %', SQLERRM;
END
$$;
Why not use xml_parse() instead of calling xmlCtxtReadMemory()
directly? The answer is probably in the archives, because someone
understood the problem enough to document "Some XML-related functions
may not work at all on non-ASCII data when the server encoding is not
UTF-8. This is known to be an issue for xpath() in particular."Probably there are two possible issues
Would you research in the archives to confirm?
1. what I touched - recv function does encoding to database encoding -
but document encoding is not updated.Using xml_parse() would fix that, right?
It should to help, because it cuts a header - but it does little bit more
work, than we would - it check if xml is doc or not too.
That's true. xpath() currently works on both DOCUMENT and CONTENT xml values.
If xpath() used xml_parse(), this would stop working:
SELECT xpath('text()', XMLPARSE (DOCUMENT '<!DOCTYPE foo []><x>bar</x>'));
One possible fix - and similar technique is used more times in xml.c code
.. xmlroot
+ /* remove header */ + parse_xml_decl(string, &header_len, NULL, NULL, NULL);
- doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0); + doc = xmlCtxtReadMemory(ctxt, (char *) string + header_len, len -
another possibility is using xml_out_internal - that is used in XMLTABLE
function - what was initial fix.xml_out_internal uses parse_xml_decl too, but it is little bit more
expensive due call print_xml_decl that has not any effect in this case
(where only encoding is interesting) - but it can has sense, when server
encoding is not UTF8, because in this case, xmlstr is not encoded to UTF8 -
so now, I am think the correct solution should be using xml_out_internal -
because it appends a header with server encoding to XML doc
As you may have been implying here, your patch never adds an xml declaration
that bears an encoding. (That is because it passes targetencoding=0 to
xml_out_internal().) If we were going to fix xpath() to support non-ASCII
characters in non-UTF8 databases, we would not do that by adding xml
declarations. We would do our own conversion to UTF8, like xml_parse() does.
In that light, I like this parse_xml_decl() approach better. Would you update
your patch to use it and to add the test case I give above?
This change can make things worse in a non-UTF8 database. The following
happens to work today in a LATIN1 database, but it will cease to work:
SELECT xpath('string-length()', ('<?xml version="1.0" encoding="ISO-8859-1"?>' ||
'<x>' || chr(176) || '</x>')::xml);
However, extracting actual text already gives wrong answers, because we treat
the UTF8 response from libxml2 as though it were LATIN1:
SELECT xpath('string()', ('<?xml version="1.0" encoding="ISO-8859-1"?>' ||
'<x>' || chr(176) || '</x>')::xml);
I plan to back-patch this. The documentation says "Some XML-related functions
may not work at all on non-ASCII data when the server encoding is not
UTF-8. This is known to be an issue for xpath() in particular." Your patch
fixes a case where even a UTF8 database mishandles non-ASCII data. (Sometimes
it mishandles the data silently. Other times, it raises an error.) It's
worth further breaking a case where we already disclaim support to repair a
supported case. Other opinions?
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi
2017-08-08 2:10 GMT+02:00 Noah Misch <noah@leadboat.com>:
On Wed, Apr 05, 2017 at 10:53:39PM +0200, Pavel Stehule wrote:
2017-03-17 4:23 GMT+01:00 Noah Misch <noah@leadboat.com>:
On Sun, Mar 12, 2017 at 10:26:33PM +0100, Pavel Stehule wrote:
2017-03-12 21:57 GMT+01:00 Noah Misch <noah@leadboat.com>:
On Sun, Mar 12, 2017 at 08:36:58PM +0100, Pavel Stehule wrote:
2017-03-12 0:56 GMT+01:00 Noah Misch <noah@leadboat.com>:
Please add a test case.
I am sending test case.
I am afraid so this cannot be part of regress tests due strong dependency
on locale win1250.Right. Based on that, I've distilled this portable test case:
-- Round-trip non-ASCII data through xpath().
DO $$
DECLARE
xml_declaration text := '<?xml version="1.0"
encoding="ISO-8859-1"?>';
degree_symbol text;
res xml[];
BEGIN
-- Per the documentation, xpath() doesn't work on non-ASCII data
when
-- the server encoding is not UTF8. The EXCEPTION block below,
-- currently dead code, will be relevant if we remove this
limitation.
IF current_setting('server_encoding') <> 'UTF8' THEN
RAISE LOG 'skip: encoding % unsupported for xml',
current_setting('server_encoding');
RETURN;
END IF;degree_symbol := convert_from('\xc2b0', 'UTF8');
res := xpath('text()', (xml_declaration ||
'<x>' || degree_symbol || '</x>')::xml);
IF degree_symbol <> res[1]::text THEN
RAISE 'expected % (%), got % (%)',
degree_symbol, convert_to(degree_symbol, 'UTF8'),
res[1], convert_to(res[1]::text, 'UTF8');
END IF;
EXCEPTION
-- character with byte sequence 0xc2 0xb0 in encoding "UTF8" has
no equivalent in encoding "LATIN8"
WHEN untranslatable_character THEN RAISE LOG 'skip: %', SQLERRM;
-- default conversion function for encoding "UTF8" to
"MULE_INTERNAL" does not exist
WHEN undefined_function THEN RAISE LOG 'skip: %', SQLERRM;
END
$$
yes, probably libXML2 try to do check from utf8 encoding to header
specified encoding.
Why not use xml_parse() instead of calling xmlCtxtReadMemory()
directly? The answer is probably in the archives, because someone
understood the problem enough to document "Some XML-relatedfunctions
may not work at all on non-ASCII data when the server encoding is
not
UTF-8. This is known to be an issue for xpath() in particular."
Probably there are two possible issues
Would you research in the archives to confirm?
1. what I touched - recv function does encoding to database encoding
-
but document encoding is not updated.
Using xml_parse() would fix that, right?
It should to help, because it cuts a header - but it does little bit more
work, than we would - it check if xml is doc or not too.That's true. xpath() currently works on both DOCUMENT and CONTENT xml
values.
If xpath() used xml_parse(), this would stop working:SELECT xpath('text()', XMLPARSE (DOCUMENT '<!DOCTYPE foo
[]><x>bar</x>'));One possible fix - and similar technique is used more times in xml.c code
.. xmlroot+ /* remove header */ + parse_xml_decl(string, &header_len, NULL, NULL, NULL);- doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL,
0);
+ doc = xmlCtxtReadMemory(ctxt, (char *) string + header_len, len -
another possibility is using xml_out_internal - that is used in XMLTABLE
function - what was initial fix.xml_out_internal uses parse_xml_decl too, but it is little bit more
expensive due call print_xml_decl that has not any effect in this case
(where only encoding is interesting) - but it can has sense, when server
encoding is not UTF8, because in this case, xmlstr is not encoded toUTF8 -
so now, I am think the correct solution should be using xml_out_internal
-
because it appends a header with server encoding to XML doc
As you may have been implying here, your patch never adds an xml
declaration
that bears an encoding. (That is because it passes targetencoding=0 to
xml_out_internal().) If we were going to fix xpath() to support non-ASCII
characters in non-UTF8 databases, we would not do that by adding xml
declarations. We would do our own conversion to UTF8, like xml_parse()
does.
In that light, I like this parse_xml_decl() approach better. Would you
update
your patch to use it and to add the test case I give above?
I need to do some recapitulation (my analyses was wrong):
a) all values created by xml_in iterface are in database encoding - input
string is stored without any change. xml_parse is called only due
validation.
b) inside xml_parse, the input is converted to UTF8, and document is read
by xmlCtxtReadDoc with explicitly specified "UTF-8" encoding or
by xmlParseBalancedChunkMemory with explicitly specified encoding "UTF8"
and removed decl section.
So for "xml_parse" based functions (xml_in, texttoxml, xml_is_document,
wellformated_xml) the database encoding is not important
c) xml_recv function does validation by xml_parse and translation to
database encoding.
Now I don't see a difference between @b and @c - so my hypotheses about
necessity to use recv interface was wrong.
So Looks like libXML2 behave - when we push document with encoding decl,
then this library expects document in this encoding
So test case is simple
-- should to work
select xpath('/enprimeur/vino/id'/,
'<enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur>');
-- should to fail
select xpath('/enprimeur/vino/id'/, '<?xml version="1.0"
encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur>');
I didn't expect this error on recv API, because we do implicit encoding to
database encoding - and I expected implicit removing of encoding
declaration. The problematic char is ok, the issue is different length
Other functions is working - so I have a expectation so xpath should to work
postgres=# select '<?xml version="1.0"
encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur>'::xml;
┌────────────────────────────────────────────────────────────────────┐
│ xml │
╞════════════════════════════════════════════════════════════════════╡
│ <enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur> │
└────────────────────────────────────────────────────────────────────┘
(1 row)
postgres=# select '<?xml version="1.0"
encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur>'
is document
postgres-# ;
┌──────────┐
│ ?column? │
╞══════════╡
│ t │
└──────────┘
(1 row)
postgres=# select xml_is_well_formed_document('<?xml version="1.0"
encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur>');
┌─────────────────────────────┐
│ xml_is_well_formed_document │
╞═════════════════════════════╡
│ t │
└─────────────────────────────┘
(1 row)
postgres=# select xml_is_well_formed_content('<?xml version="1.0"
encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur>');
┌────────────────────────────┐
│ xml_is_well_formed_content │
╞════════════════════════════╡
│ t │
└────────────────────────────┘
(1 row)
d) XMLEXISTS doesn't work too .. because it share code wit xpath function
This change can make things worse in a non-UTF8 database. The following
happens to work today in a LATIN1 database, but it will cease to work:
SELECT xpath('string-length()', ('<?xml version="1.0"
encoding="ISO-8859-1"?>' ||
'<x>' || chr(176) || '</x>')::xml);However, extracting actual text already gives wrong answers, because we
treat
the UTF8 response from libxml2 as though it were LATIN1:SELECT xpath('string()', ('<?xml version="1.0" encoding="ISO-8859-1"?>' ||
'<x>' || chr(176) || '</x>')::xml);I plan to back-patch this. The documentation says "Some XML-related
functions
may not work at all on non-ASCII data when the server encoding is not
UTF-8. This is known to be an issue for xpath() in particular." Your patch
fixes a case where even a UTF8 database mishandles non-ASCII data.
(Sometimes
it mishandles the data silently. Other times, it raises an error.) It's
worth further breaking a case where we already disclaim support to repair a
supported case. Other opinions?
Isn't the most correct solution to call xml_parse function?
Regards
Pavel
Isn't the most correct solution to call xml_parse function?
I am reply to self. Probably not.
Now, I am thinking so I found a reason of this issue. The document
processed in xpath_internal is passed to libXML2 by
doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0);
We don't pass a encoding parameter so libXML2 expecting "UTF8" or expecting
correct encoding decl in XML document. When we pass incorrect document -
XML is in database encoding, but encoding decl is original, then it should
to fail.
the regress test can looks like your (but all chars are valid there)
postgres=# do $$
declare str text;
begin
if current_setting('server_encoding') <> 'UTF8' then return; end if;
str = '<?xml version="1.0"
encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>'
|| convert_from('\xc588', 'UTF8')
|| '</remark></vino></enprimeur>';
raise notice '%', xpath('/enprimeur/vino/id', str::xml);
end; $$;
ERROR: could not parse XML document
DETAIL: input conversion failed due to input error, bytes 0x88 0x3C 0x2F
0x72
line 1: switching encoding: encoder error
�</remark></vino></enprimeur>
^
CONTEXT: PL/pgSQL function inline_code_block line 8 at RAISE
After correct fix:
doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL,
pg_encoding_to_char(GetDatabaseEncoding()),
0);
It is working
postgres=# do $$
declare str text;
begin
if current_setting('server_encoding') <> 'UTF8' then return; end if;
str = '<?xml version="1.0"
encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>'
|| convert_from('\xc588', 'UTF8')
|| '</remark></vino></enprimeur>';
raise notice '%', xpath('/enprimeur/vino/id', str::xml);
end; $$;
NOTICE: {<id>909</id>}
DO
This fix should be apply to xmltable function too.
patch attached
It doesn't fix xpath and xmltable functions issues when server encoding is
not UTF8. Looks so XPath functions from libXML2 requires UTF8 encoded
strings and the result is in UTF8 too - so result should be recoded to
server encoding.
I didn't find any info how to enable libXML2 XPath functions for other
encoding than UTF8 :( ??
Regards
Pavel
Show quoted text
Regards
Pavel
Attachments:
encoding_for_xmlCtxtReadMemory.patchtext/x-patch; charset=US-ASCII; name=encoding_for_xmlCtxtReadMemory.patchDownload
diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index c47624eff6..b09ce23cdb 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -3911,7 +3911,9 @@ xpath_internal(text *xpath_expr_text, xmltype *data, ArrayType *namespaces,
if (ctxt == NULL || xmlerrcxt->err_occurred)
xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY,
"could not allocate parser context");
- doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0);
+ doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL,
+ pg_encoding_to_char(GetDatabaseEncoding()), 0);
+
if (doc == NULL || xmlerrcxt->err_occurred)
xml_ereport(xmlerrcxt, ERROR, ERRCODE_INVALID_XML_DOCUMENT,
"could not parse XML document");
@@ -4242,18 +4244,14 @@ XmlTableSetDocument(TableFuncScanState *state, Datum value)
xtCxt = GetXmlTableBuilderPrivateData(state, "XmlTableSetDocument");
- /*
- * Use out function for casting to string (remove encoding property). See
- * comment in xml_out.
- */
- str = xml_out_internal(xmlval, 0);
-
- length = strlen(str);
+ str = VARDATA(xmlval);
+ length = VARSIZE(xmlval) - VARHDRSZ;
xstr = pg_xmlCharStrndup(str, length);
PG_TRY();
{
- doc = xmlCtxtReadMemory(xtCxt->ctxt, (char *) xstr, length, NULL, NULL, 0);
+ doc = xmlCtxtReadMemory(xtCxt->ctxt, (char *) xstr, length, NULL,
+ pg_encoding_to_char(GetDatabaseEncoding()), 0);
if (doc == NULL || xtCxt->xmlerrcxt->err_occurred)
xml_ereport(xtCxt->xmlerrcxt, ERROR, ERRCODE_INVALID_XML_DOCUMENT,
"could not parse XML document");
diff --git a/src/test/regress/expected/xml.out b/src/test/regress/expected/xml.out
index bcc585d427..6a43896d40 100644
--- a/src/test/regress/expected/xml.out
+++ b/src/test/regress/expected/xml.out
@@ -1452,3 +1452,24 @@ SELECT xmltable.* FROM xmltest2, LATERAL xmltable(('/d/r/' || lower(_path) || 'c
14
(4 rows)
+-- XML is saved in database encoding with original encoding declaration.
+-- There can be incosistency based on wrong user input, different server/client
+-- encoding or reading XML with recv function. All XML functions should to
+-- work with this partially broken XML.
+DO $$
+DECLARE str text;
+BEGIN
+ -- leave early without error, when we are not sure about result of conversion
+ IF current_setting('server_encoding') NOT IN ('UTF8', 'LATIN2') THEN return; END IF;
+
+ -- build valid UTF8 XML with broken encoding declaration
+ str = '<?xml version="1.0" encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>'
+ || convert_from('\xf2', 'windows-1250')
+ || '</remark></vino></enprimeur>';
+
+ -- should to work
+ RAISE NOTICE '%', xpath('/enprimeur/vino/id', str::xml);
+ RAISE NOTICE '%', (SELECT id FROM xmltable('/enprimeur/vino' PASSING (str::xml) COLUMNS id int));
+END; $$;
+NOTICE: {<id>909</id>}
+NOTICE: 909
diff --git a/src/test/regress/sql/xml.sql b/src/test/regress/sql/xml.sql
index eb4687fb09..97a3aa9de2 100644
--- a/src/test/regress/sql/xml.sql
+++ b/src/test/regress/sql/xml.sql
@@ -558,3 +558,23 @@ INSERT INTO xmltest2 VALUES('<d><r><dc>2</dc></r></d>', 'D');
SELECT xmltable.* FROM xmltest2, LATERAL xmltable('/d/r' PASSING x COLUMNS a int PATH '' || lower(_path) || 'c');
SELECT xmltable.* FROM xmltest2, LATERAL xmltable(('/d/r/' || lower(_path) || 'c') PASSING x COLUMNS a int PATH '.');
SELECT xmltable.* FROM xmltest2, LATERAL xmltable(('/d/r/' || lower(_path) || 'c') PASSING x COLUMNS a int PATH 'x' DEFAULT ascii(_path) - 54);
+
+-- XML is saved in database encoding with original encoding declaration.
+-- There can be incosistency based on wrong user input, different server/client
+-- encoding or reading XML with recv function. All XML functions should to
+-- work with this partially broken XML.
+DO $$
+DECLARE str text;
+BEGIN
+ -- leave early without error, when we are not sure about result of conversion
+ IF current_setting('server_encoding') NOT IN ('UTF8', 'LATIN2') THEN return; END IF;
+
+ -- build valid UTF8 XML with broken encoding declaration
+ str = '<?xml version="1.0" encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>'
+ || convert_from('\xf2', 'windows-1250')
+ || '</remark></vino></enprimeur>';
+
+ -- should to work
+ RAISE NOTICE '%', xpath('/enprimeur/vino/id', str::xml);
+ RAISE NOTICE '%', (SELECT id FROM xmltable('/enprimeur/vino' PASSING (str::xml) COLUMNS id int));
+END; $$;
Hi
I am sending some POC - it does support XPATH and XMLTABLE for not UTF8
server encoding.
In this case, all strings should be converted to UTF8 before call libXML2
functions, and result should be converted back from UTF8.
I found some previous experiments https://marc.info/?l=pgsql-bug
s&m=123407176408688
Note: I got some information so used xmlNodeDump function is deprecated -
so we should to replace it too sometime.
Regards
Pavel
Attachments:
xpath-encoding-nonutf8-support.patchtext/x-patch; charset=US-ASCII; name=xpath-encoding-nonutf8-support.patchDownload
diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index c47624eff6..d02cec88ab 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -147,6 +147,7 @@ static int xml_xpathobjtoxmlarray(xmlXPathObjectPtr xpathobj,
ArrayBuildState *astate,
PgXmlErrorContext *xmlerrcxt);
static xmlChar *pg_xmlCharStrndup(char *str, size_t len);
+static xmlChar *pg_xmlCharUtf8(char *str, size_t len);
#endif /* USE_LIBXML */
static void xmldata_root_element_start(StringInfo result, const char *eltname,
@@ -459,8 +460,28 @@ cstring_to_xmltype(const char *string)
static xmltype *
xmlBuffer_to_xmltype(xmlBufferPtr buf)
{
- return (xmltype *) cstring_to_text_with_len((const char *) xmlBufferContent(buf),
+ if (GetDatabaseEncoding() != PG_UTF8)
+ {
+ char *utf8str = (char *) xmlBufferContent(buf);
+ char *str;
+ xmltype *result;
+
+ str = (char *) pg_do_encoding_conversion((unsigned char *) utf8str,
+ xmlBufferLength(buf),
+ PG_UTF8,
+ GetDatabaseEncoding());
+
+ Assert(str != utf8str);
+ result = (xmltype *) cstring_to_text(str);
+ pfree(str);
+
+ return result;
+ }
+ else
+ {
+ return (xmltype *) cstring_to_text_with_len((const char *) xmlBufferContent(buf),
xmlBufferLength(buf));
+ }
}
#endif
@@ -1176,6 +1197,28 @@ pg_xmlCharStrndup(char *str, size_t len)
}
/*
+ * LibXML2 internal encoding is UTF8. Sometimes LibXML2 enforce
+ * encoding to UTF8 by self, sometimes it expects UTF8 strings.
+ * This function is used for encoding from database encoding to
+ * UTF8.
+ */
+static xmlChar *
+pg_xmlCharUtf8(char *str, size_t len)
+{
+ char *result;
+
+ result = (char *) pg_do_encoding_conversion((unsigned char *) str,
+ len,
+ GetDatabaseEncoding(),
+ PG_UTF8);
+
+ if (result != str)
+ return BAD_CAST result;
+
+ return pg_xmlCharStrndup(str, len);
+}
+
+/*
* str is the null-terminated input string. Remaining arguments are
* output arguments; each can be NULL if value is not wanted.
* version and encoding are returned as locally-palloc'd strings.
@@ -3714,9 +3757,16 @@ xml_xmlnodetoxmltype(xmlNodePtr cur, PgXmlErrorContext *xmlerrcxt)
}
else
{
- xmlChar *str;
+ xmlChar *utf8str;
+ char *str = NULL;
+
+ utf8str = xmlXPathCastNodeToString(cur);
+
+ str = (char *) pg_do_encoding_conversion((unsigned char *) utf8str,
+ strlen((char *) utf8str),
+ PG_UTF8,
+ GetDatabaseEncoding());
- str = xmlXPathCastNodeToString(cur);
PG_TRY();
{
/* Here we rely on XML having the same representation as TEXT */
@@ -3727,11 +3777,18 @@ xml_xmlnodetoxmltype(xmlNodePtr cur, PgXmlErrorContext *xmlerrcxt)
}
PG_CATCH();
{
- xmlFree(str);
+ if (str != (char *) utf8str)
+ pfree(str);
+
+ xmlFree(utf8str);
PG_RE_THROW();
}
PG_END_TRY();
- xmlFree(str);
+
+ if (str != (char *) utf8str)
+ pfree(str);
+
+ xmlFree(utf8str);
}
return result;
@@ -3758,6 +3815,7 @@ xml_xpathobjtoxmlarray(xmlXPathObjectPtr xpathobj,
Datum datum;
Oid datumtype;
char *result_str;
+ char *str = NULL;
switch (xpathobj->type)
{
@@ -3797,7 +3855,20 @@ xml_xpathobjtoxmlarray(xmlXPathObjectPtr xpathobj,
case XPATH_STRING:
if (astate == NULL)
return 1;
- datum = CStringGetDatum((char *) xpathobj->stringval);
+
+ /*
+ * returned string is in UTF8 encoding - should be encoded
+ * to database encoding first.
+ */
+ str = (char *) pg_do_encoding_conversion((unsigned char *) xpathobj->stringval,
+ strlen((char *) xpathobj->stringval),
+ PG_UTF8,
+ GetDatabaseEncoding());
+
+elog(NOTICE, ">>>%s<<<", str);
+
+ datum = CStringGetDatum(str);
+
datumtype = CSTRINGOID;
break;
@@ -3812,6 +3883,7 @@ xml_xpathobjtoxmlarray(xmlXPathObjectPtr xpathobj,
datum = PointerGetDatum(cstring_to_xmltype(result_str));
(void) accumArrayResult(astate, datum, false,
XMLOID, CurrentMemoryContext);
+
return 1;
}
@@ -3895,7 +3967,7 @@ xpath_internal(text *xpath_expr_text, xmltype *data, ArrayType *namespaces,
errmsg("empty XPath expression")));
string = pg_xmlCharStrndup(datastr, len);
- xpath_expr = pg_xmlCharStrndup(VARDATA_ANY(xpath_expr_text), xpath_len);
+ xpath_expr = pg_xmlCharUtf8(VARDATA_ANY(xpath_expr_text), xpath_len);
xmlerrcxt = pg_xml_init(PG_XML_STRICTNESS_ALL);
@@ -3911,7 +3983,9 @@ xpath_internal(text *xpath_expr_text, xmltype *data, ArrayType *namespaces,
if (ctxt == NULL || xmlerrcxt->err_occurred)
xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY,
"could not allocate parser context");
- doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0);
+ doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL,
+ pg_encoding_to_char(GetDatabaseEncoding()), 0);
+
if (doc == NULL || xmlerrcxt->err_occurred)
xml_ereport(xmlerrcxt, ERROR, ERRCODE_INVALID_XML_DOCUMENT,
"could not parse XML document");
@@ -3929,22 +4003,25 @@ xpath_internal(text *xpath_expr_text, xmltype *data, ArrayType *namespaces,
{
for (i = 0; i < ns_count; i++)
{
- char *ns_name;
- char *ns_uri;
+ text *ns_name;
+ text *ns_uri;
if (ns_names_uris_nulls[i * 2] ||
ns_names_uris_nulls[i * 2 + 1])
ereport(ERROR,
(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
errmsg("neither namespace name nor URI may be null")));
- ns_name = TextDatumGetCString(ns_names_uris[i * 2]);
- ns_uri = TextDatumGetCString(ns_names_uris[i * 2 + 1]);
+ ns_name = DatumGetTextP(ns_names_uris[i * 2]);
+ ns_uri = DatumGetTextP(ns_names_uris[i * 2 + 1]);
if (xmlXPathRegisterNs(xpathctx,
- (xmlChar *) ns_name,
- (xmlChar *) ns_uri) != 0)
+ pg_xmlCharUtf8(VARDATA_ANY(ns_name),
+ VARSIZE(ns_name) - VARHDRSZ),
+ pg_xmlCharUtf8(VARDATA_ANY(ns_uri),
+ VARSIZE(ns_uri) - VARHDRSZ)) != 0)
ereport(ERROR, /* is this an internal error??? */
(errmsg("could not register XML namespace with name \"%s\" and URI \"%s\"",
- ns_name, ns_uri)));
+ TextDatumGetCString(ns_name),
+ TextDatumGetCString(ns_uri))));
}
}
@@ -4242,18 +4319,14 @@ XmlTableSetDocument(TableFuncScanState *state, Datum value)
xtCxt = GetXmlTableBuilderPrivateData(state, "XmlTableSetDocument");
- /*
- * Use out function for casting to string (remove encoding property). See
- * comment in xml_out.
- */
- str = xml_out_internal(xmlval, 0);
-
- length = strlen(str);
+ str = VARDATA(xmlval);
+ length = VARSIZE(xmlval) - VARHDRSZ;
xstr = pg_xmlCharStrndup(str, length);
PG_TRY();
{
- doc = xmlCtxtReadMemory(xtCxt->ctxt, (char *) xstr, length, NULL, NULL, 0);
+ doc = xmlCtxtReadMemory(xtCxt->ctxt, (char *) xstr, length, NULL,
+ pg_encoding_to_char(GetDatabaseEncoding()), 0);
if (doc == NULL || xtCxt->xmlerrcxt->err_occurred)
xml_ereport(xtCxt->xmlerrcxt, ERROR, ERRCODE_INVALID_XML_DOCUMENT,
"could not parse XML document");
@@ -4301,8 +4374,8 @@ XmlTableSetNamespace(TableFuncScanState *state, char *name, char *uri)
xtCxt = GetXmlTableBuilderPrivateData(state, "XmlTableSetNamespace");
if (xmlXPathRegisterNs(xtCxt->xpathcxt,
- pg_xmlCharStrndup(name, strlen(name)),
- pg_xmlCharStrndup(uri, strlen(uri))))
+ pg_xmlCharUtf8(name, strlen(name)),
+ pg_xmlCharUtf8(uri, strlen(uri))))
xml_ereport(xtCxt->xmlerrcxt, ERROR, ERRCODE_DATA_EXCEPTION,
"could not set XML namespace");
#else
@@ -4328,7 +4401,7 @@ XmlTableSetRowFilter(TableFuncScanState *state, char *path)
(errcode(ERRCODE_DATA_EXCEPTION),
errmsg("row path filter must not be empty string")));
- xstr = pg_xmlCharStrndup(path, strlen(path));
+ xstr = pg_xmlCharUtf8(path, strlen(path));
xtCxt->xpathcomp = xmlXPathCompile(xstr);
if (xtCxt->xpathcomp == NULL || xtCxt->xmlerrcxt->err_occurred)
@@ -4359,7 +4432,7 @@ XmlTableSetColumnFilter(TableFuncScanState *state, char *path, int colnum)
(errcode(ERRCODE_DATA_EXCEPTION),
errmsg("column path filter must not be empty string")));
- xstr = pg_xmlCharStrndup(path, strlen(path));
+ xstr = pg_xmlCharUtf8(path, strlen(path));
xtCxt->xpathscomp[colnum] = xmlXPathCompile(xstr);
if (xtCxt->xpathscomp[colnum] == NULL || xtCxt->xmlerrcxt->err_occurred)
@@ -4502,7 +4575,15 @@ XmlTableGetValue(TableFuncScanState *state, int colnum,
{
PG_TRY();
{
- cstr = pstrdup((char *) str);
+ if (GetDatabaseEncoding() != PG_UTF8)
+ {
+ cstr = (char *) pg_do_encoding_conversion((unsigned char *) str,
+ strlen((char *) str),
+ PG_UTF8,
+ GetDatabaseEncoding());
+ }
+ else
+ cstr = pstrdup((char *) str);
}
PG_CATCH();
{
diff --git a/src/test/regress/expected/xml.out b/src/test/regress/expected/xml.out
index bcc585d427..6a43896d40 100644
--- a/src/test/regress/expected/xml.out
+++ b/src/test/regress/expected/xml.out
@@ -1452,3 +1452,24 @@ SELECT xmltable.* FROM xmltest2, LATERAL xmltable(('/d/r/' || lower(_path) || 'c
14
(4 rows)
+-- XML is saved in database encoding with original encoding declaration.
+-- There can be incosistency based on wrong user input, different server/client
+-- encoding or reading XML with recv function. All XML functions should to
+-- work with this partially broken XML.
+DO $$
+DECLARE str text;
+BEGIN
+ -- leave early without error, when we are not sure about result of conversion
+ IF current_setting('server_encoding') NOT IN ('UTF8', 'LATIN2') THEN return; END IF;
+
+ -- build valid UTF8 XML with broken encoding declaration
+ str = '<?xml version="1.0" encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>'
+ || convert_from('\xf2', 'windows-1250')
+ || '</remark></vino></enprimeur>';
+
+ -- should to work
+ RAISE NOTICE '%', xpath('/enprimeur/vino/id', str::xml);
+ RAISE NOTICE '%', (SELECT id FROM xmltable('/enprimeur/vino' PASSING (str::xml) COLUMNS id int));
+END; $$;
+NOTICE: {<id>909</id>}
+NOTICE: 909
diff --git a/src/test/regress/sql/xml.sql b/src/test/regress/sql/xml.sql
index eb4687fb09..97a3aa9de2 100644
--- a/src/test/regress/sql/xml.sql
+++ b/src/test/regress/sql/xml.sql
@@ -558,3 +558,23 @@ INSERT INTO xmltest2 VALUES('<d><r><dc>2</dc></r></d>', 'D');
SELECT xmltable.* FROM xmltest2, LATERAL xmltable('/d/r' PASSING x COLUMNS a int PATH '' || lower(_path) || 'c');
SELECT xmltable.* FROM xmltest2, LATERAL xmltable(('/d/r/' || lower(_path) || 'c') PASSING x COLUMNS a int PATH '.');
SELECT xmltable.* FROM xmltest2, LATERAL xmltable(('/d/r/' || lower(_path) || 'c') PASSING x COLUMNS a int PATH 'x' DEFAULT ascii(_path) - 54);
+
+-- XML is saved in database encoding with original encoding declaration.
+-- There can be incosistency based on wrong user input, different server/client
+-- encoding or reading XML with recv function. All XML functions should to
+-- work with this partially broken XML.
+DO $$
+DECLARE str text;
+BEGIN
+ -- leave early without error, when we are not sure about result of conversion
+ IF current_setting('server_encoding') NOT IN ('UTF8', 'LATIN2') THEN return; END IF;
+
+ -- build valid UTF8 XML with broken encoding declaration
+ str = '<?xml version="1.0" encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>'
+ || convert_from('\xf2', 'windows-1250')
+ || '</remark></vino></enprimeur>';
+
+ -- should to work
+ RAISE NOTICE '%', xpath('/enprimeur/vino/id', str::xml);
+ RAISE NOTICE '%', (SELECT id FROM xmltable('/enprimeur/vino' PASSING (str::xml) COLUMNS id int));
+END; $$;
Hi
2017-08-19 22:53 GMT+02:00 Pavel Stehule <pavel.stehule@gmail.com>:
Hi
I am sending some POC - it does support XPATH and XMLTABLE for not UTF8
server encoding.In this case, all strings should be converted to UTF8 before call libXML2
functions, and result should be converted back from UTF8.I found some previous experiments https://marc.info/?l=pgsql-bugs&m=
123407176408688Note: I got some information so used xmlNodeDump function is deprecated -
so we should to replace it too sometime.Regards
I forgot a debug elog in previous patch
Show quoted text
Pavel
Attachments:
xpath-no-utf8-encodings.patchtext/x-patch; charset=US-ASCII; name=xpath-no-utf8-encodings.patchDownload
diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index c47624eff6..a43cf13d16 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -147,6 +147,7 @@ static int xml_xpathobjtoxmlarray(xmlXPathObjectPtr xpathobj,
ArrayBuildState *astate,
PgXmlErrorContext *xmlerrcxt);
static xmlChar *pg_xmlCharStrndup(char *str, size_t len);
+static xmlChar *pg_xmlCharUtf8(char *str, size_t len);
#endif /* USE_LIBXML */
static void xmldata_root_element_start(StringInfo result, const char *eltname,
@@ -459,8 +460,28 @@ cstring_to_xmltype(const char *string)
static xmltype *
xmlBuffer_to_xmltype(xmlBufferPtr buf)
{
- return (xmltype *) cstring_to_text_with_len((const char *) xmlBufferContent(buf),
+ if (GetDatabaseEncoding() != PG_UTF8)
+ {
+ char *utf8str = (char *) xmlBufferContent(buf);
+ char *str;
+ xmltype *result;
+
+ str = (char *) pg_do_encoding_conversion((unsigned char *) utf8str,
+ xmlBufferLength(buf),
+ PG_UTF8,
+ GetDatabaseEncoding());
+
+ Assert(str != utf8str);
+ result = (xmltype *) cstring_to_text(str);
+ pfree(str);
+
+ return result;
+ }
+ else
+ {
+ return (xmltype *) cstring_to_text_with_len((const char *) xmlBufferContent(buf),
xmlBufferLength(buf));
+ }
}
#endif
@@ -1176,6 +1197,28 @@ pg_xmlCharStrndup(char *str, size_t len)
}
/*
+ * LibXML2 internal encoding is UTF8. Sometimes LibXML2 enforce
+ * encoding to UTF8 by self, sometimes it expects UTF8 strings.
+ * This function is used for encoding from database encoding to
+ * UTF8.
+ */
+static xmlChar *
+pg_xmlCharUtf8(char *str, size_t len)
+{
+ char *result;
+
+ result = (char *) pg_do_encoding_conversion((unsigned char *) str,
+ len,
+ GetDatabaseEncoding(),
+ PG_UTF8);
+
+ if (result != str)
+ return BAD_CAST result;
+
+ return pg_xmlCharStrndup(str, len);
+}
+
+/*
* str is the null-terminated input string. Remaining arguments are
* output arguments; each can be NULL if value is not wanted.
* version and encoding are returned as locally-palloc'd strings.
@@ -3714,9 +3757,16 @@ xml_xmlnodetoxmltype(xmlNodePtr cur, PgXmlErrorContext *xmlerrcxt)
}
else
{
- xmlChar *str;
+ xmlChar *utf8str;
+ char *str = NULL;
+
+ utf8str = xmlXPathCastNodeToString(cur);
+
+ str = (char *) pg_do_encoding_conversion((unsigned char *) utf8str,
+ strlen((char *) utf8str),
+ PG_UTF8,
+ GetDatabaseEncoding());
- str = xmlXPathCastNodeToString(cur);
PG_TRY();
{
/* Here we rely on XML having the same representation as TEXT */
@@ -3727,11 +3777,18 @@ xml_xmlnodetoxmltype(xmlNodePtr cur, PgXmlErrorContext *xmlerrcxt)
}
PG_CATCH();
{
- xmlFree(str);
+ if (str != (char *) utf8str)
+ pfree(str);
+
+ xmlFree(utf8str);
PG_RE_THROW();
}
PG_END_TRY();
- xmlFree(str);
+
+ if (str != (char *) utf8str)
+ pfree(str);
+
+ xmlFree(utf8str);
}
return result;
@@ -3758,6 +3815,7 @@ xml_xpathobjtoxmlarray(xmlXPathObjectPtr xpathobj,
Datum datum;
Oid datumtype;
char *result_str;
+ char *str = NULL;
switch (xpathobj->type)
{
@@ -3797,7 +3855,18 @@ xml_xpathobjtoxmlarray(xmlXPathObjectPtr xpathobj,
case XPATH_STRING:
if (astate == NULL)
return 1;
- datum = CStringGetDatum((char *) xpathobj->stringval);
+
+ /*
+ * returned string is in UTF8 encoding - should be encoded
+ * to database encoding first.
+ */
+ str = (char *) pg_do_encoding_conversion((unsigned char *) xpathobj->stringval,
+ strlen((char *) xpathobj->stringval),
+ PG_UTF8,
+ GetDatabaseEncoding());
+
+ datum = CStringGetDatum(str);
+
datumtype = CSTRINGOID;
break;
@@ -3812,6 +3881,7 @@ xml_xpathobjtoxmlarray(xmlXPathObjectPtr xpathobj,
datum = PointerGetDatum(cstring_to_xmltype(result_str));
(void) accumArrayResult(astate, datum, false,
XMLOID, CurrentMemoryContext);
+
return 1;
}
@@ -3895,7 +3965,7 @@ xpath_internal(text *xpath_expr_text, xmltype *data, ArrayType *namespaces,
errmsg("empty XPath expression")));
string = pg_xmlCharStrndup(datastr, len);
- xpath_expr = pg_xmlCharStrndup(VARDATA_ANY(xpath_expr_text), xpath_len);
+ xpath_expr = pg_xmlCharUtf8(VARDATA_ANY(xpath_expr_text), xpath_len);
xmlerrcxt = pg_xml_init(PG_XML_STRICTNESS_ALL);
@@ -3911,7 +3981,9 @@ xpath_internal(text *xpath_expr_text, xmltype *data, ArrayType *namespaces,
if (ctxt == NULL || xmlerrcxt->err_occurred)
xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY,
"could not allocate parser context");
- doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0);
+ doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL,
+ pg_encoding_to_char(GetDatabaseEncoding()), 0);
+
if (doc == NULL || xmlerrcxt->err_occurred)
xml_ereport(xmlerrcxt, ERROR, ERRCODE_INVALID_XML_DOCUMENT,
"could not parse XML document");
@@ -3929,22 +4001,25 @@ xpath_internal(text *xpath_expr_text, xmltype *data, ArrayType *namespaces,
{
for (i = 0; i < ns_count; i++)
{
- char *ns_name;
- char *ns_uri;
+ text *ns_name;
+ text *ns_uri;
if (ns_names_uris_nulls[i * 2] ||
ns_names_uris_nulls[i * 2 + 1])
ereport(ERROR,
(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
errmsg("neither namespace name nor URI may be null")));
- ns_name = TextDatumGetCString(ns_names_uris[i * 2]);
- ns_uri = TextDatumGetCString(ns_names_uris[i * 2 + 1]);
+ ns_name = DatumGetTextP(ns_names_uris[i * 2]);
+ ns_uri = DatumGetTextP(ns_names_uris[i * 2 + 1]);
if (xmlXPathRegisterNs(xpathctx,
- (xmlChar *) ns_name,
- (xmlChar *) ns_uri) != 0)
+ pg_xmlCharUtf8(VARDATA_ANY(ns_name),
+ VARSIZE(ns_name) - VARHDRSZ),
+ pg_xmlCharUtf8(VARDATA_ANY(ns_uri),
+ VARSIZE(ns_uri) - VARHDRSZ)) != 0)
ereport(ERROR, /* is this an internal error??? */
(errmsg("could not register XML namespace with name \"%s\" and URI \"%s\"",
- ns_name, ns_uri)));
+ TextDatumGetCString(ns_name),
+ TextDatumGetCString(ns_uri))));
}
}
@@ -4242,18 +4317,14 @@ XmlTableSetDocument(TableFuncScanState *state, Datum value)
xtCxt = GetXmlTableBuilderPrivateData(state, "XmlTableSetDocument");
- /*
- * Use out function for casting to string (remove encoding property). See
- * comment in xml_out.
- */
- str = xml_out_internal(xmlval, 0);
-
- length = strlen(str);
+ str = VARDATA(xmlval);
+ length = VARSIZE(xmlval) - VARHDRSZ;
xstr = pg_xmlCharStrndup(str, length);
PG_TRY();
{
- doc = xmlCtxtReadMemory(xtCxt->ctxt, (char *) xstr, length, NULL, NULL, 0);
+ doc = xmlCtxtReadMemory(xtCxt->ctxt, (char *) xstr, length, NULL,
+ pg_encoding_to_char(GetDatabaseEncoding()), 0);
if (doc == NULL || xtCxt->xmlerrcxt->err_occurred)
xml_ereport(xtCxt->xmlerrcxt, ERROR, ERRCODE_INVALID_XML_DOCUMENT,
"could not parse XML document");
@@ -4301,8 +4372,8 @@ XmlTableSetNamespace(TableFuncScanState *state, char *name, char *uri)
xtCxt = GetXmlTableBuilderPrivateData(state, "XmlTableSetNamespace");
if (xmlXPathRegisterNs(xtCxt->xpathcxt,
- pg_xmlCharStrndup(name, strlen(name)),
- pg_xmlCharStrndup(uri, strlen(uri))))
+ pg_xmlCharUtf8(name, strlen(name)),
+ pg_xmlCharUtf8(uri, strlen(uri))))
xml_ereport(xtCxt->xmlerrcxt, ERROR, ERRCODE_DATA_EXCEPTION,
"could not set XML namespace");
#else
@@ -4328,7 +4399,7 @@ XmlTableSetRowFilter(TableFuncScanState *state, char *path)
(errcode(ERRCODE_DATA_EXCEPTION),
errmsg("row path filter must not be empty string")));
- xstr = pg_xmlCharStrndup(path, strlen(path));
+ xstr = pg_xmlCharUtf8(path, strlen(path));
xtCxt->xpathcomp = xmlXPathCompile(xstr);
if (xtCxt->xpathcomp == NULL || xtCxt->xmlerrcxt->err_occurred)
@@ -4359,7 +4430,7 @@ XmlTableSetColumnFilter(TableFuncScanState *state, char *path, int colnum)
(errcode(ERRCODE_DATA_EXCEPTION),
errmsg("column path filter must not be empty string")));
- xstr = pg_xmlCharStrndup(path, strlen(path));
+ xstr = pg_xmlCharUtf8(path, strlen(path));
xtCxt->xpathscomp[colnum] = xmlXPathCompile(xstr);
if (xtCxt->xpathscomp[colnum] == NULL || xtCxt->xmlerrcxt->err_occurred)
@@ -4502,7 +4573,15 @@ XmlTableGetValue(TableFuncScanState *state, int colnum,
{
PG_TRY();
{
- cstr = pstrdup((char *) str);
+ if (GetDatabaseEncoding() != PG_UTF8)
+ {
+ cstr = (char *) pg_do_encoding_conversion((unsigned char *) str,
+ strlen((char *) str),
+ PG_UTF8,
+ GetDatabaseEncoding());
+ }
+ else
+ cstr = pstrdup((char *) str);
}
PG_CATCH();
{
diff --git a/src/test/regress/expected/xml.out b/src/test/regress/expected/xml.out
index bcc585d427..6a43896d40 100644
--- a/src/test/regress/expected/xml.out
+++ b/src/test/regress/expected/xml.out
@@ -1452,3 +1452,24 @@ SELECT xmltable.* FROM xmltest2, LATERAL xmltable(('/d/r/' || lower(_path) || 'c
14
(4 rows)
+-- XML is saved in database encoding with original encoding declaration.
+-- There can be incosistency based on wrong user input, different server/client
+-- encoding or reading XML with recv function. All XML functions should to
+-- work with this partially broken XML.
+DO $$
+DECLARE str text;
+BEGIN
+ -- leave early without error, when we are not sure about result of conversion
+ IF current_setting('server_encoding') NOT IN ('UTF8', 'LATIN2') THEN return; END IF;
+
+ -- build valid UTF8 XML with broken encoding declaration
+ str = '<?xml version="1.0" encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>'
+ || convert_from('\xf2', 'windows-1250')
+ || '</remark></vino></enprimeur>';
+
+ -- should to work
+ RAISE NOTICE '%', xpath('/enprimeur/vino/id', str::xml);
+ RAISE NOTICE '%', (SELECT id FROM xmltable('/enprimeur/vino' PASSING (str::xml) COLUMNS id int));
+END; $$;
+NOTICE: {<id>909</id>}
+NOTICE: 909
diff --git a/src/test/regress/sql/xml.sql b/src/test/regress/sql/xml.sql
index eb4687fb09..97a3aa9de2 100644
--- a/src/test/regress/sql/xml.sql
+++ b/src/test/regress/sql/xml.sql
@@ -558,3 +558,23 @@ INSERT INTO xmltest2 VALUES('<d><r><dc>2</dc></r></d>', 'D');
SELECT xmltable.* FROM xmltest2, LATERAL xmltable('/d/r' PASSING x COLUMNS a int PATH '' || lower(_path) || 'c');
SELECT xmltable.* FROM xmltest2, LATERAL xmltable(('/d/r/' || lower(_path) || 'c') PASSING x COLUMNS a int PATH '.');
SELECT xmltable.* FROM xmltest2, LATERAL xmltable(('/d/r/' || lower(_path) || 'c') PASSING x COLUMNS a int PATH 'x' DEFAULT ascii(_path) - 54);
+
+-- XML is saved in database encoding with original encoding declaration.
+-- There can be incosistency based on wrong user input, different server/client
+-- encoding or reading XML with recv function. All XML functions should to
+-- work with this partially broken XML.
+DO $$
+DECLARE str text;
+BEGIN
+ -- leave early without error, when we are not sure about result of conversion
+ IF current_setting('server_encoding') NOT IN ('UTF8', 'LATIN2') THEN return; END IF;
+
+ -- build valid UTF8 XML with broken encoding declaration
+ str = '<?xml version="1.0" encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>'
+ || convert_from('\xf2', 'windows-1250')
+ || '</remark></vino></enprimeur>';
+
+ -- should to work
+ RAISE NOTICE '%', xpath('/enprimeur/vino/id', str::xml);
+ RAISE NOTICE '%', (SELECT id FROM xmltable('/enprimeur/vino' PASSING (str::xml) COLUMNS id int));
+END; $$;
On Fri, Aug 18, 2017 at 11:43:19PM +0200, Pavel Stehule wrote:
yes, probably libXML2 try to do check from utf8 encoding to header
specified encoding.
Yes. That has been the topic of this thread.
a) all values created by xml_in iterface are in database encoding - input
string is stored without any change. xml_parse is called only due
validation.b) inside xml_parse, the input is converted to UTF8, and document is read
by xmlCtxtReadDoc with explicitly specified "UTF-8" encoding or
by xmlParseBalancedChunkMemory with explicitly specified encoding "UTF8"
and removed decl section.So for "xml_parse" based functions (xml_in, texttoxml, xml_is_document,
wellformated_xml) the database encoding is not importantc) xml_recv function does validation by xml_parse and translation to
database encoding.Now I don't see a difference between @b and @c - so my hypotheses about
necessity to use recv interface was wrong.
Yes. You posted, on 2017-04-05, a test case not requiring the recv interface.
On Sat, Aug 19, 2017 at 09:13:50AM +0200, Pavel Stehule wrote:
I didn't find any info how to enable libXML2 XPath functions for other
encoding than UTF8 :( ??
http://xmlsoft.org/encoding.html is the relevant authority. To summarize, we
should send only UTF8 to libxml2.
On Sat, Aug 19, 2017 at 10:53:19PM +0200, Pavel Stehule wrote:
I am sending some POC - it does support XPATH and XMLTABLE for not UTF8
server encoding.In this case, all strings should be converted to UTF8 before call libXML2
functions, and result should be converted back from UTF8.
Adding support for xpath in non-UTF8 databases is a v11 feature proposal.
Please start a new thread for this, and add it to the open CommitFest.
In this thread, would you provide the version of your patch that I described
in my 2017-08-08 post to this thread? That's a back-patchable bug fix.
I found some previous experiments https://marc.info/?l=pgsql-bugs&m=123407176408688
https://wiki.postgresql.org/wiki/Todo#XML links to other background on this
feature proposal. See Tom Lane's review of a previous patch. Ensure your
patch does not have the problems he found during that review. Do that before
starting a thread for this feature.
Thanks,
nm
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
2017-08-20 4:17 GMT+02:00 Noah Misch <noah@leadboat.com>:
On Fri, Aug 18, 2017 at 11:43:19PM +0200, Pavel Stehule wrote:
yes, probably libXML2 try to do check from utf8 encoding to header
specified encoding.Yes. That has been the topic of this thread.
a) all values created by xml_in iterface are in database encoding -
input
string is stored without any change. xml_parse is called only due
validation.b) inside xml_parse, the input is converted to UTF8, and document is read
by xmlCtxtReadDoc with explicitly specified "UTF-8" encoding or
by xmlParseBalancedChunkMemory with explicitly specified encoding "UTF8"
and removed decl section.So for "xml_parse" based functions (xml_in, texttoxml, xml_is_document,
wellformated_xml) the database encoding is not importantc) xml_recv function does validation by xml_parse and translation to
database encoding.Now I don't see a difference between @b and @c - so my hypotheses about
necessity to use recv interface was wrong.Yes. You posted, on 2017-04-05, a test case not requiring the recv
interface.On Sat, Aug 19, 2017 at 09:13:50AM +0200, Pavel Stehule wrote:
I didn't find any info how to enable libXML2 XPath functions for other
encoding than UTF8 :( ??http://xmlsoft.org/encoding.html is the relevant authority. To
summarize, we
should send only UTF8 to libxml2.
libxml2 encodes XML to UTF8 by self. All others should be in UTF8. I found
some references to xmlSwitchEncoding function - but I didn't find any
examples of usage - probably nobody use it. Result is in UTF8 always.
On Sat, Aug 19, 2017 at 10:53:19PM +0200, Pavel Stehule wrote:
I am sending some POC - it does support XPATH and XMLTABLE for not UTF8
server encoding.In this case, all strings should be converted to UTF8 before call libXML2
functions, and result should be converted back from UTF8.Adding support for xpath in non-UTF8 databases is a v11 feature proposal.
Please start a new thread for this, and add it to the open CommitFest.In this thread, would you provide the version of your patch that I
described
in my 2017-08-08 post to this thread? That's a back-patchable bug fix.
There are three issues:
1. processing 1byte encoding XMLs documents with encoding declaration -
should be fixed by ecoding_for_xmlCtxtReadMemory.patch. This patch is very
short and safe - can be apply immediately (there is regress tests)
2 encoding issues in XPath specification (and namespaces) - because
multibytes chars are not usually used in tag names, this issue hit minimum
users.
3. encoding issues in XPath and XMLTABLE results - this is bad issue - the
function XMLTABLE will not be functional on non UTF8 databases. Fortunately
- there are less users with this encoding, but probably should be apply as
fix in 10/11 Postgres.
I found some previous experiments https://marc.info/?l=pgsql-
bugs&m=123407176408688https://wiki.postgresql.org/wiki/Todo#XML links to other background on
this
feature proposal. See Tom Lane's review of a previous patch. Ensure your
patch does not have the problems he found during that review. Do that
before
starting a thread for this feature.
good information - thank you. I'll start new thread for @2 and @3 issues -
not sure if I prepare good enough patch for next commit fest - and later
commiter can decide if will do backpatching.
Regards
Pavel
Show quoted text
Thanks,
nm
On Sun, Aug 20, 2017 at 08:46:03AM +0200, Pavel Stehule wrote:
2017-08-20 4:17 GMT+02:00 Noah Misch <noah@leadboat.com>:
On Sat, Aug 19, 2017 at 10:53:19PM +0200, Pavel Stehule wrote:
I am sending some POC - it does support XPATH and XMLTABLE for not UTF8
server encoding.In this case, all strings should be converted to UTF8 before call libXML2
functions, and result should be converted back from UTF8.Adding support for xpath in non-UTF8 databases is a v11 feature proposal.
Please start a new thread for this, and add it to the open CommitFest.In this thread, would you provide the version of your patch that I
described
in my 2017-08-08 post to this thread? That's a back-patchable bug fix.There are three issues:
1. processing 1byte encoding XMLs documents with encoding declaration -
should be fixed by ecoding_for_xmlCtxtReadMemory.patch. This patch is very
short and safe - can be apply immediately (there is regress tests)
We should fix that problem, yes. encoding_for_xmlCtxtReadMemory.patch is not
the right fix; see below.
2 encoding issues in XPath specification (and namespaces) - because
multibytes chars are not usually used in tag names, this issue hit minimum
users.3. encoding issues in XPath and XMLTABLE results - this is bad issue - the
function XMLTABLE will not be functional on non UTF8 databases. Fortunately
- there are less users with this encoding, but probably should be apply as
fix in 10/11 Postgres.
(2) and (3) are long-documented (since commit 14180f9, 2009-06) limitations in
xpath functions. That's why I would treat an improvement as a new feature and
not back-patch it. It is tempting to fix v10 so XMLTABLE is born without this
limitation, but it is too late in the release cycle.
Reviewing encoding_for_xmlCtxtReadMemory.patch:
On Sat, Aug 19, 2017 at 09:13:50AM +0200, Pavel Stehule wrote:
- doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0); + doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, + pg_encoding_to_char(GetDatabaseEncoding()), 0);
This assumes libxml2 understands every encoding name that
pg_encoding_to_char() can return, but it does not. xpath-bugfix.patch was
better. Here are the relevant parts of my review of that patch.
On Mon, Aug 07, 2017 at 05:10:14PM -0700, Noah Misch wrote:
On Wed, Apr 05, 2017 at 10:53:39PM +0200, Pavel Stehule wrote:
One possible fix - and similar technique is used more times in xml.c code
.. xmlroot+ /* remove header */ + parse_xml_decl(string, &header_len, NULL, NULL, NULL);- doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0); + doc = xmlCtxtReadMemory(ctxt, (char *) string + header_len, len -
I like this parse_xml_decl() approach better. Would you update
your patch to use it and to add the test case I give above?
Again, would you make those two edits?
Thanks,
nm
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
2017-08-20 9:21 GMT+02:00 Noah Misch <noah@leadboat.com>:
On Sun, Aug 20, 2017 at 08:46:03AM +0200, Pavel Stehule wrote:
2017-08-20 4:17 GMT+02:00 Noah Misch <noah@leadboat.com>:
On Sat, Aug 19, 2017 at 10:53:19PM +0200, Pavel Stehule wrote:
I am sending some POC - it does support XPATH and XMLTABLE for not
UTF8
server encoding.
In this case, all strings should be converted to UTF8 before call
libXML2
functions, and result should be converted back from UTF8.
Adding support for xpath in non-UTF8 databases is a v11 feature
proposal.
Please start a new thread for this, and add it to the open CommitFest.
In this thread, would you provide the version of your patch that I
described
in my 2017-08-08 post to this thread? That's a back-patchable bug fix.There are three issues:
1. processing 1byte encoding XMLs documents with encoding declaration -
should be fixed by ecoding_for_xmlCtxtReadMemory.patch. This patch isvery
short and safe - can be apply immediately (there is regress tests)
We should fix that problem, yes. encoding_for_xmlCtxtReadMemory.patch is
not
the right fix; see below.2 encoding issues in XPath specification (and namespaces) - because
multibytes chars are not usually used in tag names, this issue hitminimum
users.
3. encoding issues in XPath and XMLTABLE results - this is bad issue -
the
function XMLTABLE will not be functional on non UTF8 databases.
Fortunately
- there are less users with this encoding, but probably should be apply
as
fix in 10/11 Postgres.
(2) and (3) are long-documented (since commit 14180f9, 2009-06)
limitations in
xpath functions. That's why I would treat an improvement as a new feature
and
not back-patch it. It is tempting to fix v10 so XMLTABLE is born without
this
limitation, but it is too late in the release cycle.
I agree
Reviewing encoding_for_xmlCtxtReadMemory.patch:
On Sat, Aug 19, 2017 at 09:13:50AM +0200, Pavel Stehule wrote:
- doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL,
NULL, 0);
+ doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, +pg_encoding_to_char(GetDatabaseEncoding()), 0);
This assumes libxml2 understands every encoding name that
pg_encoding_to_char() can return, but it does not. xpath-bugfix.patch was
better. Here are the relevant parts of my review of that patch.
I understand to this argument.
On Mon, Aug 07, 2017 at 05:10:14PM -0700, Noah Misch wrote:
On Wed, Apr 05, 2017 at 10:53:39PM +0200, Pavel Stehule wrote:
One possible fix - and similar technique is used more times in xml.c
code
.. xmlroot
+ /* remove header */ + parse_xml_decl(string, &header_len, NULL, NULL, NULL);- doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL,
NULL, 0);
+ doc = xmlCtxtReadMemory(ctxt, (char *) string + header_len,
len -
I like this parse_xml_decl() approach better. Would you update
your patch to use it and to add the test case I give above?Again, would you make those two edits?
Now, I am not sure so it is correct fix. We will fix case, when server is
in UTF8, but maybe we will break some cases when server is not in UTF8
(although it is broken already).
I am think so correct solution is encoding to UTF8 and passing encoding
parameter. It will works safely in all cases - and we don't need cut
header. We should not to cut header if server encoding is not in UTF8 and
we don't pass encoding as parameter. When we pass encoding as parameter,
then cutting header is not necessary.
What do you think about last patch?
Regards
Pavel
Show quoted text
Thanks,
nm
Attachments:
xpath-parsing-error-fix.patchtext/x-patch; charset=US-ASCII; name=xpath-parsing-error-fix.patchDownload
diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index c47624eff6..5ceda8034d 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -3895,6 +3895,13 @@ xpath_internal(text *xpath_expr_text, xmltype *data, ArrayType *namespaces,
errmsg("empty XPath expression")));
string = pg_xmlCharStrndup(datastr, len);
+
+ /* It does nothing when string is in UTF8 already */
+ string = pg_do_encoding_conversion(string,
+ len,
+ GetDatabaseEncoding(),
+ PG_UTF8);
+
xpath_expr = pg_xmlCharStrndup(VARDATA_ANY(xpath_expr_text), xpath_len);
xmlerrcxt = pg_xml_init(PG_XML_STRICTNESS_ALL);
@@ -3911,7 +3918,7 @@ xpath_internal(text *xpath_expr_text, xmltype *data, ArrayType *namespaces,
if (ctxt == NULL || xmlerrcxt->err_occurred)
xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY,
"could not allocate parser context");
- doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0);
+ doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, "UTF-8", 0);
if (doc == NULL || xmlerrcxt->err_occurred)
xml_ereport(xmlerrcxt, ERROR, ERRCODE_INVALID_XML_DOCUMENT,
"could not parse XML document");
diff --git a/src/test/regress/expected/xml.out b/src/test/regress/expected/xml.out
index bcc585d427..0a5a80255f 100644
--- a/src/test/regress/expected/xml.out
+++ b/src/test/regress/expected/xml.out
@@ -1452,3 +1452,29 @@ SELECT xmltable.* FROM xmltest2, LATERAL xmltable(('/d/r/' || lower(_path) || 'c
14
(4 rows)
+-- XML is saved in database encoding with original encoding declaration.
+-- There can be incosistency based on wrong user input, different server/client
+-- encoding or reading XML with recv function. All XML functions should to
+-- work with this partially broken XML.
+DO $$
+DECLARE
+ str text;
+ result text;
+BEGIN
+ -- leave early without error, when we are not sure about result of conversion
+ IF current_setting('server_encoding') NOT IN ('UTF8', 'LATIN2') THEN return; END IF;
+
+ -- build valid UTF8 XML with broken encoding declaration
+ str = '<?xml version="1.0" encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>'
+ || convert_from('\xf2', 'windows-1250')
+ || '</remark></vino></enprimeur>';
+
+ -- should not to raise XML parsing error
+ result := (xpath('/enprimeur/vino/id/text()', str::xml))[1];
+
+ -- should be quiet when result is expected
+ -- result be same for all encodings when all is ok
+ IF result <> '909' THEN
+ RAISE EXCEPTION 'unexpected result: %s', result;
+ END IF;
+END; $$;
diff --git a/src/test/regress/expected/xml_1.out b/src/test/regress/expected/xml_1.out
index d3bd8c91d7..9c20d384c9 100644
--- a/src/test/regress/expected/xml_1.out
+++ b/src/test/regress/expected/xml_1.out
@@ -1302,3 +1302,33 @@ SELECT xmltable.* FROM xmltest2, LATERAL xmltable(('/d/r/' || lower(_path) || 'c
---
(0 rows)
+-- XML is saved in database encoding with original encoding declaration.
+-- There can be incosistency based on wrong user input, different server/client
+-- encoding or reading XML with recv function. All XML functions should to
+-- work with this partially broken XML.
+DO $$
+DECLARE
+ str text;
+ result text;
+BEGIN
+ -- leave early without error, when we are not sure about result of conversion
+ IF current_setting('server_encoding') NOT IN ('UTF8', 'LATIN2') THEN return; END IF;
+
+ -- build valid UTF8 XML with broken encoding declaration
+ str = '<?xml version="1.0" encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>'
+ || convert_from('\xf2', 'windows-1250')
+ || '</remark></vino></enprimeur>';
+
+ -- should not to raise XML parsing error
+ result := (xpath('/enprimeur/vino/id/text()', str::xml))[1];
+
+ -- should be quiet when result is expected
+ -- result be same for all encodings when all is ok
+ IF result <> '909' THEN
+ RAISE EXCEPTION 'unexpected result: %s', result;
+ END IF;
+END; $$;
+ERROR: unsupported XML feature
+DETAIL: This functionality requires the server to be built with libxml support.
+HINT: You need to rebuild PostgreSQL using --with-libxml.
+CONTEXT: PL/pgSQL function inline_code_block line 15 at assignment
diff --git a/src/test/regress/expected/xml_2.out b/src/test/regress/expected/xml_2.out
index ff77132803..694163c7fb 100644
--- a/src/test/regress/expected/xml_2.out
+++ b/src/test/regress/expected/xml_2.out
@@ -1432,3 +1432,29 @@ SELECT xmltable.* FROM xmltest2, LATERAL xmltable(('/d/r/' || lower(_path) || 'c
14
(4 rows)
+-- XML is saved in database encoding with original encoding declaration.
+-- There can be incosistency based on wrong user input, different server/client
+-- encoding or reading XML with recv function. All XML functions should to
+-- work with this partially broken XML.
+DO $$
+DECLARE
+ str text;
+ result text;
+BEGIN
+ -- leave early without error, when we are not sure about result of conversion
+ IF current_setting('server_encoding') NOT IN ('UTF8', 'LATIN2') THEN return; END IF;
+
+ -- build valid UTF8 XML with broken encoding declaration
+ str = '<?xml version="1.0" encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>'
+ || convert_from('\xf2', 'windows-1250')
+ || '</remark></vino></enprimeur>';
+
+ -- should not to raise XML parsing error
+ result := (xpath('/enprimeur/vino/id/text()', str::xml))[1];
+
+ -- should be quiet when result is expected
+ -- result be same for all encodings when all is ok
+ IF result <> '909' THEN
+ RAISE EXCEPTION 'unexpected result: %s', result;
+ END IF;
+END; $$;
diff --git a/src/test/regress/sql/xml.sql b/src/test/regress/sql/xml.sql
index eb4687fb09..37b87ce5a2 100644
--- a/src/test/regress/sql/xml.sql
+++ b/src/test/regress/sql/xml.sql
@@ -558,3 +558,30 @@ INSERT INTO xmltest2 VALUES('<d><r><dc>2</dc></r></d>', 'D');
SELECT xmltable.* FROM xmltest2, LATERAL xmltable('/d/r' PASSING x COLUMNS a int PATH '' || lower(_path) || 'c');
SELECT xmltable.* FROM xmltest2, LATERAL xmltable(('/d/r/' || lower(_path) || 'c') PASSING x COLUMNS a int PATH '.');
SELECT xmltable.* FROM xmltest2, LATERAL xmltable(('/d/r/' || lower(_path) || 'c') PASSING x COLUMNS a int PATH 'x' DEFAULT ascii(_path) - 54);
+
+-- XML is saved in database encoding with original encoding declaration.
+-- There can be incosistency based on wrong user input, different server/client
+-- encoding or reading XML with recv function. All XML functions should to
+-- work with this partially broken XML.
+DO $$
+DECLARE
+ str text;
+ result text;
+BEGIN
+ -- leave early without error, when we are not sure about result of conversion
+ IF current_setting('server_encoding') NOT IN ('UTF8', 'LATIN2') THEN return; END IF;
+
+ -- build valid UTF8 XML with broken encoding declaration
+ str = '<?xml version="1.0" encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>'
+ || convert_from('\xf2', 'windows-1250')
+ || '</remark></vino></enprimeur>';
+
+ -- should not to raise XML parsing error
+ result := (xpath('/enprimeur/vino/id/text()', str::xml))[1];
+
+ -- should be quiet when result is expected
+ -- result be same for all encodings when all is ok
+ IF result <> '909' THEN
+ RAISE EXCEPTION 'unexpected result: %s', result;
+ END IF;
+END; $$;
On Sun, Aug 20, 2017 at 10:54:57AM +0200, Pavel Stehule wrote:
2017-08-20 9:21 GMT+02:00 Noah Misch <noah@leadboat.com>:
On Mon, Aug 07, 2017 at 05:10:14PM -0700, Noah Misch wrote:
On Wed, Apr 05, 2017 at 10:53:39PM +0200, Pavel Stehule wrote:
One possible fix - and similar technique is used more times in xml.c code
.. xmlroot+ /* remove header */ + parse_xml_decl(string, &header_len, NULL, NULL, NULL);- doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0); + doc = xmlCtxtReadMemory(ctxt, (char *) string + header_len, len -I like this parse_xml_decl() approach better. Would you update
your patch to use it and to add the test case I give above?Again, would you make those two edits?
Now, I am not sure so it is correct fix. We will fix case, when server is
in UTF8, but maybe we will break some cases when server is not in UTF8
(although it is broken already).
That's right.
I am think so correct solution is encoding to UTF8 and passing encoding
parameter. It will works safely in all cases - and we don't need cut
header. We should not to cut header if server encoding is not in UTF8 and
we don't pass encoding as parameter. When we pass encoding as parameter,
then cutting header is not necessary.
xpath-bugfix.patch affected only xml values containing an xml declaration with
"encoding" attribute. In UTF8 databases, this latest proposal
(xpath-parsing-error-fix.patch) is equivalent to xpath-bugfix.patch. In
non-UTF8 databases, xpath-parsing-error-fix.patch affects all xml values
containing non-ASCII data. In a LATIN1 database, the following works today
but breaks under your latest proposal:
SELECT xpath('text()', ('<x>' || convert_from('\xc2b0', 'LATIN1') || '</x>')::xml);
It's acceptable to break that, since the documentation explicitly disclaims
support for it. xpath-bugfix.patch breaks different use cases, which are
likewise acceptable to break. See my 2017-08-08 review for details.
We have xpath-bugfix.patch and xpath-parsing-error-fix.patch. Both are
equivalent under supported use cases (xpath in UTF8 databases). Among
non-supported use cases, they each make different things better and different
things worse. We should prefer to back-patch the version harming fewer
applications. I expect non-ASCII data is more common than xml declarations
with "encoding" attribute, so xpath-bugfix.patch will harm fewer applications.
Having said that, I now see a third option. Condition this thread's patch's
effects on GetDatabaseEncoding()==PG_UTF8. That way, we fix supported cases,
and we remain bug-compatible in unsupported cases. I think that's better than
the other options discussed so far. If you agree, please send a patch based
on xpath-bugfix.patch with the GetDatabaseEncoding()==PG_UTF8 change and the
two edits I described earlier.
Thanks,
nm
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi
xpath-bugfix.patch affected only xml values containing an xml declaration
with
"encoding" attribute. In UTF8 databases, this latest proposal
(xpath-parsing-error-fix.patch) is equivalent to xpath-bugfix.patch. In
non-UTF8 databases, xpath-parsing-error-fix.patch affects all xml values
containing non-ASCII data. In a LATIN1 database, the following works today
but breaks under your latest proposal:SELECT xpath('text()', ('<x>' || convert_from('\xc2b0', 'LATIN1') ||
'</x>')::xml);
I don't understand, why it should not work?
It's acceptable to break that, since the documentation explicitly disclaims
support for it. xpath-bugfix.patch breaks different use cases, which are
likewise acceptable to break. See my 2017-08-08 review for details.We have xpath-bugfix.patch and xpath-parsing-error-fix.patch. Both are
equivalent under supported use cases (xpath in UTF8 databases). Among
non-supported use cases, they each make different things better and
different
things worse. We should prefer to back-patch the version harming fewer
applications. I expect non-ASCII data is more common than xml declarations
with "encoding" attribute, so xpath-bugfix.patch will harm fewer
applications.Having said that, I now see a third option. Condition this thread's
patch's
effects on GetDatabaseEncoding()==PG_UTF8. That way, we fix supported
cases,
and we remain bug-compatible in unsupported cases. I think that's better
than
the other options discussed so far. If you agree, please send a patch
based
on xpath-bugfix.patch with the GetDatabaseEncoding()==PG_UTF8 change and
the
two edits I described earlier.
I am sorry - too long day today. Do you think some like
diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index 24229c2dff..9fd6f3509f 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -3914,7 +3914,14 @@ xpath_internal(text *xpath_expr_text, xmltype *data,
ArrayType *namespaces,
if (ctxt == NULL || xmlerrcxt->err_occurred)
xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY,
"could not allocate parser context");
- doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0);
+
+ /*
+ * Passed XML is always in server encoding. When server encoding
+ * is UTF8, we can pass this information to libxml2 to ignore
+ * possible invalid encoding declaration in XML document.
+ */
+ doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL,
+ GetDatabaseEncoding() == PG_UTF8 ? "UTF-8" : NULL, 0);
if (doc == NULL || xmlerrcxt->err_occurred)
xml_ereport(xmlerrcxt, ERROR, ERRCODE_INVALID_XML_DOCUMENT,
"could not parse XML document");
This fix only UTF8 servers and has no any effect on other cases
?
Regards
Pavel
Show quoted text
Thanks,
nm
xpath-bugfix.patch affected only xml values containing an xml declaration
with
"encoding" attribute. In UTF8 databases, this latest proposal
(xpath-parsing-error-fix.patch) is equivalent to xpath-bugfix.patch. In
non-UTF8 databases, xpath-parsing-error-fix.patch affects all xml values
containing non-ASCII data. In a LATIN1 database, the following works today
but breaks under your latest proposal:SELECT xpath('text()', ('<x>' || convert_from('\xc2b0', 'LATIN1') ||
'</x>')::xml);It's acceptable to break that, since the documentation explicitly disclaims
support for it. xpath-bugfix.patch breaks different use cases, which are
likewise acceptable to break. See my 2017-08-08 review for details.
The fact so this code is working shows so a universe is pretty dangerous
place :)
2017-08-21 6:25 GMT+02:00 Pavel Stehule <pavel.stehule@gmail.com>:
xpath-bugfix.patch affected only xml values containing an xml declaration
with
"encoding" attribute. In UTF8 databases, this latest proposal
(xpath-parsing-error-fix.patch) is equivalent to xpath-bugfix.patch. In
non-UTF8 databases, xpath-parsing-error-fix.patch affects all xml values
containing non-ASCII data. In a LATIN1 database, the following works
today
but breaks under your latest proposal:SELECT xpath('text()', ('<x>' || convert_from('\xc2b0', 'LATIN1') ||
'</x>')::xml);It's acceptable to break that, since the documentation explicitly
disclaims
support for it. xpath-bugfix.patch breaks different use cases, which are
likewise acceptable to break. See my 2017-08-08 review for details.The fact so this code is working shows so a universe is pretty dangerous
place :)
ping?
will we continue in this topic?
Show quoted text
On Sun, Aug 20, 2017 at 10:37:10PM +0200, Pavel Stehule wrote:
We have xpath-bugfix.patch and xpath-parsing-error-fix.patch. Both are
equivalent under supported use cases (xpath in UTF8 databases). Among
non-supported use cases, they each make different things better and
different
things worse. We should prefer to back-patch the version harming fewer
applications. I expect non-ASCII data is more common than xml declarations
with "encoding" attribute, so xpath-bugfix.patch will harm fewer
applications.Having said that, I now see a third option. Condition this thread's
patch's
effects on GetDatabaseEncoding()==PG_UTF8. That way, we fix supported
cases,
and we remain bug-compatible in unsupported cases. I think that's better
than
the other options discussed so far. If you agree, please send a patch
based
on xpath-bugfix.patch with the GetDatabaseEncoding()==PG_UTF8 change and
the
two edits I described earlier.I am sorry - too long day today. Do you think some like
diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c index 24229c2dff..9fd6f3509f 100644 --- a/src/backend/utils/adt/xml.c +++ b/src/backend/utils/adt/xml.c @@ -3914,7 +3914,14 @@ xpath_internal(text *xpath_expr_text, xmltype *data, ArrayType *namespaces, if (ctxt == NULL || xmlerrcxt->err_occurred) xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY, "could not allocate parser context"); - doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0); + + /* + * Passed XML is always in server encoding. When server encoding + * is UTF8, we can pass this information to libxml2 to ignore + * possible invalid encoding declaration in XML document. + */ + doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, + GetDatabaseEncoding() == PG_UTF8 ? "UTF-8" : NULL, 0); if (doc == NULL || xmlerrcxt->err_occurred) xml_ereport(xmlerrcxt, ERROR, ERRCODE_INVALID_XML_DOCUMENT, "could not parse XML document");
No, that doesn't match my description above. I don't see a way to clarify the
description. Feel free to try again. Alternately, if you wait, I will
eventually construct the patch I described.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
2017-10-17 1:57 GMT+02:00 Noah Misch <noah@leadboat.com>:
On Sun, Aug 20, 2017 at 10:37:10PM +0200, Pavel Stehule wrote:
We have xpath-bugfix.patch and xpath-parsing-error-fix.patch. Both are
equivalent under supported use cases (xpath in UTF8 databases). Among
non-supported use cases, they each make different things better and
different
things worse. We should prefer to back-patch the version harming fewer
applications. I expect non-ASCII data is more common than xmldeclarations
with "encoding" attribute, so xpath-bugfix.patch will harm fewer
applications.Having said that, I now see a third option. Condition this thread's
patch's
effects on GetDatabaseEncoding()==PG_UTF8. That way, we fix supported
cases,
and we remain bug-compatible in unsupported cases. I think that'sbetter
than
the other options discussed so far. If you agree, please send a patch
based
on xpath-bugfix.patch with the GetDatabaseEncoding()==PG_UTF8 changeand
the
two edits I described earlier.I am sorry - too long day today. Do you think some like
diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c index 24229c2dff..9fd6f3509f 100644 --- a/src/backend/utils/adt/xml.c +++ b/src/backend/utils/adt/xml.c @@ -3914,7 +3914,14 @@ xpath_internal(text *xpath_expr_text, xmltype*data,
ArrayType *namespaces,
if (ctxt == NULL || xmlerrcxt->err_occurred)
xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY,
"could not allocate parser context");
- doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL,0);
+ + /* + * Passed XML is always in server encoding. When server encoding + * is UTF8, we can pass this information to libxml2 to ignore + * possible invalid encoding declaration in XML document. + */ + doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, + GetDatabaseEncoding() == PG_UTF8 ? "UTF-8" : NULL, 0); if (doc == NULL || xmlerrcxt->err_occurred) xml_ereport(xmlerrcxt, ERROR, ERRCODE_INVALID_XML_DOCUMENT, "could not parse XML document");No, that doesn't match my description above. I don't see a way to clarify
the
description. Feel free to try again. Alternately, if you wait, I will
eventually construct the patch I described.
Please, if you can, try it write. I am little bit lost :)
Regards
Pavel
On Tue, Oct 17, 2017 at 06:06:40AM +0200, Pavel Stehule wrote:
Please, if you can, try it write. I am little bit lost :)
I'm attaching the patch I desired. Please review. This will probably miss
this week's minor releases. If there's significant support, I could instead
push before the wrap.
Attachments:
stale-xmldecl-v1.patchtext/plain; charset=us-asciiDownload
diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index 24229c2..3e3aed8 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -3845,6 +3845,7 @@ xpath_internal(text *xpath_expr_text, xmltype *data, ArrayType *namespaces,
int32 xpath_len;
xmlChar *string;
xmlChar *xpath_expr;
+ size_t xmldecl_len = 0;
int i;
int ndim;
Datum *ns_names_uris;
@@ -3900,6 +3901,16 @@ xpath_internal(text *xpath_expr_text, xmltype *data, ArrayType *namespaces,
string = pg_xmlCharStrndup(datastr, len);
xpath_expr = pg_xmlCharStrndup(VARDATA_ANY(xpath_expr_text), xpath_len);
+ /*
+ * In a UTF8 database, skip any xml declaration, which might assert
+ * another encoding. Ignore parse_xml_decl() failure, letting
+ * xmlCtxtReadMemory() report parse errors. Documentation disclaims
+ * xpath() support for non-ASCII data in non-UTF8 databases, so leave
+ * those scenarios bug-compatible with historical behavior.
+ */
+ if (GetDatabaseEncoding() == PG_UTF8)
+ parse_xml_decl(string, &xmldecl_len, NULL, NULL, NULL);
+
xmlerrcxt = pg_xml_init(PG_XML_STRICTNESS_ALL);
PG_TRY();
@@ -3914,7 +3925,8 @@ xpath_internal(text *xpath_expr_text, xmltype *data, ArrayType *namespaces,
if (ctxt == NULL || xmlerrcxt->err_occurred)
xml_ereport(xmlerrcxt, ERROR, ERRCODE_OUT_OF_MEMORY,
"could not allocate parser context");
- doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL, 0);
+ doc = xmlCtxtReadMemory(ctxt, (char *) string + xmldecl_len,
+ len - xmldecl_len, NULL, NULL, 0);
if (doc == NULL || xmlerrcxt->err_occurred)
xml_ereport(xmlerrcxt, ERROR, ERRCODE_INVALID_XML_DOCUMENT,
"could not parse XML document");
diff --git a/src/test/regress/expected/xml.out b/src/test/regress/expected/xml.out
index bcc585d..f7a8c38 100644
--- a/src/test/regress/expected/xml.out
+++ b/src/test/regress/expected/xml.out
@@ -670,6 +670,37 @@ SELECT xpath('/nosuchtag', '<root/>');
{}
(1 row)
+-- Round-trip non-ASCII data through xpath().
+DO $$
+DECLARE
+ xml_declaration text := '<?xml version="1.0" encoding="ISO-8859-1"?>';
+ degree_symbol text;
+ res xml[];
+BEGIN
+ -- Per the documentation, xpath() doesn't work on non-ASCII data when
+ -- the server encoding is not UTF8. The EXCEPTION block below,
+ -- currently dead code, will be relevant if we remove this limitation.
+ IF current_setting('server_encoding') <> 'UTF8' THEN
+ RAISE LOG 'skip: encoding % unsupported for xml',
+ current_setting('server_encoding');
+ RETURN;
+ END IF;
+
+ degree_symbol := convert_from('\xc2b0', 'UTF8');
+ res := xpath('text()', (xml_declaration ||
+ '<x>' || degree_symbol || '</x>')::xml);
+ IF degree_symbol <> res[1]::text THEN
+ RAISE 'expected % (%), got % (%)',
+ degree_symbol, convert_to(degree_symbol, 'UTF8'),
+ res[1], convert_to(res[1]::text, 'UTF8');
+ END IF;
+EXCEPTION
+ -- character with byte sequence 0xc2 0xb0 in encoding "UTF8" has no equivalent in encoding "LATIN8"
+ WHEN untranslatable_character THEN RAISE LOG 'skip: %', SQLERRM;
+ -- default conversion function for encoding "UTF8" to "MULE_INTERNAL" does not exist
+ WHEN undefined_function THEN RAISE LOG 'skip: %', SQLERRM;
+END
+$$;
-- Test xmlexists and xpath_exists
SELECT xmlexists('//town[text() = ''Toronto'']' PASSING BY REF '<towns><town>Bidford-on-Avon</town><town>Cwmbran</town><town>Bristol</town></towns>');
xmlexists
diff --git a/src/test/regress/expected/xml_1.out b/src/test/regress/expected/xml_1.out
index d3bd8c9..1a9efa2 100644
--- a/src/test/regress/expected/xml_1.out
+++ b/src/test/regress/expected/xml_1.out
@@ -576,6 +576,41 @@ LINE 1: SELECT xpath('/nosuchtag', '<root/>');
^
DETAIL: This functionality requires the server to be built with libxml support.
HINT: You need to rebuild PostgreSQL using --with-libxml.
+-- Round-trip non-ASCII data through xpath().
+DO $$
+DECLARE
+ xml_declaration text := '<?xml version="1.0" encoding="ISO-8859-1"?>';
+ degree_symbol text;
+ res xml[];
+BEGIN
+ -- Per the documentation, xpath() doesn't work on non-ASCII data when
+ -- the server encoding is not UTF8. The EXCEPTION block below,
+ -- currently dead code, will be relevant if we remove this limitation.
+ IF current_setting('server_encoding') <> 'UTF8' THEN
+ RAISE LOG 'skip: encoding % unsupported for xml',
+ current_setting('server_encoding');
+ RETURN;
+ END IF;
+
+ degree_symbol := convert_from('\xc2b0', 'UTF8');
+ res := xpath('text()', (xml_declaration ||
+ '<x>' || degree_symbol || '</x>')::xml);
+ IF degree_symbol <> res[1]::text THEN
+ RAISE 'expected % (%), got % (%)',
+ degree_symbol, convert_to(degree_symbol, 'UTF8'),
+ res[1], convert_to(res[1]::text, 'UTF8');
+ END IF;
+EXCEPTION
+ -- character with byte sequence 0xc2 0xb0 in encoding "UTF8" has no equivalent in encoding "LATIN8"
+ WHEN untranslatable_character THEN RAISE LOG 'skip: %', SQLERRM;
+ -- default conversion function for encoding "UTF8" to "MULE_INTERNAL" does not exist
+ WHEN undefined_function THEN RAISE LOG 'skip: %', SQLERRM;
+END
+$$;
+ERROR: unsupported XML feature
+DETAIL: This functionality requires the server to be built with libxml support.
+HINT: You need to rebuild PostgreSQL using --with-libxml.
+CONTEXT: PL/pgSQL function inline_code_block line 17 at assignment
-- Test xmlexists and xpath_exists
SELECT xmlexists('//town[text() = ''Toronto'']' PASSING BY REF '<towns><town>Bidford-on-Avon</town><town>Cwmbran</town><town>Bristol</town></towns>');
ERROR: unsupported XML feature
diff --git a/src/test/regress/expected/xml_2.out b/src/test/regress/expected/xml_2.out
index ff77132..3868da1 100644
--- a/src/test/regress/expected/xml_2.out
+++ b/src/test/regress/expected/xml_2.out
@@ -650,6 +650,37 @@ SELECT xpath('/nosuchtag', '<root/>');
{}
(1 row)
+-- Round-trip non-ASCII data through xpath().
+DO $$
+DECLARE
+ xml_declaration text := '<?xml version="1.0" encoding="ISO-8859-1"?>';
+ degree_symbol text;
+ res xml[];
+BEGIN
+ -- Per the documentation, xpath() doesn't work on non-ASCII data when
+ -- the server encoding is not UTF8. The EXCEPTION block below,
+ -- currently dead code, will be relevant if we remove this limitation.
+ IF current_setting('server_encoding') <> 'UTF8' THEN
+ RAISE LOG 'skip: encoding % unsupported for xml',
+ current_setting('server_encoding');
+ RETURN;
+ END IF;
+
+ degree_symbol := convert_from('\xc2b0', 'UTF8');
+ res := xpath('text()', (xml_declaration ||
+ '<x>' || degree_symbol || '</x>')::xml);
+ IF degree_symbol <> res[1]::text THEN
+ RAISE 'expected % (%), got % (%)',
+ degree_symbol, convert_to(degree_symbol, 'UTF8'),
+ res[1], convert_to(res[1]::text, 'UTF8');
+ END IF;
+EXCEPTION
+ -- character with byte sequence 0xc2 0xb0 in encoding "UTF8" has no equivalent in encoding "LATIN8"
+ WHEN untranslatable_character THEN RAISE LOG 'skip: %', SQLERRM;
+ -- default conversion function for encoding "UTF8" to "MULE_INTERNAL" does not exist
+ WHEN undefined_function THEN RAISE LOG 'skip: %', SQLERRM;
+END
+$$;
-- Test xmlexists and xpath_exists
SELECT xmlexists('//town[text() = ''Toronto'']' PASSING BY REF '<towns><town>Bidford-on-Avon</town><town>Cwmbran</town><town>Bristol</town></towns>');
xmlexists
diff --git a/src/test/regress/sql/xml.sql b/src/test/regress/sql/xml.sql
index eb4687f..fdb51c9 100644
--- a/src/test/regress/sql/xml.sql
+++ b/src/test/regress/sql/xml.sql
@@ -189,6 +189,38 @@ SELECT xpath('count(//*)=3', '<root><sub/><sub/></root>');
SELECT xpath('name(/*)', '<root><sub/><sub/></root>');
SELECT xpath('/nosuchtag', '<root/>');
+-- Round-trip non-ASCII data through xpath().
+DO $$
+DECLARE
+ xml_declaration text := '<?xml version="1.0" encoding="ISO-8859-1"?>';
+ degree_symbol text;
+ res xml[];
+BEGIN
+ -- Per the documentation, xpath() doesn't work on non-ASCII data when
+ -- the server encoding is not UTF8. The EXCEPTION block below,
+ -- currently dead code, will be relevant if we remove this limitation.
+ IF current_setting('server_encoding') <> 'UTF8' THEN
+ RAISE LOG 'skip: encoding % unsupported for xml',
+ current_setting('server_encoding');
+ RETURN;
+ END IF;
+
+ degree_symbol := convert_from('\xc2b0', 'UTF8');
+ res := xpath('text()', (xml_declaration ||
+ '<x>' || degree_symbol || '</x>')::xml);
+ IF degree_symbol <> res[1]::text THEN
+ RAISE 'expected % (%), got % (%)',
+ degree_symbol, convert_to(degree_symbol, 'UTF8'),
+ res[1], convert_to(res[1]::text, 'UTF8');
+ END IF;
+EXCEPTION
+ -- character with byte sequence 0xc2 0xb0 in encoding "UTF8" has no equivalent in encoding "LATIN8"
+ WHEN untranslatable_character THEN RAISE LOG 'skip: %', SQLERRM;
+ -- default conversion function for encoding "UTF8" to "MULE_INTERNAL" does not exist
+ WHEN undefined_function THEN RAISE LOG 'skip: %', SQLERRM;
+END
+$$;
+
-- Test xmlexists and xpath_exists
SELECT xmlexists('//town[text() = ''Toronto'']' PASSING BY REF '<towns><town>Bidford-on-Avon</town><town>Cwmbran</town><town>Bristol</town></towns>');
SELECT xmlexists('//town[text() = ''Cwmbran'']' PASSING BY REF '<towns><town>Bidford-on-Avon</town><town>Cwmbran</town><town>Bristol</town></towns>');
Hi
2017-11-05 4:07 GMT+01:00 Noah Misch <noah@leadboat.com>:
On Tue, Oct 17, 2017 at 06:06:40AM +0200, Pavel Stehule wrote:
Please, if you can, try it write. I am little bit lost :)
I'm attaching the patch I desired. Please review. This will probably miss
this week's minor releases. If there's significant support, I could
instead
push before the wrap.
I have not any objection to this solution. It fixes my regress tests too.
I checked it and it is working.
Regards
Pavel
On Sun, Nov 05, 2017 at 06:10:04PM +0100, Pavel Stehule wrote:
Hi
2017-11-05 4:07 GMT+01:00 Noah Misch <noah@leadboat.com>:
On Tue, Oct 17, 2017 at 06:06:40AM +0200, Pavel Stehule wrote:
Please, if you can, try it write. I am little bit lost :)
I'm attaching the patch I desired. Please review. This will probably miss
this week's minor releases. If there's significant support, I could
instead
push before the wrap.I have not any objection to this solution. It fixes my regress tests too.
I checked it and it is working.
Pushed, but the buildfarm shows I didn't get the test quite right for the
non-xml, non-UTF8 case. Fixing.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
2017-11-11 21:19 GMT+01:00 Noah Misch <noah@leadboat.com>:
On Sun, Nov 05, 2017 at 06:10:04PM +0100, Pavel Stehule wrote:
Hi
2017-11-05 4:07 GMT+01:00 Noah Misch <noah@leadboat.com>:
On Tue, Oct 17, 2017 at 06:06:40AM +0200, Pavel Stehule wrote:
Please, if you can, try it write. I am little bit lost :)
I'm attaching the patch I desired. Please review. This will probably
miss
this week's minor releases. If there's significant support, I could
instead
push before the wrap.I have not any objection to this solution. It fixes my regress tests too.
I checked it and it is working.
Pushed, but the buildfarm shows I didn't get the test quite right for the
non-xml, non-UTF8 case. Fixing.
Thank you
Pavel