xml type and encodings
We need to decide on how to handle encoding information embedded in xml
data that is passed through the client/server encoding conversion.
Here is an example:
Client encoding is A, server encoding is B. Client sends an xml datum
that looks like this:
INSERT INTO table VALUES (xmlparse(document '<?xml version="1.0"
encoding="C"?><content>...</content>'));
Assuming that A, B, and C are all distinct, this could fail at a number
of places.
I suggest that we make the system ignore all encoding declarations in
xml data. That is, in the above example, the string would actually
have to be encoded in client encoding B on the client, would be
converted to A on the server and stored as such. As far as I can tell,
this is easily implemented and allowed by the XML standard.
The same would be done on the way back. The datum would arrive in
encoding B on the client. It might be implementation-dependent whether
the datum actually contains an XML declaration specifying an encoding
and whether that encoding might read A, B, or C -- I haven't figured
that out yet -- but the client will always be required to consider it
to be B.
What should be done above the binary send/receive functionality?
Looking at the send/receive functions for the text type, they
communicate all data in the server encoding, so it seems reasonable to
do this here as well.
Comments?
--
Peter Eisentraut
http://developer.postgresql.org/~petere/
Peter Eisentraut <peter_e@gmx.net> writes:
Looking at the send/receive functions for the text type, they
communicate all data in the server encoding, so it seems reasonable to
do this here as well.
Uh, no, I'm pretty sure there's a translation to the client encoding.
It's in a subroutine not in text_send proper.
regards, tom lane
On 1/15/07, Peter Eisentraut <peter_e@gmx.net> wrote:
Client encoding is A, server encoding is B. Client sends an xml datum
that looks like this:INSERT INTO table VALUES (xmlparse(document '<?xml version="1.0"
encoding="C"?><content>...</content>'));Assuming that A, B, and C are all distinct, this could fail at a number
of places.I suggest that we make the system ignore all encoding declarations in
xml data. That is, in the above example, the string would actually
have to be encoded in client encoding B on the client, would be
converted to A on the server and stored as such. As far as I can tell,
this is easily implemented and allowed by the XML standard.
In other words, in case when B != C server must trigger an error, right?
--
Best regards,
Nikolay
Am Montag, 15. Januar 2007 12:42 schrieb Nikolay Samokhvalov:
On 1/15/07, Peter Eisentraut <peter_e@gmx.net> wrote:
Client encoding is A, server encoding is B. Client sends an xml datum
that looks like this:INSERT INTO table VALUES (xmlparse(document '<?xml version="1.0"
encoding="C"?><content>...</content>'));Assuming that A, B, and C are all distinct, this could fail at a number
of places.I suggest that we make the system ignore all encoding declarations in
xml data. That is, in the above example, the string would actually
have to be encoded in client encoding B on the client, would be
converted to A on the server and stored as such. As far as I can tell,
this is easily implemented and allowed by the XML standard.In other words, in case when B != C server must trigger an error, right?
No, C is ignored in all cases.
--
Peter Eisentraut
http://developer.postgresql.org/~petere/
Peter Eisentraut wrote:
Am Montag, 15. Januar 2007 12:42 schrieb Nikolay Samokhvalov:
On 1/15/07, Peter Eisentraut <peter_e@gmx.net> wrote:
Client encoding is A, server encoding is B. Client sends an xml datum
that looks like this:INSERT INTO table VALUES (xmlparse(document '<?xml version="1.0"
encoding="C"?><content>...</content>'));Assuming that A, B, and C are all distinct, this could fail at a number
of places.I suggest that we make the system ignore all encoding declarations in
xml data. That is, in the above example, the string would actually
have to be encoded in client encoding B on the client, would be
converted to A on the server and stored as such. As far as I can tell,
this is easily implemented and allowed by the XML standard.In other words, in case when B != C server must trigger an error, right?
No, C is ignored in all cases.
Would this mean that if the client_encoding is for example latin1, and I
retrieve an xml document uploaded by a client with client_encoding utf-8
(and thus having encoding="c" in the xml tag), that I would get a
document with latin1 encoding but saying that it's utf-8 in it's xml tag?
greetings, Florian Pflug
Am Montag, 15. Januar 2007 17:33 schrieb Florian G. Pflug:
Would this mean that if the client_encoding is for example latin1, and I
retrieve an xml document uploaded by a client with client_encoding utf-8
(and thus having encoding="c" in the xml tag), that I would get a
document with latin1 encoding but saying that it's utf-8 in it's xml tag?
That is likely to be a consequence of this proposed behaviour, but no doubt
not a nice one.
--
Peter Eisentraut
http://developer.postgresql.org/~petere/
On Mon, Jan 15, 2007 at 05:47:37PM +0100, Peter Eisentraut wrote:
Am Montag, 15. Januar 2007 17:33 schrieb Florian G. Pflug:
Would this mean that if the client_encoding is for example latin1, and I
retrieve an xml document uploaded by a client with client_encoding utf-8
(and thus having encoding="c" in the xml tag), that I would get a
document with latin1 encoding but saying that it's utf-8 in it's xml tag?That is likely to be a consequence of this proposed behaviour, but no doubt
not a nice one.
The only real alternative is to treat xml more like bytea than text
(ie, treat the input as a stream of octets). Whether that's "nice", I
have no idea.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
From each according to his ability. To each according to his ability to litigate.
Peter Eisentraut wrote:
Am Montag, 15. Januar 2007 17:33 schrieb Florian G. Pflug:
Would this mean that if the client_encoding is for example latin1, and I
retrieve an xml document uploaded by a client with client_encoding utf-8
(and thus having encoding="c" in the xml tag), that I would get a
document with latin1 encoding but saying that it's utf-8 in it's xml tag?That is likely to be a consequence of this proposed behaviour, but no doubt
not a nice one.
Couldn't the server change the encoding declaration inside the xml to
the correct
one (the same as client_encoding) before returning the result?
Otherwise, parsing the xml on the client with some xml library becomes
difficult, because the library is likely to get confused by the wrong
encoding tag - and you can't even fix that by using the correct client
encoding, because you don't know what the encoding tag says until you
have retrieved the document...
greetings, Florian Pflug
Martijn van Oosterhout wrote:
The only real alternative is to treat xml more like bytea than text
(ie, treat the input as a stream of octets).
bytea isn't "treated" any different than other data types. You just
have to take care in the client that you escape every byte greater than
127. The same option is available to you in xml, if you escape all
suspicious characters using entities. Then, the encoding declaration
is immaterial anyway. (Unless you allow UTF-16 into the picture, but
let's say we exclude that implicitly.)
--
Peter Eisentraut
http://developer.postgresql.org/~petere/
Florian G. Pflug wrote:
Couldn't the server change the encoding declaration inside the xml to
the correct
one (the same as client_encoding) before returning the result?
The data type output function doesn't know what the client encoding is
or whether the data will be shipped to the client at all. But what I'm
thinking is that we should remove the encoding declaration if possible.
At least that would be less confusing, albeit still potentially
incorrect if the client continues to process the document without care.
--
Peter Eisentraut
http://developer.postgresql.org/~petere/
Peter Eisentraut wrote:
Florian G. Pflug wrote:
Couldn't the server change the encoding declaration inside the xml to
the correct
one (the same as client_encoding) before returning the result?The data type output function doesn't know what the client encoding is
or whether the data will be shipped to the client at all. But what I'm
thinking is that we should remove the encoding declaration if possible.
At least that would be less confusing, albeit still potentially
incorrect if the client continues to process the document without care.
Sorry, I don't get it - how does this work for text, then? It works there
to dynamically recode the data from the database encoding to the client
encoding before sending it off to the client, no?
greetings, Florian Pflug
Florian G. Pflug wrote:
Sorry, I don't get it - how does this work for text, then? It works
there to dynamically recode the data from the database encoding to
the client encoding before sending it off to the client, no?
Sure, but it doesn't change the text inside the datum.
--
Peter Eisentraut
http://developer.postgresql.org/~petere/
Peter Eisentraut wrote:
Florian G. Pflug wrote:
Couldn't the server change the encoding declaration inside the xml to
the correct
one (the same as client_encoding) before returning the result?The data type output function doesn't know what the client encoding is
or whether the data will be shipped to the client at all. But what I'm
thinking is that we should remove the encoding declaration if possible.
At least that would be less confusing, albeit still potentially
incorrect if the client continues to process the document without care.
The XML SPec says:
"In the absence of information provided by an external transport protocol
(e.g. HTTP or MIME), it is a fatal error for an entity including an
encoding declaration to be presented to the XML processor in an encoding
other than that named in the declaration, or for an entity which begins
with neither a Byte Order Mark nor an encoding declaration to use an
encoding other than UTF-8. Note that since ASCII is a subset of UTF-8,
ordinary ASCII entities do not strictly need an encoding declaration."
ISTM we are reasonably entitled to require the client to pass in an xml
document that uses the client encoding, re-encoding it if necessary (and
adjusting the encoding decl if any in the process).
We should error out on any explicit encoding that conflicts with the
client encoding. I don't like the idea of just ignoring an explicit
encoding decl.
Are we going to ensure that what we hand back to another client has an
appropriate encding decl? Or will we just remove it in all cases?
cheers
andrew
Andrew Dunstan wrote:
We should error out on any explicit encoding that conflicts with the
client encoding. I don't like the idea of just ignoring an explicit
encoding decl.
That is an instance of the problem of figuring out which encoding names
are equivalent, which I believe we have settled on finding impossible.
Are we going to ensure that what we hand back to another client has
an appropriate encding decl? Or will we just remove it in all cases?
We can't do the former, but the latter might be doable.
--
Peter Eisentraut
http://developer.postgresql.org/~petere/
I wrote:
We need to decide on how to handle encoding information embedded in
xml data that is passed through the client/server encoding
conversion.
Tangentially related, I'm currently experimenting with a setup that
stores all xml data in UTF-8 on the server, converting it back to the
server encoding on output. This doesn't do anything to solve the
problem above, but it makes the internal processing much simpler, since
all of libxml uses UTF-8 internally anyway. Is anyone opposed to that
setup on principle?
--
Peter Eisentraut
http://developer.postgresql.org/~petere/
Peter Eisentraut <peter_e@gmx.net> writes:
Andrew Dunstan wrote:
Are we going to ensure that what we hand back to another client has
an appropriate encding decl? Or will we just remove it in all cases?
We can't do the former, but the latter might be doable.
I think that in the case of binary output, it'd be possible for xml_send
to include an encoding decl safely, because it could be sure that that's
where the data is going forthwith. Not sure if that's worth anything
though. The idea of text and binary output behaving differently on this
point doesn't seem all that attractive ...
regards, tom lane
Peter Eisentraut wrote:
I wrote:
We need to decide on how to handle encoding information embedded in
xml data that is passed through the client/server encoding
conversion.Tangentially related, I'm currently experimenting with a setup that
stores all xml data in UTF-8 on the server, converting it back to the
server encoding on output. This doesn't do anything to solve the
problem above, but it makes the internal processing much simpler, since
all of libxml uses UTF-8 internally anyway. Is anyone opposed to that
setup on principle?
If you do that, maybe it would be the easiest and least confusing thing
to just _always_ represent an xml document in utf-8, ignoring the client_encoding
entirely for xml. The only good reason for not using utf-8 that comes to
my mind is the increased storage size, especially for eastern scripts where
nearly all characters need 2 or more bytes. But if you store it in utf-8
internally anyway, than I don't think this arguments carries a lot of weight
anymore...
You could warn the user about that fact whenever he sends or recieves an
xml document, and the client_encoding is not set to utf-8.
Not that I'm entirely conviced about this being a good idea myself - but I
think I'd prefer a clear rule like that over surprises like "text and binary
output have different semantics" or "the encoding information is totally misleading
and must be ignored". And most software that uses xml probably uses utf-8...
greetings, Florian Pflug
On Tue, Jan 16, 2007 at 06:41:56PM +0100, Florian G. Pflug wrote:
If you do that, maybe it would be the easiest and least confusing thing
to just _always_ represent an xml document in utf-8, ignoring the
client_encoding
entirely for xml.
You can't do that. The server needs to parse the incoming string before
it knows it's dealing with XML. The string from the client must be
interpreted in the client encoding...
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
From each according to his ability. To each according to his ability to litigate.