PQescapeBytea is not multibyte aware
PQescapebytea() is not multibyte aware and will produce bad multibyte
character sequences. Example:
INSERT INTO t1(bytea_col) VALUES('characters produced by
PQescapebytea');
ERROR: Invalid EUC_JP character sequence found (0x8950)
I think 0x89 should be converted to '\\211' since 0x89 of 0x8950 is
considered as "non printable characters".
Any objection?
--
Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
PQescapebytea() is not multibyte aware and will produce bad multibyte
character sequences. Example:
I think 0x89 should be converted to '\\211' since 0x89 of 0x8950 is
considered as "non printable characters".
Hmm, so essentially we'd have to convert all codes >= 0x80 to prevent
them from being mistaken for parts of multibyte sequences? Ugh, but
you're probably right. It looks to me like byteaout does the reverse
already.
regards, tom lane
Tom Lane wrote:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
PQescapebytea() is not multibyte aware and will produce bad multibyte
character sequences. Example:
I think 0x89 should be converted to '\\211' since 0x89 of 0x8950 is
considered as "non printable characters".Hmm, so essentially we'd have to convert all codes >= 0x80 to prevent
them from being mistaken for parts of multibyte sequences? Ugh, but
you're probably right. It looks to me like byteaout does the reverse
already.
But the error comes from pg_verifymbstr. Since bytea has no encoding
(it's just an array of bytes afterall), why does pg_verifymbstr get
applied at all to bytea data?
pg_verifymbstr is called by textin, bpcharin, and varcharin. Would it
help to rewrite this as:
INSERT INTO t1(bytea_col) VALUES('characters produced by
PQescapebytea'::bytea);
?
Joe
Joe Conway <mail@joeconway.com> writes:
But the error comes from pg_verifymbstr. Since bytea has no encoding
(it's just an array of bytes afterall), why does pg_verifymbstr get
applied at all to bytea data?
Because textin() is used for the initial conversion to an "unknown"
constant --- see make_const() in parse_node.c.
pg_verifymbstr is called by textin, bpcharin, and varcharin. Would it
help to rewrite this as:
INSERT INTO t1(bytea_col) VALUES('characters produced by
PQescapebytea'::bytea);
Probably that would cause the error to disappear, but it's hardly a
desirable answer.
I wonder whether this says that TEXT is not a good implementation of
type UNKNOWN. That choice was made on the assumption that TEXT would
faithfully preserve the contents of a C string ... but it seems that in
the multibyte world it ain't so. It would not be a huge amount of work
to write a couple more I/O routines and give UNKNOWN its own I/O
behavior.
OTOH, I was surprised to read your message because I had assumed the
damage was being done much further upstream, viz during collection of
the query string by pq_getstr(). Do we need to think twice about that
processing, as well?
regards, tom lane
Tom Lane wrote:
INSERT INTO t1(bytea_col) VALUES('characters produced by
PQescapebytea'::bytea);Probably that would cause the error to disappear, but it's hardly a
desirable answer.I wonder whether this says that TEXT is not a good implementation of
type UNKNOWN. That choice was made on the assumption that TEXT would
faithfully preserve the contents of a C string ... but it seems that in
the multibyte world it ain't so. It would not be a huge amount of work
to write a couple more I/O routines and give UNKNOWN its own I/O
behavior.
I could take a look at this. Any guidance other than "faithfully
preserving the contents of a C string"?
OTOH, I was surprised to read your message because I had assumed the
damage was being done much further upstream, viz during collection of
the query string by pq_getstr(). Do we need to think twice about that
processing, as well?
I'll take a look at this as well.
Joe
Joe Conway <mail@joeconway.com> writes:
I could take a look at this. Any guidance other than "faithfully
preserving the contents of a C string"?
Take textin/textout, remove multibyte awareness? Actually the hard
part is to figure out which of the existing hardwired calls of textin
and textout would need to be replaced by calls to unknownin/unknownout.
I think the assumption UNKNOWN == TEXT has crept into a fair number of
places by now.
regards, tom lane
Tom Lane wrote:
OTOH, I was surprised to read your message because I had assumed the
damage was being done much further upstream, viz during collection of
the query string by pq_getstr(). Do we need to think twice about that
processing, as well?
I just looked in pq_getstr() I see:
#ifdef MULTIBYTE
p = (char *) pg_client_to_server((unsigned char *) s->data, s->len);
if (p != s->data) /* actual conversion has been done? */
and in pg_client_to_server I see:
if (ClientEncoding->encoding == DatabaseEncoding->encoding)
return s;
So I'm guessing that in Tatsuo's case, both client and database encoding
are the same, and therefore the string was passed as-is downstream. I
think you're correct that in a client/database encoding mismatch
scenario, there would be bigger problems. Thoughts on this?
Joe
Joe Conway <mail@joeconway.com> writes:
I think you're correct that in a client/database encoding mismatch
scenario, there would be bigger problems. Thoughts on this?
This scenario is probably why Tatsuo wants PQescapeBytea to octalize
everything with the high bit set; I'm not sure there's any lesser way
out. Nonetheless, if UNKNOWN conversion introduces additional failures
then it makes sense to fix that.
regards, tom lane
Tom Lane wrote:
Joe Conway <mail@joeconway.com> writes:
I think you're correct that in a client/database encoding mismatch
scenario, there would be bigger problems. Thoughts on this?This scenario is probably why Tatsuo wants PQescapeBytea to octalize
everything with the high bit set; I'm not sure there's any lesser way
Yuck! At that point you're no better off than converting to hex (and
worse off than converting to base64) for storage.
SQL99 actually defines BLOB as a binary string literal comprised of an
even number of hexadecimal digits, in single quotes, preceded by "X",
e.g. X'1a43fe'. Should we be looking at implementing the standard
instead of, or in addition to, octalizing? Maybe it is possible to do
this by creating a new datatype, BLOB, which uses new IN/OUT functions,
but otherwise uses the various bytea functions?
out. Nonetheless, if UNKNOWN conversion introduces additional failures
then it makes sense to fix that.
I'll follow up on this then.
Joe
Joe Conway <mail@joeconway.com> writes:
This scenario is probably why Tatsuo wants PQescapeBytea to octalize
everything with the high bit set; I'm not sure there's any lesser way
Yuck! At that point you're no better off than converting to hex (and
worse off than converting to base64) for storage.
No; the *storage* is still compact, it's just the I/O representation
that's not.
SQL99 actually defines BLOB as a binary string literal comprised of an
even number of hexadecimal digits, in single quotes, preceded by "X",
e.g. X'1a43fe'. Should we be looking at implementing the standard
instead of, or in addition to, octalizing?
Perhaps we should cause the system to regard hex-strings as literals of
type bytea. Right now I think they're taken to be integer constants,
which is clearly not per spec.
regards, tom lane
Tom Lane wrote:
Yuck! At that point you're no better off than converting to hex
(and worse off than converting to base64) for storage.No; the *storage* is still compact, it's just the I/O representation
that's not.
Yeah, I realized that after I pushed send ;)
But still, doesn't that mean roughly twice as much memory usage for each
copy of the string? And I seem to remember Jan saying that each datum
winds up having 4 copies in memory. It ends up impacting the practical
length limit for a bytea value.
SQL99 actually defines BLOB as a binary string literal comprised
of an even number of hexadecimal digits, in single quotes,
preceded by "X", e.g. X'1a43fe'. Should we be looking at
implementing the standard instead of, or in addition to,
octalizing?Perhaps we should cause the system to regard hex-strings as literals
of type bytea. Right now I think they're taken to be integer
constants, which is clearly not per spec.
Wow. I didn't realize this was possible:
test=# select X'ffff';
?column?
----------
65535
(1 row)
This does clearly conflict with the spec, but what about backward
compatibility? Do you think many people use this capability?
Joe
Joe Conway <mail@joeconway.com> writes:
But still, doesn't that mean roughly twice as much memory usage for each
copy of the string? And I seem to remember Jan saying that each datum
winds up having 4 copies in memory. It ends up impacting the practical
length limit for a bytea value.
Well, once the data actually reaches Datum form it'll be in internal
representation, hence compact. I'm not sure how many copies the parser
will make in the process of casting to UNKNOWN and then to bytea, but
I'm not terribly concerned by the above argument.
Wow. I didn't realize this was possible:
test=# select X'ffff';
?column?
----------
65535
(1 row)
This does clearly conflict with the spec, but what about backward
compatibility? Do you think many people use this capability?
No idea. I don't think it's documented anywhere, though...
regards, tom lane
Hmm, so essentially we'd have to convert all codes >= 0x80 to prevent
them from being mistaken for parts of multibyte sequences?
Yes.
Ugh, but
you're probably right. It looks to me like byteaout does the reverse
already.
As for the new UNKNOWN data type, that seems a good thing for
me. However, I think more aggressive soultion would be having an
encoding info in the text data type itself. This would also opens the
way to implement SQL99's CREATE CHARACTER SET stuffs. I have been
thinking about this for a while and want to make a RFC in the future(I
need to rethink my idea to adopt the SCHEMA you introduced).
BTW, for the 7.2.x tree we need a solution with lesser impact.
For this purpose, I would like to change PQescapeBytea as I stated in
the previous mail. Objection?
--
Tatsuo Ishii
Tatsuo Ishii wrote:
BTW, for the 7.2.x tree we need a solution with lesser impact.
For this purpose, I would like to change PQescapeBytea as I stated in
the previous mail. Objection?
--
Tatsuo Ishii
No objection here, but can we wrap the change in #ifdef MULTIBYTE so
there's no effect for people who don't use MULTIBYTE?
Joe
Joe Conway <mail@joeconway.com> writes:
No objection here, but can we wrap the change in #ifdef MULTIBYTE so
there's no effect for people who don't use MULTIBYTE?
That opens up the standard set of issues about "what if your server is
MULTIBYTE but your libpq is not?" It seems risky to me.
regards, tom lane
Tom Lane wrote:
Joe Conway <mail@joeconway.com> writes:
I think you're correct that in a client/database encoding mismatch
scenario, there would be bigger problems. Thoughts on this?This scenario is probably why Tatsuo wants PQescapeBytea to octalize
everything with the high bit set; I'm not sure there's any lesser way
out. Nonetheless, if UNKNOWN conversion introduces additional failures
then it makes sense to fix that.regards, tom lane
Here's a patch to add unknownin/unknownout support. I also poked around
looking for places that assume UNKNOWN == TEXT. One of those was the
"SET" type in pg_type.h, which was using textin/textout. This one I took
care of in this patch. The other suspicious place was in
string_to_dataum (which is defined in both selfuncs.c and indxpath.c). I
wasn't too sure about those, so I left them be.
Regression tests all pass with the exception of horology, which also
fails on CVS tip. It looks like that is a daylight savings time issue
though.
Also as a side note, I can't get make check to get past initdb if I
configure with --enable-multibyte on CVS tip. Is there a known problem
or am I just being clueless . . .wait, let's qualify that -- am I being
clueless on this one issue? ;-)
Joe
Attachments:
unk.r0.patchtext/plain; name=unk.r0.patchDownload+56-4
Joe Conway wrote:
Here's a patch to add unknownin/unknownout support. I also poked around
looking for places that assume UNKNOWN == TEXT. One of those was the
"SET" type in pg_type.h, which was using textin/textout. This one I took
care of in this patch. The other suspicious place was in
string_to_dataum (which is defined in both selfuncs.c and indxpath.c). I
wasn't too sure about those, so I left them be.
I found three other suspicious spots in the source, where UNKNOWN ==
TEXT is assumed. The first looks like it needs to be changed for sure,
the other two I'm less sure about. Feedback would be most appreciated
(on this and the patch itself).
(line numbers based on CVS from earlier today)
parse_node.c
line 428
parse_coerce.c
line 85
parse_coerce.c
line 403
Joe
Joe Conway <mail@joeconway.com> writes:
No objection here, but can we wrap the change in #ifdef MULTIBYTE so
there's no effect for people who don't use MULTIBYTE?That opens up the standard set of issues about "what if your server is
MULTIBYTE but your libpq is not?" It seems risky to me.
I have committed changes to the current source (without MULTIBYTE
ifdes). Will change t.2-stable tree soon.
I also added some careful handlings for memory allocation errors and
changed some questionable codes useing direct ASCII values 92 instead
of '\\' for example.
--
Tatsuo Ishii
Joe Conway <mail@joeconway.com> writes:
Regression tests all pass with the exception of horology, which also
fails on CVS tip. It looks like that is a daylight savings time issue
though.
Yup, ye olde DST-transition-makes-for-funny-day-length issue. This is
mentioned in the docs at
http://www.ca.postgresql.org/users-lounge/docs/7.2/postgres/regress-evaluation.html#AEN18363
although I see the troublesome tests are now in horology not timestamp.
(Docs fixed...)
Also as a side note, I can't get make check to get past initdb if I
configure with --enable-multibyte on CVS tip. Is there a known problem
News to me --- anyone else seeing that?
regards, tom lane
Yup, ye olde DST-transition-makes-for-funny-day-length issue. This is
mentioned in the docs at
http://www.ca.postgresql.org/users-lounge/docs/7.2/postgres/regres
s-evaluation.html#AEN18363
although I see the troublesome tests are now in horology not timestamp.
(Docs fixed...)Also as a side note, I can't get make check to get past initdb if I
configure with --enable-multibyte on CVS tip. Is there a known problemNews to me --- anyone else seeing that?
I get initdb failures all the time when building CVS. You need to gmake
clean to fix some things. Try doing a gmake clean && gmake check
Chris