PQescapeBytea is not multibyte aware

Started by Tatsuo Ishiiabout 24 years ago36 messageshackers

t-ishii@sra.co.jp

about 24 years ago

PQescapebytea() is not multibyte aware and will produce bad multibyte
character sequences. Example:

INSERT INTO t1(bytea_col) VALUES('characters produced by
PQescapebytea');
ERROR: Invalid EUC_JP character sequence found (0x8950)

I think 0x89 should be converted to '\\211' since 0x89 of 0x8950 is
considered as "non printable characters".

Any objection?
--
Tatsuo Ishii

Tom Lane

tgl@sss.pgh.pa.us

about 24 years ago

In reply to: Tatsuo Ishii (#1)

Re: PQescapeBytea is not multibyte aware

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

PQescapebytea() is not multibyte aware and will produce bad multibyte
character sequences. Example:
I think 0x89 should be converted to '\\211' since 0x89 of 0x8950 is
considered as "non printable characters".

Hmm, so essentially we'd have to convert all codes >= 0x80 to prevent
them from being mistaken for parts of multibyte sequences? Ugh, but
you're probably right. It looks to me like byteaout does the reverse
already.

regards, tom lane

Joe Conway

mail@joeconway.com

about 24 years ago

In reply to: Tatsuo Ishii (#1)

Re: PQescapeBytea is not multibyte aware

Tom Lane wrote:

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

PQescapebytea() is not multibyte aware and will produce bad multibyte
character sequences. Example:
I think 0x89 should be converted to '\\211' since 0x89 of 0x8950 is
considered as "non printable characters".

Hmm, so essentially we'd have to convert all codes >= 0x80 to prevent
them from being mistaken for parts of multibyte sequences? Ugh, but
you're probably right. It looks to me like byteaout does the reverse
already.

But the error comes from pg_verifymbstr. Since bytea has no encoding
(it's just an array of bytes afterall), why does pg_verifymbstr get
applied at all to bytea data?

pg_verifymbstr is called by textin, bpcharin, and varcharin. Would it
help to rewrite this as:

INSERT INTO t1(bytea_col) VALUES('characters produced by
PQescapebytea'::bytea);
?

Joe

Tom Lane

tgl@sss.pgh.pa.us

about 24 years ago

In reply to: Joe Conway (#3)

Re: PQescapeBytea is not multibyte aware

Joe Conway <mail@joeconway.com> writes:

But the error comes from pg_verifymbstr. Since bytea has no encoding
(it's just an array of bytes afterall), why does pg_verifymbstr get
applied at all to bytea data?

Because textin() is used for the initial conversion to an "unknown"
constant --- see make_const() in parse_node.c.

pg_verifymbstr is called by textin, bpcharin, and varcharin. Would it
help to rewrite this as:

INSERT INTO t1(bytea_col) VALUES('characters produced by
PQescapebytea'::bytea);

Probably that would cause the error to disappear, but it's hardly a
desirable answer.

I wonder whether this says that TEXT is not a good implementation of
type UNKNOWN. That choice was made on the assumption that TEXT would
faithfully preserve the contents of a C string ... but it seems that in
the multibyte world it ain't so. It would not be a huge amount of work
to write a couple more I/O routines and give UNKNOWN its own I/O
behavior.

OTOH, I was surprised to read your message because I had assumed the
damage was being done much further upstream, viz during collection of
the query string by pq_getstr(). Do we need to think twice about that
processing, as well?

regards, tom lane

Joe Conway

mail@joeconway.com

about 24 years ago

In reply to: Tatsuo Ishii (#1)

Re: PQescapeBytea is not multibyte aware

Tom Lane wrote:

INSERT INTO t1(bytea_col) VALUES('characters produced by
PQescapebytea'::bytea);

Probably that would cause the error to disappear, but it's hardly a
desirable answer.

I wonder whether this says that TEXT is not a good implementation of
type UNKNOWN. That choice was made on the assumption that TEXT would
faithfully preserve the contents of a C string ... but it seems that in
the multibyte world it ain't so. It would not be a huge amount of work
to write a couple more I/O routines and give UNKNOWN its own I/O
behavior.

I could take a look at this. Any guidance other than "faithfully
preserving the contents of a C string"?

OTOH, I was surprised to read your message because I had assumed the
damage was being done much further upstream, viz during collection of
the query string by pq_getstr(). Do we need to think twice about that
processing, as well?

I'll take a look at this as well.

Joe

Tom Lane

tgl@sss.pgh.pa.us

about 24 years ago

In reply to: Joe Conway (#5)

Re: PQescapeBytea is not multibyte aware

Joe Conway <mail@joeconway.com> writes:

I could take a look at this. Any guidance other than "faithfully
preserving the contents of a C string"?

Take textin/textout, remove multibyte awareness? Actually the hard
part is to figure out which of the existing hardwired calls of textin
and textout would need to be replaced by calls to unknownin/unknownout.
I think the assumption UNKNOWN == TEXT has crept into a fair number of
places by now.

regards, tom lane

Joe Conway

mail@joeconway.com

about 24 years ago

In reply to: Tatsuo Ishii (#1)

Re: PQescapeBytea is not multibyte aware

Tom Lane wrote:

OTOH, I was surprised to read your message because I had assumed the
damage was being done much further upstream, viz during collection of
the query string by pq_getstr(). Do we need to think twice about that
processing, as well?

I just looked in pq_getstr() I see:

#ifdef MULTIBYTE
p = (char *) pg_client_to_server((unsigned char *) s->data, s->len);
if (p != s->data) /* actual conversion has been done? */

and in pg_client_to_server I see:

if (ClientEncoding->encoding == DatabaseEncoding->encoding)
return s;

So I'm guessing that in Tatsuo's case, both client and database encoding
are the same, and therefore the string was passed as-is downstream. I
think you're correct that in a client/database encoding mismatch
scenario, there would be bigger problems. Thoughts on this?

Joe

Tom Lane

tgl@sss.pgh.pa.us

about 24 years ago

In reply to: Joe Conway (#7)

Re: PQescapeBytea is not multibyte aware

Joe Conway <mail@joeconway.com> writes:

I think you're correct that in a client/database encoding mismatch
scenario, there would be bigger problems. Thoughts on this?

This scenario is probably why Tatsuo wants PQescapeBytea to octalize
everything with the high bit set; I'm not sure there's any lesser way
out. Nonetheless, if UNKNOWN conversion introduces additional failures
then it makes sense to fix that.

regards, tom lane

Joe Conway

mail@joeconway.com

about 24 years ago

In reply to: Tatsuo Ishii (#1)

Re: PQescapeBytea is not multibyte aware

Tom Lane wrote:

Joe Conway <mail@joeconway.com> writes:

I think you're correct that in a client/database encoding mismatch
scenario, there would be bigger problems. Thoughts on this?

This scenario is probably why Tatsuo wants PQescapeBytea to octalize
everything with the high bit set; I'm not sure there's any lesser way

Yuck! At that point you're no better off than converting to hex (and
worse off than converting to base64) for storage.

SQL99 actually defines BLOB as a binary string literal comprised of an
even number of hexadecimal digits, in single quotes, preceded by "X",
e.g. X'1a43fe'. Should we be looking at implementing the standard
instead of, or in addition to, octalizing? Maybe it is possible to do
this by creating a new datatype, BLOB, which uses new IN/OUT functions,
but otherwise uses the various bytea functions?

out. Nonetheless, if UNKNOWN conversion introduces additional failures
then it makes sense to fix that.

I'll follow up on this then.

Joe

#10

Tom Lane

tgl@sss.pgh.pa.us

about 24 years ago

In reply to: Joe Conway (#9)

Re: PQescapeBytea is not multibyte aware

Joe Conway <mail@joeconway.com> writes:

This scenario is probably why Tatsuo wants PQescapeBytea to octalize
everything with the high bit set; I'm not sure there's any lesser way

Yuck! At that point you're no better off than converting to hex (and
worse off than converting to base64) for storage.

No; the *storage* is still compact, it's just the I/O representation
that's not.

SQL99 actually defines BLOB as a binary string literal comprised of an
even number of hexadecimal digits, in single quotes, preceded by "X",
e.g. X'1a43fe'. Should we be looking at implementing the standard
instead of, or in addition to, octalizing?

Perhaps we should cause the system to regard hex-strings as literals of
type bytea. Right now I think they're taken to be integer constants,
which is clearly not per spec.

regards, tom lane

#11

Joe Conway

mail@joeconway.com

about 24 years ago

In reply to: Tatsuo Ishii (#1)

Re: PQescapeBytea is not multibyte aware

Tom Lane wrote:

Yuck! At that point you're no better off than converting to hex
(and worse off than converting to base64) for storage.

No; the *storage* is still compact, it's just the I/O representation
that's not.

Yeah, I realized that after I pushed send ;)

But still, doesn't that mean roughly twice as much memory usage for each
copy of the string? And I seem to remember Jan saying that each datum
winds up having 4 copies in memory. It ends up impacting the practical
length limit for a bytea value.

SQL99 actually defines BLOB as a binary string literal comprised
of an even number of hexadecimal digits, in single quotes,
preceded by "X", e.g. X'1a43fe'. Should we be looking at
implementing the standard instead of, or in addition to,
octalizing?

Perhaps we should cause the system to regard hex-strings as literals
of type bytea. Right now I think they're taken to be integer
constants, which is clearly not per spec.

Wow. I didn't realize this was possible:

test=# select X'ffff';
?column?
----------
65535
(1 row)

This does clearly conflict with the spec, but what about backward
compatibility? Do you think many people use this capability?

Joe

#12

Tom Lane

tgl@sss.pgh.pa.us

about 24 years ago

In reply to: Joe Conway (#11)

Re: PQescapeBytea is not multibyte aware

Joe Conway <mail@joeconway.com> writes:

But still, doesn't that mean roughly twice as much memory usage for each
copy of the string? And I seem to remember Jan saying that each datum
winds up having 4 copies in memory. It ends up impacting the practical
length limit for a bytea value.

Well, once the data actually reaches Datum form it'll be in internal
representation, hence compact. I'm not sure how many copies the parser
will make in the process of casting to UNKNOWN and then to bytea, but
I'm not terribly concerned by the above argument.

Wow. I didn't realize this was possible:

test=# select X'ffff';
?column?
----------
65535
(1 row)

This does clearly conflict with the spec, but what about backward
compatibility? Do you think many people use this capability?

No idea. I don't think it's documented anywhere, though...

regards, tom lane

#13

Tatsuo Ishii

t-ishii@sra.co.jp

about 24 years ago

In reply to: Tom Lane (#2)

Re: PQescapeBytea is not multibyte aware

Hmm, so essentially we'd have to convert all codes >= 0x80 to prevent
them from being mistaken for parts of multibyte sequences?

Yes.

Ugh, but
you're probably right. It looks to me like byteaout does the reverse
already.

As for the new UNKNOWN data type, that seems a good thing for
me. However, I think more aggressive soultion would be having an
encoding info in the text data type itself. This would also opens the
way to implement SQL99's CREATE CHARACTER SET stuffs. I have been
thinking about this for a while and want to make a RFC in the future(I
need to rethink my idea to adopt the SCHEMA you introduced).

BTW, for the 7.2.x tree we need a solution with lesser impact.
For this purpose, I would like to change PQescapeBytea as I stated in
the previous mail. Objection?
--
Tatsuo Ishii

#14

Joe Conway

mail@joeconway.com

about 24 years ago

In reply to: Tatsuo Ishii (#1)

Re: PQescapeBytea is not multibyte aware

Tatsuo Ishii wrote:

BTW, for the 7.2.x tree we need a solution with lesser impact.
For this purpose, I would like to change PQescapeBytea as I stated in
the previous mail. Objection?
--
Tatsuo Ishii

No objection here, but can we wrap the change in #ifdef MULTIBYTE so
there's no effect for people who don't use MULTIBYTE?

Joe

#15

Tom Lane

tgl@sss.pgh.pa.us

about 24 years ago

In reply to: Joe Conway (#14)

Re: PQescapeBytea is not multibyte aware

Joe Conway <mail@joeconway.com> writes:

No objection here, but can we wrap the change in #ifdef MULTIBYTE so
there's no effect for people who don't use MULTIBYTE?

That opens up the standard set of issues about "what if your server is
MULTIBYTE but your libpq is not?" It seems risky to me.

regards, tom lane

#16

Joe Conway

mail@joeconway.com

about 24 years ago

In reply to: Tatsuo Ishii (#1)

unknownin/out patch (was [HACKERS] PQescapeBytea is not multibyte aware)

Tom Lane wrote:

Joe Conway <mail@joeconway.com> writes:

I think you're correct that in a client/database encoding mismatch
scenario, there would be bigger problems. Thoughts on this?

This scenario is probably why Tatsuo wants PQescapeBytea to octalize
everything with the high bit set; I'm not sure there's any lesser way
out. Nonetheless, if UNKNOWN conversion introduces additional failures
then it makes sense to fix that.

regards, tom lane

Here's a patch to add unknownin/unknownout support. I also poked around
looking for places that assume UNKNOWN == TEXT. One of those was the
"SET" type in pg_type.h, which was using textin/textout. This one I took
care of in this patch. The other suspicious place was in
string_to_dataum (which is defined in both selfuncs.c and indxpath.c). I
wasn't too sure about those, so I left them be.

Regression tests all pass with the exception of horology, which also
fails on CVS tip. It looks like that is a daylight savings time issue
though.

Also as a side note, I can't get make check to get past initdb if I
configure with --enable-multibyte on CVS tip. Is there a known problem
or am I just being clueless . . .wait, let's qualify that -- am I being
clueless on this one issue? ;-)

Joe

#17

Joe Conway

mail@joeconway.com

about 24 years ago

In reply to: Tatsuo Ishii (#1)

Re: unknownin/out patch (was [HACKERS] PQescapeBytea is

Joe Conway wrote:

Here's a patch to add unknownin/unknownout support. I also poked around
looking for places that assume UNKNOWN == TEXT. One of those was the
"SET" type in pg_type.h, which was using textin/textout. This one I took
care of in this patch. The other suspicious place was in
string_to_dataum (which is defined in both selfuncs.c and indxpath.c). I
wasn't too sure about those, so I left them be.

I found three other suspicious spots in the source, where UNKNOWN ==
TEXT is assumed. The first looks like it needs to be changed for sure,
the other two I'm less sure about. Feedback would be most appreciated
(on this and the patch itself).

(line numbers based on CVS from earlier today)
parse_node.c
line 428
parse_coerce.c
line 85
parse_coerce.c
line 403

Joe

#18

Tatsuo Ishii

t-ishii@sra.co.jp

about 24 years ago

In reply to: Tom Lane (#15)

Re: PQescapeBytea is not multibyte aware

Joe Conway <mail@joeconway.com> writes:

No objection here, but can we wrap the change in #ifdef MULTIBYTE so
there's no effect for people who don't use MULTIBYTE?

That opens up the standard set of issues about "what if your server is
MULTIBYTE but your libpq is not?" It seems risky to me.

I have committed changes to the current source (without MULTIBYTE
ifdes). Will change t.2-stable tree soon.

I also added some careful handlings for memory allocation errors and
changed some questionable codes useing direct ASCII values 92 instead
of '\\' for example.
--
Tatsuo Ishii

#19

Tom Lane

tgl@sss.pgh.pa.us

about 24 years ago

In reply to: Joe Conway (#16)

Re: unknownin/out patch (was [HACKERS] PQescapeBytea is not multibyte aware)

Joe Conway <mail@joeconway.com> writes:

Regression tests all pass with the exception of horology, which also
fails on CVS tip. It looks like that is a daylight savings time issue
though.

Yup, ye olde DST-transition-makes-for-funny-day-length issue. This is
mentioned in the docs at
http://www.ca.postgresql.org/users-lounge/docs/7.2/postgres/regress-evaluation.html#AEN18363
although I see the troublesome tests are now in horology not timestamp.
(Docs fixed...)

Also as a side note, I can't get make check to get past initdb if I
configure with --enable-multibyte on CVS tip. Is there a known problem

News to me --- anyone else seeing that?

regards, tom lane

#20

Christopher Kings-Lynne

chriskl@familyhealth.com.au

about 24 years ago

In reply to: Tom Lane (#19)

Re: unknownin/out patch (was [HACKERS] PQescapeBytea is not multibyte aware)

Yup, ye olde DST-transition-makes-for-funny-day-length issue. This is
mentioned in the docs at
http://www.ca.postgresql.org/users-lounge/docs/7.2/postgres/regres
s-evaluation.html#AEN18363
although I see the troublesome tests are now in horology not timestamp.
(Docs fixed...)

Also as a side note, I can't get make check to get past initdb if I
configure with --enable-multibyte on CVS tip. Is there a known problem

News to me --- anyone else seeing that?

I get initdb failures all the time when building CVS. You need to gmake
clean to fix some things. Try doing a gmake clean && gmake check

Chris

#21

Tom Lane

tgl@sss.pgh.pa.us

about 24 years ago

In reply to: Tom Lane (#19)

#22

Tatsuo Ishii

t-ishii@sra.co.jp

about 24 years ago

In reply to: Tom Lane (#21)