invalidly encoded strings
I have been looking at fixing the issue of accepting strings that are
not valid in the database encoding. It appears from previous discussion
that we need to add a call to pg_verifymbstr() to the relevant input
routines and ensure that the chr() function returns a valid string. That
leaves several issues:
. which are the relevant input routines? I have identified the following
as needing remediation: textin(), bpcharin(), varcharin(), anyenum_in(),
namein(). Do we also need one for cstring_in()? Does the xml code
handle this as part of xml validation?
. what do we need to do to make the verification code more efficient? I
think we need to address the correctness issue first, but doing so
should certainly make us want to improve the verification code. For
example, I'm wondering if it might benefit from having a tiny cache.
. for chr() under UTF8, it seems to be generally agreed that the
argument should represent the codepoint and the function should return
the correspondingly encoded character. If so, possible the argument
should be a bigint to accommodate the full range of possible code
points. It is not clear what the argument should represent for other
multi-byte encodings for any argument higher than 127. Similarly, it is
not clear what ascii() should return in such cases. I would be inclined
just to error out.
cheers
andrew
On Sun, Sep 09, 2007 at 12:02:28AM -0400, Andrew Dunstan wrote:
. what do we need to do to make the verification code more efficient? I
think we need to address the correctness issue first, but doing so
should certainly make us want to improve the verification code. For
example, I'm wondering if it might benefit from having a tiny cache.
It has been pointed out the the verification for UTF-8 is very
inefficient, involving several function calls to first get the length,
then check characters, etc. It could be significantly improved. I don't
know whether a cache would make any useful difference.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
From each according to his ability. To each according to his ability to litigate.
Martijn van Oosterhout wrote:
On Sun, Sep 09, 2007 at 12:02:28AM -0400, Andrew Dunstan wrote:
. what do we need to do to make the verification code more efficient? I
think we need to address the correctness issue first, but doing so
should certainly make us want to improve the verification code. For
example, I'm wondering if it might benefit from having a tiny cache.It has been pointed out the the verification for UTF-8 is very
inefficient, involving several function calls to first get the length,
then check characters, etc. It could be significantly improved. I don't
know whether a cache would make any useful difference.
Well, it looks to me like "several" = 2. If function call overhead is
the worry, we could either create inline versions of pg_utf_mblen and
pg_utf8_islegal and rely on the compiler to make things good for us, or
inline the calls directly ourselves. Either way there's some
duplication, since the whole "extern inline" thing is still a mess. I'd
be inclined to do the inlining by hand since it's not that much code and
it would be more portable. Unless I'm missing something it should be a
10 minute c&p job.
Is that all we're worried about?
cheers
andrew
Andrew Dunstan <andrew@dunslane.net> writes:
I have been looking at fixing the issue of accepting strings that are
not valid in the database encoding. It appears from previous discussion
that we need to add a call to pg_verifymbstr() to the relevant input
routines and ensure that the chr() function returns a valid string. That
leaves several issues:
. which are the relevant input routines? I have identified the following
as needing remediation: textin(), bpcharin(), varcharin(), anyenum_in(),
namein(). Do we also need one for cstring_in()? Does the xml code
handle this as part of xml validation?
This seems entirely the wrong approach, because 99% of the time a
check placed here will be redundant with the one in the main
client-input logic. (That was, indeed, the reason I took such checks
out of these places in the first place.) Moreover, as you've already
found out there are N places that would have to be fixed and we'd face
a constant hazard of new datatypes introducing new holes.
AFAICS the risk comes only from chr() and the possibility of numeric
backslash-escapes in SQL string literals, and we ought to think about
fixing it in those places.
A possible answer is to add a verifymbstr to the string literal
converter anytime it has processed a numeric backslash-escape in the
string. Open questions for that are (1) does it have negative effects
for bytea, and if so is there any hope of working around it? (2) how
can we postpone the conversion/test to the parse analysis phase?
(To the extent that database encoding is frozen it'd probably be OK
to do it in the scanner, but such a choice will come back to bite
us eventually.)
. for chr() under UTF8, it seems to be generally agreed that the
argument should represent the codepoint and the function should return
the correspondingly encoded character. If so, possible the argument
should be a bigint to accommodate the full range of possible code
points. It is not clear what the argument should represent for other
multi-byte encodings for any argument higher than 127. Similarly, it is
not clear what ascii() should return in such cases. I would be inclined
just to error out.
In SQL_ASCII I'd argue for allowing 0..255. In actual MB encodings,
OK with throwing error.
regards, tom lane
Tom Lane wrote:
A possible answer is to add a verifymbstr to the string literal
converter anytime it has processed a numeric backslash-escape in the
string. Open questions for that are (1) does it have negative effects
for bytea, and if so is there any hope of working around it? (2) how
can we postpone the conversion/test to the parse analysis phase?
Finding out how to do (2) seems to me the only possible answer to (1).
I'll have a look.
Is that going to cover data coming in via COPY? and parameters for
prepared statements?
. for chr() under UTF8, it seems to be generally agreed that the
argument should represent the codepoint and the function should return
the correspondingly encoded character. If so, possible the argument
should be a bigint to accommodate the full range of possible code
points. It is not clear what the argument should represent for other
multi-byte encodings for any argument higher than 127. Similarly, it is
not clear what ascii() should return in such cases. I would be inclined
just to error out.In SQL_ASCII I'd argue for allowing 0..255. In actual MB encodings,
OK with throwing error.
I was planning on allowing up to 255 for all single byte encodings too.
cheers
andrew
Andrew Dunstan <andrew@dunslane.net> writes:
Is that going to cover data coming in via COPY? and parameters for
prepared statements?
Those should be checked already --- if not, the right fix is still to
fix it there, not in per-datatype code. I think we are OK though,
eg see "need_transcoding" logic in copy.c.
In SQL_ASCII I'd argue for allowing 0..255. In actual MB encodings,
OK with throwing error.
I was planning on allowing up to 255 for all single byte encodings too.
OK, that sounds fine.
regards, tom lane
Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
Is that going to cover data coming in via COPY? and parameters for
prepared statements?Those should be checked already --- if not, the right fix is still to
fix it there, not in per-datatype code. I think we are OK though,
eg see "need_transcoding" logic in copy.c.
Well, a little experimentation shows that we currently are not OK:
in foo.data:
\366\66
utf8test=# \copy xx from foo.data
utf8test=# select encode(t::bytea,'hex') from xx;
encode
--------
f636
(1 row)
utf8test=# \copy xx to bb.data
utf8test=# \copy xx from bb.data
ERROR: invalid byte sequence for encoding "UTF8": 0xf636
HINT: This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".
CONTEXT: COPY xx, line 1
utf8test=#
BTW, all the foo_recv functions that call pq_getmsgtext or
pq_getmsgstring are thereby calling pg_verify_mbstr already (via
pg_client_to_server). So I am still not 100% convinced that doing the
same directly in the corresponding foo_in functions is such a bad idea.
cheers
andrew
On Sun, 2007-09-09 at 10:51 -0400, Tom Lane wrote:
A possible answer is to add a verifymbstr to the string literal
converter anytime it has processed a numeric backslash-escape in the
string. Open questions for that are (1) does it have negative effects
for bytea, and if so is there any hope of working around it? (2) how
can we postpone the conversion/test to the parse analysis phase?
(To the extent that database encoding is frozen it'd probably be OK
to do it in the scanner, but such a choice will come back to bite
us eventually.)
Regarding #1:
Currently, you can pass a bytea literal as either: E'\377\377\377' or
E'\\377\\377\\377'.
The first strategy (single backslash) is not correct, because if you do
E'\377\000\377', the embedded null character counts as the end of the
cstring, even though there are bytes after it. Similar strange things
happen if you have a E'\134' (backslash) somewhere in the string.
However, I have no doubt that there are people who use the first
strategy anyway, and the proposed change would break that for them.
There may be more issues, too.
Regards,
Jeff Davis
Andrew Dunstan <andrew@dunslane.net> writes:
Well, a little experimentation shows that we currently are not OK:
This experiment is inadequately described.
What is the type of the column involved?
regards, tom lane
Jeff Davis <pgsql@j-davis.com> writes:
Currently, you can pass a bytea literal as either: E'\377\377\377' or
E'\\377\\377\\377'.
The first strategy (single backslash) is not correct, because if you do
E'\377\000\377', the embedded null character counts as the end of the
cstring, even though there are bytes after it. Similar strange things
happen if you have a E'\134' (backslash) somewhere in the string.
However, I have no doubt that there are people who use the first
strategy anyway, and the proposed change would break that for them.
If their code is broken anyway, calling their attention to it would be a
good idea, hm?
If we are not going to reject the embedded-null case then there is
hardly any point in considering any behavioral change at all here.
I want to point out in particular that Andrew's proposal of making
datatype input functions responsible for encoding verification cannot
possibly handle this, since they have to take the "terminating" null
at face value.
regards, tom lane
On Sun, 2007-09-09 at 17:09 -0400, Tom Lane wrote:
Jeff Davis <pgsql@j-davis.com> writes:
Currently, you can pass a bytea literal as either: E'\377\377\377' or
E'\\377\\377\\377'.The first strategy (single backslash) is not correct, because if you do
E'\377\000\377', the embedded null character counts as the end of the
cstring, even though there are bytes after it. Similar strange things
happen if you have a E'\134' (backslash) somewhere in the string.
However, I have no doubt that there are people who use the first
strategy anyway, and the proposed change would break that for them.If their code is broken anyway, calling their attention to it would be a
good idea, hm?
Agreed.
If we are not going to reject the embedded-null case then there is
hardly any point in considering any behavioral change at all here.
I want to point out in particular that Andrew's proposal of making
datatype input functions responsible for encoding verification cannot
possibly handle this, since they have to take the "terminating" null
at face value.
Would stringTypeDatum() in parse_type.c be a good place to put the
pg_verifymbstr()?
Regards,
Jeff Davis
Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
Well, a little experimentation shows that we currently are not OK:
This experiment is inadequately described.
What is the type of the column involved?
Sorry. It's text.
cheers
andrew
Jeff Davis <pgsql@j-davis.com> writes:
Would stringTypeDatum() in parse_type.c be a good place to put the
pg_verifymbstr()?
Probably not, in its current form, since it hasn't got any idea where
the "char *string" came from; moreover it is not in any better position
than the typinput function to determine whether there was a bogus
embedded null.
OTOH, there may be no decent way to fix the embedded-null problem
other than by hacking the scanner to reject \0 immediately. If we
did that it would give us more flexibility about where to put the
encoding validity checks.
In any case, I feel dubious that checking in stringTypeDatum will cover
every code path. Somewhere around where A_Const gets transformed to
Const seems like it'd be a better plan. (But I think that in most
utility statement parsetrees, A_Const never does get transformed to
Const; and there seem to be a few places in gram.y where an SCONST
gives rise to something other than A_Const; so this is still not a
bulletproof choice, at least not without additional changes.)
In the short run it might be best to do it in scan.l after all. A few
minutes' thought about what it'd take to delay the decisions till later
yields a depressingly large number of changes; and we do not have time
to be developing mostly-cosmetic patches for 8.3. Given that
database_encoding is frozen for any one DB at the moment, and that that
is unlikely to change in the near future, insisting on a solution that
allows it to vary is probably unreasonable at this stage of the game.
regards, tom lane
Tom Lane wrote:
In the short run it might be best to do it in scan.l after all.
I have not come up with a way of doing that and handling the bytea case.
If you have I'm all ears. And then I am still worried about COPY.
cheers
andrew
On Sun, 2007-09-09 at 23:22 -0400, Tom Lane wrote:
In the short run it might be best to do it in scan.l after all. A few
minutes' thought about what it'd take to delay the decisions till later
yields a depressingly large number of changes; and we do not have time
to be developing mostly-cosmetic patches for 8.3. Given that
database_encoding is frozen for any one DB at the moment, and that that
is unlikely to change in the near future, insisting on a solution that
allows it to vary is probably unreasonable at this stage of the game.
Sounds reasonable to me.
Regards,
Jeff Davis
Andrew Dunstan <andrew@dunslane.net> writes:
Tom Lane wrote:
In the short run it might be best to do it in scan.l after all.
I have not come up with a way of doing that and handling the bytea case.
AFAICS we have no realistic choice other than to reject \0 in SQL
literals; to do otherwise requires API changes throughout that stack of
modules. And once you admit \0 is no good, it's not clear that
\somethingelse is any better for bytea-using clients. Moreover, given
that we are moving away from backslash escapes as fast as we can sanely
travel, expending large amounts of energy to make them work better
doesn't seem like a good use of development manpower.
If you have I'm all ears. And then I am still worried about COPY.
Haven't looked at your test case yet... but it sure looks to me like the
COPY code *ought* to cover this.
regards, tom lane
On Sun, 2007-09-09 at 23:33 -0400, Andrew Dunstan wrote:
Tom Lane wrote:
In the short run it might be best to do it in scan.l after all.
I have not come up with a way of doing that and handling the bytea case.
If you have I'm all ears. And then I am still worried about COPY.
If it's done in the scanner it should still accept things like:
E'\\377\\000\\377'::bytea
right?
I think the concern is when they use only one slash, like:
E'\377\000\377'::bytea
which, as I mentioned before, is not correct anyway.
Regards,
Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes:
If it's done in the scanner it should still accept things like:
E'\\377\\000\\377'::bytea
right?
Right, that will work, because the transformed literal is '\377\000\377'
(no strange characters there, just what it says) and that has not got
any encoding violations.
I think the concern is when they use only one slash, like:
E'\377\000\377'::bytea
which, as I mentioned before, is not correct anyway.
Note also that if you have standard_conforming_strings set to ON,
this gives the very same result:
'\377\000\377'::bytea
I'm not convinced that we need to move heaven and earth to make this
work the same with or without E''. Basically what I'm thinking is we
should just decree that the de-escaped string literal still has to be
valid per the database encoding. If you don't want to be bound by that,
use the right number of backslashes to get the thing passed through to
the bytea input routine.
regards, tom lane
Tom Lane wrote:
. for chr() under UTF8, it seems to be generally agreed
that the argument should represent the codepoint and the
function should return the correspondingly encoded character.
If so, possible the argument should be a bigint to
accommodate the full range of possible code points.
It is not clear what the argument should represent for other
multi-byte encodings for any argument higher than 127.
Similarly, it is not clear what ascii() should return in
such cases. I would be inclined just to error out.In SQL_ASCII I'd argue for allowing 0..255. In actual MB
encodings, OK with throwing error.
I'd like to repeat my suggestion for chr() and ascii().
Instead of the code point, I'd prefer the actual encoding of
the character as argument to chr() and return value of ascii().
The advantage I see is the following:
- It would make these functions from oracle_compat.c
compatible with Oracle (Oracle's chr() and ascii() work
the way I suggest).
I agree with Tom's earlier suggestion to throw an error for
chr(0), although this is not what Oracle does.
Of course, if it is generally perceived that the code point
is more useful than the encoding, then Oracle compliance
is probably secondary.
Yours,
Laurenz Albe
"Tom Lane" <tgl@sss.pgh.pa.us> writes:
Jeff Davis <pgsql@j-davis.com> writes:
I think the concern is when they use only one slash, like:
E'\377\000\377'::bytea
which, as I mentioned before, is not correct anyway.
Wait, why would this be wrong? How would you enter the three byte bytea of
consisting of those three bytes described above?
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com