[PATCH] json_lex_string: don't overread on bad UTF8

Started by Jacob Championover 1 year ago14 messages
#1Jacob Champion
jacob.champion@enterprisedb.com
1 attachment(s)

Hi all,

When json_lex_string() hits certain types of invalid input, it calls
pg_encoding_mblen_bounded(), which assumes that its input is
null-terminated and calls strnlen(). But the JSON lexer is constructed
with an explicit string length, and we don't ensure that the string is
null-terminated in all cases, so we can walk off the end of the
buffer. This isn't really relevant on the server side, where you'd
have to get a superuser to help you break string encodings, but for
client-side usage on untrusted input (such as my OAuth patch) it would
be more important.

Attached is a draft patch that explicitly checks against the
end-of-string pointer and clamps the token_terminator to it. Note that
this removes the only caller of pg_encoding_mblen_bounded() and I'm
not sure what we should do with that function. It seems like a
reasonable API, just not here.

The new test needs to record two versions of the error message, one
for invalid token and one for invalid escape sequence. This is
because, for smaller chunk sizes, the partial-token logic in the
incremental JSON parser skips the affected code entirely when it can't
find an ending double-quote.

Tangentially: Should we maybe rethink pieces of the json_lex_string
error handling? For example, do we really want to echo an incomplete
multibyte sequence once we know it's bad? It also looks like there are
places where the FAIL_AT_CHAR_END macro is called after the `s`
pointer has already advanced past the code point of interest. I'm not
sure if that's intentional.

Thanks,
--Jacob

Attachments:

0001-json_lex_string-don-t-overread-on-bad-UTF8.patchapplication/octet-stream; name=0001-json_lex_string-don-t-overread-on-bad-UTF8.patchDownload
From bba7744e59941d8bb2f039e631d090d0e3956d6c Mon Sep 17 00:00:00 2001
From: Jacob Champion <jacob.champion@enterprisedb.com>
Date: Mon, 8 Apr 2024 15:31:17 -0700
Subject: [PATCH] json_lex_string: don't overread on bad UTF8

Inputs to pg_parse_json[_incremental] are not guaranteed to be
null-terminated, so pg_encoding_mblen_bounded (which uses strnlen) can
walk off the end of the buffer. Check against the end pointer instead.

TODO:
- pg_encoding_mblen_bounded() now has no callers; should we remove it?
- Do we really want to print incomplete UTF-8 sequences as-is once we
  know they're bad?
---
 src/common/jsonapi.c                              | 5 +++--
 src/test/modules/test_json_parser/t/002_inline.pl | 4 ++++
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index fc0cb36974..6633503490 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -1689,8 +1689,9 @@ json_lex_string(JsonLexContext *lex)
 	} while (0)
 #define FAIL_AT_CHAR_END(code) \
 	do { \
-		lex->token_terminator = \
-			s + pg_encoding_mblen_bounded(lex->input_encoding, s); \
+		lex->token_terminator = s + pg_encoding_mblen(lex->input_encoding, s); \
+		if (lex->token_terminator >= end) \
+			lex->token_terminator = end; \
 		return code; \
 	} while (0)
 
diff --git a/src/test/modules/test_json_parser/t/002_inline.pl b/src/test/modules/test_json_parser/t/002_inline.pl
index f83cec03f8..0335a26f47 100644
--- a/src/test/modules/test_json_parser/t/002_inline.pl
+++ b/src/test/modules/test_json_parser/t/002_inline.pl
@@ -128,5 +128,9 @@ test(
 	"incorrect escape count",
 	'"\\\\\\\\\\\\\\"',
 	error => qr/Token ""\\\\\\\\\\\\\\"" is invalid/);
+test(
+	"incomplete UTF-8 sequence",
+	"\"\\\x{F5}",  # three bytes: double-quote, backslash, <f5>
+	error => qr/(Token|Escape sequence) ""?\\\x{F5}" is invalid/);
 
 done_testing();
-- 
2.34.1

#2Michael Paquier
michael@paquier.xyz
In reply to: Jacob Champion (#1)
Re: [PATCH] json_lex_string: don't overread on bad UTF8

On Tue, Apr 30, 2024 at 10:39:04AM -0700, Jacob Champion wrote:

When json_lex_string() hits certain types of invalid input, it calls
pg_encoding_mblen_bounded(), which assumes that its input is
null-terminated and calls strnlen(). But the JSON lexer is constructed
with an explicit string length, and we don't ensure that the string is
null-terminated in all cases, so we can walk off the end of the
buffer. This isn't really relevant on the server side, where you'd
have to get a superuser to help you break string encodings, but for
client-side usage on untrusted input (such as my OAuth patch) it would
be more important.

Not sure to like much the fact that this advances token_terminator
first. Wouldn't it be better to calculate pg_encoding_mblen() first,
then save token_terminator? I feel a bit uneasy about saving a value
in token_terminator past the end of the string. That a nit in this
context, still..

Attached is a draft patch that explicitly checks against the
end-of-string pointer and clamps the token_terminator to it. Note that
this removes the only caller of pg_encoding_mblen_bounded() and I'm
not sure what we should do with that function. It seems like a
reasonable API, just not here.

Hmm. Keeping it around as currently designed means that it could
cause more harm than anything in the long term, even in the stable
branches if new code uses it. There is a risk of seeing this new code
incorrectly using it again, even if its top comment tells that we rely
on the string being nul-terminated. A safer alternative would be to
redesign it so as the length of the string is provided in input,
removing the dependency of strlen in its internals, perhaps. Anyway,
without any callers of it, I'd be tempted to wipe it from HEAD and
call it a day.

The new test needs to record two versions of the error message, one
for invalid token and one for invalid escape sequence. This is
because, for smaller chunk sizes, the partial-token logic in the
incremental JSON parser skips the affected code entirely when it can't
find an ending double-quote.

Ah, that makes sense. That looks OK here. A comment around the test
would be adapted to document that, I guess.

Tangentially: Should we maybe rethink pieces of the json_lex_string
error handling? For example, do we really want to echo an incomplete
multibyte sequence once we know it's bad? It also looks like there are
places where the FAIL_AT_CHAR_END macro is called after the `s`
pointer has already advanced past the code point of interest. I'm not
sure if that's intentional.

Advancing the tracking pointer 's' before reporting an error related
the end of the string is a bad practive, IMO, and we should avoid
that. json_lex_string() does not offer a warm feeling regarding that
with escape characters, at least :/
--
Michael

#3Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Michael Paquier (#2)
1 attachment(s)
Re: [PATCH] json_lex_string: don't overread on bad UTF8

On Tue, Apr 30, 2024 at 11:09 PM Michael Paquier <michael@paquier.xyz> wrote:

Not sure to like much the fact that this advances token_terminator
first. Wouldn't it be better to calculate pg_encoding_mblen() first,
then save token_terminator? I feel a bit uneasy about saving a value
in token_terminator past the end of the string. That a nit in this
context, still..

v2 tries it that way; see what you think. Is the concern that someone
might add code later that escapes that macro early?

Hmm. Keeping it around as currently designed means that it could
cause more harm than anything in the long term, even in the stable
branches if new code uses it. There is a risk of seeing this new code
incorrectly using it again, even if its top comment tells that we rely
on the string being nul-terminated. A safer alternative would be to
redesign it so as the length of the string is provided in input,
removing the dependency of strlen in its internals, perhaps. Anyway,
without any callers of it, I'd be tempted to wipe it from HEAD and
call it a day.

Removed in v2.

The new test needs to record two versions of the error message, one
for invalid token and one for invalid escape sequence. This is
because, for smaller chunk sizes, the partial-token logic in the
incremental JSON parser skips the affected code entirely when it can't
find an ending double-quote.

Ah, that makes sense. That looks OK here. A comment around the test
would be adapted to document that, I guess.

Done.

Advancing the tracking pointer 's' before reporting an error related
the end of the string is a bad practive, IMO, and we should avoid
that. json_lex_string() does not offer a warm feeling regarding that
with escape characters, at least :/

Yeah... I think some expansion of the json_errdetail test coverage is
probably in my future. :)

Thanks,
--Jacob

Attachments:

v2-0001-json_lex_string-don-t-overread-on-bad-UTF8.patchapplication/octet-stream; name=v2-0001-json_lex_string-don-t-overread-on-bad-UTF8.patchDownload
From 7573859b6f66b4ed370725f33077361c1cb81cb7 Mon Sep 17 00:00:00 2001
From: Jacob Champion <jacob.champion@enterprisedb.com>
Date: Mon, 8 Apr 2024 15:31:17 -0700
Subject: [PATCH v2] json_lex_string: don't overread on bad UTF8

Inputs to pg_parse_json[_incremental] are not guaranteed to be
null-terminated, so pg_encoding_mblen_bounded (which uses strnlen) can
walk off the end of the buffer. Check against the end pointer instead.

pg_encoding_mblen_bounded() no longer has any callers and has been
removed.

TODO:
- Do we really want to print incomplete UTF-8 sequences as-is once we
  know they're bad?
---
 src/common/jsonapi.c                              |  4 ++--
 src/common/wchar.c                                | 13 +------------
 src/include/mb/pg_wchar.h                         |  1 -
 src/test/modules/test_json_parser/t/002_inline.pl |  8 ++++++++
 4 files changed, 11 insertions(+), 15 deletions(-)

diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index fc0cb36974..26e1f43ed3 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -1689,8 +1689,8 @@ json_lex_string(JsonLexContext *lex)
 	} while (0)
 #define FAIL_AT_CHAR_END(code) \
 	do { \
-		lex->token_terminator = \
-			s + pg_encoding_mblen_bounded(lex->input_encoding, s); \
+		char	   *term = s + pg_encoding_mblen(lex->input_encoding, s); \
+		lex->token_terminator = (term <= end) ? term : end; \
 		return code; \
 	} while (0)
 
diff --git a/src/common/wchar.c b/src/common/wchar.c
index 76b7dfdfcb..97e9b61dba 100644
--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -2062,8 +2062,7 @@ const pg_wchar_tbl pg_wchar_table[] = {
  *
  * Caution: when dealing with text that is not certainly valid in the
  * specified encoding, the result may exceed the actual remaining
- * string length.  Callers that are not prepared to deal with that
- * should use pg_encoding_mblen_bounded() instead.
+ * string length.
  */
 int
 pg_encoding_mblen(int encoding, const char *mbstr)
@@ -2073,16 +2072,6 @@ pg_encoding_mblen(int encoding, const char *mbstr)
 			pg_wchar_table[PG_SQL_ASCII].mblen((const unsigned char *) mbstr));
 }
 
-/*
- * Returns the byte length of a multibyte character; but not more than
- * the distance to end of string.
- */
-int
-pg_encoding_mblen_bounded(int encoding, const char *mbstr)
-{
-	return strnlen(mbstr, pg_encoding_mblen(encoding, mbstr));
-}
-
 /*
  * Returns the display length of a multibyte character.
  */
diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h
index 249cd18a35..ac65bfcbef 100644
--- a/src/include/mb/pg_wchar.h
+++ b/src/include/mb/pg_wchar.h
@@ -663,7 +663,6 @@ extern int	pg_valid_server_encoding_id(int encoding);
  * earlier in this file are also available from libpgcommon.
  */
 extern int	pg_encoding_mblen(int encoding, const char *mbstr);
-extern int	pg_encoding_mblen_bounded(int encoding, const char *mbstr);
 extern int	pg_encoding_dsplen(int encoding, const char *mbstr);
 extern int	pg_encoding_verifymbchar(int encoding, const char *mbstr, int len);
 extern int	pg_encoding_verifymbstr(int encoding, const char *mbstr, int len);
diff --git a/src/test/modules/test_json_parser/t/002_inline.pl b/src/test/modules/test_json_parser/t/002_inline.pl
index f83cec03f8..60bb930e92 100644
--- a/src/test/modules/test_json_parser/t/002_inline.pl
+++ b/src/test/modules/test_json_parser/t/002_inline.pl
@@ -128,5 +128,13 @@ test(
 	"incorrect escape count",
 	'"\\\\\\\\\\\\\\"',
 	error => qr/Token ""\\\\\\\\\\\\\\"" is invalid/);
+test(
+	"incomplete UTF-8 sequence",
+	# Three bytes: double-quote, backslash, <f5>
+	"\"\\\x{F5}",
+	# Both invalid-token and invalid-escape are possible, because for smaller
+	# chunk sizes the incremental parser will skip the string parsing when it
+	# can't find an ending quote.
+	error => qr/(Token|Escape sequence) ""?\\\x{F5}" is invalid/);
 
 done_testing();
-- 
2.34.1

#4Michael Paquier
michael@paquier.xyz
In reply to: Jacob Champion (#3)
Re: [PATCH] json_lex_string: don't overread on bad UTF8

On Wed, May 01, 2024 at 04:22:24PM -0700, Jacob Champion wrote:

On Tue, Apr 30, 2024 at 11:09 PM Michael Paquier <michael@paquier.xyz> wrote:

Not sure to like much the fact that this advances token_terminator
first. Wouldn't it be better to calculate pg_encoding_mblen() first,
then save token_terminator? I feel a bit uneasy about saving a value
in token_terminator past the end of the string. That a nit in this
context, still..

v2 tries it that way; see what you think. Is the concern that someone
might add code later that escapes that macro early?

Yeah, I am not sure if that's something that would really happen, but
that looks like a good practice to keep anyway to keep a clean stack
at any time.

Ah, that makes sense. That looks OK here. A comment around the test
would be adapted to document that, I guess.

Done.

That seems OK at quick glance. I don't have much room to do something
about this patch this week as an effect of Golden Week and the
buildfarm effect, but I should be able to get to it next week once the
next round of minor releases is tagged.

About the fact that we may finish by printing unfinished UTF-8
sequences, I'd be curious to hear your thoughts. Now, the information
provided about the partial byte sequences can be also useful for
debugging on top of having the error code, no?
--
Michael

#5Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#4)
Re: [PATCH] json_lex_string: don't overread on bad UTF8

On Thu, May 02, 2024 at 11:23:13AM +0900, Michael Paquier wrote:

About the fact that we may finish by printing unfinished UTF-8
sequences, I'd be curious to hear your thoughts. Now, the information
provided about the partial byte sequences can be also useful for
debugging on top of having the error code, no?

By the way, as long as I have that in mind.. I am not sure that it is
worth spending cycles in detecting the unfinished sequences and make
these printable. Wouldn't it be enough for more cases to adjust
token_error() to truncate the byte sequences we cannot print?

Another thing that I think would be nice would be to calculate the
location of what we're parsing on a given line, and provide that in
the error context. That would not be backpatchable as it requires a
change in JsonLexContext, unfortunately, but it would help in making
more sense with an error if the incomplete byte sequence is at the
beginning of a token or after an expected character.
--
Michael

#6Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Michael Paquier (#5)
Re: [PATCH] json_lex_string: don't overread on bad UTF8

On Wed, May 1, 2024 at 8:40 PM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, May 02, 2024 at 11:23:13AM +0900, Michael Paquier wrote:

About the fact that we may finish by printing unfinished UTF-8
sequences, I'd be curious to hear your thoughts. Now, the information
provided about the partial byte sequences can be also useful for
debugging on top of having the error code, no?

Yes, but which information do you want? Do you want to know the bad
byte sequence, or see the glyph that corresponds to it (which is
probably �)? The glyph is better as long as it's complete; if it's a
bad sequence, then maybe you'd prefer to know the particular byte, but
that assumes a lot of technical knowledge on the part of whoever's
reading the message.

By the way, as long as I have that in mind.. I am not sure that it is
worth spending cycles in detecting the unfinished sequences and make
these printable. Wouldn't it be enough for more cases to adjust
token_error() to truncate the byte sequences we cannot print?

Maybe. I'm beginning to wonder if I'm overthinking this particular
problem, and if we should just go ahead and print the bad sequence. At
least for the case of UTF-8 console encoding, replacement glyphs will
show up as needed.

There is the matter of a client that's not using UTF-8, though. Do we
deal with that correctly today? (I understand why it was done the way
it was, at least on the server side, but it's still really weird to
have code that parses "JSON" that isn't actually Unicode.)

Another thing that I think would be nice would be to calculate the
location of what we're parsing on a given line, and provide that in
the error context. That would not be backpatchable as it requires a
change in JsonLexContext, unfortunately, but it would help in making
more sense with an error if the incomplete byte sequence is at the
beginning of a token or after an expected character.

+1, at least that way you can skip directly to the broken spot during
a postmortem.

Thanks,
--Jacob

#7Peter Eisentraut
peter@eisentraut.org
In reply to: Jacob Champion (#1)
Re: [PATCH] json_lex_string: don't overread on bad UTF8

On 30.04.24 19:39, Jacob Champion wrote:

Tangentially: Should we maybe rethink pieces of the json_lex_string
error handling? For example, do we really want to echo an incomplete
multibyte sequence once we know it's bad?

I can't quite find the place you might be looking at in
json_lex_string(), but for the general encoding conversion we have what
would appear to be the same behavior in report_invalid_encoding(), and
we go out of our way there to produce a verbose error message including
the invalid data.

#8Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Peter Eisentraut (#7)
Re: [PATCH] json_lex_string: don't overread on bad UTF8

On Fri, May 3, 2024 at 4:54 AM Peter Eisentraut <peter@eisentraut.org> wrote:

On 30.04.24 19:39, Jacob Champion wrote:

Tangentially: Should we maybe rethink pieces of the json_lex_string
error handling? For example, do we really want to echo an incomplete
multibyte sequence once we know it's bad?

I can't quite find the place you might be looking at in
json_lex_string(),

(json_lex_string() reports the beginning and end of the "area of
interest" via the JsonLexContext; it's json_errdetail() that turns
that into an error message.)

but for the general encoding conversion we have what
would appear to be the same behavior in report_invalid_encoding(), and
we go out of our way there to produce a verbose error message including
the invalid data.

We could port something like that to src/common. IMO that'd be more
suited for an actual conversion routine, though, as opposed to a
parser that for the most part assumes you didn't lie about the input
encoding and is just trying not to crash if you're wrong. Most of the
time, the parser just copies bytes between delimiters around and it's
up to the caller to handle encodings... the exceptions to that are the
\uXXXX escapes and the error handling.

Offhand, are all of our supported frontend encodings
self-synchronizing? By that I mean, is it safe to print a partial byte
sequence if the locale isn't UTF-8? (As I type this I'm starting at
Shift-JIS, and thinking "probably not.")

Actually -- hopefully this is not too much of a tangent -- that
further crystallizes a vague unease about the API that I have. The
JsonLexContext is initialized with something called the
"input_encoding", but that encoding is necessarily also the output
encoding for parsed string literals and error messages. For the server
side that's fine, but frontend clients have the input_encoding locked
to UTF-8, which seems like it might cause problems? Maybe I'm missing
code somewhere, but I don't see a conversion routine from
json_errdetail() to the actual client/locale encoding. (And the parser
does not support multibyte input_encodings that contain ASCII in trail
bytes.)

Thanks,
--Jacob

#9Michael Paquier
michael@paquier.xyz
In reply to: Jacob Champion (#8)
Re: [PATCH] json_lex_string: don't overread on bad UTF8

On Fri, May 03, 2024 at 07:05:38AM -0700, Jacob Champion wrote:

On Fri, May 3, 2024 at 4:54 AM Peter Eisentraut <peter@eisentraut.org> wrote:

but for the general encoding conversion we have what
would appear to be the same behavior in report_invalid_encoding(), and
we go out of our way there to produce a verbose error message including
the invalid data.

I was looking for that a couple of days ago in the backend but could
not put my finger on it. Thanks.

We could port something like that to src/common. IMO that'd be more
suited for an actual conversion routine, though, as opposed to a
parser that for the most part assumes you didn't lie about the input
encoding and is just trying not to crash if you're wrong. Most of the
time, the parser just copies bytes between delimiters around and it's
up to the caller to handle encodings... the exceptions to that are the
\uXXXX escapes and the error handling.

Hmm. That would still leave the backpatch issue at hand, which is
kind of confusing to leave as it is. Would it be complicated to
truncate the entire byte sequence in the error message and just give
up because we cannot do better if the input byte sequence is
incomplete? We could still have some information depending on the
string given in input, which should be enough, hopefully. With the
location pointing to the beginning of the sequence, even better.

Offhand, are all of our supported frontend encodings
self-synchronizing? By that I mean, is it safe to print a partial byte
sequence if the locale isn't UTF-8? (As I type this I'm starting at
Shift-JIS, and thinking "probably not.")

Actually -- hopefully this is not too much of a tangent -- that
further crystallizes a vague unease about the API that I have. The
JsonLexContext is initialized with something called the
"input_encoding", but that encoding is necessarily also the output
encoding for parsed string literals and error messages. For the server
side that's fine, but frontend clients have the input_encoding locked
to UTF-8, which seems like it might cause problems? Maybe I'm missing
code somewhere, but I don't see a conversion routine from
json_errdetail() to the actual client/locale encoding. (And the parser
does not support multibyte input_encodings that contain ASCII in trail
bytes.)

Referring to json_lex_string() that does UTF-8 -> ASCII -> give-up in
its conversion for FRONTEND, I guess? Yep. This limitation looks
like a problem, especially if plugging that to libpq.
--
Michael

#10Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Michael Paquier (#9)
Re: [PATCH] json_lex_string: don't overread on bad UTF8

On Mon, May 6, 2024 at 8:43 PM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, May 03, 2024 at 07:05:38AM -0700, Jacob Champion wrote:

We could port something like that to src/common. IMO that'd be more
suited for an actual conversion routine, though, as opposed to a
parser that for the most part assumes you didn't lie about the input
encoding and is just trying not to crash if you're wrong. Most of the
time, the parser just copies bytes between delimiters around and it's
up to the caller to handle encodings... the exceptions to that are the
\uXXXX escapes and the error handling.

Hmm. That would still leave the backpatch issue at hand, which is
kind of confusing to leave as it is. Would it be complicated to
truncate the entire byte sequence in the error message and just give
up because we cannot do better if the input byte sequence is
incomplete?

Maybe I've misunderstood, but isn't that what's being done in v2?

Maybe I'm missing
code somewhere, but I don't see a conversion routine from
json_errdetail() to the actual client/locale encoding. (And the parser
does not support multibyte input_encodings that contain ASCII in trail
bytes.)

Referring to json_lex_string() that does UTF-8 -> ASCII -> give-up in
its conversion for FRONTEND, I guess? Yep. This limitation looks
like a problem, especially if plugging that to libpq.

Okay. How we deal with that will likely guide the "optimal" fix to
error reporting, I think...

Thanks,
--Jacob

#11Michael Paquier
michael@paquier.xyz
In reply to: Jacob Champion (#10)
Re: [PATCH] json_lex_string: don't overread on bad UTF8

On Tue, May 07, 2024 at 02:06:10PM -0700, Jacob Champion wrote:

Maybe I've misunderstood, but isn't that what's being done in v2?

Something a bit different.. I was wondering if it could be possible
to tweak this code to truncate the data in the generated error string
so as the incomplete multi-byte sequence is entirely cut out, which
would come to setting token_terminator to "s" (last byte before the
incomplete byte sequence) rather than "term" (last byte available,
even if incomplete):
#define FAIL_AT_CHAR_END(code) \
do { \
char *term = s + pg_encoding_mblen(lex->input_encoding, s); \
lex->token_terminator = (term <= end) ? term : s; \
return code; \
} while (0)

But looking closer, I can see that in the JSON_INVALID_TOKEN case,
when !tok_done, we set token_terminator to point to the end of the
token, and that would include an incomplete byte sequence like in your
case. :/

At the end of the day, I think that I'm OK with your patch and avoid
the overread for now in the back-branches. This situation makes me
uncomfortable and we should put more effort in printing error messages
in a readable format, but that could always be tackled later as a
separate problem.. And I don't see something backpatchable at short
sight for v16.

Thoughts and/or objections?
--
Michael

#12Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Michael Paquier (#11)
Re: [PATCH] json_lex_string: don't overread on bad UTF8

On Tue, May 7, 2024 at 10:31 PM Michael Paquier <michael@paquier.xyz> wrote:

But looking closer, I can see that in the JSON_INVALID_TOKEN case,
when !tok_done, we set token_terminator to point to the end of the
token, and that would include an incomplete byte sequence like in your
case. :/

Ah, I see what you're saying. Yeah, that approach would need some more
invasive changes.

This situation makes me
uncomfortable and we should put more effort in printing error messages
in a readable format, but that could always be tackled later as a
separate problem.. And I don't see something backpatchable at short
sight for v16.

Agreed. Fortunately (or unfortunately?) I think the JSON
client-encoding work is now a prerequisite for OAuth in libpq, so
hopefully some improvements can fall out of that work too.

Thoughts and/or objections?

None here.

Thanks!
--Jacob

#13Michael Paquier
michael@paquier.xyz
In reply to: Jacob Champion (#12)
Re: [PATCH] json_lex_string: don't overread on bad UTF8

On Wed, May 08, 2024 at 07:01:08AM -0700, Jacob Champion wrote:

On Tue, May 7, 2024 at 10:31 PM Michael Paquier <michael@paquier.xyz> wrote:

But looking closer, I can see that in the JSON_INVALID_TOKEN case,
when !tok_done, we set token_terminator to point to the end of the
token, and that would include an incomplete byte sequence like in your
case. :/

Ah, I see what you're saying. Yeah, that approach would need some more
invasive changes.

My first feeling was actually to do that, and report the location in
the input string where we are seeing issues. All code paths playing
with token_terminator would need to track that.

Agreed. Fortunately (or unfortunately?) I think the JSON
client-encoding work is now a prerequisite for OAuth in libpq, so
hopefully some improvements can fall out of that work too.

I'm afraid so. I don't quite see how this would be OK to tweak on
stable branches, but all areas that could report error states with
partial byte sequence contents would benefit from such a change.

Thoughts and/or objections?

None here.

This is a bit mitigated by the fact that d6607016c738 is recent, but
this is incorrect since v13 so backpatched down to that.
--
Michael

#14Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Michael Paquier (#13)
Re: [PATCH] json_lex_string: don't overread on bad UTF8

On Wed, May 8, 2024 at 9:27 PM Michael Paquier <michael@paquier.xyz> wrote:

This is a bit mitigated by the fact that d6607016c738 is recent, but
this is incorrect since v13 so backpatched down to that.

Thank you!

--Jacob