Decision by Monday: PQescapeString() vs. encoding violation

tgl@sss.pgh.pa.us

11 months ago

In reply to: Andrew Dunstan (#3)

1 attachment(s)

Re: Decision by Monday: PQescapeString() vs. encoding violation

I studied the v3 patch a little and realized that it only fixes the
behavior for the case of a complete-but-invalid multibyte character.
If we have an incomplete character at the end of the string, the
whole thing still gets dropped. Surely that's not what we want if
we are going to adopt this behavior.

Here's a v4 that fixes that. As a bonus, we can get rid of
duplicative coding since the "incomplete" and "invalid" cases
are now treated identically.

One minor point is that the error messages from PQescapeStringInternal
no longer distinguish "incomplete" from "invalid". I don't find that
to be a terribly useful distinction, so that's fine with me. But if
someone feels that's important to preserve, we could make it do
something like

if (conn)
{
if (remaining < charlen)
libpq_append_conn_error(conn, "incomplete multibyte character");
else
libpq_append_conn_error(conn, "invalid multibyte character");
}

regards, tom lane

Attachments:

v4-0001-Make-escaping-functions-retain-trailing-bytes-of-.patchtext/x-diff; charset=us-ascii; name*0=v4-0001-Make-escaping-functions-retain-trailing-bytes-of-.p; name*1=atchDownload

From 935ccea6da82d0150f65672825c8f5e6f86bca89 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat, 15 Feb 2025 12:58:22 -0500
Subject: [PATCH v4] Make escaping functions retain trailing bytes of an
 invalid character.

Instead of dropping the trailing byte(s) of an invalid or incomplete
multibyte character, replace only the first byte with a known-invalid
sequence, and process the rest normally.  This seems less likely to
confuse incautious callers than the behavior adopted in 5dc1e42b4.

Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/20250215012712.45@rfd.leadboat.com
Backpatch-through: 13
---
 src/fe_utils/string_utils.c    | 91 +++++++++++++---------------------
 src/interfaces/libpq/fe-exec.c | 57 ++++++++-------------
 2 files changed, 55 insertions(+), 93 deletions(-)

diff --git a/src/fe_utils/string_utils.c b/src/fe_utils/string_utils.c
index b6a7b19708..bdf35db614 100644
--- a/src/fe_utils/string_utils.c
+++ b/src/fe_utils/string_utils.c
@@ -180,40 +180,25 @@ fmtIdEnc(const char *rawid, int encoding)
 			/* Slow path for possible multibyte characters */
 			charlen = pg_encoding_mblen(encoding, cp);
 
-			if (remaining < charlen)
-			{
-				/*
-				 * If the character is longer than the available input,
-				 * replace the string with an invalid sequence. The invalid
-				 * sequence ensures that the escaped string will trigger an
-				 * error on the server-side, even if we can't directly report
-				 * an error here.
-				 */
-				enlargePQExpBuffer(id_return, 2);
-				pg_encoding_set_invalid(encoding,
-										id_return->data + id_return->len);
-				id_return->len += 2;
-				id_return->data[id_return->len] = '\0';
-
-				/* there's no more input data, so we can stop */
-				break;
-			}
-			else if (pg_encoding_verifymbchar(encoding, cp, charlen) == -1)
+			if (remaining < charlen ||
+				pg_encoding_verifymbchar(encoding, cp, charlen) == -1)
 			{
 				/*
 				 * Multibyte character is invalid.  It's important to verify
-				 * that as invalid multi-byte characters could e.g. be used to
+				 * that as invalid multibyte characters could e.g. be used to
 				 * "skip" over quote characters, e.g. when parsing
 				 * character-by-character.
 				 *
-				 * Replace the bytes corresponding to the invalid character
-				 * with an invalid sequence, for the same reason as above.
+				 * Replace the character's first byte with an invalid
+				 * sequence. The invalid sequence ensures that the escaped
+				 * string will trigger an error on the server-side, even if we
+				 * can't directly report an error here.
 				 *
 				 * It would be a bit faster to verify the whole string the
 				 * first time we encounter a set highbit, but this way we can
-				 * replace just the invalid characters, which probably makes
-				 * it easier for users to find the invalidly encoded portion
-				 * of a larger string.
+				 * replace just the invalid data, which probably makes it
+				 * easier for users to find the invalidly encoded portion of a
+				 * larger string.
 				 */
 				enlargePQExpBuffer(id_return, 2);
 				pg_encoding_set_invalid(encoding,
@@ -222,11 +207,13 @@ fmtIdEnc(const char *rawid, int encoding)
 				id_return->data[id_return->len] = '\0';
 
 				/*
-				 * Copy the rest of the string after the invalid multi-byte
-				 * character.
+				 * Handle the following bytes as if this byte didn't exist.
+				 * That's safer in case the subsequent bytes contain
+				 * characters that are significant for the caller (e.g. '>' in
+				 * html).
 				 */
-				remaining -= charlen;
-				cp += charlen;
+				remaining--;
+				cp++;
 			}
 			else
 			{
@@ -395,49 +382,39 @@ appendStringLiteral(PQExpBuffer buf, const char *str,
 		/* Slow path for possible multibyte characters */
 		charlen = PQmblen(source, encoding);
 
-		if (remaining < charlen)
-		{
-			/*
-			 * If the character is longer than the available input, replace
-			 * the string with an invalid sequence. The invalid sequence
-			 * ensures that the escaped string will trigger an error on the
-			 * server-side, even if we can't directly report an error here.
-			 *
-			 * We know there's enough space for the invalid sequence because
-			 * the "target" buffer is 2 * length + 2 long, and at worst we're
-			 * replacing a single input byte with two invalid bytes.
-			 */
-			pg_encoding_set_invalid(encoding, target);
-			target += 2;
-
-			/* there's no more valid input data, so we can stop */
-			break;
-		}
-		else if (pg_encoding_verifymbchar(encoding, source, charlen) == -1)
+		if (remaining < charlen ||
+			pg_encoding_verifymbchar(encoding, source, charlen) == -1)
 		{
 			/*
 			 * Multibyte character is invalid.  It's important to verify that
-			 * as invalid multi-byte characters could e.g. be used to "skip"
+			 * as invalid multibyte characters could e.g. be used to "skip"
 			 * over quote characters, e.g. when parsing
 			 * character-by-character.
 			 *
-			 * Replace the bytes corresponding to the invalid character with
-			 * an invalid sequence, for the same reason as above.
+			 * Replace the character's first byte with an invalid sequence.
+			 * The invalid sequence ensures that the escaped string will
+			 * trigger an error on the server-side, even if we can't directly
+			 * report an error here.
+			 *
+			 * We know there's enough space for the invalid sequence because
+			 * the "target" buffer is 2 * length + 2 long, and at worst we're
+			 * replacing a single input byte with two invalid bytes.
 			 *
 			 * It would be a bit faster to verify the whole string the first
 			 * time we encounter a set highbit, but this way we can replace
-			 * just the invalid characters, which probably makes it easier for
-			 * users to find the invalidly encoded portion of a larger string.
+			 * just the invalid data, which probably makes it easier for users
+			 * to find the invalidly encoded portion of a larger string.
 			 */
 			pg_encoding_set_invalid(encoding, target);
 			target += 2;
-			remaining -= charlen;
 
 			/*
-			 * Copy the rest of the string after the invalid multi-byte
-			 * character.
+			 * Handle the following bytes as if this byte didn't exist. That's
+			 * safer in case the subsequent bytes contain important characters
+			 * for the caller (e.g. '>' in html).
 			 */
-			source += charlen;
+			source++;
+			remaining--;
 		}
 		else
 		{
diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c
index 120d4d032e..69554cd603 100644
--- a/src/interfaces/libpq/fe-exec.c
+++ b/src/interfaces/libpq/fe-exec.c
@@ -4102,50 +4102,34 @@ PQescapeStringInternal(PGconn *conn,
 		/* Slow path for possible multibyte characters */
 		charlen = pg_encoding_mblen(encoding, source);
 
-		if (remaining < charlen)
+		if (remaining < charlen ||
+			pg_encoding_verifymbchar(encoding, source, charlen) == -1)
 		{
 			/*
-			 * If the character is longer than the available input, report an
-			 * error if possible, and replace the string with an invalid
-			 * sequence. The invalid sequence ensures that the escaped string
-			 * will trigger an error on the server-side, even if we can't
-			 * directly report an error here.
+			 * Multibyte character is invalid.  It's important to verify that
+			 * as invalid multibyte characters could e.g. be used to "skip"
+			 * over quote characters, e.g. when parsing
+			 * character-by-character.
+			 *
+			 * Report an error if possible, and replace the character's first
+			 * byte with an invalid sequence. The invalid sequence ensures
+			 * that the escaped string will trigger an error on the
+			 * server-side, even if we can't directly report an error here.
 			 *
 			 * This isn't *that* crucial when we can report an error to the
-			 * caller, but if we can't, the caller will use this string
-			 * unmodified and it needs to be safe for parsing.
+			 * caller; but if we can't or the caller ignores it, the caller
+			 * will use this string unmodified and it needs to be safe for
+			 * parsing.
 			 *
 			 * We know there's enough space for the invalid sequence because
 			 * the "to" buffer needs to be at least 2 * length + 1 long, and
 			 * at worst we're replacing a single input byte with two invalid
 			 * bytes.
-			 */
-			if (error)
-				*error = 1;
-			if (conn)
-				libpq_append_conn_error(conn, "incomplete multibyte character");
-
-			pg_encoding_set_invalid(encoding, target);
-			target += 2;
-
-			/* there's no more input data, so we can stop */
-			break;
-		}
-		else if (pg_encoding_verifymbchar(encoding, source, charlen) == -1)
-		{
-			/*
-			 * Multibyte character is invalid.  It's important to verify that
-			 * as invalid multi-byte characters could e.g. be used to "skip"
-			 * over quote characters, e.g. when parsing
-			 * character-by-character.
-			 *
-			 * Replace the bytes corresponding to the invalid character with
-			 * an invalid sequence, for the same reason as above.
 			 *
 			 * It would be a bit faster to verify the whole string the first
 			 * time we encounter a set highbit, but this way we can replace
-			 * just the invalid characters, which probably makes it easier for
-			 * users to find the invalidly encoded portion of a larger string.
+			 * just the invalid data, which probably makes it easier for users
+			 * to find the invalidly encoded portion of a larger string.
 			 */
 			if (error)
 				*error = 1;
@@ -4154,13 +4138,14 @@ PQescapeStringInternal(PGconn *conn,
 
 			pg_encoding_set_invalid(encoding, target);
 			target += 2;
-			remaining -= charlen;
 
 			/*
-			 * Copy the rest of the string after the invalid multi-byte
-			 * character.
+			 * Handle the following bytes as if this byte didn't exist. That's
+			 * safer in case the subsequent bytes contain important characters
+			 * for the caller (e.g. '>' in html).
 			 */
-			source += charlen;
+			source++;
+			remaining--;
 		}
 		else
 		{
-- 
2.43.5

andres@anarazel.de

11 months ago

In reply to: Tom Lane (#4)

Re: Decision by Monday: PQescapeString() vs. encoding violation

Hi,

On 2025-02-15 13:08:03 -0500, Tom Lane wrote:

I studied the v3 patch a little and realized that it only fixes the
behavior for the case of a complete-but-invalid multibyte character.
If we have an incomplete character at the end of the string, the
whole thing still gets dropped. Surely that's not what we want if
we are going to adopt this behavior.

Good catch! The different behaviour made sense when we were dropping the
entire multi-byte character, but not anymore.

Here's a v4 that fixes that. As a bonus, we can get rid of
duplicative coding since the "incomplete" and "invalid" cases
are now treated identically.

One minor point is that the error messages from PQescapeStringInternal
no longer distinguish "incomplete" from "invalid". I don't find that
to be a terribly useful distinction, so that's fine with me. But if
someone feels that's important to preserve, we could make it do
something like

I think there's some value in the distinction, because incomplete characters
can happen a lot more easily due to truncating longer strings. I don't know if
it's worth the code complexity though.

From 935ccea6da82d0150f65672825c8f5e6f86bca89 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat, 15 Feb 2025 12:58:22 -0500
Subject: [PATCH v4] Make escaping functions retain trailing bytes of an
invalid character.

Instead of dropping the trailing byte(s) of an invalid or incomplete
multibyte character, replace only the first byte with a known-invalid
sequence, and process the rest normally. This seems less likely to
confuse incautious callers than the behavior adopted in 5dc1e42b4.

Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Jeff Davis <pgsql@j-davis.com>

This should have been a Reported-by, not a Reviewed-by, my mistake...

It seems that nobody is arguing against the "just skip one byte" behaviour, so
I'm inclined to push this fairly soon, even if Noah's "24 hours" haven't quite
elapsed. A few more cycles in the buildfarm wouldn't hurt.

Greetings,

Andres Freund

tgl@sss.pgh.pa.us

11 months ago

In reply to: Andres Freund (#5)

Re: Decision by Monday: PQescapeString() vs. encoding violation

Andres Freund <andres@anarazel.de> writes:

It seems that nobody is arguing against the "just skip one byte" behaviour, so
I'm inclined to push this fairly soon, even if Noah's "24 hours" haven't quite
elapsed. A few more cycles in the buildfarm wouldn't hurt.

Agreed. I thought there would be more discussion, but it seems
nobody really objects to changing this.

The other thing that was discussed in the security thread was
modifying PQescapeStringInternal and PQescapeInternal to produce
no more than one complaint about invalid multibyte characters,
on the grounds that input that's just plain in some other encoding
would otherwise produce a ton of repetitive messages. That seems
trivial enough to mechanize with a bool already_complained flag,
so I think we should incorporate that refinement while we're here.
I can write that patch if you're busy.

regards, tom lane

Noah Misch

noah@leadboat.com

11 months ago

In reply to: Tom Lane (#6)

Re: Decision by Monday: PQescapeString() vs. encoding violation

On Sat, Feb 15, 2025 at 01:23:51PM -0500, Andres Freund wrote:

On 2025-02-15 13:08:03 -0500, Tom Lane wrote:

I studied the v3 patch a little and realized that it only fixes the
behavior for the case of a complete-but-invalid multibyte character.
If we have an incomplete character at the end of the string, the
whole thing still gets dropped. Surely that's not what we want if
we are going to adopt this behavior.

Thanks for doing that.

Good catch! The different behaviour made sense when we were dropping the
entire multi-byte character, but not anymore.

Here's a v4 that fixes that. As a bonus, we can get rid of
duplicative coding since the "incomplete" and "invalid" cases
are now treated identically.

One minor point is that the error messages from PQescapeStringInternal
no longer distinguish "incomplete" from "invalid". I don't find that
to be a terribly useful distinction, so that's fine with me. But if
someone feels that's important to preserve, we could make it do
something like

I think there's some value in the distinction, because incomplete characters
can happen a lot more easily due to truncating longer strings. I don't know if
it's worth the code complexity though.

While I agree it doesn't feel like a terribly-important distinction, I'd
definitely keep that distinction today rather than drop it with so little
notice. (I'd feel fine about dropping it in a major release. I'd probably
not drop it in any minor release, but especially not this close to wrap.)

On Sat, Feb 15, 2025 at 01:35:10PM -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

It seems that nobody is arguing against the "just skip one byte" behaviour, so
I'm inclined to push this fairly soon, even if Noah's "24 hours" haven't quite
elapsed. A few more cycles in the buildfarm wouldn't hurt.

Works for me; we can always revert in the event of further discussion. For
what it's worth, I didn't intend this 24hr as a hard period for gating any
particular action.

Agreed. I thought there would be more discussion, but it seems
nobody really objects to changing this.

The other thing that was discussed in the security thread was
modifying PQescapeStringInternal and PQescapeInternal to produce
no more than one complaint about invalid multibyte characters,
on the grounds that input that's just plain in some other encoding
would otherwise produce a ton of repetitive messages. That seems
trivial enough to mechanize with a bool already_complained flag,
so I think we should incorporate that refinement while we're here.

tgl@sss.pgh.pa.us

11 months ago

In reply to: Noah Misch (#7)

1 attachment(s)

Re: Decision by Monday: PQescapeString() vs. encoding violation

Noah Misch <noah@leadboat.com> writes:

On 2025-02-15 13:08:03 -0500, Tom Lane wrote:

The other thing that was discussed in the security thread was
modifying PQescapeStringInternal and PQescapeInternal to produce
no more than one complaint about invalid multibyte characters,
on the grounds that input that's just plain in some other encoding
would otherwise produce a ton of repetitive messages. That seems
trivial enough to mechanize with a bool already_complained flag,
so I think we should incorporate that refinement while we're here.

+1

On closer inspection, PQescapeInternal already issues only one
error message, since it does "return NULL" after detecting the
first error. So this makes PQescapeStringInternal behave more
like that.

regards, tom lane

Attachments:

v5-0001-Make-escaping-functions-retain-trailing-bytes-of-.patchtext/x-diff; charset=us-ascii; name*0=v5-0001-Make-escaping-functions-retain-trailing-bytes-of-.p; name*1=atchDownload

From 565b42cecf645b1dbde277512fd67836ee9081d1 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat, 15 Feb 2025 14:10:33 -0500
Subject: [PATCH v5] Make escaping functions retain trailing bytes of an
 invalid character.

Instead of dropping the trailing byte(s) of an invalid or incomplete
multibyte character, replace only the first byte with a known-invalid
sequence, and process the rest normally.  This seems less likely to
confuse incautious callers than the behavior adopted in 5dc1e42b4.

Author: Andres Freund <andres@anarazel.de>
Reported-by: Jeff Davis <pgsql@j-davis.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/20250215012712.45@rfd.leadboat.com
Backpatch-through: 13
---
 src/fe_utils/string_utils.c    | 91 +++++++++++++---------------------
 src/interfaces/libpq/fe-exec.c | 69 ++++++++++++--------------
 2 files changed, 65 insertions(+), 95 deletions(-)

diff --git a/src/fe_utils/string_utils.c b/src/fe_utils/string_utils.c
index b6a7b19708..bdf35db614 100644
--- a/src/fe_utils/string_utils.c
+++ b/src/fe_utils/string_utils.c
@@ -180,40 +180,25 @@ fmtIdEnc(const char *rawid, int encoding)
 			/* Slow path for possible multibyte characters */
 			charlen = pg_encoding_mblen(encoding, cp);
 
-			if (remaining < charlen)
-			{
-				/*
-				 * If the character is longer than the available input,
-				 * replace the string with an invalid sequence. The invalid
-				 * sequence ensures that the escaped string will trigger an
-				 * error on the server-side, even if we can't directly report
-				 * an error here.
-				 */
-				enlargePQExpBuffer(id_return, 2);
-				pg_encoding_set_invalid(encoding,
-										id_return->data + id_return->len);
-				id_return->len += 2;
-				id_return->data[id_return->len] = '\0';
-
-				/* there's no more input data, so we can stop */
-				break;
-			}
-			else if (pg_encoding_verifymbchar(encoding, cp, charlen) == -1)
+			if (remaining < charlen ||
+				pg_encoding_verifymbchar(encoding, cp, charlen) == -1)
 			{
 				/*
 				 * Multibyte character is invalid.  It's important to verify
-				 * that as invalid multi-byte characters could e.g. be used to
+				 * that as invalid multibyte characters could e.g. be used to
 				 * "skip" over quote characters, e.g. when parsing
 				 * character-by-character.
 				 *
-				 * Replace the bytes corresponding to the invalid character
-				 * with an invalid sequence, for the same reason as above.
+				 * Replace the character's first byte with an invalid
+				 * sequence. The invalid sequence ensures that the escaped
+				 * string will trigger an error on the server-side, even if we
+				 * can't directly report an error here.
 				 *
 				 * It would be a bit faster to verify the whole string the
 				 * first time we encounter a set highbit, but this way we can
-				 * replace just the invalid characters, which probably makes
-				 * it easier for users to find the invalidly encoded portion
-				 * of a larger string.
+				 * replace just the invalid data, which probably makes it
+				 * easier for users to find the invalidly encoded portion of a
+				 * larger string.
 				 */
 				enlargePQExpBuffer(id_return, 2);
 				pg_encoding_set_invalid(encoding,
@@ -222,11 +207,13 @@ fmtIdEnc(const char *rawid, int encoding)
 				id_return->data[id_return->len] = '\0';
 
 				/*
-				 * Copy the rest of the string after the invalid multi-byte
-				 * character.
+				 * Handle the following bytes as if this byte didn't exist.
+				 * That's safer in case the subsequent bytes contain
+				 * characters that are significant for the caller (e.g. '>' in
+				 * html).
 				 */
-				remaining -= charlen;
-				cp += charlen;
+				remaining--;
+				cp++;
 			}
 			else
 			{
@@ -395,49 +382,39 @@ appendStringLiteral(PQExpBuffer buf, const char *str,
 		/* Slow path for possible multibyte characters */
 		charlen = PQmblen(source, encoding);
 
-		if (remaining < charlen)
-		{
-			/*
-			 * If the character is longer than the available input, replace
-			 * the string with an invalid sequence. The invalid sequence
-			 * ensures that the escaped string will trigger an error on the
-			 * server-side, even if we can't directly report an error here.
-			 *
-			 * We know there's enough space for the invalid sequence because
-			 * the "target" buffer is 2 * length + 2 long, and at worst we're
-			 * replacing a single input byte with two invalid bytes.
-			 */
-			pg_encoding_set_invalid(encoding, target);
-			target += 2;
-
-			/* there's no more valid input data, so we can stop */
-			break;
-		}
-		else if (pg_encoding_verifymbchar(encoding, source, charlen) == -1)
+		if (remaining < charlen ||
+			pg_encoding_verifymbchar(encoding, source, charlen) == -1)
 		{
 			/*
 			 * Multibyte character is invalid.  It's important to verify that
-			 * as invalid multi-byte characters could e.g. be used to "skip"
+			 * as invalid multibyte characters could e.g. be used to "skip"
 			 * over quote characters, e.g. when parsing
 			 * character-by-character.
 			 *
-			 * Replace the bytes corresponding to the invalid character with
-			 * an invalid sequence, for the same reason as above.
+			 * Replace the character's first byte with an invalid sequence.
+			 * The invalid sequence ensures that the escaped string will
+			 * trigger an error on the server-side, even if we can't directly
+			 * report an error here.
+			 *
+			 * We know there's enough space for the invalid sequence because
+			 * the "target" buffer is 2 * length + 2 long, and at worst we're
+			 * replacing a single input byte with two invalid bytes.
 			 *
 			 * It would be a bit faster to verify the whole string the first
 			 * time we encounter a set highbit, but this way we can replace
-			 * just the invalid characters, which probably makes it easier for
-			 * users to find the invalidly encoded portion of a larger string.
+			 * just the invalid data, which probably makes it easier for users
+			 * to find the invalidly encoded portion of a larger string.
 			 */
 			pg_encoding_set_invalid(encoding, target);
 			target += 2;
-			remaining -= charlen;
 
 			/*
-			 * Copy the rest of the string after the invalid multi-byte
-			 * character.
+			 * Handle the following bytes as if this byte didn't exist. That's
+			 * safer in case the subsequent bytes contain important characters
+			 * for the caller (e.g. '>' in html).
 			 */
-			source += charlen;
+			source++;
+			remaining--;
 		}
 		else
 		{
diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c
index 120d4d032e..294843ed8b 100644
--- a/src/interfaces/libpq/fe-exec.c
+++ b/src/interfaces/libpq/fe-exec.c
@@ -4076,6 +4076,7 @@ PQescapeStringInternal(PGconn *conn,
 	const char *source = from;
 	char	   *target = to;
 	size_t		remaining = strnlen(from, length);
+	bool		already_complained = false;
 
 	if (error)
 		*error = 0;
@@ -4102,65 +4103,57 @@ PQescapeStringInternal(PGconn *conn,
 		/* Slow path for possible multibyte characters */
 		charlen = pg_encoding_mblen(encoding, source);
 
-		if (remaining < charlen)
+		if (remaining < charlen ||
+			pg_encoding_verifymbchar(encoding, source, charlen) == -1)
 		{
 			/*
-			 * If the character is longer than the available input, report an
-			 * error if possible, and replace the string with an invalid
-			 * sequence. The invalid sequence ensures that the escaped string
-			 * will trigger an error on the server-side, even if we can't
-			 * directly report an error here.
+			 * Multibyte character is invalid.  It's important to verify that
+			 * as invalid multibyte characters could e.g. be used to "skip"
+			 * over quote characters, e.g. when parsing
+			 * character-by-character.
+			 *
+			 * Report an error if possible, and replace the character's first
+			 * byte with an invalid sequence. The invalid sequence ensures
+			 * that the escaped string will trigger an error on the
+			 * server-side, even if we can't directly report an error here.
 			 *
 			 * This isn't *that* crucial when we can report an error to the
-			 * caller, but if we can't, the caller will use this string
-			 * unmodified and it needs to be safe for parsing.
+			 * caller; but if we can't or the caller ignores it, the caller
+			 * will use this string unmodified and it needs to be safe for
+			 * parsing.
 			 *
 			 * We know there's enough space for the invalid sequence because
 			 * the "to" buffer needs to be at least 2 * length + 1 long, and
 			 * at worst we're replacing a single input byte with two invalid
 			 * bytes.
-			 */
-			if (error)
-				*error = 1;
-			if (conn)
-				libpq_append_conn_error(conn, "incomplete multibyte character");
-
-			pg_encoding_set_invalid(encoding, target);
-			target += 2;
-
-			/* there's no more input data, so we can stop */
-			break;
-		}
-		else if (pg_encoding_verifymbchar(encoding, source, charlen) == -1)
-		{
-			/*
-			 * Multibyte character is invalid.  It's important to verify that
-			 * as invalid multi-byte characters could e.g. be used to "skip"
-			 * over quote characters, e.g. when parsing
-			 * character-by-character.
-			 *
-			 * Replace the bytes corresponding to the invalid character with
-			 * an invalid sequence, for the same reason as above.
 			 *
 			 * It would be a bit faster to verify the whole string the first
 			 * time we encounter a set highbit, but this way we can replace
-			 * just the invalid characters, which probably makes it easier for
-			 * users to find the invalidly encoded portion of a larger string.
+			 * just the invalid data, which probably makes it easier for users
+			 * to find the invalidly encoded portion of a larger string.
 			 */
 			if (error)
 				*error = 1;
-			if (conn)
-				libpq_append_conn_error(conn, "invalid multibyte character");
+			if (conn && !already_complained)
+			{
+				if (remaining < charlen)
+					libpq_append_conn_error(conn, "incomplete multibyte character");
+				else
+					libpq_append_conn_error(conn, "invalid multibyte character");
+				/* Issue a complaint only once per string */
+				already_complained = true;
+			}
 
 			pg_encoding_set_invalid(encoding, target);
 			target += 2;
-			remaining -= charlen;
 
 			/*
-			 * Copy the rest of the string after the invalid multi-byte
-			 * character.
+			 * Handle the following bytes as if this byte didn't exist. That's
+			 * safer in case the subsequent bytes contain important characters
+			 * for the caller (e.g. '>' in html).
 			 */
-			source += charlen;
+			source++;
+			remaining--;
 		}
 		else
 		{
-- 
2.43.5

andres@anarazel.de

11 months ago

In reply to: Tom Lane (#8)

Re: Decision by Monday: PQescapeString() vs. encoding violation

Hi,

On 2025-02-15 14:12:06 -0500, Tom Lane wrote:

On closer inspection, PQescapeInternal already issues only one
error message, since it does "return NULL" after detecting the
first error. So this makes PQescapeStringInternal behave more
like that.

This looks good to me.

I looked through the diff between what test_escape -v before/after this change
prints out, looks good to me.

From 565b42cecf645b1dbde277512fd67836ee9081d1 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat, 15 Feb 2025 14:10:33 -0500
Subject: [PATCH v5] Make escaping functions retain trailing bytes of an
invalid character.

Instead of dropping the trailing byte(s) of an invalid or incomplete
multibyte character, replace only the first byte with a known-invalid
sequence, and process the rest normally. This seems less likely to
confuse incautious callers than the behavior adopted in 5dc1e42b4.

Author: Andres Freund <andres@anarazel.de>

I think you deserve primary authorship for the change now...

Are you planning to push / backpatch, or should I?

Greetings,

Andres Freund

#10

pgsql@j-davis.com

11 months ago

In reply to: Noah Misch (#1)

Re: Decision by Monday: PQescapeString() vs. encoding violation

Expanding on the reasoning a bit:

This discussion is only relevant only when the application meets the
following conditions:

A. Sends invalidly-encoded input to an escaping routine. Many
languages protect against this, such as python and rust. But other
languages, like C, Go, and Ruby do not.

B. Uses PQescapeStringConn() and ignores the error, or an escaping
routine that doesn't provide an error at all. (Earlier versions of
PQescapeStringConn ignored invalid sequences in the middle of the
string, but 5dc1e42b4f fixed that.)

C. Does some kind of post-processing to the escaped output to remove
invalidly-encoded data before sending it to the server. If the invalid
data wasn't removed, the server would throw an immediate error anyway.

Given that the situation already involves A, B & C, it's already
somewhat of a special case and I don't consider it a clear-and-present
security threat beyond what was fixed for the CVE. Instead, this is
more about following an accepted practice that is less likely to result
in security problems even if the application code is imperfect.

On Fri, 2025-02-14 at 17:27 -0800, Noah Misch wrote:

Question for all: would you switch to the "remove fewer bytes"
behavior in
next week's releases, switch later, or switch never? Why so?

The reasoning why it's the right change are:

1. It seems reasonable to assume most invalid byte sequences in the
middle of a string usually resulted from some kind of byte splicing or
concatenation. For instance, an application taking untrusted input
bytes, validating only at the byte level, and then concatenating with
other text (which might be done by string interpolation or by applying
a template).

- Given the above assumption, valid initial bytes in a byte sequence
are much more likely to be the start of a new valid character than part
of the previous invalid character.

- If it's the start of a new valid character, it might be an
important delimiter, and removing it could have security consequences,
perhaps not for Postgres itself, but for whatever the application does
with that string (with a missing delimiter) later.

2. The comparable risk on the other side seems substantially lower.
Let's say your application considers 0x7C ('|') to be a very important
delimiter. If the untrusted input contains, e.g. "\xC2\x7C" (invalid
sequence), one could imagine PQescapeStringConn() saving the day by
removing both bytes. But that seems more like an accident than anything
else: the application needs to validate the untrusted input regardless;
otherwise the attacker could just supply a valid '|'.

- Caveat: the application's validation could be broken in such a way
that it skips over multibyte characters without checking for invalid
sequences, i.e. seeing the \xC2 and skipping over the \x7C without
escaping it. Unfortunately, that's exactly what PQescapeStringConn()
did until recently, so this is a reasonable argument in favor of
"remove more bytes". But we shouldn't favor multibyte-half-aware
validators at the cost of breaking byte-by-byte validators.

3. Unicode strongly recommends that we take the "remove fewer bytes"
option. They are in a good position to have an informed opinion about
the security consequences of either choice, so this is partially an
appeal to authority. But that also means that it's the way other string
handling functions are likley to behave, e.g. Ruby's String#scrub seems
to take the "remove fewer bytes" approach. Following convention has its
own security benefits.

Regards,
Jeff Davis

#11

[1]: https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G48534
https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G48534

pgsql@j-davis.com

11 months ago

In reply to: Noah Misch (#1)

Re: Decision by Monday: PQescapeString() vs. encoding violation

On Fri, 2025-02-14 at 17:27 -0800, Noah Misch wrote:

I'm attaching a WIP patch from Andres Freund.

I am not suggesting a change, but there's a minor point about the
behavior of the replacement that I'd like to highlight:

Unicode discusses a choice[1]https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G48534: "An ill-formed subsequence consisting of
more than one code unit could be treated as a single error or as
multiple errors."

The patch implements the latter. Escaping:
<7A F0 80 80 41 7A>
results in:
<7A C0 20 C0 20 C0 20 41 7A>

The Unicode standard suggests[2]https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G66453 that the former approach may provide
more consistency in how it's done, but that doesn't seem important or
relevant for our purposes. I'd favor whichever approach results in
simpler code.

Regards,
Jeff Davis

[2]: https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G66453
https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G66453

#12

andres@anarazel.de

11 months ago

In reply to: Jeff Davis (#11)

Re: Decision by Monday: PQescapeString() vs. encoding violation

Hi,

On 2025-02-15 12:35:45 -0800, Jeff Davis wrote:

I am not suggesting a change, but there's a minor point about the
behavior of the replacement that I'd like to highlight:

Unicode discusses a choice[1]: "An ill-formed subsequence consisting of
more than one code unit could be treated as a single error or as
multiple errors."

The patch implements the latter. Escaping:
<7A F0 80 80 41 7A>
results in:
<7A C0 20 C0 20 C0 20 41 7A>

The Unicode standard suggests[2] that the former approach may provide
more consistency in how it's done, but that doesn't seem important or
relevant for our purposes. I'd favor whichever approach results in
simpler code.

It seems completely infeasible to me to to implement the "single error"
approach in a minor version. It'd afaict require non-trivial new
infrastructure. We can't just consume up to the next byte without a high bit,
because some encodings have subsequent bytes that are not guaranteed to have a
high bit set.

Greetings,

Andres Freund

#13

tgl@sss.pgh.pa.us

11 months ago

In reply to: Andres Freund (#12)

Re: Decision by Monday: PQescapeString() vs. encoding violation

Andres Freund <andres@anarazel.de> writes:

On 2025-02-15 12:35:45 -0800, Jeff Davis wrote:

I am not suggesting a change, but there's a minor point about the
behavior of the replacement that I'd like to highlight:
Unicode discusses a choice[1]: "An ill-formed subsequence consisting of
more than one code unit could be treated as a single error or as
multiple errors."

It seems completely infeasible to me to to implement the "single error"
approach in a minor version. It'd afaict require non-trivial new
infrastructure. We can't just consume up to the next byte without a high bit,
because some encodings have subsequent bytes that are not guaranteed to have a
high bit set.

Yeah. Also I think that probably depends on being able to tell the
difference between a first byte and a not-first byte of a multibyte
character, something that works in UTF-8 but not necessarily elsewhere.
As I commented in the security thread, Unicode's recommendations seem
pretty UTF-8-centric; I'm hesitant to adopt them wholesale in code
that has to deal with other encodings.

The v5 patch seems Good Enough(TM) to me. We can refine it later
perhaps; I don't think something like the above would affect
anything that external code should care about.

regards, tom lane

#14

pgsql@j-davis.com

11 months ago

In reply to: Andres Freund (#12)

Re: Decision by Monday: PQescapeString() vs. encoding violation

On Sat, 2025-02-15 at 15:43 -0500, Andres Freund wrote:

It seems completely infeasible to me to to implement the "single
error"
approach in a minor version.

+1, keep with the behavior in the proposed patches.

Even in the next major version, I'd be inclined to try to move toward
interfaces that can produce an error reliably rather than spending more
time trying to use a "better" pattern of invalid bytes.

I just wanted to point out that some kind of decision was made and link
to relevant background information.

Regards,
Jeff Davis

#15

tgl@sss.pgh.pa.us

11 months ago

In reply to: Andres Freund (#9)

Re: Decision by Monday: PQescapeString() vs. encoding violation

Andres Freund <andres@anarazel.de> writes:

Are you planning to push / backpatch, or should I?

I can take care of it, if you like.

regards, tom lane

#16

andres@anarazel.de

11 months ago

In reply to: Tom Lane (#13)

Re: Decision by Monday: PQescapeString() vs. encoding violation

Hi,

On 2025-02-15 15:52:01 -0500, Tom Lane wrote:

The v5 patch seems Good Enough(TM) to me.

Agreed.

We can refine it later perhaps; I don't think something like the above would
affect anything that external code should care about.

I don't really think it's worth spending cycles on this anytime soon. It makes
sense to put the effort in to replace invalid "characters" in a minimal way
when intending to actually use the "stripped" output permanently. But all
we're trying to do here is to a) ensure that the backend will error out b)
reduce the chances that other tooling (psql, xml parsers, ..) get confused due
to invalidly encoded data.

Greetings,

Andres Freund

#17

andres@anarazel.de

11 months ago

In reply to: Tom Lane (#15)

Re: Decision by Monday: PQescapeString() vs. encoding violation

On 2025-02-15 15:53:20 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

Are you planning to push / backpatch, or should I?

I can take care of it, if you like.

That'd be appreciated!

#18

tgl@sss.pgh.pa.us

11 months ago

In reply to: Andres Freund (#17)

Re: Decision by Monday: PQescapeString() vs. encoding violation

Andres Freund <andres@anarazel.de> writes:

On 2025-02-15 15:53:20 -0500, Tom Lane wrote:

I can take care of it, if you like.

That'd be appreciated!

Done.

regards, tom lane

#19