Bug in COPY FROM backslash escaping multi-byte chars

Started by Heikki Linnakangasalmost 5 years ago5 messages

hlinnaka@iki.fi

almost 5 years ago

1 attachment(s)

Hi,

While playing with COPY FROM refactorings in another thread, I noticed
corner case where I think backslash escaping doesn't work correctly.
Consider the following input:

\么.foo

I hope that came through in this email correctly as UTF-8. The string
contains a sequence of: backslash, multibyte-character and a dot.

The documentation says:

Backslash characters (\) can be used in the COPY data to quote data
characters that might otherwise be taken as row or column delimiters

So I believe escaping multi-byte characters is supposed to work, and it
usually does.

However, let's consider the same string in Big5 encoding (in hex escaped
format):

\x5ca45c2e666f6f

The first byte 0x5c, is the backslash. The multi-byte character consists
of two bytes: 0xa4 0x5c. Note that the second byte is equal to a backslash.

That confuses the parser in CopyReadLineText, so that you get an error:

postgres=# create table copytest (t text);
CREATE TABLE
postgres=# \copy copytest from 'big5-skip-test.data' with (encoding 'big5');
ERROR: end-of-copy marker corrupt
CONTEXT: COPY copytest, line 1

What happens is that when the parser sees the backslash, it looks ahead
at the next byte, and when it's not a dot, it skips over it:

else if (!cstate->opts.csv_mode)

/*
* If we are here, it means we found a backslash followed by
* something other than a period. In non-CSV mode, anything
* after a backslash is special, so we skip over that second
* character too. If we didn't do that \\. would be
* considered an eof-of copy, while in non-CSV mode it is a
* literal backslash followed by a period. In CSV mode,
* backslashes are not special, so we want to process the
* character after the backslash just like a normal character,
* so we don't increment in those cases.
*/
raw_buf_ptr++;

However, in a multi-byte encoding that might "embed" ascii characters,
it should skip over the next *character*, not byte.

Attached is a pretty straightforward patch to fix that. Anyone see a
problem with this?

- Heikki

Attachments:

0001-Fix-a-corner-case-in-COPY-FROM-backslash-processing.patchtext/x-patch; charset=UTF-8; name=0001-Fix-a-corner-case-in-COPY-FROM-backslash-processing.patchDownload

From 632549e79dc98cb338d7fc22888d8b6237a47d10 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 3 Feb 2021 14:00:32 +0200
Subject: [PATCH 1/1] Fix a corner-case in COPY FROM backslash processing.

If a multi-byte character is escaped with a backslash in TEXT mode input,
we didn't always treat the escape correctly. If:

- a multi-byte character is escaped with a backslash, and
- the second byte of the character is 0x5C, i.e. the ASCII code of a
  backslash (\), and
- the next character is a dot (.), then

copyreadline we would incorrectly interpret the sequence as an end-of-copy
marker (\.). this can only happen in encodings that can "embed" ascii
characters as the second byte. one example of such sequence is
'\x5ca45c2e666f6f' in Big5 encoding. If you put that in a file, and
load it with COPY FROM, you'd incorrectly get an "end-of-copy marker
corrupt" error.
---
 src/backend/commands/copyfromparse.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index 798e18e013..c20ec482db 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -1084,7 +1084,7 @@ CopyReadLineText(CopyFromState cstate)
 				break;
 			}
 			else if (!cstate->opts.csv_mode)
-
+			{
 				/*
 				 * If we are here, it means we found a backslash followed by
 				 * something other than a period.  In non-CSV mode, anything
@@ -1095,8 +1095,17 @@ CopyReadLineText(CopyFromState cstate)
 				 * backslashes are not special, so we want to process the
 				 * character after the backslash just like a normal character,
 				 * so we don't increment in those cases.
+				 *
+				 * In principle, we would need to use pg_encoding_mblen() to
+				 * skip over the character in some encodings, like at the end
+				 * of the loop.  But if it's a multi-byte character, it cannot
+				 * have any special meaning and skipping isn't necessary for
+				 * correctness.  We can fall through to process it normally
+				 * instead.
 				 */
-				raw_buf_ptr++;
+				if (!IS_HIGHBIT_SET(c2))
+					raw_buf_ptr++;
+			}
 		}
 
 		/*
-- 
2.30.0

John Naylor

john.naylor@enterprisedb.com

almost 5 years ago

In reply to: Heikki Linnakangas (#1)

Re: Bug in COPY FROM backslash escaping multi-byte chars

On Wed, Feb 3, 2021 at 8:08 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Hi,

While playing with COPY FROM refactorings in another thread, I noticed
corner case where I think backslash escaping doesn't work correctly.
Consider the following input:

\么.foo

I've seen multibyte delimiters in the wild, so it's not as outlandish as it
seems. The fix is simple enough, so +1.

--
John Naylor
EDB: http://www.enterprisedb.com

Heikki Linnakangas

hlinnaka@iki.fi

almost 5 years ago

In reply to: John Naylor (#2)

Re: Bug in COPY FROM backslash escaping multi-byte chars

On 03/02/2021 15:38, John Naylor wrote:

On Wed, Feb 3, 2021 at 8:08 AM Heikki Linnakangas <hlinnaka@iki.fi
<mailto:hlinnaka@iki.fi>> wrote:

Hi,

While playing with COPY FROM refactorings in another thread, I noticed
corner case where I think backslash escaping doesn't work correctly.
Consider the following input:

\么.foo

I've seen multibyte delimiters in the wild, so it's not as outlandish as
it seems.

We don't actually support multi-byte characters as delimiters or quote
or escape characters:

postgres=# copy copytest from 'foo' with (delimiter '么');
ERROR: COPY delimiter must be a single one-byte character

The fix is simple enough, so +1.

Thanks, I'll commit and backpatch shortly.

- Heikki

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 5 years ago

In reply to: Heikki Linnakangas (#3)

Re: Bug in COPY FROM backslash escaping multi-byte chars

At Wed, 3 Feb 2021 15:46:30 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in

On 03/02/2021 15:38, John Naylor wrote:

On Wed, Feb 3, 2021 at 8:08 AM Heikki Linnakangas <hlinnaka@iki.fi
<mailto:hlinnaka@iki.fi>> wrote:

Hi,

While playing with COPY FROM refactorings in another thread, I noticed
corner case where I think backslash escaping doesn't work correctly.
Consider the following input:

\么.foo

I've seen multibyte delimiters in the wild, so it's not as outlandish
as it seems.

We don't actually support multi-byte characters as delimiters or quote
or escape characters:

postgres=# copy copytest from 'foo' with (delimiter '么');
ERROR: COPY delimiter must be a single one-byte character

The fix is simple enough, so +1.

Thanks, I'll commit and backpatch shortly.

I'm not sure the assumption in the second hunk always holds, but
that's fine at least with Shift-JIS and -2004 since they are two-byte
encoding.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Heikki Linnakangas

hlinnaka@iki.fi

almost 5 years ago

In reply to: Kyotaro Horiguchi (#4)

1 attachment(s)

Re: Bug in COPY FROM backslash escaping multi-byte chars

On 04/02/2021 03:50, Kyotaro Horiguchi wrote:

At Wed, 3 Feb 2021 15:46:30 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in

On 03/02/2021 15:38, John Naylor wrote:

On Wed, Feb 3, 2021 at 8:08 AM Heikki Linnakangas <hlinnaka@iki.fi
<mailto:hlinnaka@iki.fi>> wrote:

Hi,

While playing with COPY FROM refactorings in another thread, I noticed
corner case where I think backslash escaping doesn't work correctly.
Consider the following input:

\么.foo

I've seen multibyte delimiters in the wild, so it's not as outlandish
as it seems.

We don't actually support multi-byte characters as delimiters or quote
or escape characters:

postgres=# copy copytest from 'foo' with (delimiter '么');
ERROR: COPY delimiter must be a single one-byte character

The fix is simple enough, so +1.

Thanks, I'll commit and backpatch shortly.

I'm not sure the assumption in the second hunk always holds, but
that's fine at least with Shift-JIS and -2004 since they are two-byte
encoding.

The assumption is that a multi-byte character cannot have a special
meaning, as far as the loop in CopyReadLineText is concerned. The
characters with special meaning are '\\', '\n' and '\r'. That hold
regardless of encoding.

Thinking about this a bit more, I think the attached patch is slightly
better. Normally in the loop, raw_buf_ptr points to the next byte to
consume, and 'c' is the last consumed byte. At the end of the loop, we
check 'c' to see if it was a multi-byte character, and skip its 2nd, 3rd
and 4th byte if necessary. The crux of the bug is that after the
"raw_buf_ptr++;" to skip the character after the backslash, we left c to
'\\', even though we already consumed the first byte of the next
character. Because of that, the end-of-the-loop check didn't correctly
treat it as a multi-byte character. So a more straightforward fix is to
set 'c' to the byte we skipped over.

- Heikki

Attachments:

v2-0001-Fix-a-corner-case-in-COPY-FROM-backslash-processi.patchtext/x-patch; charset=UTF-8; name=v2-0001-Fix-a-corner-case-in-COPY-FROM-backslash-processi.patchDownload

From a9411516dcdcf4c91a9f88dd05741a1479dec7a3 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 4 Feb 2021 21:36:46 +0200
Subject: [PATCH v2 1/1] Fix a corner-case in COPY FROM backslash processing.

If a multi-byte character is escaped with a backslash in TEXT mode input,
we didn't always treat the escape correctly. If:

- a multi-byte character is escaped with a backslash, and
- the second byte of the character is 0x5C, i.e. the ASCII code of a
  backslash (\), and
- the next character is a dot (.), then

CopyReadLineText function would incorrectly interpret the sequence as an
end-of-copy marker (\.). This can only happen in encodings that can
"embed" ascii characters as the second byte. One example of such sequence
is '\x5ca45c2e666f6f' in Big5 encoding. If you put that in a file, and
load it with COPY FROM, you'd incorrectly get an "end-of-copy marker
corrupt" error.

Backpatch to all supported versions.

Reviewed-by: John Naylor, Kyotaro Horiguchi
Discussion: https://www.postgresql.org/message-id/a897f84f-8dca-8798-3139-07da5bb38728%40iki.fi
---
 src/backend/commands/copyfromparse.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index b843d315b1..315b16fd7a 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -1084,7 +1084,7 @@ CopyReadLineText(CopyFromState cstate)
 				break;
 			}
 			else if (!cstate->opts.csv_mode)
-
+			{
 				/*
 				 * If we are here, it means we found a backslash followed by
 				 * something other than a period.  In non-CSV mode, anything
@@ -1095,8 +1095,16 @@ CopyReadLineText(CopyFromState cstate)
 				 * backslashes are not special, so we want to process the
 				 * character after the backslash just like a normal character,
 				 * so we don't increment in those cases.
+				 *
+				 * Set 'c' to skip whole character correctly in multi-byte
+				 * encodings.  If we don't have the whole character in the
+				 * buffer yet, we might loop back to process it, after all,
+				 * but that's OK because multi-byte characters cannot have any
+				 * special meaning.
 				 */
 				raw_buf_ptr++;
+				c = c2;
+			}
 		}
 
 		/*
-- 
2.30.0