[BUG] psql: Make \copy from 'text' and 'csv' formats fail on NUL bytes

Started by Joel Jacobsonabout 1 year ago5 messages

joel@compiler.org

about 1 year ago

1 attachment(s)

Hi hackers,

When using \copy from in psql to import files containing NUL bytes (\0) in
'text' or 'csv' format, the NUL bytes were not detected and did not result in
an error, leading to silent data corruption.

This behavior is inconsistent with server-side COPY FROM, which reports an error
upon encountering NUL bytes in the 'text' and 'csv' formats.

Fix by adjusting handleCopyIn() to use the binary code path also when the copy
source is a file (i.e., copystream != pset.cur_cmd_source), even in textual
copies. This ensures that NUL bytes are detected and reported in the same way
as server-side COPY.

Example:

% printf 'test_col\nfoo\n\x00\nbar\nbaz\x00aar\nbizarre\n' > nul_bytes.data
% cat -v nul_bytes.data
test_col
foo
^@
bar
baz^@aar
bizarre

% psql
create table nul_bytes_test (test_col text);

--
-- HEAD
--
\copy nul_bytes_test (test_col) from 'nul_bytes.data' (format csv, header match);
COPY 3
select * from nul_bytes_test;
test_col
------------
foo
bar
bazbizarre
(3 rows)

\copy nul_bytes_test (test_col) from 'nul_bytes.data' (format text, header match);
select * from nul_bytes_test;
test_col
------------
foo
bar
bazbizarre
(3 rows)

--
-- 0001-psql-Make-copy-from-text-and-csv-formats-fail-on-NUL.patch
--
\copy nul_bytes_test (test_col) from 'nul_bytes.data' (format csv, header match);
ERROR: invalid byte sequence for encoding "UTF8": 0x00
CONTEXT: COPY nul_bytes_test, line 3
\copy nul_bytes_test (test_col) from 'nul_bytes.data' (format text, header match);
ERROR: invalid byte sequence for encoding "UTF8": 0x00
CONTEXT: COPY nul_bytes_test, line 3

/Joel

Attachments:

0001-psql-Make-copy-from-text-and-csv-formats-fail-on-NUL.patchapplication/octet-stream; name="=?UTF-8?Q?0001-psql-Make-copy-from-text-and-csv-formats-fail-on-NUL.patc?= =?UTF-8?Q?h?="Download

From b6824faf03d4d4bdb909b33230fab0011cbcfc69 Mon Sep 17 00:00:00 2001
From: Joel Jacobson <joel@compiler.org>
Date: Sun, 10 Nov 2024 22:18:48 +0100
Subject: [PATCH] psql: Make \copy from 'text' and 'csv' formats fail on NUL
 bytes

When using \copy from in psql to import files containing NUL bytes (\0) in
'text' or 'csv' format, the NUL bytes were not detected and did not result in
an error, leading to silent data corruption.

This behavior is inconsistent with server-side COPY FROM, which reports an error
upon encountering NUL bytes in the 'text' and 'csv' formats.

Fix by adjusting handleCopyIn() to use the binary code path also when the copy
source is a file (i.e., copystream != pset.cur_cmd_source), even in textual
copies. This ensures that NUL bytes are detected and reported in the same way
as server-side COPY.
---
 src/bin/psql/copy.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/bin/psql/copy.c b/src/bin/psql/copy.c
index e020e4d665d..306f5161e1d 100644
--- a/src/bin/psql/copy.c
+++ b/src/bin/psql/copy.c
@@ -544,7 +544,7 @@ handleCopyIn(PGconn *conn, FILE *copystream, bool isbinary, PGresult **res)
 
 	OK = true;
 
-	if (isbinary)
+	if (isbinary || copystream != pset.cur_cmd_source)
 	{
 		/* interactive input probably silly, but give one prompt anyway */
 		if (showprompt)
-- 
2.45.1

Tom Lane

tgl@sss.pgh.pa.us

about 1 year ago

In reply to: Joel Jacobson (#1)

Re: [BUG] psql: Make \copy from 'text' and 'csv' formats fail on NUL bytes

"Joel Jacobson" <joel@compiler.org> writes:

Fix by adjusting handleCopyIn() to use the binary code path also when the copy
source is a file (i.e., copystream != pset.cur_cmd_source), even in textual
copies.

That seems like a hack, as it also changes the behavior w.r.t.
prompts and EOF-mark detection, neither for the better.

I'm not really convinced that we need to do anything at all
about this.

regards, tom lane

Joel Jacobson

joel@compiler.org

about 1 year ago

In reply to: Tom Lane (#2)

Re: [BUG] psql: Make \copy from 'text' and 'csv' formats fail on NUL bytes

On Sun, Nov 10, 2024, at 22:37, Tom Lane wrote:

"Joel Jacobson" <joel@compiler.org> writes:

Fix by adjusting handleCopyIn() to use the binary code path also when the copy
source is a file (i.e., copystream != pset.cur_cmd_source), even in textual
copies.

That seems like a hack, as it also changes the behavior w.r.t.
prompts and EOF-mark detection, neither for the better.

Hmm, in what way does it change the behavior?
I ran almost all the tests, including the TAP ones, and none of them failed with
this fix. Is there some behavior that is currently not covered by the test suite?

I'm not really convinced that we need to do anything at all
about this.

I think this should be fixed, because 0x00 is valid UTF-8 for the Unicode
Character 'NULL' (U+0000), which makes me think there is a risk CSV files
coming from other systems than PostgreSQL, could contain such text strings
which would then silently cause data corruption.

/Joel

Tom Lane

tgl@sss.pgh.pa.us

about 1 year ago

In reply to: Joel Jacobson (#3)

Re: [BUG] psql: Make \copy from 'text' and 'csv' formats fail on NUL bytes

"Joel Jacobson" <joel@compiler.org> writes:

On Sun, Nov 10, 2024, at 22:37, Tom Lane wrote:

That seems like a hack, as it also changes the behavior w.r.t.
prompts and EOF-mark detection, neither for the better.

Hmm, in what way does it change the behavior?
I ran almost all the tests, including the TAP ones, and none of them failed with
this fix. Is there some behavior that is currently not covered by the test suite?

Of course. (Our regression tests are very far from covering 100% of
the behavioral space.) In the case at hand, this would break the
expected line-by-line prompting for "\copy ... from '/dev/tty'".

Playing with this just now, I notice that the prompt you get
still claims that \. works to end input, although right now
it does not:

regression=# \copy int8_tbl from '/dev/tty'
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.

1 3
5 6
\.
^D

COPY 2

I'm inclined to think that the prompt still describes what should
happen, at least in non-binary mode, although in binary mode we
probably ought to just say (and do) "End with an EOF signal".

So perhaps the if-test to choose the code path could be

if ((isbinary || copystream != pset.cur_cmd_source) && !showprompt)

which would allow dropping the vestigial prompt logic in the first
path, and we would also need to change the test in the second path
that decides if we should check for \. (Likely this should be
refactored a bit to make it more understandable. An intermediate
flag saying whether we intend to check for \. might help.)

Anyway, my point is that we need to think through the desirable
behavior for each possible combination of showprompt, isbinary, and
copystream != pset.cur_cmd_source, because all 8 cases are reachable.

regards, tom lane

Joel Jacobson

joel@compiler.org

about 1 year ago

In reply to: Tom Lane (#4)

Re: [BUG] psql: Make \copy from 'text' and 'csv' formats fail on NUL bytes

On Sun, Nov 10, 2024, at 23:14, Tom Lane wrote:

"Joel Jacobson" <joel@compiler.org> writes:

On Sun, Nov 10, 2024, at 22:37, Tom Lane wrote:

That seems like a hack, as it also changes the behavior w.r.t.
prompts and EOF-mark detection, neither for the better.

Hmm, in what way does it change the behavior?
I ran almost all the tests, including the TAP ones, and none of them failed with
this fix. Is there some behavior that is currently not covered by the test suite?

Of course. (Our regression tests are very far from covering 100% of
the behavioral space.)

In the case at hand, this would break the
expected line-by-line prompting for "\copy ... from '/dev/tty'".

Yes, of course, just wondered what kind of behavior in this area
that wasn't tested, thanks for explaining it.

Playing with this just now, I notice that the prompt you get
still claims that \. works to end input, although right now
it does not:

regression=# \copy int8_tbl from '/dev/tty'
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.

1 3
5 6
\.
^D

COPY 2

I'm inclined to think that the prompt still describes what should
happen, at least in non-binary mode, although in binary mode we
probably ought to just say (and do) "End with an EOF signal".

So perhaps the if-test to choose the code path could be

if ((isbinary || copystream != pset.cur_cmd_source) && !showprompt)

Thanks for guidance, that seems to fix, for 6/8 cases I've figured out how to test.

which would allow dropping the vestigial prompt logic in the first
path,

I assume you mean that due to "&& !showprompt" the "if (showprompt)"
becomes unreachable and can therefore be dropped?

and we would also need to change the test in the second path
that decides if we should check for \. (Likely this should be
refactored a bit to make it more understandable. An intermediate
flag saying whether we intend to check for \. might help.)

Maybe check_dot_command?

const bool check_dot_command = (copystream == pset.cur_cmd_source);

I haven't tried yet to refactor the code,
except than replacing the two "copystream == pset.cur_cmd_source"
occurrences with the new check_dot_command flag.

First want to understand if the two remaining cases are valid,
and if they can be tested:

Anyway, my point is that we need to think through the desirable
behavior for each possible combination of showprompt, isbinary, and
copystream != pset.cur_cmd_source, because all 8 cases are reachable.

I guess these are the 8 cases?

+--------+-------------+----------+------------------+
| CASE   | showprompt  | isbinary | check_dot_command |
+--------+-------------+----------+------------------+
|   1    |    false    |   false  |      false      |
|   2    |    false    |   false  |      true       |
|   3    |    false    |   true   |      false      |
|   4    |    false    |   true   |      true       |
|   5    |    true     |   false  |      false      |
|   6    |    true     |   false  |      true       |
|   7*   |    true     |   true   |      false      |
|   8*   |    true     |   true   |      true       |
+--------+-------------+----------+------------------+
* Cases 7 and 8 not tested yet

With the changed if-test, case 1-6 works,
and for case 1, then binary mode branch is taken
instead of the text mode branch,
whereas cases 2-6 take the same branch as before.

joel@Joels-MBP psql_tester % git diff --no-index -U100 /tmp/psql.log.HEAD /tmp/psql.log
diff --git a/tmp/psql.log.HEAD b/tmp/psql.log
index 5e44e30..1f48ac9 100644
--- a/tmp/psql.log.HEAD
+++ b/tmp/psql.log
@@ -1,6 +1,6 @@
-COPY case: 1 TEXT MODE
+COPY case: 1 BINARY MODE
 COPY case: 2 TEXT MODE
 COPY case: 3 BINARY MODE
 COPY case: 4 BINARY MODE
 COPY case: 5 TEXT MODE
 COPY case: 6 TEXT MODE

Here is how I tested each case:

# CASE 1:
# showprompt: false
# isbinary: false
# check_dot_command: false
psql -c "\copy int8_tbl from '/tmp/int8_tbl.data'"

# CASE 2:
# showprompt: false
# isbinary: false
# check_dot_command: true
psql -f /tmp/copy_stdin_text.sql

# CASE 3:
# showprompt: false
# isbinary: true
# check_dot_command: false
psql -c "\copy int8_tbl from '/tmp/int8_tbl.bin' (format binary)"

# CASE 4:
# showprompt: false
# isbinary: true
# check_dot_command: true
printf '\\copy int8_tbl from stdin (format binary)\n' >/tmp/copy_stdin_binary.sql
cat /tmp/int8_tbl.bin >>/tmp/copy_stdin_binary.sql
psql -f copy_stdin_binary.sql

# CASE 5:
# showprompt: true
# isbinary: false
# check_dot_command: false

psql
\copy int8_tbl from '/dev/tty'
17 18
19 20
\.
# Send EOF (Ctrl+D)

# CASE 6:
# showprompt: true
# isbinary: false
# check_dot_command: true

psql
\copy int8_tbl from stdin
21 22
23 24
\.

# CASE 7:
# showprompt: true
# isbinary: true
# check_dot_command: false

CASE 7 would be like CASE 5, with the addition of (format binary)
that is:

\copy int8_tbl from '/dev/tty' (format binary)

but I wonder if this is a valid case? Could we really copy/paste
or by some other means give psql binary data here?
I tried to wrap psql and send the binary content of
my '/tmp/int8_tbl.bin' file, and tried sending the EOF control code,
and I see I get PGCOPY-message, but the txn isn't committed,
so not sure what's happening. Before I continue trying
to figure this one out, I just wanted to make sure this is
a valid case, and how it is supposed to be used if so?

# CASE 8:
# showprompt: true
# isbinary: true
# check_dot_command: true

CASE 8 would be like CASE 6, with the addition of (format binary)
that is:

\copy int8_tbl from stdin (format binary)

I've had the same problem with this one, as with CASE 7.

Many thanks for guidance.

/Joel