Fixing backslash dot for COPY FROM...CSV

Started by Daniel Veriteover 2 years ago31 messageshackers
Jump to latest
#1Daniel Verite
daniel@manitou-mail.org

Hi,

PFA a patch that attempts to fix the bug that \. on a line
by itself is handled incorrectly by COPY FROM ... CSV.
This issue has been discussed several times previously,
for instance in [1]/messages/by-id/10e3eff6-eb04-4b3f-aeb4-b920192b977a@manitou-mail.org and [2]/messages/by-id/8aeab305-5e94-4fa5-82bf-6da6baee6e05@app.fastmail.com, and mentioned in the
doc for \copy in commit 42d3125.

There's one case that works today: when
the line is part of a multi-line quoted section,
and the data is read from a file, not from the client.
In other situations, an error is raised or the data is cut at
the point of \. without an error.

The patch addresses that issue in the server and in psql,
except for the case of inlined data, where \. cannot be
both valid data and an EOF marker at the same time, so
it keeps treating it as an EOF marker.

[1]: /messages/by-id/10e3eff6-eb04-4b3f-aeb4-b920192b977a@manitou-mail.org
/messages/by-id/10e3eff6-eb04-4b3f-aeb4-b920192b977a@manitou-mail.org
[2]: /messages/by-id/8aeab305-5e94-4fa5-82bf-6da6baee6e05@app.fastmail.com
/messages/by-id/8aeab305-5e94-4fa5-82bf-6da6baee6e05@app.fastmail.com

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Attachments:

v1-0001-Support-backslash-dot-on-a-line-by-itself-as-vali.patchtext/plainDownload+51-25
#2vignesh C
vignesh21@gmail.com
In reply to: Daniel Verite (#1)
Re: Fixing backslash dot for COPY FROM...CSV

On Tue, 19 Dec 2023 at 02:06, Daniel Verite <daniel@manitou-mail.org> wrote:

Hi,

PFA a patch that attempts to fix the bug that \. on a line
by itself is handled incorrectly by COPY FROM ... CSV.
This issue has been discussed several times previously,
for instance in [1] and [2], and mentioned in the
doc for \copy in commit 42d3125.

There's one case that works today: when
the line is part of a multi-line quoted section,
and the data is read from a file, not from the client.
In other situations, an error is raised or the data is cut at
the point of \. without an error.

The patch addresses that issue in the server and in psql,
except for the case of inlined data, where \. cannot be
both valid data and an EOF marker at the same time, so
it keeps treating it as an EOF marker.

I noticed that these tests are passing without applying patch too:
+++ b/src/test/regress/sql/copy.sql
@@ -38,6 +38,17 @@ copy copytest2 from :'filename' csv quote '''' escape E'\\';

select * from copytest except select * from copytest2;

+--- test unquoted .\ as data inside CSV
+
+truncate copytest2;
+
+insert into copytest2(test) values('line1'), ('\.'), ('line2');
+copy (select test from copytest2 order by test collate "C") to :'filename' csv;
+-- get the data back in with copy
+truncate copytest2;
+copy copytest2(test) from :'filename' csv;
+select test from copytest2 order by test collate "C";

I was not sure if this was intentional. Can we add a test which fails
in HEAD and passes with the patch applied.

Regards,
Vignesh

#3Daniel Verite
daniel@manitou-mail.org
In reply to: vignesh C (#2)
Re: Fixing backslash dot for COPY FROM...CSV

vignesh C wrote:

I noticed that these tests are passing without applying patch too:

+insert into copytest2(test) values('line1'), ('\.'), ('line2');
+copy (select test from copytest2 order by test collate "C") to :'filename'
csv;
+-- get the data back in with copy
+truncate copytest2;
+copy copytest2(test) from :'filename' csv;
+select test from copytest2 order by test collate "C";

I was not sure if this was intentional. Can we add a test which fails
in HEAD and passes with the patch applied.

Thanks for checking this out.
Indeed, that was not intentional. I've been using static files
in my tests and forgot that if the data was produced with
COPY OUT, it would quote backslash-dot so that COPY IN could
reload it without problem.

PFA an updated version that uses \qecho to produce the
data instead of COPY OUT. This test on unpatched HEAD
shows that copytest2 is missing 2 rows after COPY IN.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Attachments:

v2-0001-Support-backslash-dot-on-a-line-by-itself-as-vali.patchtext/plainDownload+55-25
#4vignesh C
vignesh21@gmail.com
In reply to: Daniel Verite (#3)
Re: Fixing backslash dot for COPY FROM...CSV

On Tue, 19 Dec 2023 at 16:57, Daniel Verite <daniel@manitou-mail.org> wrote:

vignesh C wrote:

I noticed that these tests are passing without applying patch too:

+insert into copytest2(test) values('line1'), ('\.'), ('line2');
+copy (select test from copytest2 order by test collate "C") to :'filename'
csv;
+-- get the data back in with copy
+truncate copytest2;
+copy copytest2(test) from :'filename' csv;
+select test from copytest2 order by test collate "C";

I was not sure if this was intentional. Can we add a test which fails
in HEAD and passes with the patch applied.

Thanks for checking this out.
Indeed, that was not intentional. I've been using static files
in my tests and forgot that if the data was produced with
COPY OUT, it would quote backslash-dot so that COPY IN could
reload it without problem.

PFA an updated version that uses \qecho to produce the
data instead of COPY OUT. This test on unpatched HEAD
shows that copytest2 is missing 2 rows after COPY IN.

Thanks for the updated patch, any reason why this is handled only in csv.
postgres=# copy test1 from '/home/vignesh/postgres/inst/bin/copy1.out';
COPY 1
postgres=# select * from test1;
c1
-------
line1
(1 row)

postgres=# copy test1 from '/home/vignesh/postgres/inst/bin/copy1.out' csv;
COPY 1
postgres=# select * from test1;
c1
-------
line1
\.
line2
(3 rows)

As the documentation at [1]https://www.postgresql.org/docs/devel/sql-copy.html says:
An end-of-data marker is not necessary when reading from a file, since
the end of file serves perfectly well; it is needed only when copying
data to or from client applications using pre-3.0 client protocol.

[1]: https://www.postgresql.org/docs/devel/sql-copy.html

Regards,
Vignesh

#5Daniel Verite
daniel@manitou-mail.org
In reply to: vignesh C (#4)
Re: Fixing backslash dot for COPY FROM...CSV

vignesh C wrote:

Thanks for the updated patch, any reason why this is handled only in csv.
postgres=# copy test1 from '/home/vignesh/postgres/inst/bin/copy1.out';
COPY 1
postgres=# select * from test1;
c1
-------
line1
(1 row)

I believe it's safer to not change anything to the normal "non-csv"
text mode.
The current doc says that \. will not be taken as data in this format.
From https://www.postgresql.org/docs/current/sql-copy.html :

Any other backslashed character that is not mentioned in the above
table will be taken to represent itself. However, beware of adding
backslashes unnecessarily, since that might accidentally produce a
string matching the end-of-data marker (\.) or the null string (\N
by default). These strings will be recognized before any other
backslash processing is done.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

#6vignesh C
vignesh21@gmail.com
In reply to: Daniel Verite (#5)
Re: Fixing backslash dot for COPY FROM...CSV

On Fri, 22 Dec 2023 at 01:17, Daniel Verite <daniel@manitou-mail.org> wrote:

vignesh C wrote:

Thanks for the updated patch, any reason why this is handled only in csv.
postgres=# copy test1 from '/home/vignesh/postgres/inst/bin/copy1.out';
COPY 1
postgres=# select * from test1;
c1
-------
line1
(1 row)

I believe it's safer to not change anything to the normal "non-csv"
text mode.
The current doc says that \. will not be taken as data in this format.
From https://www.postgresql.org/docs/current/sql-copy.html :

Any other backslashed character that is not mentioned in the above
table will be taken to represent itself. However, beware of adding
backslashes unnecessarily, since that might accidentally produce a
string matching the end-of-data marker (\.) or the null string (\N
by default). These strings will be recognized before any other
backslash processing is done.

Thanks for the clarification. Then let's keep it as you have implemented.

Regards,
Vignesh

#7Daniel Verite
daniel@manitou-mail.org
In reply to: Daniel Verite (#3)
Re: Fixing backslash dot for COPY FROM...CSV

Hi,

The CI patch tester fails on this patch, because it has a label
at the end of a C block, which I'm learning is a C23 feature
that happens to be supported by gcc 11 [1]https://gcc.gnu.org/onlinedocs/gcc/Mixed-Labels-and-Declarations.html, but is not portable.

PFA an update fixing this, plus removing an obsolete chunk
in the COPY documentation that v2 left out.

[1]: https://gcc.gnu.org/onlinedocs/gcc/Mixed-Labels-and-Declarations.html

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Attachments:

v3-0001-Support-backslash-dot-on-a-line-by-itself-as-vali.patchtext/plainDownload+67-63
#8Robert Haas
robertmhaas@gmail.com
In reply to: Daniel Verite (#1)
Re: Fixing backslash dot for COPY FROM...CSV

On Mon, Dec 18, 2023 at 3:36 PM Daniel Verite <daniel@manitou-mail.org> wrote:

PFA a patch that attempts to fix the bug that \. on a line
by itself is handled incorrectly by COPY FROM ... CSV.
This issue has been discussed several times previously,
for instance in [1] and [2], and mentioned in the
doc for \copy in commit 42d3125.

Those links unfortunately seem not to be entirely specific to this
issue. Other, related things seem to be discussed there, and it's not
obvious that everyone agrees on what to do, or really that anyone
agrees on what to do. The best link that I found for this exact issue
is /messages/by-id/E1TdNVQ-0001ju-GO@wrigleys.postgresql.org
but the thread isn't very conclusive and is so old that any
conclusions reached then might no longer be considered valid today.

And I guess the reason I mention is that, supposing your patch were
technically perfect in every way, would everyone agree that it ought
to be committed? If Alice is a user with a CSV file that might contain
\. on a line by itself within a quoted CSV field, then Alice is
currently sad because she can't necessarily load all of her CSV files.
The patch would fix that, and make her happy. On the other hand, if
Bob is a user with a CSV-ish file that definitely doesn't contain \.
on a line by itself within a quoted CSV field but might have been
truncated in the middle of a quoted field, maybe Bob will be sad if
this patch gets committed, because he will no longer be able to append
\n\.\n to whatever junk he's got in the file and let the server sort
out whether to throw an error.

I have to admit that it seems more likely that there are people in the
world with Alice's problem rather than people in the world with Bob's
problem. We'd probably make more people happy with the change than
sad. But it is a definitional change, I think, and that's a little
scary, and maybe somebody will think that's a reason why we should
change nothing here. Part of my hesitancy, I suppose, is that I don't
understand why we even have this strange convention of making \.
terminate the input in the first place -- I mean, why wouldn't that be
done in some kind of out-of-band way, rather than including a special
marker in the data?

I can't help feeling a bit nervous about this first documentation hunk:

--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -761,11 +761,7 @@ COPY <replaceable class="parameter">count</replaceable>
     format, <literal>\.</literal>, the end-of-data marker, could also appear
     as a data value.  To avoid any misinterpretation, a <literal>\.</literal>
     data value appearing as a lone entry on a line is automatically
-    quoted on output, and on input, if quoted, is not interpreted as the
-    end-of-data marker.  If you are loading a file created by another
-    application that has a single unquoted column and might have a
-    value of <literal>\.</literal>, you might need to quote that value in the
-    input file.
+    quoted on output.
    </para>

<note>

It doesn't feel right to me to just replace all of this text with
nothing. That leaves us documenting only the behavior on output,
whereas the previous text documents both the output behavior (we quote
it) and the input behavior (it has to be quoted to avoid being taken
as the EOF marker).

         /*
-         * In CSV mode, we only recognize \. alone on a line.  This is because
-         * \. is a valid CSV data value.
+         * In CSV mode, backslash is a normal character.
          */
-        if (c == '\\' && (!cstate->opts.csv_mode || first_char_in_line))
+        if (c == '\\' && !cstate->opts.csv_mode)

I don't think that the update comment accurately describes the
behavior. If I understand correctly what you're trying to fix, I'd
expect the new behavior to be that we only recognize \. alone on a
line and even then only if we're not inside a quoting string, but
that's not what the revised comment says. Instead, it claims that
backslash is just a normal character, but if that were true, the whole
if-statement wouldn't exist at all, since its purpose is to provide
special-handling for backslashes -- and indeed the patch does not
change that, since the if-statement is still looking for a backslash
and doing something special if one is found.

Hmm. Looking at the rest of the patch, it seems like you're removing
the logic that prevents us from interpreting

\. lksdghksdhgjskdghjs

as an end-of-file while in CSV mode. But I would have thought based on
what problem you're trying to fix that you would have wanted to keep
that logic and further restrict it so that it only applies when not
within a quoted string.

Maybe I'm misunderstanding what bug you're trying to fix?

--
Robert Haas
EDB: http://www.enterprisedb.com

#9Daniel Verite
daniel@manitou-mail.org
In reply to: Robert Haas (#8)
Re: Fixing backslash dot for COPY FROM...CSV

Robert Haas wrote:

Part of my hesitancy, I suppose, is that I don't
understand why we even have this strange convention of making \.
terminate the input in the first place -- I mean, why wouldn't that be
done in some kind of out-of-band way, rather than including a special
marker in the data?

The v3 protocol added the out-of-band method, but the v2 protocol
did not have it, and as far as I understand, this is the reason why
CopyReadLineText() must interpret \. as an end-of-data marker.

The v2 protocol was removed in pg14
https://www.postgresql.org/docs/release/14.0/
<quote>
Remove server and libpq support for the version 2 wire protocol (Heikki
Linnakangas)
This was last used as the default in PostgreSQL 7.3 (released in 2002).
</quote>

Also I hadnt' noticed this before, but the current doc has this mention
that is relevant to this patch:

https://www.postgresql.org/docs/current/protocol-changes.html
"Summary of Changes since Protocol 2.0"
<quote>
COPY data is now encapsulated into CopyData and CopyDone
messages. There is a well-defined way to recover from errors during
COPY. The special “\.” last line is not needed anymore, and is not
sent during COPY OUT. (It is still recognized as a terminator during
COPY IN, but its use is deprecated and will eventually be removed.)
</quote>

What the present patch does is essentially, for the server-side part,
stop recognizing "\." as as terminator, like this paragraph says, but
it does that for CSV only, not for TEXT.

Hmm. Looking at the rest of the patch, it seems like you're removing
the logic that prevents us from interpreting

\. lksdghksdhgjskdghjs

as an end-of-file while in CSV mode. But I would have thought based on
what problem you're trying to fix that you would have wanted to keep
that logic and further restrict it so that it only applies when not
within a quoted string.

Maybe I'm misunderstanding what bug you're trying to fix?

The fix is that \. is no longer recognized as special in CSV, whether
alone on a line or not, and whether in a quoted section or not.
It's always interpreted as data, like it would have been in
the first place, I imagine, if the v2 protocol could have handled
it. This is why the patch consists mostly of removing code and
simplifying comments.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

#10Daniel Verite
daniel@manitou-mail.org
In reply to: Robert Haas (#8)
Re: Fixing backslash dot for COPY FROM...CSV

Robert Haas wrote:

Those links unfortunately seem not to be entirely specific to this
issue. Other, related things seem to be discussed there, and it's not
obvious that everyone agrees on what to do, or really that anyone
agrees on what to do. The best link that I found for this exact issue
is
/messages/by-id/E1TdNVQ-0001ju-GO@wrigleys.postgresql.org
but the thread isn't very conclusive and is so old that any
conclusions reached then might no longer be considered valid today.

To refresh the problem statement, 4 cases that need fixing as
of HEAD can be distinguished:

#1. copy csv from file, single column, no quoting involved.
COPY will stop at \. and ignore the rest of the file without
any error or warning.

$ cat >/tmp/file.csv <<EOF
line1
\.
line2
EOF

$ psql <<EOF
create table contents(t text);
copy contents from '/tmp/file.csv' (format csv);
table contents;
EOF

Results in
t
-------
line1
(1 row)

The bug is that a single row is imported instead of the 3 rows of the file.

#2. Same as the previous case, but with file_fdw

$ psql <<EOF
CREATE EXTENSION file_fdw;

CREATE FOREIGN TABLE csv_data(line text) SERVER myserver
OPTIONS (filename '/tmp/file.csv', format 'csv');

TABLE csv_data;
EOF

Results in:

line
-------
line1
(1 row)

The bug is that rows 2 and 3 are missing, as in case #1.

#3. \copy csv from file with \. inside a quoted multi-line section

This is the case that the above linked report mentioned,
resulting in a failure to import.
In addition to being legal CSV, these contents can be produced by
Postgres itself exporting in CSV.

$ cat >/tmp/file-quoted.csv <<EOF
line1
"
\.
"
line2
EOF

$ psql <<EOF
create table contents(t text);
\copy contents from '/tmp/file-quoted.csv' csv;
EOF

Results in an error:

ERROR: unterminated CSV quoted field
CONTEXT: COPY contents, line 4: ""
\.
"

The expected result is that it imports 3 rows without error.

#4. \copy csv from file, single column, no quoting involved
This is the same case as #1 except it uses the client-server protocol.

$ cat >/tmp/file.csv <<EOF
line1
\.
line2
EOF

$ psql <<EOF
create table contents(t text);
\copy contents from '/tmp/file.csv' (format csv);
TABLE contents;
EOF

Results in
t
-------
line1
(1 row)

As in case #1, a single row is imported instead of 3 rows.

The proposed patch addresses these cases by making the sequence
\. non-special in CSV (in fact backslash itself is a normal character in
CSV).
It does not address the cases when the data is embedded after
the COPY command or typed interactively in psql, since these cases
require an explicit end-of-data marker, and CSV does not have
the concept of an end-of-data marker.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Daniel Verite (#10)
Re: Fixing backslash dot for COPY FROM...CSV

"Daniel Verite" <daniel@manitou-mail.org> writes:

Robert Haas wrote:

Those links unfortunately seem not to be entirely specific to this
issue. Other, related things seem to be discussed there, and it's not
obvious that everyone agrees on what to do, or really that anyone
agrees on what to do.

The proposed patch addresses these cases by making the sequence
\. non-special in CSV (in fact backslash itself is a normal character in
CSV).
It does not address the cases when the data is embedded after
the COPY command or typed interactively in psql, since these cases
require an explicit end-of-data marker, and CSV does not have
the concept of an end-of-data marker.

I've looked over this patch and I generally agree that this is a
reasonable solution. While it's barely possible that somebody
out there is depending on the current behavior of \. in CSV mode,
it seems considerably more likely that people who run into it
would consider it a bug. In any case, the patch isn't proposed
for back-patch, and as cross-version incompatibilities go this
seems like a pretty small one.

I concur with Robert's doubts about some of the doc changes though.
In particular, since we're not changing the behavior for non-CSV
mode, we shouldn't remove any statements that apply to non-CSV mode.

I'm also wondering why the patch adds a test for
"PQprotocolVersion(conn) >= 3" in handleCopyIn. As already
noted upthread, we ripped out all support for protocol 2.0
some time ago, so that sure looks like dead code.

regards, tom lane

#12Daniel Verite
daniel@manitou-mail.org
In reply to: Tom Lane (#11)
Re: Fixing backslash dot for COPY FROM...CSV

Tom Lane wrote:

I've looked over this patch and I generally agree that this is a
reasonable solution.

Thanks for reviewing this!

I'm also wondering why the patch adds a test for
"PQprotocolVersion(conn) >= 3" in handleCopyIn.

I've removed this in the attached update.

I concur with Robert's doubts about some of the doc changes though.
In particular, since we're not changing the behavior for non-CSV
mode, we shouldn't remove any statements that apply to non-CSV mode.

I don't quite understand the issues with the doc changes. Let me
detail the changes.

The first change is under
<refsect2>
<title>CSV Format</title>
so it does no concern non-CSV modes.

--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -761,11 +761,7 @@ COPY <replaceable class="parameter">count</replaceable>
    format, <literal>\.</literal>, the end-of-data marker, could also appear
    as a data value.  To avoid any misinterpretation, a <literal>\.</literal>
    data value appearing as a lone entry on a line is automatically
-    quoted on output, and on input, if quoted, is not interpreted as the
-    end-of-data marker.  If you are loading a file created by another
-    application that has a single unquoted column and might have a
-    value of <literal>\.</literal>, you might need to quote that value in
the
-    input file.
+    quoted on output.
   </para>

The part about quoting output is kept because the code still does that.

About this bit:
"and on input, if quoted, is not interpreted as the end-of-data marker."

So the patch removes that part. The reason is \. is now not interpreted as
EOD in both cases, quoted or unquoted, conforming to spec.
Previously, what the reader should have understood by "if quoted, ..."
is that it implies "if not quoted, then .\ will be interpreted as EOD
even though that behavior does not conform to the CSV spec".
If we documented the new behavior, it would be something like:
when quoted, it works as expected, and when unquoted, it works as
expected too. Isn't it simpler not to say anything?

About the next sentence:
"If you are loading a file created by another application
that has a single unquoted column and might have a value of \., you
might need to quote that value in the input file."

It's removed as well because it's not longer necessary
to do that.

The other hunk is in psql doc:

--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1119,10 +1119,6 @@ INSERT INTO tbl1 VALUES ($1, $2) \bind 'first value'
'second value' \g
	 destination, because all data must pass through the client/server
	 connection.  For large amounts of data the <acronym>SQL</acronym>
	 command might be preferable.
-	 Also, because of this pass-through method, <literal>\copy
-	 ... from</literal> in <acronym>CSV</acronym> mode will erroneously
-	 treat a <literal>\.</literal> data value alone on a line as an
-	 end-of-input marker.

That behavior no longer happens, so this gets removed as well.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Attachments:

v4-0001-Support-backslash-dot-on-a-line-by-itself-as-vali.patchtext/plainDownload+65-63
#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Daniel Verite (#12)
Re: Fixing backslash dot for COPY FROM...CSV

"Daniel Verite" <daniel@manitou-mail.org> writes:

Tom Lane wrote:

I've looked over this patch and I generally agree that this is a
reasonable solution.

Thanks for reviewing this!

While testing this, I tried running the tests with an updated server
and non-updated psql, and not only did the new test case fail, but
so did a bunch of existing ones. That's because existing psql will
send the trailing "\." of inlined data to the server, and the updated
server will now think that's data if it's in CSV mode.

So this means that the patch introduces a rather serious cross-version
compatibility problem. I doubt we can consider inlined CSV data to be
a niche case that few people use; but it will fail every time if your
psql is older than your server.

Not sure what to do here. One idea is to install just the psql-side
fix, which should break nothing now that version-2 protocol is dead,
and then wait a few years before introducing the server-side change.
That seems kind of sad though.

An argument for not waiting is that psql may not be the only client
that needs this behavioral adjustment, and if so there's going to
be breakage anyway when we change the server; so we might as well
get it over with.

regards, tom lane

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#13)
Re: Fixing backslash dot for COPY FROM...CSV

I wrote:

So this means that the patch introduces a rather serious cross-version
compatibility problem. I doubt we can consider inlined CSV data to be
a niche case that few people use; but it will fail every time if your
psql is older than your server.

On third thought, that may not be as bad as I was thinking.
We don't blink at the idea that an older psql's \d commands may
malfunction with a newer server, and I think most users have
internalized the idea that they want psql >= server. If the
patch created an incompatibility with that direction, it'd be
a problem, but I don't think it does.

regards, tom lane

#15Daniel Verite
daniel@manitou-mail.org
In reply to: Tom Lane (#13)
Re: Fixing backslash dot for COPY FROM...CSV

Tom Lane wrote:

Not sure what to do here. One idea is to install just the psql-side
fix, which should break nothing now that version-2 protocol is dead,
and then wait a few years before introducing the server-side change.
That seems kind of sad though.

Wouldn't backpatching solve this? Then only the users who don't
apply the minor updates would have non-matching server and psql.
Initially I though that obsoleting the v2 protocol was a recent
move, but reading older messages from the list I've got the
impression that it was more or less in the pipeline since way
before version 10.

Also one of the cases the patch fixes, the one when imported
data are silently truncated at the point of \., is quite nasty
IMO.
I can imagine an app where user-supplied data would be
appended row-by-row into a CSV file, and would be processed
periodically by batch. Under some conditions, in particular
if newlines in the first column are allowed, a malevolent user could
submit a \. sequence to cause the batch to miss the rest of the data
without any error being raised.

[1]: /messages/by-id/11648.1403147417@sss.pgh.pa.us
/messages/by-id/11648.1403147417@sss.pgh.pa.us

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Daniel Verite (#15)
Re: Fixing backslash dot for COPY FROM...CSV

"Daniel Verite" <daniel@manitou-mail.org> writes:

Tom Lane wrote:

Not sure what to do here. One idea is to install just the psql-side
fix, which should break nothing now that version-2 protocol is dead,
and then wait a few years before introducing the server-side change.
That seems kind of sad though.

Wouldn't backpatching solve this?

No, it'd just reduce the surface area a bit. People on less-than-
the-latest-minor-release would still have the issue. In any case
back-patching further than v14 would be a nonstarter, because we
didn't remove protocol v2 support till then.

However, the analogy to "\d commands might fail against a newer
server" reduces my level of concern quite a lot: it's hard to
draw much of a line separating that kind of issue from "inline
COPY CSV will fail against a newer server". It's not like such
failures won't be obvious and fairly easy to diagnose.

regards, tom lane

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#16)
Re: Fixing backslash dot for COPY FROM...CSV

After some more poking at this topic, I realize that there is already
very strange and undocumented behavior for backslash-dot even in
non-CSV mode. Create a file like this:

$ cat eofdata
foobar
foobaz\.
more
\.
yet more

and try importing it with COPY:

regression=# create table eofdata(f1 text);
CREATE TABLE
regression=# copy eofdata from '/home/tgl/pgsql/eofdata';
COPY 2
regression=# table eofdata;
f1
--------
foobar
foobaz
(2 rows)

That's what you get in 9.0 and earlier versions, and it's already
not-as-documented, because we claim that only \. alone on a line is an
EOF marker; we certainly don't suggest that what's in front of it will
be taken as valid data. However, somebody broke it some more in 9.1,
because 9.1 up to HEAD produce this result:

regression=# create table eofdata(f1 text);
CREATE TABLE
regression=# copy eofdata from '/home/tgl/pgsql/eofdata';
COPY 3
regression=# table eofdata;
f1
--------
foobar
foobaz
more
(3 rows)

So the current behavior is that \. that is on the end of a line,
but is not the whole line, is silently discarded and we keep going.

All versions throw "end-of-copy marker corrupt" if there is
something after \. on the same line.

This is sufficiently weird that I'm starting to come around to
Daniel's original proposal that we just drop the server's recognition
of \. altogether (which would allow removal of some dozens of lines of
complicated and now known-buggy code). Alternatively, we could fix it
so that \. at the end of a line draws "end-of-copy marker corrupt",
which would at least make things consistent, but I'm not sure that has
any great advantage. I surely don't want to document the current
behavioral details as being the right thing that we're going to keep
doing.

Thoughts?

regards, tom lane

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#17)
Re: Fixing backslash dot for COPY FROM...CSV

I wrote:

So the current behavior is that \. that is on the end of a line,
but is not the whole line, is silently discarded and we keep going.

All versions throw "end-of-copy marker corrupt" if there is
something after \. on the same line.

This is sufficiently weird that I'm starting to come around to
Daniel's original proposal that we just drop the server's recognition
of \. altogether (which would allow removal of some dozens of lines of
complicated and now known-buggy code).

I experimented with that and soon ran into a nasty roadblock: it
breaks dump/restore, because pg_dump includes a "\." line after
COPY data whether or not it really needs one. Worse, that's
implemented by including the "\." line into the archive format,
so that existing dump files contain it. Getting rid of it would
require an archive format version bump, plus some hackery to allow
removal of the line when reading old dump files.

While that's surely doable with enough effort, it's not the kind
of thing to be undertaking with less than 2 days to feature freeze.
Not to mention that I'm not sure we have consensus to do it at all.

More fun stuff: PQgetline actually invents a "\." line when it
sees server end-of-copy, and we tell users of that function to
check for that not an out-of-band return value to detect EOF.
It looks like we have no callers of that in the core distro,
but do we want to deprecate it completely?

So I feel like we need to put this patch on the shelf for the moment
and come back to it early in v18. Although it seems reasonably clear
what to do on the CSV side of things, it's very much less clear what
to do about text-format handling of EOD markers, and I don't want to
change one of those things in v17 and the other in v18. Also it
seems like there are more dependencies on "\." than we realized.

There could be an argument for applying just the psql change now,
to remove its unnecessary sending of "\.". That won't break
anything and it would give us at least one year's leg up on
compatibility issues.

regards, tom lane

#19Daniel Verite
daniel@manitou-mail.org
In reply to: Tom Lane (#17)
Re: Fixing backslash dot for COPY FROM...CSV

Tom Lane wrote:

This is sufficiently weird that I'm starting to come around to
Daniel's original proposal that we just drop the server's recognition
of \. altogether (which would allow removal of some dozens of lines of
complicated and now known-buggy code)

FWIW my plan was to not change anything in the TEXT mode,
but I wasn't aware it had this issue that you found when
\. is not in a line by itself.

Alternatively, we could fix it so that \. at the end of a line draws
"end-of-copy marker corrupt"
which would at least make things consistent, but I'm not sure that has
any great advantage. I surely don't want to document the current
behavioral details as being the right thing that we're going to keep
doing.

Agreed we don't want to document that, but also why doesn't \. in the
contents represent just a dot (as opposed to being an error),
just like \a is a?

I mean if eofdata contains

foobar\a
foobaz\aother

then we get after import:
f1
--------------
foobara
foobazaother
(2 rows)

Reading the current doc on the text format, I can't see why
importing:

foobar\.
foobar\.other

is not supposed to produce
f1
--------------
foobar.
foobaz.other
(2 rows)

I see these rules in [1]https://www.postgresql.org/docs/current/sql-copy.html about backslash:

#1.
"End of data can be represented by a single line containing just
backslash-period (\.)."

foobar\. and foobar\.other do not match that so #1 does not describe
how they're interpreted.

#2.
"Backslash characters (\) can be used in the COPY data to quote data
characters that might otherwise be taken as row or column
delimiters."

Dot is not a column delimiter (it's forbidden anyway), so #2 does
not apply.

#3.
"In particular, the following characters must be preceded by a
backslash if they appear as part of a column value: backslash itself,
newline, carriage return, and the current delimiter character"

Dot is not in that list so #3 does not apply.

#4.
"The following special backslash sequences are recognized by COPY
FROM:" (followed by the table with \b \f, ...,)

Dot is not mentioned.

#5.
"Any other backslashed character that is not mentioned in the above
table will be taken to represent itself"

Here we say that backslash dot represents a dot (unless other
rules apply)

foobar\. => foobar.
foobar\.other => foobar.other

#6.
"However, beware of adding backslashes unnecessarily, since that
might accidentally produce a string matching the end-of-data marker
(\.) or the null string (\N by default)."

So we *recommend* not to use \. but as I understand it, the warning
with the EOD marker is about accidentally creating a line that matches #1,
that is, \. alone on a line.

#7
"These strings will be recognized before any other backslash
processing is done."

TBH I don't understand what #7 implies. The order in backslash
processing looks like an implementation detail that should not
matter in understanding the format?

Considering this, it seems to me that #5 says that
backslash-dot represents a dot unless #1 applies, and the
other #2 #3 #4 #6 #7 rules do not state anything that would
contradict that.

[1]: https://www.postgresql.org/docs/current/sql-copy.html

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

#20Bruce Momjian
bruce@momjian.us
In reply to: Daniel Verite (#19)
Re: Fixing backslash dot for COPY FROM...CSV

On Sun, Apr 7, 2024 at 12:00:25AM +0200, Daniel Verite wrote:

Tom Lane wrote:

This is sufficiently weird that I'm starting to come around to
Daniel's original proposal that we just drop the server's recognition
of \. altogether (which would allow removal of some dozens of lines of
complicated and now known-buggy code)

FWIW my plan was to not change anything in the TEXT mode,
but I wasn't aware it had this issue that you found when
\. is not in a line by itself.

Alternatively, we could fix it so that \. at the end of a line draws
"end-of-copy marker corrupt"
which would at least make things consistent, but I'm not sure that has
any great advantage. I surely don't want to document the current
behavioral details as being the right thing that we're going to keep
doing.

Agreed we don't want to document that, but also why doesn't \. in the
contents represent just a dot (as opposed to being an error),
just like \a is a?

I looked into this and started to realize that \. is the only copy
backslash command where we define the behavior only alone at the
beginning of a line, and not in other circumstances. The \a example
above suggests \. should be period in all other cases, as suggested, but
I don't know the ramifications if that.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Only you can decide what is important to you.

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#20)
#22Sutou Kouhei
kou@clear-code.com
In reply to: Tom Lane (#21)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Sutou Kouhei (#22)
#24Daniel Verite
daniel@manitou-mail.org
In reply to: Sutou Kouhei (#22)
#25Sutou Kouhei
kou@clear-code.com
In reply to: Daniel Verite (#24)
#26Artur Zakirov
zaartur@gmail.com
In reply to: Daniel Verite (#24)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Daniel Verite (#24)
#28Daniel Verite
daniel@manitou-mail.org
In reply to: Artur Zakirov (#26)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Daniel Verite (#28)
#30Daniel Verite
daniel@manitou-mail.org
In reply to: Tom Lane (#29)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Daniel Verite (#30)