Should CSV parsing be stricter about mid-field quotes?

Started by Joel Jacobsonover 2 years ago38 messages

joel@compiler.org

over 2 years ago

Hi hackers,

I've come across an unexpected behavior in our CSV parser that I'd like to
bring up for discussion.

% cat example.csv
id,rating,review
1,5,"Great product, will buy again."
2,3,"I bought this for my 6" laptop but it didn't fit my 8" tablet"

% psql
CREATE TABLE reviews (id int, rating int, review text);
\COPY reviews FROM example.csv WITH CSV HEADER;
SELECT * FROM reviews;

This gives:

id | rating | review
----+--------+-------------------------------------------------------------
1 | 5 | Great product, will buy again.
2 | 3 | I bought this for my 6 laptop but it didn't fit my 8 tablet
(2 rows)

The parser currently accepts quoting within an unquoted field. This can lead to
data misinterpretation when the quote is part of the field data (e.g.,
for inches, like in the example).

Our CSV output rules quote an entire field or not at all. But the import of
fields with mid-field quotes might lead to surprising and undetected outcomes.

I think we should throw a parsing error for unescaped mid-field quotes,
and add a COPY option like ALLOW_MIDFIELD_QUOTES for cases where mid-field
quotes are necessary. The error message could suggest this option when it
encounters an unescaped mid-field quote.

I think the convenience of not having to use an extra option doesn't outweigh
the risk of undetected data integrity issues.

Thoughts?

/Joel

pavel.stehule@gmail.com

over 2 years ago

In reply to: Joel Jacobson (#1)

Re: Should CSV parsing be stricter about mid-field quotes?

čt 11. 5. 2023 v 16:04 odesílatel Joel Jacobson <joel@compiler.org> napsal:

Hi hackers,

I've come across an unexpected behavior in our CSV parser that I'd like to
bring up for discussion.

% cat example.csv
id,rating,review
1,5,"Great product, will buy again."
2,3,"I bought this for my 6" laptop but it didn't fit my 8" tablet"

% psql
CREATE TABLE reviews (id int, rating int, review text);
\COPY reviews FROM example.csv WITH CSV HEADER;
SELECT * FROM reviews;

This gives:

id | rating | review
----+--------+-------------------------------------------------------------
1 | 5 | Great product, will buy again.
2 | 3 | I bought this for my 6 laptop but it didn't fit my 8 tablet
(2 rows)

The parser currently accepts quoting within an unquoted field. This can
lead to
data misinterpretation when the quote is part of the field data (e.g.,
for inches, like in the example).

Our CSV output rules quote an entire field or not at all. But the import of
fields with mid-field quotes might lead to surprising and undetected
outcomes.

I think we should throw a parsing error for unescaped mid-field quotes,
and add a COPY option like ALLOW_MIDFIELD_QUOTES for cases where mid-field
quotes are necessary. The error message could suggest this option when it
encounters an unescaped mid-field quote.

I think the convenience of not having to use an extra option doesn't
outweigh
the risk of undetected data integrity issues.

Thoughts?

+1

Pavel

Show quoted text

/Joel

stark@mit.edu

over 2 years ago

In reply to: Joel Jacobson (#1)

Re: Should CSV parsing be stricter about mid-field quotes?

On Thu, 11 May 2023 at 10:04, Joel Jacobson <joel@compiler.org> wrote:

The parser currently accepts quoting within an unquoted field. This can lead to
data misinterpretation when the quote is part of the field data (e.g.,
for inches, like in the example).

I think you're thinking about it differently than the parser. I think
the parser is treating this the way, say, the shell treats quotes.
That is, it sees a quoted "I bought this for my 6" followed by an
unquoted "a laptop but it didn't fit my 8" followed by a quoted "
tablet".

So for example, in that world you might only quote commas and newlines
so you might print something like

1,2,I bought this for my "6"" laptop
" but it "didn't" fit my "8""" laptop

The actual CSV spec https://datatracker.ietf.org/doc/html/rfc4180 only
allows fully quoted or fully unquoted fields and there can only be
escaped double-doublequote characters in quoted fields and no
doublequote characters in unquoted fields.

But it also says

Due to lack of a single specification, there are considerable
differences among implementations. Implementors should "be
conservative in what you do, be liberal in what you accept from
others" (RFC 793 [8]) when processing CSV files. An attempt at a
common definition can be found in Section 2.

So the real question is are there tools out there that generate
entries like this and what are their intentions?

I think we should throw a parsing error for unescaped mid-field quotes,
and add a COPY option like ALLOW_MIDFIELD_QUOTES for cases where mid-field
quotes are necessary. The error message could suggest this option when it
encounters an unescaped mid-field quote.

I think the convenience of not having to use an extra option doesn't outweigh
the risk of undetected data integrity issues.

It's also a pretty annoying experience to get a message saying "error,
turn this option on to not get an error". I get what you're saying
too, which is more of a risk depends on whether turning off the error
is really the right thing most of the time or is just causing data to
be read incorrectly.

--
greg

andrew@dunslane.net

over 2 years ago

In reply to: Joel Jacobson (#1)

Re: Should CSV parsing be stricter about mid-field quotes?

On 2023-05-11 Th 10:03, Joel Jacobson wrote:

Hi hackers,

I've come across an unexpected behavior in our CSV parser that I'd like to
bring up for discussion.

% cat example.csv
id,rating,review
1,5,"Great product, will buy again."
2,3,"I bought this for my 6" laptop but it didn't fit my 8" tablet"

% psql
CREATE TABLE reviews (id int, rating int, review text);
\COPY reviews FROM example.csv WITH CSV HEADER;
SELECT * FROM reviews;

This gives:

id | rating |                           review
----+--------+-------------------------------------------------------------
1 |      5 | Great product, will buy again.
2 |      3 | I bought this for my 6 laptop but it didn't fit my 8 tablet
(2 rows)

Maybe this is unexpected by you, but it's not by me. What other sane
interpretation of that data could there be? And what CSV producer
outputs such horrible content? As you've noted, ours certainly does not.
Our rules are clear: quotes within quotes must be escaped (default
escape is by doubling the quote char). Allowing partial fields to be
quoted was a deliberate decision when CSV parsing was implemented,
because examples have been seen in the wild.

So I don't think our behaviour is broken or needs fixing. As mentioned
by Greg, this is an example of the adage about being liberal in what you
accept.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

joel@compiler.org

over 2 years ago

In reply to: Andrew Dunstan (#4)

Re: Should CSV parsing be stricter about mid-field quotes?

On Fri, May 12, 2023, at 21:57, Andrew Dunstan wrote:

Maybe this is unexpected by you, but it's not by me. What other sane interpretation of that data could there be? And what CSV producer outputs such horrible content? As you've noted, ours certainly does not. Our rules are clear: quotes within quotes must be escaped (default escape is by doubling the quote char). Allowing partial fields to be quoted was a deliberate decision when CSV parsing was implemented, because examples have been seen in the wild.

So I don't think our behaviour is broken or needs fixing. As mentioned by Greg, this is an example of the adage about being liberal in what you accept.

I understand your position, and your points are indeed in line with the
traditional "Robustness Principle" (aka "Postel's Law") [1]https://datatracker.ietf.org/doc/html/rfc761#section-2.10 from 1980, which
suggests "be conservative in what you send, be liberal in what you accept."
However, I'd like to offer a different perspective that might be worth
considering.

A 2021 IETF draft, "The Harmful Consequences of the Robustness Principle" [2]https://www.ietf.org/archive/id/draft-iab-protocol-maintenance-05.html,
argues that the flexibility advocated by Postel's Law can lead to problems such
as unclear specifications and a multitude of varying implementations. Features
that initially seem helpful can unexpectedly turn into bugs, resulting in
unanticipated consequences and data integrity risks.

Based on the feedback from you and others, I'd like to revise my earlier
proposal. Rather than adding an option to preserve the existing behavior, I now
think it's better to simply report an error in such cases. This approach offers
several benefits: it simplifies the CSV parser, reduces the risk of
misinterpreting data due to malformed input, and prevents the all-too-familiar
situation where users blindly apply an error hint without understanding the
consequences.

Finally, I acknowledge that we can't foresee the number of CSV producers that
produce mid-field quoting, and this change may cause compatibility issues for
some users. However, I consider this an acceptable tradeoff. Users encountering
the error would receive a clear message explaining that mid-field quoting is not
allowed and that they should change their CSV producer's settings to escape
quotes by doubling the quote character. Importantly, this change guarantees that
previously parsed data won't be misinterpreted, as it only enforces stricter
parsing rules.

[1]: https://datatracker.ietf.org/doc/html/rfc761#section-2.10
[2]: https://www.ietf.org/archive/id/draft-iab-protocol-maintenance-05.html

/Joel

andrew@dunslane.net

over 2 years ago

In reply to: Joel Jacobson (#5)

Re: Should CSV parsing be stricter about mid-field quotes?

On 2023-05-13 Sa 04:20, Joel Jacobson wrote:

On Fri, May 12, 2023, at 21:57, Andrew Dunstan wrote:

Maybe this is unexpected by you, but it's not by me. What other sane
interpretation of that data could there be? And what CSV producer
outputs such horrible content? As you've noted, ours certainly does
not. Our rules are clear: quotes within quotes must be escaped
(default escape is by doubling the quote char). Allowing partial
fields to be quoted was a deliberate decision when CSV parsing was
implemented, because examples have been seen in the wild.

So I don't think our behaviour is broken or needs fixing. As
mentioned by Greg, this is an example of the adage about being
liberal in what you accept.

I understand your position, and your points are indeed in line with the
traditional "Robustness Principle" (aka "Postel's Law") [1] from 1980,
which
suggests "be conservative in what you send, be liberal in what you
accept."
However, I'd like to offer a different perspective that might be worth
considering.

A 2021 IETF draft, "The Harmful Consequences of the Robustness
Principle" [2],
argues that the flexibility advocated by Postel's Law can lead to
problems such
as unclear specifications and a multitude of varying implementations.
Features
that initially seem helpful can unexpectedly turn into bugs, resulting in
unanticipated consequences and data integrity risks.

Based on the feedback from you and others, I'd like to revise my earlier
proposal. Rather than adding an option to preserve the existing
behavior, I now
think it's better to simply report an error in such cases. This
approach offers
several benefits: it simplifies the CSV parser, reduces the risk of
misinterpreting data due to malformed input, and prevents the
all-too-familiar
situation where users blindly apply an error hint without
understanding the
consequences.

Finally, I acknowledge that we can't foresee the number of CSV
producers that
produce mid-field quoting, and this change may cause compatibility
issues for
some users. However, I consider this an acceptable tradeoff. Users
encountering
the error would receive a clear message explaining that mid-field
quoting is not
allowed and that they should change their CSV producer's settings to
escape
quotes by doubling the quote character. Importantly, this change
guarantees that
previously parsed data won't be misinterpreted, as it only enforces
stricter
parsing rules.

[1] https://datatracker.ietf.org/doc/html/rfc761#section-2.10
[2] https://www.ietf.org/archive/id/draft-iab-protocol-maintenance-05.html

I'm pretty reluctant to change something that's been working as designed
for almost 20 years, and about which we have hitherto had zero
complaints that I recall.

I could see an argument for a STRICT mode which would disallow partially
quoted fields, although I'd like some evidence that we're dealing with a
real problem here. Is there really a CSV producer that produces output
like that you showed in your example? And if so has anyone objected to
them about the insanity of that?

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: Andrew Dunstan (#6)

Re: Should CSV parsing be stricter about mid-field quotes?

Andrew Dunstan <andrew@dunslane.net> writes:

I could see an argument for a STRICT mode which would disallow partially
quoted fields, although I'd like some evidence that we're dealing with a
real problem here. Is there really a CSV producer that produces output
like that you showed in your example? And if so has anyone objected to
them about the insanity of that?

I think you'd want not just "some evidence" but "compelling evidence".
Any such option is going to add cycles into the low-level input parser
for COPY, which we know is a hot spot and we've expended plenty of
sweat on. Adding a speed penalty that will be paid by the 99.99%
of users who don't have an issue here is going to be a hard sell.

regards, tom lane

stark@mit.edu

over 2 years ago

In reply to: Tom Lane (#7)

Re: Should CSV parsing be stricter about mid-field quotes?

On Sat, 13 May 2023 at 09:46, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

I could see an argument for a STRICT mode which would disallow partially
quoted fields, although I'd like some evidence that we're dealing with a
real problem here. Is there really a CSV producer that produces output
like that you showed in your example? And if so has anyone objected to
them about the insanity of that?

I think you'd want not just "some evidence" but "compelling evidence".
Any such option is going to add cycles into the low-level input parser
for COPY, which we know is a hot spot and we've expended plenty of
sweat on. Adding a speed penalty that will be paid by the 99.99%
of users who don't have an issue here is going to be a hard sell.

Well I'm not sure that follows. Joel specifically claimed that an
implementation that didn't accept inputs like this would actually be
simpler and that might mean it would actually be faster.

And I don't think you have to look very hard for inputs like this --
plenty of people generate CSV files from simple templates or script
outputs that don't understand escaping quotation marks at all. Outputs
like that will be fine as long as there's no doublequotes in the
inputs but then one day someone will enter a doublequote in a form
somewhere and blammo.

So I guess the real question is whether accepting inputs with
unescapted quotes and interpreting them the way we do is really the
best interpretation. Is the user best served by a) assuming they
intended to quote part of the field and not quote part of it b) assume
they failed to escape the quotation mark or c) assume something's gone
wrong and the input is entirely untrustworthy.

--
greg

andrew@dunslane.net

over 2 years ago

In reply to: Greg Stark (#8)

Re: Should CSV parsing be stricter about mid-field quotes?

On 2023-05-13 Sa 23:11, Greg Stark wrote:

On Sat, 13 May 2023 at 09:46, Tom Lane<tgl@sss.pgh.pa.us> wrote:

Andrew Dunstan<andrew@dunslane.net> writes:

I could see an argument for a STRICT mode which would disallow partially
quoted fields, although I'd like some evidence that we're dealing with a
real problem here. Is there really a CSV producer that produces output
like that you showed in your example? And if so has anyone objected to
them about the insanity of that?

I think you'd want not just "some evidence" but "compelling evidence".
Any such option is going to add cycles into the low-level input parser
for COPY, which we know is a hot spot and we've expended plenty of
sweat on. Adding a speed penalty that will be paid by the 99.99%
of users who don't have an issue here is going to be a hard sell.

Well I'm not sure that follows. Joel specifically claimed that an
implementation that didn't accept inputs like this would actually be
simpler and that might mean it would actually be faster.

And I don't think you have to look very hard for inputs like this --
plenty of people generate CSV files from simple templates or script
outputs that don't understand escaping quotation marks at all. Outputs
like that will be fine as long as there's no doublequotes in the
inputs but then one day someone will enter a doublequote in a form
somewhere and blammo.

The procedure described is plain wrong, and I don't have too much
sympathy for people who implement it. Parsing CSV files might be a mild
PITN, but constructing them is pretty darn simple. Something like this
perl fragment should do it:

do {
s/"/""/g;
$_ = qq{"$_"} if /[",\r\n]/;
} foreach @fields;
print join(',',@fields),"\n";

And if people do follow the method you describe then their input with
unescaped quotes will be rejected 999 times out of 1000. It's only cases
where the field happens to have an even number of embedded quotes, like
Joel's somewhat contrived example, that the input will be accepted.

So I guess the real question is whether accepting inputs with
unescapted quotes and interpreting them the way we do is really the
best interpretation. Is the user best served by a) assuming they
intended to quote part of the field and not quote part of it b) assume
they failed to escape the quotation mark or c) assume something's gone
wrong and the input is entirely untrustworthy.

As I said earlier, I'm quite reluctant to break things that might have
been working happily for people for many years, in order to accommodate
people who can't do the minimum required to produce correct CSVs. I have
no idea how many are relying on it, but I would be slightly surprised if
the number were zero.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

joel@compiler.org

over 2 years ago

In reply to: Andrew Dunstan (#9)

Re: Should CSV parsing be stricter about mid-field quotes?

On Sun, May 14, 2023, at 16:58, Andrew Dunstan wrote:

And if people do follow the method you describe then their input with
unescaped quotes will be rejected 999 times out of 1000. It's only cases where
the field happens to have an even number of embedded quotes, like Joel's
somewhat contrived example, that the input will be accepted.

I concur with Andrew that my previous example might've been somewhat
contrived, as it deliberately included two instances of the term "inches".
It's a matter of time before someone submits a review featuring an odd number of
"inches", leading to an error.

Having done some additional digging, I stumbled upon three instances [1]/messages/by-id/1upfg19cru2jigbm553fugj5k6iebtd4ps@4ax.com [2]https://stackoverflow.com/questions/44108286/unterminated-csv-quoted-field-in-postgres [3]https://dba.stackexchange.com/questions/306662/unterminated-csv-quoted-field-when-to-import-csv-data-file-into-postgresql
where users have misidentified their TSV/TEXT files as CSV. In these situations,
users have been shielded from failure due to an imbalance in quotes.

However, in cases where a field is utilized to store text in which the double
quotation mark is rarely used to denote inches, but instead, for quotation
and/or HTML attributes, it's quite feasible that a large amount of
user-generated text could contain balanced quotes. Even more concerning is the
potential for cases where the vast majority of inputs may not even contain
double quotation marks at all. This would effectively render the issue invisible,
even upon manual data inspection.

Here's a problem scenario that I believe is plausible:

1. The user wishes to import a .TXT file into PostgreSQL.

2. The user examines the .TXT file and observes column headers separated by a
delimiter like TAB or semicolon, with subsequent rows of data also separated by
the same delimiter.

3. The user is familiar with "CSV" (412M hits on Google) but not "TSV" (48M hits
on Google), leading to a false assumption that their file is in CSV format.

4. A Google search for "import csv into postgresql" leads the user to a tutorial
titled "Import CSV File Into PostgreSQL Table". An example found therein:

COPY persons(first_name, last_name, dob, email)
FROM 'C:\sampledb\persons.csv'
DELIMITER ','
CSV HEADER;

5. The user, now confident, believes they understand how to import their "CSV"
file.

6. In contrast to the "ERROR: unterminated CSV quoted field" examples below,
this user's .TXT file contains fields with balanced midfield quote-marks:

blog_posts.txt:
id message
1 This is a <b>bold</b> statement

7. The user copies the COPY command from the tutorial and modifies the file path
and delimiter accordingly. The user then concludes that the code is functioning
as expected and proceeds to deploy it.

8, Weeks later, users complain about broken links in their blog posts. Upon
inspection of the blog_posts table, the user identifies an issue:

SELECT * FROM blog_posts;
id | message
----+------------------------------------------------------------------------
1 | This is a <b>bold</b> statement
2 | Check <a href=http://example.com/?param1=Midfield quoting>this</a> out
(2 rows)

One of the users has used balanced quotes for the href attribute, which was
imported successfully but the quotes were stripped, contrary to the intention of
preserving them.

Content of blog_posts.txt:
id message
1 This is a <b>bold</b> statement
2 Check <a href="http://example.com/?param1=Midfield quoting">this</a> out

If we made midfield quoting a CSV error, those users who are currently mistaken
about their TSV/TEXT files being CSV while also having balanced quotes in their
data, would encounter an error rather than a silent failure, which I believe
would be an enhancement.

/Joel

[1]: /messages/by-id/1upfg19cru2jigbm553fugj5k6iebtd4ps@4ax.com
[2]: https://stackoverflow.com/questions/44108286/unterminated-csv-quoted-field-in-postgres
[3]: https://dba.stackexchange.com/questions/306662/unterminated-csv-quoted-field-when-to-import-csv-data-file-into-postgresql

joel@compiler.org

over 2 years ago

In reply to: Joel Jacobson (#10)

Re: Should CSV parsing be stricter about mid-field quotes?

On Tue, May 16, 2023, at 13:43, Joel Jacobson wrote:

If we made midfield quoting a CSV error, those users who are currently mistaken
about their TSV/TEXT files being CSV while also having balanced quotes in their
data, would encounter an error rather than a silent failure, which I believe
would be an enhancement.

Furthermore, I think it could be beneficial to add a HINT message for all type
of CSV/TEXT parsing errors, since the precise ERROR messages might just cause
the user to tinker with the options until it works, instead of carefully reading
through the documentation on the various formats.

Perhaps something like this:

HINT: Are you sure the FORMAT matches your input?

Also, the COPY documentation says nothing about TSV, and I know TEXT isn't
exactly TSV, but it's at least much more TSV than CSV, so maybe we should
describe the differences, such as \N. I think the best advise to users would be
to avoid exporting to .TSV and use .CSV instead, since I've noticed e.g.
Google Sheets to replace newlines in fields with blank space when
exporting .TSV, which effectively destroys data.

The first search results for "postgresql tsv" on Google link to postgresql.org
pages, but the COPY docs are not one of them unfortunately.

The first relevant hit is this one:

"Importing a TSV File into Postgres | by Riley Wong" [1]https://medium.com/@rlwong2/importing-a-tsv-file-into-postgres-364572a004bf

Sadly, this author has also misunderstood how to properly import a .TSV file,
he got it all wrong, and doesn't understand or at least doesn't mention there
are more differences than just the delimiter:

COPY listings
FROM '/home/ec2-user/list.tsv'
DELIMITER E'\t'
CSV HEADER;

I must confess I have used PostgreSQL for over two decades without having really
understood the detailed differences between TEXT and CSV, until recently.

[1]: https://medium.com/@rlwong2/importing-a-tsv-file-into-postgres-364572a004bf

andrew@dunslane.net

over 2 years ago

In reply to: Joel Jacobson (#11)

Re: Should CSV parsing be stricter about mid-field quotes?

On 2023-05-16 Tu 13:15, Joel Jacobson wrote:

On Tue, May 16, 2023, at 13:43, Joel Jacobson wrote:

If we made midfield quoting a CSV error, those users who are

currently mistaken

about their TSV/TEXT files being CSV while also having balanced

quotes in their

data, would encounter an error rather than a silent failure, which I

believe

would be an enhancement.

Furthermore, I think it could be beneficial to add a HINT message for
all type
of CSV/TEXT parsing errors, since the precise ERROR messages might
just cause
the user to tinker with the options until it works, instead of
carefully reading
through the documentation on the various formats.

Perhaps something like this:

HINT: Are you sure the FORMAT matches your input?

Also, the COPY documentation says nothing about TSV, and I know TEXT isn't
exactly TSV, but it's at least much more TSV than CSV, so maybe we should
describe the differences, such as \N. I think the best advise to users
would be
to avoid exporting to .TSV and use .CSV instead, since I've noticed e.g.
Google Sheets to replace newlines in fields with blank space when
exporting .TSV, which effectively destroys data.

The first search results for "postgresql tsv" on Google link to
postgresql.org
pages, but the COPY docs are not one of them unfortunately.

The first relevant hit is this one:

"Importing a TSV File into Postgres | by Riley Wong" [1]

Sadly, this author has also misunderstood how to properly import a
.TSV file,
he got it all wrong, and doesn't understand or at least doesn't
mention there
are more differences than just the delimiter:

COPY listings
FROM '/home/ec2-user/list.tsv'
DELIMITER E'\t'
CSV HEADER;

I must confess I have used PostgreSQL for over two decades without
having really
understood the detailed differences between TEXT and CSV, until recently.

[1]
https://medium.com/@rlwong2/importing-a-tsv-file-into-postgres-364572a004bf

You can use CSV mode pretty reliably for TSV files. The trick is to use
a quoting char that shouldn't appear, such as E'\x01' as well as setting
the delimiter to E'\t'. Yes, it's far from obvious.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

joel@compiler.org

over 2 years ago

In reply to: Andrew Dunstan (#12)

Re: Should CSV parsing be stricter about mid-field quotes?

On Wed, May 17, 2023, at 19:42, Andrew Dunstan wrote:

You can use CSV mode pretty reliably for TSV files. The trick is to use a
quoting char that shouldn't appear, such as E'\x01' as well as setting the
delimiter to E'\t'. Yes, it's far from obvious.

I've been using that trick myself many times in the past, but thanks to this
deep-dive into this topic, it looks to me like TEXT would be a better format
fit when dealing with unquoted TSV files, or?

OTOH, one would then need to inspect the TSV file doesn't contain \. on an empty
line...

I was about to suggest we perhaps should consider adding a TSV format, that
is like TEXT excluding the PostgreSQL specific things like \. and \N,
but then I tested exporting TSV from Numbers on Mac and Google Sheets,
and I can see there are incompatible differences. Numbers quote fields
that contain double-quote marks, while Google Sheets doesn't.
None of them (unsurpringly) uses midfield quoting though.

Anyone using Excel that could try exporting the following example as CSV/TSV?

CREATE TABLE t (a text, b text, c text, d text);
INSERT INTO t (a, b, c, d)
VALUES ('unquoted','a "quoted" string', 'field, with a comma', E'field\t with a tab');

I agree with you that it's unwise to change something that's been working
great for such a long time, and I agree CSV-files are probably not a problem
per se, but I think you will agree with me TSV-files is a different story,
from a user-friendliness and correctness perspective. Sure, we could just say
"Don't use TSV! Use CSV instead!" in the docs, that would be an improvement
I think, but there is currently nothing on "TSV" in the docs, so users will
google and find all these broken dangerous suggestions on work-arounds.

Thoughts?

/Joel

wolakk@gmail.com

over 2 years ago

In reply to: Joel Jacobson (#13)

4 attachment(s)

Re: Should CSV parsing be stricter about mid-field quotes?

On Wed, May 17, 2023 at 5:47 PM Joel Jacobson <joel@compiler.org> wrote:

On Wed, May 17, 2023, at 19:42, Andrew Dunstan wrote:

You can use CSV mode pretty reliably for TSV files. The trick is to use a
quoting char that shouldn't appear, such as E'\x01' as well as setting

the

delimiter to E'\t'. Yes, it's far from obvious.

I've been using that trick myself many times in the past, but thanks to
this
deep-dive into this topic, it looks to me like TEXT would be a better
format
fit when dealing with unquoted TSV files, or?

OTOH, one would then need to inspect the TSV file doesn't contain \. on an
empty
line...

I was about to suggest we perhaps should consider adding a TSV format, that
is like TEXT excluding the PostgreSQL specific things like \. and \N,
but then I tested exporting TSV from Numbers on Mac and Google Sheets,
and I can see there are incompatible differences. Numbers quote fields
that contain double-quote marks, while Google Sheets doesn't.
None of them (unsurpringly) uses midfield quoting though.

Anyone using Excel that could try exporting the following example as
CSV/TSV?

CREATE TABLE t (a text, b text, c text, d text);
INSERT INTO t (a, b, c, d)
VALUES ('unquoted','a "quoted" string', 'field, with a comma', E'field\t
with a tab');

Here you go. Not horrible handling. (I use DataGrip so I saved it from
there directly as TSV,
just for an extra datapoint).

FWIW, if you copy/paste in windows, the data, the field with the tab gets
split into another column in Excel.
But saving it as a file, and opening it.
Saving it as XLSX, and then having Excel save it as a TSV (versus opening a
text file, and saving it back)

Kirk...

joel@compiler.org

over 2 years ago

In reply to: Kirk Wolak (#14)

Re: Should CSV parsing be stricter about mid-field quotes?

On Thu, May 18, 2023, at 00:18, Kirk Wolak wrote:

Here you go. Not horrible handling. (I use DataGrip so I saved it from there
directly as TSV, just for an extra datapoint).

FWIW, if you copy/paste in windows, the data, the field with the tab gets
split into another column in Excel. But saving it as a file, and opening it.
Saving it as XLSX, and then having Excel save it as a TSV (versus opening a
text file, and saving it back)

Very useful, thanks.

Interesting, DataGrip contrary to Excel doesn't quote fields with commas in TSV.
All the DataGrip/Excel TSV variants uses quoting when necessary,
contrary to Google Sheets's TSV-format, that doesn't quote fields at all.

DataGrip/Excel terminate also the last record with newline,
while Google Sheets omit the newline for the last record,
(which is bad, since then a streaming reader wouldn't know
if the last record is completed or not.)

This makes me think we probably shouldn't add a new TSV format,
since there is no consistency between vendors.
It's impossible to deduce with certainty if a TSV-field that
begins with a double quotation mark is quoted or unquoted.

Two alternative ideas:

1. How about adding a `WITHOUT QUOTE` or `QUOTE NONE` option in conjunction
with `COPY ... WITH CSV`?

Internally, it would just set

quotec = '\0';`

so it would't affect performance at all.

2. How about adding a note on the complexities of dealing with TSV files in the
COPY documentation?

/Joel

joel@compiler.org

over 2 years ago

In reply to: Joel Jacobson (#15)

Re: Should CSV parsing be stricter about mid-field quotes?

On Thu, May 18, 2023, at 08:00, Joel Jacobson wrote:

1. How about adding a `WITHOUT QUOTE` or `QUOTE NONE` option in conjunction
with `COPY ... WITH CSV`?

More ideas:
[ QUOTE 'quote_character' | UNQUOTED ]
or
[ QUOTE 'quote_character' | NO_QUOTE ]

Thinking about it, I recall another hack;
specifying a non-existing char as the delimiter to force the entire line into a
single column table. For that use-case, we could also provide an option that
would internally set:

delimc = '\0';

How about:

[DELIMITER 'delimiter_character' | UNDELIMITED ]
or
[DELIMITER 'delimiter_character' | NO_DELIMITER ]
or it should be more use-case-based and intuitive:
[DELIMITER 'delimiter_character' | WHOLE_LINE_AS_RECORD ]

/Joel

pavel.stehule@gmail.com

over 2 years ago

In reply to: Joel Jacobson (#15)

Re: Should CSV parsing be stricter about mid-field quotes?

čt 18. 5. 2023 v 8:01 odesílatel Joel Jacobson <joel@compiler.org> napsal:

On Thu, May 18, 2023, at 00:18, Kirk Wolak wrote:

Here you go. Not horrible handling. (I use DataGrip so I saved it from

there

directly as TSV, just for an extra datapoint).

FWIW, if you copy/paste in windows, the data, the field with the tab gets
split into another column in Excel. But saving it as a file, and opening

it.

Saving it as XLSX, and then having Excel save it as a TSV (versus

opening a

text file, and saving it back)

Very useful, thanks.

Interesting, DataGrip contrary to Excel doesn't quote fields with commas
in TSV.
All the DataGrip/Excel TSV variants uses quoting when necessary,
contrary to Google Sheets's TSV-format, that doesn't quote fields at all.

Maybe there is another third implementation in Libre Office.

Generally TSV is not well specified, and then the implementations are not
consistent.

Show quoted text

DataGrip/Excel terminate also the last record with newline,
while Google Sheets omit the newline for the last record,
(which is bad, since then a streaming reader wouldn't know
if the last record is completed or not.)

This makes me think we probably shouldn't add a new TSV format,
since there is no consistency between vendors.
It's impossible to deduce with certainty if a TSV-field that
begins with a double quotation mark is quoted or unquoted.

Two alternative ideas:

1. How about adding a `WITHOUT QUOTE` or `QUOTE NONE` option in conjunction
with `COPY ... WITH CSV`?

Internally, it would just set

quotec = '\0';`

so it would't affect performance at all.

2. How about adding a note on the complexities of dealing with TSV files
in the
COPY documentation?

/Joel

joel@compiler.org

over 2 years ago

In reply to: Pavel Stehule (#17)

Re: Should CSV parsing be stricter about mid-field quotes?

On Thu, May 18, 2023, at 08:35, Pavel Stehule wrote:

Maybe there is another third implementation in Libre Office.

Generally TSV is not well specified, and then the implementations are not consistent.

Thanks Pavel, that was a very interesting case indeed:

Libre Office (tested on Mac) doesn't have a separate TSV format,
but its CSV format allows specifying custom "Field delimiter" and
"String delimiter".

How peculiar, in Libre Office, when trying to write double quotation marks
(using Shift+2 on my keyboard) you actually don't get the normal double
quotation marks, but some special type of Unicode-quoting,
e2 80 9c ("LEFT DOUBLE QUOTATION MARK") and
e2 80 9d ("RIGHT DOUBLE QUOTATION MARK"),
and in the .CSV file you get the normal double quotation marks as
"String delimiter":

a,b,c,d,e
unquoted,“this field is quoted”,this “word” is quoted,"field with , comma",field with tab

So, my "this field is quoted" experiment was exported unquoted since their
quotation marks don't need to be quoted.

andrew@dunslane.net

over 2 years ago

In reply to: Joel Jacobson (#16)

Re: Should CSV parsing be stricter about mid-field quotes?

On 2023-05-18 Th 02:19, Joel Jacobson wrote:

On Thu, May 18, 2023, at 08:00, Joel Jacobson wrote:

1. How about adding a `WITHOUT QUOTE` or `QUOTE NONE` option in

conjunction

with `COPY ... WITH CSV`?

More ideas:
[ QUOTE 'quote_character' | UNQUOTED ]
or
[ QUOTE 'quote_character' | NO_QUOTE ]

Thinking about it, I recall another hack;
specifying a non-existing char as the delimiter to force the entire
line into a
single column table. For that use-case, we could also provide an
option that
would internally set:

delimc = '\0';

How about:

[DELIMITER 'delimiter_character' | UNDELIMITED ]
or
[DELIMITER 'delimiter_character' | NO_DELIMITER ]
or it should be more use-case-based and intuitive:
[DELIMITER 'delimiter_character' | WHOLE_LINE_AS_RECORD ]

QUOTE NONE and DELIMITER NONE should work fine. NONE is already a
keyword, so the disturbance should be minimal.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

daniel@manitou-mail.org

over 2 years ago

In reply to: Joel Jacobson (#13)

Re: Should CSV parsing be stricter about mid-field quotes?

Joel Jacobson wrote:

I've been using that trick myself many times in the past, but thanks to this
deep-dive into this topic, it looks to me like TEXT would be a better format
fit when dealing with unquoted TSV files, or?

OTOH, one would then need to inspect the TSV file doesn't contain \. on an
empty line...

Note that this is the case for valid CSV contents, since backslash-dot
on a line by itself is both an end-of-data marker for COPY FROM and a
valid CSV line.
Having this line in the data results in either an error or having the
rest of the data silently discarded, depending on the context. There
is some previous discussion about this in [1]/messages/by-id/10e3eff6-eb04-4b3f-aeb4-b920192b977a@manitou-mail.org.
Since the TEXT format doesn't have this kind of problem, one solution
is to filter the data through PROGRAM with an [untrusted CSV]->TEXT
filter. This is to be preferred over direct CSV loading when
strictness or robustness are more important than convenience.

[1]: /messages/by-id/10e3eff6-eb04-4b3f-aeb4-b920192b977a@manitou-mail.org
/messages/by-id/10e3eff6-eb04-4b3f-aeb4-b920192b977a@manitou-mail.org

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

joel@compiler.org

over 2 years ago

In reply to: Daniel Verite (#20)

Re: Should CSV parsing be stricter about mid-field quotes?

On Thu, May 18, 2023, at 18:48, Daniel Verite wrote:

Joel Jacobson wrote:

OTOH, one would then need to inspect the TSV file doesn't contain \. on an
empty line...

Note that this is the case for valid CSV contents, since backslash-dot
on a line by itself is both an end-of-data marker for COPY FROM and a
valid CSV line.
Having this line in the data results in either an error or having the
rest of the data silently discarded, depending on the context. There
is some previous discussion about this in [1].
Since the TEXT format doesn't have this kind of problem, one solution
is to filter the data through PROGRAM with an [untrusted CSV]->TEXT
filter. This is to be preferred over direct CSV loading when
strictness or robustness are more important than convenience.

[1]
/messages/by-id/10e3eff6-eb04-4b3f-aeb4-b920192b977a@manitou-mail.org

Thanks for sharing the old thread, very useful.
I see I've failed miserably to understand all the details of the COPY command.

Upon reading the thread, I'm still puzzled about one thing:

Why does \. need to have a special meaning when using COPY FROM with files?

I understand its necessity for STDIN, given that the end of input needs to be
explicitly defined.
However, for files, we have a known file size and the end-of-file can be
detected without the need for special markers.

Also, is the difference in how server-side COPY CSV is capable of dealing
with \. but apparently not the client-side \COPY CSV documented somewhere?

CREATE TABLE t (c text);
INSERT INTO t (c) VALUES ('foo'), (E'\n\\.\n'), ('bar');

-- Works OK:
COPY t TO '/tmp/t.csv' WITH CSV;
TRUNCATE t;
COPY t FROM '/tmp/t.csv' WITH CSV;

-- Doesn't work:
\COPY t TO '/tmp/t.csv' WITH CSV;
TRUNCATE t;
\COPY t FROM '/tmp/t.csv' WITH CSV;
ERROR: unterminated CSV quoted field
CONTEXT: COPY t, line 4: ""
\.
"

/Joel

daniel@manitou-mail.org

over 2 years ago

In reply to: Joel Jacobson (#21)

Re: Should CSV parsing be stricter about mid-field quotes?

Joel Jacobson wrote:

I understand its necessity for STDIN, given that the end of input needs to
be explicitly defined.
However, for files, we have a known file size and the end-of-file can be
detected without the need for special markers.

Also, is the difference in how server-side COPY CSV is capable of dealing
with \. but apparently not the client-side \COPY CSV documented somewhere?

psql implements the client-side "\copy table from file..." with
COPY table FROM STDIN ...

COPY FROM file CSV somewhat differs as your example shows,
but it still mishandle \. when unquoted. For instance, consider this
file to load with COPY t FROM '/tmp/t.csv' WITH CSV
$ cat /tmp/t.csv
line 1
\.
line 3
line 4

It results in having only "line 1" being imported.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

joel@compiler.org

over 2 years ago

In reply to: Daniel Verite (#22)

Re: Should CSV parsing be stricter about mid-field quotes?

On Fri, May 19, 2023, at 18:06, Daniel Verite wrote:

COPY FROM file CSV somewhat differs as your example shows,
but it still mishandle \. when unquoted. For instance, consider this
file to load with COPY t FROM '/tmp/t.csv' WITH CSV
$ cat /tmp/t.csv
line 1
\.
line 3
line 4

It results in having only "line 1" being imported.

Hmm, this is a problem for one of the new use-cases I brought up that would be
possible with DELIMITER NONE QUOTE NONE, i.e. to import unstructured log files,
where each raw line should be imported "as is" into a single text column.

Is there a valid reason why \. is needed for COPY FROM filename?
It seems to me it would only be necessary for the COPY FROM STDIN case,
since files have a natural end-of-file and a known file size.

/Joel

daniel@manitou-mail.org

over 2 years ago

In reply to: Joel Jacobson (#23)

Re: Should CSV parsing be stricter about mid-field quotes?

Joel Jacobson wrote:

Is there a valid reason why \. is needed for COPY FROM filename?
It seems to me it would only be necessary for the COPY FROM STDIN case,
since files have a natural end-of-file and a known file size.

Looking at CopyReadLineText() over at [1]https://doxygen.postgresql.org/copyfromparse_8c.html#a90201f711221dd82d0c08deedd91e1b3, I don't see a reason why
the unquoted \. could not be handled with COPY FROM file.
Even COPY FROM STDIN looks like it could be benefit, so that
\copy from file csv would hopefully not choke or truncate the data.
There's still the case when the CSV data is embedded in a psql script
(psql is unable to know where it ends), but for that, "don't do that"
might be a reasonable answer.

[1]: https://doxygen.postgresql.org/copyfromparse_8c.html#a90201f711221dd82d0c08deedd91e1b3
https://doxygen.postgresql.org/copyfromparse_8c.html#a90201f711221dd82d0c08deedd91e1b3

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

wolakk@gmail.com

over 2 years ago

In reply to: Daniel Verite (#24)

Re: Should CSV parsing be stricter about mid-field quotes?

On Mon, May 22, 2023 at 12:13 PM Daniel Verite <daniel@manitou-mail.org>
wrote:

Joel Jacobson wrote:

Is there a valid reason why \. is needed for COPY FROM filename?
It seems to me it would only be necessary for the COPY FROM STDIN case,
since files have a natural end-of-file and a known file size.

Looking at CopyReadLineText() over at [1], I don't see a reason why
the unquoted \. could not be handled with COPY FROM file.
Even COPY FROM STDIN looks like it could be benefit, so that
\copy from file csv would hopefully not choke or truncate the data.
There's still the case when the CSV data is embedded in a psql script
(psql is unable to know where it ends), but for that, "don't do that"
might be a reasonable answer.

Don't have what looks like a pg_dump script?
We specifically create such SQL files with embedded data. Depending on the
circumstances,
we either confirm that indexes dropped and triggers are disabled... [Or we
create a dynamic name,
and insert it into a queue table for later processing], and then we COPY
the data, ending in
\.

We do NOT do "CSV", we mimic pg_dump.

Now, if you are talking about only impacting a fixed data file format...
Sure. But impacting how psql
processes these \i included files... (that could hurt)

daniel@manitou-mail.org

over 2 years ago

In reply to: Kirk Wolak (#25)

Re: Should CSV parsing be stricter about mid-field quotes?

Kirk Wolak wrote:

We do NOT do "CSV", we mimic pg_dump.

pg_dump uses the text format (as opposed to csv), where
\. on a line by itself cannot appear in the data, so there's
no problem. The problem is limited to the csv format.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

noah@leadboat.com

over 2 years ago

In reply to: Joel Jacobson (#23)

Re: Should CSV parsing be stricter about mid-field quotes?

On Sat, May 20, 2023 at 09:16:30AM +0200, Joel Jacobson wrote:

On Fri, May 19, 2023, at 18:06, Daniel Verite wrote:

COPY FROM file CSV somewhat differs as your example shows,
but it still mishandle \. when unquoted. For instance, consider this
file to load with COPY t FROM '/tmp/t.csv' WITH CSV
$ cat /tmp/t.csv
line 1
\.
line 3
line 4

It results in having only "line 1" being imported.

Hmm, this is a problem for one of the new use-cases I brought up that would be
possible with DELIMITER NONE QUOTE NONE, i.e. to import unstructured log files,
where each raw line should be imported "as is" into a single text column.

Is there a valid reason why \. is needed for COPY FROM filename?

No.

It seems to me it would only be necessary for the COPY FROM STDIN case,
since files have a natural end-of-file and a known file size.

Right. Even for COPY FROM STDIN, it's not needed anymore since the removal of
protocol v2. psql would still use it to find the end of inline COPY data,
though. Here's another relevant thread:
/messages/by-id/flat/bfcd57e4-8f23-4c3e-a5db-2571d09208e2@beta.fastmail.com

joel@compiler.org

over 1 year ago

In reply to: Noah Misch (#27)

Re: Should CSV parsing be stricter about mid-field quotes?

On Sun, Jul 2, 2023, at 07:45, Noah Misch wrote:

On Sat, May 20, 2023 at 09:16:30AM +0200, Joel Jacobson wrote:

On Fri, May 19, 2023, at 18:06, Daniel Verite wrote:

COPY FROM file CSV somewhat differs as your example shows,
but it still mishandle \. when unquoted. For instance, consider this
file to load with COPY t FROM '/tmp/t.csv' WITH CSV
$ cat /tmp/t.csv
line 1
\.
line 3
line 4

It results in having only "line 1" being imported.

Hmm, this is a problem for one of the new use-cases I brought up that would be
possible with DELIMITER NONE QUOTE NONE, i.e. to import unstructured log files,
where each raw line should be imported "as is" into a single text column.

Is there a valid reason why \. is needed for COPY FROM filename?

No.

It seems to me it would only be necessary for the COPY FROM STDIN case,
since files have a natural end-of-file and a known file size.

Right. Even for COPY FROM STDIN, it's not needed anymore since the
removal of
protocol v2. psql would still use it to find the end of inline COPY
data,
though. Here's another relevant thread:
/messages/by-id/flat/bfcd57e4-8f23-4c3e-a5db-2571d09208e2@beta.fastmail.com

I was very pleased to see commit 7702337:

Do not treat \. as an EOF marker in CSV mode for COPY IN.

Great job!

Thanks to this fix, maybe there is now interest to resume the discussion on
the ideas discussed in this thread?

Recap of ideas:

1. Stricter parsing, reject mid-field quotes

The idea is to prevent balanced mid-field quotes from being silently removed.

Example:

% cat example.csv
id,post
1,<p>Hello there!</p>
2,<a href="http://example.com">Click me!</a>

% psql
# \copy posts from example.csv with csv header;
COPY 2
# SELECT * FROM posts;
id | post
----+------------------------------------------
1 | <p>Hello there!</p>
2 | <a href=http://example.com>Click me!</a>
(2 rows)

Note how the quotes around the URL disappeared.

2. Avoid needing hacks like using E'\x01' as quoting char.

Introduce QUOTE NONE and DELIMITER NONE,
to allow raw lines to be imported "as is" into a single text column.

Best regards,

Joel

andrew@dunslane.net

over 1 year ago

In reply to: Joel Jacobson (#28)

Re: Should CSV parsing be stricter about mid-field quotes?

On 2024-10-04 Fr 12:19 PM, Joel Jacobson wrote:

On Sun, Jul 2, 2023, at 07:45, Noah Misch wrote:

On Sat, May 20, 2023 at 09:16:30AM +0200, Joel Jacobson wrote:

On Fri, May 19, 2023, at 18:06, Daniel Verite wrote:

COPY FROM file CSV somewhat differs as your example shows,
but it still mishandle \. when unquoted. For instance, consider this
file to load with COPY t FROM '/tmp/t.csv' WITH CSV
$ cat /tmp/t.csv
line 1
\.
line 3
line 4

It results in having only "line 1" being imported.

Hmm, this is a problem for one of the new use-cases I brought up that would be
possible with DELIMITER NONE QUOTE NONE, i.e. to import unstructured log files,
where each raw line should be imported "as is" into a single text column.

Is there a valid reason why \. is needed for COPY FROM filename?

No.

It seems to me it would only be necessary for the COPY FROM STDIN case,
since files have a natural end-of-file and a known file size.

Right. Even for COPY FROM STDIN, it's not needed anymore since the
removal of
protocol v2. psql would still use it to find the end of inline COPY
data,
though. Here's another relevant thread:
/messages/by-id/flat/bfcd57e4-8f23-4c3e-a5db-2571d09208e2@beta.fastmail.com

I was very pleased to see commit 7702337:

Do not treat \. as an EOF marker in CSV mode for COPY IN.

Great job!

Thanks to this fix, maybe there is now interest to resume the discussion on
the ideas discussed in this thread?

Recap of ideas:

1. Stricter parsing, reject mid-field quotes

The idea is to prevent balanced mid-field quotes from being silently removed.

Example:

% cat example.csv
id,post
1,<p>Hello there!</p>
2,<a href="http://example.com">Click me!</a>

% psql
# \copy posts from example.csv with csv header;
COPY 2
# SELECT * FROM posts;
id | post
----+------------------------------------------
1 | <p>Hello there!</p>
2 | <a href=http://example.com>Click me!</a>
(2 rows)

Note how the quotes around the URL disappeared.

2. Avoid needing hacks like using E'\x01' as quoting char.

Introduce QUOTE NONE and DELIMITER NONE,
to allow raw lines to be imported "as is" into a single text column.

As I think I previously indicated, I'm perfectly happy about 2, because
it replaces a far from obvious hack, but I am at best dubious about 1.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

joel@compiler.org

over 1 year ago

In reply to: Andrew Dunstan (#29)

Re: Should CSV parsing be stricter about mid-field quotes?

On Sun, Oct 6, 2024, at 15:12, Andrew Dunstan wrote:

On 2024-10-04 Fr 12:19 PM, Joel Jacobson wrote:

2. Avoid needing hacks like using E'\x01' as quoting char.

Introduce QUOTE NONE and DELIMITER NONE,
to allow raw lines to be imported "as is" into a single text column.

As I think I previously indicated, I'm perfectly happy about 2, because
it replaces a far from obvious hack, but I am at best dubious about 1.

I've looked at how to implement this, and there is quite a lot of complexity
having to do with quoting and escaping.

Need guidance on what you think would be best to do:

2a) Should we aim to support all NONE combinations, at the cost of increasing the
complexity at all code having to do with quoting, escaping and delimiters?

2b) Should we aim to only support the QUOTE NONE DELIMITER NONE ESCAPE NONE case,
useful to the real-life scenario we've identified, that is, importing raw log
lines into a single column, which could then be handed by a much simpler and
probably faster version of CopyReadAttributesCSV(),
e.g. named CopyReadAttributesUnquotedUnDelimited() or
maybe CopyReadAttributesRaw()?
(We also need to modify CopyReadLineText(), but seems we only need a
quote_none bool, to skip over the quoting code there, so don't think a
separate function is warranted there.)

I think ESCAPE NONE should be implied from QUOTE NONE, since the default escape
character is the same as the quote character, so if there isn't any
quote character, then I think that would imply no escape character either.

Can we think of any other valid, useful, realistic, and safe combinations of
QUOTE NONE, DELIMITER NONE and ESCAPE NONE, that would be interesting
to support?

If not, then I think 2b looks more interesting, to reduce risk of accidental
misuse, simpler implementation, and since it also should allow importing
raw log files faster, thanks to the reduced complexity.

Best regards,

Joel

andrew@dunslane.net

over 1 year ago

In reply to: Joel Jacobson (#30)

Re: Should CSV parsing be stricter about mid-field quotes?

On 2024-10-08 Tu 3:25 AM, Joel Jacobson wrote:

On Sun, Oct 6, 2024, at 15:12, Andrew Dunstan wrote:

On 2024-10-04 Fr 12:19 PM, Joel Jacobson wrote:

2. Avoid needing hacks like using E'\x01' as quoting char.

Introduce QUOTE NONE and DELIMITER NONE,
to allow raw lines to be imported "as is" into a single text column.

As I think I previously indicated, I'm perfectly happy about 2, because
it replaces a far from obvious hack, but I am at best dubious about 1.

I've looked at how to implement this, and there is quite a lot of complexity
having to do with quoting and escaping.

Need guidance on what you think would be best to do:

2a) Should we aim to support all NONE combinations, at the cost of increasing the
complexity at all code having to do with quoting, escaping and delimiters?

2b) Should we aim to only support the QUOTE NONE DELIMITER NONE ESCAPE NONE case,
useful to the real-life scenario we've identified, that is, importing raw log
lines into a single column, which could then be handed by a much simpler and
probably faster version of CopyReadAttributesCSV(),
e.g. named CopyReadAttributesUnquotedUnDelimited() or
maybe CopyReadAttributesRaw()?
(We also need to modify CopyReadLineText(), but seems we only need a
quote_none bool, to skip over the quoting code there, so don't think a
separate function is warranted there.)

I think ESCAPE NONE should be implied from QUOTE NONE, since the default escape
character is the same as the quote character, so if there isn't any
quote character, then I think that would imply no escape character either.

Can we think of any other valid, useful, realistic, and safe combinations of
QUOTE NONE, DELIMITER NONE and ESCAPE NONE, that would be interesting
to support?

If not, then I think 2b looks more interesting, to reduce risk of accidental
misuse, simpler implementation, and since it also should allow importing
raw log files faster, thanks to the reduced complexity.

Off hand I can't think of a case other than 2b that would apply in the
real world, although others might like to chime in here. If we're going
to do that, let's find a shorter way to spell it. In fact, we should do
that even if we go with 2a.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

andrew@dunslane.net

over 1 year ago

In reply to: Andrew Dunstan (#31)

Re: Should CSV parsing be stricter about mid-field quotes?

On 2024-10-09 We 8:00 AM, Andrew Dunstan wrote:

On 2024-10-08 Tu 3:25 AM, Joel Jacobson wrote:

On Sun, Oct 6, 2024, at 15:12, Andrew Dunstan wrote:

On 2024-10-04 Fr 12:19 PM, Joel Jacobson wrote:

2. Avoid needing hacks like using E'\x01' as quoting char.

Introduce QUOTE NONE and DELIMITER NONE,
to allow raw lines to be imported "as is" into a single text column.

As I think I previously indicated, I'm perfectly happy about 2, because
it replaces a far from obvious hack, but I am at best dubious about 1.

I've looked at how to implement this, and there is quite a lot of
complexity
having to do with quoting and escaping.

Need guidance on what you think would be best to do:

2a) Should we aim to support all NONE combinations, at the cost of
increasing the
complexity at all code having to do with quoting, escaping and
delimiters?

2b) Should we aim to only support the QUOTE NONE DELIMITER NONE
ESCAPE NONE case,
useful to the real-life scenario we've identified, that is, importing
raw log
lines into a single column, which could then be handed by a much
simpler and
probably faster version of CopyReadAttributesCSV(),
e.g. named CopyReadAttributesUnquotedUnDelimited() or
maybe CopyReadAttributesRaw()?
(We also need to modify CopyReadLineText(), but seems we only need a
quote_none bool, to skip over the quoting code there, so don't think a
separate function is warranted there.)

I think ESCAPE NONE should be implied from QUOTE NONE, since the
default escape
character is the same as the quote character, so if there isn't any
quote character, then I think that would imply no escape character
either.

Can we think of any other valid, useful, realistic, and safe
combinations of
QUOTE NONE, DELIMITER NONE and ESCAPE NONE, that would be interesting
to support?

If not, then I think 2b looks more interesting, to reduce risk of
accidental
misuse, simpler implementation, and since it also should allow importing
raw log files faster, thanks to the reduced complexity.

Off hand I can't think of a case other than 2b that would apply in the
real world, although others might like to chime in here. If we're
going to do that, let's find a shorter way to spell it. In fact, we
should do that even if we go with 2a.

At the very least you should not need to say ESCAPE NONE, since the
default is to have ESCAPE the same as QUOTE, so QUOTE NONE should imply
ESCAPE NONE.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

joel@compiler.org

over 1 year ago

In reply to: Andrew Dunstan (#32)

Re: Should CSV parsing be stricter about mid-field quotes?

On Wed, Oct 9, 2024, at 14:45, Andrew Dunstan wrote:

On 2024-10-09 We 8:00 AM, Andrew Dunstan wrote:

On 2024-10-08 Tu 3:25 AM, Joel Jacobson wrote:

2b) Should we aim to only support the QUOTE NONE DELIMITER NONE
ESCAPE NONE case,
useful to the real-life scenario we've identified, that is, importing
raw log
lines into a single column,

...

Off hand I can't think of a case other than 2b that would apply in the
real world, although others might like to chime in here. If we're
going to do that, let's find a shorter way to spell it. In fact, we
should do that even if we go with 2a.

I agree a shorter way to spell it would be nice.

But yet another new option would risk option creep IMO,
especially since we already have quite a lot of options
that can only be used for some formats and/or not together
with other options.

If there would have been an easy way to just hack this into
the CSV mode, in a clean and understandable way, that
wouldn't risk confuse users who are actually importing real CSV files,
that would seem OK to me, but it seems difficult.

However, I feel it might be too much of shoehorning to add this to the CSV mode.

Also, since, if there is no delimiter, that by definition means there cannot be
any "separated" values that the "S" in "CSV" means.

I think it would be nicest to introduce a new "raw" FORMAT,
that clearly get things right already at the top-level,
instead of trying to customize any of the existing formats.

Arbitrary unstructured text files, such as some log files,
seems like a common big enough group of files,
that users are probably already importing to PostgreSQL,
using the various hacks we've discussed.

So providing such users with a new FORMAT that precisely matches their use-case,
seems like the simplest and most intuitive solution.

With a new format, we could clearly define its purpose in the docs:

Raw Format

The "raw" format is used for importing and exporting files containing
unstructured text, where each line is treated as a single field. This format
is ideal when dealing with data that doesn't conform to a structured,
tabular format and lacks delimiters.

Key Characteristics:

- No Field Delimiters:
Each line is considered a complete value without any field separation.

- Single Column Requirement:
The COPY command must specify exactly one column when using the raw format.
Specifying multiple columns will result in an error.

- Literal Data Interpretation:
All characters are taken literally.
There is no special handling for quotes, backslashes, or escape sequences.

- No NULL Distinction:
Empty lines are imported as empty strings, not as NULL values.

- No Headers or Metadata:
The format does not support header rows or end-of-data markers;
every line is treated as data.

Notes:

- Error Handling:
An error will occur if you use the raw format without specifying exactly
one column or if the table has multiple columns and no column list is provided.

- Data Preservation:
All characters, including whitespace and special characters, are preserved
exactly as they appear in the file.

/Joel

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: Joel Jacobson (#33)

Re: Should CSV parsing be stricter about mid-field quotes?

"Joel Jacobson" <joel@compiler.org> writes:

I think it would be nicest to introduce a new "raw" FORMAT,
that clearly get things right already at the top-level,
instead of trying to customize any of the existing formats.

FWIW, I like this idea. It makes it much clearer how to have a
fast parsing path.

regards, tom lane

andrew@dunslane.net

over 1 year ago

In reply to: Tom Lane (#34)

Re: Should CSV parsing be stricter about mid-field quotes?

On 2024-10-09 We 11:58 AM, Tom Lane wrote:

"Joel Jacobson" <joel@compiler.org> writes:

I think it would be nicest to introduce a new "raw" FORMAT,
that clearly get things right already at the top-level,
instead of trying to customize any of the existing formats.

FWIW, I like this idea. It makes it much clearer how to have a
fast parsing path.

WFM, so something like FORMAT {SIMPLE, RAW, FAST, SINGLE}? We don't seem
to have an existing keyword that we could sanely reuse here.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

daniel@manitou-mail.org

over 1 year ago

In reply to: Joel Jacobson (#33)

Re: Should CSV parsing be stricter about mid-field quotes?

Joel Jacobson wrote:

- No Headers or Metadata:

It's not clear why it's necessary to disable the HEADER option
for this format?

The format does not support header rows or end-of-data markers;
every line is treated as data.

With COPY FROM STDIN followed by inline data in a script,
an end-of-data marker is required. That's also a problem
for CSV except it's mitigated by the possibility of quoting
(using "\." instead of \.)

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

joel@compiler.org

over 1 year ago

In reply to: Daniel Verite (#36)

1 attachment(s)

Re: Should CSV parsing be stricter about mid-field quotes?

On Thu, Oct 10, 2024, at 10:37, Daniel Verite wrote:

Joel Jacobson wrote:

- No Headers or Metadata:

It's not clear why it's necessary to disable the HEADER option
for this format?

It's not necessary, no, just couldn't see a use-case,
since I only thought about the COPY FROM case
where one would be dealing with unstructured undelimited
text files, such as log files coming from some other system,
that I've never seen have header rows.

However, thanks to your question, I see how a user
might want to use the raw format to export a text
column "as is" using COPY TO, in which case it would
be useful to use HEADER and then HEADER MATCH
for COPY FROM.

I therefore think the HEADER option should be supported
for the new raw format.

The format does not support header rows or end-of-data markers;
every line is treated as data.

With COPY FROM STDIN followed by inline data in a script,
an end-of-data marker is required. That's also a problem
for CSV except it's mitigated by the possibility of quoting
(using "\." instead of \.)

Right. As long as \. won't have any special meaning for the raw format
except in the STDIN case, that seems fine.

I haven't looked at that part of the code in detail yet though.

As a preparatory step, I think we should replace the two
"binary" and "csv_mode" bool fields in CopyFormatOptions,
with a new "format" field of a new new CopyFormat enum type.

If instead introducing another bool field, I think the code would
be too cluttered.

Best regards,

Joel

Attachments:

v1-0001-Replace-binary-flags-binary-and-csv_mode-with-format.patchapplication/octet-stream; name="=?UTF-8?Q?v1-0001-Replace-binary-flags-binary-and-csv=5Fmode-with-format?= =?UTF-8?Q?.patch?="Download

From d621bb2fd0d0d6079ec16a92f5c925fd9fa0baaa Mon Sep 17 00:00:00 2001
From: Joel Jakobsson <github@compiler.org>
Date: Thu, 10 Oct 2024 08:33:33 +0200
Subject: [PATCH] Replace binary flags `binary` and `csv_mode` with `format`
 enum.

---
 src/backend/commands/copy.c          | 44 ++++++++++++++--------------
 src/backend/commands/copyfrom.c      | 10 +++----
 src/backend/commands/copyfromparse.c | 34 ++++++++++-----------
 src/backend/commands/copyto.c        | 20 ++++++-------
 src/include/commands/copy.h          | 13 ++++++--
 5 files changed, 65 insertions(+), 56 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 03eb7a4eba..2021300308 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -493,11 +493,11 @@ ProcessCopyOptions(ParseState *pstate,
 				errorConflictingDefElem(defel, pstate);
 			format_specified = true;
 			if (strcmp(fmt, "text") == 0)
-				 /* default format */ ;
+				opts_out->format = COPY_FORMAT_TEXT;
 			else if (strcmp(fmt, "csv") == 0)
-				opts_out->csv_mode = true;
+				opts_out->format = COPY_FORMAT_CSV;
 			else if (strcmp(fmt, "binary") == 0)
-				opts_out->binary = true;
+				opts_out->format = COPY_FORMAT_BINARY;
 			else
 				ereport(ERROR,
 						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
@@ -650,36 +650,36 @@ ProcessCopyOptions(ParseState *pstate,
 	 * Check for incompatible options (must do these two before inserting
 	 * defaults)
 	 */
-	if (opts_out->binary && opts_out->delim)
+	if (opts_out->format == COPY_FORMAT_BINARY && opts_out->delim)
 		ereport(ERROR,
 				(errcode(ERRCODE_SYNTAX_ERROR),
 		/*- translator: %s is the name of a COPY option, e.g. ON_ERROR */
 				 errmsg("cannot specify %s in BINARY mode", "DELIMITER")));
 
-	if (opts_out->binary && opts_out->null_print)
+	if (opts_out->format == COPY_FORMAT_BINARY && opts_out->null_print)
 		ereport(ERROR,
 				(errcode(ERRCODE_SYNTAX_ERROR),
 				 errmsg("cannot specify %s in BINARY mode", "NULL")));
 
-	if (opts_out->binary && opts_out->default_print)
+	if (opts_out->format == COPY_FORMAT_BINARY && opts_out->default_print)
 		ereport(ERROR,
 				(errcode(ERRCODE_SYNTAX_ERROR),
 				 errmsg("cannot specify %s in BINARY mode", "DEFAULT")));
 
-	if (opts_out->binary && opts_out->on_error != COPY_ON_ERROR_STOP)
+	if (opts_out->format == COPY_FORMAT_BINARY && opts_out->on_error != COPY_ON_ERROR_STOP)
 		ereport(ERROR,
 				(errcode(ERRCODE_SYNTAX_ERROR),
 				 errmsg("only ON_ERROR STOP is allowed in BINARY mode")));
 
 	/* Set defaults for omitted options */
 	if (!opts_out->delim)
-		opts_out->delim = opts_out->csv_mode ? "," : "\t";
+		opts_out->delim = opts_out->format == COPY_FORMAT_CSV ? "," : "\t";
 
 	if (!opts_out->null_print)
-		opts_out->null_print = opts_out->csv_mode ? "" : "\\N";
+		opts_out->null_print = opts_out->format == COPY_FORMAT_CSV ? "" : "\\N";
 	opts_out->null_print_len = strlen(opts_out->null_print);
 
-	if (opts_out->csv_mode)
+	if (opts_out->format == COPY_FORMAT_CSV)
 	{
 		if (!opts_out->quote)
 			opts_out->quote = "\"";
@@ -727,7 +727,7 @@ ProcessCopyOptions(ParseState *pstate,
 	 * future-proofing.  Likewise we disallow all digits though only octal
 	 * digits are actually dangerous.
 	 */
-	if (!opts_out->csv_mode &&
+	if (opts_out->format != COPY_FORMAT_CSV &&
 		strchr("\\.abcdefghijklmnopqrstuvwxyz0123456789",
 			   opts_out->delim[0]) != NULL)
 		ereport(ERROR,
@@ -735,43 +735,43 @@ ProcessCopyOptions(ParseState *pstate,
 				 errmsg("COPY delimiter cannot be \"%s\"", opts_out->delim)));
 
 	/* Check header */
-	if (opts_out->binary && opts_out->header_line)
+	if (opts_out->format == COPY_FORMAT_BINARY && opts_out->header_line)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 		/*- translator: %s is the name of a COPY option, e.g. ON_ERROR */
 				 errmsg("cannot specify %s in BINARY mode", "HEADER")));
 
 	/* Check quote */
-	if (!opts_out->csv_mode && opts_out->quote != NULL)
+	if (opts_out->format != COPY_FORMAT_CSV && opts_out->quote != NULL)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 		/*- translator: %s is the name of a COPY option, e.g. ON_ERROR */
 				 errmsg("COPY %s requires CSV mode", "QUOTE")));
 
-	if (opts_out->csv_mode && strlen(opts_out->quote) != 1)
+	if (opts_out->format == COPY_FORMAT_CSV && strlen(opts_out->quote) != 1)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("COPY quote must be a single one-byte character")));
 
-	if (opts_out->csv_mode && opts_out->delim[0] == opts_out->quote[0])
+	if (opts_out->format == COPY_FORMAT_CSV && opts_out->delim[0] == opts_out->quote[0])
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("COPY delimiter and quote must be different")));
 
 	/* Check escape */
-	if (!opts_out->csv_mode && opts_out->escape != NULL)
+	if (opts_out->format != COPY_FORMAT_CSV && opts_out->escape != NULL)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 		/*- translator: %s is the name of a COPY option, e.g. ON_ERROR */
 				 errmsg("COPY %s requires CSV mode", "ESCAPE")));
 
-	if (opts_out->csv_mode && strlen(opts_out->escape) != 1)
+	if (opts_out->format == COPY_FORMAT_CSV && strlen(opts_out->escape) != 1)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("COPY escape must be a single one-byte character")));
 
 	/* Check force_quote */
-	if (!opts_out->csv_mode && (opts_out->force_quote || opts_out->force_quote_all))
+	if (opts_out->format != COPY_FORMAT_CSV && (opts_out->force_quote || opts_out->force_quote_all))
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 		/*- translator: %s is the name of a COPY option, e.g. ON_ERROR */
@@ -785,7 +785,7 @@ ProcessCopyOptions(ParseState *pstate,
 						"COPY FROM")));
 
 	/* Check force_notnull */
-	if (!opts_out->csv_mode && opts_out->force_notnull != NIL)
+	if (opts_out->format != COPY_FORMAT_CSV && opts_out->force_notnull != NIL)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 		/*- translator: %s is the name of a COPY option, e.g. ON_ERROR */
@@ -799,7 +799,7 @@ ProcessCopyOptions(ParseState *pstate,
 						"COPY TO")));
 
 	/* Check force_null */
-	if (!opts_out->csv_mode && opts_out->force_null != NIL)
+	if (opts_out->format != COPY_FORMAT_CSV && opts_out->force_null != NIL)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 		/*- translator: %s is the name of a COPY option, e.g. ON_ERROR */
@@ -822,7 +822,7 @@ ProcessCopyOptions(ParseState *pstate,
 						"NULL")));
 
 	/* Don't allow the CSV quote char to appear in the null string. */
-	if (opts_out->csv_mode &&
+	if (opts_out->format == COPY_FORMAT_CSV &&
 		strchr(opts_out->null_print, opts_out->quote[0]) != NULL)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
@@ -858,7 +858,7 @@ ProcessCopyOptions(ParseState *pstate,
 							"DEFAULT")));
 
 		/* Don't allow the CSV quote char to appear in the default string. */
-		if (opts_out->csv_mode &&
+		if (opts_out->format == COPY_FORMAT_CSV &&
 			strchr(opts_out->default_print, opts_out->quote[0]) != NULL)
 			ereport(ERROR,
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 9139a40785..46a662465a 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -122,7 +122,7 @@ CopyFromErrorCallback(void *arg)
 				   cstate->cur_relname);
 		return;
 	}
-	if (cstate->opts.binary)
+	if (cstate->opts.format == COPY_FORMAT_BINARY)
 	{
 		/* can't usefully display the data */
 		if (cstate->cur_attname)
@@ -1576,7 +1576,7 @@ BeginCopyFrom(ParseState *pstate,
 	cstate->raw_buf_index = cstate->raw_buf_len = 0;
 	cstate->raw_reached_eof = false;
 
-	if (!cstate->opts.binary)
+	if (cstate->opts.format != COPY_FORMAT_BINARY)
 	{
 		/*
 		 * If encoding conversion is needed, we need another buffer to hold
@@ -1627,7 +1627,7 @@ BeginCopyFrom(ParseState *pstate,
 			continue;
 
 		/* Fetch the input function and typioparam info */
-		if (cstate->opts.binary)
+		if (cstate->opts.format == COPY_FORMAT_BINARY)
 			getTypeBinaryInputInfo(att->atttypid,
 								   &in_func_oid, &typioparams[attnum - 1]);
 		else
@@ -1768,14 +1768,14 @@ BeginCopyFrom(ParseState *pstate,
 
 	pgstat_progress_update_multi_param(3, progress_cols, progress_vals);
 
-	if (cstate->opts.binary)
+	if (cstate->opts.format == COPY_FORMAT_BINARY)
 	{
 		/* Read and verify binary header */
 		ReceiveCopyBinaryHeader(cstate);
 	}
 
 	/* create workspace for CopyReadAttributes results */
-	if (!cstate->opts.binary)
+	if (cstate->opts.format != COPY_FORMAT_BINARY)
 	{
 		AttrNumber	attr_count = list_length(cstate->attnumlist);
 
diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index 654fecb1b1..50bb4b7750 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -163,7 +163,7 @@ ReceiveCopyBegin(CopyFromState cstate)
 {
 	StringInfoData buf;
 	int			natts = list_length(cstate->attnumlist);
-	int16		format = (cstate->opts.binary ? 1 : 0);
+	int16		format = (cstate->opts.format == COPY_FORMAT_BINARY ? 1 : 0);
 	int			i;
 
 	pq_beginmessage(&buf, PqMsg_CopyInResponse);
@@ -749,7 +749,7 @@ NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields)
 	bool		done;
 
 	/* only available for text or csv input */
-	Assert(!cstate->opts.binary);
+	Assert(cstate->opts.format != COPY_FORMAT_BINARY);
 
 	/* on input check that the header line is correct if needed */
 	if (cstate->cur_lineno == 0 && cstate->opts.header_line)
@@ -766,7 +766,7 @@ NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields)
 		{
 			int			fldnum;
 
-			if (cstate->opts.csv_mode)
+			if (cstate->opts.format == COPY_FORMAT_CSV)
 				fldct = CopyReadAttributesCSV(cstate);
 			else
 				fldct = CopyReadAttributesText(cstate);
@@ -821,7 +821,7 @@ NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields)
 		return false;
 
 	/* Parse the line into de-escaped field values */
-	if (cstate->opts.csv_mode)
+	if (cstate->opts.format == COPY_FORMAT_CSV)
 		fldct = CopyReadAttributesCSV(cstate);
 	else
 		fldct = CopyReadAttributesText(cstate);
@@ -865,7 +865,7 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext,
 	MemSet(nulls, true, num_phys_attrs * sizeof(bool));
 	MemSet(cstate->defaults, false, num_phys_attrs * sizeof(bool));
 
-	if (!cstate->opts.binary)
+	if (cstate->opts.format != COPY_FORMAT_BINARY)
 	{
 		char	  **field_strings;
 		ListCell   *cur;
@@ -906,7 +906,7 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext,
 				continue;
 			}
 
-			if (cstate->opts.csv_mode)
+			if (cstate->opts.format == COPY_FORMAT_CSV)
 			{
 				if (string == NULL &&
 					cstate->opts.force_notnull_flags[m])
@@ -1179,7 +1179,7 @@ CopyReadLineText(CopyFromState cstate)
 	char		quotec = '\0';
 	char		escapec = '\0';
 
-	if (cstate->opts.csv_mode)
+	if (cstate->opts.format == COPY_FORMAT_CSV)
 	{
 		quotec = cstate->opts.quote[0];
 		escapec = cstate->opts.escape[0];
@@ -1256,7 +1256,7 @@ CopyReadLineText(CopyFromState cstate)
 		prev_raw_ptr = input_buf_ptr;
 		c = copy_input_buf[input_buf_ptr++];
 
-		if (cstate->opts.csv_mode)
+		if (cstate->opts.format == COPY_FORMAT_CSV)
 		{
 			/*
 			 * If character is '\r', we may need to look ahead below.  Force
@@ -1295,7 +1295,7 @@ CopyReadLineText(CopyFromState cstate)
 		}
 
 		/* Process \r */
-		if (c == '\r' && (!cstate->opts.csv_mode || !in_quote))
+		if (c == '\r' && (cstate->opts.format != COPY_FORMAT_CSV || !in_quote))
 		{
 			/* Check for \r\n on first line, _and_ handle \r\n. */
 			if (cstate->eol_type == EOL_UNKNOWN ||
@@ -1323,10 +1323,10 @@ CopyReadLineText(CopyFromState cstate)
 					if (cstate->eol_type == EOL_CRNL)
 						ereport(ERROR,
 								(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
-								 !cstate->opts.csv_mode ?
+								 cstate->opts.format != COPY_FORMAT_CSV ?
 								 errmsg("literal carriage return found in data") :
 								 errmsg("unquoted carriage return found in data"),
-								 !cstate->opts.csv_mode ?
+								 cstate->opts.format != COPY_FORMAT_CSV ?
 								 errhint("Use \"\\r\" to represent carriage return.") :
 								 errhint("Use quoted CSV field to represent carriage return.")));
 
@@ -1340,10 +1340,10 @@ CopyReadLineText(CopyFromState cstate)
 			else if (cstate->eol_type == EOL_NL)
 				ereport(ERROR,
 						(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
-						 !cstate->opts.csv_mode ?
+						 cstate->opts.format != COPY_FORMAT_CSV ?
 						 errmsg("literal carriage return found in data") :
 						 errmsg("unquoted carriage return found in data"),
-						 !cstate->opts.csv_mode ?
+						 cstate->opts.format != COPY_FORMAT_CSV ?
 						 errhint("Use \"\\r\" to represent carriage return.") :
 						 errhint("Use quoted CSV field to represent carriage return.")));
 			/* If reach here, we have found the line terminator */
@@ -1351,15 +1351,15 @@ CopyReadLineText(CopyFromState cstate)
 		}
 
 		/* Process \n */
-		if (c == '\n' && (!cstate->opts.csv_mode || !in_quote))
+		if (c == '\n' && (cstate->opts.format != COPY_FORMAT_CSV || !in_quote))
 		{
 			if (cstate->eol_type == EOL_CR || cstate->eol_type == EOL_CRNL)
 				ereport(ERROR,
 						(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
-						 !cstate->opts.csv_mode ?
+						 cstate->opts.format != COPY_FORMAT_CSV ?
 						 errmsg("literal newline found in data") :
 						 errmsg("unquoted newline found in data"),
-						 !cstate->opts.csv_mode ?
+						 cstate->opts.format != COPY_FORMAT_CSV ?
 						 errhint("Use \"\\n\" to represent newline.") :
 						 errhint("Use quoted CSV field to represent newline.")));
 			cstate->eol_type = EOL_NL;	/* in case not set yet */
@@ -1371,7 +1371,7 @@ CopyReadLineText(CopyFromState cstate)
 		 * Process backslash, except in CSV mode where backslash is a normal
 		 * character.
 		 */
-		if (c == '\\' && !cstate->opts.csv_mode)
+		if (c == '\\' && cstate->opts.format != COPY_FORMAT_CSV)
 		{
 			char		c2;
 
diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c
index 463083e645..78531ae846 100644
--- a/src/backend/commands/copyto.c
+++ b/src/backend/commands/copyto.c
@@ -134,7 +134,7 @@ SendCopyBegin(CopyToState cstate)
 {
 	StringInfoData buf;
 	int			natts = list_length(cstate->attnumlist);
-	int16		format = (cstate->opts.binary ? 1 : 0);
+	int16		format = (cstate->opts.format == COPY_FORMAT_BINARY ? 1 : 0);
 	int			i;
 
 	pq_beginmessage(&buf, PqMsg_CopyOutResponse);
@@ -191,7 +191,7 @@ CopySendEndOfRow(CopyToState cstate)
 	switch (cstate->copy_dest)
 	{
 		case COPY_FILE:
-			if (!cstate->opts.binary)
+			if (cstate->opts.format != COPY_FORMAT_BINARY)
 			{
 				/* Default line termination depends on platform */
 #ifndef WIN32
@@ -236,7 +236,7 @@ CopySendEndOfRow(CopyToState cstate)
 			break;
 		case COPY_FRONTEND:
 			/* The FE/BE protocol uses \n as newline for all platforms */
-			if (!cstate->opts.binary)
+			if (cstate->opts.format != COPY_FORMAT_BINARY)
 				CopySendChar(cstate, '\n');
 
 			/* Dump the accumulated row as one CopyData message */
@@ -771,7 +771,7 @@ DoCopyTo(CopyToState cstate)
 		bool		isvarlena;
 		Form_pg_attribute attr = TupleDescAttr(tupDesc, attnum - 1);
 
-		if (cstate->opts.binary)
+		if (cstate->opts.format == COPY_FORMAT_BINARY)
 			getTypeBinaryOutputInfo(attr->atttypid,
 									&out_func_oid,
 									&isvarlena);
@@ -792,7 +792,7 @@ DoCopyTo(CopyToState cstate)
 											   "COPY TO",
 											   ALLOCSET_DEFAULT_SIZES);
 
-	if (cstate->opts.binary)
+	if (cstate->opts.format == COPY_FORMAT_BINARY)
 	{
 		/* Generate header for a binary copy */
 		int32		tmp;
@@ -833,7 +833,7 @@ DoCopyTo(CopyToState cstate)
 
 				colname = NameStr(TupleDescAttr(tupDesc, attnum - 1)->attname);
 
-				if (cstate->opts.csv_mode)
+				if (cstate->opts.format == COPY_FORMAT_CSV)
 					CopyAttributeOutCSV(cstate, colname, false);
 				else
 					CopyAttributeOutText(cstate, colname);
@@ -880,7 +880,7 @@ DoCopyTo(CopyToState cstate)
 		processed = ((DR_copy *) cstate->queryDesc->dest)->processed;
 	}
 
-	if (cstate->opts.binary)
+	if (cstate->opts.format == COPY_FORMAT_BINARY)
 	{
 		/* Generate trailer for a binary copy */
 		CopySendInt16(cstate, -1);
@@ -908,7 +908,7 @@ CopyOneRowTo(CopyToState cstate, TupleTableSlot *slot)
 	MemoryContextReset(cstate->rowcontext);
 	oldcontext = MemoryContextSwitchTo(cstate->rowcontext);
 
-	if (cstate->opts.binary)
+	if (cstate->opts.format == COPY_FORMAT_BINARY)
 	{
 		/* Binary per-tuple header */
 		CopySendInt16(cstate, list_length(cstate->attnumlist));
@@ -917,7 +917,7 @@ CopyOneRowTo(CopyToState cstate, TupleTableSlot *slot)
 	/* Make sure the tuple is fully deconstructed */
 	slot_getallattrs(slot);
 
-	if (!cstate->opts.binary)
+	if (cstate->opts.format != COPY_FORMAT_BINARY)
 	{
 		bool		need_delim = false;
 
@@ -937,7 +937,7 @@ CopyOneRowTo(CopyToState cstate, TupleTableSlot *slot)
 			{
 				string = OutputFunctionCall(&out_functions[attnum - 1],
 											value);
-				if (cstate->opts.csv_mode)
+				if (cstate->opts.format == COPY_FORMAT_CSV)
 					CopyAttributeOutCSV(cstate, string,
 										cstate->opts.force_quote_flags[attnum - 1]);
 				else
diff --git a/src/include/commands/copy.h b/src/include/commands/copy.h
index 6f64d97fdd..4b4079db95 100644
--- a/src/include/commands/copy.h
+++ b/src/include/commands/copy.h
@@ -51,6 +51,16 @@ typedef enum CopyLogVerbosityChoice
 	COPY_LOG_VERBOSITY_VERBOSE, /* logs additional messages */
 } CopyLogVerbosityChoice;
 
+/*
+ * Represents the format of the COPY operation.
+ */
+typedef enum CopyFormat
+{
+	COPY_FORMAT_TEXT,
+	COPY_FORMAT_BINARY,
+	COPY_FORMAT_CSV
+} CopyFormat;
+
 /*
  * A struct to hold COPY options, in a parsed form. All of these are related
  * to formatting, except for 'freeze', which doesn't really belong here, but
@@ -61,9 +71,8 @@ typedef struct CopyFormatOptions
 	/* parameters from the COPY command */
 	int			file_encoding;	/* file or remote side's character encoding,
 								 * -1 if not specified */
-	bool		binary;			/* binary format? */
+	CopyFormat	format;			/* format of the COPY operation */
 	bool		freeze;			/* freeze rows on loading? */
-	bool		csv_mode;		/* Comma Separated Value format? */
 	CopyHeaderChoice header_line;	/* header line? */
 	char	   *null_print;		/* NULL marker string (server encoding!) */
 	int			null_print_len; /* length of same */
-- 
2.45.1

joel@compiler.org

over 1 year ago

In reply to: Joel Jacobson (#37)

Re: Should CSV parsing be stricter about mid-field quotes?

On Fri, Oct 11, 2024, at 15:04, Joel Jacobson wrote:

On Thu, Oct 10, 2024, at 10:37, Daniel Verite wrote:

Joel Jacobson wrote:

- No Headers or Metadata:

It's not clear why it's necessary to disable the HEADER option
for this format?

It's not necessary, no, just couldn't see a use-case,
since I only thought about the COPY FROM case
where one would be dealing with unstructured undelimited
text files, such as log files coming from some other system,
that I've never seen have header rows.

However, thanks to your question, I see how a user
might want to use the raw format to export a text
column "as is" using COPY TO, in which case it would
be useful to use HEADER and then HEADER MATCH
for COPY FROM.

I therefore think the HEADER option should be supported
for the new raw format.

The format does not support header rows or end-of-data markers;
every line is treated as data.

With COPY FROM STDIN followed by inline data in a script,
an end-of-data marker is required. That's also a problem
for CSV except it's mitigated by the possibility of quoting
(using "\." instead of \.)

Right. As long as \. won't have any special meaning for the raw format
except in the STDIN case, that seems fine.

I haven't looked at that part of the code in detail yet though.

As a preparatory step, I think we should replace the two
"binary" and "csv_mode" bool fields in CopyFormatOptions,
with a new "format" field of a new new CopyFormat enum type.

If instead introducing another bool field, I think the code would
be too cluttered.

I'm starting a new thread for this with a more suitable subject.

/Joel