New "raw" COPY format

Started by Joel Jacobsonover 1 year ago46 messageshackers

joel@compiler.org

over 1 year ago

Hi hackers,

This thread is about implementing a new "raw" COPY format.

This idea came up in a different thread [1]/messages/by-id/47b5c6a7-5c0e-40aa-8ea2-c7b95ccf296f@app.fastmail.com, moved here.

[1]: /messages/by-id/47b5c6a7-5c0e-40aa-8ea2-c7b95ccf296f@app.fastmail.com

The main use-case for the raw format, is when needing to import arbitrary
unstructured text files, such as log files, into a single text column
of a table.

The name "raw" is just a working title. Andrew had some other good name ideas:

WFM, so something like FORMAT {SIMPLE, RAW, FAST, SINGLE}?

Below is the draft of its description, sent previously [1]/messages/by-id/47b5c6a7-5c0e-40aa-8ea2-c7b95ccf296f@app.fastmail.com,
adjusted thanks to feedback from Daniel Verite, who made me realize the
HEADER option should be made available also for this format.

--- START OF DESCRIPTION ---

Raw Format

The "raw" format is used for importing and exporting files containing
unstructured text, where each line is treated as a single field. This format
is ideal when dealing with data that doesn't conform to a structured,
tabular format and lacks delimiters.

Key Characteristics:

- No Field Delimiters:
Each line is considered a complete value without any field separation.

- Single Column Requirement:
The COPY command must specify exactly one column when using the raw format.
Specifying multiple columns will result in an error.

- Literal Data Interpretation:
All characters are taken literally.
There is no special handling for quotes, backslashes, or escape sequences.

- No NULL Distinction:
Empty lines are imported as empty strings, not as NULL values.

Notes:

- Error Handling:
An error will occur if you use the raw format without specifying exactly
one column or if the table has multiple columns and no column list is
provided.

- Data Preservation:
All characters, including whitespace and special characters, are preserved
exactly as they appear in the file.

--- END OF DESCRIPTION ---

After having studied the code that will be affected,
I feel that before making any changes, I would like to try to improve
ProcessCopyOptions, in terms of readability and maintainability, first.

This seems possible by just reorganize it a bit.

It is actually already organized quite nicely, where the code is mostly
organized per-option, but not always, as it sometimes is spread across
different sections.

It seems possible to organize even more of it per-option,
which would make it easier to reason about each option separately.

This seems possible by organizing the checks per option,
under a single if-branch per option, and moving the setting
of defaults per option (when applicable) to the corresponding
else-branch.

This would also avoid setting defaults for options that are not applicable
for a given format, and instead let their initial NULL value remain untouched,
rather than setting unnecessary defaults.

Some of the checks depend on multiple options in an interdependent way,
not belonging to a specific option more than another. I think such checks
would be nice to place at the end under a separate section.

I also think it would be more readable to use the existing bool variables
named [option]_specified, to determine if an option has been set,
rather than relying on the option's default enum value to evaluate to false.

The attached patch implements the above ideas.

I think with these changes, it would be easier to hack on new and existing
copy options and formats.

/Joel

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Joel Jacobson (#1)

Re: New "raw" COPY format

On Fri, Oct 11, 2024, at 22:29, Joel Jacobson wrote:

Hi hackers,

This thread is about implementing a new "raw" COPY format.

...

The attached patch implements the above ideas.

I think with these changes, it would be easier to hack on new and existing
copy options and formats.

/Joel

Attachments:
* v1-0001-Replace-binary-flags-binary-and-csv_mode-with-format.patch
* v1-0002-Reorganize-ProcessCopyOptions-for-clarity-and-consis.patch

Ops, I see I failed to use the correct way to check if
opts_out->force_notnull or opts_out->force_null
have been set, that is using != NIL.

However, thanks to not just blindly copy/pasting this code,
I see I actually fixed a bug in HEAD, by also checking
opts_out->force_notnull_all or opts_out->force_null_all,
which HEAD currently fails to do:

joel=# copy t to '/tmp/t.csv' (format text, FORCE_NOT_NULL (c1));
ERROR: COPY FORCE_NOT_NULL requires CSV mode
joel=# copy t to '/tmp/t.csv' (format text, FORCE_NOT_NULL *);
COPY 0
joel=# copy t to '/tmp/t.csv' (format text, FORCE_NULL (c1));
ERROR: COPY FORCE_NULL requires CSV mode
joel=# copy t to '/tmp/t.csv' (format text, FORCE_NULL *);
COPY 0

Fixed in new version:

joel=# copy t to '/tmp/t.csv' (format text, FORCE_NOT_NULL *);
ERROR: COPY FORCE_NOT_NULL requires CSV mode
joel=# copy t to '/tmp/t.csv' (format text, FORCE_NULL *);
ERROR: COPY FORCE_NULL requires CSV mode

/Joel

jian he

jian.universality@gmail.com

over 1 year ago

In reply to: Joel Jacobson (#2)

Re: New "raw" COPY format

On Sat, Oct 12, 2024 at 5:02 AM Joel Jacobson <joel@compiler.org> wrote:

On Fri, Oct 11, 2024, at 22:29, Joel Jacobson wrote:

Hi hackers,

This thread is about implementing a new "raw" COPY format.

...

The attached patch implements the above ideas.

I think with these changes, it would be easier to hack on new and existing
copy options and formats.

/Joel

Attachments:
* v1-0001-Replace-binary-flags-binary-and-csv_mode-with-format.patch
* v1-0002-Reorganize-ProcessCopyOptions-for-clarity-and-consis.patch

Ops, I see I failed to use the correct way to check if
opts_out->force_notnull or opts_out->force_null
have been set, that is using != NIL.

However, thanks to not just blindly copy/pasting this code,
I see I actually fixed a bug in HEAD, by also checking
opts_out->force_notnull_all or opts_out->force_null_all,
which HEAD currently fails to do:

git version 2.34.1
cannot do `git apply`

trying:
patch -p1 < patch -p1 <
$PATCHES/v2-0001-Replace-binary-flags-binary-and-csv_mode-with-format.patch
patch -p1 < $PATCHES/v2-0002-Reorganize-ProcessCopyOptions-for-clarity-and-consis.patch
After that, I still cannot apply.

typedef enum CopyFormat
{
COPY_FORMAT_TEXT,
COPY_FORMAT_BINARY,
COPY_FORMAT_CSV
} CopyFormat;
the last element should add a comma.

CopyFormat should add to
src/tools/pgindent/typedefs.list

+        /* Assert options have been set (defaults applied if not specified) */
+        Assert(opts_out->delim);
+        Assert(opts_out->null_print);
+
+        /* Don't allow the delimiter to appear in the NULL or DEFAULT
strings */
+
+        if (strchr(opts_out->null_print, opts_out->delim[0]) != NULL)

+        Assert(opts_out->delim);
+        Assert(opts_out->quote);
+        Assert(opts_out->null_print);
+
+        if (opts_out->delim[0] == opts_out->quote[0])
these Asserts, no need? Without it, if conditions are not met, it will
still segfault.

there is no sql example, like
copy the_table from :'filename' (format raw);

in the patch.
I thought you were going to implement something like that.

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: jian he (#3)

Re: New "raw" COPY format

On Sat, Oct 12, 2024, at 02:48, jian he wrote:

git version 2.34.1
cannot do `git apply`

Sorry about that, fixed.

typedef enum CopyFormat
{
COPY_FORMAT_TEXT,
COPY_FORMAT_BINARY,
COPY_FORMAT_CSV
} CopyFormat;

Thanks, fixed.

CopyFormat should add to
src/tools/pgindent/typedefs.list

Thanks, fixed.

+        /* Assert options have been set (defaults applied if not specified) */
+        Assert(opts_out->delim);
+        Assert(opts_out->null_print);
+
+        /* Don't allow the delimiter to appear in the NULL or DEFAULT
strings */
+
+        if (strchr(opts_out->null_print, opts_out->delim[0]) != NULL)

+        Assert(opts_out->delim);
+        Assert(opts_out->quote);
+        Assert(opts_out->null_print);
+
+        if (opts_out->delim[0] == opts_out->quote[0])
these Asserts, no need? Without it, if conditions are not met, it will
still segfault.

The asserts are only there to indicate that at this point in the code,
we can be certain the delim, quote and null_print have been set,
since the format is COPY_FORMAT_CSV, otherwise it would be a bug.

If you don't think they add any documentation value, I'm OK with removing them.

there is no sql example, like
copy the_table from :'filename' (format raw);

in the patch.
I thought you were going to implement something like that.

Sorry if that was unclear, yes, that's the plan, but as I wrote:

After having studied the code that will be affected,
I feel that before making any changes, I would like to try to improve
ProcessCopyOptions, in terms of readability and maintainability, first.

So, I just wanted to get some feedback first, if this reorganization of ProcessCopyOptions,
would be OK to do first, which I think is needed for it to be easily maintainable.

/Joel

Tatsuo Ishii

t-ishii@sra.co.jp

over 1 year ago

In reply to: Joel Jacobson (#1)

Re: New "raw" COPY format

Hi hackers,

This thread is about implementing a new "raw" COPY format.

This idea came up in a different thread [1], moved here.

[1] /messages/by-id/47b5c6a7-5c0e-40aa-8ea2-c7b95ccf296f@app.fastmail.com

The main use-case for the raw format, is when needing to import arbitrary
unstructured text files, such as log files, into a single text column
of a table.

After copy imported the "unstructured text file" in "row" COPY format,
what the column type is? text? or bytea? If it's text, how do you
handle encoding conversion if the "unstructured text file" is encoded
in server side unsafe encoding such as SJIS?

All characters are taken literally.
There is no special handling for quotes, backslashes, or escape sequences.

If SJIS text is imported "literally" (i.e. no encoding conversion), it
should be rejected.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Tatsuo Ishii (#5)

Re: New "raw" COPY format

On Sun, Oct 13, 2024, at 11:52, Tatsuo Ishii wrote:

After copy imported the "unstructured text file" in "row" COPY format,
what the column type is? text? or bytea? If it's text, how do you
handle encoding conversion if the "unstructured text file" is encoded
in server side unsafe encoding such as SJIS?

All characters are taken literally.
There is no special handling for quotes, backslashes, or escape sequences.

If SJIS text is imported "literally" (i.e. no encoding conversion), it
should be rejected.

I think encoding conversion is still necessary,
and should work the same as for the COPY formats "text" and "csv".

/Joel

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Joel Jacobson (#6)

Re: New "raw" COPY format

On Sun, Oct 13, 2024, at 14:39, Joel Jacobson wrote:

On Sun, Oct 13, 2024, at 11:52, Tatsuo Ishii wrote:

After copy imported the "unstructured text file" in "row" COPY format,
what the column type is? text? or bytea? If it's text, how do you
handle encoding conversion if the "unstructured text file" is encoded
in server side unsafe encoding such as SJIS?

All characters are taken literally.
There is no special handling for quotes, backslashes, or escape sequences.

If SJIS text is imported "literally" (i.e. no encoding conversion), it
should be rejected.

I think encoding conversion is still necessary,
and should work the same as for the COPY formats "text" and "csv".

Attached is a first draft implementation of the new proposed COPY "raw" format.

The first two patches are just the bug fix in HEAD, reported separately:
https://commitfest.postgresql.org/50/5297/

* v4-0001-Fix-thinko-in-tests-for-COPY-options-force_not_null-.patch
The first patch fixes a thinko in tests for COPY options force_not_null and force_null.

* v4-0002-Fix-validation-of-FORCE_NOT_NULL-FORCE_NULL-for-all-.patch
The second patch fixes validation of FORCE_NOT_NULL/FORCE_NULL for all-columns case.

* v4-0003-Replace-binary-flags-binary-and-csv_mode-with-format.patch
The third patch introduces a new enum CopyFormat, with options for the three current formats.

* v4-0004-Reorganize-ProcessCopyOptions-for-clarity-and-consis.patch
The fourth patch reorganize ProcessCopyOptions for clarity and consistent option handling.

* v4-0005-Add-raw-COPY-format-support-for-unstructured-text-da.patch
Finally, the firth patch introduces the new "raw" COPY format.

Docs and tests updated.

The raw format currently goes through the same multiple stages,
as the text and CSV formats. I'm not sure what the best approach would be,
if we would want to create a special fast parsing path for this.

/Joel

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Joel Jacobson (#7)

Re: New "raw" COPY format

On Mon, Oct 14, 2024, at 10:07, Joel Jacobson wrote:

Attached is a first draft implementation of the new proposed COPY "raw" format.

The first two patches are just the bug fix in HEAD, reported separately:
https://commitfest.postgresql.org/50/5297/

I forgot about adding support for the old syntax format.
Fixed in new version. Only the fifth patch is updated.

Before, only the new syntax worked:
COPY .... (FORMAT raw);

Now this also works:
COPY ... RAW;

/Joel

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Joel Jacobson (#8)

Re: New "raw" COPY format

On Mon, Oct 14, 2024, at 10:51, Joel Jacobson wrote:

On Mon, Oct 14, 2024, at 10:07, Joel Jacobson wrote:

Attached is a first draft implementation of the new proposed COPY "raw" format.

The first two patches are just the bug fix in HEAD, reported separately:
https://commitfest.postgresql.org/50/5297/

Rebase only.

/Joel

#10

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Joel Jacobson (#9)

Re: New "raw" COPY format

On Mon, Oct 14, 2024, at 21:59, Joel Jacobson wrote:

On Mon, Oct 14, 2024, at 10:51, Joel Jacobson wrote:

On Mon, Oct 14, 2024, at 10:07, Joel Jacobson wrote:

Attached is a first draft implementation of the new proposed COPY "raw" format.

The first two patches are just the bug fix in HEAD, reported separately:
https://commitfest.postgresql.org/50/5297/

I noticed tests were failing in cfbot,
which surprised me since tests were passing locally,
but it was due to me not running the full test suite.

Sorry about the noise. I'm not running the full test suite,
with tap and `meson test --num-processes 32`,
so hopefully I won't cause cfbot failures as often any longer.

(The bug was due to an invalid assert; I wrongly assumed
Assert(opts_out->quote) would be sane inside the
if (opts_out->default_print) branch, but it is of course
wrong, since quote is NULL for the text format,
and the quote check was only performed if format was CSV,
so the assert was unnecessary and invalid).

I've also change the :filename I'm using in copy2.sql
for the raw format tests, to not be the same as in copy.sql,
since I guess this could cause problems with tests
running in parallell.

I've now split the reorganization of ProcessCopyOptions,
into separate easily small steps to make it easier to review.

Each step breaks out the validation of a COPY option,
into its own section, except for DELIMITER and NULL
that had to be changed together in a single commit.

/Joel

#11

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Joel Jacobson (#10)

Re: New "raw" COPY format

On Tue, Oct 15, 2024, at 03:35, Joel Jacobson wrote:

On Mon, Oct 14, 2024, at 21:59, Joel Jacobson wrote:

On Mon, Oct 14, 2024, at 10:51, Joel Jacobson wrote:

On Mon, Oct 14, 2024, at 10:07, Joel Jacobson wrote:

Attached is a first draft implementation of the new proposed COPY "raw" format.

The first two patches are just the bug fix in HEAD, reported separately:
https://commitfest.postgresql.org/50/5297/

...

Sorry about the noise. I'm not running the full test suite,
with tap and `meson test --num-processes 32`,
so hopefully I won't cause cfbot failures as often any longer.

Ops, that should have said:
"Sorry about the noise. I'm *now* running the full test suite"

However, I see Windows still failed on copy2.sql,
and I think the reason could be the use of \qecho -n
to create files with inconsistent newline style, e.g.:

\o :filename
\qecho -n line1
\qecho -n '\n'
\qecho -n line2
\qecho -n '\r\n'
\o
COPY copy_raw_test_errors (col1) FROM :'filename' (FORMAT raw);

Maybe Windows automatically translates \n into \r\n, and vice versa?
If so, this would explain why this test failed on Windows.

Btw, anyone know if it's possible to download the "regression.diffs" file from a the Ci task?

I've downloaded all the crashlog, meason_log, testrun zip files from

https://cirrus-ci.com/task/4564405273231360

but none of these contained the "regression.diffs" mentioned here:

[02:09:42.431] # The differences that caused some tests to fail can be viewed in the file
"C:/cirrus/build/testrun/regress/regress/regression.diffs".

Anyhow, I think I've fixed the problem now, in a cross-platform safe way,
by shipping src/test/regress/data/newline*.data files:

newlines_cr.data
newlines_cr_lr.data
newlines_cr_lr_nolast.data
newlines_cr_nolast.data
newlines_lr.data
newlines_lr_nolast.data
newlines_mixed_1.data
newlines_mixed_2.data
newlines_mixed_3.data
newlines_mixed_4.data
newlines_mixed_5.data

These are then used in copy.sql and copy2.sql, e.g.:

copy.sql:

\set filename :abs_srcdir '/data/newlines_lr.data'
TRUNCATE copy_raw_test;
COPY copy_raw_test (col) FROM :'filename' (FORMAT raw);
SELECT col, col IS NULL FROM copy_raw_test ORDER BY id;

copy2.sql:

-- Test inconsistent newline style
\set filename :abs_srcdir '/data/newlines_mixed_1.data'
COPY copy_raw_test_errors (col1) FROM :'filename' (FORMAT raw);

Attaching new version. It's only patch 0016 that has been updated.

/Joel

#12

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Joel Jacobson (#11)

Re: New "raw" COPY format

On Tue, Oct 15, 2024, at 09:54, Joel Jacobson wrote:

On Tue, Oct 15, 2024, at 03:35, Joel Jacobson wrote:

On Mon, Oct 14, 2024, at 21:59, Joel Jacobson wrote:

On Mon, Oct 14, 2024, at 10:51, Joel Jacobson wrote:

On Mon, Oct 14, 2024, at 10:07, Joel Jacobson wrote:

Attached is a first draft implementation of the new proposed COPY "raw" format.

The first two patches are just the bug fix in HEAD, reported separately:
https://commitfest.postgresql.org/50/5297/

...

Btw, anyone know if it's possible to download the "regression.diffs"
file from a the Ci task?

Thanks @Matthias for the help,
found it at https://api.cirrus-ci.com/v1/artifact/task/5938219148115968/testrun/build/testrun/regress/regress/regression.diffs

The Windows problem was due to a test that inserted a "\n" as a text column,
to test that it should be parsed as an extra newline.
I've removed that part now from the test, since it's covered by the other tests,
with the hard-coded files.

/Joel

#13

Jacob Champion

jacob.champion@enterprisedb.com

over 1 year ago

In reply to: Joel Jacobson (#12)

Re: New "raw" COPY format

Hi,

Idle thoughts from a design perspective -- feel free to ignore, since
I'm not the target audience for the feature:

- If the column data stored in Postgres contains newlines, it seems
like COPY TO won't work "correctly". Is that acceptable?
- RAW seems like an okay-ish label, but for something that's doing as
much magic end-of-line detection as this patch is, I'd personally
prefer SINGLE (as in, "single column").
- Speaking of magic end-of-line detection, can there be a way to turn
that off? Say, via DELIMITER?
- Generic DELIMITER support, for any single-byte separator at all,
might make a "single-column" format more generally applicable. But I
might be over-architecting. And it would make the COPY TO issue even
worse...

Thanks,
--Jacob

#14

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Jacob Champion (#13)

Re: New "raw" COPY format

On Tue, Oct 15, 2024, at 19:30, Jacob Champion wrote:

Hi,

Idle thoughts from a design perspective -- feel free to ignore, since
I'm not the target audience for the feature:

Many thanks for looking at this!

- If the column data stored in Postgres contains newlines, it seems
like COPY TO won't work "correctly". Is that acceptable?

That's an interesting edge-case to think about.

Rejecting such column data, to ensure all data that can be COPY TO'd,
can be loaded back using COPY FROM, preserving the data intact,
seems attractive because it protects users against unintentional mistakes,
and all the other COPY formats are able to preserve the data unchanged,
when doing a COPY FROM of a file created with COPY TO.

OTOH, if we think of COPY TO in this case as a way of `cat`-ing
text values, it might be more pragmatic to allow it.
With `cat`-ing I mean like if in Unix doing...
cat file1.txt file2.txt > file3.txt
...there is no way to reverse that operation,
that is, to reconstruct file1.txt and file2.txt from file3.txt.

However, I thinking rejecting such column data seems like the
better alternative, to ensure data exported with COPY TO
can always be imported back using COPY FROM,
for the same format. If text column data contains newlines,
users probably ought to be using the text or csv format instead.

- RAW seems like an okay-ish label, but for something that's doing as
much magic end-of-line detection as this patch is, I'd personally
prefer SINGLE (as in, "single column").

It's actually the same end-of-line detection as the text format
in copyfromparse.c's CopyReadLineText(), except the code
is simpler thanks to not having to deal with quotes or escapes.

It basically just learns the newline sequence based on the first
occurrence, and then require it to be the same throughout the file.

The same data files can be tested with the text format,
since they don't contain any escape, quote or delimiter characters.

Different newline styles detected automatically also by format text:

COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_lr.data' (FORMAT text);
COPY 2

COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_cr.data' (FORMAT text);
COPY 2

COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_cr_lr.data' (FORMAT text);
COPY 2

The mixed newline style causes errors also for the text format:

COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_mixed_1.data' (FORMAT text);
ERROR: literal newline found in data
HINT: Use "\n" to represent newline.
CONTEXT: COPY t, line 2

COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_mixed_2.data' (FORMAT text);
ERROR: literal newline found in data
HINT: Use "\n" to represent newline.
CONTEXT: COPY t, line 2

COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_mixed_3.data' (FORMAT text);
ERROR: literal carriage return found in data
HINT: Use "\r" to represent carriage return.
CONTEXT: COPY t, line 2

COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_mixed_4.data' (FORMAT text);
ERROR: literal carriage return found in data
HINT: Use "\r" to represent carriage return.
CONTEXT: COPY t, line 2

COPY t FROM '/home/joel/postgresql/src/test/regress/data/newlines_mixed_5.data' (FORMAT text);
ERROR: literal carriage return found in data
HINT: Use "\r" to represent carriage return.
CONTEXT: COPY t, line 2

I must confess, I didn't know about this newline detection before reading the
copyfromparse.c source code.

- Speaking of magic end-of-line detection, can there be a way to turn
that off? Say, via DELIMITER?
- Generic DELIMITER support, for any single-byte separator at all,
might make a "single-column" format more generally applicable. But I
might be over-architecting. And it would make the COPY TO issue even
worse...

That's an interesting idea that would provide more flexibility,
though, at the cost of complicating things by overloading the meaning
of DELIMITER.

If aiming to make this more generally applicable,
then at least DELIMITER would need to be multi-byte,
since otherwise the Windows case \r\n couldn't be specified.

But I feel COPY already has quite a lot of options, and I fear it's
quite complicated for users as it is.

What I found appealing with the idea of a new COPY format,
was that instead of overloading the existing options
with more complexity, a new format wouldn't need to affect
the existing options, and the new format could be explained
separately, without making things worse for users not
using this format.

/Joel

#15

jian he

jian.universality@gmail.com

over 1 year ago

In reply to: Joel Jacobson (#12)

Re: New "raw" COPY format

On Tue, Oct 15, 2024 at 8:50 PM Joel Jacobson <joel@compiler.org> wrote:

Hi.
I only checked 0001, 0002, 0003.
the raw format patch is v9-0016.
003-0016 is a lot of small patches, maybe you can consolidate it to
make the review more easier.

-COPY x to stdin (format TEXT, force_quote(a));
+COPY x to stdout (format TEXT, force_quote(a));
0001 make sense to me, i think generally we do "to stdout", "from stdin"

v9-0002-Fix-validation-of-FORCE_NOT_NULL-FORCE_NULL-for-all-.patch
looks good.

typedef enum CopyLogVerbosityChoice
{
COPY_LOG_VERBOSITY_SILENT = -1, /* logs none */
COPY_LOG_VERBOSITY_DEFAULT = 0, /* logs no additional messages. As this is
* the default, assign 0 */
COPY_LOG_VERBOSITY_VERBOSE, /* logs additional messages */
} CopyLogVerbosityChoice;
/*
* Represents the format of the COPY operation.
*/
typedef enum CopyFormat
{
COPY_FORMAT_TEXT,
COPY_FORMAT_BINARY,
COPY_FORMAT_CSV,
} CopyFormat;

BeginCopyTo
cstate = (CopyToStateData *) palloc0(sizeof(CopyToStateData));
ProcessCopyOptions(pstate, &cstate->opts, false /* is_from */ , options);

palloc0(sizeof(CopyToStateData)); makes the default format to COPY_FORMAT_TEXT.

I think you may need COPY_FORMAT_TEXT = 0, even though based on [1]https://stackoverflow.com/questions/6434105/are-default-enum-values-in-c-the-same-for-all-compilers,
it seems not required.

[1]: https://stackoverflow.com/questions/6434105/are-default-enum-values-in-c-the-same-for-all-compilers

#16

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: jian he (#15)

Re: New "raw" COPY format

On Wed, Oct 16, 2024, at 05:31, jian he wrote:

Hi.
I only checked 0001, 0002, 0003.
the raw format patch is v9-0016.
003-0016 is a lot of small patches, maybe you can consolidate it to
make the review more easier.

Thanks for reviewing.

OK, I've consolidated the v9 0003-0016 into a single patch.

(I submitted them as separate smaller patches since it might be difficult
otherwise to verify the correctness of the refactoring of ProcessCopyOptions().
So if needed, the smaller patches can be viewed in the previous email.
I've only squashed them in the attached patch set, except the setting
of COPY_FORMAT_TEXT = 0, see below.)

-COPY x to stdin (format TEXT, force_quote(a));
+COPY x to stdout (format TEXT, force_quote(a));
0001 make sense to me, i think generally we do "to stdout", "from stdin"
v9-0002-Fix-validation-of-FORCE_NOT_NULL-FORCE_NULL-for-all-.patch
looks good.

OK, cool.

typedef enum CopyFormat
{
COPY_FORMAT_TEXT,
COPY_FORMAT_BINARY,
COPY_FORMAT_CSV,
} CopyFormat;

...

I think you may need COPY_FORMAT_TEXT = 0, even though based on [1],
it seems not required.

OK, changed, and I agree, since this seems to be the style in copy.h,
even if not setting = 0 seems to be more popular in the codebase in general.

/Joel

#17

Jacob Champion

jacob.champion@enterprisedb.com

over 1 year ago

In reply to: Joel Jacobson (#14)

Re: New "raw" COPY format

On Tue, Oct 15, 2024 at 1:38 PM Joel Jacobson <joel@compiler.org> wrote:

However, I thinking rejecting such column data seems like the
better alternative, to ensure data exported with COPY TO
can always be imported back using COPY FROM,
for the same format. If text column data contains newlines,
users probably ought to be using the text or csv format instead.

Yeah. I think _someone's_ going to have strong opinions one way or the
other, but that person is not me. And I assume a contents check during
COPY TO is going to have a noticeable performance impact...

- RAW seems like an okay-ish label, but for something that's doing as
much magic end-of-line detection as this patch is, I'd personally
prefer SINGLE (as in, "single column").

It's actually the same end-of-line detection as the text format
in copyfromparse.c's CopyReadLineText(), except the code
is simpler thanks to not having to deal with quotes or escapes.

Right, sorry, I hadn't meant to imply that you made it up. :D Just
that a "raw" format that is actually automagically detecting things
doesn't seem very "raw" to me, so I prefer the other name.

It basically just learns the newline sequence based on the first
occurrence, and then require it to be the same throughout the file.

A hypothetical type whose text representation can contain '\r' but not
'\n' still can't be unambiguously round-tripped under this scheme:
COPY FROM will see the "mixed" line endings and complain, even though
there's no ambiguity.

Maybe no one will run into that problem in practice? But if they did,
I think that'd be a pretty frustrating limitation. It'd be nice to
override the behavior, to change it from "do what you think I mean" to
"do what I say".

- Speaking of magic end-of-line detection, can there be a way to turn
that off? Say, via DELIMITER?
- Generic DELIMITER support, for any single-byte separator at all,
might make a "single-column" format more generally applicable. But I
might be over-architecting. And it would make the COPY TO issue even
worse...

That's an interesting idea that would provide more flexibility,
though, at the cost of complicating things by overloading the meaning
of DELIMITER.

I think that'd be a docs issue rather than a conceptual one, though...
it's still a delimiter. I wouldn't really expect end-user confusion.

If aiming to make this more generally applicable,
then at least DELIMITER would need to be multi-byte,
since otherwise the Windows case \r\n couldn't be specified.

True.

What I found appealing with the idea of a new COPY format,
was that instead of overloading the existing options
with more complexity, a new format wouldn't need to affect
the existing options, and the new format could be explained
separately, without making things worse for users not
using this format.

I agree that we should not touch the existing formats. If
RAW/SINGLE/whatever needed a multibyte line delimiter, I'm not
proposing that the other formats should change.

--Jacob

#18

Daniel Verite

daniel@manitou-mail.org

over 1 year ago

In reply to: Joel Jacobson (#14)

Re: New "raw" COPY format

Joel Jacobson wrote:

However, I thinking rejecting such column data seems like the
better alternative, to ensure data exported with COPY TO
can always be imported back using COPY FROM,
for the same format.

On the other hand, that might prevent cases where we
want to export, for instance, a valid json array:

copy (select json_agg(col) from table ) to 'file' RAW

This is a variant of the discussion in [1]/messages/by-id/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU=kcg@mail.gmail.com where the OP does:

copy (select json_agg(row_to_json(t)) from <query> t) TO 'file'

and he complains that both text and csv "break the JSON".
That discussion morphed into a proposed patch adding JSON
format to COPY, but RAW would work directly as the OP
expected.

That is, unless <query> happens to include JSON fields with LF/CRLF
in them, and the RAW format says this is an error condition.
In that case it's quite annoying to make it an error, rather than
simply let it pass.

[1]: /messages/by-id/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU=kcg@mail.gmail.com
/messages/by-id/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU=kcg@mail.gmail.com

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

#19

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Jacob Champion (#17)

Re: New "raw" COPY format

On Wed, Oct 16, 2024, at 18:04, Jacob Champion wrote:

A hypothetical type whose text representation can contain '\r' but not
'\n' still can't be unambiguously round-tripped under this scheme:
COPY FROM will see the "mixed" line endings and complain, even though
there's no ambiguity.

Yeah, that's quite an ugly limitation.

Maybe no one will run into that problem in practice? But if they did,
I think that'd be a pretty frustrating limitation. It'd be nice to
override the behavior, to change it from "do what you think I mean" to
"do what I say".

That would be nice.

That's an interesting idea that would provide more flexibility,
though, at the cost of complicating things by overloading the meaning
of DELIMITER.

I think that'd be a docs issue rather than a conceptual one, though...
it's still a delimiter. I wouldn't really expect end-user confusion.

Yeah, I meant the docs, but that's probably fine,
we could just add <note> to DELIMITER.

What I found appealing with the idea of a new COPY format,
was that instead of overloading the existing options
with more complexity, a new format wouldn't need to affect
the existing options, and the new format could be explained
separately, without making things worse for users not
using this format.

I agree that we should not touch the existing formats. If
RAW/SINGLE/whatever needed a multibyte line delimiter, I'm not
proposing that the other formats should change.

Right, I didn't think you did either, I meant overloading the existing
options, from a docs perspective.

But I agree it's probably fine if we just overload DELIMITER in the docs,
that should be possible to explain in a pedagogic way,
without causing confusion.

/Joel

#20

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Daniel Verite (#18)

Re: New "raw" COPY format

On Wed, Oct 16, 2024, at 18:34, Daniel Verite wrote:

Joel Jacobson wrote:

However, I thinking rejecting such column data seems like the
better alternative, to ensure data exported with COPY TO
can always be imported back using COPY FROM,
for the same format.

On the other hand, that might prevent cases where we
want to export, for instance, a valid json array:

copy (select json_agg(col) from table ) to 'file' RAW

This is a variant of the discussion in [1] where the OP does:

copy (select json_agg(row_to_json(t)) from <query> t) TO 'file'

and he complains that both text and csv "break the JSON".
That discussion morphed into a proposed patch adding JSON
format to COPY, but RAW would work directly as the OP
expected.

That is, unless <query> happens to include JSON fields with LF/CRLF
in them, and the RAW format says this is an error condition.
In that case it's quite annoying to make it an error, rather than
simply let it pass.

[1]
/messages/by-id/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU=kcg@mail.gmail.com

Thanks for finding this related thread.

Very good example, that would be solved with RAW.

I've used a different hack myself many times to export text "as is",
which is to use COPY TO ... BINARY, and then to manually edit the
file using vim, stripping away the header and footer from the file. :D

But I don't see how JSON fields that come from PostgreSQL
could contain LF/CRLF though, could they?
Since LF/CR must be escaped inside a JSON field,
and when casting JSONB to text, there are no newlines
injected anywhere in between fields.
I can only see how a value of the legacy JSON type could
have newlines in it.

That doesn't matter though, this is still a very good example
on the need to export text "as is" from PostgreSQL to a file.

I like Jacob's idea of letting the user specify a DELIMITER
also for the RAW format, to override the automagical
newline detection. This would e.g. allow importing values
containing e.g. \r when the newline delimiter is \n,
which would otherwise be reject with an
"inconsistent newline style" error.

I think it would also be useful to allow specifying
DELIMITER NONE, which would allow importing an
entire text file, "as is", into a single column and single row.

To export a single column single row, the delimiter
wouldn't matter, since there wouldn't be any.
But DELIMITER NONE would be useful also for COPY TO,
if wanting to concatenate text "as is" without adding
newlines in between, to a file, similar to the
UNIX "cat" command.

Regarding the name for the format, I thought SINGLE
was nice before I read this message, but now since
the need for more rawness seems desirable,
I think I like RAW the most again, since if the
automagical newline detection can be overridden,
then it's really raw for real.

A final thought is to maybe consider just skipping
the automagical newline detection for RAW?

Instead of the automagical detection,
the default newline delimiter could be the OS default,
similar to how COPY TO works.

That way, it would almost always just work for most users,
as long as processing files within their OS,
and when not, they would just need to specify the DELIMITER.

/Joel

#21

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Joel Jacobson (#20)

#22

Joel Jacobson

joel@compiler.org

over 1 year ago

In reply to: Joel Jacobson (#21)

#23