New "single" COPY format

Started by Joel Jacobsonover 1 year ago34 messageshackers
Jump to latest
#1Joel Jacobson
joel@compiler.org

Hi hackers,

Thread [1]/messages/by-id/c12516b1-77dc-4ad3-94a7-88527360aee0@app.fastmail.com renamed, since the format name has now been changed from 'raw' to
'single', as suggested by Andrew Dunstan and Jacob Champion.

[1]: /messages/by-id/c12516b1-77dc-4ad3-94a7-88527360aee0@app.fastmail.com

Recap: This is about adding support to import/export text-based formats such as
JSONL, or any unstructured text file, where wanting to import each line "as is"
into a single column, or wanting to export a single column to a text file.

Example importing the meson-logs/testlog.json file Meson generates
when building PostgreSQL, which is in JSONL format:

# create table meson_log (log_line jsonb);
# \copy meson_log from meson-logs/testlog.json (format single);
COPY 306
# select log_line->'name' name, log_line->'result' result from meson_log limit 3;
name | result
-----------------------------------------+--------
"postgresql:setup / tmp_install" | "OK"
"postgresql:setup / install_test_files" | "OK"
"postgresql:setup / initdb_cache" | "OK"
(3 rows)

Changes since v16:

* EOL handling now works the same as for 'text' and 'csv'.
In v16, we supported multi-byte delimiters to allow specifying
e.g. Windows EOL (\r\n), but this seemed unnecessary, if we just do what we do
for text/csv, that is, to auto-detect the EOL for COPY FROM, and use
the OS default EOL for COPY TO.
The DELIMITER option is therefore invalid for the 'single' format.
This is the biggest change in the code, between v16 and v18.
CopyReadLineRawText() has been renamed to CopyReadLineSingleText(),
and changed accordingly.

* A final EOL is now emitted to the last record in COPY TO.
So now it works just like 'text' and 'csv'.

* HEADER [ boolean | MATCH ] now supported
This is now again supported, as previously suggested by Daniel Verite,
possible thanks to the EOL handling.

* Docs updated.

Below is quoted directly from the copy.sgml, but in plaintext:

---
Single Format

This format option is used for importing and exporting files containing
unstructured text, where each line is treated as a single field. It is
useful for data that does not conform to a structured, tabular format and
lacks delimiters.

In the single format, each line of the input or output is
considered a complete value without any field separation. There are no
field delimiters, and all characters are taken literally. There is no
special handling for quotes, backslashes, or escape sequences. All
characters, including whitespace and special characters, are preserved
exactly as they appear in the file. However, it's important to note that
the text is still interpreted according to the specified ENCODING
option or the current client encoding for input, and encoded using the
specified ENCODING or the current client encoding for output.

When using this format, the COPY command must specify exactly one column.
Specifying multiple columns will result in an error.
If the table has multiple columns and no column list is provided, an error
will occur.

The single format does not distinguish a NULL value from an empty string.
Empty lines are imported as empty strings, not as NULL values.

Encoding works the same as in the text and CSV formats.
---

On Fri, Nov 1, 2024, at 22:28, Masahiko Sawada wrote:

I think it would be better to explain how to parse data in raw mode,
especially which steps in the pipeline we skip, in the comment at the
top of copyfromparse.c.

Good idea. I've explained it in the comment.

/Joel

Attachments:

v18-0001-Introduce-CopyFormat-and-replace-csv_mode-and-binary.patchapplication/octet-stream; name="=?UTF-8?Q?v18-0001-Introduce-CopyFormat-and-replace-csv=5Fmode-and-binar?= =?UTF-8?Q?y.patch?="Download+70-59
v18-0002-Add-COPY-format-single.patchapplication/octet-stream; name="=?UTF-8?Q?v18-0002-Add-COPY-format-single.patch?="Download+442-47
v18-0003-Reorganize-option-validations.patchapplication/octet-stream; name="=?UTF-8?Q?v18-0003-Reorganize-option-validations.patch?="Download+259-202
#2Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Joel Jacobson (#1)
Re: New "single" COPY format

Hi,

On Thu, Nov 7, 2024 at 8:16 AM Joel Jacobson <joel@compiler.org> wrote:

Hi hackers,

Thread [1] renamed, since the format name has now been changed from 'raw' to
'single', as suggested by Andrew Dunstan and Jacob Champion.

[1] /messages/by-id/c12516b1-77dc-4ad3-94a7-88527360aee0@app.fastmail.com

Recap: This is about adding support to import/export text-based formats such as
JSONL, or any unstructured text file, where wanting to import each line "as is"
into a single column, or wanting to export a single column to a text file.

Example importing the meson-logs/testlog.json file Meson generates
when building PostgreSQL, which is in JSONL format:

# create table meson_log (log_line jsonb);
# \copy meson_log from meson-logs/testlog.json (format single);
COPY 306
# select log_line->'name' name, log_line->'result' result from meson_log limit 3;
name | result
-----------------------------------------+--------
"postgresql:setup / tmp_install" | "OK"
"postgresql:setup / install_test_files" | "OK"
"postgresql:setup / initdb_cache" | "OK"
(3 rows)

Changes since v16:

* EOL handling now works the same as for 'text' and 'csv'.
In v16, we supported multi-byte delimiters to allow specifying
e.g. Windows EOL (\r\n), but this seemed unnecessary, if we just do what we do
for text/csv, that is, to auto-detect the EOL for COPY FROM, and use
the OS default EOL for COPY TO.
The DELIMITER option is therefore invalid for the 'single' format.
This is the biggest change in the code, between v16 and v18.
CopyReadLineRawText() has been renamed to CopyReadLineSingleText(),
and changed accordingly.

In earlier versions, we supported loading the whole file into a single
tuple. Is there any reason that it doesn't support it in v18? I think
if it's useful we can improve it in a separate patch.

* A final EOL is now emitted to the last record in COPY TO.
So now it works just like 'text' and 'csv'.

+1

* HEADER [ boolean | MATCH ] now supported
This is now again supported, as previously suggested by Daniel Verite,
possible thanks to the EOL handling.

It makes sense to support it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#3Joel Jacobson
joel@compiler.org
In reply to: Masahiko Sawada (#2)
Re: New "single" COPY format

On Fri, Nov 8, 2024, at 00:13, Masahiko Sawada wrote:

In earlier versions, we supported loading the whole file into a single
tuple. Is there any reason that it doesn't support it in v18? I think
if it's useful we can improve it in a separate patch.

Not sure how useful it is, since we already have pg_read_file().

Also, I think it's out of scope for the 'single' format, since I think it should
be about about processing text line by line, in the same way 'csv' and 'text'
work.

The implementation depended on the delimiter option, where the default was
no delimiter, which then read the entire file, and since we don't have
the delimiter option anymore, that approach won't work.

From a docs perspective, it would get quite ugly and confusing, since we would
need to rephrase sentences like the below, since they would then not always
be true:

"each line is treated as a single field"

"each line of the input or output is considered a
complete value without any field separation"

* A final EOL is now emitted to the last record in COPY TO.
So now it works just like 'text' and 'csv'.

+1

* HEADER [ boolean | MATCH ] now supported
This is now again supported, as previously suggested by Daniel Verite,
possible thanks to the EOL handling.

It makes sense to support it.

/Joel

#4David G. Johnston
david.g.johnston@gmail.com
In reply to: Joel Jacobson (#3)
Re: New "single" COPY format

On Thursday, November 7, 2024, Joel Jacobson <joel@compiler.org> wrote:

On Fri, Nov 8, 2024, at 00:13, Masahiko Sawada wrote:

In earlier versions, we supported loading the whole file into a single
tuple. Is there any reason that it doesn't support it in v18? I think
if it's useful we can improve it in a separate patch.

Not sure how useful it is, since we already have pg_read_file().

Being forced to have the file server-readable, non-stdin, destroys quite a
bit of the usefulness of pg_read_file.

If we want clients to be able to pass the effort here to the server, copy
is definitely the most useful way to do so.

I’d be concerned choosing “single” given this future possibility. I do
agree that such an enhancement would be best done in its own patch.

David J.

#5Joel Jacobson
joel@compiler.org
In reply to: David G. Johnston (#4)
Re: New "single" COPY format

On Fri, Nov 8, 2024, at 07:14, David G. Johnston wrote:

On Thursday, November 7, 2024, Joel Jacobson <joel@compiler.org> wrote:

On Fri, Nov 8, 2024, at 00:13, Masahiko Sawada wrote:

In earlier versions, we supported loading the whole file into a single
tuple. Is there any reason that it doesn't support it in v18? I think
if it's useful we can improve it in a separate patch.

Not sure how useful it is, since we already have pg_read_file().

Being forced to have the file server-readable, non-stdin, destroys
quite a bit of the usefulness of pg_read_file.

If we want clients to be able to pass the effort here to the server,
copy is definitely the most useful way to do so.

Right, good point, I agree.

I’d be concerned choosing “single” given this future possibility. I do
agree that such an enhancement would be best done in its own patch.

OK, sounds good to do it in its own patch.

If the name "single" doesn't work for this reason, I see at least
two alternatives:

1) Keep "single" as format name, and let it only be concerned about line by line
processing, and introduce a different format for entire file processing,
in its own patch.

2) Some other format name ("raw"?) that allows such future enhancement to
be done within the same format, in its own patch.

Other ideas?

/Joel

#6Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#5)
Re: New "single" COPY format

On Fri, Nov 8, 2024, at 08:42, Joel Jacobson wrote:

I’d be concerned choosing “single” given this future possibility. I do
agree that such an enhancement would be best done in its own patch.

OK, sounds good to do it in its own patch.

If the name "single" doesn't work for this reason, I see at least
two alternatives:

1) Keep "single" as format name, and let it only be concerned about line by line
processing, and introduce a different format for entire file processing,
in its own patch.

2) Some other format name ("raw"?) that allows such future enhancement to
be done within the same format, in its own patch.

Other ideas?

Sorry for noise, "raw" is of course not an option given how v18 works,
due to the auto-magic EOL detection, same as in "text" and "csv",
as pointed out by others earlier in the thread.

How about "single_column"?

Then, a future patch could implement a "single_value" format,
to process an entire file or value. Such format could then also
support binary data, by detecting if the column type is "bytea".

/Joel

#7Aleksander Alekseev
aleksander@timescale.com
In reply to: Joel Jacobson (#1)
Re: New "single" COPY format

Hi Joel,

Recap: This is about adding support to import/export text-based formats such as
JSONL, or any unstructured text file, where wanting to import each line "as is"
into a single column, or wanting to export a single column to a text file.

Sorry for being late for the discussion.

I disagree with the idea of adding a new format name for this. Mostly
because it is *not* a new format and pretending that it is will be
just a poor and/or confusing user interface.

IMO it should be 'text' we already have with special options e.g.
DELIMITER AS NULL ESCAPE AS NULL. If there are no escape characters
and column delimiters (and no NULLs designations, and what else I
forgot) then your text file just contains one tuple per line.

--
Best regards,
Aleksander Alekseev

#8Aleksander Alekseev
aleksander@timescale.com
In reply to: Aleksander Alekseev (#7)
Re: New "single" COPY format

Hi,

Recap: This is about adding support to import/export text-based formats such as
JSONL, or any unstructured text file, where wanting to import each line "as is"
into a single column, or wanting to export a single column to a text file.

Sorry for being late for the discussion.

I disagree with the idea of adding a new format name for this. Mostly
because it is *not* a new format and pretending that it is will be
just a poor and/or confusing user interface.

IMO it should be 'text' we already have with special options e.g.
DELIMITER AS NULL ESCAPE AS NULL. If there are no escape characters
and column delimiters (and no NULLs designations, and what else I
forgot) then your text file just contains one tuple per line.

Personally I wouldn't mind a special syntax such as LINES AS IS or
maybe COPY AS IS for convenience. Perhaps we should discuss it
separately though as a syntax sugar for a long list of options we
already support.

--
Best regards,
Aleksander Alekseev

#9Joel Jacobson
joel@compiler.org
In reply to: Aleksander Alekseev (#8)
Re: New "single" COPY format

On Fri, Nov 8, 2024, at 12:25, Aleksander Alekseev wrote:

Sorry for being late for the discussion.

No worries, better late than never, thanks for chiming in.

I disagree with the idea of adding a new format name for this. Mostly
because it is *not* a new format and pretending that it is will be
just a poor and/or confusing user interface.

IMO it should be 'text' we already have with special options e.g.
DELIMITER AS NULL ESCAPE AS NULL. If there are no escape characters
and column delimiters (and no NULLs designations, and what else I
forgot) then your text file just contains one tuple per line.

Personally I wouldn't mind a special syntax such as LINES AS IS or
maybe COPY AS IS for convenience. Perhaps we should discuss it
separately though as a syntax sugar for a long list of options we
already support.

From an implementation perspective, I agree with you that this could
be handled by tweaking the existing code for the 'text' and 'csv' formats,
although with a performance penalty for the existing formats.

But from a user-perspective, the implementation is of course irrelevant,
then what I think is important, is that the format should have an intuitive name,
where the default behaviour should match a typical file in the format,
as closely as possible.

For this reason, the 'text' format is unfortunately a poor name,
since it gives the impression it's a generic format for text files,
which it's not, it's a PostgreSQL-specific format, where "\." on a
single line has special meaning, and other defaults such as \N
are also PostgreSQL-specific, and needs to be overriden, if dealing
with a non-PostgreSQL specific text file.
Users who fail to understand these details, risks being surprised.

In contrast, the 'csv' format, works quite as expected.

So for this reason, I think a new format, is a good idea, not only
because it makes it much clearer how to have a fast parsing path
in the implementation, but also because it will increase the chances
users will get things right, when dealing with non-PostgreSQL specific
text files, such as JSONL and log files.

Sure, from an implementation perspective, we could have separate
specialized functions, to allow for fast parsing paths, even if
just overloading the existing options, but that would be a bit awkward I think.

The "DELIMITER AS NULL ESCAPE AS NULL" idea was proposed in the old thread
"Should CSV parsing be stricter about mid-field quotes?" [1] /messages/by-id/8aeab305-5e94-4fa5-82bf-6da6baee6e05@app.fastmail.com

[1]:  /messages/by-id/8aeab305-5e94-4fa5-82bf-6da6baee6e05@app.fastmail.com

However, some of us came to the conclusion that it would be
better to introduce a new format, for reasons explained below,
quoted from the old thread [1] /messages/by-id/8aeab305-5e94-4fa5-82bf-6da6baee6e05@app.fastmail.com:

On Wed, Oct 9, 2024, at 18:14, Andrew Dunstan wrote:

On 2024-10-09 We 11:58 AM, Tom Lane wrote:

"Joel Jacobson" <joel@compiler.org> writes:

I think it would be nicest to introduce a new "raw" FORMAT,
that clearly get things right already at the top-level,
instead of trying to customize any of the existing formats.

FWIW, I like this idea. It makes it much clearer how to have a
fast parsing path.

WFM, so something like FORMAT {SIMPLE, RAW, FAST, SINGLE}? We don't seem
to have an existing keyword that we could sanely reuse here.

To add to that, I think there is value of a new format, from a user-friendiness
perspective, by keeping the existing formats and their options intact,
and instead just add a new format, with a clear name, with well-defined
semantics, explained in the docs under its own section, to avoid cluttering
the documentation further, where users would need to assemble various options,
and understand their intricate details, in order to get things right.

/Joel

#10Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#1)
Re: New "single" COPY format

On Thu, Nov 7, 2024, at 17:15, Joel Jacobson wrote:

Attachments:
* v18-0001-Introduce-CopyFormat-and-replace-csv_mode-and-binary.patch
* v18-0002-Add-COPY-format-single.patch
* v18-0003-Reorganize-option-validations.patch

I want to bring up a potential problem with v18, which has been discussed
before:

On Tue, Oct 15, 2024, at 19:30, Jacob Champion wrote:

Hi,

Idle thoughts from a design perspective -- feel free to ignore, since
I'm not the target audience for the feature:

- If the column data stored in Postgres contains newlines, it seems
like COPY TO won't work "correctly". Is that acceptable?

Example:

CREATE TABLE log (line TEXT);
INSERT INTO log (line) VALUES (E'foo\nbar'), ('baz');
COPY log TO '/tmp/log.txt' (FORMAT 'single');
COPY 2

% cat log.txt
foo
bar
baz

TRUNCATE log;
COPY log FROM '/tmp/log.txt' (FORMAT 'single');
SELECT * FROM log;
line
------
foo
bar
baz
(3 rows)

It would be nice if we could come up with an approach, that didn't introduce
this footgun, while at the same time being convenient for the common use cases.

Ideas?

/Joel

#11Daniel Verite
daniel@manitou-mail.org
In reply to: Aleksander Alekseev (#7)
Re: New "single" COPY format

Aleksander Alekseev wrote:

IMO it should be 'text' we already have with special options e.g.
DELIMITER AS NULL ESCAPE AS NULL. If there are no escape characters
and column delimiters (and no NULLs designations, and what else I
forgot) then your text file just contains one tuple per line.

+1 for the idea that accepting "no delimiter" and "no escape"
as a valid combination for the text format seems better
than adding a new format.
However inviting "NULL" into that syntax when it has nothing to do
with the SQL "NULL" does not look like a good idea.
Maybe DELIMITER '' ESCAPE '', or DELIMITER NONE ESCAPE NONE.

Besides, "single" as a format name does not sound right.
Generally the name for a text format designates a set
of characteristics meaning that certain combinations of
characters have specific behaviors.
Sometimes "plain" is used in the context of text formats
to indicate that no character is special ("plain" is also the
default subtype of "text" in MIME types).

"single" as proposed is to be understood as "single-column",
which is a consequence of the lack of a field delimiter, but
not an intrinsic characteristic of the format.
If COPY accepted fixed-length fields, it could be in a
no-delimiter no-escape mode and still handle multiple
columns, in opposition to what "single" suggests.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

#12Joel Jacobson
joel@compiler.org
In reply to: Daniel Verite (#11)
Re: New "single" COPY format

On Fri, Nov 8, 2024, at 20:44, Daniel Verite wrote:

Aleksander Alekseev wrote:

IMO it should be 'text' we already have with special options e.g.
DELIMITER AS NULL ESCAPE AS NULL. If there are no escape characters
and column delimiters (and no NULLs designations, and what else I
forgot) then your text file just contains one tuple per line.

+1 for the idea that accepting "no delimiter" and "no escape"
as a valid combination for the text format seems better
than adding a new format.
However inviting "NULL" into that syntax when it has nothing to do
with the SQL "NULL" does not look like a good idea.
Maybe DELIMITER '' ESCAPE '', or DELIMITER NONE ESCAPE NONE.

Okay, let's see if we can solve all problems I see with
overloading the 'text' format:

1. Text files containing \. in the middle of the file
% cat /tmp/test.txt
foo
\.
bar

How do we import such a file?
Is it not supported?
Or another option to turn off the special meaning of \.?
Both seems like bad ideas to me, maybe there is a nice idea I fail to see?

2. NULL option is \N for 'text', so to import a plain text
file safely, where \N lines should not be converted to NULL,
users would need to also specify NULL '', which seems
like a footgun to me.

3. What should happen if specifying DELIMITER NONE, and:
- specifying a column list with more than one column?
- not also specifying ESCAPE NONE?

4. What should happen if specifying ESCAPE NONE, and
- specifying a column list with more than one column?

5. What about the isomorphism violation, I brought up in my
previous email, that is, the non-bijective mapping and irreversibility,
for records with embedded newlines?
This is also a problem with a separate format,
but I wonder what you think about the problem,
if it's acceptable, or needs to be solved, and if so,
if you see any solutions.

Besides, "single" as a format name does not sound right.
Generally the name for a text format designates a set
of characteristics meaning that certain combinations of
characters have specific behaviors.
Sometimes "plain" is used in the context of text formats
to indicate that no character is special ("plain" is also the
default subtype of "text" in MIME types).

"single" as proposed is to be understood as "single-column",
which is a consequence of the lack of a field delimiter, but
not an intrinsic characteristic of the format.
If COPY accepted fixed-length fields, it could be in a
no-delimiter no-escape mode and still handle multiple
columns, in opposition to what "single" suggests.

Good points. I agree "plain" is a better name.

/Joel

#13David G. Johnston
david.g.johnston@gmail.com
In reply to: Joel Jacobson (#12)
Re: New "single" COPY format

On Fri, Nov 8, 2024 at 2:20 PM Joel Jacobson <joel@compiler.org> wrote:

1. Text files containing \. in the middle of the file
% cat /tmp/test.txt
foo
\.
bar

Or another option to turn off the special meaning of \.?

This does seem like an orthogonal option worth considering.

Besides, "single" as a format name does not sound right.
Generally the name for a text format designates a set
of characteristics meaning that certain combinations of
characters have specific behaviors.
Sometimes "plain" is used in the context of text formats
to indicate that no character is special ("plain" is also the
default subtype of "text" in MIME types).

"single" as proposed is to be understood as "single-column",
which is a consequence of the lack of a field delimiter, but
not an intrinsic characteristic of the format.
If COPY accepted fixed-length fields, it could be in a
no-delimiter no-escape mode and still handle multiple
columns, in opposition to what "single" suggests.

Good points. I agree "plain" is a better name.

I'm on board with a new named format that selects the desired defaults
instead of requiring the user to know and change them all manually.

This seems to me like a "list" format. Implying each row is a list entry.
Since we have tables the concept of list would likewise reasonably imply a
single column.

Since newlines are special, i.e., record delimiters, "plain" thus would
remain misleading. It could be used for a case where the entire file is
loaded into a new row, single column.

David J.

#14Joel Jacobson
joel@compiler.org
In reply to: David G. Johnston (#13)
Re: New "single" COPY format

On Fri, Nov 8, 2024, at 22:47, David G. Johnston wrote:

On Fri, Nov 8, 2024 at 2:20 PM Joel Jacobson <joel@compiler.org> wrote:

1. Text files containing \. in the middle of the file
% cat /tmp/test.txt
foo
\.
bar

Or another option to turn off the special meaning of \.?

This does seem like an orthogonal option worth considering.

I agree; if we want to integrate this into 'text', it's an option worth considering.

I'm on board with a new named format that selects the desired defaults
instead of requiring the user to know and change them all manually.

This seems to me like a "list" format. Implying each row is a list
entry. Since we have tables the concept of list would likewise
reasonably imply a single column.

Since newlines are special, i.e., record delimiters, "plain" thus would
remain misleading. It could be used for a case where the entire file
is loaded into a new row, single column.

I think 'list' is the best proposal I've heard so far.
New patch attached, only change since v18 is the renaming.

There is one remaining important issue though:

Fields that contain newline characters, cause an irreversibility problem.

It feels wrong to leave this as a potential pitfall for users.

Here's a draft of an idea I'm considering (not yet implemented):

- Fast path for newline-free types:
For the list of built-in types where we know the ::text representation cannot
contain newlines, we take the fast path in NextCopyFromRawFields(),
pointing cstate->raw_fields[0] directly to cstate->line_buf.data.

- Handling newlines for other types:
For any other types, we would need to scan the string for newline characters.
If a newline is encountered, it would, by default, result in an error when
using the list format, unless:

- Optional quoting mechanism:
If the QUOTE option is specified, we can allow newlines within fields by
quoting the entire line. Any quote characters within the field would be
handled by doubling them, similar to CSV escaping rules.

/Joel

Attachments:

v19-0001-Introduce-CopyFormat-and-replace-csv_mode-and-binary.patchapplication/octet-stream; name="=?UTF-8?Q?v19-0001-Introduce-CopyFormat-and-replace-csv=5Fmode-and-binar?= =?UTF-8?Q?y.patch?="Download+70-59
v19-0002-Add-COPY-format-list.patchapplication/octet-stream; name="=?UTF-8?Q?v19-0002-Add-COPY-format-list.patch?="Download+442-47
v19-0003-Reorganize-option-validations.patchapplication/octet-stream; name="=?UTF-8?Q?v19-0003-Reorganize-option-validations.patch?="Download+259-202
#15Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#14)
Re: New "single" COPY format

On Sat, Nov 9, 2024, at 08:07, Joel Jacobson wrote:

Here's a draft of an idea I'm considering (not yet implemented):

I realize the last part about optional quoting is unnecessary,
since if quoting is desired, users could just use the 'csv' format.

Revised draft of the idea (not yet implemented):

- Fast path for newline-free types:
For the list of built-in types where we know the ::text representation cannot
contain newlines, we take the fast path in NextCopyFromRawFields(),
pointing cstate->raw_fields[0] directly to cstate->line_buf.data.

- Handling newlines for other types:
For any other types, we would need to scan the string for newline characters.
If a newline is encountered, it results in an error.

This brings up the question on what to offer users wanting to export text
values containing newlines.

To address this need, I think that's out of scope for the 'list' format,
and is better handled by a separate 'value' format:
- Such a format would be specialized for exporting a value "as is" to a file,
or importing en entire file as a single value.
- Such a value could be a physical single-column single-row,
but could also be constructed using e.g. string_agg().
- The 'value' format could also easily support import/export
binary data (bytea), to e.g. allow importing/exporting images, etc.

Dimensionality perspective on formats:

2D formats: 'text', 'csv', 'binary' (tabular formats)
1D format: 'list' (single-column)
0D format: 'value' (single-column, single-row)

/Joel

#16David G. Johnston
david.g.johnston@gmail.com
In reply to: Joel Jacobson (#15)
Re: New "single" COPY format

On Saturday, November 9, 2024, Joel Jacobson <joel@compiler.org> wrote:

On Sat, Nov 9, 2024, at 08:07, Joel Jacobson wrote:

Here's a draft of an idea I'm considering (not yet implemented):

I realize the last part about optional quoting is unnecessary,
since if quoting is desired, users could just use the 'csv' format.

Revised draft of the idea (not yet implemented):

- Fast path for newline-free types:
For the list of built-in types where we know the ::text representation
cannot
contain newlines, we take the fast path in NextCopyFromRawFields(),
pointing cstate->raw_fields[0] directly to cstate->line_buf.data.

- Handling newlines for other types:
For any other types, we would need to scan the string for newline
characters.
If a newline is encountered, it results in an error.

Make sense to me.

David J.

#17David G. Johnston
david.g.johnston@gmail.com
In reply to: Joel Jacobson (#14)
Re: New "single" COPY format

On Saturday, November 9, 2024, Joel Jacobson <joel@compiler.org> wrote:

On Fri, Nov 8, 2024, at 22:47, David G. Johnston wrote:

On Fri, Nov 8, 2024 at 2:20 PM Joel Jacobson <joel@compiler.org> wrote:

1. Text files containing \. in the middle of the file
% cat /tmp/test.txt
foo
\.
bar

Or another option to turn off the special meaning of \.?

This does seem like an orthogonal option worth considering.

I agree; if we want to integrate this into 'text', it's an option worth
considering.

PostgreSQL cannot store the NUL byte. Would that be an option for the
record separator. Default to new line but accept NUL if one needs to
input/output lists containing newlines. Or whatever character the user
believes is not part of their data - tab probably being a popular option.

David J.

#18Joel Jacobson
joel@compiler.org
In reply to: David G. Johnston (#16)
Re: New "single" COPY format

On Sat, Nov 9, 2024, at 15:13, David G. Johnston wrote:

On Saturday, November 9, 2024, Joel Jacobson <joel@compiler.org> wrote:

On Sat, Nov 9, 2024, at 08:07, Joel Jacobson wrote:

Here's a draft of an idea I'm considering (not yet implemented):

I realize the last part about optional quoting is unnecessary,
since if quoting is desired, users could just use the 'csv' format.

Revised draft of the idea (not yet implemented):

- Fast path for newline-free types:
For the list of built-in types where we know the ::text representation cannot
contain newlines, we take the fast path in NextCopyFromRawFields(),
pointing cstate->raw_fields[0] directly to cstate->line_buf.data.

Ops, the above should of course have said:
"we take the fast path in CopyAttributeOutList()".

- Handling newlines for other types:
For any other types, we would need to scan the string for newline characters.
If a newline is encountered, it results in an error.

Make sense to me.

Cool. I've drafted a new patch on this approach.
The list of newline-free built-in types is not exhaustive, yet.

/Joel

Attachments:

v20-0001-Introduce-CopyFormat-and-replace-csv_mode-and-binary.patchapplication/octet-stream; name="=?UTF-8?Q?v20-0001-Introduce-CopyFormat-and-replace-csv=5Fmode-and-binar?= =?UTF-8?Q?y.patch?="Download+70-59
v20-0002-Add-COPY-format-list.patchapplication/octet-stream; name="=?UTF-8?Q?v20-0002-Add-COPY-format-list.patch?="Download+522-48
v20-0003-Reorganize-option-validations.patchapplication/octet-stream; name="=?UTF-8?Q?v20-0003-Reorganize-option-validations.patch?="Download+259-202
#19Joel Jacobson
joel@compiler.org
In reply to: David G. Johnston (#17)
Re: New "single" COPY format

On Sat, Nov 9, 2024, at 15:28, David G. Johnston wrote:

PostgreSQL cannot store the NUL byte. Would that be an option for the
record separator. Default to new line but accept NUL if one needs to
input/output lists containing newlines. Or whatever character the user
believes is not part of their data - tab probably being a popular
option.

Clever idea, could work, but using NUL bytes in text files feels a bit
unorthodox, and I can imagine surprising results in other systems having to deal
with such files.

I have no idea how useful such file format would be, but some googling suggest
it's a trick that's used out there, so I won't exclude the idea entirely, just
feels like a type of hack where it's difficult to foresee the consequences
of allowing it.

/Joel

#20David G. Johnston
david.g.johnston@gmail.com
In reply to: Joel Jacobson (#19)
Re: New "single" COPY format

On Sat, Nov 9, 2024 at 1:48 PM Joel Jacobson <joel@compiler.org> wrote:

On Sat, Nov 9, 2024, at 15:28, David G. Johnston wrote:

PostgreSQL cannot store the NUL byte. Would that be an option for the
record separator. Default to new line but accept NUL if one needs to
input/output lists containing newlines. Or whatever character the user
believes is not part of their data - tab probably being a popular
option.

Clever idea, could work, but using NUL bytes in text files feels a bit
unorthodox, and I can imagine surprising results in other systems having
to deal
with such files.

Yeah. I was inspired by xargs and find but for a permanent file it is a
bit different.

David J.

#21jian he
jian.universality@gmail.com
In reply to: Joel Jacobson (#18)
#22David G. Johnston
david.g.johnston@gmail.com
In reply to: jian he (#21)
#23Joel Jacobson
joel@compiler.org
In reply to: jian he (#21)
#24Joel Jacobson
joel@compiler.org
In reply to: David G. Johnston (#22)
#25Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#24)
#26Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#25)
#27Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#26)
#28Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#27)
#29jian he
jian.universality@gmail.com
In reply to: Joel Jacobson (#28)
#30Joel Jacobson
joel@compiler.org
In reply to: jian he (#29)
#31Andrew Dunstan
andrew@dunslane.net
In reply to: Joel Jacobson (#28)
#32Joel Jacobson
joel@compiler.org
In reply to: Andrew Dunstan (#31)
#33David G. Johnston
david.g.johnston@gmail.com
In reply to: Joel Jacobson (#32)
#34Andrew Dunstan
andrew@dunslane.net
In reply to: David G. Johnston (#33)