Allow COPY's 'text' format to output a header
This patch adds the capability to use the HEADER feature with the "text"
format of the COPY command. The patch includes the related update to
documentation and an additional regression test for this feature.
Currently you can only add a header line (which lists the column names)
when exporting with COPY to the CSV format, but I much prefer using the
default "text" format. This feature is also currently listed on the to-do
list (https://wiki.postgresql.org/wiki/Todo#COPY) where it seems to have
been requested some years ago.
Hopefully I've done everything correctly and the patch is acceptable enough
to be considered for application.
Simon Muller
Attachments:
0001-Allow-COPY-s-text-format-to-output-a-header.patchapplication/octet-stream; name=0001-Allow-COPY-s-text-format-to-output-a-header.patchDownload+14-6
Hi Simon,
On 5/13/18 6:18 PM, Simon Muller wrote:
This patch adds the capability to use the HEADER feature with the "text"
format of the COPY command. The patch includes the related update to
documentation and an additional regression test for this feature.Currently you can only add a header line (which lists the column names)
when exporting with COPY to the CSV format, but I much prefer using the
default "text" format. This feature is also currently listed on the
to-do list (https://wiki.postgresql.org/wiki/Todo#COPY) where it seems
to have been requested some years ago.Hopefully I've done everything correctly and the patch is acceptable
enough to be considered for application.
This patch makes sense to me and looks reasonable.
We're in the middle of a feature freeze that will last most of the
summer, so be sure to enter your patch into the next commitfest so it
can be considered when the freeze is over.
https://commitfest.postgresql.org/18/
Regards,
--
-David
david@pgmasters.net
On Sun, May 13, 2018 at 07:01:00PM -0400, David Steele wrote:
This patch makes sense to me and looks reasonable.
One "potential" problem is if a relation has a full set of column which
allows the input of text-like data: if the header has been added with
COPY TO, and that the user forgets to add again the header option with
COPY FROM, then an extra row will be generated but there is the same
problem with CSV format :)
One comment I have about the patch is that there is no test for
COPY FROM with an output file which has a header. In this case if
HEADER is true then the file can be loaded. If HEADER is wrong, an
error should normally be raised because of the format (well, let's
discard the case of the relation with text-only columns). So the tests
could be extended a bit even for CSV.
We're in the middle of a feature freeze that will last most of the
summer, so be sure to enter your patch into the next commitfest so it
can be considered when the freeze is over.
Yes, you will need to be patient a couple of months here.
--
Michael
Okay, I've added this to the next commitfest at
https://commitfest.postgresql.org/18/1629/.
Thanks both Michael and David for the feedback so far.
On 14 May 2018 at 02:37, Michael Paquier <michael@paquier.xyz> wrote:
Show quoted text
On Sun, May 13, 2018 at 07:01:00PM -0400, David Steele wrote:
This patch makes sense to me and looks reasonable.
One "potential" problem is if a relation has a full set of column which
allows the input of text-like data: if the header has been added with
COPY TO, and that the user forgets to add again the header option with
COPY FROM, then an extra row will be generated but there is the same
problem with CSV format :)One comment I have about the patch is that there is no test for
COPY FROM with an output file which has a header. In this case if
HEADER is true then the file can be loaded. If HEADER is wrong, an
error should normally be raised because of the format (well, let's
discard the case of the relation with text-only columns). So the tests
could be extended a bit even for CSV.We're in the middle of a feature freeze that will last most of the
summer, so be sure to enter your patch into the next commitfest so it
can be considered when the freeze is over.Yes, you will need to be patient a couple of months here.
--
Michael
On 05/14/2018 02:35 AM, Simon Muller wrote:
Okay, I've added this to the next commitfest at
https://commitfest.postgresql.org/18/1629/.Thanks both Michael and David for the feedback so far.
(Please don't top-post on PostgreSQL lists.)
I'm not necessarily opposed to this, but I'm not certain about the use
case either. The original request seemed to stem from a false impression
that CSV mode can't produce or consume tab-delimited files. But it can,
and in fact it's saner for almost all uses than text format. Postgres'
text format is really intended for Postgres' use. CSV format is more
appropriate for dealing with external programs, whether the delimiter be
a tab or a comma.
cheers
andrew
--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, May 14, 2018 at 09:37:07AM +0900, Michael Paquier wrote:
On Sun, May 13, 2018 at 07:01:00PM -0400, David Steele wrote:
This patch makes sense to me and looks reasonable.
One "potential" problem is if a relation has a full set of column which
allows the input of text-like data: if the header has been added with
COPY TO, and that the user forgets to add again the header option with
COPY FROM, then an extra row will be generated but there is the same
problem with CSV format :)
Yeah, I wonder if that can be addressed.
I wonder if there was a way to let COPY FROM detect or ignore headers
as appropriate and rather than cause silently result in headers being
added as data.
Maybe a blank line after the header line could prevent this confusion?
Garick
On Mon, May 14, 2018 at 11:44 AM, Garick Hamlin <ghamlin@isc.upenn.edu>
wrote:
I wonder if there was a way to let COPY FROM detect or ignore headers
as appropriate and rather than cause silently result in headers being
added as data.
Not reliably
Maybe a blank line after the header line could prevent this confusion
No
+1 for allowing HEADER with FORMAT text. It doesn't interfere with COPY
and even if I were to agree that CSV format is the better one this seems
like an unnecessary area to impose preferences. If TSV with Header meets
someone's need providing a minimal (and consistent with expectations)
syntax to accomplish that goal seems reasonable, as does the patch.
David J.
While we're discussing COPY options, what do people think of an option for
COPY FROM with header to require that the headers match the target column
names? This would help to ensure that the file is actually the right one.
On 14 May 2018 at 14:55, David G. Johnston <david.g.johnston@gmail.com>
wrote:
Show quoted text
On Mon, May 14, 2018 at 11:44 AM, Garick Hamlin <ghamlin@isc.upenn.edu>
wrote:I wonder if there was a way to let COPY FROM detect or ignore headers
as appropriate and rather than cause silently result in headers being
added as data.
Not reliably
Maybe a blank line after the header line could prevent this confusion
No
+1 for allowing HEADER with FORMAT text. It doesn't interfere with COPY
and even if I were to agree that CSV format is the better one this seems
like an unnecessary area to impose preferences. If TSV with Header meets
someone's need providing a minimal (and consistent with expectations)
syntax to accomplish that goal seems reasonable, as does the patch.David J.
On Mon, May 14, 2018 at 04:08:47PM -0400, Isaac Morland wrote:
While we're discussing COPY options, what do people think of an option for
COPY FROM with header to require that the headers match the target column
names? This would help to ensure that the file is actually the right one.
I am personally not much into such sanity check logics in COPY FWIW if
we can live without.
--
Michael
Andrew Dunstan wrote:
I'm not necessarily opposed to this, but I'm not certain about the use
case either.
+1.
The downside is that it would create the need, when using COPY TO,
to know whether an input file was generated with or without header,
and a hazard on mistakes.
If you say it was and it wasn't, you quietly loose the first row of data.
If you say it wasn't and in fact it was, either there's a
datatype mismatch or you quietly get a spurious row of data.
This complication should be balanced by some advantage.
What can we do with the header?
If you already have the table ready to COPY in, you don't
need that information. The only reason why COPY TO
needs to know about the header is to throw it away.
And if you don't have the table created yet, a header
with just the column names is hardly sufficient to create it,
isn't it?
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
On 15 May 2018 at 10:26, Daniel Verite <daniel@manitou-mail.org> wrote:
Andrew Dunstan wrote:
I'm not necessarily opposed to this, but I'm not certain about the use
case either.+1.
The downside is that it would create the need, when using COPY TO,
to know whether an input file was generated with or without header,
and a hazard on mistakes.
If you say it was and it wasn't, you quietly loose the first row of data.
If you say it wasn't and in fact it was, either there's a
datatype mismatch or you quietly get a spurious row of data.
Just to be clear, we're talking about my "header match" feature, not the
basic idea of allowing a header in text format?
You already need to know whether or not there is a header, no matter what:
there is no way to avoid needing to know the format of the data to be
imported. And certainly if "header" is an option, one has to know whether
or not to set it in any given situation.
The "header match" helps ensure the file is the right one by requiring the
header contents to match the field names, rather than just being thrown
away.
I don't view it as a way to avoid pre-defining the table. It just increases
the chance that the wrong file won't load but will instead trigger an error
condition immediately.
Note that this advantage includes what happens if you specify header but
the file has no header: as long as you actually specified header match, the
error will be caught unless the first row of actual data happens to match
the field names, which is almost always highly unlikely and frequently
impossible (e.g., a person with firstname "firstname", surname "surname",
birthday "birthday" and so on).
One can imagine extensions of the idea: for example, the header could
actually be used to identify the columns, so the column order in the file
doesn't matter. There could also be an "AS" syntax to allow the target
field names to be different from the field names in the header. I have
occasionally found myself wanting to ignore certain columns of the file.
But these are all significantly more complicated than just looking at the
header and requiring it to match the target field names.
If one specifies no header but there actually is a header in the file, then
loading will fail in many cases but it depends on what the header in the
file looks like. This part is unaffected by my idea.
Show quoted text
This complication should be balanced by some advantage.
What can we do with the header?
If you already have the table ready to COPY in, you don't
need that information. The only reason why COPY TO
needs to know about the header is to throw it away.
And if you don't have the table created yet, a header
with just the column names is hardly sufficient to create it,
isn't it?
Isaac Morland <isaac.morland@gmail.com> writes:
On 15 May 2018 at 10:26, Daniel Verite <daniel@manitou-mail.org> wrote:
Andrew Dunstan wrote:
I'm not necessarily opposed to this, but I'm not certain about the use
case either.
The downside is that it would create the need, when using COPY TO,
to know whether an input file was generated with or without header,
and a hazard on mistakes.
If you say it was and it wasn't, you quietly loose the first row of data.
If you say it wasn't and in fact it was, either there's a
datatype mismatch or you quietly get a spurious row of data.
Just to be clear, we're talking about my "header match" feature, not the
basic idea of allowing a header in text format?
AFAICS, Daniel's just reacting to the basic idea of a header line.
I agree that by itself that's not worth much. However, if we added
your proposed option to insist that the column names match during COPY
IN, I think that that could have some value. It would allow
forestalling one common type of pilot error, ie copying the wrong file
entirely. (It'd also prevent copying in data that has the wrong column
order, but I think that's a less common scenario. I might be wrong
about that.)
One can imagine extensions of the idea: for example, the header could
actually be used to identify the columns, so the column order in the file
doesn't matter. There could also be an "AS" syntax to allow the target
field names to be different from the field names in the header. I have
occasionally found myself wanting to ignore certain columns of the file.
But these are all significantly more complicated than just looking at the
header and requiring it to match the target field names.
Yeah, and every bit of flexibility you add raises the chance of an
undetected error. COPY isn't intended as a general ETL facility,
so I'd mostly be -1 on adding such things. But I can see the value
of confirming that you're copying the right file, and a header match
check would go a long way towards doing that.
regards, tom lane
On Tuesday, May 15, 2018, Tom Lane <tgl@sss.pgh.pa.us> wrote:
AFAICS, Daniel's just reacting to the basic idea of a header line.
I agree that by itself that's not worth much. However, if we added
your proposed option to insist that the column names match during COPY
IN, I think that that could have some value.
I'm fine for adding it without the added matching behavior, though turning
the boolean into an enum is appealing.
HEADER { true | false | match }
Though we'd need to accept all variants of Boolean for compatability...
I'm of the opinion that text and csv should be the same excepting their
defaults for some of the options.
David J.
Isaac Morland wrote:
Just to be clear, we're talking about my "header match" feature, not the
basic idea of allowing a header in text format?
For my reply it was on merely allowing it, as does the current
patch at https://commitfest.postgresql.org/18/1629
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
On Tue, May 15, 2018 at 12:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
One can imagine extensions of the idea: for example, the header could
actually be used to identify the columns, so the column order in the file
doesn't matter. There could also be an "AS" syntax to allow the target
field names to be different from the field names in the header. I have
occasionally found myself wanting to ignore certain columns of the file.
But these are all significantly more complicated than just looking at the
header and requiring it to match the target field names.Yeah, and every bit of flexibility you add raises the chance of an
undetected error. COPY isn't intended as a general ETL facility,
so I'd mostly be -1 on adding such things. But I can see the value
of confirming that you're copying the right file, and a header match
check would go a long way towards doing that.
True.
FWIW, I'm +1 on this idea. I think a header line is a pretty common
need, and if you're exporting a large amount of data, it could be
pretty annoying to have to first run COPY, and then do
(echo blah,blah1,blah2; cat copyoutput.txt)>whatireallywant.txt
There's a lot of value in being able to export from program A
*exactly* what program B wants to import, rather than something that
is close but has to be massaged.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 14 May 2018 at 08:35, Simon Muller <samullers@gmail.com> wrote:
Okay, I've added this to the next commitfest at
https://commitfest.postgresql.org/18/1629/.Thanks both Michael and David for the feedback so far.
I noticed through the patch tester link at http://commitfest.cputube.org/
that my patch caused a file_fdw test to fail (since I previously tested
only with "make check" and not with "make check-world").
This v2 patch should fix that.
Attachments:
text_header_v2.patchapplication/octet-stream; name=text_header_v2.patchDownload+15-9
On 4 July 2018 at 22:44, Simon Muller <samullers@gmail.com> wrote:
I noticed through the patch tester link at http://commitfest.cputube.org/
that my patch caused a file_fdw test to fail (since I previously tested
only with "make check" and not with "make check-world").This v2 patch should fix that.
This patch just fixes a newline issue introduced in my previous patch.
Attachments:
text_header_v3.patchapplication/octet-stream; name=text_header_v3.patchDownload+15-9
On 4 July 2018 at 22:44, Simon Muller <samullers@gmail.com <mailto:samullers@gmail.com>> wrote:
I noticed through the patch tester link at http://commitfest.cputube.org/ <http://commitfest.cputube.org/> that my patch caused a file_fdw test to fail (since I previously tested only with "make check" and not with "make check-world").This v2 patch should fix that.
This patch just fixes a newline issue introduced in my previous patch.
I've reviewed this patch and feel this patch addresses the original ask. I tested it manually trying to break it and, as mentioned previously, it's behavior is the same as the CSV copy with regards to it's shortcomings. However, I feel
1) a "copy from" test is needed and
2) the current "copy to" test is (along with a few others) in the wrong file.
With regards to #2, the copy.source tests are for things requiring replacement when running the tests. Given that these copy tests do not, I have moved the current last set of copy tests to the copy2.sql file and have provided an attached patch.
With regards to #1, the patch I have provided can then be used and the following added as the COPY TO/FROM tests (perhaps after line 426 of the attached copy2.sql file). Note that I moved the FROM test before the TO test and omitted the "(format text, header true)" in the FROM test since it is another way the command can be invoked.
copy copytest3 from stdin header;
this is just a line full of junk that would error out if parsed
11 a 1
22 b 2
\.
copy copytest3 to stdout with (format text, header true);
As for the matching check of the header in the discussion of this patch, I feel that is a separate patch that can be added later since it would affect the general functionality of the copy command, not just the ability to have a text header.
Best,
- Cynthia Shang
Attachments:
move-copy-tests-v1.patchapplication/octet-stream; name=move-copy-tests-v1.patch; x-unix-mode=0666Download+23-24
On Wed, Jul 25, 2018 at 1:24 PM, Cynthia Shang
<cynthia.shang@crunchydata.com> wrote:
With regards to #2, the copy.source tests are for things requiring
replacement when running the tests. Given that these copy tests do not, I
have moved the current last set of copy tests to the copy2.sql file and have
provided an attached patch.
The patch appears in the RAW and in your email (hopefully) but it
doesn't appear in the thread archive so I am reattaching from a
different email client.
Attachments:
move-copy-tests-v1.patchapplication/octet-stream; name=move-copy-tests-v1.patchDownload+23-24
On 25 July 2018 at 19:24, Cynthia Shang <cynthia.shang@crunchydata.com>
wrote:
I've reviewed this patch and feel this patch addresses the original ask. I
tested it manually trying to break it and, as mentioned previously, it's
behavior is the same as the CSV copy with regards to it's shortcomings.
However, I feel
1) a "copy from" test is needed and
2) the current "copy to" test is (along with a few others) in the wrong
file.With regards to #2, the copy.source tests are for things requiring
replacement when running the tests. Given that these copy tests do not, I
have moved the current last set of copy tests to the copy2.sql file and
have provided an attached patch.
Thanks for reviewing the patch.
I agree that moving those previous and these new tests out of the .source
files seems to make more sense as they don't make use of the
preprocessing/replacement feature.
With regards to #1, the patch I have provided can then be used and the
following added as the COPY TO/FROM tests (perhaps after line 426 of the
attached copy2.sql file). Note that I moved the FROM test before the TO
test and omitted the "(format text, header true)" in the FROM test since it
is another way the command can be invoked.copy copytest3 from stdin header;
this is just a line full of junk that would error out if parsed
11 a 1
22 b 2
\.copy copytest3 to stdout with (format text, header true);
I've incorporated both your suggestions and included the patch you provided
in the attached patch. Hope it's as expected.
As for the matching check of the header in the discussion of this patch, I
feel that is a separate patch that can be added later since it would affect
the general functionality of the copy command, not just the ability to have
a text header.Best,
- Cynthia Shang
P.S. I did receive the first attached patch, but on my Ubuntu I had to
apply it using "git apply --ignore-space-change --ignore-whitespace",
probably due to line ending differences.
--
Simon Muller