Nasty, propagating POLA violation in COPY CSV HEADER

Started by David Fetterover 13 years ago14 messages

david@fetter.org

over 13 years ago

Folks,

A co-worker filed a bug against file_fdw where the columns in a
FOREIGN TABLE were scrambled on SELECT. It turned out that this comes
from the (yes, it's documented, but since it's documented in a place
not obviously linked to the bug, it's pretty useless) "feature" of
COPY CSV HEADER whereby the header line is totally ignored in COPY
OUT.

Rather than being totally ignored in the COPY OUT (CSV HEADER) case,
the header line in should be parsed to establish which columns are
where and rearranging the output if needed.

I'm proposing to make the code change here:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/commands/copy.c;h=98bcb2fcf3370c72b0f0a7c0df76ebe4512e9ab0;hb=refs/heads/master#l2436

and a suitable doc change that talks about reading the header only for
the purpose of matching column names to columns, and throwing away the
output as before.

What say?

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: David Fetter (#1)

Re: Nasty, propagating POLA violation in COPY CSV HEADER

David Fetter <david@fetter.org> writes:

Rather than being totally ignored in the COPY OUT (CSV HEADER) case,
the header line in should be parsed to establish which columns are
where and rearranging the output if needed.

This is not "fix a POLA violation". This is a non-backwards-compatible
new feature, which would have to have a switch to turn it on.

regards, tom lane

Andrew Dunstan

andrew@dunslane.net

over 13 years ago

In reply to: David Fetter (#1)

Re: Nasty, propagating POLA violation in COPY CSV HEADER

On 06/20/2012 11:02 AM, David Fetter wrote:

Folks,

A co-worker filed a bug against file_fdw where the columns in a
FOREIGN TABLE were scrambled on SELECT. It turned out that this comes
from the (yes, it's documented, but since it's documented in a place
not obviously linked to the bug, it's pretty useless) "feature" of
COPY CSV HEADER whereby the header line is totally ignored in COPY
OUT.

Rather than being totally ignored in the COPY OUT (CSV HEADER) case,
the header line in should be parsed to establish which columns are
where and rearranging the output if needed.

I'm proposing to make the code change here:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/commands/copy.c;h=98bcb2fcf3370c72b0f0a7c0df76ebe4512e9ab0;hb=refs/heads/master#l2436

and a suitable doc change that talks about reading the header only for
the purpose of matching column names to columns, and throwing away the
output as before.

What say?

First you are talking about COPY IN, not COPY OUT, surely.

This is not a bug, it is documented in exactly the place that all other
COPY options are documented. The file_fdw page refers the reader to the
COPY docs for details. Unless you want us to duplicate the entire COPY
docs in the file_fdw page this seems entirely reasonable.

The current behaviour was discussed at some length back when we
implemented the HEADER feature, IIRC, and is quite intentional. I don't
think we should alter the current behaviour, as plenty of people rely on
it, some to my certain knowledge. I do see a reasonable case for adding
a new behaviour which takes notice of the header line, although it's
likely to have plenty of wrinkles.

Reordering columns like you suggest might well have a significant impact
on COPY performance, BTW. Also note that I created the file_text_array
FDW precisely for people who want to be able to cherry pick and reorder
columns. See <https://github.com/adunstan/file_text_array_fdw>

cheers

andrew

David Fetter

david@fetter.org

over 13 years ago

In reply to: Tom Lane (#2)

Re: Nasty, propagating POLA violation in COPY CSV HEADER

On Wed, Jun 20, 2012 at 11:47:14AM -0400, Tom Lane wrote:

David Fetter <david@fetter.org> writes:

Rather than being totally ignored in the COPY OUT (CSV HEADER)
case, the header line in should be parsed to establish which
columns are where and rearranging the output if needed.

This is not "fix a POLA violation". This is a
non-backwards-compatible new feature, which would have to have a
switch to turn it on.

OK, new proposal:

COPY FROM (Thanks, Andrew! Must not post while asleep...) should have
an option which requires that HEADER be enabled and mandates that the
column names in the header match the columns coming in.

Has someone got a better name for this option than
KEEP_HEADER_COLUMN_NAMES?

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: David Fetter (#4)

Re: Nasty, propagating POLA violation in COPY CSV HEADER

David Fetter <david@fetter.org> writes:

OK, new proposal:

COPY FROM (Thanks, Andrew! Must not post while asleep...) should have
an option which requires that HEADER be enabled and mandates that the
column names in the header match the columns coming in.

Has someone got a better name for this option than
KEEP_HEADER_COLUMN_NAMES?

Well, if it's just checking that the list matches, then maybe
CHECK_HEADER would do.

In your original proposal, I was rather wondering what should happen if
the incoming file didn't have the same set of columns called out in the
COPY command's column list. (That is, while *rearranging* the columns
might be thought non-astonishing, I'm less convinced about a copy
operation that inserts or defaults columns differently from what the
command said should happen.) If we're just checking for a match, that
question goes away.

regards, tom lane

David Fetter

david@fetter.org

over 13 years ago

In reply to: Tom Lane (#5)

Re: Nasty, propagating POLA violation in COPY CSV HEADER

On Wed, Jun 20, 2012 at 12:18:45PM -0400, Tom Lane wrote:

David Fetter <david@fetter.org> writes:

OK, new proposal:

COPY FROM (Thanks, Andrew! Must not post while asleep...) should have
an option which requires that HEADER be enabled and mandates that the
column names in the header match the columns coming in.

Has someone got a better name for this option than
KEEP_HEADER_COLUMN_NAMES?

Well, if it's just checking that the list matches, then maybe
CHECK_HEADER would do.

In your original proposal, I was rather wondering what should happen
if the incoming file didn't have the same set of columns called out
in the COPY command's column list. (That is, while *rearranging*
the columns might be thought non-astonishing, I'm less convinced
about a copy operation that inserts or defaults columns differently
from what the command said should happen.) If we're just checking
for a match, that question goes away.

Let's imagine we have a table foo(id serial, t text, n numeric) and a
CSV file foo.csv with headers (n, t). Just to emphasize, the column
ordering is different in the places where they match.

Would

COPY foo (t,n) FROM '/path/to/foo.csv' WITH (CSV, HEADER, CHECK_HEADER)

now "just work" (possibly with some performance penalty) and

COPY foo (t,n) FROM '/path/to/foo.csv' WITH (CSV, HEADER)

only "work," i.e. import gobbledygook, in the case where every t entry
in foo.csv happened to be castable to NUMERIC?

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Alvaro Herrera

alvherre@commandprompt.com

over 13 years ago

In reply to: David Fetter (#6)

Re: Nasty, propagating POLA violation in COPY CSV HEADER

Excerpts from David Fetter's message of mié jun 20 12:43:31 -0400 2012:

On Wed, Jun 20, 2012 at 12:18:45PM -0400, Tom Lane wrote:

In your original proposal, I was rather wondering what should happen
if the incoming file didn't have the same set of columns called out
in the COPY command's column list. (That is, while *rearranging*
the columns might be thought non-astonishing, I'm less convinced
about a copy operation that inserts or defaults columns differently
from what the command said should happen.) If we're just checking
for a match, that question goes away.

Let's imagine we have a table foo(id serial, t text, n numeric) and a
CSV file foo.csv with headers (n, t). Just to emphasize, the column
ordering is different in the places where they match.

For the record, IIRC we had one person trying to do this in the spanish
list not long ago.

Another related case: you have a file with headers and columns (n, t, x, y, z)
but your table only has n and t. How would you tell COPY to discard the
junk columns? Currently it just complains that they are there.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Andrew Dunstan

andrew@dunslane.net

over 13 years ago

In reply to: Tom Lane (#5)

Re: Nasty, propagating POLA violation in COPY CSV HEADER

On 06/20/2012 12:18 PM, Tom Lane wrote:

David Fetter<david@fetter.org> writes:

OK, new proposal:
COPY FROM (Thanks, Andrew! Must not post while asleep...) should have
an option which requires that HEADER be enabled and mandates that the
column names in the header match the columns coming in.
Has someone got a better name for this option than
KEEP_HEADER_COLUMN_NAMES?

Well, if it's just checking that the list matches, then maybe
CHECK_HEADER would do.

In your original proposal, I was rather wondering what should happen if
the incoming file didn't have the same set of columns called out in the
COPY command's column list. (That is, while *rearranging* the columns
might be thought non-astonishing, I'm less convinced about a copy
operation that inserts or defaults columns differently from what the
command said should happen.) If we're just checking for a match, that
question goes away.

In the past people have asked me to have copy use the CSV header line in
place of supplying a column list in the COPY command. I can certainly
see some utility in that, and I think it might achieve what David wants.
Using that scenario it would be an error to supply an explicit column
list and also use the header line. But then I don't think CHECK_HEADER
would be the right name for the option. In any case, specifying a name
before we settle on an exact behaviour seems wrong :-)

cheers

andrew

Andrew Dunstan

andrew@dunslane.net

over 13 years ago

In reply to: Alvaro Herrera (#7)

Re: Nasty, propagating POLA violation in COPY CSV HEADER

On 06/20/2012 12:50 PM, Alvaro Herrera wrote:

Another related case: you have a file with headers and columns (n, t,
x, y, z) but your table only has n and t. How would you tell COPY to
discard the junk columns? Currently it just complains that they are
there.

That's one of the main use cases for file_text_array_fdw.

cheers

andrew

#10

Josh Berkus

josh@agliodbs.com

over 13 years ago

In reply to: Andrew Dunstan (#8)

Re: Nasty, propagating POLA violation in COPY CSV HEADER

In the past people have asked me to have copy use the CSV header line in
place of supplying a column list in the COPY command. I can certainly
see some utility in that, and I think it might achieve what David wants.
Using that scenario it would be an error to supply an explicit column
list and also use the header line. But then I don't think CHECK_HEADER
would be the right name for the option. In any case, specifying a name
before we settle on an exact behaviour seems wrong :-)

Actually, I can see three valid and valuable-to-users behaviors here:

1) current behavior, header line is ignored.

2) CHECK HEADER: a column list is supplied, but a "check header" flag is
set. If the column list and header list don't match *exactly*, you get
an error.

3) USE HEADER: no column list is supplied, but a "use header" flag is
set. A column list is created to match the columns from the CSV header.
Of necessity, this will consist of all or some of the columns in the
table. If columns are supplied which are not in the table, then you get
an error (as well as if columns are missing which are NOT NULL, as you
get at present).

(2) is more valuable to people who want to check data integrity
rigorously, and test for unexpected API changes. (3) is more valuable
for regular users who want CSV import to "just work". (1) is valuable
for backwards compatibility, and for cases where the CSV header was
generated by another program (Excel?) so the column names don't match.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#11

Marc Mamin

M.Mamin@intershop.de

over 13 years ago

In reply to: David Fetter (#1)

Re: Nasty, propagating POLA violation in COPY CSV HEADER

-----Ursprüngliche Nachricht-----
Von: pgsql-hackers-owner@postgresql.org im Auftrag von Josh Berkus
Gesendet: Mi 6/20/2012 7:06
An: pgsql-hackers@postgresql.org
Betreff: Re: [HACKERS] Nasty, propagating POLA violation in COPY CSV HEADER

...

(1) is valuable
for backwards compatibility, and for cases where the CSV header was
generated by another program (Excel?) so the column names don't match.

4) MAP_HEADER ('column 1'-> 'col_1', 'Date' -> 'input_date' ...)

to cover the case when column names do not match.

my 2 pences,

Marc Mamin

#12

Josh Berkus

josh@agliodbs.com

over 13 years ago

In reply to: Marc Mamin (#11)

Re: Nasty, propagating POLA violation in COPY CSV HEADER

4) MAP_HEADER ('column 1'-> 'col_1', 'Date' -> 'input_date' ...)

to cover the case when column names do not match.

Personally, I think that's going way beyond what we want to do with
COPY. At that point, check out the CSV-array FDW.

Of course, if someone writes a WIP patch which implements the above, we
can evaluate it then.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#13

Alvaro Herrera

alvherre@commandprompt.com

over 13 years ago

In reply to: Andrew Dunstan (#9)

Re: Nasty, propagating POLA violation in COPY CSV HEADER

Excerpts from Andrew Dunstan's message of mié jun 20 12:56:52 -0400 2012:

On 06/20/2012 12:50 PM, Alvaro Herrera wrote:

Another related case: you have a file with headers and columns (n, t,
x, y, z) but your table only has n and t. How would you tell COPY to
discard the junk columns? Currently it just complains that they are
there.

That's one of the main use cases for file_text_array_fdw.

Ah, great, thanks.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#14

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Josh Berkus (#12)

Re: Nasty, propagating POLA violation in COPY CSV HEADER

Josh Berkus <josh@agliodbs.com> writes:

4) MAP_HEADER ('column 1'-> 'col_1', 'Date' -> 'input_date' ...)

to cover the case when column names do not match.

Personally, I think that's going way beyond what we want to do with
COPY.

I agree --- if you know that much about the incoming data, you might as
well just specify the correct column list in the COPY command. This
would only have use if you know somebody used weird column names, but
not the order they put the columns in; which seems like a thin use-case.

The three options Josh enumerated seem reasonably sane to me.

regards, tom lane