Ragged CSV import

Started by Andrew Dunstanover 16 years ago49 messageshackers
Jump to latest
#1Andrew Dunstan
andrew@dunslane.net

I have received a requirement for the ability to import ragged CSV
files, i.e. files that contain variable numbers of columns per row. The
requirement is that extra columns would be ignored and missing columns
filled with NULL. The client wanting this has wrestled with some
preprocessors to try to get what they want, but they would feel happier
with this built in. This isn't the first time I have received this
request since we implemented CSV import. People have complained on
numerous occasions about the strictness of the import routines w.r.t.
the number of columns.

I don't think that providing a CSV import mode for reading these files
will be very difficult or invasive. If it's not acceptable, I will
provide it to the client as a patch, and I will be retrofitting it to
8.4 anyway. But I think it's not an unreasonable request.

Thoughts?

cheers

andrew

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#1)
Re: Ragged CSV import

Andrew Dunstan <andrew@dunslane.net> writes:

I have received a requirement for the ability to import ragged CSV
files, i.e. files that contain variable numbers of columns per row. The
requirement is that extra columns would be ignored and missing columns
filled with NULL. The client wanting this has wrestled with some
preprocessors to try to get what they want, but they would feel happier
with this built in. This isn't the first time I have received this
request since we implemented CSV import. People have complained on
numerous occasions about the strictness of the import routines w.r.t.
the number of columns.

Hmm. Accepting too few columns and filling with nulls isn't any
different than what INSERT has always done. But ignoring extra columns
seems like a different ballgame. Can you talk your client out of that
one? It just seems like a bad idea.

As for the "numerous occasions", maybe I've not been paying attention,
but I don't recall any ...

regards, tom lane

#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#2)
Re: Ragged CSV import

Tom Lane wrote:

As for the "numerous occasions", maybe I've not been paying attention,
but I don't recall any ...

I don't know about numerous, but I've seen it in the spanish list; for
example:

http://archives.postgresql.org/pgsql-es-ayuda/2007-03/msg00901.php

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#2)
Re: Ragged CSV import

Andrew Dunstan <andrew@dunslane.net> writes:

I have received a requirement for the ability to import ragged CSV
files, i.e. files that contain variable numbers of columns per row.

BTW, one other thought about this: I think the historical reason for
COPY being strict about the number of incoming columns was that it
provided a useful cross-check that the parsing hadn't gone off into
the weeds. We have certainly seen enough examples where the reported
manifestation of, say, an escaping mistake was that COPY saw the row
as having too many or too few columns. So being permissive about it
would lose some error detection capability. I am not clear about
whether CSV format is sufficiently more robust than the traditional
COPY format to render this an acceptable loss. Comments?

(One possible answer to this is to make the behavior optional, though
surely COPY has got too many options already :-()

regards, tom lane

#5Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#4)
Re: Ragged CSV import

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

I have received a requirement for the ability to import ragged CSV
files, i.e. files that contain variable numbers of columns per row.

BTW, one other thought about this: I think the historical reason for
COPY being strict about the number of incoming columns was that it
provided a useful cross-check that the parsing hadn't gone off into
the weeds. We have certainly seen enough examples where the reported
manifestation of, say, an escaping mistake was that COPY saw the row
as having too many or too few columns. So being permissive about it
would lose some error detection capability. I am not clear about
whether CSV format is sufficiently more robust than the traditional
COPY format to render this an acceptable loss. Comments?

I think accepting less columns and filling with nulls should be
protected enough for this not to be a problem; if the parser goes nuts,
it will die eventually. Silently dropping excessive trailing columns
does not seem acceptable though; you could lose entire rows and not
notice.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#6Josh Berkus
josh@agliodbs.com
In reply to: Andrew Dunstan (#1)
Re: Ragged CSV import

On 9/9/09 12:59 PM, Andrew Dunstan wrote:

I have received a requirement for the ability to import ragged CSV
files, i.e. files that contain variable numbers of columns per row. The
requirement is that extra columns would be ignored and missing columns
filled with NULL. The client wanting this has wrestled with some
preprocessors to try to get what they want, but they would feel happier
with this built in. This isn't the first time I have received this
request since we implemented CSV import. People have complained on
numerous occasions about the strictness of the import routines w.r.t.
the number of columns.

Would this just work on columns on the end, or would it work on the
basis of parsing the CSV header and matching columns?

While the former functionality would be relatively simple, I think the
latter is what people really want.

--
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com

#7Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#2)
Re: Ragged CSV import

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

I have received a requirement for the ability to import ragged CSV
files, i.e. files that contain variable numbers of columns per row. The
requirement is that extra columns would be ignored and missing columns
filled with NULL. The client wanting this has wrestled with some
preprocessors to try to get what they want, but they would feel happier
with this built in. This isn't the first time I have received this
request since we implemented CSV import. People have complained on
numerous occasions about the strictness of the import routines w.r.t.
the number of columns.

Hmm. Accepting too few columns and filling with nulls isn't any
different than what INSERT has always done. But ignoring extra columns
seems like a different ballgame. Can you talk your client out of that
one? It just seems like a bad idea.

No, that's critical. The application this is wanted for uploads data
that users put in spreadsheets. The users apparently expect that they
will be able to put comments on some rows off to the right of the data
they want loaded, and have it ignored.

To answer your other point made later, my intention was to make this
optional behaviour, not default behaviour. I agree that it would be too
slack for default behaviour. Yes, we have quite a few options, but
that's not surprising in dealing with a format that is at best
ill-defined and which we do not control.

As for the "numerous occasions", maybe I've not been paying attention,
but I don't recall any ...

The requests have been made on IRC, at conferences, in private emails.

cheers

andrew

#8Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Josh Berkus (#6)
Re: Ragged CSV import

Josh Berkus <josh@agliodbs.com> wrote:

Would this just work on columns on the end, or would it work on the
basis of parsing the CSV header and matching columns?

While the former functionality would be relatively simple, I think

the

latter is what people really want.

It's been a while since I've had a need for something like this, but
of
the copy features not currently available in PostgreSQL, the two most
useful are to read in only some of the defined columns, and to output
to
a separate disk file any rows which failed to match the expected
format.
The latter would not cause the copy to fail unless the count of such
rows exceeded a user-specified threshold.

-Kevin

#9Andrew Dunstan
andrew@dunslane.net
In reply to: Josh Berkus (#6)
Re: Ragged CSV import

Josh Berkus wrote:

On 9/9/09 12:59 PM, Andrew Dunstan wrote:

I have received a requirement for the ability to import ragged CSV
files, i.e. files that contain variable numbers of columns per row. The
requirement is that extra columns would be ignored and missing columns
filled with NULL. The client wanting this has wrestled with some
preprocessors to try to get what they want, but they would feel happier
with this built in. This isn't the first time I have received this
request since we implemented CSV import. People have complained on
numerous occasions about the strictness of the import routines w.r.t.
the number of columns.

Would this just work on columns on the end, or would it work on the
basis of parsing the CSV header and matching columns?

While the former functionality would be relatively simple, I think the
latter is what people really want.

No, it's the former, and that's exactly what the client requested. I'm
not talking about parsing the header line - that's a much larger can of
worms.

cheers

andrew

#10Hannu Krosing
hannu@tm.ee
In reply to: Alvaro Herrera (#5)
Re: Ragged CSV import

On Wed, 2009-09-09 at 16:34 -0400, Alvaro Herrera wrote:

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

I have received a requirement for the ability to import ragged CSV
files, i.e. files that contain variable numbers of columns per row.

BTW, one other thought about this: I think the historical reason for
COPY being strict about the number of incoming columns was that it
provided a useful cross-check that the parsing hadn't gone off into
the weeds. We have certainly seen enough examples where the reported
manifestation of, say, an escaping mistake was that COPY saw the row
as having too many or too few columns. So being permissive about it
would lose some error detection capability. I am not clear about
whether CSV format is sufficiently more robust than the traditional
COPY format to render this an acceptable loss. Comments?

I think accepting less columns and filling with nulls should be
protected enough for this not to be a problem; if the parser goes nuts,
it will die eventually. Silently dropping excessive trailing columns
does not seem acceptable though; you could lose entire rows and not
notice.

Maybe we could put a catch-all "text" or even "text[]" column at as the
last one of the table and gather all extra columns there ?

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

--
Hannu Krosing http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
Services, Consulting and Training

#11Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Andrew Dunstan (#1)
Re: Ragged CSV import

"Dann Corbit" <DCorbit@connx.com> wrote:

Kevin Grittner

It's been a while since I've had a need for something like this,
but of the copy features not currently available in PostgreSQL,
the two most useful are to read in only some of the defined
columns, and to output to a separate disk file any rows which
failed to match the expected format.
The latter would not cause the copy to fail unless the count of
such rows exceeded a user-specified threshold.

Perhaps something like SQL Server's BCP format files could be
used.
http://support.microsoft.com/kb/67409
http://technet.microsoft.com/en-us/library/ms178129.aspx
http://www.nigelrivett.net/SQLTsql/BCP_quoted_CSV_Format_file.html
http://doc.ddart.net/mssql/sql70/impt_bcp_16.htm

That is what I was thinking of when I wrote the above, although in
the Sybase incarnation, before Microsoft split off on their own. (I
see they haven't changed sybchar from the Sybase name yet....)

My reservation about referencing it is that it includes so many
bells and whistles that it's not as easy to use as it might be, even
with the "wizard" to generate the format description files. The
other problem with it is that it was far and away the *least* stable
part of the DBMS. You could count on it being broken in any version
until six months to a year into that version's life. We eventually
moved away from it because of the instability -- we could write code
from scratch each time with more confidence of correct behavior. I
think keeping it a little simpler might help with keeping it stable.

-Kevin

#12Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#1)
Re: Ragged CSV import

Dann Corbit wrote:

Perhaps something like SQL Server's BCP format files could be used.
http://support.microsoft.com/kb/67409
http://technet.microsoft.com/en-us/library/ms178129.aspx
http://www.nigelrivett.net/SQLTsql/BCP_quoted_CSV_Format_file.html
http://doc.ddart.net/mssql/sql70/impt_bcp_16.htm

This looks vastly more complex than anything I am interested in doing.

cheers

andrew

#13Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#2)
Re: Ragged CSV import

On Wed, Sep 9, 2009 at 4:13 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

I have received a requirement for the ability to import ragged CSV
files, i.e. files that contain variable numbers of columns per row. The
requirement is that extra columns would be ignored and missing columns
filled with NULL. The client wanting this has wrestled with some
preprocessors to try to get what they want, but they would feel happier
with this built in. This isn't the first time I have received this
request since we implemented CSV import. People have complained on
numerous occasions about the strictness of the import routines w.r.t.
the number of columns.

Hmm.  Accepting too few columns and filling with nulls isn't any
different than what INSERT has always done.  But ignoring extra columns
seems like a different ballgame.  Can you talk your client out of that
one?  It just seems like a bad idea.

I agree that ignoring extra columns is a bad idea, but I don't even
like the idea of ignoring missing columns. It doesn't seem like a
good idea to take a spreadsheet and feed it into COPY without doing
any validation anyway, and this is the kind of thing that is trivial
to clean up with a thin layer of Perl or your scripting language of
choice.

...Robert

#14Andrew Dunstan
andrew@dunslane.net
In reply to: Robert Haas (#13)
Re: Ragged CSV import

Robert Haas wrote:

I agree that ignoring extra columns is a bad idea, but I don't even
like the idea of ignoring missing columns. It doesn't seem like a
good idea to take a spreadsheet and feed it into COPY without doing
any validation anyway, and this is the kind of thing that is trivial
to clean up with a thin layer of Perl or your scripting language of
choice.

If it's an optional feature then I don't see why there is a problem.
What skin is it off anyone else's nose but those whose choose this
behaviour?

I am perfectly familiar with Perl and so is the client that requested
this feature. They are quite technically savvy. They are using a
scripting solution now but they find it cumbersome.

As for general validation, the requestor's application in fact loads the
spreadsheet into a temp table of text columns and then subjects it to a
large variety of complex business rule checking before adding the data
to the main tables. It is a whole lot faster and cleaner to do it that
way than before the data is loaded at all. That's why they aren't
concerned about missing columns.

cheers

andrew

#15Robert Haas
robertmhaas@gmail.com
In reply to: Andrew Dunstan (#14)
Re: Ragged CSV import

On Wed, Sep 9, 2009 at 7:41 PM, Andrew Dunstan<andrew@dunslane.net> wrote:

Robert Haas wrote:

I agree that ignoring extra columns is a bad idea, but I don't even
like the idea of ignoring missing columns.  It doesn't seem like a
good idea to take a spreadsheet and feed it into COPY without doing
any validation anyway, and this is the kind of thing that is trivial
to clean up with a thin layer of Perl or your scripting language of
choice.

If it's an optional feature then I don't see why there is a problem. What
skin is it off anyone else's nose but those whose choose this behaviour?

I have to admit I'm usually an advocate of that way of thinking, so
maybe I am all wet. I suppose it comes down to whether you think this
particular feature is something with broad applicability, or whether
there are 100 other equally plausible features.

I wonder whether it would be appropriate to do something like
implement a method by which copy could return text[] and then one
could write wrappers around that functionality to do this as well as
other things. For example, suppose you wanted to have rows of the
form:

A,B,C,X1,Y1,X2,Y2,X3,Y3

...which gets transformed into an insert of (A,B,C) into a main table
and (A,X1,Y1), (A,X2,Y2), (A,X3,Y3) into a side table. (I have actual
knowledge of a widely-deployed system produced by a large company that
outputs data in a format similar to this, though the actual format is
considerably more complex.)

...Robert

#16Andrew Dunstan
andrew@dunslane.net
In reply to: Robert Haas (#15)
Re: Ragged CSV import

Robert Haas wrote:

I wonder whether it would be appropriate to do something like
implement a method by which copy could return text[] and then one
could write wrappers around that functionality to do this as well as
other things.

It's not a function ... writing a copy() function returning text[] (or
setof text[], which would be better) in plperlu would be trivial. But it
couldn't read from stdin or be able to be fed data from a client in
copy-in mode.

And even if we have it what does the sql look like to insert the values
into some table with n columns?

Anything along these lines in C is likely to be far larger than what I
had in mind, which was a fairly minor enhancement.

cheers

andrew

#17Robert Haas
robertmhaas@gmail.com
In reply to: Andrew Dunstan (#16)
Re: Ragged CSV import

On Wed, Sep 9, 2009 at 10:40 PM, Andrew Dunstan<andrew@dunslane.net> wrote:

Robert Haas wrote:

I wonder whether it would be appropriate to do something like
implement a method by which copy could return text[] and then one
could write wrappers around that functionality to do this as well as
other things.

It's not a function ... writing a copy() function returning text[] (or setof
text[], which would be better) in plperlu would be trivial. But it couldn't
read from stdin or be able to be fed data from a client in copy-in mode.

Yes, setof text[] is what I meant. Making it be able to read from
stdin or be fed data from a client in copy-in mode would be the whole
point of the feature, I suppose.

And even if we have it what does the sql look like to insert the values into
some table with n columns?

INSERT INTO table
SELECT t[1], t[2], t[3], t[4], ..., t[n]
FROM (...however we get the copy results...)

...although I'm not entirely sure that's compatible with the
client/server protocol.

Anything along these lines in C is likely to be far larger than what I had
in mind, which was a fairly minor enhancement.

It wouldn't be the first time that someone was told that a particular
enhancement didn't have enough value and that they had to do something
larger if they wanted it in core, but on the other hand, I am only
expressing my opinion, which is not binding on you or anyone else, nor
even a fully-fleshed-out proposal. All I'm saying is that it seems to
me that there is value in being able to use the CSV (or other) parsing
code, but have some way to make modifications to how/where the data is
actually inserted. However, I'm skeptical about whether the specific
thing you want to do after parsing (namely, drop excess columns,
null-fill missing ones) is sufficiently common to warrant a feature to
do only that. YMMV, of course.

...Robert

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#16)
Re: Ragged CSV import

Andrew Dunstan <andrew@dunslane.net> writes:

Robert Haas wrote:

I wonder whether it would be appropriate to do something like
implement a method by which copy could return text[] and then one
could write wrappers around that functionality to do this as well as
other things.

Anything along these lines in C is likely to be far larger than what I
had in mind, which was a fairly minor enhancement.

I think Robert's got a point though. What you're talking about is a
fairly specialized single-purpose feature, which nonetheless is going to
require a lot of infrastructure (for example, teaching psql's \copy
about it). Perhaps, for approximately the same amount of overhead,
we could provide something more general.

I don't agree with the idea of injecting something "behind" copy though.
The thought that comes to mind for me is something "in front of" copy,
that is, give it the text of each line and let it do a text-to-text
transformation before COPY chews on it.

Any of this is getting into territory we had previously agreed not to
let COPY venture into, ie general purpose data transformation. I'm not
sure I want to cross that bridge and only get "ignore extra columns" out
of it.

regards, tom lane

#19Andrew Dunstan
andrew@dunslane.net
In reply to: Robert Haas (#17)
Re: Ragged CSV import

Robert Haas wrote:

However, I'm skeptical about whether the specific
thing you want to do after parsing (namely, drop excess columns,
null-fill missing ones) is sufficiently common to warrant a feature to
do only that. YMMV, of course.

So might my experience. I can tell you that I have been asked about this
dozens of times since implementing CSV import, in various forums. From
day one people have wanted a way of handling CSVs which were not
strictly regular. Pretty much every time the request has been something
fairly close to this.

As for the feature not being large enough, we have a regrettable
(IMNSHO) tendency in this project to take simple proposals and embelish
them with many bells and whistles. Personally, I'm much more of a fan of
incremental development.

cheers

andrew

#20Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#18)
Re: Ragged CSV import

On Wed, Sep 9, 2009 at 11:01 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

Robert Haas wrote:

I wonder whether it would be appropriate to do something like
implement a method by which copy could return text[] and then one
could write wrappers around that functionality to do this as well as
other things.

Anything along these lines in C is likely to be far larger than what I
had in mind, which was a fairly minor enhancement.

I think Robert's got a point though.  What you're talking about is a
fairly specialized single-purpose feature, which nonetheless is going to
require a lot of infrastructure (for example, teaching psql's \copy
about it).  Perhaps, for approximately the same amount of overhead,
we could provide something more general.

I don't agree with the idea of injecting something "behind" copy though.
The thought that comes to mind for me is something "in front of" copy,
that is, give it the text of each line and let it do a text-to-text
transformation before COPY chews on it.

That seems to me to be a whole lot less useful. As I see it, the
whole point of any enhancement in this area is to reuse the parsing
code. If I have a CSV file (or some other format COPY understands),
I don't want to have to write my own parser for that format in order
to do some simple data transformation (like dropping columns >n).

Previous agreements nonwithstanding, I think letting COPY do general
transformations is a great idea. But I'm a lot more skeptical about
one specific transformation without some framework of which that case
is the first instance.

...Robert

#21Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#18)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#21)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#20)
#24Peter Eisentraut
peter_e@gmx.net
In reply to: Robert Haas (#13)
#25Nikhil Sontakke
nikhil.sontakke@enterprisedb.com
In reply to: Kevin Grittner (#8)
#26Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Tom Lane (#22)
#27Andrew Dunstan
andrew@dunslane.net
In reply to: Peter Eisentraut (#24)
#28Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#22)
#29Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andrew Dunstan (#19)
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dimitri Fontaine (#26)
#31Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#23)
#32Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#30)
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#31)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#32)
#35Hannu Krosing
hannu@tm.ee
In reply to: Tom Lane (#33)
#36Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#30)
#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#36)
#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hannu Krosing (#35)
#39Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#37)
#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#39)
#41Stephen Frost
sfrost@snowman.net
In reply to: Andrew Dunstan (#28)
#42Andrew Dunstan
andrew@dunslane.net
In reply to: Stephen Frost (#41)
#43Stephen Frost
sfrost@snowman.net
In reply to: Andrew Dunstan (#42)
#44Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Andrew Dunstan (#42)
#45Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#42)
#46Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#45)
#47Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#46)
#48Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#47)
#49daveg
daveg@sonic.net
In reply to: Dimitri Fontaine (#44)