COPY formatting
Hi,
in TODO is item: "* Allow dump/load of CSV format". I don't think
it's clean idea. Why CSV and why not something other? :-)
A why not allow to users full control of the format by they own
function. It means something like:
COPY tablename [ ( column [, ...] ) ]
TO { 'filename' | STDOUT }
[ [ WITH ]
[ BINARY ]
[ OIDS ]
[ DELIMITER [ AS ] 'delimiter' ]
[ NULL [ AS ] 'null string' ]
[ FORMAT funcname ] ]
^^^^^^^^^^^^^^^^
The formatting function API can be pretty simple:
text *my_copy_format(text *attrdata, int direction,
int nattrs, int attr, oid attrtype, oid relation)
-- it's pseudocode of course, it should be use standard fmgr
interface.
It's probably interesting for non-binary COPY version.
Comments?
Karel
--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/
Karel Zak wrote:
Hi,
in TODO is item: "* Allow dump/load of CSV format". I don't think
it's clean idea. Why CSV and why not something other? :-)A why not allow to users full control of the format by they own
function. It means something like:COPY tablename [ ( column [, ...] ) ]
TO { 'filename' | STDOUT }
[ [ WITH ]
[ BINARY ]
[ OIDS ]
[ DELIMITER [ AS ] 'delimiter' ]
[ NULL [ AS ] 'null string' ]
[ FORMAT funcname ] ]
^^^^^^^^^^^^^^^^The formatting function API can be pretty simple:
text *my_copy_format(text *attrdata, int direction,
int nattrs, int attr, oid attrtype, oid relation)-- it's pseudocode of course, it should be use standard fmgr
interface.It's probably interesting for non-binary COPY version.
Comments?
Karel
Karel,
This seems to be an excellent idea.
People have already asked for many different formats.
Usually I recommend them to use "psql -c "COPY ..." dbname | awk ...".
Since Windows will be supported soon, it will be hard to pipe data to a
useful program (awk, sed, ...). Maybe this feature would help a lot in
this case.
Regards,
Hans
--
Cybertec Geschwinde u Schoenig
Schoengrabern 134, A-2020 Hollabrunn, Austria
Tel: +43/2952/30706 or +43/664/233 90 75
www.cybertec.at, www.postgresql.at, kernel.cybertec.at
Hans-J�rgen Sch�nig wrote:
Karel Zak wrote:
Hi,
in TODO is item: "* Allow dump/load of CSV format". I don't think
it's clean idea. Why CSV and why not something other? :-)
A why not allow to users full control of the format by they own
function. It means something like:COPY tablename [ ( column [, ...] ) ]
TO { 'filename' | STDOUT }
[ [ WITH ] [ BINARY ]
[ OIDS ]
[ DELIMITER [ AS ] 'delimiter' ]
[ NULL [ AS ] 'null string' ]
[ FORMAT funcname ] ]
^^^^^^^^^^^^^^^^
The formatting
function API can be pretty simple:text *my_copy_format(text *attrdata, int direction, int
nattrs, int attr, oid attrtype, oid relation)-- it's pseudocode of course, it should be use standard fmgr
interface.It's probably interesting for non-binary COPY version.
Comments?
Karel
Karel,
This seems to be an excellent idea.
People have already asked for many different formats.
Usually I recommend them to use "psql -c "COPY ..." dbname | awk ...".
Since Windows will be supported soon, it will be hard to pipe data to a
useful program (awk, sed, ...). Maybe this feature would help a lot in
this case.
Could a CSV-generating function be provided with the distribution then?
Regards,
Fernando
Karel Zak <zakkr@zf.jcu.cz> writes:
The formatting function API can be pretty simple:
text *my_copy_format(text *attrdata, int direction,
int nattrs, int attr, oid attrtype, oid relation)
This seems like it could only reasonably be implemented as a C function.
I can't really imagine the average user of COPY wanting to write C in
preference to, say, an external perl script. What's the real use-case
for the feature?
regards, tom lane
Karel Zak wrote:
Hi,
in TODO is item: "* Allow dump/load of CSV format". I don't think
it's clean idea. Why CSV and why not something other? :-)A why not allow to users full control of the format by they own
function. It means something like:COPY tablename [ ( column [, ...] ) ]
TO { 'filename' | STDOUT }
[ [ WITH ]
[ BINARY ]
[ OIDS ]
[ DELIMITER [ AS ] 'delimiter' ]
[ NULL [ AS ] 'null string' ]
[ FORMAT funcname ] ]
^^^^^^^^^^^^^^^^The formatting function API can be pretty simple:
text *my_copy_format(text *attrdata, int direction,
int nattrs, int attr, oid attrtype, oid relation)-- it's pseudocode of course, it should be use standard fmgr
interface.It's probably interesting for non-binary COPY version.
Interesting ... The alternative might be an external program to munge
CSVs and whatever other format people want to support and then call the
exisiting COPY- either in bin or contrib. I have seen lots of people
wanting to import CSVs, and that's even before we get a Windows port.
cheers
andrew
Tom Lane wrote:
Karel Zak <zakkr@zf.jcu.cz> writes:
The formatting function API can be pretty simple:
text *my_copy_format(text *attrdata, int direction,
int nattrs, int attr, oid attrtype, oid relation)This seems like it could only reasonably be implemented as a C function.
I can't really imagine the average user of COPY wanting to write C in
preference to, say, an external perl script. What's the real use-case
for the feature?
That is why I suggested providing a pre-written/pre-compiled/installed
function for CSV (call it CSV?). Advanced users could still write their
own as people can write many other things if they know their ways.
andrew@dunslane.net (Andrew Dunstan) writes:
Karel Zak wrote:
Hi,
in TODO is item: "* Allow dump/load of CSV format". I don't think
it's clean idea. Why CSV and why not something other? :-)A why not allow to users full control of the format by they own
function. It means something like:
COPY tablename [ ( column [, ...] ) ]
TO { 'filename' | STDOUT }
[ [ WITH ] [ BINARY ]
[ OIDS ]
[ DELIMITER [ AS ] 'delimiter' ]
[ NULL [ AS ] 'null string' ]
[ FORMAT funcname ] ]
^^^^^^^^^^^^^^^^
The formatting
function API can be pretty simple:text *my_copy_format(text *attrdata, int direction, int
nattrs, int attr, oid attrtype, oid relation)-- it's pseudocode of course, it should be use standard fmgr
interface.
It's probably interesting for non-binary COPY version.Interesting ... The alternative might be an external program to munge
CSVs and whatever other format people want to support and then call
the exisiting COPY- either in bin or contrib. I have seen lots of
people wanting to import CSVs, and that's even before we get a Windows
port.
I know Jan Wieck has been working on something like this, with a bit
of further smarts...
- By having, alongside, a table definition, the table can be created
concurrently;
- A set of mapping functions can be used, so that if, for instance,
the program generating the data was Excel, and you have a field with
values like 37985, 38045, or 38061, they can respectively be mapped
to '2004-01-01', '2004-03-01', and '2004-03-17';
- It can load whatever data is loadable, and use Ethernet-like
backoffs when it encounters bad records so that it loads all the data
that is good, and leaves a bundle of 'crud' that is left over.
He had been prototyping it in Tcl; I'm not sure how far a port to C
has gotten. It looked pretty neat; it sure seems better to put the
"cleverness" in userspace than to try to increase the complexity of
the postmaster...
--
output = ("cbbrowne" "@" "cbbrowne.com")
http://cbbrowne.com/info/linuxxian.html
Have you heard of the new Macsyma processor? It has three instructions --
LOAD, STORE, and SKIP IF INTEGRABLE.
That is why I suggested providing a pre-written/pre-compiled/installed
function for CSV (call it CSV?). Advanced users could still write
their own as people can write many other things if they know their ways.
As someone who just went through a whole truckload of crap getting
delimited files parsed from MSSQL to PostgreSQL. I believe yes this
would be great thing. We ended up using plPython with the CSV module.
Sincerely,
Joshua Drake
---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings
--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com
PostgreSQL Replicator -- production quality replication for PostgreSQL
I deal with this daily in a cygwin environment. I wrote a simple c++
program where I hardcoded the input file name/location and output file
name/location. I strip the quotation marks out where they are used for
identifying text fields and change the comma's used as CSV's to pipes.
I use a combination of bash scripting to execute the c++ program and
then Perl to execute a stored procedure. I am new to Perl so I have not
yet considered migrating it all into Perl.
The dos2unix tools in cygwin always messed up the first character of the
first line.
I thought the real issue with the copy function and CSVs was that it did
not like the use of quotations around the fields to identify text
fields.
For a true Windows port handling MS Excel files in their native format
would be a goal I would hope. If your api could handle that then I would
agree with your method.
Show quoted text
On Wed, 2004-03-17 at 22:14, Joshua D. Drake wrote:
That is why I suggested providing a pre-written/pre-compiled/installed
function for CSV (call it CSV?). Advanced users could still write
their own as people can write many other things if they know their ways.As someone who just went through a whole truckload of crap getting
delimited files parsed from MSSQL to PostgreSQL. I believe yes this
would be great thing. We ended up using plPython with the CSV module.Sincerely,
Joshua Drake
---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings
On Wed, Mar 17, 2004 at 11:02:38AM -0500, Tom Lane wrote:
Karel Zak <zakkr@zf.jcu.cz> writes:
The formatting function API can be pretty simple:
text *my_copy_format(text *attrdata, int direction,
int nattrs, int attr, oid attrtype, oid relation)This seems like it could only reasonably be implemented as a C function.
Why? I said it's pseudo code. It should use standard fmgr API like
every other PostgreSQL function or is it problem and I overlook
something? It must to support arbitrary programming language and not
C only.
I can't really imagine the average user of COPY wanting to write C in
preference to, say, an external perl script. What's the real use-case
for the feature?
Don't hardcode any format to PostgreSQL, be open for others formats.
Karel
--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/
On Wed, Mar 17, 2004 at 10:11:48AM -0500, Fernando Nasser wrote:
Could a CSV-generating function be provided with the distribution then?
I think yes. Any other format suggestion?
Karel
--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/
On Thu, Mar 18, 2004 at 07:48:40AM +0100, Karel Zak wrote:
On Wed, Mar 17, 2004 at 11:02:38AM -0500, Tom Lane wrote:
Karel Zak <zakkr@zf.jcu.cz> writes:
The formatting function API can be pretty simple:
text *my_copy_format(text *attrdata, int direction,
int nattrs, int attr, oid attrtype, oid relation)This seems like it could only reasonably be implemented as a C function.
Why? I said it's pseudo code. It should use standard fmgr API like
every other PostgreSQL function or is it problem and I overlook
something? It must to support arbitrary programming language and not
C only.
Well, I look over the COPY code and best will start with hardcoded
version, but make it modular in code and if all will right we can think
about some interface for others formats definition. OK?
It's pity that main idea of current COPY is based on separated lines
and it is not more common interface for streaming data between FE and BE.
Karel
--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/
To be honest this idea strikes me as overkill - over
engineering. While there is a clear need for proper CSV import
(i.e. just setting DELIMITER to ',' doesn't work due to ','s in
strings) I cannot see how this would prove useful, or who would use
it?
While i have done a lot of messing around reading/writing the binary
format (and been stung by changes in that format) if you are using
this format then you're 99% likely to be in control of the
incoming/outgoing data and thus able to format to your wishes outwith
COPY.
Something else in the TODO regarding COPY is XML import/export, and
for this to be supported in your proposed implementation the function
would need to be passed in a heap more information.
L.
Karel Zak writes:
Show quoted text
Hi,
in TODO is item: "* Allow dump/load of CSV format". I don't think
it's clean idea. Why CSV and why not something other? :-)A why not allow to users full control of the format by they own
function. It means something like:COPY tablename [ ( column [, ...] ) ]
TO { 'filename' | STDOUT }
[ [ WITH ]
[ BINARY ]
[ OIDS ]
[ DELIMITER [ AS ] 'delimiter' ]
[ NULL [ AS ] 'null string' ]
[ FORMAT funcname ] ]
^^^^^^^^^^^^^^^^The formatting function API can be pretty simple:
text *my_copy_format(text *attrdata, int direction,
int nattrs, int attr, oid attrtype, oid relation)-- it's pseudocode of course, it should be use standard fmgr
interface.It's probably interesting for non-binary COPY version.
Comments?
Karel
On Thu, Mar 18, 2004 at 09:29:03AM +0000, Lee Kindness wrote:
To be honest this idea strikes me as overkill - over
engineering.
It was suggestion, maybe you're right :-)
While i have done a lot of messing around reading/writing the binary
format (and been stung by changes in that format) if you are using
this format then you're 99% likely to be in control of the
incoming/outgoing data and thus able to format to your wishes outwith
COPY.
I partly agree.
But.. there is possible write directly final file by backend without
data transfer to client. If we want to support this feature we need
control output format by server...
And.. I can image format that is use for BE/FE data transfer only and
not for some final data presentation. For example compression of data
stream from/to BE without PostgreSQL protocol change.
Something else in the TODO regarding COPY is XML import/export, and
for this to be supported in your proposed implementation the function
would need to be passed in a heap more information.
Yes, very probably some struct with all COPY information and format
specific stuff. Tom was right that in this case it will C functions
only. As I said I will try implement it without user defined function
call for format conversion, but I will do it modular and in future we
can create some interface for user defined formats.
Karel
--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/
Lee Kindness wrote:
To be honest this idea strikes me as overkill - over
engineering. While there is a clear need for proper CSV import
(i.e. just setting DELIMITER to ',' doesn't work due to ','s in
strings) I cannot see how this would prove useful, or who would use
it?
I agree. My modest proposal for handling CSVs would be to extend the
DELIMITER parameter to allow up to 3 characters - separator, quote and
escape. Escape would default to the quote char and the quote char would
default to unspecified. This would involve no grammar changes and fairly
isolated and small code changes, I believe. In the most common CSV cases
you would just use $$,"$$ or $$,'$$. :-)
COPY is basically line/tuple oriented, and that alone would exclude many
file formats (e.g. imagine wanting to import a spreadsheet where each
worksheet was the table name and the first row on each worksheet was the
field names - I have seen such beasts more than once). If we want a
general facility for loading and exporting foreign file formats, I
really believe that is the province of a utility program rather than a
database engine function.
The reason in my mind for making CSV a special case is that it is very
easy to do and so often asked for.
(I used to set parsing CSVs as a basic programming exercise - it is
amazing how many way people find to get it wrong).
cheers
andrew
Andrew Dunstan wrote:
Lee Kindness wrote:
To be honest this idea strikes me as overkill - over
engineering. While there is a clear need for proper CSV import
(i.e. just setting DELIMITER to ',' doesn't work due to ','s in
strings) I cannot see how this would prove useful, or who would use
it?I agree. My modest proposal for handling CSVs would be to extend the
DELIMITER parameter to allow up to 3 characters - separator, quote and
escape. Escape would default to the quote char and the quote char would
default to unspecified. This would involve no grammar changes and fairly
isolated and small code changes, I believe. In the most common CSV cases
you would just use $$,"$$ or $$,'$$. :-)COPY is basically line/tuple oriented, and that alone would exclude many
file formats (e.g. imagine wanting to import a spreadsheet where each
worksheet was the table name and the first row on each worksheet was the
field names - I have seen such beasts more than once). If we want a
general facility for loading and exporting foreign file formats, I
really believe that is the province of a utility program rather than a
database engine function.The reason in my mind for making CSV a special case is that it is very
easy to do and so often asked for.(I used to set parsing CSVs as a basic programming exercise - it is
amazing how many way people find to get it wrong).
I like the separator, quote, and escape idea. It allows variety without
requiring folks to code in C.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Karel Zak <zakkr@zf.jcu.cz> writes:
On Wed, Mar 17, 2004 at 11:02:38AM -0500, Tom Lane wrote:
Karel Zak <zakkr@zf.jcu.cz> writes:
This seems like it could only reasonably be implemented as a C function.
Why? I said it's pseudo code. It should use standard fmgr API like
every other PostgreSQL function or is it problem and I overlook
something? It must to support arbitrary programming language and not
C only.
Sure, but the question is whether the *stuff it has to do* can
reasonably be coded in anything but C. Why are you passing in a
relation OID, if not for lookups in relcache entries that are simply
not accessible above the C level? (Don't tell me you want the function
to do a bunch of actual SELECTs from system catalogs for every line
of the copy...)
Passing in a relation OID is probably a bad idea anyway, as it ties this
API to the assumption that COPY is only for complete relations. There's
been talk before of allowing a SELECT result to be presented via the
COPY protocol, for instance. What might be a more usable API is
COPY OUT:
function formatter_out(text[]) returns text
COPY IN:
function formatter_in(text) returns text[]
where the text array is either the results of or the input to the
per-column datatype I/O routines. This makes it explicit that the
formatter's job is solely to determine the column-level wrapping and
unwrapping of the data. I'm assuming here that there is no good reason
for the formatter to care about the specific datatypes involved; can you
give a counterexample?
It's pity that main idea of current COPY is based on separated lines
and it is not more common interface for streaming data between FE and BE.
Yeah, that was another concern I had. This API would let the formatter
control line-level layout but it would not eliminate the hard-wired
significance of newline. What's worse, there isn't any clean way to
deal with reading quoted newlines --- the formatter can't really replace
the default quoting rules if the low-level code is going to decide
whether a newline is quoted or not.
We could possibly solve that by specifying that the text output or input
(respectively) is the complete line sent to or from the client,
including newline or whatever other line-level formatting you are using.
This still leaves the problem of how the low-level COPY IN code knows
what is a complete line to pass off to the formatter_in routine. We
could possibly fix this by adding a second input-control routine
function formatter_linelength(text) returns integer
which is defined to return -1 if the input isn't a complete line yet
(i.e., read some more data, append to the buffer, and try again), or
= 0 to indicate that the first N bytes of the buffer represent a
complete line to be passed off to formatter_in. I don't see a way to
combine formatter_in and formatter_linelength into a single function
without relying on "out" parameters, which would again confine the
feature to format functions written in C.
It's a tad annoying that we need two functions for input. One way that
we could still keep the COPY option syntax to be just
FORMAT csv
is to create an arbitrary difference in the signatures of the input
functions. Then we could have coexisting functions
csv(text[]) returns text
csv(text) returns text[]
csv(text, ...) returns int
that are referenced by "FORMAT csv".
regards, tom lane
Andrew Dunstan <andrew@dunslane.net> writes:
Lee Kindness wrote:
To be honest this idea strikes me as overkill - over
engineering.I agree. My modest proposal for handling CSVs would be to extend the
DELIMITER parameter to allow up to 3 characters - separator, quote and
escape. Escape would default to the quote char and the quote char would
default to unspecified.
I could go with that too. The question here is do we have any popular
use-cases that aren't solved by that extension, but could be solved by
simple user-level data formatting functions? I'm not real eager to add
such a feature as an "if we build it they will come" speculation, but
if anyone can point to solid use-cases besides handling CSV, then it
probably is worth doing.
regards, tom lane
Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
Lee Kindness wrote:
To be honest this idea strikes me as overkill - over
engineering.I agree. My modest proposal for handling CSVs would be to extend the
DELIMITER parameter to allow up to 3 characters - separator, quote and
escape. Escape would default to the quote char and the quote char would
default to unspecified.I could go with that too. The question here is do we have any popular
use-cases that aren't solved by that extension, but could be solved by
simple user-level data formatting functions? I'm not real eager to add
such a feature as an "if we build it they will come" speculation, but
if anyone can point to solid use-cases besides handling CSV, then it
probably is worth doing.
The thing I liked about Andrew's idea is that it even covers escape
quoting for CVS, which might change from implementation to implentation,
and it is flexible without requiring C coding.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Karel, Andrew, Fernando:
On Wed, Mar 17, 2004 at 11:02:38AM -0500, Tom Lane wrote:
Karel Zak <zakkr@zf.jcu.cz> writes:
The formatting function API can be pretty simple:
text *my_copy_format(text *attrdata, int direction,
int nattrs, int attr, oid attrtype, oid relation)
No offense, but isn't this whole thing more appropriate for a client program?
Like the pg_import and pg_export projects on GBorg? Has anyone looked at
those projects?
I can see making a special provision for CSV in COPY, just because it's such a
universal format. But I personally don't see that a complex, sophisticated
import/export formatter belongs on the SQL command line. Particularly since
most users will want a GUI to handle it.
And, BTW, I deal with CSV *all the time* for my insurance clients, and I can
tell you that that format hasn't changed in 20 years. We can hard-code it
if it's easier.
--
-Josh Berkus
Aglio Database Solutions
San Francisco
Import Notes
Reply to msg id not found: auto-000004647277@davinci.ethosmedia.comReference msg id not found: auto-000004647277@davinci.ethosmedia.com | Resolved by subject fallback