Proposal: new pg_dump options --copy-delimiter and --copy-null

Started by David Fetteralmost 20 years ago14 messages
#1David Fetter
david@fetter.org

Folks,

This came up at work...

I have seed database scripts quasi-generated from pg_dump which
include COPY statements, but the data is hard to edit (especially cut
& paste operations) when the COPY delimiter is some non-visible
character like \t. So I thought it would be handy to be able to
control the DELIMITER and NULL options in COPY statements that pg_dump
uses.

Although it would be nice to make CSV and its dependencies one of the
options, I'm not sure how pg_dump would handle the end-of-line
problem, so I've skipped that part in the patch I've put together.
The other option, FORCE QUOTE, doesn't make sense to me as a pg_dump
option, but I'm not the arbiter of these things.

With the patch, pg_dump would work exactly as usual without options,
but it now has two extra options: --copy-delimiter and --copy-null.

If set, these will be incorporated in COPY commands as appropriate.
--copy-delimiter accepts any single byte other than '\r' or '\n'.
--copy-null accepts any input. The patched pg_dump ignores --copy-*
options in cases where COPY wouldn't happen anyway.

What do you folks think?

Cheers,
D
--
David Fetter david@fetter.org http://fetter.org/
phone: +1 415 235 3778

Remember to vote!

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Fetter (#1)
Re: Proposal: new pg_dump options --copy-delimiter and --copy-null

David Fetter <david@fetter.org> writes:

I have seed database scripts quasi-generated from pg_dump which
include COPY statements, but the data is hard to edit (especially cut
& paste operations) when the COPY delimiter is some non-visible
character like \t.

This seems like an awfully weak use-case for adding to pg_dump's already
overly complicated feature set. The difficulty of parsing COPY output
is not simplified by making the delimiter variable --- more likely the
reverse. Furthermore, it's quite unclear why you'd use pg_dump at all
to generate a data file that you intend to feed to some other program.
Seems to me that "psql -c 'COPY ...'" is a more likely front-end for
such a process.

regards, tom lane

#3David Fetter
david@fetter.org
In reply to: Tom Lane (#2)
1 attachment(s)
Re: Proposal: new pg_dump options --copy-delimiter and --copy-null

On Thu, Jan 26, 2006 at 10:17:05PM -0500, Tom Lane wrote:

David Fetter <david@fetter.org> writes:

I have seed database scripts quasi-generated from pg_dump which
include COPY statements, but the data is hard to edit (especially
cut & paste operations) when the COPY delimiter is some
non-visible character like \t.

This seems like an awfully weak use-case for adding to pg_dump's
already overly complicated feature set.

Those who don't use it will never see it.

The difficulty of parsing COPY output is not simplified by making
the delimiter variable --- more likely the reverse.

It's fairly straight-forward.

Furthermore, it's quite unclear why you'd use pg_dump at all to
generate a data file that you intend to feed to some other program.

In my case, it's about being copy/paste friendly.

Seems to me that "psql -c 'COPY ...'" is a more likely front-end for
such a process.

Actually, it's not. I'm attaching my preliminary patch, as I see I
haven't explained it well enough.

Cheers,
D
--
David Fetter david@fetter.org http://fetter.org/
phone: +1 415 235 3778

Remember to vote!

Attachments:

pg_dump_copy.difftext/plain; charset=us-asciiDownload
? pg_dump_copy.diff
Index: doc/src/sgml/ref/pg_dump.sgml
===================================================================
RCS file: /projects/cvsroot/pgsql/doc/src/sgml/ref/pg_dump.sgml,v
retrieving revision 1.81
diff -c -r1.81 pg_dump.sgml
*** doc/src/sgml/ref/pg_dump.sgml	1 Nov 2005 21:09:50 -0000	1.81
--- doc/src/sgml/ref/pg_dump.sgml	27 Jan 2006 02:22:41 -0000
***************
*** 163,168 ****
--- 163,188 ----
       </varlistentry>
  
       <varlistentry>
+       <term><option>--copy-delimiter=<replaceable class="parameter">delimiter</replaceable></option></term>
+       <listitem>
+        <para>
+         Use <replaceable class="parameter">delimiter</replaceable>
+         instead of the default tab character in <command>COPY</command> statements. 
+        </para>
+       </listitem>
+      </varlistentry>
+ 
+      <varlistentry>
+       <term><option>--copy-null=<replaceable class="parameter">string_for_nulls</replaceable></option></term>
+       <listitem>
+        <para>
+         Use  <replaceable class="parameter">string_for_nulls</replaceable> instead of the
+         default \N in <command>COPY</command> statements.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
+      <varlistentry>
        <term><option>-d</option></term>
        <term><option>--inserts</option></term>
        <listitem>
Index: src/bin/pg_dump/pg_dump.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/bin/pg_dump/pg_dump.c,v
retrieving revision 1.427
diff -c -r1.427 pg_dump.c
*** src/bin/pg_dump/pg_dump.c	21 Jan 2006 02:16:20 -0000	1.427
--- src/bin/pg_dump/pg_dump.c	27 Jan 2006 02:22:48 -0000
***************
*** 111,116 ****
--- 111,121 ----
  /* flag to turn on/off dollar quoting */
  static int	disable_dollar_quoting = 0;
  
+ /* Things used when caller invokes COPY options. */
+ const char *copy_delimiter_default = "\t";
+ const char *copy_delimiter = "\t";
+ const char *copy_null_default = "\\N";
+ const char *copy_null = "\\N";
  
  static void help(const char *progname);
  static NamespaceInfo *findNamespace(Oid nsoid, Oid objoid);
***************
*** 246,251 ****
--- 251,265 ----
  		{"disable-triggers", no_argument, &disable_triggers, 1},
  		{"use-set-session-authorization", no_argument, &use_setsessauth, 1},
  
+ 		/*
+ 		 * The following options don't have an equivalent short option
+ 		 * letter, and are not available as -X long-name.  Use the
+ 		 * long form instead.
+ 		 */
+ 
+ 		{"copy-delimiter", required_argument, NULL, 2},
+ 		{"copy-null", required_argument, NULL, 3},
+ 
  		{NULL, 0, NULL, 0}
  	};
  	int			optindex;
***************
*** 414,419 ****
--- 428,453 ----
  				break;
  				/* This covers the long options equivalent to -X xxx. */
  
+ 			case 2:
+ 				copy_delimiter = strdup(optarg);
+ 				if (strlen(copy_delimiter) != 1)
+ 				{
+ 					fprintf(stderr, _("In %s, copy-delimiter must be exactly one byte long, not %d\n"),
+ 								progname, strlen(copy_delimiter));
+ 					exit(1);
+ 				}
+ 				if ( (*copy_delimiter == '\r') || (*copy_delimiter == '\n') )
+ 				{
+ 					fprintf(stderr, _("In %s, copy-delimiter may not be \\r or \\n.\n"),
+ 								progname);
+ 					exit(1);
+ 				}
+ 				break;
+ 
+ 			case 3:
+ 				copy_null = strdup(optarg);
+ 				break;
+ 
  			case 0:
  				break;
  
***************
*** 816,836 ****
  		column_list = fmtCopyColumnList(tbinfo);
  	else
  		column_list = "";		/* can't select columns in COPY */
  
- 	if (oids && hasoids)
- 	{
- 		appendPQExpBuffer(q, "COPY %s %s WITH OIDS TO stdout;",
- 						  fmtQualifiedId(tbinfo->dobj.namespace->dobj.name,
- 										 classname),
- 						  column_list);
- 	}
- 	else
- 	{
- 		appendPQExpBuffer(q, "COPY %s %s TO stdout;",
- 						  fmtQualifiedId(tbinfo->dobj.namespace->dobj.name,
- 										 classname),
- 						  column_list);
- 	}
  	res = PQexec(g_conn, q->data);
  	check_sql_result(res, g_conn, q->data, PGRES_COPY_OUT);
  
--- 850,869 ----
  		column_list = fmtCopyColumnList(tbinfo);
  	else
  		column_list = "";		/* can't select columns in COPY */
+ 	appendPQExpBuffer(q, "COPY %s %s %sTO stdout",
+ 					 fmtQualifiedId(tbinfo->dobj.namespace->dobj.name, classname),
+ 					 column_list,
+ 					 /* dump OIDs if requested */
+ 					 (oids && hasoids) ?  "WITH OIDS " : ""
+ 	);
+ 	/* Add a DELIMITER if copy_delimiter is not the default */
+ 	if (strcmp(copy_delimiter, copy_delimiter_default) != 0)
+ 		appendPQExpBuffer(q, " DELIMITER AS '%s'", copy_delimiter);
+ 	/* Add a NULL AS stanza if copy_null is not the default */
+ 	if (strcmp(copy_null,copy_null_default) != 0)
+ 		appendPQExpBuffer(q, " NULL AS '%s'", copy_null);
+ 	appendPQExpBuffer(q, ";");
  
  	res = PQexec(g_conn, q->data);
  	check_sql_result(res, g_conn, q->data, PGRES_COPY_OUT);
  
***************
*** 1085,1093 ****
  		/* must use 2 steps here 'cause fmtId is nonreentrant */
  		appendPQExpBuffer(copyBuf, "COPY %s ",
  						  fmtId(tbinfo->dobj.name));
! 		appendPQExpBuffer(copyBuf, "%s %sFROM stdin;\n",
! 						  fmtCopyColumnList(tbinfo),
! 					  (tdinfo->oids && tbinfo->hasoids) ? "WITH OIDS " : "");
  		copyStmt = copyBuf->data;
  	}
  	else
--- 1118,1135 ----
  		/* must use 2 steps here 'cause fmtId is nonreentrant */
  		appendPQExpBuffer(copyBuf, "COPY %s ",
  						  fmtId(tbinfo->dobj.name));
! 		appendPQExpBuffer(copyBuf, "%s %sFROM stdin",
! 						fmtCopyColumnList(tbinfo),
! 						(tdinfo->oids && tbinfo->hasoids) ? "WITH OIDS " : ""
! 		);
! 		/* Add DELIMITER AS stanza if not the default */
! 		if (strcmp(copy_delimiter, copy_delimiter_default) != 0)
! 			appendPQExpBuffer(copyBuf, " DELIMITER AS '%s'", copy_delimiter);
! 		/* Add NULL AS stanza if not the default */
! 		if (strcmp(copy_null, copy_null_default) != 0)
! 			appendPQExpBuffer(copyBuf, " NULL AS '%s'", copy_null);
! 		appendPQExpBuffer(copyBuf, ";\n");
! 
  		copyStmt = copyBuf->data;
  	}
  	else
#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Fetter (#3)
Re: Proposal: new pg_dump options --copy-delimiter and --copy-null

David Fetter <david@fetter.org> writes:

On Thu, Jan 26, 2006 at 10:17:05PM -0500, Tom Lane wrote:

Seems to me that "psql -c 'COPY ...'" is a more likely front-end for
such a process.

Actually, it's not. I'm attaching my preliminary patch, as I see I
haven't explained it well enough.

The patch conveys nothing you didn't explain already, and I still don't
see why one would use pg_dump instead of psql. If you use pg_dump then
you have to cope with a pile of random SQL and comments as well as the
actual data.

regards, tom lane

#5Joshua D. Drake
jd@commandprompt.com
In reply to: David Fetter (#3)
Re: Proposal: new pg_dump options --copy-delimiter and

Those who don't use it will never see it.

It does however add more maintenance to the code.

Furthermore, it's quite unclear why you'd use pg_dump at all to
generate a data file that you intend to feed to some other program.

In my case, it's about being copy/paste friendly.

David I don't get this... what are you copying from/to that would
wouldn't just script? If you throw into a script you can change
the delimiter on the fly using translation.

Actually, it's not. I'm attaching my preliminary patch, as I see I
haven't explained it well enough.

Perhaps a test case that shows the productivity of it? I am not
arguing whether or not this is a useful feature but I don't
see the purpose.

Sincerely,

Joshua D. Drake

Cheers,
D

--
The PostgreSQL Company - Command Prompt, Inc. 1.503.667.4564
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Managed Services, Shared and Dedicated Hosting
Co-Authors: PLphp, PLperl - http://www.commandprompt.com/

#6Greg Stark
gsstark@mit.edu
In reply to: Joshua D. Drake (#5)
Re: Proposal: new pg_dump options --copy-delimiter and

"Joshua D. Drake" <jd@commandprompt.com> writes:

David I don't get this... what are you copying from/to that would
wouldn't just script? If you throw into a script you can change
the delimiter on the fly using translation.

I think what he's getting at is for things like, say, a contrib package with a
README that includes an example. He wants to be able to say "just paste these
commands into psql".

The problems are a) there's no guarantee the data is safe to put through your
hypothetical tab-destroying copy/paste anyways. There could be tabs or other
unsafe characters in the data. b) We have no way of knowing which characters
are or aren't safe in your hypothetical copy/paste system. Why is tab unsafe
in the first place?

Personally I find anything that would encourage people to use anything other
than tabs evil anyways. All those people who think | is somehow a reasonable
choice or want to use commas and then get all confused trying to escape them
and invent ever more confused syntaxes for escaping the escape characters.
Just use tab separated data like man was meant to.

--
greg

#7Andrew Dunstan
andrew@dunslane.net
In reply to: Greg Stark (#6)
Re: Proposal: new pg_dump options --copy-delimiter and

On Fri, 2006-01-27 at 13:12 -0500, Greg Stark wrote:

Personally I find anything that would encourage people to use anything other
than tabs evil anyways. All those people who think | is somehow a reasonable
choice or want to use commas and then get all confused trying to escape them
and invent ever more confused syntaxes for escaping the escape characters.
Just use tab separated data like man was meant to.

I could not disagree more. The invisibility of tabs makes their use as a
delimiter wholly evil. I have lost count of the number of corrupted
makefiles etc. I have seen because a tab got converted to a space and it
was impossible to tell.

More tears have been shed over tabs used than tabs unused. (Apologies to
St Theresa).

We now return you to normal -hacking.

cheers

andrew

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#7)
Re: Proposal: new pg_dump options --copy-delimiter and

Andrew Dunstan <andrew@dunslane.net> writes:

I could not disagree more. The invisibility of tabs makes their use as a
delimiter wholly evil. I have lost count of the number of corrupted
makefiles etc. I have seen because a tab got converted to a space and it
was impossible to tell.

More tears have been shed over tabs used than tabs unused. (Apologies to
St Theresa).

That line of argument leads to the suggestion that pg_dump should just
use something else (I'd vote for "|"), all the time, in order to produce
more robust dump files. I still don't see the argument for exposing
a switch though.

regards, tom lane

#9Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#8)
Re: Proposal: new pg_dump options --copy-delimiter and

On Fri, 2006-01-27 at 13:43 -0500, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

I could not disagree more. The invisibility of tabs makes their use as a
delimiter wholly evil. I have lost count of the number of corrupted
makefiles etc. I have seen because a tab got converted to a space and it
was impossible to tell.

More tears have been shed over tabs used than tabs unused. (Apologies to
St Theresa).

That line of argument leads to the suggestion that pg_dump should just
use something else (I'd vote for "|"), all the time, in order to produce
more robust dump files. I still don't see the argument for exposing
a switch though.

If we regard them as suitable for human editing for normal use, yes.

cheers

andrew

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#9)
Re: Proposal: new pg_dump options --copy-delimiter and

Andrew Dunstan <andrew@dunslane.net> writes:

On Fri, 2006-01-27 at 13:43 -0500, Tom Lane wrote:

That line of argument leads to the suggestion that pg_dump should just
use something else (I'd vote for "|"), all the time, in order to produce
more robust dump files. I still don't see the argument for exposing
a switch though.

If we regard them as suitable for human editing for normal use, yes.

No, that actually was no part of my point. A pg_dump file that doesn't
use tabs is more likely to survive being emailed, for instance. I'm not
sure whether that is a large enough consideration to justify a change,
but you don't have to assume that anyone's intending to edit anything
to make the argument for it.

regards, tom lane

#11Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#10)
Re: Proposal: new pg_dump options --copy-delimiter and

On Fri, 2006-01-27 at 14:08 -0500, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

On Fri, 2006-01-27 at 13:43 -0500, Tom Lane wrote:

That line of argument leads to the suggestion that pg_dump should just
use something else (I'd vote for "|"), all the time, in order to produce
more robust dump files. I still don't see the argument for exposing
a switch though.

If we regard them as suitable for human editing for normal use, yes.

No, that actually was no part of my point. A pg_dump file that doesn't
use tabs is more likely to survive being emailed, for instance. I'm not
sure whether that is a large enough consideration to justify a change,
but you don't have to assume that anyone's intending to edit anything
to make the argument for it.

Not sure how much that matters in these days of MIME, but it's a fair
point nevertheless.

cheers

andrew

#12Greg Stark
gsstark@mit.edu
In reply to: Tom Lane (#10)
Re: Proposal: new pg_dump options --copy-delimiter and

Tom Lane <tgl@sss.pgh.pa.us> writes:

Andrew Dunstan <andrew@dunslane.net> writes:

On Fri, 2006-01-27 at 13:43 -0500, Tom Lane wrote:

That line of argument leads to the suggestion that pg_dump should just
use something else (I'd vote for "|"), all the time, in order to produce
more robust dump files. I still don't see the argument for exposing
a switch though.

If we regard them as suitable for human editing for normal use, yes.

No, that actually was no part of my point. A pg_dump file that doesn't
use tabs is more likely to survive being emailed, for instance.

Except it's not at all. It's perhaps more likely to load but load incorrectly
though. There's no guarantee there aren't tabs in the data for example. Or
once we start being this paranoid, there's no guarantee that some other
character won't survive.

The only place this line of argument ends up is with pg_dump automatically
base64 encoding its entire output.

--
greg

#13Bruce Momjian
pgman@candle.pha.pa.us
In reply to: David Fetter (#3)
Re: Proposal: new pg_dump options --copy-delimiter and --copy-null

David Fetter wrote:

On Thu, Jan 26, 2006 at 10:17:05PM -0500, Tom Lane wrote:

David Fetter <david@fetter.org> writes:

I have seed database scripts quasi-generated from pg_dump which
include COPY statements, but the data is hard to edit (especially
cut & paste operations) when the COPY delimiter is some
non-visible character like \t.

This seems like an awfully weak use-case for adding to pg_dump's
already overly complicated feature set.

Those who don't use it will never see it.

Documentation itself is providing the option to the user and they have
to decide if it is useful. No visible feature has zero cost for our
userbase.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#14David Fetter
david@fetter.org
In reply to: Greg Stark (#6)
Re: Proposal: new pg_dump options --copy-delimiter and

On Fri, Jan 27, 2006 at 01:12:35PM -0500, Greg Stark wrote:

"Joshua D. Drake" <jd@commandprompt.com> writes:

David I don't get this... what are you copying from/to that would
wouldn't just script? If you throw into a script you can change
the delimiter on the fly using translation.

The problems are a) there's no guarantee the data is safe to put through your
hypothetical tab-destroying copy/paste anyways. There could be tabs or other
unsafe characters in the data. b) We have no way of knowing which characters
are or aren't safe in your hypothetical copy/paste system. Why is tab unsafe
in the first place?

Personally I find anything that would encourage people to use anything other
than tabs evil anyways. All those people who think | is somehow a reasonable
choice or want to use commas and then get all confused trying to escape them
and invent ever more confused syntaxes for escaping the escape characters.
Just use tab separated data like man was meant to.

I'd say that the multiplicity of strong opinions on The Right
Delimiter is a pretty strong argument for having a switch to control
it.

Cheers,
D
--
David Fetter david@fetter.org http://fetter.org/
phone: +1 415 235 3778

Remember to vote!