Restoring large tables with COPY

Started by Marko Kreenabout 24 years ago10 messages
#1Marko Kreen
marko@l-t.ee

Maybe I am missing something obvious, but I am unable to load
larger tables (~300k rows) with COPY command that pg_dump by
default produces. Yes, dump as INSERTs works but is slow.

"Cant" as in "it does not work with the default setup I have
running on devel machine" - 128M mem, 128M swap, basically
default postgresql.conf:

1) Too few WAL files.
- well, increase the wal_files (eg to 32),

2) Machine runs out of swap, PostgreSQL seems to keep whole TX
in memory.
- So I must put 1G of swap? But what if I have 1G of rows?

Or shortly: during pg_restore the resource requirements are
order of magnitude higher than during pg_dump, which is
non-obvious and may be a bad surprise when in real trouble.

This is annoying, especially as dump as COPY's should be
preferred as it is faster and smaller. Ofcourse the
dump-as-INSERTs has also positive side - eg. ALTER TABLE DROP
COLUMN with sed...

Patch below implements '-m NUM' switch to pg_dump, which splits
each COPY command to chunks, each maximum NUM rows.

Comments? What am I missing?

--
marko

Index: doc/src/sgml/ref/pg_dump.sgml
===================================================================
RCS file: /opt/cvs/pgsql/pgsql/doc/src/sgml/ref/pg_dump.sgml,v
retrieving revision 1.41
diff -u -c -r1.41 pg_dump.sgml
*** doc/src/sgml/ref/pg_dump.sgml	8 Dec 2001 03:24:37 -0000	1.41
--- doc/src/sgml/ref/pg_dump.sgml	11 Dec 2001 03:58:30 -0000
***************
*** 35,40 ****
--- 35,41 ----
     <arg>-f <replaceable>file</replaceable></arg> 
     <arg>-F <replaceable>format</replaceable></arg>
     <arg>-i</arg>
+    <arg>-m <replaceable>num_rows</replaceable></arg>
     <group> <arg>-n</arg> <arg>-N</arg> </group>
     <arg>-o</arg>
     <arg>-O</arg>
***************
*** 301,306 ****
--- 302,321 ----
  	if you need to override the version check (and if
  	<command>pg_dump</command> then fails, don't
  	say you weren't warned).
+        </para>
+       </listitem>
+      </varlistentry>
+ 
+      <varlistentry>
+       <term>-m <replaceable class="parameter">num_rows</replaceable></term>
+       <term>--maxrows=<replaceable class="parameter">num_rows</replaceable></term>
+       <listitem>
+        <para>
+ 	Set maximum number of rows to put into one COPY statement.
+ 	This starts new COPY command after every
+ 	<replaceable class="parameter">num_rows</replaceable>.
+ 	This is useful on large tables to avoid restoring whole table in 
+ 	one transaction which may consume lot of resources.
         </para>
        </listitem>
       </varlistentry>
Index: src/bin/pg_dump/pg_dump.c
===================================================================
RCS file: /opt/cvs/pgsql/pgsql/src/bin/pg_dump/pg_dump.c,v
retrieving revision 1.236
diff -u -c -r1.236 pg_dump.c
*** src/bin/pg_dump/pg_dump.c	28 Oct 2001 06:25:58 -0000	1.236
--- src/bin/pg_dump/pg_dump.c	11 Dec 2001 04:48:42 -0000
***************
*** 116,121 ****
--- 116,123 ----
  bool		dataOnly;
  bool		aclsSkip;
+ int			g_max_copy_rows = 0;
+ 
  char		g_opaque_type[10];	/* name for the opaque type */
  /* placeholders for the delimiters for comments */
***************
*** 151,156 ****
--- 153,159 ----
  				 "  -h, --host=HOSTNAME      database server host name\n"
  				 "  -i, --ignore-version     proceed even when server version mismatches\n"
  				 "                           pg_dump version\n"
+ 				 "  �m, --maxrows=NUM        max rows in one COPY command\n"
  	"  -n, --no-quotes          suppress most quotes around identifiers\n"
  	 "  -N, --quotes             enable most quotes around identifiers\n"
  				 "  -o, --oids               include oids in dump\n"
***************
*** 187,192 ****
--- 190,196 ----
  				 "                           pg_dump version\n"
  	"  -n                       suppress most quotes around identifiers\n"
  	 "  -N                       enable most quotes around identifiers\n"
+ 				 "  �m NUM                   max rows in one COPY command\n"
  				 "  -o                       include oids in dump\n"
  				 "  -O                       do not output \\connect commands in plain\n"
  				 "                           text format\n"
***************
*** 244,249 ****
--- 248,255 ----
  	int			ret;
  	bool		copydone;
  	char		copybuf[COPYBUFSIZ];
+ 	int			cur_row;
+ 	int			linestart;
  	if (g_verbose)
  		write_msg(NULL, "dumping out the contents of table %s\n", classname);
***************
*** 297,302 ****
--- 303,310 ----
  		else
  		{
  			copydone = false;
+ 			linestart = 1;
+ 			cur_row = 0;
  			while (!copydone)
  			{
***************
*** 310,316 ****
--- 318,338 ----
  				}
  				else
  				{
+ 					/*
+ 					 * Avoid too large transactions by breaking them up.
+ 					 */
+ 					if (g_max_copy_rows > 0 && linestart
+ 							&& cur_row >= g_max_copy_rows)
+ 					{
+ 						cur_row = 0;
+ 						archputs("\\.\n", fout);
+ 						archprintf(fout, "COPY %s %sFROM stdin;\n",
+ 								fmtId(classname, force_quotes),
+ 								(oids && hasoids) ? "WITH OIDS " : "");
+ 					}
+ 
  					archputs(copybuf, fout);
+ 					
  					switch (ret)
  					{
  						case EOF:
***************
*** 318,325 ****
--- 340,350 ----
  							/* FALLTHROUGH */
  						case 0:
  							archputc('\n', fout);
+ 							cur_row++;
+ 							linestart = 1;
  							break;
  						case 1:
+ 							linestart = 0;
  							break;
  					}
  				}
***************
*** 696,701 ****
--- 721,727 ----
  		{"compress", required_argument, NULL, 'Z'},
  		{"help", no_argument, NULL, '?'},
  		{"version", no_argument, NULL, 'V'},
+ 		{"maxrows", required_argument, NULL, 'm'},

/*
* the following options don't have an equivalent short option
***************
*** 748,756 ****
}

#ifdef HAVE_GETOPT_LONG
! while ((c = getopt_long(argc, argv, "abcCdDf:F:h:inNoOp:RsS:t:uU:vWxX:zZ:V?", long_options, &optindex)) != -1)
#else
! while ((c = getopt(argc, argv, "abcCdDf:F:h:inNoOp:RsS:t:uU:vWxX:zZ:V?-")) != -1)
#endif

  	{
--- 774,782 ----
  	}

#ifdef HAVE_GETOPT_LONG
! while ((c = getopt_long(argc, argv, "abcCdDf:F:h:im:nNoOp:RsS:t:uU:vWxX:zZ:V?", long_options, &optindex)) != -1)
#else
! while ((c = getopt(argc, argv, "abcCdDf:F:h:im:nNoOp:RsS:t:uU:vWxX:zZ:V?-")) != -1)
#endif

  	{
***************
*** 798,803 ****
--- 824,833 ----
  			case 'i':			/* ignore database version mismatch */
  				ignore_version = true;
+ 				break;
+ 
+ 			case 'm':
+ 				g_max_copy_rows = atoi(optarg);
  				break;

case 'n': /* Do not force double-quotes on

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marko Kreen (#1)
Re: Restoring large tables with COPY

Marko Kreen <marko@l-t.ee> writes:

Maybe I am missing something obvious, but I am unable to load
larger tables (~300k rows) with COPY command that pg_dump by
default produces.

I'd like to find out what the problem is, rather than work around it
with such an ugly hack.

1) Too few WAL files.
- well, increase the wal_files (eg to 32),

What PG version are you running? 7.1.3 or later should not have a
problem with WAL file growth.

2) Machine runs out of swap, PostgreSQL seems to keep whole TX
in memory.

That should not happen either. Could we see the full schema of the
table you are having trouble with?

Or shortly: during pg_restore the resource requirements are
order of magnitude higher than during pg_dump,

We found some client-side memory leaks in pg_restore recently; is that
what you're talking about?

regards, tom lane

#3Marko Kreen
marko@l-t.ee
In reply to: Tom Lane (#2)
Re: Restoring large tables with COPY

On Tue, Dec 11, 2001 at 10:55:30AM -0500, Tom Lane wrote:

Marko Kreen <marko@l-t.ee> writes:

Maybe I am missing something obvious, but I am unable to load
larger tables (~300k rows) with COPY command that pg_dump by
default produces.

I'd like to find out what the problem is, rather than work around it
with such an ugly hack.

1) Too few WAL files.
- well, increase the wal_files (eg to 32),

What PG version are you running? 7.1.3 or later should not have a
problem with WAL file growth.

7.1.3

2) Machine runs out of swap, PostgreSQL seems to keep whole TX
in memory.

That should not happen either. Could we see the full schema of the
table you are having trouble with?

Well, there are several such tables, I will reproduce it,
then send the schema. I guess its the first one, but maybe
not. postgres gets killed by Linux OOM handler, so I cant
tell by messages, which one it was. (hmm, i should probably
run it as psql -q -a > log).

Or shortly: during pg_restore the resource requirements are
order of magnitude higher than during pg_dump,

We found some client-side memory leaks in pg_restore recently; is that
what you're talking about?

No, its the postgres process thats memory-hungry, it happens
with "psql < db.dump" too.

If I run a dump thats produced with "pg_dump -m 5000" then
it loops between 20M and 10M is much better. (the 10M
depends on shared_buffers I guess).

--
marko

#4Marko Kreen
marko@l-t.ee
In reply to: Tom Lane (#2)
1 attachment(s)
Re: Restoring large tables with COPY

On Tue, Dec 11, 2001 at 10:55:30AM -0500, Tom Lane wrote:

Marko Kreen <marko@l-t.ee> writes:

Maybe I am missing something obvious, but I am unable to load
larger tables (~300k rows) with COPY command that pg_dump by
default produces.

I'd like to find out what the problem is, rather than work around it
with such an ugly hack.

Hmm, the problem was more 'interesting' than I thought.
Basically:

1) pg_dump of 7.1.3 dumps constraints and primary keys
with table defs in this case, so they are run during COPY.
2) I have some tricky CHECK contraints.

Look at the attached Python script, it reproduces the problem.
Sorry, cannot test on 7.2 at the moment.

--
marko

Attachments:

xx.pytext/plain; charset=us-asciiDownload
#5Serguei Mokhov
sa_mokho@alcor.concordia.ca
In reply to: Marko Kreen (#1)
Re: Restoring large tables with COPY

----- Original Message -----
From: Marko Kreen <marko@l-t.ee>
Sent: Tuesday, December 11, 2001 10:10 AM

If this thing ever gets through, shouldn't this

/* placeholders for the delimiters for comments */
***************
*** 151,156 ****
--- 153,159 ----
"  -h, --host=HOSTNAME      database server host name\n"
"  -i, --ignore-version     proceed even when server version mismatches\n"
"                           pg_dump version\n"
+ "  �m, --maxrows=NUM        max rows in one COPY command\n"

say '-m'

+ " �m NUM max rows in one COPY command\n"

and this one too?

#6Marko Kreen
marko@l-t.ee
In reply to: Serguei Mokhov (#5)
Re: Restoring large tables with COPY

On Tue, Dec 11, 2001 at 12:29:07PM -0500, Serguei Mokhov wrote:

If this thing ever gets through, shouldn't this

/* placeholders for the delimiters for comments */
***************
*** 151,156 ****
--- 153,159 ----
"  -h, --host=HOSTNAME      database server host name\n"
"  -i, --ignore-version     proceed even when server version mismatches\n"
"                           pg_dump version\n"
+ "  �m, --maxrows=NUM        max rows in one COPY command\n"

say '-m'

+ " �m NUM max rows in one COPY command\n"

and this one too?

One is for systems that have 'getopt_long', second for
short-getopt-only ones. The '-h, --host=HOSTNAME' means
that '-h HOSTNAME' and '--host=HOSTNAME' are same.

--
marko

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marko Kreen (#4)
Re: Restoring large tables with COPY

Marko Kreen <marko@l-t.ee> writes:

Look at the attached Python script, it reproduces the problem.

Hmm. You'd probably have much better luck if you rewrote the check_code
function in plpgsql: that should eliminate the memory-leak problem, and
also speed things up because plpgsql knows about caching query plans
across function calls. IIRC, sql functions don't do that.

The memory leakage is definitely a bug, but not one that is going to get
fixed for 7.2. It'll take some nontrivial work on the SQL function
executor...

regards, tom lane

#8Serguei Mokhov
sa_mokho@alcor.concordia.ca
In reply to: Marko Kreen (#1)
Re: Restoring large tables with COPY

----- Original Message -----
From: Marko Kreen <marko@l-t.ee>
Sent: Tuesday, December 11, 2001 12:38 PM

On Tue, Dec 11, 2001 at 12:29:07PM -0500, Serguei Mokhov wrote:

If this thing ever gets through, shouldn't this

/* placeholders for the delimiters for comments */
***************
*** 151,156 ****
--- 153,159 ----
"  -h, --host=HOSTNAME      database server host name\n"
"  -i, --ignore-version     proceed even when server version mismatches\n"
"                           pg_dump version\n"
+ "  �m, --maxrows=NUM        max rows in one COPY command\n"

say '-m'

+ " �m NUM max rows in one COPY command\n"

and this one too?

One is for systems that have 'getopt_long', second for
short-getopt-only ones. The '-h, --host=HOSTNAME' means
that '-h HOSTNAME' and '--host=HOSTNAME' are same.

I know, I know. I just was trying to point out a typo :)
You forgot to add '-' in the messages before 'm'.

#9Marko Kreen
marko@l-t.ee
In reply to: Serguei Mokhov (#8)
Re: Restoring large tables with COPY

+ " �m, --maxrows=NUM max rows in one COPY command\n"

say '-m'

You forgot to add '-' in the messages before 'm'.

Ah. On my screen it looks lot like a '-', but od shows 0xAD...
Well, thats VIM's digraph feature in action ;)

--
marko

#10Marko Kreen
marko@l-t.ee
In reply to: Tom Lane (#7)
Re: Restoring large tables with COPY

On Tue, Dec 11, 2001 at 01:06:13PM -0500, Tom Lane wrote:

Marko Kreen <marko@l-t.ee> writes:

Look at the attached Python script, it reproduces the problem.

Hmm. You'd probably have much better luck if you rewrote the check_code
function in plpgsql: that should eliminate the memory-leak problem, and
also speed things up because plpgsql knows about caching query plans
across function calls. IIRC, sql functions don't do that.

And I thought that the 'sql' is the more lightweight approach...

Thanks, now it seems to work.

--
marko