pg_dump is O(N) in DB table count N even if dumping only one table

Started by Gunnlaugur Thor Briemalmost 13 years ago7 messagesbugs
Jump to latest
#1Gunnlaugur Thor Briem
gunnlaugur@gmail.com

Hi,

pg_dump takes O(N) time dumping just one table (or a few) explicitly
specified with a -t parameter. It thus becomes painfully slow on a database
with very many tables.

(The use case is copying a few tables over to a test DB, from a large
production data warehouse.)

The three queries taking O(N) time are listed below. AFAICT each of these
queries could be filtered by table name/OID, at least when the number of
tables matching the -t parameters is small, allowing pg_dump to complete in
seconds rather than minutes.

SELECT c.tableoid, c.oid, c.relname, c.relacl, c.relkind, c.relnamespace,
(SELECT rolname FROM pg_catalog.pg_roles WHERE oid = c.relowner) AS
rolname, c.relchecks, c.relhastriggers, c.relhasindex, c.relhasrules,
c.relhasoids, c.relfrozenxid, tc.oid AS toid, tc.relfrozenxid AS
tfrozenxid, c.relpersistence, CASE WHEN c.reloftype <> 0 THEN
c.reloftype::pg_catalog.regtype ELSE NULL END AS reloftype, d.refobjid AS
owning_tab, d.refobjsubid AS owning_col, (SELECT spcname FROM pg_tablespace
t WHERE t.oid = c.reltablespace) AS reltablespace,
array_to_string(c.reloptions, ', ') AS reloptions,
array_to_string(array(SELECT 'toast.' || x FROM unnest(tc.reloptions) x),
', ') AS toast_reloptions FROM pg_class c LEFT JOIN pg_depend d ON
(c.relkind = 'S' AND d.classid = c.tableoid AND d.objid = c.oid AND
d.objsubid = 0 AND
d.refclassid = c.tableoid AND d.deptype = 'a') LEFT JOIN pg_class tc ON
(c.reltoastrelid = tc.oid) WHERE c.relkind in ('r', 'S', 'v', 'c', 'f')
ORDER BY c.oid

SELECT tableoid, oid, typname, typnamespace, '{=U}' AS typacl, (SELECT
rolname FROM pg_catalog.pg_roles WHERE oid = typowner) AS rolname,
typinput::oid AS typinput, typoutput::oid AS typoutput, typelem, typrelid,
CASE WHEN typrelid = 0 THEN ' '::"char" ELSE (SELECT relkind FROM pg_class
WHERE oid = typrelid) END AS typrelkind, typtype, typisdefined, typname[0]
= '_' AND typelem != 0 AND (SELECT typarray FROM pg_type te WHERE oid =
pg_type.typelem) = oid AS isarray FROM pg_type

SELECT classid, objid, refclassid, refobjid, deptype FROM pg_depend WHERE
deptype != 'p' AND deptype != 'e' ORDER BY 1,2

Cheers,

Gulli

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gunnlaugur Thor Briem (#1)
Re: pg_dump is O(N) in DB table count N even if dumping only one table

Gunnlaugur Thor Briem <gunnlaugur@gmail.com> writes:

pg_dump takes O(N) time dumping just one table (or a few) explicitly
specified with a -t parameter. It thus becomes painfully slow on a database
with very many tables.

This is not a bug. It needs information about all the tables anyway
to deal with dependencies (possible inheritance and similar situations).

Having said that, it does look like getTables is pulling back a lot of
info that we don't need *yet*, and would never need if we conclude we
don't need to dump the table. Possibly some of this work could usefully
be postponed to, say, getTableAttrs. OTOH, if that makes the normal
dump-everything case noticeably slower, it's unlikely such a patch would
get accepted.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#3Andres Freund
andres@anarazel.de
In reply to: Gunnlaugur Thor Briem (#1)
Re: pg_dump is O(N) in DB table count N even if dumping only one table

On 2013-06-10 13:28:32 +0000, Gunnlaugur Thor Briem wrote:

Hi,

pg_dump takes O(N) time dumping just one table (or a few) explicitly
specified with a -t parameter. It thus becomes painfully slow on a database
with very many tables.

(The use case is copying a few tables over to a test DB, from a large
production data warehouse.)

The three queries taking O(N) time are listed below. AFAICT each of these
queries could be filtered by table name/OID, at least when the number of
tables matching the -t parameters is small, allowing pg_dump to complete in
seconds rather than minutes.

Which of those queries take how long in your case?

E.g. I could imagine that the first two can be sped up a bit without
changing the result. E.g. not calling pg_tablespace_location() for every
relation and such.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#4Gunnlaugur Thor Briem
gunnlaugur@gmail.com
In reply to: Tom Lane (#2)
Re: pg_dump is O(N) in DB table count N even if dumping only one table

OK, that was what I thought at first, but then I read this note in `man
pg_dump`:

When -t is specified, pg_dump makes no attempt to dump any other
database objects that the selected table(s) might depend upon.

so I supposed that that dependency information was *not* required. So I
posted the bug.

Is that note then outdated/incorrect? Or am I mistaken in finding it at
odds with your explanation?

Regards,

Gulli

On Mon, Jun 10, 2013 at 2:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Gunnlaugur Thor Briem <gunnlaugur@gmail.com> writes:

pg_dump takes O(N) time dumping just one table (or a few) explicitly
specified with a -t parameter. It thus becomes painfully slow on a

database

with very many tables.

This is not a bug. It needs information about all the tables anyway
to deal with dependencies (possible inheritance and similar situations).

Having said that, it does look like getTables is pulling back a lot of
info that we don't need *yet*, and would never need if we conclude we
don't need to dump the table. Possibly some of this work could usefully
be postponed to, say, getTableAttrs. OTOH, if that makes the normal
dump-everything case noticeably slower, it's unlikely such a patch would
get accepted.

regards, tom lane

#5Gunnlaugur Thor Briem
gunnlaugur@gmail.com
In reply to: Andres Freund (#3)
Re: pg_dump is O(N) in DB table count N even if dumping only one table

On Mon, Jun 10, 2013 at 2:08 PM, Andres Freund <andres@2ndquadrant.com>wrote:

On 2013-06-10 13:28:32 +0000, Gunnlaugur Thor Briem wrote:

The three queries taking O(N) time are listed below. AFAICT each of these

Which of those queries take how long in your case?

They were as follows:

duration: 27770.917 ms statement: SELECT c.tableoid, c.oid, c.relname, ...
duration: 28133.407 ms statement: SELECT tableoid, oid, typname, ...
duration: 55751.996 ms statement: SELECT classid, objid, refclassid, ...

However, that was running pg_dump remotely over a slow network, and that
may exaggerate durations since it's a bunch of data. So I ran this again
locally on the production server:

duration: 16012.518 ms statement: SELECT c.tableoid, c.oid, c.relname, ...
duration: 16516.708 ms statement: SELECT tableoid, oid, typname, ...
duration: 13400.694 ms statement: SELECT classid, objid, refclassid, ...

Duration decrease is probably *not* significantly due to a warmer cache,
because I had two runs over the slow network and their durations were
similar (actually longer the second time).

Gulli

#6Andres Freund
andres@anarazel.de
In reply to: Gunnlaugur Thor Briem (#5)
Re: pg_dump is O(N) in DB table count N even if dumping only one table

Hi,

On 2013-06-10 14:24:14 +0000, Gunnlaugur Thor Briem wrote:

So I ran this again locally on the production server:

duration: 16012.518 ms statement: SELECT c.tableoid, c.oid, c.relname, ...
duration: 16516.708 ms statement: SELECT tableoid, oid, typname, ...
duration: 13400.694 ms statement: SELECT classid, objid, refclassid, ...

Duration decrease is probably *not* significantly due to a warmer cache,
because I had two runs over the slow network and their durations were
similar (actually longer the second time).

Hm. 13s for pg_depend locally vs 55s remotely. You need to have a
tremendous amount of dependencies.
Could you do a count(*) of pg_depend, pg_type and pg_class?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#7Gunnlaugur Thor Briem
gunnlaugur@gmail.com
In reply to: Andres Freund (#6)
Re: pg_dump is O(N) in DB table count N even if dumping only one table

On Mon, Jun 10, 2013 at 2:38 PM, Andres Freund <andres@2ndquadrant.com>wrote:

Hm. 13s for pg_depend locally vs 55s remotely. You need to have a
tremendous amount of dependencies.
Could you do a count(*) of pg_depend, pg_type and pg_class?

Yep, there's rather a lot:

select count(*) from pg_depend;
7692365

select count(*) from pg_type;
1302222

select count(*) from pg_class;
1788727

It's a very-broad-scope heterogeneous data warehouse, hence the unusual
table count. PostgreSQL has displayed tremendous grace under this fire, the
only trouble we've encountered is small niggles like tab completion in psql
and pg_dump performance (and the latter was helped a lot by the pg_dump
perf improvements in 9.1.4 and 9.2, for larger dumps). All the same we will
undoubtedly end up spreading all this to more instances/clusters eventually
(the dependency structure allows for that).

Gulli