pg_dump object sorting

Started by Andrew Dunstanabout 18 years ago5 messageshackers

andrew@dunslane.net

about 18 years ago

I have been looking at refining the sorting of objects in pg_dump to
make it take advantage of buffering and synchronised scanning, and
possibly make parallel restoration simpler and more efficient.

My first thought was to sort indexes by <namespace, tablename,
indexname> instead of by <namespace, indexname>. However, that doesn't
go far enough, I think. Is there any reason we can't do all of a table's
indexes and non-FK constraints together? Will that affect anything other
than PK and UNIQUE constraints, as NULL and CHECK constraints are
included in table definitions?

cheers

andrew

Jeff Davis

pgsql@j-davis.com

about 18 years ago

In reply to: Andrew Dunstan (#1)

Re: pg_dump object sorting

On Mon, 2008-04-14 at 11:18 -0400, Andrew Dunstan wrote:

I have been looking at refining the sorting of objects in pg_dump to
make it take advantage of buffering and synchronised scanning, and
possibly make parallel restoration simpler and more efficient.

Synchronized scanning is explicitly disabled in pg_dump. That was a
last-minute change to answer Greg Stark's complaint about dumping a
clustered table:

http://archives.postgresql.org/pgsql-hackers/2008-01/msg00987.php

That hopefully won't be a permanent solution, because I think
synchronized scans are useful for pg_dump.

However, I'm not clear on how the pg_dump order would be able to better
take advantage of synchronized scans anyway. What did you have in mind?

Regards,
Jeff Davis

Andrew Dunstan

andrew@dunslane.net

about 18 years ago

In reply to: Jeff Davis (#2)

Re: pg_dump object sorting

Jeff Davis wrote:

On Mon, 2008-04-14 at 11:18 -0400, Andrew Dunstan wrote:

I have been looking at refining the sorting of objects in pg_dump to
make it take advantage of buffering and synchronised scanning, and
possibly make parallel restoration simpler and more efficient.

Synchronized scanning is explicitly disabled in pg_dump. That was a
last-minute change to answer Greg Stark's complaint about dumping a
clustered table:

http://archives.postgresql.org/pgsql-hackers/2008-01/msg00987.php

That hopefully won't be a permanent solution, because I think
synchronized scans are useful for pg_dump.

However, I'm not clear on how the pg_dump order would be able to better
take advantage of synchronized scans anyway. What did you have in mind?

I should have expressed it better. The idea is to have pg_dump emit the
objects in an order that allows the restore to take advantage of sync
scans. So sync scans being disabled in pg_dump would not at all matter.

cheers

andrew

Tom Lane

tgl@sss.pgh.pa.us

about 18 years ago

In reply to: Andrew Dunstan (#3)

Re: pg_dump object sorting

Andrew Dunstan <andrew@dunslane.net> writes:

I should have expressed it better. The idea is to have pg_dump emit the
objects in an order that allows the restore to take advantage of sync
scans. So sync scans being disabled in pg_dump would not at all matter.

Unless you do something to explicitly parallelize the operations,
how will a different ordering improve matters?

I thought we had a paper design for this, and it involved teaching
pg_restore how to use multiple connections. In that context it's
entirely up to pg_restore to manage the ordering and ensure dependencies
are met. So I'm not seeing how it helps to have a different sort rule
at pg_dump time --- it won't really make pg_restore's task any easier.

regards, tom lane

Andrew Dunstan

andrew@dunslane.net

about 18 years ago

In reply to: Tom Lane (#4)

Re: pg_dump object sorting

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

I should have expressed it better. The idea is to have pg_dump emit the
objects in an order that allows the restore to take advantage of sync
scans. So sync scans being disabled in pg_dump would not at all matter.

Unless you do something to explicitly parallelize the operations,
how will a different ordering improve matters?

I thought we had a paper design for this, and it involved teaching
pg_restore how to use multiple connections. In that context it's
entirely up to pg_restore to manage the ordering and ensure dependencies
are met. So I'm not seeing how it helps to have a different sort rule
at pg_dump time --- it won't really make pg_restore's task any easier.

Well, what actually got me going on this initially was that I got
annoyed by having indexes not grouped by table when I dumped out the
schema of a database, because it seemed a bit illogical. Then I started
thinking about it and it seemed to me that even without synchronised
scanning or parallel restoration, we might benefit from building all the
indexes of a given table together, especially if the whole table could
fit in either our cache or the OS cache.

cheers

andrew