parallel pg_restore design issues

Started by Andrew Dunstanover 17 years ago2 messages
#1Andrew Dunstan
andrew@dunslane.net

There are a couple of open questions for parallel pg_restore.

First, we need a way to decide the boundary between the serially run
"pre-data" section and the remainder of the items in the TOC. Currently
the code uses the first TABLEDATA item as the boundary. That's not
terribly robust (what if there aren't any?). Also, people have wanted to
steer clear of hardcoding much knowledge of archive member types into
pg_restore as a way of future-proofing it somewhat. I'm wondering if we
should have pg_dump explicitly mark items as pre-data,data or post-data.
For legacy archives we could still check for either a TABLEDATA item or
something known to sort after those (i.e. a BLOB, BLOB COMMENT,
CONSTRAINT, INDEX, RULE, TRIGGER or FK CONSTRAINT item).

Another item we have already discussed is how to prevent concurrent
processes from trying to take conflicting locks. Her we really can't
rely on pg_dump to help us out, as lock requirements might change (a
little bird has already whispered in my ear about reducing the strength
of FK CONSTRAINT locks taken). I haven't got a really good answer here.

Last, there is the question of what algorithm to use in chosing the next
item to run. Currently, I am using "next item in the queue whose
dependencies have been met", with no queue reordering.

Another possible algorithm would reorder the queue by elevating any item
whose dependencies have been met. This will mean all the indexes for a
table will tend to be grouped together, which might well be a good
thing, and will tend to limit the tendency to do all the data loading at
once.

Both of these could be modified by explicitly limiting TABLEDATA items
to a certain proportion (say, one quarter) of the processing slots
available, if other items are available.

I'm actually somewhat inclined to make provision for all of these
possibilities via a command line option, with the first being the
default. One size doesn't fit all, I suspect, and if it does we'll need
lots of data before deciding what that size is. The extra logic won't
really involve all that much code, and it will all be confined to a
couple of functions.

Thoughts?

cheers

andrew

#2Philip Warner
pjw@rhyme.com.au
In reply to: Andrew Dunstan (#1)
Re: parallel pg_restore design issues

Another 0.02c, bringing the grand total to 0.04c.

Andrew Dunstan wrote:

First, we need a way to decide the boundary between the serially run
"pre-data" section and the remainder of the items in the TOC.
Currently the code uses the first TABLEDATA item as the boundary.
That's not terribly robust (what if there aren't any?).

Drawing on your later comments, and bearing in mind that it is always
recommended to use the *new* pg_dump to upgrade, I'd suggest adding some
flags to the header of the dump that say 'contains data' (might already
be there), then adding flags to TOC entries that indicate useful
properties. ISTM there are effectively 5 types of TOC entry which can be
characterised by: table definitions (including schema -- anything
defining the structure of the db), data load, performance items (index
definitions), integrity items (constraints and triggers) and function
definitions (usually part of an application interface or triggers).

So, based on the current needs, adding:

- isData
- isStructural (must be a better name) -- for anything that is
required *before* data can be loaded.
- isPerformance (likely to improve database performance after theyy
are applied)

would seem to satisfy your need.

I'm making this up as I go along, so adding any other set of flags that
abstracts the questions you are asking in a generic and useful way would
be fine. I'm pretty sure we have places already in pg_dump that could
use such flags.

Also, people have wanted to steer clear of hardcoding much knowledge
of archive member types into pg_restore as a way of future-proofing it
somewhat. I'm wondering if we should have pg_dump explicitly mark
items as pre-data,data or post-data. For legacy archives we could
still check for either a TABLEDATA item or something known to sort
after those (i.e. a BLOB, BLOB COMMENT, CONSTRAINT, INDEX, RULE,
TRIGGER or FK CONSTRAINT item).

This is, at least to some extent, answered above. Anything that is not
isData or isStructural is post data.

Another item we have already discussed is how to prevent concurrent
processes from trying to take conflicting locks. Her we really can't
rely on pg_dump to help us out, as lock requirements might change (a
little bird has already whispered in my ear about reducing the
strength of FK CONSTRAINT locks taken). I haven't got a really good
answer here.

Several things occurred to me here.

- you need to avoid shooting yourself in the foot by, for example,
trying to define an FK before relevant indexes are defined. This is
probably one area where checking the TOC entry type becomes necessary,
unless we create a flag 'isPerformance' flag (see above) in which case
you load all isPerformance entries immediately after data is loaded.
AFAICT, isPerformance becomes a synonym for entry_type='INDEX' (not
sure), but adding the flag (required for all toc entries) will aid in
future-proofing in ways that checking entry type values do not.

- It would be interesting to see benchmarks of pg_restore running with
*anything* that shared dependencies disallowed from running concurrently
vs. anything that's not an index vs. anything thats not 'isPerformance'.

- allowing lock failures might help; ie. if a statement produces a lock
failure, just wait a bit and put it back at the end of the queue. If it
fails a second time, mark it as 'single-thread-only'. Would be
interesting to see if this imroved the pre-data load, or made it slower.

- Ultimately, I suspect trying to second-guess backend locking will be
a hard-fought battle that will not be worth the resources expended.
Getting 95% of the way there will be good enough so long as lock
conflicts do not crash pg_restore (see previous point).

Another possible algorithm would reorder the queue by elevating any
item whose dependencies have been met.

Bad idea; FKs benefit from indexes.

I think, based on the benefit provided by 'isPerformance' items (by
definition), you probably need to introduce a choke point to ensure all
isPerformance items have been executed before moving on to other items.
The locking strategy above may effectively do that (shared dependencies
will be disallowed unless both are 'isPerformance').

This will mean all the indexes for a table will tend to be grouped
together, which might well be a good thing, and will tend to limit the
tendency to do all the data loading at once.

Or it might be a bad thing. Not sure why you can't intermix data and
schema items ('isStructure') tho.

And, I'm curious: what is wrong with loading all the data at once? That
will be the result I suspect for simple queueing reasons: data takes a
long time, eventually without a scheduler and explicit thread limits
(see below), all threads will be loading data.

Both of these could be modified by explicitly limiting TABLEDATA items
to a certain proportion (say, one quarter) of the processing slots
available, if other items are available.

Yep. But is there a benefit?

I have not played with this patch, but does it dump stats of thread
usage (stalled, executing etc)?

I think there could be some benefit in reordering stuff, but it is not
clear to me what the best ordering would be.

I'm actually somewhat inclined to make provision for all of these
possibilities via a command line option, with the first being the
default. One size doesn't fit all, I suspect, and if it does we'll
need lots of data before deciding what that size is. The extra logic
won't really involve all that much code, and it will all be confined
to a couple of functions.

Keeping it encapsulated with the default being a NOP sounds good to me.
Random might even be useful.