pg_dump test instability
During the development of an unrelated feature, I have experienced
failures in a pg_dump test case, specifically
t/002_pg_dump.pl ....... 1825/6052
# Failed test 'defaults_parallel: should not dump COPY
fk_reference_test_table second'
# at t/002_pg_dump.pl line 3454.
This test sets up two tables connected by a foreign key and checks that
a data_only dump dumps them ordered so that the primary key table comes
first.
But because of the way the tests are set up, it also checks that in all
other dumps (i.e., non-data_only) it does *not* dump them in that order.
This is kind of irrelevant to the test, but there is no way to express
in the pg_dump tests to not check certain scenarios.
In a non-data_only dump, the order of the tables doesn't matter, because
the foreign keys are added at the very end. In parallel dumps, the
tables are in addition sorted by size, so the resultant order is
different from a single-threaded dump. This can be seen by comparing
the dumped TOCs of the defaults_dir_format and defaults_parallel cases.
But it all happens to pass the tests right now.
In my hacking I have added another test table to the pg_dump test set,
which seems to have thrown off the sorting and scheduling, so that the
two tables now happen to come out in primary-key-first order anyway,
which causes the test to fail.
I have developed the attached rough patch to add a third option to
pg_dump test cases: besides like and unlike, add a "skip" option to
disregard the result of the test.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
0001-Add-option-to-skip-certain-pg_dump-tests.patchtext/plain; charset=UTF-8; name=0001-Add-option-to-skip-certain-pg_dump-tests.patch; x-mac-creator=0; x-mac-type=0Download+4-1
Greetings,
* Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote:
During the development of an unrelated feature, I have experienced
failures in a pg_dump test case, specificallyt/002_pg_dump.pl ....... 1825/6052
# Failed test 'defaults_parallel: should not dump COPY
fk_reference_test_table second'
# at t/002_pg_dump.pl line 3454.This test sets up two tables connected by a foreign key and checks that
a data_only dump dumps them ordered so that the primary key table comes
first.But because of the way the tests are set up, it also checks that in all
other dumps (i.e., non-data_only) it does *not* dump them in that order.
This is kind of irrelevant to the test, but there is no way to express
in the pg_dump tests to not check certain scenarios.
Hmmm, yeah, that's a good point. Most of the checks are set up that way
because it makes writing checks simpler and we have fewer of them, but
in this case we don't want that requirement to be levied on
non-data-only dumps.
In a non-data_only dump, the order of the tables doesn't matter, because
the foreign keys are added at the very end. In parallel dumps, the
tables are in addition sorted by size, so the resultant order is
different from a single-threaded dump. This can be seen by comparing
the dumped TOCs of the defaults_dir_format and defaults_parallel cases.
But it all happens to pass the tests right now.
Occationally I get lucky, apparently. :) Though I think this might have
been different before I reworked those tests to be simpler.
In my hacking I have added another test table to the pg_dump test set,
which seems to have thrown off the sorting and scheduling, so that the
two tables now happen to come out in primary-key-first order anyway,
which causes the test to fail.
Ok.
I have developed the attached rough patch to add a third option to
pg_dump test cases: besides like and unlike, add a "skip" option to
disregard the result of the test.
If I read this correctly, the actual test isn't run at all (though, of
course, the tables are created and such).
In any case though, this doesn't completely solve the problem, does it?
Skipping 'defaults' also causes 'defaults_parallel' to be skipped and
therefore avoids the issue for now, but if some other change caused the
ordering to be different in the regular (non-data_only) cases then this
test would blow up again.
Looking back at the 9.6 tests, tests were only run for the runs where
they were explicitly specified, which lead to tests being missed that
shouldn't have been, and that lead to the approach now used where every
test is against every run.
As this test should really only ever be applied to the 'data_only' run,
it seems like we should have a 'run' list and for this test that would
be just 'data_only'. I haven't looked into what's supposed to happen
here in a parallel data-only test, but if we expect the ordering to
still be honored then we should probably have a test for that.
So, in short, I don't really agree with this 'skip' approach as it
doesn't properly solve this problem but would rather see a 'only_runs'
option or similar where this test is only tried against the data_only
run (and possibly a data_only_parallel_restore one, if one such was
added). In any case, please be sure to also update the documentation
under 'Definition of the tests to run.' (about line 335) whenever the
set of options that can be specified for the tests is changed, and
let's strongly discourage the use of this feature for most tests as
these kinds of one-off's are quite rare.
Thanks!
Stephen
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
In a non-data_only dump, the order of the tables doesn't matter, because
the foreign keys are added at the very end. In parallel dumps, the
tables are in addition sorted by size, so the resultant order is
different from a single-threaded dump. This can be seen by comparing
the dumped TOCs of the defaults_dir_format and defaults_parallel cases.
But it all happens to pass the tests right now.
I noticed that business about sorting the TOC by size yesterday.
I think that's a completely bletcherous hack, and we ought to get
rid of it in favor of keeping the TOC order the same between parallel
and non-parallel cases, and instead doing size comparisons during
parallel worker dispatch.
The primary reason why it's a bletcherous hack is that it's designed
around the assumption that parallel dumps will be restored in parallel
and non-parallel dumps will be restored non-parallel. That assumption
is wrong on its face. But it explains why the code sorts both tables
and indexes, even though pg_dump doesn't have any need to parallelize
index dumps: it's expecting that a subsequent parallel restore will have
use for building indexes largest-first. Of course, if you made the dump
non-parallel, you're out of luck on that.
Another reason why the code is bogus is that it sorts only a consecutive
sequence of DO_INDEX items. This fails to work at all for indexes that
are constraints, and even for ones that aren't, there's no very good
reason to assume that they aren't interspersed with other sorts of
objects due to dependencies. So I suspect an awful lot of parallelism
is being left on the table so far as indexes are concerned.
So if we made this less of a quick-n-dirty kluge, we could remove the
hazard of different-looking dump results in parallel and serial cases,
and probably improve restore performance as well.
A small problem with postponing sorting to the restore side is that
I don't think we record relpages info anywhere in the archive format.
However, at least for the directory-format case (which I think is the
only one supported for parallel restore), we could make it compare the
file sizes of the TABLE DATA items. That'd work pretty well as a proxy
for both the amount of effort needed for table restore, and the amount
of effort needed to build indexes on the tables afterwards.
regards, tom lane
Greetings,
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
In a non-data_only dump, the order of the tables doesn't matter, because
the foreign keys are added at the very end. In parallel dumps, the
tables are in addition sorted by size, so the resultant order is
different from a single-threaded dump. This can be seen by comparing
the dumped TOCs of the defaults_dir_format and defaults_parallel cases.
But it all happens to pass the tests right now.I noticed that business about sorting the TOC by size yesterday.
I think that's a completely bletcherous hack, and we ought to get
rid of it in favor of keeping the TOC order the same between parallel
and non-parallel cases, and instead doing size comparisons during
parallel worker dispatch.
So instead of dumping things by the order of the TOC, we'll perform the
sorting later on before handing out jobs to workers? That seems alright
to me for the most part. One thing I do wonder about is if we should
also be sorting by tablespace and not just size, to try and maximize
throughput (that is, assign out parallel workers to each tablespace,
each going after the largest table in that tablespace, before coming
back around to assigning the next-largest file to the second worker on a
given tablespace, presuming we have more workers than tablespaces),
that's what we've seen works rather well in pgbackrest.
However, at least for the directory-format case (which I think is the
only one supported for parallel restore), we could make it compare the
file sizes of the TABLE DATA items. That'd work pretty well as a proxy
for both the amount of effort needed for table restore, and the amount
of effort needed to build indexes on the tables afterwards.
Parallel restore also works w/ custom-format dumps.
Thanks!
Stephen
Stephen Frost <sfrost@snowman.net> writes:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
However, at least for the directory-format case (which I think is the
only one supported for parallel restore), we could make it compare the
file sizes of the TABLE DATA items. That'd work pretty well as a proxy
for both the amount of effort needed for table restore, and the amount
of effort needed to build indexes on the tables afterwards.
Parallel restore also works w/ custom-format dumps.
Really. Well then the existing code is even more broken, because it
only does this sorting for directory output:
/* If we do a parallel dump, we want the largest tables to go first */
if (archiveFormat == archDirectory && numWorkers > 1)
sortDataAndIndexObjectsBySize(dobjs, numObjs);
so that parallel restore is completely left in the lurch with a
custom-format dump.
But I imagine we can get some measure of table data size out of a custom
dump too.
regards, tom lane
Greetings,
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
Stephen Frost <sfrost@snowman.net> writes:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
However, at least for the directory-format case (which I think is the
only one supported for parallel restore), we could make it compare the
file sizes of the TABLE DATA items. That'd work pretty well as a proxy
for both the amount of effort needed for table restore, and the amount
of effort needed to build indexes on the tables afterwards.Parallel restore also works w/ custom-format dumps.
Really. Well then the existing code is even more broken, because it
only does this sorting for directory output:/* If we do a parallel dump, we want the largest tables to go first */
if (archiveFormat == archDirectory && numWorkers > 1)
sortDataAndIndexObjectsBySize(dobjs, numObjs);so that parallel restore is completely left in the lurch with a
custom-format dump.
Sorry for not being clear- it's only possible to parallel *dump* to a
directory-format dump, and the above code is for performing that
sort-by-size before executing a parallel dump. One might wonder why
there's the check for archiveFormat at all though- numWorkers shouldn't
be able to be >1 except in the case where the archiveFormat supports
parallel dump, and if it supports parallel dump, then we should try to
dump out the tables largest-first.
Parallel *restore* can be done from either a custom-format dump or from
a directory-format dump. I agree that we should seperate the concerns
and perform independent sorting on the restore side of things based on
the relative sizes of tables in the dump (be it custom format or
directory format). While compression might make us not exactly correct
on the restore side, I expect that we'll generally be close enough to
avoid most cases where a single worker gets stuck working on a large
table at the end after all the other work is done.
But I imagine we can get some measure of table data size out of a custom
dump too.
I would think so.
Thanks!
Stephen
Stephen Frost <sfrost@snowman.net> writes:
Parallel *restore* can be done from either a custom-format dump or from
a directory-format dump. I agree that we should seperate the concerns
and perform independent sorting on the restore side of things based on
the relative sizes of tables in the dump (be it custom format or
directory format). While compression might make us not exactly correct
on the restore side, I expect that we'll generally be close enough to
avoid most cases where a single worker gets stuck working on a large
table at the end after all the other work is done.
Here's a proposed patch for this. It removes the hacking of the TOC list
order, solving Peter's original problem, and instead sorts-by-size
in the actual parallel dump or restore control code. There are a
number of ensuing performance benefits:
* The BLOBS entry, if any, gets to participate in the ordering decision
during parallel dumps. As the code stands, all the effort to avoid
scheduling a long job last is utterly wasted if you've got a lot of
blobs, because that entry stayed at the end. I didn't work real hard
on that, just gave it a large size so it would go first not last. If
you just have a few blobs, that's not necessary, but I doubt it hurts
either.
* During restore, we insert actual size numbers into the BLOBS and
TABLE DATA items, and then anything that depends on a TABLE DATA item
inherits its size. This results in size-based prioritization not just
for simple indexes as before, but also for constraint indexes (UNIQUE
or PRIMARY KEY), foreign key verifications, delayed CHECK constraints,
etc. It also means that stuff like triggers and rules get reinstalled
in size-based order, which doesn't really help, but again I don't
think it hurts.
* Parallel restore scheduling by size works for custom dumps as well
as directory ones (as long as the dump file was seekable when created,
but you'll be hurting anyway if it wasn't).
I have not really tried to demonstrate performance benefits, because
the results would depend a whole lot on your test case; but at least
in principle this should result in far more intelligent scheduling
of parallel restores.
While I haven't done so here, I'm rather tempted to rename the
par_prev/par_next fields and par_list_xxx functions to pending_prev,
pending_next, pending_list_xxx, since they now have only one use.
(BTW, I tried really hard to get rid of par_prev/par_next altogether,
in favor of keeping the pending entries in the unused space in the
"ready" TocEntry* array. But it didn't work out well --- seems like
a list really is the natural data structure for that.)
regards, tom lane
Attachments:
smarter-parallel-dump-restore-1.patchtext/x-diff; charset=us-ascii; name=smarter-parallel-dump-restore-1.patchDownload+489-322
On 28/08/2018 20:47, Tom Lane wrote:
Here's a proposed patch for this. It removes the hacking of the TOC list
order, solving Peter's original problem, and instead sorts-by-size
in the actual parallel dump or restore control code.
I have reviewed this patch. I haven't done any major performance tests
or the like, but the improvements are clear in principle.
It does solve the issue that I had originally reported, when I apply it
on top of my development branch.
Some small comments on the code:
Maybe add a ready_list_free() to go with ready_list_init(), instead of
calling pg_free(ready_list.tes) directly.
get_next_work_item() has been changed to remove the work item from the
ready_list. Maybe rename to something like pop_next_work_item()?
I'm confused by what ready_list_remove() is doing when it's not removing
the first item. It looks like it's removing all leading items up to the
i'th one. Is that what we want? In some cases, we are skipping over
things that we are not interested at all, so this would work, but if
we're just skipping over an item because of a lock conflict, then it's
not right.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
Some small comments on the code:
Maybe add a ready_list_free() to go with ready_list_init(), instead of
calling pg_free(ready_list.tes) directly.
get_next_work_item() has been changed to remove the work item from the
ready_list. Maybe rename to something like pop_next_work_item()?
Both seem reasonable, will do.
I'm confused by what ready_list_remove() is doing when it's not removing
the first item. It looks like it's removing all leading items up to the
i'th one. Is that what we want? In some cases, we are skipping over
things that we are not interested at all, so this would work, but if
we're just skipping over an item because of a lock conflict, then it's
not right.
No. In both code paths, the array slot at index first_te is being
physically dropped from the set of valid entries (by incrementing
first_te). In the first path, that slot holds the item we want to
remove logically from the set, so that incrementing first_te is
all we have to do: the remaining entries are still in the range
first_te..last_te, and they're still sorted. In the second code
path, the item that was in that slot is still wanted as part of
the set, so we copy it into the valid range (overwriting the item
in slot i, which is no longer wanted). Now the valid range is
probably not sorted, so we have to flag that a re-sort is needed.
I expect that most of the time the first code path will be taken,
because usually we'll be able to dispatch the highest-priority
ready entry. We'll only take the second path when we have to postpone
the highest-priority entry because of a potential lock conflict
against some already-running task. Any items between first_te and i
are other tasks that also have lock conflicts and can't be dispatched
yet; we certainly don't want to lose them, and this code doesn't.
If you can suggest comments that would clarify this more,
I'm all ears.
regards, tom lane
On 12/09/2018 18:06, Tom Lane wrote:
No. In both code paths, the array slot at index first_te is being
physically dropped from the set of valid entries (by incrementing
first_te). In the first path, that slot holds the item we want to
remove logically from the set, so that incrementing first_te is
all we have to do: the remaining entries are still in the range
first_te..last_te, and they're still sorted. In the second code
path, the item that was in that slot is still wanted as part of
the set, so we copy it into the valid range (overwriting the item
in slot i, which is no longer wanted). Now the valid range is
probably not sorted, so we have to flag that a re-sort is needed.
I see. Why not shift all items up to the i'th up by one place, instead
of moving only the first one? That way the sortedness would be
preserved. Otherwise we'd move the first one into the middle, then
sorting would move it to the front again, etc.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
On 12/09/2018 18:06, Tom Lane wrote:
No. In both code paths, the array slot at index first_te is being
physically dropped from the set of valid entries (by incrementing
first_te). In the first path, that slot holds the item we want to
remove logically from the set, so that incrementing first_te is
all we have to do: the remaining entries are still in the range
first_te..last_te, and they're still sorted. In the second code
path, the item that was in that slot is still wanted as part of
the set, so we copy it into the valid range (overwriting the item
in slot i, which is no longer wanted). Now the valid range is
probably not sorted, so we have to flag that a re-sort is needed.
I see. Why not shift all items up to the i'th up by one place, instead
of moving only the first one? That way the sortedness would be
preserved. Otherwise we'd move the first one into the middle, then
sorting would move it to the front again, etc.
Hmmm ... might be worth doing, but I'm not sure. The steady-state cycle
will probably be that after one task has been dispatched, we'll sleep
until some task finishes and then that will unblock some pending items,
resulting in new entries at the end of the list, forcing a sort anyway
before we next dispatch a task. So I was expecting that avoiding a sort
here wasn't really going to be worth expending much effort for. But my
intuition about that could be wrong. I'll run a test case with some
instrumentation added and see how often we could avoid sorts by
memmove'ing.
regards, tom lane
I wrote:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
I see. Why not shift all items up to the i'th up by one place, instead
of moving only the first one? That way the sortedness would be
preserved. Otherwise we'd move the first one into the middle, then
sorting would move it to the front again, etc.
Hmmm ... might be worth doing, but I'm not sure. The steady-state cycle
will probably be that after one task has been dispatched, we'll sleep
until some task finishes and then that will unblock some pending items,
resulting in new entries at the end of the list, forcing a sort anyway
before we next dispatch a task. So I was expecting that avoiding a sort
here wasn't really going to be worth expending much effort for. But my
intuition about that could be wrong. I'll run a test case with some
instrumentation added and see how often we could avoid sorts by
memmove'ing.
OK, my intuition was faulty. At least when testing with the regression
database, situations where we are taking the slow path at all seem to
involve several interrelated dump objects (eg indexes of a table) that
are all waiting for the same lock, such that we may be able to dispatch a
number of unrelated tasks before anything gets added from the pending
list. Doing it as you suggest eliminates a significant fraction of
the re-sort operations.
Attached updated patch does it like that and makes the cosmetic
adjustments you suggested. I also went ahead and did the renaming
of par_prev/par_next/par_list_xxx that I'd suggested upthread.
I think this is committable ...
regards, tom lane
Attachments:
smarter-parallel-dump-restore-2.patchtext/x-diff; charset=us-ascii; name=smarter-parallel-dump-restore-2.patchDownload+577-497
On 13/09/2018 23:03, Tom Lane wrote:
Attached updated patch does it like that and makes the cosmetic
adjustments you suggested. I also went ahead and did the renaming
of par_prev/par_next/par_list_xxx that I'd suggested upthread.
I think this is committable ...
Yes, this looks good to me.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
On 13/09/2018 23:03, Tom Lane wrote:
Attached updated patch does it like that and makes the cosmetic
adjustments you suggested. I also went ahead and did the renaming
of par_prev/par_next/par_list_xxx that I'd suggested upthread.
I think this is committable ...
Yes, this looks good to me.
Pushed, thanks for reviewing.
regards, tom lane