Fundamental scheduling bug in parallel restore of partitioned tables

Started by Tom Laneabout 1 year ago9 messageshackers

tgl@sss.pgh.pa.us

about 1 year ago

Looking into the complaint at [1]/messages/by-id/67469c1c-38bc-7d94-918a-67033f5dd731@gmx.net, I find that pg_restore really gets
it quite wrong when trying to do a parallel restore of partitioned
tables with foreign key constraints. The historical way that we
got parallel restore to work correctly with FK constraints is:

1. The dump file shows dependencies of the FK constraint object on
the two tables named in the constraint. (This matches what it
says in pg_depend.)

2. If it's a parallel restore, repoint_table_dependencies() replaces
the dependencies on the tables with dependencies on their TABLE DATA
objects.

3. Now restore of the FK constraint will not be started until all
the relevant data is loaded.

However, if we're talking about partitioned tables, there is no
TABLE DATA object for a partitioned table; only for its leaf
partitions. So repoint_table_dependencies() does nothing for the
FK object's dependencies, meaning that it's a candidate to be
launched as soon as we begin the parallel restore phase.

This is disastrous for assorted reasons. The ALTER ADD CONSTRAINT
command might fail outright if we've loaded data for the referencing
table but not the referenced table. It could deadlock against other
parallel restore jobs, as reported in [1]/messages/by-id/67469c1c-38bc-7d94-918a-67033f5dd731@gmx.net (and which I find not
too terribly hard to reproduce here). Even if it doesn't fail, if
it completes before we load data for the referencing table, we'll
have to do retail FK checks, greatly slowing that data load.

I think that the most intellectually rigorous solution is to
generate dummy TABLE DATA objects for partitioned tables, which
don't actually contain data but merely carry dependencies on
each of the child tables' TABLE DATA objects. (In a multilevel
partitioned structure, this'd result in the top TABLE DATA having
indirect dependencies on all the leaf partition TABLE DATAs.)
Then repoint_table_dependencies() does the right thing with a
dependency on a partitioned table, and the dependency-driven
scheduler will take care of the rest.

There are two places we could make that happen. The easiest
to code, likely, is to get pg_dump to create such objects and
include them in the dump. We could possibly get pg_restore
to fake up such objects from the data it has available, but
I expect that that will be harder and slower than having
pg_dump do it. So I'm leaning towards the first way.
The disadvantage is that existing dump files would still be
hazardous to restore in parallel. But given that this has
been broken for some time and nobody's reported it till now,
I feel maybe that's okay. (I don't think there would be
a backwards compatibility problem in introducing such new
objects into dumps, because AFAICS pg_restore would not need
any explicit knowledge of them.)

Thoughts?

regards, tom lane

PS: attached is a text dump file for a trivial DB containing
two partitioned tables with an FK. If I load this,
dump it into an -Fc-format dump file, and run something like
pg_restore -j10 src.dump -d target
then I can reproduce the reported deadlock failure --- not
entirely reliably, but it does happen.

[1]: /messages/by-id/67469c1c-38bc-7d94-918a-67033f5dd731@gmx.net

Tom Lane

tgl@sss.pgh.pa.us

about 1 year ago

In reply to: Tom Lane (#1)

Re: Fundamental scheduling bug in parallel restore of partitioned tables

I wrote:

I think that the most intellectually rigorous solution is to
generate dummy TABLE DATA objects for partitioned tables, which
don't actually contain data but merely carry dependencies on
each of the child tables' TABLE DATA objects.

Here's a draft patch for this. It seems to fix the problem in
light testing. Some notes:

* Quite a lot of the patch is concerned with making various places
treat the new PARTITIONED DATA TOC entry type the same as TABLE DATA.
I considered removing that distinction and representing a partitioned
table's data object as TABLE DATA with no dataDumper, but it seems to
me this way is clearer. Maybe others will think differently though;
it'd make for a smaller patch.

* It's annoying that we have to touch _tocEntryRequired's "Special
Case" logic for deciding whether an entry is schema or data, because
that means that old copies of pg_restore will think these entries are
schema and thus ignore them in a data-only restore. But I think it
doesn't matter too much, because in a data-only restore we'd not be
creating indexes or foreign keys, so the scheduling bug isn't really
problematic.

* I'm not quite certain whether identify_locking_dependencies() needs
to treat PARTITIONED DATA dependencies as lockable. I assumed here
that it does, but maybe we don't take out exclusive locks on
partitioned tables during restore?

* I noticed that a --data-only dump of the regression database now
complains:

$ pg_dump --data-only regression >r.dump
pg_dump: warning: there are circular foreign-key constraints on this table:
pg_dump: detail: parted_self_fk
pg_dump: hint: You might not be able to restore the dump without using --disable-triggers or temporarily dropping the constraints.
pg_dump: hint: Consider using a full dump instead of a --data-only dump to avoid this problem.

The existing code does not produce this warning, but I think doing so
is correct. The reason we missed the issue before is that
getTableDataFKConstraints ignores tables without a dataObj, so before
this patch it ignored partitioned tables altogether.

Comments?

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

about 1 year ago

In reply to: Tom Lane (#2)

Re: Fundamental scheduling bug in parallel restore of partitioned tables

I wrote:

Here's a draft patch for this. It seems to fix the problem in
light testing.

I realized that the "repro" I had for this isn't testing the same
thing that Dimitrios is seeing; what it is exposing looks more like
a bug or at least a behavioral change due to the v18 work to record
not-null constraints in pg_constraint [1]/messages/by-id/1280408.1744650810@sss.pgh.pa.us. So my patch may fix his
problem or it may not. It would be good to have a reproducer that
fails (not necessarily every time) in v17 or earlier.

In addition to that uncertainty, pushing the patch now would get in
the way of identifying what's really going on at [1]/messages/by-id/1280408.1744650810@sss.pgh.pa.us. So I'm going
to sit on it for now, and maybe it's going to turn into v19 material.

regards, tom lane

[1]: /messages/by-id/1280408.1744650810@sss.pgh.pa.us

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

about 1 year ago

In reply to: Tom Lane (#3)

Re: Fundamental scheduling bug in parallel restore of partitioned tables

On Mon, Apr 14, 2025 at 11:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

Here's a draft patch for this. It seems to fix the problem in
light testing.

I realized that the "repro" I had for this isn't testing the same
thing that Dimitrios is seeing; what it is exposing looks more like
a bug or at least a behavioral change due to the v18 work to record
not-null constraints in pg_constraint [1]. So my patch may fix his
problem or it may not. It would be good to have a reproducer that
fails (not necessarily every time) in v17 or earlier.

I tried to reproduce the problem using your script on v17, but could't
get either deadlock or constraint violation error.

This is disastrous for assorted reasons. The ALTER ADD CONSTRAINT
command might fail outright if we've loaded data for the referencing
table but not the referenced table.

There's a comment in getConstraints()
/*
* Restoring an FK that points to a partitioned table requires that
* all partition indexes have been attached beforehand. Ensure that
* happens by making the constraint depend on each index partition
* attach object.
*/
FK constraint addition will wait for the child indexes in referenced
partitioned table to be attached, which in turn wait for data to be
loaded in the child tables. So it doesn't look like we will see ADD
constraint failing. I may be missing something though.

It could deadlock against other
parallel restore jobs, as reported in [1] (and which I find not
too terribly hard to reproduce here).

Even if it doesn't fail, if
it completes before we load data for the referencing table, we'll
have to do retail FK checks, greatly slowing that data load.

FWIW, Executing pg_restore -j2 -v, I think, I see evidence that the FK
constraint is created before data is loaded into the referencing
table.

pg_restore: processing data for table "public.c22"
... snip
pg_restore: launching item 2477 FK CONSTRAINT parent2 parent2_ref_fkey
pg_restore: creating FK CONSTRAINT "public.parent2 parent2_ref_fkey"
pg_restore: finished item 2626 TABLE DATA c22
... snip
pg_restore: launching item 2625 TABLE DATA c21
pg_restore: processing data for table "public.c21"
pg_restore: finished item 2477 FK CONSTRAINT parent2 parent2_ref_fkey
... snip
pg_restore: finished item 2625 TABLE DATA c21

I tried applying your patch on v17 to see whether it causes the FK
creation to wait, but the patch doesn't apply cleanly.

--
Best Wishes,
Ashutosh Bapat

Dimitrios Apostolou

jimis@gmx.net

about 1 year ago

In reply to: Tom Lane (#3)

Re: Fundamental scheduling bug in parallel restore of partitioned tables

On Mon, 14 Apr 2025, Tom Lane wrote:

I wrote:

Here's a draft patch for this. It seems to fix the problem in
light testing.

I realized that the "repro" I had for this isn't testing the same
thing that Dimitrios is seeing; what it is exposing looks more like
a bug or at least a behavioral change due to the v18 work to record
not-null constraints in pg_constraint [1]. So my patch may fix his
problem or it may not. It would be good to have a reproducer that
fails (not necessarily every time) in v17 or earlier.

Thank you for your work on it.

I only got the "ERROR: deadlock detected" message once, with pg_restore
compiled from master branch. My dump is too large to test it many times on
v17 so I can't tell if it occurs there.

In general I believe that dependency resolution is not optimal, either
there is a deadlock bug or not. It can definitely be improved as work
(mostly post-data) is not parallelized as much as it can.

Anyway if I get the deadlock on v17 I'll update the initial thread.

Thanks,
Dimitris

Tom Lane

tgl@sss.pgh.pa.us

about 1 year ago

In reply to: Dimitrios Apostolou (#5)

Re: Fundamental scheduling bug in parallel restore of partitioned tables

Dimitrios Apostolou <jimis@gmx.net> writes:

I only got the "ERROR: deadlock detected" message once, with pg_restore
compiled from master branch.

Oh, hmm ... do you recall just when on the master branch? I'm
wondering if it was before or after 14e87ffa5 (2024-11-08).
If it was after, then it might have been subject to the same
issue with not-null constraints that I ran into.

regards, tom lane

Dimitrios Apostolou

jimis@gmx.net

about 1 year ago

In reply to: Tom Lane (#6)

Re: Fundamental scheduling bug in parallel restore of partitioned tables

On 15 April 2025 18:18:41 CEST, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Dimitrios Apostolou <jimis@gmx.net> writes:

I only got the "ERROR: deadlock detected" message once, with pg_restore
compiled from master branch.

Oh, hmm ... do you recall just when on the master branch? I'm
wondering if it was before or after 14e87ffa5 (2024-11-08).
If it was after, then it might have been subject to the same
issue with not-null constraints that I ran into.

regards, tom lane

Definitely after.

Tom Lane

tgl@sss.pgh.pa.us

about 1 year ago

In reply to: Ashutosh Bapat (#4)

Re: Fundamental scheduling bug in parallel restore of partitioned tables

Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes:

On Mon, Apr 14, 2025 at 11:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

This is disastrous for assorted reasons. The ALTER ADD CONSTRAINT
command might fail outright if we've loaded data for the referencing
table but not the referenced table.

There's a comment in getConstraints()
/*
* Restoring an FK that points to a partitioned table requires that
* all partition indexes have been attached beforehand. Ensure that
* happens by making the constraint depend on each index partition
* attach object.
*/

Ah, that is an excellent point which I missed. And the INDEX ATTACH
objects have dependencies on the leaf tables, which *will* get
repointed to their TABLE DATA objects by repoint_table_dependencies.
So by the time we are ready to restore the FK CONSTRAINT object,
we are certain to have loaded all the data of the referenced table.
But there's nothing delaying the constraint till after the referencing
table's data is loaded.

Even if it doesn't fail, if
it completes before we load data for the referencing table, we'll
have to do retail FK checks, greatly slowing that data load.

FWIW, Executing pg_restore -j2 -v, I think, I see evidence that the FK
constraint is created before data is loaded into the referencing
table.

Yes, I reproduced that as well. That squares with the above
analysis.

So at this point we have:

#1: ADD CONSTRAINT failure because of missing referenced data:
not possible after all.

#2: Deadlock between parallel restore jobs: possible in HEAD, but
it seems likely to be a bug introduced by the not-null-constraint
work rather than being pg_restore's fault. We have no evidence
that such a deadlock can happen in released branches, and the lack
of field reports suggests that it can't.

#3: Restoring the FK constraint before referencing data is loaded:
this seems to be possible, and it's a performance problem, but
no more than that.

So now I withdraw the suggestion that this patch needs to be
back-patched. We may not even need it in v18, if another fix
for #2 is found. Fixing #3 would be a desirable thing to do
in v19, but if that's the only thing at stake then it's not
something to break feature freeze for.

For the moment I'll mark this CF entry as meant for v19.
We can resurrect consideration of it for v18 if there's not
a better way to fix the deadlock problem.

regards, tom lane

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

about 1 year ago

In reply to: Tom Lane (#8)

Re: Fundamental scheduling bug in parallel restore of partitioned tables

On Wed, Apr 16, 2025 at 12:19 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes:

On Mon, Apr 14, 2025 at 11:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

This is disastrous for assorted reasons. The ALTER ADD CONSTRAINT
command might fail outright if we've loaded data for the referencing
table but not the referenced table.

There's a comment in getConstraints()
/*
* Restoring an FK that points to a partitioned table requires that
* all partition indexes have been attached beforehand. Ensure that
* happens by making the constraint depend on each index partition
* attach object.
*/

Ah, that is an excellent point which I missed. And the INDEX ATTACH
objects have dependencies on the leaf tables, which *will* get
repointed to their TABLE DATA objects by repoint_table_dependencies.
So by the time we are ready to restore the FK CONSTRAINT object,
we are certain to have loaded all the data of the referenced table.
But there's nothing delaying the constraint till after the referencing
table's data is loaded.

Yes.

So at this point we have:

#1: ADD CONSTRAINT failure because of missing referenced data:
not possible after all.

check

#2: Deadlock between parallel restore jobs: possible in HEAD, but
it seems likely to be a bug introduced by the not-null-constraint
work rather than being pg_restore's fault. We have no evidence
that such a deadlock can happen in released branches, and the lack
of field reports suggests that it can't.

I agree. I ran that repro several times against v17 and never saw a
deadlock, even after changing the number of partitions, sizes of
partitions, adding more tables etc.

Creating an FK constraint either happens before or after the loading
data in one of the partitions. That might point towards either
constraint creation or data load waiting for the other. But I have not
seen an evidence that a full deadlock chain is possible.

#3: Restoring the FK constraint before referencing data is loaded:
this seems to be possible, and it's a performance problem, but
no more than that.

Yes. And once we fix this, there won't be waiting between constraint
creation and data load, so the remote possibility of a deadlock also
vanishes.

So now I withdraw the suggestion that this patch needs to be
back-patched. We may not even need it in v18, if another fix
for #2 is found. Fixing #3 would be a desirable thing to do
in v19, but if that's the only thing at stake then it's not
something to break feature freeze for.

For the moment I'll mark this CF entry as meant for v19.
We can resurrect consideration of it for v18 if there's not
a better way to fix the deadlock problem.

+1.

--
Best Wishes,
Ashutosh Bapat

Fundamental scheduling bug in parallel restore of partitioned tables

Attachments:

Attachments: