parallel restore fixes

Started by Andrew Dunstanabout 17 years ago12 messageshackers
Jump to latest
#1Andrew Dunstan
andrew@dunslane.net

The attached patch fixes two issues with parallel restore:

* the static buffer problem in dumputils.c::fmtId() on Windows
(solution: use thread-local storage)
* ReopenPtr() is called too often

There is one remaining bug I know of that I can reproduce: we can get
into deadlock when two tables are foreign keyed to each other. So I need
to get a bit more paranoid about dependencies.

I can't reproduce Olivier Prennant's file closing problem on Unixware.
Is it still happening after application of this patch?

cheers

andrew

Attachments:

parallel_fix.patchtext/x-patch; charset=iso-8859-1; name=parallel_fix.patchDownload+62-20
#2Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andrew Dunstan (#1)
Re: parallel restore fixes

Andrew Dunstan wrote:

The attached patch fixes two issues with parallel restore:

* the static buffer problem in dumputils.c::fmtId() on Windows
(solution: use thread-local storage)
* ReopenPtr() is called too often

Hmm, if pg_restore is the only program that's threaded, why are you
calling init_dump_utils on pg_dump and pg_dumpall? It makes me a bit
nervous because there are some other programs that are linking
dumputils.c (psql and some in src/bin/scripts/) and even calling fmtId.

Also I think the fmtId comment needs to be updated.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#1)
Re: parallel restore fixes

Andrew Dunstan <andrew@dunslane.net> writes:

+ void
+ init_dump_utils()

This should be

+ void
+ init_dump_utils(void)

please. We don't do K&R C around here. I'd lose the added retval
variable too; that's not contributing anything.

! #endif;

Semicolon is bogus here.

Looks pretty sane otherwise.

regards, tom lane

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#2)
Re: parallel restore fixes

Alvaro Herrera <alvherre@commandprompt.com> writes:

Hmm, if pg_restore is the only program that's threaded, why are you
calling init_dump_utils on pg_dump and pg_dumpall?

... because fmtId will crash on *any* use without that.

It makes me a bit
nervous because there are some other programs that are linking
dumputils.c (psql and some in src/bin/scripts/) and even calling fmtId.

Actually, why bother with init_dump_utils at all? fmtId could be made
to initialize the ID variable for itself on first call, with only one
extra if-test, which is hardly gonna matter.

regards, tom lane

#5Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#4)
Re: parallel restore fixes

Tom Lane wrote:

It makes me a bit
nervous because there are some other programs that are linking
dumputils.c (psql and some in src/bin/scripts/) and even calling fmtId.

Actually, why bother with init_dump_utils at all? fmtId could be made
to initialize the ID variable for itself on first call, with only one
extra if-test, which is hardly gonna matter.

Well, the Windows reference I have suggests TlsAlloc() needs to be
called early in the initialisation process ... I guess I could force it
with a dummy call to fmtId() in restore_toc_entries_parallel() before it
starts spawning children, so we'd be sure there wasn't a race condition,
and nothing else is going to have threads so it won't matter. We'd need
a long comment to that effect, though ;-)

I'd lose the added retval
variable too; that's not contributing anything.

It is, in fact. Until I put that in I was getting constant crashes. I
suspect it's something to do with stuff Windows does under the hood on
function return.

cheers

andrew

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#5)
Re: parallel restore fixes

Andrew Dunstan <andrew@dunslane.net> writes:

Tom Lane wrote:

Actually, why bother with init_dump_utils at all?

Well, the Windows reference I have suggests TlsAlloc() needs to be
called early in the initialisation process ...

How early is early? The proposed call sites for init_dump_utils seem
already long past the point where any libc-level infrastructure would
think it is "initialization" time.

I'd lose the added retval
variable too; that's not contributing anything.

It is, in fact. Until I put that in I was getting constant crashes. I
suspect it's something to do with stuff Windows does under the hood on
function return.

Pardon me while I retrieve my eyebrows from the ceiling. I think you've
got something going on there you don't understand, and you need to
understand it not just put in a cargo-cult fix. (Especially one that's
not documented and hence likely to be removed by the next person who
touches the code.)

regards, tom lane

#7Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#6)
Re: parallel restore fixes

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

Tom Lane wrote:

Actually, why bother with init_dump_utils at all?

Well, the Windows reference I have suggests TlsAlloc() needs to be
called early in the initialisation process ...

How early is early? The proposed call sites for init_dump_utils seem
already long past the point where any libc-level infrastructure would
think it is "initialization" time.

Well, I think at least it needs to be done where other threads won't be
calling it at the same time.

cheers

andrew

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#7)
Re: parallel restore fixes

Andrew Dunstan <andrew@dunslane.net> writes:

Tom Lane wrote:

How early is early? The proposed call sites for init_dump_utils seem
already long past the point where any libc-level infrastructure would
think it is "initialization" time.

Well, I think at least it needs to be done where other threads won't be
calling it at the same time.

Oh, I see, ye olde race condition. Still, it seems like a bad idea
to expect that we will catch every program that might call fmtId;
as Alvaro notes, that's all over our client-side code.

How about this: by default, fmtId uses the same logic as now (one static
PQExpBuffer). If told to by a call of init_parallel_dump_utils(), which
need only be called by pg_restore during its startup, then it switches to
using per-thread storage. init_parallel_dump_utils can be the place
that calls TlsAlloc. This is almost the same as what you suggested a
couple messages back, but perhaps a bit clearer as to what's going on;
and it definitely cuts the number of places we need to touch.

regards, tom lane

#9Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#8)
Re: parallel restore fixes

Tom Lane wrote:

How about this: by default, fmtId uses the same logic as now (one static
PQExpBuffer). If told to by a call of init_parallel_dump_utils(), which
need only be called by pg_restore during its startup, then it switches to
using per-thread storage. init_parallel_dump_utils can be the place
that calls TlsAlloc. This is almost the same as what you suggested a
couple messages back, but perhaps a bit clearer as to what's going on;
and it definitely cuts the number of places we need to touch.

OK, here 'tis.

Moving on to the deadlock with crossed FKs issue.

cheers

andrew

Attachments:

parallel_fix.patchtext/x-patch; charset=iso-8859-1; name=parallel_fix.patchDownload+81-37
#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#9)
Re: parallel restore fixes

Andrew Dunstan <andrew@dunslane.net> writes:

OK, here 'tis.

Looks fairly reasonable to me, but of course I haven't tested it.

regards, tom lane

#11Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#10)
Re: parallel restore fixes

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

OK, here 'tis.

Looks fairly reasonable to me, but of course I haven't tested it.

Well, I have to do a blitz of parallel restores next week, so hopefully
that will hit any soft spots.

--Josh

#12Andrew Dunstan
andrew@dunslane.net
In reply to: Josh Berkus (#11)
Re: parallel restore fixes

Josh Berkus wrote:

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

OK, here 'tis.

Looks fairly reasonable to me, but of course I haven't tested it.

Well, I have to do a blitz of parallel restores next week, so
hopefully that will hit any soft spots.

I have a known outstanding bug to do with deadlock from FKs that cross
(i.e. A has an FK that references B, and B has an FK that references A).
The solution to this could be mildly complex, but I have an outline of
it in my head. Workaround: recreate the failed FK at the end of the restore.

The only other reported problem is the one on Unixware to do with
closing the archive. I haven't been able to reproduce it on Linux or
Windows, the two platforms I test on, although it might be fixed by the
patch I just applied.

cheers

andrew