parallel restore fixes
The attached patch fixes two issues with parallel restore:
* the static buffer problem in dumputils.c::fmtId() on Windows
(solution: use thread-local storage)
* ReopenPtr() is called too often
There is one remaining bug I know of that I can reproduce: we can get
into deadlock when two tables are foreign keyed to each other. So I need
to get a bit more paranoid about dependencies.
I can't reproduce Olivier Prennant's file closing problem on Unixware.
Is it still happening after application of this patch?
cheers
andrew
Attachments:
parallel_fix.patchtext/x-patch; charset=iso-8859-1; name=parallel_fix.patchDownload+62-20
Andrew Dunstan wrote:
The attached patch fixes two issues with parallel restore:
* the static buffer problem in dumputils.c::fmtId() on Windows
(solution: use thread-local storage)
* ReopenPtr() is called too often
Hmm, if pg_restore is the only program that's threaded, why are you
calling init_dump_utils on pg_dump and pg_dumpall? It makes me a bit
nervous because there are some other programs that are linking
dumputils.c (psql and some in src/bin/scripts/) and even calling fmtId.
Also I think the fmtId comment needs to be updated.
--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Andrew Dunstan <andrew@dunslane.net> writes:
+ void
+ init_dump_utils()
This should be
+ void
+ init_dump_utils(void)
please. We don't do K&R C around here. I'd lose the added retval
variable too; that's not contributing anything.
! #endif;
Semicolon is bogus here.
Looks pretty sane otherwise.
regards, tom lane
Alvaro Herrera <alvherre@commandprompt.com> writes:
Hmm, if pg_restore is the only program that's threaded, why are you
calling init_dump_utils on pg_dump and pg_dumpall?
... because fmtId will crash on *any* use without that.
It makes me a bit
nervous because there are some other programs that are linking
dumputils.c (psql and some in src/bin/scripts/) and even calling fmtId.
Actually, why bother with init_dump_utils at all? fmtId could be made
to initialize the ID variable for itself on first call, with only one
extra if-test, which is hardly gonna matter.
regards, tom lane
Tom Lane wrote:
It makes me a bit
nervous because there are some other programs that are linking
dumputils.c (psql and some in src/bin/scripts/) and even calling fmtId.Actually, why bother with init_dump_utils at all? fmtId could be made
to initialize the ID variable for itself on first call, with only one
extra if-test, which is hardly gonna matter.
Well, the Windows reference I have suggests TlsAlloc() needs to be
called early in the initialisation process ... I guess I could force it
with a dummy call to fmtId() in restore_toc_entries_parallel() before it
starts spawning children, so we'd be sure there wasn't a race condition,
and nothing else is going to have threads so it won't matter. We'd need
a long comment to that effect, though ;-)
I'd lose the added retval
variable too; that's not contributing anything.
It is, in fact. Until I put that in I was getting constant crashes. I
suspect it's something to do with stuff Windows does under the hood on
function return.
cheers
andrew
Andrew Dunstan <andrew@dunslane.net> writes:
Tom Lane wrote:
Actually, why bother with init_dump_utils at all?
Well, the Windows reference I have suggests TlsAlloc() needs to be
called early in the initialisation process ...
How early is early? The proposed call sites for init_dump_utils seem
already long past the point where any libc-level infrastructure would
think it is "initialization" time.
I'd lose the added retval
variable too; that's not contributing anything.
It is, in fact. Until I put that in I was getting constant crashes. I
suspect it's something to do with stuff Windows does under the hood on
function return.
Pardon me while I retrieve my eyebrows from the ceiling. I think you've
got something going on there you don't understand, and you need to
understand it not just put in a cargo-cult fix. (Especially one that's
not documented and hence likely to be removed by the next person who
touches the code.)
regards, tom lane
Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
Tom Lane wrote:
Actually, why bother with init_dump_utils at all?
Well, the Windows reference I have suggests TlsAlloc() needs to be
called early in the initialisation process ...How early is early? The proposed call sites for init_dump_utils seem
already long past the point where any libc-level infrastructure would
think it is "initialization" time.
Well, I think at least it needs to be done where other threads won't be
calling it at the same time.
cheers
andrew
Andrew Dunstan <andrew@dunslane.net> writes:
Tom Lane wrote:
How early is early? The proposed call sites for init_dump_utils seem
already long past the point where any libc-level infrastructure would
think it is "initialization" time.
Well, I think at least it needs to be done where other threads won't be
calling it at the same time.
Oh, I see, ye olde race condition. Still, it seems like a bad idea
to expect that we will catch every program that might call fmtId;
as Alvaro notes, that's all over our client-side code.
How about this: by default, fmtId uses the same logic as now (one static
PQExpBuffer). If told to by a call of init_parallel_dump_utils(), which
need only be called by pg_restore during its startup, then it switches to
using per-thread storage. init_parallel_dump_utils can be the place
that calls TlsAlloc. This is almost the same as what you suggested a
couple messages back, but perhaps a bit clearer as to what's going on;
and it definitely cuts the number of places we need to touch.
regards, tom lane
Tom Lane wrote:
How about this: by default, fmtId uses the same logic as now (one static
PQExpBuffer). If told to by a call of init_parallel_dump_utils(), which
need only be called by pg_restore during its startup, then it switches to
using per-thread storage. init_parallel_dump_utils can be the place
that calls TlsAlloc. This is almost the same as what you suggested a
couple messages back, but perhaps a bit clearer as to what's going on;
and it definitely cuts the number of places we need to touch.
OK, here 'tis.
Moving on to the deadlock with crossed FKs issue.
cheers
andrew
Attachments:
parallel_fix.patchtext/x-patch; charset=iso-8859-1; name=parallel_fix.patchDownload+81-37
Andrew Dunstan <andrew@dunslane.net> writes:
OK, here 'tis.
Looks fairly reasonable to me, but of course I haven't tested it.
regards, tom lane
Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
OK, here 'tis.
Looks fairly reasonable to me, but of course I haven't tested it.
Well, I have to do a blitz of parallel restores next week, so hopefully
that will hit any soft spots.
--Josh
Josh Berkus wrote:
Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
OK, here 'tis.
Looks fairly reasonable to me, but of course I haven't tested it.
Well, I have to do a blitz of parallel restores next week, so
hopefully that will hit any soft spots.
I have a known outstanding bug to do with deadlock from FKs that cross
(i.e. A has an FK that references B, and B has an FK that references A).
The solution to this could be mildly complex, but I have an outline of
it in my head. Workaround: recreate the failed FK at the end of the restore.
The only other reported problem is the one on Unixware to do with
closing the archive. I haven't been able to reproduce it on Linux or
Windows, the two platforms I test on, although it might be fixed by the
patch I just applied.
cheers
andrew