pg_restore crash when there is a failure before all child process is created
Hi,
I found one crash in pg_restore, this occurs when there is a failure before
all the child workers are created. Back trace for the same is given below:
#0 0x00007f9c6d31e337 in raise () from /lib64/libc.so.6
#1 0x00007f9c6d31fa28 in abort () from /lib64/libc.so.6
#2 0x00007f9c6d317156 in __assert_fail_base () from /lib64/libc.so.6
#3 0x00007f9c6d317202 in __assert_fail () from /lib64/libc.so.6
#4 0x0000000000407c9e in WaitForTerminatingWorkers (pstate=0x14af7f0) at
parallel.c:515
#5 0x0000000000407bf9 in ShutdownWorkersHard (pstate=0x14af7f0) at
parallel.c:451
#6 0x0000000000407ae9 in archive_close_connection (code=1, arg=0x6315a0
<shutdown_info>) at parallel.c:368
#7 0x000000000041a7c7 in exit_nicely (code=1) at pg_backup_utils.c:99
#8 0x0000000000408180 in ParallelBackupStart (AH=0x14972e0) at
parallel.c:967
#9 0x000000000040a3dd in RestoreArchive (AHX=0x14972e0) at
pg_backup_archiver.c:661
#10 0x0000000000404125 in main (argc=6, argv=0x7ffd5146f308) at
pg_restore.c:443
The problem is like:
- The variable pstate->numWorkers is being set with the number of
workers initially in ParallelBackupStart.
- Then the workers are created one by one.
- Before creating all the process there is a failure.
- Then the parent terminates the child process and waits for all the
child process to get terminated.
- This function WaitForTerminatingWorkers checks if all process is
terminated by calling HasEveryWorkerTerminated.
- HasEveryWorkerTerminated will always return false because it will
check for the numWorkers rather than the actual forked process count and
hits the next assert "Assert(j < pstate->numWorkers);".
Attached patch has the fix for the same. Fixed it by setting
pstate->numWorkers with the actual worker count when the child process is
being created.
Thoughts?
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
Attachments:
0001-pg_restore-crash-when-there-is-a-failure-before-all-worker-creation.patchapplication/x-patch; name=0001-pg_restore-crash-when-there-is-a-failure-before-all-worker-creation.patchDownload+14-8
On Wed, Jan 29, 2020 at 6:54 PM Ahsan Hadi <ahsan.hadi@gmail.com> wrote:
Hi Vignesh,
Can you share a test case or steps that you are using to reproduce this issue? Are you reproducing this using a debugger?
I could reproduce with the following steps:
Make cluster setup.
Create few tables.
Take a dump in directory format using pg_dump.
Restore the dump generated above using pg_restore with very high
number for --jobs options around 600.
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
Import Notes
Reply to msg id not found: CA+9bhCLChGx1P8Goo1Oi8OM7S0wmDZGXGORi1FWRJCq1z8Q8Yg@mail.gmail.com
vignesh C <vignesh21@gmail.com> writes:
On Wed, Jan 29, 2020 at 6:54 PM Ahsan Hadi <ahsan.hadi@gmail.com> wrote:
Can you share a test case or steps that you are using to reproduce this issue? Are you reproducing this using a debugger?
I could reproduce with the following steps:
Make cluster setup.
Create few tables.
Take a dump in directory format using pg_dump.
Restore the dump generated above using pg_restore with very high
number for --jobs options around 600.
I agree this is quite broken. Another way to observe the crash is
to make the fork() call randomly fail, as per booby-trap-fork.patch
below (not intended for commit, obviously).
I don't especially like the proposed patch, though, as it introduces
a great deal of confusion into what ParallelState.numWorkers means.
I think it's better to leave that as being the allocated array size,
and instead clean up all the fuzzy thinking about whether workers
are actually running or not. Like 0001-fix-worker-status.patch below.
regards, tom lane
I have applied tested both patches separately and ran regression with both.
No new test cases are failing with both patches.
The issues is fixed by both patches however the fix from Tom looks more
elegant. I haven't done a detailed code review.
On Fri, Jan 31, 2020 at 12:39 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
vignesh C <vignesh21@gmail.com> writes:
On Wed, Jan 29, 2020 at 6:54 PM Ahsan Hadi <ahsan.hadi@gmail.com> wrote:
Can you share a test case or steps that you are using to reproduce this
issue? Are you reproducing this using a debugger?
I could reproduce with the following steps:
Make cluster setup.
Create few tables.
Take a dump in directory format using pg_dump.
Restore the dump generated above using pg_restore with very high
number for --jobs options around 600.I agree this is quite broken. Another way to observe the crash is
to make the fork() call randomly fail, as per booby-trap-fork.patch
below (not intended for commit, obviously).I don't especially like the proposed patch, though, as it introduces
a great deal of confusion into what ParallelState.numWorkers means.
I think it's better to leave that as being the allocated array size,
and instead clean up all the fuzzy thinking about whether workers
are actually running or not. Like 0001-fix-worker-status.patch below.regards, tom lane
--
Highgo Software (Canada/China/Pakistan)
URL : http://www.highgo.ca
ADDR: 10318 WHALLEY BLVD, Surrey, BC
EMAIL: mailto: ahsan.hadi@highgo.ca
On Fri, Jan 31, 2020 at 1:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
vignesh C <vignesh21@gmail.com> writes:
On Wed, Jan 29, 2020 at 6:54 PM Ahsan Hadi <ahsan.hadi@gmail.com> wrote:
Can you share a test case or steps that you are using to reproduce this issue? Are you reproducing this using a debugger?
I could reproduce with the following steps:
Make cluster setup.
Create few tables.
Take a dump in directory format using pg_dump.
Restore the dump generated above using pg_restore with very high
number for --jobs options around 600.I agree this is quite broken. Another way to observe the crash is
to make the fork() call randomly fail, as per booby-trap-fork.patch
below (not intended for commit, obviously).I don't especially like the proposed patch, though, as it introduces
a great deal of confusion into what ParallelState.numWorkers means.
I think it's better to leave that as being the allocated array size,
and instead clean up all the fuzzy thinking about whether workers
are actually running or not. Like 0001-fix-worker-status.patch below.
The patch looks fine to me. The test is also getting fixed by the patch.
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: not tested
Spec compliant: tested, passed
Documentation: not tested
I have applied tested both patches separately and ran regression with both. No new test cases are failing with both patches.
The issues is fixed by both patches however the fix from Tom (0001-fix-worker-status.patch) looks more elegant. I haven't done a detailed code review.
ahsan hadi <ahsan.hadi@gmail.com> writes:
I have applied tested both patches separately and ran regression with both. No new test cases are failing with both patches.
The issues is fixed by both patches however the fix from Tom (0001-fix-worker-status.patch) looks more elegant. I haven't done a detailed code review.
Pushed, thanks for looking!
regards, tom lane