BUG #16147: postgresql 12.1 (from homebrew) - pg_restore -h localhost --jobs=2 crashes

Started by PG Bug reporting formabout 6 years ago8 messages
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 16147
Logged by: Bill Tihen
Email address: btihen@gmail.com
PostgreSQL version: 12.1
Operating system: MacOS 10.15.1
Description:

The following command crashes with any database I've tried (both large and
small) DBs:
`pg_restore -U wti0405 -d stage3 -h localhost --jobs=8 -Fc
database_12_04-01-00.bak -x`

I get the error others have associated with a locking problem in earlier
versions:
`pg_restore: error: a worker process died unexpectedly`

Interesting both of the following commands work!
`pg_restore -U wti0405 -d stage3 --jobs=8 -Fc database_12_04-01-00.bak -x`
&
`pg_restore -U wti0405 -d stage3 -h localhost -Fc database_12_04-01-00.bak
-x`

So it appears this is only a problem when using parallelization with a TCP
and not a local socket.

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: PG Bug reporting form (#1)
Re: BUG #16147: postgresql 12.1 (from homebrew) - pg_restore -h localhost --jobs=2 crashes

PG Bug reporting form <noreply@postgresql.org> writes:

The following bug has been logged on the website:
Bug reference: 16147
Logged by: Bill Tihen
Email address: btihen@gmail.com
PostgreSQL version: 12.1
Operating system: MacOS 10.15.1
Description:

The following command crashes with any database I've tried (both large and
small) DBs:
`pg_restore -U wti0405 -d stage3 -h localhost --jobs=8 -Fc
database_12_04-01-00.bak -x`

I failed to reproduce this on my own 10.15.1 laptop, using manual
builds of either HEAD or the v12 branch. Plausible reasons for
the difference in results might include:

* There's something different about the homebrew build (could we
see the output of pg_config?)

* There's something unusual about your configuration (one thought
that comes to mind: do you have SSL turned on for localhost
connections?)

* There's something about the data in this specific database
(your report that it happens for multiple databases puts a crimp
in this idea, though maybe they all share a common feature)

Anyway, we need more info to investigate. You might try looking
into the server log to see what the failure looks like from that
side --- is there a query error, or just the worker disconnecting
unexpectedly?

regards, tom lane

#3PikachuEXE
pikachuexe@gmail.com
In reply to: PG Bug reporting form (#1)
Re: BUG #16147: postgresql 12.1 (from homebrew) - pg_restore -h localhost --jobs=2 crashes

I encounter similar issue with the following differences to OP

- Operating system: MacOS 10.14.6
- PostgreSQL (both server & pg_restore) installed via PSQL.app
https://github.com/PostgresApp/PostgresApp/releases/tag/v2.3.2
<https://github.com/PostgresApp/PostgresApp/releases/tag/v2.3.2&gt;

pg_restore's path is
/Applications/Postgres.app/Contents/Versions/latest/bin/pg_restore

Same as OP
If the number of jobs is 1
or setting host to unix socket

The error "pg_restore: error: a worker process died unexpectedly" won't be
raised
However this error doesn't happen on Ubuntu 18.04 version of psql
(
https://www.ubuntuupdates.org/package/postgresql/bionic-pgdg/main/base/postgresql-12
<https://www.ubuntuupdates.org/package/postgresql/bionic-pgdg/main/base/postgresql-12&gt;
)
So it might or might not be build issue for MacOS version
Though having the same issue on both distribution channels (homebrew &
psql.app) seem rare

Anyway reported also to psql.app too
https://github.com/PostgresApp/PostgresApp/issues/538
<https://github.com/PostgresApp/PostgresApp/issues/538&gt;

--
Sent from: https://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html

#4sailor
pavlok@enapps.co.uk
In reply to: PikachuEXE (#3)
Re: BUG #16147: postgresql 12.1 (from homebrew) - pg_restore -h localhost --jobs=2 crashes

Hello,
It doesn't work for me also even on PG version 12.2

OS: Ubuntu 18.04.4 LTS
postgresql-12 12.2-1.pgdg18.04+1
amd64
pg_dump (PostgreSQL) 12.2 (Ubuntu 12.2-1.pgdg18.04+1)
pg_restore (PostgreSQL) 12.2 (Ubuntu 12.2-1.pgdg18.04+1)

pg_restore: error: could not find block ID 11550 in archive -- possibly due
to out-of-order restore request, which cannot be handled due to lack of data
offsets in archive
pg_restore: error: could not find block ID 11546 in archive -- possibly due
to out-of-order restore request, which cannot be handled due to lack of data
offsets in archive
pg_restore: error: a worker process died unexpectedly

However, it works with --jobs=1

--
Sent from: https://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html

#5David Zhang
david.zhang@highgo.ca
In reply to: Tom Lane (#2)
Re: BUG #16147: postgresql 12.1 (from homebrew) - pg_restore -h localhost --jobs=2 crashes

Hi,

I can reproduce this pg_restore crash issue (pg_dump crash too when
running with multiple jobs) on MacOS 10.14 Mojave and MacOS 10.15
Catalina using following steps.

1. build pg_resotre from 12.2 with "--with-gssapi" enabled, or use the
release from https://www.postgresql.org/download/macosx/

2. start pg server and generate some load,
    pgbench -i -p 5432 -d postgres -s 10"

3. backup database,
    pg_dump -h localhost -Fc --no-acl --no-owner postgres > /tmp/128m

4. drop the tables,
    psql -d postgres -c "drop table pgbench_accounts; drop table
pgbench_branches; drop table pgbench_history; drop table pgbench_tellers;"

5. restore database,
    pg_restore -d postgres -h localhost -Fc /tmp/128m --jobs=2
    Password:
    pg_restore: error: a worker process died unexpectedly

6. check tables, all display size 0 bytes.
    postgres=# \d+
                              List of relations
     Schema |       Name       | Type  |  Owner   |  Size   | Description
--------+------------------+-------+----------+---------+-------------
     public | pgbench_accounts | table | postgres | 0 bytes |
     public | pgbench_branches | table | postgres | 0 bytes |
     public | pgbench_history  | table | postgres | 0 bytes |
     public | pgbench_tellers  | table | postgres | 0 bytes |
    (4 rows)

7. core dump, about 2G,
(lldb) bt all
* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x00007fff6c29c44e
libdispatch.dylib`_dispatch_mgr_queue_push + 41
    frame #1: 0x00007fff41475a74
Security`___ZN8Security12KeychainCore14StorageManager14tickleKeychainEPNS0_12KeychainImplE_block_invoke_2
+ 76
    frame #2: 0x00007fff6c29250e
libdispatch.dylib`_dispatch_client_callout + 8
    frame #3: 0x00007fff6c29e567
libdispatch.dylib`_dispatch_lane_barrier_sync_invoke_and_complete + 60
    frame #4: 0x00007fff41475935
Security`Security::KeychainCore::StorageManager::tickleKeychain(Security::KeychainCore::KeychainImpl*)
+ 485
    frame #5: 0x00007fff412400d8
Security`Security::KeychainCore::KCCursorImpl::next(Security::KeychainCore::Item&)
+ 352
    frame #6: 0x00007fff41417975
Security`Security::KeychainCore::IdentityCursor::next(Security::SecPointer<Security::KeychainCore::Identity>&)
+ 217
    frame #7: 0x00007fff4143c4c3 Security`SecIdentitySearchCopyNext + 155
    frame #8: 0x00007fff414477d8
Security`SecItemCopyMatching_osx(__CFDictionary const*, void const**) + 261
    frame #9: 0x00007fff4144b024 Security`SecItemCopyMatching + 338
    frame #10: 0x00007fff56dab303 Heimdal`keychain_query + 531
    frame #11: 0x00007fff56da8f4c Heimdal`hx509_certs_find + 92
    frame #12: 0x00007fff56d67b52 Heimdal`_krb5_pk_find_cert + 466
    frame #13: 0x00007fff376da9bb GSS`_gsspku2u_acquire_cred + 619
    frame #14: 0x00007fff376bfc1c GSS`gss_acquire_cred + 940
    frame #15: 0x000000010016e6e1
libpq.5.dylib`pg_GSS_have_cred_cache(cred_out=0x0000000100505688) at
fe-gssapi-common.c:67:10
    frame #16: 0x000000010014f769
libpq.5.dylib`PQconnectPoll(conn=0x0000000100505310) at fe-connect.c:2785:22
    frame #17: 0x000000010014be9f
libpq.5.dylib`connectDBComplete(conn=0x0000000100505310) at
fe-connect.c:2095:10
    frame #18: 0x000000010014bb0c
libpq.5.dylib`PQconnectdbParams(keywords=0x00007ffeefbfeee0,
values=0x00007ffeefbfeea0, expand_dbname=1) at fe-connect.c:625:10
    frame #19: 0x000000010000ec20
pg_restore`ConnectDatabase(AHX=0x0000000100505070, dbname="postgres",
pghost="david.highgo.ca", pgport=0x0000000000000000, username="david",
prompt_password=TRI_DEFAULT) at pg_backup_db.c:287:20
    frame #20: 0x000000010000a75a
pg_restore`CloneArchive(AH=0x00000001002020f0) at
pg_backup_archiver.c:4850:3
    frame #21: 0x0000000100017b4b
pg_restore`RunWorker(AH=0x00000001002020f0, slot=0x0000000100221718) at
parallel.c:866:7
    frame #22: 0x00000001000179f5
pg_restore`ParallelBackupStart(AH=0x00000001002020f0) at parallel.c:1028:4
    frame #23: 0x0000000100004473
pg_restore`RestoreArchive(AHX=0x00000001002020f0) at
pg_backup_archiver.c:662:12
    frame #24: 0x0000000100001be4 pg_restore`main(argc=10,
argv=0x00007ffeefbff8f0) at pg_restore.c:447:3
    frame #25: 0x00007fff6c2eb7fd libdyld.dylib`start + 1
(lldb)

8. however it works with either,
    PGGSSENCMODE=disable pg_restore -d postgres -h localhost -Fc
/tmp/128m --jobs=2
or,
    pg_restore -d "dbname=postgres gssencmode=disable" -h localhost -Fc
/tmp/128m --jobs=2

9. pg_config output and versions, no SSL configured,

$ pg_config
BINDIR = /Users/david/sandbox/pg122/app/bin
DOCDIR = /Users/david/sandbox/pg122/app/share/doc/postgresql
HTMLDIR = /Users/david/sandbox/pg122/app/share/doc/postgresql
INCLUDEDIR = /Users/david/sandbox/pg122/app/include
PKGINCLUDEDIR = /Users/david/sandbox/pg122/app/include/postgresql
INCLUDEDIR-SERVER = /Users/david/sandbox/pg122/app/include/postgresql/server
LIBDIR = /Users/david/sandbox/pg122/app/lib
PKGLIBDIR = /Users/david/sandbox/pg122/app/lib/postgresql
LOCALEDIR = /Users/david/sandbox/pg122/app/share/locale
MANDIR = /Users/david/sandbox/pg122/app/share/man
SHAREDIR = /Users/david/sandbox/pg122/app/share/postgresql
SYSCONFDIR = /Users/david/sandbox/pg122/app/etc/postgresql
PGXS =
/Users/david/sandbox/pg122/app/lib/postgresql/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--with-gssapi' '--prefix=/Users/david/sandbox/pg122/app'
'--enable-debug' 'CFLAGS=-ggdb -O0 -fno-omit-frame-pointer'
CC = gcc
CPPFLAGS = -isysroot
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Werror=vla -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -Wno-unused-command-line-argument -g -ggdb -O0
-fno-omit-frame-pointer
CFLAGS_SL =
LDFLAGS = -Wl,-dead_strip_dylibs
LDFLAGS_EX =
LDFLAGS_SL =
LIBS = -lpgcommon -lpgport -lgssapi_krb5 -lz -lreadline -lm
VERSION = PostgreSQL 12.2

$ lldb --version
lldb-1100.0.30.12
Apple Swift version 5.1.3 (swiftlang-1100.0.282.1 clang-1100.0.33.15)

$ klist --version
klist (Heimdal 1.5.1apple1)
Copyright 1995-2011 Kungliga Tekniska Högskolan
Send bug-reports to heimdal-bugs@h5l.org

Hopefully the above information can help.

On 2019-12-04 6:03 a.m., Tom Lane wrote:

PG Bug reporting form <noreply@postgresql.org> writes:

The following bug has been logged on the website:
Bug reference: 16147
Logged by: Bill Tihen
Email address: btihen@gmail.com
PostgreSQL version: 12.1
Operating system: MacOS 10.15.1
Description:
The following command crashes with any database I've tried (both large and
small) DBs:
`pg_restore -U wti0405 -d stage3 -h localhost --jobs=8 -Fc
database_12_04-01-00.bak -x`

I failed to reproduce this on my own 10.15.1 laptop, using manual
builds of either HEAD or the v12 branch. Plausible reasons for
the difference in results might include:

* There's something different about the homebrew build (could we
see the output of pg_config?)

* There's something unusual about your configuration (one thought
that comes to mind: do you have SSL turned on for localhost
connections?)

* There's something about the data in this specific database
(your report that it happens for multiple databases puts a crimp
in this idea, though maybe they all share a common feature)

Anyway, we need more info to investigate. You might try looking
into the server log to see what the failure looks like from that
side --- is there a query error, or just the worker disconnecting
unexpectedly?

regards, tom lane

--
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca

#6David Gilman
davidgilman1@gmail.com
In reply to: Tom Lane (#2)
3 attachment(s)
Re: BUG #16147: postgresql 12.1 (from homebrew) - pg_restore -h localhost --jobs=2 crashes

I did some of the digging here. The issue still exists in the latest
git master. A git bisect pointed me to commit
548e50976ce721b5e927d42a105c2f05b51b52a6 which did touch the
pg_restore work. The bug doesn't seem to be with the pg_dump -Fc
format as I was using an older dump version to do the bisect and it
could still trigger the issue in recent PostgreSQL. I have noticed
this issue on several Macs. Doing some Google searching you also see
people complaining about it on Mac:

https://dba.stackexchange.com/questions/257398/pg-restore-with-jobs-flag-results-in-pg-restore-error-a-worker-process-di

https://github.com/thoughtbot/parity/issues/175

mentioning things like Homebrew and Postgres.app that are only on Mac.
The recommendation seems to be to go back to pg_restore from
postgresql 11.

I tried a bit of sleuthing here. I've attached a file
toc_contents.txt which is the pg_restore -l output of my dump file
with some relevant bits grepped. I've attached a file
printf_output.txt that has some manual printfs shoved into pg_restore
to try and see what's going on. Note that I dropped all but the last
few restore_toc_entry calls. And finally I attached printf.patch so
you can see what I'm logging. The issue seems to be that pg_restore
is trying to find a TOC that was earlier in the file than the current
offset, never finds that TOC and fails with the "possibly due to
out-of-order restore request". I've run pg_restore with these patches
a few times now and it always fails in the same way (never seeks up in
the file) but in different places each time.

Tom, if you or anyone else with PostgreSQL would appreciate the
pg_dump file I can send it to you out of band, it's only a few
megabytes. I have pg_restore with debug symbols too if you want me to
try anything.

On Fri, May 15, 2020 at 9:43 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

PG Bug reporting form <noreply@postgresql.org> writes:

The following bug has been logged on the website:
Bug reference: 16147
Logged by: Bill Tihen
Email address: btihen@gmail.com
PostgreSQL version: 12.1
Operating system: MacOS 10.15.1
Description:

The following command crashes with any database I've tried (both large and
small) DBs:
`pg_restore -U wti0405 -d stage3 -h localhost --jobs=8 -Fc
database_12_04-01-00.bak -x`

I failed to reproduce this on my own 10.15.1 laptop, using manual
builds of either HEAD or the v12 branch. Plausible reasons for
the difference in results might include:

* There's something different about the homebrew build (could we
see the output of pg_config?)

* There's something unusual about your configuration (one thought
that comes to mind: do you have SSL turned on for localhost
connections?)

* There's something about the data in this specific database
(your report that it happens for multiple databases puts a crimp
in this idea, though maybe they all share a common feature)

Anyway, we need more info to investigate. You might try looking
into the server log to see what the failure looks like from that
side --- is there a query error, or just the worker disconnecting
unexpectedly?

regards, tom lane

--
David Gilman
:DG<

Attachments:

toc_contents.txttext/plain; charset=US-ASCII; name=toc_contents.txtDownload
printf.patchapplication/octet-stream; name=printf.patchDownload
diff --git src/bin/pg_dump/pg_backup_archiver.c src/bin/pg_dump/pg_backup_archiver.c
index 8f0b32ca17..449157a1b5 100644
--- src/bin/pg_dump/pg_backup_archiver.c
+++ src/bin/pg_dump/pg_backup_archiver.c
@@ -755,6 +755,7 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te, bool is_parallel)
 	int			status = WORKER_OK;
 	teReqs		reqs;
 	bool		defnDumped;
+    printf("restore_toc_entry initial AH->FH: %d\n", ftello(AH->FH));
 
 	AH->currentTE = te;
 
@@ -863,6 +864,7 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te, bool is_parallel)
 			if (AH->PrintTocDataPtr != NULL)
 			{
 				_printTocEntry(AH, te, true);
+                printf("restore_toc_entry _printTocEntry AH->FH: %d\n", ftello(AH->FH));
 
 				if (strcmp(te->desc, "BLOBS") == 0 ||
 					strcmp(te->desc, "BLOB COMMENTS") == 0)
@@ -870,6 +872,7 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te, bool is_parallel)
 					pg_log_info("processing %s", te->desc);
 
 					_selectOutputSchema(AH, "pg_catalog");
+                    printf("restore_toc_entry _selectOutputSchema AH->FH: %d\n", ftello(AH->FH));
 
 					/* Send BLOB COMMENTS data to ExecuteSimpleCommands() */
 					if (strcmp(te->desc, "BLOB COMMENTS") == 0)
diff --git src/bin/pg_dump/pg_backup_custom.c src/bin/pg_dump/pg_backup_custom.c
index 369dcea429..8eb7f465f8 100644
--- src/bin/pg_dump/pg_backup_custom.c
+++ src/bin/pg_dump/pg_backup_custom.c
@@ -415,10 +415,13 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
 	lclTocEntry *tctx = (lclTocEntry *) te->formatData;
 	int			blkType;
 	int			id;
+    int         thisPid = getpid();
 
 	if (tctx->dataState == K_OFFSET_NO_DATA)
 		return;
 
+    printf("pid: %d, ctx->hasSeek: %d, tctx->dataState: %d, te->dumpId: %d\n", thisPid,
+            ctx->hasSeek, tctx->dataState, te->dumpId);
 	if (!ctx->hasSeek || tctx->dataState == K_OFFSET_POS_NOT_SET)
 	{
 		/*
@@ -427,6 +430,7 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
 		 * are asked to restore items out-of-order.
 		 */
 		_readBlockHeader(AH, &blkType, &id);
+        printf("initial readBlockHeader: blkType: %d, id: %d, AH->FH: %d\n", blkType, id, ftello(AH->FH));
 
 		while (blkType != EOF && id != te->dumpId)
 		{
@@ -446,6 +450,7 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
 					break;
 			}
 			_readBlockHeader(AH, &blkType, &id);
+            printf("readBlockHeader: blkType: %d, id: %d, AH->FH: %d\n", blkType, id, ftello(AH->FH));
 		}
 	}
 	else
@@ -496,6 +501,7 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
 				  blkType);
 			break;
 	}
+    printf("\n");
 }
 
 /*
@@ -773,6 +779,7 @@ _PrepParallelRestore(ArchiveHandle *AH)
 	TocEntry   *prev_te = NULL;
 	lclTocEntry *prev_tctx = NULL;
 	TocEntry   *te;
+    printf("PrepParallelRestore\n");
 
 	/*
 	 * Knowing that the data items were dumped out in TOC order, we can
@@ -800,6 +807,7 @@ _PrepParallelRestore(ArchiveHandle *AH)
 
 		prev_te = te;
 		prev_tctx = tctx;
+        printf("PrepParallelRestore prev_te: %d, prev_te->dataLength: %d\n", prev_te, prev_te->dataLength);
 	}
 
 	/* If OK to seek, we can determine the length of the last item */
@@ -812,6 +820,7 @@ _PrepParallelRestore(ArchiveHandle *AH)
 		endpos = ftello(AH->FH);
 		if (endpos > prev_tctx->dataPos)
 			prev_te->dataLength = endpos - prev_tctx->dataPos;
+        printf("PrepParallelRestore endpos: %d, prev_te->dataLength: %d\n", endpos, prev_te->dataLength);
 	}
 }
 
printf_output.txttext/plain; charset=US-ASCII; name=printf_output.txtDownload
#7Justin Pryzby
pryzby@telsasoft.com
In reply to: David Zhang (#5)
Re: BUG #16147: postgresql 12.1 (from homebrew) - pg_restore -h localhost --jobs=2 crashes

On Thu, Mar 05, 2020 at 07:53:35PM -0800, David Zhang wrote:

I can reproduce this pg_restore crash issue (pg_dump crash too when running
with multiple jobs) on MacOS 10.14 Mojave and MacOS 10.15 Catalina using
following steps.

Isn't this the same as here?
/messages/by-id/16041-b44f9931ad91fc3d@postgresql.org
..concluding that macos library fails after forking.

..I found that via: https://github.com/PostgresApp/PostgresApp/issues/538
=> /messages/by-id/1575881854624-0.post@n3.nabble.com

--
Justin

#8David Gilman
davidgilman1@gmail.com
In reply to: Justin Pryzby (#7)
Re: BUG #16147: postgresql 12.1 (from homebrew) - pg_restore -h localhost --jobs=2 crashes

It looks like there is a whole bunch of problems here:

- Bill Tihen's original post about restoring over sockets
- sailor's post about "out-of-order restore request" failure
- David Zhang's crash inside of GSS

I was running into the "out-of-order restore request" issue. My post
above was the first half of my investigations and was incorrect. The
"out-of-order restore request" comes from custom dumps who were
written to a non-seekable file descriptor. They silently lack certain
metadata and can't be used for parallel restores in PostgreSQL 12 or
later. If you're running into "out-of-order restore request" errors
try running pg_dump with the -f flag to force it to create a dump file
with the correct metadata. If you can't change how the dumps are made
you can't use parallel pg_restore with them. There's isn't a code fix
to this problem, everything is working as expected, but I've sent in a
patch to help document the behavior.
/messages/by-id/CALBH9DDuJ+scZc4MEvw5uO-=vRyR2=QF9+Yh=3hPEnKHWfS81A@mail.gmail.com

On Mon, May 18, 2020 at 10:04 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Thu, Mar 05, 2020 at 07:53:35PM -0800, David Zhang wrote:

I can reproduce this pg_restore crash issue (pg_dump crash too when running
with multiple jobs) on MacOS 10.14 Mojave and MacOS 10.15 Catalina using
following steps.

Isn't this the same as here?
/messages/by-id/16041-b44f9931ad91fc3d@postgresql.org
..concluding that macos library fails after forking.

..I found that via: https://github.com/PostgresApp/PostgresApp/issues/538
=> /messages/by-id/1575881854624-0.post@n3.nabble.com

--
Justin

--
David Gilman
:DG<