Warn when parallel restoring a custom dump without data offsets

Started by David Gilmanover 5 years ago17 messages
#1David Gilman
davidgilman1@gmail.com
1 attachment(s)

If pg_dump can't seek on its output stream when writing a dump in the
custom archive format (possibly because you piped its stdout to a file)
it can't update that file with data offsets. These files will often
break parallel restoration. Warn when the user is doing pg_restore on
such a file to give them a hint as to why their restore is about to
fail.

The documentation for pg_restore -j is also updated to suggest that you
dump custom archive formats with the -f option.
---
doc/src/sgml/ref/pg_restore.sgml | 9 +++++++++
src/bin/pg_dump/pg_backup_custom.c | 8 ++++++++
2 files changed, 17 insertions(+)

Attachments:

0001-Warn-when-parallel-restoring-a-custom-dump-without-d.patchapplication/octet-stream; name=0001-Warn-when-parallel-restoring-a-custom-dump-without-d.patchDownload
From 876338886ab5f23a5379e180f87ff7011fb3cd80 Mon Sep 17 00:00:00 2001
From: David Gilman <davidgilman1@gmail.com>
Date: Sat, 16 May 2020 16:27:17 -0400
Subject: [PATCH] Warn when parallel restoring a custom dump without data
 offsets

If pg_dump can't seek on its output stream when writing a dump in the
custom archive format (possibly because you piped its stdout to a file)
it can't update that file with data offsets. These files will often
break parallel restoration. Warn when the user is doing pg_restore on
such a file to give them a hint as to why their restore is about to
fail.

The documentation for pg_restore -j is also updated to suggest that you
dump custom archive formats with the -f option.
---
 doc/src/sgml/ref/pg_restore.sgml   | 9 +++++++++
 src/bin/pg_dump/pg_backup_custom.c | 8 ++++++++
 2 files changed, 17 insertions(+)

diff --git doc/src/sgml/ref/pg_restore.sgml doc/src/sgml/ref/pg_restore.sgml
index 232f88024f..3c0f56db51 100644
--- doc/src/sgml/ref/pg_restore.sgml
+++ doc/src/sgml/ref/pg_restore.sgml
@@ -279,6 +279,15 @@ PostgreSQL documentation
         jobs cannot be used together with the
         option <option>--single-transaction</option>.
        </para>
+
+       <para>
+        The custom archive format may not work with the <option>-j</option>
+        option if the archive was originally created by redirecting the
+        standard output stream of <application>pg_dump</application>.
+        For the best concurrent restoration performance with the custom
+        archive format use <application>pg_dump</application>'s
+        <option>--file</option> option to specify an output file.
+       </para>
       </listitem>
      </varlistentry>
 
diff --git src/bin/pg_dump/pg_backup_custom.c src/bin/pg_dump/pg_backup_custom.c
index 369dcea429..9473e43b1b 100644
--- src/bin/pg_dump/pg_backup_custom.c
+++ src/bin/pg_dump/pg_backup_custom.c
@@ -773,6 +773,7 @@ _PrepParallelRestore(ArchiveHandle *AH)
 	TocEntry   *prev_te = NULL;
 	lclTocEntry *prev_tctx = NULL;
 	TocEntry   *te;
+	int        hasPos = 0;
 
 	/*
 	 * Knowing that the data items were dumped out in TOC order, we can
@@ -800,6 +801,7 @@ _PrepParallelRestore(ArchiveHandle *AH)
 
 		prev_te = te;
 		prev_tctx = tctx;
+		hasPos = 1;
 	}
 
 	/* If OK to seek, we can determine the length of the last item */
@@ -811,8 +813,14 @@ _PrepParallelRestore(ArchiveHandle *AH)
 			fatal("error during file seek: %m");
 		endpos = ftello(AH->FH);
 		if (endpos > prev_tctx->dataPos)
+		{
 			prev_te->dataLength = endpos - prev_tctx->dataPos;
+			hasPos = 1;
+		}
 	}
+
+	if (!hasPos)
+		pg_log_warning("dump file lacks all data offsets possibly preventing parallel restoration");
 }
 
 /*
-- 
2.26.2

#2Justin Pryzby
pryzby@telsasoft.com
In reply to: David Gilman (#1)
Re: Warn when parallel restoring a custom dump without data offsets

On Sat, May 16, 2020 at 04:57:46PM -0400, David Gilman wrote:

If pg_dump can't seek on its output stream when writing a dump in the
custom archive format (possibly because you piped its stdout to a file)
it can't update that file with data offsets. These files will often
break parallel restoration. Warn when the user is doing pg_restore on
such a file to give them a hint as to why their restore is about to
fail.

You didn't say so, but I gather this is related to this other thread (which
seems to represent two separate issues).
/messages/by-id/1582010626326-0.post@n3.nabble.com

Tom, if you or anyone else with PostgreSQL would appreciate the
pg_dump file I can send it to you out of band, it's only a few
megabytes. I have pg_restore with debug symbols too if you want me to
try anything.

Would you send to me or post a link to a filesharing site and I'll try to
reproduce it ? So far no luck.

You should include here your diagnosis from that thread, or add it to a commit
message, and mention the suspect commit (548e50976). Eventually add patch for
the next commitfest. https://commitfest.postgresql.org/

I guess you're also involved in this conversation:
https://dba.stackexchange.com/questions/257398/pg-restore-with-jobs-flag-results-in-pg-restore-error-a-worker-process-di

--
Justin

#3Justin Pryzby
pryzby@telsasoft.com
In reply to: David Gilman (#1)
Re: Warn when parallel restoring a custom dump without data offsets

I started fooling with this at home while our ISP is broke (pardon my brevity).

Maybe you also saw commit b779ea8a9a2dc3a089b3ac152b1ec4568bfeb26f
"Fix pg_restore so parallel restore doesn't fail when the input file
doesn't contain data offsets (which it won't, if pg_dump thought its
output wasn't seekable)..."

...which I guess should actually say "doesn't NECESSARILY fail", since
it also adds this comment:
"This could fail if we are asked to restore items out-of-order."

So this is a known issue and not a regression. I think the PG11
commit you mentioned (548e5097) happens to make some databases fail in
parallel restore that previously worked (I didn't check). Possibly
also some databases (or some pre-existing dumps) which used to fail
might possibly now succeed.

Your patch adds a warning if unseekable output might fail during
parallel restore. I'm not opposed to that, but can we just make
pg_restore work in that case? If the input is unseekable, then we can
never do a parallel restore at all. If it *is* seekable, could we
make _PrintTocData rewind if it gets to EOF using ftello(SEEK_SET, 0)
and re-scan again from the beginning? Would you want to try that ?

#4David Gilman
davidgilman1@gmail.com
In reply to: Justin Pryzby (#3)
Re: Warn when parallel restoring a custom dump without data offsets

Your understanding of the issue is mostly correct:

I think the PG11
commit you mentioned (548e5097) happens to make some databases fail in
parallel restore that previously worked (I didn't check).

Correct, if you do the bisect around that yourself you'll see
pg_restore start failing with the expected "possibly due to
out-of-order restore request" on offset-less dumps. It is a known
issue but it's only documented in code comments, not anywhere user
facing, which is sending people to StackOverflow.

If the input is unseekable, then we can
never do a parallel restore at all.

I don't know if this is strictly true. Imagine the case of a database
dump of a single large table with a few indexes, so simple enough that
everything in the file is going to be in restore order. It might seem
silly to parallel restore a single table but remember that pg_restore
also creates indexes in parallel and on a typical development
workstation with a few CPU cores and an SSD it'll be a substantial
improvement. There are probably some other corner cases where you can
get lucky with the offset-less dump and it'll work. That's why my gut
instinct was to warn instead of fail.

If it *is* seekable, could we
make _PrintTocData rewind if it gets to EOF using ftello(SEEK_SET, 0)
and re-scan again from the beginning? Would you want to try that ?

I will try this and report back. I will also see if I can get an strace.

--
David Gilman
:DG<

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Gilman (#4)
Re: Warn when parallel restoring a custom dump without data offsets

David Gilman <davidgilman1@gmail.com> writes:

I think the PG11
commit you mentioned (548e5097) happens to make some databases fail in
parallel restore that previously worked (I didn't check).

Correct, if you do the bisect around that yourself you'll see
pg_restore start failing with the expected "possibly due to
out-of-order restore request" on offset-less dumps.

Yeah. Now, the whole point of that patch was to decouple the restore
order from the dump order ... but with an offset-less dump file, we
can't do that, or at least the restore order is greatly constrained.
I wonder if it'd be sensible for pg_restore to use a different parallel
scheduling algorithm if it notices that the input lacks offsets.
(There could still be some benefit from parallelism, just not as much.)
No idea if this is going to be worth the trouble, but it probably
is worth looking into.

regards, tom lane

#6David Gilman
davidgilman1@gmail.com
In reply to: Tom Lane (#5)
1 attachment(s)
Re: Warn when parallel restoring a custom dump without data offsets

I did some more digging. To keep everyone on the same page there are
four different ways to order TOCs:

1. topological order,
2. dataLength order, size of the table, is always zero when pg_dump can't seek,
3. dumpId order, which should be thought as random but roughly
correlates to topological order to make things fun,
4. file order, the order that tables are physically stored in the
custom dump file.

Without being able to seek backwards a parallel restore of the custom
dump archive format has to be ordered by #1 and #4. The reference
counting that reduce_dependencies does inside of the parallel restore
logic upholds ordering #1. Unfortunately, 548e50976ce changed
TocEntrySizeCompare (which is used to break ties within #1) to order
by #2, then by #3. This most often breaks on dumps written by pg_dump
without seeks (everything has a dataLength of zero) as it then falls
back to #3 ordering every time. But, because nothing in pg_restore
does any ordering by #4 you could potentially run into this with any
custom dump so I think it's a regression.

For some troubleshooting I changed ready_list_sort to never call
qsort. This fixes the problem by never ordering by #3, leaving things
in #4 order, but breaks the new algorithm introduced in 548e50976ce.

I did what Justin suggested earlier and it works great. Parallel
restore requires seekable input (enforced elsewhere) so everyone's
parallel restores should work again.

On Wed, May 20, 2020 at 10:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Gilman <davidgilman1@gmail.com> writes:

I think the PG11
commit you mentioned (548e5097) happens to make some databases fail in
parallel restore that previously worked (I didn't check).

Correct, if you do the bisect around that yourself you'll see
pg_restore start failing with the expected "possibly due to
out-of-order restore request" on offset-less dumps.

Yeah. Now, the whole point of that patch was to decouple the restore
order from the dump order ... but with an offset-less dump file, we
can't do that, or at least the restore order is greatly constrained.
I wonder if it'd be sensible for pg_restore to use a different parallel
scheduling algorithm if it notices that the input lacks offsets.
(There could still be some benefit from parallelism, just not as much.)
No idea if this is going to be worth the trouble, but it probably
is worth looking into.

regards, tom lane

--
David Gilman
:DG<

Attachments:

0001-pg_restore-fix-v2.patchapplication/octet-stream; name=0001-pg_restore-fix-v2.patchDownload
From 63ae0cabc3691dffc8227ac6f9d23aca96ea0095 Mon Sep 17 00:00:00 2001
From: David Gilman <davidgilman1@gmail.com>
Date: Wed, 20 May 2020 22:49:28 -0400
Subject: [PATCH] pg_restore fix v2

commit message to come
---
 doc/src/sgml/ref/pg_restore.sgml   |  9 +++++++++
 src/bin/pg_dump/pg_backup_custom.c | 20 ++++++++++++++++++--
 2 files changed, 27 insertions(+), 2 deletions(-)

diff --git doc/src/sgml/ref/pg_restore.sgml doc/src/sgml/ref/pg_restore.sgml
index 232f88024f..fd23d6720c 100644
--- doc/src/sgml/ref/pg_restore.sgml
+++ doc/src/sgml/ref/pg_restore.sgml
@@ -279,6 +279,15 @@ PostgreSQL documentation
         jobs cannot be used together with the
         option <option>--single-transaction</option>.
        </para>
+
+       <para>
+        The custom archive format may not work with the <option>-j</option>
+        option if the archive was originally created by writing the archive
+        to an unseekable output file. For the best concurrent restoration
+        performance with the custom archive format use
+        <application>pg_dump</application>'s <option>--file</option> option
+        to specify an output file.
+       </para>
       </listitem>
      </varlistentry>
 
diff --git src/bin/pg_dump/pg_backup_custom.c src/bin/pg_dump/pg_backup_custom.c
index 369dcea429..8dfb6581d1 100644
--- src/bin/pg_dump/pg_backup_custom.c
+++ src/bin/pg_dump/pg_backup_custom.c
@@ -423,9 +423,25 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
 	{
 		/*
 		 * We cannot seek directly to the desired block.  Instead, skip over
-		 * block headers until we find the one we want.  This could fail if we
-		 * are asked to restore items out-of-order.
+		 * block headers until we find the one we want.
 		 */
+
+		if (ctx->hasSeek)
+		{
+			/*
+			 * Start searching from the first block. If this is possible, we're
+			 * all but guaranteed to find the block, although at the cost of a bunch
+			 * of redundant, tiny reads. TOC requests aren't guaranteed to come in
+			 * disk order so this is a necessary evil.
+			 *
+			 * If the input file can't be seeked we're at the mercy of the
+			 * file's TOC layout on disk. An out-of-order restore request will
+			 * halt the restore.
+			 */
+			if (fseeko(AH->FH, ctx->dataStart, SEEK_SET) != 0)
+				fatal("error during file seek: %m");
+		}
+
 		_readBlockHeader(AH, &blkType, &id);
 
 		while (blkType != EOF && id != te->dumpId)
-- 
2.26.2

#7David Gilman
davidgilman1@gmail.com
In reply to: David Gilman (#6)
3 attachment(s)
Re: Warn when parallel restoring a custom dump without data offsets

I've rounded this patch out with a test and I've set up the commitfest
website for this thread. The latest patches are attached and I think
they are ready for review.

On Wed, May 20, 2020 at 11:05 PM David Gilman <davidgilman1@gmail.com> wrote:

I did some more digging. To keep everyone on the same page there are
four different ways to order TOCs:

1. topological order,
2. dataLength order, size of the table, is always zero when pg_dump can't seek,
3. dumpId order, which should be thought as random but roughly
correlates to topological order to make things fun,
4. file order, the order that tables are physically stored in the
custom dump file.

Without being able to seek backwards a parallel restore of the custom
dump archive format has to be ordered by #1 and #4. The reference
counting that reduce_dependencies does inside of the parallel restore
logic upholds ordering #1. Unfortunately, 548e50976ce changed
TocEntrySizeCompare (which is used to break ties within #1) to order
by #2, then by #3. This most often breaks on dumps written by pg_dump
without seeks (everything has a dataLength of zero) as it then falls
back to #3 ordering every time. But, because nothing in pg_restore
does any ordering by #4 you could potentially run into this with any
custom dump so I think it's a regression.

For some troubleshooting I changed ready_list_sort to never call
qsort. This fixes the problem by never ordering by #3, leaving things
in #4 order, but breaks the new algorithm introduced in 548e50976ce.

I did what Justin suggested earlier and it works great. Parallel
restore requires seekable input (enforced elsewhere) so everyone's
parallel restores should work again.

On Wed, May 20, 2020 at 10:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Gilman <davidgilman1@gmail.com> writes:

I think the PG11
commit you mentioned (548e5097) happens to make some databases fail in
parallel restore that previously worked (I didn't check).

Correct, if you do the bisect around that yourself you'll see
pg_restore start failing with the expected "possibly due to
out-of-order restore request" on offset-less dumps.

Yeah. Now, the whole point of that patch was to decouple the restore
order from the dump order ... but with an offset-less dump file, we
can't do that, or at least the restore order is greatly constrained.
I wonder if it'd be sensible for pg_restore to use a different parallel
scheduling algorithm if it notices that the input lacks offsets.
(There could still be some benefit from parallelism, just not as much.)
No idea if this is going to be worth the trouble, but it probably
is worth looking into.

regards, tom lane

--
David Gilman
:DG<

--
David Gilman
:DG<

Attachments:

0003-Add-integration-test-for-out-of-order-TOC-requests-i.patchapplication/octet-stream; name=0003-Add-integration-test-for-out-of-order-TOC-requests-i.patchDownload
From 58c4c37f78194e9fdc1b167cf224980ffd8e7660 Mon Sep 17 00:00:00 2001
From: David Gilman <davidgilman1@gmail.com>
Date: Sat, 23 May 2020 10:49:33 -0400
Subject: [PATCH 3/3] Add integration test for out-of-order TOC requests in
 pg_restore

Also add undocumented --disable-seeking argument to pg_dump to emulate
writing to an unseekable output file.
---
 src/bin/pg_dump/pg_backup.h          |  4 +-
 src/bin/pg_dump/pg_backup_archiver.c | 14 +++--
 src/bin/pg_dump/pg_backup_archiver.h |  1 +
 src/bin/pg_dump/pg_backup_custom.c   | 10 +++-
 src/bin/pg_dump/pg_dump.c            |  6 +-
 src/bin/pg_dump/t/002_pg_dump.pl     | 87 ++++++++++++++++++++++++++--
 6 files changed, 109 insertions(+), 13 deletions(-)

diff --git src/bin/pg_dump/pg_backup.h src/bin/pg_dump/pg_backup.h
index 8c0cedcd98..160d705eb5 100644
--- src/bin/pg_dump/pg_backup.h
+++ src/bin/pg_dump/pg_backup.h
@@ -158,6 +158,7 @@ typedef struct _dumpOptions
 	int			use_setsessauth;
 	int			enable_row_security;
 	int			load_via_partition_root;
+	int			disable_seeking;
 
 	/* default, if no "inclusion" switches appear, is to dump everything */
 	bool		include_everything;
@@ -270,7 +271,8 @@ extern Archive *OpenArchive(const char *FileSpec, const ArchiveFormat fmt);
 /* Create a new archive */
 extern Archive *CreateArchive(const char *FileSpec, const ArchiveFormat fmt,
 							  const int compression, bool dosync, ArchiveMode mode,
-							  SetupWorkerPtrType setupDumpWorker);
+							  SetupWorkerPtrType setupDumpWorker,
+							  bool disableSeeking);
 
 /* The --list option */
 extern void PrintTOCSummary(Archive *AH);
diff --git src/bin/pg_dump/pg_backup_archiver.c src/bin/pg_dump/pg_backup_archiver.c
index 8f0b32ca17..0e5d4e0a60 100644
--- src/bin/pg_dump/pg_backup_archiver.c
+++ src/bin/pg_dump/pg_backup_archiver.c
@@ -69,7 +69,8 @@ typedef struct _parallelReadyList
 
 static ArchiveHandle *_allocAH(const char *FileSpec, const ArchiveFormat fmt,
 							   const int compression, bool dosync, ArchiveMode mode,
-							   SetupWorkerPtrType setupWorkerPtr);
+							   SetupWorkerPtrType setupWorkerPtr,
+							   bool disableSeeking);
 static void _getObjectDescription(PQExpBuffer buf, TocEntry *te,
 								  ArchiveHandle *AH);
 static void _printTocEntry(ArchiveHandle *AH, TocEntry *te, bool isData);
@@ -233,11 +234,12 @@ setupRestoreWorker(Archive *AHX)
 Archive *
 CreateArchive(const char *FileSpec, const ArchiveFormat fmt,
 			  const int compression, bool dosync, ArchiveMode mode,
-			  SetupWorkerPtrType setupDumpWorker)
+			  SetupWorkerPtrType setupDumpWorker,
+			  bool disableSeeking)
 
 {
 	ArchiveHandle *AH = _allocAH(FileSpec, fmt, compression, dosync,
-								 mode, setupDumpWorker);
+								 mode, setupDumpWorker, disableSeeking);
 
 	return (Archive *) AH;
 }
@@ -247,7 +249,7 @@ CreateArchive(const char *FileSpec, const ArchiveFormat fmt,
 Archive *
 OpenArchive(const char *FileSpec, const ArchiveFormat fmt)
 {
-	ArchiveHandle *AH = _allocAH(FileSpec, fmt, 0, true, archModeRead, setupRestoreWorker);
+	ArchiveHandle *AH = _allocAH(FileSpec, fmt, 0, true, archModeRead, setupRestoreWorker, false);
 
 	return (Archive *) AH;
 }
@@ -2274,7 +2276,8 @@ _discoverArchiveFormat(ArchiveHandle *AH)
 static ArchiveHandle *
 _allocAH(const char *FileSpec, const ArchiveFormat fmt,
 		 const int compression, bool dosync, ArchiveMode mode,
-		 SetupWorkerPtrType setupWorkerPtr)
+		 SetupWorkerPtrType setupWorkerPtr,
+		 bool disableSeeking)
 {
 	ArchiveHandle *AH;
 
@@ -2325,6 +2328,7 @@ _allocAH(const char *FileSpec, const ArchiveFormat fmt,
 	AH->mode = mode;
 	AH->compression = compression;
 	AH->dosync = dosync;
+	AH->disableSeeking = disableSeeking;
 
 	memset(&(AH->sqlparse), 0, sizeof(AH->sqlparse));
 
diff --git src/bin/pg_dump/pg_backup_archiver.h src/bin/pg_dump/pg_backup_archiver.h
index f9e6b42752..acf38c20da 100644
--- src/bin/pg_dump/pg_backup_archiver.h
+++ src/bin/pg_dump/pg_backup_archiver.h
@@ -340,6 +340,7 @@ struct _archiveHandle
 	bool		dosync;			/* data requested to be synced on sight */
 	ArchiveMode mode;			/* File mode - r or w */
 	void	   *formatData;		/* Header data specific to file format */
+	bool		disableSeeking;	/* Don't use fseeko() */
 
 	/* these vars track state to avoid sending redundant SET commands */
 	char	   *currUser;		/* current username, or NULL if unknown */
diff --git src/bin/pg_dump/pg_backup_custom.c src/bin/pg_dump/pg_backup_custom.c
index 8dfb6581d1..332c557616 100644
--- src/bin/pg_dump/pg_backup_custom.c
+++ src/bin/pg_dump/pg_backup_custom.c
@@ -145,6 +145,10 @@ InitArchiveFmt_Custom(ArchiveHandle *AH)
 	AH->lo_buf = (void *) pg_malloc(LOBBUFSIZE);
 
 	ctx->filePos = 0;
+	/* Question for reviewer: why is this int and not bool? Should I fix it? */
+	/* does int vs bool actually matter in the ctx struct? */
+	/* checkSeek() returns bool */
+	ctx->hasSeek = 0;
 
 	/*
 	 * Now open the file
@@ -164,7 +168,8 @@ InitArchiveFmt_Custom(ArchiveHandle *AH)
 				fatal("could not open output file: %m");
 		}
 
-		ctx->hasSeek = checkSeek(AH->FH);
+		if (!AH->disableSeeking)
+			ctx->hasSeek = checkSeek(AH->FH);
 	}
 	else
 	{
@@ -181,7 +186,8 @@ InitArchiveFmt_Custom(ArchiveHandle *AH)
 				fatal("could not open input file: %m");
 		}
 
-		ctx->hasSeek = checkSeek(AH->FH);
+		if (!AH->disableSeeking)
+			ctx->hasSeek = checkSeek(AH->FH);
 
 		ReadHead(AH);
 		ReadToc(AH);
diff --git src/bin/pg_dump/pg_dump.c src/bin/pg_dump/pg_dump.c
index a4e949c636..53357557c3 100644
--- src/bin/pg_dump/pg_dump.c
+++ src/bin/pg_dump/pg_dump.c
@@ -319,6 +319,7 @@ main(int argc, char **argv)
 	int			plainText = 0;
 	ArchiveFormat archiveFormat = archUnknown;
 	ArchiveMode archiveMode;
+	bool		disableSeeking = false;
 
 	static DumpOptions dopt;
 
@@ -360,6 +361,7 @@ main(int argc, char **argv)
 		{"binary-upgrade", no_argument, &dopt.binary_upgrade, 1},
 		{"column-inserts", no_argument, &dopt.column_inserts, 1},
 		{"disable-dollar-quoting", no_argument, &dopt.disable_dollar_quoting, 1},
+		{"disable-seeking", no_argument, &dopt.disable_seeking, 1},
 		{"disable-triggers", no_argument, &dopt.disable_triggers, 1},
 		{"enable-row-security", no_argument, &dopt.enable_row_security, 1},
 		{"exclude-table-data", required_argument, NULL, 4},
@@ -673,6 +675,8 @@ main(int argc, char **argv)
 	if (archiveFormat == archNull)
 		plainText = 1;
 
+	disableSeeking = !!dopt.disable_seeking;
+
 	/* Custom and directory formats are compressed by default, others not */
 	if (compressLevel == -1)
 	{
@@ -715,7 +719,7 @@ main(int argc, char **argv)
 
 	/* Open the output file */
 	fout = CreateArchive(filename, archiveFormat, compressLevel, dosync,
-						 archiveMode, setupDumpWorker);
+						 archiveMode, setupDumpWorker, disableSeeking);
 
 	/* Make dump options accessible right away */
 	SetArchiveOptions(fout, &dopt, NULL);
diff --git src/bin/pg_dump/t/002_pg_dump.pl src/bin/pg_dump/t/002_pg_dump.pl
index e116235769..e7b1355f1c 100644
--- src/bin/pg_dump/t/002_pg_dump.pl
+++ src/bin/pg_dump/t/002_pg_dump.pl
@@ -31,9 +31,20 @@ my $tempdir_short = TestLib::tempdir_short;
 # generate a text file to run through the tests from the
 # non-text file generated by pg_dump.
 #
-# TODO: Have pg_restore actually restore to an independent
-# database and then pg_dump *that* database (or something along
-# those lines) to validate that part of the process.
+# secondary_dump_cmd is an optional key that enables a
+#
+# 	pg_dump (dump_cmd) ->
+# 	pg_restore (restore_cmd, real database) ->
+#	pg_dump (secondary_dmp_cmd, SQL output (for matching))
+#
+# test process instead of the default
+#
+# 	pg_dump (dump_cmd) ->
+# 	pg_restore (restore_cmd, SQL output (for matching))
+#
+# test process. The secondary_dump database is created for
+# you and your restore_cmd and secondary_dmp_cmd must read
+# and write from it.
 
 my %pgdump_runs = (
 	binary_upgrade => {
@@ -140,6 +151,24 @@ my %pgdump_runs = (
 		],
 	},
 
+	defaults_custom_format_no_seek_parallel_restore => {
+		test_key => 'defaults',
+		dump_cmd => [
+			'pg_dump', '-Fc', '-Z6', '--no-sync', '--disable-seeking',
+			"--file=$tempdir/defaults_custom_format_no_seek_parallel_restore.dump", 'postgres',
+		],
+		restore_cmd => [
+			'pg_restore', '-Fc', '--jobs=2',
+			'--dbname=secondary_dump',
+			"$tempdir/defaults_custom_format_no_seek_parallel_restore.dump",
+		],
+		secondary_dump_cmd => [
+			'pg_dump', '-Fp', '--no-sync',
+			"--file=$tempdir/defaults_custom_format_no_seek_parallel_restore.sql",
+			'secondary_dump',
+		],
+	},
+
 	# Do not use --no-sync to give test coverage for data sync.
 	defaults_dir_format => {
 		test_key => 'defaults',
@@ -3298,6 +3327,27 @@ my %tests = (
 			%full_runs, %dump_test_schema_runs, section_pre_data => 1,
 		},
 		unlike => { exclude_dump_test_schema => 1 },
+	},
+
+	# This reliably produces the "possibly due to out-of-order restore request"
+	# when restoring with -j 2 when this was written but future changes to how
+	# pg_dump works could accidentally correctly order things as we're not really
+	# telling pg_dump how to do its on-disk layout.
+	'custom dump out-of-order' => {
+		create_sql   => '
+			CREATE TABLE ooo_parent0 (id integer primary key);
+			CREATE TABLE ooo_child0 (parent0 int references ooo_parent0 (id));
+			CREATE TABLE ooo_parent1 (id integer primary key);
+			CREATE TABLE ooo_child1 (parent0 int references ooo_parent1 (id));',
+		regexp => qr/^
+			 \QCREATE TABLE public.ooo_child0 (\E\n
+			 \s+\Qparent0 integer\E\n
+			 \Q);\E\n/xm,
+		like => {
+			%full_runs, section_pre_data => 1,
+			defaults_custom_format_no_seek => 1,
+			defaults_custom_format_no_seek_parallel_restore => 1,
+		},
 	});
 
 #########################################
@@ -3350,6 +3400,11 @@ foreach my $run (sort keys %pgdump_runs)
 		$num_tests++;
 	}
 
+	if ($pgdump_runs{$run}->{secondary_dump_cmd})
+	{
+		$num_tests++;
+	}
+
 	if ($pgdump_runs{$run}->{test_key})
 	{
 		$test_key = $pgdump_runs{$run}->{test_key};
@@ -3491,7 +3546,12 @@ command_fails_like(
 #########################################
 # Run all runs
 
-foreach my $run (sort keys %pgdump_runs)
+foreach my $run (
+	sort {
+		defined($pgdump_runs{$a}->{secondary_dump_cmd}) <=>
+		  defined($pgdump_runs{$b}->{secondary_dump_cmd})
+		  || $a cmp $b
+	} keys %pgdump_runs)
 {
 	my $test_key = $run;
 	my $run_db   = 'postgres';
@@ -3499,12 +3559,23 @@ foreach my $run (sort keys %pgdump_runs)
 	$node->command_ok(\@{ $pgdump_runs{$run}->{dump_cmd} },
 		"$run: pg_dump runs");
 
+	if ($pgdump_runs{$run}->{secondary_dump_cmd})
+	{
+		$node->safe_psql('postgres', 'create database secondary_dump;');
+	}
+
 	if ($pgdump_runs{$run}->{restore_cmd})
 	{
 		$node->command_ok(\@{ $pgdump_runs{$run}->{restore_cmd} },
 			"$run: pg_restore runs");
 	}
 
+	if ($pgdump_runs{$run}->{secondary_dump_cmd})
+	{
+		$node->command_ok(\@{ $pgdump_runs{$run}->{secondary_dump_cmd} },
+			"$run: secondary pg_dump runs");
+	}
+
 	if ($pgdump_runs{$run}->{test_key})
 	{
 		$test_key = $pgdump_runs{$run}->{test_key};
@@ -3561,6 +3632,14 @@ foreach my $run (sort keys %pgdump_runs)
 			}
 		}
 	}
+
+	if ($pgdump_runs{$run}->{secondary_dump_cmd})
+	{
+		$node->safe_psql('secondary_dump',
+			'alter subscription sub1 set (slot_name = NONE);');
+		$node->safe_psql('secondary_dump', 'drop subscription sub1;');
+		$node->safe_psql('postgres',       'drop database secondary_dump;');
+	}
 }
 
 #########################################
-- 
2.26.2

0001-Remove-unused-seek-check-in-tar-dump-format.patchapplication/octet-stream; name=0001-Remove-unused-seek-check-in-tar-dump-format.patchDownload
From 5ffc0420bc6040cc26b1a36ac522b696ef428464 Mon Sep 17 00:00:00 2001
From: David Gilman <davidgilman1@gmail.com>
Date: Fri, 22 May 2020 17:27:51 -0400
Subject: [PATCH 1/3] Remove unused seek check in tar dump format

---
 src/bin/pg_dump/pg_backup_tar.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git src/bin/pg_dump/pg_backup_tar.c src/bin/pg_dump/pg_backup_tar.c
index d5bfa55646..bb5d3b1f45 100644
--- src/bin/pg_dump/pg_backup_tar.c
+++ src/bin/pg_dump/pg_backup_tar.c
@@ -82,7 +82,6 @@ typedef struct
 
 typedef struct
 {
-	int			hasSeek;
 	pgoff_t		filePos;
 	TAR_MEMBER *blobToc;
 	FILE	   *tarFH;
@@ -192,8 +191,6 @@ InitArchiveFmt_Tar(ArchiveHandle *AH)
 		 */
 		/* setvbuf(ctx->tarFH, NULL, _IONBF, 0); */
 
-		ctx->hasSeek = checkSeek(ctx->tarFH);
-
 		/*
 		 * We don't support compression because reading the files back is not
 		 * possible since gzdopen uses buffered IO which totally screws file
@@ -226,8 +223,6 @@ InitArchiveFmt_Tar(ArchiveHandle *AH)
 
 		ctx->tarFHpos = 0;
 
-		ctx->hasSeek = checkSeek(ctx->tarFH);
-
 		/*
 		 * Forcibly unmark the header as read since we use the lookahead
 		 * buffer
-- 
2.26.2

0002-Scan-all-TOCs-when-restoring-a-custom-dump-file-with.patchapplication/octet-stream; name=0002-Scan-all-TOCs-when-restoring-a-custom-dump-file-with.patchDownload
From 130b7614b4592ea2f0fee44b7a7b59703e80806e Mon Sep 17 00:00:00 2001
From: David Gilman <davidgilman1@gmail.com>
Date: Wed, 20 May 2020 22:49:28 -0400
Subject: [PATCH 2/3] Scan all TOCs when restoring a custom dump file without
 offsets

TOC requests are not guaranteed to come in disk order. If the custom
dump file was written with data offsets pg_restore can seek directly to
the data, making request order irrelevant. If there are no data offsets
pg_restore would only find the TOC if it happened to be immediately
after the current one on disk. 548e50976 changed how pg_restore's
parallel algorithm worked at the cost of greatly increasing out-of-order
TOC requests.

This patch changes pg_restore to scan through all TOCs to service a TOC
read request only when restoring a custom dump file without data offsets.
The odds of getting a successful parallel restore go way up at the cost
of a bunch of extra tiny reads when pg_restore starts up.

The pg_restore manpage now warns against running pg_dump with an
unseekable output file and suggests that if you plan on doing a parallel
restore of a custom dump you should run pg_dump with --file.
---
 doc/src/sgml/ref/pg_restore.sgml   |  9 +++++++++
 src/bin/pg_dump/pg_backup_custom.c | 20 ++++++++++++++++++--
 2 files changed, 27 insertions(+), 2 deletions(-)

diff --git doc/src/sgml/ref/pg_restore.sgml doc/src/sgml/ref/pg_restore.sgml
index 232f88024f..fd23d6720c 100644
--- doc/src/sgml/ref/pg_restore.sgml
+++ doc/src/sgml/ref/pg_restore.sgml
@@ -279,6 +279,15 @@ PostgreSQL documentation
         jobs cannot be used together with the
         option <option>--single-transaction</option>.
        </para>
+
+       <para>
+        The custom archive format may not work with the <option>-j</option>
+        option if the archive was originally created by writing the archive
+        to an unseekable output file. For the best concurrent restoration
+        performance with the custom archive format use
+        <application>pg_dump</application>'s <option>--file</option> option
+        to specify an output file.
+       </para>
       </listitem>
      </varlistentry>
 
diff --git src/bin/pg_dump/pg_backup_custom.c src/bin/pg_dump/pg_backup_custom.c
index 369dcea429..8dfb6581d1 100644
--- src/bin/pg_dump/pg_backup_custom.c
+++ src/bin/pg_dump/pg_backup_custom.c
@@ -423,9 +423,25 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
 	{
 		/*
 		 * We cannot seek directly to the desired block.  Instead, skip over
-		 * block headers until we find the one we want.  This could fail if we
-		 * are asked to restore items out-of-order.
+		 * block headers until we find the one we want.
 		 */
+
+		if (ctx->hasSeek)
+		{
+			/*
+			 * Start searching from the first block. If this is possible, we're
+			 * all but guaranteed to find the block, although at the cost of a bunch
+			 * of redundant, tiny reads. TOC requests aren't guaranteed to come in
+			 * disk order so this is a necessary evil.
+			 *
+			 * If the input file can't be seeked we're at the mercy of the
+			 * file's TOC layout on disk. An out-of-order restore request will
+			 * halt the restore.
+			 */
+			if (fseeko(AH->FH, ctx->dataStart, SEEK_SET) != 0)
+				fatal("error during file seek: %m");
+		}
+
 		_readBlockHeader(AH, &blkType, &id);
 
 		while (blkType != EOF && id != te->dumpId)
-- 
2.26.2

#8Justin Pryzby
pryzby@telsasoft.com
In reply to: David Gilman (#7)
Re: Warn when parallel restoring a custom dump without data offsets

On Sat, May 23, 2020 at 03:54:30PM -0400, David Gilman wrote:

I've rounded this patch out with a test and I've set up the commitfest
website for this thread. The latest patches are attached and I think
they are ready for review.

Thanks. https://commitfest.postgresql.org/28/2568/
I'm not sure this will be considered a bugfix, since the behavior is known.
Maybe eligible for backpatch though (?)

Your patch was encoded, so this is failing:
http://cfbot.cputube.org/david-gilman.html

Ideally CFBOT would deal with that (maybe by using git-am - adding Thomas), but
I think you sent using gmail web interface, which also reordered the patches.
(CFBOT *does* sort them, but it's a known annoyance).

dump file was written with data offsets pg_restore can seek directly to

offsets COMMA

pg_restore would only find the TOC if it happened to be immediately

"immediately" is wrong, no ? I thought the problem was if we seeked to D and
then looked for C, we wouldn't attempt to go backwards.

read request only when restoring a custom dump file without data offsets.

remove "only"

of a bunch of extra tiny reads when pg_restore starts up.

I would have thought to mention the seeks() ; but it's true that the read()s now
grow quadratically. I did run a test, but I don't know how many objects would
be unreasonable or how many it'd take to show a problem.

Maybe we should avoid fseeko(0, SEEK_SET) unless we need to wrap around after
EOF - I'm not sure.

Maybe the cleanest way would be to pre-populate a structure with all the TOC
data and loop around that instead of seeking around the file ? Can we use the
same structure as pg_dump ?

Otherwise, that makes me think of commit 42f70cd9c. Make it's not a good
parallel or example for this case, though.

+        The custom archive format may not work with the <option>-j</option>
+        option if the archive was originally created by writing the archive
+        to an unseekable output file. For the best concurrent restoration

Can I suggest something like: pg_restore with parallel jobs may fail if the
archive dump was written to an unseekable output stream, like stdout.

+ * If the input file can't be seeked we're at the mercy of the

seeked COMMA

Subject: [PATCH 3/3] Add integration test for out-of-order TOC requests in pg_restore

Well done - thanks for that.

Also add undocumented --disable-seeking argument to pg_dump to emulate
writing to an unseekable output file.

Remove "also".

Is it possible to dump to stdout (or pipe to cat or dd) to avoid a new option ?

Maybe that would involve changing the test process to use the shell (system() vs
execve()), or maybe you could write:

/* sh handles output redirection and arg splitting */
'sh', '-c', 'pg_dump -Fc -Z6 --no-sync --disable-seeking postgres > $tempdir/defaults_custom_format_no_seek_parallel_restore.dump',

But I think that would need to then separately handle WIN32, so maybe it's not
worth it.

#9David Gilman
dgilman@gilslotd.com
In reply to: Justin Pryzby (#8)
3 attachment(s)
Re: Warn when parallel restoring a custom dump without data offsets

Updated patches are attached, I ditched the gmail web interface so
hopefully this works.

Not mentioned in Justin's feedback: I dropped the extra sort in the test
as it's no longer necessary. I also added a parallel dump -> parallel
restore -> dump test run for the directory format to get some free test
coverage.

On Sat, May 23, 2020 at 05:47:51PM -0500, Justin Pryzby wrote:

I'm not sure this will be considered a bugfix, since the behavior is known.
Maybe eligible for backpatch though (?)

I'm not familiar with how your release management works, but I'm
personally fine with whatever version you can get it into. I urge you to
try landing this as soon as possible. The minimum reproducible example
in the test case is very minimal and I imagine all real world databases
are going to trigger this.

I would have thought to mention the seeks() ; but it's true that the read()s now
grow quadratically. I did run a test, but I don't know how many objects would
be unreasonable or how many it'd take to show a problem.

And I misunderstood how bad it was. I thought it was reading little
header structs off the disk but it's actually reading the entire table
(see _skipData). So you're quadratically rereading entire tables and
thrashing your cache. Oops.

Maybe we should avoid fseeko(0, SEEK_SET) unless we need to wrap around after
EOF - I'm not sure.

The seek location is already the location of the end of the last good
object so just adding wraparound gives the good algorithmic performance
from the technique in commit 42f70cd9c. I’ve gone ahead and implemented
this.

Is it possible to dump to stdout (or pipe to cat or dd) to avoid a new option ?

The underlying IPC::Run code seems to support piping in a cross-platform
way. I am not a Perl master though and after spending an evening trying
to get it to work I went with this approach. If you can put me in touch
with anyone to help me out here I'd appreciate it.

--
David Gilman :DG<
https://gilslotd.com

Attachments:

0001-Remove-unused-seek-check-in-tar-dump-format.patchtext/x-diff; charset=us-asciiDownload
From 5ffc0420bc6040cc26b1a36ac522b696ef428464 Mon Sep 17 00:00:00 2001
From: David Gilman <davidgilman1@gmail.com>
Date: Fri, 22 May 2020 17:27:51 -0400
Subject: [PATCH 1/3] Remove unused seek check in tar dump format

---
 src/bin/pg_dump/pg_backup_tar.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git src/bin/pg_dump/pg_backup_tar.c src/bin/pg_dump/pg_backup_tar.c
index d5bfa55646..bb5d3b1f45 100644
--- src/bin/pg_dump/pg_backup_tar.c
+++ src/bin/pg_dump/pg_backup_tar.c
@@ -82,7 +82,6 @@ typedef struct
 
 typedef struct
 {
-	int			hasSeek;
 	pgoff_t		filePos;
 	TAR_MEMBER *blobToc;
 	FILE	   *tarFH;
@@ -192,8 +191,6 @@ InitArchiveFmt_Tar(ArchiveHandle *AH)
 		 */
 		/* setvbuf(ctx->tarFH, NULL, _IONBF, 0); */
 
-		ctx->hasSeek = checkSeek(ctx->tarFH);
-
 		/*
 		 * We don't support compression because reading the files back is not
 		 * possible since gzdopen uses buffered IO which totally screws file
@@ -226,8 +223,6 @@ InitArchiveFmt_Tar(ArchiveHandle *AH)
 
 		ctx->tarFHpos = 0;
 
-		ctx->hasSeek = checkSeek(ctx->tarFH);
-
 		/*
 		 * Forcibly unmark the header as read since we use the lookahead
 		 * buffer
-- 
2.26.2

0002-Scan-all-TOCs-when-restoring-a-custom-dump-file-with.patchtext/x-diff; charset=us-asciiDownload
From 3e2a291103ea14f3abeab861197f76d59e7d30e0 Mon Sep 17 00:00:00 2001
From: David Gilman <davidgilman1@gmail.com>
Date: Wed, 20 May 2020 22:49:28 -0400
Subject: [PATCH 2/3] Scan all TOCs when restoring a custom dump file without
 offsets

TOC requests are not guaranteed to come in disk order. If the custom
dump file was written with data offsets, pg_restore can seek directly to
the data, making request order irrelevant. If there are no data offsets
pg_restore would never attempt to seek backwards, even when it was
possible, and would not find TOCs before the current read position in
the file. 548e50976 changed how pg_restore's parallel algorithm worked
at the cost of greatly increasing out-of-order TOC requests.

This patch changes pg_restore to scan through all TOCs to service a TOC
read request when restoring a custom dump file without data offsets.
The odds of getting a successful parallel restore go way up at the cost
of a bunch of extra tiny reads when pg_restore starts up.

The pg_restore manpage now warns against running pg_dump with an
unseekable output file and suggests that if you plan on doing a parallel
restore of a custom dump you should run pg_dump with --file.
---
 doc/src/sgml/ref/pg_restore.sgml   |  8 ++++++++
 src/bin/pg_dump/pg_backup_custom.c | 24 ++++++++++++++++++++++--
 2 files changed, 30 insertions(+), 2 deletions(-)

diff --git doc/src/sgml/ref/pg_restore.sgml doc/src/sgml/ref/pg_restore.sgml
index 232f88024f..a0bf64767b 100644
--- doc/src/sgml/ref/pg_restore.sgml
+++ doc/src/sgml/ref/pg_restore.sgml
@@ -279,6 +279,14 @@ PostgreSQL documentation
         jobs cannot be used together with the
         option <option>--single-transaction</option>.
        </para>
+
+       <para>
+        <application>pg_restore</application> with concurrent jobs may fail
+        when restoring a custom archive format dump written to an unseekable
+        output stream, like stdout. For the best performance when restoring
+        a custom archive format dump use <application>pg_dump</application>'s
+        <option>--file</option> option to specify an output file.
+       </para>
       </listitem>
      </varlistentry>
 
diff --git src/bin/pg_dump/pg_backup_custom.c src/bin/pg_dump/pg_backup_custom.c
index 369dcea429..91c60f7103 100644
--- src/bin/pg_dump/pg_backup_custom.c
+++ src/bin/pg_dump/pg_backup_custom.c
@@ -415,6 +415,7 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
 	lclTocEntry *tctx = (lclTocEntry *) te->formatData;
 	int			blkType;
 	int			id;
+	bool		initialScan = true;
 
 	if (tctx->dataState == K_OFFSET_NO_DATA)
 		return;
@@ -423,11 +424,19 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
 	{
 		/*
 		 * We cannot seek directly to the desired block.  Instead, skip over
-		 * block headers until we find the one we want.  This could fail if we
-		 * are asked to restore items out-of-order.
+		 * block headers until we find the one we want.
 		 */
+
 		_readBlockHeader(AH, &blkType, &id);
 
+		if (blkType == EOF && ctx->hasSeek) {
+			/* Started at the end of the file */
+			initialScan = false;
+			if (fseeko(AH->FH, ctx->dataStart, SEEK_SET) != 0)
+				fatal("error during file seek: %m");
+			_readBlockHeader(AH, &blkType, &id);
+		}
+
 		while (blkType != EOF && id != te->dumpId)
 		{
 			switch (blkType)
@@ -446,6 +455,17 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
 					break;
 			}
 			_readBlockHeader(AH, &blkType, &id);
+
+			if (blkType == EOF && ctx->hasSeek && initialScan) {
+				/*
+				 * This was possibly an out-of-order request.
+				 * Try one extra pass over the file to find it.
+				 */
+				initialScan = false;
+				if (fseeko(AH->FH, ctx->dataStart, SEEK_SET) != 0)
+					fatal("error during file seek: %m");
+				_readBlockHeader(AH, &blkType, &id);
+			}
 		}
 	}
 	else
-- 
2.26.2

0003-Add-integration-test-for-out-of-order-TOC-requests-i.patchtext/x-diff; charset=us-asciiDownload
From e1136ca7817e36f48346da8771ca1f0100158295 Mon Sep 17 00:00:00 2001
From: David Gilman <davidgilman1@gmail.com>
Date: Sat, 23 May 2020 10:49:33 -0400
Subject: [PATCH 3/3] Add integration test for out-of-order TOC requests in
 pg_restore

Add undocumented --disable-seeking argument to pg_dump to emulate
writing to an unseekable output file.
---
 src/bin/pg_dump/pg_backup.h          |  4 +-
 src/bin/pg_dump/pg_backup_archiver.c | 14 ++--
 src/bin/pg_dump/pg_backup_archiver.h |  1 +
 src/bin/pg_dump/pg_backup_custom.c   |  7 +-
 src/bin/pg_dump/pg_dump.c            |  6 +-
 src/bin/pg_dump/t/002_pg_dump.pl     | 97 +++++++++++++++++++++++++++-
 6 files changed, 117 insertions(+), 12 deletions(-)

diff --git src/bin/pg_dump/pg_backup.h src/bin/pg_dump/pg_backup.h
index 8c0cedcd98..160d705eb5 100644
--- src/bin/pg_dump/pg_backup.h
+++ src/bin/pg_dump/pg_backup.h
@@ -158,6 +158,7 @@ typedef struct _dumpOptions
 	int			use_setsessauth;
 	int			enable_row_security;
 	int			load_via_partition_root;
+	int			disable_seeking;
 
 	/* default, if no "inclusion" switches appear, is to dump everything */
 	bool		include_everything;
@@ -270,7 +271,8 @@ extern Archive *OpenArchive(const char *FileSpec, const ArchiveFormat fmt);
 /* Create a new archive */
 extern Archive *CreateArchive(const char *FileSpec, const ArchiveFormat fmt,
 							  const int compression, bool dosync, ArchiveMode mode,
-							  SetupWorkerPtrType setupDumpWorker);
+							  SetupWorkerPtrType setupDumpWorker,
+							  bool disableSeeking);
 
 /* The --list option */
 extern void PrintTOCSummary(Archive *AH);
diff --git src/bin/pg_dump/pg_backup_archiver.c src/bin/pg_dump/pg_backup_archiver.c
index 8f0b32ca17..0e5d4e0a60 100644
--- src/bin/pg_dump/pg_backup_archiver.c
+++ src/bin/pg_dump/pg_backup_archiver.c
@@ -69,7 +69,8 @@ typedef struct _parallelReadyList
 
 static ArchiveHandle *_allocAH(const char *FileSpec, const ArchiveFormat fmt,
 							   const int compression, bool dosync, ArchiveMode mode,
-							   SetupWorkerPtrType setupWorkerPtr);
+							   SetupWorkerPtrType setupWorkerPtr,
+							   bool disableSeeking);
 static void _getObjectDescription(PQExpBuffer buf, TocEntry *te,
 								  ArchiveHandle *AH);
 static void _printTocEntry(ArchiveHandle *AH, TocEntry *te, bool isData);
@@ -233,11 +234,12 @@ setupRestoreWorker(Archive *AHX)
 Archive *
 CreateArchive(const char *FileSpec, const ArchiveFormat fmt,
 			  const int compression, bool dosync, ArchiveMode mode,
-			  SetupWorkerPtrType setupDumpWorker)
+			  SetupWorkerPtrType setupDumpWorker,
+			  bool disableSeeking)
 
 {
 	ArchiveHandle *AH = _allocAH(FileSpec, fmt, compression, dosync,
-								 mode, setupDumpWorker);
+								 mode, setupDumpWorker, disableSeeking);
 
 	return (Archive *) AH;
 }
@@ -247,7 +249,7 @@ CreateArchive(const char *FileSpec, const ArchiveFormat fmt,
 Archive *
 OpenArchive(const char *FileSpec, const ArchiveFormat fmt)
 {
-	ArchiveHandle *AH = _allocAH(FileSpec, fmt, 0, true, archModeRead, setupRestoreWorker);
+	ArchiveHandle *AH = _allocAH(FileSpec, fmt, 0, true, archModeRead, setupRestoreWorker, false);
 
 	return (Archive *) AH;
 }
@@ -2274,7 +2276,8 @@ _discoverArchiveFormat(ArchiveHandle *AH)
 static ArchiveHandle *
 _allocAH(const char *FileSpec, const ArchiveFormat fmt,
 		 const int compression, bool dosync, ArchiveMode mode,
-		 SetupWorkerPtrType setupWorkerPtr)
+		 SetupWorkerPtrType setupWorkerPtr,
+		 bool disableSeeking)
 {
 	ArchiveHandle *AH;
 
@@ -2325,6 +2328,7 @@ _allocAH(const char *FileSpec, const ArchiveFormat fmt,
 	AH->mode = mode;
 	AH->compression = compression;
 	AH->dosync = dosync;
+	AH->disableSeeking = disableSeeking;
 
 	memset(&(AH->sqlparse), 0, sizeof(AH->sqlparse));
 
diff --git src/bin/pg_dump/pg_backup_archiver.h src/bin/pg_dump/pg_backup_archiver.h
index f9e6b42752..acf38c20da 100644
--- src/bin/pg_dump/pg_backup_archiver.h
+++ src/bin/pg_dump/pg_backup_archiver.h
@@ -340,6 +340,7 @@ struct _archiveHandle
 	bool		dosync;			/* data requested to be synced on sight */
 	ArchiveMode mode;			/* File mode - r or w */
 	void	   *formatData;		/* Header data specific to file format */
+	bool		disableSeeking;	/* Don't use fseeko() */
 
 	/* these vars track state to avoid sending redundant SET commands */
 	char	   *currUser;		/* current username, or NULL if unknown */
diff --git src/bin/pg_dump/pg_backup_custom.c src/bin/pg_dump/pg_backup_custom.c
index 91c60f7103..d471dfe267 100644
--- src/bin/pg_dump/pg_backup_custom.c
+++ src/bin/pg_dump/pg_backup_custom.c
@@ -145,6 +145,7 @@ InitArchiveFmt_Custom(ArchiveHandle *AH)
 	AH->lo_buf = (void *) pg_malloc(LOBBUFSIZE);
 
 	ctx->filePos = 0;
+	ctx->hasSeek = 0;
 
 	/*
 	 * Now open the file
@@ -164,7 +165,8 @@ InitArchiveFmt_Custom(ArchiveHandle *AH)
 				fatal("could not open output file: %m");
 		}
 
-		ctx->hasSeek = checkSeek(AH->FH);
+		if (!AH->disableSeeking)
+			ctx->hasSeek = checkSeek(AH->FH);
 	}
 	else
 	{
@@ -181,7 +183,8 @@ InitArchiveFmt_Custom(ArchiveHandle *AH)
 				fatal("could not open input file: %m");
 		}
 
-		ctx->hasSeek = checkSeek(AH->FH);
+		if (!AH->disableSeeking)
+			ctx->hasSeek = checkSeek(AH->FH);
 
 		ReadHead(AH);
 		ReadToc(AH);
diff --git src/bin/pg_dump/pg_dump.c src/bin/pg_dump/pg_dump.c
index a4e949c636..6ed45d319f 100644
--- src/bin/pg_dump/pg_dump.c
+++ src/bin/pg_dump/pg_dump.c
@@ -319,6 +319,7 @@ main(int argc, char **argv)
 	int			plainText = 0;
 	ArchiveFormat archiveFormat = archUnknown;
 	ArchiveMode archiveMode;
+	bool		disableSeeking = false;
 
 	static DumpOptions dopt;
 
@@ -360,6 +361,7 @@ main(int argc, char **argv)
 		{"binary-upgrade", no_argument, &dopt.binary_upgrade, 1},
 		{"column-inserts", no_argument, &dopt.column_inserts, 1},
 		{"disable-dollar-quoting", no_argument, &dopt.disable_dollar_quoting, 1},
+		{"disable-seeking", no_argument, &dopt.disable_seeking, 1},
 		{"disable-triggers", no_argument, &dopt.disable_triggers, 1},
 		{"enable-row-security", no_argument, &dopt.enable_row_security, 1},
 		{"exclude-table-data", required_argument, NULL, 4},
@@ -673,6 +675,8 @@ main(int argc, char **argv)
 	if (archiveFormat == archNull)
 		plainText = 1;
 
+	disableSeeking = (bool) dopt.disable_seeking;
+
 	/* Custom and directory formats are compressed by default, others not */
 	if (compressLevel == -1)
 	{
@@ -715,7 +719,7 @@ main(int argc, char **argv)
 
 	/* Open the output file */
 	fout = CreateArchive(filename, archiveFormat, compressLevel, dosync,
-						 archiveMode, setupDumpWorker);
+						 archiveMode, setupDumpWorker, disableSeeking);
 
 	/* Make dump options accessible right away */
 	SetArchiveOptions(fout, &dopt, NULL);
diff --git src/bin/pg_dump/t/002_pg_dump.pl src/bin/pg_dump/t/002_pg_dump.pl
index e116235769..2423ec5b1d 100644
--- src/bin/pg_dump/t/002_pg_dump.pl
+++ src/bin/pg_dump/t/002_pg_dump.pl
@@ -31,9 +31,20 @@ my $tempdir_short = TestLib::tempdir_short;
 # generate a text file to run through the tests from the
 # non-text file generated by pg_dump.
 #
-# TODO: Have pg_restore actually restore to an independent
-# database and then pg_dump *that* database (or something along
-# those lines) to validate that part of the process.
+# secondary_dump_cmd is an optional key that enables a
+#
+# 	pg_dump (dump_cmd) ->
+# 	pg_restore (restore_cmd, real database) ->
+#	pg_dump (secondary_dmp_cmd, SQL output (for matching))
+#
+# test process instead of the default
+#
+# 	pg_dump (dump_cmd) ->
+# 	pg_restore (restore_cmd, SQL output (for matching))
+#
+# test process. The secondary_dump database is created for
+# you and your restore_cmd and secondary_dmp_cmd must read
+# and write from it.
 
 my %pgdump_runs = (
 	binary_upgrade => {
@@ -139,6 +150,23 @@ my %pgdump_runs = (
 			"$tempdir/defaults_custom_format.dump",
 		],
 	},
+	defaults_custom_format_no_seek_parallel_restore => {
+		test_key => 'defaults',
+		dump_cmd => [
+			'pg_dump', '-Fc', '-Z6', '--no-sync', '--disable-seeking',
+			"--file=$tempdir/defaults_custom_format_no_seek_parallel_restore.dump", 'postgres',
+		],
+		restore_cmd => [
+			'pg_restore', '-Fc', '--jobs=2',
+			'--dbname=secondary_dump',
+			"$tempdir/defaults_custom_format_no_seek_parallel_restore.dump",
+		],
+		secondary_dump_cmd => [
+			'pg_dump', '-Fp', '--no-sync',
+			"--file=$tempdir/defaults_custom_format_no_seek_parallel_restore.sql",
+			'secondary_dump',
+		],
+	},
 
 	# Do not use --no-sync to give test coverage for data sync.
 	defaults_dir_format => {
@@ -153,6 +181,24 @@ my %pgdump_runs = (
 			"$tempdir/defaults_dir_format",
 		],
 	},
+	defaults_dir_format_parallel => {
+		test_key => 'defaults',
+		dump_cmd => [
+			'pg_dump',                                      '-Fd',
+			'--jobs=2',                                     '--no-sync',
+			"--file=$tempdir/defaults_dir_format_parallel", 'postgres',
+		],
+		restore_cmd => [
+			'pg_restore', '-Fd',
+			'--jobs=2', '--dbname=secondary_dump',
+			"$tempdir/defaults_dir_format_parallel",
+		],
+		secondary_dump_cmd => [
+			'pg_dump', '-Fp', '--no-sync',
+			"--file=$tempdir/defaults_dir_format_parallel.sql",
+			'secondary_dump',
+		],
+	},
 
 	# Do not use --no-sync to give test coverage for data sync.
 	defaults_parallel => {
@@ -3298,6 +3344,27 @@ my %tests = (
 			%full_runs, %dump_test_schema_runs, section_pre_data => 1,
 		},
 		unlike => { exclude_dump_test_schema => 1 },
+	},
+
+	# This reliably produces the "possibly due to out-of-order restore request"
+	# when restoring with -j 2 when this was written but future changes to how
+	# pg_dump works could accidentally correctly order things as we're not really
+	# telling pg_dump how to do its on-disk layout.
+	'custom dump out-of-order' => {
+		create_sql   => '
+			CREATE TABLE ooo_parent0 (id integer primary key);
+			CREATE TABLE ooo_child0 (parent0 int references ooo_parent0 (id));
+			CREATE TABLE ooo_parent1 (id integer primary key);
+			CREATE TABLE ooo_child1 (parent0 int references ooo_parent1 (id));',
+		regexp => qr/^
+			 \QCREATE TABLE public.ooo_child0 (\E\n
+			 \s+\Qparent0 integer\E\n
+			 \Q);\E\n/xm,
+		like => {
+			%full_runs, section_pre_data => 1,
+			defaults_custom_format_no_seek => 1,
+			defaults_custom_format_no_seek_parallel_restore => 1,
+		},
 	});
 
 #########################################
@@ -3350,6 +3417,11 @@ foreach my $run (sort keys %pgdump_runs)
 		$num_tests++;
 	}
 
+	if ($pgdump_runs{$run}->{secondary_dump_cmd})
+	{
+		$num_tests++;
+	}
+
 	if ($pgdump_runs{$run}->{test_key})
 	{
 		$test_key = $pgdump_runs{$run}->{test_key};
@@ -3499,12 +3571,23 @@ foreach my $run (sort keys %pgdump_runs)
 	$node->command_ok(\@{ $pgdump_runs{$run}->{dump_cmd} },
 		"$run: pg_dump runs");
 
+	if ($pgdump_runs{$run}->{secondary_dump_cmd})
+	{
+		$node->safe_psql('postgres', 'create database secondary_dump;');
+	}
+
 	if ($pgdump_runs{$run}->{restore_cmd})
 	{
 		$node->command_ok(\@{ $pgdump_runs{$run}->{restore_cmd} },
 			"$run: pg_restore runs");
 	}
 
+	if ($pgdump_runs{$run}->{secondary_dump_cmd})
+	{
+		$node->command_ok(\@{ $pgdump_runs{$run}->{secondary_dump_cmd} },
+			"$run: secondary pg_dump runs");
+	}
+
 	if ($pgdump_runs{$run}->{test_key})
 	{
 		$test_key = $pgdump_runs{$run}->{test_key};
@@ -3561,6 +3644,14 @@ foreach my $run (sort keys %pgdump_runs)
 			}
 		}
 	}
+
+	if ($pgdump_runs{$run}->{secondary_dump_cmd})
+	{
+		$node->safe_psql('secondary_dump',
+			'alter subscription sub1 set (slot_name = NONE);');
+		$node->safe_psql('secondary_dump', 'drop subscription sub1;');
+		$node->safe_psql('postgres',       'drop database secondary_dump;');
+	}
 }
 
 #########################################
-- 
2.26.2

#10David Gilman
dgilman@gilslotd.com
In reply to: David Gilman (#9)
4 attachment(s)
Re: Warn when parallel restoring a custom dump without data offsets

The earlier patches weren't applying because I had "git config
diff.noprefix true" set globally and that was messing up the git
format-patch output.

On Mon, May 25, 2020 at 01:54:29PM -0500, David Gilman wrote:

And I misunderstood how bad it was. I thought it was reading little
header structs off the disk but it's actually reading the entire table
(see _skipData). So you're quadratically rereading entire tables and
thrashing your cache. Oops.

I changed _skipData to fseeko() instead of fread() when possible to cut
down on this thrashing further.

--
David Gilman :DG<
https://gilslotd.com

Attachments:

0001-Remove-unused-seek-check-in-tar-dump-format.patchtext/x-diff; charset=us-asciiDownload
From 5ffc0420bc6040cc26b1a36ac522b696ef428464 Mon Sep 17 00:00:00 2001
From: David Gilman <davidgilman1@gmail.com>
Date: Fri, 22 May 2020 17:27:51 -0400
Subject: [PATCH 1/4] Remove unused seek check in tar dump format

---
 src/bin/pg_dump/pg_backup_tar.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_tar.c b/src/bin/pg_dump/pg_backup_tar.c
index d5bfa55646..bb5d3b1f45 100644
--- a/src/bin/pg_dump/pg_backup_tar.c
+++ b/src/bin/pg_dump/pg_backup_tar.c
@@ -82,7 +82,6 @@ typedef struct
 
 typedef struct
 {
-	int			hasSeek;
 	pgoff_t		filePos;
 	TAR_MEMBER *blobToc;
 	FILE	   *tarFH;
@@ -192,8 +191,6 @@ InitArchiveFmt_Tar(ArchiveHandle *AH)
 		 */
 		/* setvbuf(ctx->tarFH, NULL, _IONBF, 0); */
 
-		ctx->hasSeek = checkSeek(ctx->tarFH);
-
 		/*
 		 * We don't support compression because reading the files back is not
 		 * possible since gzdopen uses buffered IO which totally screws file
@@ -226,8 +223,6 @@ InitArchiveFmt_Tar(ArchiveHandle *AH)
 
 		ctx->tarFHpos = 0;
 
-		ctx->hasSeek = checkSeek(ctx->tarFH);
-
 		/*
 		 * Forcibly unmark the header as read since we use the lookahead
 		 * buffer
-- 
2.26.2

0002-Skip-tables-in-pg_restore-by-seeking-instead-of-read.patchtext/x-diff; charset=us-asciiDownload
From 8687d3947cab7ffadc0c7bdc61e19c4b8a94da2d Mon Sep 17 00:00:00 2001
From: David Gilman <davidgilman1@gmail.com>
Date: Mon, 25 May 2020 17:34:52 -0400
Subject: [PATCH 2/4] Skip tables in pg_restore by seeking instead of reading

---
 src/bin/pg_dump/pg_backup_custom.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c
index 369dcea429..8c84e3c611 100644
--- a/src/bin/pg_dump/pg_backup_custom.c
+++ b/src/bin/pg_dump/pg_backup_custom.c
@@ -562,19 +562,27 @@ _skipData(ArchiveHandle *AH)
 	blkLen = ReadInt(AH);
 	while (blkLen != 0)
 	{
-		if (blkLen > buflen)
+		if (ctx->hasSeek)
 		{
-			if (buf)
-				free(buf);
-			buf = (char *) pg_malloc(blkLen);
-			buflen = blkLen;
+			if (fseeko(AH->FH, blkLen, SEEK_CUR) != 0)
+				fatal("error during file seek: %m");
 		}
-		if ((cnt = fread(buf, 1, blkLen, AH->FH)) != blkLen)
+		else
 		{
-			if (feof(AH->FH))
-				fatal("could not read from input file: end of file");
-			else
-				fatal("could not read from input file: %m");
+			if (blkLen > buflen)
+			{
+				if (buf)
+					free(buf);
+				buf = (char *) pg_malloc(blkLen);
+				buflen = blkLen;
+			}
+			if ((cnt = fread(buf, 1, blkLen, AH->FH)) != blkLen)
+			{
+				if (feof(AH->FH))
+					fatal("could not read from input file: end of file");
+				else
+					fatal("could not read from input file: %m");
+			}
 		}
 
 		ctx->filePos += blkLen;
-- 
2.26.2

0003-Scan-all-TOCs-when-restoring-a-custom-dump-file-with.patchtext/x-diff; charset=us-asciiDownload
From ed106edbfd2eeeadec74289afb8d8bfe8f37fff1 Mon Sep 17 00:00:00 2001
From: David Gilman <davidgilman1@gmail.com>
Date: Wed, 20 May 2020 22:49:28 -0400
Subject: [PATCH 3/4] Scan all TOCs when restoring a custom dump file without
 offsets

TOC requests are not guaranteed to come in disk order. If the custom
dump file was written with data offsets, pg_restore can seek directly to
the data, making request order irrelevant. If there are no data offsets
pg_restore would never attempt to seek backwards, even when it was
possible, and would not find TOCs before the current read position in
the file. 548e50976 changed how pg_restore's parallel algorithm worked
at the cost of greatly increasing out-of-order TOC requests.

This patch changes pg_restore to scan through all TOCs to service a TOC
read request when restoring a custom dump file without data offsets.
The odds of getting a successful parallel restore go way up at the cost
of a bunch of extra tiny reads when pg_restore starts up.

The pg_restore manpage now warns against running pg_dump with an
unseekable output file and suggests that if you plan on doing a parallel
restore of a custom dump you should run pg_dump with --file.
---
 doc/src/sgml/ref/pg_restore.sgml   |  8 ++++++++
 src/bin/pg_dump/pg_backup_custom.c | 24 ++++++++++++++++++++++--
 2 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/ref/pg_restore.sgml b/doc/src/sgml/ref/pg_restore.sgml
index 232f88024f..a0bf64767b 100644
--- a/doc/src/sgml/ref/pg_restore.sgml
+++ b/doc/src/sgml/ref/pg_restore.sgml
@@ -279,6 +279,14 @@ PostgreSQL documentation
         jobs cannot be used together with the
         option <option>--single-transaction</option>.
        </para>
+
+       <para>
+        <application>pg_restore</application> with concurrent jobs may fail
+        when restoring a custom archive format dump written to an unseekable
+        output stream, like stdout. For the best performance when restoring
+        a custom archive format dump use <application>pg_dump</application>'s
+        <option>--file</option> option to specify an output file.
+       </para>
       </listitem>
      </varlistentry>
 
diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c
index 8c84e3c611..851add4915 100644
--- a/src/bin/pg_dump/pg_backup_custom.c
+++ b/src/bin/pg_dump/pg_backup_custom.c
@@ -415,6 +415,7 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
 	lclTocEntry *tctx = (lclTocEntry *) te->formatData;
 	int			blkType;
 	int			id;
+	bool		initialScan = true;
 
 	if (tctx->dataState == K_OFFSET_NO_DATA)
 		return;
@@ -423,11 +424,19 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
 	{
 		/*
 		 * We cannot seek directly to the desired block.  Instead, skip over
-		 * block headers until we find the one we want.  This could fail if we
-		 * are asked to restore items out-of-order.
+		 * block headers until we find the one we want.
 		 */
+
 		_readBlockHeader(AH, &blkType, &id);
 
+		if (blkType == EOF && ctx->hasSeek) {
+			/* Started at the end of the file */
+			initialScan = false;
+			if (fseeko(AH->FH, ctx->dataStart, SEEK_SET) != 0)
+				fatal("error during file seek: %m");
+			_readBlockHeader(AH, &blkType, &id);
+		}
+
 		while (blkType != EOF && id != te->dumpId)
 		{
 			switch (blkType)
@@ -446,6 +455,17 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
 					break;
 			}
 			_readBlockHeader(AH, &blkType, &id);
+
+			if (blkType == EOF && ctx->hasSeek && initialScan) {
+				/*
+				 * This was possibly an out-of-order request.
+				 * Try one extra pass over the file to find it.
+				 */
+				initialScan = false;
+				if (fseeko(AH->FH, ctx->dataStart, SEEK_SET) != 0)
+					fatal("error during file seek: %m");
+				_readBlockHeader(AH, &blkType, &id);
+			}
 		}
 	}
 	else
-- 
2.26.2

0004-Add-integration-test-for-out-of-order-TOC-requests-i.patchtext/x-diff; charset=us-asciiDownload
From 69d7a9bafa342dc558c0d8efcee1af4b9e7e07d5 Mon Sep 17 00:00:00 2001
From: David Gilman <davidgilman1@gmail.com>
Date: Sat, 23 May 2020 10:49:33 -0400
Subject: [PATCH 4/4] Add integration test for out-of-order TOC requests in
 pg_restore

Add undocumented --disable-seeking argument to pg_dump to emulate
writing to an unseekable output file.
---
 src/bin/pg_dump/pg_backup.h          |  4 +-
 src/bin/pg_dump/pg_backup_archiver.c | 14 ++--
 src/bin/pg_dump/pg_backup_archiver.h |  1 +
 src/bin/pg_dump/pg_backup_custom.c   |  7 +-
 src/bin/pg_dump/pg_dump.c            |  6 +-
 src/bin/pg_dump/t/002_pg_dump.pl     | 97 +++++++++++++++++++++++++++-
 6 files changed, 117 insertions(+), 12 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index 8c0cedcd98..160d705eb5 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -158,6 +158,7 @@ typedef struct _dumpOptions
 	int			use_setsessauth;
 	int			enable_row_security;
 	int			load_via_partition_root;
+	int			disable_seeking;
 
 	/* default, if no "inclusion" switches appear, is to dump everything */
 	bool		include_everything;
@@ -270,7 +271,8 @@ extern Archive *OpenArchive(const char *FileSpec, const ArchiveFormat fmt);
 /* Create a new archive */
 extern Archive *CreateArchive(const char *FileSpec, const ArchiveFormat fmt,
 							  const int compression, bool dosync, ArchiveMode mode,
-							  SetupWorkerPtrType setupDumpWorker);
+							  SetupWorkerPtrType setupDumpWorker,
+							  bool disableSeeking);
 
 /* The --list option */
 extern void PrintTOCSummary(Archive *AH);
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 8f0b32ca17..0e5d4e0a60 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -69,7 +69,8 @@ typedef struct _parallelReadyList
 
 static ArchiveHandle *_allocAH(const char *FileSpec, const ArchiveFormat fmt,
 							   const int compression, bool dosync, ArchiveMode mode,
-							   SetupWorkerPtrType setupWorkerPtr);
+							   SetupWorkerPtrType setupWorkerPtr,
+							   bool disableSeeking);
 static void _getObjectDescription(PQExpBuffer buf, TocEntry *te,
 								  ArchiveHandle *AH);
 static void _printTocEntry(ArchiveHandle *AH, TocEntry *te, bool isData);
@@ -233,11 +234,12 @@ setupRestoreWorker(Archive *AHX)
 Archive *
 CreateArchive(const char *FileSpec, const ArchiveFormat fmt,
 			  const int compression, bool dosync, ArchiveMode mode,
-			  SetupWorkerPtrType setupDumpWorker)
+			  SetupWorkerPtrType setupDumpWorker,
+			  bool disableSeeking)
 
 {
 	ArchiveHandle *AH = _allocAH(FileSpec, fmt, compression, dosync,
-								 mode, setupDumpWorker);
+								 mode, setupDumpWorker, disableSeeking);
 
 	return (Archive *) AH;
 }
@@ -247,7 +249,7 @@ CreateArchive(const char *FileSpec, const ArchiveFormat fmt,
 Archive *
 OpenArchive(const char *FileSpec, const ArchiveFormat fmt)
 {
-	ArchiveHandle *AH = _allocAH(FileSpec, fmt, 0, true, archModeRead, setupRestoreWorker);
+	ArchiveHandle *AH = _allocAH(FileSpec, fmt, 0, true, archModeRead, setupRestoreWorker, false);
 
 	return (Archive *) AH;
 }
@@ -2274,7 +2276,8 @@ _discoverArchiveFormat(ArchiveHandle *AH)
 static ArchiveHandle *
 _allocAH(const char *FileSpec, const ArchiveFormat fmt,
 		 const int compression, bool dosync, ArchiveMode mode,
-		 SetupWorkerPtrType setupWorkerPtr)
+		 SetupWorkerPtrType setupWorkerPtr,
+		 bool disableSeeking)
 {
 	ArchiveHandle *AH;
 
@@ -2325,6 +2328,7 @@ _allocAH(const char *FileSpec, const ArchiveFormat fmt,
 	AH->mode = mode;
 	AH->compression = compression;
 	AH->dosync = dosync;
+	AH->disableSeeking = disableSeeking;
 
 	memset(&(AH->sqlparse), 0, sizeof(AH->sqlparse));
 
diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h
index f9e6b42752..acf38c20da 100644
--- a/src/bin/pg_dump/pg_backup_archiver.h
+++ b/src/bin/pg_dump/pg_backup_archiver.h
@@ -340,6 +340,7 @@ struct _archiveHandle
 	bool		dosync;			/* data requested to be synced on sight */
 	ArchiveMode mode;			/* File mode - r or w */
 	void	   *formatData;		/* Header data specific to file format */
+	bool		disableSeeking;	/* Don't use fseeko() */
 
 	/* these vars track state to avoid sending redundant SET commands */
 	char	   *currUser;		/* current username, or NULL if unknown */
diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c
index 851add4915..79445ac363 100644
--- a/src/bin/pg_dump/pg_backup_custom.c
+++ b/src/bin/pg_dump/pg_backup_custom.c
@@ -145,6 +145,7 @@ InitArchiveFmt_Custom(ArchiveHandle *AH)
 	AH->lo_buf = (void *) pg_malloc(LOBBUFSIZE);
 
 	ctx->filePos = 0;
+	ctx->hasSeek = 0;
 
 	/*
 	 * Now open the file
@@ -164,7 +165,8 @@ InitArchiveFmt_Custom(ArchiveHandle *AH)
 				fatal("could not open output file: %m");
 		}
 
-		ctx->hasSeek = checkSeek(AH->FH);
+		if (!AH->disableSeeking)
+			ctx->hasSeek = checkSeek(AH->FH);
 	}
 	else
 	{
@@ -181,7 +183,8 @@ InitArchiveFmt_Custom(ArchiveHandle *AH)
 				fatal("could not open input file: %m");
 		}
 
-		ctx->hasSeek = checkSeek(AH->FH);
+		if (!AH->disableSeeking)
+			ctx->hasSeek = checkSeek(AH->FH);
 
 		ReadHead(AH);
 		ReadToc(AH);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index a4e949c636..6ed45d319f 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -319,6 +319,7 @@ main(int argc, char **argv)
 	int			plainText = 0;
 	ArchiveFormat archiveFormat = archUnknown;
 	ArchiveMode archiveMode;
+	bool		disableSeeking = false;
 
 	static DumpOptions dopt;
 
@@ -360,6 +361,7 @@ main(int argc, char **argv)
 		{"binary-upgrade", no_argument, &dopt.binary_upgrade, 1},
 		{"column-inserts", no_argument, &dopt.column_inserts, 1},
 		{"disable-dollar-quoting", no_argument, &dopt.disable_dollar_quoting, 1},
+		{"disable-seeking", no_argument, &dopt.disable_seeking, 1},
 		{"disable-triggers", no_argument, &dopt.disable_triggers, 1},
 		{"enable-row-security", no_argument, &dopt.enable_row_security, 1},
 		{"exclude-table-data", required_argument, NULL, 4},
@@ -673,6 +675,8 @@ main(int argc, char **argv)
 	if (archiveFormat == archNull)
 		plainText = 1;
 
+	disableSeeking = (bool) dopt.disable_seeking;
+
 	/* Custom and directory formats are compressed by default, others not */
 	if (compressLevel == -1)
 	{
@@ -715,7 +719,7 @@ main(int argc, char **argv)
 
 	/* Open the output file */
 	fout = CreateArchive(filename, archiveFormat, compressLevel, dosync,
-						 archiveMode, setupDumpWorker);
+						 archiveMode, setupDumpWorker, disableSeeking);
 
 	/* Make dump options accessible right away */
 	SetArchiveOptions(fout, &dopt, NULL);
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index e116235769..2423ec5b1d 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -31,9 +31,20 @@ my $tempdir_short = TestLib::tempdir_short;
 # generate a text file to run through the tests from the
 # non-text file generated by pg_dump.
 #
-# TODO: Have pg_restore actually restore to an independent
-# database and then pg_dump *that* database (or something along
-# those lines) to validate that part of the process.
+# secondary_dump_cmd is an optional key that enables a
+#
+# 	pg_dump (dump_cmd) ->
+# 	pg_restore (restore_cmd, real database) ->
+#	pg_dump (secondary_dmp_cmd, SQL output (for matching))
+#
+# test process instead of the default
+#
+# 	pg_dump (dump_cmd) ->
+# 	pg_restore (restore_cmd, SQL output (for matching))
+#
+# test process. The secondary_dump database is created for
+# you and your restore_cmd and secondary_dmp_cmd must read
+# and write from it.
 
 my %pgdump_runs = (
 	binary_upgrade => {
@@ -139,6 +150,23 @@ my %pgdump_runs = (
 			"$tempdir/defaults_custom_format.dump",
 		],
 	},
+	defaults_custom_format_no_seek_parallel_restore => {
+		test_key => 'defaults',
+		dump_cmd => [
+			'pg_dump', '-Fc', '-Z6', '--no-sync', '--disable-seeking',
+			"--file=$tempdir/defaults_custom_format_no_seek_parallel_restore.dump", 'postgres',
+		],
+		restore_cmd => [
+			'pg_restore', '-Fc', '--jobs=2',
+			'--dbname=secondary_dump',
+			"$tempdir/defaults_custom_format_no_seek_parallel_restore.dump",
+		],
+		secondary_dump_cmd => [
+			'pg_dump', '-Fp', '--no-sync',
+			"--file=$tempdir/defaults_custom_format_no_seek_parallel_restore.sql",
+			'secondary_dump',
+		],
+	},
 
 	# Do not use --no-sync to give test coverage for data sync.
 	defaults_dir_format => {
@@ -153,6 +181,24 @@ my %pgdump_runs = (
 			"$tempdir/defaults_dir_format",
 		],
 	},
+	defaults_dir_format_parallel => {
+		test_key => 'defaults',
+		dump_cmd => [
+			'pg_dump',                                      '-Fd',
+			'--jobs=2',                                     '--no-sync',
+			"--file=$tempdir/defaults_dir_format_parallel", 'postgres',
+		],
+		restore_cmd => [
+			'pg_restore', '-Fd',
+			'--jobs=2', '--dbname=secondary_dump',
+			"$tempdir/defaults_dir_format_parallel",
+		],
+		secondary_dump_cmd => [
+			'pg_dump', '-Fp', '--no-sync',
+			"--file=$tempdir/defaults_dir_format_parallel.sql",
+			'secondary_dump',
+		],
+	},
 
 	# Do not use --no-sync to give test coverage for data sync.
 	defaults_parallel => {
@@ -3298,6 +3344,27 @@ my %tests = (
 			%full_runs, %dump_test_schema_runs, section_pre_data => 1,
 		},
 		unlike => { exclude_dump_test_schema => 1 },
+	},
+
+	# This reliably produces the "possibly due to out-of-order restore request"
+	# when restoring with -j 2 when this was written but future changes to how
+	# pg_dump works could accidentally correctly order things as we're not really
+	# telling pg_dump how to do its on-disk layout.
+	'custom dump out-of-order' => {
+		create_sql   => '
+			CREATE TABLE ooo_parent0 (id integer primary key);
+			CREATE TABLE ooo_child0 (parent0 int references ooo_parent0 (id));
+			CREATE TABLE ooo_parent1 (id integer primary key);
+			CREATE TABLE ooo_child1 (parent0 int references ooo_parent1 (id));',
+		regexp => qr/^
+			 \QCREATE TABLE public.ooo_child0 (\E\n
+			 \s+\Qparent0 integer\E\n
+			 \Q);\E\n/xm,
+		like => {
+			%full_runs, section_pre_data => 1,
+			defaults_custom_format_no_seek => 1,
+			defaults_custom_format_no_seek_parallel_restore => 1,
+		},
 	});
 
 #########################################
@@ -3350,6 +3417,11 @@ foreach my $run (sort keys %pgdump_runs)
 		$num_tests++;
 	}
 
+	if ($pgdump_runs{$run}->{secondary_dump_cmd})
+	{
+		$num_tests++;
+	}
+
 	if ($pgdump_runs{$run}->{test_key})
 	{
 		$test_key = $pgdump_runs{$run}->{test_key};
@@ -3499,12 +3571,23 @@ foreach my $run (sort keys %pgdump_runs)
 	$node->command_ok(\@{ $pgdump_runs{$run}->{dump_cmd} },
 		"$run: pg_dump runs");
 
+	if ($pgdump_runs{$run}->{secondary_dump_cmd})
+	{
+		$node->safe_psql('postgres', 'create database secondary_dump;');
+	}
+
 	if ($pgdump_runs{$run}->{restore_cmd})
 	{
 		$node->command_ok(\@{ $pgdump_runs{$run}->{restore_cmd} },
 			"$run: pg_restore runs");
 	}
 
+	if ($pgdump_runs{$run}->{secondary_dump_cmd})
+	{
+		$node->command_ok(\@{ $pgdump_runs{$run}->{secondary_dump_cmd} },
+			"$run: secondary pg_dump runs");
+	}
+
 	if ($pgdump_runs{$run}->{test_key})
 	{
 		$test_key = $pgdump_runs{$run}->{test_key};
@@ -3561,6 +3644,14 @@ foreach my $run (sort keys %pgdump_runs)
 			}
 		}
 	}
+
+	if ($pgdump_runs{$run}->{secondary_dump_cmd})
+	{
+		$node->safe_psql('secondary_dump',
+			'alter subscription sub1 set (slot_name = NONE);');
+		$node->safe_psql('secondary_dump', 'drop subscription sub1;');
+		$node->safe_psql('postgres',       'drop database secondary_dump;');
+	}
 }
 
 #########################################
-- 
2.26.2

#11David Gilman
dgilman@gilslotd.com
In reply to: David Gilman (#10)
4 attachment(s)
Re: Warn when parallel restoring a custom dump without data offsets

I've attached the latest patches after further review from Justin Pryzby.

--
David Gilman :DG<
https://gilslotd.com

Attachments:

0001-Scan-all-TOCs-when-restoring-a-custom-dump-file-with.patchtext/x-diff; charset=us-asciiDownload
From 90e06cbb724f6f6a244dfc69f3d59ca2e7d29c01 Mon Sep 17 00:00:00 2001
From: David Gilman <davidgilman1@gmail.com>
Date: Wed, 20 May 2020 22:49:28 -0400
Subject: [PATCH 1/4] Scan all TOCs when restoring a custom dump file without
 offsets

TOC requests are not guaranteed to come in disk order. If the custom
dump file was written with data offsets, pg_restore can seek directly to
the data, making request order irrelevant. If there are no data offsets,
pg_restore would never attempt to seek backwards, even when seeking is
possible, and would be unable to find TOCs before the current read
position in the file. 548e50976 changed how pg_restore's parallel
algorithm worked at the cost of greatly increasing out-of-order TOC
requests.

This patch changes pg_restore to scan through all TOCs to service a TOC
read request when restoring a custom dump file without data offsets.
The odds of getting a successful parallel restore go way up at the cost
of a bunch of extra tiny reads when pg_restore starts up.

The pg_restore manpage now warns against running pg_dump with an
unseekable output file and suggests that if you plan on doing a parallel
restore of a custom dump, you should run pg_dump with --file.
---
 doc/src/sgml/ref/pg_restore.sgml   |  8 ++++++++
 src/bin/pg_dump/pg_backup_custom.c | 25 ++++++++++++++++++++-----
 2 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/ref/pg_restore.sgml b/doc/src/sgml/ref/pg_restore.sgml
index 232f88024f..23286bb076 100644
--- a/doc/src/sgml/ref/pg_restore.sgml
+++ b/doc/src/sgml/ref/pg_restore.sgml
@@ -279,6 +279,14 @@ PostgreSQL documentation
         jobs cannot be used together with the
         option <option>--single-transaction</option>.
        </para>
+
+       <para>
+        <application>pg_restore</application> with concurrent jobs may fail
+        when restoring a custom archive format dump written to an unseekable
+        output stream, like stdout. To allow for concurrent restoration of
+        a custom archive format dump, use <application>pg_dump</application>'s
+        <option>--file</option> option to specify an output file.
+       </para>
       </listitem>
      </varlistentry>
 
diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c
index 369dcea429..5aa7ab33db 100644
--- a/src/bin/pg_dump/pg_backup_custom.c
+++ b/src/bin/pg_dump/pg_backup_custom.c
@@ -415,6 +415,7 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
 	lclTocEntry *tctx = (lclTocEntry *) te->formatData;
 	int			blkType;
 	int			id;
+	bool		initialScan = true;
 
 	if (tctx->dataState == K_OFFSET_NO_DATA)
 		return;
@@ -423,13 +424,28 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
 	{
 		/*
 		 * We cannot seek directly to the desired block.  Instead, skip over
-		 * block headers until we find the one we want.  This could fail if we
-		 * are asked to restore items out-of-order.
+		 * block headers until we find the one we want.
 		 */
-		_readBlockHeader(AH, &blkType, &id);
 
-		while (blkType != EOF && id != te->dumpId)
+		for (;;)
 		{
+			_readBlockHeader(AH, &blkType, &id);
+
+			if (blkType == EOF && ctx->hasSeek && initialScan)
+			{
+				/*
+				 * This was possibly an out-of-order request. Try one extra
+				 * pass over the file to find the TOC.
+				 */
+				initialScan = false;
+				if (fseeko(AH->FH, ctx->dataStart, SEEK_SET) != 0)
+					fatal("error during file seek: %m");
+				continue;
+			}
+
+			if (blkType == EOF || id == te->dumpId)
+				break;
+
 			switch (blkType)
 			{
 				case BLK_DATA:
@@ -445,7 +461,6 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
 						  blkType);
 					break;
 			}
-			_readBlockHeader(AH, &blkType, &id);
 		}
 	}
 	else
-- 
2.26.2

0002-Add-integration-test-for-out-of-order-TOC-requests-i.patchtext/x-diff; charset=us-asciiDownload
From 750958499b19e6295a15b01ccb3a9e3ce963af2c Mon Sep 17 00:00:00 2001
From: David Gilman <davidgilman1@gmail.com>
Date: Sat, 23 May 2020 10:49:33 -0400
Subject: [PATCH 2/4] Add integration test for out-of-order TOC requests in
 pg_restore

Add undocumented --disable-seeking argument to pg_dump to emulate
writing to an unseekable output file.
---
 src/bin/pg_dump/pg_backup.h          |  4 +-
 src/bin/pg_dump/pg_backup_archiver.c | 14 ++--
 src/bin/pg_dump/pg_backup_archiver.h |  1 +
 src/bin/pg_dump/pg_backup_custom.c   |  7 +-
 src/bin/pg_dump/pg_dump.c            |  6 +-
 src/bin/pg_dump/t/002_pg_dump.pl     | 97 +++++++++++++++++++++++++++-
 6 files changed, 117 insertions(+), 12 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index 8c0cedcd98..160d705eb5 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -158,6 +158,7 @@ typedef struct _dumpOptions
 	int			use_setsessauth;
 	int			enable_row_security;
 	int			load_via_partition_root;
+	int			disable_seeking;
 
 	/* default, if no "inclusion" switches appear, is to dump everything */
 	bool		include_everything;
@@ -270,7 +271,8 @@ extern Archive *OpenArchive(const char *FileSpec, const ArchiveFormat fmt);
 /* Create a new archive */
 extern Archive *CreateArchive(const char *FileSpec, const ArchiveFormat fmt,
 							  const int compression, bool dosync, ArchiveMode mode,
-							  SetupWorkerPtrType setupDumpWorker);
+							  SetupWorkerPtrType setupDumpWorker,
+							  bool disableSeeking);
 
 /* The --list option */
 extern void PrintTOCSummary(Archive *AH);
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 8f0b32ca17..0e5d4e0a60 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -69,7 +69,8 @@ typedef struct _parallelReadyList
 
 static ArchiveHandle *_allocAH(const char *FileSpec, const ArchiveFormat fmt,
 							   const int compression, bool dosync, ArchiveMode mode,
-							   SetupWorkerPtrType setupWorkerPtr);
+							   SetupWorkerPtrType setupWorkerPtr,
+							   bool disableSeeking);
 static void _getObjectDescription(PQExpBuffer buf, TocEntry *te,
 								  ArchiveHandle *AH);
 static void _printTocEntry(ArchiveHandle *AH, TocEntry *te, bool isData);
@@ -233,11 +234,12 @@ setupRestoreWorker(Archive *AHX)
 Archive *
 CreateArchive(const char *FileSpec, const ArchiveFormat fmt,
 			  const int compression, bool dosync, ArchiveMode mode,
-			  SetupWorkerPtrType setupDumpWorker)
+			  SetupWorkerPtrType setupDumpWorker,
+			  bool disableSeeking)
 
 {
 	ArchiveHandle *AH = _allocAH(FileSpec, fmt, compression, dosync,
-								 mode, setupDumpWorker);
+								 mode, setupDumpWorker, disableSeeking);
 
 	return (Archive *) AH;
 }
@@ -247,7 +249,7 @@ CreateArchive(const char *FileSpec, const ArchiveFormat fmt,
 Archive *
 OpenArchive(const char *FileSpec, const ArchiveFormat fmt)
 {
-	ArchiveHandle *AH = _allocAH(FileSpec, fmt, 0, true, archModeRead, setupRestoreWorker);
+	ArchiveHandle *AH = _allocAH(FileSpec, fmt, 0, true, archModeRead, setupRestoreWorker, false);
 
 	return (Archive *) AH;
 }
@@ -2274,7 +2276,8 @@ _discoverArchiveFormat(ArchiveHandle *AH)
 static ArchiveHandle *
 _allocAH(const char *FileSpec, const ArchiveFormat fmt,
 		 const int compression, bool dosync, ArchiveMode mode,
-		 SetupWorkerPtrType setupWorkerPtr)
+		 SetupWorkerPtrType setupWorkerPtr,
+		 bool disableSeeking)
 {
 	ArchiveHandle *AH;
 
@@ -2325,6 +2328,7 @@ _allocAH(const char *FileSpec, const ArchiveFormat fmt,
 	AH->mode = mode;
 	AH->compression = compression;
 	AH->dosync = dosync;
+	AH->disableSeeking = disableSeeking;
 
 	memset(&(AH->sqlparse), 0, sizeof(AH->sqlparse));
 
diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h
index f9e6b42752..acf38c20da 100644
--- a/src/bin/pg_dump/pg_backup_archiver.h
+++ b/src/bin/pg_dump/pg_backup_archiver.h
@@ -340,6 +340,7 @@ struct _archiveHandle
 	bool		dosync;			/* data requested to be synced on sight */
 	ArchiveMode mode;			/* File mode - r or w */
 	void	   *formatData;		/* Header data specific to file format */
+	bool		disableSeeking;	/* Don't use fseeko() */
 
 	/* these vars track state to avoid sending redundant SET commands */
 	char	   *currUser;		/* current username, or NULL if unknown */
diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c
index 5aa7ab33db..90a026e0f4 100644
--- a/src/bin/pg_dump/pg_backup_custom.c
+++ b/src/bin/pg_dump/pg_backup_custom.c
@@ -145,6 +145,7 @@ InitArchiveFmt_Custom(ArchiveHandle *AH)
 	AH->lo_buf = (void *) pg_malloc(LOBBUFSIZE);
 
 	ctx->filePos = 0;
+	ctx->hasSeek = 0;
 
 	/*
 	 * Now open the file
@@ -164,7 +165,8 @@ InitArchiveFmt_Custom(ArchiveHandle *AH)
 				fatal("could not open output file: %m");
 		}
 
-		ctx->hasSeek = checkSeek(AH->FH);
+		if (!AH->disableSeeking)
+			ctx->hasSeek = checkSeek(AH->FH);
 	}
 	else
 	{
@@ -181,7 +183,8 @@ InitArchiveFmt_Custom(ArchiveHandle *AH)
 				fatal("could not open input file: %m");
 		}
 
-		ctx->hasSeek = checkSeek(AH->FH);
+		if (!AH->disableSeeking)
+			ctx->hasSeek = checkSeek(AH->FH);
 
 		ReadHead(AH);
 		ReadToc(AH);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index a4e949c636..6ed45d319f 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -319,6 +319,7 @@ main(int argc, char **argv)
 	int			plainText = 0;
 	ArchiveFormat archiveFormat = archUnknown;
 	ArchiveMode archiveMode;
+	bool		disableSeeking = false;
 
 	static DumpOptions dopt;
 
@@ -360,6 +361,7 @@ main(int argc, char **argv)
 		{"binary-upgrade", no_argument, &dopt.binary_upgrade, 1},
 		{"column-inserts", no_argument, &dopt.column_inserts, 1},
 		{"disable-dollar-quoting", no_argument, &dopt.disable_dollar_quoting, 1},
+		{"disable-seeking", no_argument, &dopt.disable_seeking, 1},
 		{"disable-triggers", no_argument, &dopt.disable_triggers, 1},
 		{"enable-row-security", no_argument, &dopt.enable_row_security, 1},
 		{"exclude-table-data", required_argument, NULL, 4},
@@ -673,6 +675,8 @@ main(int argc, char **argv)
 	if (archiveFormat == archNull)
 		plainText = 1;
 
+	disableSeeking = (bool) dopt.disable_seeking;
+
 	/* Custom and directory formats are compressed by default, others not */
 	if (compressLevel == -1)
 	{
@@ -715,7 +719,7 @@ main(int argc, char **argv)
 
 	/* Open the output file */
 	fout = CreateArchive(filename, archiveFormat, compressLevel, dosync,
-						 archiveMode, setupDumpWorker);
+						 archiveMode, setupDumpWorker, disableSeeking);
 
 	/* Make dump options accessible right away */
 	SetArchiveOptions(fout, &dopt, NULL);
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index e116235769..2423ec5b1d 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -31,9 +31,20 @@ my $tempdir_short = TestLib::tempdir_short;
 # generate a text file to run through the tests from the
 # non-text file generated by pg_dump.
 #
-# TODO: Have pg_restore actually restore to an independent
-# database and then pg_dump *that* database (or something along
-# those lines) to validate that part of the process.
+# secondary_dump_cmd is an optional key that enables a
+#
+# 	pg_dump (dump_cmd) ->
+# 	pg_restore (restore_cmd, real database) ->
+#	pg_dump (secondary_dmp_cmd, SQL output (for matching))
+#
+# test process instead of the default
+#
+# 	pg_dump (dump_cmd) ->
+# 	pg_restore (restore_cmd, SQL output (for matching))
+#
+# test process. The secondary_dump database is created for
+# you and your restore_cmd and secondary_dmp_cmd must read
+# and write from it.
 
 my %pgdump_runs = (
 	binary_upgrade => {
@@ -139,6 +150,23 @@ my %pgdump_runs = (
 			"$tempdir/defaults_custom_format.dump",
 		],
 	},
+	defaults_custom_format_no_seek_parallel_restore => {
+		test_key => 'defaults',
+		dump_cmd => [
+			'pg_dump', '-Fc', '-Z6', '--no-sync', '--disable-seeking',
+			"--file=$tempdir/defaults_custom_format_no_seek_parallel_restore.dump", 'postgres',
+		],
+		restore_cmd => [
+			'pg_restore', '-Fc', '--jobs=2',
+			'--dbname=secondary_dump',
+			"$tempdir/defaults_custom_format_no_seek_parallel_restore.dump",
+		],
+		secondary_dump_cmd => [
+			'pg_dump', '-Fp', '--no-sync',
+			"--file=$tempdir/defaults_custom_format_no_seek_parallel_restore.sql",
+			'secondary_dump',
+		],
+	},
 
 	# Do not use --no-sync to give test coverage for data sync.
 	defaults_dir_format => {
@@ -153,6 +181,24 @@ my %pgdump_runs = (
 			"$tempdir/defaults_dir_format",
 		],
 	},
+	defaults_dir_format_parallel => {
+		test_key => 'defaults',
+		dump_cmd => [
+			'pg_dump',                                      '-Fd',
+			'--jobs=2',                                     '--no-sync',
+			"--file=$tempdir/defaults_dir_format_parallel", 'postgres',
+		],
+		restore_cmd => [
+			'pg_restore', '-Fd',
+			'--jobs=2', '--dbname=secondary_dump',
+			"$tempdir/defaults_dir_format_parallel",
+		],
+		secondary_dump_cmd => [
+			'pg_dump', '-Fp', '--no-sync',
+			"--file=$tempdir/defaults_dir_format_parallel.sql",
+			'secondary_dump',
+		],
+	},
 
 	# Do not use --no-sync to give test coverage for data sync.
 	defaults_parallel => {
@@ -3298,6 +3344,27 @@ my %tests = (
 			%full_runs, %dump_test_schema_runs, section_pre_data => 1,
 		},
 		unlike => { exclude_dump_test_schema => 1 },
+	},
+
+	# This reliably produces the "possibly due to out-of-order restore request"
+	# when restoring with -j 2 when this was written but future changes to how
+	# pg_dump works could accidentally correctly order things as we're not really
+	# telling pg_dump how to do its on-disk layout.
+	'custom dump out-of-order' => {
+		create_sql   => '
+			CREATE TABLE ooo_parent0 (id integer primary key);
+			CREATE TABLE ooo_child0 (parent0 int references ooo_parent0 (id));
+			CREATE TABLE ooo_parent1 (id integer primary key);
+			CREATE TABLE ooo_child1 (parent0 int references ooo_parent1 (id));',
+		regexp => qr/^
+			 \QCREATE TABLE public.ooo_child0 (\E\n
+			 \s+\Qparent0 integer\E\n
+			 \Q);\E\n/xm,
+		like => {
+			%full_runs, section_pre_data => 1,
+			defaults_custom_format_no_seek => 1,
+			defaults_custom_format_no_seek_parallel_restore => 1,
+		},
 	});
 
 #########################################
@@ -3350,6 +3417,11 @@ foreach my $run (sort keys %pgdump_runs)
 		$num_tests++;
 	}
 
+	if ($pgdump_runs{$run}->{secondary_dump_cmd})
+	{
+		$num_tests++;
+	}
+
 	if ($pgdump_runs{$run}->{test_key})
 	{
 		$test_key = $pgdump_runs{$run}->{test_key};
@@ -3499,12 +3571,23 @@ foreach my $run (sort keys %pgdump_runs)
 	$node->command_ok(\@{ $pgdump_runs{$run}->{dump_cmd} },
 		"$run: pg_dump runs");
 
+	if ($pgdump_runs{$run}->{secondary_dump_cmd})
+	{
+		$node->safe_psql('postgres', 'create database secondary_dump;');
+	}
+
 	if ($pgdump_runs{$run}->{restore_cmd})
 	{
 		$node->command_ok(\@{ $pgdump_runs{$run}->{restore_cmd} },
 			"$run: pg_restore runs");
 	}
 
+	if ($pgdump_runs{$run}->{secondary_dump_cmd})
+	{
+		$node->command_ok(\@{ $pgdump_runs{$run}->{secondary_dump_cmd} },
+			"$run: secondary pg_dump runs");
+	}
+
 	if ($pgdump_runs{$run}->{test_key})
 	{
 		$test_key = $pgdump_runs{$run}->{test_key};
@@ -3561,6 +3644,14 @@ foreach my $run (sort keys %pgdump_runs)
 			}
 		}
 	}
+
+	if ($pgdump_runs{$run}->{secondary_dump_cmd})
+	{
+		$node->safe_psql('secondary_dump',
+			'alter subscription sub1 set (slot_name = NONE);');
+		$node->safe_psql('secondary_dump', 'drop subscription sub1;');
+		$node->safe_psql('postgres',       'drop database secondary_dump;');
+	}
 }
 
 #########################################
-- 
2.26.2

0003-Remove-unused-seek-check-in-tar-dump-format.patchtext/x-diff; charset=us-asciiDownload
From 35d05669a0455bb928e51ea532a791e7f097987d Mon Sep 17 00:00:00 2001
From: David Gilman <davidgilman1@gmail.com>
Date: Fri, 22 May 2020 17:27:51 -0400
Subject: [PATCH 3/4] Remove unused seek check in tar dump format

---
 src/bin/pg_dump/pg_backup_tar.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_tar.c b/src/bin/pg_dump/pg_backup_tar.c
index d5bfa55646..bb5d3b1f45 100644
--- a/src/bin/pg_dump/pg_backup_tar.c
+++ b/src/bin/pg_dump/pg_backup_tar.c
@@ -82,7 +82,6 @@ typedef struct
 
 typedef struct
 {
-	int			hasSeek;
 	pgoff_t		filePos;
 	TAR_MEMBER *blobToc;
 	FILE	   *tarFH;
@@ -192,8 +191,6 @@ InitArchiveFmt_Tar(ArchiveHandle *AH)
 		 */
 		/* setvbuf(ctx->tarFH, NULL, _IONBF, 0); */
 
-		ctx->hasSeek = checkSeek(ctx->tarFH);
-
 		/*
 		 * We don't support compression because reading the files back is not
 		 * possible since gzdopen uses buffered IO which totally screws file
@@ -226,8 +223,6 @@ InitArchiveFmt_Tar(ArchiveHandle *AH)
 
 		ctx->tarFHpos = 0;
 
-		ctx->hasSeek = checkSeek(ctx->tarFH);
-
 		/*
 		 * Forcibly unmark the header as read since we use the lookahead
 		 * buffer
-- 
2.26.2

0004-Skip-tables-in-pg_restore-by-seeking-instead-of-read.patchtext/x-diff; charset=us-asciiDownload
From aabcd1d843f063b7f9de2e44df7c6147226ad389 Mon Sep 17 00:00:00 2001
From: David Gilman <davidgilman1@gmail.com>
Date: Mon, 25 May 2020 17:34:52 -0400
Subject: [PATCH 4/4] Skip tables in pg_restore by seeking instead of reading

---
 src/bin/pg_dump/pg_backup_custom.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c
index 90a026e0f4..662e7a1793 100644
--- a/src/bin/pg_dump/pg_backup_custom.c
+++ b/src/bin/pg_dump/pg_backup_custom.c
@@ -580,19 +580,27 @@ _skipData(ArchiveHandle *AH)
 	blkLen = ReadInt(AH);
 	while (blkLen != 0)
 	{
-		if (blkLen > buflen)
+		if (ctx->hasSeek)
 		{
-			if (buf)
-				free(buf);
-			buf = (char *) pg_malloc(blkLen);
-			buflen = blkLen;
+			if (fseeko(AH->FH, blkLen, SEEK_CUR) != 0)
+				fatal("error during file seek: %m");
 		}
-		if ((cnt = fread(buf, 1, blkLen, AH->FH)) != blkLen)
+		else
 		{
-			if (feof(AH->FH))
-				fatal("could not read from input file: end of file");
-			else
-				fatal("could not read from input file: %m");
+			if (blkLen > buflen)
+			{
+				if (buf)
+					free(buf);
+				buf = (char *) pg_malloc(blkLen);
+				buflen = blkLen;
+			}
+			if ((cnt = fread(buf, 1, blkLen, AH->FH)) != blkLen)
+			{
+				if (feof(AH->FH))
+					fatal("could not read from input file: end of file");
+				else
+					fatal("could not read from input file: %m");
+			}
 		}
 
 		ctx->filePos += blkLen;
-- 
2.26.2

#12Justin Pryzby
pryzby@telsasoft.com
In reply to: David Gilman (#9)
Re: Warn when parallel restoring a custom dump without data offsets

On Mon, May 25, 2020 at 01:54:29PM -0500, David Gilman wrote:

Is it possible to dump to stdout (or pipe to cat or dd) to avoid a new option ?

The underlying IPC::Run code seems to support piping in a cross-platform
way. I am not a Perl master though and after spending an evening trying
to get it to work I went with this approach. If you can put me in touch
with anyone to help me out here I'd appreciate it.

I think you can do what's needed like so:

--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -152,10 +152,13 @@ my %pgdump_runs = (
        },
        defaults_custom_format_no_seek_parallel_restore => {
                test_key => 'defaults',
-               dump_cmd => [
-                       'pg_dump', '-Fc', '-Z6', '--no-sync', '--disable-seeking',
+               dump_cmd => (
+                       [
+                       'pg_dump', '-Fc', '-Z6', '--no-sync',
                        "--file=$tempdir/defaults_custom_format_no_seek_parallel_restore.dump", 'postgres',
-               ],
+                       ],
+                       "|", [ "cat" ], # disable seeking
+               ),

Also, these are failing intermittently:

t/002_pg_dump.pl .............. 1649/6758
# Failed test 'defaults_custom_format_no_seek_parallel_restore: should dump GRANT SELECT (proname ...) ON TABLE pg_proc TO public'
# at t/002_pg_dump.pl line 3635.
# Review defaults_custom_format_no_seek_parallel_restore results in /var/lib/pgsql/postgresql.src/src/bin/pg_dump/tmp_check/tmp_test_NqRC

t/002_pg_dump.pl .............. 2060/6758
# Failed test 'defaults_dir_format_parallel: should dump GRANT SELECT (proname ...) ON TABLE pg_proc TO public'
# at t/002_pg_dump.pl line 3635.
# Review defaults_dir_format_parallel results in /var/lib/pgsql/postgresql.src/src/bin/pg_dump/tmp_check/tmp_test_NqRC

If you can address those, I think this will be "ready for committer".

--
Justin

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Gilman (#11)
Re: Warn when parallel restoring a custom dump without data offsets

David Gilman <dgilman@gilslotd.com> writes:

I've attached the latest patches after further review from Justin Pryzby.

I guess I'm completely confused about the purpose of these patches.
Far from coping with the situation of an unseekable file, they appear
to change pg_restore so that it fails altogether if it can't seek
its input file. Why would we want to do this?

regards, tom lane

#14David Gilman
dgilman@gilslotd.com
In reply to: Tom Lane (#13)
4 attachment(s)
Re: Warn when parallel restoring a custom dump without data offsets

On Thu, Jul 02, 2020 at 05:25:21PM -0400, Tom Lane wrote:

I guess I'm completely confused about the purpose of these patches.
Far from coping with the situation of an unseekable file, they appear
to change pg_restore so that it fails altogether if it can't seek
its input file. Why would we want to do this?

I'm not sure where the "fails altogether if it can't seek" is. The
"Skip tables in pg_restore" patch retains the old fread() logic. The
--disable-seeking stuff was just to support tests, and thanks to
help from Justin Pryzby the tests no longer require it. I've attached
the updated patch set.

Note that this still shouldn't be merged because of Justin's bug report
in 20200706050129.GW4107@telsasoft.com which is unrelated to this change
but will leave you with flaky CI until it's fixed.

--
David Gilman :DG<
https://gilslotd.com

Attachments:

0001-Scan-all-TOCs-when-restoring-a-custom-dump-file-with.patchtext/x-diff; charset=us-asciiDownload
From a671dc328a7defa403ceb46b5107ed4ee5f53103 Mon Sep 17 00:00:00 2001
From: David Gilman <davidgilman1@gmail.com>
Date: Wed, 20 May 2020 22:49:28 -0400
Subject: [PATCH 1/5] Scan all TOCs when restoring a custom dump file without
 offsets

TOC requests are not guaranteed to come in disk order. If the custom
dump file was written with data offsets, pg_restore can seek directly to
the data, making request order irrelevant. If there are no data offsets,
pg_restore would never attempt to seek backwards, even when seeking is
possible, and would be unable to find TOCs before the current read
position in the file. 548e50976 changed how pg_restore's parallel
algorithm worked at the cost of greatly increasing out-of-order TOC
requests.

This patch changes pg_restore to scan through all TOCs to service a TOC
read request when restoring a custom dump file without data offsets.
The odds of getting a successful parallel restore go way up at the cost
of a bunch of extra tiny reads when pg_restore starts up.

The pg_restore manpage now warns against running pg_dump with an
unseekable output file and suggests that if you plan on doing a parallel
restore of a custom dump, you should run pg_dump with --file.
---
 doc/src/sgml/ref/pg_restore.sgml   |  8 ++++++++
 src/bin/pg_dump/pg_backup_custom.c | 25 ++++++++++++++++++++-----
 2 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/ref/pg_restore.sgml b/doc/src/sgml/ref/pg_restore.sgml
index 232f88024f..23286bb076 100644
--- a/doc/src/sgml/ref/pg_restore.sgml
+++ b/doc/src/sgml/ref/pg_restore.sgml
@@ -279,6 +279,14 @@ PostgreSQL documentation
         jobs cannot be used together with the
         option <option>--single-transaction</option>.
        </para>
+
+       <para>
+        <application>pg_restore</application> with concurrent jobs may fail
+        when restoring a custom archive format dump written to an unseekable
+        output stream, like stdout. To allow for concurrent restoration of
+        a custom archive format dump, use <application>pg_dump</application>'s
+        <option>--file</option> option to specify an output file.
+       </para>
       </listitem>
      </varlistentry>
 
diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c
index 6ab122242c..a873ac3afa 100644
--- a/src/bin/pg_dump/pg_backup_custom.c
+++ b/src/bin/pg_dump/pg_backup_custom.c
@@ -413,6 +413,7 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
 	lclTocEntry *tctx = (lclTocEntry *) te->formatData;
 	int			blkType;
 	int			id;
+	bool		initialScan = true;
 
 	if (tctx->dataState == K_OFFSET_NO_DATA)
 		return;
@@ -421,13 +422,28 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
 	{
 		/*
 		 * We cannot seek directly to the desired block.  Instead, skip over
-		 * block headers until we find the one we want.  This could fail if we
-		 * are asked to restore items out-of-order.
+		 * block headers until we find the one we want.
 		 */
-		_readBlockHeader(AH, &blkType, &id);
 
-		while (blkType != EOF && id != te->dumpId)
+		for (;;)
 		{
+			_readBlockHeader(AH, &blkType, &id);
+
+			if (blkType == EOF && ctx->hasSeek && initialScan)
+			{
+				/*
+				 * This was possibly an out-of-order request. Try one extra
+				 * pass over the file to find the TOC.
+				 */
+				initialScan = false;
+				if (fseeko(AH->FH, ctx->dataStart, SEEK_SET) != 0)
+					fatal("error during file seek: %m");
+				continue;
+			}
+
+			if (blkType == EOF || id == te->dumpId)
+				break;
+
 			switch (blkType)
 			{
 				case BLK_DATA:
@@ -443,7 +459,6 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
 						  blkType);
 					break;
 			}
-			_readBlockHeader(AH, &blkType, &id);
 		}
 	}
 	else
-- 
2.27.0

0002-Add-integration-test-for-out-of-order-TOC-requests-i.patchtext/x-diff; charset=us-asciiDownload
From 6e8294021482fb76cb17d6c150ec6acb6d0cac37 Mon Sep 17 00:00:00 2001
From: David Gilman <davidgilman1@gmail.com>
Date: Sat, 23 May 2020 10:49:33 -0400
Subject: [PATCH 2/5] Add integration test for out-of-order TOC requests in
 pg_restore

---
 src/bin/pg_dump/t/002_pg_dump.pl | 107 ++++++++++++++++++++++++++++++-
 src/test/perl/TestLib.pm         |  12 +++-
 2 files changed, 114 insertions(+), 5 deletions(-)

diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index e116235769..d99e7a5d22 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -31,9 +31,20 @@ my $tempdir_short = TestLib::tempdir_short;
 # generate a text file to run through the tests from the
 # non-text file generated by pg_dump.
 #
-# TODO: Have pg_restore actually restore to an independent
-# database and then pg_dump *that* database (or something along
-# those lines) to validate that part of the process.
+# secondary_dump_cmd is an optional key that enables a
+#
+# 	pg_dump (dump_cmd) ->
+# 	pg_restore (restore_cmd, real database) ->
+#	pg_dump (secondary_dump_cmd, SQL output (for matching))
+#
+# test process instead of the default
+#
+# 	pg_dump (dump_cmd) ->
+# 	pg_restore (restore_cmd, SQL output (for matching))
+#
+# test process. The secondary_dump database is created for
+# you and your restore_cmd and secondary_dump_cmd must read
+# and write from it.
 
 my %pgdump_runs = (
 	binary_upgrade => {
@@ -139,6 +150,25 @@ my %pgdump_runs = (
 			"$tempdir/defaults_custom_format.dump",
 		],
 	},
+	defaults_custom_format_no_seek_parallel_restore => {
+		test_key => 'defaults',
+		dump_cmd => [
+			['pg_dump', '-Fc', '-Z6', '--no-sync', 'postgres'],
+			'|',
+			['perl', '-pe', ''],  # make pg_dump's stdout unseekable
+			">$tempdir/defaults_custom_format_no_seek_parallel_restore"
+		],
+		restore_cmd => [
+			'pg_restore', '-Fc', '--jobs=2',
+			'--dbname=secondary_dump',
+			"$tempdir/defaults_custom_format_no_seek_parallel_restore",
+		],
+		secondary_dump_cmd => [
+			'pg_dump', '-Fp', '--no-sync',
+			"--file=$tempdir/defaults_custom_format_no_seek_parallel_restore.sql",
+			'secondary_dump',
+		],
+	},
 
 	# Do not use --no-sync to give test coverage for data sync.
 	defaults_dir_format => {
@@ -153,6 +183,24 @@ my %pgdump_runs = (
 			"$tempdir/defaults_dir_format",
 		],
 	},
+	defaults_dir_format_parallel => {
+		test_key => 'defaults',
+		dump_cmd => [
+			'pg_dump',                                      '-Fd',
+			'--jobs=2',                                     '--no-sync',
+			"--file=$tempdir/defaults_dir_format_parallel", 'postgres',
+		],
+		restore_cmd => [
+			'pg_restore', '-Fd',
+			'--jobs=2', '--dbname=secondary_dump',
+			"$tempdir/defaults_dir_format_parallel",
+		],
+		secondary_dump_cmd => [
+			'pg_dump', '-Fp', '--no-sync',
+			"--file=$tempdir/defaults_dir_format_parallel.sql",
+			'secondary_dump',
+		],
+	},
 
 	# Do not use --no-sync to give test coverage for data sync.
 	defaults_parallel => {
@@ -3298,6 +3346,35 @@ my %tests = (
 			%full_runs, %dump_test_schema_runs, section_pre_data => 1,
 		},
 		unlike => { exclude_dump_test_schema => 1 },
+	},
+
+	# This reliably produces the "possibly due to out-of-order restore request"
+	# when restoring with -j 2 when this was written but future changes to how
+	# pg_dump works could accidentally correctly order things as we're not really
+	# telling pg_dump how to do its on-disk layout.
+	'custom dump out-of-order' => {
+		create_sql   => '
+			CREATE TABLE dump_test.ooo_parent0 (id int primary key);
+			CREATE TABLE dump_test.ooo_child0 (
+				parent0 int references dump_test.ooo_parent0 (id)
+			);
+			CREATE TABLE dump_test.ooo_parent1 (id int primary key);
+			CREATE TABLE dump_test.ooo_child1 (
+				parent0 int references dump_test.ooo_parent1 (id)
+			);',
+		regexp => qr/^
+			 \QCREATE TABLE dump_test.ooo_child0 (\E\n
+			 \s+\Qparent0 integer\E\n
+			 \Q);\E\n/xm,
+		like => {
+			%full_runs, %dump_test_schema_runs,
+			section_pre_data => 1,
+			defaults_custom_format_no_seek => 1,
+			defaults_custom_format_no_seek_parallel_restore => 1,
+		},
+		unlike => {
+			exclude_dump_test_schema => 1,
+		},
 	});
 
 #########################################
@@ -3350,6 +3427,11 @@ foreach my $run (sort keys %pgdump_runs)
 		$num_tests++;
 	}
 
+	if ($pgdump_runs{$run}->{secondary_dump_cmd})
+	{
+		$num_tests++;
+	}
+
 	if ($pgdump_runs{$run}->{test_key})
 	{
 		$test_key = $pgdump_runs{$run}->{test_key};
@@ -3499,12 +3581,23 @@ foreach my $run (sort keys %pgdump_runs)
 	$node->command_ok(\@{ $pgdump_runs{$run}->{dump_cmd} },
 		"$run: pg_dump runs");
 
+	if ($pgdump_runs{$run}->{secondary_dump_cmd})
+	{
+		$node->safe_psql('postgres', 'create database secondary_dump;');
+	}
+
 	if ($pgdump_runs{$run}->{restore_cmd})
 	{
 		$node->command_ok(\@{ $pgdump_runs{$run}->{restore_cmd} },
 			"$run: pg_restore runs");
 	}
 
+	if ($pgdump_runs{$run}->{secondary_dump_cmd})
+	{
+		$node->command_ok(\@{ $pgdump_runs{$run}->{secondary_dump_cmd} },
+			"$run: secondary pg_dump runs");
+	}
+
 	if ($pgdump_runs{$run}->{test_key})
 	{
 		$test_key = $pgdump_runs{$run}->{test_key};
@@ -3561,6 +3654,14 @@ foreach my $run (sort keys %pgdump_runs)
 			}
 		}
 	}
+
+	if ($pgdump_runs{$run}->{secondary_dump_cmd})
+	{
+		$node->safe_psql('secondary_dump',
+			'alter subscription sub1 set (slot_name = NONE);');
+		$node->safe_psql('secondary_dump', 'drop subscription sub1;');
+		$node->safe_psql('postgres',       'drop database secondary_dump;');
+	}
 }
 
 #########################################
diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index d579d5c177..fb90d1abcc 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -338,8 +338,16 @@ The return value from the command is passed through.
 
 sub run_log
 {
-	print("# Running: " . join(" ", @{ $_[0] }) . "\n");
-	return IPC::Run::run(@_);
+	my $flatten = sub {
+		return map { ref eq 'ARRAY' ? @$_ : $_ } @_;
+	};
+	my @x = @{$_[0]};
+	print("# Running: " . join(" ", $flatten->(@x)) . "\n");
+	if (ref($x[0]) ne '') {
+		return IPC::Run::run(@x);
+	} else {
+		return IPC::Run::run(@_);
+	}
 }
 
 =pod
-- 
2.27.0

0003-Remove-unused-seek-check-in-tar-dump-format.patchtext/x-diff; charset=us-asciiDownload
From fa48bd25d5f4a3c4a36a0f6ef09898f816d79286 Mon Sep 17 00:00:00 2001
From: David Gilman <davidgilman1@gmail.com>
Date: Fri, 22 May 2020 17:27:51 -0400
Subject: [PATCH 3/5] Remove unused seek check in tar dump format

---
 src/bin/pg_dump/pg_backup_tar.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_tar.c b/src/bin/pg_dump/pg_backup_tar.c
index b4f5942959..bc88c311c7 100644
--- a/src/bin/pg_dump/pg_backup_tar.c
+++ b/src/bin/pg_dump/pg_backup_tar.c
@@ -82,7 +82,6 @@ typedef struct
 
 typedef struct
 {
-	int			hasSeek;
 	pgoff_t		filePos;
 	TAR_MEMBER *blobToc;
 	FILE	   *tarFH;
@@ -192,8 +191,6 @@ InitArchiveFmt_Tar(ArchiveHandle *AH)
 		 */
 		/* setvbuf(ctx->tarFH, NULL, _IONBF, 0); */
 
-		ctx->hasSeek = checkSeek(ctx->tarFH);
-
 		/*
 		 * We don't support compression because reading the files back is not
 		 * possible since gzdopen uses buffered IO which totally screws file
@@ -226,8 +223,6 @@ InitArchiveFmt_Tar(ArchiveHandle *AH)
 
 		ctx->tarFHpos = 0;
 
-		ctx->hasSeek = checkSeek(ctx->tarFH);
-
 		/*
 		 * Forcibly unmark the header as read since we use the lookahead
 		 * buffer
-- 
2.27.0

0004-Skip-tables-in-pg_restore-by-seeking-instead-of-read.patchtext/x-diff; charset=us-asciiDownload
From 881d5db123af6fca008eb479ac89624be1b75c50 Mon Sep 17 00:00:00 2001
From: David Gilman <davidgilman1@gmail.com>
Date: Mon, 25 May 2020 17:34:52 -0400
Subject: [PATCH 4/5] Skip tables in pg_restore by seeking instead of reading

---
 src/bin/pg_dump/pg_backup_custom.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c
index a873ac3afa..16414113af 100644
--- a/src/bin/pg_dump/pg_backup_custom.c
+++ b/src/bin/pg_dump/pg_backup_custom.c
@@ -575,19 +575,27 @@ _skipData(ArchiveHandle *AH)
 	blkLen = ReadInt(AH);
 	while (blkLen != 0)
 	{
-		if (blkLen > buflen)
+		if (ctx->hasSeek)
 		{
-			if (buf)
-				free(buf);
-			buf = (char *) pg_malloc(blkLen);
-			buflen = blkLen;
+			if (fseeko(AH->FH, blkLen, SEEK_CUR) != 0)
+				fatal("error during file seek: %m");
 		}
-		if ((cnt = fread(buf, 1, blkLen, AH->FH)) != blkLen)
+		else
 		{
-			if (feof(AH->FH))
-				fatal("could not read from input file: end of file");
-			else
-				fatal("could not read from input file: %m");
+			if (blkLen > buflen)
+			{
+				if (buf)
+					free(buf);
+				buf = (char *) pg_malloc(blkLen);
+				buflen = blkLen;
+			}
+			if ((cnt = fread(buf, 1, blkLen, AH->FH)) != blkLen)
+			{
+				if (feof(AH->FH))
+					fatal("could not read from input file: end of file");
+				else
+					fatal("could not read from input file: %m");
+			}
 		}
 
 		ctx->filePos += blkLen;
-- 
2.27.0

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Gilman (#14)
2 attachment(s)
Re: Warn when parallel restoring a custom dump without data offsets

David Gilman <dgilman@gilslotd.com> writes:

On Thu, Jul 02, 2020 at 05:25:21PM -0400, Tom Lane wrote:

I guess I'm completely confused about the purpose of these patches.
Far from coping with the situation of an unseekable file, they appear
to change pg_restore so that it fails altogether if it can't seek
its input file. Why would we want to do this?

I'm not sure where the "fails altogether if it can't seek" is.

I misread the patch, is where :-(

As penance, I spent some time studying this patchset, and have a few
comments:

1. The proposed doc change in 0001 seems out-of-date; isn't it adding a
warning about exactly the deficiency that the rest of the patch is
eliminating? Note that the preceding para already says that the input
has to be seekable, so that's covered. Maybe there is reason for
documenting that parallel restore will be slower if the archive was
written in a non-seekable way ... but that's not what this says.

2. It struck me that the patch is still pretty inefficient, in that
anytime it has to back up in an offset-less archive, it blindly rewinds
to dataStart and rescans everything. In the worst case that'd still be
O(N^2) work, and it's really not necessary, because once we've seen a
given data block we know where it is. We just have to remember that,
which seems easy enough. (Well, on Windows it's a bit trickier because
the state in question is shared across threads; but that's good, it might
save some work.)

3. Extending on #2, we actually don't need the rewind and retry logic
at all. If we are looking for a block we haven't already seen, and we
get to the end of the archive, it ain't there. (This is a bit less
obvious in the Windows case than otherwise, but I think it's still true,
given that the start state is either "all offsets known" or "no offsets
known". A particular thread might skip over some blocks on the strength
of an offset established by another thread, but the blocks ahead of that
spot must now all have known offsets.)

4. Patch 0002 seems mighty expensive for the amount of code coverage
it's adding. On my machine it seems to raise the overall runtime of
pg_dump's "make installcheck" by about 10%, and the only new coverage
is of the few lines added here. I wonder if we couldn't cover that
more cheaply by testing what happens when we use a "-L" option with
an intentionally mis-sorted restore list.

5. I'm inclined to reject 0003. It's not saving anything very meaningful,
and we'd just have to put the code back whenever somebody gets around
to making pg_backup_tar.c capable of out-of-order restores like
pg_backup_custom.c is now able to do.

The attached 0001 rewrites your 0001 as per the above ideas (dropping
the proposed doc change for now), and includes your 0004 for simplicity.
I'm including your 0002 verbatim just so the cfbot will be able to do a
meaningful test on 0001; but as stated, I don't really want to commit it.

regards, tom lane

Attachments:

0001-remember-seek-positions.patchtext/x-diff; charset=us-ascii; name=0001-remember-seek-positions.patchDownload
diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c
index 6ab122242c..2659676fd9 100644
--- a/src/bin/pg_dump/pg_backup_custom.c
+++ b/src/bin/pg_dump/pg_backup_custom.c
@@ -421,13 +421,47 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
 	{
 		/*
 		 * We cannot seek directly to the desired block.  Instead, skip over
-		 * block headers until we find the one we want.  This could fail if we
-		 * are asked to restore items out-of-order.
+		 * block headers until we find the one we want.  Remember the
+		 * positions of skipped-over blocks, so that if we later decide we
+		 * need to read one, we'll be able to seek to it.
 		 */
-		_readBlockHeader(AH, &blkType, &id);
-
-		while (blkType != EOF && id != te->dumpId)
+		for (;;)
 		{
+			pgoff_t		thisBlkPos = _getFilePos(AH, ctx);
+			TocEntry   *otherte;
+
+			_readBlockHeader(AH, &blkType, &id);
+
+			if (blkType == EOF || id == te->dumpId)
+				break;
+
+			otherte = getTocEntryByDumpId(AH, id);
+			if (otherte && otherte->formatData)
+			{
+				lclTocEntry *othertctx = (lclTocEntry *) otherte->formatData;
+
+				/*
+				 * Note: on Windows, multiple threads might access/update the
+				 * same lclTocEntry concurrently, but that should be safe as
+				 * long as we update dataPos before dataState.  Ideally, we'd
+				 * use pg_write_barrier() to enforce that, but the needed
+				 * infrastructure doesn't exist in frontend code.  But Windows
+				 * only runs on machines with strong store ordering, so it
+				 * should be okay for now.
+				 */
+				if (othertctx->dataState == K_OFFSET_POS_NOT_SET)
+				{
+					othertctx->dataPos = thisBlkPos;
+					othertctx->dataState = K_OFFSET_POS_SET;
+				}
+				else if (othertctx->dataPos != thisBlkPos ||
+						 othertctx->dataState != K_OFFSET_POS_SET)
+				{
+					/* sanity check */
+					pg_log_warning("data block %d has wrong seek position", id);
+				}
+			}
+
 			switch (blkType)
 			{
 				case BLK_DATA:
@@ -443,7 +477,6 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
 						  blkType);
 					break;
 			}
-			_readBlockHeader(AH, &blkType, &id);
 		}
 	}
 	else
@@ -455,20 +488,18 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
 		_readBlockHeader(AH, &blkType, &id);
 	}
 
-	/* Produce suitable failure message if we fell off end of file */
+	/*
+	 * If we reached EOF without finding the block we want, then either it
+	 * doesn't exist, or it does but we lack the ability to seek back to it.
+	 */
 	if (blkType == EOF)
 	{
-		if (tctx->dataState == K_OFFSET_POS_NOT_SET)
-			fatal("could not find block ID %d in archive -- "
-				  "possibly due to out-of-order restore request, "
-				  "which cannot be handled due to lack of data offsets in archive",
-				  te->dumpId);
-		else if (!ctx->hasSeek)
+		if (!ctx->hasSeek)
 			fatal("could not find block ID %d in archive -- "
 				  "possibly due to out-of-order restore request, "
 				  "which cannot be handled due to non-seekable input file",
 				  te->dumpId);
-		else					/* huh, the dataPos led us to EOF? */
+		else
 			fatal("could not find block ID %d in archive -- "
 				  "possibly corrupt archive",
 				  te->dumpId);
@@ -560,19 +591,27 @@ _skipData(ArchiveHandle *AH)
 	blkLen = ReadInt(AH);
 	while (blkLen != 0)
 	{
-		if (blkLen > buflen)
+		if (ctx->hasSeek)
 		{
-			if (buf)
-				free(buf);
-			buf = (char *) pg_malloc(blkLen);
-			buflen = blkLen;
+			if (fseeko(AH->FH, blkLen, SEEK_CUR) != 0)
+				fatal("error during file seek: %m");
 		}
-		if ((cnt = fread(buf, 1, blkLen, AH->FH)) != blkLen)
+		else
 		{
-			if (feof(AH->FH))
-				fatal("could not read from input file: end of file");
-			else
-				fatal("could not read from input file: %m");
+			if (blkLen > buflen)
+			{
+				if (buf)
+					free(buf);
+				buf = (char *) pg_malloc(blkLen);
+				buflen = blkLen;
+			}
+			if ((cnt = fread(buf, 1, blkLen, AH->FH)) != blkLen)
+			{
+				if (feof(AH->FH))
+					fatal("could not read from input file: end of file");
+				else
+					fatal("could not read from input file: %m");
+			}
 		}
 
 		ctx->filePos += blkLen;
@@ -830,10 +869,13 @@ _Clone(ArchiveHandle *AH)
 		fatal("compressor active");
 
 	/*
+	 * We intentionally do not clone TOC-entry-local state: it's useful to
+	 * share knowledge about where the data blocks are across threads.
+	 * _PrintTocData has to be careful about the order of operations on that
+	 * state, though.
+	 *
 	 * Note: we do not make a local lo_buf because we expect at most one BLOBS
-	 * entry per archive, so no parallelism is possible.  Likewise,
-	 * TOC-entry-local state isn't an issue because any one TOC entry is
-	 * touched by just one worker child.
+	 * entry per archive, so no parallelism is possible.
 	 */
 }
 
0002-Add-integration-test-for-out-of-order-TOC-requests-i.patchtext/x-diff; charset=us-ascii; name*0=0002-Add-integration-test-for-out-of-order-TOC-requests-i.p; name*1=atchDownload
>From 6e8294021482fb76cb17d6c150ec6acb6d0cac37 Mon Sep 17 00:00:00 2001
From: David Gilman <davidgilman1@gmail.com>
Date: Sat, 23 May 2020 10:49:33 -0400
Subject: [PATCH 2/5] Add integration test for out-of-order TOC requests in
 pg_restore

---
 src/bin/pg_dump/t/002_pg_dump.pl | 107 ++++++++++++++++++++++++++++++-
 src/test/perl/TestLib.pm         |  12 +++-
 2 files changed, 114 insertions(+), 5 deletions(-)

diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index e116235769..d99e7a5d22 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -31,9 +31,20 @@ my $tempdir_short = TestLib::tempdir_short;
 # generate a text file to run through the tests from the
 # non-text file generated by pg_dump.
 #
-# TODO: Have pg_restore actually restore to an independent
-# database and then pg_dump *that* database (or something along
-# those lines) to validate that part of the process.
+# secondary_dump_cmd is an optional key that enables a
+#
+# 	pg_dump (dump_cmd) ->
+# 	pg_restore (restore_cmd, real database) ->
+#	pg_dump (secondary_dump_cmd, SQL output (for matching))
+#
+# test process instead of the default
+#
+# 	pg_dump (dump_cmd) ->
+# 	pg_restore (restore_cmd, SQL output (for matching))
+#
+# test process. The secondary_dump database is created for
+# you and your restore_cmd and secondary_dump_cmd must read
+# and write from it.
 
 my %pgdump_runs = (
 	binary_upgrade => {
@@ -139,6 +150,25 @@ my %pgdump_runs = (
 			"$tempdir/defaults_custom_format.dump",
 		],
 	},
+	defaults_custom_format_no_seek_parallel_restore => {
+		test_key => 'defaults',
+		dump_cmd => [
+			['pg_dump', '-Fc', '-Z6', '--no-sync', 'postgres'],
+			'|',
+			['perl', '-pe', ''],  # make pg_dump's stdout unseekable
+			">$tempdir/defaults_custom_format_no_seek_parallel_restore"
+		],
+		restore_cmd => [
+			'pg_restore', '-Fc', '--jobs=2',
+			'--dbname=secondary_dump',
+			"$tempdir/defaults_custom_format_no_seek_parallel_restore",
+		],
+		secondary_dump_cmd => [
+			'pg_dump', '-Fp', '--no-sync',
+			"--file=$tempdir/defaults_custom_format_no_seek_parallel_restore.sql",
+			'secondary_dump',
+		],
+	},
 
 	# Do not use --no-sync to give test coverage for data sync.
 	defaults_dir_format => {
@@ -153,6 +183,24 @@ my %pgdump_runs = (
 			"$tempdir/defaults_dir_format",
 		],
 	},
+	defaults_dir_format_parallel => {
+		test_key => 'defaults',
+		dump_cmd => [
+			'pg_dump',                                      '-Fd',
+			'--jobs=2',                                     '--no-sync',
+			"--file=$tempdir/defaults_dir_format_parallel", 'postgres',
+		],
+		restore_cmd => [
+			'pg_restore', '-Fd',
+			'--jobs=2', '--dbname=secondary_dump',
+			"$tempdir/defaults_dir_format_parallel",
+		],
+		secondary_dump_cmd => [
+			'pg_dump', '-Fp', '--no-sync',
+			"--file=$tempdir/defaults_dir_format_parallel.sql",
+			'secondary_dump',
+		],
+	},
 
 	# Do not use --no-sync to give test coverage for data sync.
 	defaults_parallel => {
@@ -3298,6 +3346,35 @@ my %tests = (
 			%full_runs, %dump_test_schema_runs, section_pre_data => 1,
 		},
 		unlike => { exclude_dump_test_schema => 1 },
+	},
+
+	# This reliably produces the "possibly due to out-of-order restore request"
+	# when restoring with -j 2 when this was written but future changes to how
+	# pg_dump works could accidentally correctly order things as we're not really
+	# telling pg_dump how to do its on-disk layout.
+	'custom dump out-of-order' => {
+		create_sql   => '
+			CREATE TABLE dump_test.ooo_parent0 (id int primary key);
+			CREATE TABLE dump_test.ooo_child0 (
+				parent0 int references dump_test.ooo_parent0 (id)
+			);
+			CREATE TABLE dump_test.ooo_parent1 (id int primary key);
+			CREATE TABLE dump_test.ooo_child1 (
+				parent0 int references dump_test.ooo_parent1 (id)
+			);',
+		regexp => qr/^
+			 \QCREATE TABLE dump_test.ooo_child0 (\E\n
+			 \s+\Qparent0 integer\E\n
+			 \Q);\E\n/xm,
+		like => {
+			%full_runs, %dump_test_schema_runs,
+			section_pre_data => 1,
+			defaults_custom_format_no_seek => 1,
+			defaults_custom_format_no_seek_parallel_restore => 1,
+		},
+		unlike => {
+			exclude_dump_test_schema => 1,
+		},
 	});
 
 #########################################
@@ -3350,6 +3427,11 @@ foreach my $run (sort keys %pgdump_runs)
 		$num_tests++;
 	}
 
+	if ($pgdump_runs{$run}->{secondary_dump_cmd})
+	{
+		$num_tests++;
+	}
+
 	if ($pgdump_runs{$run}->{test_key})
 	{
 		$test_key = $pgdump_runs{$run}->{test_key};
@@ -3499,12 +3581,23 @@ foreach my $run (sort keys %pgdump_runs)
 	$node->command_ok(\@{ $pgdump_runs{$run}->{dump_cmd} },
 		"$run: pg_dump runs");
 
+	if ($pgdump_runs{$run}->{secondary_dump_cmd})
+	{
+		$node->safe_psql('postgres', 'create database secondary_dump;');
+	}
+
 	if ($pgdump_runs{$run}->{restore_cmd})
 	{
 		$node->command_ok(\@{ $pgdump_runs{$run}->{restore_cmd} },
 			"$run: pg_restore runs");
 	}
 
+	if ($pgdump_runs{$run}->{secondary_dump_cmd})
+	{
+		$node->command_ok(\@{ $pgdump_runs{$run}->{secondary_dump_cmd} },
+			"$run: secondary pg_dump runs");
+	}
+
 	if ($pgdump_runs{$run}->{test_key})
 	{
 		$test_key = $pgdump_runs{$run}->{test_key};
@@ -3561,6 +3654,14 @@ foreach my $run (sort keys %pgdump_runs)
 			}
 		}
 	}
+
+	if ($pgdump_runs{$run}->{secondary_dump_cmd})
+	{
+		$node->safe_psql('secondary_dump',
+			'alter subscription sub1 set (slot_name = NONE);');
+		$node->safe_psql('secondary_dump', 'drop subscription sub1;');
+		$node->safe_psql('postgres',       'drop database secondary_dump;');
+	}
 }
 
 #########################################
diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index d579d5c177..fb90d1abcc 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -338,8 +338,16 @@ The return value from the command is passed through.
 
 sub run_log
 {
-	print("# Running: " . join(" ", @{ $_[0] }) . "\n");
-	return IPC::Run::run(@_);
+	my $flatten = sub {
+		return map { ref eq 'ARRAY' ? @$_ : $_ } @_;
+	};
+	my @x = @{$_[0]};
+	print("# Running: " . join(" ", $flatten->(@x)) . "\n");
+	if (ref($x[0]) ne '') {
+		return IPC::Run::run(@x);
+	} else {
+		return IPC::Run::run(@_);
+	}
 }
 
 =pod
-- 
2.27.0

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#15)
2 attachment(s)
Re: Warn when parallel restoring a custom dump without data offsets

I wrote:

The attached 0001 rewrites your 0001 as per the above ideas (dropping
the proposed doc change for now), and includes your 0004 for simplicity.
I'm including your 0002 verbatim just so the cfbot will be able to do a
meaningful test on 0001; but as stated, I don't really want to commit it.

I spent some more time testing this, by trying to dump and restore the
core regression database. I immediately noticed that I sometimes got
"ftell mismatch with expected position -- ftell used" warnings, though
it was a bit variable depending on the -j level. The reason was fairly
apparent on looking at the code: we had various fseeko() calls in code
paths that did not bother to correct ctx->filePos afterwards. In fact,
*none* of the four existing fseeko calls in pg_backup_custom.c did so.
It's fairly surprising that that hadn't caused a problem up to now.

I started to add adjustments of ctx->filePos after all the fseeko calls,
but then began to wonder why we don't just rip the variable out entirely.
The only places where we need it are to set dataPos for data blocks,
but that's an entirely pointless activity if we don't have seek
capability, because we're not going to be able to rewrite the TOC
to emit the updated values.

Hence, the 0000 patch attached rips out ctx->filePos, and then
0001 is the currently-discussed patch rebased on that. I also added
an additional refinement, which is to track the furthest point we've
scanned to while looking for data blocks in an offset-less file.
If we have seek capability, then when we need to resume looking for
data blocks we can search forward from that spot rather than wherever
we happened to have stopped at. This fixes an additional source
of potentially-O(N^2) behavior if we have to restore blocks in a
very out-of-order fashion. I'm not sure that it makes much difference
in common cases, but with this we can say positively that we don't
scan the same block more than once per worker process.

I'm still unhappy about the proposed test case (0002), but now
I have a more concrete reason for that: it didn't catch this bug,
so the coverage is still pretty miserable.

Dump-and-restore-the-regression-database used to be a pretty common
manual test for pg_dump, but we never got around to automating it,
possibly because we figured that the pg_upgrade test script covers
that ground. It's becoming gruesomely clear that pg_upgrade is a
distinct operating mode that doesn't necessarily have the same bugs.
So I'm inclined to feel that what we ought to do is automate a test
of that sort; but first we'll have to fix the existing bugs described
at [1]/messages/by-id/3169466.1594841366@sss.pgh.pa.us[2]/messages/by-id/3170626.1594842723@sss.pgh.pa.us.

Given the current state of affairs, I'm inclined to commit the
attached with no new test coverage, and then come back and look
at better testing after the other bugs are dealt with.

regards, tom lane

[1]: /messages/by-id/3169466.1594841366@sss.pgh.pa.us
[2]: /messages/by-id/3170626.1594842723@sss.pgh.pa.us

Attachments:

0000-remove-filePos-tracking.patchtext/x-diff; charset=us-ascii; name=0000-remove-filePos-tracking.patchDownload
diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c
index 6ab122242c..3a9881d601 100644
--- a/src/bin/pg_dump/pg_backup_custom.c
+++ b/src/bin/pg_dump/pg_backup_custom.c
@@ -70,14 +70,12 @@ typedef struct
 {
 	CompressorState *cs;
 	int			hasSeek;
-	pgoff_t		filePos;
-	pgoff_t		dataStart;
 } lclContext;
 
 typedef struct
 {
 	int			dataState;
-	pgoff_t		dataPos;
+	pgoff_t		dataPos;		/* valid only if dataState=K_OFFSET_POS_SET */
 } lclTocEntry;
 
 
@@ -144,8 +142,6 @@ InitArchiveFmt_Custom(ArchiveHandle *AH)
 	AH->lo_buf_size = LOBBUFSIZE;
 	AH->lo_buf = (void *) pg_malloc(LOBBUFSIZE);
 
-	ctx->filePos = 0;
-
 	/*
 	 * Now open the file
 	 */
@@ -185,7 +181,6 @@ InitArchiveFmt_Custom(ArchiveHandle *AH)
 
 		ReadHead(AH);
 		ReadToc(AH);
-		ctx->dataStart = _getFilePos(AH, ctx);
 	}
 
 }
@@ -290,7 +285,8 @@ _StartData(ArchiveHandle *AH, TocEntry *te)
 	lclTocEntry *tctx = (lclTocEntry *) te->formatData;
 
 	tctx->dataPos = _getFilePos(AH, ctx);
-	tctx->dataState = K_OFFSET_POS_SET;
+	if (tctx->dataPos >= 0)
+		tctx->dataState = K_OFFSET_POS_SET;
 
 	_WriteByte(AH, BLK_DATA);	/* Block type */
 	WriteInt(AH, te->dumpId);	/* For sanity check */
@@ -350,7 +346,8 @@ _StartBlobs(ArchiveHandle *AH, TocEntry *te)
 	lclTocEntry *tctx = (lclTocEntry *) te->formatData;
 
 	tctx->dataPos = _getFilePos(AH, ctx);
-	tctx->dataState = K_OFFSET_POS_SET;
+	if (tctx->dataPos >= 0)
+		tctx->dataState = K_OFFSET_POS_SET;
 
 	_WriteByte(AH, BLK_BLOBS);	/* Block type */
 	WriteInt(AH, te->dumpId);	/* For sanity check */
@@ -551,7 +548,6 @@ _skipBlobs(ArchiveHandle *AH)
 static void
 _skipData(ArchiveHandle *AH)
 {
-	lclContext *ctx = (lclContext *) AH->formatData;
 	size_t		blkLen;
 	char	   *buf = NULL;
 	int			buflen = 0;
@@ -575,8 +571,6 @@ _skipData(ArchiveHandle *AH)
 				fatal("could not read from input file: %m");
 		}
 
-		ctx->filePos += blkLen;
-
 		blkLen = ReadInt(AH);
 	}
 
@@ -594,12 +588,10 @@ _skipData(ArchiveHandle *AH)
 static int
 _WriteByte(ArchiveHandle *AH, const int i)
 {
-	lclContext *ctx = (lclContext *) AH->formatData;
 	int			res;
 
 	if ((res = fputc(i, AH->FH)) == EOF)
 		WRITE_ERROR_EXIT;
-	ctx->filePos += 1;
 
 	return 1;
 }
@@ -615,13 +607,11 @@ _WriteByte(ArchiveHandle *AH, const int i)
 static int
 _ReadByte(ArchiveHandle *AH)
 {
-	lclContext *ctx = (lclContext *) AH->formatData;
 	int			res;
 
 	res = getc(AH->FH);
 	if (res == EOF)
 		READ_ERROR_EXIT(AH->FH);
-	ctx->filePos += 1;
 	return res;
 }
 
@@ -635,11 +625,8 @@ _ReadByte(ArchiveHandle *AH)
 static void
 _WriteBuf(ArchiveHandle *AH, const void *buf, size_t len)
 {
-	lclContext *ctx = (lclContext *) AH->formatData;
-
 	if (fwrite(buf, 1, len, AH->FH) != len)
 		WRITE_ERROR_EXIT;
-	ctx->filePos += len;
 }
 
 /*
@@ -652,11 +639,8 @@ _WriteBuf(ArchiveHandle *AH, const void *buf, size_t len)
 static void
 _ReadBuf(ArchiveHandle *AH, void *buf, size_t len)
 {
-	lclContext *ctx = (lclContext *) AH->formatData;
-
 	if (fread(buf, 1, len, AH->FH) != len)
 		READ_ERROR_EXIT(AH->FH);
-	ctx->filePos += len;
 }
 
 /*
@@ -688,7 +672,6 @@ _CloseArchive(ArchiveHandle *AH)
 		if (tpos < 0 && ctx->hasSeek)
 			fatal("could not determine seek position in archive file: %m");
 		WriteToc(AH);
-		ctx->dataStart = _getFilePos(AH, ctx);
 		WriteDataChunks(AH, NULL);
 
 		/*
@@ -862,30 +845,24 @@ _WorkerJobRestoreCustom(ArchiveHandle *AH, TocEntry *te)
 
 /*
  * Get the current position in the archive file.
+ *
+ * With a non-seekable archive file, we may not be able to obtain the
+ * file position.  If so, just return -1.  It's not too important in
+ * that case because we won't be able to rewrite the TOC to fill in
+ * data block offsets anyway.
  */
 static pgoff_t
 _getFilePos(ArchiveHandle *AH, lclContext *ctx)
 {
 	pgoff_t		pos;
 
-	if (ctx->hasSeek)
+	pos = ftello(AH->FH);
+	if (pos < 0)
 	{
-		/*
-		 * Prior to 1.7 (pg7.3) we relied on the internally maintained
-		 * pointer.  Now we rely on ftello() always, unless the file has been
-		 * found to not support it.  For debugging purposes, print a warning
-		 * if the internal pointer disagrees, so that we're more likely to
-		 * notice if something's broken about the internal position tracking.
-		 */
-		pos = ftello(AH->FH);
-		if (pos < 0)
+		/* Not expected if we found we can seek. */
+		if (ctx->hasSeek)
 			fatal("could not determine seek position in archive file: %m");
-
-		if (pos != ctx->filePos)
-			pg_log_warning("ftell mismatch with expected position -- ftell used");
 	}
-	else
-		pos = ctx->filePos;
 	return pos;
 }
 
@@ -897,7 +874,6 @@ _getFilePos(ArchiveHandle *AH, lclContext *ctx)
 static void
 _readBlockHeader(ArchiveHandle *AH, int *type, int *id)
 {
-	lclContext *ctx = (lclContext *) AH->formatData;
 	int			byt;
 
 	/*
@@ -918,7 +894,6 @@ _readBlockHeader(ArchiveHandle *AH, int *type, int *id)
 			*id = 0;			/* don't return an uninitialized value */
 			return;
 		}
-		ctx->filePos += 1;
 	}
 
 	*id = ReadInt(AH);
0001-remember-seek-positions.patchtext/x-diff; charset=us-ascii; name=0001-remember-seek-positions.patchDownload
diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c
index 3a9881d601..971e6adf48 100644
--- a/src/bin/pg_dump/pg_backup_custom.c
+++ b/src/bin/pg_dump/pg_backup_custom.c
@@ -70,6 +70,8 @@ typedef struct
 {
 	CompressorState *cs;
 	int			hasSeek;
+	/* lastFilePos is used only when reading, and may be invalid if !hasSeek */
+	pgoff_t		lastFilePos;	/* position after last data block we've read */
 } lclContext;
 
 typedef struct
@@ -181,8 +183,13 @@ InitArchiveFmt_Custom(ArchiveHandle *AH)
 
 		ReadHead(AH);
 		ReadToc(AH);
-	}
 
+		/*
+		 * Remember location of first data block (i.e., the point after TOC)
+		 * in case we have to search for desired data blocks.
+		 */
+		ctx->lastFilePos = _getFilePos(AH, ctx);
+	}
 }
 
 /*
@@ -418,13 +425,62 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
 	{
 		/*
 		 * We cannot seek directly to the desired block.  Instead, skip over
-		 * block headers until we find the one we want.  This could fail if we
-		 * are asked to restore items out-of-order.
+		 * block headers until we find the one we want.  Remember the
+		 * positions of skipped-over blocks, so that if we later decide we
+		 * need to read one, we'll be able to seek to it.
+		 *
+		 * When our input file is seekable, we can do the search starting from
+		 * the point after the last data block we scanned in previous
+		 * iterations of this function.
 		 */
-		_readBlockHeader(AH, &blkType, &id);
+		if (ctx->hasSeek)
+		{
+			if (fseeko(AH->FH, ctx->lastFilePos, SEEK_SET) != 0)
+				fatal("error during file seek: %m");
+		}
 
-		while (blkType != EOF && id != te->dumpId)
+		for (;;)
 		{
+			pgoff_t		thisBlkPos = _getFilePos(AH, ctx);
+
+			_readBlockHeader(AH, &blkType, &id);
+
+			if (blkType == EOF || id == te->dumpId)
+				break;
+
+			/* Remember the block position, if we got one */
+			if (thisBlkPos >= 0)
+			{
+				TocEntry   *otherte = getTocEntryByDumpId(AH, id);
+
+				if (otherte && otherte->formatData)
+				{
+					lclTocEntry *othertctx = (lclTocEntry *) otherte->formatData;
+
+					/*
+					 * Note: on Windows, multiple threads might access/update
+					 * the same lclTocEntry concurrently, but that should be
+					 * safe as long as we update dataPos before dataState.
+					 * Ideally, we'd use pg_write_barrier() to enforce that,
+					 * but the needed infrastructure doesn't exist in frontend
+					 * code.  But Windows only runs on machines with strong
+					 * store ordering, so it should be okay for now.
+					 */
+					if (othertctx->dataState == K_OFFSET_POS_NOT_SET)
+					{
+						othertctx->dataPos = thisBlkPos;
+						othertctx->dataState = K_OFFSET_POS_SET;
+					}
+					else if (othertctx->dataPos != thisBlkPos ||
+							 othertctx->dataState != K_OFFSET_POS_SET)
+					{
+						/* sanity check */
+						pg_log_warning("data block %d has wrong seek position",
+									   id);
+					}
+				}
+			}
+
 			switch (blkType)
 			{
 				case BLK_DATA:
@@ -440,7 +496,6 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
 						  blkType);
 					break;
 			}
-			_readBlockHeader(AH, &blkType, &id);
 		}
 	}
 	else
@@ -452,20 +507,18 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
 		_readBlockHeader(AH, &blkType, &id);
 	}
 
-	/* Produce suitable failure message if we fell off end of file */
+	/*
+	 * If we reached EOF without finding the block we want, then either it
+	 * doesn't exist, or it does but we lack the ability to seek back to it.
+	 */
 	if (blkType == EOF)
 	{
-		if (tctx->dataState == K_OFFSET_POS_NOT_SET)
-			fatal("could not find block ID %d in archive -- "
-				  "possibly due to out-of-order restore request, "
-				  "which cannot be handled due to lack of data offsets in archive",
-				  te->dumpId);
-		else if (!ctx->hasSeek)
+		if (!ctx->hasSeek)
 			fatal("could not find block ID %d in archive -- "
 				  "possibly due to out-of-order restore request, "
 				  "which cannot be handled due to non-seekable input file",
 				  te->dumpId);
-		else					/* huh, the dataPos led us to EOF? */
+		else
 			fatal("could not find block ID %d in archive -- "
 				  "possibly corrupt archive",
 				  te->dumpId);
@@ -491,6 +544,20 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
 				  blkType);
 			break;
 	}
+
+	/*
+	 * If our input file is seekable but lacks data offsets, update our
+	 * knowledge of where to start future searches from.  (Note that we did
+	 * not update the current TE's dataState/dataPos.  We could have, but
+	 * there is no point since it will not be visited again.)
+	 */
+	if (ctx->hasSeek && tctx->dataState == K_OFFSET_POS_NOT_SET)
+	{
+		pgoff_t		curPos = _getFilePos(AH, ctx);
+
+		if (curPos > ctx->lastFilePos)
+			ctx->lastFilePos = curPos;
+	}
 }
 
 /*
@@ -548,6 +615,7 @@ _skipBlobs(ArchiveHandle *AH)
 static void
 _skipData(ArchiveHandle *AH)
 {
+	lclContext *ctx = (lclContext *) AH->formatData;
 	size_t		blkLen;
 	char	   *buf = NULL;
 	int			buflen = 0;
@@ -556,19 +624,27 @@ _skipData(ArchiveHandle *AH)
 	blkLen = ReadInt(AH);
 	while (blkLen != 0)
 	{
-		if (blkLen > buflen)
+		if (ctx->hasSeek)
 		{
-			if (buf)
-				free(buf);
-			buf = (char *) pg_malloc(blkLen);
-			buflen = blkLen;
+			if (fseeko(AH->FH, blkLen, SEEK_CUR) != 0)
+				fatal("error during file seek: %m");
 		}
-		if ((cnt = fread(buf, 1, blkLen, AH->FH)) != blkLen)
+		else
 		{
-			if (feof(AH->FH))
-				fatal("could not read from input file: end of file");
-			else
-				fatal("could not read from input file: %m");
+			if (blkLen > buflen)
+			{
+				if (buf)
+					free(buf);
+				buf = (char *) pg_malloc(blkLen);
+				buflen = blkLen;
+			}
+			if ((cnt = fread(buf, 1, blkLen, AH->FH)) != blkLen)
+			{
+				if (feof(AH->FH))
+					fatal("could not read from input file: end of file");
+				else
+					fatal("could not read from input file: %m");
+			}
 		}
 
 		blkLen = ReadInt(AH);
@@ -804,6 +880,9 @@ _Clone(ArchiveHandle *AH)
 {
 	lclContext *ctx = (lclContext *) AH->formatData;
 
+	/*
+	 * Each thread must have private lclContext working state.
+	 */
 	AH->formatData = (lclContext *) pg_malloc(sizeof(lclContext));
 	memcpy(AH->formatData, ctx, sizeof(lclContext));
 	ctx = (lclContext *) AH->formatData;
@@ -813,10 +892,13 @@ _Clone(ArchiveHandle *AH)
 		fatal("compressor active");
 
 	/*
+	 * We intentionally do not clone TOC-entry-local state: it's useful to
+	 * share knowledge about where the data blocks are across threads.
+	 * _PrintTocData has to be careful about the order of operations on that
+	 * state, though.
+	 *
 	 * Note: we do not make a local lo_buf because we expect at most one BLOBS
-	 * entry per archive, so no parallelism is possible.  Likewise,
-	 * TOC-entry-local state isn't an issue because any one TOC entry is
-	 * touched by just one worker child.
+	 * entry per archive, so no parallelism is possible.
 	 */
 }
 
#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#16)
Re: Warn when parallel restoring a custom dump without data offsets

I wrote:

Given the current state of affairs, I'm inclined to commit the
attached with no new test coverage, and then come back and look
at better testing after the other bugs are dealt with.

Pushed back to v12.

regards, tom lane