Inefficiency in parallel pg_restore with many tables

Started by Tom Laneover 2 years ago37 messages
#1Tom Lane
tgl@sss.pgh.pa.us

I looked into the performance gripe at [1]/messages/by-id/CAEzn=HSPXi6OS-5KzGMcZeKzWKOOX1me2u2eCiGtMEZDz9Fqdg@mail.gmail.com about pg_restore not making
effective use of parallel workers when there are a lot of tables.
I was able to reproduce that by dumping and parallel restoring 100K
tables made according to this script:

do $$
begin
for i in 1..100000 loop
execute format('create table t%s (f1 int unique, f2 int unique);', i);
execute format('insert into t%s select x, x from generate_series(1,1000) x',
i);
if i % 100 = 0 then commit; end if;
end loop;
end
$$;

Once pg_restore reaches the parallelizable part of the restore, what
I see is that the parent pg_restore process goes to 100% CPU while its
children (and the server) mostly sit idle; that is, the task dispatch
logic in pg_backup_archiver.c is unable to dispatch tasks fast enough
to keep the children busy. A quick perf check showed most of the time
being eaten by pg_qsort and TocEntrySizeCompare.

What I believe is happening is that we start the parallel restore phase
with 100K TableData items that are ready to go (they are in the
ready_list) and 200K AddConstraint items that are pending, because
we make those have dependencies on the corresponding TableData so we
don't build an index until after its table is populated. Each time
one of the TableData items is completed by some worker, the two
AddConstraint items for its table are moved from the pending_list
to the ready_list --- and that means ready_list_insert marks the
ready_list as no longer sorted. When we go to pop the next task
from the ready_list, we re-sort that entire list first. So
we spend something like O(N^2 * log(N)) time just sorting, if
there are N tables. Clearly, this code is much less bright
than it thinks it is (and that's all my fault, if memory serves).

I'm not sure how big a deal this is in practice: in most situations
the individual jobs are larger than they are in this toy example,
plus the initial non-parallelizable part of the restore is a bigger
bottleneck anyway with this many tables. Still, we do have one
real-world complaint, so maybe we should look into improving it.

I wonder if we could replace the sorted ready-list with a priority heap,
although that might be complicated by the fact that pop_next_work_item
has to be capable of popping something that's not necessarily the
largest remaining job. Another idea could be to be a little less eager
to sort the list every time; I think in practice scheduling wouldn't
get much worse if we only re-sorted every so often.

I don't have time to pursue this right now, but perhaps someone
else would like to.

regards, tom lane

[1]: /messages/by-id/CAEzn=HSPXi6OS-5KzGMcZeKzWKOOX1me2u2eCiGtMEZDz9Fqdg@mail.gmail.com

#2Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#1)
Re: Inefficiency in parallel pg_restore with many tables

Hi,

On 2023-07-15 13:47:12 -0400, Tom Lane wrote:

I wonder if we could replace the sorted ready-list with a priority heap,
although that might be complicated by the fact that pop_next_work_item
has to be capable of popping something that's not necessarily the
largest remaining job. Another idea could be to be a little less eager
to sort the list every time; I think in practice scheduling wouldn't
get much worse if we only re-sorted every so often.

Perhaps we could keep track of where the newly inserted items are, and use
insertion sort or such when the number of new elements is much smaller than
the size of the already sorted elements?

As you say, a straight priority heap might not be easy. But we could just open
code using two sorted arrays, one large, one for recent additions that needs
to be newly sorted. And occasionally merge the small array into the big array,
once it has gotten large enough that sorting becomes expensive. We could go
for a heap of N>2 such arrays, but I doubt it would be worth much.

Greetings,

Andres Freund

#3Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#1)
Re: Inefficiency in parallel pg_restore with many tables

On 2023-07-15 Sa 13:47, Tom Lane wrote:

I looked into the performance gripe at [1] about pg_restore not making
effective use of parallel workers when there are a lot of tables.
I was able to reproduce that by dumping and parallel restoring 100K
tables made according to this script:

do $$
begin
for i in 1..100000 loop
execute format('create table t%s (f1 int unique, f2 int unique);', i);
execute format('insert into t%s select x, x from generate_series(1,1000) x',
i);
if i % 100 = 0 then commit; end if;
end loop;
end
$$;

Once pg_restore reaches the parallelizable part of the restore, what
I see is that the parent pg_restore process goes to 100% CPU while its
children (and the server) mostly sit idle; that is, the task dispatch
logic in pg_backup_archiver.c is unable to dispatch tasks fast enough
to keep the children busy. A quick perf check showed most of the time
being eaten by pg_qsort and TocEntrySizeCompare.

What I believe is happening is that we start the parallel restore phase
with 100K TableData items that are ready to go (they are in the
ready_list) and 200K AddConstraint items that are pending, because
we make those have dependencies on the corresponding TableData so we
don't build an index until after its table is populated. Each time
one of the TableData items is completed by some worker, the two
AddConstraint items for its table are moved from the pending_list
to the ready_list --- and that means ready_list_insert marks the
ready_list as no longer sorted. When we go to pop the next task
from the ready_list, we re-sort that entire list first. So
we spend something like O(N^2 * log(N)) time just sorting, if
there are N tables. Clearly, this code is much less bright
than it thinks it is (and that's all my fault, if memory serves).

I'm not sure how big a deal this is in practice: in most situations
the individual jobs are larger than they are in this toy example,
plus the initial non-parallelizable part of the restore is a bigger
bottleneck anyway with this many tables. Still, we do have one
real-world complaint, so maybe we should look into improving it.

I wonder if we could replace the sorted ready-list with a priority heap,
although that might be complicated by the fact that pop_next_work_item
has to be capable of popping something that's not necessarily the
largest remaining job. Another idea could be to be a little less eager
to sort the list every time; I think in practice scheduling wouldn't
get much worse if we only re-sorted every so often.

Yeah, I think that last idea is reasonable. Something like if the number
added since the last sort is more than min(50, list_length/4) then sort.
That shouldn't be too invasive.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#3)
Re: Inefficiency in parallel pg_restore with many tables

Andrew Dunstan <andrew@dunslane.net> writes:

On 2023-07-15 Sa 13:47, Tom Lane wrote:

I wonder if we could replace the sorted ready-list with a priority heap,
although that might be complicated by the fact that pop_next_work_item
has to be capable of popping something that's not necessarily the
largest remaining job. Another idea could be to be a little less eager
to sort the list every time; I think in practice scheduling wouldn't
get much worse if we only re-sorted every so often.

Yeah, I think that last idea is reasonable. Something like if the number
added since the last sort is more than min(50, list_length/4) then sort.
That shouldn't be too invasive.

Actually, as long as we're talking about approximately-correct behavior:
let's make the ready_list be a priority heap, and then just make
pop_next_work_item scan forward from the array start until it finds an
item that's runnable per the lock heuristic. If the heap root is
blocked, the next things we'll examine will be its two children.
We might pick the lower-priority of those two, but it's still known to
be higher priority than at least 50% of the remaining heap entries, so
it shouldn't be too awful as a choice. The argument gets weaker the
further you go into the heap, but we're not expecting that having most
of the top entries blocked will be a common case. (Besides which, the
priorities are pretty crude to begin with.) Once selected, pulling out
an entry that is not the heap root is no problem: you just start the
sift-down process from there.

The main advantage of this over the only-sort-sometimes idea is that
we can guarantee that the largest ready item will always be dispatched
as soon as it can be (because it will be the heap root). So cases
involving one big table (with big indexes) and a lot of little ones
should get scheduled sanely, which is the main thing we want this
algorithm to ensure. With the other approach we can't really promise
much at all.

regards, tom lane

#5Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#4)
Re: Inefficiency in parallel pg_restore with many tables

On Sun, Jul 16, 2023 at 09:45:54AM -0400, Tom Lane wrote:

Actually, as long as we're talking about approximately-correct behavior:
let's make the ready_list be a priority heap, and then just make
pop_next_work_item scan forward from the array start until it finds an
item that's runnable per the lock heuristic. If the heap root is
blocked, the next things we'll examine will be its two children.
We might pick the lower-priority of those two, but it's still known to
be higher priority than at least 50% of the remaining heap entries, so
it shouldn't be too awful as a choice. The argument gets weaker the
further you go into the heap, but we're not expecting that having most
of the top entries blocked will be a common case. (Besides which, the
priorities are pretty crude to begin with.) Once selected, pulling out
an entry that is not the heap root is no problem: you just start the
sift-down process from there.

The main advantage of this over the only-sort-sometimes idea is that
we can guarantee that the largest ready item will always be dispatched
as soon as it can be (because it will be the heap root). So cases
involving one big table (with big indexes) and a lot of little ones
should get scheduled sanely, which is the main thing we want this
algorithm to ensure. With the other approach we can't really promise
much at all.

This seems worth a try. IIUC you are suggesting making binaryheap.c
frontend-friendly and expanding its API a bit. If no one has volunteered,
I could probably hack something together.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#6Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#5)
1 attachment(s)
Re: Inefficiency in parallel pg_restore with many tables

On Sun, Jul 16, 2023 at 08:54:24PM -0700, Nathan Bossart wrote:

This seems worth a try. IIUC you are suggesting making binaryheap.c
frontend-friendly and expanding its API a bit. If no one has volunteered,
I could probably hack something together.

I spent some time on the binaryheap changes. I haven't had a chance to
plug it into the ready_list yet.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachments:

expand_binaryheap_api.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/lib/binaryheap.c b/src/backend/lib/binaryheap.c
index 1737546757..36b6f5c51f 100644
--- a/src/backend/lib/binaryheap.c
+++ b/src/backend/lib/binaryheap.c
@@ -11,10 +11,17 @@
  *-------------------------------------------------------------------------
  */
 
+#ifndef FRONTEND
 #include "postgres.h"
+#else
+#include "postgres_fe.h"
+#endif
 
 #include <math.h>
 
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
 #include "lib/binaryheap.h"
 
 static void sift_down(binaryheap *heap, int node_off);
@@ -35,7 +42,11 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 	binaryheap *heap;
 
 	sz = offsetof(binaryheap, bh_nodes) + sizeof(Datum) * capacity;
+#ifdef FRONTEND
+	heap = (binaryheap *) pg_malloc(sz);
+#else
 	heap = (binaryheap *) palloc(sz);
+#endif
 	heap->bh_space = capacity;
 	heap->bh_compare = compare;
 	heap->bh_arg = arg;
@@ -109,7 +120,11 @@ void
 binaryheap_add_unordered(binaryheap *heap, Datum d)
 {
 	if (heap->bh_size >= heap->bh_space)
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
 	heap->bh_has_heap_property = false;
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
@@ -141,7 +156,11 @@ void
 binaryheap_add(binaryheap *heap, Datum d)
 {
 	if (heap->bh_size >= heap->bh_space)
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
@@ -196,6 +215,65 @@ binaryheap_remove_first(binaryheap *heap)
 	return result;
 }
 
+/*
+ * binaryheap_remove
+ *
+ * Removes the given datum from the heap.  The caller must ensure that the
+ * datum exists in the heap.  Always O(n).
+ */
+void
+binaryheap_remove(binaryheap *heap, Datum d)
+{
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+
+	for (int i = 0; i < heap->bh_size; i++)
+	{
+		if (heap->bh_compare(heap->bh_nodes[i],
+							 d,
+							 heap->bh_arg) == 0)
+		{
+			binaryheap_remove_node(heap, &heap->bh_nodes[i]);
+			return;
+		}
+	}
+
+#ifdef FRONTEND
+	pg_fatal("datum not found in heap");
+#else
+	elog(ERROR, "datum not found in heap");
+#endif
+}
+
+/*
+ * binaryheap_remove_node
+ *
+ * Removes the given node from the heap.  The caller must ensure that the node
+ * exists in the heap.  O(log n) worst case.
+ */
+void
+binaryheap_remove_node(binaryheap *heap, Datum *n)
+{
+	int			cmp;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(n >= &heap->bh_nodes[0]);
+	Assert(n <= &heap->bh_nodes[heap->bh_size - 1]);
+
+	/* compare last node to the one that is being removed */
+	cmp = heap->bh_compare(heap->bh_nodes[--heap->bh_size],
+						   *n,
+						   heap->bh_arg);
+
+	/* remove the last node, placing it in the vacated entry */
+	*n = heap->bh_nodes[heap->bh_size];
+
+	/* sift as needed to preserve the heap property */
+	if (cmp > 0)
+		sift_up(heap, n - &heap->bh_nodes[0]);
+	else if (cmp < 0)
+		sift_down(heap, n - &heap->bh_nodes[0]);
+}
+
 /*
  * binaryheap_replace_first
  *
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 52f7b06b25..411f7358ba 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -47,6 +47,8 @@ extern void binaryheap_build(binaryheap *heap);
 extern void binaryheap_add(binaryheap *heap, Datum d);
 extern Datum binaryheap_first(binaryheap *heap);
 extern Datum binaryheap_remove_first(binaryheap *heap);
+extern void binaryheap_remove(binaryheap *heap, Datum d);
+extern void binaryheap_remove_node(binaryheap *heap, Datum *n);
 extern void binaryheap_replace_first(binaryheap *heap, Datum d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
#7Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Nathan Bossart (#6)
Re: Inefficiency in parallel pg_restore with many tables

On 2023-Jul-17, Nathan Bossart wrote:

@@ -35,7 +42,11 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
binaryheap *heap;

sz = offsetof(binaryheap, bh_nodes) + sizeof(Datum) * capacity;
+#ifdef FRONTEND
+	heap = (binaryheap *) pg_malloc(sz);
+#else
heap = (binaryheap *) palloc(sz);
+#endif

Hmm, as I recall fe_memutils.c provides you with palloc() in the
frontend environment, so you don't actually need this one.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"It takes less than 2 seconds to get to 78% complete; that's a good sign.
A few seconds later it's at 90%, but it seems to have stuck there. Did
somebody make percentages logarithmic while I wasn't looking?"
http://smylers.hates-software.com/2005/09/08/1995c749.html

#8Nathan Bossart
nathandbossart@gmail.com
In reply to: Alvaro Herrera (#7)
Re: Inefficiency in parallel pg_restore with many tables

On Tue, Jul 18, 2023 at 06:05:11PM +0200, Alvaro Herrera wrote:

On 2023-Jul-17, Nathan Bossart wrote:

@@ -35,7 +42,11 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
binaryheap *heap;

sz = offsetof(binaryheap, bh_nodes) + sizeof(Datum) * capacity;
+#ifdef FRONTEND
+	heap = (binaryheap *) pg_malloc(sz);
+#else
heap = (binaryheap *) palloc(sz);
+#endif

Hmm, as I recall fe_memutils.c provides you with palloc() in the
frontend environment, so you don't actually need this one.

Ah, yes it does. Thanks for the pointer.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#9Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#8)
4 attachment(s)
Re: Inefficiency in parallel pg_restore with many tables

Here is a work-in-progress patch set for converting ready_list to a
priority queue. On my machine, Tom's 100k-table example [0]/messages/by-id/3612876.1689443232@sss.pgh.pa.us takes 11.5
minutes without these patches and 1.5 minutes with them.

One item that requires more thought is binaryheap's use of Datum. AFAICT
the Datum definitions live in postgres.h and aren't available to frontend
code. I think we'll either need to move the Datum definitions to c.h or to
adjust binaryheap to use "void *".

[0]: /messages/by-id/3612876.1689443232@sss.pgh.pa.us

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachments:

v2-0001-misc-binaryheap-fixes.patchtext/x-diff; charset=us-asciiDownload
From bbf7b32f0abd0e9e2f7a0759cc79fcccbb5f9281 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 19 Jul 2023 15:54:00 -0700
Subject: [PATCH v2 1/4] misc binaryheap fixes

---
 src/backend/postmaster/pgarch.c                 | 12 ++++++------
 src/backend/replication/logical/reorderbuffer.c |  2 +-
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 46af349564..c28e6f2070 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -108,7 +108,7 @@ static ArchiveModuleState *archive_module_state;
  * archive_status.  Minimizing the number of directory scans when there are
  * many files to archive can significantly improve archival rate.
  *
- * arch_heap is a max-heap that is used during the directory scan to track
+ * arch_heap is a min-heap that is used during the directory scan to track
  * the highest-priority files to archive.  After the directory scan
  * completes, the file names are stored in ascending order of priority in
  * arch_files.  pgarch_readyXlog() returns files from arch_files until it
@@ -248,7 +248,7 @@ PgArchiverMain(void)
 	arch_files = palloc(sizeof(struct arch_files_state));
 	arch_files->arch_files_size = 0;
 
-	/* Initialize our max-heap for prioritizing files to archive. */
+	/* Initialize our min-heap for prioritizing files to archive. */
 	arch_files->arch_heap = binaryheap_allocate(NUM_FILES_PER_DIRECTORY_SCAN,
 												ready_file_comparator, NULL);
 
@@ -624,7 +624,7 @@ pgarch_readyXlog(char *xlog)
 		basename[basenamelen] = '\0';
 
 		/*
-		 * Store the file in our max-heap if it has a high enough priority.
+		 * Store the file in our min-heap if it has a high enough priority.
 		 */
 		if (arch_files->arch_heap->bh_size < NUM_FILES_PER_DIRECTORY_SCAN)
 		{
@@ -644,15 +644,15 @@ pgarch_readyXlog(char *xlog)
 			 * Remove the lowest priority file and add the current one to the
 			 * heap.
 			 */
-			arch_file = DatumGetCString(binaryheap_remove_first(arch_files->arch_heap));
+			arch_file = DatumGetCString(binaryheap_first(arch_files->arch_heap));
 			strcpy(arch_file, basename);
-			binaryheap_add(arch_files->arch_heap, CStringGetDatum(arch_file));
+			binaryheap_replace_first(arch_files->arch_heap, CStringGetDatum(arch_file));
 		}
 	}
 	FreeDir(rldir);
 
 	/* If no files were found, simply return. */
-	if (arch_files->arch_heap->bh_size == 0)
+	if (binaryheap_empty(arch_files->arch_heap))
 		return false;
 
 	/*
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 26d252bd87..05bbcecd29 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1381,7 +1381,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
 	int32		off;
 
 	/* nothing there anymore */
-	if (state->heap->bh_size == 0)
+	if (binaryheap_empty(state->heap))
 		return NULL;
 
 	off = DatumGetInt32(binaryheap_first(state->heap));
-- 
2.25.1

v2-0002-make-binaryheap-available-to-frontend.patchtext/x-diff; charset=us-asciiDownload
From 8f0339263dd21bb04c1b96a0e850a9b57a2e3f8b Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Thu, 20 Jul 2023 09:45:03 -0700
Subject: [PATCH v2 2/4] make binaryheap available to frontend

---
 src/backend/lib/Makefile                 |  1 -
 src/backend/lib/meson.build              |  1 -
 src/common/Makefile                      |  1 +
 src/{backend/lib => common}/binaryheap.c | 15 ++++++++++++
 src/common/meson.build                   |  1 +
 src/include/lib/binaryheap.h             | 29 ++++++++++++++++++++++++
 6 files changed, 46 insertions(+), 2 deletions(-)
 rename src/{backend/lib => common}/binaryheap.c (96%)

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..b6cefd9cca 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -13,7 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
-	binaryheap.o \
 	bipartite_match.o \
 	bloomfilter.o \
 	dshash.o \
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 974cab8776..b4e88f54ae 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -1,7 +1,6 @@
 # Copyright (c) 2022-2023, PostgreSQL Global Development Group
 
 backend_sources += files(
-  'binaryheap.c',
   'bipartite_match.c',
   'bloomfilter.c',
   'dshash.c',
diff --git a/src/common/Makefile b/src/common/Makefile
index 113029bf7b..cc5c54dcee 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -48,6 +48,7 @@ LIBS += $(PTHREAD_LIBS)
 OBJS_COMMON = \
 	archive.o \
 	base64.o \
+	binaryheap.o \
 	checksum_helper.o \
 	compression.o \
 	config_info.o \
diff --git a/src/backend/lib/binaryheap.c b/src/common/binaryheap.c
similarity index 96%
rename from src/backend/lib/binaryheap.c
rename to src/common/binaryheap.c
index 1737546757..7a590eec4a 100644
--- a/src/backend/lib/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -11,10 +11,17 @@
  *-------------------------------------------------------------------------
  */
 
+#ifndef FRONTEND
 #include "postgres.h"
+#else
+#include "postgres_fe.h"
+#endif
 
 #include <math.h>
 
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
 #include "lib/binaryheap.h"
 
 static void sift_down(binaryheap *heap, int node_off);
@@ -109,7 +116,11 @@ void
 binaryheap_add_unordered(binaryheap *heap, Datum d)
 {
 	if (heap->bh_size >= heap->bh_space)
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
 	heap->bh_has_heap_property = false;
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
@@ -141,7 +152,11 @@ void
 binaryheap_add(binaryheap *heap, Datum d)
 {
 	if (heap->bh_size >= heap->bh_space)
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
diff --git a/src/common/meson.build b/src/common/meson.build
index 53942a9a61..3b97497d1a 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -3,6 +3,7 @@
 common_sources = files(
   'archive.c',
   'base64.c',
+  'binaryheap.c',
   'checksum_helper.c',
   'compression.c',
   'controldata_utils.c',
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 52f7b06b25..06e3878cf7 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -11,6 +11,35 @@
 #ifndef BINARYHEAP_H
 #define BINARYHEAP_H
 
+/*
+ * XXX: This is obviously a hack.
+ */
+#ifdef FRONTEND
+
+typedef uintptr_t Datum;
+
+/*
+ * DatumGetPointer
+ *		Returns pointer value of a datum.
+ */
+static inline Pointer
+DatumGetPointer(Datum X)
+{
+	return (Pointer) X;
+}
+
+/*
+ * PointerGetDatum
+ *		Returns datum representation for a pointer.
+ */
+static inline Datum
+PointerGetDatum(const void *X)
+{
+	return (Datum) X;
+}
+
+#endif
+
 /*
  * For a max-heap, the comparator must return <0 iff a < b, 0 iff a == b,
  * and >0 iff a > b.  For a min-heap, the conditions are reversed.
-- 
2.25.1

v2-0003-expand-binaryheap-api.patchtext/x-diff; charset=us-asciiDownload
From 533f82b9ee4bfb5dee6cfc8e5c7e8091087e3e0b Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Thu, 20 Jul 2023 09:52:20 -0700
Subject: [PATCH v2 3/4] expand binaryheap api

---
 src/common/binaryheap.c      | 28 ++++++++++++++++++++++++++++
 src/include/lib/binaryheap.h |  3 +++
 2 files changed, 31 insertions(+)

diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index 7a590eec4a..526e56bbd0 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -211,6 +211,34 @@ binaryheap_remove_first(binaryheap *heap)
 	return result;
 }
 
+/*
+ * binaryheap_remove_node
+ *
+ * Removes the given node from the heap.  O(log n) worst case.
+ */
+void
+binaryheap_remove_node(binaryheap *heap, int n)
+{
+	int			cmp;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(n >= 0 && n < heap->bh_size);
+
+	/* compare last node to the one that is being removed */
+	cmp = heap->bh_compare(heap->bh_nodes[--heap->bh_size],
+						   heap->bh_nodes[n],
+						   heap->bh_arg);
+
+	/* remove the last node, placing it in the vacated entry */
+	heap->bh_nodes[n] = heap->bh_nodes[heap->bh_size];
+
+	/* sift as needed to preserve the heap property */
+	if (cmp > 0)
+		sift_up(heap, n);
+	else if (cmp < 0)
+		sift_down(heap, n);
+}
+
 /*
  * binaryheap_replace_first
  *
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 06e3878cf7..9f09aff29f 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -76,8 +76,11 @@ extern void binaryheap_build(binaryheap *heap);
 extern void binaryheap_add(binaryheap *heap, Datum d);
 extern Datum binaryheap_first(binaryheap *heap);
 extern Datum binaryheap_remove_first(binaryheap *heap);
+extern void binaryheap_remove_node(binaryheap *heap, int n);
 extern void binaryheap_replace_first(binaryheap *heap, Datum d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
+#define binaryheap_size(h)			((h)->bh_size)
+#define binaryheap_get_node(h, n)	((h)->bh_nodes[n])
 
 #endif							/* BINARYHEAP_H */
-- 
2.25.1

v2-0004-use-priority-queue-for-pg_restore-ready_list.patchtext/x-diff; charset=us-asciiDownload
From 997999700689d0a8dc4d01efc59b168974362992 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Thu, 20 Jul 2023 10:19:08 -0700
Subject: [PATCH v2 4/4] use priority queue for pg_restore ready_list

---
 src/bin/pg_dump/pg_backup_archiver.c | 155 ++++++---------------------
 1 file changed, 35 insertions(+), 120 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 39ebcfec32..7c0e50353d 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -34,6 +34,7 @@
 #include "compress_io.h"
 #include "dumputils.h"
 #include "fe_utils/string_utils.h"
+#include "lib/binaryheap.h"
 #include "lib/stringinfo.h"
 #include "libpq/libpq-fs.h"
 #include "parallel.h"
@@ -44,24 +45,6 @@
 #define TEXT_DUMP_HEADER "--\n-- PostgreSQL database dump\n--\n\n"
 #define TEXT_DUMPALL_HEADER "--\n-- PostgreSQL database cluster dump\n--\n\n"
 
-/*
- * State for tracking TocEntrys that are ready to process during a parallel
- * restore.  (This used to be a list, and we still call it that, though now
- * it's really an array so that we can apply qsort to it.)
- *
- * tes[] is sized large enough that we can't overrun it.
- * The valid entries are indexed first_te .. last_te inclusive.
- * We periodically sort the array to bring larger-by-dataLength entries to
- * the front; "sorted" is true if the valid entries are known sorted.
- */
-typedef struct _parallelReadyList
-{
-	TocEntry  **tes;			/* Ready-to-dump TocEntrys */
-	int			first_te;		/* index of first valid entry in tes[] */
-	int			last_te;		/* index of last valid entry in tes[] */
-	bool		sorted;			/* are valid entries currently sorted? */
-} ParallelReadyList;
-
 
 static ArchiveHandle *_allocAH(const char *FileSpec, const ArchiveFormat fmt,
 							   const pg_compress_specification compression_spec,
@@ -110,16 +93,12 @@ static void restore_toc_entries_postfork(ArchiveHandle *AH,
 static void pending_list_header_init(TocEntry *l);
 static void pending_list_append(TocEntry *l, TocEntry *te);
 static void pending_list_remove(TocEntry *te);
-static void ready_list_init(ParallelReadyList *ready_list, int tocCount);
-static void ready_list_free(ParallelReadyList *ready_list);
-static void ready_list_insert(ParallelReadyList *ready_list, TocEntry *te);
-static void ready_list_remove(ParallelReadyList *ready_list, int i);
-static void ready_list_sort(ParallelReadyList *ready_list);
 static int	TocEntrySizeCompare(const void *p1, const void *p2);
+static int  TocEntrySizeCompareBinaryHeap(Datum p1, Datum p2, void *arg);
 static void move_to_ready_list(TocEntry *pending_list,
-							   ParallelReadyList *ready_list,
+							   binaryheap *ready_list,
 							   RestorePass pass);
-static TocEntry *pop_next_work_item(ParallelReadyList *ready_list,
+static TocEntry *pop_next_work_item(binaryheap *ready_list,
 									ParallelState *pstate);
 static void mark_dump_job_done(ArchiveHandle *AH,
 							   TocEntry *te,
@@ -134,7 +113,7 @@ static bool has_lock_conflicts(TocEntry *te1, TocEntry *te2);
 static void repoint_table_dependencies(ArchiveHandle *AH);
 static void identify_locking_dependencies(ArchiveHandle *AH, TocEntry *te);
 static void reduce_dependencies(ArchiveHandle *AH, TocEntry *te,
-								ParallelReadyList *ready_list);
+								binaryheap *ready_list);
 static void mark_create_done(ArchiveHandle *AH, TocEntry *te);
 static void inhibit_data_for_failed_table(ArchiveHandle *AH, TocEntry *te);
 
@@ -4023,13 +4002,15 @@ static void
 restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 							 TocEntry *pending_list)
 {
-	ParallelReadyList ready_list;
+	binaryheap *ready_list;
 	TocEntry   *next_work_item;
 
 	pg_log_debug("entering restore_toc_entries_parallel");
 
 	/* Set up ready_list with enough room for all known TocEntrys */
-	ready_list_init(&ready_list, AH->tocCount);
+	ready_list = binaryheap_allocate(AH->tocCount,
+									 TocEntrySizeCompareBinaryHeap,
+									 NULL);
 
 	/*
 	 * The pending_list contains all items that we need to restore.  Move all
@@ -4040,7 +4021,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 	 * process in the current restore pass.
 	 */
 	AH->restorePass = RESTORE_PASS_MAIN;
-	move_to_ready_list(pending_list, &ready_list, AH->restorePass);
+	move_to_ready_list(pending_list, ready_list, AH->restorePass);
 
 	/*
 	 * main parent loop
@@ -4054,7 +4035,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 	for (;;)
 	{
 		/* Look for an item ready to be dispatched to a worker */
-		next_work_item = pop_next_work_item(&ready_list, pstate);
+		next_work_item = pop_next_work_item(ready_list, pstate);
 		if (next_work_item != NULL)
 		{
 			/* If not to be restored, don't waste time launching a worker */
@@ -4064,7 +4045,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 							next_work_item->dumpId,
 							next_work_item->desc, next_work_item->tag);
 				/* Update its dependencies as though we'd completed it */
-				reduce_dependencies(AH, next_work_item, &ready_list);
+				reduce_dependencies(AH, next_work_item, ready_list);
 				/* Loop around to see if anything else can be dispatched */
 				continue;
 			}
@@ -4075,7 +4056,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 
 			/* Dispatch to some worker */
 			DispatchJobForTocEntry(AH, pstate, next_work_item, ACT_RESTORE,
-								   mark_restore_job_done, &ready_list);
+								   mark_restore_job_done, ready_list);
 		}
 		else if (IsEveryWorkerIdle(pstate))
 		{
@@ -4089,7 +4070,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 			/* Advance to next restore pass */
 			AH->restorePass++;
 			/* That probably allows some stuff to be made ready */
-			move_to_ready_list(pending_list, &ready_list, AH->restorePass);
+			move_to_ready_list(pending_list, ready_list, AH->restorePass);
 			/* Loop around to see if anything's now ready */
 			continue;
 		}
@@ -4119,9 +4100,9 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 	}
 
 	/* There should now be nothing in ready_list. */
-	Assert(ready_list.first_te > ready_list.last_te);
+	Assert(binaryheap_empty(ready_list));
 
-	ready_list_free(&ready_list);
+	binaryheap_free(ready_list);
 
 	pg_log_info("finished main parallel loop");
 }
@@ -4221,77 +4202,6 @@ pending_list_remove(TocEntry *te)
 }
 
 
-/*
- * Initialize the ready_list with enough room for up to tocCount entries.
- */
-static void
-ready_list_init(ParallelReadyList *ready_list, int tocCount)
-{
-	ready_list->tes = (TocEntry **)
-		pg_malloc(tocCount * sizeof(TocEntry *));
-	ready_list->first_te = 0;
-	ready_list->last_te = -1;
-	ready_list->sorted = false;
-}
-
-/*
- * Free storage for a ready_list.
- */
-static void
-ready_list_free(ParallelReadyList *ready_list)
-{
-	pg_free(ready_list->tes);
-}
-
-/* Add te to the ready_list */
-static void
-ready_list_insert(ParallelReadyList *ready_list, TocEntry *te)
-{
-	ready_list->tes[++ready_list->last_te] = te;
-	/* List is (probably) not sorted anymore. */
-	ready_list->sorted = false;
-}
-
-/* Remove the i'th entry in the ready_list */
-static void
-ready_list_remove(ParallelReadyList *ready_list, int i)
-{
-	int			f = ready_list->first_te;
-
-	Assert(i >= f && i <= ready_list->last_te);
-
-	/*
-	 * In the typical case where the item to be removed is the first ready
-	 * entry, we need only increment first_te to remove it.  Otherwise, move
-	 * the entries before it to compact the list.  (This preserves sortedness,
-	 * if any.)  We could alternatively move the entries after i, but there
-	 * are typically many more of those.
-	 */
-	if (i > f)
-	{
-		TocEntry  **first_te_ptr = &ready_list->tes[f];
-
-		memmove(first_te_ptr + 1, first_te_ptr, (i - f) * sizeof(TocEntry *));
-	}
-	ready_list->first_te++;
-}
-
-/* Sort the ready_list into the desired order */
-static void
-ready_list_sort(ParallelReadyList *ready_list)
-{
-	if (!ready_list->sorted)
-	{
-		int			n = ready_list->last_te - ready_list->first_te + 1;
-
-		if (n > 1)
-			qsort(ready_list->tes + ready_list->first_te, n,
-				  sizeof(TocEntry *),
-				  TocEntrySizeCompare);
-		ready_list->sorted = true;
-	}
-}
-
 /* qsort comparator for sorting TocEntries by dataLength */
 static int
 TocEntrySizeCompare(const void *p1, const void *p2)
@@ -4314,6 +4224,15 @@ TocEntrySizeCompare(const void *p1, const void *p2)
 	return 0;
 }
 
+static int
+TocEntrySizeCompareBinaryHeap(Datum p1, Datum p2, void *arg)
+{
+	const void *te1 = (const void *) DatumGetPointer(p1);
+	const void *te2 = (const void *) DatumGetPointer(p2);
+
+	return TocEntrySizeCompare(te1, te2);
+}
+
 
 /*
  * Move all immediately-ready items from pending_list to ready_list.
@@ -4324,7 +4243,7 @@ TocEntrySizeCompare(const void *p1, const void *p2)
  */
 static void
 move_to_ready_list(TocEntry *pending_list,
-				   ParallelReadyList *ready_list,
+				   binaryheap *ready_list,
 				   RestorePass pass)
 {
 	TocEntry   *te;
@@ -4341,7 +4260,7 @@ move_to_ready_list(TocEntry *pending_list,
 			/* Remove it from pending_list ... */
 			pending_list_remove(te);
 			/* ... and add to ready_list */
-			ready_list_insert(ready_list, te);
+			binaryheap_add(ready_list, PointerGetDatum(te));
 		}
 	}
 }
@@ -4358,20 +4277,16 @@ move_to_ready_list(TocEntry *pending_list,
  * no remaining dependencies, but we have to check for lock conflicts.
  */
 static TocEntry *
-pop_next_work_item(ParallelReadyList *ready_list,
+pop_next_work_item(binaryheap *ready_list,
 				   ParallelState *pstate)
 {
-	/*
-	 * Sort the ready_list so that we'll tackle larger jobs first.
-	 */
-	ready_list_sort(ready_list);
-
 	/*
 	 * Search the ready_list until we find a suitable item.
 	 */
-	for (int i = ready_list->first_te; i <= ready_list->last_te; i++)
+	for (int i = 0; i < binaryheap_size(ready_list); i++)
 	{
-		TocEntry   *te = ready_list->tes[i];
+		Datum		ted = binaryheap_get_node(ready_list, i);
+		TocEntry   *te = (TocEntry *) DatumGetPointer(ted);
 		bool		conflicts = false;
 
 		/*
@@ -4397,7 +4312,7 @@ pop_next_work_item(ParallelReadyList *ready_list,
 			continue;
 
 		/* passed all tests, so this item can run */
-		ready_list_remove(ready_list, i);
+		binaryheap_remove_node(ready_list, i);
 		return te;
 	}
 
@@ -4443,7 +4358,7 @@ mark_restore_job_done(ArchiveHandle *AH,
 					  int status,
 					  void *callback_data)
 {
-	ParallelReadyList *ready_list = (ParallelReadyList *) callback_data;
+	binaryheap *ready_list = (binaryheap *) callback_data;
 
 	pg_log_info("finished item %d %s %s",
 				te->dumpId, te->desc, te->tag);
@@ -4708,7 +4623,7 @@ identify_locking_dependencies(ArchiveHandle *AH, TocEntry *te)
  */
 static void
 reduce_dependencies(ArchiveHandle *AH, TocEntry *te,
-					ParallelReadyList *ready_list)
+					binaryheap *ready_list)
 {
 	int			i;
 
@@ -4737,7 +4652,7 @@ reduce_dependencies(ArchiveHandle *AH, TocEntry *te,
 			/* Remove it from pending list ... */
 			pending_list_remove(otherte);
 			/* ... and add to ready_list */
-			ready_list_insert(ready_list, otherte);
+			binaryheap_add(ready_list, PointerGetDatum(otherte));
 		}
 	}
 }
-- 
2.25.1

#10Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#9)
4 attachment(s)
Re: Inefficiency in parallel pg_restore with many tables

On Thu, Jul 20, 2023 at 12:06:44PM -0700, Nathan Bossart wrote:

Here is a work-in-progress patch set for converting ready_list to a
priority queue. On my machine, Tom's 100k-table example [0] takes 11.5
minutes without these patches and 1.5 minutes with them.

One item that requires more thought is binaryheap's use of Datum. AFAICT
the Datum definitions live in postgres.h and aren't available to frontend
code. I think we'll either need to move the Datum definitions to c.h or to
adjust binaryheap to use "void *".

In v3, I moved the Datum definitions to c.h. I first tried modifying
binaryheap to use "int" or "void *" instead, but that ended up requiring
some rather invasive changes in backend code, not to mention any extensions
that happen to be using it. I also looked into moving the definitions to a
separate datumdefs.h header that postgres.h would include, but that felt
awkward because 1) postgres.h clearly states that it is intended for things
"that never escape the backend" and 2) the definitions seem relatively
inexpensive. However, I think the latter option is still viable, so I'm
fine with switching to it if folks think that is a better approach.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachments:

v3-0001-move-datum-definitions-to-c.h.patchtext/x-diff; charset=us-asciiDownload
From 56066f836f389ca0efe169c1ea8d769f70aaa817 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Sat, 22 Jul 2023 14:53:49 -0700
Subject: [PATCH v3 1/4] move datum definitions to c.h

---
 src/include/c.h        | 512 ++++++++++++++++++++++++++++++++++++++++
 src/include/postgres.h | 519 +----------------------------------------
 2 files changed, 515 insertions(+), 516 deletions(-)

diff --git a/src/include/c.h b/src/include/c.h
index f69d739be5..daee2f50eb 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -35,6 +35,7 @@
  *		7)		widely useful macros
  *		8)		random stuff
  *		9)		system-specific hacks
+ *		10)		Datum type + support functions
  *
  * NOTE: since this file is included by both frontend and backend modules,
  * it's usually wrong to put an "extern" declaration here, unless it's
@@ -1374,4 +1375,515 @@ typedef intptr_t sigjmp_buf[5];
 /* /port compatibility functions */
 #include "port.h"
 
+/* ----------------------------------------------------------------
+ *				Section 10:	 Datum type + support functions
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * A Datum contains either a value of a pass-by-value type or a pointer to a
+ * value of a pass-by-reference type.  Therefore, we require:
+ *
+ * sizeof(Datum) == sizeof(void *) == 4 or 8
+ *
+ * The functions below and the analogous functions for other types should be used to
+ * convert between a Datum and the appropriate C type.
+ */
+
+typedef uintptr_t Datum;
+
+/*
+ * A NullableDatum is used in places where both a Datum and its nullness needs
+ * to be stored. This can be more efficient than storing datums and nullness
+ * in separate arrays, due to better spatial locality, even if more space may
+ * be wasted due to padding.
+ */
+typedef struct NullableDatum
+{
+#define FIELDNO_NULLABLE_DATUM_DATUM 0
+	Datum		value;
+#define FIELDNO_NULLABLE_DATUM_ISNULL 1
+	bool		isnull;
+	/* due to alignment padding this could be used for flags for free */
+} NullableDatum;
+
+#define SIZEOF_DATUM SIZEOF_VOID_P
+
+/*
+ * DatumGetBool
+ *		Returns boolean value of a datum.
+ *
+ * Note: any nonzero value will be considered true.
+ */
+static inline bool
+DatumGetBool(Datum X)
+{
+	return (X != 0);
+}
+
+/*
+ * BoolGetDatum
+ *		Returns datum representation for a boolean.
+ *
+ * Note: any nonzero value will be considered true.
+ */
+static inline Datum
+BoolGetDatum(bool X)
+{
+	return (Datum) (X ? 1 : 0);
+}
+
+/*
+ * DatumGetChar
+ *		Returns character value of a datum.
+ */
+static inline char
+DatumGetChar(Datum X)
+{
+	return (char) X;
+}
+
+/*
+ * CharGetDatum
+ *		Returns datum representation for a character.
+ */
+static inline Datum
+CharGetDatum(char X)
+{
+	return (Datum) X;
+}
+
+/*
+ * Int8GetDatum
+ *		Returns datum representation for an 8-bit integer.
+ */
+static inline Datum
+Int8GetDatum(int8 X)
+{
+	return (Datum) X;
+}
+
+/*
+ * DatumGetUInt8
+ *		Returns 8-bit unsigned integer value of a datum.
+ */
+static inline uint8
+DatumGetUInt8(Datum X)
+{
+	return (uint8) X;
+}
+
+/*
+ * UInt8GetDatum
+ *		Returns datum representation for an 8-bit unsigned integer.
+ */
+static inline Datum
+UInt8GetDatum(uint8 X)
+{
+	return (Datum) X;
+}
+
+/*
+ * DatumGetInt16
+ *		Returns 16-bit integer value of a datum.
+ */
+static inline int16
+DatumGetInt16(Datum X)
+{
+	return (int16) X;
+}
+
+/*
+ * Int16GetDatum
+ *		Returns datum representation for a 16-bit integer.
+ */
+static inline Datum
+Int16GetDatum(int16 X)
+{
+	return (Datum) X;
+}
+
+/*
+ * DatumGetUInt16
+ *		Returns 16-bit unsigned integer value of a datum.
+ */
+static inline uint16
+DatumGetUInt16(Datum X)
+{
+	return (uint16) X;
+}
+
+/*
+ * UInt16GetDatum
+ *		Returns datum representation for a 16-bit unsigned integer.
+ */
+static inline Datum
+UInt16GetDatum(uint16 X)
+{
+	return (Datum) X;
+}
+
+/*
+ * DatumGetInt32
+ *		Returns 32-bit integer value of a datum.
+ */
+static inline int32
+DatumGetInt32(Datum X)
+{
+	return (int32) X;
+}
+
+/*
+ * Int32GetDatum
+ *		Returns datum representation for a 32-bit integer.
+ */
+static inline Datum
+Int32GetDatum(int32 X)
+{
+	return (Datum) X;
+}
+
+/*
+ * DatumGetUInt32
+ *		Returns 32-bit unsigned integer value of a datum.
+ */
+static inline uint32
+DatumGetUInt32(Datum X)
+{
+	return (uint32) X;
+}
+
+/*
+ * UInt32GetDatum
+ *		Returns datum representation for a 32-bit unsigned integer.
+ */
+static inline Datum
+UInt32GetDatum(uint32 X)
+{
+	return (Datum) X;
+}
+
+/*
+ * DatumGetObjectId
+ *		Returns object identifier value of a datum.
+ */
+static inline Oid
+DatumGetObjectId(Datum X)
+{
+	return (Oid) X;
+}
+
+/*
+ * ObjectIdGetDatum
+ *		Returns datum representation for an object identifier.
+ */
+static inline Datum
+ObjectIdGetDatum(Oid X)
+{
+	return (Datum) X;
+}
+
+/*
+ * DatumGetTransactionId
+ *		Returns transaction identifier value of a datum.
+ */
+static inline TransactionId
+DatumGetTransactionId(Datum X)
+{
+	return (TransactionId) X;
+}
+
+/*
+ * TransactionIdGetDatum
+ *		Returns datum representation for a transaction identifier.
+ */
+static inline Datum
+TransactionIdGetDatum(TransactionId X)
+{
+	return (Datum) X;
+}
+
+/*
+ * MultiXactIdGetDatum
+ *		Returns datum representation for a multixact identifier.
+ */
+static inline Datum
+MultiXactIdGetDatum(MultiXactId X)
+{
+	return (Datum) X;
+}
+
+/*
+ * DatumGetCommandId
+ *		Returns command identifier value of a datum.
+ */
+static inline CommandId
+DatumGetCommandId(Datum X)
+{
+	return (CommandId) X;
+}
+
+/*
+ * CommandIdGetDatum
+ *		Returns datum representation for a command identifier.
+ */
+static inline Datum
+CommandIdGetDatum(CommandId X)
+{
+	return (Datum) X;
+}
+
+/*
+ * DatumGetPointer
+ *		Returns pointer value of a datum.
+ */
+static inline Pointer
+DatumGetPointer(Datum X)
+{
+	return (Pointer) X;
+}
+
+/*
+ * PointerGetDatum
+ *		Returns datum representation for a pointer.
+ */
+static inline Datum
+PointerGetDatum(const void *X)
+{
+	return (Datum) X;
+}
+
+/*
+ * DatumGetCString
+ *		Returns C string (null-terminated string) value of a datum.
+ *
+ * Note: C string is not a full-fledged Postgres type at present,
+ * but type input functions use this conversion for their inputs.
+ */
+static inline char *
+DatumGetCString(Datum X)
+{
+	return (char *) DatumGetPointer(X);
+}
+
+/*
+ * CStringGetDatum
+ *		Returns datum representation for a C string (null-terminated string).
+ *
+ * Note: C string is not a full-fledged Postgres type at present,
+ * but type output functions use this conversion for their outputs.
+ * Note: CString is pass-by-reference; caller must ensure the pointed-to
+ * value has adequate lifetime.
+ */
+static inline Datum
+CStringGetDatum(const char *X)
+{
+	return PointerGetDatum(X);
+}
+
+/*
+ * DatumGetName
+ *		Returns name value of a datum.
+ */
+static inline Name
+DatumGetName(Datum X)
+{
+	return (Name) DatumGetPointer(X);
+}
+
+/*
+ * NameGetDatum
+ *		Returns datum representation for a name.
+ *
+ * Note: Name is pass-by-reference; caller must ensure the pointed-to
+ * value has adequate lifetime.
+ */
+static inline Datum
+NameGetDatum(const NameData *X)
+{
+	return CStringGetDatum(NameStr(*X));
+}
+
+/*
+ * DatumGetInt64
+ *		Returns 64-bit integer value of a datum.
+ *
+ * Note: this function hides whether int64 is pass by value or by reference.
+ */
+static inline int64
+DatumGetInt64(Datum X)
+{
+#ifdef USE_FLOAT8_BYVAL
+	return (int64) X;
+#else
+	return *((int64 *) DatumGetPointer(X));
+#endif
+}
+
+/*
+ * Int64GetDatum
+ *		Returns datum representation for a 64-bit integer.
+ *
+ * Note: if int64 is pass by reference, this function returns a reference
+ * to palloc'd space.
+ */
+#ifdef USE_FLOAT8_BYVAL
+static inline Datum
+Int64GetDatum(int64 X)
+{
+	return (Datum) X;
+}
+#else
+extern Datum Int64GetDatum(int64 X);
+#endif
+
+
+/*
+ * DatumGetUInt64
+ *		Returns 64-bit unsigned integer value of a datum.
+ *
+ * Note: this function hides whether int64 is pass by value or by reference.
+ */
+static inline uint64
+DatumGetUInt64(Datum X)
+{
+#ifdef USE_FLOAT8_BYVAL
+	return (uint64) X;
+#else
+	return *((uint64 *) DatumGetPointer(X));
+#endif
+}
+
+/*
+ * UInt64GetDatum
+ *		Returns datum representation for a 64-bit unsigned integer.
+ *
+ * Note: if int64 is pass by reference, this function returns a reference
+ * to palloc'd space.
+ */
+static inline Datum
+UInt64GetDatum(uint64 X)
+{
+#ifdef USE_FLOAT8_BYVAL
+	return (Datum) X;
+#else
+	return Int64GetDatum((int64) X);
+#endif
+}
+
+/*
+ * Float <-> Datum conversions
+ *
+ * These have to be implemented as inline functions rather than macros, when
+ * passing by value, because many machines pass int and float function
+ * parameters/results differently; so we need to play weird games with unions.
+ */
+
+/*
+ * DatumGetFloat4
+ *		Returns 4-byte floating point value of a datum.
+ */
+static inline float4
+DatumGetFloat4(Datum X)
+{
+	union
+	{
+		int32		value;
+		float4		retval;
+	}			myunion;
+
+	myunion.value = DatumGetInt32(X);
+	return myunion.retval;
+}
+
+/*
+ * Float4GetDatum
+ *		Returns datum representation for a 4-byte floating point number.
+ */
+static inline Datum
+Float4GetDatum(float4 X)
+{
+	union
+	{
+		float4		value;
+		int32		retval;
+	}			myunion;
+
+	myunion.value = X;
+	return Int32GetDatum(myunion.retval);
+}
+
+/*
+ * DatumGetFloat8
+ *		Returns 8-byte floating point value of a datum.
+ *
+ * Note: this function hides whether float8 is pass by value or by reference.
+ */
+static inline float8
+DatumGetFloat8(Datum X)
+{
+#ifdef USE_FLOAT8_BYVAL
+	union
+	{
+		int64		value;
+		float8		retval;
+	}			myunion;
+
+	myunion.value = DatumGetInt64(X);
+	return myunion.retval;
+#else
+	return *((float8 *) DatumGetPointer(X));
+#endif
+}
+
+/*
+ * Float8GetDatum
+ *		Returns datum representation for an 8-byte floating point number.
+ *
+ * Note: if float8 is pass by reference, this function returns a reference
+ * to palloc'd space.
+ */
+#ifdef USE_FLOAT8_BYVAL
+static inline Datum
+Float8GetDatum(float8 X)
+{
+	union
+	{
+		float8		value;
+		int64		retval;
+	}			myunion;
+
+	myunion.value = X;
+	return Int64GetDatum(myunion.retval);
+}
+#else
+extern Datum Float8GetDatum(float8 X);
+#endif
+
+
+/*
+ * Int64GetDatumFast
+ * Float8GetDatumFast
+ *
+ * These macros are intended to allow writing code that does not depend on
+ * whether int64 and float8 are pass-by-reference types, while not
+ * sacrificing performance when they are.  The argument must be a variable
+ * that will exist and have the same value for as long as the Datum is needed.
+ * In the pass-by-ref case, the address of the variable is taken to use as
+ * the Datum.  In the pass-by-val case, these are the same as the non-Fast
+ * functions, except for asserting that the variable is of the correct type.
+ */
+
+#ifdef USE_FLOAT8_BYVAL
+#define Int64GetDatumFast(X) \
+	(AssertVariableIsOfTypeMacro(X, int64), Int64GetDatum(X))
+#define Float8GetDatumFast(X) \
+	(AssertVariableIsOfTypeMacro(X, double), Float8GetDatum(X))
+#else
+#define Int64GetDatumFast(X) \
+	(AssertVariableIsOfTypeMacro(X, int64), PointerGetDatum(&(X)))
+#define Float8GetDatumFast(X) \
+	(AssertVariableIsOfTypeMacro(X, double), PointerGetDatum(&(X)))
+#endif
+
 #endif							/* C_H */
diff --git a/src/include/postgres.h b/src/include/postgres.h
index 8a028ff789..4fcb207d0b 100644
--- a/src/include/postgres.h
+++ b/src/include/postgres.h
@@ -23,8 +23,7 @@
  *
  *	  section	description
  *	  -------	------------------------------------------------
- *		1)		Datum type + support functions
- *		2)		miscellaneous
+ *		1)		miscellaneous
  *
  *	 NOTES
  *
@@ -35,7 +34,7 @@
  *	postgres_fe.h.  We do that since those type definitions are needed by
  *	frontend modules that want to deal with binary data transmission to or
  *	from the backend.  Type definitions in this file should be for
- *	representations that never escape the backend, such as Datum.
+ *	representations that never escape the backend.
  *
  *----------------------------------------------------------------
  */
@@ -47,519 +46,7 @@
 #include "utils/palloc.h"
 
 /* ----------------------------------------------------------------
- *				Section 1:	Datum type + support functions
- * ----------------------------------------------------------------
- */
-
-/*
- * A Datum contains either a value of a pass-by-value type or a pointer to a
- * value of a pass-by-reference type.  Therefore, we require:
- *
- * sizeof(Datum) == sizeof(void *) == 4 or 8
- *
- * The functions below and the analogous functions for other types should be used to
- * convert between a Datum and the appropriate C type.
- */
-
-typedef uintptr_t Datum;
-
-/*
- * A NullableDatum is used in places where both a Datum and its nullness needs
- * to be stored. This can be more efficient than storing datums and nullness
- * in separate arrays, due to better spatial locality, even if more space may
- * be wasted due to padding.
- */
-typedef struct NullableDatum
-{
-#define FIELDNO_NULLABLE_DATUM_DATUM 0
-	Datum		value;
-#define FIELDNO_NULLABLE_DATUM_ISNULL 1
-	bool		isnull;
-	/* due to alignment padding this could be used for flags for free */
-} NullableDatum;
-
-#define SIZEOF_DATUM SIZEOF_VOID_P
-
-/*
- * DatumGetBool
- *		Returns boolean value of a datum.
- *
- * Note: any nonzero value will be considered true.
- */
-static inline bool
-DatumGetBool(Datum X)
-{
-	return (X != 0);
-}
-
-/*
- * BoolGetDatum
- *		Returns datum representation for a boolean.
- *
- * Note: any nonzero value will be considered true.
- */
-static inline Datum
-BoolGetDatum(bool X)
-{
-	return (Datum) (X ? 1 : 0);
-}
-
-/*
- * DatumGetChar
- *		Returns character value of a datum.
- */
-static inline char
-DatumGetChar(Datum X)
-{
-	return (char) X;
-}
-
-/*
- * CharGetDatum
- *		Returns datum representation for a character.
- */
-static inline Datum
-CharGetDatum(char X)
-{
-	return (Datum) X;
-}
-
-/*
- * Int8GetDatum
- *		Returns datum representation for an 8-bit integer.
- */
-static inline Datum
-Int8GetDatum(int8 X)
-{
-	return (Datum) X;
-}
-
-/*
- * DatumGetUInt8
- *		Returns 8-bit unsigned integer value of a datum.
- */
-static inline uint8
-DatumGetUInt8(Datum X)
-{
-	return (uint8) X;
-}
-
-/*
- * UInt8GetDatum
- *		Returns datum representation for an 8-bit unsigned integer.
- */
-static inline Datum
-UInt8GetDatum(uint8 X)
-{
-	return (Datum) X;
-}
-
-/*
- * DatumGetInt16
- *		Returns 16-bit integer value of a datum.
- */
-static inline int16
-DatumGetInt16(Datum X)
-{
-	return (int16) X;
-}
-
-/*
- * Int16GetDatum
- *		Returns datum representation for a 16-bit integer.
- */
-static inline Datum
-Int16GetDatum(int16 X)
-{
-	return (Datum) X;
-}
-
-/*
- * DatumGetUInt16
- *		Returns 16-bit unsigned integer value of a datum.
- */
-static inline uint16
-DatumGetUInt16(Datum X)
-{
-	return (uint16) X;
-}
-
-/*
- * UInt16GetDatum
- *		Returns datum representation for a 16-bit unsigned integer.
- */
-static inline Datum
-UInt16GetDatum(uint16 X)
-{
-	return (Datum) X;
-}
-
-/*
- * DatumGetInt32
- *		Returns 32-bit integer value of a datum.
- */
-static inline int32
-DatumGetInt32(Datum X)
-{
-	return (int32) X;
-}
-
-/*
- * Int32GetDatum
- *		Returns datum representation for a 32-bit integer.
- */
-static inline Datum
-Int32GetDatum(int32 X)
-{
-	return (Datum) X;
-}
-
-/*
- * DatumGetUInt32
- *		Returns 32-bit unsigned integer value of a datum.
- */
-static inline uint32
-DatumGetUInt32(Datum X)
-{
-	return (uint32) X;
-}
-
-/*
- * UInt32GetDatum
- *		Returns datum representation for a 32-bit unsigned integer.
- */
-static inline Datum
-UInt32GetDatum(uint32 X)
-{
-	return (Datum) X;
-}
-
-/*
- * DatumGetObjectId
- *		Returns object identifier value of a datum.
- */
-static inline Oid
-DatumGetObjectId(Datum X)
-{
-	return (Oid) X;
-}
-
-/*
- * ObjectIdGetDatum
- *		Returns datum representation for an object identifier.
- */
-static inline Datum
-ObjectIdGetDatum(Oid X)
-{
-	return (Datum) X;
-}
-
-/*
- * DatumGetTransactionId
- *		Returns transaction identifier value of a datum.
- */
-static inline TransactionId
-DatumGetTransactionId(Datum X)
-{
-	return (TransactionId) X;
-}
-
-/*
- * TransactionIdGetDatum
- *		Returns datum representation for a transaction identifier.
- */
-static inline Datum
-TransactionIdGetDatum(TransactionId X)
-{
-	return (Datum) X;
-}
-
-/*
- * MultiXactIdGetDatum
- *		Returns datum representation for a multixact identifier.
- */
-static inline Datum
-MultiXactIdGetDatum(MultiXactId X)
-{
-	return (Datum) X;
-}
-
-/*
- * DatumGetCommandId
- *		Returns command identifier value of a datum.
- */
-static inline CommandId
-DatumGetCommandId(Datum X)
-{
-	return (CommandId) X;
-}
-
-/*
- * CommandIdGetDatum
- *		Returns datum representation for a command identifier.
- */
-static inline Datum
-CommandIdGetDatum(CommandId X)
-{
-	return (Datum) X;
-}
-
-/*
- * DatumGetPointer
- *		Returns pointer value of a datum.
- */
-static inline Pointer
-DatumGetPointer(Datum X)
-{
-	return (Pointer) X;
-}
-
-/*
- * PointerGetDatum
- *		Returns datum representation for a pointer.
- */
-static inline Datum
-PointerGetDatum(const void *X)
-{
-	return (Datum) X;
-}
-
-/*
- * DatumGetCString
- *		Returns C string (null-terminated string) value of a datum.
- *
- * Note: C string is not a full-fledged Postgres type at present,
- * but type input functions use this conversion for their inputs.
- */
-static inline char *
-DatumGetCString(Datum X)
-{
-	return (char *) DatumGetPointer(X);
-}
-
-/*
- * CStringGetDatum
- *		Returns datum representation for a C string (null-terminated string).
- *
- * Note: C string is not a full-fledged Postgres type at present,
- * but type output functions use this conversion for their outputs.
- * Note: CString is pass-by-reference; caller must ensure the pointed-to
- * value has adequate lifetime.
- */
-static inline Datum
-CStringGetDatum(const char *X)
-{
-	return PointerGetDatum(X);
-}
-
-/*
- * DatumGetName
- *		Returns name value of a datum.
- */
-static inline Name
-DatumGetName(Datum X)
-{
-	return (Name) DatumGetPointer(X);
-}
-
-/*
- * NameGetDatum
- *		Returns datum representation for a name.
- *
- * Note: Name is pass-by-reference; caller must ensure the pointed-to
- * value has adequate lifetime.
- */
-static inline Datum
-NameGetDatum(const NameData *X)
-{
-	return CStringGetDatum(NameStr(*X));
-}
-
-/*
- * DatumGetInt64
- *		Returns 64-bit integer value of a datum.
- *
- * Note: this function hides whether int64 is pass by value or by reference.
- */
-static inline int64
-DatumGetInt64(Datum X)
-{
-#ifdef USE_FLOAT8_BYVAL
-	return (int64) X;
-#else
-	return *((int64 *) DatumGetPointer(X));
-#endif
-}
-
-/*
- * Int64GetDatum
- *		Returns datum representation for a 64-bit integer.
- *
- * Note: if int64 is pass by reference, this function returns a reference
- * to palloc'd space.
- */
-#ifdef USE_FLOAT8_BYVAL
-static inline Datum
-Int64GetDatum(int64 X)
-{
-	return (Datum) X;
-}
-#else
-extern Datum Int64GetDatum(int64 X);
-#endif
-
-
-/*
- * DatumGetUInt64
- *		Returns 64-bit unsigned integer value of a datum.
- *
- * Note: this function hides whether int64 is pass by value or by reference.
- */
-static inline uint64
-DatumGetUInt64(Datum X)
-{
-#ifdef USE_FLOAT8_BYVAL
-	return (uint64) X;
-#else
-	return *((uint64 *) DatumGetPointer(X));
-#endif
-}
-
-/*
- * UInt64GetDatum
- *		Returns datum representation for a 64-bit unsigned integer.
- *
- * Note: if int64 is pass by reference, this function returns a reference
- * to palloc'd space.
- */
-static inline Datum
-UInt64GetDatum(uint64 X)
-{
-#ifdef USE_FLOAT8_BYVAL
-	return (Datum) X;
-#else
-	return Int64GetDatum((int64) X);
-#endif
-}
-
-/*
- * Float <-> Datum conversions
- *
- * These have to be implemented as inline functions rather than macros, when
- * passing by value, because many machines pass int and float function
- * parameters/results differently; so we need to play weird games with unions.
- */
-
-/*
- * DatumGetFloat4
- *		Returns 4-byte floating point value of a datum.
- */
-static inline float4
-DatumGetFloat4(Datum X)
-{
-	union
-	{
-		int32		value;
-		float4		retval;
-	}			myunion;
-
-	myunion.value = DatumGetInt32(X);
-	return myunion.retval;
-}
-
-/*
- * Float4GetDatum
- *		Returns datum representation for a 4-byte floating point number.
- */
-static inline Datum
-Float4GetDatum(float4 X)
-{
-	union
-	{
-		float4		value;
-		int32		retval;
-	}			myunion;
-
-	myunion.value = X;
-	return Int32GetDatum(myunion.retval);
-}
-
-/*
- * DatumGetFloat8
- *		Returns 8-byte floating point value of a datum.
- *
- * Note: this function hides whether float8 is pass by value or by reference.
- */
-static inline float8
-DatumGetFloat8(Datum X)
-{
-#ifdef USE_FLOAT8_BYVAL
-	union
-	{
-		int64		value;
-		float8		retval;
-	}			myunion;
-
-	myunion.value = DatumGetInt64(X);
-	return myunion.retval;
-#else
-	return *((float8 *) DatumGetPointer(X));
-#endif
-}
-
-/*
- * Float8GetDatum
- *		Returns datum representation for an 8-byte floating point number.
- *
- * Note: if float8 is pass by reference, this function returns a reference
- * to palloc'd space.
- */
-#ifdef USE_FLOAT8_BYVAL
-static inline Datum
-Float8GetDatum(float8 X)
-{
-	union
-	{
-		float8		value;
-		int64		retval;
-	}			myunion;
-
-	myunion.value = X;
-	return Int64GetDatum(myunion.retval);
-}
-#else
-extern Datum Float8GetDatum(float8 X);
-#endif
-
-
-/*
- * Int64GetDatumFast
- * Float8GetDatumFast
- *
- * These macros are intended to allow writing code that does not depend on
- * whether int64 and float8 are pass-by-reference types, while not
- * sacrificing performance when they are.  The argument must be a variable
- * that will exist and have the same value for as long as the Datum is needed.
- * In the pass-by-ref case, the address of the variable is taken to use as
- * the Datum.  In the pass-by-val case, these are the same as the non-Fast
- * functions, except for asserting that the variable is of the correct type.
- */
-
-#ifdef USE_FLOAT8_BYVAL
-#define Int64GetDatumFast(X) \
-	(AssertVariableIsOfTypeMacro(X, int64), Int64GetDatum(X))
-#define Float8GetDatumFast(X) \
-	(AssertVariableIsOfTypeMacro(X, double), Float8GetDatum(X))
-#else
-#define Int64GetDatumFast(X) \
-	(AssertVariableIsOfTypeMacro(X, int64), PointerGetDatum(&(X)))
-#define Float8GetDatumFast(X) \
-	(AssertVariableIsOfTypeMacro(X, double), PointerGetDatum(&(X)))
-#endif
-
-
-/* ----------------------------------------------------------------
- *				Section 2:	miscellaneous
+ *				Section 1:	miscellaneous
  * ----------------------------------------------------------------
  */
 
-- 
2.25.1

v3-0002-make-binaryheap-available-to-frontend.patchtext/x-diff; charset=us-asciiDownload
From 47c69318022e80a7faf2db282ce8722db688a1bf Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Sat, 22 Jul 2023 15:04:45 -0700
Subject: [PATCH v3 2/4] make binaryheap available to frontend

---
 src/backend/lib/Makefile                 |  1 -
 src/backend/lib/meson.build              |  1 -
 src/common/Makefile                      |  1 +
 src/{backend/lib => common}/binaryheap.c | 17 ++++++++++++++++-
 src/common/meson.build                   |  1 +
 5 files changed, 18 insertions(+), 3 deletions(-)
 rename src/{backend/lib => common}/binaryheap.c (96%)

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..b6cefd9cca 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -13,7 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
-	binaryheap.o \
 	bipartite_match.o \
 	bloomfilter.o \
 	dshash.o \
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 974cab8776..b4e88f54ae 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -1,7 +1,6 @@
 # Copyright (c) 2022-2023, PostgreSQL Global Development Group
 
 backend_sources += files(
-  'binaryheap.c',
   'bipartite_match.c',
   'bloomfilter.c',
   'dshash.c',
diff --git a/src/common/Makefile b/src/common/Makefile
index 113029bf7b..cc5c54dcee 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -48,6 +48,7 @@ LIBS += $(PTHREAD_LIBS)
 OBJS_COMMON = \
 	archive.o \
 	base64.o \
+	binaryheap.o \
 	checksum_helper.o \
 	compression.o \
 	config_info.o \
diff --git a/src/backend/lib/binaryheap.c b/src/common/binaryheap.c
similarity index 96%
rename from src/backend/lib/binaryheap.c
rename to src/common/binaryheap.c
index 1737546757..f21838b946 100644
--- a/src/backend/lib/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -6,15 +6,22 @@
  * Portions Copyright (c) 2012-2023, PostgreSQL Global Development Group
  *
  * IDENTIFICATION
- *	  src/backend/lib/binaryheap.c
+ *	  src/common/binaryheap.c
  *
  *-------------------------------------------------------------------------
  */
 
+#ifdef FRONTEND
+#include "postgres_fe.h"
+#else
 #include "postgres.h"
+#endif
 
 #include <math.h>
 
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
 #include "lib/binaryheap.h"
 
 static void sift_down(binaryheap *heap, int node_off);
@@ -109,7 +116,11 @@ void
 binaryheap_add_unordered(binaryheap *heap, Datum d)
 {
 	if (heap->bh_size >= heap->bh_space)
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
 	heap->bh_has_heap_property = false;
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
@@ -141,7 +152,11 @@ void
 binaryheap_add(binaryheap *heap, Datum d)
 {
 	if (heap->bh_size >= heap->bh_space)
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
diff --git a/src/common/meson.build b/src/common/meson.build
index 53942a9a61..3b97497d1a 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -3,6 +3,7 @@
 common_sources = files(
   'archive.c',
   'base64.c',
+  'binaryheap.c',
   'checksum_helper.c',
   'compression.c',
   'controldata_utils.c',
-- 
2.25.1

v3-0003-expand-binaryheap-api.patchtext/x-diff; charset=us-asciiDownload
From b165634ae997e67e44fd643769d30a4645d4b710 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Thu, 20 Jul 2023 09:52:20 -0700
Subject: [PATCH v3 3/4] expand binaryheap api

---
 src/common/binaryheap.c      | 29 +++++++++++++++++++++++++++++
 src/include/lib/binaryheap.h |  3 +++
 2 files changed, 32 insertions(+)

diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index f21838b946..fdd8c0024a 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -211,6 +211,35 @@ binaryheap_remove_first(binaryheap *heap)
 	return result;
 }
 
+/*
+ * binaryheap_remove_node
+ *
+ * Removes the nth node from the heap.  The caller must ensure that there are
+ * at least (n - 1) nodes in the heap.  O(log n) worst case.
+ */
+void
+binaryheap_remove_node(binaryheap *heap, int n)
+{
+	int			cmp;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(n >= 0 && n < heap->bh_size);
+
+	/* compare last node to the one that is being removed */
+	cmp = heap->bh_compare(heap->bh_nodes[--heap->bh_size],
+						   heap->bh_nodes[n],
+						   heap->bh_arg);
+
+	/* remove the last node, placing it in the vacated entry */
+	heap->bh_nodes[n] = heap->bh_nodes[heap->bh_size];
+
+	/* sift as needed to preserve the heap property */
+	if (cmp > 0)
+		sift_up(heap, n);
+	else if (cmp < 0)
+		sift_down(heap, n);
+}
+
 /*
  * binaryheap_replace_first
  *
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 52f7b06b25..20a7f71244 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -47,8 +47,11 @@ extern void binaryheap_build(binaryheap *heap);
 extern void binaryheap_add(binaryheap *heap, Datum d);
 extern Datum binaryheap_first(binaryheap *heap);
 extern Datum binaryheap_remove_first(binaryheap *heap);
+extern void binaryheap_remove_node(binaryheap *heap, int n);
 extern void binaryheap_replace_first(binaryheap *heap, Datum d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
+#define binaryheap_size(h)			((h)->bh_size)
+#define binaryheap_get_node(h, n)	((h)->bh_nodes[n])
 
 #endif							/* BINARYHEAP_H */
-- 
2.25.1

v3-0004-use-priority-queue-for-pg_restore-ready_list.patchtext/x-diff; charset=us-asciiDownload
From c7950af95945433d67304670a60caa5ca32cb34a Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Thu, 20 Jul 2023 10:19:08 -0700
Subject: [PATCH v3 4/4] use priority queue for pg_restore ready_list

---
 src/bin/pg_dump/pg_backup_archiver.c | 196 ++++++++-------------------
 1 file changed, 56 insertions(+), 140 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 39ebcfec32..48009b071b 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -34,6 +34,7 @@
 #include "compress_io.h"
 #include "dumputils.h"
 #include "fe_utils/string_utils.h"
+#include "lib/binaryheap.h"
 #include "lib/stringinfo.h"
 #include "libpq/libpq-fs.h"
 #include "parallel.h"
@@ -44,24 +45,6 @@
 #define TEXT_DUMP_HEADER "--\n-- PostgreSQL database dump\n--\n\n"
 #define TEXT_DUMPALL_HEADER "--\n-- PostgreSQL database cluster dump\n--\n\n"
 
-/*
- * State for tracking TocEntrys that are ready to process during a parallel
- * restore.  (This used to be a list, and we still call it that, though now
- * it's really an array so that we can apply qsort to it.)
- *
- * tes[] is sized large enough that we can't overrun it.
- * The valid entries are indexed first_te .. last_te inclusive.
- * We periodically sort the array to bring larger-by-dataLength entries to
- * the front; "sorted" is true if the valid entries are known sorted.
- */
-typedef struct _parallelReadyList
-{
-	TocEntry  **tes;			/* Ready-to-dump TocEntrys */
-	int			first_te;		/* index of first valid entry in tes[] */
-	int			last_te;		/* index of last valid entry in tes[] */
-	bool		sorted;			/* are valid entries currently sorted? */
-} ParallelReadyList;
-
 
 static ArchiveHandle *_allocAH(const char *FileSpec, const ArchiveFormat fmt,
 							   const pg_compress_specification compression_spec,
@@ -110,16 +93,12 @@ static void restore_toc_entries_postfork(ArchiveHandle *AH,
 static void pending_list_header_init(TocEntry *l);
 static void pending_list_append(TocEntry *l, TocEntry *te);
 static void pending_list_remove(TocEntry *te);
-static void ready_list_init(ParallelReadyList *ready_list, int tocCount);
-static void ready_list_free(ParallelReadyList *ready_list);
-static void ready_list_insert(ParallelReadyList *ready_list, TocEntry *te);
-static void ready_list_remove(ParallelReadyList *ready_list, int i);
-static void ready_list_sort(ParallelReadyList *ready_list);
-static int	TocEntrySizeCompare(const void *p1, const void *p2);
-static void move_to_ready_list(TocEntry *pending_list,
-							   ParallelReadyList *ready_list,
+static int	TocEntrySizeCompareQsort(const void *p1, const void *p2);
+static int	TocEntrySizeCompareBinaryheap(Datum p1, Datum p2, void *arg);
+static void move_to_ready_heap(TocEntry *pending_list,
+							   binaryheap *ready_heap,
 							   RestorePass pass);
-static TocEntry *pop_next_work_item(ParallelReadyList *ready_list,
+static TocEntry *pop_next_work_item(binaryheap *ready_heap,
 									ParallelState *pstate);
 static void mark_dump_job_done(ArchiveHandle *AH,
 							   TocEntry *te,
@@ -134,7 +113,7 @@ static bool has_lock_conflicts(TocEntry *te1, TocEntry *te2);
 static void repoint_table_dependencies(ArchiveHandle *AH);
 static void identify_locking_dependencies(ArchiveHandle *AH, TocEntry *te);
 static void reduce_dependencies(ArchiveHandle *AH, TocEntry *te,
-								ParallelReadyList *ready_list);
+								binaryheap *ready_heap);
 static void mark_create_done(ArchiveHandle *AH, TocEntry *te);
 static void inhibit_data_for_failed_table(ArchiveHandle *AH, TocEntry *te);
 
@@ -2380,7 +2359,7 @@ WriteDataChunks(ArchiveHandle *AH, ParallelState *pstate)
 		}
 
 		if (ntes > 1)
-			qsort(tes, ntes, sizeof(TocEntry *), TocEntrySizeCompare);
+			qsort(tes, ntes, sizeof(TocEntry *), TocEntrySizeCompareQsort);
 
 		for (int i = 0; i < ntes; i++)
 			DispatchJobForTocEntry(AH, pstate, tes[i], ACT_DUMP,
@@ -3980,7 +3959,7 @@ restore_toc_entries_prefork(ArchiveHandle *AH, TocEntry *pending_list)
 
 			(void) restore_toc_entry(AH, next_work_item, false);
 
-			/* Reduce dependencies, but don't move anything to ready_list */
+			/* Reduce dependencies, but don't move anything to ready_heap */
 			reduce_dependencies(AH, next_work_item, NULL);
 		}
 		else
@@ -4023,24 +4002,26 @@ static void
 restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 							 TocEntry *pending_list)
 {
-	ParallelReadyList ready_list;
+	binaryheap *ready_heap;
 	TocEntry   *next_work_item;
 
 	pg_log_debug("entering restore_toc_entries_parallel");
 
-	/* Set up ready_list with enough room for all known TocEntrys */
-	ready_list_init(&ready_list, AH->tocCount);
+	/* Set up ready_heap with enough room for all known TocEntrys */
+	ready_heap = binaryheap_allocate(AH->tocCount,
+									 TocEntrySizeCompareBinaryheap,
+									 NULL);
 
 	/*
 	 * The pending_list contains all items that we need to restore.  Move all
-	 * items that are available to process immediately into the ready_list.
+	 * items that are available to process immediately into the ready_heap.
 	 * After this setup, the pending list is everything that needs to be done
-	 * but is blocked by one or more dependencies, while the ready list
+	 * but is blocked by one or more dependencies, while the ready heap
 	 * contains items that have no remaining dependencies and are OK to
 	 * process in the current restore pass.
 	 */
 	AH->restorePass = RESTORE_PASS_MAIN;
-	move_to_ready_list(pending_list, &ready_list, AH->restorePass);
+	move_to_ready_heap(pending_list, ready_heap, AH->restorePass);
 
 	/*
 	 * main parent loop
@@ -4054,7 +4035,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 	for (;;)
 	{
 		/* Look for an item ready to be dispatched to a worker */
-		next_work_item = pop_next_work_item(&ready_list, pstate);
+		next_work_item = pop_next_work_item(ready_heap, pstate);
 		if (next_work_item != NULL)
 		{
 			/* If not to be restored, don't waste time launching a worker */
@@ -4064,7 +4045,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 							next_work_item->dumpId,
 							next_work_item->desc, next_work_item->tag);
 				/* Update its dependencies as though we'd completed it */
-				reduce_dependencies(AH, next_work_item, &ready_list);
+				reduce_dependencies(AH, next_work_item, ready_heap);
 				/* Loop around to see if anything else can be dispatched */
 				continue;
 			}
@@ -4075,7 +4056,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 
 			/* Dispatch to some worker */
 			DispatchJobForTocEntry(AH, pstate, next_work_item, ACT_RESTORE,
-								   mark_restore_job_done, &ready_list);
+								   mark_restore_job_done, ready_heap);
 		}
 		else if (IsEveryWorkerIdle(pstate))
 		{
@@ -4089,7 +4070,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 			/* Advance to next restore pass */
 			AH->restorePass++;
 			/* That probably allows some stuff to be made ready */
-			move_to_ready_list(pending_list, &ready_list, AH->restorePass);
+			move_to_ready_heap(pending_list, ready_heap, AH->restorePass);
 			/* Loop around to see if anything's now ready */
 			continue;
 		}
@@ -4118,10 +4099,10 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 					   next_work_item ? WFW_ONE_IDLE : WFW_GOT_STATUS);
 	}
 
-	/* There should now be nothing in ready_list. */
-	Assert(ready_list.first_te > ready_list.last_te);
+	/* There should now be nothing in ready_heap. */
+	Assert(binaryheap_empty(ready_heap));
 
-	ready_list_free(&ready_list);
+	binaryheap_free(ready_heap);
 
 	pg_log_info("finished main parallel loop");
 }
@@ -4221,80 +4202,9 @@ pending_list_remove(TocEntry *te)
 }
 
 
-/*
- * Initialize the ready_list with enough room for up to tocCount entries.
- */
-static void
-ready_list_init(ParallelReadyList *ready_list, int tocCount)
-{
-	ready_list->tes = (TocEntry **)
-		pg_malloc(tocCount * sizeof(TocEntry *));
-	ready_list->first_te = 0;
-	ready_list->last_te = -1;
-	ready_list->sorted = false;
-}
-
-/*
- * Free storage for a ready_list.
- */
-static void
-ready_list_free(ParallelReadyList *ready_list)
-{
-	pg_free(ready_list->tes);
-}
-
-/* Add te to the ready_list */
-static void
-ready_list_insert(ParallelReadyList *ready_list, TocEntry *te)
-{
-	ready_list->tes[++ready_list->last_te] = te;
-	/* List is (probably) not sorted anymore. */
-	ready_list->sorted = false;
-}
-
-/* Remove the i'th entry in the ready_list */
-static void
-ready_list_remove(ParallelReadyList *ready_list, int i)
-{
-	int			f = ready_list->first_te;
-
-	Assert(i >= f && i <= ready_list->last_te);
-
-	/*
-	 * In the typical case where the item to be removed is the first ready
-	 * entry, we need only increment first_te to remove it.  Otherwise, move
-	 * the entries before it to compact the list.  (This preserves sortedness,
-	 * if any.)  We could alternatively move the entries after i, but there
-	 * are typically many more of those.
-	 */
-	if (i > f)
-	{
-		TocEntry  **first_te_ptr = &ready_list->tes[f];
-
-		memmove(first_te_ptr + 1, first_te_ptr, (i - f) * sizeof(TocEntry *));
-	}
-	ready_list->first_te++;
-}
-
-/* Sort the ready_list into the desired order */
-static void
-ready_list_sort(ParallelReadyList *ready_list)
-{
-	if (!ready_list->sorted)
-	{
-		int			n = ready_list->last_te - ready_list->first_te + 1;
-
-		if (n > 1)
-			qsort(ready_list->tes + ready_list->first_te, n,
-				  sizeof(TocEntry *),
-				  TocEntrySizeCompare);
-		ready_list->sorted = true;
-	}
-}
-
 /* qsort comparator for sorting TocEntries by dataLength */
 static int
-TocEntrySizeCompare(const void *p1, const void *p2)
+TocEntrySizeCompareQsort(const void *p1, const void *p2)
 {
 	const TocEntry *te1 = *(const TocEntry *const *) p1;
 	const TocEntry *te2 = *(const TocEntry *const *) p2;
@@ -4314,17 +4224,27 @@ TocEntrySizeCompare(const void *p1, const void *p2)
 	return 0;
 }
 
+/* binaryheap comparator for sorting TocEntries by dataLength */
+static int
+TocEntrySizeCompareBinaryheap(Datum p1, Datum p2, void *arg)
+{
+	const void *te1 = (const void *) DatumGetPointer(p1);
+	const void *te2 = (const void *) DatumGetPointer(p2);
+
+	return TocEntrySizeCompareQsort(te1, te2);
+}
+
 
 /*
- * Move all immediately-ready items from pending_list to ready_list.
+ * Move all immediately-ready items from pending_list to ready_heap.
  *
  * Items are considered ready if they have no remaining dependencies and
  * they belong in the current restore pass.  (See also reduce_dependencies,
  * which applies the same logic one-at-a-time.)
  */
 static void
-move_to_ready_list(TocEntry *pending_list,
-				   ParallelReadyList *ready_list,
+move_to_ready_heap(TocEntry *pending_list,
+				   binaryheap *ready_heap,
 				   RestorePass pass)
 {
 	TocEntry   *te;
@@ -4340,38 +4260,34 @@ move_to_ready_list(TocEntry *pending_list,
 		{
 			/* Remove it from pending_list ... */
 			pending_list_remove(te);
-			/* ... and add to ready_list */
-			ready_list_insert(ready_list, te);
+			/* ... and add to ready_heap */
+			binaryheap_add(ready_heap, PointerGetDatum(te));
 		}
 	}
 }
 
 /*
  * Find the next work item (if any) that is capable of being run now,
- * and remove it from the ready_list.
+ * and remove it from the ready_heap.
  *
  * Returns the item, or NULL if nothing is runnable.
  *
  * To qualify, the item must have no remaining dependencies
  * and no requirements for locks that are incompatible with
- * items currently running.  Items in the ready_list are known to have
+ * items currently running.  Items in the ready_heap are known to have
  * no remaining dependencies, but we have to check for lock conflicts.
  */
 static TocEntry *
-pop_next_work_item(ParallelReadyList *ready_list,
+pop_next_work_item(binaryheap *ready_heap,
 				   ParallelState *pstate)
 {
 	/*
-	 * Sort the ready_list so that we'll tackle larger jobs first.
-	 */
-	ready_list_sort(ready_list);
-
-	/*
-	 * Search the ready_list until we find a suitable item.
+	 * Search the ready_heap until we find a suitable item.
 	 */
-	for (int i = ready_list->first_te; i <= ready_list->last_te; i++)
+	for (int i = 0; i < binaryheap_size(ready_heap); i++)
 	{
-		TocEntry   *te = ready_list->tes[i];
+		Datum		ted = binaryheap_get_node(ready_heap, i);
+		TocEntry   *te = (TocEntry *) DatumGetPointer(ted);
 		bool		conflicts = false;
 
 		/*
@@ -4397,7 +4313,7 @@ pop_next_work_item(ParallelReadyList *ready_list,
 			continue;
 
 		/* passed all tests, so this item can run */
-		ready_list_remove(ready_list, i);
+		binaryheap_remove_node(ready_heap, i);
 		return te;
 	}
 
@@ -4443,7 +4359,7 @@ mark_restore_job_done(ArchiveHandle *AH,
 					  int status,
 					  void *callback_data)
 {
-	ParallelReadyList *ready_list = (ParallelReadyList *) callback_data;
+	binaryheap *ready_heap = (binaryheap *) callback_data;
 
 	pg_log_info("finished item %d %s %s",
 				te->dumpId, te->desc, te->tag);
@@ -4461,7 +4377,7 @@ mark_restore_job_done(ArchiveHandle *AH,
 		pg_fatal("worker process failed: exit code %d",
 				 status);
 
-	reduce_dependencies(AH, te, ready_list);
+	reduce_dependencies(AH, te, ready_heap);
 }
 
 
@@ -4704,11 +4620,11 @@ identify_locking_dependencies(ArchiveHandle *AH, TocEntry *te)
 /*
  * Remove the specified TOC entry from the depCounts of items that depend on
  * it, thereby possibly making them ready-to-run.  Any pending item that
- * becomes ready should be moved to the ready_list, if that's provided.
+ * becomes ready should be moved to the ready_heap, if that's provided.
  */
 static void
 reduce_dependencies(ArchiveHandle *AH, TocEntry *te,
-					ParallelReadyList *ready_list)
+					binaryheap *ready_heap)
 {
 	int			i;
 
@@ -4726,18 +4642,18 @@ reduce_dependencies(ArchiveHandle *AH, TocEntry *te,
 		 * the current restore pass, and it is currently a member of the
 		 * pending list (that check is needed to prevent double restore in
 		 * some cases where a list-file forces out-of-order restoring).
-		 * However, if ready_list == NULL then caller doesn't want any list
+		 * However, if ready_heap == NULL then caller doesn't want any list
 		 * memberships changed.
 		 */
 		if (otherte->depCount == 0 &&
 			_tocEntryRestorePass(otherte) == AH->restorePass &&
 			otherte->pending_prev != NULL &&
-			ready_list != NULL)
+			ready_heap != NULL)
 		{
 			/* Remove it from pending list ... */
 			pending_list_remove(otherte);
-			/* ... and add to ready_list */
-			ready_list_insert(ready_list, otherte);
+			/* ... and add to ready_heap */
+			binaryheap_add(ready_heap, PointerGetDatum(otherte));
 		}
 	}
 }
-- 
2.25.1

#11Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#10)
Re: Inefficiency in parallel pg_restore with many tables

On Sat, Jul 22, 2023 at 04:19:41PM -0700, Nathan Bossart wrote:

In v3, I moved the Datum definitions to c.h. I first tried modifying
binaryheap to use "int" or "void *" instead, but that ended up requiring
some rather invasive changes in backend code, not to mention any extensions
that happen to be using it. I also looked into moving the definitions to a
separate datumdefs.h header that postgres.h would include, but that felt
awkward because 1) postgres.h clearly states that it is intended for things
"that never escape the backend" and 2) the definitions seem relatively
inexpensive. However, I think the latter option is still viable, so I'm
fine with switching to it if folks think that is a better approach.

BTW we might be able to replace the open-coded heap in pg_dump_sort.c
(added by 79273cc) with a binaryheap, too.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Bossart (#10)
Re: Inefficiency in parallel pg_restore with many tables

Nathan Bossart <nathandbossart@gmail.com> writes:

On Thu, Jul 20, 2023 at 12:06:44PM -0700, Nathan Bossart wrote:

One item that requires more thought is binaryheap's use of Datum. AFAICT
the Datum definitions live in postgres.h and aren't available to frontend
code. I think we'll either need to move the Datum definitions to c.h or to
adjust binaryheap to use "void *".

In v3, I moved the Datum definitions to c.h. I first tried modifying
binaryheap to use "int" or "void *" instead, but that ended up requiring
some rather invasive changes in backend code, not to mention any extensions
that happen to be using it.

I'm quite uncomfortable with putting Datum in c.h. I know that the
typedef is merely a uintptr_t, but this solution seems to me to be
blowing all kinds of holes in the abstraction, because exactly none
of the infrastructure that goes along with Datum is or is ever likely
to be in any frontend build. At the very least, frontend code that
refers to Datum will be misleading as hell.

I wonder whether we can't provide some alternate definition or "skin"
for binaryheap that preserves the Datum API for backend code that wants
that, while providing a void *-based API for frontend code to use.

regards, tom lane

#13Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#12)
4 attachment(s)
Re: Inefficiency in parallel pg_restore with many tables

On Sat, Jul 22, 2023 at 07:47:50PM -0400, Tom Lane wrote:

Nathan Bossart <nathandbossart@gmail.com> writes:

I first tried modifying
binaryheap to use "int" or "void *" instead, but that ended up requiring
some rather invasive changes in backend code, not to mention any extensions
that happen to be using it.

I followed through with the "void *" approach (attached), and it wasn't as
bad as I expected.

I wonder whether we can't provide some alternate definition or "skin"
for binaryheap that preserves the Datum API for backend code that wants
that, while providing a void *-based API for frontend code to use.

I can give this a try next, but it might be rather #ifdef-heavy.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachments:

v4-0001-use-void-instead-of-Datum-in-binaryheap.patchtext/x-diff; charset=us-asciiDownload
From b570460b19c5955c91b1fc28e8fdc791f41998e2 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Sat, 22 Jul 2023 20:59:46 -0700
Subject: [PATCH v4 1/4] use "void *" instead of "Datum" in binaryheap

---
 src/backend/executor/nodeGatherMerge.c        | 16 +++++-----
 src/backend/executor/nodeMergeAppend.c        | 16 +++++-----
 src/backend/lib/binaryheap.c                  | 20 ++++++------
 src/backend/postmaster/pgarch.c               | 18 +++++------
 .../replication/logical/reorderbuffer.c       | 31 +++++++++----------
 src/backend/storage/buffer/bufmgr.c           | 11 +++----
 src/include/lib/binaryheap.h                  | 14 ++++-----
 7 files changed, 61 insertions(+), 65 deletions(-)

diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c
index 9d5e1a46e9..021682d87c 100644
--- a/src/backend/executor/nodeGatherMerge.c
+++ b/src/backend/executor/nodeGatherMerge.c
@@ -52,7 +52,7 @@ typedef struct GMReaderTupleBuffer
 } GMReaderTupleBuffer;
 
 static TupleTableSlot *ExecGatherMerge(PlanState *pstate);
-static int32 heap_compare_slots(Datum a, Datum b, void *arg);
+static int32 heap_compare_slots(const void *a, const void *b, void *arg);
 static TupleTableSlot *gather_merge_getnext(GatherMergeState *gm_state);
 static MinimalTuple gm_readnext_tuple(GatherMergeState *gm_state, int nreader,
 									  bool nowait, bool *done);
@@ -489,7 +489,7 @@ reread:
 				/* Don't have a tuple yet, try to get one */
 				if (gather_merge_readnext(gm_state, i, nowait))
 					binaryheap_add_unordered(gm_state->gm_heap,
-											 Int32GetDatum(i));
+											 (void *) (intptr_t) i);
 			}
 			else
 			{
@@ -565,10 +565,10 @@ gather_merge_getnext(GatherMergeState *gm_state)
 		 * the heap, because it might now compare differently against the
 		 * other elements of the heap.
 		 */
-		i = DatumGetInt32(binaryheap_first(gm_state->gm_heap));
+		i = (intptr_t) binaryheap_first(gm_state->gm_heap);
 
 		if (gather_merge_readnext(gm_state, i, false))
-			binaryheap_replace_first(gm_state->gm_heap, Int32GetDatum(i));
+			binaryheap_replace_first(gm_state->gm_heap, (void *) (intptr_t) i);
 		else
 		{
 			/* reader exhausted, remove it from heap */
@@ -585,7 +585,7 @@ gather_merge_getnext(GatherMergeState *gm_state)
 	else
 	{
 		/* Return next tuple from whichever participant has the leading one */
-		i = DatumGetInt32(binaryheap_first(gm_state->gm_heap));
+		i = (intptr_t) binaryheap_first(gm_state->gm_heap);
 		return gm_state->gm_slots[i];
 	}
 }
@@ -750,11 +750,11 @@ typedef int32 SlotNumber;
  * Compare the tuples in the two given slots.
  */
 static int32
-heap_compare_slots(Datum a, Datum b, void *arg)
+heap_compare_slots(const void *a, const void *b, void *arg)
 {
 	GatherMergeState *node = (GatherMergeState *) arg;
-	SlotNumber	slot1 = DatumGetInt32(a);
-	SlotNumber	slot2 = DatumGetInt32(b);
+	SlotNumber	slot1 = (intptr_t) a;
+	SlotNumber	slot2 = (intptr_t) b;
 
 	TupleTableSlot *s1 = node->gm_slots[slot1];
 	TupleTableSlot *s2 = node->gm_slots[slot2];
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index 21b5726e6e..b3e2c29401 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -52,7 +52,7 @@
 typedef int32 SlotNumber;
 
 static TupleTableSlot *ExecMergeAppend(PlanState *pstate);
-static int	heap_compare_slots(Datum a, Datum b, void *arg);
+static int	heap_compare_slots(const void *a, const void *b, void *arg);
 
 
 /* ----------------------------------------------------------------
@@ -229,7 +229,7 @@ ExecMergeAppend(PlanState *pstate)
 		{
 			node->ms_slots[i] = ExecProcNode(node->mergeplans[i]);
 			if (!TupIsNull(node->ms_slots[i]))
-				binaryheap_add_unordered(node->ms_heap, Int32GetDatum(i));
+				binaryheap_add_unordered(node->ms_heap, (void *) (intptr_t) i);
 		}
 		binaryheap_build(node->ms_heap);
 		node->ms_initialized = true;
@@ -244,10 +244,10 @@ ExecMergeAppend(PlanState *pstate)
 		 * by doing this before returning from the prior call, but it's better
 		 * to not pull tuples until necessary.)
 		 */
-		i = DatumGetInt32(binaryheap_first(node->ms_heap));
+		i = (intptr_t) binaryheap_first(node->ms_heap);
 		node->ms_slots[i] = ExecProcNode(node->mergeplans[i]);
 		if (!TupIsNull(node->ms_slots[i]))
-			binaryheap_replace_first(node->ms_heap, Int32GetDatum(i));
+			binaryheap_replace_first(node->ms_heap, (void *) (intptr_t) i);
 		else
 			(void) binaryheap_remove_first(node->ms_heap);
 	}
@@ -259,7 +259,7 @@ ExecMergeAppend(PlanState *pstate)
 	}
 	else
 	{
-		i = DatumGetInt32(binaryheap_first(node->ms_heap));
+		i = (intptr_t) binaryheap_first(node->ms_heap);
 		result = node->ms_slots[i];
 	}
 
@@ -270,11 +270,11 @@ ExecMergeAppend(PlanState *pstate)
  * Compare the tuples in the two given slots.
  */
 static int32
-heap_compare_slots(Datum a, Datum b, void *arg)
+heap_compare_slots(const void *a, const void *b, void *arg)
 {
 	MergeAppendState *node = (MergeAppendState *) arg;
-	SlotNumber	slot1 = DatumGetInt32(a);
-	SlotNumber	slot2 = DatumGetInt32(b);
+	SlotNumber	slot1 = (intptr_t) a;
+	SlotNumber	slot2 = (intptr_t) b;
 
 	TupleTableSlot *s1 = node->ms_slots[slot1];
 	TupleTableSlot *s2 = node->ms_slots[slot2];
diff --git a/src/backend/lib/binaryheap.c b/src/backend/lib/binaryheap.c
index 1737546757..c7fcfc550b 100644
--- a/src/backend/lib/binaryheap.c
+++ b/src/backend/lib/binaryheap.c
@@ -34,7 +34,7 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 	int			sz;
 	binaryheap *heap;
 
-	sz = offsetof(binaryheap, bh_nodes) + sizeof(Datum) * capacity;
+	sz = offsetof(binaryheap, bh_nodes) + sizeof(void *) * capacity;
 	heap = (binaryheap *) palloc(sz);
 	heap->bh_space = capacity;
 	heap->bh_compare = compare;
@@ -106,7 +106,7 @@ parent_offset(int i)
  * afterwards.
  */
 void
-binaryheap_add_unordered(binaryheap *heap, Datum d)
+binaryheap_add_unordered(binaryheap *heap, void *d)
 {
 	if (heap->bh_size >= heap->bh_space)
 		elog(ERROR, "out of binary heap slots");
@@ -138,7 +138,7 @@ binaryheap_build(binaryheap *heap)
  * the heap property.
  */
 void
-binaryheap_add(binaryheap *heap, Datum d)
+binaryheap_add(binaryheap *heap, void *d)
 {
 	if (heap->bh_size >= heap->bh_space)
 		elog(ERROR, "out of binary heap slots");
@@ -154,7 +154,7 @@ binaryheap_add(binaryheap *heap, Datum d)
  * without modifying the heap. The caller must ensure that this
  * routine is not used on an empty heap. Always O(1).
  */
-Datum
+void *
 binaryheap_first(binaryheap *heap)
 {
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
@@ -169,10 +169,10 @@ binaryheap_first(binaryheap *heap)
  * that this routine is not used on an empty heap. O(log n) worst
  * case.
  */
-Datum
+void *
 binaryheap_remove_first(binaryheap *heap)
 {
-	Datum		result;
+	void	   *result;
 
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
 
@@ -204,7 +204,7 @@ binaryheap_remove_first(binaryheap *heap)
  * sifting the new node down.
  */
 void
-binaryheap_replace_first(binaryheap *heap, Datum d)
+binaryheap_replace_first(binaryheap *heap, void *d)
 {
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
 
@@ -221,7 +221,7 @@ binaryheap_replace_first(binaryheap *heap, Datum d)
 static void
 sift_up(binaryheap *heap, int node_off)
 {
-	Datum		node_val = heap->bh_nodes[node_off];
+	void	   *node_val = heap->bh_nodes[node_off];
 
 	/*
 	 * Within the loop, the node_off'th array entry is a "hole" that
@@ -232,7 +232,7 @@ sift_up(binaryheap *heap, int node_off)
 	{
 		int			cmp;
 		int			parent_off;
-		Datum		parent_val;
+		void	   *parent_val;
 
 		/*
 		 * If this node is smaller than its parent, the heap condition is
@@ -264,7 +264,7 @@ sift_up(binaryheap *heap, int node_off)
 static void
 sift_down(binaryheap *heap, int node_off)
 {
-	Datum		node_val = heap->bh_nodes[node_off];
+	void	   *node_val = heap->bh_nodes[node_off];
 
 	/*
 	 * Within the loop, the node_off'th array entry is a "hole" that
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 46af349564..9491b822db 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -145,7 +145,7 @@ static bool pgarch_readyXlog(char *xlog);
 static void pgarch_archiveDone(char *xlog);
 static void pgarch_die(int code, Datum arg);
 static void HandlePgArchInterrupts(void);
-static int	ready_file_comparator(Datum a, Datum b, void *arg);
+static int	ready_file_comparator(const void *a, const void *b, void *arg);
 static void LoadArchiveLibrary(void);
 static void pgarch_call_module_shutdown_cb(int code, Datum arg);
 
@@ -631,22 +631,22 @@ pgarch_readyXlog(char *xlog)
 			/* If the heap isn't full yet, quickly add it. */
 			arch_file = arch_files->arch_filenames[arch_files->arch_heap->bh_size];
 			strcpy(arch_file, basename);
-			binaryheap_add_unordered(arch_files->arch_heap, CStringGetDatum(arch_file));
+			binaryheap_add_unordered(arch_files->arch_heap, arch_file);
 
 			/* If we just filled the heap, make it a valid one. */
 			if (arch_files->arch_heap->bh_size == NUM_FILES_PER_DIRECTORY_SCAN)
 				binaryheap_build(arch_files->arch_heap);
 		}
 		else if (ready_file_comparator(binaryheap_first(arch_files->arch_heap),
-									   CStringGetDatum(basename), NULL) > 0)
+									   basename, NULL) > 0)
 		{
 			/*
 			 * Remove the lowest priority file and add the current one to the
 			 * heap.
 			 */
-			arch_file = DatumGetCString(binaryheap_remove_first(arch_files->arch_heap));
+			arch_file = binaryheap_remove_first(arch_files->arch_heap);
 			strcpy(arch_file, basename);
-			binaryheap_add(arch_files->arch_heap, CStringGetDatum(arch_file));
+			binaryheap_add(arch_files->arch_heap, arch_file);
 		}
 	}
 	FreeDir(rldir);
@@ -668,7 +668,7 @@ pgarch_readyXlog(char *xlog)
 	 */
 	arch_files->arch_files_size = arch_files->arch_heap->bh_size;
 	for (int i = 0; i < arch_files->arch_files_size; i++)
-		arch_files->arch_files[i] = DatumGetCString(binaryheap_remove_first(arch_files->arch_heap));
+		arch_files->arch_files[i] = binaryheap_remove_first(arch_files->arch_heap);
 
 	/* Return the highest priority file. */
 	arch_files->arch_files_size--;
@@ -686,10 +686,10 @@ pgarch_readyXlog(char *xlog)
  * If "a" and "b" have equivalent values, 0 will be returned.
  */
 static int
-ready_file_comparator(Datum a, Datum b, void *arg)
+ready_file_comparator(const void *a, const void *b, void *arg)
 {
-	char	   *a_str = DatumGetCString(a);
-	char	   *b_str = DatumGetCString(b);
+	const char *a_str = (const char *) a;
+	const char *b_str = (const char *) b;
 	bool		a_history = IsTLHistoryFileName(a_str);
 	bool		b_history = IsTLHistoryFileName(b_str);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 26d252bd87..191c0d1181 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1223,11 +1223,10 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
  * Binary heap comparison function.
  */
 static int
-ReorderBufferIterCompare(Datum a, Datum b, void *arg)
+ReorderBufferIterCompare(const void *a, const void *b, void *arg)
 {
-	ReorderBufferIterTXNState *state = (ReorderBufferIterTXNState *) arg;
-	XLogRecPtr	pos_a = state->entries[DatumGetInt32(a)].lsn;
-	XLogRecPtr	pos_b = state->entries[DatumGetInt32(b)].lsn;
+	XLogRecPtr	pos_a = ((ReorderBufferIterTXNEntry *) a)->lsn;
+	XLogRecPtr	pos_b = ((ReorderBufferIterTXNEntry *) b)->lsn;
 
 	if (pos_a < pos_b)
 		return 1;
@@ -1298,7 +1297,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	/* allocate heap */
 	state->heap = binaryheap_allocate(state->nr_txns,
 									  ReorderBufferIterCompare,
-									  state);
+									  NULL);
 
 	/* Now that the state fields are initialized, it is safe to return it. */
 	*iter_state = state;
@@ -1330,7 +1329,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		state->entries[off].change = cur_change;
 		state->entries[off].txn = txn;
 
-		binaryheap_add_unordered(state->heap, Int32GetDatum(off++));
+		binaryheap_add_unordered(state->heap, &state->entries[off++]);
 	}
 
 	/* add subtransactions if they contain changes */
@@ -1359,7 +1358,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			state->entries[off].change = cur_change;
 			state->entries[off].txn = cur_txn;
 
-			binaryheap_add_unordered(state->heap, Int32GetDatum(off++));
+			binaryheap_add_unordered(state->heap, &state->entries[off++]);
 		}
 	}
 
@@ -1378,14 +1377,12 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
 {
 	ReorderBufferChange *change;
 	ReorderBufferIterTXNEntry *entry;
-	int32		off;
 
 	/* nothing there anymore */
 	if (state->heap->bh_size == 0)
 		return NULL;
 
-	off = DatumGetInt32(binaryheap_first(state->heap));
-	entry = &state->entries[off];
+	entry = binaryheap_first(state->heap);
 
 	/* free memory we might have "leaked" in the previous *Next call */
 	if (!dlist_is_empty(&state->old_change))
@@ -1411,10 +1408,10 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
 			dlist_container(ReorderBufferChange, node, next);
 
 		/* txn stays the same */
-		state->entries[off].lsn = next_change->lsn;
-		state->entries[off].change = next_change;
+		entry->lsn = next_change->lsn;
+		entry->change = next_change;
 
-		binaryheap_replace_first(state->heap, Int32GetDatum(off));
+		binaryheap_replace_first(state->heap, entry);
 		return change;
 	}
 
@@ -1435,7 +1432,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
 		 */
 		rb->totalBytes += entry->txn->size;
 		if (ReorderBufferRestoreChanges(rb, entry->txn, &entry->file,
-										&state->entries[off].segno))
+										&entry->segno))
 		{
 			/* successfully restored changes from disk */
 			ReorderBufferChange *next_change =
@@ -1448,9 +1445,9 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
 
 			Assert(entry->txn->nentries_mem);
 			/* txn stays the same */
-			state->entries[off].lsn = next_change->lsn;
-			state->entries[off].change = next_change;
-			binaryheap_replace_first(state->heap, Int32GetDatum(off));
+			entry->lsn = next_change->lsn;
+			entry->change = next_change;
+			binaryheap_replace_first(state->heap, entry);
 
 			return change;
 		}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a7e3b9bb1d..391d5f3dd8 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -501,7 +501,7 @@ static void CheckForBufferLeaks(void);
 static int	rlocator_comparator(const void *p1, const void *p2);
 static inline int buffertag_comparator(const BufferTag *ba, const BufferTag *bb);
 static inline int ckpt_buforder_comparator(const CkptSortItem *a, const CkptSortItem *b);
-static int	ts_ckpt_progress_comparator(Datum a, Datum b, void *arg);
+static int	ts_ckpt_progress_comparator(const void *a, const void *b, void *arg);
 
 
 /*
@@ -2649,7 +2649,7 @@ BufferSync(int flags)
 
 		ts_stat->progress_slice = (float8) num_to_scan / ts_stat->num_to_scan;
 
-		binaryheap_add_unordered(ts_heap, PointerGetDatum(ts_stat));
+		binaryheap_add_unordered(ts_heap, ts_stat);
 	}
 
 	binaryheap_build(ts_heap);
@@ -2665,8 +2665,7 @@ BufferSync(int flags)
 	while (!binaryheap_empty(ts_heap))
 	{
 		BufferDesc *bufHdr = NULL;
-		CkptTsStatus *ts_stat = (CkptTsStatus *)
-			DatumGetPointer(binaryheap_first(ts_heap));
+		CkptTsStatus *ts_stat = (CkptTsStatus *) binaryheap_first(ts_heap);
 
 		buf_id = CkptBufferIds[ts_stat->index].buf_id;
 		Assert(buf_id != -1);
@@ -2713,7 +2712,7 @@ BufferSync(int flags)
 		else
 		{
 			/* update heap with the new progress */
-			binaryheap_replace_first(ts_heap, PointerGetDatum(ts_stat));
+			binaryheap_replace_first(ts_heap, ts_stat);
 		}
 
 		/*
@@ -5416,7 +5415,7 @@ ckpt_buforder_comparator(const CkptSortItem *a, const CkptSortItem *b)
  * progress.
  */
 static int
-ts_ckpt_progress_comparator(Datum a, Datum b, void *arg)
+ts_ckpt_progress_comparator(const void *a, const void *b, void *arg)
 {
 	CkptTsStatus *sa = (CkptTsStatus *) a;
 	CkptTsStatus *sb = (CkptTsStatus *) b;
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 52f7b06b25..71a25db0f4 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -15,7 +15,7 @@
  * For a max-heap, the comparator must return <0 iff a < b, 0 iff a == b,
  * and >0 iff a > b.  For a min-heap, the conditions are reversed.
  */
-typedef int (*binaryheap_comparator) (Datum a, Datum b, void *arg);
+typedef int (*binaryheap_comparator) (const void *a, const void *b, void *arg);
 
 /*
  * binaryheap
@@ -34,7 +34,7 @@ typedef struct binaryheap
 	bool		bh_has_heap_property;	/* debugging cross-check */
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
-	Datum		bh_nodes[FLEXIBLE_ARRAY_MEMBER];
+	void	   *bh_nodes[FLEXIBLE_ARRAY_MEMBER];
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int capacity,
@@ -42,12 +42,12 @@ extern binaryheap *binaryheap_allocate(int capacity,
 									   void *arg);
 extern void binaryheap_reset(binaryheap *heap);
 extern void binaryheap_free(binaryheap *heap);
-extern void binaryheap_add_unordered(binaryheap *heap, Datum d);
+extern void binaryheap_add_unordered(binaryheap *heap, void *d);
 extern void binaryheap_build(binaryheap *heap);
-extern void binaryheap_add(binaryheap *heap, Datum d);
-extern Datum binaryheap_first(binaryheap *heap);
-extern Datum binaryheap_remove_first(binaryheap *heap);
-extern void binaryheap_replace_first(binaryheap *heap, Datum d);
+extern void binaryheap_add(binaryheap *heap, void *d);
+extern void *binaryheap_first(binaryheap *heap);
+extern void *binaryheap_remove_first(binaryheap *heap);
+extern void binaryheap_replace_first(binaryheap *heap, void *d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
 
-- 
2.25.1

v4-0002-make-binaryheap-available-to-frontend.patchtext/x-diff; charset=us-asciiDownload
From 4fb71fc94af597b9ec9b5b233fedc7bb9ca5f05c Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Sat, 22 Jul 2023 15:04:45 -0700
Subject: [PATCH v4 2/4] make binaryheap available to frontend

---
 src/backend/lib/Makefile                 |  1 -
 src/backend/lib/meson.build              |  1 -
 src/common/Makefile                      |  1 +
 src/{backend/lib => common}/binaryheap.c | 17 ++++++++++++++++-
 src/common/meson.build                   |  1 +
 5 files changed, 18 insertions(+), 3 deletions(-)
 rename src/{backend/lib => common}/binaryheap.c (96%)

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..b6cefd9cca 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -13,7 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
-	binaryheap.o \
 	bipartite_match.o \
 	bloomfilter.o \
 	dshash.o \
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 974cab8776..b4e88f54ae 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -1,7 +1,6 @@
 # Copyright (c) 2022-2023, PostgreSQL Global Development Group
 
 backend_sources += files(
-  'binaryheap.c',
   'bipartite_match.c',
   'bloomfilter.c',
   'dshash.c',
diff --git a/src/common/Makefile b/src/common/Makefile
index 113029bf7b..cc5c54dcee 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -48,6 +48,7 @@ LIBS += $(PTHREAD_LIBS)
 OBJS_COMMON = \
 	archive.o \
 	base64.o \
+	binaryheap.o \
 	checksum_helper.o \
 	compression.o \
 	config_info.o \
diff --git a/src/backend/lib/binaryheap.c b/src/common/binaryheap.c
similarity index 96%
rename from src/backend/lib/binaryheap.c
rename to src/common/binaryheap.c
index c7fcfc550b..400a730c85 100644
--- a/src/backend/lib/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -6,15 +6,22 @@
  * Portions Copyright (c) 2012-2023, PostgreSQL Global Development Group
  *
  * IDENTIFICATION
- *	  src/backend/lib/binaryheap.c
+ *	  src/common/binaryheap.c
  *
  *-------------------------------------------------------------------------
  */
 
+#ifdef FRONTEND
+#include "postgres_fe.h"
+#else
 #include "postgres.h"
+#endif
 
 #include <math.h>
 
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
 #include "lib/binaryheap.h"
 
 static void sift_down(binaryheap *heap, int node_off);
@@ -109,7 +116,11 @@ void
 binaryheap_add_unordered(binaryheap *heap, void *d)
 {
 	if (heap->bh_size >= heap->bh_space)
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
 	heap->bh_has_heap_property = false;
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
@@ -141,7 +152,11 @@ void
 binaryheap_add(binaryheap *heap, void *d)
 {
 	if (heap->bh_size >= heap->bh_space)
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
diff --git a/src/common/meson.build b/src/common/meson.build
index 53942a9a61..3b97497d1a 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -3,6 +3,7 @@
 common_sources = files(
   'archive.c',
   'base64.c',
+  'binaryheap.c',
   'checksum_helper.c',
   'compression.c',
   'controldata_utils.c',
-- 
2.25.1

v4-0003-expand-binaryheap-api.patchtext/x-diff; charset=us-asciiDownload
From b28739db1e5ef373110d3d76e250bf285ad856a4 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Thu, 20 Jul 2023 09:52:20 -0700
Subject: [PATCH v4 3/4] expand binaryheap api

---
 src/common/binaryheap.c      | 29 +++++++++++++++++++++++++++++
 src/include/lib/binaryheap.h |  3 +++
 2 files changed, 32 insertions(+)

diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index 400a730c85..aaf529d6a9 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -211,6 +211,35 @@ binaryheap_remove_first(binaryheap *heap)
 	return result;
 }
 
+/*
+ * binaryheap_remove_node
+ *
+ * Removes the nth node from the heap.  The caller must ensure that there are
+ * at least (n - 1) nodes in the heap.  O(log n) worst case.
+ */
+void
+binaryheap_remove_node(binaryheap *heap, int n)
+{
+	int			cmp;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(n >= 0 && n < heap->bh_size);
+
+	/* compare last node to the one that is being removed */
+	cmp = heap->bh_compare(heap->bh_nodes[--heap->bh_size],
+						   heap->bh_nodes[n],
+						   heap->bh_arg);
+
+	/* remove the last node, placing it in the vacated entry */
+	heap->bh_nodes[n] = heap->bh_nodes[heap->bh_size];
+
+	/* sift as needed to preserve the heap property */
+	if (cmp > 0)
+		sift_up(heap, n);
+	else if (cmp < 0)
+		sift_down(heap, n);
+}
+
 /*
  * binaryheap_replace_first
  *
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 71a25db0f4..06fe2ff213 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -47,8 +47,11 @@ extern void binaryheap_build(binaryheap *heap);
 extern void binaryheap_add(binaryheap *heap, void *d);
 extern void *binaryheap_first(binaryheap *heap);
 extern void *binaryheap_remove_first(binaryheap *heap);
+extern void binaryheap_remove_node(binaryheap *heap, int n);
 extern void binaryheap_replace_first(binaryheap *heap, void *d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
+#define binaryheap_size(h)			((h)->bh_size)
+#define binaryheap_get_node(h, n)	((h)->bh_nodes[n])
 
 #endif							/* BINARYHEAP_H */
-- 
2.25.1

v4-0004-use-priority-queue-for-pg_restore-ready_list.patchtext/x-diff; charset=us-asciiDownload
From 08b6f280ad46e159ca330109c82ebb41c5508e4b Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Thu, 20 Jul 2023 10:19:08 -0700
Subject: [PATCH v4 4/4] use priority queue for pg_restore ready_list

---
 src/bin/pg_dump/pg_backup_archiver.c | 193 ++++++++-------------------
 1 file changed, 53 insertions(+), 140 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 39ebcfec32..d05266b341 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -34,6 +34,7 @@
 #include "compress_io.h"
 #include "dumputils.h"
 #include "fe_utils/string_utils.h"
+#include "lib/binaryheap.h"
 #include "lib/stringinfo.h"
 #include "libpq/libpq-fs.h"
 #include "parallel.h"
@@ -44,24 +45,6 @@
 #define TEXT_DUMP_HEADER "--\n-- PostgreSQL database dump\n--\n\n"
 #define TEXT_DUMPALL_HEADER "--\n-- PostgreSQL database cluster dump\n--\n\n"
 
-/*
- * State for tracking TocEntrys that are ready to process during a parallel
- * restore.  (This used to be a list, and we still call it that, though now
- * it's really an array so that we can apply qsort to it.)
- *
- * tes[] is sized large enough that we can't overrun it.
- * The valid entries are indexed first_te .. last_te inclusive.
- * We periodically sort the array to bring larger-by-dataLength entries to
- * the front; "sorted" is true if the valid entries are known sorted.
- */
-typedef struct _parallelReadyList
-{
-	TocEntry  **tes;			/* Ready-to-dump TocEntrys */
-	int			first_te;		/* index of first valid entry in tes[] */
-	int			last_te;		/* index of last valid entry in tes[] */
-	bool		sorted;			/* are valid entries currently sorted? */
-} ParallelReadyList;
-
 
 static ArchiveHandle *_allocAH(const char *FileSpec, const ArchiveFormat fmt,
 							   const pg_compress_specification compression_spec,
@@ -110,16 +93,13 @@ static void restore_toc_entries_postfork(ArchiveHandle *AH,
 static void pending_list_header_init(TocEntry *l);
 static void pending_list_append(TocEntry *l, TocEntry *te);
 static void pending_list_remove(TocEntry *te);
-static void ready_list_init(ParallelReadyList *ready_list, int tocCount);
-static void ready_list_free(ParallelReadyList *ready_list);
-static void ready_list_insert(ParallelReadyList *ready_list, TocEntry *te);
-static void ready_list_remove(ParallelReadyList *ready_list, int i);
-static void ready_list_sort(ParallelReadyList *ready_list);
-static int	TocEntrySizeCompare(const void *p1, const void *p2);
-static void move_to_ready_list(TocEntry *pending_list,
-							   ParallelReadyList *ready_list,
+static int	TocEntrySizeCompareQsort(const void *p1, const void *p2);
+static int	TocEntrySizeCompareBinaryheap(const void *p1, const void *p2,
+										  void *arg);
+static void move_to_ready_heap(TocEntry *pending_list,
+							   binaryheap *ready_heap,
 							   RestorePass pass);
-static TocEntry *pop_next_work_item(ParallelReadyList *ready_list,
+static TocEntry *pop_next_work_item(binaryheap *ready_heap,
 									ParallelState *pstate);
 static void mark_dump_job_done(ArchiveHandle *AH,
 							   TocEntry *te,
@@ -134,7 +114,7 @@ static bool has_lock_conflicts(TocEntry *te1, TocEntry *te2);
 static void repoint_table_dependencies(ArchiveHandle *AH);
 static void identify_locking_dependencies(ArchiveHandle *AH, TocEntry *te);
 static void reduce_dependencies(ArchiveHandle *AH, TocEntry *te,
-								ParallelReadyList *ready_list);
+								binaryheap *ready_heap);
 static void mark_create_done(ArchiveHandle *AH, TocEntry *te);
 static void inhibit_data_for_failed_table(ArchiveHandle *AH, TocEntry *te);
 
@@ -2380,7 +2360,7 @@ WriteDataChunks(ArchiveHandle *AH, ParallelState *pstate)
 		}
 
 		if (ntes > 1)
-			qsort(tes, ntes, sizeof(TocEntry *), TocEntrySizeCompare);
+			qsort(tes, ntes, sizeof(TocEntry *), TocEntrySizeCompareQsort);
 
 		for (int i = 0; i < ntes; i++)
 			DispatchJobForTocEntry(AH, pstate, tes[i], ACT_DUMP,
@@ -3980,7 +3960,7 @@ restore_toc_entries_prefork(ArchiveHandle *AH, TocEntry *pending_list)
 
 			(void) restore_toc_entry(AH, next_work_item, false);
 
-			/* Reduce dependencies, but don't move anything to ready_list */
+			/* Reduce dependencies, but don't move anything to ready_heap */
 			reduce_dependencies(AH, next_work_item, NULL);
 		}
 		else
@@ -4023,24 +4003,26 @@ static void
 restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 							 TocEntry *pending_list)
 {
-	ParallelReadyList ready_list;
+	binaryheap *ready_heap;
 	TocEntry   *next_work_item;
 
 	pg_log_debug("entering restore_toc_entries_parallel");
 
-	/* Set up ready_list with enough room for all known TocEntrys */
-	ready_list_init(&ready_list, AH->tocCount);
+	/* Set up ready_heap with enough room for all known TocEntrys */
+	ready_heap = binaryheap_allocate(AH->tocCount,
+									 TocEntrySizeCompareBinaryheap,
+									 NULL);
 
 	/*
 	 * The pending_list contains all items that we need to restore.  Move all
-	 * items that are available to process immediately into the ready_list.
+	 * items that are available to process immediately into the ready_heap.
 	 * After this setup, the pending list is everything that needs to be done
-	 * but is blocked by one or more dependencies, while the ready list
+	 * but is blocked by one or more dependencies, while the ready heap
 	 * contains items that have no remaining dependencies and are OK to
 	 * process in the current restore pass.
 	 */
 	AH->restorePass = RESTORE_PASS_MAIN;
-	move_to_ready_list(pending_list, &ready_list, AH->restorePass);
+	move_to_ready_heap(pending_list, ready_heap, AH->restorePass);
 
 	/*
 	 * main parent loop
@@ -4054,7 +4036,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 	for (;;)
 	{
 		/* Look for an item ready to be dispatched to a worker */
-		next_work_item = pop_next_work_item(&ready_list, pstate);
+		next_work_item = pop_next_work_item(ready_heap, pstate);
 		if (next_work_item != NULL)
 		{
 			/* If not to be restored, don't waste time launching a worker */
@@ -4064,7 +4046,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 							next_work_item->dumpId,
 							next_work_item->desc, next_work_item->tag);
 				/* Update its dependencies as though we'd completed it */
-				reduce_dependencies(AH, next_work_item, &ready_list);
+				reduce_dependencies(AH, next_work_item, ready_heap);
 				/* Loop around to see if anything else can be dispatched */
 				continue;
 			}
@@ -4075,7 +4057,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 
 			/* Dispatch to some worker */
 			DispatchJobForTocEntry(AH, pstate, next_work_item, ACT_RESTORE,
-								   mark_restore_job_done, &ready_list);
+								   mark_restore_job_done, ready_heap);
 		}
 		else if (IsEveryWorkerIdle(pstate))
 		{
@@ -4089,7 +4071,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 			/* Advance to next restore pass */
 			AH->restorePass++;
 			/* That probably allows some stuff to be made ready */
-			move_to_ready_list(pending_list, &ready_list, AH->restorePass);
+			move_to_ready_heap(pending_list, ready_heap, AH->restorePass);
 			/* Loop around to see if anything's now ready */
 			continue;
 		}
@@ -4118,10 +4100,10 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 					   next_work_item ? WFW_ONE_IDLE : WFW_GOT_STATUS);
 	}
 
-	/* There should now be nothing in ready_list. */
-	Assert(ready_list.first_te > ready_list.last_te);
+	/* There should now be nothing in ready_heap. */
+	Assert(binaryheap_empty(ready_heap));
 
-	ready_list_free(&ready_list);
+	binaryheap_free(ready_heap);
 
 	pg_log_info("finished main parallel loop");
 }
@@ -4221,80 +4203,9 @@ pending_list_remove(TocEntry *te)
 }
 
 
-/*
- * Initialize the ready_list with enough room for up to tocCount entries.
- */
-static void
-ready_list_init(ParallelReadyList *ready_list, int tocCount)
-{
-	ready_list->tes = (TocEntry **)
-		pg_malloc(tocCount * sizeof(TocEntry *));
-	ready_list->first_te = 0;
-	ready_list->last_te = -1;
-	ready_list->sorted = false;
-}
-
-/*
- * Free storage for a ready_list.
- */
-static void
-ready_list_free(ParallelReadyList *ready_list)
-{
-	pg_free(ready_list->tes);
-}
-
-/* Add te to the ready_list */
-static void
-ready_list_insert(ParallelReadyList *ready_list, TocEntry *te)
-{
-	ready_list->tes[++ready_list->last_te] = te;
-	/* List is (probably) not sorted anymore. */
-	ready_list->sorted = false;
-}
-
-/* Remove the i'th entry in the ready_list */
-static void
-ready_list_remove(ParallelReadyList *ready_list, int i)
-{
-	int			f = ready_list->first_te;
-
-	Assert(i >= f && i <= ready_list->last_te);
-
-	/*
-	 * In the typical case where the item to be removed is the first ready
-	 * entry, we need only increment first_te to remove it.  Otherwise, move
-	 * the entries before it to compact the list.  (This preserves sortedness,
-	 * if any.)  We could alternatively move the entries after i, but there
-	 * are typically many more of those.
-	 */
-	if (i > f)
-	{
-		TocEntry  **first_te_ptr = &ready_list->tes[f];
-
-		memmove(first_te_ptr + 1, first_te_ptr, (i - f) * sizeof(TocEntry *));
-	}
-	ready_list->first_te++;
-}
-
-/* Sort the ready_list into the desired order */
-static void
-ready_list_sort(ParallelReadyList *ready_list)
-{
-	if (!ready_list->sorted)
-	{
-		int			n = ready_list->last_te - ready_list->first_te + 1;
-
-		if (n > 1)
-			qsort(ready_list->tes + ready_list->first_te, n,
-				  sizeof(TocEntry *),
-				  TocEntrySizeCompare);
-		ready_list->sorted = true;
-	}
-}
-
 /* qsort comparator for sorting TocEntries by dataLength */
 static int
-TocEntrySizeCompare(const void *p1, const void *p2)
+TocEntrySizeCompareQsort(const void *p1, const void *p2)
 {
 	const TocEntry *te1 = *(const TocEntry *const *) p1;
 	const TocEntry *te2 = *(const TocEntry *const *) p2;
@@ -4314,17 +4225,24 @@ TocEntrySizeCompare(const void *p1, const void *p2)
 	return 0;
 }
 
+/* binaryheap comparator for sorting TocEntries by dataLength */
+static int
+TocEntrySizeCompareBinaryheap(const void *p1, const void *p2, void *arg)
+{
+	return TocEntrySizeCompareQsort(p1, p2);
+}
+
 
 /*
- * Move all immediately-ready items from pending_list to ready_list.
+ * Move all immediately-ready items from pending_list to ready_heap.
  *
  * Items are considered ready if they have no remaining dependencies and
  * they belong in the current restore pass.  (See also reduce_dependencies,
  * which applies the same logic one-at-a-time.)
  */
 static void
-move_to_ready_list(TocEntry *pending_list,
-				   ParallelReadyList *ready_list,
+move_to_ready_heap(TocEntry *pending_list,
+				   binaryheap *ready_heap,
 				   RestorePass pass)
 {
 	TocEntry   *te;
@@ -4340,38 +4258,33 @@ move_to_ready_list(TocEntry *pending_list,
 		{
 			/* Remove it from pending_list ... */
 			pending_list_remove(te);
-			/* ... and add to ready_list */
-			ready_list_insert(ready_list, te);
+			/* ... and add to ready_heap */
+			binaryheap_add(ready_heap, te);
 		}
 	}
 }
 
 /*
  * Find the next work item (if any) that is capable of being run now,
- * and remove it from the ready_list.
+ * and remove it from the ready_heap.
  *
  * Returns the item, or NULL if nothing is runnable.
  *
  * To qualify, the item must have no remaining dependencies
  * and no requirements for locks that are incompatible with
- * items currently running.  Items in the ready_list are known to have
+ * items currently running.  Items in the ready_heap are known to have
  * no remaining dependencies, but we have to check for lock conflicts.
  */
 static TocEntry *
-pop_next_work_item(ParallelReadyList *ready_list,
+pop_next_work_item(binaryheap *ready_heap,
 				   ParallelState *pstate)
 {
 	/*
-	 * Sort the ready_list so that we'll tackle larger jobs first.
-	 */
-	ready_list_sort(ready_list);
-
-	/*
-	 * Search the ready_list until we find a suitable item.
+	 * Search the ready_heap until we find a suitable item.
 	 */
-	for (int i = ready_list->first_te; i <= ready_list->last_te; i++)
+	for (int i = 0; i < binaryheap_size(ready_heap); i++)
 	{
-		TocEntry   *te = ready_list->tes[i];
+		TocEntry   *te = (TocEntry *) binaryheap_get_node(ready_heap, i);
 		bool		conflicts = false;
 
 		/*
@@ -4397,7 +4310,7 @@ pop_next_work_item(ParallelReadyList *ready_list,
 			continue;
 
 		/* passed all tests, so this item can run */
-		ready_list_remove(ready_list, i);
+		binaryheap_remove_node(ready_heap, i);
 		return te;
 	}
 
@@ -4443,7 +4356,7 @@ mark_restore_job_done(ArchiveHandle *AH,
 					  int status,
 					  void *callback_data)
 {
-	ParallelReadyList *ready_list = (ParallelReadyList *) callback_data;
+	binaryheap *ready_heap = (binaryheap *) callback_data;
 
 	pg_log_info("finished item %d %s %s",
 				te->dumpId, te->desc, te->tag);
@@ -4461,7 +4374,7 @@ mark_restore_job_done(ArchiveHandle *AH,
 		pg_fatal("worker process failed: exit code %d",
 				 status);
 
-	reduce_dependencies(AH, te, ready_list);
+	reduce_dependencies(AH, te, ready_heap);
 }
 
 
@@ -4704,11 +4617,11 @@ identify_locking_dependencies(ArchiveHandle *AH, TocEntry *te)
 /*
  * Remove the specified TOC entry from the depCounts of items that depend on
  * it, thereby possibly making them ready-to-run.  Any pending item that
- * becomes ready should be moved to the ready_list, if that's provided.
+ * becomes ready should be moved to the ready_heap, if that's provided.
  */
 static void
 reduce_dependencies(ArchiveHandle *AH, TocEntry *te,
-					ParallelReadyList *ready_list)
+					binaryheap *ready_heap)
 {
 	int			i;
 
@@ -4726,18 +4639,18 @@ reduce_dependencies(ArchiveHandle *AH, TocEntry *te,
 		 * the current restore pass, and it is currently a member of the
 		 * pending list (that check is needed to prevent double restore in
 		 * some cases where a list-file forces out-of-order restoring).
-		 * However, if ready_list == NULL then caller doesn't want any list
+		 * However, if ready_heap == NULL then caller doesn't want any list
 		 * memberships changed.
 		 */
 		if (otherte->depCount == 0 &&
 			_tocEntryRestorePass(otherte) == AH->restorePass &&
 			otherte->pending_prev != NULL &&
-			ready_list != NULL)
+			ready_heap != NULL)
 		{
 			/* Remove it from pending list ... */
 			pending_list_remove(otherte);
-			/* ... and add to ready_list */
-			ready_list_insert(ready_list, otherte);
+			/* ... and add to ready_heap */
+			binaryheap_add(ready_heap, otherte);
 		}
 	}
 }
-- 
2.25.1

#14Pierre Ducroquet
p.psql@pinaraf.info
In reply to: Tom Lane (#1)
Re: Inefficiency in parallel pg_restore with many tables

On Saturday, July 15, 2023 7:47:12 PM CEST Tom Lane wrote:

I'm not sure how big a deal this is in practice: in most situations
the individual jobs are larger than they are in this toy example,
plus the initial non-parallelizable part of the restore is a bigger
bottleneck anyway with this many tables. Still, we do have one
real-world complaint, so maybe we should look into improving it.

Hi

For what it's worth, at my current job it's kind of a big deal. I was going to
start looking at the bad performance I got on pg_restore for some databases
with over 50k tables (in 200 namespaces) when I found this thread. The dump
weights in about 2,8GB, the toc.dat file is 230MB, 50 120 tables, 142 069
constraints and 73 669 indexes.

HEAD pg_restore duration: 30 minutes
pg_restore with latest patch from Nathan Bossart: 23 minutes

This is indeed better, but there is still a lot of room for improvements. With
such usecases, I was able to go much faster using the patched pg_restore with
a script that parallelize on each schema instead of relying on the choices
made by pg_restore. It seems the choice of parallelizing only the data loading
is losing nice speedup opportunities with a huge number of objects.

patched pg_restore + parallel restore of schemas: 10 minutes

Anyway, the patch works really fine as is, and I will certainly keep trying
future iterations.

Regards

Pierre

#15Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#13)
3 attachment(s)
Re: Inefficiency in parallel pg_restore with many tables

On Sat, Jul 22, 2023 at 10:57:03PM -0700, Nathan Bossart wrote:

On Sat, Jul 22, 2023 at 07:47:50PM -0400, Tom Lane wrote:

I wonder whether we can't provide some alternate definition or "skin"
for binaryheap that preserves the Datum API for backend code that wants
that, while providing a void *-based API for frontend code to use.

I can give this a try next, but it might be rather #ifdef-heavy.

Here is a sketch of this approach. It required fewer #ifdefs than I was
expecting. At the moment, this one seems like the winner to me.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachments:

v5-0001-make-binaryheap-available-to-frontend.patchtext/x-diff; charset=us-asciiDownload
From 13017359cc6dd35f3e465280373b4b82e87b7c5b Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Sat, 22 Jul 2023 15:04:45 -0700
Subject: [PATCH v5 1/3] make binaryheap available to frontend

---
 src/backend/lib/Makefile                 |  1 -
 src/backend/lib/meson.build              |  1 -
 src/common/Makefile                      |  1 +
 src/{backend/lib => common}/binaryheap.c | 37 +++++++++++++++++-------
 src/common/meson.build                   |  1 +
 src/include/lib/binaryheap.h             | 23 ++++++++++-----
 6 files changed, 44 insertions(+), 20 deletions(-)
 rename src/{backend/lib => common}/binaryheap.c (91%)

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..b6cefd9cca 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -13,7 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
-	binaryheap.o \
 	bipartite_match.o \
 	bloomfilter.o \
 	dshash.o \
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 974cab8776..b4e88f54ae 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -1,7 +1,6 @@
 # Copyright (c) 2022-2023, PostgreSQL Global Development Group
 
 backend_sources += files(
-  'binaryheap.c',
   'bipartite_match.c',
   'bloomfilter.c',
   'dshash.c',
diff --git a/src/common/Makefile b/src/common/Makefile
index 113029bf7b..cc5c54dcee 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -48,6 +48,7 @@ LIBS += $(PTHREAD_LIBS)
 OBJS_COMMON = \
 	archive.o \
 	base64.o \
+	binaryheap.o \
 	checksum_helper.o \
 	compression.o \
 	config_info.o \
diff --git a/src/backend/lib/binaryheap.c b/src/common/binaryheap.c
similarity index 91%
rename from src/backend/lib/binaryheap.c
rename to src/common/binaryheap.c
index 1737546757..a6be4b42ae 100644
--- a/src/backend/lib/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -6,15 +6,22 @@
  * Portions Copyright (c) 2012-2023, PostgreSQL Global Development Group
  *
  * IDENTIFICATION
- *	  src/backend/lib/binaryheap.c
+ *	  src/common/binaryheap.c
  *
  *-------------------------------------------------------------------------
  */
 
+#ifdef FRONTEND
+#include "postgres_fe.h"
+#else
 #include "postgres.h"
+#endif
 
 #include <math.h>
 
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
 #include "lib/binaryheap.h"
 
 static void sift_down(binaryheap *heap, int node_off);
@@ -34,7 +41,7 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 	int			sz;
 	binaryheap *heap;
 
-	sz = offsetof(binaryheap, bh_nodes) + sizeof(Datum) * capacity;
+	sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_elem_type) * capacity;
 	heap = (binaryheap *) palloc(sz);
 	heap->bh_space = capacity;
 	heap->bh_compare = compare;
@@ -106,10 +113,14 @@ parent_offset(int i)
  * afterwards.
  */
 void
-binaryheap_add_unordered(binaryheap *heap, Datum d)
+binaryheap_add_unordered(binaryheap *heap, bh_elem_type d)
 {
 	if (heap->bh_size >= heap->bh_space)
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
 	heap->bh_has_heap_property = false;
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
@@ -138,10 +149,14 @@ binaryheap_build(binaryheap *heap)
  * the heap property.
  */
 void
-binaryheap_add(binaryheap *heap, Datum d)
+binaryheap_add(binaryheap *heap, bh_elem_type d)
 {
 	if (heap->bh_size >= heap->bh_space)
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
@@ -154,7 +169,7 @@ binaryheap_add(binaryheap *heap, Datum d)
  * without modifying the heap. The caller must ensure that this
  * routine is not used on an empty heap. Always O(1).
  */
-Datum
+bh_elem_type
 binaryheap_first(binaryheap *heap)
 {
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
@@ -169,10 +184,10 @@ binaryheap_first(binaryheap *heap)
  * that this routine is not used on an empty heap. O(log n) worst
  * case.
  */
-Datum
+bh_elem_type
 binaryheap_remove_first(binaryheap *heap)
 {
-	Datum		result;
+	bh_elem_type result;
 
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
 
@@ -204,7 +219,7 @@ binaryheap_remove_first(binaryheap *heap)
  * sifting the new node down.
  */
 void
-binaryheap_replace_first(binaryheap *heap, Datum d)
+binaryheap_replace_first(binaryheap *heap, bh_elem_type d)
 {
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
 
@@ -221,7 +236,7 @@ binaryheap_replace_first(binaryheap *heap, Datum d)
 static void
 sift_up(binaryheap *heap, int node_off)
 {
-	Datum		node_val = heap->bh_nodes[node_off];
+	bh_elem_type node_val = heap->bh_nodes[node_off];
 
 	/*
 	 * Within the loop, the node_off'th array entry is a "hole" that
@@ -232,7 +247,7 @@ sift_up(binaryheap *heap, int node_off)
 	{
 		int			cmp;
 		int			parent_off;
-		Datum		parent_val;
+		bh_elem_type parent_val;
 
 		/*
 		 * If this node is smaller than its parent, the heap condition is
@@ -264,7 +279,7 @@ sift_up(binaryheap *heap, int node_off)
 static void
 sift_down(binaryheap *heap, int node_off)
 {
-	Datum		node_val = heap->bh_nodes[node_off];
+	bh_elem_type node_val = heap->bh_nodes[node_off];
 
 	/*
 	 * Within the loop, the node_off'th array entry is a "hole" that
diff --git a/src/common/meson.build b/src/common/meson.build
index 53942a9a61..3b97497d1a 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -3,6 +3,7 @@
 common_sources = files(
   'archive.c',
   'base64.c',
+  'binaryheap.c',
   'checksum_helper.c',
   'compression.c',
   'controldata_utils.c',
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 52f7b06b25..ae0af68f0d 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -11,11 +11,20 @@
 #ifndef BINARYHEAP_H
 #define BINARYHEAP_H
 
+/*
+ * XXX: This needs a big comment.
+ */
+#ifdef FRONTEND
+#define bh_elem_type void *
+#else
+#define bh_elem_type Datum
+#endif
+
 /*
  * For a max-heap, the comparator must return <0 iff a < b, 0 iff a == b,
  * and >0 iff a > b.  For a min-heap, the conditions are reversed.
  */
-typedef int (*binaryheap_comparator) (Datum a, Datum b, void *arg);
+typedef int (*binaryheap_comparator) (bh_elem_type a, bh_elem_type b, void *arg);
 
 /*
  * binaryheap
@@ -34,7 +43,7 @@ typedef struct binaryheap
 	bool		bh_has_heap_property;	/* debugging cross-check */
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
-	Datum		bh_nodes[FLEXIBLE_ARRAY_MEMBER];
+	bh_elem_type bh_nodes[FLEXIBLE_ARRAY_MEMBER];
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int capacity,
@@ -42,12 +51,12 @@ extern binaryheap *binaryheap_allocate(int capacity,
 									   void *arg);
 extern void binaryheap_reset(binaryheap *heap);
 extern void binaryheap_free(binaryheap *heap);
-extern void binaryheap_add_unordered(binaryheap *heap, Datum d);
+extern void binaryheap_add_unordered(binaryheap *heap, bh_elem_type d);
 extern void binaryheap_build(binaryheap *heap);
-extern void binaryheap_add(binaryheap *heap, Datum d);
-extern Datum binaryheap_first(binaryheap *heap);
-extern Datum binaryheap_remove_first(binaryheap *heap);
-extern void binaryheap_replace_first(binaryheap *heap, Datum d);
+extern void binaryheap_add(binaryheap *heap, bh_elem_type d);
+extern bh_elem_type binaryheap_first(binaryheap *heap);
+extern bh_elem_type binaryheap_remove_first(binaryheap *heap);
+extern void binaryheap_replace_first(binaryheap *heap, bh_elem_type d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
 
-- 
2.25.1

v5-0002-expand-binaryheap-api.patchtext/x-diff; charset=us-asciiDownload
From 36eefefc9a9597fe9671aa084903a699d16a1144 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Thu, 20 Jul 2023 09:52:20 -0700
Subject: [PATCH v5 2/3] expand binaryheap api

---
 src/common/binaryheap.c      | 29 +++++++++++++++++++++++++++++
 src/include/lib/binaryheap.h |  3 +++
 2 files changed, 32 insertions(+)

diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index a6be4b42ae..d24a61fd2b 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -211,6 +211,35 @@ binaryheap_remove_first(binaryheap *heap)
 	return result;
 }
 
+/*
+ * binaryheap_remove_node
+ *
+ * Removes the nth node from the heap.  The caller must ensure that there are
+ * at least (n - 1) nodes in the heap.  O(log n) worst case.
+ */
+void
+binaryheap_remove_node(binaryheap *heap, int n)
+{
+	int			cmp;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(n >= 0 && n < heap->bh_size);
+
+	/* compare last node to the one that is being removed */
+	cmp = heap->bh_compare(heap->bh_nodes[--heap->bh_size],
+						   heap->bh_nodes[n],
+						   heap->bh_arg);
+
+	/* remove the last node, placing it in the vacated entry */
+	heap->bh_nodes[n] = heap->bh_nodes[heap->bh_size];
+
+	/* sift as needed to preserve the heap property */
+	if (cmp > 0)
+		sift_up(heap, n);
+	else if (cmp < 0)
+		sift_down(heap, n);
+}
+
 /*
  * binaryheap_replace_first
  *
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index ae0af68f0d..84c08d7ebb 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -56,8 +56,11 @@ extern void binaryheap_build(binaryheap *heap);
 extern void binaryheap_add(binaryheap *heap, bh_elem_type d);
 extern bh_elem_type binaryheap_first(binaryheap *heap);
 extern bh_elem_type binaryheap_remove_first(binaryheap *heap);
+extern void binaryheap_remove_node(binaryheap *heap, int n);
 extern void binaryheap_replace_first(binaryheap *heap, bh_elem_type d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
+#define binaryheap_size(h)			((h)->bh_size)
+#define binaryheap_get_node(h, n)	((h)->bh_nodes[n])
 
 #endif							/* BINARYHEAP_H */
-- 
2.25.1

v5-0003-use-priority-queue-for-pg_restore-ready_list.patchtext/x-diff; charset=us-asciiDownload
From 71d8c2f4fb3945d90d1e4af174ff8b40175449bf Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Thu, 20 Jul 2023 10:19:08 -0700
Subject: [PATCH v5 3/3] use priority queue for pg_restore ready_list

---
 src/bin/pg_dump/pg_backup_archiver.c | 192 ++++++++-------------------
 1 file changed, 52 insertions(+), 140 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 39ebcfec32..5cfddf3916 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -34,6 +34,7 @@
 #include "compress_io.h"
 #include "dumputils.h"
 #include "fe_utils/string_utils.h"
+#include "lib/binaryheap.h"
 #include "lib/stringinfo.h"
 #include "libpq/libpq-fs.h"
 #include "parallel.h"
@@ -44,24 +45,6 @@
 #define TEXT_DUMP_HEADER "--\n-- PostgreSQL database dump\n--\n\n"
 #define TEXT_DUMPALL_HEADER "--\n-- PostgreSQL database cluster dump\n--\n\n"
 
-/*
- * State for tracking TocEntrys that are ready to process during a parallel
- * restore.  (This used to be a list, and we still call it that, though now
- * it's really an array so that we can apply qsort to it.)
- *
- * tes[] is sized large enough that we can't overrun it.
- * The valid entries are indexed first_te .. last_te inclusive.
- * We periodically sort the array to bring larger-by-dataLength entries to
- * the front; "sorted" is true if the valid entries are known sorted.
- */
-typedef struct _parallelReadyList
-{
-	TocEntry  **tes;			/* Ready-to-dump TocEntrys */
-	int			first_te;		/* index of first valid entry in tes[] */
-	int			last_te;		/* index of last valid entry in tes[] */
-	bool		sorted;			/* are valid entries currently sorted? */
-} ParallelReadyList;
-
 
 static ArchiveHandle *_allocAH(const char *FileSpec, const ArchiveFormat fmt,
 							   const pg_compress_specification compression_spec,
@@ -110,16 +93,12 @@ static void restore_toc_entries_postfork(ArchiveHandle *AH,
 static void pending_list_header_init(TocEntry *l);
 static void pending_list_append(TocEntry *l, TocEntry *te);
 static void pending_list_remove(TocEntry *te);
-static void ready_list_init(ParallelReadyList *ready_list, int tocCount);
-static void ready_list_free(ParallelReadyList *ready_list);
-static void ready_list_insert(ParallelReadyList *ready_list, TocEntry *te);
-static void ready_list_remove(ParallelReadyList *ready_list, int i);
-static void ready_list_sort(ParallelReadyList *ready_list);
-static int	TocEntrySizeCompare(const void *p1, const void *p2);
-static void move_to_ready_list(TocEntry *pending_list,
-							   ParallelReadyList *ready_list,
+static int	TocEntrySizeCompareQsort(const void *p1, const void *p2);
+static int	TocEntrySizeCompareBinaryheap(void *p1, void *p2, void *arg);
+static void move_to_ready_heap(TocEntry *pending_list,
+							   binaryheap *ready_heap,
 							   RestorePass pass);
-static TocEntry *pop_next_work_item(ParallelReadyList *ready_list,
+static TocEntry *pop_next_work_item(binaryheap *ready_heap,
 									ParallelState *pstate);
 static void mark_dump_job_done(ArchiveHandle *AH,
 							   TocEntry *te,
@@ -134,7 +113,7 @@ static bool has_lock_conflicts(TocEntry *te1, TocEntry *te2);
 static void repoint_table_dependencies(ArchiveHandle *AH);
 static void identify_locking_dependencies(ArchiveHandle *AH, TocEntry *te);
 static void reduce_dependencies(ArchiveHandle *AH, TocEntry *te,
-								ParallelReadyList *ready_list);
+								binaryheap *ready_heap);
 static void mark_create_done(ArchiveHandle *AH, TocEntry *te);
 static void inhibit_data_for_failed_table(ArchiveHandle *AH, TocEntry *te);
 
@@ -2380,7 +2359,7 @@ WriteDataChunks(ArchiveHandle *AH, ParallelState *pstate)
 		}
 
 		if (ntes > 1)
-			qsort(tes, ntes, sizeof(TocEntry *), TocEntrySizeCompare);
+			qsort(tes, ntes, sizeof(TocEntry *), TocEntrySizeCompareQsort);
 
 		for (int i = 0; i < ntes; i++)
 			DispatchJobForTocEntry(AH, pstate, tes[i], ACT_DUMP,
@@ -3980,7 +3959,7 @@ restore_toc_entries_prefork(ArchiveHandle *AH, TocEntry *pending_list)
 
 			(void) restore_toc_entry(AH, next_work_item, false);
 
-			/* Reduce dependencies, but don't move anything to ready_list */
+			/* Reduce dependencies, but don't move anything to ready_heap */
 			reduce_dependencies(AH, next_work_item, NULL);
 		}
 		else
@@ -4023,24 +4002,26 @@ static void
 restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 							 TocEntry *pending_list)
 {
-	ParallelReadyList ready_list;
+	binaryheap *ready_heap;
 	TocEntry   *next_work_item;
 
 	pg_log_debug("entering restore_toc_entries_parallel");
 
-	/* Set up ready_list with enough room for all known TocEntrys */
-	ready_list_init(&ready_list, AH->tocCount);
+	/* Set up ready_heap with enough room for all known TocEntrys */
+	ready_heap = binaryheap_allocate(AH->tocCount,
+									 TocEntrySizeCompareBinaryheap,
+									 NULL);
 
 	/*
 	 * The pending_list contains all items that we need to restore.  Move all
-	 * items that are available to process immediately into the ready_list.
+	 * items that are available to process immediately into the ready_heap.
 	 * After this setup, the pending list is everything that needs to be done
-	 * but is blocked by one or more dependencies, while the ready list
+	 * but is blocked by one or more dependencies, while the ready heap
 	 * contains items that have no remaining dependencies and are OK to
 	 * process in the current restore pass.
 	 */
 	AH->restorePass = RESTORE_PASS_MAIN;
-	move_to_ready_list(pending_list, &ready_list, AH->restorePass);
+	move_to_ready_heap(pending_list, ready_heap, AH->restorePass);
 
 	/*
 	 * main parent loop
@@ -4054,7 +4035,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 	for (;;)
 	{
 		/* Look for an item ready to be dispatched to a worker */
-		next_work_item = pop_next_work_item(&ready_list, pstate);
+		next_work_item = pop_next_work_item(ready_heap, pstate);
 		if (next_work_item != NULL)
 		{
 			/* If not to be restored, don't waste time launching a worker */
@@ -4064,7 +4045,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 							next_work_item->dumpId,
 							next_work_item->desc, next_work_item->tag);
 				/* Update its dependencies as though we'd completed it */
-				reduce_dependencies(AH, next_work_item, &ready_list);
+				reduce_dependencies(AH, next_work_item, ready_heap);
 				/* Loop around to see if anything else can be dispatched */
 				continue;
 			}
@@ -4075,7 +4056,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 
 			/* Dispatch to some worker */
 			DispatchJobForTocEntry(AH, pstate, next_work_item, ACT_RESTORE,
-								   mark_restore_job_done, &ready_list);
+								   mark_restore_job_done, ready_heap);
 		}
 		else if (IsEveryWorkerIdle(pstate))
 		{
@@ -4089,7 +4070,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 			/* Advance to next restore pass */
 			AH->restorePass++;
 			/* That probably allows some stuff to be made ready */
-			move_to_ready_list(pending_list, &ready_list, AH->restorePass);
+			move_to_ready_heap(pending_list, ready_heap, AH->restorePass);
 			/* Loop around to see if anything's now ready */
 			continue;
 		}
@@ -4118,10 +4099,10 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 					   next_work_item ? WFW_ONE_IDLE : WFW_GOT_STATUS);
 	}
 
-	/* There should now be nothing in ready_list. */
-	Assert(ready_list.first_te > ready_list.last_te);
+	/* There should now be nothing in ready_heap. */
+	Assert(binaryheap_empty(ready_heap));
 
-	ready_list_free(&ready_list);
+	binaryheap_free(ready_heap);
 
 	pg_log_info("finished main parallel loop");
 }
@@ -4221,80 +4202,9 @@ pending_list_remove(TocEntry *te)
 }
 
 
-/*
- * Initialize the ready_list with enough room for up to tocCount entries.
- */
-static void
-ready_list_init(ParallelReadyList *ready_list, int tocCount)
-{
-	ready_list->tes = (TocEntry **)
-		pg_malloc(tocCount * sizeof(TocEntry *));
-	ready_list->first_te = 0;
-	ready_list->last_te = -1;
-	ready_list->sorted = false;
-}
-
-/*
- * Free storage for a ready_list.
- */
-static void
-ready_list_free(ParallelReadyList *ready_list)
-{
-	pg_free(ready_list->tes);
-}
-
-/* Add te to the ready_list */
-static void
-ready_list_insert(ParallelReadyList *ready_list, TocEntry *te)
-{
-	ready_list->tes[++ready_list->last_te] = te;
-	/* List is (probably) not sorted anymore. */
-	ready_list->sorted = false;
-}
-
-/* Remove the i'th entry in the ready_list */
-static void
-ready_list_remove(ParallelReadyList *ready_list, int i)
-{
-	int			f = ready_list->first_te;
-
-	Assert(i >= f && i <= ready_list->last_te);
-
-	/*
-	 * In the typical case where the item to be removed is the first ready
-	 * entry, we need only increment first_te to remove it.  Otherwise, move
-	 * the entries before it to compact the list.  (This preserves sortedness,
-	 * if any.)  We could alternatively move the entries after i, but there
-	 * are typically many more of those.
-	 */
-	if (i > f)
-	{
-		TocEntry  **first_te_ptr = &ready_list->tes[f];
-
-		memmove(first_te_ptr + 1, first_te_ptr, (i - f) * sizeof(TocEntry *));
-	}
-	ready_list->first_te++;
-}
-
-/* Sort the ready_list into the desired order */
-static void
-ready_list_sort(ParallelReadyList *ready_list)
-{
-	if (!ready_list->sorted)
-	{
-		int			n = ready_list->last_te - ready_list->first_te + 1;
-
-		if (n > 1)
-			qsort(ready_list->tes + ready_list->first_te, n,
-				  sizeof(TocEntry *),
-				  TocEntrySizeCompare);
-		ready_list->sorted = true;
-	}
-}
-
 /* qsort comparator for sorting TocEntries by dataLength */
 static int
-TocEntrySizeCompare(const void *p1, const void *p2)
+TocEntrySizeCompareQsort(const void *p1, const void *p2)
 {
 	const TocEntry *te1 = *(const TocEntry *const *) p1;
 	const TocEntry *te2 = *(const TocEntry *const *) p2;
@@ -4314,17 +4224,24 @@ TocEntrySizeCompare(const void *p1, const void *p2)
 	return 0;
 }
 
+/* binaryheap comparator for sorting TocEntries by dataLength */
+static int
+TocEntrySizeCompareBinaryheap(void *p1, void *p2, void *arg)
+{
+	return TocEntrySizeCompareQsort(p1, p2);
+}
+
 
 /*
- * Move all immediately-ready items from pending_list to ready_list.
+ * Move all immediately-ready items from pending_list to ready_heap.
  *
  * Items are considered ready if they have no remaining dependencies and
  * they belong in the current restore pass.  (See also reduce_dependencies,
  * which applies the same logic one-at-a-time.)
  */
 static void
-move_to_ready_list(TocEntry *pending_list,
-				   ParallelReadyList *ready_list,
+move_to_ready_heap(TocEntry *pending_list,
+				   binaryheap *ready_heap,
 				   RestorePass pass)
 {
 	TocEntry   *te;
@@ -4340,38 +4257,33 @@ move_to_ready_list(TocEntry *pending_list,
 		{
 			/* Remove it from pending_list ... */
 			pending_list_remove(te);
-			/* ... and add to ready_list */
-			ready_list_insert(ready_list, te);
+			/* ... and add to ready_heap */
+			binaryheap_add(ready_heap, te);
 		}
 	}
 }
 
 /*
  * Find the next work item (if any) that is capable of being run now,
- * and remove it from the ready_list.
+ * and remove it from the ready_heap.
  *
  * Returns the item, or NULL if nothing is runnable.
  *
  * To qualify, the item must have no remaining dependencies
  * and no requirements for locks that are incompatible with
- * items currently running.  Items in the ready_list are known to have
+ * items currently running.  Items in the ready_heap are known to have
  * no remaining dependencies, but we have to check for lock conflicts.
  */
 static TocEntry *
-pop_next_work_item(ParallelReadyList *ready_list,
+pop_next_work_item(binaryheap *ready_heap,
 				   ParallelState *pstate)
 {
 	/*
-	 * Sort the ready_list so that we'll tackle larger jobs first.
-	 */
-	ready_list_sort(ready_list);
-
-	/*
-	 * Search the ready_list until we find a suitable item.
+	 * Search the ready_heap until we find a suitable item.
 	 */
-	for (int i = ready_list->first_te; i <= ready_list->last_te; i++)
+	for (int i = 0; i < binaryheap_size(ready_heap); i++)
 	{
-		TocEntry   *te = ready_list->tes[i];
+		TocEntry   *te = (TocEntry *) binaryheap_get_node(ready_heap, i);
 		bool		conflicts = false;
 
 		/*
@@ -4397,7 +4309,7 @@ pop_next_work_item(ParallelReadyList *ready_list,
 			continue;
 
 		/* passed all tests, so this item can run */
-		ready_list_remove(ready_list, i);
+		binaryheap_remove_node(ready_heap, i);
 		return te;
 	}
 
@@ -4443,7 +4355,7 @@ mark_restore_job_done(ArchiveHandle *AH,
 					  int status,
 					  void *callback_data)
 {
-	ParallelReadyList *ready_list = (ParallelReadyList *) callback_data;
+	binaryheap *ready_heap = (binaryheap *) callback_data;
 
 	pg_log_info("finished item %d %s %s",
 				te->dumpId, te->desc, te->tag);
@@ -4461,7 +4373,7 @@ mark_restore_job_done(ArchiveHandle *AH,
 		pg_fatal("worker process failed: exit code %d",
 				 status);
 
-	reduce_dependencies(AH, te, ready_list);
+	reduce_dependencies(AH, te, ready_heap);
 }
 
 
@@ -4704,11 +4616,11 @@ identify_locking_dependencies(ArchiveHandle *AH, TocEntry *te)
 /*
  * Remove the specified TOC entry from the depCounts of items that depend on
  * it, thereby possibly making them ready-to-run.  Any pending item that
- * becomes ready should be moved to the ready_list, if that's provided.
+ * becomes ready should be moved to the ready_heap, if that's provided.
  */
 static void
 reduce_dependencies(ArchiveHandle *AH, TocEntry *te,
-					ParallelReadyList *ready_list)
+					binaryheap *ready_heap)
 {
 	int			i;
 
@@ -4726,18 +4638,18 @@ reduce_dependencies(ArchiveHandle *AH, TocEntry *te,
 		 * the current restore pass, and it is currently a member of the
 		 * pending list (that check is needed to prevent double restore in
 		 * some cases where a list-file forces out-of-order restoring).
-		 * However, if ready_list == NULL then caller doesn't want any list
+		 * However, if ready_heap == NULL then caller doesn't want any list
 		 * memberships changed.
 		 */
 		if (otherte->depCount == 0 &&
 			_tocEntryRestorePass(otherte) == AH->restorePass &&
 			otherte->pending_prev != NULL &&
-			ready_list != NULL)
+			ready_heap != NULL)
 		{
 			/* Remove it from pending list ... */
 			pending_list_remove(otherte);
-			/* ... and add to ready_list */
-			ready_list_insert(ready_list, otherte);
+			/* ... and add to ready_heap */
+			binaryheap_add(ready_heap, otherte);
 		}
 	}
 }
-- 
2.25.1

#16Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#15)
4 attachment(s)
Re: Inefficiency in parallel pg_restore with many tables

On Mon, Jul 24, 2023 at 12:00:15PM -0700, Nathan Bossart wrote:

Here is a sketch of this approach. It required fewer #ifdefs than I was
expecting. At the moment, this one seems like the winner to me.

Here is a polished patch set for this approach. I've also added a 0004
that replaces the open-coded heap in pg_dump_sort.c with a binaryheap.
IMHO these patches are in decent shape.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachments:

v6-0001-Make-binaryheap-available-to-frontend-code.patchtext/x-diff; charset=iso-8859-1Download
From def4af18e68cece22008863f664e9e0fd060f917 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Sat, 22 Jul 2023 15:04:45 -0700
Subject: [PATCH v6 1/4] Make binaryheap available to frontend code.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

There are a couple of places in frontend code that could make use
of this simple binary heap implementation.  This commit makes
binaryheap usable in frontend code, much like commit 26aaf97b68 did
for StringInfo.  Like StringInfo, the header file is left in lib/
to reduce the likelihood of unnecessary breakage.

The frontend version of binaryheap exposes a void *-based API since
frontend code does not have access to the Datum definitions.  This
seemed like a better approach than switching all existing uses to
void * or making the Datum definitions available to frontend code.

Reviewed-by: Tom Lane, Álvaro Herrera
Discussion: https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us
---
 src/backend/lib/Makefile                 |  1 -
 src/backend/lib/meson.build              |  1 -
 src/common/Makefile                      |  1 +
 src/{backend/lib => common}/binaryheap.c | 37 +++++++++++++++++-------
 src/common/meson.build                   |  1 +
 src/include/lib/binaryheap.h             | 26 ++++++++++++-----
 6 files changed, 47 insertions(+), 20 deletions(-)
 rename src/{backend/lib => common}/binaryheap.c (91%)

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..b6cefd9cca 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -13,7 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
-	binaryheap.o \
 	bipartite_match.o \
 	bloomfilter.o \
 	dshash.o \
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 974cab8776..b4e88f54ae 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -1,7 +1,6 @@
 # Copyright (c) 2022-2023, PostgreSQL Global Development Group
 
 backend_sources += files(
-  'binaryheap.c',
   'bipartite_match.c',
   'bloomfilter.c',
   'dshash.c',
diff --git a/src/common/Makefile b/src/common/Makefile
index 113029bf7b..cc5c54dcee 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -48,6 +48,7 @@ LIBS += $(PTHREAD_LIBS)
 OBJS_COMMON = \
 	archive.o \
 	base64.o \
+	binaryheap.o \
 	checksum_helper.o \
 	compression.o \
 	config_info.o \
diff --git a/src/backend/lib/binaryheap.c b/src/common/binaryheap.c
similarity index 91%
rename from src/backend/lib/binaryheap.c
rename to src/common/binaryheap.c
index 1737546757..4836cd1c6d 100644
--- a/src/backend/lib/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -6,15 +6,22 @@
  * Portions Copyright (c) 2012-2023, PostgreSQL Global Development Group
  *
  * IDENTIFICATION
- *	  src/backend/lib/binaryheap.c
+ *	  src/common/binaryheap.c
  *
  *-------------------------------------------------------------------------
  */
 
+#ifdef FRONTEND
+#include "postgres_fe.h"
+#else
 #include "postgres.h"
+#endif
 
 #include <math.h>
 
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
 #include "lib/binaryheap.h"
 
 static void sift_down(binaryheap *heap, int node_off);
@@ -34,7 +41,7 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 	int			sz;
 	binaryheap *heap;
 
-	sz = offsetof(binaryheap, bh_nodes) + sizeof(Datum) * capacity;
+	sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_node_type) * capacity;
 	heap = (binaryheap *) palloc(sz);
 	heap->bh_space = capacity;
 	heap->bh_compare = compare;
@@ -106,10 +113,14 @@ parent_offset(int i)
  * afterwards.
  */
 void
-binaryheap_add_unordered(binaryheap *heap, Datum d)
+binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 {
 	if (heap->bh_size >= heap->bh_space)
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
 	heap->bh_has_heap_property = false;
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
@@ -138,10 +149,14 @@ binaryheap_build(binaryheap *heap)
  * the heap property.
  */
 void
-binaryheap_add(binaryheap *heap, Datum d)
+binaryheap_add(binaryheap *heap, bh_node_type d)
 {
 	if (heap->bh_size >= heap->bh_space)
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
@@ -154,7 +169,7 @@ binaryheap_add(binaryheap *heap, Datum d)
  * without modifying the heap. The caller must ensure that this
  * routine is not used on an empty heap. Always O(1).
  */
-Datum
+bh_node_type
 binaryheap_first(binaryheap *heap)
 {
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
@@ -169,10 +184,10 @@ binaryheap_first(binaryheap *heap)
  * that this routine is not used on an empty heap. O(log n) worst
  * case.
  */
-Datum
+bh_node_type
 binaryheap_remove_first(binaryheap *heap)
 {
-	Datum		result;
+	bh_node_type result;
 
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
 
@@ -204,7 +219,7 @@ binaryheap_remove_first(binaryheap *heap)
  * sifting the new node down.
  */
 void
-binaryheap_replace_first(binaryheap *heap, Datum d)
+binaryheap_replace_first(binaryheap *heap, bh_node_type d)
 {
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
 
@@ -221,7 +236,7 @@ binaryheap_replace_first(binaryheap *heap, Datum d)
 static void
 sift_up(binaryheap *heap, int node_off)
 {
-	Datum		node_val = heap->bh_nodes[node_off];
+	bh_node_type node_val = heap->bh_nodes[node_off];
 
 	/*
 	 * Within the loop, the node_off'th array entry is a "hole" that
@@ -232,7 +247,7 @@ sift_up(binaryheap *heap, int node_off)
 	{
 		int			cmp;
 		int			parent_off;
-		Datum		parent_val;
+		bh_node_type parent_val;
 
 		/*
 		 * If this node is smaller than its parent, the heap condition is
@@ -264,7 +279,7 @@ sift_up(binaryheap *heap, int node_off)
 static void
 sift_down(binaryheap *heap, int node_off)
 {
-	Datum		node_val = heap->bh_nodes[node_off];
+	bh_node_type node_val = heap->bh_nodes[node_off];
 
 	/*
 	 * Within the loop, the node_off'th array entry is a "hole" that
diff --git a/src/common/meson.build b/src/common/meson.build
index 53942a9a61..3b97497d1a 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -3,6 +3,7 @@
 common_sources = files(
   'archive.c',
   'base64.c',
+  'binaryheap.c',
   'checksum_helper.c',
   'compression.c',
   'controldata_utils.c',
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 52f7b06b25..4756f43a6d 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -11,11 +11,23 @@
 #ifndef BINARYHEAP_H
 #define BINARYHEAP_H
 
+/*
+ * We provide a Datum-based API for backend code and a void *-based API for
+ * frontend code (since the Datum definitions are not available to frontend
+ * code).  You should typically avoid using bh_node_type directly and instead
+ * use Datum or void * as appropriate.
+ */
+#ifdef FRONTEND
+#define bh_node_type void *
+#else
+#define bh_node_type Datum
+#endif
+
 /*
  * For a max-heap, the comparator must return <0 iff a < b, 0 iff a == b,
  * and >0 iff a > b.  For a min-heap, the conditions are reversed.
  */
-typedef int (*binaryheap_comparator) (Datum a, Datum b, void *arg);
+typedef int (*binaryheap_comparator) (bh_node_type a, bh_node_type b, void *arg);
 
 /*
  * binaryheap
@@ -34,7 +46,7 @@ typedef struct binaryheap
 	bool		bh_has_heap_property;	/* debugging cross-check */
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
-	Datum		bh_nodes[FLEXIBLE_ARRAY_MEMBER];
+	bh_node_type bh_nodes[FLEXIBLE_ARRAY_MEMBER];
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int capacity,
@@ -42,12 +54,12 @@ extern binaryheap *binaryheap_allocate(int capacity,
 									   void *arg);
 extern void binaryheap_reset(binaryheap *heap);
 extern void binaryheap_free(binaryheap *heap);
-extern void binaryheap_add_unordered(binaryheap *heap, Datum d);
+extern void binaryheap_add_unordered(binaryheap *heap, bh_node_type d);
 extern void binaryheap_build(binaryheap *heap);
-extern void binaryheap_add(binaryheap *heap, Datum d);
-extern Datum binaryheap_first(binaryheap *heap);
-extern Datum binaryheap_remove_first(binaryheap *heap);
-extern void binaryheap_replace_first(binaryheap *heap, Datum d);
+extern void binaryheap_add(binaryheap *heap, bh_node_type d);
+extern bh_node_type binaryheap_first(binaryheap *heap);
+extern bh_node_type binaryheap_remove_first(binaryheap *heap);
+extern void binaryheap_replace_first(binaryheap *heap, bh_node_type d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
 
-- 
2.25.1

v6-0002-Add-function-for-removing-arbitrary-nodes-in-bina.patchtext/x-diff; charset=us-asciiDownload
From 8e01f7140bd0c70cceb586b5fb4d199b38cf1956 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Thu, 20 Jul 2023 09:52:20 -0700
Subject: [PATCH v6 2/4] Add function for removing arbitrary nodes in
 binaryheap.

This commit introduces binaryheap_remove_node(), which can be used
to remove any node from a binary heap.  The implementation is
straightforward.  The target node is replaced with the last node in
the heap, and then we sift as needed to preserve the heap property.
This new function is intended for use in a follow-up commit that
will improve the performance of pg_restore.

Discussion: https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us
---
 src/common/binaryheap.c      | 29 +++++++++++++++++++++++++++++
 src/include/lib/binaryheap.h |  3 +++
 2 files changed, 32 insertions(+)

diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index 4836cd1c6d..8c6cc6e5e2 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -211,6 +211,35 @@ binaryheap_remove_first(binaryheap *heap)
 	return result;
 }
 
+/*
+ * binaryheap_remove_node
+ *
+ * Removes the nth node from the heap.  The caller must ensure that there are
+ * at least (n - 1) nodes in the heap.  O(log n) worst case.
+ */
+void
+binaryheap_remove_node(binaryheap *heap, int n)
+{
+	int			cmp;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(n >= 0 && n < heap->bh_size);
+
+	/* compare last node to the one that is being removed */
+	cmp = heap->bh_compare(heap->bh_nodes[--heap->bh_size],
+						   heap->bh_nodes[n],
+						   heap->bh_arg);
+
+	/* remove the last node, placing it in the vacated entry */
+	heap->bh_nodes[n] = heap->bh_nodes[heap->bh_size];
+
+	/* sift as needed to preserve the heap property */
+	if (cmp > 0)
+		sift_up(heap, n);
+	else if (cmp < 0)
+		sift_down(heap, n);
+}
+
 /*
  * binaryheap_replace_first
  *
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 4756f43a6d..7060c77a52 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -59,8 +59,11 @@ extern void binaryheap_build(binaryheap *heap);
 extern void binaryheap_add(binaryheap *heap, bh_node_type d);
 extern bh_node_type binaryheap_first(binaryheap *heap);
 extern bh_node_type binaryheap_remove_first(binaryheap *heap);
+extern void binaryheap_remove_node(binaryheap *heap, int n);
 extern void binaryheap_replace_first(binaryheap *heap, bh_node_type d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
+#define binaryheap_size(h)			((h)->bh_size)
+#define binaryheap_get_node(h, n)	((h)->bh_nodes[n])
 
 #endif							/* BINARYHEAP_H */
-- 
2.25.1

v6-0003-Convert-pg_restore-s-ready_list-to-a-priority-que.patchtext/x-diff; charset=us-asciiDownload
From a00509da87960187313f1e269831563fdc375a78 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Thu, 20 Jul 2023 10:19:08 -0700
Subject: [PATCH v6 3/4] Convert pg_restore's ready_list to a priority queue.

Presently, we spend a lot of time sorting this list so that we pick
the largest items first.  With many tables, this sorting can become
a significant bottleneck.  There are a couple of reports from the
field about this, and it is easily reproducible, so this is not a
hypothetical issue.

This commit improves the performance of pg_restore with many tables
by converting its ready_list to a priority queue, i.e., a binary
heap.  We will first try to run the highest priority item, but if
it cannot be chosen due to the lock heuristic, we'll do a
sequential scan through the heap nodes until we find one that is
runnable.  This means that we might end up picking an item with
much lower priority, but since we expect that we'll typically be
able to choose one of the first few nodes, we should usually pick
an item with a relatively high priority.

On my machine, a basic test with 100,000 tables takes 11.5 minutes
without this patch and 1.5 minutes with it.  Pierre Ducroquet
claims to see a speedup from 30 minutes to 23 minutes for a
real-world dump of over 50,000 tables.

Suggested-by: Tom Lane
Tested-by: Pierre Ducroquet
Discussion: https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us
---
 src/bin/pg_dump/pg_backup_archiver.c | 197 ++++++++-------------------
 1 file changed, 57 insertions(+), 140 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 39ebcfec32..4c86f7b448 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -34,6 +34,7 @@
 #include "compress_io.h"
 #include "dumputils.h"
 #include "fe_utils/string_utils.h"
+#include "lib/binaryheap.h"
 #include "lib/stringinfo.h"
 #include "libpq/libpq-fs.h"
 #include "parallel.h"
@@ -44,24 +45,6 @@
 #define TEXT_DUMP_HEADER "--\n-- PostgreSQL database dump\n--\n\n"
 #define TEXT_DUMPALL_HEADER "--\n-- PostgreSQL database cluster dump\n--\n\n"
 
-/*
- * State for tracking TocEntrys that are ready to process during a parallel
- * restore.  (This used to be a list, and we still call it that, though now
- * it's really an array so that we can apply qsort to it.)
- *
- * tes[] is sized large enough that we can't overrun it.
- * The valid entries are indexed first_te .. last_te inclusive.
- * We periodically sort the array to bring larger-by-dataLength entries to
- * the front; "sorted" is true if the valid entries are known sorted.
- */
-typedef struct _parallelReadyList
-{
-	TocEntry  **tes;			/* Ready-to-dump TocEntrys */
-	int			first_te;		/* index of first valid entry in tes[] */
-	int			last_te;		/* index of last valid entry in tes[] */
-	bool		sorted;			/* are valid entries currently sorted? */
-} ParallelReadyList;
-
 
 static ArchiveHandle *_allocAH(const char *FileSpec, const ArchiveFormat fmt,
 							   const pg_compress_specification compression_spec,
@@ -110,16 +93,12 @@ static void restore_toc_entries_postfork(ArchiveHandle *AH,
 static void pending_list_header_init(TocEntry *l);
 static void pending_list_append(TocEntry *l, TocEntry *te);
 static void pending_list_remove(TocEntry *te);
-static void ready_list_init(ParallelReadyList *ready_list, int tocCount);
-static void ready_list_free(ParallelReadyList *ready_list);
-static void ready_list_insert(ParallelReadyList *ready_list, TocEntry *te);
-static void ready_list_remove(ParallelReadyList *ready_list, int i);
-static void ready_list_sort(ParallelReadyList *ready_list);
-static int	TocEntrySizeCompare(const void *p1, const void *p2);
-static void move_to_ready_list(TocEntry *pending_list,
-							   ParallelReadyList *ready_list,
+static int	TocEntrySizeCompareQsort(const void *p1, const void *p2);
+static int	TocEntrySizeCompareBinaryheap(void *p1, void *p2, void *arg);
+static void move_to_ready_heap(TocEntry *pending_list,
+							   binaryheap *ready_heap,
 							   RestorePass pass);
-static TocEntry *pop_next_work_item(ParallelReadyList *ready_list,
+static TocEntry *pop_next_work_item(binaryheap *ready_heap,
 									ParallelState *pstate);
 static void mark_dump_job_done(ArchiveHandle *AH,
 							   TocEntry *te,
@@ -134,7 +113,7 @@ static bool has_lock_conflicts(TocEntry *te1, TocEntry *te2);
 static void repoint_table_dependencies(ArchiveHandle *AH);
 static void identify_locking_dependencies(ArchiveHandle *AH, TocEntry *te);
 static void reduce_dependencies(ArchiveHandle *AH, TocEntry *te,
-								ParallelReadyList *ready_list);
+								binaryheap *ready_heap);
 static void mark_create_done(ArchiveHandle *AH, TocEntry *te);
 static void inhibit_data_for_failed_table(ArchiveHandle *AH, TocEntry *te);
 
@@ -2380,7 +2359,7 @@ WriteDataChunks(ArchiveHandle *AH, ParallelState *pstate)
 		}
 
 		if (ntes > 1)
-			qsort(tes, ntes, sizeof(TocEntry *), TocEntrySizeCompare);
+			qsort(tes, ntes, sizeof(TocEntry *), TocEntrySizeCompareQsort);
 
 		for (int i = 0; i < ntes; i++)
 			DispatchJobForTocEntry(AH, pstate, tes[i], ACT_DUMP,
@@ -3980,7 +3959,7 @@ restore_toc_entries_prefork(ArchiveHandle *AH, TocEntry *pending_list)
 
 			(void) restore_toc_entry(AH, next_work_item, false);
 
-			/* Reduce dependencies, but don't move anything to ready_list */
+			/* Reduce dependencies, but don't move anything to ready_heap */
 			reduce_dependencies(AH, next_work_item, NULL);
 		}
 		else
@@ -4023,24 +4002,26 @@ static void
 restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 							 TocEntry *pending_list)
 {
-	ParallelReadyList ready_list;
+	binaryheap *ready_heap;
 	TocEntry   *next_work_item;
 
 	pg_log_debug("entering restore_toc_entries_parallel");
 
-	/* Set up ready_list with enough room for all known TocEntrys */
-	ready_list_init(&ready_list, AH->tocCount);
+	/* Set up ready_heap with enough room for all known TocEntrys */
+	ready_heap = binaryheap_allocate(AH->tocCount,
+									 TocEntrySizeCompareBinaryheap,
+									 NULL);
 
 	/*
 	 * The pending_list contains all items that we need to restore.  Move all
-	 * items that are available to process immediately into the ready_list.
+	 * items that are available to process immediately into the ready_heap.
 	 * After this setup, the pending list is everything that needs to be done
-	 * but is blocked by one or more dependencies, while the ready list
+	 * but is blocked by one or more dependencies, while the ready heap
 	 * contains items that have no remaining dependencies and are OK to
 	 * process in the current restore pass.
 	 */
 	AH->restorePass = RESTORE_PASS_MAIN;
-	move_to_ready_list(pending_list, &ready_list, AH->restorePass);
+	move_to_ready_heap(pending_list, ready_heap, AH->restorePass);
 
 	/*
 	 * main parent loop
@@ -4054,7 +4035,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 	for (;;)
 	{
 		/* Look for an item ready to be dispatched to a worker */
-		next_work_item = pop_next_work_item(&ready_list, pstate);
+		next_work_item = pop_next_work_item(ready_heap, pstate);
 		if (next_work_item != NULL)
 		{
 			/* If not to be restored, don't waste time launching a worker */
@@ -4064,7 +4045,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 							next_work_item->dumpId,
 							next_work_item->desc, next_work_item->tag);
 				/* Update its dependencies as though we'd completed it */
-				reduce_dependencies(AH, next_work_item, &ready_list);
+				reduce_dependencies(AH, next_work_item, ready_heap);
 				/* Loop around to see if anything else can be dispatched */
 				continue;
 			}
@@ -4075,7 +4056,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 
 			/* Dispatch to some worker */
 			DispatchJobForTocEntry(AH, pstate, next_work_item, ACT_RESTORE,
-								   mark_restore_job_done, &ready_list);
+								   mark_restore_job_done, ready_heap);
 		}
 		else if (IsEveryWorkerIdle(pstate))
 		{
@@ -4089,7 +4070,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 			/* Advance to next restore pass */
 			AH->restorePass++;
 			/* That probably allows some stuff to be made ready */
-			move_to_ready_list(pending_list, &ready_list, AH->restorePass);
+			move_to_ready_heap(pending_list, ready_heap, AH->restorePass);
 			/* Loop around to see if anything's now ready */
 			continue;
 		}
@@ -4118,10 +4099,10 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 					   next_work_item ? WFW_ONE_IDLE : WFW_GOT_STATUS);
 	}
 
-	/* There should now be nothing in ready_list. */
-	Assert(ready_list.first_te > ready_list.last_te);
+	/* There should now be nothing in ready_heap. */
+	Assert(binaryheap_empty(ready_heap));
 
-	ready_list_free(&ready_list);
+	binaryheap_free(ready_heap);
 
 	pg_log_info("finished main parallel loop");
 }
@@ -4221,80 +4202,9 @@ pending_list_remove(TocEntry *te)
 }
 
 
-/*
- * Initialize the ready_list with enough room for up to tocCount entries.
- */
-static void
-ready_list_init(ParallelReadyList *ready_list, int tocCount)
-{
-	ready_list->tes = (TocEntry **)
-		pg_malloc(tocCount * sizeof(TocEntry *));
-	ready_list->first_te = 0;
-	ready_list->last_te = -1;
-	ready_list->sorted = false;
-}
-
-/*
- * Free storage for a ready_list.
- */
-static void
-ready_list_free(ParallelReadyList *ready_list)
-{
-	pg_free(ready_list->tes);
-}
-
-/* Add te to the ready_list */
-static void
-ready_list_insert(ParallelReadyList *ready_list, TocEntry *te)
-{
-	ready_list->tes[++ready_list->last_te] = te;
-	/* List is (probably) not sorted anymore. */
-	ready_list->sorted = false;
-}
-
-/* Remove the i'th entry in the ready_list */
-static void
-ready_list_remove(ParallelReadyList *ready_list, int i)
-{
-	int			f = ready_list->first_te;
-
-	Assert(i >= f && i <= ready_list->last_te);
-
-	/*
-	 * In the typical case where the item to be removed is the first ready
-	 * entry, we need only increment first_te to remove it.  Otherwise, move
-	 * the entries before it to compact the list.  (This preserves sortedness,
-	 * if any.)  We could alternatively move the entries after i, but there
-	 * are typically many more of those.
-	 */
-	if (i > f)
-	{
-		TocEntry  **first_te_ptr = &ready_list->tes[f];
-
-		memmove(first_te_ptr + 1, first_te_ptr, (i - f) * sizeof(TocEntry *));
-	}
-	ready_list->first_te++;
-}
-
-/* Sort the ready_list into the desired order */
-static void
-ready_list_sort(ParallelReadyList *ready_list)
-{
-	if (!ready_list->sorted)
-	{
-		int			n = ready_list->last_te - ready_list->first_te + 1;
-
-		if (n > 1)
-			qsort(ready_list->tes + ready_list->first_te, n,
-				  sizeof(TocEntry *),
-				  TocEntrySizeCompare);
-		ready_list->sorted = true;
-	}
-}
-
 /* qsort comparator for sorting TocEntries by dataLength */
 static int
-TocEntrySizeCompare(const void *p1, const void *p2)
+TocEntrySizeCompareQsort(const void *p1, const void *p2)
 {
 	const TocEntry *te1 = *(const TocEntry *const *) p1;
 	const TocEntry *te2 = *(const TocEntry *const *) p2;
@@ -4314,17 +4224,24 @@ TocEntrySizeCompare(const void *p1, const void *p2)
 	return 0;
 }
 
+/* binaryheap comparator for sorting TocEntries by dataLength */
+static int
+TocEntrySizeCompareBinaryheap(void *p1, void *p2, void *arg)
+{
+	return TocEntrySizeCompareQsort(p1, p2);
+}
+
 
 /*
- * Move all immediately-ready items from pending_list to ready_list.
+ * Move all immediately-ready items from pending_list to ready_heap.
  *
  * Items are considered ready if they have no remaining dependencies and
  * they belong in the current restore pass.  (See also reduce_dependencies,
  * which applies the same logic one-at-a-time.)
  */
 static void
-move_to_ready_list(TocEntry *pending_list,
-				   ParallelReadyList *ready_list,
+move_to_ready_heap(TocEntry *pending_list,
+				   binaryheap *ready_heap,
 				   RestorePass pass)
 {
 	TocEntry   *te;
@@ -4340,38 +4257,38 @@ move_to_ready_list(TocEntry *pending_list,
 		{
 			/* Remove it from pending_list ... */
 			pending_list_remove(te);
-			/* ... and add to ready_list */
-			ready_list_insert(ready_list, te);
+			/* ... and add to ready_heap */
+			binaryheap_add(ready_heap, te);
 		}
 	}
 }
 
 /*
  * Find the next work item (if any) that is capable of being run now,
- * and remove it from the ready_list.
+ * and remove it from the ready_heap.
  *
  * Returns the item, or NULL if nothing is runnable.
  *
  * To qualify, the item must have no remaining dependencies
  * and no requirements for locks that are incompatible with
- * items currently running.  Items in the ready_list are known to have
+ * items currently running.  Items in the ready_heap are known to have
  * no remaining dependencies, but we have to check for lock conflicts.
  */
 static TocEntry *
-pop_next_work_item(ParallelReadyList *ready_list,
+pop_next_work_item(binaryheap *ready_heap,
 				   ParallelState *pstate)
 {
 	/*
-	 * Sort the ready_list so that we'll tackle larger jobs first.
-	 */
-	ready_list_sort(ready_list);
-
-	/*
-	 * Search the ready_list until we find a suitable item.
+	 * Search the ready_heap until we find a suitable item.  Note that we do a
+	 * sequential scan through the heap nodes, so even though we will first
+	 * try to choose the highest-priority item, we might end up picking
+	 * something with a much lower priority.  However, it is expected that we
+	 * will usually be able to pick one of the first few items, which will
+	 * generally be relatively high priority.
 	 */
-	for (int i = ready_list->first_te; i <= ready_list->last_te; i++)
+	for (int i = 0; i < binaryheap_size(ready_heap); i++)
 	{
-		TocEntry   *te = ready_list->tes[i];
+		TocEntry   *te = (TocEntry *) binaryheap_get_node(ready_heap, i);
 		bool		conflicts = false;
 
 		/*
@@ -4397,7 +4314,7 @@ pop_next_work_item(ParallelReadyList *ready_list,
 			continue;
 
 		/* passed all tests, so this item can run */
-		ready_list_remove(ready_list, i);
+		binaryheap_remove_node(ready_heap, i);
 		return te;
 	}
 
@@ -4443,7 +4360,7 @@ mark_restore_job_done(ArchiveHandle *AH,
 					  int status,
 					  void *callback_data)
 {
-	ParallelReadyList *ready_list = (ParallelReadyList *) callback_data;
+	binaryheap *ready_heap = (binaryheap *) callback_data;
 
 	pg_log_info("finished item %d %s %s",
 				te->dumpId, te->desc, te->tag);
@@ -4461,7 +4378,7 @@ mark_restore_job_done(ArchiveHandle *AH,
 		pg_fatal("worker process failed: exit code %d",
 				 status);
 
-	reduce_dependencies(AH, te, ready_list);
+	reduce_dependencies(AH, te, ready_heap);
 }
 
 
@@ -4704,11 +4621,11 @@ identify_locking_dependencies(ArchiveHandle *AH, TocEntry *te)
 /*
  * Remove the specified TOC entry from the depCounts of items that depend on
  * it, thereby possibly making them ready-to-run.  Any pending item that
- * becomes ready should be moved to the ready_list, if that's provided.
+ * becomes ready should be moved to the ready_heap, if that's provided.
  */
 static void
 reduce_dependencies(ArchiveHandle *AH, TocEntry *te,
-					ParallelReadyList *ready_list)
+					binaryheap *ready_heap)
 {
 	int			i;
 
@@ -4726,18 +4643,18 @@ reduce_dependencies(ArchiveHandle *AH, TocEntry *te,
 		 * the current restore pass, and it is currently a member of the
 		 * pending list (that check is needed to prevent double restore in
 		 * some cases where a list-file forces out-of-order restoring).
-		 * However, if ready_list == NULL then caller doesn't want any list
+		 * However, if ready_heap == NULL then caller doesn't want any list
 		 * memberships changed.
 		 */
 		if (otherte->depCount == 0 &&
 			_tocEntryRestorePass(otherte) == AH->restorePass &&
 			otherte->pending_prev != NULL &&
-			ready_list != NULL)
+			ready_heap != NULL)
 		{
 			/* Remove it from pending list ... */
 			pending_list_remove(otherte);
-			/* ... and add to ready_list */
-			ready_list_insert(ready_list, otherte);
+			/* ... and add to ready_heap */
+			binaryheap_add(ready_heap, otherte);
 		}
 	}
 }
-- 
2.25.1

v6-0004-Remove-open-coded-binary-heap-in-pg_dump_sort.c.patchtext/x-diff; charset=us-asciiDownload
From 63384e57b79b5d3fef1801720893886a6541656f Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Tue, 25 Jul 2023 11:18:52 -0700
Subject: [PATCH v6 4/4] Remove open-coded binary heap in pg_dump_sort.c.

Thanks to commit XXXXXXXXXX, binaryheap is available to frontend
code.  This commit replaces the open-coded heap implementation in
pg_dump_sort.c with a binaryheap, saving a few lines of code.

Discussion: https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us
---
 src/bin/pg_dump/pg_dump_sort.c | 105 +++++++--------------------------
 1 file changed, 22 insertions(+), 83 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump_sort.c b/src/bin/pg_dump/pg_dump_sort.c
index 523a19c155..aad64aecc2 100644
--- a/src/bin/pg_dump/pg_dump_sort.c
+++ b/src/bin/pg_dump/pg_dump_sort.c
@@ -16,6 +16,7 @@
 #include "postgres_fe.h"
 
 #include "catalog/pg_class_d.h"
+#include "lib/binaryheap.h"
 #include "pg_backup_archiver.h"
 #include "pg_backup_utils.h"
 #include "pg_dump.h"
@@ -161,8 +162,6 @@ static bool TopoSort(DumpableObject **objs,
 					 int numObjs,
 					 DumpableObject **ordering,
 					 int *nOrdering);
-static void addHeapElement(int val, int *heap, int heapLength);
-static int	removeHeapElement(int *heap, int heapLength);
 static void findDependencyLoops(DumpableObject **objs, int nObjs, int totObjs);
 static int	findLoop(DumpableObject *obj,
 					 DumpId startPoint,
@@ -174,6 +173,7 @@ static void repairDependencyLoop(DumpableObject **loop,
 								 int nLoop);
 static void describeDumpableObject(DumpableObject *obj,
 								   char *buf, int bufsize);
+static int	int_cmp(void *a, void *b, void *arg);
 
 
 /*
@@ -374,11 +374,10 @@ TopoSort(DumpableObject **objs,
 		 int *nOrdering)		/* output argument */
 {
 	DumpId		maxDumpId = getMaxDumpId();
-	int		   *pendingHeap;
+	binaryheap *pendingHeap;
 	int		   *beforeConstraints;
 	int		   *idMap;
 	DumpableObject *obj;
-	int			heapLength;
 	int			i,
 				j,
 				k;
@@ -403,7 +402,7 @@ TopoSort(DumpableObject **objs,
 		return true;
 
 	/* Create workspace for the above-described heap */
-	pendingHeap = (int *) pg_malloc(numObjs * sizeof(int));
+	pendingHeap = binaryheap_allocate(numObjs, int_cmp, NULL);
 
 	/*
 	 * Scan the constraints, and for each item in the input, generate a count
@@ -434,19 +433,16 @@ TopoSort(DumpableObject **objs,
 	 * Now initialize the heap of items-ready-to-output by filling it with the
 	 * indexes of items that already have beforeConstraints[id] == 0.
 	 *
-	 * The essential property of a heap is heap[(j-1)/2] >= heap[j] for each j
-	 * in the range 1..heapLength-1 (note we are using 0-based subscripts
-	 * here, while the discussion in Knuth assumes 1-based subscripts). So, if
-	 * we simply enter the indexes into pendingHeap[] in decreasing order, we
-	 * a-fortiori have the heap invariant satisfied at completion of this
-	 * loop, and don't need to do any sift-up comparisons.
+	 * We enter the objects into pendingHeap in decreasing order so that the
+	 * heap invariant is satisfied at the completion of this loop.  This
+	 * reduces the amount of work that binaryheap_build() must do.
 	 */
-	heapLength = 0;
 	for (i = numObjs; --i >= 0;)
 	{
 		if (beforeConstraints[objs[i]->dumpId] == 0)
-			pendingHeap[heapLength++] = i;
+			binaryheap_add_unordered(pendingHeap, (void *) (intptr_t) i);
 	}
+	binaryheap_build(pendingHeap);
 
 	/*--------------------
 	 * Now emit objects, working backwards in the output list.  At each step,
@@ -464,10 +460,10 @@ TopoSort(DumpableObject **objs,
 	 *--------------------
 	 */
 	i = numObjs;
-	while (heapLength > 0)
+	while (!binaryheap_empty(pendingHeap))
 	{
 		/* Select object to output by removing largest heap member */
-		j = removeHeapElement(pendingHeap, heapLength--);
+		j = (intptr_t) binaryheap_remove_first(pendingHeap);
 		obj = objs[j];
 		/* Output candidate to ordering[] */
 		ordering[--i] = obj;
@@ -477,7 +473,7 @@ TopoSort(DumpableObject **objs,
 			int			id = obj->dependencies[k];
 
 			if ((--beforeConstraints[id]) == 0)
-				addHeapElement(idMap[id], pendingHeap, heapLength++);
+				binaryheap_add(pendingHeap, (void *) (intptr_t) idMap[id]);
 		}
 	}
 
@@ -497,79 +493,13 @@ TopoSort(DumpableObject **objs,
 	}
 
 	/* Done */
-	free(pendingHeap);
+	binaryheap_free(pendingHeap);
 	free(beforeConstraints);
 	free(idMap);
 
 	return (i == 0);
 }
 
-/*
- * Add an item to a heap (priority queue)
- *
- * heapLength is the current heap size; caller is responsible for increasing
- * its value after the call.  There must be sufficient storage at *heap.
- */
-static void
-addHeapElement(int val, int *heap, int heapLength)
-{
-	int			j;
-
-	/*
-	 * Sift-up the new entry, per Knuth 5.2.3 exercise 16. Note that Knuth is
-	 * using 1-based array indexes, not 0-based.
-	 */
-	j = heapLength;
-	while (j > 0)
-	{
-		int			i = (j - 1) >> 1;
-
-		if (val <= heap[i])
-			break;
-		heap[j] = heap[i];
-		j = i;
-	}
-	heap[j] = val;
-}
-
-/*
- * Remove the largest item present in a heap (priority queue)
- *
- * heapLength is the current heap size; caller is responsible for decreasing
- * its value after the call.
- *
- * We remove and return heap[0], which is always the largest element of
- * the heap, and then "sift up" to maintain the heap invariant.
- */
-static int
-removeHeapElement(int *heap, int heapLength)
-{
-	int			result = heap[0];
-	int			val;
-	int			i;
-
-	if (--heapLength <= 0)
-		return result;
-	val = heap[heapLength];		/* value that must be reinserted */
-	i = 0;						/* i is where the "hole" is */
-	for (;;)
-	{
-		int			j = 2 * i + 1;
-
-		if (j >= heapLength)
-			break;
-		if (j + 1 < heapLength &&
-			heap[j] < heap[j + 1])
-			j++;
-		if (val >= heap[j])
-			break;
-		heap[i] = heap[j];
-		i = j;
-	}
-	heap[i] = val;
-	return result;
-}
-
 /*
  * findDependencyLoops - identify loops in TopoSort's failure output,
  *		and pass each such loop to repairDependencyLoop() for action
@@ -1559,3 +1489,12 @@ describeDumpableObject(DumpableObject *obj, char *buf, int bufsize)
 			 (int) obj->objType,
 			 obj->dumpId, obj->catId.oid);
 }
+
+/*
+ * Compares "a" and "b" as integers.
+ */
+static int
+int_cmp(void *a, void *b, void *arg)
+{
+	return (intptr_t) a - (intptr_t) b;
+}
-- 
2.25.1

#17Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#16)
Re: Inefficiency in parallel pg_restore with many tables

On Tue, Jul 25, 2023 at 11:53:36AM -0700, Nathan Bossart wrote:

Here is a polished patch set for this approach. I've also added a 0004
that replaces the open-coded heap in pg_dump_sort.c with a binaryheap.
IMHO these patches are in decent shape.

I'm hoping to commit these patches at some point in the current commitfest.
I don't sense anything tremendously controversial, and they provide a
pretty nice speedup in some cases. Are there any remaining concerns?

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Bossart (#17)
Re: Inefficiency in parallel pg_restore with many tables

Nathan Bossart <nathandbossart@gmail.com> writes:

I'm hoping to commit these patches at some point in the current commitfest.
I don't sense anything tremendously controversial, and they provide a
pretty nice speedup in some cases. Are there any remaining concerns?

I've not actually looked at any of these patchsets after the first one.
I have added myself as a reviewer and will hopefully get to it within
a week or so.

regards, tom lane

#19Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#18)
Re: Inefficiency in parallel pg_restore with many tables

On Fri, Sep 01, 2023 at 01:41:41PM -0400, Tom Lane wrote:

I've not actually looked at any of these patchsets after the first one.
I have added myself as a reviewer and will hopefully get to it within
a week or so.

Thanks!

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#20Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#16)
Re: Inefficiency in parallel pg_restore with many tables

On Tue, Jul 25, 2023 at 2:53 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Mon, Jul 24, 2023 at 12:00:15PM -0700, Nathan Bossart wrote:

Here is a sketch of this approach. It required fewer #ifdefs than I was
expecting. At the moment, this one seems like the winner to me.

Here is a polished patch set for this approach. I've also added a 0004
that replaces the open-coded heap in pg_dump_sort.c with a binaryheap.
IMHO these patches are in decent shape.

[ drive-by comment that hopefully doesn't cause too much pain ]

In hindsight, I think that making binaryheap depend on Datum was a bad
idea. I think that was my idea, and I think it wasn't very smart.
Considering that people have coded to that decision up until now, it
might not be too easy to change at this point. But in principle I
guess you'd want to be able to make a heap out of any C data type,
rather than just Datum, or just Datum in the backend and just void *
in the frontend.

--
Robert Haas
EDB: http://www.enterprisedb.com

#21Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#20)
Re: Inefficiency in parallel pg_restore with many tables

On Fri, Sep 01, 2023 at 04:00:44PM -0400, Robert Haas wrote:

In hindsight, I think that making binaryheap depend on Datum was a bad
idea. I think that was my idea, and I think it wasn't very smart.
Considering that people have coded to that decision up until now, it
might not be too easy to change at this point. But in principle I
guess you'd want to be able to make a heap out of any C data type,
rather than just Datum, or just Datum in the backend and just void *
in the frontend.

Yeah, something similar to simplehash for binary heaps could be nice. That
being said, I don't know if there's a strong reason to specialize the
implementation for a given C data type in most cases. I suspect many
callers are just fine with dealing with pointers (e.g., I wouldn't store an
entire TocEntry in the array), and smaller types like integers are already
stored directly in the array thanks to the use of Datum. However, it
_would_ allow us to abandon this frontend/backend void */Datum kludge,
which is something.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#22Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#21)
1 attachment(s)
Re: Inefficiency in parallel pg_restore with many tables

On Fri, Sep 01, 2023 at 01:52:48PM -0700, Nathan Bossart wrote:

On Fri, Sep 01, 2023 at 04:00:44PM -0400, Robert Haas wrote:

In hindsight, I think that making binaryheap depend on Datum was a bad
idea. I think that was my idea, and I think it wasn't very smart.
Considering that people have coded to that decision up until now, it
might not be too easy to change at this point. But in principle I
guess you'd want to be able to make a heap out of any C data type,
rather than just Datum, or just Datum in the backend and just void *
in the frontend.

Yeah, something similar to simplehash for binary heaps could be nice. That
being said, I don't know if there's a strong reason to specialize the
implementation for a given C data type in most cases. I suspect many
callers are just fine with dealing with pointers (e.g., I wouldn't store an
entire TocEntry in the array), and smaller types like integers are already
stored directly in the array thanks to the use of Datum. However, it
_would_ allow us to abandon this frontend/backend void */Datum kludge,
which is something.

I ended up hacking together a (nowhere near committable) patch to see how
hard it would be to allow using any type with binaryheap. It doesn't seem
too bad.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachments:

0001-allow-binaryheap-to-use-any-type.patchtext/x-diff; charset=us-asciiDownload
From fa51a6b48d1a7cdcffbd5ccbe4274f39dfb0c1b5 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Sat, 2 Sep 2023 11:48:36 -0700
Subject: [PATCH 1/1] allow binaryheap to use any type

---
 src/backend/executor/nodeGatherMerge.c        | 20 +++---
 src/backend/executor/nodeMergeAppend.c        | 20 +++---
 src/backend/lib/binaryheap.c                  | 71 ++++++++++---------
 src/backend/postmaster/pgarch.c               | 37 +++++-----
 .../replication/logical/reorderbuffer.c       | 22 +++---
 src/backend/storage/buffer/bufmgr.c           | 21 +++---
 src/include/lib/binaryheap.h                  | 17 ++---
 7 files changed, 112 insertions(+), 96 deletions(-)

diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c
index 9d5e1a46e9..4c75d224c8 100644
--- a/src/backend/executor/nodeGatherMerge.c
+++ b/src/backend/executor/nodeGatherMerge.c
@@ -52,7 +52,7 @@ typedef struct GMReaderTupleBuffer
 } GMReaderTupleBuffer;
 
 static TupleTableSlot *ExecGatherMerge(PlanState *pstate);
-static int32 heap_compare_slots(Datum a, Datum b, void *arg);
+static int32 heap_compare_slots(void *a, void *b, void *arg);
 static TupleTableSlot *gather_merge_getnext(GatherMergeState *gm_state);
 static MinimalTuple gm_readnext_tuple(GatherMergeState *gm_state, int nreader,
 									  bool nowait, bool *done);
@@ -428,7 +428,7 @@ gather_merge_setup(GatherMergeState *gm_state)
 	}
 
 	/* Allocate the resources for the merge */
-	gm_state->gm_heap = binaryheap_allocate(nreaders + 1,
+	gm_state->gm_heap = binaryheap_allocate(nreaders + 1, sizeof(int),
 											heap_compare_slots,
 											gm_state);
 }
@@ -489,7 +489,7 @@ reread:
 				/* Don't have a tuple yet, try to get one */
 				if (gather_merge_readnext(gm_state, i, nowait))
 					binaryheap_add_unordered(gm_state->gm_heap,
-											 Int32GetDatum(i));
+											 &i);
 			}
 			else
 			{
@@ -565,14 +565,14 @@ gather_merge_getnext(GatherMergeState *gm_state)
 		 * the heap, because it might now compare differently against the
 		 * other elements of the heap.
 		 */
-		i = DatumGetInt32(binaryheap_first(gm_state->gm_heap));
+		binaryheap_first(gm_state->gm_heap, &i);
 
 		if (gather_merge_readnext(gm_state, i, false))
-			binaryheap_replace_first(gm_state->gm_heap, Int32GetDatum(i));
+			binaryheap_replace_first(gm_state->gm_heap, &i);
 		else
 		{
 			/* reader exhausted, remove it from heap */
-			(void) binaryheap_remove_first(gm_state->gm_heap);
+			binaryheap_remove_first(gm_state->gm_heap, &i);
 		}
 	}
 
@@ -585,7 +585,7 @@ gather_merge_getnext(GatherMergeState *gm_state)
 	else
 	{
 		/* Return next tuple from whichever participant has the leading one */
-		i = DatumGetInt32(binaryheap_first(gm_state->gm_heap));
+		binaryheap_first(gm_state->gm_heap, &i);
 		return gm_state->gm_slots[i];
 	}
 }
@@ -750,11 +750,11 @@ typedef int32 SlotNumber;
  * Compare the tuples in the two given slots.
  */
 static int32
-heap_compare_slots(Datum a, Datum b, void *arg)
+heap_compare_slots(void *a, void *b, void *arg)
 {
 	GatherMergeState *node = (GatherMergeState *) arg;
-	SlotNumber	slot1 = DatumGetInt32(a);
-	SlotNumber	slot2 = DatumGetInt32(b);
+	SlotNumber	slot1 = *((int *) a);
+	SlotNumber	slot2 = *((int *) b);
 
 	TupleTableSlot *s1 = node->gm_slots[slot1];
 	TupleTableSlot *s2 = node->gm_slots[slot2];
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index 21b5726e6e..928b4b3719 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -52,7 +52,7 @@
 typedef int32 SlotNumber;
 
 static TupleTableSlot *ExecMergeAppend(PlanState *pstate);
-static int	heap_compare_slots(Datum a, Datum b, void *arg);
+static int	heap_compare_slots(void *a, void *b, void *arg);
 
 
 /* ----------------------------------------------------------------
@@ -125,7 +125,7 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	mergestate->ms_nplans = nplans;
 
 	mergestate->ms_slots = (TupleTableSlot **) palloc0(sizeof(TupleTableSlot *) * nplans);
-	mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots,
+	mergestate->ms_heap = binaryheap_allocate(nplans, sizeof(int), heap_compare_slots,
 											  mergestate);
 
 	/*
@@ -229,7 +229,7 @@ ExecMergeAppend(PlanState *pstate)
 		{
 			node->ms_slots[i] = ExecProcNode(node->mergeplans[i]);
 			if (!TupIsNull(node->ms_slots[i]))
-				binaryheap_add_unordered(node->ms_heap, Int32GetDatum(i));
+				binaryheap_add_unordered(node->ms_heap, &i);
 		}
 		binaryheap_build(node->ms_heap);
 		node->ms_initialized = true;
@@ -244,12 +244,12 @@ ExecMergeAppend(PlanState *pstate)
 		 * by doing this before returning from the prior call, but it's better
 		 * to not pull tuples until necessary.)
 		 */
-		i = DatumGetInt32(binaryheap_first(node->ms_heap));
+		binaryheap_first(node->ms_heap, &i);
 		node->ms_slots[i] = ExecProcNode(node->mergeplans[i]);
 		if (!TupIsNull(node->ms_slots[i]))
-			binaryheap_replace_first(node->ms_heap, Int32GetDatum(i));
+			binaryheap_replace_first(node->ms_heap, &i);
 		else
-			(void) binaryheap_remove_first(node->ms_heap);
+			binaryheap_remove_first(node->ms_heap, &i);
 	}
 
 	if (binaryheap_empty(node->ms_heap))
@@ -259,7 +259,7 @@ ExecMergeAppend(PlanState *pstate)
 	}
 	else
 	{
-		i = DatumGetInt32(binaryheap_first(node->ms_heap));
+		binaryheap_first(node->ms_heap, &i);
 		result = node->ms_slots[i];
 	}
 
@@ -270,11 +270,11 @@ ExecMergeAppend(PlanState *pstate)
  * Compare the tuples in the two given slots.
  */
 static int32
-heap_compare_slots(Datum a, Datum b, void *arg)
+heap_compare_slots(void *a, void *b, void *arg)
 {
 	MergeAppendState *node = (MergeAppendState *) arg;
-	SlotNumber	slot1 = DatumGetInt32(a);
-	SlotNumber	slot2 = DatumGetInt32(b);
+	SlotNumber	slot1 = *((int *) a);
+	SlotNumber	slot2 = *((int *) b);
 
 	TupleTableSlot *s1 = node->ms_slots[slot1];
 	TupleTableSlot *s2 = node->ms_slots[slot2];
diff --git a/src/backend/lib/binaryheap.c b/src/backend/lib/binaryheap.c
index 1737546757..0dc0e18f48 100644
--- a/src/backend/lib/binaryheap.c
+++ b/src/backend/lib/binaryheap.c
@@ -20,6 +20,9 @@
 static void sift_down(binaryheap *heap, int node_off);
 static void sift_up(binaryheap *heap, int node_off);
 
+#define bh_node_ptr(n)	(&heap->bh_nodes[(n) * heap->bh_elem_size])
+#define bh_set(n, d)	(memmove(bh_node_ptr((n)), (d), heap->bh_elem_size))
+
 /*
  * binaryheap_allocate
  *
@@ -29,14 +32,16 @@ static void sift_up(binaryheap *heap, int node_off);
  * argument specified by 'arg'.
  */
 binaryheap *
-binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
+binaryheap_allocate(int capacity, size_t elem_size,
+					binaryheap_comparator compare, void *arg)
 {
 	int			sz;
 	binaryheap *heap;
 
-	sz = offsetof(binaryheap, bh_nodes) + sizeof(Datum) * capacity;
+	sz = offsetof(binaryheap, bh_nodes) + elem_size * capacity;
 	heap = (binaryheap *) palloc(sz);
 	heap->bh_space = capacity;
+	heap->bh_elem_size = elem_size;
 	heap->bh_compare = compare;
 	heap->bh_arg = arg;
 
@@ -106,12 +111,12 @@ parent_offset(int i)
  * afterwards.
  */
 void
-binaryheap_add_unordered(binaryheap *heap, Datum d)
+binaryheap_add_unordered(binaryheap *heap, void *d)
 {
 	if (heap->bh_size >= heap->bh_space)
 		elog(ERROR, "out of binary heap slots");
 	heap->bh_has_heap_property = false;
-	heap->bh_nodes[heap->bh_size] = d;
+	bh_set(heap->bh_size, d);
 	heap->bh_size++;
 }
 
@@ -138,11 +143,11 @@ binaryheap_build(binaryheap *heap)
  * the heap property.
  */
 void
-binaryheap_add(binaryheap *heap, Datum d)
+binaryheap_add(binaryheap *heap, void *d)
 {
 	if (heap->bh_size >= heap->bh_space)
 		elog(ERROR, "out of binary heap slots");
-	heap->bh_nodes[heap->bh_size] = d;
+	bh_set(heap->bh_size, d);
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
 }
@@ -154,11 +159,11 @@ binaryheap_add(binaryheap *heap, Datum d)
  * without modifying the heap. The caller must ensure that this
  * routine is not used on an empty heap. Always O(1).
  */
-Datum
-binaryheap_first(binaryheap *heap)
+void
+binaryheap_first(binaryheap *heap, void *result)
 {
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
-	return heap->bh_nodes[0];
+	memmove(result, heap->bh_nodes, heap->bh_elem_size);
 }
 
 /*
@@ -169,31 +174,27 @@ binaryheap_first(binaryheap *heap)
  * that this routine is not used on an empty heap. O(log n) worst
  * case.
  */
-Datum
-binaryheap_remove_first(binaryheap *heap)
+void
+binaryheap_remove_first(binaryheap *heap, void *result)
 {
-	Datum		result;
-
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
 
 	/* extract the root node, which will be the result */
-	result = heap->bh_nodes[0];
+	memmove(result, heap->bh_nodes, heap->bh_elem_size);
 
 	/* easy if heap contains one element */
 	if (heap->bh_size == 1)
 	{
 		heap->bh_size--;
-		return result;
+		return;
 	}
 
 	/*
 	 * Remove the last node, placing it in the vacated root entry, and sift
 	 * the new root node down to its correct position.
 	 */
-	heap->bh_nodes[0] = heap->bh_nodes[--heap->bh_size];
+	bh_set(0, bh_node_ptr(--heap->bh_size));
 	sift_down(heap, 0);
-
-	return result;
 }
 
 /*
@@ -204,11 +205,11 @@ binaryheap_remove_first(binaryheap *heap)
  * sifting the new node down.
  */
 void
-binaryheap_replace_first(binaryheap *heap, Datum d)
+binaryheap_replace_first(binaryheap *heap, void *d)
 {
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
 
-	heap->bh_nodes[0] = d;
+	bh_set(0, d);
 
 	if (heap->bh_size > 1)
 		sift_down(heap, 0);
@@ -221,7 +222,9 @@ binaryheap_replace_first(binaryheap *heap, Datum d)
 static void
 sift_up(binaryheap *heap, int node_off)
 {
-	Datum		node_val = heap->bh_nodes[node_off];
+	char	   *node_val = palloc(heap->bh_elem_size);
+
+	memmove(node_val, bh_node_ptr(node_off), heap->bh_elem_size);
 
 	/*
 	 * Within the loop, the node_off'th array entry is a "hole" that
@@ -232,14 +235,14 @@ sift_up(binaryheap *heap, int node_off)
 	{
 		int			cmp;
 		int			parent_off;
-		Datum		parent_val;
+		void	   *parent_val;
 
 		/*
 		 * If this node is smaller than its parent, the heap condition is
 		 * satisfied, and we're done.
 		 */
 		parent_off = parent_offset(node_off);
-		parent_val = heap->bh_nodes[parent_off];
+		parent_val = bh_node_ptr(parent_off);
 		cmp = heap->bh_compare(node_val,
 							   parent_val,
 							   heap->bh_arg);
@@ -250,11 +253,12 @@ sift_up(binaryheap *heap, int node_off)
 		 * Otherwise, swap the parent value with the hole, and go on to check
 		 * the node's new parent.
 		 */
-		heap->bh_nodes[node_off] = parent_val;
+		bh_set(node_off, parent_val);
 		node_off = parent_off;
 	}
 	/* Re-fill the hole */
-	heap->bh_nodes[node_off] = node_val;
+	bh_set(node_off, node_val);
+	pfree(node_val);
 }
 
 /*
@@ -264,7 +268,9 @@ sift_up(binaryheap *heap, int node_off)
 static void
 sift_down(binaryheap *heap, int node_off)
 {
-	Datum		node_val = heap->bh_nodes[node_off];
+	char	   *node_val = palloc(heap->bh_elem_size);
+
+	memmove(node_val, bh_node_ptr(node_off), heap->bh_elem_size);
 
 	/*
 	 * Within the loop, the node_off'th array entry is a "hole" that
@@ -280,20 +286,20 @@ sift_down(binaryheap *heap, int node_off)
 		/* Is the left child larger than the parent? */
 		if (left_off < heap->bh_size &&
 			heap->bh_compare(node_val,
-							 heap->bh_nodes[left_off],
+							 bh_node_ptr(left_off),
 							 heap->bh_arg) < 0)
 			swap_off = left_off;
 
 		/* Is the right child larger than the parent? */
 		if (right_off < heap->bh_size &&
 			heap->bh_compare(node_val,
-							 heap->bh_nodes[right_off],
+							 bh_node_ptr(right_off),
 							 heap->bh_arg) < 0)
 		{
 			/* swap with the larger child */
 			if (!swap_off ||
-				heap->bh_compare(heap->bh_nodes[left_off],
-								 heap->bh_nodes[right_off],
+				heap->bh_compare(bh_node_ptr(left_off),
+								 bh_node_ptr(right_off),
 								 heap->bh_arg) < 0)
 				swap_off = right_off;
 		}
@@ -309,9 +315,10 @@ sift_down(binaryheap *heap, int node_off)
 		 * Otherwise, swap the hole with the child that violates the heap
 		 * property; then go on to check its children.
 		 */
-		heap->bh_nodes[node_off] = heap->bh_nodes[swap_off];
+		bh_set(node_off, bh_node_ptr(swap_off));
 		node_off = swap_off;
 	}
 	/* Re-fill the hole */
-	heap->bh_nodes[node_off] = node_val;
+	bh_set(node_off, node_val);
+	pfree(node_val);
 }
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 46af349564..a11005708a 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -145,7 +145,7 @@ static bool pgarch_readyXlog(char *xlog);
 static void pgarch_archiveDone(char *xlog);
 static void pgarch_die(int code, Datum arg);
 static void HandlePgArchInterrupts(void);
-static int	ready_file_comparator(Datum a, Datum b, void *arg);
+static int	ready_file_comparator(void *a, void *b, void *arg);
 static void LoadArchiveLibrary(void);
 static void pgarch_call_module_shutdown_cb(int code, Datum arg);
 
@@ -249,7 +249,7 @@ PgArchiverMain(void)
 	arch_files->arch_files_size = 0;
 
 	/* Initialize our max-heap for prioritizing files to archive. */
-	arch_files->arch_heap = binaryheap_allocate(NUM_FILES_PER_DIRECTORY_SCAN,
+	arch_files->arch_heap = binaryheap_allocate(NUM_FILES_PER_DIRECTORY_SCAN, sizeof(char *),
 												ready_file_comparator, NULL);
 
 	/* Load the archive_library. */
@@ -631,22 +631,27 @@ pgarch_readyXlog(char *xlog)
 			/* If the heap isn't full yet, quickly add it. */
 			arch_file = arch_files->arch_filenames[arch_files->arch_heap->bh_size];
 			strcpy(arch_file, basename);
-			binaryheap_add_unordered(arch_files->arch_heap, CStringGetDatum(arch_file));
+			binaryheap_add_unordered(arch_files->arch_heap, &arch_file);
 
 			/* If we just filled the heap, make it a valid one. */
 			if (arch_files->arch_heap->bh_size == NUM_FILES_PER_DIRECTORY_SCAN)
 				binaryheap_build(arch_files->arch_heap);
 		}
-		else if (ready_file_comparator(binaryheap_first(arch_files->arch_heap),
-									   CStringGetDatum(basename), NULL) > 0)
+		else
 		{
-			/*
-			 * Remove the lowest priority file and add the current one to the
-			 * heap.
-			 */
-			arch_file = DatumGetCString(binaryheap_remove_first(arch_files->arch_heap));
-			strcpy(arch_file, basename);
-			binaryheap_add(arch_files->arch_heap, CStringGetDatum(arch_file));
+			char	   *result;
+
+			binaryheap_first(arch_files->arch_heap, &result);
+			if (ready_file_comparator(&result, &basename, NULL) > 0)
+			{
+				/*
+				 * Remove the lowest priority file and add the current one to
+				 * the heap.
+				 */
+				binaryheap_remove_first(arch_files->arch_heap, &arch_file);
+				strcpy(arch_file, basename);
+				binaryheap_add(arch_files->arch_heap, &arch_file);
+			}
 		}
 	}
 	FreeDir(rldir);
@@ -668,7 +673,7 @@ pgarch_readyXlog(char *xlog)
 	 */
 	arch_files->arch_files_size = arch_files->arch_heap->bh_size;
 	for (int i = 0; i < arch_files->arch_files_size; i++)
-		arch_files->arch_files[i] = DatumGetCString(binaryheap_remove_first(arch_files->arch_heap));
+		binaryheap_remove_first(arch_files->arch_heap, &arch_files->arch_files[i]);
 
 	/* Return the highest priority file. */
 	arch_files->arch_files_size--;
@@ -686,10 +691,10 @@ pgarch_readyXlog(char *xlog)
  * If "a" and "b" have equivalent values, 0 will be returned.
  */
 static int
-ready_file_comparator(Datum a, Datum b, void *arg)
+ready_file_comparator(void *a, void *b, void *arg)
 {
-	char	   *a_str = DatumGetCString(a);
-	char	   *b_str = DatumGetCString(b);
+	char	   *a_str = *((char **) a);
+	char	   *b_str = *((char **) b);
 	bool		a_history = IsTLHistoryFileName(a_str);
 	bool		b_history = IsTLHistoryFileName(b_str);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 12edc5772a..b0a84c5ed1 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1223,11 +1223,11 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
  * Binary heap comparison function.
  */
 static int
-ReorderBufferIterCompare(Datum a, Datum b, void *arg)
+ReorderBufferIterCompare(void *a, void *b, void *arg)
 {
 	ReorderBufferIterTXNState *state = (ReorderBufferIterTXNState *) arg;
-	XLogRecPtr	pos_a = state->entries[DatumGetInt32(a)].lsn;
-	XLogRecPtr	pos_b = state->entries[DatumGetInt32(b)].lsn;
+	XLogRecPtr	pos_a = state->entries[*((int *) a)].lsn;
+	XLogRecPtr	pos_b = state->entries[*((int *) b)].lsn;
 
 	if (pos_a < pos_b)
 		return 1;
@@ -1296,7 +1296,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	}
 
 	/* allocate heap */
-	state->heap = binaryheap_allocate(state->nr_txns,
+	state->heap = binaryheap_allocate(state->nr_txns, sizeof(int),
 									  ReorderBufferIterCompare,
 									  state);
 
@@ -1330,7 +1330,8 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		state->entries[off].change = cur_change;
 		state->entries[off].txn = txn;
 
-		binaryheap_add_unordered(state->heap, Int32GetDatum(off++));
+		binaryheap_add_unordered(state->heap, &off);
+		off++;
 	}
 
 	/* add subtransactions if they contain changes */
@@ -1359,7 +1360,8 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			state->entries[off].change = cur_change;
 			state->entries[off].txn = cur_txn;
 
-			binaryheap_add_unordered(state->heap, Int32GetDatum(off++));
+			binaryheap_add_unordered(state->heap, &off);
+			off++;
 		}
 	}
 
@@ -1384,7 +1386,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
 	if (state->heap->bh_size == 0)
 		return NULL;
 
-	off = DatumGetInt32(binaryheap_first(state->heap));
+	binaryheap_first(state->heap, &off);
 	entry = &state->entries[off];
 
 	/* free memory we might have "leaked" in the previous *Next call */
@@ -1414,7 +1416,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
 		state->entries[off].lsn = next_change->lsn;
 		state->entries[off].change = next_change;
 
-		binaryheap_replace_first(state->heap, Int32GetDatum(off));
+		binaryheap_replace_first(state->heap, &off);
 		return change;
 	}
 
@@ -1450,14 +1452,14 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
 			/* txn stays the same */
 			state->entries[off].lsn = next_change->lsn;
 			state->entries[off].change = next_change;
-			binaryheap_replace_first(state->heap, Int32GetDatum(off));
+			binaryheap_replace_first(state->heap, &off);
 
 			return change;
 		}
 	}
 
 	/* ok, no changes there anymore, remove */
-	binaryheap_remove_first(state->heap);
+	binaryheap_remove_first(state->heap, &off);
 
 	return change;
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3bd82dbfca..5bd7176b3c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -501,7 +501,7 @@ static void CheckForBufferLeaks(void);
 static int	rlocator_comparator(const void *p1, const void *p2);
 static inline int buffertag_comparator(const BufferTag *ba, const BufferTag *bb);
 static inline int ckpt_buforder_comparator(const CkptSortItem *a, const CkptSortItem *b);
-static int	ts_ckpt_progress_comparator(Datum a, Datum b, void *arg);
+static int	ts_ckpt_progress_comparator(void *a, void *b, void *arg);
 
 
 /*
@@ -2639,7 +2639,7 @@ BufferSync(int flags)
 	 * and compute how large a portion of the total progress a single
 	 * processed buffer is.
 	 */
-	ts_heap = binaryheap_allocate(num_spaces,
+	ts_heap = binaryheap_allocate(num_spaces, sizeof(CkptTsStatus *),
 								  ts_ckpt_progress_comparator,
 								  NULL);
 
@@ -2649,7 +2649,7 @@ BufferSync(int flags)
 
 		ts_stat->progress_slice = (float8) num_to_scan / ts_stat->num_to_scan;
 
-		binaryheap_add_unordered(ts_heap, PointerGetDatum(ts_stat));
+		binaryheap_add_unordered(ts_heap, &ts_stat);
 	}
 
 	binaryheap_build(ts_heap);
@@ -2665,8 +2665,9 @@ BufferSync(int flags)
 	while (!binaryheap_empty(ts_heap))
 	{
 		BufferDesc *bufHdr = NULL;
-		CkptTsStatus *ts_stat = (CkptTsStatus *)
-			DatumGetPointer(binaryheap_first(ts_heap));
+		CkptTsStatus *ts_stat;
+
+		binaryheap_first(ts_heap, &ts_stat);
 
 		buf_id = CkptBufferIds[ts_stat->index].buf_id;
 		Assert(buf_id != -1);
@@ -2708,12 +2709,12 @@ BufferSync(int flags)
 		/* Have all the buffers from the tablespace been processed? */
 		if (ts_stat->num_scanned == ts_stat->num_to_scan)
 		{
-			binaryheap_remove_first(ts_heap);
+			binaryheap_remove_first(ts_heap, &ts_stat);
 		}
 		else
 		{
 			/* update heap with the new progress */
-			binaryheap_replace_first(ts_heap, PointerGetDatum(ts_stat));
+			binaryheap_replace_first(ts_heap, &ts_stat);
 		}
 
 		/*
@@ -5416,10 +5417,10 @@ ckpt_buforder_comparator(const CkptSortItem *a, const CkptSortItem *b)
  * progress.
  */
 static int
-ts_ckpt_progress_comparator(Datum a, Datum b, void *arg)
+ts_ckpt_progress_comparator(void *a, void *b, void *arg)
 {
-	CkptTsStatus *sa = (CkptTsStatus *) a;
-	CkptTsStatus *sb = (CkptTsStatus *) b;
+	CkptTsStatus *sa = *((CkptTsStatus **) a);
+	CkptTsStatus *sb = *((CkptTsStatus **) b);
 
 	/* we want a min-heap, so return 1 for the a < b */
 	if (sa->progress < sb->progress)
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 52f7b06b25..24228e5b4b 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -15,7 +15,7 @@
  * For a max-heap, the comparator must return <0 iff a < b, 0 iff a == b,
  * and >0 iff a > b.  For a min-heap, the conditions are reversed.
  */
-typedef int (*binaryheap_comparator) (Datum a, Datum b, void *arg);
+typedef int (*binaryheap_comparator) (void *a, void *b, void *arg);
 
 /*
  * binaryheap
@@ -31,23 +31,24 @@ typedef struct binaryheap
 {
 	int			bh_size;
 	int			bh_space;
+	size_t		bh_elem_size;
 	bool		bh_has_heap_property;	/* debugging cross-check */
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
-	Datum		bh_nodes[FLEXIBLE_ARRAY_MEMBER];
+	char		bh_nodes[FLEXIBLE_ARRAY_MEMBER];
 } binaryheap;
 
-extern binaryheap *binaryheap_allocate(int capacity,
+extern binaryheap *binaryheap_allocate(int capacity, size_t elem_size,
 									   binaryheap_comparator compare,
 									   void *arg);
 extern void binaryheap_reset(binaryheap *heap);
 extern void binaryheap_free(binaryheap *heap);
-extern void binaryheap_add_unordered(binaryheap *heap, Datum d);
+extern void binaryheap_add_unordered(binaryheap *heap, void *d);
 extern void binaryheap_build(binaryheap *heap);
-extern void binaryheap_add(binaryheap *heap, Datum d);
-extern Datum binaryheap_first(binaryheap *heap);
-extern Datum binaryheap_remove_first(binaryheap *heap);
-extern void binaryheap_replace_first(binaryheap *heap, Datum d);
+extern void binaryheap_add(binaryheap *heap, void *d);
+extern void binaryheap_first(binaryheap *heap, void *result);
+extern void binaryheap_remove_first(binaryheap *heap, void *result);
+extern void binaryheap_replace_first(binaryheap *heap, void *d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
 
-- 
2.25.1

#23Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Nathan Bossart (#22)
Re: Inefficiency in parallel pg_restore with many tables

On 2023-Sep-02, Nathan Bossart wrote:

On Fri, Sep 01, 2023 at 01:52:48PM -0700, Nathan Bossart wrote:

Yeah, something similar to simplehash for binary heaps could be nice. That
being said, I don't know if there's a strong reason to specialize the
implementation for a given C data type in most cases.

I ended up hacking together a (nowhere near committable) patch to see how
hard it would be to allow using any type with binaryheap. It doesn't seem
too bad.

Yeah, using void * seems to lead to interfaces that are pretty much the
same as bsearch() or qsort(). (Why isn't your payload type const,
though?)

I do wonder why did you change _remove_first and _first to have a
'result' output argument instead of a return value. Does this change
actually buy you anything? simplehash.h doesn't do that either.

-extern void binaryheap_add(binaryheap *heap, Datum d);
-extern Datum binaryheap_first(binaryheap *heap);
-extern Datum binaryheap_remove_first(binaryheap *heap);
-extern void binaryheap_replace_first(binaryheap *heap, Datum d);
+extern void binaryheap_add(binaryheap *heap, void *d);
+extern void binaryheap_first(binaryheap *heap, void *result);
+extern void binaryheap_remove_first(binaryheap *heap, void *result);
+extern void binaryheap_replace_first(binaryheap *heap, void *d);

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/

#24Nathan Bossart
nathandbossart@gmail.com
In reply to: Alvaro Herrera (#23)
Re: Inefficiency in parallel pg_restore with many tables

On Sun, Sep 03, 2023 at 12:04:00PM +0200, Alvaro Herrera wrote:

On 2023-Sep-02, Nathan Bossart wrote:

I ended up hacking together a (nowhere near committable) patch to see how
hard it would be to allow using any type with binaryheap. It doesn't seem
too bad.

Yeah, using void * seems to lead to interfaces that are pretty much the
same as bsearch() or qsort().

Right. This is what I had in mind.

(Why isn't your payload type const,
though?)

It probably should be const. This patch was just a proof-of-concept and
still requireѕ a bit of work.

I do wonder why did you change _remove_first and _first to have a
'result' output argument instead of a return value. Does this change
actually buy you anything? simplehash.h doesn't do that either.

-extern void binaryheap_add(binaryheap *heap, Datum d);
-extern Datum binaryheap_first(binaryheap *heap);
-extern Datum binaryheap_remove_first(binaryheap *heap);
-extern void binaryheap_replace_first(binaryheap *heap, Datum d);
+extern void binaryheap_add(binaryheap *heap, void *d);
+extern void binaryheap_first(binaryheap *heap, void *result);
+extern void binaryheap_remove_first(binaryheap *heap, void *result);
+extern void binaryheap_replace_first(binaryheap *heap, void *d);

_first could likely just return a pointer to the data in the binary heap's
array. However, _remove_first has to copy the data somewhere, so I think
the alternative would be to return a palloc'd value. Is there another way
that I'm not thinking of?

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#25Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#22)
5 attachment(s)
Re: Inefficiency in parallel pg_restore with many tables

On Sat, Sep 02, 2023 at 11:55:21AM -0700, Nathan Bossart wrote:

I ended up hacking together a (nowhere near committable) patch to see how
hard it would be to allow using any type with binaryheap. It doesn't seem
too bad.

I spent some more time on this patch and made the relevant adjustments to
the rest of the set.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachments:

v7-0001-Allow-binaryheap-to-use-any-type.patchtext/x-diff; charset=us-asciiDownload
From 3a51f25ae712b792bdde90b133c09b0b940f4198 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 4 Sep 2023 15:04:40 -0700
Subject: [PATCH v7 1/5] Allow binaryheap to use any type.

We would like to make binaryheap available to frontend code in a
future commit, but presently the implementation uses Datum for the
node type, which is inaccessible in the frontend.  This commit
modifies the implementation to allow using any type for the nodes.
binaryheap_allocate() now requires callers to specify the size of
the node, and several of the other functions now use void pointers.
---
 src/backend/executor/nodeGatherMerge.c        | 19 ++--
 src/backend/executor/nodeMergeAppend.c        | 22 ++---
 src/backend/lib/binaryheap.c                  | 99 +++++++++++--------
 src/backend/postmaster/pgarch.c               | 36 ++++---
 .../replication/logical/reorderbuffer.c       | 21 ++--
 src/backend/storage/buffer/bufmgr.c           | 20 ++--
 src/include/lib/binaryheap.h                  | 21 ++--
 7 files changed, 137 insertions(+), 101 deletions(-)

diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c
index 9d5e1a46e9..f76406b575 100644
--- a/src/backend/executor/nodeGatherMerge.c
+++ b/src/backend/executor/nodeGatherMerge.c
@@ -52,7 +52,7 @@ typedef struct GMReaderTupleBuffer
 } GMReaderTupleBuffer;
 
 static TupleTableSlot *ExecGatherMerge(PlanState *pstate);
-static int32 heap_compare_slots(Datum a, Datum b, void *arg);
+static int32 heap_compare_slots(const void *a, const void *b, void *arg);
 static TupleTableSlot *gather_merge_getnext(GatherMergeState *gm_state);
 static MinimalTuple gm_readnext_tuple(GatherMergeState *gm_state, int nreader,
 									  bool nowait, bool *done);
@@ -429,6 +429,7 @@ gather_merge_setup(GatherMergeState *gm_state)
 
 	/* Allocate the resources for the merge */
 	gm_state->gm_heap = binaryheap_allocate(nreaders + 1,
+											sizeof(int),
 											heap_compare_slots,
 											gm_state);
 }
@@ -489,7 +490,7 @@ reread:
 				/* Don't have a tuple yet, try to get one */
 				if (gather_merge_readnext(gm_state, i, nowait))
 					binaryheap_add_unordered(gm_state->gm_heap,
-											 Int32GetDatum(i));
+											 &i);
 			}
 			else
 			{
@@ -565,14 +566,14 @@ gather_merge_getnext(GatherMergeState *gm_state)
 		 * the heap, because it might now compare differently against the
 		 * other elements of the heap.
 		 */
-		i = DatumGetInt32(binaryheap_first(gm_state->gm_heap));
+		binaryheap_first(gm_state->gm_heap, &i);
 
 		if (gather_merge_readnext(gm_state, i, false))
-			binaryheap_replace_first(gm_state->gm_heap, Int32GetDatum(i));
+			binaryheap_replace_first(gm_state->gm_heap, &i);
 		else
 		{
 			/* reader exhausted, remove it from heap */
-			(void) binaryheap_remove_first(gm_state->gm_heap);
+			binaryheap_remove_first(gm_state->gm_heap, NULL);
 		}
 	}
 
@@ -585,7 +586,7 @@ gather_merge_getnext(GatherMergeState *gm_state)
 	else
 	{
 		/* Return next tuple from whichever participant has the leading one */
-		i = DatumGetInt32(binaryheap_first(gm_state->gm_heap));
+		binaryheap_first(gm_state->gm_heap, &i);
 		return gm_state->gm_slots[i];
 	}
 }
@@ -750,11 +751,11 @@ typedef int32 SlotNumber;
  * Compare the tuples in the two given slots.
  */
 static int32
-heap_compare_slots(Datum a, Datum b, void *arg)
+heap_compare_slots(const void *a, const void *b, void *arg)
 {
 	GatherMergeState *node = (GatherMergeState *) arg;
-	SlotNumber	slot1 = DatumGetInt32(a);
-	SlotNumber	slot2 = DatumGetInt32(b);
+	SlotNumber	slot1 = *((const int *) a);
+	SlotNumber	slot2 = *((const int *) b);
 
 	TupleTableSlot *s1 = node->gm_slots[slot1];
 	TupleTableSlot *s2 = node->gm_slots[slot2];
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index 21b5726e6e..e0ace294a0 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -52,7 +52,7 @@
 typedef int32 SlotNumber;
 
 static TupleTableSlot *ExecMergeAppend(PlanState *pstate);
-static int	heap_compare_slots(Datum a, Datum b, void *arg);
+static int	heap_compare_slots(const void *a, const void *b, void *arg);
 
 
 /* ----------------------------------------------------------------
@@ -125,8 +125,8 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	mergestate->ms_nplans = nplans;
 
 	mergestate->ms_slots = (TupleTableSlot **) palloc0(sizeof(TupleTableSlot *) * nplans);
-	mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots,
-											  mergestate);
+	mergestate->ms_heap = binaryheap_allocate(nplans, sizeof(int),
+											  heap_compare_slots, mergestate);
 
 	/*
 	 * Miscellaneous initialization
@@ -229,7 +229,7 @@ ExecMergeAppend(PlanState *pstate)
 		{
 			node->ms_slots[i] = ExecProcNode(node->mergeplans[i]);
 			if (!TupIsNull(node->ms_slots[i]))
-				binaryheap_add_unordered(node->ms_heap, Int32GetDatum(i));
+				binaryheap_add_unordered(node->ms_heap, &i);
 		}
 		binaryheap_build(node->ms_heap);
 		node->ms_initialized = true;
@@ -244,12 +244,12 @@ ExecMergeAppend(PlanState *pstate)
 		 * by doing this before returning from the prior call, but it's better
 		 * to not pull tuples until necessary.)
 		 */
-		i = DatumGetInt32(binaryheap_first(node->ms_heap));
+		binaryheap_first(node->ms_heap, &i);
 		node->ms_slots[i] = ExecProcNode(node->mergeplans[i]);
 		if (!TupIsNull(node->ms_slots[i]))
-			binaryheap_replace_first(node->ms_heap, Int32GetDatum(i));
+			binaryheap_replace_first(node->ms_heap, &i);
 		else
-			(void) binaryheap_remove_first(node->ms_heap);
+			binaryheap_remove_first(node->ms_heap, NULL);
 	}
 
 	if (binaryheap_empty(node->ms_heap))
@@ -259,7 +259,7 @@ ExecMergeAppend(PlanState *pstate)
 	}
 	else
 	{
-		i = DatumGetInt32(binaryheap_first(node->ms_heap));
+		binaryheap_first(node->ms_heap, &i);
 		result = node->ms_slots[i];
 	}
 
@@ -270,11 +270,11 @@ ExecMergeAppend(PlanState *pstate)
  * Compare the tuples in the two given slots.
  */
 static int32
-heap_compare_slots(Datum a, Datum b, void *arg)
+heap_compare_slots(const void *a, const void *b, void *arg)
 {
 	MergeAppendState *node = (MergeAppendState *) arg;
-	SlotNumber	slot1 = DatumGetInt32(a);
-	SlotNumber	slot2 = DatumGetInt32(b);
+	SlotNumber	slot1 = *((const int *) a);
+	SlotNumber	slot2 = *((const int *) b);
 
 	TupleTableSlot *s1 = node->ms_slots[slot1];
 	TupleTableSlot *s2 = node->ms_slots[slot2];
diff --git a/src/backend/lib/binaryheap.c b/src/backend/lib/binaryheap.c
index 1737546757..6f63181c4c 100644
--- a/src/backend/lib/binaryheap.c
+++ b/src/backend/lib/binaryheap.c
@@ -29,16 +29,19 @@ static void sift_up(binaryheap *heap, int node_off);
  * argument specified by 'arg'.
  */
 binaryheap *
-binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
+binaryheap_allocate(int capacity, size_t elem_size,
+					binaryheap_comparator compare, void *arg)
 {
 	int			sz;
 	binaryheap *heap;
 
-	sz = offsetof(binaryheap, bh_nodes) + sizeof(Datum) * capacity;
+	sz = offsetof(binaryheap, bh_nodes) + elem_size * (capacity + 1);
 	heap = (binaryheap *) palloc(sz);
 	heap->bh_space = capacity;
+	heap->bh_elem_size = elem_size;
 	heap->bh_compare = compare;
 	heap->bh_arg = arg;
+	heap->bh_hole = &heap->bh_nodes[capacity * elem_size];
 
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
@@ -97,6 +100,27 @@ parent_offset(int i)
 	return (i - 1) / 2;
 }
 
+/*
+ * This utility function returns a pointer to the nth node of the binary heap.
+ */
+static inline void *
+bh_node(binaryheap *heap, int n)
+{
+	return &heap->bh_nodes[n * heap->bh_elem_size];
+}
+
+/*
+ * This utility function sets the nth node of the binary heap to the value that
+ * d points to.
+ */
+static inline void
+bh_set(binaryheap *heap, int n, const void *d)
+{
+	void	   *node = bh_node(heap, n);
+
+	memmove(node, d, heap->bh_elem_size);
+}
+
 /*
  * binaryheap_add_unordered
  *
@@ -106,12 +130,12 @@ parent_offset(int i)
  * afterwards.
  */
 void
-binaryheap_add_unordered(binaryheap *heap, Datum d)
+binaryheap_add_unordered(binaryheap *heap, void *d)
 {
 	if (heap->bh_size >= heap->bh_space)
 		elog(ERROR, "out of binary heap slots");
 	heap->bh_has_heap_property = false;
-	heap->bh_nodes[heap->bh_size] = d;
+	bh_set(heap, heap->bh_size, d);
 	heap->bh_size++;
 }
 
@@ -138,11 +162,11 @@ binaryheap_build(binaryheap *heap)
  * the heap property.
  */
 void
-binaryheap_add(binaryheap *heap, Datum d)
+binaryheap_add(binaryheap *heap, void *d)
 {
 	if (heap->bh_size >= heap->bh_space)
 		elog(ERROR, "out of binary heap slots");
-	heap->bh_nodes[heap->bh_size] = d;
+	bh_set(heap, heap->bh_size, d);
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
 }
@@ -154,11 +178,11 @@ binaryheap_add(binaryheap *heap, Datum d)
  * without modifying the heap. The caller must ensure that this
  * routine is not used on an empty heap. Always O(1).
  */
-Datum
-binaryheap_first(binaryheap *heap)
+void
+binaryheap_first(binaryheap *heap, void *result)
 {
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
-	return heap->bh_nodes[0];
+	memmove(result, heap->bh_nodes, heap->bh_elem_size);
 }
 
 /*
@@ -169,31 +193,28 @@ binaryheap_first(binaryheap *heap)
  * that this routine is not used on an empty heap. O(log n) worst
  * case.
  */
-Datum
-binaryheap_remove_first(binaryheap *heap)
+void
+binaryheap_remove_first(binaryheap *heap, void *result)
 {
-	Datum		result;
-
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
 
 	/* extract the root node, which will be the result */
-	result = heap->bh_nodes[0];
+	if (result)
+		memmove(result, heap->bh_nodes, heap->bh_elem_size);
 
 	/* easy if heap contains one element */
 	if (heap->bh_size == 1)
 	{
 		heap->bh_size--;
-		return result;
+		return;
 	}
 
 	/*
 	 * Remove the last node, placing it in the vacated root entry, and sift
 	 * the new root node down to its correct position.
 	 */
-	heap->bh_nodes[0] = heap->bh_nodes[--heap->bh_size];
+	bh_set(heap, 0, bh_node(heap, --heap->bh_size));
 	sift_down(heap, 0);
-
-	return result;
 }
 
 /*
@@ -204,11 +225,11 @@ binaryheap_remove_first(binaryheap *heap)
  * sifting the new node down.
  */
 void
-binaryheap_replace_first(binaryheap *heap, Datum d)
+binaryheap_replace_first(binaryheap *heap, void *d)
 {
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
 
-	heap->bh_nodes[0] = d;
+	bh_set(heap, 0, d);
 
 	if (heap->bh_size > 1)
 		sift_down(heap, 0);
@@ -221,26 +242,26 @@ binaryheap_replace_first(binaryheap *heap, Datum d)
 static void
 sift_up(binaryheap *heap, int node_off)
 {
-	Datum		node_val = heap->bh_nodes[node_off];
+	memmove(heap->bh_hole, bh_node(heap, node_off), heap->bh_elem_size);
 
 	/*
 	 * Within the loop, the node_off'th array entry is a "hole" that
-	 * notionally holds node_val, but we don't actually store node_val there
-	 * till the end, saving some unnecessary data copying steps.
+	 * notionally holds the swapped value, but we don't actually store the
+	 * value there till the end, saving some unnecessary data copying steps.
 	 */
 	while (node_off != 0)
 	{
 		int			cmp;
 		int			parent_off;
-		Datum		parent_val;
+		void	   *parent_val;
 
 		/*
 		 * If this node is smaller than its parent, the heap condition is
 		 * satisfied, and we're done.
 		 */
 		parent_off = parent_offset(node_off);
-		parent_val = heap->bh_nodes[parent_off];
-		cmp = heap->bh_compare(node_val,
+		parent_val = bh_node(heap, parent_off);
+		cmp = heap->bh_compare(heap->bh_hole,
 							   parent_val,
 							   heap->bh_arg);
 		if (cmp <= 0)
@@ -250,11 +271,11 @@ sift_up(binaryheap *heap, int node_off)
 		 * Otherwise, swap the parent value with the hole, and go on to check
 		 * the node's new parent.
 		 */
-		heap->bh_nodes[node_off] = parent_val;
+		bh_set(heap, node_off, parent_val);
 		node_off = parent_off;
 	}
 	/* Re-fill the hole */
-	heap->bh_nodes[node_off] = node_val;
+	bh_set(heap, node_off, heap->bh_hole);
 }
 
 /*
@@ -264,12 +285,12 @@ sift_up(binaryheap *heap, int node_off)
 static void
 sift_down(binaryheap *heap, int node_off)
 {
-	Datum		node_val = heap->bh_nodes[node_off];
+	memmove(heap->bh_hole, bh_node(heap, node_off), heap->bh_elem_size);
 
 	/*
 	 * Within the loop, the node_off'th array entry is a "hole" that
-	 * notionally holds node_val, but we don't actually store node_val there
-	 * till the end, saving some unnecessary data copying steps.
+	 * notionally holds the swapped value, but we don't actually store the
+	 * value there till the end, saving some unnecessary data copying steps.
 	 */
 	while (true)
 	{
@@ -279,21 +300,21 @@ sift_down(binaryheap *heap, int node_off)
 
 		/* Is the left child larger than the parent? */
 		if (left_off < heap->bh_size &&
-			heap->bh_compare(node_val,
-							 heap->bh_nodes[left_off],
+			heap->bh_compare(heap->bh_hole,
+							 bh_node(heap, left_off),
 							 heap->bh_arg) < 0)
 			swap_off = left_off;
 
 		/* Is the right child larger than the parent? */
 		if (right_off < heap->bh_size &&
-			heap->bh_compare(node_val,
-							 heap->bh_nodes[right_off],
+			heap->bh_compare(heap->bh_hole,
+							 bh_node(heap, right_off),
 							 heap->bh_arg) < 0)
 		{
 			/* swap with the larger child */
 			if (!swap_off ||
-				heap->bh_compare(heap->bh_nodes[left_off],
-								 heap->bh_nodes[right_off],
+				heap->bh_compare(bh_node(heap, left_off),
+								 bh_node(heap, right_off),
 								 heap->bh_arg) < 0)
 				swap_off = right_off;
 		}
@@ -309,9 +330,9 @@ sift_down(binaryheap *heap, int node_off)
 		 * Otherwise, swap the hole with the child that violates the heap
 		 * property; then go on to check its children.
 		 */
-		heap->bh_nodes[node_off] = heap->bh_nodes[swap_off];
+		bh_set(heap, node_off, bh_node(heap, swap_off));
 		node_off = swap_off;
 	}
 	/* Re-fill the hole */
-	heap->bh_nodes[node_off] = node_val;
+	bh_set(heap, node_off, heap->bh_hole);
 }
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 46af349564..f7ee95d8f9 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -145,7 +145,7 @@ static bool pgarch_readyXlog(char *xlog);
 static void pgarch_archiveDone(char *xlog);
 static void pgarch_die(int code, Datum arg);
 static void HandlePgArchInterrupts(void);
-static int	ready_file_comparator(Datum a, Datum b, void *arg);
+static int	ready_file_comparator(const void *a, const void *b, void *arg);
 static void LoadArchiveLibrary(void);
 static void pgarch_call_module_shutdown_cb(int code, Datum arg);
 
@@ -250,6 +250,7 @@ PgArchiverMain(void)
 
 	/* Initialize our max-heap for prioritizing files to archive. */
 	arch_files->arch_heap = binaryheap_allocate(NUM_FILES_PER_DIRECTORY_SCAN,
+												sizeof(char *),
 												ready_file_comparator, NULL);
 
 	/* Load the archive_library. */
@@ -631,22 +632,27 @@ pgarch_readyXlog(char *xlog)
 			/* If the heap isn't full yet, quickly add it. */
 			arch_file = arch_files->arch_filenames[arch_files->arch_heap->bh_size];
 			strcpy(arch_file, basename);
-			binaryheap_add_unordered(arch_files->arch_heap, CStringGetDatum(arch_file));
+			binaryheap_add_unordered(arch_files->arch_heap, &arch_file);
 
 			/* If we just filled the heap, make it a valid one. */
 			if (arch_files->arch_heap->bh_size == NUM_FILES_PER_DIRECTORY_SCAN)
 				binaryheap_build(arch_files->arch_heap);
 		}
-		else if (ready_file_comparator(binaryheap_first(arch_files->arch_heap),
-									   CStringGetDatum(basename), NULL) > 0)
+		else
 		{
-			/*
-			 * Remove the lowest priority file and add the current one to the
-			 * heap.
-			 */
-			arch_file = DatumGetCString(binaryheap_remove_first(arch_files->arch_heap));
-			strcpy(arch_file, basename);
-			binaryheap_add(arch_files->arch_heap, CStringGetDatum(arch_file));
+			char	   *root;
+
+			binaryheap_first(arch_files->arch_heap, &root);
+			if (ready_file_comparator(&root, &basename, NULL) > 0)
+			{
+				/*
+				 * Remove the lowest priority file and add the current one to
+				 * the heap.
+				 */
+				binaryheap_remove_first(arch_files->arch_heap, &arch_file);
+				strcpy(arch_file, basename);
+				binaryheap_add(arch_files->arch_heap, &arch_file);
+			}
 		}
 	}
 	FreeDir(rldir);
@@ -668,7 +674,7 @@ pgarch_readyXlog(char *xlog)
 	 */
 	arch_files->arch_files_size = arch_files->arch_heap->bh_size;
 	for (int i = 0; i < arch_files->arch_files_size; i++)
-		arch_files->arch_files[i] = DatumGetCString(binaryheap_remove_first(arch_files->arch_heap));
+		binaryheap_remove_first(arch_files->arch_heap, &arch_files->arch_files[i]);
 
 	/* Return the highest priority file. */
 	arch_files->arch_files_size--;
@@ -686,10 +692,10 @@ pgarch_readyXlog(char *xlog)
  * If "a" and "b" have equivalent values, 0 will be returned.
  */
 static int
-ready_file_comparator(Datum a, Datum b, void *arg)
+ready_file_comparator(const void *a, const void *b, void *arg)
 {
-	char	   *a_str = DatumGetCString(a);
-	char	   *b_str = DatumGetCString(b);
+	const char *a_str = *((const char **) a);
+	const char *b_str = *((const char **) b);
 	bool		a_history = IsTLHistoryFileName(a_str);
 	bool		b_history = IsTLHistoryFileName(b_str);
 
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 12edc5772a..e22680da0b 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1223,11 +1223,11 @@ ReorderBufferCommitChild(ReorderBuffer *rb, TransactionId xid,
  * Binary heap comparison function.
  */
 static int
-ReorderBufferIterCompare(Datum a, Datum b, void *arg)
+ReorderBufferIterCompare(const void *a, const void *b, void *arg)
 {
 	ReorderBufferIterTXNState *state = (ReorderBufferIterTXNState *) arg;
-	XLogRecPtr	pos_a = state->entries[DatumGetInt32(a)].lsn;
-	XLogRecPtr	pos_b = state->entries[DatumGetInt32(b)].lsn;
+	XLogRecPtr	pos_a = state->entries[*((const int *) a)].lsn;
+	XLogRecPtr	pos_b = state->entries[*((const int *) b)].lsn;
 
 	if (pos_a < pos_b)
 		return 1;
@@ -1297,6 +1297,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	/* allocate heap */
 	state->heap = binaryheap_allocate(state->nr_txns,
+									  sizeof(int),
 									  ReorderBufferIterCompare,
 									  state);
 
@@ -1330,7 +1331,8 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		state->entries[off].change = cur_change;
 		state->entries[off].txn = txn;
 
-		binaryheap_add_unordered(state->heap, Int32GetDatum(off++));
+		binaryheap_add_unordered(state->heap, &off);
+		off++;
 	}
 
 	/* add subtransactions if they contain changes */
@@ -1359,7 +1361,8 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			state->entries[off].change = cur_change;
 			state->entries[off].txn = cur_txn;
 
-			binaryheap_add_unordered(state->heap, Int32GetDatum(off++));
+			binaryheap_add_unordered(state->heap, &off);
+			off++;
 		}
 	}
 
@@ -1384,7 +1387,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
 	if (state->heap->bh_size == 0)
 		return NULL;
 
-	off = DatumGetInt32(binaryheap_first(state->heap));
+	binaryheap_first(state->heap, &off);
 	entry = &state->entries[off];
 
 	/* free memory we might have "leaked" in the previous *Next call */
@@ -1414,7 +1417,7 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
 		state->entries[off].lsn = next_change->lsn;
 		state->entries[off].change = next_change;
 
-		binaryheap_replace_first(state->heap, Int32GetDatum(off));
+		binaryheap_replace_first(state->heap, &off);
 		return change;
 	}
 
@@ -1450,14 +1453,14 @@ ReorderBufferIterTXNNext(ReorderBuffer *rb, ReorderBufferIterTXNState *state)
 			/* txn stays the same */
 			state->entries[off].lsn = next_change->lsn;
 			state->entries[off].change = next_change;
-			binaryheap_replace_first(state->heap, Int32GetDatum(off));
+			binaryheap_replace_first(state->heap, &off);
 
 			return change;
 		}
 	}
 
 	/* ok, no changes there anymore, remove */
-	binaryheap_remove_first(state->heap);
+	binaryheap_remove_first(state->heap, NULL);
 
 	return change;
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3bd82dbfca..7c78a0d625 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -501,7 +501,7 @@ static void CheckForBufferLeaks(void);
 static int	rlocator_comparator(const void *p1, const void *p2);
 static inline int buffertag_comparator(const BufferTag *ba, const BufferTag *bb);
 static inline int ckpt_buforder_comparator(const CkptSortItem *a, const CkptSortItem *b);
-static int	ts_ckpt_progress_comparator(Datum a, Datum b, void *arg);
+static int	ts_ckpt_progress_comparator(const void *a, const void *b, void *arg);
 
 
 /*
@@ -2640,6 +2640,7 @@ BufferSync(int flags)
 	 * processed buffer is.
 	 */
 	ts_heap = binaryheap_allocate(num_spaces,
+								  sizeof(CkptTsStatus *),
 								  ts_ckpt_progress_comparator,
 								  NULL);
 
@@ -2649,7 +2650,7 @@ BufferSync(int flags)
 
 		ts_stat->progress_slice = (float8) num_to_scan / ts_stat->num_to_scan;
 
-		binaryheap_add_unordered(ts_heap, PointerGetDatum(ts_stat));
+		binaryheap_add_unordered(ts_heap, &ts_stat);
 	}
 
 	binaryheap_build(ts_heap);
@@ -2665,8 +2666,9 @@ BufferSync(int flags)
 	while (!binaryheap_empty(ts_heap))
 	{
 		BufferDesc *bufHdr = NULL;
-		CkptTsStatus *ts_stat = (CkptTsStatus *)
-			DatumGetPointer(binaryheap_first(ts_heap));
+		CkptTsStatus *ts_stat;
+
+		binaryheap_first(ts_heap, &ts_stat);
 
 		buf_id = CkptBufferIds[ts_stat->index].buf_id;
 		Assert(buf_id != -1);
@@ -2708,12 +2710,12 @@ BufferSync(int flags)
 		/* Have all the buffers from the tablespace been processed? */
 		if (ts_stat->num_scanned == ts_stat->num_to_scan)
 		{
-			binaryheap_remove_first(ts_heap);
+			binaryheap_remove_first(ts_heap, NULL);
 		}
 		else
 		{
 			/* update heap with the new progress */
-			binaryheap_replace_first(ts_heap, PointerGetDatum(ts_stat));
+			binaryheap_replace_first(ts_heap, &ts_stat);
 		}
 
 		/*
@@ -5416,10 +5418,10 @@ ckpt_buforder_comparator(const CkptSortItem *a, const CkptSortItem *b)
  * progress.
  */
 static int
-ts_ckpt_progress_comparator(Datum a, Datum b, void *arg)
+ts_ckpt_progress_comparator(const void *a, const void *b, void *arg)
 {
-	CkptTsStatus *sa = (CkptTsStatus *) a;
-	CkptTsStatus *sb = (CkptTsStatus *) b;
+	const CkptTsStatus *sa = *((const CkptTsStatus **) a);
+	const CkptTsStatus *sb = *((const CkptTsStatus **) b);
 
 	/* we want a min-heap, so return 1 for the a < b */
 	if (sa->progress < sb->progress)
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 52f7b06b25..b8c7ef81ef 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -15,7 +15,7 @@
  * For a max-heap, the comparator must return <0 iff a < b, 0 iff a == b,
  * and >0 iff a > b.  For a min-heap, the conditions are reversed.
  */
-typedef int (*binaryheap_comparator) (Datum a, Datum b, void *arg);
+typedef int (*binaryheap_comparator) (const void *a, const void *b, void *arg);
 
 /*
  * binaryheap
@@ -25,29 +25,32 @@ typedef int (*binaryheap_comparator) (Datum a, Datum b, void *arg);
  *		bh_has_heap_property	no unordered operations since last heap build
  *		bh_compare		comparison function to define the heap property
  *		bh_arg			user data for comparison function
- *		bh_nodes		variable-length array of "space" nodes
+ *		bh_hole			extra node used during sifting operations
+ *		bh_nodes		variable-length array of "space + 1" nodes
  */
 typedef struct binaryheap
 {
 	int			bh_size;
 	int			bh_space;
+	size_t		bh_elem_size;
 	bool		bh_has_heap_property;	/* debugging cross-check */
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
-	Datum		bh_nodes[FLEXIBLE_ARRAY_MEMBER];
+	void	   *bh_hole;
+	char		bh_nodes[FLEXIBLE_ARRAY_MEMBER];
 } binaryheap;
 
-extern binaryheap *binaryheap_allocate(int capacity,
+extern binaryheap *binaryheap_allocate(int capacity, size_t elem_size,
 									   binaryheap_comparator compare,
 									   void *arg);
 extern void binaryheap_reset(binaryheap *heap);
 extern void binaryheap_free(binaryheap *heap);
-extern void binaryheap_add_unordered(binaryheap *heap, Datum d);
+extern void binaryheap_add_unordered(binaryheap *heap, void *d);
 extern void binaryheap_build(binaryheap *heap);
-extern void binaryheap_add(binaryheap *heap, Datum d);
-extern Datum binaryheap_first(binaryheap *heap);
-extern Datum binaryheap_remove_first(binaryheap *heap);
-extern void binaryheap_replace_first(binaryheap *heap, Datum d);
+extern void binaryheap_add(binaryheap *heap, void *d);
+extern void binaryheap_first(binaryheap *heap, void *result);
+extern void binaryheap_remove_first(binaryheap *heap, void *result);
+extern void binaryheap_replace_first(binaryheap *heap, void *d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
 
-- 
2.25.1

v7-0002-Make-binaryheap-available-to-frontend-code.patchtext/x-diff; charset=us-asciiDownload
From a0dad80775e9e1947d49bb1483cbdc2f2b86415a Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 4 Sep 2023 15:17:34 -0700
Subject: [PATCH v7 2/5] Make binaryheap available to frontend code.

There are a couple of places in frontend code that could make use
of this simple binary heap implementation.  This commit makes
binaryheap usable in frontend code, much like commit 26aaf97b68 did
for StringInfo.  Like StringInfo, the header file is left in lib/
to reduce the likelihood of unnecessary breakage.
---
 src/backend/lib/Makefile                 |  1 -
 src/backend/lib/meson.build              |  1 -
 src/common/Makefile                      |  1 +
 src/{backend/lib => common}/binaryheap.c | 17 ++++++++++++++++-
 src/common/meson.build                   |  1 +
 5 files changed, 18 insertions(+), 3 deletions(-)
 rename src/{backend/lib => common}/binaryheap.c (96%)

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..b6cefd9cca 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -13,7 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
-	binaryheap.o \
 	bipartite_match.o \
 	bloomfilter.o \
 	dshash.o \
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 974cab8776..b4e88f54ae 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -1,7 +1,6 @@
 # Copyright (c) 2022-2023, PostgreSQL Global Development Group
 
 backend_sources += files(
-  'binaryheap.c',
   'bipartite_match.c',
   'bloomfilter.c',
   'dshash.c',
diff --git a/src/common/Makefile b/src/common/Makefile
index 113029bf7b..cc5c54dcee 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -48,6 +48,7 @@ LIBS += $(PTHREAD_LIBS)
 OBJS_COMMON = \
 	archive.o \
 	base64.o \
+	binaryheap.o \
 	checksum_helper.o \
 	compression.o \
 	config_info.o \
diff --git a/src/backend/lib/binaryheap.c b/src/common/binaryheap.c
similarity index 96%
rename from src/backend/lib/binaryheap.c
rename to src/common/binaryheap.c
index 6f63181c4c..07021d10b8 100644
--- a/src/backend/lib/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -6,15 +6,22 @@
  * Portions Copyright (c) 2012-2023, PostgreSQL Global Development Group
  *
  * IDENTIFICATION
- *	  src/backend/lib/binaryheap.c
+ *	  src/common/binaryheap.c
  *
  *-------------------------------------------------------------------------
  */
 
+#ifdef FRONTEND
+#include "postgres_fe.h"
+#else
 #include "postgres.h"
+#endif
 
 #include <math.h>
 
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
 #include "lib/binaryheap.h"
 
 static void sift_down(binaryheap *heap, int node_off);
@@ -133,7 +140,11 @@ void
 binaryheap_add_unordered(binaryheap *heap, void *d)
 {
 	if (heap->bh_size >= heap->bh_space)
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
 	heap->bh_has_heap_property = false;
 	bh_set(heap, heap->bh_size, d);
 	heap->bh_size++;
@@ -165,7 +176,11 @@ void
 binaryheap_add(binaryheap *heap, void *d)
 {
 	if (heap->bh_size >= heap->bh_space)
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
 	bh_set(heap, heap->bh_size, d);
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
diff --git a/src/common/meson.build b/src/common/meson.build
index 53942a9a61..3b97497d1a 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -3,6 +3,7 @@
 common_sources = files(
   'archive.c',
   'base64.c',
+  'binaryheap.c',
   'checksum_helper.c',
   'compression.c',
   'controldata_utils.c',
-- 
2.25.1

v7-0003-Add-function-for-removing-arbitrary-nodes-in-bina.patchtext/x-diff; charset=us-asciiDownload
From 658f4c1899b4721275986e50f388dd5be88e8c20 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Thu, 20 Jul 2023 09:52:20 -0700
Subject: [PATCH v7 3/5] Add function for removing arbitrary nodes in
 binaryheap.

This commit introduces binaryheap_remove_node(), which can be used
to remove any node from a binary heap.  The implementation is
straightforward.  The target node is replaced with the last node in
the heap, and then we sift as needed to preserve the heap property.
This new function is intended for use in a follow-up commit that
will improve the performance of pg_restore.
---
 src/common/binaryheap.c      | 44 ++++++++++++++++++++++++++++++++++++
 src/include/lib/binaryheap.h |  3 +++
 2 files changed, 47 insertions(+)

diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index 07021d10b8..f1ed0e157e 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -232,6 +232,50 @@ binaryheap_remove_first(binaryheap *heap, void *result)
 	sift_down(heap, 0);
 }
 
+/*
+ * binaryheap_nth
+ *
+ * Returns the nth node from the heap.  The caller must ensure that there are
+ * at least (n - 1) nodes in the heap.  Always O(1).
+ */
+void
+binaryheap_nth(binaryheap *heap, int n, void *result)
+{
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(n >= 0 && n < heap->bh_size);
+
+	memmove(result, bh_node(heap, n), heap->bh_elem_size);
+}
+
+/*
+ * binaryheap_remove_nth
+ *
+ * Removes the nth node from the heap.  The caller must ensure that there are
+ * at least (n - 1) nodes in the heap.  O(log n) worst case.
+ */
+void
+binaryheap_remove_nth(binaryheap *heap, int n)
+{
+	int			cmp;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(n >= 0 && n < heap->bh_size);
+
+	/* compare last node to the one that is being removed */
+	cmp = heap->bh_compare(bh_node(heap, --heap->bh_size),
+						   bh_node(heap, n),
+						   heap->bh_arg);
+
+	/* remove the last node, placing it in the vacated entry */
+	bh_set(heap, n, bh_node(heap, heap->bh_size));
+
+	/* sift as needed to preserve the heap property */
+	if (cmp > 0)
+		sift_up(heap, n);
+	else if (cmp < 0)
+		sift_down(heap, n);
+}
+
 /*
  * binaryheap_replace_first
  *
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index b8c7ef81ef..083821ddbd 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -50,8 +50,11 @@ extern void binaryheap_build(binaryheap *heap);
 extern void binaryheap_add(binaryheap *heap, void *d);
 extern void binaryheap_first(binaryheap *heap, void *result);
 extern void binaryheap_remove_first(binaryheap *heap, void *result);
+extern void binaryheap_nth(binaryheap *heap, int n, void *result);
+extern void binaryheap_remove_nth(binaryheap *heap, int n);
 extern void binaryheap_replace_first(binaryheap *heap, void *d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
+#define binaryheap_size(h)			((h)->bh_size)
 
 #endif							/* BINARYHEAP_H */
-- 
2.25.1

v7-0004-Convert-pg_restore-s-ready_list-to-a-priority-que.patchtext/x-diff; charset=us-asciiDownload
From 65de4421ba130a5716fc226f6de05b97b30c2711 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 4 Sep 2023 15:35:08 -0700
Subject: [PATCH v7 4/5] Convert pg_restore's ready_list to a priority queue.

Presently, we spend a lot of time sorting this list so that we pick
the largest items first.  With many tables, this sorting can become
a significant bottleneck.  There are a couple of reports from the
field about this, and it is easily reproducible, so this is not a
hypothetical issue.

This commit improves the performance of pg_restore with many tables
by converting its ready_list to a priority queue, i.e., a binary
heap.  We will first try to run the highest priority item, but if
it cannot be chosen due to the lock heuristic, we'll do a
sequential scan through the heap nodes until we find one that is
runnable.  This means that we might end up picking an item with
much lower priority, but since we expect that we'll typically be
able to choose one of the first few nodes, we should usually pick
an item with a relatively high priority.

On my machine, a basic test with 100,000 tables takes 11.5 minutes
without this patch and 1.5 minutes with it.  Pierre Ducroquet
claims to see a speedup from 30 minutes to 23 minutes for a
real-world dump of over 50,000 tables.
---
 src/bin/pg_dump/pg_backup_archiver.c | 201 ++++++++-------------------
 1 file changed, 61 insertions(+), 140 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 39ebcfec32..ff7349537e 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -34,6 +34,7 @@
 #include "compress_io.h"
 #include "dumputils.h"
 #include "fe_utils/string_utils.h"
+#include "lib/binaryheap.h"
 #include "lib/stringinfo.h"
 #include "libpq/libpq-fs.h"
 #include "parallel.h"
@@ -44,24 +45,6 @@
 #define TEXT_DUMP_HEADER "--\n-- PostgreSQL database dump\n--\n\n"
 #define TEXT_DUMPALL_HEADER "--\n-- PostgreSQL database cluster dump\n--\n\n"
 
-/*
- * State for tracking TocEntrys that are ready to process during a parallel
- * restore.  (This used to be a list, and we still call it that, though now
- * it's really an array so that we can apply qsort to it.)
- *
- * tes[] is sized large enough that we can't overrun it.
- * The valid entries are indexed first_te .. last_te inclusive.
- * We periodically sort the array to bring larger-by-dataLength entries to
- * the front; "sorted" is true if the valid entries are known sorted.
- */
-typedef struct _parallelReadyList
-{
-	TocEntry  **tes;			/* Ready-to-dump TocEntrys */
-	int			first_te;		/* index of first valid entry in tes[] */
-	int			last_te;		/* index of last valid entry in tes[] */
-	bool		sorted;			/* are valid entries currently sorted? */
-} ParallelReadyList;
-
 
 static ArchiveHandle *_allocAH(const char *FileSpec, const ArchiveFormat fmt,
 							   const pg_compress_specification compression_spec,
@@ -110,16 +93,13 @@ static void restore_toc_entries_postfork(ArchiveHandle *AH,
 static void pending_list_header_init(TocEntry *l);
 static void pending_list_append(TocEntry *l, TocEntry *te);
 static void pending_list_remove(TocEntry *te);
-static void ready_list_init(ParallelReadyList *ready_list, int tocCount);
-static void ready_list_free(ParallelReadyList *ready_list);
-static void ready_list_insert(ParallelReadyList *ready_list, TocEntry *te);
-static void ready_list_remove(ParallelReadyList *ready_list, int i);
-static void ready_list_sort(ParallelReadyList *ready_list);
-static int	TocEntrySizeCompare(const void *p1, const void *p2);
-static void move_to_ready_list(TocEntry *pending_list,
-							   ParallelReadyList *ready_list,
+static int	TocEntrySizeCompareQsort(const void *p1, const void *p2);
+static int	TocEntrySizeCompareBinaryheap(const void *p1, const void *p2,
+										  void *arg);
+static void move_to_ready_heap(TocEntry *pending_list,
+							   binaryheap *ready_heap,
 							   RestorePass pass);
-static TocEntry *pop_next_work_item(ParallelReadyList *ready_list,
+static TocEntry *pop_next_work_item(binaryheap *ready_heap,
 									ParallelState *pstate);
 static void mark_dump_job_done(ArchiveHandle *AH,
 							   TocEntry *te,
@@ -134,7 +114,7 @@ static bool has_lock_conflicts(TocEntry *te1, TocEntry *te2);
 static void repoint_table_dependencies(ArchiveHandle *AH);
 static void identify_locking_dependencies(ArchiveHandle *AH, TocEntry *te);
 static void reduce_dependencies(ArchiveHandle *AH, TocEntry *te,
-								ParallelReadyList *ready_list);
+								binaryheap *ready_heap);
 static void mark_create_done(ArchiveHandle *AH, TocEntry *te);
 static void inhibit_data_for_failed_table(ArchiveHandle *AH, TocEntry *te);
 
@@ -2380,7 +2360,7 @@ WriteDataChunks(ArchiveHandle *AH, ParallelState *pstate)
 		}
 
 		if (ntes > 1)
-			qsort(tes, ntes, sizeof(TocEntry *), TocEntrySizeCompare);
+			qsort(tes, ntes, sizeof(TocEntry *), TocEntrySizeCompareQsort);
 
 		for (int i = 0; i < ntes; i++)
 			DispatchJobForTocEntry(AH, pstate, tes[i], ACT_DUMP,
@@ -3980,7 +3960,7 @@ restore_toc_entries_prefork(ArchiveHandle *AH, TocEntry *pending_list)
 
 			(void) restore_toc_entry(AH, next_work_item, false);
 
-			/* Reduce dependencies, but don't move anything to ready_list */
+			/* Reduce dependencies, but don't move anything to ready_heap */
 			reduce_dependencies(AH, next_work_item, NULL);
 		}
 		else
@@ -4023,24 +4003,27 @@ static void
 restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 							 TocEntry *pending_list)
 {
-	ParallelReadyList ready_list;
+	binaryheap *ready_heap;
 	TocEntry   *next_work_item;
 
 	pg_log_debug("entering restore_toc_entries_parallel");
 
-	/* Set up ready_list with enough room for all known TocEntrys */
-	ready_list_init(&ready_list, AH->tocCount);
+	/* Set up ready_heap with enough room for all known TocEntrys */
+	ready_heap = binaryheap_allocate(AH->tocCount,
+									 sizeof(TocEntry *),
+									 TocEntrySizeCompareBinaryheap,
+									 NULL);
 
 	/*
 	 * The pending_list contains all items that we need to restore.  Move all
-	 * items that are available to process immediately into the ready_list.
+	 * items that are available to process immediately into the ready_heap.
 	 * After this setup, the pending list is everything that needs to be done
-	 * but is blocked by one or more dependencies, while the ready list
+	 * but is blocked by one or more dependencies, while the ready heap
 	 * contains items that have no remaining dependencies and are OK to
 	 * process in the current restore pass.
 	 */
 	AH->restorePass = RESTORE_PASS_MAIN;
-	move_to_ready_list(pending_list, &ready_list, AH->restorePass);
+	move_to_ready_heap(pending_list, ready_heap, AH->restorePass);
 
 	/*
 	 * main parent loop
@@ -4054,7 +4037,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 	for (;;)
 	{
 		/* Look for an item ready to be dispatched to a worker */
-		next_work_item = pop_next_work_item(&ready_list, pstate);
+		next_work_item = pop_next_work_item(ready_heap, pstate);
 		if (next_work_item != NULL)
 		{
 			/* If not to be restored, don't waste time launching a worker */
@@ -4064,7 +4047,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 							next_work_item->dumpId,
 							next_work_item->desc, next_work_item->tag);
 				/* Update its dependencies as though we'd completed it */
-				reduce_dependencies(AH, next_work_item, &ready_list);
+				reduce_dependencies(AH, next_work_item, ready_heap);
 				/* Loop around to see if anything else can be dispatched */
 				continue;
 			}
@@ -4075,7 +4058,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 
 			/* Dispatch to some worker */
 			DispatchJobForTocEntry(AH, pstate, next_work_item, ACT_RESTORE,
-								   mark_restore_job_done, &ready_list);
+								   mark_restore_job_done, ready_heap);
 		}
 		else if (IsEveryWorkerIdle(pstate))
 		{
@@ -4089,7 +4072,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 			/* Advance to next restore pass */
 			AH->restorePass++;
 			/* That probably allows some stuff to be made ready */
-			move_to_ready_list(pending_list, &ready_list, AH->restorePass);
+			move_to_ready_heap(pending_list, ready_heap, AH->restorePass);
 			/* Loop around to see if anything's now ready */
 			continue;
 		}
@@ -4118,10 +4101,10 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 					   next_work_item ? WFW_ONE_IDLE : WFW_GOT_STATUS);
 	}
 
-	/* There should now be nothing in ready_list. */
-	Assert(ready_list.first_te > ready_list.last_te);
+	/* There should now be nothing in ready_heap. */
+	Assert(binaryheap_empty(ready_heap));
 
-	ready_list_free(&ready_list);
+	binaryheap_free(ready_heap);
 
 	pg_log_info("finished main parallel loop");
 }
@@ -4221,80 +4204,9 @@ pending_list_remove(TocEntry *te)
 }
 
 
-/*
- * Initialize the ready_list with enough room for up to tocCount entries.
- */
-static void
-ready_list_init(ParallelReadyList *ready_list, int tocCount)
-{
-	ready_list->tes = (TocEntry **)
-		pg_malloc(tocCount * sizeof(TocEntry *));
-	ready_list->first_te = 0;
-	ready_list->last_te = -1;
-	ready_list->sorted = false;
-}
-
-/*
- * Free storage for a ready_list.
- */
-static void
-ready_list_free(ParallelReadyList *ready_list)
-{
-	pg_free(ready_list->tes);
-}
-
-/* Add te to the ready_list */
-static void
-ready_list_insert(ParallelReadyList *ready_list, TocEntry *te)
-{
-	ready_list->tes[++ready_list->last_te] = te;
-	/* List is (probably) not sorted anymore. */
-	ready_list->sorted = false;
-}
-
-/* Remove the i'th entry in the ready_list */
-static void
-ready_list_remove(ParallelReadyList *ready_list, int i)
-{
-	int			f = ready_list->first_te;
-
-	Assert(i >= f && i <= ready_list->last_te);
-
-	/*
-	 * In the typical case where the item to be removed is the first ready
-	 * entry, we need only increment first_te to remove it.  Otherwise, move
-	 * the entries before it to compact the list.  (This preserves sortedness,
-	 * if any.)  We could alternatively move the entries after i, but there
-	 * are typically many more of those.
-	 */
-	if (i > f)
-	{
-		TocEntry  **first_te_ptr = &ready_list->tes[f];
-
-		memmove(first_te_ptr + 1, first_te_ptr, (i - f) * sizeof(TocEntry *));
-	}
-	ready_list->first_te++;
-}
-
-/* Sort the ready_list into the desired order */
-static void
-ready_list_sort(ParallelReadyList *ready_list)
-{
-	if (!ready_list->sorted)
-	{
-		int			n = ready_list->last_te - ready_list->first_te + 1;
-
-		if (n > 1)
-			qsort(ready_list->tes + ready_list->first_te, n,
-				  sizeof(TocEntry *),
-				  TocEntrySizeCompare);
-		ready_list->sorted = true;
-	}
-}
-
 /* qsort comparator for sorting TocEntries by dataLength */
 static int
-TocEntrySizeCompare(const void *p1, const void *p2)
+TocEntrySizeCompareQsort(const void *p1, const void *p2)
 {
 	const TocEntry *te1 = *(const TocEntry *const *) p1;
 	const TocEntry *te2 = *(const TocEntry *const *) p2;
@@ -4314,17 +4226,24 @@ TocEntrySizeCompare(const void *p1, const void *p2)
 	return 0;
 }
 
+/* binaryheap comparator for sorting TocEntries by dataLength */
+static int
+TocEntrySizeCompareBinaryheap(const void *p1, const void *p2, void *arg)
+{
+	return TocEntrySizeCompareQsort(p1, p2);
+}
+
 
 /*
- * Move all immediately-ready items from pending_list to ready_list.
+ * Move all immediately-ready items from pending_list to ready_heap.
  *
  * Items are considered ready if they have no remaining dependencies and
  * they belong in the current restore pass.  (See also reduce_dependencies,
  * which applies the same logic one-at-a-time.)
  */
 static void
-move_to_ready_list(TocEntry *pending_list,
-				   ParallelReadyList *ready_list,
+move_to_ready_heap(TocEntry *pending_list,
+				   binaryheap *ready_heap,
 				   RestorePass pass)
 {
 	TocEntry   *te;
@@ -4340,40 +4259,42 @@ move_to_ready_list(TocEntry *pending_list,
 		{
 			/* Remove it from pending_list ... */
 			pending_list_remove(te);
-			/* ... and add to ready_list */
-			ready_list_insert(ready_list, te);
+			/* ... and add to ready_heap */
+			binaryheap_add(ready_heap, &te);
 		}
 	}
 }
 
 /*
  * Find the next work item (if any) that is capable of being run now,
- * and remove it from the ready_list.
+ * and remove it from the ready_heap.
  *
  * Returns the item, or NULL if nothing is runnable.
  *
  * To qualify, the item must have no remaining dependencies
  * and no requirements for locks that are incompatible with
- * items currently running.  Items in the ready_list are known to have
+ * items currently running.  Items in the ready_heap are known to have
  * no remaining dependencies, but we have to check for lock conflicts.
  */
 static TocEntry *
-pop_next_work_item(ParallelReadyList *ready_list,
+pop_next_work_item(binaryheap *ready_heap,
 				   ParallelState *pstate)
 {
 	/*
-	 * Sort the ready_list so that we'll tackle larger jobs first.
-	 */
-	ready_list_sort(ready_list);
-
-	/*
-	 * Search the ready_list until we find a suitable item.
+	 * Search the ready_heap until we find a suitable item.  Note that we do a
+	 * sequential scan through the heap nodes, so even though we will first
+	 * try to choose the highest-priority item, we might end up picking
+	 * something with a much lower priority.  However, it is expected that we
+	 * will typically be able to pick one of the first few items, which should
+	 * usually have a relatively high priority.
 	 */
-	for (int i = ready_list->first_te; i <= ready_list->last_te; i++)
+	for (int i = 0; i < binaryheap_size(ready_heap); i++)
 	{
-		TocEntry   *te = ready_list->tes[i];
+		TocEntry   *te;
 		bool		conflicts = false;
 
+		binaryheap_nth(ready_heap, i, &te);
+
 		/*
 		 * Check to see if the item would need exclusive lock on something
 		 * that a currently running item also needs lock on, or vice versa. If
@@ -4397,7 +4318,7 @@ pop_next_work_item(ParallelReadyList *ready_list,
 			continue;
 
 		/* passed all tests, so this item can run */
-		ready_list_remove(ready_list, i);
+		binaryheap_remove_nth(ready_heap, i);
 		return te;
 	}
 
@@ -4443,7 +4364,7 @@ mark_restore_job_done(ArchiveHandle *AH,
 					  int status,
 					  void *callback_data)
 {
-	ParallelReadyList *ready_list = (ParallelReadyList *) callback_data;
+	binaryheap *ready_heap = (binaryheap *) callback_data;
 
 	pg_log_info("finished item %d %s %s",
 				te->dumpId, te->desc, te->tag);
@@ -4461,7 +4382,7 @@ mark_restore_job_done(ArchiveHandle *AH,
 		pg_fatal("worker process failed: exit code %d",
 				 status);
 
-	reduce_dependencies(AH, te, ready_list);
+	reduce_dependencies(AH, te, ready_heap);
 }
 
 
@@ -4704,11 +4625,11 @@ identify_locking_dependencies(ArchiveHandle *AH, TocEntry *te)
 /*
  * Remove the specified TOC entry from the depCounts of items that depend on
  * it, thereby possibly making them ready-to-run.  Any pending item that
- * becomes ready should be moved to the ready_list, if that's provided.
+ * becomes ready should be moved to the ready_heap, if that's provided.
  */
 static void
 reduce_dependencies(ArchiveHandle *AH, TocEntry *te,
-					ParallelReadyList *ready_list)
+					binaryheap *ready_heap)
 {
 	int			i;
 
@@ -4726,18 +4647,18 @@ reduce_dependencies(ArchiveHandle *AH, TocEntry *te,
 		 * the current restore pass, and it is currently a member of the
 		 * pending list (that check is needed to prevent double restore in
 		 * some cases where a list-file forces out-of-order restoring).
-		 * However, if ready_list == NULL then caller doesn't want any list
+		 * However, if ready_heap == NULL then caller doesn't want any list
 		 * memberships changed.
 		 */
 		if (otherte->depCount == 0 &&
 			_tocEntryRestorePass(otherte) == AH->restorePass &&
 			otherte->pending_prev != NULL &&
-			ready_list != NULL)
+			ready_heap != NULL)
 		{
 			/* Remove it from pending list ... */
 			pending_list_remove(otherte);
-			/* ... and add to ready_list */
-			ready_list_insert(ready_list, otherte);
+			/* ... and add to ready_heap */
+			binaryheap_add(ready_heap, &otherte);
 		}
 	}
 }
-- 
2.25.1

v7-0005-Remove-open-coded-binary-heap-in-pg_dump_sort.c.patchtext/x-diff; charset=us-asciiDownload
From 51ebcf055faadf2c174bbe533479843837c3945a Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Tue, 25 Jul 2023 11:18:52 -0700
Subject: [PATCH v7 5/5] Remove open-coded binary heap in pg_dump_sort.c.

Thanks to commit XXXXXXXXXX, binaryheap is available to frontend
code.  This commit replaces the open-coded heap implementation in
pg_dump_sort.c with a binaryheap, saving a few lines of code.
---
 src/bin/pg_dump/pg_dump_sort.c | 103 +++++++--------------------------
 1 file changed, 20 insertions(+), 83 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump_sort.c b/src/bin/pg_dump/pg_dump_sort.c
index 523a19c155..c9b161ed30 100644
--- a/src/bin/pg_dump/pg_dump_sort.c
+++ b/src/bin/pg_dump/pg_dump_sort.c
@@ -16,6 +16,7 @@
 #include "postgres_fe.h"
 
 #include "catalog/pg_class_d.h"
+#include "lib/binaryheap.h"
 #include "pg_backup_archiver.h"
 #include "pg_backup_utils.h"
 #include "pg_dump.h"
@@ -161,8 +162,6 @@ static bool TopoSort(DumpableObject **objs,
 					 int numObjs,
 					 DumpableObject **ordering,
 					 int *nOrdering);
-static void addHeapElement(int val, int *heap, int heapLength);
-static int	removeHeapElement(int *heap, int heapLength);
 static void findDependencyLoops(DumpableObject **objs, int nObjs, int totObjs);
 static int	findLoop(DumpableObject *obj,
 					 DumpId startPoint,
@@ -174,6 +173,7 @@ static void repairDependencyLoop(DumpableObject **loop,
 								 int nLoop);
 static void describeDumpableObject(DumpableObject *obj,
 								   char *buf, int bufsize);
+static int	int_cmp(const void *a, const void *b, void *arg);
 
 
 /*
@@ -374,11 +374,10 @@ TopoSort(DumpableObject **objs,
 		 int *nOrdering)		/* output argument */
 {
 	DumpId		maxDumpId = getMaxDumpId();
-	int		   *pendingHeap;
+	binaryheap *pendingHeap;
 	int		   *beforeConstraints;
 	int		   *idMap;
 	DumpableObject *obj;
-	int			heapLength;
 	int			i,
 				j,
 				k;
@@ -403,7 +402,7 @@ TopoSort(DumpableObject **objs,
 		return true;
 
 	/* Create workspace for the above-described heap */
-	pendingHeap = (int *) pg_malloc(numObjs * sizeof(int));
+	pendingHeap = binaryheap_allocate(numObjs, sizeof(int), int_cmp, NULL);
 
 	/*
 	 * Scan the constraints, and for each item in the input, generate a count
@@ -434,19 +433,16 @@ TopoSort(DumpableObject **objs,
 	 * Now initialize the heap of items-ready-to-output by filling it with the
 	 * indexes of items that already have beforeConstraints[id] == 0.
 	 *
-	 * The essential property of a heap is heap[(j-1)/2] >= heap[j] for each j
-	 * in the range 1..heapLength-1 (note we are using 0-based subscripts
-	 * here, while the discussion in Knuth assumes 1-based subscripts). So, if
-	 * we simply enter the indexes into pendingHeap[] in decreasing order, we
-	 * a-fortiori have the heap invariant satisfied at completion of this
-	 * loop, and don't need to do any sift-up comparisons.
+	 * We enter the objects into pendingHeap in decreasing order so that the
+	 * heap invariant is satisfied at the completion of this loop.  This
+	 * reduces the amount of work that binaryheap_build() must do.
 	 */
-	heapLength = 0;
 	for (i = numObjs; --i >= 0;)
 	{
 		if (beforeConstraints[objs[i]->dumpId] == 0)
-			pendingHeap[heapLength++] = i;
+			binaryheap_add_unordered(pendingHeap, &i);
 	}
+	binaryheap_build(pendingHeap);
 
 	/*--------------------
 	 * Now emit objects, working backwards in the output list.  At each step,
@@ -464,10 +460,10 @@ TopoSort(DumpableObject **objs,
 	 *--------------------
 	 */
 	i = numObjs;
-	while (heapLength > 0)
+	while (!binaryheap_empty(pendingHeap))
 	{
 		/* Select object to output by removing largest heap member */
-		j = removeHeapElement(pendingHeap, heapLength--);
+		binaryheap_remove_first(pendingHeap, &j);
 		obj = objs[j];
 		/* Output candidate to ordering[] */
 		ordering[--i] = obj;
@@ -477,7 +473,7 @@ TopoSort(DumpableObject **objs,
 			int			id = obj->dependencies[k];
 
 			if ((--beforeConstraints[id]) == 0)
-				addHeapElement(idMap[id], pendingHeap, heapLength++);
+				binaryheap_add(pendingHeap, &idMap[id]);
 		}
 	}
 
@@ -497,79 +493,13 @@ TopoSort(DumpableObject **objs,
 	}
 
 	/* Done */
-	free(pendingHeap);
+	binaryheap_free(pendingHeap);
 	free(beforeConstraints);
 	free(idMap);
 
 	return (i == 0);
 }
 
-/*
- * Add an item to a heap (priority queue)
- *
- * heapLength is the current heap size; caller is responsible for increasing
- * its value after the call.  There must be sufficient storage at *heap.
- */
-static void
-addHeapElement(int val, int *heap, int heapLength)
-{
-	int			j;
-
-	/*
-	 * Sift-up the new entry, per Knuth 5.2.3 exercise 16. Note that Knuth is
-	 * using 1-based array indexes, not 0-based.
-	 */
-	j = heapLength;
-	while (j > 0)
-	{
-		int			i = (j - 1) >> 1;
-
-		if (val <= heap[i])
-			break;
-		heap[j] = heap[i];
-		j = i;
-	}
-	heap[j] = val;
-}
-
-/*
- * Remove the largest item present in a heap (priority queue)
- *
- * heapLength is the current heap size; caller is responsible for decreasing
- * its value after the call.
- *
- * We remove and return heap[0], which is always the largest element of
- * the heap, and then "sift up" to maintain the heap invariant.
- */
-static int
-removeHeapElement(int *heap, int heapLength)
-{
-	int			result = heap[0];
-	int			val;
-	int			i;
-
-	if (--heapLength <= 0)
-		return result;
-	val = heap[heapLength];		/* value that must be reinserted */
-	i = 0;						/* i is where the "hole" is */
-	for (;;)
-	{
-		int			j = 2 * i + 1;
-
-		if (j >= heapLength)
-			break;
-		if (j + 1 < heapLength &&
-			heap[j] < heap[j + 1])
-			j++;
-		if (val >= heap[j])
-			break;
-		heap[i] = heap[j];
-		i = j;
-	}
-	heap[i] = val;
-	return result;
-}
-
 /*
  * findDependencyLoops - identify loops in TopoSort's failure output,
  *		and pass each such loop to repairDependencyLoop() for action
@@ -1559,3 +1489,10 @@ describeDumpableObject(DumpableObject *obj, char *buf, int bufsize)
 			 (int) obj->objType,
 			 obj->dumpId, obj->catId.oid);
 }
+
+/* binaryheap comparator for integers */
+static int
+int_cmp(const void *a, const void *b, void *arg)
+{
+	return *((const int *) a) - *((const int *) b);
+}
-- 
2.25.1

#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Bossart (#25)
Re: Inefficiency in parallel pg_restore with many tables

Nathan Bossart <nathandbossart@gmail.com> writes:

I spent some more time on this patch and made the relevant adjustments to
the rest of the set.

Hmm ... I do not like v7 very much at all. It requires rather ugly
changes to all of the existing callers, and what are we actually
buying? If anything, it makes things slower for pass-by-value items
like integers. I'd stick with the Datum convention in the backend.

Instead, I took a closer look through the v6 patch set.
I think that's in pretty good shape and nearly committable,
but I have a few thoughts:

* I'm not sure about defining bh_node_type as a macro:

+#ifdef FRONTEND
+#define bh_node_type void *
+#else
+#define bh_node_type Datum
+#endif

rather than an actual typedef:

+#ifdef FRONTEND
+typedef void *bh_node_type;
+#else
+typedef Datum bh_node_type;
+#endif

My concern here is that bh_node_type is effectively acting as a
typedef, so that pgindent might misbehave if it's not declared as a
typedef. On the other hand, there doesn't seem to be any indentation
problem in the patchset as it stands, and we don't expect any code
outside binaryheap.h/.c to refer to bh_node_type, so maybe it's fine.
(If you do choose to make it a typedef, remember to add it to
typedefs.list.)

* As a matter of style, I'd recommend adding braces in places
like this:

 	if (heap->bh_size >= heap->bh_space)
+	{
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
+	}
 	heap->bh_nodes[heap->bh_size] = d;

It's not wrong as you have it, but I think it's more readable
and less easy to accidentally break with the extra braces.

* In 0002, isn't the comment for binaryheap_remove_node wrong?

+ * Removes the nth node from the heap.  The caller must ensure that there are
+ * at least (n - 1) nodes in the heap.  O(log n) worst case.

Shouldn't that be "(n + 1)"? Also, I'd specify "n'th (zero based) node"
for clarity.

* I would say that this bit in 0004:

-		j = removeHeapElement(pendingHeap, heapLength--);
+		j = (intptr_t) binaryheap_remove_first(pendingHeap);

needs an explicit cast to int:

+ j = (int) (intptr_t) binaryheap_remove_first(pendingHeap);

otherwise some compilers might complain about the result possibly
not fitting in "j".

Other than those nitpicks, I like v6. I'll mark this RfC.

regards, tom lane

#27Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#26)
4 attachment(s)
Re: Inefficiency in parallel pg_restore with many tables

On Sun, Sep 10, 2023 at 12:35:10PM -0400, Tom Lane wrote:

Hmm ... I do not like v7 very much at all. It requires rather ugly
changes to all of the existing callers, and what are we actually
buying? If anything, it makes things slower for pass-by-value items
like integers. I'd stick with the Datum convention in the backend.

Instead, I took a closer look through the v6 patch set.
I think that's in pretty good shape and nearly committable,
but I have a few thoughts:

Thanks for reviewing. I'm fine with proceeding with the v6 approach. Even
though the alternative approach makes the API consistent for the frontend
and backend, I'm also not a huge fan of the pointer gymnastics required in
the comparators. Granted, we still have to do some intptr_t conversions in
pg_dump_sort.c with the v6 approach, but that seems to be an exception.

* I'm not sure about defining bh_node_type as a macro:

+#ifdef FRONTEND
+#define bh_node_type void *
+#else
+#define bh_node_type Datum
+#endif

rather than an actual typedef:

+#ifdef FRONTEND
+typedef void *bh_node_type;
+#else
+typedef Datum bh_node_type;
+#endif

My concern here is that bh_node_type is effectively acting as a
typedef, so that pgindent might misbehave if it's not declared as a
typedef. On the other hand, there doesn't seem to be any indentation
problem in the patchset as it stands, and we don't expect any code
outside binaryheap.h/.c to refer to bh_node_type, so maybe it's fine.
(If you do choose to make it a typedef, remember to add it to
typedefs.list.)

I think a typedef makes more sense here.

* As a matter of style, I'd recommend adding braces in places
like this:

if (heap->bh_size >= heap->bh_space)
+	{
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
elog(ERROR, "out of binary heap slots");
+#endif
+	}
heap->bh_nodes[heap->bh_size] = d;

It's not wrong as you have it, but I think it's more readable
and less easy to accidentally break with the extra braces.

Fair point.

* In 0002, isn't the comment for binaryheap_remove_node wrong?

+ * Removes the nth node from the heap.  The caller must ensure that there are
+ * at least (n - 1) nodes in the heap.  O(log n) worst case.

Shouldn't that be "(n + 1)"? Also, I'd specify "n'th (zero based) node"
for clarity.

Yeah, that's a mistake.

* I would say that this bit in 0004:

-		j = removeHeapElement(pendingHeap, heapLength--);
+		j = (intptr_t) binaryheap_remove_first(pendingHeap);

needs an explicit cast to int:

+ j = (int) (intptr_t) binaryheap_remove_first(pendingHeap);

otherwise some compilers might complain about the result possibly
not fitting in "j".

Sure. IMO it's a tad more readable, too.

Other than those nitpicks, I like v6. I'll mark this RfC.

Great. I've posted a v8 with your comments addressed in order to get one
more round of cfbot coverage. Assuming those tests pass and there is no
additional feedback, I'll plan on committing this in the next few days.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachments:

v8-0001-Make-binaryheap-available-to-frontend-code.patchtext/x-diff; charset=us-asciiDownload
From 7dd09c72a44e5ba293c6e9b2d2afa608ba070f03 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Sat, 22 Jul 2023 15:04:45 -0700
Subject: [PATCH v8 1/4] Make binaryheap available to frontend code.

There are a couple of places in frontend code that could make use
of this simple binary heap implementation.  This commit makes
binaryheap usable in frontend code, much like commit 26aaf97b68 did
for StringInfo.  Like StringInfo, the header file is left in lib/
to reduce the likelihood of unnecessary breakage.

The frontend version of binaryheap exposes a void *-based API since
frontend code does not have access to the Datum definitions.  This
seemed like a better approach than switching all existing uses to
void * or making the Datum definitions available to frontend code.

Reviewed-by: Tom Lane, Alvaro Herrera
Discussion: https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us
---
 src/backend/lib/Makefile                 |  1 -
 src/backend/lib/meson.build              |  1 -
 src/common/Makefile                      |  1 +
 src/{backend/lib => common}/binaryheap.c | 41 +++++++++++++++++-------
 src/common/meson.build                   |  1 +
 src/include/lib/binaryheap.h             | 26 +++++++++++----
 src/tools/pgindent/typedefs.list         |  1 +
 7 files changed, 52 insertions(+), 20 deletions(-)
 rename src/{backend/lib => common}/binaryheap.c (90%)

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..b6cefd9cca 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -13,7 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
-	binaryheap.o \
 	bipartite_match.o \
 	bloomfilter.o \
 	dshash.o \
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 974cab8776..b4e88f54ae 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -1,7 +1,6 @@
 # Copyright (c) 2022-2023, PostgreSQL Global Development Group
 
 backend_sources += files(
-  'binaryheap.c',
   'bipartite_match.c',
   'bloomfilter.c',
   'dshash.c',
diff --git a/src/common/Makefile b/src/common/Makefile
index 113029bf7b..cc5c54dcee 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -48,6 +48,7 @@ LIBS += $(PTHREAD_LIBS)
 OBJS_COMMON = \
 	archive.o \
 	base64.o \
+	binaryheap.o \
 	checksum_helper.o \
 	compression.o \
 	config_info.o \
diff --git a/src/backend/lib/binaryheap.c b/src/common/binaryheap.c
similarity index 90%
rename from src/backend/lib/binaryheap.c
rename to src/common/binaryheap.c
index 1737546757..39a8243a6d 100644
--- a/src/backend/lib/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -6,15 +6,22 @@
  * Portions Copyright (c) 2012-2023, PostgreSQL Global Development Group
  *
  * IDENTIFICATION
- *	  src/backend/lib/binaryheap.c
+ *	  src/common/binaryheap.c
  *
  *-------------------------------------------------------------------------
  */
 
+#ifdef FRONTEND
+#include "postgres_fe.h"
+#else
 #include "postgres.h"
+#endif
 
 #include <math.h>
 
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
 #include "lib/binaryheap.h"
 
 static void sift_down(binaryheap *heap, int node_off);
@@ -34,7 +41,7 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 	int			sz;
 	binaryheap *heap;
 
-	sz = offsetof(binaryheap, bh_nodes) + sizeof(Datum) * capacity;
+	sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_node_type) * capacity;
 	heap = (binaryheap *) palloc(sz);
 	heap->bh_space = capacity;
 	heap->bh_compare = compare;
@@ -106,10 +113,16 @@ parent_offset(int i)
  * afterwards.
  */
 void
-binaryheap_add_unordered(binaryheap *heap, Datum d)
+binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 {
 	if (heap->bh_size >= heap->bh_space)
+	{
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
+	}
 	heap->bh_has_heap_property = false;
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
@@ -138,10 +151,16 @@ binaryheap_build(binaryheap *heap)
  * the heap property.
  */
 void
-binaryheap_add(binaryheap *heap, Datum d)
+binaryheap_add(binaryheap *heap, bh_node_type d)
 {
 	if (heap->bh_size >= heap->bh_space)
+	{
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
+	}
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
@@ -154,7 +173,7 @@ binaryheap_add(binaryheap *heap, Datum d)
  * without modifying the heap. The caller must ensure that this
  * routine is not used on an empty heap. Always O(1).
  */
-Datum
+bh_node_type
 binaryheap_first(binaryheap *heap)
 {
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
@@ -169,10 +188,10 @@ binaryheap_first(binaryheap *heap)
  * that this routine is not used on an empty heap. O(log n) worst
  * case.
  */
-Datum
+bh_node_type
 binaryheap_remove_first(binaryheap *heap)
 {
-	Datum		result;
+	bh_node_type result;
 
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
 
@@ -204,7 +223,7 @@ binaryheap_remove_first(binaryheap *heap)
  * sifting the new node down.
  */
 void
-binaryheap_replace_first(binaryheap *heap, Datum d)
+binaryheap_replace_first(binaryheap *heap, bh_node_type d)
 {
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
 
@@ -221,7 +240,7 @@ binaryheap_replace_first(binaryheap *heap, Datum d)
 static void
 sift_up(binaryheap *heap, int node_off)
 {
-	Datum		node_val = heap->bh_nodes[node_off];
+	bh_node_type node_val = heap->bh_nodes[node_off];
 
 	/*
 	 * Within the loop, the node_off'th array entry is a "hole" that
@@ -232,7 +251,7 @@ sift_up(binaryheap *heap, int node_off)
 	{
 		int			cmp;
 		int			parent_off;
-		Datum		parent_val;
+		bh_node_type parent_val;
 
 		/*
 		 * If this node is smaller than its parent, the heap condition is
@@ -264,7 +283,7 @@ sift_up(binaryheap *heap, int node_off)
 static void
 sift_down(binaryheap *heap, int node_off)
 {
-	Datum		node_val = heap->bh_nodes[node_off];
+	bh_node_type node_val = heap->bh_nodes[node_off];
 
 	/*
 	 * Within the loop, the node_off'th array entry is a "hole" that
diff --git a/src/common/meson.build b/src/common/meson.build
index 53942a9a61..3b97497d1a 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -3,6 +3,7 @@
 common_sources = files(
   'archive.c',
   'base64.c',
+  'binaryheap.c',
   'checksum_helper.c',
   'compression.c',
   'controldata_utils.c',
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 52f7b06b25..3647aeae65 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -11,11 +11,23 @@
 #ifndef BINARYHEAP_H
 #define BINARYHEAP_H
 
+/*
+ * We provide a Datum-based API for backend code and a void *-based API for
+ * frontend code (since the Datum definitions are not available to frontend
+ * code).  You should typically avoid using bh_node_type directly and instead
+ * use Datum or void * as appropriate.
+ */
+#ifdef FRONTEND
+typedef void *bh_node_type;
+#else
+typedef Datum bh_node_type;
+#endif
+
 /*
  * For a max-heap, the comparator must return <0 iff a < b, 0 iff a == b,
  * and >0 iff a > b.  For a min-heap, the conditions are reversed.
  */
-typedef int (*binaryheap_comparator) (Datum a, Datum b, void *arg);
+typedef int (*binaryheap_comparator) (bh_node_type a, bh_node_type b, void *arg);
 
 /*
  * binaryheap
@@ -34,7 +46,7 @@ typedef struct binaryheap
 	bool		bh_has_heap_property;	/* debugging cross-check */
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
-	Datum		bh_nodes[FLEXIBLE_ARRAY_MEMBER];
+	bh_node_type bh_nodes[FLEXIBLE_ARRAY_MEMBER];
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int capacity,
@@ -42,12 +54,12 @@ extern binaryheap *binaryheap_allocate(int capacity,
 									   void *arg);
 extern void binaryheap_reset(binaryheap *heap);
 extern void binaryheap_free(binaryheap *heap);
-extern void binaryheap_add_unordered(binaryheap *heap, Datum d);
+extern void binaryheap_add_unordered(binaryheap *heap, bh_node_type d);
 extern void binaryheap_build(binaryheap *heap);
-extern void binaryheap_add(binaryheap *heap, Datum d);
-extern Datum binaryheap_first(binaryheap *heap);
-extern Datum binaryheap_remove_first(binaryheap *heap);
-extern void binaryheap_replace_first(binaryheap *heap, Datum d);
+extern void binaryheap_add(binaryheap *heap, bh_node_type d);
+extern bh_node_type binaryheap_first(binaryheap *heap);
+extern bh_node_type binaryheap_remove_first(binaryheap *heap);
+extern void binaryheap_replace_first(binaryheap *heap, bh_node_type d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f3d8a2a855..b5bbdd1608 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3198,6 +3198,7 @@ bbstreamer_tar_archiver
 bbstreamer_tar_parser
 bbstreamer_zstd_frame
 bgworker_main_type
+bh_node_type
 binaryheap
 binaryheap_comparator
 bitmapword
-- 
2.25.1

v8-0002-Add-function-for-removing-arbitrary-nodes-in-bina.patchtext/x-diff; charset=us-asciiDownload
From bab7e4721240bbd5f2269b09c32fa023519f5457 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Thu, 20 Jul 2023 09:52:20 -0700
Subject: [PATCH v8 2/4] Add function for removing arbitrary nodes in
 binaryheap.

This commit introduces binaryheap_remove_node(), which can be used
to remove any node from a binary heap.  The implementation is
straightforward.  The target node is replaced with the last node in
the heap, and then we sift as needed to preserve the heap property.
This new function is intended for use in a follow-up commit that
will improve the performance of pg_restore.

Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us
---
 src/common/binaryheap.c      | 29 +++++++++++++++++++++++++++++
 src/include/lib/binaryheap.h |  3 +++
 2 files changed, 32 insertions(+)

diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index 39a8243a6d..19e095f1fb 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -215,6 +215,35 @@ binaryheap_remove_first(binaryheap *heap)
 	return result;
 }
 
+/*
+ * binaryheap_remove_node
+ *
+ * Removes the nth (zero based) node from the heap.  The caller must ensure
+ * that there are at least (n + 1) nodes in the heap.  O(log n) worst case.
+ */
+void
+binaryheap_remove_node(binaryheap *heap, int n)
+{
+	int			cmp;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(n >= 0 && n < heap->bh_size);
+
+	/* compare last node to the one that is being removed */
+	cmp = heap->bh_compare(heap->bh_nodes[--heap->bh_size],
+						   heap->bh_nodes[n],
+						   heap->bh_arg);
+
+	/* remove the last node, placing it in the vacated entry */
+	heap->bh_nodes[n] = heap->bh_nodes[heap->bh_size];
+
+	/* sift as needed to preserve the heap property */
+	if (cmp > 0)
+		sift_up(heap, n);
+	else if (cmp < 0)
+		sift_down(heap, n);
+}
+
 /*
  * binaryheap_replace_first
  *
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 3647aeae65..9525dcaec4 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -59,8 +59,11 @@ extern void binaryheap_build(binaryheap *heap);
 extern void binaryheap_add(binaryheap *heap, bh_node_type d);
 extern bh_node_type binaryheap_first(binaryheap *heap);
 extern bh_node_type binaryheap_remove_first(binaryheap *heap);
+extern void binaryheap_remove_node(binaryheap *heap, int n);
 extern void binaryheap_replace_first(binaryheap *heap, bh_node_type d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
+#define binaryheap_size(h)			((h)->bh_size)
+#define binaryheap_get_node(h, n)	((h)->bh_nodes[n])
 
 #endif							/* BINARYHEAP_H */
-- 
2.25.1

v8-0003-Convert-pg_restore-s-ready_list-to-a-priority-que.patchtext/x-diff; charset=us-asciiDownload
From 9cff5d8dd57dbddf0e0fd5e08b2e0244cb913109 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Thu, 20 Jul 2023 10:19:08 -0700
Subject: [PATCH v8 3/4] Convert pg_restore's ready_list to a priority queue.

Presently, we spend a lot of time sorting this list so that we pick
the largest items first.  With many tables, this sorting can become
a significant bottleneck.  There are a couple of reports from the
field about this, and it is easily reproducible, so this is not a
hypothetical issue.

This commit improves the performance of pg_restore with many tables
by converting its ready_list to a priority queue, i.e., a binary
heap.  We will first try to run the highest priority item, but if
it cannot be chosen due to the lock heuristic, we'll do a
sequential scan through the heap nodes until we find one that is
runnable.  This means that we might end up picking an item with
much lower priority, but since we expect that we'll typically be
able to choose one of the first few nodes, we should usually pick
an item with a relatively high priority.

On my machine, a basic test with 100,000 tables takes 11.5 minutes
without this patch and 1.5 minutes with it.  Pierre Ducroquet
claims to see a speedup from 30 minutes to 23 minutes for a
real-world dump of over 50,000 tables.

Suggested-by: Tom Lane
Tested-by: Pierre Ducroquet
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us
---
 src/bin/pg_dump/pg_backup_archiver.c | 197 ++++++++-------------------
 1 file changed, 57 insertions(+), 140 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 4d83381d84..95b8f69ecf 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -34,6 +34,7 @@
 #include "compress_io.h"
 #include "dumputils.h"
 #include "fe_utils/string_utils.h"
+#include "lib/binaryheap.h"
 #include "lib/stringinfo.h"
 #include "libpq/libpq-fs.h"
 #include "parallel.h"
@@ -44,24 +45,6 @@
 #define TEXT_DUMP_HEADER "--\n-- PostgreSQL database dump\n--\n\n"
 #define TEXT_DUMPALL_HEADER "--\n-- PostgreSQL database cluster dump\n--\n\n"
 
-/*
- * State for tracking TocEntrys that are ready to process during a parallel
- * restore.  (This used to be a list, and we still call it that, though now
- * it's really an array so that we can apply qsort to it.)
- *
- * tes[] is sized large enough that we can't overrun it.
- * The valid entries are indexed first_te .. last_te inclusive.
- * We periodically sort the array to bring larger-by-dataLength entries to
- * the front; "sorted" is true if the valid entries are known sorted.
- */
-typedef struct _parallelReadyList
-{
-	TocEntry  **tes;			/* Ready-to-dump TocEntrys */
-	int			first_te;		/* index of first valid entry in tes[] */
-	int			last_te;		/* index of last valid entry in tes[] */
-	bool		sorted;			/* are valid entries currently sorted? */
-} ParallelReadyList;
-
 
 static ArchiveHandle *_allocAH(const char *FileSpec, const ArchiveFormat fmt,
 							   const pg_compress_specification compression_spec,
@@ -111,16 +94,12 @@ static void restore_toc_entries_postfork(ArchiveHandle *AH,
 static void pending_list_header_init(TocEntry *l);
 static void pending_list_append(TocEntry *l, TocEntry *te);
 static void pending_list_remove(TocEntry *te);
-static void ready_list_init(ParallelReadyList *ready_list, int tocCount);
-static void ready_list_free(ParallelReadyList *ready_list);
-static void ready_list_insert(ParallelReadyList *ready_list, TocEntry *te);
-static void ready_list_remove(ParallelReadyList *ready_list, int i);
-static void ready_list_sort(ParallelReadyList *ready_list);
-static int	TocEntrySizeCompare(const void *p1, const void *p2);
-static void move_to_ready_list(TocEntry *pending_list,
-							   ParallelReadyList *ready_list,
+static int	TocEntrySizeCompareQsort(const void *p1, const void *p2);
+static int	TocEntrySizeCompareBinaryheap(void *p1, void *p2, void *arg);
+static void move_to_ready_heap(TocEntry *pending_list,
+							   binaryheap *ready_heap,
 							   RestorePass pass);
-static TocEntry *pop_next_work_item(ParallelReadyList *ready_list,
+static TocEntry *pop_next_work_item(binaryheap *ready_heap,
 									ParallelState *pstate);
 static void mark_dump_job_done(ArchiveHandle *AH,
 							   TocEntry *te,
@@ -135,7 +114,7 @@ static bool has_lock_conflicts(TocEntry *te1, TocEntry *te2);
 static void repoint_table_dependencies(ArchiveHandle *AH);
 static void identify_locking_dependencies(ArchiveHandle *AH, TocEntry *te);
 static void reduce_dependencies(ArchiveHandle *AH, TocEntry *te,
-								ParallelReadyList *ready_list);
+								binaryheap *ready_heap);
 static void mark_create_done(ArchiveHandle *AH, TocEntry *te);
 static void inhibit_data_for_failed_table(ArchiveHandle *AH, TocEntry *te);
 
@@ -2384,7 +2363,7 @@ WriteDataChunks(ArchiveHandle *AH, ParallelState *pstate)
 		}
 
 		if (ntes > 1)
-			qsort(tes, ntes, sizeof(TocEntry *), TocEntrySizeCompare);
+			qsort(tes, ntes, sizeof(TocEntry *), TocEntrySizeCompareQsort);
 
 		for (int i = 0; i < ntes; i++)
 			DispatchJobForTocEntry(AH, pstate, tes[i], ACT_DUMP,
@@ -3984,7 +3963,7 @@ restore_toc_entries_prefork(ArchiveHandle *AH, TocEntry *pending_list)
 
 			(void) restore_toc_entry(AH, next_work_item, false);
 
-			/* Reduce dependencies, but don't move anything to ready_list */
+			/* Reduce dependencies, but don't move anything to ready_heap */
 			reduce_dependencies(AH, next_work_item, NULL);
 		}
 		else
@@ -4027,24 +4006,26 @@ static void
 restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 							 TocEntry *pending_list)
 {
-	ParallelReadyList ready_list;
+	binaryheap *ready_heap;
 	TocEntry   *next_work_item;
 
 	pg_log_debug("entering restore_toc_entries_parallel");
 
-	/* Set up ready_list with enough room for all known TocEntrys */
-	ready_list_init(&ready_list, AH->tocCount);
+	/* Set up ready_heap with enough room for all known TocEntrys */
+	ready_heap = binaryheap_allocate(AH->tocCount,
+									 TocEntrySizeCompareBinaryheap,
+									 NULL);
 
 	/*
 	 * The pending_list contains all items that we need to restore.  Move all
-	 * items that are available to process immediately into the ready_list.
+	 * items that are available to process immediately into the ready_heap.
 	 * After this setup, the pending list is everything that needs to be done
-	 * but is blocked by one or more dependencies, while the ready list
+	 * but is blocked by one or more dependencies, while the ready heap
 	 * contains items that have no remaining dependencies and are OK to
 	 * process in the current restore pass.
 	 */
 	AH->restorePass = RESTORE_PASS_MAIN;
-	move_to_ready_list(pending_list, &ready_list, AH->restorePass);
+	move_to_ready_heap(pending_list, ready_heap, AH->restorePass);
 
 	/*
 	 * main parent loop
@@ -4058,7 +4039,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 	for (;;)
 	{
 		/* Look for an item ready to be dispatched to a worker */
-		next_work_item = pop_next_work_item(&ready_list, pstate);
+		next_work_item = pop_next_work_item(ready_heap, pstate);
 		if (next_work_item != NULL)
 		{
 			/* If not to be restored, don't waste time launching a worker */
@@ -4068,7 +4049,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 							next_work_item->dumpId,
 							next_work_item->desc, next_work_item->tag);
 				/* Update its dependencies as though we'd completed it */
-				reduce_dependencies(AH, next_work_item, &ready_list);
+				reduce_dependencies(AH, next_work_item, ready_heap);
 				/* Loop around to see if anything else can be dispatched */
 				continue;
 			}
@@ -4079,7 +4060,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 
 			/* Dispatch to some worker */
 			DispatchJobForTocEntry(AH, pstate, next_work_item, ACT_RESTORE,
-								   mark_restore_job_done, &ready_list);
+								   mark_restore_job_done, ready_heap);
 		}
 		else if (IsEveryWorkerIdle(pstate))
 		{
@@ -4093,7 +4074,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 			/* Advance to next restore pass */
 			AH->restorePass++;
 			/* That probably allows some stuff to be made ready */
-			move_to_ready_list(pending_list, &ready_list, AH->restorePass);
+			move_to_ready_heap(pending_list, ready_heap, AH->restorePass);
 			/* Loop around to see if anything's now ready */
 			continue;
 		}
@@ -4122,10 +4103,10 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 					   next_work_item ? WFW_ONE_IDLE : WFW_GOT_STATUS);
 	}
 
-	/* There should now be nothing in ready_list. */
-	Assert(ready_list.first_te > ready_list.last_te);
+	/* There should now be nothing in ready_heap. */
+	Assert(binaryheap_empty(ready_heap));
 
-	ready_list_free(&ready_list);
+	binaryheap_free(ready_heap);
 
 	pg_log_info("finished main parallel loop");
 }
@@ -4225,80 +4206,9 @@ pending_list_remove(TocEntry *te)
 }
 
 
-/*
- * Initialize the ready_list with enough room for up to tocCount entries.
- */
-static void
-ready_list_init(ParallelReadyList *ready_list, int tocCount)
-{
-	ready_list->tes = (TocEntry **)
-		pg_malloc(tocCount * sizeof(TocEntry *));
-	ready_list->first_te = 0;
-	ready_list->last_te = -1;
-	ready_list->sorted = false;
-}
-
-/*
- * Free storage for a ready_list.
- */
-static void
-ready_list_free(ParallelReadyList *ready_list)
-{
-	pg_free(ready_list->tes);
-}
-
-/* Add te to the ready_list */
-static void
-ready_list_insert(ParallelReadyList *ready_list, TocEntry *te)
-{
-	ready_list->tes[++ready_list->last_te] = te;
-	/* List is (probably) not sorted anymore. */
-	ready_list->sorted = false;
-}
-
-/* Remove the i'th entry in the ready_list */
-static void
-ready_list_remove(ParallelReadyList *ready_list, int i)
-{
-	int			f = ready_list->first_te;
-
-	Assert(i >= f && i <= ready_list->last_te);
-
-	/*
-	 * In the typical case where the item to be removed is the first ready
-	 * entry, we need only increment first_te to remove it.  Otherwise, move
-	 * the entries before it to compact the list.  (This preserves sortedness,
-	 * if any.)  We could alternatively move the entries after i, but there
-	 * are typically many more of those.
-	 */
-	if (i > f)
-	{
-		TocEntry  **first_te_ptr = &ready_list->tes[f];
-
-		memmove(first_te_ptr + 1, first_te_ptr, (i - f) * sizeof(TocEntry *));
-	}
-	ready_list->first_te++;
-}
-
-/* Sort the ready_list into the desired order */
-static void
-ready_list_sort(ParallelReadyList *ready_list)
-{
-	if (!ready_list->sorted)
-	{
-		int			n = ready_list->last_te - ready_list->first_te + 1;
-
-		if (n > 1)
-			qsort(ready_list->tes + ready_list->first_te, n,
-				  sizeof(TocEntry *),
-				  TocEntrySizeCompare);
-		ready_list->sorted = true;
-	}
-}
-
 /* qsort comparator for sorting TocEntries by dataLength */
 static int
-TocEntrySizeCompare(const void *p1, const void *p2)
+TocEntrySizeCompareQsort(const void *p1, const void *p2)
 {
 	const TocEntry *te1 = *(const TocEntry *const *) p1;
 	const TocEntry *te2 = *(const TocEntry *const *) p2;
@@ -4318,17 +4228,24 @@ TocEntrySizeCompare(const void *p1, const void *p2)
 	return 0;
 }
 
+/* binaryheap comparator for sorting TocEntries by dataLength */
+static int
+TocEntrySizeCompareBinaryheap(void *p1, void *p2, void *arg)
+{
+	return TocEntrySizeCompareQsort(p1, p2);
+}
+
 
 /*
- * Move all immediately-ready items from pending_list to ready_list.
+ * Move all immediately-ready items from pending_list to ready_heap.
  *
  * Items are considered ready if they have no remaining dependencies and
  * they belong in the current restore pass.  (See also reduce_dependencies,
  * which applies the same logic one-at-a-time.)
  */
 static void
-move_to_ready_list(TocEntry *pending_list,
-				   ParallelReadyList *ready_list,
+move_to_ready_heap(TocEntry *pending_list,
+				   binaryheap *ready_heap,
 				   RestorePass pass)
 {
 	TocEntry   *te;
@@ -4344,38 +4261,38 @@ move_to_ready_list(TocEntry *pending_list,
 		{
 			/* Remove it from pending_list ... */
 			pending_list_remove(te);
-			/* ... and add to ready_list */
-			ready_list_insert(ready_list, te);
+			/* ... and add to ready_heap */
+			binaryheap_add(ready_heap, te);
 		}
 	}
 }
 
 /*
  * Find the next work item (if any) that is capable of being run now,
- * and remove it from the ready_list.
+ * and remove it from the ready_heap.
  *
  * Returns the item, or NULL if nothing is runnable.
  *
  * To qualify, the item must have no remaining dependencies
  * and no requirements for locks that are incompatible with
- * items currently running.  Items in the ready_list are known to have
+ * items currently running.  Items in the ready_heap are known to have
  * no remaining dependencies, but we have to check for lock conflicts.
  */
 static TocEntry *
-pop_next_work_item(ParallelReadyList *ready_list,
+pop_next_work_item(binaryheap *ready_heap,
 				   ParallelState *pstate)
 {
 	/*
-	 * Sort the ready_list so that we'll tackle larger jobs first.
-	 */
-	ready_list_sort(ready_list);
-
-	/*
-	 * Search the ready_list until we find a suitable item.
+	 * Search the ready_heap until we find a suitable item.  Note that we do a
+	 * sequential scan through the heap nodes, so even though we will first
+	 * try to choose the highest-priority item, we might end up picking
+	 * something with a much lower priority.  However, it is expected that we
+	 * will typically be able to pick one of the first few items, which should
+	 * usually have a relatively high priority.
 	 */
-	for (int i = ready_list->first_te; i <= ready_list->last_te; i++)
+	for (int i = 0; i < binaryheap_size(ready_heap); i++)
 	{
-		TocEntry   *te = ready_list->tes[i];
+		TocEntry   *te = (TocEntry *) binaryheap_get_node(ready_heap, i);
 		bool		conflicts = false;
 
 		/*
@@ -4401,7 +4318,7 @@ pop_next_work_item(ParallelReadyList *ready_list,
 			continue;
 
 		/* passed all tests, so this item can run */
-		ready_list_remove(ready_list, i);
+		binaryheap_remove_node(ready_heap, i);
 		return te;
 	}
 
@@ -4447,7 +4364,7 @@ mark_restore_job_done(ArchiveHandle *AH,
 					  int status,
 					  void *callback_data)
 {
-	ParallelReadyList *ready_list = (ParallelReadyList *) callback_data;
+	binaryheap *ready_heap = (binaryheap *) callback_data;
 
 	pg_log_info("finished item %d %s %s",
 				te->dumpId, te->desc, te->tag);
@@ -4465,7 +4382,7 @@ mark_restore_job_done(ArchiveHandle *AH,
 		pg_fatal("worker process failed: exit code %d",
 				 status);
 
-	reduce_dependencies(AH, te, ready_list);
+	reduce_dependencies(AH, te, ready_heap);
 }
 
 
@@ -4708,11 +4625,11 @@ identify_locking_dependencies(ArchiveHandle *AH, TocEntry *te)
 /*
  * Remove the specified TOC entry from the depCounts of items that depend on
  * it, thereby possibly making them ready-to-run.  Any pending item that
- * becomes ready should be moved to the ready_list, if that's provided.
+ * becomes ready should be moved to the ready_heap, if that's provided.
  */
 static void
 reduce_dependencies(ArchiveHandle *AH, TocEntry *te,
-					ParallelReadyList *ready_list)
+					binaryheap *ready_heap)
 {
 	int			i;
 
@@ -4730,18 +4647,18 @@ reduce_dependencies(ArchiveHandle *AH, TocEntry *te,
 		 * the current restore pass, and it is currently a member of the
 		 * pending list (that check is needed to prevent double restore in
 		 * some cases where a list-file forces out-of-order restoring).
-		 * However, if ready_list == NULL then caller doesn't want any list
+		 * However, if ready_heap == NULL then caller doesn't want any list
 		 * memberships changed.
 		 */
 		if (otherte->depCount == 0 &&
 			_tocEntryRestorePass(otherte) == AH->restorePass &&
 			otherte->pending_prev != NULL &&
-			ready_list != NULL)
+			ready_heap != NULL)
 		{
 			/* Remove it from pending list ... */
 			pending_list_remove(otherte);
-			/* ... and add to ready_list */
-			ready_list_insert(ready_list, otherte);
+			/* ... and add to ready_heap */
+			binaryheap_add(ready_heap, otherte);
 		}
 	}
 }
-- 
2.25.1

v8-0004-Remove-open-coded-binary-heap-in-pg_dump_sort.c.patchtext/x-diff; charset=us-asciiDownload
From c0d0492ec8c553807d790923f1be8bcd5d258450 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Tue, 25 Jul 2023 11:18:52 -0700
Subject: [PATCH v8 4/4] Remove open-coded binary heap in pg_dump_sort.c.

Thanks to commit XXXXXXXXXX, binaryheap is available to frontend
code.  This commit replaces the open-coded heap implementation in
pg_dump_sort.c with a binaryheap, saving a few lines of code.

Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us
---
 src/bin/pg_dump/pg_dump_sort.c | 105 +++++++--------------------------
 1 file changed, 22 insertions(+), 83 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump_sort.c b/src/bin/pg_dump/pg_dump_sort.c
index 523a19c155..8436c1d757 100644
--- a/src/bin/pg_dump/pg_dump_sort.c
+++ b/src/bin/pg_dump/pg_dump_sort.c
@@ -16,6 +16,7 @@
 #include "postgres_fe.h"
 
 #include "catalog/pg_class_d.h"
+#include "lib/binaryheap.h"
 #include "pg_backup_archiver.h"
 #include "pg_backup_utils.h"
 #include "pg_dump.h"
@@ -161,8 +162,6 @@ static bool TopoSort(DumpableObject **objs,
 					 int numObjs,
 					 DumpableObject **ordering,
 					 int *nOrdering);
-static void addHeapElement(int val, int *heap, int heapLength);
-static int	removeHeapElement(int *heap, int heapLength);
 static void findDependencyLoops(DumpableObject **objs, int nObjs, int totObjs);
 static int	findLoop(DumpableObject *obj,
 					 DumpId startPoint,
@@ -174,6 +173,7 @@ static void repairDependencyLoop(DumpableObject **loop,
 								 int nLoop);
 static void describeDumpableObject(DumpableObject *obj,
 								   char *buf, int bufsize);
+static int	int_cmp(void *a, void *b, void *arg);
 
 
 /*
@@ -374,11 +374,10 @@ TopoSort(DumpableObject **objs,
 		 int *nOrdering)		/* output argument */
 {
 	DumpId		maxDumpId = getMaxDumpId();
-	int		   *pendingHeap;
+	binaryheap *pendingHeap;
 	int		   *beforeConstraints;
 	int		   *idMap;
 	DumpableObject *obj;
-	int			heapLength;
 	int			i,
 				j,
 				k;
@@ -403,7 +402,7 @@ TopoSort(DumpableObject **objs,
 		return true;
 
 	/* Create workspace for the above-described heap */
-	pendingHeap = (int *) pg_malloc(numObjs * sizeof(int));
+	pendingHeap = binaryheap_allocate(numObjs, int_cmp, NULL);
 
 	/*
 	 * Scan the constraints, and for each item in the input, generate a count
@@ -434,19 +433,16 @@ TopoSort(DumpableObject **objs,
 	 * Now initialize the heap of items-ready-to-output by filling it with the
 	 * indexes of items that already have beforeConstraints[id] == 0.
 	 *
-	 * The essential property of a heap is heap[(j-1)/2] >= heap[j] for each j
-	 * in the range 1..heapLength-1 (note we are using 0-based subscripts
-	 * here, while the discussion in Knuth assumes 1-based subscripts). So, if
-	 * we simply enter the indexes into pendingHeap[] in decreasing order, we
-	 * a-fortiori have the heap invariant satisfied at completion of this
-	 * loop, and don't need to do any sift-up comparisons.
+	 * We enter the objects into pendingHeap in decreasing order so that the
+	 * heap invariant is satisfied at the completion of this loop.  This
+	 * reduces the amount of work that binaryheap_build() must do.
 	 */
-	heapLength = 0;
 	for (i = numObjs; --i >= 0;)
 	{
 		if (beforeConstraints[objs[i]->dumpId] == 0)
-			pendingHeap[heapLength++] = i;
+			binaryheap_add_unordered(pendingHeap, (void *) (intptr_t) i);
 	}
+	binaryheap_build(pendingHeap);
 
 	/*--------------------
 	 * Now emit objects, working backwards in the output list.  At each step,
@@ -464,10 +460,10 @@ TopoSort(DumpableObject **objs,
 	 *--------------------
 	 */
 	i = numObjs;
-	while (heapLength > 0)
+	while (!binaryheap_empty(pendingHeap))
 	{
 		/* Select object to output by removing largest heap member */
-		j = removeHeapElement(pendingHeap, heapLength--);
+		j = (int) (intptr_t) binaryheap_remove_first(pendingHeap);
 		obj = objs[j];
 		/* Output candidate to ordering[] */
 		ordering[--i] = obj;
@@ -477,7 +473,7 @@ TopoSort(DumpableObject **objs,
 			int			id = obj->dependencies[k];
 
 			if ((--beforeConstraints[id]) == 0)
-				addHeapElement(idMap[id], pendingHeap, heapLength++);
+				binaryheap_add(pendingHeap, (void *) (intptr_t) idMap[id]);
 		}
 	}
 
@@ -497,79 +493,13 @@ TopoSort(DumpableObject **objs,
 	}
 
 	/* Done */
-	free(pendingHeap);
+	binaryheap_free(pendingHeap);
 	free(beforeConstraints);
 	free(idMap);
 
 	return (i == 0);
 }
 
-/*
- * Add an item to a heap (priority queue)
- *
- * heapLength is the current heap size; caller is responsible for increasing
- * its value after the call.  There must be sufficient storage at *heap.
- */
-static void
-addHeapElement(int val, int *heap, int heapLength)
-{
-	int			j;
-
-	/*
-	 * Sift-up the new entry, per Knuth 5.2.3 exercise 16. Note that Knuth is
-	 * using 1-based array indexes, not 0-based.
-	 */
-	j = heapLength;
-	while (j > 0)
-	{
-		int			i = (j - 1) >> 1;
-
-		if (val <= heap[i])
-			break;
-		heap[j] = heap[i];
-		j = i;
-	}
-	heap[j] = val;
-}
-
-/*
- * Remove the largest item present in a heap (priority queue)
- *
- * heapLength is the current heap size; caller is responsible for decreasing
- * its value after the call.
- *
- * We remove and return heap[0], which is always the largest element of
- * the heap, and then "sift up" to maintain the heap invariant.
- */
-static int
-removeHeapElement(int *heap, int heapLength)
-{
-	int			result = heap[0];
-	int			val;
-	int			i;
-
-	if (--heapLength <= 0)
-		return result;
-	val = heap[heapLength];		/* value that must be reinserted */
-	i = 0;						/* i is where the "hole" is */
-	for (;;)
-	{
-		int			j = 2 * i + 1;
-
-		if (j >= heapLength)
-			break;
-		if (j + 1 < heapLength &&
-			heap[j] < heap[j + 1])
-			j++;
-		if (val >= heap[j])
-			break;
-		heap[i] = heap[j];
-		i = j;
-	}
-	heap[i] = val;
-	return result;
-}
-
 /*
  * findDependencyLoops - identify loops in TopoSort's failure output,
  *		and pass each such loop to repairDependencyLoop() for action
@@ -1559,3 +1489,12 @@ describeDumpableObject(DumpableObject *obj, char *buf, int bufsize)
 			 (int) obj->objType,
 			 obj->dumpId, obj->catId.oid);
 }
+
+/*
+ * Compares "a" and "b" as integers.
+ */
+static int
+int_cmp(void *a, void *b, void *arg)
+{
+	return (int) (intptr_t) a - (int) (intptr_t) b;
+}
-- 
2.25.1

#28Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#27)
4 attachment(s)
Re: Inefficiency in parallel pg_restore with many tables

On Wed, Sep 13, 2023 at 11:34:50AM -0700, Nathan Bossart wrote:

On Sun, Sep 10, 2023 at 12:35:10PM -0400, Tom Lane wrote:

Other than those nitpicks, I like v6. I'll mark this RfC.

Great. I've posted a v8 with your comments addressed in order to get one
more round of cfbot coverage. Assuming those tests pass and there is no
additional feedback, I'll plan on committing this in the next few days.

Upon closer inspection, I found a rather nasty problem. The qsort
comparator expects a TocEntry **, but the binaryheap comparator expects a
TocEntry *, and we simply pass the arguments through to the qsort
comparator. In v9, I added the requisite ampersands. I'm surprised this
worked at all. I'm planning to run some additional tests to make sure this
patch set works as expected.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachments:

v9-0001-Make-binaryheap-available-to-frontend-code.patchtext/x-diff; charset=us-asciiDownload
From df3d862907b29687503a8f362e0a3cc3dcbc3cc4 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Sat, 22 Jul 2023 15:04:45 -0700
Subject: [PATCH v9 1/4] Make binaryheap available to frontend code.

There are a couple of places in frontend code that could make use
of this simple binary heap implementation.  This commit makes
binaryheap usable in frontend code, much like commit 26aaf97b68 did
for StringInfo.  Like StringInfo, the header file is left in lib/
to reduce the likelihood of unnecessary breakage.

The frontend version of binaryheap exposes a void *-based API since
frontend code does not have access to the Datum definitions.  This
seemed like a better approach than switching all existing uses to
void * or making the Datum definitions available to frontend code.

Reviewed-by: Tom Lane, Alvaro Herrera
Discussion: https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us
---
 src/backend/lib/Makefile                 |  1 -
 src/backend/lib/meson.build              |  1 -
 src/common/Makefile                      |  1 +
 src/{backend/lib => common}/binaryheap.c | 41 +++++++++++++++++-------
 src/common/meson.build                   |  1 +
 src/include/lib/binaryheap.h             | 26 +++++++++++----
 src/tools/pgindent/typedefs.list         |  1 +
 7 files changed, 52 insertions(+), 20 deletions(-)
 rename src/{backend/lib => common}/binaryheap.c (90%)

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..b6cefd9cca 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -13,7 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
-	binaryheap.o \
 	bipartite_match.o \
 	bloomfilter.o \
 	dshash.o \
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 974cab8776..b4e88f54ae 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -1,7 +1,6 @@
 # Copyright (c) 2022-2023, PostgreSQL Global Development Group
 
 backend_sources += files(
-  'binaryheap.c',
   'bipartite_match.c',
   'bloomfilter.c',
   'dshash.c',
diff --git a/src/common/Makefile b/src/common/Makefile
index 113029bf7b..cc5c54dcee 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -48,6 +48,7 @@ LIBS += $(PTHREAD_LIBS)
 OBJS_COMMON = \
 	archive.o \
 	base64.o \
+	binaryheap.o \
 	checksum_helper.o \
 	compression.o \
 	config_info.o \
diff --git a/src/backend/lib/binaryheap.c b/src/common/binaryheap.c
similarity index 90%
rename from src/backend/lib/binaryheap.c
rename to src/common/binaryheap.c
index 1737546757..39a8243a6d 100644
--- a/src/backend/lib/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -6,15 +6,22 @@
  * Portions Copyright (c) 2012-2023, PostgreSQL Global Development Group
  *
  * IDENTIFICATION
- *	  src/backend/lib/binaryheap.c
+ *	  src/common/binaryheap.c
  *
  *-------------------------------------------------------------------------
  */
 
+#ifdef FRONTEND
+#include "postgres_fe.h"
+#else
 #include "postgres.h"
+#endif
 
 #include <math.h>
 
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
 #include "lib/binaryheap.h"
 
 static void sift_down(binaryheap *heap, int node_off);
@@ -34,7 +41,7 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 	int			sz;
 	binaryheap *heap;
 
-	sz = offsetof(binaryheap, bh_nodes) + sizeof(Datum) * capacity;
+	sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_node_type) * capacity;
 	heap = (binaryheap *) palloc(sz);
 	heap->bh_space = capacity;
 	heap->bh_compare = compare;
@@ -106,10 +113,16 @@ parent_offset(int i)
  * afterwards.
  */
 void
-binaryheap_add_unordered(binaryheap *heap, Datum d)
+binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 {
 	if (heap->bh_size >= heap->bh_space)
+	{
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
+	}
 	heap->bh_has_heap_property = false;
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
@@ -138,10 +151,16 @@ binaryheap_build(binaryheap *heap)
  * the heap property.
  */
 void
-binaryheap_add(binaryheap *heap, Datum d)
+binaryheap_add(binaryheap *heap, bh_node_type d)
 {
 	if (heap->bh_size >= heap->bh_space)
+	{
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
+	}
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
@@ -154,7 +173,7 @@ binaryheap_add(binaryheap *heap, Datum d)
  * without modifying the heap. The caller must ensure that this
  * routine is not used on an empty heap. Always O(1).
  */
-Datum
+bh_node_type
 binaryheap_first(binaryheap *heap)
 {
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
@@ -169,10 +188,10 @@ binaryheap_first(binaryheap *heap)
  * that this routine is not used on an empty heap. O(log n) worst
  * case.
  */
-Datum
+bh_node_type
 binaryheap_remove_first(binaryheap *heap)
 {
-	Datum		result;
+	bh_node_type result;
 
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
 
@@ -204,7 +223,7 @@ binaryheap_remove_first(binaryheap *heap)
  * sifting the new node down.
  */
 void
-binaryheap_replace_first(binaryheap *heap, Datum d)
+binaryheap_replace_first(binaryheap *heap, bh_node_type d)
 {
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
 
@@ -221,7 +240,7 @@ binaryheap_replace_first(binaryheap *heap, Datum d)
 static void
 sift_up(binaryheap *heap, int node_off)
 {
-	Datum		node_val = heap->bh_nodes[node_off];
+	bh_node_type node_val = heap->bh_nodes[node_off];
 
 	/*
 	 * Within the loop, the node_off'th array entry is a "hole" that
@@ -232,7 +251,7 @@ sift_up(binaryheap *heap, int node_off)
 	{
 		int			cmp;
 		int			parent_off;
-		Datum		parent_val;
+		bh_node_type parent_val;
 
 		/*
 		 * If this node is smaller than its parent, the heap condition is
@@ -264,7 +283,7 @@ sift_up(binaryheap *heap, int node_off)
 static void
 sift_down(binaryheap *heap, int node_off)
 {
-	Datum		node_val = heap->bh_nodes[node_off];
+	bh_node_type node_val = heap->bh_nodes[node_off];
 
 	/*
 	 * Within the loop, the node_off'th array entry is a "hole" that
diff --git a/src/common/meson.build b/src/common/meson.build
index 53942a9a61..3b97497d1a 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -3,6 +3,7 @@
 common_sources = files(
   'archive.c',
   'base64.c',
+  'binaryheap.c',
   'checksum_helper.c',
   'compression.c',
   'controldata_utils.c',
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 52f7b06b25..3647aeae65 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -11,11 +11,23 @@
 #ifndef BINARYHEAP_H
 #define BINARYHEAP_H
 
+/*
+ * We provide a Datum-based API for backend code and a void *-based API for
+ * frontend code (since the Datum definitions are not available to frontend
+ * code).  You should typically avoid using bh_node_type directly and instead
+ * use Datum or void * as appropriate.
+ */
+#ifdef FRONTEND
+typedef void *bh_node_type;
+#else
+typedef Datum bh_node_type;
+#endif
+
 /*
  * For a max-heap, the comparator must return <0 iff a < b, 0 iff a == b,
  * and >0 iff a > b.  For a min-heap, the conditions are reversed.
  */
-typedef int (*binaryheap_comparator) (Datum a, Datum b, void *arg);
+typedef int (*binaryheap_comparator) (bh_node_type a, bh_node_type b, void *arg);
 
 /*
  * binaryheap
@@ -34,7 +46,7 @@ typedef struct binaryheap
 	bool		bh_has_heap_property;	/* debugging cross-check */
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
-	Datum		bh_nodes[FLEXIBLE_ARRAY_MEMBER];
+	bh_node_type bh_nodes[FLEXIBLE_ARRAY_MEMBER];
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int capacity,
@@ -42,12 +54,12 @@ extern binaryheap *binaryheap_allocate(int capacity,
 									   void *arg);
 extern void binaryheap_reset(binaryheap *heap);
 extern void binaryheap_free(binaryheap *heap);
-extern void binaryheap_add_unordered(binaryheap *heap, Datum d);
+extern void binaryheap_add_unordered(binaryheap *heap, bh_node_type d);
 extern void binaryheap_build(binaryheap *heap);
-extern void binaryheap_add(binaryheap *heap, Datum d);
-extern Datum binaryheap_first(binaryheap *heap);
-extern Datum binaryheap_remove_first(binaryheap *heap);
-extern void binaryheap_replace_first(binaryheap *heap, Datum d);
+extern void binaryheap_add(binaryheap *heap, bh_node_type d);
+extern bh_node_type binaryheap_first(binaryheap *heap);
+extern bh_node_type binaryheap_remove_first(binaryheap *heap);
+extern void binaryheap_replace_first(binaryheap *heap, bh_node_type d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f3d8a2a855..b5bbdd1608 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3198,6 +3198,7 @@ bbstreamer_tar_archiver
 bbstreamer_tar_parser
 bbstreamer_zstd_frame
 bgworker_main_type
+bh_node_type
 binaryheap
 binaryheap_comparator
 bitmapword
-- 
2.25.1

v9-0002-Add-function-for-removing-arbitrary-nodes-in-bina.patchtext/x-diff; charset=us-asciiDownload
From 020938f83ba2c6e8a121b4a92b52c25cc5a336a2 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Thu, 20 Jul 2023 09:52:20 -0700
Subject: [PATCH v9 2/4] Add function for removing arbitrary nodes in
 binaryheap.

This commit introduces binaryheap_remove_node(), which can be used
to remove any node from a binary heap.  The implementation is
straightforward.  The target node is replaced with the last node in
the heap, and then we sift as needed to preserve the heap property.
This new function is intended for use in a follow-up commit that
will improve the performance of pg_restore.

Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us
---
 src/common/binaryheap.c      | 29 +++++++++++++++++++++++++++++
 src/include/lib/binaryheap.h |  3 +++
 2 files changed, 32 insertions(+)

diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index 39a8243a6d..19e095f1fb 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -215,6 +215,35 @@ binaryheap_remove_first(binaryheap *heap)
 	return result;
 }
 
+/*
+ * binaryheap_remove_node
+ *
+ * Removes the nth (zero based) node from the heap.  The caller must ensure
+ * that there are at least (n + 1) nodes in the heap.  O(log n) worst case.
+ */
+void
+binaryheap_remove_node(binaryheap *heap, int n)
+{
+	int			cmp;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(n >= 0 && n < heap->bh_size);
+
+	/* compare last node to the one that is being removed */
+	cmp = heap->bh_compare(heap->bh_nodes[--heap->bh_size],
+						   heap->bh_nodes[n],
+						   heap->bh_arg);
+
+	/* remove the last node, placing it in the vacated entry */
+	heap->bh_nodes[n] = heap->bh_nodes[heap->bh_size];
+
+	/* sift as needed to preserve the heap property */
+	if (cmp > 0)
+		sift_up(heap, n);
+	else if (cmp < 0)
+		sift_down(heap, n);
+}
+
 /*
  * binaryheap_replace_first
  *
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 3647aeae65..9525dcaec4 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -59,8 +59,11 @@ extern void binaryheap_build(binaryheap *heap);
 extern void binaryheap_add(binaryheap *heap, bh_node_type d);
 extern bh_node_type binaryheap_first(binaryheap *heap);
 extern bh_node_type binaryheap_remove_first(binaryheap *heap);
+extern void binaryheap_remove_node(binaryheap *heap, int n);
 extern void binaryheap_replace_first(binaryheap *heap, bh_node_type d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
+#define binaryheap_size(h)			((h)->bh_size)
+#define binaryheap_get_node(h, n)	((h)->bh_nodes[n])
 
 #endif							/* BINARYHEAP_H */
-- 
2.25.1

v9-0003-Convert-pg_restore-s-ready_list-to-a-priority-que.patchtext/x-diff; charset=us-asciiDownload
From 77fe22011ad412dffc960408e28994bb66d0b06a Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Thu, 20 Jul 2023 10:19:08 -0700
Subject: [PATCH v9 3/4] Convert pg_restore's ready_list to a priority queue.

Presently, we spend a lot of time sorting this list so that we pick
the largest items first.  With many tables, this sorting can become
a significant bottleneck.  There are a couple of reports from the
field about this, and it is easily reproducible, so this is not a
hypothetical issue.

This commit improves the performance of pg_restore with many tables
by converting its ready_list to a priority queue, i.e., a binary
heap.  We will first try to run the highest priority item, but if
it cannot be chosen due to the lock heuristic, we'll do a
sequential scan through the heap nodes until we find one that is
runnable.  This means that we might end up picking an item with
much lower priority, but since we expect that we'll typically be
able to choose one of the first few nodes, we should usually pick
an item with a relatively high priority.

On my machine, a basic test with 100,000 tables takes 11.5 minutes
without this patch and 1.5 minutes with it.  Pierre Ducroquet
claims to see a speedup from 30 minutes to 23 minutes for a
real-world dump of over 50,000 tables.

Suggested-by: Tom Lane
Tested-by: Pierre Ducroquet
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us
---
 src/bin/pg_dump/pg_backup_archiver.c | 197 ++++++++-------------------
 1 file changed, 57 insertions(+), 140 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 4d83381d84..ea17ca4559 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -34,6 +34,7 @@
 #include "compress_io.h"
 #include "dumputils.h"
 #include "fe_utils/string_utils.h"
+#include "lib/binaryheap.h"
 #include "lib/stringinfo.h"
 #include "libpq/libpq-fs.h"
 #include "parallel.h"
@@ -44,24 +45,6 @@
 #define TEXT_DUMP_HEADER "--\n-- PostgreSQL database dump\n--\n\n"
 #define TEXT_DUMPALL_HEADER "--\n-- PostgreSQL database cluster dump\n--\n\n"
 
-/*
- * State for tracking TocEntrys that are ready to process during a parallel
- * restore.  (This used to be a list, and we still call it that, though now
- * it's really an array so that we can apply qsort to it.)
- *
- * tes[] is sized large enough that we can't overrun it.
- * The valid entries are indexed first_te .. last_te inclusive.
- * We periodically sort the array to bring larger-by-dataLength entries to
- * the front; "sorted" is true if the valid entries are known sorted.
- */
-typedef struct _parallelReadyList
-{
-	TocEntry  **tes;			/* Ready-to-dump TocEntrys */
-	int			first_te;		/* index of first valid entry in tes[] */
-	int			last_te;		/* index of last valid entry in tes[] */
-	bool		sorted;			/* are valid entries currently sorted? */
-} ParallelReadyList;
-
 
 static ArchiveHandle *_allocAH(const char *FileSpec, const ArchiveFormat fmt,
 							   const pg_compress_specification compression_spec,
@@ -111,16 +94,12 @@ static void restore_toc_entries_postfork(ArchiveHandle *AH,
 static void pending_list_header_init(TocEntry *l);
 static void pending_list_append(TocEntry *l, TocEntry *te);
 static void pending_list_remove(TocEntry *te);
-static void ready_list_init(ParallelReadyList *ready_list, int tocCount);
-static void ready_list_free(ParallelReadyList *ready_list);
-static void ready_list_insert(ParallelReadyList *ready_list, TocEntry *te);
-static void ready_list_remove(ParallelReadyList *ready_list, int i);
-static void ready_list_sort(ParallelReadyList *ready_list);
-static int	TocEntrySizeCompare(const void *p1, const void *p2);
-static void move_to_ready_list(TocEntry *pending_list,
-							   ParallelReadyList *ready_list,
+static int	TocEntrySizeCompareQsort(const void *p1, const void *p2);
+static int	TocEntrySizeCompareBinaryheap(void *p1, void *p2, void *arg);
+static void move_to_ready_heap(TocEntry *pending_list,
+							   binaryheap *ready_heap,
 							   RestorePass pass);
-static TocEntry *pop_next_work_item(ParallelReadyList *ready_list,
+static TocEntry *pop_next_work_item(binaryheap *ready_heap,
 									ParallelState *pstate);
 static void mark_dump_job_done(ArchiveHandle *AH,
 							   TocEntry *te,
@@ -135,7 +114,7 @@ static bool has_lock_conflicts(TocEntry *te1, TocEntry *te2);
 static void repoint_table_dependencies(ArchiveHandle *AH);
 static void identify_locking_dependencies(ArchiveHandle *AH, TocEntry *te);
 static void reduce_dependencies(ArchiveHandle *AH, TocEntry *te,
-								ParallelReadyList *ready_list);
+								binaryheap *ready_heap);
 static void mark_create_done(ArchiveHandle *AH, TocEntry *te);
 static void inhibit_data_for_failed_table(ArchiveHandle *AH, TocEntry *te);
 
@@ -2384,7 +2363,7 @@ WriteDataChunks(ArchiveHandle *AH, ParallelState *pstate)
 		}
 
 		if (ntes > 1)
-			qsort(tes, ntes, sizeof(TocEntry *), TocEntrySizeCompare);
+			qsort(tes, ntes, sizeof(TocEntry *), TocEntrySizeCompareQsort);
 
 		for (int i = 0; i < ntes; i++)
 			DispatchJobForTocEntry(AH, pstate, tes[i], ACT_DUMP,
@@ -3984,7 +3963,7 @@ restore_toc_entries_prefork(ArchiveHandle *AH, TocEntry *pending_list)
 
 			(void) restore_toc_entry(AH, next_work_item, false);
 
-			/* Reduce dependencies, but don't move anything to ready_list */
+			/* Reduce dependencies, but don't move anything to ready_heap */
 			reduce_dependencies(AH, next_work_item, NULL);
 		}
 		else
@@ -4027,24 +4006,26 @@ static void
 restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 							 TocEntry *pending_list)
 {
-	ParallelReadyList ready_list;
+	binaryheap *ready_heap;
 	TocEntry   *next_work_item;
 
 	pg_log_debug("entering restore_toc_entries_parallel");
 
-	/* Set up ready_list with enough room for all known TocEntrys */
-	ready_list_init(&ready_list, AH->tocCount);
+	/* Set up ready_heap with enough room for all known TocEntrys */
+	ready_heap = binaryheap_allocate(AH->tocCount,
+									 TocEntrySizeCompareBinaryheap,
+									 NULL);
 
 	/*
 	 * The pending_list contains all items that we need to restore.  Move all
-	 * items that are available to process immediately into the ready_list.
+	 * items that are available to process immediately into the ready_heap.
 	 * After this setup, the pending list is everything that needs to be done
-	 * but is blocked by one or more dependencies, while the ready list
+	 * but is blocked by one or more dependencies, while the ready heap
 	 * contains items that have no remaining dependencies and are OK to
 	 * process in the current restore pass.
 	 */
 	AH->restorePass = RESTORE_PASS_MAIN;
-	move_to_ready_list(pending_list, &ready_list, AH->restorePass);
+	move_to_ready_heap(pending_list, ready_heap, AH->restorePass);
 
 	/*
 	 * main parent loop
@@ -4058,7 +4039,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 	for (;;)
 	{
 		/* Look for an item ready to be dispatched to a worker */
-		next_work_item = pop_next_work_item(&ready_list, pstate);
+		next_work_item = pop_next_work_item(ready_heap, pstate);
 		if (next_work_item != NULL)
 		{
 			/* If not to be restored, don't waste time launching a worker */
@@ -4068,7 +4049,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 							next_work_item->dumpId,
 							next_work_item->desc, next_work_item->tag);
 				/* Update its dependencies as though we'd completed it */
-				reduce_dependencies(AH, next_work_item, &ready_list);
+				reduce_dependencies(AH, next_work_item, ready_heap);
 				/* Loop around to see if anything else can be dispatched */
 				continue;
 			}
@@ -4079,7 +4060,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 
 			/* Dispatch to some worker */
 			DispatchJobForTocEntry(AH, pstate, next_work_item, ACT_RESTORE,
-								   mark_restore_job_done, &ready_list);
+								   mark_restore_job_done, ready_heap);
 		}
 		else if (IsEveryWorkerIdle(pstate))
 		{
@@ -4093,7 +4074,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 			/* Advance to next restore pass */
 			AH->restorePass++;
 			/* That probably allows some stuff to be made ready */
-			move_to_ready_list(pending_list, &ready_list, AH->restorePass);
+			move_to_ready_heap(pending_list, ready_heap, AH->restorePass);
 			/* Loop around to see if anything's now ready */
 			continue;
 		}
@@ -4122,10 +4103,10 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 					   next_work_item ? WFW_ONE_IDLE : WFW_GOT_STATUS);
 	}
 
-	/* There should now be nothing in ready_list. */
-	Assert(ready_list.first_te > ready_list.last_te);
+	/* There should now be nothing in ready_heap. */
+	Assert(binaryheap_empty(ready_heap));
 
-	ready_list_free(&ready_list);
+	binaryheap_free(ready_heap);
 
 	pg_log_info("finished main parallel loop");
 }
@@ -4225,80 +4206,9 @@ pending_list_remove(TocEntry *te)
 }
 
 
-/*
- * Initialize the ready_list with enough room for up to tocCount entries.
- */
-static void
-ready_list_init(ParallelReadyList *ready_list, int tocCount)
-{
-	ready_list->tes = (TocEntry **)
-		pg_malloc(tocCount * sizeof(TocEntry *));
-	ready_list->first_te = 0;
-	ready_list->last_te = -1;
-	ready_list->sorted = false;
-}
-
-/*
- * Free storage for a ready_list.
- */
-static void
-ready_list_free(ParallelReadyList *ready_list)
-{
-	pg_free(ready_list->tes);
-}
-
-/* Add te to the ready_list */
-static void
-ready_list_insert(ParallelReadyList *ready_list, TocEntry *te)
-{
-	ready_list->tes[++ready_list->last_te] = te;
-	/* List is (probably) not sorted anymore. */
-	ready_list->sorted = false;
-}
-
-/* Remove the i'th entry in the ready_list */
-static void
-ready_list_remove(ParallelReadyList *ready_list, int i)
-{
-	int			f = ready_list->first_te;
-
-	Assert(i >= f && i <= ready_list->last_te);
-
-	/*
-	 * In the typical case where the item to be removed is the first ready
-	 * entry, we need only increment first_te to remove it.  Otherwise, move
-	 * the entries before it to compact the list.  (This preserves sortedness,
-	 * if any.)  We could alternatively move the entries after i, but there
-	 * are typically many more of those.
-	 */
-	if (i > f)
-	{
-		TocEntry  **first_te_ptr = &ready_list->tes[f];
-
-		memmove(first_te_ptr + 1, first_te_ptr, (i - f) * sizeof(TocEntry *));
-	}
-	ready_list->first_te++;
-}
-
-/* Sort the ready_list into the desired order */
-static void
-ready_list_sort(ParallelReadyList *ready_list)
-{
-	if (!ready_list->sorted)
-	{
-		int			n = ready_list->last_te - ready_list->first_te + 1;
-
-		if (n > 1)
-			qsort(ready_list->tes + ready_list->first_te, n,
-				  sizeof(TocEntry *),
-				  TocEntrySizeCompare);
-		ready_list->sorted = true;
-	}
-}
-
 /* qsort comparator for sorting TocEntries by dataLength */
 static int
-TocEntrySizeCompare(const void *p1, const void *p2)
+TocEntrySizeCompareQsort(const void *p1, const void *p2)
 {
 	const TocEntry *te1 = *(const TocEntry *const *) p1;
 	const TocEntry *te2 = *(const TocEntry *const *) p2;
@@ -4318,17 +4228,24 @@ TocEntrySizeCompare(const void *p1, const void *p2)
 	return 0;
 }
 
+/* binaryheap comparator for sorting TocEntries by dataLength */
+static int
+TocEntrySizeCompareBinaryheap(void *p1, void *p2, void *arg)
+{
+	return TocEntrySizeCompareQsort(&p1, &p2);
+}
+
 
 /*
- * Move all immediately-ready items from pending_list to ready_list.
+ * Move all immediately-ready items from pending_list to ready_heap.
  *
  * Items are considered ready if they have no remaining dependencies and
  * they belong in the current restore pass.  (See also reduce_dependencies,
  * which applies the same logic one-at-a-time.)
  */
 static void
-move_to_ready_list(TocEntry *pending_list,
-				   ParallelReadyList *ready_list,
+move_to_ready_heap(TocEntry *pending_list,
+				   binaryheap *ready_heap,
 				   RestorePass pass)
 {
 	TocEntry   *te;
@@ -4344,38 +4261,38 @@ move_to_ready_list(TocEntry *pending_list,
 		{
 			/* Remove it from pending_list ... */
 			pending_list_remove(te);
-			/* ... and add to ready_list */
-			ready_list_insert(ready_list, te);
+			/* ... and add to ready_heap */
+			binaryheap_add(ready_heap, te);
 		}
 	}
 }
 
 /*
  * Find the next work item (if any) that is capable of being run now,
- * and remove it from the ready_list.
+ * and remove it from the ready_heap.
  *
  * Returns the item, or NULL if nothing is runnable.
  *
  * To qualify, the item must have no remaining dependencies
  * and no requirements for locks that are incompatible with
- * items currently running.  Items in the ready_list are known to have
+ * items currently running.  Items in the ready_heap are known to have
  * no remaining dependencies, but we have to check for lock conflicts.
  */
 static TocEntry *
-pop_next_work_item(ParallelReadyList *ready_list,
+pop_next_work_item(binaryheap *ready_heap,
 				   ParallelState *pstate)
 {
 	/*
-	 * Sort the ready_list so that we'll tackle larger jobs first.
-	 */
-	ready_list_sort(ready_list);
-
-	/*
-	 * Search the ready_list until we find a suitable item.
+	 * Search the ready_heap until we find a suitable item.  Note that we do a
+	 * sequential scan through the heap nodes, so even though we will first
+	 * try to choose the highest-priority item, we might end up picking
+	 * something with a much lower priority.  However, it is expected that we
+	 * will typically be able to pick one of the first few items, which should
+	 * usually have a relatively high priority.
 	 */
-	for (int i = ready_list->first_te; i <= ready_list->last_te; i++)
+	for (int i = 0; i < binaryheap_size(ready_heap); i++)
 	{
-		TocEntry   *te = ready_list->tes[i];
+		TocEntry   *te = (TocEntry *) binaryheap_get_node(ready_heap, i);
 		bool		conflicts = false;
 
 		/*
@@ -4401,7 +4318,7 @@ pop_next_work_item(ParallelReadyList *ready_list,
 			continue;
 
 		/* passed all tests, so this item can run */
-		ready_list_remove(ready_list, i);
+		binaryheap_remove_node(ready_heap, i);
 		return te;
 	}
 
@@ -4447,7 +4364,7 @@ mark_restore_job_done(ArchiveHandle *AH,
 					  int status,
 					  void *callback_data)
 {
-	ParallelReadyList *ready_list = (ParallelReadyList *) callback_data;
+	binaryheap *ready_heap = (binaryheap *) callback_data;
 
 	pg_log_info("finished item %d %s %s",
 				te->dumpId, te->desc, te->tag);
@@ -4465,7 +4382,7 @@ mark_restore_job_done(ArchiveHandle *AH,
 		pg_fatal("worker process failed: exit code %d",
 				 status);
 
-	reduce_dependencies(AH, te, ready_list);
+	reduce_dependencies(AH, te, ready_heap);
 }
 
 
@@ -4708,11 +4625,11 @@ identify_locking_dependencies(ArchiveHandle *AH, TocEntry *te)
 /*
  * Remove the specified TOC entry from the depCounts of items that depend on
  * it, thereby possibly making them ready-to-run.  Any pending item that
- * becomes ready should be moved to the ready_list, if that's provided.
+ * becomes ready should be moved to the ready_heap, if that's provided.
  */
 static void
 reduce_dependencies(ArchiveHandle *AH, TocEntry *te,
-					ParallelReadyList *ready_list)
+					binaryheap *ready_heap)
 {
 	int			i;
 
@@ -4730,18 +4647,18 @@ reduce_dependencies(ArchiveHandle *AH, TocEntry *te,
 		 * the current restore pass, and it is currently a member of the
 		 * pending list (that check is needed to prevent double restore in
 		 * some cases where a list-file forces out-of-order restoring).
-		 * However, if ready_list == NULL then caller doesn't want any list
+		 * However, if ready_heap == NULL then caller doesn't want any list
 		 * memberships changed.
 		 */
 		if (otherte->depCount == 0 &&
 			_tocEntryRestorePass(otherte) == AH->restorePass &&
 			otherte->pending_prev != NULL &&
-			ready_list != NULL)
+			ready_heap != NULL)
 		{
 			/* Remove it from pending list ... */
 			pending_list_remove(otherte);
-			/* ... and add to ready_list */
-			ready_list_insert(ready_list, otherte);
+			/* ... and add to ready_heap */
+			binaryheap_add(ready_heap, otherte);
 		}
 	}
 }
-- 
2.25.1

v9-0004-Remove-open-coded-binary-heap-in-pg_dump_sort.c.patchtext/x-diff; charset=us-asciiDownload
From dc849256ad09acd7b85d8f39ad0a837c5f9fb0c1 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Tue, 25 Jul 2023 11:18:52 -0700
Subject: [PATCH v9 4/4] Remove open-coded binary heap in pg_dump_sort.c.

Thanks to commit XXXXXXXXXX, binaryheap is available to frontend
code.  This commit replaces the open-coded heap implementation in
pg_dump_sort.c with a binaryheap, saving a few lines of code.

Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us
---
 src/bin/pg_dump/pg_dump_sort.c | 105 +++++++--------------------------
 1 file changed, 22 insertions(+), 83 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump_sort.c b/src/bin/pg_dump/pg_dump_sort.c
index 523a19c155..8436c1d757 100644
--- a/src/bin/pg_dump/pg_dump_sort.c
+++ b/src/bin/pg_dump/pg_dump_sort.c
@@ -16,6 +16,7 @@
 #include "postgres_fe.h"
 
 #include "catalog/pg_class_d.h"
+#include "lib/binaryheap.h"
 #include "pg_backup_archiver.h"
 #include "pg_backup_utils.h"
 #include "pg_dump.h"
@@ -161,8 +162,6 @@ static bool TopoSort(DumpableObject **objs,
 					 int numObjs,
 					 DumpableObject **ordering,
 					 int *nOrdering);
-static void addHeapElement(int val, int *heap, int heapLength);
-static int	removeHeapElement(int *heap, int heapLength);
 static void findDependencyLoops(DumpableObject **objs, int nObjs, int totObjs);
 static int	findLoop(DumpableObject *obj,
 					 DumpId startPoint,
@@ -174,6 +173,7 @@ static void repairDependencyLoop(DumpableObject **loop,
 								 int nLoop);
 static void describeDumpableObject(DumpableObject *obj,
 								   char *buf, int bufsize);
+static int	int_cmp(void *a, void *b, void *arg);
 
 
 /*
@@ -374,11 +374,10 @@ TopoSort(DumpableObject **objs,
 		 int *nOrdering)		/* output argument */
 {
 	DumpId		maxDumpId = getMaxDumpId();
-	int		   *pendingHeap;
+	binaryheap *pendingHeap;
 	int		   *beforeConstraints;
 	int		   *idMap;
 	DumpableObject *obj;
-	int			heapLength;
 	int			i,
 				j,
 				k;
@@ -403,7 +402,7 @@ TopoSort(DumpableObject **objs,
 		return true;
 
 	/* Create workspace for the above-described heap */
-	pendingHeap = (int *) pg_malloc(numObjs * sizeof(int));
+	pendingHeap = binaryheap_allocate(numObjs, int_cmp, NULL);
 
 	/*
 	 * Scan the constraints, and for each item in the input, generate a count
@@ -434,19 +433,16 @@ TopoSort(DumpableObject **objs,
 	 * Now initialize the heap of items-ready-to-output by filling it with the
 	 * indexes of items that already have beforeConstraints[id] == 0.
 	 *
-	 * The essential property of a heap is heap[(j-1)/2] >= heap[j] for each j
-	 * in the range 1..heapLength-1 (note we are using 0-based subscripts
-	 * here, while the discussion in Knuth assumes 1-based subscripts). So, if
-	 * we simply enter the indexes into pendingHeap[] in decreasing order, we
-	 * a-fortiori have the heap invariant satisfied at completion of this
-	 * loop, and don't need to do any sift-up comparisons.
+	 * We enter the objects into pendingHeap in decreasing order so that the
+	 * heap invariant is satisfied at the completion of this loop.  This
+	 * reduces the amount of work that binaryheap_build() must do.
 	 */
-	heapLength = 0;
 	for (i = numObjs; --i >= 0;)
 	{
 		if (beforeConstraints[objs[i]->dumpId] == 0)
-			pendingHeap[heapLength++] = i;
+			binaryheap_add_unordered(pendingHeap, (void *) (intptr_t) i);
 	}
+	binaryheap_build(pendingHeap);
 
 	/*--------------------
 	 * Now emit objects, working backwards in the output list.  At each step,
@@ -464,10 +460,10 @@ TopoSort(DumpableObject **objs,
 	 *--------------------
 	 */
 	i = numObjs;
-	while (heapLength > 0)
+	while (!binaryheap_empty(pendingHeap))
 	{
 		/* Select object to output by removing largest heap member */
-		j = removeHeapElement(pendingHeap, heapLength--);
+		j = (int) (intptr_t) binaryheap_remove_first(pendingHeap);
 		obj = objs[j];
 		/* Output candidate to ordering[] */
 		ordering[--i] = obj;
@@ -477,7 +473,7 @@ TopoSort(DumpableObject **objs,
 			int			id = obj->dependencies[k];
 
 			if ((--beforeConstraints[id]) == 0)
-				addHeapElement(idMap[id], pendingHeap, heapLength++);
+				binaryheap_add(pendingHeap, (void *) (intptr_t) idMap[id]);
 		}
 	}
 
@@ -497,79 +493,13 @@ TopoSort(DumpableObject **objs,
 	}
 
 	/* Done */
-	free(pendingHeap);
+	binaryheap_free(pendingHeap);
 	free(beforeConstraints);
 	free(idMap);
 
 	return (i == 0);
 }
 
-/*
- * Add an item to a heap (priority queue)
- *
- * heapLength is the current heap size; caller is responsible for increasing
- * its value after the call.  There must be sufficient storage at *heap.
- */
-static void
-addHeapElement(int val, int *heap, int heapLength)
-{
-	int			j;
-
-	/*
-	 * Sift-up the new entry, per Knuth 5.2.3 exercise 16. Note that Knuth is
-	 * using 1-based array indexes, not 0-based.
-	 */
-	j = heapLength;
-	while (j > 0)
-	{
-		int			i = (j - 1) >> 1;
-
-		if (val <= heap[i])
-			break;
-		heap[j] = heap[i];
-		j = i;
-	}
-	heap[j] = val;
-}
-
-/*
- * Remove the largest item present in a heap (priority queue)
- *
- * heapLength is the current heap size; caller is responsible for decreasing
- * its value after the call.
- *
- * We remove and return heap[0], which is always the largest element of
- * the heap, and then "sift up" to maintain the heap invariant.
- */
-static int
-removeHeapElement(int *heap, int heapLength)
-{
-	int			result = heap[0];
-	int			val;
-	int			i;
-
-	if (--heapLength <= 0)
-		return result;
-	val = heap[heapLength];		/* value that must be reinserted */
-	i = 0;						/* i is where the "hole" is */
-	for (;;)
-	{
-		int			j = 2 * i + 1;
-
-		if (j >= heapLength)
-			break;
-		if (j + 1 < heapLength &&
-			heap[j] < heap[j + 1])
-			j++;
-		if (val >= heap[j])
-			break;
-		heap[i] = heap[j];
-		i = j;
-	}
-	heap[i] = val;
-	return result;
-}
-
 /*
  * findDependencyLoops - identify loops in TopoSort's failure output,
  *		and pass each such loop to repairDependencyLoop() for action
@@ -1559,3 +1489,12 @@ describeDumpableObject(DumpableObject *obj, char *buf, int bufsize)
 			 (int) obj->objType,
 			 obj->dumpId, obj->catId.oid);
 }
+
+/*
+ * Compares "a" and "b" as integers.
+ */
+static int
+int_cmp(void *a, void *b, void *arg)
+{
+	return (int) (intptr_t) a - (int) (intptr_t) b;
+}
-- 
2.25.1

#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Bossart (#28)
Re: Inefficiency in parallel pg_restore with many tables

Nathan Bossart <nathandbossart@gmail.com> writes:

Upon closer inspection, I found a rather nasty problem. The qsort
comparator expects a TocEntry **, but the binaryheap comparator expects a
TocEntry *, and we simply pass the arguments through to the qsort
comparator. In v9, I added the requisite ampersands.

Ooops :-(

I'm surprised this
worked at all.

Probably it was not sorting things appropriately. Might be worth adding
some test scaffolding to check that bigger tasks are chosen before
smaller ones.

regards, tom lane

#30Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#29)
4 attachment(s)
Re: Inefficiency in parallel pg_restore with many tables

On Wed, Sep 13, 2023 at 08:01:39PM -0400, Tom Lane wrote:

Nathan Bossart <nathandbossart@gmail.com> writes:

Upon closer inspection, I found a rather nasty problem. The qsort
comparator expects a TocEntry **, but the binaryheap comparator expects a
TocEntry *, and we simply pass the arguments through to the qsort
comparator. In v9, I added the requisite ampersands.

Ooops :-(

I'm surprised this
worked at all.

Probably it was not sorting things appropriately. Might be worth adding
some test scaffolding to check that bigger tasks are chosen before
smaller ones.

Further testing revealed that the binaryheap comparator function was
actually generating a min-heap since the qsort comparator sorts by
decreasing dataLength. This is fixed in v10. And I am 0 for 2 today...

Now that this appears to be functioning as expected, I see that the larger
entries are typically picked up earlier, but we do sometimes pick entries
quite a bit further down the list, as anticipated. The case I was testing
(10k tables with the number of rows equal to the table number) was much
faster with this patch (just over a minute) than without it (over 16
minutes).

Sincerest apologies for the noise.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachments:

v10-0001-Make-binaryheap-available-to-frontend-code.patchtext/x-diff; charset=us-asciiDownload
From 59498e233d725d5c427e58f700402908fce2a10f Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Sat, 22 Jul 2023 15:04:45 -0700
Subject: [PATCH v10 1/4] Make binaryheap available to frontend code.

There are a couple of places in frontend code that could make use
of this simple binary heap implementation.  This commit makes
binaryheap usable in frontend code, much like commit 26aaf97b68 did
for StringInfo.  Like StringInfo, the header file is left in lib/
to reduce the likelihood of unnecessary breakage.

The frontend version of binaryheap exposes a void *-based API since
frontend code does not have access to the Datum definitions.  This
seemed like a better approach than switching all existing uses to
void * or making the Datum definitions available to frontend code.

Reviewed-by: Tom Lane, Alvaro Herrera
Discussion: https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us
---
 src/backend/lib/Makefile                 |  1 -
 src/backend/lib/meson.build              |  1 -
 src/common/Makefile                      |  1 +
 src/{backend/lib => common}/binaryheap.c | 41 +++++++++++++++++-------
 src/common/meson.build                   |  1 +
 src/include/lib/binaryheap.h             | 26 +++++++++++----
 src/tools/pgindent/typedefs.list         |  1 +
 7 files changed, 52 insertions(+), 20 deletions(-)
 rename src/{backend/lib => common}/binaryheap.c (90%)

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..b6cefd9cca 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -13,7 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
-	binaryheap.o \
 	bipartite_match.o \
 	bloomfilter.o \
 	dshash.o \
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 974cab8776..b4e88f54ae 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -1,7 +1,6 @@
 # Copyright (c) 2022-2023, PostgreSQL Global Development Group
 
 backend_sources += files(
-  'binaryheap.c',
   'bipartite_match.c',
   'bloomfilter.c',
   'dshash.c',
diff --git a/src/common/Makefile b/src/common/Makefile
index 113029bf7b..cc5c54dcee 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -48,6 +48,7 @@ LIBS += $(PTHREAD_LIBS)
 OBJS_COMMON = \
 	archive.o \
 	base64.o \
+	binaryheap.o \
 	checksum_helper.o \
 	compression.o \
 	config_info.o \
diff --git a/src/backend/lib/binaryheap.c b/src/common/binaryheap.c
similarity index 90%
rename from src/backend/lib/binaryheap.c
rename to src/common/binaryheap.c
index 1737546757..39a8243a6d 100644
--- a/src/backend/lib/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -6,15 +6,22 @@
  * Portions Copyright (c) 2012-2023, PostgreSQL Global Development Group
  *
  * IDENTIFICATION
- *	  src/backend/lib/binaryheap.c
+ *	  src/common/binaryheap.c
  *
  *-------------------------------------------------------------------------
  */
 
+#ifdef FRONTEND
+#include "postgres_fe.h"
+#else
 #include "postgres.h"
+#endif
 
 #include <math.h>
 
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
 #include "lib/binaryheap.h"
 
 static void sift_down(binaryheap *heap, int node_off);
@@ -34,7 +41,7 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 	int			sz;
 	binaryheap *heap;
 
-	sz = offsetof(binaryheap, bh_nodes) + sizeof(Datum) * capacity;
+	sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_node_type) * capacity;
 	heap = (binaryheap *) palloc(sz);
 	heap->bh_space = capacity;
 	heap->bh_compare = compare;
@@ -106,10 +113,16 @@ parent_offset(int i)
  * afterwards.
  */
 void
-binaryheap_add_unordered(binaryheap *heap, Datum d)
+binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 {
 	if (heap->bh_size >= heap->bh_space)
+	{
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
+	}
 	heap->bh_has_heap_property = false;
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
@@ -138,10 +151,16 @@ binaryheap_build(binaryheap *heap)
  * the heap property.
  */
 void
-binaryheap_add(binaryheap *heap, Datum d)
+binaryheap_add(binaryheap *heap, bh_node_type d)
 {
 	if (heap->bh_size >= heap->bh_space)
+	{
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
+	}
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
@@ -154,7 +173,7 @@ binaryheap_add(binaryheap *heap, Datum d)
  * without modifying the heap. The caller must ensure that this
  * routine is not used on an empty heap. Always O(1).
  */
-Datum
+bh_node_type
 binaryheap_first(binaryheap *heap)
 {
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
@@ -169,10 +188,10 @@ binaryheap_first(binaryheap *heap)
  * that this routine is not used on an empty heap. O(log n) worst
  * case.
  */
-Datum
+bh_node_type
 binaryheap_remove_first(binaryheap *heap)
 {
-	Datum		result;
+	bh_node_type result;
 
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
 
@@ -204,7 +223,7 @@ binaryheap_remove_first(binaryheap *heap)
  * sifting the new node down.
  */
 void
-binaryheap_replace_first(binaryheap *heap, Datum d)
+binaryheap_replace_first(binaryheap *heap, bh_node_type d)
 {
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
 
@@ -221,7 +240,7 @@ binaryheap_replace_first(binaryheap *heap, Datum d)
 static void
 sift_up(binaryheap *heap, int node_off)
 {
-	Datum		node_val = heap->bh_nodes[node_off];
+	bh_node_type node_val = heap->bh_nodes[node_off];
 
 	/*
 	 * Within the loop, the node_off'th array entry is a "hole" that
@@ -232,7 +251,7 @@ sift_up(binaryheap *heap, int node_off)
 	{
 		int			cmp;
 		int			parent_off;
-		Datum		parent_val;
+		bh_node_type parent_val;
 
 		/*
 		 * If this node is smaller than its parent, the heap condition is
@@ -264,7 +283,7 @@ sift_up(binaryheap *heap, int node_off)
 static void
 sift_down(binaryheap *heap, int node_off)
 {
-	Datum		node_val = heap->bh_nodes[node_off];
+	bh_node_type node_val = heap->bh_nodes[node_off];
 
 	/*
 	 * Within the loop, the node_off'th array entry is a "hole" that
diff --git a/src/common/meson.build b/src/common/meson.build
index 53942a9a61..3b97497d1a 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -3,6 +3,7 @@
 common_sources = files(
   'archive.c',
   'base64.c',
+  'binaryheap.c',
   'checksum_helper.c',
   'compression.c',
   'controldata_utils.c',
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 52f7b06b25..3647aeae65 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -11,11 +11,23 @@
 #ifndef BINARYHEAP_H
 #define BINARYHEAP_H
 
+/*
+ * We provide a Datum-based API for backend code and a void *-based API for
+ * frontend code (since the Datum definitions are not available to frontend
+ * code).  You should typically avoid using bh_node_type directly and instead
+ * use Datum or void * as appropriate.
+ */
+#ifdef FRONTEND
+typedef void *bh_node_type;
+#else
+typedef Datum bh_node_type;
+#endif
+
 /*
  * For a max-heap, the comparator must return <0 iff a < b, 0 iff a == b,
  * and >0 iff a > b.  For a min-heap, the conditions are reversed.
  */
-typedef int (*binaryheap_comparator) (Datum a, Datum b, void *arg);
+typedef int (*binaryheap_comparator) (bh_node_type a, bh_node_type b, void *arg);
 
 /*
  * binaryheap
@@ -34,7 +46,7 @@ typedef struct binaryheap
 	bool		bh_has_heap_property;	/* debugging cross-check */
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
-	Datum		bh_nodes[FLEXIBLE_ARRAY_MEMBER];
+	bh_node_type bh_nodes[FLEXIBLE_ARRAY_MEMBER];
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int capacity,
@@ -42,12 +54,12 @@ extern binaryheap *binaryheap_allocate(int capacity,
 									   void *arg);
 extern void binaryheap_reset(binaryheap *heap);
 extern void binaryheap_free(binaryheap *heap);
-extern void binaryheap_add_unordered(binaryheap *heap, Datum d);
+extern void binaryheap_add_unordered(binaryheap *heap, bh_node_type d);
 extern void binaryheap_build(binaryheap *heap);
-extern void binaryheap_add(binaryheap *heap, Datum d);
-extern Datum binaryheap_first(binaryheap *heap);
-extern Datum binaryheap_remove_first(binaryheap *heap);
-extern void binaryheap_replace_first(binaryheap *heap, Datum d);
+extern void binaryheap_add(binaryheap *heap, bh_node_type d);
+extern bh_node_type binaryheap_first(binaryheap *heap);
+extern bh_node_type binaryheap_remove_first(binaryheap *heap);
+extern void binaryheap_replace_first(binaryheap *heap, bh_node_type d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f3d8a2a855..b5bbdd1608 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3198,6 +3198,7 @@ bbstreamer_tar_archiver
 bbstreamer_tar_parser
 bbstreamer_zstd_frame
 bgworker_main_type
+bh_node_type
 binaryheap
 binaryheap_comparator
 bitmapword
-- 
2.25.1

v10-0002-Add-function-for-removing-arbitrary-nodes-in-bin.patchtext/x-diff; charset=us-asciiDownload
From 303cd18beaaf4a4133a0988d71bfea0bd2b4444b Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Thu, 20 Jul 2023 09:52:20 -0700
Subject: [PATCH v10 2/4] Add function for removing arbitrary nodes in
 binaryheap.

This commit introduces binaryheap_remove_node(), which can be used
to remove any node from a binary heap.  The implementation is
straightforward.  The target node is replaced with the last node in
the heap, and then we sift as needed to preserve the heap property.
This new function is intended for use in a follow-up commit that
will improve the performance of pg_restore.

Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us
---
 src/common/binaryheap.c      | 29 +++++++++++++++++++++++++++++
 src/include/lib/binaryheap.h |  3 +++
 2 files changed, 32 insertions(+)

diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index 39a8243a6d..19e095f1fb 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -215,6 +215,35 @@ binaryheap_remove_first(binaryheap *heap)
 	return result;
 }
 
+/*
+ * binaryheap_remove_node
+ *
+ * Removes the nth (zero based) node from the heap.  The caller must ensure
+ * that there are at least (n + 1) nodes in the heap.  O(log n) worst case.
+ */
+void
+binaryheap_remove_node(binaryheap *heap, int n)
+{
+	int			cmp;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(n >= 0 && n < heap->bh_size);
+
+	/* compare last node to the one that is being removed */
+	cmp = heap->bh_compare(heap->bh_nodes[--heap->bh_size],
+						   heap->bh_nodes[n],
+						   heap->bh_arg);
+
+	/* remove the last node, placing it in the vacated entry */
+	heap->bh_nodes[n] = heap->bh_nodes[heap->bh_size];
+
+	/* sift as needed to preserve the heap property */
+	if (cmp > 0)
+		sift_up(heap, n);
+	else if (cmp < 0)
+		sift_down(heap, n);
+}
+
 /*
  * binaryheap_replace_first
  *
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 3647aeae65..9525dcaec4 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -59,8 +59,11 @@ extern void binaryheap_build(binaryheap *heap);
 extern void binaryheap_add(binaryheap *heap, bh_node_type d);
 extern bh_node_type binaryheap_first(binaryheap *heap);
 extern bh_node_type binaryheap_remove_first(binaryheap *heap);
+extern void binaryheap_remove_node(binaryheap *heap, int n);
 extern void binaryheap_replace_first(binaryheap *heap, bh_node_type d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
+#define binaryheap_size(h)			((h)->bh_size)
+#define binaryheap_get_node(h, n)	((h)->bh_nodes[n])
 
 #endif							/* BINARYHEAP_H */
-- 
2.25.1

v10-0003-Convert-pg_restore-s-ready_list-to-a-priority-qu.patchtext/x-diff; charset=us-asciiDownload
From 06dd45a66995cd10db0b9a752b9d601b053cd244 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Thu, 20 Jul 2023 10:19:08 -0700
Subject: [PATCH v10 3/4] Convert pg_restore's ready_list to a priority queue.

Presently, we spend a lot of time sorting this list so that we pick
the largest items first.  With many tables, this sorting can become
a significant bottleneck.  There are a couple of reports from the
field about this, and it is easily reproducible, so this is not a
hypothetical issue.

This commit improves the performance of pg_restore with many tables
by converting its ready_list to a priority queue, i.e., a binary
heap.  We will first try to run the highest priority item, but if
it cannot be chosen due to the lock heuristic, we'll do a
sequential scan through the heap nodes until we find one that is
runnable.  This means that we might end up picking an item with
much lower priority, but since we expect that we'll typically be
able to choose one of the first few nodes, we should usually pick
an item with a relatively high priority.

On my machine, a basic test with 100,000 tables takes 11.5 minutes
without this patch and 1.5 minutes with it.  Pierre Ducroquet
claims to see a speedup from 30 minutes to 23 minutes for a
real-world dump of over 50,000 tables.

Suggested-by: Tom Lane
Tested-by: Pierre Ducroquet
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us
---
 src/bin/pg_dump/pg_backup_archiver.c | 198 ++++++++-------------------
 1 file changed, 58 insertions(+), 140 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 4d83381d84..b30b49702d 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -34,6 +34,7 @@
 #include "compress_io.h"
 #include "dumputils.h"
 #include "fe_utils/string_utils.h"
+#include "lib/binaryheap.h"
 #include "lib/stringinfo.h"
 #include "libpq/libpq-fs.h"
 #include "parallel.h"
@@ -44,24 +45,6 @@
 #define TEXT_DUMP_HEADER "--\n-- PostgreSQL database dump\n--\n\n"
 #define TEXT_DUMPALL_HEADER "--\n-- PostgreSQL database cluster dump\n--\n\n"
 
-/*
- * State for tracking TocEntrys that are ready to process during a parallel
- * restore.  (This used to be a list, and we still call it that, though now
- * it's really an array so that we can apply qsort to it.)
- *
- * tes[] is sized large enough that we can't overrun it.
- * The valid entries are indexed first_te .. last_te inclusive.
- * We periodically sort the array to bring larger-by-dataLength entries to
- * the front; "sorted" is true if the valid entries are known sorted.
- */
-typedef struct _parallelReadyList
-{
-	TocEntry  **tes;			/* Ready-to-dump TocEntrys */
-	int			first_te;		/* index of first valid entry in tes[] */
-	int			last_te;		/* index of last valid entry in tes[] */
-	bool		sorted;			/* are valid entries currently sorted? */
-} ParallelReadyList;
-
 
 static ArchiveHandle *_allocAH(const char *FileSpec, const ArchiveFormat fmt,
 							   const pg_compress_specification compression_spec,
@@ -111,16 +94,12 @@ static void restore_toc_entries_postfork(ArchiveHandle *AH,
 static void pending_list_header_init(TocEntry *l);
 static void pending_list_append(TocEntry *l, TocEntry *te);
 static void pending_list_remove(TocEntry *te);
-static void ready_list_init(ParallelReadyList *ready_list, int tocCount);
-static void ready_list_free(ParallelReadyList *ready_list);
-static void ready_list_insert(ParallelReadyList *ready_list, TocEntry *te);
-static void ready_list_remove(ParallelReadyList *ready_list, int i);
-static void ready_list_sort(ParallelReadyList *ready_list);
-static int	TocEntrySizeCompare(const void *p1, const void *p2);
-static void move_to_ready_list(TocEntry *pending_list,
-							   ParallelReadyList *ready_list,
+static int	TocEntrySizeCompareQsort(const void *p1, const void *p2);
+static int	TocEntrySizeCompareBinaryheap(void *p1, void *p2, void *arg);
+static void move_to_ready_heap(TocEntry *pending_list,
+							   binaryheap *ready_heap,
 							   RestorePass pass);
-static TocEntry *pop_next_work_item(ParallelReadyList *ready_list,
+static TocEntry *pop_next_work_item(binaryheap *ready_heap,
 									ParallelState *pstate);
 static void mark_dump_job_done(ArchiveHandle *AH,
 							   TocEntry *te,
@@ -135,7 +114,7 @@ static bool has_lock_conflicts(TocEntry *te1, TocEntry *te2);
 static void repoint_table_dependencies(ArchiveHandle *AH);
 static void identify_locking_dependencies(ArchiveHandle *AH, TocEntry *te);
 static void reduce_dependencies(ArchiveHandle *AH, TocEntry *te,
-								ParallelReadyList *ready_list);
+								binaryheap *ready_heap);
 static void mark_create_done(ArchiveHandle *AH, TocEntry *te);
 static void inhibit_data_for_failed_table(ArchiveHandle *AH, TocEntry *te);
 
@@ -2384,7 +2363,7 @@ WriteDataChunks(ArchiveHandle *AH, ParallelState *pstate)
 		}
 
 		if (ntes > 1)
-			qsort(tes, ntes, sizeof(TocEntry *), TocEntrySizeCompare);
+			qsort(tes, ntes, sizeof(TocEntry *), TocEntrySizeCompareQsort);
 
 		for (int i = 0; i < ntes; i++)
 			DispatchJobForTocEntry(AH, pstate, tes[i], ACT_DUMP,
@@ -3984,7 +3963,7 @@ restore_toc_entries_prefork(ArchiveHandle *AH, TocEntry *pending_list)
 
 			(void) restore_toc_entry(AH, next_work_item, false);
 
-			/* Reduce dependencies, but don't move anything to ready_list */
+			/* Reduce dependencies, but don't move anything to ready_heap */
 			reduce_dependencies(AH, next_work_item, NULL);
 		}
 		else
@@ -4027,24 +4006,26 @@ static void
 restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 							 TocEntry *pending_list)
 {
-	ParallelReadyList ready_list;
+	binaryheap *ready_heap;
 	TocEntry   *next_work_item;
 
 	pg_log_debug("entering restore_toc_entries_parallel");
 
-	/* Set up ready_list with enough room for all known TocEntrys */
-	ready_list_init(&ready_list, AH->tocCount);
+	/* Set up ready_heap with enough room for all known TocEntrys */
+	ready_heap = binaryheap_allocate(AH->tocCount,
+									 TocEntrySizeCompareBinaryheap,
+									 NULL);
 
 	/*
 	 * The pending_list contains all items that we need to restore.  Move all
-	 * items that are available to process immediately into the ready_list.
+	 * items that are available to process immediately into the ready_heap.
 	 * After this setup, the pending list is everything that needs to be done
-	 * but is blocked by one or more dependencies, while the ready list
+	 * but is blocked by one or more dependencies, while the ready heap
 	 * contains items that have no remaining dependencies and are OK to
 	 * process in the current restore pass.
 	 */
 	AH->restorePass = RESTORE_PASS_MAIN;
-	move_to_ready_list(pending_list, &ready_list, AH->restorePass);
+	move_to_ready_heap(pending_list, ready_heap, AH->restorePass);
 
 	/*
 	 * main parent loop
@@ -4058,7 +4039,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 	for (;;)
 	{
 		/* Look for an item ready to be dispatched to a worker */
-		next_work_item = pop_next_work_item(&ready_list, pstate);
+		next_work_item = pop_next_work_item(ready_heap, pstate);
 		if (next_work_item != NULL)
 		{
 			/* If not to be restored, don't waste time launching a worker */
@@ -4068,7 +4049,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 							next_work_item->dumpId,
 							next_work_item->desc, next_work_item->tag);
 				/* Update its dependencies as though we'd completed it */
-				reduce_dependencies(AH, next_work_item, &ready_list);
+				reduce_dependencies(AH, next_work_item, ready_heap);
 				/* Loop around to see if anything else can be dispatched */
 				continue;
 			}
@@ -4079,7 +4060,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 
 			/* Dispatch to some worker */
 			DispatchJobForTocEntry(AH, pstate, next_work_item, ACT_RESTORE,
-								   mark_restore_job_done, &ready_list);
+								   mark_restore_job_done, ready_heap);
 		}
 		else if (IsEveryWorkerIdle(pstate))
 		{
@@ -4093,7 +4074,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 			/* Advance to next restore pass */
 			AH->restorePass++;
 			/* That probably allows some stuff to be made ready */
-			move_to_ready_list(pending_list, &ready_list, AH->restorePass);
+			move_to_ready_heap(pending_list, ready_heap, AH->restorePass);
 			/* Loop around to see if anything's now ready */
 			continue;
 		}
@@ -4122,10 +4103,10 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 					   next_work_item ? WFW_ONE_IDLE : WFW_GOT_STATUS);
 	}
 
-	/* There should now be nothing in ready_list. */
-	Assert(ready_list.first_te > ready_list.last_te);
+	/* There should now be nothing in ready_heap. */
+	Assert(binaryheap_empty(ready_heap));
 
-	ready_list_free(&ready_list);
+	binaryheap_free(ready_heap);
 
 	pg_log_info("finished main parallel loop");
 }
@@ -4225,80 +4206,9 @@ pending_list_remove(TocEntry *te)
 }
 
 
-/*
- * Initialize the ready_list with enough room for up to tocCount entries.
- */
-static void
-ready_list_init(ParallelReadyList *ready_list, int tocCount)
-{
-	ready_list->tes = (TocEntry **)
-		pg_malloc(tocCount * sizeof(TocEntry *));
-	ready_list->first_te = 0;
-	ready_list->last_te = -1;
-	ready_list->sorted = false;
-}
-
-/*
- * Free storage for a ready_list.
- */
-static void
-ready_list_free(ParallelReadyList *ready_list)
-{
-	pg_free(ready_list->tes);
-}
-
-/* Add te to the ready_list */
-static void
-ready_list_insert(ParallelReadyList *ready_list, TocEntry *te)
-{
-	ready_list->tes[++ready_list->last_te] = te;
-	/* List is (probably) not sorted anymore. */
-	ready_list->sorted = false;
-}
-
-/* Remove the i'th entry in the ready_list */
-static void
-ready_list_remove(ParallelReadyList *ready_list, int i)
-{
-	int			f = ready_list->first_te;
-
-	Assert(i >= f && i <= ready_list->last_te);
-
-	/*
-	 * In the typical case where the item to be removed is the first ready
-	 * entry, we need only increment first_te to remove it.  Otherwise, move
-	 * the entries before it to compact the list.  (This preserves sortedness,
-	 * if any.)  We could alternatively move the entries after i, but there
-	 * are typically many more of those.
-	 */
-	if (i > f)
-	{
-		TocEntry  **first_te_ptr = &ready_list->tes[f];
-
-		memmove(first_te_ptr + 1, first_te_ptr, (i - f) * sizeof(TocEntry *));
-	}
-	ready_list->first_te++;
-}
-
-/* Sort the ready_list into the desired order */
-static void
-ready_list_sort(ParallelReadyList *ready_list)
-{
-	if (!ready_list->sorted)
-	{
-		int			n = ready_list->last_te - ready_list->first_te + 1;
-
-		if (n > 1)
-			qsort(ready_list->tes + ready_list->first_te, n,
-				  sizeof(TocEntry *),
-				  TocEntrySizeCompare);
-		ready_list->sorted = true;
-	}
-}
-
 /* qsort comparator for sorting TocEntries by dataLength */
 static int
-TocEntrySizeCompare(const void *p1, const void *p2)
+TocEntrySizeCompareQsort(const void *p1, const void *p2)
 {
 	const TocEntry *te1 = *(const TocEntry *const *) p1;
 	const TocEntry *te2 = *(const TocEntry *const *) p2;
@@ -4318,17 +4228,25 @@ TocEntrySizeCompare(const void *p1, const void *p2)
 	return 0;
 }
 
+/* binaryheap comparator for sorting TocEntries by dataLength */
+static int
+TocEntrySizeCompareBinaryheap(void *p1, void *p2, void *arg)
+{
+	/* return opposite of qsort comparator for max-heap */
+	return -TocEntrySizeCompareQsort(&p1, &p2);
+}
+
 
 /*
- * Move all immediately-ready items from pending_list to ready_list.
+ * Move all immediately-ready items from pending_list to ready_heap.
  *
  * Items are considered ready if they have no remaining dependencies and
  * they belong in the current restore pass.  (See also reduce_dependencies,
  * which applies the same logic one-at-a-time.)
  */
 static void
-move_to_ready_list(TocEntry *pending_list,
-				   ParallelReadyList *ready_list,
+move_to_ready_heap(TocEntry *pending_list,
+				   binaryheap *ready_heap,
 				   RestorePass pass)
 {
 	TocEntry   *te;
@@ -4344,38 +4262,38 @@ move_to_ready_list(TocEntry *pending_list,
 		{
 			/* Remove it from pending_list ... */
 			pending_list_remove(te);
-			/* ... and add to ready_list */
-			ready_list_insert(ready_list, te);
+			/* ... and add to ready_heap */
+			binaryheap_add(ready_heap, te);
 		}
 	}
 }
 
 /*
  * Find the next work item (if any) that is capable of being run now,
- * and remove it from the ready_list.
+ * and remove it from the ready_heap.
  *
  * Returns the item, or NULL if nothing is runnable.
  *
  * To qualify, the item must have no remaining dependencies
  * and no requirements for locks that are incompatible with
- * items currently running.  Items in the ready_list are known to have
+ * items currently running.  Items in the ready_heap are known to have
  * no remaining dependencies, but we have to check for lock conflicts.
  */
 static TocEntry *
-pop_next_work_item(ParallelReadyList *ready_list,
+pop_next_work_item(binaryheap *ready_heap,
 				   ParallelState *pstate)
 {
 	/*
-	 * Sort the ready_list so that we'll tackle larger jobs first.
-	 */
-	ready_list_sort(ready_list);
-
-	/*
-	 * Search the ready_list until we find a suitable item.
+	 * Search the ready_heap until we find a suitable item.  Note that we do a
+	 * sequential scan through the heap nodes, so even though we will first
+	 * try to choose the highest-priority item, we might end up picking
+	 * something with a much lower priority.  However, it is expected that we
+	 * will typically be able to pick one of the first few items, which should
+	 * usually have a relatively high priority.
 	 */
-	for (int i = ready_list->first_te; i <= ready_list->last_te; i++)
+	for (int i = 0; i < binaryheap_size(ready_heap); i++)
 	{
-		TocEntry   *te = ready_list->tes[i];
+		TocEntry   *te = (TocEntry *) binaryheap_get_node(ready_heap, i);
 		bool		conflicts = false;
 
 		/*
@@ -4401,7 +4319,7 @@ pop_next_work_item(ParallelReadyList *ready_list,
 			continue;
 
 		/* passed all tests, so this item can run */
-		ready_list_remove(ready_list, i);
+		binaryheap_remove_node(ready_heap, i);
 		return te;
 	}
 
@@ -4447,7 +4365,7 @@ mark_restore_job_done(ArchiveHandle *AH,
 					  int status,
 					  void *callback_data)
 {
-	ParallelReadyList *ready_list = (ParallelReadyList *) callback_data;
+	binaryheap *ready_heap = (binaryheap *) callback_data;
 
 	pg_log_info("finished item %d %s %s",
 				te->dumpId, te->desc, te->tag);
@@ -4465,7 +4383,7 @@ mark_restore_job_done(ArchiveHandle *AH,
 		pg_fatal("worker process failed: exit code %d",
 				 status);
 
-	reduce_dependencies(AH, te, ready_list);
+	reduce_dependencies(AH, te, ready_heap);
 }
 
 
@@ -4708,11 +4626,11 @@ identify_locking_dependencies(ArchiveHandle *AH, TocEntry *te)
 /*
  * Remove the specified TOC entry from the depCounts of items that depend on
  * it, thereby possibly making them ready-to-run.  Any pending item that
- * becomes ready should be moved to the ready_list, if that's provided.
+ * becomes ready should be moved to the ready_heap, if that's provided.
  */
 static void
 reduce_dependencies(ArchiveHandle *AH, TocEntry *te,
-					ParallelReadyList *ready_list)
+					binaryheap *ready_heap)
 {
 	int			i;
 
@@ -4730,18 +4648,18 @@ reduce_dependencies(ArchiveHandle *AH, TocEntry *te,
 		 * the current restore pass, and it is currently a member of the
 		 * pending list (that check is needed to prevent double restore in
 		 * some cases where a list-file forces out-of-order restoring).
-		 * However, if ready_list == NULL then caller doesn't want any list
+		 * However, if ready_heap == NULL then caller doesn't want any list
 		 * memberships changed.
 		 */
 		if (otherte->depCount == 0 &&
 			_tocEntryRestorePass(otherte) == AH->restorePass &&
 			otherte->pending_prev != NULL &&
-			ready_list != NULL)
+			ready_heap != NULL)
 		{
 			/* Remove it from pending list ... */
 			pending_list_remove(otherte);
-			/* ... and add to ready_list */
-			ready_list_insert(ready_list, otherte);
+			/* ... and add to ready_heap */
+			binaryheap_add(ready_heap, otherte);
 		}
 	}
 }
-- 
2.25.1

v10-0004-Remove-open-coded-binary-heap-in-pg_dump_sort.c.patchtext/x-diff; charset=us-asciiDownload
From 60ae7a331e6445f9630695b1a13d64dd4287ce62 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Tue, 25 Jul 2023 11:18:52 -0700
Subject: [PATCH v10 4/4] Remove open-coded binary heap in pg_dump_sort.c.

Thanks to commit XXXXXXXXXX, binaryheap is available to frontend
code.  This commit replaces the open-coded heap implementation in
pg_dump_sort.c with a binaryheap, saving a few lines of code.

Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us
---
 src/bin/pg_dump/pg_dump_sort.c | 105 +++++++--------------------------
 1 file changed, 22 insertions(+), 83 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump_sort.c b/src/bin/pg_dump/pg_dump_sort.c
index 523a19c155..8436c1d757 100644
--- a/src/bin/pg_dump/pg_dump_sort.c
+++ b/src/bin/pg_dump/pg_dump_sort.c
@@ -16,6 +16,7 @@
 #include "postgres_fe.h"
 
 #include "catalog/pg_class_d.h"
+#include "lib/binaryheap.h"
 #include "pg_backup_archiver.h"
 #include "pg_backup_utils.h"
 #include "pg_dump.h"
@@ -161,8 +162,6 @@ static bool TopoSort(DumpableObject **objs,
 					 int numObjs,
 					 DumpableObject **ordering,
 					 int *nOrdering);
-static void addHeapElement(int val, int *heap, int heapLength);
-static int	removeHeapElement(int *heap, int heapLength);
 static void findDependencyLoops(DumpableObject **objs, int nObjs, int totObjs);
 static int	findLoop(DumpableObject *obj,
 					 DumpId startPoint,
@@ -174,6 +173,7 @@ static void repairDependencyLoop(DumpableObject **loop,
 								 int nLoop);
 static void describeDumpableObject(DumpableObject *obj,
 								   char *buf, int bufsize);
+static int	int_cmp(void *a, void *b, void *arg);
 
 
 /*
@@ -374,11 +374,10 @@ TopoSort(DumpableObject **objs,
 		 int *nOrdering)		/* output argument */
 {
 	DumpId		maxDumpId = getMaxDumpId();
-	int		   *pendingHeap;
+	binaryheap *pendingHeap;
 	int		   *beforeConstraints;
 	int		   *idMap;
 	DumpableObject *obj;
-	int			heapLength;
 	int			i,
 				j,
 				k;
@@ -403,7 +402,7 @@ TopoSort(DumpableObject **objs,
 		return true;
 
 	/* Create workspace for the above-described heap */
-	pendingHeap = (int *) pg_malloc(numObjs * sizeof(int));
+	pendingHeap = binaryheap_allocate(numObjs, int_cmp, NULL);
 
 	/*
 	 * Scan the constraints, and for each item in the input, generate a count
@@ -434,19 +433,16 @@ TopoSort(DumpableObject **objs,
 	 * Now initialize the heap of items-ready-to-output by filling it with the
 	 * indexes of items that already have beforeConstraints[id] == 0.
 	 *
-	 * The essential property of a heap is heap[(j-1)/2] >= heap[j] for each j
-	 * in the range 1..heapLength-1 (note we are using 0-based subscripts
-	 * here, while the discussion in Knuth assumes 1-based subscripts). So, if
-	 * we simply enter the indexes into pendingHeap[] in decreasing order, we
-	 * a-fortiori have the heap invariant satisfied at completion of this
-	 * loop, and don't need to do any sift-up comparisons.
+	 * We enter the objects into pendingHeap in decreasing order so that the
+	 * heap invariant is satisfied at the completion of this loop.  This
+	 * reduces the amount of work that binaryheap_build() must do.
 	 */
-	heapLength = 0;
 	for (i = numObjs; --i >= 0;)
 	{
 		if (beforeConstraints[objs[i]->dumpId] == 0)
-			pendingHeap[heapLength++] = i;
+			binaryheap_add_unordered(pendingHeap, (void *) (intptr_t) i);
 	}
+	binaryheap_build(pendingHeap);
 
 	/*--------------------
 	 * Now emit objects, working backwards in the output list.  At each step,
@@ -464,10 +460,10 @@ TopoSort(DumpableObject **objs,
 	 *--------------------
 	 */
 	i = numObjs;
-	while (heapLength > 0)
+	while (!binaryheap_empty(pendingHeap))
 	{
 		/* Select object to output by removing largest heap member */
-		j = removeHeapElement(pendingHeap, heapLength--);
+		j = (int) (intptr_t) binaryheap_remove_first(pendingHeap);
 		obj = objs[j];
 		/* Output candidate to ordering[] */
 		ordering[--i] = obj;
@@ -477,7 +473,7 @@ TopoSort(DumpableObject **objs,
 			int			id = obj->dependencies[k];
 
 			if ((--beforeConstraints[id]) == 0)
-				addHeapElement(idMap[id], pendingHeap, heapLength++);
+				binaryheap_add(pendingHeap, (void *) (intptr_t) idMap[id]);
 		}
 	}
 
@@ -497,79 +493,13 @@ TopoSort(DumpableObject **objs,
 	}
 
 	/* Done */
-	free(pendingHeap);
+	binaryheap_free(pendingHeap);
 	free(beforeConstraints);
 	free(idMap);
 
 	return (i == 0);
 }
 
-/*
- * Add an item to a heap (priority queue)
- *
- * heapLength is the current heap size; caller is responsible for increasing
- * its value after the call.  There must be sufficient storage at *heap.
- */
-static void
-addHeapElement(int val, int *heap, int heapLength)
-{
-	int			j;
-
-	/*
-	 * Sift-up the new entry, per Knuth 5.2.3 exercise 16. Note that Knuth is
-	 * using 1-based array indexes, not 0-based.
-	 */
-	j = heapLength;
-	while (j > 0)
-	{
-		int			i = (j - 1) >> 1;
-
-		if (val <= heap[i])
-			break;
-		heap[j] = heap[i];
-		j = i;
-	}
-	heap[j] = val;
-}
-
-/*
- * Remove the largest item present in a heap (priority queue)
- *
- * heapLength is the current heap size; caller is responsible for decreasing
- * its value after the call.
- *
- * We remove and return heap[0], which is always the largest element of
- * the heap, and then "sift up" to maintain the heap invariant.
- */
-static int
-removeHeapElement(int *heap, int heapLength)
-{
-	int			result = heap[0];
-	int			val;
-	int			i;
-
-	if (--heapLength <= 0)
-		return result;
-	val = heap[heapLength];		/* value that must be reinserted */
-	i = 0;						/* i is where the "hole" is */
-	for (;;)
-	{
-		int			j = 2 * i + 1;
-
-		if (j >= heapLength)
-			break;
-		if (j + 1 < heapLength &&
-			heap[j] < heap[j + 1])
-			j++;
-		if (val >= heap[j])
-			break;
-		heap[i] = heap[j];
-		i = j;
-	}
-	heap[i] = val;
-	return result;
-}
-
 /*
  * findDependencyLoops - identify loops in TopoSort's failure output,
  *		and pass each such loop to repairDependencyLoop() for action
@@ -1559,3 +1489,12 @@ describeDumpableObject(DumpableObject *obj, char *buf, int bufsize)
 			 (int) obj->objType,
 			 obj->dumpId, obj->catId.oid);
 }
+
+/*
+ * Compares "a" and "b" as integers.
+ */
+static int
+int_cmp(void *a, void *b, void *arg)
+{
+	return (int) (intptr_t) a - (int) (intptr_t) b;
+}
-- 
2.25.1

#31Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#30)
Re: Inefficiency in parallel pg_restore with many tables

For now, I've committed 0001 and 0002. I intend to commit the others soon.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Bossart (#31)
Re: Inefficiency in parallel pg_restore with many tables

Nathan Bossart <nathandbossart@gmail.com> writes:

For now, I've committed 0001 and 0002. I intend to commit the others soon.

bowerbird is unhappy with this. I suppose you missed out updating
the src/tools/msvc/ scripts. (Weren't we about ready to nuke those?)

regards, tom lane

#33Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#32)
Re: Inefficiency in parallel pg_restore with many tables

On Mon, Sep 18, 2023 at 09:23:20PM -0400, Tom Lane wrote:

Nathan Bossart <nathandbossart@gmail.com> writes:

For now, I've committed 0001 and 0002. I intend to commit the others soon.

bowerbird is unhappy with this. I suppose you missed out updating
the src/tools/msvc/ scripts. (Weren't we about ready to nuke those?)

I saw that and have attempted to fix it with 83223f5. I'm still waiting
for an MSVC animal to report back.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#34Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#32)
Re: Inefficiency in parallel pg_restore with many tables

On Mon, Sep 18, 2023 at 09:23:20PM -0400, Tom Lane wrote:

bowerbird is unhappy with this. I suppose you missed out updating
the src/tools/msvc/ scripts.
(Weren't we about ready to nuke those?)

hamerkop seems to be the only buildfarm member that would complain if
these were to be gone today, on top of bowerbird, of course.
--
Michael

#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Bossart (#33)
Re: Inefficiency in parallel pg_restore with many tables

Nathan Bossart <nathandbossart@gmail.com> writes:

On Mon, Sep 18, 2023 at 09:23:20PM -0400, Tom Lane wrote:

bowerbird is unhappy with this. I suppose you missed out updating
the src/tools/msvc/ scripts. (Weren't we about ready to nuke those?)

I saw that and have attempted to fix it with 83223f5.

Ah, right, sorry for the noise.

But in any case, how long are we keeping src/tools/msvc/ ?

regards, tom lane

#36Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#35)
Re: Inefficiency in parallel pg_restore with many tables

On Mon, Sep 18, 2023 at 09:36:03PM -0400, Tom Lane wrote:

But in any case, how long are we keeping src/tools/msvc/ ?

From a skim of [0]/messages/by-id/20230408191007.7lysd42euafwl74f@awork3.anarazel.de, it seems like it could be removed now. I see a couple
of work-in-progress patches from Andres [1]https://github.com/anarazel/postgres/commits/drop-homegrown-msvc that would probably serve as a
good starting point. I won't have much time for this for the next few
weeks, so if someone else wants to pick it up, please feel free.

[0]: /messages/by-id/20230408191007.7lysd42euafwl74f@awork3.anarazel.de
[1]: https://github.com/anarazel/postgres/commits/drop-homegrown-msvc

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#37Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#31)
Re: Inefficiency in parallel pg_restore with many tables

On Mon, Sep 18, 2023 at 02:22:32PM -0700, Nathan Bossart wrote:

For now, I've committed 0001 and 0002. I intend to commit the others soon.

I've committed the rest of the patches.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com