Improve eviction algorithm in ReorderBuffer

Started by Masahiko Sawadaabout 2 years ago93 messages
#1Masahiko Sawada
sawada.mshk@gmail.com
1 attachment(s)

Hi all,

As the comment of ReorderBufferLargestTXN() says, it's very slow with
many subtransactions:

/*
* Find the largest transaction (toplevel or subxact) to evict (spill to disk).
*
* XXX With many subtransactions this might be quite slow, because we'll have
* to walk through all of them. There are some options how we could improve
* that: (a) maintain some secondary structure with transactions sorted by
* amount of changes, (b) not looking for the entirely largest transaction,
* but e.g. for transaction using at least some fraction of the memory limit,
* and (c) evicting multiple transactions at once, e.g. to free a given portion
* of the memory limit (e.g. 50%).
*/

This is because the reorderbuffer has transaction entries for each
top-level and sub transaction, and ReorderBufferLargestTXN() walks
through all transaction entries to pick the transaction to evict.
I've heard the report internally that replication lag became huge when
decoding transactions each consisting of 500k sub transactions. Note
that ReorderBufferLargetstTXN() is used only in non-streaming mode.

Here is a test script for a many subtransactions scenario. In my
environment, the logical decoding took over 2min to decode one top
transaction having 100k subtransctions.

-----
create table test (c int);
create or replace function testfn (cnt int) returns void as $$
begin
for i in 1..cnt loop
begin
insert into test values (i);
exception when division_by_zero then
raise notice 'caught error';
return;
end;
end loop;
end;
$$
language plpgsql;
select testfn(100000)
set logical_decoding_work_mem to '4MB';
select count(*) from pg_logical_slot_peek_changes('s', null, null)
----

To deal with this problem, I initially thought of the idea (a)
mentioned in the comment; use a binary heap to maintain the
transactions sorted by the amount of changes or the size. But it seems
not a good idea to try maintaining all transactions by its size since
the size of each transaction could be changed frequently.

The attached patch uses a different approach that consists of three
strategies; (1) maintain the list of transactions whose size is larger
than 10% of logical_decoding_work_mem, and preferentially evict a
transaction from this list. If the list is empty, all transactions are
small enough, (2) so we evict the oldest top-level transaction from
rb->toplevel_by_lsn list. Evicting older transactions would help in
freeing memory blocks in GenerationContext. Finally, if this is also
empty, (3) we evict a transaction that size is > 0. Here, we need to
note the fact that even if a transaction is evicted the
ReorderBufferTXN entry is not removed from rb->by_txn but its size is
0. In the worst case where all (quite a few) transactions are smaller
than 10% of the memory limit, we might end up checking many
transactions to find non-zero size transaction entries to evict. So
the patch adds a new list to maintain all transactions that have at
least one change in memory.

Summarizing the algorithm I've implemented in the patch,

1. pick a transaction from the list of large transactions (larger than
10% of memory limit).
2. pick a transaction from the top-level transaction list in LSN order.
3. pick a transaction from the list of transactions that have at least
one change in memory.

With the patch, the above test case completed within 3 seconds in my
environment.

As a side note, the idea (c) mentioned in the comment, evicting
multiple transactions at once to free a given portion of the memory,
would also help in avoiding back and forth the memory threshold. It's
also worth considering.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

improve_eviction_rb_poc.patchapplication/octet-stream; name=improve_eviction_rb_poc.patchDownload
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 12edc5772a..70068f6961 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -370,6 +370,8 @@ ReorderBufferAllocate(void)
 	dlist_init(&buffer->toplevel_by_lsn);
 	dlist_init(&buffer->txns_by_base_snapshot_lsn);
 	dclist_init(&buffer->catchange_txns);
+	dlist_init(&buffer->large_txns);
+	dlist_init(&buffer->mem_txns);
 
 	/*
 	 * Ensure there's no stale data from prior uses of this slot, in case some
@@ -1580,6 +1582,10 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	dlist_delete(&txn->node);
 	if (rbtxn_has_catalog_changes(txn))
 		dclist_delete_from(&rb->catchange_txns, &txn->catchange_node);
+	if (!dlist_node_is_detached(&txn->large_node))
+		dlist_delete_thoroughly(&txn->large_node);
+	if (!dlist_node_is_detached(&txn->mem_node))
+		dlist_delete_thoroughly(&txn->mem_node);
 
 	/* now remove reference from buffer */
 	hash_search(rb->by_txn, &txn->xid, HASH_REMOVE, &found);
@@ -1710,6 +1716,11 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 		txn->txn_flags |= RBTXN_IS_SERIALIZED_CLEAR;
 	}
 
+	if (!dlist_node_is_detached(&txn->large_node))
+		dlist_delete_thoroughly(&txn->large_node);
+	if (!dlist_node_is_detached(&txn->mem_node))
+		dlist_delete_thoroughly(&txn->mem_node);
+
 	/* also reset the number of entries in the transaction */
 	txn->nentries_mem = 0;
 	txn->nentries = 0;
@@ -3202,11 +3213,26 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 	if (addition)
 	{
+		/* push the transaction to the on-memory transaction list */
+		if (txn->size == 0)
+		{
+			Assert(dlist_node_is_detached(&txn->mem_node));
+			dlist_push_tail(&rb->mem_txns, &txn->mem_node);
+		}
+
 		txn->size += sz;
 		rb->size += sz;
 
 		/* Update the total size in the top transaction. */
 		toptxn->total_size += sz;
+
+		/*
+		 * Push this transaction to the large-txn list if its size is greater
+		 * than 10% of logical_decoding_work_mem limit.
+		 */
+		if (dlist_node_is_detached(&txn->large_node) &&
+			txn->size >= (logical_decoding_work_mem * 1024L * 0.1))
+			dlist_push_tail(&rb->large_txns, &txn->large_node);
 	}
 	else
 	{
@@ -3214,6 +3240,13 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 		txn->size -= sz;
 		rb->size -= sz;
 
+		/* remove the transaction from the on-memory transaction list */
+		if (txn->size == 0)
+		{
+			Assert(!dlist_node_is_detached(&txn->mem_node));
+			dlist_delete_thoroughly(&txn->mem_node);
+		}
+
 		/* Update the total size in the top transaction. */
 		toptxn->total_size -= sz;
 	}
@@ -3472,38 +3505,43 @@ ReorderBufferSerializeReserve(ReorderBuffer *rb, Size sz)
 }
 
 /*
- * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
+ * Find the transaction to evict (spill to disk).
  *
- * XXX With many subtransactions this might be quite slow, because we'll have
- * to walk through all of them. There are some options how we could improve
- * that: (a) maintain some secondary structure with transactions sorted by
- * amount of changes, (b) not looking for the entirely largest transaction,
- * but e.g. for transaction using at least some fraction of the memory limit,
- * and (c) evicting multiple transactions at once, e.g. to free a given portion
- * of the memory limit (e.g. 50%).
+ * We use three strategies to pick the transaction to evict. (1) we try to pick
+ * the (top-level or sub) transaction from the large_txns list. If there is no such
+ * large transaction, (2) we try to get the oldest transaction. Since we're using
+ * generational context to record changes (which usually represent 99% of the memory
+ * used during decoding), evicting transactions in order from the oldest actually
+ * helps in returning memory to the operating system. Since we use
+ * toplevel_by_lsn list, only top-level transactions will be picked. In case where
+ * there is no such transaction further, (3) we pick a transaction that has at
+ * least one change in memory.
  */
 static ReorderBufferTXN *
-ReorderBufferLargestTXN(ReorderBuffer *rb)
+ReorderBufferPickTXNToEvict(ReorderBuffer *rb)
 {
-	HASH_SEQ_STATUS hash_seq;
-	ReorderBufferTXNByIdEnt *ent;
-	ReorderBufferTXN *largest = NULL;
+	ReorderBufferTXN *txn = NULL;
+	dlist_iter iter;
 
-	hash_seq_init(&hash_seq, rb->by_txn);
-	while ((ent = hash_seq_search(&hash_seq)) != NULL)
+	/* Pick the large (top-level or sub) transaction */
+	if (!dlist_is_empty(&rb->large_txns))
+		return dlist_head_element(ReorderBufferTXN, large_node, &rb->large_txns);
+
+	/* Pick the oldest top-transaction */
+	dlist_foreach(iter, &rb->toplevel_by_lsn)
 	{
-		ReorderBufferTXN *txn = ent->txn;
+		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
-			largest = txn;
+		if (txn->size > 0)
+			return txn;
 	}
 
-	Assert(largest);
-	Assert(largest->size > 0);
-	Assert(largest->size <= rb->size);
+	/* Pick a transaction that has at least one changes in memory */
+	txn = dlist_head_element(ReorderBufferTXN, mem_node, &rb->mem_txns);
 
-	return largest;
+	Assert(txn);
+	Assert(txn->size > 0);
+	return txn;
 }
 
 /*
@@ -3618,10 +3656,10 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 		else
 		{
 			/*
-			 * Pick the largest transaction (or subtransaction) and evict it
-			 * from memory by serializing it to disk.
+			 * Pick the transaction (or subtransaction) and evict it from memory
+			 * by serializing it to disk.
 			 */
-			txn = ReorderBufferLargestTXN(rb);
+			txn = ReorderBufferPickTXNToEvict(rb);
 
 			/* we know there has to be one, because the size is not zero */
 			Assert(txn);
@@ -3637,6 +3675,7 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 		 */
 		Assert(txn->size == 0);
 		Assert(txn->nentries_mem == 0);
+		Assert(dlist_node_is_detached(&txn->large_node));
 	}
 
 	/* We must be under the memory limit now. */
@@ -3726,6 +3765,15 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		UpdateDecodingStats((LogicalDecodingContext *) rb->private_data);
 	}
 
+	Assert(txn->size == 0);
+
+	if (!dlist_node_is_detached(&txn->large_node))
+		dlist_delete_thoroughly(&txn->large_node);
+
+	/* transaction does not have any changes on memory */
+	if (!dlist_node_is_detached(&txn->mem_node))
+		dlist_delete_thoroughly(&txn->mem_node);
+
 	Assert(spilled == txn->nentries_mem);
 	Assert(dlist_is_empty(&txn->changes));
 	txn->nentries_mem = 0;
@@ -4097,6 +4145,11 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	/* update the decoding stats */
 	UpdateDecodingStats((LogicalDecodingContext *) rb->private_data);
 
+	if (!dlist_node_is_detached(&txn->large_node))
+		dlist_delete_thoroughly(&txn->large_node);
+	if (!dlist_node_is_detached(&txn->mem_node))
+		dlist_delete_thoroughly(&txn->mem_node);
+
 	Assert(dlist_is_empty(&txn->changes));
 	Assert(txn->nentries == 0);
 	Assert(txn->nentries_mem == 0);
@@ -4322,6 +4375,10 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		restored++;
 	}
 
+	/* If the transaction restore changes to memory, push it to the list */
+	if (txn->nentries_mem > 0 && !dlist_node_is_detached(&txn->mem_node))
+		dlist_push_tail(&rb->mem_txns, &txn->mem_node);
+
 	return restored;
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index f986101e50..6ab8a398fd 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -420,6 +420,17 @@ typedef struct ReorderBufferTXN
 	 */
 	dlist_node	catchange_node;
 
+	/*
+	 * A node in the list of large transactions
+	 */
+	dlist_node	large_node;
+
+
+	/*
+	 * A node in the list of in-memory transactions
+	 */
+	dlist_node	mem_node;
+
 	/*
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
@@ -577,6 +588,18 @@ struct ReorderBuffer
 	 */
 	dclist_head catchange_txns;
 
+
+	/*
+	 * Transactions accounting for more than 10% of logical_decoding_work_mem
+	 * limit (*not* ordered by sizes).
+	 */
+	dlist_head	large_txns;
+
+	/*
+	 * Transactions having decoded changes in memory.
+	 */
+	dlist_head	mem_txns;
+
 	/*
 	 * one-entry sized cache for by_txn. Very frequently the same txn gets
 	 * looked up over and over again.
#2Dilip Kumar
dilipbalaut@gmail.com
In reply to: Masahiko Sawada (#1)
Re: Improve eviction algorithm in ReorderBuffer

On Tue, Dec 12, 2023 at 9:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi all,

As the comment of ReorderBufferLargestTXN() says, it's very slow with
many subtransactions:

/*
* Find the largest transaction (toplevel or subxact) to evict (spill to disk).
*
* XXX With many subtransactions this might be quite slow, because we'll have
* to walk through all of them. There are some options how we could improve
* that: (a) maintain some secondary structure with transactions sorted by
* amount of changes, (b) not looking for the entirely largest transaction,
* but e.g. for transaction using at least some fraction of the memory limit,
* and (c) evicting multiple transactions at once, e.g. to free a given portion
* of the memory limit (e.g. 50%).
*/

This is because the reorderbuffer has transaction entries for each
top-level and sub transaction, and ReorderBufferLargestTXN() walks
through all transaction entries to pick the transaction to evict.
I've heard the report internally that replication lag became huge when
decoding transactions each consisting of 500k sub transactions. Note
that ReorderBufferLargetstTXN() is used only in non-streaming mode.

Here is a test script for a many subtransactions scenario. In my
environment, the logical decoding took over 2min to decode one top
transaction having 100k subtransctions.

-----
create table test (c int);
create or replace function testfn (cnt int) returns void as $$
begin
for i in 1..cnt loop
begin
insert into test values (i);
exception when division_by_zero then
raise notice 'caught error';
return;
end;
end loop;
end;
$$
language plpgsql;
select testfn(100000)
set logical_decoding_work_mem to '4MB';
select count(*) from pg_logical_slot_peek_changes('s', null, null)
----

To deal with this problem, I initially thought of the idea (a)
mentioned in the comment; use a binary heap to maintain the
transactions sorted by the amount of changes or the size. But it seems
not a good idea to try maintaining all transactions by its size since
the size of each transaction could be changed frequently.

The attached patch uses a different approach that consists of three
strategies; (1) maintain the list of transactions whose size is larger
than 10% of logical_decoding_work_mem, and preferentially evict a
transaction from this list. If the list is empty, all transactions are
small enough, (2) so we evict the oldest top-level transaction from
rb->toplevel_by_lsn list. Evicting older transactions would help in
freeing memory blocks in GenerationContext. Finally, if this is also
empty, (3) we evict a transaction that size is > 0. Here, we need to
note the fact that even if a transaction is evicted the
ReorderBufferTXN entry is not removed from rb->by_txn but its size is
0. In the worst case where all (quite a few) transactions are smaller
than 10% of the memory limit, we might end up checking many
transactions to find non-zero size transaction entries to evict. So
the patch adds a new list to maintain all transactions that have at
least one change in memory.

Summarizing the algorithm I've implemented in the patch,

1. pick a transaction from the list of large transactions (larger than
10% of memory limit).
2. pick a transaction from the top-level transaction list in LSN order.
3. pick a transaction from the list of transactions that have at least
one change in memory.

With the patch, the above test case completed within 3 seconds in my
environment.

Thanks for working on this, I think it would be good to test other
scenarios as well where this might have some negative impact and see
where we stand. I mean
1) A scenario where suppose you have one very large transaction that
is consuming ~40% of the memory and 5-6 comparatively smaller
transactions that are just above 10% of the memory limit. And now for
coming under the memory limit instead of getting 1 large transaction
evicted out, we are evicting out multiple times.
2) Another scenario where all the transactions are under 10% of the
memory limit but let's say there are some transactions are consuming
around 8-9% of the memory limit each but those are not very old
transactions whereas there are certain old transactions which are
fairly small and consuming under 1% of memory limit and there are many
such transactions. So how it would affect if we frequently select
many of these transactions to come under memory limit instead of
selecting a couple of large transactions which are consuming 8-9%?

As a side note, the idea (c) mentioned in the comment, evicting
multiple transactions at once to free a given portion of the memory,
would also help in avoiding back and forth the memory threshold. It's
also worth considering.

Yes, I think it is worth considering.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#3Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Dilip Kumar (#2)
Re: Improve eviction algorithm in ReorderBuffer

On Tue, Dec 12, 2023 at 1:33 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Dec 12, 2023 at 9:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi all,

As the comment of ReorderBufferLargestTXN() says, it's very slow with
many subtransactions:

/*
* Find the largest transaction (toplevel or subxact) to evict (spill to disk).
*
* XXX With many subtransactions this might be quite slow, because we'll have
* to walk through all of them. There are some options how we could improve
* that: (a) maintain some secondary structure with transactions sorted by
* amount of changes, (b) not looking for the entirely largest transaction,
* but e.g. for transaction using at least some fraction of the memory limit,
* and (c) evicting multiple transactions at once, e.g. to free a given portion
* of the memory limit (e.g. 50%).
*/

This is because the reorderbuffer has transaction entries for each
top-level and sub transaction, and ReorderBufferLargestTXN() walks
through all transaction entries to pick the transaction to evict.
I've heard the report internally that replication lag became huge when
decoding transactions each consisting of 500k sub transactions. Note
that ReorderBufferLargetstTXN() is used only in non-streaming mode.

Here is a test script for a many subtransactions scenario. In my
environment, the logical decoding took over 2min to decode one top
transaction having 100k subtransctions.

-----
create table test (c int);
create or replace function testfn (cnt int) returns void as $$
begin
for i in 1..cnt loop
begin
insert into test values (i);
exception when division_by_zero then
raise notice 'caught error';
return;
end;
end loop;
end;
$$
language plpgsql;
select testfn(100000)
set logical_decoding_work_mem to '4MB';
select count(*) from pg_logical_slot_peek_changes('s', null, null)
----

To deal with this problem, I initially thought of the idea (a)
mentioned in the comment; use a binary heap to maintain the
transactions sorted by the amount of changes or the size. But it seems
not a good idea to try maintaining all transactions by its size since
the size of each transaction could be changed frequently.

The attached patch uses a different approach that consists of three
strategies; (1) maintain the list of transactions whose size is larger
than 10% of logical_decoding_work_mem, and preferentially evict a
transaction from this list. If the list is empty, all transactions are
small enough, (2) so we evict the oldest top-level transaction from
rb->toplevel_by_lsn list. Evicting older transactions would help in
freeing memory blocks in GenerationContext. Finally, if this is also
empty, (3) we evict a transaction that size is > 0. Here, we need to
note the fact that even if a transaction is evicted the
ReorderBufferTXN entry is not removed from rb->by_txn but its size is
0. In the worst case where all (quite a few) transactions are smaller
than 10% of the memory limit, we might end up checking many
transactions to find non-zero size transaction entries to evict. So
the patch adds a new list to maintain all transactions that have at
least one change in memory.

Summarizing the algorithm I've implemented in the patch,

1. pick a transaction from the list of large transactions (larger than
10% of memory limit).
2. pick a transaction from the top-level transaction list in LSN order.
3. pick a transaction from the list of transactions that have at least
one change in memory.

With the patch, the above test case completed within 3 seconds in my
environment.

Thanks for working on this, I think it would be good to test other
scenarios as well where this might have some negative impact and see
where we stand.

Agreed.

1) A scenario where suppose you have one very large transaction that
is consuming ~40% of the memory and 5-6 comparatively smaller
transactions that are just above 10% of the memory limit. And now for
coming under the memory limit instead of getting 1 large transaction
evicted out, we are evicting out multiple times.

Given the large transaction list will have up to 10 transactions, I
think it's cheap to pick the largest transaction among them. It's O(N)
but N won't be large.

2) Another scenario where all the transactions are under 10% of the
memory limit but let's say there are some transactions are consuming
around 8-9% of the memory limit each but those are not very old
transactions whereas there are certain old transactions which are
fairly small and consuming under 1% of memory limit and there are many
such transactions. So how it would affect if we frequently select
many of these transactions to come under memory limit instead of
selecting a couple of large transactions which are consuming 8-9%?

Yeah, probably we can do something for small transactions (i.e. small
and on-memory transactions). One idea is to pick the largest
transaction among them by iterating over all of them. Given that the
more transactions are evicted, the less transactions the on-memory
transaction list has, unlike the current algorithm, we would still
win. Or we could even split it into several sub-lists in order to
reduce the number of transactions to check. For example, splitting it
into two lists: transactions consuming 5% < and 5% >= of the memory
limit, and checking the 5% >= list preferably. The cost for
maintaining these lists could increase, though.

Do you have any ideas?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#4Dilip Kumar
dilipbalaut@gmail.com
In reply to: Masahiko Sawada (#3)
Re: Improve eviction algorithm in ReorderBuffer

On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Thanks for working on this, I think it would be good to test other
scenarios as well where this might have some negative impact and see
where we stand.

Agreed.

1) A scenario where suppose you have one very large transaction that
is consuming ~40% of the memory and 5-6 comparatively smaller
transactions that are just above 10% of the memory limit. And now for
coming under the memory limit instead of getting 1 large transaction
evicted out, we are evicting out multiple times.

Given the large transaction list will have up to 10 transactions, I
think it's cheap to pick the largest transaction among them. It's O(N)
but N won't be large.

Yeah, that makes sense.

2) Another scenario where all the transactions are under 10% of the
memory limit but let's say there are some transactions are consuming
around 8-9% of the memory limit each but those are not very old
transactions whereas there are certain old transactions which are
fairly small and consuming under 1% of memory limit and there are many
such transactions. So how it would affect if we frequently select
many of these transactions to come under memory limit instead of
selecting a couple of large transactions which are consuming 8-9%?

Yeah, probably we can do something for small transactions (i.e. small
and on-memory transactions). One idea is to pick the largest
transaction among them by iterating over all of them. Given that the
more transactions are evicted, the less transactions the on-memory
transaction list has, unlike the current algorithm, we would still
win. Or we could even split it into several sub-lists in order to
reduce the number of transactions to check. For example, splitting it
into two lists: transactions consuming 5% < and 5% >= of the memory
limit, and checking the 5% >= list preferably. The cost for
maintaining these lists could increase, though.

Do you have any ideas?

Yeah something like what you mention might be good, we maintain 3 list
that says large, medium, and small transactions. In a large
transaction, list suppose we allow transactions that consume more than
10% so there could be at max 10 transactions so we can do a sequence
search and spill the largest of all. Whereas in the medium list
suppose we keep transactions ranging from e.g. 3-10% then it's just
fine to pick from the head because the size differences between the
largest and smallest transaction in this list are not very
significant. And remaining in the small transaction list and from the
small transaction list we can choose to spill multiple transactions at
a time.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#5Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#3)
Re: Improve eviction algorithm in ReorderBuffer

On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Dec 12, 2023 at 1:33 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Dec 12, 2023 at 9:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've heard the report internally that replication lag became huge when
decoding transactions each consisting of 500k sub transactions. Note
that ReorderBufferLargetstTXN() is used only in non-streaming mode.

Can't you suggest them to use streaming mode to avoid this problem or
do you see some problem with that?

Here is a test script for a many subtransactions scenario. In my
environment, the logical decoding took over 2min to decode one top
transaction having 100k subtransctions.

-----
create table test (c int);
create or replace function testfn (cnt int) returns void as $$
begin
for i in 1..cnt loop
begin
insert into test values (i);
exception when division_by_zero then
raise notice 'caught error';
return;
end;
end loop;
end;
$$
language plpgsql;
select testfn(100000)
set logical_decoding_work_mem to '4MB';
select count(*) from pg_logical_slot_peek_changes('s', null, null)
----

To deal with this problem, I initially thought of the idea (a)
mentioned in the comment; use a binary heap to maintain the
transactions sorted by the amount of changes or the size. But it seems
not a good idea to try maintaining all transactions by its size since
the size of each transaction could be changed frequently.

The attached patch uses a different approach that consists of three
strategies; (1) maintain the list of transactions whose size is larger
than 10% of logical_decoding_work_mem, and preferentially evict a
transaction from this list.

IIUC, you are giving preference to multiple list ideas as compared to
(a) because you don't need to adjust the list each time the
transaction size changes, is that right? If so, I think there is a
cost to keep that data structure up-to-date but it can help in
reducing the number of times we need to serialize.

If the list is empty, all transactions are

small enough, (2) so we evict the oldest top-level transaction from
rb->toplevel_by_lsn list. Evicting older transactions would help in
freeing memory blocks in GenerationContext. Finally, if this is also
empty, (3) we evict a transaction that size is > 0. Here, we need to
note the fact that even if a transaction is evicted the
ReorderBufferTXN entry is not removed from rb->by_txn but its size is
0. In the worst case where all (quite a few) transactions are smaller
than 10% of the memory limit, we might end up checking many
transactions to find non-zero size transaction entries to evict. So
the patch adds a new list to maintain all transactions that have at
least one change in memory.

Summarizing the algorithm I've implemented in the patch,

1. pick a transaction from the list of large transactions (larger than
10% of memory limit).
2. pick a transaction from the top-level transaction list in LSN order.
3. pick a transaction from the list of transactions that have at least
one change in memory.

With the patch, the above test case completed within 3 seconds in my
environment.

Thanks for working on this, I think it would be good to test other
scenarios as well where this might have some negative impact and see
where we stand.

Agreed.

1) A scenario where suppose you have one very large transaction that
is consuming ~40% of the memory and 5-6 comparatively smaller
transactions that are just above 10% of the memory limit. And now for
coming under the memory limit instead of getting 1 large transaction
evicted out, we are evicting out multiple times.

Given the large transaction list will have up to 10 transactions, I
think it's cheap to pick the largest transaction among them. It's O(N)
but N won't be large.

2) Another scenario where all the transactions are under 10% of the
memory limit but let's say there are some transactions are consuming
around 8-9% of the memory limit each but those are not very old
transactions whereas there are certain old transactions which are
fairly small and consuming under 1% of memory limit and there are many
such transactions. So how it would affect if we frequently select
many of these transactions to come under memory limit instead of
selecting a couple of large transactions which are consuming 8-9%?

Yeah, probably we can do something for small transactions (i.e. small
and on-memory transactions). One idea is to pick the largest
transaction among them by iterating over all of them. Given that the
more transactions are evicted, the less transactions the on-memory
transaction list has, unlike the current algorithm, we would still
win. Or we could even split it into several sub-lists in order to
reduce the number of transactions to check. For example, splitting it
into two lists: transactions consuming 5% < and 5% >= of the memory
limit, and checking the 5% >= list preferably.

Which memory limit are you referring to here? Is it logical_decoding_work_mem?

The cost for
maintaining these lists could increase, though.

Yeah, can't we maintain a single list of all xacts that are consuming
equal to or greater than the memory limit? Considering that the memory
limit is logical_decoding_work_mem, then I think just picking one
transaction to serialize would be sufficient.

--
With Regards,
Amit Kapila.

#6Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#5)
Re: Improve eviction algorithm in ReorderBuffer

On Fri, Dec 15, 2023 at 12:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Dec 12, 2023 at 1:33 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Dec 12, 2023 at 9:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've heard the report internally that replication lag became huge when
decoding transactions each consisting of 500k sub transactions. Note
that ReorderBufferLargetstTXN() is used only in non-streaming mode.

Can't you suggest them to use streaming mode to avoid this problem or
do you see some problem with that?

Yeah, that's one option. But I can suggest

Here is a test script for a many subtransactions scenario. In my
environment, the logical decoding took over 2min to decode one top
transaction having 100k subtransctions.

-----
create table test (c int);
create or replace function testfn (cnt int) returns void as $$
begin
for i in 1..cnt loop
begin
insert into test values (i);
exception when division_by_zero then
raise notice 'caught error';
return;
end;
end loop;
end;
$$
language plpgsql;
select testfn(100000)
set logical_decoding_work_mem to '4MB';
select count(*) from pg_logical_slot_peek_changes('s', null, null)
----

To deal with this problem, I initially thought of the idea (a)
mentioned in the comment; use a binary heap to maintain the
transactions sorted by the amount of changes or the size. But it seems
not a good idea to try maintaining all transactions by its size since
the size of each transaction could be changed frequently.

The attached patch uses a different approach that consists of three
strategies; (1) maintain the list of transactions whose size is larger
than 10% of logical_decoding_work_mem, and preferentially evict a
transaction from this list.

IIUC, you are giving preference to multiple list ideas as compared to
(a) because you don't need to adjust the list each time the
transaction size changes, is that right?

Right.

If so, I think there is a
cost to keep that data structure up-to-date but it can help in
reducing the number of times we need to serialize.

Yes, there is a trade-off.

What I don't want to do is to keep all transactions ordered since it's
too costly. The proposed idea uses multiple lists to keep all
transactions roughly ordered. The maintenance cost would be cheap
since each list is unordered.

It might be a good idea to have a threshold to switch how to pick the
largest transaction based on the number of transactions in the
reorderbuffer. If there are many transactions, we can use the proposed
algorithm to find a possibly-largest transaction, otherwise use the
current way.

If the list is empty, all transactions are

small enough, (2) so we evict the oldest top-level transaction from
rb->toplevel_by_lsn list. Evicting older transactions would help in
freeing memory blocks in GenerationContext. Finally, if this is also
empty, (3) we evict a transaction that size is > 0. Here, we need to
note the fact that even if a transaction is evicted the
ReorderBufferTXN entry is not removed from rb->by_txn but its size is
0. In the worst case where all (quite a few) transactions are smaller
than 10% of the memory limit, we might end up checking many
transactions to find non-zero size transaction entries to evict. So
the patch adds a new list to maintain all transactions that have at
least one change in memory.

Summarizing the algorithm I've implemented in the patch,

1. pick a transaction from the list of large transactions (larger than
10% of memory limit).
2. pick a transaction from the top-level transaction list in LSN order.
3. pick a transaction from the list of transactions that have at least
one change in memory.

With the patch, the above test case completed within 3 seconds in my
environment.

Thanks for working on this, I think it would be good to test other
scenarios as well where this might have some negative impact and see
where we stand.

Agreed.

1) A scenario where suppose you have one very large transaction that
is consuming ~40% of the memory and 5-6 comparatively smaller
transactions that are just above 10% of the memory limit. And now for
coming under the memory limit instead of getting 1 large transaction
evicted out, we are evicting out multiple times.

Given the large transaction list will have up to 10 transactions, I
think it's cheap to pick the largest transaction among them. It's O(N)
but N won't be large.

2) Another scenario where all the transactions are under 10% of the
memory limit but let's say there are some transactions are consuming
around 8-9% of the memory limit each but those are not very old
transactions whereas there are certain old transactions which are
fairly small and consuming under 1% of memory limit and there are many
such transactions. So how it would affect if we frequently select
many of these transactions to come under memory limit instead of
selecting a couple of large transactions which are consuming 8-9%?

Yeah, probably we can do something for small transactions (i.e. small
and on-memory transactions). One idea is to pick the largest
transaction among them by iterating over all of them. Given that the
more transactions are evicted, the less transactions the on-memory
transaction list has, unlike the current algorithm, we would still
win. Or we could even split it into several sub-lists in order to
reduce the number of transactions to check. For example, splitting it
into two lists: transactions consuming 5% < and 5% >= of the memory
limit, and checking the 5% >= list preferably.

Which memory limit are you referring to here? Is it logical_decoding_work_mem?

logical_decoding_work_mem.

The cost for
maintaining these lists could increase, though.

Yeah, can't we maintain a single list of all xacts that are consuming
equal to or greater than the memory limit? Considering that the memory
limit is logical_decoding_work_mem, then I think just picking one
transaction to serialize would be sufficient.

IIUC we serialize a transaction when the sum of all transactions'
memory usage in the reorderbuffer exceeds logical_decoding_work_mem.
In what cases are multiple transactions consuming equal to or greater
than the logical_decoding_work_mem?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#7Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Masahiko Sawada (#1)
Re: Improve eviction algorithm in ReorderBuffer

On 2023-Dec-12, Masahiko Sawada wrote:

To deal with this problem, I initially thought of the idea (a)
mentioned in the comment; use a binary heap to maintain the
transactions sorted by the amount of changes or the size. But it seems
not a good idea to try maintaining all transactions by its size since
the size of each transaction could be changed frequently.

Hmm, maybe you can just use binaryheap_add_unordered and just let the
sizes change, and do binaryheap_build() at the point where the eviction
is needed.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"No necesitamos banderas
No reconocemos fronteras" (Jorge González)

#8Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Alvaro Herrera (#7)
Re: Improve eviction algorithm in ReorderBuffer

On Fri, Dec 15, 2023 at 7:10 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2023-Dec-12, Masahiko Sawada wrote:

To deal with this problem, I initially thought of the idea (a)
mentioned in the comment; use a binary heap to maintain the
transactions sorted by the amount of changes or the size. But it seems
not a good idea to try maintaining all transactions by its size since
the size of each transaction could be changed frequently.

Hmm, maybe you can just use binaryheap_add_unordered and just let the
sizes change, and do binaryheap_build() at the point where the eviction
is needed.

I assume you mean to add ReorderBufferTXN entries to the binaryheap
and then build it by comparing their sizes (i.e. txn->size). But
ReorderBufferTXN entries are removed and deallocated once the
transaction finished. How can we efficiently remove these entries from
binaryheap? I guess it would be O(n) to find the entry among the
unordered entries, where n is the number of transactions in the
reorderbuffer.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#9Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#6)
Re: Improve eviction algorithm in ReorderBuffer

On Fri, Dec 15, 2023 at 2:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Dec 15, 2023 at 12:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Dec 12, 2023 at 1:33 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Dec 12, 2023 at 9:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've heard the report internally that replication lag became huge when
decoding transactions each consisting of 500k sub transactions. Note
that ReorderBufferLargetstTXN() is used only in non-streaming mode.

Can't you suggest them to use streaming mode to avoid this problem or
do you see some problem with that?

Yeah, that's one option. But I can suggest

Sorry, it was still in the middle of editing.

Yeah, that's one option. But since there is a trade-off I cannot
suggest using streaming mode for every user. Also, the logical
replication client (e.g. third party tool receiving logical change
set) might not support the streaming mode yet.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#10Euler Taveira
euler@eulerto.com
In reply to: Masahiko Sawada (#8)
Re: Improve eviction algorithm in ReorderBuffer

On Fri, Dec 15, 2023, at 9:11 AM, Masahiko Sawada wrote:

I assume you mean to add ReorderBufferTXN entries to the binaryheap
and then build it by comparing their sizes (i.e. txn->size). But
ReorderBufferTXN entries are removed and deallocated once the
transaction finished. How can we efficiently remove these entries from
binaryheap? I guess it would be O(n) to find the entry among the
unordered entries, where n is the number of transactions in the
reorderbuffer.

O(log n) for both functions: binaryheap_remove_first() and
binaryheap_remove_node(). I didn't read your patch but do you really need to
free entries one by one? If not, binaryheap_free().

--
Euler Taveira
EDB https://www.enterprisedb.com/

#11Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Euler Taveira (#10)
Re: Improve eviction algorithm in ReorderBuffer

On Sat, Dec 16, 2023 at 1:36 AM Euler Taveira <euler@eulerto.com> wrote:

On Fri, Dec 15, 2023, at 9:11 AM, Masahiko Sawada wrote:

I assume you mean to add ReorderBufferTXN entries to the binaryheap
and then build it by comparing their sizes (i.e. txn->size). But
ReorderBufferTXN entries are removed and deallocated once the
transaction finished. How can we efficiently remove these entries from
binaryheap? I guess it would be O(n) to find the entry among the
unordered entries, where n is the number of transactions in the
reorderbuffer.

O(log n) for both functions: binaryheap_remove_first() and
binaryheap_remove_node().

Right. The binaryheap_binaryheap_remove_first() removes the topmost
entry in O(log n), but the ReorderBufferTXN being removed is not
necessarily the topmost entry, since we remove the entry when the
transaction completes (committed or aborted). The
binaryheap_remove_node() removes the entry at the given Nth in O(log
n), but I'm not sure how we can know the indexes of each entry. I
think we can remember the index of newly added entry after calling
binaryheap_add_unordered() but once we call binaryheap_build() the
index is out-of-date. So I think that in the worst case we would need
to check all entries in order to remove an arbitrary entry in
binaryheap. It's O(n). I might be missing something though.

I didn't read your patch but do you really need to
free entries one by one? If not, binaryheap_free().

The patch doesn't touch on how to free entries. ReorderBufferTXN
entries are freed one by one after each of which completes (committed
or aborted).

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#12Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#6)
Re: Improve eviction algorithm in ReorderBuffer

On Fri, Dec 15, 2023 at 11:29 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Dec 15, 2023 at 12:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

IIUC, you are giving preference to multiple list ideas as compared to
(a) because you don't need to adjust the list each time the
transaction size changes, is that right?

Right.

If so, I think there is a
cost to keep that data structure up-to-date but it can help in
reducing the number of times we need to serialize.

Yes, there is a trade-off.

What I don't want to do is to keep all transactions ordered since it's
too costly. The proposed idea uses multiple lists to keep all
transactions roughly ordered. The maintenance cost would be cheap
since each list is unordered.

It might be a good idea to have a threshold to switch how to pick the
largest transaction based on the number of transactions in the
reorderbuffer. If there are many transactions, we can use the proposed
algorithm to find a possibly-largest transaction, otherwise use the
current way.

Yeah, that makes sense.

1) A scenario where suppose you have one very large transaction that
is consuming ~40% of the memory and 5-6 comparatively smaller
transactions that are just above 10% of the memory limit. And now for
coming under the memory limit instead of getting 1 large transaction
evicted out, we are evicting out multiple times.

Given the large transaction list will have up to 10 transactions, I
think it's cheap to pick the largest transaction among them. It's O(N)
but N won't be large.

2) Another scenario where all the transactions are under 10% of the
memory limit but let's say there are some transactions are consuming
around 8-9% of the memory limit each but those are not very old
transactions whereas there are certain old transactions which are
fairly small and consuming under 1% of memory limit and there are many
such transactions. So how it would affect if we frequently select
many of these transactions to come under memory limit instead of
selecting a couple of large transactions which are consuming 8-9%?

Yeah, probably we can do something for small transactions (i.e. small
and on-memory transactions). One idea is to pick the largest
transaction among them by iterating over all of them. Given that the
more transactions are evicted, the less transactions the on-memory
transaction list has, unlike the current algorithm, we would still
win. Or we could even split it into several sub-lists in order to
reduce the number of transactions to check. For example, splitting it
into two lists: transactions consuming 5% < and 5% >= of the memory
limit, and checking the 5% >= list preferably.

Which memory limit are you referring to here? Is it logical_decoding_work_mem?

logical_decoding_work_mem.

The cost for
maintaining these lists could increase, though.

Yeah, can't we maintain a single list of all xacts that are consuming
equal to or greater than the memory limit? Considering that the memory
limit is logical_decoding_work_mem, then I think just picking one
transaction to serialize would be sufficient.

IIUC we serialize a transaction when the sum of all transactions'
memory usage in the reorderbuffer exceeds logical_decoding_work_mem.
In what cases are multiple transactions consuming equal to or greater
than the logical_decoding_work_mem?

The individual transactions shouldn't cross
'logical_decoding_work_mem'. I got a bit confused by your proposal to
maintain the lists: "...splitting it into two lists: transactions
consuming 5% < and 5% >= of the memory limit, and checking the 5% >=
list preferably.". In the previous sentence, what did you mean by
transactions consuming 5% >= of the memory limit? I got the impression
that you are saying to maintain them in a separate transaction list
which doesn't seems to be the case.

--
With Regards,
Amit Kapila.

#13Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#12)
Re: Improve eviction algorithm in ReorderBuffer

On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Dec 15, 2023 at 11:29 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Dec 15, 2023 at 12:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

IIUC, you are giving preference to multiple list ideas as compared to
(a) because you don't need to adjust the list each time the
transaction size changes, is that right?

Right.

If so, I think there is a
cost to keep that data structure up-to-date but it can help in
reducing the number of times we need to serialize.

Yes, there is a trade-off.

What I don't want to do is to keep all transactions ordered since it's
too costly. The proposed idea uses multiple lists to keep all
transactions roughly ordered. The maintenance cost would be cheap
since each list is unordered.

It might be a good idea to have a threshold to switch how to pick the
largest transaction based on the number of transactions in the
reorderbuffer. If there are many transactions, we can use the proposed
algorithm to find a possibly-largest transaction, otherwise use the
current way.

Yeah, that makes sense.

1) A scenario where suppose you have one very large transaction that
is consuming ~40% of the memory and 5-6 comparatively smaller
transactions that are just above 10% of the memory limit. And now for
coming under the memory limit instead of getting 1 large transaction
evicted out, we are evicting out multiple times.

Given the large transaction list will have up to 10 transactions, I
think it's cheap to pick the largest transaction among them. It's O(N)
but N won't be large.

2) Another scenario where all the transactions are under 10% of the
memory limit but let's say there are some transactions are consuming
around 8-9% of the memory limit each but those are not very old
transactions whereas there are certain old transactions which are
fairly small and consuming under 1% of memory limit and there are many
such transactions. So how it would affect if we frequently select
many of these transactions to come under memory limit instead of
selecting a couple of large transactions which are consuming 8-9%?

Yeah, probably we can do something for small transactions (i.e. small
and on-memory transactions). One idea is to pick the largest
transaction among them by iterating over all of them. Given that the
more transactions are evicted, the less transactions the on-memory
transaction list has, unlike the current algorithm, we would still
win. Or we could even split it into several sub-lists in order to
reduce the number of transactions to check. For example, splitting it
into two lists: transactions consuming 5% < and 5% >= of the memory
limit, and checking the 5% >= list preferably.

Which memory limit are you referring to here? Is it logical_decoding_work_mem?

logical_decoding_work_mem.

The cost for
maintaining these lists could increase, though.

Yeah, can't we maintain a single list of all xacts that are consuming
equal to or greater than the memory limit? Considering that the memory
limit is logical_decoding_work_mem, then I think just picking one
transaction to serialize would be sufficient.

IIUC we serialize a transaction when the sum of all transactions'
memory usage in the reorderbuffer exceeds logical_decoding_work_mem.
In what cases are multiple transactions consuming equal to or greater
than the logical_decoding_work_mem?

The individual transactions shouldn't cross
'logical_decoding_work_mem'. I got a bit confused by your proposal to
maintain the lists: "...splitting it into two lists: transactions
consuming 5% < and 5% >= of the memory limit, and checking the 5% >=
list preferably.". In the previous sentence, what did you mean by
transactions consuming 5% >= of the memory limit? I got the impression
that you are saying to maintain them in a separate transaction list
which doesn't seems to be the case.

I wanted to mean that there are three lists in total: the first one
maintain the transactions consuming more than 10% of
logical_decoding_work_mem, the second one maintains other transactions
consuming more than or equal to 5% of logical_decoding_work_mem, and
the third one maintains other transactions consuming more than 0 and
less than 5% of logical_decoding_work_mem.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#14Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#13)
Re: Improve eviction algorithm in ReorderBuffer

On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

The individual transactions shouldn't cross
'logical_decoding_work_mem'. I got a bit confused by your proposal to
maintain the lists: "...splitting it into two lists: transactions
consuming 5% < and 5% >= of the memory limit, and checking the 5% >=
list preferably.". In the previous sentence, what did you mean by
transactions consuming 5% >= of the memory limit? I got the impression
that you are saying to maintain them in a separate transaction list
which doesn't seems to be the case.

I wanted to mean that there are three lists in total: the first one
maintain the transactions consuming more than 10% of
logical_decoding_work_mem,

How can we have multiple transactions in the list consuming more than
10% of logical_decoding_work_mem? Shouldn't we perform serialization
before any xact reaches logical_decoding_work_mem?

--
With Regards,
Amit Kapila.

#15Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#14)
Re: Improve eviction algorithm in ReorderBuffer

On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

The individual transactions shouldn't cross
'logical_decoding_work_mem'. I got a bit confused by your proposal to
maintain the lists: "...splitting it into two lists: transactions
consuming 5% < and 5% >= of the memory limit, and checking the 5% >=
list preferably.". In the previous sentence, what did you mean by
transactions consuming 5% >= of the memory limit? I got the impression
that you are saying to maintain them in a separate transaction list
which doesn't seems to be the case.

I wanted to mean that there are three lists in total: the first one
maintain the transactions consuming more than 10% of
logical_decoding_work_mem,

How can we have multiple transactions in the list consuming more than
10% of logical_decoding_work_mem? Shouldn't we perform serialization
before any xact reaches logical_decoding_work_mem?

Well, suppose logical_decoding_work_mem is set to 64MB, transactions
consuming more than 6.4MB are added to the list. So for example, it's
possible that the list has three transactions each of which are
consuming 10MB while the total memory usage in the reorderbuffer is
still 30MB (less than logical_decoding_work_mem).

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#16Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#15)
Re: Improve eviction algorithm in ReorderBuffer

On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

The individual transactions shouldn't cross
'logical_decoding_work_mem'. I got a bit confused by your proposal to
maintain the lists: "...splitting it into two lists: transactions
consuming 5% < and 5% >= of the memory limit, and checking the 5% >=
list preferably.". In the previous sentence, what did you mean by
transactions consuming 5% >= of the memory limit? I got the impression
that you are saying to maintain them in a separate transaction list
which doesn't seems to be the case.

I wanted to mean that there are three lists in total: the first one
maintain the transactions consuming more than 10% of
logical_decoding_work_mem,

How can we have multiple transactions in the list consuming more than
10% of logical_decoding_work_mem? Shouldn't we perform serialization
before any xact reaches logical_decoding_work_mem?

Well, suppose logical_decoding_work_mem is set to 64MB, transactions
consuming more than 6.4MB are added to the list. So for example, it's
possible that the list has three transactions each of which are
consuming 10MB while the total memory usage in the reorderbuffer is
still 30MB (less than logical_decoding_work_mem).

Thanks for the clarification. I misunderstood the list to have
transactions greater than 70.4 MB (64 + 6.4) in your example. But one
thing to note is that maintaining these lists by default can also have
some overhead unless the list of open transactions crosses a certain
threshold.

--
With Regards,
Amit Kapila.

#17Peter Smith
smithpb2250@gmail.com
In reply to: Masahiko Sawada (#1)
Re: Improve eviction algorithm in ReorderBuffer

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1]https://commitfest.postgresql.org/46/4699/, but it seems
like there was some CFbot test failures last time it was run [2]https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4699.
Please have a look and post an updated version if necessary.

======
[1]: https://commitfest.postgresql.org/46/4699/
[2]: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4699

Kind Regards,
Peter Smith.

#18Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#16)
4 attachment(s)
Re: Improve eviction algorithm in ReorderBuffer

On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

The individual transactions shouldn't cross
'logical_decoding_work_mem'. I got a bit confused by your proposal to
maintain the lists: "...splitting it into two lists: transactions
consuming 5% < and 5% >= of the memory limit, and checking the 5% >=
list preferably.". In the previous sentence, what did you mean by
transactions consuming 5% >= of the memory limit? I got the impression
that you are saying to maintain them in a separate transaction list
which doesn't seems to be the case.

I wanted to mean that there are three lists in total: the first one
maintain the transactions consuming more than 10% of
logical_decoding_work_mem,

How can we have multiple transactions in the list consuming more than
10% of logical_decoding_work_mem? Shouldn't we perform serialization
before any xact reaches logical_decoding_work_mem?

Well, suppose logical_decoding_work_mem is set to 64MB, transactions
consuming more than 6.4MB are added to the list. So for example, it's
possible that the list has three transactions each of which are
consuming 10MB while the total memory usage in the reorderbuffer is
still 30MB (less than logical_decoding_work_mem).

Thanks for the clarification. I misunderstood the list to have
transactions greater than 70.4 MB (64 + 6.4) in your example. But one
thing to note is that maintaining these lists by default can also have
some overhead unless the list of open transactions crosses a certain
threshold.

On further analysis, I realized that the approach discussed here might
not be the way to go. The idea of dividing transactions into several
subgroups is to divide a large number of entries into multiple
sub-groups so we can reduce the complexity to search for the
particular entry. Since we assume that there are no big differences in
entries' sizes within a sub-group, we can pick the entry to evict in
O(1). However, what we really need to avoid here is that we end up
increasing the number of times to evict entries because serializing an
entry to the disk is more costly than searching an entry on memory in
general.

I think that it's no problem in a large-entries subgroup but when it
comes to the smallest-entries subgroup, like for entries consuming
less than 5% of the limit, it could end up evicting many entries. For
example, there would be a huge difference between serializing 1 entry
consuming 5% of the memory limit and serializing 5000 entries
consuming 0.001% of the memory limit. Even if we can select 5000
entries quickly, I think the latter would be slower in total. The more
subgroups we create, the more the algorithm gets complex and the
overheads could cause. So I think we need to search for the largest
entry in order to minimize the number of evictions anyway.

Looking for data structures and algorithms, I think binaryheap with
some improvements could be promising. I mentioned before why we cannot
use the current binaryheap[1]. The missing pieces are efficient ways
to remove the arbitrary entry and to update the arbitrary entry's key.
The current binaryheap provides binaryheap_remove_node(), which is
O(log n), but it requires the entry's position in the binaryheap. We
can know the entry's position just after binaryheap_add_unordered()
but it might be changed after heapify. Searching the node's position
is O(n). So the improvement idea is to add a hash table to the
binaryheap so that it can track the positions for each entry so that
we can remove the arbitrary entry in O(log n) and also update the
arbitrary entry's key in O(log n). This is known as the indexed
priority queue. I've attached the patch for that (0001 and 0002).

That way, in terms of reorderbuffer, we can update and remove the
transaction's memory usage in O(log n) (in worst case and O(1) in
average) and then pick the largest transaction in O(1). Since we might
need to call ReorderBufferSerializeTXN() even in non-streaming case,
we need to maintain the binaryheap anyway. I've attached the patch for
that (0003).

Here are test script for many sub-transactions case:

create table test (c int);
create or replace function testfn (cnt int) returns void as $$
begin
for i in 1..cnt loop
begin
insert into test values (i);
exception when division_by_zero then
raise notice 'caught error';
return;
end;
end loop;
end;
$$
language plpgsql;
select pg_create_logical_replication_slot('s', 'test_decoding');
select testfn(50000);
set logical_decoding_work_mem to '4MB';
select count(*) from pg_logical_slot_peek_changes('s', null, null)";

and here are results:

* HEAD: 16877.281 ms
* HEAD w/ patches (0001 and 0002): 655.154 ms

There is huge improvement in a many-subtransactions case.

Finally, we need to note that memory counter updates could happen
frequently as we update it for each change. So even though we update
the binaryheap in O(log n), it could be a huge overhead if it happens
quite often. One idea is to batch the memory counter updates where
available. I've attached the patch for that (0004). I'll benchmark
overheads for normal cases.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v1-0004-Batch-memory-counter-updates-in-ReorderBuffer.patchapplication/octet-stream; name=v1-0004-Batch-memory-counter-updates-in-ReorderBuffer.patchDownload
From 3cfce047ff0bbcdfddc7122a4b637f9d61f334a4 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 11:40:03 +0900
Subject: [PATCH v1 4/4] Batch memory counter updates in ReorderBuffer.

Commit XXX improved the algorith of selecting the largest transaction
among top-level and sub transactions by using a max-heap. It in terns
required for memory counter updates to also update the max-heap, which
is O(log n), where n is the number of transactions.

In order to reduce the number of times for updating the binaryheap,
this commits batches memory counter updates where available. For
instance, when cleaning up a transaction, we sum up the total amount
of changes we freed, and then update the memory counter once.

XXX: if the performance test on cases where memory counter updates
happen quite often showed regressions, we need this patch.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 .../replication/logical/reorderbuffer.c       | 116 ++++++++++++------
 1 file changed, 80 insertions(+), 36 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 1228e3e0d0..ca60e7b984 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -252,7 +252,7 @@ static void ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *tx
 										 int fd, ReorderBufferChange *change);
 static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
 										TXNEntryFile *file, XLogSegNo *segno);
-static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
+static Size ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 									   char *data);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
@@ -295,6 +295,9 @@ static Size ReorderBufferChangeSize(ReorderBufferChange *change);
 static void ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 											ReorderBufferChange *change,
 											bool addition, Size sz);
+static void ReorderBufferTXNMemoryUpdate(ReorderBuffer *rb,
+										 ReorderBufferTXN *txn,
+										 bool addition, Size sz);
 static int ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg);
 
 /*
@@ -1509,6 +1512,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
 	bool		found;
 	dlist_mutable_iter iter;
+	Size		freed_bytes = 0;
 
 	/* cleanup subtransactions & their changes */
 	dlist_foreach_modify(iter, &txn->subtxns)
@@ -1538,9 +1542,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		/* Check we're not mixing changes from different transactions. */
 		Assert(change->txn == txn);
 
-		ReorderBufferReturnChange(rb, change, true);
+		freed_bytes += ReorderBufferChangeSize(change);
+		ReorderBufferReturnChange(rb, change, false);
 	}
 
+	/* Update memory statistics for this txn entry */
+	if (freed_bytes > 0)
+		ReorderBufferTXNMemoryUpdate(rb, txn, false, freed_bytes);
+
+
 	/*
 	 * Cleanup the tuplecids we stored for decoding catalog snapshot access.
 	 * They are always stored in the toplevel transaction.
@@ -1555,7 +1565,8 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		Assert(change->txn == txn);
 		Assert(change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID);
 
-		ReorderBufferReturnChange(rb, change, true);
+		/* Tuple CID changes are ignored for updating memory counter. */
+		ReorderBufferReturnChange(rb, change, false);
 	}
 
 	/*
@@ -1616,6 +1627,7 @@ static void
 ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared)
 {
 	dlist_mutable_iter iter;
+	Size		freed_bytes = 0;
 
 	/* cleanup subtransactions & their changes */
 	dlist_foreach_modify(iter, &txn->subtxns)
@@ -1648,9 +1660,14 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 		/* remove the change from it's containing list */
 		dlist_delete(&change->node);
 
-		ReorderBufferReturnChange(rb, change, true);
+		freed_bytes += ReorderBufferChangeSize(change);
+		ReorderBufferReturnChange(rb, change, false);
 	}
 
+	/* Update memory statistics for this txn entry */
+	if (freed_bytes > 0)
+		ReorderBufferTXNMemoryUpdate(rb, txn, false, freed_bytes);
+
 	/*
 	 * Mark the transaction as streamed.
 	 *
@@ -1689,7 +1706,8 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 			/* Remove the change from its containing list. */
 			dlist_delete(&change->node);
 
-			ReorderBufferReturnChange(rb, change, true);
+			/* Tuple CID changes are ignored for updating memory counter. */
+			ReorderBufferReturnChange(rb, change, false);
 		}
 	}
 
@@ -3170,26 +3188,14 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 }
 
 /*
- * Update memory counters to account for the new or removed change.
- *
- * We update two counters - in the reorder buffer, and in the transaction
- * containing the change. The reorder buffer counter allows us to quickly
- * decide if we reached the memory limit, the transaction counter allows
- * us to quickly pick the largest transaction for eviction.
- *
- * When streaming is enabled, we need to update the toplevel transaction
- * counters instead - we don't really care about subtransactions as we
- * can't stream them individually anyway, and we only pick toplevel
- * transactions for eviction. So only toplevel transactions matter.
+ * A wrapper function for ReorderBufferTXNMemoryUpdate() to update memory
+ * counters to account for the new or removed change.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								ReorderBufferChange *change,
 								bool addition, Size sz)
 {
-	ReorderBufferTXN *txn;
-	ReorderBufferTXN *toptxn;
-
 	Assert(change->txn);
 
 	/*
@@ -3200,7 +3206,27 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
-	txn = change->txn;
+	ReorderBufferTXNMemoryUpdate(rb, change->txn, addition, sz);
+}
+
+/*
+ * Update memory counters of the given transaction.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
+ */
+static void
+ReorderBufferTXNMemoryUpdate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 bool addition, Size sz)
+{
+	ReorderBufferTXN *toptxn;
 
 	/*
 	 * Update the total size in top level as well. This is later used to
@@ -3656,6 +3682,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	int			fd = -1;
 	XLogSegNo	curOpenSegNo = 0;
 	Size		spilled = 0;
+	Size		freed_bytes = 0;
 	Size		size = txn->size;
 
 	elog(DEBUG2, "spill %u changes in XID %u to disk",
@@ -3710,11 +3737,15 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		ReorderBufferSerializeChange(rb, txn, fd, change);
 		dlist_delete(&change->node);
-		ReorderBufferReturnChange(rb, change, true);
+		freed_bytes += ReorderBufferChangeSize(change);
+		ReorderBufferReturnChange(rb, change, false);
 
 		spilled++;
 	}
 
+	if (freed_bytes > 0)
+		ReorderBufferTXNMemoryUpdate(rb, txn, false, freed_bytes);
+
 	/* update the statistics iff we have spilled anything */
 	if (spilled)
 	{
@@ -4197,6 +4228,8 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
 							TXNEntryFile *file, XLogSegNo *segno)
 {
 	Size		restored = 0;
+	Size		restored_bytes = 0;
+	Size		freed_bytes = 0;
 	XLogSegNo	last_segno;
 	dlist_mutable_iter cleanup_iter;
 	File	   *fd = &file->vfd;
@@ -4210,12 +4243,16 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		ReorderBufferChange *cleanup =
 			dlist_container(ReorderBufferChange, node, cleanup_iter.cur);
 
+		freed_bytes += ReorderBufferChangeSize(cleanup);
 		dlist_delete(&cleanup->node);
-		ReorderBufferReturnChange(rb, cleanup, true);
+		ReorderBufferReturnChange(rb, cleanup, false);
 	}
 	txn->nentries_mem = 0;
 	Assert(dlist_is_empty(&txn->changes));
 
+	if (freed_bytes > 0)
+		ReorderBufferTXNMemoryUpdate(rb, txn, false, freed_bytes);
+
 	XLByteToSeg(txn->final_lsn, last_segno, wal_segment_size);
 
 	while (restored < max_changes_in_memory && *segno <= last_segno)
@@ -4320,22 +4357,32 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 * ok, read a full change from disk, now restore it into proper
 		 * in-memory format
 		 */
-		ReorderBufferRestoreChange(rb, txn, rb->outbuf);
+		restored_bytes += ReorderBufferRestoreChange(rb, txn, rb->outbuf);
 		restored++;
 	}
 
+	/*
+	 * Update memory accounting for the restored change.  We need to do this
+	 * although we don't check the memory limit when restoring the changes in
+	 * this branch (we only do that when initially queueing the changes after
+	 * decoding), because we will release the changes later, and that will
+	 * update the accounting too (subtracting the size from the counters). And
+	 * we don't want to underflow there.
+	 */
+	ReorderBufferTXNMemoryUpdate(rb, txn, true, restored_bytes);
+
 	return restored;
 }
 
 /*
  * Convert change from its on-disk format to in-memory format and queue it onto
- * the TXN's ->changes list.
+ * the TXN's ->changes list. Return the size of the restored change.
  *
  * Note: although "data" is declared char*, at entry it points to a
  * maxalign'd buffer, making it safe in most of this function to assume
  * that the pointed-to data is suitably aligned for direct access.
  */
-static void
+static Size
 ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 						   char *data)
 {
@@ -4488,16 +4535,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	dlist_push_tail(&txn->changes, &change->node);
 	txn->nentries_mem++;
 
-	/*
-	 * Update memory accounting for the restored change.  We need to do this
-	 * although we don't check the memory limit when restoring the changes in
-	 * this branch (we only do that when initially queueing the changes after
-	 * decoding), because we will release the changes later, and that will
-	 * update the accounting too (subtracting the size from the counters). And
-	 * we don't want to underflow there.
-	 */
-	ReorderBufferChangeMemoryUpdate(rb, change, true,
-									ReorderBufferChangeSize(change));
+	return ReorderBufferChangeSize(change);
 }
 
 /*
@@ -4931,6 +4969,7 @@ ReorderBufferToastReset(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	while ((ent = (ReorderBufferToastEnt *) hash_seq_search(&hstat)) != NULL)
 	{
 		dlist_mutable_iter it;
+		Size	freed_bytes = 0;
 
 		if (ent->reconstructed != NULL)
 			pfree(ent->reconstructed);
@@ -4940,9 +4979,14 @@ ReorderBufferToastReset(ReorderBuffer *rb, ReorderBufferTXN *txn)
 			ReorderBufferChange *change =
 				dlist_container(ReorderBufferChange, node, it.cur);
 
+			freed_bytes += ReorderBufferChangeSize(change);
 			dlist_delete(&change->node);
-			ReorderBufferReturnChange(rb, change, true);
+			ReorderBufferReturnChange(rb, change, false);
 		}
+
+		/* Update memory statistics for this txn entry */
+		if (freed_bytes > 0)
+			ReorderBufferTXNMemoryUpdate(rb, txn, false, freed_bytes);
 	}
 
 	hash_destroy(txn->toast_hash);
-- 
2.39.3

v1-0001-Make-binaryheap-enlareable.patchapplication/octet-stream; name=v1-0001-Make-binaryheap-enlareable.patchDownload
From 6a3fec120d392f356c900cd8e547514804d5a90d Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 17:12:20 +0900
Subject: [PATCH v1 1/4] Make binaryheap enlareable.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/common/binaryheap.c      | 39 ++++++++++++++++++------------------
 src/include/lib/binaryheap.h |  2 +-
 2 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index 7377ebdf15..bc43aca093 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -38,17 +38,16 @@ static void sift_up(binaryheap *heap, int node_off);
 binaryheap *
 binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 {
-	int			sz;
 	binaryheap *heap;
 
-	sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_node_type) * capacity;
-	heap = (binaryheap *) palloc(sz);
+	heap = (binaryheap *) palloc(sizeof(binaryheap));
 	heap->bh_space = capacity;
 	heap->bh_compare = compare;
 	heap->bh_arg = arg;
 
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
+	heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type) * capacity);
 
 	return heap;
 }
@@ -104,6 +103,20 @@ parent_offset(int i)
 	return (i - 1) / 2;
 }
 
+/*
+ * Make sure there is enough space for nodes.
+ */
+static void
+bh_enlarge_node_array(binaryheap *heap)
+{
+	if (heap->bh_size < heap->bh_space)
+		return;
+
+	heap->bh_space *= 2;
+	heap->bh_nodes = repalloc(heap->bh_nodes,
+							  sizeof(bh_node_type) * heap->bh_space);
+}
+
 /*
  * binaryheap_add_unordered
  *
@@ -115,14 +128,8 @@ parent_offset(int i)
 void
 binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 {
-	if (heap->bh_size >= heap->bh_space)
-	{
-#ifdef FRONTEND
-		pg_fatal("out of binary heap slots");
-#else
-		elog(ERROR, "out of binary heap slots");
-#endif
-	}
+	bh_enlarge_node_array(heap);
+
 	heap->bh_has_heap_property = false;
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
@@ -153,14 +160,8 @@ binaryheap_build(binaryheap *heap)
 void
 binaryheap_add(binaryheap *heap, bh_node_type d)
 {
-	if (heap->bh_size >= heap->bh_space)
-	{
-#ifdef FRONTEND
-		pg_fatal("out of binary heap slots");
-#else
-		elog(ERROR, "out of binary heap slots");
-#endif
-	}
+	bh_enlarge_node_array(heap);
+
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 19025c08ef..1439f20803 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -46,7 +46,7 @@ typedef struct binaryheap
 	bool		bh_has_heap_property;	/* debugging cross-check */
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
-	bh_node_type bh_nodes[FLEXIBLE_ARRAY_MEMBER];
+	bh_node_type *bh_nodes;
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int capacity,
-- 
2.39.3

v1-0003-Improve-transaction-eviction-algorithm-in-Reorder.patchapplication/octet-stream; name=v1-0003-Improve-transaction-eviction-algorithm-in-Reorder.patchDownload
From d43d33f40b0820f22bd5355edbd3f95338a6be0f Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 11:31:41 +0900
Subject: [PATCH v1 3/4] Improve transaction eviction algorithm in
 ReorderBuffer.

Previously, when selecting the largest transaction to evict, we scan
all transactions. Which could be quite slow as it was O(n), where n is
the total number of (top-level and sub) transactions, especially in
cases where there are many subtransactions. It could lead to a huge
replication lag.

This commit changes the eviction algorithm in ReorderBuffer to use
max-heap for transaction sizes. The selecting the largest transaction
is now O(1). The performance test showed a significant improvement.

XXX: updating the transaction's memory counter and the max-heap is now
O(log n), so we need to evaludate it. If there are some regression, we
would need a follow-up patch that batches multiple memory counter
updates where possible..

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 .../replication/logical/reorderbuffer.c       | 57 ++++++++++++-------
 src/include/replication/reorderbuffer.h       |  4 ++
 2 files changed, 42 insertions(+), 19 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index d1334ffb55..1228e3e0d0 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -295,6 +295,7 @@ static Size ReorderBufferChangeSize(ReorderBufferChange *change);
 static void ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 											ReorderBufferChange *change,
 											bool addition, Size sz);
+static int ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg);
 
 /*
  * Allocate a new ReorderBuffer and clean out any old serialized state from
@@ -356,6 +357,13 @@ ReorderBufferAllocate(void)
 	buffer->outbufsize = 0;
 	buffer->size = 0;
 
+	/*
+	 * We start with an arbitrary number. Which should be enough for most of
+	 * cases.
+	 */
+	buffer->txn_heap = binaryheap_allocate(1024, ReorderBufferTXNSizeCompare,
+										   NULL);
+
 	buffer->spillTxns = 0;
 	buffer->spillCount = 0;
 	buffer->spillBytes = 0;
@@ -3202,11 +3210,18 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 	if (addition)
 	{
+		bool init = (txn->size == 0);
+
 		txn->size += sz;
 		rb->size += sz;
 
 		/* Update the total size in the top transaction. */
 		toptxn->total_size += sz;
+
+		if (init)
+			binaryheap_add(rb->txn_heap, PointerGetDatum(txn));
+		else
+			binaryheap_update_up(rb->txn_heap, PointerGetDatum(txn));
 	}
 	else
 	{
@@ -3216,6 +3231,11 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 		/* Update the total size in the top transaction. */
 		toptxn->total_size -= sz;
+
+		if (txn->size == 0)
+			binaryheap_remove_node_ptr(rb->txn_heap, PointerGetDatum(txn));
+		else
+			binaryheap_update_down(rb->txn_heap, PointerGetDatum(txn));
 	}
 
 	Assert(txn->size <= rb->size);
@@ -3473,31 +3493,13 @@ ReorderBufferSerializeReserve(ReorderBuffer *rb, Size sz)
 
 /*
  * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
- *
- * XXX With many subtransactions this might be quite slow, because we'll have
- * to walk through all of them. There are some options how we could improve
- * that: (a) maintain some secondary structure with transactions sorted by
- * amount of changes, (b) not looking for the entirely largest transaction,
- * but e.g. for transaction using at least some fraction of the memory limit,
- * and (c) evicting multiple transactions at once, e.g. to free a given portion
- * of the memory limit (e.g. 50%).
  */
 static ReorderBufferTXN *
 ReorderBufferLargestTXN(ReorderBuffer *rb)
 {
-	HASH_SEQ_STATUS hash_seq;
-	ReorderBufferTXNByIdEnt *ent;
 	ReorderBufferTXN *largest = NULL;
 
-	hash_seq_init(&hash_seq, rb->by_txn);
-	while ((ent = hash_seq_search(&hash_seq)) != NULL)
-	{
-		ReorderBufferTXN *txn = ent->txn;
-
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
-			largest = txn;
-	}
+	largest = (ReorderBufferTXN *) DatumGetPointer(binaryheap_first(rb->txn_heap));
 
 	Assert(largest);
 	Assert(largest->size > 0);
@@ -5278,3 +5280,20 @@ restart:
 		*cmax = ent->cmax;
 	return true;
 }
+
+/*
+ * Compare between sizes of two transactions. This is for a binary heap
+ * comparison function.
+ */
+static int
+ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg)
+{
+	ReorderBufferTXN	*ta = (ReorderBufferTXN *) DatumGetPointer(a);
+	ReorderBufferTXN	*tb = (ReorderBufferTXN *) DatumGetPointer(b);
+
+	if (ta->size < tb->size)
+		return -1;
+	if (ta->size > tb->size)
+		return 1;
+	return 0;
+}
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 3e232c6c27..e9eba186e9 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "lib/binaryheap.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -650,6 +651,9 @@ struct ReorderBuffer
 	/* memory accounting */
 	Size		size;
 
+	/* Max-heap for sizes of all top-level and sub transactions */
+	binaryheap	*txn_heap;
+
 	/*
 	 * Statistics about transactions spilled to disk.
 	 *
-- 
2.39.3

v1-0002-Add-functions-for-updating-keys-and-removing-node.patchapplication/octet-stream; name=v1-0002-Add-functions-for-updating-keys-and-removing-node.patchDownload
From 439ec9295568bfa35933c1595f799e4fc3ad0030 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 11:20:23 +0900
Subject: [PATCH v1 2/4] Add functions for updating keys and removing nodes to
 binaryheap.

Previously, binaryheap didn't support key updates and removing nodes
in an efficient way. For example, in order to remove a node from the
binaryheap, the caller has to pass the node's position within the
array that the binaryheap internally has. This operation can be done
in O(log n) but searching for the key's position is O(n).

This commit adds a hash table to binaryheap to track of positions of
each nodes in the binaryheap. That way, by using newly added
functions, binaryheap_update_up, binaryheap_update_down and
binaryheap_remove_node_ptr, both updating a key and removing a
node can be done in O(1) in an average and O(log n) in worst
case. This is known as the indexed priority queue.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/common/binaryheap.c      | 134 ++++++++++++++++++++++++++++++++---
 src/include/lib/binaryheap.h |  28 ++++++++
 2 files changed, 153 insertions(+), 9 deletions(-)

diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index bc43aca093..9122654e15 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -22,8 +22,28 @@
 #ifdef FRONTEND
 #include "common/logging.h"
 #endif
+#include "common/hashfn.h"
 #include "lib/binaryheap.h"
 
+/*
+ * Define parameters for hash table code generation. The interface is *also*"
+ * declared in binaryheaph.h (to generate the types, which are externally
+ * visible).
+ */
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+	hash_bytes((const unsigned char *) &key, sizeof(bh_node_type))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0)
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
 static void sift_down(binaryheap *heap, int node_off);
 static void sift_up(binaryheap *heap, int node_off);
 
@@ -49,6 +69,12 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 	heap->bh_has_heap_property = true;
 	heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type) * capacity);
 
+#ifdef FRONTEND
+	heap->bh_nodeidx = bh_nodeidx_create(capacity, NULL);
+#else
+	heap->bh_nodeidx = bh_nodeidx_create(CurrentMemoryContext, capacity, NULL);
+#endif
+
 	return heap;
 }
 
@@ -63,6 +89,7 @@ binaryheap_reset(binaryheap *heap)
 {
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
+	bh_nodeidx_reset(heap->bh_nodeidx);
 }
 
 /*
@@ -73,6 +100,7 @@ binaryheap_reset(binaryheap *heap)
 void
 binaryheap_free(binaryheap *heap)
 {
+	bh_nodeidx_destroy(heap->bh_nodeidx);
 	pfree(heap);
 }
 
@@ -117,6 +145,40 @@ bh_enlarge_node_array(binaryheap *heap)
 							  sizeof(bh_node_type) * heap->bh_space);
 }
 
+/*
+ * Set the given node at the 'idx' and updates its position accordingly.
+ */
+static void
+bh_set_node(binaryheap *heap, bh_node_type d, int idx)
+{
+	bh_nodeidx_entry *ent;
+	bool	found;
+
+	/* Set the node to the nodes array */
+	heap->bh_nodes[idx] = d;
+
+	/* Remember its index in the nodes array */
+	ent = bh_nodeidx_insert(heap->bh_nodeidx, d, &found);
+	ent->idx = idx;
+}
+
+/*
+ * Replace the node at 'idx' with the given node 'replaced_by'. Also
+ * update their positions accordingly.
+ */
+static void
+bh_replace_node(binaryheap *heap, int idx, bh_node_type replaced_by)
+{
+	bh_node_type	node = heap->bh_nodes[idx];
+
+	/* Remove overwritten node's index */
+	(void) bh_nodeidx_delete(heap->bh_nodeidx, node);
+
+	/* Replace it with the given new node */
+	if (idx < heap->bh_size)
+		bh_set_node(heap, replaced_by, idx);
+}
+
 /*
  * binaryheap_add_unordered
  *
@@ -131,7 +193,7 @@ binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 	bh_enlarge_node_array(heap);
 
 	heap->bh_has_heap_property = false;
-	heap->bh_nodes[heap->bh_size] = d;
+	bh_set_node(heap, d, heap->bh_size);
 	heap->bh_size++;
 }
 
@@ -162,7 +224,7 @@ binaryheap_add(binaryheap *heap, bh_node_type d)
 {
 	bh_enlarge_node_array(heap);
 
-	heap->bh_nodes[heap->bh_size] = d;
+	bh_set_node(heap, d, heap->bh_size);
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
 }
@@ -203,6 +265,7 @@ binaryheap_remove_first(binaryheap *heap)
 	if (heap->bh_size == 1)
 	{
 		heap->bh_size--;
+		bh_nodeidx_delete(heap->bh_nodeidx, result);
 		return result;
 	}
 
@@ -210,7 +273,7 @@ binaryheap_remove_first(binaryheap *heap)
 	 * Remove the last node, placing it in the vacated root entry, and sift
 	 * the new root node down to its correct position.
 	 */
-	heap->bh_nodes[0] = heap->bh_nodes[--heap->bh_size];
+	bh_replace_node(heap, 0, heap->bh_nodes[--heap->bh_size]);
 	sift_down(heap, 0);
 
 	return result;
@@ -236,7 +299,7 @@ binaryheap_remove_node(binaryheap *heap, int n)
 						   heap->bh_arg);
 
 	/* remove the last node, placing it in the vacated entry */
-	heap->bh_nodes[n] = heap->bh_nodes[heap->bh_size];
+	bh_replace_node(heap, n, heap->bh_nodes[heap->bh_size]);
 
 	/* sift as needed to preserve the heap property */
 	if (cmp > 0)
@@ -245,6 +308,59 @@ binaryheap_remove_node(binaryheap *heap, int n)
 		sift_down(heap, n);
 }
 
+/*
+ * binaryheap_remove_node_ptr
+ *
+ * Similar to binaryheap_remove_node() but removes the given node. The caller
+ * must ensure that the given node is in the heap. O(log n) worst case.
+ */
+void
+binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
+{
+	bh_nodeidx_entry *ent;
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+	Assert(ent);
+
+	binaryheap_remove_node(heap, ent->idx);
+}
+
+/*
+ * binaryheap_update_up
+ *
+ * Sift the given node up after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ */
+void
+binaryheap_update_up(binaryheap *heap, bh_node_type d)
+{
+	bh_nodeidx_entry *ent;
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+	Assert(ent);
+	Assert(ent->idx >= 0 && ent->idx < heap->bh_size);
+
+	sift_up(heap, ent->idx);
+}
+
+/*
+ * binaryheap_update_down
+ *
+ * Sift the given node down after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ */
+void
+binaryheap_update_down(binaryheap *heap, bh_node_type d)
+{
+	bh_nodeidx_entry *ent;
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+	Assert(ent);
+	Assert(ent->idx >= 0 && ent->idx < heap->bh_size);
+
+	sift_down(heap, ent->idx);
+}
+
 /*
  * binaryheap_replace_first
  *
@@ -257,7 +373,7 @@ binaryheap_replace_first(binaryheap *heap, bh_node_type d)
 {
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
 
-	heap->bh_nodes[0] = d;
+	bh_replace_node(heap, 0, d);
 
 	if (heap->bh_size > 1)
 		sift_down(heap, 0);
@@ -299,11 +415,11 @@ sift_up(binaryheap *heap, int node_off)
 		 * Otherwise, swap the parent value with the hole, and go on to check
 		 * the node's new parent.
 		 */
-		heap->bh_nodes[node_off] = parent_val;
+		bh_set_node(heap, parent_val, node_off);
 		node_off = parent_off;
 	}
 	/* Re-fill the hole */
-	heap->bh_nodes[node_off] = node_val;
+	bh_set_node(heap, node_val, node_off);
 }
 
 /*
@@ -358,9 +474,9 @@ sift_down(binaryheap *heap, int node_off)
 		 * Otherwise, swap the hole with the child that violates the heap
 		 * property; then go on to check its children.
 		 */
-		heap->bh_nodes[node_off] = heap->bh_nodes[swap_off];
+		bh_set_node(heap, heap->bh_nodes[swap_off], node_off);
 		node_off = swap_off;
 	}
 	/* Re-fill the hole */
-	heap->bh_nodes[node_off] = node_val;
+	bh_set_node(heap, node_val, node_off);
 }
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 1439f20803..8847abf863 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -11,6 +11,8 @@
 #ifndef BINARYHEAP_H
 #define BINARYHEAP_H
 
+#include "utils/palloc.h"
+
 /*
  * We provide a Datum-based API for backend code and a void *-based API for
  * frontend code (since the Datum definitions are not available to frontend
@@ -29,6 +31,28 @@ typedef Datum bh_node_type;
  */
 typedef int (*binaryheap_comparator) (bh_node_type a, bh_node_type b, void *arg);
 
+/*
+ * Struct for A hash table element to store the node's index in the bh_nodes
+ * array.
+ */
+typedef struct bh_nodeidx_entry
+{
+	bh_node_type	key;
+	char			status;
+	int				idx;
+} bh_nodeidx_entry;
+
+/* define parameters necessary to generate the hash table interface */
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
 /*
  * binaryheap
  *
@@ -47,6 +71,7 @@ typedef struct binaryheap
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
 	bh_node_type *bh_nodes;
+	bh_nodeidx_hash	*bh_nodeidx;
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int capacity,
@@ -60,7 +85,10 @@ extern void binaryheap_add(binaryheap *heap, bh_node_type d);
 extern bh_node_type binaryheap_first(binaryheap *heap);
 extern bh_node_type binaryheap_remove_first(binaryheap *heap);
 extern void binaryheap_remove_node(binaryheap *heap, int n);
+extern void binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d);
 extern void binaryheap_replace_first(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_up(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_down(binaryheap *heap, bh_node_type d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
 #define binaryheap_size(h)			((h)->bh_size)
-- 
2.39.3

#19Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#18)
3 attachment(s)
Re: Improve eviction algorithm in ReorderBuffer

On Fri, Jan 26, 2024 at 5:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

The individual transactions shouldn't cross
'logical_decoding_work_mem'. I got a bit confused by your proposal to
maintain the lists: "...splitting it into two lists: transactions
consuming 5% < and 5% >= of the memory limit, and checking the 5% >=
list preferably.". In the previous sentence, what did you mean by
transactions consuming 5% >= of the memory limit? I got the impression
that you are saying to maintain them in a separate transaction list
which doesn't seems to be the case.

I wanted to mean that there are three lists in total: the first one
maintain the transactions consuming more than 10% of
logical_decoding_work_mem,

How can we have multiple transactions in the list consuming more than
10% of logical_decoding_work_mem? Shouldn't we perform serialization
before any xact reaches logical_decoding_work_mem?

Well, suppose logical_decoding_work_mem is set to 64MB, transactions
consuming more than 6.4MB are added to the list. So for example, it's
possible that the list has three transactions each of which are
consuming 10MB while the total memory usage in the reorderbuffer is
still 30MB (less than logical_decoding_work_mem).

Thanks for the clarification. I misunderstood the list to have
transactions greater than 70.4 MB (64 + 6.4) in your example. But one
thing to note is that maintaining these lists by default can also have
some overhead unless the list of open transactions crosses a certain
threshold.

On further analysis, I realized that the approach discussed here might
not be the way to go. The idea of dividing transactions into several
subgroups is to divide a large number of entries into multiple
sub-groups so we can reduce the complexity to search for the
particular entry. Since we assume that there are no big differences in
entries' sizes within a sub-group, we can pick the entry to evict in
O(1). However, what we really need to avoid here is that we end up
increasing the number of times to evict entries because serializing an
entry to the disk is more costly than searching an entry on memory in
general.

I think that it's no problem in a large-entries subgroup but when it
comes to the smallest-entries subgroup, like for entries consuming
less than 5% of the limit, it could end up evicting many entries. For
example, there would be a huge difference between serializing 1 entry
consuming 5% of the memory limit and serializing 5000 entries
consuming 0.001% of the memory limit. Even if we can select 5000
entries quickly, I think the latter would be slower in total. The more
subgroups we create, the more the algorithm gets complex and the
overheads could cause. So I think we need to search for the largest
entry in order to minimize the number of evictions anyway.

Looking for data structures and algorithms, I think binaryheap with
some improvements could be promising. I mentioned before why we cannot
use the current binaryheap[1]. The missing pieces are efficient ways
to remove the arbitrary entry and to update the arbitrary entry's key.
The current binaryheap provides binaryheap_remove_node(), which is
O(log n), but it requires the entry's position in the binaryheap. We
can know the entry's position just after binaryheap_add_unordered()
but it might be changed after heapify. Searching the node's position
is O(n). So the improvement idea is to add a hash table to the
binaryheap so that it can track the positions for each entry so that
we can remove the arbitrary entry in O(log n) and also update the
arbitrary entry's key in O(log n). This is known as the indexed
priority queue. I've attached the patch for that (0001 and 0002).

That way, in terms of reorderbuffer, we can update and remove the
transaction's memory usage in O(log n) (in worst case and O(1) in
average) and then pick the largest transaction in O(1). Since we might
need to call ReorderBufferSerializeTXN() even in non-streaming case,
we need to maintain the binaryheap anyway.

Since if the number of transactions being decoded is small, updating
max-heap for each memory counter update could lead to some
regressions, I've measured it with the case where updating memory
counter happens frequently:

setup script:
create table test (c int);
select pg_create_logical_replication_slot('s', 'test_decoding');
insert into test select generate_series(1, 8000000);

benchmark script:
set work_mem to '3GB';
set logical_decoding_work_mem to '5GB';
select count(*) from pg_logical_slot_peek_changes('s', null, null);

Here are results (the median of five executions):

* HEAD
5274.765 ms

* HEAD + 0001-0003 patch
5532.203 ms

There were approximately 5% performance regressions.

An improvement idea is that we use two strategies for updating
max-heap depending on the number of transactions. That is, if the
number of transactions being decoded is small, we add a transaction to
max-heap by binaryheap_add_unordered(), which is O(1), and heapify it
just before picking the largest transactions, which is O(n). That way,
we can minimize the overhead of updating the memory counter. Once the
number of transactions being decoded exceeds the threshold, say 1024,
we use another strategy. We call binaryheap_update_up/down() when
updating the memory counter to preserve heap property, which is O(log
n), and pick the largest transaction in O(1). This strategy minimizes
the cost of picking the largest transactions instead of paying some
costs to update the memory counters.

I've experimented with this idea and run the same tests:

* HEAD + new patches (0001 - 0003)
5277.524 ms

The number looks good. I've attached these patches. Feedback is very welcome.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v2-0003-Improve-transaction-eviction-algorithm-in-Reorder.patchapplication/octet-stream; name=v2-0003-Improve-transaction-eviction-algorithm-in-Reorder.patchDownload
From 158a0037b897cdb0a6b267a3a393bb5b7a72bef0 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 11:31:41 +0900
Subject: [PATCH v2 3/4] Improve transaction eviction algorithm in
 ReorderBuffer.

Previously, when selecting the largest transaction to evict, we scan
all transactions. Which could be quite slow as it was O(n), where n is
the total number of (top-level and sub) transactions, especially in
cases where there are many subtransactions. It could lead to a huge
replication lag.

This commit changes the eviction algorithm in ReorderBuffer to use
max-heap with transaction size,a nd use two strategies depending on
the number of transactions being decoded.

It could be too expensive to pudate max-heap while preserving the heap
property each time the transaction's memory counter is updated, as it
could happen very frquently. So when the number of transactions being
decoded is small, we add the transactions to max-heap but don't
preserve the heap property, which is O(1). We heapify the max-heap
just before picking the largest transaction, which is O(n). This
strategy minimizes the overheads of updating the transaction's memory
counter.

On the other hand, when the number of transactions being decoded is
fairly large, such as when a transaction has many subtranasctions,
selecting the largest transaction is O(n) is too expensive. Therefore,
once the number of transactions being decoded exceeds the
threshold (1024), each time updating the transaction's memory counter
we update max-heap while preserving the heap property, which is O(log
n). Picking the largest transaction can be done in O(1). This strategy
minimizes the cost of picking the largest transaction.

XXX: updating the transaction's memory counter and the max-heap is now
O(log n), so we need to evaludate it. If there are some regression, we
would need a follow-up patch that batches multiple memory counter
updates where possible..

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 .../replication/logical/reorderbuffer.c       | 130 +++++++++++++++---
 src/include/replication/reorderbuffer.h       |  11 ++
 2 files changed, 125 insertions(+), 16 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c390d96ac3..a114f57d3b 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -67,6 +67,27 @@
  *	  allocator, evicting the oldest changes would make it more likely the
  *	  memory gets actually freed.
  *
+ *	  We use max-heap with transaction size as the key to find the largest
+ *	  transaction, and use two strategies depending on the number of transactions
+ *	  being decoded:
+ *
+ *	  Since the transaction memory counter is updated frequently, it's expensive
+ *	  to update max-heap while preserving the heap property each time the memory
+ *	  counter is updated. So when the number of transactions is small, transactions
+ *	  are added to the max-heap while not preserving the heap property. We heapify
+ *	  it just before picking the largest transaction. In this case, updating the
+ *	  memory counter is done in O(1) whereas picking the largest transaction is
+ *	  done in O(n), where n is the total number of transactions being decoded.
+ *
+ *	  On the other hand, when the number of transactions being decoded is large,
+ *	  such as when a transaction has many subtransactions, selecting the largest
+ *	  transaction in O(1) is too costly. Therefore, each time the memory counter
+ *	  of a transaction is updated, the max-heap is updated while preserving the
+ *	  heap property, and the largest transaction is picked at a low cost. In
+ *	  this case, updating the memory counter is done in O(log n) whereas picking
+ *	  the largest transaction is done in O(1). This minimizes the cost of choosing
+ *	  the largest transaction.
+ *
  *	  We still rely on max_changes_in_memory when loading serialized changes
  *	  back into memory. At that point we can't use the memory limit directly
  *	  as we load the subxacts independently. One option to deal with this
@@ -295,6 +316,7 @@ static Size ReorderBufferChangeSize(ReorderBufferChange *change);
 static void ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 											ReorderBufferChange *change,
 											bool addition, Size sz);
+static int ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg);
 
 /*
  * Allocate a new ReorderBuffer and clean out any old serialized state from
@@ -356,6 +378,14 @@ ReorderBufferAllocate(void)
 	buffer->outbufsize = 0;
 	buffer->size = 0;
 
+	/*
+	 * We start with an arbitrary number. Which should be enough for most of
+	 * cases.
+	 */
+	buffer->memtrack_state = REORDER_BUFFER_MEM_TRACK_NORMAL;
+	buffer->txn_heap = binaryheap_allocate(1024, ReorderBufferTXNSizeCompare,
+										   true, NULL);
+
 	buffer->spillTxns = 0;
 	buffer->spillCount = 0;
 	buffer->spillBytes = 0;
@@ -3200,11 +3230,31 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 	if (addition)
 	{
+		bool init = (txn->size == 0);
+
 		txn->size += sz;
 		rb->size += sz;
 
 		/* Update the total size in the top transaction. */
 		toptxn->total_size += sz;
+
+		/* Update the transaction in the max-heap */
+		if (init)
+		{
+			/* Add the transaction to the max-heap */
+			if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_NORMAL)
+				binaryheap_add_unordered(rb->txn_heap, PointerGetDatum(txn));
+			else
+				binaryheap_add(rb->txn_heap, PointerGetDatum(txn));
+		}
+		else if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)
+		{
+			/*
+			 * If we're maintaining max-heap even while updating the memory counter,
+			 * we reflect the updates to the max-heap.
+			 */
+			binaryheap_update_up(rb->txn_heap, PointerGetDatum(txn));
+		}
 	}
 	else
 	{
@@ -3214,6 +3264,24 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 		/* Update the total size in the top transaction. */
 		toptxn->total_size -= sz;
+
+		/* Remove the transaction from the max-heap */
+		if (txn->size == 0)
+		{
+			/* Remove the transaction */
+			if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_NORMAL)
+				binaryheap_remove_node_ptr_unordered(rb->txn_heap, PointerGetDatum(txn));
+			else
+				binaryheap_remove_node_ptr(rb->txn_heap, PointerGetDatum(txn));
+		}
+		else if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)
+		{
+			/*
+			 * If we're maintaining max-heap even while updating the memory counter,
+			 * we reflect the updates to the max-heap.
+			 */
+			binaryheap_update_down(rb->txn_heap, PointerGetDatum(txn));
+		}
 	}
 
 	Assert(txn->size <= rb->size);
@@ -3471,32 +3539,45 @@ ReorderBufferSerializeReserve(ReorderBuffer *rb, Size sz)
 
 /*
  * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
- *
- * XXX With many subtransactions this might be quite slow, because we'll have
- * to walk through all of them. There are some options how we could improve
- * that: (a) maintain some secondary structure with transactions sorted by
- * amount of changes, (b) not looking for the entirely largest transaction,
- * but e.g. for transaction using at least some fraction of the memory limit,
- * and (c) evicting multiple transactions at once, e.g. to free a given portion
- * of the memory limit (e.g. 50%).
  */
 static ReorderBufferTXN *
 ReorderBufferLargestTXN(ReorderBuffer *rb)
 {
-	HASH_SEQ_STATUS hash_seq;
-	ReorderBufferTXNByIdEnt *ent;
+	/*
+	 * The threshold of the number of transactions in the max-heap (rb->txn_heap)
+	 * to switch the state.
+	 */
+#define REORDE_BUFFER_MEM_TRACK_THRESHOLD 1024
+
 	ReorderBufferTXN *largest = NULL;
 
-	hash_seq_init(&hash_seq, rb->by_txn);
-	while ((ent = hash_seq_search(&hash_seq)) != NULL)
+	if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_NORMAL)
 	{
-		ReorderBufferTXN *txn = ent->txn;
+		binaryheap_build(rb->txn_heap);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
-			largest = txn;
+		/*
+		 * If the number of transactions exceeds the threshold, switch to the
+		 * state where we maintain the max-heap even while updating the memory
+		 * counter.
+		 */
+		if (binaryheap_size(rb->txn_heap) >= REORDE_BUFFER_MEM_TRACK_THRESHOLD)
+			rb->memtrack_state = REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP;
+	}
+	else
+	{
+		Assert(rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP);
+
+		/*
+		 * If the number of transactions gets lowered than the threshold, switch
+		 * to the state where we heapify the max-heap right before picking the
+		 * largest transaction while doing nothing for memory counter update.
+		 */
+		if (binaryheap_size(rb->txn_heap) < REORDE_BUFFER_MEM_TRACK_THRESHOLD)
+			rb->memtrack_state = REORDER_BUFFER_MEM_TRACK_NORMAL;
 	}
 
+	largest = (ReorderBufferTXN *) DatumGetPointer(binaryheap_first(rb->txn_heap));
+
 	Assert(largest);
 	Assert(largest->size > 0);
 	Assert(largest->size <= rb->size);
@@ -5276,3 +5357,20 @@ restart:
 		*cmax = ent->cmax;
 	return true;
 }
+
+/*
+ * Compare between sizes of two transactions. This is for a binary heap
+ * comparison function.
+ */
+static int
+ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg)
+{
+	ReorderBufferTXN	*ta = (ReorderBufferTXN *) DatumGetPointer(a);
+	ReorderBufferTXN	*tb = (ReorderBufferTXN *) DatumGetPointer(b);
+
+	if (ta->size < tb->size)
+		return -1;
+	if (ta->size > tb->size)
+		return 1;
+	return 0;
+}
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0b2c95f7aa..c9815d03f7 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "lib/binaryheap.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -531,6 +532,12 @@ typedef void (*ReorderBufferUpdateProgressTxnCB) (
 												  ReorderBufferTXN *txn,
 												  XLogRecPtr lsn);
 
+typedef enum ReorderBufferMemTrackState
+{
+	REORDER_BUFFER_MEM_TRACK_NORMAL,
+	REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP,
+} ReorderBufferMemTrackState;
+
 struct ReorderBuffer
 {
 	/*
@@ -631,6 +638,10 @@ struct ReorderBuffer
 	/* memory accounting */
 	Size		size;
 
+	/* Max-heap for sizes of all top-level and sub transactions */
+	ReorderBufferMemTrackState memtrack_state;
+	binaryheap	*txn_heap;
+
 	/*
 	 * Statistics about transactions spilled to disk.
 	 *
-- 
2.39.3

v2-0002-Add-functions-for-updating-keys-and-removing-node.patchapplication/octet-stream; name=v2-0002-Add-functions-for-updating-keys-and-removing-node.patchDownload
From 94005e6f57690e12e8734ef2d0b22a09c28220d8 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 11:20:23 +0900
Subject: [PATCH v2 2/4] Add functions for updating keys and removing nodes to
 binaryheap.

Previously, binaryheap didn't support key updates and removing nodes
in an efficient way. For example, in order to remove a node from the
binaryheap, the caller has to pass the node's position within the
array that the binaryheap internally has. This operation can be done
in O(log n) but searching for the key's position is O(n).

This commit adds a hash table to binaryheap to track of positions of
each nodes in the binaryheap. That way, by using newly added
functions such as binaryheap_update_up() etc., both updating a key and
removing a node can node can be done in O(1) in an average and
O(log n) in worst case. This is known as the indexed priority queue.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/executor/nodeGatherMerge.c        |   1 +
 src/backend/executor/nodeMergeAppend.c        |   2 +-
 src/backend/postmaster/pgarch.c               |   3 +-
 .../replication/logical/reorderbuffer.c       |   1 +
 src/backend/storage/buffer/bufmgr.c           |   1 +
 src/bin/pg_dump/pg_backup_archiver.c          |   1 +
 src/bin/pg_dump/pg_dump_sort.c                |   2 +-
 src/common/binaryheap.c                       | 190 +++++++++++++++++-
 src/include/lib/binaryheap.h                  |  38 +++-
 9 files changed, 225 insertions(+), 14 deletions(-)

diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c
index 2d552f4224..250f226d5f 100644
--- a/src/backend/executor/nodeGatherMerge.c
+++ b/src/backend/executor/nodeGatherMerge.c
@@ -427,6 +427,7 @@ gather_merge_setup(GatherMergeState *gm_state)
 	/* Allocate the resources for the merge */
 	gm_state->gm_heap = binaryheap_allocate(nreaders + 1,
 											heap_compare_slots,
+											false,
 											gm_state);
 }
 
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index 0817868452..1980794cb7 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -125,7 +125,7 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	mergestate->ms_nplans = nplans;
 
 	mergestate->ms_slots = (TupleTableSlot **) palloc0(sizeof(TupleTableSlot *) * nplans);
-	mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots,
+	mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots, false,
 											  mergestate);
 
 	/*
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 67693b0580..f3ec0a8918 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -250,7 +250,8 @@ PgArchiverMain(void)
 
 	/* Initialize our max-heap for prioritizing files to archive. */
 	arch_files->arch_heap = binaryheap_allocate(NUM_FILES_PER_DIRECTORY_SCAN,
-												ready_file_comparator, NULL);
+												ready_file_comparator, false,
+												NULL);
 
 	/* Load the archive_library. */
 	LoadArchiveLibrary();
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index bbf0966182..c390d96ac3 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1295,6 +1295,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	/* allocate heap */
 	state->heap = binaryheap_allocate(state->nr_txns,
 									  ReorderBufferIterCompare,
+									  false,
 									  state);
 
 	/* Now that the state fields are initialized, it is safe to return it. */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7d601bef6d..e641ebaa40 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2733,6 +2733,7 @@ BufferSync(int flags)
 	 */
 	ts_heap = binaryheap_allocate(num_spaces,
 								  ts_ckpt_progress_comparator,
+								  false,
 								  NULL);
 
 	for (i = 0; i < num_spaces; i++)
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 256d1e35a4..a044a684c8 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -4032,6 +4032,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 	/* Set up ready_heap with enough room for all known TocEntrys */
 	ready_heap = binaryheap_allocate(AH->tocCount,
 									 TocEntrySizeCompareBinaryheap,
+									 false,
 									 NULL);
 
 	/*
diff --git a/src/bin/pg_dump/pg_dump_sort.c b/src/bin/pg_dump/pg_dump_sort.c
index f358dd22b9..63b1c3570d 100644
--- a/src/bin/pg_dump/pg_dump_sort.c
+++ b/src/bin/pg_dump/pg_dump_sort.c
@@ -404,7 +404,7 @@ TopoSort(DumpableObject **objs,
 		return true;
 
 	/* Create workspace for the above-described heap */
-	pendingHeap = binaryheap_allocate(numObjs, int_cmp, NULL);
+	pendingHeap = binaryheap_allocate(numObjs, int_cmp, false, NULL);
 
 	/*
 	 * Scan the constraints, and for each item in the input, generate a count
diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index bc43aca093..a5bb3b148d 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -22,8 +22,28 @@
 #ifdef FRONTEND
 #include "common/logging.h"
 #endif
+#include "common/hashfn.h"
 #include "lib/binaryheap.h"
 
+/*
+ * Define parameters for hash table code generation. The interface is *also*"
+ * declared in binaryheaph.h (to generate the types, which are externally
+ * visible).
+ */
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+	hash_bytes((const unsigned char *) &key, sizeof(bh_node_type))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0)
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
 static void sift_down(binaryheap *heap, int node_off);
 static void sift_up(binaryheap *heap, int node_off);
 
@@ -36,7 +56,8 @@ static void sift_up(binaryheap *heap, int node_off);
  * argument specified by 'arg'.
  */
 binaryheap *
-binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
+binaryheap_allocate(int capacity, binaryheap_comparator compare,
+					bool indexed, void *arg)
 {
 	binaryheap *heap;
 
@@ -49,6 +70,17 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 	heap->bh_has_heap_property = true;
 	heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type) * capacity);
 
+	heap->bh_indexed = indexed;
+	if (heap->bh_indexed)
+	{
+#ifdef FRONTEND
+		heap->bh_nodeidx = bh_nodeidx_create(capacity, NULL);
+#else
+		heap->bh_nodeidx = bh_nodeidx_create(CurrentMemoryContext, capacity,
+											 NULL);
+#endif
+	}
+
 	return heap;
 }
 
@@ -63,6 +95,9 @@ binaryheap_reset(binaryheap *heap)
 {
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
+
+	if (heap->bh_indexed)
+		bh_nodeidx_reset(heap->bh_nodeidx);
 }
 
 /*
@@ -73,6 +108,8 @@ binaryheap_reset(binaryheap *heap)
 void
 binaryheap_free(binaryheap *heap)
 {
+	if (heap->bh_indexed)
+		bh_nodeidx_destroy(heap->bh_nodeidx);
 	pfree(heap);
 }
 
@@ -117,6 +154,44 @@ bh_enlarge_node_array(binaryheap *heap)
 							  sizeof(bh_node_type) * heap->bh_space);
 }
 
+/*
+ * Set the given node at the 'idx' and updates its position accordingly.
+ */
+static void
+bh_set_node(binaryheap *heap, bh_node_type d, int idx)
+{
+	bh_nodeidx_entry *ent;
+	bool	found;
+
+	/* Set the node to the nodes array */
+	heap->bh_nodes[idx] = d;
+
+	if (heap->bh_indexed)
+	{
+		/* Remember its index in the nodes array */
+		ent = bh_nodeidx_insert(heap->bh_nodeidx, d, &found);
+		ent->idx = idx;
+	}
+}
+
+/*
+ * Replace the node at 'idx' with the given node 'replaced_by'. Also
+ * update their positions accordingly.
+ */
+static void
+bh_replace_node(binaryheap *heap, int idx, bh_node_type replaced_by)
+{
+	bh_node_type	node = heap->bh_nodes[idx];
+
+	/* Remove overwritten node's index */
+	if (heap->bh_indexed)
+		(void) bh_nodeidx_delete(heap->bh_nodeidx, node);
+
+	/* Replace it with the given new node */
+	if (idx < heap->bh_size)
+		bh_set_node(heap, replaced_by, idx);
+}
+
 /*
  * binaryheap_add_unordered
  *
@@ -131,7 +206,7 @@ binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 	bh_enlarge_node_array(heap);
 
 	heap->bh_has_heap_property = false;
-	heap->bh_nodes[heap->bh_size] = d;
+	bh_set_node(heap, d, heap->bh_size);
 	heap->bh_size++;
 }
 
@@ -162,7 +237,7 @@ binaryheap_add(binaryheap *heap, bh_node_type d)
 {
 	bh_enlarge_node_array(heap);
 
-	heap->bh_nodes[heap->bh_size] = d;
+	bh_set_node(heap, d, heap->bh_size);
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
 }
@@ -203,6 +278,10 @@ binaryheap_remove_first(binaryheap *heap)
 	if (heap->bh_size == 1)
 	{
 		heap->bh_size--;
+
+		if (heap->bh_indexed)
+			bh_nodeidx_delete(heap->bh_nodeidx, result);
+
 		return result;
 	}
 
@@ -210,7 +289,7 @@ binaryheap_remove_first(binaryheap *heap)
 	 * Remove the last node, placing it in the vacated root entry, and sift
 	 * the new root node down to its correct position.
 	 */
-	heap->bh_nodes[0] = heap->bh_nodes[--heap->bh_size];
+	bh_replace_node(heap, 0, heap->bh_nodes[--heap->bh_size]);
 	sift_down(heap, 0);
 
 	return result;
@@ -236,7 +315,7 @@ binaryheap_remove_node(binaryheap *heap, int n)
 						   heap->bh_arg);
 
 	/* remove the last node, placing it in the vacated entry */
-	heap->bh_nodes[n] = heap->bh_nodes[heap->bh_size];
+	bh_replace_node(heap, n, heap->bh_nodes[heap->bh_size]);
 
 	/* sift as needed to preserve the heap property */
 	if (cmp > 0)
@@ -245,6 +324,97 @@ binaryheap_remove_node(binaryheap *heap, int n)
 		sift_down(heap, n);
 }
 
+/*
+ * binaryheap_remove_node_ptr
+ *
+ * Similar to binaryheap_remove_node() but removes the given node. The caller
+ * must ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if bh_indexed is true.
+ */
+void
+binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(heap->bh_indexed);
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+	Assert(ent);
+
+	binaryheap_remove_node(heap, ent->idx);
+}
+
+/*
+ * binaryheap_remove_node_ptr_unordered
+ *
+ * Remove the given datum from binaryheap in O(1) without preserving the heap property.
+ * To obtain a valid heap, one must call binaryheap_build() afterwards.
+ *
+ * This function can be used only if bh_indexed is true.
+ */
+void
+binaryheap_remove_node_ptr_unordered(binaryheap *heap, bh_node_type d)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap));
+	Assert(heap->bh_indexed);
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+	Assert(ent);
+
+	heap->bh_has_heap_property = false;
+	bh_replace_node(heap, ent->idx, heap->bh_nodes[--heap->bh_size]);
+}
+
+/*
+ * binaryheap_update_up
+ *
+ * Sift the given node up after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if bh_indexed is true.
+ */
+void
+binaryheap_update_up(binaryheap *heap, bh_node_type d)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(heap->bh_indexed);
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+	Assert(ent);
+	Assert(ent->idx >= 0 && ent->idx < heap->bh_size);
+
+	sift_up(heap, ent->idx);
+}
+
+/*
+ * binaryheap_update_down
+ *
+ * Sift the given node down after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if bh_indexed is true.
+ */
+void
+binaryheap_update_down(binaryheap *heap, bh_node_type d)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(heap->bh_indexed);
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+	Assert(ent);
+	Assert(ent->idx >= 0 && ent->idx < heap->bh_size);
+
+	sift_down(heap, ent->idx);
+}
+
 /*
  * binaryheap_replace_first
  *
@@ -257,7 +427,7 @@ binaryheap_replace_first(binaryheap *heap, bh_node_type d)
 {
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
 
-	heap->bh_nodes[0] = d;
+	bh_replace_node(heap, 0, d);
 
 	if (heap->bh_size > 1)
 		sift_down(heap, 0);
@@ -299,11 +469,11 @@ sift_up(binaryheap *heap, int node_off)
 		 * Otherwise, swap the parent value with the hole, and go on to check
 		 * the node's new parent.
 		 */
-		heap->bh_nodes[node_off] = parent_val;
+		bh_set_node(heap, parent_val, node_off);
 		node_off = parent_off;
 	}
 	/* Re-fill the hole */
-	heap->bh_nodes[node_off] = node_val;
+	bh_set_node(heap, node_val, node_off);
 }
 
 /*
@@ -358,9 +528,9 @@ sift_down(binaryheap *heap, int node_off)
 		 * Otherwise, swap the hole with the child that violates the heap
 		 * property; then go on to check its children.
 		 */
-		heap->bh_nodes[node_off] = heap->bh_nodes[swap_off];
+		bh_set_node(heap, heap->bh_nodes[swap_off], node_off);
 		node_off = swap_off;
 	}
 	/* Re-fill the hole */
-	heap->bh_nodes[node_off] = node_val;
+	bh_set_node(heap, node_val, node_off);
 }
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 1439f20803..1070dcf48d 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -11,6 +11,8 @@
 #ifndef BINARYHEAP_H
 #define BINARYHEAP_H
 
+#include "utils/palloc.h"
+
 /*
  * We provide a Datum-based API for backend code and a void *-based API for
  * frontend code (since the Datum definitions are not available to frontend
@@ -29,6 +31,28 @@ typedef Datum bh_node_type;
  */
 typedef int (*binaryheap_comparator) (bh_node_type a, bh_node_type b, void *arg);
 
+/*
+ * Struct for A hash table element to store the node's index in the bh_nodes
+ * array.
+ */
+typedef struct bh_nodeidx_entry
+{
+	bh_node_type	key;
+	char			status;
+	int				idx;
+} bh_nodeidx_entry;
+
+/* define parameters necessary to generate the hash table interface */
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
 /*
  * binaryheap
  *
@@ -47,11 +71,19 @@ typedef struct binaryheap
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
 	bh_node_type *bh_nodes;
+
+	/*
+	 * If bh_indexed is true, the bh_nodeidx is used to track of each
+	 * node's index in bh_nodes. This enables the caller to perform
+	 * binaryheap_remove_node_ptr(), binaryheap_update_up/down in O(log n).
+	 */
+	bool		bh_indexed;
+	bh_nodeidx_hash	*bh_nodeidx;
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int capacity,
 									   binaryheap_comparator compare,
-									   void *arg);
+									   bool indexed, void *arg);
 extern void binaryheap_reset(binaryheap *heap);
 extern void binaryheap_free(binaryheap *heap);
 extern void binaryheap_add_unordered(binaryheap *heap, bh_node_type d);
@@ -60,7 +92,11 @@ extern void binaryheap_add(binaryheap *heap, bh_node_type d);
 extern bh_node_type binaryheap_first(binaryheap *heap);
 extern bh_node_type binaryheap_remove_first(binaryheap *heap);
 extern void binaryheap_remove_node(binaryheap *heap, int n);
+extern void binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d);
+extern void binaryheap_remove_node_ptr_unordered(binaryheap *heap, bh_node_type d);
 extern void binaryheap_replace_first(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_up(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_down(binaryheap *heap, bh_node_type d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
 #define binaryheap_size(h)			((h)->bh_size)
-- 
2.39.3

v2-0001-Make-binaryheap-enlareable.patchapplication/octet-stream; name=v2-0001-Make-binaryheap-enlareable.patchDownload
From a20d72065d3662d0773742dbff4fce2af454ef17 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 17:12:20 +0900
Subject: [PATCH v2 1/4] Make binaryheap enlareable.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/common/binaryheap.c      | 39 ++++++++++++++++++------------------
 src/include/lib/binaryheap.h |  2 +-
 2 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index 7377ebdf15..bc43aca093 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -38,17 +38,16 @@ static void sift_up(binaryheap *heap, int node_off);
 binaryheap *
 binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 {
-	int			sz;
 	binaryheap *heap;
 
-	sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_node_type) * capacity;
-	heap = (binaryheap *) palloc(sz);
+	heap = (binaryheap *) palloc(sizeof(binaryheap));
 	heap->bh_space = capacity;
 	heap->bh_compare = compare;
 	heap->bh_arg = arg;
 
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
+	heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type) * capacity);
 
 	return heap;
 }
@@ -104,6 +103,20 @@ parent_offset(int i)
 	return (i - 1) / 2;
 }
 
+/*
+ * Make sure there is enough space for nodes.
+ */
+static void
+bh_enlarge_node_array(binaryheap *heap)
+{
+	if (heap->bh_size < heap->bh_space)
+		return;
+
+	heap->bh_space *= 2;
+	heap->bh_nodes = repalloc(heap->bh_nodes,
+							  sizeof(bh_node_type) * heap->bh_space);
+}
+
 /*
  * binaryheap_add_unordered
  *
@@ -115,14 +128,8 @@ parent_offset(int i)
 void
 binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 {
-	if (heap->bh_size >= heap->bh_space)
-	{
-#ifdef FRONTEND
-		pg_fatal("out of binary heap slots");
-#else
-		elog(ERROR, "out of binary heap slots");
-#endif
-	}
+	bh_enlarge_node_array(heap);
+
 	heap->bh_has_heap_property = false;
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
@@ -153,14 +160,8 @@ binaryheap_build(binaryheap *heap)
 void
 binaryheap_add(binaryheap *heap, bh_node_type d)
 {
-	if (heap->bh_size >= heap->bh_space)
-	{
-#ifdef FRONTEND
-		pg_fatal("out of binary heap slots");
-#else
-		elog(ERROR, "out of binary heap slots");
-#endif
-	}
+	bh_enlarge_node_array(heap);
+
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 19025c08ef..1439f20803 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -46,7 +46,7 @@ typedef struct binaryheap
 	bool		bh_has_heap_property;	/* debugging cross-check */
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
-	bh_node_type bh_nodes[FLEXIBLE_ARRAY_MEMBER];
+	bh_node_type *bh_nodes;
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int capacity,
-- 
2.39.3

#20Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Masahiko Sawada (#19)
RE: Improve eviction algorithm in ReorderBuffer

Dear Sawada-san,

I have started to read your patches. Here are my initial comments.
At least, all subscription tests were passed on my env.

A comment for 0001:

01.
```
+static void
+bh_enlarge_node_array(binaryheap *heap)
+{
+    if (heap->bh_size < heap->bh_space)
+        return;
+
+    heap->bh_space *= 2;
+    heap->bh_nodes = repalloc(heap->bh_nodes,
+                              sizeof(bh_node_type) * heap->bh_space);
+}
```

I'm not sure it is OK to use repalloc() for enlarging bh_nodes. This data
structure public one and arbitrary codes and extensions can directly refer
bh_nodes. But if the array is repalloc()'d, the pointer would be updated so that
their referring would be a dangling pointer.
I think the internal of the structure should be a private one in this case.

Comments for 0002:

02.
```
+#include "utils/palloc.h"
```

Is it really needed? I'm not sure who referrs it.

03.
```
typedef struct bh_nodeidx_entry
{
bh_node_type key;
char status;
int idx;
} bh_nodeidx_entry;
```

Sorry if it is a stupid question. Can you tell me how "status" is used?
None of binaryheap and reorderbuffer components refer it.

04.
```
 extern binaryheap *binaryheap_allocate(int capacity,
                                        binaryheap_comparator compare,
-                                       void *arg);
+                                       bool indexed, void *arg);
```

I felt pre-existing API should not be changed. How about adding
binaryheap_allocate_extended() or something which can specify the `bool indexed`?
binaryheap_allocate() sets heap->bh_indexed to false.

05.
```
+extern void binaryheap_update_up(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_down(binaryheap *heap, bh_node_type d);
```

IIUC, callers must consider whether the node should be shift up/down and use
appropriate function, right? I felt it may not be user-friendly.

Comments for 0003:

06.
```
This commit changes the eviction algorithm in ReorderBuffer to use
max-heap with transaction size,a nd use two strategies depending on
the number of transactions being decoded.
```

s/a nd/ and/

07.
```
It could be too expensive to pudate max-heap while preserving the heap
property each time the transaction's memory counter is updated, as it
could happen very frquently. So when the number of transactions being
decoded is small, we add the transactions to max-heap but don't
preserve the heap property, which is O(1). We heapify the max-heap
just before picking the largest transaction, which is O(n). This
strategy minimizes the overheads of updating the transaction's memory
counter.
```

s/pudate/update/

08.
IIUC, if more than 1024 transactions are running but they have small amount of
changes, the performance may be degraded, right? Do you have a result in sucha
a case?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

#21vignesh C
vignesh21@gmail.com
In reply to: Masahiko Sawada (#19)
Re: Improve eviction algorithm in ReorderBuffer

On Tue, 30 Jan 2024 at 13:37, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jan 26, 2024 at 5:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

The individual transactions shouldn't cross
'logical_decoding_work_mem'. I got a bit confused by your proposal to
maintain the lists: "...splitting it into two lists: transactions
consuming 5% < and 5% >= of the memory limit, and checking the 5% >=
list preferably.". In the previous sentence, what did you mean by
transactions consuming 5% >= of the memory limit? I got the impression
that you are saying to maintain them in a separate transaction list
which doesn't seems to be the case.

I wanted to mean that there are three lists in total: the first one
maintain the transactions consuming more than 10% of
logical_decoding_work_mem,

How can we have multiple transactions in the list consuming more than
10% of logical_decoding_work_mem? Shouldn't we perform serialization
before any xact reaches logical_decoding_work_mem?

Well, suppose logical_decoding_work_mem is set to 64MB, transactions
consuming more than 6.4MB are added to the list. So for example, it's
possible that the list has three transactions each of which are
consuming 10MB while the total memory usage in the reorderbuffer is
still 30MB (less than logical_decoding_work_mem).

Thanks for the clarification. I misunderstood the list to have
transactions greater than 70.4 MB (64 + 6.4) in your example. But one
thing to note is that maintaining these lists by default can also have
some overhead unless the list of open transactions crosses a certain
threshold.

On further analysis, I realized that the approach discussed here might
not be the way to go. The idea of dividing transactions into several
subgroups is to divide a large number of entries into multiple
sub-groups so we can reduce the complexity to search for the
particular entry. Since we assume that there are no big differences in
entries' sizes within a sub-group, we can pick the entry to evict in
O(1). However, what we really need to avoid here is that we end up
increasing the number of times to evict entries because serializing an
entry to the disk is more costly than searching an entry on memory in
general.

I think that it's no problem in a large-entries subgroup but when it
comes to the smallest-entries subgroup, like for entries consuming
less than 5% of the limit, it could end up evicting many entries. For
example, there would be a huge difference between serializing 1 entry
consuming 5% of the memory limit and serializing 5000 entries
consuming 0.001% of the memory limit. Even if we can select 5000
entries quickly, I think the latter would be slower in total. The more
subgroups we create, the more the algorithm gets complex and the
overheads could cause. So I think we need to search for the largest
entry in order to minimize the number of evictions anyway.

Looking for data structures and algorithms, I think binaryheap with
some improvements could be promising. I mentioned before why we cannot
use the current binaryheap[1]. The missing pieces are efficient ways
to remove the arbitrary entry and to update the arbitrary entry's key.
The current binaryheap provides binaryheap_remove_node(), which is
O(log n), but it requires the entry's position in the binaryheap. We
can know the entry's position just after binaryheap_add_unordered()
but it might be changed after heapify. Searching the node's position
is O(n). So the improvement idea is to add a hash table to the
binaryheap so that it can track the positions for each entry so that
we can remove the arbitrary entry in O(log n) and also update the
arbitrary entry's key in O(log n). This is known as the indexed
priority queue. I've attached the patch for that (0001 and 0002).

That way, in terms of reorderbuffer, we can update and remove the
transaction's memory usage in O(log n) (in worst case and O(1) in
average) and then pick the largest transaction in O(1). Since we might
need to call ReorderBufferSerializeTXN() even in non-streaming case,
we need to maintain the binaryheap anyway.

Since if the number of transactions being decoded is small, updating
max-heap for each memory counter update could lead to some
regressions, I've measured it with the case where updating memory
counter happens frequently:

setup script:
create table test (c int);
select pg_create_logical_replication_slot('s', 'test_decoding');
insert into test select generate_series(1, 8000000);

benchmark script:
set work_mem to '3GB';
set logical_decoding_work_mem to '5GB';
select count(*) from pg_logical_slot_peek_changes('s', null, null);

Here are results (the median of five executions):

* HEAD
5274.765 ms

* HEAD + 0001-0003 patch
5532.203 ms

There were approximately 5% performance regressions.

An improvement idea is that we use two strategies for updating
max-heap depending on the number of transactions. That is, if the
number of transactions being decoded is small, we add a transaction to
max-heap by binaryheap_add_unordered(), which is O(1), and heapify it
just before picking the largest transactions, which is O(n). That way,
we can minimize the overhead of updating the memory counter. Once the
number of transactions being decoded exceeds the threshold, say 1024,
we use another strategy. We call binaryheap_update_up/down() when
updating the memory counter to preserve heap property, which is O(log
n), and pick the largest transaction in O(1). This strategy minimizes
the cost of picking the largest transactions instead of paying some
costs to update the memory counters.

I've experimented with this idea and run the same tests:

* HEAD + new patches (0001 - 0003)
5277.524 ms

The number looks good. I've attached these patches. Feedback is very welcome.

Few comments:
1) Here we are changing memtrack_state to
REORDER_BUFFER_MEM_TRACK_NORMAL immediately once the size is less than
REORDE_BUFFER_MEM_TRACK_THRESHOLD. In this scenario we will be
building the heap many times if there are transactions getting added
and removed. How about we wait for txn_heap to become less than 95% of
REORDE_BUFFER_MEM_TRACK_THRESHOLD to avoid building the heap many
times in this scenario.
+       {
+               Assert(rb->memtrack_state ==
REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP);
+
+               /*
+                * If the number of transactions gets lowered than the
threshold, switch
+                * to the state where we heapify the max-heap right
before picking the
+                * largest transaction while doing nothing for memory
counter update.
+                */
+               if (binaryheap_size(rb->txn_heap) <
REORDE_BUFFER_MEM_TRACK_THRESHOLD)
+                       rb->memtrack_state = REORDER_BUFFER_MEM_TRACK_NORMAL;
        }
2) I felt init variable is not needed, we can directly check txn->size
instead like it is done in the else case:
+               bool init = (txn->size == 0);
+
                txn->size += sz;
                rb->size += sz;
                /* Update the total size in the top transaction. */
                toptxn->total_size += sz;
+
+               /* Update the transaction in the max-heap */
+               if (init)
+               {
+                       /* Add the transaction to the max-heap */
+                       if (rb->memtrack_state ==
REORDER_BUFFER_MEM_TRACK_NORMAL)
+                               binaryheap_add_unordered(rb->txn_heap,
PointerGetDatum(txn));
+                       else
+                               binaryheap_add(rb->txn_heap,
PointerGetDatum(txn));
+               }
+               else if (rb->memtrack_state ==
REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)
+               {
+                       /*
+                        * If we're maintaining max-heap even while
updating the memory counter,
+                        * we reflect the updates to the max-heap.
+                        */
+                       binaryheap_update_up(rb->txn_heap,
PointerGetDatum(txn));
+               }
3) we can add some comments for this:
+typedef enum ReorderBufferMemTrackState
+{
+       REORDER_BUFFER_MEM_TRACK_NORMAL,
+       REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP,
+} ReorderBufferMemTrackState;
+
4) This should be added to typedefs.list:
+typedef enum ReorderBufferMemTrackState
+{
+       REORDER_BUFFER_MEM_TRACK_NORMAL,
+       REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP,
+} ReorderBufferMemTrackState;
+

5)Few typos:
5.a) enlareable should be enlargeable
[PATCH v2 1/4] Make binaryheap enlareable.

5.b) subtranasctions should be subtransactions:
On the other hand, when the number of transactions being decoded is
fairly large, such as when a transaction has many subtranasctions,

5.c) evaludate should be evaluate:
XXX: updating the transaction's memory counter and the max-heap is now
O(log n), so we need to evaludate it. If there are some regression, we

5.d) pudate should be update:
It could be too expensive to pudate max-heap while preserving the heap
property each time the transaction's memory counter is updated, as it

5.e) frquently should be frequently:
could happen very frquently. So when the number of transactions being
decoded is small, we add the transactions to max-heap but don't

6) This should be added to typedefs.list:
+/*
+ * Struct for A hash table element to store the node's index in the bh_nodes
+ * array.
+ */
+typedef struct bh_nodeidx_entry
+{
+       bh_node_type    key;
+       char                    status;
+       int                             idx;
+} bh_nodeidx_entry;

Regards,
Vignesh

#22Shubham Khanna
khannashubham1197@gmail.com
In reply to: Masahiko Sawada (#18)
3 attachment(s)
Re: Improve eviction algorithm in ReorderBuffer

On Fri, Jan 26, 2024 at 2:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

The individual transactions shouldn't cross
'logical_decoding_work_mem'. I got a bit confused by your proposal to
maintain the lists: "...splitting it into two lists: transactions
consuming 5% < and 5% >= of the memory limit, and checking the 5% >=
list preferably.". In the previous sentence, what did you mean by
transactions consuming 5% >= of the memory limit? I got the impression
that you are saying to maintain them in a separate transaction list
which doesn't seems to be the case.

I wanted to mean that there are three lists in total: the first one
maintain the transactions consuming more than 10% of
logical_decoding_work_mem,

How can we have multiple transactions in the list consuming more than
10% of logical_decoding_work_mem? Shouldn't we perform serialization
before any xact reaches logical_decoding_work_mem?

Well, suppose logical_decoding_work_mem is set to 64MB, transactions
consuming more than 6.4MB are added to the list. So for example, it's
possible that the list has three transactions each of which are
consuming 10MB while the total memory usage in the reorderbuffer is
still 30MB (less than logical_decoding_work_mem).

Thanks for the clarification. I misunderstood the list to have
transactions greater than 70.4 MB (64 + 6.4) in your example. But one
thing to note is that maintaining these lists by default can also have
some overhead unless the list of open transactions crosses a certain
threshold.

On further analysis, I realized that the approach discussed here might
not be the way to go. The idea of dividing transactions into several
subgroups is to divide a large number of entries into multiple
sub-groups so we can reduce the complexity to search for the
particular entry. Since we assume that there are no big differences in
entries' sizes within a sub-group, we can pick the entry to evict in
O(1). However, what we really need to avoid here is that we end up
increasing the number of times to evict entries because serializing an
entry to the disk is more costly than searching an entry on memory in
general.

I think that it's no problem in a large-entries subgroup but when it
comes to the smallest-entries subgroup, like for entries consuming
less than 5% of the limit, it could end up evicting many entries. For
example, there would be a huge difference between serializing 1 entry
consuming 5% of the memory limit and serializing 5000 entries
consuming 0.001% of the memory limit. Even if we can select 5000
entries quickly, I think the latter would be slower in total. The more
subgroups we create, the more the algorithm gets complex and the
overheads could cause. So I think we need to search for the largest
entry in order to minimize the number of evictions anyway.

Looking for data structures and algorithms, I think binaryheap with
some improvements could be promising. I mentioned before why we cannot
use the current binaryheap[1]. The missing pieces are efficient ways
to remove the arbitrary entry and to update the arbitrary entry's key.
The current binaryheap provides binaryheap_remove_node(), which is
O(log n), but it requires the entry's position in the binaryheap. We
can know the entry's position just after binaryheap_add_unordered()
but it might be changed after heapify. Searching the node's position
is O(n). So the improvement idea is to add a hash table to the
binaryheap so that it can track the positions for each entry so that
we can remove the arbitrary entry in O(log n) and also update the
arbitrary entry's key in O(log n). This is known as the indexed
priority queue. I've attached the patch for that (0001 and 0002).

That way, in terms of reorderbuffer, we can update and remove the
transaction's memory usage in O(log n) (in worst case and O(1) in
average) and then pick the largest transaction in O(1). Since we might
need to call ReorderBufferSerializeTXN() even in non-streaming case,
we need to maintain the binaryheap anyway. I've attached the patch for
that (0003).

Here are test script for many sub-transactions case:

create table test (c int);
create or replace function testfn (cnt int) returns void as $$
begin
for i in 1..cnt loop
begin
insert into test values (i);
exception when division_by_zero then
raise notice 'caught error';
return;
end;
end loop;
end;
$$
language plpgsql;
select pg_create_logical_replication_slot('s', 'test_decoding');
select testfn(50000);
set logical_decoding_work_mem to '4MB';
select count(*) from pg_logical_slot_peek_changes('s', null, null)";

and here are results:

* HEAD: 16877.281 ms
* HEAD w/ patches (0001 and 0002): 655.154 ms

There is huge improvement in a many-subtransactions case.

I have run the same test and found around 12.53x improvement(the
median of five executions):
HEAD | HEAD+ v2-0001+ v2-0002 + v2-0003 patch
29197ms | 2329ms

I had also run the regression test that you had shared at [1]/messages/by-id/CAD21AoB-7mPpKnLmBNfzfavG8AiTwEgAdVMuv=jzmAp9ex7eyQ@mail.gmail.com, there
was a very very slight dip in this case around it takes around 0.31x
more time:
HEAD | HEAD + v2-0001+ v2-0002 + v2-0003 patch
4459ms | 4473ms

The machine has Total Memory of 755.536 GB, 120 CPUs and RHEL 7
Operating System. Also find the detailed info of the performance
machine attached.

[1]: /messages/by-id/CAD21AoB-7mPpKnLmBNfzfavG8AiTwEgAdVMuv=jzmAp9ex7eyQ@mail.gmail.com

Thanks and Regards,
Shubham Khanna.

Attachments:

memory_info.txttext/plain; charset=US-ASCII; name=memory_info.txtDownload
cpu_info.txttext/plain; charset=US-ASCII; name=cpu_info.txtDownload
os_info.txttext/plain; charset=US-ASCII; name=os_info.txtDownload
#23Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#20)
Re: Improve eviction algorithm in ReorderBuffer

On Wed, Jan 31, 2024 at 2:18 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Sawada-san,

I have started to read your patches. Here are my initial comments.
At least, all subscription tests were passed on my env.

Thank you for the review comments!

A comment for 0001:

01.
```
+static void
+bh_enlarge_node_array(binaryheap *heap)
+{
+    if (heap->bh_size < heap->bh_space)
+        return;
+
+    heap->bh_space *= 2;
+    heap->bh_nodes = repalloc(heap->bh_nodes,
+                              sizeof(bh_node_type) * heap->bh_space);
+}
```

I'm not sure it is OK to use repalloc() for enlarging bh_nodes. This data
structure public one and arbitrary codes and extensions can directly refer
bh_nodes. But if the array is repalloc()'d, the pointer would be updated so that
their referring would be a dangling pointer.

Hmm I'm not sure this is the case that we need to really worry about,
and cannot come up with a good use case where extensions refer to
bh_nodes directly rather than binaryheap. In PostgreSQL codes, many
Nodes already have pointers and are exposed.

I think the internal of the structure should be a private one in this case.

Comments for 0002:

02.
```
+#include "utils/palloc.h"
```

Is it really needed? I'm not sure who referrs it.

Seems not, will remove.

03.
```
typedef struct bh_nodeidx_entry
{
bh_node_type key;
char status;
int idx;
} bh_nodeidx_entry;
```

Sorry if it is a stupid question. Can you tell me how "status" is used?
None of binaryheap and reorderbuffer components refer it.

It's required by simplehash.h

04.
```
extern binaryheap *binaryheap_allocate(int capacity,
binaryheap_comparator compare,
-                                       void *arg);
+                                       bool indexed, void *arg);
```

I felt pre-existing API should not be changed. How about adding
binaryheap_allocate_extended() or something which can specify the `bool indexed`?
binaryheap_allocate() sets heap->bh_indexed to false.

I'm really not sure it's worth inventing a
binaryheap_allocate_extended() function just for preserving API
compatibility. I think it's generally a good idea to have
xxx_extended() function to increase readability and usability, for
example, for the case where the same (kind of default) arguments are
passed in most cases and the function is called from many places.
However, we have a handful binaryheap_allocate() callers, and I
believe that it would not hurt the existing callers.

05.
```
+extern void binaryheap_update_up(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_down(binaryheap *heap, bh_node_type d);
```

IIUC, callers must consider whether the node should be shift up/down and use
appropriate function, right? I felt it may not be user-friendly.

Right, I couldn't come up with a better interface.

Another idea I've considered was that the caller provides a callback
function where it can compare the old and new keys. For example, in
reorderbuffer case, we call like:

binaryheap_update(rb->txn_heap, PointerGetDatum(txn),
ReorderBufferTXNUpdateCompare, (void *) &old_size);

Then in ReorderBufferTXNUpdateCompare(),
ReorderBufferTXN *txn = (ReorderBufferTXN *) a;Size old_size = *(Size *) b;
(compare txn->size to "b" ...)

However it seems complicated...

Comments for 0003:

06.
```
This commit changes the eviction algorithm in ReorderBuffer to use
max-heap with transaction size,a nd use two strategies depending on
the number of transactions being decoded.
```

s/a nd/ and/

07.
```
It could be too expensive to pudate max-heap while preserving the heap
property each time the transaction's memory counter is updated, as it
could happen very frquently. So when the number of transactions being
decoded is small, we add the transactions to max-heap but don't
preserve the heap property, which is O(1). We heapify the max-heap
just before picking the largest transaction, which is O(n). This
strategy minimizes the overheads of updating the transaction's memory
counter.
```

s/pudate/update/

Will fix them.

08.
IIUC, if more than 1024 transactions are running but they have small amount of
changes, the performance may be degraded, right? Do you have a result in sucha
a case?

I've run a benchmark test that I shared before[1]/messages/by-id/CAD21AoAfKTgrBrLq96GcTv9d6k97zaQcDM-rxfKEt4GSe0qnaQ@mail.gmail.com. Here are results of
decoding a transaction that has 1M subtransaction each of which has 1
INSERT:

HEAD:
1810.192 ms

HEAD w/ patch:
2001.094 ms

I set a large enough value to logical_decoding_work_mem not to evict
any transactions. I can see about about 10% performance regression in
this case.

Regards,

[1]: /messages/by-id/CAD21AoAfKTgrBrLq96GcTv9d6k97zaQcDM-rxfKEt4GSe0qnaQ@mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#24Masahiko Sawada
sawada.mshk@gmail.com
In reply to: vignesh C (#21)
Re: Improve eviction algorithm in ReorderBuffer

Hi,

On Wed, Jan 31, 2024 at 5:32 PM vignesh C <vignesh21@gmail.com> wrote:

On Tue, 30 Jan 2024 at 13:37, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jan 26, 2024 at 5:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

The individual transactions shouldn't cross
'logical_decoding_work_mem'. I got a bit confused by your proposal to
maintain the lists: "...splitting it into two lists: transactions
consuming 5% < and 5% >= of the memory limit, and checking the 5% >=
list preferably.". In the previous sentence, what did you mean by
transactions consuming 5% >= of the memory limit? I got the impression
that you are saying to maintain them in a separate transaction list
which doesn't seems to be the case.

I wanted to mean that there are three lists in total: the first one
maintain the transactions consuming more than 10% of
logical_decoding_work_mem,

How can we have multiple transactions in the list consuming more than
10% of logical_decoding_work_mem? Shouldn't we perform serialization
before any xact reaches logical_decoding_work_mem?

Well, suppose logical_decoding_work_mem is set to 64MB, transactions
consuming more than 6.4MB are added to the list. So for example, it's
possible that the list has three transactions each of which are
consuming 10MB while the total memory usage in the reorderbuffer is
still 30MB (less than logical_decoding_work_mem).

Thanks for the clarification. I misunderstood the list to have
transactions greater than 70.4 MB (64 + 6.4) in your example. But one
thing to note is that maintaining these lists by default can also have
some overhead unless the list of open transactions crosses a certain
threshold.

On further analysis, I realized that the approach discussed here might
not be the way to go. The idea of dividing transactions into several
subgroups is to divide a large number of entries into multiple
sub-groups so we can reduce the complexity to search for the
particular entry. Since we assume that there are no big differences in
entries' sizes within a sub-group, we can pick the entry to evict in
O(1). However, what we really need to avoid here is that we end up
increasing the number of times to evict entries because serializing an
entry to the disk is more costly than searching an entry on memory in
general.

I think that it's no problem in a large-entries subgroup but when it
comes to the smallest-entries subgroup, like for entries consuming
less than 5% of the limit, it could end up evicting many entries. For
example, there would be a huge difference between serializing 1 entry
consuming 5% of the memory limit and serializing 5000 entries
consuming 0.001% of the memory limit. Even if we can select 5000
entries quickly, I think the latter would be slower in total. The more
subgroups we create, the more the algorithm gets complex and the
overheads could cause. So I think we need to search for the largest
entry in order to minimize the number of evictions anyway.

Looking for data structures and algorithms, I think binaryheap with
some improvements could be promising. I mentioned before why we cannot
use the current binaryheap[1]. The missing pieces are efficient ways
to remove the arbitrary entry and to update the arbitrary entry's key.
The current binaryheap provides binaryheap_remove_node(), which is
O(log n), but it requires the entry's position in the binaryheap. We
can know the entry's position just after binaryheap_add_unordered()
but it might be changed after heapify. Searching the node's position
is O(n). So the improvement idea is to add a hash table to the
binaryheap so that it can track the positions for each entry so that
we can remove the arbitrary entry in O(log n) and also update the
arbitrary entry's key in O(log n). This is known as the indexed
priority queue. I've attached the patch for that (0001 and 0002).

That way, in terms of reorderbuffer, we can update and remove the
transaction's memory usage in O(log n) (in worst case and O(1) in
average) and then pick the largest transaction in O(1). Since we might
need to call ReorderBufferSerializeTXN() even in non-streaming case,
we need to maintain the binaryheap anyway.

Since if the number of transactions being decoded is small, updating
max-heap for each memory counter update could lead to some
regressions, I've measured it with the case where updating memory
counter happens frequently:

setup script:
create table test (c int);
select pg_create_logical_replication_slot('s', 'test_decoding');
insert into test select generate_series(1, 8000000);

benchmark script:
set work_mem to '3GB';
set logical_decoding_work_mem to '5GB';
select count(*) from pg_logical_slot_peek_changes('s', null, null);

Here are results (the median of five executions):

* HEAD
5274.765 ms

* HEAD + 0001-0003 patch
5532.203 ms

There were approximately 5% performance regressions.

An improvement idea is that we use two strategies for updating
max-heap depending on the number of transactions. That is, if the
number of transactions being decoded is small, we add a transaction to
max-heap by binaryheap_add_unordered(), which is O(1), and heapify it
just before picking the largest transactions, which is O(n). That way,
we can minimize the overhead of updating the memory counter. Once the
number of transactions being decoded exceeds the threshold, say 1024,
we use another strategy. We call binaryheap_update_up/down() when
updating the memory counter to preserve heap property, which is O(log
n), and pick the largest transaction in O(1). This strategy minimizes
the cost of picking the largest transactions instead of paying some
costs to update the memory counters.

I've experimented with this idea and run the same tests:

* HEAD + new patches (0001 - 0003)
5277.524 ms

The number looks good. I've attached these patches. Feedback is very welcome.

Few comments:

Thank you for the review comments!

1) Here we are changing memtrack_state to
REORDER_BUFFER_MEM_TRACK_NORMAL immediately once the size is less than
REORDE_BUFFER_MEM_TRACK_THRESHOLD. In this scenario we will be
building the heap many times if there are transactions getting added
and removed. How about we wait for txn_heap to become less than 95% of
REORDE_BUFFER_MEM_TRACK_THRESHOLD to avoid building the heap many
times in this scenario.

But until we call ReorderBufferLargestTXN() next time, we will have
decoded some changes, added or removed transactions, which modifies
the transaction size. Is it okay to do the memory counter updates in
O(log n) during that? I guess your idea works well in the case where
until we call ReorderBufferLargestTXN() next time, only a few
transactions' memory counters have been updated.

I realized that the state could never switch to NORMAL in cases where
the number of transactions got lower than the threshold but the total
memory usage doesn't exceed the limit. I'll fix it.

2) I felt init variable is not needed, we can directly check txn->size
instead like it is done in the else case:
+               bool init = (txn->size == 0);
+
txn->size += sz;
rb->size += sz;
/* Update the total size in the top transaction. */
toptxn->total_size += sz;
+
+               /* Update the transaction in the max-heap */
+               if (init)
+               {
+                       /* Add the transaction to the max-heap */
+                       if (rb->memtrack_state ==
REORDER_BUFFER_MEM_TRACK_NORMAL)
+                               binaryheap_add_unordered(rb->txn_heap,
PointerGetDatum(txn));
+                       else
+                               binaryheap_add(rb->txn_heap,
PointerGetDatum(txn));
+               }
+               else if (rb->memtrack_state ==
REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)
+               {
+                       /*
+                        * If we're maintaining max-heap even while
updating the memory counter,
+                        * we reflect the updates to the max-heap.
+                        */
+                       binaryheap_update_up(rb->txn_heap,
PointerGetDatum(txn));
+               }

Okay, we can replace it with "(txn->size - sz) == 0".

3) we can add some comments for this:
+typedef enum ReorderBufferMemTrackState
+{
+       REORDER_BUFFER_MEM_TRACK_NORMAL,
+       REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP,
+} ReorderBufferMemTrackState;
+
4) This should be added to typedefs.list:
+typedef enum ReorderBufferMemTrackState
+{
+       REORDER_BUFFER_MEM_TRACK_NORMAL,
+       REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP,
+} ReorderBufferMemTrackState;
+

Will add them.

5)Few typos:
5.a) enlareable should be enlargeable
[PATCH v2 1/4] Make binaryheap enlareable.

5.b) subtranasctions should be subtransactions:
On the other hand, when the number of transactions being decoded is
fairly large, such as when a transaction has many subtranasctions,

5.c) evaludate should be evaluate:
XXX: updating the transaction's memory counter and the max-heap is now
O(log n), so we need to evaludate it. If there are some regression, we

5.d) pudate should be update:
It could be too expensive to pudate max-heap while preserving the heap
property each time the transaction's memory counter is updated, as it

5.e) frquently should be frequently:
could happen very frquently. So when the number of transactions being
decoded is small, we add the transactions to max-heap but don't

Thanks, will fix them.

6) This should be added to typedefs.list:
+/*
+ * Struct for A hash table element to store the node's index in the bh_nodes
+ * array.
+ */
+typedef struct bh_nodeidx_entry
+{
+       bh_node_type    key;
+       char                    status;
+       int                             idx;
+} bh_nodeidx_entry;

Right, will add it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#25Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Shubham Khanna (#22)
Re: Improve eviction algorithm in ReorderBuffer

On Fri, Feb 2, 2024 at 1:59 PM Shubham Khanna
<khannashubham1197@gmail.com> wrote:

On Fri, Jan 26, 2024 at 2:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

The individual transactions shouldn't cross
'logical_decoding_work_mem'. I got a bit confused by your proposal to
maintain the lists: "...splitting it into two lists: transactions
consuming 5% < and 5% >= of the memory limit, and checking the 5% >=
list preferably.". In the previous sentence, what did you mean by
transactions consuming 5% >= of the memory limit? I got the impression
that you are saying to maintain them in a separate transaction list
which doesn't seems to be the case.

I wanted to mean that there are three lists in total: the first one
maintain the transactions consuming more than 10% of
logical_decoding_work_mem,

How can we have multiple transactions in the list consuming more than
10% of logical_decoding_work_mem? Shouldn't we perform serialization
before any xact reaches logical_decoding_work_mem?

Well, suppose logical_decoding_work_mem is set to 64MB, transactions
consuming more than 6.4MB are added to the list. So for example, it's
possible that the list has three transactions each of which are
consuming 10MB while the total memory usage in the reorderbuffer is
still 30MB (less than logical_decoding_work_mem).

Thanks for the clarification. I misunderstood the list to have
transactions greater than 70.4 MB (64 + 6.4) in your example. But one
thing to note is that maintaining these lists by default can also have
some overhead unless the list of open transactions crosses a certain
threshold.

On further analysis, I realized that the approach discussed here might
not be the way to go. The idea of dividing transactions into several
subgroups is to divide a large number of entries into multiple
sub-groups so we can reduce the complexity to search for the
particular entry. Since we assume that there are no big differences in
entries' sizes within a sub-group, we can pick the entry to evict in
O(1). However, what we really need to avoid here is that we end up
increasing the number of times to evict entries because serializing an
entry to the disk is more costly than searching an entry on memory in
general.

I think that it's no problem in a large-entries subgroup but when it
comes to the smallest-entries subgroup, like for entries consuming
less than 5% of the limit, it could end up evicting many entries. For
example, there would be a huge difference between serializing 1 entry
consuming 5% of the memory limit and serializing 5000 entries
consuming 0.001% of the memory limit. Even if we can select 5000
entries quickly, I think the latter would be slower in total. The more
subgroups we create, the more the algorithm gets complex and the
overheads could cause. So I think we need to search for the largest
entry in order to minimize the number of evictions anyway.

Looking for data structures and algorithms, I think binaryheap with
some improvements could be promising. I mentioned before why we cannot
use the current binaryheap[1]. The missing pieces are efficient ways
to remove the arbitrary entry and to update the arbitrary entry's key.
The current binaryheap provides binaryheap_remove_node(), which is
O(log n), but it requires the entry's position in the binaryheap. We
can know the entry's position just after binaryheap_add_unordered()
but it might be changed after heapify. Searching the node's position
is O(n). So the improvement idea is to add a hash table to the
binaryheap so that it can track the positions for each entry so that
we can remove the arbitrary entry in O(log n) and also update the
arbitrary entry's key in O(log n). This is known as the indexed
priority queue. I've attached the patch for that (0001 and 0002).

That way, in terms of reorderbuffer, we can update and remove the
transaction's memory usage in O(log n) (in worst case and O(1) in
average) and then pick the largest transaction in O(1). Since we might
need to call ReorderBufferSerializeTXN() even in non-streaming case,
we need to maintain the binaryheap anyway. I've attached the patch for
that (0003).

Here are test script for many sub-transactions case:

create table test (c int);
create or replace function testfn (cnt int) returns void as $$
begin
for i in 1..cnt loop
begin
insert into test values (i);
exception when division_by_zero then
raise notice 'caught error';
return;
end;
end loop;
end;
$$
language plpgsql;
select pg_create_logical_replication_slot('s', 'test_decoding');
select testfn(50000);
set logical_decoding_work_mem to '4MB';
select count(*) from pg_logical_slot_peek_changes('s', null, null)";

and here are results:

* HEAD: 16877.281 ms
* HEAD w/ patches (0001 and 0002): 655.154 ms

There is huge improvement in a many-subtransactions case.

I have run the same test and found around 12.53x improvement(the
median of five executions):
HEAD | HEAD+ v2-0001+ v2-0002 + v2-0003 patch
29197ms | 2329ms

I had also run the regression test that you had shared at [1], there
was a very very slight dip in this case around it takes around 0.31x
more time:
HEAD | HEAD + v2-0001+ v2-0002 + v2-0003 patch
4459ms | 4473ms

Thank you for doing a benchmark test with the latest patches!

I'm going to submit the new version patches next week.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#26Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Masahiko Sawada (#23)
RE: Improve eviction algorithm in ReorderBuffer

Dear Sawada-san,

Thank you for the review comments!

A comment for 0001:

01.
```
+static void
+bh_enlarge_node_array(binaryheap *heap)
+{
+    if (heap->bh_size < heap->bh_space)
+        return;
+
+    heap->bh_space *= 2;
+    heap->bh_nodes = repalloc(heap->bh_nodes,
+                              sizeof(bh_node_type) * heap->bh_space);
+}
```

I'm not sure it is OK to use repalloc() for enlarging bh_nodes. This data
structure public one and arbitrary codes and extensions can directly refer
bh_nodes. But if the array is repalloc()'d, the pointer would be updated so that
their referring would be a dangling pointer.

Hmm I'm not sure this is the case that we need to really worry about,
and cannot come up with a good use case where extensions refer to
bh_nodes directly rather than binaryheap. In PostgreSQL codes, many
Nodes already have pointers and are exposed.

Actually, me neither. I could not come up with the use-case - I just said the possibility.
If it is not a real issue, we can ignore.

04.
```
extern binaryheap *binaryheap_allocate(int capacity,
binaryheap_comparator compare,
-                                       void *arg);
+                                       bool indexed, void *arg);
```

I felt pre-existing API should not be changed. How about adding
binaryheap_allocate_extended() or something which can specify the `bool

indexed`?

binaryheap_allocate() sets heap->bh_indexed to false.

I'm really not sure it's worth inventing a
binaryheap_allocate_extended() function just for preserving API
compatibility. I think it's generally a good idea to have
xxx_extended() function to increase readability and usability, for
example, for the case where the same (kind of default) arguments are
passed in most cases and the function is called from many places.
However, we have a handful binaryheap_allocate() callers, and I
believe that it would not hurt the existing callers.

I kept (external) extensions which uses binaryheap APIs in my mind.
I thought we could avoid to raise costs for updating their codes. But I could
understand the change is small, so ... up to you.

05.
```
+extern void binaryheap_update_up(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_down(binaryheap *heap, bh_node_type d);
```

IIUC, callers must consider whether the node should be shift up/down and use
appropriate function, right? I felt it may not be user-friendly.

Right, I couldn't come up with a better interface.

Another idea I've considered was that the caller provides a callback
function where it can compare the old and new keys. For example, in
reorderbuffer case, we call like:

binaryheap_update(rb->txn_heap, PointerGetDatum(txn),
ReorderBufferTXNUpdateCompare, (void *) &old_size);

Then in ReorderBufferTXNUpdateCompare(),
ReorderBufferTXN *txn = (ReorderBufferTXN *) a;Size old_size = *(Size *) b;
(compare txn->size to "b" ...)

However it seems complicated...

I considered similar approach: accept old node as an argument of a compare function.
But it requires further memory allocation. Do someone have better idea?

08.
IIUC, if more than 1024 transactions are running but they have small amount of
changes, the performance may be degraded, right? Do you have a result in

sucha

a case?

I've run a benchmark test that I shared before[1]. Here are results of
decoding a transaction that has 1M subtransaction each of which has 1
INSERT:

HEAD:
1810.192 ms

HEAD w/ patch:
2001.094 ms

I set a large enough value to logical_decoding_work_mem not to evict
any transactions. I can see about about 10% performance regression in
this case.

Thanks for running. I think this workload is the worst and an extreme case which
would not be occurred on the real system (Such a system should be fixed), so we
can say that the regression is up to -10%. I felt it could be negligible but how
do other think?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

#27Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#25)
3 attachment(s)
Re: Improve eviction algorithm in ReorderBuffer

On Fri, Feb 2, 2024 at 5:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Feb 2, 2024 at 1:59 PM Shubham Khanna
<khannashubham1197@gmail.com> wrote:

On Fri, Jan 26, 2024 at 2:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

The individual transactions shouldn't cross
'logical_decoding_work_mem'. I got a bit confused by your proposal to
maintain the lists: "...splitting it into two lists: transactions
consuming 5% < and 5% >= of the memory limit, and checking the 5% >=
list preferably.". In the previous sentence, what did you mean by
transactions consuming 5% >= of the memory limit? I got the impression
that you are saying to maintain them in a separate transaction list
which doesn't seems to be the case.

I wanted to mean that there are three lists in total: the first one
maintain the transactions consuming more than 10% of
logical_decoding_work_mem,

How can we have multiple transactions in the list consuming more than
10% of logical_decoding_work_mem? Shouldn't we perform serialization
before any xact reaches logical_decoding_work_mem?

Well, suppose logical_decoding_work_mem is set to 64MB, transactions
consuming more than 6.4MB are added to the list. So for example, it's
possible that the list has three transactions each of which are
consuming 10MB while the total memory usage in the reorderbuffer is
still 30MB (less than logical_decoding_work_mem).

Thanks for the clarification. I misunderstood the list to have
transactions greater than 70.4 MB (64 + 6.4) in your example. But one
thing to note is that maintaining these lists by default can also have
some overhead unless the list of open transactions crosses a certain
threshold.

On further analysis, I realized that the approach discussed here might
not be the way to go. The idea of dividing transactions into several
subgroups is to divide a large number of entries into multiple
sub-groups so we can reduce the complexity to search for the
particular entry. Since we assume that there are no big differences in
entries' sizes within a sub-group, we can pick the entry to evict in
O(1). However, what we really need to avoid here is that we end up
increasing the number of times to evict entries because serializing an
entry to the disk is more costly than searching an entry on memory in
general.

I think that it's no problem in a large-entries subgroup but when it
comes to the smallest-entries subgroup, like for entries consuming
less than 5% of the limit, it could end up evicting many entries. For
example, there would be a huge difference between serializing 1 entry
consuming 5% of the memory limit and serializing 5000 entries
consuming 0.001% of the memory limit. Even if we can select 5000
entries quickly, I think the latter would be slower in total. The more
subgroups we create, the more the algorithm gets complex and the
overheads could cause. So I think we need to search for the largest
entry in order to minimize the number of evictions anyway.

Looking for data structures and algorithms, I think binaryheap with
some improvements could be promising. I mentioned before why we cannot
use the current binaryheap[1]. The missing pieces are efficient ways
to remove the arbitrary entry and to update the arbitrary entry's key.
The current binaryheap provides binaryheap_remove_node(), which is
O(log n), but it requires the entry's position in the binaryheap. We
can know the entry's position just after binaryheap_add_unordered()
but it might be changed after heapify. Searching the node's position
is O(n). So the improvement idea is to add a hash table to the
binaryheap so that it can track the positions for each entry so that
we can remove the arbitrary entry in O(log n) and also update the
arbitrary entry's key in O(log n). This is known as the indexed
priority queue. I've attached the patch for that (0001 and 0002).

That way, in terms of reorderbuffer, we can update and remove the
transaction's memory usage in O(log n) (in worst case and O(1) in
average) and then pick the largest transaction in O(1). Since we might
need to call ReorderBufferSerializeTXN() even in non-streaming case,
we need to maintain the binaryheap anyway. I've attached the patch for
that (0003).

Here are test script for many sub-transactions case:

create table test (c int);
create or replace function testfn (cnt int) returns void as $$
begin
for i in 1..cnt loop
begin
insert into test values (i);
exception when division_by_zero then
raise notice 'caught error';
return;
end;
end loop;
end;
$$
language plpgsql;
select pg_create_logical_replication_slot('s', 'test_decoding');
select testfn(50000);
set logical_decoding_work_mem to '4MB';
select count(*) from pg_logical_slot_peek_changes('s', null, null)";

and here are results:

* HEAD: 16877.281 ms
* HEAD w/ patches (0001 and 0002): 655.154 ms

There is huge improvement in a many-subtransactions case.

I have run the same test and found around 12.53x improvement(the
median of five executions):
HEAD | HEAD+ v2-0001+ v2-0002 + v2-0003 patch
29197ms | 2329ms

I had also run the regression test that you had shared at [1], there
was a very very slight dip in this case around it takes around 0.31x
more time:
HEAD | HEAD + v2-0001+ v2-0002 + v2-0003 patch
4459ms | 4473ms

Thank you for doing a benchmark test with the latest patches!

I'm going to submit the new version patches next week.

I've attached the new version patch set.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v3-0003-Improve-transaction-eviction-algorithm-in-Reorder.patchapplication/octet-stream; name=v3-0003-Improve-transaction-eviction-algorithm-in-Reorder.patchDownload
From 3d89473738752d991810b86701b2790a27e82734 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 11:31:41 +0900
Subject: [PATCH v3 3/3] Improve transaction eviction algorithm in
 ReorderBuffer.

Previously, when selecting the largest transaction to evict, we scan
all transactions. Which could be quite slow as it was O(n), where n is
the total number of (top-level and sub) transactions, especially in
cases where there are many subtransactions. It could lead to a huge
replication lag.

This commit changes the eviction algorithm in ReorderBuffer to use
max-heap with transaction size, and use two strategies depending on
the number of transactions being decoded.

It could be too expensive to update max-heap while preserving the heap
property each time the transaction's memory counter is updated, as it
could happen very frequently. So when the number of transactions being
decoded is small, we add the transactions to max-heap but don't
preserve the heap property, which is O(1). We heapify the max-heap
just before picking the largest transaction, which is O(n). This
strategy minimizes the overheads of updating the transaction's memory
counter.

On the other hand, when the number of transactions being decoded is
fairly large, such as when a transaction has many subtransactions,
selecting the largest transaction is O(n) is too expensive. Therefore,
once the number of transactions being decoded exceeds the
threshold (1024), each time updating the transaction's memory counter
we update max-heap while preserving the heap property, which is O(log
n). Picking the largest transaction can be done in O(1). This strategy
minimizes the cost of picking the largest transaction.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 .../replication/logical/reorderbuffer.c       | 136 +++++++++++++++---
 src/include/replication/reorderbuffer.h       |  12 ++
 2 files changed, 132 insertions(+), 16 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c390d96ac3..bc6a8c0810 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -67,6 +67,29 @@
  *	  allocator, evicting the oldest changes would make it more likely the
  *	  memory gets actually freed.
  *
+ *	  We use max-heap with transaction size as the key to find the largest
+ *	  transaction, and use two strategies depending on the number of transactions
+ *	  being decoded:
+ *
+ *	  Since the transaction memory counter is updated frequently, it's expensive
+ *	  to update max-heap while preserving the heap property each time the memory
+ *	  counter is updated. So when the number of transactions is small (i.e.
+ *	  in REORDER_BUFFER_MEM_TRACK_NORMAL state), transactions are added to the
+ *	  max-heap while not preserving the heap property. We heapify it just before
+ *	  picking the largest transaction. In this case, updating the memory counter
+ *	  is done in O(1) whereas picking the largest transaction is done in O(n),
+ *	  where n is the total number of transactions being decoded.
+ *
+ *	  On the other hand, when the number of transactions being decoded is large
+ *	  (i.e. in REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP), such as when a
+ *	  transaction has many subtransactions, selecting the largest transaction in
+ *	  O(1) is too costly. Therefore, each time the memory counter of a transaction
+ *	  is updated, the max-heap is updated while preserving the heap property,
+ *	  and the largest transaction is picked at a low cost. In this case,
+ *	  updating the memory counter is done in O(log n) whereas picking the
+ *	  largest transaction is done in O(1). This minimizes the cost of choosing
+ *	  the largest transaction.
+ *
  *	  We still rely on max_changes_in_memory when loading serialized changes
  *	  back into memory. At that point we can't use the memory limit directly
  *	  as we load the subxacts independently. One option to deal with this
@@ -108,6 +131,11 @@
 #include "utils/rel.h"
 #include "utils/relfilenumbermap.h"
 
+/*
+ * The threshold of the number of transactions in the max-heap (rb->txn_heap)
+ * to switch the state.
+ */
+#define REORDE_BUFFER_MEM_TRACK_THRESHOLD 1024
 
 /* entry for a hash table we use to map from xid to our transaction state */
 typedef struct ReorderBufferTXNByIdEnt
@@ -295,6 +323,7 @@ static Size ReorderBufferChangeSize(ReorderBufferChange *change);
 static void ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 											ReorderBufferChange *change,
 											bool addition, Size sz);
+static int ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg);
 
 /*
  * Allocate a new ReorderBuffer and clean out any old serialized state from
@@ -356,6 +385,14 @@ ReorderBufferAllocate(void)
 	buffer->outbufsize = 0;
 	buffer->size = 0;
 
+	/*
+	 * We start with an arbitrary number. Which should be enough for most of
+	 * cases.
+	 */
+	buffer->memtrack_state = REORDER_BUFFER_MEM_TRACK_NORMAL;
+	buffer->txn_heap = binaryheap_allocate(1024, ReorderBufferTXNSizeCompare,
+										   true, NULL);
+
 	buffer->spillTxns = 0;
 	buffer->spillCount = 0;
 	buffer->spillBytes = 0;
@@ -3205,6 +3242,32 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 		/* Update the total size in the top transaction. */
 		toptxn->total_size += sz;
+
+		if ((txn->size - sz) == 0)
+		{
+			/* Add the transaction to the max-heap */
+			if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_NORMAL)
+				binaryheap_add_unordered(rb->txn_heap, PointerGetDatum(txn));
+			else
+				binaryheap_add(rb->txn_heap, PointerGetDatum(txn));
+
+			/*
+			 * Even if the number of transactions reached
+			 * REORDE_BUFFER_MEM_TRACK_THRESHOLD, we don't switch the state
+			 * immediately since it requires to heapify the max-heap and
+			 * some transactions could finish before reaching the memory
+			 * limit. We could switch the state when the total memory usage
+			 * exceeds the memory limit, in ReorderBufferLargestTXN().
+			 */
+		}
+		else if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)
+		{
+			/*
+			 * If we're maintaining max-heap even while updating the memory counter,
+			 * we reflect the updates to the max-heap.
+			 */
+			binaryheap_update_up(rb->txn_heap, PointerGetDatum(txn));
+		}
 	}
 	else
 	{
@@ -3214,6 +3277,35 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 		/* Update the total size in the top transaction. */
 		toptxn->total_size -= sz;
+
+		if (txn->size == 0)
+		{
+			/* Remove the transaction from the max-heap */
+			if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_NORMAL)
+				binaryheap_remove_node_ptr_unordered(rb->txn_heap, PointerGetDatum(txn));
+			else
+				binaryheap_remove_node_ptr(rb->txn_heap, PointerGetDatum(txn));
+
+			/*
+			 * Even if the number of transactions falls below
+			 * REORDER_BUFFER_MEM_TRACK_THRESHOLD, it may exceed it and require
+			 * to heapify the max-heap again. In this case, maintaining max-heap
+			 * would be cheaper overall. Therefore in order to switch to the normal
+			 * state, we have a small buffer; when the number of transactions falls
+			 * below 95% of REORDER_BUFFER_MEM_TRACK_THRESHOLD, we switch to the
+			 * normal state.
+			 */
+			if (binaryheap_size(rb->txn_heap) < (REORDE_BUFFER_MEM_TRACK_THRESHOLD * 0.95))
+				rb->memtrack_state = REORDER_BUFFER_MEM_TRACK_NORMAL;
+		}
+		else if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)
+		{
+			/*
+			 * If we're maintaining max-heap even while updating the memory counter,
+			 * we reflect the updates to the max-heap.
+			 */
+			binaryheap_update_down(rb->txn_heap, PointerGetDatum(txn));
+		}
 	}
 
 	Assert(txn->size <= rb->size);
@@ -3471,32 +3563,27 @@ ReorderBufferSerializeReserve(ReorderBuffer *rb, Size sz)
 
 /*
  * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
- *
- * XXX With many subtransactions this might be quite slow, because we'll have
- * to walk through all of them. There are some options how we could improve
- * that: (a) maintain some secondary structure with transactions sorted by
- * amount of changes, (b) not looking for the entirely largest transaction,
- * but e.g. for transaction using at least some fraction of the memory limit,
- * and (c) evicting multiple transactions at once, e.g. to free a given portion
- * of the memory limit (e.g. 50%).
  */
 static ReorderBufferTXN *
 ReorderBufferLargestTXN(ReorderBuffer *rb)
 {
-	HASH_SEQ_STATUS hash_seq;
-	ReorderBufferTXNByIdEnt *ent;
 	ReorderBufferTXN *largest = NULL;
 
-	hash_seq_init(&hash_seq, rb->by_txn);
-	while ((ent = hash_seq_search(&hash_seq)) != NULL)
+	if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_NORMAL)
 	{
-		ReorderBufferTXN *txn = ent->txn;
+		binaryheap_build(rb->txn_heap);
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
-			largest = txn;
+		/*
+		 * If the number of transactions exceeds the threshold, switch to the
+		 * state where we maintain the max-heap even while updating the memory
+		 * counter.
+		 */
+		if (binaryheap_size(rb->txn_heap) >= REORDE_BUFFER_MEM_TRACK_THRESHOLD)
+			rb->memtrack_state = REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP;
 	}
 
+	largest = (ReorderBufferTXN *) DatumGetPointer(binaryheap_first(rb->txn_heap));
+
 	Assert(largest);
 	Assert(largest->size > 0);
 	Assert(largest->size <= rb->size);
@@ -5276,3 +5363,20 @@ restart:
 		*cmax = ent->cmax;
 	return true;
 }
+
+/*
+ * Compare between sizes of two transactions. This is for a binary heap
+ * comparison function.
+ */
+static int
+ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg)
+{
+	ReorderBufferTXN	*ta = (ReorderBufferTXN *) DatumGetPointer(a);
+	ReorderBufferTXN	*tb = (ReorderBufferTXN *) DatumGetPointer(b);
+
+	if (ta->size < tb->size)
+		return -1;
+	if (ta->size > tb->size)
+		return 1;
+	return 0;
+}
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0b2c95f7aa..967eb65cb3 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "lib/binaryheap.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -531,6 +532,13 @@ typedef void (*ReorderBufferUpdateProgressTxnCB) (
 												  ReorderBufferTXN *txn,
 												  XLogRecPtr lsn);
 
+/* How to track the memory usage of each transaction being decoded */
+typedef enum ReorderBufferMemTrackState
+{
+	REORDER_BUFFER_MEM_TRACK_NORMAL,
+	REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP,
+} ReorderBufferMemTrackState;
+
 struct ReorderBuffer
 {
 	/*
@@ -631,6 +639,10 @@ struct ReorderBuffer
 	/* memory accounting */
 	Size		size;
 
+	/* Max-heap for sizes of all top-level and sub transactions */
+	ReorderBufferMemTrackState memtrack_state;
+	binaryheap	*txn_heap;
+
 	/*
 	 * Statistics about transactions spilled to disk.
 	 *
-- 
2.39.3

v3-0002-Add-functions-for-updating-keys-and-removing-node.patchapplication/octet-stream; name=v3-0002-Add-functions-for-updating-keys-and-removing-node.patchDownload
From c0a4ec9672a0c7da58d980452c232009aecbc5b0 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 11:20:23 +0900
Subject: [PATCH v3 2/3] Add functions for updating keys and removing nodes to
 binaryheap.

Previously, binaryheap didn't support key updates and removing nodes
in an efficient way. For example, in order to remove a node from the
binaryheap, the caller has to pass the node's position within the
array that the binaryheap internally has. This operation can be done
in O(log n) but searching for the key's position is O(n).

This commit adds a hash table to binaryheap to track of positions of
each nodes in the binaryheap. That way, by using newly added
functions such as binaryheap_update_up() etc., both updating a key and
removing a node can node can be done in O(1) in an average and
O(log n) in worst case. This is known as the indexed priority queue.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/executor/nodeGatherMerge.c        |   1 +
 src/backend/executor/nodeMergeAppend.c        |   2 +-
 src/backend/postmaster/pgarch.c               |   3 +-
 .../replication/logical/reorderbuffer.c       |   1 +
 src/backend/storage/buffer/bufmgr.c           |   1 +
 src/bin/pg_dump/pg_backup_archiver.c          |   1 +
 src/bin/pg_dump/pg_dump_sort.c                |   2 +-
 src/common/binaryheap.c                       | 190 +++++++++++++++++-
 src/include/lib/binaryheap.h                  |  36 +++-
 9 files changed, 223 insertions(+), 14 deletions(-)

diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c
index 2d552f4224..250f226d5f 100644
--- a/src/backend/executor/nodeGatherMerge.c
+++ b/src/backend/executor/nodeGatherMerge.c
@@ -427,6 +427,7 @@ gather_merge_setup(GatherMergeState *gm_state)
 	/* Allocate the resources for the merge */
 	gm_state->gm_heap = binaryheap_allocate(nreaders + 1,
 											heap_compare_slots,
+											false,
 											gm_state);
 }
 
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index 0817868452..1980794cb7 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -125,7 +125,7 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	mergestate->ms_nplans = nplans;
 
 	mergestate->ms_slots = (TupleTableSlot **) palloc0(sizeof(TupleTableSlot *) * nplans);
-	mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots,
+	mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots, false,
 											  mergestate);
 
 	/*
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 67693b0580..f3ec0a8918 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -250,7 +250,8 @@ PgArchiverMain(void)
 
 	/* Initialize our max-heap for prioritizing files to archive. */
 	arch_files->arch_heap = binaryheap_allocate(NUM_FILES_PER_DIRECTORY_SCAN,
-												ready_file_comparator, NULL);
+												ready_file_comparator, false,
+												NULL);
 
 	/* Load the archive_library. */
 	LoadArchiveLibrary();
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index bbf0966182..c390d96ac3 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1295,6 +1295,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	/* allocate heap */
 	state->heap = binaryheap_allocate(state->nr_txns,
 									  ReorderBufferIterCompare,
+									  false,
 									  state);
 
 	/* Now that the state fields are initialized, it is safe to return it. */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index eb1ec3b86d..183b91394c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2725,6 +2725,7 @@ BufferSync(int flags)
 	 */
 	ts_heap = binaryheap_allocate(num_spaces,
 								  ts_ckpt_progress_comparator,
+								  false,
 								  NULL);
 
 	for (i = 0; i < num_spaces; i++)
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 256d1e35a4..a044a684c8 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -4032,6 +4032,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 	/* Set up ready_heap with enough room for all known TocEntrys */
 	ready_heap = binaryheap_allocate(AH->tocCount,
 									 TocEntrySizeCompareBinaryheap,
+									 false,
 									 NULL);
 
 	/*
diff --git a/src/bin/pg_dump/pg_dump_sort.c b/src/bin/pg_dump/pg_dump_sort.c
index f358dd22b9..63b1c3570d 100644
--- a/src/bin/pg_dump/pg_dump_sort.c
+++ b/src/bin/pg_dump/pg_dump_sort.c
@@ -404,7 +404,7 @@ TopoSort(DumpableObject **objs,
 		return true;
 
 	/* Create workspace for the above-described heap */
-	pendingHeap = binaryheap_allocate(numObjs, int_cmp, NULL);
+	pendingHeap = binaryheap_allocate(numObjs, int_cmp, false, NULL);
 
 	/*
 	 * Scan the constraints, and for each item in the input, generate a count
diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index bc43aca093..a5bb3b148d 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -22,8 +22,28 @@
 #ifdef FRONTEND
 #include "common/logging.h"
 #endif
+#include "common/hashfn.h"
 #include "lib/binaryheap.h"
 
+/*
+ * Define parameters for hash table code generation. The interface is *also*"
+ * declared in binaryheaph.h (to generate the types, which are externally
+ * visible).
+ */
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+	hash_bytes((const unsigned char *) &key, sizeof(bh_node_type))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0)
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
 static void sift_down(binaryheap *heap, int node_off);
 static void sift_up(binaryheap *heap, int node_off);
 
@@ -36,7 +56,8 @@ static void sift_up(binaryheap *heap, int node_off);
  * argument specified by 'arg'.
  */
 binaryheap *
-binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
+binaryheap_allocate(int capacity, binaryheap_comparator compare,
+					bool indexed, void *arg)
 {
 	binaryheap *heap;
 
@@ -49,6 +70,17 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 	heap->bh_has_heap_property = true;
 	heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type) * capacity);
 
+	heap->bh_indexed = indexed;
+	if (heap->bh_indexed)
+	{
+#ifdef FRONTEND
+		heap->bh_nodeidx = bh_nodeidx_create(capacity, NULL);
+#else
+		heap->bh_nodeidx = bh_nodeidx_create(CurrentMemoryContext, capacity,
+											 NULL);
+#endif
+	}
+
 	return heap;
 }
 
@@ -63,6 +95,9 @@ binaryheap_reset(binaryheap *heap)
 {
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
+
+	if (heap->bh_indexed)
+		bh_nodeidx_reset(heap->bh_nodeidx);
 }
 
 /*
@@ -73,6 +108,8 @@ binaryheap_reset(binaryheap *heap)
 void
 binaryheap_free(binaryheap *heap)
 {
+	if (heap->bh_indexed)
+		bh_nodeidx_destroy(heap->bh_nodeidx);
 	pfree(heap);
 }
 
@@ -117,6 +154,44 @@ bh_enlarge_node_array(binaryheap *heap)
 							  sizeof(bh_node_type) * heap->bh_space);
 }
 
+/*
+ * Set the given node at the 'idx' and updates its position accordingly.
+ */
+static void
+bh_set_node(binaryheap *heap, bh_node_type d, int idx)
+{
+	bh_nodeidx_entry *ent;
+	bool	found;
+
+	/* Set the node to the nodes array */
+	heap->bh_nodes[idx] = d;
+
+	if (heap->bh_indexed)
+	{
+		/* Remember its index in the nodes array */
+		ent = bh_nodeidx_insert(heap->bh_nodeidx, d, &found);
+		ent->idx = idx;
+	}
+}
+
+/*
+ * Replace the node at 'idx' with the given node 'replaced_by'. Also
+ * update their positions accordingly.
+ */
+static void
+bh_replace_node(binaryheap *heap, int idx, bh_node_type replaced_by)
+{
+	bh_node_type	node = heap->bh_nodes[idx];
+
+	/* Remove overwritten node's index */
+	if (heap->bh_indexed)
+		(void) bh_nodeidx_delete(heap->bh_nodeidx, node);
+
+	/* Replace it with the given new node */
+	if (idx < heap->bh_size)
+		bh_set_node(heap, replaced_by, idx);
+}
+
 /*
  * binaryheap_add_unordered
  *
@@ -131,7 +206,7 @@ binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 	bh_enlarge_node_array(heap);
 
 	heap->bh_has_heap_property = false;
-	heap->bh_nodes[heap->bh_size] = d;
+	bh_set_node(heap, d, heap->bh_size);
 	heap->bh_size++;
 }
 
@@ -162,7 +237,7 @@ binaryheap_add(binaryheap *heap, bh_node_type d)
 {
 	bh_enlarge_node_array(heap);
 
-	heap->bh_nodes[heap->bh_size] = d;
+	bh_set_node(heap, d, heap->bh_size);
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
 }
@@ -203,6 +278,10 @@ binaryheap_remove_first(binaryheap *heap)
 	if (heap->bh_size == 1)
 	{
 		heap->bh_size--;
+
+		if (heap->bh_indexed)
+			bh_nodeidx_delete(heap->bh_nodeidx, result);
+
 		return result;
 	}
 
@@ -210,7 +289,7 @@ binaryheap_remove_first(binaryheap *heap)
 	 * Remove the last node, placing it in the vacated root entry, and sift
 	 * the new root node down to its correct position.
 	 */
-	heap->bh_nodes[0] = heap->bh_nodes[--heap->bh_size];
+	bh_replace_node(heap, 0, heap->bh_nodes[--heap->bh_size]);
 	sift_down(heap, 0);
 
 	return result;
@@ -236,7 +315,7 @@ binaryheap_remove_node(binaryheap *heap, int n)
 						   heap->bh_arg);
 
 	/* remove the last node, placing it in the vacated entry */
-	heap->bh_nodes[n] = heap->bh_nodes[heap->bh_size];
+	bh_replace_node(heap, n, heap->bh_nodes[heap->bh_size]);
 
 	/* sift as needed to preserve the heap property */
 	if (cmp > 0)
@@ -245,6 +324,97 @@ binaryheap_remove_node(binaryheap *heap, int n)
 		sift_down(heap, n);
 }
 
+/*
+ * binaryheap_remove_node_ptr
+ *
+ * Similar to binaryheap_remove_node() but removes the given node. The caller
+ * must ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if bh_indexed is true.
+ */
+void
+binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(heap->bh_indexed);
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+	Assert(ent);
+
+	binaryheap_remove_node(heap, ent->idx);
+}
+
+/*
+ * binaryheap_remove_node_ptr_unordered
+ *
+ * Remove the given datum from binaryheap in O(1) without preserving the heap property.
+ * To obtain a valid heap, one must call binaryheap_build() afterwards.
+ *
+ * This function can be used only if bh_indexed is true.
+ */
+void
+binaryheap_remove_node_ptr_unordered(binaryheap *heap, bh_node_type d)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap));
+	Assert(heap->bh_indexed);
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+	Assert(ent);
+
+	heap->bh_has_heap_property = false;
+	bh_replace_node(heap, ent->idx, heap->bh_nodes[--heap->bh_size]);
+}
+
+/*
+ * binaryheap_update_up
+ *
+ * Sift the given node up after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if bh_indexed is true.
+ */
+void
+binaryheap_update_up(binaryheap *heap, bh_node_type d)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(heap->bh_indexed);
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+	Assert(ent);
+	Assert(ent->idx >= 0 && ent->idx < heap->bh_size);
+
+	sift_up(heap, ent->idx);
+}
+
+/*
+ * binaryheap_update_down
+ *
+ * Sift the given node down after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if bh_indexed is true.
+ */
+void
+binaryheap_update_down(binaryheap *heap, bh_node_type d)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(heap->bh_indexed);
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+	Assert(ent);
+	Assert(ent->idx >= 0 && ent->idx < heap->bh_size);
+
+	sift_down(heap, ent->idx);
+}
+
 /*
  * binaryheap_replace_first
  *
@@ -257,7 +427,7 @@ binaryheap_replace_first(binaryheap *heap, bh_node_type d)
 {
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
 
-	heap->bh_nodes[0] = d;
+	bh_replace_node(heap, 0, d);
 
 	if (heap->bh_size > 1)
 		sift_down(heap, 0);
@@ -299,11 +469,11 @@ sift_up(binaryheap *heap, int node_off)
 		 * Otherwise, swap the parent value with the hole, and go on to check
 		 * the node's new parent.
 		 */
-		heap->bh_nodes[node_off] = parent_val;
+		bh_set_node(heap, parent_val, node_off);
 		node_off = parent_off;
 	}
 	/* Re-fill the hole */
-	heap->bh_nodes[node_off] = node_val;
+	bh_set_node(heap, node_val, node_off);
 }
 
 /*
@@ -358,9 +528,9 @@ sift_down(binaryheap *heap, int node_off)
 		 * Otherwise, swap the hole with the child that violates the heap
 		 * property; then go on to check its children.
 		 */
-		heap->bh_nodes[node_off] = heap->bh_nodes[swap_off];
+		bh_set_node(heap, heap->bh_nodes[swap_off], node_off);
 		node_off = swap_off;
 	}
 	/* Re-fill the hole */
-	heap->bh_nodes[node_off] = node_val;
+	bh_set_node(heap, node_val, node_off);
 }
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 1439f20803..0f6a02573b 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -29,6 +29,28 @@ typedef Datum bh_node_type;
  */
 typedef int (*binaryheap_comparator) (bh_node_type a, bh_node_type b, void *arg);
 
+/*
+ * Struct for A hash table element to store the node's index in the bh_nodes
+ * array.
+ */
+typedef struct bh_nodeidx_entry
+{
+	bh_node_type	key;
+	char			status;
+	int				idx;
+} bh_nodeidx_entry;
+
+/* define parameters necessary to generate the hash table interface */
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
 /*
  * binaryheap
  *
@@ -47,11 +69,19 @@ typedef struct binaryheap
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
 	bh_node_type *bh_nodes;
+
+	/*
+	 * If bh_indexed is true, the bh_nodeidx is used to track of each
+	 * node's index in bh_nodes. This enables the caller to perform
+	 * binaryheap_remove_node_ptr(), binaryheap_update_up/down in O(log n).
+	 */
+	bool		bh_indexed;
+	bh_nodeidx_hash	*bh_nodeidx;
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int capacity,
 									   binaryheap_comparator compare,
-									   void *arg);
+									   bool indexed, void *arg);
 extern void binaryheap_reset(binaryheap *heap);
 extern void binaryheap_free(binaryheap *heap);
 extern void binaryheap_add_unordered(binaryheap *heap, bh_node_type d);
@@ -60,7 +90,11 @@ extern void binaryheap_add(binaryheap *heap, bh_node_type d);
 extern bh_node_type binaryheap_first(binaryheap *heap);
 extern bh_node_type binaryheap_remove_first(binaryheap *heap);
 extern void binaryheap_remove_node(binaryheap *heap, int n);
+extern void binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d);
+extern void binaryheap_remove_node_ptr_unordered(binaryheap *heap, bh_node_type d);
 extern void binaryheap_replace_first(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_up(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_down(binaryheap *heap, bh_node_type d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
 #define binaryheap_size(h)			((h)->bh_size)
-- 
2.39.3

v3-0001-Make-binaryheap-enlareable.patchapplication/octet-stream; name=v3-0001-Make-binaryheap-enlareable.patchDownload
From a2de1140ec2ceb4b29efa907b6170aac04022902 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 17:12:20 +0900
Subject: [PATCH v3 1/3] Make binaryheap enlareable.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/common/binaryheap.c      | 39 ++++++++++++++++++------------------
 src/include/lib/binaryheap.h |  2 +-
 2 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index 7377ebdf15..bc43aca093 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -38,17 +38,16 @@ static void sift_up(binaryheap *heap, int node_off);
 binaryheap *
 binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 {
-	int			sz;
 	binaryheap *heap;
 
-	sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_node_type) * capacity;
-	heap = (binaryheap *) palloc(sz);
+	heap = (binaryheap *) palloc(sizeof(binaryheap));
 	heap->bh_space = capacity;
 	heap->bh_compare = compare;
 	heap->bh_arg = arg;
 
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
+	heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type) * capacity);
 
 	return heap;
 }
@@ -104,6 +103,20 @@ parent_offset(int i)
 	return (i - 1) / 2;
 }
 
+/*
+ * Make sure there is enough space for nodes.
+ */
+static void
+bh_enlarge_node_array(binaryheap *heap)
+{
+	if (heap->bh_size < heap->bh_space)
+		return;
+
+	heap->bh_space *= 2;
+	heap->bh_nodes = repalloc(heap->bh_nodes,
+							  sizeof(bh_node_type) * heap->bh_space);
+}
+
 /*
  * binaryheap_add_unordered
  *
@@ -115,14 +128,8 @@ parent_offset(int i)
 void
 binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 {
-	if (heap->bh_size >= heap->bh_space)
-	{
-#ifdef FRONTEND
-		pg_fatal("out of binary heap slots");
-#else
-		elog(ERROR, "out of binary heap slots");
-#endif
-	}
+	bh_enlarge_node_array(heap);
+
 	heap->bh_has_heap_property = false;
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
@@ -153,14 +160,8 @@ binaryheap_build(binaryheap *heap)
 void
 binaryheap_add(binaryheap *heap, bh_node_type d)
 {
-	if (heap->bh_size >= heap->bh_space)
-	{
-#ifdef FRONTEND
-		pg_fatal("out of binary heap slots");
-#else
-		elog(ERROR, "out of binary heap slots");
-#endif
-	}
+	bh_enlarge_node_array(heap);
+
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 19025c08ef..1439f20803 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -46,7 +46,7 @@ typedef struct binaryheap
 	bool		bh_has_heap_property;	/* debugging cross-check */
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
-	bh_node_type bh_nodes[FLEXIBLE_ARRAY_MEMBER];
+	bh_node_type *bh_nodes;
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int capacity,
-- 
2.39.3

#28Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Masahiko Sawada (#27)
RE: Improve eviction algorithm in ReorderBuffer

Dear Sawada-san,

Thanks for making v3 patchset. I have also benchmarked the case [1]/messages/by-id/CAD21AoB-7mPpKnLmBNfzfavG8AiTwEgAdVMuv=jzmAp9ex7eyQ@mail.gmail.com.
Below results are the average of 5th, there are almost the same result
even when median is used for the comparison. On my env, the regression
cannot be seen.

HEAD (1e285a5) HEAD + v3 patches difference
10910.722 ms 10714.540 ms around 1.8%

Also, here are mino comments for v3 set.

01.
bh_nodeidx_entry and ReorderBufferMemTrackState is missing in typedefs.list.

02. ReorderBufferTXNSizeCompare
Should we assert {ta, tb} are not NULL?

[1]: /messages/by-id/CAD21AoB-7mPpKnLmBNfzfavG8AiTwEgAdVMuv=jzmAp9ex7eyQ@mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

#29Ajin Cherian
itsajin@gmail.com
In reply to: Masahiko Sawada (#27)
Re: Improve eviction algorithm in ReorderBuffer

On Tue, Feb 6, 2024 at 5:06 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

I've attached the new version patch set.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Thanks for the patch. I reviewed that patch and did minimal testing and it
seems to show the speed up as claimed. Some minor comments:
patch 0001:

+static void
+bh_enlarge_node_array(binaryheap *heap)
+{
+ if (heap->bh_size < heap->bh_space)
+ return;

why not check "if (heap->bh_size >= heap->bh_space)" outside this function
to avoid calling this function when not necessary? This check was there in
code before the patch.

patch 0003:

+/*
+ * The threshold of the number of transactions in the max-heap
(rb->txn_heap)
+ * to switch the state.
+ */
+#define REORDE_BUFFER_MEM_TRACK_THRESHOLD 1024

Typo: I think you meant REORDER_ and not REORDE_

regards,
Ajin Cherian
Fujitsu Australia

#30Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#28)
5 attachment(s)
Re: Improve eviction algorithm in ReorderBuffer

On Thu, Feb 8, 2024 at 6:33 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Sawada-san,

Thanks for making v3 patchset. I have also benchmarked the case [1].
Below results are the average of 5th, there are almost the same result
even when median is used for the comparison. On my env, the regression
cannot be seen.

HEAD (1e285a5) HEAD + v3 patches difference
10910.722 ms 10714.540 ms around 1.8%

Thank you for doing the performance test!

Also, here are mino comments for v3 set.

01.
bh_nodeidx_entry and ReorderBufferMemTrackState is missing in typedefs.list.

Will add them.

02. ReorderBufferTXNSizeCompare
Should we assert {ta, tb} are not NULL?

Not sure we really need it as other binaryheap users don't have such checks.

On Tue, Feb 6, 2024 at 2:45 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

I've run a benchmark test that I shared before[1]. Here are results of
decoding a transaction that has 1M subtransaction each of which has 1
INSERT:

HEAD:
1810.192 ms

HEAD w/ patch:
2001.094 ms

I set a large enough value to logical_decoding_work_mem not to evict
any transactions. I can see about about 10% performance regression in
this case.

Thanks for running. I think this workload is the worst and an extreme case which
would not be occurred on the real system (Such a system should be fixed), so we
can say that the regression is up to -10%. I felt it could be negligible but how
do other think?

I think this performance regression is not acceptable. In this
workload, one transaction has 10k subtransactions and the logical
decoding becomes quite slow if logical_decoding_work_mem is not big
enough. Therefore, it's a legitimate and common approach to increase
logical_decoding_work_mem to speedup the decoding. However, with thie
patch, the decoding becomes slower than today. It's a bad idea in
general to optimize an extreme case while sacrificing the normal (or
more common) cases.

Therefore, I've improved the algorithm so that we don't touch the
max-heap at all if the number of transactions is small enough. I've
run benchmark test with two workloads:

workload-1, decode single transaction with 800k tuples (normal.sql):

* without spill
HEAD: 13235.136 ms
v3 patch: 14320.082 ms
v4 patch: 13300.665 ms

* with spill
HEAD: 22970.204 ms
v3 patch: 23625.649 ms
v4 patch: 23304.366

workload-2, decode one transaction with 100k subtransaction (many-subtxn.sql):

* without spill
HEAD: 345.718 ms
v3 patch: 409.686 ms
v4 patch: 353.026 ms

* with spill
HEAD: 136718.313 ms
v3 patch: 2675.539 ms
v4 patch: 2734.981 ms

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v4-0001-Make-binaryheap-enlareable.patchapplication/octet-stream; name=v4-0001-Make-binaryheap-enlareable.patchDownload
From 613f5d674bd0f7052fef02e7a10d9c2da930e596 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 17:12:20 +0900
Subject: [PATCH v4 1/3] Make binaryheap enlareable.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/common/binaryheap.c      | 36 +++++++++++++++++++-----------------
 src/include/lib/binaryheap.h |  2 +-
 2 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index 7377ebdf15..6f16c83295 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -38,17 +38,16 @@ static void sift_up(binaryheap *heap, int node_off);
 binaryheap *
 binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 {
-	int			sz;
 	binaryheap *heap;
 
-	sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_node_type) * capacity;
-	heap = (binaryheap *) palloc(sz);
+	heap = (binaryheap *) palloc(sizeof(binaryheap));
 	heap->bh_space = capacity;
 	heap->bh_compare = compare;
 	heap->bh_arg = arg;
 
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
+	heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type) * capacity);
 
 	return heap;
 }
@@ -104,6 +103,17 @@ parent_offset(int i)
 	return (i - 1) / 2;
 }
 
+/*
+ * Make sure there is enough space for nodes.
+ */
+static void
+bh_enlarge_node_array(binaryheap *heap)
+{
+	heap->bh_space *= 2;
+	heap->bh_nodes = repalloc(heap->bh_nodes,
+							  sizeof(bh_node_type) * heap->bh_space);
+}
+
 /*
  * binaryheap_add_unordered
  *
@@ -115,14 +125,10 @@ parent_offset(int i)
 void
 binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 {
+	/* make sure enough space for a new node */
 	if (heap->bh_size >= heap->bh_space)
-	{
-#ifdef FRONTEND
-		pg_fatal("out of binary heap slots");
-#else
-		elog(ERROR, "out of binary heap slots");
-#endif
-	}
+		bh_enlarge_node_array(heap);
+
 	heap->bh_has_heap_property = false;
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
@@ -153,14 +159,10 @@ binaryheap_build(binaryheap *heap)
 void
 binaryheap_add(binaryheap *heap, bh_node_type d)
 {
+	/* make sure enough space for a new node */
 	if (heap->bh_size >= heap->bh_space)
-	{
-#ifdef FRONTEND
-		pg_fatal("out of binary heap slots");
-#else
-		elog(ERROR, "out of binary heap slots");
-#endif
-	}
+		bh_enlarge_node_array(heap);
+
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 19025c08ef..1439f20803 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -46,7 +46,7 @@ typedef struct binaryheap
 	bool		bh_has_heap_property;	/* debugging cross-check */
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
-	bh_node_type bh_nodes[FLEXIBLE_ARRAY_MEMBER];
+	bh_node_type *bh_nodes;
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int capacity,
-- 
2.39.3

v4-0002-Add-functions-to-binaryheap-to-efficiently-remove.patchapplication/octet-stream; name=v4-0002-Add-functions-to-binaryheap-to-efficiently-remove.patchDownload
From 7f2d2e728d24aff04a24842c71be27488b5a62c9 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 11:20:23 +0900
Subject: [PATCH v4 2/3] Add functions to binaryheap to efficiently
 remove/update keys.

Previously, binaryheap didn't support key updates and removing nodes
in an efficient way. For example, in order to remove a node from the
binaryheap, the caller has to pass the node's position within the
array that the binaryheap internally has. Removing a node from the
binaryheap is done in O(log n) but searching for the key's position is
done in O(n).

This commit adds a hash table to binaryheap to track of positions of
each nodes in the binaryheap. That way, by using newly added
functions such as binaryheap_update_up() etc., both updating a key and
removing a node can node can be done in O(1) in an average and
O(log n) in worst case. This is known as the indexed priority
queue. The caller can specify to use the indexed binaryheap by passing
indexed = true. There is no user of it but it will be used by a
upcoming patch.

XXX: update typedef.list

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/executor/nodeGatherMerge.c        |   1 +
 src/backend/executor/nodeMergeAppend.c        |   2 +-
 src/backend/postmaster/pgarch.c               |   3 +-
 .../replication/logical/reorderbuffer.c       |   1 +
 src/backend/storage/buffer/bufmgr.c           |   1 +
 src/bin/pg_dump/pg_backup_archiver.c          |   1 +
 src/bin/pg_dump/pg_dump_sort.c                |   2 +-
 src/common/binaryheap.c                       | 167 ++++++++++++++++--
 src/include/lib/binaryheap.h                  |  35 +++-
 9 files changed, 199 insertions(+), 14 deletions(-)

diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c
index 2d552f4224..250f226d5f 100644
--- a/src/backend/executor/nodeGatherMerge.c
+++ b/src/backend/executor/nodeGatherMerge.c
@@ -427,6 +427,7 @@ gather_merge_setup(GatherMergeState *gm_state)
 	/* Allocate the resources for the merge */
 	gm_state->gm_heap = binaryheap_allocate(nreaders + 1,
 											heap_compare_slots,
+											false,
 											gm_state);
 }
 
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index 0817868452..1980794cb7 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -125,7 +125,7 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	mergestate->ms_nplans = nplans;
 
 	mergestate->ms_slots = (TupleTableSlot **) palloc0(sizeof(TupleTableSlot *) * nplans);
-	mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots,
+	mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots, false,
 											  mergestate);
 
 	/*
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 67693b0580..f3ec0a8918 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -250,7 +250,8 @@ PgArchiverMain(void)
 
 	/* Initialize our max-heap for prioritizing files to archive. */
 	arch_files->arch_heap = binaryheap_allocate(NUM_FILES_PER_DIRECTORY_SCAN,
-												ready_file_comparator, NULL);
+												ready_file_comparator, false,
+												NULL);
 
 	/* Load the archive_library. */
 	LoadArchiveLibrary();
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index bbf0966182..c390d96ac3 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1295,6 +1295,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	/* allocate heap */
 	state->heap = binaryheap_allocate(state->nr_txns,
 									  ReorderBufferIterCompare,
+									  false,
 									  state);
 
 	/* Now that the state fields are initialized, it is safe to return it. */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index eb1ec3b86d..183b91394c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2725,6 +2725,7 @@ BufferSync(int flags)
 	 */
 	ts_heap = binaryheap_allocate(num_spaces,
 								  ts_ckpt_progress_comparator,
+								  false,
 								  NULL);
 
 	for (i = 0; i < num_spaces; i++)
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 256d1e35a4..a044a684c8 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -4032,6 +4032,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 	/* Set up ready_heap with enough room for all known TocEntrys */
 	ready_heap = binaryheap_allocate(AH->tocCount,
 									 TocEntrySizeCompareBinaryheap,
+									 false,
 									 NULL);
 
 	/*
diff --git a/src/bin/pg_dump/pg_dump_sort.c b/src/bin/pg_dump/pg_dump_sort.c
index f358dd22b9..63b1c3570d 100644
--- a/src/bin/pg_dump/pg_dump_sort.c
+++ b/src/bin/pg_dump/pg_dump_sort.c
@@ -404,7 +404,7 @@ TopoSort(DumpableObject **objs,
 		return true;
 
 	/* Create workspace for the above-described heap */
-	pendingHeap = binaryheap_allocate(numObjs, int_cmp, NULL);
+	pendingHeap = binaryheap_allocate(numObjs, int_cmp, false, NULL);
 
 	/*
 	 * Scan the constraints, and for each item in the input, generate a count
diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index 6f16c83295..ff03c477dc 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -22,8 +22,28 @@
 #ifdef FRONTEND
 #include "common/logging.h"
 #endif
+#include "common/hashfn.h"
 #include "lib/binaryheap.h"
 
+/*
+ * Define parameters for hash table code generation. The interface is *also*"
+ * declared in binaryheaph.h (to generate the types, which are externally
+ * visible).
+ */
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+	hash_bytes((const unsigned char *) &key, sizeof(bh_node_type))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0)
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
 static void sift_down(binaryheap *heap, int node_off);
 static void sift_up(binaryheap *heap, int node_off);
 
@@ -36,7 +56,8 @@ static void sift_up(binaryheap *heap, int node_off);
  * argument specified by 'arg'.
  */
 binaryheap *
-binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
+binaryheap_allocate(int capacity, binaryheap_comparator compare,
+					bool indexed, void *arg)
 {
 	binaryheap *heap;
 
@@ -49,6 +70,17 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 	heap->bh_has_heap_property = true;
 	heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type) * capacity);
 
+	heap->bh_indexed = indexed;
+	if (heap->bh_indexed)
+	{
+#ifdef FRONTEND
+		heap->bh_nodeidx = bh_nodeidx_create(capacity, NULL);
+#else
+		heap->bh_nodeidx = bh_nodeidx_create(CurrentMemoryContext, capacity,
+											 NULL);
+#endif
+	}
+
 	return heap;
 }
 
@@ -63,6 +95,9 @@ binaryheap_reset(binaryheap *heap)
 {
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
+
+	if (heap->bh_indexed)
+		bh_nodeidx_reset(heap->bh_nodeidx);
 }
 
 /*
@@ -73,6 +108,8 @@ binaryheap_reset(binaryheap *heap)
 void
 binaryheap_free(binaryheap *heap)
 {
+	if (heap->bh_indexed)
+		bh_nodeidx_destroy(heap->bh_nodeidx);
 	pfree(heap);
 }
 
@@ -114,6 +151,44 @@ bh_enlarge_node_array(binaryheap *heap)
 							  sizeof(bh_node_type) * heap->bh_space);
 }
 
+/*
+ * Set the given node at the 'idx' and updates its position accordingly.
+ */
+static void
+bh_set_node(binaryheap *heap, bh_node_type d, int idx)
+{
+	bh_nodeidx_entry *ent;
+	bool	found;
+
+	/* Set the node to the nodes array */
+	heap->bh_nodes[idx] = d;
+
+	if (heap->bh_indexed)
+	{
+		/* Remember its index in the nodes array */
+		ent = bh_nodeidx_insert(heap->bh_nodeidx, d, &found);
+		ent->idx = idx;
+	}
+}
+
+/*
+ * Replace the node at 'idx' with the given node 'replaced_by'. Also
+ * update their positions accordingly.
+ */
+static void
+bh_replace_node(binaryheap *heap, int idx, bh_node_type replaced_by)
+{
+	bh_node_type	node = heap->bh_nodes[idx];
+
+	/* Remove overwritten node's index */
+	if (heap->bh_indexed)
+		(void) bh_nodeidx_delete(heap->bh_nodeidx, node);
+
+	/* Replace it with the given new node */
+	if (idx < heap->bh_size)
+		bh_set_node(heap, replaced_by, idx);
+}
+
 /*
  * binaryheap_add_unordered
  *
@@ -130,7 +205,7 @@ binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 		bh_enlarge_node_array(heap);
 
 	heap->bh_has_heap_property = false;
-	heap->bh_nodes[heap->bh_size] = d;
+	bh_set_node(heap, d, heap->bh_size);
 	heap->bh_size++;
 }
 
@@ -163,7 +238,7 @@ binaryheap_add(binaryheap *heap, bh_node_type d)
 	if (heap->bh_size >= heap->bh_space)
 		bh_enlarge_node_array(heap);
 
-	heap->bh_nodes[heap->bh_size] = d;
+	bh_set_node(heap, d, heap->bh_size);
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
 }
@@ -204,6 +279,10 @@ binaryheap_remove_first(binaryheap *heap)
 	if (heap->bh_size == 1)
 	{
 		heap->bh_size--;
+
+		if (heap->bh_indexed)
+			bh_nodeidx_delete(heap->bh_nodeidx, result);
+
 		return result;
 	}
 
@@ -211,7 +290,7 @@ binaryheap_remove_first(binaryheap *heap)
 	 * Remove the last node, placing it in the vacated root entry, and sift
 	 * the new root node down to its correct position.
 	 */
-	heap->bh_nodes[0] = heap->bh_nodes[--heap->bh_size];
+	bh_replace_node(heap, 0, heap->bh_nodes[--heap->bh_size]);
 	sift_down(heap, 0);
 
 	return result;
@@ -237,7 +316,7 @@ binaryheap_remove_node(binaryheap *heap, int n)
 						   heap->bh_arg);
 
 	/* remove the last node, placing it in the vacated entry */
-	heap->bh_nodes[n] = heap->bh_nodes[heap->bh_size];
+	bh_replace_node(heap, n, heap->bh_nodes[heap->bh_size]);
 
 	/* sift as needed to preserve the heap property */
 	if (cmp > 0)
@@ -246,6 +325,74 @@ binaryheap_remove_node(binaryheap *heap, int n)
 		sift_down(heap, n);
 }
 
+/*
+ * binaryheap_remove_node_ptr
+ *
+ * Similar to binaryheap_remove_node() but removes the given node. The caller
+ * must ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if bh_indexed is true.
+ */
+void
+binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(heap->bh_indexed);
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+	Assert(ent);
+
+	binaryheap_remove_node(heap, ent->idx);
+}
+
+/*
+ * binaryheap_update_up
+ *
+ * Sift the given node up after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if bh_indexed is true.
+ */
+void
+binaryheap_update_up(binaryheap *heap, bh_node_type d)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(heap->bh_indexed);
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+	Assert(ent);
+	Assert(ent->idx >= 0 && ent->idx < heap->bh_size);
+
+	sift_up(heap, ent->idx);
+}
+
+/*
+ * binaryheap_update_down
+ *
+ * Sift the given node down after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if bh_indexed is true.
+ */
+void
+binaryheap_update_down(binaryheap *heap, bh_node_type d)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(heap->bh_indexed);
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+	Assert(ent);
+	Assert(ent->idx >= 0 && ent->idx < heap->bh_size);
+
+	sift_down(heap, ent->idx);
+}
+
 /*
  * binaryheap_replace_first
  *
@@ -258,7 +405,7 @@ binaryheap_replace_first(binaryheap *heap, bh_node_type d)
 {
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
 
-	heap->bh_nodes[0] = d;
+	bh_replace_node(heap, 0, d);
 
 	if (heap->bh_size > 1)
 		sift_down(heap, 0);
@@ -300,11 +447,11 @@ sift_up(binaryheap *heap, int node_off)
 		 * Otherwise, swap the parent value with the hole, and go on to check
 		 * the node's new parent.
 		 */
-		heap->bh_nodes[node_off] = parent_val;
+		bh_set_node(heap, parent_val, node_off);
 		node_off = parent_off;
 	}
 	/* Re-fill the hole */
-	heap->bh_nodes[node_off] = node_val;
+	bh_set_node(heap, node_val, node_off);
 }
 
 /*
@@ -359,9 +506,9 @@ sift_down(binaryheap *heap, int node_off)
 		 * Otherwise, swap the hole with the child that violates the heap
 		 * property; then go on to check its children.
 		 */
-		heap->bh_nodes[node_off] = heap->bh_nodes[swap_off];
+		bh_set_node(heap, heap->bh_nodes[swap_off], node_off);
 		node_off = swap_off;
 	}
 	/* Re-fill the hole */
-	heap->bh_nodes[node_off] = node_val;
+	bh_set_node(heap, node_val, node_off);
 }
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 1439f20803..48c2de33b4 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -29,6 +29,28 @@ typedef Datum bh_node_type;
  */
 typedef int (*binaryheap_comparator) (bh_node_type a, bh_node_type b, void *arg);
 
+/*
+ * Struct for A hash table element to store the node's index in the bh_nodes
+ * array.
+ */
+typedef struct bh_nodeidx_entry
+{
+	bh_node_type	key;
+	char			status;
+	int				idx;
+} bh_nodeidx_entry;
+
+/* define parameters necessary to generate the hash table interface */
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
 /*
  * binaryheap
  *
@@ -47,11 +69,19 @@ typedef struct binaryheap
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
 	bh_node_type *bh_nodes;
+
+	/*
+	 * If bh_indexed is true, the bh_nodeidx is used to track of each
+	 * node's index in bh_nodes. This enables the caller to perform
+	 * binaryheap_remove_node_ptr(), binaryheap_update_up/down in O(log n).
+	 */
+	bool		bh_indexed;
+	bh_nodeidx_hash	*bh_nodeidx;
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int capacity,
 									   binaryheap_comparator compare,
-									   void *arg);
+									   bool indexed, void *arg);
 extern void binaryheap_reset(binaryheap *heap);
 extern void binaryheap_free(binaryheap *heap);
 extern void binaryheap_add_unordered(binaryheap *heap, bh_node_type d);
@@ -60,7 +90,10 @@ extern void binaryheap_add(binaryheap *heap, bh_node_type d);
 extern bh_node_type binaryheap_first(binaryheap *heap);
 extern bh_node_type binaryheap_remove_first(binaryheap *heap);
 extern void binaryheap_remove_node(binaryheap *heap, int n);
+extern void binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d);
 extern void binaryheap_replace_first(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_up(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_down(binaryheap *heap, bh_node_type d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
 #define binaryheap_size(h)			((h)->bh_size)
-- 
2.39.3

v4-0003-Use-max-heap-to-evict-largest-transactions-in-Reo.patchapplication/octet-stream; name=v4-0003-Use-max-heap-to-evict-largest-transactions-in-Reo.patchDownload
From 5127df17b95e0f283c1cf31ee60b6b7a9ed35eb5 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 11:31:41 +0900
Subject: [PATCH v4 3/3] Use max-heap to evict largest transactions in
 ReorderBuffer.

Previously, when selecting the transaction to evict, we check all
transactions to find the largest transaction. Which could lead to a
significant replication lag especially in case where there are many
subtransactions.

This commit improves the eviction algorithm in ReorderBuffer using the
max-heap with transaction size as the key to find the largest
transaction. The max-heap state is maneged in two states.

Overall algorithm:

REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP is the starting state, where we do
not update the max-heap when updating the memory counter. We build the
max-heap just before selecting large transactions. Therefore, in this
state, we can update the memory counter with no additional costs but
need O(n) time to get the largest transaction, where n is the number of
transactions including top-level transactions and subtransactions.

Once we build the max-heap, we switch to
REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP state, where we also update
the max-heap when updating the memory counter. The intention is to
efficiently retrieve the largest transaction in O(1) time instead of
incurring the cost of memory counter updates (O(log n)). We remain in
this state as long as the number of transactions is larger than the
threshold, REORDER_BUFFER_MEM_TRACK_THRESHOLD. Otherwise, we switch back
to REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP and reset the max-heap.

The performance benchmark results showed significant speed up (more
than x30 speed up on my machine) in decoding a transaction with 100k
subtransactions, whereas there is no visible overhead in other cases.

XXX: update typedef.list

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
---
 .../replication/logical/reorderbuffer.c       | 197 +++++++++++++++---
 src/include/replication/reorderbuffer.h       |  21 ++
 2 files changed, 189 insertions(+), 29 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c390d96ac3..49923ed244 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -67,6 +67,26 @@
  *	  allocator, evicting the oldest changes would make it more likely the
  *	  memory gets actually freed.
  *
+ *	  We use a max-heap with transaction size as the key to efficiently find
+ *	  the largest transaction. The max-heap state is managed in two states:
+ *	  REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP and REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP.
+ *
+ *	  REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP is the starting state, where we do
+ *	  not update the max-heap when updating the memory counter. We build the
+ *	  max-heap just before selecting large transactions. Therefore, in this
+ *	  state, we can update the memory counter with no additional costs but
+ *	  need O(n) time to get the largest transaction, where n is the number of
+ *	  transactions including top-level transactions and subtransactions.
+ *
+ *	  Once we build the max-heap, we switch to
+ *	  REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP state, where we also update
+ *	  the max-heap when updating the memory counter. The intention is to
+ *	  efficiently retrieve the largest transaction in O(1) time instead of
+ *	  incurring the cost of memory counter updates (O(log n)). We remain in
+ *	  this state as long as the number of transactions is larger than the
+ *	  threshold, REORDER_BUFFER_MEM_TRACK_THRESHOLD. Otherwise, we switch back
+ *	  to REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP and reset the max-heap.
+ *
  *	  We still rely on max_changes_in_memory when loading serialized changes
  *	  back into memory. At that point we can't use the memory limit directly
  *	  as we load the subxacts independently. One option to deal with this
@@ -108,6 +128,11 @@
 #include "utils/rel.h"
 #include "utils/relfilenumbermap.h"
 
+/*
+ * The threshold of the number of transactions in the max-heap (rb->txn_heap)
+ * to switch the state.
+ */
+#define REORDER_BUFFER_MEM_TRACK_THRESHOLD 1024
 
 /* entry for a hash table we use to map from xid to our transaction state */
 typedef struct ReorderBufferTXNByIdEnt
@@ -295,6 +320,9 @@ static Size ReorderBufferChangeSize(ReorderBufferChange *change);
 static void ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 											ReorderBufferChange *change,
 											bool addition, Size sz);
+static int ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg);
+static void ReorderBufferTXNMemoryUpdate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+										 bool addition, Size sz);
 
 /*
  * Allocate a new ReorderBuffer and clean out any old serialized state from
@@ -356,6 +384,15 @@ ReorderBufferAllocate(void)
 	buffer->outbufsize = 0;
 	buffer->size = 0;
 
+	/*
+	 * Don't start with a lower number than REORDER_BUFFER_MEM_TRACK_THRESHOLD, since
+	 * we add at least REORDER_BUFFER_MEM_TRACK_THRESHOLD entries at once.
+	 */
+	buffer->memtrack_state = REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP;
+	buffer->txn_heap = binaryheap_allocate(REORDER_BUFFER_MEM_TRACK_THRESHOLD * 2,
+										   ReorderBufferTXNSizeCompare,
+										   true, NULL);
+
 	buffer->spillTxns = 0;
 	buffer->spillCount = 0;
 	buffer->spillBytes = 0;
@@ -1499,6 +1536,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
 	bool		found;
 	dlist_mutable_iter iter;
+	Size		mem_freed = 0;
 
 	/* cleanup subtransactions & their changes */
 	dlist_foreach_modify(iter, &txn->subtxns)
@@ -1528,9 +1566,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		/* Check we're not mixing changes from different transactions. */
 		Assert(change->txn == txn);
 
-		ReorderBufferReturnChange(rb, change, true);
+		mem_freed += ReorderBufferChangeSize(change);
+		ReorderBufferReturnChange(rb, change, false);
 	}
 
+	/* Update the memory counter */
+	Assert(mem_freed == txn->size);
+	if (mem_freed > 0)
+		ReorderBufferTXNMemoryUpdate(rb, txn, false, mem_freed);
+
 	/*
 	 * Cleanup the tuplecids we stored for decoding catalog snapshot access.
 	 * They are always stored in the toplevel transaction.
@@ -1589,6 +1633,22 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* deallocate */
 	ReorderBufferReturnTXN(rb, txn);
+
+	/*
+	 * Check if the number of transactions get lower than the threshold. If
+	 * so, switch to NO_MAXHEAP state and reset the max-heap.
+	 *
+	 * XXX: If a new transaction is added and the memory usage reached the
+	 * limit soon, we will end up building the max-heap again. It might be
+	 * more efficient if we accept a certain amount of transactions to switch
+	 * back to the NO_MAXHEAP state, say 95% of the threshold.
+	 */
+	if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP &&
+		(binaryheap_size(rb->txn_heap) < REORDER_BUFFER_MEM_TRACK_THRESHOLD))
+	{
+		rb->memtrack_state = REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP;
+		binaryheap_reset(rb->txn_heap);
+	}
 }
 
 /*
@@ -3161,16 +3221,6 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 
 /*
  * Update memory counters to account for the new or removed change.
- *
- * We update two counters - in the reorder buffer, and in the transaction
- * containing the change. The reorder buffer counter allows us to quickly
- * decide if we reached the memory limit, the transaction counter allows
- * us to quickly pick the largest transaction for eviction.
- *
- * When streaming is enabled, we need to update the toplevel transaction
- * counters instead - we don't really care about subtransactions as we
- * can't stream them individually anyway, and we only pick toplevel
- * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -3178,7 +3228,6 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition, Size sz)
 {
 	ReorderBufferTXN *txn;
-	ReorderBufferTXN *toptxn;
 
 	Assert(change->txn);
 
@@ -3192,6 +3241,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 	txn = change->txn;
 
+	ReorderBufferTXNMemoryUpdate(rb, txn, addition, sz);
+}
+
+/*
+ * Update memory counter of the given transaction.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
+ */
+static void
+ReorderBufferTXNMemoryUpdate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 bool addition, Size sz)
+{
+	ReorderBufferTXN *toptxn;
+
 	/*
 	 * Update the total size in top level as well. This is later used to
 	 * compute the decoding stats.
@@ -3205,6 +3276,15 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 		/* Update the total size in the top transaction. */
 		toptxn->total_size += sz;
+
+		/* Update the max-heap as well if necessary */
+		if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)
+		{
+			if ((txn->size - sz) == 0)
+				binaryheap_add(rb->txn_heap, PointerGetDatum(txn));
+			else
+				binaryheap_update_up(rb->txn_heap, PointerGetDatum(txn));
+		}
 	}
 	else
 	{
@@ -3214,6 +3294,15 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 		/* Update the total size in the top transaction. */
 		toptxn->total_size -= sz;
+
+		/* Update the max-heap as well if necessary */
+		if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)
+		{
+			if (txn->size == 0)
+				binaryheap_remove_node_ptr(rb->txn_heap, PointerGetDatum(txn));
+			else
+				binaryheap_update_down(rb->txn_heap, PointerGetDatum(txn));
+		}
 	}
 
 	Assert(txn->size <= rb->size);
@@ -3471,31 +3560,45 @@ ReorderBufferSerializeReserve(ReorderBuffer *rb, Size sz)
 
 /*
  * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
- *
- * XXX With many subtransactions this might be quite slow, because we'll have
- * to walk through all of them. There are some options how we could improve
- * that: (a) maintain some secondary structure with transactions sorted by
- * amount of changes, (b) not looking for the entirely largest transaction,
- * but e.g. for transaction using at least some fraction of the memory limit,
- * and (c) evicting multiple transactions at once, e.g. to free a given portion
- * of the memory limit (e.g. 50%).
  */
 static ReorderBufferTXN *
 ReorderBufferLargestTXN(ReorderBuffer *rb)
 {
-	HASH_SEQ_STATUS hash_seq;
-	ReorderBufferTXNByIdEnt *ent;
 	ReorderBufferTXN *largest = NULL;
 
-	hash_seq_init(&hash_seq, rb->by_txn);
-	while ((ent = hash_seq_search(&hash_seq)) != NULL)
+	/*
+	 * Build the max-heap to pick the largest transaction if not yet. We will
+	 * run a heap assembly step at the end, which is more efficient.
+	 */
+	if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP)
 	{
-		ReorderBufferTXN *txn = ent->txn;
+		HASH_SEQ_STATUS hash_seq;
+		ReorderBufferTXNByIdEnt *ent;
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
-			largest = txn;
+		hash_seq_init(&hash_seq, rb->by_txn);
+		while ((ent = hash_seq_search(&hash_seq)) != NULL)
+		{
+			ReorderBufferTXN *txn = ent->txn;
+
+			if (txn->size == 0)
+				continue;
+
+			binaryheap_add_unordered(rb->txn_heap, PointerGetDatum(txn));
+		}
+
+		binaryheap_build(rb->txn_heap);
+
+		/*
+		 * The max-heap is ready now. We remain in this state at least until
+		 * we free up enough transactions to bring the total memory usage
+		 * below the limit.
+		 */
+		rb->memtrack_state = REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP;
 	}
+	else
+		Assert(binaryheap_size(rb->txn_heap) > 0);
+
+	largest = (ReorderBufferTXN *) DatumGetPointer(binaryheap_first(rb->txn_heap));
 
 	Assert(largest);
 	Assert(largest->size > 0);
@@ -3637,6 +3740,18 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 		Assert(txn->nentries_mem == 0);
 	}
 
+	/*
+	 * Check the number of transactions in max-heap after evicting large
+	 * transactions. If the number of transactions is small, we switch back
+	 * to the NO_MAXHEAP state, and reset the current the max-heap.
+	 */
+	if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP &&
+		(binaryheap_size(rb->txn_heap) < REORDER_BUFFER_MEM_TRACK_THRESHOLD))
+	{
+		rb->memtrack_state = REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP;
+		binaryheap_reset(rb->txn_heap);
+	}
+
 	/* We must be under the memory limit now. */
 	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
@@ -3653,6 +3768,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	XLogSegNo	curOpenSegNo = 0;
 	Size		spilled = 0;
 	Size		size = txn->size;
+	Size		mem_freed = 0;
 
 	elog(DEBUG2, "spill %u changes in XID %u to disk",
 		 (uint32) txn->nentries_mem, txn->xid);
@@ -3706,11 +3822,17 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		ReorderBufferSerializeChange(rb, txn, fd, change);
 		dlist_delete(&change->node);
-		ReorderBufferReturnChange(rb, change, true);
+		mem_freed += ReorderBufferChangeSize(change);
+		ReorderBufferReturnChange(rb, change, false);
 
 		spilled++;
 	}
 
+	/* Update the memory counter */
+	Assert(mem_freed == txn->size);
+	if (mem_freed > 0)
+		ReorderBufferTXNMemoryUpdate(rb, txn, false, mem_freed);
+
 	/* update the statistics iff we have spilled anything */
 	if (spilled)
 	{
@@ -5276,3 +5398,20 @@ restart:
 		*cmax = ent->cmax;
 	return true;
 }
+
+/*
+ * Compare between sizes of two transactions. This is for a binary heap
+ * comparison function.
+ */
+static int
+ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg)
+{
+	ReorderBufferTXN	*ta = (ReorderBufferTXN *) DatumGetPointer(a);
+	ReorderBufferTXN	*tb = (ReorderBufferTXN *) DatumGetPointer(b);
+
+	if (ta->size < tb->size)
+		return -1;
+	if (ta->size > tb->size)
+		return 1;
+	return 0;
+}
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0b2c95f7aa..f0d352cfcc 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "lib/binaryheap.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -531,6 +532,22 @@ typedef void (*ReorderBufferUpdateProgressTxnCB) (
 												  ReorderBufferTXN *txn,
 												  XLogRecPtr lsn);
 
+/* State of how to track the memory usage of each transaction being decoded */
+typedef enum ReorderBufferMemTrackState
+{
+	/*
+	 * We don't update max-heap while updating the memory counter. The
+	 * max-heap is built before use.
+	 */
+	REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP,
+
+	/*
+	 * We also update the max-heap when updating the memory counter so
+	 * the heap property is always preserved.
+	 */
+	REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP,
+} ReorderBufferMemTrackState;
+
 struct ReorderBuffer
 {
 	/*
@@ -631,6 +648,10 @@ struct ReorderBuffer
 	/* memory accounting */
 	Size		size;
 
+	/* Max-heap for sizes of all top-level and sub transactions */
+	ReorderBufferMemTrackState memtrack_state;
+	binaryheap	*txn_heap;
+
 	/*
 	 * Statistics about transactions spilled to disk.
 	 *
-- 
2.39.3

normal.sqlapplication/octet-stream; name=normal.sqlDownload
many-subtxn.sqlapplication/octet-stream; name=many-subtxn.sqlDownload
#31Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Ajin Cherian (#29)
Re: Improve eviction algorithm in ReorderBuffer

On Fri, Feb 9, 2024 at 7:35 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Tue, Feb 6, 2024 at 5:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached the new version patch set.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Thanks for the patch. I reviewed that patch and did minimal testing and it seems to show the speed up as claimed. Some minor comments:

Thank you for the comments!

patch 0001:

+static void
+bh_enlarge_node_array(binaryheap *heap)
+{
+ if (heap->bh_size < heap->bh_space)
+ return;

why not check "if (heap->bh_size >= heap->bh_space)" outside this function to avoid calling this function when not necessary? This check was there in code before the patch.

Agreed.

patch 0003:

+/*
+ * The threshold of the number of transactions in the max-heap (rb->txn_heap)
+ * to switch the state.
+ */
+#define REORDE_BUFFER_MEM_TRACK_THRESHOLD 1024

Typo: I think you meant REORDER_ and not REORDE_

Fixed.

These comments are addressed in the v4 patch set I just shared[1]/messages/by-id/CAD21AoDhuybyryVkmVkgPY8uVrjGLYchL8EY8-rBm1hbZJpwLw@mail.gmail.com.

Regards,

[1]: /messages/by-id/CAD21AoDhuybyryVkmVkgPY8uVrjGLYchL8EY8-rBm1hbZJpwLw@mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#32Ajin Cherian
itsajin@gmail.com
In reply to: Masahiko Sawada (#31)
Re: Improve eviction algorithm in ReorderBuffer

On Sat, Feb 10, 2024 at 2:23 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Fri, Feb 9, 2024 at 7:35 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Tue, Feb 6, 2024 at 5:06 PM Masahiko Sawada <sawada.mshk@gmail.com>

wrote:

I've attached the new version patch set.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Thanks for the patch. I reviewed that patch and did minimal testing and

it seems to show the speed up as claimed. Some minor comments:

Thank you for the comments!

patch 0001:

+static void
+bh_enlarge_node_array(binaryheap *heap)
+{
+ if (heap->bh_size < heap->bh_space)
+ return;

why not check "if (heap->bh_size >= heap->bh_space)" outside this

function to avoid calling this function when not necessary? This check was
there in code before the patch.

Agreed.

patch 0003:

+/*
+ * The threshold of the number of transactions in the max-heap

(rb->txn_heap)

+ * to switch the state.
+ */
+#define REORDE_BUFFER_MEM_TRACK_THRESHOLD 1024

Typo: I think you meant REORDER_ and not REORDE_

Fixed.

These comments are addressed in the v4 patch set I just shared[1].

These changes look good to me. I've done some tests with a few varying

levels of subtransaction and I could see that the patch was at least 5%
better in all of them.

regards,
Ajin Cherian
Fujitsu Australia

#33vignesh C
vignesh21@gmail.com
In reply to: Masahiko Sawada (#30)
Re: Improve eviction algorithm in ReorderBuffer

On Fri, 9 Feb 2024 at 20:51, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Feb 8, 2024 at 6:33 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Sawada-san,

Thanks for making v3 patchset. I have also benchmarked the case [1].
Below results are the average of 5th, there are almost the same result
even when median is used for the comparison. On my env, the regression
cannot be seen.

HEAD (1e285a5) HEAD + v3 patches difference
10910.722 ms 10714.540 ms around 1.8%

Thank you for doing the performance test!

Also, here are mino comments for v3 set.

01.
bh_nodeidx_entry and ReorderBufferMemTrackState is missing in typedefs.list.

Will add them.

02. ReorderBufferTXNSizeCompare
Should we assert {ta, tb} are not NULL?

Not sure we really need it as other binaryheap users don't have such checks.

On Tue, Feb 6, 2024 at 2:45 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

I've run a benchmark test that I shared before[1]. Here are results of
decoding a transaction that has 1M subtransaction each of which has 1
INSERT:

HEAD:
1810.192 ms

HEAD w/ patch:
2001.094 ms

I set a large enough value to logical_decoding_work_mem not to evict
any transactions. I can see about about 10% performance regression in
this case.

Thanks for running. I think this workload is the worst and an extreme case which
would not be occurred on the real system (Such a system should be fixed), so we
can say that the regression is up to -10%. I felt it could be negligible but how
do other think?

I think this performance regression is not acceptable. In this
workload, one transaction has 10k subtransactions and the logical
decoding becomes quite slow if logical_decoding_work_mem is not big
enough. Therefore, it's a legitimate and common approach to increase
logical_decoding_work_mem to speedup the decoding. However, with thie
patch, the decoding becomes slower than today. It's a bad idea in
general to optimize an extreme case while sacrificing the normal (or
more common) cases.

Since this same function is used by pg_dump sorting TopoSort functions
also, we can just verify once if there is no performance impact with
large number of objects during dump sorting:
binaryheap *
binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
{
- int sz;
binaryheap *heap;

-       sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_node_type) * capacity;
-       heap = (binaryheap *) palloc(sz);
+       heap = (binaryheap *) palloc(sizeof(binaryheap));
        heap->bh_space = capacity;
        heap->bh_compare = compare;
        heap->bh_arg = arg;

heap->bh_size = 0;
heap->bh_has_heap_property = true;
+ heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type)
* capacity);

Regards,
Vignesh

#34Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: vignesh C (#33)
6 attachment(s)
Re: Improve eviction algorithm in ReorderBuffer

Hi,

I did a basic review and testing of this patch today. Overall I think
the patch is in very good shape - I agree with the tradeoffs it makes,
and I like the approach in general. I do have a couple minor comments
about the code, and then maybe a couple thoughts about the approach.

First, some comments - I'll put them here, but I also kept them in
"review" commits, because that makes it easier to show the exact place
in the code the comment is about.

1) binaryheap_allocate got a new "indexed" argument, but the comment is
not updated to document it

2) I think it's preferable to use descriptive argument names for
bh_set_node. I don't think there's a good reason to keep it short.

3) In a couple places we have code like this:

if (heap->bh_indexed)
bh_nodeidx_delete(heap->bh_nodeidx, result);

Maybe it'd be better to have the if condition in bh_nodeidx_delete, so
that it can be called without it.

4) Could we check the "found" flag in bh_set_node, somehow? I mean, we
either expect to find the node (update of already tracked transaction)
or not (when inserting it). The life cycle may be non-trivial (node
added, updated and removed, ...), would be useful assert I think.

5) Do we actually need the various mem_freed local variables in a couple
places, when we expect the value to be equal to txn->size (there's even
assert enforcing that)?

6) ReorderBufferCleanupTXN has a comment about maybe not using the same
threshold both to enable & disable usage of the binaryheap. I agree with
that, otherwise we could easily end up "trashing" if we add/remove
transactions right around the threshold. I think 90-95% for disabling
the heap would work fine.

7) The code disabling binaryheap (based on the threshold) is copied in a
couple places, perhaps it should be a separate function called from
those places.

8) Similarly to (3), maybe ReorderBufferTXNMemoryUpdate should do the
memory size check internally, to make the calls simpler.

9) The ReorderBufferChangeMemoryUpdate / ReorderBufferTXNMemoryUpdate
split maybe not very clear. It's not clear to me why it's divided like
this, or why we can't simply call ReorderBufferTXNMemoryUpdate directly.

performance
-----------

I did some benchmarks, to see the behavior in simple good/bad cases (see
the attached scripts.tgz). "large" is one large transaction inserting 1M
rows, small is 64k single-row inserts, and subxacts is the original case
with ~100k subxacts. Finally, subxacts-small is many transactions with
128 subxacts each (the main transactions are concurrent).

The results are pretty good, I think:

test master patched
-----------------------------------------------------
large 2587 2459 95%
small 956 856 89%
subxacts 138915 2911 2%
subxacts-small 13632 13187 97%

This is timing (ms) with logical_work_mem=4MB. I also tried with 64MB,
where the subxact timing goes way down, but the overall conclusions do
not change.

I was a bit surprised I haven't seen any clear regression, but in the
end that's a good thing, right? There's a couple results in this thread
showing ~10% regression, but I've been unable to reproduce those.
Perhaps the newer patch versions fix that, I guess.

Anyway, I think that at some point we'd have to accept that some cases
may have slight regression. I think that's inherent for almost any
heuristics - there's always going to be some rare case that defeats it.
What's important is that the case needs to be rare and/or the impact
very limited. And I think that's true here.

overall design
--------------

As for the design, I agree with the approach of using a binaryheap to
track transactions by size. When going over the thread history,
describing the initial approach with only keeping "large" transactions
above some threshold (e.g. 10%), I was really concerned that'll either
lead to abrupt changes in behavior (when transactions move just around
the 10%), or won't help with many common cases (with most transactions
being below the limit).

I was going to suggest some sort of "binning" - keeping lists for
transactions of similar size (e.g. <1kB, 1-2kB, 2-4kB, 4-8kB, ...) and
evicting transactions from a list, i.e. based on approximate size. But
if the indexed binary heap seems to be cheap enough, I think it's a
better solution.

The one thing I'm a bit concerned about is the threshold used to start
using binary heap - these thresholds with binary decisions may easily
lead to a "cliff" and robustness issues, i.e. abrupt change in behavior
with significant runtime change (e.g. you add/remove one transaction and
the code takes a much more expensive path). The value (1024) seems
rather arbitrary, I wonder if there's something to justify that choice.

In any case, I agree it'd be good to have some dampening factor, to
reduce the risk of trashing because of adding/removing a single
transaction to the decoding.

related stuff / GenerationContext
---------------------------------

It's not the fault of this patch, but this reminds me I have some doubts
about how the eviction interferes with using the GenerationContext for
some of the data. I suspect we can easily get into a situation where we
evict the largest transaction, but that doesn't actually reduce the
memory usage at all, because the memory context blocks are shared with
some other transactions and don't get 100% empty (so we can't release
them). But it's actually worse, because GenerationContext does not even
reuse this memory. So do we even gain anything by the eviction?

When the earlier patch versions also considered age of the transaction,
to try evicting the older ones first, I think that was interesting. I
think we may want to do something like this even with the binary heap.

related stuff / increase of logical_decoding_work_mem
-----------------------------------------------------

In the thread, one of the "alternatives to spilling" suggested in the
thread was to enable streaming, but I think there's often a much more
efficient alternative - increase the amount of memory, so that we don't
actually need to spill.

For example, a system may be doing a lot of eviction / spilling with
logical_decoding_work_mem=64MB, but setting 128MB may completely
eliminate that. Of course, if there are large transactions, this may not
be possible (the GUC would have to exceed RAM). But I don't think that's
very common, the incidents that I've observed were often resolved by
bumping the logical_decoding_work_mem by a little bit.

I wonder if there's something we might do to help users to tune this. We
should be able to measure the "peak" memory usage (how much memory we'd
need to not spill), so maybe we could log that as a WARNING, similarly
to checkpoints - there we only log "checkpoints too frequent, tune WAL
limits", but perhaps we might do more here? Or maybe we could add the
watermark to the system catalog?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

v5-0005-review.patchtext/x-patch; charset=UTF-8; name=v5-0005-review.patchDownload
From 6dfeb61ffddeedc8e00f8de5eb6b644b28ae1f62 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Fri, 23 Feb 2024 13:15:44 +0100
Subject: [PATCH v5 5/5] review

---
 .../replication/logical/reorderbuffer.c       | 32 ++++++++++++-------
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index f22cf2fb9b8..40fa2ba9843 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1537,7 +1537,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
 	bool		found;
 	dlist_mutable_iter iter;
-	Size		mem_freed = 0;
+	Size		mem_freed = 0;	/* XXX why don't we use txn->size directly? */
 
 	/* cleanup subtransactions & their changes */
 	dlist_foreach_modify(iter, &txn->subtxns)
@@ -1571,11 +1571,6 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		ReorderBufferReturnChange(rb, change, false);
 	}
 
-	/* Update the memory counter */
-	Assert(mem_freed == txn->size);
-	if (mem_freed > 0)
-		ReorderBufferTXNMemoryUpdate(rb, txn, false, mem_freed);
-
 	/*
 	 * Cleanup the tuplecids we stored for decoding catalog snapshot access.
 	 * They are always stored in the toplevel transaction.
@@ -1635,14 +1630,21 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	/* deallocate */
 	ReorderBufferReturnTXN(rb, txn);
 
+	/* Update the memory counter */
+	Assert(mem_freed == txn->size);
+	ReorderBufferTXNMemoryUpdate(rb, txn, false, mem_freed);
+
 	/*
-	 * Check if the number of transactions get lower than the threshold. If
+	 * Check if the number of transactions got lower than the threshold. If
 	 * so, switch to NO_MAXHEAP state and reset the max-heap.
 	 *
-	 * XXX: If a new transaction is added and the memory usage reached the
+	 * XXX: If a new transaction is added and the memory usage reaches the
 	 * limit soon, we will end up building the max-heap again. It might be
 	 * more efficient if we accept a certain amount of transactions to switch
 	 * back to the NO_MAXHEAP state, say 95% of the threshold.
+	 *
+	 * XXX Yes, having the enable/disable threshold exactly the same can lead
+	 * to trashing. Something like 90% would work, I think.
 	 */
 	if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP &&
 		(binaryheap_size(rb->txn_heap) < REORDER_BUFFER_MEM_TRACK_THRESHOLD))
@@ -3257,6 +3259,10 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
  * counters instead - we don't really care about subtransactions as we
  * can't stream them individually anyway, and we only pick toplevel
  * transactions for eviction. So only toplevel transactions matter.
+ *
+ * XXX Not sure the naming is great, it seems pretty similar to the earlier
+ * function, can be quite confusing. Why do we even need the separate function
+ * and can't simply call ReorderBufferChangeMemoryUpdate from everywhere?
  */
 static void
 ReorderBufferTXNMemoryUpdate(ReorderBuffer *rb, ReorderBufferTXN *txn,
@@ -3264,6 +3270,9 @@ ReorderBufferTXNMemoryUpdate(ReorderBuffer *rb, ReorderBufferTXN *txn,
 {
 	ReorderBufferTXN *toptxn;
 
+	if (sz == 0)
+		return;
+
 	/*
 	 * Update the total size in top level as well. This is later used to
 	 * compute the decoding stats.
@@ -3745,6 +3754,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	 * Check the number of transactions in max-heap after evicting large
 	 * transactions. If the number of transactions is small, we switch back
 	 * to the NO_MAXHEAP state, and reset the current the max-heap.
+	 *
+	 * XXX We already have this block elsewhere, maybe have a function?
 	 */
 	if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP &&
 		(binaryheap_size(rb->txn_heap) < REORDER_BUFFER_MEM_TRACK_THRESHOLD))
@@ -3769,7 +3780,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	XLogSegNo	curOpenSegNo = 0;
 	Size		spilled = 0;
 	Size		size = txn->size;
-	Size		mem_freed = 0;
+	Size		mem_freed = 0;	/* XXX why needed? can't we just use txn->size? */
 
 	elog(DEBUG2, "spill %u changes in XID %u to disk",
 		 (uint32) txn->nentries_mem, txn->xid);
@@ -3831,8 +3842,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* Update the memory counter */
 	Assert(mem_freed == txn->size);
-	if (mem_freed > 0)
-		ReorderBufferTXNMemoryUpdate(rb, txn, false, mem_freed);
+	ReorderBufferTXNMemoryUpdate(rb, txn, false, mem_freed);
 
 	/* update the statistics iff we have spilled anything */
 	if (spilled)
-- 
2.43.0

v5-0004-Use-max-heap-to-evict-largest-transactions-in-Reo.patchtext/x-patch; charset=UTF-8; name=v5-0004-Use-max-heap-to-evict-largest-transactions-in-Reo.patchDownload
From 889d0dc3a3ff203fd382e5020029a78b9334c586 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 11:31:41 +0900
Subject: [PATCH v5 4/5] Use max-heap to evict largest transactions in
 ReorderBuffer.

Previously, when selecting the transaction to evict, we check all
transactions to find the largest transaction. Which could lead to a
significant replication lag especially in case where there are many
subtransactions.

This commit improves the eviction algorithm in ReorderBuffer using the
max-heap with transaction size as the key to find the largest
transaction. The max-heap state is maneged in two states.

Overall algorithm:

REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP is the starting state, where we do
not update the max-heap when updating the memory counter. We build the
max-heap just before selecting large transactions. Therefore, in this
state, we can update the memory counter with no additional costs but
need O(n) time to get the largest transaction, where n is the number of
transactions including top-level transactions and subtransactions.

Once we build the max-heap, we switch to
REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP state, where we also update
the max-heap when updating the memory counter. The intention is to
efficiently retrieve the largest transaction in O(1) time instead of
incurring the cost of memory counter updates (O(log n)). We remain in
this state as long as the number of transactions is larger than the
threshold, REORDER_BUFFER_MEM_TRACK_THRESHOLD. Otherwise, we switch back
to REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP and reset the max-heap.

The performance benchmark results showed significant speed up (more
than x30 speed up on my machine) in decoding a transaction with 100k
subtransactions, whereas there is no visible overhead in other cases.

XXX: update typedef.list

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
---
 .../replication/logical/reorderbuffer.c       | 197 +++++++++++++++---
 src/include/replication/reorderbuffer.h       |  21 ++
 2 files changed, 189 insertions(+), 29 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 91b9618d7ec..f22cf2fb9b8 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -67,6 +67,26 @@
  *	  allocator, evicting the oldest changes would make it more likely the
  *	  memory gets actually freed.
  *
+ *	  We use a max-heap with transaction size as the key to efficiently find
+ *	  the largest transaction. The max-heap state is managed in two states:
+ *	  REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP and REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP.
+ *
+ *	  REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP is the starting state, where we do
+ *	  not update the max-heap when updating the memory counter. We build the
+ *	  max-heap just before selecting large transactions. Therefore, in this
+ *	  state, we can update the memory counter with no additional costs but
+ *	  need O(n) time to get the largest transaction, where n is the number of
+ *	  transactions including top-level transactions and subtransactions.
+ *
+ *	  Once we build the max-heap, we switch to
+ *	  REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP state, where we also update
+ *	  the max-heap when updating the memory counter. The intention is to
+ *	  efficiently retrieve the largest transaction in O(1) time instead of
+ *	  incurring the cost of memory counter updates (O(log n)). We remain in
+ *	  this state as long as the number of transactions is larger than the
+ *	  threshold, REORDER_BUFFER_MEM_TRACK_THRESHOLD. Otherwise, we switch back
+ *	  to REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP and reset the max-heap.
+ *
  *	  We still rely on max_changes_in_memory when loading serialized changes
  *	  back into memory. At that point we can't use the memory limit directly
  *	  as we load the subxacts independently. One option to deal with this
@@ -109,6 +129,11 @@
 #include "utils/rel.h"
 #include "utils/relfilenumbermap.h"
 
+/*
+ * The threshold of the number of transactions in the max-heap (rb->txn_heap)
+ * to switch the state.
+ */
+#define REORDER_BUFFER_MEM_TRACK_THRESHOLD 1024
 
 /* entry for a hash table we use to map from xid to our transaction state */
 typedef struct ReorderBufferTXNByIdEnt
@@ -296,6 +321,9 @@ static Size ReorderBufferChangeSize(ReorderBufferChange *change);
 static void ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 											ReorderBufferChange *change,
 											bool addition, Size sz);
+static int ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg);
+static void ReorderBufferTXNMemoryUpdate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+										 bool addition, Size sz);
 
 /*
  * Allocate a new ReorderBuffer and clean out any old serialized state from
@@ -357,6 +385,15 @@ ReorderBufferAllocate(void)
 	buffer->outbufsize = 0;
 	buffer->size = 0;
 
+	/*
+	 * Don't start with a lower number than REORDER_BUFFER_MEM_TRACK_THRESHOLD, since
+	 * we add at least REORDER_BUFFER_MEM_TRACK_THRESHOLD entries at once.
+	 */
+	buffer->memtrack_state = REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP;
+	buffer->txn_heap = binaryheap_allocate(REORDER_BUFFER_MEM_TRACK_THRESHOLD * 2,
+										   ReorderBufferTXNSizeCompare,
+										   true, NULL);
+
 	buffer->spillTxns = 0;
 	buffer->spillCount = 0;
 	buffer->spillBytes = 0;
@@ -1500,6 +1537,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 {
 	bool		found;
 	dlist_mutable_iter iter;
+	Size		mem_freed = 0;
 
 	/* cleanup subtransactions & their changes */
 	dlist_foreach_modify(iter, &txn->subtxns)
@@ -1529,9 +1567,15 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		/* Check we're not mixing changes from different transactions. */
 		Assert(change->txn == txn);
 
-		ReorderBufferReturnChange(rb, change, true);
+		mem_freed += ReorderBufferChangeSize(change);
+		ReorderBufferReturnChange(rb, change, false);
 	}
 
+	/* Update the memory counter */
+	Assert(mem_freed == txn->size);
+	if (mem_freed > 0)
+		ReorderBufferTXNMemoryUpdate(rb, txn, false, mem_freed);
+
 	/*
 	 * Cleanup the tuplecids we stored for decoding catalog snapshot access.
 	 * They are always stored in the toplevel transaction.
@@ -1590,6 +1634,22 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* deallocate */
 	ReorderBufferReturnTXN(rb, txn);
+
+	/*
+	 * Check if the number of transactions get lower than the threshold. If
+	 * so, switch to NO_MAXHEAP state and reset the max-heap.
+	 *
+	 * XXX: If a new transaction is added and the memory usage reached the
+	 * limit soon, we will end up building the max-heap again. It might be
+	 * more efficient if we accept a certain amount of transactions to switch
+	 * back to the NO_MAXHEAP state, say 95% of the threshold.
+	 */
+	if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP &&
+		(binaryheap_size(rb->txn_heap) < REORDER_BUFFER_MEM_TRACK_THRESHOLD))
+	{
+		rb->memtrack_state = REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP;
+		binaryheap_reset(rb->txn_heap);
+	}
 }
 
 /*
@@ -3162,16 +3222,6 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 
 /*
  * Update memory counters to account for the new or removed change.
- *
- * We update two counters - in the reorder buffer, and in the transaction
- * containing the change. The reorder buffer counter allows us to quickly
- * decide if we reached the memory limit, the transaction counter allows
- * us to quickly pick the largest transaction for eviction.
- *
- * When streaming is enabled, we need to update the toplevel transaction
- * counters instead - we don't really care about subtransactions as we
- * can't stream them individually anyway, and we only pick toplevel
- * transactions for eviction. So only toplevel transactions matter.
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
@@ -3179,7 +3229,6 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								bool addition, Size sz)
 {
 	ReorderBufferTXN *txn;
-	ReorderBufferTXN *toptxn;
 
 	Assert(change->txn);
 
@@ -3193,6 +3242,28 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 	txn = change->txn;
 
+	ReorderBufferTXNMemoryUpdate(rb, txn, addition, sz);
+}
+
+/*
+ * Update memory counter of the given transaction.
+ *
+ * We update two counters - in the reorder buffer, and in the transaction
+ * containing the change. The reorder buffer counter allows us to quickly
+ * decide if we reached the memory limit, the transaction counter allows
+ * us to quickly pick the largest transaction for eviction.
+ *
+ * When streaming is enabled, we need to update the toplevel transaction
+ * counters instead - we don't really care about subtransactions as we
+ * can't stream them individually anyway, and we only pick toplevel
+ * transactions for eviction. So only toplevel transactions matter.
+ */
+static void
+ReorderBufferTXNMemoryUpdate(ReorderBuffer *rb, ReorderBufferTXN *txn,
+							 bool addition, Size sz)
+{
+	ReorderBufferTXN *toptxn;
+
 	/*
 	 * Update the total size in top level as well. This is later used to
 	 * compute the decoding stats.
@@ -3206,6 +3277,15 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 		/* Update the total size in the top transaction. */
 		toptxn->total_size += sz;
+
+		/* Update the max-heap as well if necessary */
+		if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)
+		{
+			if ((txn->size - sz) == 0)
+				binaryheap_add(rb->txn_heap, PointerGetDatum(txn));
+			else
+				binaryheap_update_up(rb->txn_heap, PointerGetDatum(txn));
+		}
 	}
 	else
 	{
@@ -3215,6 +3295,15 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 		/* Update the total size in the top transaction. */
 		toptxn->total_size -= sz;
+
+		/* Update the max-heap as well if necessary */
+		if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)
+		{
+			if (txn->size == 0)
+				binaryheap_remove_node_ptr(rb->txn_heap, PointerGetDatum(txn));
+			else
+				binaryheap_update_down(rb->txn_heap, PointerGetDatum(txn));
+		}
 	}
 
 	Assert(txn->size <= rb->size);
@@ -3472,31 +3561,45 @@ ReorderBufferSerializeReserve(ReorderBuffer *rb, Size sz)
 
 /*
  * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
- *
- * XXX With many subtransactions this might be quite slow, because we'll have
- * to walk through all of them. There are some options how we could improve
- * that: (a) maintain some secondary structure with transactions sorted by
- * amount of changes, (b) not looking for the entirely largest transaction,
- * but e.g. for transaction using at least some fraction of the memory limit,
- * and (c) evicting multiple transactions at once, e.g. to free a given portion
- * of the memory limit (e.g. 50%).
  */
 static ReorderBufferTXN *
 ReorderBufferLargestTXN(ReorderBuffer *rb)
 {
-	HASH_SEQ_STATUS hash_seq;
-	ReorderBufferTXNByIdEnt *ent;
 	ReorderBufferTXN *largest = NULL;
 
-	hash_seq_init(&hash_seq, rb->by_txn);
-	while ((ent = hash_seq_search(&hash_seq)) != NULL)
+	/*
+	 * Build the max-heap to pick the largest transaction if not yet. We will
+	 * run a heap assembly step at the end, which is more efficient.
+	 */
+	if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP)
 	{
-		ReorderBufferTXN *txn = ent->txn;
+		HASH_SEQ_STATUS hash_seq;
+		ReorderBufferTXNByIdEnt *ent;
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
-			largest = txn;
+		hash_seq_init(&hash_seq, rb->by_txn);
+		while ((ent = hash_seq_search(&hash_seq)) != NULL)
+		{
+			ReorderBufferTXN *txn = ent->txn;
+
+			if (txn->size == 0)
+				continue;
+
+			binaryheap_add_unordered(rb->txn_heap, PointerGetDatum(txn));
+		}
+
+		binaryheap_build(rb->txn_heap);
+
+		/*
+		 * The max-heap is ready now. We remain in this state at least until
+		 * we free up enough transactions to bring the total memory usage
+		 * below the limit.
+		 */
+		rb->memtrack_state = REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP;
 	}
+	else
+		Assert(binaryheap_size(rb->txn_heap) > 0);
+
+	largest = (ReorderBufferTXN *) DatumGetPointer(binaryheap_first(rb->txn_heap));
 
 	Assert(largest);
 	Assert(largest->size > 0);
@@ -3638,6 +3741,18 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 		Assert(txn->nentries_mem == 0);
 	}
 
+	/*
+	 * Check the number of transactions in max-heap after evicting large
+	 * transactions. If the number of transactions is small, we switch back
+	 * to the NO_MAXHEAP state, and reset the current the max-heap.
+	 */
+	if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP &&
+		(binaryheap_size(rb->txn_heap) < REORDER_BUFFER_MEM_TRACK_THRESHOLD))
+	{
+		rb->memtrack_state = REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP;
+		binaryheap_reset(rb->txn_heap);
+	}
+
 	/* We must be under the memory limit now. */
 	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
@@ -3654,6 +3769,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	XLogSegNo	curOpenSegNo = 0;
 	Size		spilled = 0;
 	Size		size = txn->size;
+	Size		mem_freed = 0;
 
 	elog(DEBUG2, "spill %u changes in XID %u to disk",
 		 (uint32) txn->nentries_mem, txn->xid);
@@ -3707,11 +3823,17 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		ReorderBufferSerializeChange(rb, txn, fd, change);
 		dlist_delete(&change->node);
-		ReorderBufferReturnChange(rb, change, true);
+		mem_freed += ReorderBufferChangeSize(change);
+		ReorderBufferReturnChange(rb, change, false);
 
 		spilled++;
 	}
 
+	/* Update the memory counter */
+	Assert(mem_freed == txn->size);
+	if (mem_freed > 0)
+		ReorderBufferTXNMemoryUpdate(rb, txn, false, mem_freed);
+
 	/* update the statistics iff we have spilled anything */
 	if (spilled)
 	{
@@ -5273,3 +5395,20 @@ restart:
 		*cmax = ent->cmax;
 	return true;
 }
+
+/*
+ * Compare between sizes of two transactions. This is for a binary heap
+ * comparison function.
+ */
+static int
+ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg)
+{
+	ReorderBufferTXN	*ta = (ReorderBufferTXN *) DatumGetPointer(a);
+	ReorderBufferTXN	*tb = (ReorderBufferTXN *) DatumGetPointer(b);
+
+	if (ta->size < tb->size)
+		return -1;
+	if (ta->size > tb->size)
+		return 1;
+	return 0;
+}
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0b2c95f7aa0..f0d352cfcc6 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "lib/binaryheap.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -531,6 +532,22 @@ typedef void (*ReorderBufferUpdateProgressTxnCB) (
 												  ReorderBufferTXN *txn,
 												  XLogRecPtr lsn);
 
+/* State of how to track the memory usage of each transaction being decoded */
+typedef enum ReorderBufferMemTrackState
+{
+	/*
+	 * We don't update max-heap while updating the memory counter. The
+	 * max-heap is built before use.
+	 */
+	REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP,
+
+	/*
+	 * We also update the max-heap when updating the memory counter so
+	 * the heap property is always preserved.
+	 */
+	REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP,
+} ReorderBufferMemTrackState;
+
 struct ReorderBuffer
 {
 	/*
@@ -631,6 +648,10 @@ struct ReorderBuffer
 	/* memory accounting */
 	Size		size;
 
+	/* Max-heap for sizes of all top-level and sub transactions */
+	ReorderBufferMemTrackState memtrack_state;
+	binaryheap	*txn_heap;
+
 	/*
 	 * Statistics about transactions spilled to disk.
 	 *
-- 
2.43.0

v5-0003-review.patchtext/x-patch; charset=UTF-8; name=v5-0003-review.patchDownload
From f2b54fbb2bc0b6a74d10f46b086e238d76fe822f Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Fri, 23 Feb 2024 13:32:04 +0100
Subject: [PATCH v5 3/5] review

---
 src/common/binaryheap.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index ff03c477dc9..f656c47524e 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -54,6 +54,8 @@ static void sift_up(binaryheap *heap, int node_off);
  * store the given number of nodes, with the heap property defined by
  * the given comparator function, which will be invoked with the additional
  * argument specified by 'arg'.
+ *
+ * XXX Should document the new "indexed" argument.
  */
 binaryheap *
 binaryheap_allocate(int capacity, binaryheap_comparator compare,
@@ -110,6 +112,7 @@ binaryheap_free(binaryheap *heap)
 {
 	if (heap->bh_indexed)
 		bh_nodeidx_destroy(heap->bh_nodeidx);
+
 	pfree(heap);
 }
 
@@ -152,28 +155,34 @@ bh_enlarge_node_array(binaryheap *heap)
 }
 
 /*
- * Set the given node at the 'idx' and updates its position accordingly.
+ * Set the given node at the 'index' and updates its position accordingly.
+ *
+ * XXX No need to shorten the argument names, I think.
+ *
+ * XXX Should this return "found" maybe?
  */
 static void
-bh_set_node(binaryheap *heap, bh_node_type d, int idx)
+bh_set_node(binaryheap *heap, bh_node_type node, int index)
 {
 	bh_nodeidx_entry *ent;
 	bool	found;
 
 	/* Set the node to the nodes array */
-	heap->bh_nodes[idx] = d;
+	heap->bh_nodes[index] = node;
 
 	if (heap->bh_indexed)
 	{
 		/* Remember its index in the nodes array */
-		ent = bh_nodeidx_insert(heap->bh_nodeidx, d, &found);
-		ent->idx = idx;
+		ent = bh_nodeidx_insert(heap->bh_nodeidx, node, &found);
+		ent->idx = index;
 	}
 }
 
 /*
  * Replace the node at 'idx' with the given node 'replaced_by'. Also
  * update their positions accordingly.
+ *
+ * XXX can we do Assert(found) here? if bh_set_node returns it, ofc
  */
 static void
 bh_replace_node(binaryheap *heap, int idx, bh_node_type replaced_by)
@@ -280,6 +289,8 @@ binaryheap_remove_first(binaryheap *heap)
 	{
 		heap->bh_size--;
 
+		/* XXX maybe it'd be good to make the check in bh_nodeidx_delete, so that
+		 * we don't need to do it everywhere. */
 		if (heap->bh_indexed)
 			bh_nodeidx_delete(heap->bh_nodeidx, result);
 
-- 
2.43.0

v5-0002-Add-functions-to-binaryheap-to-efficiently-remove.patchtext/x-patch; charset=UTF-8; name=v5-0002-Add-functions-to-binaryheap-to-efficiently-remove.patchDownload
From a2a7db6e02344982764b07ec4bf4d509d1dd7ae4 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 11:20:23 +0900
Subject: [PATCH v5 2/5] Add functions to binaryheap to efficiently
 remove/update keys.

Previously, binaryheap didn't support key updates and removing nodes
in an efficient way. For example, in order to remove a node from the
binaryheap, the caller has to pass the node's position within the
array that the binaryheap internally has. Removing a node from the
binaryheap is done in O(log n) but searching for the key's position is
done in O(n).

This commit adds a hash table to binaryheap to track of positions of
each nodes in the binaryheap. That way, by using newly added
functions such as binaryheap_update_up() etc., both updating a key and
removing a node can node can be done in O(1) in an average and
O(log n) in worst case. This is known as the indexed priority
queue. The caller can specify to use the indexed binaryheap by passing
indexed = true. There is no user of it but it will be used by a
upcoming patch.

XXX: update typedef.list

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/executor/nodeGatherMerge.c        |   1 +
 src/backend/executor/nodeMergeAppend.c        |   2 +-
 src/backend/postmaster/pgarch.c               |   3 +-
 .../replication/logical/reorderbuffer.c       |   1 +
 src/backend/storage/buffer/bufmgr.c           |   1 +
 src/bin/pg_dump/pg_backup_archiver.c          |   1 +
 src/bin/pg_dump/pg_dump_sort.c                |   2 +-
 src/common/binaryheap.c                       | 167 ++++++++++++++++--
 src/include/lib/binaryheap.h                  |  35 +++-
 9 files changed, 199 insertions(+), 14 deletions(-)

diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c
index 2d552f42240..250f226d5f8 100644
--- a/src/backend/executor/nodeGatherMerge.c
+++ b/src/backend/executor/nodeGatherMerge.c
@@ -427,6 +427,7 @@ gather_merge_setup(GatherMergeState *gm_state)
 	/* Allocate the resources for the merge */
 	gm_state->gm_heap = binaryheap_allocate(nreaders + 1,
 											heap_compare_slots,
+											false,
 											gm_state);
 }
 
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index 08178684528..1980794cb7a 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -125,7 +125,7 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	mergestate->ms_nplans = nplans;
 
 	mergestate->ms_slots = (TupleTableSlot **) palloc0(sizeof(TupleTableSlot *) * nplans);
-	mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots,
+	mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots, false,
 											  mergestate);
 
 	/*
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 9c18e4b3efb..36522940dd4 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -250,7 +250,8 @@ PgArchiverMain(void)
 
 	/* Initialize our max-heap for prioritizing files to archive. */
 	arch_files->arch_heap = binaryheap_allocate(NUM_FILES_PER_DIRECTORY_SCAN,
-												ready_file_comparator, NULL);
+												ready_file_comparator, false,
+												NULL);
 
 	/* Load the archive_library. */
 	LoadArchiveLibrary();
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5446df3c647..91b9618d7ec 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1296,6 +1296,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	/* allocate heap */
 	state->heap = binaryheap_allocate(state->nr_txns,
 									  ReorderBufferIterCompare,
+									  false,
 									  state);
 
 	/* Now that the state fields are initialized, it is safe to return it. */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index bdf89bbc4dc..69f071321dd 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2725,6 +2725,7 @@ BufferSync(int flags)
 	 */
 	ts_heap = binaryheap_allocate(num_spaces,
 								  ts_ckpt_progress_comparator,
+								  false,
 								  NULL);
 
 	for (i = 0; i < num_spaces; i++)
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index d97ebaff5b8..6587a7b0814 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -4033,6 +4033,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 	/* Set up ready_heap with enough room for all known TocEntrys */
 	ready_heap = binaryheap_allocate(AH->tocCount,
 									 TocEntrySizeCompareBinaryheap,
+									 false,
 									 NULL);
 
 	/*
diff --git a/src/bin/pg_dump/pg_dump_sort.c b/src/bin/pg_dump/pg_dump_sort.c
index 8ee8a42781a..4d10af3a344 100644
--- a/src/bin/pg_dump/pg_dump_sort.c
+++ b/src/bin/pg_dump/pg_dump_sort.c
@@ -405,7 +405,7 @@ TopoSort(DumpableObject **objs,
 		return true;
 
 	/* Create workspace for the above-described heap */
-	pendingHeap = binaryheap_allocate(numObjs, int_cmp, NULL);
+	pendingHeap = binaryheap_allocate(numObjs, int_cmp, false, NULL);
 
 	/*
 	 * Scan the constraints, and for each item in the input, generate a count
diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index 6f16c83295d..ff03c477dc9 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -22,8 +22,28 @@
 #ifdef FRONTEND
 #include "common/logging.h"
 #endif
+#include "common/hashfn.h"
 #include "lib/binaryheap.h"
 
+/*
+ * Define parameters for hash table code generation. The interface is *also*"
+ * declared in binaryheaph.h (to generate the types, which are externally
+ * visible).
+ */
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+	hash_bytes((const unsigned char *) &key, sizeof(bh_node_type))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0)
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
 static void sift_down(binaryheap *heap, int node_off);
 static void sift_up(binaryheap *heap, int node_off);
 
@@ -36,7 +56,8 @@ static void sift_up(binaryheap *heap, int node_off);
  * argument specified by 'arg'.
  */
 binaryheap *
-binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
+binaryheap_allocate(int capacity, binaryheap_comparator compare,
+					bool indexed, void *arg)
 {
 	binaryheap *heap;
 
@@ -49,6 +70,17 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 	heap->bh_has_heap_property = true;
 	heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type) * capacity);
 
+	heap->bh_indexed = indexed;
+	if (heap->bh_indexed)
+	{
+#ifdef FRONTEND
+		heap->bh_nodeidx = bh_nodeidx_create(capacity, NULL);
+#else
+		heap->bh_nodeidx = bh_nodeidx_create(CurrentMemoryContext, capacity,
+											 NULL);
+#endif
+	}
+
 	return heap;
 }
 
@@ -63,6 +95,9 @@ binaryheap_reset(binaryheap *heap)
 {
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
+
+	if (heap->bh_indexed)
+		bh_nodeidx_reset(heap->bh_nodeidx);
 }
 
 /*
@@ -73,6 +108,8 @@ binaryheap_reset(binaryheap *heap)
 void
 binaryheap_free(binaryheap *heap)
 {
+	if (heap->bh_indexed)
+		bh_nodeidx_destroy(heap->bh_nodeidx);
 	pfree(heap);
 }
 
@@ -114,6 +151,44 @@ bh_enlarge_node_array(binaryheap *heap)
 							  sizeof(bh_node_type) * heap->bh_space);
 }
 
+/*
+ * Set the given node at the 'idx' and updates its position accordingly.
+ */
+static void
+bh_set_node(binaryheap *heap, bh_node_type d, int idx)
+{
+	bh_nodeidx_entry *ent;
+	bool	found;
+
+	/* Set the node to the nodes array */
+	heap->bh_nodes[idx] = d;
+
+	if (heap->bh_indexed)
+	{
+		/* Remember its index in the nodes array */
+		ent = bh_nodeidx_insert(heap->bh_nodeidx, d, &found);
+		ent->idx = idx;
+	}
+}
+
+/*
+ * Replace the node at 'idx' with the given node 'replaced_by'. Also
+ * update their positions accordingly.
+ */
+static void
+bh_replace_node(binaryheap *heap, int idx, bh_node_type replaced_by)
+{
+	bh_node_type	node = heap->bh_nodes[idx];
+
+	/* Remove overwritten node's index */
+	if (heap->bh_indexed)
+		(void) bh_nodeidx_delete(heap->bh_nodeidx, node);
+
+	/* Replace it with the given new node */
+	if (idx < heap->bh_size)
+		bh_set_node(heap, replaced_by, idx);
+}
+
 /*
  * binaryheap_add_unordered
  *
@@ -130,7 +205,7 @@ binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 		bh_enlarge_node_array(heap);
 
 	heap->bh_has_heap_property = false;
-	heap->bh_nodes[heap->bh_size] = d;
+	bh_set_node(heap, d, heap->bh_size);
 	heap->bh_size++;
 }
 
@@ -163,7 +238,7 @@ binaryheap_add(binaryheap *heap, bh_node_type d)
 	if (heap->bh_size >= heap->bh_space)
 		bh_enlarge_node_array(heap);
 
-	heap->bh_nodes[heap->bh_size] = d;
+	bh_set_node(heap, d, heap->bh_size);
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
 }
@@ -204,6 +279,10 @@ binaryheap_remove_first(binaryheap *heap)
 	if (heap->bh_size == 1)
 	{
 		heap->bh_size--;
+
+		if (heap->bh_indexed)
+			bh_nodeidx_delete(heap->bh_nodeidx, result);
+
 		return result;
 	}
 
@@ -211,7 +290,7 @@ binaryheap_remove_first(binaryheap *heap)
 	 * Remove the last node, placing it in the vacated root entry, and sift
 	 * the new root node down to its correct position.
 	 */
-	heap->bh_nodes[0] = heap->bh_nodes[--heap->bh_size];
+	bh_replace_node(heap, 0, heap->bh_nodes[--heap->bh_size]);
 	sift_down(heap, 0);
 
 	return result;
@@ -237,7 +316,7 @@ binaryheap_remove_node(binaryheap *heap, int n)
 						   heap->bh_arg);
 
 	/* remove the last node, placing it in the vacated entry */
-	heap->bh_nodes[n] = heap->bh_nodes[heap->bh_size];
+	bh_replace_node(heap, n, heap->bh_nodes[heap->bh_size]);
 
 	/* sift as needed to preserve the heap property */
 	if (cmp > 0)
@@ -246,6 +325,74 @@ binaryheap_remove_node(binaryheap *heap, int n)
 		sift_down(heap, n);
 }
 
+/*
+ * binaryheap_remove_node_ptr
+ *
+ * Similar to binaryheap_remove_node() but removes the given node. The caller
+ * must ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if bh_indexed is true.
+ */
+void
+binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(heap->bh_indexed);
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+	Assert(ent);
+
+	binaryheap_remove_node(heap, ent->idx);
+}
+
+/*
+ * binaryheap_update_up
+ *
+ * Sift the given node up after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if bh_indexed is true.
+ */
+void
+binaryheap_update_up(binaryheap *heap, bh_node_type d)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(heap->bh_indexed);
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+	Assert(ent);
+	Assert(ent->idx >= 0 && ent->idx < heap->bh_size);
+
+	sift_up(heap, ent->idx);
+}
+
+/*
+ * binaryheap_update_down
+ *
+ * Sift the given node down after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if bh_indexed is true.
+ */
+void
+binaryheap_update_down(binaryheap *heap, bh_node_type d)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(heap->bh_indexed);
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+	Assert(ent);
+	Assert(ent->idx >= 0 && ent->idx < heap->bh_size);
+
+	sift_down(heap, ent->idx);
+}
+
 /*
  * binaryheap_replace_first
  *
@@ -258,7 +405,7 @@ binaryheap_replace_first(binaryheap *heap, bh_node_type d)
 {
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
 
-	heap->bh_nodes[0] = d;
+	bh_replace_node(heap, 0, d);
 
 	if (heap->bh_size > 1)
 		sift_down(heap, 0);
@@ -300,11 +447,11 @@ sift_up(binaryheap *heap, int node_off)
 		 * Otherwise, swap the parent value with the hole, and go on to check
 		 * the node's new parent.
 		 */
-		heap->bh_nodes[node_off] = parent_val;
+		bh_set_node(heap, parent_val, node_off);
 		node_off = parent_off;
 	}
 	/* Re-fill the hole */
-	heap->bh_nodes[node_off] = node_val;
+	bh_set_node(heap, node_val, node_off);
 }
 
 /*
@@ -359,9 +506,9 @@ sift_down(binaryheap *heap, int node_off)
 		 * Otherwise, swap the hole with the child that violates the heap
 		 * property; then go on to check its children.
 		 */
-		heap->bh_nodes[node_off] = heap->bh_nodes[swap_off];
+		bh_set_node(heap, heap->bh_nodes[swap_off], node_off);
 		node_off = swap_off;
 	}
 	/* Re-fill the hole */
-	heap->bh_nodes[node_off] = node_val;
+	bh_set_node(heap, node_val, node_off);
 }
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 1439f208033..48c2de33b48 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -29,6 +29,28 @@ typedef Datum bh_node_type;
  */
 typedef int (*binaryheap_comparator) (bh_node_type a, bh_node_type b, void *arg);
 
+/*
+ * Struct for A hash table element to store the node's index in the bh_nodes
+ * array.
+ */
+typedef struct bh_nodeidx_entry
+{
+	bh_node_type	key;
+	char			status;
+	int				idx;
+} bh_nodeidx_entry;
+
+/* define parameters necessary to generate the hash table interface */
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
 /*
  * binaryheap
  *
@@ -47,11 +69,19 @@ typedef struct binaryheap
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
 	bh_node_type *bh_nodes;
+
+	/*
+	 * If bh_indexed is true, the bh_nodeidx is used to track of each
+	 * node's index in bh_nodes. This enables the caller to perform
+	 * binaryheap_remove_node_ptr(), binaryheap_update_up/down in O(log n).
+	 */
+	bool		bh_indexed;
+	bh_nodeidx_hash	*bh_nodeidx;
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int capacity,
 									   binaryheap_comparator compare,
-									   void *arg);
+									   bool indexed, void *arg);
 extern void binaryheap_reset(binaryheap *heap);
 extern void binaryheap_free(binaryheap *heap);
 extern void binaryheap_add_unordered(binaryheap *heap, bh_node_type d);
@@ -60,7 +90,10 @@ extern void binaryheap_add(binaryheap *heap, bh_node_type d);
 extern bh_node_type binaryheap_first(binaryheap *heap);
 extern bh_node_type binaryheap_remove_first(binaryheap *heap);
 extern void binaryheap_remove_node(binaryheap *heap, int n);
+extern void binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d);
 extern void binaryheap_replace_first(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_up(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_down(binaryheap *heap, bh_node_type d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
 #define binaryheap_size(h)			((h)->bh_size)
-- 
2.43.0

v5-0001-Make-binaryheap-enlareable.patchtext/x-patch; charset=UTF-8; name=v5-0001-Make-binaryheap-enlareable.patchDownload
From 540bfa5568ee07205bc3e18aaec78e02ef2051c0 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 17:12:20 +0900
Subject: [PATCH v5 1/5] Make binaryheap enlareable.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/common/binaryheap.c      | 36 +++++++++++++++++++-----------------
 src/include/lib/binaryheap.h |  2 +-
 2 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index 7377ebdf156..6f16c83295d 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -38,17 +38,16 @@ static void sift_up(binaryheap *heap, int node_off);
 binaryheap *
 binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 {
-	int			sz;
 	binaryheap *heap;
 
-	sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_node_type) * capacity;
-	heap = (binaryheap *) palloc(sz);
+	heap = (binaryheap *) palloc(sizeof(binaryheap));
 	heap->bh_space = capacity;
 	heap->bh_compare = compare;
 	heap->bh_arg = arg;
 
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
+	heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type) * capacity);
 
 	return heap;
 }
@@ -104,6 +103,17 @@ parent_offset(int i)
 	return (i - 1) / 2;
 }
 
+/*
+ * Make sure there is enough space for nodes.
+ */
+static void
+bh_enlarge_node_array(binaryheap *heap)
+{
+	heap->bh_space *= 2;
+	heap->bh_nodes = repalloc(heap->bh_nodes,
+							  sizeof(bh_node_type) * heap->bh_space);
+}
+
 /*
  * binaryheap_add_unordered
  *
@@ -115,14 +125,10 @@ parent_offset(int i)
 void
 binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 {
+	/* make sure enough space for a new node */
 	if (heap->bh_size >= heap->bh_space)
-	{
-#ifdef FRONTEND
-		pg_fatal("out of binary heap slots");
-#else
-		elog(ERROR, "out of binary heap slots");
-#endif
-	}
+		bh_enlarge_node_array(heap);
+
 	heap->bh_has_heap_property = false;
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
@@ -153,14 +159,10 @@ binaryheap_build(binaryheap *heap)
 void
 binaryheap_add(binaryheap *heap, bh_node_type d)
 {
+	/* make sure enough space for a new node */
 	if (heap->bh_size >= heap->bh_space)
-	{
-#ifdef FRONTEND
-		pg_fatal("out of binary heap slots");
-#else
-		elog(ERROR, "out of binary heap slots");
-#endif
-	}
+		bh_enlarge_node_array(heap);
+
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 19025c08ef1..1439f208033 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -46,7 +46,7 @@ typedef struct binaryheap
 	bool		bh_has_heap_property;	/* debugging cross-check */
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
-	bh_node_type bh_nodes[FLEXIBLE_ARRAY_MEMBER];
+	bh_node_type *bh_nodes;
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int capacity,
-- 
2.43.0

test-scripts.tgzapplication/x-compressed-tar; name=test-scripts.tgzDownload
#35Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Tomas Vondra (#34)
3 attachment(s)
Re: Improve eviction algorithm in ReorderBuffer

On Sat, Feb 24, 2024 at 1:29 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Hi,

I did a basic review and testing of this patch today. Overall I think
the patch is in very good shape - I agree with the tradeoffs it makes,
and I like the approach in general. I do have a couple minor comments
about the code, and then maybe a couple thoughts about the approach.

Thank you for the review comments and tests!

First, some comments - I'll put them here, but I also kept them in
"review" commits, because that makes it easier to show the exact place
in the code the comment is about.

1) binaryheap_allocate got a new "indexed" argument, but the comment is
not updated to document it

Fixed.

2) I think it's preferable to use descriptive argument names for
bh_set_node. I don't think there's a good reason to keep it short.

Agreed.

3) In a couple places we have code like this:

if (heap->bh_indexed)
bh_nodeidx_delete(heap->bh_nodeidx, result);

Maybe it'd be better to have the if condition in bh_nodeidx_delete, so
that it can be called without it.

Fixed.

4) Could we check the "found" flag in bh_set_node, somehow? I mean, we
either expect to find the node (update of already tracked transaction)
or not (when inserting it). The life cycle may be non-trivial (node
added, updated and removed, ...), would be useful assert I think.

Agreed.

5) Do we actually need the various mem_freed local variables in a couple
places, when we expect the value to be equal to txn->size (there's even
assert enforcing that)?

You're right.

6) ReorderBufferCleanupTXN has a comment about maybe not using the same
threshold both to enable & disable usage of the binaryheap. I agree with
that, otherwise we could easily end up "trashing" if we add/remove
transactions right around the threshold. I think 90-95% for disabling
the heap would work fine.

Agreeehd.

7) The code disabling binaryheap (based on the threshold) is copied in a
couple places, perhaps it should be a separate function called from
those places.

Fixed.

8) Similarly to (3), maybe ReorderBufferTXNMemoryUpdate should do the
memory size check internally, to make the calls simpler.

Agreed.

9) The ReorderBufferChangeMemoryUpdate / ReorderBufferTXNMemoryUpdate
split maybe not very clear. It's not clear to me why it's divided like
this, or why we can't simply call ReorderBufferTXNMemoryUpdate directly.

I think that now we have two use cases: updating the memory counter
after freeing individual change, and updating the memory counter after
freeing all changes of the transaction (i.e., making the counter to
0). In the former case, we need to check if the change is
REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID but we don't need to pass the
transaction as the change has its transaction. On the other hand, in
the latter case, we don't need the change but need to pass the
transaction. If we do both things in one function, the function would
have two arguments: change and txn, and the callers set either one
that they know. I've updated the patch accordingly.

BTW it might be worth considering to create a separate patch for the
updates around ReorderBufferChangeMemoryUpdate() that batches the
memory counter updates as it seems an independent change from the
max-heap stuff.

performance
-----------

I did some benchmarks, to see the behavior in simple good/bad cases (see
the attached scripts.tgz). "large" is one large transaction inserting 1M
rows, small is 64k single-row inserts, and subxacts is the original case
with ~100k subxacts. Finally, subxacts-small is many transactions with
128 subxacts each (the main transactions are concurrent).

The results are pretty good, I think:

test master patched
-----------------------------------------------------
large 2587 2459 95%
small 956 856 89%
subxacts 138915 2911 2%
subxacts-small 13632 13187 97%

Thank you for doing the performance test. I ran the same script you
shared on my machine just in case and got similar results:

master patched
large: 2831 2827
small: 1226 1222
subxacts: 134076 2744
subxacts-small: 23384 23127

In my case, the differences seem to be within a noise range.

This is timing (ms) with logical_work_mem=4MB. I also tried with 64MB,
where the subxact timing goes way down, but the overall conclusions do
not change.

I was a bit surprised I haven't seen any clear regression, but in the
end that's a good thing, right? There's a couple results in this thread
showing ~10% regression, but I've been unable to reproduce those.
Perhaps the newer patch versions fix that, I guess.

Yes, the 10% regression is fixed in the v4 patch. We don't update the
max-heap at all until the number of transactions reaches the threshold
so I think there is mostly 0 overhead in normal cases.

Anyway, I think that at some point we'd have to accept that some cases
may have slight regression. I think that's inherent for almost any
heuristics - there's always going to be some rare case that defeats it.
What's important is that the case needs to be rare and/or the impact
very limited. And I think that's true here.

Agreed.

overall design
--------------

As for the design, I agree with the approach of using a binaryheap to
track transactions by size. When going over the thread history,
describing the initial approach with only keeping "large" transactions
above some threshold (e.g. 10%), I was really concerned that'll either
lead to abrupt changes in behavior (when transactions move just around
the 10%), or won't help with many common cases (with most transactions
being below the limit).

I was going to suggest some sort of "binning" - keeping lists for
transactions of similar size (e.g. <1kB, 1-2kB, 2-4kB, 4-8kB, ...) and
evicting transactions from a list, i.e. based on approximate size. But
if the indexed binary heap seems to be cheap enough, I think it's a
better solution.

I've also considered the binning idea. But it was not clear to me how
it works well in a case where all transactions belong to the
particular class. For example, if we need to free up 1MB memory, we
could end up evicting 2000 transactions consuming 50 bytes instead of
100 transactions consuming 1000 bytes, resulting in that we end up
serializing more transactions. Also, I'm concerned about the cost of
maintaining the binning lists.

The one thing I'm a bit concerned about is the threshold used to start
using binary heap - these thresholds with binary decisions may easily
lead to a "cliff" and robustness issues, i.e. abrupt change in behavior
with significant runtime change (e.g. you add/remove one transaction and
the code takes a much more expensive path). The value (1024) seems
rather arbitrary, I wonder if there's something to justify that choice.

True. 1024 seems small to me. In my environment, I started to see a
big difference from around 40000 transactions. But it varies depending
on environments and workloads.

I think that this performance problem we're addressing doesn't
normally happen as long as all transactions being decoded are
top-level transactions. Otherwise, we also need to improve
ReorderBufferLargestStreamableTopTXN(). Given this fact, I think
max_connections = 1024 is a possible value in some systems, and I've
observed such systems sometimes. On the other hand, I've observed >
5000 in just a few cases, and having more than 5000 transactions in
ReorderBuffer seems unlikely to happen without subtransactions. I
think we can say it's an extreme case, the number is still an
arbitrary number though.

Or probably we can compute the threshold based on max_connections,
e.g., max_connections * 10. That way, we can ensure that users won't
incur the max-heap maintenance costs as long as they don't use
subtransactions.

In any case, I agree it'd be good to have some dampening factor, to
reduce the risk of trashing because of adding/removing a single
transaction to the decoding.

related stuff / GenerationContext
---------------------------------

It's not the fault of this patch, but this reminds me I have some doubts
about how the eviction interferes with using the GenerationContext for
some of the data. I suspect we can easily get into a situation where we
evict the largest transaction, but that doesn't actually reduce the
memory usage at all, because the memory context blocks are shared with
some other transactions and don't get 100% empty (so we can't release
them). But it's actually worse, because GenerationContext does not even
reuse this memory. So do we even gain anything by the eviction?

When the earlier patch versions also considered age of the transaction,
to try evicting the older ones first, I think that was interesting. I
think we may want to do something like this even with the binary heap.

Thank you for raising this issue. This is one of the highest priority
items in my backlog. We've seen cases where the logical decoding uses
much more memory than logical_decoding_work_mem value[1]/messages/by-id/CAMnUB3oYugXCBLSkih+qNsWQPciEwos6g_AMbnz_peNoxfHwyw@mail.gmail.com[2]/messages/by-id/17974-f8c9d353a62f414d@postgresql.org (e.g. it
used 4GB memory even though the logical_decoding_work_mem was 256kB).
I think that the problem would still happen even with this improvement
on the eviction.

I believe these are separate problems we can address, and evicting
large transactions first would still be the right strategy. We might
want to improve how we store changes in memory contexts. For example,
it might be worth having per-transaction memory context so that we can
actually free memory blocks by the eviction. We can discuss it in a
separate thread.

related stuff / increase of logical_decoding_work_mem
-----------------------------------------------------

In the thread, one of the "alternatives to spilling" suggested in the
thread was to enable streaming, but I think there's often a much more
efficient alternative - increase the amount of memory, so that we don't
actually need to spill.

Agreed.

For example, a system may be doing a lot of eviction / spilling with
logical_decoding_work_mem=64MB, but setting 128MB may completely
eliminate that. Of course, if there are large transactions, this may not
be possible (the GUC would have to exceed RAM). But I don't think that's
very common, the incidents that I've observed were often resolved by
bumping the logical_decoding_work_mem by a little bit.

I wonder if there's something we might do to help users to tune this. We
should be able to measure the "peak" memory usage (how much memory we'd
need to not spill), so maybe we could log that as a WARNING, similarly
to checkpoints - there we only log "checkpoints too frequent, tune WAL
limits", but perhaps we might do more here? Or maybe we could add the
watermark to the system catalog?

Interesting ideas.

The statistics such as spill_count shown in pg_stat_replication_slots
view could already give hints to users to increase the
logical_decoding_work_mem. In addition to that, it's an interesting
idea to have the high water mark in the view.

I've attached updated patches.

Regards,

[1]: /messages/by-id/CAMnUB3oYugXCBLSkih+qNsWQPciEwos6g_AMbnz_peNoxfHwyw@mail.gmail.com
[2]: /messages/by-id/17974-f8c9d353a62f414d@postgresql.org

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v6-0001-Make-binaryheap-enlargeable.patchapplication/octet-stream; name=v6-0001-Make-binaryheap-enlargeable.patchDownload
From 65232ff8bbba85a69836b45360494fe590945b5b Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 17:12:20 +0900
Subject: [PATCH v6 1/3] Make binaryheap enlargeable.

The node array space of the binaryheap is doubled when there is no
available space.

Reviewed-by: Hayato Kuroda, Vignesh C, Ajin Cherian, Tomas Vondra,
Shubham Khanna
Discussion: https://postgr.es/m/CAD21AoDffo37RC-eUuyHJKVEr017V2YYDLyn1xF_00ofptWbkg%40mail.gmail.com
---
 src/common/binaryheap.c      | 36 +++++++++++++++++++-----------------
 src/include/lib/binaryheap.h |  2 +-
 2 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index 7377ebdf15..6f16c83295 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -38,17 +38,16 @@ static void sift_up(binaryheap *heap, int node_off);
 binaryheap *
 binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 {
-	int			sz;
 	binaryheap *heap;
 
-	sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_node_type) * capacity;
-	heap = (binaryheap *) palloc(sz);
+	heap = (binaryheap *) palloc(sizeof(binaryheap));
 	heap->bh_space = capacity;
 	heap->bh_compare = compare;
 	heap->bh_arg = arg;
 
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
+	heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type) * capacity);
 
 	return heap;
 }
@@ -104,6 +103,17 @@ parent_offset(int i)
 	return (i - 1) / 2;
 }
 
+/*
+ * Make sure there is enough space for nodes.
+ */
+static void
+bh_enlarge_node_array(binaryheap *heap)
+{
+	heap->bh_space *= 2;
+	heap->bh_nodes = repalloc(heap->bh_nodes,
+							  sizeof(bh_node_type) * heap->bh_space);
+}
+
 /*
  * binaryheap_add_unordered
  *
@@ -115,14 +125,10 @@ parent_offset(int i)
 void
 binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 {
+	/* make sure enough space for a new node */
 	if (heap->bh_size >= heap->bh_space)
-	{
-#ifdef FRONTEND
-		pg_fatal("out of binary heap slots");
-#else
-		elog(ERROR, "out of binary heap slots");
-#endif
-	}
+		bh_enlarge_node_array(heap);
+
 	heap->bh_has_heap_property = false;
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
@@ -153,14 +159,10 @@ binaryheap_build(binaryheap *heap)
 void
 binaryheap_add(binaryheap *heap, bh_node_type d)
 {
+	/* make sure enough space for a new node */
 	if (heap->bh_size >= heap->bh_space)
-	{
-#ifdef FRONTEND
-		pg_fatal("out of binary heap slots");
-#else
-		elog(ERROR, "out of binary heap slots");
-#endif
-	}
+		bh_enlarge_node_array(heap);
+
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 19025c08ef..1439f20803 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -46,7 +46,7 @@ typedef struct binaryheap
 	bool		bh_has_heap_property;	/* debugging cross-check */
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
-	bh_node_type bh_nodes[FLEXIBLE_ARRAY_MEMBER];
+	bh_node_type *bh_nodes;
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int capacity,
-- 
2.39.3

v6-0003-Use-max-heap-to-efficiently-select-largest-transa.patchapplication/octet-stream; name=v6-0003-Use-max-heap-to-efficiently-select-largest-transa.patchDownload
From fc203704634e9561680550fd7800e908e8816932 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 11:31:41 +0900
Subject: [PATCH v6 3/3] Use max-heap to efficiently select largest
 transactions in ReorderBuffer.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Previously, when selecting the transaction to evict during logical
decoding, we check all transactions to find the largest
transaction. Which could lead to a significant replication lag
especially in case where there are many subtransactions.

This commit improves the eviction algorithm in ReorderBuffer using the
max-heap with transaction size as the key to find the largest
transaction depending on the number of transactions being decoded.

Overall algorithm:

There are two memory track states: REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP
and REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP.

REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP is the starting state, where we do
not update the max-heap when updating the memory counter. We build the
max-heap just before selecting large transactions. Therefore, in this
state, we can update the memory counter with no additional costs but
need O(n) time to get the largest transaction, where n is the number of
transactions including top-level transactions and subtransactions.

Once we build the max-heap, we switch to
REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP state, where we also update
the max-heap when updating the memory counter. The intention is to
efficiently retrieve the largest transaction in O(1) time instead of
incurring the cost of memory counter updates (O(log n)). We remain in
this state as long as the number of transactions is larger than the
threshold, REORDER_BUFFER_MEM_TRACK_THRESHOLD. Otherwise, we switch back
to REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP and reset the max-heap.

The performance benchmark results showed significant speed up (more
than x30 speed up on my machine) in decoding a transaction with 100k
subtransactions, whereas there is no visible overhead in other cases.

XXX: update typedef.list

Reviewed-by: Amit Kapila, Hayato Kuroda, Vignesh C, Ajin Cherian,
Tomas Vondra, Shubham Khanna, Álvaro Herrera, Euler Taveira
Discussion: https://postgr.es/m/CAD21AoAfKTgrBrLq96GcTv9d6k97zaQcDM-rxfKEt4GSe0qnaQ%40mail.gmail.com
---
 .../replication/logical/reorderbuffer.c       | 183 +++++++++++++++---
 src/include/replication/reorderbuffer.h       |  21 ++
 2 files changed, 176 insertions(+), 28 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 91b9618d7e..f077e998a3 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -67,6 +67,26 @@
  *	  allocator, evicting the oldest changes would make it more likely the
  *	  memory gets actually freed.
  *
+ *	  We use a max-heap with transaction size as the key to efficiently find
+ *	  the largest transaction. The max-heap state is managed in two states:
+ *	  REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP and REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP.
+ *
+ *	  REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP is the starting state, where we do
+ *	  not update the max-heap when updating the memory counter. We build the
+ *	  max-heap just before selecting large transactions. Therefore, in this
+ *	  state, we can update the memory counter with no additional costs but
+ *	  need O(n) time to get the largest transaction, where n is the number of
+ *	  transactions including top-level transactions and subtransactions.
+ *
+ *	  Once we build the max-heap, we switch to
+ *	  REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP state, where we also update
+ *	  the max-heap when updating the memory counter. The intention is to
+ *	  efficiently retrieve the largest transaction in O(1) time instead of
+ *	  incurring the cost of memory counter updates (O(log n)). We remain in
+ *	  this state as long as the number of transactions is larger than the
+ *	  threshold, REORDER_BUFFER_MEM_TRACK_THRESHOLD. Otherwise, we switch back
+ *	  to REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP and reset the max-heap.
+ *
  *	  We still rely on max_changes_in_memory when loading serialized changes
  *	  back into memory. At that point we can't use the memory limit directly
  *	  as we load the subxacts independently. One option to deal with this
@@ -109,6 +129,11 @@
 #include "utils/rel.h"
 #include "utils/relfilenumbermap.h"
 
+/*
+ * The threshold of the number of transactions in the max-heap (rb->txn_heap)
+ * to switch the state.
+ */
+#define REORDER_BUFFER_MEM_TRACK_THRESHOLD 1024
 
 /* entry for a hash table we use to map from xid to our transaction state */
 typedef struct ReorderBufferTXNByIdEnt
@@ -295,7 +320,10 @@ static void ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *t
 static Size ReorderBufferChangeSize(ReorderBufferChange *change);
 static void ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 											ReorderBufferChange *change,
+											ReorderBufferTXN *txn,
 											bool addition, Size sz);
+static int ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg);
+static void ReorderBufferMaybeChangeNoMaxHeap(ReorderBuffer *rb);
 
 /*
  * Allocate a new ReorderBuffer and clean out any old serialized state from
@@ -357,6 +385,15 @@ ReorderBufferAllocate(void)
 	buffer->outbufsize = 0;
 	buffer->size = 0;
 
+	/*
+	 * Don't start with a lower number than REORDER_BUFFER_MEM_TRACK_THRESHOLD, since
+	 * we add at least REORDER_BUFFER_MEM_TRACK_THRESHOLD entries at once.
+	 */
+	buffer->memtrack_state = REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP;
+	buffer->txn_heap = binaryheap_allocate(REORDER_BUFFER_MEM_TRACK_THRESHOLD * 2,
+										   ReorderBufferTXNSizeCompare,
+										   true, NULL);
+
 	buffer->spillTxns = 0;
 	buffer->spillCount = 0;
 	buffer->spillBytes = 0;
@@ -487,7 +524,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
 {
 	/* update memory accounting info */
 	if (upd_mem)
-		ReorderBufferChangeMemoryUpdate(rb, change, false,
+		ReorderBufferChangeMemoryUpdate(rb, change, NULL, false,
 										ReorderBufferChangeSize(change));
 
 	/* free contained data */
@@ -818,7 +855,7 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries_mem++;
 
 	/* update memory accounting information */
-	ReorderBufferChangeMemoryUpdate(rb, change, true,
+	ReorderBufferChangeMemoryUpdate(rb, change, NULL, true,
 									ReorderBufferChangeSize(change));
 
 	/* process partial change */
@@ -1529,7 +1566,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		/* Check we're not mixing changes from different transactions. */
 		Assert(change->txn == txn);
 
-		ReorderBufferReturnChange(rb, change, true);
+		ReorderBufferReturnChange(rb, change, false);
 	}
 
 	/*
@@ -1588,8 +1625,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	if (rbtxn_is_serialized(txn))
 		ReorderBufferRestoreCleanup(rb, txn);
 
+	/* Update the memory counter */
+	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, txn->size);
+
 	/* deallocate */
 	ReorderBufferReturnTXN(rb, txn);
+
+	/* check the memory track state */
+	ReorderBufferMaybeChangeNoMaxHeap(rb);
 }
 
 /*
@@ -3172,26 +3215,32 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
  * counters instead - we don't really care about subtransactions as we
  * can't stream them individually anyway, and we only pick toplevel
  * transactions for eviction. So only toplevel transactions matter.
+ *
+ * XXX Not sure the naming is great, it seems pretty similar to the earlier
+ * function, can be quite confusing. Why do we even need the separate function
+ * and can't simply call ReorderBufferChangeMemoryUpdate from everywhere?
  */
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								ReorderBufferChange *change,
+								ReorderBufferTXN *txn,
 								bool addition, Size sz)
 {
-	ReorderBufferTXN *txn;
 	ReorderBufferTXN *toptxn;
 
-	Assert(change->txn);
-
 	/*
 	 * Ignore tuple CID changes, because those are not evicted when reaching
 	 * memory limit. So we just don't count them, because it might easily
 	 * trigger a pointless attempt to spill.
 	 */
-	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
+	if (change && change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
+		return;
+
+	if (sz == 0)
 		return;
 
-	txn = change->txn;
+	txn = txn != NULL ? txn : change->txn;
+	Assert(txn != NULL);
 
 	/*
 	 * Update the total size in top level as well. This is later used to
@@ -3206,6 +3255,15 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 		/* Update the total size in the top transaction. */
 		toptxn->total_size += sz;
+
+		/* Update the max-heap as well if necessary */
+		if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)
+		{
+			if ((txn->size - sz) == 0)
+				binaryheap_add(rb->txn_heap, PointerGetDatum(txn));
+			else
+				binaryheap_update_up(rb->txn_heap, PointerGetDatum(txn));
+		}
 	}
 	else
 	{
@@ -3215,11 +3273,43 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 		/* Update the total size in the top transaction. */
 		toptxn->total_size -= sz;
+
+		/* Update the max-heap as well if necessary */
+		if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)
+		{
+			if (txn->size == 0)
+				binaryheap_remove_node_ptr(rb->txn_heap, PointerGetDatum(txn));
+			else
+				binaryheap_update_down(rb->txn_heap, PointerGetDatum(txn));
+		}
 	}
 
 	Assert(txn->size <= rb->size);
 }
 
+/*
+ * Switch to NO_MAXHEAP state and reset the max-heap if the number of
+ * transactions got lower than the threshold.
+ */
+static void
+ReorderBufferMaybeChangeNoMaxHeap(ReorderBuffer *rb)
+{
+	if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP)
+		return;
+
+	/*
+	 * If we add and remove transactions right around the threshold,
+	 * we could easily end up "thrashing". It is more efficient if we
+	 * accept a certain amount, say 90%, of transactions to switch back
+	 * to the NO_MAXHEAP state.
+	 */
+	if (binaryheap_size(rb->txn_heap) < REORDER_BUFFER_MEM_TRACK_THRESHOLD * 0.9)
+	{
+		rb->memtrack_state = REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP;
+		binaryheap_reset(rb->txn_heap);
+	}
+}
+
 /*
  * Add new (relfilelocator, tid) -> (cmin, cmax) mappings.
  *
@@ -3472,31 +3562,45 @@ ReorderBufferSerializeReserve(ReorderBuffer *rb, Size sz)
 
 /*
  * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
- *
- * XXX With many subtransactions this might be quite slow, because we'll have
- * to walk through all of them. There are some options how we could improve
- * that: (a) maintain some secondary structure with transactions sorted by
- * amount of changes, (b) not looking for the entirely largest transaction,
- * but e.g. for transaction using at least some fraction of the memory limit,
- * and (c) evicting multiple transactions at once, e.g. to free a given portion
- * of the memory limit (e.g. 50%).
  */
 static ReorderBufferTXN *
 ReorderBufferLargestTXN(ReorderBuffer *rb)
 {
-	HASH_SEQ_STATUS hash_seq;
-	ReorderBufferTXNByIdEnt *ent;
 	ReorderBufferTXN *largest = NULL;
 
-	hash_seq_init(&hash_seq, rb->by_txn);
-	while ((ent = hash_seq_search(&hash_seq)) != NULL)
+	/*
+	 * Build the max-heap to pick the largest transaction if not yet. We will
+	 * run a heap assembly step at the end, which is more efficient.
+	 */
+	if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP)
 	{
-		ReorderBufferTXN *txn = ent->txn;
+		HASH_SEQ_STATUS hash_seq;
+		ReorderBufferTXNByIdEnt *ent;
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
-			largest = txn;
+		hash_seq_init(&hash_seq, rb->by_txn);
+		while ((ent = hash_seq_search(&hash_seq)) != NULL)
+		{
+			ReorderBufferTXN *txn = ent->txn;
+
+			if (txn->size == 0)
+				continue;
+
+			binaryheap_add_unordered(rb->txn_heap, PointerGetDatum(txn));
+		}
+
+		binaryheap_build(rb->txn_heap);
+
+		/*
+		 * The max-heap is ready now. We remain in this state at least until
+		 * we free up enough transactions to bring the total memory usage
+		 * below the limit.
+		 */
+		rb->memtrack_state = REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP;
 	}
+	else
+		Assert(binaryheap_size(rb->txn_heap) > 0);
+
+	largest = (ReorderBufferTXN *) DatumGetPointer(binaryheap_first(rb->txn_heap));
 
 	Assert(largest);
 	Assert(largest->size > 0);
@@ -3638,6 +3742,9 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 		Assert(txn->nentries_mem == 0);
 	}
 
+	/* check the memory track state */
+	ReorderBufferMaybeChangeNoMaxHeap(rb);
+
 	/* We must be under the memory limit now. */
 	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
@@ -3707,11 +3814,14 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		ReorderBufferSerializeChange(rb, txn, fd, change);
 		dlist_delete(&change->node);
-		ReorderBufferReturnChange(rb, change, true);
+		ReorderBufferReturnChange(rb, change, false);
 
 		spilled++;
 	}
 
+	/* Update the memory counter */
+	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, txn->size);
+
 	/* update the statistics iff we have spilled anything */
 	if (spilled)
 	{
@@ -4493,7 +4603,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	 * update the accounting too (subtracting the size from the counters). And
 	 * we don't want to underflow there.
 	 */
-	ReorderBufferChangeMemoryUpdate(rb, change, true,
+	ReorderBufferChangeMemoryUpdate(rb, change, NULL, true,
 									ReorderBufferChangeSize(change));
 }
 
@@ -4905,9 +5015,9 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	MemoryContextSwitchTo(oldcontext);
 
 	/* subtract the old change size */
-	ReorderBufferChangeMemoryUpdate(rb, change, false, old_size);
+	ReorderBufferChangeMemoryUpdate(rb, change, NULL, false, old_size);
 	/* now add the change back, with the correct size */
-	ReorderBufferChangeMemoryUpdate(rb, change, true,
+	ReorderBufferChangeMemoryUpdate(rb, change, NULL, true,
 									ReorderBufferChangeSize(change));
 }
 
@@ -5273,3 +5383,20 @@ restart:
 		*cmax = ent->cmax;
 	return true;
 }
+
+/*
+ * Compare between sizes of two transactions. This is for a binary heap
+ * comparison function.
+ */
+static int
+ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg)
+{
+	ReorderBufferTXN	*ta = (ReorderBufferTXN *) DatumGetPointer(a);
+	ReorderBufferTXN	*tb = (ReorderBufferTXN *) DatumGetPointer(b);
+
+	if (ta->size < tb->size)
+		return -1;
+	if (ta->size > tb->size)
+		return 1;
+	return 0;
+}
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0b2c95f7aa..f0d352cfcc 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "lib/binaryheap.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -531,6 +532,22 @@ typedef void (*ReorderBufferUpdateProgressTxnCB) (
 												  ReorderBufferTXN *txn,
 												  XLogRecPtr lsn);
 
+/* State of how to track the memory usage of each transaction being decoded */
+typedef enum ReorderBufferMemTrackState
+{
+	/*
+	 * We don't update max-heap while updating the memory counter. The
+	 * max-heap is built before use.
+	 */
+	REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP,
+
+	/*
+	 * We also update the max-heap when updating the memory counter so
+	 * the heap property is always preserved.
+	 */
+	REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP,
+} ReorderBufferMemTrackState;
+
 struct ReorderBuffer
 {
 	/*
@@ -631,6 +648,10 @@ struct ReorderBuffer
 	/* memory accounting */
 	Size		size;
 
+	/* Max-heap for sizes of all top-level and sub transactions */
+	ReorderBufferMemTrackState memtrack_state;
+	binaryheap	*txn_heap;
+
 	/*
 	 * Statistics about transactions spilled to disk.
 	 *
-- 
2.39.3

v6-0002-Add-functions-to-binaryheap-for-efficient-key-rem.patchapplication/octet-stream; name=v6-0002-Add-functions-to-binaryheap-for-efficient-key-rem.patchDownload
From 79930bb8bdc3b78858b063ecc4c59f7a718733aa Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 11:20:23 +0900
Subject: [PATCH v6 2/3] Add functions to binaryheap for efficient key removal
 and update.

Previously, binaryheap didn't support key updates and removing nodes
in an efficient way. For example, in order to remove a node from the
binaryheap, the caller has to pass the node's position within the
array that the binaryheap internally has. Removing a node from the
binaryheap is done in O(log n) but searching for the key's position is
done in O(n).

This commit adds a hash table to binaryheap to track of positions of
each nodes in the binaryheap. That way, by using newly added functions
such as binaryheap_update_up() etc., both updating a key and removing
a node can be done in O(1) on an average and O(log n) in worst
case. This is known as the indexed binary heap. The caller can specify
to use the indexed binaryheap by passing indexed = true.

There is no user of it but it will be used by a upcoming patch.

XXX: update typedef.list

Reviewed-by: Hayato Kuroda, Vignesh C, Ajin Cherian, Tomas Vondra,
Shubham Khanna
Discussion: https://postgr.es/m/CAD21AoDffo37RC-eUuyHJKVEr017V2YYDLyn1xF_00ofptWbkg%40mail.gmail.com
---
 src/backend/executor/nodeGatherMerge.c        |   1 +
 src/backend/executor/nodeMergeAppend.c        |   2 +-
 src/backend/postmaster/pgarch.c               |   3 +-
 .../replication/logical/reorderbuffer.c       |   1 +
 src/backend/storage/buffer/bufmgr.c           |   1 +
 src/bin/pg_dump/pg_backup_archiver.c          |   1 +
 src/bin/pg_dump/pg_dump_sort.c                |   2 +-
 src/common/binaryheap.c                       | 190 +++++++++++++++++-
 src/include/lib/binaryheap.h                  |  35 +++-
 9 files changed, 222 insertions(+), 14 deletions(-)

diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c
index 2d552f4224..250f226d5f 100644
--- a/src/backend/executor/nodeGatherMerge.c
+++ b/src/backend/executor/nodeGatherMerge.c
@@ -427,6 +427,7 @@ gather_merge_setup(GatherMergeState *gm_state)
 	/* Allocate the resources for the merge */
 	gm_state->gm_heap = binaryheap_allocate(nreaders + 1,
 											heap_compare_slots,
+											false,
 											gm_state);
 }
 
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index 0817868452..1980794cb7 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -125,7 +125,7 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	mergestate->ms_nplans = nplans;
 
 	mergestate->ms_slots = (TupleTableSlot **) palloc0(sizeof(TupleTableSlot *) * nplans);
-	mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots,
+	mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots, false,
 											  mergestate);
 
 	/*
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 9c18e4b3ef..36522940dd 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -250,7 +250,8 @@ PgArchiverMain(void)
 
 	/* Initialize our max-heap for prioritizing files to archive. */
 	arch_files->arch_heap = binaryheap_allocate(NUM_FILES_PER_DIRECTORY_SCAN,
-												ready_file_comparator, NULL);
+												ready_file_comparator, false,
+												NULL);
 
 	/* Load the archive_library. */
 	LoadArchiveLibrary();
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5446df3c64..91b9618d7e 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1296,6 +1296,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	/* allocate heap */
 	state->heap = binaryheap_allocate(state->nr_txns,
 									  ReorderBufferIterCompare,
+									  false,
 									  state);
 
 	/* Now that the state fields are initialized, it is safe to return it. */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index bdf89bbc4d..69f071321d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2725,6 +2725,7 @@ BufferSync(int flags)
 	 */
 	ts_heap = binaryheap_allocate(num_spaces,
 								  ts_ckpt_progress_comparator,
+								  false,
 								  NULL);
 
 	for (i = 0; i < num_spaces; i++)
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index d97ebaff5b..6587a7b081 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -4033,6 +4033,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 	/* Set up ready_heap with enough room for all known TocEntrys */
 	ready_heap = binaryheap_allocate(AH->tocCount,
 									 TocEntrySizeCompareBinaryheap,
+									 false,
 									 NULL);
 
 	/*
diff --git a/src/bin/pg_dump/pg_dump_sort.c b/src/bin/pg_dump/pg_dump_sort.c
index 8ee8a42781..4d10af3a34 100644
--- a/src/bin/pg_dump/pg_dump_sort.c
+++ b/src/bin/pg_dump/pg_dump_sort.c
@@ -405,7 +405,7 @@ TopoSort(DumpableObject **objs,
 		return true;
 
 	/* Create workspace for the above-described heap */
-	pendingHeap = binaryheap_allocate(numObjs, int_cmp, NULL);
+	pendingHeap = binaryheap_allocate(numObjs, int_cmp, false, NULL);
 
 	/*
 	 * Scan the constraints, and for each item in the input, generate a count
diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index 6f16c83295..c3d36b352d 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -22,8 +22,28 @@
 #ifdef FRONTEND
 #include "common/logging.h"
 #endif
+#include "common/hashfn.h"
 #include "lib/binaryheap.h"
 
+/*
+ * Define parameters for hash table code generation. The interface is *also*"
+ * declared in binaryheaph.h (to generate the types, which are externally
+ * visible).
+ */
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+	hash_bytes((const unsigned char *) &key, sizeof(bh_node_type))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0)
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
 static void sift_down(binaryheap *heap, int node_off);
 static void sift_up(binaryheap *heap, int node_off);
 
@@ -34,9 +54,14 @@ static void sift_up(binaryheap *heap, int node_off);
  * store the given number of nodes, with the heap property defined by
  * the given comparator function, which will be invoked with the additional
  * argument specified by 'arg'.
+ *
+ * If 'indexed' is true, we create a hash table to track of each node's
+ * index in the heap, enabling to perform some operations such as removing
+ * the node from the heap.
  */
 binaryheap *
-binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
+binaryheap_allocate(int capacity, binaryheap_comparator compare,
+					bool indexed, void *arg)
 {
 	binaryheap *heap;
 
@@ -49,6 +74,17 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 	heap->bh_has_heap_property = true;
 	heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type) * capacity);
 
+	heap->bh_indexed = indexed;
+	if (heap->bh_indexed)
+	{
+#ifdef FRONTEND
+		heap->bh_nodeidx = bh_nodeidx_create(capacity, NULL);
+#else
+		heap->bh_nodeidx = bh_nodeidx_create(CurrentMemoryContext, capacity,
+											 NULL);
+#endif
+	}
+
 	return heap;
 }
 
@@ -63,6 +99,9 @@ binaryheap_reset(binaryheap *heap)
 {
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
+
+	if (heap->bh_indexed)
+		bh_nodeidx_reset(heap->bh_nodeidx);
 }
 
 /*
@@ -73,6 +112,9 @@ binaryheap_reset(binaryheap *heap)
 void
 binaryheap_free(binaryheap *heap)
 {
+	if (heap->bh_indexed)
+		bh_nodeidx_destroy(heap->bh_nodeidx);
+
 	pfree(heap);
 }
 
@@ -114,6 +156,64 @@ bh_enlarge_node_array(binaryheap *heap)
 							  sizeof(bh_node_type) * heap->bh_space);
 }
 
+/*
+ * Set the given node at the 'index', and updates its position accordingly.
+ *
+ * Return true if the node's index is already tracked.
+ */
+static bool
+bh_set_node(binaryheap *heap, bh_node_type node, int index)
+{
+	bh_nodeidx_entry *ent;
+	bool	found = false;
+
+	/* Set the node to the nodes array */
+	heap->bh_nodes[index] = node;
+
+	if (heap->bh_indexed)
+	{
+		/* Remember its index in the nodes array */
+		ent = bh_nodeidx_insert(heap->bh_nodeidx, node, &found);
+		ent->idx = index;
+	}
+
+	return found;
+}
+
+/*
+ * Remove the node's index from the hash table if the heap is indexed.
+ */
+static void
+bh_delete_nodeidx(binaryheap *heap, bh_node_type node)
+{
+	if (!heap->bh_indexed)
+		return;
+
+	(void) bh_nodeidx_delete(heap->bh_nodeidx, node);
+}
+
+/*
+ * Replace the node at 'idx' with the given node 'replaced_by'. Also
+ * update their positions accordingly.
+ */
+static void
+bh_replace_node(binaryheap *heap, int idx, bh_node_type replaced_by)
+{
+	/* Remove overwritten node's index */
+	bh_delete_nodeidx(heap, heap->bh_nodes[idx]);
+
+	/* Replace it with the given new node */
+	if (idx < heap->bh_size)
+	{
+		bool found PG_USED_FOR_ASSERTS_ONLY;
+
+		found = bh_set_node(heap, replaced_by, idx);
+
+		/* The overwritten node's index must already be tracked */
+		Assert(!heap->bh_indexed || found);
+	}
+}
+
 /*
  * binaryheap_add_unordered
  *
@@ -130,7 +230,7 @@ binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 		bh_enlarge_node_array(heap);
 
 	heap->bh_has_heap_property = false;
-	heap->bh_nodes[heap->bh_size] = d;
+	bh_set_node(heap, d, heap->bh_size);
 	heap->bh_size++;
 }
 
@@ -163,7 +263,7 @@ binaryheap_add(binaryheap *heap, bh_node_type d)
 	if (heap->bh_size >= heap->bh_space)
 		bh_enlarge_node_array(heap);
 
-	heap->bh_nodes[heap->bh_size] = d;
+	bh_set_node(heap, d, heap->bh_size);
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
 }
@@ -204,6 +304,8 @@ binaryheap_remove_first(binaryheap *heap)
 	if (heap->bh_size == 1)
 	{
 		heap->bh_size--;
+		bh_delete_nodeidx(heap, result);
+
 		return result;
 	}
 
@@ -211,7 +313,7 @@ binaryheap_remove_first(binaryheap *heap)
 	 * Remove the last node, placing it in the vacated root entry, and sift
 	 * the new root node down to its correct position.
 	 */
-	heap->bh_nodes[0] = heap->bh_nodes[--heap->bh_size];
+	bh_replace_node(heap, 0, heap->bh_nodes[--heap->bh_size]);
 	sift_down(heap, 0);
 
 	return result;
@@ -237,7 +339,7 @@ binaryheap_remove_node(binaryheap *heap, int n)
 						   heap->bh_arg);
 
 	/* remove the last node, placing it in the vacated entry */
-	heap->bh_nodes[n] = heap->bh_nodes[heap->bh_size];
+	bh_replace_node(heap, n, heap->bh_nodes[heap->bh_size]);
 
 	/* sift as needed to preserve the heap property */
 	if (cmp > 0)
@@ -246,6 +348,74 @@ binaryheap_remove_node(binaryheap *heap, int n)
 		sift_down(heap, n);
 }
 
+/*
+ * binaryheap_remove_node_ptr
+ *
+ * Similar to binaryheap_remove_node() but removes the given node. The caller
+ * must ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if bh_indexed is true.
+ */
+void
+binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(heap->bh_indexed);
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+	Assert(ent);
+
+	binaryheap_remove_node(heap, ent->idx);
+}
+
+/*
+ * binaryheap_update_up
+ *
+ * Sift the given node up after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if bh_indexed is true.
+ */
+void
+binaryheap_update_up(binaryheap *heap, bh_node_type d)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(heap->bh_indexed);
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+	Assert(ent);
+	Assert(ent->idx >= 0 && ent->idx < heap->bh_size);
+
+	sift_up(heap, ent->idx);
+}
+
+/*
+ * binaryheap_update_down
+ *
+ * Sift the given node down after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if bh_indexed is true.
+ */
+void
+binaryheap_update_down(binaryheap *heap, bh_node_type d)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(heap->bh_indexed);
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+	Assert(ent);
+	Assert(ent->idx >= 0 && ent->idx < heap->bh_size);
+
+	sift_down(heap, ent->idx);
+}
+
 /*
  * binaryheap_replace_first
  *
@@ -258,7 +428,7 @@ binaryheap_replace_first(binaryheap *heap, bh_node_type d)
 {
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
 
-	heap->bh_nodes[0] = d;
+	bh_replace_node(heap, 0, d);
 
 	if (heap->bh_size > 1)
 		sift_down(heap, 0);
@@ -300,11 +470,11 @@ sift_up(binaryheap *heap, int node_off)
 		 * Otherwise, swap the parent value with the hole, and go on to check
 		 * the node's new parent.
 		 */
-		heap->bh_nodes[node_off] = parent_val;
+		bh_set_node(heap, parent_val, node_off);
 		node_off = parent_off;
 	}
 	/* Re-fill the hole */
-	heap->bh_nodes[node_off] = node_val;
+	bh_set_node(heap, node_val, node_off);
 }
 
 /*
@@ -359,9 +529,9 @@ sift_down(binaryheap *heap, int node_off)
 		 * Otherwise, swap the hole with the child that violates the heap
 		 * property; then go on to check its children.
 		 */
-		heap->bh_nodes[node_off] = heap->bh_nodes[swap_off];
+		bh_set_node(heap, heap->bh_nodes[swap_off], node_off);
 		node_off = swap_off;
 	}
 	/* Re-fill the hole */
-	heap->bh_nodes[node_off] = node_val;
+	bh_set_node(heap, node_val, node_off);
 }
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 1439f20803..48c2de33b4 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -29,6 +29,28 @@ typedef Datum bh_node_type;
  */
 typedef int (*binaryheap_comparator) (bh_node_type a, bh_node_type b, void *arg);
 
+/*
+ * Struct for A hash table element to store the node's index in the bh_nodes
+ * array.
+ */
+typedef struct bh_nodeidx_entry
+{
+	bh_node_type	key;
+	char			status;
+	int				idx;
+} bh_nodeidx_entry;
+
+/* define parameters necessary to generate the hash table interface */
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
 /*
  * binaryheap
  *
@@ -47,11 +69,19 @@ typedef struct binaryheap
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
 	bh_node_type *bh_nodes;
+
+	/*
+	 * If bh_indexed is true, the bh_nodeidx is used to track of each
+	 * node's index in bh_nodes. This enables the caller to perform
+	 * binaryheap_remove_node_ptr(), binaryheap_update_up/down in O(log n).
+	 */
+	bool		bh_indexed;
+	bh_nodeidx_hash	*bh_nodeidx;
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int capacity,
 									   binaryheap_comparator compare,
-									   void *arg);
+									   bool indexed, void *arg);
 extern void binaryheap_reset(binaryheap *heap);
 extern void binaryheap_free(binaryheap *heap);
 extern void binaryheap_add_unordered(binaryheap *heap, bh_node_type d);
@@ -60,7 +90,10 @@ extern void binaryheap_add(binaryheap *heap, bh_node_type d);
 extern bh_node_type binaryheap_first(binaryheap *heap);
 extern bh_node_type binaryheap_remove_first(binaryheap *heap);
 extern void binaryheap_remove_node(binaryheap *heap, int n);
+extern void binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d);
 extern void binaryheap_replace_first(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_up(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_down(binaryheap *heap, bh_node_type d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
 #define binaryheap_size(h)			((h)->bh_size)
-- 
2.39.3

#36Masahiko Sawada
sawada.mshk@gmail.com
In reply to: vignesh C (#33)
Re: Improve eviction algorithm in ReorderBuffer

On Fri, Feb 23, 2024 at 6:24 PM vignesh C <vignesh21@gmail.com> wrote:

On Fri, 9 Feb 2024 at 20:51, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I think this performance regression is not acceptable. In this
workload, one transaction has 10k subtransactions and the logical
decoding becomes quite slow if logical_decoding_work_mem is not big
enough. Therefore, it's a legitimate and common approach to increase
logical_decoding_work_mem to speedup the decoding. However, with thie
patch, the decoding becomes slower than today. It's a bad idea in
general to optimize an extreme case while sacrificing the normal (or
more common) cases.

Since this same function is used by pg_dump sorting TopoSort functions
also, we can just verify once if there is no performance impact with
large number of objects during dump sorting:

Okay. I've run the pg_dump regression tests with --timer flag (note
that pg_dump doesn't use indexed binary heap):

master:
[16:00:25] t/001_basic.pl ................ ok 151 ms ( 0.00 usr
0.00 sys + 0.09 cusr 0.06 csys = 0.15 CPU)
[16:00:25] t/002_pg_dump.pl .............. ok 10157 ms ( 0.23 usr
0.01 sys + 1.48 cusr 0.37 csys = 2.09 CPU)
[16:00:36] t/003_pg_dump_with_server.pl .. ok 504 ms ( 0.00 usr
0.01 sys + 0.10 cusr 0.07 csys = 0.18 CPU)
[16:00:36] t/004_pg_dump_parallel.pl ..... ok 1044 ms ( 0.00 usr
0.00 sys + 0.12 cusr 0.08 csys = 0.20 CPU)
[16:00:37] t/005_pg_dump_filterfile.pl ... ok 2390 ms ( 0.00 usr
0.00 sys + 0.34 cusr 0.19 csys = 0.53 CPU)
[16:00:40] t/010_dump_connstr.pl ......... ok 4813 ms ( 0.01 usr
0.00 sys + 2.13 cusr 0.45 csys = 2.59 CPU)

patched:
[15:59:47] t/001_basic.pl ................ ok 150 ms ( 0.00 usr
0.00 sys + 0.08 cusr 0.07 csys = 0.15 CPU)
[15:59:47] t/002_pg_dump.pl .............. ok 10057 ms ( 0.23 usr
0.02 sys + 1.49 cusr 0.36 csys = 2.10 CPU)
[15:59:57] t/003_pg_dump_with_server.pl .. ok 509 ms ( 0.00 usr
0.00 sys + 0.09 cusr 0.08 csys = 0.17 CPU)
[15:59:58] t/004_pg_dump_parallel.pl ..... ok 1048 ms ( 0.01 usr
0.00 sys + 0.11 cusr 0.11 csys = 0.23 CPU)
[15:59:59] t/005_pg_dump_filterfile.pl ... ok 2398 ms ( 0.00 usr
0.00 sys + 0.34 cusr 0.20 csys = 0.54 CPU)
[16:00:01] t/010_dump_connstr.pl ......... ok 4762 ms ( 0.01 usr
0.00 sys + 2.15 cusr 0.42 csys = 2.58 CPU)

There is no noticeable difference between the two results.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#37Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Masahiko Sawada (#35)
Re: Improve eviction algorithm in ReorderBuffer

On 2/26/24 07:46, Masahiko Sawada wrote:

On Sat, Feb 24, 2024 at 1:29 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

...

overall design
--------------

As for the design, I agree with the approach of using a binaryheap to
track transactions by size. When going over the thread history,
describing the initial approach with only keeping "large" transactions
above some threshold (e.g. 10%), I was really concerned that'll either
lead to abrupt changes in behavior (when transactions move just around
the 10%), or won't help with many common cases (with most transactions
being below the limit).

I was going to suggest some sort of "binning" - keeping lists for
transactions of similar size (e.g. <1kB, 1-2kB, 2-4kB, 4-8kB, ...) and
evicting transactions from a list, i.e. based on approximate size. But
if the indexed binary heap seems to be cheap enough, I think it's a
better solution.

I've also considered the binning idea. But it was not clear to me how
it works well in a case where all transactions belong to the
particular class. For example, if we need to free up 1MB memory, we
could end up evicting 2000 transactions consuming 50 bytes instead of
100 transactions consuming 1000 bytes, resulting in that we end up
serializing more transactions. Also, I'm concerned about the cost of
maintaining the binning lists.

I don't think the list maintenance would be very costly - in particular,
the lists would not need to be sorted by size. You're right in some
extreme cases we might evict the smallest transactions in the list. I
think on average we'd evict transactions with average size, which seems
OK for this use case.

Anyway, I don't think we need to be distracted with this. I mentioned it
merely to show it was considered, but the heap seems to work well
enough, and in the end is even simpler because the complexity is hidden
outside reorderbuffer.

The one thing I'm a bit concerned about is the threshold used to start
using binary heap - these thresholds with binary decisions may easily
lead to a "cliff" and robustness issues, i.e. abrupt change in behavior
with significant runtime change (e.g. you add/remove one transaction and
the code takes a much more expensive path). The value (1024) seems
rather arbitrary, I wonder if there's something to justify that choice.

True. 1024 seems small to me. In my environment, I started to see a
big difference from around 40000 transactions. But it varies depending
on environments and workloads.

I think that this performance problem we're addressing doesn't
normally happen as long as all transactions being decoded are
top-level transactions. Otherwise, we also need to improve
ReorderBufferLargestStreamableTopTXN(). Given this fact, I think
max_connections = 1024 is a possible value in some systems, and I've
observed such systems sometimes. On the other hand, I've observed >
5000 in just a few cases, and having more than 5000 transactions in
ReorderBuffer seems unlikely to happen without subtransactions. I
think we can say it's an extreme case, the number is still an
arbitrary number though.

Or probably we can compute the threshold based on max_connections,
e.g., max_connections * 10. That way, we can ensure that users won't
incur the max-heap maintenance costs as long as they don't use
subtransactions.

Tying this to max_connections seems like an interesting option. It'd
make this adaptive to a system. I haven't thought about the exact value
(m_c * 10), but it seems better than arbitrary hard-coded values.

In any case, I agree it'd be good to have some dampening factor, to
reduce the risk of trashing because of adding/removing a single
transaction to the decoding.

related stuff / GenerationContext
---------------------------------

It's not the fault of this patch, but this reminds me I have some doubts
about how the eviction interferes with using the GenerationContext for
some of the data. I suspect we can easily get into a situation where we
evict the largest transaction, but that doesn't actually reduce the
memory usage at all, because the memory context blocks are shared with
some other transactions and don't get 100% empty (so we can't release
them). But it's actually worse, because GenerationContext does not even
reuse this memory. So do we even gain anything by the eviction?

When the earlier patch versions also considered age of the transaction,
to try evicting the older ones first, I think that was interesting. I
think we may want to do something like this even with the binary heap.

Thank you for raising this issue. This is one of the highest priority
items in my backlog. We've seen cases where the logical decoding uses
much more memory than logical_decoding_work_mem value[1][2] (e.g. it
used 4GB memory even though the logical_decoding_work_mem was 256kB).
I think that the problem would still happen even with this improvement
on the eviction.

I believe these are separate problems we can address, and evicting
large transactions first would still be the right strategy. We might
want to improve how we store changes in memory contexts. For example,
it might be worth having per-transaction memory context so that we can
actually free memory blocks by the eviction. We can discuss it in a
separate thread.

Yes, I think using per-transaction context for large transactions might
work. I don't think we want too many contexts, so we'd start with the
shared context, and then at some point (when the transaction exceeds say
5% of the memory limit) we'd switch it to a separate one.

But that's a matter for a separate patch, so let's discuss elsewhere.

For example, a system may be doing a lot of eviction / spilling with
logical_decoding_work_mem=64MB, but setting 128MB may completely
eliminate that. Of course, if there are large transactions, this may not
be possible (the GUC would have to exceed RAM). But I don't think that's
very common, the incidents that I've observed were often resolved by
bumping the logical_decoding_work_mem by a little bit.

I wonder if there's something we might do to help users to tune this. We
should be able to measure the "peak" memory usage (how much memory we'd
need to not spill), so maybe we could log that as a WARNING, similarly
to checkpoints - there we only log "checkpoints too frequent, tune WAL
limits", but perhaps we might do more here? Or maybe we could add the
watermark to the system catalog?

Interesting ideas.

The statistics such as spill_count shown in pg_stat_replication_slots
view could already give hints to users to increase the
logical_decoding_work_mem. In addition to that, it's an interesting
idea to have the high water mark in the view.

The spill statistics are useful, but I'm not sure it can answer the main
question:

How high would the memory limit need to be to not spill?

Maybe there's something we can measure / log to help with this.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#38Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Tomas Vondra (#37)
3 attachment(s)
Re: Improve eviction algorithm in ReorderBuffer

On Mon, Feb 26, 2024 at 6:43 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 2/26/24 07:46, Masahiko Sawada wrote:

On Sat, Feb 24, 2024 at 1:29 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

...

overall design
--------------

As for the design, I agree with the approach of using a binaryheap to
track transactions by size. When going over the thread history,
describing the initial approach with only keeping "large" transactions
above some threshold (e.g. 10%), I was really concerned that'll either
lead to abrupt changes in behavior (when transactions move just around
the 10%), or won't help with many common cases (with most transactions
being below the limit).

I was going to suggest some sort of "binning" - keeping lists for
transactions of similar size (e.g. <1kB, 1-2kB, 2-4kB, 4-8kB, ...) and
evicting transactions from a list, i.e. based on approximate size. But
if the indexed binary heap seems to be cheap enough, I think it's a
better solution.

I've also considered the binning idea. But it was not clear to me how
it works well in a case where all transactions belong to the
particular class. For example, if we need to free up 1MB memory, we
could end up evicting 2000 transactions consuming 50 bytes instead of
100 transactions consuming 1000 bytes, resulting in that we end up
serializing more transactions. Also, I'm concerned about the cost of
maintaining the binning lists.

I don't think the list maintenance would be very costly - in particular,
the lists would not need to be sorted by size. You're right in some
extreme cases we might evict the smallest transactions in the list. I
think on average we'd evict transactions with average size, which seems
OK for this use case.

Anyway, I don't think we need to be distracted with this. I mentioned it
merely to show it was considered, but the heap seems to work well
enough, and in the end is even simpler because the complexity is hidden
outside reorderbuffer.

The one thing I'm a bit concerned about is the threshold used to start
using binary heap - these thresholds with binary decisions may easily
lead to a "cliff" and robustness issues, i.e. abrupt change in behavior
with significant runtime change (e.g. you add/remove one transaction and
the code takes a much more expensive path). The value (1024) seems
rather arbitrary, I wonder if there's something to justify that choice.

True. 1024 seems small to me. In my environment, I started to see a
big difference from around 40000 transactions. But it varies depending
on environments and workloads.

I think that this performance problem we're addressing doesn't
normally happen as long as all transactions being decoded are
top-level transactions. Otherwise, we also need to improve
ReorderBufferLargestStreamableTopTXN(). Given this fact, I think
max_connections = 1024 is a possible value in some systems, and I've
observed such systems sometimes. On the other hand, I've observed >
5000 in just a few cases, and having more than 5000 transactions in
ReorderBuffer seems unlikely to happen without subtransactions. I
think we can say it's an extreme case, the number is still an
arbitrary number though.

Or probably we can compute the threshold based on max_connections,
e.g., max_connections * 10. That way, we can ensure that users won't
incur the max-heap maintenance costs as long as they don't use
subtransactions.

Tying this to max_connections seems like an interesting option. It'd
make this adaptive to a system. I haven't thought about the exact value
(m_c * 10), but it seems better than arbitrary hard-coded values.

I've updated the patch accordingly, using MaxConnections for now. I've
also updated some comments and commit messages and added typedef.list
changes.

In any case, I agree it'd be good to have some dampening factor, to
reduce the risk of trashing because of adding/removing a single
transaction to the decoding.

related stuff / GenerationContext
---------------------------------

It's not the fault of this patch, but this reminds me I have some doubts
about how the eviction interferes with using the GenerationContext for
some of the data. I suspect we can easily get into a situation where we
evict the largest transaction, but that doesn't actually reduce the
memory usage at all, because the memory context blocks are shared with
some other transactions and don't get 100% empty (so we can't release
them). But it's actually worse, because GenerationContext does not even
reuse this memory. So do we even gain anything by the eviction?

When the earlier patch versions also considered age of the transaction,
to try evicting the older ones first, I think that was interesting. I
think we may want to do something like this even with the binary heap.

Thank you for raising this issue. This is one of the highest priority
items in my backlog. We've seen cases where the logical decoding uses
much more memory than logical_decoding_work_mem value[1][2] (e.g. it
used 4GB memory even though the logical_decoding_work_mem was 256kB).
I think that the problem would still happen even with this improvement
on the eviction.

I believe these are separate problems we can address, and evicting
large transactions first would still be the right strategy. We might
want to improve how we store changes in memory contexts. For example,
it might be worth having per-transaction memory context so that we can
actually free memory blocks by the eviction. We can discuss it in a
separate thread.

Yes, I think using per-transaction context for large transactions might
work. I don't think we want too many contexts, so we'd start with the
shared context, and then at some point (when the transaction exceeds say
5% of the memory limit) we'd switch it to a separate one.

But that's a matter for a separate patch, so let's discuss elsewhere.

+1

For example, a system may be doing a lot of eviction / spilling with
logical_decoding_work_mem=64MB, but setting 128MB may completely
eliminate that. Of course, if there are large transactions, this may not
be possible (the GUC would have to exceed RAM). But I don't think that's
very common, the incidents that I've observed were often resolved by
bumping the logical_decoding_work_mem by a little bit.

I wonder if there's something we might do to help users to tune this. We
should be able to measure the "peak" memory usage (how much memory we'd
need to not spill), so maybe we could log that as a WARNING, similarly
to checkpoints - there we only log "checkpoints too frequent, tune WAL
limits", but perhaps we might do more here? Or maybe we could add the
watermark to the system catalog?

Interesting ideas.

The statistics such as spill_count shown in pg_stat_replication_slots
view could already give hints to users to increase the
logical_decoding_work_mem. In addition to that, it's an interesting
idea to have the high water mark in the view.

The spill statistics are useful, but I'm not sure it can answer the main
question:

How high would the memory limit need to be to not spill?

Maybe there's something we can measure / log to help with this.

Right. I like the idea of the high watermark. The
pg_stat_replication_slots would be the place to store such
information. Since the reorder buffer evicts or streams transactions
anyway based on the logical_decoding_work_mem, probably we need to
compute the maximum amount of data in the reorder buffer at one point
in time while assuming no transactions were evicted and streamed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v7-0003-Improve-eviction-algorithm-in-Reorderbuffer-using.patchapplication/octet-stream; name=v7-0003-Improve-eviction-algorithm-in-Reorderbuffer-using.patchDownload
From 037aac9afdc988608afb8cac231a11ce7d4d830b Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 11:31:41 +0900
Subject: [PATCH v7 3/3] Improve eviction algorithm in Reorderbuffer using
 max-heap for many subtransactions.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Previously, when selecting the transaction to evict during logical
decoding, we check all transactions to find the largest
transaction. Which could lead to a significant replication lag
especially in case where there are many subtransactions.

This commit improves the eviction algorithm in ReorderBuffer using the
max-heap with transaction size as the key to efficiently find the
largest transaction.

Overall algorithm:

There are two memory track states: REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP
and REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP.

REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP is the starting state, where we do
not update the max-heap when updating the memory counter. We build the
max-heap just before selecting large transactions. Therefore, in this
state, we can update the memory counter with no additional costs but
need O(n) time to get the largest transaction, where n is the number of
transactions including top-level transactions and subtransactions.

Once we build the max-heap, we switch to
REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP state, where we also update
the max-heap when updating the memory counter. The intention is to
efficiently retrieve the largest transaction in O(1) time instead of
incurring the cost of memory counter updates (O(log n)). To minimize
the overhead of maintaining the max-heap, we batch memory updates when
cleaning up all changes. We remain in this state as long as the number
of transactions is larger than the threshold,
REORDER_BUFFER_MEM_TRACK_THRESHOLD. Otherwise, we switch back to
REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP and reset the max-heap.

The performance benchmark results showed significant speed up (more
than x30 speed up on my machine) in decoding a transaction with 100k
subtransactions, whereas there is no visible overhead in other cases.

Reviewed-by: Amit Kapila, Hayato Kuroda, Vignesh C, Ajin Cherian,
Tomas Vondra, Shubham Khanna, Álvaro Herrera, Euler Taveira
Discussion: https://postgr.es/m/CAD21AoAfKTgrBrLq96GcTv9d6k97zaQcDM-rxfKEt4GSe0qnaQ%40mail.gmail.com
---
 .../replication/logical/reorderbuffer.c       | 188 +++++++++++++++---
 src/include/replication/reorderbuffer.h       |  21 ++
 src/tools/pgindent/typedefs.list              |   1 +
 3 files changed, 181 insertions(+), 29 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 91b9618d7e..3bc40fd7b6 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -67,6 +67,26 @@
  *	  allocator, evicting the oldest changes would make it more likely the
  *	  memory gets actually freed.
  *
+ *	  We use a max-heap with transaction size as the key to efficiently find
+ *	  the largest transaction. The max-heap state is managed in two states:
+ *	  REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP and REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP.
+ *
+ *	  REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP is the starting state, where we do
+ *	  not update the max-heap when updating the memory counter. We build the
+ *	  max-heap just before selecting large transactions. Therefore, in this
+ *	  state, we can update the memory counter with no additional costs but
+ *	  need O(n) time to get the largest transaction, where n is the number of
+ *	  transactions including top-level transactions and subtransactions.
+ *
+ *	  Once we build the max-heap, we switch to
+ *	  REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP state, where we also update
+ *	  the max-heap when updating the memory counter. The intention is to
+ *	  efficiently retrieve the largest transaction in O(1) time instead of
+ *	  incurring the cost of memory counter updates (O(log n)). We remain in
+ *	  this state as long as the number of transactions is larger than the
+ *	  threshold, REORDER_BUFFER_MEM_TRACK_THRESHOLD. Otherwise, we switch back
+ *	  to REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP and reset the max-heap.
+ *
  *	  We still rely on max_changes_in_memory when loading serialized changes
  *	  back into memory. At that point we can't use the memory limit directly
  *	  as we load the subxacts independently. One option to deal with this
@@ -109,6 +129,15 @@
 #include "utils/rel.h"
 #include "utils/relfilenumbermap.h"
 
+/*
+ * Threshold of the total number of top-level and sub transactions that controls
+ * whether we switch the memory track state. While the MAINTAIN_HEAP state is
+ * effective when there are many transactions being decoded, in many systems
+ * there is generally no need to use it as long as all transactions being decoded
+ * are top-level transactions. Therefore, we use MaxConnections as the threshold
+ * so we can prevent switch to the state unless we use subtransactions.
+ */
+#define REORDER_BUFFER_MEM_TRACK_THRESHOLD	MaxConnections
 
 /* entry for a hash table we use to map from xid to our transaction state */
 typedef struct ReorderBufferTXNByIdEnt
@@ -295,7 +324,10 @@ static void ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *t
 static Size ReorderBufferChangeSize(ReorderBufferChange *change);
 static void ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 											ReorderBufferChange *change,
+											ReorderBufferTXN *txn,
 											bool addition, Size sz);
+static int	ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg);
+static void ReorderBufferMaybeChangeNoMaxHeap(ReorderBuffer *rb);
 
 /*
  * Allocate a new ReorderBuffer and clean out any old serialized state from
@@ -357,6 +389,16 @@ ReorderBufferAllocate(void)
 	buffer->outbufsize = 0;
 	buffer->size = 0;
 
+	/*
+	 * Don't start with a lower number than
+	 * REORDER_BUFFER_MEM_TRACK_THRESHOLD, since we add at least
+	 * REORDER_BUFFER_MEM_TRACK_THRESHOLD entries at once.
+	 */
+	buffer->memtrack_state = REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP;
+	buffer->txn_heap = binaryheap_allocate(REORDER_BUFFER_MEM_TRACK_THRESHOLD * 2,
+										   ReorderBufferTXNSizeCompare,
+										   true, NULL);
+
 	buffer->spillTxns = 0;
 	buffer->spillCount = 0;
 	buffer->spillBytes = 0;
@@ -487,7 +529,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
 {
 	/* update memory accounting info */
 	if (upd_mem)
-		ReorderBufferChangeMemoryUpdate(rb, change, false,
+		ReorderBufferChangeMemoryUpdate(rb, change, NULL, false,
 										ReorderBufferChangeSize(change));
 
 	/* free contained data */
@@ -818,7 +860,7 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries_mem++;
 
 	/* update memory accounting information */
-	ReorderBufferChangeMemoryUpdate(rb, change, true,
+	ReorderBufferChangeMemoryUpdate(rb, change, NULL, true,
 									ReorderBufferChangeSize(change));
 
 	/* process partial change */
@@ -1529,7 +1571,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		/* Check we're not mixing changes from different transactions. */
 		Assert(change->txn == txn);
 
-		ReorderBufferReturnChange(rb, change, true);
+		ReorderBufferReturnChange(rb, change, false);
 	}
 
 	/*
@@ -1588,8 +1630,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	if (rbtxn_is_serialized(txn))
 		ReorderBufferRestoreCleanup(rb, txn);
 
+	/* Update the memory counter */
+	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, txn->size);
+
 	/* deallocate */
 	ReorderBufferReturnTXN(rb, txn);
+
+	/* check the memory track state */
+	ReorderBufferMaybeChangeNoMaxHeap(rb);
 }
 
 /*
@@ -1639,9 +1687,12 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 		/* remove the change from it's containing list */
 		dlist_delete(&change->node);
 
-		ReorderBufferReturnChange(rb, change, true);
+		ReorderBufferReturnChange(rb, change, false);
 	}
 
+	/* Update the memory counter */
+	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, txn->size);
+
 	/*
 	 * Mark the transaction as streamed.
 	 *
@@ -3176,22 +3227,24 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								ReorderBufferChange *change,
+								ReorderBufferTXN *txn,
 								bool addition, Size sz)
 {
-	ReorderBufferTXN *txn;
 	ReorderBufferTXN *toptxn;
 
-	Assert(change->txn);
-
 	/*
 	 * Ignore tuple CID changes, because those are not evicted when reaching
 	 * memory limit. So we just don't count them, because it might easily
 	 * trigger a pointless attempt to spill.
 	 */
-	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
+	if (change && change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
-	txn = change->txn;
+	if (sz == 0)
+		return;
+
+	txn = txn != NULL ? txn : change->txn;
+	Assert(txn != NULL);
 
 	/*
 	 * Update the total size in top level as well. This is later used to
@@ -3206,6 +3259,15 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 		/* Update the total size in the top transaction. */
 		toptxn->total_size += sz;
+
+		/* Update the max-heap as well if necessary */
+		if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)
+		{
+			if ((txn->size - sz) == 0)
+				binaryheap_add(rb->txn_heap, PointerGetDatum(txn));
+			else
+				binaryheap_update_up(rb->txn_heap, PointerGetDatum(txn));
+		}
 	}
 	else
 	{
@@ -3215,11 +3277,42 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 		/* Update the total size in the top transaction. */
 		toptxn->total_size -= sz;
+
+		/* Update the max-heap as well if necessary */
+		if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)
+		{
+			if (txn->size == 0)
+				binaryheap_remove_node_ptr(rb->txn_heap, PointerGetDatum(txn));
+			else
+				binaryheap_update_down(rb->txn_heap, PointerGetDatum(txn));
+		}
 	}
 
 	Assert(txn->size <= rb->size);
 }
 
+/*
+ * Switch to NO_MAXHEAP state and reset the max-heap if the number of
+ * transactions got lower than the threshold.
+ */
+static void
+ReorderBufferMaybeChangeNoMaxHeap(ReorderBuffer *rb)
+{
+	if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP)
+		return;
+
+	/*
+	 * If we add and remove transactions right around the threshold, we could
+	 * easily end up "thrashing". To avoid it, we adapt 10% of transactions to
+	 * switch back to the NO_MAXHEAP state.
+	 */
+	if (binaryheap_size(rb->txn_heap) < REORDER_BUFFER_MEM_TRACK_THRESHOLD * 0.9)
+	{
+		rb->memtrack_state = REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP;
+		binaryheap_reset(rb->txn_heap);
+	}
+}
+
 /*
  * Add new (relfilelocator, tid) -> (cmin, cmax) mappings.
  *
@@ -3472,31 +3565,45 @@ ReorderBufferSerializeReserve(ReorderBuffer *rb, Size sz)
 
 /*
  * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
- *
- * XXX With many subtransactions this might be quite slow, because we'll have
- * to walk through all of them. There are some options how we could improve
- * that: (a) maintain some secondary structure with transactions sorted by
- * amount of changes, (b) not looking for the entirely largest transaction,
- * but e.g. for transaction using at least some fraction of the memory limit,
- * and (c) evicting multiple transactions at once, e.g. to free a given portion
- * of the memory limit (e.g. 50%).
  */
 static ReorderBufferTXN *
 ReorderBufferLargestTXN(ReorderBuffer *rb)
 {
-	HASH_SEQ_STATUS hash_seq;
-	ReorderBufferTXNByIdEnt *ent;
 	ReorderBufferTXN *largest = NULL;
 
-	hash_seq_init(&hash_seq, rb->by_txn);
-	while ((ent = hash_seq_search(&hash_seq)) != NULL)
+	/*
+	 * Build the max-heap to pick the largest transaction if not yet. We will
+	 * run a heap assembly step at the end, which is more efficient.
+	 */
+	if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP)
 	{
-		ReorderBufferTXN *txn = ent->txn;
+		HASH_SEQ_STATUS hash_seq;
+		ReorderBufferTXNByIdEnt *ent;
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
-			largest = txn;
+		hash_seq_init(&hash_seq, rb->by_txn);
+		while ((ent = hash_seq_search(&hash_seq)) != NULL)
+		{
+			ReorderBufferTXN *txn = ent->txn;
+
+			if (txn->size == 0)
+				continue;
+
+			binaryheap_add_unordered(rb->txn_heap, PointerGetDatum(txn));
+		}
+
+		binaryheap_build(rb->txn_heap);
+
+		/*
+		 * The max-heap is ready now. We remain in this state at least until
+		 * we free up enough transactions to bring the total memory usage
+		 * below the limit.
+		 */
+		rb->memtrack_state = REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP;
 	}
+	else
+		Assert(binaryheap_size(rb->txn_heap) > 0);
+
+	largest = (ReorderBufferTXN *) DatumGetPointer(binaryheap_first(rb->txn_heap));
 
 	Assert(largest);
 	Assert(largest->size > 0);
@@ -3638,6 +3745,9 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 		Assert(txn->nentries_mem == 0);
 	}
 
+	/* check the memory track state */
+	ReorderBufferMaybeChangeNoMaxHeap(rb);
+
 	/* We must be under the memory limit now. */
 	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
@@ -3707,11 +3817,14 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		ReorderBufferSerializeChange(rb, txn, fd, change);
 		dlist_delete(&change->node);
-		ReorderBufferReturnChange(rb, change, true);
+		ReorderBufferReturnChange(rb, change, false);
 
 		spilled++;
 	}
 
+	/* Update the memory counter */
+	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, txn->size);
+
 	/* update the statistics iff we have spilled anything */
 	if (spilled)
 	{
@@ -4493,7 +4606,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	 * update the accounting too (subtracting the size from the counters). And
 	 * we don't want to underflow there.
 	 */
-	ReorderBufferChangeMemoryUpdate(rb, change, true,
+	ReorderBufferChangeMemoryUpdate(rb, change, NULL, true,
 									ReorderBufferChangeSize(change));
 }
 
@@ -4905,9 +5018,9 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	MemoryContextSwitchTo(oldcontext);
 
 	/* subtract the old change size */
-	ReorderBufferChangeMemoryUpdate(rb, change, false, old_size);
+	ReorderBufferChangeMemoryUpdate(rb, change, NULL, false, old_size);
 	/* now add the change back, with the correct size */
-	ReorderBufferChangeMemoryUpdate(rb, change, true,
+	ReorderBufferChangeMemoryUpdate(rb, change, NULL, true,
 									ReorderBufferChangeSize(change));
 }
 
@@ -5273,3 +5386,20 @@ restart:
 		*cmax = ent->cmax;
 	return true;
 }
+
+/*
+ * Compare between sizes of two transactions. This is for a binary heap
+ * comparison function.
+ */
+static int
+ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg)
+{
+	ReorderBufferTXN *ta = (ReorderBufferTXN *) DatumGetPointer(a);
+	ReorderBufferTXN *tb = (ReorderBufferTXN *) DatumGetPointer(b);
+
+	if (ta->size < tb->size)
+		return -1;
+	if (ta->size > tb->size)
+		return 1;
+	return 0;
+}
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0b2c95f7aa..1f0ad2b94e 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "lib/binaryheap.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -531,6 +532,22 @@ typedef void (*ReorderBufferUpdateProgressTxnCB) (
 												  ReorderBufferTXN *txn,
 												  XLogRecPtr lsn);
 
+/* State of how to track the memory usage of each transaction being decoded */
+typedef enum ReorderBufferMemTrackState
+{
+	/*
+	 * We don't update max-heap while updating the memory counter. The
+	 * max-heap is built before use.
+	 */
+	REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP,
+
+	/*
+	 * We also update the max-heap when updating the memory counter so the
+	 * heap property is always preserved.
+	 */
+	REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP,
+} ReorderBufferMemTrackState;
+
 struct ReorderBuffer
 {
 	/*
@@ -631,6 +648,10 @@ struct ReorderBuffer
 	/* memory accounting */
 	Size		size;
 
+	/* Max-heap for sizes of all top-level and sub transactions */
+	ReorderBufferMemTrackState memtrack_state;
+	binaryheap *txn_heap;
+
 	/*
 	 * Statistics about transactions spilled to disk.
 	 *
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 82ee10afac..b672853858 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4056,3 +4056,4 @@ ws_options
 ws_file_info
 PathKeyInfo
 bh_nodeidx_entry
+ReorderBufferMemTrackState
-- 
2.39.3

v7-0001-Make-binaryheap-enlargeable.patchapplication/octet-stream; name=v7-0001-Make-binaryheap-enlargeable.patchDownload
From 65232ff8bbba85a69836b45360494fe590945b5b Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 17:12:20 +0900
Subject: [PATCH v7 1/3] Make binaryheap enlargeable.

The node array space of the binaryheap is doubled when there is no
available space.

Reviewed-by: Hayato Kuroda, Vignesh C, Ajin Cherian, Tomas Vondra,
Shubham Khanna
Discussion: https://postgr.es/m/CAD21AoDffo37RC-eUuyHJKVEr017V2YYDLyn1xF_00ofptWbkg%40mail.gmail.com
---
 src/common/binaryheap.c      | 36 +++++++++++++++++++-----------------
 src/include/lib/binaryheap.h |  2 +-
 2 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index 7377ebdf15..6f16c83295 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -38,17 +38,16 @@ static void sift_up(binaryheap *heap, int node_off);
 binaryheap *
 binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 {
-	int			sz;
 	binaryheap *heap;
 
-	sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_node_type) * capacity;
-	heap = (binaryheap *) palloc(sz);
+	heap = (binaryheap *) palloc(sizeof(binaryheap));
 	heap->bh_space = capacity;
 	heap->bh_compare = compare;
 	heap->bh_arg = arg;
 
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
+	heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type) * capacity);
 
 	return heap;
 }
@@ -104,6 +103,17 @@ parent_offset(int i)
 	return (i - 1) / 2;
 }
 
+/*
+ * Make sure there is enough space for nodes.
+ */
+static void
+bh_enlarge_node_array(binaryheap *heap)
+{
+	heap->bh_space *= 2;
+	heap->bh_nodes = repalloc(heap->bh_nodes,
+							  sizeof(bh_node_type) * heap->bh_space);
+}
+
 /*
  * binaryheap_add_unordered
  *
@@ -115,14 +125,10 @@ parent_offset(int i)
 void
 binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 {
+	/* make sure enough space for a new node */
 	if (heap->bh_size >= heap->bh_space)
-	{
-#ifdef FRONTEND
-		pg_fatal("out of binary heap slots");
-#else
-		elog(ERROR, "out of binary heap slots");
-#endif
-	}
+		bh_enlarge_node_array(heap);
+
 	heap->bh_has_heap_property = false;
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
@@ -153,14 +159,10 @@ binaryheap_build(binaryheap *heap)
 void
 binaryheap_add(binaryheap *heap, bh_node_type d)
 {
+	/* make sure enough space for a new node */
 	if (heap->bh_size >= heap->bh_space)
-	{
-#ifdef FRONTEND
-		pg_fatal("out of binary heap slots");
-#else
-		elog(ERROR, "out of binary heap slots");
-#endif
-	}
+		bh_enlarge_node_array(heap);
+
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 19025c08ef..1439f20803 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -46,7 +46,7 @@ typedef struct binaryheap
 	bool		bh_has_heap_property;	/* debugging cross-check */
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
-	bh_node_type bh_nodes[FLEXIBLE_ARRAY_MEMBER];
+	bh_node_type *bh_nodes;
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int capacity,
-- 
2.39.3

v7-0002-Add-functions-to-binaryheap-for-efficient-key-rem.patchapplication/octet-stream; name=v7-0002-Add-functions-to-binaryheap-for-efficient-key-rem.patchDownload
From a6120d7f4fdac716e6bfe993d8caea61d4bbfadc Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 11:20:23 +0900
Subject: [PATCH v7 2/3] Add functions to binaryheap for efficient key removal
 and update.

Previously, binaryheap didn't support updating a key and removing a
node in an efficient way. For example, in order to remove a node from
the binaryheap, the caller has to pass the node's position within the
array that the binaryheap internally has. Removing a node from the
binaryheap is done in O(log n) but searching for the key's position is
done in O(n).

This commit adds a hash table to binaryheap in order to track of
positions of each nodes in the binaryheap. That way, by using newly
added functions such as binaryheap_update_up() etc., both updating a
key and removing a node can be done in O(1) on an average and O(log n)
in worst case. This is known as the indexed binary heap. The caller
can specify to use the indexed binaryheap by passing indexed = true.

There is no user of it but it will be used by a upcoming patch.

Reviewed-by: Hayato Kuroda, Vignesh C, Ajin Cherian, Tomas Vondra,
Shubham Khanna
Discussion: https://postgr.es/m/CAD21AoDffo37RC-eUuyHJKVEr017V2YYDLyn1xF_00ofptWbkg%40mail.gmail.com
---
 src/backend/executor/nodeGatherMerge.c        |   1 +
 src/backend/executor/nodeMergeAppend.c        |   2 +-
 src/backend/postmaster/pgarch.c               |   3 +-
 .../replication/logical/reorderbuffer.c       |   1 +
 src/backend/storage/buffer/bufmgr.c           |   1 +
 src/bin/pg_dump/pg_backup_archiver.c          |   1 +
 src/bin/pg_dump/pg_dump_sort.c                |   2 +-
 src/common/binaryheap.c                       | 190 +++++++++++++++++-
 src/include/lib/binaryheap.h                  |  35 +++-
 src/tools/pgindent/typedefs.list              |   1 +
 10 files changed, 223 insertions(+), 14 deletions(-)

diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c
index 2d552f4224..250f226d5f 100644
--- a/src/backend/executor/nodeGatherMerge.c
+++ b/src/backend/executor/nodeGatherMerge.c
@@ -427,6 +427,7 @@ gather_merge_setup(GatherMergeState *gm_state)
 	/* Allocate the resources for the merge */
 	gm_state->gm_heap = binaryheap_allocate(nreaders + 1,
 											heap_compare_slots,
+											false,
 											gm_state);
 }
 
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index 0817868452..1980794cb7 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -125,7 +125,7 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	mergestate->ms_nplans = nplans;
 
 	mergestate->ms_slots = (TupleTableSlot **) palloc0(sizeof(TupleTableSlot *) * nplans);
-	mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots,
+	mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots, false,
 											  mergestate);
 
 	/*
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 9c18e4b3ef..36522940dd 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -250,7 +250,8 @@ PgArchiverMain(void)
 
 	/* Initialize our max-heap for prioritizing files to archive. */
 	arch_files->arch_heap = binaryheap_allocate(NUM_FILES_PER_DIRECTORY_SCAN,
-												ready_file_comparator, NULL);
+												ready_file_comparator, false,
+												NULL);
 
 	/* Load the archive_library. */
 	LoadArchiveLibrary();
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5446df3c64..91b9618d7e 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1296,6 +1296,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	/* allocate heap */
 	state->heap = binaryheap_allocate(state->nr_txns,
 									  ReorderBufferIterCompare,
+									  false,
 									  state);
 
 	/* Now that the state fields are initialized, it is safe to return it. */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index bdf89bbc4d..69f071321d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2725,6 +2725,7 @@ BufferSync(int flags)
 	 */
 	ts_heap = binaryheap_allocate(num_spaces,
 								  ts_ckpt_progress_comparator,
+								  false,
 								  NULL);
 
 	for (i = 0; i < num_spaces; i++)
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index d97ebaff5b..6587a7b081 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -4033,6 +4033,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 	/* Set up ready_heap with enough room for all known TocEntrys */
 	ready_heap = binaryheap_allocate(AH->tocCount,
 									 TocEntrySizeCompareBinaryheap,
+									 false,
 									 NULL);
 
 	/*
diff --git a/src/bin/pg_dump/pg_dump_sort.c b/src/bin/pg_dump/pg_dump_sort.c
index 8ee8a42781..4d10af3a34 100644
--- a/src/bin/pg_dump/pg_dump_sort.c
+++ b/src/bin/pg_dump/pg_dump_sort.c
@@ -405,7 +405,7 @@ TopoSort(DumpableObject **objs,
 		return true;
 
 	/* Create workspace for the above-described heap */
-	pendingHeap = binaryheap_allocate(numObjs, int_cmp, NULL);
+	pendingHeap = binaryheap_allocate(numObjs, int_cmp, false, NULL);
 
 	/*
 	 * Scan the constraints, and for each item in the input, generate a count
diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index 6f16c83295..a4bc64589b 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -22,8 +22,28 @@
 #ifdef FRONTEND
 #include "common/logging.h"
 #endif
+#include "common/hashfn.h"
 #include "lib/binaryheap.h"
 
+/*
+ * Define parameters for hash table code generation. The interface is *also*"
+ * declared in binaryheaph.h (to generate the types, which are externally
+ * visible).
+ */
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+	hash_bytes((const unsigned char *) &key, sizeof(bh_node_type))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0)
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
 static void sift_down(binaryheap *heap, int node_off);
 static void sift_up(binaryheap *heap, int node_off);
 
@@ -34,9 +54,14 @@ static void sift_up(binaryheap *heap, int node_off);
  * store the given number of nodes, with the heap property defined by
  * the given comparator function, which will be invoked with the additional
  * argument specified by 'arg'.
+ *
+ * If 'indexed' is true, we create a hash table to track of each node's
+ * index in the heap, enabling to perform some operations such as removing
+ * the node from the heap.
  */
 binaryheap *
-binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
+binaryheap_allocate(int capacity, binaryheap_comparator compare,
+					bool indexed, void *arg)
 {
 	binaryheap *heap;
 
@@ -49,6 +74,17 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 	heap->bh_has_heap_property = true;
 	heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type) * capacity);
 
+	heap->bh_indexed = indexed;
+	if (heap->bh_indexed)
+	{
+#ifdef FRONTEND
+		heap->bh_nodeidx = bh_nodeidx_create(capacity, NULL);
+#else
+		heap->bh_nodeidx = bh_nodeidx_create(CurrentMemoryContext, capacity,
+											 NULL);
+#endif
+	}
+
 	return heap;
 }
 
@@ -63,6 +99,9 @@ binaryheap_reset(binaryheap *heap)
 {
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
+
+	if (heap->bh_indexed)
+		bh_nodeidx_reset(heap->bh_nodeidx);
 }
 
 /*
@@ -73,6 +112,9 @@ binaryheap_reset(binaryheap *heap)
 void
 binaryheap_free(binaryheap *heap)
 {
+	if (heap->bh_indexed)
+		bh_nodeidx_destroy(heap->bh_nodeidx);
+
 	pfree(heap);
 }
 
@@ -114,6 +156,64 @@ bh_enlarge_node_array(binaryheap *heap)
 							  sizeof(bh_node_type) * heap->bh_space);
 }
 
+/*
+ * Set the given node at the 'index', and updates its position accordingly.
+ *
+ * Return true if the node's index is already tracked.
+ */
+static bool
+bh_set_node(binaryheap *heap, bh_node_type node, int index)
+{
+	bh_nodeidx_entry *ent;
+	bool		found = false;
+
+	/* Set the node to the nodes array */
+	heap->bh_nodes[index] = node;
+
+	if (heap->bh_indexed)
+	{
+		/* Remember its index in the nodes array */
+		ent = bh_nodeidx_insert(heap->bh_nodeidx, node, &found);
+		ent->idx = index;
+	}
+
+	return found;
+}
+
+/*
+ * Remove the node's index from the hash table if the heap is indexed.
+ */
+static void
+bh_delete_nodeidx(binaryheap *heap, bh_node_type node)
+{
+	if (!heap->bh_indexed)
+		return;
+
+	(void) bh_nodeidx_delete(heap->bh_nodeidx, node);
+}
+
+/*
+ * Replace the node at 'idx' with the given node 'replaced_by'. Also
+ * update their positions accordingly.
+ */
+static void
+bh_replace_node(binaryheap *heap, int idx, bh_node_type replaced_by)
+{
+	/* Remove overwritten node's index */
+	bh_delete_nodeidx(heap, heap->bh_nodes[idx]);
+
+	/* Replace it with the given new node */
+	if (idx < heap->bh_size)
+	{
+		bool		found PG_USED_FOR_ASSERTS_ONLY;
+
+		found = bh_set_node(heap, replaced_by, idx);
+
+		/* The overwritten node's index must already be tracked */
+		Assert(!heap->bh_indexed || found);
+	}
+}
+
 /*
  * binaryheap_add_unordered
  *
@@ -130,7 +230,7 @@ binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 		bh_enlarge_node_array(heap);
 
 	heap->bh_has_heap_property = false;
-	heap->bh_nodes[heap->bh_size] = d;
+	bh_set_node(heap, d, heap->bh_size);
 	heap->bh_size++;
 }
 
@@ -163,7 +263,7 @@ binaryheap_add(binaryheap *heap, bh_node_type d)
 	if (heap->bh_size >= heap->bh_space)
 		bh_enlarge_node_array(heap);
 
-	heap->bh_nodes[heap->bh_size] = d;
+	bh_set_node(heap, d, heap->bh_size);
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
 }
@@ -204,6 +304,8 @@ binaryheap_remove_first(binaryheap *heap)
 	if (heap->bh_size == 1)
 	{
 		heap->bh_size--;
+		bh_delete_nodeidx(heap, result);
+
 		return result;
 	}
 
@@ -211,7 +313,7 @@ binaryheap_remove_first(binaryheap *heap)
 	 * Remove the last node, placing it in the vacated root entry, and sift
 	 * the new root node down to its correct position.
 	 */
-	heap->bh_nodes[0] = heap->bh_nodes[--heap->bh_size];
+	bh_replace_node(heap, 0, heap->bh_nodes[--heap->bh_size]);
 	sift_down(heap, 0);
 
 	return result;
@@ -237,7 +339,7 @@ binaryheap_remove_node(binaryheap *heap, int n)
 						   heap->bh_arg);
 
 	/* remove the last node, placing it in the vacated entry */
-	heap->bh_nodes[n] = heap->bh_nodes[heap->bh_size];
+	bh_replace_node(heap, n, heap->bh_nodes[heap->bh_size]);
 
 	/* sift as needed to preserve the heap property */
 	if (cmp > 0)
@@ -246,6 +348,74 @@ binaryheap_remove_node(binaryheap *heap, int n)
 		sift_down(heap, n);
 }
 
+/*
+ * binaryheap_remove_node_ptr
+ *
+ * Similar to binaryheap_remove_node() but removes the given node. The caller
+ * must ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if bh_indexed is true.
+ */
+void
+binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(heap->bh_indexed);
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+	Assert(ent);
+
+	binaryheap_remove_node(heap, ent->idx);
+}
+
+/*
+ * binaryheap_update_up
+ *
+ * Sift the given node up after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if bh_indexed is true.
+ */
+void
+binaryheap_update_up(binaryheap *heap, bh_node_type d)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(heap->bh_indexed);
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+	Assert(ent);
+	Assert(ent->idx >= 0 && ent->idx < heap->bh_size);
+
+	sift_up(heap, ent->idx);
+}
+
+/*
+ * binaryheap_update_down
+ *
+ * Sift the given node down after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if bh_indexed is true.
+ */
+void
+binaryheap_update_down(binaryheap *heap, bh_node_type d)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(heap->bh_indexed);
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+	Assert(ent);
+	Assert(ent->idx >= 0 && ent->idx < heap->bh_size);
+
+	sift_down(heap, ent->idx);
+}
+
 /*
  * binaryheap_replace_first
  *
@@ -258,7 +428,7 @@ binaryheap_replace_first(binaryheap *heap, bh_node_type d)
 {
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
 
-	heap->bh_nodes[0] = d;
+	bh_replace_node(heap, 0, d);
 
 	if (heap->bh_size > 1)
 		sift_down(heap, 0);
@@ -300,11 +470,11 @@ sift_up(binaryheap *heap, int node_off)
 		 * Otherwise, swap the parent value with the hole, and go on to check
 		 * the node's new parent.
 		 */
-		heap->bh_nodes[node_off] = parent_val;
+		bh_set_node(heap, parent_val, node_off);
 		node_off = parent_off;
 	}
 	/* Re-fill the hole */
-	heap->bh_nodes[node_off] = node_val;
+	bh_set_node(heap, node_val, node_off);
 }
 
 /*
@@ -359,9 +529,9 @@ sift_down(binaryheap *heap, int node_off)
 		 * Otherwise, swap the hole with the child that violates the heap
 		 * property; then go on to check its children.
 		 */
-		heap->bh_nodes[node_off] = heap->bh_nodes[swap_off];
+		bh_set_node(heap, heap->bh_nodes[swap_off], node_off);
 		node_off = swap_off;
 	}
 	/* Re-fill the hole */
-	heap->bh_nodes[node_off] = node_val;
+	bh_set_node(heap, node_val, node_off);
 }
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 1439f20803..bf85c24002 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -29,6 +29,28 @@ typedef Datum bh_node_type;
  */
 typedef int (*binaryheap_comparator) (bh_node_type a, bh_node_type b, void *arg);
 
+/*
+ * Struct for A hash table element to store the node's index in the bh_nodes
+ * array.
+ */
+typedef struct bh_nodeidx_entry
+{
+	bh_node_type key;
+	char		status;
+	int			idx;
+} bh_nodeidx_entry;
+
+/* define parameters necessary to generate the hash table interface */
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
 /*
  * binaryheap
  *
@@ -47,11 +69,19 @@ typedef struct binaryheap
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
 	bh_node_type *bh_nodes;
+
+	/*
+	 * If bh_indexed is true, the bh_nodeidx is used to track of each node's
+	 * index in bh_nodes. This enables the caller to perform
+	 * binaryheap_remove_node_ptr(), binaryheap_update_up/down in O(log n).
+	 */
+	bool		bh_indexed;
+	bh_nodeidx_hash *bh_nodeidx;
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int capacity,
 									   binaryheap_comparator compare,
-									   void *arg);
+									   bool indexed, void *arg);
 extern void binaryheap_reset(binaryheap *heap);
 extern void binaryheap_free(binaryheap *heap);
 extern void binaryheap_add_unordered(binaryheap *heap, bh_node_type d);
@@ -60,7 +90,10 @@ extern void binaryheap_add(binaryheap *heap, bh_node_type d);
 extern bh_node_type binaryheap_first(binaryheap *heap);
 extern bh_node_type binaryheap_remove_first(binaryheap *heap);
 extern void binaryheap_remove_node(binaryheap *heap, int n);
+extern void binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d);
 extern void binaryheap_replace_first(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_up(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_down(binaryheap *heap, bh_node_type d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
 #define binaryheap_size(h)			((h)->bh_size)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fc8b15d0cf..82ee10afac 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4055,3 +4055,4 @@ rfile
 ws_options
 ws_file_info
 PathKeyInfo
+bh_nodeidx_entry
-- 
2.39.3

#39vignesh C
vignesh21@gmail.com
In reply to: Masahiko Sawada (#36)
Re: Improve eviction algorithm in ReorderBuffer

On Mon, 26 Feb 2024 at 12:33, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Feb 23, 2024 at 6:24 PM vignesh C <vignesh21@gmail.com> wrote:

On Fri, 9 Feb 2024 at 20:51, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I think this performance regression is not acceptable. In this
workload, one transaction has 10k subtransactions and the logical
decoding becomes quite slow if logical_decoding_work_mem is not big
enough. Therefore, it's a legitimate and common approach to increase
logical_decoding_work_mem to speedup the decoding. However, with thie
patch, the decoding becomes slower than today. It's a bad idea in
general to optimize an extreme case while sacrificing the normal (or
more common) cases.

Since this same function is used by pg_dump sorting TopoSort functions
also, we can just verify once if there is no performance impact with
large number of objects during dump sorting:

Okay. I've run the pg_dump regression tests with --timer flag (note
that pg_dump doesn't use indexed binary heap):

master:
[16:00:25] t/001_basic.pl ................ ok 151 ms ( 0.00 usr
0.00 sys + 0.09 cusr 0.06 csys = 0.15 CPU)
[16:00:25] t/002_pg_dump.pl .............. ok 10157 ms ( 0.23 usr
0.01 sys + 1.48 cusr 0.37 csys = 2.09 CPU)
[16:00:36] t/003_pg_dump_with_server.pl .. ok 504 ms ( 0.00 usr
0.01 sys + 0.10 cusr 0.07 csys = 0.18 CPU)
[16:00:36] t/004_pg_dump_parallel.pl ..... ok 1044 ms ( 0.00 usr
0.00 sys + 0.12 cusr 0.08 csys = 0.20 CPU)
[16:00:37] t/005_pg_dump_filterfile.pl ... ok 2390 ms ( 0.00 usr
0.00 sys + 0.34 cusr 0.19 csys = 0.53 CPU)
[16:00:40] t/010_dump_connstr.pl ......... ok 4813 ms ( 0.01 usr
0.00 sys + 2.13 cusr 0.45 csys = 2.59 CPU)

patched:
[15:59:47] t/001_basic.pl ................ ok 150 ms ( 0.00 usr
0.00 sys + 0.08 cusr 0.07 csys = 0.15 CPU)
[15:59:47] t/002_pg_dump.pl .............. ok 10057 ms ( 0.23 usr
0.02 sys + 1.49 cusr 0.36 csys = 2.10 CPU)
[15:59:57] t/003_pg_dump_with_server.pl .. ok 509 ms ( 0.00 usr
0.00 sys + 0.09 cusr 0.08 csys = 0.17 CPU)
[15:59:58] t/004_pg_dump_parallel.pl ..... ok 1048 ms ( 0.01 usr
0.00 sys + 0.11 cusr 0.11 csys = 0.23 CPU)
[15:59:59] t/005_pg_dump_filterfile.pl ... ok 2398 ms ( 0.00 usr
0.00 sys + 0.34 cusr 0.20 csys = 0.54 CPU)
[16:00:01] t/010_dump_connstr.pl ......... ok 4762 ms ( 0.01 usr
0.00 sys + 2.15 cusr 0.42 csys = 2.58 CPU)

There is no noticeable difference between the two results.

Thanks for verifying it, I have also run in my environment and found
no noticeable difference between them:
Head:
[07:29:41] t/001_basic.pl ................ ok 332 ms
[07:29:41] t/002_pg_dump.pl .............. ok 11029 ms
[07:29:52] t/003_pg_dump_with_server.pl .. ok 705 ms
[07:29:53] t/004_pg_dump_parallel.pl ..... ok 1198 ms
[07:29:54] t/005_pg_dump_filterfile.pl ... ok 2822 ms
[07:29:57] t/010_dump_connstr.pl ......... ok 5582 ms

With Patch:
[07:42:16] t/001_basic.pl ................ ok 328 ms
[07:42:17] t/002_pg_dump.pl .............. ok 11044 ms
[07:42:28] t/003_pg_dump_with_server.pl .. ok 719 ms
[07:42:29] t/004_pg_dump_parallel.pl ..... ok 1188 ms
[07:42:30] t/005_pg_dump_filterfile.pl ... ok 2816 ms
[07:42:33] t/010_dump_connstr.pl ......... ok 5609 ms

Regards,
Vignesh

#40Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#38)
Re: Improve eviction algorithm in ReorderBuffer

On Mon, Feb 26, 2024 at 7:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

A few comments on 0003:
===================
1.
+/*
+ * Threshold of the total number of top-level and sub transactions
that controls
+ * whether we switch the memory track state. While the MAINTAIN_HEAP state is
+ * effective when there are many transactions being decoded, in many systems
+ * there is generally no need to use it as long as all transactions
being decoded
+ * are top-level transactions. Therefore, we use MaxConnections as
the threshold
+ * so we can prevent switch to the state unless we use subtransactions.
+ */
+#define REORDER_BUFFER_MEM_TRACK_THRESHOLD MaxConnections

The comment seems to imply that MAINTAIN_HEAP is useful for large
number of transactions but ReorderBufferLargestTXN() switches to this
state even when there is one transaction. So, basically we use the
binary_heap technique to get the largest even when we have one
transaction but we don't maintain that heap unless we have
REORDER_BUFFER_MEM_TRACK_THRESHOLD number of transactions are
in-progress. This means there is some additional work when (build and
reset heap each time when we pick largest xact) we have fewer
transactions in the system but that may not be impacting us because of
other costs involved like serializing all the changes. I think once we
can try to stress test this by setting
debug_logical_replication_streaming to 'immediate' to see if the new
mechanism has any overhead.

2. Can we somehow measure the additional memory that will be consumed
by each backend/walsender to maintain transactions? Because I think
this also won't be accounted for logical_decoding_work_mem, so if this
is large, there could be a chance of more complaints on us for not
honoring logical_decoding_work_mem.

3.
@@ -3707,11 +3817,14 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)

  ReorderBufferSerializeChange(rb, txn, fd, change);
  dlist_delete(&change->node);
- ReorderBufferReturnChange(rb, change, true);
+ ReorderBufferReturnChange(rb, change, false);

spilled++;
}

+ /* Update the memory counter */
+ ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, txn->size);

In ReorderBufferSerializeTXN(), we already use a size variable for
txn->size, we can probably use that for the sake of consistency.

--
With Regards,
Amit Kapila.

#41Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#40)
Re: Improve eviction algorithm in ReorderBuffer

On Wed, Feb 28, 2024 at 3:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Feb 26, 2024 at 7:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Thank you for the comments!

A few comments on 0003:
===================
1.
+/*
+ * Threshold of the total number of top-level and sub transactions
that controls
+ * whether we switch the memory track state. While the MAINTAIN_HEAP state is
+ * effective when there are many transactions being decoded, in many systems
+ * there is generally no need to use it as long as all transactions
being decoded
+ * are top-level transactions. Therefore, we use MaxConnections as
the threshold
+ * so we can prevent switch to the state unless we use subtransactions.
+ */
+#define REORDER_BUFFER_MEM_TRACK_THRESHOLD MaxConnections

The comment seems to imply that MAINTAIN_HEAP is useful for large
number of transactions but ReorderBufferLargestTXN() switches to this
state even when there is one transaction. So, basically we use the
binary_heap technique to get the largest even when we have one
transaction but we don't maintain that heap unless we have
REORDER_BUFFER_MEM_TRACK_THRESHOLD number of transactions are
in-progress.

Right.

This means there is some additional work when (build and
reset heap each time when we pick largest xact) we have fewer
transactions in the system but that may not be impacting us because of
other costs involved like serializing all the changes. I think once we
can try to stress test this by setting
debug_logical_replication_streaming to 'immediate' to see if the new
mechanism has any overhead.

Agreed.

I've done performance tests that decodes 10k small transactions
(pgbench transactions) with debug_logical_replication_streaming =
'immediate':

master: 6263.022 ms
patched: 6403.873 ms

I don't see noticeable regressions.

2. Can we somehow measure the additional memory that will be consumed
by each backend/walsender to maintain transactions? Because I think
this also won't be accounted for logical_decoding_work_mem, so if this
is large, there could be a chance of more complaints on us for not
honoring logical_decoding_work_mem.

Good point.

We initialize the binaryheap with MaxConnections * 2 entries and the
binaryheap entries are pointers. So we use additional (8 * 100 * 2)
bytes with the default max_connections setting even when there is one
transaction, and could use more memory when adding more transactions.

I think there is still room for considering how to determine the
threshold and the number of initial entries. Using MaxConnections
seems to work but it always uses the current MaxConnections value
instead of the value that was set at a time when WAL records were
written. As for the initial number of entries in binaryheap, I think
we can the threshold value as the initial number of entries instead of
(threshold * 2). Or we might want to use the same value, 1000, as the
one we use for buffer->by_txn hash table.

3.
@@ -3707,11 +3817,14 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)

ReorderBufferSerializeChange(rb, txn, fd, change);
dlist_delete(&change->node);
- ReorderBufferReturnChange(rb, change, true);
+ ReorderBufferReturnChange(rb, change, false);

spilled++;
}

+ /* Update the memory counter */
+ ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, txn->size);

In ReorderBufferSerializeTXN(), we already use a size variable for
txn->size, we can probably use that for the sake of consistency.

Agreed, will fix it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#42Peter Smith
smithpb2250@gmail.com
In reply to: Masahiko Sawada (#38)
Re: Improve eviction algorithm in ReorderBuffer

Hi, Here are some review comments for v7-0001

1.
/*
* binaryheap_free
*
* Releases memory used by the given binaryheap.
*/
void
binaryheap_free(binaryheap *heap)
{
pfree(heap);
}

Shouldn't the above function (not modified by the patch) also firstly
free the memory allocated for the heap->bh_nodes?

~~~

2.
+/*
+ * Make sure there is enough space for nodes.
+ */
+static void
+bh_enlarge_node_array(binaryheap *heap)
+{
+ heap->bh_space *= 2;
+ heap->bh_nodes = repalloc(heap->bh_nodes,
+   sizeof(bh_node_type) * heap->bh_space);
+}

Strictly speaking, this function doesn't really "Make sure" of
anything because the caller does the check whether we need more space.
All that happens here is allocating more space. Maybe this function
comment should say something like "Double the space allocated for
nodes."

----------
Kind Regards,
Peter Smith.
Fujitsu Australia

#43vignesh C
vignesh21@gmail.com
In reply to: Amit Kapila (#40)
Re: Improve eviction algorithm in ReorderBuffer

On Wed, 28 Feb 2024 at 11:40, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Feb 26, 2024 at 7:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

A few comments on 0003:
===================
1.
+/*
+ * Threshold of the total number of top-level and sub transactions
that controls
+ * whether we switch the memory track state. While the MAINTAIN_HEAP state is
+ * effective when there are many transactions being decoded, in many systems
+ * there is generally no need to use it as long as all transactions
being decoded
+ * are top-level transactions. Therefore, we use MaxConnections as
the threshold
+ * so we can prevent switch to the state unless we use subtransactions.
+ */
+#define REORDER_BUFFER_MEM_TRACK_THRESHOLD MaxConnections

The comment seems to imply that MAINTAIN_HEAP is useful for large
number of transactions but ReorderBufferLargestTXN() switches to this
state even when there is one transaction. So, basically we use the
binary_heap technique to get the largest even when we have one
transaction but we don't maintain that heap unless we have
REORDER_BUFFER_MEM_TRACK_THRESHOLD number of transactions are
in-progress. This means there is some additional work when (build and
reset heap each time when we pick largest xact) we have fewer
transactions in the system but that may not be impacting us because of
other costs involved like serializing all the changes. I think once we
can try to stress test this by setting
debug_logical_replication_streaming to 'immediate' to see if the new
mechanism has any overhead.

I ran the test with a transaction having many inserts:

| 5000 | 10000 | 20000 | 100000 | 1000000 | 10000000
------- |-----------|------------|------------|--------------|----------------|----------------
Head | 26.31 | 48.84 | 93.65 | 480.05 | 4808.29 | 47020.16
Patch | 26.35 | 50.8 | 97.99 | 484.8 | 4856.95 | 48108.89

The same test with debug_logical_replication_streaming= 'immediate'

| 5000 | 10000 | 20000 | 100000 | 1000000 | 10000000
------- |-----------|------------|------------|--------------|----------------|----------------
Head | 59.29 | 115.84 | 227.21 | 1156.08 | 11367.42 | 113986.14
Patch | 62.45 | 120.48 | 240.56 | 1185.12 | 11855.37 | 119921.81

The execution time is in milliseconds. The column header indicates the
number of inserts in the transaction.
In this case I noticed that the test execution with patch was taking
slightly more time.

Regards,
Vignesh

#44Peter Smith
smithpb2250@gmail.com
In reply to: Masahiko Sawada (#38)
Re: Improve eviction algorithm in ReorderBuffer

Hi, here are some review comments for v7-0002

======
Commit Message

1.
This commit adds a hash table to binaryheap in order to track of
positions of each nodes in the binaryheap. That way, by using newly
added functions such as binaryheap_update_up() etc., both updating a
key and removing a node can be done in O(1) on an average and O(log n)
in worst case. This is known as the indexed binary heap. The caller
can specify to use the indexed binaryheap by passing indexed = true.

~

/to track of positions of each nodes/to track the position of each node/

~~~

2.
There is no user of it but it will be used by a upcoming patch.

~

The current code does not use the new indexing logic, but it will be
used by an upcoming patch.

======
src/common/binaryheap.c

3.
+/*
+ * Define parameters for hash table code generation. The interface is *also*"
+ * declared in binaryheaph.h (to generate the types, which are externally
+ * visible).
+ */

Typo: *also*"

~~~

4.
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((const unsigned char *) &key, sizeof(bh_node_type))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0)
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#include "lib/simplehash.h"

4a.
The comment in simplehash.h says
* The following parameters are only relevant when SH_DEFINE is defined:
* - SH_KEY - ...
* - SH_EQUAL(table, a, b) - ...
* - SH_HASH_KEY(table, key) - ...
* - SH_STORE_HASH - ...
* - SH_GET_HASH(tb, a) - ...

So maybe it is nicer to reorder the #defines in that same order?

SUGGESTION:
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#define SH_KEY key
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0)
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((const unsigned char *) &key, sizeof(bh_node_type))
+#include "lib/simplehash.h"

~~

4b.
The comment in simplehash.h says that "it's preferable, if possible,
to store the element's hash in the element's data type", so should
SH_STORE_HASH and SH_GET_HASH also be defined here?

~~~

5.
+ *
+ * If 'indexed' is true, we create a hash table to track of each node's
+ * index in the heap, enabling to perform some operations such as removing
+ * the node from the heap.
  */
 binaryheap *
-binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
+binaryheap_allocate(int capacity, binaryheap_comparator compare,
+ bool indexed, void *arg)

BEFORE
... enabling to perform some operations such as removing the node from the heap.

SUGGESTION
... to help make operations such as removing nodes more efficient.

~~~

6.
+ heap->bh_indexed = indexed;
+ if (heap->bh_indexed)
+ {
+#ifdef FRONTEND
+ heap->bh_nodeidx = bh_nodeidx_create(capacity, NULL);
+#else
+ heap->bh_nodeidx = bh_nodeidx_create(CurrentMemoryContext, capacity,
+ NULL);
+#endif
+ }
+

The heap allocation just uses palloc instead of palloc0 so it might be
better to assign "heap->bh_nodeidx = NULL;" up-front, just so you will
never get a situation where bh_indexed is false but bh_nodeidx has
some (garbage) value.

~~~

7.
+/*
+ * Set the given node at the 'index', and updates its position accordingly.
+ *
+ * Return true if the node's index is already tracked.
+ */
+static bool
+bh_set_node(binaryheap *heap, bh_node_type node, int index)

7a.
I felt the 1st sentence should be more like:

SUGGESTION
Set the given node at the 'index' and track it if required.

~

7b.
IMO the parameters would be better the other way around (e.g. 'index'
before the 'node') because that's what the assignments look like:

heap->bh_nodes[heap->bh_size] = d;

becomes:
bh_set_node(heap, heap->bh_size, d);

~~~

8.
+static bool
+bh_set_node(binaryheap *heap, bh_node_type node, int index)
+{
+ bh_nodeidx_entry *ent;
+ bool found = false;
+
+ /* Set the node to the nodes array */
+ heap->bh_nodes[index] = node;
+
+ if (heap->bh_indexed)
+ {
+ /* Remember its index in the nodes array */
+ ent = bh_nodeidx_insert(heap->bh_nodeidx, node, &found);
+ ent->idx = index;
+ }
+
+ return found;
+}

8a.
That 'ent' declaration can be moved to the inner block scope, so it is
closer to where it is needed.

~

8b.
+ /* Remember its index in the nodes array */

The comment is worded a bit ambiguously. IMO a simpler comment would
be: "/* Keep track of the node index. */"

~~~

9.
+static void
+bh_delete_nodeidx(binaryheap *heap, bh_node_type node)
+{
+ if (!heap->bh_indexed)
+ return;
+
+ (void) bh_nodeidx_delete(heap->bh_nodeidx, node);
+}

Since there is only 1 statement IMO it is simpler to write this
function like below:

if (heap->bh_indexed)
(void) bh_nodeidx_delete(heap->bh_nodeidx, node);

~~~

10.
+/*
+ * Replace the node at 'idx' with the given node 'replaced_by'. Also
+ * update their positions accordingly.
+ */
+static void
+bh_replace_node(binaryheap *heap, int idx, bh_node_type replaced_by)

10a.
Would 'node' or 'new_node' or 'replacement' be a better name than 'replaced_by'?

~

10b.
I noticed that the index param is called 'idx' here but in other
functions, it is called 'index'. I think either is good (I prefer
'idx') but at least everywhere should use the same name for
consistency.

~~~

11.
+static void
+bh_replace_node(binaryheap *heap, int idx, bh_node_type replaced_by)
+{
+ /* Remove overwritten node's index */
+ bh_delete_nodeidx(heap, heap->bh_nodes[idx]);
+
+ /* Replace it with the given new node */
+ if (idx < heap->bh_size)
+ {
+ bool found PG_USED_FOR_ASSERTS_ONLY;
+
+ found = bh_set_node(heap, replaced_by, idx);
+
+ /* The overwritten node's index must already be tracked */
+ Assert(!heap->bh_indexed || found);
+ }
+}

I did not understand the condition.
e.g. Can you explain when is idx NOT less than heap->bh_size?
e.g. If this condition failed then nothing gets replaced (??)

~~~

======
src/include/lib/binaryheap.h

12.
+/*
+ * Struct for A hash table element to store the node's index in the bh_nodes
+ * array.
+ */
+typedef struct bh_nodeidx_entry

/for A hash table/for a hash table/

~~~

13.
+/* define parameters necessary to generate the hash table interface */

Suggest uppercase "Define" and add a period.

~~~

14.
+
+ /*
+ * If bh_indexed is true, the bh_nodeidx is used to track of each node's
+ * index in bh_nodes. This enables the caller to perform
+ * binaryheap_remove_node_ptr(), binaryheap_update_up/down in O(log n).
+ */
+ bool bh_indexed;
+ bh_nodeidx_hash *bh_nodeidx;
 } binaryheap;

I'm wondering why the separate 'bh_indexed' is necessary at all. Can't
you just use the bh_nodeidx value? E.g. If bh_nodeidx == NULL then it
means there is no index tracking, otherwise there is.

----------
Kind Regards,
Peter Smith.
Fujitsu Australia

#45Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Smith (#42)
Re: Improve eviction algorithm in ReorderBuffer

On Tue, Mar 5, 2024 at 11:25 AM Peter Smith <smithpb2250@gmail.com> wrote:

Hi, Here are some review comments for v7-0001

Thank you for reviewing the patch.

1.
/*
* binaryheap_free
*
* Releases memory used by the given binaryheap.
*/
void
binaryheap_free(binaryheap *heap)
{
pfree(heap);
}

Shouldn't the above function (not modified by the patch) also firstly
free the memory allocated for the heap->bh_nodes?

~~~

2.
+/*
+ * Make sure there is enough space for nodes.
+ */
+static void
+bh_enlarge_node_array(binaryheap *heap)
+{
+ heap->bh_space *= 2;
+ heap->bh_nodes = repalloc(heap->bh_nodes,
+   sizeof(bh_node_type) * heap->bh_space);
+}

Strictly speaking, this function doesn't really "Make sure" of
anything because the caller does the check whether we need more space.
All that happens here is allocating more space. Maybe this function
comment should say something like "Double the space allocated for
nodes."

Agreed with the above two points. I'll fix them in the next version patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#46Masahiko Sawada
sawada.mshk@gmail.com
In reply to: vignesh C (#43)
3 attachment(s)
Re: Improve eviction algorithm in ReorderBuffer

On Tue, Mar 5, 2024 at 12:20 PM vignesh C <vignesh21@gmail.com> wrote:

On Wed, 28 Feb 2024 at 11:40, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Feb 26, 2024 at 7:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

A few comments on 0003:
===================
1.
+/*
+ * Threshold of the total number of top-level and sub transactions
that controls
+ * whether we switch the memory track state. While the MAINTAIN_HEAP state is
+ * effective when there are many transactions being decoded, in many systems
+ * there is generally no need to use it as long as all transactions
being decoded
+ * are top-level transactions. Therefore, we use MaxConnections as
the threshold
+ * so we can prevent switch to the state unless we use subtransactions.
+ */
+#define REORDER_BUFFER_MEM_TRACK_THRESHOLD MaxConnections

The comment seems to imply that MAINTAIN_HEAP is useful for large
number of transactions but ReorderBufferLargestTXN() switches to this
state even when there is one transaction. So, basically we use the
binary_heap technique to get the largest even when we have one
transaction but we don't maintain that heap unless we have
REORDER_BUFFER_MEM_TRACK_THRESHOLD number of transactions are
in-progress. This means there is some additional work when (build and
reset heap each time when we pick largest xact) we have fewer
transactions in the system but that may not be impacting us because of
other costs involved like serializing all the changes. I think once we
can try to stress test this by setting
debug_logical_replication_streaming to 'immediate' to see if the new
mechanism has any overhead.

I ran the test with a transaction having many inserts:

| 5000 | 10000 | 20000 | 100000 | 1000000 | 10000000
------- |-----------|------------|------------|--------------|----------------|----------------
Head | 26.31 | 48.84 | 93.65 | 480.05 | 4808.29 | 47020.16
Patch | 26.35 | 50.8 | 97.99 | 484.8 | 4856.95 | 48108.89

The same test with debug_logical_replication_streaming= 'immediate'

| 5000 | 10000 | 20000 | 100000 | 1000000 | 10000000
------- |-----------|------------|------------|--------------|----------------|----------------
Head | 59.29 | 115.84 | 227.21 | 1156.08 | 11367.42 | 113986.14
Patch | 62.45 | 120.48 | 240.56 | 1185.12 | 11855.37 | 119921.81

The execution time is in milliseconds. The column header indicates the
number of inserts in the transaction.
In this case I noticed that the test execution with patch was taking
slightly more time.

Thank you for testing! With 10M records, I can see 2% regression in
the 'buffered' case and 5% regression in the 'immediate' case.

I think that in general it makes sense to postpone using a max-heap
until the number of transactions is higher than the threshold. I've
implemented this idea and here are the results on my environment (with
10M records and debug_logical_replication_streaming = 'immediate'):

HEAD:
68937.887 ms
69450.174 ms
68808.248 ms

v7 patch:
71280.783 ms
71673.101 ms
71330.898 ms

v8 patch:
68918.259 ms
68822.330 ms
68972.452 ms

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v8-0001-Make-binaryheap-enlargeable.patchapplication/octet-stream; name=v8-0001-Make-binaryheap-enlargeable.patchDownload
From 04e1719180c7dcf75d829b269e37b89f16fccba4 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 17:12:20 +0900
Subject: [PATCH v8 1/3] Make binaryheap enlargeable.

The node array space of the binaryheap is doubled when there is no
available space.

Reviewed-by: Hayato Kuroda, Vignesh C, Ajin Cherian, Tomas Vondra,
Shubham Khanna, Peter Smith
Discussion: https://postgr.es/m/CAD21AoDffo37RC-eUuyHJKVEr017V2YYDLyn1xF_00ofptWbkg%40mail.gmail.com
---
 src/common/binaryheap.c      | 37 +++++++++++++++++++-----------------
 src/include/lib/binaryheap.h |  2 +-
 2 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index 7377ebdf15..843e764bb6 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -38,17 +38,16 @@ static void sift_up(binaryheap *heap, int node_off);
 binaryheap *
 binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 {
-	int			sz;
 	binaryheap *heap;
 
-	sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_node_type) * capacity;
-	heap = (binaryheap *) palloc(sz);
+	heap = (binaryheap *) palloc(sizeof(binaryheap));
 	heap->bh_space = capacity;
 	heap->bh_compare = compare;
 	heap->bh_arg = arg;
 
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
+	heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type) * capacity);
 
 	return heap;
 }
@@ -74,6 +73,7 @@ binaryheap_reset(binaryheap *heap)
 void
 binaryheap_free(binaryheap *heap)
 {
+	pfree(heap->bh_nodes);
 	pfree(heap);
 }
 
@@ -104,6 +104,17 @@ parent_offset(int i)
 	return (i - 1) / 2;
 }
 
+/*
+ * Double the space allocated for nodes.
+ */
+static void
+enlarge_node_array(binaryheap *heap)
+{
+	heap->bh_space *= 2;
+	heap->bh_nodes = repalloc(heap->bh_nodes,
+							  sizeof(bh_node_type) * heap->bh_space);
+}
+
 /*
  * binaryheap_add_unordered
  *
@@ -115,14 +126,10 @@ parent_offset(int i)
 void
 binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 {
+	/* make sure enough space for a new node */
 	if (heap->bh_size >= heap->bh_space)
-	{
-#ifdef FRONTEND
-		pg_fatal("out of binary heap slots");
-#else
-		elog(ERROR, "out of binary heap slots");
-#endif
-	}
+		enlarge_node_array(heap);
+
 	heap->bh_has_heap_property = false;
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
@@ -153,14 +160,10 @@ binaryheap_build(binaryheap *heap)
 void
 binaryheap_add(binaryheap *heap, bh_node_type d)
 {
+	/* make sure enough space for a new node */
 	if (heap->bh_size >= heap->bh_space)
-	{
-#ifdef FRONTEND
-		pg_fatal("out of binary heap slots");
-#else
-		elog(ERROR, "out of binary heap slots");
-#endif
-	}
+		enlarge_node_array(heap);
+
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 19025c08ef..1439f20803 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -46,7 +46,7 @@ typedef struct binaryheap
 	bool		bh_has_heap_property;	/* debugging cross-check */
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
-	bh_node_type bh_nodes[FLEXIBLE_ARRAY_MEMBER];
+	bh_node_type *bh_nodes;
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int capacity,
-- 
2.39.3

v8-0002-Add-functions-to-binaryheap-for-efficient-key-rem.patchapplication/octet-stream; name=v8-0002-Add-functions-to-binaryheap-for-efficient-key-rem.patchDownload
From d5233f54bcd7bcf55bf43508886dd98fea11c619 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 11:20:23 +0900
Subject: [PATCH v8 2/3] Add functions to binaryheap for efficient key removal
 and update.

Previously, binaryheap didn't support updating a key and removing a
node in an efficient way. For example, in order to remove a node from
the binaryheap, the caller has to pass the node's position within the
array that the binaryheap internally has. Removing a node from the
binaryheap is done in O(log n) but searching for the key's position is
done in O(n).

This commit adds a hash table to binaryheap in order to track the
position of each nodes in the binaryheap. That way, by using newly
added functions such as binaryheap_update_up() etc., both updating a
key and removing a node can be done in O(1) on an average and O(log n)
in worst case. This is known as the indexed binary heap. The caller
can specify to use the indexed binaryheap by passing indexed = true.

The current code does not use the new indexing logic, but it will be
used by an upcoming patch.

Reviewed-by: Hayato Kuroda, Vignesh C, Ajin Cherian, Tomas Vondra,
Shubham Khanna, Peter Smith
Discussion: https://postgr.es/m/CAD21AoDffo37RC-eUuyHJKVEr017V2YYDLyn1xF_00ofptWbkg%40mail.gmail.com
---
 src/backend/executor/nodeGatherMerge.c        |   1 +
 src/backend/executor/nodeMergeAppend.c        |   2 +-
 src/backend/postmaster/pgarch.c               |   3 +-
 .../replication/logical/reorderbuffer.c       |   1 +
 src/backend/storage/buffer/bufmgr.c           |   1 +
 src/bin/pg_dump/pg_backup_archiver.c          |   1 +
 src/bin/pg_dump/pg_dump_sort.c                |   2 +-
 src/common/binaryheap.c                       | 201 +++++++++++++++++-
 src/include/lib/binaryheap.h                  |  36 +++-
 src/tools/pgindent/typedefs.list              |   1 +
 10 files changed, 235 insertions(+), 14 deletions(-)

diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c
index 45f6017c29..ce19e0837a 100644
--- a/src/backend/executor/nodeGatherMerge.c
+++ b/src/backend/executor/nodeGatherMerge.c
@@ -422,6 +422,7 @@ gather_merge_setup(GatherMergeState *gm_state)
 	/* Allocate the resources for the merge */
 	gm_state->gm_heap = binaryheap_allocate(nreaders + 1,
 											heap_compare_slots,
+											false,
 											gm_state);
 }
 
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index e1b9b984a7..3efebd537f 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -125,7 +125,7 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	mergestate->ms_nplans = nplans;
 
 	mergestate->ms_slots = (TupleTableSlot **) palloc0(sizeof(TupleTableSlot *) * nplans);
-	mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots,
+	mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots, false,
 											  mergestate);
 
 	/*
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index f97035ca03..fee5955b13 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -248,7 +248,8 @@ PgArchiverMain(void)
 
 	/* Initialize our max-heap for prioritizing files to archive. */
 	arch_files->arch_heap = binaryheap_allocate(NUM_FILES_PER_DIRECTORY_SCAN,
-												ready_file_comparator, NULL);
+												ready_file_comparator, false,
+												NULL);
 
 	/* Load the archive_library. */
 	LoadArchiveLibrary();
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 001f901ee6..393713af91 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1294,6 +1294,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	/* allocate heap */
 	state->heap = binaryheap_allocate(state->nr_txns,
 									  ReorderBufferIterCompare,
+									  false,
 									  state);
 
 	/* Now that the state fields are initialized, it is safe to return it. */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c..eee5021197 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2724,6 +2724,7 @@ BufferSync(int flags)
 	 */
 	ts_heap = binaryheap_allocate(num_spaces,
 								  ts_ckpt_progress_comparator,
+								  false,
 								  NULL);
 
 	for (i = 0; i < num_spaces; i++)
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index d97ebaff5b..6587a7b081 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -4033,6 +4033,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 	/* Set up ready_heap with enough room for all known TocEntrys */
 	ready_heap = binaryheap_allocate(AH->tocCount,
 									 TocEntrySizeCompareBinaryheap,
+									 false,
 									 NULL);
 
 	/*
diff --git a/src/bin/pg_dump/pg_dump_sort.c b/src/bin/pg_dump/pg_dump_sort.c
index 4cb754caa5..7362f7c961 100644
--- a/src/bin/pg_dump/pg_dump_sort.c
+++ b/src/bin/pg_dump/pg_dump_sort.c
@@ -405,7 +405,7 @@ TopoSort(DumpableObject **objs,
 		return true;
 
 	/* Create workspace for the above-described heap */
-	pendingHeap = binaryheap_allocate(numObjs, int_cmp, NULL);
+	pendingHeap = binaryheap_allocate(numObjs, int_cmp, false, NULL);
 
 	/*
 	 * Scan the constraints, and for each item in the input, generate a count
diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index 843e764bb6..a94861feaa 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -22,8 +22,30 @@
 #ifdef FRONTEND
 #include "common/logging.h"
 #endif
+#include "common/hashfn.h"
 #include "lib/binaryheap.h"
 
+/*
+ * Define parameters for hash table code generation. The interface is *also*
+ * declared in binaryheaph.h (to generate the types, which are externally
+ * visible).
+ */
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+	hash_bytes((const unsigned char *) &key, sizeof(bh_node_type))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0)
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_STORE_HASH
+#define SH_GET_HASH(tb, a) a->hash
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
 static void sift_down(binaryheap *heap, int node_off);
 static void sift_up(binaryheap *heap, int node_off);
 
@@ -34,9 +56,14 @@ static void sift_up(binaryheap *heap, int node_off);
  * store the given number of nodes, with the heap property defined by
  * the given comparator function, which will be invoked with the additional
  * argument specified by 'arg'.
+ *
+ * If 'indexed' is true, we create a hash table to track of each node's
+ * index in the heap, enabling to perform some operations such as removing
+ * the node from the heap.
  */
 binaryheap *
-binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
+binaryheap_allocate(int capacity, binaryheap_comparator compare,
+					bool indexed, void *arg)
 {
 	binaryheap *heap;
 
@@ -48,6 +75,17 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
 	heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type) * capacity);
+	heap->bh_nodeidx = NULL;
+
+	if (indexed)
+	{
+#ifdef FRONTEND
+		heap->bh_nodeidx = bh_nodeidx_create(capacity, NULL);
+#else
+		heap->bh_nodeidx = bh_nodeidx_create(CurrentMemoryContext, capacity,
+											 NULL);
+#endif
+	}
 
 	return heap;
 }
@@ -63,6 +101,9 @@ binaryheap_reset(binaryheap *heap)
 {
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
+
+	if (binaryheap_indexed(heap))
+		bh_nodeidx_reset(heap->bh_nodeidx);
 }
 
 /*
@@ -73,6 +114,9 @@ binaryheap_reset(binaryheap *heap)
 void
 binaryheap_free(binaryheap *heap)
 {
+	if (binaryheap_indexed(heap))
+		bh_nodeidx_destroy(heap->bh_nodeidx);
+
 	pfree(heap->bh_nodes);
 	pfree(heap);
 }
@@ -115,6 +159,73 @@ enlarge_node_array(binaryheap *heap)
 							  sizeof(bh_node_type) * heap->bh_space);
 }
 
+/*
+ * Set the given node at the 'index' and track it if required.
+ *
+ * Return true if the node's index is already tracked.
+ */
+static bool
+set_node(binaryheap *heap, bh_node_type node, int index)
+{
+	bool		found = false;
+
+	/* Set the node to the nodes array */
+	heap->bh_nodes[index] = node;
+
+	if (binaryheap_indexed(heap))
+	{
+		bh_nodeidx_entry *ent;
+
+		/* Keep track of the node index */
+		ent = bh_nodeidx_insert(heap->bh_nodeidx, node, &found);
+		ent->index = index;
+	}
+
+	return found;
+}
+
+/*
+ * Remove the node's index from the hash table if the heap is indexed.
+ */
+static bool
+delete_nodeidx(binaryheap *heap, bh_node_type node)
+{
+	if (!binaryheap_indexed(heap))
+		return false;
+
+	return bh_nodeidx_delete(heap->bh_nodeidx, node);
+}
+
+/*
+ * Replace the existing node at 'idx' with the given 'new_node'. Also
+ * update their positions accordingly. Note that we assume the new_node's
+ * position is already tracked if enabled, i.e. the new_node is already
+ * present in the heap.
+ */
+static void
+replace_node(binaryheap *heap, int index, bh_node_type new_node)
+{
+	bool		found PG_USED_FOR_ASSERTS_ONLY;
+
+	/* Quick return if not necessary to move */
+	if (heap->bh_nodes[index] == new_node)
+		return;
+
+	/*
+	 * Remove overwritten node's index. The overwritten node's position must
+	 * have been tracked, if enabled.
+	 */
+	found = delete_nodeidx(heap, heap->bh_nodes[index]);
+	Assert(!binaryheap_indexed(heap) || found);
+
+	/*
+	 * Replace it with the given new node. This node's position must also be
+	 * tracked as we assume to replace the node by the existing node.
+	 */
+	found = set_node(heap, new_node, index);
+	Assert(!binaryheap_indexed(heap) || found);
+}
+
 /*
  * binaryheap_add_unordered
  *
@@ -131,7 +242,7 @@ binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 		enlarge_node_array(heap);
 
 	heap->bh_has_heap_property = false;
-	heap->bh_nodes[heap->bh_size] = d;
+	set_node(heap, d, heap->bh_size);
 	heap->bh_size++;
 }
 
@@ -164,7 +275,7 @@ binaryheap_add(binaryheap *heap, bh_node_type d)
 	if (heap->bh_size >= heap->bh_space)
 		enlarge_node_array(heap);
 
-	heap->bh_nodes[heap->bh_size] = d;
+	set_node(heap, d, heap->bh_size);
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
 }
@@ -205,6 +316,8 @@ binaryheap_remove_first(binaryheap *heap)
 	if (heap->bh_size == 1)
 	{
 		heap->bh_size--;
+		delete_nodeidx(heap, result);
+
 		return result;
 	}
 
@@ -212,7 +325,7 @@ binaryheap_remove_first(binaryheap *heap)
 	 * Remove the last node, placing it in the vacated root entry, and sift
 	 * the new root node down to its correct position.
 	 */
-	heap->bh_nodes[0] = heap->bh_nodes[--heap->bh_size];
+	replace_node(heap, 0, heap->bh_nodes[--heap->bh_size]);
 	sift_down(heap, 0);
 
 	return result;
@@ -238,7 +351,7 @@ binaryheap_remove_node(binaryheap *heap, int n)
 						   heap->bh_arg);
 
 	/* remove the last node, placing it in the vacated entry */
-	heap->bh_nodes[n] = heap->bh_nodes[heap->bh_size];
+	replace_node(heap, n, heap->bh_nodes[heap->bh_size]);
 
 	/* sift as needed to preserve the heap property */
 	if (cmp > 0)
@@ -247,6 +360,74 @@ binaryheap_remove_node(binaryheap *heap, int n)
 		sift_down(heap, n);
 }
 
+/*
+ * binaryheap_remove_node_ptr
+ *
+ * Similar to binaryheap_remove_node() but removes the given node. The caller
+ * must ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if the heap is indexed.
+ */
+void
+binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(binaryheap_indexed(heap));
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+	Assert(ent);
+
+	binaryheap_remove_node(heap, ent->index);
+}
+
+/*
+ * binaryheap_update_up
+ *
+ * Sift the given node up after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if the heap is indexed.
+ */
+void
+binaryheap_update_up(binaryheap *heap, bh_node_type d)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(binaryheap_indexed(heap));
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+	Assert(ent);
+	Assert(ent->index >= 0 && ent->index < heap->bh_size);
+
+	sift_up(heap, ent->index);
+}
+
+/*
+ * binaryheap_update_down
+ *
+ * Sift the given node down after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if the heap is indexed.
+ */
+void
+binaryheap_update_down(binaryheap *heap, bh_node_type d)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(binaryheap_indexed(heap));
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+	Assert(ent);
+	Assert(ent->index >= 0 && ent->index < heap->bh_size);
+
+	sift_down(heap, ent->index);
+}
+
 /*
  * binaryheap_replace_first
  *
@@ -259,7 +440,7 @@ binaryheap_replace_first(binaryheap *heap, bh_node_type d)
 {
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
 
-	heap->bh_nodes[0] = d;
+	replace_node(heap, 0, d);
 
 	if (heap->bh_size > 1)
 		sift_down(heap, 0);
@@ -301,11 +482,11 @@ sift_up(binaryheap *heap, int node_off)
 		 * Otherwise, swap the parent value with the hole, and go on to check
 		 * the node's new parent.
 		 */
-		heap->bh_nodes[node_off] = parent_val;
+		set_node(heap, parent_val, node_off);
 		node_off = parent_off;
 	}
 	/* Re-fill the hole */
-	heap->bh_nodes[node_off] = node_val;
+	set_node(heap, node_val, node_off);
 }
 
 /*
@@ -360,9 +541,9 @@ sift_down(binaryheap *heap, int node_off)
 		 * Otherwise, swap the hole with the child that violates the heap
 		 * property; then go on to check its children.
 		 */
-		heap->bh_nodes[node_off] = heap->bh_nodes[swap_off];
+		set_node(heap, heap->bh_nodes[swap_off], node_off);
 		node_off = swap_off;
 	}
 	/* Re-fill the hole */
-	heap->bh_nodes[node_off] = node_val;
+	set_node(heap, node_val, node_off);
 }
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 1439f20803..a7240aa0c2 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -29,6 +29,29 @@ typedef Datum bh_node_type;
  */
 typedef int (*binaryheap_comparator) (bh_node_type a, bh_node_type b, void *arg);
 
+/*
+ * Struct for a hash table element to store the node's index in the bh_nodes
+ * array.
+ */
+typedef struct bh_nodeidx_entry
+{
+	bh_node_type key;
+	int			index;			/* entry's index within the node array */
+	char		status;			/* hash status */
+	uint32		hash;			/* hash values (cached) */
+} bh_nodeidx_entry;
+
+/* Define parameters necessary to generate the hash table interface. */
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
 /*
  * binaryheap
  *
@@ -47,11 +70,18 @@ typedef struct binaryheap
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
 	bh_node_type *bh_nodes;
+
+	/*
+	 * If bh_nodeidx is not NULL, the bh_nodeidx is used to track of each
+	 * node's index in bh_nodes. This enables the caller to perform
+	 * binaryheap_remove_node_ptr(), binaryheap_update_up/down in O(log n).
+	 */
+	bh_nodeidx_hash *bh_nodeidx;
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int capacity,
 									   binaryheap_comparator compare,
-									   void *arg);
+									   bool indexed, void *arg);
 extern void binaryheap_reset(binaryheap *heap);
 extern void binaryheap_free(binaryheap *heap);
 extern void binaryheap_add_unordered(binaryheap *heap, bh_node_type d);
@@ -60,10 +90,14 @@ extern void binaryheap_add(binaryheap *heap, bh_node_type d);
 extern bh_node_type binaryheap_first(binaryheap *heap);
 extern bh_node_type binaryheap_remove_first(binaryheap *heap);
 extern void binaryheap_remove_node(binaryheap *heap, int n);
+extern void binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d);
 extern void binaryheap_replace_first(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_up(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_down(binaryheap *heap, bh_node_type d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
 #define binaryheap_size(h)			((h)->bh_size)
 #define binaryheap_get_node(h, n)	((h)->bh_nodes[n])
+#define binaryheap_indexed(h)		((h)->bh_nodeidx != NULL)
 
 #endif							/* BINARYHEAP_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 95ae7845d8..ba6baaf7db 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4054,3 +4054,4 @@ rfile
 ws_options
 ws_file_info
 PathKeyInfo
+bh_nodeidx_entry
-- 
2.39.3

v8-0003-Improve-eviction-algorithm-in-Reorderbuffer-using.patchapplication/octet-stream; name=v8-0003-Improve-eviction-algorithm-in-Reorderbuffer-using.patchDownload
From 153c59bb6712b5301f04fe88ce4d011351a09bc3 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 11:31:41 +0900
Subject: [PATCH v8 3/3] Improve eviction algorithm in Reorderbuffer using
 max-heap for many subtransactions.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Previously, when selecting the transaction to evict during logical
decoding, we check all transactions to find the largest
transaction. Which could lead to a significant replication lag
especially in case where there are many subtransactions.

This commit improves the eviction algorithm in ReorderBuffer using the
max-heap with transaction size as the key to efficiently find the
largest transaction.

Overall algorithm:

There are two memory track states: REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP
and REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP.

REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP is the starting state, where we do
not update the max-heap when updating the memory counter. We build the
max-heap just before selecting large transactions, if the number of
transactions being decoded is larger than the threshold,
REORDER_BUFFER_MEM_TRACK_THRESHOLD. Therefore, in this state, we can
update the memory counter with no additional costs but need O(N) time
to get the largest transaction, where N is the number of transactions
including top-level transactions and subtransactions.

Once we build the max-heap, we switch to
REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP state, where we also update
the max-heap when updating the memory counter. The intention is to
efficiently retrieve the largest transaction in O(1) time instead of
incurring the cost of memory counter updates (O(log n)). To minimize
the overhead of maintaining the max-heap, we batch memory updates when
cleaning up all changes. We remain in this state as long as the number
of transactions is larger than the threshold,
REORDER_BUFFER_MEM_TRACK_THRESHOLD. Otherwise, we switch back to
REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP and reset the max-heap.

The performance benchmark results showed significant speed up (more
than x30 speed up on my machine) in decoding a transaction with 100k
subtransactions, whereas there is no visible overhead in other cases.

Reviewed-by: Amit Kapila, Hayato Kuroda, Vignesh C, Ajin Cherian,
Tomas Vondra, Shubham Khanna, Álvaro Herrera, Euler Taveira
Discussion: https://postgr.es/m/CAD21AoAfKTgrBrLq96GcTv9d6k97zaQcDM-rxfKEt4GSe0qnaQ%40mail.gmail.com
---
 .../replication/logical/reorderbuffer.c       | 239 ++++++++++++++++--
 src/include/replication/reorderbuffer.h       |  21 ++
 src/tools/pgindent/typedefs.list              |   1 +
 3 files changed, 233 insertions(+), 28 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 393713af91..4b64eb4264 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -67,6 +67,28 @@
  *	  allocator, evicting the oldest changes would make it more likely the
  *	  memory gets actually freed.
  *
+ *	  We use a max-heap with transaction size as the key to efficiently find
+ *	  the largest transaction. The max-heap state is managed in two states:
+ *	  REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP and REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP.
+ *
+ *	  REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP is the starting state, where we do
+ *	  not update the max-heap when updating the memory counter. We build the
+ *	  max-heap just before selecting large transactions if the number of
+ *	  transactions being decoded is larger than the threshold,
+ *	  REORDER_BUFFER_MEM_TRACK_THRESHOLD. Therefore, in this state, we can
+ *	  update the memory counter with no additional costs but need O(N) time
+ *	  to get the largest transaction, where N is the number of transactions
+ *	  including top-level transactions and subtransactions.
+ *
+ *	  Once we build the max-heap, we switch to
+ *	  REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP state, where we also update
+ *	  the max-heap when updating the memory counter. The intention is to
+ *	  efficiently retrieve the largest transaction in O(1) time instead of
+ *	  incurring the cost of memory counter updates (O(log N)). We remain in
+ *	  this state as long as the number of transactions is larger than the
+ *	  threshold. Otherwise, we switch back to REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP
+ *	  and reset the max-heap.
+ *
  *	  We still rely on max_changes_in_memory when loading serialized changes
  *	  back into memory. At that point we can't use the memory limit directly
  *	  as we load the subxacts independently. One option to deal with this
@@ -107,6 +129,16 @@
 #include "utils/rel.h"
 #include "utils/relfilenumbermap.h"
 
+/*
+ * Threshold of the total number of top-level and sub transactions that controls
+ * whether we switch the memory track state. While using max-heap to select
+ * the largest transaction is effective when there are many transactions being
+ * decoded, in many systems there is generally no need to use it as long as all
+ * transactions being decoded are top-level transactions. Therefore, we use
+ * MaxConnections as the threshold* so we can prevent switch to the state unless
+ * we use subtransactions.
+ */
+#define REORDER_BUFFER_MEM_TRACK_THRESHOLD	MaxConnections
 
 /* entry for a hash table we use to map from xid to our transaction state */
 typedef struct ReorderBufferTXNByIdEnt
@@ -259,6 +291,8 @@ static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
+static void ReorderBufferBuildMaxHeap(ReorderBuffer *rb);
+static void ReorderBufferMaybeChangeNoMaxHeap(ReorderBuffer *rb);
 
 static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
@@ -293,7 +327,9 @@ static void ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *t
 static Size ReorderBufferChangeSize(ReorderBufferChange *change);
 static void ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 											ReorderBufferChange *change,
+											ReorderBufferTXN *txn,
 											bool addition, Size sz);
+static int	ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg);
 
 /*
  * Allocate a new ReorderBuffer and clean out any old serialized state from
@@ -355,6 +391,16 @@ ReorderBufferAllocate(void)
 	buffer->outbufsize = 0;
 	buffer->size = 0;
 
+	/*
+	 * Don't start with a lower number than
+	 * REORDER_BUFFER_MEM_TRACK_THRESHOLD, since we add at least
+	 * REORDER_BUFFER_MEM_TRACK_THRESHOLD entries at once.
+	 */
+	buffer->memtrack_state = REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP;
+	buffer->txn_heap = binaryheap_allocate(REORDER_BUFFER_MEM_TRACK_THRESHOLD * 2,
+										   ReorderBufferTXNSizeCompare,
+										   true, NULL);
+
 	buffer->spillTxns = 0;
 	buffer->spillCount = 0;
 	buffer->spillBytes = 0;
@@ -485,7 +531,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
 {
 	/* update memory accounting info */
 	if (upd_mem)
-		ReorderBufferChangeMemoryUpdate(rb, change, false,
+		ReorderBufferChangeMemoryUpdate(rb, change, NULL, false,
 										ReorderBufferChangeSize(change));
 
 	/* free contained data */
@@ -816,7 +862,7 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries_mem++;
 
 	/* update memory accounting information */
-	ReorderBufferChangeMemoryUpdate(rb, change, true,
+	ReorderBufferChangeMemoryUpdate(rb, change, NULL, true,
 									ReorderBufferChangeSize(change));
 
 	/* process partial change */
@@ -1527,7 +1573,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		/* Check we're not mixing changes from different transactions. */
 		Assert(change->txn == txn);
 
-		ReorderBufferReturnChange(rb, change, true);
+		ReorderBufferReturnChange(rb, change, false);
 	}
 
 	/*
@@ -1586,8 +1632,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	if (rbtxn_is_serialized(txn))
 		ReorderBufferRestoreCleanup(rb, txn);
 
+	/* Update the memory counter */
+	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, txn->size);
+
 	/* deallocate */
 	ReorderBufferReturnTXN(rb, txn);
+
+	/* check the memory track state */
+	ReorderBufferMaybeChangeNoMaxHeap(rb);
 }
 
 /*
@@ -1637,9 +1689,12 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 		/* remove the change from it's containing list */
 		dlist_delete(&change->node);
 
-		ReorderBufferReturnChange(rb, change, true);
+		ReorderBufferReturnChange(rb, change, false);
 	}
 
+	/* Update the memory counter */
+	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, txn->size);
+
 	/*
 	 * Mark the transaction as streamed.
 	 *
@@ -3174,22 +3229,24 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								ReorderBufferChange *change,
+								ReorderBufferTXN *txn,
 								bool addition, Size sz)
 {
-	ReorderBufferTXN *txn;
 	ReorderBufferTXN *toptxn;
 
-	Assert(change->txn);
-
 	/*
 	 * Ignore tuple CID changes, because those are not evicted when reaching
 	 * memory limit. So we just don't count them, because it might easily
 	 * trigger a pointless attempt to spill.
 	 */
-	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
+	if (change && change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
+		return;
+
+	if (sz == 0)
 		return;
 
-	txn = change->txn;
+	txn = txn != NULL ? txn : change->txn;
+	Assert(txn != NULL);
 
 	/*
 	 * Update the total size in top level as well. This is later used to
@@ -3204,6 +3261,15 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 		/* Update the total size in the top transaction. */
 		toptxn->total_size += sz;
+
+		/* Update the max-heap as well if necessary */
+		if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)
+		{
+			if ((txn->size - sz) == 0)
+				binaryheap_add(rb->txn_heap, PointerGetDatum(txn));
+			else
+				binaryheap_update_up(rb->txn_heap, PointerGetDatum(txn));
+		}
 	}
 	else
 	{
@@ -3213,6 +3279,15 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 		/* Update the total size in the top transaction. */
 		toptxn->total_size -= sz;
+
+		/* Update the max-heap as well if necessary */
+		if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)
+		{
+			if (txn->size == 0)
+				binaryheap_remove_node_ptr(rb->txn_heap, PointerGetDatum(txn));
+			else
+				binaryheap_update_down(rb->txn_heap, PointerGetDatum(txn));
+		}
 	}
 
 	Assert(txn->size <= rb->size);
@@ -3469,31 +3544,116 @@ ReorderBufferSerializeReserve(ReorderBuffer *rb, Size sz)
 }
 
 /*
- * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
- *
- * XXX With many subtransactions this might be quite slow, because we'll have
- * to walk through all of them. There are some options how we could improve
- * that: (a) maintain some secondary structure with transactions sorted by
- * amount of changes, (b) not looking for the entirely largest transaction,
- * but e.g. for transaction using at least some fraction of the memory limit,
- * and (c) evicting multiple transactions at once, e.g. to free a given portion
- * of the memory limit (e.g. 50%).
+ * Build the max-heap and switch the state. We will run a heap assembly step
+ * at the end, which is more efficient.
  */
-static ReorderBufferTXN *
-ReorderBufferLargestTXN(ReorderBuffer *rb)
+static void
+ReorderBufferBuildMaxHeap(ReorderBuffer *rb)
 {
 	HASH_SEQ_STATUS hash_seq;
 	ReorderBufferTXNByIdEnt *ent;
-	ReorderBufferTXN *largest = NULL;
+
+	Assert(rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP);
 
 	hash_seq_init(&hash_seq, rb->by_txn);
 	while ((ent = hash_seq_search(&hash_seq)) != NULL)
 	{
 		ReorderBufferTXN *txn = ent->txn;
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
-			largest = txn;
+		if (txn->size == 0)
+			continue;
+
+		binaryheap_add_unordered(rb->txn_heap, PointerGetDatum(txn));
+	}
+
+	binaryheap_build(rb->txn_heap);
+
+	/* Switch to the new state */
+	rb->memtrack_state = REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP;
+}
+
+/*
+ * Switch to NO_MAXHEAP state and reset the max-heap if the number of
+ * transactions got lower than the threshold.
+ */
+static void
+ReorderBufferMaybeChangeNoMaxHeap(ReorderBuffer *rb)
+{
+	if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP)
+		return;
+
+	/*
+	 * If we add and remove transactions right around the threshold, we could
+	 * easily end up "thrashing". To avoid it, we adapt 10% of transactions to
+	 * switch back to the NO_MAXHEAP state.
+	 */
+	if (binaryheap_size(rb->txn_heap) < REORDER_BUFFER_MEM_TRACK_THRESHOLD * 0.9)
+	{
+		rb->memtrack_state = REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP;
+		binaryheap_reset(rb->txn_heap);
+	}
+}
+
+/*
+ * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
+ * We use a different way to find the largest transaction depending on the
+ * memory tracking state and the number of transactions being decoded. Refer
+ * to the comments atop this file for the algorithm details.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTXN(ReorderBuffer *rb)
+{
+	ReorderBufferTXN *largest = NULL;
+
+	if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP)
+	{
+		/*
+		 * If the number of transactions are small, we scan all transactions
+		 * being decoded to get the largest transaction. This saves the cost
+		 * of building a max-heap with a small number of transactions.
+		 */
+		if (hash_get_num_entries(rb->by_txn) < REORDER_BUFFER_MEM_TRACK_THRESHOLD)
+		{
+			HASH_SEQ_STATUS hash_seq;
+			ReorderBufferTXNByIdEnt *ent;
+
+			hash_seq_init(&hash_seq, rb->by_txn);
+			while ((ent = hash_seq_search(&hash_seq)) != NULL)
+			{
+				ReorderBufferTXN *txn = ent->txn;
+
+				/* if the current transaction is larger, remember it */
+				if ((!largest) || (txn->size > largest->size))
+					largest = txn;
+			}
+
+			Assert(largest);
+		}
+		else
+		{
+			/*
+			 * There are a large number of transactions in ReorderBuffer. We
+			 * build the max-heap for efficiently selecting the largest
+			 * transactions.
+			 */
+			ReorderBufferBuildMaxHeap(rb);
+
+			/*
+			 * The max-heap is ready now. We remain in this state at least
+			 * until we free up enough transactions to bring the total memory
+			 * usage below the limit. The largest transaction is selected
+			 * below.
+			 */
+			Assert(rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP);
+		}
+	}
+
+	/* Get the largest transaction from the max-heap */
+	if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)
+	{
+		Assert(binaryheap_size(rb->txn_heap) > 0);
+		largest = (ReorderBufferTXN *)
+			DatumGetPointer(binaryheap_first(rb->txn_heap));
 	}
 
 	Assert(largest);
@@ -3636,6 +3796,9 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 		Assert(txn->nentries_mem == 0);
 	}
 
+	/* check the memory track state */
+	ReorderBufferMaybeChangeNoMaxHeap(rb);
+
 	/* We must be under the memory limit now. */
 	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
@@ -3705,11 +3868,14 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		ReorderBufferSerializeChange(rb, txn, fd, change);
 		dlist_delete(&change->node);
-		ReorderBufferReturnChange(rb, change, true);
+		ReorderBufferReturnChange(rb, change, false);
 
 		spilled++;
 	}
 
+	/* Update the memory counter */
+	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, size);
+
 	/* update the statistics iff we have spilled anything */
 	if (spilled)
 	{
@@ -4491,7 +4657,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	 * update the accounting too (subtracting the size from the counters). And
 	 * we don't want to underflow there.
 	 */
-	ReorderBufferChangeMemoryUpdate(rb, change, true,
+	ReorderBufferChangeMemoryUpdate(rb, change, NULL, true,
 									ReorderBufferChangeSize(change));
 }
 
@@ -4903,9 +5069,9 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	MemoryContextSwitchTo(oldcontext);
 
 	/* subtract the old change size */
-	ReorderBufferChangeMemoryUpdate(rb, change, false, old_size);
+	ReorderBufferChangeMemoryUpdate(rb, change, NULL, false, old_size);
 	/* now add the change back, with the correct size */
-	ReorderBufferChangeMemoryUpdate(rb, change, true,
+	ReorderBufferChangeMemoryUpdate(rb, change, NULL, true,
 									ReorderBufferChangeSize(change));
 }
 
@@ -5271,3 +5437,20 @@ restart:
 		*cmax = ent->cmax;
 	return true;
 }
+
+/*
+ * Compare between sizes of two transactions. This is for a binary heap
+ * comparison function.
+ */
+static int
+ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg)
+{
+	ReorderBufferTXN *ta = (ReorderBufferTXN *) DatumGetPointer(a);
+	ReorderBufferTXN *tb = (ReorderBufferTXN *) DatumGetPointer(b);
+
+	if (ta->size < tb->size)
+		return -1;
+	if (ta->size > tb->size)
+		return 1;
+	return 0;
+}
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0b2c95f7aa..1f0ad2b94e 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "lib/binaryheap.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -531,6 +532,22 @@ typedef void (*ReorderBufferUpdateProgressTxnCB) (
 												  ReorderBufferTXN *txn,
 												  XLogRecPtr lsn);
 
+/* State of how to track the memory usage of each transaction being decoded */
+typedef enum ReorderBufferMemTrackState
+{
+	/*
+	 * We don't update max-heap while updating the memory counter. The
+	 * max-heap is built before use.
+	 */
+	REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP,
+
+	/*
+	 * We also update the max-heap when updating the memory counter so the
+	 * heap property is always preserved.
+	 */
+	REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP,
+} ReorderBufferMemTrackState;
+
 struct ReorderBuffer
 {
 	/*
@@ -631,6 +648,10 @@ struct ReorderBuffer
 	/* memory accounting */
 	Size		size;
 
+	/* Max-heap for sizes of all top-level and sub transactions */
+	ReorderBufferMemTrackState memtrack_state;
+	binaryheap *txn_heap;
+
 	/*
 	 * Statistics about transactions spilled to disk.
 	 *
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ba6baaf7db..f4209fc10c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4055,3 +4055,4 @@ ws_options
 ws_file_info
 PathKeyInfo
 bh_nodeidx_entry
+ReorderBufferMemTrackState
-- 
2.39.3

#47Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Smith (#44)
Re: Improve eviction algorithm in ReorderBuffer

On Tue, Mar 5, 2024 at 3:28 PM Peter Smith <smithpb2250@gmail.com> wrote:

Hi, here are some review comments for v7-0002

======
Commit Message

1.
This commit adds a hash table to binaryheap in order to track of
positions of each nodes in the binaryheap. That way, by using newly
added functions such as binaryheap_update_up() etc., both updating a
key and removing a node can be done in O(1) on an average and O(log n)
in worst case. This is known as the indexed binary heap. The caller
can specify to use the indexed binaryheap by passing indexed = true.

~

/to track of positions of each nodes/to track the position of each node/

~~~

2.
There is no user of it but it will be used by a upcoming patch.

~

The current code does not use the new indexing logic, but it will be
used by an upcoming patch.

Fixed.

======
src/common/binaryheap.c

3.
+/*
+ * Define parameters for hash table code generation. The interface is *also*"
+ * declared in binaryheaph.h (to generate the types, which are externally
+ * visible).
+ */

Typo: *also*"

Fixed.

~~~

4.
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((const unsigned char *) &key, sizeof(bh_node_type))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0)
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#include "lib/simplehash.h"

4a.
The comment in simplehash.h says
* The following parameters are only relevant when SH_DEFINE is defined:
* - SH_KEY - ...
* - SH_EQUAL(table, a, b) - ...
* - SH_HASH_KEY(table, key) - ...
* - SH_STORE_HASH - ...
* - SH_GET_HASH(tb, a) - ...

So maybe it is nicer to reorder the #defines in that same order?

SUGGESTION:
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#define SH_KEY key
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0)
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((const unsigned char *) &key, sizeof(bh_node_type))
+#include "lib/simplehash.h"

I'm really not sure it helps increase readability. For instance, for
me it's readable if SH_DEFINE and SH_DECLARE come to the last before
#include since it's more obvious whether we want to declare, define or
both. Other simplehash.h users also do so.

~~

4b.
The comment in simplehash.h says that "it's preferable, if possible,
to store the element's hash in the element's data type", so should
SH_STORE_HASH and SH_GET_HASH also be defined here?

Good catch. I've used these macros.

~~~

5.
+ *
+ * If 'indexed' is true, we create a hash table to track of each node's
+ * index in the heap, enabling to perform some operations such as removing
+ * the node from the heap.
*/
binaryheap *
-binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
+binaryheap_allocate(int capacity, binaryheap_comparator compare,
+ bool indexed, void *arg)

BEFORE
... enabling to perform some operations such as removing the node from the heap.

SUGGESTION
... to help make operations such as removing nodes more efficient.

But these operations literally require the indexed binary heap as we
have an assertion:

void
binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
{
bh_nodeidx_entry *ent;

Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
Assert(heap->bh_indexed);

~~~

6.
+ heap->bh_indexed = indexed;
+ if (heap->bh_indexed)
+ {
+#ifdef FRONTEND
+ heap->bh_nodeidx = bh_nodeidx_create(capacity, NULL);
+#else
+ heap->bh_nodeidx = bh_nodeidx_create(CurrentMemoryContext, capacity,
+ NULL);
+#endif
+ }
+

The heap allocation just uses palloc instead of palloc0 so it might be
better to assign "heap->bh_nodeidx = NULL;" up-front, just so you will
never get a situation where bh_indexed is false but bh_nodeidx has
some (garbage) value.

Fixed.

~~~

7.
+/*
+ * Set the given node at the 'index', and updates its position accordingly.
+ *
+ * Return true if the node's index is already tracked.
+ */
+static bool
+bh_set_node(binaryheap *heap, bh_node_type node, int index)

7a.
I felt the 1st sentence should be more like:

SUGGESTION
Set the given node at the 'index' and track it if required.

Fixed.

~

7b.
IMO the parameters would be better the other way around (e.g. 'index'
before the 'node') because that's what the assignments look like:

heap->bh_nodes[heap->bh_size] = d;

becomes:
bh_set_node(heap, heap->bh_size, d);

I think it assumes heap->bh_nodes is an array. But if we change it in
the future, it will no longer make sense. I think it would make more
sense if we define the parameters in an order like "we set the 'node'
at 'index'". What do you think?

~~~

8.
+static bool
+bh_set_node(binaryheap *heap, bh_node_type node, int index)
+{
+ bh_nodeidx_entry *ent;
+ bool found = false;
+
+ /* Set the node to the nodes array */
+ heap->bh_nodes[index] = node;
+
+ if (heap->bh_indexed)
+ {
+ /* Remember its index in the nodes array */
+ ent = bh_nodeidx_insert(heap->bh_nodeidx, node, &found);
+ ent->idx = index;
+ }
+
+ return found;
+}

8a.
That 'ent' declaration can be moved to the inner block scope, so it is
closer to where it is needed.

~

8b.
+ /* Remember its index in the nodes array */

The comment is worded a bit ambiguously. IMO a simpler comment would
be: "/* Keep track of the node index. */"

~~~

Fixed.

9.
+static void
+bh_delete_nodeidx(binaryheap *heap, bh_node_type node)
+{
+ if (!heap->bh_indexed)
+ return;
+
+ (void) bh_nodeidx_delete(heap->bh_nodeidx, node);
+}

Since there is only 1 statement IMO it is simpler to write this
function like below:

if (heap->bh_indexed)
(void) bh_nodeidx_delete(heap->bh_nodeidx, node);

Fixed.

~~~

10.
+/*
+ * Replace the node at 'idx' with the given node 'replaced_by'. Also
+ * update their positions accordingly.
+ */
+static void
+bh_replace_node(binaryheap *heap, int idx, bh_node_type replaced_by)

10a.
Would 'node' or 'new_node' or 'replacement' be a better name than 'replaced_by'?

Fixed.

~

10b.
I noticed that the index param is called 'idx' here but in other
functions, it is called 'index'. I think either is good (I prefer
'idx') but at least everywhere should use the same name for
consistency.

Fixed.

~~~

11.
+static void
+bh_replace_node(binaryheap *heap, int idx, bh_node_type replaced_by)
+{
+ /* Remove overwritten node's index */
+ bh_delete_nodeidx(heap, heap->bh_nodes[idx]);
+
+ /* Replace it with the given new node */
+ if (idx < heap->bh_size)
+ {
+ bool found PG_USED_FOR_ASSERTS_ONLY;
+
+ found = bh_set_node(heap, replaced_by, idx);
+
+ /* The overwritten node's index must already be tracked */
+ Assert(!heap->bh_indexed || found);
+ }
+}

I did not understand the condition.
e.g. Can you explain when is idx NOT less than heap->bh_size?
e.g. If this condition failed then nothing gets replaced (??)

It was for a case like where we call binaryheap_remote_node(heap, 0)
where the heap has only one entry, resulting in setting the root node
again. I updated the bh_replace_node() to return if the node doesn't
not need to be moved.

~~~

======
src/include/lib/binaryheap.h

12.
+/*
+ * Struct for A hash table element to store the node's index in the bh_nodes
+ * array.
+ */
+typedef struct bh_nodeidx_entry

/for A hash table/for a hash table/

~~~

13.
+/* define parameters necessary to generate the hash table interface */

Suggest uppercase "Define" and add a period.

Fixed.

~~~

14.
+
+ /*
+ * If bh_indexed is true, the bh_nodeidx is used to track of each node's
+ * index in bh_nodes. This enables the caller to perform
+ * binaryheap_remove_node_ptr(), binaryheap_update_up/down in O(log n).
+ */
+ bool bh_indexed;
+ bh_nodeidx_hash *bh_nodeidx;
} binaryheap;

I'm wondering why the separate 'bh_indexed' is necessary at all. Can't
you just use the bh_nodeidx value? E.g. If bh_nodeidx == NULL then it
means there is no index tracking, otherwise there is.

Good point. I added a macro binaryheap_indexed() to check it for
better readability.

The above comments are incorporated into the latest v8 patch set that
I've just submitted[1]/messages/by-id/CAD21AoBYjJmz7q_=Z+eXJgm0FScyu3_iGFshPAvnq78B2KL3qQ@mail.gmail.com.

Regards,

[1]: /messages/by-id/CAD21AoBYjJmz7q_=Z+eXJgm0FScyu3_iGFshPAvnq78B2KL3qQ@mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#48Peter Smith
smithpb2250@gmail.com
In reply to: Masahiko Sawada (#47)
Re: Improve eviction algorithm in ReorderBuffer

On Thu, Mar 7, 2024 at 2:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Mar 5, 2024 at 3:28 PM Peter Smith <smithpb2250@gmail.com> wrote:

4a.
The comment in simplehash.h says
* The following parameters are only relevant when SH_DEFINE is defined:
* - SH_KEY - ...
* - SH_EQUAL(table, a, b) - ...
* - SH_HASH_KEY(table, key) - ...
* - SH_STORE_HASH - ...
* - SH_GET_HASH(tb, a) - ...

So maybe it is nicer to reorder the #defines in that same order?

SUGGESTION:
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#define SH_KEY key
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0)
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((const unsigned char *) &key, sizeof(bh_node_type))
+#include "lib/simplehash.h"

I'm really not sure it helps increase readability. For instance, for
me it's readable if SH_DEFINE and SH_DECLARE come to the last before
#include since it's more obvious whether we want to declare, define or
both. Other simplehash.h users also do so.

OK.

5.
+ *
+ * If 'indexed' is true, we create a hash table to track of each node's
+ * index in the heap, enabling to perform some operations such as removing
+ * the node from the heap.
*/
binaryheap *
-binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
+binaryheap_allocate(int capacity, binaryheap_comparator compare,
+ bool indexed, void *arg)

BEFORE
... enabling to perform some operations such as removing the node from the heap.

SUGGESTION
... to help make operations such as removing nodes more efficient.

But these operations literally require the indexed binary heap as we
have an assertion:

void
binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
{
bh_nodeidx_entry *ent;

Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
Assert(heap->bh_indexed);

I didn’t quite understand -- the operations mentioned are "operations
such as removing the node", but binaryheap_remove_node() also removes
a node from the heap. So I still felt the comment wording of the patch
is not quite correct.

Now, if the removal of a node from an indexed heap can *only* be done
using binaryheap_remove_node_ptr() then:
- the other removal functions (binaryheap_remove_*) probably need some
comments to make sure nobody is tempted to call them directly for an
indexed heap.
- maybe some refactoring and assertions are needed to ensure those
*cannot* be called directly for an indexed heap.

7b.
IMO the parameters would be better the other way around (e.g. 'index'
before the 'node') because that's what the assignments look like:

heap->bh_nodes[heap->bh_size] = d;

becomes:
bh_set_node(heap, heap->bh_size, d);

I think it assumes heap->bh_nodes is an array. But if we change it in
the future, it will no longer make sense. I think it would make more
sense if we define the parameters in an order like "we set the 'node'
at 'index'". What do you think?

YMMV. The patch code is also OK by me if you prefer it.

//////////

And, here are some review comments for v8-0002.

======
1. delete_nodeidx

+/*
+ * Remove the node's index from the hash table if the heap is indexed.
+ */
+static bool
+delete_nodeidx(binaryheap *heap, bh_node_type node)
+{
+ if (!binaryheap_indexed(heap))
+ return false;
+
+ return bh_nodeidx_delete(heap->bh_nodeidx, node);
+}

1a.
In v8 this function was changed to now return bool, so, I think the
function comment should explain the meaning of that return value.

~

1b.
I felt the function body is better expressed positively: "If this then
do that", instead of "If not this then do nothing otherwise do that"

SUGGESTION
if (binaryheap_indexed(heap))
return bh_nodeidx_delete(heap->bh_nodeidx, node);

return false;

~~~

2.
+static void
+replace_node(binaryheap *heap, int index, bh_node_type new_node)
+{
+ bool found PG_USED_FOR_ASSERTS_ONLY;
+
+ /* Quick return if not necessary to move */
+ if (heap->bh_nodes[index] == new_node)
+ return;
+
+ /*
+ * Remove overwritten node's index. The overwritten node's position must
+ * have been tracked, if enabled.
+ */
+ found = delete_nodeidx(heap, heap->bh_nodes[index]);
+ Assert(!binaryheap_indexed(heap) || found);
+
+ /*
+ * Replace it with the given new node. This node's position must also be
+ * tracked as we assume to replace the node by the existing node.
+ */
+ found = set_node(heap, new_node, index);
+ Assert(!binaryheap_indexed(heap) || found);
+}

2a.
/Remove overwritten/Remove the overwritten/
/replace the node by the existing node/replace the node with the existing node/

~

2b.
It might be helpful to declare another local var...
bh_node_type cur_node = heap->bh_nodes[index];

... because I think it will be more readable to say:
+ if (cur_node == new_node)
+ return;

and

+ found = delete_nodeidx(heap, cur_node);

----------
Kind Regards,
Peter Smith.
Fujitsu Australia

#49Peter Smith
smithpb2250@gmail.com
In reply to: Masahiko Sawada (#46)
Re: Improve eviction algorithm in ReorderBuffer

Here are some review comments for v8-0003

======
0. GENERAL -- why the state enum?

This patch introduced a new ReorderBufferMemTrackState with 2 states
(REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP,
REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)

It's the same as having a boolean flag OFF/ON, so I didn't see any
benefit of the enum instead of a simple boolean flag like
'track_txn_sizes'.

NOTE: Below in this post (see #11) I would like to propose another
idea, which can simplify much further, eliminating the need for the
state boolean. If adopted that will impact lots of these other review
comments.

======
Commit Message

1.
Previously, when selecting the transaction to evict during logical
decoding, we check all transactions to find the largest
transaction. Which could lead to a significant replication lag
especially in case where there are many subtransactions.

~

/Which could/This could/

/in case/in the case/

======
.../replication/logical/reorderbuffer.c

2.
* We use a max-heap with transaction size as the key to efficiently find
* the largest transaction. The max-heap state is managed in two states:
* REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP and
REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP.

/The max-heap state is managed in two states:/The max-heap is managed
in two states:/

~~~

3.
+/*
+ * Threshold of the total number of top-level and sub transactions
that controls
+ * whether we switch the memory track state. While using max-heap to select
+ * the largest transaction is effective when there are many transactions being
+ * decoded, in many systems there is generally no need to use it as long as all
+ * transactions being decoded are top-level transactions. Therefore, we use
+ * MaxConnections as the threshold* so we can prevent switch to the
state unless
+ * we use subtransactions.
+ */
+#define REORDER_BUFFER_MEM_TRACK_THRESHOLD MaxConnections

3a.
/memory track state./memory tracking state./

/While using max-heap/Although using max-heap/

"in many systems" (are these words adding anything?)

/threshold*/threshold/

/so we can prevent switch/so we can prevent switching/

~

3b.
There's nothing really in this name to indicate the units of the
threshold. Consider if there is some more informative name for this
macro: e.g.
MAXHEAP_TX_COUNT_THRESHOLD (?)

~~~

4.
+ /*
+ * Don't start with a lower number than
+ * REORDER_BUFFER_MEM_TRACK_THRESHOLD, since we add at least
+ * REORDER_BUFFER_MEM_TRACK_THRESHOLD entries at once.
+ */
+ buffer->memtrack_state = REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP;
+ buffer->txn_heap = binaryheap_allocate(REORDER_BUFFER_MEM_TRACK_THRESHOLD * 2,
+    ReorderBufferTXNSizeCompare,
+    true, NULL);
+

IIUC the comment intends to say:

Allocate the initial heap size greater than THRESHOLD because the
txn_heap will not be used until the threshold is exceeded.

Also, maybe the comment should make a point of saying "Note: the
binary heap is INDEXED for faster manipulations". or something
similar.

~~~

5.
static void
ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
ReorderBufferChange *change,
+ ReorderBufferTXN *txn,
bool addition, Size sz)
{
- ReorderBufferTXN *txn;
ReorderBufferTXN *toptxn;

- Assert(change->txn);
-

There seems some trick now where the passed 'change' could be NULL,
which was not possible before. e.g., when change is NULL then 'txn' is
not NULL, and vice versa. Some explanation about this logic and the
meaning of these parameters should be written in this function
comment.

~

6.
+ txn = txn != NULL ? txn : change->txn;

IMO it's more natural to code the ternary using the same order as the
parameters:

e.g. txn = change ? change->txn : txn;

~~~

7.
/*
* Build the max-heap and switch the state. We will run a heap assembly step
* at the end, which is more efficient.
*/
static void
ReorderBufferBuildMaxHeap(ReorderBuffer *rb)

/We will run a heap assembly step at the end, which is more
efficient./The heap assembly step is deferred until the end, for
efficiency./

~~~

8. ReorderBufferLargestTXN

+ if (hash_get_num_entries(rb->by_txn) < REORDER_BUFFER_MEM_TRACK_THRESHOLD)
+ {
+ HASH_SEQ_STATUS hash_seq;
+ ReorderBufferTXNByIdEnt *ent;
+
+ hash_seq_init(&hash_seq, rb->by_txn);
+ while ((ent = hash_seq_search(&hash_seq)) != NULL)
+ {
+ ReorderBufferTXN *txn = ent->txn;
+
+ /* if the current transaction is larger, remember it */
+ if ((!largest) || (txn->size > largest->size))
+ largest = txn;
+ }
+
+ Assert(largest);
+ }

That Assert(largest) seems redundant because there is anyway another
Assert(largest) immediately after this code.

~~~

9.
+ /* Get the largest transaction from the max-heap */
+ if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)
+ {
+ Assert(binaryheap_size(rb->txn_heap) > 0);
+ largest = (ReorderBufferTXN *)
+ DatumGetPointer(binaryheap_first(rb->txn_heap));
  }
Assert(binaryheap_size(rb->txn_heap) > 0); seemed like slightly less
readable way of saying:

Assert(!binaryheap_empty(rb->txn_heap));

~~~

10.
+
+/*
+ * Compare between sizes of two transactions. This is for a binary heap
+ * comparison function.
+ */
+static int
+ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg)

10a.
/Compare between sizes of two transactions./Compare two transactions by size./

~~~

10b.
IMO this comparator function belongs just before the
ReorderBufferAllocate() function since that is the only place where it
is used.

======
src/include/replication/reorderbuffer.h

11.
+/* State of how to track the memory usage of each transaction being decoded */
+typedef enum ReorderBufferMemTrackState
+{
+ /*
+ * We don't update max-heap while updating the memory counter. The
+ * max-heap is built before use.
+ */
+ REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP,
+
+ /*
+ * We also update the max-heap when updating the memory counter so the
+ * heap property is always preserved.
+ */
+ REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP,
+} ReorderBufferMemTrackState;
+

In my GENERAL review comment #0, I suggested the removal of this
entire enum. e.g. It could be replaced with a boolean field
'track_txn_sizes'

TBH, I think there is a better way to handle this "state". IIUC
- the txn_heap is always allocated up-front.
- you only "build" it when > threshold and
- when it drops < 0.9 x threshold you reset it.

Therefore, AFAICT you do not need to maintain any “switch states” at
all; you simply need to check binaryheap_empty(txn_heap), right?
* If the heap is empty…. It means you are NOT tracking, so don’t use it
* If the heap is NOT empty …. It means you ARE tracking, so use it.

~

Using my idea to remove the state flag will have the side effect of
simplifying many other parts of this patch. For example

BEFORE
+static void
+ReorderBufferMaybeChangeNoMaxHeap(ReorderBuffer *rb)
+{
+ if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP)
+ return;
+
...
+ if (binaryheap_size(rb->txn_heap) < REORDER_BUFFER_MEM_TRACK_THRESHOLD * 0.9)
+ {
+ rb->memtrack_state = REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP;
+ binaryheap_reset(rb->txn_heap);
+ }
+}
AFTER
+static void
+ReorderBufferMaybeChangeNoMaxHeap(ReorderBuffer *rb)
+{
+ if (binaryheap_empty(rb->txn_heap))
+ return;
+
...
+ if (binaryheap_size(rb->txn_heap) < REORDER_BUFFER_MEM_TRACK_THRESHOLD * 0.9)
+ binaryheap_reset(rb->txn_heap);
+}

~~~

12. struct ReorderBuffer

+ /* Max-heap for sizes of all top-level and sub transactions */
+ ReorderBufferMemTrackState memtrack_state;
+ binaryheap *txn_heap;
+

12a.
Why is this being referred to in the commit message and code comments
as "max-heap" when the field is not called by that same name? Won't it
be better to give the field a better name -- e.g. "txn_maxheap" or
similar?

~

12b.
This comment should also say that the heap is ordered by tx size --
(e.g. the comparator is ReorderBufferTXNSizeCompare)

----------
Kind Regards,
Peter Smith.
Fujitsu Australia

#50Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Smith (#48)
Re: Improve eviction algorithm in ReorderBuffer

On Fri, Mar 8, 2024 at 12:58 PM Peter Smith <smithpb2250@gmail.com> wrote:

On Thu, Mar 7, 2024 at 2:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Mar 5, 2024 at 3:28 PM Peter Smith <smithpb2250@gmail.com> wrote:

4a.
The comment in simplehash.h says
* The following parameters are only relevant when SH_DEFINE is defined:
* - SH_KEY - ...
* - SH_EQUAL(table, a, b) - ...
* - SH_HASH_KEY(table, key) - ...
* - SH_STORE_HASH - ...
* - SH_GET_HASH(tb, a) - ...

So maybe it is nicer to reorder the #defines in that same order?

SUGGESTION:
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DEFINE
+#define SH_KEY key
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0)
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((const unsigned char *) &key, sizeof(bh_node_type))
+#include "lib/simplehash.h"

I'm really not sure it helps increase readability. For instance, for
me it's readable if SH_DEFINE and SH_DECLARE come to the last before
#include since it's more obvious whether we want to declare, define or
both. Other simplehash.h users also do so.

OK.

5.
+ *
+ * If 'indexed' is true, we create a hash table to track of each node's
+ * index in the heap, enabling to perform some operations such as removing
+ * the node from the heap.
*/
binaryheap *
-binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
+binaryheap_allocate(int capacity, binaryheap_comparator compare,
+ bool indexed, void *arg)

BEFORE
... enabling to perform some operations such as removing the node from the heap.

SUGGESTION
... to help make operations such as removing nodes more efficient.

But these operations literally require the indexed binary heap as we
have an assertion:

void
binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
{
bh_nodeidx_entry *ent;

Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
Assert(heap->bh_indexed);

I didn’t quite understand -- the operations mentioned are "operations
such as removing the node", but binaryheap_remove_node() also removes
a node from the heap. So I still felt the comment wording of the patch
is not quite correct.

Now I understand your point. That's a valid point.

Now, if the removal of a node from an indexed heap can *only* be done
using binaryheap_remove_node_ptr() then:
- the other removal functions (binaryheap_remove_*) probably need some
comments to make sure nobody is tempted to call them directly for an
indexed heap.
- maybe some refactoring and assertions are needed to ensure those
*cannot* be called directly for an indexed heap.

If the 'index' is true, the caller can not only use the existing
functions but also newly added functions such as
binaryheap_remove_node_ptr() and binaryheap_update_up() etc. How about
something like below?

* If 'indexed' is true, we create a hash table to track each node's
* index in the heap, enabling to perform some operations such as
* binaryheap_remove_node_ptr() etc.

And, here are some review comments for v8-0002.

======
1. delete_nodeidx

+/*
+ * Remove the node's index from the hash table if the heap is indexed.
+ */
+static bool
+delete_nodeidx(binaryheap *heap, bh_node_type node)
+{
+ if (!binaryheap_indexed(heap))
+ return false;
+
+ return bh_nodeidx_delete(heap->bh_nodeidx, node);
+}

1a.
In v8 this function was changed to now return bool, so, I think the
function comment should explain the meaning of that return value.

~

1b.
I felt the function body is better expressed positively: "If this then
do that", instead of "If not this then do nothing otherwise do that"

SUGGESTION
if (binaryheap_indexed(heap))
return bh_nodeidx_delete(heap->bh_nodeidx, node);

return false;

~~~

2.
+static void
+replace_node(binaryheap *heap, int index, bh_node_type new_node)
+{
+ bool found PG_USED_FOR_ASSERTS_ONLY;
+
+ /* Quick return if not necessary to move */
+ if (heap->bh_nodes[index] == new_node)
+ return;
+
+ /*
+ * Remove overwritten node's index. The overwritten node's position must
+ * have been tracked, if enabled.
+ */
+ found = delete_nodeidx(heap, heap->bh_nodes[index]);
+ Assert(!binaryheap_indexed(heap) || found);
+
+ /*
+ * Replace it with the given new node. This node's position must also be
+ * tracked as we assume to replace the node by the existing node.
+ */
+ found = set_node(heap, new_node, index);
+ Assert(!binaryheap_indexed(heap) || found);
+}

2a.
/Remove overwritten/Remove the overwritten/
/replace the node by the existing node/replace the node with the existing node/

~

2b.
It might be helpful to declare another local var...
bh_node_type cur_node = heap->bh_nodes[index];

... because I think it will be more readable to say:
+ if (cur_node == new_node)
+ return;

and

+ found = delete_nodeidx(heap, cur_node);

As for changes around delete_nodeidx(), I've changed the
delete_nodeidx() to return nothing as it would not be helpful much and
seems confusing. I've simplified replace_node() logic accordingly.

I'll update 0003 patch to address your comment and submit the updated
version patches.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#51Peter Smith
smithpb2250@gmail.com
In reply to: Masahiko Sawada (#50)
Re: Improve eviction algorithm in ReorderBuffer

On Tue, Mar 12, 2024 at 4:23 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Mar 8, 2024 at 12:58 PM Peter Smith <smithpb2250@gmail.com> wrote:

...

5.
+ *
+ * If 'indexed' is true, we create a hash table to track of each node's
+ * index in the heap, enabling to perform some operations such as removing
+ * the node from the heap.
*/
binaryheap *
-binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
+binaryheap_allocate(int capacity, binaryheap_comparator compare,
+ bool indexed, void *arg)

BEFORE
... enabling to perform some operations such as removing the node from the heap.

SUGGESTION
... to help make operations such as removing nodes more efficient.

But these operations literally require the indexed binary heap as we
have an assertion:

void
binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
{
bh_nodeidx_entry *ent;

Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
Assert(heap->bh_indexed);

I didn’t quite understand -- the operations mentioned are "operations
such as removing the node", but binaryheap_remove_node() also removes
a node from the heap. So I still felt the comment wording of the patch
is not quite correct.

Now I understand your point. That's a valid point.

Now, if the removal of a node from an indexed heap can *only* be done
using binaryheap_remove_node_ptr() then:
- the other removal functions (binaryheap_remove_*) probably need some
comments to make sure nobody is tempted to call them directly for an
indexed heap.
- maybe some refactoring and assertions are needed to ensure those
*cannot* be called directly for an indexed heap.

If the 'index' is true, the caller can not only use the existing
functions but also newly added functions such as
binaryheap_remove_node_ptr() and binaryheap_update_up() etc. How about
something like below?

You said: "can not only use the existing functions but also..."

Hmm. Is that right? IIUC those existing "remove" functions should NOT
be called directly if the heap was "indexed" because they'll delete
the node from the heap OK, but any corresponding index for that
deleted node will be left lying around -- i.e. everything gets out of
sync. This was the reason for my original concern.

* If 'indexed' is true, we create a hash table to track each node's
* index in the heap, enabling to perform some operations such as
* binaryheap_remove_node_ptr() etc.

Yeah, something like that... I'll wait for the next patch version
before commenting further.

----------
Kind Regards,
Peter Smith.
Fujitsu Australia

#52Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Smith (#51)
Re: Improve eviction algorithm in ReorderBuffer

On Wed, Mar 13, 2024 at 10:15 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Tue, Mar 12, 2024 at 4:23 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Mar 8, 2024 at 12:58 PM Peter Smith <smithpb2250@gmail.com> wrote:

...

5.
+ *
+ * If 'indexed' is true, we create a hash table to track of each node's
+ * index in the heap, enabling to perform some operations such as removing
+ * the node from the heap.
*/
binaryheap *
-binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
+binaryheap_allocate(int capacity, binaryheap_comparator compare,
+ bool indexed, void *arg)

BEFORE
... enabling to perform some operations such as removing the node from the heap.

SUGGESTION
... to help make operations such as removing nodes more efficient.

But these operations literally require the indexed binary heap as we
have an assertion:

void
binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
{
bh_nodeidx_entry *ent;

Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
Assert(heap->bh_indexed);

I didn’t quite understand -- the operations mentioned are "operations
such as removing the node", but binaryheap_remove_node() also removes
a node from the heap. So I still felt the comment wording of the patch
is not quite correct.

Now I understand your point. That's a valid point.

Now, if the removal of a node from an indexed heap can *only* be done
using binaryheap_remove_node_ptr() then:
- the other removal functions (binaryheap_remove_*) probably need some
comments to make sure nobody is tempted to call them directly for an
indexed heap.
- maybe some refactoring and assertions are needed to ensure those
*cannot* be called directly for an indexed heap.

If the 'index' is true, the caller can not only use the existing
functions but also newly added functions such as
binaryheap_remove_node_ptr() and binaryheap_update_up() etc. How about
something like below?

You said: "can not only use the existing functions but also..."

Hmm. Is that right? IIUC those existing "remove" functions should NOT
be called directly if the heap was "indexed" because they'll delete
the node from the heap OK, but any corresponding index for that
deleted node will be left lying around -- i.e. everything gets out of
sync. This was the reason for my original concern.

All existing binaryheap functions should be available even if the
binaryheap is 'indexed'. For instance, with the patch,
binaryheap_remote_node() is:

void
binaryheap_remove_node(binaryheap *heap, int n)
{
int cmp;

Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
Assert(n >= 0 && n < heap->bh_size);

/* compare last node to the one that is being removed */
cmp = heap->bh_compare(heap->bh_nodes[--heap->bh_size],
heap->bh_nodes[n],
heap->bh_arg);

/* remove the last node, placing it in the vacated entry */
replace_node(heap, n, heap->bh_nodes[heap->bh_size]);

/* sift as needed to preserve the heap property */
if (cmp > 0)
sift_up(heap, n);
else if (cmp < 0)
sift_down(heap, n);
}

The replace_node(), sift_up() and sift_down() update node's index as
well if the binaryheap is indexed. When deleting the node from the
binaryheap, it will also delete its index from the hash table.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#53Peter Smith
smithpb2250@gmail.com
In reply to: Masahiko Sawada (#52)
Re: Improve eviction algorithm in ReorderBuffer

On Wed, Mar 13, 2024 at 12:48 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Mar 13, 2024 at 10:15 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Tue, Mar 12, 2024 at 4:23 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Mar 8, 2024 at 12:58 PM Peter Smith <smithpb2250@gmail.com> wrote:

...

5.
+ *
+ * If 'indexed' is true, we create a hash table to track of each node's
+ * index in the heap, enabling to perform some operations such as removing
+ * the node from the heap.
*/
binaryheap *
-binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
+binaryheap_allocate(int capacity, binaryheap_comparator compare,
+ bool indexed, void *arg)

BEFORE
... enabling to perform some operations such as removing the node from the heap.

SUGGESTION
... to help make operations such as removing nodes more efficient.

But these operations literally require the indexed binary heap as we
have an assertion:

void
binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
{
bh_nodeidx_entry *ent;

Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
Assert(heap->bh_indexed);

I didn’t quite understand -- the operations mentioned are "operations
such as removing the node", but binaryheap_remove_node() also removes
a node from the heap. So I still felt the comment wording of the patch
is not quite correct.

Now I understand your point. That's a valid point.

Now, if the removal of a node from an indexed heap can *only* be done
using binaryheap_remove_node_ptr() then:
- the other removal functions (binaryheap_remove_*) probably need some
comments to make sure nobody is tempted to call them directly for an
indexed heap.
- maybe some refactoring and assertions are needed to ensure those
*cannot* be called directly for an indexed heap.

If the 'index' is true, the caller can not only use the existing
functions but also newly added functions such as
binaryheap_remove_node_ptr() and binaryheap_update_up() etc. How about
something like below?

You said: "can not only use the existing functions but also..."

Hmm. Is that right? IIUC those existing "remove" functions should NOT
be called directly if the heap was "indexed" because they'll delete
the node from the heap OK, but any corresponding index for that
deleted node will be left lying around -- i.e. everything gets out of
sync. This was the reason for my original concern.

All existing binaryheap functions should be available even if the
binaryheap is 'indexed'. For instance, with the patch,
binaryheap_remote_node() is:

void
binaryheap_remove_node(binaryheap *heap, int n)
{
int cmp;

Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
Assert(n >= 0 && n < heap->bh_size);

/* compare last node to the one that is being removed */
cmp = heap->bh_compare(heap->bh_nodes[--heap->bh_size],
heap->bh_nodes[n],
heap->bh_arg);

/* remove the last node, placing it in the vacated entry */
replace_node(heap, n, heap->bh_nodes[heap->bh_size]);

/* sift as needed to preserve the heap property */
if (cmp > 0)
sift_up(heap, n);
else if (cmp < 0)
sift_down(heap, n);
}

The replace_node(), sift_up() and sift_down() update node's index as
well if the binaryheap is indexed. When deleting the node from the
binaryheap, it will also delete its index from the hash table.

I see now. Thanks for the information.

~~~

Some more review comments for v8-0002

======

1.
+/*
+ * Remove the node's index from the hash table if the heap is indexed.
+ */
+static bool
+delete_nodeidx(binaryheap *heap, bh_node_type node)
+{
+ if (!binaryheap_indexed(heap))
+ return false;
+
+ return bh_nodeidx_delete(heap->bh_nodeidx, node);
+}

I wasn't sure if having this function was a good idea. Yes, it makes
code more readable, but I felt the heap code ought to be as efficient
as possible so maybe it is better for the index check to be done at
the caller, instead of incurring any overhead of function calls that
might do nothing.

SUGGESTION
if (binaryheap_indexed(heap))
found = bh_nodeidx_delete(heap->bh_nodeidx, node);

~~~

2.
+/*
+ * binaryheap_update_up
+ *
+ * Sift the given node up after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if the heap is indexed.
+ */
+void
+binaryheap_update_up(binaryheap *heap, bh_node_type d)
+{
+ bh_nodeidx_entry *ent;
+
+ Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+ Assert(binaryheap_indexed(heap));
+
+ ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+ Assert(ent);
+ Assert(ent->index >= 0 && ent->index < heap->bh_size);
+
+ sift_up(heap, ent->index);
+}
+
+/*
+ * binaryheap_update_down
+ *
+ * Sift the given node down after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if the heap is indexed.
+ */
+void
+binaryheap_update_down(binaryheap *heap, bh_node_type d)
+{
+ bh_nodeidx_entry *ent;
+
+ Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+ Assert(binaryheap_indexed(heap));
+
+ ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+ Assert(ent);
+ Assert(ent->index >= 0 && ent->index < heap->bh_size);
+
+ sift_down(heap, ent->index);
+}

Since those functions are almost identical, wouldn't it be better to
combine them, passing the sift direction?

SUGGESTION
binaryheap_resift(binaryheap *heap, bh_node_type d, bool sift_dir_up)
{
...

if (sift_dir_up)
sift_up(heap, ent->index);
else
sift_down(heap, ent->index);
}

----------
Kind Regards,
Peter Smith.
Fujitsu Australia

#54Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Smith (#49)
3 attachment(s)
Re: Improve eviction algorithm in ReorderBuffer

On Mon, Mar 11, 2024 at 3:04 PM Peter Smith <smithpb2250@gmail.com> wrote:

Here are some review comments for v8-0003

======
0. GENERAL -- why the state enum?

This patch introduced a new ReorderBufferMemTrackState with 2 states
(REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP,
REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)

It's the same as having a boolean flag OFF/ON, so I didn't see any
benefit of the enum instead of a simple boolean flag like
'track_txn_sizes'.

NOTE: Below in this post (see #11) I would like to propose another
idea, which can simplify much further, eliminating the need for the
state boolean. If adopted that will impact lots of these other review
comments.

Good point! We used to use three states in the earlier version patch
but now that we have only two we don't necessarily need to use an
enum. I've used your idea.

======
Commit Message

1.
Previously, when selecting the transaction to evict during logical
decoding, we check all transactions to find the largest
transaction. Which could lead to a significant replication lag
especially in case where there are many subtransactions.

~

/Which could/This could/

/in case/in the case/

Fixed.

======
.../replication/logical/reorderbuffer.c

2.
* We use a max-heap with transaction size as the key to efficiently find
* the largest transaction. The max-heap state is managed in two states:
* REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP and
REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP.

/The max-heap state is managed in two states:/The max-heap is managed
in two states:/

This part is removed.

~~~

3.
+/*
+ * Threshold of the total number of top-level and sub transactions
that controls
+ * whether we switch the memory track state. While using max-heap to select
+ * the largest transaction is effective when there are many transactions being
+ * decoded, in many systems there is generally no need to use it as long as all
+ * transactions being decoded are top-level transactions. Therefore, we use
+ * MaxConnections as the threshold* so we can prevent switch to the
state unless
+ * we use subtransactions.
+ */
+#define REORDER_BUFFER_MEM_TRACK_THRESHOLD MaxConnections

3a.
/memory track state./memory tracking state./

/While using max-heap/Although using max-heap/

"in many systems" (are these words adding anything?)

/threshold*/threshold/

/so we can prevent switch/so we can prevent switching/

Fixed.

~

3b.
There's nothing really in this name to indicate the units of the
threshold. Consider if there is some more informative name for this
macro: e.g.
MAXHEAP_TX_COUNT_THRESHOLD (?)

Fixed.

~~~

4.
+ /*
+ * Don't start with a lower number than
+ * REORDER_BUFFER_MEM_TRACK_THRESHOLD, since we add at least
+ * REORDER_BUFFER_MEM_TRACK_THRESHOLD entries at once.
+ */
+ buffer->memtrack_state = REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP;
+ buffer->txn_heap = binaryheap_allocate(REORDER_BUFFER_MEM_TRACK_THRESHOLD * 2,
+    ReorderBufferTXNSizeCompare,
+    true, NULL);
+

IIUC the comment intends to say:

Allocate the initial heap size greater than THRESHOLD because the
txn_heap will not be used until the threshold is exceeded.

Also, maybe the comment should make a point of saying "Note: the
binary heap is INDEXED for faster manipulations". or something
similar.

Fixed.

~~~

5.
static void
ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
ReorderBufferChange *change,
+ ReorderBufferTXN *txn,
bool addition, Size sz)
{
- ReorderBufferTXN *txn;
ReorderBufferTXN *toptxn;

- Assert(change->txn);
-

There seems some trick now where the passed 'change' could be NULL,
which was not possible before. e.g., when change is NULL then 'txn' is
not NULL, and vice versa. Some explanation about this logic and the
meaning of these parameters should be written in this function
comment.

Added comments.

~

6.
+ txn = txn != NULL ? txn : change->txn;

IMO it's more natural to code the ternary using the same order as the
parameters:

e.g. txn = change ? change->txn : txn;

I see your point. I changed it to:

if (txn == NULL)
txn = change->txn;

so we don't change txn if it's not NULL.

~~~

7.
/*
* Build the max-heap and switch the state. We will run a heap assembly step
* at the end, which is more efficient.
*/
static void
ReorderBufferBuildMaxHeap(ReorderBuffer *rb)

/We will run a heap assembly step at the end, which is more
efficient./The heap assembly step is deferred until the end, for
efficiency./

Fixed.

~~~

8. ReorderBufferLargestTXN

+ if (hash_get_num_entries(rb->by_txn) < REORDER_BUFFER_MEM_TRACK_THRESHOLD)
+ {
+ HASH_SEQ_STATUS hash_seq;
+ ReorderBufferTXNByIdEnt *ent;
+
+ hash_seq_init(&hash_seq, rb->by_txn);
+ while ((ent = hash_seq_search(&hash_seq)) != NULL)
+ {
+ ReorderBufferTXN *txn = ent->txn;
+
+ /* if the current transaction is larger, remember it */
+ if ((!largest) || (txn->size > largest->size))
+ largest = txn;
+ }
+
+ Assert(largest);
+ }

That Assert(largest) seems redundant because there is anyway another
Assert(largest) immediately after this code.

Removed.

~~~

9.
+ /* Get the largest transaction from the max-heap */
+ if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)
+ {
+ Assert(binaryheap_size(rb->txn_heap) > 0);
+ largest = (ReorderBufferTXN *)
+ DatumGetPointer(binaryheap_first(rb->txn_heap));
}
Assert(binaryheap_size(rb->txn_heap) > 0); seemed like slightly less
readable way of saying:

Assert(!binaryheap_empty(rb->txn_heap));

Fixed.

~~~

10.
+
+/*
+ * Compare between sizes of two transactions. This is for a binary heap
+ * comparison function.
+ */
+static int
+ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg)

10a.
/Compare between sizes of two transactions./Compare two transactions by size./

Fixed.

~~~

10b.
IMO this comparator function belongs just before the
ReorderBufferAllocate() function since that is the only place where it
is used.

I think it's better to move close to new max-heap related functions.

======
src/include/replication/reorderbuffer.h

11.
+/* State of how to track the memory usage of each transaction being decoded */
+typedef enum ReorderBufferMemTrackState
+{
+ /*
+ * We don't update max-heap while updating the memory counter. The
+ * max-heap is built before use.
+ */
+ REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP,
+
+ /*
+ * We also update the max-heap when updating the memory counter so the
+ * heap property is always preserved.
+ */
+ REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP,
+} ReorderBufferMemTrackState;
+

In my GENERAL review comment #0, I suggested the removal of this
entire enum. e.g. It could be replaced with a boolean field
'track_txn_sizes'

TBH, I think there is a better way to handle this "state". IIUC
- the txn_heap is always allocated up-front.
- you only "build" it when > threshold and
- when it drops < 0.9 x threshold you reset it.

Therefore, AFAICT you do not need to maintain any “switch states” at
all; you simply need to check binaryheap_empty(txn_heap), right?
* If the heap is empty…. It means you are NOT tracking, so don’t use it
* If the heap is NOT empty …. It means you ARE tracking, so use it.

~

Using my idea to remove the state flag will have the side effect of
simplifying many other parts of this patch. For example

BEFORE
+static void
+ReorderBufferMaybeChangeNoMaxHeap(ReorderBuffer *rb)
+{
+ if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP)
+ return;
+
...
+ if (binaryheap_size(rb->txn_heap) < REORDER_BUFFER_MEM_TRACK_THRESHOLD * 0.9)
+ {
+ rb->memtrack_state = REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP;
+ binaryheap_reset(rb->txn_heap);
+ }
+}
AFTER
+static void
+ReorderBufferMaybeChangeNoMaxHeap(ReorderBuffer *rb)
+{
+ if (binaryheap_empty(rb->txn_heap))
+ return;
+
...
+ if (binaryheap_size(rb->txn_heap) < REORDER_BUFFER_MEM_TRACK_THRESHOLD * 0.9)
+ binaryheap_reset(rb->txn_heap);
+}

Agreed. I removed the enum and changed the logic.

~~~

12. struct ReorderBuffer

+ /* Max-heap for sizes of all top-level and sub transactions */
+ ReorderBufferMemTrackState memtrack_state;
+ binaryheap *txn_heap;
+

12a.
Why is this being referred to in the commit message and code comments
as "max-heap" when the field is not called by that same name? Won't it
be better to give the field a better name -- e.g. "txn_maxheap" or
similar?

Not sure it helps increase readability. Other codes where we use
binaryheap use neither max nor min in the field name.

~

12b.
This comment should also say that the heap is ordered by tx size --
(e.g. the comparator is ReorderBufferTXNSizeCompare)

It seems to me the comment "/* Max-heap for sizes of all top-level and
sub transactions */" already mentions that, no? I'm not sure we need
to refer to the actual function name here.

I've attached new version patches.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v9-0001-Make-binaryheap-enlargeable.patchapplication/octet-stream; name=v9-0001-Make-binaryheap-enlargeable.patchDownload
From 00afd765c232bd00484316b497e921a9947cd1d3 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 17:12:20 +0900
Subject: [PATCH v9 1/3] Make binaryheap enlargeable.

The node array space of the binaryheap is doubled when there is no
available space.

Reviewed-by: Hayato Kuroda, Vignesh C, Ajin Cherian, Tomas Vondra,
Shubham Khanna, Peter Smith
Discussion: https://postgr.es/m/CAD21AoDffo37RC-eUuyHJKVEr017V2YYDLyn1xF_00ofptWbkg%40mail.gmail.com
---
 src/common/binaryheap.c      | 37 +++++++++++++++++++-----------------
 src/include/lib/binaryheap.h |  2 +-
 2 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index 7377ebdf15..843e764bb6 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -38,17 +38,16 @@ static void sift_up(binaryheap *heap, int node_off);
 binaryheap *
 binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 {
-	int			sz;
 	binaryheap *heap;
 
-	sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_node_type) * capacity;
-	heap = (binaryheap *) palloc(sz);
+	heap = (binaryheap *) palloc(sizeof(binaryheap));
 	heap->bh_space = capacity;
 	heap->bh_compare = compare;
 	heap->bh_arg = arg;
 
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
+	heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type) * capacity);
 
 	return heap;
 }
@@ -74,6 +73,7 @@ binaryheap_reset(binaryheap *heap)
 void
 binaryheap_free(binaryheap *heap)
 {
+	pfree(heap->bh_nodes);
 	pfree(heap);
 }
 
@@ -104,6 +104,17 @@ parent_offset(int i)
 	return (i - 1) / 2;
 }
 
+/*
+ * Double the space allocated for nodes.
+ */
+static void
+enlarge_node_array(binaryheap *heap)
+{
+	heap->bh_space *= 2;
+	heap->bh_nodes = repalloc(heap->bh_nodes,
+							  sizeof(bh_node_type) * heap->bh_space);
+}
+
 /*
  * binaryheap_add_unordered
  *
@@ -115,14 +126,10 @@ parent_offset(int i)
 void
 binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 {
+	/* make sure enough space for a new node */
 	if (heap->bh_size >= heap->bh_space)
-	{
-#ifdef FRONTEND
-		pg_fatal("out of binary heap slots");
-#else
-		elog(ERROR, "out of binary heap slots");
-#endif
-	}
+		enlarge_node_array(heap);
+
 	heap->bh_has_heap_property = false;
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
@@ -153,14 +160,10 @@ binaryheap_build(binaryheap *heap)
 void
 binaryheap_add(binaryheap *heap, bh_node_type d)
 {
+	/* make sure enough space for a new node */
 	if (heap->bh_size >= heap->bh_space)
-	{
-#ifdef FRONTEND
-		pg_fatal("out of binary heap slots");
-#else
-		elog(ERROR, "out of binary heap slots");
-#endif
-	}
+		enlarge_node_array(heap);
+
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 19025c08ef..1439f20803 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -46,7 +46,7 @@ typedef struct binaryheap
 	bool		bh_has_heap_property;	/* debugging cross-check */
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
-	bh_node_type bh_nodes[FLEXIBLE_ARRAY_MEMBER];
+	bh_node_type *bh_nodes;
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int capacity,
-- 
2.39.3

v9-0002-Add-functions-to-binaryheap-for-efficient-key-rem.patchapplication/octet-stream; name=v9-0002-Add-functions-to-binaryheap-for-efficient-key-rem.patchDownload
From 3f70fd0292cb593f98f9e3511f877c8f1bbd36f5 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 11:20:23 +0900
Subject: [PATCH v9 2/3] Add functions to binaryheap for efficient key removal
 and update.

Previously, binaryheap didn't support updating a key and removing a
node in an efficient way. For example, in order to remove a node from
the binaryheap, the caller has to pass the node's position within the
array that the binaryheap internally has. Removing a node from the
binaryheap is done in O(log n) but searching for the key's position is
done in O(n).

This commit adds a hash table to binaryheap in order to track the
position of each nodes in the binaryheap. That way, by using newly
added functions such as binaryheap_update_up() etc., both updating a
key and removing a node can be done in O(1) on an average and O(log n)
in worst case. This is known as the indexed binary heap. The caller
can specify to use the indexed binaryheap by passing indexed = true.

The current code does not use the new indexing logic, but it will be
used by an upcoming patch.

Reviewed-by: Hayato Kuroda, Vignesh C, Ajin Cherian, Tomas Vondra,
Shubham Khanna, Peter Smith
Discussion: https://postgr.es/m/CAD21AoDffo37RC-eUuyHJKVEr017V2YYDLyn1xF_00ofptWbkg%40mail.gmail.com
---
 src/backend/executor/nodeGatherMerge.c        |   1 +
 src/backend/executor/nodeMergeAppend.c        |   2 +-
 src/backend/postmaster/pgarch.c               |   3 +-
 .../replication/logical/reorderbuffer.c       |   1 +
 src/backend/storage/buffer/bufmgr.c           |   1 +
 src/bin/pg_dump/pg_backup_archiver.c          |   1 +
 src/bin/pg_dump/pg_dump_sort.c                |   2 +-
 src/common/binaryheap.c                       | 198 +++++++++++++++++-
 src/include/lib/binaryheap.h                  |  36 +++-
 src/tools/pgindent/typedefs.list              |   1 +
 10 files changed, 232 insertions(+), 14 deletions(-)

diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c
index 45f6017c29..ce19e0837a 100644
--- a/src/backend/executor/nodeGatherMerge.c
+++ b/src/backend/executor/nodeGatherMerge.c
@@ -422,6 +422,7 @@ gather_merge_setup(GatherMergeState *gm_state)
 	/* Allocate the resources for the merge */
 	gm_state->gm_heap = binaryheap_allocate(nreaders + 1,
 											heap_compare_slots,
+											false,
 											gm_state);
 }
 
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index e1b9b984a7..3efebd537f 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -125,7 +125,7 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	mergestate->ms_nplans = nplans;
 
 	mergestate->ms_slots = (TupleTableSlot **) palloc0(sizeof(TupleTableSlot *) * nplans);
-	mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots,
+	mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots, false,
 											  mergestate);
 
 	/*
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index f97035ca03..fee5955b13 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -248,7 +248,8 @@ PgArchiverMain(void)
 
 	/* Initialize our max-heap for prioritizing files to archive. */
 	arch_files->arch_heap = binaryheap_allocate(NUM_FILES_PER_DIRECTORY_SCAN,
-												ready_file_comparator, NULL);
+												ready_file_comparator, false,
+												NULL);
 
 	/* Load the archive_library. */
 	LoadArchiveLibrary();
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 001f901ee6..393713af91 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1294,6 +1294,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	/* allocate heap */
 	state->heap = binaryheap_allocate(state->nr_txns,
 									  ReorderBufferIterCompare,
+									  false,
 									  state);
 
 	/* Now that the state fields are initialized, it is safe to return it. */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c..eee5021197 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2724,6 +2724,7 @@ BufferSync(int flags)
 	 */
 	ts_heap = binaryheap_allocate(num_spaces,
 								  ts_ckpt_progress_comparator,
+								  false,
 								  NULL);
 
 	for (i = 0; i < num_spaces; i++)
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index d97ebaff5b..6587a7b081 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -4033,6 +4033,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 	/* Set up ready_heap with enough room for all known TocEntrys */
 	ready_heap = binaryheap_allocate(AH->tocCount,
 									 TocEntrySizeCompareBinaryheap,
+									 false,
 									 NULL);
 
 	/*
diff --git a/src/bin/pg_dump/pg_dump_sort.c b/src/bin/pg_dump/pg_dump_sort.c
index 4cb754caa5..7362f7c961 100644
--- a/src/bin/pg_dump/pg_dump_sort.c
+++ b/src/bin/pg_dump/pg_dump_sort.c
@@ -405,7 +405,7 @@ TopoSort(DumpableObject **objs,
 		return true;
 
 	/* Create workspace for the above-described heap */
-	pendingHeap = binaryheap_allocate(numObjs, int_cmp, NULL);
+	pendingHeap = binaryheap_allocate(numObjs, int_cmp, false, NULL);
 
 	/*
 	 * Scan the constraints, and for each item in the input, generate a count
diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index 843e764bb6..0f8cf6fd51 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -22,8 +22,30 @@
 #ifdef FRONTEND
 #include "common/logging.h"
 #endif
+#include "common/hashfn.h"
 #include "lib/binaryheap.h"
 
+/*
+ * Define parameters for hash table code generation. The interface is *also*
+ * declared in binaryheaph.h (to generate the types, which are externally
+ * visible).
+ */
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+	hash_bytes((const unsigned char *) &key, sizeof(bh_node_type))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0)
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_STORE_HASH
+#define SH_GET_HASH(tb, a) a->hash
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
 static void sift_down(binaryheap *heap, int node_off);
 static void sift_up(binaryheap *heap, int node_off);
 
@@ -34,9 +56,14 @@ static void sift_up(binaryheap *heap, int node_off);
  * store the given number of nodes, with the heap property defined by
  * the given comparator function, which will be invoked with the additional
  * argument specified by 'arg'.
+ *
+ * If 'indexed' is true, we create a hash table to track each node's
+ * index in the heap, enabling to perform some operations such as
+ * binaryheap_remove_node_ptr() etc.
  */
 binaryheap *
-binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
+binaryheap_allocate(int capacity, binaryheap_comparator compare,
+					bool indexed, void *arg)
 {
 	binaryheap *heap;
 
@@ -48,6 +75,17 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
 	heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type) * capacity);
+	heap->bh_nodeidx = NULL;
+
+	if (indexed)
+	{
+#ifdef FRONTEND
+		heap->bh_nodeidx = bh_nodeidx_create(capacity, NULL);
+#else
+		heap->bh_nodeidx = bh_nodeidx_create(CurrentMemoryContext, capacity,
+											 NULL);
+#endif
+	}
 
 	return heap;
 }
@@ -63,6 +101,9 @@ binaryheap_reset(binaryheap *heap)
 {
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
+
+	if (binaryheap_indexed(heap))
+		bh_nodeidx_reset(heap->bh_nodeidx);
 }
 
 /*
@@ -73,6 +114,9 @@ binaryheap_reset(binaryheap *heap)
 void
 binaryheap_free(binaryheap *heap)
 {
+	if (binaryheap_indexed(heap))
+		bh_nodeidx_destroy(heap->bh_nodeidx);
+
 	pfree(heap->bh_nodes);
 	pfree(heap);
 }
@@ -115,6 +159,67 @@ enlarge_node_array(binaryheap *heap)
 							  sizeof(bh_node_type) * heap->bh_space);
 }
 
+/*
+ * Set the given node at the 'index' and track it if required.
+ *
+ * Return true if the node's index is already tracked.
+ */
+static bool
+set_node(binaryheap *heap, bh_node_type node, int index)
+{
+	bool		found = false;
+
+	/* Set the node to the nodes array */
+	heap->bh_nodes[index] = node;
+
+	if (binaryheap_indexed(heap))
+	{
+		bh_nodeidx_entry *ent;
+
+		/* Keep track of the node index */
+		ent = bh_nodeidx_insert(heap->bh_nodeidx, node, &found);
+		ent->index = index;
+	}
+
+	return found;
+}
+
+/*
+ * Remove the node's index from the hash table if the heap is indexed.
+ */
+static inline void
+delete_nodeidx(binaryheap *heap, bh_node_type node)
+{
+	if (binaryheap_indexed(heap))
+		bh_nodeidx_delete(heap->bh_nodeidx, node);
+}
+
+/*
+ * Replace the existing node at 'idx' with the given 'new_node'. Also
+ * update their positions accordingly. Note that we assume the new_node's
+ * position is already tracked if enabled, i.e. the new_node is already
+ * present in the heap.
+ */
+static void
+replace_node(binaryheap *heap, int index, bh_node_type new_node)
+{
+	bool		found PG_USED_FOR_ASSERTS_ONLY;
+
+	/* Quick return if not necessary to move */
+	if (heap->bh_nodes[index] == new_node)
+		return;
+
+	/* Remove the overwritten node's index */
+	delete_nodeidx(heap, heap->bh_nodes[index]);
+
+	/*
+	 * Replace it with the given new node. This node's position must also be
+	 * tracked as we assume to replace the node with the existing node.
+	 */
+	found = set_node(heap, new_node, index);
+	Assert(!binaryheap_indexed(heap) || found);
+}
+
 /*
  * binaryheap_add_unordered
  *
@@ -131,7 +236,7 @@ binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 		enlarge_node_array(heap);
 
 	heap->bh_has_heap_property = false;
-	heap->bh_nodes[heap->bh_size] = d;
+	set_node(heap, d, heap->bh_size);
 	heap->bh_size++;
 }
 
@@ -164,7 +269,7 @@ binaryheap_add(binaryheap *heap, bh_node_type d)
 	if (heap->bh_size >= heap->bh_space)
 		enlarge_node_array(heap);
 
-	heap->bh_nodes[heap->bh_size] = d;
+	set_node(heap, d, heap->bh_size);
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
 }
@@ -205,6 +310,8 @@ binaryheap_remove_first(binaryheap *heap)
 	if (heap->bh_size == 1)
 	{
 		heap->bh_size--;
+		delete_nodeidx(heap, result);
+
 		return result;
 	}
 
@@ -212,7 +319,7 @@ binaryheap_remove_first(binaryheap *heap)
 	 * Remove the last node, placing it in the vacated root entry, and sift
 	 * the new root node down to its correct position.
 	 */
-	heap->bh_nodes[0] = heap->bh_nodes[--heap->bh_size];
+	replace_node(heap, 0, heap->bh_nodes[--heap->bh_size]);
 	sift_down(heap, 0);
 
 	return result;
@@ -238,7 +345,7 @@ binaryheap_remove_node(binaryheap *heap, int n)
 						   heap->bh_arg);
 
 	/* remove the last node, placing it in the vacated entry */
-	heap->bh_nodes[n] = heap->bh_nodes[heap->bh_size];
+	replace_node(heap, n, heap->bh_nodes[heap->bh_size]);
 
 	/* sift as needed to preserve the heap property */
 	if (cmp > 0)
@@ -247,6 +354,77 @@ binaryheap_remove_node(binaryheap *heap, int n)
 		sift_down(heap, n);
 }
 
+/*
+ * binaryheap_remove_node_ptr
+ *
+ * Similar to binaryheap_remove_node() but removes the given node. The caller
+ * must ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if the heap is indexed.
+ */
+void
+binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(binaryheap_indexed(heap));
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+	Assert(ent);
+
+	binaryheap_remove_node(heap, ent->index);
+}
+
+/*
+ * Workhorse for binaryheap_update_up and binaryheap_update_down.
+ */
+static void
+resift_node(binaryheap *heap, bh_node_type node, bool sift_dir_up)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(binaryheap_indexed(heap));
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, node);
+	Assert(ent);
+	Assert(ent->index >= 0 && ent->index < heap->bh_size);
+
+	if (sift_dir_up)
+		sift_up(heap, ent->index);
+	else
+		sift_down(heap, ent->index);
+}
+
+/*
+ * binaryheap_update_up
+ *
+ * Sift the given node up after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if the heap is indexed.
+ */
+void
+binaryheap_update_up(binaryheap *heap, bh_node_type d)
+{
+	resift_node(heap, d, true);
+}
+
+/*
+ * binaryheap_update_down
+ *
+ * Sift the given node down after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if the heap is indexed.
+ */
+void
+binaryheap_update_down(binaryheap *heap, bh_node_type d)
+{
+	resift_node(heap, d, false);
+}
+
 /*
  * binaryheap_replace_first
  *
@@ -259,7 +437,7 @@ binaryheap_replace_first(binaryheap *heap, bh_node_type d)
 {
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
 
-	heap->bh_nodes[0] = d;
+	replace_node(heap, 0, d);
 
 	if (heap->bh_size > 1)
 		sift_down(heap, 0);
@@ -301,11 +479,11 @@ sift_up(binaryheap *heap, int node_off)
 		 * Otherwise, swap the parent value with the hole, and go on to check
 		 * the node's new parent.
 		 */
-		heap->bh_nodes[node_off] = parent_val;
+		set_node(heap, parent_val, node_off);
 		node_off = parent_off;
 	}
 	/* Re-fill the hole */
-	heap->bh_nodes[node_off] = node_val;
+	set_node(heap, node_val, node_off);
 }
 
 /*
@@ -360,9 +538,9 @@ sift_down(binaryheap *heap, int node_off)
 		 * Otherwise, swap the hole with the child that violates the heap
 		 * property; then go on to check its children.
 		 */
-		heap->bh_nodes[node_off] = heap->bh_nodes[swap_off];
+		set_node(heap, heap->bh_nodes[swap_off], node_off);
 		node_off = swap_off;
 	}
 	/* Re-fill the hole */
-	heap->bh_nodes[node_off] = node_val;
+	set_node(heap, node_val, node_off);
 }
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 1439f20803..a7240aa0c2 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -29,6 +29,29 @@ typedef Datum bh_node_type;
  */
 typedef int (*binaryheap_comparator) (bh_node_type a, bh_node_type b, void *arg);
 
+/*
+ * Struct for a hash table element to store the node's index in the bh_nodes
+ * array.
+ */
+typedef struct bh_nodeidx_entry
+{
+	bh_node_type key;
+	int			index;			/* entry's index within the node array */
+	char		status;			/* hash status */
+	uint32		hash;			/* hash values (cached) */
+} bh_nodeidx_entry;
+
+/* Define parameters necessary to generate the hash table interface. */
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
 /*
  * binaryheap
  *
@@ -47,11 +70,18 @@ typedef struct binaryheap
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
 	bh_node_type *bh_nodes;
+
+	/*
+	 * If bh_nodeidx is not NULL, the bh_nodeidx is used to track of each
+	 * node's index in bh_nodes. This enables the caller to perform
+	 * binaryheap_remove_node_ptr(), binaryheap_update_up/down in O(log n).
+	 */
+	bh_nodeidx_hash *bh_nodeidx;
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int capacity,
 									   binaryheap_comparator compare,
-									   void *arg);
+									   bool indexed, void *arg);
 extern void binaryheap_reset(binaryheap *heap);
 extern void binaryheap_free(binaryheap *heap);
 extern void binaryheap_add_unordered(binaryheap *heap, bh_node_type d);
@@ -60,10 +90,14 @@ extern void binaryheap_add(binaryheap *heap, bh_node_type d);
 extern bh_node_type binaryheap_first(binaryheap *heap);
 extern bh_node_type binaryheap_remove_first(binaryheap *heap);
 extern void binaryheap_remove_node(binaryheap *heap, int n);
+extern void binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d);
 extern void binaryheap_replace_first(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_up(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_down(binaryheap *heap, bh_node_type d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
 #define binaryheap_size(h)			((h)->bh_size)
 #define binaryheap_get_node(h, n)	((h)->bh_nodes[n])
+#define binaryheap_indexed(h)		((h)->bh_nodeidx != NULL)
 
 #endif							/* BINARYHEAP_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index aa7a25b8f8..b82efa75ba 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4058,3 +4058,4 @@ rfile
 ws_options
 ws_file_info
 PathKeyInfo
+bh_nodeidx_entry
-- 
2.39.3

v9-0003-Improve-eviction-algorithm-in-Reorderbuffer-using.patchapplication/octet-stream; name=v9-0003-Improve-eviction-algorithm-in-Reorderbuffer-using.patchDownload
From 6d4e5a520888d6385703a4539b33d003793102af Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 11:31:41 +0900
Subject: [PATCH v9 3/3] Improve eviction algorithm in Reorderbuffer using
 max-heap for many subtransactions.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Previously, when selecting the transaction to evict during logical
decoding, we check all transactions to find the largest
transaction. This could lead to a significant replication lag
especially in the case where there are many subtransactions.

This commit improves the eviction algorithm in ReorderBuffer using the
max-heap with transaction size as the key to efficiently find the
largest transaction.

The max-heap starts with empty. While the max-heap is empty, we don't
do anything for the max-heap when updating the memory
counter. Therefore, we get the largest transaction in O(N) time, where
N is the number of transactions including top-level transactions and
subtransactions.

We build the max-heap just before selecting the largest transactions
if the number of transactions being decoded is higher than the
threshold, MAX_HEAP_TXN_COUNT_THRESHOLD. After building the max-heap,
we also update the max-heap when updating the memory counter. The
intention is to efficiently find the largest transaction in O(1) time
instead of incurring the cost of memory counter updates (O(log
N)). Once the number of transactions got lower than the threshold, we
reset the max-heap.

The performance benchmark results showed significant speed up (more
than x30 speed up on my machine) in decoding a transaction with 100k
subtransactions, whereas there is no visible overhead in other cases.

Reviewed-by: Amit Kapila, Hayato Kuroda, Vignesh C, Ajin Cherian,
Tomas Vondra, Shubham Khanna, Álvaro Herrera, Euler Taveira, Peter
Smith
Discussion: https://postgr.es/m/CAD21AoAfKTgrBrLq96GcTv9d6k97zaQcDM-rxfKEt4GSe0qnaQ%40mail.gmail.com
---
 .../replication/logical/reorderbuffer.c       | 226 +++++++++++++++---
 src/include/replication/reorderbuffer.h       |   4 +
 2 files changed, 202 insertions(+), 28 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 393713af91..aa961a924e 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -67,6 +67,21 @@
  *	  allocator, evicting the oldest changes would make it more likely the
  *	  memory gets actually freed.
  *
+ *	  We use a max-heap with transaction size as the key to efficiently find
+ *	  the largest transaction. While the max-heap is empty, we don't update
+ *	  the max-heap when updating the memory counter. Therefore, we can get
+ *	  the largest transaction in O(N) time, where N is the number of
+ *	  transactions including top-level transactions and subtransactions.
+ *
+ *	  We build the max-heap just before selecting the largest transactions
+ *	  if the number of transactions being decoded is higher than the threshold,
+ *	  MAX_HEAP_TXN_COUNT_THRESHOLD. After building the max-heap, we also
+ *	  update the max-heap when updating the memory counter. The intention is
+ *	  to efficiently find the largest transaction in O(1) time instead of
+ *	  incurring the cost of memory counter updates (O(log N)). Once the number
+ *	  of transactions got lower than the threshold, we reset the max-heap
+ *	  (refer to ReorderBufferMaybeResetMaxHeap() for details).
+ *
  *	  We still rely on max_changes_in_memory when loading serialized changes
  *	  back into memory. At that point we can't use the memory limit directly
  *	  as we load the subxacts independently. One option to deal with this
@@ -107,6 +122,22 @@
 #include "utils/rel.h"
 #include "utils/relfilenumbermap.h"
 
+/*
+ * Threshold of the total number of top-level and sub transactions that controls
+ * whether we use the max-heap. Although using max-heap to select the largest
+ * transaction is effective when there are many transactions being decoded,
+ * there is generally no need to use it as long as all transactions being
+ * decoded are top-level transactions. Therefore, we use MaxConnections as the
+ * threshold so we can prevent switching to the state unless we use
+ * subtransactions.
+ */
+#define MAX_HEAP_TXN_COUNT_THRESHOLD	MaxConnections
+
+/*
+ * A macro to check if the max-heap is ready to use and needs to be updated
+ * accordingly.
+ */
+#define ReorderBufferMaxHeapIsReady(rb) !binaryheap_empty((rb)->txn_heap)
 
 /* entry for a hash table we use to map from xid to our transaction state */
 typedef struct ReorderBufferTXNByIdEnt
@@ -259,6 +290,9 @@ static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
+static void ReorderBufferBuildMaxHeap(ReorderBuffer *rb);
+static void ReorderBufferMaybeResetMaxHeap(ReorderBuffer *rb);
+static int	ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg);
 
 static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
@@ -293,6 +327,7 @@ static void ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *t
 static Size ReorderBufferChangeSize(ReorderBufferChange *change);
 static void ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 											ReorderBufferChange *change,
+											ReorderBufferTXN *txn,
 											bool addition, Size sz);
 
 /*
@@ -355,6 +390,17 @@ ReorderBufferAllocate(void)
 	buffer->outbufsize = 0;
 	buffer->size = 0;
 
+	/*
+	 * The binaryheap is indexed for faster manipulations.
+	 *
+	 * We allocate the initial heap size greater than
+	 * MAX_HEAP_TXN_COUNT_THRESHOLD because the txn_heap will not be used
+	 * until the threshold is exceeded.
+	 */
+	buffer->txn_heap = binaryheap_allocate(MAX_HEAP_TXN_COUNT_THRESHOLD * 2,
+										   ReorderBufferTXNSizeCompare,
+										   true, NULL);
+
 	buffer->spillTxns = 0;
 	buffer->spillCount = 0;
 	buffer->spillBytes = 0;
@@ -485,7 +531,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
 {
 	/* update memory accounting info */
 	if (upd_mem)
-		ReorderBufferChangeMemoryUpdate(rb, change, false,
+		ReorderBufferChangeMemoryUpdate(rb, change, NULL, false,
 										ReorderBufferChangeSize(change));
 
 	/* free contained data */
@@ -816,7 +862,7 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries_mem++;
 
 	/* update memory accounting information */
-	ReorderBufferChangeMemoryUpdate(rb, change, true,
+	ReorderBufferChangeMemoryUpdate(rb, change, NULL, true,
 									ReorderBufferChangeSize(change));
 
 	/* process partial change */
@@ -1527,7 +1573,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		/* Check we're not mixing changes from different transactions. */
 		Assert(change->txn == txn);
 
-		ReorderBufferReturnChange(rb, change, true);
+		ReorderBufferReturnChange(rb, change, false);
 	}
 
 	/*
@@ -1586,8 +1632,13 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	if (rbtxn_is_serialized(txn))
 		ReorderBufferRestoreCleanup(rb, txn);
 
+	/* Update the memory counter */
+	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, txn->size);
+
 	/* deallocate */
 	ReorderBufferReturnTXN(rb, txn);
+
+	ReorderBufferMaybeResetMaxHeap(rb);
 }
 
 /*
@@ -1637,9 +1688,12 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 		/* remove the change from it's containing list */
 		dlist_delete(&change->node);
 
-		ReorderBufferReturnChange(rb, change, true);
+		ReorderBufferReturnChange(rb, change, false);
 	}
 
+	/* Update the memory counter */
+	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, txn->size);
+
 	/*
 	 * Mark the transaction as streamed.
 	 *
@@ -3166,6 +3220,9 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
  * decide if we reached the memory limit, the transaction counter allows
  * us to quickly pick the largest transaction for eviction.
  *
+ * Either txn or change must be non-NULL at least. We update the memory
+ * counter of txn if it's non-NULL, otherwise change->txn.
+ *
  * When streaming is enabled, we need to update the toplevel transaction
  * counters instead - we don't really care about subtransactions as we
  * can't stream them individually anyway, and we only pick toplevel
@@ -3174,22 +3231,25 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								ReorderBufferChange *change,
+								ReorderBufferTXN *txn,
 								bool addition, Size sz)
 {
-	ReorderBufferTXN *txn;
 	ReorderBufferTXN *toptxn;
 
-	Assert(change->txn);
-
 	/*
 	 * Ignore tuple CID changes, because those are not evicted when reaching
 	 * memory limit. So we just don't count them, because it might easily
 	 * trigger a pointless attempt to spill.
 	 */
-	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
+	if (change && change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
+		return;
+
+	if (sz == 0)
 		return;
 
-	txn = change->txn;
+	if (txn == NULL)
+		txn = change->txn;
+	Assert(txn != NULL);
 
 	/*
 	 * Update the total size in top level as well. This is later used to
@@ -3204,6 +3264,15 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 		/* Update the total size in the top transaction. */
 		toptxn->total_size += sz;
+
+		/* Update the max-heap as well if necessary */
+		if (ReorderBufferMaxHeapIsReady(rb))
+		{
+			if ((txn->size - sz) == 0)
+				binaryheap_add(rb->txn_heap, PointerGetDatum(txn));
+			else
+				binaryheap_update_up(rb->txn_heap, PointerGetDatum(txn));
+		}
 	}
 	else
 	{
@@ -3213,6 +3282,15 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 		/* Update the total size in the top transaction. */
 		toptxn->total_size -= sz;
+
+		/* Update the max-heap as well if necessary */
+		if (ReorderBufferMaxHeapIsReady(rb))
+		{
+			if (txn->size == 0)
+				binaryheap_remove_node_ptr(rb->txn_heap, PointerGetDatum(txn));
+			else
+				binaryheap_update_down(rb->txn_heap, PointerGetDatum(txn));
+		}
 	}
 
 	Assert(txn->size <= rb->size);
@@ -3468,34 +3546,121 @@ ReorderBufferSerializeReserve(ReorderBuffer *rb, Size sz)
 	}
 }
 
+
+/* Compare two transactions by size */
+static int
+ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg)
+{
+	ReorderBufferTXN *ta = (ReorderBufferTXN *) DatumGetPointer(a);
+	ReorderBufferTXN *tb = (ReorderBufferTXN *) DatumGetPointer(b);
+
+	if (ta->size < tb->size)
+		return -1;
+	if (ta->size > tb->size)
+		return 1;
+	return 0;
+}
+
 /*
- * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
- *
- * XXX With many subtransactions this might be quite slow, because we'll have
- * to walk through all of them. There are some options how we could improve
- * that: (a) maintain some secondary structure with transactions sorted by
- * amount of changes, (b) not looking for the entirely largest transaction,
- * but e.g. for transaction using at least some fraction of the memory limit,
- * and (c) evicting multiple transactions at once, e.g. to free a given portion
- * of the memory limit (e.g. 50%).
+ * Build the max-heap. The heap assembly step is deferred  until the end, for
+ * efficiency.
  */
-static ReorderBufferTXN *
-ReorderBufferLargestTXN(ReorderBuffer *rb)
+static void
+ReorderBufferBuildMaxHeap(ReorderBuffer *rb)
 {
 	HASH_SEQ_STATUS hash_seq;
 	ReorderBufferTXNByIdEnt *ent;
-	ReorderBufferTXN *largest = NULL;
+
+	Assert(binaryheap_empty(rb->txn_heap));
 
 	hash_seq_init(&hash_seq, rb->by_txn);
 	while ((ent = hash_seq_search(&hash_seq)) != NULL)
 	{
 		ReorderBufferTXN *txn = ent->txn;
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
-			largest = txn;
+		if (txn->size == 0)
+			continue;
+
+		binaryheap_add_unordered(rb->txn_heap, PointerGetDatum(txn));
+	}
+
+	binaryheap_build(rb->txn_heap);
+}
+
+/*
+ * Reset the max-heap if the number of transactions got lower than the
+ * threshold.
+ */
+static void
+ReorderBufferMaybeResetMaxHeap(ReorderBuffer *rb)
+{
+	/*
+	 * If we add and remove transactions right around the threshold, we could
+	 * easily end up "thrashing". To avoid it, we adapt 10% of transactions to
+	 * reset the max-heap.
+	 */
+	if (ReorderBufferMaxHeapIsReady(rb) &&
+		binaryheap_size(rb->txn_heap) < MAX_HEAP_TXN_COUNT_THRESHOLD * 0.9)
+		binaryheap_reset(rb->txn_heap);
+}
+
+/*
+ * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
+ * We use a different way to find the largest transaction depending on the
+ * memory tracking state and the number of transactions being decoded. Refer
+ * to the comments atop this file for the algorithm details.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTXN(ReorderBuffer *rb)
+{
+	ReorderBufferTXN *largest = NULL;
+
+	if (!ReorderBufferMaxHeapIsReady(rb))
+	{
+		/*
+		 * If the number of transactions are small, we scan all transactions
+		 * being decoded to get the largest transaction. This saves the cost
+		 * of building a max-heap with a small number of transactions.
+		 */
+		if (hash_get_num_entries(rb->by_txn) < MAX_HEAP_TXN_COUNT_THRESHOLD)
+		{
+			HASH_SEQ_STATUS hash_seq;
+			ReorderBufferTXNByIdEnt *ent;
+
+			hash_seq_init(&hash_seq, rb->by_txn);
+			while ((ent = hash_seq_search(&hash_seq)) != NULL)
+			{
+				ReorderBufferTXN *txn = ent->txn;
+
+				/* if the current transaction is larger, remember it */
+				if ((!largest) || (txn->size > largest->size))
+					largest = txn;
+			}
+		}
+		else
+		{
+			/*
+			 * There are a large number of transactions in ReorderBuffer. We
+			 * build the max-heap for efficiently selecting the largest
+			 * transactions.
+			 */
+			ReorderBufferBuildMaxHeap(rb);
+
+			/*
+			 * The max-heap is ready now. We remain in this state at least
+			 * until we free up enough transactions to bring the total memory
+			 * usage below the limit. The largest transaction is selected
+			 * below.
+			 */
+			Assert(ReorderBufferMaxHeapIsReady(rb));
+		}
 	}
 
+	/* Get the largest transaction from the max-heap */
+	if (ReorderBufferMaxHeapIsReady(rb))
+		largest = (ReorderBufferTXN *)
+			DatumGetPointer(binaryheap_first(rb->txn_heap));
+
 	Assert(largest);
 	Assert(largest->size > 0);
 	Assert(largest->size <= rb->size);
@@ -3636,6 +3801,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 		Assert(txn->nentries_mem == 0);
 	}
 
+	ReorderBufferMaybeResetMaxHeap(rb);
+
 	/* We must be under the memory limit now. */
 	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
@@ -3705,11 +3872,14 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		ReorderBufferSerializeChange(rb, txn, fd, change);
 		dlist_delete(&change->node);
-		ReorderBufferReturnChange(rb, change, true);
+		ReorderBufferReturnChange(rb, change, false);
 
 		spilled++;
 	}
 
+	/* Update the memory counter */
+	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, size);
+
 	/* update the statistics iff we have spilled anything */
 	if (spilled)
 	{
@@ -4491,7 +4661,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	 * update the accounting too (subtracting the size from the counters). And
 	 * we don't want to underflow there.
 	 */
-	ReorderBufferChangeMemoryUpdate(rb, change, true,
+	ReorderBufferChangeMemoryUpdate(rb, change, NULL, true,
 									ReorderBufferChangeSize(change));
 }
 
@@ -4903,9 +5073,9 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	MemoryContextSwitchTo(oldcontext);
 
 	/* subtract the old change size */
-	ReorderBufferChangeMemoryUpdate(rb, change, false, old_size);
+	ReorderBufferChangeMemoryUpdate(rb, change, NULL, false, old_size);
 	/* now add the change back, with the correct size */
-	ReorderBufferChangeMemoryUpdate(rb, change, true,
+	ReorderBufferChangeMemoryUpdate(rb, change, NULL, true,
 									ReorderBufferChangeSize(change));
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0b2c95f7aa..a5aec01c2f 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "lib/binaryheap.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -631,6 +632,9 @@ struct ReorderBuffer
 	/* memory accounting */
 	Size		size;
 
+	/* Max-heap for sizes of all top-level and sub transactions */
+	binaryheap *txn_heap;
+
 	/*
 	 * Statistics about transactions spilled to disk.
 	 *
-- 
2.39.3

#55Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Smith (#53)
Re: Improve eviction algorithm in ReorderBuffer

On Wed, Mar 13, 2024 at 11:23 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Wed, Mar 13, 2024 at 12:48 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Mar 13, 2024 at 10:15 AM Peter Smith <smithpb2250@gmail.com> wrote:

On Tue, Mar 12, 2024 at 4:23 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Mar 8, 2024 at 12:58 PM Peter Smith <smithpb2250@gmail.com> wrote:

...

5.
+ *
+ * If 'indexed' is true, we create a hash table to track of each node's
+ * index in the heap, enabling to perform some operations such as removing
+ * the node from the heap.
*/
binaryheap *
-binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
+binaryheap_allocate(int capacity, binaryheap_comparator compare,
+ bool indexed, void *arg)

BEFORE
... enabling to perform some operations such as removing the node from the heap.

SUGGESTION
... to help make operations such as removing nodes more efficient.

But these operations literally require the indexed binary heap as we
have an assertion:

void
binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
{
bh_nodeidx_entry *ent;

Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
Assert(heap->bh_indexed);

I didn’t quite understand -- the operations mentioned are "operations
such as removing the node", but binaryheap_remove_node() also removes
a node from the heap. So I still felt the comment wording of the patch
is not quite correct.

Now I understand your point. That's a valid point.

Now, if the removal of a node from an indexed heap can *only* be done
using binaryheap_remove_node_ptr() then:
- the other removal functions (binaryheap_remove_*) probably need some
comments to make sure nobody is tempted to call them directly for an
indexed heap.
- maybe some refactoring and assertions are needed to ensure those
*cannot* be called directly for an indexed heap.

If the 'index' is true, the caller can not only use the existing
functions but also newly added functions such as
binaryheap_remove_node_ptr() and binaryheap_update_up() etc. How about
something like below?

You said: "can not only use the existing functions but also..."

Hmm. Is that right? IIUC those existing "remove" functions should NOT
be called directly if the heap was "indexed" because they'll delete
the node from the heap OK, but any corresponding index for that
deleted node will be left lying around -- i.e. everything gets out of
sync. This was the reason for my original concern.

All existing binaryheap functions should be available even if the
binaryheap is 'indexed'. For instance, with the patch,
binaryheap_remote_node() is:

void
binaryheap_remove_node(binaryheap *heap, int n)
{
int cmp;

Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
Assert(n >= 0 && n < heap->bh_size);

/* compare last node to the one that is being removed */
cmp = heap->bh_compare(heap->bh_nodes[--heap->bh_size],
heap->bh_nodes[n],
heap->bh_arg);

/* remove the last node, placing it in the vacated entry */
replace_node(heap, n, heap->bh_nodes[heap->bh_size]);

/* sift as needed to preserve the heap property */
if (cmp > 0)
sift_up(heap, n);
else if (cmp < 0)
sift_down(heap, n);
}

The replace_node(), sift_up() and sift_down() update node's index as
well if the binaryheap is indexed. When deleting the node from the
binaryheap, it will also delete its index from the hash table.

I see now. Thanks for the information.

~~~

Some more review comments for v8-0002

======

1.
+/*
+ * Remove the node's index from the hash table if the heap is indexed.
+ */
+static bool
+delete_nodeidx(binaryheap *heap, bh_node_type node)
+{
+ if (!binaryheap_indexed(heap))
+ return false;
+
+ return bh_nodeidx_delete(heap->bh_nodeidx, node);
+}

I wasn't sure if having this function was a good idea. Yes, it makes
code more readable, but I felt the heap code ought to be as efficient
as possible so maybe it is better for the index check to be done at
the caller, instead of incurring any overhead of function calls that
might do nothing.

SUGGESTION
if (binaryheap_indexed(heap))
found = bh_nodeidx_delete(heap->bh_nodeidx, node);

I think we can have the function inlined, instead of doing the same
things in multiple places. I've changed it in the v9 patch.

~~~

2.
+/*
+ * binaryheap_update_up
+ *
+ * Sift the given node up after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if the heap is indexed.
+ */
+void
+binaryheap_update_up(binaryheap *heap, bh_node_type d)
+{
+ bh_nodeidx_entry *ent;
+
+ Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+ Assert(binaryheap_indexed(heap));
+
+ ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+ Assert(ent);
+ Assert(ent->index >= 0 && ent->index < heap->bh_size);
+
+ sift_up(heap, ent->index);
+}
+
+/*
+ * binaryheap_update_down
+ *
+ * Sift the given node down after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if the heap is indexed.
+ */
+void
+binaryheap_update_down(binaryheap *heap, bh_node_type d)
+{
+ bh_nodeidx_entry *ent;
+
+ Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+ Assert(binaryheap_indexed(heap));
+
+ ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+ Assert(ent);
+ Assert(ent->index >= 0 && ent->index < heap->bh_size);
+
+ sift_down(heap, ent->index);
+}

Since those functions are almost identical, wouldn't it be better to
combine them, passing the sift direction?

SUGGESTION
binaryheap_resift(binaryheap *heap, bh_node_type d, bool sift_dir_up)
{
...

if (sift_dir_up)
sift_up(heap, ent->index);
else
sift_down(heap, ent->index);
}

I'm not really sure binaryheap_resift() is a better API than
binaryheap_update_up() and _down(). Having different APIs for
different behavior makes sense to me. On the other hand, I see your
point that these two functions have duplicated codes, so I created a
common function for them to remove the duplication.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#56Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#54)
3 attachment(s)
Re: Improve eviction algorithm in ReorderBuffer

On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached new version patches.

Since the previous patch conflicts with the current HEAD, I've
attached the rebased patches.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v10-0001-Make-binaryheap-enlargeable.patchapplication/octet-stream; name=v10-0001-Make-binaryheap-enlargeable.patchDownload
From e57b1a9a1651399a2c5e4a365e2e7f113c361286 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 17:12:20 +0900
Subject: [PATCH v10 1/3] Make binaryheap enlargeable.

The node array space of the binaryheap is doubled when there is no
available space.

Reviewed-by: Hayato Kuroda, Vignesh C, Ajin Cherian, Tomas Vondra,
Shubham Khanna, Peter Smith
Discussion: https://postgr.es/m/CAD21AoDffo37RC-eUuyHJKVEr017V2YYDLyn1xF_00ofptWbkg%40mail.gmail.com
---
 src/common/binaryheap.c      | 37 +++++++++++++++++++-----------------
 src/include/lib/binaryheap.h |  2 +-
 2 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index 7377ebdf15..843e764bb6 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -38,17 +38,16 @@ static void sift_up(binaryheap *heap, int node_off);
 binaryheap *
 binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 {
-	int			sz;
 	binaryheap *heap;
 
-	sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_node_type) * capacity;
-	heap = (binaryheap *) palloc(sz);
+	heap = (binaryheap *) palloc(sizeof(binaryheap));
 	heap->bh_space = capacity;
 	heap->bh_compare = compare;
 	heap->bh_arg = arg;
 
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
+	heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type) * capacity);
 
 	return heap;
 }
@@ -74,6 +73,7 @@ binaryheap_reset(binaryheap *heap)
 void
 binaryheap_free(binaryheap *heap)
 {
+	pfree(heap->bh_nodes);
 	pfree(heap);
 }
 
@@ -104,6 +104,17 @@ parent_offset(int i)
 	return (i - 1) / 2;
 }
 
+/*
+ * Double the space allocated for nodes.
+ */
+static void
+enlarge_node_array(binaryheap *heap)
+{
+	heap->bh_space *= 2;
+	heap->bh_nodes = repalloc(heap->bh_nodes,
+							  sizeof(bh_node_type) * heap->bh_space);
+}
+
 /*
  * binaryheap_add_unordered
  *
@@ -115,14 +126,10 @@ parent_offset(int i)
 void
 binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 {
+	/* make sure enough space for a new node */
 	if (heap->bh_size >= heap->bh_space)
-	{
-#ifdef FRONTEND
-		pg_fatal("out of binary heap slots");
-#else
-		elog(ERROR, "out of binary heap slots");
-#endif
-	}
+		enlarge_node_array(heap);
+
 	heap->bh_has_heap_property = false;
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
@@ -153,14 +160,10 @@ binaryheap_build(binaryheap *heap)
 void
 binaryheap_add(binaryheap *heap, bh_node_type d)
 {
+	/* make sure enough space for a new node */
 	if (heap->bh_size >= heap->bh_space)
-	{
-#ifdef FRONTEND
-		pg_fatal("out of binary heap slots");
-#else
-		elog(ERROR, "out of binary heap slots");
-#endif
-	}
+		enlarge_node_array(heap);
+
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 19025c08ef..1439f20803 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -46,7 +46,7 @@ typedef struct binaryheap
 	bool		bh_has_heap_property;	/* debugging cross-check */
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
-	bh_node_type bh_nodes[FLEXIBLE_ARRAY_MEMBER];
+	bh_node_type *bh_nodes;
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int capacity,
-- 
2.39.3

v10-0003-Improve-eviction-algorithm-in-Reorderbuffer-usin.patchapplication/octet-stream; name=v10-0003-Improve-eviction-algorithm-in-Reorderbuffer-usin.patchDownload
From e6bf00bdeca7d8e9914811b620c722961bae40e3 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 11:31:41 +0900
Subject: [PATCH v10 3/3] Improve eviction algorithm in Reorderbuffer using
 max-heap for many subtransactions.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Previously, when selecting the transaction to evict during logical
decoding, we check all transactions to find the largest
transaction. This could lead to a significant replication lag
especially in the case where there are many subtransactions.

This commit improves the eviction algorithm in ReorderBuffer using the
max-heap with transaction size as the key to efficiently find the
largest transaction.

The max-heap starts with empty. While the max-heap is empty, we don't
do anything for the max-heap when updating the memory
counter. Therefore, we get the largest transaction in O(N) time, where
N is the number of transactions including top-level transactions and
subtransactions.

We build the max-heap just before selecting the largest transactions
if the number of transactions being decoded is higher than the
threshold, MAX_HEAP_TXN_COUNT_THRESHOLD. After building the max-heap,
we also update the max-heap when updating the memory counter. The
intention is to efficiently find the largest transaction in O(1) time
instead of incurring the cost of memory counter updates (O(log
N)). Once the number of transactions got lower than the threshold, we
reset the max-heap.

The performance benchmark results showed significant speed up (more
than x30 speed up on my machine) in decoding a transaction with 100k
subtransactions, whereas there is no visible overhead in other cases.

Reviewed-by: Amit Kapila, Hayato Kuroda, Vignesh C, Ajin Cherian,
Tomas Vondra, Shubham Khanna, Álvaro Herrera, Euler Taveira, Peter
Smith
Discussion: https://postgr.es/m/CAD21AoAfKTgrBrLq96GcTv9d6k97zaQcDM-rxfKEt4GSe0qnaQ%40mail.gmail.com
---
 .../replication/logical/reorderbuffer.c       | 226 +++++++++++++++---
 src/include/replication/reorderbuffer.h       |   4 +
 2 files changed, 202 insertions(+), 28 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 07eebedbac..7223abe958 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -67,6 +67,21 @@
  *	  allocator, evicting the oldest changes would make it more likely the
  *	  memory gets actually freed.
  *
+ *	  We use a max-heap with transaction size as the key to efficiently find
+ *	  the largest transaction. While the max-heap is empty, we don't update
+ *	  the max-heap when updating the memory counter. Therefore, we can get
+ *	  the largest transaction in O(N) time, where N is the number of
+ *	  transactions including top-level transactions and subtransactions.
+ *
+ *	  We build the max-heap just before selecting the largest transactions
+ *	  if the number of transactions being decoded is higher than the threshold,
+ *	  MAX_HEAP_TXN_COUNT_THRESHOLD. After building the max-heap, we also
+ *	  update the max-heap when updating the memory counter. The intention is
+ *	  to efficiently find the largest transaction in O(1) time instead of
+ *	  incurring the cost of memory counter updates (O(log N)). Once the number
+ *	  of transactions got lower than the threshold, we reset the max-heap
+ *	  (refer to ReorderBufferMaybeResetMaxHeap() for details).
+ *
  *	  We still rely on max_changes_in_memory when loading serialized changes
  *	  back into memory. At that point we can't use the memory limit directly
  *	  as we load the subxacts independently. One option to deal with this
@@ -107,6 +122,22 @@
 #include "utils/rel.h"
 #include "utils/relfilenumbermap.h"
 
+/*
+ * Threshold of the total number of top-level and sub transactions that controls
+ * whether we use the max-heap. Although using max-heap to select the largest
+ * transaction is effective when there are many transactions being decoded,
+ * there is generally no need to use it as long as all transactions being
+ * decoded are top-level transactions. Therefore, we use MaxConnections as the
+ * threshold so we can prevent switching to the state unless we use
+ * subtransactions.
+ */
+#define MAX_HEAP_TXN_COUNT_THRESHOLD	MaxConnections
+
+/*
+ * A macro to check if the max-heap is ready to use and needs to be updated
+ * accordingly.
+ */
+#define ReorderBufferMaxHeapIsReady(rb) !binaryheap_empty((rb)->txn_heap)
 
 /* entry for a hash table we use to map from xid to our transaction state */
 typedef struct ReorderBufferTXNByIdEnt
@@ -259,6 +290,9 @@ static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
+static void ReorderBufferBuildMaxHeap(ReorderBuffer *rb);
+static void ReorderBufferMaybeResetMaxHeap(ReorderBuffer *rb);
+static int	ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg);
 
 static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
@@ -293,6 +327,7 @@ static void ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *t
 static Size ReorderBufferChangeSize(ReorderBufferChange *change);
 static void ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 											ReorderBufferChange *change,
+											ReorderBufferTXN *txn,
 											bool addition, Size sz);
 
 /*
@@ -355,6 +390,17 @@ ReorderBufferAllocate(void)
 	buffer->outbufsize = 0;
 	buffer->size = 0;
 
+	/*
+	 * The binaryheap is indexed for faster manipulations.
+	 *
+	 * We allocate the initial heap size greater than
+	 * MAX_HEAP_TXN_COUNT_THRESHOLD because the txn_heap will not be used
+	 * until the threshold is exceeded.
+	 */
+	buffer->txn_heap = binaryheap_allocate(MAX_HEAP_TXN_COUNT_THRESHOLD * 2,
+										   ReorderBufferTXNSizeCompare,
+										   true, NULL);
+
 	buffer->spillTxns = 0;
 	buffer->spillCount = 0;
 	buffer->spillBytes = 0;
@@ -485,7 +531,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
 {
 	/* update memory accounting info */
 	if (upd_mem)
-		ReorderBufferChangeMemoryUpdate(rb, change, false,
+		ReorderBufferChangeMemoryUpdate(rb, change, NULL, false,
 										ReorderBufferChangeSize(change));
 
 	/* free contained data */
@@ -816,7 +862,7 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries_mem++;
 
 	/* update memory accounting information */
-	ReorderBufferChangeMemoryUpdate(rb, change, true,
+	ReorderBufferChangeMemoryUpdate(rb, change, NULL, true,
 									ReorderBufferChangeSize(change));
 
 	/* process partial change */
@@ -1527,7 +1573,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		/* Check we're not mixing changes from different transactions. */
 		Assert(change->txn == txn);
 
-		ReorderBufferReturnChange(rb, change, true);
+		ReorderBufferReturnChange(rb, change, false);
 	}
 
 	/*
@@ -1586,8 +1632,13 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	if (rbtxn_is_serialized(txn))
 		ReorderBufferRestoreCleanup(rb, txn);
 
+	/* Update the memory counter */
+	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, txn->size);
+
 	/* deallocate */
 	ReorderBufferReturnTXN(rb, txn);
+
+	ReorderBufferMaybeResetMaxHeap(rb);
 }
 
 /*
@@ -1637,9 +1688,12 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 		/* remove the change from it's containing list */
 		dlist_delete(&change->node);
 
-		ReorderBufferReturnChange(rb, change, true);
+		ReorderBufferReturnChange(rb, change, false);
 	}
 
+	/* Update the memory counter */
+	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, txn->size);
+
 	/*
 	 * Mark the transaction as streamed.
 	 *
@@ -3166,6 +3220,9 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
  * decide if we reached the memory limit, the transaction counter allows
  * us to quickly pick the largest transaction for eviction.
  *
+ * Either txn or change must be non-NULL at least. We update the memory
+ * counter of txn if it's non-NULL, otherwise change->txn.
+ *
  * When streaming is enabled, we need to update the toplevel transaction
  * counters instead - we don't really care about subtransactions as we
  * can't stream them individually anyway, and we only pick toplevel
@@ -3174,22 +3231,25 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								ReorderBufferChange *change,
+								ReorderBufferTXN *txn,
 								bool addition, Size sz)
 {
-	ReorderBufferTXN *txn;
 	ReorderBufferTXN *toptxn;
 
-	Assert(change->txn);
-
 	/*
 	 * Ignore tuple CID changes, because those are not evicted when reaching
 	 * memory limit. So we just don't count them, because it might easily
 	 * trigger a pointless attempt to spill.
 	 */
-	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
+	if (change && change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
+		return;
+
+	if (sz == 0)
 		return;
 
-	txn = change->txn;
+	if (txn == NULL)
+		txn = change->txn;
+	Assert(txn != NULL);
 
 	/*
 	 * Update the total size in top level as well. This is later used to
@@ -3204,6 +3264,15 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 		/* Update the total size in the top transaction. */
 		toptxn->total_size += sz;
+
+		/* Update the max-heap as well if necessary */
+		if (ReorderBufferMaxHeapIsReady(rb))
+		{
+			if ((txn->size - sz) == 0)
+				binaryheap_add(rb->txn_heap, PointerGetDatum(txn));
+			else
+				binaryheap_update_up(rb->txn_heap, PointerGetDatum(txn));
+		}
 	}
 	else
 	{
@@ -3213,6 +3282,15 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 		/* Update the total size in the top transaction. */
 		toptxn->total_size -= sz;
+
+		/* Update the max-heap as well if necessary */
+		if (ReorderBufferMaxHeapIsReady(rb))
+		{
+			if (txn->size == 0)
+				binaryheap_remove_node_ptr(rb->txn_heap, PointerGetDatum(txn));
+			else
+				binaryheap_update_down(rb->txn_heap, PointerGetDatum(txn));
+		}
 	}
 
 	Assert(txn->size <= rb->size);
@@ -3468,34 +3546,121 @@ ReorderBufferSerializeReserve(ReorderBuffer *rb, Size sz)
 	}
 }
 
+
+/* Compare two transactions by size */
+static int
+ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg)
+{
+	ReorderBufferTXN *ta = (ReorderBufferTXN *) DatumGetPointer(a);
+	ReorderBufferTXN *tb = (ReorderBufferTXN *) DatumGetPointer(b);
+
+	if (ta->size < tb->size)
+		return -1;
+	if (ta->size > tb->size)
+		return 1;
+	return 0;
+}
+
 /*
- * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
- *
- * XXX With many subtransactions this might be quite slow, because we'll have
- * to walk through all of them. There are some options how we could improve
- * that: (a) maintain some secondary structure with transactions sorted by
- * amount of changes, (b) not looking for the entirely largest transaction,
- * but e.g. for transaction using at least some fraction of the memory limit,
- * and (c) evicting multiple transactions at once, e.g. to free a given portion
- * of the memory limit (e.g. 50%).
+ * Build the max-heap. The heap assembly step is deferred  until the end, for
+ * efficiency.
  */
-static ReorderBufferTXN *
-ReorderBufferLargestTXN(ReorderBuffer *rb)
+static void
+ReorderBufferBuildMaxHeap(ReorderBuffer *rb)
 {
 	HASH_SEQ_STATUS hash_seq;
 	ReorderBufferTXNByIdEnt *ent;
-	ReorderBufferTXN *largest = NULL;
+
+	Assert(binaryheap_empty(rb->txn_heap));
 
 	hash_seq_init(&hash_seq, rb->by_txn);
 	while ((ent = hash_seq_search(&hash_seq)) != NULL)
 	{
 		ReorderBufferTXN *txn = ent->txn;
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
-			largest = txn;
+		if (txn->size == 0)
+			continue;
+
+		binaryheap_add_unordered(rb->txn_heap, PointerGetDatum(txn));
+	}
+
+	binaryheap_build(rb->txn_heap);
+}
+
+/*
+ * Reset the max-heap if the number of transactions got lower than the
+ * threshold.
+ */
+static void
+ReorderBufferMaybeResetMaxHeap(ReorderBuffer *rb)
+{
+	/*
+	 * If we add and remove transactions right around the threshold, we could
+	 * easily end up "thrashing". To avoid it, we adapt 10% of transactions to
+	 * reset the max-heap.
+	 */
+	if (ReorderBufferMaxHeapIsReady(rb) &&
+		binaryheap_size(rb->txn_heap) < MAX_HEAP_TXN_COUNT_THRESHOLD * 0.9)
+		binaryheap_reset(rb->txn_heap);
+}
+
+/*
+ * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
+ * We use a different way to find the largest transaction depending on the
+ * memory tracking state and the number of transactions being decoded. Refer
+ * to the comments atop this file for the algorithm details.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTXN(ReorderBuffer *rb)
+{
+	ReorderBufferTXN *largest = NULL;
+
+	if (!ReorderBufferMaxHeapIsReady(rb))
+	{
+		/*
+		 * If the number of transactions are small, we scan all transactions
+		 * being decoded to get the largest transaction. This saves the cost
+		 * of building a max-heap with a small number of transactions.
+		 */
+		if (hash_get_num_entries(rb->by_txn) < MAX_HEAP_TXN_COUNT_THRESHOLD)
+		{
+			HASH_SEQ_STATUS hash_seq;
+			ReorderBufferTXNByIdEnt *ent;
+
+			hash_seq_init(&hash_seq, rb->by_txn);
+			while ((ent = hash_seq_search(&hash_seq)) != NULL)
+			{
+				ReorderBufferTXN *txn = ent->txn;
+
+				/* if the current transaction is larger, remember it */
+				if ((!largest) || (txn->size > largest->size))
+					largest = txn;
+			}
+		}
+		else
+		{
+			/*
+			 * There are a large number of transactions in ReorderBuffer. We
+			 * build the max-heap for efficiently selecting the largest
+			 * transactions.
+			 */
+			ReorderBufferBuildMaxHeap(rb);
+
+			/*
+			 * The max-heap is ready now. We remain in this state at least
+			 * until we free up enough transactions to bring the total memory
+			 * usage below the limit. The largest transaction is selected
+			 * below.
+			 */
+			Assert(ReorderBufferMaxHeapIsReady(rb));
+		}
 	}
 
+	/* Get the largest transaction from the max-heap */
+	if (ReorderBufferMaxHeapIsReady(rb))
+		largest = (ReorderBufferTXN *)
+			DatumGetPointer(binaryheap_first(rb->txn_heap));
+
 	Assert(largest);
 	Assert(largest->size > 0);
 	Assert(largest->size <= rb->size);
@@ -3636,6 +3801,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 		Assert(txn->nentries_mem == 0);
 	}
 
+	ReorderBufferMaybeResetMaxHeap(rb);
+
 	/* We must be under the memory limit now. */
 	Assert(rb->size < logical_decoding_work_mem * 1024L);
 }
@@ -3705,11 +3872,14 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		ReorderBufferSerializeChange(rb, txn, fd, change);
 		dlist_delete(&change->node);
-		ReorderBufferReturnChange(rb, change, true);
+		ReorderBufferReturnChange(rb, change, false);
 
 		spilled++;
 	}
 
+	/* Update the memory counter */
+	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, size);
+
 	/* update the statistics iff we have spilled anything */
 	if (spilled)
 	{
@@ -4491,7 +4661,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	 * update the accounting too (subtracting the size from the counters). And
 	 * we don't want to underflow there.
 	 */
-	ReorderBufferChangeMemoryUpdate(rb, change, true,
+	ReorderBufferChangeMemoryUpdate(rb, change, NULL, true,
 									ReorderBufferChangeSize(change));
 }
 
@@ -4903,9 +5073,9 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	MemoryContextSwitchTo(oldcontext);
 
 	/* subtract the old change size */
-	ReorderBufferChangeMemoryUpdate(rb, change, false, old_size);
+	ReorderBufferChangeMemoryUpdate(rb, change, NULL, false, old_size);
 	/* now add the change back, with the correct size */
-	ReorderBufferChangeMemoryUpdate(rb, change, true,
+	ReorderBufferChangeMemoryUpdate(rb, change, NULL, true,
 									ReorderBufferChangeSize(change));
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0b2c95f7aa..a5aec01c2f 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "lib/binaryheap.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -631,6 +632,9 @@ struct ReorderBuffer
 	/* memory accounting */
 	Size		size;
 
+	/* Max-heap for sizes of all top-level and sub transactions */
+	binaryheap *txn_heap;
+
 	/*
 	 * Statistics about transactions spilled to disk.
 	 *
-- 
2.39.3

v10-0002-Add-functions-to-binaryheap-for-efficient-key-re.patchapplication/octet-stream; name=v10-0002-Add-functions-to-binaryheap-for-efficient-key-re.patchDownload
From 89c9256be2d08bccc05f55fa8fcc8cb6d7a636a9 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 11:20:23 +0900
Subject: [PATCH v10 2/3] Add functions to binaryheap for efficient key removal
 and update.

Previously, binaryheap didn't support updating a key and removing a
node in an efficient way. For example, in order to remove a node from
the binaryheap, the caller had to pass the node's position within the
array that the binaryheap internally has. Removing a node from the
binaryheap is done in O(log n) but searching for the key's position is
done in O(n).

This commit adds a hash table to binaryheap in order to track the
position of each nodes in the binaryheap. That way, by using newly
added functions such as binaryheap_update_up() etc., both updating a
key and removing a node can be done in O(1) on an average and O(log n)
in worst case. This is known as the indexed binary heap. The caller
can specify to use the indexed binaryheap by passing indexed = true.

The current code does not use the new indexing logic, but it will be
used by an upcoming patch.

Reviewed-by: Hayato Kuroda, Vignesh C, Ajin Cherian, Tomas Vondra,
Shubham Khanna, Peter Smith
Discussion: https://postgr.es/m/CAD21AoDffo37RC-eUuyHJKVEr017V2YYDLyn1xF_00ofptWbkg%40mail.gmail.com
---
 src/backend/executor/nodeGatherMerge.c        |   1 +
 src/backend/executor/nodeMergeAppend.c        |   2 +-
 src/backend/postmaster/pgarch.c               |   3 +-
 .../replication/logical/reorderbuffer.c       |   1 +
 src/backend/storage/buffer/bufmgr.c           |   1 +
 src/bin/pg_dump/pg_backup_archiver.c          |   1 +
 src/bin/pg_dump/pg_dump_sort.c                |   2 +-
 src/common/binaryheap.c                       | 198 +++++++++++++++++-
 src/include/lib/binaryheap.h                  |  36 +++-
 src/tools/pgindent/typedefs.list              |   1 +
 10 files changed, 232 insertions(+), 14 deletions(-)

diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c
index 45f6017c29..ce19e0837a 100644
--- a/src/backend/executor/nodeGatherMerge.c
+++ b/src/backend/executor/nodeGatherMerge.c
@@ -422,6 +422,7 @@ gather_merge_setup(GatherMergeState *gm_state)
 	/* Allocate the resources for the merge */
 	gm_state->gm_heap = binaryheap_allocate(nreaders + 1,
 											heap_compare_slots,
+											false,
 											gm_state);
 }
 
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index e1b9b984a7..3efebd537f 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -125,7 +125,7 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	mergestate->ms_nplans = nplans;
 
 	mergestate->ms_slots = (TupleTableSlot **) palloc0(sizeof(TupleTableSlot *) * nplans);
-	mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots,
+	mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots, false,
 											  mergestate);
 
 	/*
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index c266904b57..2b4e5a623c 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -254,7 +254,8 @@ PgArchiverMain(char *startup_data, size_t startup_data_len)
 
 	/* Initialize our max-heap for prioritizing files to archive. */
 	arch_files->arch_heap = binaryheap_allocate(NUM_FILES_PER_DIRECTORY_SCAN,
-												ready_file_comparator, NULL);
+												ready_file_comparator, false,
+												NULL);
 
 	/* Load the archive_library. */
 	LoadArchiveLibrary();
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 92cf39ff74..07eebedbac 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1294,6 +1294,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	/* allocate heap */
 	state->heap = binaryheap_allocate(state->nr_txns,
 									  ReorderBufferIterCompare,
+									  false,
 									  state);
 
 	/* Now that the state fields are initialized, it is safe to return it. */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c..eee5021197 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2724,6 +2724,7 @@ BufferSync(int flags)
 	 */
 	ts_heap = binaryheap_allocate(num_spaces,
 								  ts_ckpt_progress_comparator,
+								  false,
 								  NULL);
 
 	for (i = 0; i < num_spaces; i++)
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index d97ebaff5b..6587a7b081 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -4033,6 +4033,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 	/* Set up ready_heap with enough room for all known TocEntrys */
 	ready_heap = binaryheap_allocate(AH->tocCount,
 									 TocEntrySizeCompareBinaryheap,
+									 false,
 									 NULL);
 
 	/*
diff --git a/src/bin/pg_dump/pg_dump_sort.c b/src/bin/pg_dump/pg_dump_sort.c
index 4cb754caa5..7362f7c961 100644
--- a/src/bin/pg_dump/pg_dump_sort.c
+++ b/src/bin/pg_dump/pg_dump_sort.c
@@ -405,7 +405,7 @@ TopoSort(DumpableObject **objs,
 		return true;
 
 	/* Create workspace for the above-described heap */
-	pendingHeap = binaryheap_allocate(numObjs, int_cmp, NULL);
+	pendingHeap = binaryheap_allocate(numObjs, int_cmp, false, NULL);
 
 	/*
 	 * Scan the constraints, and for each item in the input, generate a count
diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index 843e764bb6..0f8cf6fd51 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -22,8 +22,30 @@
 #ifdef FRONTEND
 #include "common/logging.h"
 #endif
+#include "common/hashfn.h"
 #include "lib/binaryheap.h"
 
+/*
+ * Define parameters for hash table code generation. The interface is *also*
+ * declared in binaryheaph.h (to generate the types, which are externally
+ * visible).
+ */
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+	hash_bytes((const unsigned char *) &key, sizeof(bh_node_type))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0)
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_STORE_HASH
+#define SH_GET_HASH(tb, a) a->hash
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
 static void sift_down(binaryheap *heap, int node_off);
 static void sift_up(binaryheap *heap, int node_off);
 
@@ -34,9 +56,14 @@ static void sift_up(binaryheap *heap, int node_off);
  * store the given number of nodes, with the heap property defined by
  * the given comparator function, which will be invoked with the additional
  * argument specified by 'arg'.
+ *
+ * If 'indexed' is true, we create a hash table to track each node's
+ * index in the heap, enabling to perform some operations such as
+ * binaryheap_remove_node_ptr() etc.
  */
 binaryheap *
-binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
+binaryheap_allocate(int capacity, binaryheap_comparator compare,
+					bool indexed, void *arg)
 {
 	binaryheap *heap;
 
@@ -48,6 +75,17 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
 	heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type) * capacity);
+	heap->bh_nodeidx = NULL;
+
+	if (indexed)
+	{
+#ifdef FRONTEND
+		heap->bh_nodeidx = bh_nodeidx_create(capacity, NULL);
+#else
+		heap->bh_nodeidx = bh_nodeidx_create(CurrentMemoryContext, capacity,
+											 NULL);
+#endif
+	}
 
 	return heap;
 }
@@ -63,6 +101,9 @@ binaryheap_reset(binaryheap *heap)
 {
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
+
+	if (binaryheap_indexed(heap))
+		bh_nodeidx_reset(heap->bh_nodeidx);
 }
 
 /*
@@ -73,6 +114,9 @@ binaryheap_reset(binaryheap *heap)
 void
 binaryheap_free(binaryheap *heap)
 {
+	if (binaryheap_indexed(heap))
+		bh_nodeidx_destroy(heap->bh_nodeidx);
+
 	pfree(heap->bh_nodes);
 	pfree(heap);
 }
@@ -115,6 +159,67 @@ enlarge_node_array(binaryheap *heap)
 							  sizeof(bh_node_type) * heap->bh_space);
 }
 
+/*
+ * Set the given node at the 'index' and track it if required.
+ *
+ * Return true if the node's index is already tracked.
+ */
+static bool
+set_node(binaryheap *heap, bh_node_type node, int index)
+{
+	bool		found = false;
+
+	/* Set the node to the nodes array */
+	heap->bh_nodes[index] = node;
+
+	if (binaryheap_indexed(heap))
+	{
+		bh_nodeidx_entry *ent;
+
+		/* Keep track of the node index */
+		ent = bh_nodeidx_insert(heap->bh_nodeidx, node, &found);
+		ent->index = index;
+	}
+
+	return found;
+}
+
+/*
+ * Remove the node's index from the hash table if the heap is indexed.
+ */
+static inline void
+delete_nodeidx(binaryheap *heap, bh_node_type node)
+{
+	if (binaryheap_indexed(heap))
+		bh_nodeidx_delete(heap->bh_nodeidx, node);
+}
+
+/*
+ * Replace the existing node at 'idx' with the given 'new_node'. Also
+ * update their positions accordingly. Note that we assume the new_node's
+ * position is already tracked if enabled, i.e. the new_node is already
+ * present in the heap.
+ */
+static void
+replace_node(binaryheap *heap, int index, bh_node_type new_node)
+{
+	bool		found PG_USED_FOR_ASSERTS_ONLY;
+
+	/* Quick return if not necessary to move */
+	if (heap->bh_nodes[index] == new_node)
+		return;
+
+	/* Remove the overwritten node's index */
+	delete_nodeidx(heap, heap->bh_nodes[index]);
+
+	/*
+	 * Replace it with the given new node. This node's position must also be
+	 * tracked as we assume to replace the node with the existing node.
+	 */
+	found = set_node(heap, new_node, index);
+	Assert(!binaryheap_indexed(heap) || found);
+}
+
 /*
  * binaryheap_add_unordered
  *
@@ -131,7 +236,7 @@ binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 		enlarge_node_array(heap);
 
 	heap->bh_has_heap_property = false;
-	heap->bh_nodes[heap->bh_size] = d;
+	set_node(heap, d, heap->bh_size);
 	heap->bh_size++;
 }
 
@@ -164,7 +269,7 @@ binaryheap_add(binaryheap *heap, bh_node_type d)
 	if (heap->bh_size >= heap->bh_space)
 		enlarge_node_array(heap);
 
-	heap->bh_nodes[heap->bh_size] = d;
+	set_node(heap, d, heap->bh_size);
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
 }
@@ -205,6 +310,8 @@ binaryheap_remove_first(binaryheap *heap)
 	if (heap->bh_size == 1)
 	{
 		heap->bh_size--;
+		delete_nodeidx(heap, result);
+
 		return result;
 	}
 
@@ -212,7 +319,7 @@ binaryheap_remove_first(binaryheap *heap)
 	 * Remove the last node, placing it in the vacated root entry, and sift
 	 * the new root node down to its correct position.
 	 */
-	heap->bh_nodes[0] = heap->bh_nodes[--heap->bh_size];
+	replace_node(heap, 0, heap->bh_nodes[--heap->bh_size]);
 	sift_down(heap, 0);
 
 	return result;
@@ -238,7 +345,7 @@ binaryheap_remove_node(binaryheap *heap, int n)
 						   heap->bh_arg);
 
 	/* remove the last node, placing it in the vacated entry */
-	heap->bh_nodes[n] = heap->bh_nodes[heap->bh_size];
+	replace_node(heap, n, heap->bh_nodes[heap->bh_size]);
 
 	/* sift as needed to preserve the heap property */
 	if (cmp > 0)
@@ -247,6 +354,77 @@ binaryheap_remove_node(binaryheap *heap, int n)
 		sift_down(heap, n);
 }
 
+/*
+ * binaryheap_remove_node_ptr
+ *
+ * Similar to binaryheap_remove_node() but removes the given node. The caller
+ * must ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if the heap is indexed.
+ */
+void
+binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(binaryheap_indexed(heap));
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+	Assert(ent);
+
+	binaryheap_remove_node(heap, ent->index);
+}
+
+/*
+ * Workhorse for binaryheap_update_up and binaryheap_update_down.
+ */
+static void
+resift_node(binaryheap *heap, bh_node_type node, bool sift_dir_up)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(binaryheap_indexed(heap));
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, node);
+	Assert(ent);
+	Assert(ent->index >= 0 && ent->index < heap->bh_size);
+
+	if (sift_dir_up)
+		sift_up(heap, ent->index);
+	else
+		sift_down(heap, ent->index);
+}
+
+/*
+ * binaryheap_update_up
+ *
+ * Sift the given node up after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if the heap is indexed.
+ */
+void
+binaryheap_update_up(binaryheap *heap, bh_node_type d)
+{
+	resift_node(heap, d, true);
+}
+
+/*
+ * binaryheap_update_down
+ *
+ * Sift the given node down after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if the heap is indexed.
+ */
+void
+binaryheap_update_down(binaryheap *heap, bh_node_type d)
+{
+	resift_node(heap, d, false);
+}
+
 /*
  * binaryheap_replace_first
  *
@@ -259,7 +437,7 @@ binaryheap_replace_first(binaryheap *heap, bh_node_type d)
 {
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
 
-	heap->bh_nodes[0] = d;
+	replace_node(heap, 0, d);
 
 	if (heap->bh_size > 1)
 		sift_down(heap, 0);
@@ -301,11 +479,11 @@ sift_up(binaryheap *heap, int node_off)
 		 * Otherwise, swap the parent value with the hole, and go on to check
 		 * the node's new parent.
 		 */
-		heap->bh_nodes[node_off] = parent_val;
+		set_node(heap, parent_val, node_off);
 		node_off = parent_off;
 	}
 	/* Re-fill the hole */
-	heap->bh_nodes[node_off] = node_val;
+	set_node(heap, node_val, node_off);
 }
 
 /*
@@ -360,9 +538,9 @@ sift_down(binaryheap *heap, int node_off)
 		 * Otherwise, swap the hole with the child that violates the heap
 		 * property; then go on to check its children.
 		 */
-		heap->bh_nodes[node_off] = heap->bh_nodes[swap_off];
+		set_node(heap, heap->bh_nodes[swap_off], node_off);
 		node_off = swap_off;
 	}
 	/* Re-fill the hole */
-	heap->bh_nodes[node_off] = node_val;
+	set_node(heap, node_val, node_off);
 }
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 1439f20803..a7240aa0c2 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -29,6 +29,29 @@ typedef Datum bh_node_type;
  */
 typedef int (*binaryheap_comparator) (bh_node_type a, bh_node_type b, void *arg);
 
+/*
+ * Struct for a hash table element to store the node's index in the bh_nodes
+ * array.
+ */
+typedef struct bh_nodeidx_entry
+{
+	bh_node_type key;
+	int			index;			/* entry's index within the node array */
+	char		status;			/* hash status */
+	uint32		hash;			/* hash values (cached) */
+} bh_nodeidx_entry;
+
+/* Define parameters necessary to generate the hash table interface. */
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
 /*
  * binaryheap
  *
@@ -47,11 +70,18 @@ typedef struct binaryheap
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
 	bh_node_type *bh_nodes;
+
+	/*
+	 * If bh_nodeidx is not NULL, the bh_nodeidx is used to track of each
+	 * node's index in bh_nodes. This enables the caller to perform
+	 * binaryheap_remove_node_ptr(), binaryheap_update_up/down in O(log n).
+	 */
+	bh_nodeidx_hash *bh_nodeidx;
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int capacity,
 									   binaryheap_comparator compare,
-									   void *arg);
+									   bool indexed, void *arg);
 extern void binaryheap_reset(binaryheap *heap);
 extern void binaryheap_free(binaryheap *heap);
 extern void binaryheap_add_unordered(binaryheap *heap, bh_node_type d);
@@ -60,10 +90,14 @@ extern void binaryheap_add(binaryheap *heap, bh_node_type d);
 extern bh_node_type binaryheap_first(binaryheap *heap);
 extern bh_node_type binaryheap_remove_first(binaryheap *heap);
 extern void binaryheap_remove_node(binaryheap *heap, int n);
+extern void binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d);
 extern void binaryheap_replace_first(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_up(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_down(binaryheap *heap, bh_node_type d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
 #define binaryheap_size(h)			((h)->bh_size)
 #define binaryheap_get_node(h, n)	((h)->bh_nodes[n])
+#define binaryheap_indexed(h)		((h)->bh_nodeidx != NULL)
 
 #endif							/* BINARYHEAP_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 4679660837..2bfcdc70aa 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4084,3 +4084,4 @@ TidStoreIter
 TidStoreIterResult
 BlocktableEntry
 ItemArray
+bh_nodeidx_entry
-- 
2.39.3

#57Shubham Khanna
khannashubham1197@gmail.com
In reply to: vignesh C (#43)
Re: Improve eviction algorithm in ReorderBuffer

On Tue, Mar 5, 2024 at 8:50 AM vignesh C <vignesh21@gmail.com> wrote:

On Wed, 28 Feb 2024 at 11:40, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Feb 26, 2024 at 7:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

A few comments on 0003:
===================
1.
+/*
+ * Threshold of the total number of top-level and sub transactions
that controls
+ * whether we switch the memory track state. While the MAINTAIN_HEAP state is
+ * effective when there are many transactions being decoded, in many systems
+ * there is generally no need to use it as long as all transactions
being decoded
+ * are top-level transactions. Therefore, we use MaxConnections as
the threshold
+ * so we can prevent switch to the state unless we use subtransactions.
+ */
+#define REORDER_BUFFER_MEM_TRACK_THRESHOLD MaxConnections

The comment seems to imply that MAINTAIN_HEAP is useful for large
number of transactions but ReorderBufferLargestTXN() switches to this
state even when there is one transaction. So, basically we use the
binary_heap technique to get the largest even when we have one
transaction but we don't maintain that heap unless we have
REORDER_BUFFER_MEM_TRACK_THRESHOLD number of transactions are
in-progress. This means there is some additional work when (build and
reset heap each time when we pick largest xact) we have fewer
transactions in the system but that may not be impacting us because of
other costs involved like serializing all the changes. I think once we
can try to stress test this by setting
debug_logical_replication_streaming to 'immediate' to see if the new
mechanism has any overhead.

I ran the test with a transaction having many inserts:

| 5000 | 10000 | 20000 | 100000 | 1000000 | 10000000
------- |-----------|------------|------------|--------------|----------------|----------------
Head | 26.31 | 48.84 | 93.65 | 480.05 | 4808.29 | 47020.16
Patch | 26.35 | 50.8 | 97.99 | 484.8 | 4856.95 | 48108.89

The same test with debug_logical_replication_streaming= 'immediate'

| 5000 | 10000 | 20000 | 100000 | 1000000 | 10000000
------- |-----------|------------|------------|--------------|----------------|----------------
Head | 59.29 | 115.84 | 227.21 | 1156.08 | 11367.42 | 113986.14
Patch | 62.45 | 120.48 | 240.56 | 1185.12 | 11855.37 | 119921.81

The execution time is in milliseconds. The column header indicates the
number of inserts in the transaction.
In this case I noticed that the test execution with patch was taking
slightly more time.

I have ran the tests that Vignesh had reported a issue, the test
results with the latest patch is given below:

Without debug_logical_replication_streaming= 'immediate'
Record|10000000 |1000000 |100000 | 20000 | 10000 | 5000
----------|---------------|-------------|-----------|----------|----------|----------
Head |47563.759| 4917.057|478.923|97.28 |50.368 |25.917
Patch |47445.733| 4722.874|472.817|95.15 |48.801 |26.168
%imp |0.248 | 03.949 |01.274 |02.189|03.111 |-0.968

With debug_logical_replication_streaming= 'immediate'
Record| 10000000 | 1000000 | 100000 | 20000 | 10000 | 5000
----------|----------------|--------------|-------------|-----------|-----------|----------
Head |106281.236|10669.992|1073.815|214.287|107.62 |54.947
Patch |103108.673|10603.139|1064.98 |210.229|106.321|54.218
%imp | 02.985 | 0.626  |0.822 |01.893 |01.207 |01.326

The execution time is in milliseconds. The column header indicates the
number of inserts in the transaction. I can notice with the test
result that the issue has been resolved with the new patch.

Thanks and Regards,
Shubham Khanna.

#58vignesh C
vignesh21@gmail.com
In reply to: Masahiko Sawada (#56)
Re: Improve eviction algorithm in ReorderBuffer

On Tue, 26 Mar 2024 at 10:05, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached new version patches.

Since the previous patch conflicts with the current HEAD, I've
attached the rebased patches.

Thanks for the updated patch.
One comment:
I felt we can mention the improvement where we update memory
accounting info at transaction level instead of per change level which
is done in ReorderBufferCleanupTXN, ReorderBufferTruncateTXN, and
ReorderBufferSerializeTXN also in the commit message:
@@ -1527,7 +1573,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
/* Check we're not mixing changes from different
transactions. */
Assert(change->txn == txn);

-               ReorderBufferReturnChange(rb, change, true);
+               ReorderBufferReturnChange(rb, change, false);
        }

/*
@@ -1586,8 +1632,13 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
if (rbtxn_is_serialized(txn))
ReorderBufferRestoreCleanup(rb, txn);

+       /* Update the memory counter */
+       ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, txn->size);

Regards,
Vignesh

#59Masahiko Sawada
sawada.mshk@gmail.com
In reply to: vignesh C (#58)
Re: Improve eviction algorithm in ReorderBuffer

On Fri, Mar 29, 2024 at 2:09 PM vignesh C <vignesh21@gmail.com> wrote:

On Tue, 26 Mar 2024 at 10:05, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached new version patches.

Since the previous patch conflicts with the current HEAD, I've
attached the rebased patches.

Thanks for the updated patch.
One comment:
I felt we can mention the improvement where we update memory
accounting info at transaction level instead of per change level which
is done in ReorderBufferCleanupTXN, ReorderBufferTruncateTXN, and
ReorderBufferSerializeTXN also in the commit message:

Agreed.

I think the patch is in good shape. I'll push the patch with the
suggestion next week, barring any objections.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#60Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#59)
Re: Improve eviction algorithm in ReorderBuffer

On Fri, Mar 29, 2024 at 12:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Mar 29, 2024 at 2:09 PM vignesh C <vignesh21@gmail.com> wrote:

On Tue, 26 Mar 2024 at 10:05, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached new version patches.

Since the previous patch conflicts with the current HEAD, I've
attached the rebased patches.

Thanks for the updated patch.
One comment:
I felt we can mention the improvement where we update memory
accounting info at transaction level instead of per change level which
is done in ReorderBufferCleanupTXN, ReorderBufferTruncateTXN, and
ReorderBufferSerializeTXN also in the commit message:

Agreed.

I think the patch is in good shape. I'll push the patch with the
suggestion next week, barring any objections.

Few minor comments:
1.
@@ -3636,6 +3801,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
Assert(txn->nentries_mem == 0);
}

+ ReorderBufferMaybeResetMaxHeap(rb);
+

Can we write a comment about why this reset is required here?
Otherwise, the reason is not apparent.

2.
Although using max-heap to select the largest
+ * transaction is effective when there are many transactions being decoded,
+ * there is generally no need to use it as long as all transactions being
+ * decoded are top-level transactions. Therefore, we use MaxConnections as the
+ * threshold so we can prevent switching to the state unless we use
+ * subtransactions.
+ */
+#define MAX_HEAP_TXN_COUNT_THRESHOLD MaxConnections

Isn't using max-heap equally effective in finding the largest
transaction whether there are top-level or top-level plus
subtransactions? This comment indicates it is only effective when
there are subtransactions.

--
With Regards,
Amit Kapila.

#61Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Masahiko Sawada (#59)
RE: Improve eviction algorithm in ReorderBuffer

Dear Sawada-san,

Agreed.

I think the patch is in good shape. I'll push the patch with the
suggestion next week, barring any objections.

Thanks for working on this. Agreed it is committable.
Few minor comments:

```
+ * Either txn or change must be non-NULL at least. We update the memory
+ * counter of txn if it's non-NULL, otherwise change->txn.
```

IIUC no one checks the restriction. Should we add Assert() for it, e.g,:
Assert(txn || change)?

```
+    /* make sure enough space for a new node */
...
+    /* make sure enough space for a new node */
```

Should be started with upper case?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

#62Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#60)
3 attachment(s)
Re: Improve eviction algorithm in ReorderBuffer

On Fri, Mar 29, 2024 at 7:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Mar 29, 2024 at 12:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Mar 29, 2024 at 2:09 PM vignesh C <vignesh21@gmail.com> wrote:

On Tue, 26 Mar 2024 at 10:05, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached new version patches.

Since the previous patch conflicts with the current HEAD, I've
attached the rebased patches.

Thanks for the updated patch.
One comment:
I felt we can mention the improvement where we update memory
accounting info at transaction level instead of per change level which
is done in ReorderBufferCleanupTXN, ReorderBufferTruncateTXN, and
ReorderBufferSerializeTXN also in the commit message:

Agreed.

I think the patch is in good shape. I'll push the patch with the
suggestion next week, barring any objections.

Few minor comments:
1.
@@ -3636,6 +3801,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
Assert(txn->nentries_mem == 0);
}

+ ReorderBufferMaybeResetMaxHeap(rb);
+

Can we write a comment about why this reset is required here?
Otherwise, the reason is not apparent.

Yes, added.

2.
Although using max-heap to select the largest
+ * transaction is effective when there are many transactions being decoded,
+ * there is generally no need to use it as long as all transactions being
+ * decoded are top-level transactions. Therefore, we use MaxConnections as the
+ * threshold so we can prevent switching to the state unless we use
+ * subtransactions.
+ */
+#define MAX_HEAP_TXN_COUNT_THRESHOLD MaxConnections

Isn't using max-heap equally effective in finding the largest
transaction whether there are top-level or top-level plus
subtransactions? This comment indicates it is only effective when
there are subtransactions.

You're right. Updated the comment.

I've attached the updated patches.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v11-0002-Add-functions-to-binaryheap-for-efficient-key-re.patchapplication/octet-stream; name=v11-0002-Add-functions-to-binaryheap-for-efficient-key-re.patchDownload
From e44863bcd3835a8d84caf4dba93228190b37877b Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 11:20:23 +0900
Subject: [PATCH v11 2/3] Add functions to binaryheap for efficient key removal
 and update.

Previously, binaryheap didn't support updating a key and removing a
node in an efficient way. For example, in order to remove a node from
the binaryheap, the caller had to pass the node's position within the
array that the binaryheap internally has. Removing a node from the
binaryheap is done in O(log n) but searching for the key's position is
done in O(n).

This commit adds a hash table to binaryheap in order to track the
position of each nodes in the binaryheap. That way, by using newly
added functions such as binaryheap_update_up() etc., both updating a
key and removing a node can be done in O(1) on an average and O(log n)
in worst case. This is known as the indexed binary heap. The caller
can specify to use the indexed binaryheap by passing indexed = true.

The current code does not use the new indexing logic, but it will be
used by an upcoming patch.

Reviewed-by: Hayato Kuroda, Vignesh C, Ajin Cherian, Tomas Vondra,
Shubham Khanna, Peter Smith
Discussion: https://postgr.es/m/CAD21AoDffo37RC-eUuyHJKVEr017V2YYDLyn1xF_00ofptWbkg%40mail.gmail.com
---
 src/backend/executor/nodeGatherMerge.c        |   1 +
 src/backend/executor/nodeMergeAppend.c        |   2 +-
 src/backend/postmaster/pgarch.c               |   3 +-
 .../replication/logical/reorderbuffer.c       |   1 +
 src/backend/storage/buffer/bufmgr.c           |   1 +
 src/bin/pg_dump/pg_backup_archiver.c          |   1 +
 src/bin/pg_dump/pg_dump_sort.c                |   2 +-
 src/common/binaryheap.c                       | 198 +++++++++++++++++-
 src/include/lib/binaryheap.h                  |  36 +++-
 src/tools/pgindent/typedefs.list              |   1 +
 10 files changed, 232 insertions(+), 14 deletions(-)

diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c
index 45f6017c29..ce19e0837a 100644
--- a/src/backend/executor/nodeGatherMerge.c
+++ b/src/backend/executor/nodeGatherMerge.c
@@ -422,6 +422,7 @@ gather_merge_setup(GatherMergeState *gm_state)
 	/* Allocate the resources for the merge */
 	gm_state->gm_heap = binaryheap_allocate(nreaders + 1,
 											heap_compare_slots,
+											false,
 											gm_state);
 }
 
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index e1b9b984a7..3efebd537f 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -125,7 +125,7 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	mergestate->ms_nplans = nplans;
 
 	mergestate->ms_slots = (TupleTableSlot **) palloc0(sizeof(TupleTableSlot *) * nplans);
-	mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots,
+	mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots, false,
 											  mergestate);
 
 	/*
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index c266904b57..2b4e5a623c 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -254,7 +254,8 @@ PgArchiverMain(char *startup_data, size_t startup_data_len)
 
 	/* Initialize our max-heap for prioritizing files to archive. */
 	arch_files->arch_heap = binaryheap_allocate(NUM_FILES_PER_DIRECTORY_SCAN,
-												ready_file_comparator, NULL);
+												ready_file_comparator, false,
+												NULL);
 
 	/* Load the archive_library. */
 	LoadArchiveLibrary();
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 92cf39ff74..07eebedbac 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1294,6 +1294,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	/* allocate heap */
 	state->heap = binaryheap_allocate(state->nr_txns,
 									  ReorderBufferIterCompare,
+									  false,
 									  state);
 
 	/* Now that the state fields are initialized, it is safe to return it. */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c..eee5021197 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2724,6 +2724,7 @@ BufferSync(int flags)
 	 */
 	ts_heap = binaryheap_allocate(num_spaces,
 								  ts_ckpt_progress_comparator,
+								  false,
 								  NULL);
 
 	for (i = 0; i < num_spaces; i++)
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index d97ebaff5b..6587a7b081 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -4033,6 +4033,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 	/* Set up ready_heap with enough room for all known TocEntrys */
 	ready_heap = binaryheap_allocate(AH->tocCount,
 									 TocEntrySizeCompareBinaryheap,
+									 false,
 									 NULL);
 
 	/*
diff --git a/src/bin/pg_dump/pg_dump_sort.c b/src/bin/pg_dump/pg_dump_sort.c
index 4cb754caa5..7362f7c961 100644
--- a/src/bin/pg_dump/pg_dump_sort.c
+++ b/src/bin/pg_dump/pg_dump_sort.c
@@ -405,7 +405,7 @@ TopoSort(DumpableObject **objs,
 		return true;
 
 	/* Create workspace for the above-described heap */
-	pendingHeap = binaryheap_allocate(numObjs, int_cmp, NULL);
+	pendingHeap = binaryheap_allocate(numObjs, int_cmp, false, NULL);
 
 	/*
 	 * Scan the constraints, and for each item in the input, generate a count
diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index 843e764bb6..0f8cf6fd51 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -22,8 +22,30 @@
 #ifdef FRONTEND
 #include "common/logging.h"
 #endif
+#include "common/hashfn.h"
 #include "lib/binaryheap.h"
 
+/*
+ * Define parameters for hash table code generation. The interface is *also*
+ * declared in binaryheaph.h (to generate the types, which are externally
+ * visible).
+ */
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+	hash_bytes((const unsigned char *) &key, sizeof(bh_node_type))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0)
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_STORE_HASH
+#define SH_GET_HASH(tb, a) a->hash
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
 static void sift_down(binaryheap *heap, int node_off);
 static void sift_up(binaryheap *heap, int node_off);
 
@@ -34,9 +56,14 @@ static void sift_up(binaryheap *heap, int node_off);
  * store the given number of nodes, with the heap property defined by
  * the given comparator function, which will be invoked with the additional
  * argument specified by 'arg'.
+ *
+ * If 'indexed' is true, we create a hash table to track each node's
+ * index in the heap, enabling to perform some operations such as
+ * binaryheap_remove_node_ptr() etc.
  */
 binaryheap *
-binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
+binaryheap_allocate(int capacity, binaryheap_comparator compare,
+					bool indexed, void *arg)
 {
 	binaryheap *heap;
 
@@ -48,6 +75,17 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
 	heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type) * capacity);
+	heap->bh_nodeidx = NULL;
+
+	if (indexed)
+	{
+#ifdef FRONTEND
+		heap->bh_nodeidx = bh_nodeidx_create(capacity, NULL);
+#else
+		heap->bh_nodeidx = bh_nodeidx_create(CurrentMemoryContext, capacity,
+											 NULL);
+#endif
+	}
 
 	return heap;
 }
@@ -63,6 +101,9 @@ binaryheap_reset(binaryheap *heap)
 {
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
+
+	if (binaryheap_indexed(heap))
+		bh_nodeidx_reset(heap->bh_nodeidx);
 }
 
 /*
@@ -73,6 +114,9 @@ binaryheap_reset(binaryheap *heap)
 void
 binaryheap_free(binaryheap *heap)
 {
+	if (binaryheap_indexed(heap))
+		bh_nodeidx_destroy(heap->bh_nodeidx);
+
 	pfree(heap->bh_nodes);
 	pfree(heap);
 }
@@ -115,6 +159,67 @@ enlarge_node_array(binaryheap *heap)
 							  sizeof(bh_node_type) * heap->bh_space);
 }
 
+/*
+ * Set the given node at the 'index' and track it if required.
+ *
+ * Return true if the node's index is already tracked.
+ */
+static bool
+set_node(binaryheap *heap, bh_node_type node, int index)
+{
+	bool		found = false;
+
+	/* Set the node to the nodes array */
+	heap->bh_nodes[index] = node;
+
+	if (binaryheap_indexed(heap))
+	{
+		bh_nodeidx_entry *ent;
+
+		/* Keep track of the node index */
+		ent = bh_nodeidx_insert(heap->bh_nodeidx, node, &found);
+		ent->index = index;
+	}
+
+	return found;
+}
+
+/*
+ * Remove the node's index from the hash table if the heap is indexed.
+ */
+static inline void
+delete_nodeidx(binaryheap *heap, bh_node_type node)
+{
+	if (binaryheap_indexed(heap))
+		bh_nodeidx_delete(heap->bh_nodeidx, node);
+}
+
+/*
+ * Replace the existing node at 'idx' with the given 'new_node'. Also
+ * update their positions accordingly. Note that we assume the new_node's
+ * position is already tracked if enabled, i.e. the new_node is already
+ * present in the heap.
+ */
+static void
+replace_node(binaryheap *heap, int index, bh_node_type new_node)
+{
+	bool		found PG_USED_FOR_ASSERTS_ONLY;
+
+	/* Quick return if not necessary to move */
+	if (heap->bh_nodes[index] == new_node)
+		return;
+
+	/* Remove the overwritten node's index */
+	delete_nodeidx(heap, heap->bh_nodes[index]);
+
+	/*
+	 * Replace it with the given new node. This node's position must also be
+	 * tracked as we assume to replace the node with the existing node.
+	 */
+	found = set_node(heap, new_node, index);
+	Assert(!binaryheap_indexed(heap) || found);
+}
+
 /*
  * binaryheap_add_unordered
  *
@@ -131,7 +236,7 @@ binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 		enlarge_node_array(heap);
 
 	heap->bh_has_heap_property = false;
-	heap->bh_nodes[heap->bh_size] = d;
+	set_node(heap, d, heap->bh_size);
 	heap->bh_size++;
 }
 
@@ -164,7 +269,7 @@ binaryheap_add(binaryheap *heap, bh_node_type d)
 	if (heap->bh_size >= heap->bh_space)
 		enlarge_node_array(heap);
 
-	heap->bh_nodes[heap->bh_size] = d;
+	set_node(heap, d, heap->bh_size);
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
 }
@@ -205,6 +310,8 @@ binaryheap_remove_first(binaryheap *heap)
 	if (heap->bh_size == 1)
 	{
 		heap->bh_size--;
+		delete_nodeidx(heap, result);
+
 		return result;
 	}
 
@@ -212,7 +319,7 @@ binaryheap_remove_first(binaryheap *heap)
 	 * Remove the last node, placing it in the vacated root entry, and sift
 	 * the new root node down to its correct position.
 	 */
-	heap->bh_nodes[0] = heap->bh_nodes[--heap->bh_size];
+	replace_node(heap, 0, heap->bh_nodes[--heap->bh_size]);
 	sift_down(heap, 0);
 
 	return result;
@@ -238,7 +345,7 @@ binaryheap_remove_node(binaryheap *heap, int n)
 						   heap->bh_arg);
 
 	/* remove the last node, placing it in the vacated entry */
-	heap->bh_nodes[n] = heap->bh_nodes[heap->bh_size];
+	replace_node(heap, n, heap->bh_nodes[heap->bh_size]);
 
 	/* sift as needed to preserve the heap property */
 	if (cmp > 0)
@@ -247,6 +354,77 @@ binaryheap_remove_node(binaryheap *heap, int n)
 		sift_down(heap, n);
 }
 
+/*
+ * binaryheap_remove_node_ptr
+ *
+ * Similar to binaryheap_remove_node() but removes the given node. The caller
+ * must ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if the heap is indexed.
+ */
+void
+binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(binaryheap_indexed(heap));
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+	Assert(ent);
+
+	binaryheap_remove_node(heap, ent->index);
+}
+
+/*
+ * Workhorse for binaryheap_update_up and binaryheap_update_down.
+ */
+static void
+resift_node(binaryheap *heap, bh_node_type node, bool sift_dir_up)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(binaryheap_indexed(heap));
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, node);
+	Assert(ent);
+	Assert(ent->index >= 0 && ent->index < heap->bh_size);
+
+	if (sift_dir_up)
+		sift_up(heap, ent->index);
+	else
+		sift_down(heap, ent->index);
+}
+
+/*
+ * binaryheap_update_up
+ *
+ * Sift the given node up after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if the heap is indexed.
+ */
+void
+binaryheap_update_up(binaryheap *heap, bh_node_type d)
+{
+	resift_node(heap, d, true);
+}
+
+/*
+ * binaryheap_update_down
+ *
+ * Sift the given node down after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if the heap is indexed.
+ */
+void
+binaryheap_update_down(binaryheap *heap, bh_node_type d)
+{
+	resift_node(heap, d, false);
+}
+
 /*
  * binaryheap_replace_first
  *
@@ -259,7 +437,7 @@ binaryheap_replace_first(binaryheap *heap, bh_node_type d)
 {
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
 
-	heap->bh_nodes[0] = d;
+	replace_node(heap, 0, d);
 
 	if (heap->bh_size > 1)
 		sift_down(heap, 0);
@@ -301,11 +479,11 @@ sift_up(binaryheap *heap, int node_off)
 		 * Otherwise, swap the parent value with the hole, and go on to check
 		 * the node's new parent.
 		 */
-		heap->bh_nodes[node_off] = parent_val;
+		set_node(heap, parent_val, node_off);
 		node_off = parent_off;
 	}
 	/* Re-fill the hole */
-	heap->bh_nodes[node_off] = node_val;
+	set_node(heap, node_val, node_off);
 }
 
 /*
@@ -360,9 +538,9 @@ sift_down(binaryheap *heap, int node_off)
 		 * Otherwise, swap the hole with the child that violates the heap
 		 * property; then go on to check its children.
 		 */
-		heap->bh_nodes[node_off] = heap->bh_nodes[swap_off];
+		set_node(heap, heap->bh_nodes[swap_off], node_off);
 		node_off = swap_off;
 	}
 	/* Re-fill the hole */
-	heap->bh_nodes[node_off] = node_val;
+	set_node(heap, node_val, node_off);
 }
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 1439f20803..a7240aa0c2 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -29,6 +29,29 @@ typedef Datum bh_node_type;
  */
 typedef int (*binaryheap_comparator) (bh_node_type a, bh_node_type b, void *arg);
 
+/*
+ * Struct for a hash table element to store the node's index in the bh_nodes
+ * array.
+ */
+typedef struct bh_nodeidx_entry
+{
+	bh_node_type key;
+	int			index;			/* entry's index within the node array */
+	char		status;			/* hash status */
+	uint32		hash;			/* hash values (cached) */
+} bh_nodeidx_entry;
+
+/* Define parameters necessary to generate the hash table interface. */
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
 /*
  * binaryheap
  *
@@ -47,11 +70,18 @@ typedef struct binaryheap
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
 	bh_node_type *bh_nodes;
+
+	/*
+	 * If bh_nodeidx is not NULL, the bh_nodeidx is used to track of each
+	 * node's index in bh_nodes. This enables the caller to perform
+	 * binaryheap_remove_node_ptr(), binaryheap_update_up/down in O(log n).
+	 */
+	bh_nodeidx_hash *bh_nodeidx;
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int capacity,
 									   binaryheap_comparator compare,
-									   void *arg);
+									   bool indexed, void *arg);
 extern void binaryheap_reset(binaryheap *heap);
 extern void binaryheap_free(binaryheap *heap);
 extern void binaryheap_add_unordered(binaryheap *heap, bh_node_type d);
@@ -60,10 +90,14 @@ extern void binaryheap_add(binaryheap *heap, bh_node_type d);
 extern bh_node_type binaryheap_first(binaryheap *heap);
 extern bh_node_type binaryheap_remove_first(binaryheap *heap);
 extern void binaryheap_remove_node(binaryheap *heap, int n);
+extern void binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d);
 extern void binaryheap_replace_first(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_up(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_down(binaryheap *heap, bh_node_type d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
 #define binaryheap_size(h)			((h)->bh_size)
 #define binaryheap_get_node(h, n)	((h)->bh_nodes[n])
+#define binaryheap_indexed(h)		((h)->bh_nodeidx != NULL)
 
 #endif							/* BINARYHEAP_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a8d7bed411..0c3a3aaccc 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4084,3 +4084,4 @@ TidStoreIter
 TidStoreIterResult
 BlocktableEntry
 ItemArray
+bh_nodeidx_entry
-- 
2.39.3

v11-0001-Make-binaryheap-enlargeable.patchapplication/octet-stream; name=v11-0001-Make-binaryheap-enlargeable.patchDownload
From 530c313dbccb7e219351f4adb5d4954dc4efb6f2 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 17:12:20 +0900
Subject: [PATCH v11 1/3] Make binaryheap enlargeable.

The node array space of the binaryheap is doubled when there is no
available space.

Reviewed-by: Hayato Kuroda, Vignesh C, Ajin Cherian, Tomas Vondra,
Shubham Khanna, Peter Smith
Discussion: https://postgr.es/m/CAD21AoDffo37RC-eUuyHJKVEr017V2YYDLyn1xF_00ofptWbkg%40mail.gmail.com
---
 src/common/binaryheap.c      | 37 +++++++++++++++++++-----------------
 src/include/lib/binaryheap.h |  2 +-
 2 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index 7377ebdf15..843e764bb6 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -38,17 +38,16 @@ static void sift_up(binaryheap *heap, int node_off);
 binaryheap *
 binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 {
-	int			sz;
 	binaryheap *heap;
 
-	sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_node_type) * capacity;
-	heap = (binaryheap *) palloc(sz);
+	heap = (binaryheap *) palloc(sizeof(binaryheap));
 	heap->bh_space = capacity;
 	heap->bh_compare = compare;
 	heap->bh_arg = arg;
 
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
+	heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type) * capacity);
 
 	return heap;
 }
@@ -74,6 +73,7 @@ binaryheap_reset(binaryheap *heap)
 void
 binaryheap_free(binaryheap *heap)
 {
+	pfree(heap->bh_nodes);
 	pfree(heap);
 }
 
@@ -104,6 +104,17 @@ parent_offset(int i)
 	return (i - 1) / 2;
 }
 
+/*
+ * Double the space allocated for nodes.
+ */
+static void
+enlarge_node_array(binaryheap *heap)
+{
+	heap->bh_space *= 2;
+	heap->bh_nodes = repalloc(heap->bh_nodes,
+							  sizeof(bh_node_type) * heap->bh_space);
+}
+
 /*
  * binaryheap_add_unordered
  *
@@ -115,14 +126,10 @@ parent_offset(int i)
 void
 binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 {
+	/* make sure enough space for a new node */
 	if (heap->bh_size >= heap->bh_space)
-	{
-#ifdef FRONTEND
-		pg_fatal("out of binary heap slots");
-#else
-		elog(ERROR, "out of binary heap slots");
-#endif
-	}
+		enlarge_node_array(heap);
+
 	heap->bh_has_heap_property = false;
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
@@ -153,14 +160,10 @@ binaryheap_build(binaryheap *heap)
 void
 binaryheap_add(binaryheap *heap, bh_node_type d)
 {
+	/* make sure enough space for a new node */
 	if (heap->bh_size >= heap->bh_space)
-	{
-#ifdef FRONTEND
-		pg_fatal("out of binary heap slots");
-#else
-		elog(ERROR, "out of binary heap slots");
-#endif
-	}
+		enlarge_node_array(heap);
+
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 19025c08ef..1439f20803 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -46,7 +46,7 @@ typedef struct binaryheap
 	bool		bh_has_heap_property;	/* debugging cross-check */
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
-	bh_node_type bh_nodes[FLEXIBLE_ARRAY_MEMBER];
+	bh_node_type *bh_nodes;
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int capacity,
-- 
2.39.3

v11-0003-Improve-eviction-algorithm-in-Reorderbuffer-usin.patchapplication/octet-stream; name=v11-0003-Improve-eviction-algorithm-in-Reorderbuffer-usin.patchDownload
From 8712dde0c09e0bef24936816bdf857600f3553cf Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 11:31:41 +0900
Subject: [PATCH v11 3/3] Improve eviction algorithm in Reorderbuffer using
 max-heap for many subtransactions.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Previously, when selecting the transaction to evict during logical
decoding, we check all transactions to find the largest
transaction. This could lead to a significant replication lag
especially in the case where there are many subtransactions.

This commit improves the eviction algorithm in ReorderBuffer using the
max-heap with transaction size as the key to efficiently find the
largest transaction.

The max-heap starts with empty. While the max-heap is empty, we don't
do anything for the max-heap when updating the memory
counter. Therefore, we get the largest transaction in O(N) time, where
N is the number of transactions including top-level transactions and
subtransactions.

We build the max-heap just before selecting the largest transactions
if the number of transactions being decoded is higher than the
threshold, MAX_HEAP_TXN_COUNT_THRESHOLD. After building the max-heap,
we also update the max-heap when updating the memory counter. The
intention is to efficiently find the largest transaction in O(1) time
instead of incurring the cost of memory counter updates (O(log
N)). Once the number of transactions got lower than the threshold, we
reset the max-heap.

The performance benchmark results showed significant speed up (more
than x30 speed up on my machine) in decoding a transaction with 100k
subtransactions, whereas there is no visible overhead in other cases.

Reviewed-by: Amit Kapila, Hayato Kuroda, Vignesh C, Ajin Cherian,
Tomas Vondra, Shubham Khanna, Álvaro Herrera, Euler Taveira, Peter
Smith
Discussion: https://postgr.es/m/CAD21AoAfKTgrBrLq96GcTv9d6k97zaQcDM-rxfKEt4GSe0qnaQ%40mail.gmail.com
---
 .../replication/logical/reorderbuffer.c       | 235 ++++++++++++++++--
 src/include/replication/reorderbuffer.h       |   4 +
 2 files changed, 212 insertions(+), 27 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 07eebedbac..f7453cda12 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -67,6 +67,21 @@
  *	  allocator, evicting the oldest changes would make it more likely the
  *	  memory gets actually freed.
  *
+ *	  We use a max-heap with transaction size as the key to efficiently find
+ *	  the largest transaction. While the max-heap is empty, we don't update
+ *	  the max-heap when updating the memory counter. Therefore, we can get
+ *	  the largest transaction in O(N) time, where N is the number of
+ *	  transactions including top-level transactions and subtransactions.
+ *
+ *	  We build the max-heap just before selecting the largest transactions
+ *	  if the number of transactions being decoded is higher than the threshold,
+ *	  MAX_HEAP_TXN_COUNT_THRESHOLD. After building the max-heap, we also
+ *	  update the max-heap when updating the memory counter. The intention is
+ *	  to efficiently find the largest transaction in O(1) time instead of
+ *	  incurring the cost of memory counter updates (O(log N)). Once the number
+ *	  of transactions got lower than the threshold, we reset the max-heap
+ *	  (refer to ReorderBufferMaybeResetMaxHeap() for details).
+ *
  *	  We still rely on max_changes_in_memory when loading serialized changes
  *	  back into memory. At that point we can't use the memory limit directly
  *	  as we load the subxacts independently. One option to deal with this
@@ -107,6 +122,22 @@
 #include "utils/rel.h"
 #include "utils/relfilenumbermap.h"
 
+/*
+ * Threshold of the total number of top-level and sub transactions that
+ * controls whether we use the max-heap for tracking their sizes. Although
+ * using the max-heap to select the largest transaction is effective when
+ * there are many transactions being decoded, maintaining the max-heap while
+ * updating the memory statistics can be costly. Therefore, we use
+ * MaxConnections as the threshold so that we use the max-heap only when
+ * using subtransactions.
+ */
+#define MAX_HEAP_TXN_COUNT_THRESHOLD	MaxConnections
+
+/*
+ * A macro to check if the max-heap is ready to use and needs to be updated
+ * accordingly.
+ */
+#define ReorderBufferMaxHeapIsReady(rb) !binaryheap_empty((rb)->txn_heap)
 
 /* entry for a hash table we use to map from xid to our transaction state */
 typedef struct ReorderBufferTXNByIdEnt
@@ -259,6 +290,9 @@ static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
+static void ReorderBufferBuildMaxHeap(ReorderBuffer *rb);
+static void ReorderBufferMaybeResetMaxHeap(ReorderBuffer *rb);
+static int	ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg);
 
 static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
@@ -293,6 +327,7 @@ static void ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *t
 static Size ReorderBufferChangeSize(ReorderBufferChange *change);
 static void ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 											ReorderBufferChange *change,
+											ReorderBufferTXN *txn,
 											bool addition, Size sz);
 
 /*
@@ -355,6 +390,17 @@ ReorderBufferAllocate(void)
 	buffer->outbufsize = 0;
 	buffer->size = 0;
 
+	/*
+	 * The binaryheap is indexed for faster manipulations.
+	 *
+	 * We allocate the initial heap size greater than
+	 * MAX_HEAP_TXN_COUNT_THRESHOLD because the txn_heap will not be used
+	 * until the threshold is exceeded.
+	 */
+	buffer->txn_heap = binaryheap_allocate(MAX_HEAP_TXN_COUNT_THRESHOLD * 2,
+										   ReorderBufferTXNSizeCompare,
+										   true, NULL);
+
 	buffer->spillTxns = 0;
 	buffer->spillCount = 0;
 	buffer->spillBytes = 0;
@@ -485,7 +531,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
 {
 	/* update memory accounting info */
 	if (upd_mem)
-		ReorderBufferChangeMemoryUpdate(rb, change, false,
+		ReorderBufferChangeMemoryUpdate(rb, change, NULL, false,
 										ReorderBufferChangeSize(change));
 
 	/* free contained data */
@@ -816,7 +862,7 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries_mem++;
 
 	/* update memory accounting information */
-	ReorderBufferChangeMemoryUpdate(rb, change, true,
+	ReorderBufferChangeMemoryUpdate(rb, change, NULL, true,
 									ReorderBufferChangeSize(change));
 
 	/* process partial change */
@@ -1527,7 +1573,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		/* Check we're not mixing changes from different transactions. */
 		Assert(change->txn == txn);
 
-		ReorderBufferReturnChange(rb, change, true);
+		ReorderBufferReturnChange(rb, change, false);
 	}
 
 	/*
@@ -1586,8 +1632,17 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	if (rbtxn_is_serialized(txn))
 		ReorderBufferRestoreCleanup(rb, txn);
 
+	/* Update the memory counter */
+	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, txn->size);
+
 	/* deallocate */
 	ReorderBufferReturnTXN(rb, txn);
+
+	/*
+	 * After cleaning up one transaction, the number of transactions might get
+	 * lower than the threshold for the max-heap.
+	 */
+	ReorderBufferMaybeResetMaxHeap(rb);
 }
 
 /*
@@ -1637,9 +1692,12 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 		/* remove the change from it's containing list */
 		dlist_delete(&change->node);
 
-		ReorderBufferReturnChange(rb, change, true);
+		ReorderBufferReturnChange(rb, change, false);
 	}
 
+	/* Update the memory counter */
+	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, txn->size);
+
 	/*
 	 * Mark the transaction as streamed.
 	 *
@@ -3166,6 +3224,9 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
  * decide if we reached the memory limit, the transaction counter allows
  * us to quickly pick the largest transaction for eviction.
  *
+ * Either txn or change must be non-NULL at least. We update the memory
+ * counter of txn if it's non-NULL, otherwise change->txn.
+ *
  * When streaming is enabled, we need to update the toplevel transaction
  * counters instead - we don't really care about subtransactions as we
  * can't stream them individually anyway, and we only pick toplevel
@@ -3174,22 +3235,27 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								ReorderBufferChange *change,
+								ReorderBufferTXN *txn,
 								bool addition, Size sz)
 {
-	ReorderBufferTXN *txn;
 	ReorderBufferTXN *toptxn;
 
-	Assert(change->txn);
+	Assert(txn || change);
 
 	/*
 	 * Ignore tuple CID changes, because those are not evicted when reaching
 	 * memory limit. So we just don't count them, because it might easily
 	 * trigger a pointless attempt to spill.
 	 */
-	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
+	if (change && change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
+		return;
+
+	if (sz == 0)
 		return;
 
-	txn = change->txn;
+	if (txn == NULL)
+		txn = change->txn;
+	Assert(txn != NULL);
 
 	/*
 	 * Update the total size in top level as well. This is later used to
@@ -3204,6 +3270,15 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 		/* Update the total size in the top transaction. */
 		toptxn->total_size += sz;
+
+		/* Update the max-heap as well if necessary */
+		if (ReorderBufferMaxHeapIsReady(rb))
+		{
+			if ((txn->size - sz) == 0)
+				binaryheap_add(rb->txn_heap, PointerGetDatum(txn));
+			else
+				binaryheap_update_up(rb->txn_heap, PointerGetDatum(txn));
+		}
 	}
 	else
 	{
@@ -3213,6 +3288,15 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 		/* Update the total size in the top transaction. */
 		toptxn->total_size -= sz;
+
+		/* Update the max-heap as well if necessary */
+		if (ReorderBufferMaxHeapIsReady(rb))
+		{
+			if (txn->size == 0)
+				binaryheap_remove_node_ptr(rb->txn_heap, PointerGetDatum(txn));
+			else
+				binaryheap_update_down(rb->txn_heap, PointerGetDatum(txn));
+		}
 	}
 
 	Assert(txn->size <= rb->size);
@@ -3468,34 +3552,121 @@ ReorderBufferSerializeReserve(ReorderBuffer *rb, Size sz)
 	}
 }
 
+
+/* Compare two transactions by size */
+static int
+ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg)
+{
+	ReorderBufferTXN *ta = (ReorderBufferTXN *) DatumGetPointer(a);
+	ReorderBufferTXN *tb = (ReorderBufferTXN *) DatumGetPointer(b);
+
+	if (ta->size < tb->size)
+		return -1;
+	if (ta->size > tb->size)
+		return 1;
+	return 0;
+}
+
 /*
- * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
- *
- * XXX With many subtransactions this might be quite slow, because we'll have
- * to walk through all of them. There are some options how we could improve
- * that: (a) maintain some secondary structure with transactions sorted by
- * amount of changes, (b) not looking for the entirely largest transaction,
- * but e.g. for transaction using at least some fraction of the memory limit,
- * and (c) evicting multiple transactions at once, e.g. to free a given portion
- * of the memory limit (e.g. 50%).
+ * Build the max-heap. The heap assembly step is deferred  until the end, for
+ * efficiency.
  */
-static ReorderBufferTXN *
-ReorderBufferLargestTXN(ReorderBuffer *rb)
+static void
+ReorderBufferBuildMaxHeap(ReorderBuffer *rb)
 {
 	HASH_SEQ_STATUS hash_seq;
 	ReorderBufferTXNByIdEnt *ent;
-	ReorderBufferTXN *largest = NULL;
+
+	Assert(binaryheap_empty(rb->txn_heap));
 
 	hash_seq_init(&hash_seq, rb->by_txn);
 	while ((ent = hash_seq_search(&hash_seq)) != NULL)
 	{
 		ReorderBufferTXN *txn = ent->txn;
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
-			largest = txn;
+		if (txn->size == 0)
+			continue;
+
+		binaryheap_add_unordered(rb->txn_heap, PointerGetDatum(txn));
 	}
 
+	binaryheap_build(rb->txn_heap);
+}
+
+/*
+ * Reset the max-heap if the number of transactions got lower than the
+ * threshold.
+ */
+static void
+ReorderBufferMaybeResetMaxHeap(ReorderBuffer *rb)
+{
+	/*
+	 * If we add and remove transactions right around the threshold, we could
+	 * easily end up "thrashing". To avoid it, we adapt 10% of transactions to
+	 * reset the max-heap.
+	 */
+	if (ReorderBufferMaxHeapIsReady(rb) &&
+		binaryheap_size(rb->txn_heap) < MAX_HEAP_TXN_COUNT_THRESHOLD * 0.9)
+		binaryheap_reset(rb->txn_heap);
+}
+
+/*
+ * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
+ * We use a different way to find the largest transaction depending on the
+ * memory tracking state and the number of transactions being decoded. Refer
+ * to the comments atop this file for the algorithm details.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTXN(ReorderBuffer *rb)
+{
+	ReorderBufferTXN *largest = NULL;
+
+	if (!ReorderBufferMaxHeapIsReady(rb))
+	{
+		/*
+		 * If the number of transactions are small, we scan all transactions
+		 * being decoded to get the largest transaction. This saves the cost
+		 * of building a max-heap with a small number of transactions.
+		 */
+		if (hash_get_num_entries(rb->by_txn) < MAX_HEAP_TXN_COUNT_THRESHOLD)
+		{
+			HASH_SEQ_STATUS hash_seq;
+			ReorderBufferTXNByIdEnt *ent;
+
+			hash_seq_init(&hash_seq, rb->by_txn);
+			while ((ent = hash_seq_search(&hash_seq)) != NULL)
+			{
+				ReorderBufferTXN *txn = ent->txn;
+
+				/* if the current transaction is larger, remember it */
+				if ((!largest) || (txn->size > largest->size))
+					largest = txn;
+			}
+		}
+		else
+		{
+			/*
+			 * There are a large number of transactions in ReorderBuffer. We
+			 * build the max-heap for efficiently selecting the largest
+			 * transactions.
+			 */
+			ReorderBufferBuildMaxHeap(rb);
+
+			/*
+			 * The max-heap is ready now. We remain in this state at least
+			 * until we free up enough transactions to bring the total memory
+			 * usage below the limit. The largest transaction is selected
+			 * below.
+			 */
+			Assert(ReorderBufferMaxHeapIsReady(rb));
+		}
+	}
+
+	/* Get the largest transaction from the max-heap */
+	if (ReorderBufferMaxHeapIsReady(rb))
+		largest = (ReorderBufferTXN *)
+			DatumGetPointer(binaryheap_first(rb->txn_heap));
+
 	Assert(largest);
 	Assert(largest->size > 0);
 	Assert(largest->size <= rb->size);
@@ -3638,6 +3809,13 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/* We must be under the memory limit now. */
 	Assert(rb->size < logical_decoding_work_mem * 1024L);
+
+	/*
+	 * After evicting some transactions, the number of transactions might get
+	 * lower than the threshold for the max-heap.
+	 */
+	ReorderBufferMaybeResetMaxHeap(rb);
+
 }
 
 /*
@@ -3705,11 +3883,14 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		ReorderBufferSerializeChange(rb, txn, fd, change);
 		dlist_delete(&change->node);
-		ReorderBufferReturnChange(rb, change, true);
+		ReorderBufferReturnChange(rb, change, false);
 
 		spilled++;
 	}
 
+	/* Update the memory counter */
+	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, size);
+
 	/* update the statistics iff we have spilled anything */
 	if (spilled)
 	{
@@ -4491,7 +4672,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	 * update the accounting too (subtracting the size from the counters). And
 	 * we don't want to underflow there.
 	 */
-	ReorderBufferChangeMemoryUpdate(rb, change, true,
+	ReorderBufferChangeMemoryUpdate(rb, change, NULL, true,
 									ReorderBufferChangeSize(change));
 }
 
@@ -4903,9 +5084,9 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	MemoryContextSwitchTo(oldcontext);
 
 	/* subtract the old change size */
-	ReorderBufferChangeMemoryUpdate(rb, change, false, old_size);
+	ReorderBufferChangeMemoryUpdate(rb, change, NULL, false, old_size);
 	/* now add the change back, with the correct size */
-	ReorderBufferChangeMemoryUpdate(rb, change, true,
+	ReorderBufferChangeMemoryUpdate(rb, change, NULL, true,
 									ReorderBufferChangeSize(change));
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0b2c95f7aa..a5aec01c2f 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "lib/binaryheap.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -631,6 +632,9 @@ struct ReorderBuffer
 	/* memory accounting */
 	Size		size;
 
+	/* Max-heap for sizes of all top-level and sub transactions */
+	binaryheap *txn_heap;
+
 	/*
 	 * Statistics about transactions spilled to disk.
 	 *
-- 
2.39.3

#63Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#61)
Re: Improve eviction algorithm in ReorderBuffer

On Fri, Mar 29, 2024 at 8:48 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Sawada-san,

Agreed.

I think the patch is in good shape. I'll push the patch with the
suggestion next week, barring any objections.

Thanks for working on this. Agreed it is committable.
Few minor comments:

Thank you for the comments!

```
+ * Either txn or change must be non-NULL at least. We update the memory
+ * counter of txn if it's non-NULL, otherwise change->txn.
```

IIUC no one checks the restriction. Should we add Assert() for it, e.g,:
Assert(txn || change)?

Agreed to add it.

```
+    /* make sure enough space for a new node */
...
+    /* make sure enough space for a new node */
```

Should be started with upper case?

I don't think we need to change it. There are other comments in the
same file that are one line and start with lowercase.

I've just submitted the updated patches[1]/messages/by-id/CAD21AoA6=+tL=btB_s9N+cZK7tKz1W=PQyNq72nzjUcdyE+wZw@mail.gmail.com

Regards,

[1]: /messages/by-id/CAD21AoA6=+tL=btB_s9N+cZK7tKz1W=PQyNq72nzjUcdyE+wZw@mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#64Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#62)
3 attachment(s)
Re: Improve eviction algorithm in ReorderBuffer

On Mon, Apr 1, 2024 at 11:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Mar 29, 2024 at 7:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Mar 29, 2024 at 12:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Mar 29, 2024 at 2:09 PM vignesh C <vignesh21@gmail.com> wrote:

On Tue, 26 Mar 2024 at 10:05, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached new version patches.

Since the previous patch conflicts with the current HEAD, I've
attached the rebased patches.

Thanks for the updated patch.
One comment:
I felt we can mention the improvement where we update memory
accounting info at transaction level instead of per change level which
is done in ReorderBufferCleanupTXN, ReorderBufferTruncateTXN, and
ReorderBufferSerializeTXN also in the commit message:

Agreed.

I think the patch is in good shape. I'll push the patch with the
suggestion next week, barring any objections.

Few minor comments:
1.
@@ -3636,6 +3801,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
Assert(txn->nentries_mem == 0);
}

+ ReorderBufferMaybeResetMaxHeap(rb);
+

Can we write a comment about why this reset is required here?
Otherwise, the reason is not apparent.

Yes, added.

2.
Although using max-heap to select the largest
+ * transaction is effective when there are many transactions being decoded,
+ * there is generally no need to use it as long as all transactions being
+ * decoded are top-level transactions. Therefore, we use MaxConnections as the
+ * threshold so we can prevent switching to the state unless we use
+ * subtransactions.
+ */
+#define MAX_HEAP_TXN_COUNT_THRESHOLD MaxConnections

Isn't using max-heap equally effective in finding the largest
transaction whether there are top-level or top-level plus
subtransactions? This comment indicates it is only effective when
there are subtransactions.

You're right. Updated the comment.

I've attached the updated patches.

While reviewing the patches, I realized the comment of
binearyheap_allocate() should also be updated. So I've attached the
new patches.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v12-0001-Make-binaryheap-enlargeable.patchapplication/octet-stream; name=v12-0001-Make-binaryheap-enlargeable.patchDownload
From e18ba536e96dda91020bf8ab1a543b5401a88ead Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <msawada@postgresql.org>
Date: Mon, 1 Apr 2024 12:21:21 +0900
Subject: [PATCH v12 1/3] Make binaryheap enlargeable.

The node array space of the binaryheap is doubled when there is no
available space.

Reviewed-by: Vignesh C, Peter Smith, Hayato Kuroda, Ajin Cherian,
Tomas Vondra, Shubham Khanna
Discussion: https://postgr.es/m/CAD21AoDffo37RC-eUuyHJKVEr017V2YYDLyn1xF_00ofptWbkg%40mail.gmail.com
---
 src/common/binaryheap.c      | 49 +++++++++++++++++++-----------------
 src/include/lib/binaryheap.h |  4 +--
 2 files changed, 28 insertions(+), 25 deletions(-)

diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index 7377ebdf15..2ffd656e87 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -30,25 +30,24 @@ static void sift_up(binaryheap *heap, int node_off);
 /*
  * binaryheap_allocate
  *
- * Returns a pointer to a newly-allocated heap that has the capacity to
- * store the given number of nodes, with the heap property defined by
- * the given comparator function, which will be invoked with the additional
- * argument specified by 'arg'.
+ * Returns a pointer to a newly-allocated heap with the given initial number
+ * of nodes, and with the heap property defined by the given comparator
+ * function, which will be invoked with the additional argument specified by
+ * 'arg'.
  */
 binaryheap *
-binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
+binaryheap_allocate(int num_nodes, binaryheap_comparator compare, void *arg)
 {
-	int			sz;
 	binaryheap *heap;
 
-	sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_node_type) * capacity;
-	heap = (binaryheap *) palloc(sz);
-	heap->bh_space = capacity;
+	heap = (binaryheap *) palloc(sizeof(binaryheap));
+	heap->bh_space = num_nodes;
 	heap->bh_compare = compare;
 	heap->bh_arg = arg;
 
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
+	heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type) * num_nodes);
 
 	return heap;
 }
@@ -74,6 +73,7 @@ binaryheap_reset(binaryheap *heap)
 void
 binaryheap_free(binaryheap *heap)
 {
+	pfree(heap->bh_nodes);
 	pfree(heap);
 }
 
@@ -104,6 +104,17 @@ parent_offset(int i)
 	return (i - 1) / 2;
 }
 
+/*
+ * Double the space allocated for nodes.
+ */
+static void
+enlarge_node_array(binaryheap *heap)
+{
+	heap->bh_space *= 2;
+	heap->bh_nodes = repalloc(heap->bh_nodes,
+							  sizeof(bh_node_type) * heap->bh_space);
+}
+
 /*
  * binaryheap_add_unordered
  *
@@ -115,14 +126,10 @@ parent_offset(int i)
 void
 binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 {
+	/* make sure enough space for a new node */
 	if (heap->bh_size >= heap->bh_space)
-	{
-#ifdef FRONTEND
-		pg_fatal("out of binary heap slots");
-#else
-		elog(ERROR, "out of binary heap slots");
-#endif
-	}
+		enlarge_node_array(heap);
+
 	heap->bh_has_heap_property = false;
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
@@ -153,14 +160,10 @@ binaryheap_build(binaryheap *heap)
 void
 binaryheap_add(binaryheap *heap, bh_node_type d)
 {
+	/* make sure enough space for a new node */
 	if (heap->bh_size >= heap->bh_space)
-	{
-#ifdef FRONTEND
-		pg_fatal("out of binary heap slots");
-#else
-		elog(ERROR, "out of binary heap slots");
-#endif
-	}
+		enlarge_node_array(heap);
+
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 19025c08ef..9f6efb06e3 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -46,10 +46,10 @@ typedef struct binaryheap
 	bool		bh_has_heap_property;	/* debugging cross-check */
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
-	bh_node_type bh_nodes[FLEXIBLE_ARRAY_MEMBER];
+	bh_node_type *bh_nodes;
 } binaryheap;
 
-extern binaryheap *binaryheap_allocate(int capacity,
+extern binaryheap *binaryheap_allocate(int num_nodes,
 									   binaryheap_comparator compare,
 									   void *arg);
 extern void binaryheap_reset(binaryheap *heap);
-- 
2.39.3

v12-0002-Add-functions-to-binaryheap-for-efficient-key-re.patchapplication/octet-stream; name=v12-0002-Add-functions-to-binaryheap-for-efficient-key-re.patchDownload
From 6419205c7bc4c4e164965e8e41ff261a48631b4f Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 11:20:23 +0900
Subject: [PATCH v12 2/3] Add functions to binaryheap for efficient key removal
 and update.

Previously, binaryheap didn't support updating a key and removing a
node in an efficient way. For example, in order to remove a node from
the binaryheap, the caller had to pass the node's position within the
array that the binaryheap internally has. Removing a node from the
binaryheap is done in O(log n) but searching for the key's position is
done in O(n).

This commit adds a hash table to binaryheap in order to track the
position of each nodes in the binaryheap. That way, by using newly
added functions such as binaryheap_update_up() etc., both updating a
key and removing a node can be done in O(1) on an average and O(log n)
in worst case. This is known as the indexed binary heap. The caller
can specify to use the indexed binaryheap by passing indexed = true.

The current code does not use the new indexing logic, but it will be
used by an upcoming patch.

Reviewed-by: Vignesh C, Peter Smith, Hayato Kuroda, Ajin Cherian,
Tomas Vondra,Shubham Khanna
Discussion: https://postgr.es/m/CAD21AoDffo37RC-eUuyHJKVEr017V2YYDLyn1xF_00ofptWbkg%40mail.gmail.com
---
 src/backend/executor/nodeGatherMerge.c        |   1 +
 src/backend/executor/nodeMergeAppend.c        |   2 +-
 src/backend/postmaster/pgarch.c               |   3 +-
 .../replication/logical/reorderbuffer.c       |   1 +
 src/backend/storage/buffer/bufmgr.c           |   1 +
 src/bin/pg_dump/pg_backup_archiver.c          |   1 +
 src/bin/pg_dump/pg_dump_sort.c                |   2 +-
 src/common/binaryheap.c                       | 198 +++++++++++++++++-
 src/include/lib/binaryheap.h                  |  36 +++-
 src/tools/pgindent/typedefs.list              |   1 +
 10 files changed, 232 insertions(+), 14 deletions(-)

diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c
index 45f6017c29..ce19e0837a 100644
--- a/src/backend/executor/nodeGatherMerge.c
+++ b/src/backend/executor/nodeGatherMerge.c
@@ -422,6 +422,7 @@ gather_merge_setup(GatherMergeState *gm_state)
 	/* Allocate the resources for the merge */
 	gm_state->gm_heap = binaryheap_allocate(nreaders + 1,
 											heap_compare_slots,
+											false,
 											gm_state);
 }
 
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index e1b9b984a7..3efebd537f 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -125,7 +125,7 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	mergestate->ms_nplans = nplans;
 
 	mergestate->ms_slots = (TupleTableSlot **) palloc0(sizeof(TupleTableSlot *) * nplans);
-	mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots,
+	mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots, false,
 											  mergestate);
 
 	/*
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index c266904b57..2b4e5a623c 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -254,7 +254,8 @@ PgArchiverMain(char *startup_data, size_t startup_data_len)
 
 	/* Initialize our max-heap for prioritizing files to archive. */
 	arch_files->arch_heap = binaryheap_allocate(NUM_FILES_PER_DIRECTORY_SCAN,
-												ready_file_comparator, NULL);
+												ready_file_comparator, false,
+												NULL);
 
 	/* Load the archive_library. */
 	LoadArchiveLibrary();
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 92cf39ff74..07eebedbac 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1294,6 +1294,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	/* allocate heap */
 	state->heap = binaryheap_allocate(state->nr_txns,
 									  ReorderBufferIterCompare,
+									  false,
 									  state);
 
 	/* Now that the state fields are initialized, it is safe to return it. */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c..eee5021197 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2724,6 +2724,7 @@ BufferSync(int flags)
 	 */
 	ts_heap = binaryheap_allocate(num_spaces,
 								  ts_ckpt_progress_comparator,
+								  false,
 								  NULL);
 
 	for (i = 0; i < num_spaces; i++)
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index d97ebaff5b..6587a7b081 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -4033,6 +4033,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 	/* Set up ready_heap with enough room for all known TocEntrys */
 	ready_heap = binaryheap_allocate(AH->tocCount,
 									 TocEntrySizeCompareBinaryheap,
+									 false,
 									 NULL);
 
 	/*
diff --git a/src/bin/pg_dump/pg_dump_sort.c b/src/bin/pg_dump/pg_dump_sort.c
index 4cb754caa5..7362f7c961 100644
--- a/src/bin/pg_dump/pg_dump_sort.c
+++ b/src/bin/pg_dump/pg_dump_sort.c
@@ -405,7 +405,7 @@ TopoSort(DumpableObject **objs,
 		return true;
 
 	/* Create workspace for the above-described heap */
-	pendingHeap = binaryheap_allocate(numObjs, int_cmp, NULL);
+	pendingHeap = binaryheap_allocate(numObjs, int_cmp, false, NULL);
 
 	/*
 	 * Scan the constraints, and for each item in the input, generate a count
diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index 2ffd656e87..4039cb4ddc 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -22,8 +22,30 @@
 #ifdef FRONTEND
 #include "common/logging.h"
 #endif
+#include "common/hashfn.h"
 #include "lib/binaryheap.h"
 
+/*
+ * Define parameters for hash table code generation. The interface is *also*
+ * declared in binaryheaph.h (to generate the types, which are externally
+ * visible).
+ */
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+	hash_bytes((const unsigned char *) &key, sizeof(bh_node_type))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0)
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_STORE_HASH
+#define SH_GET_HASH(tb, a) a->hash
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
 static void sift_down(binaryheap *heap, int node_off);
 static void sift_up(binaryheap *heap, int node_off);
 
@@ -34,9 +56,14 @@ static void sift_up(binaryheap *heap, int node_off);
  * of nodes, and with the heap property defined by the given comparator
  * function, which will be invoked with the additional argument specified by
  * 'arg'.
+ *
+ * If 'indexed' is true, we create a hash table to track each node's
+ * index in the heap, enabling to perform some operations such as
+ * binaryheap_remove_node_ptr() etc.
  */
 binaryheap *
-binaryheap_allocate(int num_nodes, binaryheap_comparator compare, void *arg)
+binaryheap_allocate(int num_nodes, binaryheap_comparator compare,
+					bool indexed, void *arg)
 {
 	binaryheap *heap;
 
@@ -48,6 +75,17 @@ binaryheap_allocate(int num_nodes, binaryheap_comparator compare, void *arg)
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
 	heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type) * num_nodes);
+	heap->bh_nodeidx = NULL;
+
+	if (indexed)
+	{
+#ifdef FRONTEND
+		heap->bh_nodeidx = bh_nodeidx_create(num_nodes, NULL);
+#else
+		heap->bh_nodeidx = bh_nodeidx_create(CurrentMemoryContext, num_nodes,
+											 NULL);
+#endif
+	}
 
 	return heap;
 }
@@ -63,6 +101,9 @@ binaryheap_reset(binaryheap *heap)
 {
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
+
+	if (binaryheap_indexed(heap))
+		bh_nodeidx_reset(heap->bh_nodeidx);
 }
 
 /*
@@ -73,6 +114,9 @@ binaryheap_reset(binaryheap *heap)
 void
 binaryheap_free(binaryheap *heap)
 {
+	if (binaryheap_indexed(heap))
+		bh_nodeidx_destroy(heap->bh_nodeidx);
+
 	pfree(heap->bh_nodes);
 	pfree(heap);
 }
@@ -115,6 +159,67 @@ enlarge_node_array(binaryheap *heap)
 							  sizeof(bh_node_type) * heap->bh_space);
 }
 
+/*
+ * Set the given node at the 'index' and track it if required.
+ *
+ * Return true if the node's index is already tracked.
+ */
+static bool
+set_node(binaryheap *heap, bh_node_type node, int index)
+{
+	bool		found = false;
+
+	/* Set the node to the nodes array */
+	heap->bh_nodes[index] = node;
+
+	if (binaryheap_indexed(heap))
+	{
+		bh_nodeidx_entry *ent;
+
+		/* Keep track of the node index */
+		ent = bh_nodeidx_insert(heap->bh_nodeidx, node, &found);
+		ent->index = index;
+	}
+
+	return found;
+}
+
+/*
+ * Remove the node's index from the hash table if the heap is indexed.
+ */
+static inline void
+delete_nodeidx(binaryheap *heap, bh_node_type node)
+{
+	if (binaryheap_indexed(heap))
+		bh_nodeidx_delete(heap->bh_nodeidx, node);
+}
+
+/*
+ * Replace the existing node at 'idx' with the given 'new_node'. Also
+ * update their positions accordingly. Note that we assume the new_node's
+ * position is already tracked if enabled, i.e. the new_node is already
+ * present in the heap.
+ */
+static void
+replace_node(binaryheap *heap, int index, bh_node_type new_node)
+{
+	bool		found PG_USED_FOR_ASSERTS_ONLY;
+
+	/* Quick return if not necessary to move */
+	if (heap->bh_nodes[index] == new_node)
+		return;
+
+	/* Remove the overwritten node's index */
+	delete_nodeidx(heap, heap->bh_nodes[index]);
+
+	/*
+	 * Replace it with the given new node. This node's position must also be
+	 * tracked as we assume to replace the node with the existing node.
+	 */
+	found = set_node(heap, new_node, index);
+	Assert(!binaryheap_indexed(heap) || found);
+}
+
 /*
  * binaryheap_add_unordered
  *
@@ -131,7 +236,7 @@ binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 		enlarge_node_array(heap);
 
 	heap->bh_has_heap_property = false;
-	heap->bh_nodes[heap->bh_size] = d;
+	set_node(heap, d, heap->bh_size);
 	heap->bh_size++;
 }
 
@@ -164,7 +269,7 @@ binaryheap_add(binaryheap *heap, bh_node_type d)
 	if (heap->bh_size >= heap->bh_space)
 		enlarge_node_array(heap);
 
-	heap->bh_nodes[heap->bh_size] = d;
+	set_node(heap, d, heap->bh_size);
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
 }
@@ -205,6 +310,8 @@ binaryheap_remove_first(binaryheap *heap)
 	if (heap->bh_size == 1)
 	{
 		heap->bh_size--;
+		delete_nodeidx(heap, result);
+
 		return result;
 	}
 
@@ -212,7 +319,7 @@ binaryheap_remove_first(binaryheap *heap)
 	 * Remove the last node, placing it in the vacated root entry, and sift
 	 * the new root node down to its correct position.
 	 */
-	heap->bh_nodes[0] = heap->bh_nodes[--heap->bh_size];
+	replace_node(heap, 0, heap->bh_nodes[--heap->bh_size]);
 	sift_down(heap, 0);
 
 	return result;
@@ -238,7 +345,7 @@ binaryheap_remove_node(binaryheap *heap, int n)
 						   heap->bh_arg);
 
 	/* remove the last node, placing it in the vacated entry */
-	heap->bh_nodes[n] = heap->bh_nodes[heap->bh_size];
+	replace_node(heap, n, heap->bh_nodes[heap->bh_size]);
 
 	/* sift as needed to preserve the heap property */
 	if (cmp > 0)
@@ -247,6 +354,77 @@ binaryheap_remove_node(binaryheap *heap, int n)
 		sift_down(heap, n);
 }
 
+/*
+ * binaryheap_remove_node_ptr
+ *
+ * Similar to binaryheap_remove_node() but removes the given node. The caller
+ * must ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if the heap is indexed.
+ */
+void
+binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(binaryheap_indexed(heap));
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
+	Assert(ent);
+
+	binaryheap_remove_node(heap, ent->index);
+}
+
+/*
+ * Workhorse for binaryheap_update_up and binaryheap_update_down.
+ */
+static void
+resift_node(binaryheap *heap, bh_node_type node, bool sift_dir_up)
+{
+	bh_nodeidx_entry *ent;
+
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(binaryheap_indexed(heap));
+
+	ent = bh_nodeidx_lookup(heap->bh_nodeidx, node);
+	Assert(ent);
+	Assert(ent->index >= 0 && ent->index < heap->bh_size);
+
+	if (sift_dir_up)
+		sift_up(heap, ent->index);
+	else
+		sift_down(heap, ent->index);
+}
+
+/*
+ * binaryheap_update_up
+ *
+ * Sift the given node up after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if the heap is indexed.
+ */
+void
+binaryheap_update_up(binaryheap *heap, bh_node_type d)
+{
+	resift_node(heap, d, true);
+}
+
+/*
+ * binaryheap_update_down
+ *
+ * Sift the given node down after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if the heap is indexed.
+ */
+void
+binaryheap_update_down(binaryheap *heap, bh_node_type d)
+{
+	resift_node(heap, d, false);
+}
+
 /*
  * binaryheap_replace_first
  *
@@ -259,7 +437,7 @@ binaryheap_replace_first(binaryheap *heap, bh_node_type d)
 {
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
 
-	heap->bh_nodes[0] = d;
+	replace_node(heap, 0, d);
 
 	if (heap->bh_size > 1)
 		sift_down(heap, 0);
@@ -301,11 +479,11 @@ sift_up(binaryheap *heap, int node_off)
 		 * Otherwise, swap the parent value with the hole, and go on to check
 		 * the node's new parent.
 		 */
-		heap->bh_nodes[node_off] = parent_val;
+		set_node(heap, parent_val, node_off);
 		node_off = parent_off;
 	}
 	/* Re-fill the hole */
-	heap->bh_nodes[node_off] = node_val;
+	set_node(heap, node_val, node_off);
 }
 
 /*
@@ -360,9 +538,9 @@ sift_down(binaryheap *heap, int node_off)
 		 * Otherwise, swap the hole with the child that violates the heap
 		 * property; then go on to check its children.
 		 */
-		heap->bh_nodes[node_off] = heap->bh_nodes[swap_off];
+		set_node(heap, heap->bh_nodes[swap_off], node_off);
 		node_off = swap_off;
 	}
 	/* Re-fill the hole */
-	heap->bh_nodes[node_off] = node_val;
+	set_node(heap, node_val, node_off);
 }
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 9f6efb06e3..4c1a1bb274 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -29,6 +29,29 @@ typedef Datum bh_node_type;
  */
 typedef int (*binaryheap_comparator) (bh_node_type a, bh_node_type b, void *arg);
 
+/*
+ * Struct for a hash table element to store the node's index in the bh_nodes
+ * array.
+ */
+typedef struct bh_nodeidx_entry
+{
+	bh_node_type key;
+	int			index;			/* entry's index within the node array */
+	char		status;			/* hash status */
+	uint32		hash;			/* hash values (cached) */
+} bh_nodeidx_entry;
+
+/* Define parameters necessary to generate the hash table interface. */
+#define SH_PREFIX bh_nodeidx
+#define SH_ELEMENT_TYPE bh_nodeidx_entry
+#define SH_KEY_TYPE bh_node_type
+#define SH_SCOPE extern
+#ifdef FRONTEND
+#define SH_RAW_ALLOCATOR pg_malloc0
+#endif
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
 /*
  * binaryheap
  *
@@ -47,11 +70,18 @@ typedef struct binaryheap
 	binaryheap_comparator bh_compare;
 	void	   *bh_arg;
 	bh_node_type *bh_nodes;
+
+	/*
+	 * If bh_nodeidx is not NULL, the bh_nodeidx is used to track of each
+	 * node's index in bh_nodes. This enables the caller to perform
+	 * binaryheap_remove_node_ptr(), binaryheap_update_up/down in O(log n).
+	 */
+	bh_nodeidx_hash *bh_nodeidx;
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int num_nodes,
 									   binaryheap_comparator compare,
-									   void *arg);
+									   bool indexed, void *arg);
 extern void binaryheap_reset(binaryheap *heap);
 extern void binaryheap_free(binaryheap *heap);
 extern void binaryheap_add_unordered(binaryheap *heap, bh_node_type d);
@@ -60,10 +90,14 @@ extern void binaryheap_add(binaryheap *heap, bh_node_type d);
 extern bh_node_type binaryheap_first(binaryheap *heap);
 extern bh_node_type binaryheap_remove_first(binaryheap *heap);
 extern void binaryheap_remove_node(binaryheap *heap, int n);
+extern void binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d);
 extern void binaryheap_replace_first(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_up(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_down(binaryheap *heap, bh_node_type d);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
 #define binaryheap_size(h)			((h)->bh_size)
 #define binaryheap_get_node(h, n)	((h)->bh_nodes[n])
+#define binaryheap_indexed(h)		((h)->bh_nodeidx != NULL)
 
 #endif							/* BINARYHEAP_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a8d7bed411..0c3a3aaccc 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4084,3 +4084,4 @@ TidStoreIter
 TidStoreIterResult
 BlocktableEntry
 ItemArray
+bh_nodeidx_entry
-- 
2.39.3

v12-0003-Improve-eviction-algorithm-in-Reorderbuffer-usin.patchapplication/octet-stream; name=v12-0003-Improve-eviction-algorithm-in-Reorderbuffer-usin.patchDownload
From 8aa0184049e3ddf7e5ded751c443f4b16eb50c55 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 26 Jan 2024 11:31:41 +0900
Subject: [PATCH v12 3/3] Improve eviction algorithm in Reorderbuffer using
 max-heap for many subtransactions.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Previously, when selecting the transaction to evict during logical
decoding, we check all transactions to find the largest
transaction. This could lead to a significant replication lag
especially in the case where there are many subtransactions.

This commit improves the eviction algorithm in ReorderBuffer using the
max-heap with transaction size as the key to efficiently find the
largest transaction.

The max-heap starts with empty. While the max-heap is empty, we don't
do anything for the max-heap when updating the memory
counter. Therefore, we get the largest transaction in O(N) time, where
N is the number of transactions including top-level transactions and
subtransactions.

We build the max-heap just before selecting the largest transactions
if the number of transactions being decoded is higher than the
threshold, MAX_HEAP_TXN_COUNT_THRESHOLD. After building the max-heap,
we also update the max-heap when updating the memory counter. The
intention is to efficiently find the largest transaction in O(1) time
instead of incurring the cost of memory counter updates (O(log
N)). Once the number of transactions got lower than the threshold, we
reset the max-heap.

The performance benchmark results showed significant speed up (more
than x30 speed up on my machine) in decoding a transaction with 100k
subtransactions, whereas there is no visible overhead in other cases.

Reviewed-by: Amit Kapila, Hayato Kuroda, Vignesh C, Ajin Cherian,
Tomas Vondra, Shubham Khanna, Peter Smith, Álvaro Herrera,
Euler Taveira
Discussion: https://postgr.es/m/CAD21AoAfKTgrBrLq96GcTv9d6k97zaQcDM-rxfKEt4GSe0qnaQ%40mail.gmail.com
---
 .../replication/logical/reorderbuffer.c       | 235 ++++++++++++++++--
 src/include/replication/reorderbuffer.h       |   4 +
 2 files changed, 212 insertions(+), 27 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 07eebedbac..941ae310b0 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -67,6 +67,21 @@
  *	  allocator, evicting the oldest changes would make it more likely the
  *	  memory gets actually freed.
  *
+ *	  We use a max-heap with transaction size as the key to efficiently find
+ *	  the largest transaction. While the max-heap is empty, we don't update
+ *	  the max-heap when updating the memory counter. Therefore, we can get
+ *	  the largest transaction in O(N) time, where N is the number of
+ *	  transactions including top-level transactions and subtransactions.
+ *
+ *	  We build the max-heap just before selecting the largest transactions
+ *	  if the number of transactions being decoded is higher than the threshold,
+ *	  MAX_HEAP_TXN_COUNT_THRESHOLD. After building the max-heap, we also
+ *	  update the max-heap when updating the memory counter. The intention is
+ *	  to efficiently find the largest transaction in O(1) time instead of
+ *	  incurring the cost of memory counter updates (O(log N)). Once the number
+ *	  of transactions got lower than the threshold, we reset the max-heap
+ *	  (refer to ReorderBufferMaybeResetMaxHeap() for details).
+ *
  *	  We still rely on max_changes_in_memory when loading serialized changes
  *	  back into memory. At that point we can't use the memory limit directly
  *	  as we load the subxacts independently. One option to deal with this
@@ -107,6 +122,22 @@
 #include "utils/rel.h"
 #include "utils/relfilenumbermap.h"
 
+/*
+ * Threshold of the total number of top-level and sub transactions that
+ * controls whether we use the max-heap for tracking their sizes. Although
+ * using the max-heap to select the largest transaction is effective when
+ * there are many transactions being decoded, maintaining the max-heap while
+ * updating the memory statistics can be costly. Therefore, we use
+ * MaxConnections as the threshold so that we use the max-heap only when
+ * using subtransactions.
+ */
+#define MAX_HEAP_TXN_COUNT_THRESHOLD	MaxConnections
+
+/*
+ * A macro to check if the max-heap is ready to use and needs to be updated
+ * accordingly.
+ */
+#define ReorderBufferMaxHeapIsReady(rb) !binaryheap_empty((rb)->txn_heap)
 
 /* entry for a hash table we use to map from xid to our transaction state */
 typedef struct ReorderBufferTXNByIdEnt
@@ -259,6 +290,9 @@ static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
+static void ReorderBufferBuildMaxHeap(ReorderBuffer *rb);
+static void ReorderBufferMaybeResetMaxHeap(ReorderBuffer *rb);
+static int	ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg);
 
 static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
@@ -293,6 +327,7 @@ static void ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *t
 static Size ReorderBufferChangeSize(ReorderBufferChange *change);
 static void ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 											ReorderBufferChange *change,
+											ReorderBufferTXN *txn,
 											bool addition, Size sz);
 
 /*
@@ -355,6 +390,17 @@ ReorderBufferAllocate(void)
 	buffer->outbufsize = 0;
 	buffer->size = 0;
 
+	/*
+	 * The binaryheap is indexed for faster manipulations.
+	 *
+	 * We allocate the initial heap size greater than
+	 * MAX_HEAP_TXN_COUNT_THRESHOLD because the txn_heap will not be used
+	 * until the threshold is exceeded.
+	 */
+	buffer->txn_heap = binaryheap_allocate(MAX_HEAP_TXN_COUNT_THRESHOLD * 2,
+										   ReorderBufferTXNSizeCompare,
+										   true, NULL);
+
 	buffer->spillTxns = 0;
 	buffer->spillCount = 0;
 	buffer->spillBytes = 0;
@@ -485,7 +531,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
 {
 	/* update memory accounting info */
 	if (upd_mem)
-		ReorderBufferChangeMemoryUpdate(rb, change, false,
+		ReorderBufferChangeMemoryUpdate(rb, change, NULL, false,
 										ReorderBufferChangeSize(change));
 
 	/* free contained data */
@@ -816,7 +862,7 @@ ReorderBufferQueueChange(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
 	txn->nentries_mem++;
 
 	/* update memory accounting information */
-	ReorderBufferChangeMemoryUpdate(rb, change, true,
+	ReorderBufferChangeMemoryUpdate(rb, change, NULL, true,
 									ReorderBufferChangeSize(change));
 
 	/* process partial change */
@@ -1527,7 +1573,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		/* Check we're not mixing changes from different transactions. */
 		Assert(change->txn == txn);
 
-		ReorderBufferReturnChange(rb, change, true);
+		ReorderBufferReturnChange(rb, change, false);
 	}
 
 	/*
@@ -1586,8 +1632,17 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	if (rbtxn_is_serialized(txn))
 		ReorderBufferRestoreCleanup(rb, txn);
 
+	/* Update the memory counter */
+	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, txn->size);
+
 	/* deallocate */
 	ReorderBufferReturnTXN(rb, txn);
+
+	/*
+	 * After cleaning up one transaction, the number of transactions might get
+	 * lower than the threshold for the max-heap.
+	 */
+	ReorderBufferMaybeResetMaxHeap(rb);
 }
 
 /*
@@ -1637,9 +1692,12 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 		/* remove the change from it's containing list */
 		dlist_delete(&change->node);
 
-		ReorderBufferReturnChange(rb, change, true);
+		ReorderBufferReturnChange(rb, change, false);
 	}
 
+	/* Update the memory counter */
+	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, txn->size);
+
 	/*
 	 * Mark the transaction as streamed.
 	 *
@@ -3166,6 +3224,9 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
  * decide if we reached the memory limit, the transaction counter allows
  * us to quickly pick the largest transaction for eviction.
  *
+ * Either txn or change must be non-NULL at least. We update the memory
+ * counter of txn if it's non-NULL, otherwise change->txn.
+ *
  * When streaming is enabled, we need to update the toplevel transaction
  * counters instead - we don't really care about subtransactions as we
  * can't stream them individually anyway, and we only pick toplevel
@@ -3174,22 +3235,27 @@ ReorderBufferAddNewCommandId(ReorderBuffer *rb, TransactionId xid,
 static void
 ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 								ReorderBufferChange *change,
+								ReorderBufferTXN *txn,
 								bool addition, Size sz)
 {
-	ReorderBufferTXN *txn;
 	ReorderBufferTXN *toptxn;
 
-	Assert(change->txn);
+	Assert(txn || change);
 
 	/*
 	 * Ignore tuple CID changes, because those are not evicted when reaching
 	 * memory limit. So we just don't count them, because it might easily
 	 * trigger a pointless attempt to spill.
 	 */
-	if (change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
+	if (change && change->action == REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID)
 		return;
 
-	txn = change->txn;
+	if (sz == 0)
+		return;
+
+	if (txn == NULL)
+		txn = change->txn;
+	Assert(txn != NULL);
 
 	/*
 	 * Update the total size in top level as well. This is later used to
@@ -3204,6 +3270,15 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 		/* Update the total size in the top transaction. */
 		toptxn->total_size += sz;
+
+		/* Update the max-heap as well if necessary */
+		if (ReorderBufferMaxHeapIsReady(rb))
+		{
+			if ((txn->size - sz) == 0)
+				binaryheap_add(rb->txn_heap, PointerGetDatum(txn));
+			else
+				binaryheap_update_up(rb->txn_heap, PointerGetDatum(txn));
+		}
 	}
 	else
 	{
@@ -3213,6 +3288,15 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 		/* Update the total size in the top transaction. */
 		toptxn->total_size -= sz;
+
+		/* Update the max-heap as well if necessary */
+		if (ReorderBufferMaxHeapIsReady(rb))
+		{
+			if (txn->size == 0)
+				binaryheap_remove_node_ptr(rb->txn_heap, PointerGetDatum(txn));
+			else
+				binaryheap_update_down(rb->txn_heap, PointerGetDatum(txn));
+		}
 	}
 
 	Assert(txn->size <= rb->size);
@@ -3468,34 +3552,121 @@ ReorderBufferSerializeReserve(ReorderBuffer *rb, Size sz)
 	}
 }
 
+
+/* Compare two transactions by size */
+static int
+ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg)
+{
+	ReorderBufferTXN *ta = (ReorderBufferTXN *) DatumGetPointer(a);
+	ReorderBufferTXN *tb = (ReorderBufferTXN *) DatumGetPointer(b);
+
+	if (ta->size < tb->size)
+		return -1;
+	if (ta->size > tb->size)
+		return 1;
+	return 0;
+}
+
 /*
- * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
- *
- * XXX With many subtransactions this might be quite slow, because we'll have
- * to walk through all of them. There are some options how we could improve
- * that: (a) maintain some secondary structure with transactions sorted by
- * amount of changes, (b) not looking for the entirely largest transaction,
- * but e.g. for transaction using at least some fraction of the memory limit,
- * and (c) evicting multiple transactions at once, e.g. to free a given portion
- * of the memory limit (e.g. 50%).
+ * Build the max-heap. The heap assembly step is deferred  until the end, for
+ * efficiency.
  */
-static ReorderBufferTXN *
-ReorderBufferLargestTXN(ReorderBuffer *rb)
+static void
+ReorderBufferBuildMaxHeap(ReorderBuffer *rb)
 {
 	HASH_SEQ_STATUS hash_seq;
 	ReorderBufferTXNByIdEnt *ent;
-	ReorderBufferTXN *largest = NULL;
+
+	Assert(binaryheap_empty(rb->txn_heap));
 
 	hash_seq_init(&hash_seq, rb->by_txn);
 	while ((ent = hash_seq_search(&hash_seq)) != NULL)
 	{
 		ReorderBufferTXN *txn = ent->txn;
 
-		/* if the current transaction is larger, remember it */
-		if ((!largest) || (txn->size > largest->size))
-			largest = txn;
+		if (txn->size == 0)
+			continue;
+
+		binaryheap_add_unordered(rb->txn_heap, PointerGetDatum(txn));
+	}
+
+	binaryheap_build(rb->txn_heap);
+}
+
+/*
+ * Reset the max-heap if the number of transactions got lower than the
+ * threshold.
+ */
+static void
+ReorderBufferMaybeResetMaxHeap(ReorderBuffer *rb)
+{
+	/*
+	 * If we add and remove transactions right around the threshold, we could
+	 * easily end up "thrashing". To avoid it, we adapt 10% of transactions to
+	 * reset the max-heap.
+	 */
+	if (ReorderBufferMaxHeapIsReady(rb) &&
+		binaryheap_size(rb->txn_heap) < MAX_HEAP_TXN_COUNT_THRESHOLD * 0.9)
+		binaryheap_reset(rb->txn_heap);
+}
+
+/*
+ * Find the largest transaction (toplevel or subxact) to evict (spill to disk)
+ * by doing a linear search or using the max-heap depending on the number of
+ * transactions in ReorderBuffer. Refer to the comments atop this file for the
+ * algorithm details.
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTXN(ReorderBuffer *rb)
+{
+	ReorderBufferTXN *largest = NULL;
+
+	if (!ReorderBufferMaxHeapIsReady(rb))
+	{
+		/*
+		 * If the number of transactions are small, we scan all transactions
+		 * being decoded to get the largest transaction. This saves the cost
+		 * of building a max-heap with a small number of transactions.
+		 */
+		if (hash_get_num_entries(rb->by_txn) < MAX_HEAP_TXN_COUNT_THRESHOLD)
+		{
+			HASH_SEQ_STATUS hash_seq;
+			ReorderBufferTXNByIdEnt *ent;
+
+			hash_seq_init(&hash_seq, rb->by_txn);
+			while ((ent = hash_seq_search(&hash_seq)) != NULL)
+			{
+				ReorderBufferTXN *txn = ent->txn;
+
+				/* if the current transaction is larger, remember it */
+				if ((!largest) || (txn->size > largest->size))
+					largest = txn;
+			}
+		}
+		else
+		{
+			/*
+			 * There are a large number of transactions in ReorderBuffer. We
+			 * build the max-heap for efficiently selecting the largest
+			 * transactions.
+			 */
+			ReorderBufferBuildMaxHeap(rb);
+
+			/*
+			 * The max-heap is ready now. We remain the max-heap at least
+			 * until we free up enough transactions to bring the total memory
+			 * usage below the limit. The largest transaction is selected
+			 * below.
+			 */
+			Assert(ReorderBufferMaxHeapIsReady(rb));
+		}
 	}
 
+	/* Get the largest transaction from the max-heap */
+	if (ReorderBufferMaxHeapIsReady(rb))
+		largest = (ReorderBufferTXN *)
+			DatumGetPointer(binaryheap_first(rb->txn_heap));
+
 	Assert(largest);
 	Assert(largest->size > 0);
 	Assert(largest->size <= rb->size);
@@ -3638,6 +3809,13 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 
 	/* We must be under the memory limit now. */
 	Assert(rb->size < logical_decoding_work_mem * 1024L);
+
+	/*
+	 * After evicting some transactions, the number of transactions might get
+	 * lower than the threshold for the max-heap.
+	 */
+	ReorderBufferMaybeResetMaxHeap(rb);
+
 }
 
 /*
@@ -3705,11 +3883,14 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		ReorderBufferSerializeChange(rb, txn, fd, change);
 		dlist_delete(&change->node);
-		ReorderBufferReturnChange(rb, change, true);
+		ReorderBufferReturnChange(rb, change, false);
 
 		spilled++;
 	}
 
+	/* Update the memory counter */
+	ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, size);
+
 	/* update the statistics iff we have spilled anything */
 	if (spilled)
 	{
@@ -4491,7 +4672,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	 * update the accounting too (subtracting the size from the counters). And
 	 * we don't want to underflow there.
 	 */
-	ReorderBufferChangeMemoryUpdate(rb, change, true,
+	ReorderBufferChangeMemoryUpdate(rb, change, NULL, true,
 									ReorderBufferChangeSize(change));
 }
 
@@ -4903,9 +5084,9 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	MemoryContextSwitchTo(oldcontext);
 
 	/* subtract the old change size */
-	ReorderBufferChangeMemoryUpdate(rb, change, false, old_size);
+	ReorderBufferChangeMemoryUpdate(rb, change, NULL, false, old_size);
 	/* now add the change back, with the correct size */
-	ReorderBufferChangeMemoryUpdate(rb, change, true,
+	ReorderBufferChangeMemoryUpdate(rb, change, NULL, true,
 									ReorderBufferChangeSize(change));
 }
 
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0b2c95f7aa..a5aec01c2f 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H
 
 #include "access/htup_details.h"
+#include "lib/binaryheap.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -631,6 +632,9 @@ struct ReorderBuffer
 	/* memory accounting */
 	Size		size;
 
+	/* Max-heap for sizes of all top-level and sub transactions */
+	binaryheap *txn_heap;
+
 	/*
 	 * Statistics about transactions spilled to disk.
 	 *
-- 
2.39.3

#65Jeff Davis
pgsql@j-davis.com
In reply to: Masahiko Sawada (#64)
Re: Improve eviction algorithm in ReorderBuffer

On Mon, 2024-04-01 at 12:42 +0900, Masahiko Sawada wrote:

While reviewing the patches, I realized the comment of
binearyheap_allocate() should also be updated. So I've attached the
new patches.

In sift_{up|down}, each loop iteration calls set_node(), and each call
to set_node does a hash lookup. I didn't measure it, but that feels
wasteful.

I don't even think you really need the hash table. The key to the hash
table is a pointer, so it's not really doing anything that couldn't be
done more efficiently by just following the pointer.

I suggest that you add a "heap_index" field to ReorderBufferTXN that
would point to the index into the heap's array (the same as
bh_nodeidx_entry.index in your patch). Each time an element moves
within the heap array, just follow the pointer to the ReorderBufferTXN
object and update the heap_index -- no hash lookup required.

That's not easy to do with the current binaryheap API. But a binary
heap is not a terribly complex structure, so you can just do an inline
implementation of it where sift_{up|down} know to update the heap_index
field of the ReorderBufferTXN.

Regards,
Jeff Davis

#66Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#65)
Re: Improve eviction algorithm in ReorderBuffer

On Wed, 2024-04-03 at 01:45 -0700, Jeff Davis wrote:

I suggest that you add a "heap_index" field to ReorderBufferTXN that
would point to the index into the heap's array (the same as
bh_nodeidx_entry.index in your patch). Each time an element moves
within the heap array, just follow the pointer to the
ReorderBufferTXN
object and update the heap_index -- no hash lookup required.

It looks like my email was slightly too late, as the work was already
committed.

My suggestion is not required for 17, and so it's fine if this waits
until the next CF. If it turns out to be a win we can consider
backporting to 17 just to keep the code consistent, otherwise it can go
in 18.

Regards,
Jeff Davis

#67Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Jeff Davis (#66)
Re: Improve eviction algorithm in ReorderBuffer

Hi,

On Thu, Apr 4, 2024 at 2:32 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Wed, 2024-04-03 at 01:45 -0700, Jeff Davis wrote:

I suggest that you add a "heap_index" field to ReorderBufferTXN that
would point to the index into the heap's array (the same as
bh_nodeidx_entry.index in your patch). Each time an element moves
within the heap array, just follow the pointer to the
ReorderBufferTXN
object and update the heap_index -- no hash lookup required.

It looks like my email was slightly too late, as the work was already
committed.

Thank you for the suggestions! I should have informed it earlier.

My suggestion is not required for 17, and so it's fine if this waits
until the next CF. If it turns out to be a win we can consider
backporting to 17 just to keep the code consistent, otherwise it can go
in 18.

IIUC, with your suggestion, sift_{up|down} needs to update the
heap_index field as well. Does it mean that the caller needs to pass
the address of heap_index down to sift_{up|down}?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#68Jeff Davis
pgsql@j-davis.com
In reply to: Masahiko Sawada (#67)
Re: Improve eviction algorithm in ReorderBuffer

On Thu, 2024-04-04 at 09:31 +0900, Masahiko Sawada wrote:

IIUC, with your suggestion, sift_{up|down} needs to update the
heap_index field as well. Does it mean that the caller needs to pass
the address of heap_index down to sift_{up|down}?

I'm not sure quite how binaryheap should be changed. Bringing the heap
implementation into reorderbuffer.c would obviously work, but that
would be more code. Another option might be to make the API of
binaryheap look a little more like simplehash, where some #defines
control optional behavior and can tell the implementation where to find
fields in the structure.

Perhaps it's not worth the effort though, if performance is already
good enough?

Regards,
Jeff Davis

#69Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Jeff Davis (#68)
Re: Improve eviction algorithm in ReorderBuffer

On Thu, Apr 4, 2024 at 1:54 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Thu, 2024-04-04 at 09:31 +0900, Masahiko Sawada wrote:

IIUC, with your suggestion, sift_{up|down} needs to update the
heap_index field as well. Does it mean that the caller needs to pass
the address of heap_index down to sift_{up|down}?

I'm not sure quite how binaryheap should be changed. Bringing the heap
implementation into reorderbuffer.c would obviously work, but that
would be more code.

Right.

Another option might be to make the API of
binaryheap look a little more like simplehash, where some #defines
control optional behavior and can tell the implementation where to find
fields in the structure.

Interesting idea.

Perhaps it's not worth the effort though, if performance is already
good enough?

Yeah, it would be better to measure the overhead first. I'll do that.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#70Jeff Davis
pgsql@j-davis.com
In reply to: Masahiko Sawada (#69)
Re: Improve eviction algorithm in ReorderBuffer

On Thu, 2024-04-04 at 17:28 +0900, Masahiko Sawada wrote:

Perhaps it's not worth the effort though, if performance is already
good enough?

Yeah, it would be better to measure the overhead first. I'll do that.

I have some further comments and I believe changes are required for
v17.

An indexed binary heap API requires both a comparator and a hash
function to be specified, and has two different kinds of keys: the heap
key (mutable) and the hash key (immutable). It provides heap methods
and hashtable methods, and keep the two internal structures (heap and
HT) in sync.

The implementation in b840508644 uses the bh_node_type as the hash key,
which is just a Datum, and it just hashes the bytes. I believe the
implicit assumption is that the Datum is a pointer -- I'm not sure how
one would use that API if the Datum were a value. Hashing a pointer
seems strange to me and, while I see why you did it that way, I think
it reflects that the API boundaries are not quite right.

One consequence of using the pointer as the hash key is that you need
to find the pointer first: you can't change or remove elements based on
the transaction ID, you have to get the ReorderBufferTXN pointer by
finding it in another structure, first. Currently, that's being done by
searching ReorderBuffer->by_txn. So we actually have two hash tables
for essentially the same purpose: one with xid as the key, and the
other with the pointer as the key. That makes no sense -- let's have a
proper indexed binary heap to look things up by xid (the internal HT)
or by transaction size (using the internal heap).

I suggest:

* Make a proper indexed binary heap API that accepts a hash function
and provides both heap methods and HT methods that operate based on
values (transaction size and transaction ID, respectively).
* Get rid of ReorderBuffer->by_txn and use the indexed binary heap
instead.

This will be a net simplification in reorderbuffer.c, which is good,
because that file makes use of a *lot* of data strucutres.

Regards
Jeff Davis

#71Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#70)
Re: Improve eviction algorithm in ReorderBuffer

On Thu, 2024-04-04 at 10:55 -0700, Jeff Davis wrote:

  * Make a proper indexed binary heap API that accepts a hash
function
and provides both heap methods and HT methods that operate based on
values (transaction size and transaction ID, respectively).
  * Get rid of ReorderBuffer->by_txn and use the indexed binary heap
instead.

An alternative idea:

* remove the hash table from binaryheap.c

* supply a new callback to the binary heap with type like:

typedef void (*binaryheap_update_index)(
bh_node_type node,
int new_element_index);

* make the remove, update_up, and update_down methods take the element
index rather than the pointer

reorderbuffer.c would then do something like:

void
txn_update_heap_index(ReorderBufferTXN *txn, int new_element_index)
{
txn->heap_element_index = new_element_index;
}

...

txn_heap = binaryheap_allocate(..., txn_update_heap_index, ...);

and then binaryheap.c would effectively maintain txn-

heap_element_index, so reorderbuffer.c can pass that to the APIs that

require the element index.

Another alternative is to keep the hash table in binaryheap.c, and
supply a hash function that hashes the xid. That leaves us with two
hash tables still, but it would be cleaner than hashing the pointer.
That might be best for right now, and we can consider these other ideas
later.

Regards,
Jeff Davis

#72Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Jeff Davis (#70)
1 attachment(s)
Re: Improve eviction algorithm in ReorderBuffer

On Fri, Apr 5, 2024 at 2:55 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Thu, 2024-04-04 at 17:28 +0900, Masahiko Sawada wrote:

Perhaps it's not worth the effort though, if performance is already
good enough?

Yeah, it would be better to measure the overhead first. I'll do that.

I have some further comments and I believe changes are required for
v17.

An indexed binary heap API requires both a comparator and a hash
function to be specified, and has two different kinds of keys: the heap
key (mutable) and the hash key (immutable). It provides heap methods
and hashtable methods, and keep the two internal structures (heap and
HT) in sync.

IIUC for example in ReorderBuffer, the heap key is transaction size
and the hash key is xid.

The implementation in b840508644 uses the bh_node_type as the hash key,
which is just a Datum, and it just hashes the bytes. I believe the
implicit assumption is that the Datum is a pointer -- I'm not sure how
one would use that API if the Datum were a value. Hashing a pointer
seems strange to me and, while I see why you did it that way, I think
it reflects that the API boundaries are not quite right.

I see your point. It assumes that the bh_node_type is a pointer or at
least unique. So it cannot work with Datum being a value.

One consequence of using the pointer as the hash key is that you need
to find the pointer first: you can't change or remove elements based on
the transaction ID, you have to get the ReorderBufferTXN pointer by
finding it in another structure, first. Currently, that's being done by
searching ReorderBuffer->by_txn. So we actually have two hash tables
for essentially the same purpose: one with xid as the key, and the
other with the pointer as the key. That makes no sense -- let's have a
proper indexed binary heap to look things up by xid (the internal HT)
or by transaction size (using the internal heap).

I suggest:

* Make a proper indexed binary heap API that accepts a hash function
and provides both heap methods and HT methods that operate based on
values (transaction size and transaction ID, respectively).
* Get rid of ReorderBuffer->by_txn and use the indexed binary heap
instead.

This will be a net simplification in reorderbuffer.c, which is good,
because that file makes use of a *lot* of data strucutres.

It sounds like a data structure that mixes the hash table and the
binary heap and we use it as the main storage (e.g. for
ReorderBufferTXN) instead of using the binary heap as the secondary
data structure. IIUC with your idea, the indexed binary heap has a
hash table to store elements each of which has its index within the
heap node array. I guess it's better to create it as a new data
structure rather than extending the existing binaryheap, since APIs
could be very different. I might be missing something, though.

On Fri, Apr 5, 2024 at 3:55 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Thu, 2024-04-04 at 10:55 -0700, Jeff Davis wrote:

* Make a proper indexed binary heap API that accepts a hash
function
and provides both heap methods and HT methods that operate based on
values (transaction size and transaction ID, respectively).
* Get rid of ReorderBuffer->by_txn and use the indexed binary heap
instead.

An alternative idea:

* remove the hash table from binaryheap.c

* supply a new callback to the binary heap with type like:

typedef void (*binaryheap_update_index)(
bh_node_type node,
int new_element_index);

* make the remove, update_up, and update_down methods take the element
index rather than the pointer

reorderbuffer.c would then do something like:

void
txn_update_heap_index(ReorderBufferTXN *txn, int new_element_index)
{
txn->heap_element_index = new_element_index;
}

...

txn_heap = binaryheap_allocate(..., txn_update_heap_index, ...);

and then binaryheap.c would effectively maintain txn-

heap_element_index, so reorderbuffer.c can pass that to the APIs that

require the element index.

Thank you for the idea. I was thinking the same idea when considering
your previous comment. With this idea, we still use the binaryheap for
ReorderBuffer as the second data structure. Since we can implement
this idea with relatively small changes to the current binaryheap,
I've implemented it and measured performances.

I've attached a patch that adds an extension for benchmarking
binaryheap implementations. binaryheap_bench.c is the main test
module. To make the comparison between different binaryheap
implementations, the extension includes two different binaryheap
implementations. Therefore, binaryheap_bench.c uses three different
binaryheap implementation in total as the comment on top of the file
says:

/*
* This benchmark tool uses three binary heap implementations.
*
* "binaryheap" is the current binaryheap implementation in PostgreSQL. That
* is, it internally has a hash table to track each node index within the
* node array.
*
* "xx_binaryheap" is based on "binaryheap" but remove the hash table.
* Instead, it has each element have its index with in the node array. The
* element's index is updated by the callback function,
xx_binaryheap_update_index_fn
* specified when xx_binaryheap_allocate().
*
* "old_binaryheap" is the binaryheap implementation before the "indexed" binary
* heap changes are made. It neither has a hash table internally nor
tracks nodes'
* indexes.
*/

That is, xx_binaryheap is the binaryheap implementation suggested above.

The bench_load() function measures the time for adding elements (i.e.
using binaryheap_add() and similar). Here are results:

postgres(1:3882886)=# select * from generate_series(1,3) x(x), lateral
(select * from bench_load(true, 10000000 * (1+x-x)));
x | cnt | load_ms | xx_load_ms | old_load_ms
---+----------+---------+------------+-------------
1 | 10000000 | 4372 | 582 | 429
2 | 10000000 | 4371 | 582 | 429
3 | 10000000 | 4373 | 582 | 429
(3 rows)

This shows that the current indexed binaryheap is much slower than the
other two implementations, and the xx_binaryheap has a good number in
spite of also being indexed.

Here are another run that disables indexing on the current binaryheap:

postgres(1:3882886)=# select * from generate_series(1,3) x(x), lateral
(select * from bench_load(false, 10000000 * (1+x-x)));
x | cnt | load_ms | xx_load_ms | old_load_ms
---+----------+---------+------------+-------------
1 | 10000000 | 697 | 579 | 430
2 | 10000000 | 704 | 582 | 430
3 | 10000000 | 698 | 581 | 429
(3 rows)

This shows that there is still performance regression in the current
binaryheap even if the indexing is disabled. xx_binaryheap also has
some regressions. I haven't investigated the root cause yet though.

Overall, we can say there is a large room to improve the current
binaryheap performance, as you pointed out. When it comes to
implementing the above idea (i.e. changing binaryheap to
xx_binaryheap), it was simple since we just replace the code where we
update the hash table with the code where we call the callback, if we
get the consensus on API change.

Another alternative is to keep the hash table in binaryheap.c, and
supply a hash function that hashes the xid. That leaves us with two
hash tables still, but it would be cleaner than hashing the pointer.
That might be best for right now, and we can consider these other ideas
later.

The fact that we use simplehash for the internal hash table might make
this idea complex. If I understand your suggestion correctly, the
caller needs to tell the hash table the hash function when creating a
binaryheap but the hash function needs to be specified at a compile
time. We can use a dynahash instead but it would make the binaryheap
slow further.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

0001-binaryheap_bench.patchapplication/octet-stream; name=0001-binaryheap_bench.patchDownload
From 2a0ff8958f9ca6beae88dab6f9431210faa67c9d Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 5 Apr 2024 16:08:55 +0900
Subject: [PATCH] binaryheap_bench.

---
 contrib/binaryheap_bench/.gitignore           |   4 +
 contrib/binaryheap_bench/Makefile             |  26 +
 .../binaryheap_bench--1.0.sql                 |  24 +
 contrib/binaryheap_bench/binaryheap_bench.c   | 208 ++++++++
 .../binaryheap_bench/binaryheap_bench.control |   5 +
 contrib/binaryheap_bench/meson.build          |  25 +
 contrib/binaryheap_bench/old_binaryheap.c     | 368 ++++++++++++++
 contrib/binaryheap_bench/old_binaryheap.h     |  69 +++
 contrib/binaryheap_bench/xx_binaryheap.c      | 463 ++++++++++++++++++
 contrib/binaryheap_bench/xx_binaryheap.h      |  76 +++
 10 files changed, 1268 insertions(+)
 create mode 100644 contrib/binaryheap_bench/.gitignore
 create mode 100644 contrib/binaryheap_bench/Makefile
 create mode 100644 contrib/binaryheap_bench/binaryheap_bench--1.0.sql
 create mode 100644 contrib/binaryheap_bench/binaryheap_bench.c
 create mode 100644 contrib/binaryheap_bench/binaryheap_bench.control
 create mode 100644 contrib/binaryheap_bench/meson.build
 create mode 100644 contrib/binaryheap_bench/old_binaryheap.c
 create mode 100644 contrib/binaryheap_bench/old_binaryheap.h
 create mode 100644 contrib/binaryheap_bench/xx_binaryheap.c
 create mode 100644 contrib/binaryheap_bench/xx_binaryheap.h

diff --git a/contrib/binaryheap_bench/.gitignore b/contrib/binaryheap_bench/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/contrib/binaryheap_bench/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/contrib/binaryheap_bench/Makefile b/contrib/binaryheap_bench/Makefile
new file mode 100644
index 0000000000..d5b1208430
--- /dev/null
+++ b/contrib/binaryheap_bench/Makefile
@@ -0,0 +1,26 @@
+# contrib/binaryheap_bench/Makefile
+
+MODULE_big = binaryheap_bench
+OBJS = \
+	$(WIN32RES) \
+	xx_binaryheap.o \
+	old_binaryheap.o \
+	binaryheap_bench.o
+
+EXTENSION = binaryheap_bench
+DATA = binaryheap_bench--1.0.sql
+PGFILEDESC = "binaryheap_bench"
+
+REGRESS = binaryheap_bench
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/binaryheap_bench
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/binaryheap_bench/binaryheap_bench--1.0.sql b/contrib/binaryheap_bench/binaryheap_bench--1.0.sql
new file mode 100644
index 0000000000..bc10e0ed57
--- /dev/null
+++ b/contrib/binaryheap_bench/binaryheap_bench--1.0.sql
@@ -0,0 +1,24 @@
+/* contrib/binaryheap_bench/binaryheap_bench--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION binaryheap_bench" to load this file. \quit
+
+CREATE FUNCTION bench_load(
+indexed bool,
+cnt int8,
+OUT cnt int8,
+OUT load_ms int8,
+OUT xx_load_ms int8,
+OUT old_load_ms int8)
+RETURNS record
+AS 'MODULE_PATHNAME', 'bench_load'
+LANGUAGE C STRICT;
+
+CREATE FUNCTION bench_sift_down(
+cnt int8,
+OUT cnt int8,
+OUT sift_ms int8,
+OUT xx_sift_ms int8)
+RETURNS record
+AS 'MODULE_PATHNAME', 'bench_sift_down'
+LANGUAGE C STRICT;
diff --git a/contrib/binaryheap_bench/binaryheap_bench.c b/contrib/binaryheap_bench/binaryheap_bench.c
new file mode 100644
index 0000000000..5de76fdbb7
--- /dev/null
+++ b/contrib/binaryheap_bench/binaryheap_bench.c
@@ -0,0 +1,208 @@
+/*-------------------------------------------------------------------------
+ *
+ * binaryheap_bench.c
+ *
+ * Copyright (c) 2016-2024, PostgreSQL Global Development Group
+ *
+ *	  contrib/binaryheap_bench/binaryheap_bench.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+
+/*
+ * This benchmark tool uses three binary heap implementations.
+ *
+ * "binaryheap" is the current binaryheap implementation in PostgreSQL. That
+ * is, it internally has a hash table to track each node index within the
+ * node array.
+ *
+ * "xx_binaryheap" is based on "binaryheap" but remove the hash table.
+ * Instead, it has each element have its index with in the node array. The
+ * element's index is updated by the callback function, xx_binaryheap_update_index_fn
+ * specified when xx_binaryheap_allocate().
+ *
+ * "old_binaryheap" is the binaryheap implementation before the "indexed" binary
+ * heap changes are made. It neither has a hash table internally nor tracks nodes'
+ * indexes.
+ */
+#include "lib/binaryheap.h"
+#include "xx_binaryheap.h"
+#include "old_binaryheap.h"
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(bench_load);
+PG_FUNCTION_INFO_V1(bench_sift_down);
+
+typedef struct test_elem
+{
+	int64		key;
+	int			index; /* used only for xx_binaryheap */
+} test_elem;
+
+/* comparator for max-heap */
+static int
+test_elem_cmp(Datum a, Datum b, void *arg)
+{
+	test_elem *e1 = (test_elem *) DatumGetPointer(a);
+	test_elem *e2 = (test_elem *) DatumGetPointer(b);
+
+	if (e1->key < e2->key)
+		return -1;
+	else if (e1->key > e2->key)
+		return 1;
+	return 0;
+}
+
+static void
+test_update_index(Datum a, int new_element_index)
+{
+	test_elem *e = (test_elem *) DatumGetPointer(a);
+	e->index = new_element_index;
+}
+
+Datum
+bench_load(PG_FUNCTION_ARGS)
+{
+	bool	indexed = PG_GETARG_BOOL(0);
+	int64	cnt = PG_GETARG_INT64(1);
+	test_elem	*values;
+	binaryheap	*heap;
+	xx_binaryheap *xx_heap;
+	old_binaryheap *old_heap;
+	TupleDesc	tupdesc;
+	TimestampTz start_time, end_time;
+	long		secs;
+	int			usecs;
+	int64		load_ms, xx_load_ms, old_load_ms;
+	Datum	vals[4];
+	bool	nulls[4];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	/* generate test data */
+	values = (test_elem *) palloc(sizeof(test_elem) * cnt);
+	for (int64 i = 0; i < cnt; i++)
+		values[i].key = i;
+
+	heap = binaryheap_allocate(cnt, test_elem_cmp, indexed, NULL);
+	xx_heap = xx_binaryheap_allocate(cnt, test_elem_cmp, NULL, test_update_index);
+	old_heap = old_binaryheap_allocate(cnt, test_elem_cmp, NULL);
+
+	/* measure load time of binaryheap */
+	start_time = GetCurrentTimestamp();
+	for (int64 i = 0; i < cnt; i++)
+		binaryheap_add(heap, PointerGetDatum(&(values[i])));
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_ms = secs * 1000 + usecs / 1000;
+
+	/* measure load time of binaryheap */
+	start_time = GetCurrentTimestamp();
+	for (int64 i = 0; i < cnt; i++)
+		xx_binaryheap_add(xx_heap, PointerGetDatum(&(values[i])));
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	xx_load_ms = secs * 1000 + usecs / 1000;
+
+	/* measure load time of old_binaryheap */
+	start_time = GetCurrentTimestamp();
+	for (int64 i = 0; i < cnt; i++)
+		old_binaryheap_add(old_heap, PointerGetDatum(&(values[i])));
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	old_load_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	vals[0] = Int64GetDatum(cnt);
+	vals[1] = Int64GetDatum(load_ms);
+	vals[2] = Int64GetDatum(xx_load_ms);
+	vals[3] = Int64GetDatum(old_load_ms);
+
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, vals, nulls)));
+}
+
+Datum
+bench_sift_down(PG_FUNCTION_ARGS)
+{
+	int64	cnt = PG_GETARG_INT64(0);
+	test_elem	*values;
+	binaryheap	*heap;
+	xx_binaryheap *xx_heap;
+	TupleDesc	tupdesc;
+	TimestampTz start_time, end_time;
+	long		secs;
+	int			usecs;
+	int64		sift_ms, xx_sift_ms;
+	Datum	vals[3];
+	bool	nulls[3];
+	test_elem * e;
+	int64	old_key;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	/* generate test data */
+	values = (test_elem *) palloc(sizeof(test_elem) * cnt);
+	for (int64 i = 0; i < cnt; i++)
+		values[i].key = i;
+
+	heap = binaryheap_allocate(cnt, test_elem_cmp, true, NULL);
+	xx_heap = xx_binaryheap_allocate(cnt, test_elem_cmp, NULL, test_update_index);
+
+	/*
+	 * test for binaryheap.
+	 *
+	 * 1. load the test data.
+	 * 2. measure the time of sifting down the top node while decreasing the key
+	 */
+	for (int64 i = 0; i < cnt; i++)
+		binaryheap_add(heap, PointerGetDatum(&(values[i])));
+	e = (test_elem *) DatumGetPointer(binaryheap_first(heap));
+	old_key = e->key;
+	start_time = GetCurrentTimestamp();
+	for (int64 i = 0; i < cnt; i++)
+	{
+		e->key--;
+		binaryheap_update_down(heap, PointerGetDatum(e));
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	sift_ms = secs * 1000 + usecs / 1000;
+
+	/* restore the old key */
+	e->key = old_key;
+
+	/*
+	 * test for xx_binaryheap.
+	 *
+	 * 1. load the test data.
+	 * 2. measure the time of sifting down the top node while decreasing the key
+	 */
+	for (int64 i = 0; i < cnt; i++)
+		xx_binaryheap_add(xx_heap, PointerGetDatum(&(values[i])));
+	e = (test_elem *) DatumGetPointer(xx_binaryheap_first(xx_heap));
+	start_time = GetCurrentTimestamp();
+	for (int64 i = 0; i < cnt; i++)
+	{
+		e->key--;
+		xx_binaryheap_update_down(xx_heap, e->index);
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	xx_sift_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	vals[0] = Int64GetDatum(cnt);
+	vals[1] = Int64GetDatum(sift_ms);
+	vals[2] = Int64GetDatum(xx_sift_ms);
+
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, vals, nulls)));
+}
diff --git a/contrib/binaryheap_bench/binaryheap_bench.control b/contrib/binaryheap_bench/binaryheap_bench.control
new file mode 100644
index 0000000000..e6a1e190be
--- /dev/null
+++ b/contrib/binaryheap_bench/binaryheap_bench.control
@@ -0,0 +1,5 @@
+# binaryheap_bench extension
+comment = 'benchmark tool for binary heap'
+default_version = '1.0'
+module_pathname = '$libdir/binaryheap_bench'
+relocatable = true
diff --git a/contrib/binaryheap_bench/meson.build b/contrib/binaryheap_bench/meson.build
new file mode 100644
index 0000000000..64cc7d3687
--- /dev/null
+++ b/contrib/binaryheap_bench/meson.build
@@ -0,0 +1,25 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+binaryheap_bench_sources = files(
+  'binaryheap_bench.c',
+  'xx_binaryheap.c',
+  'old_binaryheap.c',
+)
+
+if host_system == 'windows'
+  binaryheap_bench_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'binaryheap_bench',
+    '--FILEDESC', 'binaryheap_bench',])
+endif
+
+binaryheap_bench = shared_module('binaryheap_bench',
+  binaryheap_bench_sources,
+  kwargs: contrib_mod_args,
+)
+contrib_targets += binaryheap_bench
+
+install_data(
+  'binaryheap_bench--1.0',
+  'binaryheap_bench.control',
+  kwargs: contrib_data_args,
+)
diff --git a/contrib/binaryheap_bench/old_binaryheap.c b/contrib/binaryheap_bench/old_binaryheap.c
new file mode 100644
index 0000000000..78bdcc63a7
--- /dev/null
+++ b/contrib/binaryheap_bench/old_binaryheap.c
@@ -0,0 +1,368 @@
+/*-------------------------------------------------------------------------
+ *
+ * old_binaryheap.c
+ *	  A simple binary heap implementation
+ *
+ * Portions Copyright (c) 2012-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/common/old_binaryheap.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifdef FRONTEND
+#include "postgres_fe.h"
+#else
+#include "postgres.h"
+#endif
+
+#include <math.h>
+
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
+#include "old_binaryheap.h"
+
+static void sift_down(old_binaryheap *heap, int node_off);
+static void sift_up(old_binaryheap *heap, int node_off);
+
+/*
+ * old_binaryheap_allocate
+ *
+ * Returns a pointer to a newly-allocated heap with the given initial number
+ * of nodes, and with the heap property defined by the given comparator
+ * function, which will be invoked with the additional argument specified by
+ * 'arg'.
+ */
+old_binaryheap *
+old_binaryheap_allocate(int num_nodes, old_binaryheap_comparator compare, void *arg)
+{
+	old_binaryheap *heap;
+
+	heap = (old_binaryheap *) palloc(sizeof(old_binaryheap));
+	heap->bh_space = num_nodes;
+	heap->bh_compare = compare;
+	heap->bh_arg = arg;
+
+	heap->bh_size = 0;
+	heap->bh_has_heap_property = true;
+	heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type) * num_nodes);
+
+	return heap;
+}
+
+/*
+ * old_binaryheap_reset
+ *
+ * Resets the heap to an empty state, losing its data content but not the
+ * parameters passed at allocation.
+ */
+void
+old_binaryheap_reset(old_binaryheap *heap)
+{
+	heap->bh_size = 0;
+	heap->bh_has_heap_property = true;
+}
+
+/*
+ * old_binaryheap_free
+ *
+ * Releases memory used by the given old_binaryheap.
+ */
+void
+old_binaryheap_free(old_binaryheap *heap)
+{
+	pfree(heap->bh_nodes);
+	pfree(heap);
+}
+
+/*
+ * These utility functions return the offset of the left child, right
+ * child, and parent of the node at the given index, respectively.
+ *
+ * The heap is represented as an array of nodes, with the root node
+ * stored at index 0. The left child of node i is at index 2*i+1, and
+ * the right child at 2*i+2. The parent of node i is at index (i-1)/2.
+ */
+
+static inline int
+left_offset(int i)
+{
+	return 2 * i + 1;
+}
+
+static inline int
+right_offset(int i)
+{
+	return 2 * i + 2;
+}
+
+static inline int
+parent_offset(int i)
+{
+	return (i - 1) / 2;
+}
+
+/*
+ * Double the space allocated for nodes.
+ */
+static void
+enlarge_node_array(old_binaryheap *heap)
+{
+	heap->bh_space *= 2;
+	heap->bh_nodes = repalloc(heap->bh_nodes,
+							  sizeof(bh_node_type) * heap->bh_space);
+}
+
+/*
+ * old_binaryheap_add_unordered
+ *
+ * Adds the given datum to the end of the heap's list of nodes in O(1) without
+ * preserving the heap property. This is a convenience to add elements quickly
+ * to a new heap. To obtain a valid heap, one must call old_binaryheap_build()
+ * afterwards.
+ */
+void
+old_binaryheap_add_unordered(old_binaryheap *heap, bh_node_type d)
+{
+	/* make sure enough space for a new node */
+	if (heap->bh_size >= heap->bh_space)
+		enlarge_node_array(heap);
+
+	heap->bh_has_heap_property = false;
+	heap->bh_nodes[heap->bh_size] = d;
+	heap->bh_size++;
+}
+
+/*
+ * old_binaryheap_build
+ *
+ * Assembles a valid heap in O(n) from the nodes added by
+ * old_binaryheap_add_unordered(). Not needed otherwise.
+ */
+void
+old_binaryheap_build(old_binaryheap *heap)
+{
+	int			i;
+
+	for (i = parent_offset(heap->bh_size - 1); i >= 0; i--)
+		sift_down(heap, i);
+	heap->bh_has_heap_property = true;
+}
+
+/*
+ * old_binaryheap_add
+ *
+ * Adds the given datum to the heap in O(log n) time, while preserving
+ * the heap property.
+ */
+void
+old_binaryheap_add(old_binaryheap *heap, bh_node_type d)
+{
+	/* make sure enough space for a new node */
+	if (heap->bh_size >= heap->bh_space)
+		enlarge_node_array(heap);
+
+	heap->bh_nodes[heap->bh_size] = d;
+	heap->bh_size++;
+	sift_up(heap, heap->bh_size - 1);
+}
+
+/*
+ * old_binaryheap_first
+ *
+ * Returns a pointer to the first (root, topmost) node in the heap
+ * without modifying the heap. The caller must ensure that this
+ * routine is not used on an empty heap. Always O(1).
+ */
+bh_node_type
+old_binaryheap_first(old_binaryheap *heap)
+{
+	Assert(!old_binaryheap_empty(heap) && heap->bh_has_heap_property);
+	return heap->bh_nodes[0];
+}
+
+/*
+ * old_binaryheap_remove_first
+ *
+ * Removes the first (root, topmost) node in the heap and returns a
+ * pointer to it after rebalancing the heap. The caller must ensure
+ * that this routine is not used on an empty heap. O(log n) worst
+ * case.
+ */
+bh_node_type
+old_binaryheap_remove_first(old_binaryheap *heap)
+{
+	bh_node_type result;
+
+	Assert(!old_binaryheap_empty(heap) && heap->bh_has_heap_property);
+
+	/* extract the root node, which will be the result */
+	result = heap->bh_nodes[0];
+
+	/* easy if heap contains one element */
+	if (heap->bh_size == 1)
+	{
+		heap->bh_size--;
+		return result;
+	}
+
+	/*
+	 * Remove the last node, placing it in the vacated root entry, and sift
+	 * the new root node down to its correct position.
+	 */
+	heap->bh_nodes[0] = heap->bh_nodes[--heap->bh_size];
+	sift_down(heap, 0);
+
+	return result;
+}
+
+/*
+ * old_binaryheap_remove_node
+ *
+ * Removes the nth (zero based) node from the heap.  The caller must ensure
+ * that there are at least (n + 1) nodes in the heap.  O(log n) worst case.
+ */
+void
+old_binaryheap_remove_node(old_binaryheap *heap, int n)
+{
+	int			cmp;
+
+	Assert(!old_binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(n >= 0 && n < heap->bh_size);
+
+	/* compare last node to the one that is being removed */
+	cmp = heap->bh_compare(heap->bh_nodes[--heap->bh_size],
+						   heap->bh_nodes[n],
+						   heap->bh_arg);
+
+	/* remove the last node, placing it in the vacated entry */
+	heap->bh_nodes[n] = heap->bh_nodes[heap->bh_size];
+
+	/* sift as needed to preserve the heap property */
+	if (cmp > 0)
+		sift_up(heap, n);
+	else if (cmp < 0)
+		sift_down(heap, n);
+}
+
+/*
+ * old_binaryheap_replace_first
+ *
+ * Replace the topmost element of a non-empty heap, preserving the heap
+ * property.  O(1) in the best case, or O(log n) if it must fall back to
+ * sifting the new node down.
+ */
+void
+old_binaryheap_replace_first(old_binaryheap *heap, bh_node_type d)
+{
+	Assert(!old_binaryheap_empty(heap) && heap->bh_has_heap_property);
+
+	heap->bh_nodes[0] = d;
+
+	if (heap->bh_size > 1)
+		sift_down(heap, 0);
+}
+
+/*
+ * Sift a node up to the highest position it can hold according to the
+ * comparator.
+ */
+static void
+sift_up(old_binaryheap *heap, int node_off)
+{
+	bh_node_type node_val = heap->bh_nodes[node_off];
+
+	/*
+	 * Within the loop, the node_off'th array entry is a "hole" that
+	 * notionally holds node_val, but we don't actually store node_val there
+	 * till the end, saving some unnecessary data copying steps.
+	 */
+	while (node_off != 0)
+	{
+		int			cmp;
+		int			parent_off;
+		bh_node_type parent_val;
+
+		/*
+		 * If this node is smaller than its parent, the heap condition is
+		 * satisfied, and we're done.
+		 */
+		parent_off = parent_offset(node_off);
+		parent_val = heap->bh_nodes[parent_off];
+		cmp = heap->bh_compare(node_val,
+							   parent_val,
+							   heap->bh_arg);
+		if (cmp <= 0)
+			break;
+
+		/*
+		 * Otherwise, swap the parent value with the hole, and go on to check
+		 * the node's new parent.
+		 */
+		heap->bh_nodes[node_off] = parent_val;
+		node_off = parent_off;
+	}
+	/* Re-fill the hole */
+	heap->bh_nodes[node_off] = node_val;
+}
+
+/*
+ * Sift a node down from its current position to satisfy the heap
+ * property.
+ */
+static void
+sift_down(old_binaryheap *heap, int node_off)
+{
+	bh_node_type node_val = heap->bh_nodes[node_off];
+
+	/*
+	 * Within the loop, the node_off'th array entry is a "hole" that
+	 * notionally holds node_val, but we don't actually store node_val there
+	 * till the end, saving some unnecessary data copying steps.
+	 */
+	while (true)
+	{
+		int			left_off = left_offset(node_off);
+		int			right_off = right_offset(node_off);
+		int			swap_off = 0;
+
+		/* Is the left child larger than the parent? */
+		if (left_off < heap->bh_size &&
+			heap->bh_compare(node_val,
+							 heap->bh_nodes[left_off],
+							 heap->bh_arg) < 0)
+			swap_off = left_off;
+
+		/* Is the right child larger than the parent? */
+		if (right_off < heap->bh_size &&
+			heap->bh_compare(node_val,
+							 heap->bh_nodes[right_off],
+							 heap->bh_arg) < 0)
+		{
+			/* swap with the larger child */
+			if (!swap_off ||
+				heap->bh_compare(heap->bh_nodes[left_off],
+								 heap->bh_nodes[right_off],
+								 heap->bh_arg) < 0)
+				swap_off = right_off;
+		}
+
+		/*
+		 * If we didn't find anything to swap, the heap condition is
+		 * satisfied, and we're done.
+		 */
+		if (!swap_off)
+			break;
+
+		/*
+		 * Otherwise, swap the hole with the child that violates the heap
+		 * property; then go on to check its children.
+		 */
+		heap->bh_nodes[node_off] = heap->bh_nodes[swap_off];
+		node_off = swap_off;
+	}
+	/* Re-fill the hole */
+	heap->bh_nodes[node_off] = node_val;
+}
diff --git a/contrib/binaryheap_bench/old_binaryheap.h b/contrib/binaryheap_bench/old_binaryheap.h
new file mode 100644
index 0000000000..23c564bfbd
--- /dev/null
+++ b/contrib/binaryheap_bench/old_binaryheap.h
@@ -0,0 +1,69 @@
+/*
+ * old_binaryheap.h
+ *
+ * A simple binary heap implementation
+ *
+ * Portions Copyright (c) 2012-2024, PostgreSQL Global Development Group
+ *
+ * src/include/lib/old_binaryheap.h
+ */
+
+#ifndef OLD_BINARYHEAP_H
+#define OLD_BINARYHEAP_H
+
+/*
+ * We provide a Datum-based API for backend code and a void *-based API for
+ * frontend code (since the Datum definitions are not available to frontend
+ * code).  You should typically avoid using bh_node_type directly and instead
+ * use Datum or void * as appropriate.
+ */
+#ifdef FRONTEND
+typedef void *bh_node_type;
+#else
+typedef Datum bh_node_type;
+#endif
+
+/*
+ * For a max-heap, the comparator must return <0 iff a < b, 0 iff a == b,
+ * and >0 iff a > b.  For a min-heap, the conditions are reversed.
+ */
+typedef int (*old_binaryheap_comparator) (bh_node_type a, bh_node_type b, void *arg);
+
+/*
+ * old_binaryheap
+ *
+ *		bh_size			how many nodes are currently in "nodes"
+ *		bh_space		how many nodes can be stored in "nodes"
+ *		bh_has_heap_property	no unordered operations since last heap build
+ *		bh_compare		comparison function to define the heap property
+ *		bh_arg			user data for comparison function
+ *		bh_nodes		variable-length array of "space" nodes
+ */
+typedef struct old_binaryheap
+{
+	int			bh_size;
+	int			bh_space;
+	bool		bh_has_heap_property;	/* debugging cross-check */
+	old_binaryheap_comparator bh_compare;
+	void	   *bh_arg;
+	bh_node_type *bh_nodes;
+} old_binaryheap;
+
+extern old_binaryheap *old_binaryheap_allocate(int num_nodes,
+									   old_binaryheap_comparator compare,
+									   void *arg);
+extern void old_binaryheap_reset(old_binaryheap *heap);
+extern void old_binaryheap_free(old_binaryheap *heap);
+extern void old_binaryheap_add_unordered(old_binaryheap *heap, bh_node_type d);
+extern void old_binaryheap_build(old_binaryheap *heap);
+extern void old_binaryheap_add(old_binaryheap *heap, bh_node_type d);
+extern bh_node_type old_binaryheap_first(old_binaryheap *heap);
+extern bh_node_type old_binaryheap_remove_first(old_binaryheap *heap);
+extern void old_binaryheap_remove_node(old_binaryheap *heap, int n);
+extern void old_binaryheap_replace_first(old_binaryheap *heap, bh_node_type d);
+
+#define old_binaryheap_empty(h)			((h)->bh_size == 0)
+#define old_binaryheap_size(h)			((h)->bh_size)
+#define old_binaryheap_get_node(h, n)	((h)->bh_nodes[n])
+
+#endif							/* OLD_BINARYHEAP_H */
diff --git a/contrib/binaryheap_bench/xx_binaryheap.c b/contrib/binaryheap_bench/xx_binaryheap.c
new file mode 100644
index 0000000000..41e4ed1549
--- /dev/null
+++ b/contrib/binaryheap_bench/xx_binaryheap.c
@@ -0,0 +1,463 @@
+/*-------------------------------------------------------------------------
+ *
+ * xx_binaryheap.c
+ *	  A simple binary heap implementation
+ *
+ * Portions Copyright (c) 2012-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/common/xx_binaryheap.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifdef FRONTEND
+#include "postgres_fe.h"
+#else
+#include "postgres.h"
+#endif
+
+#include <math.h>
+
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
+#include "common/hashfn.h"
+#include "xx_binaryheap.h"
+
+static void sift_down(xx_binaryheap *heap, int node_off);
+static void sift_up(xx_binaryheap *heap, int node_off);
+
+/*
+ * xx_binaryheap_allocate
+ *
+ * Returns a pointer to a newly-allocated heap with the given initial number
+ * of nodes, and with the heap property defined by the given comparator
+ * function, which will be invoked with the additional argument specified by
+ * 'arg'.
+ *
+ * If 'indexed' is true, we create a hash table to track each node's
+ * index in the heap, enabling to perform some operations such as
+ * xx_binaryheap_remove_node_ptr() etc.
+ */
+xx_binaryheap *
+xx_binaryheap_allocate(int num_nodes, xx_binaryheap_comparator compare,
+					   void *arg, xx_binaryheap_update_index_fn update_index)
+{
+	xx_binaryheap *heap;
+
+	heap = (xx_binaryheap *) palloc(sizeof(xx_binaryheap));
+	heap->bh_space = num_nodes;
+	heap->bh_compare = compare;
+	heap->bh_arg = arg;
+
+	heap->bh_size = 0;
+	heap->bh_has_heap_property = true;
+	heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type) * num_nodes);
+	heap->bh_update_index = update_index;
+
+	return heap;
+}
+
+/*
+ * xx_binaryheap_reset
+ *
+ * Resets the heap to an empty state, losing its data content but not the
+ * parameters passed at allocation.
+ */
+void
+xx_binaryheap_reset(xx_binaryheap *heap)
+{
+	heap->bh_size = 0;
+	heap->bh_has_heap_property = true;
+}
+
+/*
+ * xx_binaryheap_free
+ *
+ * Releases memory used by the given xx_binaryheap.
+ */
+void
+xx_binaryheap_free(xx_binaryheap *heap)
+{
+	pfree(heap->bh_nodes);
+	pfree(heap);
+}
+
+/*
+ * These utility functions return the offset of the left child, right
+ * child, and parent of the node at the given index, respectively.
+ *
+ * The heap is represented as an array of nodes, with the root node
+ * stored at index 0. The left child of node i is at index 2*i+1, and
+ * the right child at 2*i+2. The parent of node i is at index (i-1)/2.
+ */
+
+static inline int
+left_offset(int i)
+{
+	return 2 * i + 1;
+}
+
+static inline int
+right_offset(int i)
+{
+	return 2 * i + 2;
+}
+
+static inline int
+parent_offset(int i)
+{
+	return (i - 1) / 2;
+}
+
+/*
+ * Double the space allocated for nodes.
+ */
+static void
+enlarge_node_array(xx_binaryheap *heap)
+{
+	heap->bh_space *= 2;
+	heap->bh_nodes = repalloc(heap->bh_nodes,
+							  sizeof(bh_node_type) * heap->bh_space);
+}
+
+/*
+ * Set the given node at the 'index' and track it if required.
+ *
+ * Return true if the node's index is already tracked.
+ */
+static inline void
+set_node(xx_binaryheap *heap, bh_node_type node, int index)
+{
+	/* Set the node to the nodes array */
+	heap->bh_nodes[index] = node;
+
+	/* Keep track of the node index */
+	if (xx_binaryheap_indexed(heap))
+		heap->bh_update_index(node, index);
+}
+
+/*
+ * Remove the node's index from the hash table if the heap is indexed.
+ */
+static inline void
+delete_nodeidx(xx_binaryheap *heap, bh_node_type node)
+{
+	/* XXX how we handle the removal in terms of the index? */
+	if (xx_binaryheap_indexed(heap))
+		heap->bh_update_index(node, -1);
+}
+
+/*
+ * Replace the existing node at 'idx' with the given 'new_node'. Also
+ * update their positions accordingly. Note that we assume the new_node's
+ * position is already tracked if enabled, i.e. the new_node is already
+ * present in the heap.
+ */
+static inline void
+replace_node(xx_binaryheap *heap, int index, bh_node_type new_node)
+{
+	bool		found PG_USED_FOR_ASSERTS_ONLY;
+
+	/* Quick return if not necessary to move */
+	if (heap->bh_nodes[index] == new_node)
+		return;
+
+	/* Remove the overwritten node's index */
+	delete_nodeidx(heap, heap->bh_nodes[index]);
+
+	/*
+	 * Replace it with the given new node. This node's position must also be
+	 * tracked as we assume to replace the node with the existing node.
+	 */
+	set_node(heap, new_node, index);
+}
+
+/*
+ * xx_binaryheap_add_unordered
+ *
+ * Adds the given datum to the end of the heap's list of nodes in O(1) without
+ * preserving the heap property. This is a convenience to add elements quickly
+ * to a new heap. To obtain a valid heap, one must call xx_binaryheap_build()
+ * afterwards.
+ */
+void
+xx_binaryheap_add_unordered(xx_binaryheap *heap, bh_node_type d)
+{
+	/* make sure enough space for a new node */
+	if (heap->bh_size >= heap->bh_space)
+		enlarge_node_array(heap);
+
+	heap->bh_has_heap_property = false;
+	set_node(heap, d, heap->bh_size);
+	heap->bh_size++;
+}
+
+/*
+ * xx_binaryheap_build
+ *
+ * Assembles a valid heap in O(n) from the nodes added by
+ * xx_binaryheap_add_unordered(). Not needed otherwise.
+ */
+void
+xx_binaryheap_build(xx_binaryheap *heap)
+{
+	int			i;
+
+	for (i = parent_offset(heap->bh_size - 1); i >= 0; i--)
+		sift_down(heap, i);
+	heap->bh_has_heap_property = true;
+}
+
+/*
+ * xx_binaryheap_add
+ *
+ * Adds the given datum to the heap in O(log n) time, while preserving
+ * the heap property.
+ */
+void
+xx_binaryheap_add(xx_binaryheap *heap, bh_node_type d)
+{
+	/* make sure enough space for a new node */
+	if (heap->bh_size >= heap->bh_space)
+		enlarge_node_array(heap);
+
+	set_node(heap, d, heap->bh_size);
+	heap->bh_size++;
+	sift_up(heap, heap->bh_size - 1);
+}
+
+/*
+ * xx_binaryheap_first
+ *
+ * Returns a pointer to the first (root, topmost) node in the heap
+ * without modifying the heap. The caller must ensure that this
+ * routine is not used on an empty heap. Always O(1).
+ */
+bh_node_type
+xx_binaryheap_first(xx_binaryheap *heap)
+{
+	Assert(!xx_binaryheap_empty(heap) && heap->bh_has_heap_property);
+	return heap->bh_nodes[0];
+}
+
+/*
+ * xx_binaryheap_remove_first
+ *
+ * Removes the first (root, topmost) node in the heap and returns a
+ * pointer to it after rebalancing the heap. The caller must ensure
+ * that this routine is not used on an empty heap. O(log n) worst
+ * case.
+ */
+bh_node_type
+xx_binaryheap_remove_first(xx_binaryheap *heap)
+{
+	bh_node_type result;
+
+	Assert(!xx_binaryheap_empty(heap) && heap->bh_has_heap_property);
+
+	/* extract the root node, which will be the result */
+	result = heap->bh_nodes[0];
+
+	/* easy if heap contains one element */
+	if (heap->bh_size == 1)
+	{
+		heap->bh_size--;
+		delete_nodeidx(heap, result);
+
+		return result;
+	}
+
+	/*
+	 * Remove the last node, placing it in the vacated root entry, and sift
+	 * the new root node down to its correct position.
+	 */
+	replace_node(heap, 0, heap->bh_nodes[--heap->bh_size]);
+	sift_down(heap, 0);
+
+	return result;
+}
+
+/*
+ * xx_binaryheap_remove_node
+ *
+ * Removes the nth (zero based) node from the heap.  The caller must ensure
+ * that there are at least (n + 1) nodes in the heap.  O(log n) worst case.
+ */
+void
+xx_binaryheap_remove_node(xx_binaryheap *heap, int n)
+{
+	int			cmp;
+
+	Assert(!xx_binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(n >= 0 && n < heap->bh_size);
+
+	/* compare last node to the one that is being removed */
+	cmp = heap->bh_compare(heap->bh_nodes[--heap->bh_size],
+						   heap->bh_nodes[n],
+						   heap->bh_arg);
+
+	/* remove the last node, placing it in the vacated entry */
+	replace_node(heap, n, heap->bh_nodes[heap->bh_size]);
+
+	/* sift as needed to preserve the heap property */
+	if (cmp > 0)
+		sift_up(heap, n);
+	else if (cmp < 0)
+		sift_down(heap, n);
+}
+
+/*
+ * xx_binaryheap_update_up
+ *
+ * Sift the given node up after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if the heap is indexed.
+ */
+void
+xx_binaryheap_update_up(xx_binaryheap *heap, int index)
+{
+	Assert(!xx_binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(xx_binaryheap_indexed(heap));
+
+	sift_up(heap, index);
+}
+
+/*
+ * xx_binaryheap_update_down
+ *
+ * Sift the given node down after the node's key is updated. The caller must
+ * ensure that the given node is in the heap. O(log n) worst case.
+ *
+ * This function can be used only if the heap is indexed.
+ */
+void
+xx_binaryheap_update_down(xx_binaryheap *heap, int index)
+{
+	Assert(!xx_binaryheap_empty(heap) && heap->bh_has_heap_property);
+	Assert(xx_binaryheap_indexed(heap));
+
+	sift_down(heap, index);
+}
+
+/*
+ * xx_binaryheap_replace_first
+ *
+ * Replace the topmost element of a non-empty heap, preserving the heap
+ * property.  O(1) in the best case, or O(log n) if it must fall back to
+ * sifting the new node down.
+ */
+void
+xx_binaryheap_replace_first(xx_binaryheap *heap, bh_node_type d)
+{
+	Assert(!xx_binaryheap_empty(heap) && heap->bh_has_heap_property);
+
+	replace_node(heap, 0, d);
+
+	if (heap->bh_size > 1)
+		sift_down(heap, 0);
+}
+
+/*
+ * Sift a node up to the highest position it can hold according to the
+ * comparator.
+ */
+static void
+sift_up(xx_binaryheap *heap, int node_off)
+{
+	bh_node_type node_val = heap->bh_nodes[node_off];
+
+	/*
+	 * Within the loop, the node_off'th array entry is a "hole" that
+	 * notionally holds node_val, but we don't actually store node_val there
+	 * till the end, saving some unnecessary data copying steps.
+	 */
+	while (node_off != 0)
+	{
+		int			cmp;
+		int			parent_off;
+		bh_node_type parent_val;
+
+		/*
+		 * If this node is smaller than its parent, the heap condition is
+		 * satisfied, and we're done.
+		 */
+		parent_off = parent_offset(node_off);
+		parent_val = heap->bh_nodes[parent_off];
+		cmp = heap->bh_compare(node_val,
+							   parent_val,
+							   heap->bh_arg);
+		if (cmp <= 0)
+			break;
+
+		/*
+		 * Otherwise, swap the parent value with the hole, and go on to check
+		 * the node's new parent.
+		 */
+		set_node(heap, parent_val, node_off);
+		node_off = parent_off;
+	}
+	/* Re-fill the hole */
+	set_node(heap, node_val, node_off);
+}
+
+/*
+ * Sift a node down from its current position to satisfy the heap
+ * property.
+ */
+static void
+sift_down(xx_binaryheap *heap, int node_off)
+{
+	bh_node_type node_val = heap->bh_nodes[node_off];
+
+	/*
+	 * Within the loop, the node_off'th array entry is a "hole" that
+	 * notionally holds node_val, but we don't actually store node_val there
+	 * till the end, saving some unnecessary data copying steps.
+	 */
+	while (true)
+	{
+		int			left_off = left_offset(node_off);
+		int			right_off = right_offset(node_off);
+		int			swap_off = 0;
+
+		/* Is the left child larger than the parent? */
+		if (left_off < heap->bh_size &&
+			heap->bh_compare(node_val,
+							 heap->bh_nodes[left_off],
+							 heap->bh_arg) < 0)
+			swap_off = left_off;
+
+		/* Is the right child larger than the parent? */
+		if (right_off < heap->bh_size &&
+			heap->bh_compare(node_val,
+							 heap->bh_nodes[right_off],
+							 heap->bh_arg) < 0)
+		{
+			/* swap with the larger child */
+			if (!swap_off ||
+				heap->bh_compare(heap->bh_nodes[left_off],
+								 heap->bh_nodes[right_off],
+								 heap->bh_arg) < 0)
+				swap_off = right_off;
+		}
+
+		/*
+		 * If we didn't find anything to swap, the heap condition is
+		 * satisfied, and we're done.
+		 */
+		if (!swap_off)
+			break;
+
+		/*
+		 * Otherwise, swap the hole with the child that violates the heap
+		 * property; then go on to check its children.
+		 */
+		set_node(heap, heap->bh_nodes[swap_off], node_off);
+		node_off = swap_off;
+	}
+	/* Re-fill the hole */
+	set_node(heap, node_val, node_off);
+}
diff --git a/contrib/binaryheap_bench/xx_binaryheap.h b/contrib/binaryheap_bench/xx_binaryheap.h
new file mode 100644
index 0000000000..ac7851557f
--- /dev/null
+++ b/contrib/binaryheap_bench/xx_binaryheap.h
@@ -0,0 +1,76 @@
+/*
+ * xx_binaryheap.h
+ *
+ * A simple binary heap implementation
+ *
+ * Portions Copyright (c) 2012-2024, PostgreSQL Global Development Group
+ *
+ * src/include/lib/xx_binaryheap.h
+ */
+
+#ifndef XX_BINARYHEAP_H
+#define XX_BINARYHEAP_H
+
+/*
+ * We provide a Datum-based API for backend code and a void *-based API for
+ * frontend code (since the Datum definitions are not available to frontend
+ * code).  You should typically avoid using bh_node_type directly and instead
+ * use Datum or void * as appropriate.
+ */
+#ifdef FRONTEND
+typedef void *bh_node_type;
+#else
+typedef Datum bh_node_type;
+#endif
+
+/*
+ * For a max-heap, the comparator must return <0 iff a < b, 0 iff a == b,
+ * and >0 iff a > b.  For a min-heap, the conditions are reversed.
+ */
+typedef int (*xx_binaryheap_comparator) (bh_node_type a, bh_node_type b, void *arg);
+
+typedef void (*xx_binaryheap_update_index_fn) (bh_node_type node, int new_element_index);
+
+/*
+ * xx_binaryheap
+ *
+ *		bh_size			how many nodes are currently in "nodes"
+ *		bh_space		how many nodes can be stored in "nodes"
+ *		bh_has_heap_property	no unordered operations since last heap build
+ *		bh_compare		comparison function to define the heap property
+ *		bh_arg			user data for comparison function
+ *		bh_nodes		variable-length array of "space" nodes
+ */
+typedef struct xx_binaryheap
+{
+	int			bh_size;
+	int			bh_space;
+	bool		bh_has_heap_property;	/* debugging cross-check */
+	xx_binaryheap_comparator bh_compare;
+	xx_binaryheap_update_index_fn	bh_update_index;
+	void	   *bh_arg;
+	bh_node_type *bh_nodes;
+} xx_binaryheap;
+
+extern xx_binaryheap *xx_binaryheap_allocate(int num_nodes,
+											 xx_binaryheap_comparator compare,
+											 void *arg,
+											 xx_binaryheap_update_index_fn update_index);
+extern void xx_binaryheap_reset(xx_binaryheap *heap);
+extern void xx_binaryheap_free(xx_binaryheap *heap);
+extern void xx_binaryheap_add_unordered(xx_binaryheap *heap, bh_node_type d);
+extern void xx_binaryheap_build(xx_binaryheap *heap);
+extern void xx_binaryheap_add(xx_binaryheap *heap, bh_node_type d);
+extern bh_node_type xx_binaryheap_first(xx_binaryheap *heap);
+extern bh_node_type xx_binaryheap_remove_first(xx_binaryheap *heap);
+extern void xx_binaryheap_remove_node(xx_binaryheap *heap, int n);
+extern void xx_binaryheap_replace_first(xx_binaryheap *heap, bh_node_type d);
+extern void xx_binaryheap_update_up(xx_binaryheap *heap, int index);
+extern void xx_binaryheap_update_down(xx_binaryheap *heap, int index);
+
+#define xx_binaryheap_empty(h)			((h)->bh_size == 0)
+#define xx_binaryheap_size(h)			((h)->bh_size)
+#define xx_binaryheap_get_node(h, n)	((h)->bh_nodes[n])
+#define xx_binaryheap_indexed(h)		((h)->bh_update_index != NULL)
+
+#endif							/* XX_BINARYHEAP_H */
-- 
2.39.3

#73Jeff Davis
pgsql@j-davis.com
In reply to: Masahiko Sawada (#72)
Re: Improve eviction algorithm in ReorderBuffer

On Fri, 2024-04-05 at 16:58 +0900, Masahiko Sawada wrote:

IIUC for example in ReorderBuffer, the heap key is transaction size
and the hash key is xid.

Yes.

I see your point. It assumes that the bh_node_type is a pointer or at
least unique. So it cannot work with Datum being a value.

Right. One option might just be to add some comments explaining the API
and limitations, but in general I feel it's confusing to hash a pointer
without a good reason.

It sounds like a data structure that mixes the hash table and the
binary heap and we use it as the main storage (e.g. for
ReorderBufferTXN) instead of using the binary heap as the secondary
data structure. IIUC with your idea, the indexed binary heap has a
hash table to store elements each of which has its index within the
heap node array. I guess it's better to create it as a new data
structure rather than extending the existing binaryheap, since APIs
could be very different. I might be missing something, though.

You are right that this approach starts to feel like a new data
structure and is not v17 material.

I am interested in this for v18 though -- we could make the API more
like simplehash to be more flexible when using values (rather than
pointers) and to be able to inline the comparator.

* remove the hash table from binaryheap.c

* supply a new callback to the binary heap with type like:

  typedef void (*binaryheap_update_index)(
    bh_node_type node,
    int new_element_index);

* make the remove, update_up, and update_down methods take the
element
index rather than the pointer

...

This shows that the current indexed binaryheap is much slower than
the
other two implementations, and the xx_binaryheap has a good number in
spite of also being indexed.

xx_binaryheap isn't indexed though, right?

When it comes to
implementing the above idea (i.e. changing binaryheap to
xx_binaryheap), it was simple since we just replace the code where we
update the hash table with the code where we call the callback, if we
get the consensus on API change.

That seems reasonable to me.

The fact that we use simplehash for the internal hash table might
make
this idea complex. If I understand your suggestion correctly, the
caller needs to tell the hash table the hash function when creating a
binaryheap but the hash function needs to be specified at a compile
time. We can use a dynahash instead but it would make the binaryheap
slow further.

simplehash.h supports private_data, which makes it easier to track a
callback.

In binaryheap.c, that would look something like:

static inline uint32
binaryheap_hash(bh_nodeidx_hash *tab, uint32 key)
{
binaryheap_hashfunc hashfunc = tab->private_data;
return hashfunc(key);
}

...
#define SH_HASH_KEY(tb, key) binaryheap_hash(tb, key)
...

binaryheap_allocate(int num_nodes, binaryheap_comparator compare,
void *arg, binaryheap_hashfunc hashfunc)
{
...
if (hashfunc != NULL)
{
/* could have a new structure, but we only need to
* store one pointer, so don't bother with palloc/pfree */
void *private_data = (void *)hashfunc;
heap->bh_nodeidx = bh_nodeidx_create(..., private_data);
...

And in reorderbuffer.c, define the callback like:

static uint32
reorderbuffer_xid_hash(TransactionId xid)
{
/* fasthash32 is 'static inline' so may
* be faster than hash_bytes()? */
return fasthash32(&xid, sizeof(TransactionId), 0);
}

In summary, there are two viable approaches for addressing the concerns
in v17:

1. Provide callback to update ReorderBufferTXN->heap_element_index, and
use that index (rather than the pointer) for updating the heap key
(transaction size) or removing elements from the heap.

2. Provide callback for hashing, so that binaryheap.c can hash the xid
value rather than the pointer.

I don't have a strong opinion about which one to use. I prefer
something closer to #1 for v18, but for v17 I suggest whichever one
comes out simpler.

Regards,
Jeff Davis

#74Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Jeff Davis (#73)
2 attachment(s)
Re: Improve eviction algorithm in ReorderBuffer

On Sat, Apr 6, 2024 at 5:44 AM Jeff Davis <pgsql@j-davis.com> wrote:

It sounds like a data structure that mixes the hash table and the
binary heap and we use it as the main storage (e.g. for
ReorderBufferTXN) instead of using the binary heap as the secondary
data structure. IIUC with your idea, the indexed binary heap has a
hash table to store elements each of which has its index within the
heap node array. I guess it's better to create it as a new data
structure rather than extending the existing binaryheap, since APIs
could be very different. I might be missing something, though.

You are right that this approach starts to feel like a new data
structure and is not v17 material.

I am interested in this for v18 though -- we could make the API more
like simplehash to be more flexible when using values (rather than
pointers) and to be able to inline the comparator.

Interesting project. It would be great if we could support increasing
and decreasing the key as APIs. The current
binaryheap_update_{up|down} APIs are not very user-friendly.

* remove the hash table from binaryheap.c

* supply a new callback to the binary heap with type like:

typedef void (*binaryheap_update_index)(
bh_node_type node,
int new_element_index);

* make the remove, update_up, and update_down methods take the
element
index rather than the pointer

...

This shows that the current indexed binaryheap is much slower than
the
other two implementations, and the xx_binaryheap has a good number in
spite of also being indexed.

xx_binaryheap isn't indexed though, right?

Well, yes. To be xact, xx_binaryheap isn't indexed but the element
indexes are stored in the element itself (see test_elem struct) so the
caller still can update the key using xx_binaryheap_update_{up|down}.

When it comes to
implementing the above idea (i.e. changing binaryheap to
xx_binaryheap), it was simple since we just replace the code where we
update the hash table with the code where we call the callback, if we
get the consensus on API change.

That seems reasonable to me.

The fact that we use simplehash for the internal hash table might
make
this idea complex. If I understand your suggestion correctly, the
caller needs to tell the hash table the hash function when creating a
binaryheap but the hash function needs to be specified at a compile
time. We can use a dynahash instead but it would make the binaryheap
slow further.

simplehash.h supports private_data, which makes it easier to track a
callback.

In binaryheap.c, that would look something like:

static inline uint32
binaryheap_hash(bh_nodeidx_hash *tab, uint32 key)
{
binaryheap_hashfunc hashfunc = tab->private_data;
return hashfunc(key);
}

...
#define SH_HASH_KEY(tb, key) binaryheap_hash(tb, key)
...

binaryheap_allocate(int num_nodes, binaryheap_comparator compare,
void *arg, binaryheap_hashfunc hashfunc)
{
...
if (hashfunc != NULL)
{
/* could have a new structure, but we only need to
* store one pointer, so don't bother with palloc/pfree */
void *private_data = (void *)hashfunc;
heap->bh_nodeidx = bh_nodeidx_create(..., private_data);
...

And in reorderbuffer.c, define the callback like:

static uint32
reorderbuffer_xid_hash(TransactionId xid)
{
/* fasthash32 is 'static inline' so may
* be faster than hash_bytes()? */
return fasthash32(&xid, sizeof(TransactionId), 0);
}

Thanks, that's a good idea.

In summary, there are two viable approaches for addressing the concerns
in v17:

1. Provide callback to update ReorderBufferTXN->heap_element_index, and
use that index (rather than the pointer) for updating the heap key
(transaction size) or removing elements from the heap.

2. Provide callback for hashing, so that binaryheap.c can hash the xid
value rather than the pointer.

I don't have a strong opinion about which one to use. I prefer
something closer to #1 for v18, but for v17 I suggest whichever one
comes out simpler.

I've implemented prototypes of both ideas, and attached the draft patches.

I agree with you that something closer to #1 is for v18. Probably we
can implement the #1 idea while making binaryheap codes template like
simplehash.h. For v17, changes for #2 are smaller, but I'm concerned
that the new API that requires a hash function to be able to use
binaryheap_update_{up|down} might not be user friendly. In terms of
APIs, I prefer #1 idea. And changes for #1 can make the binaryheap
code simple, although it requires adding a variable in
ReorderBufferTXN instead. But overall, it can remove the hash table
and some functions so it looks better to me.

When it comes to performance overhead, I mentioned that there is some
regression in the current binaryheap even without indexing. Since
function calling contributed to the regression, inlining some
functions reduced some overheads. For example, inlining set_node() and
replace_node(), the same benchmark test I used showed:

postgres(1:88476)=# select * from generate_series(1,3) x(x), lateral
(select * from bench_load(false, 10000000 * (1+x-x)));
x | cnt | load_ms | xx_load_ms | old_load_ms
---+----------+---------+------------+-------------
1 | 10000000 | 502 | 624 | 427
2 | 10000000 | 503 | 622 | 428
3 | 10000000 | 502 | 621 | 427
(3 rows)

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

01_use_update_index_func_in_binaryheap.patchapplication/octet-stream; name=01_use_update_index_func_in_binaryheap.patchDownload
diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c
index ce19e0837a..7f1c743d0c 100644
--- a/src/backend/executor/nodeGatherMerge.c
+++ b/src/backend/executor/nodeGatherMerge.c
@@ -422,8 +422,8 @@ gather_merge_setup(GatherMergeState *gm_state)
 	/* Allocate the resources for the merge */
 	gm_state->gm_heap = binaryheap_allocate(nreaders + 1,
 											heap_compare_slots,
-											false,
-											gm_state);
+											gm_state,
+											NULL);
 }
 
 /*
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index 3efebd537f..7493421331 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -125,8 +125,8 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	mergestate->ms_nplans = nplans;
 
 	mergestate->ms_slots = (TupleTableSlot **) palloc0(sizeof(TupleTableSlot *) * nplans);
-	mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots, false,
-											  mergestate);
+	mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots,
+											  mergestate, NULL);
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 251f75e91d..fe2091498f 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -258,7 +258,7 @@ PgArchiverMain(char *startup_data, size_t startup_data_len)
 
 	/* Initialize our max-heap for prioritizing files to archive. */
 	arch_files->arch_heap = binaryheap_allocate(NUM_FILES_PER_DIRECTORY_SCAN,
-												ready_file_comparator, false,
+												ready_file_comparator, NULL,
 												NULL);
 
 	/* Initialize our memory context. */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5cf28d4df4..17a44141c6 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -293,6 +293,7 @@ static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 static void ReorderBufferBuildMaxHeap(ReorderBuffer *rb);
 static void ReorderBufferMaybeResetMaxHeap(ReorderBuffer *rb);
 static int	ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg);
+static void ReorderBufferTXNUpdateIndex(Datum d, int new_index);
 
 static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
@@ -399,7 +400,7 @@ ReorderBufferAllocate(void)
 	 */
 	buffer->txn_heap = binaryheap_allocate(MAX_HEAP_TXN_COUNT_THRESHOLD * 2,
 										   ReorderBufferTXNSizeCompare,
-										   true, NULL);
+										   NULL, ReorderBufferTXNUpdateIndex);
 
 	buffer->spillTxns = 0;
 	buffer->spillCount = 0;
@@ -1340,8 +1341,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	/* allocate heap */
 	state->heap = binaryheap_allocate(state->nr_txns,
 									  ReorderBufferIterCompare,
-									  false,
-									  state);
+									  state, NULL);
 
 	/* Now that the state fields are initialized, it is safe to return it. */
 	*iter_state = state;
@@ -3277,7 +3277,7 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 			if ((txn->size - sz) == 0)
 				binaryheap_add(rb->txn_heap, PointerGetDatum(txn));
 			else
-				binaryheap_update_up(rb->txn_heap, PointerGetDatum(txn));
+				binaryheap_update_up(rb->txn_heap, txn->heap_element_index);
 		}
 	}
 	else
@@ -3293,9 +3293,9 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 		if (ReorderBufferMaxHeapIsReady(rb))
 		{
 			if (txn->size == 0)
-				binaryheap_remove_node_ptr(rb->txn_heap, PointerGetDatum(txn));
+				binaryheap_remove_node(rb->txn_heap, txn->heap_element_index);
 			else
-				binaryheap_update_down(rb->txn_heap, PointerGetDatum(txn));
+				binaryheap_update_down(rb->txn_heap, txn->heap_element_index);
 		}
 	}
 
@@ -3552,6 +3552,16 @@ ReorderBufferSerializeReserve(ReorderBuffer *rb, Size sz)
 	}
 }
 
+/*
+ * Callback function for updating the transaction's index in the max-heap.
+ */
+static void
+ReorderBufferTXNUpdateIndex(Datum d, int new_index)
+{
+	ReorderBufferTXN *txn = (ReorderBufferTXN *) DatumGetPointer(d);
+
+	txn->heap_element_index = new_index;
+}
 
 /* Compare two transactions by size */
 static int
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 44836751b7..5c80ea4370 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3014,8 +3014,7 @@ BufferSync(int flags)
 	 */
 	ts_heap = binaryheap_allocate(num_spaces,
 								  ts_ckpt_progress_comparator,
-								  false,
-								  NULL);
+								  NULL, NULL);
 
 	for (i = 0; i < num_spaces; i++)
 	{
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 465e9ce777..aed7ca8786 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -4200,8 +4200,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 	/* Set up ready_heap with enough room for all known TocEntrys */
 	ready_heap = binaryheap_allocate(AH->tocCount,
 									 TocEntrySizeCompareBinaryheap,
-									 false,
-									 NULL);
+									 NULL, NULL);
 
 	/*
 	 * The pending_list contains all items that we need to restore.  Move all
diff --git a/src/bin/pg_dump/pg_dump_sort.c b/src/bin/pg_dump/pg_dump_sort.c
index 7362f7c961..9004704733 100644
--- a/src/bin/pg_dump/pg_dump_sort.c
+++ b/src/bin/pg_dump/pg_dump_sort.c
@@ -405,7 +405,7 @@ TopoSort(DumpableObject **objs,
 		return true;
 
 	/* Create workspace for the above-described heap */
-	pendingHeap = binaryheap_allocate(numObjs, int_cmp, false, NULL);
+	pendingHeap = binaryheap_allocate(numObjs, int_cmp, NULL, NULL);
 
 	/*
 	 * Scan the constraints, and for each item in the input, generate a count
diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index c20ed50acc..f24835e7c4 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -25,27 +25,6 @@
 #include "common/hashfn.h"
 #include "lib/binaryheap.h"
 
-/*
- * Define parameters for hash table code generation. The interface is *also*
- * declared in binaryheap.h (to generate the types, which are externally
- * visible).
- */
-#define SH_PREFIX bh_nodeidx
-#define SH_ELEMENT_TYPE bh_nodeidx_entry
-#define SH_KEY_TYPE bh_node_type
-#define SH_KEY key
-#define SH_HASH_KEY(tb, key) \
-	hash_bytes((const unsigned char *) &key, sizeof(bh_node_type))
-#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0)
-#define SH_SCOPE extern
-#ifdef FRONTEND
-#define SH_RAW_ALLOCATOR pg_malloc0
-#endif
-#define SH_STORE_HASH
-#define SH_GET_HASH(tb, a) a->hash
-#define SH_DEFINE
-#include "lib/simplehash.h"
-
 static void sift_down(binaryheap *heap, int node_off);
 static void sift_up(binaryheap *heap, int node_off);
 
@@ -57,35 +36,25 @@ static void sift_up(binaryheap *heap, int node_off);
  * function, which will be invoked with the additional argument specified by
  * 'arg'.
  *
- * If 'indexed' is true, we create a hash table to track each node's
- * index in the heap, enabling to perform some operations such as
- * binaryheap_remove_node_ptr() etc.
+ * The update_index callback function is optional. It it's set, the function
+ * is called whenever individual node moves to let the caller know the node's
+ * index in the bh_nodes.
  */
 binaryheap *
-binaryheap_allocate(int num_nodes, binaryheap_comparator compare,
-					bool indexed, void *arg)
+binaryheap_allocate(int num_nodes, binaryheap_comparator compare, void *arg,
+					binaryheap_update_index_func update_index)
 {
 	binaryheap *heap;
 
 	heap = (binaryheap *) palloc(sizeof(binaryheap));
 	heap->bh_space = num_nodes;
 	heap->bh_compare = compare;
+	heap->bh_update_index = update_index;
 	heap->bh_arg = arg;
 
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
 	heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type) * num_nodes);
-	heap->bh_nodeidx = NULL;
-
-	if (indexed)
-	{
-#ifdef FRONTEND
-		heap->bh_nodeidx = bh_nodeidx_create(num_nodes, NULL);
-#else
-		heap->bh_nodeidx = bh_nodeidx_create(CurrentMemoryContext, num_nodes,
-											 NULL);
-#endif
-	}
 
 	return heap;
 }
@@ -101,9 +70,6 @@ binaryheap_reset(binaryheap *heap)
 {
 	heap->bh_size = 0;
 	heap->bh_has_heap_property = true;
-
-	if (binaryheap_indexed(heap))
-		bh_nodeidx_reset(heap->bh_nodeidx);
 }
 
 /*
@@ -114,9 +80,6 @@ binaryheap_reset(binaryheap *heap)
 void
 binaryheap_free(binaryheap *heap)
 {
-	if (binaryheap_indexed(heap))
-		bh_nodeidx_destroy(heap->bh_nodeidx);
-
 	pfree(heap->bh_nodes);
 	pfree(heap);
 }
@@ -164,34 +127,15 @@ enlarge_node_array(binaryheap *heap)
  *
  * Return true if the node's index is already tracked.
  */
-static bool
+static void
 set_node(binaryheap *heap, bh_node_type node, int index)
 {
-	bool		found = false;
-
 	/* Set the node to the nodes array */
 	heap->bh_nodes[index] = node;
 
-	if (binaryheap_indexed(heap))
-	{
-		bh_nodeidx_entry *ent;
-
-		/* Keep track of the node index */
-		ent = bh_nodeidx_insert(heap->bh_nodeidx, node, &found);
-		ent->index = index;
-	}
-
-	return found;
-}
-
-/*
- * Remove the node's index from the hash table if the heap is indexed.
- */
-static inline void
-delete_nodeidx(binaryheap *heap, bh_node_type node)
-{
-	if (binaryheap_indexed(heap))
-		bh_nodeidx_delete(heap->bh_nodeidx, node);
+	/* Update the node index */
+	if (heap->bh_update_index != NULL)
+		heap->bh_update_index(node, index);
 }
 
 /*
@@ -203,21 +147,15 @@ delete_nodeidx(binaryheap *heap, bh_node_type node)
 static void
 replace_node(binaryheap *heap, int index, bh_node_type new_node)
 {
-	bool		found PG_USED_FOR_ASSERTS_ONLY;
-
 	/* Quick return if not necessary to move */
 	if (heap->bh_nodes[index] == new_node)
 		return;
 
-	/* Remove the overwritten node's index */
-	delete_nodeidx(heap, heap->bh_nodes[index]);
-
 	/*
 	 * Replace it with the given new node. This node's position must also be
 	 * tracked as we assume to replace the node with the existing node.
 	 */
-	found = set_node(heap, new_node, index);
-	Assert(!binaryheap_indexed(heap) || found);
+	set_node(heap, new_node, index);
 }
 
 /*
@@ -310,7 +248,6 @@ binaryheap_remove_first(binaryheap *heap)
 	if (heap->bh_size == 1)
 	{
 		heap->bh_size--;
-		delete_nodeidx(heap, result);
 
 		return result;
 	}
@@ -355,74 +292,33 @@ binaryheap_remove_node(binaryheap *heap, int n)
 }
 
 /*
- * binaryheap_remove_node_ptr
- *
- * Similar to binaryheap_remove_node() but removes the given node. The caller
- * must ensure that the given node is in the heap. O(log n) worst case.
+ * binaryheap_update_up
  *
- * This function can be used only if the heap is indexed.
+ * Sift the nth (zero based) up after the node's key is updated. The caller
+ * must ensure that there are at least (n + 1) nodes in the heap. O(log n)
+ * worst case.
  */
 void
-binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
+binaryheap_update_up(binaryheap *heap, int n)
 {
-	bh_nodeidx_entry *ent;
-
-	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
-	Assert(binaryheap_indexed(heap));
-
-	ent = bh_nodeidx_lookup(heap->bh_nodeidx, d);
-	Assert(ent);
-
-	binaryheap_remove_node(heap, ent->index);
-}
-
-/*
- * Workhorse for binaryheap_update_up and binaryheap_update_down.
- */
-static void
-resift_node(binaryheap *heap, bh_node_type node, bool sift_dir_up)
-{
-	bh_nodeidx_entry *ent;
-
 	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
-	Assert(binaryheap_indexed(heap));
 
-	ent = bh_nodeidx_lookup(heap->bh_nodeidx, node);
-	Assert(ent);
-	Assert(ent->index >= 0 && ent->index < heap->bh_size);
-
-	if (sift_dir_up)
-		sift_up(heap, ent->index);
-	else
-		sift_down(heap, ent->index);
-}
-
-/*
- * binaryheap_update_up
- *
- * Sift the given node up after the node's key is updated. The caller must
- * ensure that the given node is in the heap. O(log n) worst case.
- *
- * This function can be used only if the heap is indexed.
- */
-void
-binaryheap_update_up(binaryheap *heap, bh_node_type d)
-{
-	resift_node(heap, d, true);
+	sift_up(heap, n);
 }
 
 /*
  * binaryheap_update_down
  *
- * Sift the given node down after the node's key is updated. The caller must
- * ensure that the given node is in the heap. O(log n) worst case.
- *
- * This function can be used only if the heap is indexed.
+ * Sift the nth (zero based) down after the node's key is updated. The caller
+ * must ensure that there are at least (n + 1) nodes in the heap. O(log n)
+ * worst case.
  */
 void
-binaryheap_update_down(binaryheap *heap, bh_node_type d)
+binaryheap_update_down(binaryheap *heap, int n)
 {
-	resift_node(heap, d, false);
+	Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
+
+	sift_up(heap, n);
 }
 
 /*
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 4c1a1bb274..f3093a6ec9 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -30,27 +30,11 @@ typedef Datum bh_node_type;
 typedef int (*binaryheap_comparator) (bh_node_type a, bh_node_type b, void *arg);
 
 /*
- * Struct for a hash table element to store the node's index in the bh_nodes
- * array.
+ * This callback function is called whenever the node's position within
+ * the node array (i.e. bh_nodes) changes.
  */
-typedef struct bh_nodeidx_entry
-{
-	bh_node_type key;
-	int			index;			/* entry's index within the node array */
-	char		status;			/* hash status */
-	uint32		hash;			/* hash values (cached) */
-} bh_nodeidx_entry;
-
-/* Define parameters necessary to generate the hash table interface. */
-#define SH_PREFIX bh_nodeidx
-#define SH_ELEMENT_TYPE bh_nodeidx_entry
-#define SH_KEY_TYPE bh_node_type
-#define SH_SCOPE extern
-#ifdef FRONTEND
-#define SH_RAW_ALLOCATOR pg_malloc0
-#endif
-#define SH_DECLARE
-#include "lib/simplehash.h"
+typedef void (*binaryheap_update_index_func) (bh_node_type d,
+											  int new_element_index);
 
 /*
  * binaryheap
@@ -59,6 +43,7 @@ typedef struct bh_nodeidx_entry
  *		bh_space		how many nodes can be stored in "nodes"
  *		bh_has_heap_property	no unordered operations since last heap build
  *		bh_compare		comparison function to define the heap property
+ *		bh_update_index callback function called when updating the node's index
  *		bh_arg			user data for comparison function
  *		bh_nodes		variable-length array of "space" nodes
  */
@@ -68,20 +53,15 @@ typedef struct binaryheap
 	int			bh_space;
 	bool		bh_has_heap_property;	/* debugging cross-check */
 	binaryheap_comparator bh_compare;
+	binaryheap_update_index_func bh_update_index;
 	void	   *bh_arg;
 	bh_node_type *bh_nodes;
-
-	/*
-	 * If bh_nodeidx is not NULL, the bh_nodeidx is used to track of each
-	 * node's index in bh_nodes. This enables the caller to perform
-	 * binaryheap_remove_node_ptr(), binaryheap_update_up/down in O(log n).
-	 */
-	bh_nodeidx_hash *bh_nodeidx;
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int num_nodes,
 									   binaryheap_comparator compare,
-									   bool indexed, void *arg);
+									   void *arg,
+									   binaryheap_update_index_func update_indx);
 extern void binaryheap_reset(binaryheap *heap);
 extern void binaryheap_free(binaryheap *heap);
 extern void binaryheap_add_unordered(binaryheap *heap, bh_node_type d);
@@ -90,14 +70,12 @@ extern void binaryheap_add(binaryheap *heap, bh_node_type d);
 extern bh_node_type binaryheap_first(binaryheap *heap);
 extern bh_node_type binaryheap_remove_first(binaryheap *heap);
 extern void binaryheap_remove_node(binaryheap *heap, int n);
-extern void binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d);
 extern void binaryheap_replace_first(binaryheap *heap, bh_node_type d);
-extern void binaryheap_update_up(binaryheap *heap, bh_node_type d);
-extern void binaryheap_update_down(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_up(binaryheap *heap, int n);
+extern void binaryheap_update_down(binaryheap *heap, int n);
 
 #define binaryheap_empty(h)			((h)->bh_size == 0)
 #define binaryheap_size(h)			((h)->bh_size)
 #define binaryheap_get_node(h, n)	((h)->bh_nodes[n])
-#define binaryheap_indexed(h)		((h)->bh_nodeidx != NULL)
 
 #endif							/* BINARYHEAP_H */
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index a5aec01c2f..202e976b2d 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -410,6 +410,9 @@ typedef struct ReorderBufferTXN
 	/* Size of top-transaction including sub-transactions. */
 	Size		total_size;
 
+	/* Index of this transaction in the rb->txn_heap max-heap */
+	int			heap_element_index;
+
 	/* If we have detected concurrent abort then ignore future changes. */
 	bool		concurrent_abort;
 
02_use_hashfunc_in_binaryheap.patchapplication/octet-stream; name=02_use_hashfunc_in_binaryheap.patchDownload
diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c
index ce19e0837a..6ed6aea7af 100644
--- a/src/backend/executor/nodeGatherMerge.c
+++ b/src/backend/executor/nodeGatherMerge.c
@@ -422,8 +422,7 @@ gather_merge_setup(GatherMergeState *gm_state)
 	/* Allocate the resources for the merge */
 	gm_state->gm_heap = binaryheap_allocate(nreaders + 1,
 											heap_compare_slots,
-											false,
-											gm_state);
+											gm_state, NULL);
 }
 
 /*
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index 3efebd537f..7493421331 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -125,8 +125,8 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	mergestate->ms_nplans = nplans;
 
 	mergestate->ms_slots = (TupleTableSlot **) palloc0(sizeof(TupleTableSlot *) * nplans);
-	mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots, false,
-											  mergestate);
+	mergestate->ms_heap = binaryheap_allocate(nplans, heap_compare_slots,
+											  mergestate, NULL);
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 251f75e91d..fe2091498f 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -258,7 +258,7 @@ PgArchiverMain(char *startup_data, size_t startup_data_len)
 
 	/* Initialize our max-heap for prioritizing files to archive. */
 	arch_files->arch_heap = binaryheap_allocate(NUM_FILES_PER_DIRECTORY_SCAN,
-												ready_file_comparator, false,
+												ready_file_comparator, NULL,
 												NULL);
 
 	/* Initialize our memory context. */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5cf28d4df4..9940c14d48 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -106,6 +106,7 @@
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
+#include "common/hashfn_unstable.h"
 #include "common/int.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
@@ -293,6 +294,7 @@ static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 static void ReorderBufferBuildMaxHeap(ReorderBuffer *rb);
 static void ReorderBufferMaybeResetMaxHeap(ReorderBuffer *rb);
 static int	ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg);
+static uint32 ReorderBufferXidHash(Datum d);
 
 static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
@@ -399,7 +401,7 @@ ReorderBufferAllocate(void)
 	 */
 	buffer->txn_heap = binaryheap_allocate(MAX_HEAP_TXN_COUNT_THRESHOLD * 2,
 										   ReorderBufferTXNSizeCompare,
-										   true, NULL);
+										   NULL, ReorderBufferXidHash);
 
 	buffer->spillTxns = 0;
 	buffer->spillCount = 0;
@@ -1340,8 +1342,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	/* allocate heap */
 	state->heap = binaryheap_allocate(state->nr_txns,
 									  ReorderBufferIterCompare,
-									  false,
-									  state);
+									  state, NULL);
 
 	/* Now that the state fields are initialized, it is safe to return it. */
 	*iter_state = state;
@@ -3552,6 +3553,14 @@ ReorderBufferSerializeReserve(ReorderBuffer *rb, Size sz)
 	}
 }
 
+/* Hash function for rb->txn_heap */
+static uint32
+ReorderBufferXidHash(Datum d)
+{
+	ReorderBufferTXN *txn = (ReorderBufferTXN *) DatumGetPointer(d);
+
+	return fasthash32((const char *) &(txn->xid), sizeof(TransactionId), 0);
+}
 
 /* Compare two transactions by size */
 static int
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 44836751b7..5c80ea4370 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3014,8 +3014,7 @@ BufferSync(int flags)
 	 */
 	ts_heap = binaryheap_allocate(num_spaces,
 								  ts_ckpt_progress_comparator,
-								  false,
-								  NULL);
+								  NULL, NULL);
 
 	for (i = 0; i < num_spaces; i++)
 	{
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 465e9ce777..aed7ca8786 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -4200,8 +4200,7 @@ restore_toc_entries_parallel(ArchiveHandle *AH, ParallelState *pstate,
 	/* Set up ready_heap with enough room for all known TocEntrys */
 	ready_heap = binaryheap_allocate(AH->tocCount,
 									 TocEntrySizeCompareBinaryheap,
-									 false,
-									 NULL);
+									 NULL, NULL);
 
 	/*
 	 * The pending_list contains all items that we need to restore.  Move all
diff --git a/src/bin/pg_dump/pg_dump_sort.c b/src/bin/pg_dump/pg_dump_sort.c
index 7362f7c961..9004704733 100644
--- a/src/bin/pg_dump/pg_dump_sort.c
+++ b/src/bin/pg_dump/pg_dump_sort.c
@@ -405,7 +405,7 @@ TopoSort(DumpableObject **objs,
 		return true;
 
 	/* Create workspace for the above-described heap */
-	pendingHeap = binaryheap_allocate(numObjs, int_cmp, false, NULL);
+	pendingHeap = binaryheap_allocate(numObjs, int_cmp, NULL, NULL);
 
 	/*
 	 * Scan the constraints, and for each item in the input, generate a count
diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index c20ed50acc..e077905bd1 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -25,6 +25,12 @@
 #include "common/hashfn.h"
 #include "lib/binaryheap.h"
 
+/*
+ * Need to be declared before the hash table code generation as it is
+ * referenced during the code generation.
+ */
+static inline uint32 binaryheap_hash(bh_nodeidx_hash *tab, bh_node_type key);
+
 /*
  * Define parameters for hash table code generation. The interface is *also*
  * declared in binaryheap.h (to generate the types, which are externally
@@ -34,8 +40,7 @@
 #define SH_ELEMENT_TYPE bh_nodeidx_entry
 #define SH_KEY_TYPE bh_node_type
 #define SH_KEY key
-#define SH_HASH_KEY(tb, key) \
-	hash_bytes((const unsigned char *) &key, sizeof(bh_node_type))
+#define SH_HASH_KEY(tb, key) binaryheap_hash(tb, key)
 #define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0)
 #define SH_SCOPE extern
 #ifdef FRONTEND
@@ -63,7 +68,7 @@ static void sift_up(binaryheap *heap, int node_off);
  */
 binaryheap *
 binaryheap_allocate(int num_nodes, binaryheap_comparator compare,
-					bool indexed, void *arg)
+					void *arg, binaryheap_hashfunc hashfunc)
 {
 	binaryheap *heap;
 
@@ -77,13 +82,15 @@ binaryheap_allocate(int num_nodes, binaryheap_comparator compare,
 	heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type) * num_nodes);
 	heap->bh_nodeidx = NULL;
 
-	if (indexed)
+	if (hashfunc != NULL)
 	{
+		void *private_data = (void *) hashfunc;
+
 #ifdef FRONTEND
-		heap->bh_nodeidx = bh_nodeidx_create(num_nodes, NULL);
+		heap->bh_nodeidx = bh_nodeidx_create(num_nodes, private_data);
 #else
 		heap->bh_nodeidx = bh_nodeidx_create(CurrentMemoryContext, num_nodes,
-											 NULL);
+											 private_data);
 #endif
 	}
 
@@ -220,6 +227,18 @@ replace_node(binaryheap *heap, int index, bh_node_type new_node)
 	Assert(!binaryheap_indexed(heap) || found);
 }
 
+/*
+ * Return the hash value of the given key using the caller-specified hash
+ * function.
+ */
+static inline uint32
+binaryheap_hash(bh_nodeidx_hash *tab, bh_node_type key)
+{
+	binaryheap_hashfunc hashfunc = tab->private_data;
+
+	return hashfunc(key);
+}
+
 /*
  * binaryheap_add_unordered
  *
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 4c1a1bb274..91537da291 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -29,6 +29,12 @@ typedef Datum bh_node_type;
  */
 typedef int (*binaryheap_comparator) (bh_node_type a, bh_node_type b, void *arg);
 
+/*
+ * The hash function must returns a hash value of the given key. The
+ * returned hash value is used as the key of bh_nodeidx hash table.
+ */
+typedef uint32 (*binaryheap_hashfunc) (bh_node_type key);
+
 /*
  * Struct for a hash table element to store the node's index in the bh_nodes
  * array.
@@ -81,7 +87,8 @@ typedef struct binaryheap
 
 extern binaryheap *binaryheap_allocate(int num_nodes,
 									   binaryheap_comparator compare,
-									   bool indexed, void *arg);
+									   void *arg,
+									   binaryheap_hashfunc hashfunc);
 extern void binaryheap_reset(binaryheap *heap);
 extern void binaryheap_free(binaryheap *heap);
 extern void binaryheap_add_unordered(binaryheap *heap, bh_node_type d);
#75Jeff Davis
pgsql@j-davis.com
In reply to: Masahiko Sawada (#74)
Re: Improve eviction algorithm in ReorderBuffer

On Mon, 2024-04-08 at 21:29 +0900, Masahiko Sawada wrote:

For v17, changes for #2 are smaller, but I'm concerned
that the new API that requires a hash function to be able to use
binaryheap_update_{up|down} might not be user friendly.

The only API change in 02 is accepting a hash callback rather than a
boolean in binaryheap_allocate(), so I don't see that as worse than
what's there now. It also directly fixes my complaint (hashing the
pointer) and does nothing more, so I think it's the right fix for now.

I do think that the API can be better (templated like simplehash), but
I don't think right now is a great time to change it.

When it comes to performance overhead, I mentioned that there is some
regression in the current binaryheap even without indexing.

As far as I can tell, you are just adding a single branch in that path,
and I would expect it to be a predictable branch, right?

Thank you for testing, but small differences in a microbenchmark aren't
terribly worrying for me. If other call sites are that sensitive to
binaryheap performance, the right answer is to have a templated version
that would not only avoid this unnecessary branch, but also inline the
comparator (which probably matters more).

Regards,
Jeff Davis

#76Jeff Davis
pgsql@j-davis.com
In reply to: Masahiko Sawada (#72)
1 attachment(s)
Re: Improve eviction algorithm in ReorderBuffer

On Fri, 2024-04-05 at 16:58 +0900, Masahiko Sawada wrote:

I have some further comments and I believe changes are required for
v17.

I also noticed that the simplehash is declared in binaryheap.h with
"SH_SCOPE extern", which seems wrong. Attached is a rough patch to
bring the declarations into binaryheap.c.

Note that the attached patch uses "SH_SCOPE static", which makes sense
to me in this case, but causes a bunch of warnings in gcc. I will post
separately about silencing that warning, but for now you can either
use:

SH_SCOPE static inline

which is probably fine, but will encourage the compiler to inline more
code, when not all callers even use the hash table. Alternatively, you
can do:

SH_SCOPE static pg_attribute_unused()

which looks a bit wrong to me, but seems to silence the warnings, and
lets the compiler decide whether or not to inline.

Also probably needs comment updates, etc.

Regards,
Jeff Davis

Attachments:

v1-0001-binaryheap-move-hash-table-out-of-header-into-bin.patchtext/x-patch; charset=UTF-8; name=v1-0001-binaryheap-move-hash-table-out-of-header-into-bin.patchDownload
From 08dcf21646e4ded22b10fd0ed536d3bbf6fc1328 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Tue, 9 Apr 2024 10:45:00 -0700
Subject: [PATCH v1] binaryheap: move hash table out of header into
 binaryheap.c.

Commit b840508644 declared the internal hash table in the header with
scope "extern", which was unnecessary.
---
 src/common/binaryheap.c      | 15 ++++++++++++++-
 src/include/lib/binaryheap.h | 25 +------------------------
 2 files changed, 15 insertions(+), 25 deletions(-)

diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c
index c20ed50acc..2501dad6f2 100644
--- a/src/common/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -25,6 +25,18 @@
 #include "common/hashfn.h"
 #include "lib/binaryheap.h"
 
+/*
+ * Struct for a hash table element to store the node's index in the bh_nodes
+ * array.
+ */
+typedef struct bh_nodeidx_entry
+{
+	bh_node_type key;
+	int			index;			/* entry's index within the node array */
+	char		status;			/* hash status */
+	uint32		hash;			/* hash values (cached) */
+} bh_nodeidx_entry;
+
 /*
  * Define parameters for hash table code generation. The interface is *also*
  * declared in binaryheap.h (to generate the types, which are externally
@@ -37,12 +49,13 @@
 #define SH_HASH_KEY(tb, key) \
 	hash_bytes((const unsigned char *) &key, sizeof(bh_node_type))
 #define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0)
-#define SH_SCOPE extern
+#define SH_SCOPE static
 #ifdef FRONTEND
 #define SH_RAW_ALLOCATOR pg_malloc0
 #endif
 #define SH_STORE_HASH
 #define SH_GET_HASH(tb, a) a->hash
+#define SH_DECLARE
 #define SH_DEFINE
 #include "lib/simplehash.h"
 
diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h
index 4c1a1bb274..8b47132fc3 100644
--- a/src/include/lib/binaryheap.h
+++ b/src/include/lib/binaryheap.h
@@ -29,29 +29,6 @@ typedef Datum bh_node_type;
  */
 typedef int (*binaryheap_comparator) (bh_node_type a, bh_node_type b, void *arg);
 
-/*
- * Struct for a hash table element to store the node's index in the bh_nodes
- * array.
- */
-typedef struct bh_nodeidx_entry
-{
-	bh_node_type key;
-	int			index;			/* entry's index within the node array */
-	char		status;			/* hash status */
-	uint32		hash;			/* hash values (cached) */
-} bh_nodeidx_entry;
-
-/* Define parameters necessary to generate the hash table interface. */
-#define SH_PREFIX bh_nodeidx
-#define SH_ELEMENT_TYPE bh_nodeidx_entry
-#define SH_KEY_TYPE bh_node_type
-#define SH_SCOPE extern
-#ifdef FRONTEND
-#define SH_RAW_ALLOCATOR pg_malloc0
-#endif
-#define SH_DECLARE
-#include "lib/simplehash.h"
-
 /*
  * binaryheap
  *
@@ -76,7 +53,7 @@ typedef struct binaryheap
 	 * node's index in bh_nodes. This enables the caller to perform
 	 * binaryheap_remove_node_ptr(), binaryheap_update_up/down in O(log n).
 	 */
-	bh_nodeidx_hash *bh_nodeidx;
+	struct bh_nodeidx_hash *bh_nodeidx;
 } binaryheap;
 
 extern binaryheap *binaryheap_allocate(int num_nodes,
-- 
2.34.1

#77Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Jeff Davis (#76)
Re: Improve eviction algorithm in ReorderBuffer

On 09/04/2024 21:04, Jeff Davis wrote:

On Fri, 2024-04-05 at 16:58 +0900, Masahiko Sawada wrote:

I have some further comments and I believe changes are required for
v17.

I also noticed that the simplehash is declared in binaryheap.h with
"SH_SCOPE extern", which seems wrong. Attached is a rough patch to
bring the declarations into binaryheap.c.

Note that the attached patch uses "SH_SCOPE static", which makes sense
to me in this case, but causes a bunch of warnings in gcc. I will post
separately about silencing that warning, but for now you can either
use:

SH_SCOPE static inline

which is probably fine, but will encourage the compiler to inline more
code, when not all callers even use the hash table. Alternatively, you
can do:

SH_SCOPE static pg_attribute_unused()

which looks a bit wrong to me, but seems to silence the warnings, and
lets the compiler decide whether or not to inline.

Also probably needs comment updates, etc.

Sorry I'm late to the party, I didn't pay attention to this thread
earlier. But I wonder why this doesn't use the existing pairing heap
implementation? I would assume that to be at least as good as the binary
heap + hash table. And it's cheap to to insert to (O(1)), so we could
probably remove the MAX_HEAP_TXN_COUNT_THRESHOLD, and always keep the
heap up-to-date.

--
Heikki Linnakangas
Neon (https://neon.tech)

#78Jeff Davis
pgsql@j-davis.com
In reply to: Heikki Linnakangas (#77)
Re: Improve eviction algorithm in ReorderBuffer

On Tue, 2024-04-09 at 23:49 +0300, Heikki Linnakangas wrote:

I wonder why this doesn't use the existing pairing heap
implementation? I would assume that to be at least as good as the
binary
heap + hash table

I agree that an additional hash table is not needed -- there's already
a hash table to do a lookup based on xid in reorderbuffer.c.

I had suggested that the heap could track the element indexes for
efficient update/removal, but that would be a change to the
binaryheap.h API, which would require some discussion (and possibly not
be acceptable post-freeze).

But I think you're right: a pairing heap already solves the problem
without modification. (Note: our pairing heap API doesn't explicitly
support updating a key, so I think it would need to be done with
remove/add.) So we might as well just do that right now rather than
trying to fix the way the hash table is being used or trying to extend
the binaryheap API.

Of course, we should measure to be sure there aren't bad cases around
updating/removing a key, but I don't see a fundamental reason that it
should be worse.

Regards,
Jeff Davis

#79Michael Paquier
michael@paquier.xyz
In reply to: Jeff Davis (#78)
Re: Improve eviction algorithm in ReorderBuffer

On Tue, Apr 09, 2024 at 06:24:43PM -0700, Jeff Davis wrote:

I had suggested that the heap could track the element indexes for
efficient update/removal, but that would be a change to the
binaryheap.h API, which would require some discussion (and possibly not
be acceptable post-freeze).

But I think you're right: a pairing heap already solves the problem
without modification. (Note: our pairing heap API doesn't explicitly
support updating a key, so I think it would need to be done with
remove/add.) So we might as well just do that right now rather than
trying to fix the way the hash table is being used or trying to extend
the binaryheap API.

Of course, we should measure to be sure there aren't bad cases around
updating/removing a key, but I don't see a fundamental reason that it
should be worse.

This is going to require a rewrite of 5bec1d6bc5e3 with a new
performance study, which strikes me as something that we'd better not
do after feature freeze. Wouldn't the best way forward be to revert
5bec1d6bc5e3 and revisit the whole in v18?

I have added an open item, for now.
--
Michael

#80Jeff Davis
pgsql@j-davis.com
In reply to: Michael Paquier (#79)
Re: Improve eviction algorithm in ReorderBuffer

On Wed, 2024-04-10 at 12:13 +0900, Michael Paquier wrote:

Wouldn't the best way forward be to revert
5bec1d6bc5e3 and revisit the whole in v18?

That's a reasonable conclusion. Also consider commits b840508644 and
bcb14f4abc.

I had tried to come up with a narrower fix, and I think it's already
been implemented here in approach 2:

/messages/by-id/CAD21AoAtf12e9Z9NLBuaO1GjHMMo16_8R-yBu9Q9jrk2QLqMEA@mail.gmail.com

but it does feel wrong to introduce an unnecessary hash table in 17
when we know it's not the right solution.

Regards,
Jeff Davis

#81Michael Paquier
michael@paquier.xyz
In reply to: Jeff Davis (#80)
Re: Improve eviction algorithm in ReorderBuffer

On Tue, Apr 09, 2024 at 09:16:53PM -0700, Jeff Davis wrote:

On Wed, 2024-04-10 at 12:13 +0900, Michael Paquier wrote:

Wouldn't the best way forward be to revert
5bec1d6bc5e3 and revisit the whole in v18?

Also consider commits b840508644 and bcb14f4abc.

Indeed. These are also linked.
--
Michael

#82Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Michael Paquier (#81)
Re: Improve eviction algorithm in ReorderBuffer

On 10/04/2024 07:45, Michael Paquier wrote:

On Tue, Apr 09, 2024 at 09:16:53PM -0700, Jeff Davis wrote:

On Wed, 2024-04-10 at 12:13 +0900, Michael Paquier wrote:

Wouldn't the best way forward be to revert
5bec1d6bc5e3 and revisit the whole in v18?

Also consider commits b840508644 and bcb14f4abc.

Indeed. These are also linked.

I don't feel the urge to revert this:

- It's not broken as such, we're just discussing better ways to
implement it. We could also do nothing, and revisit this in v18. The
only must-fix issue is some compiler warnings IIUC.

- It's a pretty localized change in reorderbuffer.c, so it's not in the
way of other patches or reverts. Nothing else depends on the binaryheap
changes yet either.

- It seems straightforward to repeat the performance tests with whatever
alternative implementations we want to consider.

My #1 choice would be to write a patch to switch the pairing heap,
performance test that, and revert the binary heap changes.

--
Heikki Linnakangas
Neon (https://neon.tech)

#83Amit Kapila
amit.kapila16@gmail.com
In reply to: Heikki Linnakangas (#82)
Re: Improve eviction algorithm in ReorderBuffer

On Wed, Apr 10, 2024 at 11:00 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 10/04/2024 07:45, Michael Paquier wrote:

On Tue, Apr 09, 2024 at 09:16:53PM -0700, Jeff Davis wrote:

On Wed, 2024-04-10 at 12:13 +0900, Michael Paquier wrote:

Wouldn't the best way forward be to revert
5bec1d6bc5e3 and revisit the whole in v18?

Also consider commits b840508644 and bcb14f4abc.

Indeed. These are also linked.

I don't feel the urge to revert this:

- It's not broken as such, we're just discussing better ways to
implement it. We could also do nothing, and revisit this in v18. The
only must-fix issue is some compiler warnings IIUC.

- It's a pretty localized change in reorderbuffer.c, so it's not in the
way of other patches or reverts. Nothing else depends on the binaryheap
changes yet either.

- It seems straightforward to repeat the performance tests with whatever
alternative implementations we want to consider.

My #1 choice would be to write a patch to switch the pairing heap,
performance test that, and revert the binary heap changes.

+1.

--
With Regards,
Amit Kapila.

#84Jeff Davis
pgsql@j-davis.com
In reply to: Heikki Linnakangas (#82)
Re: Improve eviction algorithm in ReorderBuffer

On Wed, 2024-04-10 at 08:30 +0300, Heikki Linnakangas wrote:

My #1 choice would be to write a patch to switch the pairing heap,
performance test that, and revert the binary heap changes.

Sounds good to me. I would expect it to perform better than the extra
hash table, if anything.

It also has the advantage that we don't change the API for binaryheap
in 17.

Regards,
Jeff Davis

#85Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Amit Kapila (#83)
1 attachment(s)
Re: Improve eviction algorithm in ReorderBuffer

On 10/04/2024 08:31, Amit Kapila wrote:

On Wed, Apr 10, 2024 at 11:00 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 10/04/2024 07:45, Michael Paquier wrote:

On Tue, Apr 09, 2024 at 09:16:53PM -0700, Jeff Davis wrote:

On Wed, 2024-04-10 at 12:13 +0900, Michael Paquier wrote:

Wouldn't the best way forward be to revert
5bec1d6bc5e3 and revisit the whole in v18?

Also consider commits b840508644 and bcb14f4abc.

Indeed. These are also linked.

I don't feel the urge to revert this:

- It's not broken as such, we're just discussing better ways to
implement it. We could also do nothing, and revisit this in v18. The
only must-fix issue is some compiler warnings IIUC.

- It's a pretty localized change in reorderbuffer.c, so it's not in the
way of other patches or reverts. Nothing else depends on the binaryheap
changes yet either.

- It seems straightforward to repeat the performance tests with whatever
alternative implementations we want to consider.

My #1 choice would be to write a patch to switch the pairing heap,
performance test that, and revert the binary heap changes.

+1.

To move this forward, here's a patch to switch to a pairing heap. In my
very quick testing, with the performance test cases posted earlier in
this thread [1] [2], I'm seeing no meaningful performance difference
between this and what's in master currently.

Sawada-san, what do you think of this? To be sure, if you could also
repeat the performance tests you performed earlier, that'd be great. If
you agree with this direction, and you're happy with this patch, feel
free take it from here and commit this, and also revert commits
b840508644 and bcb14f4abc.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachments:

0001-Replace-binaryheap-index-with-pairingheap.patchtext/x-patch; charset=UTF-8; name=0001-Replace-binaryheap-index-with-pairingheap.patchDownload
From c883d03bd221341b0bbb315d376d9e125b84329a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 10 Apr 2024 08:09:38 +0300
Subject: [PATCH 1/1] Replace binaryheap + index with pairingheap

A pairing heap can perform the same operations as the binary heap +
index, with as good or better algorithmic complexity, and that's an
existing data structure so that we don't need to invent anything new
compared to v16. This commit makes the new binaryheap functionality
that was added in commits b840508644 and bcb14f4abc unnecessarily, but
they will be reverted separately.

Remove the optimization to only build and maintain the heap when the
amount of memory used is close to the limit, becuase the bookkeeping
overhead with the pairing heap seems to be small enough that it
doesn't matter in practice.
---
 .../replication/logical/reorderbuffer.c       | 186 +++---------------
 src/include/replication/reorderbuffer.h       |   8 +-
 2 files changed, 29 insertions(+), 165 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5cf28d4df42..ab2ddabfadd 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -68,19 +68,9 @@
  *	  memory gets actually freed.
  *
  *	  We use a max-heap with transaction size as the key to efficiently find
- *	  the largest transaction. While the max-heap is empty, we don't update
- *	  the max-heap when updating the memory counter. Therefore, we can get
- *	  the largest transaction in O(N) time, where N is the number of
- *	  transactions including top-level transactions and subtransactions.
- *
- *	  We build the max-heap just before selecting the largest transactions
- *	  if the number of transactions being decoded is higher than the threshold,
- *	  MAX_HEAP_TXN_COUNT_THRESHOLD. After building the max-heap, we also
- *	  update the max-heap when updating the memory counter. The intention is
- *	  to efficiently find the largest transaction in O(1) time instead of
- *	  incurring the cost of memory counter updates (O(log N)). Once the number
- *	  of transactions got lower than the threshold, we reset the max-heap
- *	  (refer to ReorderBufferMaybeResetMaxHeap() for details).
+ *	  the largest transaction. We update the max-heap whenever the memory
+ *	  counter is updated; however transactions with size 0 are not stored in
+ *	  the heap, because they have no changes to evict.
  *
  *	  We still rely on max_changes_in_memory when loading serialized changes
  *	  back into memory. At that point we can't use the memory limit directly
@@ -122,23 +112,6 @@
 #include "utils/rel.h"
 #include "utils/relfilenumbermap.h"
 
-/*
- * Threshold of the total number of top-level and sub transactions that
- * controls whether we use the max-heap for tracking their sizes. Although
- * using the max-heap to select the largest transaction is effective when
- * there are many transactions being decoded, maintaining the max-heap while
- * updating the memory statistics can be costly. Therefore, we use
- * MaxConnections as the threshold so that we use the max-heap only when
- * using subtransactions.
- */
-#define MAX_HEAP_TXN_COUNT_THRESHOLD	MaxConnections
-
-/*
- * A macro to check if the max-heap is ready to use and needs to be updated
- * accordingly.
- */
-#define ReorderBufferMaxHeapIsReady(rb) !binaryheap_empty((rb)->txn_heap)
-
 /* entry for a hash table we use to map from xid to our transaction state */
 typedef struct ReorderBufferTXNByIdEnt
 {
@@ -290,9 +263,7 @@ static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 static void ReorderBufferCleanupSerializedTXNs(const char *slotname);
 static void ReorderBufferSerializedPath(char *path, ReplicationSlot *slot,
 										TransactionId xid, XLogSegNo segno);
-static void ReorderBufferBuildMaxHeap(ReorderBuffer *rb);
-static void ReorderBufferMaybeResetMaxHeap(ReorderBuffer *rb);
-static int	ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg);
+static int	ReorderBufferTXNSizeCompare(const pairingheap_node *a, const pairingheap_node *b, void *arg);
 
 static void ReorderBufferFreeSnap(ReorderBuffer *rb, Snapshot snap);
 static Snapshot ReorderBufferCopySnap(ReorderBuffer *rb, Snapshot orig_snap,
@@ -390,16 +361,8 @@ ReorderBufferAllocate(void)
 	buffer->outbufsize = 0;
 	buffer->size = 0;
 
-	/*
-	 * The binaryheap is indexed for faster manipulations.
-	 *
-	 * We allocate the initial heap size greater than
-	 * MAX_HEAP_TXN_COUNT_THRESHOLD because the txn_heap will not be used
-	 * until the threshold is exceeded.
-	 */
-	buffer->txn_heap = binaryheap_allocate(MAX_HEAP_TXN_COUNT_THRESHOLD * 2,
-										   ReorderBufferTXNSizeCompare,
-										   true, NULL);
+	/* txn_heap is ordered by transaction size */
+	buffer->txn_heap = pairingheap_allocate(ReorderBufferTXNSizeCompare, NULL);
 
 	buffer->spillTxns = 0;
 	buffer->spillCount = 0;
@@ -1637,12 +1600,6 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	/* deallocate */
 	ReorderBufferReturnTXN(rb, txn);
-
-	/*
-	 * After cleaning up one transaction, the number of transactions might get
-	 * lower than the threshold for the max-heap.
-	 */
-	ReorderBufferMaybeResetMaxHeap(rb);
 }
 
 /*
@@ -3265,20 +3222,18 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 
 	if (addition)
 	{
+		Size		oldsize = txn->size;
+
 		txn->size += sz;
 		rb->size += sz;
 
 		/* Update the total size in the top transaction. */
 		toptxn->total_size += sz;
 
-		/* Update the max-heap as well if necessary */
-		if (ReorderBufferMaxHeapIsReady(rb))
-		{
-			if ((txn->size - sz) == 0)
-				binaryheap_add(rb->txn_heap, PointerGetDatum(txn));
-			else
-				binaryheap_update_up(rb->txn_heap, PointerGetDatum(txn));
-		}
+		/* Update the max-heap */
+		if (oldsize != 0)
+			pairingheap_remove(rb->txn_heap, &txn->txn_node);
+		pairingheap_add(rb->txn_heap, &txn->txn_node);
 	}
 	else
 	{
@@ -3289,14 +3244,10 @@ ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
 		/* Update the total size in the top transaction. */
 		toptxn->total_size -= sz;
 
-		/* Update the max-heap as well if necessary */
-		if (ReorderBufferMaxHeapIsReady(rb))
-		{
-			if (txn->size == 0)
-				binaryheap_remove_node_ptr(rb->txn_heap, PointerGetDatum(txn));
-			else
-				binaryheap_update_down(rb->txn_heap, PointerGetDatum(txn));
-		}
+		/* Update the max-heap */
+		pairingheap_remove(rb->txn_heap, &txn->txn_node);
+		if (txn->size != 0)
+			pairingheap_add(rb->txn_heap, &txn->txn_node);
 	}
 
 	Assert(txn->size <= rb->size);
@@ -3555,10 +3506,10 @@ ReorderBufferSerializeReserve(ReorderBuffer *rb, Size sz)
 
 /* Compare two transactions by size */
 static int
-ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg)
+ReorderBufferTXNSizeCompare(const pairingheap_node *a, const pairingheap_node *b, void *arg)
 {
-	ReorderBufferTXN *ta = (ReorderBufferTXN *) DatumGetPointer(a);
-	ReorderBufferTXN *tb = (ReorderBufferTXN *) DatumGetPointer(b);
+	const ReorderBufferTXN *ta = pairingheap_const_container(ReorderBufferTXN, txn_node, a);
+	const ReorderBufferTXN *tb = pairingheap_const_container(ReorderBufferTXN, txn_node, b);
 
 	if (ta->size < tb->size)
 		return -1;
@@ -3567,49 +3518,6 @@ ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg)
 	return 0;
 }
 
-/*
- * Build the max-heap. The heap assembly step is deferred  until the end, for
- * efficiency.
- */
-static void
-ReorderBufferBuildMaxHeap(ReorderBuffer *rb)
-{
-	HASH_SEQ_STATUS hash_seq;
-	ReorderBufferTXNByIdEnt *ent;
-
-	Assert(binaryheap_empty(rb->txn_heap));
-
-	hash_seq_init(&hash_seq, rb->by_txn);
-	while ((ent = hash_seq_search(&hash_seq)) != NULL)
-	{
-		ReorderBufferTXN *txn = ent->txn;
-
-		if (txn->size == 0)
-			continue;
-
-		binaryheap_add_unordered(rb->txn_heap, PointerGetDatum(txn));
-	}
-
-	binaryheap_build(rb->txn_heap);
-}
-
-/*
- * Reset the max-heap if the number of transactions got lower than the
- * threshold.
- */
-static void
-ReorderBufferMaybeResetMaxHeap(ReorderBuffer *rb)
-{
-	/*
-	 * If we add and remove transactions right around the threshold, we could
-	 * easily end up "thrashing". To avoid it, we adapt 10% of transactions to
-	 * reset the max-heap.
-	 */
-	if (ReorderBufferMaxHeapIsReady(rb) &&
-		binaryheap_size(rb->txn_heap) < MAX_HEAP_TXN_COUNT_THRESHOLD * 0.9)
-		binaryheap_reset(rb->txn_heap);
-}
-
 /*
  * Find the largest transaction (toplevel or subxact) to evict (spill to disk)
  * by doing a linear search or using the max-heap depending on the number of
@@ -3619,55 +3527,11 @@ ReorderBufferMaybeResetMaxHeap(ReorderBuffer *rb)
 static ReorderBufferTXN *
 ReorderBufferLargestTXN(ReorderBuffer *rb)
 {
-	ReorderBufferTXN *largest = NULL;
-
-	if (!ReorderBufferMaxHeapIsReady(rb))
-	{
-		/*
-		 * If the number of transactions are small, we scan all transactions
-		 * being decoded to get the largest transaction. This saves the cost
-		 * of building a max-heap with a small number of transactions.
-		 */
-		if (hash_get_num_entries(rb->by_txn) < MAX_HEAP_TXN_COUNT_THRESHOLD)
-		{
-			HASH_SEQ_STATUS hash_seq;
-			ReorderBufferTXNByIdEnt *ent;
-
-			hash_seq_init(&hash_seq, rb->by_txn);
-			while ((ent = hash_seq_search(&hash_seq)) != NULL)
-			{
-				ReorderBufferTXN *txn = ent->txn;
-
-				/* if the current transaction is larger, remember it */
-				if ((!largest) || (txn->size > largest->size))
-					largest = txn;
-			}
-
-			Assert(largest);
-		}
-		else
-		{
-			/*
-			 * There are a large number of transactions in ReorderBuffer. We
-			 * build the max-heap for efficiently selecting the largest
-			 * transactions.
-			 */
-			ReorderBufferBuildMaxHeap(rb);
-
-			/*
-			 * The max-heap is ready now. We remain the max-heap at least
-			 * until we free up enough transactions to bring the total memory
-			 * usage below the limit. The largest transaction is selected
-			 * below.
-			 */
-			Assert(ReorderBufferMaxHeapIsReady(rb));
-		}
-	}
+	ReorderBufferTXN *largest;
 
 	/* Get the largest transaction from the max-heap */
-	if (ReorderBufferMaxHeapIsReady(rb))
-		largest = (ReorderBufferTXN *)
-			DatumGetPointer(binaryheap_first(rb->txn_heap));
+	largest = pairingheap_container(ReorderBufferTXN, txn_node,
+									pairingheap_first(rb->txn_heap));
 
 	Assert(largest);
 	Assert(largest->size > 0);
@@ -3812,12 +3676,6 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
 	/* We must be under the memory limit now. */
 	Assert(rb->size < logical_decoding_work_mem * 1024L);
 
-	/*
-	 * After evicting some transactions, the number of transactions might get
-	 * lower than the threshold for the max-heap.
-	 */
-	ReorderBufferMaybeResetMaxHeap(rb);
-
 }
 
 /*
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index a5aec01c2f0..5ef6be0385a 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -12,6 +12,7 @@
 #include "access/htup_details.h"
 #include "lib/binaryheap.h"
 #include "lib/ilist.h"
+#include "lib/pairingheap.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
@@ -402,6 +403,11 @@ typedef struct ReorderBufferTXN
 	 */
 	dlist_node	catchange_node;
 
+	/*
+	 * A node in txn_heap
+	 */
+	pairingheap_node txn_node;
+
 	/*
 	 * Size of this transaction (changes currently in memory, in bytes).
 	 */
@@ -633,7 +639,7 @@ struct ReorderBuffer
 	Size		size;
 
 	/* Max-heap for sizes of all top-level and sub transactions */
-	binaryheap *txn_heap;
+	pairingheap *txn_heap;
 
 	/*
 	 * Statistics about transactions spilled to disk.
-- 
2.39.2

#86Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#85)
Re: Improve eviction algorithm in ReorderBuffer

On Thu, Apr 11, 2024 at 12:20:55AM +0300, Heikki Linnakangas wrote:

To move this forward, here's a patch to switch to a pairing heap. In my very
quick testing, with the performance test cases posted earlier in this thread
[1] [2], I'm seeing no meaningful performance difference between this and
what's in master currently.

Reading through the patch, that's a nice cleanup. It cuts quite some
code.

+++ b/src/include/replication/reorderbuffer.h
@@ -12,6 +12,7 @@
 #include "access/htup_details.h"
 #include "lib/binaryheap.h"
 #include "lib/ilist.h"
+#include "lib/pairingheap.h"

I'm slightly annoyed by the extra amount of information that gets
added to reorderbuffer.h for stuff that's only local to
reorderbuffer.c, but that's not something new in this area, so..
--
Michael

#87Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Michael Paquier (#86)
Re: Improve eviction algorithm in ReorderBuffer

On 11/04/2024 01:37, Michael Paquier wrote:

On Thu, Apr 11, 2024 at 12:20:55AM +0300, Heikki Linnakangas wrote:

To move this forward, here's a patch to switch to a pairing heap. In my very
quick testing, with the performance test cases posted earlier in this thread
[1] [2], I'm seeing no meaningful performance difference between this and
what's in master currently.

Reading through the patch, that's a nice cleanup. It cuts quite some
code.

+++ b/src/include/replication/reorderbuffer.h
@@ -12,6 +12,7 @@
#include "access/htup_details.h"
#include "lib/binaryheap.h"
#include "lib/ilist.h"
+#include "lib/pairingheap.h"

I'm slightly annoyed by the extra amount of information that gets
added to reorderbuffer.h for stuff that's only local to
reorderbuffer.c, but that's not something new in this area, so..

We can actually remove the "lib/binaryheap.h" in this patch; I missed
that. There are no other uses of binaryheap in the file.

--
Heikki Linnakangas
Neon (https://neon.tech)

#88Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Heikki Linnakangas (#85)
Re: Improve eviction algorithm in ReorderBuffer

Hi,

Sorry for the late reply, I took two days off.

On Thu, Apr 11, 2024 at 6:20 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 10/04/2024 08:31, Amit Kapila wrote:

On Wed, Apr 10, 2024 at 11:00 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 10/04/2024 07:45, Michael Paquier wrote:

On Tue, Apr 09, 2024 at 09:16:53PM -0700, Jeff Davis wrote:

On Wed, 2024-04-10 at 12:13 +0900, Michael Paquier wrote:

Wouldn't the best way forward be to revert
5bec1d6bc5e3 and revisit the whole in v18?

Also consider commits b840508644 and bcb14f4abc.

Indeed. These are also linked.

I don't feel the urge to revert this:

- It's not broken as such, we're just discussing better ways to
implement it. We could also do nothing, and revisit this in v18. The
only must-fix issue is some compiler warnings IIUC.

- It's a pretty localized change in reorderbuffer.c, so it's not in the
way of other patches or reverts. Nothing else depends on the binaryheap
changes yet either.

- It seems straightforward to repeat the performance tests with whatever
alternative implementations we want to consider.

My #1 choice would be to write a patch to switch the pairing heap,
performance test that, and revert the binary heap changes.

+1.

To move this forward, here's a patch to switch to a pairing heap. In my
very quick testing, with the performance test cases posted earlier in
this thread [1] [2], I'm seeing no meaningful performance difference
between this and what's in master currently.

Sawada-san, what do you think of this? To be sure, if you could also
repeat the performance tests you performed earlier, that'd be great. If
you agree with this direction, and you're happy with this patch, feel
free take it from here and commit this, and also revert commits
b840508644 and bcb14f4abc.

Thank you for the patch!

I agree with the direction that we replace binaryheap + index with the
existing pairing heap and revert the changes for binaryheap. Regarding
the patch, I'm not sure we can remove the MAX_HEAP_TXN_COUNT_THRESHOLD
logic because otherwise we need to remove and add the txn node (i.e.
O(log n)) for every memory update. I'm concerned it could cause some
performance degradation in a case where there are not many
transactions being decoded.

I'll do performance tests, and share the results soon.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#89Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Heikki Linnakangas (#85)
1 attachment(s)
RE: Improve eviction algorithm in ReorderBuffer

Dear Heikki,

I also prototyped the idea, which has almost the same shape.
I attached just in case, but we may not have to see.

Few comments based on the experiment.

```
+	/* txn_heap is ordered by transaction size */
+	buffer->txn_heap = pairingheap_allocate(ReorderBufferTXNSizeCompare, NULL);
```

I think the pairing heap should be in the same MemoryContext with the buffer.
Can we add MemoryContextSwithTo()?

```
+		/* Update the max-heap */
+		if (oldsize != 0)
+			pairingheap_remove(rb->txn_heap, &txn->txn_node);
+		pairingheap_add(rb->txn_heap, &txn->txn_node);
...
+		/* Update the max-heap */
+		pairingheap_remove(rb->txn_heap, &txn->txn_node);
+		if (txn->size != 0)
+			pairingheap_add(rb->txn_heap, &txn->txn_node);
```

Since the number of stored transactions does not affect to the insert operation, we may able
to add the node while creating ReorederBufferTXN and remove while cleaning up it. This can
reduce branches in ReorderBufferChangeMemoryUpdate().

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Attachments:

patchset.zipapplication/x-zip-compressed; name=patchset.zipDownload
#90Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#88)
Re: Improve eviction algorithm in ReorderBuffer

On Thu, Apr 11, 2024 at 10:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

Sorry for the late reply, I took two days off.

On Thu, Apr 11, 2024 at 6:20 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 10/04/2024 08:31, Amit Kapila wrote:

On Wed, Apr 10, 2024 at 11:00 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 10/04/2024 07:45, Michael Paquier wrote:

On Tue, Apr 09, 2024 at 09:16:53PM -0700, Jeff Davis wrote:

On Wed, 2024-04-10 at 12:13 +0900, Michael Paquier wrote:

Wouldn't the best way forward be to revert
5bec1d6bc5e3 and revisit the whole in v18?

Also consider commits b840508644 and bcb14f4abc.

Indeed. These are also linked.

I don't feel the urge to revert this:

- It's not broken as such, we're just discussing better ways to
implement it. We could also do nothing, and revisit this in v18. The
only must-fix issue is some compiler warnings IIUC.

- It's a pretty localized change in reorderbuffer.c, so it's not in the
way of other patches or reverts. Nothing else depends on the binaryheap
changes yet either.

- It seems straightforward to repeat the performance tests with whatever
alternative implementations we want to consider.

My #1 choice would be to write a patch to switch the pairing heap,
performance test that, and revert the binary heap changes.

+1.

To move this forward, here's a patch to switch to a pairing heap. In my
very quick testing, with the performance test cases posted earlier in
this thread [1] [2], I'm seeing no meaningful performance difference
between this and what's in master currently.

Sawada-san, what do you think of this? To be sure, if you could also
repeat the performance tests you performed earlier, that'd be great. If
you agree with this direction, and you're happy with this patch, feel
free take it from here and commit this, and also revert commits
b840508644 and bcb14f4abc.

Thank you for the patch!

I agree with the direction that we replace binaryheap + index with the
existing pairing heap and revert the changes for binaryheap. Regarding
the patch, I'm not sure we can remove the MAX_HEAP_TXN_COUNT_THRESHOLD
logic because otherwise we need to remove and add the txn node (i.e.
O(log n)) for every memory update. I'm concerned it could cause some
performance degradation in a case where there are not many
transactions being decoded.

I'll do performance tests, and share the results soon.

Here are some performance test results.

* test case 1 (many subtransactions)

test script:

create or replace function testfn (cnt int) returns void as $$
begin
for i in 1..cnt loop
begin
insert into test values (i);
exception when division_by_zero then
raise notice 'caught error';
return;
end;
end loop;
end;
$$
language plpgsql;
select pg_create_logical_replication_slot('s', 'test_decoding');
select testfn(1000000);
set logical_decoding_work_mem to '4MB';
select from pg_logical_slot_peek_changes('s', null, null);

HEAD:

43128.266 ms
40116.313 ms
38790.666 ms

Patched:

43626.316 ms
44456.234 ms
39899.753 ms

* test case 2 (single big insertion)

test script:

create table test (c int);
select pg_create_logical_replication_slot('s', 'test_decoding');
insert into test select generate_series(1, 10000000);
set logical_decoding_work_mem to '10GB'; -- avoid data spill
select from pg_logical_slot_peek_changes('s', null, null);

HEAD:

7996.476 ms
8034.022 ms
8005.583 ms

Patched:

8153.500 ms
8121.588 ms
8121.538 ms

* test case 3 (many small transactions)

test script:

pgbench -s -i 300
psql -c "select pg_create_replication_slot('s', 'test_decoding')";
pgbench -t 100000 -c 32
psql -c "set logical_decoding_work_mem to '10GB'; select count(*) from
pg_logical_slot_peek_changes('s', null, null)"

HEAD:

22586.343 ms
22507.905 ms
22504.133 ms

Patched:

23365.142 ms
23110.651 ms
23102.170 ms

We can see 2% ~ 3% performance regressions compared to the current
HEAD, but it's much smaller than I expected. Given that we can make
the code simple, I think we can go with this direction.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#91Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#90)
Re: Improve eviction algorithm in ReorderBuffer

On Thu, Apr 11, 2024 at 11:52 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

We can see 2% ~ 3% performance regressions compared to the current
HEAD, but it's much smaller than I expected. Given that we can make
the code simple, I think we can go with this direction.

Pushed the patch and reverted binaryheap changes.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#92Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Masahiko Sawada (#91)
Re: Improve eviction algorithm in ReorderBuffer

On 11/04/2024 11:20, Masahiko Sawada wrote:

On Thu, Apr 11, 2024 at 11:52 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

We can see 2% ~ 3% performance regressions compared to the current
HEAD, but it's much smaller than I expected. Given that we can make
the code simple, I think we can go with this direction.

Pushed the patch and reverted binaryheap changes.

Thank you!

--
Heikki Linnakangas
Neon (https://neon.tech)

#93Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#89)
Re: Improve eviction algorithm in ReorderBuffer

On Thu, Apr 11, 2024 at 10:46 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Heikki,

I also prototyped the idea, which has almost the same shape.
I attached just in case, but we may not have to see.

Few comments based on the experiment.

Thank you for reviewing the patch. I didn't include the following
suggestions since firstly I wanted to just fix binaryheap part while
keeping other parts. If we need these changes, we can do them in
separate commits as fixes.

```
+       /* txn_heap is ordered by transaction size */
+       buffer->txn_heap = pairingheap_allocate(ReorderBufferTXNSizeCompare, NULL);
```

I think the pairing heap should be in the same MemoryContext with the buffer.
Can we add MemoryContextSwithTo()?

The pairingheap_allocate() allocates a tiny amount of memory for
pairingheap and its memory usage doesn't grow even when adding more
data. And since it's allocated in logical decoding context its
lifetime is also fine. So I'm not sure it's worth including it in
rb->context for better memory accountability.

```
+               /* Update the max-heap */
+               if (oldsize != 0)
+                       pairingheap_remove(rb->txn_heap, &txn->txn_node);
+               pairingheap_add(rb->txn_heap, &txn->txn_node);
...
+               /* Update the max-heap */
+               pairingheap_remove(rb->txn_heap, &txn->txn_node);
+               if (txn->size != 0)
+                       pairingheap_add(rb->txn_heap, &txn->txn_node);
```

Since the number of stored transactions does not affect to the insert operation, we may able
to add the node while creating ReorederBufferTXN and remove while cleaning up it. This can
reduce branches in ReorderBufferChangeMemoryUpdate().

I think it also means that we need to remove the entry while cleaning
up even if it doesn't have any changes, which is done in O(log n). I
feel the current approach that we don't store transactions with size 0
in the heap is better and I'm not sure that reducing these branches
really contributes to the performance improvements..

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com