Improve eviction algorithm in ReorderBuffer

Started by Masahiko Sawadaover 2 years ago93 messageshackers

sawada.mshk@gmail.com

over 2 years ago

Hi all,

As the comment of ReorderBufferLargestTXN() says, it's very slow with
many subtransactions:

/*
* Find the largest transaction (toplevel or subxact) to evict (spill to disk).
*
* XXX With many subtransactions this might be quite slow, because we'll have
* to walk through all of them. There are some options how we could improve
* that: (a) maintain some secondary structure with transactions sorted by
* amount of changes, (b) not looking for the entirely largest transaction,
* but e.g. for transaction using at least some fraction of the memory limit,
* and (c) evicting multiple transactions at once, e.g. to free a given portion
* of the memory limit (e.g. 50%).
*/

This is because the reorderbuffer has transaction entries for each
top-level and sub transaction, and ReorderBufferLargestTXN() walks
through all transaction entries to pick the transaction to evict.
I've heard the report internally that replication lag became huge when
decoding transactions each consisting of 500k sub transactions. Note
that ReorderBufferLargetstTXN() is used only in non-streaming mode.

Here is a test script for a many subtransactions scenario. In my
environment, the logical decoding took over 2min to decode one top
transaction having 100k subtransctions.

-----
create table test (c int);
create or replace function testfn (cnt int) returns void as $$
begin
for i in 1..cnt loop
begin
insert into test values (i);
exception when division_by_zero then
raise notice 'caught error';
return;
end;
end loop;
end;
$$
language plpgsql;
select testfn(100000)
set logical_decoding_work_mem to '4MB';
select count(*) from pg_logical_slot_peek_changes('s', null, null)
----

To deal with this problem, I initially thought of the idea (a)
mentioned in the comment; use a binary heap to maintain the
transactions sorted by the amount of changes or the size. But it seems
not a good idea to try maintaining all transactions by its size since
the size of each transaction could be changed frequently.

The attached patch uses a different approach that consists of three
strategies; (1) maintain the list of transactions whose size is larger
than 10% of logical_decoding_work_mem, and preferentially evict a
transaction from this list. If the list is empty, all transactions are
small enough, (2) so we evict the oldest top-level transaction from
rb->toplevel_by_lsn list. Evicting older transactions would help in
freeing memory blocks in GenerationContext. Finally, if this is also
empty, (3) we evict a transaction that size is > 0. Here, we need to
note the fact that even if a transaction is evicted the
ReorderBufferTXN entry is not removed from rb->by_txn but its size is
0. In the worst case where all (quite a few) transactions are smaller
than 10% of the memory limit, we might end up checking many
transactions to find non-zero size transaction entries to evict. So
the patch adds a new list to maintain all transactions that have at
least one change in memory.

Summarizing the algorithm I've implemented in the patch,

1. pick a transaction from the list of large transactions (larger than
10% of memory limit).
2. pick a transaction from the top-level transaction list in LSN order.
3. pick a transaction from the list of transactions that have at least
one change in memory.

With the patch, the above test case completed within 3 seconds in my
environment.

As a side note, the idea (c) mentioned in the comment, evicting
multiple transactions at once to free a given portion of the memory,
would also help in avoiding back and forth the memory threshold. It's
also worth considering.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Dilip Kumar

dilipbalaut@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#1)

Re: Improve eviction algorithm in ReorderBuffer

On Tue, Dec 12, 2023 at 9:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi all,

As the comment of ReorderBufferLargestTXN() says, it's very slow with
many subtransactions:

/*
* Find the largest transaction (toplevel or subxact) to evict (spill to disk).
*
* XXX With many subtransactions this might be quite slow, because we'll have
* to walk through all of them. There are some options how we could improve
* that: (a) maintain some secondary structure with transactions sorted by
* amount of changes, (b) not looking for the entirely largest transaction,
* but e.g. for transaction using at least some fraction of the memory limit,
* and (c) evicting multiple transactions at once, e.g. to free a given portion
* of the memory limit (e.g. 50%).
*/

This is because the reorderbuffer has transaction entries for each
top-level and sub transaction, and ReorderBufferLargestTXN() walks
through all transaction entries to pick the transaction to evict.
I've heard the report internally that replication lag became huge when
decoding transactions each consisting of 500k sub transactions. Note
that ReorderBufferLargetstTXN() is used only in non-streaming mode.

Here is a test script for a many subtransactions scenario. In my
environment, the logical decoding took over 2min to decode one top
transaction having 100k subtransctions.

-----
create table test (c int);
create or replace function testfn (cnt int) returns void as $$
begin
for i in 1..cnt loop
begin
insert into test values (i);
exception when division_by_zero then
raise notice 'caught error';
return;
end;
end loop;
end;
$$
language plpgsql;
select testfn(100000)
set logical_decoding_work_mem to '4MB';
select count(*) from pg_logical_slot_peek_changes('s', null, null)
----

To deal with this problem, I initially thought of the idea (a)
mentioned in the comment; use a binary heap to maintain the
transactions sorted by the amount of changes or the size. But it seems
not a good idea to try maintaining all transactions by its size since
the size of each transaction could be changed frequently.

The attached patch uses a different approach that consists of three
strategies; (1) maintain the list of transactions whose size is larger
than 10% of logical_decoding_work_mem, and preferentially evict a
transaction from this list. If the list is empty, all transactions are
small enough, (2) so we evict the oldest top-level transaction from
rb->toplevel_by_lsn list. Evicting older transactions would help in
freeing memory blocks in GenerationContext. Finally, if this is also
empty, (3) we evict a transaction that size is > 0. Here, we need to
note the fact that even if a transaction is evicted the
ReorderBufferTXN entry is not removed from rb->by_txn but its size is
0. In the worst case where all (quite a few) transactions are smaller
than 10% of the memory limit, we might end up checking many
transactions to find non-zero size transaction entries to evict. So
the patch adds a new list to maintain all transactions that have at
least one change in memory.

Summarizing the algorithm I've implemented in the patch,

1. pick a transaction from the list of large transactions (larger than
10% of memory limit).
2. pick a transaction from the top-level transaction list in LSN order.
3. pick a transaction from the list of transactions that have at least
one change in memory.

With the patch, the above test case completed within 3 seconds in my
environment.

Thanks for working on this, I think it would be good to test other
scenarios as well where this might have some negative impact and see
where we stand. I mean
1) A scenario where suppose you have one very large transaction that
is consuming ~40% of the memory and 5-6 comparatively smaller
transactions that are just above 10% of the memory limit. And now for
coming under the memory limit instead of getting 1 large transaction
evicted out, we are evicting out multiple times.
2) Another scenario where all the transactions are under 10% of the
memory limit but let's say there are some transactions are consuming
around 8-9% of the memory limit each but those are not very old
transactions whereas there are certain old transactions which are
fairly small and consuming under 1% of memory limit and there are many
such transactions. So how it would affect if we frequently select
many of these transactions to come under memory limit instead of
selecting a couple of large transactions which are consuming 8-9%?

As a side note, the idea (c) mentioned in the comment, evicting
multiple transactions at once to free a given portion of the memory,
would also help in avoiding back and forth the memory threshold. It's
also worth considering.

Yes, I think it is worth considering.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Dilip Kumar (#2)

Re: Improve eviction algorithm in ReorderBuffer

On Tue, Dec 12, 2023 at 1:33 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Dec 12, 2023 at 9:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi all,

As the comment of ReorderBufferLargestTXN() says, it's very slow with
many subtransactions:

/*
* Find the largest transaction (toplevel or subxact) to evict (spill to disk).
*
* XXX With many subtransactions this might be quite slow, because we'll have
* to walk through all of them. There are some options how we could improve
* that: (a) maintain some secondary structure with transactions sorted by
* amount of changes, (b) not looking for the entirely largest transaction,
* but e.g. for transaction using at least some fraction of the memory limit,
* and (c) evicting multiple transactions at once, e.g. to free a given portion
* of the memory limit (e.g. 50%).
*/

This is because the reorderbuffer has transaction entries for each
top-level and sub transaction, and ReorderBufferLargestTXN() walks
through all transaction entries to pick the transaction to evict.
I've heard the report internally that replication lag became huge when
decoding transactions each consisting of 500k sub transactions. Note
that ReorderBufferLargetstTXN() is used only in non-streaming mode.

Here is a test script for a many subtransactions scenario. In my
environment, the logical decoding took over 2min to decode one top
transaction having 100k subtransctions.

-----
create table test (c int);
create or replace function testfn (cnt int) returns void as $$
begin
for i in 1..cnt loop
begin
insert into test values (i);
exception when division_by_zero then
raise notice 'caught error';
return;
end;
end loop;
end;
$$
language plpgsql;
select testfn(100000)
set logical_decoding_work_mem to '4MB';
select count(*) from pg_logical_slot_peek_changes('s', null, null)
----

To deal with this problem, I initially thought of the idea (a)
mentioned in the comment; use a binary heap to maintain the
transactions sorted by the amount of changes or the size. But it seems
not a good idea to try maintaining all transactions by its size since
the size of each transaction could be changed frequently.

The attached patch uses a different approach that consists of three
strategies; (1) maintain the list of transactions whose size is larger
than 10% of logical_decoding_work_mem, and preferentially evict a
transaction from this list. If the list is empty, all transactions are
small enough, (2) so we evict the oldest top-level transaction from
rb->toplevel_by_lsn list. Evicting older transactions would help in
freeing memory blocks in GenerationContext. Finally, if this is also
empty, (3) we evict a transaction that size is > 0. Here, we need to
note the fact that even if a transaction is evicted the
ReorderBufferTXN entry is not removed from rb->by_txn but its size is
0. In the worst case where all (quite a few) transactions are smaller
than 10% of the memory limit, we might end up checking many
transactions to find non-zero size transaction entries to evict. So
the patch adds a new list to maintain all transactions that have at
least one change in memory.

Summarizing the algorithm I've implemented in the patch,

1. pick a transaction from the list of large transactions (larger than
10% of memory limit).
2. pick a transaction from the top-level transaction list in LSN order.
3. pick a transaction from the list of transactions that have at least
one change in memory.

With the patch, the above test case completed within 3 seconds in my
environment.

Thanks for working on this, I think it would be good to test other
scenarios as well where this might have some negative impact and see
where we stand.

Agreed.

1) A scenario where suppose you have one very large transaction that
is consuming ~40% of the memory and 5-6 comparatively smaller
transactions that are just above 10% of the memory limit. And now for
coming under the memory limit instead of getting 1 large transaction
evicted out, we are evicting out multiple times.

Given the large transaction list will have up to 10 transactions, I
think it's cheap to pick the largest transaction among them. It's O(N)
but N won't be large.

2) Another scenario where all the transactions are under 10% of the
memory limit but let's say there are some transactions are consuming
around 8-9% of the memory limit each but those are not very old
transactions whereas there are certain old transactions which are
fairly small and consuming under 1% of memory limit and there are many
such transactions. So how it would affect if we frequently select
many of these transactions to come under memory limit instead of
selecting a couple of large transactions which are consuming 8-9%?

Yeah, probably we can do something for small transactions (i.e. small
and on-memory transactions). One idea is to pick the largest
transaction among them by iterating over all of them. Given that the
more transactions are evicted, the less transactions the on-memory
transaction list has, unlike the current algorithm, we would still
win. Or we could even split it into several sub-lists in order to
reduce the number of transactions to check. For example, splitting it
into two lists: transactions consuming 5% < and 5% >= of the memory
limit, and checking the 5% >= list preferably. The cost for
maintaining these lists could increase, though.

Do you have any ideas?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Dilip Kumar

dilipbalaut@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#3)

Re: Improve eviction algorithm in ReorderBuffer

On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Thanks for working on this, I think it would be good to test other
scenarios as well where this might have some negative impact and see
where we stand.

Agreed.

1) A scenario where suppose you have one very large transaction that
is consuming ~40% of the memory and 5-6 comparatively smaller
transactions that are just above 10% of the memory limit. And now for
coming under the memory limit instead of getting 1 large transaction
evicted out, we are evicting out multiple times.

Given the large transaction list will have up to 10 transactions, I
think it's cheap to pick the largest transaction among them. It's O(N)
but N won't be large.

Yeah, that makes sense.

2) Another scenario where all the transactions are under 10% of the
memory limit but let's say there are some transactions are consuming
around 8-9% of the memory limit each but those are not very old
transactions whereas there are certain old transactions which are
fairly small and consuming under 1% of memory limit and there are many
such transactions. So how it would affect if we frequently select
many of these transactions to come under memory limit instead of
selecting a couple of large transactions which are consuming 8-9%?

Yeah, probably we can do something for small transactions (i.e. small
and on-memory transactions). One idea is to pick the largest
transaction among them by iterating over all of them. Given that the
more transactions are evicted, the less transactions the on-memory
transaction list has, unlike the current algorithm, we would still
win. Or we could even split it into several sub-lists in order to
reduce the number of transactions to check. For example, splitting it
into two lists: transactions consuming 5% < and 5% >= of the memory
limit, and checking the 5% >= list preferably. The cost for
maintaining these lists could increase, though.

Do you have any ideas?

Yeah something like what you mention might be good, we maintain 3 list
that says large, medium, and small transactions. In a large
transaction, list suppose we allow transactions that consume more than
10% so there could be at max 10 transactions so we can do a sequence
search and spill the largest of all. Whereas in the medium list
suppose we keep transactions ranging from e.g. 3-10% then it's just
fine to pick from the head because the size differences between the
largest and smallest transaction in this list are not very
significant. And remaining in the small transaction list and from the
small transaction list we can choose to spill multiple transactions at
a time.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#3)

Re: Improve eviction algorithm in ReorderBuffer

On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Dec 12, 2023 at 1:33 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Dec 12, 2023 at 9:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've heard the report internally that replication lag became huge when
decoding transactions each consisting of 500k sub transactions. Note
that ReorderBufferLargetstTXN() is used only in non-streaming mode.

Can't you suggest them to use streaming mode to avoid this problem or
do you see some problem with that?

Here is a test script for a many subtransactions scenario. In my
environment, the logical decoding took over 2min to decode one top
transaction having 100k subtransctions.

-----
create table test (c int);
create or replace function testfn (cnt int) returns void as $$
begin
for i in 1..cnt loop
begin
insert into test values (i);
exception when division_by_zero then
raise notice 'caught error';
return;
end;
end loop;
end;
$$
language plpgsql;
select testfn(100000)
set logical_decoding_work_mem to '4MB';
select count(*) from pg_logical_slot_peek_changes('s', null, null)
----

To deal with this problem, I initially thought of the idea (a)
mentioned in the comment; use a binary heap to maintain the
transactions sorted by the amount of changes or the size. But it seems
not a good idea to try maintaining all transactions by its size since
the size of each transaction could be changed frequently.

The attached patch uses a different approach that consists of three
strategies; (1) maintain the list of transactions whose size is larger
than 10% of logical_decoding_work_mem, and preferentially evict a
transaction from this list.

IIUC, you are giving preference to multiple list ideas as compared to
(a) because you don't need to adjust the list each time the
transaction size changes, is that right? If so, I think there is a
cost to keep that data structure up-to-date but it can help in
reducing the number of times we need to serialize.

If the list is empty, all transactions are

small enough, (2) so we evict the oldest top-level transaction from
rb->toplevel_by_lsn list. Evicting older transactions would help in
freeing memory blocks in GenerationContext. Finally, if this is also
empty, (3) we evict a transaction that size is > 0. Here, we need to
note the fact that even if a transaction is evicted the
ReorderBufferTXN entry is not removed from rb->by_txn but its size is
0. In the worst case where all (quite a few) transactions are smaller
than 10% of the memory limit, we might end up checking many
transactions to find non-zero size transaction entries to evict. So
the patch adds a new list to maintain all transactions that have at
least one change in memory.

Summarizing the algorithm I've implemented in the patch,

1. pick a transaction from the list of large transactions (larger than
10% of memory limit).
2. pick a transaction from the top-level transaction list in LSN order.
3. pick a transaction from the list of transactions that have at least
one change in memory.

With the patch, the above test case completed within 3 seconds in my
environment.

Thanks for working on this, I think it would be good to test other
scenarios as well where this might have some negative impact and see
where we stand.

Agreed.

1) A scenario where suppose you have one very large transaction that
is consuming ~40% of the memory and 5-6 comparatively smaller
transactions that are just above 10% of the memory limit. And now for
coming under the memory limit instead of getting 1 large transaction
evicted out, we are evicting out multiple times.

Given the large transaction list will have up to 10 transactions, I
think it's cheap to pick the largest transaction among them. It's O(N)
but N won't be large.

2) Another scenario where all the transactions are under 10% of the
memory limit but let's say there are some transactions are consuming
around 8-9% of the memory limit each but those are not very old
transactions whereas there are certain old transactions which are
fairly small and consuming under 1% of memory limit and there are many
such transactions. So how it would affect if we frequently select
many of these transactions to come under memory limit instead of
selecting a couple of large transactions which are consuming 8-9%?

Yeah, probably we can do something for small transactions (i.e. small
and on-memory transactions). One idea is to pick the largest
transaction among them by iterating over all of them. Given that the
more transactions are evicted, the less transactions the on-memory
transaction list has, unlike the current algorithm, we would still
win. Or we could even split it into several sub-lists in order to
reduce the number of transactions to check. For example, splitting it
into two lists: transactions consuming 5% < and 5% >= of the memory
limit, and checking the 5% >= list preferably.

Which memory limit are you referring to here? Is it logical_decoding_work_mem?

The cost for
maintaining these lists could increase, though.

Yeah, can't we maintain a single list of all xacts that are consuming
equal to or greater than the memory limit? Considering that the memory
limit is logical_decoding_work_mem, then I think just picking one
transaction to serialize would be sufficient.

--
With Regards,
Amit Kapila.

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Amit Kapila (#5)

Re: Improve eviction algorithm in ReorderBuffer

On Fri, Dec 15, 2023 at 12:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Dec 12, 2023 at 1:33 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Dec 12, 2023 at 9:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've heard the report internally that replication lag became huge when
decoding transactions each consisting of 500k sub transactions. Note
that ReorderBufferLargetstTXN() is used only in non-streaming mode.

Can't you suggest them to use streaming mode to avoid this problem or
do you see some problem with that?

Yeah, that's one option. But I can suggest

Here is a test script for a many subtransactions scenario. In my
environment, the logical decoding took over 2min to decode one top
transaction having 100k subtransctions.

-----
create table test (c int);
create or replace function testfn (cnt int) returns void as $$
begin
for i in 1..cnt loop
begin
insert into test values (i);
exception when division_by_zero then
raise notice 'caught error';
return;
end;
end loop;
end;
$$
language plpgsql;
select testfn(100000)
set logical_decoding_work_mem to '4MB';
select count(*) from pg_logical_slot_peek_changes('s', null, null)
----

To deal with this problem, I initially thought of the idea (a)
mentioned in the comment; use a binary heap to maintain the
transactions sorted by the amount of changes or the size. But it seems
not a good idea to try maintaining all transactions by its size since
the size of each transaction could be changed frequently.

The attached patch uses a different approach that consists of three
strategies; (1) maintain the list of transactions whose size is larger
than 10% of logical_decoding_work_mem, and preferentially evict a
transaction from this list.

IIUC, you are giving preference to multiple list ideas as compared to
(a) because you don't need to adjust the list each time the
transaction size changes, is that right?

Right.

If so, I think there is a
cost to keep that data structure up-to-date but it can help in
reducing the number of times we need to serialize.

Yes, there is a trade-off.

What I don't want to do is to keep all transactions ordered since it's
too costly. The proposed idea uses multiple lists to keep all
transactions roughly ordered. The maintenance cost would be cheap
since each list is unordered.

It might be a good idea to have a threshold to switch how to pick the
largest transaction based on the number of transactions in the
reorderbuffer. If there are many transactions, we can use the proposed
algorithm to find a possibly-largest transaction, otherwise use the
current way.

If the list is empty, all transactions are

small enough, (2) so we evict the oldest top-level transaction from
rb->toplevel_by_lsn list. Evicting older transactions would help in
freeing memory blocks in GenerationContext. Finally, if this is also
empty, (3) we evict a transaction that size is > 0. Here, we need to
note the fact that even if a transaction is evicted the
ReorderBufferTXN entry is not removed from rb->by_txn but its size is
0. In the worst case where all (quite a few) transactions are smaller
than 10% of the memory limit, we might end up checking many
transactions to find non-zero size transaction entries to evict. So
the patch adds a new list to maintain all transactions that have at
least one change in memory.

Summarizing the algorithm I've implemented in the patch,

1. pick a transaction from the list of large transactions (larger than
10% of memory limit).
2. pick a transaction from the top-level transaction list in LSN order.
3. pick a transaction from the list of transactions that have at least
one change in memory.

With the patch, the above test case completed within 3 seconds in my
environment.

Thanks for working on this, I think it would be good to test other
scenarios as well where this might have some negative impact and see
where we stand.

Agreed.

1) A scenario where suppose you have one very large transaction that
is consuming ~40% of the memory and 5-6 comparatively smaller
transactions that are just above 10% of the memory limit. And now for
coming under the memory limit instead of getting 1 large transaction
evicted out, we are evicting out multiple times.

Given the large transaction list will have up to 10 transactions, I
think it's cheap to pick the largest transaction among them. It's O(N)
but N won't be large.

2) Another scenario where all the transactions are under 10% of the
memory limit but let's say there are some transactions are consuming
around 8-9% of the memory limit each but those are not very old
transactions whereas there are certain old transactions which are
fairly small and consuming under 1% of memory limit and there are many
such transactions. So how it would affect if we frequently select
many of these transactions to come under memory limit instead of
selecting a couple of large transactions which are consuming 8-9%?

Yeah, probably we can do something for small transactions (i.e. small
and on-memory transactions). One idea is to pick the largest
transaction among them by iterating over all of them. Given that the
more transactions are evicted, the less transactions the on-memory
transaction list has, unlike the current algorithm, we would still
win. Or we could even split it into several sub-lists in order to
reduce the number of transactions to check. For example, splitting it
into two lists: transactions consuming 5% < and 5% >= of the memory
limit, and checking the 5% >= list preferably.

Which memory limit are you referring to here? Is it logical_decoding_work_mem?

logical_decoding_work_mem.

The cost for
maintaining these lists could increase, though.

Yeah, can't we maintain a single list of all xacts that are consuming
equal to or greater than the memory limit? Considering that the memory
limit is logical_decoding_work_mem, then I think just picking one
transaction to serialize would be sufficient.

IIUC we serialize a transaction when the sum of all transactions'
memory usage in the reorderbuffer exceeds logical_decoding_work_mem.
In what cases are multiple transactions consuming equal to or greater
than the logical_decoding_work_mem?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Alvaro Herrera

alvherre@2ndquadrant.com

over 2 years ago

In reply to: Masahiko Sawada (#1)

Re: Improve eviction algorithm in ReorderBuffer

On 2023-Dec-12, Masahiko Sawada wrote:

To deal with this problem, I initially thought of the idea (a)
mentioned in the comment; use a binary heap to maintain the
transactions sorted by the amount of changes or the size. But it seems
not a good idea to try maintaining all transactions by its size since
the size of each transaction could be changed frequently.

Hmm, maybe you can just use binaryheap_add_unordered and just let the
sizes change, and do binaryheap_build() at the point where the eviction
is needed.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"No necesitamos banderas
No reconocemos fronteras" (Jorge González)

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Alvaro Herrera (#7)

Re: Improve eviction algorithm in ReorderBuffer

On Fri, Dec 15, 2023 at 7:10 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2023-Dec-12, Masahiko Sawada wrote:

To deal with this problem, I initially thought of the idea (a)
mentioned in the comment; use a binary heap to maintain the
transactions sorted by the amount of changes or the size. But it seems
not a good idea to try maintaining all transactions by its size since
the size of each transaction could be changed frequently.

Hmm, maybe you can just use binaryheap_add_unordered and just let the
sizes change, and do binaryheap_build() at the point where the eviction
is needed.

I assume you mean to add ReorderBufferTXN entries to the binaryheap
and then build it by comparing their sizes (i.e. txn->size). But
ReorderBufferTXN entries are removed and deallocated once the
transaction finished. How can we efficiently remove these entries from
binaryheap? I guess it would be O(n) to find the entry among the
unordered entries, where n is the number of transactions in the
reorderbuffer.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#6)

Re: Improve eviction algorithm in ReorderBuffer

On Fri, Dec 15, 2023 at 2:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Dec 15, 2023 at 12:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Dec 12, 2023 at 1:33 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Dec 12, 2023 at 9:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've heard the report internally that replication lag became huge when
decoding transactions each consisting of 500k sub transactions. Note
that ReorderBufferLargetstTXN() is used only in non-streaming mode.

Can't you suggest them to use streaming mode to avoid this problem or
do you see some problem with that?

Yeah, that's one option. But I can suggest

Sorry, it was still in the middle of editing.

Yeah, that's one option. But since there is a trade-off I cannot
suggest using streaming mode for every user. Also, the logical
replication client (e.g. third party tool receiving logical change
set) might not support the streaming mode yet.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#10

Euler Taveira

euler@eulerto.com

over 2 years ago

In reply to: Masahiko Sawada (#8)

Re: Improve eviction algorithm in ReorderBuffer

On Fri, Dec 15, 2023, at 9:11 AM, Masahiko Sawada wrote:

I assume you mean to add ReorderBufferTXN entries to the binaryheap
and then build it by comparing their sizes (i.e. txn->size). But
ReorderBufferTXN entries are removed and deallocated once the
transaction finished. How can we efficiently remove these entries from
binaryheap? I guess it would be O(n) to find the entry among the
unordered entries, where n is the number of transactions in the
reorderbuffer.

O(log n) for both functions: binaryheap_remove_first() and
binaryheap_remove_node(). I didn't read your patch but do you really need to
free entries one by one? If not, binaryheap_free().

--
Euler Taveira
EDB https://www.enterprisedb.com/

#11

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Euler Taveira (#10)

Re: Improve eviction algorithm in ReorderBuffer

On Sat, Dec 16, 2023 at 1:36 AM Euler Taveira <euler@eulerto.com> wrote:

On Fri, Dec 15, 2023, at 9:11 AM, Masahiko Sawada wrote:

I assume you mean to add ReorderBufferTXN entries to the binaryheap
and then build it by comparing their sizes (i.e. txn->size). But
ReorderBufferTXN entries are removed and deallocated once the
transaction finished. How can we efficiently remove these entries from
binaryheap? I guess it would be O(n) to find the entry among the
unordered entries, where n is the number of transactions in the
reorderbuffer.

O(log n) for both functions: binaryheap_remove_first() and
binaryheap_remove_node().

Right. The binaryheap_binaryheap_remove_first() removes the topmost
entry in O(log n), but the ReorderBufferTXN being removed is not
necessarily the topmost entry, since we remove the entry when the
transaction completes (committed or aborted). The
binaryheap_remove_node() removes the entry at the given Nth in O(log
n), but I'm not sure how we can know the indexes of each entry. I
think we can remember the index of newly added entry after calling
binaryheap_add_unordered() but once we call binaryheap_build() the
index is out-of-date. So I think that in the worst case we would need
to check all entries in order to remove an arbitrary entry in
binaryheap. It's O(n). I might be missing something though.

I didn't read your patch but do you really need to
free entries one by one? If not, binaryheap_free().

The patch doesn't touch on how to free entries. ReorderBufferTXN
entries are freed one by one after each of which completes (committed
or aborted).

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#12

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#6)

Re: Improve eviction algorithm in ReorderBuffer

On Fri, Dec 15, 2023 at 11:29 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Dec 15, 2023 at 12:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

IIUC, you are giving preference to multiple list ideas as compared to
(a) because you don't need to adjust the list each time the
transaction size changes, is that right?

Right.

If so, I think there is a
cost to keep that data structure up-to-date but it can help in
reducing the number of times we need to serialize.

Yes, there is a trade-off.

What I don't want to do is to keep all transactions ordered since it's
too costly. The proposed idea uses multiple lists to keep all
transactions roughly ordered. The maintenance cost would be cheap
since each list is unordered.

It might be a good idea to have a threshold to switch how to pick the
largest transaction based on the number of transactions in the
reorderbuffer. If there are many transactions, we can use the proposed
algorithm to find a possibly-largest transaction, otherwise use the
current way.

Yeah, that makes sense.

1) A scenario where suppose you have one very large transaction that
is consuming ~40% of the memory and 5-6 comparatively smaller
transactions that are just above 10% of the memory limit. And now for
coming under the memory limit instead of getting 1 large transaction
evicted out, we are evicting out multiple times.

Given the large transaction list will have up to 10 transactions, I
think it's cheap to pick the largest transaction among them. It's O(N)
but N won't be large.

2) Another scenario where all the transactions are under 10% of the
memory limit but let's say there are some transactions are consuming
around 8-9% of the memory limit each but those are not very old
transactions whereas there are certain old transactions which are
fairly small and consuming under 1% of memory limit and there are many
such transactions. So how it would affect if we frequently select
many of these transactions to come under memory limit instead of
selecting a couple of large transactions which are consuming 8-9%?

Yeah, probably we can do something for small transactions (i.e. small
and on-memory transactions). One idea is to pick the largest
transaction among them by iterating over all of them. Given that the
more transactions are evicted, the less transactions the on-memory
transaction list has, unlike the current algorithm, we would still
win. Or we could even split it into several sub-lists in order to
reduce the number of transactions to check. For example, splitting it
into two lists: transactions consuming 5% < and 5% >= of the memory
limit, and checking the 5% >= list preferably.

Which memory limit are you referring to here? Is it logical_decoding_work_mem?

logical_decoding_work_mem.

The cost for
maintaining these lists could increase, though.

Yeah, can't we maintain a single list of all xacts that are consuming
equal to or greater than the memory limit? Considering that the memory
limit is logical_decoding_work_mem, then I think just picking one
transaction to serialize would be sufficient.

IIUC we serialize a transaction when the sum of all transactions'
memory usage in the reorderbuffer exceeds logical_decoding_work_mem.
In what cases are multiple transactions consuming equal to or greater
than the logical_decoding_work_mem?

The individual transactions shouldn't cross
'logical_decoding_work_mem'. I got a bit confused by your proposal to
maintain the lists: "...splitting it into two lists: transactions
consuming 5% < and 5% >= of the memory limit, and checking the 5% >=
list preferably.". In the previous sentence, what did you mean by
transactions consuming 5% >= of the memory limit? I got the impression
that you are saying to maintain them in a separate transaction list
which doesn't seems to be the case.

--
With Regards,
Amit Kapila.

#13

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Amit Kapila (#12)

Re: Improve eviction algorithm in ReorderBuffer

On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Dec 15, 2023 at 11:29 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Dec 15, 2023 at 12:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

IIUC, you are giving preference to multiple list ideas as compared to
(a) because you don't need to adjust the list each time the
transaction size changes, is that right?

Right.

If so, I think there is a
cost to keep that data structure up-to-date but it can help in
reducing the number of times we need to serialize.

Yes, there is a trade-off.

What I don't want to do is to keep all transactions ordered since it's
too costly. The proposed idea uses multiple lists to keep all
transactions roughly ordered. The maintenance cost would be cheap
since each list is unordered.

It might be a good idea to have a threshold to switch how to pick the
largest transaction based on the number of transactions in the
reorderbuffer. If there are many transactions, we can use the proposed
algorithm to find a possibly-largest transaction, otherwise use the
current way.

Yeah, that makes sense.

1) A scenario where suppose you have one very large transaction that
is consuming ~40% of the memory and 5-6 comparatively smaller
transactions that are just above 10% of the memory limit. And now for
coming under the memory limit instead of getting 1 large transaction
evicted out, we are evicting out multiple times.

Given the large transaction list will have up to 10 transactions, I
think it's cheap to pick the largest transaction among them. It's O(N)
but N won't be large.

2) Another scenario where all the transactions are under 10% of the
memory limit but let's say there are some transactions are consuming
around 8-9% of the memory limit each but those are not very old
transactions whereas there are certain old transactions which are
fairly small and consuming under 1% of memory limit and there are many
such transactions. So how it would affect if we frequently select
many of these transactions to come under memory limit instead of
selecting a couple of large transactions which are consuming 8-9%?

Yeah, probably we can do something for small transactions (i.e. small
and on-memory transactions). One idea is to pick the largest
transaction among them by iterating over all of them. Given that the
more transactions are evicted, the less transactions the on-memory
transaction list has, unlike the current algorithm, we would still
win. Or we could even split it into several sub-lists in order to
reduce the number of transactions to check. For example, splitting it
into two lists: transactions consuming 5% < and 5% >= of the memory
limit, and checking the 5% >= list preferably.

Which memory limit are you referring to here? Is it logical_decoding_work_mem?

logical_decoding_work_mem.

The cost for
maintaining these lists could increase, though.

Yeah, can't we maintain a single list of all xacts that are consuming
equal to or greater than the memory limit? Considering that the memory
limit is logical_decoding_work_mem, then I think just picking one
transaction to serialize would be sufficient.

IIUC we serialize a transaction when the sum of all transactions'
memory usage in the reorderbuffer exceeds logical_decoding_work_mem.
In what cases are multiple transactions consuming equal to or greater
than the logical_decoding_work_mem?

The individual transactions shouldn't cross
'logical_decoding_work_mem'. I got a bit confused by your proposal to
maintain the lists: "...splitting it into two lists: transactions
consuming 5% < and 5% >= of the memory limit, and checking the 5% >=
list preferably.". In the previous sentence, what did you mean by
transactions consuming 5% >= of the memory limit? I got the impression
that you are saying to maintain them in a separate transaction list
which doesn't seems to be the case.

I wanted to mean that there are three lists in total: the first one
maintain the transactions consuming more than 10% of
logical_decoding_work_mem, the second one maintains other transactions
consuming more than or equal to 5% of logical_decoding_work_mem, and
the third one maintains other transactions consuming more than 0 and
less than 5% of logical_decoding_work_mem.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#14

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#13)

Re: Improve eviction algorithm in ReorderBuffer

On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

The individual transactions shouldn't cross
'logical_decoding_work_mem'. I got a bit confused by your proposal to
maintain the lists: "...splitting it into two lists: transactions
consuming 5% < and 5% >= of the memory limit, and checking the 5% >=
list preferably.". In the previous sentence, what did you mean by
transactions consuming 5% >= of the memory limit? I got the impression
that you are saying to maintain them in a separate transaction list
which doesn't seems to be the case.

I wanted to mean that there are three lists in total: the first one
maintain the transactions consuming more than 10% of
logical_decoding_work_mem,

How can we have multiple transactions in the list consuming more than
10% of logical_decoding_work_mem? Shouldn't we perform serialization
before any xact reaches logical_decoding_work_mem?

--
With Regards,
Amit Kapila.

#15

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Amit Kapila (#14)

Re: Improve eviction algorithm in ReorderBuffer

On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

The individual transactions shouldn't cross
'logical_decoding_work_mem'. I got a bit confused by your proposal to
maintain the lists: "...splitting it into two lists: transactions
consuming 5% < and 5% >= of the memory limit, and checking the 5% >=
list preferably.". In the previous sentence, what did you mean by
transactions consuming 5% >= of the memory limit? I got the impression
that you are saying to maintain them in a separate transaction list
which doesn't seems to be the case.

I wanted to mean that there are three lists in total: the first one
maintain the transactions consuming more than 10% of
logical_decoding_work_mem,

How can we have multiple transactions in the list consuming more than
10% of logical_decoding_work_mem? Shouldn't we perform serialization
before any xact reaches logical_decoding_work_mem?

Well, suppose logical_decoding_work_mem is set to 64MB, transactions
consuming more than 6.4MB are added to the list. So for example, it's
possible that the list has three transactions each of which are
consuming 10MB while the total memory usage in the reorderbuffer is
still 30MB (less than logical_decoding_work_mem).

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#16

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#15)

Re: Improve eviction algorithm in ReorderBuffer

On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

The individual transactions shouldn't cross
'logical_decoding_work_mem'. I got a bit confused by your proposal to
maintain the lists: "...splitting it into two lists: transactions
consuming 5% < and 5% >= of the memory limit, and checking the 5% >=
list preferably.". In the previous sentence, what did you mean by
transactions consuming 5% >= of the memory limit? I got the impression
that you are saying to maintain them in a separate transaction list
which doesn't seems to be the case.

I wanted to mean that there are three lists in total: the first one
maintain the transactions consuming more than 10% of
logical_decoding_work_mem,

How can we have multiple transactions in the list consuming more than
10% of logical_decoding_work_mem? Shouldn't we perform serialization
before any xact reaches logical_decoding_work_mem?

Well, suppose logical_decoding_work_mem is set to 64MB, transactions
consuming more than 6.4MB are added to the list. So for example, it's
possible that the list has three transactions each of which are
consuming 10MB while the total memory usage in the reorderbuffer is
still 30MB (less than logical_decoding_work_mem).

Thanks for the clarification. I misunderstood the list to have
transactions greater than 70.4 MB (64 + 6.4) in your example. But one
thing to note is that maintaining these lists by default can also have
some overhead unless the list of open transactions crosses a certain
threshold.

--
With Regards,
Amit Kapila.

#17

Peter Smith

smithpb2250@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#1)

Re: Improve eviction algorithm in ReorderBuffer

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1]https://commitfest.postgresql.org/46/4699/, but it seems
like there was some CFbot test failures last time it was run [2]https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4699.
Please have a look and post an updated version if necessary.

======
[1]: https://commitfest.postgresql.org/46/4699/
[2]: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4699

Kind Regards,
Peter Smith.

#18

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Amit Kapila (#16)

Re: Improve eviction algorithm in ReorderBuffer

On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

The individual transactions shouldn't cross
'logical_decoding_work_mem'. I got a bit confused by your proposal to
maintain the lists: "...splitting it into two lists: transactions
consuming 5% < and 5% >= of the memory limit, and checking the 5% >=
list preferably.". In the previous sentence, what did you mean by
transactions consuming 5% >= of the memory limit? I got the impression
that you are saying to maintain them in a separate transaction list
which doesn't seems to be the case.

I wanted to mean that there are three lists in total: the first one
maintain the transactions consuming more than 10% of
logical_decoding_work_mem,

How can we have multiple transactions in the list consuming more than
10% of logical_decoding_work_mem? Shouldn't we perform serialization
before any xact reaches logical_decoding_work_mem?

Well, suppose logical_decoding_work_mem is set to 64MB, transactions
consuming more than 6.4MB are added to the list. So for example, it's
possible that the list has three transactions each of which are
consuming 10MB while the total memory usage in the reorderbuffer is
still 30MB (less than logical_decoding_work_mem).

Thanks for the clarification. I misunderstood the list to have
transactions greater than 70.4 MB (64 + 6.4) in your example. But one
thing to note is that maintaining these lists by default can also have
some overhead unless the list of open transactions crosses a certain
threshold.

On further analysis, I realized that the approach discussed here might
not be the way to go. The idea of dividing transactions into several
subgroups is to divide a large number of entries into multiple
sub-groups so we can reduce the complexity to search for the
particular entry. Since we assume that there are no big differences in
entries' sizes within a sub-group, we can pick the entry to evict in
O(1). However, what we really need to avoid here is that we end up
increasing the number of times to evict entries because serializing an
entry to the disk is more costly than searching an entry on memory in
general.

I think that it's no problem in a large-entries subgroup but when it
comes to the smallest-entries subgroup, like for entries consuming
less than 5% of the limit, it could end up evicting many entries. For
example, there would be a huge difference between serializing 1 entry
consuming 5% of the memory limit and serializing 5000 entries
consuming 0.001% of the memory limit. Even if we can select 5000
entries quickly, I think the latter would be slower in total. The more
subgroups we create, the more the algorithm gets complex and the
overheads could cause. So I think we need to search for the largest
entry in order to minimize the number of evictions anyway.

Looking for data structures and algorithms, I think binaryheap with
some improvements could be promising. I mentioned before why we cannot
use the current binaryheap[1]. The missing pieces are efficient ways
to remove the arbitrary entry and to update the arbitrary entry's key.
The current binaryheap provides binaryheap_remove_node(), which is
O(log n), but it requires the entry's position in the binaryheap. We
can know the entry's position just after binaryheap_add_unordered()
but it might be changed after heapify. Searching the node's position
is O(n). So the improvement idea is to add a hash table to the
binaryheap so that it can track the positions for each entry so that
we can remove the arbitrary entry in O(log n) and also update the
arbitrary entry's key in O(log n). This is known as the indexed
priority queue. I've attached the patch for that (0001 and 0002).

That way, in terms of reorderbuffer, we can update and remove the
transaction's memory usage in O(log n) (in worst case and O(1) in
average) and then pick the largest transaction in O(1). Since we might
need to call ReorderBufferSerializeTXN() even in non-streaming case,
we need to maintain the binaryheap anyway. I've attached the patch for
that (0003).

Here are test script for many sub-transactions case:

create table test (c int);
create or replace function testfn (cnt int) returns void as $$
begin
for i in 1..cnt loop
begin
insert into test values (i);
exception when division_by_zero then
raise notice 'caught error';
return;
end;
end loop;
end;
$$
language plpgsql;
select pg_create_logical_replication_slot('s', 'test_decoding');
select testfn(50000);
set logical_decoding_work_mem to '4MB';
select count(*) from pg_logical_slot_peek_changes('s', null, null)";

and here are results:

* HEAD: 16877.281 ms
* HEAD w/ patches (0001 and 0002): 655.154 ms

There is huge improvement in a many-subtransactions case.

Finally, we need to note that memory counter updates could happen
frequently as we update it for each change. So even though we update
the binaryheap in O(log n), it could be a huge overhead if it happens
quite often. One idea is to batch the memory counter updates where
available. I've attached the patch for that (0004). I'll benchmark
overheads for normal cases.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#19

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#18)

Re: Improve eviction algorithm in ReorderBuffer

On Fri, Jan 26, 2024 at 5:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

The individual transactions shouldn't cross
'logical_decoding_work_mem'. I got a bit confused by your proposal to
maintain the lists: "...splitting it into two lists: transactions
consuming 5% < and 5% >= of the memory limit, and checking the 5% >=
list preferably.". In the previous sentence, what did you mean by
transactions consuming 5% >= of the memory limit? I got the impression
that you are saying to maintain them in a separate transaction list
which doesn't seems to be the case.

I wanted to mean that there are three lists in total: the first one
maintain the transactions consuming more than 10% of
logical_decoding_work_mem,

How can we have multiple transactions in the list consuming more than
10% of logical_decoding_work_mem? Shouldn't we perform serialization
before any xact reaches logical_decoding_work_mem?

Well, suppose logical_decoding_work_mem is set to 64MB, transactions
consuming more than 6.4MB are added to the list. So for example, it's
possible that the list has three transactions each of which are
consuming 10MB while the total memory usage in the reorderbuffer is
still 30MB (less than logical_decoding_work_mem).

Thanks for the clarification. I misunderstood the list to have
transactions greater than 70.4 MB (64 + 6.4) in your example. But one
thing to note is that maintaining these lists by default can also have
some overhead unless the list of open transactions crosses a certain
threshold.

On further analysis, I realized that the approach discussed here might
not be the way to go. The idea of dividing transactions into several
subgroups is to divide a large number of entries into multiple
sub-groups so we can reduce the complexity to search for the
particular entry. Since we assume that there are no big differences in
entries' sizes within a sub-group, we can pick the entry to evict in
O(1). However, what we really need to avoid here is that we end up
increasing the number of times to evict entries because serializing an
entry to the disk is more costly than searching an entry on memory in
general.

I think that it's no problem in a large-entries subgroup but when it
comes to the smallest-entries subgroup, like for entries consuming
less than 5% of the limit, it could end up evicting many entries. For
example, there would be a huge difference between serializing 1 entry
consuming 5% of the memory limit and serializing 5000 entries
consuming 0.001% of the memory limit. Even if we can select 5000
entries quickly, I think the latter would be slower in total. The more
subgroups we create, the more the algorithm gets complex and the
overheads could cause. So I think we need to search for the largest
entry in order to minimize the number of evictions anyway.

Looking for data structures and algorithms, I think binaryheap with
some improvements could be promising. I mentioned before why we cannot
use the current binaryheap[1]. The missing pieces are efficient ways
to remove the arbitrary entry and to update the arbitrary entry's key.
The current binaryheap provides binaryheap_remove_node(), which is
O(log n), but it requires the entry's position in the binaryheap. We
can know the entry's position just after binaryheap_add_unordered()
but it might be changed after heapify. Searching the node's position
is O(n). So the improvement idea is to add a hash table to the
binaryheap so that it can track the positions for each entry so that
we can remove the arbitrary entry in O(log n) and also update the
arbitrary entry's key in O(log n). This is known as the indexed
priority queue. I've attached the patch for that (0001 and 0002).

That way, in terms of reorderbuffer, we can update and remove the
transaction's memory usage in O(log n) (in worst case and O(1) in
average) and then pick the largest transaction in O(1). Since we might
need to call ReorderBufferSerializeTXN() even in non-streaming case,
we need to maintain the binaryheap anyway.

Since if the number of transactions being decoded is small, updating
max-heap for each memory counter update could lead to some
regressions, I've measured it with the case where updating memory
counter happens frequently:

setup script:
create table test (c int);
select pg_create_logical_replication_slot('s', 'test_decoding');
insert into test select generate_series(1, 8000000);

benchmark script:
set work_mem to '3GB';
set logical_decoding_work_mem to '5GB';
select count(*) from pg_logical_slot_peek_changes('s', null, null);

Here are results (the median of five executions):

* HEAD
5274.765 ms

* HEAD + 0001-0003 patch
5532.203 ms

There were approximately 5% performance regressions.

An improvement idea is that we use two strategies for updating
max-heap depending on the number of transactions. That is, if the
number of transactions being decoded is small, we add a transaction to
max-heap by binaryheap_add_unordered(), which is O(1), and heapify it
just before picking the largest transactions, which is O(n). That way,
we can minimize the overhead of updating the memory counter. Once the
number of transactions being decoded exceeds the threshold, say 1024,
we use another strategy. We call binaryheap_update_up/down() when
updating the memory counter to preserve heap property, which is O(log
n), and pick the largest transaction in O(1). This strategy minimizes
the cost of picking the largest transactions instead of paying some
costs to update the memory counters.

I've experimented with this idea and run the same tests:

* HEAD + new patches (0001 - 0003)
5277.524 ms

The number looks good. I've attached these patches. Feedback is very welcome.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#20

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

over 2 years ago

In reply to: Masahiko Sawada (#19)

RE: Improve eviction algorithm in ReorderBuffer

Dear Sawada-san,

I have started to read your patches. Here are my initial comments.
At least, all subscription tests were passed on my env.

A comment for 0001:

01.
```
+static void
+bh_enlarge_node_array(binaryheap *heap)
+{
+    if (heap->bh_size < heap->bh_space)
+        return;
+
+    heap->bh_space *= 2;
+    heap->bh_nodes = repalloc(heap->bh_nodes,
+                              sizeof(bh_node_type) * heap->bh_space);
+}
```

I'm not sure it is OK to use repalloc() for enlarging bh_nodes. This data
structure public one and arbitrary codes and extensions can directly refer
bh_nodes. But if the array is repalloc()'d, the pointer would be updated so that
their referring would be a dangling pointer.
I think the internal of the structure should be a private one in this case.

Comments for 0002:

02.
```
+#include "utils/palloc.h"
```

Is it really needed? I'm not sure who referrs it.

03.
```
typedef struct bh_nodeidx_entry
{
bh_node_type key;
char status;
int idx;
} bh_nodeidx_entry;
```

Sorry if it is a stupid question. Can you tell me how "status" is used?
None of binaryheap and reorderbuffer components refer it.

04.
```
 extern binaryheap *binaryheap_allocate(int capacity,
                                        binaryheap_comparator compare,
-                                       void *arg);
+                                       bool indexed, void *arg);
```

I felt pre-existing API should not be changed. How about adding
binaryheap_allocate_extended() or something which can specify the `bool indexed`?
binaryheap_allocate() sets heap->bh_indexed to false.

05.
```
+extern void binaryheap_update_up(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_down(binaryheap *heap, bh_node_type d);
```

IIUC, callers must consider whether the node should be shift up/down and use
appropriate function, right? I felt it may not be user-friendly.

Comments for 0003:

06.
```
This commit changes the eviction algorithm in ReorderBuffer to use
max-heap with transaction size,a nd use two strategies depending on
the number of transactions being decoded.
```

s/a nd/ and/

07.
```
It could be too expensive to pudate max-heap while preserving the heap
property each time the transaction's memory counter is updated, as it
could happen very frquently. So when the number of transactions being
decoded is small, we add the transactions to max-heap but don't
preserve the heap property, which is O(1). We heapify the max-heap
just before picking the largest transaction, which is O(n). This
strategy minimizes the overheads of updating the transaction's memory
counter.
```

s/pudate/update/

08.
IIUC, if more than 1024 transactions are running but they have small amount of
changes, the performance may be degraded, right? Do you have a result in sucha
a case?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

#21

vignesh C

vignesh21@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#19)

#22

Shubham Khanna

khannashubham1197@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#18)

#23

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Hayato Kuroda (Fujitsu) (#20)

#24

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: vignesh C (#21)

#25

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Shubham Khanna (#22)

#26

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

over 2 years ago

In reply to: Masahiko Sawada (#23)

#27

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#25)

#28

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

over 2 years ago

In reply to: Masahiko Sawada (#27)

#29

Ajin Cherian

itsajin@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#27)

#30

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Hayato Kuroda (Fujitsu) (#28)

#31

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Ajin Cherian (#29)

#32

Ajin Cherian

itsajin@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#31)

#33

vignesh C

vignesh21@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#30)

#34

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 2 years ago

In reply to: vignesh C (#33)

#35

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Tomas Vondra (#34)

#36

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: vignesh C (#33)

#37

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 2 years ago

In reply to: Masahiko Sawada (#35)

#38

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Tomas Vondra (#37)

#39

vignesh C

vignesh21@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#36)

#40

Amit Kapila

amit.kapila16@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#38)

#41

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Amit Kapila (#40)

#42

Peter Smith

smithpb2250@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#38)

#43

vignesh C

vignesh21@gmail.com

over 2 years ago

In reply to: Amit Kapila (#40)

#44

Peter Smith

smithpb2250@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#38)

#45

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Peter Smith (#42)

#46

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: vignesh C (#43)

#47

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Peter Smith (#44)

#48

Peter Smith

smithpb2250@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#47)

#49

Peter Smith

smithpb2250@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#46)

#50

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Peter Smith (#48)

#51

Peter Smith

smithpb2250@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#50)

#52

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Peter Smith (#51)

#53

Peter Smith

smithpb2250@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#52)

#54

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Peter Smith (#49)

#55

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Peter Smith (#53)

#56

Masahiko Sawada

sawada.mshk@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#54)

#57

Shubham Khanna

khannashubham1197@gmail.com

over 2 years ago

In reply to: vignesh C (#43)

#58

vignesh C

vignesh21@gmail.com

about 2 years ago

In reply to: Masahiko Sawada (#56)

#59

Masahiko Sawada

sawada.mshk@gmail.com

about 2 years ago

In reply to: vignesh C (#58)

#60

Amit Kapila

amit.kapila16@gmail.com

about 2 years ago

In reply to: Masahiko Sawada (#59)

#61

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

about 2 years ago

In reply to: Masahiko Sawada (#59)

#62

Masahiko Sawada

sawada.mshk@gmail.com

about 2 years ago

In reply to: Amit Kapila (#60)

#63

Masahiko Sawada

sawada.mshk@gmail.com

about 2 years ago

In reply to: Hayato Kuroda (Fujitsu) (#61)

#64

Masahiko Sawada

sawada.mshk@gmail.com

about 2 years ago

In reply to: Masahiko Sawada (#62)

#65

Jeff Davis

pgsql@j-davis.com

about 2 years ago

In reply to: Masahiko Sawada (#64)

#66

Jeff Davis

pgsql@j-davis.com

about 2 years ago

In reply to: Jeff Davis (#65)

#67

Masahiko Sawada

sawada.mshk@gmail.com

about 2 years ago

In reply to: Jeff Davis (#66)

#68

Jeff Davis

pgsql@j-davis.com

about 2 years ago

In reply to: Masahiko Sawada (#67)

#69

Masahiko Sawada

sawada.mshk@gmail.com

about 2 years ago

In reply to: Jeff Davis (#68)

#70

Jeff Davis

pgsql@j-davis.com

about 2 years ago

In reply to: Masahiko Sawada (#69)

#71

Jeff Davis

pgsql@j-davis.com

about 2 years ago

In reply to: Jeff Davis (#70)

#72

Masahiko Sawada

sawada.mshk@gmail.com

about 2 years ago

In reply to: Jeff Davis (#70)

#73

Jeff Davis

pgsql@j-davis.com

about 2 years ago

In reply to: Masahiko Sawada (#72)

#74

Masahiko Sawada

sawada.mshk@gmail.com

about 2 years ago

In reply to: Jeff Davis (#73)

#75

Jeff Davis

pgsql@j-davis.com

about 2 years ago

In reply to: Masahiko Sawada (#74)

#76

Jeff Davis

pgsql@j-davis.com

about 2 years ago

In reply to: Masahiko Sawada (#72)

#77

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 2 years ago

In reply to: Jeff Davis (#76)

#78

Jeff Davis

pgsql@j-davis.com

about 2 years ago

In reply to: Heikki Linnakangas (#77)

#79

Michael Paquier

michael@paquier.xyz

about 2 years ago

In reply to: Jeff Davis (#78)

#80

Jeff Davis

pgsql@j-davis.com

about 2 years ago

In reply to: Michael Paquier (#79)

#81

Michael Paquier

michael@paquier.xyz

about 2 years ago

In reply to: Jeff Davis (#80)

#82

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 2 years ago

In reply to: Michael Paquier (#81)

#83

Amit Kapila

amit.kapila16@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#82)

#84

Jeff Davis

pgsql@j-davis.com

about 2 years ago

In reply to: Heikki Linnakangas (#82)

#85

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 2 years ago

In reply to: Amit Kapila (#83)

#86

Michael Paquier

michael@paquier.xyz

about 2 years ago

In reply to: Heikki Linnakangas (#85)

#87

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 2 years ago

In reply to: Michael Paquier (#86)

#88

Masahiko Sawada

sawada.mshk@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#85)

#89

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

about 2 years ago

In reply to: Heikki Linnakangas (#85)

#90

Masahiko Sawada

sawada.mshk@gmail.com

about 2 years ago

In reply to: Masahiko Sawada (#88)

#91

Masahiko Sawada

sawada.mshk@gmail.com

about 2 years ago

In reply to: Masahiko Sawada (#90)

#92

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 2 years ago

In reply to: Masahiko Sawada (#91)

#93

Masahiko Sawada

sawada.mshk@gmail.com

about 2 years ago

In reply to: Hayato Kuroda (Fujitsu) (#89)

Improve eviction algorithm in ReorderBuffer

Attachments:

Attachments:

Attachments: