Using per-transaction memory contexts for storing decoded tuples
Hi all,
We have several reports that logical decoding uses memory much more
than logical_decoding_work_mem[1]/messages/by-id/CAMnUB3oYugXCBLSkih+qNsWQPciEwos6g_AMbnz_peNoxfHwyw@mail.gmail.com[2]/messages/by-id/17974-f8c9d353a62f414d@postgresql.org[3]/messages/by-id/DB9PR07MB71808AC6C7770AF2FB36B95BCB252@DB9PR07MB7180.eurprd07.prod.outlook.com. For instance in one of the
reports[1]/messages/by-id/CAMnUB3oYugXCBLSkih+qNsWQPciEwos6g_AMbnz_peNoxfHwyw@mail.gmail.com, even though users set logical_decoding_work_mem to
'256MB', a walsender process was killed by OOM because of using more
than 4GB memory.
As per the discussion in these threads so far, what happened was that
there was huge memory fragmentation in rb->tup_context.
rb->tup_context uses GenerationContext with 8MB memory blocks. We
cannot free memory blocks until all memory chunks in the block are
freed. If there is a long-running transaction making changes, its
changes could be spread across many memory blocks and we end up not
being able to free memory blocks unless the long-running transaction
is evicted or completed. Since we don't account fragmentation, block
header size, nor chunk header size into per-transaction memory usage
(i.e. txn->size), rb->size could be less than
logical_decoding_work_mem but the actual allocated memory in the
context is much higher than logical_decoding_work_mem.
We can reproduce this issue with the attached patch
rb_excessive_memory_reproducer.patch (not intended to commit) that
adds a memory usage reporting and includes the test. After running the
tap test contrib/test_decoding/t/002_rb_memory.pl, we can see a memory
usage report in the server logs like follows:
LOG: reorderbuffer memory: logical_decoding_work_mem=268435456
rb->size=17816832 rb->tup_context=1082130304 rb->context=1086267264
Which means that the logical decoding allocated 1GB memory in spite of
logical_decoding_work_mem being 256MB.
One idea to deal with this problem is that we use
MemoryContextMemAllocated() as the reorderbuffer's memory usage
instead of txn->total_size. That is, we evict transactions until the
value returned by MemoryContextMemAllocated() gets lower than
logical_decoding_work_mem. However, it could end up evicting a lot of
(possibly all) transactions because the transaction whose decoded
tuples data are spread across memory context blocks could be evicted
after all other larger transactions are evicted (note that we evict
transactions from largest to smallest). Therefore, if we want to do
that, we would need to change the eviction algorithm to for example
LSN order instead of transaction size order. Furthermore,
reorderbuffer's changes that are counted in txn->size (and therefore
rb->size too) are stored in different memory contexts depending on the
kinds. For example, decoded tuples are stored in rb->context,
ReorderBufferChange are stored in rb->change_context, and snapshot
data are stored in builder->context. So we would need to sort out
which data is stored into which memory context and which memory
context should be accounted for in the reorderbuffer's memory usage.
Which could be a large amount of work.
Another idea is to have memory context for storing decoded tuples per
transaction. That way, we can ensure that all memory blocks of the
context are freed when the transaction is evicted or completed. I
think that this idea would be simpler and worth considering, so I
attached the PoC patch, use_tup_context_per_ctx_v1.patch. Since the
decoded tuple data is created individually when the corresponding WAL
record is decoded but is released collectively when it is released
(e.g., transaction eviction), the bump memory context would fit the
best this case. One exception is that we immediately free the decoded
tuple data if the transaction is already aborted. However, in this
case, I think it is possible to skip the WAL decoding in the first
place.
One thing we need to consider is that the number of transaction
entries and memory contexts in the reorderbuffer could be very high
since it has entries for both top-level transaction entries and
sub-transactions. To deal with that, the patch changes that decoded
tuples of a subtransaction are stored into its top-level transaction's
tuple context. We always evict sub-transactions and its top-level
transaction at the same time, I think it would not be a problem.
Checking performance degradation due to using many memory contexts
would have to be done.
Even with this idea, we would still have a mismatch between the actual
amount of allocated memory and rb->size, but it would be much better
than today. And something like the first idea would be necessary to
address this mismatch, and we can discuss it in a separate thread.
Regards,
[1]: /messages/by-id/CAMnUB3oYugXCBLSkih+qNsWQPciEwos6g_AMbnz_peNoxfHwyw@mail.gmail.com
[2]: /messages/by-id/17974-f8c9d353a62f414d@postgresql.org
[3]: /messages/by-id/DB9PR07MB71808AC6C7770AF2FB36B95BCB252@DB9PR07MB7180.eurprd07.prod.outlook.com
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On 2024-09-12 07:32, Masahiko Sawada wrote:
Thanks a lot for working on this!
Hi all,
We have several reports that logical decoding uses memory much more
than logical_decoding_work_mem[1][2][3]. For instance in one of the
reports[1], even though users set logical_decoding_work_mem to
'256MB', a walsender process was killed by OOM because of using more
than 4GB memory.As per the discussion in these threads so far, what happened was that
there was huge memory fragmentation in rb->tup_context.
rb->tup_context uses GenerationContext with 8MB memory blocks. We
cannot free memory blocks until all memory chunks in the block are
freed. If there is a long-running transaction making changes, its
changes could be spread across many memory blocks and we end up not
being able to free memory blocks unless the long-running transaction
is evicted or completed. Since we don't account fragmentation, block
header size, nor chunk header size into per-transaction memory usage
(i.e. txn->size), rb->size could be less than
logical_decoding_work_mem but the actual allocated memory in the
context is much higher than logical_decoding_work_mem.We can reproduce this issue with the attached patch
rb_excessive_memory_reproducer.patch (not intended to commit) that
adds a memory usage reporting and includes the test. After running the
tap test contrib/test_decoding/t/002_rb_memory.pl, we can see a memory
usage report in the server logs like follows:LOG: reorderbuffer memory: logical_decoding_work_mem=268435456
rb->size=17816832 rb->tup_context=1082130304 rb->context=1086267264Which means that the logical decoding allocated 1GB memory in spite of
logical_decoding_work_mem being 256MB.One idea to deal with this problem is that we use
MemoryContextMemAllocated() as the reorderbuffer's memory usage
instead of txn->total_size. That is, we evict transactions until the
value returned by MemoryContextMemAllocated() gets lower than
logical_decoding_work_mem. However, it could end up evicting a lot of
(possibly all) transactions because the transaction whose decoded
tuples data are spread across memory context blocks could be evicted
after all other larger transactions are evicted (note that we evict
transactions from largest to smallest). Therefore, if we want to do
that, we would need to change the eviction algorithm to for example
LSN order instead of transaction size order. Furthermore,
reorderbuffer's changes that are counted in txn->size (and therefore
rb->size too) are stored in different memory contexts depending on the
kinds. For example, decoded tuples are stored in rb->context,
ReorderBufferChange are stored in rb->change_context, and snapshot
data are stored in builder->context. So we would need to sort out
which data is stored into which memory context and which memory
context should be accounted for in the reorderbuffer's memory usage.
Which could be a large amount of work.Another idea is to have memory context for storing decoded tuples per
transaction. That way, we can ensure that all memory blocks of the
context are freed when the transaction is evicted or completed. I
think that this idea would be simpler and worth considering, so I
attached the PoC patch, use_tup_context_per_ctx_v1.patch. Since the
decoded tuple data is created individually when the corresponding WAL
record is decoded but is released collectively when it is released
(e.g., transaction eviction), the bump memory context would fit the
best this case. One exception is that we immediately free the decoded
tuple data if the transaction is already aborted. However, in this
case, I think it is possible to skip the WAL decoding in the first
place.
I haven't read the patch yet, but it seems a reasonable approach.
One thing we need to consider is that the number of transaction
entries and memory contexts in the reorderbuffer could be very high
since it has entries for both top-level transaction entries and
sub-transactions. To deal with that, the patch changes that decoded
tuples of a subtransaction are stored into its top-level transaction's
tuple context. We always evict sub-transactions and its top-level
transaction at the same time, I think it would not be a problem.
Checking performance degradation due to using many memory contexts
would have to be done.
Yeah, and I imagine there would be cases where the current
implementation shows better performance, such as when there are no long
transactions, but compared to unexpected memory bloat and OOM kill, I
think it's far more better.
Even with this idea, we would still have a mismatch between the actual
amount of allocated memory and rb->size, but it would be much better
than today. And something like the first idea would be necessary to
address this mismatch, and we can discuss it in a separate thread.Regards,
[1]
/messages/by-id/CAMnUB3oYugXCBLSkih+qNsWQPciEwos6g_AMbnz_peNoxfHwyw@mail.gmail.com
[2]
/messages/by-id/17974-f8c9d353a62f414d@postgresql.org
[3]
/messages/by-id/DB9PR07MB71808AC6C7770AF2FB36B95BCB252@DB9PR07MB7180.eurprd07.prod.outlook.com
--
Regards,
--
Atsushi Torikoshi
NTT DATA Group Corporation
On Thu, Sep 12, 2024 at 4:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
We have several reports that logical decoding uses memory much more
than logical_decoding_work_mem[1][2][3]. For instance in one of the
reports[1], even though users set logical_decoding_work_mem to
'256MB', a walsender process was killed by OOM because of using more
than 4GB memory.As per the discussion in these threads so far, what happened was that
there was huge memory fragmentation in rb->tup_context.
rb->tup_context uses GenerationContext with 8MB memory blocks. We
cannot free memory blocks until all memory chunks in the block are
freed. If there is a long-running transaction making changes, its
changes could be spread across many memory blocks and we end up not
being able to free memory blocks unless the long-running transaction
is evicted or completed. Since we don't account fragmentation, block
header size, nor chunk header size into per-transaction memory usage
(i.e. txn->size), rb->size could be less than
logical_decoding_work_mem but the actual allocated memory in the
context is much higher than logical_decoding_work_mem.
It is not clear to me how the fragmentation happens. Is it because of
some interleaving transactions which are even ended but the memory
corresponding to them is not released? Can we try reducing the size of
8MB memory blocks? The comment atop allocation says: "XXX the
allocation sizes used below pre-date generation context's block
growing code. These values should likely be benchmarked and set to
more suitable values.", so do we need some tuning here?
We can reproduce this issue with the attached patch
rb_excessive_memory_reproducer.patch (not intended to commit) that
adds a memory usage reporting and includes the test. After running the
tap test contrib/test_decoding/t/002_rb_memory.pl, we can see a memory
usage report in the server logs like follows:LOG: reorderbuffer memory: logical_decoding_work_mem=268435456
rb->size=17816832 rb->tup_context=1082130304 rb->context=1086267264Which means that the logical decoding allocated 1GB memory in spite of
logical_decoding_work_mem being 256MB.One idea to deal with this problem is that we use
MemoryContextMemAllocated() as the reorderbuffer's memory usage
instead of txn->total_size. That is, we evict transactions until the
value returned by MemoryContextMemAllocated() gets lower than
logical_decoding_work_mem. However, it could end up evicting a lot of
(possibly all) transactions because the transaction whose decoded
tuples data are spread across memory context blocks could be evicted
after all other larger transactions are evicted (note that we evict
transactions from largest to smallest). Therefore, if we want to do
that, we would need to change the eviction algorithm to for example
LSN order instead of transaction size order. Furthermore,
reorderbuffer's changes that are counted in txn->size (and therefore
rb->size too) are stored in different memory contexts depending on the
kinds. For example, decoded tuples are stored in rb->context,
ReorderBufferChange are stored in rb->change_context, and snapshot
data are stored in builder->context. So we would need to sort out
which data is stored into which memory context and which memory
context should be accounted for in the reorderbuffer's memory usage.
Which could be a large amount of work.Another idea is to have memory context for storing decoded tuples per
transaction. That way, we can ensure that all memory blocks of the
context are freed when the transaction is evicted or completed. I
think that this idea would be simpler and worth considering, so I
attached the PoC patch, use_tup_context_per_ctx_v1.patch. Since the
decoded tuple data is created individually when the corresponding WAL
record is decoded but is released collectively when it is released
(e.g., transaction eviction), the bump memory context would fit the
best this case. One exception is that we immediately free the decoded
tuple data if the transaction is already aborted. However, in this
case, I think it is possible to skip the WAL decoding in the first
place.One thing we need to consider is that the number of transaction
entries and memory contexts in the reorderbuffer could be very high
since it has entries for both top-level transaction entries and
sub-transactions. To deal with that, the patch changes that decoded
tuples of a subtransaction are stored into its top-level transaction's
tuple context.
Won't that impact the calculation for ReorderBufferLargestTXN() which
can select either subtransaction or top-level xact?
We always evict sub-transactions and its top-level
transaction at the same time, I think it would not be a problem.
Checking performance degradation due to using many memory contexts
would have to be done.
The generation context has been introduced in commit a4ccc1ce which
claims that it has significantly reduced logical decoding's memory
usage and improved its performance. Won't changing it requires us to
validate all the cases which led us to use generation context in the
first place?
--
With Regards,
Amit Kapila.
On Fri, Sep 13, 2024 at 3:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Sep 12, 2024 at 4:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
We have several reports that logical decoding uses memory much more
than logical_decoding_work_mem[1][2][3]. For instance in one of the
reports[1], even though users set logical_decoding_work_mem to
'256MB', a walsender process was killed by OOM because of using more
than 4GB memory.As per the discussion in these threads so far, what happened was that
there was huge memory fragmentation in rb->tup_context.
rb->tup_context uses GenerationContext with 8MB memory blocks. We
cannot free memory blocks until all memory chunks in the block are
freed. If there is a long-running transaction making changes, its
changes could be spread across many memory blocks and we end up not
being able to free memory blocks unless the long-running transaction
is evicted or completed. Since we don't account fragmentation, block
header size, nor chunk header size into per-transaction memory usage
(i.e. txn->size), rb->size could be less than
logical_decoding_work_mem but the actual allocated memory in the
context is much higher than logical_decoding_work_mem.It is not clear to me how the fragmentation happens. Is it because of
some interleaving transactions which are even ended but the memory
corresponding to them is not released?
In a generation context, we can free a memory block only when all
memory chunks there are freed. Therefore, individual tuple buffers are
already pfree()'ed but the underlying memory blocks are not freed.
Can we try reducing the size of
8MB memory blocks? The comment atop allocation says: "XXX the
allocation sizes used below pre-date generation context's block
growing code. These values should likely be benchmarked and set to
more suitable values.", so do we need some tuning here?
Reducing the size of the 8MB memory block would be one solution and
could be better as it could be back-patchable. It would mitigate the
problem but would not resolve it. I agree to try reducing it and do
some benchmark tests. If it reasonably makes the problem less likely
to happen, it would be a good solution.
We can reproduce this issue with the attached patch
rb_excessive_memory_reproducer.patch (not intended to commit) that
adds a memory usage reporting and includes the test. After running the
tap test contrib/test_decoding/t/002_rb_memory.pl, we can see a memory
usage report in the server logs like follows:LOG: reorderbuffer memory: logical_decoding_work_mem=268435456
rb->size=17816832 rb->tup_context=1082130304 rb->context=1086267264Which means that the logical decoding allocated 1GB memory in spite of
logical_decoding_work_mem being 256MB.One idea to deal with this problem is that we use
MemoryContextMemAllocated() as the reorderbuffer's memory usage
instead of txn->total_size. That is, we evict transactions until the
value returned by MemoryContextMemAllocated() gets lower than
logical_decoding_work_mem. However, it could end up evicting a lot of
(possibly all) transactions because the transaction whose decoded
tuples data are spread across memory context blocks could be evicted
after all other larger transactions are evicted (note that we evict
transactions from largest to smallest). Therefore, if we want to do
that, we would need to change the eviction algorithm to for example
LSN order instead of transaction size order. Furthermore,
reorderbuffer's changes that are counted in txn->size (and therefore
rb->size too) are stored in different memory contexts depending on the
kinds. For example, decoded tuples are stored in rb->context,
ReorderBufferChange are stored in rb->change_context, and snapshot
data are stored in builder->context. So we would need to sort out
which data is stored into which memory context and which memory
context should be accounted for in the reorderbuffer's memory usage.
Which could be a large amount of work.Another idea is to have memory context for storing decoded tuples per
transaction. That way, we can ensure that all memory blocks of the
context are freed when the transaction is evicted or completed. I
think that this idea would be simpler and worth considering, so I
attached the PoC patch, use_tup_context_per_ctx_v1.patch. Since the
decoded tuple data is created individually when the corresponding WAL
record is decoded but is released collectively when it is released
(e.g., transaction eviction), the bump memory context would fit the
best this case. One exception is that we immediately free the decoded
tuple data if the transaction is already aborted. However, in this
case, I think it is possible to skip the WAL decoding in the first
place.One thing we need to consider is that the number of transaction
entries and memory contexts in the reorderbuffer could be very high
since it has entries for both top-level transaction entries and
sub-transactions. To deal with that, the patch changes that decoded
tuples of a subtransaction are stored into its top-level transaction's
tuple context.Won't that impact the calculation for ReorderBufferLargestTXN() which
can select either subtransaction or top-level xact?
Yeah, I missed that we could evict only sub-transactions when choosing
the largest transaction. We need to find a better solution.
We always evict sub-transactions and its top-level
transaction at the same time, I think it would not be a problem.
Checking performance degradation due to using many memory contexts
would have to be done.The generation context has been introduced in commit a4ccc1ce which
claims that it has significantly reduced logical decoding's memory
usage and improved its performance. Won't changing it requires us to
validate all the cases which led us to use generation context in the
first place?
Agreed. Will investigate the thread and check the cases.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Hi,
We have several reports that logical decoding uses memory much more
than logical_decoding_work_mem[1][2][3]. For instance in one of the
reports[1], even though users set logical_decoding_work_mem to
'256MB', a walsender process was killed by OOM because of using more
than 4GB memory.
I appreciate your work on logical replication and am interested in the thread.
I've heard this issue from others, and this has been the barrier to using logical
replication. Please let me know if I can help with benchmarking, other
measurements, etc.
Best regards,
Hayato Kuroda
FUJITSU LIMITED
On Mon, Sep 16, 2024 at 10:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Sep 13, 2024 at 3:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Sep 12, 2024 at 4:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
We have several reports that logical decoding uses memory much more
than logical_decoding_work_mem[1][2][3]. For instance in one of the
reports[1], even though users set logical_decoding_work_mem to
'256MB', a walsender process was killed by OOM because of using more
than 4GB memory.As per the discussion in these threads so far, what happened was that
there was huge memory fragmentation in rb->tup_context.
rb->tup_context uses GenerationContext with 8MB memory blocks. We
cannot free memory blocks until all memory chunks in the block are
freed. If there is a long-running transaction making changes, its
changes could be spread across many memory blocks and we end up not
being able to free memory blocks unless the long-running transaction
is evicted or completed. Since we don't account fragmentation, block
header size, nor chunk header size into per-transaction memory usage
(i.e. txn->size), rb->size could be less than
logical_decoding_work_mem but the actual allocated memory in the
context is much higher than logical_decoding_work_mem.It is not clear to me how the fragmentation happens. Is it because of
some interleaving transactions which are even ended but the memory
corresponding to them is not released?In a generation context, we can free a memory block only when all
memory chunks there are freed. Therefore, individual tuple buffers are
already pfree()'ed but the underlying memory blocks are not freed.
I understood this part but didn't understand the cases leading to this
problem. For example, if there is a large (and only) transaction in
the system that allocates many buffers for change records during
decoding, in the end, it should free memory for all the buffers
allocated in the transaction. So, wouldn't that free all the memory
chunks corresponding to the memory blocks allocated? My guess was that
we couldn't free all the chunks because there were small interleaving
transactions that allocated memory but didn't free it before the large
transaction ended.
Can we try reducing the size of
8MB memory blocks? The comment atop allocation says: "XXX the
allocation sizes used below pre-date generation context's block
growing code. These values should likely be benchmarked and set to
more suitable values.", so do we need some tuning here?Reducing the size of the 8MB memory block would be one solution and
could be better as it could be back-patchable. It would mitigate the
problem but would not resolve it. I agree to try reducing it and do
some benchmark tests. If it reasonably makes the problem less likely
to happen, it would be a good solution.
makes sense.
--
With Regards,
Amit Kapila.
On Tue, Sep 17, 2024 at 2:06 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Sep 16, 2024 at 10:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Sep 13, 2024 at 3:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Sep 12, 2024 at 4:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
We have several reports that logical decoding uses memory much more
than logical_decoding_work_mem[1][2][3]. For instance in one of the
reports[1], even though users set logical_decoding_work_mem to
'256MB', a walsender process was killed by OOM because of using more
than 4GB memory.As per the discussion in these threads so far, what happened was that
there was huge memory fragmentation in rb->tup_context.
rb->tup_context uses GenerationContext with 8MB memory blocks. We
cannot free memory blocks until all memory chunks in the block are
freed. If there is a long-running transaction making changes, its
changes could be spread across many memory blocks and we end up not
being able to free memory blocks unless the long-running transaction
is evicted or completed. Since we don't account fragmentation, block
header size, nor chunk header size into per-transaction memory usage
(i.e. txn->size), rb->size could be less than
logical_decoding_work_mem but the actual allocated memory in the
context is much higher than logical_decoding_work_mem.It is not clear to me how the fragmentation happens. Is it because of
some interleaving transactions which are even ended but the memory
corresponding to them is not released?In a generation context, we can free a memory block only when all
memory chunks there are freed. Therefore, individual tuple buffers are
already pfree()'ed but the underlying memory blocks are not freed.I understood this part but didn't understand the cases leading to this
problem. For example, if there is a large (and only) transaction in
the system that allocates many buffers for change records during
decoding, in the end, it should free memory for all the buffers
allocated in the transaction. So, wouldn't that free all the memory
chunks corresponding to the memory blocks allocated? My guess was that
we couldn't free all the chunks because there were small interleaving
transactions that allocated memory but didn't free it before the large
transaction ended.
We haven't actually checked with the person who reported the problem,
so this is just a guess, but I think there were concurrent
transactions, including long-running INSERT transactions. For example,
suppose a transaction that inserts 10 million rows and many OLTP-like
(short) transactions are running at the same time. The scenario I
thought of was that one 8MB Generation Context Block contains 1MB of
the large insert transaction changes, and the other 7MB contains short
OLTP transaction changes. If there are just 256 such blocks, even
after all short-transactions have completed, the Generation Context
will have allocated 2GB of memory until we decode the commit record of
the large transaction, but rb->size will say 256MB.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Tue, Sep 17, 2024 at 2:06 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Sep 16, 2024 at 10:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Sep 13, 2024 at 3:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
Can we try reducing the size of
8MB memory blocks? The comment atop allocation says: "XXX the
allocation sizes used below pre-date generation context's block
growing code. These values should likely be benchmarked and set to
more suitable values.", so do we need some tuning here?Reducing the size of the 8MB memory block would be one solution and
could be better as it could be back-patchable. It would mitigate the
problem but would not resolve it. I agree to try reducing it and do
some benchmark tests. If it reasonably makes the problem less likely
to happen, it would be a good solution.makes sense.
I've done some benchmark tests for three different code bases with
different test cases. In short, reducing the generation memory context
block size to 8kB seems to be promising; it mitigates the problem
while keeping a similar performance.
Here are three code bases that I used:
* head: current head code.
* per-tx-bump: the proposed idea (with a slight change; each sub and
top-level transactions have its own bump memory context to store
decoded tuples).
* 8kb-mem-block: same as head except for changing the generation
memory block size from 8MB to 8kB.
And here are test cases and results:
1. Memory usage check
I've run the test that I shared before and checked the maximum amount
of memory allocated in the reorderbuffer context shown by
MemoryContextMemAllocated(). Here are results:
head: 2.1GB (while rb->size showing 43MB)
per-tx-bump: 50MB (while rb->size showing 43MB)
8kb-mem-block: 54MB (while rb->size showing 43MB)
I've confirmed that the excessive memory usage issue didn't happen in
the per-tx-bump case and the 8kb-mem-block cases.
2. Decoding many sub transactions
IIUC this kind of workload was a trigger to make us invent the
Generation Context for logical decoding[1]/messages/by-id/20160706185502.1426.28143@wrigleys.postgresql.org. The single top-level
transaction has 1M sub-transactions each of which insert a tuple. Here
are results:
head: 31694.163 ms (00:31.694)
per-tx-bump: 32661.752 ms (00:32.662)
8kb-mem-block: 31834.872 ms (00:31.835)
The head and 8kb-mem-block showed similar results whereas I see there
is a bit of regression on per-tx-bump. I think this is because of the
overhead of creating and deleting memory contexts for each
sub-transactions.
3. Decoding a big transaction
The next test case I did is to decode a single big transaction that
inserts 10M rows. I set logical_decoding_work_mem large enough to
avoid spilling behavior. Here are results:
head: 19859.113 ms (00:19.859)
per-tx-bump: 19422.308 ms (00:19.422)
8kb-mem-block: 19923.600 ms (00:19.924)
There were no big differences. FYI, I also checked the maximum memory
usage for this test case as well:
head: 1.53GB
per-tx-bump: 1.4GB
8kb-mem-block: 1.53GB
The per-tx-bump used a bit lesser memory probably thanks to bump
memory contexts.
4. Decoding many short transactions.
The last test case I did is to decode a bunch of short pgbench
transactions (10k transactions). Here are results:
head: 31694.163 ms (00:31.694)
per-tx-bump: 32661.752 ms (00:32.662)
8kb-mem-block: Time: 31834.872 ms (00:31.835)
I can see a similar trend of the test case #2 above.
Overall, reducing the generation context memory block size to 8kB
seems to be promising. And using the bump memory context per
transaction didn't bring performance improvement than I expected in
these cases.
Regards,
[1]: /messages/by-id/20160706185502.1426.28143@wrigleys.postgresql.org
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Thu, 19 Sept 2024 at 11:54, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I've done some benchmark tests for three different code bases with
different test cases. In short, reducing the generation memory context
block size to 8kB seems to be promising; it mitigates the problem
while keeping a similar performance.
Did you try any sizes between 8KB and 8MB? 1000x reduction seems
quite large a jump. There is additional overhead from having more
blocks. It means more malloc() work and more free() work when deleting
a context. It would be nice to see some numbers with all powers of 2
between 8KB and 8MB. I imagine the returns are diminishing as the
block size is reduced further.
Another alternative idea would be to defragment transactions with a
large number of changes after they grow to some predefined size.
Defragmentation would just be a matter of performing
palloc/memcpy/pfree for each change. If that's all done at once, all
the changes for that transaction would be contiguous in memory. If
you're smart about what the trigger point is for performing the
defragmentation then I imagine there's not much risk of performance
regression for the general case. For example, you might only want to
trigger it when MemoryContextMemAllocated() for the generation context
exceeds logical_decoding_work_mem by some factor and only do it for
transactions where the size of the changes exceeds some threshold.
Overall, reducing the generation context memory block size to 8kB
seems to be promising. And using the bump memory context per
transaction didn't bring performance improvement than I expected in
these cases.
Having a bump context per transaction would cause a malloc() every
time you create a new context and a free() each time you delete the
context when cleaning up the reorder buffer for the transaction.
GenerationContext has a "freeblock" field that it'll populate instead
of freeing a block. That block will be reused next time a new block is
required. For truly FIFO workloads that never need an oversized
block, all new blocks will come from the freeblock field once the
first block becomes unused. See the comments in GenerationFree(). I
expect this is why bump contexts are slower than the generation
context for your short transaction workload.
David
On 2024/09/19 8:53, Masahiko Sawada wrote:
On Tue, Sep 17, 2024 at 2:06 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Sep 16, 2024 at 10:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Sep 13, 2024 at 3:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
Can we try reducing the size of
8MB memory blocks? The comment atop allocation says: "XXX the
allocation sizes used below pre-date generation context's block
growing code. These values should likely be benchmarked and set to
more suitable values.", so do we need some tuning here?Reducing the size of the 8MB memory block would be one solution and
could be better as it could be back-patchable. It would mitigate the
problem but would not resolve it. I agree to try reducing it and do
some benchmark tests. If it reasonably makes the problem less likely
to happen, it would be a good solution.makes sense.
I've done some benchmark tests for three different code bases with
different test cases. In short, reducing the generation memory context
block size to 8kB seems to be promising; it mitigates the problem
while keeping a similar performance.
Sounds good!
I believe the memory bloat issue in the reorder buffer should be
considered a bug. Since this solution isn’t too invasive, I think
it’s worth considering back-patch to older versions.
Then, if we find a better approach, we can apply it to v18 or later.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
On Thu, Sep 19, 2024 at 6:46 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 19 Sept 2024 at 11:54, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I've done some benchmark tests for three different code bases with
different test cases. In short, reducing the generation memory context
block size to 8kB seems to be promising; it mitigates the problem
while keeping a similar performance.Did you try any sizes between 8KB and 8MB? 1000x reduction seems
quite large a jump. There is additional overhead from having more
blocks. It means more malloc() work and more free() work when deleting
a context. It would be nice to see some numbers with all powers of 2
between 8KB and 8MB. I imagine the returns are diminishing as the
block size is reduced further.
Good idea.
Another alternative idea would be to defragment transactions with a
large number of changes after they grow to some predefined size.
Defragmentation would just be a matter of performing
palloc/memcpy/pfree for each change. If that's all done at once, all
the changes for that transaction would be contiguous in memory. If
you're smart about what the trigger point is for performing the
defragmentation then I imagine there's not much risk of performance
regression for the general case. For example, you might only want to
trigger it when MemoryContextMemAllocated() for the generation context
exceeds logical_decoding_work_mem by some factor and only do it for
transactions where the size of the changes exceeds some threshold.
After collecting the changes that exceed 'logical_decoding_work_mem',
one can choose to stream the transaction and free the changes to avoid
hitting this problem, however, we can use that or some other constant
to decide the point of defragmentation. The other point we need to
think in this idea is whether we actually need any defragmentation at
all. This will depend on whether there are concurrent transactions
being decoded. This would require benchmarking to see the performance
impact.
--
With Regards,
Amit Kapila.
On Wed, Sep 18, 2024 at 8:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Sep 19, 2024 at 6:46 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 19 Sept 2024 at 11:54, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I've done some benchmark tests for three different code bases with
different test cases. In short, reducing the generation memory context
block size to 8kB seems to be promising; it mitigates the problem
while keeping a similar performance.Did you try any sizes between 8KB and 8MB? 1000x reduction seems
quite large a jump. There is additional overhead from having more
blocks. It means more malloc() work and more free() work when deleting
a context. It would be nice to see some numbers with all powers of 2
between 8KB and 8MB. I imagine the returns are diminishing as the
block size is reduced further.Good idea.
Agreed.
I've done other benchmarking tests while changing the memory block
sizes from 8kB to 8MB. I measured the execution time of logical
decoding of one transaction that inserted 10M rows. I set
logical_decoding_work_mem large enough to avoid spilling behavior. In
this scenario, we allocate many memory chunks while decoding the
transaction and resulting in calling more malloc() in smaller memory
block sizes. Here are results (an average of 3 executions):
8kB: 19747.870 ms
16kB: 19780.025 ms
32kB: 19760.575 ms
64kB: 19772.387 ms
128kB: 19825.385 ms
256kB: 19781.118 ms
512kB: 19808.138 ms
1MB: 19757.640 ms
2MB: 19801.429 ms
4MB: 19673.996 ms
8MB: 19643.547 ms
Interestingly, there were no noticeable differences in the execution
time. I've checked the number of allocated memory blocks in each case
and more blocks are allocated in smaller block size cases. For
example, when the logical decoding used the maximum memory (about
1.5GB), we allocated about 80k blocks in 8kb memory block size case
and 80 blocks in 8MB memory block cases.
It could have different results in different environments. I've
attached the patch that I used to change the memory block size via
GUC. It would be appreciated if someone also could do a similar test
to see the differences.
Another alternative idea would be to defragment transactions with a
large number of changes after they grow to some predefined size.
Defragmentation would just be a matter of performing
palloc/memcpy/pfree for each change. If that's all done at once, all
the changes for that transaction would be contiguous in memory. If
you're smart about what the trigger point is for performing the
defragmentation then I imagine there's not much risk of performance
regression for the general case. For example, you might only want to
trigger it when MemoryContextMemAllocated() for the generation context
exceeds logical_decoding_work_mem by some factor and only do it for
transactions where the size of the changes exceeds some threshold.
Interesting idea.
After collecting the changes that exceed 'logical_decoding_work_mem',
one can choose to stream the transaction and free the changes to avoid
hitting this problem, however, we can use that or some other constant
to decide the point of defragmentation. The other point we need to
think in this idea is whether we actually need any defragmentation at
all. This will depend on whether there are concurrent transactions
being decoded. This would require benchmarking to see the performance
impact.
The fact that we're using rb->size and txn->size to choose the
transactions to evict could make this idea less attractive. Even if we
defragment transactions, rb->size and txn->size don't change.
Therefore, it doesn't mean we can avoid evicting transactions due to
full of logical_decoding_work_mem, but just mean the amount of
allocated memory might have been reduced.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
rb_mem_block_size.patchapplication/octet-stream; name=rb_mem_block_size.patchDownload+16-3
Hi,
On Mon, Sep 16, 2024 at 10:56 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
Hi,
We have several reports that logical decoding uses memory much more
than logical_decoding_work_mem[1][2][3]. For instance in one of the
reports[1], even though users set logical_decoding_work_mem to
'256MB', a walsender process was killed by OOM because of using more
than 4GB memory.I appreciate your work on logical replication and am interested in the thread.
I've heard this issue from others, and this has been the barrier to using logical
replication. Please let me know if I can help with benchmarking, other
measurements, etc.
Thank you for your interest in this patch. I've just shared some
benchmark results (with a patch) that could be different depending on
the environment[1]/messages/by-id/CAD21AoAaN4jaJ=W+WprHvc0cGCf80SkiFQhRc6R+X3-05HAFqw@mail.gmail.com. I would be appreciated if you also do similar
tests and share the results.
Regards,
[1]: /messages/by-id/CAD21AoAaN4jaJ=W+WprHvc0cGCf80SkiFQhRc6R+X3-05HAFqw@mail.gmail.com
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Fri, 20 Sept 2024 at 05:03, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I've done other benchmarking tests while changing the memory block
sizes from 8kB to 8MB. I measured the execution time of logical
decoding of one transaction that inserted 10M rows. I set
logical_decoding_work_mem large enough to avoid spilling behavior. In
this scenario, we allocate many memory chunks while decoding the
transaction and resulting in calling more malloc() in smaller memory
block sizes. Here are results (an average of 3 executions):
I was interested in seeing the memory consumption with the test that
was causing an OOM due to the GenerationBlock fragmentation. You saw
bad results with 8MB blocks and good results with 8KB blocks. The
measure that's interesting here is the MemoryContextMemAllocated() for
the GenerationContext in question.
The fact that we're using rb->size and txn->size to choose the
transactions to evict could make this idea less attractive. Even if we
defragment transactions, rb->size and txn->size don't change.
Therefore, it doesn't mean we can avoid evicting transactions due to
full of logical_decoding_work_mem, but just mean the amount of
allocated memory might have been reduced.
I had roughly imagined that you'd add an extra field to store
txn->size when the last fragmentation was done and only defrag the
transaction when the ReorderBufferTXN->size is, say, double the last
size when the changes were last defragmented. Of course, if the first
defragmentation was enough to drop MemoryContextMemAllocated() below
the concerning threshold above logical_decoding_work_mem then further
defrags wouldn't be done, at least, until such times as the
MemoryContextMemAllocated() became concerning again. If you didn't
want to a full Size variable for the defragmentation size, you could
just store a uint8 to store which power of 2 ReorderBufferTXN->size
was when it was last defragmented. There are plenty of holds in that
struct that could be filled without enlarging the struct.
In general, it's a bit annoying to have to code around this
GenerationContext fragmentation issue. Unfortunately, AllocSet could
also suffer from fragmentation issues too if lots of chunks end up on
freelists that are never reused, so using another context type might
just create a fragmentation issue for a different workload.
David
On Thu, Sep 19, 2024 at 10:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Wed, Sep 18, 2024 at 8:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Sep 19, 2024 at 6:46 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 19 Sept 2024 at 11:54, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I've done some benchmark tests for three different code bases with
different test cases. In short, reducing the generation memory context
block size to 8kB seems to be promising; it mitigates the problem
while keeping a similar performance.Did you try any sizes between 8KB and 8MB? 1000x reduction seems
quite large a jump. There is additional overhead from having more
blocks. It means more malloc() work and more free() work when deleting
a context. It would be nice to see some numbers with all powers of 2
between 8KB and 8MB. I imagine the returns are diminishing as the
block size is reduced further.Good idea.
Agreed.
I've done other benchmarking tests while changing the memory block
sizes from 8kB to 8MB. I measured the execution time of logical
decoding of one transaction that inserted 10M rows. I set
logical_decoding_work_mem large enough to avoid spilling behavior. In
this scenario, we allocate many memory chunks while decoding the
transaction and resulting in calling more malloc() in smaller memory
block sizes. Here are results (an average of 3 executions):8kB: 19747.870 ms
16kB: 19780.025 ms
32kB: 19760.575 ms
64kB: 19772.387 ms
128kB: 19825.385 ms
256kB: 19781.118 ms
512kB: 19808.138 ms
1MB: 19757.640 ms
2MB: 19801.429 ms
4MB: 19673.996 ms
8MB: 19643.547 msInterestingly, there were no noticeable differences in the execution
time. I've checked the number of allocated memory blocks in each case
and more blocks are allocated in smaller block size cases. For
example, when the logical decoding used the maximum memory (about
1.5GB), we allocated about 80k blocks in 8kb memory block size case
and 80 blocks in 8MB memory block cases.
What exactly do these test results mean? Do you want to prove that
there is no regression by using smaller block sizes?
--
With Regards,
Amit Kapila.
On Fri, Sep 20, 2024 at 5:13 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Fri, 20 Sept 2024 at 05:03, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I've done other benchmarking tests while changing the memory block
sizes from 8kB to 8MB. I measured the execution time of logical
decoding of one transaction that inserted 10M rows. I set
logical_decoding_work_mem large enough to avoid spilling behavior. In
this scenario, we allocate many memory chunks while decoding the
transaction and resulting in calling more malloc() in smaller memory
block sizes. Here are results (an average of 3 executions):I was interested in seeing the memory consumption with the test that
was causing an OOM due to the GenerationBlock fragmentation.
+1. That test will be helpful.
The fact that we're using rb->size and txn->size to choose the
transactions to evict could make this idea less attractive. Even if we
defragment transactions, rb->size and txn->size don't change.
Therefore, it doesn't mean we can avoid evicting transactions due to
full of logical_decoding_work_mem, but just mean the amount of
allocated memory might have been reduced.I had roughly imagined that you'd add an extra field to store
txn->size when the last fragmentation was done and only defrag the
transaction when the ReorderBufferTXN->size is, say, double the last
size when the changes were last defragmented. Of course, if the first
defragmentation was enough to drop MemoryContextMemAllocated() below
the concerning threshold above logical_decoding_work_mem then further
defrags wouldn't be done, at least, until such times as the
MemoryContextMemAllocated() became concerning again. If you didn't
want to a full Size variable for the defragmentation size, you could
just store a uint8 to store which power of 2 ReorderBufferTXN->size
was when it was last defragmented. There are plenty of holds in that
struct that could be filled without enlarging the struct.In general, it's a bit annoying to have to code around this
GenerationContext fragmentation issue.
Right, and I am also slightly afraid that this may not cause some
regression in other cases where defrag wouldn't help.
--
With Regards,
Amit Kapila.
Dear Sawada-san,
Thank you for your interest in this patch. I've just shared some
benchmark results (with a patch) that could be different depending on
the environment[1]. I would be appreciated if you also do similar
tests and share the results.
Okay, I did similar tests, the attached script is the test runner. rb_mem_block_size
was changed from 8kB to 8MB. Below table show the result (millisecond unit).
Each cell is the average of 5 runs.
==========
8kB 12877.4
16kB 12829.1
32kB 11793.3
64kB 13134.4
128kB 13353.1
256kB 11664.0
512kB 12603.4
1MB 13443.8
2MB 12469.0
4MB 12651.4
8MB 12381.4
==========
The standard deviation of measurements was 100-500 ms, there were no noticeable
differences on my env as well.
Also, I've checked the statistics of the generation context, and confirmed the
number of allocated blocks is x1000 higher if the block size is changed 8kB->8MB.
[1]: 8kB Tuples: 724959232 total in 88496 blocks (10000000 chunks); 3328 free (0 chunks); 724955904 used Grand total: 724959232 bytes in 88496 blocks; 3328 free (0 chunks); 724955904 used
of actual used space comes from the header of each block. Each block has attributes
for management so that the total usage becomes larger based on the number.
[1]: 8kB Tuples: 724959232 total in 88496 blocks (10000000 chunks); 3328 free (0 chunks); 724955904 used Grand total: 724959232 bytes in 88496 blocks; 3328 free (0 chunks); 724955904 used
8kB
Tuples: 724959232 total in 88496 blocks (10000000 chunks); 3328 free (0 chunks); 724955904 used
Grand total: 724959232 bytes in 88496 blocks; 3328 free (0 chunks); 724955904 used
8MB
Tuples: 721420288 total in 86 blocks (10000000 chunks); 1415344 free (0 chunks); 720004944 used
Grand total: 721420288 bytes in 86 blocks; 1415344 free (0 chunks); 720004944 use
Best regards,
Hayato Kuroda
FUJITSU LIMITED
Attachments:
On Thu, Sep 19, 2024 at 10:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Sep 20, 2024 at 5:13 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Fri, 20 Sept 2024 at 05:03, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I've done other benchmarking tests while changing the memory block
sizes from 8kB to 8MB. I measured the execution time of logical
decoding of one transaction that inserted 10M rows. I set
logical_decoding_work_mem large enough to avoid spilling behavior. In
this scenario, we allocate many memory chunks while decoding the
transaction and resulting in calling more malloc() in smaller memory
block sizes. Here are results (an average of 3 executions):I was interested in seeing the memory consumption with the test that
was causing an OOM due to the GenerationBlock fragmentation.+1. That test will be helpful.
Sure. Here are results of peak memory usage in bytes reported by
MemoryContextMemAllocated() (when rb->size shows 43MB):
8kB: 52,371,328
16kB: 52,887,424
32kB: 53,428,096
64kB: 55,099,264
128kB: 86,163,328
256kB: 149,340,032
512kB: 273,334,144
1MB: 523,419,520
2MB: 1,021,493,120
4MB: 1,984,085,888
8MB: 2,130,886,528
Probably we can increase the size to 64kB?
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Fri, 20 Sept 2024 at 17:46, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Sep 20, 2024 at 5:13 AM David Rowley <dgrowleyml@gmail.com> wrote:
In general, it's a bit annoying to have to code around this
GenerationContext fragmentation issue.Right, and I am also slightly afraid that this may not cause some
regression in other cases where defrag wouldn't help.
Yeah, that's certainly a possibility. I was hoping that
MemoryContextMemAllocated() being much larger than logical_work_mem
could only happen when there is fragmentation, but certainly, you
could be wasting effort trying to defrag transactions where the
changes all arrive in WAL consecutively and there is no
defragmentation. It might be some other large transaction that's
causing the context's allocations to be fragmented. I don't have any
good ideas on how to avoid wasting effort on non-problematic
transactions. Maybe there's something that could be done if we knew
the LSN of the first and last change and the gap between the LSNs was
much larger than the WAL space used for this transaction. That would
likely require tracking way more stuff than we do now, however.
With the smaller blocks idea, I'm a bit concerned that using smaller
blocks could cause regressions on systems that are better at releasing
memory back to the OS after free() as no doubt malloc() would often be
slower on those systems. There have been some complaints recently
about glibc being a bit too happy to keep hold of memory after free()
and I wondered if that was the reason why the small block test does
not cause much of a performance regression. I wonder how the small
block test would look on Mac, FreeBSD or Windows. I think it would be
risky to assume that all is well with reducing the block size after
testing on a single platform.
David
On Sun, Sep 22, 2024 at 11:27 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Fri, 20 Sept 2024 at 17:46, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Sep 20, 2024 at 5:13 AM David Rowley <dgrowleyml@gmail.com> wrote:
In general, it's a bit annoying to have to code around this
GenerationContext fragmentation issue.Right, and I am also slightly afraid that this may not cause some
regression in other cases where defrag wouldn't help.Yeah, that's certainly a possibility. I was hoping that
MemoryContextMemAllocated() being much larger than logical_work_mem
could only happen when there is fragmentation, but certainly, you
could be wasting effort trying to defrag transactions where the
changes all arrive in WAL consecutively and there is no
defragmentation. It might be some other large transaction that's
causing the context's allocations to be fragmented. I don't have any
good ideas on how to avoid wasting effort on non-problematic
transactions. Maybe there's something that could be done if we knew
the LSN of the first and last change and the gap between the LSNs was
much larger than the WAL space used for this transaction. That would
likely require tracking way more stuff than we do now, however.
With more information tracking, we could avoid some non-problematic
transactions but still, it would be difficult to predict that we
didn't harm many cases because to make the memory non-contiguous, we
only need a few interleaving small transactions. We can try to think
of ideas for implementing defragmentation in our code if we first can
prove that smaller block sizes cause problems.
With the smaller blocks idea, I'm a bit concerned that using smaller
blocks could cause regressions on systems that are better at releasing
memory back to the OS after free() as no doubt malloc() would often be
slower on those systems. There have been some complaints recently
about glibc being a bit too happy to keep hold of memory after free()
and I wondered if that was the reason why the small block test does
not cause much of a performance regression. I wonder how the small
block test would look on Mac, FreeBSD or Windows. I think it would be
risky to assume that all is well with reducing the block size after
testing on a single platform.
Good point. We need extensive testing on different platforms, as you
suggest, to verify if smaller block sizes caused any regressions.
--
With Regards,
Amit Kapila.