Use-after-free in reorderbuffer.c for INSERT ON CONFLICT
Hi all,
Ethan Mertz (colleague, in CC) has found a bug in reorderbuffer.c when
processing a REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM change, based
on the data gathered from a customer issue. The bug is a
use-after-free, causing random crashes, that can be summarized like
that:
1) "specinsert" is saved as a new change to process:
change = specinsert;
change->action = REORDER_BUFFER_CHANGE_INSERT;
2) A bit later on the change and specinsert are freed when we are done
with the speculative insert:
change_done:
/*
* If speculative insertion was confirmed, the record
* isn't needed anymore.
*/
if (specinsert != NULL)
{
ReorderBufferFreeChange(rb, specinsert, true);
specinsert = NULL;
}
3) Finally, a couple of lines down, we then do the following things,
and attempt to use a reference to change->lsn that has been freed:
#define CHANGES_THRESHOLD 100
if (++changes_count >= CHANGES_THRESHOLD)
{
rb->update_progress_txn(rb, txn, change->lsn);
changes_count = 0;
}
This issue has been introduced in 8c58624df462, down to REL_16_STABLE.
Committer of the related change is added in CC (aka Amit K.).
Sawada-san (also in CC) has come up with an imaginative trick to
easily reproduce the issue, as of:
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2595,7 +2595,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
* overhead, but testing showed there is no noticeable overhead if
* we do it after every ~100 changes.
*/
-#define CHANGES_THRESHOLD 100
+#define CHANGES_THRESHOLD 1
if (++changes_count >= CHANGES_THRESHOLD)
{
When running the test "ddl" from test_decoding, an instance running
valgrind comes up with that, pointing immediately at the problem:
==28821== Invalid read of size 8
==28821== at 0x222B2D8: ReorderBufferProcessTXN (reorderbuffer.c:2602)
==28821== by 0x222C6D6: ReorderBufferReplay (reorderbuffer.c:2864)
==28821== by 0x222C754: ReorderBufferCommit (reorderbuffer.c:2888)
==28821== by 0x21EDA48: DecodeCommit (decode.c:743)
==28821== by 0x21EA54C: xact_decode (decode.c:242)
==28821== by 0x21E98AA: LogicalDecodingProcessRecord (decode.c:116)
==28821== by 0x22093A4: pg_logical_slot_get_changes_guts (logicalfuncs.c:266)
==28821== by 0x22097BC: pg_logical_slot_get_changes (logicalfuncs.c:333)
==28821== by 0x1AF25D3: ExecMakeTableFunctionResult (execSRF.c:234)
==28821== by 0x1B65606: FunctionNext (nodeFunctionscan.c:94)
==28821== by 0x1AF7306: ExecScanFetch (execScan.h:126)
==28821== by 0x1AF748D: ExecScanExtended (execScan.h:187)
==28821== Address 0x107fae08 is 7,656 bytes inside a recently re-allocated block of size 8,192 alloc'd
==28821== at 0x4844818: malloc (vg_replace_malloc.c:446)
==28821== by 0x2AD3573: SlabAllocFromNewBlock (slab.c:565)
==28821== by 0x2AD3EF7: SlabAlloc (slab.c:656)
==28821== by 0x2AC5282: MemoryContextAlloc (mcxt.c:1237)
==28821== by 0x221E17A: ReorderBufferAllocChange (reorderbuffer.c:511)
==28821== by 0x222F64B: ReorderBufferAddNewTupleCids (reorderbuffer.c:3444)
==28821== by 0x2248439: SnapBuildProcessNewCid (snapbuild.c:700)
==28821== by 0x21EB9D8: heap2_decode (decode.c:437)
==28821== by 0x21E98AA: LogicalDecodingProcessRecord (decode.c:116)
==28821== by 0x22093A4: pg_logical_slot_get_changes_guts (logicalfuncs.c:266)
==28821== by 0x22097BC: pg_logical_slot_get_changes (logicalfuncs.c:333)
==28821== by 0x1AF25D3: ExecMakeTableFunctionResult (execSRF.c:234)
A simple solution suggested by Ethan would be to use the "prev_lsn"
guessed from the change at the beginning of the loop, rather than the
problematic change->lsn. But that does not seem completely right to
me because we can switch to "specinsert" as the change to process,
hence wouldn't we want to call update_progress_txn() based on the LSN
of the actual change we are looking at? All that leads me to the
attached. Note that I am not the most familiar with this area of the
code, so please take my arguments with a grain of salt.
Comments and thoughts are welcome.
--
Michael
Attachments:
0001-Fix-use-after-free-in-reorderbuffer.c-with-ON-CONFLI.patchtext/x-diff; charset=us-asciiDownload+4-2
On Wed, Jul 30, 2025 at 11:44 PM Michael Paquier <michael@paquier.xyz> wrote:
A simple solution suggested by Ethan would be to use the "prev_lsn"
guessed from the change at the beginning of the loop, rather than the
problematic change->lsn. But that does not seem completely right to
me because we can switch to "specinsert" as the change to process,
hence wouldn't we want to call update_progress_txn() based on the LSN
of the actual change we are looking at? All that leads me to the
attached.
Thank you for preparing the patch!
Yes, I think it's sensible to keep the current behavior. So the patch
looks good to me.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Thu, Jul 31, 2025 at 09:51:21AM -0700, Masahiko Sawada wrote:
Yes, I think it's sensible to keep the current behavior. So the patch
looks good to me.
Thanks for the review. I am planning to apply that at the beginning
of next week, in time for the next set of minor releases.
--
Michael
On Thu, Jul 31, 2025 at 12:14 PM Michael Paquier <michael@paquier.xyz> wrote:
Ethan Mertz (colleague, in CC) has found a bug in reorderbuffer.c when
processing a REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM change, based
on the data gathered from a customer issue. The bug is a
use-after-free, causing random crashes, that can be summarized like
that:
...
A simple solution suggested by Ethan would be to use the "prev_lsn"
guessed from the change at the beginning of the loop, rather than the
problematic change->lsn. But that does not seem completely right to
me because we can switch to "specinsert" as the change to process,
hence wouldn't we want to call update_progress_txn() based on the LSN
of the actual change we are looking at?
We still won't be able to capture the latest LSN in case of
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT. IIRC, update_progress_txn
is used to keep the client active so that when many changes are
skipped, the client doesn't timeout. In this case, it seems okay to
use prev_lsn as well.
--
With Regards,
Amit Kapila.
On Fri, Aug 01, 2025 at 10:03:14AM +0530, Amit Kapila wrote:
We still won't be able to capture the latest LSN in case of
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT. IIRC, update_progress_txn
is used to keep the client active so that when many changes are
skipped, the client doesn't timeout. In this case, it seems okay to
use prev_lsn as well.
I am not quite sure to follow your argument here. In the case of a
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT change, we would use
change->lsn, which is in the case of the patch and HEAD the same
thing: prev_lsn. So the logic is unchanged in the case, isn't it?
--
Michael
On Fri, Aug 1, 2025 at 10:22 AM Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Aug 01, 2025 at 10:03:14AM +0530, Amit Kapila wrote:
We still won't be able to capture the latest LSN in case of
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT. IIRC, update_progress_txn
is used to keep the client active so that when many changes are
skipped, the client doesn't timeout. In this case, it seems okay to
use prev_lsn as well.I am not quite sure to follow your argument here. In the case of a
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT change, we would use
change->lsn, which is in the case of the patch and HEAD the same
thing: prev_lsn.
I mean to say we can use the same change LSN both for
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM and
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT. Right now, for
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM, we switch the change to
specinsert which would have a prior LSN value (say, if confirm/abort
record will have value, 1000, it will be 800 or so) but we should
still use 1000 for update_progress_txn. The update_progress_txn() is
helpful when such an insert is skipped by a plugin (in this case
pgouput) and in that case, we would require the latest LSN processed
by reorder buffer to pass to it. We use it to send a keep_alive to a
client with the last LSN processed.
--
With Regards,
Amit Kapila.
On Fri, Aug 01, 2025 at 03:30:17PM +0530, Amit Kapila wrote:
I mean to say we can use the same change LSN both for
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM and
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT. Right now, for
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM, we switch the change to
specinsert which would have a prior LSN value (say, if confirm/abort
record will have value, 1000, it will be 800 or so) but we should
still use 1000 for update_progress_txn. The update_progress_txn() is
helpful when such an insert is skipped by a plugin (in this case
pgouput) and in that case, we would require the latest LSN processed
by reorder buffer to pass to it. We use it to send a keep_alive to a
client with the last LSN processed.
Ah, OK, I've missed your point then. It's kind of an optimization in
itself because we would be a bit more aggressive with the updates, but
I agree to do that in the scope of this fix. The updated attached
uses prev_lsn for the job, for both the ABORT and CONFIRM cases,
meaning a one-liner.
--
Michael
Attachments:
v2-0001-Fix-use-after-free-in-reorderbuffer.c-with-ON-CON.patchtext/x-diff; charset=us-asciiDownload+1-2
On Fri, Aug 1, 2025 at 4:45 AM Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Aug 01, 2025 at 03:30:17PM +0530, Amit Kapila wrote:
I mean to say we can use the same change LSN both for
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM and
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT. Right now, for
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM, we switch the change to
specinsert which would have a prior LSN value (say, if confirm/abort
record will have value, 1000, it will be 800 or so) but we should
still use 1000 for update_progress_txn. The update_progress_txn() is
helpful when such an insert is skipped by a plugin (in this case
pgouput) and in that case, we would require the latest LSN processed
by reorder buffer to pass to it. We use it to send a keep_alive to a
client with the last LSN processed.Ah, OK, I've missed your point then. It's kind of an optimization in
itself because we would be a bit more aggressive with the updates, but
I agree to do that in the scope of this fix. The updated attached
uses prev_lsn for the job, for both the ABORT and CONFIRM cases,
meaning a one-liner.
I assumed that behavior was intentional of the original patch but I'm
fine with the new version patch too if it's not.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Fri, Aug 1, 2025 at 5:15 PM Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Aug 01, 2025 at 03:30:17PM +0530, Amit Kapila wrote:
I mean to say we can use the same change LSN both for
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM and
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT. Right now, for
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM, we switch the change to
specinsert which would have a prior LSN value (say, if confirm/abort
record will have value, 1000, it will be 800 or so) but we should
still use 1000 for update_progress_txn. The update_progress_txn() is
helpful when such an insert is skipped by a plugin (in this case
pgouput) and in that case, we would require the latest LSN processed
by reorder buffer to pass to it. We use it to send a keep_alive to a
client with the last LSN processed.Ah, OK, I've missed your point then. It's kind of an optimization in
itself because we would be a bit more aggressive with the updates, but
I agree to do that in the scope of this fix. The updated attached
uses prev_lsn for the job, for both the ABORT and CONFIRM cases,
meaning a one-liner.
The proposed change looks good to me.
--
With Regards,
Amit Kapila.
On Sat, Aug 02, 2025 at 08:57:10AM +0530, Amit Kapila wrote:
The proposed change looks good to me.
Thanks. I've learnt that my schedule is going to be a bit unstable at
the beginning of the next week and I had a window today, so done now.
--
Michael