BUG #18369: logical decoding core on AssertTXNLsnOrder()

Started by PG Bug reporting formabout 2 years ago32 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 18369
Logged by: haiyang li
Email address: ocean_li_996@163.com
PostgreSQL version: 14.11
Operating system: centos7 5.10.84 x86_64
Description:

When testing on logical replication module, we encountered a core dump
issue. The stack trace from the core file is:

##
[New LWP 113877]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres(5432): normal_user dml_full0
11.164.97.22[37210]SELECT ".
Program terminated with signal SIGABRT, Aborted.
#0 0x00007fe2074a9277 in raise () from /lib64/libc.so.6
#0 0x00007fe2074a9277 in raise () from /lib64/libc.so.6
#1 0x00007fe2074aa968 in abort () from /lib64/libc.so.6
#2 0x00000000010f8d67 in ExceptionalCondition (conditionName=0x17edb50
"!(prev_first_lsn < cur_txn->first_lsn)", errorType=0x17ed93c
"FailedAssertion", fileName=0x17ed990 "reorderbuffer.c", lineNumber=762) at
assert.c:46
#3 0x0000000000e6145c in AssertTXNLsnOrder (rb=0x4558060) at
reorderbuffer.c:762
#4 0x0000000000e60ead in ReorderBufferTXNByXid (rb=0x4558060, xid=19937,
create=true, is_new=0x0, lsn=12640708096, create_as_top=true) at
reorderbuffer.c:610
#5 0x0000000000e6415c in ReorderBufferXidSetCatalogChanges (rb=0x4558060,
xid=19937, lsn=12640708096) at reorderbuffer.c:2298
#6 0x0000000000e6bb4b in SnapBuildXidSetCatalogChanges (builder=0x456e160,
xid=19933, subxcnt=17, subxacts=0x44e08d8, lsn=12640708096) at
snapbuild.c:2172
#7 0x0000000000e54fb3 in DecodeCommit (ctx=0x452bc10, buf=0x7ffe4f03af70,
parsed=0x7ffe4f03ae20, xid=19933) at decode.c:631
#8 0x0000000000e54556 in DecodeXactOp (ctx=0x452bc10, buf=0x7ffe4f03af70)
at decode.c:268
#9 0x0000000000e54124 in LogicalDecodingProcessRecord (ctx=0x452bc10,
record=0x452bed0) at decode.c:120
#10 0x0000000000e5adc6 in pg_logical_slot_get_changes_guts
(fcinfo=0x7ffe4f03b2d0, confirm=true, binary=false) at logicalfuncs.c:329
#11 0x0000000000e5af76 in pg_logical_slot_get_changes
(fcinfo=0x7ffe4f03b2d0) at logicalfuncs.c:393
...
#33 0x0000000000f1c0b9 in exec_simple_query (query_string=0x425bb80 "SELECT
* FROM pg_logical_slot_get_changes("test_logical_decode_slot_0", NULL,
NULL)") at postgres.c:1570
...
(gdb) f 3
#3 0x0000000000e6145c in AssertTXNLsnOrder (rb=0x4558060) at
reorderbuffer.c:762
(gdb) p /x MyReplicationSlot->data.restart_lsn
$1 = 0x2f171b8a8
(gdb) p /x cur_txn->first_lsn
$2 = 0x2f171e600
(gdb) p NInitialRunningXacts
$3 = 1
(gdb) p *InitialRunningXacts
$4 = 19933
##

As indicated, the problem occurred at the AssertTXNLsnOrder function.
Moreover, this issue occurred when pg_logical_slot_get_changes function was
called again because NInitialRunningXacts != 0.

1) The WAL records from restart_lsn to the corresponding lsn when the issue
occurred,
2) personal analysis of the problem,
3) the steps to reproduce the issue,
4) personal proposed solution
will be posted later under this thread.

#2ocean_li_996
ocean_li_996@163.com
In reply to: PG Bug reporting form (#1)
Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

At 2024-02-28 15:53:30, "PG Bug reporting form" <noreply@postgresql.org> wrote:

1) The WAL records from restart_lsn to the corresponding lsn when the issue
occurred,
2) personal analysis of the problem,
3) the steps to reproduce the issue,
4) personal proposed solution
will be posted later under this thread.

1) The WAL records from restart_lsn to the corresponding lsn when the issue occurred is supported in attachment file 1.

2) As indicated in 1), some invalidation messages are generated in 19933 top xact. After the decoding restarted, the invalidation messages will make 19933 top xact and its subtransaction(s) to be marked as containing catalog change while processing its commit record(see SnapBuildXidSetCatalogChanges() ). In this step, the corresponding subxacts which never procedded before are added into ReorderBuffer with the same first_lsn as top-level xact. Then, the check in AssertTXNLsnOrder() will failed if the number of subxact mentioned above more than 1.

3) The patch to reproduce the issue is supported in attachment file 2. DML on temporary table can consume xid and not log any WAL RECORD except it's the firtst subxact of top xact(log ASSIGNMENT record). So we use DML on temporary table to generate two "never procedded before" sunxacts in on top xact.

4) Since it is already known to be a subxact before being add into ReorderBuffer, I think an appropriate fix is extending the ReorderBufferXidSetCatalogChanges function with an is_top parameter to indicate whether the xact is a top-level xact.
For a subxact, it would not be added to the toplevel_by_lsn list and would not undergo the AssertTXNLsnOrder check. Of course, it is necessary to introduce a check to verify whether a node is in the list when attempting to remove a node from toplevel_by_lsn.
The specific fix patch is provided in Attachment 3.

Thanks
Haiyang Li

Attachments:

xid_19933_wal_record.txttext/plain; name=xid_19933_wal_record.txtDownload
v1-0001-Testcase-Coredump-On-AssertTXNLsnOrder.patchapplication/octet-stream; name=v1-0001-Testcase-Coredump-On-AssertTXNLsnOrder.patchDownload+11-0
v1-0002-Fix-Coredump-On-AssertTXNLsnOrder.patchapplication/octet-stream; name=v1-0002-Fix-Coredump-On-AssertTXNLsnOrder.patchDownload+12-10
#3ocean_li_996
ocean_li_996@163.com
In reply to: ocean_li_996 (#2)
Re:Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

This issue exists in PG 12 -15.

At 2024-02-28 15:57:37, "ocean_li_996" <ocean_li_996@163.com> wrote:

At 2024-02-28 15:53:30, "PG Bug reporting form" <noreply@postgresql.org> wrote:

1) The WAL records from restart_lsn to the corresponding lsn when the issue
occurred,
2) personal analysis of the problem,
3) the steps to reproduce the issue,
4) personal proposed solution
will be posted later under this thread.

1) The WAL records from restart_lsn to the corresponding lsn when the issue occurred is supported in attachment file 1.

2) As indicated in 1), some invalidation messages are generated in 19933 top xact. After the decoding restarted, the invalidation messages will make 19933 top xact and its subtransaction(s) to be marked as containing catalog change while processing its commit record(see SnapBuildXidSetCatalogChanges() ). In this step, the corresponding subxacts which never procedded before are added into ReorderBuffer with the same first_lsn as top-level xact. Then, the check in AssertTXNLsnOrder() will failed if the number of subxact mentioned above more than 1.

3) The patch to reproduce the issue is supported in attachment file 2. DML on temporary table can consume xid and not log any WAL RECORD except it's the firtst subxact of top xact(log ASSIGNMENT record). So we use DML on temporary table to generate two "never procedded before" sunxacts in on top xact.

4) Since it is already known to be a subxact before being add into ReorderBuffer, I think an appropriate fix is extending the ReorderBufferXidSetCatalogChanges function with an is_top parameter to indicate whether the xact is a top-level xact.
For a subxact, it would not be added to the toplevel_by_lsn list and would not undergo the AssertTXNLsnOrder check. Of course, it is necessary to introduce a check to verify whether a node is in the list when attempting to remove a node from toplevel_by_lsn.
The specific fix patch is provided in Attachment 3.

Thanks
Haiyang Li

#4Alexander Lakhin
exclusion@gmail.com
In reply to: ocean_li_996 (#3)
Re: BUG #18369: logical decoding core on AssertTXNLsnOrder()

Hello Haiyang Li,

28.02.2024 11:20, ocean_li_996 wrote:

This issue exists in PG 12 -15.

At 2024-02-28 15:57:37, "ocean_li_996" <ocean_li_996@163.com> wrote:

At 2024-02-28 15:53:30, "PG Bug reporting form" <noreply@postgresql.org> wrote:

1) The WAL records from restart_lsn to the corresponding lsn when the issue
occurred,
2) personal analysis of the problem,
3) the steps to reproduce the issue,
4) personal proposed solution
will be posted later under this thread.

Please see the similar issue discussed last year:
/messages/by-id/f158d9ca-2057-2836-a522-0b1278be5a53@gmail.com

With your patch applied (on REL_14_STABLE) I still get:
TRAP: FailedAssertion("builder->next_phase_at == InvalidTransactionId", File: "snapbuild.c", Line: 1623, PID: 92772)

when running make -s installcheck-force -C contrib/test_decoding...
as specified in that message.
(You might need to disable REGRESS tests in the Makefile to reach
isolation tests.)

Best regards,
Alexander

#5ocean_li_996
ocean_li_996@163.com
In reply to: Alexander Lakhin (#4)
Re:Re: BUG #18369: logical decoding core on AssertTXNLsnOrder()

Hi Alexander,

At 2024-02-28 17:00:00, "Alexander Lakhin" <exclusion@gmail.com> wrote:

Please see the similar issue discussed last year:
/messages/by-id/f158d9ca-2057-2836-a522-0b1278be5a53@gmail.com

Well, I have to say that the whole thread is a bit long. AFAIC, the two issues exhibited the same symptoms, but they occured in different scenarios. The patch I provided may not solve the problem you're referring to.

With your patch applied (on REL_14_STABLE) I still get:
TRAP: FailedAssertion("builder->next_phase_at == InvalidTransactionId", File: "snapbuild.c", Line: 1623, PID: 92772)

when running make -s installcheck-force -C contrib/test_decoding...
as specified in that message.
(You might need to disable REGRESS tests in the Makefile to reach
isolation tests.)

I'm not sure if I fully get your mind. I disabled the REGRESS tests in the test_decoding/Makefile and then manually run the command "make -s installcheck-force -C contrib/test_decoding" a few times. I didn't encounter the issue you mentioned. Is this problem consistently reproducible in your environment? BTW, the issue mentioned in [1]/messages/by-id/7e4d4a80-3e3c-231f-f886-6cada2aa582b@gmail.com is more similar to your problem. And the patch was not applied to v14. Maybe it is another issue.

[1]: /messages/by-id/7e4d4a80-3e3c-231f-f886-6cada2aa582b@gmail.com

Thanks
Haiyang Li

#6Alexander Lakhin
exclusion@gmail.com
In reply to: ocean_li_996 (#5)
Re: BUG #18369: logical decoding core on AssertTXNLsnOrder()

Hi Haiyang,

29.02.2024 05:25, ocean_li_996 wrote:

With your patch applied (on REL_14_STABLE) I still get:
TRAP: FailedAssertion("builder->next_phase_at == InvalidTransactionId", File: "snapbuild.c", Line: 1623, PID: 92772)

when running make -s installcheck-force -C contrib/test_decoding...
as specified in that message.
(You might need to disable REGRESS tests in the Makefile to reach
isolation tests.)

I'm not sure if I fully get your mind. I disabled the REGRESS tests in the test_decoding/Makefile and then manually
run the command "make -s installcheck-force -C contrib/test_decoding" a few times. I didn't encounter the issue you
mentioned. Is this problem consistently reproducible in your environment? BTW, the  issue mentioned in [1] is more
similar to your problem. And the patch was not applied to v14.  Maybe it is another issue.

You can try the following script (assuming a server with the test_decoding
module installed is running):
rm -rf contrib/test_decoding_*
numclients=5
for ((c=1;c<=numclients;c++)); do
   cp -r contrib/test_decoding contrib/test_decoding_$c
   sed "s/REGRESS = /# REGRESS =/" -i contrib/test_decoding_$c/Makefile
   sed "s/isolation_slot/isolation_slot_$c/" -i contrib/test_decoding_$c/specs/catalog_change_snapshot.spec # Use
independent slots
   sed "$(printf '$p; %.0s' `seq 50`)" -i contrib/test_decoding_$c/specs/catalog_change_snapshot.spec # Repeat the last
permutation 50 times
done
for ((c=1;c<=numclients;c++)); do
   EXTRA_REGRESS_OPTS="--dbname=regress_$c" make -s installcheck-force -C contrib/test_decoding_$c USE_MODULE_DB=1

"installcheck-$c.log" 2>&1 &

done
wait

Though that's really not directly related to the current issue (sorry for
the wrong direction, my point was that there are still living bugs in this
area).

I've found that your added test case fails on REL_15_STABLE starting from
b793a416b (6b77048e5 on REL_14_STABLE), so it looks like this is a new
defect introduced in REL_14_STABLE, REL_15_STABLE recently with the fix for
bug #18280.

As to REL_13_STABLE/REL_12_STABLE the failure reproduced starting from
commits 38dbaaf27/02600886c, a result of the aforementioned discussion:
/messages/by-id/CAA4eK1Lx=g09z2k9Teq9ca1eRzfpfxJwFdjyHNwgEKv69KWhrQ@mail.gmail.com

Best regards,
Alexander

#7ocean_li_996
ocean_li_996@163.com
In reply to: Alexander Lakhin (#6)
Re:Re: BUG #18369: logical decoding core on AssertTXNLsnOrder()

At 2024-02-29 18:00:00, "Alexander Lakhin" <exclusion@gmail.com> wrote:

You can try the following script (assuming a server with the test_decoding
module installed is running):
rm -rf contrib/test_decoding_*
numclients=5
for ((c=1;c<=numclients;c++)); do
cp -r contrib/test_decoding contrib/test_decoding_$c
sed "s/REGRESS = /# REGRESS =/" -i contrib/test_decoding_$c/Makefile
sed "s/isolation_slot/isolation_slot_$c/" -i contrib/test_decoding_$c/specs/catalog_change_snapshot.spec # Use independent slots
sed "$(printf '$p; %.0s' `seq 50`)" -i contrib/test_decoding_$c/specs/catalog_change_snapshot.spec # Repeat the last permutation 50 times
done
for ((c=1;c<=numclients;c++)); do
EXTRA_REGRESS_OPTS="--dbname=regress_$c" make -s installcheck-force -C contrib/test_decoding_$c USE_MODULE_DB=1 >"installcheck-$c.log" 2>&1 &
done
wait

Thanks! Before and after applying the changes on REL_14_STABLE, I executed the script (with numclients = 50) four times, respectively.
Unfortunately, I wasn't able to replicate the issue you mentioned.

Though that's really not directly related to the current issue (sorry for
the wrong direction, my point was that there are still living bugs in this
area).

Got it! I concur with your statement. OTOH, there is no evidence to indicate that the issue is a result of
v1-0002-Fix-Coredump-On-AssertTXNLsnOrder.patch.

I've found that your added test case fails on REL_15_STABLE starting from
b793a416b (6b77048e5 on REL_14_STABLE), so it looks like this is a new
defect introduced in REL_14_STABLE, REL_15_STABLE recently with the fix for
bug #18280.

Oops, I forgot to mention this information in the email. Indeed, the test I provided couldn't reproduce the issue before fixing bug #18280
While I haven't tested it, I belive that we can get another reproducing test with a little more complexity (such as needing two transactions
in sequence).

As to REL_13_STABLE/REL_12_STABLE the failure reproduced starting from
commits 38dbaaf27/02600886c, a result of the aforementioned discussion:
/messages/by-id/CAA4eK1Lx=g09z2k9Teq9ca1eRzfpfxJwFdjyHNwgEKv69KWhrQ@mail.gmail.com

Indeed.

Back to the issue in this thread, are there any suggestions or discussion on the fix patch?

Best Regards
Haiyang Li.

#8Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Alexander Lakhin (#6)
RE: BUG #18369: logical decoding core on AssertTXNLsnOrder()

Dear Alexander,

I also ran your script after applying patches on REL_14_STABLE, but I got another failure.
I have not analyzed yet so I share the result as-is.
Was I missing something?

```
#0 0x00007f4f033de387 in raise () from /lib64/libc.so.6
#1 0x00007f4f033dfa78 in abort () from /lib64/libc.so.6
#2 0x0000000000b08152 in ExceptionalCondition (conditionName=0xcc8228 "prev_first_lsn < cur_txn->first_lsn",
errorType=0xcc8024 "FailedAssertion", fileName=0xcc8070 "reorderbuffer.c", lineNumber=916) at assert.c:69
#3 0x00000000008e01ad in AssertTXNLsnOrder (rb=0x19a95c0) at reorderbuffer.c:916
#4 0x00000000008dfae8 in ReorderBufferTXNByXid (rb=0x19a95c0, xid=1753, create=true, is_new=0x0, lsn=37430336,
create_as_top=true) at reorderbuffer.c:669
#5 0x00000000008e3e04 in ReorderBufferAddNewTupleCids (rb=0x19a95c0, xid=1753, lsn=37430336, node=..., tid=..., cmin=1,
cmax=4294967295, combocid=4294967295) at reorderbuffer.c:3200
#6 0x00000000008e9018 in SnapBuildProcessNewCid (builder=0x19af5f0, xid=1754, lsn=37430336, xlrec=0x197db38)
at snapbuild.c:823
#7 0x00000000008d2616 in DecodeHeap2Op (ctx=0x1999550, buf=0x7ffcc39d7310) at decode.c:470
#8 0x00000000008d1df5 in LogicalDecodingProcessRecord (ctx=0x1999550, record=0x1999910) at decode.c:150
#9 0x00000000008d9782 in pg_logical_slot_get_changes_guts (fcinfo=0x19874b0, confirm=true, binary=false)
at logicalfuncs.c:296
#10 0x00000000008d98b7 in pg_logical_slot_get_changes (fcinfo=0x19874b0) at logicalfuncs.c:365
#11 0x0000000000740001 in ExecMakeTableFunctionResult (setexpr=0x1985a38, econtext=0x19858f0, argContext=0x1987390,
expectedDesc=0x19ae898, randomAccess=false) at execSRF.c:234
#12 0x000000000075be2b in FunctionNext (node=0x19856d8) at nodeFunctionscan.c:95
#13 0x000000000074184a in ExecScanFetch (node=0x19856d8, accessMtd=0x75bd7a <FunctionNext>,
recheckMtd=0x75c175 <FunctionRecheck>) at execScan.c:132
#14 0x00000000007418eb in ExecScan (node=0x19856d8, accessMtd=0x75bd7a <FunctionNext>, recheckMtd=0x75c175 <FunctionRecheck>)
at execScan.c:198
#15 0x000000000075c1bf in ExecFunctionScan (pstate=0x19856d8) at nodeFunctionscan.c:270
#16 0x000000000073db6e in ExecProcNodeFirst (node=0x19856d8) at execProcnode.c:464
#17 0x00000000007322ac in ExecProcNode (node=0x19856d8) at ../../../src/include/executor/executor.h:260
#18 0x0000000000734b36 in ExecutePlan (estate=0x19854a0, planstate=0x19856d8, use_parallel_mode=false, operation=CMD_SELECT,
sendTuples=true, numberTuples=0, direction=ForwardScanDirection, dest=0x19c2008, execute_once=true) at execMain.c:1551
#19 0x0000000000732916 in standard_ExecutorRun (queryDesc=0x1975860, direction=ForwardScanDirection, count=0,
execute_once=true) at execMain.c:361
--Type <RET> for more, q to quit, c to continue without paging--
#20 0x0000000000732744 in ExecutorRun (queryDesc=0x1975860, direction=ForwardScanDirection, count=0, execute_once=true)
at execMain.c:305
#21 0x000000000097c752 in PortalRunSelect (portal=0x191dd40, forward=true, count=0, dest=0x19c2008) at pquery.c:921
#22 0x000000000097c411 in PortalRun (portal=0x191dd40, count=9223372036854775807, isTopLevel=true, run_once=true,
dest=0x19c2008, altdest=0x19c2008, qc=0x7ffcc39d7b00) at pquery.c:765
#23 0x00000000009760d8 in exec_simple_query (
query_string=0x18bb7a0 "SELECT data FROM pg_logical_slot_get_changes('isolation_slot_5', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');") at postgres.c:1213
#24 0x000000000097a5eb in PostgresMain (argc=1, argv=0x7ffcc39d7d90, dbname=0x18e5c28 "regress_5",
username=0x18e5c08 "hayato") at postgres.c:4513
#25 0x00000000008b58b6 in BackendRun (port=0x18dd880) at postmaster.c:4540
#26 0x00000000008b5232 in BackendStartup (port=0x18dd880) at postmaster.c:4262
#27 0x00000000008b16ff in ServerLoop () at postmaster.c:1748
#28 0x00000000008b0fd1 in PostmasterMain (argc=5, argv=0x18b6200) at postmaster.c:1420
#29 0x00000000007b16f8 in main (argc=5, argv=0x18b6200) at main.c:209
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/&lt;https://www.fujitsu.com/global/&gt;

#9Alexander Lakhin
exclusion@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#8)
Re: BUG #18369: logical decoding core on AssertTXNLsnOrder()

Hello Kuroda-san,

04.03.2024 15:52, Hayato Kuroda (Fujitsu) wrote:

Dear Alexander,

I also ran your script after applying patches on REL_14_STABLE, but I got another failure.

I have not analyzed yet so I share the result as-is.

Was I missing something?

```

#0 0x00007f4f033de387 in raise () from /lib64/libc.so.6

#1 0x00007f4f033dfa78 in abort () from /lib64/libc.so.6

#2 0x0000000000b08152 in ExceptionalCondition (conditionName=0xcc8228 "prev_first_lsn < cur_txn->first_lsn",

    errorType=0xcc8024 "FailedAssertion", fileName=0xcc8070 "reorderbuffer.c", lineNumber=916) at assert.c:69

#3 0x00000000008e01ad in AssertTXNLsnOrder (rb=0x19a95c0) at reorderbuffer.c:916

#4 0x00000000008dfae8 in ReorderBufferTXNByXid (rb=0x19a95c0, xid=1753, create=true, is_new=0x0, lsn=37430336,

    create_as_top=true) at reorderbuffer.c:669

Yes, you're right. That script produces another failure as I mentioned
upthread, but still in AssertTXNLsnOrder().
I had thought that the fix should cover that case too, but then I found out
that the defect in question is rather new, so maybe the more focused fix is
really preferable.

Best regards,
Alexander

#10Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: ocean_li_996 (#2)
RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

Dear Haiyang Li,

Thanks for reporting. I could reproduce the issue on PG12-15.
I thought you have already known the reason, but let me share my analysis for
the confirmation. The root cause is missing consideration for temporary tables.

## Premise

INSERT/UPDATE/DELETE operations for temporary tables are not recorded to WAL files,
but xid would be involved for such operations.
So there is a possibility that a transaction does not have related WAL records
if only temp tables are modified within the transaction.

Basically such transactions would not be decoded.

## Found issue

### Empty transaction is decoded on PG14 and PG15

However, there is a room for generating ReorderBufferTxn for empty transactions,
which was introduced by 6b77048e5. Conditions are:

1. There are sub transactions which modify only temp tables, and
2. the top transaction modifies the catalog.

The call-stack toward the generation is below.

```
ReorderBufferTXNByXid(create = true, create_as_top = true)
ReorderBufferXidSetCatalogChanges() // for sub transactions
SnapBuildXidSetCatalogChanges() // for top transaction
DecodeCommit() // for top transaction
```

The path has been introduced by 6b77048e5.
Previously, calling ReorderBufferXidSetCatalogChanges() for sub transactions
would be skipped, if they do not have catalog changes or they have not decoded yet.
However, the commit ensures sub transactions must be marked as containing
catalog changes, and this also enforces to decode transactions even if it is
empty.

### Assertion failure

The empty transactions would be created as top transactions. At that time,
AssertTXNLsnOrder() is called so that we ensured that first_lsn of top-transactions
must be strictly higher than previous. But they can be the same if there are more
than two empty transactions. It led an assertion failure.

### Considerations on PG12 and PG13

Same failure can be occurred on the PG12 and 13, and the background is bit different.
343afa967 removed a ReorderBufferAssignChild() from SnapBuildXidSetCatalogChanges().
The function allowed empty transactions being marked as sub-trans, so there had
been no problem in past. After the commit, assignments were removed, so that the
empty transactions would be generated as top-transactions.

## Possible solutions

I think there are several solutions.
Note that I assumed here that fixes for all the versions should be almost the same.

* Ease the condition in AssertTXNLsnOrder(). If the decoded transaction is empty,
it can be allowed that the first_lsn is same as previous one.
PSA file to see my consideration.
* Generate a ReorderBufferTXN as sub transaction when we are in this path.
The approach has already been shared by you. However, note that this needs to
extend the ReorderBufferXidSetCatalogChanges function, and breaks ABI
compatibility [1]https://wiki.postgresql.org/wiki/Committing_checklist#Maintaining_ABI_compatibility_while_backpatching.
* Avoid calling ReorderBufferXidSetCatalogChanges() if the target transaction
has not been decoded. An concern is that ReorderBuffer does not provide an API
for checking whether the transaction has been already decoded or not.

I will keep analyzing more and share further updates if found.
Thought?

[1]: https://wiki.postgresql.org/wiki/Committing_checklist#Maintaining_ABI_compatibility_while_backpatching

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/

Attachments:

add_is_empty.diffapplication/octet-stream; name=add_is_empty.diffDownload+31-2
#11ocean_li_996
ocean_li_996@163.com
In reply to: Hayato Kuroda (Fujitsu) (#10)
Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

Dear Hayato Kuroda,

Thanks for your attention.

At 2024-03-05 17:24:05, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:

Dear Haiyang Li,

...
## Found issue

### Empty transaction is decoded on PG14 and PG15

However, there is a room for generating ReorderBufferTxn for empty transactions,
which was introduced by 6b77048e5. Conditions are:

1. There are sub transactions which modify only temp tables, and
2. the top transaction modifies the catalog.

The call-stack toward the generation is below.

```
ReorderBufferTXNByXid(create = true, create_as_top = true)
ReorderBufferXidSetCatalogChanges() // for sub transactions
SnapBuildXidSetCatalogChanges() // for top transaction
DecodeCommit() // for top transaction
```

The path has been introduced by 6b77048e5.
Previously, calling ReorderBufferXidSetCatalogChanges() for sub transactions
would be skipped, if they do not have catalog changes or they have not decoded yet.
However, the commit ensures sub transactions must be marked as containing
catalog changes, and this also enforces to decode transactions even if it is
empty.

### Assertion failure

The empty transactions would be created as top transactions. At that time,
AssertTXNLsnOrder() is called so that we ensured that first_lsn of top-transactions
must be strictly higher than previous. But they can be the same if there are more
than two empty transactions. It led an assertion failure.

Your analysis is correct for me. Actually, I mentioned in [1]/messages/by-id/6444e39.131bc.18df5c0cae3.Coremail.ocean_li_996@163.com Best Regards, Haiyang Li that I can reproduce this issue before 6b77048e5.
After some attempts and analysis, I also believe that the issue will only occur after 6b77048e5.

...

## Possible solutions

I think there are several solutions.
Note that I assumed here that fixes for all the versions should be almost the same.

* Ease the condition in AssertTXNLsnOrder(). If the decoded transaction is empty,
it can be allowed that the first_lsn is same as previous one.

PSA file to see my consideration.

LGFM. For my observation, the most case failed on AsserTXNOrder is checking empty
decoded transaction. Maybe we should pay more attention to review ReorderBufferTXNIsEmpty.

* Generate a ReorderBufferTXN as sub transaction when we are in this path.
The approach has already been shared by you. However, note that this needs to
extend the ReorderBufferXidSetCatalogChanges function, and breaks ABI

compatibility [1].

Yes, It breaks ABI compatibility.

* Avoid calling ReorderBufferXidSetCatalogChanges() if the target transaction
has not been decoded. An concern is that ReorderBuffer does not provide an API
for checking whether the transaction has been already decoded or not.

I think this idear is a little complex, especially when considering version compatibility.

[1]: /messages/by-id/6444e39.131bc.18df5c0cae3.Coremail.ocean_li_996@163.com Best Regards, Haiyang Li
Best Regards,
Haiyang Li

#12Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: ocean_li_996 (#11)
RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

Dear Haiyang, Alexander,

LGFM. For my observation, the most case failed on AsserTXNOrder is checking empty
decoded transaction. Maybe we should pay more attention to review ReorderBufferTXNIsEmpty.

While checking on my fresh eyes, I thought the code might be wrong. The first_lsn
would be same as previous one, when the *PREVIOUS transaction* (not cur_txn) was
empty. Thought?

Also I found the crash reported on [1]/messages/by-id/TYCPR01MB12077573479C5A2471BDE8065F5232@TYCPR01MB12077.jpnprd01.prod.outlook.com was not resolved by the patch. I'm
still analyzing but I have not found the good reproducer yet which could be done
by spec file. Can someone find the workload?

[1]: /messages/by-id/TYCPR01MB12077573479C5A2471BDE8065F5232@TYCPR01MB12077.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/

#13Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Hayato Kuroda (Fujitsu) (#12)
RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

Dear Haiyang, Alexander,

I analyzed the second failure reported in [1]/messages/by-id/TYCPR01MB12077573479C5A2471BDE8065F5232@TYCPR01MB12077.jpnprd01.prod.outlook.com. The failure happened on the all
Supported branches. Attached patches fixes two failures [1]/messages/by-id/TYCPR01MB12077573479C5A2471BDE8065F5232@TYCPR01MB12077.jpnprd01.prod.outlook.com [2]/messages/by-id/18369-ad61699bf91c5bc0@postgresql.org on PG12-PG15,
and only the failure [1]/messages/by-id/TYCPR01MB12077573479C5A2471BDE8065F5232@TYCPR01MB12077.jpnprd01.prod.outlook.com on PG16-HEAD.

Below part describes the second failure. The issue would be occurred when:

1) Logical decoding starts from the middle of a sub-transaction, and
2) NEW_CID record is initially decoded in the sub-transaction, and
3) An arbitrary changes are decoded in the sub-transaction.

## Stuck trace

Just in case, below is a stuck trace I got.

```
#0 0x00007f85e64af387 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1 0x00007f85e64b0a78 in __GI_abort () at abort.c:90
#2 0x0000000000b082d3 in ExceptionalCondition (conditionName=0xcc83e7 "prev_txn_is_empty",
errorType=0xcc8184 "FailedAssertion", fileName=0xcc81d0 "reorderbuffer.c", lineNumber=932) at assert.c:69
#3 0x00000000008e023d in AssertTXNLsnOrder (rb=0x1a69e00) at reorderbuffer.c:932
#4 0x00000000008e3cf5 in ReorderBufferSetBaseSnapshot (rb=0x1a69e00, xid=757, lsn=24749472, snap=0x1a660e8)
at reorderbuffer.c:3140
#5 0x00000000008e910a in SnapBuildProcessChange (builder=0x1a65de0, xid=757, lsn=24749472) at snapbuild.c:799
#6 0x00000000008d2740 in DecodeHeapOp (ctx=0x1a53d60, buf=0x7ffe24280460) at decode.c:521
#7 0x00000000008d1df7 in LogicalDecodingProcessRecord (ctx=0x1a53d60, record=0x1a54160) at decode.c:154
#8 0x00000000008d9766 in pg_logical_slot_get_changes_guts (fcinfo=0x1a41f20, confirm=true, binary=false)
at logicalfuncs.c:296
#9 0x00000000008d989b in pg_logical_slot_get_changes (fcinfo=0x1a41f20) at logicalfuncs.c:365
...
```

## Reproducer

Each attached patches contained the reproducer.
As I said above, logical decoding is started from the middle of the sub-transaction
initial decoded record must be NEW_CID. Below shows the wal records from the
decoding start point.

```
$ pg_waldump tmp_check_iso/data/pg_wal/000000010000000000000001 -s 0/179A4E8
rmgr: ..., tx: 0, lsn: 0/0179A4E8, prev 0/0179A4A8, desc: CHECKPOINT_ONLINE...
rmgr: ..., tx: 757, lsn: 0/0179A560, prev 0/0179A4E8, desc: NEW_CID ...
rmgr: ..., tx: 757, lsn: 0/0179A5A0, prev 0/0179A560, desc: INSERT+INIT...
rmgr: ..., tx: 756, lsn: 0/0179A5E0, prev 0/0179A5A0, desc: COMMIT 2024-03-07 13:54:56.243746 UTC; subxacts: 757
rmgr: ..., tx: 0, lsn: 0/0179A618, prev 0/0179A5E0, desc: RUNNING_XACTS ...
```

## Analysis

SnapBuildProcessNewCid() generates a two transactions (xid = 757 and 756) based on
the same wal record, so both entries have same first_lsn in ReorderBufferTXN.
However, they are not associated as top-sub relationship so that they are pushed to
toplevel_by_lsn list. Since AssertTXNLsnOrder() does not assumes the case which
two entries have same first_lsn, it leads an assertion failure. Note that the
sub-transaction is generated earlier than top one and sub is not an empty, an
initial fix was not sufficient.

### Detailed analysis

I added a debug variable in AssertTXNLsnOrder to preserve the previous entry in
toplevel_by_lsn loop, and I found that two ReorderBufferTNXes have same first_lsn.
Also, according to above, 757 is a sub-transaction of 756.

```
(gdb) p *prev
$1 = {txn_flags = 1, xid = 757, toplevel_xid = 0, gid = 0x0, first_lsn = 24749408, ...nentries = 1, ..., ntuplecids = 0,...}
(gdb) p *cur_txn
$2 = {txn_flags = 0, xid = 756, toplevel_xid = 0, gid = 0x0, first_lsn = 24749408, ...nentries = 0, ..., ntuplecids = 1,...}
```

Based on above and some debug outputs, I considered a scenario. Below flow showed
a case when the sub-transaction (xid = 757) was decoded.

```
DecodeHeap2Op(info = XLOG_HEAP2_NEW_CID)
SnapBuildProcessNewCid()
ReorderBufferXidSetCatalogChanges(xid, lsn)
ReorderBufferTXNByXid(xid, lsn)
-> A ReorderBufferTXN for subtransaciton (757) was generated.
The first_lsn was the head of NEW_CID record.
ReorderBufferAddNewCommandId(xid, lsn)
ReorderBufferQueueChange(xid, lsn)
-> A ReorderBufferChange was queued to the subtransaction (757)
...
ReorderBufferAddNewTupleCids(xlrec->top_xid, lsn)
ReorderBufferTXNByXid()
-> A ReorderBufferTXN for top-transaciton (756) was generated.
The first_lsn was the head of NEW_CID record.
...
DecodeHeapOp(info = XLOG_HEAP_INSERT)
SnapBuildProcessChange(xid)
ReorderBufferSetBaseSnapshot(xid, lsn)
AssertTXNLsnOrder()
-> A subtransaction was found and it was not an empty transaction.
-> Next entry was a top-transaction.
The previous entry was not empty, and they had same first_lsn. It caused an Assertion failure.
```

## How to fix

I think the straightforward fix is to associate them to top-sub relationship,
and attached patch did it.

Thought?

[1]: /messages/by-id/TYCPR01MB12077573479C5A2471BDE8065F5232@TYCPR01MB12077.jpnprd01.prod.outlook.com
[2]: /messages/by-id/18369-ad61699bf91c5bc0@postgresql.org

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Attachments:

REL_12_fix_assertion_failure.diffapplication/octet-stream; name=REL_12_fix_assertion_failure.diffDownload+157-4
REL_13_fix_assertion_failure.diffapplication/octet-stream; name=REL_13_fix_assertion_failure.diffDownload+156-3
REL_14_fix_assertion_failure.diffapplication/octet-stream; name=REL_14_fix_assertion_failure.diffDownload+152-3
REL_15_fix_assertion_failure.diffapplication/octet-stream; name=REL_15_fix_assertion_failure.diffDownload+152-3
REL_16_fix_assertion_failure.diffapplication/octet-stream; name=REL_16_fix_assertion_failure.diffDownload+95-2
HEAD_fix_assertion_failure.diffapplication/octet-stream; name=HEAD_fix_assertion_failure.diffDownload+95-2
#14Amit Kapila
amit.kapila16@gmail.com
In reply to: ocean_li_996 (#11)
Re: RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

On Wed, Mar 6, 2024 at 9:04 AM ocean_li_996 <ocean_li_996@163.com> wrote:

Thanks for your attention.

At 2024-03-05 17:24:05, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:

Dear Haiyang Li,

...
## Found issue

### Empty transaction is decoded on PG14 and PG15

However, there is a room for generating ReorderBufferTxn for empty transactions,
which was introduced by 6b77048e5. Conditions are:

1. There are sub transactions which modify only temp tables, and
2. the top transaction modifies the catalog.

The call-stack toward the generation is below.

```
ReorderBufferTXNByXid(create = true, create_as_top = true)
ReorderBufferXidSetCatalogChanges() // for sub transactions
SnapBuildXidSetCatalogChanges() // for top transaction
DecodeCommit() // for top transaction
```

The path has been introduced by 6b77048e5.
Previously, calling ReorderBufferXidSetCatalogChanges() for sub transactions
would be skipped, if they do not have catalog changes or they have not decoded yet.
However, the commit ensures sub transactions must be marked as containing
catalog changes, and this also enforces to decode transactions even if it is
empty.

### Assertion failure

The empty transactions would be created as top transactions. At that time,
AssertTXNLsnOrder() is called so that we ensured that first_lsn of top-transactions
must be strictly higher than previous. But they can be the same if there are more
than two empty transactions. It led an assertion failure.

Your analysis is correct for me. Actually, I mentioned in [1] that I can reproduce this issue before 6b77048e5.
After some attempts and analysis, I also believe that the issue will only occur after 6b77048e5.

...

## Possible solutions

I think there are several solutions.
Note that I assumed here that fixes for all the versions should be almost the same.

* Ease the condition in AssertTXNLsnOrder(). If the decoded transaction is empty,
it can be allowed that the first_lsn is same as previous one.
PSA file to see my consideration.

LGFM. For my observation, the most case failed on AsserTXNOrder is checking empty
decoded transaction. Maybe we should pay more attention to review ReorderBufferTXNIsEmpty.

* Generate a ReorderBufferTXN as sub transaction when we are in this path.
The approach has already been shared by you. However, note that this needs to
extend the ReorderBufferXidSetCatalogChanges function, and breaks ABI
compatibility [1].

Yes, It breaks ABI compatibility.

One possibility is introducing ReorderBufferXidSetCatalogChangesEx, a
new API with a bool parameter. I don't know if this is a good idea but
I prefer not to tinker with asserts as proposed by Kuroda-San in
another approach.

--
With Regards,
Amit Kapila.

#15Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#14)
RE: RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

Dear Amit,

OK, so let's consider the approach. Note that at this stage, another failure [1]/messages/by-id/TYCPR01MB12077573479C5A2471BDE8065F5232@TYCPR01MB12077.jpnprd01.prod.outlook.com
was ignored.

Here are patches for all supported versions. The failure on PG12-PG15 has been
fixed on my env, but you can confirm as well.

The approach is almost same as what initially shared by Haiyang [2]/messages/by-id/6d0e80d6.c1fc.18deeb8120a.Coremail.ocean_li_996@163.com. However,
instead of extending ReorderBufferXidSetCatalogChanges(), a new function
ReorderBufferXidSetCatalogChangesEx() was added.

Note again that there were changes also in ReorderBufferAssignChild() and
ReorderBufferCleanupTXN(). Extended ReorderBufferXidSetCatalogChanges would the
create ReorderBufferTXN not as top, however, these transaction would not be
associated with the top one. So there is a possibility that txn->node is invalid.
IIUC, only ReorderBufferAssignChild() calls ReorderBufferTXNByXid with create = true
and create_as_top = false, and they would be immediately associated in below.

```
/* add to subtransaction list */
dlist_push_tail(&txn->subtxns, &subtxn->node);
txn->nsubtxns++;
```

[1]: /messages/by-id/TYCPR01MB12077573479C5A2471BDE8065F5232@TYCPR01MB12077.jpnprd01.prod.outlook.com
[2]: /messages/by-id/6d0e80d6.c1fc.18deeb8120a.Coremail.ocean_li_996@163.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

Attachments:

REL_15_fix_coredump.patchapplication/octet-stream; name=REL_15_fix_coredump.patchDownload+80-3
REL_12_fix_coredump.patchapplication/octet-stream; name=REL_12_fix_coredump.patchDownload+78-3
REL_13_fix_coredump.patchapplication/octet-stream; name=REL_13_fix_coredump.patchDownload+78-3
REL_14_fix_coredump.patchapplication/octet-stream; name=REL_14_fix_coredump.patchDownload+82-3
#16ocean_li_996
ocean_li_996@163.com
In reply to: Hayato Kuroda (Fujitsu) (#13)
Re:RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

Dear Hayato:

At 2024-03-08 15:35:16, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote:

Dear Haiyang, Alexander,

Below part describes the second failure. The issue would be occurred when:

1) Logical decoding starts from the middle of a sub-transaction, and
2) NEW_CID record is initially decoded in the sub-transaction, and
3) An arbitrary changes are decoded in the sub-transaction.

...

## Issue 1
Thanks for your reproducer. I have investigated this issue. The scenario that caused the issue
is indeed as you described above. I had not realized before that a serialized snapshot file from
one replication slot could be utilized by another replication slot to achieve a consistent state.
This issue would not arise with only one replication slot, as SnapBuildXactNeedsSkip ensures that.

Using the spec test case you provided (without any fix patch), I got another Stuck trace on REL_14_STABLE :
```
#0 0x00007f93604e0277 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1 0x00007f93604e1968 in __GI_abort () at abort.c:90
#2 0x0000000000b084da in ExceptionalCondition (conditionName=0xcc94c8 "prev_first_lsn < cur_txn->first_lsn", errorType=0xcc92c4 "FailedAssertion", fileName=0xcc9310 "reorderbuffer.c",
lineNumber=916) at assert.c:69
#3 0x00000000008df064 in AssertTXNLsnOrder (rb=0x2a48590) at reorderbuffer.c:916
#4 0x00000000008de9a1 in ReorderBufferTXNByXid (rb=0x2a48590, xid=756, create=true, is_new=0x0, lsn=24749416, create_as_top=true) at reorderbuffer.c:669
#5 0x00000000008e2d00 in ReorderBufferAddNewTupleCids (rb=0x2a48590, xid=756, lsn=24749416, node=..., tid=..., cmin=1, cmax=4294967295, combocid=4294967295) at reorderbuffer.c:3198
#6 0x00000000008e7ed7 in SnapBuildProcessNewCid (builder=0x2a44570, xid=757, lsn=24749416, xlrec=0x2a16d38) at snapbuild.c:823
#7 0x00000000008d14b3 in DecodeHeap2Op (ctx=0x2a324f0, buf=0x7ffd886a4bc0) at decode.c:471
#8 0x00000000008d0bfa in LogicalDecodingProcessRecord (ctx=0x2a324f0, record=0x2a328f0) at decode.c:151
#9 0x00000000008d8688 in pg_logical_slot_get_changes_guts (fcinfo=0x2a206b0, confirm=true, binary=false) at logicalfuncs.c:296
#10 0x00000000008d87bd in pg_logical_slot_get_changes (fcinfo=0x2a206b0) at logicalfuncs.c:365
```

Then, I think the simplest process would be:
```
ReorderBufferProcessXid(xid = 757, lsn)
-> A ReorderBufferTXN for subtransaciton (757) was generated.
The first_lsn was the head of NEW_CID record.
DecodeHeap2Op(info = XLOG_HEAP2_NEW_CID)
SnapBuildProcessNewCid()
ReorderBufferAddNewTupleCids(..., xlrec->top_xid = 756,...)
ReorderBufferTXNByXid(xid = 756, lsn)
-> A ReorderBufferTXN for top-transaciton (756) was generated.
The first_lsn was the head of NEW_CID record. It caused an Assertion failure.
```

I think the straightforward fix is to associate them to top-sub relationship,
and attached patch did it.

LGFM, I think it is suitable assign subtransaction after calling ReorderBufferAddNewTupleCids().

## Issue 2
Inspired by your spec case, I've reorganized the spec case provided in [2]/messages/by-id/6d0e80d6.c1fc.18deeb8120a.Coremail.ocean_li_996@163.com. The new test in attachment
is able to reproduce the issue mentioned in [1]/messages/by-id/18369-ad61699bf91c5bc0@postgresql.org even before commit 6b77048e5.

The approach in [3]/messages/by-id/CAA4eK1KpW5pHMwMp9hfXYvOeEU5Rcbhoc7FxtBOGPgKeyYLDmA@mail.gmail.com is also LGFM.

[1]: /messages/by-id/18369-ad61699bf91c5bc0@postgresql.org
[2]: /messages/by-id/6d0e80d6.c1fc.18deeb8120a.Coremail.ocean_li_996@163.com
[3]: /messages/by-id/CAA4eK1KpW5pHMwMp9hfXYvOeEU5Rcbhoc7FxtBOGPgKeyYLDmA@mail.gmail.com

Best Regards,
Haiyang Li

Attachments:

v2-0001-TestCase-Coredump-On-Assert_TXNLsnOrder.patchapplication/octet-stream; name=v2-0001-TestCase-Coredump-On-Assert_TXNLsnOrder.patchDownload+22-0
#17Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: ocean_li_996 (#16)
RE: Re:RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

Dear Haiyang,

Thanks for checking! This reply was still focused only on "Issue 2" in your notation.

## Issue 2
Inspired by your spec case, I've reorganized the spec case provided in [2]. The new test in attachment
is able to reproduce the issue mentioned in [1] even before commit 6b77048e5.

Good findings. I also confirmed the workload could fail after reverting the 6b77048e5.
Also confirmed that the patch [1]/messages/by-id/TYCPR01MB1207790E98F0A563280CC39FCF5262@TYCPR01MB12077.jpnprd01.prod.outlook.com could fix the workload as well.

permutation "s0_init" "s0_begin" "s0_savepoint" "s0_create_part1" "s0_savepoint_release"
"s2_init" "s1_checkpoint" "s1_get_changes" "s0_commit" "s2_get_changes"

## Analysis

The point was that the serialized snapshot by another replication slot can be reused.
When the first get_change is called, a consistent snapshot can be serialized because
of the XLOG_RUNNING_XACTS record (see later).
The get_changes for the second slot reuses so that it can read WAL records property.
(If the first slot does not exist, the status of the snapshot would be
SNAPBUILD_BUILDING_SNAPSHOT. So no records are read)

In the second get_changes, below records are read. First (LOCK, RUNNING_XACTS)
pair is generated from the slot creation, and second pair comes from the
CHECKPOINT. I.e., it reads all records from the slot generation.

```
...lsn: 0/01906DB8, prev 0/01906D58, desc: LOCK ...
...lsn: 0/01906DF0, prev 0/01906DB8, desc: RUNNING_XACTS ...
...lsn: 0/01906E30, prev 0/01906DF0, desc: LOCK ...
...lsn: 0/01906E68, prev 0/01906E30, desc: RUNNING_XACTS ...
...lsn: 0/01906EA8, prev 0/01906E68, desc: CHECKPOINT_ONLINE ...
...lsn: 0/01906F20, prev 0/01906EA8, desc: COMMIT ... subxacts: 728; ... inval msgs: ...
```

Also the final COMMIT record contains the info for a subtransaction and
XACT_XINFO_HAS_INVALS flag, so DecodeCommit()->SnapBuildXidSetCatalogChanges()
is called transactions.

After that, two ReorderBufferTXNs are created with the same LSN, it fails the
assertion in AssertTXNLsnOrder().

I will update the patch if above analysis is correct.

The approach in [3] is also LGFM.

Thanks. I agreed that we should not ease condition for Assert() as much as possible.

[1]: /messages/by-id/TYCPR01MB1207790E98F0A563280CC39FCF5262@TYCPR01MB12077.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/

#18ocean_li_996
ocean_li_996@163.com
In reply to: Hayato Kuroda (Fujitsu) (#17)
Re:RE: Re:RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

Dear Hayato,

I will update the patch if above analysis is correct.

Your analysis is correct for me.

Best Regards
Haiyang Li

#19Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: ocean_li_996 (#18)
RE: Re:RE: Re:RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()

Dear Haiyang Li,

Your analysis is correct for me.

Thanks for the confirmation. Here are updated patch for REL12-REL15.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/

Attachments:

REL_12-v2-0001-Make-the-decoded-transaction-as-subxact-while-dec.patchapplication/octet-stream; name=REL_12-v2-0001-Make-the-decoded-transaction-as-subxact-while-dec.patchDownload+88-6
REL_13-v2-0001-Make-the-decoded-transaction-as-subxact-while-dec.patchapplication/octet-stream; name=REL_13-v2-0001-Make-the-decoded-transaction-as-subxact-while-dec.patchDownload+88-6
REL_14-v2-0001-Make-the-decoded-transaction-as-subxact-while-dec.patchapplication/octet-stream; name=REL_14-v2-0001-Make-the-decoded-transaction-as-subxact-while-dec.patchDownload+93-6
REL_15-v2-0001-Make-the-decoded-transaction-as-subxact-while-dec.patchapplication/octet-stream; name=REL_15-v2-0001-Make-the-decoded-transaction-as-subxact-while-dec.patchDownload+93-6
#20feichanghong
feichanghong@qq.com
In reply to: Hayato Kuroda (Fujitsu) (#19)
Re: BUG #18369: logical decoding core on AssertTXNLsnOrder()

On Mar 13, 2024, at 11:24, Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote:

Dear Haiyang Li,

Your analysis is correct for me.

Thanks for the confirmation. Here are updated patch for REL12-REL15.

Is it better for ReorderBufferXidSetCatalogChanges to directly call the
ReorderBufferXidSetCatalogChangesEx function? , which can reduce the cost of
later maintenance.

Best Regards,
Fei Changhong
Alibaba Cloud Computing Ltd.

#21Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: feichanghong (#20)
#22ocean_li_996
ocean_li_996@163.com
In reply to: Hayato Kuroda (Fujitsu) (#21)
#23Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Hayato Kuroda (Fujitsu) (#13)
#24ocean_li_996
ocean_li_996@163.com
In reply to: Hayato Kuroda (Fujitsu) (#23)
#25Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: ocean_li_996 (#24)
#26ocean_li_996
ocean_li_996@163.com
In reply to: Hayato Kuroda (Fujitsu) (#25)
#27Amit Kapila
amit.kapila16@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#23)
#28Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#27)
#29Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#27)
#30ocean_li_996
ocean_li_996@163.com
In reply to: Hayato Kuroda (Fujitsu) (#29)
#31ocean_li_996
ocean_li_996@163.com
In reply to: Hayato Kuroda (Fujitsu) (#29)
#32Masahiko Sawada
sawada.mshk@gmail.com
In reply to: ocean_li_996 (#31)