TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

Started by Tomas Vondraover 3 years ago46 messages
#1Tomas Vondra
tomas.vondra@enterprisedb.com
2 attachment(s)

Hi,

I've been running some valgrind tests on rpi4/aarch64, and I get a crash
in test_decoding ddl test in ~50% runs. I don't see the same failure
without valgrind or on 32-bit system (hundreds of runs, no crashes), so
I suspect this is a race condition, and with valgrind the timing changes
in a way to make it more likely.

The crash always happens in the "ddl" test. The backtrace always looks
like this:

(ExceptionalCondition+0x98)[0x8f6f7c]
(+0x57a7ec)[0x6827ec]
(+0x579edc)[0x681edc]
(ReorderBufferAddNewTupleCids+0x60)[0x686758]
(SnapBuildProcessNewCid+0x94)[0x68b920]
(heap2_decode+0x17c)[0x671584]
(LogicalDecodingProcessRecord+0xbc)[0x670cd0]
(+0x570f88)[0x678f88]
(pg_logical_slot_get_changes+0x1c)[0x6790fc]
(ExecMakeTableFunctionResult+0x29c)[0x4a92c0]
(+0x3be638)[0x4c6638]
(+0x3a2c14)[0x4aac14]
(ExecScan+0x8c)[0x4aaca8]
(+0x3bea14)[0x4c6a14]
(+0x39ea60)[0x4a6a60]
(+0x392378)[0x49a378]
(+0x39520c)[0x49d20c]
(standard_ExecutorRun+0x214)[0x49aad8]
(ExecutorRun+0x64)[0x49a8b8]
(+0x62f53c)[0x73753c]
(PortalRun+0x27c)[0x737198]
(+0x627e78)[0x72fe78]
(PostgresMain+0x9a0)[0x73512c]
(+0x547be8)[0x64fbe8]
(+0x547540)[0x64f540]
(+0x542d30)[0x64ad30]
(PostmasterMain+0x1460)[0x64a574]
(+0x418888)[0x520888]

I'm unable to get a better backtrace from the valgrind-produces core
usign gdb, for some reason.

However, I've modified AssertTXNLsnOrder() - which is where the assert
is checked - to also dump toplevel_by_lsn instead of just triggering the
assert, and the result is always like this:

WARNING: ==============================================
WARNING: txn xid 849 top 0 first 30264752 0/1CDCDB0 final 0 0/0
WARNING: txn xid 848 top 0 first 30264752 0/1CDCDB0 final 0 0/0
WARNING: ==============================================

The LSNs change a bit between the runs, but the failing transactions are
always 848 and 849. Also, both transactions have exactly the same info.

But the very first WAL record for 849 is

ASSIGNMENT xtop 848: subxacts: 849

so it's strange 849 is in the toplevel_by_lsn list at all, because it
clearly is a subxact of 848.

Furthermore, the WAL is almost exactly the same in both cases. Attached
are two dumps from a failed and successful run (only the part related to
these two xids is included). There are very few differences - there is a
PRUNE in the failed case, and a LOCK / RUNNING_XACTS moved a bit.

Any ideas?

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

wal-crash.log.gzapplication/gzip; name=wal-crash.log.gzDownload
wal-ok.log.gzapplication/gzip; name=wal-ok.log.gzDownload
#2Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#1)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On Sun, Sep 4, 2022 at 4:34 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

I've been running some valgrind tests on rpi4/aarch64, and I get a crash
in test_decoding ddl test in ~50% runs. I don't see the same failure
without valgrind or on 32-bit system (hundreds of runs, no crashes), so
I suspect this is a race condition, and with valgrind the timing changes
in a way to make it more likely.

The crash always happens in the "ddl" test. The backtrace always looks
like this:

(ExceptionalCondition+0x98)[0x8f6f7c]
(+0x57a7ec)[0x6827ec]
(+0x579edc)[0x681edc]
(ReorderBufferAddNewTupleCids+0x60)[0x686758]
(SnapBuildProcessNewCid+0x94)[0x68b920]
(heap2_decode+0x17c)[0x671584]
(LogicalDecodingProcessRecord+0xbc)[0x670cd0]
(+0x570f88)[0x678f88]
(pg_logical_slot_get_changes+0x1c)[0x6790fc]
(ExecMakeTableFunctionResult+0x29c)[0x4a92c0]
(+0x3be638)[0x4c6638]
(+0x3a2c14)[0x4aac14]
(ExecScan+0x8c)[0x4aaca8]
(+0x3bea14)[0x4c6a14]
(+0x39ea60)[0x4a6a60]
(+0x392378)[0x49a378]
(+0x39520c)[0x49d20c]
(standard_ExecutorRun+0x214)[0x49aad8]
(ExecutorRun+0x64)[0x49a8b8]
(+0x62f53c)[0x73753c]
(PortalRun+0x27c)[0x737198]
(+0x627e78)[0x72fe78]
(PostgresMain+0x9a0)[0x73512c]
(+0x547be8)[0x64fbe8]
(+0x547540)[0x64f540]
(+0x542d30)[0x64ad30]
(PostmasterMain+0x1460)[0x64a574]
(+0x418888)[0x520888]

I'm unable to get a better backtrace from the valgrind-produces core
usign gdb, for some reason.

However, I've modified AssertTXNLsnOrder() - which is where the assert
is checked - to also dump toplevel_by_lsn instead of just triggering the
assert, and the result is always like this:

WARNING: ==============================================
WARNING: txn xid 849 top 0 first 30264752 0/1CDCDB0 final 0 0/0
WARNING: txn xid 848 top 0 first 30264752 0/1CDCDB0 final 0 0/0
WARNING: ==============================================

The LSNs change a bit between the runs, but the failing transactions are
always 848 and 849. Also, both transactions have exactly the same info.

But the very first WAL record for 849 is

ASSIGNMENT xtop 848: subxacts: 849

so it's strange 849 is in the toplevel_by_lsn list at all, because it
clearly is a subxact of 848.

There is no guarantee that toplevel_by_lsn won't have subxact. As per
my understanding, the problem I reported in the email [1]/messages/by-id/CAA4eK1LK1nxOTL32OP=ejhPoBsUP4Bvwb3Ly=fethyJ-KbaXyw@mail.gmail.com is the same
and we have seen this in BF failures as well. I posted a way to
reproduce it in that email. It seems this is possible if the decoding
gets XLOG_HEAP2_NEW_CID as the first record (belonging to a
subtransaction) after XLOG_RUNNING_XACTS.

[1]: /messages/by-id/CAA4eK1LK1nxOTL32OP=ejhPoBsUP4Bvwb3Ly=fethyJ-KbaXyw@mail.gmail.com

--
With Regards,
Amit Kapila.

#3Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Amit Kapila (#2)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On 9/4/22 13:49, Amit Kapila wrote:

On Sun, Sep 4, 2022 at 4:34 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

I've been running some valgrind tests on rpi4/aarch64, and I get a crash
in test_decoding ddl test in ~50% runs. I don't see the same failure
without valgrind or on 32-bit system (hundreds of runs, no crashes), so
I suspect this is a race condition, and with valgrind the timing changes
in a way to make it more likely.

The crash always happens in the "ddl" test. The backtrace always looks
like this:

(ExceptionalCondition+0x98)[0x8f6f7c]
(+0x57a7ec)[0x6827ec]
(+0x579edc)[0x681edc]
(ReorderBufferAddNewTupleCids+0x60)[0x686758]
(SnapBuildProcessNewCid+0x94)[0x68b920]
(heap2_decode+0x17c)[0x671584]
(LogicalDecodingProcessRecord+0xbc)[0x670cd0]
(+0x570f88)[0x678f88]
(pg_logical_slot_get_changes+0x1c)[0x6790fc]
(ExecMakeTableFunctionResult+0x29c)[0x4a92c0]
(+0x3be638)[0x4c6638]
(+0x3a2c14)[0x4aac14]
(ExecScan+0x8c)[0x4aaca8]
(+0x3bea14)[0x4c6a14]
(+0x39ea60)[0x4a6a60]
(+0x392378)[0x49a378]
(+0x39520c)[0x49d20c]
(standard_ExecutorRun+0x214)[0x49aad8]
(ExecutorRun+0x64)[0x49a8b8]
(+0x62f53c)[0x73753c]
(PortalRun+0x27c)[0x737198]
(+0x627e78)[0x72fe78]
(PostgresMain+0x9a0)[0x73512c]
(+0x547be8)[0x64fbe8]
(+0x547540)[0x64f540]
(+0x542d30)[0x64ad30]
(PostmasterMain+0x1460)[0x64a574]
(+0x418888)[0x520888]

I'm unable to get a better backtrace from the valgrind-produces core
usign gdb, for some reason.

However, I've modified AssertTXNLsnOrder() - which is where the assert
is checked - to also dump toplevel_by_lsn instead of just triggering the
assert, and the result is always like this:

WARNING: ==============================================
WARNING: txn xid 849 top 0 first 30264752 0/1CDCDB0 final 0 0/0
WARNING: txn xid 848 top 0 first 30264752 0/1CDCDB0 final 0 0/0
WARNING: ==============================================

The LSNs change a bit between the runs, but the failing transactions are
always 848 and 849. Also, both transactions have exactly the same info.

But the very first WAL record for 849 is

ASSIGNMENT xtop 848: subxacts: 849

so it's strange 849 is in the toplevel_by_lsn list at all, because it
clearly is a subxact of 848.

There is no guarantee that toplevel_by_lsn won't have subxact.

I don't think that's quite true - toplevel_by_lsn should not contain any
*known* subxacts. Yes, we may initially add a subxact to the list, but
as soon as we get assignment record, it should be removed. See what
ReorderBufferAssignChild does.

And in this case the ASSIGNMENT is the first WAL record we get for 849
(in fact, isn't that guaranteed since 7259736a6e?), so we know from the
very beginning 849 is a subxact.

As per
my understanding, the problem I reported in the email [1] is the same
and we have seen this in BF failures as well. I posted a way to
reproduce it in that email. It seems this is possible if the decoding
gets XLOG_HEAP2_NEW_CID as the first record (belonging to a
subtransaction) after XLOG_RUNNING_XACTS.

Interesting. That's certainly true for WAL in the crashing case:

rmgr: Standby len (rec/tot): 58/ 58, tx: 0, lsn:
0/01CDCD70, prev 0/01CDCD10, desc: RUNNING_XACTS nextXid 850
latestCompletedXid 847 oldestRunningXid 848; 1 xacts: 848
rmgr: Heap2 len (rec/tot): 60/ 60, tx: 849, lsn:
0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel 1663/16384/1249; tid
58/38; cmin: 1, cmax: 14, combo: 6

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Tomas Vondra (#3)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On 9/4/22 14:24, Tomas Vondra wrote:

On 9/4/22 13:49, Amit Kapila wrote:

On Sun, Sep 4, 2022 at 4:34 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

I've been running some valgrind tests on rpi4/aarch64, and I get a crash
in test_decoding ddl test in ~50% runs. I don't see the same failure
without valgrind or on 32-bit system (hundreds of runs, no crashes), so
I suspect this is a race condition, and with valgrind the timing changes
in a way to make it more likely.

The crash always happens in the "ddl" test. The backtrace always looks
like this:

(ExceptionalCondition+0x98)[0x8f6f7c]
(+0x57a7ec)[0x6827ec]
(+0x579edc)[0x681edc]
(ReorderBufferAddNewTupleCids+0x60)[0x686758]
(SnapBuildProcessNewCid+0x94)[0x68b920]
(heap2_decode+0x17c)[0x671584]
(LogicalDecodingProcessRecord+0xbc)[0x670cd0]
(+0x570f88)[0x678f88]
(pg_logical_slot_get_changes+0x1c)[0x6790fc]
(ExecMakeTableFunctionResult+0x29c)[0x4a92c0]
(+0x3be638)[0x4c6638]
(+0x3a2c14)[0x4aac14]
(ExecScan+0x8c)[0x4aaca8]
(+0x3bea14)[0x4c6a14]
(+0x39ea60)[0x4a6a60]
(+0x392378)[0x49a378]
(+0x39520c)[0x49d20c]
(standard_ExecutorRun+0x214)[0x49aad8]
(ExecutorRun+0x64)[0x49a8b8]
(+0x62f53c)[0x73753c]
(PortalRun+0x27c)[0x737198]
(+0x627e78)[0x72fe78]
(PostgresMain+0x9a0)[0x73512c]
(+0x547be8)[0x64fbe8]
(+0x547540)[0x64f540]
(+0x542d30)[0x64ad30]
(PostmasterMain+0x1460)[0x64a574]
(+0x418888)[0x520888]

I'm unable to get a better backtrace from the valgrind-produces core
usign gdb, for some reason.

However, I've modified AssertTXNLsnOrder() - which is where the assert
is checked - to also dump toplevel_by_lsn instead of just triggering the
assert, and the result is always like this:

WARNING: ==============================================
WARNING: txn xid 849 top 0 first 30264752 0/1CDCDB0 final 0 0/0
WARNING: txn xid 848 top 0 first 30264752 0/1CDCDB0 final 0 0/0
WARNING: ==============================================

The LSNs change a bit between the runs, but the failing transactions are
always 848 and 849. Also, both transactions have exactly the same info.

But the very first WAL record for 849 is

ASSIGNMENT xtop 848: subxacts: 849

so it's strange 849 is in the toplevel_by_lsn list at all, because it
clearly is a subxact of 848.

There is no guarantee that toplevel_by_lsn won't have subxact.

I don't think that's quite true - toplevel_by_lsn should not contain any
*known* subxacts. Yes, we may initially add a subxact to the list, but
as soon as we get assignment record, it should be removed. See what
ReorderBufferAssignChild does.

And in this case the ASSIGNMENT is the first WAL record we get for 849
(in fact, isn't that guaranteed since 7259736a6e?), so we know from the
very beginning 849 is a subxact.

As per
my understanding, the problem I reported in the email [1] is the same
and we have seen this in BF failures as well. I posted a way to
reproduce it in that email. It seems this is possible if the decoding
gets XLOG_HEAP2_NEW_CID as the first record (belonging to a
subtransaction) after XLOG_RUNNING_XACTS.

Interesting. That's certainly true for WAL in the crashing case:

rmgr: Standby len (rec/tot): 58/ 58, tx: 0, lsn:
0/01CDCD70, prev 0/01CDCD10, desc: RUNNING_XACTS nextXid 850
latestCompletedXid 847 oldestRunningXid 848; 1 xacts: 848
rmgr: Heap2 len (rec/tot): 60/ 60, tx: 849, lsn:
0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel 1663/16384/1249; tid
58/38; cmin: 1, cmax: 14, combo: 6

I investigated using the pgdata from the crashed run (can provide, if
you have rpi4 or some other aarch64 machine), and the reason is pretty
simple - the restart_lsn for the slot is 0/1CDCD70, which is looong
after the subxact assignment, so we add both xids as toplevel.

That seems broken - if we skip the assignment like this, doesn't that
break spill-to-disk and/or streaming? IIRC that's exactly why we had to
start logging assignments immediately with wal_level=logical.

Or maybe we're not dealing with the restart_lsn properly, and we should
have ignored those records. Both xacts started long before the restart
LSN, so we're not seeing the whole xact anyway.

However, when processing the NEW_CID record:

tx: 849, lsn: 0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel
1663/16384/1249; tid 58/38; cmin: 1, cmax: 14, combo: 6

we ultimately do this in SnapBuildProcessNewCid:

#1 0x0000005566cccdb4 in ReorderBufferAddNewTupleCids (rb=0x559dd64218,
xid=848, lsn=30264752, locator=..., tid=..., cmin=1, cmax=14,
combocid=6) at reorderbuffer.c:3218
#2 0x0000005566cd1f7c in SnapBuildProcessNewCid (builder=0x559dd6a248,
xid=849, lsn=30264752, xlrec=0x559dd6e1e0) at snapbuild.c:818

so in fact we *know* 849 is a subxact of 848, but we don't call
ReorderBufferAssignChild in this case. In fact we can't even do the
assignment easily in this case, because we create the subxact first, so
that the crash happens right when we attempt to create the toplevel one,
and we never even get a chance to do the assignment:

1) process the NEW_CID record, logged for 849 (subxact)
2) process CIDs in the WAL record, which has topleve_xid 848

So IMHO we need to figure out what to do for WAL records that create
both the toplevel and subxact - either we need to skip them, or rethink
how we create the ReorderBufferTXN structs.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#5Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Tomas Vondra (#4)
1 attachment(s)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On 9/4/22 16:08, Tomas Vondra wrote:

...

so in fact we *know* 849 is a subxact of 848, but we don't call
ReorderBufferAssignChild in this case. In fact we can't even do the
assignment easily in this case, because we create the subxact first, so
that the crash happens right when we attempt to create the toplevel one,
and we never even get a chance to do the assignment:

1) process the NEW_CID record, logged for 849 (subxact)
2) process CIDs in the WAL record, which has topleve_xid 848

So IMHO we need to figure out what to do for WAL records that create
both the toplevel and subxact - either we need to skip them, or rethink
how we create the ReorderBufferTXN structs.

This fixes the crash for me, by adding a ReorderBufferAssignChild call
to SnapBuildProcessNewCid, and tweaking ReorderBufferAssignChild to
ensure we don't try to create the top xact before updating the subxact
and removing it from the toplevel_by_lsn list.

Essentially, what's happening is this:

1) We read the NEW_CID record, which is logged with XID 849, i.e. the
subxact. But we don't know it's a subxact, so we create it as a
top-level xact with the LSN.

2) We start processing contents of the NEW_CID, which however has info
that 849 is subxact of 848, calls ReorderBufferAddNewTupleCids which
promptly does ReorderBufferTXNByXid() with the top-level XID, which
creates it with the same LSN, and crashes because of the assert.

I'm not sure what's the right/proper way to fix this ...

The problem is ReorderBufferAssignChild was coded in a way that did not
expect the subxact to be created first (as a top-level xact). And
indeed, if I add Assert(false) to the (!new_sub) branch that converts
top-level xact to subxact, check-world still passes. So we never test
this case, but the NEW_CID breaks this assumption and creates them in
the opposite order (i.e. subxact first).

So the patch "fixes" this by

(a) tweaking ReorderBufferAssignChild to first remove the subxact from
the list of top-level transactions

(b) call ReorderBufferAssignChild when processing NEW_CID

However, I wonder whether we even have to process these records? If the
restart_lsn is half-way through the xact, so can we even decode it?
Maybe we can just skip all of this, somehow? We'd still need to remember
849 is a subxact of 848, at least, so that we know to skip it too.

Thread [1] suggested to relax the assert to allow the same LSN, provided
it's xact and it's subxact. That goes directly against the expectation
the toplevel_by_lsn list contains no known subxacts, and I don't think
we should be relaxing that. After all, just tweaking the LSN does not
really fix the issue, because not remembering it's xact+subxact is part
of the issue. In principle, I think the issue is exactly the opposite,
i.e. that we don't realize 849 is a subxact, and leave it in the list.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

crash-wip-fix.patchtext/x-patch; charset=UTF-8; name=crash-wip-fix.patchDownload
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 89cf9f9389c..34ca52d6cc0 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1029,7 +1029,6 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	bool		new_top;
 	bool		new_sub;
 
-	txn = ReorderBufferTXNByXid(rb, xid, true, &new_top, lsn, true);
 	subtxn = ReorderBufferTXNByXid(rb, subxid, true, &new_sub, lsn, false);
 
 	if (!new_sub)
@@ -1054,6 +1053,8 @@ ReorderBufferAssignChild(ReorderBuffer *rb, TransactionId xid,
 	subtxn->toplevel_xid = xid;
 	Assert(subtxn->nsubtxns == 0);
 
+	txn = ReorderBufferTXNByXid(rb, xid, true, &new_top, lsn, true);
+
 	/* set the reference to top-level transaction */
 	subtxn->toptxn = txn;
 
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index bf72ad45ec7..00428e46d09 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -808,6 +808,10 @@ SnapBuildProcessNewCid(SnapBuild *builder, TransactionId xid,
 {
 	CommandId	cid;
 
+	/* we have a top XID, but it's not known to be subxact */
+	if (xid != xlrec->top_xid && xlrec->top_xid != InvalidTransactionId)
+		ReorderBufferAssignChild(builder->reorder, xlrec->top_xid, xid, lsn);
+
 	/*
 	 * we only log new_cid's if a catalog tuple was modified, so mark the
 	 * transaction as containing catalog modifications
#6Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#4)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On Sun, Sep 4, 2022 at 7:38 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 9/4/22 14:24, Tomas Vondra wrote:

As per
my understanding, the problem I reported in the email [1] is the same
and we have seen this in BF failures as well. I posted a way to
reproduce it in that email. It seems this is possible if the decoding
gets XLOG_HEAP2_NEW_CID as the first record (belonging to a
subtransaction) after XLOG_RUNNING_XACTS.

Interesting. That's certainly true for WAL in the crashing case:

rmgr: Standby len (rec/tot): 58/ 58, tx: 0, lsn:
0/01CDCD70, prev 0/01CDCD10, desc: RUNNING_XACTS nextXid 850
latestCompletedXid 847 oldestRunningXid 848; 1 xacts: 848
rmgr: Heap2 len (rec/tot): 60/ 60, tx: 849, lsn:
0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel 1663/16384/1249; tid
58/38; cmin: 1, cmax: 14, combo: 6

I investigated using the pgdata from the crashed run (can provide, if
you have rpi4 or some other aarch64 machine), and the reason is pretty
simple - the restart_lsn for the slot is 0/1CDCD70, which is looong
after the subxact assignment, so we add both xids as toplevel.

That seems broken - if we skip the assignment like this, doesn't that
break spill-to-disk and/or streaming? IIRC that's exactly why we had to
start logging assignments immediately with wal_level=logical.

We had started logging assignments immediately in commit 0bead9af48
for streaming transactions in PG14. This issue exists prior to that. I
have tried and reproduced it in PG13 but I think it will be there even
before that. So, I am not sure if the spilling behavior is broken due
to this. I think if we don't get assignment recording before
processing changes during decoding commit then we could miss sending
the changes which won't be the case here. Do you see any other
problem?

Or maybe we're not dealing with the restart_lsn properly, and we should
have ignored those records. Both xacts started long before the restart
LSN, so we're not seeing the whole xact anyway.

Right, but is that problematic? The restart LSN will be used as a
point to start reading the WAL and that helps in building a consistent
snapshot. However, for decoding to send the commit, we use
start_decoding_at point which will ensure that we send complete
transactions.

However, when processing the NEW_CID record:

tx: 849, lsn: 0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel
1663/16384/1249; tid 58/38; cmin: 1, cmax: 14, combo: 6

we ultimately do this in SnapBuildProcessNewCid:

#1 0x0000005566cccdb4 in ReorderBufferAddNewTupleCids (rb=0x559dd64218,
xid=848, lsn=30264752, locator=..., tid=..., cmin=1, cmax=14,
combocid=6) at reorderbuffer.c:3218
#2 0x0000005566cd1f7c in SnapBuildProcessNewCid (builder=0x559dd6a248,
xid=849, lsn=30264752, xlrec=0x559dd6e1e0) at snapbuild.c:818

so in fact we *know* 849 is a subxact of 848, but we don't call
ReorderBufferAssignChild in this case. In fact we can't even do the
assignment easily in this case, because we create the subxact first, so
that the crash happens right when we attempt to create the toplevel one,
and we never even get a chance to do the assignment:

1) process the NEW_CID record, logged for 849 (subxact)
2) process CIDs in the WAL record, which has topleve_xid 848

So IMHO we need to figure out what to do for WAL records that create
both the toplevel and subxact - either we need to skip them, or rethink
how we create the ReorderBufferTXN structs.

As per my understanding, we can't skip them as they are used to build
the snapshot.

--
With Regards,
Amit Kapila.

#7Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#5)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On Sun, Sep 4, 2022 at 11:10 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 9/4/22 16:08, Tomas Vondra wrote:

...

so in fact we *know* 849 is a subxact of 848, but we don't call
ReorderBufferAssignChild in this case. In fact we can't even do the
assignment easily in this case, because we create the subxact first, so
that the crash happens right when we attempt to create the toplevel one,
and we never even get a chance to do the assignment:

1) process the NEW_CID record, logged for 849 (subxact)
2) process CIDs in the WAL record, which has topleve_xid 848

So IMHO we need to figure out what to do for WAL records that create
both the toplevel and subxact - either we need to skip them, or rethink
how we create the ReorderBufferTXN structs.

This fixes the crash for me, by adding a ReorderBufferAssignChild call
to SnapBuildProcessNewCid, and tweaking ReorderBufferAssignChild to
ensure we don't try to create the top xact before updating the subxact
and removing it from the toplevel_by_lsn list.

Essentially, what's happening is this:

1) We read the NEW_CID record, which is logged with XID 849, i.e. the
subxact. But we don't know it's a subxact, so we create it as a
top-level xact with the LSN.

2) We start processing contents of the NEW_CID, which however has info
that 849 is subxact of 848, calls ReorderBufferAddNewTupleCids which
promptly does ReorderBufferTXNByXid() with the top-level XID, which
creates it with the same LSN, and crashes because of the assert.

I'm not sure what's the right/proper way to fix this ...

The problem is ReorderBufferAssignChild was coded in a way that did not
expect the subxact to be created first (as a top-level xact).

I think there was a previously hard-coded way to detect that and we
have removed it in commit
(https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e3ff789acfb2754cd7b5e87f6f4463fd08e35996).
I think it is possible that subtransaction gets logged without
previous top-level txn record as shown in the commit shared.

--
With Regards,
Amit Kapila.

#8Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Amit Kapila (#6)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On 9/5/22 06:32, Amit Kapila wrote:

On Sun, Sep 4, 2022 at 7:38 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 9/4/22 14:24, Tomas Vondra wrote:

As per
my understanding, the problem I reported in the email [1] is the same
and we have seen this in BF failures as well. I posted a way to
reproduce it in that email. It seems this is possible if the decoding
gets XLOG_HEAP2_NEW_CID as the first record (belonging to a
subtransaction) after XLOG_RUNNING_XACTS.

Interesting. That's certainly true for WAL in the crashing case:

rmgr: Standby len (rec/tot): 58/ 58, tx: 0, lsn:
0/01CDCD70, prev 0/01CDCD10, desc: RUNNING_XACTS nextXid 850
latestCompletedXid 847 oldestRunningXid 848; 1 xacts: 848
rmgr: Heap2 len (rec/tot): 60/ 60, tx: 849, lsn:
0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel 1663/16384/1249; tid
58/38; cmin: 1, cmax: 14, combo: 6

I investigated using the pgdata from the crashed run (can provide, if
you have rpi4 or some other aarch64 machine), and the reason is pretty
simple - the restart_lsn for the slot is 0/1CDCD70, which is looong
after the subxact assignment, so we add both xids as toplevel.

That seems broken - if we skip the assignment like this, doesn't that
break spill-to-disk and/or streaming? IIRC that's exactly why we had to
start logging assignments immediately with wal_level=logical.

We had started logging assignments immediately in commit 0bead9af48
for streaming transactions in PG14. This issue exists prior to that. I
have tried and reproduced it in PG13 but I think it will be there even
before that. So, I am not sure if the spilling behavior is broken due
to this. I think if we don't get assignment recording before
processing changes during decoding commit then we could miss sending
the changes which won't be the case here. Do you see any other
problem?

I can't, but that's hardly a proof of anything. You're right spilling to
disk may not be broken by this, though. I forgot it precedes assignments
being logged immediately, so it does not rely on that.

Or maybe we're not dealing with the restart_lsn properly, and we should
have ignored those records. Both xacts started long before the restart
LSN, so we're not seeing the whole xact anyway.

Right, but is that problematic? The restart LSN will be used as a
point to start reading the WAL and that helps in building a consistent
snapshot. However, for decoding to send the commit, we use
start_decoding_at point which will ensure that we send complete
transactions.

Which part would not be problematic? There's some sort of a bug, that's
for sure.

I think it's mostly clear we won't output this transaction, because the
restart LSN is half-way through. We can either ignore it at commit time,
and then we have to make everything work in case we miss assignments (or
any other part of the transaction).

Or we can ignore stuff early, and not even process some of the changes.
For example in this case do we need to process the NEW_CID contents for
transaction 848? If we can skip that bit, the problem will disappear.

But maybe this is futile and there are other similar issues ...

However, when processing the NEW_CID record:

tx: 849, lsn: 0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel
1663/16384/1249; tid 58/38; cmin: 1, cmax: 14, combo: 6

we ultimately do this in SnapBuildProcessNewCid:

#1 0x0000005566cccdb4 in ReorderBufferAddNewTupleCids (rb=0x559dd64218,
xid=848, lsn=30264752, locator=..., tid=..., cmin=1, cmax=14,
combocid=6) at reorderbuffer.c:3218
#2 0x0000005566cd1f7c in SnapBuildProcessNewCid (builder=0x559dd6a248,
xid=849, lsn=30264752, xlrec=0x559dd6e1e0) at snapbuild.c:818

so in fact we *know* 849 is a subxact of 848, but we don't call
ReorderBufferAssignChild in this case. In fact we can't even do the
assignment easily in this case, because we create the subxact first, so
that the crash happens right when we attempt to create the toplevel one,
and we never even get a chance to do the assignment:

1) process the NEW_CID record, logged for 849 (subxact)
2) process CIDs in the WAL record, which has topleve_xid 848

So IMHO we need to figure out what to do for WAL records that create
both the toplevel and subxact - either we need to skip them, or rethink
how we create the ReorderBufferTXN structs.

As per my understanding, we can't skip them as they are used to build
the snapshot.

Don't we know 848 (the top-level xact) won't be decoded? In that case we
won't need the snapshot, so why build it? Of course, the NEW_CID is
logged with xid 849 and we don't know it's subxact of 848, but when
processing the NEW_CID content we can realize that (xl_heap_new_cid does
include top_xid).

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#8)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On Mon, Sep 5, 2022 at 12:14 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 9/5/22 06:32, Amit Kapila wrote:

On Sun, Sep 4, 2022 at 7:38 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 9/4/22 14:24, Tomas Vondra wrote:

As per
my understanding, the problem I reported in the email [1] is the same
and we have seen this in BF failures as well. I posted a way to
reproduce it in that email. It seems this is possible if the decoding
gets XLOG_HEAP2_NEW_CID as the first record (belonging to a
subtransaction) after XLOG_RUNNING_XACTS.

Interesting. That's certainly true for WAL in the crashing case:

rmgr: Standby len (rec/tot): 58/ 58, tx: 0, lsn:
0/01CDCD70, prev 0/01CDCD10, desc: RUNNING_XACTS nextXid 850
latestCompletedXid 847 oldestRunningXid 848; 1 xacts: 848
rmgr: Heap2 len (rec/tot): 60/ 60, tx: 849, lsn:
0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel 1663/16384/1249; tid
58/38; cmin: 1, cmax: 14, combo: 6

I investigated using the pgdata from the crashed run (can provide, if
you have rpi4 or some other aarch64 machine), and the reason is pretty
simple - the restart_lsn for the slot is 0/1CDCD70, which is looong
after the subxact assignment, so we add both xids as toplevel.

That seems broken - if we skip the assignment like this, doesn't that
break spill-to-disk and/or streaming? IIRC that's exactly why we had to
start logging assignments immediately with wal_level=logical.

We had started logging assignments immediately in commit 0bead9af48
for streaming transactions in PG14. This issue exists prior to that. I
have tried and reproduced it in PG13 but I think it will be there even
before that. So, I am not sure if the spilling behavior is broken due
to this. I think if we don't get assignment recording before
processing changes during decoding commit then we could miss sending
the changes which won't be the case here. Do you see any other
problem?

I can't, but that's hardly a proof of anything. You're right spilling to
disk may not be broken by this, though. I forgot it precedes assignments
being logged immediately, so it does not rely on that.

Or maybe we're not dealing with the restart_lsn properly, and we should
have ignored those records. Both xacts started long before the restart
LSN, so we're not seeing the whole xact anyway.

Right, but is that problematic? The restart LSN will be used as a
point to start reading the WAL and that helps in building a consistent
snapshot. However, for decoding to send the commit, we use
start_decoding_at point which will ensure that we send complete
transactions.

Which part would not be problematic? There's some sort of a bug, that's
for sure.

It is possible that there is some other problem here that I am
missing. But at this stage, I don't see anything wrong other than the
assertion you have reported.

I think it's mostly clear we won't output this transaction, because the
restart LSN is half-way through. We can either ignore it at commit time,
and then we have to make everything work in case we miss assignments (or
any other part of the transaction).

Note, traditionally, we only form these assignments at commit time
after deciding whether to skip such commits. So, ideally, there
shouldn't be any fundamental problem with not making these
associations before deciding whether we need to replay (send
downstream) any particular transaction.

Or we can ignore stuff early, and not even process some of the changes.
For example in this case do we need to process the NEW_CID contents for
transaction 848? If we can skip that bit, the problem will disappear.

But maybe this is futile and there are other similar issues ...

However, when processing the NEW_CID record:

tx: 849, lsn: 0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel
1663/16384/1249; tid 58/38; cmin: 1, cmax: 14, combo: 6

we ultimately do this in SnapBuildProcessNewCid:

#1 0x0000005566cccdb4 in ReorderBufferAddNewTupleCids (rb=0x559dd64218,
xid=848, lsn=30264752, locator=..., tid=..., cmin=1, cmax=14,
combocid=6) at reorderbuffer.c:3218
#2 0x0000005566cd1f7c in SnapBuildProcessNewCid (builder=0x559dd6a248,
xid=849, lsn=30264752, xlrec=0x559dd6e1e0) at snapbuild.c:818

so in fact we *know* 849 is a subxact of 848, but we don't call
ReorderBufferAssignChild in this case. In fact we can't even do the
assignment easily in this case, because we create the subxact first, so
that the crash happens right when we attempt to create the toplevel one,
and we never even get a chance to do the assignment:

1) process the NEW_CID record, logged for 849 (subxact)
2) process CIDs in the WAL record, which has topleve_xid 848

So IMHO we need to figure out what to do for WAL records that create
both the toplevel and subxact - either we need to skip them, or rethink
how we create the ReorderBufferTXN structs.

As per my understanding, we can't skip them as they are used to build
the snapshot.

Don't we know 848 (the top-level xact) won't be decoded? In that case we
won't need the snapshot, so why build it?

But this transaction id can be part of committed.xip array if it has
made any catalog changes. We add the transaction/subtransaction to
this array before deciding whether to skip decoding/replay of its
commit.

--
With Regards,
Amit Kapila.

#10Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Amit Kapila (#7)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On 9/5/22 08:35, Amit Kapila wrote:

On Sun, Sep 4, 2022 at 11:10 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 9/4/22 16:08, Tomas Vondra wrote:

...

so in fact we *know* 849 is a subxact of 848, but we don't call
ReorderBufferAssignChild in this case. In fact we can't even do the
assignment easily in this case, because we create the subxact first, so
that the crash happens right when we attempt to create the toplevel one,
and we never even get a chance to do the assignment:

1) process the NEW_CID record, logged for 849 (subxact)
2) process CIDs in the WAL record, which has topleve_xid 848

So IMHO we need to figure out what to do for WAL records that create
both the toplevel and subxact - either we need to skip them, or rethink
how we create the ReorderBufferTXN structs.

This fixes the crash for me, by adding a ReorderBufferAssignChild call
to SnapBuildProcessNewCid, and tweaking ReorderBufferAssignChild to
ensure we don't try to create the top xact before updating the subxact
and removing it from the toplevel_by_lsn list.

Essentially, what's happening is this:

1) We read the NEW_CID record, which is logged with XID 849, i.e. the
subxact. But we don't know it's a subxact, so we create it as a
top-level xact with the LSN.

2) We start processing contents of the NEW_CID, which however has info
that 849 is subxact of 848, calls ReorderBufferAddNewTupleCids which
promptly does ReorderBufferTXNByXid() with the top-level XID, which
creates it with the same LSN, and crashes because of the assert.

I'm not sure what's the right/proper way to fix this ...

The problem is ReorderBufferAssignChild was coded in a way that did not
expect the subxact to be created first (as a top-level xact).

I think there was a previously hard-coded way to detect that and we
have removed it in commit
(https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e3ff789acfb2754cd7b5e87f6f4463fd08e35996).
I think it is possible that subtransaction gets logged without
previous top-level txn record as shown in the commit shared.

Well, yes and no.

This wouldn't detect the issue, because the assert happens in the first
ReorderBufferTXNByXid(), so it's still crash (in assert-enabled build,
at least).

Maybe removing the assumption was the wrong thing, and we should have
changed the code so that we don't violate it? That's kinda what my "fix"
does, in a way.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Amit Kapila (#9)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On 9/5/22 12:12, Amit Kapila wrote:

On Mon, Sep 5, 2022 at 12:14 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 9/5/22 06:32, Amit Kapila wrote:

On Sun, Sep 4, 2022 at 7:38 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 9/4/22 14:24, Tomas Vondra wrote:

As per
my understanding, the problem I reported in the email [1] is the same
and we have seen this in BF failures as well. I posted a way to
reproduce it in that email. It seems this is possible if the decoding
gets XLOG_HEAP2_NEW_CID as the first record (belonging to a
subtransaction) after XLOG_RUNNING_XACTS.

Interesting. That's certainly true for WAL in the crashing case:

rmgr: Standby len (rec/tot): 58/ 58, tx: 0, lsn:
0/01CDCD70, prev 0/01CDCD10, desc: RUNNING_XACTS nextXid 850
latestCompletedXid 847 oldestRunningXid 848; 1 xacts: 848
rmgr: Heap2 len (rec/tot): 60/ 60, tx: 849, lsn:
0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel 1663/16384/1249; tid
58/38; cmin: 1, cmax: 14, combo: 6

I investigated using the pgdata from the crashed run (can provide, if
you have rpi4 or some other aarch64 machine), and the reason is pretty
simple - the restart_lsn for the slot is 0/1CDCD70, which is looong
after the subxact assignment, so we add both xids as toplevel.

That seems broken - if we skip the assignment like this, doesn't that
break spill-to-disk and/or streaming? IIRC that's exactly why we had to
start logging assignments immediately with wal_level=logical.

We had started logging assignments immediately in commit 0bead9af48
for streaming transactions in PG14. This issue exists prior to that. I
have tried and reproduced it in PG13 but I think it will be there even
before that. So, I am not sure if the spilling behavior is broken due
to this. I think if we don't get assignment recording before
processing changes during decoding commit then we could miss sending
the changes which won't be the case here. Do you see any other
problem?

I can't, but that's hardly a proof of anything. You're right spilling to
disk may not be broken by this, though. I forgot it precedes assignments
being logged immediately, so it does not rely on that.

Or maybe we're not dealing with the restart_lsn properly, and we should
have ignored those records. Both xacts started long before the restart
LSN, so we're not seeing the whole xact anyway.

Right, but is that problematic? The restart LSN will be used as a
point to start reading the WAL and that helps in building a consistent
snapshot. However, for decoding to send the commit, we use
start_decoding_at point which will ensure that we send complete
transactions.

Which part would not be problematic? There's some sort of a bug, that's
for sure.

It is possible that there is some other problem here that I am
missing. But at this stage, I don't see anything wrong other than the
assertion you have reported.

I'm not sure I agree with that. I'm not convinced the assert is at
fault, it might easily be that it hints there's a logic bug somewhere.

I think it's mostly clear we won't output this transaction, because the
restart LSN is half-way through. We can either ignore it at commit time,
and then we have to make everything work in case we miss assignments (or
any other part of the transaction).

Note, traditionally, we only form these assignments at commit time
after deciding whether to skip such commits. So, ideally, there
shouldn't be any fundamental problem with not making these
associations before deciding whether we need to replay (send
downstream) any particular transaction.

Isn't that self-contradictory? Either we form these assignments at
commit time, or we support streaming (in which case it clearly can't
happen at commit time). AFAICS that's exactly why we started logging
(and processing) assignments immediately, no?

Or we can ignore stuff early, and not even process some of the changes.
For example in this case do we need to process the NEW_CID contents for
transaction 848? If we can skip that bit, the problem will disappear.

But maybe this is futile and there are other similar issues ...

However, when processing the NEW_CID record:

tx: 849, lsn: 0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel
1663/16384/1249; tid 58/38; cmin: 1, cmax: 14, combo: 6

we ultimately do this in SnapBuildProcessNewCid:

#1 0x0000005566cccdb4 in ReorderBufferAddNewTupleCids (rb=0x559dd64218,
xid=848, lsn=30264752, locator=..., tid=..., cmin=1, cmax=14,
combocid=6) at reorderbuffer.c:3218
#2 0x0000005566cd1f7c in SnapBuildProcessNewCid (builder=0x559dd6a248,
xid=849, lsn=30264752, xlrec=0x559dd6e1e0) at snapbuild.c:818

so in fact we *know* 849 is a subxact of 848, but we don't call
ReorderBufferAssignChild in this case. In fact we can't even do the
assignment easily in this case, because we create the subxact first, so
that the crash happens right when we attempt to create the toplevel one,
and we never even get a chance to do the assignment:

1) process the NEW_CID record, logged for 849 (subxact)
2) process CIDs in the WAL record, which has topleve_xid 848

So IMHO we need to figure out what to do for WAL records that create
both the toplevel and subxact - either we need to skip them, or rethink
how we create the ReorderBufferTXN structs.

As per my understanding, we can't skip them as they are used to build
the snapshot.

Don't we know 848 (the top-level xact) won't be decoded? In that case we
won't need the snapshot, so why build it?

But this transaction id can be part of committed.xip array if it has
made any catalog changes. We add the transaction/subtransaction to
this array before deciding whether to skip decoding/replay of its
commit.

Hmm, yeah. It's been a while since I last looked into how we build
snapshots and how we share them between the transactions :-( If we share
the snapshots between transactions, you're probably right we can't just
skip these changes.

However, doesn't that pretty much mean we *have* to do something about
the assignment? I mean, suppose we miss the assignment (like now), so
that we end up with two TXNs that we think are top-level. And then we
get the commit for the actual top-level transaction. AFAICS that won't
clean-up the subxact, and we end up with a lingering TXN.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#11)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On Mon, Sep 5, 2022 at 5:24 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 9/5/22 12:12, Amit Kapila wrote:

On Mon, Sep 5, 2022 at 12:14 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

It is possible that there is some other problem here that I am
missing. But at this stage, I don't see anything wrong other than the
assertion you have reported.

I'm not sure I agree with that. I'm not convinced the assert is at
fault, it might easily be that it hints there's a logic bug somewhere.

It is possible but let's try to prove it. I am also keen to know if
this hints at a logic bug somewhere.

I think it's mostly clear we won't output this transaction, because the
restart LSN is half-way through. We can either ignore it at commit time,
and then we have to make everything work in case we miss assignments (or
any other part of the transaction).

Note, traditionally, we only form these assignments at commit time
after deciding whether to skip such commits. So, ideally, there
shouldn't be any fundamental problem with not making these
associations before deciding whether we need to replay (send
downstream) any particular transaction.

Isn't that self-contradictory? Either we form these assignments at
commit time, or we support streaming (in which case it clearly can't
happen at commit time).

I was talking about non-streaming cases which also have this assert
problem as seen in this thread. I am intentionally keeping streaming
cases out of this discussion as it happens without those and by
including streaming in the discussion, we will add another angle to
this problem which may not be required.

AFAICS that's exactly why we started logging
(and processing) assignments immediately, no?

Or we can ignore stuff early, and not even process some of the changes.
For example in this case do we need to process the NEW_CID contents for
transaction 848? If we can skip that bit, the problem will disappear.

But maybe this is futile and there are other similar issues ...

However, when processing the NEW_CID record:

tx: 849, lsn: 0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel
1663/16384/1249; tid 58/38; cmin: 1, cmax: 14, combo: 6

we ultimately do this in SnapBuildProcessNewCid:

#1 0x0000005566cccdb4 in ReorderBufferAddNewTupleCids (rb=0x559dd64218,
xid=848, lsn=30264752, locator=..., tid=..., cmin=1, cmax=14,
combocid=6) at reorderbuffer.c:3218
#2 0x0000005566cd1f7c in SnapBuildProcessNewCid (builder=0x559dd6a248,
xid=849, lsn=30264752, xlrec=0x559dd6e1e0) at snapbuild.c:818

so in fact we *know* 849 is a subxact of 848, but we don't call
ReorderBufferAssignChild in this case. In fact we can't even do the
assignment easily in this case, because we create the subxact first, so
that the crash happens right when we attempt to create the toplevel one,
and we never even get a chance to do the assignment:

1) process the NEW_CID record, logged for 849 (subxact)
2) process CIDs in the WAL record, which has topleve_xid 848

So IMHO we need to figure out what to do for WAL records that create
both the toplevel and subxact - either we need to skip them, or rethink
how we create the ReorderBufferTXN structs.

As per my understanding, we can't skip them as they are used to build
the snapshot.

Don't we know 848 (the top-level xact) won't be decoded? In that case we
won't need the snapshot, so why build it?

But this transaction id can be part of committed.xip array if it has
made any catalog changes. We add the transaction/subtransaction to
this array before deciding whether to skip decoding/replay of its
commit.

Hmm, yeah. It's been a while since I last looked into how we build
snapshots and how we share them between the transactions :-( If we share
the snapshots between transactions, you're probably right we can't just
skip these changes.

However, doesn't that pretty much mean we *have* to do something about
the assignment? I mean, suppose we miss the assignment (like now), so
that we end up with two TXNs that we think are top-level. And then we
get the commit for the actual top-level transaction. AFAICS that won't
clean-up the subxact, and we end up with a lingering TXN.

I think we will clean up such a subxact. Such a xact should be skipped
via DecodeTXNNeedSkip() and then it will call ReorderBufferForget()
for each of the subxacts and that will make sure that we clean up each
of subtxn's.

--
With Regards,
Amit Kapila.

#13Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#12)
1 attachment(s)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

Hi,

On Tue, Sep 6, 2022 at 3:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Sep 5, 2022 at 5:24 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 9/5/22 12:12, Amit Kapila wrote:

On Mon, Sep 5, 2022 at 12:14 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

It is possible that there is some other problem here that I am
missing. But at this stage, I don't see anything wrong other than the
assertion you have reported.

I'm not sure I agree with that. I'm not convinced the assert is at
fault, it might easily be that it hints there's a logic bug somewhere.

It is possible but let's try to prove it. I am also keen to know if
this hints at a logic bug somewhere.

I think it's mostly clear we won't output this transaction, because the
restart LSN is half-way through. We can either ignore it at commit time,
and then we have to make everything work in case we miss assignments (or
any other part of the transaction).

Note, traditionally, we only form these assignments at commit time
after deciding whether to skip such commits. So, ideally, there
shouldn't be any fundamental problem with not making these
associations before deciding whether we need to replay (send
downstream) any particular transaction.

Agreed.

Summarizing this issue, the assertion check in AssertTXNLsnOrder()
fails as reported because the current logical decoding cannot properly
handle the case where the decoding restarts from NEW_CID. Since we
don't make the association between top-level transaction and its
subtransaction while decoding NEW_CID (ie, in
SnapBuildProcessNewCid()), two transactions are created in
ReorderBuffer as top-txn and have the same LSN. This failure happens
on all supported versions.

To fix the problem, one idea is that we make the association between
top-txn and sub-txn during that by calling ReorderBufferAssignChild(),
as Tomas proposed. On the other hand, since we don't guarantee to make
the association between the top-level transaction and its
sub-transactions until we try to decode the actual contents of the
transaction, it makes sense to me that instead of trying to solve by
making association, we need to change the code which are assuming that
it is associated.

I've attached the patch for this idea. With the patch, we skip the
assertion checks in AssertTXNLsnOrder() until we reach the LSN at
which we start decoding the contents of transaction, ie.
start_decoding_at in SnapBuild. The minor concern is other way that
the assertion check could miss some faulty cases where two unrelated
top-transactions could have same LSN. With this patch, it will pass
for such a case. Therefore, for transactions that we skipped checking,
we do the check when we reach the LSN.

Please note that to pass the new regression tests, the fix proposed in
a related thread[1]/messages/by-id/TYAPR01MB58666BD6BE24853269624282F5419@TYAPR01MB5866.jpnprd01.prod.outlook.com is required. Particularly, we need:

@@ -1099,6 +1099,9 @@ SnapBuildCommitTxn(SnapBuild *builder,
XLogRecPtr lsn, TransactionId xid,
        else if (sub_needs_timetravel)
        {
                /* track toplevel txn as well, subxact alone isn't meaningful */
+               elog(DEBUG2, "forced transaction %u to do timetravel
due to one of its subtransaction",
+                        xid);
+               needs_timetravel = true;
                SnapBuildAddCommittedTxn(builder, xid);
        }
        else if (needs_timetravel)

A side benefit of this approach is that we can fix another assertion
failure too that happens on REL14 and REL15 and reported here[2]/messages/by-id/TYAPR01MB58660803BCAA7849C8584AA4F57E9@TYAPR01MB5866.jpnprd01.prod.outlook.com. In
the commits 68dcce247f1a(REL14) and 272248a0c1(REL15), the reason why
we make the association between sub-txns to top-txn in
SnapBuildXidSetCatalogChanges() is just to avoid the assertion failure
in AssertTXNLsnOrder(). However, since the invalidation messages are
not transported from sub-txn to top-txn during the assignment, another
assertion check in ReorderBufferForget() fails when forgetting the
subtransaction. If we apply this idea of skipping the assertion
checks, we no longer need to make the such association in
SnapBuildXidSetCatalogChanges() and resolve this issue as well.

Don't we know 848 (the top-level xact) won't be decoded? In that case we
won't need the snapshot, so why build it?

But this transaction id can be part of committed.xip array if it has
made any catalog changes. We add the transaction/subtransaction to
this array before deciding whether to skip decoding/replay of its
commit.

Hmm, yeah. It's been a while since I last looked into how we build
snapshots and how we share them between the transactions :-( If we share
the snapshots between transactions, you're probably right we can't just
skip these changes.

However, doesn't that pretty much mean we *have* to do something about
the assignment? I mean, suppose we miss the assignment (like now), so
that we end up with two TXNs that we think are top-level. And then we
get the commit for the actual top-level transaction. AFAICS that won't
clean-up the subxact, and we end up with a lingering TXN.

I think we will clean up such a subxact. Such a xact should be skipped
via DecodeTXNNeedSkip() and then it will call ReorderBufferForget()
for each of the subxacts and that will make sure that we clean up each
of subtxn's.

Right.

Regards,

[1]: /messages/by-id/TYAPR01MB58666BD6BE24853269624282F5419@TYAPR01MB5866.jpnprd01.prod.outlook.com
[2]: /messages/by-id/TYAPR01MB58660803BCAA7849C8584AA4F57E9@TYAPR01MB5866.jpnprd01.prod.outlook.com

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v1-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patchapplication/octet-stream; name=v1-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patchDownload
From 0ccecb9ec572b2a46e69f9b9d8edcd84261ab4c3 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 12 Oct 2022 11:49:04 +0900
Subject: [PATCH v1] Fix the assertion failure while processing NEW_CID in
 logical decoding.

When the logical decoding restarts from NEW_CID, since there is no
association between the top transaction and its sub transaction, both
are created as top transactions and have the same LSN. This caused
the assertion failure in AssertTXNLsnOrder().

With this change, we skip the assertion check until we reach the LSN
at which we start decoding the contents of transaction, specifically
start_decoding_at in SnapBuild. This is okay because we don't
guarantee to make the association between top transaction and sub
transaction until we try to decode the actual contents of
transactions.

By skipping the check in some cases, we could miss some faulty cases
where two unrelated top-transactions could have same LSN. Therefore,
for transactions we skipped, we do the assertion check  when reaching
the LSN.
---
 .../expected/catalog_change_snapshot.out      | 44 +++++++++++++++++++
 .../specs/catalog_change_snapshot.spec        | 11 +++++
 src/backend/replication/logical/decode.c      |  9 ++++
 .../replication/logical/reorderbuffer.c       | 19 ++++++--
 src/backend/replication/logical/snapbuild.c   |  9 ++++
 src/include/replication/reorderbuffer.h       |  2 +
 src/include/replication/snapbuild.h           |  1 +
 7 files changed, 92 insertions(+), 3 deletions(-)

diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
index d2a4bdfcc1..d85552699f 100644
--- a/contrib/test_decoding/expected/catalog_change_snapshot.out
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -96,3 +96,47 @@ COMMIT
 stop    
 (1 row)
 
+
+starting permutation: s0_init s0_begin s0_savepoint s0_insert s1_checkpoint s1_get_changes s0_analyze s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_analyze: ANALYZE tbl1;
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
index ff8f68489b..30673da61f 100644
--- a/contrib/test_decoding/specs/catalog_change_snapshot.spec
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -21,6 +21,7 @@ step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolatio
 step "s0_begin" { BEGIN; }
 step "s0_savepoint" { SAVEPOINT sp1; }
 step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_analyze" { ANALYZE tbl1; }
 step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
 step "s0_commit" { COMMIT; }
 
@@ -57,3 +58,13 @@ permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s
 # checkpoint record it prunes one of the xacts in that list and when decoding the
 # next checkpoint, it will completely prune that list.
 permutation "s0_init" "s0_begin" "s0_truncate" "s2_begin" "s2_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s2_commit" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
+
+# Test that we can handle the case where there is no association between top-level
+# transaction and its subtransactions. The last decoding restarts from the first
+# checkpoint, decodes NEW_CID generated by "s0_analyze", and marks the subtransaction
+# as containing catalog changes while adding tuple cids to its top-level transaction.
+# During that, both transaction entries are created in ReorderBuffer as top-level
+# transactions and have the same LSN. We check if the assertion check for the order
+# of transaction LSNs in AssertTXNLsnOrder() is skipped since we are still before the
+# LSN at which we start replaying the contents of transactions.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_analyze" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 2cc0ac9eb0..17349da1f0 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -113,6 +113,15 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 								 buf.origptr);
 	}
 
+#ifdef USE_ASSERT_CHECKING
+	/*
+	 * Check the order of transaction LSNs when we reached the start decoding
+	 * LSN. See the comments in AssertTXNLsnOrder() for details.
+	 */
+	if (SnapBuildGetStartDecodingAt(ctx->snapshot_builder) == buf.origptr)
+		AssertTXNLsnOrder(ctx->reorder);
+#endif
+
 	rmgr = GetRmgr(XLogRecGetRmid(record));
 
 	if (rmgr.rm_decode != NULL)
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 6dff9915a5..3a442e0d1e 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -221,8 +221,6 @@ static ReorderBufferTXN *ReorderBufferTXNByXid(ReorderBuffer *rb,
 static void ReorderBufferTransferSnapToParent(ReorderBufferTXN *txn,
 											  ReorderBufferTXN *subtxn);
 
-static void AssertTXNLsnOrder(ReorderBuffer *rb);
-
 /* ---------------------------------------
  * support functions for lsn-order iterating over the ->changes of a
  * transaction and its subtransactions
@@ -877,14 +875,29 @@ ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid,
  *
  * No-op if assertions are not in use.
  */
-static void
+void
 AssertTXNLsnOrder(ReorderBuffer *rb)
 {
 #ifdef USE_ASSERT_CHECKING
+	LogicalDecodingContext	*ctx = rb->private_data;
 	dlist_iter	iter;
 	XLogRecPtr	prev_first_lsn = InvalidXLogRecPtr;
 	XLogRecPtr	prev_base_snap_lsn = InvalidXLogRecPtr;
 
+	/*
+	 * Skip the verification if we don't reach the LSN at which we start
+	 * decoding the contents of transactions yet. This is okay because
+	 * until we reach the LSN, we could have transactions that don't have
+	 * the association between the top-level transaction and subtransaction
+	 * yet. We don't guarantee this association until we try to decode the
+	 * actual contents of transaction. Since we typically do this
+	 * verification when manipulating the transaction lists in the reorder
+	 * buffer, we may skip verifying some transactions. For those that we
+	 * skipped, we do that when we reached the LSN.
+	 */
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, ctx->reader->EndRecPtr))
+		return;
+
 	dlist_foreach(iter, &rb->toplevel_by_lsn)
 	{
 		ReorderBufferTXN *cur_txn = dlist_container(ReorderBufferTXN, node,
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 54499c06fe..4640a6321c 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -431,6 +431,15 @@ SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr)
 	return ptr < builder->start_decoding_at;
 }
 
+/*
+ * Return the LSN at which the contents of transaction are first decoded.
+ */
+XLogRecPtr
+SnapBuildGetStartDecodingAt(SnapBuild *builder)
+{
+	return builder->start_decoding_at;
+}
+
 /*
  * Increase refcount of a snapshot.
  *
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 02b59a1931..97f6c26e99 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -703,4 +703,6 @@ extern void ReorderBufferSetRestartPoint(ReorderBuffer *rb, XLogRecPtr ptr);
 
 extern void StartupReorderBuffer(void);
 
+extern void AssertTXNLsnOrder(ReorderBuffer *rb);
+
 #endif
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 2a697e57c3..3b917ad3ec 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -76,6 +76,7 @@ extern SnapBuildState SnapBuildCurrentState(SnapBuild *builder);
 extern Snapshot SnapBuildGetOrBuildSnapshot(SnapBuild *builder);
 
 extern bool SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr);
+extern XLogRecPtr SnapBuildGetStartDecodingAt(SnapBuild *builder);
 extern XLogRecPtr SnapBuildGetTwoPhaseAt(SnapBuild *builder);
 extern void SnapBuildSetTwoPhaseAt(SnapBuild *builder, XLogRecPtr ptr);
 
-- 
2.31.1

#14Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#13)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On Wed, Oct 12, 2022 at 11:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Summarizing this issue, the assertion check in AssertTXNLsnOrder()
fails as reported because the current logical decoding cannot properly
handle the case where the decoding restarts from NEW_CID. Since we
don't make the association between top-level transaction and its
subtransaction while decoding NEW_CID (ie, in
SnapBuildProcessNewCid()), two transactions are created in
ReorderBuffer as top-txn and have the same LSN. This failure happens
on all supported versions.

To fix the problem, one idea is that we make the association between
top-txn and sub-txn during that by calling ReorderBufferAssignChild(),
as Tomas proposed. On the other hand, since we don't guarantee to make
the association between the top-level transaction and its
sub-transactions until we try to decode the actual contents of the
transaction, it makes sense to me that instead of trying to solve by
making association, we need to change the code which are assuming that
it is associated.

I've attached the patch for this idea. With the patch, we skip the
assertion checks in AssertTXNLsnOrder() until we reach the LSN at
which we start decoding the contents of transaction, ie.
start_decoding_at in SnapBuild. The minor concern is other way that
the assertion check could miss some faulty cases where two unrelated
top-transactions could have same LSN. With this patch, it will pass
for such a case. Therefore, for transactions that we skipped checking,
we do the check when we reach the LSN.

--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -113,6 +113,15 @@
LogicalDecodingProcessRecord(LogicalDecodingContext *ctx,
XLogReaderState *recor
  buf.origptr);
  }
+#ifdef USE_ASSERT_CHECKING
+ /*
+ * Check the order of transaction LSNs when we reached the start decoding
+ * LSN. See the comments in AssertTXNLsnOrder() for details.
+ */
+ if (SnapBuildGetStartDecodingAt(ctx->snapshot_builder) == buf.origptr)
+ AssertTXNLsnOrder(ctx->reorder);
+#endif
+
  rmgr = GetRmgr(XLogRecGetRmid(record));

I am not able to think how/when this check will be useful. Because we
skipped assert checking only for records that are prior to
start_decoding_at point, I think for those records ordering should
have been checked before the restart. start_decoding_at point will be
either (a) confirmed_flush location, or (b) lsn sent by client, and
any record prior to that must have been processed before restart.

Now, say we have commit records for multiple transactions which are
after start_decoding_at but all their changes are before
start_decoding_at, then we won't check their ordering at commit time
but OTOH, we would have checked their ordering before restart. Isn't
that sufficient?

--
With Regards,
Amit Kapila.

#15Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#14)
1 attachment(s)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On Thu, Oct 13, 2022 at 4:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Oct 12, 2022 at 11:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Summarizing this issue, the assertion check in AssertTXNLsnOrder()
fails as reported because the current logical decoding cannot properly
handle the case where the decoding restarts from NEW_CID. Since we
don't make the association between top-level transaction and its
subtransaction while decoding NEW_CID (ie, in
SnapBuildProcessNewCid()), two transactions are created in
ReorderBuffer as top-txn and have the same LSN. This failure happens
on all supported versions.

To fix the problem, one idea is that we make the association between
top-txn and sub-txn during that by calling ReorderBufferAssignChild(),
as Tomas proposed. On the other hand, since we don't guarantee to make
the association between the top-level transaction and its
sub-transactions until we try to decode the actual contents of the
transaction, it makes sense to me that instead of trying to solve by
making association, we need to change the code which are assuming that
it is associated.

I've attached the patch for this idea. With the patch, we skip the
assertion checks in AssertTXNLsnOrder() until we reach the LSN at
which we start decoding the contents of transaction, ie.
start_decoding_at in SnapBuild. The minor concern is other way that
the assertion check could miss some faulty cases where two unrelated
top-transactions could have same LSN. With this patch, it will pass
for such a case. Therefore, for transactions that we skipped checking,
we do the check when we reach the LSN.

--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -113,6 +113,15 @@
LogicalDecodingProcessRecord(LogicalDecodingContext *ctx,
XLogReaderState *recor
buf.origptr);
}
+#ifdef USE_ASSERT_CHECKING
+ /*
+ * Check the order of transaction LSNs when we reached the start decoding
+ * LSN. See the comments in AssertTXNLsnOrder() for details.
+ */
+ if (SnapBuildGetStartDecodingAt(ctx->snapshot_builder) == buf.origptr)
+ AssertTXNLsnOrder(ctx->reorder);
+#endif
+
rmgr = GetRmgr(XLogRecGetRmid(record));

I am not able to think how/when this check will be useful. Because we
skipped assert checking only for records that are prior to
start_decoding_at point, I think for those records ordering should
have been checked before the restart. start_decoding_at point will be
either (a) confirmed_flush location, or (b) lsn sent by client, and
any record prior to that must have been processed before restart.

Good point. I was considering the case where the client sets far ahead
LSN but it's not worth considering this case in this context. I've
updated the patch accoringly.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v2-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patchapplication/octet-stream; name=v2-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patchDownload
From be6e410c80901d9e78c1e0d523b775112031c491 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 12 Oct 2022 11:49:04 +0900
Subject: [PATCH v2] Fix the assertion failure while processing NEW_CID in
 logical decoding.

When the logical decoding restarts from NEW_CID, since there is no
association between the top transaction and its sub transaction, both
are created as top transactions and have the same LSN. This caused
the assertion failure in AssertTXNLsnOrder().

With this change, we skip the assertion check until we reach the LSN
at which we start decoding the contents of transaction, specifically
start_decoding_at in SnapBuild. This is okay because we don't
guarantee to make the association between top transaction and sub
transaction until we try to decode the actual contents of
transactions.

Reported-by: Thomas Vondra
Author: Masahiko Sawada
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/a89b46b6-0239-2fd5-71a9-b19b1f7a7145%40enterprisedb.com
---
 .../expected/catalog_change_snapshot.out      | 44 +++++++++++++++++++
 .../specs/catalog_change_snapshot.spec        | 11 +++++
 .../replication/logical/reorderbuffer.c       | 14 ++++++
 3 files changed, 69 insertions(+)

diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
index d2a4bdfcc1..d85552699f 100644
--- a/contrib/test_decoding/expected/catalog_change_snapshot.out
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -96,3 +96,47 @@ COMMIT
 stop    
 (1 row)
 
+
+starting permutation: s0_init s0_begin s0_savepoint s0_insert s1_checkpoint s1_get_changes s0_analyze s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_analyze: ANALYZE tbl1;
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
index ff8f68489b..30673da61f 100644
--- a/contrib/test_decoding/specs/catalog_change_snapshot.spec
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -21,6 +21,7 @@ step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolatio
 step "s0_begin" { BEGIN; }
 step "s0_savepoint" { SAVEPOINT sp1; }
 step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_analyze" { ANALYZE tbl1; }
 step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
 step "s0_commit" { COMMIT; }
 
@@ -57,3 +58,13 @@ permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s
 # checkpoint record it prunes one of the xacts in that list and when decoding the
 # next checkpoint, it will completely prune that list.
 permutation "s0_init" "s0_begin" "s0_truncate" "s2_begin" "s2_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s2_commit" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
+
+# Test that we can handle the case where there is no association between top-level
+# transaction and its subtransactions. The last decoding restarts from the first
+# checkpoint, decodes NEW_CID generated by "s0_analyze", and marks the subtransaction
+# as containing catalog changes while adding tuple cids to its top-level transaction.
+# During that, both transaction entries are created in ReorderBuffer as top-level
+# transactions and have the same LSN. We check if the assertion check for the order
+# of transaction LSNs in AssertTXNLsnOrder() is skipped since we are still before the
+# LSN at which we start replaying the contents of transactions.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_analyze" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 6dff9915a5..bc81b107f3 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -881,10 +881,24 @@ static void
 AssertTXNLsnOrder(ReorderBuffer *rb)
 {
 #ifdef USE_ASSERT_CHECKING
+	LogicalDecodingContext	*ctx = rb->private_data;
 	dlist_iter	iter;
 	XLogRecPtr	prev_first_lsn = InvalidXLogRecPtr;
 	XLogRecPtr	prev_base_snap_lsn = InvalidXLogRecPtr;
 
+	/*
+	 * Skip the verification if we don't reach the LSN at which we start
+	 * decoding the contents of transactions yet because until we reach the
+	 * LSN, we could have transactions that don't have the association
+	 * between the top-level transaction and subtransaction yet and
+	 * consequently have the same LSN.  We don't guarantee this association
+	 * until we try to decode the actual contents of transaction. The ordering
+	 * of the records prior to the LSN, we should have been checked before
+	 * the restart.
+	 */
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, ctx->reader->EndRecPtr))
+		return;
+
 	dlist_foreach(iter, &rb->toplevel_by_lsn)
 	{
 		ReorderBufferTXN *cur_txn = dlist_container(ReorderBufferTXN, node,
-- 
2.31.1

#16Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#13)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On Wed, Oct 12, 2022 at 11:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Please note that to pass the new regression tests, the fix proposed in
a related thread[1] is required. Particularly, we need:

@@ -1099,6 +1099,9 @@ SnapBuildCommitTxn(SnapBuild *builder,
XLogRecPtr lsn, TransactionId xid,
else if (sub_needs_timetravel)
{
/* track toplevel txn as well, subxact alone isn't meaningful */
+               elog(DEBUG2, "forced transaction %u to do timetravel
due to one of its subtransaction",
+                        xid);
+               needs_timetravel = true;
SnapBuildAddCommittedTxn(builder, xid);
}
else if (needs_timetravel)

A side benefit of this approach is that we can fix another assertion
failure too that happens on REL14 and REL15 and reported here[2]. In
the commits 68dcce247f1a(REL14) and 272248a0c1(REL15), the reason why
we make the association between sub-txns to top-txn in
SnapBuildXidSetCatalogChanges() is just to avoid the assertion failure
in AssertTXNLsnOrder(). However, since the invalidation messages are
not transported from sub-txn to top-txn during the assignment, another
assertion check in ReorderBufferForget() fails when forgetting the
subtransaction. If we apply this idea of skipping the assertion
checks, we no longer need to make the such association in
SnapBuildXidSetCatalogChanges() and resolve this issue as well.

IIUC, here you are speaking of three different changes. Change-1: Add
a check in AssertTXNLsnOrder() to skip assert checking till we reach
start_decoding_at. Change-2: Set needs_timetravel to true in one of
the else if branches in SnapBuildCommitTxn(). Change-3: Remove the
call to ReorderBufferAssignChild() from SnapBuildXidSetCatalogChanges
in PG-14/15 as that won't be required after Change-1.

AFAIU, Change-1 is required till v10; Change-2 and Change-3 are
required in HEAD/v15/v14 to fix the problem. Now, the second and third
changes are not required in branches prior to v14 because we don't
record invalidations via XLOG_XACT_INVALIDATIONS record. However, if
we want, we can even back-patch Change-2 and Change-3 to keep the code
consistent or maybe just Change-3.

Is my understanding correct?

--
With Regards,
Amit Kapila.

#17Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#16)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On Mon, Oct 17, 2022 at 4:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Oct 12, 2022 at 11:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Please note that to pass the new regression tests, the fix proposed in
a related thread[1] is required. Particularly, we need:

@@ -1099,6 +1099,9 @@ SnapBuildCommitTxn(SnapBuild *builder,
XLogRecPtr lsn, TransactionId xid,
else if (sub_needs_timetravel)
{
/* track toplevel txn as well, subxact alone isn't meaningful */
+               elog(DEBUG2, "forced transaction %u to do timetravel
due to one of its subtransaction",
+                        xid);
+               needs_timetravel = true;
SnapBuildAddCommittedTxn(builder, xid);
}
else if (needs_timetravel)

A side benefit of this approach is that we can fix another assertion
failure too that happens on REL14 and REL15 and reported here[2]. In
the commits 68dcce247f1a(REL14) and 272248a0c1(REL15), the reason why
we make the association between sub-txns to top-txn in
SnapBuildXidSetCatalogChanges() is just to avoid the assertion failure
in AssertTXNLsnOrder(). However, since the invalidation messages are
not transported from sub-txn to top-txn during the assignment, another
assertion check in ReorderBufferForget() fails when forgetting the
subtransaction. If we apply this idea of skipping the assertion
checks, we no longer need to make the such association in
SnapBuildXidSetCatalogChanges() and resolve this issue as well.

IIUC, here you are speaking of three different changes. Change-1: Add
a check in AssertTXNLsnOrder() to skip assert checking till we reach
start_decoding_at. Change-2: Set needs_timetravel to true in one of
the else if branches in SnapBuildCommitTxn(). Change-3: Remove the
call to ReorderBufferAssignChild() from SnapBuildXidSetCatalogChanges
in PG-14/15 as that won't be required after Change-1.

Yes.

AFAIU, Change-1 is required till v10; Change-2 and Change-3 are
required in HEAD/v15/v14 to fix the problem.

IIUC Change-2 is required in v16 and HEAD but not mandatory in v15 and
v14. The reason why we need Change-2 is that there is a case where we
mark only subtransactions as containing catalog change while not doing
that for its top-level transaction. In v15 and v14, since we mark both
subtransactions and top-level transaction in
SnapBuildXidSetCatalogChanges() as containing catalog changes, we
don't get the assertion failure at "Assert(!needs_snapshot ||
needs_timetravel)".

Regarding Change-3, it's required in v15 and v14 but not in HEAD and
v16. Since we didn't add SnapBuildXidSetCatalogChanges() to v16 and
HEAD, Change-3 cannot be applied to the two branches.

Now, the second and third
changes are not required in branches prior to v14 because we don't
record invalidations via XLOG_XACT_INVALIDATIONS record. However, if
we want, we can even back-patch Change-2 and Change-3 to keep the code
consistent or maybe just Change-3.

Right. I don't think it's a good idea to back-patch Change-2 in
branches prior to v14 as it's not a relevant issue. Regarding
back-patching Change-3 to branches prior 14, I think it may be okay
til v11, but I'd be hesitant for v10 as the final release comes in a
month.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#18kuroda.hayato@fujitsu.com
kuroda.hayato@fujitsu.com
In reply to: Masahiko Sawada (#17)
RE: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

Dear Sawada-san, Amit,

IIUC Change-2 is required in v16 and HEAD but not mandatory in v15 and
v14. The reason why we need Change-2 is that there is a case where we
mark only subtransactions as containing catalog change while not doing
that for its top-level transaction. In v15 and v14, since we mark both
subtransactions and top-level transaction in
SnapBuildXidSetCatalogChanges() as containing catalog changes, we
don't get the assertion failure at "Assert(!needs_snapshot ||
needs_timetravel)".

Incidentally, I agreed that Change-2 is needed for HEAD (and v16), not v15 and v14.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

#19Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#17)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On Tue, Oct 18, 2022 at 6:29 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Oct 17, 2022 at 4:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

IIUC, here you are speaking of three different changes. Change-1: Add
a check in AssertTXNLsnOrder() to skip assert checking till we reach
start_decoding_at. Change-2: Set needs_timetravel to true in one of
the else if branches in SnapBuildCommitTxn(). Change-3: Remove the
call to ReorderBufferAssignChild() from SnapBuildXidSetCatalogChanges
in PG-14/15 as that won't be required after Change-1.

Yes.

AFAIU, Change-1 is required till v10; Change-2 and Change-3 are
required in HEAD/v15/v14 to fix the problem.

IIUC Change-2 is required in v16 and HEAD

Why are you referring v16 and HEAD separately?

but not mandatory in v15 and
v14. The reason why we need Change-2 is that there is a case where we
mark only subtransactions as containing catalog change while not doing
that for its top-level transaction. In v15 and v14, since we mark both
subtransactions and top-level transaction in
SnapBuildXidSetCatalogChanges() as containing catalog changes, we
don't get the assertion failure at "Assert(!needs_snapshot ||
needs_timetravel)".

Regarding Change-3, it's required in v15 and v14 but not in HEAD and
v16. Since we didn't add SnapBuildXidSetCatalogChanges() to v16 and
HEAD, Change-3 cannot be applied to the two branches.

Now, the second and third
changes are not required in branches prior to v14 because we don't
record invalidations via XLOG_XACT_INVALIDATIONS record. However, if
we want, we can even back-patch Change-2 and Change-3 to keep the code
consistent or maybe just Change-3.

Right. I don't think it's a good idea to back-patch Change-2 in
branches prior to v14 as it's not a relevant issue.

Fair enough but then why to even backpatch it to v15 and v14?

Regarding
back-patching Change-3 to branches prior 14, I think it may be okay
til v11, but I'd be hesitant for v10 as the final release comes in a
month.

So to fix the issue in all branches, what we need to do is to
backpatch change-1: in all branches till v10, change-2: in HEAD, and
change-3: in V15 and V14. Additionally, we think, it is okay to
backpatch change-3 till v11 as it is mainly done to avoid the problem
fixed by change-1 and it makes code consistent in back branches.

I think because the test case proposed needs all three changes, we can
push the change-1 without a test case and then as a second patch have
change-2 for HEAD and change-3 for back branches with the test case.
Do you have any other ideas to proceed here?

--
With Regards,
Amit Kapila.

#20Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#19)
3 attachment(s)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On Tue, Oct 18, 2022 at 1:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Oct 18, 2022 at 6:29 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Oct 17, 2022 at 4:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

IIUC, here you are speaking of three different changes. Change-1: Add
a check in AssertTXNLsnOrder() to skip assert checking till we reach
start_decoding_at. Change-2: Set needs_timetravel to true in one of
the else if branches in SnapBuildCommitTxn(). Change-3: Remove the
call to ReorderBufferAssignChild() from SnapBuildXidSetCatalogChanges
in PG-14/15 as that won't be required after Change-1.

Yes.

AFAIU, Change-1 is required till v10; Change-2 and Change-3 are
required in HEAD/v15/v14 to fix the problem.

IIUC Change-2 is required in v16 and HEAD

Why are you referring v16 and HEAD separately?

Sorry, my wrong, I was confused.

but not mandatory in v15 and
v14. The reason why we need Change-2 is that there is a case where we
mark only subtransactions as containing catalog change while not doing
that for its top-level transaction. In v15 and v14, since we mark both
subtransactions and top-level transaction in
SnapBuildXidSetCatalogChanges() as containing catalog changes, we
don't get the assertion failure at "Assert(!needs_snapshot ||
needs_timetravel)".

Regarding Change-3, it's required in v15 and v14 but not in HEAD and
v16. Since we didn't add SnapBuildXidSetCatalogChanges() to v16 and
HEAD, Change-3 cannot be applied to the two branches.

Now, the second and third
changes are not required in branches prior to v14 because we don't
record invalidations via XLOG_XACT_INVALIDATIONS record. However, if
we want, we can even back-patch Change-2 and Change-3 to keep the code
consistent or maybe just Change-3.

Right. I don't think it's a good idea to back-patch Change-2 in
branches prior to v14 as it's not a relevant issue.

Fair enough but then why to even backpatch it to v15 and v14?

Oops, it's a typo. I wanted to say Change-2 should be back-patched only to HEAD.

Regarding
back-patching Change-3 to branches prior 14, I think it may be okay
til v11, but I'd be hesitant for v10 as the final release comes in a
month.

So to fix the issue in all branches, what we need to do is to
backpatch change-1: in all branches till v10, change-2: in HEAD, and
change-3: in V15 and V14. Additionally, we think, it is okay to
backpatch change-3 till v11 as it is mainly done to avoid the problem
fixed by change-1 and it makes code consistent in back branches.

Right.

I think because the test case proposed needs all three changes, we can
push the change-1 without a test case and then as a second patch have
change-2 for HEAD and change-3 for back branches with the test case.
Do you have any other ideas to proceed here?

I found another test case that causes the assertion failure at
"Assert(!needs_snapshot || needs_timetravel);" on all branches. I've
attached the patch for the test case. In this test case, I modified a
user-catalog table instead of system-catalog table. That way, we don't
generate invalidation messages while generating NEW_CID records. As a
result, we mark only the subtransactions as containing catalog change
and don't make association between top-level and sub transactions. The
assertion failure happens on all supported branches. If we need to fix
this (I believe so), Change-2 needs to be backpatched to all supported
branches.

There are three changes as Amit mentioned, and regarding the test
case, we have three test cases I've attached: truncate_testcase.patch,
analyze_testcase.patch, uesr_catalog_testcase.patch. The relationship
between assertion failures and test cases are very complex. I could
not find any test case to cause only one assertion failure on all
branches. One idea to proceed is:

Patch-1 includes Change-1 and is applied to all branches.

Patch-2 includes Change-2 and the user_catalog test case, and is
applied to all branches.

Patch-3 includes Change-3 and the truncate test case (or the analyze
test case), and is applied to v14 and v15 (also till v11 if we
prefer).

The patch-1 doesn't include any test case but the user_catalog test
case can test both Change-1 and Change-2 on all branches. In v15 and
v14, the analyze test case causes both the assertions at
"Assert(txn->ninvalidations == 0);" and "Assert(prev_first_lsn <
cur_txn->first_lsn);" whereas the truncate test case causes the
assertion only at "Assert(txn->ninvalidations == 0);". Since the
patch-2 is applied on top of the patch-1, there is no difference in
terms of testing Change-2.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

user_catalog_testcase.patchapplication/octet-stream; name=user_catalog_testcase.patchDownload
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
index ff8f68489b..d3120d6879 100644
--- a/contrib/test_decoding/specs/catalog_change_snapshot.spec
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -6,12 +6,14 @@ setup
     DROP TABLE IF EXISTS tbl2;
     CREATE TABLE tbl1 (val1 integer, val2 integer);
     CREATE TABLE tbl2 (val1 integer, val2 integer);
+    CREATE TABLE user_catalog (val1 integer) WITh (user_catalog_table = true);
 }
 
 teardown
 {
     DROP TABLE tbl1;
     DROP TABLE tbl2;
+    DROP TABLE user_catalog;
     SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
 }
 
@@ -22,6 +24,7 @@ step "s0_begin" { BEGIN; }
 step "s0_savepoint" { SAVEPOINT sp1; }
 step "s0_truncate" { TRUNCATE tbl1; }
 step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_insert2" { INSERT INTO user_catalog VALUES (1); }
 step "s0_commit" { COMMIT; }
 
 session "s1"
@@ -57,3 +60,5 @@ permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s
 # checkpoint record it prunes one of the xacts in that list and when decoding the
 # next checkpoint, it will completely prune that list.
 permutation "s0_init" "s0_begin" "s0_truncate" "s2_begin" "s2_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s2_commit" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
+
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_insert2" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
analyze_testcase.patchapplication/octet-stream; name=analyze_testcase.patchDownload
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
index ff8f68489b..b4c4b3b249 100644
--- a/contrib/test_decoding/specs/catalog_change_snapshot.spec
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -21,6 +21,7 @@ step "s0_init" { SELECT 'init' FROM pg_create_logical_replication_slot('isolatio
 step "s0_begin" { BEGIN; }
 step "s0_savepoint" { SAVEPOINT sp1; }
 step "s0_truncate" { TRUNCATE tbl1; }
+step "s0_analyze" { ANALYZE tbl1; }
 step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
 step "s0_commit" { COMMIT; }
 
@@ -57,3 +58,5 @@ permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s
 # checkpoint record it prunes one of the xacts in that list and when decoding the
 # next checkpoint, it will completely prune that list.
 permutation "s0_init" "s0_begin" "s0_truncate" "s2_begin" "s2_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s2_commit" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
+
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_analyze" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
truncate_testcase.patchapplication/octet-stream; name=truncate_testcase.patchDownload
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
index ff8f68489b..5e6b5024d4 100644
--- a/contrib/test_decoding/specs/catalog_change_snapshot.spec
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -57,3 +57,5 @@ permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s
 # checkpoint record it prunes one of the xacts in that list and when decoding the
 # next checkpoint, it will completely prune that list.
 permutation "s0_init" "s0_begin" "s0_truncate" "s2_begin" "s2_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s2_commit" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
+
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_truncate" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
#21Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#20)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On Tue, Oct 18, 2022 at 1:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I think because the test case proposed needs all three changes, we can
push the change-1 without a test case and then as a second patch have
change-2 for HEAD and change-3 for back branches with the test case.
Do you have any other ideas to proceed here?

I found another test case that causes the assertion failure at
"Assert(!needs_snapshot || needs_timetravel);" on all branches. I've
attached the patch for the test case. In this test case, I modified a
user-catalog table instead of system-catalog table. That way, we don't
generate invalidation messages while generating NEW_CID records. As a
result, we mark only the subtransactions as containing catalog change
and don't make association between top-level and sub transactions. The
assertion failure happens on all supported branches. If we need to fix
this (I believe so), Change-2 needs to be backpatched to all supported
branches.

There are three changes as Amit mentioned, and regarding the test
case, we have three test cases I've attached: truncate_testcase.patch,
analyze_testcase.patch, uesr_catalog_testcase.patch. The relationship
between assertion failures and test cases are very complex. I could
not find any test case to cause only one assertion failure on all
branches. One idea to proceed is:

Patch-1 includes Change-1 and is applied to all branches.

Patch-2 includes Change-2 and the user_catalog test case, and is
applied to all branches.

Patch-3 includes Change-3 and the truncate test case (or the analyze
test case), and is applied to v14 and v15 (also till v11 if we
prefer).

The patch-1 doesn't include any test case but the user_catalog test
case can test both Change-1 and Change-2 on all branches.

I was wondering if it makes sense to commit both Change-1 and Change-2
together as one patch? Both assertions are caused by a single test
case and are related to the general problem that the association of
top and sub transaction is only guaranteed to be formed before we
decode transaction changes. Also, it would be good to fix the problem
with a test case that can cause it. What do you think?

--
With Regards,
Amit Kapila.

#22Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#15)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On Mon, Oct 17, 2022 at 7:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Oct 13, 2022 at 4:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -113,6 +113,15 @@
LogicalDecodingProcessRecord(LogicalDecodingContext *ctx,
XLogReaderState *recor
buf.origptr);
}
+#ifdef USE_ASSERT_CHECKING
+ /*
+ * Check the order of transaction LSNs when we reached the start decoding
+ * LSN. See the comments in AssertTXNLsnOrder() for details.
+ */
+ if (SnapBuildGetStartDecodingAt(ctx->snapshot_builder) == buf.origptr)
+ AssertTXNLsnOrder(ctx->reorder);
+#endif
+
rmgr = GetRmgr(XLogRecGetRmid(record));

I am not able to think how/when this check will be useful. Because we
skipped assert checking only for records that are prior to
start_decoding_at point, I think for those records ordering should
have been checked before the restart. start_decoding_at point will be
either (a) confirmed_flush location, or (b) lsn sent by client, and
any record prior to that must have been processed before restart.

Good point. I was considering the case where the client sets far ahead
LSN but it's not worth considering this case in this context. I've
updated the patch accoringly.

One minor comment:
Can we slightly change the comment: ". The ordering of the records
prior to the LSN, we should have been checked before the restart." to
". The ordering of the records prior to the start_decoding_at LSN
should have been checked before the restart."?

--
With Regards,
Amit Kapila.

#23Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#21)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On Tue, Oct 18, 2022 at 7:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Oct 18, 2022 at 1:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I think because the test case proposed needs all three changes, we can
push the change-1 without a test case and then as a second patch have
change-2 for HEAD and change-3 for back branches with the test case.
Do you have any other ideas to proceed here?

I found another test case that causes the assertion failure at
"Assert(!needs_snapshot || needs_timetravel);" on all branches. I've
attached the patch for the test case. In this test case, I modified a
user-catalog table instead of system-catalog table. That way, we don't
generate invalidation messages while generating NEW_CID records. As a
result, we mark only the subtransactions as containing catalog change
and don't make association between top-level and sub transactions. The
assertion failure happens on all supported branches. If we need to fix
this (I believe so), Change-2 needs to be backpatched to all supported
branches.

There are three changes as Amit mentioned, and regarding the test
case, we have three test cases I've attached: truncate_testcase.patch,
analyze_testcase.patch, uesr_catalog_testcase.patch. The relationship
between assertion failures and test cases are very complex. I could
not find any test case to cause only one assertion failure on all
branches. One idea to proceed is:

Patch-1 includes Change-1 and is applied to all branches.

Patch-2 includes Change-2 and the user_catalog test case, and is
applied to all branches.

Patch-3 includes Change-3 and the truncate test case (or the analyze
test case), and is applied to v14 and v15 (also till v11 if we
prefer).

The patch-1 doesn't include any test case but the user_catalog test
case can test both Change-1 and Change-2 on all branches.

I was wondering if it makes sense to commit both Change-1 and Change-2
together as one patch? Both assertions are caused by a single test
case and are related to the general problem that the association of
top and sub transaction is only guaranteed to be formed before we
decode transaction changes. Also, it would be good to fix the problem
with a test case that can cause it. What do you think?

Yeah, it makes sense to me.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#24Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#22)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On Tue, Oct 18, 2022 at 7:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Oct 17, 2022 at 7:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Oct 13, 2022 at 4:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -113,6 +113,15 @@
LogicalDecodingProcessRecord(LogicalDecodingContext *ctx,
XLogReaderState *recor
buf.origptr);
}
+#ifdef USE_ASSERT_CHECKING
+ /*
+ * Check the order of transaction LSNs when we reached the start decoding
+ * LSN. See the comments in AssertTXNLsnOrder() for details.
+ */
+ if (SnapBuildGetStartDecodingAt(ctx->snapshot_builder) == buf.origptr)
+ AssertTXNLsnOrder(ctx->reorder);
+#endif
+
rmgr = GetRmgr(XLogRecGetRmid(record));

I am not able to think how/when this check will be useful. Because we
skipped assert checking only for records that are prior to
start_decoding_at point, I think for those records ordering should
have been checked before the restart. start_decoding_at point will be
either (a) confirmed_flush location, or (b) lsn sent by client, and
any record prior to that must have been processed before restart.

Good point. I was considering the case where the client sets far ahead
LSN but it's not worth considering this case in this context. I've
updated the patch accoringly.

One minor comment:
Can we slightly change the comment: ". The ordering of the records
prior to the LSN, we should have been checked before the restart." to
". The ordering of the records prior to the start_decoding_at LSN
should have been checked before the restart."?

Agreed. I'll update the patch accordingly.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#25Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#23)
2 attachment(s)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On Tue, Oct 18, 2022 at 9:53 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Oct 18, 2022 at 7:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Oct 18, 2022 at 1:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I think because the test case proposed needs all three changes, we can
push the change-1 without a test case and then as a second patch have
change-2 for HEAD and change-3 for back branches with the test case.
Do you have any other ideas to proceed here?

I found another test case that causes the assertion failure at
"Assert(!needs_snapshot || needs_timetravel);" on all branches. I've
attached the patch for the test case. In this test case, I modified a
user-catalog table instead of system-catalog table. That way, we don't
generate invalidation messages while generating NEW_CID records. As a
result, we mark only the subtransactions as containing catalog change
and don't make association between top-level and sub transactions. The
assertion failure happens on all supported branches. If we need to fix
this (I believe so), Change-2 needs to be backpatched to all supported
branches.

There are three changes as Amit mentioned, and regarding the test
case, we have three test cases I've attached: truncate_testcase.patch,
analyze_testcase.patch, uesr_catalog_testcase.patch. The relationship
between assertion failures and test cases are very complex. I could
not find any test case to cause only one assertion failure on all
branches. One idea to proceed is:

Patch-1 includes Change-1 and is applied to all branches.

Patch-2 includes Change-2 and the user_catalog test case, and is
applied to all branches.

Patch-3 includes Change-3 and the truncate test case (or the analyze
test case), and is applied to v14 and v15 (also till v11 if we
prefer).

The patch-1 doesn't include any test case but the user_catalog test
case can test both Change-1 and Change-2 on all branches.

I was wondering if it makes sense to commit both Change-1 and Change-2
together as one patch? Both assertions are caused by a single test
case and are related to the general problem that the association of
top and sub transaction is only guaranteed to be formed before we
decode transaction changes. Also, it would be good to fix the problem
with a test case that can cause it. What do you think?

Yeah, it makes sense to me.

I've attached two patches that need to be back-patched to all branches
and includes Change-1, Change-2, and a test case for them. FYI this
patch resolves the assertion failure reported in this thread as well
as one reported in another thread[2]/messages/by-id/TYAPR01MB5866B30A1439043B1FC3F21EF5229@TYAPR01MB5866.jpnprd01.prod.outlook.com. So I borrowed some of the
changes from the patch[2]/messages/by-id/TYAPR01MB5866B30A1439043B1FC3F21EF5229@TYAPR01MB5866.jpnprd01.prod.outlook.com Osumi-san recently proposed.

Regards,

[1]: /messages/by-id/TYCPR01MB83733C6CEAE47D0280814D5AED7A9@TYCPR01MB8373.jpnprd01.prod.outlook.com
[2]: /messages/by-id/TYAPR01MB5866B30A1439043B1FC3F21EF5229@TYAPR01MB5866.jpnprd01.prod.outlook.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v15_v3-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patchapplication/octet-stream; name=v15_v3-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patchDownload
From c4b387b009f4f4bffd047cf2401f3ebce31b1c92 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 12 Oct 2022 11:49:04 +0900
Subject: [PATCH v3] Fix the assertion failure while processing NEW_CID in
 logical decoding.

When the logical decoding restarts from NEW_CID, since there is no
association between the top transaction and its sub transaction, both
are created as top transactions and have the same LSN. This caused
the assertion failure in AssertTXNLsnOrder(). Besides, when decoding
the commit record of top transaction, we missed setting
needs_timetravel = true although one of its subtransactions has been
marked as containing catalog change, which caused another assertion
failure in SnapBuildCommitTxn().

With this change, we skip the assertion check until we reach the LSN
at which we start decoding the contents of transaction, specifically
start_decoding_at in SnapBuild. This is okay because we don't
guarantee to make the association between top transaction and sub
transaction until we try to decode the actual contents of
transactions. The orderig ordering of the records prior to the
start_decoding_at LSN should have been checked before the restart. In
addition, when decoding the commit record, we force top transaction to
do timetravel if one of its subtransactions has been marked as
containing catalog changes.

Back-patch to all supported branches.

Reported-by: Thomas Vondra, Osumi Takamichi
Author: Osumi Takamichi, Masahiko Sawada
Reviewed-by: Amit Kapila, Dilip Kumar, Kuroda Hayato, Kyotaro Horiguchi, Masahiko Sawada
Discussion: https://postgr.es/m/a89b46b6-0239-2fd5-71a9-b19b1f7a7145%40enterprisedb.com
Discussion: https://postgr.es/m/TYCPR01MB83733C6CEAE47D0280814D5AED7A9%40TYCPR01MB8373.jpnprd01.prod.outlook.com
Backpatch-through: 10
---
 .../expected/catalog_change_snapshot.out      | 45 +++++++++++++++++++
 .../specs/catalog_change_snapshot.spec        | 16 +++++++
 .../replication/logical/reorderbuffer.c       | 14 ++++++
 src/backend/replication/logical/snapbuild.c   |  3 ++
 4 files changed, 78 insertions(+)

diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
index dc4f9b7018..1d75cf5af0 100644
--- a/contrib/test_decoding/expected/catalog_change_snapshot.out
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -42,3 +42,48 @@ COMMIT
 stop    
 (1 row)
 
+
+starting permutation: s0_init s0_begin s0_savepoint s0_insert s1_checkpoint s1_get_changes s0_insert2 s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_insert2: INSERT INTO user_cat VALUES (1);
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+table public.user_cat: INSERT: val1[integer]:1               
+COMMIT                                                       
+(4 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
index 2971ddc69c..2ad1edeaa8 100644
--- a/contrib/test_decoding/specs/catalog_change_snapshot.spec
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -4,11 +4,13 @@ setup
 {
     DROP TABLE IF EXISTS tbl1;
     CREATE TABLE tbl1 (val1 integer, val2 integer);
+    CREATE TABLE user_cat (val1 integer) WITH (user_catalog_table = true);
 }
 
 teardown
 {
     DROP TABLE tbl1;
+    DROP TABLE user_cat;
     SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
 }
 
@@ -19,6 +21,7 @@ step "s0_begin" { BEGIN; }
 step "s0_savepoint" { SAVEPOINT sp1; }
 step "s0_truncate" { TRUNCATE tbl1; }
 step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_insert2" { INSERT INTO user_cat VALUES (1); }
 step "s0_commit" { COMMIT; }
 
 session "s1"
@@ -37,3 +40,16 @@ step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_
 # record written by bgwriter.  One might think we can either stop the bgwriter or
 # increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
 permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
+
+# Test that we can handle the case where there is no association between top-level
+# transaction and its subtransactions. The last decoding restarts from the first
+# checkpoint, decodes NEW_CID generated by "s0_insert2", and marks the subtransaction
+# as containing catalog changes while adding tuple cids to its top-level transaction.
+# During that, both transaction entries are created in ReorderBuffer as top-level
+# transactions and have the same LSN. We check if the assertion check for the order
+# of transaction LSNs in AssertTXNLsnOrder() is skipped since we are still before the
+# LSN at which we start replaying the contents of transactions. Besides, when decoding
+# the commit record of the top-level transaction, we must force the top-level
+# transaction to do timetravel since one of its subtransactions has been marked as
+# containing catalog changes.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_insert2" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 8971df140b..e1f47b5208 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -878,10 +878,24 @@ static void
 AssertTXNLsnOrder(ReorderBuffer *rb)
 {
 #ifdef USE_ASSERT_CHECKING
+	LogicalDecodingContext	*ctx = rb->private_data;
 	dlist_iter	iter;
 	XLogRecPtr	prev_first_lsn = InvalidXLogRecPtr;
 	XLogRecPtr	prev_base_snap_lsn = InvalidXLogRecPtr;
 
+	/*
+	 * Skip the verification if we don't reach the LSN at which we start
+	 * decoding the contents of transactions yet because until we reach the
+	 * LSN, we could have transactions that don't have the association
+	 * between the top-level transaction and subtransaction yet and
+	 * consequently have the same LSN.  We don't guarantee this association
+	 * until we try to decode the actual contents of transaction. The ordering
+	 * of the records prior to the start_decoding_at LSN should have been checked
+	 * before the restart.
+	 */
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, ctx->reader->EndRecPtr))
+		return;
+
 	dlist_foreach(iter, &rb->toplevel_by_lsn)
 	{
 		ReorderBufferTXN *cur_txn = dlist_container(ReorderBufferTXN, node,
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 385817e295..80b7d86e79 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1101,6 +1101,9 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	else if (sub_needs_timetravel)
 	{
 		/* track toplevel txn as well, subxact alone isn't meaningful */
+		elog(DEBUG2, "forced transaction %u to do timetravel due to one of its subtransactions",
+			 xid);
+		needs_timetravel = true;
 		SnapBuildAddCommittedTxn(builder, xid);
 	}
 	else if (needs_timetravel)
-- 
2.31.1

HEAD_v3-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patchapplication/octet-stream; name=HEAD_v3-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patchDownload
From 65ebf68fa447e98233c08bdae3438d206d4584af Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 12 Oct 2022 11:49:04 +0900
Subject: [PATCH v3] Fix the assertion failure while processing NEW_CID in
 logical decoding.

When the logical decoding restarts from NEW_CID, since there is no
association between the top transaction and its sub transaction, both
are created as top transactions and have the same LSN. This caused
the assertion failure in AssertTXNLsnOrder(). Besides, when decoding
the commit record of top transaction, we missed setting
needs_timetravel = true although one of its subtransactions has been
marked as containing catalog change, which caused another assertion
failure in SnapBuildCommitTxn().

With this change, we skip the assertion check until we reach the LSN
at which we start decoding the contents of transaction, specifically
start_decoding_at in SnapBuild. This is okay because we don't
guarantee to make the association between top transaction and sub
transaction until we try to decode the actual contents of
transactions. The orderig ordering of the records prior to the
start_decoding_at LSN should have been checked before the restart. In
addition, when decoding the commit record, we force top transaction to
do timetravel if one of its subtransactions has been marked as
containing catalog changes.

Back-patch to all supported branches.

Reported-by: Thomas Vondra, Osumi Takamichi
Author: Osumi Takamichi, Masahiko Sawada
Reviewed-by: Amit Kapila, Dilip Kumar, Kuroda Hayato, Kyotaro Horiguchi, Masahiko Sawada
Discussion: https://postgr.es/m/a89b46b6-0239-2fd5-71a9-b19b1f7a7145%40enterprisedb.com
Discussion: https://postgr.es/m/TYCPR01MB83733C6CEAE47D0280814D5AED7A9%40TYCPR01MB8373.jpnprd01.prod.outlook.com
Backpatch-through: 10
---
 .../expected/catalog_change_snapshot.out      | 45 +++++++++++++++++++
 .../specs/catalog_change_snapshot.spec        | 16 +++++++
 .../replication/logical/reorderbuffer.c       | 14 ++++++
 src/backend/replication/logical/snapbuild.c   |  3 ++
 4 files changed, 78 insertions(+)

diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
index d2a4bdfcc1..8722787e6e 100644
--- a/contrib/test_decoding/expected/catalog_change_snapshot.out
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -96,3 +96,48 @@ COMMIT
 stop    
 (1 row)
 
+
+starting permutation: s0_init s0_begin s0_savepoint s0_insert s1_checkpoint s1_get_changes s0_insert2 s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_insert2: INSERT INTO user_cat VALUES (1);
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+table public.user_cat: INSERT: val1[integer]:1               
+COMMIT                                                       
+(4 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
index ff8f68489b..673bccf4b0 100644
--- a/contrib/test_decoding/specs/catalog_change_snapshot.spec
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -6,12 +6,14 @@ setup
     DROP TABLE IF EXISTS tbl2;
     CREATE TABLE tbl1 (val1 integer, val2 integer);
     CREATE TABLE tbl2 (val1 integer, val2 integer);
+    CREATE TABLE user_cat (val1 integer) WITH (user_catalog_table = true);
 }
 
 teardown
 {
     DROP TABLE tbl1;
     DROP TABLE tbl2;
+    DROP TABLE user_cat;
     SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
 }
 
@@ -22,6 +24,7 @@ step "s0_begin" { BEGIN; }
 step "s0_savepoint" { SAVEPOINT sp1; }
 step "s0_truncate" { TRUNCATE tbl1; }
 step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_insert2" { INSERT INTO user_cat VALUES (1); }
 step "s0_commit" { COMMIT; }
 
 session "s1"
@@ -57,3 +60,16 @@ permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s
 # checkpoint record it prunes one of the xacts in that list and when decoding the
 # next checkpoint, it will completely prune that list.
 permutation "s0_init" "s0_begin" "s0_truncate" "s2_begin" "s2_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s2_commit" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
+
+# Test that we can handle the case where there is no association between top-level
+# transaction and its subtransactions. The last decoding restarts from the first
+# checkpoint, decodes NEW_CID generated by "s0_insert2", and marks the subtransaction
+# as containing catalog changes while adding tuple cids to its top-level transaction.
+# During that, both transaction entries are created in ReorderBuffer as top-level
+# transactions and have the same LSN. We check if the assertion check for the order
+# of transaction LSNs in AssertTXNLsnOrder() is skipped since we are still before the
+# LSN at which we start replaying the contents of transactions. Besides, when decoding
+# the commit record of the top-level transaction, we must force the top-level
+# transaction to do timetravel since one of its subtransactions has been marked as
+# containing catalog changes.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_insert2" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 6dff9915a5..7be69f744e 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -881,10 +881,24 @@ static void
 AssertTXNLsnOrder(ReorderBuffer *rb)
 {
 #ifdef USE_ASSERT_CHECKING
+	LogicalDecodingContext	*ctx = rb->private_data;
 	dlist_iter	iter;
 	XLogRecPtr	prev_first_lsn = InvalidXLogRecPtr;
 	XLogRecPtr	prev_base_snap_lsn = InvalidXLogRecPtr;
 
+	/*
+	 * Skip the verification if we don't reach the LSN at which we start
+	 * decoding the contents of transactions yet because until we reach the
+	 * LSN, we could have transactions that don't have the association
+	 * between the top-level transaction and subtransaction yet and
+	 * consequently have the same LSN.  We don't guarantee this association
+	 * until we try to decode the actual contents of transaction. The ordering
+	 * of the records prior to the start_decoding_at LSN should have been checked
+	 * before the restart.
+	 */
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, ctx->reader->EndRecPtr))
+		return;
+
 	dlist_foreach(iter, &rb->toplevel_by_lsn)
 	{
 		ReorderBufferTXN *cur_txn = dlist_container(ReorderBufferTXN, node,
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 54499c06fe..f892d19bfb 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1099,6 +1099,9 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	else if (sub_needs_timetravel)
 	{
 		/* track toplevel txn as well, subxact alone isn't meaningful */
+		elog(DEBUG2, "forced transaction %u to do timetravel due to one of its subtransactions",
+			 xid);
+		needs_timetravel = true;
 		SnapBuildAddCommittedTxn(builder, xid);
 	}
 	else if (needs_timetravel)
-- 
2.31.1

#26Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#25)
3 attachment(s)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On Wed, Oct 19, 2022 at 11:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Oct 18, 2022 at 9:53 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Oct 18, 2022 at 7:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Oct 18, 2022 at 1:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I think because the test case proposed needs all three changes, we can
push the change-1 without a test case and then as a second patch have
change-2 for HEAD and change-3 for back branches with the test case.
Do you have any other ideas to proceed here?

I found another test case that causes the assertion failure at
"Assert(!needs_snapshot || needs_timetravel);" on all branches. I've
attached the patch for the test case. In this test case, I modified a
user-catalog table instead of system-catalog table. That way, we don't
generate invalidation messages while generating NEW_CID records. As a
result, we mark only the subtransactions as containing catalog change
and don't make association between top-level and sub transactions. The
assertion failure happens on all supported branches. If we need to fix
this (I believe so), Change-2 needs to be backpatched to all supported
branches.

There are three changes as Amit mentioned, and regarding the test
case, we have three test cases I've attached: truncate_testcase.patch,
analyze_testcase.patch, uesr_catalog_testcase.patch. The relationship
between assertion failures and test cases are very complex. I could
not find any test case to cause only one assertion failure on all
branches. One idea to proceed is:

Patch-1 includes Change-1 and is applied to all branches.

Patch-2 includes Change-2 and the user_catalog test case, and is
applied to all branches.

Patch-3 includes Change-3 and the truncate test case (or the analyze
test case), and is applied to v14 and v15 (also till v11 if we
prefer).

The patch-1 doesn't include any test case but the user_catalog test
case can test both Change-1 and Change-2 on all branches.

I was wondering if it makes sense to commit both Change-1 and Change-2
together as one patch? Both assertions are caused by a single test
case and are related to the general problem that the association of
top and sub transaction is only guaranteed to be formed before we
decode transaction changes. Also, it would be good to fix the problem
with a test case that can cause it. What do you think?

Yeah, it makes sense to me.

I've attached two patches that need to be back-patched to all branches
and includes Change-1, Change-2, and a test case for them. FYI this
patch resolves the assertion failure reported in this thread as well
as one reported in another thread[2]. So I borrowed some of the
changes from the patch[2] Osumi-san recently proposed.

I've attached patches for Change-3 with a test case. Please review them as well.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v12-13_v1-0001-Don-t-assign-subtransactions-to-top-transaction-i.patchapplication/x-patch; name=v12-13_v1-0001-Don-t-assign-subtransactions-to-top-transaction-i.patchDownload
From 763cb604fae97131b5be2da36deb10054b6dbf3a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 19 Oct 2022 12:16:43 +0900
Subject: [PATCH v1] Don't assign subtransactions to top transaction in
 SnapBuildXidSetCatalogChanges().

Previously, when decoding the commit record of the transaction, we
mark both top-level transaction and its subtransactions as containing
catalog changes and assign the subtransactions to the top-level
transaction. However, if the subtransacitons have invalidation
messages, we missed executing them when forgetting the
transactions. In commit 272248a0c1 where we introduced
SnapBuildXidSetCatalogChanges(), the reason why we assigned them is
just to avoid the assertion failure in AssertTXNLsnOrder() as they
have the same LSN. Now that with commit XXX we skip this assertion
check until we reach the LSN at we start decoding the contents of the
transaciton, there is no reason for that anymore.

SnapBuildXidSetCatalogChanges() was introduced in 15 or older but this
bug doesn't exist in branches prior to 14 since we don't add
invalidation messages to subtransactions. We decided to backpatch
through 11 for consistency but don't for 10 since its final release
will come soon.

Reported-by: Osumi Takamichi
Author: Masahiko Sawada
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/TYAPR01MB58660803BCAA7849C8584AA4F57E9%40TYAPR01MB5866.jpnprd01.prod.outlook.com
Backpatch: Backpatch-through: 11
---
 .../expected/catalog_change_snapshot.out      | 45 +++++++++++++++++++
 .../specs/catalog_change_snapshot.spec        |  8 ++++
 src/backend/replication/logical/snapbuild.c   |  7 +--
 3 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
index 1d75cf5af0..b33e49c0b1 100644
--- a/contrib/test_decoding/expected/catalog_change_snapshot.out
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -87,3 +87,48 @@ COMMIT
 stop    
 (1 row)
 
+
+starting permutation: s0_init s0_begin s0_savepoint s0_insert s1_checkpoint s1_get_changes s0_truncate s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_truncate: TRUNCATE tbl1;
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+table public.tbl1: TRUNCATE: (no-flags)                      
+COMMIT                                                       
+(4 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
index 2ad1edeaa8..b3ccb9e0e9 100644
--- a/contrib/test_decoding/specs/catalog_change_snapshot.spec
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -53,3 +53,11 @@ permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s
 # transaction to do timetravel since one of its subtransactions has been marked as
 # containing catalog changes.
 permutation "s0_init" "s0_begin" "s0_savepoint" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_insert2" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
+
+# The last decoding restarts from the first checkpoint, and add invalidation messages
+# generated by "s0_truncate" to the subtransaction. When decoding the commit record of
+# the top-level transaction, we mark both top-level transaction and its subtransactions
+# as containing catalog changes. However, we check if we don't create the association
+# between top-level and subtransactions at this time. Otherwise, we miss executing
+# invalidation messages when forgetting the transaction.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_truncate" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 1e7c918bde..14ce0d0523 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -2145,10 +2145,11 @@ SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt
 	if (bsearch(&xid, InitialRunningXacts, NInitialRunningXacts,
 				sizeof(TransactionId), xidComparator) != NULL)
 	{
+		/*
+		 * We will assign subtransactions to the top transaction before
+		 * replaying the contents of the transaction.
+		 */
 		for (int i = 0; i < subxcnt; i++)
-		{
-			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
 			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
-		}
 	}
 }
-- 
2.31.1

v14-v15_v1-0001-Don-t-assign-subtransactions-to-top-transaction-i.patchapplication/x-patch; name=v14-v15_v1-0001-Don-t-assign-subtransactions-to-top-transaction-i.patchDownload
From f1953aaf460e6c80c5f7680a291917b7d3c20ba4 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 19 Oct 2022 12:16:43 +0900
Subject: [PATCH v1] Don't assign subtransactions to top transaction in
 SnapBuildXidSetCatalogChanges().

Previously, when decoding the commit record of the transaction, we
mark both top-level transaction and its subtransactions as containing
catalog changes and assign the subtransactions to the top-level
transaction. However, if the subtransacitons have invalidation
messages, we missed executing them when forgetting the
transactions. In commit 272248a0c1 where we introduced
SnapBuildXidSetCatalogChanges(), the reason why we assigned them is
just to avoid the assertion failure in AssertTXNLsnOrder() as they
have the same LSN. Now that with commit XXX we skip this assertion
check until we reach the LSN at we start decoding the contents of the
transaciton, there is no reason for that anymore.

SnapBuildXidSetCatalogChanges() was introduced in 15 or older but this
bug doesn't exist in branches prior to 14 since we don't add
invalidation messages to subtransactions. We decided to backpatch
through 11 for consistency but don't for 10 since its final release
will come soon.

Reported-by: Osumi Takamichi
Author: Masahiko Sawada
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/TYAPR01MB58660803BCAA7849C8584AA4F57E9%40TYAPR01MB5866.jpnprd01.prod.outlook.com
Backpatch: Backpatch-through: 11
---
 .../expected/catalog_change_snapshot.out      | 45 +++++++++++++++++++
 .../specs/catalog_change_snapshot.spec        |  8 ++++
 src/backend/replication/logical/snapbuild.c   |  7 +--
 3 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
index 1d75cf5af0..b33e49c0b1 100644
--- a/contrib/test_decoding/expected/catalog_change_snapshot.out
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -87,3 +87,48 @@ COMMIT
 stop    
 (1 row)
 
+
+starting permutation: s0_init s0_begin s0_savepoint s0_insert s1_checkpoint s1_get_changes s0_truncate s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_truncate: TRUNCATE tbl1;
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+table public.tbl1: TRUNCATE: (no-flags)                      
+COMMIT                                                       
+(4 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
index 2ad1edeaa8..b3ccb9e0e9 100644
--- a/contrib/test_decoding/specs/catalog_change_snapshot.spec
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -53,3 +53,11 @@ permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s
 # transaction to do timetravel since one of its subtransactions has been marked as
 # containing catalog changes.
 permutation "s0_init" "s0_begin" "s0_savepoint" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_insert2" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
+
+# The last decoding restarts from the first checkpoint, and add invalidation messages
+# generated by "s0_truncate" to the subtransaction. When decoding the commit record of
+# the top-level transaction, we mark both top-level transaction and its subtransactions
+# as containing catalog changes. However, we check if we don't create the association
+# between top-level and subtransactions at this time. Otherwise, we miss executing
+# invalidation messages when forgetting the transaction.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_truncate" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 80b7d86e79..3204e1e3a1 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -2119,10 +2119,11 @@ SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt
 	{
 		ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
 
+		/*
+		 * We will assign subtransactions to the top transaction before
+		 * replaying the contents of the transaction.
+		 */
 		for (int i = 0; i < subxcnt; i++)
-		{
-			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
 			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
-		}
 	}
 }
-- 
2.31.1

v11_v1-0001-Don-t-assign-subtransactions-to-top-transaction-i.patchapplication/x-patch; name=v11_v1-0001-Don-t-assign-subtransactions-to-top-transaction-i.patchDownload
From d60b190a7a64600bb952c42a3565973d96bcf456 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 19 Oct 2022 12:16:43 +0900
Subject: [PATCH v1] Don't assign subtransactions to top transaction in
 SnapBuildXidSetCatalogChanges().

Previously, when decoding the commit record of the transaction, we
mark both top-level transaction and its subtransactions as containing
catalog changes and assign the subtransactions to the top-level
transaction. However, if the subtransacitons have invalidation
messages, we missed executing them when forgetting the
transactions. In commit 272248a0c1 where we introduced
SnapBuildXidSetCatalogChanges(), the reason why we assigned them is
just to avoid the assertion failure in AssertTXNLsnOrder() as they
have the same LSN. Now that with commit XXX we skip this assertion
check until we reach the LSN at we start decoding the contents of the
transaciton, there is no reason for that anymore.

SnapBuildXidSetCatalogChanges() was introduced in 15 or older but this
bug doesn't exist in branches prior to 14 since we don't add
invalidation messages to subtransactions. We decided to backpatch
through 11 for consistency but don't for 10 since its final release
will come soon.

Reported-by: Osumi Takamichi
Author: Masahiko Sawada
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/TYAPR01MB58660803BCAA7849C8584AA4F57E9%40TYAPR01MB5866.jpnprd01.prod.outlook.com
Backpatch: Backpatch-through: 11
---
 .../expected/catalog_change_snapshot.out      | 45 +++++++++++++++++++
 .../specs/catalog_change_snapshot.spec        |  8 ++++
 src/backend/replication/logical/snapbuild.c   |  7 +--
 3 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
index 1d75cf5af0..b33e49c0b1 100644
--- a/contrib/test_decoding/expected/catalog_change_snapshot.out
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -87,3 +87,48 @@ COMMIT
 stop    
 (1 row)
 
+
+starting permutation: s0_init s0_begin s0_savepoint s0_insert s1_checkpoint s1_get_changes s0_truncate s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_truncate: TRUNCATE tbl1;
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+table public.tbl1: TRUNCATE: (no-flags)                      
+COMMIT                                                       
+(4 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
index 2ad1edeaa8..b3ccb9e0e9 100644
--- a/contrib/test_decoding/specs/catalog_change_snapshot.spec
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -53,3 +53,11 @@ permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s
 # transaction to do timetravel since one of its subtransactions has been marked as
 # containing catalog changes.
 permutation "s0_init" "s0_begin" "s0_savepoint" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_insert2" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
+
+# The last decoding restarts from the first checkpoint, and add invalidation messages
+# generated by "s0_truncate" to the subtransaction. When decoding the commit record of
+# the top-level transaction, we mark both top-level transaction and its subtransactions
+# as containing catalog changes. However, we check if we don't create the association
+# between top-level and subtransactions at this time. Otherwise, we miss executing
+# invalidation messages when forgetting the transaction.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_truncate" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index dfe9d017f2..f3f5aba6a3 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -2113,10 +2113,11 @@ SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt
 	{
 		int		i;
 
+		/*
+		 * We will assign subtransactions to the top transaction before
+		 * replaying the contents of the transaction.
+		 */
 		for (i = 0; i < subxcnt; i++)
-		{
-			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
 			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
-		}
 	}
 }
-- 
2.31.1

#27Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#25)
2 attachment(s)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On Wed, Oct 19, 2022 at 11:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Oct 18, 2022 at 9:53 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Oct 18, 2022 at 7:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Oct 18, 2022 at 1:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I think because the test case proposed needs all three changes, we can
push the change-1 without a test case and then as a second patch have
change-2 for HEAD and change-3 for back branches with the test case.
Do you have any other ideas to proceed here?

I found another test case that causes the assertion failure at
"Assert(!needs_snapshot || needs_timetravel);" on all branches. I've
attached the patch for the test case. In this test case, I modified a
user-catalog table instead of system-catalog table. That way, we don't
generate invalidation messages while generating NEW_CID records. As a
result, we mark only the subtransactions as containing catalog change
and don't make association between top-level and sub transactions. The
assertion failure happens on all supported branches. If we need to fix
this (I believe so), Change-2 needs to be backpatched to all supported
branches.

There are three changes as Amit mentioned, and regarding the test
case, we have three test cases I've attached: truncate_testcase.patch,
analyze_testcase.patch, uesr_catalog_testcase.patch. The relationship
between assertion failures and test cases are very complex. I could
not find any test case to cause only one assertion failure on all
branches. One idea to proceed is:

Patch-1 includes Change-1 and is applied to all branches.

Patch-2 includes Change-2 and the user_catalog test case, and is
applied to all branches.

Patch-3 includes Change-3 and the truncate test case (or the analyze
test case), and is applied to v14 and v15 (also till v11 if we
prefer).

The patch-1 doesn't include any test case but the user_catalog test
case can test both Change-1 and Change-2 on all branches.

I was wondering if it makes sense to commit both Change-1 and Change-2
together as one patch? Both assertions are caused by a single test
case and are related to the general problem that the association of
top and sub transaction is only guaranteed to be formed before we
decode transaction changes. Also, it would be good to fix the problem
with a test case that can cause it. What do you think?

Yeah, it makes sense to me.

I've attached two patches that need to be back-patched to all branches
and includes Change-1, Change-2, and a test case for them. FYI this
patch resolves the assertion failure reported in this thread as well
as one reported in another thread[2]. So I borrowed some of the
changes from the patch[2] Osumi-san recently proposed.

Amit pointed out offlist that the changes in reorderbuffer.c is not
pgindent'ed. I've run pgindent and attached updated patches.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

HEAD_v4-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patchapplication/octet-stream; name=HEAD_v4-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patchDownload
From 4fbb52a567c021c5abf862afba342e59c16d6bee Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 12 Oct 2022 11:49:04 +0900
Subject: [PATCH v4] Fix the assertion failure while processing NEW_CID in
 logical decoding.

When the logical decoding restarts from NEW_CID, since there is no
association between the top transaction and its sub transaction, both
are created as top transactions and have the same LSN. This caused
the assertion failure in AssertTXNLsnOrder(). Besides, when decoding
the commit record of top transaction, we missed setting
needs_timetravel = true although one of its subtransactions has been
marked as containing catalog change, which caused another assertion
failure in SnapBuildCommitTxn().

With this change, we skip the assertion check until we reach the LSN
at which we start decoding the contents of transaction, specifically
start_decoding_at in SnapBuild. This is okay because we don't
guarantee to make the association between top transaction and sub
transaction until we try to decode the actual contents of
transactions. The orderig ordering of the records prior to the
start_decoding_at LSN should have been checked before the restart. In
addition, when decoding the commit record, we force top transaction to
do timetravel if one of its subtransactions has been marked as
containing catalog changes.

Back-patch to all supported branches.

Reported-by: Thomas Vondra, Osumi Takamichi
Author: Osumi Takamichi, Masahiko Sawada
Reviewed-by: Amit Kapila, Dilip Kumar, Kuroda Hayato, Kyotaro Horiguchi, Masahiko Sawada
Discussion: https://postgr.es/m/a89b46b6-0239-2fd5-71a9-b19b1f7a7145%40enterprisedb.com
Discussion: https://postgr.es/m/TYCPR01MB83733C6CEAE47D0280814D5AED7A9%40TYCPR01MB8373.jpnprd01.prod.outlook.com
Backpatch-through: 10
---
 .../expected/catalog_change_snapshot.out      | 45 +++++++++++++++++++
 .../specs/catalog_change_snapshot.spec        | 16 +++++++
 .../replication/logical/reorderbuffer.c       | 14 ++++++
 src/backend/replication/logical/snapbuild.c   |  3 ++
 4 files changed, 78 insertions(+)

diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
index d2a4bdfcc1..8722787e6e 100644
--- a/contrib/test_decoding/expected/catalog_change_snapshot.out
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -96,3 +96,48 @@ COMMIT
 stop    
 (1 row)
 
+
+starting permutation: s0_init s0_begin s0_savepoint s0_insert s1_checkpoint s1_get_changes s0_insert2 s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_insert2: INSERT INTO user_cat VALUES (1);
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+table public.user_cat: INSERT: val1[integer]:1               
+COMMIT                                                       
+(4 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
index ff8f68489b..673bccf4b0 100644
--- a/contrib/test_decoding/specs/catalog_change_snapshot.spec
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -6,12 +6,14 @@ setup
     DROP TABLE IF EXISTS tbl2;
     CREATE TABLE tbl1 (val1 integer, val2 integer);
     CREATE TABLE tbl2 (val1 integer, val2 integer);
+    CREATE TABLE user_cat (val1 integer) WITH (user_catalog_table = true);
 }
 
 teardown
 {
     DROP TABLE tbl1;
     DROP TABLE tbl2;
+    DROP TABLE user_cat;
     SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
 }
 
@@ -22,6 +24,7 @@ step "s0_begin" { BEGIN; }
 step "s0_savepoint" { SAVEPOINT sp1; }
 step "s0_truncate" { TRUNCATE tbl1; }
 step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_insert2" { INSERT INTO user_cat VALUES (1); }
 step "s0_commit" { COMMIT; }
 
 session "s1"
@@ -57,3 +60,16 @@ permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s
 # checkpoint record it prunes one of the xacts in that list and when decoding the
 # next checkpoint, it will completely prune that list.
 permutation "s0_init" "s0_begin" "s0_truncate" "s2_begin" "s2_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s2_commit" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
+
+# Test that we can handle the case where there is no association between top-level
+# transaction and its subtransactions. The last decoding restarts from the first
+# checkpoint, decodes NEW_CID generated by "s0_insert2", and marks the subtransaction
+# as containing catalog changes while adding tuple cids to its top-level transaction.
+# During that, both transaction entries are created in ReorderBuffer as top-level
+# transactions and have the same LSN. We check if the assertion check for the order
+# of transaction LSNs in AssertTXNLsnOrder() is skipped since we are still before the
+# LSN at which we start replaying the contents of transactions. Besides, when decoding
+# the commit record of the top-level transaction, we must force the top-level
+# transaction to do timetravel since one of its subtransactions has been marked as
+# containing catalog changes.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_insert2" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 6dff9915a5..f55bf44290 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -881,10 +881,24 @@ static void
 AssertTXNLsnOrder(ReorderBuffer *rb)
 {
 #ifdef USE_ASSERT_CHECKING
+	LogicalDecodingContext *ctx = rb->private_data;
 	dlist_iter	iter;
 	XLogRecPtr	prev_first_lsn = InvalidXLogRecPtr;
 	XLogRecPtr	prev_base_snap_lsn = InvalidXLogRecPtr;
 
+	/*
+	 * Skip the verification if we don't reach the LSN at which we start
+	 * decoding the contents of transactions yet because until we reach the
+	 * LSN, we could have transactions that don't have the association between
+	 * the top-level transaction and subtransaction yet and consequently have
+	 * the same LSN.  We don't guarantee this association until we try to
+	 * decode the actual contents of transaction. The ordering of the records
+	 * prior to the start_decoding_at LSN should have been checked before the
+	 * restart.
+	 */
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, ctx->reader->EndRecPtr))
+		return;
+
 	dlist_foreach(iter, &rb->toplevel_by_lsn)
 	{
 		ReorderBufferTXN *cur_txn = dlist_container(ReorderBufferTXN, node,
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 54499c06fe..f892d19bfb 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1099,6 +1099,9 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	else if (sub_needs_timetravel)
 	{
 		/* track toplevel txn as well, subxact alone isn't meaningful */
+		elog(DEBUG2, "forced transaction %u to do timetravel due to one of its subtransactions",
+			 xid);
+		needs_timetravel = true;
 		SnapBuildAddCommittedTxn(builder, xid);
 	}
 	else if (needs_timetravel)
-- 
2.31.1

v10-v15_v4-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patchapplication/octet-stream; name=v10-v15_v4-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patchDownload
From 0e61669c150a0ba0b2263959ef617113f0e161ed Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 12 Oct 2022 11:49:04 +0900
Subject: [PATCH v4] Fix the assertion failure while processing NEW_CID in
 logical decoding.

When the logical decoding restarts from NEW_CID, since there is no
association between the top transaction and its sub transaction, both
are created as top transactions and have the same LSN. This caused
the assertion failure in AssertTXNLsnOrder(). Besides, when decoding
the commit record of top transaction, we missed setting
needs_timetravel = true although one of its subtransactions has been
marked as containing catalog change, which caused another assertion
failure in SnapBuildCommitTxn().

With this change, we skip the assertion check until we reach the LSN
at which we start decoding the contents of transaction, specifically
start_decoding_at in SnapBuild. This is okay because we don't
guarantee to make the association between top transaction and sub
transaction until we try to decode the actual contents of
transactions. The orderig ordering of the records prior to the
start_decoding_at LSN should have been checked before the restart. In
addition, when decoding the commit record, we force top transaction to
do timetravel if one of its subtransactions has been marked as
containing catalog changes.

Back-patch to all supported branches.

Reported-by: Thomas Vondra, Osumi Takamichi
Author: Osumi Takamichi, Masahiko Sawada
Reviewed-by: Amit Kapila, Dilip Kumar, Kuroda Hayato, Kyotaro Horiguchi, Masahiko Sawada
Discussion: https://postgr.es/m/a89b46b6-0239-2fd5-71a9-b19b1f7a7145%40enterprisedb.com
Discussion: https://postgr.es/m/TYCPR01MB83733C6CEAE47D0280814D5AED7A9%40TYCPR01MB8373.jpnprd01.prod.outlook.com
Backpatch-through: 10
---
 .../expected/catalog_change_snapshot.out      | 45 +++++++++++++++++++
 .../specs/catalog_change_snapshot.spec        | 16 +++++++
 .../replication/logical/reorderbuffer.c       | 14 ++++++
 src/backend/replication/logical/snapbuild.c   |  3 ++
 4 files changed, 78 insertions(+)

diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
index dc4f9b7018..1d75cf5af0 100644
--- a/contrib/test_decoding/expected/catalog_change_snapshot.out
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -42,3 +42,48 @@ COMMIT
 stop    
 (1 row)
 
+
+starting permutation: s0_init s0_begin s0_savepoint s0_insert s1_checkpoint s1_get_changes s0_insert2 s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_insert2: INSERT INTO user_cat VALUES (1);
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+table public.user_cat: INSERT: val1[integer]:1               
+COMMIT                                                       
+(4 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
index 2971ddc69c..2ad1edeaa8 100644
--- a/contrib/test_decoding/specs/catalog_change_snapshot.spec
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -4,11 +4,13 @@ setup
 {
     DROP TABLE IF EXISTS tbl1;
     CREATE TABLE tbl1 (val1 integer, val2 integer);
+    CREATE TABLE user_cat (val1 integer) WITH (user_catalog_table = true);
 }
 
 teardown
 {
     DROP TABLE tbl1;
+    DROP TABLE user_cat;
     SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
 }
 
@@ -19,6 +21,7 @@ step "s0_begin" { BEGIN; }
 step "s0_savepoint" { SAVEPOINT sp1; }
 step "s0_truncate" { TRUNCATE tbl1; }
 step "s0_insert" { INSERT INTO tbl1 VALUES (1); }
+step "s0_insert2" { INSERT INTO user_cat VALUES (1); }
 step "s0_commit" { COMMIT; }
 
 session "s1"
@@ -37,3 +40,16 @@ step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_
 # record written by bgwriter.  One might think we can either stop the bgwriter or
 # increase LOG_SNAPSHOT_INTERVAL_MS but it's not practical via tests.
 permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s1_get_changes" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
+
+# Test that we can handle the case where there is no association between top-level
+# transaction and its subtransactions. The last decoding restarts from the first
+# checkpoint, decodes NEW_CID generated by "s0_insert2", and marks the subtransaction
+# as containing catalog changes while adding tuple cids to its top-level transaction.
+# During that, both transaction entries are created in ReorderBuffer as top-level
+# transactions and have the same LSN. We check if the assertion check for the order
+# of transaction LSNs in AssertTXNLsnOrder() is skipped since we are still before the
+# LSN at which we start replaying the contents of transactions. Besides, when decoding
+# the commit record of the top-level transaction, we must force the top-level
+# transaction to do timetravel since one of its subtransactions has been marked as
+# containing catalog changes.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_insert2" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 8971df140b..5d42283b56 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -878,10 +878,24 @@ static void
 AssertTXNLsnOrder(ReorderBuffer *rb)
 {
 #ifdef USE_ASSERT_CHECKING
+	LogicalDecodingContext *ctx = rb->private_data;
 	dlist_iter	iter;
 	XLogRecPtr	prev_first_lsn = InvalidXLogRecPtr;
 	XLogRecPtr	prev_base_snap_lsn = InvalidXLogRecPtr;
 
+	/*
+	 * Skip the verification if we don't reach the LSN at which we start
+	 * decoding the contents of transactions yet because until we reach the
+	 * LSN, we could have transactions that don't have the association between
+	 * the top-level transaction and subtransaction yet and consequently have
+	 * the same LSN.  We don't guarantee this association until we try to
+	 * decode the actual contents of transaction. The ordering of the records
+	 * prior to the start_decoding_at LSN should have been checked before the
+	 * restart.
+	 */
+	if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, ctx->reader->EndRecPtr))
+		return;
+
 	dlist_foreach(iter, &rb->toplevel_by_lsn)
 	{
 		ReorderBufferTXN *cur_txn = dlist_container(ReorderBufferTXN, node,
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 385817e295..80b7d86e79 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1101,6 +1101,9 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	else if (sub_needs_timetravel)
 	{
 		/* track toplevel txn as well, subxact alone isn't meaningful */
+		elog(DEBUG2, "forced transaction %u to do timetravel due to one of its subtransactions",
+			 xid);
+		needs_timetravel = true;
 		SnapBuildAddCommittedTxn(builder, xid);
 	}
 	else if (needs_timetravel)
-- 
2.31.1

#28Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#27)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On Wed, Oct 19, 2022 at 1:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Oct 19, 2022 at 11:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached two patches that need to be back-patched to all branches
and includes Change-1, Change-2, and a test case for them. FYI this
patch resolves the assertion failure reported in this thread as well
as one reported in another thread[2]. So I borrowed some of the
changes from the patch[2] Osumi-san recently proposed.

Amit pointed out offlist that the changes in reorderbuffer.c is not
pgindent'ed. I've run pgindent and attached updated patches.

Thanks, I have tested these across all branches till v10 and it works
as expected. I am planning to push this tomorrow unless I see any
further comments.

--
With Regards,
Amit Kapila.

#29Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#26)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On Wed, Oct 19, 2022 at 9:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached patches for Change-3 with a test case. Please review them as well.

The patch looks mostly good to me apart from few minor comments which
are as follows:
1.
+# The last decoding restarts from the first checkpoint, and add
invalidation messages
+# generated by "s0_truncate" to the subtransaction. When decoding the
commit record of
+# the top-level transaction, we mark both top-level transaction and
its subtransactions
+# as containing catalog changes. However, we check if we don't create
the association
+# between top-level and subtransactions at this time. Otherwise, we
miss executing
+# invalidation messages when forgetting the transaction.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_insert"
"s1_checkpoint" "s1_get_changes" "s0_truncate" "s0_commit" "s0_begin"
"s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit"
"s1_get_changes"

The second part of this comment seems to say things more than required
which makes it less clear. How about something like: "The last
decoding restarts from the first checkpoint and adds invalidation
messages generated by "s0_truncate" to the subtransaction. While
processing the commit record for the top-level transaction, we decide
to skip this xact but ensure that corresponding invalidation messages
get processed."?

2.
+ /*
+ * We will assign subtransactions to the top transaction before
+ * replaying the contents of the transaction.
+ */

I don't think we need this comment.

--
With Regards,
Amit Kapila.

#30Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#28)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On Wed, Oct 19, 2022 at 4:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Oct 19, 2022 at 1:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Oct 19, 2022 at 11:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached two patches that need to be back-patched to all branches
and includes Change-1, Change-2, and a test case for them. FYI this
patch resolves the assertion failure reported in this thread as well
as one reported in another thread[2]. So I borrowed some of the
changes from the patch[2] Osumi-san recently proposed.

Amit pointed out offlist that the changes in reorderbuffer.c is not
pgindent'ed. I've run pgindent and attached updated patches.

Thanks, I have tested these across all branches till v10 and it works
as expected. I am planning to push this tomorrow unless I see any
further comments.

Pushed.

--
With Regards,
Amit Kapila.

#31Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#30)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On Thu, Oct 20, 2022 at 8:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Oct 19, 2022 at 4:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Oct 19, 2022 at 1:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Oct 19, 2022 at 11:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached two patches that need to be back-patched to all branches
and includes Change-1, Change-2, and a test case for them. FYI this
patch resolves the assertion failure reported in this thread as well
as one reported in another thread[2]. So I borrowed some of the
changes from the patch[2] Osumi-san recently proposed.

Amit pointed out offlist that the changes in reorderbuffer.c is not
pgindent'ed. I've run pgindent and attached updated patches.

Thanks, I have tested these across all branches till v10 and it works
as expected. I am planning to push this tomorrow unless I see any
further comments.

Pushed.

Thank you!

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#32Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#29)
2 attachment(s)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On Thu, Oct 20, 2022 at 6:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Oct 19, 2022 at 9:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached patches for Change-3 with a test case. Please review them as well.

The patch looks mostly good to me apart from few minor comments which
are as follows:
1.
+# The last decoding restarts from the first checkpoint, and add
invalidation messages
+# generated by "s0_truncate" to the subtransaction. When decoding the
commit record of
+# the top-level transaction, we mark both top-level transaction and
its subtransactions
+# as containing catalog changes. However, we check if we don't create
the association
+# between top-level and subtransactions at this time. Otherwise, we
miss executing
+# invalidation messages when forgetting the transaction.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_insert"
"s1_checkpoint" "s1_get_changes" "s0_truncate" "s0_commit" "s0_begin"
"s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit"
"s1_get_changes"

The second part of this comment seems to say things more than required
which makes it less clear. How about something like: "The last
decoding restarts from the first checkpoint and adds invalidation
messages generated by "s0_truncate" to the subtransaction. While
processing the commit record for the top-level transaction, we decide
to skip this xact but ensure that corresponding invalidation messages
get processed."?

2.
+ /*
+ * We will assign subtransactions to the top transaction before
+ * replaying the contents of the transaction.
+ */

I don't think we need this comment.

Thank you for the comment! I agreed with all comments and I've updated
patches accordingly.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v12-v15_v2-0001-Fix-executing-invalidation-messages-generated-by-.patchapplication/octet-stream; name=v12-v15_v2-0001-Fix-executing-invalidation-messages-generated-by-.patchDownload
From e8b7888d6b2bf11522f9bdcb6472e7cc87aa9ebf Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 19 Oct 2022 12:16:43 +0900
Subject: [PATCH v2] Fix executing invalidation messages generated by
 subtransactions during decoding.

This problem has been introduced by commit 272248a0c1 where we started
assigning the subtransactions to the top-level transaction when we mark
both the top-level transaction and its subtransactions as containing
catalog changes. After we assign subtransactions to the top-level
transaction, we were not allowed to execute any invalidations associated
with it when we decide to skip the transaction.

The reason to assign the subtransactions to the top-level transaction was
to avoid the assertion failure in AssertTXNLsnOrder() as they have the
same LSN when we sometimes start accumulating transaction changes for
partial transactions after the restart. Now that with commit 64ff0fe4e8,
we skip this assertion check until we reach the LSN at which we start
decoding the contents of the transaction, so, there is no reason for such
an assignment anymore.

The assignment change was introduced in 15 and prior versions but this bug
doesn't exist in branches prior to 14 since we don't add invalidation
messages to subtransactions. We decided to backpatch through 11 for
consistency but not for 10 since its final release is near.

Reported-by: Osumi Takamichi
Author: Masahiko Sawada
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/TYAPR01MB58660803BCAA7849C8584AA4F57E9%40TYAPR01MB5866.jpnprd01.prod.outlook.com
Backpatch: Backpatch-through: 11
---
 .../expected/catalog_change_snapshot.out      | 45 +++++++++++++++++++
 .../specs/catalog_change_snapshot.spec        |  7 +++
 src/backend/replication/logical/snapbuild.c   |  3 --
 3 files changed, 52 insertions(+), 3 deletions(-)

diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
index 1d75cf5af0..b33e49c0b1 100644
--- a/contrib/test_decoding/expected/catalog_change_snapshot.out
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -87,3 +87,48 @@ COMMIT
 stop    
 (1 row)
 
+
+starting permutation: s0_init s0_begin s0_savepoint s0_insert s1_checkpoint s1_get_changes s0_truncate s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_truncate: TRUNCATE tbl1;
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+table public.tbl1: TRUNCATE: (no-flags)                      
+COMMIT                                                       
+(4 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
index 2ad1edeaa8..770dbd642d 100644
--- a/contrib/test_decoding/specs/catalog_change_snapshot.spec
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -53,3 +53,10 @@ permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s
 # transaction to do timetravel since one of its subtransactions has been marked as
 # containing catalog changes.
 permutation "s0_init" "s0_begin" "s0_savepoint" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_insert2" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
+
+# The last decoding restarts from the first checkpoint and adds invalidation
+# messages generated by "s0_truncate" to the subtransaction. While
+# processing the commit record for the top-level transaction, we decide
+# to skip this xact but ensure that corresponding invalidation messages
+# get processed.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_truncate" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 80b7d86e79..cdf4aa01e9 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -2120,9 +2120,6 @@ SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt
 		ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
 
 		for (int i = 0; i < subxcnt; i++)
-		{
-			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
 			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
-		}
 	}
 }
-- 
2.31.1

v11_v2-0001-Fix-executing-invalidation-messages-generated-by-.patchapplication/octet-stream; name=v11_v2-0001-Fix-executing-invalidation-messages-generated-by-.patchDownload
From 3883432d397d5e1a269d737aeec8657f985e8eef Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 19 Oct 2022 12:16:43 +0900
Subject: [PATCH v2] Fix executing invalidation messages generated by
 subtransactions during decoding.

This problem has been introduced by commit 272248a0c1 where we started
assigning the subtransactions to the top-level transaction when we mark
both the top-level transaction and its subtransactions as containing
catalog changes. After we assign subtransactions to the top-level
transaction, we were not allowed to execute any invalidations associated
with it when we decide to skip the transaction.

The reason to assign the subtransactions to the top-level transaction was
to avoid the assertion failure in AssertTXNLsnOrder() as they have the
same LSN when we sometimes start accumulating transaction changes for
partial transactions after the restart. Now that with commit 64ff0fe4e8,
we skip this assertion check until we reach the LSN at which we start
decoding the contents of the transaction, so, there is no reason for such
an assignment anymore.

The assignment change was introduced in 15 and prior versions but this bug
doesn't exist in branches prior to 14 since we don't add invalidation
messages to subtransactions. We decided to backpatch through 11 for
consistency but not for 10 since its final release is near.

Reported-by: Osumi Takamichi
Author: Masahiko Sawada
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/TYAPR01MB58660803BCAA7849C8584AA4F57E9%40TYAPR01MB5866.jpnprd01.prod.outlook.com
Backpatch: Backpatch-through: 11
---
 .../expected/catalog_change_snapshot.out      | 45 +++++++++++++++++++
 .../specs/catalog_change_snapshot.spec        |  7 +++
 src/backend/replication/logical/snapbuild.c   |  3 --
 3 files changed, 52 insertions(+), 3 deletions(-)

diff --git a/contrib/test_decoding/expected/catalog_change_snapshot.out b/contrib/test_decoding/expected/catalog_change_snapshot.out
index 1d75cf5af0..b33e49c0b1 100644
--- a/contrib/test_decoding/expected/catalog_change_snapshot.out
+++ b/contrib/test_decoding/expected/catalog_change_snapshot.out
@@ -87,3 +87,48 @@ COMMIT
 stop    
 (1 row)
 
+
+starting permutation: s0_init s0_begin s0_savepoint s0_insert s1_checkpoint s1_get_changes s0_truncate s0_commit s0_begin s0_insert s1_checkpoint s1_get_changes s0_commit s1_get_changes
+step s0_init: SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+?column?
+--------
+init    
+(1 row)
+
+step s0_begin: BEGIN;
+step s0_savepoint: SAVEPOINT sp1;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data
+----
+(0 rows)
+
+step s0_truncate: TRUNCATE tbl1;
+step s0_commit: COMMIT;
+step s0_begin: BEGIN;
+step s0_insert: INSERT INTO tbl1 VALUES (1);
+step s1_checkpoint: CHECKPOINT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+table public.tbl1: TRUNCATE: (no-flags)                      
+COMMIT                                                       
+(4 rows)
+
+step s0_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
+data                                                         
+-------------------------------------------------------------
+BEGIN                                                        
+table public.tbl1: INSERT: val1[integer]:1 val2[integer]:null
+COMMIT                                                       
+(3 rows)
+
+?column?
+--------
+stop    
+(1 row)
+
diff --git a/contrib/test_decoding/specs/catalog_change_snapshot.spec b/contrib/test_decoding/specs/catalog_change_snapshot.spec
index 2ad1edeaa8..770dbd642d 100644
--- a/contrib/test_decoding/specs/catalog_change_snapshot.spec
+++ b/contrib/test_decoding/specs/catalog_change_snapshot.spec
@@ -53,3 +53,10 @@ permutation "s0_init" "s0_begin" "s0_savepoint" "s0_truncate" "s1_checkpoint" "s
 # transaction to do timetravel since one of its subtransactions has been marked as
 # containing catalog changes.
 permutation "s0_init" "s0_begin" "s0_savepoint" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_insert2" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
+
+# The last decoding restarts from the first checkpoint and adds invalidation
+# messages generated by "s0_truncate" to the subtransaction. While
+# processing the commit record for the top-level transaction, we decide
+# to skip this xact but ensure that corresponding invalidation messages
+# get processed.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_truncate" "s0_commit" "s0_begin" "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index dfe9d017f2..aa2e9f303e 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -2114,9 +2114,6 @@ SnapBuildXidSetCatalogChanges(SnapBuild *builder, TransactionId xid, int subxcnt
 		int		i;
 
 		for (i = 0; i < subxcnt; i++)
-		{
-			ReorderBufferAssignChild(builder->reorder, xid, subxacts[i], lsn);
 			ReorderBufferXidSetCatalogChanges(builder->reorder, subxacts[i], lsn);
-		}
 	}
 }
-- 
2.31.1

#33Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#32)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On Fri, Oct 21, 2022 at 8:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Thank you for the comment! I agreed with all comments and I've updated
patches accordingly.

Pushed after removing the test case from v11-13 branches as it is not
relevant to those branches and the test-1 in
catalog_change_snapshot.spec already tests the same case for those
branches.

--
With Regards,
Amit Kapila.

#34Alexander Lakhin
exclusion@gmail.com
In reply to: Amit Kapila (#33)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

Hello,

21.10.2022 08:49, Amit Kapila wrote:

On Fri, Oct 21, 2022 at 8:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Thank you for the comment! I agreed with all comments and I've updated
patches accordingly.

Pushed after removing the test case from v11-13 branches as it is not
relevant to those branches and the test-1 in
catalog_change_snapshot.spec already tests the same case for those
branches.

I've managed to get that assertion failure again (on master) while playing
with the concurrent installcheck. This can be easily reproduced with the
following script:
numclients=5
for ((c=1;c<=numclients;c++)); do
  cp -r contrib/test_decoding contrib/test_decoding_$c
  sed "s/isolation_slot/isolation_slot_$c/" -i contrib/test_decoding_$c/specs/catalog_change_snapshot.spec # Use
independent slots
  sed "$(printf '$p; %.0s' `seq 50`)" -i contrib/test_decoding_$c/specs/catalog_change_snapshot.spec # Repeat the last
permutation 50 times
done
for ((c=1;c<=numclients;c++)); do
  EXTRA_REGRESS_OPTS="--dbname=regress_$c" make -s installcheck-force -C contrib/test_decoding_$c USE_MODULE_DB=1

"installcheck-$c.log" 2>&1 &

done
wait
grep 'TRAP:' server.log

Produces for me:
TRAP: failed Assert("prev_first_lsn < cur_txn->first_lsn"), File: "reorderbuffer.c", Line: 942, PID: 3794105
TRAP: failed Assert("prev_first_lsn < cur_txn->first_lsn"), File: "reorderbuffer.c", Line: 942, PID: 3794104
TRAP: failed Assert("prev_first_lsn < cur_txn->first_lsn"), File: "reorderbuffer.c", Line: 942, PID: 3794099
TRAP: failed Assert("prev_first_lsn < cur_txn->first_lsn"), File: "reorderbuffer.c", Line: 942, PID: 3794105
TRAP: failed Assert("prev_first_lsn < cur_txn->first_lsn"), File: "reorderbuffer.c", Line: 942, PID: 3794104
TRAP: failed Assert("prev_first_lsn < cur_txn->first_lsn"), File: "reorderbuffer.c", Line: 942, PID: 3794099

With the debug logging added inside AssertTXNLsnOrder() I see:
ctx->snapshot_builder->start_decoding_at: 209807224, ctx->reader->EndRecPtr: 210043072,
SnapBuildXactNeedsSkip(ctx->snapshot_builder, ctx->reader->EndRecPtr): 0
and inside the loop:
cur_txn->first_lsn: 209792872
cur_txn->first_lsn: 209975744
cur_txn->first_lsn: 210043008
cur_txn->first_lsn: 210043008
and it triggers the Assert.

Best regards,
Alexander

#35Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Alexander Lakhin (#34)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On 6/6/23 11:00, Alexander Lakhin wrote:

Hello,
...> With the debug logging added inside AssertTXNLsnOrder() I see:
ctx->snapshot_builder->start_decoding_at: 209807224,
ctx->reader->EndRecPtr: 210043072,
SnapBuildXactNeedsSkip(ctx->snapshot_builder, ctx->reader->EndRecPtr): 0
and inside the loop:
cur_txn->first_lsn: 209792872
cur_txn->first_lsn: 209975744
cur_txn->first_lsn: 210043008
cur_txn->first_lsn: 210043008
and it triggers the Assert.

So what's the prev_first_lsn value for these first_lsn values? How does
it change over time? Did you try looking at the pg_waldump for these
positions?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#36Alexander Lakhin
exclusion@gmail.com
In reply to: Tomas Vondra (#35)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

Hello Tomas,

06.06.2023 12:56, Tomas Vondra wrote:

On 6/6/23 11:00, Alexander Lakhin wrote:

Hello,
...> With the debug logging added inside AssertTXNLsnOrder() I see:
ctx->snapshot_builder->start_decoding_at: 209807224,
ctx->reader->EndRecPtr: 210043072,
SnapBuildXactNeedsSkip(ctx->snapshot_builder, ctx->reader->EndRecPtr): 0
and inside the loop:
cur_txn->first_lsn: 209792872
cur_txn->first_lsn: 209975744
cur_txn->first_lsn: 210043008
cur_txn->first_lsn: 210043008
and it triggers the Assert.

So what's the prev_first_lsn value for these first_lsn values? How does
it change over time? Did you try looking at the pg_waldump for these
positions?

With more logging I've got (for another run):
ReorderBufferTXNByXid| xid: 3397, lsn: c1fbc80

ctx->snapshot_builder->start_decoding_at: c1f2cc0, ctx->reader->EndRecPtr: c1fbcc0,
SnapBuildXactNeedsSkip(ctx->snapshot_builder, ctx->reader->EndRecPtr): 0
prev_first_lsn: 0, cur_txn->first_lsn: c1fbc80
prev_first_lsn: c1fbc80, cur_txn->first_lsn: c1fbc80
TRAP: failed Assert("prev_first_lsn < cur_txn->first_lsn") ...

waldump for 00000001000000000000000C shows:
grep c1fbc80:
rmgr: Heap2       len (rec/tot):     60/    60, tx:       3398, lsn: 0/0C1FBC80, prev 0/0C1FBC50, desc: NEW_CID rel:
1663/18763/19987, tid: 0/1, cmin: 1, cmax: 4294967295, combo: 4294967295
rmgr: Heap        len (rec/tot):     59/    59, tx:       3398, lsn: 0/0C1FBCC0, prev 0/0C1FBC80, desc: INSERT+INIT off:
1, flags: 0x08, blkref #0: rel 1663/18763/19987 blk 0

grep '( 3397| 3398)'
rmgr: Transaction len (rec/tot):     43/    43, tx:       3398, lsn: 0/0C1F2B20, prev 0/0C1F2688, desc: ASSIGNMENT xtop
3397: subxacts: 3398
rmgr: Heap        len (rec/tot):     59/    59, tx:       3398, lsn: 0/0C1F2B50, prev 0/0C1F2B20, desc: INSERT+INIT off:
1, flags: 0x08, blkref #0: rel 1663/18763/19981 blk 0
rmgr: Standby     len (rec/tot):     62/    62, tx:          0, lsn: 0/0C1F2BD0, prev 0/0C1F2B90, desc: RUNNING_XACTS
nextXid 3400 latestCompletedXid 3396 oldestRunningXid 3397; 2 xacts: 3399 3397; 1 subxacts: 3398
rmgr: Standby     len (rec/tot):     58/    58, tx:          0, lsn: 0/0C1F2C80, prev 0/0C1F2C50, desc: RUNNING_XACTS
nextXid 3400 latestCompletedXid 3399 oldestRunningXid 3397; 1 xacts: 3397; 1 subxacts: 3398
rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 0/0C1F2CC0, prev 0/0C1F2C80, desc:
CHECKPOINT_ONLINE redo 0/C1F2C10; tli 1; prev tli 1; fpw true; xid 0:3400; oid 24576; multi 13; offset 29; oldest xid
722 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 3397; online
rmgr: Standby     len (rec/tot):     62/    62, tx:          0, lsn: 0/0C1FBAD0, prev 0/0C1FBAA0, desc: RUNNING_XACTS
nextXid 3401 latestCompletedXid 3399 oldestRunningXid 3397; 2 xacts: 3400 3397; 1 subxacts: 3398
rmgr: Heap2       len (rec/tot):     60/    60, tx:       3398, lsn: 0/0C1FBC80, prev 0/0C1FBC50, desc: NEW_CID rel:
1663/18763/19987, tid: 0/1, cmin: 1, cmax: 4294967295, combo: 4294967295
rmgr: Heap        len (rec/tot):     59/    59, tx:       3398, lsn: 0/0C1FBCC0, prev 0/0C1FBC80, desc: INSERT+INIT off:
1, flags: 0x08, blkref #0: rel 1663/18763/19987 blk 0
rmgr: Transaction len (rec/tot):     54/    54, tx:       3397, lsn: 0/0C1FBD00, prev 0/0C1FBCC0, desc: COMMIT
2023-06-06 13:55:26.955268 MSK; subxacts: 3398

Best regards,
Alexander

#37Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Alexander Lakhin (#36)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On 6/6/23 14:00, Alexander Lakhin wrote:

Hello Tomas,

06.06.2023 12:56, Tomas Vondra wrote:

On 6/6/23 11:00, Alexander Lakhin wrote:

Hello,
...> With the debug logging added inside AssertTXNLsnOrder() I see:
ctx->snapshot_builder->start_decoding_at: 209807224,
ctx->reader->EndRecPtr: 210043072,
SnapBuildXactNeedsSkip(ctx->snapshot_builder, ctx->reader->EndRecPtr): 0
and inside the loop:
cur_txn->first_lsn: 209792872
cur_txn->first_lsn: 209975744
cur_txn->first_lsn: 210043008
cur_txn->first_lsn: 210043008
and it triggers the Assert.

So what's the prev_first_lsn value for these first_lsn values? How does
it change over time? Did you try looking at the pg_waldump for these
positions?

With more logging I've got (for another run):
ReorderBufferTXNByXid| xid: 3397, lsn: c1fbc80

ctx->snapshot_builder->start_decoding_at: c1f2cc0,
ctx->reader->EndRecPtr: c1fbcc0,
SnapBuildXactNeedsSkip(ctx->snapshot_builder, ctx->reader->EndRecPtr): 0
prev_first_lsn: 0, cur_txn->first_lsn: c1fbc80
prev_first_lsn: c1fbc80, cur_txn->first_lsn: c1fbc80
TRAP: failed Assert("prev_first_lsn < cur_txn->first_lsn") ...

waldump for 00000001000000000000000C shows:
grep c1fbc80:
rmgr: Heap2       len (rec/tot):     60/    60, tx:       3398, lsn:
0/0C1FBC80, prev 0/0C1FBC50, desc: NEW_CID rel: 1663/18763/19987, tid:
0/1, cmin: 1, cmax: 4294967295, combo: 4294967295
rmgr: Heap        len (rec/tot):     59/    59, tx:       3398, lsn:
0/0C1FBCC0, prev 0/0C1FBC80, desc: INSERT+INIT off: 1, flags: 0x08,
blkref #0: rel 1663/18763/19987 blk 0

grep '( 3397| 3398)'

I've been able to reproduce this, after messing with the script a little
bit (I had to skip the test_decoding regression tests, because that was
complaining about slots already existing etc).

Anyway, AssertTXNLsnOrder sees these two transactions (before aborting):

26662 0/6462E6F0 (first 0/0)
26661 0/6462E6F0 (first 0/6462E6F0)

where 26661 is the top xact, 26662 is a subxact of 26661. This is
clearly a problem, because we really should not have subxact in this
list once the assignment gets applied.

And the relevant WAL looks like this:

---------------------------------------------------------------------
26662, lsn: 0/645EDAA0, prev 0/645EDA60, desc: ASSIGNMENT xtop 26661:
subxacts: 26662
26662, lsn: 0/645EDAD0, prev 0/645EDAA0, desc: INSERT+INIT off: 1,
flags: 0x08, blkref #0: rel 1663/125447/126835 blk 0
...
0, lsn: 0/6462E5D8, prev 0/6462E2A0, desc: RUNNING_XACTS nextXid
26673 latestCompletedXid 26672 oldestRunningXid 26661; 3 xacts: 26667
26661 26664; 3 subxacts: 26668 26662 26665
...
26662, lsn: 0/6462E6F0, prev 0/6462E678, desc: NEW_CID rel:
1663/125447/126841, tid: 0/1, cmin: 1, cmax: 4294967295, combo: 4294967295
26662, lsn: 0/6462E730, prev 0/6462E6F0, desc: INSERT+INIT off: 1,
flags: 0x08, blkref #0: rel 1663/125447/126841 blk 0
26661, lsn: 0/6462E770, prev 0/6462E730, desc: COMMIT 2023-06-06
16:41:24.442870 CEST; subxacts: 26662
---------------------------------------------------------------------

so the assignment is the *first* thing that happens for these xacts.

However, we skip the assignment, because the log for this call of
get_changes says this:

LOG: logical decoding found consistent point at 0/6462E5D8

so we fail to realize the 26662 is a subxact.

Then when processing the NEW_CID, SnapBuildProcessNewCid chimes in and
does this:

ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);

ReorderBufferAddNewTupleCids(builder->reorder, xlrec->top_xid, lsn,
xlrec->target_locator, xlrec->target_tid,
xlrec->cmin, xlrec->cmax,
xlrec->combocid);

and ReorderBufferAddNewTupleCids() proceeds to enter an entry for the
passed XID (which is xlrec->top_xid, 26661), but with LSN of the WAL
record. But ReorderBufferXidSetCatalogChanges() already did the same
thing for the subxact 26662, as it has no idea it's a subxact (due to
the skipped assignment).

I haven't figured out what exactly is happening / what it should be
doing instead. But it seems wrong to skip the assignment - I wonder if
SnapBuildProcessRunningXacts might be doing that too eagerly.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#38Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Tomas Vondra (#37)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On 6/6/23 17:42, Tomas Vondra wrote:

On 6/6/23 14:00, Alexander Lakhin wrote:

Hello Tomas,

06.06.2023 12:56, Tomas Vondra wrote:

On 6/6/23 11:00, Alexander Lakhin wrote:

Hello,
...> With the debug logging added inside AssertTXNLsnOrder() I see:
ctx->snapshot_builder->start_decoding_at: 209807224,
ctx->reader->EndRecPtr: 210043072,
SnapBuildXactNeedsSkip(ctx->snapshot_builder, ctx->reader->EndRecPtr): 0
and inside the loop:
cur_txn->first_lsn: 209792872
cur_txn->first_lsn: 209975744
cur_txn->first_lsn: 210043008
cur_txn->first_lsn: 210043008
and it triggers the Assert.

So what's the prev_first_lsn value for these first_lsn values? How does
it change over time? Did you try looking at the pg_waldump for these
positions?

With more logging I've got (for another run):
ReorderBufferTXNByXid| xid: 3397, lsn: c1fbc80

ctx->snapshot_builder->start_decoding_at: c1f2cc0,
ctx->reader->EndRecPtr: c1fbcc0,
SnapBuildXactNeedsSkip(ctx->snapshot_builder, ctx->reader->EndRecPtr): 0
prev_first_lsn: 0, cur_txn->first_lsn: c1fbc80
prev_first_lsn: c1fbc80, cur_txn->first_lsn: c1fbc80
TRAP: failed Assert("prev_first_lsn < cur_txn->first_lsn") ...

waldump for 00000001000000000000000C shows:
grep c1fbc80:
rmgr: Heap2       len (rec/tot):     60/    60, tx:       3398, lsn:
0/0C1FBC80, prev 0/0C1FBC50, desc: NEW_CID rel: 1663/18763/19987, tid:
0/1, cmin: 1, cmax: 4294967295, combo: 4294967295
rmgr: Heap        len (rec/tot):     59/    59, tx:       3398, lsn:
0/0C1FBCC0, prev 0/0C1FBC80, desc: INSERT+INIT off: 1, flags: 0x08,
blkref #0: rel 1663/18763/19987 blk 0

grep '( 3397| 3398)'

I've been able to reproduce this, after messing with the script a little
bit (I had to skip the test_decoding regression tests, because that was
complaining about slots already existing etc).

Anyway, AssertTXNLsnOrder sees these two transactions (before aborting):

26662 0/6462E6F0 (first 0/0)
26661 0/6462E6F0 (first 0/6462E6F0)

where 26661 is the top xact, 26662 is a subxact of 26661. This is
clearly a problem, because we really should not have subxact in this
list once the assignment gets applied.

And the relevant WAL looks like this:

---------------------------------------------------------------------
26662, lsn: 0/645EDAA0, prev 0/645EDA60, desc: ASSIGNMENT xtop 26661:
subxacts: 26662
26662, lsn: 0/645EDAD0, prev 0/645EDAA0, desc: INSERT+INIT off: 1,
flags: 0x08, blkref #0: rel 1663/125447/126835 blk 0
...
0, lsn: 0/6462E5D8, prev 0/6462E2A0, desc: RUNNING_XACTS nextXid
26673 latestCompletedXid 26672 oldestRunningXid 26661; 3 xacts: 26667
26661 26664; 3 subxacts: 26668 26662 26665
...
26662, lsn: 0/6462E6F0, prev 0/6462E678, desc: NEW_CID rel:
1663/125447/126841, tid: 0/1, cmin: 1, cmax: 4294967295, combo: 4294967295
26662, lsn: 0/6462E730, prev 0/6462E6F0, desc: INSERT+INIT off: 1,
flags: 0x08, blkref #0: rel 1663/125447/126841 blk 0
26661, lsn: 0/6462E770, prev 0/6462E730, desc: COMMIT 2023-06-06
16:41:24.442870 CEST; subxacts: 26662
---------------------------------------------------------------------

so the assignment is the *first* thing that happens for these xacts.

However, we skip the assignment, because the log for this call of
get_changes says this:

LOG: logical decoding found consistent point at 0/6462E5D8

so we fail to realize the 26662 is a subxact.

Then when processing the NEW_CID, SnapBuildProcessNewCid chimes in and
does this:

ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);

ReorderBufferAddNewTupleCids(builder->reorder, xlrec->top_xid, lsn,
xlrec->target_locator, xlrec->target_tid,
xlrec->cmin, xlrec->cmax,
xlrec->combocid);

and ReorderBufferAddNewTupleCids() proceeds to enter an entry for the
passed XID (which is xlrec->top_xid, 26661), but with LSN of the WAL
record. But ReorderBufferXidSetCatalogChanges() already did the same
thing for the subxact 26662, as it has no idea it's a subxact (due to
the skipped assignment).

I haven't figured out what exactly is happening / what it should be
doing instead. But it seems wrong to skip the assignment - I wonder if
SnapBuildProcessRunningXacts might be doing that too eagerly.

In investigated this a bit more, and the problem actually seems to be
more like this:

1) we create a new logical replication slot

2) while building the initial snapshot, we start with current insert
location, and then process records

3) for RUNNING_XACTS we call SnapBuildProcessRunningXacts, which calls
SnapBuildFindSnapshot

4) SnapBuildFindSnapshot does this:

else if (!builder->building_full_snapshot &&
SnapBuildRestore(builder, lsn))
{
/* there won't be any state to cleanup */
return false;
}

5) because create_logical_replication_slot and get_changes both call
CreateInitDecodingContext with needs_full_snapshot=false, we end up
calling SnapBuildRestore()

6) once in a while this likely hits a snapshot created by a concurrent
session (for another logical slot) with SNAPBUILD_CONSISTENT state

I don't know what's the correct fix for this. Maybe we should set
needs_full_snapshot=true in create_logical_replication_slot when
creating the initial snapshot? Maybe we should use true even in
pg_logical_slot_get_changes_guts? This seems to fix the crashes ...

That'll prevent reading the serialized snapshots like this, but how
could that ever work? It seems pretty much guaranteed to ignore any
assignments that happened right before the snapshot?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#39Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#38)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On Wed, Jun 7, 2023 at 6:18 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 6/6/23 17:42, Tomas Vondra wrote:

In investigated this a bit more, and the problem actually seems to be
more like this:

1) we create a new logical replication slot

2) while building the initial snapshot, we start with current insert
location, and then process records

3) for RUNNING_XACTS we call SnapBuildProcessRunningXacts, which calls
SnapBuildFindSnapshot

4) SnapBuildFindSnapshot does this:

else if (!builder->building_full_snapshot &&
SnapBuildRestore(builder, lsn))
{
/* there won't be any state to cleanup */
return false;
}

5) because create_logical_replication_slot and get_changes both call
CreateInitDecodingContext with needs_full_snapshot=false, we end up
calling SnapBuildRestore()

6) once in a while this likely hits a snapshot created by a concurrent
session (for another logical slot) with SNAPBUILD_CONSISTENT state

I think this analysis doesn't seem to match what you mentioned in the
previous email which is as follows:

However, we skip the assignment, because the log for this call of
get_changes says this:

LOG: logical decoding found consistent point at 0/6462E5D8

so we fail to realize the 26662 is a subxact.

This is because the above LOG is printed when
"running->oldestRunningXid == running->nextXid" not when we restore
the snapshot. Am, I missing something?

I don't know what's the correct fix for this. Maybe we should set
needs_full_snapshot=true in create_logical_replication_slot when
creating the initial snapshot? Maybe we should use true even in
pg_logical_slot_get_changes_guts? This seems to fix the crashes ...

I don't think that is advisable because setting "needs_full_snapshot"
to true for decoding means the snapshot will start tracking
non-catalog committed xacts as well which is costly and is not
required for this case.

That'll prevent reading the serialized snapshots like this, but how
could that ever work? It seems pretty much guaranteed to ignore any
assignments that happened right before the snapshot?

This part needs some analysis/thoughts. BTW, do you mean that it skips
the assignment (a) because the assignment record is before we reach a
consistent point, or (b) because we start reading WAL after the
assignment, or (c) something else? If you intend to say (a) then can
you please point me to the code you are referring to for the same?

--
With Regards,
Amit Kapila.

#40Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Amit Kapila (#39)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On 6/7/23 07:18, Amit Kapila wrote:

On Wed, Jun 7, 2023 at 6:18 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 6/6/23 17:42, Tomas Vondra wrote:

In investigated this a bit more, and the problem actually seems to be
more like this:

1) we create a new logical replication slot

2) while building the initial snapshot, we start with current insert
location, and then process records

3) for RUNNING_XACTS we call SnapBuildProcessRunningXacts, which calls
SnapBuildFindSnapshot

4) SnapBuildFindSnapshot does this:

else if (!builder->building_full_snapshot &&
SnapBuildRestore(builder, lsn))
{
/* there won't be any state to cleanup */
return false;
}

5) because create_logical_replication_slot and get_changes both call
CreateInitDecodingContext with needs_full_snapshot=false, we end up
calling SnapBuildRestore()

6) once in a while this likely hits a snapshot created by a concurrent
session (for another logical slot) with SNAPBUILD_CONSISTENT state

I think this analysis doesn't seem to match what you mentioned in the
previous email which is as follows:

However, we skip the assignment, because the log for this call of
get_changes says this:

LOG: logical decoding found consistent point at 0/6462E5D8

so we fail to realize the 26662 is a subxact.

This is because the above LOG is printed when
"running->oldestRunningXid == running->nextXid" not when we restore
the snapshot. Am, I missing something?

There are multiple places in snapbuild.c with the same message. Two in
SnapBuildFindSnapshot (one of them being the one you mentioned) and one
in SnapBuildRestore (which is the one actually triggered).

I don't know what's the correct fix for this. Maybe we should set
needs_full_snapshot=true in create_logical_replication_slot when
creating the initial snapshot? Maybe we should use true even in
pg_logical_slot_get_changes_guts? This seems to fix the crashes ...

I don't think that is advisable because setting "needs_full_snapshot"
to true for decoding means the snapshot will start tracking
non-catalog committed xacts as well which is costly and is not
required for this case.

True. TBH I managed to forget most of these details, so I meant it more
like a data point that it seems to fix the issue for me.

That'll prevent reading the serialized snapshots like this, but how
could that ever work? It seems pretty much guaranteed to ignore any
assignments that happened right before the snapshot?

This part needs some analysis/thoughts. BTW, do you mean that it skips
the assignment (a) because the assignment record is before we reach a
consistent point, or (b) because we start reading WAL after the
assignment, or (c) something else? If you intend to say (a) then can
you please point me to the code you are referring to for the same?

Well, I think the issue is pretty clear - we end up with an initial
snapshot that's in between the ASSIGNMENT and NEW_CID, and because
NEW_CID has both xact and subxact XID it fails because we add two TXNs
with the same LSN, not realizing one of them is subxact.

That's obviously wrong, although somewhat benign in production because
it only fails because of hitting an assert. Regular builds are likely to
just ignore it, although I haven't checked if the COMMIT cleanup (I
wonder if we remove the subxact from the toplevel list on commit).

I think the problem is we just grab an existing snapshot, before all
running xacts complete. Maybe we should fix that, and leave the
needs_full_snapshot alone. Haven't tried that, though.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#41Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#40)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On Wed, Jun 7, 2023 at 6:02 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Well, I think the issue is pretty clear - we end up with an initial
snapshot that's in between the ASSIGNMENT and NEW_CID, and because
NEW_CID has both xact and subxact XID it fails because we add two TXNs
with the same LSN, not realizing one of them is subxact.

That's obviously wrong, although somewhat benign in production because
it only fails because of hitting an assert.

Doesn't this indicate that we can end up decoding a partial
transaction when we restore a snapshot? Won't that be a problem even
for production?

Regular builds are likely to
just ignore it, although I haven't checked if the COMMIT cleanup (I
wonder if we remove the subxact from the toplevel list on commit).

I think the problem is we just grab an existing snapshot, before all
running xacts complete. Maybe we should fix that, and leave the
needs_full_snapshot alone.

It is not clear what exactly you have in mind to fix this because if
there is no running xact, we don't even need to restore the snapshot
because of a prior check "if (running->oldestRunningXid ==
running->nextXid)". I think the main problem is that we started
decoding immediately from the point where we restored a snapshot as at
that point we could have some partial running xacts.

--
With Regards,
Amit Kapila.

#42Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Amit Kapila (#41)
RE: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On Tuesday, June 13, 2023 12:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jun 7, 2023 at 6:02 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Well, I think the issue is pretty clear - we end up with an initial
snapshot that's in between the ASSIGNMENT and NEW_CID, and because
NEW_CID has both xact and subxact XID it fails because we add two TXNs
with the same LSN, not realizing one of them is subxact.

That's obviously wrong, although somewhat benign in production because
it only fails because of hitting an assert.

Doesn't this indicate that we can end up decoding a partial transaction when
we restore a snapshot? Won't that be a problem even for production?

Yes, I think it can cause the problem that only partial changes of a transaction are streamed.
I tried to reproduce this and here are the steps. Note, to make sure the test
won't be affected by other running_xact WALs, I changed LOG_SNAPSHOT_INTERVAL_MS
to a bigger number.

session 1:
-----
create table test(a int);
SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot_1', 'test_decoding');
-----

session 2:
-----
- Start a transaction
BEGIN;
INSERT INTO test VALUES(1);
-----

session 3:
-----
- Create another slot isolation_slot_2, it should choose a restart_point which is
- after the changes that happened in session 2. Note, to let the current slot
- restore another snapshot, we need to use gdb to block the current backend at
- SnapBuildFindSnapshot(), the backend should have logged the running_xacts WAL
- before reaching SnapBuildFindSnapshot.

SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot_2', 'test_decoding');
-----

session 1:
-----
- Since there is a running_xacts which session 3 logged, the current backend will
- serialize the snapshot when decoding the running_xacts WAL, and the snapshot
- can be used by other slots (e.g. isolation_slot_2)

SELECT data FROM pg_logical_slot_get_changes('isolation_slot_1', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
-----

session 2:
-----
- Insert some different data and commit the transaction.

INSERT INTO test VALUES(2);
INSERT INTO test VALUES(3);
INSERT INTO test VALUES(4);
COMMIT
-----

session 3:
-----
- Release the process and try to stream the changes, since the restart point is
- at the middle of the transaction, it will stream partial changes of the
- transaction which was committed in session 2:

SELECT data FROM pg_logical_slot_get_changes('isolation_slot_2', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
-----

Results (partial streamed changes):
postgres=# SELECT data FROM pg_logical_slot_get_changes('isolation_slot_2', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
data
-----------------------------------------
BEGIN
table public.test: INSERT: a[integer]:2
table public.test: INSERT: a[integer]:3
table public.test: INSERT: a[integer]:4
COMMIT
(5 rows)

Regular builds are likely to
just ignore it, although I haven't checked if the COMMIT cleanup (I
wonder if we remove the subxact from the toplevel list on commit).

I think the problem is we just grab an existing snapshot, before all
running xacts complete. Maybe we should fix that, and leave the
needs_full_snapshot alone.

It is not clear what exactly you have in mind to fix this because if there is no
running xact, we don't even need to restore the snapshot because of a prior
check "if (running->oldestRunningXid ==
running->nextXid)". I think the main problem is that we started
decoding immediately from the point where we restored a snapshot as at that
point we could have some partial running xacts.

Best Regards,
Hou zj

#43Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Zhijie Hou (Fujitsu) (#42)
RE: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On Tuesday, June 13, 2023 12:19 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:

On Tuesday, June 13, 2023 12:04 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Wed, Jun 7, 2023 at 6:02 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Well, I think the issue is pretty clear - we end up with an initial
snapshot that's in between the ASSIGNMENT and NEW_CID, and because
NEW_CID has both xact and subxact XID it fails because we add two
TXNs with the same LSN, not realizing one of them is subxact.

That's obviously wrong, although somewhat benign in production
because it only fails because of hitting an assert.

Doesn't this indicate that we can end up decoding a partial
transaction when we restore a snapshot? Won't that be a problem even for

production?

Yes, I think it can cause the problem that only partial changes of a transaction
are streamed.
I tried to reproduce this and here are the steps. Note, to make sure the test
won't be affected by other running_xact WALs, I changed
LOG_SNAPSHOT_INTERVAL_MS to a bigger number.

session 1:
-----
create table test(a int);
SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot_1',
'test_decoding');
-----

session 2:
-----
- Start a transaction
BEGIN;
INSERT INTO test VALUES(1);
-----

session 3:
-----
- Create another slot isolation_slot_2, it should choose a restart_point which is
- after the changes that happened in session 2. Note, to let the current slot
- restore another snapshot, we need to use gdb to block the current backend
at
- SnapBuildFindSnapshot(), the backend should have logged the running_xacts
WAL
- before reaching SnapBuildFindSnapshot.

SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot_2',
'test_decoding');
-----

session 1:
-----
- Since there is a running_xacts which session 3 logged, the current backend
will
- serialize the snapshot when decoding the running_xacts WAL, and the
snapshot
- can be used by other slots (e.g. isolation_slot_2)

SELECT data FROM pg_logical_slot_get_changes('isolation_slot_1', NULL, NULL,
'skip-empty-xacts', '1', 'include-xids', '0');
-----

session 2:
-----
- Insert some different data and commit the transaction.

INSERT INTO test VALUES(2);
INSERT INTO test VALUES(3);
INSERT INTO test VALUES(4);
COMMIT
-----

session 3:
-----
- Release the process and try to stream the changes, since the restart point is
- at the middle of the transaction, it will stream partial changes of the
- transaction which was committed in session 2:

SELECT data FROM pg_logical_slot_get_changes('isolation_slot_2', NULL, NULL,
'skip-empty-xacts', '1', 'include-xids', '0');
-----

Results (partial streamed changes):
postgres=# SELECT data FROM pg_logical_slot_get_changes('isolation_slot_2',
NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
data
-----------------------------------------
BEGIN
table public.test: INSERT: a[integer]:2 table public.test: INSERT: a[integer]:3
table public.test: INSERT: a[integer]:4 COMMIT
(5 rows)

One idea to fix the partial change stream problem would be that we record all
the running transaction's xid when restoring the snapshot in
SnapBuildFindSnapshot(), and in the following decoding, we skip decoding
changes for the recorded transaction. Or we can do similar to 7f13ac8(serialize the
information of running xacts if any)

But one point I am not very sure is that we might retore snapshot in
SnapBuildSerializationPoint() as well where we don't have running transactions
information. Although SnapBuildSerializationPoint() is invoked for
XLOG_END_OF_RECOVERY and XLOG_CHECKPOINT_SHUTDOWN records which seems no
running transaction will be there when logging. But I am not 100% sure if the
problem can happen in this case as well.

Thoughts ?

Best Regards,
Hou zj

#44Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Zhijie Hou (Fujitsu) (#43)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On 6/14/23 05:15, Zhijie Hou (Fujitsu) wrote:

On Tuesday, June 13, 2023 12:19 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:

On Tuesday, June 13, 2023 12:04 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Wed, Jun 7, 2023 at 6:02 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Well, I think the issue is pretty clear - we end up with an initial
snapshot that's in between the ASSIGNMENT and NEW_CID, and because
NEW_CID has both xact and subxact XID it fails because we add two
TXNs with the same LSN, not realizing one of them is subxact.

That's obviously wrong, although somewhat benign in production
because it only fails because of hitting an assert.

Doesn't this indicate that we can end up decoding a partial
transaction when we restore a snapshot? Won't that be a problem even for

production?

Yes, I think it can cause the problem that only partial changes of a transaction
are streamed.
I tried to reproduce this and here are the steps. Note, to make sure the test
won't be affected by other running_xact WALs, I changed
LOG_SNAPSHOT_INTERVAL_MS to a bigger number.

session 1:
-----
create table test(a int);
SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot_1',
'test_decoding');
-----

session 2:
-----
- Start a transaction
BEGIN;
INSERT INTO test VALUES(1);
-----

session 3:
-----
- Create another slot isolation_slot_2, it should choose a restart_point which is
- after the changes that happened in session 2. Note, to let the current slot
- restore another snapshot, we need to use gdb to block the current backend
at
- SnapBuildFindSnapshot(), the backend should have logged the running_xacts
WAL
- before reaching SnapBuildFindSnapshot.

SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot_2',
'test_decoding');
-----

session 1:
-----
- Since there is a running_xacts which session 3 logged, the current backend
will
- serialize the snapshot when decoding the running_xacts WAL, and the
snapshot
- can be used by other slots (e.g. isolation_slot_2)

SELECT data FROM pg_logical_slot_get_changes('isolation_slot_1', NULL, NULL,
'skip-empty-xacts', '1', 'include-xids', '0');
-----

session 2:
-----
- Insert some different data and commit the transaction.

INSERT INTO test VALUES(2);
INSERT INTO test VALUES(3);
INSERT INTO test VALUES(4);
COMMIT
-----

session 3:
-----
- Release the process and try to stream the changes, since the restart point is
- at the middle of the transaction, it will stream partial changes of the
- transaction which was committed in session 2:

SELECT data FROM pg_logical_slot_get_changes('isolation_slot_2', NULL, NULL,
'skip-empty-xacts', '1', 'include-xids', '0');
-----

Results (partial streamed changes):
postgres=# SELECT data FROM pg_logical_slot_get_changes('isolation_slot_2',
NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
data
-----------------------------------------
BEGIN
table public.test: INSERT: a[integer]:2 table public.test: INSERT: a[integer]:3
table public.test: INSERT: a[integer]:4 COMMIT
(5 rows)

One idea to fix the partial change stream problem would be that we record all
the running transaction's xid when restoring the snapshot in
SnapBuildFindSnapshot(), and in the following decoding, we skip decoding
changes for the recorded transaction. Or we can do similar to 7f13ac8(serialize the
information of running xacts if any)

We need to think about how to fix this in backbranches, and the idea
with serializing running transactions seems rather unbackpatchable (as
it changes on-disk state).

But one point I am not very sure is that we might retore snapshot in
SnapBuildSerializationPoint() as well where we don't have running transactions
information. Although SnapBuildSerializationPoint() is invoked for
XLOG_END_OF_RECOVERY and XLOG_CHECKPOINT_SHUTDOWN records which seems no
running transaction will be there when logging. But I am not 100% sure if the
problem can happen in this case as well.

So, is the problem that we grab an existing snapshot in SnapBuildRestore
when called from SnapBuildFindSnapshot? If so, would it be possible to
just skip this while building the initial snapshot?

I tried that (by commenting out the block in SnapBuildFindSnapshot), but
it causes some output changes in test_decoding regression tests. I
haven't investigated why exactly.

Also, can you try if we still stream the partial transaction with
create_logical_replication_slot building a full snapshot?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#45Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Tomas Vondra (#44)
RE: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On Wednesday, June 14, 2023 5:05 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 6/14/23 05:15, Zhijie Hou (Fujitsu) wrote:

On Tuesday, June 13, 2023 12:19 PM Zhijie Hou (Fujitsu)

<houzj.fnst@fujitsu.com> wrote:

On Tuesday, June 13, 2023 12:04 PM Amit Kapila
<amit.kapila16@gmail.com>
wrote:

On Wed, Jun 7, 2023 at 6:02 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Well, I think the issue is pretty clear - we end up with an initial
snapshot that's in between the ASSIGNMENT and NEW_CID, and because
NEW_CID has both xact and subxact XID it fails because we add two
TXNs with the same LSN, not realizing one of them is subxact.

That's obviously wrong, although somewhat benign in production
because it only fails because of hitting an assert.

Doesn't this indicate that we can end up decoding a partial
transaction when we restore a snapshot? Won't that be a problem even
for

production?

Yes, I think it can cause the problem that only partial changes of a
transaction are streamed.
I tried to reproduce this and here are the steps. Note, to make sure
the test won't be affected by other running_xact WALs, I changed
LOG_SNAPSHOT_INTERVAL_MS to a bigger number.

session 1:
-----
create table test(a int);
SELECT 'init' FROM
pg_create_logical_replication_slot('isolation_slot_1',
'test_decoding');
-----

session 2:
-----
- Start a transaction
BEGIN;
INSERT INTO test VALUES(1);
-----

session 3:
-----
- Create another slot isolation_slot_2, it should choose a
restart_point which is
- after the changes that happened in session 2. Note, to let the
current slot
- restore another snapshot, we need to use gdb to block the current
backend at
- SnapBuildFindSnapshot(), the backend should have logged the
running_xacts WAL
- before reaching SnapBuildFindSnapshot.

SELECT 'init' FROM
pg_create_logical_replication_slot('isolation_slot_2',
'test_decoding');
-----

session 1:
-----
- Since there is a running_xacts which session 3 logged, the current
backend will
- serialize the snapshot when decoding the running_xacts WAL, and the
snapshot
- can be used by other slots (e.g. isolation_slot_2)

SELECT data FROM pg_logical_slot_get_changes('isolation_slot_1',
NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
-----

session 2:
-----
- Insert some different data and commit the transaction.

INSERT INTO test VALUES(2);
INSERT INTO test VALUES(3);
INSERT INTO test VALUES(4);
COMMIT
-----

session 3:
-----
- Release the process and try to stream the changes, since the
restart point is
- at the middle of the transaction, it will stream partial changes of
the
- transaction which was committed in session 2:

SELECT data FROM pg_logical_slot_get_changes('isolation_slot_2',
NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
-----

Results (partial streamed changes):
postgres=# SELECT data FROM
pg_logical_slot_get_changes('isolation_slot_2',
NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
data
-----------------------------------------
BEGIN
table public.test: INSERT: a[integer]:2 table public.test: INSERT:
a[integer]:3 table public.test: INSERT: a[integer]:4 COMMIT
(5 rows)

One idea to fix the partial change stream problem would be that we
record all the running transaction's xid when restoring the snapshot
in SnapBuildFindSnapshot(), and in the following decoding, we skip
decoding changes for the recorded transaction. Or we can do similar to
7f13ac8(serialize the information of running xacts if any)

We need to think about how to fix this in backbranches, and the idea with
serializing running transactions seems rather unbackpatchable (as it changes
on-disk state).

But one point I am not very sure is that we might retore snapshot in
SnapBuildSerializationPoint() as well where we don't have running
transactions information. Although SnapBuildSerializationPoint() is
invoked for XLOG_END_OF_RECOVERY and XLOG_CHECKPOINT_SHUTDOWN

records

which seems no running transaction will be there when logging. But I
am not 100% sure if the problem can happen in this case as well.

...

Also, can you try if we still stream the partial transaction with
create_logical_replication_slot building a full snapshot?

Yes, It can fix this problem because force create_logical_replication_slot
build a full snapshot can avoid restoring the snapshot. But I am not sure if
this is the best fix for this for the same reason(it's costly) mentioned by
Amit[1]/messages/by-id/CAA4eK1Kro0rej=ZXMhcdjs+aYsZvNywu3-cqdRUtyAp4zqpVWw@mail.gmail.com.

[1]: /messages/by-id/CAA4eK1Kro0rej=ZXMhcdjs+aYsZvNywu3-cqdRUtyAp4zqpVWw@mail.gmail.com

Best Regards,
Hou zj

#46Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Zhijie Hou (Fujitsu) (#45)
Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

On 6/14/23 15:39, Zhijie Hou (Fujitsu) wrote:

On Wednesday, June 14, 2023 5:05 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

...

Also, can you try if we still stream the partial transaction with
create_logical_replication_slot building a full snapshot?

Yes, It can fix this problem because force create_logical_replication_slot
build a full snapshot can avoid restoring the snapshot. But I am not sure if
this is the best fix for this for the same reason(it's costly) mentioned by
Amit[1].

Costly compared to the current behavior? Sure, but considering the
current behavior is incorrect/broken, that seems rather irrelevant.
Incorrect behavior can be infinitely faster.

I doubt it's significantly costlier than just setting the "full
snapshot" flag when building the initial snapshot - sure, it will take
more time than now, but that's kinda the whole point. It seems to me the
problem is exactly that it *doesn't* wait long enough.

I may be misunderstanding the solution you proposed, but this:

One idea to fix the partial change stream problem would be that we
record all the running transaction's xid when restoring the snapshot
in SnapBuildFindSnapshot(), and in the following decoding, we skip
decoding changes for the recorded transaction

sounds pretty close to what building a correct snapshot actually does.

But maybe I'm wrong - ultimately, the way to compare those approaches
seems to be to prototype this proposal, and do some tests.

There's also the question of back branches, and it seems way simpler to
just flip a flag and disable broken optimization than doing fairly
complex stuff to save it.

I'd also point out that (a) this only affects the initial snapshot, not
every time we start the decoding context, and (b) the slots created from
walsender already do that with (unless when copy_data=false).

So if needs_full_snapshot=true fixes the issue, I'd just do that as the
first step - in master and backpatches. And then if we want to salvage
the optimization, we can try fixing it in master.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company