TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)
Hi,
I've been running some valgrind tests on rpi4/aarch64, and I get a crash
in test_decoding ddl test in ~50% runs. I don't see the same failure
without valgrind or on 32-bit system (hundreds of runs, no crashes), so
I suspect this is a race condition, and with valgrind the timing changes
in a way to make it more likely.
The crash always happens in the "ddl" test. The backtrace always looks
like this:
(ExceptionalCondition+0x98)[0x8f6f7c]
(+0x57a7ec)[0x6827ec]
(+0x579edc)[0x681edc]
(ReorderBufferAddNewTupleCids+0x60)[0x686758]
(SnapBuildProcessNewCid+0x94)[0x68b920]
(heap2_decode+0x17c)[0x671584]
(LogicalDecodingProcessRecord+0xbc)[0x670cd0]
(+0x570f88)[0x678f88]
(pg_logical_slot_get_changes+0x1c)[0x6790fc]
(ExecMakeTableFunctionResult+0x29c)[0x4a92c0]
(+0x3be638)[0x4c6638]
(+0x3a2c14)[0x4aac14]
(ExecScan+0x8c)[0x4aaca8]
(+0x3bea14)[0x4c6a14]
(+0x39ea60)[0x4a6a60]
(+0x392378)[0x49a378]
(+0x39520c)[0x49d20c]
(standard_ExecutorRun+0x214)[0x49aad8]
(ExecutorRun+0x64)[0x49a8b8]
(+0x62f53c)[0x73753c]
(PortalRun+0x27c)[0x737198]
(+0x627e78)[0x72fe78]
(PostgresMain+0x9a0)[0x73512c]
(+0x547be8)[0x64fbe8]
(+0x547540)[0x64f540]
(+0x542d30)[0x64ad30]
(PostmasterMain+0x1460)[0x64a574]
(+0x418888)[0x520888]
I'm unable to get a better backtrace from the valgrind-produces core
usign gdb, for some reason.
However, I've modified AssertTXNLsnOrder() - which is where the assert
is checked - to also dump toplevel_by_lsn instead of just triggering the
assert, and the result is always like this:
WARNING: ==============================================
WARNING: txn xid 849 top 0 first 30264752 0/1CDCDB0 final 0 0/0
WARNING: txn xid 848 top 0 first 30264752 0/1CDCDB0 final 0 0/0
WARNING: ==============================================
The LSNs change a bit between the runs, but the failing transactions are
always 848 and 849. Also, both transactions have exactly the same info.
But the very first WAL record for 849 is
ASSIGNMENT xtop 848: subxacts: 849
so it's strange 849 is in the toplevel_by_lsn list at all, because it
clearly is a subxact of 848.
Furthermore, the WAL is almost exactly the same in both cases. Attached
are two dumps from a failed and successful run (only the part related to
these two xids is included). There are very few differences - there is a
PRUNE in the failed case, and a LOCK / RUNNING_XACTS moved a bit.
Any ideas?
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sun, Sep 4, 2022 at 4:34 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
I've been running some valgrind tests on rpi4/aarch64, and I get a crash
in test_decoding ddl test in ~50% runs. I don't see the same failure
without valgrind or on 32-bit system (hundreds of runs, no crashes), so
I suspect this is a race condition, and with valgrind the timing changes
in a way to make it more likely.The crash always happens in the "ddl" test. The backtrace always looks
like this:(ExceptionalCondition+0x98)[0x8f6f7c]
(+0x57a7ec)[0x6827ec]
(+0x579edc)[0x681edc]
(ReorderBufferAddNewTupleCids+0x60)[0x686758]
(SnapBuildProcessNewCid+0x94)[0x68b920]
(heap2_decode+0x17c)[0x671584]
(LogicalDecodingProcessRecord+0xbc)[0x670cd0]
(+0x570f88)[0x678f88]
(pg_logical_slot_get_changes+0x1c)[0x6790fc]
(ExecMakeTableFunctionResult+0x29c)[0x4a92c0]
(+0x3be638)[0x4c6638]
(+0x3a2c14)[0x4aac14]
(ExecScan+0x8c)[0x4aaca8]
(+0x3bea14)[0x4c6a14]
(+0x39ea60)[0x4a6a60]
(+0x392378)[0x49a378]
(+0x39520c)[0x49d20c]
(standard_ExecutorRun+0x214)[0x49aad8]
(ExecutorRun+0x64)[0x49a8b8]
(+0x62f53c)[0x73753c]
(PortalRun+0x27c)[0x737198]
(+0x627e78)[0x72fe78]
(PostgresMain+0x9a0)[0x73512c]
(+0x547be8)[0x64fbe8]
(+0x547540)[0x64f540]
(+0x542d30)[0x64ad30]
(PostmasterMain+0x1460)[0x64a574]
(+0x418888)[0x520888]I'm unable to get a better backtrace from the valgrind-produces core
usign gdb, for some reason.However, I've modified AssertTXNLsnOrder() - which is where the assert
is checked - to also dump toplevel_by_lsn instead of just triggering the
assert, and the result is always like this:WARNING: ==============================================
WARNING: txn xid 849 top 0 first 30264752 0/1CDCDB0 final 0 0/0
WARNING: txn xid 848 top 0 first 30264752 0/1CDCDB0 final 0 0/0
WARNING: ==============================================The LSNs change a bit between the runs, but the failing transactions are
always 848 and 849. Also, both transactions have exactly the same info.But the very first WAL record for 849 is
ASSIGNMENT xtop 848: subxacts: 849
so it's strange 849 is in the toplevel_by_lsn list at all, because it
clearly is a subxact of 848.
There is no guarantee that toplevel_by_lsn won't have subxact. As per
my understanding, the problem I reported in the email [1]/messages/by-id/CAA4eK1LK1nxOTL32OP=ejhPoBsUP4Bvwb3Ly=fethyJ-KbaXyw@mail.gmail.com is the same
and we have seen this in BF failures as well. I posted a way to
reproduce it in that email. It seems this is possible if the decoding
gets XLOG_HEAP2_NEW_CID as the first record (belonging to a
subtransaction) after XLOG_RUNNING_XACTS.
[1]: /messages/by-id/CAA4eK1LK1nxOTL32OP=ejhPoBsUP4Bvwb3Ly=fethyJ-KbaXyw@mail.gmail.com
--
With Regards,
Amit Kapila.
On 9/4/22 13:49, Amit Kapila wrote:
On Sun, Sep 4, 2022 at 4:34 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:I've been running some valgrind tests on rpi4/aarch64, and I get a crash
in test_decoding ddl test in ~50% runs. I don't see the same failure
without valgrind or on 32-bit system (hundreds of runs, no crashes), so
I suspect this is a race condition, and with valgrind the timing changes
in a way to make it more likely.The crash always happens in the "ddl" test. The backtrace always looks
like this:(ExceptionalCondition+0x98)[0x8f6f7c]
(+0x57a7ec)[0x6827ec]
(+0x579edc)[0x681edc]
(ReorderBufferAddNewTupleCids+0x60)[0x686758]
(SnapBuildProcessNewCid+0x94)[0x68b920]
(heap2_decode+0x17c)[0x671584]
(LogicalDecodingProcessRecord+0xbc)[0x670cd0]
(+0x570f88)[0x678f88]
(pg_logical_slot_get_changes+0x1c)[0x6790fc]
(ExecMakeTableFunctionResult+0x29c)[0x4a92c0]
(+0x3be638)[0x4c6638]
(+0x3a2c14)[0x4aac14]
(ExecScan+0x8c)[0x4aaca8]
(+0x3bea14)[0x4c6a14]
(+0x39ea60)[0x4a6a60]
(+0x392378)[0x49a378]
(+0x39520c)[0x49d20c]
(standard_ExecutorRun+0x214)[0x49aad8]
(ExecutorRun+0x64)[0x49a8b8]
(+0x62f53c)[0x73753c]
(PortalRun+0x27c)[0x737198]
(+0x627e78)[0x72fe78]
(PostgresMain+0x9a0)[0x73512c]
(+0x547be8)[0x64fbe8]
(+0x547540)[0x64f540]
(+0x542d30)[0x64ad30]
(PostmasterMain+0x1460)[0x64a574]
(+0x418888)[0x520888]I'm unable to get a better backtrace from the valgrind-produces core
usign gdb, for some reason.However, I've modified AssertTXNLsnOrder() - which is where the assert
is checked - to also dump toplevel_by_lsn instead of just triggering the
assert, and the result is always like this:WARNING: ==============================================
WARNING: txn xid 849 top 0 first 30264752 0/1CDCDB0 final 0 0/0
WARNING: txn xid 848 top 0 first 30264752 0/1CDCDB0 final 0 0/0
WARNING: ==============================================The LSNs change a bit between the runs, but the failing transactions are
always 848 and 849. Also, both transactions have exactly the same info.But the very first WAL record for 849 is
ASSIGNMENT xtop 848: subxacts: 849
so it's strange 849 is in the toplevel_by_lsn list at all, because it
clearly is a subxact of 848.There is no guarantee that toplevel_by_lsn won't have subxact.
I don't think that's quite true - toplevel_by_lsn should not contain any
*known* subxacts. Yes, we may initially add a subxact to the list, but
as soon as we get assignment record, it should be removed. See what
ReorderBufferAssignChild does.
And in this case the ASSIGNMENT is the first WAL record we get for 849
(in fact, isn't that guaranteed since 7259736a6e?), so we know from the
very beginning 849 is a subxact.
As per
my understanding, the problem I reported in the email [1] is the same
and we have seen this in BF failures as well. I posted a way to
reproduce it in that email. It seems this is possible if the decoding
gets XLOG_HEAP2_NEW_CID as the first record (belonging to a
subtransaction) after XLOG_RUNNING_XACTS.
Interesting. That's certainly true for WAL in the crashing case:
rmgr: Standby len (rec/tot): 58/ 58, tx: 0, lsn:
0/01CDCD70, prev 0/01CDCD10, desc: RUNNING_XACTS nextXid 850
latestCompletedXid 847 oldestRunningXid 848; 1 xacts: 848
rmgr: Heap2 len (rec/tot): 60/ 60, tx: 849, lsn:
0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel 1663/16384/1249; tid
58/38; cmin: 1, cmax: 14, combo: 6
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 9/4/22 14:24, Tomas Vondra wrote:
On 9/4/22 13:49, Amit Kapila wrote:
On Sun, Sep 4, 2022 at 4:34 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:I've been running some valgrind tests on rpi4/aarch64, and I get a crash
in test_decoding ddl test in ~50% runs. I don't see the same failure
without valgrind or on 32-bit system (hundreds of runs, no crashes), so
I suspect this is a race condition, and with valgrind the timing changes
in a way to make it more likely.The crash always happens in the "ddl" test. The backtrace always looks
like this:(ExceptionalCondition+0x98)[0x8f6f7c]
(+0x57a7ec)[0x6827ec]
(+0x579edc)[0x681edc]
(ReorderBufferAddNewTupleCids+0x60)[0x686758]
(SnapBuildProcessNewCid+0x94)[0x68b920]
(heap2_decode+0x17c)[0x671584]
(LogicalDecodingProcessRecord+0xbc)[0x670cd0]
(+0x570f88)[0x678f88]
(pg_logical_slot_get_changes+0x1c)[0x6790fc]
(ExecMakeTableFunctionResult+0x29c)[0x4a92c0]
(+0x3be638)[0x4c6638]
(+0x3a2c14)[0x4aac14]
(ExecScan+0x8c)[0x4aaca8]
(+0x3bea14)[0x4c6a14]
(+0x39ea60)[0x4a6a60]
(+0x392378)[0x49a378]
(+0x39520c)[0x49d20c]
(standard_ExecutorRun+0x214)[0x49aad8]
(ExecutorRun+0x64)[0x49a8b8]
(+0x62f53c)[0x73753c]
(PortalRun+0x27c)[0x737198]
(+0x627e78)[0x72fe78]
(PostgresMain+0x9a0)[0x73512c]
(+0x547be8)[0x64fbe8]
(+0x547540)[0x64f540]
(+0x542d30)[0x64ad30]
(PostmasterMain+0x1460)[0x64a574]
(+0x418888)[0x520888]I'm unable to get a better backtrace from the valgrind-produces core
usign gdb, for some reason.However, I've modified AssertTXNLsnOrder() - which is where the assert
is checked - to also dump toplevel_by_lsn instead of just triggering the
assert, and the result is always like this:WARNING: ==============================================
WARNING: txn xid 849 top 0 first 30264752 0/1CDCDB0 final 0 0/0
WARNING: txn xid 848 top 0 first 30264752 0/1CDCDB0 final 0 0/0
WARNING: ==============================================The LSNs change a bit between the runs, but the failing transactions are
always 848 and 849. Also, both transactions have exactly the same info.But the very first WAL record for 849 is
ASSIGNMENT xtop 848: subxacts: 849
so it's strange 849 is in the toplevel_by_lsn list at all, because it
clearly is a subxact of 848.There is no guarantee that toplevel_by_lsn won't have subxact.
I don't think that's quite true - toplevel_by_lsn should not contain any
*known* subxacts. Yes, we may initially add a subxact to the list, but
as soon as we get assignment record, it should be removed. See what
ReorderBufferAssignChild does.And in this case the ASSIGNMENT is the first WAL record we get for 849
(in fact, isn't that guaranteed since 7259736a6e?), so we know from the
very beginning 849 is a subxact.As per
my understanding, the problem I reported in the email [1] is the same
and we have seen this in BF failures as well. I posted a way to
reproduce it in that email. It seems this is possible if the decoding
gets XLOG_HEAP2_NEW_CID as the first record (belonging to a
subtransaction) after XLOG_RUNNING_XACTS.Interesting. That's certainly true for WAL in the crashing case:
rmgr: Standby len (rec/tot): 58/ 58, tx: 0, lsn:
0/01CDCD70, prev 0/01CDCD10, desc: RUNNING_XACTS nextXid 850
latestCompletedXid 847 oldestRunningXid 848; 1 xacts: 848
rmgr: Heap2 len (rec/tot): 60/ 60, tx: 849, lsn:
0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel 1663/16384/1249; tid
58/38; cmin: 1, cmax: 14, combo: 6
I investigated using the pgdata from the crashed run (can provide, if
you have rpi4 or some other aarch64 machine), and the reason is pretty
simple - the restart_lsn for the slot is 0/1CDCD70, which is looong
after the subxact assignment, so we add both xids as toplevel.
That seems broken - if we skip the assignment like this, doesn't that
break spill-to-disk and/or streaming? IIRC that's exactly why we had to
start logging assignments immediately with wal_level=logical.
Or maybe we're not dealing with the restart_lsn properly, and we should
have ignored those records. Both xacts started long before the restart
LSN, so we're not seeing the whole xact anyway.
However, when processing the NEW_CID record:
tx: 849, lsn: 0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel
1663/16384/1249; tid 58/38; cmin: 1, cmax: 14, combo: 6
we ultimately do this in SnapBuildProcessNewCid:
#1 0x0000005566cccdb4 in ReorderBufferAddNewTupleCids (rb=0x559dd64218,
xid=848, lsn=30264752, locator=..., tid=..., cmin=1, cmax=14,
combocid=6) at reorderbuffer.c:3218
#2 0x0000005566cd1f7c in SnapBuildProcessNewCid (builder=0x559dd6a248,
xid=849, lsn=30264752, xlrec=0x559dd6e1e0) at snapbuild.c:818
so in fact we *know* 849 is a subxact of 848, but we don't call
ReorderBufferAssignChild in this case. In fact we can't even do the
assignment easily in this case, because we create the subxact first, so
that the crash happens right when we attempt to create the toplevel one,
and we never even get a chance to do the assignment:
1) process the NEW_CID record, logged for 849 (subxact)
2) process CIDs in the WAL record, which has topleve_xid 848
So IMHO we need to figure out what to do for WAL records that create
both the toplevel and subxact - either we need to skip them, or rethink
how we create the ReorderBufferTXN structs.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 9/4/22 16:08, Tomas Vondra wrote:
...
so in fact we *know* 849 is a subxact of 848, but we don't call
ReorderBufferAssignChild in this case. In fact we can't even do the
assignment easily in this case, because we create the subxact first, so
that the crash happens right when we attempt to create the toplevel one,
and we never even get a chance to do the assignment:1) process the NEW_CID record, logged for 849 (subxact)
2) process CIDs in the WAL record, which has topleve_xid 848So IMHO we need to figure out what to do for WAL records that create
both the toplevel and subxact - either we need to skip them, or rethink
how we create the ReorderBufferTXN structs.
This fixes the crash for me, by adding a ReorderBufferAssignChild call
to SnapBuildProcessNewCid, and tweaking ReorderBufferAssignChild to
ensure we don't try to create the top xact before updating the subxact
and removing it from the toplevel_by_lsn list.
Essentially, what's happening is this:
1) We read the NEW_CID record, which is logged with XID 849, i.e. the
subxact. But we don't know it's a subxact, so we create it as a
top-level xact with the LSN.
2) We start processing contents of the NEW_CID, which however has info
that 849 is subxact of 848, calls ReorderBufferAddNewTupleCids which
promptly does ReorderBufferTXNByXid() with the top-level XID, which
creates it with the same LSN, and crashes because of the assert.
I'm not sure what's the right/proper way to fix this ...
The problem is ReorderBufferAssignChild was coded in a way that did not
expect the subxact to be created first (as a top-level xact). And
indeed, if I add Assert(false) to the (!new_sub) branch that converts
top-level xact to subxact, check-world still passes. So we never test
this case, but the NEW_CID breaks this assumption and creates them in
the opposite order (i.e. subxact first).
So the patch "fixes" this by
(a) tweaking ReorderBufferAssignChild to first remove the subxact from
the list of top-level transactions
(b) call ReorderBufferAssignChild when processing NEW_CID
However, I wonder whether we even have to process these records? If the
restart_lsn is half-way through the xact, so can we even decode it?
Maybe we can just skip all of this, somehow? We'd still need to remember
849 is a subxact of 848, at least, so that we know to skip it too.
Thread [1] suggested to relax the assert to allow the same LSN, provided
it's xact and it's subxact. That goes directly against the expectation
the toplevel_by_lsn list contains no known subxacts, and I don't think
we should be relaxing that. After all, just tweaking the LSN does not
really fix the issue, because not remembering it's xact+subxact is part
of the issue. In principle, I think the issue is exactly the opposite,
i.e. that we don't realize 849 is a subxact, and leave it in the list.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
crash-wip-fix.patchtext/x-patch; charset=UTF-8; name=crash-wip-fix.patchDownload+6-1
On Sun, Sep 4, 2022 at 7:38 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
On 9/4/22 14:24, Tomas Vondra wrote:
As per
my understanding, the problem I reported in the email [1] is the same
and we have seen this in BF failures as well. I posted a way to
reproduce it in that email. It seems this is possible if the decoding
gets XLOG_HEAP2_NEW_CID as the first record (belonging to a
subtransaction) after XLOG_RUNNING_XACTS.Interesting. That's certainly true for WAL in the crashing case:
rmgr: Standby len (rec/tot): 58/ 58, tx: 0, lsn:
0/01CDCD70, prev 0/01CDCD10, desc: RUNNING_XACTS nextXid 850
latestCompletedXid 847 oldestRunningXid 848; 1 xacts: 848
rmgr: Heap2 len (rec/tot): 60/ 60, tx: 849, lsn:
0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel 1663/16384/1249; tid
58/38; cmin: 1, cmax: 14, combo: 6I investigated using the pgdata from the crashed run (can provide, if
you have rpi4 or some other aarch64 machine), and the reason is pretty
simple - the restart_lsn for the slot is 0/1CDCD70, which is looong
after the subxact assignment, so we add both xids as toplevel.That seems broken - if we skip the assignment like this, doesn't that
break spill-to-disk and/or streaming? IIRC that's exactly why we had to
start logging assignments immediately with wal_level=logical.
We had started logging assignments immediately in commit 0bead9af48
for streaming transactions in PG14. This issue exists prior to that. I
have tried and reproduced it in PG13 but I think it will be there even
before that. So, I am not sure if the spilling behavior is broken due
to this. I think if we don't get assignment recording before
processing changes during decoding commit then we could miss sending
the changes which won't be the case here. Do you see any other
problem?
Or maybe we're not dealing with the restart_lsn properly, and we should
have ignored those records. Both xacts started long before the restart
LSN, so we're not seeing the whole xact anyway.
Right, but is that problematic? The restart LSN will be used as a
point to start reading the WAL and that helps in building a consistent
snapshot. However, for decoding to send the commit, we use
start_decoding_at point which will ensure that we send complete
transactions.
However, when processing the NEW_CID record:
tx: 849, lsn: 0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel
1663/16384/1249; tid 58/38; cmin: 1, cmax: 14, combo: 6we ultimately do this in SnapBuildProcessNewCid:
#1 0x0000005566cccdb4 in ReorderBufferAddNewTupleCids (rb=0x559dd64218,
xid=848, lsn=30264752, locator=..., tid=..., cmin=1, cmax=14,
combocid=6) at reorderbuffer.c:3218
#2 0x0000005566cd1f7c in SnapBuildProcessNewCid (builder=0x559dd6a248,
xid=849, lsn=30264752, xlrec=0x559dd6e1e0) at snapbuild.c:818so in fact we *know* 849 is a subxact of 848, but we don't call
ReorderBufferAssignChild in this case. In fact we can't even do the
assignment easily in this case, because we create the subxact first, so
that the crash happens right when we attempt to create the toplevel one,
and we never even get a chance to do the assignment:1) process the NEW_CID record, logged for 849 (subxact)
2) process CIDs in the WAL record, which has topleve_xid 848So IMHO we need to figure out what to do for WAL records that create
both the toplevel and subxact - either we need to skip them, or rethink
how we create the ReorderBufferTXN structs.
As per my understanding, we can't skip them as they are used to build
the snapshot.
--
With Regards,
Amit Kapila.
On Sun, Sep 4, 2022 at 11:10 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
On 9/4/22 16:08, Tomas Vondra wrote:
...
so in fact we *know* 849 is a subxact of 848, but we don't call
ReorderBufferAssignChild in this case. In fact we can't even do the
assignment easily in this case, because we create the subxact first, so
that the crash happens right when we attempt to create the toplevel one,
and we never even get a chance to do the assignment:1) process the NEW_CID record, logged for 849 (subxact)
2) process CIDs in the WAL record, which has topleve_xid 848So IMHO we need to figure out what to do for WAL records that create
both the toplevel and subxact - either we need to skip them, or rethink
how we create the ReorderBufferTXN structs.This fixes the crash for me, by adding a ReorderBufferAssignChild call
to SnapBuildProcessNewCid, and tweaking ReorderBufferAssignChild to
ensure we don't try to create the top xact before updating the subxact
and removing it from the toplevel_by_lsn list.Essentially, what's happening is this:
1) We read the NEW_CID record, which is logged with XID 849, i.e. the
subxact. But we don't know it's a subxact, so we create it as a
top-level xact with the LSN.2) We start processing contents of the NEW_CID, which however has info
that 849 is subxact of 848, calls ReorderBufferAddNewTupleCids which
promptly does ReorderBufferTXNByXid() with the top-level XID, which
creates it with the same LSN, and crashes because of the assert.I'm not sure what's the right/proper way to fix this ...
The problem is ReorderBufferAssignChild was coded in a way that did not
expect the subxact to be created first (as a top-level xact).
I think there was a previously hard-coded way to detect that and we
have removed it in commit
(https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e3ff789acfb2754cd7b5e87f6f4463fd08e35996).
I think it is possible that subtransaction gets logged without
previous top-level txn record as shown in the commit shared.
--
With Regards,
Amit Kapila.
On 9/5/22 06:32, Amit Kapila wrote:
On Sun, Sep 4, 2022 at 7:38 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:On 9/4/22 14:24, Tomas Vondra wrote:
As per
my understanding, the problem I reported in the email [1] is the same
and we have seen this in BF failures as well. I posted a way to
reproduce it in that email. It seems this is possible if the decoding
gets XLOG_HEAP2_NEW_CID as the first record (belonging to a
subtransaction) after XLOG_RUNNING_XACTS.Interesting. That's certainly true for WAL in the crashing case:
rmgr: Standby len (rec/tot): 58/ 58, tx: 0, lsn:
0/01CDCD70, prev 0/01CDCD10, desc: RUNNING_XACTS nextXid 850
latestCompletedXid 847 oldestRunningXid 848; 1 xacts: 848
rmgr: Heap2 len (rec/tot): 60/ 60, tx: 849, lsn:
0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel 1663/16384/1249; tid
58/38; cmin: 1, cmax: 14, combo: 6I investigated using the pgdata from the crashed run (can provide, if
you have rpi4 or some other aarch64 machine), and the reason is pretty
simple - the restart_lsn for the slot is 0/1CDCD70, which is looong
after the subxact assignment, so we add both xids as toplevel.That seems broken - if we skip the assignment like this, doesn't that
break spill-to-disk and/or streaming? IIRC that's exactly why we had to
start logging assignments immediately with wal_level=logical.We had started logging assignments immediately in commit 0bead9af48
for streaming transactions in PG14. This issue exists prior to that. I
have tried and reproduced it in PG13 but I think it will be there even
before that. So, I am not sure if the spilling behavior is broken due
to this. I think if we don't get assignment recording before
processing changes during decoding commit then we could miss sending
the changes which won't be the case here. Do you see any other
problem?
I can't, but that's hardly a proof of anything. You're right spilling to
disk may not be broken by this, though. I forgot it precedes assignments
being logged immediately, so it does not rely on that.
Or maybe we're not dealing with the restart_lsn properly, and we should
have ignored those records. Both xacts started long before the restart
LSN, so we're not seeing the whole xact anyway.Right, but is that problematic? The restart LSN will be used as a
point to start reading the WAL and that helps in building a consistent
snapshot. However, for decoding to send the commit, we use
start_decoding_at point which will ensure that we send complete
transactions.
Which part would not be problematic? There's some sort of a bug, that's
for sure.
I think it's mostly clear we won't output this transaction, because the
restart LSN is half-way through. We can either ignore it at commit time,
and then we have to make everything work in case we miss assignments (or
any other part of the transaction).
Or we can ignore stuff early, and not even process some of the changes.
For example in this case do we need to process the NEW_CID contents for
transaction 848? If we can skip that bit, the problem will disappear.
But maybe this is futile and there are other similar issues ...
However, when processing the NEW_CID record:
tx: 849, lsn: 0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel
1663/16384/1249; tid 58/38; cmin: 1, cmax: 14, combo: 6we ultimately do this in SnapBuildProcessNewCid:
#1 0x0000005566cccdb4 in ReorderBufferAddNewTupleCids (rb=0x559dd64218,
xid=848, lsn=30264752, locator=..., tid=..., cmin=1, cmax=14,
combocid=6) at reorderbuffer.c:3218
#2 0x0000005566cd1f7c in SnapBuildProcessNewCid (builder=0x559dd6a248,
xid=849, lsn=30264752, xlrec=0x559dd6e1e0) at snapbuild.c:818so in fact we *know* 849 is a subxact of 848, but we don't call
ReorderBufferAssignChild in this case. In fact we can't even do the
assignment easily in this case, because we create the subxact first, so
that the crash happens right when we attempt to create the toplevel one,
and we never even get a chance to do the assignment:1) process the NEW_CID record, logged for 849 (subxact)
2) process CIDs in the WAL record, which has topleve_xid 848So IMHO we need to figure out what to do for WAL records that create
both the toplevel and subxact - either we need to skip them, or rethink
how we create the ReorderBufferTXN structs.As per my understanding, we can't skip them as they are used to build
the snapshot.
Don't we know 848 (the top-level xact) won't be decoded? In that case we
won't need the snapshot, so why build it? Of course, the NEW_CID is
logged with xid 849 and we don't know it's subxact of 848, but when
processing the NEW_CID content we can realize that (xl_heap_new_cid does
include top_xid).
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Sep 5, 2022 at 12:14 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
On 9/5/22 06:32, Amit Kapila wrote:
On Sun, Sep 4, 2022 at 7:38 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:On 9/4/22 14:24, Tomas Vondra wrote:
As per
my understanding, the problem I reported in the email [1] is the same
and we have seen this in BF failures as well. I posted a way to
reproduce it in that email. It seems this is possible if the decoding
gets XLOG_HEAP2_NEW_CID as the first record (belonging to a
subtransaction) after XLOG_RUNNING_XACTS.Interesting. That's certainly true for WAL in the crashing case:
rmgr: Standby len (rec/tot): 58/ 58, tx: 0, lsn:
0/01CDCD70, prev 0/01CDCD10, desc: RUNNING_XACTS nextXid 850
latestCompletedXid 847 oldestRunningXid 848; 1 xacts: 848
rmgr: Heap2 len (rec/tot): 60/ 60, tx: 849, lsn:
0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel 1663/16384/1249; tid
58/38; cmin: 1, cmax: 14, combo: 6I investigated using the pgdata from the crashed run (can provide, if
you have rpi4 or some other aarch64 machine), and the reason is pretty
simple - the restart_lsn for the slot is 0/1CDCD70, which is looong
after the subxact assignment, so we add both xids as toplevel.That seems broken - if we skip the assignment like this, doesn't that
break spill-to-disk and/or streaming? IIRC that's exactly why we had to
start logging assignments immediately with wal_level=logical.We had started logging assignments immediately in commit 0bead9af48
for streaming transactions in PG14. This issue exists prior to that. I
have tried and reproduced it in PG13 but I think it will be there even
before that. So, I am not sure if the spilling behavior is broken due
to this. I think if we don't get assignment recording before
processing changes during decoding commit then we could miss sending
the changes which won't be the case here. Do you see any other
problem?I can't, but that's hardly a proof of anything. You're right spilling to
disk may not be broken by this, though. I forgot it precedes assignments
being logged immediately, so it does not rely on that.Or maybe we're not dealing with the restart_lsn properly, and we should
have ignored those records. Both xacts started long before the restart
LSN, so we're not seeing the whole xact anyway.Right, but is that problematic? The restart LSN will be used as a
point to start reading the WAL and that helps in building a consistent
snapshot. However, for decoding to send the commit, we use
start_decoding_at point which will ensure that we send complete
transactions.Which part would not be problematic? There's some sort of a bug, that's
for sure.
It is possible that there is some other problem here that I am
missing. But at this stage, I don't see anything wrong other than the
assertion you have reported.
I think it's mostly clear we won't output this transaction, because the
restart LSN is half-way through. We can either ignore it at commit time,
and then we have to make everything work in case we miss assignments (or
any other part of the transaction).
Note, traditionally, we only form these assignments at commit time
after deciding whether to skip such commits. So, ideally, there
shouldn't be any fundamental problem with not making these
associations before deciding whether we need to replay (send
downstream) any particular transaction.
Or we can ignore stuff early, and not even process some of the changes.
For example in this case do we need to process the NEW_CID contents for
transaction 848? If we can skip that bit, the problem will disappear.But maybe this is futile and there are other similar issues ...
However, when processing the NEW_CID record:
tx: 849, lsn: 0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel
1663/16384/1249; tid 58/38; cmin: 1, cmax: 14, combo: 6we ultimately do this in SnapBuildProcessNewCid:
#1 0x0000005566cccdb4 in ReorderBufferAddNewTupleCids (rb=0x559dd64218,
xid=848, lsn=30264752, locator=..., tid=..., cmin=1, cmax=14,
combocid=6) at reorderbuffer.c:3218
#2 0x0000005566cd1f7c in SnapBuildProcessNewCid (builder=0x559dd6a248,
xid=849, lsn=30264752, xlrec=0x559dd6e1e0) at snapbuild.c:818so in fact we *know* 849 is a subxact of 848, but we don't call
ReorderBufferAssignChild in this case. In fact we can't even do the
assignment easily in this case, because we create the subxact first, so
that the crash happens right when we attempt to create the toplevel one,
and we never even get a chance to do the assignment:1) process the NEW_CID record, logged for 849 (subxact)
2) process CIDs in the WAL record, which has topleve_xid 848So IMHO we need to figure out what to do for WAL records that create
both the toplevel and subxact - either we need to skip them, or rethink
how we create the ReorderBufferTXN structs.As per my understanding, we can't skip them as they are used to build
the snapshot.Don't we know 848 (the top-level xact) won't be decoded? In that case we
won't need the snapshot, so why build it?
But this transaction id can be part of committed.xip array if it has
made any catalog changes. We add the transaction/subtransaction to
this array before deciding whether to skip decoding/replay of its
commit.
--
With Regards,
Amit Kapila.
On 9/5/22 08:35, Amit Kapila wrote:
On Sun, Sep 4, 2022 at 11:10 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:On 9/4/22 16:08, Tomas Vondra wrote:
...
so in fact we *know* 849 is a subxact of 848, but we don't call
ReorderBufferAssignChild in this case. In fact we can't even do the
assignment easily in this case, because we create the subxact first, so
that the crash happens right when we attempt to create the toplevel one,
and we never even get a chance to do the assignment:1) process the NEW_CID record, logged for 849 (subxact)
2) process CIDs in the WAL record, which has topleve_xid 848So IMHO we need to figure out what to do for WAL records that create
both the toplevel and subxact - either we need to skip them, or rethink
how we create the ReorderBufferTXN structs.This fixes the crash for me, by adding a ReorderBufferAssignChild call
to SnapBuildProcessNewCid, and tweaking ReorderBufferAssignChild to
ensure we don't try to create the top xact before updating the subxact
and removing it from the toplevel_by_lsn list.Essentially, what's happening is this:
1) We read the NEW_CID record, which is logged with XID 849, i.e. the
subxact. But we don't know it's a subxact, so we create it as a
top-level xact with the LSN.2) We start processing contents of the NEW_CID, which however has info
that 849 is subxact of 848, calls ReorderBufferAddNewTupleCids which
promptly does ReorderBufferTXNByXid() with the top-level XID, which
creates it with the same LSN, and crashes because of the assert.I'm not sure what's the right/proper way to fix this ...
The problem is ReorderBufferAssignChild was coded in a way that did not
expect the subxact to be created first (as a top-level xact).I think there was a previously hard-coded way to detect that and we
have removed it in commit
(https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e3ff789acfb2754cd7b5e87f6f4463fd08e35996).
I think it is possible that subtransaction gets logged without
previous top-level txn record as shown in the commit shared.
Well, yes and no.
This wouldn't detect the issue, because the assert happens in the first
ReorderBufferTXNByXid(), so it's still crash (in assert-enabled build,
at least).
Maybe removing the assumption was the wrong thing, and we should have
changed the code so that we don't violate it? That's kinda what my "fix"
does, in a way.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 9/5/22 12:12, Amit Kapila wrote:
On Mon, Sep 5, 2022 at 12:14 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:On 9/5/22 06:32, Amit Kapila wrote:
On Sun, Sep 4, 2022 at 7:38 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:On 9/4/22 14:24, Tomas Vondra wrote:
As per
my understanding, the problem I reported in the email [1] is the same
and we have seen this in BF failures as well. I posted a way to
reproduce it in that email. It seems this is possible if the decoding
gets XLOG_HEAP2_NEW_CID as the first record (belonging to a
subtransaction) after XLOG_RUNNING_XACTS.Interesting. That's certainly true for WAL in the crashing case:
rmgr: Standby len (rec/tot): 58/ 58, tx: 0, lsn:
0/01CDCD70, prev 0/01CDCD10, desc: RUNNING_XACTS nextXid 850
latestCompletedXid 847 oldestRunningXid 848; 1 xacts: 848
rmgr: Heap2 len (rec/tot): 60/ 60, tx: 849, lsn:
0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel 1663/16384/1249; tid
58/38; cmin: 1, cmax: 14, combo: 6I investigated using the pgdata from the crashed run (can provide, if
you have rpi4 or some other aarch64 machine), and the reason is pretty
simple - the restart_lsn for the slot is 0/1CDCD70, which is looong
after the subxact assignment, so we add both xids as toplevel.That seems broken - if we skip the assignment like this, doesn't that
break spill-to-disk and/or streaming? IIRC that's exactly why we had to
start logging assignments immediately with wal_level=logical.We had started logging assignments immediately in commit 0bead9af48
for streaming transactions in PG14. This issue exists prior to that. I
have tried and reproduced it in PG13 but I think it will be there even
before that. So, I am not sure if the spilling behavior is broken due
to this. I think if we don't get assignment recording before
processing changes during decoding commit then we could miss sending
the changes which won't be the case here. Do you see any other
problem?I can't, but that's hardly a proof of anything. You're right spilling to
disk may not be broken by this, though. I forgot it precedes assignments
being logged immediately, so it does not rely on that.Or maybe we're not dealing with the restart_lsn properly, and we should
have ignored those records. Both xacts started long before the restart
LSN, so we're not seeing the whole xact anyway.Right, but is that problematic? The restart LSN will be used as a
point to start reading the WAL and that helps in building a consistent
snapshot. However, for decoding to send the commit, we use
start_decoding_at point which will ensure that we send complete
transactions.Which part would not be problematic? There's some sort of a bug, that's
for sure.It is possible that there is some other problem here that I am
missing. But at this stage, I don't see anything wrong other than the
assertion you have reported.
I'm not sure I agree with that. I'm not convinced the assert is at
fault, it might easily be that it hints there's a logic bug somewhere.
I think it's mostly clear we won't output this transaction, because the
restart LSN is half-way through. We can either ignore it at commit time,
and then we have to make everything work in case we miss assignments (or
any other part of the transaction).Note, traditionally, we only form these assignments at commit time
after deciding whether to skip such commits. So, ideally, there
shouldn't be any fundamental problem with not making these
associations before deciding whether we need to replay (send
downstream) any particular transaction.
Isn't that self-contradictory? Either we form these assignments at
commit time, or we support streaming (in which case it clearly can't
happen at commit time). AFAICS that's exactly why we started logging
(and processing) assignments immediately, no?
Or we can ignore stuff early, and not even process some of the changes.
For example in this case do we need to process the NEW_CID contents for
transaction 848? If we can skip that bit, the problem will disappear.But maybe this is futile and there are other similar issues ...
However, when processing the NEW_CID record:
tx: 849, lsn: 0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel
1663/16384/1249; tid 58/38; cmin: 1, cmax: 14, combo: 6we ultimately do this in SnapBuildProcessNewCid:
#1 0x0000005566cccdb4 in ReorderBufferAddNewTupleCids (rb=0x559dd64218,
xid=848, lsn=30264752, locator=..., tid=..., cmin=1, cmax=14,
combocid=6) at reorderbuffer.c:3218
#2 0x0000005566cd1f7c in SnapBuildProcessNewCid (builder=0x559dd6a248,
xid=849, lsn=30264752, xlrec=0x559dd6e1e0) at snapbuild.c:818so in fact we *know* 849 is a subxact of 848, but we don't call
ReorderBufferAssignChild in this case. In fact we can't even do the
assignment easily in this case, because we create the subxact first, so
that the crash happens right when we attempt to create the toplevel one,
and we never even get a chance to do the assignment:1) process the NEW_CID record, logged for 849 (subxact)
2) process CIDs in the WAL record, which has topleve_xid 848So IMHO we need to figure out what to do for WAL records that create
both the toplevel and subxact - either we need to skip them, or rethink
how we create the ReorderBufferTXN structs.As per my understanding, we can't skip them as they are used to build
the snapshot.Don't we know 848 (the top-level xact) won't be decoded? In that case we
won't need the snapshot, so why build it?But this transaction id can be part of committed.xip array if it has
made any catalog changes. We add the transaction/subtransaction to
this array before deciding whether to skip decoding/replay of its
commit.
Hmm, yeah. It's been a while since I last looked into how we build
snapshots and how we share them between the transactions :-( If we share
the snapshots between transactions, you're probably right we can't just
skip these changes.
However, doesn't that pretty much mean we *have* to do something about
the assignment? I mean, suppose we miss the assignment (like now), so
that we end up with two TXNs that we think are top-level. And then we
get the commit for the actual top-level transaction. AFAICS that won't
clean-up the subxact, and we end up with a lingering TXN.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Sep 5, 2022 at 5:24 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
On 9/5/22 12:12, Amit Kapila wrote:
On Mon, Sep 5, 2022 at 12:14 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:It is possible that there is some other problem here that I am
missing. But at this stage, I don't see anything wrong other than the
assertion you have reported.I'm not sure I agree with that. I'm not convinced the assert is at
fault, it might easily be that it hints there's a logic bug somewhere.
It is possible but let's try to prove it. I am also keen to know if
this hints at a logic bug somewhere.
I think it's mostly clear we won't output this transaction, because the
restart LSN is half-way through. We can either ignore it at commit time,
and then we have to make everything work in case we miss assignments (or
any other part of the transaction).Note, traditionally, we only form these assignments at commit time
after deciding whether to skip such commits. So, ideally, there
shouldn't be any fundamental problem with not making these
associations before deciding whether we need to replay (send
downstream) any particular transaction.Isn't that self-contradictory? Either we form these assignments at
commit time, or we support streaming (in which case it clearly can't
happen at commit time).
I was talking about non-streaming cases which also have this assert
problem as seen in this thread. I am intentionally keeping streaming
cases out of this discussion as it happens without those and by
including streaming in the discussion, we will add another angle to
this problem which may not be required.
AFAICS that's exactly why we started logging
(and processing) assignments immediately, no?Or we can ignore stuff early, and not even process some of the changes.
For example in this case do we need to process the NEW_CID contents for
transaction 848? If we can skip that bit, the problem will disappear.But maybe this is futile and there are other similar issues ...
However, when processing the NEW_CID record:
tx: 849, lsn: 0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel
1663/16384/1249; tid 58/38; cmin: 1, cmax: 14, combo: 6we ultimately do this in SnapBuildProcessNewCid:
#1 0x0000005566cccdb4 in ReorderBufferAddNewTupleCids (rb=0x559dd64218,
xid=848, lsn=30264752, locator=..., tid=..., cmin=1, cmax=14,
combocid=6) at reorderbuffer.c:3218
#2 0x0000005566cd1f7c in SnapBuildProcessNewCid (builder=0x559dd6a248,
xid=849, lsn=30264752, xlrec=0x559dd6e1e0) at snapbuild.c:818so in fact we *know* 849 is a subxact of 848, but we don't call
ReorderBufferAssignChild in this case. In fact we can't even do the
assignment easily in this case, because we create the subxact first, so
that the crash happens right when we attempt to create the toplevel one,
and we never even get a chance to do the assignment:1) process the NEW_CID record, logged for 849 (subxact)
2) process CIDs in the WAL record, which has topleve_xid 848So IMHO we need to figure out what to do for WAL records that create
both the toplevel and subxact - either we need to skip them, or rethink
how we create the ReorderBufferTXN structs.As per my understanding, we can't skip them as they are used to build
the snapshot.Don't we know 848 (the top-level xact) won't be decoded? In that case we
won't need the snapshot, so why build it?But this transaction id can be part of committed.xip array if it has
made any catalog changes. We add the transaction/subtransaction to
this array before deciding whether to skip decoding/replay of its
commit.Hmm, yeah. It's been a while since I last looked into how we build
snapshots and how we share them between the transactions :-( If we share
the snapshots between transactions, you're probably right we can't just
skip these changes.However, doesn't that pretty much mean we *have* to do something about
the assignment? I mean, suppose we miss the assignment (like now), so
that we end up with two TXNs that we think are top-level. And then we
get the commit for the actual top-level transaction. AFAICS that won't
clean-up the subxact, and we end up with a lingering TXN.
I think we will clean up such a subxact. Such a xact should be skipped
via DecodeTXNNeedSkip() and then it will call ReorderBufferForget()
for each of the subxacts and that will make sure that we clean up each
of subtxn's.
--
With Regards,
Amit Kapila.
Hi,
On Tue, Sep 6, 2022 at 3:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Sep 5, 2022 at 5:24 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:On 9/5/22 12:12, Amit Kapila wrote:
On Mon, Sep 5, 2022 at 12:14 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:It is possible that there is some other problem here that I am
missing. But at this stage, I don't see anything wrong other than the
assertion you have reported.I'm not sure I agree with that. I'm not convinced the assert is at
fault, it might easily be that it hints there's a logic bug somewhere.It is possible but let's try to prove it. I am also keen to know if
this hints at a logic bug somewhere.I think it's mostly clear we won't output this transaction, because the
restart LSN is half-way through. We can either ignore it at commit time,
and then we have to make everything work in case we miss assignments (or
any other part of the transaction).Note, traditionally, we only form these assignments at commit time
after deciding whether to skip such commits. So, ideally, there
shouldn't be any fundamental problem with not making these
associations before deciding whether we need to replay (send
downstream) any particular transaction.
Agreed.
Summarizing this issue, the assertion check in AssertTXNLsnOrder()
fails as reported because the current logical decoding cannot properly
handle the case where the decoding restarts from NEW_CID. Since we
don't make the association between top-level transaction and its
subtransaction while decoding NEW_CID (ie, in
SnapBuildProcessNewCid()), two transactions are created in
ReorderBuffer as top-txn and have the same LSN. This failure happens
on all supported versions.
To fix the problem, one idea is that we make the association between
top-txn and sub-txn during that by calling ReorderBufferAssignChild(),
as Tomas proposed. On the other hand, since we don't guarantee to make
the association between the top-level transaction and its
sub-transactions until we try to decode the actual contents of the
transaction, it makes sense to me that instead of trying to solve by
making association, we need to change the code which are assuming that
it is associated.
I've attached the patch for this idea. With the patch, we skip the
assertion checks in AssertTXNLsnOrder() until we reach the LSN at
which we start decoding the contents of transaction, ie.
start_decoding_at in SnapBuild. The minor concern is other way that
the assertion check could miss some faulty cases where two unrelated
top-transactions could have same LSN. With this patch, it will pass
for such a case. Therefore, for transactions that we skipped checking,
we do the check when we reach the LSN.
Please note that to pass the new regression tests, the fix proposed in
a related thread[1]/messages/by-id/TYAPR01MB58666BD6BE24853269624282F5419@TYAPR01MB5866.jpnprd01.prod.outlook.com is required. Particularly, we need:
@@ -1099,6 +1099,9 @@ SnapBuildCommitTxn(SnapBuild *builder,
XLogRecPtr lsn, TransactionId xid,
else if (sub_needs_timetravel)
{
/* track toplevel txn as well, subxact alone isn't meaningful */
+ elog(DEBUG2, "forced transaction %u to do timetravel
due to one of its subtransaction",
+ xid);
+ needs_timetravel = true;
SnapBuildAddCommittedTxn(builder, xid);
}
else if (needs_timetravel)
A side benefit of this approach is that we can fix another assertion
failure too that happens on REL14 and REL15 and reported here[2]/messages/by-id/TYAPR01MB58660803BCAA7849C8584AA4F57E9@TYAPR01MB5866.jpnprd01.prod.outlook.com. In
the commits 68dcce247f1a(REL14) and 272248a0c1(REL15), the reason why
we make the association between sub-txns to top-txn in
SnapBuildXidSetCatalogChanges() is just to avoid the assertion failure
in AssertTXNLsnOrder(). However, since the invalidation messages are
not transported from sub-txn to top-txn during the assignment, another
assertion check in ReorderBufferForget() fails when forgetting the
subtransaction. If we apply this idea of skipping the assertion
checks, we no longer need to make the such association in
SnapBuildXidSetCatalogChanges() and resolve this issue as well.
Don't we know 848 (the top-level xact) won't be decoded? In that case we
won't need the snapshot, so why build it?But this transaction id can be part of committed.xip array if it has
made any catalog changes. We add the transaction/subtransaction to
this array before deciding whether to skip decoding/replay of its
commit.Hmm, yeah. It's been a while since I last looked into how we build
snapshots and how we share them between the transactions :-( If we share
the snapshots between transactions, you're probably right we can't just
skip these changes.However, doesn't that pretty much mean we *have* to do something about
the assignment? I mean, suppose we miss the assignment (like now), so
that we end up with two TXNs that we think are top-level. And then we
get the commit for the actual top-level transaction. AFAICS that won't
clean-up the subxact, and we end up with a lingering TXN.I think we will clean up such a subxact. Such a xact should be skipped
via DecodeTXNNeedSkip() and then it will call ReorderBufferForget()
for each of the subxacts and that will make sure that we clean up each
of subtxn's.
Right.
Regards,
[1]: /messages/by-id/TYAPR01MB58666BD6BE24853269624282F5419@TYAPR01MB5866.jpnprd01.prod.outlook.com
[2]: /messages/by-id/TYAPR01MB58660803BCAA7849C8584AA4F57E9@TYAPR01MB5866.jpnprd01.prod.outlook.com
--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v1-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patchapplication/octet-stream; name=v1-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patchDownload+92-4
On Wed, Oct 12, 2022 at 11:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Summarizing this issue, the assertion check in AssertTXNLsnOrder()
fails as reported because the current logical decoding cannot properly
handle the case where the decoding restarts from NEW_CID. Since we
don't make the association between top-level transaction and its
subtransaction while decoding NEW_CID (ie, in
SnapBuildProcessNewCid()), two transactions are created in
ReorderBuffer as top-txn and have the same LSN. This failure happens
on all supported versions.To fix the problem, one idea is that we make the association between
top-txn and sub-txn during that by calling ReorderBufferAssignChild(),
as Tomas proposed. On the other hand, since we don't guarantee to make
the association between the top-level transaction and its
sub-transactions until we try to decode the actual contents of the
transaction, it makes sense to me that instead of trying to solve by
making association, we need to change the code which are assuming that
it is associated.I've attached the patch for this idea. With the patch, we skip the
assertion checks in AssertTXNLsnOrder() until we reach the LSN at
which we start decoding the contents of transaction, ie.
start_decoding_at in SnapBuild. The minor concern is other way that
the assertion check could miss some faulty cases where two unrelated
top-transactions could have same LSN. With this patch, it will pass
for such a case. Therefore, for transactions that we skipped checking,
we do the check when we reach the LSN.
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -113,6 +113,15 @@
LogicalDecodingProcessRecord(LogicalDecodingContext *ctx,
XLogReaderState *recor
buf.origptr);
}
+#ifdef USE_ASSERT_CHECKING
+ /*
+ * Check the order of transaction LSNs when we reached the start decoding
+ * LSN. See the comments in AssertTXNLsnOrder() for details.
+ */
+ if (SnapBuildGetStartDecodingAt(ctx->snapshot_builder) == buf.origptr)
+ AssertTXNLsnOrder(ctx->reorder);
+#endif
+
rmgr = GetRmgr(XLogRecGetRmid(record));
I am not able to think how/when this check will be useful. Because we
skipped assert checking only for records that are prior to
start_decoding_at point, I think for those records ordering should
have been checked before the restart. start_decoding_at point will be
either (a) confirmed_flush location, or (b) lsn sent by client, and
any record prior to that must have been processed before restart.
Now, say we have commit records for multiple transactions which are
after start_decoding_at but all their changes are before
start_decoding_at, then we won't check their ordering at commit time
but OTOH, we would have checked their ordering before restart. Isn't
that sufficient?
--
With Regards,
Amit Kapila.
On Thu, Oct 13, 2022 at 4:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Oct 12, 2022 at 11:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Summarizing this issue, the assertion check in AssertTXNLsnOrder()
fails as reported because the current logical decoding cannot properly
handle the case where the decoding restarts from NEW_CID. Since we
don't make the association between top-level transaction and its
subtransaction while decoding NEW_CID (ie, in
SnapBuildProcessNewCid()), two transactions are created in
ReorderBuffer as top-txn and have the same LSN. This failure happens
on all supported versions.To fix the problem, one idea is that we make the association between
top-txn and sub-txn during that by calling ReorderBufferAssignChild(),
as Tomas proposed. On the other hand, since we don't guarantee to make
the association between the top-level transaction and its
sub-transactions until we try to decode the actual contents of the
transaction, it makes sense to me that instead of trying to solve by
making association, we need to change the code which are assuming that
it is associated.I've attached the patch for this idea. With the patch, we skip the
assertion checks in AssertTXNLsnOrder() until we reach the LSN at
which we start decoding the contents of transaction, ie.
start_decoding_at in SnapBuild. The minor concern is other way that
the assertion check could miss some faulty cases where two unrelated
top-transactions could have same LSN. With this patch, it will pass
for such a case. Therefore, for transactions that we skipped checking,
we do the check when we reach the LSN.--- a/src/backend/replication/logical/decode.c +++ b/src/backend/replication/logical/decode.c @@ -113,6 +113,15 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor buf.origptr); }+#ifdef USE_ASSERT_CHECKING + /* + * Check the order of transaction LSNs when we reached the start decoding + * LSN. See the comments in AssertTXNLsnOrder() for details. + */ + if (SnapBuildGetStartDecodingAt(ctx->snapshot_builder) == buf.origptr) + AssertTXNLsnOrder(ctx->reorder); +#endif + rmgr = GetRmgr(XLogRecGetRmid(record));I am not able to think how/when this check will be useful. Because we
skipped assert checking only for records that are prior to
start_decoding_at point, I think for those records ordering should
have been checked before the restart. start_decoding_at point will be
either (a) confirmed_flush location, or (b) lsn sent by client, and
any record prior to that must have been processed before restart.
Good point. I was considering the case where the client sets far ahead
LSN but it's not worth considering this case in this context. I've
updated the patch accoringly.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v2-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patchapplication/octet-stream; name=v2-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patchDownload+69-1
On Wed, Oct 12, 2022 at 11:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Please note that to pass the new regression tests, the fix proposed in
a related thread[1] is required. Particularly, we need:@@ -1099,6 +1099,9 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid, else if (sub_needs_timetravel) { /* track toplevel txn as well, subxact alone isn't meaningful */ + elog(DEBUG2, "forced transaction %u to do timetravel due to one of its subtransaction", + xid); + needs_timetravel = true; SnapBuildAddCommittedTxn(builder, xid); } else if (needs_timetravel)A side benefit of this approach is that we can fix another assertion
failure too that happens on REL14 and REL15 and reported here[2]. In
the commits 68dcce247f1a(REL14) and 272248a0c1(REL15), the reason why
we make the association between sub-txns to top-txn in
SnapBuildXidSetCatalogChanges() is just to avoid the assertion failure
in AssertTXNLsnOrder(). However, since the invalidation messages are
not transported from sub-txn to top-txn during the assignment, another
assertion check in ReorderBufferForget() fails when forgetting the
subtransaction. If we apply this idea of skipping the assertion
checks, we no longer need to make the such association in
SnapBuildXidSetCatalogChanges() and resolve this issue as well.
IIUC, here you are speaking of three different changes. Change-1: Add
a check in AssertTXNLsnOrder() to skip assert checking till we reach
start_decoding_at. Change-2: Set needs_timetravel to true in one of
the else if branches in SnapBuildCommitTxn(). Change-3: Remove the
call to ReorderBufferAssignChild() from SnapBuildXidSetCatalogChanges
in PG-14/15 as that won't be required after Change-1.
AFAIU, Change-1 is required till v10; Change-2 and Change-3 are
required in HEAD/v15/v14 to fix the problem. Now, the second and third
changes are not required in branches prior to v14 because we don't
record invalidations via XLOG_XACT_INVALIDATIONS record. However, if
we want, we can even back-patch Change-2 and Change-3 to keep the code
consistent or maybe just Change-3.
Is my understanding correct?
--
With Regards,
Amit Kapila.
On Mon, Oct 17, 2022 at 4:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Oct 12, 2022 at 11:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Please note that to pass the new regression tests, the fix proposed in
a related thread[1] is required. Particularly, we need:@@ -1099,6 +1099,9 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid, else if (sub_needs_timetravel) { /* track toplevel txn as well, subxact alone isn't meaningful */ + elog(DEBUG2, "forced transaction %u to do timetravel due to one of its subtransaction", + xid); + needs_timetravel = true; SnapBuildAddCommittedTxn(builder, xid); } else if (needs_timetravel)A side benefit of this approach is that we can fix another assertion
failure too that happens on REL14 and REL15 and reported here[2]. In
the commits 68dcce247f1a(REL14) and 272248a0c1(REL15), the reason why
we make the association between sub-txns to top-txn in
SnapBuildXidSetCatalogChanges() is just to avoid the assertion failure
in AssertTXNLsnOrder(). However, since the invalidation messages are
not transported from sub-txn to top-txn during the assignment, another
assertion check in ReorderBufferForget() fails when forgetting the
subtransaction. If we apply this idea of skipping the assertion
checks, we no longer need to make the such association in
SnapBuildXidSetCatalogChanges() and resolve this issue as well.IIUC, here you are speaking of three different changes. Change-1: Add
a check in AssertTXNLsnOrder() to skip assert checking till we reach
start_decoding_at. Change-2: Set needs_timetravel to true in one of
the else if branches in SnapBuildCommitTxn(). Change-3: Remove the
call to ReorderBufferAssignChild() from SnapBuildXidSetCatalogChanges
in PG-14/15 as that won't be required after Change-1.
Yes.
AFAIU, Change-1 is required till v10; Change-2 and Change-3 are
required in HEAD/v15/v14 to fix the problem.
IIUC Change-2 is required in v16 and HEAD but not mandatory in v15 and
v14. The reason why we need Change-2 is that there is a case where we
mark only subtransactions as containing catalog change while not doing
that for its top-level transaction. In v15 and v14, since we mark both
subtransactions and top-level transaction in
SnapBuildXidSetCatalogChanges() as containing catalog changes, we
don't get the assertion failure at "Assert(!needs_snapshot ||
needs_timetravel)".
Regarding Change-3, it's required in v15 and v14 but not in HEAD and
v16. Since we didn't add SnapBuildXidSetCatalogChanges() to v16 and
HEAD, Change-3 cannot be applied to the two branches.
Now, the second and third
changes are not required in branches prior to v14 because we don't
record invalidations via XLOG_XACT_INVALIDATIONS record. However, if
we want, we can even back-patch Change-2 and Change-3 to keep the code
consistent or maybe just Change-3.
Right. I don't think it's a good idea to back-patch Change-2 in
branches prior to v14 as it's not a relevant issue. Regarding
back-patching Change-3 to branches prior 14, I think it may be okay
til v11, but I'd be hesitant for v10 as the final release comes in a
month.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Dear Sawada-san, Amit,
IIUC Change-2 is required in v16 and HEAD but not mandatory in v15 and
v14. The reason why we need Change-2 is that there is a case where we
mark only subtransactions as containing catalog change while not doing
that for its top-level transaction. In v15 and v14, since we mark both
subtransactions and top-level transaction in
SnapBuildXidSetCatalogChanges() as containing catalog changes, we
don't get the assertion failure at "Assert(!needs_snapshot ||
needs_timetravel)".
Incidentally, I agreed that Change-2 is needed for HEAD (and v16), not v15 and v14.
Best Regards,
Hayato Kuroda
FUJITSU LIMITED
On Tue, Oct 18, 2022 at 6:29 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Oct 17, 2022 at 4:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
IIUC, here you are speaking of three different changes. Change-1: Add
a check in AssertTXNLsnOrder() to skip assert checking till we reach
start_decoding_at. Change-2: Set needs_timetravel to true in one of
the else if branches in SnapBuildCommitTxn(). Change-3: Remove the
call to ReorderBufferAssignChild() from SnapBuildXidSetCatalogChanges
in PG-14/15 as that won't be required after Change-1.Yes.
AFAIU, Change-1 is required till v10; Change-2 and Change-3 are
required in HEAD/v15/v14 to fix the problem.IIUC Change-2 is required in v16 and HEAD
Why are you referring v16 and HEAD separately?
but not mandatory in v15 and
v14. The reason why we need Change-2 is that there is a case where we
mark only subtransactions as containing catalog change while not doing
that for its top-level transaction. In v15 and v14, since we mark both
subtransactions and top-level transaction in
SnapBuildXidSetCatalogChanges() as containing catalog changes, we
don't get the assertion failure at "Assert(!needs_snapshot ||
needs_timetravel)".Regarding Change-3, it's required in v15 and v14 but not in HEAD and
v16. Since we didn't add SnapBuildXidSetCatalogChanges() to v16 and
HEAD, Change-3 cannot be applied to the two branches.Now, the second and third
changes are not required in branches prior to v14 because we don't
record invalidations via XLOG_XACT_INVALIDATIONS record. However, if
we want, we can even back-patch Change-2 and Change-3 to keep the code
consistent or maybe just Change-3.Right. I don't think it's a good idea to back-patch Change-2 in
branches prior to v14 as it's not a relevant issue.
Fair enough but then why to even backpatch it to v15 and v14?
Regarding
back-patching Change-3 to branches prior 14, I think it may be okay
til v11, but I'd be hesitant for v10 as the final release comes in a
month.
So to fix the issue in all branches, what we need to do is to
backpatch change-1: in all branches till v10, change-2: in HEAD, and
change-3: in V15 and V14. Additionally, we think, it is okay to
backpatch change-3 till v11 as it is mainly done to avoid the problem
fixed by change-1 and it makes code consistent in back branches.
I think because the test case proposed needs all three changes, we can
push the change-1 without a test case and then as a second patch have
change-2 for HEAD and change-3 for back branches with the test case.
Do you have any other ideas to proceed here?
--
With Regards,
Amit Kapila.
On Tue, Oct 18, 2022 at 1:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Oct 18, 2022 at 6:29 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Oct 17, 2022 at 4:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
IIUC, here you are speaking of three different changes. Change-1: Add
a check in AssertTXNLsnOrder() to skip assert checking till we reach
start_decoding_at. Change-2: Set needs_timetravel to true in one of
the else if branches in SnapBuildCommitTxn(). Change-3: Remove the
call to ReorderBufferAssignChild() from SnapBuildXidSetCatalogChanges
in PG-14/15 as that won't be required after Change-1.Yes.
AFAIU, Change-1 is required till v10; Change-2 and Change-3 are
required in HEAD/v15/v14 to fix the problem.IIUC Change-2 is required in v16 and HEAD
Why are you referring v16 and HEAD separately?
Sorry, my wrong, I was confused.
but not mandatory in v15 and
v14. The reason why we need Change-2 is that there is a case where we
mark only subtransactions as containing catalog change while not doing
that for its top-level transaction. In v15 and v14, since we mark both
subtransactions and top-level transaction in
SnapBuildXidSetCatalogChanges() as containing catalog changes, we
don't get the assertion failure at "Assert(!needs_snapshot ||
needs_timetravel)".Regarding Change-3, it's required in v15 and v14 but not in HEAD and
v16. Since we didn't add SnapBuildXidSetCatalogChanges() to v16 and
HEAD, Change-3 cannot be applied to the two branches.Now, the second and third
changes are not required in branches prior to v14 because we don't
record invalidations via XLOG_XACT_INVALIDATIONS record. However, if
we want, we can even back-patch Change-2 and Change-3 to keep the code
consistent or maybe just Change-3.Right. I don't think it's a good idea to back-patch Change-2 in
branches prior to v14 as it's not a relevant issue.Fair enough but then why to even backpatch it to v15 and v14?
Oops, it's a typo. I wanted to say Change-2 should be back-patched only to HEAD.
Regarding
back-patching Change-3 to branches prior 14, I think it may be okay
til v11, but I'd be hesitant for v10 as the final release comes in a
month.So to fix the issue in all branches, what we need to do is to
backpatch change-1: in all branches till v10, change-2: in HEAD, and
change-3: in V15 and V14. Additionally, we think, it is okay to
backpatch change-3 till v11 as it is mainly done to avoid the problem
fixed by change-1 and it makes code consistent in back branches.
Right.
I think because the test case proposed needs all three changes, we can
push the change-1 without a test case and then as a second patch have
change-2 for HEAD and change-3 for back branches with the test case.
Do you have any other ideas to proceed here?
I found another test case that causes the assertion failure at
"Assert(!needs_snapshot || needs_timetravel);" on all branches. I've
attached the patch for the test case. In this test case, I modified a
user-catalog table instead of system-catalog table. That way, we don't
generate invalidation messages while generating NEW_CID records. As a
result, we mark only the subtransactions as containing catalog change
and don't make association between top-level and sub transactions. The
assertion failure happens on all supported branches. If we need to fix
this (I believe so), Change-2 needs to be backpatched to all supported
branches.
There are three changes as Amit mentioned, and regarding the test
case, we have three test cases I've attached: truncate_testcase.patch,
analyze_testcase.patch, uesr_catalog_testcase.patch. The relationship
between assertion failures and test cases are very complex. I could
not find any test case to cause only one assertion failure on all
branches. One idea to proceed is:
Patch-1 includes Change-1 and is applied to all branches.
Patch-2 includes Change-2 and the user_catalog test case, and is
applied to all branches.
Patch-3 includes Change-3 and the truncate test case (or the analyze
test case), and is applied to v14 and v15 (also till v11 if we
prefer).
The patch-1 doesn't include any test case but the user_catalog test
case can test both Change-1 and Change-2 on all branches. In v15 and
v14, the analyze test case causes both the assertions at
"Assert(txn->ninvalidations == 0);" and "Assert(prev_first_lsn <
cur_txn->first_lsn);" whereas the truncate test case causes the
assertion only at "Assert(txn->ninvalidations == 0);". Since the
patch-2 is applied on top of the patch-1, there is no difference in
terms of testing Change-2.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com