Assertion failure with epoch when replaying standby records for 2PC
Hi all,
rorqual has failed today with a very interesting failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-01-17%2005%3A06%3A31
This has caused an assertion failure for a 2PC transaction when
replaying one of the tests from the main regression suite:
2024-01-17 05:08:23.143 UTC [3242608] DETAIL: Last completed transaction was at log time 2024-01-17 05:08:22.920244+00.
TRAP: failed Assert("epoch > 0"), File: "../pgsql/src/backend/access/transam/twophase.c", Line: 969, PID: 3242610
postgres: standby_1: startup recovering 00000001000000000000000C(ExceptionalCondition+0x83)[0x55746c7838c1]
postgres: standby_1: startup recovering 00000001000000000000000C(+0x194f0e)[0x55746c371f0e]
postgres: standby_1: startup recovering 00000001000000000000000C(StandbyTransactionIdIsPrepared+0x29)[0x55746c373120]
postgres: standby_1: startup recovering 00000001000000000000000C(StandbyReleaseOldLocks+0x3f)[0x55746c621357]
postgres: standby_1: startup recovering 00000001000000000000000C(ProcArrayApplyRecoveryInfo+0x50)[0x55746c61bbb5]
postgres: standby_1: startup recovering 00000001000000000000000C(standby_redo+0xe1)[0x55746c621490]
postgres: standby_1: startup recovering 00000001000000000000000C(PerformWalRecovery+0xa5e)[0x55746c392404]
postgres: standby_1: startup recovering 00000001000000000000000C(StartupXLOG+0x3ac)[0x55746c3862b8]
postgres: standby_1: startup recovering 00000001000000000000000C(StartupProcessMain+0xd9)[0x55746c5a60f6]
postgres: standby_1: startup recovering 00000001000000000000000C(AuxiliaryProcessMain+0x172)[0x55746c59bbdd]
postgres: standby_1: startup recovering 00000001000000000000000C(+0x3c4235)[0x55746c5a1235]
postgres: standby_1: startup recovering 00000001000000000000000C(PostmasterMain+0x1401)[0x55746c5a5a10]
postgres: standby_1: startup recovering 00000001000000000000000C(main+0x835)[0x55746c4e90ce]
/lib/x86_64-linux-gnu/libc.so.6(+0x276ca)[0x7f67bbb846ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f67bbb84785]
postgres: standby_1: startup recovering 00000001000000000000000C(_start+0x21)[0x55746c2b61d1]
This refers to the following in twophase.c with
AdjustToFullTransactionId():
nextXid = XidFromFullTransactionId(nextFullXid);
epoch = EpochFromFullTransactionId(nextFullXid);
if (unlikely(xid > nextXid))
{
/* Wraparound occurred, must be from a prev epoch. */
Assert(epoch > 0);
epoch--;
}
This would mean that we've found a way to get a negative epoch, which
should not be possible.
Alexander, you have added this code in 5a1dfde8334b when switching the
2PC file names to use FullTransactionIds. Could you check please?
--
Michael
Hi, Michael!
On Wed, Jan 17, 2024 at 7:47 AM Michael Paquier <michael@paquier.xyz> wrote:
rorqual has failed today with a very interesting failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-01-17%2005%3A06%3A31This has caused an assertion failure for a 2PC transaction when
replaying one of the tests from the main regression suite:
2024-01-17 05:08:23.143 UTC [3242608] DETAIL: Last completed transaction was at log time 2024-01-17 05:08:22.920244+00.
TRAP: failed Assert("epoch > 0"), File: "../pgsql/src/backend/access/transam/twophase.c", Line: 969, PID: 3242610
postgres: standby_1: startup recovering 00000001000000000000000C(ExceptionalCondition+0x83)[0x55746c7838c1]
postgres: standby_1: startup recovering 00000001000000000000000C(+0x194f0e)[0x55746c371f0e]
postgres: standby_1: startup recovering 00000001000000000000000C(StandbyTransactionIdIsPrepared+0x29)[0x55746c373120]
postgres: standby_1: startup recovering 00000001000000000000000C(StandbyReleaseOldLocks+0x3f)[0x55746c621357]
postgres: standby_1: startup recovering 00000001000000000000000C(ProcArrayApplyRecoveryInfo+0x50)[0x55746c61bbb5]
postgres: standby_1: startup recovering 00000001000000000000000C(standby_redo+0xe1)[0x55746c621490]
postgres: standby_1: startup recovering 00000001000000000000000C(PerformWalRecovery+0xa5e)[0x55746c392404]
postgres: standby_1: startup recovering 00000001000000000000000C(StartupXLOG+0x3ac)[0x55746c3862b8]
postgres: standby_1: startup recovering 00000001000000000000000C(StartupProcessMain+0xd9)[0x55746c5a60f6]
postgres: standby_1: startup recovering 00000001000000000000000C(AuxiliaryProcessMain+0x172)[0x55746c59bbdd]
postgres: standby_1: startup recovering 00000001000000000000000C(+0x3c4235)[0x55746c5a1235]
postgres: standby_1: startup recovering 00000001000000000000000C(PostmasterMain+0x1401)[0x55746c5a5a10]
postgres: standby_1: startup recovering 00000001000000000000000C(main+0x835)[0x55746c4e90ce]
/lib/x86_64-linux-gnu/libc.so.6(+0x276ca)[0x7f67bbb846ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f67bbb84785]
postgres: standby_1: startup recovering 00000001000000000000000C(_start+0x21)[0x55746c2b61d1]This refers to the following in twophase.c with
AdjustToFullTransactionId():
nextXid = XidFromFullTransactionId(nextFullXid);
epoch = EpochFromFullTransactionId(nextFullXid);if (unlikely(xid > nextXid))
{
/* Wraparound occurred, must be from a prev epoch. */
Assert(epoch > 0);
epoch--;
}This would mean that we've found a way to get a negative epoch, which
should not be possible.Alexander, you have added this code in 5a1dfde8334b when switching the
2PC file names to use FullTransactionIds. Could you check please?
Thank you for reporting! I'm going to look at this in the next couple of days.
------
Regards,
Alexander Korotkov
On Wed, Jan 17, 2024 at 11:08 PM Alexander Korotkov
<aekorotkov@gmail.com> wrote:
On Wed, Jan 17, 2024 at 7:47 AM Michael Paquier <michael@paquier.xyz> wrote:
rorqual has failed today with a very interesting failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-01-17%2005%3A06%3A31This has caused an assertion failure for a 2PC transaction when
replaying one of the tests from the main regression suite:
2024-01-17 05:08:23.143 UTC [3242608] DETAIL: Last completed transaction was at log time 2024-01-17 05:08:22.920244+00.
TRAP: failed Assert("epoch > 0"), File: "../pgsql/src/backend/access/transam/twophase.c", Line: 969, PID: 3242610
postgres: standby_1: startup recovering 00000001000000000000000C(ExceptionalCondition+0x83)[0x55746c7838c1]
postgres: standby_1: startup recovering 00000001000000000000000C(+0x194f0e)[0x55746c371f0e]
postgres: standby_1: startup recovering 00000001000000000000000C(StandbyTransactionIdIsPrepared+0x29)[0x55746c373120]
postgres: standby_1: startup recovering 00000001000000000000000C(StandbyReleaseOldLocks+0x3f)[0x55746c621357]
postgres: standby_1: startup recovering 00000001000000000000000C(ProcArrayApplyRecoveryInfo+0x50)[0x55746c61bbb5]
postgres: standby_1: startup recovering 00000001000000000000000C(standby_redo+0xe1)[0x55746c621490]
postgres: standby_1: startup recovering 00000001000000000000000C(PerformWalRecovery+0xa5e)[0x55746c392404]
postgres: standby_1: startup recovering 00000001000000000000000C(StartupXLOG+0x3ac)[0x55746c3862b8]
postgres: standby_1: startup recovering 00000001000000000000000C(StartupProcessMain+0xd9)[0x55746c5a60f6]
postgres: standby_1: startup recovering 00000001000000000000000C(AuxiliaryProcessMain+0x172)[0x55746c59bbdd]
postgres: standby_1: startup recovering 00000001000000000000000C(+0x3c4235)[0x55746c5a1235]
postgres: standby_1: startup recovering 00000001000000000000000C(PostmasterMain+0x1401)[0x55746c5a5a10]
postgres: standby_1: startup recovering 00000001000000000000000C(main+0x835)[0x55746c4e90ce]
/lib/x86_64-linux-gnu/libc.so.6(+0x276ca)[0x7f67bbb846ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f67bbb84785]
postgres: standby_1: startup recovering 00000001000000000000000C(_start+0x21)[0x55746c2b61d1]This refers to the following in twophase.c with
AdjustToFullTransactionId():
nextXid = XidFromFullTransactionId(nextFullXid);
epoch = EpochFromFullTransactionId(nextFullXid);if (unlikely(xid > nextXid))
{
/* Wraparound occurred, must be from a prev epoch. */
Assert(epoch > 0);
epoch--;
}This would mean that we've found a way to get a negative epoch, which
should not be possible.Alexander, you have added this code in 5a1dfde8334b when switching the
2PC file names to use FullTransactionIds. Could you check please?Thank you for reporting! I'm going to look at this in the next couple of days.
Oh, that is a forgotten piece I've already discovered.
/messages/by-id/CAPpHfdv=VahovNqJHBqr0ejHvx=eDuGYySC48Wcvp+GDxYLCJg@mail.gmail.com
I'm going to do some additional checks and push.
------
Regards,
Alexander Korotkov