Assertion failure with epoch when replaying standby records for 2PC

Started by Michael Paquierabout 2 years ago3 messageshackers
Jump to latest
#1Michael Paquier
michael@paquier.xyz

Hi all,

rorqual has failed today with a very interesting failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-01-17%2005%3A06%3A31

This has caused an assertion failure for a 2PC transaction when
replaying one of the tests from the main regression suite:
2024-01-17 05:08:23.143 UTC [3242608] DETAIL: Last completed transaction was at log time 2024-01-17 05:08:22.920244+00.
TRAP: failed Assert("epoch > 0"), File: "../pgsql/src/backend/access/transam/twophase.c", Line: 969, PID: 3242610
postgres: standby_1: startup recovering 00000001000000000000000C(ExceptionalCondition+0x83)[0x55746c7838c1]
postgres: standby_1: startup recovering 00000001000000000000000C(+0x194f0e)[0x55746c371f0e]
postgres: standby_1: startup recovering 00000001000000000000000C(StandbyTransactionIdIsPrepared+0x29)[0x55746c373120]
postgres: standby_1: startup recovering 00000001000000000000000C(StandbyReleaseOldLocks+0x3f)[0x55746c621357]
postgres: standby_1: startup recovering 00000001000000000000000C(ProcArrayApplyRecoveryInfo+0x50)[0x55746c61bbb5]
postgres: standby_1: startup recovering 00000001000000000000000C(standby_redo+0xe1)[0x55746c621490]
postgres: standby_1: startup recovering 00000001000000000000000C(PerformWalRecovery+0xa5e)[0x55746c392404]
postgres: standby_1: startup recovering 00000001000000000000000C(StartupXLOG+0x3ac)[0x55746c3862b8]
postgres: standby_1: startup recovering 00000001000000000000000C(StartupProcessMain+0xd9)[0x55746c5a60f6]
postgres: standby_1: startup recovering 00000001000000000000000C(AuxiliaryProcessMain+0x172)[0x55746c59bbdd]
postgres: standby_1: startup recovering 00000001000000000000000C(+0x3c4235)[0x55746c5a1235]
postgres: standby_1: startup recovering 00000001000000000000000C(PostmasterMain+0x1401)[0x55746c5a5a10]
postgres: standby_1: startup recovering 00000001000000000000000C(main+0x835)[0x55746c4e90ce]
/lib/x86_64-linux-gnu/libc.so.6(+0x276ca)[0x7f67bbb846ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f67bbb84785]
postgres: standby_1: startup recovering 00000001000000000000000C(_start+0x21)[0x55746c2b61d1]

This refers to the following in twophase.c with
AdjustToFullTransactionId():
nextXid = XidFromFullTransactionId(nextFullXid);
epoch = EpochFromFullTransactionId(nextFullXid);

if (unlikely(xid > nextXid))
{
/* Wraparound occurred, must be from a prev epoch. */
Assert(epoch > 0);
epoch--;
}

This would mean that we've found a way to get a negative epoch, which
should not be possible.

Alexander, you have added this code in 5a1dfde8334b when switching the
2PC file names to use FullTransactionIds. Could you check please?
--
Michael

#2Alexander Korotkov
aekorotkov@gmail.com
In reply to: Michael Paquier (#1)
Re: Assertion failure with epoch when replaying standby records for 2PC

Hi, Michael!

On Wed, Jan 17, 2024 at 7:47 AM Michael Paquier <michael@paquier.xyz> wrote:

rorqual has failed today with a very interesting failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&amp;dt=2024-01-17%2005%3A06%3A31

This has caused an assertion failure for a 2PC transaction when
replaying one of the tests from the main regression suite:
2024-01-17 05:08:23.143 UTC [3242608] DETAIL: Last completed transaction was at log time 2024-01-17 05:08:22.920244+00.
TRAP: failed Assert("epoch > 0"), File: "../pgsql/src/backend/access/transam/twophase.c", Line: 969, PID: 3242610
postgres: standby_1: startup recovering 00000001000000000000000C(ExceptionalCondition+0x83)[0x55746c7838c1]
postgres: standby_1: startup recovering 00000001000000000000000C(+0x194f0e)[0x55746c371f0e]
postgres: standby_1: startup recovering 00000001000000000000000C(StandbyTransactionIdIsPrepared+0x29)[0x55746c373120]
postgres: standby_1: startup recovering 00000001000000000000000C(StandbyReleaseOldLocks+0x3f)[0x55746c621357]
postgres: standby_1: startup recovering 00000001000000000000000C(ProcArrayApplyRecoveryInfo+0x50)[0x55746c61bbb5]
postgres: standby_1: startup recovering 00000001000000000000000C(standby_redo+0xe1)[0x55746c621490]
postgres: standby_1: startup recovering 00000001000000000000000C(PerformWalRecovery+0xa5e)[0x55746c392404]
postgres: standby_1: startup recovering 00000001000000000000000C(StartupXLOG+0x3ac)[0x55746c3862b8]
postgres: standby_1: startup recovering 00000001000000000000000C(StartupProcessMain+0xd9)[0x55746c5a60f6]
postgres: standby_1: startup recovering 00000001000000000000000C(AuxiliaryProcessMain+0x172)[0x55746c59bbdd]
postgres: standby_1: startup recovering 00000001000000000000000C(+0x3c4235)[0x55746c5a1235]
postgres: standby_1: startup recovering 00000001000000000000000C(PostmasterMain+0x1401)[0x55746c5a5a10]
postgres: standby_1: startup recovering 00000001000000000000000C(main+0x835)[0x55746c4e90ce]
/lib/x86_64-linux-gnu/libc.so.6(+0x276ca)[0x7f67bbb846ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f67bbb84785]
postgres: standby_1: startup recovering 00000001000000000000000C(_start+0x21)[0x55746c2b61d1]

This refers to the following in twophase.c with
AdjustToFullTransactionId():
nextXid = XidFromFullTransactionId(nextFullXid);
epoch = EpochFromFullTransactionId(nextFullXid);

if (unlikely(xid > nextXid))
{
/* Wraparound occurred, must be from a prev epoch. */
Assert(epoch > 0);
epoch--;
}

This would mean that we've found a way to get a negative epoch, which
should not be possible.

Alexander, you have added this code in 5a1dfde8334b when switching the
2PC file names to use FullTransactionIds. Could you check please?

Thank you for reporting! I'm going to look at this in the next couple of days.

------
Regards,
Alexander Korotkov

#3Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#2)
Re: Assertion failure with epoch when replaying standby records for 2PC

On Wed, Jan 17, 2024 at 11:08 PM Alexander Korotkov
<aekorotkov@gmail.com> wrote:

On Wed, Jan 17, 2024 at 7:47 AM Michael Paquier <michael@paquier.xyz> wrote:

rorqual has failed today with a very interesting failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&amp;dt=2024-01-17%2005%3A06%3A31

This has caused an assertion failure for a 2PC transaction when
replaying one of the tests from the main regression suite:
2024-01-17 05:08:23.143 UTC [3242608] DETAIL: Last completed transaction was at log time 2024-01-17 05:08:22.920244+00.
TRAP: failed Assert("epoch > 0"), File: "../pgsql/src/backend/access/transam/twophase.c", Line: 969, PID: 3242610
postgres: standby_1: startup recovering 00000001000000000000000C(ExceptionalCondition+0x83)[0x55746c7838c1]
postgres: standby_1: startup recovering 00000001000000000000000C(+0x194f0e)[0x55746c371f0e]
postgres: standby_1: startup recovering 00000001000000000000000C(StandbyTransactionIdIsPrepared+0x29)[0x55746c373120]
postgres: standby_1: startup recovering 00000001000000000000000C(StandbyReleaseOldLocks+0x3f)[0x55746c621357]
postgres: standby_1: startup recovering 00000001000000000000000C(ProcArrayApplyRecoveryInfo+0x50)[0x55746c61bbb5]
postgres: standby_1: startup recovering 00000001000000000000000C(standby_redo+0xe1)[0x55746c621490]
postgres: standby_1: startup recovering 00000001000000000000000C(PerformWalRecovery+0xa5e)[0x55746c392404]
postgres: standby_1: startup recovering 00000001000000000000000C(StartupXLOG+0x3ac)[0x55746c3862b8]
postgres: standby_1: startup recovering 00000001000000000000000C(StartupProcessMain+0xd9)[0x55746c5a60f6]
postgres: standby_1: startup recovering 00000001000000000000000C(AuxiliaryProcessMain+0x172)[0x55746c59bbdd]
postgres: standby_1: startup recovering 00000001000000000000000C(+0x3c4235)[0x55746c5a1235]
postgres: standby_1: startup recovering 00000001000000000000000C(PostmasterMain+0x1401)[0x55746c5a5a10]
postgres: standby_1: startup recovering 00000001000000000000000C(main+0x835)[0x55746c4e90ce]
/lib/x86_64-linux-gnu/libc.so.6(+0x276ca)[0x7f67bbb846ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f67bbb84785]
postgres: standby_1: startup recovering 00000001000000000000000C(_start+0x21)[0x55746c2b61d1]

This refers to the following in twophase.c with
AdjustToFullTransactionId():
nextXid = XidFromFullTransactionId(nextFullXid);
epoch = EpochFromFullTransactionId(nextFullXid);

if (unlikely(xid > nextXid))
{
/* Wraparound occurred, must be from a prev epoch. */
Assert(epoch > 0);
epoch--;
}

This would mean that we've found a way to get a negative epoch, which
should not be possible.

Alexander, you have added this code in 5a1dfde8334b when switching the
2PC file names to use FullTransactionIds. Could you check please?

Thank you for reporting! I'm going to look at this in the next couple of days.

Oh, that is a forgotten piece I've already discovered.
/messages/by-id/CAPpHfdv=VahovNqJHBqr0ejHvx=eDuGYySC48Wcvp+GDxYLCJg@mail.gmail.com
I'm going to do some additional checks and push.

------
Regards,
Alexander Korotkov