Assertion failure when promoting node by deleting recovery.conf and restart node

Started by Michael Paquieralmost 13 years ago4 messages
#1Michael Paquier
michael.paquier@gmail.com
1 attachment(s)

Hi,

When trying to *promote* a slave as master by removing recovery.conf and
restarting node, I found an assertion failure on master branch:
LOG: database system was shut down in recovery at 2013-03-15 10:22:27 JST
TRAP: FailedAssertion("!(ControlFile->minRecoveryPointTLI != 1)", File:
"xlog.c", Line: 4954)
(gdb) bt
#0 0x00007f95af03b2c5 in raise () from /usr/lib/libc.so.6
#1 0x00007f95af03c748 in abort () from /usr/lib/libc.so.6
#2 0x000000000086ce71 in ExceptionalCondition (conditionName=0x8f2af0
"!(ControlFile->minRecoveryPointTLI != 1)", errorType=0x8f0813
"FailedAssertion", fileName=0x8f076b "xlog.c",
lineNumber=4954) at assert.c:54
#3 0x00000000004fe499 in StartupXLOG () at xlog.c:4954
#4 0x00000000006f9d34 in StartupProcessMain () at startup.c:224
#5 0x000000000050ef92 in AuxiliaryProcessMain (argc=2,
argv=0x7fffa6fc3d20) at bootstrap.c:423
#6 0x00000000006f8816 in StartChildProcess (type=StartupProcess) at
postmaster.c:4956
#7 0x00000000006f39e9 in PostmasterMain (argc=6, argv=0x1c950a0) at
postmaster.c:1237
#8 0x000000000065d59b in main (argc=6, argv=0x1c950a0) at main.c:197
Ok, this is not the cleanest way to promote a node as it doesn't do any
safety checks relation at promotion but 9.2 and previous versions allowed
to do that properly.

The assertion has been introduced by commit 3f0ab05 in order to record
properly minRecoveryPointTLI in control file at the end of recovery in the
case of a crash.
However, in the case of a slave node properly shutdown in recovery which is
then restarted as a master, the code path of this assertion is taken.
What do you think of the patch attached? It avoids the update of
recoveryTargetTLI and recoveryTargetIsLatest if the node has been shutdown
while in recovery.
Another possibility could be to add in the assertion some conditions based
on the state of controlFile but I think it is more consistent simply not to
update those fields.

Regards,
--
Michael

Attachments:

20130315_crash_tli.patchapplication/octet-stream; name=20130315_crash_tli.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a02eebc..eaee05c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4939,7 +4939,8 @@ StartupXLOG(void)
 			ereport(LOG,
 					(errmsg("starting archive recovery")));
 	}
-	else if (ControlFile->minRecoveryPointTLI > 0)
+	else if (ControlFile->minRecoveryPointTLI > 0 &&
+			 ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
 	{
 		/*
 		 * If the minRecoveryPointTLI is set when not in Archive Recovery
@@ -6833,7 +6834,7 @@ CreateCheckPoint(int flags)
 		XLogRecPtr	curInsert;
 
 		INSERT_RECPTR(curInsert, Insert, Insert->curridx);
-		if (curInsert == ControlFile->checkPoint + 
+		if (curInsert == ControlFile->checkPoint +
 			MAXALIGN(SizeOfXLogRecord + sizeof(CheckPoint)) &&
 			ControlFile->checkPoint == ControlFile->checkPointCopy.redo)
 		{
#2Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Michael Paquier (#1)
Re: Assertion failure when promoting node by deleting recovery.conf and restart node

On 15.03.2013 04:25, Michael Paquier wrote:

Hi,

When trying to *promote* a slave as master by removing recovery.conf and
restarting node, I found an assertion failure on master branch:
LOG: database system was shut down in recovery at 2013-03-15 10:22:27 JST
TRAP: FailedAssertion("!(ControlFile->minRecoveryPointTLI != 1)", File:
"xlog.c", Line: 4954)
(gdb) bt
#0 0x00007f95af03b2c5 in raise () from /usr/lib/libc.so.6
#1 0x00007f95af03c748 in abort () from /usr/lib/libc.so.6
#2 0x000000000086ce71 in ExceptionalCondition (conditionName=0x8f2af0
"!(ControlFile->minRecoveryPointTLI != 1)", errorType=0x8f0813
"FailedAssertion", fileName=0x8f076b "xlog.c",
lineNumber=4954) at assert.c:54
#3 0x00000000004fe499 in StartupXLOG () at xlog.c:4954
#4 0x00000000006f9d34 in StartupProcessMain () at startup.c:224
#5 0x000000000050ef92 in AuxiliaryProcessMain (argc=2,
argv=0x7fffa6fc3d20) at bootstrap.c:423
#6 0x00000000006f8816 in StartChildProcess (type=StartupProcess) at
postmaster.c:4956
#7 0x00000000006f39e9 in PostmasterMain (argc=6, argv=0x1c950a0) at
postmaster.c:1237
#8 0x000000000065d59b in main (argc=6, argv=0x1c950a0) at main.c:197
Ok, this is not the cleanest way to promote a node as it doesn't do any
safety checks relation at promotion but 9.2 and previous versions allowed
to do that properly.

The assertion has been introduced by commit 3f0ab05 in order to record
properly minRecoveryPointTLI in control file at the end of recovery in the
case of a crash.
However, in the case of a slave node properly shutdown in recovery which is
then restarted as a master, the code path of this assertion is taken.
What do you think of the patch attached? It avoids the update of
recoveryTargetTLI and recoveryTargetIsLatest if the node has been shutdown
while in recovery.
Another possibility could be to add in the assertion some conditions based
on the state of controlFile but I think it is more consistent simply not to
update those fields.

Simon, can you comment on this? ISTM we could just remove the assertion
and update the comment to mention that this can happen. If there is a
min recovery point, surely we always need to recover to the timeline
containing that point, so setting recoveryTargetTLI to
minRecoveryPointTLI seems sensible.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#2)
Re: Assertion failure when promoting node by deleting recovery.conf and restart node

On 25 March 2013 19:14, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

Simon, can you comment on this?

Yes, will do.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#2)
Re: Assertion failure when promoting node by deleting recovery.conf and restart node

On 25 March 2013 19:14, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

On 15.03.2013 04:25, Michael Paquier wrote:

Hi,

When trying to *promote* a slave as master by removing recovery.conf and
restarting node, I found an assertion failure on master branch:
LOG: database system was shut down in recovery at 2013-03-15 10:22:27 JST
TRAP: FailedAssertion("!(ControlFile->minRecoveryPointTLI != 1)", File:
"xlog.c", Line: 4954)
(gdb) bt
#0 0x00007f95af03b2c5 in raise () from /usr/lib/libc.so.6
#1 0x00007f95af03c748 in abort () from /usr/lib/libc.so.6
#2 0x000000000086ce71 in ExceptionalCondition (conditionName=0x8f2af0
"!(ControlFile->minRecoveryPointTLI != 1)", errorType=0x8f0813
"FailedAssertion", fileName=0x8f076b "xlog.c",
lineNumber=4954) at assert.c:54
#3 0x00000000004fe499 in StartupXLOG () at xlog.c:4954
#4 0x00000000006f9d34 in StartupProcessMain () at startup.c:224
#5 0x000000000050ef92 in AuxiliaryProcessMain (argc=2,
argv=0x7fffa6fc3d20) at bootstrap.c:423
#6 0x00000000006f8816 in StartChildProcess (type=StartupProcess) at
postmaster.c:4956
#7 0x00000000006f39e9 in PostmasterMain (argc=6, argv=0x1c950a0) at
postmaster.c:1237
#8 0x000000000065d59b in main (argc=6, argv=0x1c950a0) at main.c:197
Ok, this is not the cleanest way to promote a node as it doesn't do any
safety checks relation at promotion but 9.2 and previous versions allowed
to do that properly.

The assertion has been introduced by commit 3f0ab05 in order to record
properly minRecoveryPointTLI in control file at the end of recovery in the
case of a crash.
However, in the case of a slave node properly shutdown in recovery which
is
then restarted as a master, the code path of this assertion is taken.
What do you think of the patch attached? It avoids the update of
recoveryTargetTLI and recoveryTargetIsLatest if the node has been shutdown
while in recovery.
Another possibility could be to add in the assertion some conditions based
on the state of controlFile but I think it is more consistent simply not
to
update those fields.

Simon, can you comment on this? ISTM we could just remove the assertion and
update the comment to mention that this can happen. If there is a min
recovery point, surely we always need to recover to the timeline containing
that point, so setting recoveryTargetTLI to minRecoveryPointTLI seems
sensible.

Fixed using the latest TLI available and removing the assertion.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers