Failed assertion during recovery of partial WAL file

Started by Martin Pihlakalmost 16 years ago3 messages
#1Martin Pihlak
martin.pihlak@gmail.com

Encountered the following FailedAssertion while testing database
recovery (actually this would be HS) with partial WAL file:

LOG: restored log file "000000010000000000000003" from archive
LOG: consistent recovery state reached at 0/3001EEC
LOG: record with zero length at 0/3001EEC
LOG: redo done at 0/3001D68
LOG: last completed transaction was at log time 2010-02-05 11:02:49.695544+02
LOG: database system is ready to accept read only connections
LOG: selected new timeline ID: 3
TRAP: FailedAssertion("!(readFile >= 0)", File: "xlog.c", Line: 5117)

and the assorted backtrace:
#0 0xb7f2e410 in __kernel_vsyscall ()
#1 0xb7dc9085 in raise () from /lib/tls/i686/cmov/libc.so.6
#2 0xb7dcaa01 in abort () from /lib/tls/i686/cmov/libc.so.6
#3 0x0834926e in ExceptionalCondition (conditionName=0x8389a82 "!(readFile >= 0)",
errorType=0x837d5b4 "FailedAssertion",
fileName=0x8390338 "xlog.c", lineNumber=5117) at assert.c:57
#4 0x080dde34 in exitArchiveRecovery (endTLI=1, endLogId=0, endLogSeg=18) at xlog.c:5117
#5 0x080e3eab in StartupXLOG () at xlog.c:6029
#6 0x080e6055 in StartupProcessMain () at xlog.c:8666
#7 0x08106a30 in AuxiliaryProcessMain (argc=2, argv=0xbfebdd64) at bootstrap.c:412
#8 0x08253f0c in StartChildProcess (type=StartupProcess) at postmaster.c:4340
#9 0x0825669e in PostmasterMain (argc=3, argv=0x8516e08) at postmaster.c:1078
#10 0x081f7919 in main (argc=3, argv=0x8516e08) at main.c:188

The crash happened on a HS slave which was fed a partial WAL file. The
partial was constructed by extracting data up to pg_current_xlog_location()
and zero padding it up to 16MB.

This only seems to be happening on HEAD - quick tests indicate that both
8.3 and 8.4 are not affected (or maybe I didn't try hard enough).

regards,
Martin

#2Fujii Masao
masao.fujii@gmail.com
In reply to: Martin Pihlak (#1)
1 attachment(s)
Re: Failed assertion during recovery of partial WAL file

On Fri, Feb 5, 2010 at 10:46 PM, Martin Pihlak <martin.pihlak@gmail.com> wrote:

Encountered the following FailedAssertion while testing database
recovery (actually this would be HS) with partial WAL file:

LOG:  restored log file "000000010000000000000003" from archive
LOG:  consistent recovery state reached at 0/3001EEC
LOG:  record with zero length at 0/3001EEC
LOG:  redo done at 0/3001D68
LOG:  last completed transaction was at log time 2010-02-05 11:02:49.695544+02
LOG:  database system is ready to accept read only connections
LOG:  selected new timeline ID: 3
TRAP: FailedAssertion("!(readFile >= 0)", File: "xlog.c", Line: 5117)

and the assorted backtrace:
#0  0xb7f2e410 in __kernel_vsyscall ()
#1  0xb7dc9085 in raise () from /lib/tls/i686/cmov/libc.so.6
#2  0xb7dcaa01 in abort () from /lib/tls/i686/cmov/libc.so.6
#3  0x0834926e in ExceptionalCondition (conditionName=0x8389a82 "!(readFile >= 0)",
errorType=0x837d5b4 "FailedAssertion",
   fileName=0x8390338 "xlog.c", lineNumber=5117) at assert.c:57
#4  0x080dde34 in exitArchiveRecovery (endTLI=1, endLogId=0, endLogSeg=18) at xlog.c:5117
#5  0x080e3eab in StartupXLOG () at xlog.c:6029
#6  0x080e6055 in StartupProcessMain () at xlog.c:8666
#7  0x08106a30 in AuxiliaryProcessMain (argc=2, argv=0xbfebdd64) at bootstrap.c:412
#8  0x08253f0c in StartChildProcess (type=StartupProcess) at postmaster.c:4340
#9  0x0825669e in PostmasterMain (argc=3, argv=0x8516e08) at postmaster.c:1078
#10 0x081f7919 in main (argc=3, argv=0x8516e08) at main.c:188

The crash happened on a HS slave which was fed a partial WAL file. The
partial was constructed by extracting data up to pg_current_xlog_location()
and zero padding it up to 16MB.

This only seems to be happening on HEAD - quick tests indicate that both
8.3 and 8.4 are not affected (or maybe I didn't try hard enough).

Thanks for the report!

This assertion failure derives from the recent refactoring of ReadRecord().
Which changed the startup process so as to re-fetch the last applied WAL
record at the end of recovery from the buffer instead of the WAL file if it's
stored in the buffer. In this case, the last applied WAL file remains closed.
So readFile (which ought to have been the file descriptor of that WAL file)
might be -1 in exitArchiveRecovery().

In the now, that assertion is obsolete. So I attached the patch that removes
the assert() from exitArchiveRecovery().

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

remove_obsolete_assertion_0206.patchtext/x-diff; charset=US-ASCII; name=remove_obsolete_assertion_0206.patchDownload
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 5110,5125 **** exitArchiveRecovery(TimeLineID endTLI, uint32 endLogId, uint32 endLogSeg)
  	UpdateMinRecoveryPoint(InvalidXLogRecPtr, true);
  
  	/*
! 	 * We should have the ending log segment currently open.  Verify, and then
! 	 * close it (to avoid problems on Windows with trying to rename or delete
! 	 * an open file).
  	 */
! 	Assert(readFile >= 0);
! 	Assert(readId == endLogId);
! 	Assert(readSeg == endLogSeg);
! 
! 	close(readFile);
! 	readFile = -1;
  
  	/*
  	 * If the segment was fetched from archival storage, we want to replace
--- 5110,5123 ----
  	UpdateMinRecoveryPoint(InvalidXLogRecPtr, true);
  
  	/*
! 	 * If the ending log segment is currently open, close it (to avoid
! 	 * problems on Windows with trying to rename or delete an open file).
  	 */
! 	if (readFile >= 0)
! 	{
! 		close(readFile);
! 		readFile = -1;
! 	}
  
  	/*
  	 * If the segment was fetched from archival storage, we want to replace
***************
*** 5974,5979 **** StartupXLOG(void)
--- 5972,5980 ----
  	EndOfLog = EndRecPtr;
  	XLByteToPrevSeg(EndOfLog, endLogId, endLogSeg);
  
+ 	Assert(readId == endLogId);
+ 	Assert(readSeg == endLogSeg);
+ 
  	/*
  	 * Complain if we did not roll forward far enough to render the backup
  	 * dump consistent.  Note: it is indeed okay to look at the local variable
#3Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#2)
Re: Failed assertion during recovery of partial WAL file

Fujii Masao wrote:

This assertion failure derives from the recent refactoring of ReadRecord().
Which changed the startup process so as to re-fetch the last applied WAL
record at the end of recovery from the buffer instead of the WAL file if it's
stored in the buffer. In this case, the last applied WAL file remains closed.
So readFile (which ought to have been the file descriptor of that WAL file)
might be -1 in exitArchiveRecovery().

In the now, that assertion is obsolete. So I attached the patch that removes
the assert() from exitArchiveRecovery().

Committed.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com