[BUG?] lag of minRecoveryPont in archive recovery

Started by Kyotaro HORIGUCHIabout 13 years ago16 messages

horiguchi.kyotaro@lab.ntt.co.jp

about 13 years ago

Hello, I have a problem with PostgreSQL 9.2 with Pacemaker.

HA standby sometime failes to start under normal operation.

Testing with a bare replication pair showed that the standby
failes startup recovery under the operation sequence shown
below. 9.3dev too, but 9.1 does not have this problem. This
problem became apparent by the invalid-page check of xlog, but
9.1 also has same glitch potentially.

After the investigation, the lag of minRecoveryPoint behind
EndRecPtr in redo loop seems to be the cause. The lag brings
about repetitive redoing of unrepeatable xlog sequences such as
XLOG_HEAP2_VISIBLE -> SMGR_TRUNCATE on the same page. So I did
the same aid work as xact_redo_commit_internal for
smgr_redo. While doing this, I noticed that
CheckRecoveryConsistency() in redo apply loop should be after
redoing the record, so moved it.

The patch attached (for 9.3dev) puts down this failure, also
works for 9.2, and seems no harm and maybe some grace for 9.1..

What do you think about this?

The details follow.

reagrds,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Operation sequence to cause the problem.
=======

Stb$export PGDATA=$HOME/pgdatas
Stb$export PGPORT=5433
Stb$rm -rf $PGDATA
Stb$pg_basebackup -X s -F p -D $PGDATA -h /tmp -p 5432
Stb$vi $PGDATA/postgresql.conf
| checkpoint_segments = 1
| checkpoint_timeout = 30s
Stb$vi $PGDATA/recovery.conf
| standby_mode=yes
| primary_conninfo='host=/tmp port=5432'
| restore_command='cp <ARCHIVE_DIR>/%f %p'
Stb$pg_ctl start

Mst$createdb
Mst$psql
Mst_psql=#create table t (a int);
Mst_psql=#insert into t values (1);
Mst_psql=#chekcpoint;
Mst_psql=#vacuum;

!!! Waiting for the completion of restart point of the standby !!!

Mst_psql=#delete from t;
Mst_psql=#vacuum;

Stb$pg_ctl stop -m i
Stb$vi PGDATA/recovery.conf
| restore_command='cp <ARCHIVE_DIR>/%f %p'
Stb$pg_ctl start
LOG: database system was interrupted while in recovery at log time 2012-12-05 09:55:18 JST
HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
LOG: entering standby mode
LOG: redo starts at 0/A020890
WARNING: page 0 of relation base/16384/16385 does not exist
PANIC: WAL contains references to invalid pages
LOG: startup process (PID 18422) was terminated by signal 6: Aborted
LOG: terminating any other active server processes
======

Amit Kapila

amit.kapila@huawei.com

about 13 years ago

In reply to: Kyotaro HORIGUCHI (#1)

Re: [BUG?] lag of minRecoveryPont in archive recovery

On Thursday, December 06, 2012 9:35 AM Kyotaro HORIGUCHI wrote:

Hello, I have a problem with PostgreSQL 9.2 with Pacemaker.

HA standby sometime failes to start under normal operation.

Testing with a bare replication pair showed that the standby failes
startup recovery under the operation sequence shown below. 9.3dev too,
but 9.1 does not have this problem. This problem became apparent by the
invalid-page check of xlog, but
9.1 also has same glitch potentially.

After the investigation, the lag of minRecoveryPoint behind EndRecPtr in
redo loop seems to be the cause. The lag brings about repetitive redoing
of unrepeatable xlog sequences such as XLOG_HEAP2_VISIBLE ->
SMGR_TRUNCATE on the same page. So I did the same aid work as
xact_redo_commit_internal for smgr_redo. While doing this, I noticed
that
CheckRecoveryConsistency() in redo apply loop should be after redoing
the record, so moved it.

I think moving CheckRecoveryConsistency() after redo apply loop might cause
a problem.
As currently it is done before recoveryStopsHere() function, which can allow
connections
on HOTSTANDY. But now if due to some reason recovery pauses or stops due to
above function,
connections might not be allowed as CheckRecoveryConsistency() is not
called.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Fujii Masao

masao.fujii@gmail.com

about 13 years ago

In reply to: Kyotaro HORIGUCHI (#1)

Re: [BUG?] lag of minRecoveryPont in archive recovery

On Thu, Dec 6, 2012 at 8:39 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

On Thursday, December 06, 2012 9:35 AM Kyotaro HORIGUCHI wrote:

Hello, I have a problem with PostgreSQL 9.2 with Pacemaker.

HA standby sometime failes to start under normal operation.

Testing with a bare replication pair showed that the standby failes
startup recovery under the operation sequence shown below. 9.3dev too,
but 9.1 does not have this problem. This problem became apparent by the
invalid-page check of xlog, but
9.1 also has same glitch potentially.

After the investigation, the lag of minRecoveryPoint behind EndRecPtr in
redo loop seems to be the cause. The lag brings about repetitive redoing
of unrepeatable xlog sequences such as XLOG_HEAP2_VISIBLE ->
SMGR_TRUNCATE on the same page. So I did the same aid work as
xact_redo_commit_internal for smgr_redo. While doing this, I noticed
that
CheckRecoveryConsistency() in redo apply loop should be after redoing
the record, so moved it.

I think moving CheckRecoveryConsistency() after redo apply loop might cause
a problem.
As currently it is done before recoveryStopsHere() function, which can allow
connections
on HOTSTANDY. But now if due to some reason recovery pauses or stops due to
above function,
connections might not be allowed as CheckRecoveryConsistency() is not
called.

Yes, so we should just add the CheckRecoveryConsistency() call after
rm_redo rather than moving it? This issue is related to the old discussion:
http://archives.postgresql.org/pgsql-bugs/2012-09/msg00101.php

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: 50c08426.88b50e0a.5a5f.64cbSMTPIN_ADDED_BROKEN@mx.google.com

Fujii Masao

masao.fujii@gmail.com

about 13 years ago

In reply to: Kyotaro HORIGUCHI (#1)

Re: [BUG?] lag of minRecoveryPont in archive recovery

On Thu, Dec 6, 2012 at 1:04 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 993bc49..d34ab65 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -519,6 +519,12 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record)
visibilitymap_truncate(rel, xlrec->blkno);

FreeFakeRelcacheEntry(rel);
+
+               /*
+                * Xlogs before this record is unrepeatable, so winding
+                * minRecoveryPoint to here.
+                */
+               XLogFlush(lsn);
}
else
elog(PANIC, "smgr_redo: unknown op code %u", info);

I think that minRecoveryPoint should be updated before the data-file
changes, so XLogFlush() should be called before smgrtruncate(). No?

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 13 years ago

In reply to: Kyotaro HORIGUCHI (#1)

Re: [BUG?] lag of minRecoveryPont in archive recovery

Thank you.

I think moving CheckRecoveryConsistency() after redo apply loop might cause
a problem.
As currently it is done before recoveryStopsHere() function, which can allow
connections
on HOTSTANDY. But now if due to some reason recovery pauses or stops due to
above function,
connections might not be allowed as CheckRecoveryConsistency() is not
called.

It depends on the precise meaning of minRecoveryPoint. I've not
found the explicit explanation for it.

Currently minRecoveryPoint is updated only in XLogFlush. Other
updates of it seems to reset to InvalidXLogRecPtr. XLogFlush
seems to be called AFTER the redo core operation has been done,
at least in xact_redo_commit_internal. I shows me that the
meaning of minRecoveryPoint is that "Just AFTER applying the XLog
at current LSN, the database files will be assumed to be
consistent."

At Mon, 10 Dec 2012 00:36:31 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwG4W5QZ7+LJimg8xxuevwz0bYniHmZLZmWf0j6kBiuRCg@mail.gmail.com>

Yes, so we should just add the CheckRecoveryConsistency() call after
rm_redo rather than moving it? This issue is related to the old discussion:
http://archives.postgresql.org/pgsql-bugs/2012-09/msg00101.php

Since I've not cleary understood the problem of missing it before
redo, and it also seems to have no harm on performance, I have no
objection to place it both before and after of redo.

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: 00da01cdd3a652ba37e0f82ea7a0@kapila@huawei.comCAHGQGwG4W5QZ7+LJimg8xxuevwz0bYniHmZLZmWf0j6kBiuRCg@mail.gmail.com

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 13 years ago

In reply to: Fujii Masao (#4)

Re: [BUG?] lag of minRecoveryPont in archive recovery

Hello,

I think that minRecoveryPoint should be updated before the data-file
changes, so XLogFlush() should be called before smgrtruncate(). No?

Mmm.. As far as I saw in xact_redo_commit_internal, XLogFlush
seems should be done AFTER redo substantial operation. Is it
wrong? If so, I suppose xact_redo_commit_internal could shold be
fixed in the same way.

At Mon, 10 Dec 2012 00:41:34 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwEUpEvK=s0xRFinWnoasjhy3poD3HiE4namoWHK0a-d_w@mail.gmail.com>

On Thu, Dec 6, 2012 at 1:04 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 993bc49..d34ab65 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -519,6 +519,12 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record)
visibilitymap_truncate(rel, xlrec->blkno);

FreeFakeRelcacheEntry(rel);
+
+               /*
+                * Xlogs before this record is unrepeatable, so winding
+                * minRecoveryPoint to here.
+                */
+               XLogFlush(lsn);
}
else
elog(PANIC, "smgr_redo: unknown op code %u", info);

I think that minRecoveryPoint should be updated before the data-file
changes, so XLogFlush() should be called before smgrtruncate(). No?

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Kapila

amit.kapila@huawei.com

about 13 years ago

In reply to: Kyotaro HORIGUCHI (#5)

Re: [BUG?] lag of minRecoveryPont in archive recovery

Monday, December 10, 2012 7:16 AM Kyotaro HORIGUCHI wrote:

Thank you.

I think moving CheckRecoveryConsistency() after redo apply loop might

cause

a problem.
As currently it is done before recoveryStopsHere() function, which can

allow

connections
on HOTSTANDY. But now if due to some reason recovery pauses or stops

due to

above function,
connections might not be allowed as CheckRecoveryConsistency() is not
called.

It depends on the precise meaning of minRecoveryPoint. I've not
found the explicit explanation for it.

Currently minRecoveryPoint is updated only in XLogFlush. Other
updates of it seems to reset to InvalidXLogRecPtr. XLogFlush
seems to be called AFTER the redo core operation has been done,
at least in xact_redo_commit_internal. I shows me that the
meaning of minRecoveryPoint is that "Just AFTER applying the XLog
at current LSN, the database files will be assumed to be
consistent."

At Mon, 10 Dec 2012 00:36:31 +0900, Fujii Masao <masao.fujii@gmail.com>
wrote in
<CAHGQGwG4W5QZ7+LJimg8xxuevwz0bYniHmZLZmWf0j6kBiuRCg@mail.gmail.com>

Yes, so we should just add the CheckRecoveryConsistency() call after
rm_redo rather than moving it? This issue is related to the old

discussion:

http://archives.postgresql.org/pgsql-bugs/2012-09/msg00101.php

Since I've not cleary understood the problem of missing it before
redo, and it also seems to have no harm on performance, I have no
objection to place it both before and after of redo.

I have tried that way as well, but it didn't completely resolved the problem
reported in above link.
As the situation of defect got arised when it does first time ReadRecord().

To resolve the defect mentioned in link by Fujii Masao, actually we need to
check and
try to reset the backupStartPoint before each ReadRecord.
The reason is that in ReadRecord(), it can go and start waiting for records
from Master.
So now if backupStartPoint is not set and CheckRecoveryConsistency() is not
done, it can keep on waiting
Records from Master and no connections will be allowed on Standby.

I have modified the code of XLogPageRead(...) such that before it calls
WaitForWALToBecomeavailable(), following code will be added

if (!XLogRecPtrIsInvalid(ControlFile->backupEndPoint) &&

XLByteLE(ControlFile->backupEndPoint, EndRecPtr))
{
/*
* We have reached the end of base
backup, the point where
* the minimum recovery point in
pg_control indicates. The
* data on disk is now consistent.
Reset backupStartPoint
* and backupEndPoint.
*/
elog(DEBUG1, "end of backup
reached");

LWLockAcquire(ControlFileLock,
LW_EXCLUSIVE);

MemSet(&ControlFile->backupStartPoint, 0, sizeof(XLogRecPtr));
MemSet(&ControlFile->backupEndPoint,
0, sizeof(XLogRecPtr));
ControlFile->backupEndRequired =
false;
UpdateControlFile();

LWLockRelease(ControlFileLock);
}

CheckRecoveryConsistency();

This had completely resolved the problem reported on above link for me.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

hlinnakangas@vmware.com

about 13 years ago

In reply to: Kyotaro HORIGUCHI (#6)

1 attachment(s)

Re: [BUG?] lag of minRecoveryPont in archive recovery

On 10.12.2012 03:52, Kyotaro HORIGUCHI wrote:

I think that minRecoveryPoint should be updated before the data-file
changes, so XLogFlush() should be called before smgrtruncate(). No?

Mmm.. As far as I saw in xact_redo_commit_internal, XLogFlush
seems should be done AFTER redo substantial operation. Is it
wrong? If so, I suppose xact_redo_commit_internal could shold be
fixed in the same way.

So, two options:

1. Redo truncation, then XLogFlush()

There's a window where the original problem could still occur: The file
is truncated, and you crash before XLogFlush() finishes.

2. XLogFlush(), then redo truncation.

If the truncation fails for some reason (disk failure?) and you have
already updated minRecoveryPoint, you can not start up until you somehow
fix the problem so that the truncation can succeed, and then finish
recovery.

Both options have their merits. The window in 1. is very small, and in
2., the probability that an unlink() or truncation fails is very small
as well.

We've chosen 1. in xact_redo_commit_internal(), but I don't think that
was a carefully made decision, it just imitates what happens in the
corresponding non-redo commit path. In xact_redo_commit_internal(), it
makes sense to do XLogFlush() afterwards, for CREATE DATABASE and CREATE
TABLESPACE. Those create files, and if you e.g run out of disk space, or
non-existent path, you don't want minRecoveryPoint to be updated yet.
Otherwise you can no longer recover to the point just before the
creation of those files. But in case of DROP DATABASE, you have the
exact same situation as with truncation: if you have already deleted
some files, you must not be able to stop recovery at a point before that.

So I'd say we should update minRecoveryPoint first, then
truncate/delete. But we should still keep the XLogFlush() at the end of
xact_redo_commit_internal(), for the case where files/directories are
created. Patch attached.

- Heikki

Attachments:

xlogflush-before-truncation-1.patchtext/x-diff; name=xlogflush-before-truncation-1.patchDownload

*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 4617,4639 **** xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
  	}
  
  	/* Make sure files supposed to be dropped are dropped */
! 	for (i = 0; i < nrels; i++)
  	{
! 		SMgrRelation srel = smgropen(xnodes[i], InvalidBackendId);
! 		ForkNumber	fork;
  
! 		for (fork = 0; fork <= MAX_FORKNUM; fork++)
! 			XLogDropRelation(xnodes[i], fork);
! 		smgrdounlink(srel, true);
! 		smgrclose(srel);
  	}
  
  	/*
  	 * We issue an XLogFlush() for the same reason we emit ForceSyncCommit()
! 	 * in normal operation. For example, in DROP DATABASE, we delete all the
! 	 * files belonging to the database, and then commit the transaction. If we
! 	 * crash after all the files have been deleted but before the commit, you
! 	 * have an entry in pg_database without any files. To minimize the window
  	 * for that, we use ForceSyncCommit() to rush the commit record to disk as
  	 * quick as possible. We have the same window during recovery, and forcing
  	 * an XLogFlush() (which updates minRecoveryPoint during recovery) helps
--- 4617,4660 ----
  	}
  
  	/* Make sure files supposed to be dropped are dropped */
! 	if (nrels > 0)
  	{
! 		/*
! 		 * First update minimum recovery point to cover this WAL record. Once
! 		 * a relation is deleted, there's no going back. The buffer manager
! 		 * enforces the WAL-first rule for normal updates to relation files,
! 		 * so that the minimum recovery point is always updated before the
! 		 * corresponding change in the data file is flushed to disk, but we
! 		 * have to do the same here since we're bypassing the buffer manager.
! 		 *
! 		 * Doing this before the deleting the files means that if a deletion
! 		 * fails for some reason, you cannot start up the system even after
! 		 * restart, until you fix the underlying situation so that the
! 		 * deletion will succeed. Alternatively, we could update the minimum
! 		 * recovery point after deletion, but that would leave a small window
! 		 * where the WAL-first rule would be violated.
! 		 */
! 		XLogFlush(lsn);
  
! 		for (i = 0; i < nrels; i++)
! 		{
! 			SMgrRelation srel = smgropen(xnodes[i], InvalidBackendId);
! 			ForkNumber	fork;
! 
! 			for (fork = 0; fork <= MAX_FORKNUM; fork++)
! 				XLogDropRelation(xnodes[i], fork);
! 			smgrdounlink(srel, true);
! 			smgrclose(srel);
! 		}
  	}
  
  	/*
  	 * We issue an XLogFlush() for the same reason we emit ForceSyncCommit()
! 	 * in normal operation. For example, in CREATE DATABASE, we copy all the
! 	 * files from the template database, and then commit the transaction. If we
! 	 * crash after all the files have been copied but before the commit, you
! 	 * have files in the data directory without an entry in pg_database.
! 	 * To minimize the window
  	 * for that, we use ForceSyncCommit() to rush the commit record to disk as
  	 * quick as possible. We have the same window during recovery, and forcing
  	 * an XLogFlush() (which updates minRecoveryPoint during recovery) helps
*** a/src/backend/catalog/storage.c
--- b/src/backend/catalog/storage.c
***************
*** 482,487 **** smgr_redo(XLogRecPtr lsn, XLogRecord *record)
--- 482,504 ----
  		 */
  		smgrcreate(reln, MAIN_FORKNUM, true);
  
+ 		/*
+ 		 * Before we perform the truncation, update minimum recovery point
+ 		 * to cover this WAL record. Once the relation is truncated, there's
+ 		 * no going back. The buffer manager enforces the WAL-first rule
+ 		 * for normal updates to relation files, so that the minimum recovery
+ 		 * point is always updated before the corresponding change in the
+ 		 * data file is flushed to disk.
+ 		 *
+ 		 * Doing this before the truncation means that if the truncation fails
+ 		 * for some reason, you cannot start up the system even after restart,
+ 		 * until you fix the underlying situation so that the truncation will
+ 		 * succeed. Alternatively, we could update the minimum recovery point
+ 		 * after truncation, but that would leave a small window where the
+ 		 * WAL-first rule could be violated.
+ 		 */
+ 		XLogFlush(lsn);
+ 
  		smgrtruncate(reln, MAIN_FORKNUM, xlrec->blkno);
  
  		/* Also tell xlogutils.c about it */

Heikki Linnakangas

hlinnakangas@vmware.com

about 13 years ago

In reply to: Heikki Linnakangas (#8)

1 attachment(s)

Re: [BUG?] lag of minRecoveryPont in archive recovery

On 10.12.2012 13:50, Heikki Linnakangas wrote:

So I'd say we should update minRecoveryPoint first, then
truncate/delete. But we should still keep the XLogFlush() at the end of
xact_redo_commit_internal(), for the case where files/directories are
created. Patch attached.

Committed and backpatched that. Attached is a script I used to reproduce
this problem, going back to 8.4.

- Heikki

#10

Fujii Masao

masao.fujii@gmail.com

about 13 years ago

In reply to: Heikki Linnakangas (#9)

1 attachment(s)

Re: [BUG?] lag of minRecoveryPont in archive recovery

On Tue, Dec 11, 2012 at 1:33 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 10.12.2012 13:50, Heikki Linnakangas wrote:

So I'd say we should update minRecoveryPoint first, then
truncate/delete. But we should still keep the XLogFlush() at the end of
xact_redo_commit_internal(), for the case where files/directories are
created. Patch attached.

Sounds reasonable.

Committed and backpatched that. Attached is a script I used to reproduce
this problem, going back to 8.4.

Thanks!

Unfortunately I could reproduce the problem even after that commit.
Attached is the script I used to reproduce the problem.

The cause is that CheckRecoveryConsistency() is called before rm_redo(),
as Horiguchi-san suggested upthead. Imagine the case where
minRecoveryPoint is set to the location of the XLOG_SMGR_TRUNCATE
record. When restarting the server with that minRecoveryPoint,
the followings would happen, and then PANIC occurs.

1. XLOG_SMGR_TRUNCATE record is read.
2. CheckRecoveryConsistency() is called, and database is marked as
consistent since we've reached minRecoveryPoint (i.e., the location
of XLOG_SMGR_TRUNCATE).
3. XLOG_SMGR_TRUNCATE record is replayed, and invalid page is
found.
4. Since the database has already been marked as consistent, an invalid
page leads to PANIC.

Regards,

--
Fujii Masao

#11

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 13 years ago

In reply to: Fujii Masao (#10)

Re: [BUG?] lag of minRecoveryPont in archive recovery

Hello, I've also found this does not fix this problem.

So I'd say we should update minRecoveryPoint first, then
truncate/delete. But we should still keep the XLogFlush() at the end of
xact_redo_commit_internal(), for the case where files/directories are
created. Patch attached.

Sounds reasonable.

It makes perfectly sense.

Committed and backpatched that. Attached is a script I used to reproduce
this problem, going back to 8.4.

Thanks!

Unfortunately I could reproduce the problem even after that commit.
Attached is the script I used to reproduce the problem.

Me too.

The cause is that CheckRecoveryConsistency() is called before rm_redo(),
as Horiguchi-san suggested upthead. Imagine the case where
minRecoveryPoint is set to the location of the XLOG_SMGR_TRUNCATE
record. When restarting the server with that minRecoveryPoint,
the followings would happen, and then PANIC occurs.

1. XLOG_SMGR_TRUNCATE record is read.
2. CheckRecoveryConsistency() is called, and database is marked as
consistent since we've reached minRecoveryPoint (i.e., the location
of XLOG_SMGR_TRUNCATE).
3. XLOG_SMGR_TRUNCATE record is replayed, and invalid page is
found.
4. Since the database has already been marked as consistent, an invalid
page leads to PANIC.

Exactly.

In smgr_redo, EndRecPtr which is pointing the record next to
SMGR_TRUNCATE, is used as the new minRecoveryPoint. On the other
hand, during the second startup of the standby,
CheckRecoveryConsistency checks for consistency by
XLByteLE(minRecoveryPoint, EndRecPtr) which should be true at
just BEFORE the SMGR_TRUNCATE record is applied. So
reachedConsistency becomes true just before the SMGR_TRUNCATE
record will be applied. Bang!

I said I had no objection to placing CheckRecoveryConsistency
both before and after of rm_redo in previous message, but it was
wrong. Given aminRecoveryPoint value, it should be placed after
rm_redo from the point of view of when the database should be
considered to be consistent.

Actually, simply moving CheckRecoeverConsistency after rm_redo
turned into succeessfully startup, ignoring the another reason
for it should be before, which is unknown to me.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Heikki Linnakangas

hlinnakangas@vmware.com

about 13 years ago

In reply to: Kyotaro HORIGUCHI (#11)

1 attachment(s)

Re: [BUG?] lag of minRecoveryPont in archive recovery

On 11.12.2012 08:07, Kyotaro HORIGUCHI wrote:

The cause is that CheckRecoveryConsistency() is called before rm_redo(),
as Horiguchi-san suggested upthead. Imagine the case where
minRecoveryPoint is set to the location of the XLOG_SMGR_TRUNCATE
record. When restarting the server with that minRecoveryPoint,
the followings would happen, and then PANIC occurs.

1. XLOG_SMGR_TRUNCATE record is read.
2. CheckRecoveryConsistency() is called, and database is marked as
consistent since we've reached minRecoveryPoint (i.e., the location
of XLOG_SMGR_TRUNCATE).
3. XLOG_SMGR_TRUNCATE record is replayed, and invalid page is
found.
4. Since the database has already been marked as consistent, an invalid
page leads to PANIC.

Exactly.

In smgr_redo, EndRecPtr which is pointing the record next to
SMGR_TRUNCATE, is used as the new minRecoveryPoint. On the other
hand, during the second startup of the standby,
CheckRecoveryConsistency checks for consistency by
XLByteLE(minRecoveryPoint, EndRecPtr) which should be true at
just BEFORE the SMGR_TRUNCATE record is applied. So
reachedConsistency becomes true just before the SMGR_TRUNCATE
record will be applied. Bang!

Ah, I see. I thought moving CheckRecoveryConsistency was just a
nice-to-have thing, so that we'd notice that we're consistent earlier,
so didn't pay attention to that part.

I think the real issue here is that CheckRecoveryConsistency should not
be comparing minRecoveryPoint against recoveryLastRecPtr, not EndRecPtr
as it currently does. EndRecPtr points to the end of the last record
*read*, while recoveryLastRecPtr points to the end of the last record
*replayed*. The latter is what CheckRecoveryConsistency should be
concerned with.

BTW, it occurs to me that we have two variables in the shared struct
that are almost the same thing: recoveryLastRecPtr and replayEndRecPtr.
The former points to the end of the last record successfully replayed,
while replayEndRecPtr points to the end of the last record successfully
replayed, or begin replayed at the moment. I find that confusing, so I
suggest that we rename recoveryLastRecPtr to make that more clear.
Included in the attached patch.

- Heikki

Attachments:

fix-check-recovery-consistency-1.patchtext/x-diff; name=fix-check-recovery-consistency-1.patchDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9bd7f03..6961a4a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -445,11 +445,15 @@ typedef struct XLogCtlData
 	XLogRecPtr	lastCheckPointRecPtr;
 	CheckPoint	lastCheckPoint;
 
-	/* end+1 of the last record replayed (or being replayed) */
+	/*
+	 * lastReplayedEndRecPtr points to the end+1 of the last record
+	 * successfully replayed. When we're currently replaying a record, ie.
+	 * in a redo function,  replayEndRecPtr points to the end+1 of the record
+	 * being replayed, otherwise it's equal to lastReplayedEndRecPtr.
+	 */
+	XLogRecPtr	lastReplayedEndRecPtr;
 	XLogRecPtr	replayEndRecPtr;
 	TimeLineID	replayEndTLI;
-	/* end+1 of the last record replayed */
-	XLogRecPtr	recoveryLastRecPtr;
 	/* timestamp of last COMMIT/ABORT record replayed (or being replayed) */
 	TimestampTz recoveryLastXTime;
 	/* current effective recovery target timeline */
@@ -5759,7 +5763,7 @@ StartupXLOG(void)
 		SpinLockAcquire(&xlogctl->info_lck);
 		xlogctl->replayEndRecPtr = ReadRecPtr;
 		xlogctl->replayEndTLI = ThisTimeLineID;
-		xlogctl->recoveryLastRecPtr = EndRecPtr;
+		xlogctl->lastReplayedEndRecPtr = EndRecPtr;
 		xlogctl->recoveryLastXTime = 0;
 		xlogctl->currentChunkStartTime = 0;
 		xlogctl->recoveryPause = false;
@@ -5983,11 +5987,11 @@ StartupXLOG(void)
 				}
 
 				/*
-				 * Update shared recoveryLastRecPtr after this record has been
-				 * replayed.
+				 * Update lastReplayedEndRecPtr after this record has been
+				 * successfully replayed.
 				 */
 				SpinLockAcquire(&xlogctl->info_lck);
-				xlogctl->recoveryLastRecPtr = EndRecPtr;
+				xlogctl->lastReplayedEndRecPtr = EndRecPtr;
 				SpinLockRelease(&xlogctl->info_lck);
 
 				/* Remember this record as the last-applied one */
@@ -6383,10 +6387,11 @@ CheckRecoveryConsistency(void)
 	 * Have we passed our safe starting point? Note that minRecoveryPoint
 	 * is known to be incorrectly set if ControlFile->backupEndRequired,
 	 * until the XLOG_BACKUP_RECORD arrives to advise us of the correct
-	 * minRecoveryPoint. All we prior to that is its not consistent yet.
+	 * minRecoveryPoint. All we know prior to that is that we're not
+	 * consistent yet.
 	 */
 	if (!reachedConsistency && !ControlFile->backupEndRequired &&
-		XLByteLE(minRecoveryPoint, EndRecPtr) &&
+		XLByteLE(minRecoveryPoint, XLogCtl->lastReplayedEndRecPtr) &&
 		XLogRecPtrIsInvalid(ControlFile->backupStartPoint))
 	{
 		/*
@@ -6398,7 +6403,8 @@ CheckRecoveryConsistency(void)
 		reachedConsistency = true;
 		ereport(LOG,
 				(errmsg("consistent recovery state reached at %X/%X",
-						(uint32) (EndRecPtr >> 32), (uint32) EndRecPtr)));
+						(uint32) (XLogCtl->lastReplayedEndRecPtr >> 32),
+						(uint32) XLogCtl->lastReplayedEndRecPtr)));
 	}
 
 	/*
@@ -9094,7 +9100,7 @@ GetXLogReplayRecPtr(TimeLineID *targetTLI)
 	XLogRecPtr	recptr;
 
 	SpinLockAcquire(&xlogctl->info_lck);
-	recptr = xlogctl->recoveryLastRecPtr;
+	recptr = xlogctl->lastReplayedEndRecPtr;
 	if (targetTLI)
 		*targetTLI = xlogctl->RecoveryTargetTLI;
 	SpinLockRelease(&xlogctl->info_lck);

#13

Fujii Masao

masao.fujii@gmail.com

about 13 years ago

In reply to: Heikki Linnakangas (#12)

Re: [BUG?] lag of minRecoveryPont in archive recovery

On Tue, Dec 11, 2012 at 6:14 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 11.12.2012 08:07, Kyotaro HORIGUCHI wrote:

The cause is that CheckRecoveryConsistency() is called before rm_redo(),
as Horiguchi-san suggested upthead. Imagine the case where
minRecoveryPoint is set to the location of the XLOG_SMGR_TRUNCATE
record. When restarting the server with that minRecoveryPoint,
the followings would happen, and then PANIC occurs.

1. XLOG_SMGR_TRUNCATE record is read.
2. CheckRecoveryConsistency() is called, and database is marked as
consistent since we've reached minRecoveryPoint (i.e., the location
of XLOG_SMGR_TRUNCATE).
3. XLOG_SMGR_TRUNCATE record is replayed, and invalid page is
found.
4. Since the database has already been marked as consistent, an invalid
page leads to PANIC.

Exactly.

In smgr_redo, EndRecPtr which is pointing the record next to
SMGR_TRUNCATE, is used as the new minRecoveryPoint. On the other
hand, during the second startup of the standby,
CheckRecoveryConsistency checks for consistency by
XLByteLE(minRecoveryPoint, EndRecPtr) which should be true at
just BEFORE the SMGR_TRUNCATE record is applied. So
reachedConsistency becomes true just before the SMGR_TRUNCATE
record will be applied. Bang!

Ah, I see. I thought moving CheckRecoveryConsistency was just a nice-to-have
thing, so that we'd notice that we're consistent earlier, so didn't pay
attention to that part.

I think the real issue here is that CheckRecoveryConsistency should not be
comparing minRecoveryPoint against recoveryLastRecPtr, not EndRecPtr as it
currently does. EndRecPtr points to the end of the last record *read*, while
recoveryLastRecPtr points to the end of the last record *replayed*. The
latter is what CheckRecoveryConsistency should be concerned with.

BTW, it occurs to me that we have two variables in the shared struct that
are almost the same thing: recoveryLastRecPtr and replayEndRecPtr. The
former points to the end of the last record successfully replayed, while
replayEndRecPtr points to the end of the last record successfully replayed,
or begin replayed at the moment. I find that confusing, so I suggest that we
rename recoveryLastRecPtr to make that more clear. Included in the attached
patch.

The patch looks good. I confirmed that the PANIC error didn't happen,
with the patch.

xlog.c

* Initialize shared replayEndRecPtr, recoveryLastRecPtr, and

s/recoveryLastRecPtr/lastReplayedEndRecPtr

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Heikki Linnakangas

hlinnakangas@vmware.com

about 13 years ago

In reply to: Fujii Masao (#13)

Re: [BUG?] lag of minRecoveryPont in archive recovery

On 11.12.2012 17:04, Fujii Masao wrote:

On Tue, Dec 11, 2012 at 6:14 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 11.12.2012 08:07, Kyotaro HORIGUCHI wrote:

The cause is that CheckRecoveryConsistency() is called before rm_redo(),
as Horiguchi-san suggested upthead. Imagine the case where
minRecoveryPoint is set to the location of the XLOG_SMGR_TRUNCATE
record. When restarting the server with that minRecoveryPoint,
the followings would happen, and then PANIC occurs.

1. XLOG_SMGR_TRUNCATE record is read.
2. CheckRecoveryConsistency() is called, and database is marked as
consistent since we've reached minRecoveryPoint (i.e., the location
of XLOG_SMGR_TRUNCATE).
3. XLOG_SMGR_TRUNCATE record is replayed, and invalid page is
found.
4. Since the database has already been marked as consistent, an invalid
page leads to PANIC.

Exactly.

In smgr_redo, EndRecPtr which is pointing the record next to
SMGR_TRUNCATE, is used as the new minRecoveryPoint. On the other
hand, during the second startup of the standby,
CheckRecoveryConsistency checks for consistency by
XLByteLE(minRecoveryPoint, EndRecPtr) which should be true at
just BEFORE the SMGR_TRUNCATE record is applied. So
reachedConsistency becomes true just before the SMGR_TRUNCATE
record will be applied. Bang!

Ah, I see. I thought moving CheckRecoveryConsistency was just a nice-to-have
thing, so that we'd notice that we're consistent earlier, so didn't pay
attention to that part.

I think the real issue here is that CheckRecoveryConsistency should not be
comparing minRecoveryPoint against recoveryLastRecPtr, not EndRecPtr as it
currently does. EndRecPtr points to the end of the last record *read*, while
recoveryLastRecPtr points to the end of the last record *replayed*. The
latter is what CheckRecoveryConsistency should be concerned with.

BTW, it occurs to me that we have two variables in the shared struct that
are almost the same thing: recoveryLastRecPtr and replayEndRecPtr. The
former points to the end of the last record successfully replayed, while
replayEndRecPtr points to the end of the last record successfully replayed,
or begin replayed at the moment. I find that confusing, so I suggest that we
rename recoveryLastRecPtr to make that more clear. Included in the attached
patch.

The patch looks good. I confirmed that the PANIC error didn't happen,
with the patch.

Thanks. After thinking about this more, I think it's still not quite
right. Now that we fix the check in CheckRecoveryConsistency, we have to
move the call to CheckRecoveryConsistency to after the rm_redo call.
Otherwise you don't enter hot standby mode after replaying the last
record, the one ending at minRecoveryPoint. You have to wait for one
more record to be read (but not replayed), so that
CheckRecoveryConsistency gets called and we let backends in.

As the code stands, the bad placement of CheckRecoveryConsistency is
compensated by the bug that we open up for hot standby one record too
early. The net result is that we open up for hot standby just before
replaying the last record that makes us consistent. The window for
someone to see the inconsistent state is really small, postmaster will
have to receive the signal, start accepting connections, and you have
connect and run a query, all before the startup process gets around to
replay the record its already read into memory.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Heikki Linnakangas

hlinnakangas@vmware.com

about 13 years ago

In reply to: Fujii Masao (#13)

Re: [BUG?] lag of minRecoveryPont in archive recovery

On 11.12.2012 17:04, Fujii Masao wrote:

The patch looks good. I confirmed that the PANIC error didn't happen,
with the patch.

Ok, committed, and also moved the CheckRecoveryConsistency call. Please
take a look to double-check that I didn't miss anything.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Fujii Masao

masao.fujii@gmail.com

about 13 years ago

In reply to: Heikki Linnakangas (#15)

Re: [BUG?] lag of minRecoveryPont in archive recovery

On Wed, Dec 12, 2012 at 2:03 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 11.12.2012 17:04, Fujii Masao wrote:

The patch looks good. I confirmed that the PANIC error didn't happen,
with the patch.

Ok, committed, and also moved the CheckRecoveryConsistency call. Please take
a look to double-check that I didn't miss anything.

Looks good to me. Thanks a lot!

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers