PANIC during crash recovery of a recently promoted standby

Started by Pavan Deolaseealmost 8 years ago21 messageshackers
Jump to latest
#1Pavan Deolasee
pavan.deolasee@gmail.com

Hello,

I recently investigated a problem where a standby is promoted to be the new
master. The promoted standby crashes shortly thereafter for whatever
reason. Upon running the crash recovery, the promoted standby (now master)
PANICs with message such as:

PANIC,XX000,"WAL contains references to invalid
pages",,,,,,,,"XLogCheckInvalidPages,
xlogutils.c:242",""

After investigation, I could recreate a reproduction scenario for this
problem. The attached TAP test (thanks Alvaro from converting my bash
script to a TAP test) demonstrates the problem. The test is probably
sensitive to timing, but it reproduces the problem consistently at least at
my end. While the original report was for 9.6, I can reproduce it on the
master and thus it probably affects all supported releases.

Investigations point to a possible bug where we fail to update the
minRecoveryPoint after completing the ongoing restart point upon promotion.
IMV after promotion the new master must always recover to the end of the
WAL to ensure that all changes are applied correctly. But what we've
instead is that minRecoveryPoint remains set to a prior location because of
this:

/*
* Update pg_control, using current time. Check that it still shows
* IN_ARCHIVE_RECOVERY state and an older checkpoint, else do nothing;
* this is a quick hack to make sure nothing really bad happens if
somehow
* we get here after the end-of-recovery checkpoint.
*/
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
{
ControlFile->checkPoint = lastCheckPointRecPtr;
ControlFile->checkPointCopy = lastCheckPoint;
ControlFile->time = (pg_time_t) time(NULL);

/*
* Ensure minRecoveryPoint is past the checkpoint record. Normally,
* this will have happened already while writing out dirty buffers,
* but not necessarily - e.g. because no buffers were dirtied. We
do
* this because a non-exclusive base backup uses minRecoveryPoint to
* determine which WAL files must be included in the backup, and the
* file (or files) containing the checkpoint record must be
included,
* at a minimum. Note that for an ordinary restart of recovery
there's
* no value in having the minimum recovery point any earlier than
this
* anyway, because redo will begin just after the checkpoint record.
*/
if (ControlFile->minRecoveryPoint < lastCheckPointEndPtr)
{
ControlFile->minRecoveryPoint = lastCheckPointEndPtr;
ControlFile->minRecoveryPointTLI =
lastCheckPoint.ThisTimeLineID;

/* update local copy */
minRecoveryPoint = ControlFile->minRecoveryPoint;
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
}
if (flags & CHECKPOINT_IS_SHUTDOWN)
ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
UpdateControlFile();
}
LWLockRelease(ControlFileLock);

After promotion, the minRecoveryPoint is only updated (cleared) when the
first regular checkpoint completes. If a crash happens before that, we will
run the crash recovery with a stale minRecoveryPoint, which results into
the PANIC that we diagnosed. The test case was written to reproduce the
issue as reported to us. Thus the test case TRUNCATEs and extends the table
at hand after promotion. The crash shortly thereafter leaves the pages in
uninitialised state because the shared buffers are not yet flushed to the
disk.

During crash recovery, we see uninitialised pages for the WAL records
written before the promotion. These pages are remembered and we expect to
either see a DROP TABLE or TRUNCATE WAL record before the minRecoveryPoint
is reached. But since the minRecoveryPoint is still pointing to a WAL
location prior to the TRUNCATE operation, crash recovery hits the
minRecoveryPoint before seeing the TRUNCATE WAL record. That results in a
PANIC situation.

I propose that we should always clear the minRecoveryPoint after promotion
to ensure that crash recovery always run to the end if a just-promoted
standby crashes before completing its first regular checkpoint. A WIP patch
is attached.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-A-new-TAP-test-to-test-a-recovery-bug.patchapplication/octet-stream; name=0001-A-new-TAP-test-to-test-a-recovery-bug.patchDownload+72-1
0002-Ensure-recovery-is-run-to-the-end-upon-promotion-of-.patchapplication/octet-stream; name=0002-Ensure-recovery-is-run-to-the-end-upon-promotion-of-.patchDownload+7-2
#2Michael Paquier
michael@paquier.xyz
In reply to: Pavan Deolasee (#1)
Re: PANIC during crash recovery of a recently promoted standby

On Thu, May 10, 2018 at 10:52:12AM +0530, Pavan Deolasee wrote:

I propose that we should always clear the minRecoveryPoint after promotion
to ensure that crash recovery always run to the end if a just-promoted
standby crashes before completing its first regular checkpoint. A WIP patch
is attached.

I have been playing with your patch and upgraded the test to check as
well for cascading standbys. We could use that in the final patch.
That's definitely something to add in the recovery test suite, and the
sleep phases should be replaced by waits on replay and/or flush.

Still, that approach looks sensitive to me. A restart point could be
running while the end-of-recovery record is inserted, so your patch
could update minRecoveryPoint to InvalidXLogRecPtr, and then a restart
point would happily update again the control file's minRecoveryPoint to
lastCheckPointEndPtr because it would see that the former is older than
lastCheckPointEndPtr (let's not forget that InvalidXLogRecPtr is 0), so
you could still crash on invalid pages?

I need to think a bit more about that stuff, but one idea would be to
use a special state in the control file to mark it as ending recovery,
this way we would control race conditions with restart points.
--
Michael

#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#2)
Re: PANIC during crash recovery of a recently promoted standby

Michael Paquier wrote:

On Thu, May 10, 2018 at 10:52:12AM +0530, Pavan Deolasee wrote:

I propose that we should always clear the minRecoveryPoint after promotion
to ensure that crash recovery always run to the end if a just-promoted
standby crashes before completing its first regular checkpoint. A WIP patch
is attached.

I have been playing with your patch and upgraded the test to check as
well for cascading standbys. We could use that in the final patch.
That's definitely something to add in the recovery test suite, and the
sleep phases should be replaced by waits on replay and/or flush.

Still, that approach looks sensitive to me. A restart point could be
running while the end-of-recovery record is inserted, so your patch
could update minRecoveryPoint to InvalidXLogRecPtr, and then a restart
point would happily update again the control file's minRecoveryPoint to
lastCheckPointEndPtr because it would see that the former is older than
lastCheckPointEndPtr (let's not forget that InvalidXLogRecPtr is 0), so
you could still crash on invalid pages?

Yeah, I had this exact comment, but I was unable to come up with a test
case that would cause a problem.

I need to think a bit more about that stuff, but one idea would be to
use a special state in the control file to mark it as ending recovery,
this way we would control race conditions with restart points.

Hmm. Can we change the control file in released branches? (It should
be possible to make the new server understand both old and new formats,
but I think this is breaking new ground and it looks easy to introduce
more bugs there.)

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#4Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#3)
Re: PANIC during crash recovery of a recently promoted standby

On Fri, May 11, 2018 at 12:09:58PM -0300, Alvaro Herrera wrote:

Yeah, I had this exact comment, but I was unable to come up with a test
case that would cause a problem.

pg_ctl promote would wait for the control file to be updated, so you
cannot use it in the TAP tests to trigger the promotion. Still I think
I found one after waking up? Please note I have not tested it:
- Use a custom trigger file and then trigger promotion with a signal.
- Use a sleep command in recovery_end_command to increase the window, as
what matters is sleeping after CreateEndOfRecoveryRecord updates the
control file.
- Issue a restart point on the standby, which will update the control
file.
- Stop the standby with immediate mode.
- Start the standby, it should see unreferenced pages.

Hmm. Can we change the control file in released branches? (It should
be possible to make the new server understand both old and new formats,
but I think this is breaking new ground and it looks easy to introduce
more bugs there.)

We definitely can't, even if the new value is added at the end of
DBState :(

A couple of wild ideas, not tested, again after waking up:
1) We could also abuse of existing values by using the existing
DB_IN_CRASH_RECOVERY or DB_STARTUP. Still that's not completely true as
the cluster may be open for business as a hot standby.
2) Invent a new special value for XLogRecPtr, normally impossible to
reach, which uses high bits.
--
Michael

#5Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#4)
Re: PANIC during crash recovery of a recently promoted standby

On Sat, May 12, 2018 at 07:41:33AM +0900, Michael Paquier wrote:

pg_ctl promote would wait for the control file to be updated, so you
cannot use it in the TAP tests to trigger the promotion. Still I think
I found one after waking up? Please note I have not tested it:
- Use a custom trigger file and then trigger promotion with a signal.
- Use a sleep command in recovery_end_command to increase the window, as
what matters is sleeping after CreateEndOfRecoveryRecord updates the
control file.
- Issue a restart point on the standby, which will update the control
file.
- Stop the standby with immediate mode.
- Start the standby, it should see unreferenced pages.

I have been looking at that this morning, and actually I have been able
to create a test case where minRecoveryPoint goes backwards using
Pavan's patch. Using a sleep in recovery_end_command has proved to not
be enough so I had to patch the backend with a couple of sleeps and some
processing, mainly:
- One sleep in CreateRestartPoint to make a restart point wait before
updating the control file, which I set at 5s.
- One sleep just after calling CreateEndOfRecoveryRecord, which has been
set at 20s.
- Trigger an asynchronous checkpoint using IPC::Run::start.
- Trigger a promotion with a trigger file and SIGUSR2 to the
postmaster.

The rest of the test is crafted with some magic wait number and adds
some logs to ease the monitoring of the issue. In order to get a crash,
I think that you would need to crash the backend after creating the last
WAL records which generate the invalid page references, and also slow
down the last restart point which makes minRecoveryPoint go backwards,
which is err... Complicated to make deterministic. Still if you apply
the patch attached you would see log entries on the standby as follows:
2018-05-14 12:24:15.065 JST [17352] LOG: selected new timeline ID: 2
2018-05-14 12:24:15.074 JST [17352] LOG: archive recovery complete
2018-05-14 12:24:15.074 JST [17352] WARNING: CreateEndOfRecoveryRecord: minRecoveryPoint is 0/32C4258 before update
2018-05-14 12:24:15.074 JST [17352] WARNING: CreateEndOfRecoveryRecord: minRecoveryPoint is 0/0 after update
2018-05-14 12:24:17.067 JST [17353] WARNING: checkPointCopy.redo =0/30B3D70, lastCheckPoint.redo = 0/31BC208
2018-05-14 12:24:17.067 JST [17353] WARNING: CreateRestartPoint: minRecoveryPoint is 0/0 before update, lastCheckPointEndPtr is 0/31BC2B0
2018-05-14 12:24:17.067 JST [17353] WARNING: CreateRestartPoint: minRecoveryPoint is 0/31BC2B0 after update

So minRecoveryRecord can go definitely go backwards here, which is not
good. Attached is a patch which includes Pavan's fix on top of the test
case I crafted with what is in upthread. You just need to apply it on
HEAD.
--
Michael

Attachments:

min-recovery-point-backwards.patchtext/x-diff; charset=us-asciiDownload+189-2
#6Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Alvaro Herrera (#3)
Re: PANIC during crash recovery of a recently promoted standby

On Fri, May 11, 2018 at 8:39 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

Michael Paquier wrote:

On Thu, May 10, 2018 at 10:52:12AM +0530, Pavan Deolasee wrote:

I propose that we should always clear the minRecoveryPoint after

promotion

to ensure that crash recovery always run to the end if a just-promoted
standby crashes before completing its first regular checkpoint. A WIP

patch

is attached.

I have been playing with your patch and upgraded the test to check as
well for cascading standbys. We could use that in the final patch.
That's definitely something to add in the recovery test suite, and the
sleep phases should be replaced by waits on replay and/or flush.

Still, that approach looks sensitive to me. A restart point could be
running while the end-of-recovery record is inserted, so your patch
could update minRecoveryPoint to InvalidXLogRecPtr, and then a restart
point would happily update again the control file's minRecoveryPoint to
lastCheckPointEndPtr because it would see that the former is older than
lastCheckPointEndPtr (let's not forget that InvalidXLogRecPtr is 0), so
you could still crash on invalid pages?

Yeah, I had this exact comment, but I was unable to come up with a test
case that would cause a problem.

Looks like I didn't understand Alvaro's comment when he mentioned it to me
off-list. But I now see what Michael and Alvaro mean and that indeed seems
like a problem. I was thinking that the test for (ControlFile->state ==
DB_IN_ARCHIVE_RECOVERY) will ensure that minRecoveryPoint can't be updated
after the standby is promoted. While that's true for a DB_IN_PRODUCTION, the
RestartPoint may finish after we have written end-of-recovery record, but
before we're in production and thus the minRecoveryPoint may again be set.

I need to think a bit more about that stuff, but one idea would be to
use a special state in the control file to mark it as ending recovery,
this way we would control race conditions with restart points.

Hmm. Can we change the control file in released branches? (It should
be possible to make the new server understand both old and new formats,
but I think this is breaking new ground and it looks easy to introduce
more bugs there.)

Can't we just remember that in shared memory state instead of writing to
the control file? So if we've already performed end-of-recovery, we don't
update the minRecoveryPoint when RestartPoint completes.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#7Michael Paquier
michael@paquier.xyz
In reply to: Pavan Deolasee (#6)
Re: PANIC during crash recovery of a recently promoted standby

On Mon, May 14, 2018 at 01:14:22PM +0530, Pavan Deolasee wrote:

Looks like I didn't understand Alvaro's comment when he mentioned it to me
off-list. But I now see what Michael and Alvaro mean and that indeed seems
like a problem. I was thinking that the test for (ControlFile->state ==
DB_IN_ARCHIVE_RECOVERY) will ensure that minRecoveryPoint can't be updated
after the standby is promoted. While that's true for a DB_IN_PRODUCTION, the
RestartPoint may finish after we have written end-of-recovery record, but
before we're in production and thus the minRecoveryPoint may again be set.

Yeah, this has been something I considered as well first, but I was not
confident enough that setting up minRecoveryPoint to InvalidXLogRecPtr
was actually a safe thing for timeline switches.

So I have spent a good portion of today testing and playing with it to
be confident enough that this was right, and I have finished with the
attached. The patch adds a new flag to XLogCtl which marks if the
control file has been updated after the end-of-recovery record has been
written, so as minRecoveryPoint does not get updated because of a
restart point running in parallel.

I have also reworked the test case you sent, removing the manuals sleeps
and replacing them with correct wait points. There is also no point to
wait after promotion as pg_ctl promote implies a wait. Another
important thing is that you need to use wal_log_hints = off to see a
crash, which is something that allows_streaming actually enables.

Comments are welcome.
--
Michael

Attachments:

recovery-panic-michael.patchtext/x-diff; charset=us-asciiDownload+127-2
#8Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#7)
Re: PANIC during crash recovery of a recently promoted standby

Hello.

At Thu, 24 May 2018 16:57:07 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180524075707.GE15445@paquier.xyz>

On Mon, May 14, 2018 at 01:14:22PM +0530, Pavan Deolasee wrote:

Looks like I didn't understand Alvaro's comment when he mentioned it to me
off-list. But I now see what Michael and Alvaro mean and that indeed seems
like a problem. I was thinking that the test for (ControlFile->state ==
DB_IN_ARCHIVE_RECOVERY) will ensure that minRecoveryPoint can't be updated
after the standby is promoted. While that's true for a DB_IN_PRODUCTION, the
RestartPoint may finish after we have written end-of-recovery record, but
before we're in production and thus the minRecoveryPoint may again be set.

Yeah, this has been something I considered as well first, but I was not
confident enough that setting up minRecoveryPoint to InvalidXLogRecPtr
was actually a safe thing for timeline switches.

So I have spent a good portion of today testing and playing with it to
be confident enough that this was right, and I have finished with the
attached. The patch adds a new flag to XLogCtl which marks if the
control file has been updated after the end-of-recovery record has been
written, so as minRecoveryPoint does not get updated because of a
restart point running in parallel.

I have also reworked the test case you sent, removing the manuals sleeps
and replacing them with correct wait points. There is also no point to
wait after promotion as pg_ctl promote implies a wait. Another
important thing is that you need to use wal_log_hints = off to see a
crash, which is something that allows_streaming actually enables.

Comments are welcome.

As the result of some poking around, my dignosis is different
from yours.

(I believe that) By definition recovery doesn't end until the
end-of-recovery check point ends so from the viewpoint I think it
is wrong to clear ControlFile->minRecoveryPoint before the end.

Invalid-page checking during crash recovery is hamful rather than
useless. It is done by CheckRecoveryConsistency even in crash
recovery against its expectation because there's a case where
minRecoveryPoint is valid but InArchiveRecovery is false. The two
variable there seems to be in contradiction with each other.

The immediate cause of the contradition is that StartXLOG wrongly
initializes local minRecoveryPoint from control file's value even
under crash recovery. updateMinRecoveryPoint also should be
turned off during crash recovery. The case of crash after last
checkpoint end has been treated in UpdateMinRecoveryPoint but
forgot to consider this case, where crash recovery has been
started while control file is still holding valid
minRecoveryPoint.

Finally, the attached patch seems fixing the issue. It passes
015_promotion.pl and the problem doesn't happen with
014_promotion_bug.pl. Also this passes the existing tests
002-014.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Fix-wrong-behavior-during-crash-recovery.patchtext/x-patch; charset=us-asciiDownload+13-15
#9Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#8)
Re: PANIC during crash recovery of a recently promoted standby

On Thu, Jun 07, 2018 at 07:58:29PM +0900, Kyotaro HORIGUCHI wrote:

Invalid-page checking during crash recovery is hamful rather than
useless. It is done by CheckRecoveryConsistency even in crash
recovery against its expectation because there's a case where
minRecoveryPoint is valid but InArchiveRecovery is false. The two
variable there seems to be in contradiction with each other.

The immediate cause of the contradition is that StartXLOG wrongly
initializes local minRecoveryPoint from control file's value even
under crash recovery. updateMinRecoveryPoint also should be
turned off during crash recovery. The case of crash after last
checkpoint end has been treated in UpdateMinRecoveryPoint but
forgot to consider this case, where crash recovery has been
started while control file is still holding valid
minRecoveryPoint.

Hmm. This patch looks interesting and those issues need a very careful
lookup. This is one of those things which should be fixed as part of
the extra CF, so I am planning to look at it in details very soon,
perhaps by the end of this week...

I have by the way noticed that nothing was registered in the CF app:
https://commitfest.postgresql.org/18/1680/
I have added as well a bullet point in the open item page of v11 for
older bugs.
--
Michael

#10Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#8)
Re: PANIC during crash recovery of a recently promoted standby

On Thu, Jun 07, 2018 at 07:58:29PM +0900, Kyotaro HORIGUCHI wrote:

(I believe that) By definition recovery doesn't end until the
end-of-recovery check point ends so from the viewpoint I think it
is wrong to clear ControlFile->minRecoveryPoint before the end.

Invalid-page checking during crash recovery is hamful rather than
useless. It is done by CheckRecoveryConsistency even in crash
recovery against its expectation because there's a case where
minRecoveryPoint is valid but InArchiveRecovery is false. The two
variable there seems to be in contradiction with each other.

The immediate cause of the contradition is that StartXLOG wrongly
initializes local minRecoveryPoint from control file's value even
under crash recovery. updateMinRecoveryPoint also should be
turned off during crash recovery. The case of crash after last
checkpoint end has been treated in UpdateMinRecoveryPoint but
forgot to consider this case, where crash recovery has been
started while control file is still holding valid
minRecoveryPoint.

I have been digesting your proposal and I reviewed it, and I think that
what you are proposing here is on the right track. However, the updates
around minRecoveryPoint and minRecoveryPointTLI ought to be more
consistent because that could cause future issues. I have spotted two
bug where I think the problem is not fixed: when replaying a WAL record
XLOG_PARAMETER_CHANGE, minRecoveryPoint and minRecoveryPointTLI would
still get updated from the control file values which can still lead to
failures as CheckRecoveryConsistency could still happily trigger a
PANIC, so I think that we had better maintain those values consistent as
long as crash recovery runs. And XLogNeedsFlush() also has a similar
problem.

Note that there is as well the case where the startup process switches
from crash recovery to archive recovery, in which case the update of the
local copies have to happen once the switch is done. Your patch covers
that with just updating updateMinRecoveryPoint each time crash recovery
happens but that's not completely consistent, but it seems that it also
missed the fact that updateMinRecoveryPoint needs to be switched back to
true as the startup process can update the control file. Actually,
updateMinRecoveryPoint needs to be switched back to true in that case or
the startup process would not be able to update minRecoveryPoint when it
calls XLogFlush for example.

There is the point of trying to get rid of updateMinRecoveryPoint which
has crossed my mind, but that's not wise as it's default value allows
the checkpointer minRecoveryPoint when started, which also has to happen
once the startup process gets out of recovery and tells the postmaster
to start the checkpointer. For backends as well that's a sane default.

There is also no point in taking ControlFileLock when checking if the
local copy of minRecoveryPoint is valid or not, so this can be
bypassed.

The assertion in CheckRecoveryConsistency is definitely a good idea as
this should only be called by the startup process, so we can keep it.

In the attached, I have fixed the issues I found and added the test case
which should be included in the final commit. Its concept is pretty
simple, the idea is to keep the local copy of minRecoveryPoint to
InvalidXLogRecPtr as long as crash recovery runs, and is switched back
to normal if there is a switch to archive recovery after a crash
recovery.

This is not really a complicated patch, and it took a lot of energy from
me the last couple of days per the nature of the many scenarios to think
about... So an extra pair of eyes from another committer would be
welcome. I am letting that cool down for a couple of days now.
--
Michael

Attachments:

recovery-panic-michael-v2.patchtext/x-diff; charset=us-asciiDownload+151-28
#11Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Michael Paquier (#10)
Re: PANIC during crash recovery of a recently promoted standby

On Fri, Jun 22, 2018 at 9:28 AM, Michael Paquier <michael@paquier.xyz>
wrote:

This is not really a complicated patch, and it took a lot of energy from
me the last couple of days per the nature of the many scenarios to think
about...

Thanks for the efforts. It wasn't an easy bug to chase to begin with. So I
am not surprised there were additional problems that I missed.

So an extra pair of eyes from another committer would be
welcome. I am letting that cool down for a couple of days now.

I am not a committer, so don't know if my pair of eyes count, but FWIW the
patch looks good to me except couple of minor points.

+/*
+ * Local copies of equivalent fields in the control file.  When running
+ * crash recovery, minRecoveryPoint is set to InvalidXLogRecPtr as we
+ * expect to replay all the WAL available, and updateMinRecoveryPoint is
+ * switched to false to prevent any updates while replaying records.
+ * Those values are kept consistent as long as crash recovery runs.
+ */
 static XLogRecPtr minRecoveryPoint; /* local copy of
  * ControlFile->minRecoveryPoint */

The inline comment looks unnecessary now that we have comment at the top.

@@ -4266,6 +4276,12 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr
RecPtr, int emode,
minRecoveryPoint = ControlFile->minRecoveryPoint;
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;

+ /*
+ * The startup process can update its local copy of
+ * minRecoveryPoint from that point.
+ */

s/that/this

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#12Michael Paquier
michael@paquier.xyz
In reply to: Pavan Deolasee (#11)
Re: PANIC during crash recovery of a recently promoted standby

On Fri, Jun 22, 2018 at 10:08:24AM +0530, Pavan Deolasee wrote:

On Fri, Jun 22, 2018 at 9:28 AM, Michael Paquier <michael@paquier.xyz>
wrote:

So an extra pair of eyes from another committer would be
welcome. I am letting that cool down for a couple of days now.

I am not a committer, so don't know if my pair of eyes count, but FWIW the
patch looks good to me except couple of minor points.

Thanks for grabbing some time, Pavan. Any help is welcome!

+/*
+ * Local copies of equivalent fields in the control file.  When running
+ * crash recovery, minRecoveryPoint is set to InvalidXLogRecPtr as we
+ * expect to replay all the WAL available, and updateMinRecoveryPoint is
+ * switched to false to prevent any updates while replaying records.
+ * Those values are kept consistent as long as crash recovery runs.
+ */
static XLogRecPtr minRecoveryPoint; /* local copy of
* ControlFile->minRecoveryPoint */

The inline comment looks unnecessary now that we have comment at the
top.

I'll fix that.

@@ -4266,6 +4276,12 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr
RecPtr, int emode,
minRecoveryPoint = ControlFile->minRecoveryPoint;
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;

+ /*
+ * The startup process can update its local copy of
+ * minRecoveryPoint from that point.
+ */

s/that/this

This one too.
--
Michael

#13Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#12)
Re: PANIC during crash recovery of a recently promoted standby

Hello, sorry for the absense and I looked the second patch.

At Fri, 22 Jun 2018 13:45:21 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180622044521.GC5215@paquier.xyz>

On Fri, Jun 22, 2018 at 10:08:24AM +0530, Pavan Deolasee wrote:

On Fri, Jun 22, 2018 at 9:28 AM, Michael Paquier <michael@paquier.xyz>
wrote:

So an extra pair of eyes from another committer would be
welcome. I am letting that cool down for a couple of days now.

I am not a committer, so don't know if my pair of eyes count, but FWIW the
patch looks good to me except couple of minor points.

Thanks for grabbing some time, Pavan. Any help is welcome!

in previous mail:

I have spotted two
bug where I think the problem is not fixed: when replaying a WAL record
XLOG_PARAMETER_CHANGE, minRecoveryPoint and minRecoveryPointTLI would
still get updated from the control file values which can still lead to
failures as CheckRecoveryConsistency could still happily trigger a
PANIC, so I think that we had better maintain those values consistent as

The fix of StartupXLOG, CheckRecoveryConsistency, ReadRecrod and
xlog_redo looks (functionally, mendtioned below) fine.

long as crash recovery runs. And XLogNeedsFlush() also has a similar
problem.

Here, on the other hand, this patch turns off
updateMinRecoverypoint if minRecoverPoint is invalid when
RecoveryInProgress() == true. Howerver RecovInProg() == true
means archive recovery is already started and and
minRecoveryPoint *should* be updated t for the
condition. Actually minRecoverypoint is updated just below. If
this is really right thing, I think that some explanation for the
reason is required here.

In xlog_redo there still be "minRecoverypoint != 0", which ought
to be described as "!XLogRecPtrIsInvalid(minRecoveryPoint)". (It
seems the only one. Double negation is a bit uneasy but there are
many instance of this kind of coding.)

# I'll go all-out next week.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#14Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#13)
Re: PANIC during crash recovery of a recently promoted standby

On Fri, Jun 22, 2018 at 02:34:02PM +0900, Kyotaro HORIGUCHI wrote:

Hello, sorry for the absense and I looked the second patch.

Thanks for the review!

At Fri, 22 Jun 2018 13:45:21 +0900, Michael Paquier
<michael@paquier.xyz> wrote in <20180622044521.GC5215@paquier.xyz>

long as crash recovery runs. And XLogNeedsFlush() also has a similar
problem.

Here, on the other hand, this patch turns off
updateMinRecoverypoint if minRecoverPoint is invalid when
RecoveryInProgress() == true. Howerver RecovInProg() == true
means archive recovery is already started and and
minRecoveryPoint *should* be updated t for the
condition. Actually minRecoverypoint is updated just below. If
this is really right thing, I think that some explanation for the
reason is required here.

LocalRecoveryInProgress is just a local copy of SharedRecoveryInProgress
so RecoveryInProgress also returns true if crash recovery is running.
But perhaps I am missing what you mean? The point here is that redo can
call XLogNeedsFlush, no?

In xlog_redo there still be "minRecoverypoint != 0", which ought
to be described as "!XLogRecPtrIsInvalid(minRecoveryPoint)". (It
seems the only one. Double negation is a bit uneasy but there are
many instance of this kind of coding.)

It is possible to use directly a comparison with InvalidXLogRecPtr
instead of a double negation.

# I'll go all-out next week.

Enjoy your vacations!
--
Michael

#15Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#14)
Re: PANIC during crash recovery of a recently promoted standby

On Fri, Jun 22, 2018 at 03:25:48PM +0900, Michael Paquier wrote:

On Fri, Jun 22, 2018 at 02:34:02PM +0900, Kyotaro HORIGUCHI wrote:

Hello, sorry for the absense and I looked the second patch.

Thanks for the review!

I have been spending some time testing and torturing the patch for all
stable branches, and I have finished with the set of patches attached.

My testing has involved using the TAP suite, where I have actually, and
roughly backported the infrastructure in v10 down to older versions,
which has required to tweak Makefile.global.in and finding out again
that pg_ctl start has switched to the wait mode by default in 10.

I have spent a bit of time testing this on HEAD, 10 and 9.6. For 9.5,
9.4 and 9.3 I have reproduced the failure and tested the patch, but I
lacked time to perform more tests. The patch set for 9.3~9.5 applies
without conflict across the 3 branches. 9.6 has a conflict in a
comment, and v10 had an extra comment conflict.

Feel free to have a look, I am not completely done with this stuff and
I'll work more tomorrow on checking 9.3~9.5.
--
Michael

Attachments:

promote-panic-96.patchtext/x-diff; charset=iso-8859-1Download+70-32
promote-panic-10.patchtext/x-diff; charset=iso-8859-1Download+157-32
promote-panic-master.patchtext/x-diff; charset=iso-8859-1Download+157-32
promote-panic-95.patchtext/x-diff; charset=iso-8859-1Download+70-32
promote-panic-94.patchtext/x-diff; charset=iso-8859-1Download+70-32
promote-panic-93.patchtext/x-diff; charset=iso-8859-1Download+70-32
#16Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#15)
Re: PANIC during crash recovery of a recently promoted standby

Adding Heikki and Andres in CC here for awareness..

On Wed, Jun 27, 2018 at 05:29:38PM +0900, Michael Paquier wrote:

I have spent a bit of time testing this on HEAD, 10 and 9.6. For 9.5,
9.4 and 9.3 I have reproduced the failure and tested the patch, but I
lacked time to perform more tests. The patch set for 9.3~9.5 applies
without conflict across the 3 branches. 9.6 has a conflict in a
comment, and v10 had an extra comment conflict.

Feel free to have a look, I am not completely done with this stuff and
I'll work more tomorrow on checking 9.3~9.5.

And I have been able to spend the time I wanted to spend on this patch
series with testing for 9.3 to 9.5. Attached are a couple of patches
you can use to reproduce the failures for all the branches:
- For master and 10, the tests are included in the patch and are
proposed for commit.
- On 9.6, I had to tweak the TAP scripts as pg_ctl start has switched to
use the wait mode by default.
- On 9.5, there is a tweak to src/Makefile.global.in which cleans up
tmp_check, and a couple of GUCs not compatible.
- On 9.4, I had to tweak src/Makefile.global.in so as the temporary
installation path is correct. Again some GUCs had to be tweaked.
- On 9.3, there is no TAP infrastructure, so I tweaked
src/test/recovery/Makefile to be able to run the tests.

I have also created a bash script which emulates what the TAP test does,
which is attached. Because of visibly some timing reasons, I have not
been able to reproduce the problem with it. Anyway, running (and
actually sort of back-porting) the TAP suite so as the problematic test
case can be run is possible with the sets attached and shows the failure
so we can use that.

Thoughts? I would love more input about the patch concept.
--
Michael

Attachments:

promote_panic.bashtext/plain; charset=us-asciiDownload
promote-panic-test-93.tar.gzapplication/gzipDownload
promote-panic-test-94.tar.gzapplication/gzipDownload+0-1
promote-panic-test-95.tar.gzapplication/gzipDownload
promote-panic-test-96.tar.gzapplication/gzipDownload
#17Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#14)
Re: PANIC during crash recovery of a recently promoted standby

Hello.

At Fri, 22 Jun 2018 15:25:48 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180622062548.GE5215@paquier.xyz>

On Fri, Jun 22, 2018 at 02:34:02PM +0900, Kyotaro HORIGUCHI wrote:

Hello, sorry for the absense and I looked the second patch.

Thanks for the review!

At Fri, 22 Jun 2018 13:45:21 +0900, Michael Paquier
<michael@paquier.xyz> wrote in <20180622044521.GC5215@paquier.xyz>

long as crash recovery runs. And XLogNeedsFlush() also has a similar
problem.

Here, on the other hand, this patch turns off
updateMinRecoverypoint if minRecoverPoint is invalid when
RecoveryInProgress() == true. Howerver RecovInProg() == true
means archive recovery is already started and and
minRecoveryPoint *should* be updated t for the
condition. Actually minRecoverypoint is updated just below. If
this is really right thing, I think that some explanation for the
reason is required here.

LocalRecoveryInProgress is just a local copy of SharedRecoveryInProgress
so RecoveryInProgress also returns true if crash recovery is running.
But perhaps I am missing what you mean? The point here is that redo can
call XLogNeedsFlush, no?

My concern at the time was the necessity to turn off
updateMinRecoveryPoint on the fly. (The previous comment seems a
bit confused, sorry.)

When minRecoveryPoint is invalid, there're only two possible
cases. It may be at very beginning of archive reovery or may be
running a crash recovery. In the latter case, we have detected
crash recovery before redo starts. So we can turn off
updateMinRecoveryPoint immediately and no further check is
needed and it is (I think) easier to understand.

In xlog_redo there still be "minRecoverypoint != 0", which ought
to be described as "!XLogRecPtrIsInvalid(minRecoveryPoint)". (It
seems the only one. Double negation is a bit uneasy but there are
many instance of this kind of coding.)

It is possible to use directly a comparison with InvalidXLogRecPtr
instead of a double negation.

I'm not sure whether it is abstraction of invalid value, or just
a short cut of the value. That's right if it's the
latter. (There's several places where invalid LSN is assumed to
be smaller than any valid values in the patch).

the second diff is the difference of the first patch from
promote_panic_master.diff

On further thought, as we confirmed upthread (and existing
comments are saying) that (minRecoveryPoint == 0)
!InArchiveRecovery are always equivalent, and
updateMinRecoveryPoint becomes equivalent to them by the v3
patch. That is, we can just remove the variable and the attached
third patch is that. It also passes all recovery tests including
the new 015.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

promote-panic_v3.patchtext/x-patch; charset=us-asciiDownload+54-32
diff_from_promote_panic_master.difftext/x-patch; charset=us-asciiDownload+6-23
promote-panic_horiguti_v4.patchtext/x-patch; charset=us-asciiDownload+58-37
#18Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#17)
Re: PANIC during crash recovery of a recently promoted standby

On Mon, Jul 02, 2018 at 04:25:13PM +0900, Kyotaro HORIGUCHI wrote:

When minRecoveryPoint is invalid, there're only two possible
cases. It may be at very beginning of archive reovery or may be
running a crash recovery. In the latter case, we have detected
crash recovery before redo starts. So we can turn off
updateMinRecoveryPoint immediately and no further check is
needed and it is (I think) easier to understand.

Er, you are missing the point that updateMinRecoveryPoint is also used
by processes, like the checkpointer, other than the startup process,
which actually needs to update minRecoveryPoint and rely on the default
value of updateMinRecoveryPoint which is true...

I am planning to finish wrapping this patch luckily on Wednesday JST
time, or in the worst case on Thursday. I got this problem on my mind
for a couple of days now and I could not find a case where the approach
taken could cause a problem. Opinions are welcome.
--
Michael

#19Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#18)
Re: PANIC during crash recovery of a recently promoted standby

On Mon, Jul 02, 2018 at 10:41:05PM +0900, Michael Paquier wrote:

I am planning to finish wrapping this patch luckily on Wednesday JST
time, or in the worst case on Thursday. I got this problem on my mind
for a couple of days now and I could not find a case where the approach
taken could cause a problem. Opinions are welcome.

Okay, pushed and back-patched. Thanks to all who participated in the
thread!
--
Michael

#20Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Michael Paquier (#19)
Re: PANIC during crash recovery of a recently promoted standby

On Thu, Jul 5, 2018 at 7:20 AM, Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Jul 02, 2018 at 10:41:05PM +0900, Michael Paquier wrote:

I am planning to finish wrapping this patch luckily on Wednesday JST
time, or in the worst case on Thursday. I got this problem on my mind
for a couple of days now and I could not find a case where the approach
taken could cause a problem. Opinions are welcome.

Okay, pushed and back-patched. Thanks to all who participated in the
thread!

Many thanks Michael for doing the gruelling of coming up with a more
complete fix, verifying all the cases, in various back branches.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#21Michael Paquier
michael@paquier.xyz
In reply to: Pavan Deolasee (#20)