recovery_min_apply_delay in archive recovery causes assertion failure in latch
Hi,
I got the following assertion failure when I enabled recovery_min_apply_delay
and started archive recovery (i.e., I put only recovery.signal not
standby.signal).
TRAP: FailedAssertion("latch->owner_pid == MyProcPid", File:
"latch.c", Line: 522)
Here is the example to reproduce the issue:
----------------------------
initdb -D data
pg_ctl -D data start
psql -c "alter system set recovery_min_apply_delay to '60s'"
psql -c "alter system set archive_mode to on"
psql -c "alter system set archive_command to 'cp %p ../arch/%f'"
psql -c "alter system set restore_command to 'cp ../arch/%f %p'"
mkdir arch
pg_basebackup -D bkp -c fast
pgbench -i
pgbench -t 1000
pg_ctl -D data -m i stop
rm -rf bkp/pg_wal
mv data/pg_wal bkp
rm -rf data
mv bkp data
touch data/recovery.signal
pg_ctl -D data -W start
----------------------------
The latch that causes this assertion failure is recoveryWakeupLatch.
The ownership of this latch is taken only when standby mode is
requested. But this latch can be used when starting archive recovery
with recovery_min_apply_delay set even though it's unowned.
So the assertion failure happened.
Attached patch fixes this issue by making archive recovery always ignore
recovery_min_apply_delay. This change is OK because
recovery_min_apply_delay was introduced for standby mode, I think.
This issue is not new in v12. I observed that the issue was reproduced
in v11. So the back-patch is necessary.
Regards,
--
Fujii Masao
Attachments:
fix-assertion-failure-in-latch.patchapplication/octet-stream; name=fix-assertion-failure-in-latch.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6c69eb6dd7..65220de600 100644
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 5975,5980 **** recoveryApplyDelay(XLogReaderState *record)
--- 5975,5984 ----
if (!reachedConsistency)
return false;
+ /* nothing to do if standby mode is not requested */
+ if (!StandbyModeRequested)
+ return false;
+
/*
* Is it a COMMIT record?
*
On Mon, Sep 30, 2019 at 12:49:03AM +0900, Fujii Masao wrote:
Attached patch fixes this issue by making archive recovery always ignore
recovery_min_apply_delay. This change is OK because
recovery_min_apply_delay was introduced for standby mode, I think.This issue is not new in v12. I observed that the issue was reproduced
in v11. So the back-patch is necessary.
I have not directly tested, but from a lookup at the code I think
that you are right. Perhaps we'd want more safeguards in
WaitForWALToBecomeAvailable(), like an assert within the
XLOG_FROM_STREAM part similar to the check you are adding? My point
is that we should switch to XLOG_FROM_STREAM only if we are in standby
mode, and that's the only place where the startup process looks at
recoveryWakeupLatch.
--
Michael
On Mon, Sep 30, 2019 at 12:42 PM Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Sep 30, 2019 at 12:49:03AM +0900, Fujii Masao wrote:
Attached patch fixes this issue by making archive recovery always ignore
recovery_min_apply_delay. This change is OK because
recovery_min_apply_delay was introduced for standby mode, I think.This issue is not new in v12. I observed that the issue was reproduced
in v11. So the back-patch is necessary.I have not directly tested, but from a lookup at the code I think
that you are right. Perhaps we'd want more safeguards in
WaitForWALToBecomeAvailable(), like an assert within the
XLOG_FROM_STREAM part similar to the check you are adding? My point
is that we should switch to XLOG_FROM_STREAM only if we are in standby
mode, and that's the only place where the startup process looks at
recoveryWakeupLatch.
Thanks for the review! OK, attached is the patch which also added
two assertion checks as you described.
Regards,
--
Fujii Masao
Attachments:
fix-assertion-failure-in-latch_v2.patchapplication/octet-stream; name=fix-assertion-failure-in-latch_v2.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 61ba6b852e..b5a426c535 100644
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 5989,5994 **** recoveryApplyDelay(XLogReaderState *record)
--- 5989,5998 ----
if (!reachedConsistency)
return false;
+ /* nothing to do if standby mode is not requested */
+ if (!StandbyModeRequested)
+ return false;
+
/*
* Is it a COMMIT record?
*
***************
*** 11869,11874 **** WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
--- 11873,11884 ----
* hope...
*/
+ /*
+ * We should be able to move to XLOG_FROM_STREAM
+ * only in standby mode.
+ */
+ Assert(StandbyMode);
+
/*
* Before we leave XLOG_FROM_STREAM state, make sure that
* walreceiver is not active, so that it won't overwrite
***************
*** 11982,11987 **** WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
--- 11992,12003 ----
{
bool havedata;
+ /*
+ * We should be able to move to XLOG_FROM_STREAM
+ * only in standby mode.
+ */
+ Assert(StandbyMode);
+
/*
* Check if WAL receiver is still active.
*/
On Mon, Sep 30, 2019 at 05:50:03PM +0900, Fujii Masao wrote:
Thanks for the review! OK, attached is the patch which also added
two assertion checks as you described.
Thanks, looks fine. The indentation looks a bit wrong for the
comments, but that's a nit.
--
Michael
On Mon, Sep 30, 2019 at 12:49 AM Fujii Masao <masao.fujii@gmail.com> wrote:
Hi,
I got the following assertion failure when I enabled recovery_min_apply_delay
and started archive recovery (i.e., I put only recovery.signal not
standby.signal).TRAP: FailedAssertion("latch->owner_pid == MyProcPid", File:
"latch.c", Line: 522)Here is the example to reproduce the issue:
----------------------------
initdb -D data
pg_ctl -D data start
psql -c "alter system set recovery_min_apply_delay to '60s'"
psql -c "alter system set archive_mode to on"
psql -c "alter system set archive_command to 'cp %p ../arch/%f'"
psql -c "alter system set restore_command to 'cp ../arch/%f %p'"
mkdir arch
pg_basebackup -D bkp -c fast
pgbench -i
pgbench -t 1000
pg_ctl -D data -m i stop
rm -rf bkp/pg_wal
mv data/pg_wal bkp
rm -rf data
mv bkp data
touch data/recovery.signal
pg_ctl -D data -W start
----------------------------The latch that causes this assertion failure is recoveryWakeupLatch.
The ownership of this latch is taken only when standby mode is
requested. But this latch can be used when starting archive recovery
with recovery_min_apply_delay set even though it's unowned.
So the assertion failure happened.Attached patch fixes this issue by making archive recovery always ignore
recovery_min_apply_delay. This change is OK because
recovery_min_apply_delay was introduced for standby mode, I think.
No, I found the following description in the doc.
------------------------------
This parameter is intended for use with streaming replication deployments;
however, if the parameter is specified it will be honored in all cases
------------------------------
So archive recovery with recovery_min_apply_delay enabled would be
intended configuration. My patch that changes archive recovery so that
it always ignores thesetting might be in wrong direction. Maybe we should
make recovery_min_apply_delay work properly even in archive recovery.
Thought?
Regards,
--
Fujii Masao
On Fri, Oct 4, 2019 at 9:03 PM Fujii Masao <masao.fujii@gmail.com> wrote:
On Mon, Sep 30, 2019 at 12:49 AM Fujii Masao <masao.fujii@gmail.com> wrote:
Hi,
I got the following assertion failure when I enabled recovery_min_apply_delay
and started archive recovery (i.e., I put only recovery.signal not
standby.signal).TRAP: FailedAssertion("latch->owner_pid == MyProcPid", File:
"latch.c", Line: 522)Here is the example to reproduce the issue:
----------------------------
initdb -D data
pg_ctl -D data start
psql -c "alter system set recovery_min_apply_delay to '60s'"
psql -c "alter system set archive_mode to on"
psql -c "alter system set archive_command to 'cp %p ../arch/%f'"
psql -c "alter system set restore_command to 'cp ../arch/%f %p'"
mkdir arch
pg_basebackup -D bkp -c fast
pgbench -i
pgbench -t 1000
pg_ctl -D data -m i stop
rm -rf bkp/pg_wal
mv data/pg_wal bkp
rm -rf data
mv bkp data
touch data/recovery.signal
pg_ctl -D data -W start
----------------------------The latch that causes this assertion failure is recoveryWakeupLatch.
The ownership of this latch is taken only when standby mode is
requested. But this latch can be used when starting archive recovery
with recovery_min_apply_delay set even though it's unowned.
So the assertion failure happened.Attached patch fixes this issue by making archive recovery always ignore
recovery_min_apply_delay. This change is OK because
recovery_min_apply_delay was introduced for standby mode, I think.No, I found the following description in the doc.
------------------------------
This parameter is intended for use with streaming replication deployments;
however, if the parameter is specified it will be honored in all cases
------------------------------So archive recovery with recovery_min_apply_delay enabled would be
intended configuration. My patch that changes archive recovery so that
it always ignores thesetting might be in wrong direction. Maybe we should
make recovery_min_apply_delay work properly even in archive recovery.
Thought?
Patch attached. This patch allows archive recovery with
recovery_min_apply_delay set, but not crash recovery.
Regards,
--
Fujii Masao
Attachments:
allow_recovery_min_apply_delay_during_archive_recovery_v1.patchapplication/octet-stream; name=allow_recovery_min_apply_delay_during_archive_recovery_v1.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 619ac8c50c..9c7868a888 100644
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 4206,4212 **** ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</para>
<para>
This parameter is intended for use with streaming replication deployments;
! however, if the parameter is specified it will be honored in all cases.
<varname>hot_standby_feedback</varname> will be delayed by use of this feature
which could lead to bloat on the master; use both together with care.
--- 4206,4213 ----
</para>
<para>
This parameter is intended for use with streaming replication deployments;
! however, if the parameter is specified it will be honored in all cases
! except crash recovery.
<varname>hot_standby_feedback</varname> will be delayed by use of this feature
which could lead to bloat on the master; use both together with care.
diff --git a/src/backend/accessindex 790e2c8714..8526407ddf 100644
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 5977,5982 **** recoveryApplyDelay(XLogReaderState *record)
--- 5977,5986 ----
if (!reachedConsistency)
return false;
+ /* nothing to do if crash recovery is requested */
+ if (!ArchiveRecoveryRequested && !StandbyModeRequested)
+ return false;
+
/*
* Is it a COMMIT record?
*
***************
*** 6350,6356 **** StartupXLOG(void)
* Take ownership of the wakeup latch if we're going to sleep during
* recovery.
*/
! if (StandbyModeRequested)
OwnLatch(&XLogCtl->recoveryWakeupLatch);
/* Set up XLOG reader facility */
--- 6354,6360 ----
* Take ownership of the wakeup latch if we're going to sleep during
* recovery.
*/
! if (ArchiveRecoveryRequested || StandbyModeRequested)
OwnLatch(&XLogCtl->recoveryWakeupLatch);
/* Set up XLOG reader facility */
***************
*** 7334,7340 **** StartupXLOG(void)
* We don't need the latch anymore. It's not strictly necessary to disown
* it, but let's do it for the sake of tidiness.
*/
! if (StandbyModeRequested)
DisownLatch(&XLogCtl->recoveryWakeupLatch);
/*
--- 7338,7344 ----
* We don't need the latch anymore. It's not strictly necessary to disown
* it, but let's do it for the sake of tidiness.
*/
! if (ArchiveRecoveryRequested || StandbyModeRequested)
DisownLatch(&XLogCtl->recoveryWakeupLatch);
/*
***************
*** 11869,11874 **** WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
--- 11873,11884 ----
* hope...
*/
+ /*
+ * We should be able to move to XLOG_FROM_STREAM
+ * only in standby mode.
+ */
+ Assert(StandbyMode);
+
/*
* Before we leave XLOG_FROM_STREAM state, make sure that
* walreceiver is not active, so that it won't overwrite
***************
*** 11982,11987 **** WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
--- 11992,12003 ----
{
bool havedata;
+ /*
+ * We should be able to move to XLOG_FROM_STREAM
+ * only in standby mode.
+ */
+ Assert(StandbyMode);
+
/*
* Check if WAL receiver is still active.
*/
On Tue, Oct 08, 2019 at 02:18:00AM +0900, Fujii Masao wrote:
On Fri, Oct 4, 2019 at 9:03 PM Fujii Masao <masao.fujii@gmail.com> wrote:
So archive recovery with recovery_min_apply_delay enabled would be
intended configuration. My patch that changes archive recovery so that
it always ignores thesetting might be in wrong direction. Maybe we should
make recovery_min_apply_delay work properly even in archive recovery.
Thought?Patch attached. This patch allows archive recovery with
recovery_min_apply_delay set, but not crash recovery.
Right. In short it makes no sense to wait the delay when in crash
recovery. After more testing I have been able to reproduce the
failure myself.
+ /* nothing to do if crash recovery is requested */
+ if (!ArchiveRecoveryRequested && !StandbyModeRequested)
+ return false;
ArchiveRecoveryRequested will be set to true if recovery.signal or
standby.signal are found, so it seems to me that you can make all
those checks more simple by removing from the equation
StandbyModeRequested, no? StandbyModeRequested is never set to true
if ArchiveRecoveryRequested is not itself true.
It would be nice to test some scenario within 002_archiving.pl.
--
Michael
On Thu, Oct 17, 2019 at 02:35:13PM +0900, Michael Paquier wrote:
ArchiveRecoveryRequested will be set to true if recovery.signal or
standby.signal are found, so it seems to me that you can make all
those checks more simple by removing from the equation
StandbyModeRequested, no? StandbyModeRequested is never set to true
if ArchiveRecoveryRequested is not itself true.
For the sake of the archives, this has been applied by Fujii-san as of
ec1259e8.
--
Michael
On 2019-Oct-19, Michael Paquier wrote:
On Thu, Oct 17, 2019 at 02:35:13PM +0900, Michael Paquier wrote:
ArchiveRecoveryRequested will be set to true if recovery.signal or
standby.signal are found, so it seems to me that you can make all
those checks more simple by removing from the equation
StandbyModeRequested, no? StandbyModeRequested is never set to true
if ArchiveRecoveryRequested is not itself true.For the sake of the archives, this has been applied by Fujii-san as of
ec1259e8.
So, the commit message says
Fix failure of archive recovery with recovery_min_apply_delay enabled.
recovery_min_apply_delay parameter is intended for use with streaming
replication deployments. However, the document clearly explains that
the parameter will be honored in all cases if it's specified. So it should
take effect even if in archive recovery. But, previously, archive recovery
with recovery_min_apply_delay enabled always failed, and caused assertion
failure if --enable-caasert is enabled.
but I'm not clear how would this problem manifest in the case of a build
with assertions disabled. Will it keep sleeping beyond the specified
time?
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Dec 14, 2019 at 12:35 AM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
On 2019-Oct-19, Michael Paquier wrote:
On Thu, Oct 17, 2019 at 02:35:13PM +0900, Michael Paquier wrote:
ArchiveRecoveryRequested will be set to true if recovery.signal or
standby.signal are found, so it seems to me that you can make all
those checks more simple by removing from the equation
StandbyModeRequested, no? StandbyModeRequested is never set to true
if ArchiveRecoveryRequested is not itself true.For the sake of the archives, this has been applied by Fujii-san as of
ec1259e8.So, the commit message says
Fix failure of archive recovery with recovery_min_apply_delay enabled.
recovery_min_apply_delay parameter is intended for use with streaming
replication deployments. However, the document clearly explains that
the parameter will be honored in all cases if it's specified. So it should
take effect even if in archive recovery. But, previously, archive recovery
with recovery_min_apply_delay enabled always failed, and caused assertion
failure if --enable-caasert is enabled.but I'm not clear how would this problem manifest in the case of a build
with assertions disabled. Will it keep sleeping beyond the specified
time?
When assertion is disabled, the recovery exists with the following messages.
FATAL: cannot wait on a latch owned by another process
LOG: startup process (PID 81007) exited with exit code 1
LOG: terminating any other active server processes
LOG: database system is shut down
Regards,
--
Fujii Masao