[BUG] non archived WAL removed during production crash recovery

Started by Jehan-Guillaume de Rorthaisalmost 6 years ago67 messages
2 attachment(s)

Hello,

A colleague of mine reported an expected behavior.

On production cluster is in crash recovery, eg. after killing a backend, the
WALs ready to be archived are removed before being archived.

See in attachment the reproduction script "non-arch-wal-on-recovery.bash".

This behavior has been introduced in 78ea8b5daab9237fd42d7a8a836c1c451765499f.
Function XLogArchiveCheckDone() badly consider the in crashed recovery
production cluster as a standby without archive_mode=always. So the check
conclude the WAL can be removed safely.

bool inRecovery = RecoveryInProgress();

/*
* The file is always deletable if archive_mode is "off". On standbys
* archiving is disabled if archive_mode is "on", and enabled with
* "always". On a primary, archiving is enabled if archive_mode is "on"
* or "always".
*/
if (!((XLogArchivingActive() && !inRecovery) ||
(XLogArchivingAlways() && inRecovery)))
return true;

Please find in attachment a patch that fix this issue using the following test
instead:

if (!((XLogArchivingActive() && !StandbyModeRequested) ||
(XLogArchivingAlways() && inRecovery)))
return true;

I'm not sure if we should rely on StandbyModeRequested for the second part of
the test as well thought. What was the point to rely on RecoveryInProgress() to
get the recovery status from shared mem?

Regards,

Attachments:

non-arch-wal-on-recovery.bashapplication/octet-stream; name=non-arch-wal-on-recovery.bashDownload
0001-Fix-WAL-retention-during-production-crash-recovery.patchtext/x-patchDownload
From aecb5f2de5c9ef3d2e7a4a4f29a7de61f8665373 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Date: Tue, 31 Mar 2020 16:56:40 +0200
Subject: [PATCH] Fix WAL retention during production crash recovery

During crash recovery of a production cluster with archive_mode=on,
XLogArchiveCheckDone() was considering the cluster as inRecovery
without archive_mode=always. Because of this non-arcived WAL and
related .ready files were recycled or removed.
---
 src/backend/access/transam/xlogarchive.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index d62c12310a..088206ac38 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -580,7 +580,7 @@ XLogArchiveCheckDone(const char *xlog)
 	 * "always".  On a primary, archiving is enabled if archive_mode is "on"
 	 * or "always".
 	 */
-	if (!((XLogArchivingActive() && !inRecovery) ||
+	if (!((XLogArchivingActive() && !StandbyModeRequested) ||
 		  (XLogArchivingAlways() && inRecovery)))
 		return true;
 
-- 
2.20.1

#2Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Jehan-Guillaume de Rorthais (#1)
Re: [BUG] non archived WAL removed during production crash recovery

On 2020/04/01 0:22, Jehan-Guillaume de Rorthais wrote:

Hello,

A colleague of mine reported an expected behavior.

On production cluster is in crash recovery, eg. after killing a backend, the
WALs ready to be archived are removed before being archived.

See in attachment the reproduction script "non-arch-wal-on-recovery.bash".

This behavior has been introduced in 78ea8b5daab9237fd42d7a8a836c1c451765499f.
Function XLogArchiveCheckDone() badly consider the in crashed recovery
production cluster as a standby without archive_mode=always. So the check
conclude the WAL can be removed safely.

Thanks for the report! Yeah, this seems a bug.

bool inRecovery = RecoveryInProgress();

/*
* The file is always deletable if archive_mode is "off". On standbys
* archiving is disabled if archive_mode is "on", and enabled with
* "always". On a primary, archiving is enabled if archive_mode is "on"
* or "always".
*/
if (!((XLogArchivingActive() && !inRecovery) ||
(XLogArchivingAlways() && inRecovery)))
return true;

Please find in attachment a patch that fix this issue using the following test
instead:

if (!((XLogArchivingActive() && !StandbyModeRequested) ||
(XLogArchivingAlways() && inRecovery)))
return true;

I'm not sure if we should rely on StandbyModeRequested for the second part of
the test as well thought. What was the point to rely on RecoveryInProgress() to
get the recovery status from shared mem?

Since StandbyModeRequested is the startup process-local variable,
it basically cannot be used in XLogArchiveCheckDone() that other process
may call. So another approach would be necessary... One straight idea is to
add new shmem flag indicating whether we are in standby mode or not.
Another is to make the startup process remove .ready file if necessary.

If it's not easy to fix the issue, we might need to just revert the commit
that introduced the issue at first.

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

In reply to: Fujii Masao (#2)
2 attachment(s)
Re: [BUG] non archived WAL removed during production crash recovery

On Wed, 1 Apr 2020 17:27:22 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
[...]

bool inRecovery = RecoveryInProgress();

/*
* The file is always deletable if archive_mode is "off". On standbys
* archiving is disabled if archive_mode is "on", and enabled with
* "always". On a primary, archiving is enabled if archive_mode is "on"
* or "always".
*/
if (!((XLogArchivingActive() && !inRecovery) ||
(XLogArchivingAlways() && inRecovery)))
return true;

Please find in attachment a patch that fix this issue using the following
test instead:

if (!((XLogArchivingActive() && !StandbyModeRequested) ||
(XLogArchivingAlways() && inRecovery)))
return true;

I'm not sure if we should rely on StandbyModeRequested for the second part
of the test as well thought. What was the point to rely on
RecoveryInProgress() to get the recovery status from shared mem?

Since StandbyModeRequested is the startup process-local variable,
it basically cannot be used in XLogArchiveCheckDone() that other process
may call.

Ok, you answered my wondering about using recovery status from shared mem. This
was obvious. Thanks for your help!

I was wondering if we could use "ControlFile->state != DB_IN_CRASH_RECOVERY".
It seems fine during crash recovery as the control file is updated before the
checkpoint, but it doesn't feel right for other code paths where the control
file might not be up-to-date on filesystem, right ?

So another approach would be necessary... One straight idea is to
add new shmem flag indicating whether we are in standby mode or not.

I was thinking about setting XLogCtlData->SharedRecoveryInProgress as an enum
using:

enum RecoveryState
{
NOT_IN_RECOVERY = 0
IN_CRASH_RECOVERY,
IN_ARCHIVE_RECOVERY
}

Please, find in attachment a patch implementing this.

Plus, I added a second commit to add one test in regard with this bug.

Another is to make the startup process remove .ready file if necessary.

I'm not sure to understand this one.

Regards,

Attachments:

0001-v2-Fix-WAL-retention-during-production-crash-recovery.patchtext/x-patchDownload
From e5e0f7a0bf9d5f0b4097c53016fd7cc381ef551c Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Date: Wed, 1 Apr 2020 16:36:06 +0200
Subject: [PATCH 1/2] Fix WAL retention during production crash recovery

During crash recovery of a production cluster with archive_mode=on,
XLogArchiveCheckDone() was considering the cluster as inRecovery
without archive_mode=always. Because of this non-arcived WAL and
related .ready files were recycled or removed.
---
 src/backend/access/transam/xlog.c        | 25 +++++++++++++++++-------
 src/backend/access/transam/xlogarchive.c | 11 ++++++++---
 src/include/access/xlog.h                |  9 +++++++++
 3 files changed, 35 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 977d448f50..55d06a8704 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -219,8 +219,8 @@ static TimeLineID receiveTLI = 0;
 static bool lastFullPageWrites;
 
 /*
- * Local copy of SharedRecoveryInProgress variable. True actually means "not
- * known, need to check the shared state".
+ * This is false when SharedRecoveryInProgress is not IN_ARCHIVE_RECOVERY.
+ * True actually means "not known, need to check the shared state".
  */
 static bool LocalRecoveryInProgress = true;
 
@@ -653,10 +653,10 @@ typedef struct XLogCtlData
 	TimeLineID	PrevTimeLineID;
 
 	/*
-	 * SharedRecoveryInProgress indicates if we're still in crash or archive
+	 * SharedRecoveryInProgress indicates if we're either in crash or archive
 	 * recovery.  Protected by info_lck.
 	 */
-	bool		SharedRecoveryInProgress;
+	RecoveryState	SharedRecoveryInProgress;
 
 	/*
 	 * SharedHotStandbyActive indicates if we allow hot standby queries to be
@@ -5131,7 +5131,7 @@ XLOGShmemInit(void)
 	 * in additional info.)
 	 */
 	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
-	XLogCtl->SharedRecoveryInProgress = true;
+	XLogCtl->SharedRecoveryInProgress = IN_ARCHIVE_RECOVERY;
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
 	XLogCtl->WalWriterSleeping = false;
@@ -6901,6 +6901,9 @@ StartupXLOG(void)
 								ControlFile->checkPointCopy.ThisTimeLineID,
 								recoveryTargetTLI)));
 			ControlFile->state = DB_IN_CRASH_RECOVERY;
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryInProgress = IN_CRASH_RECOVERY;
+			SpinLockRelease(&XLogCtl->info_lck);
 		}
 		ControlFile->checkPoint = checkPointLoc;
 		ControlFile->checkPointCopy = checkPoint;
@@ -7928,7 +7931,7 @@ StartupXLOG(void)
 	ControlFile->time = (pg_time_t) time(NULL);
 
 	SpinLockAcquire(&XLogCtl->info_lck);
-	XLogCtl->SharedRecoveryInProgress = false;
+	XLogCtl->SharedRecoveryInProgress = NOT_IN_RECOVERY;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	UpdateControlFile();
@@ -8074,7 +8077,7 @@ RecoveryInProgress(void)
 		 */
 		volatile XLogCtlData *xlogctl = XLogCtl;
 
-		LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
+		LocalRecoveryInProgress = (xlogctl->SharedRecoveryInProgress != NOT_IN_RECOVERY);
 
 		/*
 		 * Initialize TimeLineID and RedoRecPtr when we discover that recovery
@@ -8103,6 +8106,14 @@ RecoveryInProgress(void)
 	}
 }
 
+RecoveryState
+GetRecoveryState(void)
+{
+	volatile XLogCtlData *xlogctl = XLogCtl;
+
+	return xlogctl->SharedRecoveryInProgress;
+}
+
 /*
  * Is HotStandby active yet? This is only important in special backends
  * since normal backends won't ever be able to connect until this returns
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index d62c12310a..0f9e10bbef 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -572,16 +572,21 @@ XLogArchiveCheckDone(const char *xlog)
 {
 	char		archiveStatusPath[MAXPGPATH];
 	struct stat stat_buf;
-	bool		inRecovery = RecoveryInProgress();
+	RecoveryState	inRecoveryState = GetRecoveryState();
 
+		ereport(DEBUG2,
+			(errmsg("XLogArchivingActive(): %d, XLogArchivingAlways: %d, inRecoveryState: %d (%d, %d, %d)",
+					XLogArchivingActive(), XLogArchivingAlways(),
+					inRecoveryState,
+					NOT_IN_RECOVERY, IN_CRASH_RECOVERY, IN_ARCHIVE_RECOVERY)));
 	/*
 	 * The file is always deletable if archive_mode is "off".  On standbys
 	 * archiving is disabled if archive_mode is "on", and enabled with
 	 * "always".  On a primary, archiving is enabled if archive_mode is "on"
 	 * or "always".
 	 */
-	if (!((XLogArchivingActive() && !inRecovery) ||
-		  (XLogArchivingAlways() && inRecovery)))
+	if (!((XLogArchivingActive() && inRecoveryState != IN_ARCHIVE_RECOVERY) ||
+		  (XLogArchivingAlways() && inRecoveryState == IN_ARCHIVE_RECOVERY)))
 		return true;
 
 	/* First check for .done --- this means archiver is done with it */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 9ec7b31cce..d8b08d6d17 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -165,6 +165,14 @@ typedef enum WalLevel
 	WAL_LEVEL_LOGICAL
 } WalLevel;
 
+/* Recovery state */
+typedef enum RecoveryState
+{
+	NOT_IN_RECOVERY = 0,
+	IN_CRASH_RECOVERY,
+	IN_ARCHIVE_RECOVERY
+} RecoveryState;
+
 extern PGDLLIMPORT int wal_level;
 
 /* Is WAL archiving enabled (always or only while server is running normally)? */
@@ -278,6 +286,7 @@ extern const char *xlog_identify(uint8 info);
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
 
 extern bool RecoveryInProgress(void);
+extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
-- 
2.20.1

0002-v1-Add-test-on-non-archived-WAL-during-crash-recovery.patchtext/x-patchDownload
From 4ad24cd903766fc21132bc5ff4a80be5969d492f Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Date: Wed, 1 Apr 2020 18:12:17 +0200
Subject: [PATCH 2/2] Add test on non archived WAL during crash recovery

---
 src/test/recovery/t/011_crash_recovery.pl | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 526a3481fb..ad512b881f 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -15,11 +15,17 @@ if ($Config{osname} eq 'MSWin32')
 }
 else
 {
-	plan tests => 3;
+	plan tests => 4;
 }
 
 my $node = get_new_node('master');
-$node->init(allows_streaming => 1);
+$node->init(
+	has_archiving    => 1,
+	allows_streaming => 1);
+
+$node->append_conf('postgresql.conf',
+	"archive_command = ''");
+
 $node->start;
 
 my ($stdin, $stdout, $stderr) = ('', '', '');
@@ -43,6 +49,7 @@ $stdin .= q[
 BEGIN;
 CREATE TABLE mine(x integer);
 SELECT txid_current();
+SELECT pg_switch_wal();
 ];
 $tx->pump until $stdout =~ /[[:digit:]]+[\r\n]$/;
 
@@ -65,4 +72,9 @@ cmp_ok($node->safe_psql('postgres', 'SELECT txid_current()'),
 is($node->safe_psql('postgres', qq[SELECT txid_status('$xid');]),
 	'aborted', 'xid is aborted after crash');
 
+# make WAL are still waiting to be archived
+my $node_data = $node->data_dir;
+ok( -f "$node_data/pg_wal/archive_status/000000010000000000000001.ready",
+  "WAL still waiting to be archived");
+
 $tx->kill_kill;
-- 
2.20.1

#4Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Jehan-Guillaume de Rorthais (#3)
Re: [BUG] non archived WAL removed during production crash recovery

Hello.

At Wed, 1 Apr 2020 18:17:35 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in

Please, find in attachment a patch implementing this.

The patch partially reintroduces the issue the patch have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

Perhaps checking '.ready' before the checking for archive-mode would
be sufficient.

Plus, I added a second commit to add one test in regard with this bug.

Another is to make the startup process remove .ready file if necessary.

I'm not sure to understand this one.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#5Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#4)
Re: [BUG] non archived WAL removed during production crash recovery

Sorry, it was quite ambiguous.

At Thu, 02 Apr 2020 13:04:43 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Wed, 1 Apr 2020 18:17:35 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in

Please, find in attachment a patch implementing this.

The patch partially reintroduces the issue the patch have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

The patch partially reintroduces the issue the commit 78ea8b5daa have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

Perhaps checking '.ready' before the checking for archive-mode would
be sufficient.

Plus, I added a second commit to add one test in regard with this bug.

Another is to make the startup process remove .ready file if necessary.

I'm not sure to understand this one.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#6Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Kyotaro Horiguchi (#5)
Re: [BUG] non archived WAL removed during production crash recovery

On 2020/04/02 13:07, Kyotaro Horiguchi wrote:

Sorry, it was quite ambiguous.

At Thu, 02 Apr 2020 13:04:43 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Wed, 1 Apr 2020 18:17:35 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in

Please, find in attachment a patch implementing this.

The patch partially reintroduces the issue the patch have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

The patch partially reintroduces the issue the commit 78ea8b5daa have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

On second thought, I think that we should discuss what the desirable
behavior is before the implentation. Especially what's unclear to me
is whether to remove such WAL files in archive recovery case with
archive_mode=on. Those WAL files would be required when recovering
from the backup taken before that archive recovery happens.
So it seems unsafe to remove them in that case.

Therefore, IMO that the patch should change the code so that
no unarchived WAL files are removed not only in crash recovery
but also archive recovery. Thought?

Of course, this change would lead to the issue that the past unarchived
WAL files keep remaining in the case of warm-standby using archive
recovery. But this issue looks unavoidable. If users want to avoid that,
archive_mode should be set to always.

Also I'm a bit wondering if it's really safe to remove such unarchived
WAL files even in the standby case with archive_mode=on. I would need
more time to think that.

Perhaps checking '.ready' before the checking for archive-mode would
be sufficient.

Plus, I added a second commit to add one test in regard with this bug.

Another is to make the startup process remove .ready file if necessary.

I'm not sure to understand this one.

I was thinking to make the startup process remove such unarchived WAL files
if archive_mode=on and StandbyModeRequested/ArchiveRecoveryRequested
is true.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#7Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Fujii Masao (#6)
Re: [BUG] non archived WAL removed during production crash recovery

At Thu, 2 Apr 2020 14:19:15 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2020/04/02 13:07, Kyotaro Horiguchi wrote:

Sorry, it was quite ambiguous.
At Thu, 02 Apr 2020 13:04:43 +0900 (JST), Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote in

At Wed, 1 Apr 2020 18:17:35 +0200, Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote in

Please, find in attachment a patch implementing this.

The patch partially reintroduces the issue the patch have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

The patch partially reintroduces the issue the commit 78ea8b5daa have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

On second thought, I think that we should discuss what the desirable
behavior is before the implentation. Especially what's unclear to me

Agreed.

is whether to remove such WAL files in archive recovery case with
archive_mode=on. Those WAL files would be required when recovering
from the backup taken before that archive recovery happens.
So it seems unsafe to remove them in that case.

I'm not sure I'm getting the intention correctly, but I think it
responsibility of the operator to provide a complete set of archived
WAL files for a backup. Could you elaborate what operation steps are
you assuming of?

Therefore, IMO that the patch should change the code so that
no unarchived WAL files are removed not only in crash recovery
but also archive recovery. Thought?

Agreed if "an unarchived WAL" means "a WAL file that is marked .ready"
and it should be archived immediately. My previous mail is written
based on the same thought.

In a very narrow window, if server crashed or killed after a segment
is finished but before marking the file as .ready, the file doesn't
have .ready but should be archived. If we need to get rid of such a
window, it would help to mark a WAL file as ".busy" at creation time.

Of course, this change would lead to the issue that the past
unarchived
WAL files keep remaining in the case of warm-standby using archive
recovery. But this issue looks unavoidable. If users want to avoid
that,
archive_mode should be set to always.

Also I'm a bit wondering if it's really safe to remove such unarchived
WAL files even in the standby case with archive_mode=on. I would need
more time to think that.

Perhaps checking '.ready' before the checking for archive-mode would
be sufficient.

Plus, I added a second commit to add one test in regard with this bug.

Another is to make the startup process remove .ready file if
necessary.

I'm not sure to understand this one.

I was thinking to make the startup process remove such unarchived WAL
files
if archive_mode=on and StandbyModeRequested/ArchiveRecoveryRequested
is true.

As mentioned above, I don't understand the point of preserving WAL
files that are either marked as .ready or not marked at all on a
standby with archive_mode=on.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#8Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Kyotaro Horiguchi (#7)
Re: [BUG] non archived WAL removed during production crash recovery

On 2020/04/02 16:23, Kyotaro Horiguchi wrote:

At Thu, 2 Apr 2020 14:19:15 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2020/04/02 13:07, Kyotaro Horiguchi wrote:

Sorry, it was quite ambiguous.
At Thu, 02 Apr 2020 13:04:43 +0900 (JST), Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote in

At Wed, 1 Apr 2020 18:17:35 +0200, Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote in

Please, find in attachment a patch implementing this.

The patch partially reintroduces the issue the patch have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

The patch partially reintroduces the issue the commit 78ea8b5daa have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

On second thought, I think that we should discuss what the desirable
behavior is before the implentation. Especially what's unclear to me

Agreed.

is whether to remove such WAL files in archive recovery case with
archive_mode=on. Those WAL files would be required when recovering
from the backup taken before that archive recovery happens.
So it seems unsafe to remove them in that case.

I'm not sure I'm getting the intention correctly, but I think it
responsibility of the operator to provide a complete set of archived
WAL files for a backup. Could you elaborate what operation steps are
you assuming of?

Please imagine the case where you need to do archive recovery
from the database snapshot taken while there are many WAL files
with .ready files. Those WAL files have not been archived yet.
In this case, ISTM those WAL files should not be removed until
they are archived, when archive_mode = on.

Therefore, IMO that the patch should change the code so that
no unarchived WAL files are removed not only in crash recovery
but also archive recovery. Thought?

Agreed if "an unarchived WAL" means "a WAL file that is marked .ready"
and it should be archived immediately. My previous mail is written
based on the same thought.

Ok, so our *current* consensus seems the followings. Right?

- If archive_mode=off, any WAL files with .ready files are removed in
crash recovery, archive recoery and standby mode.

- If archive_mode=on, WAL files with .ready files are removed only in
standby mode. In crash recovery and archive recovery cases, they keep
remaining and would be archived after recovery finishes (i.e., during
normal processing).

- If archive_mode=always, in crash recovery, archive recovery and
standby mode, WAL files with .ready files are archived if WAL archiver
is running.

That is, WAL files with .ready files are removed when either
archive_mode!=always in standby mode or archive_mode=off.

In a very narrow window, if server crashed or killed after a segment
is finished but before marking the file as .ready, the file doesn't
have .ready but should be archived. If we need to get rid of such a
window, it would help to mark a WAL file as ".busy" at creation time.

Of course, this change would lead to the issue that the past
unarchived
WAL files keep remaining in the case of warm-standby using archive
recovery. But this issue looks unavoidable. If users want to avoid
that,
archive_mode should be set to always.

Also I'm a bit wondering if it's really safe to remove such unarchived
WAL files even in the standby case with archive_mode=on. I would need
more time to think that.

Perhaps checking '.ready' before the checking for archive-mode would
be sufficient.

Plus, I added a second commit to add one test in regard with this bug.

Another is to make the startup process remove .ready file if
necessary.

I'm not sure to understand this one.

I was thinking to make the startup process remove such unarchived WAL
files
if archive_mode=on and StandbyModeRequested/ArchiveRecoveryRequested
is true.

As mentioned above, I don't understand the point of preserving WAL
files that are either marked as .ready or not marked at all on a
standby with archive_mode=on.

Maybe yes. But I'm not confident about that there is no such case.
Anyway I'm fine to fix the bug based on the above consensus at first.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

In reply to: Kyotaro Horiguchi (#5)
Re: [BUG] non archived WAL removed during production crash recovery

On Thu, 02 Apr 2020 13:07:34 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

Sorry, it was quite ambiguous.

At Thu, 02 Apr 2020 13:04:43 +0900 (JST), Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote in

At Wed, 1 Apr 2020 18:17:35 +0200, Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote in

Please, find in attachment a patch implementing this.

The patch partially reintroduces the issue the patch have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

The patch partially reintroduces the issue the commit 78ea8b5daa have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

As far as I understand StartupXLOG(), NOT_IN_RECOVERY and IN_CRASH_RECOVERY are
only set for production clusters, not standby ones. So the following test
should never catch standby cluster as :

(XLogArchivingActive() && inRecoveryState != IN_ARCHIVE_RECOVERY)

Forgive me if I'm wrong, but do I miss something?

Regards,

In reply to: Fujii Masao (#8)
Re: [BUG] non archived WAL removed during production crash recovery

On Thu, 2 Apr 2020 19:38:59 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/04/02 16:23, Kyotaro Horiguchi wrote:

At Thu, 2 Apr 2020 14:19:15 +0900, Fujii Masao
<masao.fujii@oss.nttdata.com> wrote in

[...]

is whether to remove such WAL files in archive recovery case with
archive_mode=on. Those WAL files would be required when recovering
from the backup taken before that archive recovery happens.
So it seems unsafe to remove them in that case.

I'm not sure I'm getting the intention correctly, but I think it
responsibility of the operator to provide a complete set of archived
WAL files for a backup. Could you elaborate what operation steps are
you assuming of?

Please imagine the case where you need to do archive recovery
from the database snapshot taken while there are many WAL files
with .ready files. Those WAL files have not been archived yet.
In this case, ISTM those WAL files should not be removed until
they are archived, when archive_mode = on.

If you rely on snapshot without pg_start/stop_backup, I agree. Theses WAL
should be archived if:

* archive_mode >= on for primary
* archive_mode = always for standby

Therefore, IMO that the patch should change the code so that
no unarchived WAL files are removed not only in crash recovery
but also archive recovery. Thought?

Agreed if "an unarchived WAL" means "a WAL file that is marked .ready"
and it should be archived immediately. My previous mail is written
based on the same thought.

Ok, so our *current* consensus seems the followings. Right?

- If archive_mode=off, any WAL files with .ready files are removed in
crash recovery, archive recoery and standby mode.

yes

- If archive_mode=on, WAL files with .ready files are removed only in
standby mode. In crash recovery and archive recovery cases, they keep
remaining and would be archived after recovery finishes (i.e., during
normal processing).

yes

- If archive_mode=always, in crash recovery, archive recovery and
standby mode, WAL files with .ready files are archived if WAL archiver
is running.

yes

That is, WAL files with .ready files are removed when either
archive_mode!=always in standby mode or archive_mode=off.

sounds fine to me.

[...]

Another is to make the startup process remove .ready file if
necessary.

I'm not sure to understand this one.

I was thinking to make the startup process remove such unarchived WAL
files
if archive_mode=on and StandbyModeRequested/ArchiveRecoveryRequested
is true.

Ok, understood.

As mentioned above, I don't understand the point of preserving WAL
files that are either marked as .ready or not marked at all on a
standby with archive_mode=on.

Maybe yes. But I'm not confident about that there is no such case.

Well, it seems to me that this is what you suggested few paragraph away:

«.ready files are removed when either archive_mode!=always in standby mode»

#11Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Jehan-Guillaume de Rorthais (#10)
Re: [BUG] non archived WAL removed during production crash recovery

On 2020/04/02 22:49, Jehan-Guillaume de Rorthais wrote:

On Thu, 2 Apr 2020 19:38:59 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/04/02 16:23, Kyotaro Horiguchi wrote:

At Thu, 2 Apr 2020 14:19:15 +0900, Fujii Masao
<masao.fujii@oss.nttdata.com> wrote in

[...]

is whether to remove such WAL files in archive recovery case with
archive_mode=on. Those WAL files would be required when recovering
from the backup taken before that archive recovery happens.
So it seems unsafe to remove them in that case.

I'm not sure I'm getting the intention correctly, but I think it
responsibility of the operator to provide a complete set of archived
WAL files for a backup. Could you elaborate what operation steps are
you assuming of?

Please imagine the case where you need to do archive recovery
from the database snapshot taken while there are many WAL files
with .ready files. Those WAL files have not been archived yet.
In this case, ISTM those WAL files should not be removed until
they are archived, when archive_mode = on.

If you rely on snapshot without pg_start/stop_backup, I agree. Theses WAL
should be archived if:

* archive_mode >= on for primary
* archive_mode = always for standby

Therefore, IMO that the patch should change the code so that
no unarchived WAL files are removed not only in crash recovery
but also archive recovery. Thought?

Agreed if "an unarchived WAL" means "a WAL file that is marked .ready"
and it should be archived immediately. My previous mail is written
based on the same thought.

Ok, so our *current* consensus seems the followings. Right?

- If archive_mode=off, any WAL files with .ready files are removed in
crash recovery, archive recoery and standby mode.

yes

- If archive_mode=on, WAL files with .ready files are removed only in
standby mode. In crash recovery and archive recovery cases, they keep
remaining and would be archived after recovery finishes (i.e., during
normal processing).

yes

- If archive_mode=always, in crash recovery, archive recovery and
standby mode, WAL files with .ready files are archived if WAL archiver
is running.

yes

That is, WAL files with .ready files are removed when either
archive_mode!=always in standby mode or archive_mode=off.

sounds fine to me.

[...]

Another is to make the startup process remove .ready file if
necessary.

I'm not sure to understand this one.

I was thinking to make the startup process remove such unarchived WAL
files
if archive_mode=on and StandbyModeRequested/ArchiveRecoveryRequested
is true.

Ok, understood.

As mentioned above, I don't understand the point of preserving WAL
files that are either marked as .ready or not marked at all on a
standby with archive_mode=on.

Maybe yes. But I'm not confident about that there is no such case.

Well, it seems to me that this is what you suggested few paragraph away:

«.ready files are removed when either archive_mode!=always in standby mode»

Yes, so I'm fine with that as the first consensus because the behavior
is obviously better than the current one. *If* the case where no WAL files
should be removed is found, I'd just like to propose the additional patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#12Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Jehan-Guillaume de Rorthais (#9)
Re: [BUG] non archived WAL removed during production crash recovery

On 2020/04/02 22:02, Jehan-Guillaume de Rorthais wrote:

On Thu, 02 Apr 2020 13:07:34 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

Sorry, it was quite ambiguous.

At Thu, 02 Apr 2020 13:04:43 +0900 (JST), Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote in

At Wed, 1 Apr 2020 18:17:35 +0200, Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote in

Please, find in attachment a patch implementing this.

The patch partially reintroduces the issue the patch have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

The patch partially reintroduces the issue the commit 78ea8b5daa have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

As far as I understand StartupXLOG(), NOT_IN_RECOVERY and IN_CRASH_RECOVERY are
only set for production clusters, not standby ones.

DB_IN_CRASH_RECOVERY can be set even in standby mode. For example,
if you start the standby from the cold backup of the primary,
since InArchiveRecovery is false at the beginning of the recovery,
DB_IN_CRASH_RECOVERY is set in that moment. But then after all the valid
WAL in pg_wal have been replayed, InArchiveRecovery is set to true and
DB_IN_ARCHIVE_RECOVERY is set.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

In reply to: Fujii Masao (#12)
Re: [BUG] non archived WAL removed during production crash recovery

On Thu, 2 Apr 2020 23:58:00 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/04/02 22:02, Jehan-Guillaume de Rorthais wrote:

On Thu, 02 Apr 2020 13:07:34 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

Sorry, it was quite ambiguous.

At Thu, 02 Apr 2020 13:04:43 +0900 (JST), Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote in

At Wed, 1 Apr 2020 18:17:35 +0200, Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote in

Please, find in attachment a patch implementing this.

The patch partially reintroduces the issue the patch have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

The patch partially reintroduces the issue the commit 78ea8b5daa have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

As far as I understand StartupXLOG(), NOT_IN_RECOVERY and IN_CRASH_RECOVERY
are only set for production clusters, not standby ones.

DB_IN_CRASH_RECOVERY can be set even in standby mode. For example,
if you start the standby from the cold backup of the primary,

In cold backup? Then ControlFile->state == DB_SHUTDOWNED, right?

Unless I'm wrong, this should be catched by:

if (ArchiveRecoveryRequested && ( [...] ||
ControlFile->state == DB_SHUTDOWNED))
{
InArchiveRecovery = true;
if (StandbyModeRequested)
StandbyMode = true;
}

With InArchiveRecovery=true, we later set DB_IN_ARCHIVE_RECOVERY instead of
DB_IN_CRASH_RECOVERY.

since InArchiveRecovery is false at the beginning of the recovery,
DB_IN_CRASH_RECOVERY is set in that moment. But then after all the valid
WAL in pg_wal have been replayed, InArchiveRecovery is set to true and
DB_IN_ARCHIVE_RECOVERY is set.

However, I suppose this is true if you restore a backup from a snapshot
without backup_label, right?

In reply to: Fujii Masao (#11)
Re: [BUG] non archived WAL removed during production crash recovery

On Thu, 2 Apr 2020 23:55:46 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

[...]

Well, it seems to me that this is what you suggested few paragraph away:

«.ready files are removed when either archive_mode!=always in standby
mode»

Yes, so I'm fine with that as the first consensus because the behavior
is obviously better than the current one. *If* the case where no WAL files
should be removed is found, I'd just like to propose the additional patch.

Do you mean to want to produce the next patch yourself?

#15Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Jehan-Guillaume de Rorthais (#14)
Re: [BUG] non archived WAL removed during production crash recovery

At Thu, 2 Apr 2020 17:44:50 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in

On Thu, 2 Apr 2020 23:55:46 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

[...]

Well, it seems to me that this is what you suggested few paragraph away:

«.ready files are removed when either archive_mode!=always in standby
mode»

Yes, so I'm fine with that as the first consensus because the behavior
is obviously better than the current one. *If* the case where no WAL files
should be removed is found, I'd just like to propose the additional patch.

Do you mean to want to produce the next patch yourself?

No. Fujii-san is saying that he will address it, if the fix made in
this thread is found to be imperfect later.

He suspects that WAL files should be preserved at least in certain
cases even if the file persists forever, but the consensus here is to
remove files under certain conditions so as not to no WAL file
persists in pg_wal directory.

Feel free to propose the next version!

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#16Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Jehan-Guillaume de Rorthais (#13)
Re: [BUG] non archived WAL removed during production crash recovery

On 2020/04/03 0:37, Jehan-Guillaume de Rorthais wrote:

On Thu, 2 Apr 2020 23:58:00 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/04/02 22:02, Jehan-Guillaume de Rorthais wrote:

On Thu, 02 Apr 2020 13:07:34 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

Sorry, it was quite ambiguous.

At Thu, 02 Apr 2020 13:04:43 +0900 (JST), Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote in

At Wed, 1 Apr 2020 18:17:35 +0200, Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote in

Please, find in attachment a patch implementing this.

The patch partially reintroduces the issue the patch have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

The patch partially reintroduces the issue the commit 78ea8b5daa have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

As far as I understand StartupXLOG(), NOT_IN_RECOVERY and IN_CRASH_RECOVERY
are only set for production clusters, not standby ones.

DB_IN_CRASH_RECOVERY can be set even in standby mode. For example,
if you start the standby from the cold backup of the primary,

In cold backup? Then ControlFile->state == DB_SHUTDOWNED, right?

Unless I'm wrong, this should be catched by:

if (ArchiveRecoveryRequested && ( [...] ||
ControlFile->state == DB_SHUTDOWNED))
{
InArchiveRecovery = true;
if (StandbyModeRequested)
StandbyMode = true;
}

With InArchiveRecovery=true, we later set DB_IN_ARCHIVE_RECOVERY instead of
DB_IN_CRASH_RECOVERY.

Yes, you're right. So I had to mention one more condition in my
previous email. The condition is that the cold backup was taken from
the server that was shutdowned with immdiate mode. In this case,
the code block that you pointed is skipped and InArchiveRecovery is
not set to true there.

since InArchiveRecovery is false at the beginning of the recovery,
DB_IN_CRASH_RECOVERY is set in that moment. But then after all the valid
WAL in pg_wal have been replayed, InArchiveRecovery is set to true and
DB_IN_ARCHIVE_RECOVERY is set.

However, I suppose this is true if you restore a backup from a snapshot
without backup_label, right?

Maybe yes.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#17Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Kyotaro Horiguchi (#15)
Re: [BUG] non archived WAL removed during production crash recovery

On 2020/04/03 10:14, Kyotaro Horiguchi wrote:

At Thu, 2 Apr 2020 17:44:50 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in

On Thu, 2 Apr 2020 23:55:46 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

[...]

Well, it seems to me that this is what you suggested few paragraph away:

«.ready files are removed when either archive_mode!=always in standby
mode»

Yes, so I'm fine with that as the first consensus because the behavior
is obviously better than the current one. *If* the case where no WAL files
should be removed is found, I'd just like to propose the additional patch.

Do you mean to want to produce the next patch yourself?

No. Fujii-san is saying that he will address it, if the fix made in
this thread is found to be imperfect later.

He suspects that WAL files should be preserved at least in certain
cases even if the file persists forever, but the consensus here is to
remove files under certain conditions so as not to no WAL file
persists in pg_wal directory.

Yes.

Feel free to propose the next version!

Yes!

BTW, now I'm thinking that the flag in shmem should be updated when
the startup process sets StandbyModeRequested to true at the beginning
of the recovery. That is,

- Add something like SharedStandbyModeRequested into XLogCtl. This field
should be initialized with false;
- Set XLogCtl->SharedStandbyModeRequested to true when the startup
process detects the standby.signal file and sets the local variable
StandbyModeRequested to true.
- Make XLogArchiveCheckDone() use XLogCtl->SharedStandbyModeRequested
to know whether the server is in standby mode or not.

Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

In reply to: Fujii Masao (#11)
2 attachment(s)
Re: [BUG] non archived WAL removed during production crash recovery

On Thu, 2 Apr 2020 23:55:46 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/04/02 22:49, Jehan-Guillaume de Rorthais wrote:

On Thu, 2 Apr 2020 19:38:59 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

[...]

That is, WAL files with .ready files are removed when either
archive_mode!=always in standby mode or archive_mode=off.

sounds fine to me.

To some extends, it appears to me this sentence was relatively close to my
previous patch, as far as you exclude IN_CRASH_RECOVERY from the shortcut:

XLogArchiveCheckDone(const char *xlog)
{
[...]
if ( (inRecoveryState != IN_CRASH_RECOVERY) && (
(inRecoveryState == NOT_IN_RECOVERY && !XLogArchivingActive()) ||
(inRecoveryState == IN_ARCHIVE_RECOVERY && !XLogArchivingAlways())))
return true;

Which means that only .done cleanup would occurs during CRASH_RECOVERY
and .ready files might be created if no .done exists. No matter the futur status
of the cluster: primary or standby. Normal shortcut will apply during first
checkpoint after the crash recovery step.

This should handle the case where a backup without backup_label (taken from a
snapshot or after a shutdown with immediate) is restored to build a standby.

Please, find in attachment a third version of my patch
"0001-v3-Fix-WAL-retention-during-production-crash-recovery.patch".

"0002-v1-Add-test-on-non-archived-WAL-during-crash-recovery.patch" is left
untouched. But I'm considering adding some more tests relative to this
discussion.

BTW, now I'm thinking that the flag in shmem should be updated when
the startup process sets StandbyModeRequested to true at the beginning
of the recovery. That is,

- Add something like SharedStandbyModeRequested into XLogCtl. This field
should be initialized with false;
- Set XLogCtl->SharedStandbyModeRequested to true when the startup
process detects the standby.signal file and sets the local variable
StandbyModeRequested to true.
- Make XLogArchiveCheckDone() use XLogCtl->SharedStandbyModeRequested
to know whether the server is in standby mode or not.

Thought?

I try to avoid a new flag in memory with the proposal in attachment of this
email. It seems to me various combinaisons of booleans with subtle differences
around the same subject makes it a bit trappy and complicated to understand.

If my proposal is rejected, I'll be happy to volunteer to add
XLogCtl->SharedStandbyModeRequested though. It seems like a simple enough fix
as well.

Regards,

Attachments:

0001-v3-Fix-WAL-retention-during-production-crash-recovery.patchtext/x-patchDownload
From 834235add60df8203c243fa30e78cfa62619076b Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Date: Wed, 1 Apr 2020 16:36:06 +0200
Subject: [PATCH 1/2] Fix WAL retention during production crash recovery

During crash recovery of a production cluster with archive_mode=on,
XLogArchiveCheckDone() was considering the cluster as inRecovery
without archive_mode=always. Because of this non-arcived WAL and
related .ready files were recycled or removed.
---
 src/backend/access/transam/xlog.c        | 25 +++++++++++++++++-------
 src/backend/access/transam/xlogarchive.c |  7 ++++---
 src/include/access/xlog.h                |  9 +++++++++
 3 files changed, 31 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 977d448f50..55d06a8704 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -219,8 +219,8 @@ static TimeLineID receiveTLI = 0;
 static bool lastFullPageWrites;
 
 /*
- * Local copy of SharedRecoveryInProgress variable. True actually means "not
- * known, need to check the shared state".
+ * This is false when SharedRecoveryInProgress is not IN_ARCHIVE_RECOVERY.
+ * True actually means "not known, need to check the shared state".
  */
 static bool LocalRecoveryInProgress = true;
 
@@ -653,10 +653,10 @@ typedef struct XLogCtlData
 	TimeLineID	PrevTimeLineID;
 
 	/*
-	 * SharedRecoveryInProgress indicates if we're still in crash or archive
+	 * SharedRecoveryInProgress indicates if we're either in crash or archive
 	 * recovery.  Protected by info_lck.
 	 */
-	bool		SharedRecoveryInProgress;
+	RecoveryState	SharedRecoveryInProgress;
 
 	/*
 	 * SharedHotStandbyActive indicates if we allow hot standby queries to be
@@ -5131,7 +5131,7 @@ XLOGShmemInit(void)
 	 * in additional info.)
 	 */
 	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
-	XLogCtl->SharedRecoveryInProgress = true;
+	XLogCtl->SharedRecoveryInProgress = IN_ARCHIVE_RECOVERY;
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
 	XLogCtl->WalWriterSleeping = false;
@@ -6901,6 +6901,9 @@ StartupXLOG(void)
 								ControlFile->checkPointCopy.ThisTimeLineID,
 								recoveryTargetTLI)));
 			ControlFile->state = DB_IN_CRASH_RECOVERY;
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryInProgress = IN_CRASH_RECOVERY;
+			SpinLockRelease(&XLogCtl->info_lck);
 		}
 		ControlFile->checkPoint = checkPointLoc;
 		ControlFile->checkPointCopy = checkPoint;
@@ -7928,7 +7931,7 @@ StartupXLOG(void)
 	ControlFile->time = (pg_time_t) time(NULL);
 
 	SpinLockAcquire(&XLogCtl->info_lck);
-	XLogCtl->SharedRecoveryInProgress = false;
+	XLogCtl->SharedRecoveryInProgress = NOT_IN_RECOVERY;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	UpdateControlFile();
@@ -8074,7 +8077,7 @@ RecoveryInProgress(void)
 		 */
 		volatile XLogCtlData *xlogctl = XLogCtl;
 
-		LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
+		LocalRecoveryInProgress = (xlogctl->SharedRecoveryInProgress != NOT_IN_RECOVERY);
 
 		/*
 		 * Initialize TimeLineID and RedoRecPtr when we discover that recovery
@@ -8103,6 +8106,14 @@ RecoveryInProgress(void)
 	}
 }
 
+RecoveryState
+GetRecoveryState(void)
+{
+	volatile XLogCtlData *xlogctl = XLogCtl;
+
+	return xlogctl->SharedRecoveryInProgress;
+}
+
 /*
  * Is HotStandby active yet? This is only important in special backends
  * since normal backends won't ever be able to connect until this returns
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index d62c12310a..25666cdbcf 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -572,7 +572,7 @@ XLogArchiveCheckDone(const char *xlog)
 {
 	char		archiveStatusPath[MAXPGPATH];
 	struct stat stat_buf;
-	bool		inRecovery = RecoveryInProgress();
+	RecoveryState	inRecoveryState = GetRecoveryState();
 
 	/*
 	 * The file is always deletable if archive_mode is "off".  On standbys
@@ -580,8 +580,9 @@ XLogArchiveCheckDone(const char *xlog)
 	 * "always".  On a primary, archiving is enabled if archive_mode is "on"
 	 * or "always".
 	 */
-	if (!((XLogArchivingActive() && !inRecovery) ||
-		  (XLogArchivingAlways() && inRecovery)))
+	if ( (inRecoveryState != IN_CRASH_RECOVERY) && (
+		  (inRecoveryState == NOT_IN_RECOVERY && !XLogArchivingActive()) &&
+		  (inRecoveryState == IN_ARCHIVE_RECOVERY && !XLogArchivingAlways())))
 		return true;
 
 	/* First check for .done --- this means archiver is done with it */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 9ec7b31cce..d8b08d6d17 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -165,6 +165,14 @@ typedef enum WalLevel
 	WAL_LEVEL_LOGICAL
 } WalLevel;
 
+/* Recovery state */
+typedef enum RecoveryState
+{
+	NOT_IN_RECOVERY = 0,
+	IN_CRASH_RECOVERY,
+	IN_ARCHIVE_RECOVERY
+} RecoveryState;
+
 extern PGDLLIMPORT int wal_level;
 
 /* Is WAL archiving enabled (always or only while server is running normally)? */
@@ -278,6 +286,7 @@ extern const char *xlog_identify(uint8 info);
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
 
 extern bool RecoveryInProgress(void);
+extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
-- 
2.20.1

0002-v1-Add-test-on-non-archived-WAL-during-crash-recovery.patchtext/x-patchDownload
From 4ad24cd903766fc21132bc5ff4a80be5969d492f Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Date: Wed, 1 Apr 2020 18:12:17 +0200
Subject: [PATCH 2/2] Add test on non archived WAL during crash recovery

---
 src/test/recovery/t/011_crash_recovery.pl | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 526a3481fb..ad512b881f 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -15,11 +15,17 @@ if ($Config{osname} eq 'MSWin32')
 }
 else
 {
-	plan tests => 3;
+	plan tests => 4;
 }
 
 my $node = get_new_node('master');
-$node->init(allows_streaming => 1);
+$node->init(
+	has_archiving    => 1,
+	allows_streaming => 1);
+
+$node->append_conf('postgresql.conf',
+	"archive_command = ''");
+
 $node->start;
 
 my ($stdin, $stdout, $stderr) = ('', '', '');
@@ -43,6 +49,7 @@ $stdin .= q[
 BEGIN;
 CREATE TABLE mine(x integer);
 SELECT txid_current();
+SELECT pg_switch_wal();
 ];
 $tx->pump until $stdout =~ /[[:digit:]]+[\r\n]$/;
 
@@ -65,4 +72,9 @@ cmp_ok($node->safe_psql('postgres', 'SELECT txid_current()'),
 is($node->safe_psql('postgres', qq[SELECT txid_status('$xid');]),
 	'aborted', 'xid is aborted after crash');
 
+# make WAL are still waiting to be archived
+my $node_data = $node->data_dir;
+ok( -f "$node_data/pg_wal/archive_status/000000010000000000000001.ready",
+  "WAL still waiting to be archived");
+
 $tx->kill_kill;
-- 
2.20.1

#19Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Jehan-Guillaume de Rorthais (#18)
Re: [BUG] non archived WAL removed during production crash recovery

On 2020/04/04 1:26, Jehan-Guillaume de Rorthais wrote:

On Thu, 2 Apr 2020 23:55:46 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/04/02 22:49, Jehan-Guillaume de Rorthais wrote:

On Thu, 2 Apr 2020 19:38:59 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

[...]

That is, WAL files with .ready files are removed when either
archive_mode!=always in standby mode or archive_mode=off.

sounds fine to me.

To some extends, it appears to me this sentence was relatively close to my
previous patch, as far as you exclude IN_CRASH_RECOVERY from the shortcut:

XLogArchiveCheckDone(const char *xlog)
{
[...]
if ( (inRecoveryState != IN_CRASH_RECOVERY) && (
(inRecoveryState == NOT_IN_RECOVERY && !XLogArchivingActive()) ||
(inRecoveryState == IN_ARCHIVE_RECOVERY && !XLogArchivingAlways())))
return true;

Which means that only .done cleanup would occurs during CRASH_RECOVERY
and .ready files might be created if no .done exists. No matter the futur status
of the cluster: primary or standby. Normal shortcut will apply during first
checkpoint after the crash recovery step.

This should handle the case where a backup without backup_label (taken from a
snapshot or after a shutdown with immediate) is restored to build a standby.

Please, find in attachment a third version of my patch
"0001-v3-Fix-WAL-retention-during-production-crash-recovery.patch".

Thanks for updating the patch! Here are my review comments:

-	bool		SharedRecoveryInProgress;
+	RecoveryState	SharedRecoveryInProgress;

Since the patch changes the meaning of this variable, the name of
the variable should be changed? Otherwise, the current name seems
confusing.

+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryInProgress = IN_CRASH_RECOVERY;
+			SpinLockRelease(&XLogCtl->info_lck);

As I explained upthread, this code can be reached and IN_CRASH_RECOVERY
can be set even in standby or archive recovery. Is this right behavior that
you're expecting?

Even in crash recovery case, GetRecoveryState() returns IN_ARCHIVE_RECOVERY
until this code is reached. Also when WAL replay is not necessary
(e.g., restart of the server shutdowed cleanly before), GetRecoveryState()
returns IN_ARCHIVE_RECOVERY because this code is not reached. Aren't
these fragile? If XLogArchiveCheckDone() is only user of GetRecoveryState(),
they would be ok. But if another user will appear in the future, it seems
very easy to mistake. At least those behaviors should be commented in
GetRecoveryState().

-	if (!((XLogArchivingActive() && !inRecovery) ||
-		  (XLogArchivingAlways() && inRecovery)))
+	if ( (inRecoveryState != IN_CRASH_RECOVERY) && (
+		  (inRecoveryState == NOT_IN_RECOVERY && !XLogArchivingActive()) &&
+		  (inRecoveryState == IN_ARCHIVE_RECOVERY && !XLogArchivingAlways())))
		return true;

The last condition seems to cause XLogArchiveCheckDone() to return
true in archive recovery mode with archive_mode=on, then cause
unarchived WAL files with .ready to be removed. Is my understanding right?
If yes, that behavior doesn't seem to match with our consensus, i.e.,
WAL files with .ready should not be removed in that case.

+/* Recovery state */
+typedef enum RecoveryState
+{
+	NOT_IN_RECOVERY = 0,
+	IN_CRASH_RECOVERY,
+	IN_ARCHIVE_RECOVERY
+} RecoveryState;

Isn't it better to add more comments here? For example, what does
"Recovery state" mean? Which state is used in standby mode? Why? ,etc.

Is it really ok not to have the value indicating standby mode?

These enum values names are confusing because the variables with
similar names already exist. For example, IN_CRASH_RECOVERY vs.
DB_IN_CRASH_RECOVERY. So IMO it seems better to rename them,
.e.g., by adding the prefix.

"0002-v1-Add-test-on-non-archived-WAL-during-crash-recovery.patch" is left
untouched. But I'm considering adding some more tests relative to this
discussion.

BTW, now I'm thinking that the flag in shmem should be updated when
the startup process sets StandbyModeRequested to true at the beginning
of the recovery. That is,

- Add something like SharedStandbyModeRequested into XLogCtl. This field
should be initialized with false;
- Set XLogCtl->SharedStandbyModeRequested to true when the startup
process detects the standby.signal file and sets the local variable
StandbyModeRequested to true.
- Make XLogArchiveCheckDone() use XLogCtl->SharedStandbyModeRequested
to know whether the server is in standby mode or not.

Thought?

I try to avoid a new flag in memory with the proposal in attachment of this
email. It seems to me various combinaisons of booleans with subtle differences
around the same subject makes it a bit trappy and complicated to understand.

Ok, so firstly I try to review your patch!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

In reply to: Fujii Masao (#19)
2 attachment(s)
Re: [BUG] non archived WAL removed during production crash recovery

On Sat, 4 Apr 2020 02:49:50 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/04/04 1:26, Jehan-Guillaume de Rorthais wrote:

On Thu, 2 Apr 2020 23:55:46 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/04/02 22:49, Jehan-Guillaume de Rorthais wrote:

On Thu, 2 Apr 2020 19:38:59 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

[...]

That is, WAL files with .ready files are removed when either
archive_mode!=always in standby mode or archive_mode=off.

sounds fine to me.

To some extends, it appears to me this sentence was relatively close to my
previous patch, as far as you exclude IN_CRASH_RECOVERY from the shortcut:

XLogArchiveCheckDone(const char *xlog)
{
[...]
if ( (inRecoveryState != IN_CRASH_RECOVERY) && (
(inRecoveryState == NOT_IN_RECOVERY && !XLogArchivingActive()) ||
(inRecoveryState == IN_ARCHIVE_RECOVERY
&& !XLogArchivingAlways()))) return true;

Which means that only .done cleanup would occurs during CRASH_RECOVERY
and .ready files might be created if no .done exists. No matter the futur
status of the cluster: primary or standby. Normal shortcut will apply
during first checkpoint after the crash recovery step.

This should handle the case where a backup without backup_label (taken from
a snapshot or after a shutdown with immediate) is restored to build a
standby.

Please, find in attachment a third version of my patch
"0001-v3-Fix-WAL-retention-during-production-crash-recovery.patch".

Thanks for updating the patch! Here are my review comments:

-	bool		SharedRecoveryInProgress;
+	RecoveryState	SharedRecoveryInProgress;

Since the patch changes the meaning of this variable, the name of
the variable should be changed? Otherwise, the current name seems
confusing.

Indeed, fixed using SharedRecoveryState

+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryInProgress =
IN_CRASH_RECOVERY;
+			SpinLockRelease(&XLogCtl->info_lck);

As I explained upthread, this code can be reached and IN_CRASH_RECOVERY
can be set even in standby or archive recovery. Is this right behavior that
you're expecting?

Yes. This patch avoids archive cleanup during crash recovery altogether,
whatever the requested status for the cluster.

Even in crash recovery case, GetRecoveryState() returns IN_ARCHIVE_RECOVERY
until this code is reached.

I tried to stick as close as possible with "ControlFile->state" and old
XLogCtl->SharedRecoveryInProgress variables. That's why it's initialized as
IN_ARCHIVE_RECOVERY as XLogCtl->SharedRecoveryInProgress was init as true as
well.

The status itself is set during StartupXLOG when the historical code actually
tries to define and record the real state between DB_IN_ARCHIVE_RECOVERY and
DB_IN_CRASH_RECOVERY.

Also when WAL replay is not necessary (e.g., restart of the server shutdowed
cleanly before), GetRecoveryState() returns IN_ARCHIVE_RECOVERY because this
code is not reached.

It is set to NOT_IN_RECOVERY at the end of StartupXLOG, in the same place we
set ControlFile->state = DB_IN_PRODUCTION. So GetRecoveryState() returns
NOT_IN_RECOVERY as soon as StartupXLOG is done when no WAL replay is necessary.

Aren't these fragile? If XLogArchiveCheckDone() is only user of
GetRecoveryState(), they would be ok. But if another user will appear in the
future, it seems very easy to mistake. At least those behaviors should be
commented in GetRecoveryState().

We certainly can set SharedRecoveryState earlier, in XLOGShmemInit, based on
the ControlFile->state value. In my understanding, anything different than
DB_SHUTDOWNED or DB_SHUTDOWNED_IN_RECOVERY can be considered as a crash
recovery. Based on this XLOGShmemInit can init SharedRecoveryState to
IN_ARCHIVE_RECOVERY or IN_CRASH_RECOVERY.

With either values, RecoveryInProgress() keep returning the same result so any
current code relying on RecoveryInProgress() is compatible.

I'm not sure who would need this information before the WAL machinery is up,
but is it safe enough in your opinion for futur usage of GetRecoveryState()? Do
you think this information might be useful before the WAL machinery is set?
Current "user" (eg. restoreTwoPhaseData()) only need to know if we are in
recovery, whatever the reason.

The patch in attachment set SharedRecoveryState to either IN_ARCHIVE_RECOVERY
or IN_CRASH_RECOVERY from XLOGShmemInit based on the ControlFile state. It
feels strange though to set this so far away from ControlFile->state
= DB_IN_CRASH_RECOVERY...

-	if (!((XLogArchivingActive() && !inRecovery) ||
-		  (XLogArchivingAlways() && inRecovery)))
+	if ( (inRecoveryState != IN_CRASH_RECOVERY) && (
+		  (inRecoveryState == NOT_IN_RECOVERY
&& !XLogArchivingActive()) &&
+		  (inRecoveryState == IN_ARCHIVE_RECOVERY
&& !XLogArchivingAlways()))) return true;

The last condition seems to cause XLogArchiveCheckDone() to return
true in archive recovery mode with archive_mode=on, then cause
unarchived WAL files with .ready to be removed. Is my understanding right?
If yes, that behavior doesn't seem to match with our consensus, i.e.,
WAL files with .ready should not be removed in that case.

We wrote:

That is, WAL files with .ready files are removed when either
archive_mode!=always in standby mode or archive_mode=off.

sounds fine to me.

So if in standby mode and archive_mode is not "always", the .ready files
should be removed.

In current patch, I split the conditions in the sake of clarity.

+/* Recovery state */
+typedef enum RecoveryState
+{
+	NOT_IN_RECOVERY = 0,
+	IN_CRASH_RECOVERY,
+	IN_ARCHIVE_RECOVERY
+} RecoveryState;

Isn't it better to add more comments here? For example, what does
"Recovery state" mean? Which state is used in standby mode? Why? ,etc.

Explanations added.

Is it really ok not to have the value indicating standby mode?

Unless I'm wrong, this shared state only indicates why we are recovering WAL,
it does not need reflect the status of the instance in shared memory.
StandbyMode is already available for the startup process. Would it be useful
to share this mode outside of the startup process?

These enum values names are confusing because the variables with
similar names already exist. For example, IN_CRASH_RECOVERY vs.
DB_IN_CRASH_RECOVERY. So IMO it seems better to rename them,
.e.g., by adding the prefix.

Well, I lack of imagination. So I picked CRASH_RECOVERING and
ARCHIVE_RECOVERING.

"0002-v1-Add-test-on-non-archived-WAL-during-crash-recovery.patch" is left
untouched. But I'm considering adding some more tests relative to this
discussion.

BTW, now I'm thinking that the flag in shmem should be updated when
the startup process sets StandbyModeRequested to true at the beginning
of the recovery. That is,

- Add something like SharedStandbyModeRequested into XLogCtl. This field
should be initialized with false;
- Set XLogCtl->SharedStandbyModeRequested to true when the startup
process detects the standby.signal file and sets the local variable
StandbyModeRequested to true.
- Make XLogArchiveCheckDone() use XLogCtl->SharedStandbyModeRequested
to know whether the server is in standby mode or not.

Thought?

I try to avoid a new flag in memory with the proposal in attachment of this
email. It seems to me various combinaisons of booleans with subtle
differences around the same subject makes it a bit trappy and complicated
to understand.

Ok, so firstly I try to review your patch!

Thank you for your review and help!

In attachment:
* fix proposal: 0001-v4-Fix-WAL-retention-during-production-crash-recovery.patch
* add test: 0002-v2-Add-test-on-non-archived-WAL-during-crash-recovery.patch
0002-v2 fix conflict with current master.

Regards,

Attachments:

0001-v4-Fix-WAL-retention-during-production-crash-recovery.patchtext/x-patchDownload
From 147f616565bc652950bcd928a870cd3aa3dfb805 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Date: Wed, 1 Apr 2020 16:36:06 +0200
Subject: [PATCH 1/2] Fix WAL retention during production crash recovery

During crash recovery of a production cluster with archive_mode=on,
XLogArchiveCheckDone() was considering the cluster as inRecovery
without archive_mode=always. Because of this non-arcived WAL and
related .ready files were recycled or removed.
---
 src/backend/access/transam/xlog.c        | 45 +++++++++++++++++++-----
 src/backend/access/transam/xlogarchive.c | 21 +++++++----
 src/include/access/xlog.h                | 23 ++++++++++++
 3 files changed, 73 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index abf954ba39..a7a5ce6ef4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -220,8 +220,8 @@ static TimeLineID receiveTLI = 0;
 static bool lastFullPageWrites;
 
 /*
- * Local copy of SharedRecoveryInProgress variable. True actually means "not
- * known, need to check the shared state".
+ * This is false when SharedRecoveryState is not ARCHIVE_RECOVERING.
+ * True actually means "not known, need to check the shared state".
  */
 static bool LocalRecoveryInProgress = true;
 
@@ -652,10 +652,10 @@ typedef struct XLogCtlData
 	TimeLineID	PrevTimeLineID;
 
 	/*
-	 * SharedRecoveryInProgress indicates if we're still in crash or archive
+	 * SharedRecoveryState indicates if we're either in crash or archive
 	 * recovery.  Protected by info_lck.
 	 */
-	bool		SharedRecoveryInProgress;
+	RecoveryState	SharedRecoveryState;
 
 	/*
 	 * SharedHotStandbyActive indicates if we allow hot standby queries to be
@@ -4407,6 +4407,11 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 				updateMinRecoveryPoint = true;
 
 				UpdateControlFile();
+
+				SpinLockAcquire(&XLogCtl->info_lck);
+				XLogCtl->SharedRecoveryState = ARCHIVE_RECOVERING;
+				SpinLockRelease(&XLogCtl->info_lck);
+
 				LWLockRelease(ControlFileLock);
 
 				CheckRecoveryConsistency();
@@ -5139,11 +5144,20 @@ XLOGShmemInit(void)
 	 * in additional info.)
 	 */
 	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
-	XLogCtl->SharedRecoveryInProgress = true;
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
 	XLogCtl->WalWriterSleeping = false;
 
+	switch (ControlFile->state)
+	{
+		case DB_SHUTDOWNED:
+		case DB_SHUTDOWNED_IN_RECOVERY:
+			XLogCtl->SharedRecoveryState = ARCHIVE_RECOVERING;
+			break;
+		default:
+			XLogCtl->SharedRecoveryState = CRASH_RECOVERING;
+	}
+
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
 	SpinLockInit(&XLogCtl->ulsn_lck);
@@ -7884,7 +7898,7 @@ StartupXLOG(void)
 	ControlFile->time = (pg_time_t) time(NULL);
 
 	SpinLockAcquire(&XLogCtl->info_lck);
-	XLogCtl->SharedRecoveryInProgress = false;
+	XLogCtl->SharedRecoveryState = NOT_IN_RECOVERY;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	UpdateControlFile();
@@ -8030,7 +8044,7 @@ RecoveryInProgress(void)
 		 */
 		volatile XLogCtlData *xlogctl = XLogCtl;
 
-		LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
+		LocalRecoveryInProgress = (xlogctl->SharedRecoveryState != NOT_IN_RECOVERY);
 
 		/*
 		 * Initialize TimeLineID and RedoRecPtr when we discover that recovery
@@ -8042,8 +8056,8 @@ RecoveryInProgress(void)
 		{
 			/*
 			 * If we just exited recovery, make sure we read TimeLineID and
-			 * RedoRecPtr after SharedRecoveryInProgress (for machines with
-			 * weak memory ordering).
+			 * RedoRecPtr after SharedRecoveryState (for machines with weak
+			 * memory ordering).
 			 */
 			pg_memory_barrier();
 			InitXLOGAccess();
@@ -8059,6 +8073,19 @@ RecoveryInProgress(void)
 	}
 }
 
+/*
+ * Returns current recovery state from shared memory.
+ *
+ * See details about RecoveryState in xlog.h
+ */
+RecoveryState
+GetRecoveryState(void)
+{
+	volatile XLogCtlData *xlogctl = XLogCtl;
+
+	return xlogctl->SharedRecoveryState;
+}
+
 /*
  * Is HotStandby active yet? This is only important in special backends
  * since normal backends won't ever be able to connect until this returns
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index d62c12310a..b207827651 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -572,18 +572,25 @@ XLogArchiveCheckDone(const char *xlog)
 {
 	char		archiveStatusPath[MAXPGPATH];
 	struct stat stat_buf;
-	bool		inRecovery = RecoveryInProgress();
+	RecoveryState	recoveryState = GetRecoveryState();
+
+	/* The file is always deletable if archive_mode is "off". */
+	if ( ! XLogArchivingActive() )
+		return true;
 
 	/*
-	 * The file is always deletable if archive_mode is "off".  On standbys
-	 * archiving is disabled if archive_mode is "on", and enabled with
-	 * "always".  On a primary, archiving is enabled if archive_mode is "on"
-	 * or "always".
+	 * On standbys, the file is deletable if archive_mode is not
+	 * "always".
 	 */
-	if (!((XLogArchivingActive() && !inRecovery) ||
-		  (XLogArchivingAlways() && inRecovery)))
+	if ( recoveryState == ARCHIVE_RECOVERING && !XLogArchivingAlways() )
 		return true;
 
+	/*
+	 * We are either:
+	 * * a primary with archive_mode set to "on" or "always"
+	 * * a standby with archive_mode set to "always".
+	 */
+
 	/* First check for .done --- this means archiver is done with it */
 	StatusFilePath(archiveStatusPath, xlog, ".done");
 	if (stat(archiveStatusPath, &stat_buf) == 0)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 7412caa5f2..601ae34dd8 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -165,6 +165,28 @@ typedef enum WalLevel
 	WAL_LEVEL_LOGICAL
 } WalLevel;
 
+/*
+ * Recovery state.
+ *
+ * There's two different reasons to recover WAL: when standby mode is requested
+ * or after a crash to catchup with consistency.
+ *
+ * CRASH_RECOVERING is set when recovering after a crash.
+ * ARCHIVE_RECOVERING is set when archive recovery or standby is requested.
+ * NOT_IN_RECOVERY means the cluster is in production.
+ *
+ * Note that even if archive recovery or standby is requested, recovery state
+ * starts first with CRASH_RECOVERING before switching to ARCHIVE_RECOVERING if:
+ * - the control file shows evidence the cluster crashed based on its previous state
+ * - there's no backup label file
+ */
+typedef enum RecoveryState
+{
+	NOT_IN_RECOVERY = 0, /* currently in production */
+	CRASH_RECOVERING,    /* recovering from a crash */
+	ARCHIVE_RECOVERING   /* recovering archives as requested */
+} RecoveryState;
+
 extern PGDLLIMPORT int wal_level;
 
 /* Is WAL archiving enabled (always or only while server is running normally)? */
@@ -279,6 +301,7 @@ extern const char *xlog_identify(uint8 info);
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
 
 extern bool RecoveryInProgress(void);
+extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
-- 
2.20.1

0002-v2-Add-test-on-non-archived-WAL-during-crash-recovery.patchtext/x-patchDownload
From 63feb240a6c7f9bdc305eb7b3aa0a25f3355ccbb Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Date: Wed, 1 Apr 2020 18:12:17 +0200
Subject: [PATCH 2/2] Add test on non archived WAL during crash recovery

---
 src/test/recovery/t/011_crash_recovery.pl | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index ca6e92b50d..ce2e899891 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -15,11 +15,17 @@ if ($Config{osname} eq 'MSWin32')
 }
 else
 {
-	plan tests => 3;
+	plan tests => 4;
 }
 
 my $node = get_new_node('master');
-$node->init(allows_streaming => 1);
+$node->init(
+	has_archiving    => 1,
+	allows_streaming => 1);
+
+$node->append_conf('postgresql.conf',
+	"archive_command = ''");
+
 $node->start;
 
 my ($stdin, $stdout, $stderr) = ('', '', '');
@@ -43,6 +49,7 @@ $stdin .= q[
 BEGIN;
 CREATE TABLE mine(x integer);
 SELECT pg_current_xact_id();
+SELECT pg_switch_wal();
 ];
 $tx->pump until $stdout =~ /[[:digit:]]+[\r\n]$/;
 
@@ -65,4 +72,9 @@ cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
 is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 	'aborted', 'xid is aborted after crash');
 
+# make WAL are still waiting to be archived
+my $node_data = $node->data_dir;
+ok( -f "$node_data/pg_wal/archive_status/000000010000000000000001.ready",
+  "WAL still waiting to be archived");
+
 $tx->kill_kill;
-- 
2.20.1

#21Michael Paquier
michael@paquier.xyz
In reply to: Jehan-Guillaume de Rorthais (#20)
Re: [BUG] non archived WAL removed during production crash recovery

On Tue, Apr 07, 2020 at 05:17:36PM +0200, Jehan-Guillaume de Rorthais wrote:

I'm not sure who would need this information before the WAL machinery is up,
but is it safe enough in your opinion for futur usage of GetRecoveryState()? Do
you think this information might be useful before the WAL machinery is set?
Current "user" (eg. restoreTwoPhaseData()) only need to know if we are in
recovery, whatever the reason.

(I had this thread marked as something to look at, but could not.
Sorry for the delay).

src/test/recovery/t/011_crash_recovery.pl | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index ca6e92b50d..ce2e899891 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -15,11 +15,17 @@ if ($Config{osname} eq 'MSWin32')

May I ask why this new test is added to 011_crash_recovery.pl which is
aimed at testing crash and redo, while we have 002_archiving.pl that
is dedicated to archiving in a more general manner?
--
Michael

#22Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Jehan-Guillaume de Rorthais (#18)
Re: [BUG] non archived WAL removed during production crash recovery

At Tue, 7 Apr 2020 17:17:36 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in

+/* Recovery state */
+typedef enum RecoveryState
+{
+	NOT_IN_RECOVERY = 0,
+	IN_CRASH_RECOVERY,
+	IN_ARCHIVE_RECOVERY
+} RecoveryState;

I'm not sure the complexity is required here. Are we asuume that
archive_mode can be changed before restarting?

At Thu, 2 Apr 2020 15:49:15 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in

Ok, so our *current* consensus seems the followings. Right?

- If archive_mode=off, any WAL files with .ready files are removed in
crash recovery, archive recoery and standby mode.

yes

If archive_mode = off no WAL files are marked as ".ready".

- If archive_mode=on, WAL files with .ready files are removed only in
standby mode. In crash recovery and archive recovery cases, they keep
remaining and would be archived after recovery finishes (i.e., during
normal processing).

yes

- If archive_mode=always, in crash recovery, archive recovery and
standby mode, WAL files with .ready files are archived if WAL archiver
is running.

yes

So if we assume archive_mode won't be changed after a crash before
restarting, if archive_mode = on on a standy, WAL files are not marked
as ".ready". If it is "always", WAL files that are to be archived are
marked as ".ready". Finally, the condition reduces to:

If archiver is running, archive ".ready" files. Otherwise ignore
".ready" and just remove WAL files after use.

That is, WAL files with .ready files are removed when either
archive_mode!=always in standby mode or archive_mode=off.

sounds fine to me.

That situation implies that archive_mode has been changed. Can we
guarantee the completeness of the archive in the case?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In reply to: Michael Paquier (#21)
Re: [BUG] non archived WAL removed during production crash recovery

On Wed, 8 Apr 2020 11:23:45 +0900
Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Apr 07, 2020 at 05:17:36PM +0200, Jehan-Guillaume de Rorthais wrote:

I'm not sure who would need this information before the WAL machinery is up,
but is it safe enough in your opinion for futur usage of
GetRecoveryState()? Do you think this information might be useful before
the WAL machinery is set? Current "user" (eg. restoreTwoPhaseData()) only
need to know if we are in recovery, whatever the reason.

(I had this thread marked as something to look at, but could not.
Sorry for the delay).

(no worries :))

src/test/recovery/t/011_crash_recovery.pl | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/src/test/recovery/t/011_crash_recovery.pl
b/src/test/recovery/t/011_crash_recovery.pl index ca6e92b50d..ce2e899891
100644 --- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -15,11 +15,17 @@ if ($Config{osname} eq 'MSWin32')  

May I ask why this new test is added to 011_crash_recovery.pl which is
aimed at testing crash and redo, while we have 002_archiving.pl that
is dedicated to archiving in a more general manner?

I thought it was a better place because the test happen during crash recovery.

In the meantime, while working on other tests related to $SUBJECT and the
current consensus, I was wondering if a new file would be a better place anyway.

In reply to: Kyotaro Horiguchi (#22)
Re: [BUG] non archived WAL removed during production crash recovery

On Wed, 08 Apr 2020 17:39:09 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

At Tue, 7 Apr 2020 17:17:36 +0200, Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote in

+/* Recovery state */
+typedef enum RecoveryState
+{
+	NOT_IN_RECOVERY = 0,
+	IN_CRASH_RECOVERY,
+	IN_ARCHIVE_RECOVERY
+} RecoveryState;  

I'm not sure the complexity is required here. Are we asuume that
archive_mode can be changed before restarting?

I assume it can yes. Eg., one can restore a PITR backup as a standby and change
the value of archive_mode to either off, on or always.

At Thu, 2 Apr 2020 15:49:15 +0200, Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote in

Ok, so our *current* consensus seems the followings. Right?

- If archive_mode=off, any WAL files with .ready files are removed in
crash recovery, archive recoery and standby mode.

yes

If archive_mode = off no WAL files are marked as ".ready".

Sure, on the primary side.

What if you build a standby from a backup with archive_mode=on with
some .ready files in there?

- If archive_mode=on, WAL files with .ready files are removed only in
standby mode. In crash recovery and archive recovery cases, they keep
remaining and would be archived after recovery finishes (i.e., during
normal processing).

yes

- If archive_mode=always, in crash recovery, archive recovery and
standby mode, WAL files with .ready files are archived if WAL archiver
is running.

yes

So if we assume archive_mode won't be changed after a crash before
restarting, if archive_mode = on on a standy, WAL files are not marked
as ".ready".

.ready files can be inherited from the old primary when building the standby,
depending on the method. See previous explanations from Fujii-san:
/messages/by-id/ca964b3a-61a0-902e-c7b3-3abbc01a921f@oss.nttdata.com

If it is "always", WAL files that are to be archived are
marked as ".ready". Finally, the condition reduces to:

If archiver is running, archive ".ready" files. Otherwise ignore
".ready" and just remove WAL files after use.

That is, WAL files with .ready files are removed when either
archive_mode!=always in standby mode or archive_mode=off.

sounds fine to me.

That situation implies that archive_mode has been changed.

Why? archive_mode may have been "always" on the primary when eg. a snapshot has
been created.

Regards,

#25Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Jehan-Guillaume de Rorthais (#24)
Re: [BUG] non archived WAL removed during production crash recovery

Hello, Jehan.

At Wed, 8 Apr 2020 15:26:03 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in

On Wed, 08 Apr 2020 17:39:09 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

At Tue, 7 Apr 2020 17:17:36 +0200, Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote in

+/* Recovery state */
+typedef enum RecoveryState
+{
+	NOT_IN_RECOVERY = 0,
+	IN_CRASH_RECOVERY,
+	IN_ARCHIVE_RECOVERY
+} RecoveryState;  

I'm not sure the complexity is required here. Are we asuume that
archive_mode can be changed before restarting?

I assume it can yes. Eg., one can restore a PITR backup as a standby and change
the value of archive_mode to either off, on or always.

Thanks. I was confused. The original issue was restarted master can
miss files in archive. To fix that, it's sufficient not ignoring
.ready. It is more than that.

At Thu, 2 Apr 2020 15:49:15 +0200, Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote in

Ok, so our *current* consensus seems the followings. Right?

- If archive_mode=off, any WAL files with .ready files are removed in
crash recovery, archive recoery and standby mode.

yes

If archive_mode = off no WAL files are marked as ".ready".

Sure, on the primary side.

What if you build a standby from a backup with archive_mode=on with
some .ready files in there?

Well. Backup doesn't have nothing in archive_status directory if it is
taken by pg_basebackup. If the backup is created other way, it can
have some (as Fujii-san mentioned). Master with archive_mode != off
and standby with archive_mode=always should archive WAL files that are
not marked .done, but standby with archive_mode == on should not. The
commit intended that but the mistake here is it thinks that inRecovery
represents whether it is running as a standby or not, but actually it
is true on primary during crash recovery.

On the other hand, with the patch, standby with archive_mode=on
wrongly archives WAL files during crash recovery.

What we should check there is, as the commit was intended, not whether
it is under crash or archive recovery, but whether it is running as
primary or standby.

If it is "always", WAL files that are to be archived are
marked as ".ready". Finally, the condition reduces to:

If archiver is running, archive ".ready" files. Otherwise ignore
".ready" and just remove WAL files after use.

That is, WAL files with .ready files are removed when either
archive_mode!=always in standby mode or archive_mode=off.

sounds fine to me.

That situation implies that archive_mode has been changed.

Why? archive_mode may have been "always" on the primary when eg. a snapshot has
been created.

.ready files are created only when archive_mode != off.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In reply to: Kyotaro Horiguchi (#25)
Re: [BUG] non archived WAL removed during production crash recovery

On Thu, 09 Apr 2020 11:26:57 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
[...]

At Thu, 2 Apr 2020 15:49:15 +0200, Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote in

Ok, so our *current* consensus seems the followings. Right?

- If archive_mode=off, any WAL files with .ready files are removed in
crash recovery, archive recoery and standby mode.

yes

If archive_mode = off no WAL files are marked as ".ready".

Sure, on the primary side.

What if you build a standby from a backup with archive_mode=on with
some .ready files in there?

Well. Backup doesn't have nothing in archive_status directory if it is
taken by pg_basebackup. If the backup is created other way, it can
have some (as Fujii-san mentioned). Master with archive_mode != off
and standby with archive_mode=always should archive WAL files that are
not marked .done, but standby with archive_mode == on should not. The
commit intended that

Unless I'm wrong, the commit avoids creating .ready files on standby when a WAL
has neither .done or .ready status file.

but the mistake here is it thinks that inRecovery represents whether it is
running as a standby or not, but actually it is true on primary during crash
recovery.

Indeed.

On the other hand, with the patch, standby with archive_mode=on
wrongly archives WAL files during crash recovery.

"without the patch" you mean? You are talking about 78ea8b5daab, right?

What we should check there is, as the commit was intended, not whether
it is under crash or archive recovery, but whether it is running as
primary or standby.

Yes.

If it is "always", WAL files that are to be archived are
marked as ".ready". Finally, the condition reduces to:

If archiver is running, archive ".ready" files. Otherwise ignore
".ready" and just remove WAL files after use.

That is, WAL files with .ready files are removed when either
archive_mode!=always in standby mode or archive_mode=off.

sounds fine to me.

That situation implies that archive_mode has been changed.

Why? archive_mode may have been "always" on the primary when eg. a snapshot
has been created.

.ready files are created only when archive_mode != off.

Yes, on a primary, they are created when archive_mode > off. On standby, when
archive_mode=always. If a primary had archive_mode=always, a standby created
from its backup will still have archive_mode=always, with no changes.
Maybe I miss your point, sorry.

Regards,

In reply to: Jehan-Guillaume de Rorthais (#23)
1 attachment(s)
Re: [BUG] non archived WAL removed during production crash recovery

On Wed, 8 Apr 2020 13:58:30 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
[...]

May I ask why this new test is added to 011_crash_recovery.pl which is
aimed at testing crash and redo, while we have 002_archiving.pl that
is dedicated to archiving in a more general manner?

I thought it was a better place because the test happen during crash recovery.

In the meantime, while working on other tests related to $SUBJECT and the
current consensus, I was wondering if a new file would be a better place
anyway.

So, 002_archiving.pl deals more with testing recovering on hot standby side
than archiving. Maybe it could be renamed?

While discussing this, I created a new file to add some more tests about WAL
archiving and how recovery deal with them. Please, find the patch in
attachment. I'll be able to move them elsewhere later, depending on the
conclusions of this discussion.

Regards,

Attachments:

0002-v3-Add-tests-relative-to-WAL-archiving.patchtext/x-patchDownload
From 8a5422c451d91fb9e2bf3d5d826b9625118d3f24 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Date: Thu, 9 Apr 2020 18:38:50 +0200
Subject: [PATCH 2/2] Add tests relative to WAL archiving

---
 src/test/recovery/t/020_archive_status.pl | 130 ++++++++++++++++++++++
 1 file changed, 130 insertions(+)
 create mode 100644 src/test/recovery/t/020_archive_status.pl

diff --git a/src/test/recovery/t/020_archive_status.pl b/src/test/recovery/t/020_archive_status.pl
new file mode 100644
index 0000000000..5423e86978
--- /dev/null
+++ b/src/test/recovery/t/020_archive_status.pl
@@ -0,0 +1,130 @@
+#
+# Tests relating to WAL archiving and cleanup
+#
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More;
+use Config;
+
+my ($node, $standby1, $standby2);
+my ($node_data, $standby1_data, $standby2_data);
+
+if ($Config{osname} eq 'MSWin32')
+{
+
+	# some Windows Perls at least don't like IPC::Run's start/kill_kill regime.
+	plan skip_all => "Test fails on Windows perl";
+}
+else
+{
+	plan tests => 8;
+}
+
+$node = get_new_node('master');
+$node->init(
+	has_archiving => 1,
+	allows_streaming => 1
+);
+$node->start;
+$node_data = $node->data_dir;
+
+# temporary fail archive_command for futur tests
+$node->safe_psql('postgres', q{
+	ALTER SYSTEM SET archive_command TO 'false';
+	SELECT pg_reload_conf();
+});
+
+$node->safe_psql('postgres', q{
+	CREATE TABLE mine AS SELECT generate_series(1,10) AS x;
+	SELECT pg_switch_wal();
+	CHECKPOINT;
+});
+
+# wait for archive failure
+$node->poll_query_until('postgres',
+	q{SELECT failed_count > 0 FROM pg_stat_archiver},
+	't') or die "Timed out while waiting for archiver to fail";
+
+ok( -f "$node_data/pg_wal/archive_status/000000010000000000000001.ready",
+  "ready file exists for WAL waiting to be archived");
+
+is($node->safe_psql('postgres', q{
+		SELECT archived_count, last_failed_wal
+		FROM pg_stat_archiver
+	}),
+	'0|000000010000000000000001', 'pg_stat_archiver reports archive failure');
+
+
+# We need to crash the cluster because next test checks the crash
+# recovery step do not removes non-archived WAL. 
+$node->stop('immediate');
+
+# Standby recovery tests checks the recovery behavior when restoring a
+# backup taken using eg. a snapshot with no pg_start/stop_backup.
+# In this situation, the recovered standby should enter first crash
+# recovery then switch to regular archive recovery.
+$node->backup_fs_cold('backup');
+
+$node->start;
+
+ok( -f "$node_data/pg_wal/archive_status/000000010000000000000001.ready",
+  "WAL still ready to archive in archive_status");
+
+# Allow WAL archiving again
+$node->safe_psql('postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+
+# wait for archive success
+$node->poll_query_until('postgres',
+	q{SELECT archived_count FROM pg_stat_archiver},
+	'1')  or die "Timed out while waiting for archiver to success";
+
+ok( ! -f "$node_data/pg_wal/archive_status/000000010000000000000001.ready",
+  "ready file for archived WAL removed");
+
+ok( -f "$node_data/pg_wal/archive_status/000000010000000000000001.done",
+  "done file for archived WAL exists");
+
+is($node->safe_psql('postgres',
+		q{ SELECT last_archived_wal FROM pg_stat_archiver }),
+	'000000010000000000000001',
+	'Archive success reported in pg_stat_archiver');
+
+# create some wal activity and a new checkpoint so futur standby can create
+# a restartpoint.
+# As standby start in crash recovery because of the backup method, they need
+# a clean restartpoint to deal with existing status files.
+$node->safe_psql('postgres', q{
+	INSERT INTO mine SELECT generate_series(10,20) AS x;
+	SELECT pg_switch_wal();
+	CHECKPOINT;
+});
+
+$node->poll_query_until('postgres',
+	q{ SELECT last_archived_wal FROM pg_stat_archiver },
+	'000000010000000000000002')  or die "Timed out while waiting for archiver to success";
+
+# test recovery without archive_mode=always does not keep .ready WALs
+$standby1 = get_new_node('standby');
+$standby1->init_from_backup($node, 'backup', has_restoring => 1);
+$standby1_data = $standby1->data_dir;
+$standby1->start;
+$standby1->safe_psql('postgres', q{CHECKPOINT});
+
+ok( ! -f "$standby1_data/pg_wal/archive_status/000000010000000000000001.ready",
+  "WAL waiting to be archived in backup removed with archive_mode=on on standby" );
+
+# test recovery with archive_mode=always keeps .ready WALs
+$standby2 = get_new_node('standby2');
+$standby2->init_from_backup($node, 'backup', has_restoring => 1);
+$standby2->append_conf('postgresql.conf', "archive_mode = always");
+$standby2_data = $standby2->data_dir;
+$standby2->start;
+$standby2->safe_psql('postgres', q{CHECKPOINT});
+
+ok( -f "$standby2_data/pg_wal/archive_status/000000010000000000000001.ready",
+  "WAL waiting to be archived in backup kept with archive_mode=always on standby" );
-- 
2.20.1

#28Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Jehan-Guillaume de Rorthais (#26)
Re: [BUG] non archived WAL removed during production crash recovery

By the way, I haven't noticed that Cc: didn't contain -hackers. Added
it.

At Thu, 9 Apr 2020 11:35:12 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in

On Thu, 09 Apr 2020 11:26:57 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
[...]

At Thu, 2 Apr 2020 15:49:15 +0200, Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote in

Ok, so our *current* consensus seems the followings. Right?

- If archive_mode=off, any WAL files with .ready files are removed in
crash recovery, archive recoery and standby mode.

yes

If archive_mode = off no WAL files are marked as ".ready".

Sure, on the primary side.

What if you build a standby from a backup with archive_mode=on with
some .ready files in there?

Well. Backup doesn't have nothing in archive_status directory if it is
taken by pg_basebackup. If the backup is created other way, it can
have some (as Fujii-san mentioned). Master with archive_mode != off
and standby with archive_mode=always should archive WAL files that are
not marked .done, but standby with archive_mode == on should not. The
commit intended that

Unless I'm wrong, the commit avoids creating .ready files on standby when a WAL
has neither .done or .ready status file.

Right.

but the mistake here is it thinks that inRecovery represents whether it is
running as a standby or not, but actually it is true on primary during crash
recovery.

Indeed.

On the other hand, with the patch, standby with archive_mode=on
wrongly archives WAL files during crash recovery.

"without the patch" you mean? You are talking about 78ea8b5daab, right?

No. I menat the v4 patch in [1]/messages/by-id/20200407171736.61906608@firost.

[1]: /messages/by-id/20200407171736.61906608@firost

Prior to appllying the patch (that is the commit 78ea..),
XLogArchiveCheckDone() correctly returns true (= allow to remove it)
for the same condition.

The proposed patch does the folloing thing.

if (!XLogArchivingActive() ||
recoveryState == ARCHIVE_RECOVERING && !XLogArchivingAlways())
return true;

It doesn't return for the condition "recoveryState=CRASH_RECOVERING
and archive_mode = on". Then the WAL files is mitakenly marked as
".ready" if not marked yet.

By the way, the code seems not following the convention a bit
here. Let the inserting code be in the same style to the existing code
around.

+ if ( ! XLogArchivingActive() )

I think we don't put the spaces within the parentheses above.

| ARCHIVE_RECOVERING/CRASH_RECOVERING/NOT_IN_RECOVERY

The first two and the last one are in different style. *I* prefer them
(if we use it) be "IN_ARCHIVE_RECOVERY/IN_CRASH_RECOVERY/NOT_IN_RECOVERY".

What we should check there is, as the commit was intended, not whether
it is under crash or archive recovery, but whether it is running as
primary or standby.

Yes.

If it is "always", WAL files that are to be archived are
marked as ".ready". Finally, the condition reduces to:

If archiver is running, archive ".ready" files. Otherwise ignore
".ready" and just remove WAL files after use.

That is, WAL files with .ready files are removed when either
archive_mode!=always in standby mode or archive_mode=off.

sounds fine to me.

That situation implies that archive_mode has been changed.

Why? archive_mode may have been "always" on the primary when eg. a snapshot
has been created.

.ready files are created only when archive_mode != off.

Yes, on a primary, they are created when archive_mode > off. On standby, when
archive_mode=always. If a primary had archive_mode=always, a standby created
from its backup will still have archive_mode=always, with no changes.
Maybe I miss your point, sorry.

Sorry, it was ambiguous.

That is, WAL files with .ready files are removed when either
archive_mode!=always in standby mode or archive_mode=off.

If we have .ready file when archive_mode = off, the cluster (or the
original of the copy cluster) should have been running in archive = on
or always. That is, archive_mode has been changed. But anyway that
discussion would not be in much relevance.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#29Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Jehan-Guillaume de Rorthais (#27)
Re: [BUG] non archived WAL removed during production crash recovery

At Thu, 9 Apr 2020 18:46:22 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in

On Wed, 8 Apr 2020 13:58:30 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
[...]

May I ask why this new test is added to 011_crash_recovery.pl which is
aimed at testing crash and redo, while we have 002_archiving.pl that
is dedicated to archiving in a more general manner?

I thought it was a better place because the test happen during crash recovery.

In the meantime, while working on other tests related to $SUBJECT and the
current consensus, I was wondering if a new file would be a better place
anyway.

So, 002_archiving.pl deals more with testing recovering on hot standby side
than archiving. Maybe it could be renamed?

I have the same feeling with Michael. The test that archives are
created correctly seems to fit the file. It would be unintentionally
that the file is not exercising archiving so much.

While discussing this, I created a new file to add some more tests about WAL
archiving and how recovery deal with them. Please, find the patch in
attachment. I'll be able to move them elsewhere later, depending on the
conclusions of this discussion.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#30Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#29)
Re: [BUG] non archived WAL removed during production crash recovery

On Fri, Apr 10, 2020 at 11:14:54AM +0900, Kyotaro Horiguchi wrote:

I have the same feeling with Michael. The test that archives are
created correctly seems to fit the file. It would be unintentionally
that the file is not exercising archiving so much.

I have been finally looking at this thread and the latest patch set,
sorry for the late reply.

    XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
-   XLogCtl->SharedRecoveryInProgress = true;
    XLogCtl->SharedHotStandbyActive = false;
    XLogCtl->SharedPromoteIsTriggered = false;
    XLogCtl->WalWriterSleeping = false;
[...]
+   switch (ControlFile->state)
+   {
+       case DB_SHUTDOWNED:
+       case DB_SHUTDOWNED_IN_RECOVERY:
+           XLogCtl->SharedRecoveryState = ARCHIVE_RECOVERING;
+           break;
+       default:
+           XLogCtl->SharedRecoveryState = CRASH_RECOVERING;
+   }
It seems to me that the initial value of SharedRecoveryState should be
CRASH_RECOVERING all the time no?  StartupXLOG() is a code path taken
even if starting cleanly, and the flag would be reset correctly at the
end to NOT_IN_RECOVERY.
+typedef enum RecoveryState
+{
+   NOT_IN_RECOVERY = 0, /* currently in production */
+   CRASH_RECOVERING,    /* recovering from a crash */
+   ARCHIVE_RECOVERING   /* recovering archives as requested */
+} RecoveryState;
I also have some issues with the name of those variables, here is an
idea for the three states:
- RECOVERY_STATE_CRASH
- RECOVERT_STATE_ARCHIVE
- RECOVERY_STATE_NONE
I would recommend to use the same suffix for those variables to ease
grepping.
 /*
- * Local copy of SharedRecoveryInProgress variable. True actually means "not
- * known, need to check the shared state".
+ * This is false when SharedRecoveryState is not ARCHIVE_RECOVERING.
+ * True actually means "not known, need to check the shared state".
  */
A double negation sounds wrong to me.  And actually this variable is
false when the shared state is set to NOT_IN_RECOVERY, and true when
the state is either CRASH_RECOVERING or ARCHIVE_RECOVERING because it
means that recovery is running, be it archive recovery or crash
recovery, so the comment is wrong.
+   /* The file is always deletable if archive_mode is "off". */
+   if ( ! XLogArchivingActive() )
+       return true;
[...]
+   if ( recoveryState == ARCHIVE_RECOVERING && !XLogArchivingAlways() )
        return true;
Incorrect indentation.
                UpdateControlFile();
+
+               SpinLockAcquire(&XLogCtl->info_lck);
+               XLogCtl->SharedRecoveryState = ARCHIVE_RECOVERING;
+               SpinLockRelease(&XLogCtl->info_lck);
+
                LWLockRelease(ControlFileLock);
Here, the shared flag is updated while holding ControlFileLock to
ensure a consistent state of things within shared memory, so it would
be nice to add a comment about that.
+RecoveryState
+GetRecoveryState(void)
+{
+   volatile XLogCtlData *xlogctl = XLogCtl;
+
+   return xlogctl->SharedRecoveryState;
+}
Er, you need to acquire info_lck to look at the state here, no?
    /*
-    * The file is always deletable if archive_mode is "off".  On standbys
-    * archiving is disabled if archive_mode is "on", and enabled with
-    * "always".  On a primary, archiving is enabled if archive_mode is "on"
-    * or "always".
+    * On standbys, the file is deletable if archive_mode is not
+    * "always".
     */
It would be good to mention that while in crash recovery, files are
not considered as deletable.

I agree that having a separate .pl file for the tests of this thread
is just cleaner. Here are more comments about these.

+# temporary fail archive_command for futur tests
+$node->safe_psql('postgres', q{
+   ALTER SYSTEM SET archive_command TO 'false';
+   SELECT pg_reload_conf();
+});
That's likely portable on Windows even if you skip the tests there,
and I am not sure that it is a good idea to rely on it being in PATH.
Depending on the system, the path of the command is also likely going
to be different.  As here the goal is to prevent the archiver to do
its work, why not relying on the configuration where archive_mode is
set but archive_command is not?  This would cause the archiver to be a
no-op process, and .ready files will remain around.  You could then
replace the lookup of pg_stat_archiver with poll_query_until() and a
query that makes use of pg_stat_file() to make sure that the .ready
exists when needed.
+ ok( -f "$node_data/pg_wal/archive_status/000000010000000000000001.ready",
+   "WAL still ready to archive in archive_status");
It would be good to mention in the description the check applies to a
primary.
+# test recovery without archive_mode=always does not keep .ready WALs
+$standby1 = get_new_node('standby');
+$standby1->init_from_backup($node, 'backup', has_restoring => 1);
+$standby1_data = $standby1->data_dir;
+$standby1->start;
+$standby1->safe_psql('postgres', q{CHECKPOINT});
For readability archive_mode = on should be added to the configuration
file?  Okay, this is inherited from the primary, still that would
avoid any issues with this code is refactored in some way.

"WAL waiting to be archived in backup removed with archive_mode=on
on standby" );
That should be "WAL segment" or "WAL file", but not WAL.

Regarding the tests on a standby, it seems to me that the following
is necessary:
1) Test that with archive_mode = on, segments are not marked with
.ready.
2) Test that with archive_mode = always, segments are marked with
.ready during archive recovery.
3) Test that with archive_mode = always, segments are not removed
during crash recovery.
I can see tests for 1) and 2), but not 3). Could you add a
stop('immediate')+start() for $standby2 at the end of
020_archive_status.pl and check that the .ready file is still there
after crash recovery? The end of the tests actually relies on the
fact that archive_command is set to "false" when the cold backup is
taken, before resetting it. I think that it would be cleaner to
enforce the configuration you want to test before starting each
standby. It becomes easier to understand the flow of the test for the
reader.
--
Michael

In reply to: Michael Paquier (#30)
2 attachment(s)
Re: [BUG] non archived WAL removed during production crash recovery

On Mon, 13 Apr 2020 16:14:14 +0900
Michael Paquier <michael@paquier.xyz> wrote:
[...]

XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
-   XLogCtl->SharedRecoveryInProgress = true;
XLogCtl->SharedHotStandbyActive = false;
XLogCtl->SharedPromoteIsTriggered = false;
XLogCtl->WalWriterSleeping = false;
[...]
+   switch (ControlFile->state)
+   {
+       case DB_SHUTDOWNED:
+       case DB_SHUTDOWNED_IN_RECOVERY:
+           XLogCtl->SharedRecoveryState = ARCHIVE_RECOVERING;
+           break;
+       default:
+           XLogCtl->SharedRecoveryState = CRASH_RECOVERING;
+   }
It seems to me that the initial value of SharedRecoveryState should be
CRASH_RECOVERING all the time no?  StartupXLOG() is a code path taken
even if starting cleanly, and the flag would be reset correctly at the
end to NOT_IN_RECOVERY.

Previous version of the patch was setting CRASH_RECOVERING. Fujii-san reported
(the 4 Apr 2020 02:49:50 +0900) that it could be useful to expose a better value
until relevant code is reached in StartupXLOG() so GetRecoveryState() returns
a safer value for futur use.

As I answered upthread, I'm not sure who would need this information before the
WAL machinery is up though. Note that ARCHIVE_RECOVERING and CRASH_RECOVERING
are compatible with the previous behavior.

Maybe the solution would be to init with CRASH_RECOVERING and add some comment
in GetRecoveryState() to warn the state is "enforced" after the XLOG machinery
is started and is init'ed to RECOVERING in the meantime?

I initialized it to CRASH_RECOVERING in the new v5 patch and added a comment
to GetRecoveryState().

+typedef enum RecoveryState
+{
+   NOT_IN_RECOVERY = 0, /* currently in production */
+   CRASH_RECOVERING,    /* recovering from a crash */
+   ARCHIVE_RECOVERING   /* recovering archives as requested */
+} RecoveryState;
I also have some issues with the name of those variables, here is an
idea for the three states:
- RECOVERY_STATE_CRASH
- RECOVERT_STATE_ARCHIVE
- RECOVERY_STATE_NONE
I would recommend to use the same suffix for those variables to ease
grepping.

Sounds really good to me. Thanks!

/*
- * Local copy of SharedRecoveryInProgress variable. True actually means "not
- * known, need to check the shared state".
+ * This is false when SharedRecoveryState is not ARCHIVE_RECOVERING.
+ * True actually means "not known, need to check the shared state".
*/
A double negation sounds wrong to me.  And actually this variable is
false when the shared state is set to NOT_IN_RECOVERY, and true when
the state is either CRASH_RECOVERING or ARCHIVE_RECOVERING because it
means that recovery is running, be it archive recovery or crash
recovery, so the comment is wrong.

Indeed, sorry. Fixed.

+   /* The file is always deletable if archive_mode is "off". */
+   if ( ! XLogArchivingActive() )
+       return true;
[...]
+   if ( recoveryState == ARCHIVE_RECOVERING && !XLogArchivingAlways() )
return true;
Incorrect indentation.

Is it the spaces as reported by Horiguchi-san? I removed them in latest patch.

UpdateControlFile();
+
+               SpinLockAcquire(&XLogCtl->info_lck);
+               XLogCtl->SharedRecoveryState = ARCHIVE_RECOVERING;
+               SpinLockRelease(&XLogCtl->info_lck);
+
LWLockRelease(ControlFileLock);
Here, the shared flag is updated while holding ControlFileLock to
ensure a consistent state of things within shared memory, so it would
be nice to add a comment about that.

Indeed. the original code had no such comment and I asked myself the same
question. Added.

+RecoveryState
+GetRecoveryState(void)
+{
+   volatile XLogCtlData *xlogctl = XLogCtl;
+
+   return xlogctl->SharedRecoveryState;
+}
Er, you need to acquire info_lck to look at the state here, no?

Yes, fixed.

/*
-    * The file is always deletable if archive_mode is "off".  On standbys
-    * archiving is disabled if archive_mode is "on", and enabled with
-    * "always".  On a primary, archiving is enabled if archive_mode is "on"
-    * or "always".
+    * On standbys, the file is deletable if archive_mode is not
+    * "always".
*/
It would be good to mention that while in crash recovery, files are
not considered as deletable.

Well, in fact, I am still wondering about this. I was hesitant to add a
shortcut like:

/* no WAL segment cleanup during crash recovery */
if (recoveryState == RECOVERT_STATE_CRASH)
return false;

But, what if for example we crashed for lack of disk space during intensive
wirtes? During crash recovery, any WAL marked as .done could be removed and
allow the system to start again and maybe make even further WAL cleanup by
archiving some more WAL without competing with previous high write ratio.

When we recover a primary, this behavior seems conform with any value of
archive_mode, even if it has been changed after crash and before starting it.
On a standby, we might create .ready files, but they will be removed during the
first restartpoint if needed.

I agree that having a separate .pl file for the tests of this thread
is just cleaner. Here are more comments about these.

+# temporary fail archive_command for futur tests
+$node->safe_psql('postgres', q{
+   ALTER SYSTEM SET archive_command TO 'false';
+   SELECT pg_reload_conf();
+});
That's likely portable on Windows even if you skip the tests there,
and I am not sure that it is a good idea to rely on it being in PATH.
Depending on the system, the path of the command is also likely going
to be different.  As here the goal is to prevent the archiver to do
its work, why not relying on the configuration where archive_mode is
set but archive_command is not?  This would cause the archiver to be a
no-op process, and .ready files will remain around.  You could then
replace the lookup of pg_stat_archiver with poll_query_until() and a
query that makes use of pg_stat_file() to make sure that the .ready
exists when needed.

I needed a failure so I can test pg_stat_archiver reports it as well. In my
mind, using "false" would either trigger a failure because false returns 1 or...
a failure because the command is not found. In either way, the result is the
same.

Using poll_query_until+pg_stat_file, is a good idea, but not enough as
archiver reports a failure some moment after the .ready signal file appears.

+ ok( -f "$node_data/pg_wal/archive_status/000000010000000000000001.ready",
+   "WAL still ready to archive in archive_status");
It would be good to mention in the description the check applies to a
primary.

done

+# test recovery without archive_mode=always does not keep .ready WALs
+$standby1 = get_new_node('standby');
+$standby1->init_from_backup($node, 'backup', has_restoring => 1);
+$standby1_data = $standby1->data_dir;
+$standby1->start;
+$standby1->safe_psql('postgres', q{CHECKPOINT});
For readability archive_mode = on should be added to the configuration
file?  Okay, this is inherited from the primary, still that would
avoid any issues with this code is refactored in some way.

added

"WAL waiting to be archived in backup removed with archive_mode=on
on
standby" ); That should be "WAL segment" or "WAL file", but not WAL.

updated everywhere.

Regarding the tests on a standby, it seems to me that the following
is necessary:
1) Test that with archive_mode = on, segments are not marked with
.ready.
2) Test that with archive_mode = always, segments are marked with
.ready during archive recovery.
3) Test that with archive_mode = always, segments are not removed
during crash recovery.
I can see tests for 1) and 2),

Not really. The current tests does not check that segments created *after* the
backup are marked or not with .ready file on standby. I added these tests plus
some various other ones.

but not 3). Could you add a
stop('immediate')+start() for $standby2 at the end of
020_archive_status.pl and check that the .ready file is still there
after crash recovery?

done.

The end of the tests actually relies on the
fact that archive_command is set to "false" when the cold backup is
taken, before resetting it. I think that it would be cleaner to
enforce the configuration you want to test before starting each
standby. It becomes easier to understand the flow of the test for the
reader.

done as well.

Thank you for your review!

Attachments:

0001-v5-Fix-WAL-retention-during-production-crash-recovery.patchtext/x-patchDownload
From 6654b0cd3073c4bed6c0cbf7a3b0e9fc7a4b9e2d Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Date: Wed, 1 Apr 2020 16:36:06 +0200
Subject: [PATCH 1/2] Fix WAL retention during production crash recovery

During crash recovery of a production cluster with archive_mode=on,
XLogArchiveCheckDone() was considering the cluster as inRecovery
without archive_mode=always. Because of this non-arcived WAL and
related .ready files were recycled or removed.
---
 src/backend/access/transam/xlog.c        | 48 +++++++++++++++++++-----
 src/backend/access/transam/xlogarchive.c | 21 +++++++----
 src/include/access/xlog.h                | 21 +++++++++++
 3 files changed, 74 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 11e32733c4..26920c7f3e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -221,8 +221,8 @@ static TimeLineID receiveTLI = 0;
 static bool lastFullPageWrites;
 
 /*
- * Local copy of SharedRecoveryInProgress variable. True actually means "not
- * known, need to check the shared state".
+ * This is false when SharedRecoveryState is RECOVERT_STATE_NONE.
+ * True actually means "not known, need to check the shared state".
  */
 static bool LocalRecoveryInProgress = true;
 
@@ -653,10 +653,10 @@ typedef struct XLogCtlData
 	TimeLineID	PrevTimeLineID;
 
 	/*
-	 * SharedRecoveryInProgress indicates if we're still in crash or archive
+	 * SharedRecoveryState indicates if we're either in crash or archive
 	 * recovery.  Protected by info_lck.
 	 */
-	bool		SharedRecoveryInProgress;
+	RecoveryState	SharedRecoveryState;
 
 	/*
 	 * SharedHotStandbyActive indicates if we allow hot standby queries to be
@@ -4434,6 +4434,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 				updateMinRecoveryPoint = true;
 
 				UpdateControlFile();
+
+				/*
+				 * We update SharedRecoveryState while holding the lock on
+				 * ControlFileLock so both states are consistent in shared
+				 * memory.
+				 */
+				SpinLockAcquire(&XLogCtl->info_lck);
+				XLogCtl->SharedRecoveryState = RECOVERT_STATE_ARCHIVE;
+				SpinLockRelease(&XLogCtl->info_lck);
+
 				LWLockRelease(ControlFileLock);
 
 				CheckRecoveryConsistency();
@@ -5166,10 +5176,10 @@ XLOGShmemInit(void)
 	 * in additional info.)
 	 */
 	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
-	XLogCtl->SharedRecoveryInProgress = true;
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
 	XLogCtl->WalWriterSleeping = false;
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
 
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
@@ -7911,7 +7921,7 @@ StartupXLOG(void)
 	ControlFile->time = (pg_time_t) time(NULL);
 
 	SpinLockAcquire(&XLogCtl->info_lck);
-	XLogCtl->SharedRecoveryInProgress = false;
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_NONE;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	UpdateControlFile();
@@ -8057,7 +8067,7 @@ RecoveryInProgress(void)
 		 */
 		volatile XLogCtlData *xlogctl = XLogCtl;
 
-		LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
+		LocalRecoveryInProgress = (xlogctl->SharedRecoveryState != RECOVERY_STATE_NONE);
 
 		/*
 		 * Initialize TimeLineID and RedoRecPtr when we discover that recovery
@@ -8069,8 +8079,8 @@ RecoveryInProgress(void)
 		{
 			/*
 			 * If we just exited recovery, make sure we read TimeLineID and
-			 * RedoRecPtr after SharedRecoveryInProgress (for machines with
-			 * weak memory ordering).
+			 * RedoRecPtr after SharedRecoveryState (for machines with weak
+			 * memory ordering).
 			 */
 			pg_memory_barrier();
 			InitXLOGAccess();
@@ -8086,6 +8096,26 @@ RecoveryInProgress(void)
 	}
 }
 
+/*
+ * Returns current recovery state from shared memory.
+ *
+ * This returned state is accurate after StartupXLOG()
+ * finished.
+ *
+ * See details about RecoveryState in xlog.h
+ */
+RecoveryState
+GetRecoveryState(void)
+{
+	RecoveryState	retval;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	retval = XLogCtl->SharedRecoveryState;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return retval;
+}
+
 /*
  * Is HotStandby active yet? This is only important in special backends
  * since normal backends won't ever be able to connect until this returns
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index d62c12310a..d2ffc7e1a2 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -572,18 +572,25 @@ XLogArchiveCheckDone(const char *xlog)
 {
 	char		archiveStatusPath[MAXPGPATH];
 	struct stat stat_buf;
-	bool		inRecovery = RecoveryInProgress();
+	RecoveryState	recoveryState = GetRecoveryState();
+
+	/* The file is always deletable if archive_mode is "off". */
+	if (!XLogArchivingActive())
+		return true;
 
 	/*
-	 * The file is always deletable if archive_mode is "off".  On standbys
-	 * archiving is disabled if archive_mode is "on", and enabled with
-	 * "always".  On a primary, archiving is enabled if archive_mode is "on"
-	 * or "always".
+	 * On standbys, the file is deletable if archive_mode is not
+	 * "always".
 	 */
-	if (!((XLogArchivingActive() && !inRecovery) ||
-		  (XLogArchivingAlways() && inRecovery)))
+	if (recoveryState == RECOVERT_STATE_ARCHIVE && !XLogArchivingAlways())
 		return true;
 
+	/*
+	 * We are either:
+	 * * a primary with archive_mode set to "on" or "always"
+	 * * a standby with archive_mode set to "always".
+	 */
+
 	/* First check for .done --- this means archiver is done with it */
 	StatusFilePath(archiveStatusPath, xlog, ".done");
 	if (stat(archiveStatusPath, &stat_buf) == 0)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f60ed2d36c..157a3fb104 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -166,6 +166,26 @@ typedef enum WalLevel
 	WAL_LEVEL_LOGICAL
 } WalLevel;
 
+/*
+ * Recovery state.
+ *
+ * There's two different reasons to recover WAL: when standby mode is requested
+ * or after a crash to catchup with consistency.
+ *
+ * RECOVERY_STATE_CRASH is set when recovering after a crash.
+ * RECOVERT_STATE_ARCHIVE is set when archive recovery or standby is requested.
+ * RECOVERY_STATE_NONE means the cluster is in production.
+ *
+ * Note that even if archive recovery or standby is requested, recovery state
+ * starts first with RECOVERY_STATE_CRASH before switching to ARCHIVE_RECOVERING
+ */
+typedef enum RecoveryState
+{
+	RECOVERY_STATE_NONE = 0, /* currently in production */
+	RECOVERY_STATE_CRASH,    /* recovering from a crash */
+	RECOVERT_STATE_ARCHIVE   /* recovering archives as requested */
+} RecoveryState;
+
 extern PGDLLIMPORT int wal_level;
 
 /* Is WAL archiving enabled (always or only while server is running normally)? */
@@ -291,6 +311,7 @@ extern const char *xlog_identify(uint8 info);
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
 
 extern bool RecoveryInProgress(void);
+extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
-- 
2.20.1

0002-v4-Add-tests-relative-to-WAL-archiving.patchtext/x-patchDownload
From 14fab693010b1e034d0bd9e7c81a8ac48aa8f8d8 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Date: Thu, 9 Apr 2020 18:38:50 +0200
Subject: [PATCH 2/2] Add tests relative to WAL archiving

---
 src/test/recovery/t/020_archive_status.pl | 193 ++++++++++++++++++++++
 1 file changed, 193 insertions(+)
 create mode 100644 src/test/recovery/t/020_archive_status.pl

diff --git a/src/test/recovery/t/020_archive_status.pl b/src/test/recovery/t/020_archive_status.pl
new file mode 100644
index 0000000000..a5f4fd821a
--- /dev/null
+++ b/src/test/recovery/t/020_archive_status.pl
@@ -0,0 +1,193 @@
+#
+# Tests relating to WAL archiving and cleanup
+#
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More;
+use Config;
+
+my ($node, $standby1, $standby2);
+my ($node_data, $standby1_data, $standby2_data);
+
+if ($Config{osname} eq 'MSWin32')
+{
+
+	# some Windows Perls at least don't like IPC::Run's start/kill_kill regime.
+	plan skip_all => "Test fails on Windows perl";
+}
+else
+{
+	plan tests => 16;
+}
+
+$node = get_new_node('master');
+$node->init(
+	has_archiving => 1,
+	allows_streaming => 1
+);
+$node->start;
+$node_data = $node->data_dir;
+
+# temporary fail archive_command for futur tests
+$node->safe_psql('postgres', q{
+	ALTER SYSTEM SET archive_command TO 'false';
+	SELECT pg_reload_conf();
+});
+
+$node->safe_psql('postgres', q{
+	CREATE TABLE mine AS SELECT generate_series(1,10) AS x;
+	SELECT pg_switch_wal();
+	CHECKPOINT;
+});
+
+# wait for archive failure
+$node->poll_query_until('postgres',
+	q{SELECT failed_count > 0 FROM pg_stat_archiver},
+	't') or die "Timed out while waiting for archiver to fail";
+
+ok( -f "$node_data/pg_wal/archive_status/000000010000000000000001.ready",
+  "ready file exists for WAL waiting to be archived");
+
+is($node->safe_psql('postgres', q{
+		SELECT archived_count, last_failed_wal
+		FROM pg_stat_archiver
+	}),
+	'0|000000010000000000000001', 'pg_stat_archiver reports archive failure');
+
+# We need to crash the cluster because next test checks the crash
+# recovery step do not removes non-archived WAL.
+$node->stop('immediate');
+
+# Standby recovery tests checks the recovery behavior when restoring a
+# backup taken using eg. a snapshot with no pg_start/stop_backup.
+# In this situation, the recovered standby should enter first crash
+# recovery then switch to regular archive recovery.
+$node->backup_fs_cold('backup');
+
+$node->start;
+
+ok( -f "$node_data/pg_wal/archive_status/000000010000000000000001.ready",
+  "WAL segment still ready to archive after crash recovery on primary");
+
+# Allow WAL archiving again
+$node->safe_psql('postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+
+# wait for archive success
+$node->poll_query_until('postgres',
+	q{SELECT archived_count FROM pg_stat_archiver},
+	'1')  or die "Timed out while waiting for archiver to success";
+
+ok( ! -f "$node_data/pg_wal/archive_status/000000010000000000000001.ready",
+  "ready file for archived WAL removed");
+
+ok( -f "$node_data/pg_wal/archive_status/000000010000000000000001.done",
+  "done file for archived WAL exists");
+
+is($node->safe_psql('postgres',
+		q{ SELECT last_archived_wal FROM pg_stat_archiver }),
+	'000000010000000000000001',
+	'Archive success reported in pg_stat_archiver');
+
+# create some wal activity and a new checkpoint so futur standby can create
+# a restartpoint.
+# As standby start in crash recovery because of the backup method, they need
+# a clean restartpoint to deal with existing status files.
+$node->safe_psql('postgres', q{
+	INSERT INTO mine SELECT generate_series(10,20) AS x;
+	SELECT pg_switch_wal();
+	CHECKPOINT;
+});
+
+$node->poll_query_until('postgres',
+	q{ SELECT last_archived_wal FROM pg_stat_archiver },
+	'000000010000000000000002')  or die "Timed out while waiting for archiver to succeed";
+
+# tests standby with archive_mode=on
+$standby1 = get_new_node('standby');
+$standby1->init_from_backup($node, 'backup', has_restoring => 1);
+$standby1->append_conf('postgresql.conf', "archive_mode = on");
+$standby1_data = $standby1->data_dir;
+$standby1->start;
+$standby1->safe_psql('postgres', q{CHECKPOINT});
+
+# recovery with archive_mode=on does not keep .ready signal files inherited from backup.
+# 000000010000000000000001.ready existed in the backup.
+ok( ! -f "$standby1_data/pg_wal/archive_status/000000010000000000000001.ready",
+  ".ready signal file existing in backup removed with archive_mode=on on standby" );
+
+# recovery with archive_mode=on should not create .ready signal files
+# 000000010000000000000002.ready did not exist in the backup
+ok( ! -f "$standby1_data/pg_wal/archive_status/000000010000000000000002.ready",
+  "standby doesn't create .ready signal file when archive_mode=on" );
+
+# recovery with archive_mode=on creates .done signal files
+ok( -f "$standby1_data/pg_wal/archive_status/000000010000000000000002.done",
+  "standby creates .done signal file when archive_mode=on" );
+
+# test recovery with archive_mode=always keeps .ready WALs
+$standby2 = get_new_node('standby2');
+$standby2->init_from_backup($node, 'backup', has_restoring => 1);
+$standby2->append_conf('postgresql.conf', 'archive_mode = always');
+$standby1->append_conf('postgresql.auto.conf', "archive_command = false");
+$standby2_data = $standby2->data_dir;
+$standby2->start;
+
+$standby2->safe_psql('postgres', q{CHECKPOINT});
+
+ok( -f "$standby2_data/pg_wal/archive_status/000000010000000000000001.ready",
+  ".ready signal file existing in backup are kept with archive_mode=always on standby" );
+
+ok( -f "$standby2_data/pg_wal/archive_status/000000010000000000000002.ready",
+  ".ready signal file are created with archive_mode=always on standby" );
+
+# Allow WAL archiving again
+$standby2->safe_psql('postgres', q{SELECT pg_stat_reset_shared('archiver')});
+
+is($standby2->safe_psql('postgres',
+		q{ SELECT failed_count, archived_count FROM pg_stat_archiver }),
+	'0|0',
+	'statistics reset from pg_stat_archiver succeed');
+
+# We need to crash the cluster because next test checks the crash
+# recovery step do not removes non-archived WAL on a standby.
+$standby2->stop('immediate');
+$standby2->start;
+
+ok( -f "$standby2_data/pg_wal/archive_status/000000010000000000000001.ready",
+  "WAL segment still ready to archive after crash recovery on standby with archive_mode=always");
+
+# Allow WAL archiving again
+$standby2->safe_psql('postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+
+# wait for archive success
+$standby2->poll_query_until('postgres',
+	q{SELECT last_archived_wal FROM pg_stat_archiver},
+	'000000010000000000000002')  or die "Timed out while waiting for archiver to success";
+
+is($standby2->safe_psql('postgres',
+		q{SELECT archived_count FROM pg_stat_archiver}),
+	'2', 'WAL segment archived from standby');
+
+ok( ! -f "$standby2_data/pg_wal/archive_status/000000010000000000000001.ready"
+	&& ! -f "$standby2_data/pg_wal/archive_status/000000010000000000000002.ready",
+  ".ready signal file removed after archiving with archive_mode=always on standby" );
+
+ok( -f "$standby2_data/pg_wal/archive_status/000000010000000000000001.done"
+	and -f "$standby2_data/pg_wal/archive_status/000000010000000000000002.done",
+  ".done signal file created after archiving with archive_mode=always on standby" );
+
+#ok(0);
+#
+## wait for archive success
+#$standby2->poll_query_until('postgres',
+#	q{SELECT 1},
+#	'0')  or die "Timed out while waiting for archiver to success";
+
-- 
2.20.1

In reply to: Kyotaro Horiguchi (#28)
Re: [BUG] non archived WAL removed during production crash recovery

On Fri, 10 Apr 2020 11:00:31 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
[...]

but the mistake here is it thinks that inRecovery represents whether it is
running as a standby or not, but actually it is true on primary during
crash recovery.

Indeed.

On the other hand, with the patch, standby with archive_mode=on
wrongly archives WAL files during crash recovery.

"without the patch" you mean? You are talking about 78ea8b5daab, right?

No. I menat the v4 patch in [1].

[1] /messages/by-id/20200407171736.61906608@firost

Prior to appllying the patch (that is the commit 78ea..),
XLogArchiveCheckDone() correctly returns true (= allow to remove it)
for the same condition.

The proposed patch does the folloing thing.

if (!XLogArchivingActive() ||
recoveryState == ARCHIVE_RECOVERING && !XLogArchivingAlways())
return true;

It doesn't return for the condition "recoveryState=CRASH_RECOVERING
and archive_mode = on". Then the WAL files is mitakenly marked as
".ready" if not marked yet.

Indeed.

But .ready files are then deleted during the first restartpoint. I'm not sure
how to fix this behavior without making the code too complex.

This is discussed in my last answer to Michael few minutes ago as well.

By the way, the code seems not following the convention a bit
here. Let the inserting code be in the same style to the existing code
around.

+ if ( ! XLogArchivingActive() )

I think we don't put the spaces within the parentheses above.

Indeed, This is fixed in patch v5 sent few minutes ago.

| ARCHIVE_RECOVERING/CRASH_RECOVERING/NOT_IN_RECOVERY

The first two and the last one are in different style. *I* prefer them
(if we use it) be "IN_ARCHIVE_RECOVERY/IN_CRASH_RECOVERY/NOT_IN_RECOVERY".

I like Michael's proposal. See v5 of the patch.

Thank you for your review!

Regards,

#33Michael Paquier
michael@paquier.xyz
In reply to: Jehan-Guillaume de Rorthais (#31)
1 attachment(s)
Re: [BUG] non archived WAL removed during production crash recovery

On Tue, Apr 14, 2020 at 06:03:19PM +0200, Jehan-Guillaume de Rorthais wrote:

Thanks for the new version.

On Mon, 13 Apr 2020 16:14:14 +0900 Michael Paquier <michael@paquier.xyz> wrote:

It seems to me that the initial value of SharedRecoveryState should be
CRASH_RECOVERING all the time no? StartupXLOG() is a code path taken
even if starting cleanly, and the flag would be reset correctly at the
end to NOT_IN_RECOVERY.

Previous version of the patch was setting CRASH_RECOVERING. Fujii-san reported
(the 4 Apr 2020 02:49:50 +0900) that it could be useful to expose a better value
until relevant code is reached in StartupXLOG() so GetRecoveryState() returns
a safer value for futur use.

As I answered upthread, I'm not sure who would need this information before the
WAL machinery is up though. Note that ARCHIVE_RECOVERING and CRASH_RECOVERING
are compatible with the previous behavior.

I am not sure either if we need this information until the startup
process is up, but even if we need it I'd rather keep the code
consistent with the past practice, which was that
SharedRecoveryInProgress got set to true, the equivalent of crash
recovery as that's more generic than the archive recovery switch.

Maybe the solution would be to init with CRASH_RECOVERING and add some comment
in GetRecoveryState() to warn the state is "enforced" after the XLOG machinery
is started and is init'ed to RECOVERING in the meantime?

I initialized it to CRASH_RECOVERING in the new v5 patch and added a comment
to GetRecoveryState().

Not sure that the comment is worth it. Your description of the state
looks enough, and the code is clear that we have just an initial
shared memory state in this case.

+   /* The file is always deletable if archive_mode is "off". */
+   if ( ! XLogArchivingActive() )
+       return true;
[...]
+   if ( recoveryState == ARCHIVE_RECOVERING && !XLogArchivingAlways() )
return true;
Incorrect indentation.

Is it the spaces as reported by Horiguchi-san? I removed them in latest patch.

Yes.

It would be good to mention that while in crash recovery, files are
not considered as deletable.

Well, in fact, I am still wondering about this. I was hesitant to add a
shortcut like:

/* no WAL segment cleanup during crash recovery */
if (recoveryState == RECOVERT_STATE_CRASH)
return false;

But, what if for example we crashed for lack of disk space during intensive
writes? During crash recovery, any WAL marked as .done could be removed and
allow the system to start again and maybe make even further WAL cleanup by
archiving some more WAL without competing with previous high write ratio.

When we recover a primary, this behavior seems conform with any value of
archive_mode, even if it has been changed after crash and before starting it.
On a standby, we might create .ready files, but they will be removed during the
first restartpoint if needed.

I guess that you mean .ready and not .done here? Removing .done files
does not matter as the segments related to them are already gone.
Even with that, why should we need to make the backend smarter about
the removal of .ready files during crash recovery. It seems to me
that we should keep them, and an operator could always come by himself
and do some manual cleanup to free some space in the pg_wal partition.

I needed a failure so I can test pg_stat_archiver reports it as well. In my
mind, using "false" would either trigger a failure because false returns 1 or...
a failure because the command is not found. In either way, the result is the
same.

Using poll_query_until+pg_stat_file, is a good idea, but not enough as
archiver reports a failure some moment after the .ready signal file appears.

Using an empty string makes the test more portable, but while I looked
at it I have found out that it leads to delays in the archiver except
if you force the generation of more segments in the test, causing the
logic to get more complicated with the manipulation of the .ready and
.done files. And I was then finding myself to add an
archive_timeout.. Anyway, this reduced the readability of the test so
I am pretty much giving up on this idea.

The end of the tests actually relies on the
fact that archive_command is set to "false" when the cold backup is
taken, before resetting it. I think that it would be cleaner to
enforce the configuration you want to test before starting each
standby. It becomes easier to understand the flow of the test for the
reader.

done as well.

I have put my hands on the code, and attached is a cleaned up
version for the backend part. Below are some notes.

+ * RECOVERT_STATE_ARCHIVE is set for archive recovery or for a
standby.
Typo here that actually comes from my previous email, and that you
blindly copy-pasted, repeated five times in the tree actually.

+   RecoveryState   recoveryState = GetRecoveryState();
+
+   /* The file is always deletable if archive_mode is "off". */
+   if (!XLogArchivingActive())
+       return true;
There is no point to call GetRecoveryState() is archive_mode = off.
+ * There's two different reasons to recover WAL: when standby mode is
requested
+ * or after a crash to catchup with consistency.
Nit: s/There's/There are/.  Anyway, I don't see much point in keeping
this comment as the description of each state value should be enough,
so I have removed it.

I am currently in the middle of reviewing the test and there are a
couple of things that can be improved but I lack of time today, so
I'll continue tomorrow on it. There is no need to send two separate
patches by the way as the code paths touched are different.
--
Michael

Attachments:

recovery-archives-v6.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f60ed2d36c..2cce871dc9 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -166,6 +166,14 @@ typedef enum WalLevel
 	WAL_LEVEL_LOGICAL
 } WalLevel;
 
+/* Recovery states */
+typedef enum RecoveryState
+{
+	RECOVERY_STATE_CRASH = 0,	/* crash recovery */
+	RECOVERY_STATE_ARCHIVE,		/* archive recovery */
+	RECOVERY_STATE_NONE			/* currently in production */
+} RecoveryState;
+
 extern PGDLLIMPORT int wal_level;
 
 /* Is WAL archiving enabled (always or only while server is running normally)? */
@@ -291,6 +299,7 @@ extern const char *xlog_identify(uint8 info);
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
 
 extern bool RecoveryInProgress(void);
+extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 11e32733c4..a1372b08f6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -221,8 +221,9 @@ static TimeLineID receiveTLI = 0;
 static bool lastFullPageWrites;
 
 /*
- * Local copy of SharedRecoveryInProgress variable. True actually means "not
- * known, need to check the shared state".
+ * Local copy of the state tracked by SharedRecoveryState in shared memory,
+ * It is false if SharedRecoveryState is RECOVERY_STATE_NONE.  True actually
+ * means "not known, need to check the shared state".
  */
 static bool LocalRecoveryInProgress = true;
 
@@ -653,10 +654,10 @@ typedef struct XLogCtlData
 	TimeLineID	PrevTimeLineID;
 
 	/*
-	 * SharedRecoveryInProgress indicates if we're still in crash or archive
+	 * SharedRecoveryState indicates if we're either in crash or archive
 	 * recovery.  Protected by info_lck.
 	 */
-	bool		SharedRecoveryInProgress;
+	RecoveryState	SharedRecoveryState;
 
 	/*
 	 * SharedHotStandbyActive indicates if we allow hot standby queries to be
@@ -4434,6 +4435,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 				updateMinRecoveryPoint = true;
 
 				UpdateControlFile();
+
+				/*
+				 * We update SharedRecoveryState while holding the lock on
+				 * ControlFileLock so both states are consistent in shared
+				 * memory.
+				 */
+				SpinLockAcquire(&XLogCtl->info_lck);
+				XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+				SpinLockRelease(&XLogCtl->info_lck);
+
 				LWLockRelease(ControlFileLock);
 
 				CheckRecoveryConsistency();
@@ -5166,10 +5177,10 @@ XLOGShmemInit(void)
 	 * in additional info.)
 	 */
 	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
-	XLogCtl->SharedRecoveryInProgress = true;
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
 	XLogCtl->WalWriterSleeping = false;
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
 
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
@@ -7911,7 +7922,7 @@ StartupXLOG(void)
 	ControlFile->time = (pg_time_t) time(NULL);
 
 	SpinLockAcquire(&XLogCtl->info_lck);
-	XLogCtl->SharedRecoveryInProgress = false;
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_NONE;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	UpdateControlFile();
@@ -8057,7 +8068,7 @@ RecoveryInProgress(void)
 		 */
 		volatile XLogCtlData *xlogctl = XLogCtl;
 
-		LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
+		LocalRecoveryInProgress = (xlogctl->SharedRecoveryState != RECOVERY_STATE_NONE);
 
 		/*
 		 * Initialize TimeLineID and RedoRecPtr when we discover that recovery
@@ -8069,8 +8080,8 @@ RecoveryInProgress(void)
 		{
 			/*
 			 * If we just exited recovery, make sure we read TimeLineID and
-			 * RedoRecPtr after SharedRecoveryInProgress (for machines with
-			 * weak memory ordering).
+			 * RedoRecPtr after SharedRecoveryState (for machines with weak
+			 * memory ordering).
 			 */
 			pg_memory_barrier();
 			InitXLOGAccess();
@@ -8086,6 +8097,24 @@ RecoveryInProgress(void)
 	}
 }
 
+/*
+ * Returns current recovery state from shared memory.
+ *
+ * This returned state is accurate after StartupXLOG() finished.  See
+ * details about the possible values of RecoveryState in xlog.h.
+ */
+RecoveryState
+GetRecoveryState(void)
+{
+	RecoveryState	retval;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	retval = XLogCtl->SharedRecoveryState;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return retval;
+}
+
 /*
  * Is HotStandby active yet? This is only important in special backends
  * since normal backends won't ever be able to connect until this returns
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index d62c12310a..ce0127c3d0 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -572,17 +572,21 @@ XLogArchiveCheckDone(const char *xlog)
 {
 	char		archiveStatusPath[MAXPGPATH];
 	struct stat stat_buf;
-	bool		inRecovery = RecoveryInProgress();
+
+	/* The file is always deletable if archive_mode is "off". */
+	if (!XLogArchivingActive())
+		return true;
+
+	/* On standbys, the file is deletable if archive_mode is not "always". */
+	if (!XLogArchivingAlways() &&
+		GetRecoveryState() == RECOVERY_STATE_ARCHIVE)
+		return true;
 
 	/*
-	 * The file is always deletable if archive_mode is "off".  On standbys
-	 * archiving is disabled if archive_mode is "on", and enabled with
-	 * "always".  On a primary, archiving is enabled if archive_mode is "on"
-	 * or "always".
+	 * At this point of the logic, note that we are either a primary with
+	 * with archive_mode set to "on" or "always", or a standby with
+	 * archive_mode set to "always".
 	 */
-	if (!((XLogArchivingActive() && !inRecovery) ||
-		  (XLogArchivingAlways() && inRecovery)))
-		return true;
 
 	/* First check for .done --- this means archiver is done with it */
 	StatusFilePath(archiveStatusPath, xlog, ".done");
diff --git a/src/test/recovery/t/020_archive_status.pl b/src/test/recovery/t/020_archive_status.pl
new file mode 100644
index 0000000000..a5f4fd821a
--- /dev/null
+++ b/src/test/recovery/t/020_archive_status.pl
@@ -0,0 +1,193 @@
+#
+# Tests relating to WAL archiving and cleanup
+#
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More;
+use Config;
+
+my ($node, $standby1, $standby2);
+my ($node_data, $standby1_data, $standby2_data);
+
+if ($Config{osname} eq 'MSWin32')
+{
+
+	# some Windows Perls at least don't like IPC::Run's start/kill_kill regime.
+	plan skip_all => "Test fails on Windows perl";
+}
+else
+{
+	plan tests => 16;
+}
+
+$node = get_new_node('master');
+$node->init(
+	has_archiving => 1,
+	allows_streaming => 1
+);
+$node->start;
+$node_data = $node->data_dir;
+
+# temporary fail archive_command for futur tests
+$node->safe_psql('postgres', q{
+	ALTER SYSTEM SET archive_command TO 'false';
+	SELECT pg_reload_conf();
+});
+
+$node->safe_psql('postgres', q{
+	CREATE TABLE mine AS SELECT generate_series(1,10) AS x;
+	SELECT pg_switch_wal();
+	CHECKPOINT;
+});
+
+# wait for archive failure
+$node->poll_query_until('postgres',
+	q{SELECT failed_count > 0 FROM pg_stat_archiver},
+	't') or die "Timed out while waiting for archiver to fail";
+
+ok( -f "$node_data/pg_wal/archive_status/000000010000000000000001.ready",
+  "ready file exists for WAL waiting to be archived");
+
+is($node->safe_psql('postgres', q{
+		SELECT archived_count, last_failed_wal
+		FROM pg_stat_archiver
+	}),
+	'0|000000010000000000000001', 'pg_stat_archiver reports archive failure');
+
+# We need to crash the cluster because next test checks the crash
+# recovery step do not removes non-archived WAL.
+$node->stop('immediate');
+
+# Standby recovery tests checks the recovery behavior when restoring a
+# backup taken using eg. a snapshot with no pg_start/stop_backup.
+# In this situation, the recovered standby should enter first crash
+# recovery then switch to regular archive recovery.
+$node->backup_fs_cold('backup');
+
+$node->start;
+
+ok( -f "$node_data/pg_wal/archive_status/000000010000000000000001.ready",
+  "WAL segment still ready to archive after crash recovery on primary");
+
+# Allow WAL archiving again
+$node->safe_psql('postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+
+# wait for archive success
+$node->poll_query_until('postgres',
+	q{SELECT archived_count FROM pg_stat_archiver},
+	'1')  or die "Timed out while waiting for archiver to success";
+
+ok( ! -f "$node_data/pg_wal/archive_status/000000010000000000000001.ready",
+  "ready file for archived WAL removed");
+
+ok( -f "$node_data/pg_wal/archive_status/000000010000000000000001.done",
+  "done file for archived WAL exists");
+
+is($node->safe_psql('postgres',
+		q{ SELECT last_archived_wal FROM pg_stat_archiver }),
+	'000000010000000000000001',
+	'Archive success reported in pg_stat_archiver');
+
+# create some wal activity and a new checkpoint so futur standby can create
+# a restartpoint.
+# As standby start in crash recovery because of the backup method, they need
+# a clean restartpoint to deal with existing status files.
+$node->safe_psql('postgres', q{
+	INSERT INTO mine SELECT generate_series(10,20) AS x;
+	SELECT pg_switch_wal();
+	CHECKPOINT;
+});
+
+$node->poll_query_until('postgres',
+	q{ SELECT last_archived_wal FROM pg_stat_archiver },
+	'000000010000000000000002')  or die "Timed out while waiting for archiver to succeed";
+
+# tests standby with archive_mode=on
+$standby1 = get_new_node('standby');
+$standby1->init_from_backup($node, 'backup', has_restoring => 1);
+$standby1->append_conf('postgresql.conf', "archive_mode = on");
+$standby1_data = $standby1->data_dir;
+$standby1->start;
+$standby1->safe_psql('postgres', q{CHECKPOINT});
+
+# recovery with archive_mode=on does not keep .ready signal files inherited from backup.
+# 000000010000000000000001.ready existed in the backup.
+ok( ! -f "$standby1_data/pg_wal/archive_status/000000010000000000000001.ready",
+  ".ready signal file existing in backup removed with archive_mode=on on standby" );
+
+# recovery with archive_mode=on should not create .ready signal files
+# 000000010000000000000002.ready did not exist in the backup
+ok( ! -f "$standby1_data/pg_wal/archive_status/000000010000000000000002.ready",
+  "standby doesn't create .ready signal file when archive_mode=on" );
+
+# recovery with archive_mode=on creates .done signal files
+ok( -f "$standby1_data/pg_wal/archive_status/000000010000000000000002.done",
+  "standby creates .done signal file when archive_mode=on" );
+
+# test recovery with archive_mode=always keeps .ready WALs
+$standby2 = get_new_node('standby2');
+$standby2->init_from_backup($node, 'backup', has_restoring => 1);
+$standby2->append_conf('postgresql.conf', 'archive_mode = always');
+$standby1->append_conf('postgresql.auto.conf', "archive_command = false");
+$standby2_data = $standby2->data_dir;
+$standby2->start;
+
+$standby2->safe_psql('postgres', q{CHECKPOINT});
+
+ok( -f "$standby2_data/pg_wal/archive_status/000000010000000000000001.ready",
+  ".ready signal file existing in backup are kept with archive_mode=always on standby" );
+
+ok( -f "$standby2_data/pg_wal/archive_status/000000010000000000000002.ready",
+  ".ready signal file are created with archive_mode=always on standby" );
+
+# Allow WAL archiving again
+$standby2->safe_psql('postgres', q{SELECT pg_stat_reset_shared('archiver')});
+
+is($standby2->safe_psql('postgres',
+		q{ SELECT failed_count, archived_count FROM pg_stat_archiver }),
+	'0|0',
+	'statistics reset from pg_stat_archiver succeed');
+
+# We need to crash the cluster because next test checks the crash
+# recovery step do not removes non-archived WAL on a standby.
+$standby2->stop('immediate');
+$standby2->start;
+
+ok( -f "$standby2_data/pg_wal/archive_status/000000010000000000000001.ready",
+  "WAL segment still ready to archive after crash recovery on standby with archive_mode=always");
+
+# Allow WAL archiving again
+$standby2->safe_psql('postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+
+# wait for archive success
+$standby2->poll_query_until('postgres',
+	q{SELECT last_archived_wal FROM pg_stat_archiver},
+	'000000010000000000000002')  or die "Timed out while waiting for archiver to success";
+
+is($standby2->safe_psql('postgres',
+		q{SELECT archived_count FROM pg_stat_archiver}),
+	'2', 'WAL segment archived from standby');
+
+ok( ! -f "$standby2_data/pg_wal/archive_status/000000010000000000000001.ready"
+	&& ! -f "$standby2_data/pg_wal/archive_status/000000010000000000000002.ready",
+  ".ready signal file removed after archiving with archive_mode=always on standby" );
+
+ok( -f "$standby2_data/pg_wal/archive_status/000000010000000000000001.done"
+	and -f "$standby2_data/pg_wal/archive_status/000000010000000000000002.done",
+  ".done signal file created after archiving with archive_mode=always on standby" );
+
+#ok(0);
+#
+## wait for archive success
+#$standby2->poll_query_until('postgres',
+#	q{SELECT 1},
+#	'0')  or die "Timed out while waiting for archiver to success";
+
In reply to: Michael Paquier (#33)
Re: [BUG] non archived WAL removed during production crash recovery

On Thu, 16 Apr 2020 17:11:00 +0900
Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Apr 14, 2020 at 06:03:19PM +0200, Jehan-Guillaume de Rorthais wrote:

Thanks for the new version.

Thank you for v6.

On Mon, 13 Apr 2020 16:14:14 +0900 Michael Paquier <michael@paquier.xyz>
wrote:

It seems to me that the initial value of SharedRecoveryState should be
CRASH_RECOVERING all the time no? StartupXLOG() is a code path taken
even if starting cleanly, and the flag would be reset correctly at the
end to NOT_IN_RECOVERY.

Previous version of the patch was setting CRASH_RECOVERING. Fujii-san
reported (the 4 Apr 2020 02:49:50 +0900) that it could be useful to expose
a better value until relevant code is reached in StartupXLOG() so
GetRecoveryState() returns a safer value for futur use.

As I answered upthread, I'm not sure who would need this information before
the WAL machinery is up though. Note that ARCHIVE_RECOVERING and
CRASH_RECOVERING are compatible with the previous behavior.

I am not sure either if we need this information until the startup
process is up, but even if we need it I'd rather keep the code
consistent with the past practice, which was that
SharedRecoveryInProgress got set to true, the equivalent of crash
recovery as that's more generic than the archive recovery switch.

OK.

Maybe the solution would be to init with CRASH_RECOVERING and add some
comment in GetRecoveryState() to warn the state is "enforced" after the
XLOG machinery is started and is init'ed to RECOVERING in the meantime?

I initialized it to CRASH_RECOVERING in the new v5 patch and added a comment
to GetRecoveryState().

Not sure that the comment is worth it. Your description of the state
looks enough, and the code is clear that we have just an initial
shared memory state in this case.

OK. Will remove later after your review of the tests.

It would be good to mention that while in crash recovery, files are
not considered as deletable.

Well, in fact, I am still wondering about this. I was hesitant to add a
shortcut like:

/* no WAL segment cleanup during crash recovery */
if (recoveryState == RECOVERT_STATE_CRASH)
return false;

But, what if for example we crashed for lack of disk space during intensive
writes? During crash recovery, any WAL marked as .done could be removed and
allow the system to start again and maybe make even further WAL cleanup by
archiving some more WAL without competing with previous high write ratio.

When we recover a primary, this behavior seems conform with any value of
archive_mode, even if it has been changed after crash and before starting
it. On a standby, we might create .ready files, but they will be removed
during the first restartpoint if needed.

I guess that you mean .ready and not .done here?

No, I meant .done.

Removing .done files does not matter as the segments related to them are
already gone.

Not necessarily. There's a time windows between the moment the archiver set
the .done file and when the checkpointer removes the associated WAL file.
So, after a PANIC because of lack of space, WAL associated with .done files but
not removed yet will be removed during the crash recovery.

Even with that, why should we need to make the backend smarter about
the removal of .ready files during crash recovery. It seems to me
that we should keep them, and an operator could always come by himself
and do some manual cleanup to free some space in the pg_wal partition.

We are agree on this.

I needed a failure so I can test pg_stat_archiver reports it as well. In my
mind, using "false" would either trigger a failure because false returns 1
or... a failure because the command is not found. In either way, the result
is the same.

Using poll_query_until+pg_stat_file, is a good idea, but not enough as
archiver reports a failure some moment after the .ready signal file
appears.

Using an empty string makes the test more portable, but while I looked
at it I have found out that it leads to delays in the archiver except
if you force the generation of more segments in the test, causing the
logic to get more complicated with the manipulation of the .ready and
.done files. And I was then finding myself to add an
archive_timeout.. Anyway, this reduced the readability of the test so
I am pretty much giving up on this idea.

Unless I'm wrong, the empty string does not raise an error in pg_stat_archiver,
and I wanted to add a test on this as well.

The end of the tests actually relies on the
fact that archive_command is set to "false" when the cold backup is
taken, before resetting it. I think that it would be cleaner to
enforce the configuration you want to test before starting each
standby. It becomes easier to understand the flow of the test for the
reader.

done as well.

I have put my hands on the code, and attached is a cleaned up
version for the backend part. Below are some notes.

+ * RECOVERT_STATE_ARCHIVE is set for archive recovery or for a
standby.
Typo here that actually comes from my previous email, and that you
blindly copy-pasted, repeated five times in the tree actually.

Oops...yes, even in a comment with RECOVERT_STATE_NONE :/
Sorry.

+   RecoveryState   recoveryState = GetRecoveryState();
+
+   /* The file is always deletable if archive_mode is "off". */
+   if (!XLogArchivingActive())
+       return true;
There is no point to call GetRecoveryState() is archive_mode = off.

good catch!

+ * There's two different reasons to recover WAL: when standby mode is
requested
+ * or after a crash to catchup with consistency.
Nit: s/There's/There are/.  Anyway, I don't see much point in keeping
this comment as the description of each state value should be enough,
so I have removed it.

OK

I am currently in the middle of reviewing the test and there are a
couple of things that can be improved but I lack of time today, so
I'll continue tomorrow on it. There is no need to send two separate
patches by the way as the code paths touched are different.

OK.

Thanks for your review! Let me know if you want me to add/change/fix some tests.

Regards,

#35Michael Paquier
michael@paquier.xyz
In reply to: Jehan-Guillaume de Rorthais (#34)
1 attachment(s)
Re: [BUG] non archived WAL removed during production crash recovery

On Fri, Apr 17, 2020 at 12:07:39AM +0200, Jehan-Guillaume de Rorthais wrote:

On Thu, 16 Apr 2020 17:11:00 +0900 Michael Paquier <michael@paquier.xyz> wrote:

Removing .done files does not matter as the segments related to them are
already gone.

Not necessarily. There's a time windows between the moment the archiver set
the .done file and when the checkpointer removes the associated WAL file.
So, after a PANIC because of lack of space, WAL associated with .done files but
not removed yet will be removed during the crash recovery.

Not sure that it is something that matters for this thread though, so
if necessary I think that it could be be discussed separately.

Even with that, why should we need to make the backend smarter about
the removal of .ready files during crash recovery. It seems to me
that we should keep them, and an operator could always come by himself
and do some manual cleanup to free some space in the pg_wal partition.

We are agree on this.

Okay.

Unless I'm wrong, the empty string does not raise an error in pg_stat_archiver,
and I wanted to add a test on this as well.

Exactly, it won't raise an error. Instead I switched to use a poll
query with pg_stat_file() and .ready files, but this has proved to
delay the test considerably if we did not create more segments. And
your approach has the merit to be more simple with only two segments
manipulated for the whole test. So I have tried first my idea,
noticed the mess it introduced, and just kept your approach.

Thanks for your review! Let me know if you want me to add/change/fix some tests.

Thanks, I have worked more on the test, refactoring pieces related to
the segment names, adjusting some comments and fixing some of the
logic. Note that you introduced something incorrect at the creation
of $standby2 as you have been updating postgresql.conf.auto for
$standby1.

I have noticed an extra issue while looking at the backend pieces
today: at the beginning of the REDO loop we forgot one place where
SharedRecoveryState *has* to be updated to a correct state (around
the comment "Update pg_control to show that we are..." in xlog.c) as
the startup process may decide to switch the control file state to
DB_IN_ARCHIVE_RECOVERY or DB_IN_CRASH_RECOVERY, but we forgot to
update the new shared flag at this early stage. It did not matter
before because SharedRecoveryInProgress would be only "true" for both,
but that's not the case anymore as we need to make the difference
between crash recovery and archive recovery in the new flag. There is
no actual need to update SharedRecoveryState to RECOVERY_STATE_CRASH
as the initial shared memory state is RECOVERY_STATE_CRASH, but
updating the flag makes the code more consistent IMO so I updated it
anyway in the attached.

I have the feeling that I need to work a bit more on this patch, but
my impression is that we are getting to something committable here.

Thoughts?
--
Michael

Attachments:

recovery-archives-v7.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f60ed2d36c..2cce871dc9 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -166,6 +166,14 @@ typedef enum WalLevel
 	WAL_LEVEL_LOGICAL
 } WalLevel;
 
+/* Recovery states */
+typedef enum RecoveryState
+{
+	RECOVERY_STATE_CRASH = 0,	/* crash recovery */
+	RECOVERY_STATE_ARCHIVE,		/* archive recovery */
+	RECOVERY_STATE_NONE			/* currently in production */
+} RecoveryState;
+
 extern PGDLLIMPORT int wal_level;
 
 /* Is WAL archiving enabled (always or only while server is running normally)? */
@@ -291,6 +299,7 @@ extern const char *xlog_identify(uint8 info);
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
 
 extern bool RecoveryInProgress(void);
+extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 11e32733c4..b8f696ffdc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -221,8 +221,9 @@ static TimeLineID receiveTLI = 0;
 static bool lastFullPageWrites;
 
 /*
- * Local copy of SharedRecoveryInProgress variable. True actually means "not
- * known, need to check the shared state".
+ * Local copy of the state tracked by SharedRecoveryState in shared memory,
+ * It is false if SharedRecoveryState is RECOVERY_STATE_NONE.  True actually
+ * means "not known, need to check the shared state".
  */
 static bool LocalRecoveryInProgress = true;
 
@@ -653,10 +654,10 @@ typedef struct XLogCtlData
 	TimeLineID	PrevTimeLineID;
 
 	/*
-	 * SharedRecoveryInProgress indicates if we're still in crash or archive
+	 * SharedRecoveryState indicates if we're either in crash or archive
 	 * recovery.  Protected by info_lck.
 	 */
-	bool		SharedRecoveryInProgress;
+	RecoveryState SharedRecoveryState;
 
 	/*
 	 * SharedHotStandbyActive indicates if we allow hot standby queries to be
@@ -4434,6 +4435,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 				updateMinRecoveryPoint = true;
 
 				UpdateControlFile();
+
+				/*
+				 * We update SharedRecoveryState while holding the lock on
+				 * ControlFileLock so both states are consistent in shared
+				 * memory.
+				 */
+				SpinLockAcquire(&XLogCtl->info_lck);
+				XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+				SpinLockRelease(&XLogCtl->info_lck);
+
 				LWLockRelease(ControlFileLock);
 
 				CheckRecoveryConsistency();
@@ -5166,7 +5177,7 @@ XLOGShmemInit(void)
 	 * in additional info.)
 	 */
 	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
-	XLogCtl->SharedRecoveryInProgress = true;
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
 	XLogCtl->WalWriterSleeping = false;
@@ -6871,7 +6882,13 @@ StartupXLOG(void)
 		 */
 		dbstate_at_startup = ControlFile->state;
 		if (InArchiveRecovery)
+		{
 			ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
+
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+			SpinLockRelease(&XLogCtl->info_lck);
+		}
 		else
 		{
 			ereport(LOG,
@@ -6884,6 +6901,10 @@ StartupXLOG(void)
 								ControlFile->checkPointCopy.ThisTimeLineID,
 								recoveryTargetTLI)));
 			ControlFile->state = DB_IN_CRASH_RECOVERY;
+
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
+			SpinLockRelease(&XLogCtl->info_lck);
 		}
 		ControlFile->checkPoint = checkPointLoc;
 		ControlFile->checkPointCopy = checkPoint;
@@ -7911,7 +7932,7 @@ StartupXLOG(void)
 	ControlFile->time = (pg_time_t) time(NULL);
 
 	SpinLockAcquire(&XLogCtl->info_lck);
-	XLogCtl->SharedRecoveryInProgress = false;
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_NONE;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	UpdateControlFile();
@@ -8057,7 +8078,7 @@ RecoveryInProgress(void)
 		 */
 		volatile XLogCtlData *xlogctl = XLogCtl;
 
-		LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
+		LocalRecoveryInProgress = (xlogctl->SharedRecoveryState != RECOVERY_STATE_NONE);
 
 		/*
 		 * Initialize TimeLineID and RedoRecPtr when we discover that recovery
@@ -8069,8 +8090,8 @@ RecoveryInProgress(void)
 		{
 			/*
 			 * If we just exited recovery, make sure we read TimeLineID and
-			 * RedoRecPtr after SharedRecoveryInProgress (for machines with
-			 * weak memory ordering).
+			 * RedoRecPtr after SharedRecoveryState (for machines with weak
+			 * memory ordering).
 			 */
 			pg_memory_barrier();
 			InitXLOGAccess();
@@ -8086,6 +8107,24 @@ RecoveryInProgress(void)
 	}
 }
 
+/*
+ * Returns current recovery state from shared memory.
+ *
+ * This returned state is accurate after StartupXLOG() finished.  See
+ * details about the possible values of RecoveryState in xlog.h.
+ */
+RecoveryState
+GetRecoveryState(void)
+{
+	RecoveryState retval;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	retval = XLogCtl->SharedRecoveryState;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return retval;
+}
+
 /*
  * Is HotStandby active yet? This is only important in special backends
  * since normal backends won't ever be able to connect until this returns
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index d62c12310a..d71453f2e8 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -572,17 +572,21 @@ XLogArchiveCheckDone(const char *xlog)
 {
 	char		archiveStatusPath[MAXPGPATH];
 	struct stat stat_buf;
-	bool		inRecovery = RecoveryInProgress();
+
+	/* The file is always deletable if archive_mode is "off". */
+	if (!XLogArchivingActive())
+		return true;
+
+	/* On standbys, the file is deletable if archive_mode is not "always". */
+	if (!XLogArchivingAlways() &&
+		GetRecoveryState() == RECOVERY_STATE_ARCHIVE)
+		return true;
 
 	/*
-	 * The file is always deletable if archive_mode is "off".  On standbys
-	 * archiving is disabled if archive_mode is "on", and enabled with
-	 * "always".  On a primary, archiving is enabled if archive_mode is "on"
-	 * or "always".
+	 * At this point of the logic, note that we are either a primary with
+	 * archive_mode set to "on" or "always", or a standby with archive_mode
+	 * set to "always".
 	 */
-	if (!((XLogArchivingActive() && !inRecovery) ||
-		  (XLogArchivingAlways() && inRecovery)))
-		return true;
 
 	/* First check for .done --- this means archiver is done with it */
 	StatusFilePath(archiveStatusPath, xlog, ".done");
diff --git a/src/test/recovery/t/020_archive_status.pl b/src/test/recovery/t/020_archive_status.pl
new file mode 100644
index 0000000000..1ba237ee1d
--- /dev/null
+++ b/src/test/recovery/t/020_archive_status.pl
@@ -0,0 +1,218 @@
+#
+# Tests related to WAL archiving and recovery.
+#
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More;
+use Config;
+
+if ($Config{osname} eq 'MSWin32')
+{
+
+	# some Windows Perls at least don't like IPC::Run's start/kill_kill regime.
+	plan skip_all => "Test fails on Windows perl";
+}
+else
+{
+	plan tests => 16;
+}
+
+my $primary = get_new_node('master');
+$primary->init(
+	has_archiving    => 1,
+	allows_streaming => 1);
+$primary->append_conf('postgresql.conf', 'autovacuum = off');
+$primary->start;
+my $primary_data = $primary->data_dir;
+
+# Temporarily use an archive_command value to make the archiver fail,
+# knowing that archiving is enabled.  This allows keeping around
+# .ready files for the follow-up tests.
+$primary->safe_psql(
+	'postgres', q{
+	ALTER SYSTEM SET archive_command TO 'false';
+	SELECT pg_reload_conf();
+});
+
+# Save the WAL segment currently in use and switch to a new segment.
+# This will be used to track the activity of the archiver.
+my $segment_name_1 = $primary->safe_psql('postgres',
+	q{SELECT pg_walfile_name(pg_current_wal_lsn())});
+my $segment_path_1       = "pg_wal/archive_status/$segment_name_1";
+my $segment_path_1_ready = "$segment_path_1.ready";
+my $segment_path_1_done  = "$segment_path_1.done";
+$primary->safe_psql(
+	'postgres', q{
+	CREATE TABLE mine AS SELECT generate_series(1,10) AS x;
+	SELECT pg_switch_wal();
+	CHECKPOINT;
+});
+
+# Wait for an archive failure.
+$primary->poll_query_until('postgres',
+	q{SELECT failed_count > 0 FROM pg_stat_archiver}, 't')
+  or die "Timed out while waiting for archiving to fail";
+ok( -f "$primary_data/$segment_path_1_ready",
+	".ready file exists for WAL segment $segment_name_1 waiting to be archived"
+);
+ok( !-f "$primary_data/$segment_path_1_done",
+	".done file does not exist for WAL segment $segment_name_1 waiting to be archived"
+);
+
+is( $primary->safe_psql(
+		'postgres', q{
+		SELECT archived_count, last_failed_wal
+		FROM pg_stat_archiver
+	}),
+	"0|$segment_name_1",
+	'pg_stat_archiver failed to archive $segment_name_1');
+
+# We need to crash the cluster because next test checks the crash
+# recovery step do not removes non-archived WAL.
+$primary->stop('immediate');
+
+# Recovery tests for the achiving with a standby partially check
+# the recovery behavior when restoring a backup taken using eg.
+# a snapshot with no pg_start/stop_backup.  In this situation,
+# the recovered standby should enter first crash recovery then
+# switch to regular archive recovery.  Note that the base backup
+# is taken here so as archive_command is not set.  This is
+# necessary for the assumptions of the tests done with the
+# standbys below.
+$primary->backup_fs_cold('backup');
+
+$primary->start;
+ok( -f "$primary_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 still exists after crash recovery on primary"
+);
+
+# Allow WAL archiving again and wait for a success.
+$primary->safe_psql(
+	'postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+
+$primary->poll_query_until('postgres',
+	q{SELECT archived_count FROM pg_stat_archiver}, '1')
+  or die "Timed out while waiting for archiving to finish";
+
+ok(!-f "$primary_data/$segment_path_1_ready",
+	".ready file for archived WAL segment $segment_name_1 removed");
+
+ok(-f "$primary_data/$segment_path_1_done",
+	".done file for archived WAL segment $segment_name_1 exists");
+
+is( $primary->safe_psql(
+		'postgres', q{ SELECT last_archived_wal FROM pg_stat_archiver }),
+	$segment_name_1,
+	'archive success reported in pg_stat_archiver for WAL segment $segment_name_1'
+);
+
+# Create some WAL activity and a new checkpoint so as the next standby can
+# create a restartpoint.  As this standby starts in crash recovery because
+# of the cold backup taken previously, it needs a clean restartpoint to deal
+# with existing status files.
+my $segment_name_2 = $primary->safe_psql('postgres',
+	q{SELECT pg_walfile_name(pg_current_wal_lsn())});
+my $segment_path_2       = "pg_wal/archive_status/$segment_name_2";
+my $segment_path_2_ready = "$segment_path_2.ready";
+my $segment_path_2_done  = "$segment_path_2.done";
+$primary->safe_psql(
+	'postgres', q{
+	INSERT INTO mine SELECT generate_series(10,20) AS x;
+	SELECT pg_switch_wal();
+	CHECKPOINT;
+});
+
+$primary->poll_query_until('postgres',
+	q{ SELECT last_archived_wal FROM pg_stat_archiver },
+	$segment_name_2)
+  or die "Timed out while waiting for archiving to finish";
+
+# Test standby with archive_mode = on.
+my $standby1 = get_new_node('standby');
+$standby1->init_from_backup($primary, 'backup', has_restoring => 1);
+$standby1->append_conf('postgresql.conf', "archive_mode = on");
+my $standby1_data = $standby1->data_dir;
+$standby1->start;
+$standby1->safe_psql('postgres', q{CHECKPOINT});
+
+# Recovery with archive_mode=on does not keep .ready signal files inherited
+# from backup.  Note that this WAL segment existed in the backup.
+ok( !-f "$standby1_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 present in backup got removed with archive_mode=on on standby"
+);
+
+# Recovery with archive_mode=on should not create .ready files.
+# Note that this segment did not exist in the backup.
+ok( !-f "$standby1_data/$segment_path_2_ready",
+	".ready file for WAL segment $segment_name_2 not created on standby when archive_mode=on on standby"
+);
+
+# Recovery with archive_mode = on creates .done files.
+ok( -f "$standby1_data/$segment_path_2_done",
+	".done file for WAL segment $segment_name_2 created when archive_mode=on on standby"
+);
+
+# Test recovery with archive_mode = always, which should always keep
+# .ready files if archiving is enabled, though here we want the archive
+# command to fail to persist the .ready files.  Note that this node
+# has inherited the archive command of the previous cold backup, that
+# will cause archiving failures.
+my $standby2 = get_new_node('standby2');
+$standby2->init_from_backup($primary, 'backup', has_restoring => 1);
+$standby2->append_conf('postgresql.conf', 'archive_mode = always');
+my $standby2_data = $standby2->data_dir;
+$standby2->start;
+
+$standby2->safe_psql('postgres', q{CHECKPOINT});
+
+ok( -f "$standby2_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 existing in backup is kept with archive_mode=always on standby"
+);
+
+ok( -f "$standby2_data/$segment_path_2_ready",
+	".ready file for WAL segment $segment_name_2 created with archive_mode=always on standby"
+);
+
+# Reset statistics of the archiver for the next checks.
+$standby2->safe_psql('postgres', q{SELECT pg_stat_reset_shared('archiver')});
+
+# Now crash the cluster to check that recovery step does not
+# remove non-archived WAL segments on a standby where archiving
+# is enabled.
+$standby2->stop('immediate');
+$standby2->start;
+
+ok( -f "$standby2_data/$segment_path_1_ready",
+	"WAL segment still ready to archive after crash recovery on standby with archive_mode=always"
+);
+
+# Allow WAL archiving again, and wait for the segments to be archived.
+$standby2->safe_psql(
+	'postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+$standby2->poll_query_until('postgres',
+	q{SELECT last_archived_wal FROM pg_stat_archiver},
+	$segment_name_2)
+  or die "Timed out while waiting for archiving to finish";
+
+is( $standby2->safe_psql(
+		'postgres', q{SELECT archived_count FROM pg_stat_archiver}),
+	'2',
+	'correct number of WAL segments archived from standby');
+
+ok( !-f "$standby2_data/$segment_path_1_ready"
+	  && !-f "$standby2_data/$segment_path_2_ready",
+	".ready files removed after archive success with archive_mode=always on standby"
+);
+
+ok( -f "$standby2_data/$segment_path_1_done"
+	  && -f "$standby2_data/$segment_path_2_done",
+	".done files created after archive success with archive_mode=always on standby"
+);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 525d58e7f0..b319266bb3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1983,6 +1983,7 @@ RecordCacheEntry
 RecordCompareData
 RecordIOData
 RecoveryLockListsEntry
+RecoveryState
 RecoveryTargetTimeLineGoal
 RecoveryTargetType
 RectBox
In reply to: Michael Paquier (#35)
Re: [BUG] non archived WAL removed during production crash recovery

On Fri, 17 Apr 2020 15:50:43 +0900
Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Apr 17, 2020 at 12:07:39AM +0200, Jehan-Guillaume de Rorthais wrote:

On Thu, 16 Apr 2020 17:11:00 +0900 Michael Paquier <michael@paquier.xyz>
wrote:

Removing .done files does not matter as the segments related to them are
already gone.

Not necessarily. There's a time windows between the moment the archiver set
the .done file and when the checkpointer removes the associated WAL file.
So, after a PANIC because of lack of space, WAL associated with .done files
but not removed yet will be removed during the crash recovery.

Not sure that it is something that matters for this thread though, so
if necessary I think that it could be be discussed separately.

OK. However, unless I'm wrong, what I am describing as an desired behavior
is the current behavior of XLogArchiveCheckDone. So, we might want to decide if
v8 should return false during crash recovery no matter the archive_mode setup,
or if we keep the curent behavior. I vote for keeping it this way.

[...]

Unless I'm wrong, the empty string does not raise an error in
pg_stat_archiver, and I wanted to add a test on this as well.

Exactly, it won't raise an error. Instead I switched to use a poll
query with pg_stat_file() and .ready files, but this has proved to
delay the test considerably if we did not create more segments. And
your approach has the merit to be more simple with only two segments
manipulated for the whole test. So I have tried first my idea,
noticed the mess it introduced, and just kept your approach.

Maybe we could use something more common for all plateform? Eg.:

archive_command='this command does not exist'

At least, we would have the same error everywhere, as far as it could matter...

Thanks for your review! Let me know if you want me to add/change/fix some
tests.

Thanks, I have worked more on the test, refactoring pieces related to
the segment names, adjusting some comments and fixing some of the
logic. Note that you introduced something incorrect at the creation
of $standby2 as you have been updating postgresql.conf.auto for
$standby1.

erf, last minute quick edit with lack of review on my side :(

I have noticed an extra issue while looking at the backend pieces
today: at the beginning of the REDO loop we forgot one place where
SharedRecoveryState *has* to be updated to a correct state (around
the comment "Update pg_control to show that we are..." in xlog.c) as
the startup process may decide to switch the control file state to
DB_IN_ARCHIVE_RECOVERY or DB_IN_CRASH_RECOVERY, but we forgot to
update the new shared flag at this early stage. It did not matter
before because SharedRecoveryInProgress would be only "true" for both,
but that's not the case anymore as we need to make the difference
between crash recovery and archive recovery in the new flag. There is
no actual need to update SharedRecoveryState to RECOVERY_STATE_CRASH
as the initial shared memory state is RECOVERY_STATE_CRASH, but
updating the flag makes the code more consistent IMCRASHO so I updated it
anyway in the attached.

Grmbl...I had this logic the other way around: init with
RECOVERY_STATE_RECOVERY and set to CRASH in this exact if/then/else block... I
removed it in v4 when setting XLogCtl->SharedRecoveryState to RECOVERY or CRASH
based on ControlFile->state.

Sorry, I forgot it after discussing the init value in v5 :(

Regards,

#37Michael Paquier
michael@paquier.xyz
In reply to: Jehan-Guillaume de Rorthais (#36)
Re: [BUG] non archived WAL removed during production crash recovery

On Fri, Apr 17, 2020 at 03:33:04PM +0200, Jehan-Guillaume de Rorthais wrote:

On Fri, 17 Apr 2020 15:50:43 +0900 Michael Paquier <michael@paquier.xyz> wrote:

Not sure that it is something that matters for this thread though, so
if necessary I think that it could be discussed separately.

OK. However, unless I'm wrong, what I am describing as a desired behavior
is the current behavior of XLogArchiveCheckDone. So, we might want to decide if
v8 should return false during crash recovery no matter the archive_mode setup,
or if we keep the current behavior. I vote for keeping it this way.

I would rather avoid that now, as we don't check explicitely for crash
recovery in this code path. And for the purpose of this patch it is
fine to stick with the extra check on a standby with
(RECOVERY_STATE_ARCHIVE && archive_mode = always).

Maybe we could use something more common for all plateform? Eg.:

archive_command='this command does not exist'

At least, we would have the same error everywhere, as far as it could matter...

Yeah. We could try to do with "false" as command anyway, and see what
the buildfarm thinks. As the test is skipped on Windows, I would
assume that it does not matter much anyway. Let's see what others
think about this piece. I don't have plans to touch again this patch
until likely the middle of next week.

Thanks, I have worked more on the test, refactoring pieces related to
the segment names, adjusting some comments and fixing some of the
logic. Note that you introduced something incorrect at the creation
of $standby2 as you have been updating postgresql.conf.auto for
$standby1.

erf, last minute quick edit with lack of review on my side :(

No problem. It happens.

I have noticed an extra issue while looking at the backend pieces
today: at the beginning of the REDO loop we forgot one place where
SharedRecoveryState *has* to be updated to a correct state (around
the comment "Update pg_control to show that we are..." in xlog.c) as
the startup process may decide to switch the control file state to
DB_IN_ARCHIVE_RECOVERY or DB_IN_CRASH_RECOVERY, but we forgot to
update the new shared flag at this early stage. It did not matter
before because SharedRecoveryInProgress would be only "true" for both,
but that's not the case anymore as we need to make the difference
between crash recovery and archive recovery in the new flag. There is
no actual need to update SharedRecoveryState to RECOVERY_STATE_CRASH
as the initial shared memory state is RECOVERY_STATE_CRASH, but
updating the flag makes the code more consistent IMCRASHO so I updated it
anyway in the attached.

Grmbl...I had this logic the other way around: init with
RECOVERY_STATE_RECOVERY and set to CRASH in this exact if/then/else block... I
removed it in v4 when setting XLogCtl->SharedRecoveryState to RECOVERY or CRASH
based on ControlFile->state.

Sorry, I forgot it after discussing the init value in v5 :(

Indeed. The extra initialization was part of v4, and got removed as
of v5. Still, it seems to me that this part was not complete without
updating the shared memory field correctly at the beginning of the
REDO processing as the last version of the patch does.
--
Michael

#38Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#35)
Re: [BUG] non archived WAL removed during production crash recovery

At Sat, 18 Apr 2020 18:26:11 +0900, Michael Paquier <michael@paquier.xyz> wrote in

On Fri, Apr 17, 2020 at 03:33:04PM +0200, Jehan-Guillaume de Rorthais wrote:

On Fri, 17 Apr 2020 15:50:43 +0900 Michael Paquier <michael@paquier.xyz> wrote:

Not sure that it is something that matters for this thread though, so
if necessary I think that it could be discussed separately.

OK. However, unless I'm wrong, what I am describing as a desired behavior
is the current behavior of XLogArchiveCheckDone. So, we might want to decide if
v8 should return false during crash recovery no matter the archive_mode setup,
or if we keep the current behavior. I vote for keeping it this way.

I would rather avoid that now, as we don't check explicitely for crash
recovery in this code path. And for the purpose of this patch it is
fine to stick with the extra check on a standby with
(RECOVERY_STATE_ARCHIVE && archive_mode = always).

The commit 78ea8b5daa intends that WAL segments are properly removed
on standby with archive_mode=on by not marking .ready. The v7
actually let such segments be marked .ready, but they are finally
removed after entering archive recovery. It preserves the patch's
intention in that perspective. (I'd rather prefer to distinguish
"ArchiveRecoveryRequested" somehow but it would be more complex and it
is not the agreement on this thread.)

As the result, +1 to what v7 is doing and discussing on earlier
removal of such WAL segments separately if needed.

Maybe we could use something more common for all plateform? Eg.:

archive_command='this command does not exist'

At least, we would have the same error everywhere, as far as it could matter...

Yeah. We could try to do with "false" as command anyway, and see what
the buildfarm thinks. As the test is skipped on Windows, I would
assume that it does not matter much anyway. Let's see what others
think about this piece. I don't have plans to touch again this patch
until likely the middle of next week.

Couldn't we use "/" as a globally-results-in-failure command? But
that doesn't increment failed_count. The reason is pgarch_archiveXLog
exits with FATAL for "is a directory" error. The comment asserts that
we exit with FATAL for SIGINT or SIGQUIT and if so it is enough to
check only exit-by-signal case. The following fix worked.

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 01ffd6513c..def6a68063 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -595,7 +595,7 @@ pgarch_archiveXlog(char *xlog)
 		 * "command not found" type of error.  If we overreact it's no big
 		 * deal, the postmaster will just start the archiver again.
 		 */
-		int			lev = wait_result_is_any_signal(rc, true) ? FATAL : LOG;
+		int			lev = wait_result_is_any_signal(rc, false) ? FATAL : LOG;

if (WIFEXITED(rc))
{

I didn't tested it on Windows (I somehow broke my repo and it's too
slow to clone.) but system("/") returned 1 and I think that result
increments the counter.

Thanks, I have worked more on the test, refactoring pieces related to
the segment names, adjusting some comments and fixing some of the
logic. Note that you introduced something incorrect at the creation
of $standby2 as you have been updating postgresql.conf.auto for
$standby1.

erf, last minute quick edit with lack of review on my side :(

No problem. It happens.

I have noticed an extra issue while looking at the backend pieces
today: at the beginning of the REDO loop we forgot one place where
SharedRecoveryState *has* to be updated to a correct state (around
the comment "Update pg_control to show that we are..." in xlog.c) as
the startup process may decide to switch the control file state to
DB_IN_ARCHIVE_RECOVERY or DB_IN_CRASH_RECOVERY, but we forgot to
update the new shared flag at this early stage. It did not matter
before because SharedRecoveryInProgress would be only "true" for both,
but that's not the case anymore as we need to make the difference
between crash recovery and archive recovery in the new flag. There is
no actual need to update SharedRecoveryState to RECOVERY_STATE_CRASH
as the initial shared memory state is RECOVERY_STATE_CRASH, but
updating the flag makes the code more consistent IMCRASHO so I updated it
anyway in the attached.

Grmbl...I had this logic the other way around: init with
RECOVERY_STATE_RECOVERY and set to CRASH in this exact if/then/else block... I
removed it in v4 when setting XLogCtl->SharedRecoveryState to RECOVERY or CRASH
based on ControlFile->state.

Sorry, I forgot it after discussing the init value in v5 :(

Indeed. The extra initialization was part of v4, and got removed as
of v5. Still, it seems to me that this part was not complete without
updating the shared memory field correctly at the beginning of the
REDO processing as the last version of the patch does.

I may not be following the discussion, but I think it is reasonable
that SharedRecoveryState is initialized as CRASH then moves to ARCHIVE
as needed and finished by NONE. That transition also stables
RecoveryInProgress().

Other minor comments:

+ RECOVERY_STATE_NONE /* currently in production */

I think it would be better be RECOVERY_STATE_DONE.

By the way I noticed that RecoveryState is exactly a subset of
DBState. And changes of SharedRecoveryState happens side-by-side with
ControlFileData->state in most places. Coundn't we just usee
ControlFile->state instead of SharedRecoveryState?

By the way I found a typo.

+# Recovery tests for the achiving with a standby partially check
s/achiving/archiving/

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#39Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#38)
1 attachment(s)
Re: [BUG] non archived WAL removed during production crash recovery

On Mon, Apr 20, 2020 at 04:02:31PM +0900, Kyotaro Horiguchi wrote:

At Sat, 18 Apr 2020 18:26:11 +0900, Michael Paquier <michael@paquier.xyz> wrote in
As the result, +1 to what v7 is doing and discussing on earlier
removal of such WAL segments separately if needed.

Thanks for the extra review.

Yeah. We could try to do with "false" as command anyway, and see what
the buildfarm thinks. As the test is skipped on Windows, I would
assume that it does not matter much anyway. Let's see what others
think about this piece. I don't have plans to touch again this patch
until likely the middle of next week.

Couldn't we use "/" as a globally-results-in-failure command? But
that doesn't increment failed_count. The reason is pgarch_archiveXLog
exits with FATAL for "is a directory" error. The comment asserts that
we exit with FATAL for SIGINT or SIGQUIT and if so it is enough to
check only exit-by-signal case. The following fix worked.

Yeah, I was working on this stuff today and I noticed this problem. I
was just going to send an email on the matter with a more portable
patch and you also just beat me to it with this one :)

So yes, using "false" may be a bad idea because we cannot rely on the
case where the command does not exist in an environment in this test.
After more testing, I have been hit hard about the fact that the
archiver exits immediately if an archive command cannot be found
(errcode = 127), and it does not report this failure back to
pg_stat_archiver, which would cause the test to wait until the timeout
of poll_query_until() kills the test. There is however an extra
method not mentioned yet on this thread: we know that cp/copy is
portable enough per the buildfarm, so let's use a copy command that we
know *will* fail. A simple way of doing this is a command where the
origin file does not exist.

--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -595,7 +595,7 @@ pgarch_archiveXlog(char *xlog)
* "command not found" type of error.  If we overreact it's no big
* deal, the postmaster will just start the archiver again.
*/
-		int			lev = wait_result_is_any_signal(rc, true) ? FATAL : LOG;
+		int			lev = wait_result_is_any_signal(rc, false) ? FATAL : LOG;

if (WIFEXITED(rc))
{

I didn't tested it on Windows (I somehow broke my repo and it's too
slow to clone.) but system("/") returned 1 and I think that result
increments the counter.

No, this would be a behavior change, which is not acceptable in my
view. (By the way, just nuke your full repo if it does not work
anymore on Windows, this method works).

Indeed. The extra initialization was part of v4, and got removed as
of v5. Still, it seems to me that this part was not complete without
updating the shared memory field correctly at the beginning of the
REDO processing as the last version of the patch does.

I may not be following the discussion, but I think it is reasonable
that SharedRecoveryState is initialized as CRASH then moves to ARCHIVE
as needed and finished by NONE. That transition also stables
RecoveryInProgress().

Thought as well about that over the weekend, and that's still the best
option to me.

I think it would be better be RECOVERY_STATE_DONE.

I like this suggestion better than the original in v7.

By the way I noticed that RecoveryState is exactly a subset of
DBState. And changes of SharedRecoveryState happens side-by-side with
ControlFileData->state in most places. Coundn't we just usee
ControlFile->state instead of SharedRecoveryState?

I actually found confusing to use the same thing, because then the
reader would thing that SharedRecoveryState could be set to more
values but we don't want that.

By the way I found a typo.

+# Recovery tests for the achiving with a standby partially check
s/achiving/archiving/

Thanks, fixed.

Attached is an updated patch, where I tweaked more comments.

Jehan-Guillaume, who is your colleague who found originally about this
problem? We should credit him in the commit message.
--
Michael

Attachments:

v8-0001-Fix-handling-of-.ready-files-during-crash-recover.patchtext/x-diff; charset=us-asciiDownload
From cbe9310cc2f2e065708e7e9e05cf6a94043d484d Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Mon, 20 Apr 2020 16:30:32 +0900
Subject: [PATCH v8] Fix handling of .ready files during crash recovery

78ea8b5 introduced a fix to correctly recycle WAL segments on a standby
when archive_mode = always, however it has introduced a regression with
the handling of .ready files during crash recovery of a primary, causing
those files to be removed.

A set of TAP tests is added to close the gap here, making sure that
archive status files are correctly handled when a cluster is in archive
or crash recovery with archive_mode set to "on" or "always, for both
standbys and primarys.

Reported-by: ???
Author: Jehan-Guillaume de Rorthais
Reviewed-by: Kyotaro Horiguchi, Fujii Masao, Michael Paquier
Backpatch-through: 9.5
---
 src/include/access/xlog.h                 |   9 +
 src/backend/access/transam/xlog.c         |  57 +++++-
 src/backend/access/transam/xlogarchive.c  |  21 +-
 src/test/recovery/t/020_archive_status.pl | 225 ++++++++++++++++++++++
 src/tools/pgindent/typedefs.list          |   1 +
 5 files changed, 297 insertions(+), 16 deletions(-)
 create mode 100644 src/test/recovery/t/020_archive_status.pl

diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f60ed2d36c..0a12afb59e 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -166,6 +166,14 @@ typedef enum WalLevel
 	WAL_LEVEL_LOGICAL
 } WalLevel;
 
+/* Recovery states */
+typedef enum RecoveryState
+{
+	RECOVERY_STATE_CRASH = 0,	/* crash recovery */
+	RECOVERY_STATE_ARCHIVE,		/* archive recovery */
+	RECOVERY_STATE_DONE			/* currently in production */
+} RecoveryState;
+
 extern PGDLLIMPORT int wal_level;
 
 /* Is WAL archiving enabled (always or only while server is running normally)? */
@@ -291,6 +299,7 @@ extern const char *xlog_identify(uint8 info);
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
 
 extern bool RecoveryInProgress(void);
+extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 11e32733c4..009a071276 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -221,8 +221,9 @@ static TimeLineID receiveTLI = 0;
 static bool lastFullPageWrites;
 
 /*
- * Local copy of SharedRecoveryInProgress variable. True actually means "not
- * known, need to check the shared state".
+ * Local copy of the state tracked by SharedRecoveryState in shared memory,
+ * It is false if SharedRecoveryState is RECOVERY_STATE_DONE.  True actually
+ * means "not known, need to check the shared state".
  */
 static bool LocalRecoveryInProgress = true;
 
@@ -653,10 +654,10 @@ typedef struct XLogCtlData
 	TimeLineID	PrevTimeLineID;
 
 	/*
-	 * SharedRecoveryInProgress indicates if we're still in crash or archive
+	 * SharedRecoveryState indicates if we're still in crash or archive
 	 * recovery.  Protected by info_lck.
 	 */
-	bool		SharedRecoveryInProgress;
+	RecoveryState SharedRecoveryState;
 
 	/*
 	 * SharedHotStandbyActive indicates if we allow hot standby queries to be
@@ -4434,6 +4435,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 				updateMinRecoveryPoint = true;
 
 				UpdateControlFile();
+
+				/*
+				 * We update SharedRecoveryState while holding the lock on
+				 * ControlFileLock so both states are consistent in shared
+				 * memory.
+				 */
+				SpinLockAcquire(&XLogCtl->info_lck);
+				XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+				SpinLockRelease(&XLogCtl->info_lck);
+
 				LWLockRelease(ControlFileLock);
 
 				CheckRecoveryConsistency();
@@ -5166,7 +5177,7 @@ XLOGShmemInit(void)
 	 * in additional info.)
 	 */
 	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
-	XLogCtl->SharedRecoveryInProgress = true;
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
 	XLogCtl->WalWriterSleeping = false;
@@ -6871,7 +6882,13 @@ StartupXLOG(void)
 		 */
 		dbstate_at_startup = ControlFile->state;
 		if (InArchiveRecovery)
+		{
 			ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
+
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+			SpinLockRelease(&XLogCtl->info_lck);
+		}
 		else
 		{
 			ereport(LOG,
@@ -6884,6 +6901,10 @@ StartupXLOG(void)
 								ControlFile->checkPointCopy.ThisTimeLineID,
 								recoveryTargetTLI)));
 			ControlFile->state = DB_IN_CRASH_RECOVERY;
+
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
+			SpinLockRelease(&XLogCtl->info_lck);
 		}
 		ControlFile->checkPoint = checkPointLoc;
 		ControlFile->checkPointCopy = checkPoint;
@@ -7911,7 +7932,7 @@ StartupXLOG(void)
 	ControlFile->time = (pg_time_t) time(NULL);
 
 	SpinLockAcquire(&XLogCtl->info_lck);
-	XLogCtl->SharedRecoveryInProgress = false;
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	UpdateControlFile();
@@ -8057,7 +8078,7 @@ RecoveryInProgress(void)
 		 */
 		volatile XLogCtlData *xlogctl = XLogCtl;
 
-		LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
+		LocalRecoveryInProgress = (xlogctl->SharedRecoveryState != RECOVERY_STATE_DONE);
 
 		/*
 		 * Initialize TimeLineID and RedoRecPtr when we discover that recovery
@@ -8069,8 +8090,8 @@ RecoveryInProgress(void)
 		{
 			/*
 			 * If we just exited recovery, make sure we read TimeLineID and
-			 * RedoRecPtr after SharedRecoveryInProgress (for machines with
-			 * weak memory ordering).
+			 * RedoRecPtr after SharedRecoveryState (for machines with weak
+			 * memory ordering).
 			 */
 			pg_memory_barrier();
 			InitXLOGAccess();
@@ -8086,6 +8107,24 @@ RecoveryInProgress(void)
 	}
 }
 
+/*
+ * Returns current recovery state from shared memory.
+ *
+ * This returned state is kept consistent with the contents of the control
+ * file.  See details about the possible values of RecoveryState in xlog.h.
+ */
+RecoveryState
+GetRecoveryState(void)
+{
+	RecoveryState retval;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	retval = XLogCtl->SharedRecoveryState;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return retval;
+}
+
 /*
  * Is HotStandby active yet? This is only important in special backends
  * since normal backends won't ever be able to connect until this returns
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index d62c12310a..55becd65d4 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -572,18 +572,25 @@ XLogArchiveCheckDone(const char *xlog)
 {
 	char		archiveStatusPath[MAXPGPATH];
 	struct stat stat_buf;
-	bool		inRecovery = RecoveryInProgress();
+
+	/* The file is always deletable if archive_mode is "off". */
+	if (!XLogArchivingActive())
+		return true;
 
 	/*
-	 * The file is always deletable if archive_mode is "off".  On standbys
-	 * archiving is disabled if archive_mode is "on", and enabled with
-	 * "always".  On a primary, archiving is enabled if archive_mode is "on"
-	 * or "always".
+	 * During archive recovery, the file is deletable if archive_mode is not
+	 * "always".
 	 */
-	if (!((XLogArchivingActive() && !inRecovery) ||
-		  (XLogArchivingAlways() && inRecovery)))
+	if (!XLogArchivingAlways() &&
+		GetRecoveryState() == RECOVERY_STATE_ARCHIVE)
 		return true;
 
+	/*
+	 * At this point of the logic, note that we are either a primary with
+	 * archive_mode set to "on" or "always", or a standby with archive_mode
+	 * set to "always".
+	 */
+
 	/* First check for .done --- this means archiver is done with it */
 	StatusFilePath(archiveStatusPath, xlog, ".done");
 	if (stat(archiveStatusPath, &stat_buf) == 0)
diff --git a/src/test/recovery/t/020_archive_status.pl b/src/test/recovery/t/020_archive_status.pl
new file mode 100644
index 0000000000..688014862a
--- /dev/null
+++ b/src/test/recovery/t/020_archive_status.pl
@@ -0,0 +1,225 @@
+#
+# Tests related to WAL archiving and recovery.
+#
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More;
+use Config;
+
+if ($Config{osname} eq 'MSWin32')
+{
+
+	# some Windows Perls at least don't like IPC::Run's start/kill_kill regime.
+	plan skip_all => "Test fails on Windows perl";
+}
+else
+{
+	plan tests => 16;
+}
+
+my $primary = get_new_node('master');
+$primary->init(
+	has_archiving    => 1,
+	allows_streaming => 1);
+$primary->append_conf('postgresql.conf', 'autovacuum = off');
+$primary->start;
+my $primary_data = $primary->data_dir;
+
+# Temporarily use an archive_command value to make the archiver fail,
+# knowing that archiving is enabled.  Note that we cannot use a command
+# that does not exist as in this case the archiver process would just exit
+# without reporting the failure to pg_stat_archiver.  This also cannot
+# use a plain "false" as that's unportable on Windows.  So, instead, as
+# a portable solution, use an archive command that is known to work but
+# will fail: a copy with an incorrect original path.
+my $incorrect_command =
+  $TestLib::windows_os
+  ? qq{copy "%p_does_not_exist" "%f_does_not_exist"}
+  : qq{cp "%p_does_not_exist" "%f_does_not_exist"};
+$primary->safe_psql(
+	'postgres', qq{
+    ALTER SYSTEM SET archive_command TO '$incorrect_command';
+    SELECT pg_reload_conf();
+});
+
+# Save the WAL segment currently in use and switch to a new segment.
+# This will be used to track the activity of the archiver.
+my $segment_name_1 = $primary->safe_psql('postgres',
+	q{SELECT pg_walfile_name(pg_current_wal_lsn())});
+my $segment_path_1       = "pg_wal/archive_status/$segment_name_1";
+my $segment_path_1_ready = "$segment_path_1.ready";
+my $segment_path_1_done  = "$segment_path_1.done";
+$primary->safe_psql(
+	'postgres', q{
+	CREATE TABLE mine AS SELECT generate_series(1,10) AS x;
+	SELECT pg_switch_wal();
+	CHECKPOINT;
+});
+
+# Wait for an archive failure.
+$primary->poll_query_until('postgres',
+	q{SELECT failed_count > 0 FROM pg_stat_archiver}, 't')
+  or die "Timed out while waiting for archiving to fail";
+ok( -f "$primary_data/$segment_path_1_ready",
+	".ready file exists for WAL segment $segment_name_1 waiting to be archived"
+);
+ok( !-f "$primary_data/$segment_path_1_done",
+	".done file does not exist for WAL segment $segment_name_1 waiting to be archived"
+);
+
+is( $primary->safe_psql(
+		'postgres', q{
+		SELECT archived_count, last_failed_wal
+		FROM pg_stat_archiver
+	}),
+	"0|$segment_name_1",
+	'pg_stat_archiver failed to archive $segment_name_1');
+
+# We need to crash the cluster because next test checks the crash
+# recovery step do not removes non-archived WAL.
+$primary->stop('immediate');
+
+# Recovery tests for the archiving with a standby partially check
+# the recovery behavior when restoring a backup taken using a
+# snapshot with no pg_start/stop_backup.  In this situation,
+# the recovered standby should enter first crash recovery then
+# switch to regular archive recovery.  Note that the base backup
+# is taken here so as archive_command will fail.  This is necessary
+# for the assumptions of the tests done with the standbys below.
+$primary->backup_fs_cold('backup');
+
+$primary->start;
+ok( -f "$primary_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 still exists after crash recovery on primary"
+);
+
+# Allow WAL archiving again and wait for a success.
+$primary->safe_psql(
+	'postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+
+$primary->poll_query_until('postgres',
+	q{SELECT archived_count FROM pg_stat_archiver}, '1')
+  or die "Timed out while waiting for archiving to finish";
+
+ok(!-f "$primary_data/$segment_path_1_ready",
+	".ready file for archived WAL segment $segment_name_1 removed");
+
+ok(-f "$primary_data/$segment_path_1_done",
+	".done file for archived WAL segment $segment_name_1 exists");
+
+is( $primary->safe_psql(
+		'postgres', q{ SELECT last_archived_wal FROM pg_stat_archiver }),
+	$segment_name_1,
+	'archive success reported in pg_stat_archiver for WAL segment $segment_name_1'
+);
+
+# Create some WAL activity and a new checkpoint so as the next standby can
+# create a restartpoint.  As this standby starts in crash recovery because
+# of the cold backup taken previously, it needs a clean restartpoint to deal
+# with existing status files.
+my $segment_name_2 = $primary->safe_psql('postgres',
+	q{SELECT pg_walfile_name(pg_current_wal_lsn())});
+my $segment_path_2       = "pg_wal/archive_status/$segment_name_2";
+my $segment_path_2_ready = "$segment_path_2.ready";
+my $segment_path_2_done  = "$segment_path_2.done";
+$primary->safe_psql(
+	'postgres', q{
+	INSERT INTO mine SELECT generate_series(10,20) AS x;
+	SELECT pg_switch_wal();
+	CHECKPOINT;
+});
+
+$primary->poll_query_until('postgres',
+	q{ SELECT last_archived_wal FROM pg_stat_archiver },
+	$segment_name_2)
+  or die "Timed out while waiting for archiving to finish";
+
+# Test standby with archive_mode = on.
+my $standby1 = get_new_node('standby');
+$standby1->init_from_backup($primary, 'backup', has_restoring => 1);
+$standby1->append_conf('postgresql.conf', "archive_mode = on");
+my $standby1_data = $standby1->data_dir;
+$standby1->start;
+$standby1->safe_psql('postgres', q{CHECKPOINT});
+
+# Recovery with archive_mode=on does not keep .ready signal files inherited
+# from backup.  Note that this WAL segment existed in the backup.
+ok( !-f "$standby1_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 present in backup got removed with archive_mode=on on standby"
+);
+
+# Recovery with archive_mode=on should not create .ready files.
+# Note that this segment did not exist in the backup.
+ok( !-f "$standby1_data/$segment_path_2_ready",
+	".ready file for WAL segment $segment_name_2 not created on standby when archive_mode=on on standby"
+);
+
+# Recovery with archive_mode = on creates .done files.
+ok( -f "$standby1_data/$segment_path_2_done",
+	".done file for WAL segment $segment_name_2 created when archive_mode=on on standby"
+);
+
+# Test recovery with archive_mode = always, which should always keep
+# .ready files if archiving is enabled, though here we want the archive
+# command to fail to persist the .ready files.  Note that this node
+# has inherited the archive command of the previous cold backup, that
+# will cause archiving failures.
+my $standby2 = get_new_node('standby2');
+$standby2->init_from_backup($primary, 'backup', has_restoring => 1);
+$standby2->append_conf('postgresql.conf', 'archive_mode = always');
+my $standby2_data = $standby2->data_dir;
+$standby2->start;
+
+$standby2->safe_psql('postgres', q{CHECKPOINT});
+
+ok( -f "$standby2_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 existing in backup is kept with archive_mode=always on standby"
+);
+
+ok( -f "$standby2_data/$segment_path_2_ready",
+	".ready file for WAL segment $segment_name_2 created with archive_mode=always on standby"
+);
+
+# Reset statistics of the archiver for the next checks.
+$standby2->safe_psql('postgres', q{SELECT pg_stat_reset_shared('archiver')});
+
+# Now crash the cluster to check that recovery step does not
+# remove non-archived WAL segments on a standby where archiving
+# is enabled.
+$standby2->stop('immediate');
+$standby2->start;
+
+ok( -f "$standby2_data/$segment_path_1_ready",
+	"WAL segment still ready to archive after crash recovery on standby with archive_mode=always"
+);
+
+# Allow WAL archiving again, and wait for the segments to be archived.
+$standby2->safe_psql(
+	'postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+$standby2->poll_query_until('postgres',
+	q{SELECT last_archived_wal FROM pg_stat_archiver},
+	$segment_name_2)
+  or die "Timed out while waiting for archiving to finish";
+
+is( $standby2->safe_psql(
+		'postgres', q{SELECT archived_count FROM pg_stat_archiver}),
+	'2',
+	'correct number of WAL segments archived from standby');
+
+ok( !-f "$standby2_data/$segment_path_1_ready"
+	  && !-f "$standby2_data/$segment_path_2_ready",
+	".ready files removed after archive success with archive_mode=always on standby"
+);
+
+ok( -f "$standby2_data/$segment_path_1_done"
+	  && -f "$standby2_data/$segment_path_2_done",
+	".done files created after archive success with archive_mode=always on standby"
+);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 525d58e7f0..b319266bb3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1983,6 +1983,7 @@ RecordCacheEntry
 RecordCompareData
 RecordIOData
 RecoveryLockListsEntry
+RecoveryState
 RecoveryTargetTimeLineGoal
 RecoveryTargetType
 RectBox
-- 
2.26.0

In reply to: Michael Paquier (#39)
Re: [BUG] non archived WAL removed during production crash recovery

On Mon, 20 Apr 2020 16:34:44 +0900
Michael Paquier <michael@paquier.xyz> wrote:
[...]

By the way I noticed that RecoveryState is exactly a subset of
DBState. And changes of SharedRecoveryState happens side-by-side with
ControlFileData->state in most places. Coundn't we just usee
ControlFile->state instead of SharedRecoveryState?

I actually found confusing to use the same thing, because then the
reader would thing that SharedRecoveryState could be set to more
values but we don't want that.

I thought about this while studying various possible fix.

/messages/by-id/20200401181735.11100908@firost

The problem is that we would have to read the controldata file each time we
wonder if a segment should be archived/removed. Moreover, the controldata
file might not be in sync quickly enough with the real state for some other
code path or futur needs.

[...]

Attached is an updated patch, where I tweaked more comments.

Jehan-Guillaume, who is your colleague who found originally about this
problem? We should credit him in the commit message.

Indeed, Benoît Lobréau reported this behavior to me.

Regards,

#41Michael Paquier
michael@paquier.xyz
In reply to: Jehan-Guillaume de Rorthais (#40)
Re: [BUG] non archived WAL removed during production crash recovery

On Mon, Apr 20, 2020 at 02:22:35PM +0200, Jehan-Guillaume de Rorthais wrote:

The problem is that we would have to read the controldata file each time we
wonder if a segment should be archived/removed. Moreover, the controldata
file might not be in sync quickly enough with the real state for some other
code path or futur needs.

I don't think that this is what Horiguchi-san meant here. What I got
from his previous message would be to be to copy the shared value from
the control file when necessary, and have the shared state use only a
subset of the existing values of DBState, aka:
- DB_IN_CRASH_RECOVERY
- DB_IN_ARCHIVE_RECOVERY
- DB_IN_PRODUCTION
Still, that sounds wrong to me because then somebody would be tempted
to change the shared value thinking that things like DB_SHUTDOWNING,
DB_SHUTDOWNED_* or DB_STARTUP are valid but we don't want that here.
Note that there may be a case for DB_STARTUP to be used in
XLOGShmemInit(), but I'd rather let the code use the safest default,
DB_IN_CRASH_RECOVERY to control that we won't remove .ready files by
default until the startup process sees fit to do the actual switch
depending on the checkpoint record lookup, if archive recovery was
actually requested, etc.

Indeed, Benoît Lobréau reported this behavior to me.

Noted. Thanks for the information. I don't think that I have ever
met Benoît in person, do I? Tell him that I owe him one beer or a
beverage of his choice when we meet IRL, and that he had better use
this message-id to make me keep my promise :)
--
Michael

#42Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#41)
Re: [BUG] non archived WAL removed during production crash recovery

At Tue, 21 Apr 2020 11:15:01 +0900, Michael Paquier <michael@paquier.xyz> wrote in

On Mon, Apr 20, 2020 at 02:22:35PM +0200, Jehan-Guillaume de Rorthais wrote:

The problem is that we would have to read the controldata file each time we
wonder if a segment should be archived/removed. Moreover, the controldata
file might not be in sync quickly enough with the real state for some other
code path or futur needs.

I don't think that this is what Horiguchi-san meant here. What I got
from his previous message would be to be to copy the shared value from
the control file when necessary, and have the shared state use only a
subset of the existing values of DBState, aka:
- DB_IN_CRASH_RECOVERY
- DB_IN_ARCHIVE_RECOVERY
- DB_IN_PRODUCTION

First I thought as above, but I thought that we could use
ControlFile->state itself in this case, by regarding the symbols less
than DB_IN_CRASH_RECOVERY as RECOVERY_STATE_CRASH. I don't think
there's no problem if the update to DB_IN_ARCHIVE_RECOVERY reaches
checkpointer with some delay.

Still, that sounds wrong to me because then somebody would be tempted
to change the shared value thinking that things like DB_SHUTDOWNING,
DB_SHUTDOWNED_* or DB_STARTUP are valid but we don't want that here.

That is not an issue if we just use DBState to know whether we have
started archive recovery.

Note that there may be a case for DB_STARTUP to be used in
XLOGShmemInit(), but I'd rather let the code use the safest default,
DB_IN_CRASH_RECOVERY to control that we won't remove .ready files by
default until the startup process sees fit to do the actual switch
depending on the checkpoint record lookup, if archive recovery was
actually requested, etc.

I'm not sure I read this correctly. But I think I agree to this.

+	if (!XLogArchivingAlways() &&
+		GetRecoveryState() == RECOVERY_STATE_ARCHIVE)

Is rewritten as

+ if (!XLogArchivingAlways() &&
+ GetDBState() > DB_IN_CRASH_RECOVERY)

FWIW, what annoyed me is there are three variables that are quite
similar but has different domains, ControlFile->state,
XLogCtl->SharedRecoveryState, and LocalRecoveryInProgress. I didn't
mind there were two, but three seems a bit too many to me.

But it may be different issue.

Indeed, Benoît Lobréau reported this behavior to me.

Noted. Thanks for the information. I don't think that I have ever
met Benoît in person, do I? Tell him that I owe him one beer or a
beverage of his choice when we meet IRL, and that he had better use
this message-id to make me keep my promise :)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#43Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#42)
Re: [BUG] non archived WAL removed during production crash recovery

On Tue, Apr 21, 2020 at 12:09:25PM +0900, Kyotaro Horiguchi wrote:

+	if (!XLogArchivingAlways() &&
+		GetRecoveryState() == RECOVERY_STATE_ARCHIVE)

Is rewritten as

+ if (!XLogArchivingAlways() &&
+ GetDBState() > DB_IN_CRASH_RECOVERY)

FWIW, what annoyed me is there are three variables that are quite
similar but has different domains, ControlFile->state,
XLogCtl->SharedRecoveryState, and LocalRecoveryInProgress. I didn't
mind there were two, but three seems a bit too many to me.

That's actually the pattern I would avoid for clarity. There is no
need to add more dependencies to the entries of DBState for the sake
of this patch, and this smells like a trap if more values are added to
it in an order that does not match what we have been assuming in the
context of this thread.
--
Michael

#44Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#43)
Re: [BUG] non archived WAL removed during production crash recovery

At Tue, 21 Apr 2020 13:57:39 +0900, Michael Paquier <michael@paquier.xyz> wrote in

On Tue, Apr 21, 2020 at 12:09:25PM +0900, Kyotaro Horiguchi wrote:

+	if (!XLogArchivingAlways() &&
+		GetRecoveryState() == RECOVERY_STATE_ARCHIVE)

Is rewritten as

+ if (!XLogArchivingAlways() &&
+ GetDBState() > DB_IN_CRASH_RECOVERY)

FWIW, what annoyed me is there are three variables that are quite
similar but has different domains, ControlFile->state,
XLogCtl->SharedRecoveryState, and LocalRecoveryInProgress. I didn't
mind there were two, but three seems a bit too many to me.

That's actually the pattern I would avoid for clarity. There is no
need to add more dependencies to the entries of DBState for the sake
of this patch, and this smells like a trap if more values are added to
it in an order that does not match what we have been assuming in the
context of this thread.

Yes. Anywaay that would be another issue, if it is an issue.

I'm fine with the current state.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In reply to: Kyotaro Horiguchi (#44)
Re: [BUG] non archived WAL removed during production crash recovery

On Tue, 21 Apr 2020 15:08:17 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

At Tue, 21 Apr 2020 13:57:39 +0900, Michael Paquier <michael@paquier.xyz>
wrote in

On Tue, Apr 21, 2020 at 12:09:25PM +0900, Kyotaro Horiguchi wrote:

+	if (!XLogArchivingAlways() &&
+		GetRecoveryState() == RECOVERY_STATE_ARCHIVE)

Is rewritten as

+ if (!XLogArchivingAlways() &&
+ GetDBState() > DB_IN_CRASH_RECOVERY)

FWIW, what annoyed me is there are three variables that are quite
similar but has different domains, ControlFile->state,
XLogCtl->SharedRecoveryState, and LocalRecoveryInProgress. I didn't
mind there were two, but three seems a bit too many to me.

In fact, my original goal [1]/messages/by-id/20200403182625.0fccc6fd@firost was to avoid adding another shared boolean related
to the same topic. So I understand your feeling.

I played with this idea again, based on your argument that there's no problem
if the update to DB_IN_ARCHIVE_RECOVERY reaches checkpointer with some delay.

The other point I feel bad with is to open and check the controlfile again and
again for each segment to archive, even on running production instance.
It's not that it would be heavy, but it feels overkill to fetch this information
that should be available more easily.

That leads me an idea where we would keep the ControlFile data up-to-date in
shared memory. There's a few duplicates between ControlFile and XLogCtl, so
maybe it could make the code a little simpler at some other places than just
fixing $SUBJECT using DBState? I'm not sure of the implications and impacts
though. This seems way bigger than the current fix and with many traps on the
way. Maybe we could discuss this in another thread if you think it deserves it?

Regards,

[1]: /messages/by-id/20200403182625.0fccc6fd@firost
/messages/by-id/20200403182625.0fccc6fd@firost

In reply to: Michael Paquier (#41)
Re: [BUG] non archived WAL removed during production crash recovery

Hello,

I did another round of review of v8.

- LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
+ LocalRecoveryInProgress = (xlogctl->SharedRecoveryState !=
  RECOVERY_STATE_DONE);

Do we need to acquire info_lck to look at the state here, as we do in
GetRecoveryState()? Why is it missing from previous code where
SharedRecoveryInProgress was protected by info_lck as well?

Plus, the new line length overflow the 80-column, but I'm not sure where to
break this line.

+if ($Config{osname} eq 'MSWin32')
+{
+
+	# some Windows Perls at least don't like IPC::Run's start/kill_kill
regime.
+	plan skip_all => "Test fails on Windows perl";
+}

In fact, this was inherited from 011_crash_recovery.pl where I originally
added some tests. As 020_archive_status.pl doesn't use IPC::Run, the comment is
wrong. But I wonder if this whole block is really needed. Unfortunately I can't
test on MSWin32 :/

On Tue, 21 Apr 2020 11:15:01 +0900
Michael Paquier <michael@paquier.xyz> wrote:

Indeed, Benoît Lobréau reported this behavior to me.

Noted. Thanks for the information. I don't think that I have ever
met Benoît in person, do I?

I don't think so.

Tell him that I owe him one beer or a beverage of his choice when we meet
IRL, and that he had better use this message-id to make me keep my promise :)

I told him (but I'm sure he was reading anyway :)).

Regards,

#47Michael Paquier
michael@paquier.xyz
In reply to: Jehan-Guillaume de Rorthais (#45)
Re: [BUG] non archived WAL removed during production crash recovery

On Wed, Apr 22, 2020 at 12:00:22AM +0200, Jehan-Guillaume de Rorthais wrote:

That leads me an idea where we would keep the ControlFile data up-to-date in
shared memory. There's a few duplicates between ControlFile and XLogCtl, so
maybe it could make the code a little simpler at some other places than just
fixing $SUBJECT using DBState? I'm not sure of the implications and impacts
though. This seems way bigger than the current fix and with many traps on the
way. Maybe we could discuss this in another thread if you think it deserves it?

It seems to me that this could have wider applications than just the
recovery state, no? I would recommend to keep this discussion on a
separate thread to give more visibility to the topic.
--
Michael

#48Michael Paquier
michael@paquier.xyz
In reply to: Jehan-Guillaume de Rorthais (#46)
Re: [BUG] non archived WAL removed during production crash recovery

On Wed, Apr 22, 2020 at 12:41:21AM +0200, Jehan-Guillaume de Rorthais wrote:

Do we need to acquire info_lck to look at the state here, as we do in
GetRecoveryState()? Why is it missing from previous code where
SharedRecoveryInProgress was protected by info_lck as well?

Please see 1a3d104.

Plus, the new line length overflow the 80-column, but I'm not sure where to
break this line.

pgindent has been run on v8, and it did not complain.

In fact, this was inherited from 011_crash_recovery.pl where I originally
added some tests. As 020_archive_status.pl doesn't use IPC::Run, the comment is
wrong. But I wonder if this whole block is really needed. Unfortunately I can't
test on MSWin32 :/

You are right here. The restriction can be removed, and I have
checked that the test from v8 is able to pass on my Windows dev VM.
--
Michael

#49Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#48)
6 attachment(s)
Re: [BUG] non archived WAL removed during production crash recovery

On Wed, Apr 22, 2020 at 10:19:35AM +0900, Michael Paquier wrote:

You are right here. The restriction can be removed, and I have
checked that the test from v8 is able to pass on my Windows dev VM.

Attached are versions for each branch down to 9.5. While working on
the backpatch, I have not found major conflicts except one thing:
up to 10, Postgres does WAL segment recycling after two completed
checkpoints, and the 8th test of the script relies on the behavior of
11~ of one completed checkpoint (first .ready file present in the cold
backup but removed removed from $standby1). I have taken the simplest
approach to fix the test by checking that the .ready file actually
exists, while the rest of the test remains the same.

It is worth noting that for 9.5 and 9.6 the test had compatibility
issues with the renaming of pg_xlog to pg_wal, including paths and
functions. The calls to poll_query_until() also needed tweaks, but
I got the tests to work.
--
Michael

Attachments:

0001-Fix-handling-of-WAL-segments-ready-to-be-arch-master.patchtext/x-diff; charset=iso-8859-1Download
From 8734b386e1e077be4d6e5353c08ab5a40b17e60f Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Wed, 22 Apr 2020 14:48:25 +0900
Subject: [PATCH] Fix handling of WAL segments ready to be archived during
 crash recovery
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

78ea8b5 has fixed an issue related to the recycling of WAL segments on
standbys depending on archive_mode.  However, it has introduced a
regression with the handling of WAL segments ready to be archived during
crash recovery, causing those files to be recycled without getting
archived.

This commit fixes the regression by tracking in shared memory if a live
cluster is either in crash recovery or archive recovery as the handling
of WAL segments ready to be archived is different in both cases (those
WAL segments should not be removed during crash recovery), and by using
this new shared memory state to decide if a segment can be recycled or
not.  Previously, it was not possible to know if a cluster was in crash
recovery or archive recovery as the shared state was able to track only
if recovery was happening or not, leading to the problem.

A set of TAP tests is added to close the gap here, making sure that WAL
segments ready to be archived are correctly handled when a cluster is in
archive or crash recovery with archive_mode set to "on" or "always", for
both standby and primary.

Reported-by: Benoît Lobréau
Author: Jehan-Guillaume de Rorthais
Reviewed-by: Kyotaro Horiguchi, Fujii Masao, Michael Paquier
Discussion: https://postgr.es/m/20200331172229.40ee00dc@firost
Backpatch-through: 9.5
---
 src/include/access/xlog.h                 |   9 +
 src/backend/access/transam/xlog.c         |  57 +++++-
 src/backend/access/transam/xlogarchive.c  |  21 ++-
 src/test/recovery/t/020_archive_status.pl | 214 ++++++++++++++++++++++
 src/tools/pgindent/typedefs.list          |   1 +
 5 files changed, 286 insertions(+), 16 deletions(-)
 create mode 100644 src/test/recovery/t/020_archive_status.pl

diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f60ed2d36c..0a12afb59e 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -166,6 +166,14 @@ typedef enum WalLevel
 	WAL_LEVEL_LOGICAL
 } WalLevel;
 
+/* Recovery states */
+typedef enum RecoveryState
+{
+	RECOVERY_STATE_CRASH = 0,	/* crash recovery */
+	RECOVERY_STATE_ARCHIVE,		/* archive recovery */
+	RECOVERY_STATE_DONE			/* currently in production */
+} RecoveryState;
+
 extern PGDLLIMPORT int wal_level;
 
 /* Is WAL archiving enabled (always or only while server is running normally)? */
@@ -291,6 +299,7 @@ extern const char *xlog_identify(uint8 info);
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
 
 extern bool RecoveryInProgress(void);
+extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 11e32733c4..009a071276 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -221,8 +221,9 @@ static TimeLineID receiveTLI = 0;
 static bool lastFullPageWrites;
 
 /*
- * Local copy of SharedRecoveryInProgress variable. True actually means "not
- * known, need to check the shared state".
+ * Local copy of the state tracked by SharedRecoveryState in shared memory,
+ * It is false if SharedRecoveryState is RECOVERY_STATE_DONE.  True actually
+ * means "not known, need to check the shared state".
  */
 static bool LocalRecoveryInProgress = true;
 
@@ -653,10 +654,10 @@ typedef struct XLogCtlData
 	TimeLineID	PrevTimeLineID;
 
 	/*
-	 * SharedRecoveryInProgress indicates if we're still in crash or archive
+	 * SharedRecoveryState indicates if we're still in crash or archive
 	 * recovery.  Protected by info_lck.
 	 */
-	bool		SharedRecoveryInProgress;
+	RecoveryState SharedRecoveryState;
 
 	/*
 	 * SharedHotStandbyActive indicates if we allow hot standby queries to be
@@ -4434,6 +4435,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 				updateMinRecoveryPoint = true;
 
 				UpdateControlFile();
+
+				/*
+				 * We update SharedRecoveryState while holding the lock on
+				 * ControlFileLock so both states are consistent in shared
+				 * memory.
+				 */
+				SpinLockAcquire(&XLogCtl->info_lck);
+				XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+				SpinLockRelease(&XLogCtl->info_lck);
+
 				LWLockRelease(ControlFileLock);
 
 				CheckRecoveryConsistency();
@@ -5166,7 +5177,7 @@ XLOGShmemInit(void)
 	 * in additional info.)
 	 */
 	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
-	XLogCtl->SharedRecoveryInProgress = true;
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->SharedPromoteIsTriggered = false;
 	XLogCtl->WalWriterSleeping = false;
@@ -6871,7 +6882,13 @@ StartupXLOG(void)
 		 */
 		dbstate_at_startup = ControlFile->state;
 		if (InArchiveRecovery)
+		{
 			ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
+
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+			SpinLockRelease(&XLogCtl->info_lck);
+		}
 		else
 		{
 			ereport(LOG,
@@ -6884,6 +6901,10 @@ StartupXLOG(void)
 								ControlFile->checkPointCopy.ThisTimeLineID,
 								recoveryTargetTLI)));
 			ControlFile->state = DB_IN_CRASH_RECOVERY;
+
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
+			SpinLockRelease(&XLogCtl->info_lck);
 		}
 		ControlFile->checkPoint = checkPointLoc;
 		ControlFile->checkPointCopy = checkPoint;
@@ -7911,7 +7932,7 @@ StartupXLOG(void)
 	ControlFile->time = (pg_time_t) time(NULL);
 
 	SpinLockAcquire(&XLogCtl->info_lck);
-	XLogCtl->SharedRecoveryInProgress = false;
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	UpdateControlFile();
@@ -8057,7 +8078,7 @@ RecoveryInProgress(void)
 		 */
 		volatile XLogCtlData *xlogctl = XLogCtl;
 
-		LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
+		LocalRecoveryInProgress = (xlogctl->SharedRecoveryState != RECOVERY_STATE_DONE);
 
 		/*
 		 * Initialize TimeLineID and RedoRecPtr when we discover that recovery
@@ -8069,8 +8090,8 @@ RecoveryInProgress(void)
 		{
 			/*
 			 * If we just exited recovery, make sure we read TimeLineID and
-			 * RedoRecPtr after SharedRecoveryInProgress (for machines with
-			 * weak memory ordering).
+			 * RedoRecPtr after SharedRecoveryState (for machines with weak
+			 * memory ordering).
 			 */
 			pg_memory_barrier();
 			InitXLOGAccess();
@@ -8086,6 +8107,24 @@ RecoveryInProgress(void)
 	}
 }
 
+/*
+ * Returns current recovery state from shared memory.
+ *
+ * This returned state is kept consistent with the contents of the control
+ * file.  See details about the possible values of RecoveryState in xlog.h.
+ */
+RecoveryState
+GetRecoveryState(void)
+{
+	RecoveryState retval;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	retval = XLogCtl->SharedRecoveryState;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return retval;
+}
+
 /*
  * Is HotStandby active yet? This is only important in special backends
  * since normal backends won't ever be able to connect until this returns
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index d62c12310a..55becd65d4 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -572,18 +572,25 @@ XLogArchiveCheckDone(const char *xlog)
 {
 	char		archiveStatusPath[MAXPGPATH];
 	struct stat stat_buf;
-	bool		inRecovery = RecoveryInProgress();
+
+	/* The file is always deletable if archive_mode is "off". */
+	if (!XLogArchivingActive())
+		return true;
 
 	/*
-	 * The file is always deletable if archive_mode is "off".  On standbys
-	 * archiving is disabled if archive_mode is "on", and enabled with
-	 * "always".  On a primary, archiving is enabled if archive_mode is "on"
-	 * or "always".
+	 * During archive recovery, the file is deletable if archive_mode is not
+	 * "always".
 	 */
-	if (!((XLogArchivingActive() && !inRecovery) ||
-		  (XLogArchivingAlways() && inRecovery)))
+	if (!XLogArchivingAlways() &&
+		GetRecoveryState() == RECOVERY_STATE_ARCHIVE)
 		return true;
 
+	/*
+	 * At this point of the logic, note that we are either a primary with
+	 * archive_mode set to "on" or "always", or a standby with archive_mode
+	 * set to "always".
+	 */
+
 	/* First check for .done --- this means archiver is done with it */
 	StatusFilePath(archiveStatusPath, xlog, ".done");
 	if (stat(archiveStatusPath, &stat_buf) == 0)
diff --git a/src/test/recovery/t/020_archive_status.pl b/src/test/recovery/t/020_archive_status.pl
new file mode 100644
index 0000000000..2369b34712
--- /dev/null
+++ b/src/test/recovery/t/020_archive_status.pl
@@ -0,0 +1,214 @@
+#
+# Tests related to WAL archiving and recovery.
+#
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 16;
+use Config;
+
+my $primary = get_new_node('master');
+$primary->init(
+	has_archiving    => 1,
+	allows_streaming => 1);
+$primary->append_conf('postgresql.conf', 'autovacuum = off');
+$primary->start;
+my $primary_data = $primary->data_dir;
+
+# Temporarily use an archive_command value to make the archiver fail,
+# knowing that archiving is enabled.  Note that we cannot use a command
+# that does not exist as in this case the archiver process would just exit
+# without reporting the failure to pg_stat_archiver.  This also cannot
+# use a plain "false" as that's unportable on Windows.  So, instead, as
+# a portable solution, use an archive command based on a command known to
+# work but will fail: copy with an incorrect original path.
+my $incorrect_command =
+  $TestLib::windows_os
+  ? qq{copy "%p_does_not_exist" "%f_does_not_exist"}
+  : qq{cp "%p_does_not_exist" "%f_does_not_exist"};
+$primary->safe_psql(
+	'postgres', qq{
+    ALTER SYSTEM SET archive_command TO '$incorrect_command';
+    SELECT pg_reload_conf();
+});
+
+# Save the WAL segment currently in use and switch to a new segment.
+# This will be used to track the activity of the archiver.
+my $segment_name_1 = $primary->safe_psql('postgres',
+	q{SELECT pg_walfile_name(pg_current_wal_lsn())});
+my $segment_path_1       = "pg_wal/archive_status/$segment_name_1";
+my $segment_path_1_ready = "$segment_path_1.ready";
+my $segment_path_1_done  = "$segment_path_1.done";
+$primary->safe_psql(
+	'postgres', q{
+	CREATE TABLE mine AS SELECT generate_series(1,10) AS x;
+	SELECT pg_switch_wal();
+	CHECKPOINT;
+});
+
+# Wait for an archive failure.
+$primary->poll_query_until('postgres',
+	q{SELECT failed_count > 0 FROM pg_stat_archiver}, 't')
+  or die "Timed out while waiting for archiving to fail";
+ok( -f "$primary_data/$segment_path_1_ready",
+	".ready file exists for WAL segment $segment_name_1 waiting to be archived"
+);
+ok( !-f "$primary_data/$segment_path_1_done",
+	".done file does not exist for WAL segment $segment_name_1 waiting to be archived"
+);
+
+is( $primary->safe_psql(
+		'postgres', q{
+		SELECT archived_count, last_failed_wal
+		FROM pg_stat_archiver
+	}),
+	"0|$segment_name_1",
+	'pg_stat_archiver failed to archive $segment_name_1');
+
+# Crash the cluster for the next test in charge of checking that non-archived
+# WAL segments are not removed.
+$primary->stop('immediate');
+
+# Recovery tests for the archiving with a standby partially check
+# the recovery behavior when restoring a backup taken using a
+# snapshot with no pg_start/stop_backup.  In this situation,
+# the recovered standby should enter first crash recovery then
+# switch to regular archive recovery.  Note that the base backup
+# is taken here so as archive_command will fail.  This is necessary
+# for the assumptions of the tests done with the standbys below.
+$primary->backup_fs_cold('backup');
+
+$primary->start;
+ok( -f "$primary_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 still exists after crash recovery on primary"
+);
+
+# Allow WAL archiving again and wait for a success.
+$primary->safe_psql(
+	'postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+
+$primary->poll_query_until('postgres',
+	q{SELECT archived_count FROM pg_stat_archiver}, '1')
+  or die "Timed out while waiting for archiving to finish";
+
+ok(!-f "$primary_data/$segment_path_1_ready",
+	".ready file for archived WAL segment $segment_name_1 removed");
+
+ok(-f "$primary_data/$segment_path_1_done",
+	".done file for archived WAL segment $segment_name_1 exists");
+
+is( $primary->safe_psql(
+		'postgres', q{ SELECT last_archived_wal FROM pg_stat_archiver }),
+	$segment_name_1,
+	'archive success reported in pg_stat_archiver for WAL segment $segment_name_1'
+);
+
+# Create some WAL activity and a new checkpoint so as the next standby can
+# create a restartpoint.  As this standby starts in crash recovery because
+# of the cold backup taken previously, it needs a clean restartpoint to deal
+# with existing status files.
+my $segment_name_2 = $primary->safe_psql('postgres',
+	q{SELECT pg_walfile_name(pg_current_wal_lsn())});
+my $segment_path_2       = "pg_wal/archive_status/$segment_name_2";
+my $segment_path_2_ready = "$segment_path_2.ready";
+my $segment_path_2_done  = "$segment_path_2.done";
+$primary->safe_psql(
+	'postgres', q{
+	INSERT INTO mine SELECT generate_series(10,20) AS x;
+	SELECT pg_switch_wal();
+	CHECKPOINT;
+});
+
+$primary->poll_query_until('postgres',
+	q{ SELECT last_archived_wal FROM pg_stat_archiver },
+	$segment_name_2)
+  or die "Timed out while waiting for archiving to finish";
+
+# Test standby with archive_mode = on.
+my $standby1 = get_new_node('standby');
+$standby1->init_from_backup($primary, 'backup', has_restoring => 1);
+$standby1->append_conf('postgresql.conf', "archive_mode = on");
+my $standby1_data = $standby1->data_dir;
+$standby1->start;
+$standby1->safe_psql('postgres', q{CHECKPOINT});
+
+# Recovery with archive_mode=on does not keep .ready signal files inherited
+# from backup.  Note that this WAL segment existed in the backup.
+ok( !-f "$standby1_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 present in backup got removed with archive_mode=on on standby"
+);
+
+# Recovery with archive_mode=on should not create .ready files.
+# Note that this segment did not exist in the backup.
+ok( !-f "$standby1_data/$segment_path_2_ready",
+	".ready file for WAL segment $segment_name_2 not created on standby when archive_mode=on on standby"
+);
+
+# Recovery with archive_mode = on creates .done files.
+ok( -f "$standby1_data/$segment_path_2_done",
+	".done file for WAL segment $segment_name_2 created when archive_mode=on on standby"
+);
+
+# Test recovery with archive_mode = always, which should always keep
+# .ready files if archiving is enabled, though here we want the archive
+# command to fail to persist the .ready files.  Note that this node
+# has inherited the archive command of the previous cold backup that
+# will cause archiving failures.
+my $standby2 = get_new_node('standby2');
+$standby2->init_from_backup($primary, 'backup', has_restoring => 1);
+$standby2->append_conf('postgresql.conf', 'archive_mode = always');
+my $standby2_data = $standby2->data_dir;
+$standby2->start;
+
+$standby2->safe_psql('postgres', q{CHECKPOINT});
+
+ok( -f "$standby2_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 existing in backup is kept with archive_mode=always on standby"
+);
+
+ok( -f "$standby2_data/$segment_path_2_ready",
+	".ready file for WAL segment $segment_name_2 created with archive_mode=always on standby"
+);
+
+# Reset statistics of the archiver for the next checks.
+$standby2->safe_psql('postgres', q{SELECT pg_stat_reset_shared('archiver')});
+
+# Now crash the cluster to check that recovery step does not
+# remove non-archived WAL segments on a standby where archiving
+# is enabled.
+$standby2->stop('immediate');
+$standby2->start;
+
+ok( -f "$standby2_data/$segment_path_1_ready",
+	"WAL segment still ready to archive after crash recovery on standby with archive_mode=always"
+);
+
+# Allow WAL archiving again, and wait for the segments to be archived.
+$standby2->safe_psql(
+	'postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+$standby2->poll_query_until('postgres',
+	q{SELECT last_archived_wal FROM pg_stat_archiver},
+	$segment_name_2)
+  or die "Timed out while waiting for archiving to finish";
+
+is( $standby2->safe_psql(
+		'postgres', q{SELECT archived_count FROM pg_stat_archiver}),
+	'2',
+	'correct number of WAL segments archived from standby');
+
+ok( !-f "$standby2_data/$segment_path_1_ready"
+	  && !-f "$standby2_data/$segment_path_2_ready",
+	".ready files removed after archive success with archive_mode=always on standby"
+);
+
+ok( -f "$standby2_data/$segment_path_1_done"
+	  && -f "$standby2_data/$segment_path_2_done",
+	".done files created after archive success with archive_mode=always on standby"
+);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 525d58e7f0..b319266bb3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1983,6 +1983,7 @@ RecordCacheEntry
 RecordCompareData
 RecordIOData
 RecoveryLockListsEntry
+RecoveryState
 RecoveryTargetTimeLineGoal
 RecoveryTargetType
 RectBox
-- 
2.26.0

0001-Fix-handling-of-WAL-segments-ready-to-be-archived-12.patchtext/x-diff; charset=iso-8859-1Download
From 94fd9d2a5791a6ffe8bb3ace75a7e263fd91ff1c Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Wed, 22 Apr 2020 14:48:25 +0900
Subject: [PATCH] Fix handling of WAL segments ready to be archived during
 crash recovery
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

78ea8b5 has fixed an issue related to the recycling of WAL segments on
standbys depending on archive_mode.  However, it has introduced a
regression with the handling of WAL segments ready to be archived during
crash recovery, causing those files to be recycled without getting
archived.

This commit fixes the regression by tracking in shared memory if a live
cluster is either in crash recovery or archive recovery as the handling
of WAL segments ready to be archived is different in both cases (those
WAL segments should not be removed during crash recovery), and by using
this new shared memory state to decide if a segment can be recycled or
not.  Previously, it was not possible to know if a cluster was in crash
recovery or archive recovery as the shared state was able to track only
if recovery was happening or not, leading to the problem.

A set of TAP tests is added to close the gap here, making sure that WAL
segments ready to be archived are correctly handled when a cluster is in
archive or crash recovery with archive_mode set to "on" or "always", for
both standby and primary.

Reported-by: Benoît Lobréau
Author: Jehan-Guillaume de Rorthais
Reviewed-by: Kyotaro Horiguchi, Fujii Masao, Michael Paquier
Discussion: https://postgr.es/m/20200331172229.40ee00dc@firost
Backpatch-through: 9.5
---
 src/include/access/xlog.h                 |   9 +
 src/backend/access/transam/xlog.c         |  57 +++++-
 src/backend/access/transam/xlogarchive.c  |  21 ++-
 src/test/recovery/t/020_archive_status.pl | 214 ++++++++++++++++++++++
 4 files changed, 285 insertions(+), 16 deletions(-)
 create mode 100644 src/test/recovery/t/020_archive_status.pl

diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d519252aad..d6ab348cb6 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -164,6 +164,14 @@ typedef enum WalLevel
 	WAL_LEVEL_LOGICAL
 } WalLevel;
 
+/* Recovery states */
+typedef enum RecoveryState
+{
+	RECOVERY_STATE_CRASH = 0,	/* crash recovery */
+	RECOVERY_STATE_ARCHIVE,		/* archive recovery */
+	RECOVERY_STATE_DONE			/* currently in production */
+} RecoveryState;
+
 extern PGDLLIMPORT int wal_level;
 
 /* Is WAL archiving enabled (always or only while server is running normally)? */
@@ -277,6 +285,7 @@ extern const char *xlog_identify(uint8 info);
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
 
 extern bool RecoveryInProgress(void);
+extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 05cb6aa022..f181c4d144 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -217,8 +217,9 @@ static TimeLineID receiveTLI = 0;
 static bool lastFullPageWrites;
 
 /*
- * Local copy of SharedRecoveryInProgress variable. True actually means "not
- * known, need to check the shared state".
+ * Local copy of the state tracked by SharedRecoveryState in shared memory,
+ * It is false if SharedRecoveryState is RECOVERY_STATE_DONE.  True actually
+ * means "not known, need to check the shared state".
  */
 static bool LocalRecoveryInProgress = true;
 
@@ -643,10 +644,10 @@ typedef struct XLogCtlData
 	TimeLineID	PrevTimeLineID;
 
 	/*
-	 * SharedRecoveryInProgress indicates if we're still in crash or archive
+	 * SharedRecoveryState indicates if we're still in crash or archive
 	 * recovery.  Protected by info_lck.
 	 */
-	bool		SharedRecoveryInProgress;
+	RecoveryState SharedRecoveryState;
 
 	/*
 	 * SharedHotStandbyActive indicates if we're still in crash or archive
@@ -4357,6 +4358,16 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 				updateMinRecoveryPoint = true;
 
 				UpdateControlFile();
+
+				/*
+				 * We update SharedRecoveryState while holding the lock on
+				 * ControlFileLock so both states are consistent in shared
+				 * memory.
+				 */
+				SpinLockAcquire(&XLogCtl->info_lck);
+				XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+				SpinLockRelease(&XLogCtl->info_lck);
+
 				LWLockRelease(ControlFileLock);
 
 				CheckRecoveryConsistency();
@@ -5069,7 +5080,7 @@ XLOGShmemInit(void)
 	 * in additional info.)
 	 */
 	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
-	XLogCtl->SharedRecoveryInProgress = true;
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->WalWriterSleeping = false;
 
@@ -6758,7 +6769,13 @@ StartupXLOG(void)
 		 */
 		dbstate_at_startup = ControlFile->state;
 		if (InArchiveRecovery)
+		{
 			ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
+
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+			SpinLockRelease(&XLogCtl->info_lck);
+		}
 		else
 		{
 			ereport(LOG,
@@ -6771,6 +6788,10 @@ StartupXLOG(void)
 								ControlFile->checkPointCopy.ThisTimeLineID,
 								recoveryTargetTLI)));
 			ControlFile->state = DB_IN_CRASH_RECOVERY;
+
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
+			SpinLockRelease(&XLogCtl->info_lck);
 		}
 		ControlFile->checkPoint = checkPointLoc;
 		ControlFile->checkPointCopy = checkPoint;
@@ -7785,7 +7806,7 @@ StartupXLOG(void)
 	ControlFile->time = (pg_time_t) time(NULL);
 
 	SpinLockAcquire(&XLogCtl->info_lck);
-	XLogCtl->SharedRecoveryInProgress = false;
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	UpdateControlFile();
@@ -7931,7 +7952,7 @@ RecoveryInProgress(void)
 		 */
 		volatile XLogCtlData *xlogctl = XLogCtl;
 
-		LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
+		LocalRecoveryInProgress = (xlogctl->SharedRecoveryState != RECOVERY_STATE_DONE);
 
 		/*
 		 * Initialize TimeLineID and RedoRecPtr when we discover that recovery
@@ -7943,8 +7964,8 @@ RecoveryInProgress(void)
 		{
 			/*
 			 * If we just exited recovery, make sure we read TimeLineID and
-			 * RedoRecPtr after SharedRecoveryInProgress (for machines with
-			 * weak memory ordering).
+			 * RedoRecPtr after SharedRecoveryState (for machines with weak
+			 * memory ordering).
 			 */
 			pg_memory_barrier();
 			InitXLOGAccess();
@@ -7960,6 +7981,24 @@ RecoveryInProgress(void)
 	}
 }
 
+/*
+ * Returns current recovery state from shared memory.
+ *
+ * This returned state is kept consistent with the contents of the control
+ * file.  See details about the possible values of RecoveryState in xlog.h.
+ */
+RecoveryState
+GetRecoveryState(void)
+{
+	RecoveryState retval;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	retval = XLogCtl->SharedRecoveryState;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return retval;
+}
+
 /*
  * Is HotStandby active yet? This is only important in special backends
  * since normal backends won't ever be able to connect until this returns
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index e14bcf8ea6..b3e5338766 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -617,18 +617,25 @@ XLogArchiveCheckDone(const char *xlog)
 {
 	char		archiveStatusPath[MAXPGPATH];
 	struct stat stat_buf;
-	bool		inRecovery = RecoveryInProgress();
+
+	/* The file is always deletable if archive_mode is "off". */
+	if (!XLogArchivingActive())
+		return true;
 
 	/*
-	 * The file is always deletable if archive_mode is "off".  On standbys
-	 * archiving is disabled if archive_mode is "on", and enabled with
-	 * "always".  On a primary, archiving is enabled if archive_mode is "on"
-	 * or "always".
+	 * During archive recovery, the file is deletable if archive_mode is not
+	 * "always".
 	 */
-	if (!((XLogArchivingActive() && !inRecovery) ||
-		  (XLogArchivingAlways() && inRecovery)))
+	if (!XLogArchivingAlways() &&
+		GetRecoveryState() == RECOVERY_STATE_ARCHIVE)
 		return true;
 
+	/*
+	 * At this point of the logic, note that we are either a primary with
+	 * archive_mode set to "on" or "always", or a standby with archive_mode
+	 * set to "always".
+	 */
+
 	/* First check for .done --- this means archiver is done with it */
 	StatusFilePath(archiveStatusPath, xlog, ".done");
 	if (stat(archiveStatusPath, &stat_buf) == 0)
diff --git a/src/test/recovery/t/020_archive_status.pl b/src/test/recovery/t/020_archive_status.pl
new file mode 100644
index 0000000000..2369b34712
--- /dev/null
+++ b/src/test/recovery/t/020_archive_status.pl
@@ -0,0 +1,214 @@
+#
+# Tests related to WAL archiving and recovery.
+#
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 16;
+use Config;
+
+my $primary = get_new_node('master');
+$primary->init(
+	has_archiving    => 1,
+	allows_streaming => 1);
+$primary->append_conf('postgresql.conf', 'autovacuum = off');
+$primary->start;
+my $primary_data = $primary->data_dir;
+
+# Temporarily use an archive_command value to make the archiver fail,
+# knowing that archiving is enabled.  Note that we cannot use a command
+# that does not exist as in this case the archiver process would just exit
+# without reporting the failure to pg_stat_archiver.  This also cannot
+# use a plain "false" as that's unportable on Windows.  So, instead, as
+# a portable solution, use an archive command based on a command known to
+# work but will fail: copy with an incorrect original path.
+my $incorrect_command =
+  $TestLib::windows_os
+  ? qq{copy "%p_does_not_exist" "%f_does_not_exist"}
+  : qq{cp "%p_does_not_exist" "%f_does_not_exist"};
+$primary->safe_psql(
+	'postgres', qq{
+    ALTER SYSTEM SET archive_command TO '$incorrect_command';
+    SELECT pg_reload_conf();
+});
+
+# Save the WAL segment currently in use and switch to a new segment.
+# This will be used to track the activity of the archiver.
+my $segment_name_1 = $primary->safe_psql('postgres',
+	q{SELECT pg_walfile_name(pg_current_wal_lsn())});
+my $segment_path_1       = "pg_wal/archive_status/$segment_name_1";
+my $segment_path_1_ready = "$segment_path_1.ready";
+my $segment_path_1_done  = "$segment_path_1.done";
+$primary->safe_psql(
+	'postgres', q{
+	CREATE TABLE mine AS SELECT generate_series(1,10) AS x;
+	SELECT pg_switch_wal();
+	CHECKPOINT;
+});
+
+# Wait for an archive failure.
+$primary->poll_query_until('postgres',
+	q{SELECT failed_count > 0 FROM pg_stat_archiver}, 't')
+  or die "Timed out while waiting for archiving to fail";
+ok( -f "$primary_data/$segment_path_1_ready",
+	".ready file exists for WAL segment $segment_name_1 waiting to be archived"
+);
+ok( !-f "$primary_data/$segment_path_1_done",
+	".done file does not exist for WAL segment $segment_name_1 waiting to be archived"
+);
+
+is( $primary->safe_psql(
+		'postgres', q{
+		SELECT archived_count, last_failed_wal
+		FROM pg_stat_archiver
+	}),
+	"0|$segment_name_1",
+	'pg_stat_archiver failed to archive $segment_name_1');
+
+# Crash the cluster for the next test in charge of checking that non-archived
+# WAL segments are not removed.
+$primary->stop('immediate');
+
+# Recovery tests for the archiving with a standby partially check
+# the recovery behavior when restoring a backup taken using a
+# snapshot with no pg_start/stop_backup.  In this situation,
+# the recovered standby should enter first crash recovery then
+# switch to regular archive recovery.  Note that the base backup
+# is taken here so as archive_command will fail.  This is necessary
+# for the assumptions of the tests done with the standbys below.
+$primary->backup_fs_cold('backup');
+
+$primary->start;
+ok( -f "$primary_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 still exists after crash recovery on primary"
+);
+
+# Allow WAL archiving again and wait for a success.
+$primary->safe_psql(
+	'postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+
+$primary->poll_query_until('postgres',
+	q{SELECT archived_count FROM pg_stat_archiver}, '1')
+  or die "Timed out while waiting for archiving to finish";
+
+ok(!-f "$primary_data/$segment_path_1_ready",
+	".ready file for archived WAL segment $segment_name_1 removed");
+
+ok(-f "$primary_data/$segment_path_1_done",
+	".done file for archived WAL segment $segment_name_1 exists");
+
+is( $primary->safe_psql(
+		'postgres', q{ SELECT last_archived_wal FROM pg_stat_archiver }),
+	$segment_name_1,
+	'archive success reported in pg_stat_archiver for WAL segment $segment_name_1'
+);
+
+# Create some WAL activity and a new checkpoint so as the next standby can
+# create a restartpoint.  As this standby starts in crash recovery because
+# of the cold backup taken previously, it needs a clean restartpoint to deal
+# with existing status files.
+my $segment_name_2 = $primary->safe_psql('postgres',
+	q{SELECT pg_walfile_name(pg_current_wal_lsn())});
+my $segment_path_2       = "pg_wal/archive_status/$segment_name_2";
+my $segment_path_2_ready = "$segment_path_2.ready";
+my $segment_path_2_done  = "$segment_path_2.done";
+$primary->safe_psql(
+	'postgres', q{
+	INSERT INTO mine SELECT generate_series(10,20) AS x;
+	SELECT pg_switch_wal();
+	CHECKPOINT;
+});
+
+$primary->poll_query_until('postgres',
+	q{ SELECT last_archived_wal FROM pg_stat_archiver },
+	$segment_name_2)
+  or die "Timed out while waiting for archiving to finish";
+
+# Test standby with archive_mode = on.
+my $standby1 = get_new_node('standby');
+$standby1->init_from_backup($primary, 'backup', has_restoring => 1);
+$standby1->append_conf('postgresql.conf', "archive_mode = on");
+my $standby1_data = $standby1->data_dir;
+$standby1->start;
+$standby1->safe_psql('postgres', q{CHECKPOINT});
+
+# Recovery with archive_mode=on does not keep .ready signal files inherited
+# from backup.  Note that this WAL segment existed in the backup.
+ok( !-f "$standby1_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 present in backup got removed with archive_mode=on on standby"
+);
+
+# Recovery with archive_mode=on should not create .ready files.
+# Note that this segment did not exist in the backup.
+ok( !-f "$standby1_data/$segment_path_2_ready",
+	".ready file for WAL segment $segment_name_2 not created on standby when archive_mode=on on standby"
+);
+
+# Recovery with archive_mode = on creates .done files.
+ok( -f "$standby1_data/$segment_path_2_done",
+	".done file for WAL segment $segment_name_2 created when archive_mode=on on standby"
+);
+
+# Test recovery with archive_mode = always, which should always keep
+# .ready files if archiving is enabled, though here we want the archive
+# command to fail to persist the .ready files.  Note that this node
+# has inherited the archive command of the previous cold backup that
+# will cause archiving failures.
+my $standby2 = get_new_node('standby2');
+$standby2->init_from_backup($primary, 'backup', has_restoring => 1);
+$standby2->append_conf('postgresql.conf', 'archive_mode = always');
+my $standby2_data = $standby2->data_dir;
+$standby2->start;
+
+$standby2->safe_psql('postgres', q{CHECKPOINT});
+
+ok( -f "$standby2_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 existing in backup is kept with archive_mode=always on standby"
+);
+
+ok( -f "$standby2_data/$segment_path_2_ready",
+	".ready file for WAL segment $segment_name_2 created with archive_mode=always on standby"
+);
+
+# Reset statistics of the archiver for the next checks.
+$standby2->safe_psql('postgres', q{SELECT pg_stat_reset_shared('archiver')});
+
+# Now crash the cluster to check that recovery step does not
+# remove non-archived WAL segments on a standby where archiving
+# is enabled.
+$standby2->stop('immediate');
+$standby2->start;
+
+ok( -f "$standby2_data/$segment_path_1_ready",
+	"WAL segment still ready to archive after crash recovery on standby with archive_mode=always"
+);
+
+# Allow WAL archiving again, and wait for the segments to be archived.
+$standby2->safe_psql(
+	'postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+$standby2->poll_query_until('postgres',
+	q{SELECT last_archived_wal FROM pg_stat_archiver},
+	$segment_name_2)
+  or die "Timed out while waiting for archiving to finish";
+
+is( $standby2->safe_psql(
+		'postgres', q{SELECT archived_count FROM pg_stat_archiver}),
+	'2',
+	'correct number of WAL segments archived from standby');
+
+ok( !-f "$standby2_data/$segment_path_1_ready"
+	  && !-f "$standby2_data/$segment_path_2_ready",
+	".ready files removed after archive success with archive_mode=always on standby"
+);
+
+ok( -f "$standby2_data/$segment_path_1_done"
+	  && -f "$standby2_data/$segment_path_2_done",
+	".done files created after archive success with archive_mode=always on standby"
+);
-- 
2.26.0

0001-Fix-handling-of-WAL-segments-ready-to-be-archived-11.patchtext/x-diff; charset=iso-8859-1Download
From 9fa0d187375b7ce2ed15b527d42fd78ca1c70d42 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Wed, 22 Apr 2020 14:48:25 +0900
Subject: [PATCH] Fix handling of WAL segments ready to be archived during
 crash recovery
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

78ea8b5 has fixed an issue related to the recycling of WAL segments on
standbys depending on archive_mode.  However, it has introduced a
regression with the handling of WAL segments ready to be archived during
crash recovery, causing those files to be recycled without getting
archived.

This commit fixes the regression by tracking in shared memory if a live
cluster is either in crash recovery or archive recovery as the handling
of WAL segments ready to be archived is different in both cases (those
WAL segments should not be removed during crash recovery), and by using
this new shared memory state to decide if a segment can be recycled or
not.  Previously, it was not possible to know if a cluster was in crash
recovery or archive recovery as the shared state was able to track only
if recovery was happening or not, leading to the problem.

A set of TAP tests is added to close the gap here, making sure that WAL
segments ready to be archived are correctly handled when a cluster is in
archive or crash recovery with archive_mode set to "on" or "always", for
both standby and primary.

Reported-by: Benoît Lobréau
Author: Jehan-Guillaume de Rorthais
Reviewed-by: Kyotaro Horiguchi, Fujii Masao, Michael Paquier
Discussion: https://postgr.es/m/20200331172229.40ee00dc@firost
Backpatch-through: 9.5
---
 src/include/access/xlog.h                 |   9 +
 src/backend/access/transam/xlog.c         |  57 +++++-
 src/backend/access/transam/xlogarchive.c  |  21 ++-
 src/test/recovery/t/020_archive_status.pl | 214 ++++++++++++++++++++++
 4 files changed, 285 insertions(+), 16 deletions(-)
 create mode 100644 src/test/recovery/t/020_archive_status.pl

diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index ef6aabde9f..1d9c87cb82 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -129,6 +129,14 @@ typedef enum WalLevel
 	WAL_LEVEL_LOGICAL
 } WalLevel;
 
+/* Recovery states */
+typedef enum RecoveryState
+{
+	RECOVERY_STATE_CRASH = 0,	/* crash recovery */
+	RECOVERY_STATE_ARCHIVE,		/* archive recovery */
+	RECOVERY_STATE_DONE			/* currently in production */
+} RecoveryState;
+
 extern PGDLLIMPORT int wal_level;
 
 /* Is WAL archiving enabled (always or only while server is running normally)? */
@@ -243,6 +251,7 @@ extern const char *xlog_identify(uint8 info);
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
 
 extern bool RecoveryInProgress(void);
+extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 62f87ed174..0017bcf85b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -211,8 +211,9 @@ static TimeLineID receiveTLI = 0;
 static bool lastFullPageWrites;
 
 /*
- * Local copy of SharedRecoveryInProgress variable. True actually means "not
- * known, need to check the shared state".
+ * Local copy of the state tracked by SharedRecoveryState in shared memory,
+ * It is false if SharedRecoveryState is RECOVERY_STATE_DONE.  True actually
+ * means "not known, need to check the shared state".
  */
 static bool LocalRecoveryInProgress = true;
 
@@ -637,10 +638,10 @@ typedef struct XLogCtlData
 	char		archiveCleanupCommand[MAXPGPATH];
 
 	/*
-	 * SharedRecoveryInProgress indicates if we're still in crash or archive
+	 * SharedRecoveryState indicates if we're still in crash or archive
 	 * recovery.  Protected by info_lck.
 	 */
-	bool		SharedRecoveryInProgress;
+	RecoveryState SharedRecoveryState;
 
 	/*
 	 * SharedHotStandbyActive indicates if we're still in crash or archive
@@ -4306,6 +4307,16 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 				updateMinRecoveryPoint = true;
 
 				UpdateControlFile();
+
+				/*
+				 * We update SharedRecoveryState while holding the lock on
+				 * ControlFileLock so both states are consistent in shared
+				 * memory.
+				 */
+				SpinLockAcquire(&XLogCtl->info_lck);
+				XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+				SpinLockRelease(&XLogCtl->info_lck);
+
 				LWLockRelease(ControlFileLock);
 
 				CheckRecoveryConsistency();
@@ -5047,7 +5058,7 @@ XLOGShmemInit(void)
 	 * in additional info.)
 	 */
 	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
-	XLogCtl->SharedRecoveryInProgress = true;
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->WalWriterSleeping = false;
 
@@ -6859,7 +6870,13 @@ StartupXLOG(void)
 		 */
 		dbstate_at_startup = ControlFile->state;
 		if (InArchiveRecovery)
+		{
 			ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
+
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+			SpinLockRelease(&XLogCtl->info_lck);
+		}
 		else
 		{
 			ereport(LOG,
@@ -6872,6 +6889,10 @@ StartupXLOG(void)
 								ControlFile->checkPointCopy.ThisTimeLineID,
 								recoveryTargetTLI)));
 			ControlFile->state = DB_IN_CRASH_RECOVERY;
+
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
+			SpinLockRelease(&XLogCtl->info_lck);
 		}
 		ControlFile->checkPoint = checkPointLoc;
 		ControlFile->checkPointCopy = checkPoint;
@@ -7896,7 +7917,7 @@ StartupXLOG(void)
 	ControlFile->time = (pg_time_t) time(NULL);
 
 	SpinLockAcquire(&XLogCtl->info_lck);
-	XLogCtl->SharedRecoveryInProgress = false;
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	UpdateControlFile();
@@ -8042,7 +8063,7 @@ RecoveryInProgress(void)
 		 */
 		volatile XLogCtlData *xlogctl = XLogCtl;
 
-		LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
+		LocalRecoveryInProgress = (xlogctl->SharedRecoveryState != RECOVERY_STATE_DONE);
 
 		/*
 		 * Initialize TimeLineID and RedoRecPtr when we discover that recovery
@@ -8054,8 +8075,8 @@ RecoveryInProgress(void)
 		{
 			/*
 			 * If we just exited recovery, make sure we read TimeLineID and
-			 * RedoRecPtr after SharedRecoveryInProgress (for machines with
-			 * weak memory ordering).
+			 * RedoRecPtr after SharedRecoveryState (for machines with weak
+			 * memory ordering).
 			 */
 			pg_memory_barrier();
 			InitXLOGAccess();
@@ -8071,6 +8092,24 @@ RecoveryInProgress(void)
 	}
 }
 
+/*
+ * Returns current recovery state from shared memory.
+ *
+ * This returned state is kept consistent with the contents of the control
+ * file.  See details about the possible values of RecoveryState in xlog.h.
+ */
+RecoveryState
+GetRecoveryState(void)
+{
+	RecoveryState retval;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	retval = XLogCtl->SharedRecoveryState;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return retval;
+}
+
 /*
  * Is HotStandby active yet? This is only important in special backends
  * since normal backends won't ever be able to connect until this returns
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 2d2812b59c..f8b8504e41 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -610,18 +610,25 @@ XLogArchiveCheckDone(const char *xlog)
 {
 	char		archiveStatusPath[MAXPGPATH];
 	struct stat stat_buf;
-	bool		inRecovery = RecoveryInProgress();
+
+	/* The file is always deletable if archive_mode is "off". */
+	if (!XLogArchivingActive())
+		return true;
 
 	/*
-	 * The file is always deletable if archive_mode is "off".  On standbys
-	 * archiving is disabled if archive_mode is "on", and enabled with
-	 * "always".  On a primary, archiving is enabled if archive_mode is "on"
-	 * or "always".
+	 * During archive recovery, the file is deletable if archive_mode is not
+	 * "always".
 	 */
-	if (!((XLogArchivingActive() && !inRecovery) ||
-		  (XLogArchivingAlways() && inRecovery)))
+	if (!XLogArchivingAlways() &&
+		GetRecoveryState() == RECOVERY_STATE_ARCHIVE)
 		return true;
 
+	/*
+	 * At this point of the logic, note that we are either a primary with
+	 * archive_mode set to "on" or "always", or a standby with archive_mode
+	 * set to "always".
+	 */
+
 	/* First check for .done --- this means archiver is done with it */
 	StatusFilePath(archiveStatusPath, xlog, ".done");
 	if (stat(archiveStatusPath, &stat_buf) == 0)
diff --git a/src/test/recovery/t/020_archive_status.pl b/src/test/recovery/t/020_archive_status.pl
new file mode 100644
index 0000000000..2369b34712
--- /dev/null
+++ b/src/test/recovery/t/020_archive_status.pl
@@ -0,0 +1,214 @@
+#
+# Tests related to WAL archiving and recovery.
+#
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 16;
+use Config;
+
+my $primary = get_new_node('master');
+$primary->init(
+	has_archiving    => 1,
+	allows_streaming => 1);
+$primary->append_conf('postgresql.conf', 'autovacuum = off');
+$primary->start;
+my $primary_data = $primary->data_dir;
+
+# Temporarily use an archive_command value to make the archiver fail,
+# knowing that archiving is enabled.  Note that we cannot use a command
+# that does not exist as in this case the archiver process would just exit
+# without reporting the failure to pg_stat_archiver.  This also cannot
+# use a plain "false" as that's unportable on Windows.  So, instead, as
+# a portable solution, use an archive command based on a command known to
+# work but will fail: copy with an incorrect original path.
+my $incorrect_command =
+  $TestLib::windows_os
+  ? qq{copy "%p_does_not_exist" "%f_does_not_exist"}
+  : qq{cp "%p_does_not_exist" "%f_does_not_exist"};
+$primary->safe_psql(
+	'postgres', qq{
+    ALTER SYSTEM SET archive_command TO '$incorrect_command';
+    SELECT pg_reload_conf();
+});
+
+# Save the WAL segment currently in use and switch to a new segment.
+# This will be used to track the activity of the archiver.
+my $segment_name_1 = $primary->safe_psql('postgres',
+	q{SELECT pg_walfile_name(pg_current_wal_lsn())});
+my $segment_path_1       = "pg_wal/archive_status/$segment_name_1";
+my $segment_path_1_ready = "$segment_path_1.ready";
+my $segment_path_1_done  = "$segment_path_1.done";
+$primary->safe_psql(
+	'postgres', q{
+	CREATE TABLE mine AS SELECT generate_series(1,10) AS x;
+	SELECT pg_switch_wal();
+	CHECKPOINT;
+});
+
+# Wait for an archive failure.
+$primary->poll_query_until('postgres',
+	q{SELECT failed_count > 0 FROM pg_stat_archiver}, 't')
+  or die "Timed out while waiting for archiving to fail";
+ok( -f "$primary_data/$segment_path_1_ready",
+	".ready file exists for WAL segment $segment_name_1 waiting to be archived"
+);
+ok( !-f "$primary_data/$segment_path_1_done",
+	".done file does not exist for WAL segment $segment_name_1 waiting to be archived"
+);
+
+is( $primary->safe_psql(
+		'postgres', q{
+		SELECT archived_count, last_failed_wal
+		FROM pg_stat_archiver
+	}),
+	"0|$segment_name_1",
+	'pg_stat_archiver failed to archive $segment_name_1');
+
+# Crash the cluster for the next test in charge of checking that non-archived
+# WAL segments are not removed.
+$primary->stop('immediate');
+
+# Recovery tests for the archiving with a standby partially check
+# the recovery behavior when restoring a backup taken using a
+# snapshot with no pg_start/stop_backup.  In this situation,
+# the recovered standby should enter first crash recovery then
+# switch to regular archive recovery.  Note that the base backup
+# is taken here so as archive_command will fail.  This is necessary
+# for the assumptions of the tests done with the standbys below.
+$primary->backup_fs_cold('backup');
+
+$primary->start;
+ok( -f "$primary_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 still exists after crash recovery on primary"
+);
+
+# Allow WAL archiving again and wait for a success.
+$primary->safe_psql(
+	'postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+
+$primary->poll_query_until('postgres',
+	q{SELECT archived_count FROM pg_stat_archiver}, '1')
+  or die "Timed out while waiting for archiving to finish";
+
+ok(!-f "$primary_data/$segment_path_1_ready",
+	".ready file for archived WAL segment $segment_name_1 removed");
+
+ok(-f "$primary_data/$segment_path_1_done",
+	".done file for archived WAL segment $segment_name_1 exists");
+
+is( $primary->safe_psql(
+		'postgres', q{ SELECT last_archived_wal FROM pg_stat_archiver }),
+	$segment_name_1,
+	'archive success reported in pg_stat_archiver for WAL segment $segment_name_1'
+);
+
+# Create some WAL activity and a new checkpoint so as the next standby can
+# create a restartpoint.  As this standby starts in crash recovery because
+# of the cold backup taken previously, it needs a clean restartpoint to deal
+# with existing status files.
+my $segment_name_2 = $primary->safe_psql('postgres',
+	q{SELECT pg_walfile_name(pg_current_wal_lsn())});
+my $segment_path_2       = "pg_wal/archive_status/$segment_name_2";
+my $segment_path_2_ready = "$segment_path_2.ready";
+my $segment_path_2_done  = "$segment_path_2.done";
+$primary->safe_psql(
+	'postgres', q{
+	INSERT INTO mine SELECT generate_series(10,20) AS x;
+	SELECT pg_switch_wal();
+	CHECKPOINT;
+});
+
+$primary->poll_query_until('postgres',
+	q{ SELECT last_archived_wal FROM pg_stat_archiver },
+	$segment_name_2)
+  or die "Timed out while waiting for archiving to finish";
+
+# Test standby with archive_mode = on.
+my $standby1 = get_new_node('standby');
+$standby1->init_from_backup($primary, 'backup', has_restoring => 1);
+$standby1->append_conf('postgresql.conf', "archive_mode = on");
+my $standby1_data = $standby1->data_dir;
+$standby1->start;
+$standby1->safe_psql('postgres', q{CHECKPOINT});
+
+# Recovery with archive_mode=on does not keep .ready signal files inherited
+# from backup.  Note that this WAL segment existed in the backup.
+ok( !-f "$standby1_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 present in backup got removed with archive_mode=on on standby"
+);
+
+# Recovery with archive_mode=on should not create .ready files.
+# Note that this segment did not exist in the backup.
+ok( !-f "$standby1_data/$segment_path_2_ready",
+	".ready file for WAL segment $segment_name_2 not created on standby when archive_mode=on on standby"
+);
+
+# Recovery with archive_mode = on creates .done files.
+ok( -f "$standby1_data/$segment_path_2_done",
+	".done file for WAL segment $segment_name_2 created when archive_mode=on on standby"
+);
+
+# Test recovery with archive_mode = always, which should always keep
+# .ready files if archiving is enabled, though here we want the archive
+# command to fail to persist the .ready files.  Note that this node
+# has inherited the archive command of the previous cold backup that
+# will cause archiving failures.
+my $standby2 = get_new_node('standby2');
+$standby2->init_from_backup($primary, 'backup', has_restoring => 1);
+$standby2->append_conf('postgresql.conf', 'archive_mode = always');
+my $standby2_data = $standby2->data_dir;
+$standby2->start;
+
+$standby2->safe_psql('postgres', q{CHECKPOINT});
+
+ok( -f "$standby2_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 existing in backup is kept with archive_mode=always on standby"
+);
+
+ok( -f "$standby2_data/$segment_path_2_ready",
+	".ready file for WAL segment $segment_name_2 created with archive_mode=always on standby"
+);
+
+# Reset statistics of the archiver for the next checks.
+$standby2->safe_psql('postgres', q{SELECT pg_stat_reset_shared('archiver')});
+
+# Now crash the cluster to check that recovery step does not
+# remove non-archived WAL segments on a standby where archiving
+# is enabled.
+$standby2->stop('immediate');
+$standby2->start;
+
+ok( -f "$standby2_data/$segment_path_1_ready",
+	"WAL segment still ready to archive after crash recovery on standby with archive_mode=always"
+);
+
+# Allow WAL archiving again, and wait for the segments to be archived.
+$standby2->safe_psql(
+	'postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+$standby2->poll_query_until('postgres',
+	q{SELECT last_archived_wal FROM pg_stat_archiver},
+	$segment_name_2)
+  or die "Timed out while waiting for archiving to finish";
+
+is( $standby2->safe_psql(
+		'postgres', q{SELECT archived_count FROM pg_stat_archiver}),
+	'2',
+	'correct number of WAL segments archived from standby');
+
+ok( !-f "$standby2_data/$segment_path_1_ready"
+	  && !-f "$standby2_data/$segment_path_2_ready",
+	".ready files removed after archive success with archive_mode=always on standby"
+);
+
+ok( -f "$standby2_data/$segment_path_1_done"
+	  && -f "$standby2_data/$segment_path_2_done",
+	".done files created after archive success with archive_mode=always on standby"
+);
-- 
2.26.0

0001-Fix-handling-of-WAL-segments-ready-to-be-archived-10.patchtext/x-diff; charset=iso-8859-1Download
From 10809521faf22fa63f70d016f3f317796be5849d Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Wed, 22 Apr 2020 14:48:25 +0900
Subject: [PATCH] Fix handling of WAL segments ready to be archived during
 crash recovery
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

78ea8b5 has fixed an issue related to the recycling of WAL segments on
standbys depending on archive_mode.  However, it has introduced a
regression with the handling of WAL segments ready to be archived during
crash recovery, causing those files to be recycled without getting
archived.

This commit fixes the regression by tracking in shared memory if a live
cluster is either in crash recovery or archive recovery as the handling
of WAL segments ready to be archived is different in both cases (those
WAL segments should not be removed during crash recovery), and by using
this new shared memory state to decide if a segment can be recycled or
not.  Previously, it was not possible to know if a cluster was in crash
recovery or archive recovery as the shared state was able to track only
if recovery was happening or not, leading to the problem.

A set of TAP tests is added to close the gap here, making sure that WAL
segments ready to be archived are correctly handled when a cluster is in
archive or crash recovery with archive_mode set to "on" or "always", for
both standby and primary.

Reported-by: Benoît Lobréau
Author: Jehan-Guillaume de Rorthais
Reviewed-by: Kyotaro Horiguchi, Fujii Masao, Michael Paquier
Discussion: https://postgr.es/m/20200331172229.40ee00dc@firost
Backpatch-through: 9.5
---
 src/include/access/xlog.h                 |   9 +
 src/backend/access/transam/xlog.c         |  57 +++++-
 src/backend/access/transam/xlogarchive.c  |  21 ++-
 src/test/recovery/t/020_archive_status.pl | 215 ++++++++++++++++++++++
 4 files changed, 286 insertions(+), 16 deletions(-)
 create mode 100644 src/test/recovery/t/020_archive_status.pl

diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 8751d86cbd..d41fefc185 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -128,6 +128,14 @@ typedef enum WalLevel
 	WAL_LEVEL_LOGICAL
 } WalLevel;
 
+/* Recovery states */
+typedef enum RecoveryState
+{
+	RECOVERY_STATE_CRASH = 0,	/* crash recovery */
+	RECOVERY_STATE_ARCHIVE,		/* archive recovery */
+	RECOVERY_STATE_DONE			/* currently in production */
+} RecoveryState;
+
 extern PGDLLIMPORT int wal_level;
 
 /* Is WAL archiving enabled (always or only while server is running normally)? */
@@ -242,6 +250,7 @@ extern const char *xlog_identify(uint8 info);
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
 
 extern bool RecoveryInProgress(void);
+extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ec66b99b32..c10dc889c6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -209,8 +209,9 @@ static TimeLineID receiveTLI = 0;
 static bool lastFullPageWrites;
 
 /*
- * Local copy of SharedRecoveryInProgress variable. True actually means "not
- * known, need to check the shared state".
+ * Local copy of the state tracked by SharedRecoveryState in shared memory,
+ * It is false if SharedRecoveryState is RECOVERY_STATE_DONE.  True actually
+ * means "not known, need to check the shared state".
  */
 static bool LocalRecoveryInProgress = true;
 
@@ -635,10 +636,10 @@ typedef struct XLogCtlData
 	char		archiveCleanupCommand[MAXPGPATH];
 
 	/*
-	 * SharedRecoveryInProgress indicates if we're still in crash or archive
+	 * SharedRecoveryState indicates if we're still in crash or archive
 	 * recovery.  Protected by info_lck.
 	 */
-	bool		SharedRecoveryInProgress;
+	RecoveryState SharedRecoveryState;
 
 	/*
 	 * SharedHotStandbyActive indicates if we're still in crash or archive
@@ -4295,6 +4296,16 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 				updateMinRecoveryPoint = true;
 
 				UpdateControlFile();
+
+				/*
+				 * We update SharedRecoveryState while holding the lock on
+				 * ControlFileLock so both states are consistent in shared
+				 * memory.
+				 */
+				SpinLockAcquire(&XLogCtl->info_lck);
+				XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+				SpinLockRelease(&XLogCtl->info_lck);
+
 				LWLockRelease(ControlFileLock);
 
 				CheckRecoveryConsistency();
@@ -4980,7 +4991,7 @@ XLOGShmemInit(void)
 	 * in additional info.)
 	 */
 	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
-	XLogCtl->SharedRecoveryInProgress = true;
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->WalWriterSleeping = false;
 
@@ -6803,7 +6814,13 @@ StartupXLOG(void)
 		 */
 		dbstate_at_startup = ControlFile->state;
 		if (InArchiveRecovery)
+		{
 			ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
+
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+			SpinLockRelease(&XLogCtl->info_lck);
+		}
 		else
 		{
 			ereport(LOG,
@@ -6816,6 +6833,10 @@ StartupXLOG(void)
 								ControlFile->checkPointCopy.ThisTimeLineID,
 								recoveryTargetTLI)));
 			ControlFile->state = DB_IN_CRASH_RECOVERY;
+
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
+			SpinLockRelease(&XLogCtl->info_lck);
 		}
 		ControlFile->prevCheckPoint = ControlFile->checkPoint;
 		ControlFile->checkPoint = checkPointLoc;
@@ -7841,7 +7862,7 @@ StartupXLOG(void)
 	ControlFile->time = (pg_time_t) time(NULL);
 
 	SpinLockAcquire(&XLogCtl->info_lck);
-	XLogCtl->SharedRecoveryInProgress = false;
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	UpdateControlFile();
@@ -7987,7 +8008,7 @@ RecoveryInProgress(void)
 		 */
 		volatile XLogCtlData *xlogctl = XLogCtl;
 
-		LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
+		LocalRecoveryInProgress = (xlogctl->SharedRecoveryState != RECOVERY_STATE_DONE);
 
 		/*
 		 * Initialize TimeLineID and RedoRecPtr when we discover that recovery
@@ -7999,8 +8020,8 @@ RecoveryInProgress(void)
 		{
 			/*
 			 * If we just exited recovery, make sure we read TimeLineID and
-			 * RedoRecPtr after SharedRecoveryInProgress (for machines with
-			 * weak memory ordering).
+			 * RedoRecPtr after SharedRecoveryState (for machines with weak
+			 * memory ordering).
 			 */
 			pg_memory_barrier();
 			InitXLOGAccess();
@@ -8016,6 +8037,24 @@ RecoveryInProgress(void)
 	}
 }
 
+/*
+ * Returns current recovery state from shared memory.
+ *
+ * This returned state is kept consistent with the contents of the control
+ * file.  See details about the possible values of RecoveryState in xlog.h.
+ */
+RecoveryState
+GetRecoveryState(void)
+{
+	RecoveryState retval;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	retval = XLogCtl->SharedRecoveryState;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return retval;
+}
+
 /*
  * Is HotStandby active yet? This is only important in special backends
  * since normal backends won't ever be able to connect until this returns
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 7825cfe532..13f18ae85d 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -608,18 +608,25 @@ XLogArchiveCheckDone(const char *xlog)
 {
 	char		archiveStatusPath[MAXPGPATH];
 	struct stat stat_buf;
-	bool		inRecovery = RecoveryInProgress();
+
+	/* The file is always deletable if archive_mode is "off". */
+	if (!XLogArchivingActive())
+		return true;
 
 	/*
-	 * The file is always deletable if archive_mode is "off".  On standbys
-	 * archiving is disabled if archive_mode is "on", and enabled with
-	 * "always".  On a primary, archiving is enabled if archive_mode is "on"
-	 * or "always".
+	 * During archive recovery, the file is deletable if archive_mode is not
+	 * "always".
 	 */
-	if (!((XLogArchivingActive() && !inRecovery) ||
-		  (XLogArchivingAlways() && inRecovery)))
+	if (!XLogArchivingAlways() &&
+		GetRecoveryState() == RECOVERY_STATE_ARCHIVE)
 		return true;
 
+	/*
+	 * At this point of the logic, note that we are either a primary with
+	 * archive_mode set to "on" or "always", or a standby with archive_mode
+	 * set to "always".
+	 */
+
 	/* First check for .done --- this means archiver is done with it */
 	StatusFilePath(archiveStatusPath, xlog, ".done");
 	if (stat(archiveStatusPath, &stat_buf) == 0)
diff --git a/src/test/recovery/t/020_archive_status.pl b/src/test/recovery/t/020_archive_status.pl
new file mode 100644
index 0000000000..390b85702b
--- /dev/null
+++ b/src/test/recovery/t/020_archive_status.pl
@@ -0,0 +1,215 @@
+#
+# Tests related to WAL archiving and recovery.
+#
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 16;
+use Config;
+
+my $primary = get_new_node('master');
+$primary->init(
+	has_archiving    => 1,
+	allows_streaming => 1);
+$primary->append_conf('postgresql.conf', 'autovacuum = off');
+$primary->start;
+my $primary_data = $primary->data_dir;
+
+# Temporarily use an archive_command value to make the archiver fail,
+# knowing that archiving is enabled.  Note that we cannot use a command
+# that does not exist as in this case the archiver process would just exit
+# without reporting the failure to pg_stat_archiver.  This also cannot
+# use a plain "false" as that's unportable on Windows.  So, instead, as
+# a portable solution, use an archive command based on a command known to
+# work but will fail: copy with an incorrect original path.
+my $incorrect_command =
+  $TestLib::windows_os
+  ? qq{copy "%p_does_not_exist" "%f_does_not_exist"}
+  : qq{cp "%p_does_not_exist" "%f_does_not_exist"};
+$primary->safe_psql(
+	'postgres', qq{
+    ALTER SYSTEM SET archive_command TO '$incorrect_command';
+    SELECT pg_reload_conf();
+});
+
+# Save the WAL segment currently in use and switch to a new segment.
+# This will be used to track the activity of the archiver.
+my $segment_name_1 = $primary->safe_psql('postgres',
+	q{SELECT pg_walfile_name(pg_current_wal_lsn())});
+my $segment_path_1       = "pg_wal/archive_status/$segment_name_1";
+my $segment_path_1_ready = "$segment_path_1.ready";
+my $segment_path_1_done  = "$segment_path_1.done";
+$primary->safe_psql(
+	'postgres', q{
+	CREATE TABLE mine AS SELECT generate_series(1,10) AS x;
+	SELECT pg_switch_wal();
+	CHECKPOINT;
+});
+
+# Wait for an archive failure.
+$primary->poll_query_until('postgres',
+	q{SELECT failed_count > 0 FROM pg_stat_archiver}, 't')
+  or die "Timed out while waiting for archiving to fail";
+ok( -f "$primary_data/$segment_path_1_ready",
+	".ready file exists for WAL segment $segment_name_1 waiting to be archived"
+);
+ok( !-f "$primary_data/$segment_path_1_done",
+	".done file does not exist for WAL segment $segment_name_1 waiting to be archived"
+);
+
+is( $primary->safe_psql(
+		'postgres', q{
+		SELECT archived_count, last_failed_wal
+		FROM pg_stat_archiver
+	}),
+	"0|$segment_name_1",
+	'pg_stat_archiver failed to archive $segment_name_1');
+
+# Crash the cluster for the next test in charge of checking that non-archived
+# WAL segments are not removed.
+$primary->stop('immediate');
+
+# Recovery tests for the archiving with a standby partially check
+# the recovery behavior when restoring a backup taken using a
+# snapshot with no pg_start/stop_backup.  In this situation,
+# the recovered standby should enter first crash recovery then
+# switch to regular archive recovery.  Note that the base backup
+# is taken here so as archive_command will fail.  This is necessary
+# for the assumptions of the tests done with the standbys below.
+$primary->backup_fs_cold('backup');
+
+$primary->start;
+ok( -f "$primary_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 still exists after crash recovery on primary"
+);
+
+# Allow WAL archiving again and wait for a success.
+$primary->safe_psql(
+	'postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+
+$primary->poll_query_until('postgres',
+	q{SELECT archived_count FROM pg_stat_archiver}, '1')
+  or die "Timed out while waiting for archiving to finish";
+
+ok(!-f "$primary_data/$segment_path_1_ready",
+	".ready file for archived WAL segment $segment_name_1 removed");
+
+ok(-f "$primary_data/$segment_path_1_done",
+	".done file for archived WAL segment $segment_name_1 exists");
+
+is( $primary->safe_psql(
+		'postgres', q{ SELECT last_archived_wal FROM pg_stat_archiver }),
+	$segment_name_1,
+	'archive success reported in pg_stat_archiver for WAL segment $segment_name_1'
+);
+
+# Create some WAL activity and a new checkpoint so as the next standby can
+# create a restartpoint.  As this standby starts in crash recovery because
+# of the cold backup taken previously, it needs a clean restartpoint to deal
+# with existing status files.
+my $segment_name_2 = $primary->safe_psql('postgres',
+	q{SELECT pg_walfile_name(pg_current_wal_lsn())});
+my $segment_path_2       = "pg_wal/archive_status/$segment_name_2";
+my $segment_path_2_ready = "$segment_path_2.ready";
+my $segment_path_2_done  = "$segment_path_2.done";
+$primary->safe_psql(
+	'postgres', q{
+	INSERT INTO mine SELECT generate_series(10,20) AS x;
+	SELECT pg_switch_wal();
+	CHECKPOINT;
+});
+
+$primary->poll_query_until('postgres',
+	q{ SELECT last_archived_wal FROM pg_stat_archiver },
+	$segment_name_2)
+  or die "Timed out while waiting for archiving to finish";
+
+# Test standby with archive_mode = on.
+my $standby1 = get_new_node('standby');
+$standby1->init_from_backup($primary, 'backup', has_restoring => 1);
+$standby1->append_conf('postgresql.conf', "archive_mode = on");
+my $standby1_data = $standby1->data_dir;
+$standby1->start;
+$standby1->safe_psql('postgres', q{CHECKPOINT});
+
+# Recovery with archive_mode=on keeps .ready signal files inherited
+# from backup after one checkpoint.  Note that this WAL segment
+# existed in the backup.
+ok( -f "$standby1_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 present in backup not removed with archive_mode=on on standby"
+);
+
+# Recovery with archive_mode=on should not create .ready files.
+# Note that this segment did not exist in the backup.
+ok( !-f "$standby1_data/$segment_path_2_ready",
+	".ready file for WAL segment $segment_name_2 not created on standby when archive_mode=on on standby"
+);
+
+# Recovery with archive_mode = on creates .done files.
+ok( -f "$standby1_data/$segment_path_2_done",
+	".done file for WAL segment $segment_name_2 created when archive_mode=on on standby"
+);
+
+# Test recovery with archive_mode = always, which should always keep
+# .ready files if archiving is enabled, though here we want the archive
+# command to fail to persist the .ready files.  Note that this node
+# has inherited the archive command of the previous cold backup that
+# will cause archiving failures.
+my $standby2 = get_new_node('standby2');
+$standby2->init_from_backup($primary, 'backup', has_restoring => 1);
+$standby2->append_conf('postgresql.conf', 'archive_mode = always');
+my $standby2_data = $standby2->data_dir;
+$standby2->start;
+
+$standby2->safe_psql('postgres', q{CHECKPOINT});
+
+ok( -f "$standby2_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 existing in backup is kept with archive_mode=always on standby"
+);
+
+ok( -f "$standby2_data/$segment_path_2_ready",
+	".ready file for WAL segment $segment_name_2 created with archive_mode=always on standby"
+);
+
+# Reset statistics of the archiver for the next checks.
+$standby2->safe_psql('postgres', q{SELECT pg_stat_reset_shared('archiver')});
+
+# Now crash the cluster to check that recovery step does not
+# remove non-archived WAL segments on a standby where archiving
+# is enabled.
+$standby2->stop('immediate');
+$standby2->start;
+
+ok( -f "$standby2_data/$segment_path_1_ready",
+	"WAL segment still ready to archive after crash recovery on standby with archive_mode=always"
+);
+
+# Allow WAL archiving again, and wait for the segments to be archived.
+$standby2->safe_psql(
+	'postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+$standby2->poll_query_until('postgres',
+	q{SELECT last_archived_wal FROM pg_stat_archiver},
+	$segment_name_2)
+  or die "Timed out while waiting for archiving to finish";
+
+is( $standby2->safe_psql(
+		'postgres', q{SELECT archived_count FROM pg_stat_archiver}),
+	'2',
+	'correct number of WAL segments archived from standby');
+
+ok( !-f "$standby2_data/$segment_path_1_ready"
+	  && !-f "$standby2_data/$segment_path_2_ready",
+	".ready files removed after archive success with archive_mode=always on standby"
+);
+
+ok( -f "$standby2_data/$segment_path_1_done"
+	  && -f "$standby2_data/$segment_path_2_done",
+	".done files created after archive success with archive_mode=always on standby"
+);
-- 
2.26.0

0001-Fix-handling-of-WAL-segments-ready-to-be-archived-96.patchtext/x-diff; charset=iso-8859-1Download
From d9baaf1b8f9ce9afec0b2c20610aaf64d05f4c5c Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Wed, 22 Apr 2020 14:48:25 +0900
Subject: [PATCH] Fix handling of WAL segments ready to be archived during
 crash recovery
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

78ea8b5 has fixed an issue related to the recycling of WAL segments on
standbys depending on archive_mode.  However, it has introduced a
regression with the handling of WAL segments ready to be archived during
crash recovery, causing those files to be recycled without getting
archived.

This commit fixes the regression by tracking in shared memory if a live
cluster is either in crash recovery or archive recovery as the handling
of WAL segments ready to be archived is different in both cases (those
WAL segments should not be removed during crash recovery), and by using
this new shared memory state to decide if a segment can be recycled or
not.  Previously, it was not possible to know if a cluster was in crash
recovery or archive recovery as the shared state was able to track only
if recovery was happening or not, leading to the problem.

A set of TAP tests is added to close the gap here, making sure that WAL
segments ready to be archived are correctly handled when a cluster is in
archive or crash recovery with archive_mode set to "on" or "always", for
both standby and primary.

Reported-by: Benoît Lobréau
Author: Jehan-Guillaume de Rorthais
Reviewed-by: Kyotaro Horiguchi, Fujii Masao, Michael Paquier
Discussion: https://postgr.es/m/20200331172229.40ee00dc@firost
Backpatch-through: 9.5
---
 src/include/access/xlog.h                 |   9 +
 src/backend/access/transam/xlog.c         |  57 +++++-
 src/backend/access/transam/xlogarchive.c  |  21 ++-
 src/test/recovery/t/020_archive_status.pl | 213 ++++++++++++++++++++++
 4 files changed, 284 insertions(+), 16 deletions(-)
 create mode 100644 src/test/recovery/t/020_archive_status.pl

diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index ab97a61bbb..28f716d4cf 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -125,6 +125,14 @@ typedef enum WalLevel
 	WAL_LEVEL_LOGICAL
 } WalLevel;
 
+/* Recovery states */
+typedef enum RecoveryState
+{
+	RECOVERY_STATE_CRASH = 0,	/* crash recovery */
+	RECOVERY_STATE_ARCHIVE,		/* archive recovery */
+	RECOVERY_STATE_DONE			/* currently in production */
+} RecoveryState;
+
 extern PGDLLIMPORT int wal_level;
 
 /* Is WAL archiving enabled (always or only while server is running normally)? */
@@ -231,6 +239,7 @@ extern const char *xlog_identify(uint8 info);
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
 
 extern bool RecoveryInProgress(void);
+extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 823efdbed0..acca98ea28 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -204,8 +204,9 @@ static TimeLineID receiveTLI = 0;
 static bool lastFullPageWrites;
 
 /*
- * Local copy of SharedRecoveryInProgress variable. True actually means "not
- * known, need to check the shared state".
+ * Local copy of the state tracked by SharedRecoveryState in shared memory,
+ * It is false if SharedRecoveryState is RECOVERY_STATE_DONE.  True actually
+ * means "not known, need to check the shared state".
  */
 static bool LocalRecoveryInProgress = true;
 
@@ -616,10 +617,10 @@ typedef struct XLogCtlData
 	char		archiveCleanupCommand[MAXPGPATH];
 
 	/*
-	 * SharedRecoveryInProgress indicates if we're still in crash or archive
+	 * SharedRecoveryState indicates if we're still in crash or archive
 	 * recovery.  Protected by info_lck.
 	 */
-	bool		SharedRecoveryInProgress;
+	RecoveryState SharedRecoveryState;
 
 	/*
 	 * SharedHotStandbyActive indicates if we're still in crash or archive
@@ -4136,6 +4137,16 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 				updateMinRecoveryPoint = true;
 
 				UpdateControlFile();
+
+				/*
+				 * We update SharedRecoveryState while holding the lock on
+				 * ControlFileLock so both states are consistent in shared
+				 * memory.
+				 */
+				SpinLockAcquire(&XLogCtl->info_lck);
+				XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+				SpinLockRelease(&XLogCtl->info_lck);
+
 				LWLockRelease(ControlFileLock);
 
 				CheckRecoveryConsistency();
@@ -4819,7 +4830,7 @@ XLOGShmemInit(void)
 	 * in additional info.)
 	 */
 	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
-	XLogCtl->SharedRecoveryInProgress = true;
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->WalWriterSleeping = false;
 
@@ -6541,7 +6552,13 @@ StartupXLOG(void)
 		 */
 		dbstate_at_startup = ControlFile->state;
 		if (InArchiveRecovery)
+		{
 			ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
+
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+			SpinLockRelease(&XLogCtl->info_lck);
+		}
 		else
 		{
 			ereport(LOG,
@@ -6554,6 +6571,10 @@ StartupXLOG(void)
 								ControlFile->checkPointCopy.ThisTimeLineID,
 								recoveryTargetTLI)));
 			ControlFile->state = DB_IN_CRASH_RECOVERY;
+
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
+			SpinLockRelease(&XLogCtl->info_lck);
 		}
 		ControlFile->prevCheckPoint = ControlFile->checkPoint;
 		ControlFile->checkPoint = checkPointLoc;
@@ -7555,7 +7576,7 @@ StartupXLOG(void)
 	 * updates to shared memory.)
 	 */
 	SpinLockAcquire(&XLogCtl->info_lck);
-	XLogCtl->SharedRecoveryInProgress = false;
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	/*
@@ -7698,7 +7719,7 @@ RecoveryInProgress(void)
 		 */
 		volatile XLogCtlData *xlogctl = XLogCtl;
 
-		LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
+		LocalRecoveryInProgress = (xlogctl->SharedRecoveryState != RECOVERY_STATE_DONE);
 
 		/*
 		 * Initialize TimeLineID and RedoRecPtr when we discover that recovery
@@ -7710,8 +7731,8 @@ RecoveryInProgress(void)
 		{
 			/*
 			 * If we just exited recovery, make sure we read TimeLineID and
-			 * RedoRecPtr after SharedRecoveryInProgress (for machines with
-			 * weak memory ordering).
+			 * RedoRecPtr after SharedRecoveryState (for machines with weak
+			 * memory ordering).
 			 */
 			pg_memory_barrier();
 			InitXLOGAccess();
@@ -7727,6 +7748,24 @@ RecoveryInProgress(void)
 	}
 }
 
+/*
+ * Returns current recovery state from shared memory.
+ *
+ * This returned state is kept consistent with the contents of the control
+ * file.  See details about the possible values of RecoveryState in xlog.h.
+ */
+RecoveryState
+GetRecoveryState(void)
+{
+	RecoveryState retval;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	retval = XLogCtl->SharedRecoveryState;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return retval;
+}
+
 /*
  * Is HotStandby active yet? This is only important in special backends
  * since normal backends won't ever be able to connect until this returns
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 99704c1469..3cc8021cb8 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -619,18 +619,25 @@ XLogArchiveCheckDone(const char *xlog)
 {
 	char		archiveStatusPath[MAXPGPATH];
 	struct stat stat_buf;
-	bool		inRecovery = RecoveryInProgress();
+
+	/* The file is always deletable if archive_mode is "off". */
+	if (!XLogArchivingActive())
+		return true;
 
 	/*
-	 * The file is always deletable if archive_mode is "off".  On standbys
-	 * archiving is disabled if archive_mode is "on", and enabled with
-	 * "always".  On a primary, archiving is enabled if archive_mode is "on"
-	 * or "always".
+	 * During archive recovery, the file is deletable if archive_mode is not
+	 * "always".
 	 */
-	if (!((XLogArchivingActive() && !inRecovery) ||
-		  (XLogArchivingAlways() && inRecovery)))
+	if (!XLogArchivingAlways() &&
+		GetRecoveryState() == RECOVERY_STATE_ARCHIVE)
 		return true;
 
+	/*
+	 * At this point of the logic, note that we are either a primary with
+	 * archive_mode set to "on" or "always", or a standby with archive_mode
+	 * set to "always".
+	 */
+
 	/* First check for .done --- this means archiver is done with it */
 	StatusFilePath(archiveStatusPath, xlog, ".done");
 	if (stat(archiveStatusPath, &stat_buf) == 0)
diff --git a/src/test/recovery/t/020_archive_status.pl b/src/test/recovery/t/020_archive_status.pl
new file mode 100644
index 0000000000..7cf367d8bf
--- /dev/null
+++ b/src/test/recovery/t/020_archive_status.pl
@@ -0,0 +1,213 @@
+#
+# Tests related to WAL archiving and recovery.
+#
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 16;
+use Config;
+
+my $primary = get_new_node('master');
+$primary->init(
+	has_archiving    => 1,
+	allows_streaming => 1);
+$primary->append_conf('postgresql.conf', 'autovacuum = off');
+$primary->start;
+my $primary_data = $primary->data_dir;
+
+# Temporarily use an archive_command value to make the archiver fail,
+# knowing that archiving is enabled.  Note that we cannot use a command
+# that does not exist as in this case the archiver process would just exit
+# without reporting the failure to pg_stat_archiver.  This also cannot
+# use a plain "false" as that's unportable on Windows.  So, instead, as
+# a portable solution, use an archive command based on a command known to
+# work but will fail: copy with an incorrect original path.
+my $incorrect_command =
+  $TestLib::windows_os
+  ? qq{copy "%p_does_not_exist" "%f_does_not_exist"}
+  : qq{cp "%p_does_not_exist" "%f_does_not_exist"};
+$primary->safe_psql(
+	'postgres', qq{
+    ALTER SYSTEM SET archive_command TO '$incorrect_command';
+    SELECT pg_reload_conf();
+});
+
+# Save the WAL segment currently in use and switch to a new segment.
+# This will be used to track the activity of the archiver.
+my $segment_name_1 = $primary->safe_psql('postgres',
+	q{SELECT pg_xlogfile_name(pg_current_xlog_location())});
+my $segment_path_1       = "pg_xlog/archive_status/$segment_name_1";
+my $segment_path_1_ready = "$segment_path_1.ready";
+my $segment_path_1_done  = "$segment_path_1.done";
+$primary->safe_psql(
+	'postgres', q{
+	CREATE TABLE mine AS SELECT generate_series(1,10) AS x;
+	SELECT pg_switch_xlog();
+	CHECKPOINT;
+});
+
+# Wait for an archive failure.
+$primary->poll_query_until('postgres',
+	q{SELECT failed_count > 0 FROM pg_stat_archiver}, 't')
+  or die "Timed out while waiting for archiving to fail";
+ok( -f "$primary_data/$segment_path_1_ready",
+	".ready file exists for WAL segment $segment_name_1 waiting to be archived"
+);
+ok( !-f "$primary_data/$segment_path_1_done",
+	".done file does not exist for WAL segment $segment_name_1 waiting to be archived"
+);
+
+is( $primary->safe_psql(
+		'postgres', q{
+		SELECT archived_count, last_failed_wal
+		FROM pg_stat_archiver
+	}),
+	"0|$segment_name_1",
+	'pg_stat_archiver failed to archive $segment_name_1');
+
+# Crash the cluster for the next test in charge of checking that non-archived
+# WAL segments are not removed.
+$primary->stop('immediate');
+
+# Recovery tests for the archiving with a standby partially check
+# the recovery behavior when restoring a backup taken using a
+# snapshot with no pg_start/stop_backup.  In this situation,
+# the recovered standby should enter first crash recovery then
+# switch to regular archive recovery.  Note that the base backup
+# is taken here so as archive_command will fail.  This is necessary
+# for the assumptions of the tests done with the standbys below.
+$primary->backup_fs_cold('backup');
+
+$primary->start;
+ok( -f "$primary_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 still exists after crash recovery on primary"
+);
+
+# Allow WAL archiving again and wait for a success.
+$primary->safe_psql(
+	'postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+
+$primary->poll_query_until('postgres',
+	q{SELECT archived_count = 1 FROM pg_stat_archiver})
+  or die "Timed out while waiting for archiving to finish";
+
+ok(!-f "$primary_data/$segment_path_1_ready",
+	".ready file for archived WAL segment $segment_name_1 removed");
+
+ok(-f "$primary_data/$segment_path_1_done",
+	".done file for archived WAL segment $segment_name_1 exists");
+
+is( $primary->safe_psql(
+		'postgres', q{ SELECT last_archived_wal FROM pg_stat_archiver }),
+	$segment_name_1,
+	'archive success reported in pg_stat_archiver for WAL segment $segment_name_1'
+);
+
+# Create some WAL activity and a new checkpoint so as the next standby can
+# create a restartpoint.  As this standby starts in crash recovery because
+# of the cold backup taken previously, it needs a clean restartpoint to deal
+# with existing status files.
+my $segment_name_2 = $primary->safe_psql('postgres',
+	q{SELECT pg_xlogfile_name(pg_current_xlog_location())});
+my $segment_path_2       = "pg_xlog/archive_status/$segment_name_2";
+my $segment_path_2_ready = "$segment_path_2.ready";
+my $segment_path_2_done  = "$segment_path_2.done";
+$primary->safe_psql(
+	'postgres', q{
+	INSERT INTO mine SELECT generate_series(10,20) AS x;
+	SELECT pg_switch_xlog();
+	CHECKPOINT;
+});
+
+$primary->poll_query_until('postgres',
+	qq{ SELECT last_archived_wal = '$segment_name_2' FROM pg_stat_archiver })
+  or die "Timed out while waiting for archiving to finish";
+
+# Test standby with archive_mode = on.
+my $standby1 = get_new_node('standby');
+$standby1->init_from_backup($primary, 'backup', has_restoring => 1);
+$standby1->append_conf('postgresql.conf', "archive_mode = on");
+my $standby1_data = $standby1->data_dir;
+$standby1->start;
+$standby1->safe_psql('postgres', q{CHECKPOINT});
+
+# Recovery with archive_mode=on keeps .ready signal files inherited
+# from backup after one checkpoint.  Note that this WAL segment
+# existed in the backup.
+ok( -f "$standby1_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 present in backup not removed with archive_mode=on on standby"
+);
+
+# Recovery with archive_mode=on should not create .ready files.
+# Note that this segment did not exist in the backup.
+ok( !-f "$standby1_data/$segment_path_2_ready",
+	".ready file for WAL segment $segment_name_2 not created on standby when archive_mode=on on standby"
+);
+
+# Recovery with archive_mode = on creates .done files.
+ok( -f "$standby1_data/$segment_path_2_done",
+	".done file for WAL segment $segment_name_2 created when archive_mode=on on standby"
+);
+
+# Test recovery with archive_mode = always, which should always keep
+# .ready files if archiving is enabled, though here we want the archive
+# command to fail to persist the .ready files.  Note that this node
+# has inherited the archive command of the previous cold backup that
+# will cause archiving failures.
+my $standby2 = get_new_node('standby2');
+$standby2->init_from_backup($primary, 'backup', has_restoring => 1);
+$standby2->append_conf('postgresql.conf', 'archive_mode = always');
+my $standby2_data = $standby2->data_dir;
+$standby2->start;
+
+$standby2->safe_psql('postgres', q{CHECKPOINT});
+
+ok( -f "$standby2_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 existing in backup is kept with archive_mode=always on standby"
+);
+
+ok( -f "$standby2_data/$segment_path_2_ready",
+	".ready file for WAL segment $segment_name_2 created with archive_mode=always on standby"
+);
+
+# Reset statistics of the archiver for the next checks.
+$standby2->safe_psql('postgres', q{SELECT pg_stat_reset_shared('archiver')});
+
+# Now crash the cluster to check that recovery step does not
+# remove non-archived WAL segments on a standby where archiving
+# is enabled.
+$standby2->stop('immediate');
+$standby2->start;
+
+ok( -f "$standby2_data/$segment_path_1_ready",
+	"WAL segment still ready to archive after crash recovery on standby with archive_mode=always"
+);
+
+# Allow WAL archiving again, and wait for the segments to be archived.
+$standby2->safe_psql(
+	'postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+$standby2->poll_query_until('postgres',
+	qq{SELECT last_archived_wal = '$segment_name_2' FROM pg_stat_archiver})
+  or die "Timed out while waiting for archiving to finish";
+
+is( $standby2->safe_psql(
+		'postgres', q{SELECT archived_count FROM pg_stat_archiver}),
+	'2',
+	'correct number of WAL segments archived from standby');
+
+ok( !-f "$standby2_data/$segment_path_1_ready"
+	  && !-f "$standby2_data/$segment_path_2_ready",
+	".ready files removed after archive success with archive_mode=always on standby"
+);
+
+ok( -f "$standby2_data/$segment_path_1_done"
+	  && -f "$standby2_data/$segment_path_2_done",
+	".done files created after archive success with archive_mode=always on standby"
+);
-- 
2.26.0

0001-Fix-handling-of-WAL-segments-ready-to-be-archived-95.patchtext/x-diff; charset=iso-8859-1Download
From 3dd404655cdc1667f5b2274cfc30c4d24ff988de Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Wed, 22 Apr 2020 14:48:25 +0900
Subject: [PATCH] Fix handling of WAL segments ready to be archived during
 crash recovery
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

78ea8b5 has fixed an issue related to the recycling of WAL segments on
standbys depending on archive_mode.  However, it has introduced a
regression with the handling of WAL segments ready to be archived during
crash recovery, causing those files to be recycled without getting
archived.

This commit fixes the regression by tracking in shared memory if a live
cluster is either in crash recovery or archive recovery as the handling
of WAL segments ready to be archived is different in both cases (those
WAL segments should not be removed during crash recovery), and by using
this new shared memory state to decide if a segment can be recycled or
not.  Previously, it was not possible to know if a cluster was in crash
recovery or archive recovery as the shared state was able to track only
if recovery was happening or not, leading to the problem.

A set of TAP tests is added to close the gap here, making sure that WAL
segments ready to be archived are correctly handled when a cluster is in
archive or crash recovery with archive_mode set to "on" or "always", for
both standby and primary.

Reported-by: Benoît Lobréau
Author: Jehan-Guillaume de Rorthais
Reviewed-by: Kyotaro Horiguchi, Fujii Masao, Michael Paquier
Discussion: https://postgr.es/m/20200331172229.40ee00dc@firost
Backpatch-through: 9.5
---
 src/include/access/xlog.h                 |   9 +
 src/backend/access/transam/xlog.c         |  57 +++++-
 src/backend/access/transam/xlogarchive.c  |  21 ++-
 src/test/recovery/t/020_archive_status.pl | 213 ++++++++++++++++++++++
 4 files changed, 284 insertions(+), 16 deletions(-)
 create mode 100644 src/test/recovery/t/020_archive_status.pl

diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 59ebccb19e..d051ee02bc 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -126,6 +126,14 @@ typedef enum WalLevel
 } WalLevel;
 extern int	wal_level;
 
+/* Recovery states */
+typedef enum RecoveryState
+{
+	RECOVERY_STATE_CRASH = 0,	/* crash recovery */
+	RECOVERY_STATE_ARCHIVE,		/* archive recovery */
+	RECOVERY_STATE_DONE			/* currently in production */
+} RecoveryState;
+
 /* Is WAL archiving enabled (always or only while server is running normally)? */
 #define XLogArchivingActive() \
 	(XLogArchiveMode > ARCHIVE_MODE_OFF && wal_level >= WAL_LEVEL_ARCHIVE)
@@ -230,6 +238,7 @@ extern const char *xlog_identify(uint8 info);
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
 
 extern bool RecoveryInProgress(void);
+extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a378df9d76..a6cdb0024f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -203,8 +203,9 @@ static TimeLineID receiveTLI = 0;
 static bool lastFullPageWrites;
 
 /*
- * Local copy of SharedRecoveryInProgress variable. True actually means "not
- * known, need to check the shared state".
+ * Local copy of the state tracked by SharedRecoveryState in shared memory,
+ * It is false if SharedRecoveryState is RECOVERY_STATE_DONE.  True actually
+ * means "not known, need to check the shared state".
  */
 static bool LocalRecoveryInProgress = true;
 
@@ -608,10 +609,10 @@ typedef struct XLogCtlData
 	char		archiveCleanupCommand[MAXPGPATH];
 
 	/*
-	 * SharedRecoveryInProgress indicates if we're still in crash or archive
+	 * SharedRecoveryState indicates if we're still in crash or archive
 	 * recovery.  Protected by info_lck.
 	 */
-	bool		SharedRecoveryInProgress;
+	RecoveryState SharedRecoveryState;
 
 	/*
 	 * SharedHotStandbyActive indicates if we're still in crash or archive
@@ -4083,6 +4084,16 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 				updateMinRecoveryPoint = true;
 
 				UpdateControlFile();
+
+				/*
+				 * We update SharedRecoveryState while holding the lock on
+				 * ControlFileLock so both states are consistent in shared
+				 * memory.
+				 */
+				SpinLockAcquire(&XLogCtl->info_lck);
+				XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+				SpinLockRelease(&XLogCtl->info_lck);
+
 				LWLockRelease(ControlFileLock);
 
 				CheckRecoveryConsistency();
@@ -4771,7 +4782,7 @@ XLOGShmemInit(void)
 	 * in additional info.)
 	 */
 	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
-	XLogCtl->SharedRecoveryInProgress = true;
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->WalWriterSleeping = false;
 
@@ -6493,7 +6504,13 @@ StartupXLOG(void)
 		 */
 		dbstate_at_startup = ControlFile->state;
 		if (InArchiveRecovery)
+		{
 			ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
+
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+			SpinLockRelease(&XLogCtl->info_lck);
+		}
 		else
 		{
 			ereport(LOG,
@@ -6506,6 +6523,10 @@ StartupXLOG(void)
 								ControlFile->checkPointCopy.ThisTimeLineID,
 								recoveryTargetTLI)));
 			ControlFile->state = DB_IN_CRASH_RECOVERY;
+
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
+			SpinLockRelease(&XLogCtl->info_lck);
 		}
 		ControlFile->prevCheckPoint = ControlFile->checkPoint;
 		ControlFile->checkPoint = checkPointLoc;
@@ -7496,7 +7517,7 @@ StartupXLOG(void)
 	 * updates to shared memory.)
 	 */
 	SpinLockAcquire(&XLogCtl->info_lck);
-	XLogCtl->SharedRecoveryInProgress = false;
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	/*
@@ -7639,7 +7660,7 @@ RecoveryInProgress(void)
 		 */
 		volatile XLogCtlData *xlogctl = XLogCtl;
 
-		LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
+		LocalRecoveryInProgress = (xlogctl->SharedRecoveryState != RECOVERY_STATE_DONE);
 
 		/*
 		 * Initialize TimeLineID and RedoRecPtr when we discover that recovery
@@ -7651,8 +7672,8 @@ RecoveryInProgress(void)
 		{
 			/*
 			 * If we just exited recovery, make sure we read TimeLineID and
-			 * RedoRecPtr after SharedRecoveryInProgress (for machines with
-			 * weak memory ordering).
+			 * RedoRecPtr after SharedRecoveryState (for machines with weak
+			 * memory ordering).
 			 */
 			pg_memory_barrier();
 			InitXLOGAccess();
@@ -7668,6 +7689,24 @@ RecoveryInProgress(void)
 	}
 }
 
+/*
+ * Returns current recovery state from shared memory.
+ *
+ * This returned state is kept consistent with the contents of the control
+ * file.  See details about the possible values of RecoveryState in xlog.h.
+ */
+RecoveryState
+GetRecoveryState(void)
+{
+	RecoveryState retval;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	retval = XLogCtl->SharedRecoveryState;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return retval;
+}
+
 /*
  * Is HotStandby active yet? This is only important in special backends
  * since normal backends won't ever be able to connect until this returns
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index c899594fb6..e36a85ccab 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -619,18 +619,25 @@ XLogArchiveCheckDone(const char *xlog)
 {
 	char		archiveStatusPath[MAXPGPATH];
 	struct stat stat_buf;
-	bool		inRecovery = RecoveryInProgress();
+
+	/* The file is always deletable if archive_mode is "off". */
+	if (!XLogArchivingActive())
+		return true;
 
 	/*
-	 * The file is always deletable if archive_mode is "off".  On standbys
-	 * archiving is disabled if archive_mode is "on", and enabled with
-	 * "always".  On a primary, archiving is enabled if archive_mode is "on"
-	 * or "always".
+	 * During archive recovery, the file is deletable if archive_mode is not
+	 * "always".
 	 */
-	if (!((XLogArchivingActive() && !inRecovery) ||
-		  (XLogArchivingAlways() && inRecovery)))
+	if (!XLogArchivingAlways() &&
+		GetRecoveryState() == RECOVERY_STATE_ARCHIVE)
 		return true;
 
+	/*
+	 * At this point of the logic, note that we are either a primary with
+	 * archive_mode set to "on" or "always", or a standby with archive_mode
+	 * set to "always".
+	 */
+
 	/* First check for .done --- this means archiver is done with it */
 	StatusFilePath(archiveStatusPath, xlog, ".done");
 	if (stat(archiveStatusPath, &stat_buf) == 0)
diff --git a/src/test/recovery/t/020_archive_status.pl b/src/test/recovery/t/020_archive_status.pl
new file mode 100644
index 0000000000..7cf367d8bf
--- /dev/null
+++ b/src/test/recovery/t/020_archive_status.pl
@@ -0,0 +1,213 @@
+#
+# Tests related to WAL archiving and recovery.
+#
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 16;
+use Config;
+
+my $primary = get_new_node('master');
+$primary->init(
+	has_archiving    => 1,
+	allows_streaming => 1);
+$primary->append_conf('postgresql.conf', 'autovacuum = off');
+$primary->start;
+my $primary_data = $primary->data_dir;
+
+# Temporarily use an archive_command value to make the archiver fail,
+# knowing that archiving is enabled.  Note that we cannot use a command
+# that does not exist as in this case the archiver process would just exit
+# without reporting the failure to pg_stat_archiver.  This also cannot
+# use a plain "false" as that's unportable on Windows.  So, instead, as
+# a portable solution, use an archive command based on a command known to
+# work but will fail: copy with an incorrect original path.
+my $incorrect_command =
+  $TestLib::windows_os
+  ? qq{copy "%p_does_not_exist" "%f_does_not_exist"}
+  : qq{cp "%p_does_not_exist" "%f_does_not_exist"};
+$primary->safe_psql(
+	'postgres', qq{
+    ALTER SYSTEM SET archive_command TO '$incorrect_command';
+    SELECT pg_reload_conf();
+});
+
+# Save the WAL segment currently in use and switch to a new segment.
+# This will be used to track the activity of the archiver.
+my $segment_name_1 = $primary->safe_psql('postgres',
+	q{SELECT pg_xlogfile_name(pg_current_xlog_location())});
+my $segment_path_1       = "pg_xlog/archive_status/$segment_name_1";
+my $segment_path_1_ready = "$segment_path_1.ready";
+my $segment_path_1_done  = "$segment_path_1.done";
+$primary->safe_psql(
+	'postgres', q{
+	CREATE TABLE mine AS SELECT generate_series(1,10) AS x;
+	SELECT pg_switch_xlog();
+	CHECKPOINT;
+});
+
+# Wait for an archive failure.
+$primary->poll_query_until('postgres',
+	q{SELECT failed_count > 0 FROM pg_stat_archiver}, 't')
+  or die "Timed out while waiting for archiving to fail";
+ok( -f "$primary_data/$segment_path_1_ready",
+	".ready file exists for WAL segment $segment_name_1 waiting to be archived"
+);
+ok( !-f "$primary_data/$segment_path_1_done",
+	".done file does not exist for WAL segment $segment_name_1 waiting to be archived"
+);
+
+is( $primary->safe_psql(
+		'postgres', q{
+		SELECT archived_count, last_failed_wal
+		FROM pg_stat_archiver
+	}),
+	"0|$segment_name_1",
+	'pg_stat_archiver failed to archive $segment_name_1');
+
+# Crash the cluster for the next test in charge of checking that non-archived
+# WAL segments are not removed.
+$primary->stop('immediate');
+
+# Recovery tests for the archiving with a standby partially check
+# the recovery behavior when restoring a backup taken using a
+# snapshot with no pg_start/stop_backup.  In this situation,
+# the recovered standby should enter first crash recovery then
+# switch to regular archive recovery.  Note that the base backup
+# is taken here so as archive_command will fail.  This is necessary
+# for the assumptions of the tests done with the standbys below.
+$primary->backup_fs_cold('backup');
+
+$primary->start;
+ok( -f "$primary_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 still exists after crash recovery on primary"
+);
+
+# Allow WAL archiving again and wait for a success.
+$primary->safe_psql(
+	'postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+
+$primary->poll_query_until('postgres',
+	q{SELECT archived_count = 1 FROM pg_stat_archiver})
+  or die "Timed out while waiting for archiving to finish";
+
+ok(!-f "$primary_data/$segment_path_1_ready",
+	".ready file for archived WAL segment $segment_name_1 removed");
+
+ok(-f "$primary_data/$segment_path_1_done",
+	".done file for archived WAL segment $segment_name_1 exists");
+
+is( $primary->safe_psql(
+		'postgres', q{ SELECT last_archived_wal FROM pg_stat_archiver }),
+	$segment_name_1,
+	'archive success reported in pg_stat_archiver for WAL segment $segment_name_1'
+);
+
+# Create some WAL activity and a new checkpoint so as the next standby can
+# create a restartpoint.  As this standby starts in crash recovery because
+# of the cold backup taken previously, it needs a clean restartpoint to deal
+# with existing status files.
+my $segment_name_2 = $primary->safe_psql('postgres',
+	q{SELECT pg_xlogfile_name(pg_current_xlog_location())});
+my $segment_path_2       = "pg_xlog/archive_status/$segment_name_2";
+my $segment_path_2_ready = "$segment_path_2.ready";
+my $segment_path_2_done  = "$segment_path_2.done";
+$primary->safe_psql(
+	'postgres', q{
+	INSERT INTO mine SELECT generate_series(10,20) AS x;
+	SELECT pg_switch_xlog();
+	CHECKPOINT;
+});
+
+$primary->poll_query_until('postgres',
+	qq{ SELECT last_archived_wal = '$segment_name_2' FROM pg_stat_archiver })
+  or die "Timed out while waiting for archiving to finish";
+
+# Test standby with archive_mode = on.
+my $standby1 = get_new_node('standby');
+$standby1->init_from_backup($primary, 'backup', has_restoring => 1);
+$standby1->append_conf('postgresql.conf', "archive_mode = on");
+my $standby1_data = $standby1->data_dir;
+$standby1->start;
+$standby1->safe_psql('postgres', q{CHECKPOINT});
+
+# Recovery with archive_mode=on keeps .ready signal files inherited
+# from backup after one checkpoint.  Note that this WAL segment
+# existed in the backup.
+ok( -f "$standby1_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 present in backup not removed with archive_mode=on on standby"
+);
+
+# Recovery with archive_mode=on should not create .ready files.
+# Note that this segment did not exist in the backup.
+ok( !-f "$standby1_data/$segment_path_2_ready",
+	".ready file for WAL segment $segment_name_2 not created on standby when archive_mode=on on standby"
+);
+
+# Recovery with archive_mode = on creates .done files.
+ok( -f "$standby1_data/$segment_path_2_done",
+	".done file for WAL segment $segment_name_2 created when archive_mode=on on standby"
+);
+
+# Test recovery with archive_mode = always, which should always keep
+# .ready files if archiving is enabled, though here we want the archive
+# command to fail to persist the .ready files.  Note that this node
+# has inherited the archive command of the previous cold backup that
+# will cause archiving failures.
+my $standby2 = get_new_node('standby2');
+$standby2->init_from_backup($primary, 'backup', has_restoring => 1);
+$standby2->append_conf('postgresql.conf', 'archive_mode = always');
+my $standby2_data = $standby2->data_dir;
+$standby2->start;
+
+$standby2->safe_psql('postgres', q{CHECKPOINT});
+
+ok( -f "$standby2_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 existing in backup is kept with archive_mode=always on standby"
+);
+
+ok( -f "$standby2_data/$segment_path_2_ready",
+	".ready file for WAL segment $segment_name_2 created with archive_mode=always on standby"
+);
+
+# Reset statistics of the archiver for the next checks.
+$standby2->safe_psql('postgres', q{SELECT pg_stat_reset_shared('archiver')});
+
+# Now crash the cluster to check that recovery step does not
+# remove non-archived WAL segments on a standby where archiving
+# is enabled.
+$standby2->stop('immediate');
+$standby2->start;
+
+ok( -f "$standby2_data/$segment_path_1_ready",
+	"WAL segment still ready to archive after crash recovery on standby with archive_mode=always"
+);
+
+# Allow WAL archiving again, and wait for the segments to be archived.
+$standby2->safe_psql(
+	'postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+$standby2->poll_query_until('postgres',
+	qq{SELECT last_archived_wal = '$segment_name_2' FROM pg_stat_archiver})
+  or die "Timed out while waiting for archiving to finish";
+
+is( $standby2->safe_psql(
+		'postgres', q{SELECT archived_count FROM pg_stat_archiver}),
+	'2',
+	'correct number of WAL segments archived from standby');
+
+ok( !-f "$standby2_data/$segment_path_1_ready"
+	  && !-f "$standby2_data/$segment_path_2_ready",
+	".ready files removed after archive success with archive_mode=always on standby"
+);
+
+ok( -f "$standby2_data/$segment_path_1_done"
+	  && -f "$standby2_data/$segment_path_2_done",
+	".done files created after archive success with archive_mode=always on standby"
+);
-- 
2.26.0

In reply to: Michael Paquier (#48)
Re: [BUG] non archived WAL removed during production crash recovery

On Wed, 22 Apr 2020 10:19:35 +0900
Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Apr 22, 2020 at 12:41:21AM +0200, Jehan-Guillaume de Rorthais wrote:

Do we need to acquire info_lck to look at the state here, as we do in
GetRecoveryState()? Why is it missing from previous code where
SharedRecoveryInProgress was protected by info_lck as well?

Please see 1a3d104.

Understood. Interesting. Thank you.

Plus, the new line length overflow the 80-column, but I'm not sure where to
break this line.

pgindent has been run on v8, and it did not complain.

OK.

In fact, this was inherited from 011_crash_recovery.pl where I originally
added some tests. As 020_archive_status.pl doesn't use IPC::Run, the
comment is wrong. But I wonder if this whole block is really needed.
Unfortunately I can't test on MSWin32 :/

You are right here. The restriction can be removed, and I have
checked that the test from v8 is able to pass on my Windows dev VM.

Thanks!

In reply to: Michael Paquier (#49)
1 attachment(s)
Re: [BUG] non archived WAL removed during production crash recovery

On Wed, 22 Apr 2020 16:32:23 +0900
Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Apr 22, 2020 at 10:19:35AM +0900, Michael Paquier wrote:

You are right here. The restriction can be removed, and I have
checked that the test from v8 is able to pass on my Windows dev VM.

Attached are versions for each branch down to 9.5. While working on
the backpatch, I have not found major conflicts except one thing:
up to 10, Postgres does WAL segment recycling after two completed
checkpoints, and the 8th test of the script relies on the behavior of
11~ of one completed checkpoint (first .ready file present in the cold
backup but removed removed from $standby1). I have taken the simplest
approach to fix the test by checking that the .ready file actually
exists, while the rest of the test remains the same.

This test seems useless to me. It should either be removed or patched to test
the signal has been removed after a second restartpoint.

Please, find in attachment a patch for 9.6 implementing this. If it seems
reasonable to you, I can create the backpatch to 9.5.

It is worth noting that for 9.5 and 9.6 the test had compatibility
issues with the renaming of pg_xlog to pg_wal, including paths and
functions. The calls to poll_query_until() also needed tweaks, but
I got the tests to work.

Thanks for the backpatching work!

Regards,

Attachments:

0001-v9-Fix-WAL-retention-during-production-crash-recovery.patchtext/x-patchDownload
From 4cd3bf655ea80f75d0c776bc11f5ca73ab427e28 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Wed, 22 Apr 2020 14:48:25 +0900
Subject: [PATCH] Fix handling of WAL segments ready to be archived during
 crash recovery
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

78ea8b5 has fixed an issue related to the recycling of WAL segments on
standbys depending on archive_mode.  However, it has introduced a
regression with the handling of WAL segments ready to be archived during
crash recovery, causing those files to be recycled without getting
archived.

This commit fixes the regression by tracking in shared memory if a live
cluster is either in crash recovery or archive recovery as the handling
of WAL segments ready to be archived is different in both cases (those
WAL segments should not be removed during crash recovery), and by using
this new shared memory state to decide if a segment can be recycled or
not.  Previously, it was not possible to know if a cluster was in crash
recovery or archive recovery as the shared state was able to track only
if recovery was happening or not, leading to the problem.

A set of TAP tests is added to close the gap here, making sure that WAL
segments ready to be archived are correctly handled when a cluster is in
archive or crash recovery with archive_mode set to "on" or "always", for
both standby and primary.

Reported-by: Benoît Lobréau
Author: Jehan-Guillaume de Rorthais
Reviewed-by: Kyotaro Horiguchi, Fujii Masao, Michael Paquier
Discussion: https://postgr.es/m/20200331172229.40ee00dc@firost
Backpatch-through: 9.5
---
 src/backend/access/transam/xlog.c         |  57 +++++-
 src/backend/access/transam/xlogarchive.c  |  21 +-
 src/include/access/xlog.h                 |   9 +
 src/test/recovery/t/020_archive_status.pl | 232 ++++++++++++++++++++++
 4 files changed, 303 insertions(+), 16 deletions(-)
 create mode 100644 src/test/recovery/t/020_archive_status.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 823efdbed0..acca98ea28 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -204,8 +204,9 @@ static TimeLineID receiveTLI = 0;
 static bool lastFullPageWrites;
 
 /*
- * Local copy of SharedRecoveryInProgress variable. True actually means "not
- * known, need to check the shared state".
+ * Local copy of the state tracked by SharedRecoveryState in shared memory,
+ * It is false if SharedRecoveryState is RECOVERY_STATE_DONE.  True actually
+ * means "not known, need to check the shared state".
  */
 static bool LocalRecoveryInProgress = true;
 
@@ -616,10 +617,10 @@ typedef struct XLogCtlData
 	char		archiveCleanupCommand[MAXPGPATH];
 
 	/*
-	 * SharedRecoveryInProgress indicates if we're still in crash or archive
+	 * SharedRecoveryState indicates if we're still in crash or archive
 	 * recovery.  Protected by info_lck.
 	 */
-	bool		SharedRecoveryInProgress;
+	RecoveryState SharedRecoveryState;
 
 	/*
 	 * SharedHotStandbyActive indicates if we're still in crash or archive
@@ -4136,6 +4137,16 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 				updateMinRecoveryPoint = true;
 
 				UpdateControlFile();
+
+				/*
+				 * We update SharedRecoveryState while holding the lock on
+				 * ControlFileLock so both states are consistent in shared
+				 * memory.
+				 */
+				SpinLockAcquire(&XLogCtl->info_lck);
+				XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+				SpinLockRelease(&XLogCtl->info_lck);
+
 				LWLockRelease(ControlFileLock);
 
 				CheckRecoveryConsistency();
@@ -4819,7 +4830,7 @@ XLOGShmemInit(void)
 	 * in additional info.)
 	 */
 	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
-	XLogCtl->SharedRecoveryInProgress = true;
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->WalWriterSleeping = false;
 
@@ -6541,7 +6552,13 @@ StartupXLOG(void)
 		 */
 		dbstate_at_startup = ControlFile->state;
 		if (InArchiveRecovery)
+		{
 			ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
+
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+			SpinLockRelease(&XLogCtl->info_lck);
+		}
 		else
 		{
 			ereport(LOG,
@@ -6554,6 +6571,10 @@ StartupXLOG(void)
 								ControlFile->checkPointCopy.ThisTimeLineID,
 								recoveryTargetTLI)));
 			ControlFile->state = DB_IN_CRASH_RECOVERY;
+
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
+			SpinLockRelease(&XLogCtl->info_lck);
 		}
 		ControlFile->prevCheckPoint = ControlFile->checkPoint;
 		ControlFile->checkPoint = checkPointLoc;
@@ -7555,7 +7576,7 @@ StartupXLOG(void)
 	 * updates to shared memory.)
 	 */
 	SpinLockAcquire(&XLogCtl->info_lck);
-	XLogCtl->SharedRecoveryInProgress = false;
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	/*
@@ -7698,7 +7719,7 @@ RecoveryInProgress(void)
 		 */
 		volatile XLogCtlData *xlogctl = XLogCtl;
 
-		LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
+		LocalRecoveryInProgress = (xlogctl->SharedRecoveryState != RECOVERY_STATE_DONE);
 
 		/*
 		 * Initialize TimeLineID and RedoRecPtr when we discover that recovery
@@ -7710,8 +7731,8 @@ RecoveryInProgress(void)
 		{
 			/*
 			 * If we just exited recovery, make sure we read TimeLineID and
-			 * RedoRecPtr after SharedRecoveryInProgress (for machines with
-			 * weak memory ordering).
+			 * RedoRecPtr after SharedRecoveryState (for machines with weak
+			 * memory ordering).
 			 */
 			pg_memory_barrier();
 			InitXLOGAccess();
@@ -7727,6 +7748,24 @@ RecoveryInProgress(void)
 	}
 }
 
+/*
+ * Returns current recovery state from shared memory.
+ *
+ * This returned state is kept consistent with the contents of the control
+ * file.  See details about the possible values of RecoveryState in xlog.h.
+ */
+RecoveryState
+GetRecoveryState(void)
+{
+	RecoveryState retval;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	retval = XLogCtl->SharedRecoveryState;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return retval;
+}
+
 /*
  * Is HotStandby active yet? This is only important in special backends
  * since normal backends won't ever be able to connect until this returns
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 99704c1469..3cc8021cb8 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -619,18 +619,25 @@ XLogArchiveCheckDone(const char *xlog)
 {
 	char		archiveStatusPath[MAXPGPATH];
 	struct stat stat_buf;
-	bool		inRecovery = RecoveryInProgress();
+
+	/* The file is always deletable if archive_mode is "off". */
+	if (!XLogArchivingActive())
+		return true;
 
 	/*
-	 * The file is always deletable if archive_mode is "off".  On standbys
-	 * archiving is disabled if archive_mode is "on", and enabled with
-	 * "always".  On a primary, archiving is enabled if archive_mode is "on"
-	 * or "always".
+	 * During archive recovery, the file is deletable if archive_mode is not
+	 * "always".
 	 */
-	if (!((XLogArchivingActive() && !inRecovery) ||
-		  (XLogArchivingAlways() && inRecovery)))
+	if (!XLogArchivingAlways() &&
+		GetRecoveryState() == RECOVERY_STATE_ARCHIVE)
 		return true;
 
+	/*
+	 * At this point of the logic, note that we are either a primary with
+	 * archive_mode set to "on" or "always", or a standby with archive_mode
+	 * set to "always".
+	 */
+
 	/* First check for .done --- this means archiver is done with it */
 	StatusFilePath(archiveStatusPath, xlog, ".done");
 	if (stat(archiveStatusPath, &stat_buf) == 0)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index ab97a61bbb..28f716d4cf 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -125,6 +125,14 @@ typedef enum WalLevel
 	WAL_LEVEL_LOGICAL
 } WalLevel;
 
+/* Recovery states */
+typedef enum RecoveryState
+{
+	RECOVERY_STATE_CRASH = 0,	/* crash recovery */
+	RECOVERY_STATE_ARCHIVE,		/* archive recovery */
+	RECOVERY_STATE_DONE			/* currently in production */
+} RecoveryState;
+
 extern PGDLLIMPORT int wal_level;
 
 /* Is WAL archiving enabled (always or only while server is running normally)? */
@@ -231,6 +239,7 @@ extern const char *xlog_identify(uint8 info);
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
 
 extern bool RecoveryInProgress(void);
+extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
diff --git a/src/test/recovery/t/020_archive_status.pl b/src/test/recovery/t/020_archive_status.pl
new file mode 100644
index 0000000000..eee1173398
--- /dev/null
+++ b/src/test/recovery/t/020_archive_status.pl
@@ -0,0 +1,232 @@
+#
+# Tests related to WAL archiving and recovery.
+#
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 16;
+use Config;
+
+my $primary = get_new_node('master');
+$primary->init(
+	has_archiving    => 1,
+	allows_streaming => 1);
+$primary->append_conf('postgresql.conf', 'autovacuum = off');
+$primary->start;
+my $primary_data = $primary->data_dir;
+
+# Temporarily use an archive_command value to make the archiver fail,
+# knowing that archiving is enabled.  Note that we cannot use a command
+# that does not exist as in this case the archiver process would just exit
+# without reporting the failure to pg_stat_archiver.  This also cannot
+# use a plain "false" as that's unportable on Windows.  So, instead, as
+# a portable solution, use an archive command based on a command known to
+# work but will fail: copy with an incorrect original path.
+my $incorrect_command =
+  $TestLib::windows_os
+  ? qq{copy "%p_does_not_exist" "%f_does_not_exist"}
+  : qq{cp "%p_does_not_exist" "%f_does_not_exist"};
+$primary->safe_psql(
+	'postgres', qq{
+    ALTER SYSTEM SET archive_command TO '$incorrect_command';
+    SELECT pg_reload_conf();
+});
+
+# Save the WAL segment currently in use and switch to a new segment.
+# This will be used to track the activity of the archiver.
+my $segment_name_1 = $primary->safe_psql('postgres',
+	q{SELECT pg_xlogfile_name(pg_current_xlog_location())});
+my $segment_path_1       = "pg_xlog/archive_status/$segment_name_1";
+my $segment_path_1_ready = "$segment_path_1.ready";
+my $segment_path_1_done  = "$segment_path_1.done";
+$primary->safe_psql(
+	'postgres', q{
+	CREATE TABLE mine AS SELECT generate_series(1,10) AS x;
+	SELECT pg_switch_xlog();
+	CHECKPOINT;
+});
+
+# Wait for an archive failure.
+$primary->poll_query_until('postgres',
+	q{SELECT failed_count > 0 FROM pg_stat_archiver}, 't')
+  or die "Timed out while waiting for archiving to fail";
+ok( -f "$primary_data/$segment_path_1_ready",
+	".ready file exists for WAL segment $segment_name_1 waiting to be archived"
+);
+ok( !-f "$primary_data/$segment_path_1_done",
+	".done file does not exist for WAL segment $segment_name_1 waiting to be archived"
+);
+
+is( $primary->safe_psql(
+		'postgres', q{
+		SELECT archived_count, last_failed_wal
+		FROM pg_stat_archiver
+	}),
+	"0|$segment_name_1",
+	'pg_stat_archiver failed to archive $segment_name_1');
+
+# Crash the cluster for the next test in charge of checking that non-archived
+# WAL segments are not removed.
+$primary->stop('immediate');
+
+# Recovery tests for the archiving with a standby partially check
+# the recovery behavior when restoring a backup taken using a
+# snapshot with no pg_start/stop_backup.  In this situation,
+# the recovered standby should enter first crash recovery then
+# switch to regular archive recovery.  Note that the base backup
+# is taken here so as archive_command will fail.  This is necessary
+# for the assumptions of the tests done with the standbys below.
+$primary->backup_fs_cold('backup');
+
+$primary->start;
+ok( -f "$primary_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 still exists after crash recovery on primary"
+);
+
+# Allow WAL archiving again and wait for a success.
+$primary->safe_psql(
+	'postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+
+$primary->poll_query_until('postgres',
+	q{SELECT archived_count = 1 FROM pg_stat_archiver})
+  or die "Timed out while waiting for archiving to finish";
+
+ok(!-f "$primary_data/$segment_path_1_ready",
+	".ready file for archived WAL segment $segment_name_1 removed");
+
+ok(-f "$primary_data/$segment_path_1_done",
+	".done file for archived WAL segment $segment_name_1 exists");
+
+is( $primary->safe_psql(
+		'postgres', q{ SELECT last_archived_wal FROM pg_stat_archiver }),
+	$segment_name_1,
+	'archive success reported in pg_stat_archiver for WAL segment $segment_name_1'
+);
+
+# Create some WAL activity and a new checkpoint so as the next standby can
+# create a restartpoint.  As this standby starts in crash recovery because
+# of the cold backup taken previously, it needs a clean restartpoint to deal
+# with existing status files.
+$primary->safe_psql(
+	'postgres', q{
+	INSERT INTO mine SELECT generate_series(10,20) AS x;
+	SELECT pg_switch_xlog();
+	CHECKPOINT;
+});
+my $segment_name_2 = $primary->safe_psql('postgres',
+	q{SELECT pg_xlogfile_name(pg_current_xlog_location())});
+my $segment_path_2       = "pg_xlog/archive_status/$segment_name_2";
+my $segment_path_2_ready = "$segment_path_2.ready";
+my $segment_path_2_done  = "$segment_path_2.done";
+
+# Test standby with archive_mode = on.
+my $standby1 = get_new_node('standby');
+$standby1->init_from_backup($primary, 'backup', has_restoring => 1);
+$standby1->append_conf('postgresql.conf', q{
+archive_mode = on
+wal_keep_segments = 0
+});
+$standby1->append_conf('postgresql.conf', "");
+my $standby1_data = $standby1->data_dir;
+$standby1->start;
+# first restartpoint
+$standby1->safe_psql('postgres', q{CHECKPOINT});
+
+# segments are cleaned after the second restartpoint, so create a second one.
+$primary->safe_psql(
+	'postgres', q{
+	INSERT INTO mine SELECT generate_series(21,30) AS x;
+	CHECKPOINT;
+	SELECT pg_switch_xlog();
+});
+
+my $primary_lsn = $primary->safe_psql('postgres',
+	q{SELECT pg_current_xlog_location()});
+
+$standby1->poll_query_until('postgres',
+	qq{ SELECT pg_xlog_location_diff('$primary_lsn', pg_last_xlog_replay_location()) = 0 })
+  or die "Timed out while waiting for xlog replay";
+
+# second restartpoint
+$standby1->safe_psql('postgres', q{CHECKPOINT});
+
+# Recovery with archive_mode=on removed .ready signal files inherited
+# from backup after two checkpoints.  Note that this WAL segment
+# existed in the backup.
+ok( ! -f "$standby1_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 present in backup removed with archive_mode=on on standby"
+);
+
+# Recovery with archive_mode=on should not create .ready files.
+# Note that this segment did not exist in the backup.
+ok( !-f "$standby1_data/$segment_path_2_ready",
+	".ready file for WAL segment $segment_name_2 not created on standby when archive_mode=on on standby"
+);
+
+# Recovery with archive_mode = on creates .done files.
+ok( -f "$standby1_data/$segment_path_2_done",
+	".done file for WAL segment $segment_name_2 created when archive_mode=on on standby"
+);
+
+# Test recovery with archive_mode = always, which should always keep
+# .ready files if archiving is enabled, though here we want the archive
+# command to fail to persist the .ready files.  Note that this node
+# has inherited the archive command of the previous cold backup that
+# will cause archiving failures.
+my $standby2 = get_new_node('standby2');
+$standby2->init_from_backup($primary, 'backup', has_restoring => 1);
+$standby2->append_conf('postgresql.conf', 'archive_mode = always');
+my $standby2_data = $standby2->data_dir;
+$standby2->start;
+
+$standby2->safe_psql('postgres', q{CHECKPOINT});
+
+ok( -f "$standby2_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 existing in backup is kept with archive_mode=always on standby"
+);
+
+ok( -f "$standby2_data/$segment_path_2_ready",
+	".ready file for WAL segment $segment_name_2 created with archive_mode=always on standby"
+);
+
+# Reset statistics of the archiver for the next checks.
+$standby2->safe_psql('postgres', q{SELECT pg_stat_reset_shared('archiver')});
+
+# Now crash the cluster to check that recovery step does not
+# remove non-archived WAL segments on a standby where archiving
+# is enabled.
+$standby2->stop('immediate');
+$standby2->start;
+
+ok( -f "$standby2_data/$segment_path_1_ready",
+	"WAL segment still ready to archive after crash recovery on standby with archive_mode=always"
+);
+
+# Allow WAL archiving again, and wait for the segments to be archived.
+$standby2->safe_psql(
+	'postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+$standby2->poll_query_until('postgres',
+	qq{SELECT last_archived_wal = '$segment_name_2' FROM pg_stat_archiver})
+  or die "Timed out while waiting for archiving to finish";
+
+is( $standby2->safe_psql(
+		'postgres', q{SELECT archived_count FROM pg_stat_archiver}),
+	'3',
+	'correct number of WAL segments archived from standby');
+
+ok( !-f "$standby2_data/$segment_path_1_ready"
+	  && !-f "$standby2_data/$segment_path_2_ready",
+	".ready files removed after archive success with archive_mode=always on standby"
+);
+
+ok( -f "$standby2_data/$segment_path_1_done"
+	  && -f "$standby2_data/$segment_path_2_done",
+	".done files created after archive success with archive_mode=always on standby"
+);
-- 
2.20.1

In reply to: Jehan-Guillaume de Rorthais (#51)
1 attachment(s)
Re: [BUG] non archived WAL removed during production crash recovery

On Wed, 22 Apr 2020 17:58:24 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

On Wed, 22 Apr 2020 16:32:23 +0900
Michael Paquier <michael@paquier.xyz> wrote:

[...]
[...]
[...]

This test seems useless to me. It should either be removed or patched to test
the signal has been removed after a second restartpoint.

Please, find in attachment a patch for 9.6 implementing this. If it seems
reasonable to you, I can create the backpatch to 9.5.

I found an extra useless line of code in v9 patch. Please, find in
attachment v10. Sorry for this.

Regards,

Attachments:

0001-v10-Fix-WAL-retention-during-production-crash-recovery.patchtext/x-patchDownload
Fix handling of WAL segments ready to be archived during crash recovery

78ea8b5 has fixed an issue related to the recycling of WAL segments on
standbys depending on archive_mode.  However, it has introduced a
regression with the handling of WAL segments ready to be archived during
crash recovery, causing those files to be recycled without getting
archived.

This commit fixes the regression by tracking in shared memory if a live
cluster is either in crash recovery or archive recovery as the handling
of WAL segments ready to be archived is different in both cases (those
WAL segments should not be removed during crash recovery), and by using
this new shared memory state to decide if a segment can be recycled or
not.  Previously, it was not possible to know if a cluster was in crash
recovery or archive recovery as the shared state was able to track only
if recovery was happening or not, leading to the problem.

A set of TAP tests is added to close the gap here, making sure that WAL
segments ready to be archived are correctly handled when a cluster is in
archive or crash recovery with archive_mode set to "on" or "always", for
both standby and primary.

Reported-by: Benoît Lobréau
Author: Jehan-Guillaume de Rorthais
Reviewed-by: Kyotaro Horiguchi, Fujii Masao, Michael Paquier
Discussion: https://postgr.es/m/20200331172229.40ee00dc@firost
Backpatch-through: 9.5
---
 src/backend/access/transam/xlog.c         |  57 +++++-
 src/backend/access/transam/xlogarchive.c  |  21 +-
 src/include/access/xlog.h                 |   9 +
 src/test/recovery/t/020_archive_status.pl | 231 ++++++++++++++++++++++
 4 files changed, 302 insertions(+), 16 deletions(-)
 create mode 100644 src/test/recovery/t/020_archive_status.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 823efdbed0..acca98ea28 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -204,8 +204,9 @@ static TimeLineID receiveTLI = 0;
 static bool lastFullPageWrites;
 
 /*
- * Local copy of SharedRecoveryInProgress variable. True actually means "not
- * known, need to check the shared state".
+ * Local copy of the state tracked by SharedRecoveryState in shared memory,
+ * It is false if SharedRecoveryState is RECOVERY_STATE_DONE.  True actually
+ * means "not known, need to check the shared state".
  */
 static bool LocalRecoveryInProgress = true;
 
@@ -616,10 +617,10 @@ typedef struct XLogCtlData
 	char		archiveCleanupCommand[MAXPGPATH];
 
 	/*
-	 * SharedRecoveryInProgress indicates if we're still in crash or archive
+	 * SharedRecoveryState indicates if we're still in crash or archive
 	 * recovery.  Protected by info_lck.
 	 */
-	bool		SharedRecoveryInProgress;
+	RecoveryState SharedRecoveryState;
 
 	/*
 	 * SharedHotStandbyActive indicates if we're still in crash or archive
@@ -4136,6 +4137,16 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 				updateMinRecoveryPoint = true;
 
 				UpdateControlFile();
+
+				/*
+				 * We update SharedRecoveryState while holding the lock on
+				 * ControlFileLock so both states are consistent in shared
+				 * memory.
+				 */
+				SpinLockAcquire(&XLogCtl->info_lck);
+				XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+				SpinLockRelease(&XLogCtl->info_lck);
+
 				LWLockRelease(ControlFileLock);
 
 				CheckRecoveryConsistency();
@@ -4819,7 +4830,7 @@ XLOGShmemInit(void)
 	 * in additional info.)
 	 */
 	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
-	XLogCtl->SharedRecoveryInProgress = true;
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->WalWriterSleeping = false;
 
@@ -6541,7 +6552,13 @@ StartupXLOG(void)
 		 */
 		dbstate_at_startup = ControlFile->state;
 		if (InArchiveRecovery)
+		{
 			ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
+
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+			SpinLockRelease(&XLogCtl->info_lck);
+		}
 		else
 		{
 			ereport(LOG,
@@ -6554,6 +6571,10 @@ StartupXLOG(void)
 								ControlFile->checkPointCopy.ThisTimeLineID,
 								recoveryTargetTLI)));
 			ControlFile->state = DB_IN_CRASH_RECOVERY;
+
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
+			SpinLockRelease(&XLogCtl->info_lck);
 		}
 		ControlFile->prevCheckPoint = ControlFile->checkPoint;
 		ControlFile->checkPoint = checkPointLoc;
@@ -7555,7 +7576,7 @@ StartupXLOG(void)
 	 * updates to shared memory.)
 	 */
 	SpinLockAcquire(&XLogCtl->info_lck);
-	XLogCtl->SharedRecoveryInProgress = false;
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	/*
@@ -7698,7 +7719,7 @@ RecoveryInProgress(void)
 		 */
 		volatile XLogCtlData *xlogctl = XLogCtl;
 
-		LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
+		LocalRecoveryInProgress = (xlogctl->SharedRecoveryState != RECOVERY_STATE_DONE);
 
 		/*
 		 * Initialize TimeLineID and RedoRecPtr when we discover that recovery
@@ -7710,8 +7731,8 @@ RecoveryInProgress(void)
 		{
 			/*
 			 * If we just exited recovery, make sure we read TimeLineID and
-			 * RedoRecPtr after SharedRecoveryInProgress (for machines with
-			 * weak memory ordering).
+			 * RedoRecPtr after SharedRecoveryState (for machines with weak
+			 * memory ordering).
 			 */
 			pg_memory_barrier();
 			InitXLOGAccess();
@@ -7727,6 +7748,24 @@ RecoveryInProgress(void)
 	}
 }
 
+/*
+ * Returns current recovery state from shared memory.
+ *
+ * This returned state is kept consistent with the contents of the control
+ * file.  See details about the possible values of RecoveryState in xlog.h.
+ */
+RecoveryState
+GetRecoveryState(void)
+{
+	RecoveryState retval;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	retval = XLogCtl->SharedRecoveryState;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return retval;
+}
+
 /*
  * Is HotStandby active yet? This is only important in special backends
  * since normal backends won't ever be able to connect until this returns
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 99704c1469..3cc8021cb8 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -619,18 +619,25 @@ XLogArchiveCheckDone(const char *xlog)
 {
 	char		archiveStatusPath[MAXPGPATH];
 	struct stat stat_buf;
-	bool		inRecovery = RecoveryInProgress();
+
+	/* The file is always deletable if archive_mode is "off". */
+	if (!XLogArchivingActive())
+		return true;
 
 	/*
-	 * The file is always deletable if archive_mode is "off".  On standbys
-	 * archiving is disabled if archive_mode is "on", and enabled with
-	 * "always".  On a primary, archiving is enabled if archive_mode is "on"
-	 * or "always".
+	 * During archive recovery, the file is deletable if archive_mode is not
+	 * "always".
 	 */
-	if (!((XLogArchivingActive() && !inRecovery) ||
-		  (XLogArchivingAlways() && inRecovery)))
+	if (!XLogArchivingAlways() &&
+		GetRecoveryState() == RECOVERY_STATE_ARCHIVE)
 		return true;
 
+	/*
+	 * At this point of the logic, note that we are either a primary with
+	 * archive_mode set to "on" or "always", or a standby with archive_mode
+	 * set to "always".
+	 */
+
 	/* First check for .done --- this means archiver is done with it */
 	StatusFilePath(archiveStatusPath, xlog, ".done");
 	if (stat(archiveStatusPath, &stat_buf) == 0)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index ab97a61bbb..28f716d4cf 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -125,6 +125,14 @@ typedef enum WalLevel
 	WAL_LEVEL_LOGICAL
 } WalLevel;
 
+/* Recovery states */
+typedef enum RecoveryState
+{
+	RECOVERY_STATE_CRASH = 0,	/* crash recovery */
+	RECOVERY_STATE_ARCHIVE,		/* archive recovery */
+	RECOVERY_STATE_DONE			/* currently in production */
+} RecoveryState;
+
 extern PGDLLIMPORT int wal_level;
 
 /* Is WAL archiving enabled (always or only while server is running normally)? */
@@ -231,6 +239,7 @@ extern const char *xlog_identify(uint8 info);
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
 
 extern bool RecoveryInProgress(void);
+extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
diff --git a/src/test/recovery/t/020_archive_status.pl b/src/test/recovery/t/020_archive_status.pl
new file mode 100644
index 0000000000..40862ffdc8
--- /dev/null
+++ b/src/test/recovery/t/020_archive_status.pl
@@ -0,0 +1,231 @@
+#
+# Tests related to WAL archiving and recovery.
+#
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 16;
+use Config;
+
+my $primary = get_new_node('master');
+$primary->init(
+	has_archiving    => 1,
+	allows_streaming => 1);
+$primary->append_conf('postgresql.conf', 'autovacuum = off');
+$primary->start;
+my $primary_data = $primary->data_dir;
+
+# Temporarily use an archive_command value to make the archiver fail,
+# knowing that archiving is enabled.  Note that we cannot use a command
+# that does not exist as in this case the archiver process would just exit
+# without reporting the failure to pg_stat_archiver.  This also cannot
+# use a plain "false" as that's unportable on Windows.  So, instead, as
+# a portable solution, use an archive command based on a command known to
+# work but will fail: copy with an incorrect original path.
+my $incorrect_command =
+  $TestLib::windows_os
+  ? qq{copy "%p_does_not_exist" "%f_does_not_exist"}
+  : qq{cp "%p_does_not_exist" "%f_does_not_exist"};
+$primary->safe_psql(
+	'postgres', qq{
+    ALTER SYSTEM SET archive_command TO '$incorrect_command';
+    SELECT pg_reload_conf();
+});
+
+# Save the WAL segment currently in use and switch to a new segment.
+# This will be used to track the activity of the archiver.
+my $segment_name_1 = $primary->safe_psql('postgres',
+	q{SELECT pg_xlogfile_name(pg_current_xlog_location())});
+my $segment_path_1       = "pg_xlog/archive_status/$segment_name_1";
+my $segment_path_1_ready = "$segment_path_1.ready";
+my $segment_path_1_done  = "$segment_path_1.done";
+$primary->safe_psql(
+	'postgres', q{
+	CREATE TABLE mine AS SELECT generate_series(1,10) AS x;
+	SELECT pg_switch_xlog();
+	CHECKPOINT;
+});
+
+# Wait for an archive failure.
+$primary->poll_query_until('postgres',
+	q{SELECT failed_count > 0 FROM pg_stat_archiver}, 't')
+  or die "Timed out while waiting for archiving to fail";
+ok( -f "$primary_data/$segment_path_1_ready",
+	".ready file exists for WAL segment $segment_name_1 waiting to be archived"
+);
+ok( !-f "$primary_data/$segment_path_1_done",
+	".done file does not exist for WAL segment $segment_name_1 waiting to be archived"
+);
+
+is( $primary->safe_psql(
+		'postgres', q{
+		SELECT archived_count, last_failed_wal
+		FROM pg_stat_archiver
+	}),
+	"0|$segment_name_1",
+	'pg_stat_archiver failed to archive $segment_name_1');
+
+# Crash the cluster for the next test in charge of checking that non-archived
+# WAL segments are not removed.
+$primary->stop('immediate');
+
+# Recovery tests for the archiving with a standby partially check
+# the recovery behavior when restoring a backup taken using a
+# snapshot with no pg_start/stop_backup.  In this situation,
+# the recovered standby should enter first crash recovery then
+# switch to regular archive recovery.  Note that the base backup
+# is taken here so as archive_command will fail.  This is necessary
+# for the assumptions of the tests done with the standbys below.
+$primary->backup_fs_cold('backup');
+
+$primary->start;
+ok( -f "$primary_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 still exists after crash recovery on primary"
+);
+
+# Allow WAL archiving again and wait for a success.
+$primary->safe_psql(
+	'postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+
+$primary->poll_query_until('postgres',
+	q{SELECT archived_count = 1 FROM pg_stat_archiver})
+  or die "Timed out while waiting for archiving to finish";
+
+ok(!-f "$primary_data/$segment_path_1_ready",
+	".ready file for archived WAL segment $segment_name_1 removed");
+
+ok(-f "$primary_data/$segment_path_1_done",
+	".done file for archived WAL segment $segment_name_1 exists");
+
+is( $primary->safe_psql(
+		'postgres', q{ SELECT last_archived_wal FROM pg_stat_archiver }),
+	$segment_name_1,
+	'archive success reported in pg_stat_archiver for WAL segment $segment_name_1'
+);
+
+# Create some WAL activity and a new checkpoint so as the next standby can
+# create a restartpoint.  As this standby starts in crash recovery because
+# of the cold backup taken previously, it needs a clean restartpoint to deal
+# with existing status files.
+$primary->safe_psql(
+	'postgres', q{
+	INSERT INTO mine SELECT generate_series(10,20) AS x;
+	SELECT pg_switch_xlog();
+	CHECKPOINT;
+});
+my $segment_name_2 = $primary->safe_psql('postgres',
+	q{SELECT pg_xlogfile_name(pg_current_xlog_location())});
+my $segment_path_2       = "pg_xlog/archive_status/$segment_name_2";
+my $segment_path_2_ready = "$segment_path_2.ready";
+my $segment_path_2_done  = "$segment_path_2.done";
+
+# Test standby with archive_mode = on.
+my $standby1 = get_new_node('standby');
+$standby1->init_from_backup($primary, 'backup', has_restoring => 1);
+$standby1->append_conf('postgresql.conf', q{
+archive_mode = on
+wal_keep_segments = 0
+});
+my $standby1_data = $standby1->data_dir;
+$standby1->start;
+# first restartpoint
+$standby1->safe_psql('postgres', q{CHECKPOINT});
+
+# segments are cleaned after the second restartpoint, so create a second one.
+$primary->safe_psql(
+	'postgres', q{
+	INSERT INTO mine SELECT generate_series(21,30) AS x;
+	CHECKPOINT;
+	SELECT pg_switch_xlog();
+});
+
+my $primary_lsn = $primary->safe_psql('postgres',
+	q{SELECT pg_current_xlog_location()});
+
+$standby1->poll_query_until('postgres',
+	qq{ SELECT pg_xlog_location_diff('$primary_lsn', pg_last_xlog_replay_location()) = 0 })
+  or die "Timed out while waiting for xlog replay";
+
+# second restartpoint
+$standby1->safe_psql('postgres', q{CHECKPOINT});
+
+# Recovery with archive_mode=on removed .ready signal files inherited
+# from backup after two checkpoints.  Note that this WAL segment
+# existed in the backup.
+ok( ! -f "$standby1_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 present in backup removed with archive_mode=on on standby"
+);
+
+# Recovery with archive_mode=on should not create .ready files.
+# Note that this segment did not exist in the backup.
+ok( !-f "$standby1_data/$segment_path_2_ready",
+	".ready file for WAL segment $segment_name_2 not created on standby when archive_mode=on on standby"
+);
+
+# Recovery with archive_mode = on creates .done files.
+ok( -f "$standby1_data/$segment_path_2_done",
+	".done file for WAL segment $segment_name_2 created when archive_mode=on on standby"
+);
+
+# Test recovery with archive_mode = always, which should always keep
+# .ready files if archiving is enabled, though here we want the archive
+# command to fail to persist the .ready files.  Note that this node
+# has inherited the archive command of the previous cold backup that
+# will cause archiving failures.
+my $standby2 = get_new_node('standby2');
+$standby2->init_from_backup($primary, 'backup', has_restoring => 1);
+$standby2->append_conf('postgresql.conf', 'archive_mode = always');
+my $standby2_data = $standby2->data_dir;
+$standby2->start;
+
+$standby2->safe_psql('postgres', q{CHECKPOINT});
+
+ok( -f "$standby2_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 existing in backup is kept with archive_mode=always on standby"
+);
+
+ok( -f "$standby2_data/$segment_path_2_ready",
+	".ready file for WAL segment $segment_name_2 created with archive_mode=always on standby"
+);
+
+# Reset statistics of the archiver for the next checks.
+$standby2->safe_psql('postgres', q{SELECT pg_stat_reset_shared('archiver')});
+
+# Now crash the cluster to check that recovery step does not
+# remove non-archived WAL segments on a standby where archiving
+# is enabled.
+$standby2->stop('immediate');
+$standby2->start;
+
+ok( -f "$standby2_data/$segment_path_1_ready",
+	"WAL segment still ready to archive after crash recovery on standby with archive_mode=always"
+);
+
+# Allow WAL archiving again, and wait for the segments to be archived.
+$standby2->safe_psql(
+	'postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+$standby2->poll_query_until('postgres',
+	qq{SELECT last_archived_wal = '$segment_name_2' FROM pg_stat_archiver})
+  or die "Timed out while waiting for archiving to finish";
+
+is( $standby2->safe_psql(
+		'postgres', q{SELECT archived_count FROM pg_stat_archiver}),
+	'3',
+	'correct number of WAL segments archived from standby');
+
+ok( !-f "$standby2_data/$segment_path_1_ready"
+	  && !-f "$standby2_data/$segment_path_2_ready",
+	".ready files removed after archive success with archive_mode=always on standby"
+);
+
+ok( -f "$standby2_data/$segment_path_1_done"
+	  && -f "$standby2_data/$segment_path_2_done",
+	".done files created after archive success with archive_mode=always on standby"
+);
-- 
2.20.1

#53Michael Paquier
michael@paquier.xyz
In reply to: Jehan-Guillaume de Rorthais (#52)
Re: [BUG] non archived WAL removed during production crash recovery

On Wed, Apr 22, 2020 at 06:17:17PM +0200, Jehan-Guillaume de Rorthais wrote:

I found an extra useless line of code in v9 patch. Please, find in
attachment v10. Sorry for this.

Thanks for helping here, your changes make sense.  This looks mostly
fine to me except that part:
+$standby1->poll_query_until('postgres',
+   qq{ SELECT pg_xlog_location_diff('$primary_lsn', pg_last_xlog_replay_location()) = 0 })
+  or die "Timed out while waiting for xlog replay";
Here we should check if $primary_lsn is at least
pg_last_xlog_replay_location().  Checking for an equality may stuck
the test if more WAL gets replayed.  For example you could have a
concurrent autovacuum generating WAL.
--
Michael
#54Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#53)
Re: [BUG] non archived WAL removed during production crash recovery

FWIW, the test for 10- looks fine, but I have one qustion.

+ 'archive success reported in pg_stat_archiver for WAL segment $segment_name_

This seems intending to show an actual segment name in the message,
but it is really shown as "... WAL segment $segment_name_1". Is that
intended?

At Thu, 23 Apr 2020 08:46:18 +0900, Michael Paquier <michael@paquier.xyz> wrote in

On Wed, Apr 22, 2020 at 06:17:17PM +0200, Jehan-Guillaume de Rorthais wrote:

I found an extra useless line of code in v9 patch. Please, find in
attachment v10. Sorry for this.

Thanks for helping here, your changes make sense.  This looks mostly
fine to me except that part:
+$standby1->poll_query_until('postgres',
+   qq{ SELECT pg_xlog_location_diff('$primary_lsn', pg_last_xlog_replay_location()) = 0 })
+  or die "Timed out while waiting for xlog replay";
Here we should check if $primary_lsn is at least
pg_last_xlog_replay_location().  Checking for an equality may stuck
the test if more WAL gets replayed.  For example you could have a
concurrent autovacuum generating WAL.

Autovacuum is turned off in this case, but anyway other kinds of WAL
records can be generated.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In reply to: Kyotaro Horiguchi (#54)
3 attachment(s)
Re: [BUG] non archived WAL removed during production crash recovery

On Thu, 23 Apr 2020 14:05:46 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

FWIW, the test for 10- looks fine, but I have one qustion.

+ 'archive success reported in pg_stat_archiver for WAL segment
$segment_name_

This seems intending to show an actual segment name in the message,
but it is really shown as "... WAL segment $segment_name_1". Is that
intended?

Good catch. Fixed in v11. Thank you.

At Thu, 23 Apr 2020 08:46:18 +0900, Michael Paquier <michael@paquier.xyz>
wrote in

On Wed, Apr 22, 2020 at 06:17:17PM +0200, Jehan-Guillaume de Rorthais
wrote:

I found an extra useless line of code in v9 patch. Please, find in
attachment v10. Sorry for this.

Thanks for helping here, your changes make sense.  This looks mostly
fine to me except that part:
+$standby1->poll_query_until('postgres',
+   qq{ SELECT pg_xlog_location_diff('$primary_lsn',
pg_last_xlog_replay_location()) = 0 })
+  or die "Timed out while waiting for xlog replay";
Here we should check if $primary_lsn is at least
pg_last_xlog_replay_location().  Checking for an equality may stuck
the test if more WAL gets replayed.  For example you could have a
concurrent autovacuum generating WAL.  

Autovacuum is turned off in this case, but anyway other kinds of WAL
records can be generated.

make sense. Fixed in v11.

Please, find in v11 for version 9.5, 9.6 and 10.

Regards,

Attachments:

0001-v11-Fix-WAL-retention-during-production-crash-recovery-10.patchtext/x-patchDownload
Fix handling of WAL segments ready to be archived during crash recovery

78ea8b5 has fixed an issue related to the recycling of WAL segments on
standbys depending on archive_mode.  However, it has introduced a
regression with the handling of WAL segments ready to be archived during
crash recovery, causing those files to be recycled without getting
archived.

This commit fixes the regression by tracking in shared memory if a live
cluster is either in crash recovery or archive recovery as the handling
of WAL segments ready to be archived is different in both cases (those
WAL segments should not be removed during crash recovery), and by using
this new shared memory state to decide if a segment can be recycled or
not.  Previously, it was not possible to know if a cluster was in crash
recovery or archive recovery as the shared state was able to track only
if recovery was happening or not, leading to the problem.

A set of TAP tests is added to close the gap here, making sure that WAL
segments ready to be archived are correctly handled when a cluster is in
archive or crash recovery with archive_mode set to "on" or "always", for
both standby and primary.

Reported-by: Benoît Lobréau
Author: Jehan-Guillaume de Rorthais
Reviewed-by: Kyotaro Horiguchi, Fujii Masao, Michael Paquier
Discussion: https://postgr.es/m/20200331172229.40ee00dc@firost
Backpatch-through: 9.5
---
 src/backend/access/transam/xlog.c         |  57 +++++-
 src/backend/access/transam/xlogarchive.c  |  21 +-
 src/include/access/xlog.h                 |   9 +
 src/test/recovery/t/020_archive_status.pl | 233 ++++++++++++++++++++++
 4 files changed, 304 insertions(+), 16 deletions(-)
 create mode 100644 src/test/recovery/t/020_archive_status.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ec66b99b32..c10dc889c6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -209,8 +209,9 @@ static TimeLineID receiveTLI = 0;
 static bool lastFullPageWrites;
 
 /*
- * Local copy of SharedRecoveryInProgress variable. True actually means "not
- * known, need to check the shared state".
+ * Local copy of the state tracked by SharedRecoveryState in shared memory,
+ * It is false if SharedRecoveryState is RECOVERY_STATE_DONE.  True actually
+ * means "not known, need to check the shared state".
  */
 static bool LocalRecoveryInProgress = true;
 
@@ -635,10 +636,10 @@ typedef struct XLogCtlData
 	char		archiveCleanupCommand[MAXPGPATH];
 
 	/*
-	 * SharedRecoveryInProgress indicates if we're still in crash or archive
+	 * SharedRecoveryState indicates if we're still in crash or archive
 	 * recovery.  Protected by info_lck.
 	 */
-	bool		SharedRecoveryInProgress;
+	RecoveryState SharedRecoveryState;
 
 	/*
 	 * SharedHotStandbyActive indicates if we're still in crash or archive
@@ -4295,6 +4296,16 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 				updateMinRecoveryPoint = true;
 
 				UpdateControlFile();
+
+				/*
+				 * We update SharedRecoveryState while holding the lock on
+				 * ControlFileLock so both states are consistent in shared
+				 * memory.
+				 */
+				SpinLockAcquire(&XLogCtl->info_lck);
+				XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+				SpinLockRelease(&XLogCtl->info_lck);
+
 				LWLockRelease(ControlFileLock);
 
 				CheckRecoveryConsistency();
@@ -4980,7 +4991,7 @@ XLOGShmemInit(void)
 	 * in additional info.)
 	 */
 	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
-	XLogCtl->SharedRecoveryInProgress = true;
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->WalWriterSleeping = false;
 
@@ -6803,7 +6814,13 @@ StartupXLOG(void)
 		 */
 		dbstate_at_startup = ControlFile->state;
 		if (InArchiveRecovery)
+		{
 			ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
+
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+			SpinLockRelease(&XLogCtl->info_lck);
+		}
 		else
 		{
 			ereport(LOG,
@@ -6816,6 +6833,10 @@ StartupXLOG(void)
 								ControlFile->checkPointCopy.ThisTimeLineID,
 								recoveryTargetTLI)));
 			ControlFile->state = DB_IN_CRASH_RECOVERY;
+
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
+			SpinLockRelease(&XLogCtl->info_lck);
 		}
 		ControlFile->prevCheckPoint = ControlFile->checkPoint;
 		ControlFile->checkPoint = checkPointLoc;
@@ -7841,7 +7862,7 @@ StartupXLOG(void)
 	ControlFile->time = (pg_time_t) time(NULL);
 
 	SpinLockAcquire(&XLogCtl->info_lck);
-	XLogCtl->SharedRecoveryInProgress = false;
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	UpdateControlFile();
@@ -7987,7 +8008,7 @@ RecoveryInProgress(void)
 		 */
 		volatile XLogCtlData *xlogctl = XLogCtl;
 
-		LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
+		LocalRecoveryInProgress = (xlogctl->SharedRecoveryState != RECOVERY_STATE_DONE);
 
 		/*
 		 * Initialize TimeLineID and RedoRecPtr when we discover that recovery
@@ -7999,8 +8020,8 @@ RecoveryInProgress(void)
 		{
 			/*
 			 * If we just exited recovery, make sure we read TimeLineID and
-			 * RedoRecPtr after SharedRecoveryInProgress (for machines with
-			 * weak memory ordering).
+			 * RedoRecPtr after SharedRecoveryState (for machines with weak
+			 * memory ordering).
 			 */
 			pg_memory_barrier();
 			InitXLOGAccess();
@@ -8016,6 +8037,24 @@ RecoveryInProgress(void)
 	}
 }
 
+/*
+ * Returns current recovery state from shared memory.
+ *
+ * This returned state is kept consistent with the contents of the control
+ * file.  See details about the possible values of RecoveryState in xlog.h.
+ */
+RecoveryState
+GetRecoveryState(void)
+{
+	RecoveryState retval;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	retval = XLogCtl->SharedRecoveryState;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return retval;
+}
+
 /*
  * Is HotStandby active yet? This is only important in special backends
  * since normal backends won't ever be able to connect until this returns
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 7825cfe532..13f18ae85d 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -608,18 +608,25 @@ XLogArchiveCheckDone(const char *xlog)
 {
 	char		archiveStatusPath[MAXPGPATH];
 	struct stat stat_buf;
-	bool		inRecovery = RecoveryInProgress();
+
+	/* The file is always deletable if archive_mode is "off". */
+	if (!XLogArchivingActive())
+		return true;
 
 	/*
-	 * The file is always deletable if archive_mode is "off".  On standbys
-	 * archiving is disabled if archive_mode is "on", and enabled with
-	 * "always".  On a primary, archiving is enabled if archive_mode is "on"
-	 * or "always".
+	 * During archive recovery, the file is deletable if archive_mode is not
+	 * "always".
 	 */
-	if (!((XLogArchivingActive() && !inRecovery) ||
-		  (XLogArchivingAlways() && inRecovery)))
+	if (!XLogArchivingAlways() &&
+		GetRecoveryState() == RECOVERY_STATE_ARCHIVE)
 		return true;
 
+	/*
+	 * At this point of the logic, note that we are either a primary with
+	 * archive_mode set to "on" or "always", or a standby with archive_mode
+	 * set to "always".
+	 */
+
 	/* First check for .done --- this means archiver is done with it */
 	StatusFilePath(archiveStatusPath, xlog, ".done");
 	if (stat(archiveStatusPath, &stat_buf) == 0)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 8751d86cbd..d41fefc185 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -128,6 +128,14 @@ typedef enum WalLevel
 	WAL_LEVEL_LOGICAL
 } WalLevel;
 
+/* Recovery states */
+typedef enum RecoveryState
+{
+	RECOVERY_STATE_CRASH = 0,	/* crash recovery */
+	RECOVERY_STATE_ARCHIVE,		/* archive recovery */
+	RECOVERY_STATE_DONE			/* currently in production */
+} RecoveryState;
+
 extern PGDLLIMPORT int wal_level;
 
 /* Is WAL archiving enabled (always or only while server is running normally)? */
@@ -242,6 +250,7 @@ extern const char *xlog_identify(uint8 info);
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
 
 extern bool RecoveryInProgress(void);
+extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
diff --git a/src/test/recovery/t/020_archive_status.pl b/src/test/recovery/t/020_archive_status.pl
new file mode 100644
index 0000000000..37daa882c9
--- /dev/null
+++ b/src/test/recovery/t/020_archive_status.pl
@@ -0,0 +1,233 @@
+#
+# Tests related to WAL archiving and recovery.
+#
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 16;
+use Config;
+
+my $primary = get_new_node('master');
+$primary->init(
+	has_archiving    => 1,
+	allows_streaming => 1);
+$primary->append_conf('postgresql.conf', 'autovacuum = off');
+$primary->start;
+my $primary_data = $primary->data_dir;
+
+# Temporarily use an archive_command value to make the archiver fail,
+# knowing that archiving is enabled.  Note that we cannot use a command
+# that does not exist as in this case the archiver process would just exit
+# without reporting the failure to pg_stat_archiver.  This also cannot
+# use a plain "false" as that's unportable on Windows.  So, instead, as
+# a portable solution, use an archive command based on a command known to
+# work but will fail: copy with an incorrect original path.
+my $incorrect_command =
+  $TestLib::windows_os
+  ? qq{copy "%p_does_not_exist" "%f_does_not_exist"}
+  : qq{cp "%p_does_not_exist" "%f_does_not_exist"};
+$primary->safe_psql(
+	'postgres', qq{
+    ALTER SYSTEM SET archive_command TO '$incorrect_command';
+    SELECT pg_reload_conf();
+});
+
+# Save the WAL segment currently in use and switch to a new segment.
+# This will be used to track the activity of the archiver.
+my $segment_name_1 = $primary->safe_psql('postgres',
+	q{SELECT pg_walfile_name(pg_current_wal_lsn())});
+my $segment_path_1       = "pg_wal/archive_status/$segment_name_1";
+my $segment_path_1_ready = "$segment_path_1.ready";
+my $segment_path_1_done  = "$segment_path_1.done";
+$primary->safe_psql(
+	'postgres', q{
+	CREATE TABLE mine AS SELECT generate_series(1,10) AS x;
+	SELECT pg_switch_wal();
+	CHECKPOINT;
+});
+
+# Wait for an archive failure.
+$primary->poll_query_until('postgres',
+	q{SELECT failed_count > 0 FROM pg_stat_archiver}, 't')
+  or die "Timed out while waiting for archiving to fail";
+ok( -f "$primary_data/$segment_path_1_ready",
+	".ready file exists for WAL segment $segment_name_1 waiting to be archived"
+);
+ok( !-f "$primary_data/$segment_path_1_done",
+	".done file does not exist for WAL segment $segment_name_1 waiting to be archived"
+);
+
+is( $primary->safe_psql(
+		'postgres', q{
+		SELECT archived_count, last_failed_wal
+		FROM pg_stat_archiver
+	}),
+	"0|$segment_name_1",
+	"pg_stat_archiver failed to archive $segment_name_1");
+
+# Crash the cluster for the next test in charge of checking that non-archived
+# WAL segments are not removed.
+$primary->stop('immediate');
+
+# Recovery tests for the archiving with a standby partially check
+# the recovery behavior when restoring a backup taken using a
+# snapshot with no pg_start/stop_backup.  In this situation,
+# the recovered standby should enter first crash recovery then
+# switch to regular archive recovery.  Note that the base backup
+# is taken here so as archive_command will fail.  This is necessary
+# for the assumptions of the tests done with the standbys below.
+$primary->backup_fs_cold('backup');
+
+$primary->start;
+ok( -f "$primary_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 still exists after crash recovery on primary"
+);
+
+# Allow WAL archiving again and wait for a success.
+$primary->safe_psql(
+	'postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+
+$primary->poll_query_until('postgres',
+	q{SELECT archived_count FROM pg_stat_archiver}, '1')
+  or die "Timed out while waiting for archiving to finish";
+
+ok(!-f "$primary_data/$segment_path_1_ready",
+	".ready file for archived WAL segment $segment_name_1 removed");
+
+ok(-f "$primary_data/$segment_path_1_done",
+	".done file for archived WAL segment $segment_name_1 exists");
+
+is( $primary->safe_psql(
+		'postgres', q{ SELECT last_archived_wal FROM pg_stat_archiver }),
+	$segment_name_1,
+	"archive success reported in pg_stat_archiver for WAL segment $segment_name_1"
+);
+
+# Create some WAL activity and a new checkpoint so as the next standby can
+# create a restartpoint.  As this standby starts in crash recovery because
+# of the cold backup taken previously, it needs a clean restartpoint to deal
+# with existing status files.
+$primary->safe_psql(
+	'postgres', q{
+	INSERT INTO mine SELECT generate_series(10,20) AS x;
+	SELECT pg_switch_wal();
+	CHECKPOINT;
+});
+my $segment_name_2 = $primary->safe_psql('postgres',
+	q{SELECT pg_walfile_name(pg_current_wal_lsn())});
+my $segment_path_2       = "pg_wal/archive_status/$segment_name_2";
+my $segment_path_2_ready = "$segment_path_2.ready";
+my $segment_path_2_done  = "$segment_path_2.done";
+
+# Test standby with archive_mode = on.
+my $standby1 = get_new_node('standby');
+$standby1->init_from_backup($primary, 'backup', has_restoring => 1);
+$standby1->append_conf('postgresql.conf', qq{
+archive_mode = on
+wal_keep_segments = 0
+});
+my $standby1_data = $standby1->data_dir;
+$standby1->start;
+# first restartpoint
+$standby1->safe_psql('postgres', q{CHECKPOINT});
+
+# segments are cleaned after the second restartpoint, so create a second one.
+$primary->safe_psql(
+	'postgres', q{
+	INSERT INTO mine SELECT generate_series(21,30) AS x;
+	CHECKPOINT;
+	SELECT pg_switch_wal();
+});
+
+my $primary_lsn = $primary->safe_psql('postgres',
+	q{SELECT pg_current_wal_lsn()});
+
+$standby1->poll_query_until('postgres',
+	qq{ SELECT pg_wal_lsn_diff(pg_last_wal_replay_lsn(), '$primary_lsn') >= 0 },
+	't')
+	or die "Timed out while waiting for xlog replay";
+
+# second restartpoint
+$standby1->safe_psql('postgres', q{CHECKPOINT});
+
+# Recovery with archive_mode=on removed .ready signal files inherited
+# from backup after two checkpoints.  Note that this WAL segment
+# existed in the backup.
+ok( ! -f "$standby1_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 present in backup removed with archive_mode=on on standby"
+);
+
+# Recovery with archive_mode=on should not create .ready files.
+# Note that this segment did not exist in the backup.
+ok( !-f "$standby1_data/$segment_path_2_ready",
+	".ready file for WAL segment $segment_name_2 not created on standby when archive_mode=on on standby"
+);
+
+# Recovery with archive_mode = on creates .done files.
+ok( -f "$standby1_data/$segment_path_2_done",
+	".done file for WAL segment $segment_name_2 created when archive_mode=on on standby"
+);
+
+# Test recovery with archive_mode = always, which should always keep
+# .ready files if archiving is enabled, though here we want the archive
+# command to fail to persist the .ready files.  Note that this node
+# has inherited the archive command of the previous cold backup that
+# will cause archiving failures.
+my $standby2 = get_new_node('standby2');
+$standby2->init_from_backup($primary, 'backup', has_restoring => 1);
+$standby2->append_conf('postgresql.conf', 'archive_mode = always');
+my $standby2_data = $standby2->data_dir;
+$standby2->start;
+
+$standby2->safe_psql('postgres', q{CHECKPOINT});
+
+ok( -f "$standby2_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 existing in backup is kept with archive_mode=always on standby"
+);
+
+ok( -f "$standby2_data/$segment_path_2_ready",
+	".ready file for WAL segment $segment_name_2 created with archive_mode=always on standby"
+);
+
+# Reset statistics of the archiver for the next checks.
+$standby2->safe_psql('postgres', q{SELECT pg_stat_reset_shared('archiver')});
+
+# Now crash the cluster to check that recovery step does not
+# remove non-archived WAL segments on a standby where archiving
+# is enabled.
+$standby2->stop('immediate');
+$standby2->start;
+
+ok( -f "$standby2_data/$segment_path_1_ready",
+	"WAL segment still ready to archive after crash recovery on standby with archive_mode=always"
+);
+
+# Allow WAL archiving again, and wait for the segments to be archived.
+$standby2->safe_psql(
+	'postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+$standby2->poll_query_until('postgres',
+	q{SELECT last_archived_wal FROM pg_stat_archiver},
+	$segment_name_2)
+  or die "Timed out while waiting for archiving to finish";
+
+is( $standby2->safe_psql(
+		'postgres', q{SELECT archived_count FROM pg_stat_archiver}),
+	'3',
+	'correct number of WAL segments archived from standby');
+
+ok( !-f "$standby2_data/$segment_path_1_ready"
+	  && !-f "$standby2_data/$segment_path_2_ready",
+	".ready files removed after archive success with archive_mode=always on standby"
+);
+
+ok( -f "$standby2_data/$segment_path_1_done"
+	  && -f "$standby2_data/$segment_path_2_done",
+	".done files created after archive success with archive_mode=always on standby"
+);
-- 
2.20.1

0001-v11-Fix-WAL-retention-during-production-crash-recovery-95.patchtext/x-patchDownload
Fix handling of WAL segments ready to be archived during crash recovery

78ea8b5 has fixed an issue related to the recycling of WAL segments on
standbys depending on archive_mode.  However, it has introduced a
regression with the handling of WAL segments ready to be archived during
crash recovery, causing those files to be recycled without getting
archived.

This commit fixes the regression by tracking in shared memory if a live
cluster is either in crash recovery or archive recovery as the handling
of WAL segments ready to be archived is different in both cases (those
WAL segments should not be removed during crash recovery), and by using
this new shared memory state to decide if a segment can be recycled or
not.  Previously, it was not possible to know if a cluster was in crash
recovery or archive recovery as the shared state was able to track only
if recovery was happening or not, leading to the problem.

A set of TAP tests is added to close the gap here, making sure that WAL
segments ready to be archived are correctly handled when a cluster is in
archive or crash recovery with archive_mode set to "on" or "always", for
both standby and primary.

Reported-by: Benoît Lobréau
Author: Jehan-Guillaume de Rorthais
Reviewed-by: Kyotaro Horiguchi, Fujii Masao, Michael Paquier
Discussion: https://postgr.es/m/20200331172229.40ee00dc@firost
Backpatch-through: 9.5
---
 src/backend/access/transam/xlog.c         |  57 +++++-
 src/backend/access/transam/xlogarchive.c  |  21 +-
 src/include/access/xlog.h                 |   9 +
 src/test/recovery/t/020_archive_status.pl | 231 ++++++++++++++++++++++
 4 files changed, 302 insertions(+), 16 deletions(-)
 create mode 100644 src/test/recovery/t/020_archive_status.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a378df9d76..a6cdb0024f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -203,8 +203,9 @@ static TimeLineID receiveTLI = 0;
 static bool lastFullPageWrites;
 
 /*
- * Local copy of SharedRecoveryInProgress variable. True actually means "not
- * known, need to check the shared state".
+ * Local copy of the state tracked by SharedRecoveryState in shared memory,
+ * It is false if SharedRecoveryState is RECOVERY_STATE_DONE.  True actually
+ * means "not known, need to check the shared state".
  */
 static bool LocalRecoveryInProgress = true;
 
@@ -608,10 +609,10 @@ typedef struct XLogCtlData
 	char		archiveCleanupCommand[MAXPGPATH];
 
 	/*
-	 * SharedRecoveryInProgress indicates if we're still in crash or archive
+	 * SharedRecoveryState indicates if we're still in crash or archive
 	 * recovery.  Protected by info_lck.
 	 */
-	bool		SharedRecoveryInProgress;
+	RecoveryState SharedRecoveryState;
 
 	/*
 	 * SharedHotStandbyActive indicates if we're still in crash or archive
@@ -4083,6 +4084,16 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 				updateMinRecoveryPoint = true;
 
 				UpdateControlFile();
+
+				/*
+				 * We update SharedRecoveryState while holding the lock on
+				 * ControlFileLock so both states are consistent in shared
+				 * memory.
+				 */
+				SpinLockAcquire(&XLogCtl->info_lck);
+				XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+				SpinLockRelease(&XLogCtl->info_lck);
+
 				LWLockRelease(ControlFileLock);
 
 				CheckRecoveryConsistency();
@@ -4771,7 +4782,7 @@ XLOGShmemInit(void)
 	 * in additional info.)
 	 */
 	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
-	XLogCtl->SharedRecoveryInProgress = true;
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->WalWriterSleeping = false;
 
@@ -6493,7 +6504,13 @@ StartupXLOG(void)
 		 */
 		dbstate_at_startup = ControlFile->state;
 		if (InArchiveRecovery)
+		{
 			ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
+
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+			SpinLockRelease(&XLogCtl->info_lck);
+		}
 		else
 		{
 			ereport(LOG,
@@ -6506,6 +6523,10 @@ StartupXLOG(void)
 								ControlFile->checkPointCopy.ThisTimeLineID,
 								recoveryTargetTLI)));
 			ControlFile->state = DB_IN_CRASH_RECOVERY;
+
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
+			SpinLockRelease(&XLogCtl->info_lck);
 		}
 		ControlFile->prevCheckPoint = ControlFile->checkPoint;
 		ControlFile->checkPoint = checkPointLoc;
@@ -7496,7 +7517,7 @@ StartupXLOG(void)
 	 * updates to shared memory.)
 	 */
 	SpinLockAcquire(&XLogCtl->info_lck);
-	XLogCtl->SharedRecoveryInProgress = false;
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	/*
@@ -7639,7 +7660,7 @@ RecoveryInProgress(void)
 		 */
 		volatile XLogCtlData *xlogctl = XLogCtl;
 
-		LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
+		LocalRecoveryInProgress = (xlogctl->SharedRecoveryState != RECOVERY_STATE_DONE);
 
 		/*
 		 * Initialize TimeLineID and RedoRecPtr when we discover that recovery
@@ -7651,8 +7672,8 @@ RecoveryInProgress(void)
 		{
 			/*
 			 * If we just exited recovery, make sure we read TimeLineID and
-			 * RedoRecPtr after SharedRecoveryInProgress (for machines with
-			 * weak memory ordering).
+			 * RedoRecPtr after SharedRecoveryState (for machines with weak
+			 * memory ordering).
 			 */
 			pg_memory_barrier();
 			InitXLOGAccess();
@@ -7668,6 +7689,24 @@ RecoveryInProgress(void)
 	}
 }
 
+/*
+ * Returns current recovery state from shared memory.
+ *
+ * This returned state is kept consistent with the contents of the control
+ * file.  See details about the possible values of RecoveryState in xlog.h.
+ */
+RecoveryState
+GetRecoveryState(void)
+{
+	RecoveryState retval;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	retval = XLogCtl->SharedRecoveryState;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return retval;
+}
+
 /*
  * Is HotStandby active yet? This is only important in special backends
  * since normal backends won't ever be able to connect until this returns
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index c899594fb6..e36a85ccab 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -619,18 +619,25 @@ XLogArchiveCheckDone(const char *xlog)
 {
 	char		archiveStatusPath[MAXPGPATH];
 	struct stat stat_buf;
-	bool		inRecovery = RecoveryInProgress();
+
+	/* The file is always deletable if archive_mode is "off". */
+	if (!XLogArchivingActive())
+		return true;
 
 	/*
-	 * The file is always deletable if archive_mode is "off".  On standbys
-	 * archiving is disabled if archive_mode is "on", and enabled with
-	 * "always".  On a primary, archiving is enabled if archive_mode is "on"
-	 * or "always".
+	 * During archive recovery, the file is deletable if archive_mode is not
+	 * "always".
 	 */
-	if (!((XLogArchivingActive() && !inRecovery) ||
-		  (XLogArchivingAlways() && inRecovery)))
+	if (!XLogArchivingAlways() &&
+		GetRecoveryState() == RECOVERY_STATE_ARCHIVE)
 		return true;
 
+	/*
+	 * At this point of the logic, note that we are either a primary with
+	 * archive_mode set to "on" or "always", or a standby with archive_mode
+	 * set to "always".
+	 */
+
 	/* First check for .done --- this means archiver is done with it */
 	StatusFilePath(archiveStatusPath, xlog, ".done");
 	if (stat(archiveStatusPath, &stat_buf) == 0)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 59ebccb19e..d051ee02bc 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -126,6 +126,14 @@ typedef enum WalLevel
 } WalLevel;
 extern int	wal_level;
 
+/* Recovery states */
+typedef enum RecoveryState
+{
+	RECOVERY_STATE_CRASH = 0,	/* crash recovery */
+	RECOVERY_STATE_ARCHIVE,		/* archive recovery */
+	RECOVERY_STATE_DONE			/* currently in production */
+} RecoveryState;
+
 /* Is WAL archiving enabled (always or only while server is running normally)? */
 #define XLogArchivingActive() \
 	(XLogArchiveMode > ARCHIVE_MODE_OFF && wal_level >= WAL_LEVEL_ARCHIVE)
@@ -230,6 +238,7 @@ extern const char *xlog_identify(uint8 info);
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
 
 extern bool RecoveryInProgress(void);
+extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
diff --git a/src/test/recovery/t/020_archive_status.pl b/src/test/recovery/t/020_archive_status.pl
new file mode 100644
index 0000000000..3d6fb0bff2
--- /dev/null
+++ b/src/test/recovery/t/020_archive_status.pl
@@ -0,0 +1,231 @@
+#
+# Tests related to WAL archiving and recovery.
+#
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 16;
+use Config;
+
+my $primary = get_new_node('master');
+$primary->init(
+	has_archiving    => 1,
+	allows_streaming => 1);
+$primary->append_conf('postgresql.conf', 'autovacuum = off');
+$primary->start;
+my $primary_data = $primary->data_dir;
+
+# Temporarily use an archive_command value to make the archiver fail,
+# knowing that archiving is enabled.  Note that we cannot use a command
+# that does not exist as in this case the archiver process would just exit
+# without reporting the failure to pg_stat_archiver.  This also cannot
+# use a plain "false" as that's unportable on Windows.  So, instead, as
+# a portable solution, use an archive command based on a command known to
+# work but will fail: copy with an incorrect original path.
+my $incorrect_command =
+  $TestLib::windows_os
+  ? qq{copy "%p_does_not_exist" "%f_does_not_exist"}
+  : qq{cp "%p_does_not_exist" "%f_does_not_exist"};
+$primary->safe_psql(
+	'postgres', qq{
+    ALTER SYSTEM SET archive_command TO '$incorrect_command';
+    SELECT pg_reload_conf();
+});
+
+# Save the WAL segment currently in use and switch to a new segment.
+# This will be used to track the activity of the archiver.
+my $segment_name_1 = $primary->safe_psql('postgres',
+	q{SELECT pg_xlogfile_name(pg_current_xlog_location())});
+my $segment_path_1       = "pg_xlog/archive_status/$segment_name_1";
+my $segment_path_1_ready = "$segment_path_1.ready";
+my $segment_path_1_done  = "$segment_path_1.done";
+$primary->safe_psql(
+	'postgres', q{
+	CREATE TABLE mine AS SELECT generate_series(1,10) AS x;
+	SELECT pg_switch_xlog();
+	CHECKPOINT;
+});
+
+# Wait for an archive failure.
+$primary->poll_query_until('postgres',
+	q{SELECT failed_count > 0 FROM pg_stat_archiver}, 't')
+  or die "Timed out while waiting for archiving to fail";
+ok( -f "$primary_data/$segment_path_1_ready",
+	".ready file exists for WAL segment $segment_name_1 waiting to be archived"
+);
+ok( !-f "$primary_data/$segment_path_1_done",
+	".done file does not exist for WAL segment $segment_name_1 waiting to be archived"
+);
+
+is( $primary->safe_psql(
+		'postgres', q{
+		SELECT archived_count, last_failed_wal
+		FROM pg_stat_archiver
+	}),
+	"0|$segment_name_1",
+	"pg_stat_archiver failed to archive $segment_name_1");
+
+# Crash the cluster for the next test in charge of checking that non-archived
+# WAL segments are not removed.
+$primary->stop('immediate');
+
+# Recovery tests for the archiving with a standby partially check
+# the recovery behavior when restoring a backup taken using a
+# snapshot with no pg_start/stop_backup.  In this situation,
+# the recovered standby should enter first crash recovery then
+# switch to regular archive recovery.  Note that the base backup
+# is taken here so as archive_command will fail.  This is necessary
+# for the assumptions of the tests done with the standbys below.
+$primary->backup_fs_cold('backup');
+
+$primary->start;
+ok( -f "$primary_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 still exists after crash recovery on primary"
+);
+
+# Allow WAL archiving again and wait for a success.
+$primary->safe_psql(
+	'postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+
+$primary->poll_query_until('postgres',
+	q{SELECT archived_count = 1 FROM pg_stat_archiver})
+  or die "Timed out while waiting for archiving to finish";
+
+ok(!-f "$primary_data/$segment_path_1_ready",
+	".ready file for archived WAL segment $segment_name_1 removed");
+
+ok(-f "$primary_data/$segment_path_1_done",
+	".done file for archived WAL segment $segment_name_1 exists");
+
+is( $primary->safe_psql(
+		'postgres', q{ SELECT last_archived_wal FROM pg_stat_archiver }),
+	$segment_name_1,
+	"archive success reported in pg_stat_archiver for WAL segment $segment_name_1"
+);
+
+# Create some WAL activity and a new checkpoint so as the next standby can
+# create a restartpoint.  As this standby starts in crash recovery because
+# of the cold backup taken previously, it needs a clean restartpoint to deal
+# with existing status files.
+$primary->safe_psql(
+	'postgres', q{
+	INSERT INTO mine SELECT generate_series(10,20) AS x;
+	SELECT pg_switch_xlog();
+	CHECKPOINT;
+});
+my $segment_name_2 = $primary->safe_psql('postgres',
+	q{SELECT pg_xlogfile_name(pg_current_xlog_location())});
+my $segment_path_2       = "pg_xlog/archive_status/$segment_name_2";
+my $segment_path_2_ready = "$segment_path_2.ready";
+my $segment_path_2_done  = "$segment_path_2.done";
+
+# Test standby with archive_mode = on.
+my $standby1 = get_new_node('standby');
+$standby1->init_from_backup($primary, 'backup', has_restoring => 1);
+$standby1->append_conf('postgresql.conf', q{
+archive_mode = on
+wal_keep_segments = 0
+});
+my $standby1_data = $standby1->data_dir;
+$standby1->start;
+# first restartpoint
+$standby1->safe_psql('postgres', q{CHECKPOINT});
+
+# segments are cleaned after the second restartpoint, so create a second one.
+$primary->safe_psql(
+	'postgres', q{
+	INSERT INTO mine SELECT generate_series(21,30) AS x;
+	CHECKPOINT;
+	SELECT pg_switch_xlog();
+});
+
+my $primary_lsn = $primary->safe_psql('postgres',
+	q{SELECT pg_current_xlog_location()});
+
+$standby1->poll_query_until('postgres',
+	qq{ SELECT pg_xlog_location_diff(pg_last_xlog_replay_location(), '$primary_lsn') >= 0 })
+  or die "Timed out while waiting for xlog replay";
+
+# second restartpoint
+$standby1->safe_psql('postgres', q{CHECKPOINT});
+
+# Recovery with archive_mode=on removed .ready signal files inherited
+# from backup after two checkpoints.  Note that this WAL segment
+# existed in the backup.
+ok( ! -f "$standby1_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 present in backup removed with archive_mode=on on standby"
+);
+
+# Recovery with archive_mode=on should not create .ready files.
+# Note that this segment did not exist in the backup.
+ok( !-f "$standby1_data/$segment_path_2_ready",
+	".ready file for WAL segment $segment_name_2 not created on standby when archive_mode=on on standby"
+);
+
+# Recovery with archive_mode = on creates .done files.
+ok( -f "$standby1_data/$segment_path_2_done",
+	".done file for WAL segment $segment_name_2 created when archive_mode=on on standby"
+);
+
+# Test recovery with archive_mode = always, which should always keep
+# .ready files if archiving is enabled, though here we want the archive
+# command to fail to persist the .ready files.  Note that this node
+# has inherited the archive command of the previous cold backup that
+# will cause archiving failures.
+my $standby2 = get_new_node('standby2');
+$standby2->init_from_backup($primary, 'backup', has_restoring => 1);
+$standby2->append_conf('postgresql.conf', 'archive_mode = always');
+my $standby2_data = $standby2->data_dir;
+$standby2->start;
+
+$standby2->safe_psql('postgres', q{CHECKPOINT});
+
+ok( -f "$standby2_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 existing in backup is kept with archive_mode=always on standby"
+);
+
+ok( -f "$standby2_data/$segment_path_2_ready",
+	".ready file for WAL segment $segment_name_2 created with archive_mode=always on standby"
+);
+
+# Reset statistics of the archiver for the next checks.
+$standby2->safe_psql('postgres', q{SELECT pg_stat_reset_shared('archiver')});
+
+# Now crash the cluster to check that recovery step does not
+# remove non-archived WAL segments on a standby where archiving
+# is enabled.
+$standby2->stop('immediate');
+$standby2->start;
+
+ok( -f "$standby2_data/$segment_path_1_ready",
+	"WAL segment still ready to archive after crash recovery on standby with archive_mode=always"
+);
+
+# Allow WAL archiving again, and wait for the segments to be archived.
+$standby2->safe_psql(
+	'postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+$standby2->poll_query_until('postgres',
+	qq{SELECT last_archived_wal = '$segment_name_2' FROM pg_stat_archiver})
+  or die "Timed out while waiting for archiving to finish";
+
+is( $standby2->safe_psql(
+		'postgres', q{SELECT archived_count FROM pg_stat_archiver}),
+	'3',
+	'correct number of WAL segments archived from standby');
+
+ok( !-f "$standby2_data/$segment_path_1_ready"
+	  && !-f "$standby2_data/$segment_path_2_ready",
+	".ready files removed after archive success with archive_mode=always on standby"
+);
+
+ok( -f "$standby2_data/$segment_path_1_done"
+	  && -f "$standby2_data/$segment_path_2_done",
+	".done files created after archive success with archive_mode=always on standby"
+);
-- 
2.20.1

0001-v11-Fix-WAL-retention-during-production-crash-recovery-96.patchtext/x-patchDownload
Fix handling of WAL segments ready to be archived during crash recovery

78ea8b5 has fixed an issue related to the recycling of WAL segments on
standbys depending on archive_mode.  However, it has introduced a
regression with the handling of WAL segments ready to be archived during
crash recovery, causing those files to be recycled without getting
archived.

This commit fixes the regression by tracking in shared memory if a live
cluster is either in crash recovery or archive recovery as the handling
of WAL segments ready to be archived is different in both cases (those
WAL segments should not be removed during crash recovery), and by using
this new shared memory state to decide if a segment can be recycled or
not.  Previously, it was not possible to know if a cluster was in crash
recovery or archive recovery as the shared state was able to track only
if recovery was happening or not, leading to the problem.

A set of TAP tests is added to close the gap here, making sure that WAL
segments ready to be archived are correctly handled when a cluster is in
archive or crash recovery with archive_mode set to "on" or "always", for
both standby and primary.

Reported-by: Benoît Lobréau
Author: Jehan-Guillaume de Rorthais
Reviewed-by: Kyotaro Horiguchi, Fujii Masao, Michael Paquier
Discussion: https://postgr.es/m/20200331172229.40ee00dc@firost
Backpatch-through: 9.5
---
 src/backend/access/transam/xlog.c         |  57 +++++-
 src/backend/access/transam/xlogarchive.c  |  21 +-
 src/include/access/xlog.h                 |   9 +
 src/test/recovery/t/020_archive_status.pl | 231 ++++++++++++++++++++++
 4 files changed, 302 insertions(+), 16 deletions(-)
 create mode 100644 src/test/recovery/t/020_archive_status.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 823efdbed0..acca98ea28 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -204,8 +204,9 @@ static TimeLineID receiveTLI = 0;
 static bool lastFullPageWrites;
 
 /*
- * Local copy of SharedRecoveryInProgress variable. True actually means "not
- * known, need to check the shared state".
+ * Local copy of the state tracked by SharedRecoveryState in shared memory,
+ * It is false if SharedRecoveryState is RECOVERY_STATE_DONE.  True actually
+ * means "not known, need to check the shared state".
  */
 static bool LocalRecoveryInProgress = true;
 
@@ -616,10 +617,10 @@ typedef struct XLogCtlData
 	char		archiveCleanupCommand[MAXPGPATH];
 
 	/*
-	 * SharedRecoveryInProgress indicates if we're still in crash or archive
+	 * SharedRecoveryState indicates if we're still in crash or archive
 	 * recovery.  Protected by info_lck.
 	 */
-	bool		SharedRecoveryInProgress;
+	RecoveryState SharedRecoveryState;
 
 	/*
 	 * SharedHotStandbyActive indicates if we're still in crash or archive
@@ -4136,6 +4137,16 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 				updateMinRecoveryPoint = true;
 
 				UpdateControlFile();
+
+				/*
+				 * We update SharedRecoveryState while holding the lock on
+				 * ControlFileLock so both states are consistent in shared
+				 * memory.
+				 */
+				SpinLockAcquire(&XLogCtl->info_lck);
+				XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+				SpinLockRelease(&XLogCtl->info_lck);
+
 				LWLockRelease(ControlFileLock);
 
 				CheckRecoveryConsistency();
@@ -4819,7 +4830,7 @@ XLOGShmemInit(void)
 	 * in additional info.)
 	 */
 	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
-	XLogCtl->SharedRecoveryInProgress = true;
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->WalWriterSleeping = false;
 
@@ -6541,7 +6552,13 @@ StartupXLOG(void)
 		 */
 		dbstate_at_startup = ControlFile->state;
 		if (InArchiveRecovery)
+		{
 			ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
+
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+			SpinLockRelease(&XLogCtl->info_lck);
+		}
 		else
 		{
 			ereport(LOG,
@@ -6554,6 +6571,10 @@ StartupXLOG(void)
 								ControlFile->checkPointCopy.ThisTimeLineID,
 								recoveryTargetTLI)));
 			ControlFile->state = DB_IN_CRASH_RECOVERY;
+
+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
+			SpinLockRelease(&XLogCtl->info_lck);
 		}
 		ControlFile->prevCheckPoint = ControlFile->checkPoint;
 		ControlFile->checkPoint = checkPointLoc;
@@ -7555,7 +7576,7 @@ StartupXLOG(void)
 	 * updates to shared memory.)
 	 */
 	SpinLockAcquire(&XLogCtl->info_lck);
-	XLogCtl->SharedRecoveryInProgress = false;
+	XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE;
 	SpinLockRelease(&XLogCtl->info_lck);
 
 	/*
@@ -7698,7 +7719,7 @@ RecoveryInProgress(void)
 		 */
 		volatile XLogCtlData *xlogctl = XLogCtl;
 
-		LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
+		LocalRecoveryInProgress = (xlogctl->SharedRecoveryState != RECOVERY_STATE_DONE);
 
 		/*
 		 * Initialize TimeLineID and RedoRecPtr when we discover that recovery
@@ -7710,8 +7731,8 @@ RecoveryInProgress(void)
 		{
 			/*
 			 * If we just exited recovery, make sure we read TimeLineID and
-			 * RedoRecPtr after SharedRecoveryInProgress (for machines with
-			 * weak memory ordering).
+			 * RedoRecPtr after SharedRecoveryState (for machines with weak
+			 * memory ordering).
 			 */
 			pg_memory_barrier();
 			InitXLOGAccess();
@@ -7727,6 +7748,24 @@ RecoveryInProgress(void)
 	}
 }
 
+/*
+ * Returns current recovery state from shared memory.
+ *
+ * This returned state is kept consistent with the contents of the control
+ * file.  See details about the possible values of RecoveryState in xlog.h.
+ */
+RecoveryState
+GetRecoveryState(void)
+{
+	RecoveryState retval;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	retval = XLogCtl->SharedRecoveryState;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return retval;
+}
+
 /*
  * Is HotStandby active yet? This is only important in special backends
  * since normal backends won't ever be able to connect until this returns
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 99704c1469..3cc8021cb8 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -619,18 +619,25 @@ XLogArchiveCheckDone(const char *xlog)
 {
 	char		archiveStatusPath[MAXPGPATH];
 	struct stat stat_buf;
-	bool		inRecovery = RecoveryInProgress();
+
+	/* The file is always deletable if archive_mode is "off". */
+	if (!XLogArchivingActive())
+		return true;
 
 	/*
-	 * The file is always deletable if archive_mode is "off".  On standbys
-	 * archiving is disabled if archive_mode is "on", and enabled with
-	 * "always".  On a primary, archiving is enabled if archive_mode is "on"
-	 * or "always".
+	 * During archive recovery, the file is deletable if archive_mode is not
+	 * "always".
 	 */
-	if (!((XLogArchivingActive() && !inRecovery) ||
-		  (XLogArchivingAlways() && inRecovery)))
+	if (!XLogArchivingAlways() &&
+		GetRecoveryState() == RECOVERY_STATE_ARCHIVE)
 		return true;
 
+	/*
+	 * At this point of the logic, note that we are either a primary with
+	 * archive_mode set to "on" or "always", or a standby with archive_mode
+	 * set to "always".
+	 */
+
 	/* First check for .done --- this means archiver is done with it */
 	StatusFilePath(archiveStatusPath, xlog, ".done");
 	if (stat(archiveStatusPath, &stat_buf) == 0)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index ab97a61bbb..28f716d4cf 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -125,6 +125,14 @@ typedef enum WalLevel
 	WAL_LEVEL_LOGICAL
 } WalLevel;
 
+/* Recovery states */
+typedef enum RecoveryState
+{
+	RECOVERY_STATE_CRASH = 0,	/* crash recovery */
+	RECOVERY_STATE_ARCHIVE,		/* archive recovery */
+	RECOVERY_STATE_DONE			/* currently in production */
+} RecoveryState;
+
 extern PGDLLIMPORT int wal_level;
 
 /* Is WAL archiving enabled (always or only while server is running normally)? */
@@ -231,6 +239,7 @@ extern const char *xlog_identify(uint8 info);
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
 
 extern bool RecoveryInProgress(void);
+extern RecoveryState GetRecoveryState(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
diff --git a/src/test/recovery/t/020_archive_status.pl b/src/test/recovery/t/020_archive_status.pl
new file mode 100644
index 0000000000..3d6fb0bff2
--- /dev/null
+++ b/src/test/recovery/t/020_archive_status.pl
@@ -0,0 +1,231 @@
+#
+# Tests related to WAL archiving and recovery.
+#
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 16;
+use Config;
+
+my $primary = get_new_node('master');
+$primary->init(
+	has_archiving    => 1,
+	allows_streaming => 1);
+$primary->append_conf('postgresql.conf', 'autovacuum = off');
+$primary->start;
+my $primary_data = $primary->data_dir;
+
+# Temporarily use an archive_command value to make the archiver fail,
+# knowing that archiving is enabled.  Note that we cannot use a command
+# that does not exist as in this case the archiver process would just exit
+# without reporting the failure to pg_stat_archiver.  This also cannot
+# use a plain "false" as that's unportable on Windows.  So, instead, as
+# a portable solution, use an archive command based on a command known to
+# work but will fail: copy with an incorrect original path.
+my $incorrect_command =
+  $TestLib::windows_os
+  ? qq{copy "%p_does_not_exist" "%f_does_not_exist"}
+  : qq{cp "%p_does_not_exist" "%f_does_not_exist"};
+$primary->safe_psql(
+	'postgres', qq{
+    ALTER SYSTEM SET archive_command TO '$incorrect_command';
+    SELECT pg_reload_conf();
+});
+
+# Save the WAL segment currently in use and switch to a new segment.
+# This will be used to track the activity of the archiver.
+my $segment_name_1 = $primary->safe_psql('postgres',
+	q{SELECT pg_xlogfile_name(pg_current_xlog_location())});
+my $segment_path_1       = "pg_xlog/archive_status/$segment_name_1";
+my $segment_path_1_ready = "$segment_path_1.ready";
+my $segment_path_1_done  = "$segment_path_1.done";
+$primary->safe_psql(
+	'postgres', q{
+	CREATE TABLE mine AS SELECT generate_series(1,10) AS x;
+	SELECT pg_switch_xlog();
+	CHECKPOINT;
+});
+
+# Wait for an archive failure.
+$primary->poll_query_until('postgres',
+	q{SELECT failed_count > 0 FROM pg_stat_archiver}, 't')
+  or die "Timed out while waiting for archiving to fail";
+ok( -f "$primary_data/$segment_path_1_ready",
+	".ready file exists for WAL segment $segment_name_1 waiting to be archived"
+);
+ok( !-f "$primary_data/$segment_path_1_done",
+	".done file does not exist for WAL segment $segment_name_1 waiting to be archived"
+);
+
+is( $primary->safe_psql(
+		'postgres', q{
+		SELECT archived_count, last_failed_wal
+		FROM pg_stat_archiver
+	}),
+	"0|$segment_name_1",
+	"pg_stat_archiver failed to archive $segment_name_1");
+
+# Crash the cluster for the next test in charge of checking that non-archived
+# WAL segments are not removed.
+$primary->stop('immediate');
+
+# Recovery tests for the archiving with a standby partially check
+# the recovery behavior when restoring a backup taken using a
+# snapshot with no pg_start/stop_backup.  In this situation,
+# the recovered standby should enter first crash recovery then
+# switch to regular archive recovery.  Note that the base backup
+# is taken here so as archive_command will fail.  This is necessary
+# for the assumptions of the tests done with the standbys below.
+$primary->backup_fs_cold('backup');
+
+$primary->start;
+ok( -f "$primary_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 still exists after crash recovery on primary"
+);
+
+# Allow WAL archiving again and wait for a success.
+$primary->safe_psql(
+	'postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+
+$primary->poll_query_until('postgres',
+	q{SELECT archived_count = 1 FROM pg_stat_archiver})
+  or die "Timed out while waiting for archiving to finish";
+
+ok(!-f "$primary_data/$segment_path_1_ready",
+	".ready file for archived WAL segment $segment_name_1 removed");
+
+ok(-f "$primary_data/$segment_path_1_done",
+	".done file for archived WAL segment $segment_name_1 exists");
+
+is( $primary->safe_psql(
+		'postgres', q{ SELECT last_archived_wal FROM pg_stat_archiver }),
+	$segment_name_1,
+	"archive success reported in pg_stat_archiver for WAL segment $segment_name_1"
+);
+
+# Create some WAL activity and a new checkpoint so as the next standby can
+# create a restartpoint.  As this standby starts in crash recovery because
+# of the cold backup taken previously, it needs a clean restartpoint to deal
+# with existing status files.
+$primary->safe_psql(
+	'postgres', q{
+	INSERT INTO mine SELECT generate_series(10,20) AS x;
+	SELECT pg_switch_xlog();
+	CHECKPOINT;
+});
+my $segment_name_2 = $primary->safe_psql('postgres',
+	q{SELECT pg_xlogfile_name(pg_current_xlog_location())});
+my $segment_path_2       = "pg_xlog/archive_status/$segment_name_2";
+my $segment_path_2_ready = "$segment_path_2.ready";
+my $segment_path_2_done  = "$segment_path_2.done";
+
+# Test standby with archive_mode = on.
+my $standby1 = get_new_node('standby');
+$standby1->init_from_backup($primary, 'backup', has_restoring => 1);
+$standby1->append_conf('postgresql.conf', q{
+archive_mode = on
+wal_keep_segments = 0
+});
+my $standby1_data = $standby1->data_dir;
+$standby1->start;
+# first restartpoint
+$standby1->safe_psql('postgres', q{CHECKPOINT});
+
+# segments are cleaned after the second restartpoint, so create a second one.
+$primary->safe_psql(
+	'postgres', q{
+	INSERT INTO mine SELECT generate_series(21,30) AS x;
+	CHECKPOINT;
+	SELECT pg_switch_xlog();
+});
+
+my $primary_lsn = $primary->safe_psql('postgres',
+	q{SELECT pg_current_xlog_location()});
+
+$standby1->poll_query_until('postgres',
+	qq{ SELECT pg_xlog_location_diff(pg_last_xlog_replay_location(), '$primary_lsn') >= 0 })
+  or die "Timed out while waiting for xlog replay";
+
+# second restartpoint
+$standby1->safe_psql('postgres', q{CHECKPOINT});
+
+# Recovery with archive_mode=on removed .ready signal files inherited
+# from backup after two checkpoints.  Note that this WAL segment
+# existed in the backup.
+ok( ! -f "$standby1_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 present in backup removed with archive_mode=on on standby"
+);
+
+# Recovery with archive_mode=on should not create .ready files.
+# Note that this segment did not exist in the backup.
+ok( !-f "$standby1_data/$segment_path_2_ready",
+	".ready file for WAL segment $segment_name_2 not created on standby when archive_mode=on on standby"
+);
+
+# Recovery with archive_mode = on creates .done files.
+ok( -f "$standby1_data/$segment_path_2_done",
+	".done file for WAL segment $segment_name_2 created when archive_mode=on on standby"
+);
+
+# Test recovery with archive_mode = always, which should always keep
+# .ready files if archiving is enabled, though here we want the archive
+# command to fail to persist the .ready files.  Note that this node
+# has inherited the archive command of the previous cold backup that
+# will cause archiving failures.
+my $standby2 = get_new_node('standby2');
+$standby2->init_from_backup($primary, 'backup', has_restoring => 1);
+$standby2->append_conf('postgresql.conf', 'archive_mode = always');
+my $standby2_data = $standby2->data_dir;
+$standby2->start;
+
+$standby2->safe_psql('postgres', q{CHECKPOINT});
+
+ok( -f "$standby2_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 existing in backup is kept with archive_mode=always on standby"
+);
+
+ok( -f "$standby2_data/$segment_path_2_ready",
+	".ready file for WAL segment $segment_name_2 created with archive_mode=always on standby"
+);
+
+# Reset statistics of the archiver for the next checks.
+$standby2->safe_psql('postgres', q{SELECT pg_stat_reset_shared('archiver')});
+
+# Now crash the cluster to check that recovery step does not
+# remove non-archived WAL segments on a standby where archiving
+# is enabled.
+$standby2->stop('immediate');
+$standby2->start;
+
+ok( -f "$standby2_data/$segment_path_1_ready",
+	"WAL segment still ready to archive after crash recovery on standby with archive_mode=always"
+);
+
+# Allow WAL archiving again, and wait for the segments to be archived.
+$standby2->safe_psql(
+	'postgres', q{
+	ALTER SYSTEM RESET archive_command;
+	SELECT pg_reload_conf();
+});
+$standby2->poll_query_until('postgres',
+	qq{SELECT last_archived_wal = '$segment_name_2' FROM pg_stat_archiver})
+  or die "Timed out while waiting for archiving to finish";
+
+is( $standby2->safe_psql(
+		'postgres', q{SELECT archived_count FROM pg_stat_archiver}),
+	'3',
+	'correct number of WAL segments archived from standby');
+
+ok( !-f "$standby2_data/$segment_path_1_ready"
+	  && !-f "$standby2_data/$segment_path_2_ready",
+	".ready files removed after archive success with archive_mode=always on standby"
+);
+
+ok( -f "$standby2_data/$segment_path_1_done"
+	  && -f "$standby2_data/$segment_path_2_done",
+	".done files created after archive success with archive_mode=always on standby"
+);
-- 
2.20.1

#56Michael Paquier
michael@paquier.xyz
In reply to: Jehan-Guillaume de Rorthais (#55)
Re: [BUG] non archived WAL removed during production crash recovery

On Thu, Apr 23, 2020 at 06:59:53PM +0200, Jehan-Guillaume de Rorthais wrote:

Please, find in v11 for version 9.5, 9.6 and 10.

I have worked more on that using your v11, tweaked few comments in the
new test parts for 9.5~10, and applied the whole. Thanks all for your
work. I am keeping now an eye on the buildfarm.
--
Michael

#57Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#56)
Re: [BUG] non archived WAL removed during production crash recovery

Hi,

On 2020-04-24 08:54:02 +0900, Michael Paquier wrote:

On Thu, Apr 23, 2020 at 06:59:53PM +0200, Jehan-Guillaume de Rorthais wrote:

Please, find in v11 for version 9.5, 9.6 and 10.

I have worked more on that using your v11, tweaked few comments in the
new test parts for 9.5~10, and applied the whole. Thanks all for your
work. I am keeping now an eye on the buildfarm.

I am confused by this commit. You added shared memory to differentiate
between crash recovery and standby mode/archive recovery, correct? You
write:

This commit fixes the regression by tracking in shared memory if a live
cluster is either in crash recovery or archive recovery as the handling
of WAL segments ready to be archived is different in both cases (those
WAL segments should not be removed during crash recovery), and by using
this new shared memory state to decide if a segment can be recycled or
not.

But don't we pretty much know this already from the state of the system?
During crash recovery there's nothing running RemoveOldXLogFiles() but
the startup process. Right? And in DB_IN_ARCHIVE_RECOVERY it should only
happen as part of restartpoints (i.e. the checkpointer).

Did you add the new shared state to avoid deducing things from the
"environment"? If so, it should really be mentioned in the commit
message & code. Because:

Previously, it was not possible to know if a cluster was in crash
recovery or archive recovery as the shared state was able to track only
if recovery was happening or not, leading to the problem.

really doesn't make that obvious.

Greetings,

Andres Freund

#58Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#56)
Re: [BUG] non archived WAL removed during production crash recovery

Michael Paquier <michael@paquier.xyz> writes:

I have worked more on that using your v11, tweaked few comments in the
new test parts for 9.5~10, and applied the whole. Thanks all for your
work. I am keeping now an eye on the buildfarm.

Looks like the news is not good :-(

I see that my own florican is one of the failing critters, though
it failed only on HEAD which seems odd. Any suggestions what to
look for?

regards, tom lane

#59Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#58)
Re: [BUG] non archived WAL removed during production crash recovery

On Thu, Apr 23, 2020 at 10:21:15PM -0400, Tom Lane wrote:

Looks like the news is not good :-(

Yes, I was looking at that for the last couple of hours, and just
pushed something to put back the buildfarm to a green state for now
(based on the first results things seem stable now) by removing the
defective subset of tests.

I see that my own florican is one of the failing critters, though
it failed only on HEAD which seems odd. Any suggestions what to
look for?

The issue comes from the parts of the test where we expect some .ready
files to exist (or not) after triggering a restartpoint to force some
segments to be recycled. And looking more at it, I suspect that the
issue is actually that we don't make sure in the test that the
standbys started have replayed up to the segment switch record
triggered on the primary (the one within generate_series(10,20)), and
then the follow-up restart point does not actually recycle the
segments we expect to recycle. That's more likely going to be a
problem on slower machines as the window gets wider between the moment
the standbys reach their consistency point and the moment the switch
record is replayed.
--
Michael

#60Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#57)
Re: [BUG] non archived WAL removed during production crash recovery

On Thu, Apr 23, 2020 at 06:48:56PM -0700, Andres Freund wrote:

But don't we pretty much know this already from the state of the system?
During crash recovery there's nothing running RemoveOldXLogFiles() but
the startup process. Right? And in DB_IN_ARCHIVE_RECOVERY it should only
happen as part of restartpoints (i.e. the checkpointer).

Did you add the new shared state to avoid deducing things from the
"environment"? If so, it should really be mentioned in the commit
message & code. Because:

Hmm. Sorry, I see your point. The key of the logic here is from
XLogArchiveCheckDone() which could be called from other processes than
the startup process. There is one code path at the end of a base
backup for backup history files where not using a shared state would
be a problem.
--
Michael

In reply to: Michael Paquier (#59)
1 attachment(s)
Re: [BUG] non archived WAL removed during production crash recovery

On Fri, 24 Apr 2020 12:43:51 +0900
Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Apr 23, 2020 at 10:21:15PM -0400, Tom Lane wrote:

Looks like the news is not good :-(

Yes, I was looking at that for the last couple of hours, and just
pushed something to put back the buildfarm to a green state for now
(based on the first results things seem stable now) by removing the
defective subset of tests.

I see that my own florican is one of the failing critters, though
it failed only on HEAD which seems odd. Any suggestions what to
look for?

The issue comes from the parts of the test where we expect some .ready
files to exist (or not) after triggering a restartpoint to force some
segments to be recycled. And looking more at it, I suspect that the
issue is actually that we don't make sure in the test that the
standbys started have replayed up to the segment switch record
triggered on the primary (the one within generate_series(10,20)), and
then the follow-up restart point does not actually recycle the
segments we expect to recycle. That's more likely going to be a
problem on slower machines as the window gets wider between the moment
the standbys reach their consistency point and the moment the switch
record is replayed.

Indeed.

In regard with your fix, as we don't know if the standby caught up with the
latest available record, there's really no point to keep this test either:

# Recovery with archive_mode=on should not create .ready files.
# Note that this segment did not exist in the backup.
ok( !-f "$standby1_data/$segment_path_2_ready",
".ready file for WAL segment $segment_name_2 not created on standby
when archive_mode=on on standby" );

I agree the three tests could be removed as they were not covering the bug we
were chasing. However, they might still be useful to detect futur non expected
behavior changes. If you agree with this, please, find in attachment a patch
proposal against HEAD that recreate these three tests **after** a waiting loop
on both standby1 and standby2. This waiting loop is inspired from the tests in
9.5 -> 10.

Regards,

Attachments:

wait-for-wal-replay.patchtext/x-patchDownload
diff --git a/src/test/recovery/t/020_archive_status.pl b/src/test/recovery/t/020_archive_status.pl
index 401412e687..d197066567 100644
--- a/src/test/recovery/t/020_archive_status.pl
+++ b/src/test/recovery/t/020_archive_status.pl
@@ -5,7 +5,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 13;
+use Test::More tests => 16;
 use Config;
 
 my $primary = get_new_node('master');
@@ -93,7 +93,7 @@ $primary->safe_psql(
 
 $primary->poll_query_until('postgres',
 	q{SELECT archived_count FROM pg_stat_archiver}, '1')
-  or die "Timed out while waiting for archiving to finish";
+	or die "Timed out while waiting for archiving to finish";
 
 ok(!-f "$primary_data/$segment_path_1_ready",
 	".ready file for archived WAL segment $segment_name_1 removed");
@@ -119,14 +119,23 @@ my $segment_path_2_done  = "$segment_path_2.done";
 $primary->safe_psql(
 	'postgres', q{
 	INSERT INTO mine SELECT generate_series(10,20) AS x;
-	SELECT pg_switch_wal();
 	CHECKPOINT;
 });
 
+# Make sure that the standby has caught here.
+my $primary_lsn = $primary->safe_psql('postgres',
+	q{SELECT pg_current_wal_lsn()});
+
+# force archiving current wal
+$primary->safe_psql(
+	'postgres', q{
+	SELECT pg_switch_wal();
+});
+
 $primary->poll_query_until('postgres',
 	q{ SELECT last_archived_wal FROM pg_stat_archiver },
 	$segment_name_2)
-  or die "Timed out while waiting for archiving to finish";
+	or die "Timed out while waiting for archiving to finish";
 
 # Test standby with archive_mode = on.
 my $standby1 = get_new_node('standby');
@@ -134,14 +143,30 @@ $standby1->init_from_backup($primary, 'backup', has_restoring => 1);
 $standby1->append_conf('postgresql.conf', "archive_mode = on");
 my $standby1_data = $standby1->data_dir;
 $standby1->start;
+
+$standby1->poll_query_until('postgres',
+	qq{ SELECT pg_wal_lsn_diff(pg_last_wal_replay_lsn(), '$primary_lsn') >= 0 })
+	or die "Timed out while waiting for xlog replay on standby2";
+
 $standby1->safe_psql('postgres', q{CHECKPOINT});
 
+# Recovery with archive_mode=on does not keep .ready signal files inherited
+# from backup.  Note that this WAL segment existed in the backup.
+ok( !-f "$standby1_data/$segment_path_1_ready",
+	".ready file for WAL segment $segment_name_1 present in backup got removed with archive_mode=on on standby"
+);
+
 # Recovery with archive_mode=on should not create .ready files.
 # Note that this segment did not exist in the backup.
 ok( !-f "$standby1_data/$segment_path_2_ready",
 	".ready file for WAL segment $segment_name_2 not created on standby when archive_mode=on on standby"
 );
 
+# Recovery with archive_mode = on creates .done files.
+ok( -f "$standby1_data/$segment_path_2_done",
+	".done file for WAL segment $segment_name_2 created when archive_mode=on on standby"
+);
+
 # Test recovery with archive_mode = always, which should always keep
 # .ready files if archiving is enabled, though here we want the archive
 # command to fail to persist the .ready files.  Note that this node
@@ -153,12 +178,20 @@ $standby2->append_conf('postgresql.conf', 'archive_mode = always');
 my $standby2_data = $standby2->data_dir;
 $standby2->start;
 
+$standby2->poll_query_until('postgres',
+	qq{ SELECT pg_wal_lsn_diff(pg_last_wal_replay_lsn(), '$primary_lsn') >= 0 })
+	or die "Timed out while waiting for xlog replay on standby2";
+
 $standby2->safe_psql('postgres', q{CHECKPOINT});
 
 ok( -f "$standby2_data/$segment_path_1_ready",
 	".ready file for WAL segment $segment_name_1 existing in backup is kept with archive_mode=always on standby"
 );
 
+ok( -f "$standby2_data/$segment_path_2_ready",
+	".ready file for WAL segment $segment_name_2 created with archive_mode=always on standby"
+);
+
 # Reset statistics of the archiver for the next checks.
 $standby2->safe_psql('postgres', q{SELECT pg_stat_reset_shared('archiver')});
 
#62Michael Paquier
michael@paquier.xyz
In reply to: Jehan-Guillaume de Rorthais (#61)
Re: [BUG] non archived WAL removed during production crash recovery

On Fri, Apr 24, 2020 at 03:03:00PM +0200, Jehan-Guillaume de Rorthais wrote:

I agree the three tests could be removed as they were not covering the bug we
were chasing. However, they might still be useful to detect futur non expected
behavior changes. If you agree with this, please, find in attachment a patch
proposal against HEAD that recreate these three tests **after** a waiting loop
on both standby1 and standby2. This waiting loop is inspired from the tests in
9.5 -> 10.

FWIW, I would prefer keeping all three tests as well.

So.. I have spent more time on this problem and mereswin here is a
very good sample because it failed all three tests:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine&amp;dt=2020-04-24%2006%3A03%3A53

For standby2, we get this failure:
ok 11 - .ready file for WAL segment 000000010000000000000001 existing
in backup is kept with archive_mode=always on standby
not ok 12 - .ready file for WAL segment 000000010000000000000002
created with archive_mode=always on standby

Then, looking at 020_archive_status_standby2.log, we have the
following logs:
2020-04-24 02:08:32.032 PDT [9841:3] 020_archive_status.pl LOG:
statement: CHECKPOINT
[...]
2020-04-24 02:08:32.303 PDT [9821:7] LOG: restored log file
"000000010000000000000002" from archive

In this case, the test forced a checkpoint to test the segment
recycling *before* the extra restored segment we'd like to work on was
actually restored.  So it looks like my initial feeling about the
timing issue was right, and I am also able to reproduce the original
set of failures by adding a manual sleep to delay restores of
segments, like that for example:
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -74,6 +74,8 @@ RestoreArchivedFile(char *path, const char *xlogfname,
    if (recoveryRestoreCommand == NULL ||
    strcmp(recoveryRestoreCommand, "") == 0)
            goto not_available;
+   pg_usleep(10 * 1000000); /* 10s */
+
    /*

With your patch the problem does not show up anymore even with the
delay added, so I would like to apply what you have sent and add back
those tests. For now, I would just patch HEAD though as that's not
worth the risk of destabilizing stable branches in the buildfarm.

$primary->poll_query_until('postgres',
q{SELECT archived_count FROM pg_stat_archiver}, '1')
-  or die "Timed out while waiting for archiving to finish";
+	or die "Timed out while waiting for archiving to finish";

Some noise in the patch. This may have come from some unfinished
business with pgindent.
--
Michael

#63Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#62)
Re: [BUG] non archived WAL removed during production crash recovery

At Mon, 27 Apr 2020 16:49:45 +0900, Michael Paquier <michael@paquier.xyz> wrote in

On Fri, Apr 24, 2020 at 03:03:00PM +0200, Jehan-Guillaume de Rorthais wrote:

I agree the three tests could be removed as they were not covering the bug we
were chasing. However, they might still be useful to detect futur non expected
behavior changes. If you agree with this, please, find in attachment a patch
proposal against HEAD that recreate these three tests **after** a waiting loop
on both standby1 and standby2. This waiting loop is inspired from the tests in
9.5 -> 10.

FWIW, I would prefer keeping all three tests as well.

So.. I have spent more time on this problem and mereswin here is a
very good sample because it failed all three tests:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine&amp;dt=2020-04-24%2006%3A03%3A53

For standby2, we get this failure:
ok 11 - .ready file for WAL segment 000000010000000000000001 existing
in backup is kept with archive_mode=always on standby
not ok 12 - .ready file for WAL segment 000000010000000000000002
created with archive_mode=always on standby

Then, looking at 020_archive_status_standby2.log, we have the
following logs:
2020-04-24 02:08:32.032 PDT [9841:3] 020_archive_status.pl LOG:
statement: CHECKPOINT
[...]
2020-04-24 02:08:32.303 PDT [9821:7] LOG: restored log file
"000000010000000000000002" from archive

In this case, the test forced a checkpoint to test the segment
recycling *before* the extra restored segment we'd like to work on was
actually restored.  So it looks like my initial feeling about the
timing issue was right, and I am also able to reproduce the original
set of failures by adding a manual sleep to delay restores of
segments, like that for example:
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -74,6 +74,8 @@ RestoreArchivedFile(char *path, const char *xlogfname,
if (recoveryRestoreCommand == NULL ||
strcmp(recoveryRestoreCommand, "") == 0)
goto not_available;
+   pg_usleep(10 * 1000000); /* 10s */
+
/*

With your patch the problem does not show up anymore even with the
delay added, so I would like to apply what you have sent and add back
those tests. For now, I would just patch HEAD though as that's not
worth the risk of destabilizing stable branches in the buildfarm.

Agreed to the diagnosis and the fix. The fix reliably cause a restart
point then the restart point manipulats the status files the right way
before the CHECKPOINT command resturns, in the both cases.

If I would add something to the fix, the following line may need a
comment.

+# Wait for the checkpoint record is replayed so that the following
+# CHECKPOINT causes a restart point reliably.
|+$standby1->poll_query_until('postgres',
|+	qq{ SELECT pg_wal_lsn_diff(pg_last_wal_replay_lsn(), '$primary_lsn') >= 0 }

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In reply to: Kyotaro Horiguchi (#63)
Re: [BUG] non archived WAL removed during production crash recovery

On Mon, 27 Apr 2020 18:21:07 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

At Mon, 27 Apr 2020 16:49:45 +0900, Michael Paquier <michael@paquier.xyz>
wrote in

[...]

With your patch the problem does not show up anymore even with the
delay added, so I would like to apply what you have sent and add back
those tests. For now, I would just patch HEAD though as that's not
worth the risk of destabilizing stable branches in the buildfarm.

Good for me. Thanks!

Agreed to the diagnosis and the fix. The fix reliably cause a restart
point then the restart point manipulats the status files the right way
before the CHECKPOINT command resturns, in the both cases.

If I would add something to the fix, the following line may need a
comment.

+# Wait for the checkpoint record is replayed so that the following
+# CHECKPOINT causes a restart point reliably.
|+$standby1->poll_query_until('postgres',
|+	qq{ SELECT pg_wal_lsn_diff(pg_last_wal_replay_lsn(),
'$primary_lsn') >= 0 }

Agree.

Regards,

#65Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#63)
Re: [BUG] non archived WAL removed during production crash recovery

On Mon, Apr 27, 2020 at 06:21:07PM +0900, Kyotaro Horiguchi wrote:

Agreed to the diagnosis and the fix. The fix reliably cause a restart
point then the restart point manipulats the status files the right way
before the CHECKPOINT command resturns, in the both cases.

Thanks for checking!

If I would add something to the fix, the following line may need a
comment.

+# Wait for the checkpoint record is replayed so that the following
+# CHECKPOINT causes a restart point reliably.
|+$standby1->poll_query_until('postgres',
|+	qq{ SELECT pg_wal_lsn_diff(pg_last_wal_replay_lsn(), '$primary_lsn') >= 0 }

Makes sense, added a comment and applied to HEAD. I have also
improved the comment around the split with pg_switch_wal(), and
actually simplified the test to use as wait point the return value
from the function.
--
Michael

In reply to: Michael Paquier (#65)
Re: [BUG] non archived WAL removed during production crash recovery

On Tue, 28 Apr 2020 08:01:38 +0900
Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Apr 27, 2020 at 06:21:07PM +0900, Kyotaro Horiguchi wrote:

Agreed to the diagnosis and the fix. The fix reliably cause a restart
point then the restart point manipulats the status files the right way
before the CHECKPOINT command resturns, in the both cases.

Thanks for checking!

If I would add something to the fix, the following line may need a
comment.

+# Wait for the checkpoint record is replayed so that the following
+# CHECKPOINT causes a restart point reliably.
|+$standby1->poll_query_until('postgres',
|+	qq{ SELECT pg_wal_lsn_diff(pg_last_wal_replay_lsn(),
'$primary_lsn') >= 0 }  

Makes sense, added a comment and applied to HEAD. I have also
improved the comment around the split with pg_switch_wal(), and
actually simplified the test to use as wait point the return value
from the function.

Thank you guys for your help, reviews and commits!

#67Michael Paquier
michael@paquier.xyz
In reply to: Jehan-Guillaume de Rorthais (#66)
Re: [BUG] non archived WAL removed during production crash recovery

On Tue, Apr 28, 2020 at 01:58:47AM +0200, Jehan-Guillaume de Rorthais wrote:

Thank you guys for your help, reviews and commits!

The buildfarm does not complain after the latest commit, meaning that
we are good here. Thanks for your efforts.
--
Michael