[BUG] non archived WAL removed during production crash recovery

Started by Jehan-Guillaume de Rorthaisabout 6 years ago67 messageshackersbugs
Jump to latest
hackersbugs

Hello,

A colleague of mine reported an expected behavior.

On production cluster is in crash recovery, eg. after killing a backend, the
WALs ready to be archived are removed before being archived.

See in attachment the reproduction script "non-arch-wal-on-recovery.bash".

This behavior has been introduced in 78ea8b5daab9237fd42d7a8a836c1c451765499f.
Function XLogArchiveCheckDone() badly consider the in crashed recovery
production cluster as a standby without archive_mode=always. So the check
conclude the WAL can be removed safely.

bool inRecovery = RecoveryInProgress();

/*
* The file is always deletable if archive_mode is "off". On standbys
* archiving is disabled if archive_mode is "on", and enabled with
* "always". On a primary, archiving is enabled if archive_mode is "on"
* or "always".
*/
if (!((XLogArchivingActive() && !inRecovery) ||
(XLogArchivingAlways() && inRecovery)))
return true;

Please find in attachment a patch that fix this issue using the following test
instead:

if (!((XLogArchivingActive() && !StandbyModeRequested) ||
(XLogArchivingAlways() && inRecovery)))
return true;

I'm not sure if we should rely on StandbyModeRequested for the second part of
the test as well thought. What was the point to rely on RecoveryInProgress() to
get the recovery status from shared mem?

Regards,

Attachments:

non-arch-wal-on-recovery.bashapplication/octet-stream; name=non-arch-wal-on-recovery.bashDownload
0001-Fix-WAL-retention-during-production-crash-recovery.patchtext/x-patchDownload+1-2
#2Fujii Masao
masao.fujii@gmail.com
In reply to: Jehan-Guillaume de Rorthais (#1)
hackersbugs
Re: [BUG] non archived WAL removed during production crash recovery

On 2020/04/01 0:22, Jehan-Guillaume de Rorthais wrote:

Hello,

A colleague of mine reported an expected behavior.

On production cluster is in crash recovery, eg. after killing a backend, the
WALs ready to be archived are removed before being archived.

See in attachment the reproduction script "non-arch-wal-on-recovery.bash".

This behavior has been introduced in 78ea8b5daab9237fd42d7a8a836c1c451765499f.
Function XLogArchiveCheckDone() badly consider the in crashed recovery
production cluster as a standby without archive_mode=always. So the check
conclude the WAL can be removed safely.

Thanks for the report! Yeah, this seems a bug.

bool inRecovery = RecoveryInProgress();

/*
* The file is always deletable if archive_mode is "off". On standbys
* archiving is disabled if archive_mode is "on", and enabled with
* "always". On a primary, archiving is enabled if archive_mode is "on"
* or "always".
*/
if (!((XLogArchivingActive() && !inRecovery) ||
(XLogArchivingAlways() && inRecovery)))
return true;

Please find in attachment a patch that fix this issue using the following test
instead:

if (!((XLogArchivingActive() && !StandbyModeRequested) ||
(XLogArchivingAlways() && inRecovery)))
return true;

I'm not sure if we should rely on StandbyModeRequested for the second part of
the test as well thought. What was the point to rely on RecoveryInProgress() to
get the recovery status from shared mem?

Since StandbyModeRequested is the startup process-local variable,
it basically cannot be used in XLogArchiveCheckDone() that other process
may call. So another approach would be necessary... One straight idea is to
add new shmem flag indicating whether we are in standby mode or not.
Another is to make the startup process remove .ready file if necessary.

If it's not easy to fix the issue, we might need to just revert the commit
that introduced the issue at first.

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

In reply to: Fujii Masao (#2)
hackersbugs
Re: [BUG] non archived WAL removed during production crash recovery

On Wed, 1 Apr 2020 17:27:22 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
[...]

bool inRecovery = RecoveryInProgress();

/*
* The file is always deletable if archive_mode is "off". On standbys
* archiving is disabled if archive_mode is "on", and enabled with
* "always". On a primary, archiving is enabled if archive_mode is "on"
* or "always".
*/
if (!((XLogArchivingActive() && !inRecovery) ||
(XLogArchivingAlways() && inRecovery)))
return true;

Please find in attachment a patch that fix this issue using the following
test instead:

if (!((XLogArchivingActive() && !StandbyModeRequested) ||
(XLogArchivingAlways() && inRecovery)))
return true;

I'm not sure if we should rely on StandbyModeRequested for the second part
of the test as well thought. What was the point to rely on
RecoveryInProgress() to get the recovery status from shared mem?

Since StandbyModeRequested is the startup process-local variable,
it basically cannot be used in XLogArchiveCheckDone() that other process
may call.

Ok, you answered my wondering about using recovery status from shared mem. This
was obvious. Thanks for your help!

I was wondering if we could use "ControlFile->state != DB_IN_CRASH_RECOVERY".
It seems fine during crash recovery as the control file is updated before the
checkpoint, but it doesn't feel right for other code paths where the control
file might not be up-to-date on filesystem, right ?

So another approach would be necessary... One straight idea is to
add new shmem flag indicating whether we are in standby mode or not.

I was thinking about setting XLogCtlData->SharedRecoveryInProgress as an enum
using:

enum RecoveryState
{
NOT_IN_RECOVERY = 0
IN_CRASH_RECOVERY,
IN_ARCHIVE_RECOVERY
}

Please, find in attachment a patch implementing this.

Plus, I added a second commit to add one test in regard with this bug.

Another is to make the startup process remove .ready file if necessary.

I'm not sure to understand this one.

Regards,

Attachments:

0001-v2-Fix-WAL-retention-during-production-crash-recovery.patchtext/x-patchDownload+35-11
0002-v1-Add-test-on-non-archived-WAL-during-crash-recovery.patchtext/x-patchDownload+14-3
#4Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Jehan-Guillaume de Rorthais (#3)
hackersbugs
Re: [BUG] non archived WAL removed during production crash recovery

Hello.

At Wed, 1 Apr 2020 18:17:35 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in

Please, find in attachment a patch implementing this.

The patch partially reintroduces the issue the patch have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

Perhaps checking '.ready' before the checking for archive-mode would
be sufficient.

Plus, I added a second commit to add one test in regard with this bug.

Another is to make the startup process remove .ready file if necessary.

I'm not sure to understand this one.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#5Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#4)
hackersbugs
Re: [BUG] non archived WAL removed during production crash recovery

Sorry, it was quite ambiguous.

At Thu, 02 Apr 2020 13:04:43 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Wed, 1 Apr 2020 18:17:35 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in

Please, find in attachment a patch implementing this.

The patch partially reintroduces the issue the patch have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

The patch partially reintroduces the issue the commit 78ea8b5daa have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

Perhaps checking '.ready' before the checking for archive-mode would
be sufficient.

Plus, I added a second commit to add one test in regard with this bug.

Another is to make the startup process remove .ready file if necessary.

I'm not sure to understand this one.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#6Fujii Masao
masao.fujii@gmail.com
In reply to: Kyotaro Horiguchi (#5)
hackersbugs
Re: [BUG] non archived WAL removed during production crash recovery

On 2020/04/02 13:07, Kyotaro Horiguchi wrote:

Sorry, it was quite ambiguous.

At Thu, 02 Apr 2020 13:04:43 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Wed, 1 Apr 2020 18:17:35 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in

Please, find in attachment a patch implementing this.

The patch partially reintroduces the issue the patch have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

The patch partially reintroduces the issue the commit 78ea8b5daa have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

On second thought, I think that we should discuss what the desirable
behavior is before the implentation. Especially what's unclear to me
is whether to remove such WAL files in archive recovery case with
archive_mode=on. Those WAL files would be required when recovering
from the backup taken before that archive recovery happens.
So it seems unsafe to remove them in that case.

Therefore, IMO that the patch should change the code so that
no unarchived WAL files are removed not only in crash recovery
but also archive recovery. Thought?

Of course, this change would lead to the issue that the past unarchived
WAL files keep remaining in the case of warm-standby using archive
recovery. But this issue looks unavoidable. If users want to avoid that,
archive_mode should be set to always.

Also I'm a bit wondering if it's really safe to remove such unarchived
WAL files even in the standby case with archive_mode=on. I would need
more time to think that.

Perhaps checking '.ready' before the checking for archive-mode would
be sufficient.

Plus, I added a second commit to add one test in regard with this bug.

Another is to make the startup process remove .ready file if necessary.

I'm not sure to understand this one.

I was thinking to make the startup process remove such unarchived WAL files
if archive_mode=on and StandbyModeRequested/ArchiveRecoveryRequested
is true.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#7Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Fujii Masao (#6)
hackersbugs
Re: [BUG] non archived WAL removed during production crash recovery

At Thu, 2 Apr 2020 14:19:15 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2020/04/02 13:07, Kyotaro Horiguchi wrote:

Sorry, it was quite ambiguous.
At Thu, 02 Apr 2020 13:04:43 +0900 (JST), Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote in

At Wed, 1 Apr 2020 18:17:35 +0200, Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote in

Please, find in attachment a patch implementing this.

The patch partially reintroduces the issue the patch have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

The patch partially reintroduces the issue the commit 78ea8b5daa have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

On second thought, I think that we should discuss what the desirable
behavior is before the implentation. Especially what's unclear to me

Agreed.

is whether to remove such WAL files in archive recovery case with
archive_mode=on. Those WAL files would be required when recovering
from the backup taken before that archive recovery happens.
So it seems unsafe to remove them in that case.

I'm not sure I'm getting the intention correctly, but I think it
responsibility of the operator to provide a complete set of archived
WAL files for a backup. Could you elaborate what operation steps are
you assuming of?

Therefore, IMO that the patch should change the code so that
no unarchived WAL files are removed not only in crash recovery
but also archive recovery. Thought?

Agreed if "an unarchived WAL" means "a WAL file that is marked .ready"
and it should be archived immediately. My previous mail is written
based on the same thought.

In a very narrow window, if server crashed or killed after a segment
is finished but before marking the file as .ready, the file doesn't
have .ready but should be archived. If we need to get rid of such a
window, it would help to mark a WAL file as ".busy" at creation time.

Of course, this change would lead to the issue that the past
unarchived
WAL files keep remaining in the case of warm-standby using archive
recovery. But this issue looks unavoidable. If users want to avoid
that,
archive_mode should be set to always.

Also I'm a bit wondering if it's really safe to remove such unarchived
WAL files even in the standby case with archive_mode=on. I would need
more time to think that.

Perhaps checking '.ready' before the checking for archive-mode would
be sufficient.

Plus, I added a second commit to add one test in regard with this bug.

Another is to make the startup process remove .ready file if
necessary.

I'm not sure to understand this one.

I was thinking to make the startup process remove such unarchived WAL
files
if archive_mode=on and StandbyModeRequested/ArchiveRecoveryRequested
is true.

As mentioned above, I don't understand the point of preserving WAL
files that are either marked as .ready or not marked at all on a
standby with archive_mode=on.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#8Fujii Masao
masao.fujii@gmail.com
In reply to: Kyotaro Horiguchi (#7)
hackersbugs
Re: [BUG] non archived WAL removed during production crash recovery

On 2020/04/02 16:23, Kyotaro Horiguchi wrote:

At Thu, 2 Apr 2020 14:19:15 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2020/04/02 13:07, Kyotaro Horiguchi wrote:

Sorry, it was quite ambiguous.
At Thu, 02 Apr 2020 13:04:43 +0900 (JST), Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote in

At Wed, 1 Apr 2020 18:17:35 +0200, Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote in

Please, find in attachment a patch implementing this.

The patch partially reintroduces the issue the patch have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

The patch partially reintroduces the issue the commit 78ea8b5daa have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

On second thought, I think that we should discuss what the desirable
behavior is before the implentation. Especially what's unclear to me

Agreed.

is whether to remove such WAL files in archive recovery case with
archive_mode=on. Those WAL files would be required when recovering
from the backup taken before that archive recovery happens.
So it seems unsafe to remove them in that case.

I'm not sure I'm getting the intention correctly, but I think it
responsibility of the operator to provide a complete set of archived
WAL files for a backup. Could you elaborate what operation steps are
you assuming of?

Please imagine the case where you need to do archive recovery
from the database snapshot taken while there are many WAL files
with .ready files. Those WAL files have not been archived yet.
In this case, ISTM those WAL files should not be removed until
they are archived, when archive_mode = on.

Therefore, IMO that the patch should change the code so that
no unarchived WAL files are removed not only in crash recovery
but also archive recovery. Thought?

Agreed if "an unarchived WAL" means "a WAL file that is marked .ready"
and it should be archived immediately. My previous mail is written
based on the same thought.

Ok, so our *current* consensus seems the followings. Right?

- If archive_mode=off, any WAL files with .ready files are removed in
crash recovery, archive recoery and standby mode.

- If archive_mode=on, WAL files with .ready files are removed only in
standby mode. In crash recovery and archive recovery cases, they keep
remaining and would be archived after recovery finishes (i.e., during
normal processing).

- If archive_mode=always, in crash recovery, archive recovery and
standby mode, WAL files with .ready files are archived if WAL archiver
is running.

That is, WAL files with .ready files are removed when either
archive_mode!=always in standby mode or archive_mode=off.

In a very narrow window, if server crashed or killed after a segment
is finished but before marking the file as .ready, the file doesn't
have .ready but should be archived. If we need to get rid of such a
window, it would help to mark a WAL file as ".busy" at creation time.

Of course, this change would lead to the issue that the past
unarchived
WAL files keep remaining in the case of warm-standby using archive
recovery. But this issue looks unavoidable. If users want to avoid
that,
archive_mode should be set to always.

Also I'm a bit wondering if it's really safe to remove such unarchived
WAL files even in the standby case with archive_mode=on. I would need
more time to think that.

Perhaps checking '.ready' before the checking for archive-mode would
be sufficient.

Plus, I added a second commit to add one test in regard with this bug.

Another is to make the startup process remove .ready file if
necessary.

I'm not sure to understand this one.

I was thinking to make the startup process remove such unarchived WAL
files
if archive_mode=on and StandbyModeRequested/ArchiveRecoveryRequested
is true.

As mentioned above, I don't understand the point of preserving WAL
files that are either marked as .ready or not marked at all on a
standby with archive_mode=on.

Maybe yes. But I'm not confident about that there is no such case.
Anyway I'm fine to fix the bug based on the above consensus at first.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

In reply to: Kyotaro Horiguchi (#5)
hackersbugs
Re: [BUG] non archived WAL removed during production crash recovery

On Thu, 02 Apr 2020 13:07:34 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

Sorry, it was quite ambiguous.

At Thu, 02 Apr 2020 13:04:43 +0900 (JST), Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote in

At Wed, 1 Apr 2020 18:17:35 +0200, Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote in

Please, find in attachment a patch implementing this.

The patch partially reintroduces the issue the patch have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

The patch partially reintroduces the issue the commit 78ea8b5daa have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

As far as I understand StartupXLOG(), NOT_IN_RECOVERY and IN_CRASH_RECOVERY are
only set for production clusters, not standby ones. So the following test
should never catch standby cluster as :

(XLogArchivingActive() && inRecoveryState != IN_ARCHIVE_RECOVERY)

Forgive me if I'm wrong, but do I miss something?

Regards,

In reply to: Fujii Masao (#8)
hackersbugs
Re: [BUG] non archived WAL removed during production crash recovery

On Thu, 2 Apr 2020 19:38:59 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/04/02 16:23, Kyotaro Horiguchi wrote:

At Thu, 2 Apr 2020 14:19:15 +0900, Fujii Masao
<masao.fujii@oss.nttdata.com> wrote in

[...]

is whether to remove such WAL files in archive recovery case with
archive_mode=on. Those WAL files would be required when recovering
from the backup taken before that archive recovery happens.
So it seems unsafe to remove them in that case.

I'm not sure I'm getting the intention correctly, but I think it
responsibility of the operator to provide a complete set of archived
WAL files for a backup. Could you elaborate what operation steps are
you assuming of?

Please imagine the case where you need to do archive recovery
from the database snapshot taken while there are many WAL files
with .ready files. Those WAL files have not been archived yet.
In this case, ISTM those WAL files should not be removed until
they are archived, when archive_mode = on.

If you rely on snapshot without pg_start/stop_backup, I agree. Theses WAL
should be archived if:

* archive_mode >= on for primary
* archive_mode = always for standby

Therefore, IMO that the patch should change the code so that
no unarchived WAL files are removed not only in crash recovery
but also archive recovery. Thought?

Agreed if "an unarchived WAL" means "a WAL file that is marked .ready"
and it should be archived immediately. My previous mail is written
based on the same thought.

Ok, so our *current* consensus seems the followings. Right?

- If archive_mode=off, any WAL files with .ready files are removed in
crash recovery, archive recoery and standby mode.

yes

- If archive_mode=on, WAL files with .ready files are removed only in
standby mode. In crash recovery and archive recovery cases, they keep
remaining and would be archived after recovery finishes (i.e., during
normal processing).

yes

- If archive_mode=always, in crash recovery, archive recovery and
standby mode, WAL files with .ready files are archived if WAL archiver
is running.

yes

That is, WAL files with .ready files are removed when either
archive_mode!=always in standby mode or archive_mode=off.

sounds fine to me.

[...]

Another is to make the startup process remove .ready file if
necessary.

I'm not sure to understand this one.

I was thinking to make the startup process remove such unarchived WAL
files
if archive_mode=on and StandbyModeRequested/ArchiveRecoveryRequested
is true.

Ok, understood.

As mentioned above, I don't understand the point of preserving WAL
files that are either marked as .ready or not marked at all on a
standby with archive_mode=on.

Maybe yes. But I'm not confident about that there is no such case.

Well, it seems to me that this is what you suggested few paragraph away:

«.ready files are removed when either archive_mode!=always in standby mode»

#11Fujii Masao
masao.fujii@gmail.com
In reply to: Jehan-Guillaume de Rorthais (#10)
hackersbugs
Re: [BUG] non archived WAL removed during production crash recovery

On 2020/04/02 22:49, Jehan-Guillaume de Rorthais wrote:

On Thu, 2 Apr 2020 19:38:59 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/04/02 16:23, Kyotaro Horiguchi wrote:

At Thu, 2 Apr 2020 14:19:15 +0900, Fujii Masao
<masao.fujii@oss.nttdata.com> wrote in

[...]

is whether to remove such WAL files in archive recovery case with
archive_mode=on. Those WAL files would be required when recovering
from the backup taken before that archive recovery happens.
So it seems unsafe to remove them in that case.

I'm not sure I'm getting the intention correctly, but I think it
responsibility of the operator to provide a complete set of archived
WAL files for a backup. Could you elaborate what operation steps are
you assuming of?

Please imagine the case where you need to do archive recovery
from the database snapshot taken while there are many WAL files
with .ready files. Those WAL files have not been archived yet.
In this case, ISTM those WAL files should not be removed until
they are archived, when archive_mode = on.

If you rely on snapshot without pg_start/stop_backup, I agree. Theses WAL
should be archived if:

* archive_mode >= on for primary
* archive_mode = always for standby

Therefore, IMO that the patch should change the code so that
no unarchived WAL files are removed not only in crash recovery
but also archive recovery. Thought?

Agreed if "an unarchived WAL" means "a WAL file that is marked .ready"
and it should be archived immediately. My previous mail is written
based on the same thought.

Ok, so our *current* consensus seems the followings. Right?

- If archive_mode=off, any WAL files with .ready files are removed in
crash recovery, archive recoery and standby mode.

yes

- If archive_mode=on, WAL files with .ready files are removed only in
standby mode. In crash recovery and archive recovery cases, they keep
remaining and would be archived after recovery finishes (i.e., during
normal processing).

yes

- If archive_mode=always, in crash recovery, archive recovery and
standby mode, WAL files with .ready files are archived if WAL archiver
is running.

yes

That is, WAL files with .ready files are removed when either
archive_mode!=always in standby mode or archive_mode=off.

sounds fine to me.

[...]

Another is to make the startup process remove .ready file if
necessary.

I'm not sure to understand this one.

I was thinking to make the startup process remove such unarchived WAL
files
if archive_mode=on and StandbyModeRequested/ArchiveRecoveryRequested
is true.

Ok, understood.

As mentioned above, I don't understand the point of preserving WAL
files that are either marked as .ready or not marked at all on a
standby with archive_mode=on.

Maybe yes. But I'm not confident about that there is no such case.

Well, it seems to me that this is what you suggested few paragraph away:

«.ready files are removed when either archive_mode!=always in standby mode»

Yes, so I'm fine with that as the first consensus because the behavior
is obviously better than the current one. *If* the case where no WAL files
should be removed is found, I'd just like to propose the additional patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#12Fujii Masao
masao.fujii@gmail.com
In reply to: Jehan-Guillaume de Rorthais (#9)
hackersbugs
Re: [BUG] non archived WAL removed during production crash recovery

On 2020/04/02 22:02, Jehan-Guillaume de Rorthais wrote:

On Thu, 02 Apr 2020 13:07:34 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

Sorry, it was quite ambiguous.

At Thu, 02 Apr 2020 13:04:43 +0900 (JST), Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote in

At Wed, 1 Apr 2020 18:17:35 +0200, Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote in

Please, find in attachment a patch implementing this.

The patch partially reintroduces the issue the patch have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

The patch partially reintroduces the issue the commit 78ea8b5daa have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

As far as I understand StartupXLOG(), NOT_IN_RECOVERY and IN_CRASH_RECOVERY are
only set for production clusters, not standby ones.

DB_IN_CRASH_RECOVERY can be set even in standby mode. For example,
if you start the standby from the cold backup of the primary,
since InArchiveRecovery is false at the beginning of the recovery,
DB_IN_CRASH_RECOVERY is set in that moment. But then after all the valid
WAL in pg_wal have been replayed, InArchiveRecovery is set to true and
DB_IN_ARCHIVE_RECOVERY is set.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

In reply to: Fujii Masao (#12)
hackersbugs
Re: [BUG] non archived WAL removed during production crash recovery

On Thu, 2 Apr 2020 23:58:00 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/04/02 22:02, Jehan-Guillaume de Rorthais wrote:

On Thu, 02 Apr 2020 13:07:34 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

Sorry, it was quite ambiguous.

At Thu, 02 Apr 2020 13:04:43 +0900 (JST), Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote in

At Wed, 1 Apr 2020 18:17:35 +0200, Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote in

Please, find in attachment a patch implementing this.

The patch partially reintroduces the issue the patch have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

The patch partially reintroduces the issue the commit 78ea8b5daa have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

As far as I understand StartupXLOG(), NOT_IN_RECOVERY and IN_CRASH_RECOVERY
are only set for production clusters, not standby ones.

DB_IN_CRASH_RECOVERY can be set even in standby mode. For example,
if you start the standby from the cold backup of the primary,

In cold backup? Then ControlFile->state == DB_SHUTDOWNED, right?

Unless I'm wrong, this should be catched by:

if (ArchiveRecoveryRequested && ( [...] ||
ControlFile->state == DB_SHUTDOWNED))
{
InArchiveRecovery = true;
if (StandbyModeRequested)
StandbyMode = true;
}

With InArchiveRecovery=true, we later set DB_IN_ARCHIVE_RECOVERY instead of
DB_IN_CRASH_RECOVERY.

since InArchiveRecovery is false at the beginning of the recovery,
DB_IN_CRASH_RECOVERY is set in that moment. But then after all the valid
WAL in pg_wal have been replayed, InArchiveRecovery is set to true and
DB_IN_ARCHIVE_RECOVERY is set.

However, I suppose this is true if you restore a backup from a snapshot
without backup_label, right?

In reply to: Fujii Masao (#11)
hackersbugs
Re: [BUG] non archived WAL removed during production crash recovery

On Thu, 2 Apr 2020 23:55:46 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

[...]

Well, it seems to me that this is what you suggested few paragraph away:

«.ready files are removed when either archive_mode!=always in standby
mode»

Yes, so I'm fine with that as the first consensus because the behavior
is obviously better than the current one. *If* the case where no WAL files
should be removed is found, I'd just like to propose the additional patch.

Do you mean to want to produce the next patch yourself?

#15Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Jehan-Guillaume de Rorthais (#14)
hackersbugs
Re: [BUG] non archived WAL removed during production crash recovery

At Thu, 2 Apr 2020 17:44:50 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in

On Thu, 2 Apr 2020 23:55:46 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

[...]

Well, it seems to me that this is what you suggested few paragraph away:

«.ready files are removed when either archive_mode!=always in standby
mode»

Yes, so I'm fine with that as the first consensus because the behavior
is obviously better than the current one. *If* the case where no WAL files
should be removed is found, I'd just like to propose the additional patch.

Do you mean to want to produce the next patch yourself?

No. Fujii-san is saying that he will address it, if the fix made in
this thread is found to be imperfect later.

He suspects that WAL files should be preserved at least in certain
cases even if the file persists forever, but the consensus here is to
remove files under certain conditions so as not to no WAL file
persists in pg_wal directory.

Feel free to propose the next version!

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#16Fujii Masao
masao.fujii@gmail.com
In reply to: Jehan-Guillaume de Rorthais (#13)
hackersbugs
Re: [BUG] non archived WAL removed during production crash recovery

On 2020/04/03 0:37, Jehan-Guillaume de Rorthais wrote:

On Thu, 2 Apr 2020 23:58:00 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/04/02 22:02, Jehan-Guillaume de Rorthais wrote:

On Thu, 02 Apr 2020 13:07:34 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

Sorry, it was quite ambiguous.

At Thu, 02 Apr 2020 13:04:43 +0900 (JST), Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote in

At Wed, 1 Apr 2020 18:17:35 +0200, Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote in

Please, find in attachment a patch implementing this.

The patch partially reintroduces the issue the patch have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

The patch partially reintroduces the issue the commit 78ea8b5daa have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

As far as I understand StartupXLOG(), NOT_IN_RECOVERY and IN_CRASH_RECOVERY
are only set for production clusters, not standby ones.

DB_IN_CRASH_RECOVERY can be set even in standby mode. For example,
if you start the standby from the cold backup of the primary,

In cold backup? Then ControlFile->state == DB_SHUTDOWNED, right?

Unless I'm wrong, this should be catched by:

if (ArchiveRecoveryRequested && ( [...] ||
ControlFile->state == DB_SHUTDOWNED))
{
InArchiveRecovery = true;
if (StandbyModeRequested)
StandbyMode = true;
}

With InArchiveRecovery=true, we later set DB_IN_ARCHIVE_RECOVERY instead of
DB_IN_CRASH_RECOVERY.

Yes, you're right. So I had to mention one more condition in my
previous email. The condition is that the cold backup was taken from
the server that was shutdowned with immdiate mode. In this case,
the code block that you pointed is skipped and InArchiveRecovery is
not set to true there.

since InArchiveRecovery is false at the beginning of the recovery,
DB_IN_CRASH_RECOVERY is set in that moment. But then after all the valid
WAL in pg_wal have been replayed, InArchiveRecovery is set to true and
DB_IN_ARCHIVE_RECOVERY is set.

However, I suppose this is true if you restore a backup from a snapshot
without backup_label, right?

Maybe yes.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#17Fujii Masao
masao.fujii@gmail.com
In reply to: Kyotaro Horiguchi (#15)
hackersbugs
Re: [BUG] non archived WAL removed during production crash recovery

On 2020/04/03 10:14, Kyotaro Horiguchi wrote:

At Thu, 2 Apr 2020 17:44:50 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in

On Thu, 2 Apr 2020 23:55:46 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

[...]

Well, it seems to me that this is what you suggested few paragraph away:

«.ready files are removed when either archive_mode!=always in standby
mode»

Yes, so I'm fine with that as the first consensus because the behavior
is obviously better than the current one. *If* the case where no WAL files
should be removed is found, I'd just like to propose the additional patch.

Do you mean to want to produce the next patch yourself?

No. Fujii-san is saying that he will address it, if the fix made in
this thread is found to be imperfect later.

He suspects that WAL files should be preserved at least in certain
cases even if the file persists forever, but the consensus here is to
remove files under certain conditions so as not to no WAL file
persists in pg_wal directory.

Yes.

Feel free to propose the next version!

Yes!

BTW, now I'm thinking that the flag in shmem should be updated when
the startup process sets StandbyModeRequested to true at the beginning
of the recovery. That is,

- Add something like SharedStandbyModeRequested into XLogCtl. This field
should be initialized with false;
- Set XLogCtl->SharedStandbyModeRequested to true when the startup
process detects the standby.signal file and sets the local variable
StandbyModeRequested to true.
- Make XLogArchiveCheckDone() use XLogCtl->SharedStandbyModeRequested
to know whether the server is in standby mode or not.

Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

In reply to: Fujii Masao (#11)
hackersbugs
Re: [BUG] non archived WAL removed during production crash recovery

On Thu, 2 Apr 2020 23:55:46 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/04/02 22:49, Jehan-Guillaume de Rorthais wrote:

On Thu, 2 Apr 2020 19:38:59 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

[...]

That is, WAL files with .ready files are removed when either
archive_mode!=always in standby mode or archive_mode=off.

sounds fine to me.

To some extends, it appears to me this sentence was relatively close to my
previous patch, as far as you exclude IN_CRASH_RECOVERY from the shortcut:

XLogArchiveCheckDone(const char *xlog)
{
[...]
if ( (inRecoveryState != IN_CRASH_RECOVERY) && (
(inRecoveryState == NOT_IN_RECOVERY && !XLogArchivingActive()) ||
(inRecoveryState == IN_ARCHIVE_RECOVERY && !XLogArchivingAlways())))
return true;

Which means that only .done cleanup would occurs during CRASH_RECOVERY
and .ready files might be created if no .done exists. No matter the futur status
of the cluster: primary or standby. Normal shortcut will apply during first
checkpoint after the crash recovery step.

This should handle the case where a backup without backup_label (taken from a
snapshot or after a shutdown with immediate) is restored to build a standby.

Please, find in attachment a third version of my patch
"0001-v3-Fix-WAL-retention-during-production-crash-recovery.patch".

"0002-v1-Add-test-on-non-archived-WAL-during-crash-recovery.patch" is left
untouched. But I'm considering adding some more tests relative to this
discussion.

BTW, now I'm thinking that the flag in shmem should be updated when
the startup process sets StandbyModeRequested to true at the beginning
of the recovery. That is,

- Add something like SharedStandbyModeRequested into XLogCtl. This field
should be initialized with false;
- Set XLogCtl->SharedStandbyModeRequested to true when the startup
process detects the standby.signal file and sets the local variable
StandbyModeRequested to true.
- Make XLogArchiveCheckDone() use XLogCtl->SharedStandbyModeRequested
to know whether the server is in standby mode or not.

Thought?

I try to avoid a new flag in memory with the proposal in attachment of this
email. It seems to me various combinaisons of booleans with subtle differences
around the same subject makes it a bit trappy and complicated to understand.

If my proposal is rejected, I'll be happy to volunteer to add
XLogCtl->SharedStandbyModeRequested though. It seems like a simple enough fix
as well.

Regards,

Attachments:

0001-v3-Fix-WAL-retention-during-production-crash-recovery.patchtext/x-patchDownload+31-11
0002-v1-Add-test-on-non-archived-WAL-during-crash-recovery.patchtext/x-patchDownload+14-3
#19Fujii Masao
masao.fujii@gmail.com
In reply to: Jehan-Guillaume de Rorthais (#18)
hackersbugs
Re: [BUG] non archived WAL removed during production crash recovery

On 2020/04/04 1:26, Jehan-Guillaume de Rorthais wrote:

On Thu, 2 Apr 2020 23:55:46 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/04/02 22:49, Jehan-Guillaume de Rorthais wrote:

On Thu, 2 Apr 2020 19:38:59 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

[...]

That is, WAL files with .ready files are removed when either
archive_mode!=always in standby mode or archive_mode=off.

sounds fine to me.

To some extends, it appears to me this sentence was relatively close to my
previous patch, as far as you exclude IN_CRASH_RECOVERY from the shortcut:

XLogArchiveCheckDone(const char *xlog)
{
[...]
if ( (inRecoveryState != IN_CRASH_RECOVERY) && (
(inRecoveryState == NOT_IN_RECOVERY && !XLogArchivingActive()) ||
(inRecoveryState == IN_ARCHIVE_RECOVERY && !XLogArchivingAlways())))
return true;

Which means that only .done cleanup would occurs during CRASH_RECOVERY
and .ready files might be created if no .done exists. No matter the futur status
of the cluster: primary or standby. Normal shortcut will apply during first
checkpoint after the crash recovery step.

This should handle the case where a backup without backup_label (taken from a
snapshot or after a shutdown with immediate) is restored to build a standby.

Please, find in attachment a third version of my patch
"0001-v3-Fix-WAL-retention-during-production-crash-recovery.patch".

Thanks for updating the patch! Here are my review comments:

-	bool		SharedRecoveryInProgress;
+	RecoveryState	SharedRecoveryInProgress;

Since the patch changes the meaning of this variable, the name of
the variable should be changed? Otherwise, the current name seems
confusing.

+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryInProgress = IN_CRASH_RECOVERY;
+			SpinLockRelease(&XLogCtl->info_lck);

As I explained upthread, this code can be reached and IN_CRASH_RECOVERY
can be set even in standby or archive recovery. Is this right behavior that
you're expecting?

Even in crash recovery case, GetRecoveryState() returns IN_ARCHIVE_RECOVERY
until this code is reached. Also when WAL replay is not necessary
(e.g., restart of the server shutdowed cleanly before), GetRecoveryState()
returns IN_ARCHIVE_RECOVERY because this code is not reached. Aren't
these fragile? If XLogArchiveCheckDone() is only user of GetRecoveryState(),
they would be ok. But if another user will appear in the future, it seems
very easy to mistake. At least those behaviors should be commented in
GetRecoveryState().

-	if (!((XLogArchivingActive() && !inRecovery) ||
-		  (XLogArchivingAlways() && inRecovery)))
+	if ( (inRecoveryState != IN_CRASH_RECOVERY) && (
+		  (inRecoveryState == NOT_IN_RECOVERY && !XLogArchivingActive()) &&
+		  (inRecoveryState == IN_ARCHIVE_RECOVERY && !XLogArchivingAlways())))
		return true;

The last condition seems to cause XLogArchiveCheckDone() to return
true in archive recovery mode with archive_mode=on, then cause
unarchived WAL files with .ready to be removed. Is my understanding right?
If yes, that behavior doesn't seem to match with our consensus, i.e.,
WAL files with .ready should not be removed in that case.

+/* Recovery state */
+typedef enum RecoveryState
+{
+	NOT_IN_RECOVERY = 0,
+	IN_CRASH_RECOVERY,
+	IN_ARCHIVE_RECOVERY
+} RecoveryState;

Isn't it better to add more comments here? For example, what does
"Recovery state" mean? Which state is used in standby mode? Why? ,etc.

Is it really ok not to have the value indicating standby mode?

These enum values names are confusing because the variables with
similar names already exist. For example, IN_CRASH_RECOVERY vs.
DB_IN_CRASH_RECOVERY. So IMO it seems better to rename them,
.e.g., by adding the prefix.

"0002-v1-Add-test-on-non-archived-WAL-during-crash-recovery.patch" is left
untouched. But I'm considering adding some more tests relative to this
discussion.

BTW, now I'm thinking that the flag in shmem should be updated when
the startup process sets StandbyModeRequested to true at the beginning
of the recovery. That is,

- Add something like SharedStandbyModeRequested into XLogCtl. This field
should be initialized with false;
- Set XLogCtl->SharedStandbyModeRequested to true when the startup
process detects the standby.signal file and sets the local variable
StandbyModeRequested to true.
- Make XLogArchiveCheckDone() use XLogCtl->SharedStandbyModeRequested
to know whether the server is in standby mode or not.

Thought?

I try to avoid a new flag in memory with the proposal in attachment of this
email. It seems to me various combinaisons of booleans with subtle differences
around the same subject makes it a bit trappy and complicated to understand.

Ok, so firstly I try to review your patch!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

In reply to: Fujii Masao (#19)
hackersbugs
Re: [BUG] non archived WAL removed during production crash recovery

On Sat, 4 Apr 2020 02:49:50 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/04/04 1:26, Jehan-Guillaume de Rorthais wrote:

On Thu, 2 Apr 2020 23:55:46 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/04/02 22:49, Jehan-Guillaume de Rorthais wrote:

On Thu, 2 Apr 2020 19:38:59 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

[...]

That is, WAL files with .ready files are removed when either
archive_mode!=always in standby mode or archive_mode=off.

sounds fine to me.

To some extends, it appears to me this sentence was relatively close to my
previous patch, as far as you exclude IN_CRASH_RECOVERY from the shortcut:

XLogArchiveCheckDone(const char *xlog)
{
[...]
if ( (inRecoveryState != IN_CRASH_RECOVERY) && (
(inRecoveryState == NOT_IN_RECOVERY && !XLogArchivingActive()) ||
(inRecoveryState == IN_ARCHIVE_RECOVERY
&& !XLogArchivingAlways()))) return true;

Which means that only .done cleanup would occurs during CRASH_RECOVERY
and .ready files might be created if no .done exists. No matter the futur
status of the cluster: primary or standby. Normal shortcut will apply
during first checkpoint after the crash recovery step.

This should handle the case where a backup without backup_label (taken from
a snapshot or after a shutdown with immediate) is restored to build a
standby.

Please, find in attachment a third version of my patch
"0001-v3-Fix-WAL-retention-during-production-crash-recovery.patch".

Thanks for updating the patch! Here are my review comments:

-	bool		SharedRecoveryInProgress;
+	RecoveryState	SharedRecoveryInProgress;

Since the patch changes the meaning of this variable, the name of
the variable should be changed? Otherwise, the current name seems
confusing.

Indeed, fixed using SharedRecoveryState

+			SpinLockAcquire(&XLogCtl->info_lck);
+			XLogCtl->SharedRecoveryInProgress =
IN_CRASH_RECOVERY;
+			SpinLockRelease(&XLogCtl->info_lck);

As I explained upthread, this code can be reached and IN_CRASH_RECOVERY
can be set even in standby or archive recovery. Is this right behavior that
you're expecting?

Yes. This patch avoids archive cleanup during crash recovery altogether,
whatever the requested status for the cluster.

Even in crash recovery case, GetRecoveryState() returns IN_ARCHIVE_RECOVERY
until this code is reached.

I tried to stick as close as possible with "ControlFile->state" and old
XLogCtl->SharedRecoveryInProgress variables. That's why it's initialized as
IN_ARCHIVE_RECOVERY as XLogCtl->SharedRecoveryInProgress was init as true as
well.

The status itself is set during StartupXLOG when the historical code actually
tries to define and record the real state between DB_IN_ARCHIVE_RECOVERY and
DB_IN_CRASH_RECOVERY.

Also when WAL replay is not necessary (e.g., restart of the server shutdowed
cleanly before), GetRecoveryState() returns IN_ARCHIVE_RECOVERY because this
code is not reached.

It is set to NOT_IN_RECOVERY at the end of StartupXLOG, in the same place we
set ControlFile->state = DB_IN_PRODUCTION. So GetRecoveryState() returns
NOT_IN_RECOVERY as soon as StartupXLOG is done when no WAL replay is necessary.

Aren't these fragile? If XLogArchiveCheckDone() is only user of
GetRecoveryState(), they would be ok. But if another user will appear in the
future, it seems very easy to mistake. At least those behaviors should be
commented in GetRecoveryState().

We certainly can set SharedRecoveryState earlier, in XLOGShmemInit, based on
the ControlFile->state value. In my understanding, anything different than
DB_SHUTDOWNED or DB_SHUTDOWNED_IN_RECOVERY can be considered as a crash
recovery. Based on this XLOGShmemInit can init SharedRecoveryState to
IN_ARCHIVE_RECOVERY or IN_CRASH_RECOVERY.

With either values, RecoveryInProgress() keep returning the same result so any
current code relying on RecoveryInProgress() is compatible.

I'm not sure who would need this information before the WAL machinery is up,
but is it safe enough in your opinion for futur usage of GetRecoveryState()? Do
you think this information might be useful before the WAL machinery is set?
Current "user" (eg. restoreTwoPhaseData()) only need to know if we are in
recovery, whatever the reason.

The patch in attachment set SharedRecoveryState to either IN_ARCHIVE_RECOVERY
or IN_CRASH_RECOVERY from XLOGShmemInit based on the ControlFile state. It
feels strange though to set this so far away from ControlFile->state
= DB_IN_CRASH_RECOVERY...

-	if (!((XLogArchivingActive() && !inRecovery) ||
-		  (XLogArchivingAlways() && inRecovery)))
+	if ( (inRecoveryState != IN_CRASH_RECOVERY) && (
+		  (inRecoveryState == NOT_IN_RECOVERY
&& !XLogArchivingActive()) &&
+		  (inRecoveryState == IN_ARCHIVE_RECOVERY
&& !XLogArchivingAlways()))) return true;

The last condition seems to cause XLogArchiveCheckDone() to return
true in archive recovery mode with archive_mode=on, then cause
unarchived WAL files with .ready to be removed. Is my understanding right?
If yes, that behavior doesn't seem to match with our consensus, i.e.,
WAL files with .ready should not be removed in that case.

We wrote:

That is, WAL files with .ready files are removed when either
archive_mode!=always in standby mode or archive_mode=off.

sounds fine to me.

So if in standby mode and archive_mode is not "always", the .ready files
should be removed.

In current patch, I split the conditions in the sake of clarity.

+/* Recovery state */
+typedef enum RecoveryState
+{
+	NOT_IN_RECOVERY = 0,
+	IN_CRASH_RECOVERY,
+	IN_ARCHIVE_RECOVERY
+} RecoveryState;

Isn't it better to add more comments here? For example, what does
"Recovery state" mean? Which state is used in standby mode? Why? ,etc.

Explanations added.

Is it really ok not to have the value indicating standby mode?

Unless I'm wrong, this shared state only indicates why we are recovering WAL,
it does not need reflect the status of the instance in shared memory.
StandbyMode is already available for the startup process. Would it be useful
to share this mode outside of the startup process?

These enum values names are confusing because the variables with
similar names already exist. For example, IN_CRASH_RECOVERY vs.
DB_IN_CRASH_RECOVERY. So IMO it seems better to rename them,
.e.g., by adding the prefix.

Well, I lack of imagination. So I picked CRASH_RECOVERING and
ARCHIVE_RECOVERING.

"0002-v1-Add-test-on-non-archived-WAL-during-crash-recovery.patch" is left
untouched. But I'm considering adding some more tests relative to this
discussion.

BTW, now I'm thinking that the flag in shmem should be updated when
the startup process sets StandbyModeRequested to true at the beginning
of the recovery. That is,

- Add something like SharedStandbyModeRequested into XLogCtl. This field
should be initialized with false;
- Set XLogCtl->SharedStandbyModeRequested to true when the startup
process detects the standby.signal file and sets the local variable
StandbyModeRequested to true.
- Make XLogArchiveCheckDone() use XLogCtl->SharedStandbyModeRequested
to know whether the server is in standby mode or not.

Thought?

I try to avoid a new flag in memory with the proposal in attachment of this
email. It seems to me various combinaisons of booleans with subtle
differences around the same subject makes it a bit trappy and complicated
to understand.

Ok, so firstly I try to review your patch!

Thank you for your review and help!

In attachment:
* fix proposal: 0001-v4-Fix-WAL-retention-during-production-crash-recovery.patch
* add test: 0002-v2-Add-test-on-non-archived-WAL-during-crash-recovery.patch
0002-v2 fix conflict with current master.

Regards,

Attachments:

0001-v4-Fix-WAL-retention-during-production-crash-recovery.patchtext/x-patchDownload+73-17
0002-v2-Add-test-on-non-archived-WAL-during-crash-recovery.patchtext/x-patchDownload+14-3
#21Michael Paquier
michael@paquier.xyz
In reply to: Jehan-Guillaume de Rorthais (#20)
hackersbugs
#22Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Jehan-Guillaume de Rorthais (#18)
hackersbugs
In reply to: Michael Paquier (#21)
hackersbugs
In reply to: Kyotaro Horiguchi (#22)
hackersbugs
#25Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Jehan-Guillaume de Rorthais (#24)
hackersbugs
In reply to: Kyotaro Horiguchi (#25)
hackersbugs
In reply to: Jehan-Guillaume de Rorthais (#23)
hackersbugs
#28Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Jehan-Guillaume de Rorthais (#26)
hackersbugs
#29Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Jehan-Guillaume de Rorthais (#27)
hackersbugs
#30Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#29)
hackersbugs
In reply to: Michael Paquier (#30)
hackersbugs
In reply to: Kyotaro Horiguchi (#28)
hackersbugs
#33Michael Paquier
michael@paquier.xyz
In reply to: Jehan-Guillaume de Rorthais (#31)
hackersbugs
In reply to: Michael Paquier (#33)
hackersbugs
#35Michael Paquier
michael@paquier.xyz
In reply to: Jehan-Guillaume de Rorthais (#34)
hackersbugs
In reply to: Michael Paquier (#35)
hackersbugs
#37Michael Paquier
michael@paquier.xyz
In reply to: Jehan-Guillaume de Rorthais (#36)
hackersbugs
#38Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#35)
hackersbugs
#39Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#38)
hackersbugs
In reply to: Michael Paquier (#39)
hackersbugs
#41Michael Paquier
michael@paquier.xyz
In reply to: Jehan-Guillaume de Rorthais (#40)
hackersbugs
#42Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#41)
hackersbugs
#43Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#42)
hackersbugs
#44Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#43)
hackersbugs
In reply to: Kyotaro Horiguchi (#44)
hackersbugs
In reply to: Michael Paquier (#41)
hackersbugs
#47Michael Paquier
michael@paquier.xyz
In reply to: Jehan-Guillaume de Rorthais (#45)
hackersbugs
#48Michael Paquier
michael@paquier.xyz
In reply to: Jehan-Guillaume de Rorthais (#46)
hackersbugs
#49Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#48)
hackersbugs
In reply to: Michael Paquier (#48)
hackersbugs
In reply to: Michael Paquier (#49)
hackersbugs
In reply to: Jehan-Guillaume de Rorthais (#51)
hackersbugs
#53Michael Paquier
michael@paquier.xyz
In reply to: Jehan-Guillaume de Rorthais (#52)
hackersbugs
#54Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#53)
hackersbugs
In reply to: Kyotaro Horiguchi (#54)
hackersbugs
#56Michael Paquier
michael@paquier.xyz
In reply to: Jehan-Guillaume de Rorthais (#55)
hackersbugs
#57Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#56)
hackersbugs
#58Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#56)
hackersbugs
#59Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#58)
hackersbugs
#60Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#57)
hackersbugs
In reply to: Michael Paquier (#59)
hackersbugs
#62Michael Paquier
michael@paquier.xyz
In reply to: Jehan-Guillaume de Rorthais (#61)
hackersbugs
#63Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#62)
hackersbugs
In reply to: Kyotaro Horiguchi (#63)
hackersbugs
#65Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#63)
hackersbugs
In reply to: Michael Paquier (#65)
hackersbugs
#67Michael Paquier
michael@paquier.xyz
In reply to: Jehan-Guillaume de Rorthais (#66)
hackersbugs