Enhance traceability of wal_level changes for backup management

Started by osumi.takamichi@fujitsu.comabout 5 years ago18 messages
#1osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com

Hi,

This thread came from another thread about wal_level [1]/messages/by-id/TYAPR01MB29901EBE5A3ACCE55BA99186FE320@TYAPR01MB2990.jpnprd01.prod.outlook.com.

Mainly from backup management tools perspective
such as pgBackRest, EDB's BART and pg_probackup,
it seems worth talking about a way comprehensively
to trace and recognize wal_level changes for various purposes and values
like necessity of invalidating old backups for example.

In the thread [1]/messages/by-id/TYAPR01MB29901EBE5A3ACCE55BA99186FE320@TYAPR01MB2990.jpnprd01.prod.outlook.com, I talk about wal_level='none' but
these kind of topic applies changing wal_level to 'minimal'
from higher level too. Accordingly, I made this topic as a new independent thread.

Currently, these backup management tools described above
work when wal_level is higher than minimal
because these use physical online backup or wal archiving
but giving any useful ideas for backup management
related to wal_level changes is welcomed.

[1]: /messages/by-id/TYAPR01MB29901EBE5A3ACCE55BA99186FE320@TYAPR01MB2990.jpnprd01.prod.outlook.com

Best Regards,
Takamichi Osumi

#2osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: osumi.takamichi@fujitsu.com (#1)
RE: Enhance traceability of wal_level changes for backup management

Hello, Sawada-san

I'll continue the discussion of [2]/messages/by-id/CAD21AoCotoAxxCmMVz6niwg4j6c3Er_-GboTLmHBft8pALpOGA@mail.gmail.com.
We talked about how to recognize the time or LSN
when/where wal_level is changed to 'none' there.

You said

The use case I imagined is that the user temporarily
changes wal_level to 'none' from 'replica' or 'logical' to speed up loading and
changes back to the normal. In this case, the backups taken before the
wal_level change cannot be used to restore the database to the point after the
wal_level change. So I think backup management tools would want to
recognize the time or LSN when/where wal_level is changed to ‘none’ in order
to do some actions such as invalidating older backups, re-calculating backup
redundancy etc.
Actually the same is true when the user changes to ‘minimal’. The tools would
need to recognize the time or LSN in this case too. Since temporarily changing
wal_level has been an uncommon use case some tools perhaps are not aware
of that yet. But since introducing wal_level = ’none’ could make the change
common, I think we need to provide a way for the tools.

I wondered, couldn't backup management tools utilize the information
in the backup that is needed to be made when wal_level is changed to "none" for example ?

As I said before, existing backup management tools support
only wal_level=replica or logical at present. And, if they would wish to alter the
status quo and want to cover the changes over wal_levels, I felt it's natural that
they support feature like taking a full backup, trigged by the wal_level changes (or server stop).

This is because taking a backup is a must for wal_level=none,
as I described in the patch of wal_level=none.
For example, they could prepare an easy way to
take an offline physical backup when the server stops for changing the wal_level.
(Here, they can support the change to minimal, too.)

What did you think ?

[2]: /messages/by-id/CAD21AoCotoAxxCmMVz6niwg4j6c3Er_-GboTLmHBft8pALpOGA@mail.gmail.com

Best Regards,
Takamichi Osumi

#3Stephen Frost
sfrost@snowman.net
In reply to: osumi.takamichi@fujitsu.com (#2)
Re: Enhance traceability of wal_level changes for backup management

Greetings,

* osumi.takamichi@fujitsu.com (osumi.takamichi@fujitsu.com) wrote:

You said

The use case I imagined is that the user temporarily
changes wal_level to 'none' from 'replica' or 'logical' to speed up loading and
changes back to the normal. In this case, the backups taken before the
wal_level change cannot be used to restore the database to the point after the
wal_level change. So I think backup management tools would want to
recognize the time or LSN when/where wal_level is changed to ‘none’ in order
to do some actions such as invalidating older backups, re-calculating backup
redundancy etc.
Actually the same is true when the user changes to ‘minimal’. The tools would
need to recognize the time or LSN in this case too. Since temporarily changing
wal_level has been an uncommon use case some tools perhaps are not aware
of that yet. But since introducing wal_level = ’none’ could make the change
common, I think we need to provide a way for the tools.

I continue to be against the idea of introducing another wal_level. If
there's additional things we can do to reduce WAL traffic while we
continue to use it to keep the system in a consistent state then we
should implement those for the 'minimal' case and be done with it.
Adding another wal_level is just going to be confusing and increase
complexity, and the chances that someone will end up in a bad state.

I wondered, couldn't backup management tools utilize the information
in the backup that is needed to be made when wal_level is changed to "none" for example ?

What information is that, exactly? If there's a way to detect that the
wal_level has been flipped to 'minimal' and then back to 'replica',
other than scanning the WAL, we'd certainly like to hear of it, so we
can implement logic in pgbackrest to detect that happening. I'll admit
that I've not gone hunting for such of late, but I don't recall seeing
anything that would change that either.

The idea proposed above about having the LSN and time recorded when a
wal_level change to minimal happens, presumably in pg_control, seems
like it could work as a way to allow external tools to more easily
figure out if the wal_level's been changed to minimal. Perhaps there's
a reason to track changes between replica and logical but I can't think
of any offhand and backup tools would still need to know if the wal
level was set to *minimal* independently of changes between replica and
logical.

Then again, once we add support for scanning the WAL to pgbackrest,
we'll almost certainly track it that way since that'll work for older
and released versions of PG, so I'm not really sure it's worth it to add
this to pg_control unless there's other reasons to.

As I said before, existing backup management tools support
only wal_level=replica or logical at present. And, if they would wish to alter the
status quo and want to cover the changes over wal_levels, I felt it's natural that
they support feature like taking a full backup, trigged by the wal_level changes (or server stop).

Sure, but there needs to be a way to actually do that..

This is because taking a backup is a must for wal_level=none,
as I described in the patch of wal_level=none.
For example, they could prepare an easy way to
take an offline physical backup when the server stops for changing the wal_level.
(Here, they can support the change to minimal, too.)

pgbackrest does support offline physical backups and it's pretty easy
(just pass in --no-online). That doesn't really help with the issue of
detecting a wal_level change though.

Thanks,

Stephen

#4tsunakawa.takay@fujitsu.com
tsunakawa.takay@fujitsu.com
In reply to: osumi.takamichi@fujitsu.com (#2)
RE: Enhance traceability of wal_level changes for backup management

From: osumi.takamichi@fujitsu.com <osumi.takamichi@fujitsu.com>

I wondered, couldn't backup management tools utilize the information
in the backup that is needed to be made when wal_level is changed to "none"
for example ?

IIRC, someone proposed in the original thread that the change count can be recorded in pg_control. The change count is incremented when wal_level is changed from replica or higher to minimal or lower. Maybe you can do it easily in XLogReportParameters().

Then, the backup management tool compares the change counts of pg_control in a backup and that of the current pg_control. If the change count is different, the tool assumes that the backup cannot be used to recover the database up to date.

Ideally, it'd be desirable for PostgreSQL core to have a backup catalog management capability like Oracle RMAN. Then, when the wal_level is changed, Postgres may be able to invalidate all backups in the backup catalog.

As I said before, existing backup management tools support
only wal_level=replica or logical at present. And, if they would wish to alter the
status quo and want to cover the changes over wal_levels, I felt it's natural that
they support feature like taking a full backup, trigged by the wal_level changes
(or server stop).

In that regard, a feature like Oracle Server Alert would be useful. When important events occur, the database server records them in the alert queue. Administration tools read from the alert queue and act accordingly. wal_level change can be recorded in the alert queue, and the backup management tool polls the queue and detect the change.

Regards
Takayuki Tsunakawa

#5osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Stephen Frost (#3)
RE: Enhance traceability of wal_level changes for backup management

Hi Stephen

Thank you so much for replying !
On Thursday, January 7, 2021 2:40 AM Stephen Frost <sfrost@snowman.net> wrote:

* osumi.takamichi@fujitsu.com (osumi.takamichi@fujitsu.com) wrote:

You said

The use case I imagined is that the user temporarily changes
wal_level to 'none' from 'replica' or 'logical' to speed up loading
and changes back to the normal. In this case, the backups taken
before the wal_level change cannot be used to restore the database
to the point after the wal_level change. So I think backup
management tools would want to recognize the time or LSN when/where
wal_level is changed to ‘none’ in order to do some actions such as
invalidating older backups, re-calculating backup redundancy etc.
Actually the same is true when the user changes to ‘minimal’. The
tools would need to recognize the time or LSN in this case too.
Since temporarily changing wal_level has been an uncommon use case
some tools perhaps are not aware of that yet. But since introducing
wal_level = ’none’ could make the change common, I think we need to

provide a way for the tools.

I continue to be against the idea of introducing another wal_level. If there's
additional things we can do to reduce WAL traffic while we continue to use it to
keep the system in a consistent state then we should implement those for the
'minimal' case and be done with it.

The new wal_level=none patch achieves something that cannot be done
or cannot be implemented together with wal_level='minimal' clearly.
Did you have a look at the peformance evaluation result that I conducted in [1]/messages/by-id/OSBPR01MB48887ECC140A97494C542264EDF40@OSBPR01MB4888.jpnprd01.prod.outlook.com ?
It proved that data loading of 'none' is much faster than that of 'minimal'.

Adding another wal_level is just going to be confusing and increase complexity,
and the chances that someone will end up in a bad state.

Even if when we committed another idea,
that is "ALTER TABLE tbl SET UNLOGGED/LOGGED without copying relation data",
the complexity like taking a full backup before bulk data loading didn't change
and when any accidents happened during no wal logging for specific table with the improvement,
user would need to start from the backup again. This looks same to me.
Additionally, the patch itself in that thread is big and more complicated.
The complexity you meant is the wal_level's impact to backup management tools or anything else ?

I wondered, couldn't backup management tools utilize the information
in the backup that is needed to be made when wal_level is changed to "none"

for example ?

What information is that, exactly? If there's a way to detect that the wal_level
has been flipped to 'minimal' and then back to 'replica', other than scanning the
WAL, we'd certainly like to hear of it, so we can implement logic in pgbackrest
to detect that happening. I'll admit that I've not gone hunting for such of late,
but I don't recall seeing anything that would change that either.

Excuse me for making you confused.
I was thinking about control file in the backup as information.

I'm not familiar with the internals of those backup management tools
but do they monitor the control file and its values of the runnning server at short intervals ?
And, if they don't do so and if we want accurate time or LSN that indicates wal_level changes,
I thought we could pick up exact information from control file of cluster directory or its backup
during server stop (because changing wal_level requires server stop).
That's all. Sorry for the noise.

The idea proposed above about having the LSN and time recorded when a
wal_level change to minimal happens, presumably in pg_control, seems like it
could work as a way to allow external tools to more easily figure out if the
wal_level's been changed to minimal. Perhaps there's a reason to track
changes between replica and logical but I can't think of any offhand and backup
tools would still need to know if the wal level was set to *minimal*
independently of changes between replica and logical.

Here, probably we use different assumptions.
What you say makes sense if we commit neither the patch of
ALTER TABLE SET UNLOGGED/LOGGED without copy (during wal_level=minimal)
nor the patch of new wal_level 'none' ?
We were talking about a case that fast bulk data loading is needed
and the user changes wal_level to 'none' or 'minimal' temporarily
from higher level in order to speed up data loading and make it back to the higher level again.
After this operation, we need to invalidate the old backups taken before the data loading.
In this case, we have to track the change to 'minimal'. Didn't it make sense ?

Then again, once we add support for scanning the WAL to pgbackrest, we'll
almost certainly track it that way since that'll work for older and released
versions of PG, so I'm not really sure it's worth it to add this to pg_control
unless there's other reasons to.

As I said before, existing backup management tools support only
wal_level=replica or logical at present. And, if they would wish to
alter the status quo and want to cover the changes over wal_levels, I
felt it's natural that they support feature like taking a full backup, trigged by

the wal_level changes (or server stop).

Sure, but there needs to be a way to actually do that..

This is because taking a backup is a must for wal_level=none, as I
described in the patch of wal_level=none.
For example, they could prepare an easy way to take an offline
physical backup when the server stops for changing the wal_level.
(Here, they can support the change to minimal, too.)

pgbackrest does support offline physical backups and it's pretty easy (just pass
in --no-online). That doesn't really help with the issue of detecting a wal_level
change though.

Thanks for the tip,

[1]: /messages/by-id/OSBPR01MB48887ECC140A97494C542264EDF40@OSBPR01MB4888.jpnprd01.prod.outlook.com

Best Regards,
Takamichi Osumi

#6Stephen Frost
sfrost@snowman.net
In reply to: osumi.takamichi@fujitsu.com (#5)
Re: Enhance traceability of wal_level changes for backup management

Greetings,

* osumi.takamichi@fujitsu.com (osumi.takamichi@fujitsu.com) wrote:

On Thursday, January 7, 2021 2:40 AM Stephen Frost <sfrost@snowman.net> wrote:

* osumi.takamichi@fujitsu.com (osumi.takamichi@fujitsu.com) wrote:

You said

The use case I imagined is that the user temporarily changes
wal_level to 'none' from 'replica' or 'logical' to speed up loading
and changes back to the normal. In this case, the backups taken
before the wal_level change cannot be used to restore the database
to the point after the wal_level change. So I think backup
management tools would want to recognize the time or LSN when/where
wal_level is changed to ‘none’ in order to do some actions such as
invalidating older backups, re-calculating backup redundancy etc.
Actually the same is true when the user changes to ‘minimal’. The
tools would need to recognize the time or LSN in this case too.
Since temporarily changing wal_level has been an uncommon use case
some tools perhaps are not aware of that yet. But since introducing
wal_level = ’none’ could make the change common, I think we need to

provide a way for the tools.

I continue to be against the idea of introducing another wal_level. If there's
additional things we can do to reduce WAL traffic while we continue to use it to
keep the system in a consistent state then we should implement those for the
'minimal' case and be done with it.

The new wal_level=none patch achieves something that cannot be done
or cannot be implemented together with wal_level='minimal' clearly.

I disagree.

Did you have a look at the peformance evaluation result that I conducted in [1] ?
It proved that data loading of 'none' is much faster than that of 'minimal'.

That test claims to have generated 10G worth of WAL, which makes it seem
pretty likely that you didn't use UNLOGGED tables, and didn't create the
table in the same transaction as you performed the data loading for that
table, so, naturally, you got the full amount of WAL.

Adding another wal_level is just going to be confusing and increase complexity,
and the chances that someone will end up in a bad state.

Even if when we committed another idea,
that is "ALTER TABLE tbl SET UNLOGGED/LOGGED without copying relation data",
the complexity like taking a full backup before bulk data loading didn't change
and when any accidents happened during no wal logging for specific table with the improvement,
user would need to start from the backup again. This looks same to me.

Yes, there is still the issue that running with wal_level = minimal for
a period of time means that you can't perform PITR from before the
change to minimal through to the time after it's been changed back to
something higher than minimal. That's no different, I agree. That
isn't an argument to introduce another WAL level though.

Additionally, the patch itself in that thread is big and more complicated.

This isn't a reason to avoid introducing another wal_level.

The complexity you meant is the wal_level's impact to backup management tools or anything else ?

The complexity of having another wal_level and having to document it and
explain how it behaves to users, particularly when, effectively, it
shouldn't actually be any different from wal_level = minimal, assuming
we've found and implemented the interesting optimizations regarding
wal_level = minimal.

I wondered, couldn't backup management tools utilize the information
in the backup that is needed to be made when wal_level is changed to "none"

for example ?

What information is that, exactly? If there's a way to detect that the wal_level
has been flipped to 'minimal' and then back to 'replica', other than scanning the
WAL, we'd certainly like to hear of it, so we can implement logic in pgbackrest
to detect that happening. I'll admit that I've not gone hunting for such of late,
but I don't recall seeing anything that would change that either.

Excuse me for making you confused.
I was thinking about control file in the backup as information.

I'm not familiar with the internals of those backup management tools
but do they monitor the control file and its values of the runnning server at short intervals ?
And, if they don't do so and if we want accurate time or LSN that indicates wal_level changes,
I thought we could pick up exact information from control file of cluster directory or its backup
during server stop (because changing wal_level requires server stop).
That's all. Sorry for the noise.

If it's in the control file then pg_controldata should be able to pull
it ount and all of the backup tools should be able to manage that, even
if they can't read it directly like tools which are also written in C,
so having it in the control file seems alright to me.

The idea proposed above about having the LSN and time recorded when a
wal_level change to minimal happens, presumably in pg_control, seems like it
could work as a way to allow external tools to more easily figure out if the
wal_level's been changed to minimal. Perhaps there's a reason to track
changes between replica and logical but I can't think of any offhand and backup
tools would still need to know if the wal level was set to *minimal*
independently of changes between replica and logical.

Here, probably we use different assumptions.
What you say makes sense if we commit neither the patch of
ALTER TABLE SET UNLOGGED/LOGGED without copy (during wal_level=minimal)
nor the patch of new wal_level 'none' ?

Yes, it does.

We were talking about a case that fast bulk data loading is needed
and the user changes wal_level to 'none' or 'minimal' temporarily
from higher level in order to speed up data loading and make it back to the higher level again.
After this operation, we need to invalidate the old backups taken before the data loading.
In this case, we have to track the change to 'minimal'. Didn't it make sense ?

The old backups don't actually need to be invaliated, but we want to
strongly encourage a new backup be taken as soon as possible since WAL
replay from an older backup through this period would end up having the
wrong result. We really shouldn't need to track anything more than the
"LSN when wal_level was last seen to be less than replica" or something
along those lines. I was questioning if there was any need to track the
changes to wal_level at a more granular level- none come to mind for me,
but perhaps someone else sees a reason to do so.

Thanks,

Stephen

#7osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: osumi.takamichi@fujitsu.com (#2)
RE: Enhance traceability of wal_level changes for backup management

Hi

Apologies for my delay.
On Wednesday, January 6, 2021 7:03 PM I wrote:

I'll continue the discussion of [2].
We talked about how to recognize the time or LSN when/where wal_level is
changed to 'none' there.

You said

The use case I imagined is that the user temporarily changes wal_level
to 'none' from 'replica' or 'logical' to speed up loading and changes
back to the normal. In this case, the backups taken before the
wal_level change cannot be used to restore the database to the point
after the wal_level change. So I think backup management tools would
want to recognize the time or LSN when/where wal_level is changed to
‘none’ in order to do some actions such as invalidating older backups,
re-calculating backup redundancy etc.
Actually the same is true when the user changes to ‘minimal’. The
tools would need to recognize the time or LSN in this case too. Since
temporarily changing wal_level has been an uncommon use case some
tools perhaps are not aware of that yet. But since introducing
wal_level = ’none’ could make the change common, I think we need to

provide a way for the tools.

Before my implementation, I'd like to confirm something.

As of now, I think there are two major ideas already.
I think to implement the 1st idea suffices.
If no one disagree with it, I'll proceed with (1) below.

(1) writing the time or LSN in the control file
to indicate when/where wal_level is changed to 'minimal'
from upper level to invalidate the old backups or make alerts to users.
I think we reset this when user executes pg_basebackup successfully.

(2) implementing incremental counters that indicates
drop of wal_level from replica to minimal(or between other levels).
Its purpose was to compare the wal_level changes between snapshots.
When any monitoring tools detect any difference of the counter,
we can predict something happened immediately without checking WAL in between.

The former could give accureate information for backup management
while the latter gives easier way to compare snapshots, I think.

By the way, thankfully I got advice to refer to
Oracle's feature such as Oracle Server Alert or
backup catalog management capability from Tsunakawa-San.
However, because those development would be huge, then
I'd like to choose either the first one or the second one
and for the purpose to give better information, I prefer the first one.

Any comments ?

Best Regards,
Takamichi Osumi

#8osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: osumi.takamichi@fujitsu.com (#7)
1 attachment(s)
RE: Enhance traceability of wal_level changes for backup management

Hello

On Thursday, January 21, 2021 11:19 PM I wrote:

If no one disagree with it, I'll proceed with (1) below.

(1) writing the time or LSN in the control file to indicate when/where wal_level
is changed to 'minimal'
from upper level to invalidate the old backups or make alerts to users.

I attached the first patch which implementes this idea.
It was aligned by pgindent and shows no regression.

Best Regards,
Takamichi Osumi

Attachments:

trace_wal_level_drop_v01.patchapplication/octet-stream; name=trace_wal_level_drop_v01.patchDownload
From b4b249842df0f0550dee3722890eded7568c0c2e Mon Sep 17 00:00:00 2001
From: Osumi Takamichi <osumi.takamichi@fujitsu.com>
Date: Thu, 28 Jan 2021 00:16:01 +0000
Subject: [PATCH v01] trace wal_level change for backup management

This patch adds a new parameter in the control file
in order to trace wal_level drop to minimal from higher level.
The purpose is to expose the LSN when the drop of wal_level
happens and to make it possible to notify the change to the users.
The value is reset when pg_basebackup creates a new base backup successfully.

Author: Takamichi Osumi <osumi.takamichi@fujitsu.com>
Discussion : https://www.postgresql.org/message-id/OSBPR01MB4888B94DF1C9325033C26D00EDD00%40OSBPR01MB4888.jpnprd01.prod.outlook.com
---
 src/backend/access/transam/xlog.c            | 35 +++++++++++++++
 src/backend/replication/basebackup.c         |  3 ++
 src/bin/pg_basebackup/t/010_pg_basebackup.pl | 64 +++++++++++++++++++++++++++-
 src/bin/pg_controldata/pg_controldata.c      |  3 ++
 src/bin/pg_resetwal/pg_resetwal.c            |  6 ++-
 src/include/access/xlog.h                    |  1 +
 src/include/catalog/pg_control.h             |  7 +++
 7 files changed, 117 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cc007b8..990c93f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -772,6 +772,16 @@ static ControlFileData *ControlFile = NULL;
  */
 #define ConvertToXSegs(x, segsize)	XLogMBVarToSegs((x), (segsize))
 
+/*
+ * Return true if the first wal_level drop which
+ * could cause discontinuity of snapshots happens,
+ * since the latest base backup was taken.
+ */
+#define TestWalLevelGap() \
+	(ControlFile->wal_level_drop == InvalidXLogRecPtr && \
+	 ControlFile->wal_level >= WAL_LEVEL_REPLICA && \
+	 wal_level < WAL_LEVEL_REPLICA)
+
 /* The number of bytes in a WAL segment usable for WAL data. */
 static int	UsableBytesInSegment;
 
@@ -4629,6 +4639,7 @@ InitControlFile(uint64 sysidentifier)
 	ControlFile->max_locks_per_xact = max_locks_per_xact;
 	ControlFile->wal_level = wal_level;
 	ControlFile->wal_log_hints = wal_log_hints;
+	ControlFile->wal_level_drop = InvalidXLogRecPtr;
 	ControlFile->track_commit_timestamp = track_commit_timestamp;
 	ControlFile->data_checksum_version = bootstrap_data_checksum_version;
 }
@@ -4931,6 +4942,19 @@ UpdateControlFile(void)
 }
 
 /*
+ * Successful execution of pg_basebackup
+ * needs to reset wal_level_drop in the control file.
+ */
+void
+ResetWalLevelDrop(void)
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->wal_level_drop = InvalidXLogRecPtr;
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+}
+
+/*
  * Returns the unique system identifier from control file.
  */
 uint64
@@ -9847,6 +9871,11 @@ XLogRestorePoint(const char *rpName)
 static void
 XLogReportParameters(void)
 {
+	/*
+	 * flag to know whether or not wal_level dropped to minimal from higher level
+	 */
+	bool cause_discontinuity = false;
+
 	if (wal_level != ControlFile->wal_level ||
 		wal_log_hints != ControlFile->wal_log_hints ||
 		MaxConnections != ControlFile->MaxConnections ||
@@ -9876,6 +9905,8 @@ XLogReportParameters(void)
 			xlrec.wal_level = wal_level;
 			xlrec.wal_log_hints = wal_log_hints;
 			xlrec.track_commit_timestamp = track_commit_timestamp;
+			if (TestWalLevelGap())
+				cause_discontinuity = true;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
@@ -9894,6 +9925,8 @@ XLogReportParameters(void)
 		ControlFile->wal_level = wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
 		ControlFile->track_commit_timestamp = track_commit_timestamp;
+		if (cause_discontinuity)
+			ControlFile->wal_level_drop = GetXLogInsertRecPtr();
 		UpdateControlFile();
 
 		LWLockRelease(ControlFileLock);
@@ -10322,6 +10355,8 @@ xlog_redo(XLogReaderState *record)
 			ControlFile->minRecoveryPoint = lsn;
 			ControlFile->minRecoveryPointTLI = ThisTimeLineID;
 		}
+		if (TestWalLevelGap())
+			ControlFile->wal_level_drop = GetXLogInsertRecPtr();
 
 		CommitTsParameterChange(xlrec.track_commit_timestamp,
 								ControlFile->track_commit_timestamp);
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 0f54635..3fd0041 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -739,6 +739,9 @@ perform_base_backup(basebackup_options *opt)
 	/* clean up the resource owner we created */
 	WalSndResourceCleanup(true);
 
+	/* reset wal_level_drop, because the gap of wal_level has been restored */
+	ResetWalLevelDrop();
+
 	pgstat_progress_end_command();
 }
 
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index f674a7c..35540e6 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -6,7 +6,7 @@ use File::Basename qw(basename dirname);
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 109;
+use Test::More tests => 118;
 
 program_help_ok('pg_basebackup');
 program_version_ok('pg_basebackup');
@@ -575,3 +575,65 @@ rmtree("$tempdir/backup_corrupt4");
 
 $node->safe_psql('postgres', "DROP TABLE corrupt1;");
 $node->safe_psql('postgres', "DROP TABLE corrupt2;");
+
+# confirm pg_basebackup works together with the control
+# file's parameter to trace wal_level drop.
+my $another_node = get_new_node('wal_level');
+my $backup_name = 'my_backup';
+
+$another_node->init(has_archiving => 1);
+$another_node->append_conf('postgresql.conf', 'wal_level = replica');
+$another_node->start;
+
+my $data = $another_node->data_dir;
+command_like(
+	[ 'pg_controldata', $data ],
+	qr/Oldest location of wal_level change to minimal:.*0\/0/,
+	'check the initial value is equal to 0/0');
+$another_node->stop;
+
+$another_node->append_conf('postgresql.conf',
+					   'wal_level = logical');
+$another_node->start;
+command_like(
+	[ 'pg_controldata', $data ],
+	qr/Oldest location of wal_level change to minimal:.*0\/0/,
+	'raising the wal_level from raplica to logical does not change initial value');
+$another_node->stop;
+
+$another_node->append_conf('postgresql.conf',q[
+wal_level = minimal
+max_wal_senders = 0
+archive_mode = off
+]);
+$another_node->start;
+$another_node->stop;
+
+my ($stdout, $stderr) = run_command([ 'pg_controldata', $data ]);
+my @control_data = split("\n", $stdout);
+my $wal_level_drop_lsn = undef;
+foreach (@control_data)
+{
+	if ($_  =~ /^Oldest location of wal_level change to minimal:.*\/(.*)/)
+	{
+		$wal_level_drop_lsn = $1;
+		chomp($wal_level_drop_lsn);
+		last;
+	}
+}
+die unless defined $wal_level_drop_lsn;
+die if $wal_level_drop_lsn eq '0';
+
+$another_node->append_conf('postgresql.conf', q[
+wal_level = replica
+max_wal_senders = 10
+]);
+
+$another_node->start;
+$another_node->backup($backup_name);
+$another_node->stop;
+
+command_like(
+	[ 'pg_controldata', $data ],
+	qr/Oldest location of wal_level change to minimal:.*0\/0/,
+	'the parameter has been changed to the initial value');
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 3e00ac0..65ba290 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -334,5 +334,8 @@ main(int argc, char *argv[])
 		   ControlFile->data_checksum_version);
 	printf(_("Mock authentication nonce:            %s\n"),
 		   mock_auth_nonce_str);
+	printf(_("Oldest location of wal_level change to minimal: %X/%X\n"),
+		   (uint32) (ControlFile->wal_level_drop >> 32),
+		   (uint32) ControlFile->wal_level_drop);
 	return 0;
 }
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 805dafe..a8555f4 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -701,7 +701,10 @@ GuessControlValues(void)
 	ControlFile.checkPoint = ControlFile.checkPointCopy.redo;
 	ControlFile.unloggedLSN = FirstNormalUnloggedLSN;
 
-	/* minRecoveryPoint, backupStartPoint and backupEndPoint can be left zero */
+	/*
+	 * minRecoveryPoint, backupStartPoint, backupEndPoint and wal_level_drop
+	 * can be left zero
+	 */
 
 	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
 	ControlFile.wal_log_hints = false;
@@ -901,6 +904,7 @@ RewriteControlFile(void)
 	ControlFile.backupStartPoint = 0;
 	ControlFile.backupEndPoint = 0;
 	ControlFile.backupEndRequired = false;
+	ControlFile.wal_level_drop = 0;
 
 	/*
 	 * Force the defaults for max_* settings. The values don't really matter
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec107..c5e18af 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -316,6 +316,7 @@ extern TimestampTz GetLatestXTime(void);
 extern TimestampTz GetCurrentChunkReplayStartTime(void);
 
 extern void UpdateControlFile(void);
+extern void ResetWalLevelDrop(void);
 extern uint64 GetSystemIdentifier(void);
 extern char *GetMockAuthenticationNonce(void);
 extern bool DataChecksumsEnabled(void);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e3f4815..d4c5ccb 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -226,6 +226,13 @@ typedef struct ControlFileData
 	 */
 	char		mock_authentication_nonce[MOCK_AUTH_NONCE_LEN];
 
+	/*
+	 * This indicates LSN just after the oldest wal_level change to lower than
+	 * wal_level=replica because it causes discontinuity of snapshots. Reset
+	 * by creating a new base backup, using pg_basebackup.
+	 */
+	XLogRecPtr	wal_level_drop;
+
 	/* CRC of all above ... MUST BE LAST! */
 	pg_crc32c	crc;
 } ControlFileData;
-- 
2.2.0

#9Peter Eisentraut
peter.eisentraut@enterprisedb.com
In reply to: osumi.takamichi@fujitsu.com (#8)
Re: Enhance traceability of wal_level changes for backup management

On 28.01.21 01:44, osumi.takamichi@fujitsu.com wrote:

(1) writing the time or LSN in the control file to indicate when/where wal_level
is changed to 'minimal'
from upper level to invalidate the old backups or make alerts to users.

I attached the first patch which implementes this idea.
It was aligned by pgindent and shows no regression.

It's not clear to me what this is supposed to accomplish. I read the
thread, but it's still not clear. What is one supposed to do with this
information?

#10osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Peter Eisentraut (#9)
RE: Enhance traceability of wal_level changes for backup management

On Sun, Mar 7, 2021 3:48 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:

On 28.01.21 01:44, osumi.takamichi@fujitsu.com wrote:

(1) writing the time or LSN in the control file to indicate
when/where wal_level is changed to 'minimal'
from upper level to invalidate the old backups or make alerts to users.

I attached the first patch which implementes this idea.
It was aligned by pgindent and shows no regression.

It's not clear to me what this is supposed to accomplish. I read the thread,
but it's still not clear.
What is one supposed to do with this information?

OK. The basic idea is to enable backup management
tools to recognize wal_level drop between *snapshots*.
When you have a snapshot of the cluster at one time and another one
at different time, with this new parameter, you can see if
anything that causes discontinuity from the drop happens
in the middle of the two snapshots without efforts to have a look at the WALs in between.

The new parameter is open to be monitored regularly
and if there's any change of that LSN, the tools can
take measures such as making an alert to users, taking a new backup
or invalidating past snapshots, which depends on the tools.

In the thread, we think the scenario of wal_level drop happens
when user wants to execute optimized bulk data loading.

Best Regards
Takamichi Osumi

#11David Steele
david@pgmasters.net
In reply to: osumi.takamichi@fujitsu.com (#10)
Re: Enhance traceability of wal_level changes for backup management

On 3/7/21 9:45 PM, osumi.takamichi@fujitsu.com wrote:

On Sun, Mar 7, 2021 3:48 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:

On 28.01.21 01:44, osumi.takamichi@fujitsu.com wrote:

(1) writing the time or LSN in the control file to indicate
when/where wal_level is changed to 'minimal'
from upper level to invalidate the old backups or make alerts to users.

I attached the first patch which implementes this idea.
It was aligned by pgindent and shows no regression.

It's not clear to me what this is supposed to accomplish. I read the thread,
but it's still not clear.
What is one supposed to do with this information?

OK. The basic idea is to enable backup management
tools to recognize wal_level drop between *snapshots*.
When you have a snapshot of the cluster at one time and another one
at different time, with this new parameter, you can see if
anything that causes discontinuity from the drop happens
in the middle of the two snapshots without efforts to have a look at the WALs in between.

As a backup software author, I don't see this feature as very useful.

The problem is that there are lots of ways for WAL to go missing so
monitoring the WAL archive for gaps is essential and this feature would
not replace that requirement. The only extra information you'd get is
the ability to classify the most recent gap as "intentional", maybe.

So, -1 from me.

Regards,
--
-David
david@pgmasters.net

#12osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: David Steele (#11)
RE: Enhance traceability of wal_level changes for backup management

On Tuesday, March 9, 2021 11:13 PM David Steele <david@pgmasters.net>

On 3/7/21 9:45 PM, osumi.takamichi@fujitsu.com wrote:

On Sun, Mar 7, 2021 3:48 AM Peter Eisentraut

<peter.eisentraut@enterprisedb.com> wrote:

On 28.01.21 01:44, osumi.takamichi@fujitsu.com wrote:

(1) writing the time or LSN in the control file to indicate
when/where wal_level is changed to 'minimal'
from upper level to invalidate the old backups or make alerts to users.

I attached the first patch which implementes this idea.
It was aligned by pgindent and shows no regression.

It's not clear to me what this is supposed to accomplish. I read the
thread, but it's still not clear.
What is one supposed to do with this information?

OK. The basic idea is to enable backup management tools to recognize
wal_level drop between *snapshots*.
When you have a snapshot of the cluster at one time and another one at
different time, with this new parameter, you can see if anything that
causes discontinuity from the drop happens in the middle of the two
snapshots without efforts to have a look at the WALs in between.

As a backup software author, I don't see this feature as very useful.

The problem is that there are lots of ways for WAL to go missing so
monitoring the WAL archive for gaps is essential and this feature would not
replace that requirement. The only extra information you'd get is the ability to
classify the most recent gap as "intentional", maybe.

So, -1 from me.

Thanks for giving me a meaningful viewpoint.
Let me sleep on it, about whether the new param doesn't help in all cases or not.

Best Regards,
Takamichi Osumi

#13tsunakawa.takay@fujitsu.com
tsunakawa.takay@fujitsu.com
In reply to: David Steele (#11)
RE: Enhance traceability of wal_level changes for backup management

From: David Steele <david@pgmasters.net>

As a backup software author, I don't see this feature as very useful.

The problem is that there are lots of ways for WAL to go missing so
monitoring the WAL archive for gaps is essential and this feature would
not replace that requirement. The only extra information you'd get is
the ability to classify the most recent gap as "intentional", maybe.

But how do you know there's any missing WAL? I think there are the following cases of missing WAL:

1. A WAL segment file is missing. e.g., 00000001 and 00000003 exist, but 00000002 doesn't.

2. All consecutive WAL segment files appear to exist, but some WAL records are missing.
This occurs ?only? when some WAL-optimized statements are run while wal_level = minimal.

Currently, backup management tools can detect 1 by scanning through the WAL archive directory. But the can't notice 2. The patch addresses this.

This is what some other people suggested in the thread for wal_level = none that Osumi-san referred to at the beginning of this thread. I don't think this detection feature is not a blocker for wal_level = none. So, I think this can be withdrawn if wal_level = none can be accepted.

Regards
Takayuki Tsunakawa

#14Peter Eisentraut
peter.eisentraut@enterprisedb.com
In reply to: osumi.takamichi@fujitsu.com (#10)
Re: Enhance traceability of wal_level changes for backup management

On 08.03.21 03:45, osumi.takamichi@fujitsu.com wrote:

OK. The basic idea is to enable backup management
tools to recognize wal_level drop between*snapshots*.
When you have a snapshot of the cluster at one time and another one
at different time, with this new parameter, you can see if
anything that causes discontinuity from the drop happens
in the middle of the two snapshots without efforts to have a look at the WALs in between.

Is this an actual problem? Changing wal_level requires a restart. Are
users frequently restarting their servers to change wal_level and then
wonder why their backups are misbehaving or incomplete? Why? Just like
fsync is "breaks your database", wal_level might as well be "breaks your
backups". Is it not documented well enough?

#15osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Peter Eisentraut (#14)
RE: Enhance traceability of wal_level changes for backup management

On Friday, March 12, 2021 5:04 PM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:

On 08.03.21 03:45, osumi.takamichi@fujitsu.com wrote:

OK. The basic idea is to enable backup management tools to recognize
wal_level drop between*snapshots*.
When you have a snapshot of the cluster at one time and another one at
different time, with this new parameter, you can see if anything that
causes discontinuity from the drop happens in the middle of the two
snapshots without efforts to have a look at the WALs in between.

Is this an actual problem? Changing wal_level requires a restart. Are users
frequently restarting their servers to change wal_level and then wonder why
their backups are misbehaving or incomplete? Why? Just like fsync is
"breaks your database", wal_level might as well be "breaks your backups". Is
it not documented well enough?

I understand what you mean.
However, this thread partly came from a concern of
another thread to introduce a new wal_level
which would make the wal_level change like above more common.
Therefore, I think it's preferable to discuss better safeguard or
better notification way to users instead of just leaving it without doing anything.

Best Regards,
Takamichi Osumi

#16Stephen Frost
sfrost@snowman.net
In reply to: osumi.takamichi@fujitsu.com (#15)
Re: Enhance traceability of wal_level changes for backup management

Greetings,

* tsunakawa.takay@fujitsu.com (tsunakawa.takay@fujitsu.com) wrote:

From: David Steele <david@pgmasters.net>

As a backup software author, I don't see this feature as very useful.

The problem is that there are lots of ways for WAL to go missing so
monitoring the WAL archive for gaps is essential and this feature would
not replace that requirement. The only extra information you'd get is
the ability to classify the most recent gap as "intentional", maybe.

But how do you know there's any missing WAL? I think there are the following cases of missing WAL:

1. A WAL segment file is missing. e.g., 00000001 and 00000003 exist, but 00000002 doesn't.

2. All consecutive WAL segment files appear to exist, but some WAL records are missing.
This occurs ?only? when some WAL-optimized statements are run while wal_level = minimal.

Currently, backup management tools can detect 1 by scanning through the WAL archive directory. But the can't notice 2. The patch addresses this.

They could notice #2 also by scanning the WAL, but that's certainly a
lot more work than just looking in pg_control.

* Peter Eisentraut (peter.eisentraut@enterprisedb.com) wrote:

On 08.03.21 03:45, osumi.takamichi@fujitsu.com wrote:

OK. The basic idea is to enable backup management
tools to recognize wal_level drop between*snapshots*.
When you have a snapshot of the cluster at one time and another one
at different time, with this new parameter, you can see if
anything that causes discontinuity from the drop happens
in the middle of the two snapshots without efforts to have a look at the WALs in between.

Is this an actual problem? Changing wal_level requires a restart. Are
users frequently restarting their servers to change wal_level and then
wonder why their backups are misbehaving or incomplete? Why? Just like
fsync is "breaks your database", wal_level might as well be "breaks your
backups". Is it not documented well enough?

We explicitly document that people can switch the WAL level and restart
to do bulk data loads faster, and there's certainly no shortage of
discussion (including what prompted this thread..) about doing exactly
that. Adding more documentation around that would certainly be good,
as would changing this:

ereport(WARNING,
(errmsg("WAL was generated with wal_level=minimal, data may be missing"),
errhint("This happens if you temporarily set wal_level=minimal without taking a new base backup.")));

into a PANIC instead of a WARNING. It's simply far too easy to end up
with corruption in the system when doing PITR through a period of time
when the WAL level was set to minimal. Unfortunately, if the user
didn't happen to know that they needed to take a new full backup after
flipping to minimal and back then they could end up with corruption at
restore/replay time which is certainly not when you want anything to go
wrong. If it was available in the control file then we could more
proactively make noise at the user to take a new full backup.

Thanks,

Stephen

#17Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Stephen Frost (#16)
Re: Enhance traceability of wal_level changes for backup management

On Mon, 2021-03-15 at 17:32 -0400, Stephen Frost wrote:

We explicitly document that people can switch the WAL level and restart
to do bulk data loads faster, and there's certainly no shortage of
discussion (including what prompted this thread..) about doing exactly
that. Adding more documentation around that would certainly be good,
as would changing this:

ereport(WARNING,
(errmsg("WAL was generated with wal_level=minimal, data may be missing"),
errhint("This happens if you temporarily set wal_level=minimal without taking a new base backup.")));

into a PANIC instead of a WARNING.

There is a patch in the commitfest that does exactly that (except it
uses ERROR rather than PANIC):
/messages/by-id/OSBPR01MB4888CBE1DA08818FD2D90ED8EDF90@OSBPR01MB4888.jpnprd01.prod.outlook.com

Yours,
Laurenz Albe

#18vignesh C
vignesh21@gmail.com
In reply to: osumi.takamichi@fujitsu.com (#8)
Re: Enhance traceability of wal_level changes for backup management

On Thu, Jan 28, 2021 at 6:14 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:

Hello

On Thursday, January 21, 2021 11:19 PM I wrote:

If no one disagree with it, I'll proceed with (1) below.

(1) writing the time or LSN in the control file to indicate when/where wal_level
is changed to 'minimal'
from upper level to invalidate the old backups or make alerts to users.

I attached the first patch which implementes this idea.
It was aligned by pgindent and shows no regression.

The patch does not apply on Head anymore, could you rebase and post a
patch. I'm changing the status to "Waiting for Author".

Regards,
Vignesh