Incremental backup from a streaming replication standby

Started by Laurenz Albeover 1 year ago20 messages

laurenz.albe@cybertec.at

over 1 year ago

I played around with incremental backup yesterday and tried $subject

The WAL summarizer is running on the standby server, but when I try
to take an incremental backup, I get an error that I understand to mean
that WAL summarizing hasn't caught up yet.

I am not sure if that is working as designed, but if it is, I think it
should be documented.

Yours,
Laurenz Albe

Laurenz Albe

laurenz.albe@cybertec.at

over 1 year ago

In reply to: Laurenz Albe (#1)

Re: Incremental backup from a streaming replication standby fails

On Sat, 2024-06-29 at 07:01 +0200, Laurenz Albe wrote:

I played around with incremental backup yesterday and tried $subject

The WAL summarizer is running on the standby server, but when I try
to take an incremental backup, I get an error that I understand to mean
that WAL summarizing hasn't caught up yet.

I am not sure if that is working as designed, but if it is, I think it
should be documented.

I played with this some more. Here is the exact error message:

ERROR: manifest requires WAL from final timeline 1 ending at 0/1967C260, but this backup starts at 0/1967C190

By trial and error I found that when I run a CHECKPOINT on the primary,
taking an incremental backup on the standby works.

I couldn't fathom the cause of that, but I think that that should either
be addressed or documented before v17 comes out.

Yours,
Laurenz Albe

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Laurenz Albe (#2)

Re: Incremental backup from a streaming replication standby fails

On Mon, Jul 15, 2024 at 11:27 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

On Sat, 2024-06-29 at 07:01 +0200, Laurenz Albe wrote:

I played around with incremental backup yesterday and tried $subject

The WAL summarizer is running on the standby server, but when I try
to take an incremental backup, I get an error that I understand to mean
that WAL summarizing hasn't caught up yet.

I am not sure if that is working as designed, but if it is, I think it
should be documented.

I played with this some more. Here is the exact error message:

ERROR: manifest requires WAL from final timeline 1 ending at 0/1967C260, but this backup starts at 0/1967C190

By trial and error I found that when I run a CHECKPOINT on the primary,
taking an incremental backup on the standby works.

I couldn't fathom the cause of that, but I think that that should either
be addressed or documented before v17 comes out.

I had a feeling this was going to be confusing. I'm not sure what to
do about it, but I'm open to suggestions.

Suppose you take a full backup F; replay of that backup will begin
with a checkpoint CF. Then you try to take an incremental backup I;
replay will begin from a checkpoint CI. For the incremental backup to
be valid, it must include all blocks modified after CF and before CI.
But when the backup is taken on a standby, no new checkpoint is
possible. Hence, CI will be the most recent restartpoint on the
standby that has occurred before the backup starts. So, if F is taken
on the primary and then I is immediately taken on the standby without
the standby having done a new restartpoint, or if both F and I are
taken on the standby and no restartpoint intervenes, then CF=CI. In
that scenario, an incremental backup is pretty much pointless: every
single incremental file would contain 0 blocks. You might as well just
use the backup you already have, unless one of the non-relation files
has changed. So, except in that unusual corner case, the fact that the
backup fails isn't really costing you anything. In fact, there's a
decent chance that it's saving you from taking a completely useless
backup.

On the primary, this doesn't occur, because there, each new backup
triggers a new checkpoint, so you always have CI>CF.

The error message is definitely confusing. The reason I'm not sure how
to do better is that there is a large class of errors that a user
could make that would trigger an error of this general type. I'm
guessing that attempting a standby backup with CF=CI will turn out to
be the most common one, but I don't think it'll be the only one that
ever comes up. The code in PrepareForIncrementalBackup() focuses on
what has gone wrong on a technical level rather than on what you
probably did to create that situation. Indeed, the server doesn't
really know what you did to create that situation. You could trigger
the same error by taking a full backup on the primary and then try to
take an incremental based on that full backup on a time-delayed
standby (or a lagging standby) whose replay position was behind the
primary, i.e. CI<CF.

More perversely, you could trigger the error by spinning up a standby,
promoting it, taking a full backup, destroying the standby, removing
the timeline history file from the archive, spinning up a new standby,
promoting onto the same timeline ID as the previous one, and then
trying to take an incremental backup relative to the full backup. This
might actually succeed, if you take the incremental backup at a later
LSN than the previous full backup, but, as you may guess, terrible
things will happen to you if you try to use such a backup. (I hope you
will agree that this would be a self-inflicted injury; I can't see any
way of detecting such cases.) If the incremental backup LSN is earlier
than the previous full backup LSN, this error will trigger.

So, given all the above, what can we do here?

One option might be to add an errhint() to the message. I had trouble
thinking of something that was compact enough to be reasonable to
include and yet reasonably accurate and useful, but maybe we can
brainstorm and figure something out. Another option might be to add
more to the documentation, but it's all so complicated that I'm not
sure what to write. It feels hard to make something that is brief
enough to be worth including, accurate enough to help more than it
hurts, and understandable enough that people who run into this will be
able to make use of it.

I think I'm a little too close to this to really know what the best
thing to do is, so I'm happy to hear suggestions from you and others.

--
Robert Haas
EDB: http://www.enterprisedb.com

David Steele

david@pgmasters.net

over 1 year ago

In reply to: Robert Haas (#3)

Re: Incremental backup from a streaming replication standby fails

On 7/19/24 21:52, Robert Haas wrote:

On Mon, Jul 15, 2024 at 11:27 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

On Sat, 2024-06-29 at 07:01 +0200, Laurenz Albe wrote:

I played around with incremental backup yesterday and tried $subject

The WAL summarizer is running on the standby server, but when I try
to take an incremental backup, I get an error that I understand to mean
that WAL summarizing hasn't caught up yet.

I am not sure if that is working as designed, but if it is, I think it
should be documented.

I played with this some more. Here is the exact error message:

ERROR: manifest requires WAL from final timeline 1 ending at 0/1967C260, but this backup starts at 0/1967C190

By trial and error I found that when I run a CHECKPOINT on the primary,
taking an incremental backup on the standby works.

I couldn't fathom the cause of that, but I think that that should either
be addressed or documented before v17 comes out.

I had a feeling this was going to be confusing. I'm not sure what to
do about it, but I'm open to suggestions.

Suppose you take a full backup F; replay of that backup will begin
with a checkpoint CF. Then you try to take an incremental backup I;
replay will begin from a checkpoint CI. For the incremental backup to
be valid, it must include all blocks modified after CF and before CI.
But when the backup is taken on a standby, no new checkpoint is
possible. Hence, CI will be the most recent restartpoint on the
standby that has occurred before the backup starts. So, if F is taken
on the primary and then I is immediately taken on the standby without
the standby having done a new restartpoint, or if both F and I are
taken on the standby and no restartpoint intervenes, then CF=CI. In
that scenario, an incremental backup is pretty much pointless: every
single incremental file would contain 0 blocks. You might as well just
use the backup you already have, unless one of the non-relation files
has changed. So, except in that unusual corner case, the fact that the
backup fails isn't really costing you anything. In fact, there's a
decent chance that it's saving you from taking a completely useless
backup.

<snip>

I think I'm a little too close to this to really know what the best
thing to do is, so I'm happy to hear suggestions from you and others.

I think it would be enough just to add a hint such as:

HINT: this is possible when making a standby backup with little or no
activity.

My guess is in production environments this will be uncommon.

For example, over the years we (pgBackRest) have gotten numerous bug
reports that time-targeted PITR does not work. In every case we found
that the user was just testing procedures and the database had no
activity between backups -- therefore recovery had no commit timestamps
to use to end recovery. Test environments sometimes produce weird results.

Having said that, I think it would be better if it worked even if it
does produce an empty backup. An empty backup wastes some disk space but
if it produces less friction and saves an admin having to intervene then
it is probably worth it. I don't immediately see how to do that in a
reliable way, though, and in any case it seems like something to
consider for PG18.

Regards,
-David

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: David Steele (#4)

Re: Incremental backup from a streaming replication standby fails

On Fri, Jul 19, 2024 at 11:32 AM David Steele <david@pgmasters.net> wrote:

I think it would be enough just to add a hint such as:

HINT: this is possible when making a standby backup with little or no
activity.

That could work (with "this" capitalized).

My guess is in production environments this will be uncommon.

I think so too, but when it does happen, confusion may be common.

Having said that, I think it would be better if it worked even if it
does produce an empty backup. An empty backup wastes some disk space but
if it produces less friction and saves an admin having to intervene then
it is probably worth it. I don't immediately see how to do that in a
reliable way, though, and in any case it seems like something to
consider for PG18.

Yeah, I'm pretty reluctant to weaken the sanity checks here, at least
in the short term. Note that what the check is actually complaining
about is that the previous backup thinks that the WAL it needs to
replay to reach consistency ends after the start of the current
backup. Even in this scenario, I'm not positive that everything would
be OK if we let the backup proceed, and it's easy to think of
scenarios where it definitely isn't. Plus, it's not quite clear how to
distinguish the cases where it's OK from the cases where it isn't.

--
Robert Haas
EDB: http://www.enterprisedb.com

Laurenz Albe

laurenz.albe@cybertec.at

over 1 year ago

In reply to: Robert Haas (#5)

Re: Incremental backup from a streaming replication standby fails

On Fri, 2024-07-19 at 12:59 -0400, Robert Haas wrote:
Thanks for looking at this.

On Fri, Jul 19, 2024 at 11:32 AM David Steele <david@pgmasters.net> wrote:

I think it would be enough just to add a hint such as:

HINT: this is possible when making a standby backup with little or no
activity.

That could work (with "this" capitalized).

My guess is in production environments this will be uncommon.

I think so too, but when it does happen, confusion may be common.

I guess this will most likely happen during tests like the one I made.

I'd be alright with the hint, but I'd say "during making an *incremental*
standby backup", because that's the only case where it can happen.

I think it would also be sufficient if we document that possibility.
When I got the error, I looked at the documentation of incremental
backup for any limitations with standby servers, but didn't find any.
A remark in the documentation would have satisfied me.

Yours,
Laurenz

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Laurenz Albe (#6)

Re: Incremental backup from a streaming replication standby fails

On Fri, Jul 19, 2024 at 2:41 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

I'd be alright with the hint, but I'd say "during making an *incremental*
standby backup", because that's the only case where it can happen.

I think it would also be sufficient if we document that possibility.
When I got the error, I looked at the documentation of incremental
backup for any limitations with standby servers, but didn't find any.
A remark in the documentation would have satisfied me.

Would you like to propose a patch adding a hint and/or adjusting the
documentation? Or are you wanting me to do that?

--
Robert Haas
EDB: http://www.enterprisedb.com

Laurenz Albe

laurenz.albe@cybertec.at

over 1 year ago

In reply to: Robert Haas (#7)

Re: Incremental backup from a streaming replication standby fails

On Fri, 2024-07-19 at 16:03 -0400, Robert Haas wrote:

On Fri, Jul 19, 2024 at 2:41 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

I'd be alright with the hint, but I'd say "during making an *incremental*
standby backup", because that's the only case where it can happen.

I think it would also be sufficient if we document that possibility.
When I got the error, I looked at the documentation of incremental
backup for any limitations with standby servers, but didn't find any.
A remark in the documentation would have satisfied me.

Would you like to propose a patch adding a hint and/or adjusting the
documentation? Or are you wanting me to do that?

Here is a patch.
I went for both the errhint and some documentation.

Yours,
Laurenz Albe

Michael Paquier

michael@paquier.xyz

over 1 year ago

In reply to: Laurenz Albe (#1)

Re: Incremental backup from a streaming replication standby

On Sat, Jun 29, 2024 at 07:01:04AM +0200, Laurenz Albe wrote:

The WAL summarizer is running on the standby server, but when I try
to take an incremental backup, I get an error that I understand to mean
that WAL summarizing hasn't caught up yet.

Added an open item for this one.
--
Michael

#10

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Laurenz Albe (#8)

Re: Incremental backup from a streaming replication standby fails

On Fri, Jul 19, 2024 at 6:07 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

Here is a patch.
I went for both the errhint and some documentation.

Hmm, the hint doesn't end up using the word "standby" anywhere. That
seems like it might not be optimal?

+    Like a base backup, you can take an incremental backup from a streaming
+    replication standby server.  But since a backup of a standby server cannot
+    initiate a checkpoint, it is possible that an incremental backup taken
+    right after a base backup will fail with an error, since it would have
+    to start with the same checkpoint as the base backup and would therefore
+    be empty.

Hmm. I feel like I'm about to be super-nitpicky, but this seems
imprecise to me in multiple ways. First, an incremental backup is a
kind of base backup, or at least, it's something you take with
pg_basebackup. Note that later in the paragraph, you use the term
"base backup" to refer to what I have been calling the "prior" or
"previous" backup or "the backup upon which it depends," but that
earlier backup could be either a full or an incremental backup.
Second, the standby need not be using streaming replication, even
though it probably will be in practice. Third, the failing incremental
backup doesn't necessarily have to be attempted immediately after the
previous one - the intervening time could be quite long on an idle
system. Fourth, it makes it sound like the backup being empty is a
reason for it to fail, which is debatable; I think we should try to
cast this more as an implementation restriction.

How about something like this:

An incremental backup is only possible if replay would begin from a
later checkpoint than for the previous backup upon which it depends.
On the primary, this condition is always satisfied, because each
backup triggers a new checkpoint. On a standby, replay begins from the
most recent restartpoint. As a result, an incremental backup may fail
on a standby if there has been very little activity since the previous
backup. Attempting to take an incremental backup that is lagging
behind the primary (or some other standby) using a prior backup taken
at a later WAL position may fail for the same reason.

I'm not saying that's perfect, but let me know your thoughts.

--
Robert Haas
EDB: http://www.enterprisedb.com

#11

Laurenz Albe

laurenz.albe@cybertec.at

over 1 year ago

In reply to: Robert Haas (#10)

Re: Incremental backup from a streaming replication standby fails

On Mon, 2024-07-22 at 09:37 -0400, Robert Haas wrote:

How about something like this:

An incremental backup is only possible if replay would begin from a
later checkpoint than for the previous backup upon which it depends.
On the primary, this condition is always satisfied, because each
backup triggers a new checkpoint. On a standby, replay begins from the
most recent restartpoint. As a result, an incremental backup may fail
on a standby if there has been very little activity since the previous
backup. Attempting to take an incremental backup that is lagging
behind the primary (or some other standby) using a prior backup taken
at a later WAL position may fail for the same reason.

Before I write a v2, a small question for clarification:
I believe I remember that during my experiments, I ran CHECKPOINT
on the standby server between the first backup and the incremental
backup, and that was not enough to make it work. I had to run
a CHECKPOINT on the primary server.

Does CHECKPOINT on the standby not trigger a restartpoint, or do
I simply misremember?

Yours,
Laurenz Albe

#12

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Laurenz Albe (#11)

Re: Incremental backup from a streaming replication standby fails

On Mon, Jul 22, 2024 at 1:05 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

Before I write a v2, a small question for clarification:
I believe I remember that during my experiments, I ran CHECKPOINT
on the standby server between the first backup and the incremental
backup, and that was not enough to make it work. I had to run
a CHECKPOINT on the primary server.

Does CHECKPOINT on the standby not trigger a restartpoint, or do
I simply misremember?

It's only possible for the standby to create a restartpoint at a
write-ahead log position where the master created a checkpoint. With
typical configuration, every or nearly every checkpoint on the primary
will trigger a restartpoint on the standby, but for example if you set
max_wal_size bigger and checkpoint_timeout longer on the standby than
on the primary, then you might end up with only some of those
checkpoints ending up becoming restartpoints and others not.

Looking at the code in CreateRestartPoint(), it looks like what
happens if you run CHECKPOINT is that it tries to turn the
most-recently replayed checkpoint into a restartpoint if that wasn't
done already; otherwise it just returns without doing anything. See
the comment that begins with "If the last checkpoint record we've
replayed is already our last".

--
Robert Haas
EDB: http://www.enterprisedb.com

#13

Laurenz Albe

laurenz.albe@cybertec.at

over 1 year ago

In reply to: Robert Haas (#10)

Re: Incremental backup from a streaming replication standby fails

On Mon, 2024-07-22 at 09:37 -0400, Robert Haas wrote:

On Fri, Jul 19, 2024 at 6:07 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

Here is a patch.
I went for both the errhint and some documentation.

Hmm, the hint doesn't end up using the word "standby" anywhere. That
seems like it might not be optimal?

I guessed that the user was aware that she is taking the backup on
a standby server...

Anyway, I reworded the hint to

This can happen for incremental backups on a standby if there was
little activity since the previous backup.

Hmm. I feel like I'm about to be super-nitpicky, but this seems
imprecise to me in multiple ways.

On the contrary, cour comments and explanations are valuable.

How about something like this:

An incremental backup is only possible if replay would begin from a
later checkpoint than for the previous backup upon which it depends.
On the primary, this condition is always satisfied, because each
backup triggers a new checkpoint. On a standby, replay begins from the
most recent restartpoint. As a result, an incremental backup may fail
on a standby if there has been very little activity since the previous
backup. Attempting to take an incremental backup that is lagging
behind the primary (or some other standby) using a prior backup taken
at a later WAL position may fail for the same reason.

I'm not saying that's perfect, but let me know your thoughts.

I tinkered with this some more, and the attached patch has

An incremental backup is only possible if replay would begin from a later
checkpoint than the checkpoint that started the previous backup upon which
it depends. If you take the incremental backup on the primary, this
condition is always satisfied, because each backup triggers a new
checkpoint. On a standby, replay begins from the most recent restartpoint.
Therefore, an incremental backup of a standby server can fail if there has
been very little activity since the previous backup, since no new
restartpoint might have been created.

Yours,
Laurenz Albe

#14

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Laurenz Albe (#13)

Re: Incremental backup from a streaming replication standby fails

On Wed, Jul 24, 2024 at 6:46 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

An incremental backup is only possible if replay would begin from a later
checkpoint than the checkpoint that started the previous backup upon which
it depends.

My concern here is that the previous backup might have been taken on a
standby, and therefore it did not start with a checkpoint. For a
standby backup, replay will begin from a checkpoint record, but that
record may be quite a bit earlier in the WAL. For instance, imagine
checkpoint_timeout is set to 30 minutes on the standby. When the
backup is taken, the most recent restartpoint could be up to 30
minutes ago -- and it is the checkpoint record for that restartpoint
from which replay will begin. I think that in my phrasing, it's always
about the checkpoint from which replay would begin (which is always
well-defined) not the checkpoint that started the backup (which is
only logical on the primary).

If you take the incremental backup on the primary, this
condition is always satisfied, because each backup triggers a new
checkpoint. On a standby, replay begins from the most recent restartpoint.
Therefore, an incremental backup of a standby server can fail if there has
been very little activity since the previous backup, since no new
restartpoint might have been created.

--
Robert Haas
EDB: http://www.enterprisedb.com

#15

Laurenz Albe

laurenz.albe@cybertec.at

over 1 year ago

In reply to: Robert Haas (#14)

Re: Incremental backup from a streaming replication standby fails

On Wed, 2024-07-24 at 15:27 -0400, Robert Haas wrote:

On Wed, Jul 24, 2024 at 6:46 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

   An incremental backup is only possible if replay would begin from a later
   checkpoint than the checkpoint that started the previous backup upon which
   it depends.

My concern here is that the previous backup might have been taken on a
standby, and therefore it did not start with a checkpoint. For a
standby backup, replay will begin from a checkpoint record, but that
record may be quite a bit earlier in the WAL. For instance, imagine
checkpoint_timeout is set to 30 minutes on the standby. When the
backup is taken, the most recent restartpoint could be up to 30
minutes ago -- and it is the checkpoint record for that restartpoint
from which replay will begin. I think that in my phrasing, it's always
about the checkpoint from which replay would begin (which is always
well-defined) not the checkpoint that started the backup (which is
only logical on the primary).

I see.

The attached patch uses your wording for the first sentence.

I left out the last sentence from your suggestion, because it sounded
like it is likely to confuse the reader. I think you just wanted to
say that there are other possible causes for an incremental backup to
fail. I want to keep the text as simple as possible and focus on the case
that I hit, because I expect that a lot of people who experiment with
incremental backup or run tests could run into the same problem.

I don't think it will be a frequent occurrence during normal operation.

Yours,
Laurenz Albe

#16

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Laurenz Albe (#15)

Re: Incremental backup from a streaming replication standby fails

On Thu, Jul 25, 2024 at 8:51 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

The attached patch uses your wording for the first sentence.

I left out the last sentence from your suggestion, because it sounded
like it is likely to confuse the reader. I think you just wanted to
say that there are other possible causes for an incremental backup to
fail. I want to keep the text as simple as possible and focus on the case
that I hit, because I expect that a lot of people who experiment with
incremental backup or run tests could run into the same problem.

I don't think it will be a frequent occurrence during normal operation.

Committed this version to master and v17.

--
Robert Haas
EDB: http://www.enterprisedb.com

#17

Laurenz Albe

laurenz.albe@cybertec.at

over 1 year ago

In reply to: Robert Haas (#16)

Re: Incremental backup from a streaming replication standby fails

On Thu, 2024-07-25 at 16:12 -0400, Robert Haas wrote:

On Thu, Jul 25, 2024 at 8:51 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

The attached patch uses your wording for the first sentence.

I left out the last sentence from your suggestion, because it sounded
like it is likely to confuse the reader. I think you just wanted to
say that there are other possible causes for an incremental backup to
fail. I want to keep the text as simple as possible and focus on the case
that I hit, because I expect that a lot of people who experiment with
incremental backup or run tests could run into the same problem.

I don't think it will be a frequent occurrence during normal operation.

Committed this version to master and v17.

Thanks for taking care of this.

Yours,
Laurenz Albe

#18

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Laurenz Albe (#17)

Re: Incremental backup from a streaming replication standby fails

On Fri, Jul 26, 2024 at 1:09 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

Committed this version to master and v17.

Thanks for taking care of this.

Sure thing!

I knew it was going to confuse someone ... I just wasn't sure what to
do about it. Now we've at least done something, which is hopefully
superior to nothing.

--
Robert Haas
EDB: http://www.enterprisedb.com

#19

Alexander Korotkov

aekorotkov@gmail.com

over 1 year ago

In reply to: Robert Haas (#18)

Re: Incremental backup from a streaming replication standby fails

On Fri, Jul 26, 2024 at 4:11 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jul 26, 2024 at 1:09 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

Committed this version to master and v17.

Thanks for taking care of this.

Sure thing!

I knew it was going to confuse someone ... I just wasn't sure what to
do about it. Now we've at least done something, which is hopefully
superior to nothing.

Great! Should we mark the corresponding v17 open item as closed?

------
Regards,
Alexander Korotkov
Supabase

#20

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Alexander Korotkov (#19)

Re: Incremental backup from a streaming replication standby fails

On Fri, Jul 26, 2024 at 4:13 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

Great! Should we mark the corresponding v17 open item as closed?

Done.

--
Robert Haas
EDB: http://www.enterprisedb.com

Incremental backup from a streaming replication standby

Attachments:

Attachments:

Attachments: