trying again to get incremental backup

Started by Robert Haasover 2 years ago106 messages

robertmhaas@gmail.com

over 2 years ago

A few years ago, I sketched out a design for incremental backup, but
no patch for incremental backup ever got committed. Instead, the whole
thing evolved into a project to add backup manifests, which are nice,
but not as nice as incremental backup would be. So I've decided to
have another go at incremental backup itself. Attached are some WIP
patches. Let me summarize the design and some open questions and
problems with it that I've discovered. I welcome problem reports and
test results from others, as well.

The basic design of this patch set is pretty simple, and there are
three main parts. First, there's a new background process called the
walsummarizer which runs all the time. It reads the WAL and generates
WAL summary files. WAL summary files are extremely small compared to
the original WAL and contain only the minimal amount of information
that we need in order to determine which parts of the database need to
be backed up. They tell us about files getting created, destroyed, or
truncated, and they tell us about modified blocks. Naturally, we don't
find out about blocks that were modified without any write-ahead log
record, e.g. hint bit updates, but those are of necessity not critical
for correctness, so it's OK. Second, pg_basebackup has a mode where it
can take an incremental backup. You must supply a backup manifest from
a previous full backup. We read the WAL summary files that have been
generated between the start of the previous backup and the start of
this one, and use that to figure out which relation files have changed
and how much. Non-relation files are sent normally, just as they would
be in a full backup. Relation files can either be sent in full or be
replaced by an incremental file, which contains a subset of the blocks
in the file plus a bit of information to handle truncations properly.
Third, there's now a pg_combinebackup utility which takes a full
backup and one or more incremental backups, performs a bunch of sanity
checks, and if everything works out, writes out a new, synthetic full
backup, aka a data directory.

Simple usage example:

pg_basebackup -cfast -Dx
pg_basebackup -cfast -Dy --incremental x/backup_manifest
pg_combinebackup x y -o z

The part of all this with which I'm least happy is the WAL
summarization engine. Actually, the core process of summarizing the
WAL seems totally fine, and the file format is very compact thanks to
some nice ideas from my colleague Dilip Kumar. Someone may of course
wish to argue that the information should be represented in some other
file format instead, and that can be done if it's really needed, but I
don't see a lot of value in tinkering with it, either. Where I do
think there's a problem is deciding how much WAL ought to be
summarized in one WAL summary file. Summary files cover a certain
range of WAL records - they have names like
$TLI${START_LSN}${END_LSN}.summary. It's not too hard to figure out
where a file should start - generally, it's wherever the previous file
ended, possibly on a new timeline, but figuring out where the summary
should end is trickier. You always have the option to either read
another WAL record and fold it into the current summary, or end the
current summary where you are, write out the file, and begin a new
one. So how do you decide what to do?

I originally had the idea of summarizing a certain number of MB of WAL
per WAL summary file, and so I added a GUC wal_summarize_mb for that
purpose. But then I realized that actually, you really want WAL
summary file boundaries to line up with possible redo points, because
when you do an incremental backup, you need a summary that stretches
from the redo point of the checkpoint written at the start of the
prior backup to the redo point of the checkpoint written at the start
of the current backup. The block modifications that happen in that
range of WAL records are the ones that need to be included in the
incremental. Unfortunately, there's no indication in the WAL itself
that you've reached a redo point, but I wrote code that tries to
notice when we've reached the redo point stored in shared memory and
stops the summary there. But I eventually realized that's not good
enough either, because if summarization zooms past the redo point
before noticing the updated redo point in shared memory, then the
backup sat around waiting for the next summary file to be generated so
it had enough summaries to proceed with the backup, while the
summarizer was in no hurry to finish up the current file and just sat
there waiting for more WAL to be generated. Eventually the incremental
backup would just time out. I tried to fix that by making it so that
if somebody's waiting for a summary file to be generated, they can let
the summarizer know about that and it can write a summary file ending
at the LSN up to which it has read and then begin a new file from
there. That seems to fix the hangs, but now I've got three
overlapping, interconnected systems for deciding where to end the
current summary file, and maybe that's OK, but I have a feeling there
might be a better way.

Dilip had an interesting potential solution to this problem, which was
to always emit a special WAL record at the redo pointer. That is, when
we fix the redo pointer for the checkpoint record we're about to
write, also insert a WAL record there. That way, when the summarizer
reaches that sentinel record, it knows it should stop the summary just
before. I'm not sure whether this approach is viable, especially from
a performance and concurrency perspective, and I'm not sure whether
people here would like it, but it does seem like it would make things
a whole lot simpler for this patch set.

Another thing that I'm not too sure about is: what happens if we find
a relation file on disk that doesn't appear in the backup_manifest for
the previous backup and isn't mentioned in the WAL summaries either?
The fact that said file isn't mentioned in the WAL summaries seems
like it ought to mean that the file is unchanged, in which case
perhaps this ought to be an error condition. But I'm not too sure
about that treatment. I have a feeling that there might be some subtle
problems here, especially if databases or tablespaces get dropped and
then new ones get created that happen to have the same OIDs. And what
about wal_level=minimal? I'm not at a point where I can say I've gone
through and plugged up these kinds of corner-case holes tightly yet,
and I'm worried that there may be still other scenarios of which I
haven't even thought. Happy to hear your ideas about what the problem
cases are or how any of the problems should be solved.

A related design question is whether we should really be sending the
whole backup manifest to the server at all. If it turns out that we
don't really need anything except for the LSN of the previous backup,
we could send that one piece of information instead of everything. On
the other hand, if we need the list of files from the previous backup,
then sending the whole manifest makes sense.

Another big and rather obvious problem with the patch set is that it
doesn't currently have any automated test cases, or any real
documentation. Those are obviously things that need a lot of work
before there could be any thought of committing this. And probably a
lot of bugs will be found along the way, too.

A few less-serious problems with the patch:

- We don't have an incremental JSON parser, so if you have a
backup_manifest>1GB, pg_basebackup --incremental is going to fail.
That's also true of the existing code in pg_verifybackup, and for the
same reason. I talked to Andrew Dunstan at one point about adapting
our JSON parser to support incremental parsing, and he had a patch for
that, but I think he found some problems with it and I'm not sure what
the current status is.

- The patch does support differential backup, aka an incremental atop
another incremental. There's no particular limit to how long a chain
of backups can be. However, pg_combinebackup currently requires that
the first backup is a full backup and all the later ones are
incremental backups. So if you have a full backup a and an incremental
backup b and a differential backup c, you can combine a b and c to get
a full backup equivalent to one you would have gotten if you had taken
a full backup at the time you took c. However, you can't combine b and
c with each other without combining them with a, and that might be
desirable in some situations. You might want to collapse a bunch of
older differential backups into a single one that covers the whole
time range of all of them. I think that the file format can support
that, but the tool is currently too dumb.

- We only know how to operate on directories, not tar files. I thought
about that when working on pg_verifybackup as well, but I didn't do
anything about it. It would be nice to go back and make that tool work
on tar-format backups, and this one, too. I don't think there would be
a whole lot of point trying to operate on compressed tar files because
you need random access and that seems hard on a compressed file, but
on uncompressed files it seems at least theoretically doable. I'm not
sure whether anyone would care that much about this, though, even
though it does sound pretty cool.

In the attached patch series, patches 1 through 6 are various
refactoring patches, patch 7 is the main event, and patch 8 adds a
useful inspection tool.

Thanks,

--
Robert Haas
EDB: http://www.enterprisedb.com

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Robert Haas (#1)

Re: trying again to get incremental backup

Hi,

On 2023-06-14 14:46:48 -0400, Robert Haas wrote:

A few years ago, I sketched out a design for incremental backup, but
no patch for incremental backup ever got committed. Instead, the whole
thing evolved into a project to add backup manifests, which are nice,
but not as nice as incremental backup would be. So I've decided to
have another go at incremental backup itself. Attached are some WIP
patches. Let me summarize the design and some open questions and
problems with it that I've discovered. I welcome problem reports and
test results from others, as well.

Cool!

I originally had the idea of summarizing a certain number of MB of WAL
per WAL summary file, and so I added a GUC wal_summarize_mb for that
purpose. But then I realized that actually, you really want WAL
summary file boundaries to line up with possible redo points, because
when you do an incremental backup, you need a summary that stretches
from the redo point of the checkpoint written at the start of the
prior backup to the redo point of the checkpoint written at the start
of the current backup. The block modifications that happen in that
range of WAL records are the ones that need to be included in the
incremental.

I assume this is "solely" required for keeping the incremental backups as
small as possible, rather than being required for correctness?

Unfortunately, there's no indication in the WAL itself
that you've reached a redo point, but I wrote code that tries to
notice when we've reached the redo point stored in shared memory and
stops the summary there. But I eventually realized that's not good
enough either, because if summarization zooms past the redo point
before noticing the updated redo point in shared memory, then the
backup sat around waiting for the next summary file to be generated so
it had enough summaries to proceed with the backup, while the
summarizer was in no hurry to finish up the current file and just sat
there waiting for more WAL to be generated. Eventually the incremental
backup would just time out. I tried to fix that by making it so that
if somebody's waiting for a summary file to be generated, they can let
the summarizer know about that and it can write a summary file ending
at the LSN up to which it has read and then begin a new file from
there. That seems to fix the hangs, but now I've got three
overlapping, interconnected systems for deciding where to end the
current summary file, and maybe that's OK, but I have a feeling there
might be a better way.

Could we just recompute the WAL summary for the [redo, end of chunk] for the
relevant summary file?

Dilip had an interesting potential solution to this problem, which was
to always emit a special WAL record at the redo pointer. That is, when
we fix the redo pointer for the checkpoint record we're about to
write, also insert a WAL record there. That way, when the summarizer
reaches that sentinel record, it knows it should stop the summary just
before. I'm not sure whether this approach is viable, especially from
a performance and concurrency perspective, and I'm not sure whether
people here would like it, but it does seem like it would make things
a whole lot simpler for this patch set.

FWIW, I like the idea of a special WAL record at that point, independent of
this feature. It wouldn't be a meaningful overhead compared to the cost of a
checkpoint, and it seems like it'd be quite useful for debugging. But I can
see uses going beyond that - we occasionally have been discussing associating
additional data with redo points, and that'd be a lot easier to deal with
during recovery with such a record.

I don't really see a performance and concurrency angle right now - what are
you wondering about?

Another thing that I'm not too sure about is: what happens if we find
a relation file on disk that doesn't appear in the backup_manifest for
the previous backup and isn't mentioned in the WAL summaries either?

Wouldn't that commonly happen for unlogged relations at least?

I suspect there's also other ways to end up with such additional files,
e.g. by crashing during the creation of a new relation.

A few less-serious problems with the patch:

- We don't have an incremental JSON parser, so if you have a
backup_manifest>1GB, pg_basebackup --incremental is going to fail.
That's also true of the existing code in pg_verifybackup, and for the
same reason. I talked to Andrew Dunstan at one point about adapting
our JSON parser to support incremental parsing, and he had a patch for
that, but I think he found some problems with it and I'm not sure what
the current status is.

As a stopgap measure, can't we just use the relevant flag to allow larger
allocations?

- The patch does support differential backup, aka an incremental atop
another incremental. There's no particular limit to how long a chain
of backups can be. However, pg_combinebackup currently requires that
the first backup is a full backup and all the later ones are
incremental backups. So if you have a full backup a and an incremental
backup b and a differential backup c, you can combine a b and c to get
a full backup equivalent to one you would have gotten if you had taken
a full backup at the time you took c. However, you can't combine b and
c with each other without combining them with a, and that might be
desirable in some situations. You might want to collapse a bunch of
older differential backups into a single one that covers the whole
time range of all of them. I think that the file format can support
that, but the tool is currently too dumb.

That seems like a feature for the future...

- We only know how to operate on directories, not tar files. I thought
about that when working on pg_verifybackup as well, but I didn't do
anything about it. It would be nice to go back and make that tool work
on tar-format backups, and this one, too. I don't think there would be
a whole lot of point trying to operate on compressed tar files because
you need random access and that seems hard on a compressed file, but
on uncompressed files it seems at least theoretically doable. I'm not
sure whether anyone would care that much about this, though, even
though it does sound pretty cool.

I don't know the tar format well, but my understanding is that it doesn't have
a "central metadata" portion. I.e. doing something like this would entail
scanning the tar file sequentially, skipping file contents? And wouldn't you
have to create an entirely new tar file for the modified output? That kind of
makes it not so incremental ;)

IOW, I'm not sure it's worth bothering about this ever, and certainly doesn't
seem worth bothering about now. But I might just be missing something.

Greetings,

Andres Freund

trying again to get incremental backup

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: